Re: My spark job runs faster in spark 1.6 and much slower in spark 2.0

2017-02-14 Thread arun kumar Natva
Hi John, The number of rows in input file is 30 billion rows. The size of input data is 72 GB, and the output is expected to have readings for each account & day combination for 50k sample accounts, which means total output records count = 50k * 365 On Tue, Feb 14, 2017 at 6:29 PM, Jörn Franke

Re: My spark job runs faster in spark 1.6 and much slower in spark 2.0

2017-02-14 Thread Jörn Franke
Can you check in the UI which tasks took most of the time? Even the 45 min looks a little bit much given that you only work most of the time with 50k rows > On 15 Feb 2017, at 00:03, Timur Shenkao wrote: > > Hello, > I'm not sure that's your reason but check this

Re: My spark job runs faster in spark 1.6 and much slower in spark 2.0

2017-02-14 Thread Timur Shenkao
Hello, I'm not sure that's your reason but check this discussion: http://apache-spark-developers-list.1001551.n3.nabble.com/SQL-ML-Pipeline-performance-regression-between-1-6-and-2-x-td20803.html On Tue, Feb 14, 2017 at 9:25 PM, anatva wrote: > Hi, > I am reading an ORC

My spark job runs faster in spark 1.6 and much slower in spark 2.0

2017-02-14 Thread anatva
Hi, I am reading an ORC file, and perform some joins, aggregations and finally generate a dense vector to perform analytics. The code runs in 45 minutes on spark 1.6 on a 4 node cluster. When the same code is migrated to run on spark 2.0 on the same cluster, it takes around 4-5 hours. It is