Hi John,
The number of rows in input file is 30 billion rows. The size of input data
is 72 GB, and the output is expected to have readings for each account &
day combination for 50k sample accounts, which means total output records
count = 50k * 365
On Tue, Feb 14, 2017 at 6:29 PM, Jörn Franke
Can you check in the UI which tasks took most of the time?
Even the 45 min looks a little bit much given that you only work most of the
time with 50k rows
> On 15 Feb 2017, at 00:03, Timur Shenkao wrote:
>
> Hello,
> I'm not sure that's your reason but check this
Hello,
I'm not sure that's your reason but check this discussion:
http://apache-spark-developers-list.1001551.n3.nabble.com/SQL-ML-Pipeline-performance-regression-between-1-6-and-2-x-td20803.html
On Tue, Feb 14, 2017 at 9:25 PM, anatva wrote:
> Hi,
> I am reading an ORC
Hi,
I am reading an ORC file, and perform some joins, aggregations and finally
generate a dense vector to perform analytics.
The code runs in 45 minutes on spark 1.6 on a 4 node cluster. When the same
code is migrated to run on spark 2.0 on the same cluster, it takes around
4-5 hours. It is