Re: Tuning level of Parallelism: Increase or decrease?

2016-08-03 Thread Jacek Laskowski
Hi Jestin, I need to expand on this in the Spark notes. Spark can handle data locality itself but if the Spark nodes run on separate nodes than HDFS' there's always the network between them that makes the performance worse comparing to co-location of Spark and HDFS nodes. These are mere details

Re: Tuning level of Parallelism: Increase or decrease?

2016-08-03 Thread Yong Zhang
features than the Standalone mode. You can research more on google about it. Yong From: Jestin Ma <jestinwith.a...@gmail.com> Sent: Tuesday, August 2, 2016 7:11 PM To: Jacek Laskowski Cc: Nikolay Zhebet; Andrew Ehrlich; user Subject: Re: Tuning level of Paral

Re: Tuning level of Parallelism: Increase or decrease?

2016-08-02 Thread Sonal Goyal
Hi Jestin, Which of your actions is the bottleneck? Is it group by, count or the join? Or all of them? It may help to tune the most time consuming ask first. On Monday, August 1, 2016, Nikolay Zhebet wrote: > Yes, Spark always trying to deliver snippet of code to the data

Re: Tuning level of Parallelism: Increase or decrease?

2016-08-01 Thread Nikolay Zhebet
Yes, Spark always trying to deliver snippet of code to the data (not vice versa). But you should realize, that if you try to run groupBY or Join on the large dataset, then you always should migrate temporary localy grouped data from one worker node to the another(It is shuffle operation as i

Re: Tuning level of Parallelism: Increase or decrease?

2016-08-01 Thread Jestin Ma
Hi Nikolay, I'm looking at data locality improvements for Spark, and I have conflicting sources on using YARN for Spark. Reynold said that Spark workers automatically take care of data locality here:

Re: Tuning level of Parallelism: Increase or decrease?

2016-08-01 Thread Nikolay Zhebet
Hi. Maybe you can help "data locality".. If you use groupBY and joins, than most likely you will see alot of network operations. This can be werry slow. You can try prepare, transform your information in that way, what can minimize transporting temporary information between worker-nodes. Try

Re: Tuning level of Parallelism: Increase or decrease?

2016-07-31 Thread Jestin Ma
It seems that the number of tasks being this large do not matter. Each task was set default by the HDFS as 128 MB (block size) which I've heard to be ok. I've tried tuning the block (task) size to be larger and smaller to no avail. I tried coalescing to 50 but that introduced large data skew and

Re: Tuning level of Parallelism: Increase or decrease?

2016-07-31 Thread Andrew Ehrlich
15000 seems like a lot of tasks for that size. Test it out with a .coalesce(50) placed right after loading the data. It will probably either run faster or crash with out of memory errors. > On Jul 29, 2016, at 9:02 AM, Jestin Ma wrote: > > I am processing ~2 TB of

Tuning level of Parallelism: Increase or decrease?

2016-07-29 Thread Jestin Ma
I am processing ~2 TB of hdfs data using DataFrames. The size of a task is equal to the block size specified by hdfs, which happens to be 128 MB, leading to about 15000 tasks. I'm using 5 worker nodes with 16 cores each and ~25 GB RAM. I'm performing groupBy, count, and an outer-join with another