15000 seems like a lot of tasks for that size. Test it out with a .coalesce(50) 
placed right after loading the data. It will probably either run faster or 
crash with out of memory errors.

> On Jul 29, 2016, at 9:02 AM, Jestin Ma <jestinwith.a...@gmail.com> wrote:
> 
> I am processing ~2 TB of hdfs data using DataFrames. The size of a task is 
> equal to the block size specified by hdfs, which happens to be 128 MB, 
> leading to about 15000 tasks.
> 
> I'm using 5 worker nodes with 16 cores each and ~25 GB RAM.
> I'm performing groupBy, count, and an outer-join with another DataFrame of 
> ~200 MB size (~80 MB cached but I don't need to cache it), then saving to 
> disk.
> 
> Right now it takes about 55 minutes, and I've been trying to tune it.
> 
> I read on the Spark Tuning guide that:
> In general, we recommend 2-3 tasks per CPU core in your cluster.
> 
> This means that I should have about 30-50 tasks instead of 15000, and each 
> task would be much bigger in size. Is my understanding correct, and is this 
> suggested? I've read from difference sources to decrease or increase 
> parallelism, or even keep it default.
> 
> Thank you for your help,
> Jestin

Reply via email to