15000 seems like a lot of tasks for that size. Test it out with a .coalesce(50) placed right after loading the data. It will probably either run faster or crash with out of memory errors.
> On Jul 29, 2016, at 9:02 AM, Jestin Ma <jestinwith.a...@gmail.com> wrote: > > I am processing ~2 TB of hdfs data using DataFrames. The size of a task is > equal to the block size specified by hdfs, which happens to be 128 MB, > leading to about 15000 tasks. > > I'm using 5 worker nodes with 16 cores each and ~25 GB RAM. > I'm performing groupBy, count, and an outer-join with another DataFrame of > ~200 MB size (~80 MB cached but I don't need to cache it), then saving to > disk. > > Right now it takes about 55 minutes, and I've been trying to tune it. > > I read on the Spark Tuning guide that: > In general, we recommend 2-3 tasks per CPU core in your cluster. > > This means that I should have about 30-50 tasks instead of 15000, and each > task would be much bigger in size. Is my understanding correct, and is this > suggested? I've read from difference sources to decrease or increase > parallelism, or even keep it default. > > Thank you for your help, > Jestin