I am processing ~2 TB of hdfs data using DataFrames. The size of a task is
equal to the block size specified by hdfs, which happens to be 128 MB,
leading to about 15000 tasks.

I'm using 5 worker nodes with 16 cores each and ~25 GB RAM.
I'm performing groupBy, count, and an outer-join with another DataFrame of
~200 MB size (~80 MB cached but I don't need to cache it), then saving to
disk.

Right now it takes about 55 minutes, and I've been trying to tune it.

I read on the Spark Tuning guide that:
*In general, we recommend 2-3 tasks per CPU core in your cluster.*

This means that I should have about 30-50 tasks instead of 15000, and each
task would be much bigger in size. Is my understanding correct, and is this
suggested? I've read from difference sources to decrease or increase
parallelism, or even keep it default.

Thank you for your help,
Jestin

Reply via email to