We encountered similar problem. If all partitions are located in the same
node and all of the tasks run less than 3 seconds (set by
spark.locality.wait, the default value is 3000), the tasks will run in
the single node. Our solution is
using org.apache.hadoop.mapred.lib.CombineTextInputFormat to
What is your cluster setup? are you running a worker on the master node
also?
1. Spark usually assigns the task to the worker who has the data locally
available, If one worker has enough tasks, then i believe it will start
assigning to others as well. You could control it with the level of
I have a job that searches for input recursively and creates a string of
pathnames to treat as one input.
The files are part-x files and they are fairly small. The job seems to take
a long time to complete considering the size of the total data (150m) and only
runs on the master machine.