Re: Lots of small input files

2014-11-23 Thread Shixiong Zhu
We encountered similar problem. If all partitions are located in the same node and all of the tasks run less than 3 seconds (set by spark.locality.wait, the default value is 3000), the tasks will run in the single node. Our solution is using org.apache.hadoop.mapred.lib.CombineTextInputFormat to

Re: Lots of small input files

2014-11-22 Thread Akhil Das
What is your cluster setup? are you running a worker on the master node also? 1. Spark usually assigns the task to the worker who has the data locally available, If one worker has enough tasks, then i believe it will start assigning to others as well. You could control it with the level of

Lots of small input files

2014-11-21 Thread Pat Ferrel
I have a job that searches for input recursively and creates a string of pathnames to treat as one input. The files are part-x files and they are fairly small. The job seems to take a long time to complete considering the size of the total data (150m) and only runs on the master machine.