Re: Lots of small input files

2014-11-23 Thread Shixiong Zhu
We encountered similar problem. If all partitions are located in the same
node and all of the tasks run less than 3 seconds (set by
spark.locality.wait, the default value is 3000), the tasks will run in
the single node. Our solution is
using org.apache.hadoop.mapred.lib.CombineTextInputFormat to create some
big enough tasks. Of cause, you can reduce `spark.locality.wait`, but it
may be not efficient because it still creates many tiny tasks.

Best Regards,
Shixiong Zhu

2014-11-22 17:17 GMT+08:00 Akhil Das ak...@sigmoidanalytics.com:

 What is your cluster setup? are you running a worker on the master node
 also?

 1. Spark usually assigns the task to the worker who has the data locally
 available, If one worker has enough tasks, then i believe it will start
 assigning to others as well. You could control it with the level of
 parallelism and all.

 2. If you coalesce it into one partition, then i believe only one of the
 worker will execute the single task.

 Thanks
 Best Regards

 On Fri, Nov 21, 2014 at 9:49 PM, Pat Ferrel p...@occamsmachete.com wrote:

 I have a job that searches for input recursively and creates a string of
 pathnames to treat as one input.

 The files are part-x files and they are fairly small. The job seems
 to take a long time to complete considering the size of the total data
 (150m) and only runs on the master machine. The job only does rdd.map type
 operations.

 1) Why doesn’t it use the other workers in the cluster?
 2) Is there a downside to using a lot of small part files? Should I
 coalesce them into one input file?
 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org





Re: Lots of small input files

2014-11-22 Thread Akhil Das
What is your cluster setup? are you running a worker on the master node
also?

1. Spark usually assigns the task to the worker who has the data locally
available, If one worker has enough tasks, then i believe it will start
assigning to others as well. You could control it with the level of
parallelism and all.

2. If you coalesce it into one partition, then i believe only one of the
worker will execute the single task.

Thanks
Best Regards

On Fri, Nov 21, 2014 at 9:49 PM, Pat Ferrel p...@occamsmachete.com wrote:

 I have a job that searches for input recursively and creates a string of
 pathnames to treat as one input.

 The files are part-x files and they are fairly small. The job seems to
 take a long time to complete considering the size of the total data (150m)
 and only runs on the master machine. The job only does rdd.map type
 operations.

 1) Why doesn’t it use the other workers in the cluster?
 2) Is there a downside to using a lot of small part files? Should I
 coalesce them into one input file?
 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org




Lots of small input files

2014-11-21 Thread Pat Ferrel
I have a job that searches for input recursively and creates a string of 
pathnames to treat as one input. 

The files are part-x files and they are fairly small. The job seems to take 
a long time to complete considering the size of the total data (150m) and only 
runs on the master machine. The job only does rdd.map type operations.

1) Why doesn’t it use the other workers in the cluster?
2) Is there a downside to using a lot of small part files? Should I coalesce 
them into one input file?
-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org