Re: Lots of small input files

Shixiong Zhu Sun, 23 Nov 2014 18:34:12 -0800

We encountered similar problem. If all partitions are located in the same
node and all of the tasks run less than 3 seconds (set by
"spark.locality.wait", the default value is 3000), the tasks will run in
the single node. Our solution is
using org.apache.hadoop.mapred.lib.CombineTextInputFormat to create some
big enough tasks. Of cause, you can reduce `spark.locality.wait`, but it
may be not efficient because it still creates many tiny tasks.


Best Regards,
Shixiong Zhu

2014-11-22 17:17 GMT+08:00 Akhil Das <ak...@sigmoidanalytics.com>:

> What is your cluster setup? are you running a worker on the master node
> also?
>
> 1. Spark usually assigns the task to the worker who has the data locally
> available, If one worker has enough tasks, then i believe it will start
> assigning to others as well. You could control it with the level of
> parallelism and all.
>
> 2. If you coalesce it into one partition, then i believe only one of the
> worker will execute the single task.
>
> Thanks
> Best Regards
>
> On Fri, Nov 21, 2014 at 9:49 PM, Pat Ferrel <p...@occamsmachete.com> wrote:
>
>> I have a job that searches for input recursively and creates a string of
>> pathnames to treat as one input.
>>
>> The files are part-xxxxx files and they are fairly small. The job seems
>> to take a long time to complete considering the size of the total data
>> (150m) and only runs on the master machine. The job only does rdd.map type
>> operations.
>>
>> 1) Why doesn’t it use the other workers in the cluster?
>> 2) Is there a downside to using a lot of small part files? Should I
>> coalesce them into one input file?
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>> For additional commands, e-mail: user-h...@spark.apache.org
>>
>>
>

Re: Lots of small input files

Reply via email to