Re: Lots of small input files

2014-11-23 Thread Shixiong Zhu
We encountered similar problem. If all partitions are located in the same
node and all of the tasks run less than 3 seconds (set by
"spark.locality.wait", the default value is 3000), the tasks will run in
the single node. Our solution is
using org.apache.hadoop.mapred.lib.CombineTextInputFormat to create some
big enough tasks. Of cause, you can reduce `spark.locality.wait`, but it
may be not efficient because it still creates many tiny tasks.

Best Regards,
Shixiong Zhu

2014-11-22 17:17 GMT+08:00 Akhil Das :

> What is your cluster setup? are you running a worker on the master node
> also?
>
> 1. Spark usually assigns the task to the worker who has the data locally
> available, If one worker has enough tasks, then i believe it will start
> assigning to others as well. You could control it with the level of
> parallelism and all.
>
> 2. If you coalesce it into one partition, then i believe only one of the
> worker will execute the single task.
>
> Thanks
> Best Regards
>
> On Fri, Nov 21, 2014 at 9:49 PM, Pat Ferrel  wrote:
>
>> I have a job that searches for input recursively and creates a string of
>> pathnames to treat as one input.
>>
>> The files are part-x files and they are fairly small. The job seems
>> to take a long time to complete considering the size of the total data
>> (150m) and only runs on the master machine. The job only does rdd.map type
>> operations.
>>
>> 1) Why doesn’t it use the other workers in the cluster?
>> 2) Is there a downside to using a lot of small part files? Should I
>> coalesce them into one input file?
>> -
>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>> For additional commands, e-mail: user-h...@spark.apache.org
>>
>>
>


Re: Lots of small input files

2014-11-22 Thread Akhil Das
What is your cluster setup? are you running a worker on the master node
also?

1. Spark usually assigns the task to the worker who has the data locally
available, If one worker has enough tasks, then i believe it will start
assigning to others as well. You could control it with the level of
parallelism and all.

2. If you coalesce it into one partition, then i believe only one of the
worker will execute the single task.

Thanks
Best Regards

On Fri, Nov 21, 2014 at 9:49 PM, Pat Ferrel  wrote:

> I have a job that searches for input recursively and creates a string of
> pathnames to treat as one input.
>
> The files are part-x files and they are fairly small. The job seems to
> take a long time to complete considering the size of the total data (150m)
> and only runs on the master machine. The job only does rdd.map type
> operations.
>
> 1) Why doesn’t it use the other workers in the cluster?
> 2) Is there a downside to using a lot of small part files? Should I
> coalesce them into one input file?
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>


Lots of small input files

2014-11-21 Thread Pat Ferrel
I have a job that searches for input recursively and creates a string of 
pathnames to treat as one input. 

The files are part-x files and they are fairly small. The job seems to take 
a long time to complete considering the size of the total data (150m) and only 
runs on the master machine. The job only does rdd.map type operations.

1) Why doesn’t it use the other workers in the cluster?
2) Is there a downside to using a lot of small part files? Should I coalesce 
them into one input file?
-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org