Re: Lots of small input files
We encountered similar problem. If all partitions are located in the same node and all of the tasks run less than 3 seconds (set by "spark.locality.wait", the default value is 3000), the tasks will run in the single node. Our solution is using org.apache.hadoop.mapred.lib.CombineTextInputFormat to create some big enough tasks. Of cause, you can reduce `spark.locality.wait`, but it may be not efficient because it still creates many tiny tasks. Best Regards, Shixiong Zhu 2014-11-22 17:17 GMT+08:00 Akhil Das : > What is your cluster setup? are you running a worker on the master node > also? > > 1. Spark usually assigns the task to the worker who has the data locally > available, If one worker has enough tasks, then i believe it will start > assigning to others as well. You could control it with the level of > parallelism and all. > > 2. If you coalesce it into one partition, then i believe only one of the > worker will execute the single task. > > Thanks > Best Regards > > On Fri, Nov 21, 2014 at 9:49 PM, Pat Ferrel wrote: > >> I have a job that searches for input recursively and creates a string of >> pathnames to treat as one input. >> >> The files are part-x files and they are fairly small. The job seems >> to take a long time to complete considering the size of the total data >> (150m) and only runs on the master machine. The job only does rdd.map type >> operations. >> >> 1) Why doesn’t it use the other workers in the cluster? >> 2) Is there a downside to using a lot of small part files? Should I >> coalesce them into one input file? >> - >> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org >> For additional commands, e-mail: user-h...@spark.apache.org >> >> >
Re: Lots of small input files
What is your cluster setup? are you running a worker on the master node also? 1. Spark usually assigns the task to the worker who has the data locally available, If one worker has enough tasks, then i believe it will start assigning to others as well. You could control it with the level of parallelism and all. 2. If you coalesce it into one partition, then i believe only one of the worker will execute the single task. Thanks Best Regards On Fri, Nov 21, 2014 at 9:49 PM, Pat Ferrel wrote: > I have a job that searches for input recursively and creates a string of > pathnames to treat as one input. > > The files are part-x files and they are fairly small. The job seems to > take a long time to complete considering the size of the total data (150m) > and only runs on the master machine. The job only does rdd.map type > operations. > > 1) Why doesn’t it use the other workers in the cluster? > 2) Is there a downside to using a lot of small part files? Should I > coalesce them into one input file? > - > To unsubscribe, e-mail: user-unsubscr...@spark.apache.org > For additional commands, e-mail: user-h...@spark.apache.org > >
Lots of small input files
I have a job that searches for input recursively and creates a string of pathnames to treat as one input. The files are part-x files and they are fairly small. The job seems to take a long time to complete considering the size of the total data (150m) and only runs on the master machine. The job only does rdd.map type operations. 1) Why doesn’t it use the other workers in the cluster? 2) Is there a downside to using a lot of small part files? Should I coalesce them into one input file? - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org