We encountered similar problem. If all partitions are located in the same node and all of the tasks run less than 3 seconds (set by "spark.locality.wait", the default value is 3000), the tasks will run in the single node. Our solution is using org.apache.hadoop.mapred.lib.CombineTextInputFormat to create some big enough tasks. Of cause, you can reduce `spark.locality.wait`, but it may be not efficient because it still creates many tiny tasks.
Best Regards, Shixiong Zhu 2014-11-22 17:17 GMT+08:00 Akhil Das <ak...@sigmoidanalytics.com>: > What is your cluster setup? are you running a worker on the master node > also? > > 1. Spark usually assigns the task to the worker who has the data locally > available, If one worker has enough tasks, then i believe it will start > assigning to others as well. You could control it with the level of > parallelism and all. > > 2. If you coalesce it into one partition, then i believe only one of the > worker will execute the single task. > > Thanks > Best Regards > > On Fri, Nov 21, 2014 at 9:49 PM, Pat Ferrel <p...@occamsmachete.com> wrote: > >> I have a job that searches for input recursively and creates a string of >> pathnames to treat as one input. >> >> The files are part-xxxxx files and they are fairly small. The job seems >> to take a long time to complete considering the size of the total data >> (150m) and only runs on the master machine. The job only does rdd.map type >> operations. >> >> 1) Why doesn’t it use the other workers in the cluster? >> 2) Is there a downside to using a lot of small part files? Should I >> coalesce them into one input file? >> --------------------------------------------------------------------- >> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org >> For additional commands, e-mail: user-h...@spark.apache.org >> >> >