Using Spark streaming to create a large volume of small nano-batch input files, 
~4k per file, thousands of ‘part-xxxxx’ files.  When reading the nano-batch 
files and doing a distributed calculation my tasks run only on the machine 
where it was launched. I’m launching in “yarn-client” mode. The rdd is created 
using sc.textFile(“list of thousand files”)

What would cause the read to occur only on the machine that launched the 

Do I need to do something to the RDD after reading? Has some partition factor 
been applied to all derived rdds?
To unsubscribe, e-mail:
For additional commands, e-mail:

Reply via email to