Where are the file splits? meaning is it possible they were also (only) available on one node and that was also your driver?
On Thu, Apr 23, 2015 at 1:21 PM, Pat Ferrel <p...@occamsmachete.com> wrote: > Sure > > var columns = mc.textFile(source).map { line => line.split(delimiter) } > > Here “source” is a comma delimited list of files or directories. Both the > textFile and .map tasks happen only on the machine they were launched from. > > Later other distributed operations happen but I suspect if I can figure out > why the fist line is run only on the client machine the rest will clear up > too. Here are some subsequent lines. > > if(filterColumn != -1) { > columns = columns.filter { tokens => tokens(filterColumn) == filterBy > } > } > > val interactions = columns.map { tokens => > tokens(rowIDColumn) -> tokens(columnIDPosition) > } > > interactions.cache() > > On Apr 23, 2015, at 10:14 AM, Jeetendra Gangele <gangele...@gmail.com> > wrote: > > Will you be able to paste code here? > > On 23 April 2015 at 22:21, Pat Ferrel <p...@occamsmachete.com> wrote: >> >> Using Spark streaming to create a large volume of small nano-batch input >> files, ~4k per file, thousands of ‘part-xxxxx’ files. When reading the >> nano-batch files and doing a distributed calculation my tasks run only on >> the machine where it was launched. I’m launching in “yarn-client” mode. The >> rdd is created using sc.textFile(“list of thousand files”) >> >> What would cause the read to occur only on the machine that launched the >> driver. >> >> Do I need to do something to the RDD after reading? Has some partition >> factor been applied to all derived rdds? >> --------------------------------------------------------------------- >> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org >> For additional commands, e-mail: user-h...@spark.apache.org >> > > > > --------------------------------------------------------------------- To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org