Sure

    var columns = mc.textFile(source).map { line => line.split(delimiter) }

Here “source” is a comma delimited list of files or directories. Both the 
textFile and .map tasks happen only on the machine they were launched from.

Later other distributed operations happen but I suspect if I can figure out why 
the fist line is run only on the client machine the rest will clear up too. 
Here are some subsequent lines.

    if(filterColumn != -1) {
      columns = columns.filter { tokens => tokens(filterColumn) == filterBy }
    }

    val interactions = columns.map { tokens =>
      tokens(rowIDColumn) -> tokens(columnIDPosition)
    }

    interactions.cache()

On Apr 23, 2015, at 10:14 AM, Jeetendra Gangele <gangele...@gmail.com> wrote:

Will you be able to paste code here?

On 23 April 2015 at 22:21, Pat Ferrel <p...@occamsmachete.com 
<mailto:p...@occamsmachete.com>> wrote:
Using Spark streaming to create a large volume of small nano-batch input files, 
~4k per file, thousands of ‘part-xxxxx’ files.  When reading the nano-batch 
files and doing a distributed calculation my tasks run only on the machine 
where it was launched. I’m launching in “yarn-client” mode. The rdd is created 
using sc.textFile(“list of thousand files”)

What would cause the read to occur only on the machine that launched the driver.

Do I need to do something to the RDD after reading? Has some partition factor 
been applied to all derived rdds?
---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org 
<mailto:user-unsubscr...@spark.apache.org>
For additional commands, e-mail: user-h...@spark.apache.org 
<mailto:user-h...@spark.apache.org>





Reply via email to