Re: Tasks run only on one machine

Sean Owen Thu, 23 Apr 2015 10:26:51 -0700

Where are the file splits? meaning is it possible they were also
(only) available on one node and that was also your driver?


On Thu, Apr 23, 2015 at 1:21 PM, Pat Ferrel <p...@occamsmachete.com> wrote:
> Sure
>
>     var columns = mc.textFile(source).map { line => line.split(delimiter) }
>
> Here “source” is a comma delimited list of files or directories. Both the
> textFile and .map tasks happen only on the machine they were launched from.
>
> Later other distributed operations happen but I suspect if I can figure out
> why the fist line is run only on the client machine the rest will clear up
> too. Here are some subsequent lines.
>
>     if(filterColumn != -1) {
>       columns = columns.filter { tokens => tokens(filterColumn) == filterBy
> }
>     }
>
>     val interactions = columns.map { tokens =>
>       tokens(rowIDColumn) -> tokens(columnIDPosition)
>     }
>
>     interactions.cache()
>
> On Apr 23, 2015, at 10:14 AM, Jeetendra Gangele <gangele...@gmail.com>
> wrote:
>
> Will you be able to paste code here?
>
> On 23 April 2015 at 22:21, Pat Ferrel <p...@occamsmachete.com> wrote:
>>
>> Using Spark streaming to create a large volume of small nano-batch input
>> files, ~4k per file, thousands of ‘part-xxxxx’ files.  When reading the
>> nano-batch files and doing a distributed calculation my tasks run only on
>> the machine where it was launched. I’m launching in “yarn-client” mode. The
>> rdd is created using sc.textFile(“list of thousand files”)
>>
>> What would cause the read to occur only on the machine that launched the
>> driver.
>>
>> Do I need to do something to the RDD after reading? Has some partition
>> factor been applied to all derived rdds?
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>> For additional commands, e-mail: user-h...@spark.apache.org
>>
>
>
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: Tasks run only on one machine

Reply via email to