Re: Tasks run only on one machine

Pat Ferrel Thu, 23 Apr 2015 10:31:25 -0700

Physically? Not sure, they were written using the nano-batch rdds in a 
streaming job that is in a separate driver. The job is a Kafka consumer.


Would that effect all derived rdds? If so is there something I can do to mix it 
up or does Spark know best about execution speed here?


On Apr 23, 2015, at 10:23 AM, Sean Owen <so...@cloudera.com> wrote:

Where are the file splits? meaning is it possible they were also
(only) available on one node and that was also your driver?

On Thu, Apr 23, 2015 at 1:21 PM, Pat Ferrel <p...@occamsmachete.com> wrote:
> Sure
> 
>    var columns = mc.textFile(source).map { line => line.split(delimiter) }
> 
> Here “source” is a comma delimited list of files or directories. Both the
> textFile and .map tasks happen only on the machine they were launched from.
> 
> Later other distributed operations happen but I suspect if I can figure out
> why the fist line is run only on the client machine the rest will clear up
> too. Here are some subsequent lines.
> 
>    if(filterColumn != -1) {
>      columns = columns.filter { tokens => tokens(filterColumn) == filterBy
> }
>    }
> 
>    val interactions = columns.map { tokens =>
>      tokens(rowIDColumn) -> tokens(columnIDPosition)
>    }
> 
>    interactions.cache()
> 
> On Apr 23, 2015, at 10:14 AM, Jeetendra Gangele <gangele...@gmail.com>
> wrote:
> 
> Will you be able to paste code here?
> 
> On 23 April 2015 at 22:21, Pat Ferrel <p...@occamsmachete.com> wrote:
>> 
>> Using Spark streaming to create a large volume of small nano-batch input
>> files, ~4k per file, thousands of ‘part-xxxxx’ files.  When reading the
>> nano-batch files and doing a distributed calculation my tasks run only on
>> the machine where it was launched. I’m launching in “yarn-client” mode. The
>> rdd is created using sc.textFile(“list of thousand files”)
>> 
>> What would cause the read to occur only on the machine that launched the
>> driver.
>> 
>> Do I need to do something to the RDD after reading? Has some partition
>> factor been applied to all derived rdds?
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>> For additional commands, e-mail: user-h...@spark.apache.org
>> 
> 
> 
> 
> 


---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: Tasks run only on one machine

Reply via email to