Then every worker would have to hold the whole RDD in memory. That's
got some significant drawbacks. As long as you are able to execute all
tasks locally to their partition, any additional copies of the data
don't help locality. And you need far less than N copies of the data
for that in general.

Consider just broadcasting your data to every worker if you have some
data you want to access locally in memory at each worker.

On Wed, Feb 25, 2015 at 10:47 AM, Marius Soutier <mps....@gmail.com> wrote:
> Yes. Effectively, could it avoid network transfers? Or put differently, would 
> an option like persist(MEMORY_ALL) improve job speed by caching an RDD on 
> every worker?
>
>> On 25.02.2015, at 11:42, Sean Owen <so...@cloudera.com> wrote:
>>
>> If you mean, can both copies of the blocks be used for computations?
>> yes they can.
>>
>> On Wed, Feb 25, 2015 at 10:36 AM, Marius Soutier <mps....@gmail.com> wrote:
>>> Hi,
>>>
>>> just a quick question about calling persist with the _2 option. Is the 2x 
>>> replication only useful for fault tolerance, or will it also increase job 
>>> speed by avoiding network transfers? Assuming I’m doing joins or other 
>>> shuffle operations.
>>>
>>> Thanks
>>>
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>>> For additional commands, e-mail: user-h...@spark.apache.org
>>>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Reply via email to