the re-use of shuffle files is always a nice surprise to me

On Sun, Jan 17, 2016 at 3:17 PM, Mark Hamstra <m...@clearstorydata.com>
wrote:

> Same SparkContext means same pool of Workers.  It's up to the Scheduler,
> not the SparkContext, whether the exact same Workers or Executors will be
> used to calculate simultaneous actions against the same RDD.  It is likely
> that many of the same Workers and Executors will be used as the Scheduler
> tries to preserve data locality, but that is not guaranteed.  In fact, what
> is most likely to happen is that the shared Stages and Tasks being
> calculated for the simultaneous actions will not actually be run at exactly
> the same time, which means that shuffle files produced for one action will
> be reused by the other(s), and repeated calculations will be avoided even
> without explicitly caching/persisting the RDD.
>
> On Sun, Jan 17, 2016 at 8:06 AM, Koert Kuipers <ko...@tresata.com> wrote:
>
>> Same rdd means same sparkcontext means same workers
>>
>> Cache/persist the rdd to avoid repeated jobs
>> On Jan 17, 2016 5:21 AM, "Mennour Rostom" <mennou...@gmail.com> wrote:
>>
>>> Hi,
>>>
>>> Thank you all for your answers,
>>>
>>> If I correctly understand, actions (in my case foreach) can be run
>>> concurrently and simultaneously on the SAME rdd, (which is logical because
>>> they are read only object). however, I want to know if the same workers are
>>> used for the concurrent analysis ?
>>>
>>> Thank you
>>>
>>> 2016-01-15 21:11 GMT+01:00 Jakob Odersky <joder...@gmail.com>:
>>>
>>>> I stand corrected. How considerable are the benefits though? Will the
>>>> scheduler be able to dispatch jobs from both actions simultaneously (or on
>>>> a when-workers-become-available basis)?
>>>>
>>>> On 15 January 2016 at 11:44, Koert Kuipers <ko...@tresata.com> wrote:
>>>>
>>>>> we run multiple actions on the same (cached) rdd all the time, i guess
>>>>> in different threads indeed (its in akka)
>>>>>
>>>>> On Fri, Jan 15, 2016 at 2:40 PM, Matei Zaharia <
>>>>> matei.zaha...@gmail.com> wrote:
>>>>>
>>>>>> RDDs actually are thread-safe, and quite a few applications use them
>>>>>> this way, e.g. the JDBC server.
>>>>>>
>>>>>> Matei
>>>>>>
>>>>>> On Jan 15, 2016, at 2:10 PM, Jakob Odersky <joder...@gmail.com>
>>>>>> wrote:
>>>>>>
>>>>>> I don't think RDDs are threadsafe.
>>>>>> More fundamentally however, why would you want to run RDD actions in
>>>>>> parallel? The idea behind RDDs is to provide you with an abstraction for
>>>>>> computing parallel operations on distributed data. Even if you were to 
>>>>>> call
>>>>>> actions from several threads at once, the individual executors of your
>>>>>> spark environment would still have to perform operations sequentially.
>>>>>>
>>>>>> As an alternative, I would suggest to restructure your RDD
>>>>>> transformations to compute the required results in one single operation.
>>>>>>
>>>>>> On 15 January 2016 at 06:18, Jonathan Coveney <jcove...@gmail.com>
>>>>>> wrote:
>>>>>>
>>>>>>> Threads
>>>>>>>
>>>>>>>
>>>>>>> El viernes, 15 de enero de 2016, Kira <mennou...@gmail.com>
>>>>>>> escribió:
>>>>>>>
>>>>>>>> Hi,
>>>>>>>>
>>>>>>>> Can we run *simultaneous* actions on the *same RDD* ?; if yes how
>>>>>>>> can this
>>>>>>>> be done ?
>>>>>>>>
>>>>>>>> Thank you,
>>>>>>>> Regards
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> --
>>>>>>>> View this message in context:
>>>>>>>> http://apache-spark-user-list.1001560.n3.nabble.com/simultaneous-actions-tp25977.html
>>>>>>>> Sent from the Apache Spark User List mailing list archive at
>>>>>>>> Nabble.com <http://nabble.com>.
>>>>>>>>
>>>>>>>>
>>>>>>>> ---------------------------------------------------------------------
>>>>>>>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>>>>>>>> For additional commands, e-mail: user-h...@spark.apache.org
>>>>>>>>
>>>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>

Reply via email to