Re: simultaneous actions

Mark Hamstra Sun, 17 Jan 2016 12:18:11 -0800

Same SparkContext means same pool of Workers.  It's up to the Scheduler,
not the SparkContext, whether the exact same Workers or Executors will be
used to calculate simultaneous actions against the same RDD.  It is likely
that many of the same Workers and Executors will be used as the Scheduler
tries to preserve data locality, but that is not guaranteed.  In fact, what
is most likely to happen is that the shared Stages and Tasks being
calculated for the simultaneous actions will not actually be run at exactly
the same time, which means that shuffle files produced for one action will
be reused by the other(s), and repeated calculations will be avoided even
without explicitly caching/persisting the RDD.


On Sun, Jan 17, 2016 at 8:06 AM, Koert Kuipers <ko...@tresata.com> wrote:

> Same rdd means same sparkcontext means same workers
>
> Cache/persist the rdd to avoid repeated jobs
> On Jan 17, 2016 5:21 AM, "Mennour Rostom" <mennou...@gmail.com> wrote:
>
>> Hi,
>>
>> Thank you all for your answers,
>>
>> If I correctly understand, actions (in my case foreach) can be run
>> concurrently and simultaneously on the SAME rdd, (which is logical because
>> they are read only object). however, I want to know if the same workers are
>> used for the concurrent analysis ?
>>
>> Thank you
>>
>> 2016-01-15 21:11 GMT+01:00 Jakob Odersky <joder...@gmail.com>:
>>
>>> I stand corrected. How considerable are the benefits though? Will the
>>> scheduler be able to dispatch jobs from both actions simultaneously (or on
>>> a when-workers-become-available basis)?
>>>
>>> On 15 January 2016 at 11:44, Koert Kuipers <ko...@tresata.com> wrote:
>>>
>>>> we run multiple actions on the same (cached) rdd all the time, i guess
>>>> in different threads indeed (its in akka)
>>>>
>>>> On Fri, Jan 15, 2016 at 2:40 PM, Matei Zaharia <matei.zaha...@gmail.com
>>>> > wrote:
>>>>
>>>>> RDDs actually are thread-safe, and quite a few applications use them
>>>>> this way, e.g. the JDBC server.
>>>>>
>>>>> Matei
>>>>>
>>>>> On Jan 15, 2016, at 2:10 PM, Jakob Odersky <joder...@gmail.com> wrote:
>>>>>
>>>>> I don't think RDDs are threadsafe.
>>>>> More fundamentally however, why would you want to run RDD actions in
>>>>> parallel? The idea behind RDDs is to provide you with an abstraction for
>>>>> computing parallel operations on distributed data. Even if you were to 
>>>>> call
>>>>> actions from several threads at once, the individual executors of your
>>>>> spark environment would still have to perform operations sequentially.
>>>>>
>>>>> As an alternative, I would suggest to restructure your RDD
>>>>> transformations to compute the required results in one single operation.
>>>>>
>>>>> On 15 January 2016 at 06:18, Jonathan Coveney <jcove...@gmail.com>
>>>>> wrote:
>>>>>
>>>>>> Threads
>>>>>>
>>>>>>
>>>>>> El viernes, 15 de enero de 2016, Kira <mennou...@gmail.com> escribió:
>>>>>>
>>>>>>> Hi,
>>>>>>>
>>>>>>> Can we run *simultaneous* actions on the *same RDD* ?; if yes how
>>>>>>> can this
>>>>>>> be done ?
>>>>>>>
>>>>>>> Thank you,
>>>>>>> Regards
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> --
>>>>>>> View this message in context:
>>>>>>> http://apache-spark-user-list.1001560.n3.nabble.com/simultaneous-actions-tp25977.html
>>>>>>> Sent from the Apache Spark User List mailing list archive at
>>>>>>> Nabble.com <http://nabble.com>.
>>>>>>>
>>>>>>> ---------------------------------------------------------------------
>>>>>>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>>>>>>> For additional commands, e-mail: user-h...@spark.apache.org
>>>>>>>
>>>>>>>
>>>>>
>>>>>
>>>>
>>>
>>

Re: simultaneous actions

Reply via email to