Re: simultaneous actions

Debasish Das Mon, 18 Jan 2016 09:42:08 -0800

Simultaneous action works on cluster fine if they are independent...on
local I never paid attention but the code path should be similar...
On Jan 18, 2016 8:00 AM, "Koert Kuipers" <ko...@tresata.com> wrote:


> stacktrace? details?
>
> On Mon, Jan 18, 2016 at 5:58 AM, Mennour Rostom <mennou...@gmail.com>
> wrote:
>
>> Hi,
>>
>> I am running my app in a single machine first before moving it in the
>> cluster; actually simultaneous actions are not working for me now; is this
>> comming from the fact that I am using a single machine ? yet I am using
>> FAIR scheduler.
>>
>> 2016-01-17 21:23 GMT+01:00 Mark Hamstra <m...@clearstorydata.com>:
>>
>>> It can be far more than that (e.g.
>>> https://issues.apache.org/jira/browse/SPARK-11838), and is generally
>>> either unrecognized or a greatly under-appreciated and underused feature of
>>> Spark.
>>>
>>> On Sun, Jan 17, 2016 at 12:20 PM, Koert Kuipers <ko...@tresata.com>
>>> wrote:
>>>
>>>> the re-use of shuffle files is always a nice surprise to me
>>>>
>>>> On Sun, Jan 17, 2016 at 3:17 PM, Mark Hamstra <m...@clearstorydata.com>
>>>> wrote:
>>>>
>>>>> Same SparkContext means same pool of Workers.  It's up to the
>>>>> Scheduler, not the SparkContext, whether the exact same Workers or
>>>>> Executors will be used to calculate simultaneous actions against the same
>>>>> RDD.  It is likely that many of the same Workers and Executors will be 
>>>>> used
>>>>> as the Scheduler tries to preserve data locality, but that is not
>>>>> guaranteed.  In fact, what is most likely to happen is that the shared
>>>>> Stages and Tasks being calculated for the simultaneous actions will not
>>>>> actually be run at exactly the same time, which means that shuffle files
>>>>> produced for one action will be reused by the other(s), and repeated
>>>>> calculations will be avoided even without explicitly caching/persisting 
>>>>> the
>>>>> RDD.
>>>>>
>>>>> On Sun, Jan 17, 2016 at 8:06 AM, Koert Kuipers <ko...@tresata.com>
>>>>> wrote:
>>>>>
>>>>>> Same rdd means same sparkcontext means same workers
>>>>>>
>>>>>> Cache/persist the rdd to avoid repeated jobs
>>>>>> On Jan 17, 2016 5:21 AM, "Mennour Rostom" <mennou...@gmail.com>
>>>>>> wrote:
>>>>>>
>>>>>>> Hi,
>>>>>>>
>>>>>>> Thank you all for your answers,
>>>>>>>
>>>>>>> If I correctly understand, actions (in my case foreach) can be run
>>>>>>> concurrently and simultaneously on the SAME rdd, (which is logical 
>>>>>>> because
>>>>>>> they are read only object). however, I want to know if the same workers 
>>>>>>> are
>>>>>>> used for the concurrent analysis ?
>>>>>>>
>>>>>>> Thank you
>>>>>>>
>>>>>>> 2016-01-15 21:11 GMT+01:00 Jakob Odersky <joder...@gmail.com>:
>>>>>>>
>>>>>>>> I stand corrected. How considerable are the benefits though? Will
>>>>>>>> the scheduler be able to dispatch jobs from both actions 
>>>>>>>> simultaneously (or
>>>>>>>> on a when-workers-become-available basis)?
>>>>>>>>
>>>>>>>> On 15 January 2016 at 11:44, Koert Kuipers <ko...@tresata.com>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>> we run multiple actions on the same (cached) rdd all the time, i
>>>>>>>>> guess in different threads indeed (its in akka)
>>>>>>>>>
>>>>>>>>> On Fri, Jan 15, 2016 at 2:40 PM, Matei Zaharia <
>>>>>>>>> matei.zaha...@gmail.com> wrote:
>>>>>>>>>
>>>>>>>>>> RDDs actually are thread-safe, and quite a few applications use
>>>>>>>>>> them this way, e.g. the JDBC server.
>>>>>>>>>>
>>>>>>>>>> Matei
>>>>>>>>>>
>>>>>>>>>> On Jan 15, 2016, at 2:10 PM, Jakob Odersky <joder...@gmail.com>
>>>>>>>>>> wrote:
>>>>>>>>>>
>>>>>>>>>> I don't think RDDs are threadsafe.
>>>>>>>>>> More fundamentally however, why would you want to run RDD actions
>>>>>>>>>> in parallel? The idea behind RDDs is to provide you with an 
>>>>>>>>>> abstraction for
>>>>>>>>>> computing parallel operations on distributed data. Even if you were 
>>>>>>>>>> to call
>>>>>>>>>> actions from several threads at once, the individual executors of 
>>>>>>>>>> your
>>>>>>>>>> spark environment would still have to perform operations 
>>>>>>>>>> sequentially.
>>>>>>>>>>
>>>>>>>>>> As an alternative, I would suggest to restructure your RDD
>>>>>>>>>> transformations to compute the required results in one single 
>>>>>>>>>> operation.
>>>>>>>>>>
>>>>>>>>>> On 15 January 2016 at 06:18, Jonathan Coveney <jcove...@gmail.com
>>>>>>>>>> > wrote:
>>>>>>>>>>
>>>>>>>>>>> Threads
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> El viernes, 15 de enero de 2016, Kira <mennou...@gmail.com>
>>>>>>>>>>> escribió:
>>>>>>>>>>>
>>>>>>>>>>>> Hi,
>>>>>>>>>>>>
>>>>>>>>>>>> Can we run *simultaneous* actions on the *same RDD* ?; if yes
>>>>>>>>>>>> how can this
>>>>>>>>>>>> be done ?
>>>>>>>>>>>>
>>>>>>>>>>>> Thank you,
>>>>>>>>>>>> Regards
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> --
>>>>>>>>>>>> View this message in context:
>>>>>>>>>>>> http://apache-spark-user-list.1001560.n3.nabble.com/simultaneous-actions-tp25977.html
>>>>>>>>>>>> Sent from the Apache Spark User List mailing list archive at
>>>>>>>>>>>> Nabble.com <http://nabble.com>.
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> ---------------------------------------------------------------------
>>>>>>>>>>>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>>>>>>>>>>>> For additional commands, e-mail: user-h...@spark.apache.org
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>
>>>>
>>>
>>
>

Re: simultaneous actions

Reply via email to