Re: simultaneous actions

2016-01-18 Thread Debasish Das
Simultaneous action works on cluster fine if they are independent...on
local I never paid attention but the code path should be similar...
On Jan 18, 2016 8:00 AM, "Koert Kuipers" <ko...@tresata.com> wrote:

> stacktrace? details?
>
> On Mon, Jan 18, 2016 at 5:58 AM, Mennour Rostom <mennou...@gmail.com>
> wrote:
>
>> Hi,
>>
>> I am running my app in a single machine first before moving it in the
>> cluster; actually simultaneous actions are not working for me now; is this
>> comming from the fact that I am using a single machine ? yet I am using
>> FAIR scheduler.
>>
>> 2016-01-17 21:23 GMT+01:00 Mark Hamstra <m...@clearstorydata.com>:
>>
>>> It can be far more than that (e.g.
>>> https://issues.apache.org/jira/browse/SPARK-11838), and is generally
>>> either unrecognized or a greatly under-appreciated and underused feature of
>>> Spark.
>>>
>>> On Sun, Jan 17, 2016 at 12:20 PM, Koert Kuipers <ko...@tresata.com>
>>> wrote:
>>>
>>>> the re-use of shuffle files is always a nice surprise to me
>>>>
>>>> On Sun, Jan 17, 2016 at 3:17 PM, Mark Hamstra <m...@clearstorydata.com>
>>>> wrote:
>>>>
>>>>> Same SparkContext means same pool of Workers.  It's up to the
>>>>> Scheduler, not the SparkContext, whether the exact same Workers or
>>>>> Executors will be used to calculate simultaneous actions against the same
>>>>> RDD.  It is likely that many of the same Workers and Executors will be 
>>>>> used
>>>>> as the Scheduler tries to preserve data locality, but that is not
>>>>> guaranteed.  In fact, what is most likely to happen is that the shared
>>>>> Stages and Tasks being calculated for the simultaneous actions will not
>>>>> actually be run at exactly the same time, which means that shuffle files
>>>>> produced for one action will be reused by the other(s), and repeated
>>>>> calculations will be avoided even without explicitly caching/persisting 
>>>>> the
>>>>> RDD.
>>>>>
>>>>> On Sun, Jan 17, 2016 at 8:06 AM, Koert Kuipers <ko...@tresata.com>
>>>>> wrote:
>>>>>
>>>>>> Same rdd means same sparkcontext means same workers
>>>>>>
>>>>>> Cache/persist the rdd to avoid repeated jobs
>>>>>> On Jan 17, 2016 5:21 AM, "Mennour Rostom" <mennou...@gmail.com>
>>>>>> wrote:
>>>>>>
>>>>>>> Hi,
>>>>>>>
>>>>>>> Thank you all for your answers,
>>>>>>>
>>>>>>> If I correctly understand, actions (in my case foreach) can be run
>>>>>>> concurrently and simultaneously on the SAME rdd, (which is logical 
>>>>>>> because
>>>>>>> they are read only object). however, I want to know if the same workers 
>>>>>>> are
>>>>>>> used for the concurrent analysis ?
>>>>>>>
>>>>>>> Thank you
>>>>>>>
>>>>>>> 2016-01-15 21:11 GMT+01:00 Jakob Odersky <joder...@gmail.com>:
>>>>>>>
>>>>>>>> I stand corrected. How considerable are the benefits though? Will
>>>>>>>> the scheduler be able to dispatch jobs from both actions 
>>>>>>>> simultaneously (or
>>>>>>>> on a when-workers-become-available basis)?
>>>>>>>>
>>>>>>>> On 15 January 2016 at 11:44, Koert Kuipers <ko...@tresata.com>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>> we run multiple actions on the same (cached) rdd all the time, i
>>>>>>>>> guess in different threads indeed (its in akka)
>>>>>>>>>
>>>>>>>>> On Fri, Jan 15, 2016 at 2:40 PM, Matei Zaharia <
>>>>>>>>> matei.zaha...@gmail.com> wrote:
>>>>>>>>>
>>>>>>>>>> RDDs actually are thread-safe, and quite a few applications use
>>>>>>>>>> them this way, e.g. the JDBC server.
>>>>>>>>>>
>>>>>>>>>> Matei
>>>>>>>>>>
>>>>>>>>>> On Jan 15, 2016, at 2:10 PM, Jakob Odersky <joder...@gmail

Re: simultaneous actions

2016-01-18 Thread Mennour Rostom
Hi,

I am running my app in a single machine first before moving it in the
cluster; actually simultaneous actions are not working for me now; is this
comming from the fact that I am using a single machine ? yet I am using
FAIR scheduler.

2016-01-17 21:23 GMT+01:00 Mark Hamstra <m...@clearstorydata.com>:

> It can be far more than that (e.g.
> https://issues.apache.org/jira/browse/SPARK-11838), and is generally
> either unrecognized or a greatly under-appreciated and underused feature of
> Spark.
>
> On Sun, Jan 17, 2016 at 12:20 PM, Koert Kuipers <ko...@tresata.com> wrote:
>
>> the re-use of shuffle files is always a nice surprise to me
>>
>> On Sun, Jan 17, 2016 at 3:17 PM, Mark Hamstra <m...@clearstorydata.com>
>> wrote:
>>
>>> Same SparkContext means same pool of Workers.  It's up to the Scheduler,
>>> not the SparkContext, whether the exact same Workers or Executors will be
>>> used to calculate simultaneous actions against the same RDD.  It is likely
>>> that many of the same Workers and Executors will be used as the Scheduler
>>> tries to preserve data locality, but that is not guaranteed.  In fact, what
>>> is most likely to happen is that the shared Stages and Tasks being
>>> calculated for the simultaneous actions will not actually be run at exactly
>>> the same time, which means that shuffle files produced for one action will
>>> be reused by the other(s), and repeated calculations will be avoided even
>>> without explicitly caching/persisting the RDD.
>>>
>>> On Sun, Jan 17, 2016 at 8:06 AM, Koert Kuipers <ko...@tresata.com>
>>> wrote:
>>>
>>>> Same rdd means same sparkcontext means same workers
>>>>
>>>> Cache/persist the rdd to avoid repeated jobs
>>>> On Jan 17, 2016 5:21 AM, "Mennour Rostom" <mennou...@gmail.com> wrote:
>>>>
>>>>> Hi,
>>>>>
>>>>> Thank you all for your answers,
>>>>>
>>>>> If I correctly understand, actions (in my case foreach) can be run
>>>>> concurrently and simultaneously on the SAME rdd, (which is logical because
>>>>> they are read only object). however, I want to know if the same workers 
>>>>> are
>>>>> used for the concurrent analysis ?
>>>>>
>>>>> Thank you
>>>>>
>>>>> 2016-01-15 21:11 GMT+01:00 Jakob Odersky <joder...@gmail.com>:
>>>>>
>>>>>> I stand corrected. How considerable are the benefits though? Will the
>>>>>> scheduler be able to dispatch jobs from both actions simultaneously (or 
>>>>>> on
>>>>>> a when-workers-become-available basis)?
>>>>>>
>>>>>> On 15 January 2016 at 11:44, Koert Kuipers <ko...@tresata.com> wrote:
>>>>>>
>>>>>>> we run multiple actions on the same (cached) rdd all the time, i
>>>>>>> guess in different threads indeed (its in akka)
>>>>>>>
>>>>>>> On Fri, Jan 15, 2016 at 2:40 PM, Matei Zaharia <
>>>>>>> matei.zaha...@gmail.com> wrote:
>>>>>>>
>>>>>>>> RDDs actually are thread-safe, and quite a few applications use
>>>>>>>> them this way, e.g. the JDBC server.
>>>>>>>>
>>>>>>>> Matei
>>>>>>>>
>>>>>>>> On Jan 15, 2016, at 2:10 PM, Jakob Odersky <joder...@gmail.com>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>> I don't think RDDs are threadsafe.
>>>>>>>> More fundamentally however, why would you want to run RDD actions
>>>>>>>> in parallel? The idea behind RDDs is to provide you with an 
>>>>>>>> abstraction for
>>>>>>>> computing parallel operations on distributed data. Even if you were to 
>>>>>>>> call
>>>>>>>> actions from several threads at once, the individual executors of your
>>>>>>>> spark environment would still have to perform operations sequentially.
>>>>>>>>
>>>>>>>> As an alternative, I would suggest to restructure your RDD
>>>>>>>> transformations to compute the required results in one single 
>>>>>>>> operation.
>>>>>>>>
>>>>>>>> On 15 January 2016 at 06:18, Jonathan Coveney <jcove...@gmail.com>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>> Threads
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> El viernes, 15 de enero de 2016, Kira <mennou...@gmail.com>
>>>>>>>>> escribió:
>>>>>>>>>
>>>>>>>>>> Hi,
>>>>>>>>>>
>>>>>>>>>> Can we run *simultaneous* actions on the *same RDD* ?; if yes how
>>>>>>>>>> can this
>>>>>>>>>> be done ?
>>>>>>>>>>
>>>>>>>>>> Thank you,
>>>>>>>>>> Regards
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> --
>>>>>>>>>> View this message in context:
>>>>>>>>>> http://apache-spark-user-list.1001560.n3.nabble.com/simultaneous-actions-tp25977.html
>>>>>>>>>> Sent from the Apache Spark User List mailing list archive at
>>>>>>>>>> Nabble.com <http://nabble.com>.
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> -
>>>>>>>>>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>>>>>>>>>> For additional commands, e-mail: user-h...@spark.apache.org
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>
>>
>


Re: simultaneous actions

2016-01-18 Thread Koert Kuipers
stacktrace? details?

On Mon, Jan 18, 2016 at 5:58 AM, Mennour Rostom <mennou...@gmail.com> wrote:

> Hi,
>
> I am running my app in a single machine first before moving it in the
> cluster; actually simultaneous actions are not working for me now; is this
> comming from the fact that I am using a single machine ? yet I am using
> FAIR scheduler.
>
> 2016-01-17 21:23 GMT+01:00 Mark Hamstra <m...@clearstorydata.com>:
>
>> It can be far more than that (e.g.
>> https://issues.apache.org/jira/browse/SPARK-11838), and is generally
>> either unrecognized or a greatly under-appreciated and underused feature of
>> Spark.
>>
>> On Sun, Jan 17, 2016 at 12:20 PM, Koert Kuipers <ko...@tresata.com>
>> wrote:
>>
>>> the re-use of shuffle files is always a nice surprise to me
>>>
>>> On Sun, Jan 17, 2016 at 3:17 PM, Mark Hamstra <m...@clearstorydata.com>
>>> wrote:
>>>
>>>> Same SparkContext means same pool of Workers.  It's up to the
>>>> Scheduler, not the SparkContext, whether the exact same Workers or
>>>> Executors will be used to calculate simultaneous actions against the same
>>>> RDD.  It is likely that many of the same Workers and Executors will be used
>>>> as the Scheduler tries to preserve data locality, but that is not
>>>> guaranteed.  In fact, what is most likely to happen is that the shared
>>>> Stages and Tasks being calculated for the simultaneous actions will not
>>>> actually be run at exactly the same time, which means that shuffle files
>>>> produced for one action will be reused by the other(s), and repeated
>>>> calculations will be avoided even without explicitly caching/persisting the
>>>> RDD.
>>>>
>>>> On Sun, Jan 17, 2016 at 8:06 AM, Koert Kuipers <ko...@tresata.com>
>>>> wrote:
>>>>
>>>>> Same rdd means same sparkcontext means same workers
>>>>>
>>>>> Cache/persist the rdd to avoid repeated jobs
>>>>> On Jan 17, 2016 5:21 AM, "Mennour Rostom" <mennou...@gmail.com> wrote:
>>>>>
>>>>>> Hi,
>>>>>>
>>>>>> Thank you all for your answers,
>>>>>>
>>>>>> If I correctly understand, actions (in my case foreach) can be run
>>>>>> concurrently and simultaneously on the SAME rdd, (which is logical 
>>>>>> because
>>>>>> they are read only object). however, I want to know if the same workers 
>>>>>> are
>>>>>> used for the concurrent analysis ?
>>>>>>
>>>>>> Thank you
>>>>>>
>>>>>> 2016-01-15 21:11 GMT+01:00 Jakob Odersky <joder...@gmail.com>:
>>>>>>
>>>>>>> I stand corrected. How considerable are the benefits though? Will
>>>>>>> the scheduler be able to dispatch jobs from both actions simultaneously 
>>>>>>> (or
>>>>>>> on a when-workers-become-available basis)?
>>>>>>>
>>>>>>> On 15 January 2016 at 11:44, Koert Kuipers <ko...@tresata.com>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> we run multiple actions on the same (cached) rdd all the time, i
>>>>>>>> guess in different threads indeed (its in akka)
>>>>>>>>
>>>>>>>> On Fri, Jan 15, 2016 at 2:40 PM, Matei Zaharia <
>>>>>>>> matei.zaha...@gmail.com> wrote:
>>>>>>>>
>>>>>>>>> RDDs actually are thread-safe, and quite a few applications use
>>>>>>>>> them this way, e.g. the JDBC server.
>>>>>>>>>
>>>>>>>>> Matei
>>>>>>>>>
>>>>>>>>> On Jan 15, 2016, at 2:10 PM, Jakob Odersky <joder...@gmail.com>
>>>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>> I don't think RDDs are threadsafe.
>>>>>>>>> More fundamentally however, why would you want to run RDD actions
>>>>>>>>> in parallel? The idea behind RDDs is to provide you with an 
>>>>>>>>> abstraction for
>>>>>>>>> computing parallel operations on distributed data. Even if you were 
>>>>>>>>> to call
>>>>>>>>> actions from several threads at once, the individual executors of your
>>>>>>>>> spark environment would still have to perform operations sequentially.
>>>>>>>>>
>>>>>>>>> As an alternative, I would suggest to restructure your RDD
>>>>>>>>> transformations to compute the required results in one single 
>>>>>>>>> operation.
>>>>>>>>>
>>>>>>>>> On 15 January 2016 at 06:18, Jonathan Coveney <jcove...@gmail.com>
>>>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>>> Threads
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> El viernes, 15 de enero de 2016, Kira <mennou...@gmail.com>
>>>>>>>>>> escribió:
>>>>>>>>>>
>>>>>>>>>>> Hi,
>>>>>>>>>>>
>>>>>>>>>>> Can we run *simultaneous* actions on the *same RDD* ?; if yes
>>>>>>>>>>> how can this
>>>>>>>>>>> be done ?
>>>>>>>>>>>
>>>>>>>>>>> Thank you,
>>>>>>>>>>> Regards
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> --
>>>>>>>>>>> View this message in context:
>>>>>>>>>>> http://apache-spark-user-list.1001560.n3.nabble.com/simultaneous-actions-tp25977.html
>>>>>>>>>>> Sent from the Apache Spark User List mailing list archive at
>>>>>>>>>>> Nabble.com <http://nabble.com>.
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> -
>>>>>>>>>>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>>>>>>>>>>> For additional commands, e-mail: user-h...@spark.apache.org
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>
>>>>
>>>
>>
>


Re: simultaneous actions

2016-01-17 Thread Koert Kuipers
the re-use of shuffle files is always a nice surprise to me

On Sun, Jan 17, 2016 at 3:17 PM, Mark Hamstra <m...@clearstorydata.com>
wrote:

> Same SparkContext means same pool of Workers.  It's up to the Scheduler,
> not the SparkContext, whether the exact same Workers or Executors will be
> used to calculate simultaneous actions against the same RDD.  It is likely
> that many of the same Workers and Executors will be used as the Scheduler
> tries to preserve data locality, but that is not guaranteed.  In fact, what
> is most likely to happen is that the shared Stages and Tasks being
> calculated for the simultaneous actions will not actually be run at exactly
> the same time, which means that shuffle files produced for one action will
> be reused by the other(s), and repeated calculations will be avoided even
> without explicitly caching/persisting the RDD.
>
> On Sun, Jan 17, 2016 at 8:06 AM, Koert Kuipers <ko...@tresata.com> wrote:
>
>> Same rdd means same sparkcontext means same workers
>>
>> Cache/persist the rdd to avoid repeated jobs
>> On Jan 17, 2016 5:21 AM, "Mennour Rostom" <mennou...@gmail.com> wrote:
>>
>>> Hi,
>>>
>>> Thank you all for your answers,
>>>
>>> If I correctly understand, actions (in my case foreach) can be run
>>> concurrently and simultaneously on the SAME rdd, (which is logical because
>>> they are read only object). however, I want to know if the same workers are
>>> used for the concurrent analysis ?
>>>
>>> Thank you
>>>
>>> 2016-01-15 21:11 GMT+01:00 Jakob Odersky <joder...@gmail.com>:
>>>
>>>> I stand corrected. How considerable are the benefits though? Will the
>>>> scheduler be able to dispatch jobs from both actions simultaneously (or on
>>>> a when-workers-become-available basis)?
>>>>
>>>> On 15 January 2016 at 11:44, Koert Kuipers <ko...@tresata.com> wrote:
>>>>
>>>>> we run multiple actions on the same (cached) rdd all the time, i guess
>>>>> in different threads indeed (its in akka)
>>>>>
>>>>> On Fri, Jan 15, 2016 at 2:40 PM, Matei Zaharia <
>>>>> matei.zaha...@gmail.com> wrote:
>>>>>
>>>>>> RDDs actually are thread-safe, and quite a few applications use them
>>>>>> this way, e.g. the JDBC server.
>>>>>>
>>>>>> Matei
>>>>>>
>>>>>> On Jan 15, 2016, at 2:10 PM, Jakob Odersky <joder...@gmail.com>
>>>>>> wrote:
>>>>>>
>>>>>> I don't think RDDs are threadsafe.
>>>>>> More fundamentally however, why would you want to run RDD actions in
>>>>>> parallel? The idea behind RDDs is to provide you with an abstraction for
>>>>>> computing parallel operations on distributed data. Even if you were to 
>>>>>> call
>>>>>> actions from several threads at once, the individual executors of your
>>>>>> spark environment would still have to perform operations sequentially.
>>>>>>
>>>>>> As an alternative, I would suggest to restructure your RDD
>>>>>> transformations to compute the required results in one single operation.
>>>>>>
>>>>>> On 15 January 2016 at 06:18, Jonathan Coveney <jcove...@gmail.com>
>>>>>> wrote:
>>>>>>
>>>>>>> Threads
>>>>>>>
>>>>>>>
>>>>>>> El viernes, 15 de enero de 2016, Kira <mennou...@gmail.com>
>>>>>>> escribió:
>>>>>>>
>>>>>>>> Hi,
>>>>>>>>
>>>>>>>> Can we run *simultaneous* actions on the *same RDD* ?; if yes how
>>>>>>>> can this
>>>>>>>> be done ?
>>>>>>>>
>>>>>>>> Thank you,
>>>>>>>> Regards
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> --
>>>>>>>> View this message in context:
>>>>>>>> http://apache-spark-user-list.1001560.n3.nabble.com/simultaneous-actions-tp25977.html
>>>>>>>> Sent from the Apache Spark User List mailing list archive at
>>>>>>>> Nabble.com <http://nabble.com>.
>>>>>>>>
>>>>>>>>
>>>>>>>> -
>>>>>>>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>>>>>>>> For additional commands, e-mail: user-h...@spark.apache.org
>>>>>>>>
>>>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>


Re: simultaneous actions

2016-01-17 Thread Matei Zaharia
They'll be able to run concurrently and share workers / data. Take a look at 
http://spark.apache.org/docs/latest/job-scheduling.html 
<http://spark.apache.org/docs/latest/job-scheduling.html> for how scheduling 
happens across multiple running jobs in the same SparkContext.

Matei

> On Jan 17, 2016, at 8:06 AM, Koert Kuipers <ko...@tresata.com> wrote:
> 
> Same rdd means same sparkcontext means same workers
> 
> Cache/persist the rdd to avoid repeated jobs
> 
> On Jan 17, 2016 5:21 AM, "Mennour Rostom" <mennou...@gmail.com 
> <mailto:mennou...@gmail.com>> wrote:
> Hi,
> 
> Thank you all for your answers,
> 
> If I correctly understand, actions (in my case foreach) can be run 
> concurrently and simultaneously on the SAME rdd, (which is logical because 
> they are read only object). however, I want to know if the same workers are 
> used for the concurrent analysis ?
> 
> Thank you
> 
> 2016-01-15 21:11 GMT+01:00 Jakob Odersky <joder...@gmail.com 
> <mailto:joder...@gmail.com>>:
> I stand corrected. How considerable are the benefits though? Will the 
> scheduler be able to dispatch jobs from both actions simultaneously (or on a 
> when-workers-become-available basis)?
> 
> On 15 January 2016 at 11:44, Koert Kuipers <ko...@tresata.com 
> <mailto:ko...@tresata.com>> wrote:
> we run multiple actions on the same (cached) rdd all the time, i guess in 
> different threads indeed (its in akka)
> 
> On Fri, Jan 15, 2016 at 2:40 PM, Matei Zaharia <matei.zaha...@gmail.com 
> <mailto:matei.zaha...@gmail.com>> wrote:
> RDDs actually are thread-safe, and quite a few applications use them this 
> way, e.g. the JDBC server.
> 
> Matei
> 
>> On Jan 15, 2016, at 2:10 PM, Jakob Odersky <joder...@gmail.com 
>> <mailto:joder...@gmail.com>> wrote:
>> 
>> I don't think RDDs are threadsafe.
>> More fundamentally however, why would you want to run RDD actions in 
>> parallel? The idea behind RDDs is to provide you with an abstraction for 
>> computing parallel operations on distributed data. Even if you were to call 
>> actions from several threads at once, the individual executors of your spark 
>> environment would still have to perform operations sequentially.
>> 
>> As an alternative, I would suggest to restructure your RDD transformations 
>> to compute the required results in one single operation.
>> 
>> On 15 January 2016 at 06:18, Jonathan Coveney <jcove...@gmail.com 
>> <mailto:jcove...@gmail.com>> wrote:
>> Threads
>> 
>> 
>> El viernes, 15 de enero de 2016, Kira <mennou...@gmail.com 
>> <mailto:mennou...@gmail.com>> escribió:
>> Hi,
>> 
>> Can we run *simultaneous* actions on the *same RDD* ?; if yes how can this
>> be done ?
>> 
>> Thank you,
>> Regards
>> 
>> 
>> 
>> --
>> View this message in context: 
>> http://apache-spark-user-list.1001560.n3.nabble.com/simultaneous-actions-tp25977.html
>>  
>> <http://apache-spark-user-list.1001560.n3.nabble.com/simultaneous-actions-tp25977.html>
>> Sent from the Apache Spark User List mailing list archive at Nabble.com 
>> <http://nabble.com/>.
>> 
>> -
>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org <>
>> For additional commands, e-mail: user-h...@spark.apache.org <>
>> 
>> 
> 
> 
> 
> 



Re: simultaneous actions

2016-01-17 Thread Mark Hamstra
Same SparkContext means same pool of Workers.  It's up to the Scheduler,
not the SparkContext, whether the exact same Workers or Executors will be
used to calculate simultaneous actions against the same RDD.  It is likely
that many of the same Workers and Executors will be used as the Scheduler
tries to preserve data locality, but that is not guaranteed.  In fact, what
is most likely to happen is that the shared Stages and Tasks being
calculated for the simultaneous actions will not actually be run at exactly
the same time, which means that shuffle files produced for one action will
be reused by the other(s), and repeated calculations will be avoided even
without explicitly caching/persisting the RDD.

On Sun, Jan 17, 2016 at 8:06 AM, Koert Kuipers <ko...@tresata.com> wrote:

> Same rdd means same sparkcontext means same workers
>
> Cache/persist the rdd to avoid repeated jobs
> On Jan 17, 2016 5:21 AM, "Mennour Rostom" <mennou...@gmail.com> wrote:
>
>> Hi,
>>
>> Thank you all for your answers,
>>
>> If I correctly understand, actions (in my case foreach) can be run
>> concurrently and simultaneously on the SAME rdd, (which is logical because
>> they are read only object). however, I want to know if the same workers are
>> used for the concurrent analysis ?
>>
>> Thank you
>>
>> 2016-01-15 21:11 GMT+01:00 Jakob Odersky <joder...@gmail.com>:
>>
>>> I stand corrected. How considerable are the benefits though? Will the
>>> scheduler be able to dispatch jobs from both actions simultaneously (or on
>>> a when-workers-become-available basis)?
>>>
>>> On 15 January 2016 at 11:44, Koert Kuipers <ko...@tresata.com> wrote:
>>>
>>>> we run multiple actions on the same (cached) rdd all the time, i guess
>>>> in different threads indeed (its in akka)
>>>>
>>>> On Fri, Jan 15, 2016 at 2:40 PM, Matei Zaharia <matei.zaha...@gmail.com
>>>> > wrote:
>>>>
>>>>> RDDs actually are thread-safe, and quite a few applications use them
>>>>> this way, e.g. the JDBC server.
>>>>>
>>>>> Matei
>>>>>
>>>>> On Jan 15, 2016, at 2:10 PM, Jakob Odersky <joder...@gmail.com> wrote:
>>>>>
>>>>> I don't think RDDs are threadsafe.
>>>>> More fundamentally however, why would you want to run RDD actions in
>>>>> parallel? The idea behind RDDs is to provide you with an abstraction for
>>>>> computing parallel operations on distributed data. Even if you were to 
>>>>> call
>>>>> actions from several threads at once, the individual executors of your
>>>>> spark environment would still have to perform operations sequentially.
>>>>>
>>>>> As an alternative, I would suggest to restructure your RDD
>>>>> transformations to compute the required results in one single operation.
>>>>>
>>>>> On 15 January 2016 at 06:18, Jonathan Coveney <jcove...@gmail.com>
>>>>> wrote:
>>>>>
>>>>>> Threads
>>>>>>
>>>>>>
>>>>>> El viernes, 15 de enero de 2016, Kira <mennou...@gmail.com> escribió:
>>>>>>
>>>>>>> Hi,
>>>>>>>
>>>>>>> Can we run *simultaneous* actions on the *same RDD* ?; if yes how
>>>>>>> can this
>>>>>>> be done ?
>>>>>>>
>>>>>>> Thank you,
>>>>>>> Regards
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> --
>>>>>>> View this message in context:
>>>>>>> http://apache-spark-user-list.1001560.n3.nabble.com/simultaneous-actions-tp25977.html
>>>>>>> Sent from the Apache Spark User List mailing list archive at
>>>>>>> Nabble.com <http://nabble.com>.
>>>>>>>
>>>>>>> -
>>>>>>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>>>>>>> For additional commands, e-mail: user-h...@spark.apache.org
>>>>>>>
>>>>>>>
>>>>>
>>>>>
>>>>
>>>
>>


Re: simultaneous actions

2016-01-17 Thread Mark Hamstra
It can be far more than that (e.g.
https://issues.apache.org/jira/browse/SPARK-11838), and is generally either
unrecognized or a greatly under-appreciated and underused feature of Spark.

On Sun, Jan 17, 2016 at 12:20 PM, Koert Kuipers <ko...@tresata.com> wrote:

> the re-use of shuffle files is always a nice surprise to me
>
> On Sun, Jan 17, 2016 at 3:17 PM, Mark Hamstra <m...@clearstorydata.com>
> wrote:
>
>> Same SparkContext means same pool of Workers.  It's up to the Scheduler,
>> not the SparkContext, whether the exact same Workers or Executors will be
>> used to calculate simultaneous actions against the same RDD.  It is likely
>> that many of the same Workers and Executors will be used as the Scheduler
>> tries to preserve data locality, but that is not guaranteed.  In fact, what
>> is most likely to happen is that the shared Stages and Tasks being
>> calculated for the simultaneous actions will not actually be run at exactly
>> the same time, which means that shuffle files produced for one action will
>> be reused by the other(s), and repeated calculations will be avoided even
>> without explicitly caching/persisting the RDD.
>>
>> On Sun, Jan 17, 2016 at 8:06 AM, Koert Kuipers <ko...@tresata.com> wrote:
>>
>>> Same rdd means same sparkcontext means same workers
>>>
>>> Cache/persist the rdd to avoid repeated jobs
>>> On Jan 17, 2016 5:21 AM, "Mennour Rostom" <mennou...@gmail.com> wrote:
>>>
>>>> Hi,
>>>>
>>>> Thank you all for your answers,
>>>>
>>>> If I correctly understand, actions (in my case foreach) can be run
>>>> concurrently and simultaneously on the SAME rdd, (which is logical because
>>>> they are read only object). however, I want to know if the same workers are
>>>> used for the concurrent analysis ?
>>>>
>>>> Thank you
>>>>
>>>> 2016-01-15 21:11 GMT+01:00 Jakob Odersky <joder...@gmail.com>:
>>>>
>>>>> I stand corrected. How considerable are the benefits though? Will the
>>>>> scheduler be able to dispatch jobs from both actions simultaneously (or on
>>>>> a when-workers-become-available basis)?
>>>>>
>>>>> On 15 January 2016 at 11:44, Koert Kuipers <ko...@tresata.com> wrote:
>>>>>
>>>>>> we run multiple actions on the same (cached) rdd all the time, i
>>>>>> guess in different threads indeed (its in akka)
>>>>>>
>>>>>> On Fri, Jan 15, 2016 at 2:40 PM, Matei Zaharia <
>>>>>> matei.zaha...@gmail.com> wrote:
>>>>>>
>>>>>>> RDDs actually are thread-safe, and quite a few applications use them
>>>>>>> this way, e.g. the JDBC server.
>>>>>>>
>>>>>>> Matei
>>>>>>>
>>>>>>> On Jan 15, 2016, at 2:10 PM, Jakob Odersky <joder...@gmail.com>
>>>>>>> wrote:
>>>>>>>
>>>>>>> I don't think RDDs are threadsafe.
>>>>>>> More fundamentally however, why would you want to run RDD actions in
>>>>>>> parallel? The idea behind RDDs is to provide you with an abstraction for
>>>>>>> computing parallel operations on distributed data. Even if you were to 
>>>>>>> call
>>>>>>> actions from several threads at once, the individual executors of your
>>>>>>> spark environment would still have to perform operations sequentially.
>>>>>>>
>>>>>>> As an alternative, I would suggest to restructure your RDD
>>>>>>> transformations to compute the required results in one single operation.
>>>>>>>
>>>>>>> On 15 January 2016 at 06:18, Jonathan Coveney <jcove...@gmail.com>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> Threads
>>>>>>>>
>>>>>>>>
>>>>>>>> El viernes, 15 de enero de 2016, Kira <mennou...@gmail.com>
>>>>>>>> escribió:
>>>>>>>>
>>>>>>>>> Hi,
>>>>>>>>>
>>>>>>>>> Can we run *simultaneous* actions on the *same RDD* ?; if yes how
>>>>>>>>> can this
>>>>>>>>> be done ?
>>>>>>>>>
>>>>>>>>> Thank you,
>>>>>>>>> Regards
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> --
>>>>>>>>> View this message in context:
>>>>>>>>> http://apache-spark-user-list.1001560.n3.nabble.com/simultaneous-actions-tp25977.html
>>>>>>>>> Sent from the Apache Spark User List mailing list archive at
>>>>>>>>> Nabble.com <http://nabble.com>.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> -
>>>>>>>>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>>>>>>>>> For additional commands, e-mail: user-h...@spark.apache.org
>>>>>>>>>
>>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>
>


Re: simultaneous actions

2016-01-17 Thread Mennour Rostom
Hi,

Thank you all for your answers,

If I correctly understand, actions (in my case foreach) can be run
concurrently and simultaneously on the SAME rdd, (which is logical because
they are read only object). however, I want to know if the same workers are
used for the concurrent analysis ?

Thank you

2016-01-15 21:11 GMT+01:00 Jakob Odersky <joder...@gmail.com>:

> I stand corrected. How considerable are the benefits though? Will the
> scheduler be able to dispatch jobs from both actions simultaneously (or on
> a when-workers-become-available basis)?
>
> On 15 January 2016 at 11:44, Koert Kuipers <ko...@tresata.com> wrote:
>
>> we run multiple actions on the same (cached) rdd all the time, i guess in
>> different threads indeed (its in akka)
>>
>> On Fri, Jan 15, 2016 at 2:40 PM, Matei Zaharia <matei.zaha...@gmail.com>
>> wrote:
>>
>>> RDDs actually are thread-safe, and quite a few applications use them
>>> this way, e.g. the JDBC server.
>>>
>>> Matei
>>>
>>> On Jan 15, 2016, at 2:10 PM, Jakob Odersky <joder...@gmail.com> wrote:
>>>
>>> I don't think RDDs are threadsafe.
>>> More fundamentally however, why would you want to run RDD actions in
>>> parallel? The idea behind RDDs is to provide you with an abstraction for
>>> computing parallel operations on distributed data. Even if you were to call
>>> actions from several threads at once, the individual executors of your
>>> spark environment would still have to perform operations sequentially.
>>>
>>> As an alternative, I would suggest to restructure your RDD
>>> transformations to compute the required results in one single operation.
>>>
>>> On 15 January 2016 at 06:18, Jonathan Coveney <jcove...@gmail.com>
>>> wrote:
>>>
>>>> Threads
>>>>
>>>>
>>>> El viernes, 15 de enero de 2016, Kira <mennou...@gmail.com> escribió:
>>>>
>>>>> Hi,
>>>>>
>>>>> Can we run *simultaneous* actions on the *same RDD* ?; if yes how can
>>>>> this
>>>>> be done ?
>>>>>
>>>>> Thank you,
>>>>> Regards
>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> View this message in context:
>>>>> http://apache-spark-user-list.1001560.n3.nabble.com/simultaneous-actions-tp25977.html
>>>>> Sent from the Apache Spark User List mailing list archive at
>>>>> Nabble.com <http://nabble.com>.
>>>>>
>>>>> -
>>>>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>>>>> For additional commands, e-mail: user-h...@spark.apache.org
>>>>>
>>>>>
>>>
>>>
>>
>


Re: simultaneous actions

2016-01-17 Thread Koert Kuipers
Same rdd means same sparkcontext means same workers

Cache/persist the rdd to avoid repeated jobs
On Jan 17, 2016 5:21 AM, "Mennour Rostom" <mennou...@gmail.com> wrote:

> Hi,
>
> Thank you all for your answers,
>
> If I correctly understand, actions (in my case foreach) can be run
> concurrently and simultaneously on the SAME rdd, (which is logical because
> they are read only object). however, I want to know if the same workers are
> used for the concurrent analysis ?
>
> Thank you
>
> 2016-01-15 21:11 GMT+01:00 Jakob Odersky <joder...@gmail.com>:
>
>> I stand corrected. How considerable are the benefits though? Will the
>> scheduler be able to dispatch jobs from both actions simultaneously (or on
>> a when-workers-become-available basis)?
>>
>> On 15 January 2016 at 11:44, Koert Kuipers <ko...@tresata.com> wrote:
>>
>>> we run multiple actions on the same (cached) rdd all the time, i guess
>>> in different threads indeed (its in akka)
>>>
>>> On Fri, Jan 15, 2016 at 2:40 PM, Matei Zaharia <matei.zaha...@gmail.com>
>>> wrote:
>>>
>>>> RDDs actually are thread-safe, and quite a few applications use them
>>>> this way, e.g. the JDBC server.
>>>>
>>>> Matei
>>>>
>>>> On Jan 15, 2016, at 2:10 PM, Jakob Odersky <joder...@gmail.com> wrote:
>>>>
>>>> I don't think RDDs are threadsafe.
>>>> More fundamentally however, why would you want to run RDD actions in
>>>> parallel? The idea behind RDDs is to provide you with an abstraction for
>>>> computing parallel operations on distributed data. Even if you were to call
>>>> actions from several threads at once, the individual executors of your
>>>> spark environment would still have to perform operations sequentially.
>>>>
>>>> As an alternative, I would suggest to restructure your RDD
>>>> transformations to compute the required results in one single operation.
>>>>
>>>> On 15 January 2016 at 06:18, Jonathan Coveney <jcove...@gmail.com>
>>>> wrote:
>>>>
>>>>> Threads
>>>>>
>>>>>
>>>>> El viernes, 15 de enero de 2016, Kira <mennou...@gmail.com> escribió:
>>>>>
>>>>>> Hi,
>>>>>>
>>>>>> Can we run *simultaneous* actions on the *same RDD* ?; if yes how can
>>>>>> this
>>>>>> be done ?
>>>>>>
>>>>>> Thank you,
>>>>>> Regards
>>>>>>
>>>>>>
>>>>>>
>>>>>> --
>>>>>> View this message in context:
>>>>>> http://apache-spark-user-list.1001560.n3.nabble.com/simultaneous-actions-tp25977.html
>>>>>> Sent from the Apache Spark User List mailing list archive at
>>>>>> Nabble.com <http://nabble.com>.
>>>>>>
>>>>>> -
>>>>>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>>>>>> For additional commands, e-mail: user-h...@spark.apache.org
>>>>>>
>>>>>>
>>>>
>>>>
>>>
>>
>


simultaneous actions

2016-01-15 Thread Kira
Hi,

Can we run *simultaneous* actions on the *same RDD* ?; if yes how can this
be done ?

Thank you,
Regards



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/simultaneous-actions-tp25977.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: simultaneous actions

2016-01-15 Thread Jonathan Coveney
Threads

El viernes, 15 de enero de 2016, Kira <mennou...@gmail.com> escribió:

> Hi,
>
> Can we run *simultaneous* actions on the *same RDD* ?; if yes how can this
> be done ?
>
> Thank you,
> Regards
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/simultaneous-actions-tp25977.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org <javascript:;>
> For additional commands, e-mail: user-h...@spark.apache.org <javascript:;>
>
>


Re: simultaneous actions

2016-01-15 Thread Sean Owen
Can you run N jobs depending on the same RDD in parallel on the
driver? certainly. The context / scheduling is thread-safe and the RDD
is immutable. I've done this to, for example, build and evaluate a
bunch of models simultaneously on a big cluster.

On Fri, Jan 15, 2016 at 7:10 PM, Jakob Odersky <joder...@gmail.com> wrote:
> I don't think RDDs are threadsafe.
> More fundamentally however, why would you want to run RDD actions in
> parallel? The idea behind RDDs is to provide you with an abstraction for
> computing parallel operations on distributed data. Even if you were to call
> actions from several threads at once, the individual executors of your spark
> environment would still have to perform operations sequentially.
>
> As an alternative, I would suggest to restructure your RDD transformations
> to compute the required results in one single operation.
>
> On 15 January 2016 at 06:18, Jonathan Coveney <jcove...@gmail.com> wrote:
>>
>> Threads
>>
>>
>> El viernes, 15 de enero de 2016, Kira <mennou...@gmail.com> escribió:
>>>
>>> Hi,
>>>
>>> Can we run *simultaneous* actions on the *same RDD* ?; if yes how can
>>> this
>>> be done ?
>>>
>>> Thank you,
>>> Regards
>>>
>>>
>>>
>>> --
>>> View this message in context:
>>> http://apache-spark-user-list.1001560.n3.nabble.com/simultaneous-actions-tp25977.html
>>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>>>
>>> -
>>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>>> For additional commands, e-mail: user-h...@spark.apache.org
>>>
>

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: simultaneous actions

2016-01-15 Thread Jakob Odersky
I don't think RDDs are threadsafe.
More fundamentally however, why would you want to run RDD actions in
parallel? The idea behind RDDs is to provide you with an abstraction for
computing parallel operations on distributed data. Even if you were to call
actions from several threads at once, the individual executors of your
spark environment would still have to perform operations sequentially.

As an alternative, I would suggest to restructure your RDD transformations
to compute the required results in one single operation.

On 15 January 2016 at 06:18, Jonathan Coveney <jcove...@gmail.com> wrote:

> Threads
>
>
> El viernes, 15 de enero de 2016, Kira <mennou...@gmail.com> escribió:
>
>> Hi,
>>
>> Can we run *simultaneous* actions on the *same RDD* ?; if yes how can this
>> be done ?
>>
>> Thank you,
>> Regards
>>
>>
>>
>> --
>> View this message in context:
>> http://apache-spark-user-list.1001560.n3.nabble.com/simultaneous-actions-tp25977.html
>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>>
>> -
>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>> For additional commands, e-mail: user-h...@spark.apache.org
>>
>>


Re: simultaneous actions

2016-01-15 Thread Jonathan Coveney
SparkContext is thread safe. And RDDs just describe operations.

While I generally agree that you want to model as much possible as
transformations as possible, this is not always possible. And in that case,
you have no option than to use threads.

Spark's designers should have made all actions return Futures, but alas...

El viernes, 15 de enero de 2016, Jakob Odersky <joder...@gmail.com>
escribió:

> I don't think RDDs are threadsafe.
> More fundamentally however, why would you want to run RDD actions in
> parallel? The idea behind RDDs is to provide you with an abstraction for
> computing parallel operations on distributed data. Even if you were to call
> actions from several threads at once, the individual executors of your
> spark environment would still have to perform operations sequentially.
>
> As an alternative, I would suggest to restructure your RDD transformations
> to compute the required results in one single operation.
>
> On 15 January 2016 at 06:18, Jonathan Coveney <jcove...@gmail.com
> <javascript:_e(%7B%7D,'cvml','jcove...@gmail.com');>> wrote:
>
>> Threads
>>
>>
>> El viernes, 15 de enero de 2016, Kira <mennou...@gmail.com
>> <javascript:_e(%7B%7D,'cvml','mennou...@gmail.com');>> escribió:
>>
>>> Hi,
>>>
>>> Can we run *simultaneous* actions on the *same RDD* ?; if yes how can
>>> this
>>> be done ?
>>>
>>> Thank you,
>>> Regards
>>>
>>>
>>>
>>> --
>>> View this message in context:
>>> http://apache-spark-user-list.1001560.n3.nabble.com/simultaneous-actions-tp25977.html
>>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>>>
>>> -
>>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>>> For additional commands, e-mail: user-h...@spark.apache.org
>>>
>>>
>


Re: simultaneous actions

2016-01-15 Thread Matei Zaharia
RDDs actually are thread-safe, and quite a few applications use them this way, 
e.g. the JDBC server.

Matei

> On Jan 15, 2016, at 2:10 PM, Jakob Odersky <joder...@gmail.com> wrote:
> 
> I don't think RDDs are threadsafe.
> More fundamentally however, why would you want to run RDD actions in 
> parallel? The idea behind RDDs is to provide you with an abstraction for 
> computing parallel operations on distributed data. Even if you were to call 
> actions from several threads at once, the individual executors of your spark 
> environment would still have to perform operations sequentially.
> 
> As an alternative, I would suggest to restructure your RDD transformations to 
> compute the required results in one single operation.
> 
> On 15 January 2016 at 06:18, Jonathan Coveney <jcove...@gmail.com 
> <mailto:jcove...@gmail.com>> wrote:
> Threads
> 
> 
> El viernes, 15 de enero de 2016, Kira <mennou...@gmail.com 
> <mailto:mennou...@gmail.com>> escribió:
> Hi,
> 
> Can we run *simultaneous* actions on the *same RDD* ?; if yes how can this
> be done ?
> 
> Thank you,
> Regards
> 
> 
> 
> --
> View this message in context: 
> http://apache-spark-user-list.1001560.n3.nabble.com/simultaneous-actions-tp25977.html
>  
> <http://apache-spark-user-list.1001560.n3.nabble.com/simultaneous-actions-tp25977.html>
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
> 
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org <>
> For additional commands, e-mail: user-h...@spark.apache.org <>
> 
> 



Re: simultaneous actions

2016-01-15 Thread Koert Kuipers
we run multiple actions on the same (cached) rdd all the time, i guess in
different threads indeed (its in akka)

On Fri, Jan 15, 2016 at 2:40 PM, Matei Zaharia <matei.zaha...@gmail.com>
wrote:

> RDDs actually are thread-safe, and quite a few applications use them this
> way, e.g. the JDBC server.
>
> Matei
>
> On Jan 15, 2016, at 2:10 PM, Jakob Odersky <joder...@gmail.com> wrote:
>
> I don't think RDDs are threadsafe.
> More fundamentally however, why would you want to run RDD actions in
> parallel? The idea behind RDDs is to provide you with an abstraction for
> computing parallel operations on distributed data. Even if you were to call
> actions from several threads at once, the individual executors of your
> spark environment would still have to perform operations sequentially.
>
> As an alternative, I would suggest to restructure your RDD transformations
> to compute the required results in one single operation.
>
> On 15 January 2016 at 06:18, Jonathan Coveney <jcove...@gmail.com> wrote:
>
>> Threads
>>
>>
>> El viernes, 15 de enero de 2016, Kira <mennou...@gmail.com> escribió:
>>
>>> Hi,
>>>
>>> Can we run *simultaneous* actions on the *same RDD* ?; if yes how can
>>> this
>>> be done ?
>>>
>>> Thank you,
>>> Regards
>>>
>>>
>>>
>>> --
>>> View this message in context:
>>> http://apache-spark-user-list.1001560.n3.nabble.com/simultaneous-actions-tp25977.html
>>> Sent from the Apache Spark User List mailing list archive at Nabble.com
>>> <http://nabble.com>.
>>>
>>> -
>>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>>> For additional commands, e-mail: user-h...@spark.apache.org
>>>
>>>
>
>


Re: simultaneous actions

2016-01-15 Thread Jakob Odersky
I stand corrected. How considerable are the benefits though? Will the
scheduler be able to dispatch jobs from both actions simultaneously (or on
a when-workers-become-available basis)?

On 15 January 2016 at 11:44, Koert Kuipers <ko...@tresata.com> wrote:

> we run multiple actions on the same (cached) rdd all the time, i guess in
> different threads indeed (its in akka)
>
> On Fri, Jan 15, 2016 at 2:40 PM, Matei Zaharia <matei.zaha...@gmail.com>
> wrote:
>
>> RDDs actually are thread-safe, and quite a few applications use them this
>> way, e.g. the JDBC server.
>>
>> Matei
>>
>> On Jan 15, 2016, at 2:10 PM, Jakob Odersky <joder...@gmail.com> wrote:
>>
>> I don't think RDDs are threadsafe.
>> More fundamentally however, why would you want to run RDD actions in
>> parallel? The idea behind RDDs is to provide you with an abstraction for
>> computing parallel operations on distributed data. Even if you were to call
>> actions from several threads at once, the individual executors of your
>> spark environment would still have to perform operations sequentially.
>>
>> As an alternative, I would suggest to restructure your RDD
>> transformations to compute the required results in one single operation.
>>
>> On 15 January 2016 at 06:18, Jonathan Coveney <jcove...@gmail.com> wrote:
>>
>>> Threads
>>>
>>>
>>> El viernes, 15 de enero de 2016, Kira <mennou...@gmail.com> escribió:
>>>
>>>> Hi,
>>>>
>>>> Can we run *simultaneous* actions on the *same RDD* ?; if yes how can
>>>> this
>>>> be done ?
>>>>
>>>> Thank you,
>>>> Regards
>>>>
>>>>
>>>>
>>>> --
>>>> View this message in context:
>>>> http://apache-spark-user-list.1001560.n3.nabble.com/simultaneous-actions-tp25977.html
>>>> Sent from the Apache Spark User List mailing list archive at Nabble.com
>>>> <http://nabble.com>.
>>>>
>>>> -
>>>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>>>> For additional commands, e-mail: user-h...@spark.apache.org
>>>>
>>>>
>>
>>
>


Re: simultaneous actions

2016-01-15 Thread Sean Owen
It makes sense if you're parallelizing jobs that have relatively few
tasks, and have a lot of execution slots available. It makes sense to
turn them loose all at once and try to use the parallelism available.

There are downsides, eventually: for example, N jobs accessing one
cached RDD may recompute the RDD's partitions many times since the
cached copy may not be available when many of them start. At some
level, way oversubscribing your cluster with a backlog of tasks is
bad. And you might find it's a net loss if a bunch of tasks try to
schedule at the same time that all access the same data, since only
some can be local to the data.

On Fri, Jan 15, 2016 at 8:11 PM, Jakob Odersky  wrote:
> I stand corrected. How considerable are the benefits though? Will the
> scheduler be able to dispatch jobs from both actions simultaneously (or on a
> when-workers-become-available basis)?

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org