Re: simultaneous actions

2016-01-18 Thread Debasish Das
Simultaneous action works on cluster fine if they are independent...on
local I never paid attention but the code path should be similar...
On Jan 18, 2016 8:00 AM, "Koert Kuipers"  wrote:

> stacktrace? details?
>
> On Mon, Jan 18, 2016 at 5:58 AM, Mennour Rostom 
> wrote:
>
>> Hi,
>>
>> I am running my app in a single machine first before moving it in the
>> cluster; actually simultaneous actions are not working for me now; is this
>> comming from the fact that I am using a single machine ? yet I am using
>> FAIR scheduler.
>>
>> 2016-01-17 21:23 GMT+01:00 Mark Hamstra :
>>
>>> It can be far more than that (e.g.
>>> https://issues.apache.org/jira/browse/SPARK-11838), and is generally
>>> either unrecognized or a greatly under-appreciated and underused feature of
>>> Spark.
>>>
>>> On Sun, Jan 17, 2016 at 12:20 PM, Koert Kuipers 
>>> wrote:
>>>
 the re-use of shuffle files is always a nice surprise to me

 On Sun, Jan 17, 2016 at 3:17 PM, Mark Hamstra 
 wrote:

> Same SparkContext means same pool of Workers.  It's up to the
> Scheduler, not the SparkContext, whether the exact same Workers or
> Executors will be used to calculate simultaneous actions against the same
> RDD.  It is likely that many of the same Workers and Executors will be 
> used
> as the Scheduler tries to preserve data locality, but that is not
> guaranteed.  In fact, what is most likely to happen is that the shared
> Stages and Tasks being calculated for the simultaneous actions will not
> actually be run at exactly the same time, which means that shuffle files
> produced for one action will be reused by the other(s), and repeated
> calculations will be avoided even without explicitly caching/persisting 
> the
> RDD.
>
> On Sun, Jan 17, 2016 at 8:06 AM, Koert Kuipers 
> wrote:
>
>> Same rdd means same sparkcontext means same workers
>>
>> Cache/persist the rdd to avoid repeated jobs
>> On Jan 17, 2016 5:21 AM, "Mennour Rostom" 
>> wrote:
>>
>>> Hi,
>>>
>>> Thank you all for your answers,
>>>
>>> If I correctly understand, actions (in my case foreach) can be run
>>> concurrently and simultaneously on the SAME rdd, (which is logical 
>>> because
>>> they are read only object). however, I want to know if the same workers 
>>> are
>>> used for the concurrent analysis ?
>>>
>>> Thank you
>>>
>>> 2016-01-15 21:11 GMT+01:00 Jakob Odersky :
>>>
 I stand corrected. How considerable are the benefits though? Will
 the scheduler be able to dispatch jobs from both actions 
 simultaneously (or
 on a when-workers-become-available basis)?

 On 15 January 2016 at 11:44, Koert Kuipers 
 wrote:

> we run multiple actions on the same (cached) rdd all the time, i
> guess in different threads indeed (its in akka)
>
> On Fri, Jan 15, 2016 at 2:40 PM, Matei Zaharia <
> matei.zaha...@gmail.com> wrote:
>
>> RDDs actually are thread-safe, and quite a few applications use
>> them this way, e.g. the JDBC server.
>>
>> Matei
>>
>> On Jan 15, 2016, at 2:10 PM, Jakob Odersky 
>> wrote:
>>
>> I don't think RDDs are threadsafe.
>> More fundamentally however, why would you want to run RDD actions
>> in parallel? The idea behind RDDs is to provide you with an 
>> abstraction for
>> computing parallel operations on distributed data. Even if you were 
>> to call
>> actions from several threads at once, the individual executors of 
>> your
>> spark environment would still have to perform operations 
>> sequentially.
>>
>> As an alternative, I would suggest to restructure your RDD
>> transformations to compute the required results in one single 
>> operation.
>>
>> On 15 January 2016 at 06:18, Jonathan Coveney > > wrote:
>>
>>> Threads
>>>
>>>
>>> El viernes, 15 de enero de 2016, Kira 
>>> escribió:
>>>
 Hi,

 Can we run *simultaneous* actions on the *same RDD* ?; if yes
 how can this
 be done ?

 Thank you,
 Regards



 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/simultaneous-actions-tp25977.html
 Sent from the Apache Spark User List mailing list archive at
 Nabble.com 

Re: simultaneous actions

2016-01-18 Thread Mennour Rostom
Hi,

I am running my app in a single machine first before moving it in the
cluster; actually simultaneous actions are not working for me now; is this
comming from the fact that I am using a single machine ? yet I am using
FAIR scheduler.

2016-01-17 21:23 GMT+01:00 Mark Hamstra :

> It can be far more than that (e.g.
> https://issues.apache.org/jira/browse/SPARK-11838), and is generally
> either unrecognized or a greatly under-appreciated and underused feature of
> Spark.
>
> On Sun, Jan 17, 2016 at 12:20 PM, Koert Kuipers  wrote:
>
>> the re-use of shuffle files is always a nice surprise to me
>>
>> On Sun, Jan 17, 2016 at 3:17 PM, Mark Hamstra 
>> wrote:
>>
>>> Same SparkContext means same pool of Workers.  It's up to the Scheduler,
>>> not the SparkContext, whether the exact same Workers or Executors will be
>>> used to calculate simultaneous actions against the same RDD.  It is likely
>>> that many of the same Workers and Executors will be used as the Scheduler
>>> tries to preserve data locality, but that is not guaranteed.  In fact, what
>>> is most likely to happen is that the shared Stages and Tasks being
>>> calculated for the simultaneous actions will not actually be run at exactly
>>> the same time, which means that shuffle files produced for one action will
>>> be reused by the other(s), and repeated calculations will be avoided even
>>> without explicitly caching/persisting the RDD.
>>>
>>> On Sun, Jan 17, 2016 at 8:06 AM, Koert Kuipers 
>>> wrote:
>>>
 Same rdd means same sparkcontext means same workers

 Cache/persist the rdd to avoid repeated jobs
 On Jan 17, 2016 5:21 AM, "Mennour Rostom"  wrote:

> Hi,
>
> Thank you all for your answers,
>
> If I correctly understand, actions (in my case foreach) can be run
> concurrently and simultaneously on the SAME rdd, (which is logical because
> they are read only object). however, I want to know if the same workers 
> are
> used for the concurrent analysis ?
>
> Thank you
>
> 2016-01-15 21:11 GMT+01:00 Jakob Odersky :
>
>> I stand corrected. How considerable are the benefits though? Will the
>> scheduler be able to dispatch jobs from both actions simultaneously (or 
>> on
>> a when-workers-become-available basis)?
>>
>> On 15 January 2016 at 11:44, Koert Kuipers  wrote:
>>
>>> we run multiple actions on the same (cached) rdd all the time, i
>>> guess in different threads indeed (its in akka)
>>>
>>> On Fri, Jan 15, 2016 at 2:40 PM, Matei Zaharia <
>>> matei.zaha...@gmail.com> wrote:
>>>
 RDDs actually are thread-safe, and quite a few applications use
 them this way, e.g. the JDBC server.

 Matei

 On Jan 15, 2016, at 2:10 PM, Jakob Odersky 
 wrote:

 I don't think RDDs are threadsafe.
 More fundamentally however, why would you want to run RDD actions
 in parallel? The idea behind RDDs is to provide you with an 
 abstraction for
 computing parallel operations on distributed data. Even if you were to 
 call
 actions from several threads at once, the individual executors of your
 spark environment would still have to perform operations sequentially.

 As an alternative, I would suggest to restructure your RDD
 transformations to compute the required results in one single 
 operation.

 On 15 January 2016 at 06:18, Jonathan Coveney 
 wrote:

> Threads
>
>
> El viernes, 15 de enero de 2016, Kira 
> escribió:
>
>> Hi,
>>
>> Can we run *simultaneous* actions on the *same RDD* ?; if yes how
>> can this
>> be done ?
>>
>> Thank you,
>> Regards
>>
>>
>>
>> --
>> View this message in context:
>> http://apache-spark-user-list.1001560.n3.nabble.com/simultaneous-actions-tp25977.html
>> Sent from the Apache Spark User List mailing list archive at
>> Nabble.com .
>>
>>
>> -
>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>> For additional commands, e-mail: user-h...@spark.apache.org
>>
>>


>>>
>>
>
>>>
>>
>


Re: simultaneous actions

2016-01-18 Thread Koert Kuipers
stacktrace? details?

On Mon, Jan 18, 2016 at 5:58 AM, Mennour Rostom  wrote:

> Hi,
>
> I am running my app in a single machine first before moving it in the
> cluster; actually simultaneous actions are not working for me now; is this
> comming from the fact that I am using a single machine ? yet I am using
> FAIR scheduler.
>
> 2016-01-17 21:23 GMT+01:00 Mark Hamstra :
>
>> It can be far more than that (e.g.
>> https://issues.apache.org/jira/browse/SPARK-11838), and is generally
>> either unrecognized or a greatly under-appreciated and underused feature of
>> Spark.
>>
>> On Sun, Jan 17, 2016 at 12:20 PM, Koert Kuipers 
>> wrote:
>>
>>> the re-use of shuffle files is always a nice surprise to me
>>>
>>> On Sun, Jan 17, 2016 at 3:17 PM, Mark Hamstra 
>>> wrote:
>>>
 Same SparkContext means same pool of Workers.  It's up to the
 Scheduler, not the SparkContext, whether the exact same Workers or
 Executors will be used to calculate simultaneous actions against the same
 RDD.  It is likely that many of the same Workers and Executors will be used
 as the Scheduler tries to preserve data locality, but that is not
 guaranteed.  In fact, what is most likely to happen is that the shared
 Stages and Tasks being calculated for the simultaneous actions will not
 actually be run at exactly the same time, which means that shuffle files
 produced for one action will be reused by the other(s), and repeated
 calculations will be avoided even without explicitly caching/persisting the
 RDD.

 On Sun, Jan 17, 2016 at 8:06 AM, Koert Kuipers 
 wrote:

> Same rdd means same sparkcontext means same workers
>
> Cache/persist the rdd to avoid repeated jobs
> On Jan 17, 2016 5:21 AM, "Mennour Rostom"  wrote:
>
>> Hi,
>>
>> Thank you all for your answers,
>>
>> If I correctly understand, actions (in my case foreach) can be run
>> concurrently and simultaneously on the SAME rdd, (which is logical 
>> because
>> they are read only object). however, I want to know if the same workers 
>> are
>> used for the concurrent analysis ?
>>
>> Thank you
>>
>> 2016-01-15 21:11 GMT+01:00 Jakob Odersky :
>>
>>> I stand corrected. How considerable are the benefits though? Will
>>> the scheduler be able to dispatch jobs from both actions simultaneously 
>>> (or
>>> on a when-workers-become-available basis)?
>>>
>>> On 15 January 2016 at 11:44, Koert Kuipers 
>>> wrote:
>>>
 we run multiple actions on the same (cached) rdd all the time, i
 guess in different threads indeed (its in akka)

 On Fri, Jan 15, 2016 at 2:40 PM, Matei Zaharia <
 matei.zaha...@gmail.com> wrote:

> RDDs actually are thread-safe, and quite a few applications use
> them this way, e.g. the JDBC server.
>
> Matei
>
> On Jan 15, 2016, at 2:10 PM, Jakob Odersky 
> wrote:
>
> I don't think RDDs are threadsafe.
> More fundamentally however, why would you want to run RDD actions
> in parallel? The idea behind RDDs is to provide you with an 
> abstraction for
> computing parallel operations on distributed data. Even if you were 
> to call
> actions from several threads at once, the individual executors of your
> spark environment would still have to perform operations sequentially.
>
> As an alternative, I would suggest to restructure your RDD
> transformations to compute the required results in one single 
> operation.
>
> On 15 January 2016 at 06:18, Jonathan Coveney 
> wrote:
>
>> Threads
>>
>>
>> El viernes, 15 de enero de 2016, Kira 
>> escribió:
>>
>>> Hi,
>>>
>>> Can we run *simultaneous* actions on the *same RDD* ?; if yes
>>> how can this
>>> be done ?
>>>
>>> Thank you,
>>> Regards
>>>
>>>
>>>
>>> --
>>> View this message in context:
>>> http://apache-spark-user-list.1001560.n3.nabble.com/simultaneous-actions-tp25977.html
>>> Sent from the Apache Spark User List mailing list archive at
>>> Nabble.com .
>>>
>>>
>>> -
>>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>>> For additional commands, e-mail: user-h...@spark.apache.org
>>>
>>>
>
>

>>>
>>

>>>
>>
>


Re: simultaneous actions

2016-01-17 Thread Koert Kuipers
the re-use of shuffle files is always a nice surprise to me

On Sun, Jan 17, 2016 at 3:17 PM, Mark Hamstra 
wrote:

> Same SparkContext means same pool of Workers.  It's up to the Scheduler,
> not the SparkContext, whether the exact same Workers or Executors will be
> used to calculate simultaneous actions against the same RDD.  It is likely
> that many of the same Workers and Executors will be used as the Scheduler
> tries to preserve data locality, but that is not guaranteed.  In fact, what
> is most likely to happen is that the shared Stages and Tasks being
> calculated for the simultaneous actions will not actually be run at exactly
> the same time, which means that shuffle files produced for one action will
> be reused by the other(s), and repeated calculations will be avoided even
> without explicitly caching/persisting the RDD.
>
> On Sun, Jan 17, 2016 at 8:06 AM, Koert Kuipers  wrote:
>
>> Same rdd means same sparkcontext means same workers
>>
>> Cache/persist the rdd to avoid repeated jobs
>> On Jan 17, 2016 5:21 AM, "Mennour Rostom"  wrote:
>>
>>> Hi,
>>>
>>> Thank you all for your answers,
>>>
>>> If I correctly understand, actions (in my case foreach) can be run
>>> concurrently and simultaneously on the SAME rdd, (which is logical because
>>> they are read only object). however, I want to know if the same workers are
>>> used for the concurrent analysis ?
>>>
>>> Thank you
>>>
>>> 2016-01-15 21:11 GMT+01:00 Jakob Odersky :
>>>
 I stand corrected. How considerable are the benefits though? Will the
 scheduler be able to dispatch jobs from both actions simultaneously (or on
 a when-workers-become-available basis)?

 On 15 January 2016 at 11:44, Koert Kuipers  wrote:

> we run multiple actions on the same (cached) rdd all the time, i guess
> in different threads indeed (its in akka)
>
> On Fri, Jan 15, 2016 at 2:40 PM, Matei Zaharia <
> matei.zaha...@gmail.com> wrote:
>
>> RDDs actually are thread-safe, and quite a few applications use them
>> this way, e.g. the JDBC server.
>>
>> Matei
>>
>> On Jan 15, 2016, at 2:10 PM, Jakob Odersky 
>> wrote:
>>
>> I don't think RDDs are threadsafe.
>> More fundamentally however, why would you want to run RDD actions in
>> parallel? The idea behind RDDs is to provide you with an abstraction for
>> computing parallel operations on distributed data. Even if you were to 
>> call
>> actions from several threads at once, the individual executors of your
>> spark environment would still have to perform operations sequentially.
>>
>> As an alternative, I would suggest to restructure your RDD
>> transformations to compute the required results in one single operation.
>>
>> On 15 January 2016 at 06:18, Jonathan Coveney 
>> wrote:
>>
>>> Threads
>>>
>>>
>>> El viernes, 15 de enero de 2016, Kira 
>>> escribió:
>>>
 Hi,

 Can we run *simultaneous* actions on the *same RDD* ?; if yes how
 can this
 be done ?

 Thank you,
 Regards



 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/simultaneous-actions-tp25977.html
 Sent from the Apache Spark User List mailing list archive at
 Nabble.com .


 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org


>>
>>
>

>>>
>


Re: simultaneous actions

2016-01-17 Thread Matei Zaharia
They'll be able to run concurrently and share workers / data. Take a look at 
http://spark.apache.org/docs/latest/job-scheduling.html 
 for how scheduling 
happens across multiple running jobs in the same SparkContext.

Matei

> On Jan 17, 2016, at 8:06 AM, Koert Kuipers  wrote:
> 
> Same rdd means same sparkcontext means same workers
> 
> Cache/persist the rdd to avoid repeated jobs
> 
> On Jan 17, 2016 5:21 AM, "Mennour Rostom"  > wrote:
> Hi,
> 
> Thank you all for your answers,
> 
> If I correctly understand, actions (in my case foreach) can be run 
> concurrently and simultaneously on the SAME rdd, (which is logical because 
> they are read only object). however, I want to know if the same workers are 
> used for the concurrent analysis ?
> 
> Thank you
> 
> 2016-01-15 21:11 GMT+01:00 Jakob Odersky  >:
> I stand corrected. How considerable are the benefits though? Will the 
> scheduler be able to dispatch jobs from both actions simultaneously (or on a 
> when-workers-become-available basis)?
> 
> On 15 January 2016 at 11:44, Koert Kuipers  > wrote:
> we run multiple actions on the same (cached) rdd all the time, i guess in 
> different threads indeed (its in akka)
> 
> On Fri, Jan 15, 2016 at 2:40 PM, Matei Zaharia  > wrote:
> RDDs actually are thread-safe, and quite a few applications use them this 
> way, e.g. the JDBC server.
> 
> Matei
> 
>> On Jan 15, 2016, at 2:10 PM, Jakob Odersky > > wrote:
>> 
>> I don't think RDDs are threadsafe.
>> More fundamentally however, why would you want to run RDD actions in 
>> parallel? The idea behind RDDs is to provide you with an abstraction for 
>> computing parallel operations on distributed data. Even if you were to call 
>> actions from several threads at once, the individual executors of your spark 
>> environment would still have to perform operations sequentially.
>> 
>> As an alternative, I would suggest to restructure your RDD transformations 
>> to compute the required results in one single operation.
>> 
>> On 15 January 2016 at 06:18, Jonathan Coveney > > wrote:
>> Threads
>> 
>> 
>> El viernes, 15 de enero de 2016, Kira > > escribió:
>> Hi,
>> 
>> Can we run *simultaneous* actions on the *same RDD* ?; if yes how can this
>> be done ?
>> 
>> Thank you,
>> Regards
>> 
>> 
>> 
>> --
>> View this message in context: 
>> http://apache-spark-user-list.1001560.n3.nabble.com/simultaneous-actions-tp25977.html
>>  
>> 
>> Sent from the Apache Spark User List mailing list archive at Nabble.com 
>> .
>> 
>> -
>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org <>
>> For additional commands, e-mail: user-h...@spark.apache.org <>
>> 
>> 
> 
> 
> 
> 



Re: simultaneous actions

2016-01-17 Thread Mark Hamstra
Same SparkContext means same pool of Workers.  It's up to the Scheduler,
not the SparkContext, whether the exact same Workers or Executors will be
used to calculate simultaneous actions against the same RDD.  It is likely
that many of the same Workers and Executors will be used as the Scheduler
tries to preserve data locality, but that is not guaranteed.  In fact, what
is most likely to happen is that the shared Stages and Tasks being
calculated for the simultaneous actions will not actually be run at exactly
the same time, which means that shuffle files produced for one action will
be reused by the other(s), and repeated calculations will be avoided even
without explicitly caching/persisting the RDD.

On Sun, Jan 17, 2016 at 8:06 AM, Koert Kuipers  wrote:

> Same rdd means same sparkcontext means same workers
>
> Cache/persist the rdd to avoid repeated jobs
> On Jan 17, 2016 5:21 AM, "Mennour Rostom"  wrote:
>
>> Hi,
>>
>> Thank you all for your answers,
>>
>> If I correctly understand, actions (in my case foreach) can be run
>> concurrently and simultaneously on the SAME rdd, (which is logical because
>> they are read only object). however, I want to know if the same workers are
>> used for the concurrent analysis ?
>>
>> Thank you
>>
>> 2016-01-15 21:11 GMT+01:00 Jakob Odersky :
>>
>>> I stand corrected. How considerable are the benefits though? Will the
>>> scheduler be able to dispatch jobs from both actions simultaneously (or on
>>> a when-workers-become-available basis)?
>>>
>>> On 15 January 2016 at 11:44, Koert Kuipers  wrote:
>>>
 we run multiple actions on the same (cached) rdd all the time, i guess
 in different threads indeed (its in akka)

 On Fri, Jan 15, 2016 at 2:40 PM, Matei Zaharia  wrote:

> RDDs actually are thread-safe, and quite a few applications use them
> this way, e.g. the JDBC server.
>
> Matei
>
> On Jan 15, 2016, at 2:10 PM, Jakob Odersky  wrote:
>
> I don't think RDDs are threadsafe.
> More fundamentally however, why would you want to run RDD actions in
> parallel? The idea behind RDDs is to provide you with an abstraction for
> computing parallel operations on distributed data. Even if you were to 
> call
> actions from several threads at once, the individual executors of your
> spark environment would still have to perform operations sequentially.
>
> As an alternative, I would suggest to restructure your RDD
> transformations to compute the required results in one single operation.
>
> On 15 January 2016 at 06:18, Jonathan Coveney 
> wrote:
>
>> Threads
>>
>>
>> El viernes, 15 de enero de 2016, Kira  escribió:
>>
>>> Hi,
>>>
>>> Can we run *simultaneous* actions on the *same RDD* ?; if yes how
>>> can this
>>> be done ?
>>>
>>> Thank you,
>>> Regards
>>>
>>>
>>>
>>> --
>>> View this message in context:
>>> http://apache-spark-user-list.1001560.n3.nabble.com/simultaneous-actions-tp25977.html
>>> Sent from the Apache Spark User List mailing list archive at
>>> Nabble.com .
>>>
>>> -
>>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>>> For additional commands, e-mail: user-h...@spark.apache.org
>>>
>>>
>
>

>>>
>>


Re: simultaneous actions

2016-01-17 Thread Mark Hamstra
It can be far more than that (e.g.
https://issues.apache.org/jira/browse/SPARK-11838), and is generally either
unrecognized or a greatly under-appreciated and underused feature of Spark.

On Sun, Jan 17, 2016 at 12:20 PM, Koert Kuipers  wrote:

> the re-use of shuffle files is always a nice surprise to me
>
> On Sun, Jan 17, 2016 at 3:17 PM, Mark Hamstra 
> wrote:
>
>> Same SparkContext means same pool of Workers.  It's up to the Scheduler,
>> not the SparkContext, whether the exact same Workers or Executors will be
>> used to calculate simultaneous actions against the same RDD.  It is likely
>> that many of the same Workers and Executors will be used as the Scheduler
>> tries to preserve data locality, but that is not guaranteed.  In fact, what
>> is most likely to happen is that the shared Stages and Tasks being
>> calculated for the simultaneous actions will not actually be run at exactly
>> the same time, which means that shuffle files produced for one action will
>> be reused by the other(s), and repeated calculations will be avoided even
>> without explicitly caching/persisting the RDD.
>>
>> On Sun, Jan 17, 2016 at 8:06 AM, Koert Kuipers  wrote:
>>
>>> Same rdd means same sparkcontext means same workers
>>>
>>> Cache/persist the rdd to avoid repeated jobs
>>> On Jan 17, 2016 5:21 AM, "Mennour Rostom"  wrote:
>>>
 Hi,

 Thank you all for your answers,

 If I correctly understand, actions (in my case foreach) can be run
 concurrently and simultaneously on the SAME rdd, (which is logical because
 they are read only object). however, I want to know if the same workers are
 used for the concurrent analysis ?

 Thank you

 2016-01-15 21:11 GMT+01:00 Jakob Odersky :

> I stand corrected. How considerable are the benefits though? Will the
> scheduler be able to dispatch jobs from both actions simultaneously (or on
> a when-workers-become-available basis)?
>
> On 15 January 2016 at 11:44, Koert Kuipers  wrote:
>
>> we run multiple actions on the same (cached) rdd all the time, i
>> guess in different threads indeed (its in akka)
>>
>> On Fri, Jan 15, 2016 at 2:40 PM, Matei Zaharia <
>> matei.zaha...@gmail.com> wrote:
>>
>>> RDDs actually are thread-safe, and quite a few applications use them
>>> this way, e.g. the JDBC server.
>>>
>>> Matei
>>>
>>> On Jan 15, 2016, at 2:10 PM, Jakob Odersky 
>>> wrote:
>>>
>>> I don't think RDDs are threadsafe.
>>> More fundamentally however, why would you want to run RDD actions in
>>> parallel? The idea behind RDDs is to provide you with an abstraction for
>>> computing parallel operations on distributed data. Even if you were to 
>>> call
>>> actions from several threads at once, the individual executors of your
>>> spark environment would still have to perform operations sequentially.
>>>
>>> As an alternative, I would suggest to restructure your RDD
>>> transformations to compute the required results in one single operation.
>>>
>>> On 15 January 2016 at 06:18, Jonathan Coveney 
>>> wrote:
>>>
 Threads


 El viernes, 15 de enero de 2016, Kira 
 escribió:

> Hi,
>
> Can we run *simultaneous* actions on the *same RDD* ?; if yes how
> can this
> be done ?
>
> Thank you,
> Regards
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/simultaneous-actions-tp25977.html
> Sent from the Apache Spark User List mailing list archive at
> Nabble.com .
>
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>
>>>
>>>
>>
>

>>
>


Re: simultaneous actions

2016-01-17 Thread Mennour Rostom
Hi,

Thank you all for your answers,

If I correctly understand, actions (in my case foreach) can be run
concurrently and simultaneously on the SAME rdd, (which is logical because
they are read only object). however, I want to know if the same workers are
used for the concurrent analysis ?

Thank you

2016-01-15 21:11 GMT+01:00 Jakob Odersky :

> I stand corrected. How considerable are the benefits though? Will the
> scheduler be able to dispatch jobs from both actions simultaneously (or on
> a when-workers-become-available basis)?
>
> On 15 January 2016 at 11:44, Koert Kuipers  wrote:
>
>> we run multiple actions on the same (cached) rdd all the time, i guess in
>> different threads indeed (its in akka)
>>
>> On Fri, Jan 15, 2016 at 2:40 PM, Matei Zaharia 
>> wrote:
>>
>>> RDDs actually are thread-safe, and quite a few applications use them
>>> this way, e.g. the JDBC server.
>>>
>>> Matei
>>>
>>> On Jan 15, 2016, at 2:10 PM, Jakob Odersky  wrote:
>>>
>>> I don't think RDDs are threadsafe.
>>> More fundamentally however, why would you want to run RDD actions in
>>> parallel? The idea behind RDDs is to provide you with an abstraction for
>>> computing parallel operations on distributed data. Even if you were to call
>>> actions from several threads at once, the individual executors of your
>>> spark environment would still have to perform operations sequentially.
>>>
>>> As an alternative, I would suggest to restructure your RDD
>>> transformations to compute the required results in one single operation.
>>>
>>> On 15 January 2016 at 06:18, Jonathan Coveney 
>>> wrote:
>>>
 Threads


 El viernes, 15 de enero de 2016, Kira  escribió:

> Hi,
>
> Can we run *simultaneous* actions on the *same RDD* ?; if yes how can
> this
> be done ?
>
> Thank you,
> Regards
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/simultaneous-actions-tp25977.html
> Sent from the Apache Spark User List mailing list archive at
> Nabble.com .
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>
>>>
>>>
>>
>


Re: simultaneous actions

2016-01-17 Thread Koert Kuipers
Same rdd means same sparkcontext means same workers

Cache/persist the rdd to avoid repeated jobs
On Jan 17, 2016 5:21 AM, "Mennour Rostom"  wrote:

> Hi,
>
> Thank you all for your answers,
>
> If I correctly understand, actions (in my case foreach) can be run
> concurrently and simultaneously on the SAME rdd, (which is logical because
> they are read only object). however, I want to know if the same workers are
> used for the concurrent analysis ?
>
> Thank you
>
> 2016-01-15 21:11 GMT+01:00 Jakob Odersky :
>
>> I stand corrected. How considerable are the benefits though? Will the
>> scheduler be able to dispatch jobs from both actions simultaneously (or on
>> a when-workers-become-available basis)?
>>
>> On 15 January 2016 at 11:44, Koert Kuipers  wrote:
>>
>>> we run multiple actions on the same (cached) rdd all the time, i guess
>>> in different threads indeed (its in akka)
>>>
>>> On Fri, Jan 15, 2016 at 2:40 PM, Matei Zaharia 
>>> wrote:
>>>
 RDDs actually are thread-safe, and quite a few applications use them
 this way, e.g. the JDBC server.

 Matei

 On Jan 15, 2016, at 2:10 PM, Jakob Odersky  wrote:

 I don't think RDDs are threadsafe.
 More fundamentally however, why would you want to run RDD actions in
 parallel? The idea behind RDDs is to provide you with an abstraction for
 computing parallel operations on distributed data. Even if you were to call
 actions from several threads at once, the individual executors of your
 spark environment would still have to perform operations sequentially.

 As an alternative, I would suggest to restructure your RDD
 transformations to compute the required results in one single operation.

 On 15 January 2016 at 06:18, Jonathan Coveney 
 wrote:

> Threads
>
>
> El viernes, 15 de enero de 2016, Kira  escribió:
>
>> Hi,
>>
>> Can we run *simultaneous* actions on the *same RDD* ?; if yes how can
>> this
>> be done ?
>>
>> Thank you,
>> Regards
>>
>>
>>
>> --
>> View this message in context:
>> http://apache-spark-user-list.1001560.n3.nabble.com/simultaneous-actions-tp25977.html
>> Sent from the Apache Spark User List mailing list archive at
>> Nabble.com .
>>
>> -
>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>> For additional commands, e-mail: user-h...@spark.apache.org
>>
>>


>>>
>>
>


Re: simultaneous actions

2016-01-15 Thread Jonathan Coveney
Threads

El viernes, 15 de enero de 2016, Kira  escribió:

> Hi,
>
> Can we run *simultaneous* actions on the *same RDD* ?; if yes how can this
> be done ?
>
> Thank you,
> Regards
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/simultaneous-actions-tp25977.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org 
> For additional commands, e-mail: user-h...@spark.apache.org 
>
>


Re: simultaneous actions

2016-01-15 Thread Sean Owen
Can you run N jobs depending on the same RDD in parallel on the
driver? certainly. The context / scheduling is thread-safe and the RDD
is immutable. I've done this to, for example, build and evaluate a
bunch of models simultaneously on a big cluster.

On Fri, Jan 15, 2016 at 7:10 PM, Jakob Odersky  wrote:
> I don't think RDDs are threadsafe.
> More fundamentally however, why would you want to run RDD actions in
> parallel? The idea behind RDDs is to provide you with an abstraction for
> computing parallel operations on distributed data. Even if you were to call
> actions from several threads at once, the individual executors of your spark
> environment would still have to perform operations sequentially.
>
> As an alternative, I would suggest to restructure your RDD transformations
> to compute the required results in one single operation.
>
> On 15 January 2016 at 06:18, Jonathan Coveney  wrote:
>>
>> Threads
>>
>>
>> El viernes, 15 de enero de 2016, Kira  escribió:
>>>
>>> Hi,
>>>
>>> Can we run *simultaneous* actions on the *same RDD* ?; if yes how can
>>> this
>>> be done ?
>>>
>>> Thank you,
>>> Regards
>>>
>>>
>>>
>>> --
>>> View this message in context:
>>> http://apache-spark-user-list.1001560.n3.nabble.com/simultaneous-actions-tp25977.html
>>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>>>
>>> -
>>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>>> For additional commands, e-mail: user-h...@spark.apache.org
>>>
>

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: simultaneous actions

2016-01-15 Thread Jakob Odersky
I don't think RDDs are threadsafe.
More fundamentally however, why would you want to run RDD actions in
parallel? The idea behind RDDs is to provide you with an abstraction for
computing parallel operations on distributed data. Even if you were to call
actions from several threads at once, the individual executors of your
spark environment would still have to perform operations sequentially.

As an alternative, I would suggest to restructure your RDD transformations
to compute the required results in one single operation.

On 15 January 2016 at 06:18, Jonathan Coveney  wrote:

> Threads
>
>
> El viernes, 15 de enero de 2016, Kira  escribió:
>
>> Hi,
>>
>> Can we run *simultaneous* actions on the *same RDD* ?; if yes how can this
>> be done ?
>>
>> Thank you,
>> Regards
>>
>>
>>
>> --
>> View this message in context:
>> http://apache-spark-user-list.1001560.n3.nabble.com/simultaneous-actions-tp25977.html
>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>>
>> -
>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>> For additional commands, e-mail: user-h...@spark.apache.org
>>
>>


Re: simultaneous actions

2016-01-15 Thread Jonathan Coveney
SparkContext is thread safe. And RDDs just describe operations.

While I generally agree that you want to model as much possible as
transformations as possible, this is not always possible. And in that case,
you have no option than to use threads.

Spark's designers should have made all actions return Futures, but alas...

El viernes, 15 de enero de 2016, Jakob Odersky 
escribió:

> I don't think RDDs are threadsafe.
> More fundamentally however, why would you want to run RDD actions in
> parallel? The idea behind RDDs is to provide you with an abstraction for
> computing parallel operations on distributed data. Even if you were to call
> actions from several threads at once, the individual executors of your
> spark environment would still have to perform operations sequentially.
>
> As an alternative, I would suggest to restructure your RDD transformations
> to compute the required results in one single operation.
>
> On 15 January 2016 at 06:18, Jonathan Coveney  > wrote:
>
>> Threads
>>
>>
>> El viernes, 15 de enero de 2016, Kira > > escribió:
>>
>>> Hi,
>>>
>>> Can we run *simultaneous* actions on the *same RDD* ?; if yes how can
>>> this
>>> be done ?
>>>
>>> Thank you,
>>> Regards
>>>
>>>
>>>
>>> --
>>> View this message in context:
>>> http://apache-spark-user-list.1001560.n3.nabble.com/simultaneous-actions-tp25977.html
>>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>>>
>>> -
>>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>>> For additional commands, e-mail: user-h...@spark.apache.org
>>>
>>>
>


Re: simultaneous actions

2016-01-15 Thread Matei Zaharia
RDDs actually are thread-safe, and quite a few applications use them this way, 
e.g. the JDBC server.

Matei

> On Jan 15, 2016, at 2:10 PM, Jakob Odersky  wrote:
> 
> I don't think RDDs are threadsafe.
> More fundamentally however, why would you want to run RDD actions in 
> parallel? The idea behind RDDs is to provide you with an abstraction for 
> computing parallel operations on distributed data. Even if you were to call 
> actions from several threads at once, the individual executors of your spark 
> environment would still have to perform operations sequentially.
> 
> As an alternative, I would suggest to restructure your RDD transformations to 
> compute the required results in one single operation.
> 
> On 15 January 2016 at 06:18, Jonathan Coveney  > wrote:
> Threads
> 
> 
> El viernes, 15 de enero de 2016, Kira  > escribió:
> Hi,
> 
> Can we run *simultaneous* actions on the *same RDD* ?; if yes how can this
> be done ?
> 
> Thank you,
> Regards
> 
> 
> 
> --
> View this message in context: 
> http://apache-spark-user-list.1001560.n3.nabble.com/simultaneous-actions-tp25977.html
>  
> 
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
> 
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org <>
> For additional commands, e-mail: user-h...@spark.apache.org <>
> 
> 



Re: simultaneous actions

2016-01-15 Thread Koert Kuipers
we run multiple actions on the same (cached) rdd all the time, i guess in
different threads indeed (its in akka)

On Fri, Jan 15, 2016 at 2:40 PM, Matei Zaharia 
wrote:

> RDDs actually are thread-safe, and quite a few applications use them this
> way, e.g. the JDBC server.
>
> Matei
>
> On Jan 15, 2016, at 2:10 PM, Jakob Odersky  wrote:
>
> I don't think RDDs are threadsafe.
> More fundamentally however, why would you want to run RDD actions in
> parallel? The idea behind RDDs is to provide you with an abstraction for
> computing parallel operations on distributed data. Even if you were to call
> actions from several threads at once, the individual executors of your
> spark environment would still have to perform operations sequentially.
>
> As an alternative, I would suggest to restructure your RDD transformations
> to compute the required results in one single operation.
>
> On 15 January 2016 at 06:18, Jonathan Coveney  wrote:
>
>> Threads
>>
>>
>> El viernes, 15 de enero de 2016, Kira  escribió:
>>
>>> Hi,
>>>
>>> Can we run *simultaneous* actions on the *same RDD* ?; if yes how can
>>> this
>>> be done ?
>>>
>>> Thank you,
>>> Regards
>>>
>>>
>>>
>>> --
>>> View this message in context:
>>> http://apache-spark-user-list.1001560.n3.nabble.com/simultaneous-actions-tp25977.html
>>> Sent from the Apache Spark User List mailing list archive at Nabble.com
>>> .
>>>
>>> -
>>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>>> For additional commands, e-mail: user-h...@spark.apache.org
>>>
>>>
>
>


Re: simultaneous actions

2016-01-15 Thread Jakob Odersky
I stand corrected. How considerable are the benefits though? Will the
scheduler be able to dispatch jobs from both actions simultaneously (or on
a when-workers-become-available basis)?

On 15 January 2016 at 11:44, Koert Kuipers  wrote:

> we run multiple actions on the same (cached) rdd all the time, i guess in
> different threads indeed (its in akka)
>
> On Fri, Jan 15, 2016 at 2:40 PM, Matei Zaharia 
> wrote:
>
>> RDDs actually are thread-safe, and quite a few applications use them this
>> way, e.g. the JDBC server.
>>
>> Matei
>>
>> On Jan 15, 2016, at 2:10 PM, Jakob Odersky  wrote:
>>
>> I don't think RDDs are threadsafe.
>> More fundamentally however, why would you want to run RDD actions in
>> parallel? The idea behind RDDs is to provide you with an abstraction for
>> computing parallel operations on distributed data. Even if you were to call
>> actions from several threads at once, the individual executors of your
>> spark environment would still have to perform operations sequentially.
>>
>> As an alternative, I would suggest to restructure your RDD
>> transformations to compute the required results in one single operation.
>>
>> On 15 January 2016 at 06:18, Jonathan Coveney  wrote:
>>
>>> Threads
>>>
>>>
>>> El viernes, 15 de enero de 2016, Kira  escribió:
>>>
 Hi,

 Can we run *simultaneous* actions on the *same RDD* ?; if yes how can
 this
 be done ?

 Thank you,
 Regards



 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/simultaneous-actions-tp25977.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com
 .

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org


>>
>>
>


Re: simultaneous actions

2016-01-15 Thread Sean Owen
It makes sense if you're parallelizing jobs that have relatively few
tasks, and have a lot of execution slots available. It makes sense to
turn them loose all at once and try to use the parallelism available.

There are downsides, eventually: for example, N jobs accessing one
cached RDD may recompute the RDD's partitions many times since the
cached copy may not be available when many of them start. At some
level, way oversubscribing your cluster with a backlog of tasks is
bad. And you might find it's a net loss if a bunch of tasks try to
schedule at the same time that all access the same data, since only
some can be local to the data.

On Fri, Jan 15, 2016 at 8:11 PM, Jakob Odersky  wrote:
> I stand corrected. How considerable are the benefits though? Will the
> scheduler be able to dispatch jobs from both actions simultaneously (or on a
> when-workers-become-available basis)?

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org