Re: simultaneous actions

2016-01-18 Thread Mennour Rostom
Hi,

I am running my app in a single machine first before moving it in the
cluster; actually simultaneous actions are not working for me now; is this
comming from the fact that I am using a single machine ? yet I am using
FAIR scheduler.

2016-01-17 21:23 GMT+01:00 Mark Hamstra :

> It can be far more than that (e.g.
> https://issues.apache.org/jira/browse/SPARK-11838), and is generally
> either unrecognized or a greatly under-appreciated and underused feature of
> Spark.
>
> On Sun, Jan 17, 2016 at 12:20 PM, Koert Kuipers  wrote:
>
>> the re-use of shuffle files is always a nice surprise to me
>>
>> On Sun, Jan 17, 2016 at 3:17 PM, Mark Hamstra 
>> wrote:
>>
>>> Same SparkContext means same pool of Workers.  It's up to the Scheduler,
>>> not the SparkContext, whether the exact same Workers or Executors will be
>>> used to calculate simultaneous actions against the same RDD.  It is likely
>>> that many of the same Workers and Executors will be used as the Scheduler
>>> tries to preserve data locality, but that is not guaranteed.  In fact, what
>>> is most likely to happen is that the shared Stages and Tasks being
>>> calculated for the simultaneous actions will not actually be run at exactly
>>> the same time, which means that shuffle files produced for one action will
>>> be reused by the other(s), and repeated calculations will be avoided even
>>> without explicitly caching/persisting the RDD.
>>>
>>> On Sun, Jan 17, 2016 at 8:06 AM, Koert Kuipers 
>>> wrote:
>>>
>>>> Same rdd means same sparkcontext means same workers
>>>>
>>>> Cache/persist the rdd to avoid repeated jobs
>>>> On Jan 17, 2016 5:21 AM, "Mennour Rostom"  wrote:
>>>>
>>>>> Hi,
>>>>>
>>>>> Thank you all for your answers,
>>>>>
>>>>> If I correctly understand, actions (in my case foreach) can be run
>>>>> concurrently and simultaneously on the SAME rdd, (which is logical because
>>>>> they are read only object). however, I want to know if the same workers 
>>>>> are
>>>>> used for the concurrent analysis ?
>>>>>
>>>>> Thank you
>>>>>
>>>>> 2016-01-15 21:11 GMT+01:00 Jakob Odersky :
>>>>>
>>>>>> I stand corrected. How considerable are the benefits though? Will the
>>>>>> scheduler be able to dispatch jobs from both actions simultaneously (or 
>>>>>> on
>>>>>> a when-workers-become-available basis)?
>>>>>>
>>>>>> On 15 January 2016 at 11:44, Koert Kuipers  wrote:
>>>>>>
>>>>>>> we run multiple actions on the same (cached) rdd all the time, i
>>>>>>> guess in different threads indeed (its in akka)
>>>>>>>
>>>>>>> On Fri, Jan 15, 2016 at 2:40 PM, Matei Zaharia <
>>>>>>> matei.zaha...@gmail.com> wrote:
>>>>>>>
>>>>>>>> RDDs actually are thread-safe, and quite a few applications use
>>>>>>>> them this way, e.g. the JDBC server.
>>>>>>>>
>>>>>>>> Matei
>>>>>>>>
>>>>>>>> On Jan 15, 2016, at 2:10 PM, Jakob Odersky 
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>> I don't think RDDs are threadsafe.
>>>>>>>> More fundamentally however, why would you want to run RDD actions
>>>>>>>> in parallel? The idea behind RDDs is to provide you with an 
>>>>>>>> abstraction for
>>>>>>>> computing parallel operations on distributed data. Even if you were to 
>>>>>>>> call
>>>>>>>> actions from several threads at once, the individual executors of your
>>>>>>>> spark environment would still have to perform operations sequentially.
>>>>>>>>
>>>>>>>> As an alternative, I would suggest to restructure your RDD
>>>>>>>> transformations to compute the required results in one single 
>>>>>>>> operation.
>>>>>>>>
>>>>>>>> On 15 January 2016 at 06:18, Jonathan Coveney 
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>> Threads
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> El viernes, 15 de enero de 2016, Kira 
>>>>>>>>> escribió:
>>>>>>>>>
>>>>>>>>>> Hi,
>>>>>>>>>>
>>>>>>>>>> Can we run *simultaneous* actions on the *same RDD* ?; if yes how
>>>>>>>>>> can this
>>>>>>>>>> be done ?
>>>>>>>>>>
>>>>>>>>>> Thank you,
>>>>>>>>>> Regards
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> --
>>>>>>>>>> View this message in context:
>>>>>>>>>> http://apache-spark-user-list.1001560.n3.nabble.com/simultaneous-actions-tp25977.html
>>>>>>>>>> Sent from the Apache Spark User List mailing list archive at
>>>>>>>>>> Nabble.com <http://nabble.com>.
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> -
>>>>>>>>>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>>>>>>>>>> For additional commands, e-mail: user-h...@spark.apache.org
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>
>>
>


Re: simultaneous actions

2016-01-17 Thread Mennour Rostom
Hi,

Thank you all for your answers,

If I correctly understand, actions (in my case foreach) can be run
concurrently and simultaneously on the SAME rdd, (which is logical because
they are read only object). however, I want to know if the same workers are
used for the concurrent analysis ?

Thank you

2016-01-15 21:11 GMT+01:00 Jakob Odersky :

> I stand corrected. How considerable are the benefits though? Will the
> scheduler be able to dispatch jobs from both actions simultaneously (or on
> a when-workers-become-available basis)?
>
> On 15 January 2016 at 11:44, Koert Kuipers  wrote:
>
>> we run multiple actions on the same (cached) rdd all the time, i guess in
>> different threads indeed (its in akka)
>>
>> On Fri, Jan 15, 2016 at 2:40 PM, Matei Zaharia 
>> wrote:
>>
>>> RDDs actually are thread-safe, and quite a few applications use them
>>> this way, e.g. the JDBC server.
>>>
>>> Matei
>>>
>>> On Jan 15, 2016, at 2:10 PM, Jakob Odersky  wrote:
>>>
>>> I don't think RDDs are threadsafe.
>>> More fundamentally however, why would you want to run RDD actions in
>>> parallel? The idea behind RDDs is to provide you with an abstraction for
>>> computing parallel operations on distributed data. Even if you were to call
>>> actions from several threads at once, the individual executors of your
>>> spark environment would still have to perform operations sequentially.
>>>
>>> As an alternative, I would suggest to restructure your RDD
>>> transformations to compute the required results in one single operation.
>>>
>>> On 15 January 2016 at 06:18, Jonathan Coveney 
>>> wrote:
>>>
 Threads


 El viernes, 15 de enero de 2016, Kira  escribió:

> Hi,
>
> Can we run *simultaneous* actions on the *same RDD* ?; if yes how can
> this
> be done ?
>
> Thank you,
> Regards
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/simultaneous-actions-tp25977.html
> Sent from the Apache Spark User List mailing list archive at
> Nabble.com .
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>
>>>
>>>
>>
>


Re: Read Accumulator value while running

2016-01-14 Thread Mennour Rostom
Hi Daniel, Andrew

Thank you for your answers, So it s not possible to read the accumulator
value until the action that manipulate it finishes. it's bad, I ll think to
something else. However the main important thing in my application is the
ability to lunche 2 (or more) actions in parallel and concurrently :  "
*within* each Spark application, multiple “jobs” (Spark actions) may be
running concurrently if they were submitted by different threads" [Job
Scheduling Official].

my actions have to run on the same RDD : eg.

RDD D;
D.ac1(func1) [foreach fo instance]  //  D.ac2(func2) [foreachePartition or
whatever]


can this be done by Asynchronous Actions ? (it is not working for me on a
single node :/ )
can I use a broadcast (same) variable in the two partions ? if yes, what
happens if I change the value of the broadcast variable ?

in the other hand, i would realy like to know more about Spark 2.0

Thank you,
Regards

2016-01-13 20:31 GMT+01:00 Andrew Or :

> Hi Kira,
>
> As you suspected, accumulator values are only updated after the task
> completes. We do send accumulator updates from the executors to the driver
> on periodic heartbeats, but these only concern internal accumulators, not
> the ones created by the user.
>
> In short, I'm afraid there is not currently a way (in Spark 1.6 and
> before) to access the accumulator values until after the tasks that updated
> them have completed. This will change in Spark 2.0, the next version,
> however.
>
> Please let me know if you have more questions.
> -Andrew
>
> 2016-01-13 11:24 GMT-08:00 Daniel Imberman :
>
>> Hi Kira,
>>
>> I'm having some trouble understanding your question. Could you please
>> give a code example?
>>
>>
>>
>> From what I think you're asking there are two issues with what you're
>> looking to do. (Please keep in mind I could be totally wrong on both of
>> these assumptions, but this is what I've been lead to believe)
>>
>> 1. The contract of an accumulator is that you can't actually read the
>> value as the function is performing because the values in the accumulator
>> don't actually mean anything until they are reduced. If you were looking
>> for progress in a local context, you could do mapPartitions and have a
>> local accumulator per partition, but I don't think it's possible to get the
>> actual accumulator value in the middle of the map job.
>>
>> 2. As far as performing ac2 while ac1 is "always running", I'm pretty
>> sure that's not possible. The way that lazy valuation works in Spark, the
>> transformations have to be done serially. Having it any other way would
>> actually be really bad because then you could have ac1 changing the data
>> thereby making ac2's output unpredictable.
>>
>> That being said, with a more specific example it might be possible to
>> help figure out a solution that accomplishes what you are trying to do.
>>
>> On Wed, Jan 13, 2016 at 5:43 AM Kira  wrote:
>>
>>> Hi,
>>>
>>> So i have an action on one RDD that is relatively long, let's call it
>>> ac1;
>>> what i want to do is to execute another action (ac2) on the same RDD to
>>> see
>>> the evolution of the first one (ac1); for this end i want to use an
>>> accumulator and read it's value progressively to see the changes on it
>>> (on
>>> the fly) while ac1 is always running. My problem is that the accumulator
>>> is
>>> only updated once the ac1 has been finished, this is not helpful for me
>>> :/ .
>>>
>>> I ve seen  here
>>> <
>>> http://apache-spark-user-list.1001560.n3.nabble.com/Asynchronous-Broadcast-from-driver-to-workers-is-it-possible-td15758.html
>>> >
>>> what may seem like a solution for me but it doesn t work : "While Spark
>>> already offers support for asynchronous reduce (collect data from
>>> workers,
>>> while not interrupting execution of a parallel transformation) through
>>> accumulator"
>>>
>>> Another post suggested to use SparkListner to do that.
>>>
>>> are these solutions correct ? if yes, give me a simple exemple ?
>>> are there other solutions ?
>>>
>>> thank you.
>>> Regards
>>>
>>>
>>>
>>> --
>>> View this message in context:
>>> http://apache-spark-user-list.1001560.n3.nabble.com/Read-Accumulator-value-while-running-tp25960.html
>>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>>>
>>> -
>>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>>> For additional commands, e-mail: user-h...@spark.apache.org
>>>
>>>
>