the re-use of shuffle files is always a nice surprise to me On Sun, Jan 17, 2016 at 3:17 PM, Mark Hamstra <m...@clearstorydata.com> wrote:
> Same SparkContext means same pool of Workers. It's up to the Scheduler, > not the SparkContext, whether the exact same Workers or Executors will be > used to calculate simultaneous actions against the same RDD. It is likely > that many of the same Workers and Executors will be used as the Scheduler > tries to preserve data locality, but that is not guaranteed. In fact, what > is most likely to happen is that the shared Stages and Tasks being > calculated for the simultaneous actions will not actually be run at exactly > the same time, which means that shuffle files produced for one action will > be reused by the other(s), and repeated calculations will be avoided even > without explicitly caching/persisting the RDD. > > On Sun, Jan 17, 2016 at 8:06 AM, Koert Kuipers <ko...@tresata.com> wrote: > >> Same rdd means same sparkcontext means same workers >> >> Cache/persist the rdd to avoid repeated jobs >> On Jan 17, 2016 5:21 AM, "Mennour Rostom" <mennou...@gmail.com> wrote: >> >>> Hi, >>> >>> Thank you all for your answers, >>> >>> If I correctly understand, actions (in my case foreach) can be run >>> concurrently and simultaneously on the SAME rdd, (which is logical because >>> they are read only object). however, I want to know if the same workers are >>> used for the concurrent analysis ? >>> >>> Thank you >>> >>> 2016-01-15 21:11 GMT+01:00 Jakob Odersky <joder...@gmail.com>: >>> >>>> I stand corrected. How considerable are the benefits though? Will the >>>> scheduler be able to dispatch jobs from both actions simultaneously (or on >>>> a when-workers-become-available basis)? >>>> >>>> On 15 January 2016 at 11:44, Koert Kuipers <ko...@tresata.com> wrote: >>>> >>>>> we run multiple actions on the same (cached) rdd all the time, i guess >>>>> in different threads indeed (its in akka) >>>>> >>>>> On Fri, Jan 15, 2016 at 2:40 PM, Matei Zaharia < >>>>> matei.zaha...@gmail.com> wrote: >>>>> >>>>>> RDDs actually are thread-safe, and quite a few applications use them >>>>>> this way, e.g. the JDBC server. >>>>>> >>>>>> Matei >>>>>> >>>>>> On Jan 15, 2016, at 2:10 PM, Jakob Odersky <joder...@gmail.com> >>>>>> wrote: >>>>>> >>>>>> I don't think RDDs are threadsafe. >>>>>> More fundamentally however, why would you want to run RDD actions in >>>>>> parallel? The idea behind RDDs is to provide you with an abstraction for >>>>>> computing parallel operations on distributed data. Even if you were to >>>>>> call >>>>>> actions from several threads at once, the individual executors of your >>>>>> spark environment would still have to perform operations sequentially. >>>>>> >>>>>> As an alternative, I would suggest to restructure your RDD >>>>>> transformations to compute the required results in one single operation. >>>>>> >>>>>> On 15 January 2016 at 06:18, Jonathan Coveney <jcove...@gmail.com> >>>>>> wrote: >>>>>> >>>>>>> Threads >>>>>>> >>>>>>> >>>>>>> El viernes, 15 de enero de 2016, Kira <mennou...@gmail.com> >>>>>>> escribió: >>>>>>> >>>>>>>> Hi, >>>>>>>> >>>>>>>> Can we run *simultaneous* actions on the *same RDD* ?; if yes how >>>>>>>> can this >>>>>>>> be done ? >>>>>>>> >>>>>>>> Thank you, >>>>>>>> Regards >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> -- >>>>>>>> View this message in context: >>>>>>>> http://apache-spark-user-list.1001560.n3.nabble.com/simultaneous-actions-tp25977.html >>>>>>>> Sent from the Apache Spark User List mailing list archive at >>>>>>>> Nabble.com <http://nabble.com>. >>>>>>>> >>>>>>>> >>>>>>>> --------------------------------------------------------------------- >>>>>>>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org >>>>>>>> For additional commands, e-mail: user-h...@spark.apache.org >>>>>>>> >>>>>>>> >>>>>> >>>>>> >>>>> >>>> >>> >