Re: simultaneous actions
Simultaneous action works on cluster fine if they are independent...on local I never paid attention but the code path should be similar... On Jan 18, 2016 8:00 AM, "Koert Kuipers" <ko...@tresata.com> wrote: > stacktrace? details? > > On Mon, Jan 18, 2016 at 5:58 AM, Mennour Rostom <mennou...@gmail.com> > wrote: > >> Hi, >> >> I am running my app in a single machine first before moving it in the >> cluster; actually simultaneous actions are not working for me now; is this >> comming from the fact that I am using a single machine ? yet I am using >> FAIR scheduler. >> >> 2016-01-17 21:23 GMT+01:00 Mark Hamstra <m...@clearstorydata.com>: >> >>> It can be far more than that (e.g. >>> https://issues.apache.org/jira/browse/SPARK-11838), and is generally >>> either unrecognized or a greatly under-appreciated and underused feature of >>> Spark. >>> >>> On Sun, Jan 17, 2016 at 12:20 PM, Koert Kuipers <ko...@tresata.com> >>> wrote: >>> >>>> the re-use of shuffle files is always a nice surprise to me >>>> >>>> On Sun, Jan 17, 2016 at 3:17 PM, Mark Hamstra <m...@clearstorydata.com> >>>> wrote: >>>> >>>>> Same SparkContext means same pool of Workers. It's up to the >>>>> Scheduler, not the SparkContext, whether the exact same Workers or >>>>> Executors will be used to calculate simultaneous actions against the same >>>>> RDD. It is likely that many of the same Workers and Executors will be >>>>> used >>>>> as the Scheduler tries to preserve data locality, but that is not >>>>> guaranteed. In fact, what is most likely to happen is that the shared >>>>> Stages and Tasks being calculated for the simultaneous actions will not >>>>> actually be run at exactly the same time, which means that shuffle files >>>>> produced for one action will be reused by the other(s), and repeated >>>>> calculations will be avoided even without explicitly caching/persisting >>>>> the >>>>> RDD. >>>>> >>>>> On Sun, Jan 17, 2016 at 8:06 AM, Koert Kuipers <ko...@tresata.com> >>>>> wrote: >>>>> >>>>>> Same rdd means same sparkcontext means same workers >>>>>> >>>>>> Cache/persist the rdd to avoid repeated jobs >>>>>> On Jan 17, 2016 5:21 AM, "Mennour Rostom" <mennou...@gmail.com> >>>>>> wrote: >>>>>> >>>>>>> Hi, >>>>>>> >>>>>>> Thank you all for your answers, >>>>>>> >>>>>>> If I correctly understand, actions (in my case foreach) can be run >>>>>>> concurrently and simultaneously on the SAME rdd, (which is logical >>>>>>> because >>>>>>> they are read only object). however, I want to know if the same workers >>>>>>> are >>>>>>> used for the concurrent analysis ? >>>>>>> >>>>>>> Thank you >>>>>>> >>>>>>> 2016-01-15 21:11 GMT+01:00 Jakob Odersky <joder...@gmail.com>: >>>>>>> >>>>>>>> I stand corrected. How considerable are the benefits though? Will >>>>>>>> the scheduler be able to dispatch jobs from both actions >>>>>>>> simultaneously (or >>>>>>>> on a when-workers-become-available basis)? >>>>>>>> >>>>>>>> On 15 January 2016 at 11:44, Koert Kuipers <ko...@tresata.com> >>>>>>>> wrote: >>>>>>>> >>>>>>>>> we run multiple actions on the same (cached) rdd all the time, i >>>>>>>>> guess in different threads indeed (its in akka) >>>>>>>>> >>>>>>>>> On Fri, Jan 15, 2016 at 2:40 PM, Matei Zaharia < >>>>>>>>> matei.zaha...@gmail.com> wrote: >>>>>>>>> >>>>>>>>>> RDDs actually are thread-safe, and quite a few applications use >>>>>>>>>> them this way, e.g. the JDBC server. >>>>>>>>>> >>>>>>>>>> Matei >>>>>>>>>> >>>>>>>>>> On Jan 15, 2016, at 2:10 PM, Jakob Odersky <joder...@gmail
Re: simultaneous actions
Hi, I am running my app in a single machine first before moving it in the cluster; actually simultaneous actions are not working for me now; is this comming from the fact that I am using a single machine ? yet I am using FAIR scheduler. 2016-01-17 21:23 GMT+01:00 Mark Hamstra <m...@clearstorydata.com>: > It can be far more than that (e.g. > https://issues.apache.org/jira/browse/SPARK-11838), and is generally > either unrecognized or a greatly under-appreciated and underused feature of > Spark. > > On Sun, Jan 17, 2016 at 12:20 PM, Koert Kuipers <ko...@tresata.com> wrote: > >> the re-use of shuffle files is always a nice surprise to me >> >> On Sun, Jan 17, 2016 at 3:17 PM, Mark Hamstra <m...@clearstorydata.com> >> wrote: >> >>> Same SparkContext means same pool of Workers. It's up to the Scheduler, >>> not the SparkContext, whether the exact same Workers or Executors will be >>> used to calculate simultaneous actions against the same RDD. It is likely >>> that many of the same Workers and Executors will be used as the Scheduler >>> tries to preserve data locality, but that is not guaranteed. In fact, what >>> is most likely to happen is that the shared Stages and Tasks being >>> calculated for the simultaneous actions will not actually be run at exactly >>> the same time, which means that shuffle files produced for one action will >>> be reused by the other(s), and repeated calculations will be avoided even >>> without explicitly caching/persisting the RDD. >>> >>> On Sun, Jan 17, 2016 at 8:06 AM, Koert Kuipers <ko...@tresata.com> >>> wrote: >>> >>>> Same rdd means same sparkcontext means same workers >>>> >>>> Cache/persist the rdd to avoid repeated jobs >>>> On Jan 17, 2016 5:21 AM, "Mennour Rostom" <mennou...@gmail.com> wrote: >>>> >>>>> Hi, >>>>> >>>>> Thank you all for your answers, >>>>> >>>>> If I correctly understand, actions (in my case foreach) can be run >>>>> concurrently and simultaneously on the SAME rdd, (which is logical because >>>>> they are read only object). however, I want to know if the same workers >>>>> are >>>>> used for the concurrent analysis ? >>>>> >>>>> Thank you >>>>> >>>>> 2016-01-15 21:11 GMT+01:00 Jakob Odersky <joder...@gmail.com>: >>>>> >>>>>> I stand corrected. How considerable are the benefits though? Will the >>>>>> scheduler be able to dispatch jobs from both actions simultaneously (or >>>>>> on >>>>>> a when-workers-become-available basis)? >>>>>> >>>>>> On 15 January 2016 at 11:44, Koert Kuipers <ko...@tresata.com> wrote: >>>>>> >>>>>>> we run multiple actions on the same (cached) rdd all the time, i >>>>>>> guess in different threads indeed (its in akka) >>>>>>> >>>>>>> On Fri, Jan 15, 2016 at 2:40 PM, Matei Zaharia < >>>>>>> matei.zaha...@gmail.com> wrote: >>>>>>> >>>>>>>> RDDs actually are thread-safe, and quite a few applications use >>>>>>>> them this way, e.g. the JDBC server. >>>>>>>> >>>>>>>> Matei >>>>>>>> >>>>>>>> On Jan 15, 2016, at 2:10 PM, Jakob Odersky <joder...@gmail.com> >>>>>>>> wrote: >>>>>>>> >>>>>>>> I don't think RDDs are threadsafe. >>>>>>>> More fundamentally however, why would you want to run RDD actions >>>>>>>> in parallel? The idea behind RDDs is to provide you with an >>>>>>>> abstraction for >>>>>>>> computing parallel operations on distributed data. Even if you were to >>>>>>>> call >>>>>>>> actions from several threads at once, the individual executors of your >>>>>>>> spark environment would still have to perform operations sequentially. >>>>>>>> >>>>>>>> As an alternative, I would suggest to restructure your RDD >>>>>>>> transformations to compute the required results in one single >>>>>>>> operation. >>>>>>>> >>>>>>>> On 15 January 2016 at 06:18, Jonathan Coveney <jcove...@gmail.com> >>>>>>>> wrote: >>>>>>>> >>>>>>>>> Threads >>>>>>>>> >>>>>>>>> >>>>>>>>> El viernes, 15 de enero de 2016, Kira <mennou...@gmail.com> >>>>>>>>> escribió: >>>>>>>>> >>>>>>>>>> Hi, >>>>>>>>>> >>>>>>>>>> Can we run *simultaneous* actions on the *same RDD* ?; if yes how >>>>>>>>>> can this >>>>>>>>>> be done ? >>>>>>>>>> >>>>>>>>>> Thank you, >>>>>>>>>> Regards >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> -- >>>>>>>>>> View this message in context: >>>>>>>>>> http://apache-spark-user-list.1001560.n3.nabble.com/simultaneous-actions-tp25977.html >>>>>>>>>> Sent from the Apache Spark User List mailing list archive at >>>>>>>>>> Nabble.com <http://nabble.com>. >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> - >>>>>>>>>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org >>>>>>>>>> For additional commands, e-mail: user-h...@spark.apache.org >>>>>>>>>> >>>>>>>>>> >>>>>>>> >>>>>>>> >>>>>>> >>>>>> >>>>> >>> >> >
Re: simultaneous actions
stacktrace? details? On Mon, Jan 18, 2016 at 5:58 AM, Mennour Rostom <mennou...@gmail.com> wrote: > Hi, > > I am running my app in a single machine first before moving it in the > cluster; actually simultaneous actions are not working for me now; is this > comming from the fact that I am using a single machine ? yet I am using > FAIR scheduler. > > 2016-01-17 21:23 GMT+01:00 Mark Hamstra <m...@clearstorydata.com>: > >> It can be far more than that (e.g. >> https://issues.apache.org/jira/browse/SPARK-11838), and is generally >> either unrecognized or a greatly under-appreciated and underused feature of >> Spark. >> >> On Sun, Jan 17, 2016 at 12:20 PM, Koert Kuipers <ko...@tresata.com> >> wrote: >> >>> the re-use of shuffle files is always a nice surprise to me >>> >>> On Sun, Jan 17, 2016 at 3:17 PM, Mark Hamstra <m...@clearstorydata.com> >>> wrote: >>> >>>> Same SparkContext means same pool of Workers. It's up to the >>>> Scheduler, not the SparkContext, whether the exact same Workers or >>>> Executors will be used to calculate simultaneous actions against the same >>>> RDD. It is likely that many of the same Workers and Executors will be used >>>> as the Scheduler tries to preserve data locality, but that is not >>>> guaranteed. In fact, what is most likely to happen is that the shared >>>> Stages and Tasks being calculated for the simultaneous actions will not >>>> actually be run at exactly the same time, which means that shuffle files >>>> produced for one action will be reused by the other(s), and repeated >>>> calculations will be avoided even without explicitly caching/persisting the >>>> RDD. >>>> >>>> On Sun, Jan 17, 2016 at 8:06 AM, Koert Kuipers <ko...@tresata.com> >>>> wrote: >>>> >>>>> Same rdd means same sparkcontext means same workers >>>>> >>>>> Cache/persist the rdd to avoid repeated jobs >>>>> On Jan 17, 2016 5:21 AM, "Mennour Rostom" <mennou...@gmail.com> wrote: >>>>> >>>>>> Hi, >>>>>> >>>>>> Thank you all for your answers, >>>>>> >>>>>> If I correctly understand, actions (in my case foreach) can be run >>>>>> concurrently and simultaneously on the SAME rdd, (which is logical >>>>>> because >>>>>> they are read only object). however, I want to know if the same workers >>>>>> are >>>>>> used for the concurrent analysis ? >>>>>> >>>>>> Thank you >>>>>> >>>>>> 2016-01-15 21:11 GMT+01:00 Jakob Odersky <joder...@gmail.com>: >>>>>> >>>>>>> I stand corrected. How considerable are the benefits though? Will >>>>>>> the scheduler be able to dispatch jobs from both actions simultaneously >>>>>>> (or >>>>>>> on a when-workers-become-available basis)? >>>>>>> >>>>>>> On 15 January 2016 at 11:44, Koert Kuipers <ko...@tresata.com> >>>>>>> wrote: >>>>>>> >>>>>>>> we run multiple actions on the same (cached) rdd all the time, i >>>>>>>> guess in different threads indeed (its in akka) >>>>>>>> >>>>>>>> On Fri, Jan 15, 2016 at 2:40 PM, Matei Zaharia < >>>>>>>> matei.zaha...@gmail.com> wrote: >>>>>>>> >>>>>>>>> RDDs actually are thread-safe, and quite a few applications use >>>>>>>>> them this way, e.g. the JDBC server. >>>>>>>>> >>>>>>>>> Matei >>>>>>>>> >>>>>>>>> On Jan 15, 2016, at 2:10 PM, Jakob Odersky <joder...@gmail.com> >>>>>>>>> wrote: >>>>>>>>> >>>>>>>>> I don't think RDDs are threadsafe. >>>>>>>>> More fundamentally however, why would you want to run RDD actions >>>>>>>>> in parallel? The idea behind RDDs is to provide you with an >>>>>>>>> abstraction for >>>>>>>>> computing parallel operations on distributed data. Even if you were >>>>>>>>> to call >>>>>>>>> actions from several threads at once, the individual executors of your >>>>>>>>> spark environment would still have to perform operations sequentially. >>>>>>>>> >>>>>>>>> As an alternative, I would suggest to restructure your RDD >>>>>>>>> transformations to compute the required results in one single >>>>>>>>> operation. >>>>>>>>> >>>>>>>>> On 15 January 2016 at 06:18, Jonathan Coveney <jcove...@gmail.com> >>>>>>>>> wrote: >>>>>>>>> >>>>>>>>>> Threads >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> El viernes, 15 de enero de 2016, Kira <mennou...@gmail.com> >>>>>>>>>> escribió: >>>>>>>>>> >>>>>>>>>>> Hi, >>>>>>>>>>> >>>>>>>>>>> Can we run *simultaneous* actions on the *same RDD* ?; if yes >>>>>>>>>>> how can this >>>>>>>>>>> be done ? >>>>>>>>>>> >>>>>>>>>>> Thank you, >>>>>>>>>>> Regards >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> -- >>>>>>>>>>> View this message in context: >>>>>>>>>>> http://apache-spark-user-list.1001560.n3.nabble.com/simultaneous-actions-tp25977.html >>>>>>>>>>> Sent from the Apache Spark User List mailing list archive at >>>>>>>>>>> Nabble.com <http://nabble.com>. >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> - >>>>>>>>>>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org >>>>>>>>>>> For additional commands, e-mail: user-h...@spark.apache.org >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>> >>>>>>> >>>>>> >>>> >>> >> >
Re: simultaneous actions
the re-use of shuffle files is always a nice surprise to me On Sun, Jan 17, 2016 at 3:17 PM, Mark Hamstra <m...@clearstorydata.com> wrote: > Same SparkContext means same pool of Workers. It's up to the Scheduler, > not the SparkContext, whether the exact same Workers or Executors will be > used to calculate simultaneous actions against the same RDD. It is likely > that many of the same Workers and Executors will be used as the Scheduler > tries to preserve data locality, but that is not guaranteed. In fact, what > is most likely to happen is that the shared Stages and Tasks being > calculated for the simultaneous actions will not actually be run at exactly > the same time, which means that shuffle files produced for one action will > be reused by the other(s), and repeated calculations will be avoided even > without explicitly caching/persisting the RDD. > > On Sun, Jan 17, 2016 at 8:06 AM, Koert Kuipers <ko...@tresata.com> wrote: > >> Same rdd means same sparkcontext means same workers >> >> Cache/persist the rdd to avoid repeated jobs >> On Jan 17, 2016 5:21 AM, "Mennour Rostom" <mennou...@gmail.com> wrote: >> >>> Hi, >>> >>> Thank you all for your answers, >>> >>> If I correctly understand, actions (in my case foreach) can be run >>> concurrently and simultaneously on the SAME rdd, (which is logical because >>> they are read only object). however, I want to know if the same workers are >>> used for the concurrent analysis ? >>> >>> Thank you >>> >>> 2016-01-15 21:11 GMT+01:00 Jakob Odersky <joder...@gmail.com>: >>> >>>> I stand corrected. How considerable are the benefits though? Will the >>>> scheduler be able to dispatch jobs from both actions simultaneously (or on >>>> a when-workers-become-available basis)? >>>> >>>> On 15 January 2016 at 11:44, Koert Kuipers <ko...@tresata.com> wrote: >>>> >>>>> we run multiple actions on the same (cached) rdd all the time, i guess >>>>> in different threads indeed (its in akka) >>>>> >>>>> On Fri, Jan 15, 2016 at 2:40 PM, Matei Zaharia < >>>>> matei.zaha...@gmail.com> wrote: >>>>> >>>>>> RDDs actually are thread-safe, and quite a few applications use them >>>>>> this way, e.g. the JDBC server. >>>>>> >>>>>> Matei >>>>>> >>>>>> On Jan 15, 2016, at 2:10 PM, Jakob Odersky <joder...@gmail.com> >>>>>> wrote: >>>>>> >>>>>> I don't think RDDs are threadsafe. >>>>>> More fundamentally however, why would you want to run RDD actions in >>>>>> parallel? The idea behind RDDs is to provide you with an abstraction for >>>>>> computing parallel operations on distributed data. Even if you were to >>>>>> call >>>>>> actions from several threads at once, the individual executors of your >>>>>> spark environment would still have to perform operations sequentially. >>>>>> >>>>>> As an alternative, I would suggest to restructure your RDD >>>>>> transformations to compute the required results in one single operation. >>>>>> >>>>>> On 15 January 2016 at 06:18, Jonathan Coveney <jcove...@gmail.com> >>>>>> wrote: >>>>>> >>>>>>> Threads >>>>>>> >>>>>>> >>>>>>> El viernes, 15 de enero de 2016, Kira <mennou...@gmail.com> >>>>>>> escribió: >>>>>>> >>>>>>>> Hi, >>>>>>>> >>>>>>>> Can we run *simultaneous* actions on the *same RDD* ?; if yes how >>>>>>>> can this >>>>>>>> be done ? >>>>>>>> >>>>>>>> Thank you, >>>>>>>> Regards >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> -- >>>>>>>> View this message in context: >>>>>>>> http://apache-spark-user-list.1001560.n3.nabble.com/simultaneous-actions-tp25977.html >>>>>>>> Sent from the Apache Spark User List mailing list archive at >>>>>>>> Nabble.com <http://nabble.com>. >>>>>>>> >>>>>>>> >>>>>>>> - >>>>>>>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org >>>>>>>> For additional commands, e-mail: user-h...@spark.apache.org >>>>>>>> >>>>>>>> >>>>>> >>>>>> >>>>> >>>> >>> >
Re: simultaneous actions
They'll be able to run concurrently and share workers / data. Take a look at http://spark.apache.org/docs/latest/job-scheduling.html <http://spark.apache.org/docs/latest/job-scheduling.html> for how scheduling happens across multiple running jobs in the same SparkContext. Matei > On Jan 17, 2016, at 8:06 AM, Koert Kuipers <ko...@tresata.com> wrote: > > Same rdd means same sparkcontext means same workers > > Cache/persist the rdd to avoid repeated jobs > > On Jan 17, 2016 5:21 AM, "Mennour Rostom" <mennou...@gmail.com > <mailto:mennou...@gmail.com>> wrote: > Hi, > > Thank you all for your answers, > > If I correctly understand, actions (in my case foreach) can be run > concurrently and simultaneously on the SAME rdd, (which is logical because > they are read only object). however, I want to know if the same workers are > used for the concurrent analysis ? > > Thank you > > 2016-01-15 21:11 GMT+01:00 Jakob Odersky <joder...@gmail.com > <mailto:joder...@gmail.com>>: > I stand corrected. How considerable are the benefits though? Will the > scheduler be able to dispatch jobs from both actions simultaneously (or on a > when-workers-become-available basis)? > > On 15 January 2016 at 11:44, Koert Kuipers <ko...@tresata.com > <mailto:ko...@tresata.com>> wrote: > we run multiple actions on the same (cached) rdd all the time, i guess in > different threads indeed (its in akka) > > On Fri, Jan 15, 2016 at 2:40 PM, Matei Zaharia <matei.zaha...@gmail.com > <mailto:matei.zaha...@gmail.com>> wrote: > RDDs actually are thread-safe, and quite a few applications use them this > way, e.g. the JDBC server. > > Matei > >> On Jan 15, 2016, at 2:10 PM, Jakob Odersky <joder...@gmail.com >> <mailto:joder...@gmail.com>> wrote: >> >> I don't think RDDs are threadsafe. >> More fundamentally however, why would you want to run RDD actions in >> parallel? The idea behind RDDs is to provide you with an abstraction for >> computing parallel operations on distributed data. Even if you were to call >> actions from several threads at once, the individual executors of your spark >> environment would still have to perform operations sequentially. >> >> As an alternative, I would suggest to restructure your RDD transformations >> to compute the required results in one single operation. >> >> On 15 January 2016 at 06:18, Jonathan Coveney <jcove...@gmail.com >> <mailto:jcove...@gmail.com>> wrote: >> Threads >> >> >> El viernes, 15 de enero de 2016, Kira <mennou...@gmail.com >> <mailto:mennou...@gmail.com>> escribió: >> Hi, >> >> Can we run *simultaneous* actions on the *same RDD* ?; if yes how can this >> be done ? >> >> Thank you, >> Regards >> >> >> >> -- >> View this message in context: >> http://apache-spark-user-list.1001560.n3.nabble.com/simultaneous-actions-tp25977.html >> >> <http://apache-spark-user-list.1001560.n3.nabble.com/simultaneous-actions-tp25977.html> >> Sent from the Apache Spark User List mailing list archive at Nabble.com >> <http://nabble.com/>. >> >> - >> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org <> >> For additional commands, e-mail: user-h...@spark.apache.org <> >> >> > > > >
Re: simultaneous actions
Same SparkContext means same pool of Workers. It's up to the Scheduler, not the SparkContext, whether the exact same Workers or Executors will be used to calculate simultaneous actions against the same RDD. It is likely that many of the same Workers and Executors will be used as the Scheduler tries to preserve data locality, but that is not guaranteed. In fact, what is most likely to happen is that the shared Stages and Tasks being calculated for the simultaneous actions will not actually be run at exactly the same time, which means that shuffle files produced for one action will be reused by the other(s), and repeated calculations will be avoided even without explicitly caching/persisting the RDD. On Sun, Jan 17, 2016 at 8:06 AM, Koert Kuipers <ko...@tresata.com> wrote: > Same rdd means same sparkcontext means same workers > > Cache/persist the rdd to avoid repeated jobs > On Jan 17, 2016 5:21 AM, "Mennour Rostom" <mennou...@gmail.com> wrote: > >> Hi, >> >> Thank you all for your answers, >> >> If I correctly understand, actions (in my case foreach) can be run >> concurrently and simultaneously on the SAME rdd, (which is logical because >> they are read only object). however, I want to know if the same workers are >> used for the concurrent analysis ? >> >> Thank you >> >> 2016-01-15 21:11 GMT+01:00 Jakob Odersky <joder...@gmail.com>: >> >>> I stand corrected. How considerable are the benefits though? Will the >>> scheduler be able to dispatch jobs from both actions simultaneously (or on >>> a when-workers-become-available basis)? >>> >>> On 15 January 2016 at 11:44, Koert Kuipers <ko...@tresata.com> wrote: >>> >>>> we run multiple actions on the same (cached) rdd all the time, i guess >>>> in different threads indeed (its in akka) >>>> >>>> On Fri, Jan 15, 2016 at 2:40 PM, Matei Zaharia <matei.zaha...@gmail.com >>>> > wrote: >>>> >>>>> RDDs actually are thread-safe, and quite a few applications use them >>>>> this way, e.g. the JDBC server. >>>>> >>>>> Matei >>>>> >>>>> On Jan 15, 2016, at 2:10 PM, Jakob Odersky <joder...@gmail.com> wrote: >>>>> >>>>> I don't think RDDs are threadsafe. >>>>> More fundamentally however, why would you want to run RDD actions in >>>>> parallel? The idea behind RDDs is to provide you with an abstraction for >>>>> computing parallel operations on distributed data. Even if you were to >>>>> call >>>>> actions from several threads at once, the individual executors of your >>>>> spark environment would still have to perform operations sequentially. >>>>> >>>>> As an alternative, I would suggest to restructure your RDD >>>>> transformations to compute the required results in one single operation. >>>>> >>>>> On 15 January 2016 at 06:18, Jonathan Coveney <jcove...@gmail.com> >>>>> wrote: >>>>> >>>>>> Threads >>>>>> >>>>>> >>>>>> El viernes, 15 de enero de 2016, Kira <mennou...@gmail.com> escribió: >>>>>> >>>>>>> Hi, >>>>>>> >>>>>>> Can we run *simultaneous* actions on the *same RDD* ?; if yes how >>>>>>> can this >>>>>>> be done ? >>>>>>> >>>>>>> Thank you, >>>>>>> Regards >>>>>>> >>>>>>> >>>>>>> >>>>>>> -- >>>>>>> View this message in context: >>>>>>> http://apache-spark-user-list.1001560.n3.nabble.com/simultaneous-actions-tp25977.html >>>>>>> Sent from the Apache Spark User List mailing list archive at >>>>>>> Nabble.com <http://nabble.com>. >>>>>>> >>>>>>> - >>>>>>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org >>>>>>> For additional commands, e-mail: user-h...@spark.apache.org >>>>>>> >>>>>>> >>>>> >>>>> >>>> >>> >>
Re: simultaneous actions
It can be far more than that (e.g. https://issues.apache.org/jira/browse/SPARK-11838), and is generally either unrecognized or a greatly under-appreciated and underused feature of Spark. On Sun, Jan 17, 2016 at 12:20 PM, Koert Kuipers <ko...@tresata.com> wrote: > the re-use of shuffle files is always a nice surprise to me > > On Sun, Jan 17, 2016 at 3:17 PM, Mark Hamstra <m...@clearstorydata.com> > wrote: > >> Same SparkContext means same pool of Workers. It's up to the Scheduler, >> not the SparkContext, whether the exact same Workers or Executors will be >> used to calculate simultaneous actions against the same RDD. It is likely >> that many of the same Workers and Executors will be used as the Scheduler >> tries to preserve data locality, but that is not guaranteed. In fact, what >> is most likely to happen is that the shared Stages and Tasks being >> calculated for the simultaneous actions will not actually be run at exactly >> the same time, which means that shuffle files produced for one action will >> be reused by the other(s), and repeated calculations will be avoided even >> without explicitly caching/persisting the RDD. >> >> On Sun, Jan 17, 2016 at 8:06 AM, Koert Kuipers <ko...@tresata.com> wrote: >> >>> Same rdd means same sparkcontext means same workers >>> >>> Cache/persist the rdd to avoid repeated jobs >>> On Jan 17, 2016 5:21 AM, "Mennour Rostom" <mennou...@gmail.com> wrote: >>> >>>> Hi, >>>> >>>> Thank you all for your answers, >>>> >>>> If I correctly understand, actions (in my case foreach) can be run >>>> concurrently and simultaneously on the SAME rdd, (which is logical because >>>> they are read only object). however, I want to know if the same workers are >>>> used for the concurrent analysis ? >>>> >>>> Thank you >>>> >>>> 2016-01-15 21:11 GMT+01:00 Jakob Odersky <joder...@gmail.com>: >>>> >>>>> I stand corrected. How considerable are the benefits though? Will the >>>>> scheduler be able to dispatch jobs from both actions simultaneously (or on >>>>> a when-workers-become-available basis)? >>>>> >>>>> On 15 January 2016 at 11:44, Koert Kuipers <ko...@tresata.com> wrote: >>>>> >>>>>> we run multiple actions on the same (cached) rdd all the time, i >>>>>> guess in different threads indeed (its in akka) >>>>>> >>>>>> On Fri, Jan 15, 2016 at 2:40 PM, Matei Zaharia < >>>>>> matei.zaha...@gmail.com> wrote: >>>>>> >>>>>>> RDDs actually are thread-safe, and quite a few applications use them >>>>>>> this way, e.g. the JDBC server. >>>>>>> >>>>>>> Matei >>>>>>> >>>>>>> On Jan 15, 2016, at 2:10 PM, Jakob Odersky <joder...@gmail.com> >>>>>>> wrote: >>>>>>> >>>>>>> I don't think RDDs are threadsafe. >>>>>>> More fundamentally however, why would you want to run RDD actions in >>>>>>> parallel? The idea behind RDDs is to provide you with an abstraction for >>>>>>> computing parallel operations on distributed data. Even if you were to >>>>>>> call >>>>>>> actions from several threads at once, the individual executors of your >>>>>>> spark environment would still have to perform operations sequentially. >>>>>>> >>>>>>> As an alternative, I would suggest to restructure your RDD >>>>>>> transformations to compute the required results in one single operation. >>>>>>> >>>>>>> On 15 January 2016 at 06:18, Jonathan Coveney <jcove...@gmail.com> >>>>>>> wrote: >>>>>>> >>>>>>>> Threads >>>>>>>> >>>>>>>> >>>>>>>> El viernes, 15 de enero de 2016, Kira <mennou...@gmail.com> >>>>>>>> escribió: >>>>>>>> >>>>>>>>> Hi, >>>>>>>>> >>>>>>>>> Can we run *simultaneous* actions on the *same RDD* ?; if yes how >>>>>>>>> can this >>>>>>>>> be done ? >>>>>>>>> >>>>>>>>> Thank you, >>>>>>>>> Regards >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> -- >>>>>>>>> View this message in context: >>>>>>>>> http://apache-spark-user-list.1001560.n3.nabble.com/simultaneous-actions-tp25977.html >>>>>>>>> Sent from the Apache Spark User List mailing list archive at >>>>>>>>> Nabble.com <http://nabble.com>. >>>>>>>>> >>>>>>>>> >>>>>>>>> - >>>>>>>>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org >>>>>>>>> For additional commands, e-mail: user-h...@spark.apache.org >>>>>>>>> >>>>>>>>> >>>>>>> >>>>>>> >>>>>> >>>>> >>>> >> >
Re: simultaneous actions
Hi, Thank you all for your answers, If I correctly understand, actions (in my case foreach) can be run concurrently and simultaneously on the SAME rdd, (which is logical because they are read only object). however, I want to know if the same workers are used for the concurrent analysis ? Thank you 2016-01-15 21:11 GMT+01:00 Jakob Odersky <joder...@gmail.com>: > I stand corrected. How considerable are the benefits though? Will the > scheduler be able to dispatch jobs from both actions simultaneously (or on > a when-workers-become-available basis)? > > On 15 January 2016 at 11:44, Koert Kuipers <ko...@tresata.com> wrote: > >> we run multiple actions on the same (cached) rdd all the time, i guess in >> different threads indeed (its in akka) >> >> On Fri, Jan 15, 2016 at 2:40 PM, Matei Zaharia <matei.zaha...@gmail.com> >> wrote: >> >>> RDDs actually are thread-safe, and quite a few applications use them >>> this way, e.g. the JDBC server. >>> >>> Matei >>> >>> On Jan 15, 2016, at 2:10 PM, Jakob Odersky <joder...@gmail.com> wrote: >>> >>> I don't think RDDs are threadsafe. >>> More fundamentally however, why would you want to run RDD actions in >>> parallel? The idea behind RDDs is to provide you with an abstraction for >>> computing parallel operations on distributed data. Even if you were to call >>> actions from several threads at once, the individual executors of your >>> spark environment would still have to perform operations sequentially. >>> >>> As an alternative, I would suggest to restructure your RDD >>> transformations to compute the required results in one single operation. >>> >>> On 15 January 2016 at 06:18, Jonathan Coveney <jcove...@gmail.com> >>> wrote: >>> >>>> Threads >>>> >>>> >>>> El viernes, 15 de enero de 2016, Kira <mennou...@gmail.com> escribió: >>>> >>>>> Hi, >>>>> >>>>> Can we run *simultaneous* actions on the *same RDD* ?; if yes how can >>>>> this >>>>> be done ? >>>>> >>>>> Thank you, >>>>> Regards >>>>> >>>>> >>>>> >>>>> -- >>>>> View this message in context: >>>>> http://apache-spark-user-list.1001560.n3.nabble.com/simultaneous-actions-tp25977.html >>>>> Sent from the Apache Spark User List mailing list archive at >>>>> Nabble.com <http://nabble.com>. >>>>> >>>>> - >>>>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org >>>>> For additional commands, e-mail: user-h...@spark.apache.org >>>>> >>>>> >>> >>> >> >
Re: simultaneous actions
Same rdd means same sparkcontext means same workers Cache/persist the rdd to avoid repeated jobs On Jan 17, 2016 5:21 AM, "Mennour Rostom" <mennou...@gmail.com> wrote: > Hi, > > Thank you all for your answers, > > If I correctly understand, actions (in my case foreach) can be run > concurrently and simultaneously on the SAME rdd, (which is logical because > they are read only object). however, I want to know if the same workers are > used for the concurrent analysis ? > > Thank you > > 2016-01-15 21:11 GMT+01:00 Jakob Odersky <joder...@gmail.com>: > >> I stand corrected. How considerable are the benefits though? Will the >> scheduler be able to dispatch jobs from both actions simultaneously (or on >> a when-workers-become-available basis)? >> >> On 15 January 2016 at 11:44, Koert Kuipers <ko...@tresata.com> wrote: >> >>> we run multiple actions on the same (cached) rdd all the time, i guess >>> in different threads indeed (its in akka) >>> >>> On Fri, Jan 15, 2016 at 2:40 PM, Matei Zaharia <matei.zaha...@gmail.com> >>> wrote: >>> >>>> RDDs actually are thread-safe, and quite a few applications use them >>>> this way, e.g. the JDBC server. >>>> >>>> Matei >>>> >>>> On Jan 15, 2016, at 2:10 PM, Jakob Odersky <joder...@gmail.com> wrote: >>>> >>>> I don't think RDDs are threadsafe. >>>> More fundamentally however, why would you want to run RDD actions in >>>> parallel? The idea behind RDDs is to provide you with an abstraction for >>>> computing parallel operations on distributed data. Even if you were to call >>>> actions from several threads at once, the individual executors of your >>>> spark environment would still have to perform operations sequentially. >>>> >>>> As an alternative, I would suggest to restructure your RDD >>>> transformations to compute the required results in one single operation. >>>> >>>> On 15 January 2016 at 06:18, Jonathan Coveney <jcove...@gmail.com> >>>> wrote: >>>> >>>>> Threads >>>>> >>>>> >>>>> El viernes, 15 de enero de 2016, Kira <mennou...@gmail.com> escribió: >>>>> >>>>>> Hi, >>>>>> >>>>>> Can we run *simultaneous* actions on the *same RDD* ?; if yes how can >>>>>> this >>>>>> be done ? >>>>>> >>>>>> Thank you, >>>>>> Regards >>>>>> >>>>>> >>>>>> >>>>>> -- >>>>>> View this message in context: >>>>>> http://apache-spark-user-list.1001560.n3.nabble.com/simultaneous-actions-tp25977.html >>>>>> Sent from the Apache Spark User List mailing list archive at >>>>>> Nabble.com <http://nabble.com>. >>>>>> >>>>>> - >>>>>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org >>>>>> For additional commands, e-mail: user-h...@spark.apache.org >>>>>> >>>>>> >>>> >>>> >>> >> >
simultaneous actions
Hi, Can we run *simultaneous* actions on the *same RDD* ?; if yes how can this be done ? Thank you, Regards -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/simultaneous-actions-tp25977.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: simultaneous actions
Threads El viernes, 15 de enero de 2016, Kira <mennou...@gmail.com> escribió: > Hi, > > Can we run *simultaneous* actions on the *same RDD* ?; if yes how can this > be done ? > > Thank you, > Regards > > > > -- > View this message in context: > http://apache-spark-user-list.1001560.n3.nabble.com/simultaneous-actions-tp25977.html > Sent from the Apache Spark User List mailing list archive at Nabble.com. > > - > To unsubscribe, e-mail: user-unsubscr...@spark.apache.org <javascript:;> > For additional commands, e-mail: user-h...@spark.apache.org <javascript:;> > >
Re: simultaneous actions
Can you run N jobs depending on the same RDD in parallel on the driver? certainly. The context / scheduling is thread-safe and the RDD is immutable. I've done this to, for example, build and evaluate a bunch of models simultaneously on a big cluster. On Fri, Jan 15, 2016 at 7:10 PM, Jakob Odersky <joder...@gmail.com> wrote: > I don't think RDDs are threadsafe. > More fundamentally however, why would you want to run RDD actions in > parallel? The idea behind RDDs is to provide you with an abstraction for > computing parallel operations on distributed data. Even if you were to call > actions from several threads at once, the individual executors of your spark > environment would still have to perform operations sequentially. > > As an alternative, I would suggest to restructure your RDD transformations > to compute the required results in one single operation. > > On 15 January 2016 at 06:18, Jonathan Coveney <jcove...@gmail.com> wrote: >> >> Threads >> >> >> El viernes, 15 de enero de 2016, Kira <mennou...@gmail.com> escribió: >>> >>> Hi, >>> >>> Can we run *simultaneous* actions on the *same RDD* ?; if yes how can >>> this >>> be done ? >>> >>> Thank you, >>> Regards >>> >>> >>> >>> -- >>> View this message in context: >>> http://apache-spark-user-list.1001560.n3.nabble.com/simultaneous-actions-tp25977.html >>> Sent from the Apache Spark User List mailing list archive at Nabble.com. >>> >>> - >>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org >>> For additional commands, e-mail: user-h...@spark.apache.org >>> > - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: simultaneous actions
I don't think RDDs are threadsafe. More fundamentally however, why would you want to run RDD actions in parallel? The idea behind RDDs is to provide you with an abstraction for computing parallel operations on distributed data. Even if you were to call actions from several threads at once, the individual executors of your spark environment would still have to perform operations sequentially. As an alternative, I would suggest to restructure your RDD transformations to compute the required results in one single operation. On 15 January 2016 at 06:18, Jonathan Coveney <jcove...@gmail.com> wrote: > Threads > > > El viernes, 15 de enero de 2016, Kira <mennou...@gmail.com> escribió: > >> Hi, >> >> Can we run *simultaneous* actions on the *same RDD* ?; if yes how can this >> be done ? >> >> Thank you, >> Regards >> >> >> >> -- >> View this message in context: >> http://apache-spark-user-list.1001560.n3.nabble.com/simultaneous-actions-tp25977.html >> Sent from the Apache Spark User List mailing list archive at Nabble.com. >> >> - >> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org >> For additional commands, e-mail: user-h...@spark.apache.org >> >>
Re: simultaneous actions
SparkContext is thread safe. And RDDs just describe operations. While I generally agree that you want to model as much possible as transformations as possible, this is not always possible. And in that case, you have no option than to use threads. Spark's designers should have made all actions return Futures, but alas... El viernes, 15 de enero de 2016, Jakob Odersky <joder...@gmail.com> escribió: > I don't think RDDs are threadsafe. > More fundamentally however, why would you want to run RDD actions in > parallel? The idea behind RDDs is to provide you with an abstraction for > computing parallel operations on distributed data. Even if you were to call > actions from several threads at once, the individual executors of your > spark environment would still have to perform operations sequentially. > > As an alternative, I would suggest to restructure your RDD transformations > to compute the required results in one single operation. > > On 15 January 2016 at 06:18, Jonathan Coveney <jcove...@gmail.com > <javascript:_e(%7B%7D,'cvml','jcove...@gmail.com');>> wrote: > >> Threads >> >> >> El viernes, 15 de enero de 2016, Kira <mennou...@gmail.com >> <javascript:_e(%7B%7D,'cvml','mennou...@gmail.com');>> escribió: >> >>> Hi, >>> >>> Can we run *simultaneous* actions on the *same RDD* ?; if yes how can >>> this >>> be done ? >>> >>> Thank you, >>> Regards >>> >>> >>> >>> -- >>> View this message in context: >>> http://apache-spark-user-list.1001560.n3.nabble.com/simultaneous-actions-tp25977.html >>> Sent from the Apache Spark User List mailing list archive at Nabble.com. >>> >>> - >>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org >>> For additional commands, e-mail: user-h...@spark.apache.org >>> >>> >
Re: simultaneous actions
RDDs actually are thread-safe, and quite a few applications use them this way, e.g. the JDBC server. Matei > On Jan 15, 2016, at 2:10 PM, Jakob Odersky <joder...@gmail.com> wrote: > > I don't think RDDs are threadsafe. > More fundamentally however, why would you want to run RDD actions in > parallel? The idea behind RDDs is to provide you with an abstraction for > computing parallel operations on distributed data. Even if you were to call > actions from several threads at once, the individual executors of your spark > environment would still have to perform operations sequentially. > > As an alternative, I would suggest to restructure your RDD transformations to > compute the required results in one single operation. > > On 15 January 2016 at 06:18, Jonathan Coveney <jcove...@gmail.com > <mailto:jcove...@gmail.com>> wrote: > Threads > > > El viernes, 15 de enero de 2016, Kira <mennou...@gmail.com > <mailto:mennou...@gmail.com>> escribió: > Hi, > > Can we run *simultaneous* actions on the *same RDD* ?; if yes how can this > be done ? > > Thank you, > Regards > > > > -- > View this message in context: > http://apache-spark-user-list.1001560.n3.nabble.com/simultaneous-actions-tp25977.html > > <http://apache-spark-user-list.1001560.n3.nabble.com/simultaneous-actions-tp25977.html> > Sent from the Apache Spark User List mailing list archive at Nabble.com. > > - > To unsubscribe, e-mail: user-unsubscr...@spark.apache.org <> > For additional commands, e-mail: user-h...@spark.apache.org <> > >
Re: simultaneous actions
we run multiple actions on the same (cached) rdd all the time, i guess in different threads indeed (its in akka) On Fri, Jan 15, 2016 at 2:40 PM, Matei Zaharia <matei.zaha...@gmail.com> wrote: > RDDs actually are thread-safe, and quite a few applications use them this > way, e.g. the JDBC server. > > Matei > > On Jan 15, 2016, at 2:10 PM, Jakob Odersky <joder...@gmail.com> wrote: > > I don't think RDDs are threadsafe. > More fundamentally however, why would you want to run RDD actions in > parallel? The idea behind RDDs is to provide you with an abstraction for > computing parallel operations on distributed data. Even if you were to call > actions from several threads at once, the individual executors of your > spark environment would still have to perform operations sequentially. > > As an alternative, I would suggest to restructure your RDD transformations > to compute the required results in one single operation. > > On 15 January 2016 at 06:18, Jonathan Coveney <jcove...@gmail.com> wrote: > >> Threads >> >> >> El viernes, 15 de enero de 2016, Kira <mennou...@gmail.com> escribió: >> >>> Hi, >>> >>> Can we run *simultaneous* actions on the *same RDD* ?; if yes how can >>> this >>> be done ? >>> >>> Thank you, >>> Regards >>> >>> >>> >>> -- >>> View this message in context: >>> http://apache-spark-user-list.1001560.n3.nabble.com/simultaneous-actions-tp25977.html >>> Sent from the Apache Spark User List mailing list archive at Nabble.com >>> <http://nabble.com>. >>> >>> - >>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org >>> For additional commands, e-mail: user-h...@spark.apache.org >>> >>> > >
Re: simultaneous actions
I stand corrected. How considerable are the benefits though? Will the scheduler be able to dispatch jobs from both actions simultaneously (or on a when-workers-become-available basis)? On 15 January 2016 at 11:44, Koert Kuipers <ko...@tresata.com> wrote: > we run multiple actions on the same (cached) rdd all the time, i guess in > different threads indeed (its in akka) > > On Fri, Jan 15, 2016 at 2:40 PM, Matei Zaharia <matei.zaha...@gmail.com> > wrote: > >> RDDs actually are thread-safe, and quite a few applications use them this >> way, e.g. the JDBC server. >> >> Matei >> >> On Jan 15, 2016, at 2:10 PM, Jakob Odersky <joder...@gmail.com> wrote: >> >> I don't think RDDs are threadsafe. >> More fundamentally however, why would you want to run RDD actions in >> parallel? The idea behind RDDs is to provide you with an abstraction for >> computing parallel operations on distributed data. Even if you were to call >> actions from several threads at once, the individual executors of your >> spark environment would still have to perform operations sequentially. >> >> As an alternative, I would suggest to restructure your RDD >> transformations to compute the required results in one single operation. >> >> On 15 January 2016 at 06:18, Jonathan Coveney <jcove...@gmail.com> wrote: >> >>> Threads >>> >>> >>> El viernes, 15 de enero de 2016, Kira <mennou...@gmail.com> escribió: >>> >>>> Hi, >>>> >>>> Can we run *simultaneous* actions on the *same RDD* ?; if yes how can >>>> this >>>> be done ? >>>> >>>> Thank you, >>>> Regards >>>> >>>> >>>> >>>> -- >>>> View this message in context: >>>> http://apache-spark-user-list.1001560.n3.nabble.com/simultaneous-actions-tp25977.html >>>> Sent from the Apache Spark User List mailing list archive at Nabble.com >>>> <http://nabble.com>. >>>> >>>> - >>>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org >>>> For additional commands, e-mail: user-h...@spark.apache.org >>>> >>>> >> >> >
Re: simultaneous actions
It makes sense if you're parallelizing jobs that have relatively few tasks, and have a lot of execution slots available. It makes sense to turn them loose all at once and try to use the parallelism available. There are downsides, eventually: for example, N jobs accessing one cached RDD may recompute the RDD's partitions many times since the cached copy may not be available when many of them start. At some level, way oversubscribing your cluster with a backlog of tasks is bad. And you might find it's a net loss if a bunch of tasks try to schedule at the same time that all access the same data, since only some can be local to the data. On Fri, Jan 15, 2016 at 8:11 PM, Jakob Oderskywrote: > I stand corrected. How considerable are the benefits though? Will the > scheduler be able to dispatch jobs from both actions simultaneously (or on a > when-workers-become-available basis)? - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org