Ashwin,

What is your motivation for needing to share RDDs between jobs? Optimizing
for reusing data across jobs?

If so, you may want to look into Tachyon. My understanding is that Tachyon
acts like a caching layer and you can designate when data will be reused in
multiple jobs so it know to keep that in memory or local disk for faster
access. But my knowledge of tachyon is second hand so forgive me if I have
it wrong :)

RJ

On Friday, October 24, 2014, Evan Chan <velvia.git...@gmail.com> wrote:

> Ashwin,
>
> I would say the strategies in general are:
>
> 1) Have each user submit separate Spark app (each its own Spark
> Context), with its own resource settings, and share data through HDFS
> or something like Tachyon for speed.
>
> 2) Share a single spark context amongst multiple users, using fair
> scheduler.  This is sort of like having a Hadoop resource pool.    It
> has some obvious HA/SPOF issues, namely that if the context dies then
> every user using it is also dead.   Also, sharing RDDs in cached
> memory has the same resiliency problems, namely that if any executor
> dies then Spark must recompute / rebuild the RDD (it tries to only
> rebuild the missing part, but sometimes it must rebuild everything).
>
> Job server can help with 1 or 2, 2 in particular.  If you have any
> questions about job server, feel free to ask at the spark-jobserver
> google group.   I am the maintainer.
>
> -Evan
>
>
> On Thu, Oct 23, 2014 at 1:06 PM, Marcelo Vanzin <van...@cloudera.com
> <javascript:;>> wrote:
> > You may want to take a look at
> https://issues.apache.org/jira/browse/SPARK-3174.
> >
> > On Thu, Oct 23, 2014 at 2:56 AM, Jianshi Huang <jianshi.hu...@gmail.com
> <javascript:;>> wrote:
> >> Upvote for the multitanency requirement.
> >>
> >> I'm also building a data analytic platform and there'll be multiple
> users
> >> running queries and computations simultaneously. One of the paint point
> is
> >> control of resource size. Users don't really know how much nodes they
> need,
> >> they always use as much as possible... The result is lots of wasted
> resource
> >> in our Yarn cluster.
> >>
> >> A way to 1) allow multiple spark context to share the same resource or
> 2)
> >> add dynamic resource management for Yarn mode is very much wanted.
> >>
> >> Jianshi
> >>
> >> On Thu, Oct 23, 2014 at 5:36 AM, Marcelo Vanzin <van...@cloudera.com
> <javascript:;>> wrote:
> >>>
> >>> On Wed, Oct 22, 2014 at 2:17 PM, Ashwin Shankar
> >>> <ashwinshanka...@gmail.com <javascript:;>> wrote:
> >>> >> That's not something you might want to do usually. In general, a
> >>> >> SparkContext maps to a user application
> >>> >
> >>> > My question was basically this. In this page in the official doc,
> under
> >>> > "Scheduling within an application" section, it talks about multiuser
> and
> >>> > fair sharing within an app. How does multiuser within an application
> >>> > work(how users connect to an app,run their stuff) ? When would I
> want to
> >>> > use
> >>> > this ?
> >>>
> >>> I see. The way I read that page is that Spark supports all those
> >>> scheduling options; but Spark doesn't give you the means to actually
> >>> be able to submit jobs from different users to a running SparkContext
> >>> hosted on a different process. For that, you'll need something like
> >>> the job server that I referenced before, or write your own framework
> >>> for supporting that.
> >>>
> >>> Personally, I'd use the information on that page when dealing with
> >>> concurrent jobs in the same SparkContext, but still restricted to the
> >>> same user. I'd avoid trying to create any application where a single
> >>> SparkContext is trying to be shared by multiple users in any way.
> >>>
> >>> >> As far as I understand, this will cause executors to be killed,
> which
> >>> >> means that Spark will start retrying tasks to rebuild the data that
> >>> >> was held by those executors when needed.
> >>> >
> >>> > I basically wanted to find out if there were any "gotchas" related to
> >>> > preemption on Spark. Things like say half of an application's
> executors
> >>> > got
> >>> > preempted say while doing reduceByKey, will the application progress
> >>> > with
> >>> > the remaining resources/fair share ?
> >>>
> >>> Jobs should still make progress as long as at least one executor is
> >>> available. The gotcha would be the one I mentioned, where Spark will
> >>> fail your job after "x" executors failed, which might be a common
> >>> occurrence when preemption is enabled. That being said, it's a
> >>> configurable option, so you can set "x" to a very large value and your
> >>> job should keep on chugging along.
> >>>
> >>> The options you'd want to take a look at are: spark.task.maxFailures
> >>> and spark.yarn.max.executor.failures
> >>>
> >>> --
> >>> Marcelo
> >>>
> >>> ---------------------------------------------------------------------
> >>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> <javascript:;>
> >>> For additional commands, e-mail: user-h...@spark.apache.org
> <javascript:;>
> >>>
> >>
> >>
> >>
> >> --
> >> Jianshi Huang
> >>
> >> LinkedIn: jianshi
> >> Twitter: @jshuang
> >> Github & Blog: http://huangjs.github.com/
> >
> >
> >
> > --
> > Marcelo
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org <javascript:;>
> > For additional commands, e-mail: dev-h...@spark.apache.org
> <javascript:;>
> >
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org <javascript:;>
> For additional commands, e-mail: user-h...@spark.apache.org <javascript:;>
>
>

-- 
em rnowl...@gmail.com
c 954.496.2314

Reply via email to