Re: Best practice for multi-user web controller in front of Spark

Sonal Goyal Tue, 11 Nov 2014 08:14:21 -0800

I believe the Spark Job Server by Ooyala can help you share data across
multiple jobs, take a look at
http://engineering.ooyala.com/blog/open-sourcing-our-spark-job-server. It
seems to fit closely to what you need.


Best Regards,
Sonal
Founder, Nube Technologies <http://www.nubetech.co>

<http://in.linkedin.com/in/sonalgoyal>



On Tue, Nov 11, 2014 at 7:20 PM, bethesda <swearinge...@mac.com> wrote:

> We are relatively new to spark and so far have been manually submitting
> single jobs at a time for ML training, during our development process,
> using
> spark-submit.  Each job accepts a small user-submitted data set and
> compares
> it to every data set in our hdfs corpus, which only changes incrementally
> on
> a daily basis.  (that detail is relevant to question 3 below)
>
> Now we are ready to start building out the front-end, which will allow a
> team of data scientists to submit their problems to the system via a web
> front-end (web tier will be java).  Users could of course be submitting
> jobs
> more or less simultaneously.  We want to make sure we understand how to
> best
> structure this.
>
> Questions:
>
> 1 - Does a new SparkContext get created in the web tier for each new
> request
> for processing?
>
> 2 - If so, how much time should we expect it to take for setting up the
> context?  Our goal is to return a response to the users in under 10
> seconds,
> but if it takes many seconds to create a new context or otherwise set up
> the
> job, then we need to adjust our expectations for what is possible.  From
> using spark-shell one might conclude that it might take more than 10
> seconds
> to create a context, however it's not clear how much of that is
> context-creation vs other things.
>
> 3 - (This last question perhaps deserves a post in and of itself:) if every
> job is always comparing some little data structure to the same HDFS corpus
> of data, what is the best pattern to use to cache the RDD's from HDFS so
> they don't have to always be re-constituted from disk?  I.e. how can RDD's
> be "shared" from the context of one job to the context of subsequent jobs?
> Or does something like memcache have to be used?
>
> Thanks!
> David
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/Best-practice-for-multi-user-web-controller-in-front-of-Spark-tp18581.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>

Re: Best practice for multi-user web controller in front of Spark

Reply via email to