RE: Best practice for multi-user web controller in front of Spark

Mohammed Guller Tue, 11 Nov 2014 15:26:32 -0800

David,

Here is what I would suggest:


1 - Does a new SparkContext get created in the web tier for each new request
for processing?
Create a single SparkContext that gets shared across multiple web requests. 
Depending on the framework that you are using for the web-tier, it should not 
be difficult to create a global singleton object that  holds the SparkContext.

2 - If so, how much time should we expect it to take for setting up the
context?  Our goal is to return a response to the users in under 10 seconds,
but if it takes many seconds to create a new context or otherwise set up the
job, then we need to adjust our expectations for what is possible.  From
using spark-shell one might conclude that it might take more than 10 seconds
to create a context, however it's not clear how much of that is
context-creation vs other things.

3 - (This last question perhaps deserves a post in and of itself:) if every
job is always comparing some little data structure to the same HDFS corpus
of data, what is the best pattern to use to cache the RDD's from HDFS so
they don't have to always be re-constituted from disk?  I.e. how can RDD's
be "shared" from the context of one job to the context of subsequent jobs?
Or does something like memcache have to be used?
Create a cached RDD in a global singleton object, which gets accessed by 
multiple web requests. You could put the cached RDD in the same object that 
holds the SparkContext, if you would like. I need to know more details about 
the specifics of your application to be more specific, but hopefully you get 
the idea.


Mohammed

From: Evan R. Sparks [mailto:evan.spa...@gmail.com]
Sent: Tuesday, November 11, 2014 8:54 AM
To: Sonal Goyal
Cc: bethesda; u...@spark.incubator.apache.org
Subject: Re: Best practice for multi-user web controller in front of Spark

For sharing RDDs across multiple jobs - you could also have a look at Tachyon. 
It provides an HDFS compatible in-memory storage layer that keeps data in 
memory across multiple jobs/frameworks - http://tachyon-project.org/.

-

On Tue, Nov 11, 2014 at 8:11 AM, Sonal Goyal 
<sonalgoy...@gmail.com<mailto:sonalgoy...@gmail.com>> wrote:
I believe the Spark Job Server by Ooyala can help you share data across 
multiple jobs, take a look at 
http://engineering.ooyala.com/blog/open-sourcing-our-spark-job-server. It seems 
to fit closely to what you need.

Best Regards,
Sonal
Founder, Nube Technologies<http://www.nubetech.co>




On Tue, Nov 11, 2014 at 7:20 PM, bethesda 
<swearinge...@mac.com<mailto:swearinge...@mac.com>> wrote:
We are relatively new to spark and so far have been manually submitting
single jobs at a time for ML training, during our development process, using
spark-submit.  Each job accepts a small user-submitted data set and compares
it to every data set in our hdfs corpus, which only changes incrementally on
a daily basis.  (that detail is relevant to question 3 below)

Now we are ready to start building out the front-end, which will allow a
team of data scientists to submit their problems to the system via a web
front-end (web tier will be java).  Users could of course be submitting jobs
more or less simultaneously.  We want to make sure we understand how to best
structure this.

Questions:

1 - Does a new SparkContext get created in the web tier for each new request
for processing?

2 - If so, how much time should we expect it to take for setting up the
context?  Our goal is to return a response to the users in under 10 seconds,
but if it takes many seconds to create a new context or otherwise set up the
job, then we need to adjust our expectations for what is possible.  From
using spark-shell one might conclude that it might take more than 10 seconds
to create a context, however it's not clear how much of that is
context-creation vs other things.

3 - (This last question perhaps deserves a post in and of itself:) if every
job is always comparing some little data structure to the same HDFS corpus
of data, what is the best pattern to use to cache the RDD's from HDFS so
they don't have to always be re-constituted from disk?  I.e. how can RDD's
be "shared" from the context of one job to the context of subsequent jobs?
Or does something like memcache have to be used?

Thanks!
David



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Best-practice-for-multi-user-web-controller-in-front-of-Spark-tp18581.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: 
user-unsubscr...@spark.apache.org<mailto:user-unsubscr...@spark.apache.org>
For additional commands, e-mail: 
user-h...@spark.apache.org<mailto:user-h...@spark.apache.org>

RE: Best practice for multi-user web controller in front of Spark

Reply via email to