subject:"RE\: Best practice for multi\-user web controller in front of Spark"

Re: Best practice for multi-user web controller in front of Spark

2014-11-11 Thread Sonal Goyal

I believe the Spark Job Server by Ooyala can help you share data across
multiple jobs, take a look at
http://engineering.ooyala.com/blog/open-sourcing-our-spark-job-server. It
seems to fit closely to what you need.

Best Regards,
Sonal
Founder, Nube Technologies http://www.nubetech.co

http://in.linkedin.com/in/sonalgoyal



On Tue, Nov 11, 2014 at 7:20 PM, bethesda swearinge...@mac.com wrote:

 We are relatively new to spark and so far have been manually submitting
 single jobs at a time for ML training, during our development process,
 using
 spark-submit.  Each job accepts a small user-submitted data set and
 compares
 it to every data set in our hdfs corpus, which only changes incrementally
 on
 a daily basis.  (that detail is relevant to question 3 below)

 Now we are ready to start building out the front-end, which will allow a
 team of data scientists to submit their problems to the system via a web
 front-end (web tier will be java).  Users could of course be submitting
 jobs
 more or less simultaneously.  We want to make sure we understand how to
 best
 structure this.

 Questions:

 1 - Does a new SparkContext get created in the web tier for each new
 request
 for processing?

 2 - If so, how much time should we expect it to take for setting up the
 context?  Our goal is to return a response to the users in under 10
 seconds,
 but if it takes many seconds to create a new context or otherwise set up
 the
 job, then we need to adjust our expectations for what is possible.  From
 using spark-shell one might conclude that it might take more than 10
 seconds
 to create a context, however it's not clear how much of that is
 context-creation vs other things.

 3 - (This last question perhaps deserves a post in and of itself:) if every
 job is always comparing some little data structure to the same HDFS corpus
 of data, what is the best pattern to use to cache the RDD's from HDFS so
 they don't have to always be re-constituted from disk?  I.e. how can RDD's
 be shared from the context of one job to the context of subsequent jobs?
 Or does something like memcache have to be used?

 Thanks!
 David



 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/Best-practice-for-multi-user-web-controller-in-front-of-Spark-tp18581.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org

Re: Best practice for multi-user web controller in front of Spark

2014-11-11 Thread Evan R. Sparks

For sharing RDDs across multiple jobs - you could also have a look at
Tachyon. It provides an HDFS compatible in-memory storage layer that keeps
data in memory across multiple jobs/frameworks - http://tachyon-project.org/
.

-

On Tue, Nov 11, 2014 at 8:11 AM, Sonal Goyal sonalgoy...@gmail.com wrote:

 I believe the Spark Job Server by Ooyala can help you share data across
 multiple jobs, take a look at
 http://engineering.ooyala.com/blog/open-sourcing-our-spark-job-server. It
 seems to fit closely to what you need.

 Best Regards,
 Sonal
 Founder, Nube Technologies http://www.nubetech.co

 http://in.linkedin.com/in/sonalgoyal



 On Tue, Nov 11, 2014 at 7:20 PM, bethesda swearinge...@mac.com wrote:

 We are relatively new to spark and so far have been manually submitting
 single jobs at a time for ML training, during our development process,
 using
 spark-submit.  Each job accepts a small user-submitted data set and
 compares
 it to every data set in our hdfs corpus, which only changes incrementally
 on
 a daily basis.  (that detail is relevant to question 3 below)

 Now we are ready to start building out the front-end, which will allow a
 team of data scientists to submit their problems to the system via a web
 front-end (web tier will be java).  Users could of course be submitting
 jobs
 more or less simultaneously.  We want to make sure we understand how to
 best
 structure this.

 Questions:

 1 - Does a new SparkContext get created in the web tier for each new
 request
 for processing?

 2 - If so, how much time should we expect it to take for setting up the
 context?  Our goal is to return a response to the users in under 10
 seconds,
 but if it takes many seconds to create a new context or otherwise set up
 the
 job, then we need to adjust our expectations for what is possible.  From
 using spark-shell one might conclude that it might take more than 10
 seconds
 to create a context, however it's not clear how much of that is
 context-creation vs other things.

 3 - (This last question perhaps deserves a post in and of itself:) if
 every
 job is always comparing some little data structure to the same HDFS corpus
 of data, what is the best pattern to use to cache the RDD's from HDFS so
 they don't have to always be re-constituted from disk?  I.e. how can RDD's
 be shared from the context of one job to the context of subsequent jobs?
 Or does something like memcache have to be used?

 Thanks!
 David



 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/Best-practice-for-multi-user-web-controller-in-front-of-Spark-tp18581.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org

RE: Best practice for multi-user web controller in front of Spark

2014-11-11 Thread Mohammed Guller

David,

Here is what I would suggest:

1 - Does a new SparkContext get created in the web tier for each new request
for processing?
Create a single SparkContext that gets shared across multiple web requests. 
Depending on the framework that you are using for the web-tier, it should not 
be difficult to create a global singleton object that  holds the SparkContext.

2 - If so, how much time should we expect it to take for setting up the
context?  Our goal is to return a response to the users in under 10 seconds,
but if it takes many seconds to create a new context or otherwise set up the
job, then we need to adjust our expectations for what is possible.  From
using spark-shell one might conclude that it might take more than 10 seconds
to create a context, however it's not clear how much of that is
context-creation vs other things.

3 - (This last question perhaps deserves a post in and of itself:) if every
job is always comparing some little data structure to the same HDFS corpus
of data, what is the best pattern to use to cache the RDD's from HDFS so
they don't have to always be re-constituted from disk?  I.e. how can RDD's
be shared from the context of one job to the context of subsequent jobs?
Or does something like memcache have to be used?
Create a cached RDD in a global singleton object, which gets accessed by 
multiple web requests. You could put the cached RDD in the same object that 
holds the SparkContext, if you would like. I need to know more details about 
the specifics of your application to be more specific, but hopefully you get 
the idea.


Mohammed

From: Evan R. Sparks [mailto:evan.spa...@gmail.com]
Sent: Tuesday, November 11, 2014 8:54 AM
To: Sonal Goyal
Cc: bethesda; u...@spark.incubator.apache.org
Subject: Re: Best practice for multi-user web controller in front of Spark

For sharing RDDs across multiple jobs - you could also have a look at Tachyon. 
It provides an HDFS compatible in-memory storage layer that keeps data in 
memory across multiple jobs/frameworks - http://tachyon-project.org/.

-

On Tue, Nov 11, 2014 at 8:11 AM, Sonal Goyal 
sonalgoy...@gmail.commailto:sonalgoy...@gmail.com wrote:
I believe the Spark Job Server by Ooyala can help you share data across 
multiple jobs, take a look at 
http://engineering.ooyala.com/blog/open-sourcing-our-spark-job-server. It seems 
to fit closely to what you need.

Best Regards,
Sonal
Founder, Nube Technologieshttp://www.nubetech.co




On Tue, Nov 11, 2014 at 7:20 PM, bethesda 
swearinge...@mac.commailto:swearinge...@mac.com wrote:
We are relatively new to spark and so far have been manually submitting
single jobs at a time for ML training, during our development process, using
spark-submit.  Each job accepts a small user-submitted data set and compares
it to every data set in our hdfs corpus, which only changes incrementally on
a daily basis.  (that detail is relevant to question 3 below)

Now we are ready to start building out the front-end, which will allow a
team of data scientists to submit their problems to the system via a web
front-end (web tier will be java).  Users could of course be submitting jobs
more or less simultaneously.  We want to make sure we understand how to best
structure this.

Questions:

1 - Does a new SparkContext get created in the web tier for each new request
for processing?

2 - If so, how much time should we expect it to take for setting up the
context?  Our goal is to return a response to the users in under 10 seconds,
but if it takes many seconds to create a new context or otherwise set up the
job, then we need to adjust our expectations for what is possible.  From
using spark-shell one might conclude that it might take more than 10 seconds
to create a context, however it's not clear how much of that is
context-creation vs other things.

3 - (This last question perhaps deserves a post in and of itself:) if every
job is always comparing some little data structure to the same HDFS corpus
of data, what is the best pattern to use to cache the RDD's from HDFS so
they don't have to always be re-constituted from disk?  I.e. how can RDD's
be shared from the context of one job to the context of subsequent jobs?
Or does something like memcache have to be used?

Thanks!
David



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Best-practice-for-multi-user-web-controller-in-front-of-Spark-tp18581.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: 
user-unsubscr...@spark.apache.orgmailto:user-unsubscr...@spark.apache.org
For additional commands, e-mail: 
user-h...@spark.apache.orgmailto:user-h...@spark.apache.org

Re: Best practice for multi-user web controller in front of Spark

2014-11-11 Thread Tobias Pfeiffer

Hi,

also there is Spindle https://github.com/adobe-research/spindle which was
introduced on this list some time ago. I haven't looked into it deeply, but
you might gain some valuable insights from their architecture, they are
also using Spark to fulfill requests coming from the web.

Tobias

Re: Best practice for multi-user web controller in front of Spark

Re: Best practice for multi-user web controller in front of Spark

RE: Best practice for multi-user web controller in front of Spark

Re: Best practice for multi-user web controller in front of Spark

4 matches

Site Navigation

Mail list logo

Footer information