For sharing RDDs across multiple jobs - you could also have a look at
Tachyon. It provides an HDFS compatible in-memory storage layer that keeps
data in memory across multiple jobs/frameworks - http://tachyon-project.org/
.

-

On Tue, Nov 11, 2014 at 8:11 AM, Sonal Goyal <sonalgoy...@gmail.com> wrote:

> I believe the Spark Job Server by Ooyala can help you share data across
> multiple jobs, take a look at
> http://engineering.ooyala.com/blog/open-sourcing-our-spark-job-server. It
> seems to fit closely to what you need.
>
> Best Regards,
> Sonal
> Founder, Nube Technologies <http://www.nubetech.co>
>
> <http://in.linkedin.com/in/sonalgoyal>
>
>
>
> On Tue, Nov 11, 2014 at 7:20 PM, bethesda <swearinge...@mac.com> wrote:
>
>> We are relatively new to spark and so far have been manually submitting
>> single jobs at a time for ML training, during our development process,
>> using
>> spark-submit.  Each job accepts a small user-submitted data set and
>> compares
>> it to every data set in our hdfs corpus, which only changes incrementally
>> on
>> a daily basis.  (that detail is relevant to question 3 below)
>>
>> Now we are ready to start building out the front-end, which will allow a
>> team of data scientists to submit their problems to the system via a web
>> front-end (web tier will be java).  Users could of course be submitting
>> jobs
>> more or less simultaneously.  We want to make sure we understand how to
>> best
>> structure this.
>>
>> Questions:
>>
>> 1 - Does a new SparkContext get created in the web tier for each new
>> request
>> for processing?
>>
>> 2 - If so, how much time should we expect it to take for setting up the
>> context?  Our goal is to return a response to the users in under 10
>> seconds,
>> but if it takes many seconds to create a new context or otherwise set up
>> the
>> job, then we need to adjust our expectations for what is possible.  From
>> using spark-shell one might conclude that it might take more than 10
>> seconds
>> to create a context, however it's not clear how much of that is
>> context-creation vs other things.
>>
>> 3 - (This last question perhaps deserves a post in and of itself:) if
>> every
>> job is always comparing some little data structure to the same HDFS corpus
>> of data, what is the best pattern to use to cache the RDD's from HDFS so
>> they don't have to always be re-constituted from disk?  I.e. how can RDD's
>> be "shared" from the context of one job to the context of subsequent jobs?
>> Or does something like memcache have to be used?
>>
>> Thanks!
>> David
>>
>>
>>
>> --
>> View this message in context:
>> http://apache-spark-user-list.1001560.n3.nabble.com/Best-practice-for-multi-user-web-controller-in-front-of-Spark-tp18581.html
>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>> For additional commands, e-mail: user-h...@spark.apache.org
>>
>>
>

Reply via email to