For sharing RDDs across multiple jobs - you could also have a look at Tachyon. It provides an HDFS compatible in-memory storage layer that keeps data in memory across multiple jobs/frameworks - http://tachyon-project.org/ .
- On Tue, Nov 11, 2014 at 8:11 AM, Sonal Goyal <sonalgoy...@gmail.com> wrote: > I believe the Spark Job Server by Ooyala can help you share data across > multiple jobs, take a look at > http://engineering.ooyala.com/blog/open-sourcing-our-spark-job-server. It > seems to fit closely to what you need. > > Best Regards, > Sonal > Founder, Nube Technologies <http://www.nubetech.co> > > <http://in.linkedin.com/in/sonalgoyal> > > > > On Tue, Nov 11, 2014 at 7:20 PM, bethesda <swearinge...@mac.com> wrote: > >> We are relatively new to spark and so far have been manually submitting >> single jobs at a time for ML training, during our development process, >> using >> spark-submit. Each job accepts a small user-submitted data set and >> compares >> it to every data set in our hdfs corpus, which only changes incrementally >> on >> a daily basis. (that detail is relevant to question 3 below) >> >> Now we are ready to start building out the front-end, which will allow a >> team of data scientists to submit their problems to the system via a web >> front-end (web tier will be java). Users could of course be submitting >> jobs >> more or less simultaneously. We want to make sure we understand how to >> best >> structure this. >> >> Questions: >> >> 1 - Does a new SparkContext get created in the web tier for each new >> request >> for processing? >> >> 2 - If so, how much time should we expect it to take for setting up the >> context? Our goal is to return a response to the users in under 10 >> seconds, >> but if it takes many seconds to create a new context or otherwise set up >> the >> job, then we need to adjust our expectations for what is possible. From >> using spark-shell one might conclude that it might take more than 10 >> seconds >> to create a context, however it's not clear how much of that is >> context-creation vs other things. >> >> 3 - (This last question perhaps deserves a post in and of itself:) if >> every >> job is always comparing some little data structure to the same HDFS corpus >> of data, what is the best pattern to use to cache the RDD's from HDFS so >> they don't have to always be re-constituted from disk? I.e. how can RDD's >> be "shared" from the context of one job to the context of subsequent jobs? >> Or does something like memcache have to be used? >> >> Thanks! >> David >> >> >> >> -- >> View this message in context: >> http://apache-spark-user-list.1001560.n3.nabble.com/Best-practice-for-multi-user-web-controller-in-front-of-Spark-tp18581.html >> Sent from the Apache Spark User List mailing list archive at Nabble.com. >> >> --------------------------------------------------------------------- >> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org >> For additional commands, e-mail: user-h...@spark.apache.org >> >> >