Hi Jean, As others have mentioned, you can use Alluxio with Spark dataframes <https://www.alluxio.com/blog/effective-spark-dataframes-with-alluxio> to keep the data in memory, and for other jobs to read them from memory again.
Hope this helps, Gene On Wed, Jun 21, 2017 at 8:08 AM, Jean Georges Perrin <j...@jgp.net> wrote: > I have looked at Livy in the (very recent past) past and it will not do > the trick for me. It seems pretty greedy in terms of resources (or at least > that was our experience). I will investigate how job-server could do the > trick. > > (on a side note I tried to find a paper on memory lifecycle within Spark > but was not very successful, maybe someone has a link to spare.) > > My need is to keep one/several dataframes in memory (well, within Spark) > so it/they can be reused at a later time, without persisting it/them to > disk (unless Spark wants to, of course). > > > > On Jun 21, 2017, at 10:47 AM, Michael Mior <mm...@uwaterloo.ca> wrote: > > This is a puzzling suggestion to me. It's unclear what features the OP > needs, so it's really hard to say whether Livy or job-server aren't > sufficient. It's true that neither are particularly mature, but they're > much more mature than a homemade project which hasn't started yet. > > That said, I'm not very familiar with either project, so perhaps there are > some big concerns I'm not aware of. > > -- > Michael Mior > mm...@apache.org > > 2017-06-21 3:19 GMT-04:00 Rick Moritz <rah...@gmail.com>: > >> Keeping it inside the same program/SparkContext is the most performant >> solution, since you can avoid serialization and deserialization. >> In-Memory-Persistance between jobs involves a memcopy, uses a lot of RAM >> and invokes serialization and deserialization. Technologies that can help >> you do that easily are Ignite (as mentioned) but also Alluxio, Cassandra >> with in-memory tables and a memory-backed HDFS-directory (see tiered >> storage). >> Although livy and job-server provide the functionality of providing a >> single SparkContext to mutliple programs, I would recommend you build your >> own framework for integrating different jobs, since many features you may >> need aren't present yet, while others may cause issues due to lack of >> maturity. Artificially splitting jobs is in general a bad idea, since it >> breaks the DAG and thus prevents some potential push-down optimizations. >> >> On Tue, Jun 20, 2017 at 10:17 PM, Jean Georges Perrin <j...@jgp.net> >> wrote: >> >>> Thanks Vadim & Jörn... I will look into those. >>> >>> jg >>> >>> On Jun 20, 2017, at 2:12 PM, Vadim Semenov <vadim.seme...@datadoghq.com> >>> wrote: >>> >>> You can launch one permanent spark context and then execute your jobs >>> within the context. And since they'll be running in the same context, they >>> can share data easily. >>> >>> These two projects provide the functionality that you need: >>> https://github.com/spark-jobserver/spark-jobserver#persisten >>> t-context-mode---faster--required-for-related-jobs >>> https://github.com/cloudera/livy#post-sessions >>> >>> On Tue, Jun 20, 2017 at 1:46 PM, Jean Georges Perrin <j...@jgp.net> >>> wrote: >>> >>>> Hey, >>>> >>>> Here is my need: program A does something on a set of data and produces >>>> results, program B does that on another set, and finally, program C >>>> combines the data of A and B. Of course, the easy way is to dump all on >>>> disk after A and B are done, but I wanted to avoid this. >>>> >>>> I was thinking of creating a temp view, but I do not really like the >>>> temp aspect of it ;). Any idea (they are all worth sharing) >>>> >>>> jg >>>> >>>> >>>> >>>> --------------------------------------------------------------------- >>>> To unsubscribe e-mail: user-unsubscr...@spark.apache.org >>>> >>>> >>> >>> >> > >