Re: "Sharing" dataframes...

Gene Pang Wed, 21 Jun 2017 08:31:40 -0700

Hi Jean,

As others have mentioned, you can use Alluxio with Spark dataframes
<https://www.alluxio.com/blog/effective-spark-dataframes-with-alluxio> to
keep the data in memory, and for other jobs to read them from memory again.


Hope this helps,
Gene

On Wed, Jun 21, 2017 at 8:08 AM, Jean Georges Perrin <j...@jgp.net> wrote:

> I have looked at Livy in the (very recent past) past and it will not do
> the trick for me. It seems pretty greedy in terms of resources (or at least
> that was our experience). I will investigate how job-server could do the
> trick.
>
> (on a side note I tried to find a paper on memory lifecycle within Spark
> but was not very successful, maybe someone has a link to spare.)
>
> My need is to keep one/several dataframes in memory (well, within Spark)
> so it/they can be reused at a later time, without persisting it/them to
> disk (unless Spark wants to, of course).
>
>
>
> On Jun 21, 2017, at 10:47 AM, Michael Mior <mm...@uwaterloo.ca> wrote:
>
> This is a puzzling suggestion to me. It's unclear what features the OP
> needs, so it's really hard to say whether Livy or job-server aren't
> sufficient. It's true that neither are particularly mature, but they're
> much more mature than a homemade project which hasn't started yet.
>
> That said, I'm not very familiar with either project, so perhaps there are
> some big concerns I'm not aware of.
>
> --
> Michael Mior
> mm...@apache.org
>
> 2017-06-21 3:19 GMT-04:00 Rick Moritz <rah...@gmail.com>:
>
>> Keeping it inside the same program/SparkContext is the most performant
>> solution, since you can avoid serialization and deserialization.
>> In-Memory-Persistance between jobs involves a memcopy, uses a lot of RAM
>> and invokes serialization and deserialization. Technologies that can help
>> you do that easily are Ignite (as mentioned) but also Alluxio, Cassandra
>> with in-memory tables and a memory-backed HDFS-directory (see tiered
>> storage).
>> Although livy and job-server provide the functionality of providing a
>> single SparkContext to mutliple programs, I would recommend you build your
>> own framework for integrating different jobs, since many features you may
>> need aren't present yet, while others may cause issues due to lack of
>> maturity. Artificially splitting jobs is in general a bad idea, since it
>> breaks the DAG and thus prevents some potential push-down optimizations.
>>
>> On Tue, Jun 20, 2017 at 10:17 PM, Jean Georges Perrin <j...@jgp.net>
>> wrote:
>>
>>> Thanks Vadim & Jörn... I will look into those.
>>>
>>> jg
>>>
>>> On Jun 20, 2017, at 2:12 PM, Vadim Semenov <vadim.seme...@datadoghq.com>
>>> wrote:
>>>
>>> You can launch one permanent spark context and then execute your jobs
>>> within the context. And since they'll be running in the same context, they
>>> can share data easily.
>>>
>>> These two projects provide the functionality that you need:
>>> https://github.com/spark-jobserver/spark-jobserver#persisten
>>> t-context-mode---faster--required-for-related-jobs
>>> https://github.com/cloudera/livy#post-sessions
>>>
>>> On Tue, Jun 20, 2017 at 1:46 PM, Jean Georges Perrin <j...@jgp.net>
>>> wrote:
>>>
>>>> Hey,
>>>>
>>>> Here is my need: program A does something on a set of data and produces
>>>> results, program B does that on another set, and finally, program C
>>>> combines the data of A and B. Of course, the easy way is to dump all on
>>>> disk after A and B are done, but I wanted to avoid this.
>>>>
>>>> I was thinking of creating a temp view, but I do not really like the
>>>> temp aspect of it ;). Any idea (they are all worth sharing)
>>>>
>>>> jg
>>>>
>>>>
>>>>
>>>> ---------------------------------------------------------------------
>>>> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>>>>
>>>>
>>>
>>>
>>
>
>

Re: "Sharing" dataframes...

Reply via email to