Re: Best practice for handing tables between pipeline components

Everett Anderson Tue, 28 Jun 2016 12:52:33 -0700

Thanks! Alluxio looks quite promising, but also quite new.

What did people do before?


On Mon, Jun 27, 2016 at 12:33 PM, Gene Pang <gene.p...@gmail.com> wrote:

> Yes, Alluxio (http://www.alluxio.org/) can be used to store data
> in-memory between stages in a pipeline.
>
> Here is more information about running Spark with Alluxio:
> http://www.alluxio.org/documentation/v1.1.0/en/Running-Spark-on-Alluxio.html
>
> Hope that helps,
> Gene
>
> On Mon, Jun 27, 2016 at 10:38 AM, Sathish Kumaran Vairavelu <
> vsathishkuma...@gmail.com> wrote:
>
>> Alluxio off heap memory would help to share cached objects
>>
>> On Mon, Jun 27, 2016 at 11:14 AM Everett Anderson
>> <ever...@nuna.com.invalid> wrote:
>>
>>> Hi,
>>>
>>> We have a pipeline of components strung together via Airflow running on
>>> AWS. Some of them are implemented in Spark, but some aren't. Generally they
>>> can all talk to a JDBC/ODBC end point or read/write files from S3.
>>>
>>> Ideally, we wouldn't suffer the I/O cost of writing all the data to HDFS
>>> or S3 and reading it back in, again, in every component, if it could stay
>>> cached in memory in a Spark cluster.
>>>
>>> Our current investigation seems to lead us towards exploring if the
>>> following things are possible:
>>>
>>>    - Using a Hive metastore with S3 as its backing data store to try to
>>>    keep a mapping from table name to files on S3 (not sure if one can cache 
>>> a
>>>    Hive table in Spark across contexts, though)
>>>    - Using something like the spark-jobserver to keep a
>>>    Spark SQLContext open across Spark components so they could avoid file 
>>> I/O
>>>    for cached tables
>>>
>>> What's the best practice for handing tables between Spark programs? What
>>> about between Spark and non-Spark programs?
>>>
>>> Thanks!
>>>
>>> - Everett
>>>
>>>
>

Re: Best practice for handing tables between pipeline components

Reply via email to