Re: Best practice for handing tables between pipeline components

Gene Pang Mon, 27 Jun 2016 12:33:45 -0700

Yes, Alluxio (http://www.alluxio.org/) can be used to store data in-memory
between stages in a pipeline.


Here is more information about running Spark with Alluxio:
http://www.alluxio.org/documentation/v1.1.0/en/Running-Spark-on-Alluxio.html

Hope that helps,
Gene

On Mon, Jun 27, 2016 at 10:38 AM, Sathish Kumaran Vairavelu <
vsathishkuma...@gmail.com> wrote:

> Alluxio off heap memory would help to share cached objects
>
> On Mon, Jun 27, 2016 at 11:14 AM Everett Anderson <ever...@nuna.com.invalid>
> wrote:
>
>> Hi,
>>
>> We have a pipeline of components strung together via Airflow running on
>> AWS. Some of them are implemented in Spark, but some aren't. Generally they
>> can all talk to a JDBC/ODBC end point or read/write files from S3.
>>
>> Ideally, we wouldn't suffer the I/O cost of writing all the data to HDFS
>> or S3 and reading it back in, again, in every component, if it could stay
>> cached in memory in a Spark cluster.
>>
>> Our current investigation seems to lead us towards exploring if the
>> following things are possible:
>>
>>    - Using a Hive metastore with S3 as its backing data store to try to
>>    keep a mapping from table name to files on S3 (not sure if one can cache a
>>    Hive table in Spark across contexts, though)
>>    - Using something like the spark-jobserver to keep a Spark SQLContext
>>    open across Spark components so they could avoid file I/O for cached 
>> tables
>>
>> What's the best practice for handing tables between Spark programs? What
>> about between Spark and non-Spark programs?
>>
>> Thanks!
>>
>> - Everett
>>
>>

Re: Best practice for handing tables between pipeline components

Reply via email to