Re: Best practice for handing tables between pipeline components

Sathish Kumaran Vairavelu Mon, 27 Jun 2016 10:38:59 -0700

Alluxio off heap memory would help to share cached objects
On Mon, Jun 27, 2016 at 11:14 AM Everett Anderson <ever...@nuna.com.invalid>
wrote:


> Hi,
>
> We have a pipeline of components strung together via Airflow running on
> AWS. Some of them are implemented in Spark, but some aren't. Generally they
> can all talk to a JDBC/ODBC end point or read/write files from S3.
>
> Ideally, we wouldn't suffer the I/O cost of writing all the data to HDFS
> or S3 and reading it back in, again, in every component, if it could stay
> cached in memory in a Spark cluster.
>
> Our current investigation seems to lead us towards exploring if the
> following things are possible:
>
>    - Using a Hive metastore with S3 as its backing data store to try to
>    keep a mapping from table name to files on S3 (not sure if one can cache a
>    Hive table in Spark across contexts, though)
>    - Using something like the spark-jobserver to keep a Spark SQLContext
>    open across Spark components so they could avoid file I/O for cached tables
>
> What's the best practice for handing tables between Spark programs? What
> about between Spark and non-Spark programs?
>
> Thanks!
>
> - Everett
>
>

Re: Best practice for handing tables between pipeline components

Reply via email to