Best practice for handing tables between pipeline components

Everett Anderson Mon, 27 Jun 2016 09:14:57 -0700

Hi,

We have a pipeline of components strung together via Airflow running on
AWS. Some of them are implemented in Spark, but some aren't. Generally they
can all talk to a JDBC/ODBC end point or read/write files from S3.


Ideally, we wouldn't suffer the I/O cost of writing all the data to HDFS or
S3 and reading it back in, again, in every component, if it could stay
cached in memory in a Spark cluster.

Our current investigation seems to lead us towards exploring if the
following things are possible:

   - Using a Hive metastore with S3 as its backing data store to try to
   keep a mapping from table name to files on S3 (not sure if one can cache a
   Hive table in Spark across contexts, though)
   - Using something like the spark-jobserver to keep a Spark SQLContext
   open across Spark components so they could avoid file I/O for cached tables

What's the best practice for handing tables between Spark programs? What
about between Spark and non-Spark programs?

Thanks!

- Everett

Best practice for handing tables between pipeline components

Reply via email to