Yes, Alluxio (http://www.alluxio.org/) can be used to store data in-memory between stages in a pipeline.
Here is more information about running Spark with Alluxio: http://www.alluxio.org/documentation/v1.1.0/en/Running-Spark-on-Alluxio.html Hope that helps, Gene On Mon, Jun 27, 2016 at 10:38 AM, Sathish Kumaran Vairavelu < vsathishkuma...@gmail.com> wrote: > Alluxio off heap memory would help to share cached objects > > On Mon, Jun 27, 2016 at 11:14 AM Everett Anderson <ever...@nuna.com.invalid> > wrote: > >> Hi, >> >> We have a pipeline of components strung together via Airflow running on >> AWS. Some of them are implemented in Spark, but some aren't. Generally they >> can all talk to a JDBC/ODBC end point or read/write files from S3. >> >> Ideally, we wouldn't suffer the I/O cost of writing all the data to HDFS >> or S3 and reading it back in, again, in every component, if it could stay >> cached in memory in a Spark cluster. >> >> Our current investigation seems to lead us towards exploring if the >> following things are possible: >> >> - Using a Hive metastore with S3 as its backing data store to try to >> keep a mapping from table name to files on S3 (not sure if one can cache a >> Hive table in Spark across contexts, though) >> - Using something like the spark-jobserver to keep a Spark SQLContext >> open across Spark components so they could avoid file I/O for cached >> tables >> >> What's the best practice for handing tables between Spark programs? What >> about between Spark and non-Spark programs? >> >> Thanks! >> >> - Everett >> >>