Hi, We have a pipeline of components strung together via Airflow running on AWS. Some of them are implemented in Spark, but some aren't. Generally they can all talk to a JDBC/ODBC end point or read/write files from S3.
Ideally, we wouldn't suffer the I/O cost of writing all the data to HDFS or S3 and reading it back in, again, in every component, if it could stay cached in memory in a Spark cluster. Our current investigation seems to lead us towards exploring if the following things are possible: - Using a Hive metastore with S3 as its backing data store to try to keep a mapping from table name to files on S3 (not sure if one can cache a Hive table in Spark across contexts, though) - Using something like the spark-jobserver to keep a Spark SQLContext open across Spark components so they could avoid file I/O for cached tables What's the best practice for handing tables between Spark programs? What about between Spark and non-Spark programs? Thanks! - Everett