Alluxio off heap memory would help to share cached objects On Mon, Jun 27, 2016 at 11:14 AM Everett Anderson <ever...@nuna.com.invalid> wrote:
> Hi, > > We have a pipeline of components strung together via Airflow running on > AWS. Some of them are implemented in Spark, but some aren't. Generally they > can all talk to a JDBC/ODBC end point or read/write files from S3. > > Ideally, we wouldn't suffer the I/O cost of writing all the data to HDFS > or S3 and reading it back in, again, in every component, if it could stay > cached in memory in a Spark cluster. > > Our current investigation seems to lead us towards exploring if the > following things are possible: > > - Using a Hive metastore with S3 as its backing data store to try to > keep a mapping from table name to files on S3 (not sure if one can cache a > Hive table in Spark across contexts, though) > - Using something like the spark-jobserver to keep a Spark SQLContext > open across Spark components so they could avoid file I/O for cached tables > > What's the best practice for handing tables between Spark programs? What > about between Spark and non-Spark programs? > > Thanks! > > - Everett > >