Thanks! Alluxio looks quite promising, but also quite new. What did people do before?
On Mon, Jun 27, 2016 at 12:33 PM, Gene Pang <gene.p...@gmail.com> wrote: > Yes, Alluxio (http://www.alluxio.org/) can be used to store data > in-memory between stages in a pipeline. > > Here is more information about running Spark with Alluxio: > http://www.alluxio.org/documentation/v1.1.0/en/Running-Spark-on-Alluxio.html > > Hope that helps, > Gene > > On Mon, Jun 27, 2016 at 10:38 AM, Sathish Kumaran Vairavelu < > vsathishkuma...@gmail.com> wrote: > >> Alluxio off heap memory would help to share cached objects >> >> On Mon, Jun 27, 2016 at 11:14 AM Everett Anderson >> <ever...@nuna.com.invalid> wrote: >> >>> Hi, >>> >>> We have a pipeline of components strung together via Airflow running on >>> AWS. Some of them are implemented in Spark, but some aren't. Generally they >>> can all talk to a JDBC/ODBC end point or read/write files from S3. >>> >>> Ideally, we wouldn't suffer the I/O cost of writing all the data to HDFS >>> or S3 and reading it back in, again, in every component, if it could stay >>> cached in memory in a Spark cluster. >>> >>> Our current investigation seems to lead us towards exploring if the >>> following things are possible: >>> >>> - Using a Hive metastore with S3 as its backing data store to try to >>> keep a mapping from table name to files on S3 (not sure if one can cache >>> a >>> Hive table in Spark across contexts, though) >>> - Using something like the spark-jobserver to keep a >>> Spark SQLContext open across Spark components so they could avoid file >>> I/O >>> for cached tables >>> >>> What's the best practice for handing tables between Spark programs? What >>> about between Spark and non-Spark programs? >>> >>> Thanks! >>> >>> - Everett >>> >>> >