Hi Everett, We are using Alluxio for the last 2 months. We implement Alluxio for sharing data each Spark Job, isolated Spark only for process layer and Alluxio for the storage layer.
> On Jun 29, 2016, at 2:52 AM, Everett Anderson <ever...@nuna.com.INVALID> > wrote: > > Thanks! Alluxio looks quite promising, but also quite new. > > What did people do before? > > On Mon, Jun 27, 2016 at 12:33 PM, Gene Pang <gene.p...@gmail.com > <mailto:gene.p...@gmail.com>> wrote: > Yes, Alluxio (http://www.alluxio.org/ <http://www.alluxio.org/>) can be used > to store data in-memory between stages in a pipeline. > > Here is more information about running Spark with Alluxio: > http://www.alluxio.org/documentation/v1.1.0/en/Running-Spark-on-Alluxio.html > <http://www.alluxio.org/documentation/v1.1.0/en/Running-Spark-on-Alluxio.html> > > Hope that helps, > Gene > > On Mon, Jun 27, 2016 at 10:38 AM, Sathish Kumaran Vairavelu > <vsathishkuma...@gmail.com <mailto:vsathishkuma...@gmail.com>> wrote: > Alluxio off heap memory would help to share cached objects > > On Mon, Jun 27, 2016 at 11:14 AM Everett Anderson <ever...@nuna.com.invalid> > wrote: > Hi, > > We have a pipeline of components strung together via Airflow running on AWS. > Some of them are implemented in Spark, but some aren't. Generally they can > all talk to a JDBC/ODBC end point or read/write files from S3. > > Ideally, we wouldn't suffer the I/O cost of writing all the data to HDFS or > S3 and reading it back in, again, in every component, if it could stay cached > in memory in a Spark cluster. > > Our current investigation seems to lead us towards exploring if the following > things are possible: > Using a Hive metastore with S3 as its backing data store to try to keep a > mapping from table name to files on S3 (not sure if one can cache a Hive > table in Spark across contexts, though) > Using something like the spark-jobserver to keep a Spark SQLContext open > across Spark components so they could avoid file I/O for cached tables > What's the best practice for handing tables between Spark programs? What > about between Spark and non-Spark programs? > > Thanks! > > - Everett > > >