Re: Best practice for handing tables between pipeline components

Chanh Le Wed, 29 Jun 2016 00:00:40 -0700

Hi Everett,
We are using Alluxio for the last 2 months. We implement Alluxio for sharing 
data each Spark Job, isolated Spark only for process layer and Alluxio for the 
storage layer.




> On Jun 29, 2016, at 2:52 AM, Everett Anderson <ever...@nuna.com.INVALID> 
> wrote:
> 
> Thanks! Alluxio looks quite promising, but also quite new.
> 
> What did people do before?
> 
> On Mon, Jun 27, 2016 at 12:33 PM, Gene Pang <gene.p...@gmail.com 
> <mailto:gene.p...@gmail.com>> wrote:
> Yes, Alluxio (http://www.alluxio.org/ <http://www.alluxio.org/>) can be used 
> to store data in-memory between stages in a pipeline.
> 
> Here is more information about running Spark with Alluxio: 
> http://www.alluxio.org/documentation/v1.1.0/en/Running-Spark-on-Alluxio.html 
> <http://www.alluxio.org/documentation/v1.1.0/en/Running-Spark-on-Alluxio.html>
> 
> Hope that helps,
> Gene
> 
> On Mon, Jun 27, 2016 at 10:38 AM, Sathish Kumaran Vairavelu 
> <vsathishkuma...@gmail.com <mailto:vsathishkuma...@gmail.com>> wrote:
> Alluxio off heap memory would help to share cached objects
> 
> On Mon, Jun 27, 2016 at 11:14 AM Everett Anderson <ever...@nuna.com.invalid> 
> wrote:
> Hi,
> 
> We have a pipeline of components strung together via Airflow running on AWS. 
> Some of them are implemented in Spark, but some aren't. Generally they can 
> all talk to a JDBC/ODBC end point or read/write files from S3.
> 
> Ideally, we wouldn't suffer the I/O cost of writing all the data to HDFS or 
> S3 and reading it back in, again, in every component, if it could stay cached 
> in memory in a Spark cluster. 
> 
> Our current investigation seems to lead us towards exploring if the following 
> things are possible:
> Using a Hive metastore with S3 as its backing data store to try to keep a 
> mapping from table name to files on S3 (not sure if one can cache a Hive 
> table in Spark across contexts, though)
> Using something like the spark-jobserver to keep a Spark SQLContext open 
> across Spark components so they could avoid file I/O for cached tables
> What's the best practice for handing tables between Spark programs? What 
> about between Spark and non-Spark programs?
> 
> Thanks!
> 
> - Everett
> 
> 
>

Re: Best practice for handing tables between pipeline components

Reply via email to