Re: Best practices for sharing/maintaining large resource files for Spark jobs

2016-01-14 Thread Gene Pang
Hi Dmitry, I am not familiar with all of the details you have just described, but I think Tachyon should be able to help you. If you store all of your resource files in HDFS or S3 or both, you can run Tachyon to use those storage systems as the under storage (

Re: Best practices for sharing/maintaining large resource files for Spark jobs

2016-01-14 Thread Dmitry Goldenberg
OK so it looks like Tachyon is a cluster memory plugin marked as "experimental" in Spark. In any case, we've got a few requirements for the system we're working on which may drive the decision for how to implement large resource file management. The system is a framework of N data analyzers

Re: Best practices for sharing/maintaining large resource files for Spark jobs

2016-01-14 Thread Dmitry Goldenberg
The other thing from some folks' recommendations on this list was Apache Ignite. Their In-Memory File System ( https://ignite.apache.org/features/igfs.html) looks quite interesting. On Thu, Jan 14, 2016 at 7:54 AM, Dmitry Goldenberg wrote: > OK so it looks like

Re: Best practices for sharing/maintaining large resource files for Spark jobs

2016-01-12 Thread Jörn Franke
Ignite can also cache rdd > On 12 Jan 2016, at 13:06, Dmitry Goldenberg wrote: > > Jorn, you said Ignite or ... ? What was the second choice you were thinking > of? It seems that got omitted. > >> On Jan 12, 2016, at 2:44 AM, Jörn Franke

Re: Best practices for sharing/maintaining large resource files for Spark jobs

2016-01-12 Thread Dmitry Goldenberg
I'd guess that if the resources are broadcast Spark would put them into Tachyon... > On Jan 12, 2016, at 7:04 AM, Dmitry Goldenberg > wrote: > > Would it make sense to load them into Tachyon and read and broadcast them > from there since Tachyon is already a part of

Re: Best practices for sharing/maintaining large resource files for Spark jobs

2016-01-12 Thread Dmitry Goldenberg
Jorn, you said Ignite or ... ? What was the second choice you were thinking of? It seems that got omitted. > On Jan 12, 2016, at 2:44 AM, Jörn Franke wrote: > > You can look at ignite as a HDFS cache or for storing rdds. > >> On 11 Jan 2016, at 21:14, Dmitry Goldenberg

Re: Best practices for sharing/maintaining large resource files for Spark jobs

2016-01-12 Thread Dmitry Goldenberg
Would it make sense to load them into Tachyon and read and broadcast them from there since Tachyon is already a part of the Spark stack? If so I wonder if I could do that Tachyon read/write via a Spark API? > On Jan 12, 2016, at 2:21 AM, Sabarish Sasidharan >

Re: Best practices for sharing/maintaining large resource files for Spark jobs

2016-01-12 Thread Dmitry Goldenberg
Thanks, Gene. Does Spark use Tachyon under the covers anyway for implementing its "cluster memory" support? It seems that the practice I hear the most about is the idea of loading resources as RDD's and then doing join's against them to achieve the lookup effect. The other approach would be to

Re: Best practices for sharing/maintaining large resource files for Spark jobs

2016-01-12 Thread Gene Pang
Hi Dmitry, Yes, Tachyon can help with your use case. You can read and write to Tachyon via the filesystem api ( http://tachyon-project.org/documentation/File-System-API.html). There is a native Java API as well as a Hadoop-compatible API. Spark is also able to interact with Tachyon via the

Re: Best practices for sharing/maintaining large resource files for Spark jobs

2016-01-11 Thread Jörn Franke
You can look at ignite as a HDFS cache or for storing rdds. > On 11 Jan 2016, at 21:14, Dmitry Goldenberg wrote: > > We have a bunch of Spark jobs deployed and a few large resource files such as > e.g. a dictionary for lookups or a statistical model. > > Right now,

Re: Best practices for sharing/maintaining large resource files for Spark jobs

2016-01-11 Thread Gavin Yue
Has anyone used Ignite in production system ? On Mon, Jan 11, 2016 at 11:44 PM, Jörn Franke wrote: > You can look at ignite as a HDFS cache or for storing rdds. > > > On 11 Jan 2016, at 21:14, Dmitry Goldenberg > wrote: > > > > We have a bunch

Re: Best practices for sharing/maintaining large resource files for Spark jobs

2016-01-11 Thread Sabarish Sasidharan
One option could be to store them as blobs in a cache like Redis and then read + broadcast them from the driver. Or you could store them in HDFS and read + broadcast from the driver. Regards Sab On Tue, Jan 12, 2016 at 1:44 AM, Dmitry Goldenberg wrote: > We have a

Best practices for sharing/maintaining large resource files for Spark jobs

2016-01-11 Thread Dmitry Goldenberg
We have a bunch of Spark jobs deployed and a few large resource files such as e.g. a dictionary for lookups or a statistical model. Right now, these are deployed as part of the Spark jobs which will eventually make the mongo-jars too bloated for deployments. What are some of the best practices