Re: Sharing RDDS across applications and users

2016-10-28 Thread vincent gromakowski
Bad idea. No caching, cluster over consumption... Have a look on instantiating a custom thriftserver on temp tables with fair scheduler to allow concurrent SQL requests. It's not a public API but you can find some examples. Le 28 oct. 2016 11:12 AM, "Mich Talebzadeh" a écrit : > Hi, > > I think

Re: Sharing RDDS across applications and users

2016-10-28 Thread Mich Talebzadeh
Hi, I think tempTable is private to the session that creates it. In Hive temp tables created by "CREATE TEMPORARY TABLE" are all private to the session. Spark is no different. The alternative may be everyone creates tempTable from the same DF? HTH Dr Mich Talebzadeh LinkedIn * https://www.l

Re: Sharing RDDS across applications and users

2016-10-28 Thread Chanh Le
> Can you elaborate on how to implement "shared sparkcontext and fair > scheduling" option? It just reuse 1 Spark Context by not letting it stop when the application had done. Should check: livy, spark-jobserver FAIR https://spark.apache.org/docs/1.2.0/job-scheduling.html

Re: Sharing RDDS across applications and users

2016-10-28 Thread Mich Talebzadeh
Thanks all for your advice. As I understand in layman's term if I had two applications running successfully where app 2 was dependent on app 1 I would finish app 1, store the results in HDFS and the app 2 starts reading the results from HDFS and work on it. Using Alluxio or others replaces HDFS

Re: Sharing RDDS across applications and users

2016-10-27 Thread vincent gromakowski
Hi Just point all users on the same app with a common spark context. For instance akka http receives queries from user and launch concurrent spark SQL queries in different actor thread. The only prerequsite is to launch the different jobs in different threads (like with actors). Be carefull it's no

Re: Sharing RDDS across applications and users

2016-10-27 Thread Victor Shafran
Hi Vincent, Can you elaborate on how to implement "shared sparkcontext and fair scheduling" option? My approach was to use sparkSession.getOrCreate() method and register temp table in one application. However, I was not able to access this tempTable in another application. You help is highly appr

Re: Sharing RDDS across applications and users

2016-10-27 Thread Gene Pang
Hi Mich, Yes, Alluxio is commonly used to cache and share Spark RDDs and DataFrames among different applications and contexts. The data typically stays in memory, but with Alluxio's tiered storage, the "colder" data can be evicted out to other medium, like SSDs and HDDs. Here is a blog post discus

Re: Sharing RDDS across applications and users

2016-10-27 Thread vincent gromakowski
For this you will need to contribute... Le 27 oct. 2016 1:35 PM, "Mich Talebzadeh" a écrit : > so I assume Ignite will not work with Spark version >=2? > > Dr Mich Talebzadeh > > > > LinkedIn * > https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw >

Re: Sharing RDDS across applications and users

2016-10-27 Thread Mich Talebzadeh
so I assume Ignite will not work with Spark version >=2? Dr Mich Talebzadeh LinkedIn * https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw * http://talebzadehmich.wordpress.com *

Re: Sharing RDDS across applications and users

2016-10-27 Thread vincent gromakowski
some options: - ignite for spark 1.5, can deep store on cassandra - alluxio for all spark versions, can deep store on hdfs, gluster... ==> these are best for sharing between jobs - shared sparkcontext and fair scheduling, seems to be not thread safe - spark jobserver and namedRDD, CRUD thread saf

Re: Sharing RDDS across applications and users

2016-10-27 Thread vincent gromakowski
I would prefer sharing the spark context and using FAIR scheduler for user concurrency Le 27 oct. 2016 12:48 PM, "Mich Talebzadeh" a écrit : > thanks Vince. > > So Ignite uses some hash/in-memory indexing. > > The question is in practice is there much use case to use these two > fabrics for sha

Re: Sharing RDDS across applications and users

2016-10-27 Thread Chanh Le
Hi Mich, I only tried Alluxio so I can’t give you a comparison. In my experience, I use Alluxio for the big data set (50GB - 100GB) which is the input of the pipelines jobs so you can reuse the result from previous job. > On Oct 27, 2016, at 5:39 PM, Mich Talebzadeh > wrote: > > Thanks Chanh,

Re: Sharing RDDS across applications and users

2016-10-27 Thread Mich Talebzadeh
thanks Vince. So Ignite uses some hash/in-memory indexing. The question is in practice is there much use case to use these two fabrics for sharing RDDs. Remember all RDBMSs do this through shared memory. In layman's term if I have two independent spark-submit running, can they share result set.

Re: Sharing RDDS across applications and users

2016-10-27 Thread vincent gromakowski
Ignite works only with spark 1.5 Ignite leverage indexes Alluxio provides tiering Alluxio easily integrates with underlying FS Le 27 oct. 2016 12:39 PM, "Mich Talebzadeh" a écrit : > Thanks Chanh, > > Can it share RDDs. > > Personally I have not used either Alluxio or Ignite. > > >1. Are the

Re: Sharing RDDS across applications and users

2016-10-27 Thread Mich Talebzadeh
Thanks Chanh, Can it share RDDs. Personally I have not used either Alluxio or Ignite. 1. Are there major differences between these two 2. Have you tried Alluxio for sharing Spark RDDs and if so do you have any experience you can kindly share Regards Dr Mich Talebzadeh LinkedIn *

Re: Sharing RDDS across applications and users

2016-10-27 Thread Chanh Le
Hi Mich, Alluxio is the good option to go. Regards, Chanh > On Oct 27, 2016, at 5:28 PM, Mich Talebzadeh > wrote: > > > There was a mention of using Zeppelin to share RDDs with many users. From the > notes on Zeppelin it appears that this is sharing UI and I am not sure how > easy it is go

Sharing RDDS across applications and users

2016-10-27 Thread Mich Talebzadeh
There was a mention of using Zeppelin to share RDDs with many users. From the notes on Zeppelin it appears that this is sharing UI and I am not sure how easy it is going to be changing the result set with different users modifying say sql queries. There is also the idea of caching RDDs with someth