Re: share datasets across multiple spark-streaming applications for lookup
Or Databaricks Delta (announced at Spark Summit) or IBM Event Store depending on the use case. On Oct 31, 2017, at 14:30, Joseph Pride mailto:jos...@versanalytics.com>> wrote: Folks: SnappyData. I’m fairly new to working with it myself, but it looks pretty promising. It marries Spark with a co-located in-memory GemFire (or something gem-related) database. So you can access the data with SQL, JDBC, ODBC (if you wanna go Enterprise instead of open-source) or natively as mutable RDDs and DataFrames. You can run it so the storage and Spark compute are co-located in the same JVM on each machine, so you get data locality instead of a bottleneck between load, save, and compute. The data is supposed to persist between applications, cluster startups, or multiple applications doing stuff to the data at the same time. I hope it works for what I’m doing and isn’t too buggy. But it looks pretty good. —Joe Pride On Oct 31, 2017, at 11:14 AM, Gene Pang mailto:gene.p...@gmail.com>> wrote: Hi, Alluxio enables sharing dataframes across different applications. This blog post<https://www.alluxio.com/blog/effective-spark-dataframes-with-alluxio> talks about dataframes and Alluxio, and this Spark Summit presentation<https://spark-summit.org/2017/events/best-practices-for-using-alluxio-with-apache-spark/> has additional information. Thanks, Gene On Tue, Oct 31, 2017 at 6:04 PM, Revin Chalil mailto:rcha...@expedia.com>> wrote: Any info on the below will be really appreciated. I read about Alluxio and Ignite. Has anybody used any of them? Do they work well with multiple Apps doing lookups simultaneously? Are there better options? Thank you. From: roshan joe mailto:impdocs2...@gmail.com>> Date: Monday, October 30, 2017 at 7:53 PM To: "user@spark.apache.org<mailto:user@spark.apache.org>" mailto:user@spark.apache.org>> Subject: share datasets across multiple spark-streaming applications for lookup Hi, What is the recommended way to share datasets across multiple spark-streaming applications, so that the incoming data can be looked up against this shared dataset? The shared dataset is also incrementally refreshed and stored on S3. Below is the scenario. Streaming App-1 consumes data from Source-1 and writes to DS-1 in S3. Streaming App-2 consumes data from Source-2 and writes to DS-2 in S3. Streaming App-3 consumes data from Source-3, needs to lookup against DS-1 and DS-2 and write to DS-3 in S3. Streaming App-4 consumes data from Source-4, needs to lookup against DS-1 and DS-2 and write to DS-3 in S3. Streaming App-n consumes data from Source-n, needs to lookup against DS-1 and DS-2 and write to DS-n in S3. So DS-1 and DS-2 ideally should be shared for lookup across multiple streaming apps. Any input is appreciated. Thank you!
Re: share datasets across multiple spark-streaming applications for lookup
Folks: SnappyData. I’m fairly new to working with it myself, but it looks pretty promising. It marries Spark with a co-located in-memory GemFire (or something gem-related) database. So you can access the data with SQL, JDBC, ODBC (if you wanna go Enterprise instead of open-source) or natively as mutable RDDs and DataFrames. You can run it so the storage and Spark compute are co-located in the same JVM on each machine, so you get data locality instead of a bottleneck between load, save, and compute. The data is supposed to persist between applications, cluster startups, or multiple applications doing stuff to the data at the same time. I hope it works for what I’m doing and isn’t too buggy. But it looks pretty good. —Joe Pride > On Oct 31, 2017, at 11:14 AM, Gene Pang wrote: > > Hi, > > Alluxio enables sharing dataframes across different applications. This blog > post talks about dataframes and Alluxio, and this Spark Summit presentation > has additional information. > > Thanks, > Gene > >> On Tue, Oct 31, 2017 at 6:04 PM, Revin Chalil wrote: >> Any info on the below will be really appreciated. >> >> >> >> I read about Alluxio and Ignite. Has anybody used any of them? Do they work >> well with multiple Apps doing lookups simultaneously? Are there better >> options? Thank you. >> >> >> >> From: roshan joe >> Date: Monday, October 30, 2017 at 7:53 PM >> To: "user@spark.apache.org" >> Subject: share datasets across multiple spark-streaming applications for >> lookup >> >> >> >> Hi, >> >> >> >> What is the recommended way to share datasets across multiple >> spark-streaming applications, so that the incoming data can be looked up >> against this shared dataset? >> >> >> >> The shared dataset is also incrementally refreshed and stored on S3. Below >> is the scenario. >> >> >> >> Streaming App-1 consumes data from Source-1 and writes to DS-1 in S3. >> >> Streaming App-2 consumes data from Source-2 and writes to DS-2 in S3. >> >> >> >> >> Streaming App-3 consumes data from Source-3, needs to lookup against DS-1 >> and DS-2 and write to DS-3 in S3. >> >> Streaming App-4 consumes data from Source-4, needs to lookup against DS-1 >> and DS-2 and write to DS-3 in S3. >> >> Streaming App-n consumes data from Source-n, needs to lookup against DS-1 >> and DS-2 and write to DS-n in S3. >> >> >> >> So DS-1 and DS-2 ideally should be shared for lookup across multiple >> streaming apps. Any input is appreciated. Thank you! >> >
Re: share datasets across multiple spark-streaming applications for lookup
Hi, Alluxio enables sharing dataframes across different applications. This blog post <https://www.alluxio.com/blog/effective-spark-dataframes-with-alluxio> talks about dataframes and Alluxio, and this Spark Summit presentation <https://spark-summit.org/2017/events/best-practices-for-using-alluxio-with-apache-spark/> has additional information. Thanks, Gene On Tue, Oct 31, 2017 at 6:04 PM, Revin Chalil wrote: > Any info on the below will be really appreciated. > > > > I read about Alluxio and Ignite. Has anybody used any of them? Do they > work well with multiple Apps doing lookups simultaneously? Are there better > options? Thank you. > > > > *From: *roshan joe > *Date: *Monday, October 30, 2017 at 7:53 PM > *To: *"user@spark.apache.org" > *Subject: *share datasets across multiple spark-streaming applications > for lookup > > > > Hi, > > > > What is the recommended way to share datasets across multiple > spark-streaming applications, so that the incoming data can be looked up > against this shared dataset? > > > > The shared dataset is also incrementally refreshed and stored on S3. Below > is the scenario. > > > > Streaming App-1 consumes data from Source-1 and writes to DS-1 in S3. > > Streaming App-2 consumes data from Source-2 and writes to DS-2 in S3. > > > > > Streaming App-3 consumes data from Source-3, *needs to lookup against > DS-1 and DS-2* and write to DS-3 in S3. > > Streaming App-4 consumes data from Source-4, *needs to lookup against > DS-1 and DS-2 *and write to DS-3 in S3. > > Streaming App-n consumes data from Source-n, *needs to lookup against > DS-1 and DS-2 *and write to DS-n in S3. > > > > So DS-1 and DS-2 ideally should be shared for lookup across multiple > streaming apps. Any input is appreciated. Thank you! >
Re: share datasets across multiple spark-streaming applications for lookup
Any info on the below will be really appreciated. I read about Alluxio and Ignite. Has anybody used any of them? Do they work well with multiple Apps doing lookups simultaneously? Are there better options? Thank you. From: roshan joe Date: Monday, October 30, 2017 at 7:53 PM To: "user@spark.apache.org" Subject: share datasets across multiple spark-streaming applications for lookup Hi, What is the recommended way to share datasets across multiple spark-streaming applications, so that the incoming data can be looked up against this shared dataset? The shared dataset is also incrementally refreshed and stored on S3. Below is the scenario. Streaming App-1 consumes data from Source-1 and writes to DS-1 in S3. Streaming App-2 consumes data from Source-2 and writes to DS-2 in S3. Streaming App-3 consumes data from Source-3, needs to lookup against DS-1 and DS-2 and write to DS-3 in S3. Streaming App-4 consumes data from Source-4, needs to lookup against DS-1 and DS-2 and write to DS-3 in S3. Streaming App-n consumes data from Source-n, needs to lookup against DS-1 and DS-2 and write to DS-n in S3. So DS-1 and DS-2 ideally should be shared for lookup across multiple streaming apps. Any input is appreciated. Thank you!
share datasets across multiple spark-streaming applications for lookup
Hi, What is the recommended way to share datasets across multiple spark-streaming applications, so that the incoming data can be looked up against this shared dataset? The shared dataset is also incrementally refreshed and stored on S3. Below is the scenario. Streaming App-1 consumes data from Source-1 and writes to DS-1 in S3. Streaming App-2 consumes data from Source-2 and writes to DS-2 in S3. Streaming App-3 consumes data from Source-3, *needs to lookup against DS-1 and DS-2* and write to DS-3 in S3. Streaming App-4 consumes data from Source-4, *needs to lookup against DS-1 and DS-2 *and write to DS-3 in S3. Streaming App-n consumes data from Source-n, *needs to lookup against DS-1 and DS-2 *and write to DS-n in S3. So DS-1 and DS-2 ideally should be shared for lookup across multiple streaming apps. Any input is appreciated. Thank you!