Re: share datasets across multiple spark-streaming applications for lookup

2017-11-02 Thread JG Perrin
Or Databaricks Delta (announced at Spark Summit) or IBM Event Store depending 
on the use case.

On Oct 31, 2017, at 14:30, Joseph Pride 
mailto:jos...@versanalytics.com>> wrote:

Folks:

SnappyData.

I’m fairly new to working with it myself, but it looks pretty promising. It 
marries Spark with a co-located in-memory GemFire (or something gem-related) 
database. So you can access the data with SQL, JDBC, ODBC (if you wanna go 
Enterprise instead of open-source) or natively as mutable RDDs and DataFrames.

You can run it so the storage and Spark compute are co-located in the same JVM 
on each machine, so you get data locality instead of a bottleneck between load, 
save, and compute. The data is supposed to persist between applications, 
cluster startups, or multiple applications doing stuff to the data at the same 
time.

I hope it works for what I’m doing and isn’t too buggy. But it looks pretty 
good.

—Joe Pride

On Oct 31, 2017, at 11:14 AM, Gene Pang 
mailto:gene.p...@gmail.com>> wrote:

Hi,

Alluxio enables sharing dataframes across different applications. This blog 
post<https://www.alluxio.com/blog/effective-spark-dataframes-with-alluxio> 
talks about dataframes and Alluxio, and this Spark Summit 
presentation<https://spark-summit.org/2017/events/best-practices-for-using-alluxio-with-apache-spark/>
 has additional information.

Thanks,
Gene

On Tue, Oct 31, 2017 at 6:04 PM, Revin Chalil 
mailto:rcha...@expedia.com>> wrote:
Any info on the below will be really appreciated.

I read about Alluxio and Ignite. Has anybody used any of them? Do they work 
well with multiple Apps doing lookups simultaneously? Are there better options? 
Thank you.

From: roshan joe mailto:impdocs2...@gmail.com>>
Date: Monday, October 30, 2017 at 7:53 PM
To: "user@spark.apache.org<mailto:user@spark.apache.org>" 
mailto:user@spark.apache.org>>
Subject: share datasets across multiple spark-streaming applications for lookup

Hi,

What is the recommended way to share datasets across multiple spark-streaming 
applications, so that the incoming data can be looked up against this shared 
dataset?

The shared dataset is also incrementally refreshed and stored on S3. Below is 
the scenario.

Streaming App-1 consumes data from Source-1 and writes to DS-1 in S3.
Streaming App-2 consumes data from Source-2 and writes to DS-2 in S3.


Streaming App-3 consumes data from Source-3, needs to lookup against DS-1 and 
DS-2 and write to DS-3 in S3.
Streaming App-4 consumes data from Source-4, needs to lookup against DS-1 and 
DS-2 and write to DS-3 in S3.
Streaming App-n consumes data from Source-n, needs to lookup against DS-1 and 
DS-2 and write to DS-n in S3.

So DS-1 and DS-2 ideally should be shared for lookup across multiple streaming 
apps. Any input is appreciated. Thank you!




Re: share datasets across multiple spark-streaming applications for lookup

2017-10-31 Thread Joseph Pride
Folks:

SnappyData.

I’m fairly new to working with it myself, but it looks pretty promising. It 
marries Spark with a co-located in-memory GemFire (or something gem-related) 
database. So you can access the data with SQL, JDBC, ODBC (if you wanna go 
Enterprise instead of open-source) or natively as mutable RDDs and DataFrames.

You can run it so the storage and Spark compute are co-located in the same JVM 
on each machine, so you get data locality instead of a bottleneck between load, 
save, and compute. The data is supposed to persist between applications, 
cluster startups, or multiple applications doing stuff to the data at the same 
time.

I hope it works for what I’m doing and isn’t too buggy. But it looks pretty 
good.

—Joe Pride

> On Oct 31, 2017, at 11:14 AM, Gene Pang  wrote:
> 
> Hi,
> 
> Alluxio enables sharing dataframes across different applications. This blog 
> post talks about dataframes and Alluxio, and this Spark Summit presentation 
> has additional information.
> 
> Thanks,
> Gene
> 
>> On Tue, Oct 31, 2017 at 6:04 PM, Revin Chalil  wrote:
>> Any info on the below will be really appreciated.
>> 
>>  
>> 
>> I read about Alluxio and Ignite. Has anybody used any of them? Do they work 
>> well with multiple Apps doing lookups simultaneously? Are there better 
>> options? Thank you.  
>> 
>>  
>> 
>> From: roshan joe 
>> Date: Monday, October 30, 2017 at 7:53 PM
>> To: "user@spark.apache.org" 
>> Subject: share datasets across multiple spark-streaming applications for 
>> lookup
>> 
>>  
>> 
>> Hi, 
>> 
>>  
>> 
>> What is the recommended way to share datasets across multiple 
>> spark-streaming applications, so that the incoming data can be looked up 
>> against this shared dataset? 
>> 
>>  
>> 
>> The shared dataset is also incrementally refreshed and stored on S3. Below 
>> is the scenario. 
>> 
>>  
>> 
>> Streaming App-1 consumes data from Source-1 and writes to DS-1 in S3. 
>> 
>> Streaming App-2 consumes data from Source-2 and writes to DS-2 in S3. 
>> 
>>  
>> 
>> 
>> Streaming App-3 consumes data from Source-3, needs to lookup against DS-1 
>> and DS-2 and write to DS-3 in S3. 
>> 
>> Streaming App-4 consumes data from Source-4, needs to lookup against DS-1 
>> and DS-2 and write to DS-3 in S3. 
>> 
>> Streaming App-n consumes data from Source-n, needs to lookup against DS-1 
>> and DS-2 and write to DS-n in S3.
>> 
>>  
>> 
>> So DS-1 and DS-2 ideally should be shared for lookup across multiple 
>> streaming apps. Any input is appreciated. Thank you!
>> 
> 


Re: share datasets across multiple spark-streaming applications for lookup

2017-10-31 Thread Gene Pang
Hi,

Alluxio enables sharing dataframes across different applications. This blog
post <https://www.alluxio.com/blog/effective-spark-dataframes-with-alluxio>
talks
about dataframes and Alluxio, and this Spark Summit presentation
<https://spark-summit.org/2017/events/best-practices-for-using-alluxio-with-apache-spark/>
has additional information.

Thanks,
Gene

On Tue, Oct 31, 2017 at 6:04 PM, Revin Chalil  wrote:

> Any info on the below will be really appreciated.
>
>
>
> I read about Alluxio and Ignite. Has anybody used any of them? Do they
> work well with multiple Apps doing lookups simultaneously? Are there better
> options? Thank you.
>
>
>
> *From: *roshan joe 
> *Date: *Monday, October 30, 2017 at 7:53 PM
> *To: *"user@spark.apache.org" 
> *Subject: *share datasets across multiple spark-streaming applications
> for lookup
>
>
>
> Hi,
>
>
>
> What is the recommended way to share datasets across multiple
> spark-streaming applications, so that the incoming data can be looked up
> against this shared dataset?
>
>
>
> The shared dataset is also incrementally refreshed and stored on S3. Below
> is the scenario.
>
>
>
> Streaming App-1 consumes data from Source-1 and writes to DS-1 in S3.
>
> Streaming App-2 consumes data from Source-2 and writes to DS-2 in S3.
>
>
>
>
> Streaming App-3 consumes data from Source-3, *needs to lookup against
> DS-1 and DS-2* and write to DS-3 in S3.
>
> Streaming App-4 consumes data from Source-4, *needs to lookup against
> DS-1 and DS-2 *and write to DS-3 in S3.
>
> Streaming App-n consumes data from Source-n, *needs to lookup against
> DS-1 and DS-2 *and write to DS-n in S3.
>
>
>
> So DS-1 and DS-2 ideally should be shared for lookup across multiple
> streaming apps. Any input is appreciated. Thank you!
>


Re: share datasets across multiple spark-streaming applications for lookup

2017-10-31 Thread Revin Chalil
Any info on the below will be really appreciated.

I read about Alluxio and Ignite. Has anybody used any of them? Do they work 
well with multiple Apps doing lookups simultaneously? Are there better options? 
Thank you.

From: roshan joe 
Date: Monday, October 30, 2017 at 7:53 PM
To: "user@spark.apache.org" 
Subject: share datasets across multiple spark-streaming applications for lookup

Hi,

What is the recommended way to share datasets across multiple spark-streaming 
applications, so that the incoming data can be looked up against this shared 
dataset?

The shared dataset is also incrementally refreshed and stored on S3. Below is 
the scenario.

Streaming App-1 consumes data from Source-1 and writes to DS-1 in S3.
Streaming App-2 consumes data from Source-2 and writes to DS-2 in S3.


Streaming App-3 consumes data from Source-3, needs to lookup against DS-1 and 
DS-2 and write to DS-3 in S3.
Streaming App-4 consumes data from Source-4, needs to lookup against DS-1 and 
DS-2 and write to DS-3 in S3.
Streaming App-n consumes data from Source-n, needs to lookup against DS-1 and 
DS-2 and write to DS-n in S3.

So DS-1 and DS-2 ideally should be shared for lookup across multiple streaming 
apps. Any input is appreciated. Thank you!


share datasets across multiple spark-streaming applications for lookup

2017-10-30 Thread roshan joe
Hi,

What is the recommended way to share datasets across multiple
spark-streaming applications, so that the incoming data can be looked up
against this shared dataset?

The shared dataset is also incrementally refreshed and stored on S3. Below
is the scenario.

Streaming App-1 consumes data from Source-1 and writes to DS-1 in S3.
Streaming App-2 consumes data from Source-2 and writes to DS-2 in S3.


Streaming App-3 consumes data from Source-3, *needs to lookup against DS-1
and DS-2* and write to DS-3 in S3.
Streaming App-4 consumes data from Source-4, *needs to lookup against DS-1
and DS-2 *and write to DS-3 in S3.
Streaming App-n consumes data from Source-n, *needs to lookup against DS-1
and DS-2 *and write to DS-n in S3.

So DS-1 and DS-2 ideally should be shared for lookup across multiple
streaming apps. Any input is appreciated. Thank you!