[jira] [Comment Edited] (BAHIR-135) Add Spark Streaming Hazelcast Extension

Eren Avsarogullari (JIRA) Sat, 23 Sep 2017 04:57:23 -0700

    [ 
https://issues.apache.org/jira/browse/BAHIR-135?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16177751#comment-16177751
 ]


Eren Avsarogullari edited comment on BAHIR-135 at 9/23/17 11:56 AM:
--------------------------------------------------------------------

Hi [~ckadner],

Firstly, many thanks for your reply.

Please find my comments as follows:

*1- *I was aware of hazelcast-spark connector and it aims reading data stored 
in Hazelcast as RDD and writing RDD to Hazelcast as Distributed Object Entries. 
On the other hand, Spark Streaming Hazelcast Connector aims creating DStream by 
listening Hazelcast Distributed Object Events(Distributed Map, List, Set, 
Queue, Topic, MultiMap and Replicated Map). Also these events are fired in the 
light of distributed data structure changes(a new entry is added, updated, 
removed or evicted). As you mentioned, one of them aims DataSource and the 
other one aims DStream by focusing distributed events as stream data.

I think it can be useful for the users to get these events as Spark DStream and 
taking advantage of Spark Streaming features as follows:
*   analytics
       e.g:
       How many entries are added, updated, removed or evicted for the 
Hazelcast Distributed Objects for specific duration?
       How many entries are handled by Hazelcast Instances(host:port) / 
Nodes(host) for specific duration?
*   to store the data to data/disk stores(HDFS/S3)
*   to transform data 
*   to query data via DF(converting RDDs to DFs)
*   to join with other streams

*2-* I think it can be useful definitely to get some feedbacks from Hazelcast 
team first so i will be contacting them and let you know.


was (Author: erenavsarogullari):
Hi [~ckadner],

Firstly, many thanks for your reply.

Please find my comments as follows:

1- I was aware of hazelcast-spark connector and it aims reading data stored in 
Hazelcast as RDD and writing RDD to Hazelcast as Distributed Object Entries. On 
the other hand, Spark Streaming Hazelcast Connector aims creating DStream by 
listening Hazelcast Distributed Object Events(Distributed Map, List, Set, 
Queue, Topic, MultiMap and Replicated Map). Also these events are fired in the 
light of distributed data structure changes(a new entry is added, updated, 
removed or evicted). As you mentioned, one of them aims DataSource and the 
other one aims DStream by focusing distributed events as stream data.

I think it can be useful for the users to get these events as Spark DStream and 
taking advantage of Spark Streaming features as follows:
*   analytics
       e.g:
       How many entries is added, updated, removed or evicted for the Hazelcast 
Distributed Objects for specific duration?
       How many entries is handled by Hazelcast Instances(host:port) / 
Nodes(host) for specific duration?
*   to store the data to data/disk stores(HDFS/S3)
*   to transform data 
*   to query data via DF(converting RDDs to DFs)
*   to join with other streams

2- I think it can be useful definitely to get some feedbacks from Hazelcast 
team first so i will be contacting them and let you know.

> Add Spark Streaming Hazelcast Extension
> ---------------------------------------
>
>                 Key: BAHIR-135
>                 URL: https://issues.apache.org/jira/browse/BAHIR-135
>             Project: Bahir
>          Issue Type: New Feature
>          Components: Spark Streaming Connectors
>            Reporter: Eren Avsarogullari
>
> I would like to propose Spark Streaming Hazelcast extension. 
> Hazelcast is an in-memory data grid(IMDG) solution under Apache 2 License and 
> provides distributed data structures such as distributed map, list, set, 
> queue (etc). When a new entry is _added_, _updated_, _removed_ or _evicted_, 
> a new event is fired by Hazelcast. This flow is almost same for above all 
> distributed data structures. This extension aims to subscribe these 
> distributed events via Hazelcast Event Listeners and create a DStream in the 
> light of distributed data structure changes. This extension supports 
> Distributed Map, List, Set, Queue, Topic, MultiMap and Replicated Map.
> Please find the following documentation for further details.
> *Proposal:* 
> [https://docs.google.com/document/d/1YN_9u72Wv699g8ivM3c8K_zZUbUl73JtquWy-g71Tm4/edit?usp=sharing]
> Also repo is ready for review. It covers implementation, full unit test 
> coverage and examples as well.
> *Repo:* [https://github.com/erenavsarogullari/bahir/tree/Hazelcast_Streaming]
> This extension can be useful for both Spark and Hazelcast communities to 
> listen these Hazelcast events & analyze them and transform the events 
> payloads via Spark.
> Please let me know if you need further details and all feedbacks are welcome 
> in advance.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

[jira] [Comment Edited] (BAHIR-135) Add Spark Streaming Hazelcast Extension

Reply via email to