[jira] [Assigned] (S2GRAPH-185) Support Spark Structured Streaming to work with data in streaming and batch

Chul Kang (JIRA) Wed, 28 Mar 2018 01:08:43 -0700

     [ 
https://issues.apache.org/jira/browse/S2GRAPH-185?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Chul Kang reassigned S2GRAPH-185:
---------------------------------

    Assignee: Chul Kang

> Support Spark Structured Streaming to work with data in streaming and batch
> ---------------------------------------------------------------------------
>
>                 Key: S2GRAPH-185
>                 URL: https://issues.apache.org/jira/browse/S2GRAPH-185
>             Project: S2Graph
>          Issue Type: New Feature
>            Reporter: Chul Kang
>            Assignee: Chul Kang
>            Priority: Major
>
> By default, S2Graph will publish all edge/vertex requests to the Kafka in WAL 
> format.
>  In Kakao, S2Graph has been used as a master database to store all user's 
> activities,
>  I have been developing several ETL jobs that are suitable for these 
> use-cases, and I want to contribute them.
> Use cases are as follows,
> {code:java}
> edge/vertex incoming through the Kafka save to other storages
> - druid sink for slice and dice
> - es sink for search
> - file sink for store edge/vertex
> ingest from various storage to s2graph
> - MySQL binlog
> - hdfs/hive/hbase
> ETL job on edge/vertex data
> - merge all user activities based on userId.
> - generate statistical information
> - apply ML library on graph data format
> {code}
>  
> Below are some simple requirements for this,
>  * supports both streaming/static source data processing
>  * computation flow is re-usable and sharing on streaming and batch
>  * operate by simple job description
>  
> Spark Structured Streaming supports unified API for both streaming and batch 
> by using Dataframe/Dataset API from SparkSQL.
>  It allows the same operations to be executed on bounded/unbounded data 
> sources and guarantees exactly-once fault-tolerance.
>  Structured streaming provides several DataSource and Sink, and it supports 
> the implementation of the Source/Sink interface.
> Using this, we can easily develop ETL Job that can be linked to various 
> repositories.
>  
> Reference: 
> [https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html]
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Assigned] (S2GRAPH-185) Support Spark Structured Streaming to work with data in streaming and batch

Reply via email to