[jira] [Commented] (HUDI-310) DynamoDB/Kinesis Change Capture using Delta Streamer

2022-03-11 Thread Madhavan (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-310?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17504958#comment-17504958
 ] 

Madhavan commented on HUDI-310:
---

Hi [~vinaypatil18]  - Any update on the above? Which release are we targeting 
this feature?

> DynamoDB/Kinesis Change Capture using Delta Streamer
> 
>
> Key: HUDI-310
> URL: https://issues.apache.org/jira/browse/HUDI-310
> Project: Apache Hudi
>  Issue Type: New Feature
>  Components: deltastreamer
>Reporter: Vinoth Chandar
>Assignee: Vinay
>Priority: Major
>
> The goal here is to do CDC from DynamoDB and then have it be ingested into S3 
> as a Hudi dataset 
> Few resources: 
>  # DynamoDB Streams 
> [https://docs.aws.amazon.com/amazondynamodb/latest/developerguide/Streams.html]
>   provides change capture logs in Kinesis. 
>  # Walkthrough 
> [https://docs.aws.amazon.com/amazondynamodb/latest/developerguide/Streams.KCLAdapter.Walkthrough.html]
>  Code [https://github.com/awslabs/dynamodb-streams-kinesis-adapter] 
>  # Spark Streaming has support for reading Kinesis streams 
> [https://spark.apache.org/docs/2.4.4/streaming-kinesis-integration.html] one 
> of the many resources showing how to change the Spark Kinesis example code to 
> consume dynamodb stream   
> [https://medium.com/@ravi72munde/using-spark-streaming-with-dynamodb-d325b9a73c79]
>  # In DeltaStreamer, we need to add some form of KinesisSource that returns a 
> RDD with new data everytime `fetchNewData` is called 
> [https://github.com/apache/incubator-hudi/blob/master/hudi-utilities/src/main/java/org/apache/hudi/utilities/sources/Source.java]
>   . DeltaStreamer itself does not use Spark Streaming APIs
>  # Internally, we have Avro, Json, Row sources that extract data in these 
> formats. 
> Open questions : 
>  # Should this just be a KinesisSource inside Hudi, that needs to be 
> configured differently or do we need two sources: DynamoDBKinesisSource (that 
> does some DynamoDB Stream specific setup/assumptions) and a plain 
> KinesisSource. What's more valuable to do , if we have to pick one. 
>  # For Kafka integration, we just reused the KafkaRDD in Spark Streaming 
> easily and avoided writing a lot of code by hand. Could we pull the same 
> thing off for Kinesis? (probably needs digging through Spark code) 
>  # What's the format of the data for DynamoDB streams? 
>  
>  
> We should probably flesh these out before going ahead with implementation? 
>  



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Commented] (HUDI-310) DynamoDB/Kinesis Change Capture using Delta Streamer

2022-01-09 Thread Vinay (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-310?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17471309#comment-17471309
 ] 

Vinay commented on HUDI-310:


[~vinoth]  I remember discussing about this, sry it went to backlog, I am 
taking this up. 

> DynamoDB/Kinesis Change Capture using Delta Streamer
> 
>
> Key: HUDI-310
> URL: https://issues.apache.org/jira/browse/HUDI-310
> Project: Apache Hudi
>  Issue Type: New Feature
>  Components: DeltaStreamer
>Reporter: Vinoth Chandar
>Assignee: Vinay
>Priority: Major
>
> The goal here is to do CDC from DynamoDB and then have it be ingested into S3 
> as a Hudi dataset 
> Few resources: 
>  # DynamoDB Streams 
> [https://docs.aws.amazon.com/amazondynamodb/latest/developerguide/Streams.html]
>   provides change capture logs in Kinesis. 
>  # Walkthrough 
> [https://docs.aws.amazon.com/amazondynamodb/latest/developerguide/Streams.KCLAdapter.Walkthrough.html]
>  Code [https://github.com/awslabs/dynamodb-streams-kinesis-adapter] 
>  # Spark Streaming has support for reading Kinesis streams 
> [https://spark.apache.org/docs/2.4.4/streaming-kinesis-integration.html] one 
> of the many resources showing how to change the Spark Kinesis example code to 
> consume dynamodb stream   
> [https://medium.com/@ravi72munde/using-spark-streaming-with-dynamodb-d325b9a73c79]
>  # In DeltaStreamer, we need to add some form of KinesisSource that returns a 
> RDD with new data everytime `fetchNewData` is called 
> [https://github.com/apache/incubator-hudi/blob/master/hudi-utilities/src/main/java/org/apache/hudi/utilities/sources/Source.java]
>   . DeltaStreamer itself does not use Spark Streaming APIs
>  # Internally, we have Avro, Json, Row sources that extract data in these 
> formats. 
> Open questions : 
>  # Should this just be a KinesisSource inside Hudi, that needs to be 
> configured differently or do we need two sources: DynamoDBKinesisSource (that 
> does some DynamoDB Stream specific setup/assumptions) and a plain 
> KinesisSource. What's more valuable to do , if we have to pick one. 
>  # For Kafka integration, we just reused the KafkaRDD in Spark Streaming 
> easily and avoided writing a lot of code by hand. Could we pull the same 
> thing off for Kinesis? (probably needs digging through Spark code) 
>  # What's the format of the data for DynamoDB streams? 
>  
>  
> We should probably flesh these out before going ahead with implementation? 
>  



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Commented] (HUDI-310) DynamoDB/Kinesis Change Capture using Delta Streamer

2021-11-05 Thread Vinoth Chandar (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-310?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17439600#comment-17439600
 ] 

Vinoth Chandar commented on HUDI-310:
-

This has changed hands quite a bit.  We still want to take this on.  Interested?

> DynamoDB/Kinesis Change Capture using Delta Streamer
> 
>
> Key: HUDI-310
> URL: https://issues.apache.org/jira/browse/HUDI-310
> Project: Apache Hudi
>  Issue Type: New Feature
>  Components: DeltaStreamer
>Reporter: Vinoth Chandar
>Assignee: Suneel Marthi
>Priority: Major
>
> The goal here is to do CDC from DynamoDB and then have it be ingested into S3 
> as a Hudi dataset 
> Few resources: 
>  # DynamoDB Streams 
> [https://docs.aws.amazon.com/amazondynamodb/latest/developerguide/Streams.html]
>   provides change capture logs in Kinesis. 
>  # Walkthrough 
> [https://docs.aws.amazon.com/amazondynamodb/latest/developerguide/Streams.KCLAdapter.Walkthrough.html]
>  Code [https://github.com/awslabs/dynamodb-streams-kinesis-adapter] 
>  # Spark Streaming has support for reading Kinesis streams 
> [https://spark.apache.org/docs/2.4.4/streaming-kinesis-integration.html] one 
> of the many resources showing how to change the Spark Kinesis example code to 
> consume dynamodb stream   
> [https://medium.com/@ravi72munde/using-spark-streaming-with-dynamodb-d325b9a73c79]
>  # In DeltaStreamer, we need to add some form of KinesisSource that returns a 
> RDD with new data everytime `fetchNewData` is called 
> [https://github.com/apache/incubator-hudi/blob/master/hudi-utilities/src/main/java/org/apache/hudi/utilities/sources/Source.java]
>   . DeltaStreamer itself does not use Spark Streaming APIs
>  # Internally, we have Avro, Json, Row sources that extract data in these 
> formats. 
> Open questions : 
>  # Should this just be a KinesisSource inside Hudi, that needs to be 
> configured differently or do we need two sources: DynamoDBKinesisSource (that 
> does some DynamoDB Stream specific setup/assumptions) and a plain 
> KinesisSource. What's more valuable to do , if we have to pick one. 
>  # For Kafka integration, we just reused the KafkaRDD in Spark Streaming 
> easily and avoided writing a lot of code by hand. Could we pull the same 
> thing off for Kinesis? (probably needs digging through Spark code) 
>  # What's the format of the data for DynamoDB streams? 
>  
>  
> We should probably flesh these out before going ahead with implementation? 
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (HUDI-310) DynamoDB/Kinesis Change Capture using Delta Streamer

2021-11-03 Thread Madhavan (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-310?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17438424#comment-17438424
 ] 

Madhavan commented on HUDI-310:
---

Is this still happening? 

> DynamoDB/Kinesis Change Capture using Delta Streamer
> 
>
> Key: HUDI-310
> URL: https://issues.apache.org/jira/browse/HUDI-310
> Project: Apache Hudi
>  Issue Type: New Feature
>  Components: DeltaStreamer
>Reporter: Vinoth Chandar
>Assignee: Suneel Marthi
>Priority: Major
>
> The goal here is to do CDC from DynamoDB and then have it be ingested into S3 
> as a Hudi dataset 
> Few resources: 
>  # DynamoDB Streams 
> [https://docs.aws.amazon.com/amazondynamodb/latest/developerguide/Streams.html]
>   provides change capture logs in Kinesis. 
>  # Walkthrough 
> [https://docs.aws.amazon.com/amazondynamodb/latest/developerguide/Streams.KCLAdapter.Walkthrough.html]
>  Code [https://github.com/awslabs/dynamodb-streams-kinesis-adapter] 
>  # Spark Streaming has support for reading Kinesis streams 
> [https://spark.apache.org/docs/2.4.4/streaming-kinesis-integration.html] one 
> of the many resources showing how to change the Spark Kinesis example code to 
> consume dynamodb stream   
> [https://medium.com/@ravi72munde/using-spark-streaming-with-dynamodb-d325b9a73c79]
>  # In DeltaStreamer, we need to add some form of KinesisSource that returns a 
> RDD with new data everytime `fetchNewData` is called 
> [https://github.com/apache/incubator-hudi/blob/master/hudi-utilities/src/main/java/org/apache/hudi/utilities/sources/Source.java]
>   . DeltaStreamer itself does not use Spark Streaming APIs
>  # Internally, we have Avro, Json, Row sources that extract data in these 
> formats. 
> Open questions : 
>  # Should this just be a KinesisSource inside Hudi, that needs to be 
> configured differently or do we need two sources: DynamoDBKinesisSource (that 
> does some DynamoDB Stream specific setup/assumptions) and a plain 
> KinesisSource. What's more valuable to do , if we have to pick one. 
>  # For Kafka integration, we just reused the KafkaRDD in Spark Streaming 
> easily and avoided writing a lot of code by hand. Could we pull the same 
> thing off for Kinesis? (probably needs digging through Spark code) 
>  # What's the format of the data for DynamoDB streams? 
>  
>  
> We should probably flesh these out before going ahead with implementation? 
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (HUDI-310) DynamoDB/Kinesis Change Capture using Delta Streamer

2021-01-26 Thread sivabalan narayanan (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-310?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17272392#comment-17272392
 ] 

sivabalan narayanan commented on HUDI-310:
--

[~vinoth]: Is this still relevant? do we keep it open. 

> DynamoDB/Kinesis Change Capture using Delta Streamer
> 
>
> Key: HUDI-310
> URL: https://issues.apache.org/jira/browse/HUDI-310
> Project: Apache Hudi
>  Issue Type: New Feature
>  Components: DeltaStreamer
>Reporter: Vinoth Chandar
>Assignee: Suneel Marthi
>Priority: Major
>
> The goal here is to do CDC from DynamoDB and then have it be ingested into S3 
> as a Hudi dataset 
> Few resources: 
>  # DynamoDB Streams 
> [https://docs.aws.amazon.com/amazondynamodb/latest/developerguide/Streams.html]
>   provides change capture logs in Kinesis. 
>  # Walkthrough 
> [https://docs.aws.amazon.com/amazondynamodb/latest/developerguide/Streams.KCLAdapter.Walkthrough.html]
>  Code [https://github.com/awslabs/dynamodb-streams-kinesis-adapter] 
>  # Spark Streaming has support for reading Kinesis streams 
> [https://spark.apache.org/docs/2.4.4/streaming-kinesis-integration.html] one 
> of the many resources showing how to change the Spark Kinesis example code to 
> consume dynamodb stream   
> [https://medium.com/@ravi72munde/using-spark-streaming-with-dynamodb-d325b9a73c79]
>  # In DeltaStreamer, we need to add some form of KinesisSource that returns a 
> RDD with new data everytime `fetchNewData` is called 
> [https://github.com/apache/incubator-hudi/blob/master/hudi-utilities/src/main/java/org/apache/hudi/utilities/sources/Source.java]
>   . DeltaStreamer itself does not use Spark Streaming APIs
>  # Internally, we have Avro, Json, Row sources that extract data in these 
> formats. 
> Open questions : 
>  # Should this just be a KinesisSource inside Hudi, that needs to be 
> configured differently or do we need two sources: DynamoDBKinesisSource (that 
> does some DynamoDB Stream specific setup/assumptions) and a plain 
> KinesisSource. What's more valuable to do , if we have to pick one. 
>  # For Kafka integration, we just reused the KafkaRDD in Spark Streaming 
> easily and avoided writing a lot of code by hand. Could we pull the same 
> thing off for Kinesis? (probably needs digging through Spark code) 
>  # What's the format of the data for DynamoDB streams? 
>  
>  
> We should probably flesh these out before going ahead with implementation? 
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (HUDI-310) DynamoDB/Kinesis Change Capture using Delta Streamer

2019-11-12 Thread Vinoth Chandar (Jira)


[ 
https://issues.apache.org/jira/browse/HUDI-310?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16973020#comment-16973020
 ] 

Vinoth Chandar commented on HUDI-310:
-

Hi [~vinaypatil18] any updates for us? 

> DynamoDB/Kinesis Change Capture using Delta Streamer
> 
>
> Key: HUDI-310
> URL: https://issues.apache.org/jira/browse/HUDI-310
> Project: Apache Hudi (incubating)
>  Issue Type: New Feature
>  Components: deltastreamer
>Reporter: Vinoth Chandar
>Assignee: Vinay
>Priority: Major
> Fix For: 0.5.1
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)