[jira] [Commented] (HUDI-310) DynamoDB/Kinesis Change Capture using Delta Streamer
[ https://issues.apache.org/jira/browse/HUDI-310?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17504958#comment-17504958 ] Madhavan commented on HUDI-310: --- Hi [~vinaypatil18] - Any update on the above? Which release are we targeting this feature? > DynamoDB/Kinesis Change Capture using Delta Streamer > > > Key: HUDI-310 > URL: https://issues.apache.org/jira/browse/HUDI-310 > Project: Apache Hudi > Issue Type: New Feature > Components: deltastreamer >Reporter: Vinoth Chandar >Assignee: Vinay >Priority: Major > > The goal here is to do CDC from DynamoDB and then have it be ingested into S3 > as a Hudi dataset > Few resources: > # DynamoDB Streams > [https://docs.aws.amazon.com/amazondynamodb/latest/developerguide/Streams.html] > provides change capture logs in Kinesis. > # Walkthrough > [https://docs.aws.amazon.com/amazondynamodb/latest/developerguide/Streams.KCLAdapter.Walkthrough.html] > Code [https://github.com/awslabs/dynamodb-streams-kinesis-adapter] > # Spark Streaming has support for reading Kinesis streams > [https://spark.apache.org/docs/2.4.4/streaming-kinesis-integration.html] one > of the many resources showing how to change the Spark Kinesis example code to > consume dynamodb stream > [https://medium.com/@ravi72munde/using-spark-streaming-with-dynamodb-d325b9a73c79] > # In DeltaStreamer, we need to add some form of KinesisSource that returns a > RDD with new data everytime `fetchNewData` is called > [https://github.com/apache/incubator-hudi/blob/master/hudi-utilities/src/main/java/org/apache/hudi/utilities/sources/Source.java] > . DeltaStreamer itself does not use Spark Streaming APIs > # Internally, we have Avro, Json, Row sources that extract data in these > formats. > Open questions : > # Should this just be a KinesisSource inside Hudi, that needs to be > configured differently or do we need two sources: DynamoDBKinesisSource (that > does some DynamoDB Stream specific setup/assumptions) and a plain > KinesisSource. What's more valuable to do , if we have to pick one. > # For Kafka integration, we just reused the KafkaRDD in Spark Streaming > easily and avoided writing a lot of code by hand. Could we pull the same > thing off for Kinesis? (probably needs digging through Spark code) > # What's the format of the data for DynamoDB streams? > > > We should probably flesh these out before going ahead with implementation? > -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Commented] (HUDI-310) DynamoDB/Kinesis Change Capture using Delta Streamer
[ https://issues.apache.org/jira/browse/HUDI-310?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17471309#comment-17471309 ] Vinay commented on HUDI-310: [~vinoth] I remember discussing about this, sry it went to backlog, I am taking this up. > DynamoDB/Kinesis Change Capture using Delta Streamer > > > Key: HUDI-310 > URL: https://issues.apache.org/jira/browse/HUDI-310 > Project: Apache Hudi > Issue Type: New Feature > Components: DeltaStreamer >Reporter: Vinoth Chandar >Assignee: Vinay >Priority: Major > > The goal here is to do CDC from DynamoDB and then have it be ingested into S3 > as a Hudi dataset > Few resources: > # DynamoDB Streams > [https://docs.aws.amazon.com/amazondynamodb/latest/developerguide/Streams.html] > provides change capture logs in Kinesis. > # Walkthrough > [https://docs.aws.amazon.com/amazondynamodb/latest/developerguide/Streams.KCLAdapter.Walkthrough.html] > Code [https://github.com/awslabs/dynamodb-streams-kinesis-adapter] > # Spark Streaming has support for reading Kinesis streams > [https://spark.apache.org/docs/2.4.4/streaming-kinesis-integration.html] one > of the many resources showing how to change the Spark Kinesis example code to > consume dynamodb stream > [https://medium.com/@ravi72munde/using-spark-streaming-with-dynamodb-d325b9a73c79] > # In DeltaStreamer, we need to add some form of KinesisSource that returns a > RDD with new data everytime `fetchNewData` is called > [https://github.com/apache/incubator-hudi/blob/master/hudi-utilities/src/main/java/org/apache/hudi/utilities/sources/Source.java] > . DeltaStreamer itself does not use Spark Streaming APIs > # Internally, we have Avro, Json, Row sources that extract data in these > formats. > Open questions : > # Should this just be a KinesisSource inside Hudi, that needs to be > configured differently or do we need two sources: DynamoDBKinesisSource (that > does some DynamoDB Stream specific setup/assumptions) and a plain > KinesisSource. What's more valuable to do , if we have to pick one. > # For Kafka integration, we just reused the KafkaRDD in Spark Streaming > easily and avoided writing a lot of code by hand. Could we pull the same > thing off for Kinesis? (probably needs digging through Spark code) > # What's the format of the data for DynamoDB streams? > > > We should probably flesh these out before going ahead with implementation? > -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Commented] (HUDI-310) DynamoDB/Kinesis Change Capture using Delta Streamer
[ https://issues.apache.org/jira/browse/HUDI-310?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17439600#comment-17439600 ] Vinoth Chandar commented on HUDI-310: - This has changed hands quite a bit. We still want to take this on. Interested? > DynamoDB/Kinesis Change Capture using Delta Streamer > > > Key: HUDI-310 > URL: https://issues.apache.org/jira/browse/HUDI-310 > Project: Apache Hudi > Issue Type: New Feature > Components: DeltaStreamer >Reporter: Vinoth Chandar >Assignee: Suneel Marthi >Priority: Major > > The goal here is to do CDC from DynamoDB and then have it be ingested into S3 > as a Hudi dataset > Few resources: > # DynamoDB Streams > [https://docs.aws.amazon.com/amazondynamodb/latest/developerguide/Streams.html] > provides change capture logs in Kinesis. > # Walkthrough > [https://docs.aws.amazon.com/amazondynamodb/latest/developerguide/Streams.KCLAdapter.Walkthrough.html] > Code [https://github.com/awslabs/dynamodb-streams-kinesis-adapter] > # Spark Streaming has support for reading Kinesis streams > [https://spark.apache.org/docs/2.4.4/streaming-kinesis-integration.html] one > of the many resources showing how to change the Spark Kinesis example code to > consume dynamodb stream > [https://medium.com/@ravi72munde/using-spark-streaming-with-dynamodb-d325b9a73c79] > # In DeltaStreamer, we need to add some form of KinesisSource that returns a > RDD with new data everytime `fetchNewData` is called > [https://github.com/apache/incubator-hudi/blob/master/hudi-utilities/src/main/java/org/apache/hudi/utilities/sources/Source.java] > . DeltaStreamer itself does not use Spark Streaming APIs > # Internally, we have Avro, Json, Row sources that extract data in these > formats. > Open questions : > # Should this just be a KinesisSource inside Hudi, that needs to be > configured differently or do we need two sources: DynamoDBKinesisSource (that > does some DynamoDB Stream specific setup/assumptions) and a plain > KinesisSource. What's more valuable to do , if we have to pick one. > # For Kafka integration, we just reused the KafkaRDD in Spark Streaming > easily and avoided writing a lot of code by hand. Could we pull the same > thing off for Kinesis? (probably needs digging through Spark code) > # What's the format of the data for DynamoDB streams? > > > We should probably flesh these out before going ahead with implementation? > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (HUDI-310) DynamoDB/Kinesis Change Capture using Delta Streamer
[ https://issues.apache.org/jira/browse/HUDI-310?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17438424#comment-17438424 ] Madhavan commented on HUDI-310: --- Is this still happening? > DynamoDB/Kinesis Change Capture using Delta Streamer > > > Key: HUDI-310 > URL: https://issues.apache.org/jira/browse/HUDI-310 > Project: Apache Hudi > Issue Type: New Feature > Components: DeltaStreamer >Reporter: Vinoth Chandar >Assignee: Suneel Marthi >Priority: Major > > The goal here is to do CDC from DynamoDB and then have it be ingested into S3 > as a Hudi dataset > Few resources: > # DynamoDB Streams > [https://docs.aws.amazon.com/amazondynamodb/latest/developerguide/Streams.html] > provides change capture logs in Kinesis. > # Walkthrough > [https://docs.aws.amazon.com/amazondynamodb/latest/developerguide/Streams.KCLAdapter.Walkthrough.html] > Code [https://github.com/awslabs/dynamodb-streams-kinesis-adapter] > # Spark Streaming has support for reading Kinesis streams > [https://spark.apache.org/docs/2.4.4/streaming-kinesis-integration.html] one > of the many resources showing how to change the Spark Kinesis example code to > consume dynamodb stream > [https://medium.com/@ravi72munde/using-spark-streaming-with-dynamodb-d325b9a73c79] > # In DeltaStreamer, we need to add some form of KinesisSource that returns a > RDD with new data everytime `fetchNewData` is called > [https://github.com/apache/incubator-hudi/blob/master/hudi-utilities/src/main/java/org/apache/hudi/utilities/sources/Source.java] > . DeltaStreamer itself does not use Spark Streaming APIs > # Internally, we have Avro, Json, Row sources that extract data in these > formats. > Open questions : > # Should this just be a KinesisSource inside Hudi, that needs to be > configured differently or do we need two sources: DynamoDBKinesisSource (that > does some DynamoDB Stream specific setup/assumptions) and a plain > KinesisSource. What's more valuable to do , if we have to pick one. > # For Kafka integration, we just reused the KafkaRDD in Spark Streaming > easily and avoided writing a lot of code by hand. Could we pull the same > thing off for Kinesis? (probably needs digging through Spark code) > # What's the format of the data for DynamoDB streams? > > > We should probably flesh these out before going ahead with implementation? > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (HUDI-310) DynamoDB/Kinesis Change Capture using Delta Streamer
[ https://issues.apache.org/jira/browse/HUDI-310?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17272392#comment-17272392 ] sivabalan narayanan commented on HUDI-310: -- [~vinoth]: Is this still relevant? do we keep it open. > DynamoDB/Kinesis Change Capture using Delta Streamer > > > Key: HUDI-310 > URL: https://issues.apache.org/jira/browse/HUDI-310 > Project: Apache Hudi > Issue Type: New Feature > Components: DeltaStreamer >Reporter: Vinoth Chandar >Assignee: Suneel Marthi >Priority: Major > > The goal here is to do CDC from DynamoDB and then have it be ingested into S3 > as a Hudi dataset > Few resources: > # DynamoDB Streams > [https://docs.aws.amazon.com/amazondynamodb/latest/developerguide/Streams.html] > provides change capture logs in Kinesis. > # Walkthrough > [https://docs.aws.amazon.com/amazondynamodb/latest/developerguide/Streams.KCLAdapter.Walkthrough.html] > Code [https://github.com/awslabs/dynamodb-streams-kinesis-adapter] > # Spark Streaming has support for reading Kinesis streams > [https://spark.apache.org/docs/2.4.4/streaming-kinesis-integration.html] one > of the many resources showing how to change the Spark Kinesis example code to > consume dynamodb stream > [https://medium.com/@ravi72munde/using-spark-streaming-with-dynamodb-d325b9a73c79] > # In DeltaStreamer, we need to add some form of KinesisSource that returns a > RDD with new data everytime `fetchNewData` is called > [https://github.com/apache/incubator-hudi/blob/master/hudi-utilities/src/main/java/org/apache/hudi/utilities/sources/Source.java] > . DeltaStreamer itself does not use Spark Streaming APIs > # Internally, we have Avro, Json, Row sources that extract data in these > formats. > Open questions : > # Should this just be a KinesisSource inside Hudi, that needs to be > configured differently or do we need two sources: DynamoDBKinesisSource (that > does some DynamoDB Stream specific setup/assumptions) and a plain > KinesisSource. What's more valuable to do , if we have to pick one. > # For Kafka integration, we just reused the KafkaRDD in Spark Streaming > easily and avoided writing a lot of code by hand. Could we pull the same > thing off for Kinesis? (probably needs digging through Spark code) > # What's the format of the data for DynamoDB streams? > > > We should probably flesh these out before going ahead with implementation? > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (HUDI-310) DynamoDB/Kinesis Change Capture using Delta Streamer
[ https://issues.apache.org/jira/browse/HUDI-310?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16973020#comment-16973020 ] Vinoth Chandar commented on HUDI-310: - Hi [~vinaypatil18] any updates for us? > DynamoDB/Kinesis Change Capture using Delta Streamer > > > Key: HUDI-310 > URL: https://issues.apache.org/jira/browse/HUDI-310 > Project: Apache Hudi (incubating) > Issue Type: New Feature > Components: deltastreamer >Reporter: Vinoth Chandar >Assignee: Vinay >Priority: Major > Fix For: 0.5.1 > > -- This message was sent by Atlassian Jira (v8.3.4#803005)