hudi-bot opened a new issue, #14503:
URL: https://github.com/apache/hudi/issues/14503
The goal here is to do CDC from DynamoDB and then have it be ingested into
S3 as a Hudi dataset
Few resources:
# DynamoDB Streams
[https://docs.aws.amazon.com/amazondynamodb/latest/developerguide/Streams.html]
provides change capture logs in Kinesis.
# Walkthrough
[https://docs.aws.amazon.com/amazondynamodb/latest/developerguide/Streams.KCLAdapter.Walkthrough.html]
Code [https://github.com/awslabs/dynamodb-streams-kinesis-adapter]
# Spark Streaming has support for reading Kinesis streams
[https://spark.apache.org/docs/2.4.4/streaming-kinesis-integration.html] one of
the many resources showing how to change the Spark Kinesis example code to
consume dynamodb stream
[https://medium.com/@ravi72munde/using-spark-streaming-with-dynamodb-d325b9a73c79]
# In DeltaStreamer, we need to add some form of KinesisSource that returns
a RDD with new data everytime `fetchNewData` is called
[https://github.com/apache/incubator-hudi/blob/master/hudi-utilities/src/main/java/org/apache/hudi/utilities/sources/Source.java]
. DeltaStreamer itself does not use Spark Streaming APIs
# Internally, we have Avro, Json, Row sources that extract data in these
formats.
Open questions :
# Should this just be a KinesisSource inside Hudi, that needs to be
configured differently or do we need two sources: DynamoDBKinesisSource (that
does some DynamoDB Stream specific setup/assumptions) and a plain
KinesisSource. What's more valuable to do , if we have to pick one.
# For Kafka integration, we just reused the KafkaRDD in Spark Streaming
easily and avoided writing a lot of code by hand. Could we pull the same thing
off for Kinesis? (probably needs digging through Spark code)
# What's the format of the data for DynamoDB streams?
We should probably flesh these out before going ahead with implementation?
## JIRA info
- Link: https://issues.apache.org/jira/browse/HUDI-310
- Type: New Feature
- Epic: https://issues.apache.org/jira/browse/HUDI-1385
---
## Comments
13/Nov/19 05:11;vinoth;Hi [~vinaypatil18] any updates for us? ;;;
---
26/Jan/21 20:23;shivnarayan;[~vinoth]: Is this still relevant? do we keep it
open. ;;;
---
04/Nov/21 01:33;maddy2u;Is this still happening? ;;;
---
06/Nov/21 05:09;vinoth;This has changed hands quite a bit. We still want to
take this on. Interested?;;;
---
09/Jan/22 08:57;vinaypatil18;[~vinoth] I remember discussing about this,
sry it went to backlog, I am taking this up. ;;;
---
11/Mar/22 15:00;maddy2u;Hi [~vinaypatil18] - Any update on the above? Which
release are we targeting this feature?;;;
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]