hudi-bot opened a new issue, #14503:
URL: https://github.com/apache/hudi/issues/14503

   The goal here is to do CDC from DynamoDB and then have it be ingested into 
S3 as a Hudi dataset 
   
   Few resources: 
    # DynamoDB Streams 
[https://docs.aws.amazon.com/amazondynamodb/latest/developerguide/Streams.html] 
 provides change capture logs in Kinesis. 
    # Walkthrough 
[https://docs.aws.amazon.com/amazondynamodb/latest/developerguide/Streams.KCLAdapter.Walkthrough.html]
 Code [https://github.com/awslabs/dynamodb-streams-kinesis-adapter] 
    # Spark Streaming has support for reading Kinesis streams 
[https://spark.apache.org/docs/2.4.4/streaming-kinesis-integration.html] one of 
the many resources showing how to change the Spark Kinesis example code to 
consume dynamodb stream   
[https://medium.com/@ravi72munde/using-spark-streaming-with-dynamodb-d325b9a73c79]
    # In DeltaStreamer, we need to add some form of KinesisSource that returns 
a RDD with new data everytime `fetchNewData` is called 
[https://github.com/apache/incubator-hudi/blob/master/hudi-utilities/src/main/java/org/apache/hudi/utilities/sources/Source.java]
  . DeltaStreamer itself does not use Spark Streaming APIs
    # Internally, we have Avro, Json, Row sources that extract data in these 
formats. 
   
   Open questions : 
    # Should this just be a KinesisSource inside Hudi, that needs to be 
configured differently or do we need two sources: DynamoDBKinesisSource (that 
does some DynamoDB Stream specific setup/assumptions) and a plain 
KinesisSource. What's more valuable to do , if we have to pick one. 
    # For Kafka integration, we just reused the KafkaRDD in Spark Streaming 
easily and avoided writing a lot of code by hand. Could we pull the same thing 
off for Kinesis? (probably needs digging through Spark code) 
    # What's the format of the data for DynamoDB streams? 
   
    
   
    
   
   We should probably flesh these out before going ahead with implementation? 
   
    
   
   ## JIRA info
   
   - Link: https://issues.apache.org/jira/browse/HUDI-310
   - Type: New Feature
   - Epic: https://issues.apache.org/jira/browse/HUDI-1385
   
   
   ---
   
   
   ## Comments
   
   13/Nov/19 05:11;vinoth;Hi [~vinaypatil18] any updates for us? ;;;
   
   ---
   
   26/Jan/21 20:23;shivnarayan;[~vinoth]: Is this still relevant? do we keep it 
open. ;;;
   
   ---
   
   04/Nov/21 01:33;maddy2u;Is this still happening? ;;;
   
   ---
   
   06/Nov/21 05:09;vinoth;This has changed hands quite a bit.  We still want to 
take this on.  Interested?;;;
   
   ---
   
   09/Jan/22 08:57;vinaypatil18;[~vinoth]  I remember discussing about this, 
sry it went to backlog, I am taking this up. ;;;
   
   ---
   
   11/Mar/22 15:00;maddy2u;Hi [~vinaypatil18]  - Any update on the above? Which 
release are we targeting this feature?;;;


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to