[ 
https://issues.apache.org/jira/browse/FLINK-4582?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16665593#comment-16665593
 ] 

Devin Thomson commented on FLINK-4582:
--------------------------------------

[~yxu-apache] I appreciate the quick response!

"The use of _DynamodbProxy.getShardList()_ is interesting." Haha very kind of 
you to not point out the potential performance issue of always fetching all the 
shard ids. I didn't want to duplicate the code of DynamoDBProxy, but yes this 
approach suffers from a complete traversal of the shard ids every time. I have 
also observed that the shard ids are always returned in the same (unsorted) 
ordering, so your approach sounds good to me.

Due to the fact that we run our Flink clusters in AWS EMR, which does not 
support high-availability master nodes, we will not be using multi-stream 
consumers so I did not implement that support here and saw DynamoDBProxy as a 
natural solution. The alternative was to use one instance of DynamoDBProxy per 
stream. I didn't look into the memory implications but I assume it's worse than 
what you have built :)

We are still in development over here using the solution I built, but I would 
be glad to cutover to your solution once it's available! One question would be 
- is it compatible with Flink 1.5.2? As I mentioned, we run in AWS EMR which 
only supports 1.5.2 in the latest release.

 

Thank you!!

Devin

> Allow FlinkKinesisConsumer to adapt for AWS DynamoDB Streams
> ------------------------------------------------------------
>
>                 Key: FLINK-4582
>                 URL: https://issues.apache.org/jira/browse/FLINK-4582
>             Project: Flink
>          Issue Type: New Feature
>          Components: Kinesis Connector, Streaming Connectors
>            Reporter: Tzu-Li (Gordon) Tai
>            Assignee: Ying Xu
>            Priority: Major
>
> AWS DynamoDB is a NoSQL database service that has a CDC-like (change data 
> capture) feature called DynamoDB Streams 
> (http://docs.aws.amazon.com/amazondynamodb/latest/developerguide/Streams.html),
>  which is a stream feed of item-level table activities.
> The DynamoDB Streams shard abstraction follows that of Kinesis Streams with 
> only a slight difference in resharding behaviours, so it is possible to build 
> on the internals of our Flink Kinesis Consumer for an exactly-once DynamoDB 
> Streams source.
> I propose an API something like this:
> {code}
> DataStream dynamoItemsCdc = 
>   FlinkKinesisConsumer.asDynamoDBStream(tableNames, schema, config)
> {code}
> The feature adds more connectivity to popular AWS services for Flink, and 
> combining what Flink has for exactly-once semantics, out-of-core state 
> backends, and queryable state with CDC can have very strong use cases. For 
> this feature there should only be an extra dependency to the AWS Java SDK for 
> DynamoDB, which has Apache License 2.0.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to