Hi,

I am developing a beam job to sink mutable data to dynamodb. I found that
DynamoDBIO will throw an error if multiple write requests to the same key
are made in a short time.

DynamoDBIO.Write uses the batchWriteItem method from the AWS SDK to sink
items, and there is a limitation in the AWS SDK that a call to
batchWriteItem cannot contain duplicate keys.

Currently DynamoDBIO.Write performs no key deduplication before flushing a
batch, which could cause "ValidationException: Provided list of item keys
contains duplicates", if consecutive updates to a single key is within the
batch size (currently hardcoded to be 25).

I have created an issue on JIRA at
https://issues.apache.org/jira/browse/BEAM-10706?jql=text%20~%20%22dynamodbio%22

AWS support team confirmed to me that the Java SDK for dynamodb does not
currently handle deduplication. Taking reference from the Python sdk boto3,
which supports this, I modified the code of DynamoDBIO, which then solved
the problem for my application. The change is applied to 2.23.0, where I
also modified the test and ran it successfully.

Shall I apply the change to master and then create a PR? However I just
changed the ver1 aws module but not the ver2 one, plus I haven't submitted
a PR before so may need some guidance (I have read the contribution guide
though).

Thanks!

Reply via email to