[
https://issues.apache.org/jira/browse/BEAM-10706?focusedWorklogId=521997&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-521997
]
ASF GitHub Bot logged work on BEAM-10706:
-----------------------------------------
Author: ASF GitHub Bot
Created on: 09/Dec/20 01:47
Start Date: 09/Dec/20 01:47
Worklog Time Spent: 10m
Work Description: dennisylyung edited a comment on pull request #12583:
URL: https://github.com/apache/beam/pull/12583#issuecomment-741419939
In the current implementation `private List<KV<String, WriteRequest>>
batch`, the key is the table name, not the primary-key.
for example, in a table `user`, the primary key is `id`. An element in Beam
would look like this:
`KV("user", WriteRequest(id=1, name=Chris, age=30))`
We have no way to know that `id` is the key we need to deduplicate on
without users specifying.
Theoretically, operating with a DynamoDB should not require setting the keys
for de-duplication, since repeated write to the same key will just update the
value. However, the current implementation of the DynamoDB batch put API
requires no duplicate keys within a batch. Hence, users need to explicitly set
the overwrite keys.
You are right that the overwrite keys are necessary to completely avoid
`ValidationError`. As long as the sink operate in upsert logic (i.e. the data
could contain duplicate keys), there is a risk of the same keys going into a
single batch. This is also the problem I face developing pipelines with
DynamoDB sinks.
There is one special case though. If the user is very sure that the keys
will never have duplicates, such as when their pipelines are logically
append-only, they will not encounter `ValidationError`. In which case,
requiring them to specify the keys could be unnecessary.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
[email protected]
Issue Time Tracking
-------------------
Worklog Id: (was: 521997)
Time Spent: 4h 10m (was: 4h)
> DynamoDBIO fail to write to the same key in short consecution
> -------------------------------------------------------------
>
> Key: BEAM-10706
> URL: https://issues.apache.org/jira/browse/BEAM-10706
> Project: Beam
> Issue Type: Bug
> Components: io-java-aws
> Affects Versions: 2.23.0
> Reporter: Dennis Yung
> Assignee: Dennis Yung
> Priority: P2
> Fix For: 2.27.0
>
> Time Spent: 4h 10m
> Remaining Estimate: 0h
>
> Internally, DynamoDBIO.Write uses the batchWriteItem method from the AWS SDK
> to sink items. However, there is a limitation in the AWS SDK that a call to
> batchWriteItem cannot contain duplicate keys.
> Currently DynamoDBIO.Write performs no key deduplication before flushing a
> batch, which could cause ValidationException: Provided list of item keys
> contains duplicates, if consecutive updates to a single key is within the
> batch size (currently hardcoded to be 25).
> To fix this bug, the batch of write requests need to be deduplicated before
> being sent to batchRequest.addRequestItemsEntry
--
This message was sent by Atlassian Jira
(v8.3.4#803005)