[jira] [Commented] (CAMEL-16594) DynamoDB stream updates are missed when there are more than one active shards

Pierre-Yves Bigourdan (Jira) Thu, 23 Sep 2021 01:50:04 -0700


    [ 
https://issues.apache.org/jira/browse/CAMEL-16594?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17419045#comment-17419045
 ]


Pierre-Yves Bigourdan commented on CAMEL-16594:
-----------------------------------------------

Some personal context: this is the last day of work at my current job, and I'll 
probably not be able to carry over this work with my next employer. However, 
I've put in some extra effort in the past few days and the good news is that 
I'm able to provide a working PR. At the very least, I'm hoping it will be a 
good basis to fix this issue.
----
I've managed to do some some quick testing with a real DynamoDB stream, and 
haven't been able to fault the new implementation, apart from one caveat which 
I'll expand on in the next section.

As far as testing is concerned, to end up with a complex tree-structure as 
described in the issue, one can put DynamoDB under a huge data load. However, 
that's inconvenient for manual testing purposes. Fortunately, there's a trick 
one can do:
 # create a brand new DynamoDB table with provisioned capacity.
 # enable streams.
 # add one or two items in it.
 # at that point, you should have a "tree" with exactly one shard.
 # switch the table to On-demand provisioning.
 # wait a few minutes.
 # you'll notice that the two new shards have been added, children of the first 
one. Your tree structure is starting to grow!
 # wait a few more minutes (we're talking ~15 minutes in total).
 # you'll notice that you've ended up with 7 shards, in a tree-like structure 
as depicted in the issue's description.
 # adding new items will place them in any of the four leaf shards.

----
I've somewhat followed the caching approach of the previous implementation. 
Indeed, it would probably be quite bad for performance to issue a new 
{{ListStreamsRequest}} and parse the shard tree on every single Camel poll.

However, this leads to a potential race condition edge-case when the tree is 
rebalanced. In step 7 above, I've not observed what actually happens underneath 
the hood, but I believe the initial shard is assigned an end sequence number, 
and one child shard is added. A second child shard is added a little later to 
the inactive parent shard. My revised implementation will see that the initial 
shard is no longer active (the returned shard iterator will be null) and will 
reload the entire shard tree as it no longer has any active shards cached. It 
will then see the first child, but potentially not the second one depending on 
whether Camel polls in that transient one-child state or not. You may end up in 
a situation where you'll be missing updates from the second child until the 
first child itself becomes inactive.

 

In addition to the performance hit, assuming that new branches can arbitrarily 
grow from inactive parent nodes would require more complex logic. I believe 
that tree rebalancing is a rare event in DynamoDB, so in my opinion improved 
performance and simpler code is worth the trade-off.

> DynamoDB stream updates are missed when there are more than one active shards
> -----------------------------------------------------------------------------
>
>                 Key: CAMEL-16594
>                 URL: https://issues.apache.org/jira/browse/CAMEL-16594
>             Project: Camel
>          Issue Type: Bug
>          Components: camel-aws
>            Reporter: Pierre-Yves Bigourdan
>            Assignee: Andrea Cosentino
>            Priority: Major
>         Attachments: shards.json
>
>
> The current Camel ddbstream implementation seems to incorrectly apply the 
> concept of {{ShardIteratorType}} to the list of shards forming a DynamoDB 
> stream rather than each shard individually.
> According to the [AWS 
> documentation|https://docs.aws.amazon.com/amazondynamodb/latest/APIReference/API_streams_GetShardIterator.html#DDB-streams_GetShardIterator-request-ShardIteratorType]:
> {noformat}
> ShardIteratorType determines how the shard iterator is used to start reading 
> stream records from the shard.
> {noformat}
> For example, for a given shard, when {{ShardIteratorType}} equal to 
> {{LATEST}}, the AWS SDK will read the most recent data in that particular 
> shard. However, when {{ShardIteratorType}} equal to {{LATEST}}, Camel will 
> additionally use {{ShardIteratorType}} to determine which shard it considers 
> amongst all the available ones in the stream: 
> https://github.com/apache/camel/blob/6119fdc379db343030bd25b191ab88bbec34d6b6/components/camel-aws/camel-aws2-ddb/src/main/java/org/apache/camel/component/aws2/ddbstream/ShardIteratorHandler.java#L132
> If my understanding is correct, shards in DynamoDB are modelled as a tree, 
> with the child leaf nodes being the shards that are still active, i.e. the 
> ones where new stream data will appear. These child shards will have a 
> {{StartingSequenceNumber}}, but no {{EndingSequenceNumber}}.
> The most common case is to have a single shard, or a single branch of parent 
> and child nodes:
> {noformat}
> Shard0
>    |
> Shard1
> {noformat}
> In the above case, new data will be added to {{Shard1}}, and the Camel 
> implementation which  looks only at the last shard when {{ShardIteratorType}} 
> is equal to {{LATEST}}, will be correct.
> However, the tree can also look like this (see related example in the 
> attached JSON output from the AWS CLI, where the shard number matches the 
> index in the JSON list):
> {noformat}
>              Shard0
>             /      \
>      Shard1          Shard2
>     /      \        /      \ 
> Shard3   Shard4  Shard5   Shard6
> {noformat}
> In this case, Camel will only consider Shard6, even though new data may be 
> added to any of Shard3, Shard4, Shard5 or Shard6. This leads to updates being 
> missed.
> As far as I can tell, DynamoDB will split into multiple shards depending on 
> the number of table partitions, which will either grow for a table with huge 
> amounts of data, or when an exiting table with provisioned capacity is 
> migrated to on-demand provisioning.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (CAMEL-16594) DynamoDB stream updates are missed when there are more than one active shards

Reply via email to