[ https://issues.apache.org/jira/browse/BEAM-8382?focusedWorklogId=331576&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-331576 ]
ASF GitHub Bot logged work on BEAM-8382: ---------------------------------------- Author: ASF GitHub Bot Created on: 21/Oct/19 18:40 Start Date: 21/Oct/19 18:40 Worklog Time Spent: 10m Work Description: cmachgodaddy commented on issue #9765: [BEAM-8382] Add polling interval to KinesisIO.Read URL: https://github.com/apache/beam/pull/9765#issuecomment-544639586 @aromanenko-dev , I think we all understand it right, from what you described above we just say it in different ways ;-) . In simple words, each split, run on each worker, will create a number of threads base off num of shards. That's means if we have 10 splits, we will have 10 guys/reader read one shard (of course, it will read all shards)? Hope you agree with me this point? Now, what Amazon recommend is we should have one client read one shard. Here is what they say: > Typically, when you use the KCL, you should ensure that the number of instances does not exceed the number of shards (except for failure standby purposes). Each shard is processed by exactly one KCL worker and has exactly one corresponding record processor, so you never need multiple instances to process one shard. However, one worker can process any number of shards, so it's fine if the number of shards exceeds the number of instances. And, here is why they do that, https://docs.aws.amazon.com/streams/latest/dev/kinesis-low-latency.html But, don't argue that here they use KCL, so pls consider one KCL is one KinesisClient that our split use to connect to Kinesis and read a shard. And don't misunderstand the `one worker can read a number of shard`. Here in their context, or the example in their doc, their worker is an application that run on EC2 instance, and when that scale this application, it will loadbalance the KCL, e.g. two EC2 will have two KCL, each read two shards (assume a stream has 4 shards). The point is we don't want to have number readers greater number of shards in one application (or one pipeline in our context)? Imagine, if we have 10 pipelines deployed in our runners, and which has parallelism of 10, then we will have 10 x 10 readers reading one shard ? And this 10 x 10 readers are not loadbalanced since we are not using KCL? (even with KCL, this number still exceed num of shard?). I am sure we will get "throughput exception" in the log :-) That's why in my subscribing POC (for the enhanced-fan-out), I design it in a way that each split read only one shard (1:1). ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking ------------------- Worklog Id: (was: 331576) Time Spent: 4h (was: 3h 50m) > Add polling interval to KinesisIO.Read > -------------------------------------- > > Key: BEAM-8382 > URL: https://issues.apache.org/jira/browse/BEAM-8382 > Project: Beam > Issue Type: Improvement > Components: io-java-kinesis > Affects Versions: 2.13.0, 2.14.0, 2.15.0 > Reporter: Jonothan Farr > Assignee: Jonothan Farr > Priority: Major > Time Spent: 4h > Remaining Estimate: 0h > > With the current implementation we are observing Kinesis throttling due to > ReadProvisionedThroughputExceeded on the order of hundreds of times per > second, regardless of the actual Kinesis throughput. This is because the > ShardReadersPool readLoop() method is polling getRecords() as fast as > possible. > From the KDS documentation: > {quote}Each shard can support up to five read transactions per second. > {quote} > and > {quote}For best results, sleep for at least 1 second (1,000 milliseconds) > between calls to getRecords to avoid exceeding the limit on getRecords > frequency. > {quote} > [https://docs.aws.amazon.com/streams/latest/dev/service-sizes-and-limits.html] > [https://docs.aws.amazon.com/streams/latest/dev/developing-consumers-with-sdk.html] -- This message was sent by Atlassian Jira (v8.3.4#803005)