In addition to lack of watermark support, the Kinesis consumer suffers from a discovery related issue that also needs to be resolved. Shard discovery runs periodically in all subtasks. That's not just inefficient but becomes a real problem when there is a large number of subtasks due to rate limiting ( https://docs.aws.amazon.com/streams/latest/dev/service-sizes-and-limits.html). The discovery interval should be minimized to cap latency (new shards not consumed until discovered).
How about moving discovery out of the fetcher into a separate singleton source and then broadcast the result to the parallel fetchers, following the pattern applied to file input? https://github.com/apache/flink/blob/5f523e6ab31afeab5b1d9bbf62c6d4ef726ffe1b/flink-streaming-java/src/main/java/org/apache/flink/streaming/api/environment/StreamExecutionEnvironment.java#L1336 This would also ensure that all subtasks consistently see the same shard list. Thoughts? Thanks, Thomas On Mon, Feb 5, 2018 at 5:31 PM, Thomas Weise <t...@apache.org> wrote: > Hi, > > The Kinesis consumer currently does not emit watermarks, and this can lead > to problems when a single subtask reads from multiple shards and offsets > are not closely aligned with respect to the event time. > > The Kafka consumer has support for periodic and punctuated watermarks, > although there is also the unresolved issue https://issues.apache.org/ > jira/browse/FLINK-5479 that would equally apply for Kinesis. > > I propose adding support for timestamp assigner and watermark generator to > the Kinesis consumer. > > As for handling of idle shards, is there a preference? Perhaps a > customization point on the assigner that defers the decision to the user > would be appropriate? > > Thanks, > Thomas > >