In addition to lack of watermark support, the Kinesis consumer suffers from
a discovery related issue that also needs to be resolved. Shard discovery
runs periodically in all subtasks. That's not just inefficient but becomes
a real problem when there is a large number of subtasks due to rate
limiting (
https://docs.aws.amazon.com/streams/latest/dev/service-sizes-and-limits.html).
The discovery interval should be minimized to cap latency (new shards not
consumed until discovered).

How about moving discovery out of the fetcher into a separate singleton
source and then broadcast the result to the parallel fetchers, following
the pattern applied to file input?

https://github.com/apache/flink/blob/5f523e6ab31afeab5b1d9bbf62c6d4ef726ffe1b/flink-streaming-java/src/main/java/org/apache/flink/streaming/api/environment/StreamExecutionEnvironment.java#L1336

This would also ensure that all subtasks consistently see the same shard
list.

Thoughts?

Thanks,
Thomas


On Mon, Feb 5, 2018 at 5:31 PM, Thomas Weise <t...@apache.org> wrote:

> Hi,
>
> The Kinesis consumer currently does not emit watermarks, and this can lead
> to problems when a single subtask reads from multiple shards and offsets
> are not closely aligned with respect to the event time.
>
> The Kafka consumer has support for periodic and punctuated watermarks,
> although there is also the unresolved issue https://issues.apache.org/
> jira/browse/FLINK-5479 that would equally apply for Kinesis.
>
> I propose adding support for timestamp assigner and watermark generator to
> the Kinesis consumer.
>
> As for handling of idle shards, is there a preference? Perhaps a
> customization point on the assigner that defers the decision to the user
> would be appropriate?
>
> Thanks,
> Thomas
>
>

Reply via email to