Hi, I think watermark / event-time skew is a problem that many users are struggling with. A built-in primitive to align event-time would be a great feature!
However, there are also some cases when it would be useful for different streams to have diverging event-time, such as an interval join [1] (DataStream API) or time-windowed join (SQL) that joins one stream will events from another stream that happened 2 to 1 hour ago. Granted, this is a very specific case and not the norm, but it might make sense to have it in the back of our heads when designing this feature. Best, Fabian [1] https://ci.apache.org/projects/flink/flink-docs-release-1.6/dev/stream/operators/joining.html#interval-join Am Di., 9. Okt. 2018 um 10:25 Uhr schrieb Aljoscha Krettek < aljos...@apache.org>: > Yes, I think this is the way to go. > > This would also go well with a redesign of the source interface that has > been floated for a while now. I also created a prototype a while back: > https://github.com/aljoscha/flink/tree/refactor-source-interface < > https://github.com/aljoscha/flink/tree/refactor-source-interface>. Just > as a refresher, the redesign aims at several things: > > 1. Make partitions/splits explicit in the interface. Currently, the fact > that there are file splits or Kafka partitions or Kinesis shards is hidden > in the source implementation while it would be beneficial for the system to > know of these and to be able to track watermarks for them. Currently, there > is a custom implementation for per-partition watermark tracking in the > Kafka Consumer that this redesign would obviate. > > 2. Split split/partition/shard discovery from the reading part. This would > allow rebalancing work and again makes the nature of sources more explicit > in the interfaces. > > 3. Go away from the push model to a pull model. The problem with the > current source interface is that the source controls the read-loop and has > to get the checkpoint lock for emitting elements/updating state. If we get > the loop out of the source this leaves more potential for Flink to be > clever about reading from sources. > > The prototype posted above defines three new interfaces: Source, > SplitEnumerator, and SplitReader, along with a naive example and a working > Kafka Consumer (with checkpointing, actually). > > If we had this source interface, along with a service for propagating > watermark information the code that reads form the splits could > de-prioritise certain splits and we would get the event-time alignment > behaviour for all sources that are implemented using the new interface > without requiring special code in each source implementation. > > @Elias Do you know if Kafka Consumers do this alignment across multiple > consumers or only within one Consumer across the partitions that it reads > from. > > > On 9. Oct 2018, at 00:55, Elias Levy <fearsome.lucid...@gmail.com> > wrote: > > > > Kafka Streams handles this problem, time alignment, by processing records > > from the partitions with the lowest timestamp in a best effort basis. > > See KIP-353 for the details. The same could be done within the Kafka > > source and multiple input stream operators. I opened FLINK-4558 > > <https://issues.apache.org/jira/browse/FLINK-4558> a while ago regarding > > this topic. > > > > On Mon, Oct 8, 2018 at 3:41 PM Jamie Grier <jgr...@lyft.com.invalid> > wrote: > > > >> I'd be very curious to hear others' thoughts on this.. I would expect > many > >> people to have run into similar issues. I also wonder if anybody has > >> already been working on similar issues. It seems there is room for some > >> core Flink changes to address this as well and I'm guessing people have > >> already thought about it. > >> > >