Thanks, it's true that looser watermark can guarantee more data not be dropped, but at the same time more state need to be kept. I just consider if there is sth like kafka-partition-aware watermark in flink in SS may be a better solution.
Tathagata Das <[email protected]>于2017年8月31日周四 上午9:13写道: > Why not set the watermark to be looser, one that works across all > partitions? The main usage of watermark is to drop state. If you loosen the > watermark threshold (e.g. from 1 hour to 10 hours), then you will keep more > state with older data, but you are guaranteed that you will not drop > important data. > > On Wed, Aug 30, 2017 at 7:41 AM, KevinZwx <[email protected]> wrote: > >> Hi, >> >> I'm working with Structured Streaming to process logs from kafka and use >> watermark to handle late events. Currently the watermark is computed by >> (max >> event time seen by the engine - late threshold), and the same watermark is >> used for all partitions. >> >> But in production environment it happens frequently that different >> partition >> is consumed at different speed, the consumption of some partitions may be >> left behind, so the newest event time in these partitions may be much >> smaller than than the others'. In this case using the same watermark for >> all >> partitions may cause heavy data loss. >> >> So is there any way to achieve different watermark for different kafka >> partition or any plan to work on this? >> >> >> >> -- >> Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/ >> >> --------------------------------------------------------------------- >> To unsubscribe e-mail: [email protected] >> >> >
