Re: Different watermark for different kafka partitions in Structured Streaming

2017-09-04 Thread 张万新
@Rayn It's frequently observed in our production environment that different partition's consumption rate vary for kinds of reasons, including performance difference of machines holding the partitions, unevenly distribution of messages and so on. So I hope there can be some advice on how to design

Re: Different watermark for different kafka partitions in Structured Streaming

2017-09-01 Thread Ryan
I don't think ss now support "partitioned" watermark. and why different partition's consumption rate vary? If the handling logic is quite different, using different topic is a better way. On Fri, Sep 1, 2017 at 4:59 PM, 张万新 wrote: > Thanks, it's true that looser

Re: Different watermark for different kafka partitions in Structured Streaming

2017-09-01 Thread 张万新
Thanks, it's true that looser watermark can guarantee more data not be dropped, but at the same time more state need to be kept. I just consider if there is sth like kafka-partition-aware watermark in flink in SS may be a better solution. Tathagata Das 于2017年8月31日周四

Re: Different watermark for different kafka partitions in Structured Streaming

2017-08-30 Thread Tathagata Das
Why not set the watermark to be looser, one that works across all partitions? The main usage of watermark is to drop state. If you loosen the watermark threshold (e.g. from 1 hour to 10 hours), then you will keep more state with older data, but you are guaranteed that you will not drop important

Different watermark for different kafka partitions in Structured Streaming

2017-08-30 Thread KevinZwx
Hi, I'm working with Structured Streaming to process logs from kafka and use watermark to handle late events. Currently the watermark is computed by (max event time seen by the engine - late threshold), and the same watermark is used for all partitions. But in production environment it happens

Different watermark for different kafka partitions in Structured Streaming

2017-08-30 Thread 张万新
Hi, I'm working with Structured Streaming to process logs from kafka and use watermark to handle late events. Currently the watermark is computed by (max event time seen by the engine - late threshold), and the same watermark is used for all partitions. But in production environment it happens

Different watermark for different kafka partitions in Structured Streaming

2017-08-30 Thread 张万新
Hi, I'm working with Structured Streaming to process logs from kafka and use watermark to handle late events. Currently the watermark is computed by (max event time seen by the engine - late threshold), and the same watermark is used for all partitions. But in production environment it happens