Hi,

I'm working with Structured Streaming to process logs from kafka and use
watermark to handle late events. Currently the watermark is computed by (max
event time seen by the engine - late threshold), and the same watermark is
used for all partitions.

But in production environment it happens frequently that different partition
is consumed at different speed, the consumption of some partitions may be
left behind, so the newest event time in these partitions may be much
smaller than than the others'. In this case using the same watermark for all
partitions may cause heavy data loss. 

So is there any way to achieve different watermark for different kafka
partition or any plan to work on this?



--
Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/

---------------------------------------------------------------------
To unsubscribe e-mail: user-unsubscr...@spark.apache.org

Reply via email to