Hi all, My use case is to read from single kafka topic using a batch spark sql job (not structured streaming ideally). I want this batch job every time it starts to get the last offset it stopped at, and start reading from there until it caught up to the latest offset, store the result and stop the job. Given the dataframe has a partition and offset column, my first thought for offset management is to groupBy partition and agg the max offset, then store it in HDFS. Next time the job runs, it will read and start from this max offset using startingOffsets
However, I was wondering if this will work. If the kafka producer failed an offset and later decides to resend it, I will have skipped it since I’m starting from the max offset sent. How does spark structured streaming know to continue onwards - does it keep a state of all offsets seen? If so, how can I replicate this for batch without missing data? Any help would be appreciated. -- Cheers, Ruijing Li