Hi all,

My use case is to read from single kafka topic using a batch spark sql job
(not structured streaming ideally). I want this batch job every time it
starts to get the last offset it stopped at, and start reading from there
until it caught up to the latest offset, store the result and stop the job.
Given the dataframe has a partition and offset column, my first thought for
offset management is to groupBy partition and agg the max offset, then
store it in HDFS. Next time the job runs, it will read and start from this
max offset using startingOffsets

However, I was wondering if this will work. If the kafka producer failed an
offset and later decides to resend it, I will have skipped it since I’m
starting from the max offset sent. How does spark structured streaming know
to continue onwards - does it keep a state of all offsets seen? If so, how
can I replicate this for batch without missing data? Any help would be
appreciated.


-- 
Cheers,
Ruijing Li

Reply via email to