Kafka can keep track of the offsets (in a separate topic based on your
consumer group) you've seen but it is usually best effort and you're
probably better off also keeping track of your offsets.

If the producer resends a message you would have to dedupe it as you've
most likely already seen it, how you handle that is dependent on your data.
I think the offset will increment automatically, you will generally not see
the same offset occur more than once in a Kafka topic partition, feel free
to correct me on this though. So the most likely scenario you need to
handle is if the producer sends a duplicate message with two offsets.

The alternative is you can reprocess the offsets back from where you
thought the message was last seen.

Kind regards
Chris

On Mon, 3 Feb 2020, 7:39 pm Ruijing Li, <liruijin...@gmail.com> wrote:

> Hi all,
>
> My use case is to read from single kafka topic using a batch spark sql job
> (not structured streaming ideally). I want this batch job every time it
> starts to get the last offset it stopped at, and start reading from there
> until it caught up to the latest offset, store the result and stop the job.
> Given the dataframe has a partition and offset column, my first thought for
> offset management is to groupBy partition and agg the max offset, then
> store it in HDFS. Next time the job runs, it will read and start from this
> max offset using startingOffsets
>
> However, I was wondering if this will work. If the kafka producer failed
> an offset and later decides to resend it, I will have skipped it since I’m
> starting from the max offset sent. How does spark structured streaming know
> to continue onwards - does it keep a state of all offsets seen? If so, how
> can I replicate this for batch without missing data? Any help would be
> appreciated.
>
>
> --
> Cheers,
> Ruijing Li
>

Reply via email to