Hi Chris, Thanks for the answer. So if I understand correctly:
- there will be need to dedupe since I should be expecting at least once delivery. - storing the result of (group by partition and and aggregate max offsets) is enough since kafka message is immutable, so a message will get sent with a different offset instead of the same offset. So spark when reading from kafka is acting as a least once consumer? Why does spark not do checkpointing for batch read of kafka? On Mon, Feb 3, 2020 at 1:36 AM Chris Teoh <chris.t...@gmail.com> wrote: > Kafka can keep track of the offsets (in a separate topic based on your > consumer group) you've seen but it is usually best effort and you're > probably better off also keeping track of your offsets. > > If the producer resends a message you would have to dedupe it as you've > most likely already seen it, how you handle that is dependent on your data. > I think the offset will increment automatically, you will generally not see > the same offset occur more than once in a Kafka topic partition, feel free > to correct me on this though. So the most likely scenario you need to > handle is if the producer sends a duplicate message with two offsets. > > The alternative is you can reprocess the offsets back from where you > thought the message was last seen. > > Kind regards > Chris > > On Mon, 3 Feb 2020, 7:39 pm Ruijing Li, <liruijin...@gmail.com> wrote: > >> Hi all, >> >> My use case is to read from single kafka topic using a batch spark sql >> job (not structured streaming ideally). I want this batch job every time it >> starts to get the last offset it stopped at, and start reading from there >> until it caught up to the latest offset, store the result and stop the job. >> Given the dataframe has a partition and offset column, my first thought for >> offset management is to groupBy partition and agg the max offset, then >> store it in HDFS. Next time the job runs, it will read and start from this >> max offset using startingOffsets >> >> However, I was wondering if this will work. If the kafka producer failed >> an offset and later decides to resend it, I will have skipped it since I’m >> starting from the max offset sent. How does spark structured streaming know >> to continue onwards - does it keep a state of all offsets seen? If so, how >> can I replicate this for batch without missing data? Any help would be >> appreciated. >> >> >> -- >> Cheers, >> Ruijing Li >> > -- Cheers, Ruijing Li