Hi Chris,

Thanks for the answer. So if I understand correctly:

- there will be need to dedupe since I should be expecting at least once
delivery.

- storing the result of (group by partition and and aggregate max offsets)
is enough since kafka message is immutable, so a message will get sent with
a different offset instead of the same offset.

So spark when reading from kafka is acting as a least once consumer? Why
does spark not do checkpointing for batch read of kafka?

On Mon, Feb 3, 2020 at 1:36 AM Chris Teoh <chris.t...@gmail.com> wrote:

> Kafka can keep track of the offsets (in a separate topic based on your
> consumer group) you've seen but it is usually best effort and you're
> probably better off also keeping track of your offsets.
>
> If the producer resends a message you would have to dedupe it as you've
> most likely already seen it, how you handle that is dependent on your data.
> I think the offset will increment automatically, you will generally not see
> the same offset occur more than once in a Kafka topic partition, feel free
> to correct me on this though. So the most likely scenario you need to
> handle is if the producer sends a duplicate message with two offsets.
>
> The alternative is you can reprocess the offsets back from where you
> thought the message was last seen.
>
> Kind regards
> Chris
>
> On Mon, 3 Feb 2020, 7:39 pm Ruijing Li, <liruijin...@gmail.com> wrote:
>
>> Hi all,
>>
>> My use case is to read from single kafka topic using a batch spark sql
>> job (not structured streaming ideally). I want this batch job every time it
>> starts to get the last offset it stopped at, and start reading from there
>> until it caught up to the latest offset, store the result and stop the job.
>> Given the dataframe has a partition and offset column, my first thought for
>> offset management is to groupBy partition and agg the max offset, then
>> store it in HDFS. Next time the job runs, it will read and start from this
>> max offset using startingOffsets
>>
>> However, I was wondering if this will work. If the kafka producer failed
>> an offset and later decides to resend it, I will have skipped it since I’m
>> starting from the max offset sent. How does spark structured streaming know
>> to continue onwards - does it keep a state of all offsets seen? If so, how
>> can I replicate this for batch without missing data? Any help would be
>> appreciated.
>>
>>
>> --
>> Cheers,
>> Ruijing Li
>>
> --
Cheers,
Ruijing Li

Reply via email to