The most common delivery semantic for Kafka producer is at least once.

So your consumers have to handle dedupe.

Spark can do checkpoint but you have to be explicit about it. It only makes
sense if your dataframe lineage gets too long (only if you're doing a
highly iterative algorithm) and you need to trim it to avoid having to
recompute from the start upon failure. It does not keep track of your Kafka
offsets for you.

In the context of reading from Kafka, your consumers can explicitly commit
an offset so kafka knows you've read up to that point.


On Tue, 4 Feb 2020, 6:13 am Ruijing Li, <liruijin...@gmail.com> wrote:

> Hi Chris,
>
> Thanks for the answer. So if I understand correctly:
>
> - there will be need to dedupe since I should be expecting at least once
> delivery.
>
> - storing the result of (group by partition and and aggregate max offsets)
> is enough since kafka message is immutable, so a message will get sent with
> a different offset instead of the same offset.
>
> So spark when reading from kafka is acting as a least once consumer? Why
> does spark not do checkpointing for batch read of kafka?
>
> On Mon, Feb 3, 2020 at 1:36 AM Chris Teoh <chris.t...@gmail.com> wrote:
>
>> Kafka can keep track of the offsets (in a separate topic based on your
>> consumer group) you've seen but it is usually best effort and you're
>> probably better off also keeping track of your offsets.
>>
>> If the producer resends a message you would have to dedupe it as you've
>> most likely already seen it, how you handle that is dependent on your data.
>> I think the offset will increment automatically, you will generally not see
>> the same offset occur more than once in a Kafka topic partition, feel free
>> to correct me on this though. So the most likely scenario you need to
>> handle is if the producer sends a duplicate message with two offsets.
>>
>> The alternative is you can reprocess the offsets back from where you
>> thought the message was last seen.
>>
>> Kind regards
>> Chris
>>
>> On Mon, 3 Feb 2020, 7:39 pm Ruijing Li, <liruijin...@gmail.com> wrote:
>>
>>> Hi all,
>>>
>>> My use case is to read from single kafka topic using a batch spark sql
>>> job (not structured streaming ideally). I want this batch job every time it
>>> starts to get the last offset it stopped at, and start reading from there
>>> until it caught up to the latest offset, store the result and stop the job.
>>> Given the dataframe has a partition and offset column, my first thought for
>>> offset management is to groupBy partition and agg the max offset, then
>>> store it in HDFS. Next time the job runs, it will read and start from this
>>> max offset using startingOffsets
>>>
>>> However, I was wondering if this will work. If the kafka producer failed
>>> an offset and later decides to resend it, I will have skipped it since I’m
>>> starting from the max offset sent. How does spark structured streaming know
>>> to continue onwards - does it keep a state of all offsets seen? If so, how
>>> can I replicate this for batch without missing data? Any help would be
>>> appreciated.
>>>
>>>
>>> --
>>> Cheers,
>>> Ruijing Li
>>>
>> --
> Cheers,
> Ruijing Li
>

Reply via email to