Re: kafka broker loosing offsets?

Vincent Dautremont Wed, 11 Oct 2017 15:48:13 -0700

Hi,
We have 4 differents Kafka cluster running,
2 on 0.10.1.0
1 on 0.10.0.1
1 that was on 0.11.0.0 and last week updated to 0.11.0.1


I’ve only seen the issue happen 2 times in production usage on the cluster on 
0.11.0.0 since it’s running (about 3months).

But I’ll monitor and report it here if it ever happen again in the future. 
We’ll also upgrade all our clusters to 0.11.0.1 in the next days.

🤞🏻!

> Le 11 oct. 2017 à 17:47, Dmitriy Vsekhvalnov <[email protected]> a écrit 
> :
> 
> Yeah just pops up in my list. Thanks, i'll take a look.
> 
> Vincent Dautremont, if you still reading it, did you try upgrade to
> 0.11.0.1? Fixed issue?
> 
> On Wed, Oct 11, 2017 at 6:46 PM, Ben Davison <[email protected]>
> wrote:
> 
>> Hi Dmitriy,
>> 
>> Did you check out this thread "Incorrect consumer offsets after broker
>> restart 0.11.0.0" from Phil Luckhurst, it sounds similar.
>> 
>> Thanks,
>> 
>> Ben
>> 
>> On Wed, Oct 11, 2017 at 4:44 PM Dmitriy Vsekhvalnov <
>> [email protected]>
>> wrote:
>> 
>>> Hey, want to resurrect this thread.
>>> 
>>> Decided to do idle test, where no load data is produced to topic at all.
>>> And when we kill #101 or #102 - nothing happening. But when we kill #200
>> -
>>> consumers starts to re-consume old events from random position.
>>> 
>>> Anybody have ideas what to check?  I really expected that Kafka will fail
>>> symmetrical with respect to any broker.
>>> 
>>> On Mon, Oct 9, 2017 at 6:26 PM, Dmitriy Vsekhvalnov <
>>> [email protected]>
>>> wrote:
>>> 
>>>> Hi tao,
>>>> 
>>>> we had unclean leader election enabled at the beginning. But then
>>> disabled
>>>> it and also reduced 'max.poll.records' value.  It helped little bit.
>>>> 
>>>> But after today's testing there is strong correlation between lag spike
>>>> and what broker we crash. For lowest ID (100) broker :
>>>>  1. it always at least 1-2 orders higher lag
>>>>  2. we start getting
>>>> 
>>>> org.apache.kafka.clients.consumer.CommitFailedException: Commit
>> cannot be
>>>> completed since the group has already rebalanced and assigned the
>>>> partitions to another member. This means that the time between
>> subsequent
>>>> calls to poll() was longer than the configured max.poll.interval.ms,
>>>> which typically implies that the poll loop is spending too much time
>>>> message processing. You can address this either by increasing the
>> session
>>>> timeout or by reducing the maximum size of batches returned in poll()
>>> with
>>>> max.poll.records.
>>>> 
>>>>  3. sometime re-consumption from random position
>>>> 
>>>> And when we crashing other brokers (101, 102), it just lag spike of
>> ~10Ks
>>>> order, settle down quite quickly, no consumer exceptions.
>>>> 
>>>> Totally lost what to try next.
>>>> 
>>>>> On Sat, Oct 7, 2017 at 2:41 AM, tao xiao <[email protected]> wrote:
>>>>> 
>>>>> Do you have unclean leader election turned on? If killing 100 is the
>>> only
>>>>> way to reproduce the problem, it is possible with unclean leader
>>> election
>>>>> turned on that leadership was transferred to out of ISR follower which
>>> may
>>>>> not have the latest high watermark
>>>>> On Sat, Oct 7, 2017 at 3:51 AM Dmitriy Vsekhvalnov <
>>>>> [email protected]>
>>>>> wrote:
>>>>> 
>>>>>> About to verify hypothesis on monday, but looks like that in latest
>>>>> tests.
>>>>>> Need to double check.
>>>>>> 
>>>>>> On Fri, Oct 6, 2017 at 11:25 PM, Stas Chizhov <[email protected]>
>>>>> wrote:
>>>>>> 
>>>>>>> So no matter in what sequence you shutdown brokers it is only 1
>> that
>>>>>> causes
>>>>>>> the major problem? That would indeed be a bit weird. have you
>>> checked
>>>>>>> offsets of your consumer - right after offsets jump back - does it
>>>>> start
>>>>>>> from the topic start or does it go back to some random position?
>>> Have
>>>>> you
>>>>>>> checked if all offsets are actually being committed by consumers?
>>>>>>> 
>>>>>>> fre 6 okt. 2017 kl. 20:59 skrev Dmitriy Vsekhvalnov <
>>>>>>> [email protected]
>>>>>>>> :
>>>>>>> 
>>>>>>>> Yeah, probably we can dig around.
>>>>>>>> 
>>>>>>>> One more observation, the most lag/re-consumption trouble
>>> happening
>>>>>> when
>>>>>>> we
>>>>>>>> kill broker with lowest id (e.g. 100 from [100,101,102]).
>>>>>>>> When crashing other brokers - there is nothing special
>> happening,
>>>>> lag
>>>>>>>> growing little bit but nothing crazy (e.g. thousands, not
>>> millions).
>>>>>>>> 
>>>>>>>> Is it sounds suspicious?
>>>>>>>> 
>>>>>>>> On Fri, Oct 6, 2017 at 9:23 PM, Stas Chizhov <
>> [email protected]>
>>>>>> wrote:
>>>>>>>> 
>>>>>>>>> Ted: when choosing earliest/latest you are saying: if it
>> happens
>>>>> that
>>>>>>>> there
>>>>>>>>> is no "valid" offset committed for a consumer (for whatever
>>>>> reason:
>>>>>>>>> bug/misconfiguration/no luck) it will be ok to start from the
>>>>>> beginning
>>>>>>>> or
>>>>>>>>> end of the topic. So if you are not ok with that you should
>>> choose
>>>>>>> none.
>>>>>>>>> 
>>>>>>>>> Dmitriy: Ok. Then it is spring-kafka that maintains this
>> offset
>>>>> per
>>>>>>>>> partition state for you. it might also has that problem of
>>> leaving
>>>>>>> stale
>>>>>>>>> offsets lying around, After quickly looking through
>>>>>>>>> https://github.com/spring-projects/spring-kafka/blob/
>>>>>>>>> 1945f29d5518e3c4a9950ba82135420dfb61e808/spring-kafka/src/
>>>>>>>>> main/java/org/springframework/kafka/listener/
>>>>>>>>> KafkaMessageListenerContainer.java
>>>>>>>>> it looks possible since offsets map is not cleared upon
>>> partition
>>>>>>>>> revocation, but that is just a hypothesis. I have no
>> experience
>>>>> with
>>>>>>>>> spring-kafka. However since you say you consumers were always
>>>>> active
>>>>>> I
>>>>>>>> find
>>>>>>>>> this theory worth investigating.
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> 2017-10-06 18:20 GMT+02:00 Vincent Dautremont <
>>>>>>>>> [email protected]>:
>>>>>>>>> 
>>>>>>>>>> is there a way to read messages on a topic partition from a
>>>>>> specific
>>>>>>>> node
>>>>>>>>>> we that we choose (and not by the topic partition leader) ?
>>>>>>>>>> I would like to read myself that each of the
>>> __consumer_offsets
>>>>>>>> partition
>>>>>>>>>> replicas have the same consumer group offset written in it
>> in
>>>>> it.
>>>>>>>>>> 
>>>>>>>>>> On Fri, Oct 6, 2017 at 6:08 PM, Dmitriy Vsekhvalnov <
>>>>>>>>>> [email protected]>
>>>>>>>>>> wrote:
>>>>>>>>>> 
>>>>>>>>>>> Stas:
>>>>>>>>>>> 
>>>>>>>>>>> we rely on spring-kafka, it  commits offsets "manually"
>> for
>>> us
>>>>>>> after
>>>>>>>>>> event
>>>>>>>>>>> handler completed. So it's kind of automatic once there is
>>>>>> constant
>>>>>>>>>> stream
>>>>>>>>>>> of events (no idle time, which is true for us). Though
>> it's
>>>>> not
>>>>>>> what
>>>>>>>>> pure
>>>>>>>>>>> kafka-client calls "automatic" (flush commits at fixed
>>>>>> intervals).
>>>>>>>>>>> 
>>>>>>>>>>> On Fri, Oct 6, 2017 at 7:04 PM, Stas Chizhov <
>>>>> [email protected]
>>>>>>> 
>>>>>>>>> wrote:
>>>>>>>>>>> 
>>>>>>>>>>>> You don't have autocmmit enables that means you commit
>>>>> offsets
>>>>>>>>>> yourself -
>>>>>>>>>>>> correct? If you store them per partition somewhere and
>>> fail
>>>>> to
>>>>>>>> clean
>>>>>>>>> it
>>>>>>>>>>> up
>>>>>>>>>>>> upon rebalance next time the consumer gets this
>> partition
>>>>>>> assigned
>>>>>>>>>> during
>>>>>>>>>>>> next rebalance it can commit old stale offset- can this
>> be
>>>>> the
>>>>>>>> case?
>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>>> fre 6 okt. 2017 kl. 17:59 skrev Dmitriy Vsekhvalnov <
>>>>>>>>>>>> [email protected]
>>>>>>>>>>>>> :
>>>>>>>>>>>> 
>>>>>>>>>>>>> Reprocessing same events again - is fine for us
>>>>> (idempotent).
>>>>>>>> While
>>>>>>>>>>>> loosing
>>>>>>>>>>>>> data is more critical.
>>>>>>>>>>>>> 
>>>>>>>>>>>>> What are reasons of such behaviour? Consumers are
>> never
>>>>> idle,
>>>>>>>>> always
>>>>>>>>>>>>> commiting, probably something wrong with broker setup
>>>>> then?
>>>>>>>>>>>>> 
>>>>>>>>>>>>> On Fri, Oct 6, 2017 at 6:58 PM, Ted Yu <
>>>>> [email protected]>
>>>>>>>>> wrote:
>>>>>>>>>>>>> 
>>>>>>>>>>>>>> Stas:
>>>>>>>>>>>>>> bq.  using anything but none is not really an option
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> If you have time, can you explain a bit more ?
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> Thanks
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> On Fri, Oct 6, 2017 at 8:55 AM, Stas Chizhov <
>>>>>>>> [email protected]
>>>>>>>>>> 
>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> If you set auto.offset.reset to none next time it
>>>>> happens
>>>>>>> you
>>>>>>>>>> will
>>>>>>>>>>> be
>>>>>>>>>>>>> in
>>>>>>>>>>>>>>> much better position to find out what happens.
>> Also
>>> in
>>>>>>>> general
>>>>>>>>>> with
>>>>>>>>>>>>>> current
>>>>>>>>>>>>>>> semantics of offset reset policy IMO using
>> anything
>>>>> but
>>>>>>> none
>>>>>>>> is
>>>>>>>>>> not
>>>>>>>>>>>>>> really
>>>>>>>>>>>>>>> an option unless it is ok for consumer to loose
>> some
>>>>> data
>>>>>>>>>> (latest)
>>>>>>>>>>> or
>>>>>>>>>>>>>>> reprocess it second time (earliest).
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> fre 6 okt. 2017 kl. 17:44 skrev Ted Yu <
>>>>>>> [email protected]
>>>>>>>>> :
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> Should Kafka log warning if log.retention.hours
>> is
>>>>>> lower
>>>>>>>> than
>>>>>>>>>>>> number
>>>>>>>>>>>>> of
>>>>>>>>>>>>>>>> hours specified by offsets.retention.minutes ?
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> On Fri, Oct 6, 2017 at 8:35 AM, Manikumar <
>>>>>>>>>>>> [email protected]
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> normally, log.retention.hours (168hrs)  should
>>> be
>>>>>>> higher
>>>>>>>>> than
>>>>>>>>>>>>>>>>> offsets.retention.minutes (336 hrs)?
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> On Fri, Oct 6, 2017 at 8:58 PM, Dmitriy
>>>>> Vsekhvalnov <
>>>>>>>>>>>>>>>>> [email protected]>
>>>>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> Hi Ted,
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> Broker: v0.11.0.0
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> Consumer:
>>>>>>>>>>>>>>>>>> kafka-clients v0.11.0.0
>>>>>>>>>>>>>>>>>> auto.offset.reset = earliest
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> On Fri, Oct 6, 2017 at 6:24 PM, Ted Yu <
>>>>>>>>>> [email protected]>
>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>> What's the value for auto.offset.reset  ?
>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>> Which release are you using ?
>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>> Cheers
>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>> On Fri, Oct 6, 2017 at 7:52 AM, Dmitriy
>>>>>>> Vsekhvalnov <
>>>>>>>>>>>>>>>>>>> [email protected]>
>>>>>>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>> Hi all,
>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>> we several time faced situation where
>>>>>>>> consumer-group
>>>>>>>>>>>> started
>>>>>>>>>>>>> to
>>>>>>>>>>>>>>>>>>> re-consume
>>>>>>>>>>>>>>>>>>>> old events from beginning. Here is
>>> scenario:
>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>> 1. x3 broker kafka cluster on top of x3
>>> node
>>>>>>>>> zookeeper
>>>>>>>>>>>>>>>>>>>> 2. RF=3 for all topics
>>>>>>>>>>>>>>>>>>>> 3. log.retention.hours=168 and
>>>>>>>>>>>>> offsets.retention.minutes=20160
>>>>>>>>>>>>>>>>>>>> 4. running sustainable load (pushing
>>> events)
>>>>>>>>>>>>>>>>>>>> 5. doing disaster testing by randomly
>>>>> shutting
>>>>>>>> down 1
>>>>>>>>>> of
>>>>>>>>>>> 3
>>>>>>>>>>>>>> broker
>>>>>>>>>>>>>>>>> nodes
>>>>>>>>>>>>>>>>>>>> (then provision new broker back)
>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>> Several times after bouncing broker we
>>> faced
>>>>>>>>> situation
>>>>>>>>>>>> where
>>>>>>>>>>>>>>>> consumer
>>>>>>>>>>>>>>>>>>> group
>>>>>>>>>>>>>>>>>>>> started to re-consume old events.
>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>> consumer group:
>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>> 1. enable.auto.commit = false
>>>>>>>>>>>>>>>>>>>> 2. tried graceful group shutdown, kill
>> -9
>>>>> and
>>>>>>>>>> terminating
>>>>>>>>>>>> AWS
>>>>>>>>>>>>>>> nodes
>>>>>>>>>>>>>>>>>>>> 3. never experienced re-consumption for
>>>>> given
>>>>>>>> cases.
>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>> What can cause that old events
>>>>> re-consumption?
>>>>>> Is
>>>>>>>> it
>>>>>>>>>>>> related
>>>>>>>>>>>>> to
>>>>>>>>>>>>>>>>>> bouncing
>>>>>>>>>>>>>>>>>>>> one of brokers? What to search in a
>> logs?
>>>>> Any
>>>>>>>> broker
>>>>>>>>>>>> settings
>>>>>>>>>>>>>> to
>>>>>>>>>>>>>>>> try?
>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>>> Thanks in advance.
>>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> 
>>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> --
>>>>>>>>>> The information transmitted is intended only for the person
>> or
>>>>>> entity
>>>>>>>> to
>>>>>>>>>> which it is addressed and may contain confidential and/or
>>>>>> privileged
>>>>>>>>>> material. Any review, retransmission, dissemination or other
>>> use
>>>>>> of,
>>>>>>> or
>>>>>>>>>> taking of any action in reliance upon, this information by
>>>>> persons
>>>>>> or
>>>>>>>>>> entities other than the intended recipient is prohibited. If
>>> you
>>>>>>>> received
>>>>>>>>>> this in error, please contact the sender and delete the
>>> material
>>>>>> from
>>>>>>>> any
>>>>>>>>>> computer.
>>>>>>>>>> 
>>>>>>>>> 
>>>>>>>> 
>>>>>>> 
>>>>>> 
>>>>> 
>>>> 
>>>> 
>>> 
>> 
>> --
>> 
>> 
>> This email, including attachments, is private and confidential. If you have
>> received this email in error please notify the sender and delete it from
>> your system. Emails are not secure and may contain viruses. No liability
>> can be accepted for viruses that might be transferred by this email or any
>> attachment. Any unauthorised copying of this message or unauthorised
>> distribution and publication of the information contained herein are
>> prohibited.
>> 
>> 7digital Group plc. Registered office: 69 Wilson Street, London EC2A 2BB.
>> Registered in England and Wales. Registered No. 04843573.
>> 

-- 
The information transmitted is intended only for the person or entity to 
which it is addressed and may contain confidential and/or privileged 
material. Any review, retransmission, dissemination or other use of, or 
taking of any action in reliance upon, this information by persons or 
entities other than the intended recipient is prohibited. If you received 
this in error, please contact the sender and delete the material from any 
computer.

Re: kafka broker loosing offsets?

Reply via email to