Re: kafka broker loosing offsets?

Dmitriy Vsekhvalnov Fri, 20 Oct 2017 13:14:17 -0700

Hey guys,

just want to post that upgrade to 0.11.0.1 solved the issue. After
excessive disaster testing no re-consumption of old offsets were
experienced.




On Thu, Oct 12, 2017 at 1:35 AM, Vincent Dautremont <
vincent.dautrem...@olamobile.com.invalid> wrote:

> Hi,
> We have 4 differents Kafka cluster running,
> 2 on 0.10.1.0
> 1 on 0.10.0.1
> 1 that was on 0.11.0.0 and last week updated to 0.11.0.1
>
> I’ve only seen the issue happen 2 times in production usage on the cluster
> on 0.11.0.0 since it’s running (about 3months).
>
> But I’ll monitor and report it here if it ever happen again in the future.
> We’ll also upgrade all our clusters to 0.11.0.1 in the next days.
>
> 🤞🏻!
>
> > Le 11 oct. 2017 à 17:47, Dmitriy Vsekhvalnov <dvsekhval...@gmail.com> a
> écrit :
> >
> > Yeah just pops up in my list. Thanks, i'll take a look.
> >
> > Vincent Dautremont, if you still reading it, did you try upgrade to
> > 0.11.0.1? Fixed issue?
> >
> > On Wed, Oct 11, 2017 at 6:46 PM, Ben Davison <ben.davi...@7digital.com>
> > wrote:
> >
> >> Hi Dmitriy,
> >>
> >> Did you check out this thread "Incorrect consumer offsets after broker
> >> restart 0.11.0.0" from Phil Luckhurst, it sounds similar.
> >>
> >> Thanks,
> >>
> >> Ben
> >>
> >> On Wed, Oct 11, 2017 at 4:44 PM Dmitriy Vsekhvalnov <
> >> dvsekhval...@gmail.com>
> >> wrote:
> >>
> >>> Hey, want to resurrect this thread.
> >>>
> >>> Decided to do idle test, where no load data is produced to topic at
> all.
> >>> And when we kill #101 or #102 - nothing happening. But when we kill
> #200
> >> -
> >>> consumers starts to re-consume old events from random position.
> >>>
> >>> Anybody have ideas what to check?  I really expected that Kafka will
> fail
> >>> symmetrical with respect to any broker.
> >>>
> >>> On Mon, Oct 9, 2017 at 6:26 PM, Dmitriy Vsekhvalnov <
> >>> dvsekhval...@gmail.com>
> >>> wrote:
> >>>
> >>>> Hi tao,
> >>>>
> >>>> we had unclean leader election enabled at the beginning. But then
> >>> disabled
> >>>> it and also reduced 'max.poll.records' value.  It helped little bit.
> >>>>
> >>>> But after today's testing there is strong correlation between lag
> spike
> >>>> and what broker we crash. For lowest ID (100) broker :
> >>>>  1. it always at least 1-2 orders higher lag
> >>>>  2. we start getting
> >>>>
> >>>> org.apache.kafka.clients.consumer.CommitFailedException: Commit
> >> cannot be
> >>>> completed since the group has already rebalanced and assigned the
> >>>> partitions to another member. This means that the time between
> >> subsequent
> >>>> calls to poll() was longer than the configured max.poll.interval.ms,
> >>>> which typically implies that the poll loop is spending too much time
> >>>> message processing. You can address this either by increasing the
> >> session
> >>>> timeout or by reducing the maximum size of batches returned in poll()
> >>> with
> >>>> max.poll.records.
> >>>>
> >>>>  3. sometime re-consumption from random position
> >>>>
> >>>> And when we crashing other brokers (101, 102), it just lag spike of
> >> ~10Ks
> >>>> order, settle down quite quickly, no consumer exceptions.
> >>>>
> >>>> Totally lost what to try next.
> >>>>
> >>>>> On Sat, Oct 7, 2017 at 2:41 AM, tao xiao <xiaotao...@gmail.com>
> wrote:
> >>>>>
> >>>>> Do you have unclean leader election turned on? If killing 100 is the
> >>> only
> >>>>> way to reproduce the problem, it is possible with unclean leader
> >>> election
> >>>>> turned on that leadership was transferred to out of ISR follower
> which
> >>> may
> >>>>> not have the latest high watermark
> >>>>> On Sat, Oct 7, 2017 at 3:51 AM Dmitriy Vsekhvalnov <
> >>>>> dvsekhval...@gmail.com>
> >>>>> wrote:
> >>>>>
> >>>>>> About to verify hypothesis on monday, but looks like that in latest
> >>>>> tests.
> >>>>>> Need to double check.
> >>>>>>
> >>>>>> On Fri, Oct 6, 2017 at 11:25 PM, Stas Chizhov <schiz...@gmail.com>
> >>>>> wrote:
> >>>>>>
> >>>>>>> So no matter in what sequence you shutdown brokers it is only 1
> >> that
> >>>>>> causes
> >>>>>>> the major problem? That would indeed be a bit weird. have you
> >>> checked
> >>>>>>> offsets of your consumer - right after offsets jump back - does it
> >>>>> start
> >>>>>>> from the topic start or does it go back to some random position?
> >>> Have
> >>>>> you
> >>>>>>> checked if all offsets are actually being committed by consumers?
> >>>>>>>
> >>>>>>> fre 6 okt. 2017 kl. 20:59 skrev Dmitriy Vsekhvalnov <
> >>>>>>> dvsekhval...@gmail.com
> >>>>>>>> :
> >>>>>>>
> >>>>>>>> Yeah, probably we can dig around.
> >>>>>>>>
> >>>>>>>> One more observation, the most lag/re-consumption trouble
> >>> happening
> >>>>>> when
> >>>>>>> we
> >>>>>>>> kill broker with lowest id (e.g. 100 from [100,101,102]).
> >>>>>>>> When crashing other brokers - there is nothing special
> >> happening,
> >>>>> lag
> >>>>>>>> growing little bit but nothing crazy (e.g. thousands, not
> >>> millions).
> >>>>>>>>
> >>>>>>>> Is it sounds suspicious?
> >>>>>>>>
> >>>>>>>> On Fri, Oct 6, 2017 at 9:23 PM, Stas Chizhov <
> >> schiz...@gmail.com>
> >>>>>> wrote:
> >>>>>>>>
> >>>>>>>>> Ted: when choosing earliest/latest you are saying: if it
> >> happens
> >>>>> that
> >>>>>>>> there
> >>>>>>>>> is no "valid" offset committed for a consumer (for whatever
> >>>>> reason:
> >>>>>>>>> bug/misconfiguration/no luck) it will be ok to start from the
> >>>>>> beginning
> >>>>>>>> or
> >>>>>>>>> end of the topic. So if you are not ok with that you should
> >>> choose
> >>>>>>> none.
> >>>>>>>>>
> >>>>>>>>> Dmitriy: Ok. Then it is spring-kafka that maintains this
> >> offset
> >>>>> per
> >>>>>>>>> partition state for you. it might also has that problem of
> >>> leaving
> >>>>>>> stale
> >>>>>>>>> offsets lying around, After quickly looking through
> >>>>>>>>> https://github.com/spring-projects/spring-kafka/blob/
> >>>>>>>>> 1945f29d5518e3c4a9950ba82135420dfb61e808/spring-kafka/src/
> >>>>>>>>> main/java/org/springframework/kafka/listener/
> >>>>>>>>> KafkaMessageListenerContainer.java
> >>>>>>>>> it looks possible since offsets map is not cleared upon
> >>> partition
> >>>>>>>>> revocation, but that is just a hypothesis. I have no
> >> experience
> >>>>> with
> >>>>>>>>> spring-kafka. However since you say you consumers were always
> >>>>> active
> >>>>>> I
> >>>>>>>> find
> >>>>>>>>> this theory worth investigating.
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>> 2017-10-06 18:20 GMT+02:00 Vincent Dautremont <
> >>>>>>>>> vincent.dautrem...@olamobile.com.invalid>:
> >>>>>>>>>
> >>>>>>>>>> is there a way to read messages on a topic partition from a
> >>>>>> specific
> >>>>>>>> node
> >>>>>>>>>> we that we choose (and not by the topic partition leader) ?
> >>>>>>>>>> I would like to read myself that each of the
> >>> __consumer_offsets
> >>>>>>>> partition
> >>>>>>>>>> replicas have the same consumer group offset written in it
> >> in
> >>>>> it.
> >>>>>>>>>>
> >>>>>>>>>> On Fri, Oct 6, 2017 at 6:08 PM, Dmitriy Vsekhvalnov <
> >>>>>>>>>> dvsekhval...@gmail.com>
> >>>>>>>>>> wrote:
> >>>>>>>>>>
> >>>>>>>>>>> Stas:
> >>>>>>>>>>>
> >>>>>>>>>>> we rely on spring-kafka, it  commits offsets "manually"
> >> for
> >>> us
> >>>>>>> after
> >>>>>>>>>> event
> >>>>>>>>>>> handler completed. So it's kind of automatic once there is
> >>>>>> constant
> >>>>>>>>>> stream
> >>>>>>>>>>> of events (no idle time, which is true for us). Though
> >> it's
> >>>>> not
> >>>>>>> what
> >>>>>>>>> pure
> >>>>>>>>>>> kafka-client calls "automatic" (flush commits at fixed
> >>>>>> intervals).
> >>>>>>>>>>>
> >>>>>>>>>>> On Fri, Oct 6, 2017 at 7:04 PM, Stas Chizhov <
> >>>>> schiz...@gmail.com
> >>>>>>>
> >>>>>>>>> wrote:
> >>>>>>>>>>>
> >>>>>>>>>>>> You don't have autocmmit enables that means you commit
> >>>>> offsets
> >>>>>>>>>> yourself -
> >>>>>>>>>>>> correct? If you store them per partition somewhere and
> >>> fail
> >>>>> to
> >>>>>>>> clean
> >>>>>>>>> it
> >>>>>>>>>>> up
> >>>>>>>>>>>> upon rebalance next time the consumer gets this
> >> partition
> >>>>>>> assigned
> >>>>>>>>>> during
> >>>>>>>>>>>> next rebalance it can commit old stale offset- can this
> >> be
> >>>>> the
> >>>>>>>> case?
> >>>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>>>> fre 6 okt. 2017 kl. 17:59 skrev Dmitriy Vsekhvalnov <
> >>>>>>>>>>>> dvsekhval...@gmail.com
> >>>>>>>>>>>>> :
> >>>>>>>>>>>>
> >>>>>>>>>>>>> Reprocessing same events again - is fine for us
> >>>>> (idempotent).
> >>>>>>>> While
> >>>>>>>>>>>> loosing
> >>>>>>>>>>>>> data is more critical.
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> What are reasons of such behaviour? Consumers are
> >> never
> >>>>> idle,
> >>>>>>>>> always
> >>>>>>>>>>>>> commiting, probably something wrong with broker setup
> >>>>> then?
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> On Fri, Oct 6, 2017 at 6:58 PM, Ted Yu <
> >>>>> yuzhih...@gmail.com>
> >>>>>>>>> wrote:
> >>>>>>>>>>>>>
> >>>>>>>>>>>>>> Stas:
> >>>>>>>>>>>>>> bq.  using anything but none is not really an option
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> If you have time, can you explain a bit more ?
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> Thanks
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> On Fri, Oct 6, 2017 at 8:55 AM, Stas Chizhov <
> >>>>>>>> schiz...@gmail.com
> >>>>>>>>>>
> >>>>>>>>>>>> wrote:
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> If you set auto.offset.reset to none next time it
> >>>>> happens
> >>>>>>> you
> >>>>>>>>>> will
> >>>>>>>>>>> be
> >>>>>>>>>>>>> in
> >>>>>>>>>>>>>>> much better position to find out what happens.
> >> Also
> >>> in
> >>>>>>>> general
> >>>>>>>>>> with
> >>>>>>>>>>>>>> current
> >>>>>>>>>>>>>>> semantics of offset reset policy IMO using
> >> anything
> >>>>> but
> >>>>>>> none
> >>>>>>>> is
> >>>>>>>>>> not
> >>>>>>>>>>>>>> really
> >>>>>>>>>>>>>>> an option unless it is ok for consumer to loose
> >> some
> >>>>> data
> >>>>>>>>>> (latest)
> >>>>>>>>>>> or
> >>>>>>>>>>>>>>> reprocess it second time (earliest).
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> fre 6 okt. 2017 kl. 17:44 skrev Ted Yu <
> >>>>>>> yuzhih...@gmail.com
> >>>>>>>>> :
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> Should Kafka log warning if log.retention.hours
> >> is
> >>>>>> lower
> >>>>>>>> than
> >>>>>>>>>>>> number
> >>>>>>>>>>>>> of
> >>>>>>>>>>>>>>>> hours specified by offsets.retention.minutes ?
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> On Fri, Oct 6, 2017 at 8:35 AM, Manikumar <
> >>>>>>>>>>>> manikumar.re...@gmail.com
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> wrote:
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>> normally, log.retention.hours (168hrs)  should
> >>> be
> >>>>>>> higher
> >>>>>>>>> than
> >>>>>>>>>>>>>>>>> offsets.retention.minutes (336 hrs)?
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>> On Fri, Oct 6, 2017 at 8:58 PM, Dmitriy
> >>>>> Vsekhvalnov <
> >>>>>>>>>>>>>>>>> dvsekhval...@gmail.com>
> >>>>>>>>>>>>>>>>> wrote:
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>> Hi Ted,
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>> Broker: v0.11.0.0
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>> Consumer:
> >>>>>>>>>>>>>>>>>> kafka-clients v0.11.0.0
> >>>>>>>>>>>>>>>>>> auto.offset.reset = earliest
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>> On Fri, Oct 6, 2017 at 6:24 PM, Ted Yu <
> >>>>>>>>>> yuzhih...@gmail.com>
> >>>>>>>>>>>>>> wrote:
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>> What's the value for auto.offset.reset  ?
> >>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>> Which release are you using ?
> >>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>> Cheers
> >>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>> On Fri, Oct 6, 2017 at 7:52 AM, Dmitriy
> >>>>>>> Vsekhvalnov <
> >>>>>>>>>>>>>>>>>>> dvsekhval...@gmail.com>
> >>>>>>>>>>>>>>>>>>> wrote:
> >>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>> Hi all,
> >>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>> we several time faced situation where
> >>>>>>>> consumer-group
> >>>>>>>>>>>> started
> >>>>>>>>>>>>> to
> >>>>>>>>>>>>>>>>>>> re-consume
> >>>>>>>>>>>>>>>>>>>> old events from beginning. Here is
> >>> scenario:
> >>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>> 1. x3 broker kafka cluster on top of x3
> >>> node
> >>>>>>>>> zookeeper
> >>>>>>>>>>>>>>>>>>>> 2. RF=3 for all topics
> >>>>>>>>>>>>>>>>>>>> 3. log.retention.hours=168 and
> >>>>>>>>>>>>> offsets.retention.minutes=20160
> >>>>>>>>>>>>>>>>>>>> 4. running sustainable load (pushing
> >>> events)
> >>>>>>>>>>>>>>>>>>>> 5. doing disaster testing by randomly
> >>>>> shutting
> >>>>>>>> down 1
> >>>>>>>>>> of
> >>>>>>>>>>> 3
> >>>>>>>>>>>>>> broker
> >>>>>>>>>>>>>>>>> nodes
> >>>>>>>>>>>>>>>>>>>> (then provision new broker back)
> >>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>> Several times after bouncing broker we
> >>> faced
> >>>>>>>>> situation
> >>>>>>>>>>>> where
> >>>>>>>>>>>>>>>> consumer
> >>>>>>>>>>>>>>>>>>> group
> >>>>>>>>>>>>>>>>>>>> started to re-consume old events.
> >>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>> consumer group:
> >>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>> 1. enable.auto.commit = false
> >>>>>>>>>>>>>>>>>>>> 2. tried graceful group shutdown, kill
> >> -9
> >>>>> and
> >>>>>>>>>> terminating
> >>>>>>>>>>>> AWS
> >>>>>>>>>>>>>>> nodes
> >>>>>>>>>>>>>>>>>>>> 3. never experienced re-consumption for
> >>>>> given
> >>>>>>>> cases.
> >>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>> What can cause that old events
> >>>>> re-consumption?
> >>>>>> Is
> >>>>>>>> it
> >>>>>>>>>>>> related
> >>>>>>>>>>>>> to
> >>>>>>>>>>>>>>>>>> bouncing
> >>>>>>>>>>>>>>>>>>>> one of brokers? What to search in a
> >> logs?
> >>>>> Any
> >>>>>>>> broker
> >>>>>>>>>>>> settings
> >>>>>>>>>>>>>> to
> >>>>>>>>>>>>>>>> try?
> >>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>> Thanks in advance.
> >>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>> --
> >>>>>>>>>> The information transmitted is intended only for the person
> >> or
> >>>>>> entity
> >>>>>>>> to
> >>>>>>>>>> which it is addressed and may contain confidential and/or
> >>>>>> privileged
> >>>>>>>>>> material. Any review, retransmission, dissemination or other
> >>> use
> >>>>>> of,
> >>>>>>> or
> >>>>>>>>>> taking of any action in reliance upon, this information by
> >>>>> persons
> >>>>>> or
> >>>>>>>>>> entities other than the intended recipient is prohibited. If
> >>> you
> >>>>>>>> received
> >>>>>>>>>> this in error, please contact the sender and delete the
> >>> material
> >>>>>> from
> >>>>>>>> any
> >>>>>>>>>> computer.
> >>>>>>>>>>
> >>>>>>>>>
> >>>>>>>>
> >>>>>>>
> >>>>>>
> >>>>>
> >>>>
> >>>>
> >>>
> >>
> >> --
> >>
> >>
> >> This email, including attachments, is private and confidential. If you
> have
> >> received this email in error please notify the sender and delete it from
> >> your system. Emails are not secure and may contain viruses. No liability
> >> can be accepted for viruses that might be transferred by this email or
> any
> >> attachment. Any unauthorised copying of this message or unauthorised
> >> distribution and publication of the information contained herein are
> >> prohibited.
> >>
> >> 7digital Group plc. Registered office: 69 Wilson Street, London EC2A
> 2BB.
> >> Registered in England and Wales. Registered No. 04843573.
> >>
>
> --
> The information transmitted is intended only for the person or entity to
> which it is addressed and may contain confidential and/or privileged
> material. Any review, retransmission, dissemination or other use of, or
> taking of any action in reliance upon, this information by persons or
> entities other than the intended recipient is prohibited. If you received
> this in error, please contact the sender and delete the material from any
> computer.
>

Re: kafka broker loosing offsets?

Reply via email to