Re: kafka broker loosing offsets?

Dmitriy Vsekhvalnov Wed, 11 Oct 2017 08:44:37 -0700

Hey, want to resurrect this thread.

Decided to do idle test, where no load data is produced to topic at all.
And when we kill #101 or #102 - nothing happening. But when we kill #200 -
consumers starts to re-consume old events from random position.


Anybody have ideas what to check?  I really expected that Kafka will fail
symmetrical with respect to any broker.

On Mon, Oct 9, 2017 at 6:26 PM, Dmitriy Vsekhvalnov <dvsekhval...@gmail.com>
wrote:

> Hi tao,
>
> we had unclean leader election enabled at the beginning. But then disabled
> it and also reduced 'max.poll.records' value.  It helped little bit.
>
> But after today's testing there is strong correlation between lag spike
> and what broker we crash. For lowest ID (100) broker :
>   1. it always at least 1-2 orders higher lag
>   2. we start getting
>
> org.apache.kafka.clients.consumer.CommitFailedException: Commit cannot be
> completed since the group has already rebalanced and assigned the
> partitions to another member. This means that the time between subsequent
> calls to poll() was longer than the configured max.poll.interval.ms,
> which typically implies that the poll loop is spending too much time
> message processing. You can address this either by increasing the session
> timeout or by reducing the maximum size of batches returned in poll() with
> max.poll.records.
>
>   3. sometime re-consumption from random position
>
> And when we crashing other brokers (101, 102), it just lag spike of ~10Ks
> order, settle down quite quickly, no consumer exceptions.
>
> Totally lost what to try next.
>
> On Sat, Oct 7, 2017 at 2:41 AM, tao xiao <xiaotao...@gmail.com> wrote:
>
>> Do you have unclean leader election turned on? If killing 100 is the only
>> way to reproduce the problem, it is possible with unclean leader election
>> turned on that leadership was transferred to out of ISR follower which may
>> not have the latest high watermark
>> On Sat, Oct 7, 2017 at 3:51 AM Dmitriy Vsekhvalnov <
>> dvsekhval...@gmail.com>
>> wrote:
>>
>> > About to verify hypothesis on monday, but looks like that in latest
>> tests.
>> > Need to double check.
>> >
>> > On Fri, Oct 6, 2017 at 11:25 PM, Stas Chizhov <schiz...@gmail.com>
>> wrote:
>> >
>> > > So no matter in what sequence you shutdown brokers it is only 1 that
>> > causes
>> > > the major problem? That would indeed be a bit weird. have you checked
>> > > offsets of your consumer - right after offsets jump back - does it
>> start
>> > > from the topic start or does it go back to some random position? Have
>> you
>> > > checked if all offsets are actually being committed by consumers?
>> > >
>> > > fre 6 okt. 2017 kl. 20:59 skrev Dmitriy Vsekhvalnov <
>> > > dvsekhval...@gmail.com
>> > > >:
>> > >
>> > > > Yeah, probably we can dig around.
>> > > >
>> > > > One more observation, the most lag/re-consumption trouble happening
>> > when
>> > > we
>> > > > kill broker with lowest id (e.g. 100 from [100,101,102]).
>> > > > When crashing other brokers - there is nothing special happening,
>> lag
>> > > > growing little bit but nothing crazy (e.g. thousands, not millions).
>> > > >
>> > > > Is it sounds suspicious?
>> > > >
>> > > > On Fri, Oct 6, 2017 at 9:23 PM, Stas Chizhov <schiz...@gmail.com>
>> > wrote:
>> > > >
>> > > > > Ted: when choosing earliest/latest you are saying: if it happens
>> that
>> > > > there
>> > > > > is no "valid" offset committed for a consumer (for whatever
>> reason:
>> > > > > bug/misconfiguration/no luck) it will be ok to start from the
>> > beginning
>> > > > or
>> > > > > end of the topic. So if you are not ok with that you should choose
>> > > none.
>> > > > >
>> > > > > Dmitriy: Ok. Then it is spring-kafka that maintains this offset
>> per
>> > > > > partition state for you. it might also has that problem of leaving
>> > > stale
>> > > > > offsets lying around, After quickly looking through
>> > > > > https://github.com/spring-projects/spring-kafka/blob/
>> > > > > 1945f29d5518e3c4a9950ba82135420dfb61e808/spring-kafka/src/
>> > > > > main/java/org/springframework/kafka/listener/
>> > > > > KafkaMessageListenerContainer.java
>> > > > > it looks possible since offsets map is not cleared upon partition
>> > > > > revocation, but that is just a hypothesis. I have no experience
>> with
>> > > > > spring-kafka. However since you say you consumers were always
>> active
>> > I
>> > > > find
>> > > > > this theory worth investigating.
>> > > > >
>> > > > >
>> > > > > 2017-10-06 18:20 GMT+02:00 Vincent Dautremont <
>> > > > > vincent.dautrem...@olamobile.com.invalid>:
>> > > > >
>> > > > > > is there a way to read messages on a topic partition from a
>> > specific
>> > > > node
>> > > > > > we that we choose (and not by the topic partition leader) ?
>> > > > > > I would like to read myself that each of the __consumer_offsets
>> > > > partition
>> > > > > > replicas have the same consumer group offset written in it in
>> it.
>> > > > > >
>> > > > > > On Fri, Oct 6, 2017 at 6:08 PM, Dmitriy Vsekhvalnov <
>> > > > > > dvsekhval...@gmail.com>
>> > > > > > wrote:
>> > > > > >
>> > > > > > > Stas:
>> > > > > > >
>> > > > > > > we rely on spring-kafka, it  commits offsets "manually" for us
>> > > after
>> > > > > > event
>> > > > > > > handler completed. So it's kind of automatic once there is
>> > constant
>> > > > > > stream
>> > > > > > > of events (no idle time, which is true for us). Though it's
>> not
>> > > what
>> > > > > pure
>> > > > > > > kafka-client calls "automatic" (flush commits at fixed
>> > intervals).
>> > > > > > >
>> > > > > > > On Fri, Oct 6, 2017 at 7:04 PM, Stas Chizhov <
>> schiz...@gmail.com
>> > >
>> > > > > wrote:
>> > > > > > >
>> > > > > > > > You don't have autocmmit enables that means you commit
>> offsets
>> > > > > > yourself -
>> > > > > > > > correct? If you store them per partition somewhere and fail
>> to
>> > > > clean
>> > > > > it
>> > > > > > > up
>> > > > > > > > upon rebalance next time the consumer gets this partition
>> > > assigned
>> > > > > > during
>> > > > > > > > next rebalance it can commit old stale offset- can this be
>> the
>> > > > case?
>> > > > > > > >
>> > > > > > > >
>> > > > > > > > fre 6 okt. 2017 kl. 17:59 skrev Dmitriy Vsekhvalnov <
>> > > > > > > > dvsekhval...@gmail.com
>> > > > > > > > >:
>> > > > > > > >
>> > > > > > > > > Reprocessing same events again - is fine for us
>> (idempotent).
>> > > > While
>> > > > > > > > loosing
>> > > > > > > > > data is more critical.
>> > > > > > > > >
>> > > > > > > > > What are reasons of such behaviour? Consumers are never
>> idle,
>> > > > > always
>> > > > > > > > > commiting, probably something wrong with broker setup
>> then?
>> > > > > > > > >
>> > > > > > > > > On Fri, Oct 6, 2017 at 6:58 PM, Ted Yu <
>> yuzhih...@gmail.com>
>> > > > > wrote:
>> > > > > > > > >
>> > > > > > > > > > Stas:
>> > > > > > > > > > bq.  using anything but none is not really an option
>> > > > > > > > > >
>> > > > > > > > > > If you have time, can you explain a bit more ?
>> > > > > > > > > >
>> > > > > > > > > > Thanks
>> > > > > > > > > >
>> > > > > > > > > > On Fri, Oct 6, 2017 at 8:55 AM, Stas Chizhov <
>> > > > schiz...@gmail.com
>> > > > > >
>> > > > > > > > wrote:
>> > > > > > > > > >
>> > > > > > > > > > > If you set auto.offset.reset to none next time it
>> happens
>> > > you
>> > > > > > will
>> > > > > > > be
>> > > > > > > > > in
>> > > > > > > > > > > much better position to find out what happens. Also in
>> > > > general
>> > > > > > with
>> > > > > > > > > > current
>> > > > > > > > > > > semantics of offset reset policy IMO using anything
>> but
>> > > none
>> > > > is
>> > > > > > not
>> > > > > > > > > > really
>> > > > > > > > > > > an option unless it is ok for consumer to loose some
>> data
>> > > > > > (latest)
>> > > > > > > or
>> > > > > > > > > > > reprocess it second time (earliest).
>> > > > > > > > > > >
>> > > > > > > > > > > fre 6 okt. 2017 kl. 17:44 skrev Ted Yu <
>> > > yuzhih...@gmail.com
>> > > > >:
>> > > > > > > > > > >
>> > > > > > > > > > > > Should Kafka log warning if log.retention.hours is
>> > lower
>> > > > than
>> > > > > > > > number
>> > > > > > > > > of
>> > > > > > > > > > > > hours specified by offsets.retention.minutes ?
>> > > > > > > > > > > >
>> > > > > > > > > > > > On Fri, Oct 6, 2017 at 8:35 AM, Manikumar <
>> > > > > > > > manikumar.re...@gmail.com
>> > > > > > > > > >
>> > > > > > > > > > > > wrote:
>> > > > > > > > > > > >
>> > > > > > > > > > > > > normally, log.retention.hours (168hrs)  should be
>> > > higher
>> > > > > than
>> > > > > > > > > > > > > offsets.retention.minutes (336 hrs)?
>> > > > > > > > > > > > >
>> > > > > > > > > > > > >
>> > > > > > > > > > > > > On Fri, Oct 6, 2017 at 8:58 PM, Dmitriy
>> Vsekhvalnov <
>> > > > > > > > > > > > > dvsekhval...@gmail.com>
>> > > > > > > > > > > > > wrote:
>> > > > > > > > > > > > >
>> > > > > > > > > > > > > > Hi Ted,
>> > > > > > > > > > > > > >
>> > > > > > > > > > > > > > Broker: v0.11.0.0
>> > > > > > > > > > > > > >
>> > > > > > > > > > > > > > Consumer:
>> > > > > > > > > > > > > > kafka-clients v0.11.0.0
>> > > > > > > > > > > > > > auto.offset.reset = earliest
>> > > > > > > > > > > > > >
>> > > > > > > > > > > > > >
>> > > > > > > > > > > > > >
>> > > > > > > > > > > > > > On Fri, Oct 6, 2017 at 6:24 PM, Ted Yu <
>> > > > > > yuzhih...@gmail.com>
>> > > > > > > > > > wrote:
>> > > > > > > > > > > > > >
>> > > > > > > > > > > > > > > What's the value for auto.offset.reset  ?
>> > > > > > > > > > > > > > >
>> > > > > > > > > > > > > > > Which release are you using ?
>> > > > > > > > > > > > > > >
>> > > > > > > > > > > > > > > Cheers
>> > > > > > > > > > > > > > >
>> > > > > > > > > > > > > > > On Fri, Oct 6, 2017 at 7:52 AM, Dmitriy
>> > > Vsekhvalnov <
>> > > > > > > > > > > > > > > dvsekhval...@gmail.com>
>> > > > > > > > > > > > > > > wrote:
>> > > > > > > > > > > > > > >
>> > > > > > > > > > > > > > > > Hi all,
>> > > > > > > > > > > > > > > >
>> > > > > > > > > > > > > > > > we several time faced situation where
>> > > > consumer-group
>> > > > > > > > started
>> > > > > > > > > to
>> > > > > > > > > > > > > > > re-consume
>> > > > > > > > > > > > > > > > old events from beginning. Here is scenario:
>> > > > > > > > > > > > > > > >
>> > > > > > > > > > > > > > > > 1. x3 broker kafka cluster on top of x3 node
>> > > > > zookeeper
>> > > > > > > > > > > > > > > > 2. RF=3 for all topics
>> > > > > > > > > > > > > > > > 3. log.retention.hours=168 and
>> > > > > > > > > offsets.retention.minutes=20160
>> > > > > > > > > > > > > > > > 4. running sustainable load (pushing events)
>> > > > > > > > > > > > > > > > 5. doing disaster testing by randomly
>> shutting
>> > > > down 1
>> > > > > > of
>> > > > > > > 3
>> > > > > > > > > > broker
>> > > > > > > > > > > > > nodes
>> > > > > > > > > > > > > > > > (then provision new broker back)
>> > > > > > > > > > > > > > > >
>> > > > > > > > > > > > > > > > Several times after bouncing broker we faced
>> > > > > situation
>> > > > > > > > where
>> > > > > > > > > > > > consumer
>> > > > > > > > > > > > > > > group
>> > > > > > > > > > > > > > > > started to re-consume old events.
>> > > > > > > > > > > > > > > >
>> > > > > > > > > > > > > > > > consumer group:
>> > > > > > > > > > > > > > > >
>> > > > > > > > > > > > > > > > 1. enable.auto.commit = false
>> > > > > > > > > > > > > > > > 2. tried graceful group shutdown, kill -9
>> and
>> > > > > > terminating
>> > > > > > > > AWS
>> > > > > > > > > > > nodes
>> > > > > > > > > > > > > > > > 3. never experienced re-consumption for
>> given
>> > > > cases.
>> > > > > > > > > > > > > > > >
>> > > > > > > > > > > > > > > > What can cause that old events
>> re-consumption?
>> > Is
>> > > > it
>> > > > > > > > related
>> > > > > > > > > to
>> > > > > > > > > > > > > > bouncing
>> > > > > > > > > > > > > > > > one of brokers? What to search in a logs?
>> Any
>> > > > broker
>> > > > > > > > settings
>> > > > > > > > > > to
>> > > > > > > > > > > > try?
>> > > > > > > > > > > > > > > >
>> > > > > > > > > > > > > > > > Thanks in advance.
>> > > > > > > > > > > > > > > >
>> > > > > > > > > > > > > > >
>> > > > > > > > > > > > > >
>> > > > > > > > > > > > >
>> > > > > > > > > > > >
>> > > > > > > > > > >
>> > > > > > > > > >
>> > > > > > > > >
>> > > > > > > >
>> > > > > > >
>> > > > > >
>> > > > > > --
>> > > > > > The information transmitted is intended only for the person or
>> > entity
>> > > > to
>> > > > > > which it is addressed and may contain confidential and/or
>> > privileged
>> > > > > > material. Any review, retransmission, dissemination or other use
>> > of,
>> > > or
>> > > > > > taking of any action in reliance upon, this information by
>> persons
>> > or
>> > > > > > entities other than the intended recipient is prohibited. If you
>> > > > received
>> > > > > > this in error, please contact the sender and delete the material
>> > from
>> > > > any
>> > > > > > computer.
>> > > > > >
>> > > > >
>> > > >
>> > >
>> >
>>
>
>

Re: kafka broker loosing offsets?

Reply via email to