Hey guys, just want to post that upgrade to 0.11.0.1 solved the issue. After excessive disaster testing no re-consumption of old offsets were experienced.
On Thu, Oct 12, 2017 at 1:35 AM, Vincent Dautremont < vincent.dautrem...@olamobile.com.invalid> wrote: > Hi, > We have 4 differents Kafka cluster running, > 2 on 0.10.1.0 > 1 on 0.10.0.1 > 1 that was on 0.11.0.0 and last week updated to 0.11.0.1 > > I’ve only seen the issue happen 2 times in production usage on the cluster > on 0.11.0.0 since it’s running (about 3months). > > But I’ll monitor and report it here if it ever happen again in the future. > We’ll also upgrade all our clusters to 0.11.0.1 in the next days. > > 🤞🏻! > > > Le 11 oct. 2017 à 17:47, Dmitriy Vsekhvalnov <dvsekhval...@gmail.com> a > écrit : > > > > Yeah just pops up in my list. Thanks, i'll take a look. > > > > Vincent Dautremont, if you still reading it, did you try upgrade to > > 0.11.0.1? Fixed issue? > > > > On Wed, Oct 11, 2017 at 6:46 PM, Ben Davison <ben.davi...@7digital.com> > > wrote: > > > >> Hi Dmitriy, > >> > >> Did you check out this thread "Incorrect consumer offsets after broker > >> restart 0.11.0.0" from Phil Luckhurst, it sounds similar. > >> > >> Thanks, > >> > >> Ben > >> > >> On Wed, Oct 11, 2017 at 4:44 PM Dmitriy Vsekhvalnov < > >> dvsekhval...@gmail.com> > >> wrote: > >> > >>> Hey, want to resurrect this thread. > >>> > >>> Decided to do idle test, where no load data is produced to topic at > all. > >>> And when we kill #101 or #102 - nothing happening. But when we kill > #200 > >> - > >>> consumers starts to re-consume old events from random position. > >>> > >>> Anybody have ideas what to check? I really expected that Kafka will > fail > >>> symmetrical with respect to any broker. > >>> > >>> On Mon, Oct 9, 2017 at 6:26 PM, Dmitriy Vsekhvalnov < > >>> dvsekhval...@gmail.com> > >>> wrote: > >>> > >>>> Hi tao, > >>>> > >>>> we had unclean leader election enabled at the beginning. But then > >>> disabled > >>>> it and also reduced 'max.poll.records' value. It helped little bit. > >>>> > >>>> But after today's testing there is strong correlation between lag > spike > >>>> and what broker we crash. For lowest ID (100) broker : > >>>> 1. it always at least 1-2 orders higher lag > >>>> 2. we start getting > >>>> > >>>> org.apache.kafka.clients.consumer.CommitFailedException: Commit > >> cannot be > >>>> completed since the group has already rebalanced and assigned the > >>>> partitions to another member. This means that the time between > >> subsequent > >>>> calls to poll() was longer than the configured max.poll.interval.ms, > >>>> which typically implies that the poll loop is spending too much time > >>>> message processing. You can address this either by increasing the > >> session > >>>> timeout or by reducing the maximum size of batches returned in poll() > >>> with > >>>> max.poll.records. > >>>> > >>>> 3. sometime re-consumption from random position > >>>> > >>>> And when we crashing other brokers (101, 102), it just lag spike of > >> ~10Ks > >>>> order, settle down quite quickly, no consumer exceptions. > >>>> > >>>> Totally lost what to try next. > >>>> > >>>>> On Sat, Oct 7, 2017 at 2:41 AM, tao xiao <xiaotao...@gmail.com> > wrote: > >>>>> > >>>>> Do you have unclean leader election turned on? If killing 100 is the > >>> only > >>>>> way to reproduce the problem, it is possible with unclean leader > >>> election > >>>>> turned on that leadership was transferred to out of ISR follower > which > >>> may > >>>>> not have the latest high watermark > >>>>> On Sat, Oct 7, 2017 at 3:51 AM Dmitriy Vsekhvalnov < > >>>>> dvsekhval...@gmail.com> > >>>>> wrote: > >>>>> > >>>>>> About to verify hypothesis on monday, but looks like that in latest > >>>>> tests. > >>>>>> Need to double check. > >>>>>> > >>>>>> On Fri, Oct 6, 2017 at 11:25 PM, Stas Chizhov <schiz...@gmail.com> > >>>>> wrote: > >>>>>> > >>>>>>> So no matter in what sequence you shutdown brokers it is only 1 > >> that > >>>>>> causes > >>>>>>> the major problem? That would indeed be a bit weird. have you > >>> checked > >>>>>>> offsets of your consumer - right after offsets jump back - does it > >>>>> start > >>>>>>> from the topic start or does it go back to some random position? > >>> Have > >>>>> you > >>>>>>> checked if all offsets are actually being committed by consumers? > >>>>>>> > >>>>>>> fre 6 okt. 2017 kl. 20:59 skrev Dmitriy Vsekhvalnov < > >>>>>>> dvsekhval...@gmail.com > >>>>>>>> : > >>>>>>> > >>>>>>>> Yeah, probably we can dig around. > >>>>>>>> > >>>>>>>> One more observation, the most lag/re-consumption trouble > >>> happening > >>>>>> when > >>>>>>> we > >>>>>>>> kill broker with lowest id (e.g. 100 from [100,101,102]). > >>>>>>>> When crashing other brokers - there is nothing special > >> happening, > >>>>> lag > >>>>>>>> growing little bit but nothing crazy (e.g. thousands, not > >>> millions). > >>>>>>>> > >>>>>>>> Is it sounds suspicious? > >>>>>>>> > >>>>>>>> On Fri, Oct 6, 2017 at 9:23 PM, Stas Chizhov < > >> schiz...@gmail.com> > >>>>>> wrote: > >>>>>>>> > >>>>>>>>> Ted: when choosing earliest/latest you are saying: if it > >> happens > >>>>> that > >>>>>>>> there > >>>>>>>>> is no "valid" offset committed for a consumer (for whatever > >>>>> reason: > >>>>>>>>> bug/misconfiguration/no luck) it will be ok to start from the > >>>>>> beginning > >>>>>>>> or > >>>>>>>>> end of the topic. So if you are not ok with that you should > >>> choose > >>>>>>> none. > >>>>>>>>> > >>>>>>>>> Dmitriy: Ok. Then it is spring-kafka that maintains this > >> offset > >>>>> per > >>>>>>>>> partition state for you. it might also has that problem of > >>> leaving > >>>>>>> stale > >>>>>>>>> offsets lying around, After quickly looking through > >>>>>>>>> https://github.com/spring-projects/spring-kafka/blob/ > >>>>>>>>> 1945f29d5518e3c4a9950ba82135420dfb61e808/spring-kafka/src/ > >>>>>>>>> main/java/org/springframework/kafka/listener/ > >>>>>>>>> KafkaMessageListenerContainer.java > >>>>>>>>> it looks possible since offsets map is not cleared upon > >>> partition > >>>>>>>>> revocation, but that is just a hypothesis. I have no > >> experience > >>>>> with > >>>>>>>>> spring-kafka. However since you say you consumers were always > >>>>> active > >>>>>> I > >>>>>>>> find > >>>>>>>>> this theory worth investigating. > >>>>>>>>> > >>>>>>>>> > >>>>>>>>> 2017-10-06 18:20 GMT+02:00 Vincent Dautremont < > >>>>>>>>> vincent.dautrem...@olamobile.com.invalid>: > >>>>>>>>> > >>>>>>>>>> is there a way to read messages on a topic partition from a > >>>>>> specific > >>>>>>>> node > >>>>>>>>>> we that we choose (and not by the topic partition leader) ? > >>>>>>>>>> I would like to read myself that each of the > >>> __consumer_offsets > >>>>>>>> partition > >>>>>>>>>> replicas have the same consumer group offset written in it > >> in > >>>>> it. > >>>>>>>>>> > >>>>>>>>>> On Fri, Oct 6, 2017 at 6:08 PM, Dmitriy Vsekhvalnov < > >>>>>>>>>> dvsekhval...@gmail.com> > >>>>>>>>>> wrote: > >>>>>>>>>> > >>>>>>>>>>> Stas: > >>>>>>>>>>> > >>>>>>>>>>> we rely on spring-kafka, it commits offsets "manually" > >> for > >>> us > >>>>>>> after > >>>>>>>>>> event > >>>>>>>>>>> handler completed. So it's kind of automatic once there is > >>>>>> constant > >>>>>>>>>> stream > >>>>>>>>>>> of events (no idle time, which is true for us). Though > >> it's > >>>>> not > >>>>>>> what > >>>>>>>>> pure > >>>>>>>>>>> kafka-client calls "automatic" (flush commits at fixed > >>>>>> intervals). > >>>>>>>>>>> > >>>>>>>>>>> On Fri, Oct 6, 2017 at 7:04 PM, Stas Chizhov < > >>>>> schiz...@gmail.com > >>>>>>> > >>>>>>>>> wrote: > >>>>>>>>>>> > >>>>>>>>>>>> You don't have autocmmit enables that means you commit > >>>>> offsets > >>>>>>>>>> yourself - > >>>>>>>>>>>> correct? If you store them per partition somewhere and > >>> fail > >>>>> to > >>>>>>>> clean > >>>>>>>>> it > >>>>>>>>>>> up > >>>>>>>>>>>> upon rebalance next time the consumer gets this > >> partition > >>>>>>> assigned > >>>>>>>>>> during > >>>>>>>>>>>> next rebalance it can commit old stale offset- can this > >> be > >>>>> the > >>>>>>>> case? > >>>>>>>>>>>> > >>>>>>>>>>>> > >>>>>>>>>>>> fre 6 okt. 2017 kl. 17:59 skrev Dmitriy Vsekhvalnov < > >>>>>>>>>>>> dvsekhval...@gmail.com > >>>>>>>>>>>>> : > >>>>>>>>>>>> > >>>>>>>>>>>>> Reprocessing same events again - is fine for us > >>>>> (idempotent). > >>>>>>>> While > >>>>>>>>>>>> loosing > >>>>>>>>>>>>> data is more critical. > >>>>>>>>>>>>> > >>>>>>>>>>>>> What are reasons of such behaviour? Consumers are > >> never > >>>>> idle, > >>>>>>>>> always > >>>>>>>>>>>>> commiting, probably something wrong with broker setup > >>>>> then? > >>>>>>>>>>>>> > >>>>>>>>>>>>> On Fri, Oct 6, 2017 at 6:58 PM, Ted Yu < > >>>>> yuzhih...@gmail.com> > >>>>>>>>> wrote: > >>>>>>>>>>>>> > >>>>>>>>>>>>>> Stas: > >>>>>>>>>>>>>> bq. using anything but none is not really an option > >>>>>>>>>>>>>> > >>>>>>>>>>>>>> If you have time, can you explain a bit more ? > >>>>>>>>>>>>>> > >>>>>>>>>>>>>> Thanks > >>>>>>>>>>>>>> > >>>>>>>>>>>>>> On Fri, Oct 6, 2017 at 8:55 AM, Stas Chizhov < > >>>>>>>> schiz...@gmail.com > >>>>>>>>>> > >>>>>>>>>>>> wrote: > >>>>>>>>>>>>>> > >>>>>>>>>>>>>>> If you set auto.offset.reset to none next time it > >>>>> happens > >>>>>>> you > >>>>>>>>>> will > >>>>>>>>>>> be > >>>>>>>>>>>>> in > >>>>>>>>>>>>>>> much better position to find out what happens. > >> Also > >>> in > >>>>>>>> general > >>>>>>>>>> with > >>>>>>>>>>>>>> current > >>>>>>>>>>>>>>> semantics of offset reset policy IMO using > >> anything > >>>>> but > >>>>>>> none > >>>>>>>> is > >>>>>>>>>> not > >>>>>>>>>>>>>> really > >>>>>>>>>>>>>>> an option unless it is ok for consumer to loose > >> some > >>>>> data > >>>>>>>>>> (latest) > >>>>>>>>>>> or > >>>>>>>>>>>>>>> reprocess it second time (earliest). > >>>>>>>>>>>>>>> > >>>>>>>>>>>>>>> fre 6 okt. 2017 kl. 17:44 skrev Ted Yu < > >>>>>>> yuzhih...@gmail.com > >>>>>>>>> : > >>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>> Should Kafka log warning if log.retention.hours > >> is > >>>>>> lower > >>>>>>>> than > >>>>>>>>>>>> number > >>>>>>>>>>>>> of > >>>>>>>>>>>>>>>> hours specified by offsets.retention.minutes ? > >>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>> On Fri, Oct 6, 2017 at 8:35 AM, Manikumar < > >>>>>>>>>>>> manikumar.re...@gmail.com > >>>>>>>>>>>>>> > >>>>>>>>>>>>>>>> wrote: > >>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>> normally, log.retention.hours (168hrs) should > >>> be > >>>>>>> higher > >>>>>>>>> than > >>>>>>>>>>>>>>>>> offsets.retention.minutes (336 hrs)? > >>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>> On Fri, Oct 6, 2017 at 8:58 PM, Dmitriy > >>>>> Vsekhvalnov < > >>>>>>>>>>>>>>>>> dvsekhval...@gmail.com> > >>>>>>>>>>>>>>>>> wrote: > >>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>> Hi Ted, > >>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>> Broker: v0.11.0.0 > >>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>> Consumer: > >>>>>>>>>>>>>>>>>> kafka-clients v0.11.0.0 > >>>>>>>>>>>>>>>>>> auto.offset.reset = earliest > >>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>> On Fri, Oct 6, 2017 at 6:24 PM, Ted Yu < > >>>>>>>>>> yuzhih...@gmail.com> > >>>>>>>>>>>>>> wrote: > >>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>> What's the value for auto.offset.reset ? > >>>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>> Which release are you using ? > >>>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>> Cheers > >>>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>> On Fri, Oct 6, 2017 at 7:52 AM, Dmitriy > >>>>>>> Vsekhvalnov < > >>>>>>>>>>>>>>>>>>> dvsekhval...@gmail.com> > >>>>>>>>>>>>>>>>>>> wrote: > >>>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>>> Hi all, > >>>>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>>> we several time faced situation where > >>>>>>>> consumer-group > >>>>>>>>>>>> started > >>>>>>>>>>>>> to > >>>>>>>>>>>>>>>>>>> re-consume > >>>>>>>>>>>>>>>>>>>> old events from beginning. Here is > >>> scenario: > >>>>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>>> 1. x3 broker kafka cluster on top of x3 > >>> node > >>>>>>>>> zookeeper > >>>>>>>>>>>>>>>>>>>> 2. RF=3 for all topics > >>>>>>>>>>>>>>>>>>>> 3. log.retention.hours=168 and > >>>>>>>>>>>>> offsets.retention.minutes=20160 > >>>>>>>>>>>>>>>>>>>> 4. running sustainable load (pushing > >>> events) > >>>>>>>>>>>>>>>>>>>> 5. doing disaster testing by randomly > >>>>> shutting > >>>>>>>> down 1 > >>>>>>>>>> of > >>>>>>>>>>> 3 > >>>>>>>>>>>>>> broker > >>>>>>>>>>>>>>>>> nodes > >>>>>>>>>>>>>>>>>>>> (then provision new broker back) > >>>>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>>> Several times after bouncing broker we > >>> faced > >>>>>>>>> situation > >>>>>>>>>>>> where > >>>>>>>>>>>>>>>> consumer > >>>>>>>>>>>>>>>>>>> group > >>>>>>>>>>>>>>>>>>>> started to re-consume old events. > >>>>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>>> consumer group: > >>>>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>>> 1. enable.auto.commit = false > >>>>>>>>>>>>>>>>>>>> 2. tried graceful group shutdown, kill > >> -9 > >>>>> and > >>>>>>>>>> terminating > >>>>>>>>>>>> AWS > >>>>>>>>>>>>>>> nodes > >>>>>>>>>>>>>>>>>>>> 3. never experienced re-consumption for > >>>>> given > >>>>>>>> cases. > >>>>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>>> What can cause that old events > >>>>> re-consumption? > >>>>>> Is > >>>>>>>> it > >>>>>>>>>>>> related > >>>>>>>>>>>>> to > >>>>>>>>>>>>>>>>>> bouncing > >>>>>>>>>>>>>>>>>>>> one of brokers? What to search in a > >> logs? > >>>>> Any > >>>>>>>> broker > >>>>>>>>>>>> settings > >>>>>>>>>>>>>> to > >>>>>>>>>>>>>>>> try? > >>>>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>>> Thanks in advance. > >>>>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>> > >>>>>>>>>>>>>> > >>>>>>>>>>>>> > >>>>>>>>>>>> > >>>>>>>>>>> > >>>>>>>>>> > >>>>>>>>>> -- > >>>>>>>>>> The information transmitted is intended only for the person > >> or > >>>>>> entity > >>>>>>>> to > >>>>>>>>>> which it is addressed and may contain confidential and/or > >>>>>> privileged > >>>>>>>>>> material. Any review, retransmission, dissemination or other > >>> use > >>>>>> of, > >>>>>>> or > >>>>>>>>>> taking of any action in reliance upon, this information by > >>>>> persons > >>>>>> or > >>>>>>>>>> entities other than the intended recipient is prohibited. If > >>> you > >>>>>>>> received > >>>>>>>>>> this in error, please contact the sender and delete the > >>> material > >>>>>> from > >>>>>>>> any > >>>>>>>>>> computer. > >>>>>>>>>> > >>>>>>>>> > >>>>>>>> > >>>>>>> > >>>>>> > >>>>> > >>>> > >>>> > >>> > >> > >> -- > >> > >> > >> This email, including attachments, is private and confidential. If you > have > >> received this email in error please notify the sender and delete it from > >> your system. Emails are not secure and may contain viruses. No liability > >> can be accepted for viruses that might be transferred by this email or > any > >> attachment. Any unauthorised copying of this message or unauthorised > >> distribution and publication of the information contained herein are > >> prohibited. > >> > >> 7digital Group plc. Registered office: 69 Wilson Street, London EC2A > 2BB. > >> Registered in England and Wales. Registered No. 04843573. > >> > > -- > The information transmitted is intended only for the person or entity to > which it is addressed and may contain confidential and/or privileged > material. Any review, retransmission, dissemination or other use of, or > taking of any action in reliance upon, this information by persons or > entities other than the intended recipient is prohibited. If you received > this in error, please contact the sender and delete the material from any > computer. >