Hi Tom, Thanks for your question!
In Kafka 3.5 there was an overhaul of the offset translation logic to avoid a data loss bug. This [1] is the ticket that I've been using to represent that overhaul. In the implementation for that fix, we traded a lot of availability (read: consumer lag) for correctness (the translated offset is never too far ahead). We've been working to claw some of that availability back, most notably through [2] in 3.6.0, [3] in 3.7.0, and [4] in 3.8.0, but it's still not back to original levels. I would definitely suggest trying 3.8.0 to see if it is suitable for your use-case, as that last fix KAFKA-15905 is very impactful for Mirror Maker instances that undergo restarts or rebalances. >From the description "Some topics are dozen or so behind, others are hundreds of messages behind" is it possible that the translation is already working to the best of its ability, or may benefit from a lower `offset.lag.max`, without more information I can't be sure. I do know that running versions 3.5-3.7 with offset.lag.max=0 is not sufficient to get good translation, the latest patches are quite important. There are open issues [5, 6] for future improvements to the algorithm, but there hasn't been much movement on those recently. With the current implementation, I would expect the target consumer lag to be approximately double the source consumer lag. The only time you should expect "perfect" translation is for a consumer group that has committed at the very end of a stable topic. Thanks, Greg [1] https://issues.apache.org/jira/browse/KAFKA-12468 [2] https://issues.apache.org/jira/browse/KAFKA-15202 [3] https://issues.apache.org/jira/browse/KAFKA-15906 [4] https://issues.apache.org/jira/browse/KAFKA-15905 [5] https://issues.apache.org/jira/browse/KAFKA-16364 [6] https://issues.apache.org/jira/browse/KAFKA-16641 On Thu, Aug 22, 2024 at 1:35 PM Harty, Tom A <tom.a.ha...@healthpartners.com> wrote: > We’ve been using MM2 one way for Active/Passive clusters for several years > now. We started running into issues with 3.5.1. It hasn’t been keeping up > with with consumer offset like it used to. > To test this out we’ve done rollbacks to 3.4.0 and the offset issue > corrects itself. Looking at the issue log it seems like some things around > offset management have been corrected in 3.6.1 and 3.7.0. Unfortunately, we > tried upgrading all the way to 3.7.0 and found the issue still remains. > It doesn’t seem to matter if it’s the low-volume non-prod clusters or > high-volume prod clusters. Some topics are dozen or so behind, others are > hundreds of messages behind. When I look at the offsets-sync topic it seems > to be producing meaningful data. And of course the messages themselves are > fully insync. > > Since we’re using Strimzi put the question to Jakob and he suggested > changing offset.lag.max to different lower values, but that didn’t really > move the needle. > > What changed between 3.4.0 and later versions? Are there configuration > changes we should look at? > > > > ________________________________ > > This e-mail and any files transmitted with it are confidential and are > intended solely for the use of the individual or entity to whom they are > addressed. If you are not the intended recipient or the individual > responsible for delivering the e-mail to the intended recipient, please be > advised that you have received this e-mail in error and that any use, > dissemination, forwarding, printing, or copying of this e-mail is strictly > prohibited. > > If you have received this communication in error, please return it to the > sender immediately and delete the original message and any copy of it from > your computer system. If you have any questions concerning this message, > please contact the sender. Disclaimer R001.0 >