I ran into the same issue today. In a production cluster, I noticed the "Shrinking ISR for partition" log messages for a topic deleted 2 months ago. Our staging cluster shows the same messages for all the topics deleted in that cluster. Both 0.8.2
Yifan, Guozhang, did you find a way to get rid of them? thanks in advance, alexis On Tue, Apr 5, 2016 at 4:16 PM Guozhang Wang <wangg...@gmail.com> wrote: > It is possible, there are some discussions about a similar issue in KIP: > > > https://cwiki.apache.org/confluence/display/KAFKA/KIP-53+-+Add+custom+policies+for+reconnect+attempts+to+NetworkdClient > > mailing thread: > > https://www.mail-archive.com/dev@kafka.apache.org/msg46868.html > > > > Guozhang > > On Tue, Apr 5, 2016 at 2:34 PM, Yifan Ying <nafan...@gmail.com> wrote: > > > Some updates: > > > > Yesterday, right after release (producers and consumers reconnected to > > Kafka/Zookeeper, but no code change in our producers and consumers), all > > under replication issues were resolved automatically and no more high > > latency in both Kafka and Zookeeper. But right after today's > > release(producers and consumers re-connected again), the under > replication > > and high latency issue happened again. So the all-at-once reconnecting > from > > producers and consumers would cause the problem? And all these only > > happened since I deleted a deprecated topic in production. > > > > Yifan > > > > On Tue, Apr 5, 2016 at 9:04 AM, Guozhang Wang <wangg...@gmail.com> > wrote: > > > >> These configs are mainly dependent on your publish throughput, since the > >> replication throughput is higher bounded by the publish throughput. If > the > >> publish throughput is not high, then setting a lower threshold values in > >> these two configs will cause churns in shrinking / expanding ISRs. > >> > >> Guozhang > >> > >> On Mon, Apr 4, 2016 at 11:55 PM, Yifan Ying <nafan...@gmail.com> wrote: > >> > >>> Thanks for replying, Guozhang. We did increase both settings: > >>> > >>> replica.lag.max.messages=20000 > >>> > >>> replica.lag.time.max.ms=20000 > >>> > >>> > >>> But no sure if these are good enough. And yes, that's a good suggestion > >>> to monitor ZK performance. > >>> > >>> > >>> Thanks. > >>> > >>> On Mon, Apr 4, 2016 at 8:58 PM, Guozhang Wang <wangg...@gmail.com> > >>> wrote: > >>> > >>>> Hmm, it seems like your broker config "replica.lag.max.messages" and " > >>>> replica.lag.time.max.ms" is mis-configed regarding your replication > >>>> traffic, and the deletion of the topic actually makes it below the > >>>> threshold. What are the config values for these two? And could you > try to > >>>> increase these configs and see if that helps? > >>>> > >>>> In 0.8.2.1 Kafka-consumer-offset-checker.sh access ZK to query the > >>>> consumer offsets one-by-one, and hence if your ZK read latency is > high it > >>>> could take long time. You may want to monitor your ZK cluster > performance > >>>> to check its read / write latencies. > >>>> > >>>> > >>>> Guozhang > >>>> > >>>> > >>>> > >>>> > >>>> > >>>> On Mon, Apr 4, 2016 at 10:59 AM, Yifan Ying <nafan...@gmail.com> > wrote: > >>>> > >>>>> Hi Guozhang, > >>>>> > >>>>> It's 0.8.2.1. So it should be fixed? We also tried to start from > >>>>> scratch by wiping out the data directory on both Kafka and > Zookeeper. And > >>>>> it's odd that the constant shrinking and expanding happened after > fresh > >>>>> restart, and high request latency as well. The brokers are using the > same > >>>>> config before topic deletion. > >>>>> > >>>>> Another observation is that, using the > >>>>> Kafka-consumer-offset-checker.sh is extremely slow. Any suggestion > would be > >>>>> appreciated! Thanks. > >>>>> > >>>>> On Sun, Apr 3, 2016 at 2:29 PM, Guozhang Wang <wangg...@gmail.com> > >>>>> wrote: > >>>>> > >>>>>> Yifan, > >>>>>> > >>>>>> Are you on 0.8.0 or 0.8.1/2? There are some issues with zkVersion > >>>>>> checking > >>>>>> in 0.8.0 that are fixed in later minor releases of 0.8. > >>>>>> > >>>>>> Guozhang > >>>>>> > >>>>>> On Fri, Apr 1, 2016 at 7:46 PM, Yifan Ying <nafan...@gmail.com> > >>>>>> wrote: > >>>>>> > >>>>>> > Hi All, > >>>>>> > > >>>>>> > We deleted a deprecated topic on Kafka cluster(0.8) and started > >>>>>> observing > >>>>>> > constant 'Expanding ISR for partition' and 'Shrinking ISR for > >>>>>> partition' > >>>>>> > for other topics. As a result we saw a huge number of under > >>>>>> replicated > >>>>>> > partitions and very high request latency from Kafka. And it > doesn't > >>>>>> seem > >>>>>> > able to recover itself. > >>>>>> > > >>>>>> > Anyone knows what caused this issue and how to resolve it? > >>>>>> > > >>>>>> > >>>>>> > >>>>>> > >>>>>> -- > >>>>>> -- Guozhang > >>>>>> > >>>>> > >>>>> > >>>>> > >>>>> -- > >>>>> Yifan > >>>>> > >>>>> > >>>>> > >>>> > >>>> > >>>> -- > >>>> -- Guozhang > >>>> > >>> > >>> > >>> > >>> -- > >>> Yifan > >>> > >>> > >>> > >> > >> > >> -- > >> -- Guozhang > >> > > > > > > > > -- > > Yifan > > > > > > > > > -- > -- Guozhang >