Hey, I think this is a known issue in Kafka 2.1.0. Check this out https://issues.apache.org/jira/browse/KAFKA-7697 It has been fixed in 2.1.1.
On Wed, Mar 13, 2019 at 12:25 AM Joe Ammann <[email protected]> wrote: > Yes, to the best of our knowledge. We had the option to take the cluster > down on Sat, so we stopped everything, upgraded the software, set > interbroker and message version to 2.1 and restarted. On Sun we did another > rolling restart of all 4 brokers. > > All looked good after that (and we had done the same in test and preprod > where we didn't see such behavior) > > CU, Joe > > Am 13. März 2019 00:22:51 MEZ schrieb "M. Manna" <[email protected]>: > >When you upgraded, did you follow the rolling upgrade instruction from > >Kafka site ? And changed message version etc. as described on the site > >? > > > >Thanks, > > > > > >On Tue, 12 Mar 2019 at 23:20, Joe Ammann <[email protected]> wrote: > > > >> Hi all > >> > >> last weekend we have upgraded a small cluster (4 brokers) from > >0.10.2.1 > >> to 2.1.0. Everything seemed to go well on the weekend. > >> > >> For 2 nights in a row we now had an strange behaviour on one of the 4 > >> nodes. At almost exactly 00:08 in both nights, 1 out of the 4 brokers > >> stopped writing anything to any of the logs for about 7 minutes. The > >> other 3 brokers soon after started issuing exceptions "Connection to > >5 > >> was disconnected before the response was read" [1] for all the > >> partitions that this broker was the leader. So far so good. What I > >would > >> have expected was that this broker would have been "taken out of the > >> cluster" and another broker to become the leader of the partitions in > >> question. > >> > >> But after ~7 minutes, that broker started to log things again (all > >> looked like normal processing) while the other 3 brokers kept on > >logging > >> messages "Failed to connect within 30000 ms" [2]. This kept on going > >> until we detected the situation in the morning. During this time, all > >> listeners on partitions lead by this broker did not get any messages > >> (but also not any errors). After the restart of the broker, thing > >> normalized and a flurry of messages which was sent in the night were > >> received by the listeners. So the message have actually been produced > >> and stored on the erroneous broker, just not delivered and > >replicated. > >> > >> What we discovered before restarting the broker was that lsof showed > >> thousands of connections like > >> > >> java 2977050 user 2579u sock 0,6 0t0 > >> 517152881 can't identify protocol > >> > >> The total number of files was very close to 4096 (but the ulimit of > >the > >> process was actually 64k). > >> > >> Our best guess is that something in the JVM (not necessarily Kafka > >> itsself, we're running some monitoring tools in the JVM) did some > >kind > >> of housekeeping around midnight, failed to close some connections > >> properly (hence the many files with "can't identify protocol" and > >> brought the JVM into a state where it was unable to accept new > >> connections. It seemed like it could hold the established connection > >to > >> Zookeeper, the node was registered under /brokers/ids when we checked > >in > >> the morning. Still we were very surprised that this broker kept on > >> working to some extent and that no failover to another leader was > >> initiated. > >> > >> Anybody have an idea what to look for? Maybe increase some log levels > >to > >> get more insight into what's happening? > >> > >> -- > >> CU, Joe > >> > >> [1] > >> > >> [2019-03-11 00:10:36,528] WARN [ReplicaFetcher replicaId=4, > >leaderId=5, > >> fetcherId=0] Error in response for fetch request (type=FetchRequest, > >> replicaId=4, maxWait=500, minBytes=1, maxBytes=10485760, > >> fetchData={__consumer_offsets-24=(offset=149155836, logStartOffset=0, > >> maxBytes=1048576, currentLeaderEpoch=Optional[813])}, > >> isolationLevel=READ_UNCOMMITTED, toForget=, > >metadata=(sessionId=519957053, > >> epoch=1244473)) (kafka.server.ReplicaFetcherThread) > >> java.io.IOException: Connection to 5 was disconnected before the > >response > >> was read > >> at > >> > > >org.apache.kafka.clients.NetworkClientUtils.sendAndReceive(NetworkClientUtils.java:97) > >> at > >> > > >kafka.server.ReplicaFetcherBlockingSend.sendRequest(ReplicaFetcherBlockingSend.scala:97) > >> at > >> > > >kafka.server.ReplicaFetcherThread.fetchFromLeader(ReplicaFetcherThread.scala:190) > >> at > >> > > >kafka.server.AbstractFetcherThread.kafka$server$AbstractFetcherThread$$processFetchRequest(AbstractFetcherThread.scala:241) > >> at > >> > > >kafka.server.AbstractFetcherThread$$anonfun$maybeFetch$1.apply(AbstractFetcherThread.scala:130) > >> at > >> > > >kafka.server.AbstractFetcherThread$$anonfun$maybeFetch$1.apply(AbstractFetcherThread.scala:129) > >> at scala.Option.foreach(Option.scala:257) > >> at > >> > > >kafka.server.AbstractFetcherThread.maybeFetch(AbstractFetcherThread.scala:129) > >> at > >> > >kafka.server.AbstractFetcherThread.doWork(AbstractFetcherThread.scala:111) > >> at > >kafka.utils.ShutdownableThread.run(ShutdownableThread.scala:82) > >> > >> [2] > >> > >> [2019-03-11 02:59:51,144] WARN [ReplicaFetcher replicaId=4, > >leaderId=5, > >> fetcherId=0] Error in response for fetch request (type=FetchRequest, > >> replicaId=4, maxWait=500, minBytes=1, maxBytes=10485760, > >> fetchData={__consumer_offsets-0=(offset=0, logStartOffset=0, > >> maxBytes=1048576, currentLeaderEpoch=Optional[817]), > >> __consumer_offsets-28=(offset=0, logStartOffset=0, maxBytes=1048576, > >> currentLeaderEpoch=Optional[873]), PR4_CLONE_LIST_1-0=(offset=0, > >> logStartOffset=0, maxBytes=1048576, > >currentLeaderEpoch=Optional[567]), > >> __consumer_offsets-4=(offset=860199640, logStartOffset=0, > >maxBytes=1048576, > >> currentLeaderEpoch=Optional[868]), __consumer_offsets-16=(offset=0, > >> logStartOffset=0, maxBytes=1048576, > >currentLeaderEpoch=Optional[879]), > >> __consumer_offsets-36=(offset=0, logStartOffset=0, maxBytes=1048576, > >> currentLeaderEpoch=Optional[825]), __consumer_offsets-12=(offset=0, > >> logStartOffset=0, maxBytes=1048576, > >currentLeaderEpoch=Optional[815]), > >> __consumer_offsets-40=(offset=0, logStartOffset=0, maxBytes=1048576, > >> currentLeaderEpoch=Optional[864]), __consumer_offsets-48=(offset=5, > >> logStartOffset=0, maxBytes=1048576, > >currentLeaderEpoch=Optional[817]), > >> __consumer_offsets-24=(offset=149155836, logStartOffset=0, > >> maxBytes=1048576, currentLeaderEpoch=Optional[813])}, > >> isolationLevel=READ_UNCOMMITTED, toForget=, > >metadata=(sessionId=519957053, > >> epoch=INITIAL)) (kafka.server.ReplicaFetcherThread) > >> java.net.SocketTimeoutException: Failed to connect within 30000 ms > >> at > >> > > >kafka.server.ReplicaFetcherBlockingSend.sendRequest(ReplicaFetcherBlockingSend.scala:93) > >> at > >> > > >kafka.server.ReplicaFetcherThread.fetchFromLeader(ReplicaFetcherThread.scala:190) > >> at > >> > > >kafka.server.AbstractFetcherThread.kafka$server$AbstractFetcherThread$$processFetchRequest(AbstractFetcherThread.scala:241) > >> at > >> > > >kafka.server.AbstractFetcherThread$$anonfun$maybeFetch$1.apply(AbstractFetcherThread.scala:130) > >> at > >> > > >kafka.server.AbstractFetcherThread$$anonfun$maybeFetch$1.apply(AbstractFetcherThread.scala:129) > >> at scala.Option.foreach(Option.scala:257) > >> at > >> > > >kafka.server.AbstractFetcherThread.maybeFetch(AbstractFetcherThread.scala:129) > >> at > >> > >kafka.server.AbstractFetcherThread.doWork(AbstractFetcherThread.scala:111) > >> at > >kafka.utils.ShutdownableThread.run(ShutdownableThread.scala:82) > >> > >> > >> > >> > -- Thanks, Ankur Rana Software Developer FarEye
