I will provide what I can (we don't have separate logs for controller, etc., it's all integrated in a single log with log4j for us, we embed Kafka in a java container, etc). Are there specific log classes you'd be interested in seeing? (I can look at the default log4j configs to see what's set up normally for the 'controller' log)....
We have most of the logs saved away (but not all of them). Sorry, I didn't mean to hijack this thread (so will start a new thread soonly)..... Jason On Fri, Apr 3, 2015 at 1:37 PM, Jiangjie Qin <j...@linkedin.com.invalid> wrote: > This sounds a very serious issueĊ Could you provide the controller log and > the log for the first broker on which you tried controlled shutdown and > upgrade? > > On 4/3/15, 8:57 AM, "Jason Rosenberg" <j...@squareup.com> wrote: > > >I'm preparing a longer post here, but we recently ran into a similar > >scenario. Not sure yet if it's the same thing you saw (but it feels > >similar). We were also doing a rolling upgrade from 0.8.1.1 to 0.8.2.1, > >and during the controlled shutdown of the first node (of a 4 node > >cluster), > >the controlled shutdown was taking longer than normal (it timed out > >several > >times and was retrying controlled shutdown), and unfortunately, our > >deployment system decided to kill it hard (so it was in the middle of it's > >4th controlled shutdown retry, etc.). > > > >Anyway, when the node came back, it naturally decided to 'restore' most of > >it's partitions, which took some time (but only like 5 minutes). What's > >weird is it didn't decide to resync data from other replicas, instead it > >just restored partitions locally. During this time, the rest of the > >cluster failed to elect any new leaders, and so for 5 minutes, those > >partitions were unavailable (and we saw a flood of failed FetcherManager > >exceptions from the other nodes in the cluster). Most of the partitions > >were empty (e.g. there's no way the other replicas were behind and not in > >the ISR normally). During this 5 minutes, producers were unable to send > >messages due to NotLeaderForPartition exceptions. Apparently the > >controller was still sending them to the unavailable broker. > > > >Finally, when the first node finally came up, the other nodes were > >somewhat > >happy again (but a few partitions remained under-replicated indefinitely). > >Because of this, we decided to pause the rolling restart, and try to wait > >for the under-replicated partitions to get insync. Unfortunately, about > >an > >hour later, the whole cluster went foobar (e.g. partitions became > >unavailable, brokers logged a flood of Fetcher errors, producers couldn't > >find a valid leader, metadata requests timed out, etc.). In a panic, we > >reverted that first node back to 0.8.1.1. This did not help, > >unfortunately, > >so, deciding we'd already probably lost data at this point (and producers > >could not send data due to (NotLeaderForPartition exceptions)), we decided > >to just forcibly do the upgrade to 0.8.2.1. This was all a bad situation, > >of course. > > > >So, now we have the cluster stable at 0.8.2.1, but like you, we are very, > >very nervous about doing any kind of restart to any of our nodes. We lost > >data, primarily in the form of producers failing to send during the > >periods > >of unavailability. > > > >It looks like the root cause, in our case, was a flood of topics created > >(long-since unused and empty). This appears to have caused the longer > >than > >normal controlled shutdown, which in turn, led to the followon problems. > >However, in the past, we've seen a controlled shutdown failure result in > >an > >unclean shutdown, but usually the cluster recovers (e.g. it elects new > >leaders, and when the new node comes back, it recovers it's partitions > >that > >were uncleanly shutdown). That did not happen this time (the rest of the > >cluster got in an apparent infinite loop where it tried repeatedly (e.g. > >500K times a minute) to fetch partitions that were unavailable). > > > >I'm preparing a longer post with more detail (will take a bit of time). > > > >Jason > > > >On Thu, Apr 2, 2015 at 10:19 PM, Gwen Shapira <gshap...@cloudera.com> > >wrote: > > > >> wow, thats scary for sure. > >> > >> Just to be clear - all you did is restart *one* broker in the cluster? > >> everything else was ok before the restart? and that was controlled > >> shutdown? > >> > >> Gwen > >> > >> On Wed, Apr 1, 2015 at 11:54 AM, Thunder Stumpges <tstump...@ntent.com> > >> wrote: > >> > >> > Well it appears we lost all the data on the one node again. It > >>appears to > >> > be all or part of KAFKA-1647< > >> > https://issues.apache.org/jira/browse/KAFKA-1647> as we saw this in > >>our > >> > logs (for all topics): > >> > > >> > [2015-04-01 10:46:58,901] WARN Partition [logactivity-redirect,3] on > >> > broker 6: No checkpointed highwatermark is found for partition > >> > [logactivity-redirect,3] (kafka.cluster.Partition) > >> > [2015-04-01 10:46:58,902] WARN Partition [pageimpression,1] on broker > >>6: > >> > No checkpointed highwatermark is found for partition > >>[pageimpression,1] > >> > (kafka.cluster.Partition) > >> > [2015-04-01 10:46:58,904] WARN Partition [campaignplatformtarget,6] on > >> > broker 6: No checkpointed highwatermark is found for partition > >> > [campaignplatformtarget,6] (kafka.cluster.Partition) > >> > [2015-04-01 10:46:58,905] WARN Partition [trackingtags-c2,1] on > >>broker 6: > >> > No checkpointed highwatermark is found for partition > >>[trackingtags-c2,1] > >> > (kafka.cluster.Partition) > >> > > >> > Followed by: > >> > > >> > [2015-04-01 10:46:58,911] INFO Truncating log trafficshaperlog-3 to > >> offset > >> > 0. (kafka.log.Log) > >> > [2015-04-01 10:46:58,928] INFO Truncating log videorecrequest-0 to > >>offset > >> > 0. (kafka.log.Log) > >> > [2015-04-01 10:46:58,928] INFO Truncating log filteredredirect-2 to > >> offset > >> > 0. (kafka.log.Log) > >> > [2015-04-01 10:46:58,985] INFO Truncating log precheckrequest-3 to > >>offset > >> > 0. (kafka.log.Log) > >> > [2015-04-01 10:46:58,990] INFO Truncating log filteredclicklog-8 to > >> offset > >> > 0. (kafka.log.Log) > >> > > >> > Followed by: > >> > [2015-04-01 10:46:59,107] INFO Scheduling log segment 90276704 for log > >> > pageview-0 for deletion. (kafka.log.Log) > >> > [2015-04-01 10:46:59,107] INFO Scheduling log segment 90593329 for log > >> > pageview-0 for deletion. (kafka.log.Log) > >> > [2015-04-01 10:46:59,107] INFO Scheduling log segment 90906048 for log > >> > pageview-0 for deletion. (kafka.log.Log) > >> > [2015-04-01 10:46:59,107] INFO Scheduling log segment 91226773 for log > >> > pageview-0 for deletion. (kafka.log.Log) > >> > > >> > > >> > The strange thing however is that I don't believe we ever had the > >> scenario > >> > mentioned in the bug (all brokers for a topic down) As we have 0 > >>unclean > >> > leader elections, our applications never complained the cluster was > >>down, > >> > and we never actually stopped more than one node (the node in > >>question). > >> > > >> > Anyway, really unsure if this could have somehow been related to the > >> > attempted upgrade to 0.8.2.1 (the occurrence this morning was NOT, it > >> was a > >> > simple stop, restart broker) > >> > > >> > We are really nervous to do any kafka service restarts now that this > >>has > >> > happened twice on this machine. > >> > Any suggestions? Should we go back and make another attempt to > >>upgrade to > >> > 0.8.2.1 ? > >> > > >> > Thanks, > >> > Thunder > >> > > >> > > >> > From: Thunder Stumpges > >> > Sent: Wednesday, April 01, 2015 10:30 AM > >> > To: kafka-us...@incubator.apache.org > >> > Subject: Problem with node after restart no partitions? > >> > > >> > Hi group, > >> > > >> > We've been trying to track down a problem this morning for a little > >> while, > >> > and thought I'd ask here while we keep looking. > >> > > >> > We have 3 nodes (rep-3) running 8.1.1. We attempted a rolling upgrade > >> > yesterday to 8.2.1, and on the first node, after restarting, a single > >> topic > >> > (a samza intermediate topic) started throwing replica fetcher errors > >>over > >> > and over ("NotLeaderForPartition"). There may or may not have been > >>other > >> > things attempted at this time (not by me so I cannot say for sure). > >> Anyway > >> > we ended up rolling back to 8.1.1 and ALL data had been DELETED from > >>that > >> > node. It spent most of yesterday re-syncing, and came into sync last > >> night, > >> > and a rebalance made everything run smoothly (*except for these damn > >> > replica fetcher errors for that one partition). > >> > > >> > Today my colleague attempted the "unsupported" topic delete command > >>for > >> > the "bad" partition, and bounced that one troublesome node. > >> > > >> > Upon coming up, I can see in server.log that it is reading in all of > >>the > >> > segments in, and then starts spitting out a samza topic fetch error, > >>and > >> > through JMX the "ReplicaManager".LeaderCount is 0. It is not > >>attempting > >> to > >> > fetch or load any topics. > >> > > >> > The other two brokers are showing under-replicated (obviously). What > >>is > >> > going wrong? How can we get that samza topic really and truly gone? > >>(if > >> > that is the cause of the broker not coming up) > >> > > >> > Thanks, > >> > Thunder > >> > > >> > > >> > >