forgot to mention, we are using 0.8.1.1.... Jason
On Thu, Nov 6, 2014 at 9:31 AM, Jason Rosenberg <j...@squareup.com> wrote: > Hi, > > We recently had a kafka node go down suddenly. When it came back up, it > apparently had a corrupt recovery file, and refused to startup: > > 2014-11-06 08:17:19,299 WARN [main] server.KafkaServer - Error starting up > KafkaServer > java.lang.NumberFormatException: For input string: > "^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@ > ^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@" > at > java.lang.NumberFormatException.forInputString(NumberFormatException.java:65) > at java.lang.Integer.parseInt(Integer.java:481) > at java.lang.Integer.parseInt(Integer.java:527) > at > scala.collection.immutable.StringLike$class.toInt(StringLike.scala:229) > at scala.collection.immutable.StringOps.toInt(StringOps.scala:31) > at kafka.server.OffsetCheckpoint.read(OffsetCheckpoint.scala:76) > at > kafka.log.LogManager$$anonfun$loadLogs$1.apply(LogManager.scala:106) > at > kafka.log.LogManager$$anonfun$loadLogs$1.apply(LogManager.scala:105) > at > scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33) > at > scala.collection.mutable.WrappedArray.foreach(WrappedArray.scala:34) > at kafka.log.LogManager.loadLogs(LogManager.scala:105) > at kafka.log.LogManager.<init>(LogManager.scala:57) > at kafka.server.KafkaServer.createLogManager(KafkaServer.scala:275) > at kafka.server.KafkaServer.startup(KafkaServer.scala:72) > > And since the app is under a monitor (so it was repeatedly restarting and > failing with this error for several minutes before we got to it)… > > We moved the ‘recovery-point-offset-checkpoint’ file out of the way, and > it then restarted cleanly (but of course re-synced all it’s data from > replicas, so we had no data loss). > > Anyway, I’m wondering if that’s the expected behavior? Or should it not > declare it corrupt and then proceed automatically to an unclean restart? > > Should this NumberFormatException be handled a bit more gracefully? > > We saved the corrupt file if it’s worth inspecting (although I doubt it > will be useful!)…. > > Jason > >