forgot to mention, we are using 0.8.1.1....

Jason

On Thu, Nov 6, 2014 at 9:31 AM, Jason Rosenberg <j...@squareup.com> wrote:

> Hi,
>
> We recently had a kafka node go down suddenly. When it came back up, it
> apparently had a corrupt recovery file, and refused to startup:
>
> 2014-11-06 08:17:19,299  WARN [main] server.KafkaServer - Error starting up 
> KafkaServer
> java.lang.NumberFormatException: For input string: 
> "^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@
> ^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@"
>         at 
> java.lang.NumberFormatException.forInputString(NumberFormatException.java:65)
>         at java.lang.Integer.parseInt(Integer.java:481)
>         at java.lang.Integer.parseInt(Integer.java:527)
>         at 
> scala.collection.immutable.StringLike$class.toInt(StringLike.scala:229)
>         at scala.collection.immutable.StringOps.toInt(StringOps.scala:31)
>         at kafka.server.OffsetCheckpoint.read(OffsetCheckpoint.scala:76)
>         at 
> kafka.log.LogManager$$anonfun$loadLogs$1.apply(LogManager.scala:106)
>         at 
> kafka.log.LogManager$$anonfun$loadLogs$1.apply(LogManager.scala:105)
>         at 
> scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
>         at 
> scala.collection.mutable.WrappedArray.foreach(WrappedArray.scala:34)
>         at kafka.log.LogManager.loadLogs(LogManager.scala:105)
>         at kafka.log.LogManager.<init>(LogManager.scala:57)
>         at kafka.server.KafkaServer.createLogManager(KafkaServer.scala:275)
>         at kafka.server.KafkaServer.startup(KafkaServer.scala:72)
>
> And since the app is under a monitor (so it was repeatedly restarting and
> failing with this error for several minutes before we got to it)…
>
> We moved the ‘recovery-point-offset-checkpoint’ file out of the way, and
> it then restarted cleanly (but of course re-synced all it’s data from
> replicas, so we had no data loss).
>
> Anyway, I’m wondering if that’s the expected behavior? Or should it not
> declare it corrupt and then proceed automatically to an unclean restart?
>
> Should this NumberFormatException be handled a bit more gracefully?
>
> We saved the corrupt file if it’s worth inspecting (although I doubt it
> will be useful!)….
>
> Jason
> ​
>

Reply via email to