Jason Rosenberg created KAFKA-1758:
--------------------------------------
Summary: corrupt recovery file prevents startup
Key: KAFKA-1758
URL: https://issues.apache.org/jira/browse/KAFKA-1758
Project: Kafka
Issue Type: Bug
Reporter: Jason Rosenberg
Hi,
We recently had a kafka node go down suddenly. When it came back up, it
apparently had a corrupt recovery file, and refused to startup:
{code}
2014-11-06 08:17:19,299 WARN [main] server.KafkaServer - Error starting up
KafkaServer
java.lang.NumberFormatException: For input string:
"^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@
^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@"
at
java.lang.NumberFormatException.forInputString(NumberFormatException.java:65)
at java.lang.Integer.parseInt(Integer.java:481)
at java.lang.Integer.parseInt(Integer.java:527)
at
scala.collection.immutable.StringLike$class.toInt(StringLike.scala:229)
at scala.collection.immutable.StringOps.toInt(StringOps.scala:31)
at kafka.server.OffsetCheckpoint.read(OffsetCheckpoint.scala:76)
at kafka.log.LogManager$$anonfun$loadLogs$1.apply(LogManager.scala:106)
at kafka.log.LogManager$$anonfun$loadLogs$1.apply(LogManager.scala:105)
at
scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
at scala.collection.mutable.WrappedArray.foreach(WrappedArray.scala:34)
at kafka.log.LogManager.loadLogs(LogManager.scala:105)
at kafka.log.LogManager.<init>(LogManager.scala:57)
at kafka.server.KafkaServer.createLogManager(KafkaServer.scala:275)
at kafka.server.KafkaServer.startup(KafkaServer.scala:72)
{code}
And the app is under a monitor (so it was repeatedly restarting and failing
with this error for several minutes before we got to it)…
We moved the ‘recovery-point-offset-checkpoint’ file out of the way, and it
then restarted cleanly (but of course re-synced all it’s data from replicas, so
we had no data loss).
Anyway, I’m wondering if that’s the expected behavior? Or should it not declare
it corrupt and then proceed automatically to an unclean restart?
Should this NumberFormatException be handled a bit more gracefully?
We saved the corrupt file if it’s worth inspecting (although I doubt it will be
useful!)….
The corrupt files appeared to be all zeroes.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)