[ 
https://issues.apache.org/jira/browse/KAFKA-1758?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14310991#comment-14310991
 ] 

Jay Kreps edited comment on KAFKA-1758 at 2/7/15 11:13 PM:
-----------------------------------------------------------

This is actually not a very difficult change--in LogManager.loadLogs we would 
need to basically handle an error in reading the recovery checkpoint, log it, 
and then just treat it as though our recovery point was 0 (or something like 
that) for all logs.


was (Author: jkreps):
This is actually not a very difficult change--in LogManager.loadLogs we would 
need to basically handle an error in reading the recovery checkpoint, log it, 
and then just start a full recovery.

> corrupt recovery file prevents startup
> --------------------------------------
>
>                 Key: KAFKA-1758
>                 URL: https://issues.apache.org/jira/browse/KAFKA-1758
>             Project: Kafka
>          Issue Type: Bug
>            Reporter: Jason Rosenberg
>
> Hi,
> We recently had a kafka node go down suddenly. When it came back up, it 
> apparently had a corrupt recovery file, and refused to startup:
> {code}
> 2014-11-06 08:17:19,299  WARN [main] server.KafkaServer - Error starting up 
> KafkaServer
> java.lang.NumberFormatException: For input string: 
> "^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@
> ^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@"
>         at 
> java.lang.NumberFormatException.forInputString(NumberFormatException.java:65)
>         at java.lang.Integer.parseInt(Integer.java:481)
>         at java.lang.Integer.parseInt(Integer.java:527)
>         at 
> scala.collection.immutable.StringLike$class.toInt(StringLike.scala:229)
>         at scala.collection.immutable.StringOps.toInt(StringOps.scala:31)
>         at kafka.server.OffsetCheckpoint.read(OffsetCheckpoint.scala:76)
>         at 
> kafka.log.LogManager$$anonfun$loadLogs$1.apply(LogManager.scala:106)
>         at 
> kafka.log.LogManager$$anonfun$loadLogs$1.apply(LogManager.scala:105)
>         at 
> scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
>         at 
> scala.collection.mutable.WrappedArray.foreach(WrappedArray.scala:34)
>         at kafka.log.LogManager.loadLogs(LogManager.scala:105)
>         at kafka.log.LogManager.<init>(LogManager.scala:57)
>         at kafka.server.KafkaServer.createLogManager(KafkaServer.scala:275)
>         at kafka.server.KafkaServer.startup(KafkaServer.scala:72)
> {code}
> And the app is under a monitor (so it was repeatedly restarting and failing 
> with this error for several minutes before we got to it)…
> We moved the ‘recovery-point-offset-checkpoint’ file out of the way, and it 
> then restarted cleanly (but of course re-synced all it’s data from replicas, 
> so we had no data loss).
> Anyway, I’m wondering if that’s the expected behavior? Or should it not 
> declare it corrupt and then proceed automatically to an unclean restart?
> Should this NumberFormatException be handled a bit more gracefully?
> We saved the corrupt file if it’s worth inspecting (although I doubt it will 
> be useful!)….
> The corrupt files appeared to be all zeroes.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to