Jun,

Checking the OffsetCheckpoint.write function, if
"fileOutputStream.getFD.sync" throws exception it will just be caught and
forgotten, and the swap will still happen, may be we need to catch the
SyncFailedException and re-throw it as a FATAIL error to skip the swap.

Guozhang


On Thu, Nov 6, 2014 at 8:50 PM, Jason Rosenberg <j...@squareup.com> wrote:

> I'm still not sure what caused the reboot of the system (but yes it appears
> to have crashed hard).  The file system is xfs, on CentOs linux.  I'm not
> yet sure, but I think also before the crash, the system might have become
> wedged.
>
> It appears the corrupt recovery files actually contained all zero bytes,
> after looking at it with odb.
>
> I'll file a Jira.
>
> On Thu, Nov 6, 2014 at 7:09 PM, Jun Rao <jun...@gmail.com> wrote:
>
> > I am also wondering how the corruption happened. The way that we update
> the
> > OffsetCheckpoint file is to first write to a tmp file and flush the data.
> > We then rename the tmp file to the final file. This is done to prevent
> > corruption caused by a crash in the middle of the writes. In your case,
> was
> > the host crashed? What kind of storage system are you using? Is there any
> > non-volatile cache on the storage system?
> >
> > Thanks,
> >
> > Jun
> >
> > On Thu, Nov 6, 2014 at 6:31 AM, Jason Rosenberg <j...@squareup.com>
> wrote:
> >
> > > Hi,
> > >
> > > We recently had a kafka node go down suddenly. When it came back up, it
> > > apparently had a corrupt recovery file, and refused to startup:
> > >
> > > 2014-11-06 08:17:19,299  WARN [main] server.KafkaServer - Error
> > > starting up KafkaServer
> > > java.lang.NumberFormatException: For input string:
> > >
> > >
> >
> "^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@
> > >
> > >
> >
> ^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@"
> > >         at
> > >
> >
> java.lang.NumberFormatException.forInputString(NumberFormatException.java:65)
> > >         at java.lang.Integer.parseInt(Integer.java:481)
> > >         at java.lang.Integer.parseInt(Integer.java:527)
> > >         at
> > > scala.collection.immutable.StringLike$class.toInt(StringLike.scala:229)
> > >         at
> scala.collection.immutable.StringOps.toInt(StringOps.scala:31)
> > >         at
> kafka.server.OffsetCheckpoint.read(OffsetCheckpoint.scala:76)
> > >         at
> > > kafka.log.LogManager$$anonfun$loadLogs$1.apply(LogManager.scala:106)
> > >         at
> > > kafka.log.LogManager$$anonfun$loadLogs$1.apply(LogManager.scala:105)
> > >         at
> > >
> >
> scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
> > >         at
> > > scala.collection.mutable.WrappedArray.foreach(WrappedArray.scala:34)
> > >         at kafka.log.LogManager.loadLogs(LogManager.scala:105)
> > >         at kafka.log.LogManager.<init>(LogManager.scala:57)
> > >         at
> > kafka.server.KafkaServer.createLogManager(KafkaServer.scala:275)
> > >         at kafka.server.KafkaServer.startup(KafkaServer.scala:72)
> > >
> > > And since the app is under a monitor (so it was repeatedly restarting
> and
> > > failing with this error for several minutes before we got to it)…
> > >
> > > We moved the ‘recovery-point-offset-checkpoint’ file out of the way,
> and
> > it
> > > then restarted cleanly (but of course re-synced all it’s data from
> > > replicas, so we had no data loss).
> > >
> > > Anyway, I’m wondering if that’s the expected behavior? Or should it not
> > > declare it corrupt and then proceed automatically to an unclean
> restart?
> > >
> > > Should this NumberFormatException be handled a bit more gracefully?
> > >
> > > We saved the corrupt file if it’s worth inspecting (although I doubt it
> > > will be useful!)….
> > >
> > > Jason
> > > ​
> > >
> >
>



-- 
-- Guozhang

Reply via email to