You are right. The swap will be skipped in that case. It seems this
mechanism does not prevent scenarios where the storage system's hard crash.
An orthogonal note: I originally though renameTo in Linux is atomic, but
after reading some JavaDocs I think maybe we should use nio.File.move to be
Guozhang,
In OffsetCheckpoint.write(), we don't catch any exceptions. There is only a
finally clause to close the writer. So, it there is any exception during
write or sync, the exception will be propagated back to the caller and
swapping will be skipped.
Thanks,
Jun
On Fri, Nov 7, 2014 at
Jun,
Checking the OffsetCheckpoint.write function, if
fileOutputStream.getFD.sync throws exception it will just be caught and
forgotten, and the swap will still happen, may be we need to catch the
SyncFailedException and re-throw it as a FATAIL error to skip the swap.
Guozhang
On Thu, Nov 6,
Hi,
We recently had a kafka node go down suddenly. When it came back up, it
apparently had a corrupt recovery file, and refused to startup:
2014-11-06 08:17:19,299 WARN [main] server.KafkaServer - Error
starting up KafkaServer
java.lang.NumberFormatException: For input string:
forgot to mention, we are using 0.8.1.1
Jason
On Thu, Nov 6, 2014 at 9:31 AM, Jason Rosenberg j...@squareup.com wrote:
Hi,
We recently had a kafka node go down suddenly. When it came back up, it
apparently had a corrupt recovery file, and refused to startup:
2014-11-06 08:17:19,299
Jason,
Yes I agree with you. We should handle this more gracefully as the
checkpoint file dump is not guaranteed atomic. Could you file a JIRA?
Guozhang
On Thu, Nov 6, 2014 at 6:31 AM, Jason Rosenberg j...@squareup.com wrote:
Hi,
We recently had a kafka node go down suddenly. When it came
I am also wondering how the corruption happened. The way that we update the
OffsetCheckpoint file is to first write to a tmp file and flush the data.
We then rename the tmp file to the final file. This is done to prevent
corruption caused by a crash in the middle of the writes. In your case, was
I'm still not sure what caused the reboot of the system (but yes it appears
to have crashed hard). The file system is xfs, on CentOs linux. I'm not
yet sure, but I think also before the crash, the system might have become
wedged.
It appears the corrupt recovery files actually contained all zero
filed: https://issues.apache.org/jira/browse/KAFKA-1758
On Thu, Nov 6, 2014 at 11:50 PM, Jason Rosenberg j...@squareup.com wrote:
I'm still not sure what caused the reboot of the system (but yes it
appears to have crashed hard). The file system is xfs, on CentOs linux.
I'm not yet sure, but