Re: corrupt recovery checkpoint file issue....

2014-11-10 Thread Guozhang Wang
You are right. The swap will be skipped in that case. It seems this mechanism does not prevent scenarios where the storage system's hard crash. An orthogonal note: I originally though renameTo in Linux is atomic, but after reading some JavaDocs I think maybe we should use nio.File.move to be

Re: corrupt recovery checkpoint file issue....

2014-11-09 Thread Jun Rao
Guozhang, In OffsetCheckpoint.write(), we don't catch any exceptions. There is only a finally clause to close the writer. So, it there is any exception during write or sync, the exception will be propagated back to the caller and swapping will be skipped. Thanks, Jun On Fri, Nov 7, 2014 at

Re: corrupt recovery checkpoint file issue....

2014-11-07 Thread Guozhang Wang
Jun, Checking the OffsetCheckpoint.write function, if fileOutputStream.getFD.sync throws exception it will just be caught and forgotten, and the swap will still happen, may be we need to catch the SyncFailedException and re-throw it as a FATAIL error to skip the swap. Guozhang On Thu, Nov 6,

corrupt recovery checkpoint file issue....

2014-11-06 Thread Jason Rosenberg
Hi, We recently had a kafka node go down suddenly. When it came back up, it apparently had a corrupt recovery file, and refused to startup: 2014-11-06 08:17:19,299 WARN [main] server.KafkaServer - Error starting up KafkaServer java.lang.NumberFormatException: For input string:

Re: corrupt recovery checkpoint file issue....

2014-11-06 Thread Jason Rosenberg
forgot to mention, we are using 0.8.1.1 Jason On Thu, Nov 6, 2014 at 9:31 AM, Jason Rosenberg j...@squareup.com wrote: Hi, We recently had a kafka node go down suddenly. When it came back up, it apparently had a corrupt recovery file, and refused to startup: 2014-11-06 08:17:19,299

Re: corrupt recovery checkpoint file issue....

2014-11-06 Thread Guozhang Wang
Jason, Yes I agree with you. We should handle this more gracefully as the checkpoint file dump is not guaranteed atomic. Could you file a JIRA? Guozhang On Thu, Nov 6, 2014 at 6:31 AM, Jason Rosenberg j...@squareup.com wrote: Hi, We recently had a kafka node go down suddenly. When it came

Re: corrupt recovery checkpoint file issue....

2014-11-06 Thread Jun Rao
I am also wondering how the corruption happened. The way that we update the OffsetCheckpoint file is to first write to a tmp file and flush the data. We then rename the tmp file to the final file. This is done to prevent corruption caused by a crash in the middle of the writes. In your case, was

Re: corrupt recovery checkpoint file issue....

2014-11-06 Thread Jason Rosenberg
I'm still not sure what caused the reboot of the system (but yes it appears to have crashed hard). The file system is xfs, on CentOs linux. I'm not yet sure, but I think also before the crash, the system might have become wedged. It appears the corrupt recovery files actually contained all zero

Re: corrupt recovery checkpoint file issue....

2014-11-06 Thread Jason Rosenberg
filed: https://issues.apache.org/jira/browse/KAFKA-1758 On Thu, Nov 6, 2014 at 11:50 PM, Jason Rosenberg j...@squareup.com wrote: I'm still not sure what caused the reboot of the system (but yes it appears to have crashed hard). The file system is xfs, on CentOs linux. I'm not yet sure, but