[jira] [Updated] (KAFKA-757) System Test Hard Failure cases : "Fatal error during KafkaServerStable startup" when hard-failed broker is re-started

Swapnil Ghike (JIRA) Wed, 13 Feb 2013 01:52:18 -0800

     [ 
https://issues.apache.org/jira/browse/KAFKA-757?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Swapnil Ghike updated KAFKA-757:
--------------------------------

    Attachment: kafka-757-v1.patch

There are two parts to this patch :

A. Move the sanity check to detect corrupt index files from OffsetIndex 
constructor to Log constructor below the recovery logic. In case of a hard 
kill, checking for corrupt index files before the last segment has been 
recovered will fail the require() assertion.

B. The following corner case is possible:
1. A broker rolled a new log segment file and an index file of non-zero size, 
and got hard killed before any appends to the index file were flushed. 
2. When the broker reboots and tries to load existing log segments, it will 
encounter this index file that has non-zero size, but has no data. 
3. Since the broker was hard killed, it will enter the recovery logic in 
Log.loadLogSegments(). 
4. The recovery logic will try to truncate the index file to the base offset of 
the segment. It will try to find the indexSlotFor(baseOffset). indexForSlot() 
will return a non- zero value, because the relativeOffset(idx, mid) == 
relOffset == 0. 
5. This will set the size of index file to a non-zero value (which will be half 
of its original size which was maxIndexSize * 8). 
6. Thus, the require() check for corrupted index file in Log constructor will 
not pass since we have #entries == size != 0 && lastOffset == baseOffset. 

The solution is to modify indexSlotFor() such that it returns -1 for non–zero 
sized index file whose lastOffset is 0 (assuming that setLength() will set 
empty bytes to 0), so that the index file is truncated to #entries == size == 
0. 


Testing done: 
1. Unit tests passed.
2. Change the flush interval and index append interval to really low values. 
Produce data using console producer (index file will have flushed entries), 
hard kill the broker, restart the broker. Should see the exception without A. 
Should pass with A, ctrl+C the broker.
3. Cleanup the kafka-logs directory, don't cleanup the zookeeper. Restart the 
broker (to create empty log and index files for topics created in 2 above), it 
will boot up, hard kill it. Restart the broker again, it should fail without B, 
should boot successfully with B.

                
> System Test Hard Failure cases : "Fatal error during KafkaServerStable 
> startup" when hard-failed broker is re-started
> ---------------------------------------------------------------------------------------------------------------------
>
>                 Key: KAFKA-757
>                 URL: https://issues.apache.org/jira/browse/KAFKA-757
>             Project: Kafka
>          Issue Type: Bug
>            Reporter: John Fung
>            Assignee: Swapnil Ghike
>            Priority: Blocker
>              Labels: 0.8, replication-testing
>         Attachments: kafka-757-v1.patch
>
>


--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (KAFKA-757) System Test Hard Failure cases : "Fatal error during KafkaServerStable startup" when hard-failed broker is re-started

Reply via email to