Thanks for triaging/fixing this Adar in such a short time. 

> On Dec 7, 2016, at 5:20 AM, Adar Dembo (JIRA) <j...@apache.org> wrote:
> 
> 
>     [ 
> https://issues.apache.org/jira/browse/KUDU-1791?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
>  ]
> 
> Adar Dembo resolved KUDU-1791.
> ------------------------------
>       Resolution: Fixed
>         Assignee: Adar Dembo
>    Fix Version/s: 1.2.0
> 
> Fixed in commit 2453a67310f62c01216e5a0ed08f192a08adc005.
> 
>> read-only log block manager should not truncate metadata files
>> --------------------------------------------------------------
>> 
>>                Key: KUDU-1791
>>                URL: https://issues.apache.org/jira/browse/KUDU-1791
>>            Project: Kudu
>>         Issue Type: Bug
>>         Components: fs
>>   Affects Versions: 1.2.0
>>           Reporter: Adar Dembo
>>           Assignee: Adar Dembo
>>            Fix For: 1.2.0
>> 
>> 
>> This appears to happen extremely rarely (i.e. not even on the flaky test 
>> dashboard); I'm noting it here in case it shows up again.
>> The error:
>> {noformat}
>> F1206 15:43:33.546993 21974 open-readonly-fs-itest.cc:121] Check failed: 
>> _s.ok() Bad status: Corruption: Could not read records from container 
>> /tmp/run_tha_testB5l6uo/test-tmp/open-readonly-fs-itest.OpenReadonlyFsITest.TestWriteAndVerify.1481038978495057-21754/minicluster-data/ts-0/data/6a60c05828f24e168f34f3c2e8b664a8:
>>  Data length checksum does not match: Incorrect checksum in file 
>> /tmp/run_tha_testB5l6uo/test-tmp/open-readonly-fs-itest.OpenReadonlyFsITest.TestWriteAndVerify.1481038978495057-21754/minicluster-data/ts-0/data/6a60c05828f24e168f34f3c2e8b664a8.metadata
>>  at offset 4085: Checksum does not match. Expected: 0. Actual: 1214729159
>> *** Check failure stack trace: ***
>>    @     0x7eff9150b21d  google::LogMessage::Fail() at ??:0
>>    @     0x7eff9150d28c  google::LogMessage::SendToLog() at ??:0
>>    @     0x7eff9150ad79  google::LogMessage::Flush() at ??:0
>>    @     0x7eff9150dc1f  google::LogMessageFatal::~LogMessageFatal() at ??:0
>>    @           0x40530f  
>> _ZZN4kudu5itest43OpenReadonlyFsITest_TestWriteAndVerify_Test8TestBodyEvENKUlvE_clEv
>>  at 
>> /home/jenkins-slave/workspace/kudu-0/thirdparty/installed/uninstrumented/include/glog/logging.h:697
>>    @     0x7eff912aca40  (unknown) at ??:0
>>    @     0x7eff8c8a3184  start_thread at ??:0
>>    @     0x7eff90d1a37d  clone at ??:0
>>    @              (nil)  (unknown)
>> {noformat}
>> In this test, a client workload is performed concurrently with a looping 
>> thread that opens a read-only FsManager. Opening the FsManager forces the 
>> log block manager to reload all of the on-disk metadata every time; this 
>> test approximates the (real) use case of a read-only CLI filesystem tool 
>> running concurrently with a live Kudu server.
>> The error itself shows the thread attempting to validate the length of a 
>> particular metadata record in a container. The validation does an 8 byte 
>> read, 4 bytes of which are the record length and 4 bytes of which are the 
>> length's checksum. The validation fails because the second 4 bytes are 0 
>> while the length's actual checksum was non-zero.
>> I scanned the reading/writing code in pb_util.cc but I can't see any obvious 
>> places where we're misusing the filesystem in such a way that we'd expect to 
>> see intermediate 0s in this field. For example, we always issue a single 
>> write() syscall to write a record to disk, including its length, checksum, 
>> body, and body checksum.
>> I took another look at the test log and I think I've found the smoking gun: 
>> {noformat}
>> W1206 15:43:32.967667 24555 log_block_manager.cc:502] Log block manager: 
>> Found partial trailing metadata record in container 
>> /tmp/run_tha_testB5l6uo/test-tmp/open-readonly-fs-itest.OpenReadonlyFsITest.TestWriteAndVerify.1481038978495057-21754/minicluster-data/ts-0/data/6a60c05828f24e168f34f3c2e8b664a8:
>>  Truncating metadata file to last valid offset: 4081
>> {noformat}
>> This shows a log block manager that, during startup, found a metadata file 
>> with a partial record and decided to truncate it. The problem: this must be 
>> the read-only FsManager thread because it's the only entity starting up over 
>> and over. Indeed, there's no read-only protection for this case, and there 
>> should be.
> 
> 
> 
> --
> This message was sent by Atlassian JIRA
> (v6.3.4#6332)

Reply via email to