Thanks for triaging/fixing this Adar in such a short time. > On Dec 7, 2016, at 5:20 AM, Adar Dembo (JIRA) <j...@apache.org> wrote: > > > [ > https://issues.apache.org/jira/browse/KUDU-1791?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel > ] > > Adar Dembo resolved KUDU-1791. > ------------------------------ > Resolution: Fixed > Assignee: Adar Dembo > Fix Version/s: 1.2.0 > > Fixed in commit 2453a67310f62c01216e5a0ed08f192a08adc005. > >> read-only log block manager should not truncate metadata files >> -------------------------------------------------------------- >> >> Key: KUDU-1791 >> URL: https://issues.apache.org/jira/browse/KUDU-1791 >> Project: Kudu >> Issue Type: Bug >> Components: fs >> Affects Versions: 1.2.0 >> Reporter: Adar Dembo >> Assignee: Adar Dembo >> Fix For: 1.2.0 >> >> >> This appears to happen extremely rarely (i.e. not even on the flaky test >> dashboard); I'm noting it here in case it shows up again. >> The error: >> {noformat} >> F1206 15:43:33.546993 21974 open-readonly-fs-itest.cc:121] Check failed: >> _s.ok() Bad status: Corruption: Could not read records from container >> /tmp/run_tha_testB5l6uo/test-tmp/open-readonly-fs-itest.OpenReadonlyFsITest.TestWriteAndVerify.1481038978495057-21754/minicluster-data/ts-0/data/6a60c05828f24e168f34f3c2e8b664a8: >> Data length checksum does not match: Incorrect checksum in file >> /tmp/run_tha_testB5l6uo/test-tmp/open-readonly-fs-itest.OpenReadonlyFsITest.TestWriteAndVerify.1481038978495057-21754/minicluster-data/ts-0/data/6a60c05828f24e168f34f3c2e8b664a8.metadata >> at offset 4085: Checksum does not match. Expected: 0. Actual: 1214729159 >> *** Check failure stack trace: *** >> @ 0x7eff9150b21d google::LogMessage::Fail() at ??:0 >> @ 0x7eff9150d28c google::LogMessage::SendToLog() at ??:0 >> @ 0x7eff9150ad79 google::LogMessage::Flush() at ??:0 >> @ 0x7eff9150dc1f google::LogMessageFatal::~LogMessageFatal() at ??:0 >> @ 0x40530f >> _ZZN4kudu5itest43OpenReadonlyFsITest_TestWriteAndVerify_Test8TestBodyEvENKUlvE_clEv >> at >> /home/jenkins-slave/workspace/kudu-0/thirdparty/installed/uninstrumented/include/glog/logging.h:697 >> @ 0x7eff912aca40 (unknown) at ??:0 >> @ 0x7eff8c8a3184 start_thread at ??:0 >> @ 0x7eff90d1a37d clone at ??:0 >> @ (nil) (unknown) >> {noformat} >> In this test, a client workload is performed concurrently with a looping >> thread that opens a read-only FsManager. Opening the FsManager forces the >> log block manager to reload all of the on-disk metadata every time; this >> test approximates the (real) use case of a read-only CLI filesystem tool >> running concurrently with a live Kudu server. >> The error itself shows the thread attempting to validate the length of a >> particular metadata record in a container. The validation does an 8 byte >> read, 4 bytes of which are the record length and 4 bytes of which are the >> length's checksum. The validation fails because the second 4 bytes are 0 >> while the length's actual checksum was non-zero. >> I scanned the reading/writing code in pb_util.cc but I can't see any obvious >> places where we're misusing the filesystem in such a way that we'd expect to >> see intermediate 0s in this field. For example, we always issue a single >> write() syscall to write a record to disk, including its length, checksum, >> body, and body checksum. >> I took another look at the test log and I think I've found the smoking gun: >> {noformat} >> W1206 15:43:32.967667 24555 log_block_manager.cc:502] Log block manager: >> Found partial trailing metadata record in container >> /tmp/run_tha_testB5l6uo/test-tmp/open-readonly-fs-itest.OpenReadonlyFsITest.TestWriteAndVerify.1481038978495057-21754/minicluster-data/ts-0/data/6a60c05828f24e168f34f3c2e8b664a8: >> Truncating metadata file to last valid offset: 4081 >> {noformat} >> This shows a log block manager that, during startup, found a metadata file >> with a partial record and decided to truncate it. The problem: this must be >> the read-only FsManager thread because it's the only entity starting up over >> and over. Indeed, there's no read-only protection for this case, and there >> should be. > > > > -- > This message was sent by Atlassian JIRA > (v6.3.4#6332)