[ https://issues.apache.org/jira/browse/KUDU-1989?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16353204#comment-16353204 ]
Todd Lipcon commented on KUDU-1989: ----------------------------------- Had another report of this from a community user running ubuntu 14. The end of their metadata file looks like: {code} 000dc500: fc46 1600 0000 88e8 4665 0a09 094b 6bf3 .F......Fe...Kk. 000dc510: 0000 0000 0010 0218 eb86 bdab 95ee d802 ................ 000dc520: 7a11 35a2 1600 0000 88e8 4665 0a09 09bf z.5.......Fe.... 000dc530: 6bf3 0000 0000 0010 0218 a1d4 ccad 95ee k............... 000dc540: d802 a8a4 df12 1600 0000 88e8 4665 0a09 ............Fe.. 000dc550: 09a7 6bf3 0000 0000 0010 0218 dec1 dbc0 ..k............. 000dc560: 95ee d802 d7f0 9876 0000 0000 0000 0000 .......v........ 000dc570: 0000 0000 0000 0000 0000 0000 0000 0000 ................ 000dc580: 0000 0000 0000 0000 0000 0000 0000 0000 ................ 000dc590: 0000 0000 0000 0000 0000 0000 0000 0000 ................ 000dc5a0: 0000 0000 0000 0000 0000 0000 0000 0000 ................ 000dc5b0: 0000 0000 0000 0000 0000 0000 0000 0000 ................ 000dc5c0: 0000 0000 0000 0000 0000 0000 0000 0000 ................ 000dc5d0: 0000 0000 0000 0000 0000 0000 0000 0000 ................ 000dc5e0: 0000 0000 0000 0000 0000 0000 0000 0000 ................ 000dc5f0: 0000 0000 0000 0000 0000 0000 0000 0000 ................ 000dc600: 0000 0000 0000 0000 0000 0000 0000 0000 ................ 000dc610: 0000 0000 0000 0000 0000 0000 0000 0000 ................ 000dc620: 0000 0000 0000 0000 0000 0000 0000 0000 ................ 000dc630: 0000 0000 0000 0000 0000 0000 0000 0000 ................ 000dc640: 0000 0000 0000 0000 0000 0000 0000 0000 ................ 000dc650: 0000 0000 0000 ...... {code} I looked into a few angles: - looking at the kernel source for writev, it seems like it does a loop writing, and if the process gets a fatal signal in the middle it could exit mid-loop, resulting in a partial write. However I don't see any case where it would extend the file length but not write. - I tried reproducing on ubuntu 14 with a simple C program that uses pwritev in the same manner as our PBC code and gets a fatal KILL signal in the middle: {code} #include <stdio.h> #include <signal.h> #include <sys/types.h> #include <pthread.h> #include <unistd.h> #include <fcntl.h> #include <sys/uio.h> #include <string> #include <assert.h> void* killer(void*) { usleep(10000); kill(0, SIGKILL); return NULL; } int main() { // open("/tmp/kudutest-1000/pb_util-test.TestPBUtil.TestBlah.1517877028228116-22829/pb_container.meta", O_RDWR|O_CREAT|O_TRUNC, 0666) = 3 // pwritev(3, [{"kuducntr\2\0\0\0_\10+\265\216\0\0\0l\363\306(\nq\no\n$ku"..., 170}], 1, 0) = 170 // pwritev(3, [{"\32'\0\0\340%\224}\n\220Nxxxxxxxxxxxxxxxxxxxxx"..., 10022}], 1, 170) = 10022 std::string x(10000, 'x'); int fd = open("/mnt/f", O_RDWR | O_CREAT | O_TRUNC, 0666); assert(fd >= 0); int64_t offset = 0; pthread_t thr; pthread_create(&thr, NULL, &killer, NULL); while (true) { struct iovec iov[1]; iov[0].iov_base = (void*)x.data(); iov[0].iov_len = x.size(); int rc = pwritev(fd, iov, 1, offset); assert(rc >= 0); offset += x.size(); } return 0; } {code} I was unable to repro the issue on either ext4 or xfs mount points using this script: {code} $ while tail --bytes=100 /mnt/f | xxd | grep -v '0000 0000' ; do rm -f /mnt/f ; ./test ; done {code} So it's still a mystery. > kudu-tserver met checksum mismatch after node crash and restart. > ---------------------------------------------------------------- > > Key: KUDU-1989 > URL: https://issues.apache.org/jira/browse/KUDU-1989 > Project: Kudu > Issue Type: Bug > Components: fs > Reporter: zhangsong > Priority: Major > > kudu-tserver version: 1.0.0 > 1 firstly node crashed > 2 when trying to restart the kudu-tserver , found it could not be restarted > successfully. > 3 log content in kudu-tserver.FATAL: > " > Log line format: [IWEF]mmdd hh:mm:ss.uuuuuu threadid file:line] msg > F0421 16:01:09.283123 20127 tablet_server_main.cc:55] Check failed: _s.ok() > Bad status: Corruption: Failed to load FS layout: Could not read records from > container > /export/servers/kudu/1.0-sp/tserver_data/data/a22af504ca16421aad511b14c51130a9: > Data length checksum does not match: Incorrect checksum in file > /export/servers/kudu/1.0-sp/tserver_data/data/a22af504ca16421aad511b14c51130a9.metadata > at offset 753661: Checksum does not match. Expected: 843507848. Actual: > 1699145864 > " > Not sure if this has been reported , create it here. -- This message was sent by Atlassian JIRA (v7.6.3#76005)