[ 
https://issues.apache.org/jira/browse/KUDU-1989?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16353204#comment-16353204
 ] 

Todd Lipcon commented on KUDU-1989:
-----------------------------------

Had another report of this from a community user running ubuntu 14. The end of 
their metadata file looks like:

{code}
000dc500: fc46 1600 0000 88e8 4665 0a09 094b 6bf3  .F......Fe...Kk.
000dc510: 0000 0000 0010 0218 eb86 bdab 95ee d802  ................
000dc520: 7a11 35a2 1600 0000 88e8 4665 0a09 09bf  z.5.......Fe....
000dc530: 6bf3 0000 0000 0010 0218 a1d4 ccad 95ee  k...............
000dc540: d802 a8a4 df12 1600 0000 88e8 4665 0a09  ............Fe..
000dc550: 09a7 6bf3 0000 0000 0010 0218 dec1 dbc0  ..k.............
000dc560: 95ee d802 d7f0 9876 0000 0000 0000 0000  .......v........
000dc570: 0000 0000 0000 0000 0000 0000 0000 0000  ................
000dc580: 0000 0000 0000 0000 0000 0000 0000 0000  ................
000dc590: 0000 0000 0000 0000 0000 0000 0000 0000  ................
000dc5a0: 0000 0000 0000 0000 0000 0000 0000 0000  ................
000dc5b0: 0000 0000 0000 0000 0000 0000 0000 0000  ................
000dc5c0: 0000 0000 0000 0000 0000 0000 0000 0000  ................
000dc5d0: 0000 0000 0000 0000 0000 0000 0000 0000  ................
000dc5e0: 0000 0000 0000 0000 0000 0000 0000 0000  ................
000dc5f0: 0000 0000 0000 0000 0000 0000 0000 0000  ................
000dc600: 0000 0000 0000 0000 0000 0000 0000 0000  ................
000dc610: 0000 0000 0000 0000 0000 0000 0000 0000  ................
000dc620: 0000 0000 0000 0000 0000 0000 0000 0000  ................
000dc630: 0000 0000 0000 0000 0000 0000 0000 0000  ................
000dc640: 0000 0000 0000 0000 0000 0000 0000 0000  ................
000dc650: 0000 0000 0000                           ......
{code}

I looked into a few angles:
- looking at the kernel source for writev, it seems like it does a loop 
writing, and if the process gets a fatal signal in the middle it could exit 
mid-loop, resulting in a partial write. However I don't see any case where it 
would extend the file length but not write.
- I tried reproducing on ubuntu 14 with a simple C program that uses pwritev in 
the same manner as our PBC code and gets a fatal KILL signal in the middle:

{code}
#include <stdio.h>
#include <signal.h>
#include <sys/types.h>
#include <pthread.h>
#include <unistd.h>
#include <fcntl.h>
#include <sys/uio.h>
#include <string>
#include <assert.h>

void* killer(void*) {
  usleep(10000);
  kill(0, SIGKILL);
  return NULL;
}

int main() {
// 
open("/tmp/kudutest-1000/pb_util-test.TestPBUtil.TestBlah.1517877028228116-22829/pb_container.meta",
 O_RDWR|O_CREAT|O_TRUNC, 0666) = 3
// pwritev(3, [{"kuducntr\2\0\0\0_\10+\265\216\0\0\0l\363\306(\nq\no\n$ku"..., 
170}], 1, 0) = 170
// pwritev(3, [{"\32'\0\0\340%\224}\n\220Nxxxxxxxxxxxxxxxxxxxxx"..., 10022}], 
1, 170) = 10022
  std::string x(10000, 'x');

  int fd = open("/mnt/f", O_RDWR | O_CREAT | O_TRUNC, 0666);
  assert(fd >= 0);
  int64_t offset = 0;
  pthread_t thr;
  pthread_create(&thr, NULL, &killer, NULL);

  while (true) {
    struct iovec iov[1];
    iov[0].iov_base = (void*)x.data();
    iov[0].iov_len = x.size();
    int rc = pwritev(fd, iov, 1, offset);
    assert(rc >= 0);
    offset += x.size();
  }

  return 0;
}
{code}

I was unable to repro the issue on either ext4 or xfs mount points using this 
script:
{code}
$ while tail --bytes=100 /mnt/f | xxd | grep -v '0000 0000' ; do rm -f /mnt/f ; 
./test ; done
{code}

So it's still a mystery.

> kudu-tserver met checksum mismatch after node crash and restart.
> ----------------------------------------------------------------
>
>                 Key: KUDU-1989
>                 URL: https://issues.apache.org/jira/browse/KUDU-1989
>             Project: Kudu
>          Issue Type: Bug
>          Components: fs
>            Reporter: zhangsong
>            Priority: Major
>
> kudu-tserver version: 1.0.0
> 1 firstly node crashed 
> 2 when trying to restart the kudu-tserver , found it could not be restarted 
> successfully.
> 3 log content in kudu-tserver.FATAL:
> "
> Log line format: [IWEF]mmdd hh:mm:ss.uuuuuu threadid file:line] msg
> F0421 16:01:09.283123 20127 tablet_server_main.cc:55] Check failed: _s.ok() 
> Bad status: Corruption: Failed to load FS layout: Could not read records from 
> container 
> /export/servers/kudu/1.0-sp/tserver_data/data/a22af504ca16421aad511b14c51130a9:
>  Data length checksum does not match: Incorrect checksum in file 
> /export/servers/kudu/1.0-sp/tserver_data/data/a22af504ca16421aad511b14c51130a9.metadata
>  at offset 753661: Checksum does not match. Expected: 843507848. Actual: 
> 1699145864
> "
> Not sure if this has been reported , create it here.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to