As far as I know, both of these are basically known issues, with no good solution, other than not using O_DIRECT. Certainly the first issue is one I recognise. The second isn't one I recognise directly, but is unsurprising to me.
There have been discussions -- including developers -- on this list as recent as a month or so ago. The general outcome seems to be that any problems with O_DIRECT are not going to be fixed. Hugo. On Fri, Sep 15, 2017 at 12:00:19AM +0200, Goffredo Baroncelli wrote: > Hi all, > > I discovered two bugs when O_DIRECT is used... > > 1) a corrupted file doesn't return -EIO when O_DIRECT is used > > Normally BTRFS prevents to access the contents of a corrupted file; however I > was able read the content of a corrupted file simply using O_DIRECT > > # in a new btrfs filesystem, create a file > $ sudo mkfs.btrfs -f /dev/sdd5 > $ mount /dev/sdd5 t > $ (while true; do echo -n "abcefg" ; done )| sudo dd of=t/abcd > bs=$((16*1024)) iflag=fullblock count=1024 > > # corrupt the file > $ sudo filefrag -v t/abcd > Filesystem type is: 9123683e > File size of t/abcd is 16777216 (4096 blocks of 4096 bytes) > ext: logical_offset: physical_offset: length: expected: flags: > 0: 0.. 3475: 70656.. 74131: 3476: > 1: 3476.. 4095: 74212.. 74831: 620: 74132: last,eof > t/abcd: 2 extents found > $ sudo umount t > $ sudo ~/btrfs/btrfs-progs/btrfs-corrupt-block -l $((70656*4096)) -b 10 > /dev/sdd5 > mirror 1 logical 289406976 physical 289406976 device /dev/sdd5 > corrupting 289406976 copy 1 > > # try to access the file; expected result: -EIO > $ sudo mount /dev/sdd5 t > $ dd if=t/abcd | hexdump -c | head > dd: error reading 't/abcd': Input/output error > 0+0 records in > 0+0 records out > 0 bytes copied, 0.000477413 s, 0.0 kB/s > > > # try to access the file using O_DIRECT; expected result: -EIO, instead the > file is accessible > $ dd if=t/abcd iflag=direct bs=4096 | hexdump -c | head > 0000000 001 001 001 001 001 001 001 001 001 001 001 001 001 001 001 001 > * > 0001000 f g a b c e f g a b c e f g a b > 0001010 c e f g a b c e f g a b c e f g > 0001020 a b c e f g a b c e f g a b c e > 0001030 f g a b c e f g a b c e f g a b > 0001040 c e f g a b c e f g a b c e f g > 0001050 a b c e f g a b c e f g a b c e > 0001060 f g a b c e f g a b c e f g a b > 0001070 c e f g a b c e f g a b c e f g > > (dmesg report the checksum mismatch) > [13265.085645] BTRFS warning (device sdd5): csum failed root 5 ino 257 off 0 > csum 0x98f94189 expected csum 0x0ab6be80 mirror 1 > > Note the first 4k filled by 0x01 !!!!! > > Conclusion: even if the file is corrupted and normally BTRFS prevent to > access it, using O_DIRECT > a) no error is returned to the caller > b) instead of the page stored on the disk, it is returned a page filled with > 0x01 (according also with the function __readpage_endio_check()) > > > 2) The second bug, is a more severe bug. If during a writing of a buffer with > O_DIRECT, the buffer is updated at the same time by a second process, the > checksum may be incorrect. > > At the end of the email there is the code which shows the problem: two > process share the same memory: the first write it to the disk, the second > update the buffer continuously. A third process try to read the file, but it > got time to time -EIO > > If you ran my code in a btrfs filesystem you got a lot of > > ERROR: read thread; r = 8192, expected = 16384 > ERROR: read thread; r = 8192, expected = 16384 > ERROR: read thread; e = 5 - Input/output error > ERROR: read thread; e = 5 - Input/output error > > The firsts lines are related to a shorter read (which may happens). The lasts > lines are related to a checksum mismatch. The dmesg is filled by lines like > [...] > [14873.573547] BTRFS warning (device sdd5): csum failed root 5 ino 259 off > 4096 csum 0x0683c6df expected csum 0x55eb85e6 mirror 1 > [...] > > This is definitely a bug. > > I think that using O_DIRECT and updating a page at the same time could happen > in a VM. In BTRFS this could lead to a wrong checksum. The problem is that > if BTRFS detects a checksum error during a reading: > a) if O_DIRECT is not used in the read > * -EIO is returned > Definitely BAD > > b) if O_DIRECT is used in the read > * it doesn't return the error to the caller > * it returns a page filled by 0x01 instead of the data from the disk > Even worse than a) > > Note1: even using O_DIRECT with O_SYNC, the problem still persist. > Note2: the man page of open(2) is filled by a lot of notes about O_DIRECT, > but also it stated that using O_DIRECT+fork()+mmap(... MAP_SHARED) is legally. > Note3: even "ZFS on linux" has its trouble with O_DIRECT: if fact ZFS doesn't > support it; see https://github.com/zfsonlinux/zfs/issues/224 > > BR > G.Baroncelli > > ----- cut --- cut --- cut ---- > > #define _GNU_SOURCE > #include <stdio.h> > #include <stdlib.h> > #include <sys/mman.h> > #include <assert.h> > #include <errno.h> > #include <sys/types.h> > #include <sys/stat.h> > #include <fcntl.h> > #include <string.h> > #include <unistd.h> > > #define FILESIZE (4096*4) > > int fd; > char *buffer = NULL; > > void read_thread(const char *nf) { > > void *data = mmap(NULL, FILESIZE, > PROT_READ|PROT_WRITE, > MAP_PRIVATE|MAP_ANONYMOUS, -1, 0); > > assert(data); > fprintf(stderr, "read_thread: data = %p\n", data); > int rfd; > rfd = open(nf, O_RDONLY); > > for(;;) { > ssize_t r = pread(rfd, data, FILESIZE, 0); > if (r < 0) { > int e = errno; > fprintf(stderr, "ERROR: read thread; e = %d - %s\n", > e, strerror(e)); > > } else if (r != FILESIZE) { > fprintf(stderr, "ERROR: read thread; r = %ld, expected > = %d\n", > r, FILESIZE); > } > } > } > > void write_thread(void) { > > for(;;) { > ssize_t r = pwrite(fd, buffer, FILESIZE, 0); > assert(r == FILESIZE); > } > } > > void update_thread(void) { > > for(;;) { > int i; > for (i = 0 ; i < FILESIZE ; i++) > buffer[i] += i+10; > } > } > > > int main(int argc, char **argv) { > > if (argc < 2) { > fprintf(stderr, "usage: %s <fname>\n", argv[0]); > exit(100); > } > > > buffer = mmap(NULL, FILESIZE, > PROT_READ|PROT_WRITE, > MAP_SHARED|MAP_ANONYMOUS, -1, 0); > > assert(buffer); > fprintf(stderr, "main: data = %p\n", buffer); > > fd = open(argv[1], O_RDWR|O_DIRECT|O_CREAT, 0660); > assert(fd>=0); > > ssize_t r = pwrite(fd, buffer, FILESIZE, 0); > assert(r == FILESIZE); > > pid_t child; > > child = fork(); > assert(child >= 0); > if (child == 0) > write_thread(); > fprintf(stderr, "write_thread pid = %d\n", child); > > child = fork(); > assert(child >= 0); > if (child == 0) > read_thread(argv[1]); > fprintf(stderr, "read_thread pid = %d\n", child); > > child = fork(); > assert(child >= 0); > if (child == 0) > update_thread(); > fprintf(stderr, "update_thread pid = %d\n", child); > > for(;;) > sleep(100*100*100); > > > return 0; > } > > ----- cut --- cut --- cut -- -- Hugo Mills | "There's more than one way to do it" is not a hugo@... carfax.org.uk | commandment. It is a dire warning. http://carfax.org.uk/ | PGP: E2AB1DE4 |
signature.asc
Description: Digital signature