On 09/15/2017 10:26 AM, Hugo Mills wrote: > On Fri, Sep 15, 2017 at 08:04:35AM +0200, Goffredo Baroncelli wrote: >> On 09/15/2017 12:18 AM, Hugo Mills wrote: >>> As far as I know, both of these are basically known issues, with no >>> good solution, other than not using O_DIRECT. Certainly the first >>> issue is one I recognise. The second isn't one I recognise directly, >>> but is unsurprising to me. >>> >>> There have been discussions -- including developers -- on this list >>> as recent as a month or so ago. The general outcome seems to be that >>> any problems with O_DIRECT are not going to be fixed. >> >> I missed this thread; could you point it to me ? > > No, you didn't miss it -- you were part of it. :) > > http://www.spinics.net/lists/linux-btrfs/msg68244.html > I hoped that there was more deeper analysis. This messages was more or less an acknowledge than an analysis :(
> Hugo. > >> If csum and O_DIRECT are not reliable, why not disallow one of them: i.e >> allow O_DIRECT only on nodatasum files... ZFS (on linux) do not support >> O_DIRECT at all... >> >> In fact most of the applications which benefit from O_DIRECT (it comes to me >> VM e DB), are the ones which need also nodatasum to have good performance. >> >> One of the strongest point of BTRFS was the checksums; but these are not >> effective when the file is opened with O_DIRECT; worse there are cases where >> the file is corrupted and the application got -EIO; not mentioning that the >> dmesg is filled by "csum failed ...." >> >> >>> >>> Hugo. >>> >>> On Fri, Sep 15, 2017 at 12:00:19AM +0200, Goffredo Baroncelli wrote: >>>> Hi all, >>>> >>>> I discovered two bugs when O_DIRECT is used... >>>> >>>> 1) a corrupted file doesn't return -EIO when O_DIRECT is used >>>> >>>> Normally BTRFS prevents to access the contents of a corrupted file; >>>> however I was able read the content of a corrupted file simply using >>>> O_DIRECT >>>> >>>> # in a new btrfs filesystem, create a file >>>> $ sudo mkfs.btrfs -f /dev/sdd5 >>>> $ mount /dev/sdd5 t >>>> $ (while true; do echo -n "abcefg" ; done )| sudo dd of=t/abcd >>>> bs=$((16*1024)) iflag=fullblock count=1024 >>>> >>>> # corrupt the file >>>> $ sudo filefrag -v t/abcd >>>> Filesystem type is: 9123683e >>>> File size of t/abcd is 16777216 (4096 blocks of 4096 bytes) >>>> ext: logical_offset: physical_offset: length: expected: >>>> flags: >>>> 0: 0.. 3475: 70656.. 74131: 3476: >>>> 1: 3476.. 4095: 74212.. 74831: 620: 74132: >>>> last,eof >>>> t/abcd: 2 extents found >>>> $ sudo umount t >>>> $ sudo ~/btrfs/btrfs-progs/btrfs-corrupt-block -l $((70656*4096)) -b 10 >>>> /dev/sdd5 >>>> mirror 1 logical 289406976 physical 289406976 device /dev/sdd5 >>>> corrupting 289406976 copy 1 >>>> >>>> # try to access the file; expected result: -EIO >>>> $ sudo mount /dev/sdd5 t >>>> $ dd if=t/abcd | hexdump -c | head >>>> dd: error reading 't/abcd': Input/output error >>>> 0+0 records in >>>> 0+0 records out >>>> 0 bytes copied, 0.000477413 s, 0.0 kB/s >>>> >>>> >>>> # try to access the file using O_DIRECT; expected result: -EIO, instead >>>> the file is accessible >>>> $ dd if=t/abcd iflag=direct bs=4096 | hexdump -c | head >>>> 0000000 001 001 001 001 001 001 001 001 001 001 001 001 001 001 001 001 >>>> * >>>> 0001000 f g a b c e f g a b c e f g a b >>>> 0001010 c e f g a b c e f g a b c e f g >>>> 0001020 a b c e f g a b c e f g a b c e >>>> 0001030 f g a b c e f g a b c e f g a b >>>> 0001040 c e f g a b c e f g a b c e f g >>>> 0001050 a b c e f g a b c e f g a b c e >>>> 0001060 f g a b c e f g a b c e f g a b >>>> 0001070 c e f g a b c e f g a b c e f g >>>> >>>> (dmesg report the checksum mismatch) >>>> [13265.085645] BTRFS warning (device sdd5): csum failed root 5 ino 257 off >>>> 0 csum 0x98f94189 expected csum 0x0ab6be80 mirror 1 >>>> >>>> Note the first 4k filled by 0x01 !!!!! >>>> >>>> Conclusion: even if the file is corrupted and normally BTRFS prevent to >>>> access it, using O_DIRECT >>>> a) no error is returned to the caller >>>> b) instead of the page stored on the disk, it is returned a page filled >>>> with 0x01 (according also with the function __readpage_endio_check()) >>>> >>>> >>>> 2) The second bug, is a more severe bug. If during a writing of a buffer >>>> with O_DIRECT, the buffer is updated at the same time by a second process, >>>> the checksum may be incorrect. >>>> >>>> At the end of the email there is the code which shows the problem: two >>>> process share the same memory: the first write it to the disk, the second >>>> update the buffer continuously. A third process try to read the file, but >>>> it got time to time -EIO >>>> >>>> If you ran my code in a btrfs filesystem you got a lot of >>>> >>>> ERROR: read thread; r = 8192, expected = 16384 >>>> ERROR: read thread; r = 8192, expected = 16384 >>>> ERROR: read thread; e = 5 - Input/output error >>>> ERROR: read thread; e = 5 - Input/output error >>>> >>>> The firsts lines are related to a shorter read (which may happens). The >>>> lasts lines are related to a checksum mismatch. The dmesg is filled by >>>> lines like >>>> [...] >>>> [14873.573547] BTRFS warning (device sdd5): csum failed root 5 ino 259 off >>>> 4096 csum 0x0683c6df expected csum 0x55eb85e6 mirror 1 >>>> [...] >>>> >>>> This is definitely a bug. >>>> >>>> I think that using O_DIRECT and updating a page at the same time could >>>> happen in a VM. In BTRFS this could lead to a wrong checksum. The problem >>>> is that if BTRFS detects a checksum error during a reading: >>>> a) if O_DIRECT is not used in the read >>>> * -EIO is returned >>>> Definitely BAD >>>> >>>> b) if O_DIRECT is used in the read >>>> * it doesn't return the error to the caller >>>> * it returns a page filled by 0x01 instead of the data from the disk >>>> Even worse than a) >>>> >>>> Note1: even using O_DIRECT with O_SYNC, the problem still persist. >>>> Note2: the man page of open(2) is filled by a lot of notes about O_DIRECT, >>>> but also it stated that using O_DIRECT+fork()+mmap(... MAP_SHARED) is >>>> legally. >>>> Note3: even "ZFS on linux" has its trouble with O_DIRECT: if fact ZFS >>>> doesn't support it; see https://github.com/zfsonlinux/zfs/issues/224 >>>> >>>> BR >>>> G.Baroncelli >>>> >>>> ----- cut --- cut --- cut ---- >>>> >>>> #define _GNU_SOURCE >>>> #include <stdio.h> >>>> #include <stdlib.h> >>>> #include <sys/mman.h> >>>> #include <assert.h> >>>> #include <errno.h> >>>> #include <sys/types.h> >>>> #include <sys/stat.h> >>>> #include <fcntl.h> >>>> #include <string.h> >>>> #include <unistd.h> >>>> >>>> #define FILESIZE (4096*4) >>>> >>>> int fd; >>>> char *buffer = NULL; >>>> >>>> void read_thread(const char *nf) { >>>> >>>> void *data = mmap(NULL, FILESIZE, >>>> PROT_READ|PROT_WRITE, >>>> MAP_PRIVATE|MAP_ANONYMOUS, -1, 0); >>>> >>>> assert(data); >>>> fprintf(stderr, "read_thread: data = %p\n", data); >>>> int rfd; >>>> rfd = open(nf, O_RDONLY); >>>> >>>> for(;;) { >>>> ssize_t r = pread(rfd, data, FILESIZE, 0); >>>> if (r < 0) { >>>> int e = errno; >>>> fprintf(stderr, "ERROR: read thread; e = %d - %s\n", >>>> e, strerror(e)); >>>> >>>> } else if (r != FILESIZE) { >>>> fprintf(stderr, "ERROR: read thread; r = %ld, expected >>>> = %d\n", >>>> r, FILESIZE); >>>> } >>>> } >>>> } >>>> >>>> void write_thread(void) { >>>> >>>> for(;;) { >>>> ssize_t r = pwrite(fd, buffer, FILESIZE, 0); >>>> assert(r == FILESIZE); >>>> } >>>> } >>>> >>>> void update_thread(void) { >>>> >>>> for(;;) { >>>> int i; >>>> for (i = 0 ; i < FILESIZE ; i++) >>>> buffer[i] += i+10; >>>> } >>>> } >>>> >>>> >>>> int main(int argc, char **argv) { >>>> >>>> if (argc < 2) { >>>> fprintf(stderr, "usage: %s <fname>\n", argv[0]); >>>> exit(100); >>>> } >>>> >>>> >>>> buffer = mmap(NULL, FILESIZE, >>>> PROT_READ|PROT_WRITE, >>>> MAP_SHARED|MAP_ANONYMOUS, -1, 0); >>>> >>>> assert(buffer); >>>> fprintf(stderr, "main: data = %p\n", buffer); >>>> >>>> fd = open(argv[1], O_RDWR|O_DIRECT|O_CREAT, 0660); >>>> assert(fd>=0); >>>> >>>> ssize_t r = pwrite(fd, buffer, FILESIZE, 0); >>>> assert(r == FILESIZE); >>>> >>>> pid_t child; >>>> >>>> child = fork(); >>>> assert(child >= 0); >>>> if (child == 0) >>>> write_thread(); >>>> fprintf(stderr, "write_thread pid = %d\n", child); >>>> >>>> child = fork(); >>>> assert(child >= 0); >>>> if (child == 0) >>>> read_thread(argv[1]); >>>> fprintf(stderr, "read_thread pid = %d\n", child); >>>> >>>> child = fork(); >>>> assert(child >= 0); >>>> if (child == 0) >>>> update_thread(); >>>> fprintf(stderr, "update_thread pid = %d\n", child); >>>> >>>> for(;;) >>>> sleep(100*100*100); >>>> >>>> >>>> return 0; >>>> } >>>> >>>> ----- cut --- cut --- cut -- >>> >> >> > -- gpg @keyserver.linux.it: Goffredo Baroncelli <kreijackATinwind.it> Key fingerprint BBF5 1610 0B64 DAC6 5F7D 17B2 0EDA 9B37 8B82 E0B5 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html