On 09/15/2017 12:18 AM, Hugo Mills wrote:
>    As far as I know, both of these are basically known issues, with no
> good solution, other than not using O_DIRECT. Certainly the first
> issue is one I recognise. The second isn't one I recognise directly,
> but is unsurprising to me.
> 
>    There have been discussions -- including developers -- on this list
> as recent as a month or so ago. The general outcome seems to be that
> any problems with O_DIRECT are not going to be fixed.

I missed this thread; could you point it to me ?


If csum and O_DIRECT are not reliable, why not disallow one of them: i.e allow 
O_DIRECT only on nodatasum files... ZFS (on linux) do not support O_DIRECT at 
all...

In fact most of the applications which benefit from O_DIRECT (it comes to me VM 
e DB), are the ones which need also nodatasum to have good performance.

One of the strongest point of BTRFS was the checksums; but these are not 
effective when the file is opened with O_DIRECT; worse there are cases where 
the file is corrupted and the application got -EIO; not mentioning that the 
dmesg is filled by "csum failed  ...."


> 
>    Hugo.
> 
> On Fri, Sep 15, 2017 at 12:00:19AM +0200, Goffredo Baroncelli wrote:
>> Hi all,
>>
>> I discovered two bugs when O_DIRECT is used...
>>
>> 1) a corrupted file doesn't return -EIO when O_DIRECT is used
>>
>> Normally BTRFS prevents to access the contents of a corrupted file; however 
>> I was able read the content of a corrupted file simply using O_DIRECT
>>
>> # in a new btrfs filesystem, create a file
>> $ sudo mkfs.btrfs -f /dev/sdd5
>> $ mount /dev/sdd5 t
>> $ (while true; do echo -n "abcefg" ; done )| sudo dd of=t/abcd 
>> bs=$((16*1024)) iflag=fullblock count=1024
>>
>> # corrupt the file
>> $ sudo filefrag -v t/abcd 
>> Filesystem type is: 9123683e
>> File size of t/abcd is 16777216 (4096 blocks of 4096 bytes)
>>  ext:     logical_offset:        physical_offset: length:   expected: flags:
>>    0:        0..    3475:      70656..     74131:   3476:            
>>    1:     3476..    4095:      74212..     74831:    620:      74132: 
>> last,eof
>> t/abcd: 2 extents found
>> $ sudo umount t
>> $ sudo ~/btrfs/btrfs-progs/btrfs-corrupt-block -l $((70656*4096)) -b 10 
>> /dev/sdd5
>> mirror 1 logical 289406976 physical 289406976 device /dev/sdd5
>> corrupting 289406976 copy 1
>>
>> # try to access the file; expected result: -EIO
>> $ sudo mount /dev/sdd5 t
>> $ dd if=t/abcd | hexdump -c | head
>> dd: error reading 't/abcd': Input/output error
>> 0+0 records in
>> 0+0 records out
>> 0 bytes copied, 0.000477413 s, 0.0 kB/s
>>
>>
>> # try to access the file using O_DIRECT; expected result: -EIO, instead the 
>> file is accessible
>> $ dd if=t/abcd iflag=direct bs=4096 | hexdump -c | head
>> 0000000 001 001 001 001 001 001 001 001 001 001 001 001 001 001 001 001
>> *
>> 0001000   f   g   a   b   c   e   f   g   a   b   c   e   f   g   a   b
>> 0001010   c   e   f   g   a   b   c   e   f   g   a   b   c   e   f   g
>> 0001020   a   b   c   e   f   g   a   b   c   e   f   g   a   b   c   e
>> 0001030   f   g   a   b   c   e   f   g   a   b   c   e   f   g   a   b
>> 0001040   c   e   f   g   a   b   c   e   f   g   a   b   c   e   f   g
>> 0001050   a   b   c   e   f   g   a   b   c   e   f   g   a   b   c   e
>> 0001060   f   g   a   b   c   e   f   g   a   b   c   e   f   g   a   b
>> 0001070   c   e   f   g   a   b   c   e   f   g   a   b   c   e   f   g
>>
>> (dmesg report the checksum mismatch)
>> [13265.085645] BTRFS warning (device sdd5): csum failed root 5 ino 257 off 0 
>> csum 0x98f94189 expected csum 0x0ab6be80 mirror 1
>>
>> Note the first 4k filled by 0x01 !!!!!
>>
>> Conclusion: even if the file is corrupted and normally BTRFS prevent to 
>> access it, using O_DIRECT
>> a) no error is returned to the caller
>> b) instead of the page stored on the disk, it is returned a page filled with 
>> 0x01 (according also with the function __readpage_endio_check())
>>
>>
>> 2) The second bug, is a more severe bug. If during a writing of a buffer 
>> with O_DIRECT, the buffer is updated at the same time by a second process, 
>> the checksum may be incorrect.
>>
>> At the end of the email there is the code which shows the problem: two 
>> process share the same memory: the first write it to the disk, the second 
>> update the buffer continuously. A third process try to read the file, but it 
>> got time to time -EIO
>>
>> If you ran my code in a btrfs filesystem you got a lot of 
>>
>> ERROR: read thread; r = 8192, expected = 16384
>> ERROR: read thread; r = 8192, expected = 16384
>> ERROR: read thread; e = 5 - Input/output error
>> ERROR: read thread; e = 5 - Input/output error
>>
>> The firsts lines are related to a shorter read (which may happens). The 
>> lasts lines are related to a checksum mismatch. The dmesg is filled by lines 
>> like
>> [...]
>> [14873.573547] BTRFS warning (device sdd5): csum failed root 5 ino 259 off 
>> 4096 csum 0x0683c6df expected csum 0x55eb85e6 mirror 1
>> [...]
>>
>> This is definitely a bug. 
>>
>> I think that using O_DIRECT and updating a page at the same time could 
>> happen in a VM. In BTRFS this  could lead to a wrong checksum. The problem 
>> is that if BTRFS detects a checksum error during a reading:
>> a) if O_DIRECT is not used in the read
>>      * -EIO is returned
>> Definitely BAD
>>
>> b) if O_DIRECT is used in the read
>>      * it doesn't return the error to the caller
>>      * it returns a page filled by 0x01 instead of the data from the disk
>> Even worse than a)   
>>
>> Note1: even using O_DIRECT with O_SYNC, the problem still persist.
>> Note2: the man page of open(2) is filled by a lot of notes about O_DIRECT, 
>> but also it stated that using O_DIRECT+fork()+mmap(... MAP_SHARED) is 
>> legally.
>> Note3: even "ZFS on linux" has its trouble with O_DIRECT: if fact ZFS 
>> doesn't support it; see https://github.com/zfsonlinux/zfs/issues/224
>>
>> BR
>> G.Baroncelli
>>
>> ----- cut --- cut --- cut ----
>>
>> #define _GNU_SOURCE
>> #include <stdio.h>
>> #include <stdlib.h>
>> #include <sys/mman.h>
>> #include <assert.h>
>> #include <errno.h>
>> #include <sys/types.h>
>> #include <sys/stat.h>
>> #include <fcntl.h>
>> #include <string.h>
>> #include <unistd.h>
>>
>> #define FILESIZE     (4096*4)
>>
>> int fd;
>> char *buffer = NULL;
>>
>> void read_thread(const char *nf) {
>>      
>>      void *data = mmap(NULL,  FILESIZE,
>>                              PROT_READ|PROT_WRITE, 
>>                              MAP_PRIVATE|MAP_ANONYMOUS, -1, 0);
>>      
>>      assert(data);
>>      fprintf(stderr, "read_thread:  data = %p\n", data);
>>      int rfd;
>>      rfd = open(nf, O_RDONLY);
>>      
>>      for(;;) {
>>              ssize_t r = pread(rfd, data, FILESIZE, 0);
>>              if (r < 0) {
>>                      int e = errno;
>>                      fprintf(stderr, "ERROR: read thread; e = %d - %s\n", 
>>                             e, strerror(e));
>>
>>              } else if (r != FILESIZE) {
>>                      fprintf(stderr, "ERROR: read thread; r = %ld, expected 
>> = %d\n", 
>>                             r, FILESIZE);
>>              }
>>      }
>> }
>>
>> void write_thread(void) {
>>
>>      for(;;) {
>>              ssize_t r = pwrite(fd, buffer, FILESIZE, 0);
>>              assert(r == FILESIZE);
>>      }
>> }
>>
>> void update_thread(void) {
>>
>>      for(;;) {
>>              int i;
>>              for (i = 0 ; i < FILESIZE ; i++)
>>                      buffer[i] += i+10;
>>      }
>> }
>>
>>
>> int main(int argc, char **argv) {
>>      
>>      if (argc < 2) {
>>              fprintf(stderr, "usage: %s <fname>\n", argv[0]);
>>              exit(100);
>>      }
>>      
>>      
>>      buffer = mmap(NULL,  FILESIZE,
>>                              PROT_READ|PROT_WRITE, 
>>                              MAP_SHARED|MAP_ANONYMOUS, -1, 0);
>>      
>>      assert(buffer);
>>      fprintf(stderr, "main:  data = %p\n", buffer);
>>      
>>      fd = open(argv[1], O_RDWR|O_DIRECT|O_CREAT, 0660);
>>      assert(fd>=0);
>>      
>>      ssize_t r = pwrite(fd, buffer, FILESIZE, 0);
>>      assert(r == FILESIZE);
>>      
>>      pid_t child;
>>      
>>      child = fork();
>>      assert(child >= 0);
>>      if (child == 0)
>>              write_thread();
>>      fprintf(stderr, "write_thread pid = %d\n", child);
>>      
>>      child = fork();
>>      assert(child >= 0);
>>      if (child == 0)
>>              read_thread(argv[1]);
>>      fprintf(stderr, "read_thread pid = %d\n", child);
>>      
>>      child = fork();
>>      assert(child >= 0);
>>      if (child == 0)
>>              update_thread();
>>      fprintf(stderr, "update_thread pid = %d\n", child);
>>      
>>      for(;;)
>>              sleep(100*100*100);
>>
>>      
>>      return 0;
>> }
>>
>> ----- cut --- cut --- cut -- 
> 


-- 
gpg @keyserver.linux.it: Goffredo Baroncelli <kreijackATinwind.it>
Key fingerprint BBF5 1610 0B64 DAC6 5F7D  17B2 0EDA 9B37 8B82 E0B5
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Reply via email to