Re: BUG: BTRFS and O_DIRECT could lead to wrong checksum and wrong data
On Fri, Sep 15, 2017 at 08:57:41PM +0200, Goffredo Baroncelli wrote: > On 09/15/2017 07:01 PM, Liu Bo wrote: > >> Conclusion: even if the file is corrupted and normally BTRFS prevent to > >> access it, using O_DIRECT > >> a) no error is returned to the caller > >> b) instead of the page stored on the disk, it is returned a page filled > >> with 0x01 (according also with the function __readpage_endio_check()) > >> > > We've queued a patch[1] to fix it, the behavior was introduced a long > > time ago and we guess it was for testing purpose. > > > > With the patch, you'll get -EIO from dio read, which is consistent > > with buffered read. > > I tried your patch, but it doesn't seem to address this issue. I still got > -EIO with a normal file access, and a page filled by 0x01 with O_DIRECT. Oh, nice, thanks for trying it out. I see it now, the regression is actually introduced by a cleanup patch, commit 4246a0b63bd8f56a1469b12eafeb875b1041a451 block: add a bi_error field to struct bio It overrides our -EIO which should be returned by dio_end_io(). But I think we need further clarification on the correct behavior, ie. If a file got corrupted (wrong) data, both buffer'd read and O_DIRECT read will return -EIO to userspace, and that parcularly corrupted page will be reset with 0x01 for not exposing data to userspace. No exception should exist. > > If I understand your patch, this address the case where there is no any > checksum, which is not this case. The checksum exists and it is valid. It is > the data wrong. > Not for the case of nodatacsum, it's actually for the case where some checksum somehow got zero'd, and btrfs was not returning -EIO even if it got a mismatch. Anyway, your problem is surely a regression, thanks for reporting it. Thanks, -liubo -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: BUG: BTRFS and O_DIRECT could lead to wrong checksum and wrong data
On 09/15/2017 07:01 PM, Liu Bo wrote: >> Conclusion: even if the file is corrupted and normally BTRFS prevent to >> access it, using O_DIRECT >> a) no error is returned to the caller >> b) instead of the page stored on the disk, it is returned a page filled with >> 0x01 (according also with the function __readpage_endio_check()) >> > We've queued a patch[1] to fix it, the behavior was introduced a long > time ago and we guess it was for testing purpose. > > With the patch, you'll get -EIO from dio read, which is consistent > with buffered read. I tried your patch, but it doesn't seem to address this issue. I still got -EIO with a normal file access, and a page filled by 0x01 with O_DIRECT. If I understand your patch, this address the case where there is no any checksum, which is not this case. The checksum exists and it is valid. It is the data wrong. BR G.Baroncelli > > [1], > https://patchwork.kernel.org/patch/9835505/ > Btrfs: report errors when checksum is not found > > Thanks, > > -liubo > -- gpg @keyserver.linux.it: Goffredo Baroncelli Key fingerprint BBF5 1610 0B64 DAC6 5F7D 17B2 0EDA 9B37 8B82 E0B5 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: BUG: BTRFS and O_DIRECT could lead to wrong checksum and wrong data
On Fri, Sep 15, 2017 at 12:00:19AM +0200, Goffredo Baroncelli wrote: > Hi all, > > I discovered two bugs when O_DIRECT is used... > > 1) a corrupted file doesn't return -EIO when O_DIRECT is used > > Normally BTRFS prevents to access the contents of a corrupted file; however I > was able read the content of a corrupted file simply using O_DIRECT > > # in a new btrfs filesystem, create a file > $ sudo mkfs.btrfs -f /dev/sdd5 > $ mount /dev/sdd5 t > $ (while true; do echo -n "abcefg" ; done )| sudo dd of=t/abcd > bs=$((16*1024)) iflag=fullblock count=1024 > > # corrupt the file > $ sudo filefrag -v t/abcd > Filesystem type is: 9123683e > File size of t/abcd is 16777216 (4096 blocks of 4096 bytes) > ext: logical_offset:physical_offset: length: expected: flags: >0:0..3475: 70656.. 74131: 3476: >1: 3476..4095: 74212.. 74831:620: 74132: last,eof > t/abcd: 2 extents found > $ sudo umount t > $ sudo ~/btrfs/btrfs-progs/btrfs-corrupt-block -l $((70656*4096)) -b 10 > /dev/sdd5 > mirror 1 logical 289406976 physical 289406976 device /dev/sdd5 > corrupting 289406976 copy 1 > > # try to access the file; expected result: -EIO > $ sudo mount /dev/sdd5 t > $ dd if=t/abcd | hexdump -c | head > dd: error reading 't/abcd': Input/output error > 0+0 records in > 0+0 records out > 0 bytes copied, 0.000477413 s, 0.0 kB/s > > > # try to access the file using O_DIRECT; expected result: -EIO, instead the > file is accessible > $ dd if=t/abcd iflag=direct bs=4096 | hexdump -c | head > 000 001 001 001 001 001 001 001 001 001 001 001 001 001 001 001 001 > * > 0001000 f g a b c e f g a b c e f g a b > 0001010 c e f g a b c e f g a b c e f g > 0001020 a b c e f g a b c e f g a b c e > 0001030 f g a b c e f g a b c e f g a b > 0001040 c e f g a b c e f g a b c e f g > 0001050 a b c e f g a b c e f g a b c e > 0001060 f g a b c e f g a b c e f g a b > 0001070 c e f g a b c e f g a b c e f g > > (dmesg report the checksum mismatch) > [13265.085645] BTRFS warning (device sdd5): csum failed root 5 ino 257 off 0 > csum 0x98f94189 expected csum 0x0ab6be80 mirror 1 > > Note the first 4k filled by 0x01 ! > > Conclusion: even if the file is corrupted and normally BTRFS prevent to > access it, using O_DIRECT > a) no error is returned to the caller > b) instead of the page stored on the disk, it is returned a page filled with > 0x01 (according also with the function __readpage_endio_check()) > We've queued a patch[1] to fix it, the behavior was introduced a long time ago and we guess it was for testing purpose. With the patch, you'll get -EIO from dio read, which is consistent with buffered read. [1], https://patchwork.kernel.org/patch/9835505/ Btrfs: report errors when checksum is not found Thanks, -liubo > > 2) The second bug, is a more severe bug. If during a writing of a buffer with > O_DIRECT, the buffer is updated at the same time by a second process, the > checksum may be incorrect. > > At the end of the email there is the code which shows the problem: two > process share the same memory: the first write it to the disk, the second > update the buffer continuously. A third process try to read the file, but it > got time to time -EIO > > If you ran my code in a btrfs filesystem you got a lot of > > ERROR: read thread; r = 8192, expected = 16384 > ERROR: read thread; r = 8192, expected = 16384 > ERROR: read thread; e = 5 - Input/output error > ERROR: read thread; e = 5 - Input/output error > > The firsts lines are related to a shorter read (which may happens). The lasts > lines are related to a checksum mismatch. The dmesg is filled by lines like > [...] > [14873.573547] BTRFS warning (device sdd5): csum failed root 5 ino 259 off > 4096 csum 0x0683c6df expected csum 0x55eb85e6 mirror 1 > [...] > > This is definitely a bug. > > I think that using O_DIRECT and updating a page at the same time could happen > in a VM. In BTRFS this could lead to a wrong checksum. The problem is that > if BTRFS detects a checksum error during a reading: > a) if O_DIRECT is not used in the read > * -EIO is returned > Definitely BAD > > b) if O_DIRECT is used in the read > * it doesn't return the error to the caller > * it returns a page filled by 0x01 instead of the data from the disk > Even worse than a) > > Note1: even using O_DIRECT with O_SYNC, the problem still persist. > Note2: the man page of open(2) is filled by a lot of notes about O_DIRECT, > but also it stated that using O_DIRECT+fork()+mmap(... MAP_SHARED) is legally. > Note3: even "ZFS on linux" has its trouble with O_DIRECT: if fact ZFS doesn't > support it; see
Re: BUG: BTRFS and O_DIRECT could lead to wrong checksum and wrong data
On 09/15/2017 11:50 AM, Marat Khalili wrote: > May I state my user's point of view: > > I know one applications that uses O_DIRECT, and it is subtly broken > on BTRFS. I know no applications that use O_DIRECT and are not > broken. (Really more statistics would help here, probably some exist > that provably work.) According to developers making O_DIRECT work on > BTRFS is difficult if not impossible. Isn't it time to disable > O_DIRECT like ZFS does AFAIU? Data safety is certainly more important > than performance gain it may or may not give some applications. I agree with you, but it should be sufficient to disable O_DIRECT when the file has the checksums. I was unable to observe filesystem corruption when the checksums are disabled. The use cases where O_DIRECT is useful are VM and databasesM and in these cases also nodatacsum/nodatacow are useful. BR G.Baroncelli > > -- > > With Best Regards, Marat Khalili > > -- To unsubscribe from this list: send the line "unsubscribe > linux-btrfs" in the body of a message to majord...@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html > -- gpg @keyserver.linux.it: Goffredo Baroncelli Key fingerprint BBF5 1610 0B64 DAC6 5F7D 17B2 0EDA 9B37 8B82 E0B5 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: BUG: BTRFS and O_DIRECT could lead to wrong checksum and wrong data
On 09/15/2017 07:01 PM, Andrei Borzenkov wrote: > 15.09.2017 08:50, Goffredo Baroncelli пишет: >> On 09/15/2017 05:55 AM, Andrei Borzenkov wrote: >>> 15.09.2017 01:00, Goffredo Baroncelli пишет: 2) The second bug, is a more severe bug. If during a writing of a buffer with O_DIRECT, the buffer is updated at the same time by a second process, the checksum may be incorrect. >>> >>> Is it btrfs specific ? If buffer is updated before it was actually >>> consumed by kernel, this likely means data corruption on any filesystem. >> >> I don't see any corruption in other FS. The fact that application push to >> filesystem garbage, doesn't allow the filesystem to be corrupted. > > I did not say "filesystem corruption", I did it.. because there is a filesystem corruption only in BTRFS . I agree with you that a concurrent access between different processes to the same page which is in writing is bad. However only in BTRFS this lead to a filesystem corruption. This is the first problem. The second one is that BTRFS doesn't warn the user about that (read more below), however it returns wrong data (a page filled by 0x01) > I said "data corruption". And, are you sure that there is data corruption ? don't have evidence of that in other filesystem. Proof is that there are a lot of problem in BTRFS when O_DIRECT is used, but not in other filesystem. If this is due to a better serialization or is due to the application touch the page which is going to written but doesn't care about the data I don't know. Anyway from open(2) man page: [...] O_DIRECT I/Os should never be run concurrently with the fork(2)[...] if the memory buffer is a private mapping (i.e., any mapping created with the mmap(2) MAP_PRIVATE flag. [...]This restriction does not apply when the memory buffer for the O_DIRECT I/Os was created using shmat(2) or mmap(2) with the MAP_SHARED flag. [...] So it is a valid and accepted use case O_DIRECT+mmap(MAP_SHARED)+fork() > >> In this case the filesystem became corrupted, because another application >> which try to read the data (without O_DIRECT) may got -EIO. >> > > No. *Data* on this filesystem was corrupted and luckily btrfs makes you > aware of it. Please re-read my messages. BTRFS doesn't make the user aware of it: if the data is read with O_DIRECT, it doesn't return an error in case of checksum mismatch (this is the second problem) > On different filesystem you still may have the same data > corruption, but silent. > >> I repeat, the problem is a data race when the data is in the FS camp, and >> the kernel does wrong checksum. >> > > Of course it is race. But again - I expect that when pwrite() returns it > means data buffer can be reused. Otherwise I cannot see how O_DIRECT can > be sensibly used at all. In this case you need to demonstrate that data > corruption happens after pwrite() returns - this makes it btrfs issue > indeed. If data corruption happens while thread is waiting for pwrite() > to return, I say this is expected behavior and application fault - it > need to protect against concurrent write and modification. > >> >> IMHO, BTRFS should disallow O_DIRECT (which is the same thing that does ZFS >> on linux); I think that it could be allowed only for nodatasum files. >> >>> I.e. there should be clear indication from kernel that buffer can be >>> reused by application, in your example - when pwrite returns. So when >>> data corruption happens - during pwrite or after? >>> If data is corrupted >>> during pwrite, it is arguably application fault - it should disallow >>> concurrent access. >> >> >> >> >> >>> >> >> > > -- gpg @keyserver.linux.it: Goffredo Baroncelli Key fingerprint BBF5 1610 0B64 DAC6 5F7D 17B2 0EDA 9B37 8B82 E0B5 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: BUG: BTRFS and O_DIRECT could lead to wrong checksum and wrong data
On 09/15/2017 10:26 AM, Hugo Mills wrote: > On Fri, Sep 15, 2017 at 08:04:35AM +0200, Goffredo Baroncelli wrote: >> On 09/15/2017 12:18 AM, Hugo Mills wrote: >>>As far as I know, both of these are basically known issues, with no >>> good solution, other than not using O_DIRECT. Certainly the first >>> issue is one I recognise. The second isn't one I recognise directly, >>> but is unsurprising to me. >>> >>>There have been discussions -- including developers -- on this list >>> as recent as a month or so ago. The general outcome seems to be that >>> any problems with O_DIRECT are not going to be fixed. >> >> I missed this thread; could you point it to me ? > >No, you didn't miss it -- you were part of it. :) > >http://www.spinics.net/lists/linux-btrfs/msg68244.html > I hoped that there was more deeper analysis. This messages was more or less an acknowledge than an analysis :( >Hugo. > >> If csum and O_DIRECT are not reliable, why not disallow one of them: i.e >> allow O_DIRECT only on nodatasum files... ZFS (on linux) do not support >> O_DIRECT at all... >> >> In fact most of the applications which benefit from O_DIRECT (it comes to me >> VM e DB), are the ones which need also nodatasum to have good performance. >> >> One of the strongest point of BTRFS was the checksums; but these are not >> effective when the file is opened with O_DIRECT; worse there are cases where >> the file is corrupted and the application got -EIO; not mentioning that the >> dmesg is filled by "csum failed " >> >> >>> >>>Hugo. >>> >>> On Fri, Sep 15, 2017 at 12:00:19AM +0200, Goffredo Baroncelli wrote: Hi all, I discovered two bugs when O_DIRECT is used... 1) a corrupted file doesn't return -EIO when O_DIRECT is used Normally BTRFS prevents to access the contents of a corrupted file; however I was able read the content of a corrupted file simply using O_DIRECT # in a new btrfs filesystem, create a file $ sudo mkfs.btrfs -f /dev/sdd5 $ mount /dev/sdd5 t $ (while true; do echo -n "abcefg" ; done )| sudo dd of=t/abcd bs=$((16*1024)) iflag=fullblock count=1024 # corrupt the file $ sudo filefrag -v t/abcd Filesystem type is: 9123683e File size of t/abcd is 16777216 (4096 blocks of 4096 bytes) ext: logical_offset:physical_offset: length: expected: flags: 0:0..3475: 70656.. 74131: 3476: 1: 3476..4095: 74212.. 74831:620: 74132: last,eof t/abcd: 2 extents found $ sudo umount t $ sudo ~/btrfs/btrfs-progs/btrfs-corrupt-block -l $((70656*4096)) -b 10 /dev/sdd5 mirror 1 logical 289406976 physical 289406976 device /dev/sdd5 corrupting 289406976 copy 1 # try to access the file; expected result: -EIO $ sudo mount /dev/sdd5 t $ dd if=t/abcd | hexdump -c | head dd: error reading 't/abcd': Input/output error 0+0 records in 0+0 records out 0 bytes copied, 0.000477413 s, 0.0 kB/s # try to access the file using O_DIRECT; expected result: -EIO, instead the file is accessible $ dd if=t/abcd iflag=direct bs=4096 | hexdump -c | head 000 001 001 001 001 001 001 001 001 001 001 001 001 001 001 001 001 * 0001000 f g a b c e f g a b c e f g a b 0001010 c e f g a b c e f g a b c e f g 0001020 a b c e f g a b c e f g a b c e 0001030 f g a b c e f g a b c e f g a b 0001040 c e f g a b c e f g a b c e f g 0001050 a b c e f g a b c e f g a b c e 0001060 f g a b c e f g a b c e f g a b 0001070 c e f g a b c e f g a b c e f g (dmesg report the checksum mismatch) [13265.085645] BTRFS warning (device sdd5): csum failed root 5 ino 257 off 0 csum 0x98f94189 expected csum 0x0ab6be80 mirror 1 Note the first 4k filled by 0x01 ! Conclusion: even if the file is corrupted and normally BTRFS prevent to access it, using O_DIRECT a) no error is returned to the caller b) instead of the page stored on the disk, it is returned a page filled with 0x01 (according also with the function __readpage_endio_check()) 2) The second bug, is a more severe bug. If during a writing of a buffer with O_DIRECT, the buffer is updated at the same time by a second process, the checksum may be incorrect. At the end of the email there is the code which shows the problem: two process share the same memory: the first write it to the disk, the second update the buffer continuously. A third process try to read the file, but it got time to
Re: BUG: BTRFS and O_DIRECT could lead to wrong checksum and wrong data
15.09.2017 08:50, Goffredo Baroncelli пишет: > On 09/15/2017 05:55 AM, Andrei Borzenkov wrote: >> 15.09.2017 01:00, Goffredo Baroncelli пишет: >>> >>> 2) The second bug, is a more severe bug. If during a writing of a buffer >>> with O_DIRECT, the buffer is updated at the same time by a second process, >>> the checksum may be incorrect. >>> >> >> Is it btrfs specific ? If buffer is updated before it was actually >> consumed by kernel, this likely means data corruption on any filesystem. > > I don't see any corruption in other FS. The fact that application push to > filesystem garbage, doesn't allow the filesystem to be corrupted. I did not say "filesystem corruption", I said "data corruption". > In this case the filesystem became corrupted, because another application > which try to read the data (without O_DIRECT) may got -EIO. > No. *Data* on this filesystem was corrupted and luckily btrfs makes you aware of it. On different filesystem you still may have the same data corruption, but silent. > I repeat, the problem is a data race when the data is in the FS camp, and the > kernel does wrong checksum. > Of course it is race. But again - I expect that when pwrite() returns it means data buffer can be reused. Otherwise I cannot see how O_DIRECT can be sensibly used at all. In this case you need to demonstrate that data corruption happens after pwrite() returns - this makes it btrfs issue indeed. If data corruption happens while thread is waiting for pwrite() to return, I say this is expected behavior and application fault - it need to protect against concurrent write and modification. > > IMHO, BTRFS should disallow O_DIRECT (which is the same thing that does ZFS > on linux); I think that it could be allowed only for nodatasum files. > >> I.e. there should be clear indication from kernel that buffer can be >> reused by application, in your example - when pwrite returns. So when >> data corruption happens - during pwrite or after? >> If data is corrupted >> during pwrite, it is arguably application fault - it should disallow >> concurrent access. > > > > > >> > > -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: BUG: BTRFS and O_DIRECT could lead to wrong checksum and wrong data
> -Original Message- > From: linux-btrfs-ow...@vger.kernel.org [mailto:linux-btrfs- > ow...@vger.kernel.org] On Behalf Of Marat Khalili > Sent: Friday, 15 September 2017 7:50 PM > To: Hugo Mills; Goffredo Baroncelli > ; linux-btrfs > Subject: Re: BUG: BTRFS and O_DIRECT could lead to wrong checksum and > wrong data > > May I state my user's point of view: > > I know one applications that uses O_DIRECT, and it is subtly broken on BTRFS. > I know no applications that use O_DIRECT and are not broken. > (Really more statistics would help here, probably some exist that provably > work.) According to developers making O_DIRECT work on BTRFS is difficult if > not impossible. Isn't it time to disable O_DIRECT like ZFS does AFAIU? Data > safety is certainly more important than performance gain it may or may not > give some applications. I agree - I've had trouble with this before because I didn't understand the consequences of using it. It would be better to hide it behind a mount option or similar IMHO. Paul.
Re: BUG: BTRFS and O_DIRECT could lead to wrong checksum and wrong data
May I state my user's point of view: I know one applications that uses O_DIRECT, and it is subtly broken on BTRFS. I know no applications that use O_DIRECT and are not broken. (Really more statistics would help here, probably some exist that provably work.) According to developers making O_DIRECT work on BTRFS is difficult if not impossible. Isn't it time to disable O_DIRECT like ZFS does AFAIU? Data safety is certainly more important than performance gain it may or may not give some applications. -- With Best Regards, Marat Khalili -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: BUG: BTRFS and O_DIRECT could lead to wrong checksum and wrong data
On Fri, Sep 15, 2017 at 08:04:35AM +0200, Goffredo Baroncelli wrote: > On 09/15/2017 12:18 AM, Hugo Mills wrote: > >As far as I know, both of these are basically known issues, with no > > good solution, other than not using O_DIRECT. Certainly the first > > issue is one I recognise. The second isn't one I recognise directly, > > but is unsurprising to me. > > > >There have been discussions -- including developers -- on this list > > as recent as a month or so ago. The general outcome seems to be that > > any problems with O_DIRECT are not going to be fixed. > > I missed this thread; could you point it to me ? No, you didn't miss it -- you were part of it. :) http://www.spinics.net/lists/linux-btrfs/msg68244.html Hugo. > If csum and O_DIRECT are not reliable, why not disallow one of them: i.e > allow O_DIRECT only on nodatasum files... ZFS (on linux) do not support > O_DIRECT at all... > > In fact most of the applications which benefit from O_DIRECT (it comes to me > VM e DB), are the ones which need also nodatasum to have good performance. > > One of the strongest point of BTRFS was the checksums; but these are not > effective when the file is opened with O_DIRECT; worse there are cases where > the file is corrupted and the application got -EIO; not mentioning that the > dmesg is filled by "csum failed " > > > > > >Hugo. > > > > On Fri, Sep 15, 2017 at 12:00:19AM +0200, Goffredo Baroncelli wrote: > >> Hi all, > >> > >> I discovered two bugs when O_DIRECT is used... > >> > >> 1) a corrupted file doesn't return -EIO when O_DIRECT is used > >> > >> Normally BTRFS prevents to access the contents of a corrupted file; > >> however I was able read the content of a corrupted file simply using > >> O_DIRECT > >> > >> # in a new btrfs filesystem, create a file > >> $ sudo mkfs.btrfs -f /dev/sdd5 > >> $ mount /dev/sdd5 t > >> $ (while true; do echo -n "abcefg" ; done )| sudo dd of=t/abcd > >> bs=$((16*1024)) iflag=fullblock count=1024 > >> > >> # corrupt the file > >> $ sudo filefrag -v t/abcd > >> Filesystem type is: 9123683e > >> File size of t/abcd is 16777216 (4096 blocks of 4096 bytes) > >> ext: logical_offset:physical_offset: length: expected: > >> flags: > >>0:0..3475: 70656.. 74131: 3476: > >>1: 3476..4095: 74212.. 74831:620: 74132: > >> last,eof > >> t/abcd: 2 extents found > >> $ sudo umount t > >> $ sudo ~/btrfs/btrfs-progs/btrfs-corrupt-block -l $((70656*4096)) -b 10 > >> /dev/sdd5 > >> mirror 1 logical 289406976 physical 289406976 device /dev/sdd5 > >> corrupting 289406976 copy 1 > >> > >> # try to access the file; expected result: -EIO > >> $ sudo mount /dev/sdd5 t > >> $ dd if=t/abcd | hexdump -c | head > >> dd: error reading 't/abcd': Input/output error > >> 0+0 records in > >> 0+0 records out > >> 0 bytes copied, 0.000477413 s, 0.0 kB/s > >> > >> > >> # try to access the file using O_DIRECT; expected result: -EIO, instead > >> the file is accessible > >> $ dd if=t/abcd iflag=direct bs=4096 | hexdump -c | head > >> 000 001 001 001 001 001 001 001 001 001 001 001 001 001 001 001 001 > >> * > >> 0001000 f g a b c e f g a b c e f g a b > >> 0001010 c e f g a b c e f g a b c e f g > >> 0001020 a b c e f g a b c e f g a b c e > >> 0001030 f g a b c e f g a b c e f g a b > >> 0001040 c e f g a b c e f g a b c e f g > >> 0001050 a b c e f g a b c e f g a b c e > >> 0001060 f g a b c e f g a b c e f g a b > >> 0001070 c e f g a b c e f g a b c e f g > >> > >> (dmesg report the checksum mismatch) > >> [13265.085645] BTRFS warning (device sdd5): csum failed root 5 ino 257 off > >> 0 csum 0x98f94189 expected csum 0x0ab6be80 mirror 1 > >> > >> Note the first 4k filled by 0x01 ! > >> > >> Conclusion: even if the file is corrupted and normally BTRFS prevent to > >> access it, using O_DIRECT > >> a) no error is returned to the caller > >> b) instead of the page stored on the disk, it is returned a page filled > >> with 0x01 (according also with the function __readpage_endio_check()) > >> > >> > >> 2) The second bug, is a more severe bug. If during a writing of a buffer > >> with O_DIRECT, the buffer is updated at the same time by a second process, > >> the checksum may be incorrect. > >> > >> At the end of the email there is the code which shows the problem: two > >> process share the same memory: the first write it to the disk, the second > >> update the buffer continuously. A third process try to read the file, but > >> it got time to time -EIO > >> > >> If you ran my code in a btrfs filesystem you got a lot of > >> > >> ERROR: read thread; r = 8192, expected = 16384 > >> ERROR: read thread; r = 8192, expected =
Re: BUG: BTRFS and O_DIRECT could lead to wrong checksum and wrong data
On 09/15/2017 12:18 AM, Hugo Mills wrote: >As far as I know, both of these are basically known issues, with no > good solution, other than not using O_DIRECT. Certainly the first > issue is one I recognise. The second isn't one I recognise directly, > but is unsurprising to me. > >There have been discussions -- including developers -- on this list > as recent as a month or so ago. The general outcome seems to be that > any problems with O_DIRECT are not going to be fixed. I missed this thread; could you point it to me ? If csum and O_DIRECT are not reliable, why not disallow one of them: i.e allow O_DIRECT only on nodatasum files... ZFS (on linux) do not support O_DIRECT at all... In fact most of the applications which benefit from O_DIRECT (it comes to me VM e DB), are the ones which need also nodatasum to have good performance. One of the strongest point of BTRFS was the checksums; but these are not effective when the file is opened with O_DIRECT; worse there are cases where the file is corrupted and the application got -EIO; not mentioning that the dmesg is filled by "csum failed " > >Hugo. > > On Fri, Sep 15, 2017 at 12:00:19AM +0200, Goffredo Baroncelli wrote: >> Hi all, >> >> I discovered two bugs when O_DIRECT is used... >> >> 1) a corrupted file doesn't return -EIO when O_DIRECT is used >> >> Normally BTRFS prevents to access the contents of a corrupted file; however >> I was able read the content of a corrupted file simply using O_DIRECT >> >> # in a new btrfs filesystem, create a file >> $ sudo mkfs.btrfs -f /dev/sdd5 >> $ mount /dev/sdd5 t >> $ (while true; do echo -n "abcefg" ; done )| sudo dd of=t/abcd >> bs=$((16*1024)) iflag=fullblock count=1024 >> >> # corrupt the file >> $ sudo filefrag -v t/abcd >> Filesystem type is: 9123683e >> File size of t/abcd is 16777216 (4096 blocks of 4096 bytes) >> ext: logical_offset:physical_offset: length: expected: flags: >>0:0..3475: 70656.. 74131: 3476: >>1: 3476..4095: 74212.. 74831:620: 74132: >> last,eof >> t/abcd: 2 extents found >> $ sudo umount t >> $ sudo ~/btrfs/btrfs-progs/btrfs-corrupt-block -l $((70656*4096)) -b 10 >> /dev/sdd5 >> mirror 1 logical 289406976 physical 289406976 device /dev/sdd5 >> corrupting 289406976 copy 1 >> >> # try to access the file; expected result: -EIO >> $ sudo mount /dev/sdd5 t >> $ dd if=t/abcd | hexdump -c | head >> dd: error reading 't/abcd': Input/output error >> 0+0 records in >> 0+0 records out >> 0 bytes copied, 0.000477413 s, 0.0 kB/s >> >> >> # try to access the file using O_DIRECT; expected result: -EIO, instead the >> file is accessible >> $ dd if=t/abcd iflag=direct bs=4096 | hexdump -c | head >> 000 001 001 001 001 001 001 001 001 001 001 001 001 001 001 001 001 >> * >> 0001000 f g a b c e f g a b c e f g a b >> 0001010 c e f g a b c e f g a b c e f g >> 0001020 a b c e f g a b c e f g a b c e >> 0001030 f g a b c e f g a b c e f g a b >> 0001040 c e f g a b c e f g a b c e f g >> 0001050 a b c e f g a b c e f g a b c e >> 0001060 f g a b c e f g a b c e f g a b >> 0001070 c e f g a b c e f g a b c e f g >> >> (dmesg report the checksum mismatch) >> [13265.085645] BTRFS warning (device sdd5): csum failed root 5 ino 257 off 0 >> csum 0x98f94189 expected csum 0x0ab6be80 mirror 1 >> >> Note the first 4k filled by 0x01 ! >> >> Conclusion: even if the file is corrupted and normally BTRFS prevent to >> access it, using O_DIRECT >> a) no error is returned to the caller >> b) instead of the page stored on the disk, it is returned a page filled with >> 0x01 (according also with the function __readpage_endio_check()) >> >> >> 2) The second bug, is a more severe bug. If during a writing of a buffer >> with O_DIRECT, the buffer is updated at the same time by a second process, >> the checksum may be incorrect. >> >> At the end of the email there is the code which shows the problem: two >> process share the same memory: the first write it to the disk, the second >> update the buffer continuously. A third process try to read the file, but it >> got time to time -EIO >> >> If you ran my code in a btrfs filesystem you got a lot of >> >> ERROR: read thread; r = 8192, expected = 16384 >> ERROR: read thread; r = 8192, expected = 16384 >> ERROR: read thread; e = 5 - Input/output error >> ERROR: read thread; e = 5 - Input/output error >> >> The firsts lines are related to a shorter read (which may happens). The >> lasts lines are related to a checksum mismatch. The dmesg is filled by lines >> like >> [...] >> [14873.573547] BTRFS warning (device sdd5): csum failed root 5 ino 259 off >> 4096 csum 0x0683c6df expected csum 0x55eb85e6
Re: BUG: BTRFS and O_DIRECT could lead to wrong checksum and wrong data
On 09/15/2017 05:55 AM, Andrei Borzenkov wrote: > 15.09.2017 01:00, Goffredo Baroncelli пишет: >> >> 2) The second bug, is a more severe bug. If during a writing of a buffer >> with O_DIRECT, the buffer is updated at the same time by a second process, >> the checksum may be incorrect. >> > > Is it btrfs specific ? If buffer is updated before it was actually > consumed by kernel, this likely means data corruption on any filesystem. I don't see any corruption in other FS. The fact that application push to filesystem garbage, doesn't allow the filesystem to be corrupted. In this case the filesystem became corrupted, because another application which try to read the data (without O_DIRECT) may got -EIO. I repeat, the problem is a data race when the data is in the FS camp, and the kernel does wrong checksum. IMHO, BTRFS should disallow O_DIRECT (which is the same thing that does ZFS on linux); I think that it could be allowed only for nodatasum files. > I.e. there should be clear indication from kernel that buffer can be > reused by application, in your example - when pwrite returns. So when > data corruption happens - during pwrite or after? > If data is corrupted > during pwrite, it is arguably application fault - it should disallow > concurrent access. > -- gpg @keyserver.linux.it: Goffredo Baroncelli Key fingerprint BBF5 1610 0B64 DAC6 5F7D 17B2 0EDA 9B37 8B82 E0B5 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: BUG: BTRFS and O_DIRECT could lead to wrong checksum and wrong data
15.09.2017 01:00, Goffredo Baroncelli пишет: > > 2) The second bug, is a more severe bug. If during a writing of a buffer with > O_DIRECT, the buffer is updated at the same time by a second process, the > checksum may be incorrect. > Is it btrfs specific? If buffer is updated before it was actually consumed by kernel, this likely means data corruption on any filesystem. I.e. there should be clear indication from kernel that buffer can be reused by application, in your example - when pwrite returns. So when data corruption happens - during pwrite or after? If data is corrupted during pwrite, it is arguably application fault - it should disallow concurrent access. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: BUG: BTRFS and O_DIRECT could lead to wrong checksum and wrong data
As far as I know, both of these are basically known issues, with no good solution, other than not using O_DIRECT. Certainly the first issue is one I recognise. The second isn't one I recognise directly, but is unsurprising to me. There have been discussions -- including developers -- on this list as recent as a month or so ago. The general outcome seems to be that any problems with O_DIRECT are not going to be fixed. Hugo. On Fri, Sep 15, 2017 at 12:00:19AM +0200, Goffredo Baroncelli wrote: > Hi all, > > I discovered two bugs when O_DIRECT is used... > > 1) a corrupted file doesn't return -EIO when O_DIRECT is used > > Normally BTRFS prevents to access the contents of a corrupted file; however I > was able read the content of a corrupted file simply using O_DIRECT > > # in a new btrfs filesystem, create a file > $ sudo mkfs.btrfs -f /dev/sdd5 > $ mount /dev/sdd5 t > $ (while true; do echo -n "abcefg" ; done )| sudo dd of=t/abcd > bs=$((16*1024)) iflag=fullblock count=1024 > > # corrupt the file > $ sudo filefrag -v t/abcd > Filesystem type is: 9123683e > File size of t/abcd is 16777216 (4096 blocks of 4096 bytes) > ext: logical_offset:physical_offset: length: expected: flags: >0:0..3475: 70656.. 74131: 3476: >1: 3476..4095: 74212.. 74831:620: 74132: last,eof > t/abcd: 2 extents found > $ sudo umount t > $ sudo ~/btrfs/btrfs-progs/btrfs-corrupt-block -l $((70656*4096)) -b 10 > /dev/sdd5 > mirror 1 logical 289406976 physical 289406976 device /dev/sdd5 > corrupting 289406976 copy 1 > > # try to access the file; expected result: -EIO > $ sudo mount /dev/sdd5 t > $ dd if=t/abcd | hexdump -c | head > dd: error reading 't/abcd': Input/output error > 0+0 records in > 0+0 records out > 0 bytes copied, 0.000477413 s, 0.0 kB/s > > > # try to access the file using O_DIRECT; expected result: -EIO, instead the > file is accessible > $ dd if=t/abcd iflag=direct bs=4096 | hexdump -c | head > 000 001 001 001 001 001 001 001 001 001 001 001 001 001 001 001 001 > * > 0001000 f g a b c e f g a b c e f g a b > 0001010 c e f g a b c e f g a b c e f g > 0001020 a b c e f g a b c e f g a b c e > 0001030 f g a b c e f g a b c e f g a b > 0001040 c e f g a b c e f g a b c e f g > 0001050 a b c e f g a b c e f g a b c e > 0001060 f g a b c e f g a b c e f g a b > 0001070 c e f g a b c e f g a b c e f g > > (dmesg report the checksum mismatch) > [13265.085645] BTRFS warning (device sdd5): csum failed root 5 ino 257 off 0 > csum 0x98f94189 expected csum 0x0ab6be80 mirror 1 > > Note the first 4k filled by 0x01 ! > > Conclusion: even if the file is corrupted and normally BTRFS prevent to > access it, using O_DIRECT > a) no error is returned to the caller > b) instead of the page stored on the disk, it is returned a page filled with > 0x01 (according also with the function __readpage_endio_check()) > > > 2) The second bug, is a more severe bug. If during a writing of a buffer with > O_DIRECT, the buffer is updated at the same time by a second process, the > checksum may be incorrect. > > At the end of the email there is the code which shows the problem: two > process share the same memory: the first write it to the disk, the second > update the buffer continuously. A third process try to read the file, but it > got time to time -EIO > > If you ran my code in a btrfs filesystem you got a lot of > > ERROR: read thread; r = 8192, expected = 16384 > ERROR: read thread; r = 8192, expected = 16384 > ERROR: read thread; e = 5 - Input/output error > ERROR: read thread; e = 5 - Input/output error > > The firsts lines are related to a shorter read (which may happens). The lasts > lines are related to a checksum mismatch. The dmesg is filled by lines like > [...] > [14873.573547] BTRFS warning (device sdd5): csum failed root 5 ino 259 off > 4096 csum 0x0683c6df expected csum 0x55eb85e6 mirror 1 > [...] > > This is definitely a bug. > > I think that using O_DIRECT and updating a page at the same time could happen > in a VM. In BTRFS this could lead to a wrong checksum. The problem is that > if BTRFS detects a checksum error during a reading: > a) if O_DIRECT is not used in the read > * -EIO is returned > Definitely BAD > > b) if O_DIRECT is used in the read > * it doesn't return the error to the caller > * it returns a page filled by 0x01 instead of the data from the disk > Even worse than a) > > Note1: even using O_DIRECT with O_SYNC, the problem still persist. > Note2: the man page of open(2) is filled by a lot of notes about O_DIRECT, > but also it stated that using O_DIRECT+fork()+mmap(... MAP_SHARED) is legally. > Note3: even