Re: BUG: BTRFS and O_DIRECT could lead to wrong checksum and wrong data

2017-09-15 Thread Liu Bo
On Fri, Sep 15, 2017 at 08:57:41PM +0200, Goffredo Baroncelli wrote:
> On 09/15/2017 07:01 PM, Liu Bo wrote:
> >> Conclusion: even if the file is corrupted and normally BTRFS prevent to 
> >> access it, using O_DIRECT
> >> a) no error is returned to the caller
> >> b) instead of the page stored on the disk, it is returned a page filled 
> >> with 0x01 (according also with the function __readpage_endio_check())
> >>
> > We've queued a patch[1] to fix it, the behavior was introduced a long
> > time ago and we guess it was for testing purpose.
> > 
> > With the patch, you'll get -EIO from dio read, which is consistent
> > with buffered read.
> 
> I tried your patch, but it doesn't seem to address this issue. I still got 
> -EIO with a normal file access, and a page filled by 0x01 with O_DIRECT.

Oh, nice, thanks for trying it out.

I see it now, the regression is actually introduced by a cleanup
patch,

commit 4246a0b63bd8f56a1469b12eafeb875b1041a451
block: add a bi_error field to struct bio

It overrides our -EIO which should be returned by dio_end_io().


But I think we need further clarification on the correct behavior,
ie.

If a file got corrupted (wrong) data, both buffer'd read and O_DIRECT
read will return -EIO to userspace, and that parcularly corrupted page
will be reset with 0x01 for not exposing data to userspace.  No
exception should exist.

> 
> If I understand your patch, this address the case where there is no any 
> checksum, which is not this case. The checksum exists and it is valid. It is 
> the data wrong.
>

Not for the case of nodatacsum, it's actually for the case where some
checksum somehow got zero'd, and btrfs was not returning -EIO even if
it got a mismatch.

Anyway, your problem is surely a regression, thanks for reporting it.

Thanks,

-liubo
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: BUG: BTRFS and O_DIRECT could lead to wrong checksum and wrong data

2017-09-15 Thread Goffredo Baroncelli
On 09/15/2017 07:01 PM, Liu Bo wrote:
>> Conclusion: even if the file is corrupted and normally BTRFS prevent to 
>> access it, using O_DIRECT
>> a) no error is returned to the caller
>> b) instead of the page stored on the disk, it is returned a page filled with 
>> 0x01 (according also with the function __readpage_endio_check())
>>
> We've queued a patch[1] to fix it, the behavior was introduced a long
> time ago and we guess it was for testing purpose.
> 
> With the patch, you'll get -EIO from dio read, which is consistent
> with buffered read.

I tried your patch, but it doesn't seem to address this issue. I still got -EIO 
with a normal file access, and a page filled by 0x01 with O_DIRECT.

If I understand your patch, this address the case where there is no any 
checksum, which is not this case. The checksum exists and it is valid. It is 
the data wrong.

BR
G.Baroncelli

> 
> [1],
> https://patchwork.kernel.org/patch/9835505/
> Btrfs: report errors when checksum is not found
> 
> Thanks,
> 
> -liubo
> 


-- 
gpg @keyserver.linux.it: Goffredo Baroncelli 
Key fingerprint BBF5 1610 0B64 DAC6 5F7D  17B2 0EDA 9B37 8B82 E0B5
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: BUG: BTRFS and O_DIRECT could lead to wrong checksum and wrong data

2017-09-15 Thread Liu Bo
On Fri, Sep 15, 2017 at 12:00:19AM +0200, Goffredo Baroncelli wrote:
> Hi all,
> 
> I discovered two bugs when O_DIRECT is used...
> 
> 1) a corrupted file doesn't return -EIO when O_DIRECT is used
> 
> Normally BTRFS prevents to access the contents of a corrupted file; however I 
> was able read the content of a corrupted file simply using O_DIRECT
> 
> # in a new btrfs filesystem, create a file
> $ sudo mkfs.btrfs -f /dev/sdd5
> $ mount /dev/sdd5 t
> $ (while true; do echo -n "abcefg" ; done )| sudo dd of=t/abcd 
> bs=$((16*1024)) iflag=fullblock count=1024
> 
> # corrupt the file
> $ sudo filefrag -v t/abcd 
> Filesystem type is: 9123683e
> File size of t/abcd is 16777216 (4096 blocks of 4096 bytes)
>  ext: logical_offset:physical_offset: length:   expected: flags:
>0:0..3475:  70656.. 74131:   3476:
>1: 3476..4095:  74212.. 74831:620:  74132: last,eof
> t/abcd: 2 extents found
> $ sudo umount t
> $ sudo ~/btrfs/btrfs-progs/btrfs-corrupt-block -l $((70656*4096)) -b 10 
> /dev/sdd5
> mirror 1 logical 289406976 physical 289406976 device /dev/sdd5
> corrupting 289406976 copy 1 
> 
> # try to access the file; expected result: -EIO
> $ sudo mount /dev/sdd5 t
> $ dd if=t/abcd | hexdump -c | head
> dd: error reading 't/abcd': Input/output error
> 0+0 records in
> 0+0 records out
> 0 bytes copied, 0.000477413 s, 0.0 kB/s
> 
> 
> # try to access the file using O_DIRECT; expected result: -EIO, instead the 
> file is accessible
> $ dd if=t/abcd iflag=direct bs=4096 | hexdump -c | head
> 000 001 001 001 001 001 001 001 001 001 001 001 001 001 001 001 001
> *
> 0001000   f   g   a   b   c   e   f   g   a   b   c   e   f   g   a   b
> 0001010   c   e   f   g   a   b   c   e   f   g   a   b   c   e   f   g
> 0001020   a   b   c   e   f   g   a   b   c   e   f   g   a   b   c   e
> 0001030   f   g   a   b   c   e   f   g   a   b   c   e   f   g   a   b
> 0001040   c   e   f   g   a   b   c   e   f   g   a   b   c   e   f   g
> 0001050   a   b   c   e   f   g   a   b   c   e   f   g   a   b   c   e
> 0001060   f   g   a   b   c   e   f   g   a   b   c   e   f   g   a   b
> 0001070   c   e   f   g   a   b   c   e   f   g   a   b   c   e   f   g
> 
> (dmesg report the checksum mismatch)
> [13265.085645] BTRFS warning (device sdd5): csum failed root 5 ino 257 off 0 
> csum 0x98f94189 expected csum 0x0ab6be80 mirror 1
> 
> Note the first 4k filled by 0x01 !
> 
> Conclusion: even if the file is corrupted and normally BTRFS prevent to 
> access it, using O_DIRECT
> a) no error is returned to the caller
> b) instead of the page stored on the disk, it is returned a page filled with 
> 0x01 (according also with the function __readpage_endio_check())
>

We've queued a patch[1] to fix it, the behavior was introduced a long
time ago and we guess it was for testing purpose.

With the patch, you'll get -EIO from dio read, which is consistent
with buffered read.

[1],
https://patchwork.kernel.org/patch/9835505/
Btrfs: report errors when checksum is not found

Thanks,

-liubo

> 
> 2) The second bug, is a more severe bug. If during a writing of a buffer with 
> O_DIRECT, the buffer is updated at the same time by a second process, the 
> checksum may be incorrect.
> 
> At the end of the email there is the code which shows the problem: two 
> process share the same memory: the first write it to the disk, the second 
> update the buffer continuously. A third process try to read the file, but it 
> got time to time -EIO
> 
> If you ran my code in a btrfs filesystem you got a lot of 
> 
> ERROR: read thread; r = 8192, expected = 16384
> ERROR: read thread; r = 8192, expected = 16384
> ERROR: read thread; e = 5 - Input/output error
> ERROR: read thread; e = 5 - Input/output error
> 
> The firsts lines are related to a shorter read (which may happens). The lasts 
> lines are related to a checksum mismatch. The dmesg is filled by lines like
> [...]
> [14873.573547] BTRFS warning (device sdd5): csum failed root 5 ino 259 off 
> 4096 csum 0x0683c6df expected csum 0x55eb85e6 mirror 1
> [...]
> 
> This is definitely a bug. 
> 
> I think that using O_DIRECT and updating a page at the same time could happen 
> in a VM. In BTRFS this  could lead to a wrong checksum. The problem is that 
> if BTRFS detects a checksum error during a reading:
> a) if O_DIRECT is not used in the read
>   * -EIO is returned
> Definitely BAD
> 
> b) if O_DIRECT is used in the read
>   * it doesn't return the error to the caller
>   * it returns a page filled by 0x01 instead of the data from the disk
> Even worse than a)
> 
> Note1: even using O_DIRECT with O_SYNC, the problem still persist.
> Note2: the man page of open(2) is filled by a lot of notes about O_DIRECT, 
> but also it stated that using O_DIRECT+fork()+mmap(... MAP_SHARED) is legally.
> Note3: even "ZFS on linux" has its trouble with O_DIRECT: if fact ZFS doesn't 
> support it; see 

Re: BUG: BTRFS and O_DIRECT could lead to wrong checksum and wrong data

2017-09-15 Thread Goffredo Baroncelli
On 09/15/2017 11:50 AM, Marat Khalili wrote:
> May I state my user's point of view:
> 
> I know one applications that uses O_DIRECT, and it is subtly broken
> on BTRFS. I know no applications that use O_DIRECT and are not
> broken. (Really more statistics would help here, probably some exist
> that provably work.) According to developers making O_DIRECT work on
> BTRFS is difficult if not impossible. Isn't it time to disable
> O_DIRECT like ZFS does AFAIU? Data safety is certainly more important
> than performance gain it may or may not give some applications.

I agree with you, but it should be sufficient to disable O_DIRECT when the file 
has the checksums. I was unable to observe filesystem corruption when the 
checksums are disabled.

The use cases where O_DIRECT is useful are VM and databasesM and in these cases 
also nodatacsum/nodatacow are useful.

BR
G.Baroncelli

> 
> --
> 
> With Best Regards, Marat Khalili
> 
> -- To unsubscribe from this list: send the line "unsubscribe
> linux-btrfs" in the body of a message to majord...@vger.kernel.org 
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 


-- 
gpg @keyserver.linux.it: Goffredo Baroncelli 
Key fingerprint BBF5 1610 0B64 DAC6 5F7D  17B2 0EDA 9B37 8B82 E0B5
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: BUG: BTRFS and O_DIRECT could lead to wrong checksum and wrong data

2017-09-15 Thread Goffredo Baroncelli
On 09/15/2017 07:01 PM, Andrei Borzenkov wrote:
> 15.09.2017 08:50, Goffredo Baroncelli пишет:
>> On 09/15/2017 05:55 AM, Andrei Borzenkov wrote:
>>> 15.09.2017 01:00, Goffredo Baroncelli пишет:

 2) The second bug, is a more severe bug. If during a writing of a buffer 
 with O_DIRECT, the buffer is updated at the same time by a second process, 
 the checksum may be incorrect.

>>>
>>> Is it btrfs specific ? If buffer is updated before it was actually
>>> consumed by kernel, this likely means data corruption on any filesystem.
>>
>> I don't see any corruption in other FS. The fact that application push to 
>> filesystem garbage, doesn't allow the filesystem to be corrupted. 
> 
> I did not say "filesystem corruption", 
I did it.. because there is a filesystem corruption only in BTRFS .
I agree with you that a concurrent access between different processes to the 
same page which is in writing is bad. However only in BTRFS this lead to a 
filesystem corruption. This is the first problem. The second one is that BTRFS 
doesn't warn the user about that (read more below), however it returns wrong 
data (a page filled by 0x01)

> I said "data corruption".
And, are you sure that there is data corruption ?  don't have evidence of that 
in other filesystem. Proof is that there are a lot of problem in BTRFS when 
O_DIRECT is used, but not in other filesystem. 
If this is due to a better serialization or is due to the application touch the 
page which is going to written but doesn't care about the data I don't know. 

Anyway from open(2) man page:

[...]
O_DIRECT  I/Os should never be run concurrently with the fork(2)[...] if the 
memory buffer is a private mapping (i.e., any mapping created  with the mmap(2) 
MAP_PRIVATE flag.
[...]This restriction does not apply when the  memory buffer for the O_DIRECT 
I/Os was created using shmat(2) or mmap(2) with the MAP_SHARED flag.
[...]

So it is a valid and accepted use case O_DIRECT+mmap(MAP_SHARED)+fork()


> 
>> In this case the filesystem became corrupted, because another application 
>> which try to read the data (without O_DIRECT) may got -EIO.
>>
> 
> No. *Data* on this filesystem was corrupted and luckily btrfs makes you
> aware of it. 

Please re-read my messages. BTRFS doesn't make the user aware of it: if the 
data is read with O_DIRECT, it doesn't return an error in case of checksum 
mismatch (this is the second problem)

> On different filesystem you still may have the same data
> corruption, but silent.
> 
>> I repeat, the problem is a data race when the data is in the FS camp, and 
>> the kernel does wrong checksum.
>>
> 
> Of course it is race. But again - I expect that when pwrite() returns it
> means data buffer can be reused. Otherwise I cannot see how O_DIRECT can
> be sensibly used at all. In this case you need to demonstrate that data
> corruption happens after pwrite() returns - this makes it btrfs issue
> indeed. If data corruption happens while thread is waiting for pwrite()
> to return, I say this is expected behavior and application fault - it
> need to protect against concurrent write and modification.
> 
>>
>> IMHO, BTRFS should disallow O_DIRECT (which is the same thing that does ZFS 
>> on linux); I think that it could be allowed only for  nodatasum files.
>>
>>> I.e. there should be clear indication from kernel that buffer can be
>>> reused by application, in your example - when pwrite returns. So when
>>> data corruption happens - during pwrite or after? 
>>> If data is corrupted
>>> during pwrite, it is arguably application fault - it should disallow
>>> concurrent access.
>>
>>
>>
>>
>>
>>>
>>
>>
> 
> 


-- 
gpg @keyserver.linux.it: Goffredo Baroncelli 
Key fingerprint BBF5 1610 0B64 DAC6 5F7D  17B2 0EDA 9B37 8B82 E0B5
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: BUG: BTRFS and O_DIRECT could lead to wrong checksum and wrong data

2017-09-15 Thread Goffredo Baroncelli
On 09/15/2017 10:26 AM, Hugo Mills wrote:
> On Fri, Sep 15, 2017 at 08:04:35AM +0200, Goffredo Baroncelli wrote:
>> On 09/15/2017 12:18 AM, Hugo Mills wrote:
>>>As far as I know, both of these are basically known issues, with no
>>> good solution, other than not using O_DIRECT. Certainly the first
>>> issue is one I recognise. The second isn't one I recognise directly,
>>> but is unsurprising to me.
>>>
>>>There have been discussions -- including developers -- on this list
>>> as recent as a month or so ago. The general outcome seems to be that
>>> any problems with O_DIRECT are not going to be fixed.
>>
>> I missed this thread; could you point it to me ?
> 
>No, you didn't miss it -- you were part of it. :)
> 
>http://www.spinics.net/lists/linux-btrfs/msg68244.html
> 
I hoped that there was more deeper analysis. This messages was more or less an 
acknowledge than an analysis :(


>Hugo.
> 
>> If csum and O_DIRECT are not reliable, why not disallow one of them: i.e 
>> allow O_DIRECT only on nodatasum files... ZFS (on linux) do not support 
>> O_DIRECT at all...
>>
>> In fact most of the applications which benefit from O_DIRECT (it comes to me 
>> VM e DB), are the ones which need also nodatasum to have good performance.
>>
>> One of the strongest point of BTRFS was the checksums; but these are not 
>> effective when the file is opened with O_DIRECT; worse there are cases where 
>> the file is corrupted and the application got -EIO; not mentioning that the 
>> dmesg is filled by "csum failed  "
>>
>>
>>>
>>>Hugo.
>>>
>>> On Fri, Sep 15, 2017 at 12:00:19AM +0200, Goffredo Baroncelli wrote:
 Hi all,

 I discovered two bugs when O_DIRECT is used...

 1) a corrupted file doesn't return -EIO when O_DIRECT is used

 Normally BTRFS prevents to access the contents of a corrupted file; 
 however I was able read the content of a corrupted file simply using 
 O_DIRECT

 # in a new btrfs filesystem, create a file
 $ sudo mkfs.btrfs -f /dev/sdd5
 $ mount /dev/sdd5 t
 $ (while true; do echo -n "abcefg" ; done )| sudo dd of=t/abcd 
 bs=$((16*1024)) iflag=fullblock count=1024

 # corrupt the file
 $ sudo filefrag -v t/abcd 
 Filesystem type is: 9123683e
 File size of t/abcd is 16777216 (4096 blocks of 4096 bytes)
  ext: logical_offset:physical_offset: length:   expected: 
 flags:
0:0..3475:  70656.. 74131:   3476:
1: 3476..4095:  74212.. 74831:620:  74132: 
 last,eof
 t/abcd: 2 extents found
 $ sudo umount t
 $ sudo ~/btrfs/btrfs-progs/btrfs-corrupt-block -l $((70656*4096)) -b 10 
 /dev/sdd5
 mirror 1 logical 289406976 physical 289406976 device /dev/sdd5
 corrupting 289406976 copy 1

 # try to access the file; expected result: -EIO
 $ sudo mount /dev/sdd5 t
 $ dd if=t/abcd | hexdump -c | head
 dd: error reading 't/abcd': Input/output error
 0+0 records in
 0+0 records out
 0 bytes copied, 0.000477413 s, 0.0 kB/s


 # try to access the file using O_DIRECT; expected result: -EIO, instead 
 the file is accessible
 $ dd if=t/abcd iflag=direct bs=4096 | hexdump -c | head
 000 001 001 001 001 001 001 001 001 001 001 001 001 001 001 001 001
 *
 0001000   f   g   a   b   c   e   f   g   a   b   c   e   f   g   a   b
 0001010   c   e   f   g   a   b   c   e   f   g   a   b   c   e   f   g
 0001020   a   b   c   e   f   g   a   b   c   e   f   g   a   b   c   e
 0001030   f   g   a   b   c   e   f   g   a   b   c   e   f   g   a   b
 0001040   c   e   f   g   a   b   c   e   f   g   a   b   c   e   f   g
 0001050   a   b   c   e   f   g   a   b   c   e   f   g   a   b   c   e
 0001060   f   g   a   b   c   e   f   g   a   b   c   e   f   g   a   b
 0001070   c   e   f   g   a   b   c   e   f   g   a   b   c   e   f   g

 (dmesg report the checksum mismatch)
 [13265.085645] BTRFS warning (device sdd5): csum failed root 5 ino 257 off 
 0 csum 0x98f94189 expected csum 0x0ab6be80 mirror 1

 Note the first 4k filled by 0x01 !

 Conclusion: even if the file is corrupted and normally BTRFS prevent to 
 access it, using O_DIRECT
 a) no error is returned to the caller
 b) instead of the page stored on the disk, it is returned a page filled 
 with 0x01 (according also with the function __readpage_endio_check())


 2) The second bug, is a more severe bug. If during a writing of a buffer 
 with O_DIRECT, the buffer is updated at the same time by a second process, 
 the checksum may be incorrect.

 At the end of the email there is the code which shows the problem: two 
 process share the same memory: the first write it to the disk, the second 
 update the buffer continuously. A third process try to read the file, but 
 it got time to 

Re: BUG: BTRFS and O_DIRECT could lead to wrong checksum and wrong data

2017-09-15 Thread Andrei Borzenkov
15.09.2017 08:50, Goffredo Baroncelli пишет:
> On 09/15/2017 05:55 AM, Andrei Borzenkov wrote:
>> 15.09.2017 01:00, Goffredo Baroncelli пишет:
>>>
>>> 2) The second bug, is a more severe bug. If during a writing of a buffer 
>>> with O_DIRECT, the buffer is updated at the same time by a second process, 
>>> the checksum may be incorrect.
>>>
>>
>> Is it btrfs specific ? If buffer is updated before it was actually
>> consumed by kernel, this likely means data corruption on any filesystem.
> 
> I don't see any corruption in other FS. The fact that application push to 
> filesystem garbage, doesn't allow the filesystem to be corrupted. 

I did not say "filesystem corruption", I said "data corruption".

> In this case the filesystem became corrupted, because another application 
> which try to read the data (without O_DIRECT) may got -EIO.
> 

No. *Data* on this filesystem was corrupted and luckily btrfs makes you
aware of it. On different filesystem you still may have the same data
corruption, but silent.

> I repeat, the problem is a data race when the data is in the FS camp, and the 
> kernel does wrong checksum.
> 

Of course it is race. But again - I expect that when pwrite() returns it
means data buffer can be reused. Otherwise I cannot see how O_DIRECT can
be sensibly used at all. In this case you need to demonstrate that data
corruption happens after pwrite() returns - this makes it btrfs issue
indeed. If data corruption happens while thread is waiting for pwrite()
to return, I say this is expected behavior and application fault - it
need to protect against concurrent write and modification.

> 
> IMHO, BTRFS should disallow O_DIRECT (which is the same thing that does ZFS 
> on linux); I think that it could be allowed only for  nodatasum files.
> 
>> I.e. there should be clear indication from kernel that buffer can be
>> reused by application, in your example - when pwrite returns. So when
>> data corruption happens - during pwrite or after? 
>> If data is corrupted
>> during pwrite, it is arguably application fault - it should disallow
>> concurrent access.
> 
> 
> 
> 
> 
>>
> 
> 

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: BUG: BTRFS and O_DIRECT could lead to wrong checksum and wrong data

2017-09-15 Thread Paul Jones
> -Original Message-
> From: linux-btrfs-ow...@vger.kernel.org [mailto:linux-btrfs-
> ow...@vger.kernel.org] On Behalf Of Marat Khalili
> Sent: Friday, 15 September 2017 7:50 PM
> To: Hugo Mills ; Goffredo Baroncelli
> ; linux-btrfs 
> Subject: Re: BUG: BTRFS and O_DIRECT could lead to wrong checksum and
> wrong data
> 
> May I state my user's point of view:
> 
> I know one applications that uses O_DIRECT, and it is subtly broken on BTRFS.
> I know no applications that use O_DIRECT and are not broken.
> (Really more statistics would help here, probably some exist that provably
> work.) According to developers making O_DIRECT work on BTRFS is difficult if
> not impossible. Isn't it time to disable O_DIRECT like ZFS does AFAIU? Data
> safety is certainly more important than performance gain it may or may not
> give some applications.

I agree - I've had trouble with this before because I didn't understand the 
consequences of using it.
It would be better to hide it behind a mount option or similar IMHO.

Paul.





Re: BUG: BTRFS and O_DIRECT could lead to wrong checksum and wrong data

2017-09-15 Thread Marat Khalili

May I state my user's point of view:

I know one applications that uses O_DIRECT, and it is subtly broken on 
BTRFS. I know no applications that use O_DIRECT and are not broken. 
(Really more statistics would help here, probably some exist that 
provably work.) According to developers making O_DIRECT work on BTRFS is 
difficult if not impossible. Isn't it time to disable O_DIRECT like ZFS 
does AFAIU? Data safety is certainly more important than performance 
gain it may or may not give some applications.


--

With Best Regards,
Marat Khalili

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: BUG: BTRFS and O_DIRECT could lead to wrong checksum and wrong data

2017-09-15 Thread Hugo Mills
On Fri, Sep 15, 2017 at 08:04:35AM +0200, Goffredo Baroncelli wrote:
> On 09/15/2017 12:18 AM, Hugo Mills wrote:
> >As far as I know, both of these are basically known issues, with no
> > good solution, other than not using O_DIRECT. Certainly the first
> > issue is one I recognise. The second isn't one I recognise directly,
> > but is unsurprising to me.
> > 
> >There have been discussions -- including developers -- on this list
> > as recent as a month or so ago. The general outcome seems to be that
> > any problems with O_DIRECT are not going to be fixed.
> 
> I missed this thread; could you point it to me ?

   No, you didn't miss it -- you were part of it. :)

   http://www.spinics.net/lists/linux-btrfs/msg68244.html

   Hugo.

> If csum and O_DIRECT are not reliable, why not disallow one of them: i.e 
> allow O_DIRECT only on nodatasum files... ZFS (on linux) do not support 
> O_DIRECT at all...
> 
> In fact most of the applications which benefit from O_DIRECT (it comes to me 
> VM e DB), are the ones which need also nodatasum to have good performance.
> 
> One of the strongest point of BTRFS was the checksums; but these are not 
> effective when the file is opened with O_DIRECT; worse there are cases where 
> the file is corrupted and the application got -EIO; not mentioning that the 
> dmesg is filled by "csum failed  "
> 
> 
> > 
> >Hugo.
> > 
> > On Fri, Sep 15, 2017 at 12:00:19AM +0200, Goffredo Baroncelli wrote:
> >> Hi all,
> >>
> >> I discovered two bugs when O_DIRECT is used...
> >>
> >> 1) a corrupted file doesn't return -EIO when O_DIRECT is used
> >>
> >> Normally BTRFS prevents to access the contents of a corrupted file; 
> >> however I was able read the content of a corrupted file simply using 
> >> O_DIRECT
> >>
> >> # in a new btrfs filesystem, create a file
> >> $ sudo mkfs.btrfs -f /dev/sdd5
> >> $ mount /dev/sdd5 t
> >> $ (while true; do echo -n "abcefg" ; done )| sudo dd of=t/abcd 
> >> bs=$((16*1024)) iflag=fullblock count=1024
> >>
> >> # corrupt the file
> >> $ sudo filefrag -v t/abcd 
> >> Filesystem type is: 9123683e
> >> File size of t/abcd is 16777216 (4096 blocks of 4096 bytes)
> >>  ext: logical_offset:physical_offset: length:   expected: 
> >> flags:
> >>0:0..3475:  70656.. 74131:   3476:
> >>1: 3476..4095:  74212.. 74831:620:  74132: 
> >> last,eof
> >> t/abcd: 2 extents found
> >> $ sudo umount t
> >> $ sudo ~/btrfs/btrfs-progs/btrfs-corrupt-block -l $((70656*4096)) -b 10 
> >> /dev/sdd5
> >> mirror 1 logical 289406976 physical 289406976 device /dev/sdd5
> >> corrupting 289406976 copy 1
> >>
> >> # try to access the file; expected result: -EIO
> >> $ sudo mount /dev/sdd5 t
> >> $ dd if=t/abcd | hexdump -c | head
> >> dd: error reading 't/abcd': Input/output error
> >> 0+0 records in
> >> 0+0 records out
> >> 0 bytes copied, 0.000477413 s, 0.0 kB/s
> >>
> >>
> >> # try to access the file using O_DIRECT; expected result: -EIO, instead 
> >> the file is accessible
> >> $ dd if=t/abcd iflag=direct bs=4096 | hexdump -c | head
> >> 000 001 001 001 001 001 001 001 001 001 001 001 001 001 001 001 001
> >> *
> >> 0001000   f   g   a   b   c   e   f   g   a   b   c   e   f   g   a   b
> >> 0001010   c   e   f   g   a   b   c   e   f   g   a   b   c   e   f   g
> >> 0001020   a   b   c   e   f   g   a   b   c   e   f   g   a   b   c   e
> >> 0001030   f   g   a   b   c   e   f   g   a   b   c   e   f   g   a   b
> >> 0001040   c   e   f   g   a   b   c   e   f   g   a   b   c   e   f   g
> >> 0001050   a   b   c   e   f   g   a   b   c   e   f   g   a   b   c   e
> >> 0001060   f   g   a   b   c   e   f   g   a   b   c   e   f   g   a   b
> >> 0001070   c   e   f   g   a   b   c   e   f   g   a   b   c   e   f   g
> >>
> >> (dmesg report the checksum mismatch)
> >> [13265.085645] BTRFS warning (device sdd5): csum failed root 5 ino 257 off 
> >> 0 csum 0x98f94189 expected csum 0x0ab6be80 mirror 1
> >>
> >> Note the first 4k filled by 0x01 !
> >>
> >> Conclusion: even if the file is corrupted and normally BTRFS prevent to 
> >> access it, using O_DIRECT
> >> a) no error is returned to the caller
> >> b) instead of the page stored on the disk, it is returned a page filled 
> >> with 0x01 (according also with the function __readpage_endio_check())
> >>
> >>
> >> 2) The second bug, is a more severe bug. If during a writing of a buffer 
> >> with O_DIRECT, the buffer is updated at the same time by a second process, 
> >> the checksum may be incorrect.
> >>
> >> At the end of the email there is the code which shows the problem: two 
> >> process share the same memory: the first write it to the disk, the second 
> >> update the buffer continuously. A third process try to read the file, but 
> >> it got time to time -EIO
> >>
> >> If you ran my code in a btrfs filesystem you got a lot of 
> >>
> >> ERROR: read thread; r = 8192, expected = 16384
> >> ERROR: read thread; r = 8192, expected = 

Re: BUG: BTRFS and O_DIRECT could lead to wrong checksum and wrong data

2017-09-15 Thread Goffredo Baroncelli
On 09/15/2017 12:18 AM, Hugo Mills wrote:
>As far as I know, both of these are basically known issues, with no
> good solution, other than not using O_DIRECT. Certainly the first
> issue is one I recognise. The second isn't one I recognise directly,
> but is unsurprising to me.
> 
>There have been discussions -- including developers -- on this list
> as recent as a month or so ago. The general outcome seems to be that
> any problems with O_DIRECT are not going to be fixed.

I missed this thread; could you point it to me ?


If csum and O_DIRECT are not reliable, why not disallow one of them: i.e allow 
O_DIRECT only on nodatasum files... ZFS (on linux) do not support O_DIRECT at 
all...

In fact most of the applications which benefit from O_DIRECT (it comes to me VM 
e DB), are the ones which need also nodatasum to have good performance.

One of the strongest point of BTRFS was the checksums; but these are not 
effective when the file is opened with O_DIRECT; worse there are cases where 
the file is corrupted and the application got -EIO; not mentioning that the 
dmesg is filled by "csum failed  "


> 
>Hugo.
> 
> On Fri, Sep 15, 2017 at 12:00:19AM +0200, Goffredo Baroncelli wrote:
>> Hi all,
>>
>> I discovered two bugs when O_DIRECT is used...
>>
>> 1) a corrupted file doesn't return -EIO when O_DIRECT is used
>>
>> Normally BTRFS prevents to access the contents of a corrupted file; however 
>> I was able read the content of a corrupted file simply using O_DIRECT
>>
>> # in a new btrfs filesystem, create a file
>> $ sudo mkfs.btrfs -f /dev/sdd5
>> $ mount /dev/sdd5 t
>> $ (while true; do echo -n "abcefg" ; done )| sudo dd of=t/abcd 
>> bs=$((16*1024)) iflag=fullblock count=1024
>>
>> # corrupt the file
>> $ sudo filefrag -v t/abcd 
>> Filesystem type is: 9123683e
>> File size of t/abcd is 16777216 (4096 blocks of 4096 bytes)
>>  ext: logical_offset:physical_offset: length:   expected: flags:
>>0:0..3475:  70656.. 74131:   3476:
>>1: 3476..4095:  74212.. 74831:620:  74132: 
>> last,eof
>> t/abcd: 2 extents found
>> $ sudo umount t
>> $ sudo ~/btrfs/btrfs-progs/btrfs-corrupt-block -l $((70656*4096)) -b 10 
>> /dev/sdd5
>> mirror 1 logical 289406976 physical 289406976 device /dev/sdd5
>> corrupting 289406976 copy 1
>>
>> # try to access the file; expected result: -EIO
>> $ sudo mount /dev/sdd5 t
>> $ dd if=t/abcd | hexdump -c | head
>> dd: error reading 't/abcd': Input/output error
>> 0+0 records in
>> 0+0 records out
>> 0 bytes copied, 0.000477413 s, 0.0 kB/s
>>
>>
>> # try to access the file using O_DIRECT; expected result: -EIO, instead the 
>> file is accessible
>> $ dd if=t/abcd iflag=direct bs=4096 | hexdump -c | head
>> 000 001 001 001 001 001 001 001 001 001 001 001 001 001 001 001 001
>> *
>> 0001000   f   g   a   b   c   e   f   g   a   b   c   e   f   g   a   b
>> 0001010   c   e   f   g   a   b   c   e   f   g   a   b   c   e   f   g
>> 0001020   a   b   c   e   f   g   a   b   c   e   f   g   a   b   c   e
>> 0001030   f   g   a   b   c   e   f   g   a   b   c   e   f   g   a   b
>> 0001040   c   e   f   g   a   b   c   e   f   g   a   b   c   e   f   g
>> 0001050   a   b   c   e   f   g   a   b   c   e   f   g   a   b   c   e
>> 0001060   f   g   a   b   c   e   f   g   a   b   c   e   f   g   a   b
>> 0001070   c   e   f   g   a   b   c   e   f   g   a   b   c   e   f   g
>>
>> (dmesg report the checksum mismatch)
>> [13265.085645] BTRFS warning (device sdd5): csum failed root 5 ino 257 off 0 
>> csum 0x98f94189 expected csum 0x0ab6be80 mirror 1
>>
>> Note the first 4k filled by 0x01 !
>>
>> Conclusion: even if the file is corrupted and normally BTRFS prevent to 
>> access it, using O_DIRECT
>> a) no error is returned to the caller
>> b) instead of the page stored on the disk, it is returned a page filled with 
>> 0x01 (according also with the function __readpage_endio_check())
>>
>>
>> 2) The second bug, is a more severe bug. If during a writing of a buffer 
>> with O_DIRECT, the buffer is updated at the same time by a second process, 
>> the checksum may be incorrect.
>>
>> At the end of the email there is the code which shows the problem: two 
>> process share the same memory: the first write it to the disk, the second 
>> update the buffer continuously. A third process try to read the file, but it 
>> got time to time -EIO
>>
>> If you ran my code in a btrfs filesystem you got a lot of 
>>
>> ERROR: read thread; r = 8192, expected = 16384
>> ERROR: read thread; r = 8192, expected = 16384
>> ERROR: read thread; e = 5 - Input/output error
>> ERROR: read thread; e = 5 - Input/output error
>>
>> The firsts lines are related to a shorter read (which may happens). The 
>> lasts lines are related to a checksum mismatch. The dmesg is filled by lines 
>> like
>> [...]
>> [14873.573547] BTRFS warning (device sdd5): csum failed root 5 ino 259 off 
>> 4096 csum 0x0683c6df expected csum 0x55eb85e6 

Re: BUG: BTRFS and O_DIRECT could lead to wrong checksum and wrong data

2017-09-14 Thread Goffredo Baroncelli
On 09/15/2017 05:55 AM, Andrei Borzenkov wrote:
> 15.09.2017 01:00, Goffredo Baroncelli пишет:
>>
>> 2) The second bug, is a more severe bug. If during a writing of a buffer 
>> with O_DIRECT, the buffer is updated at the same time by a second process, 
>> the checksum may be incorrect.
>>
> 
> Is it btrfs specific ? If buffer is updated before it was actually
> consumed by kernel, this likely means data corruption on any filesystem.

I don't see any corruption in other FS. The fact that application push to 
filesystem garbage, doesn't allow the filesystem to be corrupted. 
In this case the filesystem became corrupted, because another application which 
try to read the data (without O_DIRECT) may got -EIO.

I repeat, the problem is a data race when the data is in the FS camp, and the 
kernel does wrong checksum.


IMHO, BTRFS should disallow O_DIRECT (which is the same thing that does ZFS on 
linux); I think that it could be allowed only for  nodatasum files.

> I.e. there should be clear indication from kernel that buffer can be
> reused by application, in your example - when pwrite returns. So when
> data corruption happens - during pwrite or after? 
> If data is corrupted
> during pwrite, it is arguably application fault - it should disallow
> concurrent access.





> 


-- 
gpg @keyserver.linux.it: Goffredo Baroncelli 
Key fingerprint BBF5 1610 0B64 DAC6 5F7D  17B2 0EDA 9B37 8B82 E0B5
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: BUG: BTRFS and O_DIRECT could lead to wrong checksum and wrong data

2017-09-14 Thread Andrei Borzenkov
15.09.2017 01:00, Goffredo Baroncelli пишет:
> 
> 2) The second bug, is a more severe bug. If during a writing of a buffer with 
> O_DIRECT, the buffer is updated at the same time by a second process, the 
> checksum may be incorrect.
> 

Is it btrfs specific? If buffer is updated before it was actually
consumed by kernel, this likely means data corruption on any filesystem.
I.e. there should be clear indication from kernel that buffer can be
reused by application, in your example - when pwrite returns. So when
data corruption happens - during pwrite or after? If data is corrupted
during pwrite, it is arguably application fault - it should disallow
concurrent access.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: BUG: BTRFS and O_DIRECT could lead to wrong checksum and wrong data

2017-09-14 Thread Hugo Mills
   As far as I know, both of these are basically known issues, with no
good solution, other than not using O_DIRECT. Certainly the first
issue is one I recognise. The second isn't one I recognise directly,
but is unsurprising to me.

   There have been discussions -- including developers -- on this list
as recent as a month or so ago. The general outcome seems to be that
any problems with O_DIRECT are not going to be fixed.

   Hugo.

On Fri, Sep 15, 2017 at 12:00:19AM +0200, Goffredo Baroncelli wrote:
> Hi all,
> 
> I discovered two bugs when O_DIRECT is used...
> 
> 1) a corrupted file doesn't return -EIO when O_DIRECT is used
> 
> Normally BTRFS prevents to access the contents of a corrupted file; however I 
> was able read the content of a corrupted file simply using O_DIRECT
> 
> # in a new btrfs filesystem, create a file
> $ sudo mkfs.btrfs -f /dev/sdd5
> $ mount /dev/sdd5 t
> $ (while true; do echo -n "abcefg" ; done )| sudo dd of=t/abcd 
> bs=$((16*1024)) iflag=fullblock count=1024
> 
> # corrupt the file
> $ sudo filefrag -v t/abcd 
> Filesystem type is: 9123683e
> File size of t/abcd is 16777216 (4096 blocks of 4096 bytes)
>  ext: logical_offset:physical_offset: length:   expected: flags:
>0:0..3475:  70656.. 74131:   3476:
>1: 3476..4095:  74212.. 74831:620:  74132: last,eof
> t/abcd: 2 extents found
> $ sudo umount t
> $ sudo ~/btrfs/btrfs-progs/btrfs-corrupt-block -l $((70656*4096)) -b 10 
> /dev/sdd5
> mirror 1 logical 289406976 physical 289406976 device /dev/sdd5
> corrupting 289406976 copy 1
> 
> # try to access the file; expected result: -EIO
> $ sudo mount /dev/sdd5 t
> $ dd if=t/abcd | hexdump -c | head
> dd: error reading 't/abcd': Input/output error
> 0+0 records in
> 0+0 records out
> 0 bytes copied, 0.000477413 s, 0.0 kB/s
> 
> 
> # try to access the file using O_DIRECT; expected result: -EIO, instead the 
> file is accessible
> $ dd if=t/abcd iflag=direct bs=4096 | hexdump -c | head
> 000 001 001 001 001 001 001 001 001 001 001 001 001 001 001 001 001
> *
> 0001000   f   g   a   b   c   e   f   g   a   b   c   e   f   g   a   b
> 0001010   c   e   f   g   a   b   c   e   f   g   a   b   c   e   f   g
> 0001020   a   b   c   e   f   g   a   b   c   e   f   g   a   b   c   e
> 0001030   f   g   a   b   c   e   f   g   a   b   c   e   f   g   a   b
> 0001040   c   e   f   g   a   b   c   e   f   g   a   b   c   e   f   g
> 0001050   a   b   c   e   f   g   a   b   c   e   f   g   a   b   c   e
> 0001060   f   g   a   b   c   e   f   g   a   b   c   e   f   g   a   b
> 0001070   c   e   f   g   a   b   c   e   f   g   a   b   c   e   f   g
> 
> (dmesg report the checksum mismatch)
> [13265.085645] BTRFS warning (device sdd5): csum failed root 5 ino 257 off 0 
> csum 0x98f94189 expected csum 0x0ab6be80 mirror 1
> 
> Note the first 4k filled by 0x01 !
> 
> Conclusion: even if the file is corrupted and normally BTRFS prevent to 
> access it, using O_DIRECT
> a) no error is returned to the caller
> b) instead of the page stored on the disk, it is returned a page filled with 
> 0x01 (according also with the function __readpage_endio_check())
> 
> 
> 2) The second bug, is a more severe bug. If during a writing of a buffer with 
> O_DIRECT, the buffer is updated at the same time by a second process, the 
> checksum may be incorrect.
> 
> At the end of the email there is the code which shows the problem: two 
> process share the same memory: the first write it to the disk, the second 
> update the buffer continuously. A third process try to read the file, but it 
> got time to time -EIO
> 
> If you ran my code in a btrfs filesystem you got a lot of 
> 
> ERROR: read thread; r = 8192, expected = 16384
> ERROR: read thread; r = 8192, expected = 16384
> ERROR: read thread; e = 5 - Input/output error
> ERROR: read thread; e = 5 - Input/output error
> 
> The firsts lines are related to a shorter read (which may happens). The lasts 
> lines are related to a checksum mismatch. The dmesg is filled by lines like
> [...]
> [14873.573547] BTRFS warning (device sdd5): csum failed root 5 ino 259 off 
> 4096 csum 0x0683c6df expected csum 0x55eb85e6 mirror 1
> [...]
> 
> This is definitely a bug. 
> 
> I think that using O_DIRECT and updating a page at the same time could happen 
> in a VM. In BTRFS this  could lead to a wrong checksum. The problem is that 
> if BTRFS detects a checksum error during a reading:
> a) if O_DIRECT is not used in the read
>   * -EIO is returned
> Definitely BAD
> 
> b) if O_DIRECT is used in the read
>   * it doesn't return the error to the caller
>   * it returns a page filled by 0x01 instead of the data from the disk
> Even worse than a)
> 
> Note1: even using O_DIRECT with O_SYNC, the problem still persist.
> Note2: the man page of open(2) is filled by a lot of notes about O_DIRECT, 
> but also it stated that using O_DIRECT+fork()+mmap(... MAP_SHARED) is legally.
> Note3: even