Dear all,
this is an RFC. I made very few test, so use at your risk.
In the past there were several reports [1][2] about corruption when O_DIRECT is
used.
Every time the analysis was the same: it is too difficult to sync the checksum
with the data when O_DIRECT is used.
This aim of this patch is to avoid the corruption disabling the O_DIRECT write
when
the file is NOT marked as NODATACSUM. In other words, O_DIRECT is honored only
for
the file not protected by CSUM: O_DIRECT ^ DATASUM
The user-space is not informed that O_DIRECT is not honored.
On the best of my knowledge, ZFS does the same thing.
It is not a regression: today an O_DIRECT file update may compromise the
checksum
calculation. So may be the file content is correct, but the checksum no.This
prevent
to read the file.
In [2] there is a test program which trigger the problem. With this patch the
program
doesn't trigger the problem.
Open question:
- does the kernel return an error when a file is opened with O_DIRECT and the
file has
the checksum ?
- does make sense to have a mount option to select different behaviours:
1) let the kernel to behave as today (O_DIRECT is allowed even at risk
of
having some form of corruption)
2) let the kernel to ignore O_DIRECT if the file is protected by
checksum
3) let the kernel to return error when O_DIRECT is used with file with
checksum
?
Comments are welcome.
BR
G.Baroncelli
[1]
https://lore.kernel.org/linux-btrfs/1ad3962943592e9a60f88aecdb493f368c70bbe1.ca...@infradead.org/#r
[2]
https://lore.kernel.org/linux-btrfs/cf8a733f-2c9d-7ffe-e865-4c13d99df...@libero.it/
-----
diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c
index 0e41459b8de6..af73157e8200 100644
--- a/fs/btrfs/file.c
+++ b/fs/btrfs/file.c
@@ -2018,7 +2018,12 @@ static ssize_t btrfs_file_write_iter(struct kiocb *iocb,
if (sync)
atomic_inc(&BTRFS_I(inode)->sync_writers);
- if (iocb->ki_flags & IOCB_DIRECT)
+ /*
+ * O_DIRECT doesn't play well with CSUM, so allow the O_DIRECT
+ * only if the file is marked BTRFS_INODE_NODATASUM
+ */
+ if (iocb->ki_flags & IOCB_DIRECT &&
+ (BTRFS_I(inode)->flags & BTRFS_INODE_NODATASUM))
num_written = btrfs_direct_write(iocb, from);
else
num_written = btrfs_buffered_write(iocb, from);
-----------
--
gpg @keyserver.linux.it: Goffredo Baroncelli <kreijackATinwind.it>
Key fingerprint BBF5 1610 0B64 DAC6 5F7D 17B2 0EDA 9B37 8B82 E0B5