[PATCH] btrfs-progs: Fix cross-compile error for mktables
When cross compiling btrfs-progs, following error will prevent btrfs-progs to be compiled: [CC] mktables [TABLE] kernel-lib/tables.c /bin/sh: ./mktables: cannot execute binary file: Exec format error make: *** No rule to make target 'kernel-lib/tables.c', needed by 'kernel-lib/tables.o'. Stop. "mktables" should only be executed in host environment, while @CC set by autoconf will follow host/build/target setting, causing mktables to be cross-compiled. The fix is to introduce a new @HOSTCC for mktables, which will not be affected by host/build/target settings. Reported-by: Hallo32Suggested-by: David Sterba Signed-off-by: Qu Wenruo --- Tested with AArch64 cross-toolchain created by buildroot. --- Makefile| 2 +- Makefile.inc.in | 1 + configure.ac| 1 + 3 files changed, 3 insertions(+), 1 deletion(-) diff --git a/Makefile b/Makefile index b3e2b636..0395e37f 100644 --- a/Makefile +++ b/Makefile @@ -323,7 +323,7 @@ version.h: version.sh version.h.in configure.ac mktables: kernel-lib/mktables.c @echo "[CC] $@" - $(Q)$(CC) $(CFLAGS) $< -o $@ + $(Q)$(HOSTCC) $(CFLAGS) $< -o $@ kernel-lib/tables.c: mktables @echo "[TABLE] $@" diff --git a/Makefile.inc.in b/Makefile.inc.in index 4e1b68cb..308acca3 100644 --- a/Makefile.inc.in +++ b/Makefile.inc.in @@ -4,6 +4,7 @@ export CC = @CC@ +HOSTCC = @HOSTCC@ LN_S = @LN_S@ AR = @AR@ RM = @RM@ diff --git a/configure.ac b/configure.ac index 30055f85..f6051ebd 100644 --- a/configure.ac +++ b/configure.ac @@ -26,6 +26,7 @@ AC_CONFIG_SRCDIR([btrfs.c]) AC_PREFIX_DEFAULT([/usr/local]) AC_PROG_CC +AC_PATH_PROGS([HOSTCC], [gcc clang]) AC_CANONICAL_HOST AC_C_CONST AC_C_VOLATILE -- 2.14.0 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[RESEND PATCH] btrfs-progs: Specify C standard to gnu90 explicitly
Different C compilers have different default language standard. This sometimes causes problem on different system. For distribution like CentOS/RHEL7, its gcc is still 4.8 and will report error for c90 style declaration, while most developers are using newer gcc which will just ignore it. This makes us hard to detect such language standard problem. This patch will specify standard to gnu90 explicitly to avoid such problem. Gnu90 is a good mix of c90 while provide a lot of useful gnu extension, and is supported by all modern gcc and clang. Reported-by: Marco Lorenzo CrocianiSigned-off-by: Qu Wenruo --- Makefile | 1 + 1 file changed, 1 insertion(+) diff --git a/Makefile b/Makefile index 81598df1..cb1b3c5f 100644 --- a/Makefile +++ b/Makefile @@ -58,6 +58,7 @@ TOPDIR := $(shell pwd) # Common build flags CFLAGS = $(SUBST_CFLAGS) \ +-std=gnu90 \ -include config.h \ -DBTRFS_FLAT_INCLUDES \ -D_XOPEN_SOURCE=700 \ -- 2.13.2 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: btrfs fi du -s gives Inappropriate ioctl for device
On Mon, Aug 14, 2017 at 4:57 PM, Piotr Szymaniakwrote: > > and... some issues: > ~ # btrfs fi du -s /mnt/red/\@backup/ > Total Exclusive Set shared Filename > ERROR: cannot check space of '/mnt/red/@backup/': Inappropriate ioctl for > device It's a bug, but I don't know if any devs are working on a fix yet. The problem is that the subvolume being snapshot, contains subvolumes. The resulting snapshot, contains an empty directory in place of the nested subvolume(s), and that is the cause for the error. Other messages on this: https://www.spinics.net/lists/linux-btrfs/msg67330.html https://www.spinics.net/lists/linux-btrfs/msg61883.html -- Chris Murphy -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: btrfs-progs-v4.12: cross compiling
On 2017年08月15日 02:57, Goffredo Baroncelli wrote: On 08/14/2017 05:10 PM, David Sterba wrote: On Mon, Aug 14, 2017 at 10:14:42PM +0800, Qu Wenruo wrote: [...] mktables.c is synced from kernel sources, taking updates from there is easier than porting any changes to the proposed scripted implementation. The workflow is simple: - copy kernel mktables.c changes to btrfs-progs mktables.c How the kernel deals with this kind of problem ? Looking at the source of btrfs Makefile, it is more simple to replace mktables: kernel-lib/mktables.c @echo "[CC] $@" $(Q)$(CC) $(CFLAGS) $< -o $@ with mktables: kernel-lib/mktables.c @echo "[HOSTCC] $@" $(Q)$(HOSTCC) $(CFLAGS) $< -o $@ where HOSTCC is defined as HOSTCC=gcc (may be the same applied also to CFLAGS <-> HOSTCFLAGS ?) If using HOSTCC then I'm fine with it. Thanks, Qu - compile mktables - run 'make kernel-lib/tables.c' - commit the changes to git All of that happens very rarely, if ever, the raid56 tables and correction algorithms are set in stone. Any extensions need to be done on both sides kernel/userspace. What about using script to generate it? We do have the mktables utility to generate it and I'll regenerate it should there be a change to kernel-lib/mktables.c I mean to replace mktables.c with a script. So no cross compiler problems at all, and even easier Makefile. No dependence on "mktables" program. Somebody has to implement the script and verify that the output is the same, eventually sync changes. The cross-compilation should be fixed with the pregenerated tables.c . Is Makefile size really a concern? The number of related lines is like 7. I don't see any benefit in what you propose and hopefully explained my viewpoint enough so I don't have to continue. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
btrfs fi du -s gives Inappropriate ioctl for device
Hi list, I have some weird issue. So, I have btrfs fs and some subvols, like that: ~ # btrfs sub list /mnt/red/ ID 260 gen 827956 top level 5 path @home ID 261 gen 827926 top level 5 path @backup ID 645 gen 827911 top level 5 path @svn *snip* and some snapshots on those subvols: *snip* ID 2501 gen 827894 top level 260 path @home/.snapshot/monthly_2017-05-01_05:30:01 ID 2603 gen 827894 top level 260 path @home/.snapshot/monthly_2017-06-01_05:30:01 ID 2604 gen 827927 top level 261 path @backup/.snapshot/monthly_2017-06-01_05:30:02 ID 2680 gen 827895 top level 260 path @home/.snapshot/monthly_2017-07-01_05:30:01 ID 2681 gen 827927 top level 261 path @backup/.snapshot/monthly_2017-07-01_05:30:01 ID 2734 gen 827895 top level 260 path @home/.snapshot/weekly_2017-07-22_04:20:01 ID 2735 gen 827905 top level 645 path @svn/.snapshot/weekly_2017-07-22_04:20:02 ID 2754 gen 827895 top level 260 path @home/.snapshot/weekly_2017-07-29_04:20:01 ID 2755 gen 827906 top level 645 path @svn/.snapshot/weekly_2017-07-29_04:20:01 ID 2763 gen 827896 top level 260 path @home/.snapshot/monthly_2017-08-01_05:30:01 *snip* and... some issues: ~ # btrfs fi du -s /mnt/red/\@backup/ Total Exclusive Set shared Filename ERROR: cannot check space of '/mnt/red/@backup/': Inappropriate ioctl for device and it works for other subvols: ~ # btrfs fi du -s /mnt/red/\@svn/ Total Exclusive Set shared Filename 52.22GiB10.59MiB 4.13GiB /mnt/red/@svn/ ~ # btrfs fi du -s /mnt/red/\@home/ Total Exclusive Set shared Filename 2.08TiB 4.54GiB 124.51GiB /mnt/red/@home/ on the other hand, those values also look wrong (2.08TiB?): ~ # btrfs fi df /mnt/red/\@home/ Data, RAID1: total=448.00GiB, used=447.39GiB System, RAID1: total=32.00MiB, used=80.00KiB Metadata, RAID1: total=3.00GiB, used=2.02GiB GlobalReserve, single: total=512.00MiB, used=0.00B ~ # btrfs fi usage /mnt/red/\@home/ Overall: Device size: 1.36TiB Device allocated:902.06GiB Device unallocated: 493.21GiB Device missing: 0.00B Used:898.81GiB Free (estimated):247.22GiB (min: 247.22GiB) Data ratio: 2.00 Metadata ratio: 2.00 Global reserve: 512.00MiB (used: 0.00B) Data,RAID1: Size:448.00GiB, Used:447.39GiB /dev/mapper/wd0 448.00GiB /dev/mapper/wd1 448.00GiB Metadata,RAID1: Size:3.00GiB, Used:2.02GiB /dev/mapper/wd0 3.00GiB /dev/mapper/wd1 3.00GiB System,RAID1: Size:32.00MiB, Used:80.00KiB /dev/mapper/wd032.00MiB /dev/mapper/wd132.00MiB Unallocated: /dev/mapper/wd0 246.60GiB /dev/mapper/wd1 246.60GiB ~ # uname -sr Linux 4.4.76 ~ # btrfs --version btrfs-progs v4.12 Whats wrong with one of my subvols? Whats wrong with space reported by fi du? Best regards, Piotr Szymaniak. signature.asc Description: Digital signature
Re: [RFC] Checksum of the parity
On Mon, Aug 14, 2017 at 2:18 PM, Goffredo Baroncelliwrote: >> Anyway, I do wish I read the code better, so I knew exactly where, if >> at all, the RMW code was happening on disk rather than just in memory. >> There very clearly is RMW in memory code as a performanc optimizer, >> before a stripe gets written out it's possible to RMW it to add in >> more changes or new files, that way raid56 isn't dog slow CoW'ing >> literally a handful of 16KiB leaves each time, that then translate >> into a minimum of 384K of writes. > > In case of a fully stripe write, there is no RMW cycle, so no "write hole". That is conflating disk writes and in-memory RMW. They have to be separated. For sure there's in-memory RMW + CoW of the entire stripe to disk, for a tiny (1 byte) change to a file because I've seen it. What I don't know, and can't tell from the code, is if there's ever such a thing as partial stripe RMW (and over write of just a data strip and a corresponding over write for parity). Any overwrite is where the write hole comes into play. But inferring from the work Liu is apparently working on for a journal, it must be true that there is such a thing as a overwrites with Btrfs raid56. > > Just of curiosity, what is "minimum of 384k" ? In a 3 disks raid5 case, the > minimum data is 64k * 2 (+ 64kb of parity). Bad addition on my part. -- Chris Murphy -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: RedHat 7.4 Release Notes: "Btrfs has been deprecated" - wut?
On 08/14/2017 09:08 PM, Chris Murphy wrote: > On Mon, Aug 14, 2017 at 8:23 AM, Goffredo Baroncelli> wrote: > >> Form a theoretical point of view, if you have a "PURE" COW file-system, you >> don't need a journal. Unfortunately a RAID5/6 stripe update is a RMW cycle, >> so you need a journal to keep it in sync. The same is true for the NOCOW >> file (and their checksums) >> > > I'm pretty sure the raid56 rmw is in memory only, I don't think we > have a case where a stripe is getting partial writes (a block in a > stripe is being overwritten). Partial stripe updates with rmw *on > disk* would mean Btrfs raid56 is not CoW. > I am not sure about that. Consider the following cases: - what if we have to wrote less than a stripe ? - supposing to remove a file with length of 4k. If you don't allow a RMW cycle, this means the space would be lost forever Pay attention that the size of a stripe (theoretically) could be quite big: suppose to have an (insanely) raid compose by 20disks, the stripe would be about 20*64k = ~1.2MB) BR G.Baroncelli -- gpg @keyserver.linux.it: Goffredo Baroncelli Key fingerprint BBF5 1610 0B64 DAC6 5F7D 17B2 0EDA 9B37 8B82 E0B5 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC] Checksum of the parity
On 08/14/2017 09:28 PM, Chris Murphy wrote: > On Mon, Aug 14, 2017 at 8:12 AM, Goffredo Baroncelli> wrote: >> On 08/13/2017 08:45 PM, Chris Murphy wrote: >>> [2] >>> Is Btrfs subject to the write hole problem manifesting on disk? I'm >>> not sure, sadly I don't read the code well enough. But if all Btrfs >>> raid56 writes are full stripe CoW writes, and if the prescribed order >>> guarantees still happen: data CoW to disk > metadata CoW to disk > >>> superblock update, then I don't see how the write hole happens. Write >>> hole requires: RMW of a stripe, which is a partial stripe overwrite, >>> and a crash during the modification of the stripe making that stripe >>> inconsistent as well as still pointed to by metadata. >> >> >> RAID5 is *single* failure prof. And in order to have the write hole bug we >> need two failure: >> 1) a transaction is aborted (e.g. due to a power failure) and the results is >> that data and parity are mis-aligned >> 2) a disk disappears >> >> These two events may happen even in different moment. >> >> The key is that when a disk disappear, all remaining ones are used to >> rebuild the missing one. So if data and parity are mis-aligned the rebuild >> disk is wrong. >> >> Let me to show an example >> >> Disk 1Disk 2 Disk 3 (parity) >> AABB CC >> >> where CC = A ^ B >> >> Note1: A is a valid data >> >> Supposing to update B and due to a power failure you can't update parity, >> you have: >> >> >> Disk 1Disk 2 Disk 3 (parity) >> AADDDCC >> >> Of course CC != A ^ D (data and parity are misaligned). >> >> >> Pay attention that AA is still valid data. >> >> Now suppose to loose disk1. If you want to read from it, you have to perform >> a read of disk2 and disk3 to compute disk1. >> >> However Disk2 and disk3 are misaligned, so doing a D ^ C you don't >> got A anymore. >> >> >> Note that it is not important if DD or B are valid or invalid data. > > > Doesn't matter on Btrfs. Bad reconstruction due to wrong parity > results in csum mismatch. This I've tested. I never argued about that. The write hole is related to *loss* of "valid data" due to a mis-alignement between data and parity. The fact that BTRFS is capable to detect the problem and return an -EIO, doesn't mitigate the loss of valid data. Pay attention that in my example A reached the disk before the "failure events" > > I vaguely remember a while ago doing a dd conv=notrunc modification of > a file that's raid5, and there was no RMW, what happened is the whole > stripe was CoW'd and had the modification. So that would, hardware > behaving correctly, mean that the raid5 data CoW succeeds, then there > is a metadata CoW to point to it, then the super block is updated to > point to the new tree. > > At any point, if there's an interruption, we have the old super > pointing to the old tree which points to premodified data. > > Anyway, I do wish I read the code better, so I knew exactly where, if > at all, the RMW code was happening on disk rather than just in memory. > There very clearly is RMW in memory code as a performanc optimizer, > before a stripe gets written out it's possible to RMW it to add in > more changes or new files, that way raid56 isn't dog slow CoW'ing > literally a handful of 16KiB leaves each time, that then translate > into a minimum of 384K of writes. In case of a fully stripe write, there is no RMW cycle, so no "write hole". Unfortunately not all writes are full stripe size. I never checked the code, but I hope that during a commit of the transaction all the writing are grouped in "full stripe write" as possible. Just of curiosity, what is "minimum of 384k" ? In a 3 disks raid5 case, the minimum data is 64k * 2 (+ 64kb of parity). > But yeah, Qu just said in another thread that Liu is working on a > journal for the raid56 write hole problem. Thing is I don't see when > it happens in the code or in practice (so far, it's really tedious to > poke a file system with a stick). > > > -- gpg @keyserver.linux.it: Goffredo Baroncelli Key fingerprint BBF5 1610 0B64 DAC6 5F7D 17B2 0EDA 9B37 8B82 E0B5 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: RedHat 7.4 Release Notes: "Btrfs has been deprecated" - wut?
On Mon, 2017-08-14 at 11:53 -0400, Austin S. Hemmelgarn wrote: > Quite a few applications actually _do_ have some degree of secondary > verification or protection from a crash. Go look at almost any > database > software. Then please give proper references for this! This is from 2015, where you claimed this already and I looked up all the bigger DBs and they either couldn't do it at all, didn't to it per default, or it required application support (i.e. from the programs using the DB) https://www.spinics.net/lists/linux-btrfs/msg50258.html > It usually will not have checksumming, but it will almost > always have support for a journal, which is enough to cover the > particular data loss scenario we're talking about (unexpected > unclean > shutdown). I don't think we talk about this: We talk about people wanting checksuming to notice e.g. silent data corruption. The crash case is only the corner case about what happens then if data is written correctly but csums not. > In my own experience, the things that use nodatacow fall into one of > 4 > categories: > 1. Cases where the data is non-critical, and data loss will be > inconvenient but not fatal. Systemd journal files are a good example > of > this, as are web browser profiles when you're using profile sync. I'd guess many people would want to have their log files valid and complete. Same for their profiles (especially since people concerned about their integrity might not want to have these synced to Mozilla/Google etc.) > 2. Cases where the upper level can reasonably be expected to have > some > degree of handling, even if it's not correction. VM disk images and > most database applications fall into this category. No. Wrong. Or prove me that I'm wrong ;-) And these two (VMs, DBs) are actually *the* main cases for nodatacow. > And I and most other sysadmins I know would prefer the opposite with > the > addition of a secondary notification method. You can still hook the > notification to stop the application, but you don't have to if you > don't > want to (and in cases 1 and 2 I listed above, you probably don't want > to). Then I guess btrfs is generally not the right thing for such people, as in the CoW case it will also give them EIO on any corruptions and their programs will fail. Cheers, Chris. smime.p7s Description: S/MIME cryptographic signature
Re: RedHat 7.4 Release Notes: "Btrfs has been deprecated" - wut?
On Mon, 2017-08-14 at 10:23 -0400, Austin S. Hemmelgarn wrote: > Assume you have higher level verification. Would you rather not be > able > to read the data regardless of if it's correct or not, or be able to > read it and determine yourself if it's correct or not? What would be the difference here then to the CoW+checksuming+some- data-corruption-case?! btrfs would also give EIO and all these applications you mention would fail then. As I've said previous, one could provide end users with the means to still access the faulty data. Or they could simply mount with nochecksum. > For almost > anybody, the answer is going to be the second case, because the > application knows better than the OS if the data is correct (and > 'correct' may be a threshold, not some binary determination). You've made that claim already once with VMs and DBs, and your claim proved simply wrong. Most applications don't do this kind of verification. And those that do probably rather just check whether the data is valid and if not give an error or at best fall back to some automatical backups (e.g. what package managers do). I'd know only few programs who'd really be capable to use data they know is bogus and recover from that automagically... the only examples I'd know are some archive formats which include error correcting codes. And I really mean using the blocks for recovery for which the csum wouldn't verify (i.e. the ones that gives an EIO)... without ECCs, how would a program know what do to with such data? I cannot image that many people would choose the second option, to be honest. Working with bogus data?! What should be the benefit of this? > At that > point, you need to make the checksum error a warning instead of > returning -EIO. How do you intend to communicate that warning back > to > the application? The kernel log won't work, because on any > reasonably > secure system it's not visible to anyone but root. Still same problem with CoW + any data corruption... > There's also no side > channel for the read() system calls that you can utilize. That then > means that the checksums end up just being a means for the > administrator > to know some data wasn't written correctly, but they should know > that > anyway because the system crashed. No, they'd have no idea if any / which data was written during the crash. > Looking at this from a different angle: Without background, what > would > you assume the behavior to be for this? For most people, the > assumption > would be that this provides the same degree of data safety that the > checksums do when the data is CoW. I don't think the average use would have any such assumption. Most people likely don't even know that there is implicitly no checksuming if nodatacow is enabled. What people may however have heard of is, that btrfs does doe checksuming and they'd assume that their filesystem gives them always just valid data (or an error)... and IMO that's actually what each modern fs should do per default. Relying on higher levels providing such means is simply not realistic. Cheers, Chris. smime.p7s Description: S/MIME cryptographic signature
Re: [RFC] Checksum of the parity
On Mon, Aug 14, 2017 at 8:12 AM, Goffredo Baroncelliwrote: > On 08/13/2017 08:45 PM, Chris Murphy wrote: >> [2] >> Is Btrfs subject to the write hole problem manifesting on disk? I'm >> not sure, sadly I don't read the code well enough. But if all Btrfs >> raid56 writes are full stripe CoW writes, and if the prescribed order >> guarantees still happen: data CoW to disk > metadata CoW to disk > >> superblock update, then I don't see how the write hole happens. Write >> hole requires: RMW of a stripe, which is a partial stripe overwrite, >> and a crash during the modification of the stripe making that stripe >> inconsistent as well as still pointed to by metadata. > > > RAID5 is *single* failure prof. And in order to have the write hole bug we > need two failure: > 1) a transaction is aborted (e.g. due to a power failure) and the results is > that data and parity are mis-aligned > 2) a disk disappears > > These two events may happen even in different moment. > > The key is that when a disk disappear, all remaining ones are used to rebuild > the missing one. So if data and parity are mis-aligned the rebuild disk is > wrong. > > Let me to show an example > > Disk 1Disk 2 Disk 3 (parity) > AABB CC > > where CC = A ^ B > > Note1: A is a valid data > > Supposing to update B and due to a power failure you can't update parity, you > have: > > > Disk 1Disk 2 Disk 3 (parity) > AADDDCC > > Of course CC != A ^ D (data and parity are misaligned). > > > Pay attention that AA is still valid data. > > Now suppose to loose disk1. If you want to read from it, you have to perform > a read of disk2 and disk3 to compute disk1. > > However Disk2 and disk3 are misaligned, so doing a D ^ C you don't > got A anymore. > > > Note that it is not important if DD or B are valid or invalid data. Doesn't matter on Btrfs. Bad reconstruction due to wrong parity results in csum mismatch. This I've tested. I vaguely remember a while ago doing a dd conv=notrunc modification of a file that's raid5, and there was no RMW, what happened is the whole stripe was CoW'd and had the modification. So that would, hardware behaving correctly, mean that the raid5 data CoW succeeds, then there is a metadata CoW to point to it, then the super block is updated to point to the new tree. At any point, if there's an interruption, we have the old super pointing to the old tree which points to premodified data. Anyway, I do wish I read the code better, so I knew exactly where, if at all, the RMW code was happening on disk rather than just in memory. There very clearly is RMW in memory code as a performanc optimizer, before a stripe gets written out it's possible to RMW it to add in more changes or new files, that way raid56 isn't dog slow CoW'ing literally a handful of 16KiB leaves each time, that then translate into a minimum of 384K of writes. But yeah, Qu just said in another thread that Liu is working on a journal for the raid56 write hole problem. Thing is I don't see when it happens in the code or in practice (so far, it's really tedious to poke a file system with a stick). -- Chris Murphy -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: RedHat 7.4 Release Notes: "Btrfs has been deprecated" - wut?
On Mon, Aug 14, 2017 at 8:23 AM, Goffredo Baroncelliwrote: > Form a theoretical point of view, if you have a "PURE" COW file-system, you > don't need a journal. Unfortunately a RAID5/6 stripe update is a RMW cycle, > so you need a journal to keep it in sync. The same is true for the NOCOW file > (and their checksums) > I'm pretty sure the raid56 rmw is in memory only, I don't think we have a case where a stripe is getting partial writes (a block in a stripe is being overwritten). Partial stripe updates with rmw *on disk* would mean Btrfs raid56 is not CoW. -- Chris Murphy -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: btrfs-progs-v4.12: cross compiling
On 08/14/2017 05:10 PM, David Sterba wrote: > On Mon, Aug 14, 2017 at 10:14:42PM +0800, Qu Wenruo wrote: [...] > mktables.c is synced from kernel sources, taking updates from there is > easier than porting any changes to the proposed scripted implementation. > > The workflow is simple: > - copy kernel mktables.c changes to btrfs-progs mktables.c How the kernel deals with this kind of problem ? Looking at the source of btrfs Makefile, it is more simple to replace mktables: kernel-lib/mktables.c @echo "[CC] $@" $(Q)$(CC) $(CFLAGS) $< -o $@ with mktables: kernel-lib/mktables.c @echo "[HOSTCC] $@" $(Q)$(HOSTCC) $(CFLAGS) $< -o $@ where HOSTCC is defined as HOSTCC=gcc (may be the same applied also to CFLAGS <-> HOSTCFLAGS ?) > - compile mktables > - run 'make kernel-lib/tables.c' > - commit the changes to git > > All of that happens very rarely, if ever, the raid56 tables and > correction algorithms are set in stone. Any extensions need to be done > on both sides kernel/userspace. > What about using script to generate it? >>> >>> We do have the mktables utility to generate it and I'll regenerate it >>> should there be a change to kernel-lib/mktables.c >> >> I mean to replace mktables.c with a script. >> So no cross compiler problems at all, and even easier Makefile. >> No dependence on "mktables" program. > > Somebody has to implement the script and verify that the output is the > same, eventually sync changes. The cross-compilation should be fixed > with the pregenerated tables.c . Is Makefile size really a concern? The > number of related lines is like 7. I don't see any benefit in what you > propose and hopefully explained my viewpoint enough so I don't have to > continue. > -- > To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in > the body of a message to majord...@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html > -- gpg @keyserver.linux.it: Goffredo Baroncelli Key fingerprint BBF5 1610 0B64 DAC6 5F7D 17B2 0EDA 9B37 8B82 E0B5 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH] btrfs: test incremental send with compression and extent cloning
From: Filipe MananaTest that an incremental send/receive operation will not fail when the destination filesystem has compression enabled and the source filesystem has a 4K extent at a file offset 0 that is not compressed and that is shared. This currently fails on btrfs and is fixed by the following patch for the linux kernel: "Btrfs: incremental send, fix emission of invalid clone operations" Signed-off-by: Filipe Manana --- tests/btrfs/149 | 113 tests/btrfs/149.out | 14 +++ tests/btrfs/group | 1 + 3 files changed, 128 insertions(+) create mode 100755 tests/btrfs/149 create mode 100644 tests/btrfs/149.out diff --git a/tests/btrfs/149 b/tests/btrfs/149 new file mode 100755 index ..432ec707 --- /dev/null +++ b/tests/btrfs/149 @@ -0,0 +1,113 @@ +#! /bin/bash +# FS QA Test No. btrfs/149 +# +# Test that an incremental send/receive operation will not fail when the +# destination filesystem has compression enabled and the source filesystem +# has a 4K extent at a file offset 0 that is not compressed and that is +# shared. +# +#--- +# +# Copyright (C) 2017 SUSE Linux Products GmbH. All Rights Reserved. +# Author: Filipe Manana +# +# This program is free software; you can redistribute it and/or +# modify it under the terms of the GNU General Public License as +# published by the Free Software Foundation. +# +# This program is distributed in the hope that it would be useful, +# but WITHOUT ANY WARRANTY; without even the implied warranty of +# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the +# GNU General Public License for more details. +# +# You should have received a copy of the GNU General Public License +# along with this program; if not, write the Free Software Foundation, +# Inc., 51 Franklin St, Fifth Floor, Boston, MA 02110-1301 USA +#--- +# + +seq=`basename $0` +seqres=$RESULT_DIR/$seq +echo "QA output created by $seq" + +tmp=/tmp/$$ +status=1 # failure is the default! +trap "_cleanup; exit \$status" 0 1 2 3 15 + +_cleanup() +{ + cd / + rm -fr $send_files_dir + rm -f $tmp.* +} + +# get standard environment, filters and checks +. ./common/rc +. ./common/filter +. ./common/reflink + +# real QA test starts here +_supported_fs btrfs +_supported_os Linux +_require_test +_require_scratch +_require_scratch_reflink +_require_odirect + +send_files_dir=$TEST_DIR/btrfs-test-$seq + +rm -f $seqres.full +rm -fr $send_files_dir +mkdir $send_files_dir + +_scratch_mkfs >>$seqres.full 2>&1 +_scratch_mount "-o compress" + +# Write to our file using direct IO, so that this way the write ends up not +# getting compressed, that is, we get a regular extent which is neither +# inlined nor compressed. +# Alternatively, we could have mounted the fs without compression enabled, +# which would result as well in an uncompressed regular extent. +$XFS_IO_PROG -f -d -c "pwrite -S 0xab 0 4K" $SCRATCH_MNT/foobar | _filter_xfs_io + +$BTRFS_UTIL_PROG subvolume snapshot -r $SCRATCH_MNT \ + $SCRATCH_MNT/mysnap1 > /dev/null + +# Clone the regular (not inlined) extent. +$XFS_IO_PROG -c "reflink $SCRATCH_MNT/foobar 0 8K 4K" $SCRATCH_MNT/foobar \ + | _filter_xfs_io + +$BTRFS_UTIL_PROG subvolume snapshot -r $SCRATCH_MNT \ + $SCRATCH_MNT/mysnap2 > /dev/null + +$BTRFS_UTIL_PROG send -f $send_files_dir/1.snap \ +$SCRATCH_MNT/mysnap1 2>&1 >/dev/null | _filter_scratch + +# Now do an incremental send of the second snapshot. The send stream can have +# a clone operation to clone the extent at offset 0 to offset 8K. This operation +# would fail on the receiver if it has compression enabled, since the write +# operation of the extent at offset 0 was compressed because it was a buffered +# write operation, and btrfs' clone implementation does not allow cloning inline +# extents to offsets different from 0. +$BTRFS_UTIL_PROG send -p $SCRATCH_MNT/mysnap1 -f $send_files_dir/2.snap \ +$SCRATCH_MNT/mysnap2 2>&1 >/dev/null | _filter_scratch + +echo "File digests in the original filesystem:" +md5sum $SCRATCH_MNT/mysnap1/foobar | _filter_scratch +md5sum $SCRATCH_MNT/mysnap2/foobar | _filter_scratch + +# Now recreate the filesystem by receiving both send streams and verify we get +# the same file content that the original filesystem had. +_scratch_unmount +_scratch_mkfs >>$seqres.full 2>&1 +_scratch_mount "-o compress" + +$BTRFS_UTIL_PROG receive -f $send_files_dir/1.snap $SCRATCH_MNT > /dev/null +$BTRFS_UTIL_PROG receive -f $send_files_dir/2.snap $SCRATCH_MNT > /dev/null + +echo "File digests in the new filesystem:" +md5sum $SCRATCH_MNT/mysnap1/foobar | _filter_scratch +md5sum $SCRATCH_MNT/mysnap2/foobar | _filter_scratch + +status=0 +exit diff --git a/tests/btrfs/149.out b/tests/btrfs/149.out
[PATCH] Btrfs: incremental send, fix emission of invalid clone operations
From: Filipe MananaWhen doing an incremental send it's possible that the computed send stream contains clone operations that will fail on the receiver if the receiver has compression enabled and the clone operations target a sector sized extent that starts at a zero file offset, is not compressed on the source filesystem but ends up being compressed and inlined at the destination filesystem. Example scenario: $ mkfs.btrfs -f /dev/sdb $ mount -o compress /dev/sdb /mnt # By doing a direct IO write, the data is not compressed. $ xfs_io -f -d -c "pwrite -S 0xab 0 4K" /mnt/foobar $ btrfs subvolume snapshot -r /mnt /mnt/mysnap1 $ xfs_io -c "reflink /mnt/foobar 0 8K 4K" /mnt/foobar $ btrfs subvolume snapshot -r /mnt /mnt/mysnap2 $ btrfs send -f /tmp/1.snap /mnt/mysnap1 $ btrfs send -f /tmp/2.snap -p /mnt/mysnap1 /mnt/mysnap2 $ umount /mnt $ mkfs.btrfs -f /dev/sdc $ mount -o compress /dev/sdc /mnt $ btrfs receive -f /tmp/1.snap /mnt $ btrfs receive -f /tmp/2.snap /mnt ERROR: failed to clone extents to foobar Operation not supported The same could be achieved by mounting the source filesystem without compression and doing a buffered IO write instead of a direct IO one, and mounting the destination filesystem with compression enabled. So fix this by issuing regular write operations in the send stream instead of clone operations when the source offset is zero and the range has a length matching the sector size. Signed-off-by: Filipe Manana --- fs/btrfs/send.c | 19 +++ 1 file changed, 19 insertions(+) diff --git a/fs/btrfs/send.c b/fs/btrfs/send.c index b082210df9c8..460be72ab78b 100644 --- a/fs/btrfs/send.c +++ b/fs/btrfs/send.c @@ -4992,6 +4992,25 @@ static int clone_range(struct send_ctx *sctx, struct btrfs_key key; int ret; + /* +* Prevent cloning from a zero offset with a length matching the sector +* size because in some scenarios this will make the receiver fail. +* +* For example, if in the source filesystem the extent at offset 0 +* has a length of sectorsize and it was written using direct IO, then +* it can never be an inline extent (even if compression is enabled). +* Then this extent can be cloned in the original filesystem to a non +* zero file offset, but it may not be possible to clone in the +* destination filesystem because it can be inlined due to compression +* on the destination filesystem (as the receiver's write operations are +* always done using buffered IO). The same happens when the original +* filesystem does not have compression enabled but the destination +* filesystem has. +*/ + if (clone_root->offset == 0 && + len == sctx->send_root->fs_info->sectorsize) + return send_extent_data(sctx, offset, len); + path = alloc_path_for_send(); if (!path) return -ENOMEM; -- 2.11.0 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: RedHat 7.4 Release Notes: "Btrfs has been deprecated" - wut?
On 14/08/17 16:53, Austin S. Hemmelgarn wrote: > Quite a few applications actually _do_ have some degree of secondary > verification or protection from a crash. I am glad your applications do and you have no need of this feature. You are welcome not to use it. I, on the other hand, definitely want this feature and would have it enabled by default on all my systems despite the need for manual actions after some unclean shutdowns. > Go look at almost any database > software. It usually will not have checksumming, but it will almost > always have support for a journal, which is enough to cover the > particular data loss scenario we're talking about (unexpected unclean > shutdown). No, the problem we are talking about is the data-at-rest corruption that checksumming is designed to deal with. That is why I want it. The unclean shutdown is a side issue that means there is a trade-off to using it. No one is suggesting that checksums are any significant help with the unclean shutdown case, just that the existence of that atomicity issue does not **prevent** them being very useful for the function for which they were designed. The degree to which any particular sysadmin will choose to enable or disable checksums on nodatacow files will depend on how much they value the checksum protection vs. the impact of manually fixing problems after some unclean shutdowns. In my particular case, many of these nodatacow files are large, very long-lived and only in use intermittently. I would like my monthly "btrfs scrub" to know they haven't gone bad but they are extremely unlikely to be in the middle of a write during an unclean shutdown so I am likely to have very few false errors. They are all backed up, but without checksumming I don't know that the backup needs to be restored (or even that I am not backing up now-bad data). Graham -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v3] btrfs: preserve i_mode if __btrfs_set_acl() fails
On Wed, Aug 02, 2017 at 03:18:27AM -0300, Ernesto A. Fernández wrote: > When changing a file's acl mask, btrfs_set_acl() will first set the > group bits of i_mode to the value of the mask, and only then set the > actual extended attribute representing the new acl. > > If the second part fails (due to lack of space, for example) and the > file had no acl attribute to begin with, the system will from now on > assume that the mask permission bits are actual group permission bits, > potentially granting access to the wrong users. > > Prevent this by restoring the original mode bits if __btrfs_set_acl > fails. > > Signed-off-by: Ernesto A. Fernández> --- > Please ignore the two previous versions, this is far simpler and has the > same effect. To Josef Bacik: thank you for your review, I'm sorry I > wasted your time. This version is much better, starting the transaction would add some overhead and would need to be measured so we have an idea about the impact. Reviewed-by: David Sterba -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: RedHat 7.4 Release Notes: "Btrfs has been deprecated" - wut?
On 2017-08-14 11:13, Graham Cobb wrote: On 14/08/17 15:23, Austin S. Hemmelgarn wrote: Assume you have higher level verification. But almost no applications do. In real life, the decision making/correction process will be manual and labour-intensive (for example, running fsck on a virtual disk or restoring a file from backup). Quite a few applications actually _do_ have some degree of secondary verification or protection from a crash. Go look at almost any database software. It usually will not have checksumming, but it will almost always have support for a journal, which is enough to cover the particular data loss scenario we're talking about (unexpected unclean shutdown). Would you rather not be able to read the data regardless of if it's correct or not, or be able to read it and determine yourself if it's correct or not? It must be controllable on a per-file basis, of course. For the tiny number of files where the app can both spot the problem and correct it (for example if it has a journal) the current behaviour could be used. In my own experience, the things that use nodatacow fall into one of 4 categories: 1. Cases where the data is non-critical, and data loss will be inconvenient but not fatal. Systemd journal files are a good example of this, as are web browser profiles when you're using profile sync. 2. Cases where the upper level can reasonably be expected to have some degree of handling, even if it's not correction. VM disk images and most database applications fall into this category. 3. Cases where data corruption will take out the application anyway. Poorly written database software is the primary example of this. 4. Things that shouldn't be using nodatacow because data safety is the most important aspect of the system. The first two cases work perfectly fine with the current behavior and are arguably no better off either way. The third is functionally fine with the current behavior provided that the crash doesn't change state (which isn't a guarantee), but could theoretically benefit from the determinism of knowing the app will die if the data is bad. The fourth is what most people seem to want this for, and don't realize that even if this is implemented, they will be no better off on average. But, on MY system, I absolutely would **always** select the first option (-EIO). I need to know that a potential problem may have occurred and will take manual action to decide what to do. Of course, this also needs a special utility (as Christoph proposed) to be able to force the read (to allow me to examine the data) and to be able to reset the checksum (although that is presumably as simple as rewriting the data). And I and most other sysadmins I know would prefer the opposite with the addition of a secondary notification method. You can still hook the notification to stop the application, but you don't have to if you don't want to (and in cases 1 and 2 I listed above, you probably don't want to). This is what happens normally with any filesystem when a disk block goes bad, but with the additional benefit of being able to examine a "possibly valid" version of the data block before overwriting it. Looking at this from a different angle: Without background, what would you assume the behavior to be for this? For most people, the assumption would be that this provides the same degree of data safety that the checksums do when the data is CoW. Exactly. The naive expectation is that turning off datacow does not prevent the bitrot checking from working. Also, the naive expectation (for any filesystem operation) is that if there is any doubt about the reliability of the data, the error is reported for the user to deal with. The problem is that the naive expectation about data safety appears to be that adding checksumming support for nodatacow will improve safety, which it WILL NOT do. All it will do is add some reporting that will have a 50%+ rate of false positives (there is the very real possibility that the unexpected power loss will corrupt the checksum or the data if you're on anything but a traditional hard drive). If you have something that you need data safety for and can't be arsed to pay attention to whether or not your system had an unclean shutdwon, then you have two practical options: 1. Don't use nodatacow. 2. Do some form of higher level verification. Nothing about that is going to magically change because you suddenly have checksums telling you the data might be bad. Now, you _might_ be better off in a situation where the data got corrupted for some other reason (say, a media error for example), but even then you should have higher level verification, and it won't provide much benefit unless you're using replication or parity in BTRFS. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at
Re: RedHat 7.4 Release Notes: "Btrfs has been deprecated" - wut?
On 14/08/17 15:23, Austin S. Hemmelgarn wrote: > Assume you have higher level verification. But almost no applications do. In real life, the decision making/correction process will be manual and labour-intensive (for example, running fsck on a virtual disk or restoring a file from backup). > Would you rather not be able > to read the data regardless of if it's correct or not, or be able to > read it and determine yourself if it's correct or not? It must be controllable on a per-file basis, of course. For the tiny number of files where the app can both spot the problem and correct it (for example if it has a journal) the current behaviour could be used. But, on MY system, I absolutely would **always** select the first option (-EIO). I need to know that a potential problem may have occurred and will take manual action to decide what to do. Of course, this also needs a special utility (as Christoph proposed) to be able to force the read (to allow me to examine the data) and to be able to reset the checksum (although that is presumably as simple as rewriting the data). This is what happens normally with any filesystem when a disk block goes bad, but with the additional benefit of being able to examine a "possibly valid" version of the data block before overwriting it. > Looking at this from a different angle: Without background, what would > you assume the behavior to be for this? For most people, the assumption > would be that this provides the same degree of data safety that the > checksums do when the data is CoW. Exactly. The naive expectation is that turning off datacow does not prevent the bitrot checking from working. Also, the naive expectation (for any filesystem operation) is that if there is any doubt about the reliability of the data, the error is reported for the user to deal with. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: btrfs-progs-v4.12: cross compiling
On Mon, Aug 14, 2017 at 10:14:42PM +0800, Qu Wenruo wrote: > On 2017年08月14日 22:03, David Sterba wrote: > > On Mon, Aug 14, 2017 at 09:17:08PM +0800, Qu Wenruo wrote: > >> On 2017年08月14日 21:06, David Sterba wrote: > >>> On Mon, Aug 14, 2017 at 02:17:26PM +0200, Hallo32 wrote: > Since versions 4.12 btrfs-progs is complicated to cross compile for > other systems. > The problem is, that this version includes mktables, which needs to be > compiled for the host system and executed there for the creation of > tables.c. > > Are there any changes planed for the next version of btrfs-progs to make > the cross compiling as simple as in the past? A included tables.c for > example? > >>> > >>> Yes, keeping the generated tables.c around is fine. There's no reason it > >>> needs to be generated each time during build. I'll fix that in 4.12.1. > >> > >> But the number of lines and impossibility to review it makes it not > >> suitable to be managed by git. > > > > I don't understand your concern. The file is generated from a set of > > formulas, not intended to be updated directly. > > Yes, it should never be updated directly, so it's generated by a less > than 400 lines program, instead of a whole 10K+ lines file managed by git. mktables.c is synced from kernel sources, taking updates from there is easier than porting any changes to the proposed scripted implementation. The workflow is simple: - copy kernel mktables.c changes to btrfs-progs mktables.c - compile mktables - run 'make kernel-lib/tables.c' - commit the changes to git All of that happens very rarely, if ever, the raid56 tables and correction algorithms are set in stone. Any extensions need to be done on both sides kernel/userspace. > >> What about using script to generate it? > > > > We do have the mktables utility to generate it and I'll regenerate it > > should there be a change to kernel-lib/mktables.c > > I mean to replace mktables.c with a script. > So no cross compiler problems at all, and even easier Makefile. > No dependence on "mktables" program. Somebody has to implement the script and verify that the output is the same, eventually sync changes. The cross-compilation should be fixed with the pregenerated tables.c . Is Makefile size really a concern? The number of related lines is like 7. I don't see any benefit in what you propose and hopefully explained my viewpoint enough so I don't have to continue. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: RedHat 7.4 Release Notes: "Btrfs has been deprecated" - wut?
On 2017-08-14 08:24, Christoph Anton Mitterer wrote: On Mon, 2017-08-14 at 14:36 +0800, Qu Wenruo wrote: And how are you going to write your data and checksum atomically when doing in-place updates? Exactly, that's the main reason I can figure out why btrfs disables checksum for nodatacow. Still, I don't get the problem here... Yes it cannot be done atomically (without workarounds like a journal or so), but this should be only an issue in case of a crash or similar. And in this case nodatacow+nochecksum is anyway already bad, it's also not atomic, so data may be completely garbage (e.g. half written)... just that no one will ever notice. The only problem that nodatacow + checksuming + nonatomic should give is when the data was actually correctly written at a crash, but the cheksum was not, in which case the bogus checksum would invalidate the good data on next read. Or do I miss something? To me that sounds still much better than having no protection at all. Assume you have higher level verification. Would you rather not be able to read the data regardless of if it's correct or not, or be able to read it and determine yourself if it's correct or not? For almost anybody, the answer is going to be the second case, because the application knows better than the OS if the data is correct (and 'correct' may be a threshold, not some binary determination). At that point, you need to make the checksum error a warning instead of returning -EIO. How do you intend to communicate that warning back to the application? The kernel log won't work, because on any reasonably secure system it's not visible to anyone but root. There's also no side channel for the read() system calls that you can utilize. That then means that the checksums end up just being a means for the administrator to know some data wasn't written correctly, but they should know that anyway because the system crashed. As a result, the whole thing ends up reduced to some extra work for a pointless notification that some people may not even see. Looking at this from a different angle: Without background, what would you assume the behavior to be for this? For most people, the assumption would be that this provides the same degree of data safety that the checksums do when the data is CoW. We already have enough issues with people misunderstanding how things work and losing data as a result (keep in mind that the average user doesn't read documentation and will often blindly follow any random advice they see online), and we don't need more that are liable to cause data loss. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: RedHat 7.4 Release Notes: "Btrfs has been deprecated" - wut?
On 08/14/2017 09:08 AM, Qu Wenruo wrote: > >> >> Supposing to log for each transaction BTRFS which "data NOCOW blocks" will >> be updated and their checksum, in case a transaction is interrupted you know >> which blocks have to be checked and are able to verify if the checksum >> matches and correct the mismatch. Logging also the checksum could help to >> identify if: >> - the data is old >> - the data is updated >> - the updated data is correct >> >> The same approach could be used also to solving also the issue related to >> the infamous RAID5/6 hole: logging which block are updated, in case of >> transaction aborted you can check the parity which have to be rebuild. > Indeed Liu is using journal to solve RAID5/6 write hole. > > But to address the lack-of-journal nature of btrfs, he introduced a journal > device to handle it, since btrfs metadata is either written or trashed, we > can't rely existing btrfs metadata to handle journal. The Liu's solution is a lot heavier. With the Liu's solution, you need to write both the data and parity 2 times. I am only suggest to track the block to update. And it would be only need for the stripes involved by a RMW cycle. This is a lot less data to write (8 byte vs 4Kbyte) > > PS: This reminds me why ZFS is still using journal (called ZFS intent log) > but not mandatory metadata CoW of btrfs. Form a theoretical point of view, if you have a "PURE" COW file-system, you don't need a journal. Unfortunately a RAID5/6 stripe update is a RMW cycle, so you need a journal to keep it in sync. The same is true for the NOCOW file (and their checksums) > > Thanks, > Qu -- gpg @keyserver.linux.it: Goffredo Baroncelli Key fingerprint BBF5 1610 0B64 DAC6 5F7D 17B2 0EDA 9B37 8B82 E0B5 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: btrfs-progs-v4.12: cross compiling
On 2017年08月14日 22:03, David Sterba wrote: On Mon, Aug 14, 2017 at 09:17:08PM +0800, Qu Wenruo wrote: On 2017年08月14日 21:06, David Sterba wrote: On Mon, Aug 14, 2017 at 02:17:26PM +0200, Hallo32 wrote: Since versions 4.12 btrfs-progs is complicated to cross compile for other systems. The problem is, that this version includes mktables, which needs to be compiled for the host system and executed there for the creation of tables.c. Are there any changes planed for the next version of btrfs-progs to make the cross compiling as simple as in the past? A included tables.c for example? Yes, keeping the generated tables.c around is fine. There's no reason it needs to be generated each time during build. I'll fix that in 4.12.1. But the number of lines and impossibility to review it makes it not suitable to be managed by git. I don't understand your concern. The file is generated from a set of formulas, not intended to be updated directly. Yes, it should never be updated directly, so it's generated by a less than 400 lines program, instead of a whole 10K+ lines file managed by git. What about using script to generate it? We do have the mktables utility to generate it and I'll regenerate it should there be a change to kernel-lib/mktables.c I mean to replace mktables.c with a script. So no cross compiler problems at all, and even easier Makefile. No dependence on "mktables" program. Thanks, Qu -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC] Checksum of the parity
On 08/13/2017 08:45 PM, Chris Murphy wrote: > [2] > Is Btrfs subject to the write hole problem manifesting on disk? I'm > not sure, sadly I don't read the code well enough. But if all Btrfs > raid56 writes are full stripe CoW writes, and if the prescribed order > guarantees still happen: data CoW to disk > metadata CoW to disk > > superblock update, then I don't see how the write hole happens. Write > hole requires: RMW of a stripe, which is a partial stripe overwrite, > and a crash during the modification of the stripe making that stripe > inconsistent as well as still pointed to by metadata. RAID5 is *single* failure prof. And in order to have the write hole bug we need two failure: 1) a transaction is aborted (e.g. due to a power failure) and the results is that data and parity are mis-aligned 2) a disk disappears These two events may happen even in different moment. The key is that when a disk disappear, all remaining ones are used to rebuild the missing one. So if data and parity are mis-aligned the rebuild disk is wrong. Let me to show an example Disk 1Disk 2 Disk 3 (parity) AABB CC where CC = A ^ B Note1: A is a valid data Supposing to update B and due to a power failure you can't update parity, you have: Disk 1Disk 2 Disk 3 (parity) AADDDCC Of course CC != A ^ D (data and parity are misaligned). Pay attention that AA is still valid data. Now suppose to loose disk1. If you want to read from it, you have to perform a read of disk2 and disk3 to compute disk1. However Disk2 and disk3 are misaligned, so doing a D ^ C you don't got A anymore. Note that it is not important if DD or B are valid or invalid data. Moreover I have to point out that a simple scrub process between 1 and 2, is able to rebuild a correct parity. This would reduce the likelihood of the "write hole" bug. The only case which would still exists is when 1) and 2) happen at the same time (which is not impossible: i.e. if a disk die, it is not infrequent that the user shutdown the machine without waiting a clean shutdown). BR G.Baroncelli -- gpg @keyserver.linux.it: Goffredo Baroncelli Key fingerprint BBF5 1610 0B64 DAC6 5F7D 17B2 0EDA 9B37 8B82 E0B5 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: btrfs-progs-v4.12: cross compiling
On Mon, Aug 14, 2017 at 09:17:08PM +0800, Qu Wenruo wrote: > > > On 2017年08月14日 21:06, David Sterba wrote: > > On Mon, Aug 14, 2017 at 02:17:26PM +0200, Hallo32 wrote: > >> Since versions 4.12 btrfs-progs is complicated to cross compile for > >> other systems. > >> The problem is, that this version includes mktables, which needs to be > >> compiled for the host system and executed there for the creation of > >> tables.c. > >> > >> Are there any changes planed for the next version of btrfs-progs to make > >> the cross compiling as simple as in the past? A included tables.c for > >> example? > > > > Yes, keeping the generated tables.c around is fine. There's no reason it > > needs to be generated each time during build. I'll fix that in 4.12.1. > > But the number of lines and impossibility to review it makes it not > suitable to be managed by git. I don't understand your concern. The file is generated from a set of formulas, not intended to be updated directly. > > What about using script to generate it? We do have the mktables utility to generate it and I'll regenerate it should there be a change to kernel-lib/mktables.c -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Building a BTRFS test machine
On 2017-08-13 21:01, Cerem Cem ASLAN wrote: Would that be useful to build a BTRFS test machine, which will perform both software tests (btrfs send | btrfs receive, read/write random data etc.) and hardware tests, such as abrupt power off test, abruptly removing a raid-X disk physically, etc. In general, yes. There are already a couple of people (at least myself and Adam Borowski) who pick out patches we have interest in from the mailing list and test them in VM's, but having more people involved in testing is never a bad thing (cross verification of the testing is very helpful, because it can help identify when one of the test systems is suspect). Based on my own experience, if you do go with a VM, I've found that QEMU with LVM as the backing storage is probably one of the simplest setups to automate. You can easily script things like adding and removing disks, and using LVM for storage means you can add and remove (and snapshot) backend devices as needed. If it would be useful, what tests should it cover? Qu covered this well, so there's not much for me to add here. My own testing is pretty consistent with what Qu mentioned, plus a few special cases I've set up myself that I still need to get pushed upstream somewhere. For reference, the big ones I test that aren't (AFAIK at least) in any of the standard test sets are: * Large scale bulk parallel creation and removal of subvolumes. I've got a script that creates a 16 subvolumes, and then in parallel snapshots them 65536 times each, and then calls `btrfs subvolume delete` on all 1048576 subvolumes simultaneously. This is _really_ good at showing performance differences in handling of snapshots and subvolumes. * Large scale bulk reflink creation and deletion. Similar to the above, but using a 1GB file and the clone ioctl to create a similar number of reflinked files. * Scaling performance of directories with very large numbers of entries. In essence, I create directories with power of 2 numbers of files starting at 512 and ending with 1048576, with random names and random metadata, and see how long `ls -als` takes on the directory. This came about because of performance issues seen on a file server where I work with a directory that has well over four thousand files in it that got noticeably worse performance with BTRFS than with ext4. * Kernel handling of mixed profile filesystems. I've got a script that generates a BTRFS filesystem with a data, metadata, and system chunk of each possible profile (single, dup, raid0, raid1, raid10, raid5, and raid6), and then makes sure the kernel can mount this and that balances to each profile work correctly. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v5 2/5] lib: Add zstd modules
On Fri, Aug 11, 2017 at 09:20:10AM -0400, Chris Mason wrote: > > > On 08/10/2017 03:25 PM, Hugo Mills wrote: > > On Thu, Aug 10, 2017 at 01:41:21PM -0400, Chris Mason wrote: > >> On 08/10/2017 04:30 AM, Eric Biggers wrote: > >>> > >>> Theses benchmarks are misleading because they compress the whole file as a > >>> single stream without resetting the dictionary, which isn't how data will > >>> typically be compressed in kernel mode. With filesystem compression the > >>> data > >>> has to be divided into small chunks that can each be decompressed > >>> independently. > >>> That eliminates one of the primary advantages of Zstandard (support for > >>> large > >>> dictionary sizes). > >> > >> I did btrfs benchmarks of kernel trees and other normal data sets as > >> well. The numbers were in line with what Nick is posting here. > >> zstd is a big win over both lzo and zlib from a btrfs point of view. > >> > >> It's true Nick's patches only support a single compression level in > >> btrfs, but that's because btrfs doesn't have a way to pass in the > >> compression ratio. It could easily be a mount option, it was just > >> outside the scope of Nick's initial work. > > > > Could we please not add more mount options? I get that they're easy > > to implement, but it's a very blunt instrument. What we tend to see > > (with both nodatacow and compress) is people using the mount options, > > then asking for exceptions, discovering that they can't do that, and > > then falling back to doing it with attributes or btrfs properties. > > Could we just start with btrfs properties this time round, and cut out > > the mount option part of this cycle. > > > > In the long run, it'd be great to see most of the btrfs-specific > > mount options get deprecated and ultimately removed entirely, in > > favour of attributes/properties, where feasible. > > > > It's a good point, and as was commented later down I'd just do mount -o > compress=zstd:3 or something. We've solved that already, here's a proposed patch that extends current mount options, https://www.mail-archive.com/linux-btrfs@vger.kernel.org/msg66248.html and a patch that could be backported to older kernels so the new mount options do not break mounts on older kernels with the new syntax https://patchwork.kernel.org/patch/9845697/ -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: lazytime mount option—no support in Btrfs
On 2017-08-13 07:50, Adam Hunt wrote: Back in 2014 Ted Tso introduced the lazytime mount option for ext4 and shortly thereafter a more generic VFS implementation which was then merged into mainline. His early patches included support for Btrfs but those changes were removed prior to the feature being merged. His changelog includes the following note about the removal: - Per Christoph's suggestion, drop support for btrfs and xfs for now, issues with how btrfs and xfs handle dirty inode tracking. We can add btrfs and xfs support back later or at the end of this series if we want to revisit this decision. My reading of the current mainline shows that Btrfs still lacks any support for lazytime. Has any thought been given to adding support for lazytime to Btrfs? It has bee at least lightly discussed (I forget the thread, but I did a reasonably specific explanation of the interaction of the *atime and lazytime options in a thread a while back when trying to explain to someone why I wanted to be able to run with noatime and lazytime), but I don't think the discussion got anywhere. I would personally love to see support for it myself (or you know, at least have some warning that it isn't supported instead of just silently accepting and ignoring it like we do currently), but I unfortunately don't have the time or expertise to work on implementing it. If someone does post a patch though, I'll be more than happy to throw a few dozen VM's at testing it. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: btrfs-progs-v4.12: cross compiling
On 2017年08月14日 21:06, David Sterba wrote: On Mon, Aug 14, 2017 at 02:17:26PM +0200, Hallo32 wrote: Since versions 4.12 btrfs-progs is complicated to cross compile for other systems. The problem is, that this version includes mktables, which needs to be compiled for the host system and executed there for the creation of tables.c. Are there any changes planed for the next version of btrfs-progs to make the cross compiling as simple as in the past? A included tables.c for example? Yes, keeping the generated tables.c around is fine. There's no reason it needs to be generated each time during build. I'll fix that in 4.12.1. But the number of lines and impossibility to review it makes it not suitable to be managed by git. What about using script to generate it? Thanks, Qu Thanks for the report, cross-compilation should stay supported. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: btrfs-progs-v4.12: cross compiling
On 2017年08月14日 20:17, Hallo32 wrote: Hello at all, I'm new at this list. If the mail is not in line with your standards please inform me. Since versions 4.12 btrfs-progs is complicated to cross compile for other systems. The problem is, that this version includes mktables, which needs to be compiled for the host system and executed there for the creation of tables.c. Are there any changes planed for the next version of btrfs-progs to make the cross compiling as simple as in the past? A included tables.c for example? Including tables.c seems quite crazy for me. Just check how many lines tables.c has. If we screw up any number of it, RAID6 is screwed up and we will never find the bug by reviewing the such meaningless numbers. We could enhance the makefile to use host-cc to compile and execute mktable. But it may make the makefile more complicated just for one file. (Well, I just don't want to screw traditional Makefile/configure any more) Another solution is to generate tables.c using scripts, other binary executable. I'll try to port mktable to bash or python, which should solve the problem withouth further complicating makefile. (And maybe I can update kernel too, but kernel has its host-cc and cross-cc already distinguished well, so it may not help much) Thanks, Qu Best regards -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: btrfs-progs-v4.12: cross compiling
On Mon, Aug 14, 2017 at 02:17:26PM +0200, Hallo32 wrote: > Since versions 4.12 btrfs-progs is complicated to cross compile for > other systems. > The problem is, that this version includes mktables, which needs to be > compiled for the host system and executed there for the creation of > tables.c. > > Are there any changes planed for the next version of btrfs-progs to make > the cross compiling as simple as in the past? A included tables.c for > example? Yes, keeping the generated tables.c around is fine. There's no reason it needs to be generated each time during build. I'll fix that in 4.12.1. Thanks for the report, cross-compilation should stay supported. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: RedHat 7.4 Release Notes: "Btrfs has been deprecated" - wut?
On 2017年08月14日 20:32, Christoph Anton Mitterer wrote: On Mon, 2017-08-14 at 15:46 +0800, Qu Wenruo wrote: The problem here is, if you enable csum and even data is updated correctly, only metadata is trashed, then you can't even read out the correct data. So what? This problem occurs anyway *only* in case of a crash,.. and *only* if notdatacow+checksumung would be used. A case in which currently, the user can either only hope that his data is fine (unless higher levels provide some checksumming means[0]), or anyway needs to recover from a backup. Let's make it clear of the combinations and its result in power loss case: Datacow + Datasum: Good old data Datacow + nodatasum: Good old data Nodatacow + datacum: Good old data (data not committed yet) or -EIO (data updated) Not supported yet, so I just assume it's using current csum checking behavior. Nodatacow + nodatasum: Good old data (data not committed yet) or uncertain data. The uncertain part is when data updated, what it should behave. If we really need to implement nodatacow +datasum, I prefer to make it consistent with nodatacow + nodatasum behavior, at least read out the data, give some csum warning instead of refuse to read and returning -EIO. Intuitively I'd also say it's much less likely that the data (which is more in terms of space) is written correctly while the checksum is not. Or is it? Checksums are protected by mandatory metadata CoW, so metadata update is always atomic. Checksum will either be updated correctly, or trashed at all. Unlike data. And it's highly possible to happen. As when synchronising a filesystem, we write data first, then metadata (data and meta may be cached by disk controller, but at least we submit such request to disk), then flush all data and metadata to disk, and update superblock finally. Since metadata is updated CoW, unless the superblock is written to disk, we are always reading the old metadata trees (including csum tree). So if powerloss happens between data written to disk and final superblock update, it's highly possible to hit the problem. And considering the data/metadata ratio, we spend more time flushing data other than metadata, which increase the possibility further more. [0] And when I've investigated back when discussion rose up the first time and some list member claimed that most typical cases (DBs, VM images) would anyway do their own checksuming,... I came to the conclusion that most did not even support it and even if they would it's no enabled per default and not really a *full* checksumming in most cases. As btrfs csum checker will just prevent you from reading out any data which doesn't match with csum. As I've said before, a tool could be provided, that re-computes the checksums then (making the data accessible again)... or one could simply mount the fs with nochecksum or some other special option, which allows bypassing any checks. Just as you pointed out, such csum bypassing should be the prerequisite for nodatacow+datasum. And unfortunately, we don't have such facility yet. Now it's not just data corruption, but data loss then. I think the former is worse than the later. The later gives you a chance of noting it, and either recover from a backup, regenerate the data (if possible) or manually mark the data as being "good" (though corrupted) again. This depends. If the upper layer has its own error detection mechanism, like keeping a special file fsynced before write (or just call it journal), then allowing reading out the corrupted data gives it a chance to find it good and continue. While just returning -EIO kills the chance at all. BTW, normal user space programs can handle csum mismatch better than -EIO. Like zip files has its own checksum, btw can't handle -EIO at all. Thanks, Qu Cheers, Chris. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: RedHat 7.4 Release Notes: "Btrfs has been deprecated" - wut?
On Mon, 2017-08-14 at 15:46 +0800, Qu Wenruo wrote: > The problem here is, if you enable csum and even data is updated > correctly, only metadata is trashed, then you can't even read out > the > correct data. So what? This problem occurs anyway *only* in case of a crash,.. and *only* if notdatacow+checksumung would be used. A case in which currently, the user can either only hope that his data is fine (unless higher levels provide some checksumming means[0]), or anyway needs to recover from a backup. Intuitively I'd also say it's much less likely that the data (which is more in terms of space) is written correctly while the checksum is not. Or is it? [0] And when I've investigated back when discussion rose up the first time and some list member claimed that most typical cases (DBs, VM images) would anyway do their own checksuming,... I came to the conclusion that most did not even support it and even if they would it's no enabled per default and not really a *full* checksumming in most cases. > As btrfs csum checker will just prevent you from reading out any > data > which doesn't match with csum. As I've said before, a tool could be provided, that re-computes the checksums then (making the data accessible again)... or one could simply mount the fs with nochecksum or some other special option, which allows bypassing any checks. > Now it's not just data corruption, but data loss then. I think the former is worse than the later. The later gives you a chance of noting it, and either recover from a backup, regenerate the data (if possible) or manually mark the data as being "good" (though corrupted) again. Cheers, Chris. smime.p7s Description: S/MIME cryptographic signature
Re: RedHat 7.4 Release Notes: "Btrfs has been deprecated" - wut?
On Mon, 2017-08-14 at 14:36 +0800, Qu Wenruo wrote: > > And how are you going to write your data and checksum atomically > > when > > doing in-place updates? > > Exactly, that's the main reason I can figure out why btrfs disables > checksum for nodatacow. Still, I don't get the problem here... Yes it cannot be done atomically (without workarounds like a journal or so), but this should be only an issue in case of a crash or similar. And in this case nodatacow+nochecksum is anyway already bad, it's also not atomic, so data may be completely garbage (e.g. half written)... just that no one will ever notice. The only problem that nodatacow + checksuming + nonatomic should give is when the data was actually correctly written at a crash, but the cheksum was not, in which case the bogus checksum would invalidate the good data on next read. Or do I miss something? To me that sounds still much better than having no protection at all. Cheers, Chris. smime.p7s Description: S/MIME cryptographic signature
btrfs-progs-v4.12: cross compiling
Hello at all, I'm new at this list. If the mail is not in line with your standards please inform me. Since versions 4.12 btrfs-progs is complicated to cross compile for other systems. The problem is, that this version includes mktables, which needs to be compiled for the host system and executed there for the creation of tables.c. Are there any changes planed for the next version of btrfs-progs to make the cross compiling as simple as in the past? A included tables.c for example? Best regards -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: how to benchmark schedulers
On 2017年08月08日 21:19, Janos Toth F. wrote: I think you should consider using Linux 4.12 which has bfq (bfq-mq) for blk-mq. So, you don't need out-of-tree BFQ patches if you switch to blk-mq (but now you are free to do so even if you have HDDs or SSDs which benefit from software schedulers since you have some multi-queue schedulers for them). Just make sure to enable blk-mq (has to be a boot parameter or build-time choice) in order to gain access to bfq-mq. And remember that bfq-mq has to be activated manually (the build-time choice for a default scheduler is not valid for multi-queue schedulers, you will default to "none" which is effectively the new "no-op"). Note: there is only one BFQ in 4.12 and it's bfq-mq which runs under the name of simply BFQ (not bfq-mq, I only used that name to make it clear that BFQ in 4.12 is a multi-queue version of BFQ). I always wondered if Btrfs makes any use of FUA if it's enabled for compatible SATA devices (it's disabled by default because there are some drives with faulty firmware). At least, if using blktrace, you could find that superblock write of btrfs is submitting bio using FWFSM (Flush, Write, FUA, Sync, Meta). So for compatible SATA device, it is possible to use FUA. It's block driver to emulate FUA by addressing Write then Flush for incompatible disk. However, in my previous test with FUA compatible HDD (you could check kernel "Write cache:" line), it turns out that at least for DB workload, the performance improvement is hard to notice and sometimes even reduce the performance. Maybe it's related to the workload (PostgreSQL), but FUA support may or may not affect performance as you expected. Thanks, Qu I also wonder if RAID10 is any better (or actually worse?) for metadata (and system) chunks than RAID1. On Tue, Aug 8, 2017 at 1:59 PM, Bernhard Landauerwrote: Hello everyone I am looking for a way to test different available schedulers with Manjaro's bfq-patched kernels on sytems with both SSD and spinning drives. Since phoronix-test-suite apparently is currently useless for this task due to its bad config for bfq I am looking for alternatives. Do you have any suggestions for me? Thank you. kind regards Bernhard Landauer -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Building a BTRFS test machine
Definitely it will be useful. For quite sometime, I was thinking about: 1. Creating test cases for each BTRFS features. (ex: https://btrfs.wiki.kernel.org/index.php/Scrub_corruption_cases) 2. Automate these feature specific test scripts using bash/python while maintaining them in public git repos. 3. Launching BTRFS test machines in cloud (aws spot instance are very cheap) . 4. Run those feature specific tests and publish the results into wiki or/and bugs into bugzilla. 5. Destroy the test machines. Such setup will require quite good effort. As Qu mentioned, the challenging part will be integrating existing different tests and make it automatic. Another thought on BTRFS testing: At this moment, when someone sends a patch - i assume, they automatically pushed into patchwork.kernel.org (based on some back-end configuration I guess). On Btrfs mailing-list,we have lot of bug reports so using something like [1] we can write a program which parses mailing-list for specific keyword like "BUGREPORT:" similar to "PATCH:" and create a new bug or update the details on bugzilla.kernel.org? This will ensure we can keep track of bugs on bugzilla. I think updating bugs based on mailing-list responses will be little tricky than creating new bugs. something like that can be useful as well. I can help a bit on such testbed setup :-) Thanks! Cheers, Lakshmipathi.G http://www.giis.co.in http://www.webminal.org On Mon, Aug 14, 2017 at 6:31 AM, Cerem Cem ASLANwrote: > Would that be useful to build a BTRFS test machine, which will perform > both software tests (btrfs send | btrfs receive, read/write random > data etc.) and hardware tests, such as abrupt power off test, abruptly > removing a raid-X disk physically, etc. > > If it would be useful, what tests should it cover? > -- > To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in > the body of a message to majord...@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: RedHat 7.4 Release Notes: "Btrfs has been deprecated" - wut?
On 2017年08月14日 15:43, Paul Jones wrote: -Original Message- From: linux-btrfs-ow...@vger.kernel.org [mailto:linux-btrfs- ow...@vger.kernel.org] On Behalf Of Qu Wenruo Sent: Monday, 14 August 2017 4:37 PM To: Christoph Hellwig; Christoph Anton Mitterer Cc: Btrfs BTRFS Subject: Re: RedHat 7.4 Release Notes: "Btrfs has been deprecated" - wut? On 2017年08月12日 15:42, Christoph Hellwig wrote: On Sat, Aug 12, 2017 at 02:10:18AM +0200, Christoph Anton Mitterer wrote: Qu Wenruo wrote: Although Btrfs can disable data CoW, nodatacow also disables data checksum, which is another main feature for btrfs. Then decoupling of the two should probably decoupled and support for notdatacow+checksumming be implemented?! And how are you going to write your data and checksum atomically when doing in-place updates? Exactly, that's the main reason I can figure out why btrfs disables checksum for nodatacow. But does it matter if it's not strictly atomic? By turning off COW it implies you accept the risk of an ill-timed failure. The problem here is, if you enable csum and even data is updated correctly, only metadata is trashed, then you can't even read out the correct data. As btrfs csum checker will just prevent you from reading out any data which doesn't match with csum. Now it's not just data corruption, but data loss then. Thanks, Qu Although from my point of view any reason that would require COW to be disabled implies you're using the wrong filesystem anyway. Paul. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: RedHat 7.4 Release Notes: "Btrfs has been deprecated" - wut?
> -Original Message- > From: linux-btrfs-ow...@vger.kernel.org [mailto:linux-btrfs- > ow...@vger.kernel.org] On Behalf Of Qu Wenruo > Sent: Monday, 14 August 2017 4:37 PM > To: Christoph Hellwig; Christoph Anton Mitterer > > Cc: Btrfs BTRFS > Subject: Re: RedHat 7.4 Release Notes: "Btrfs has been deprecated" - wut? > > > > On 2017年08月12日 15:42, Christoph Hellwig wrote: > > On Sat, Aug 12, 2017 at 02:10:18AM +0200, Christoph Anton Mitterer wrote: > >> Qu Wenruo wrote: > >>> Although Btrfs can disable data CoW, nodatacow also disables data > >>> checksum, which is another main feature for btrfs. > >> > >> Then decoupling of the two should probably decoupled and support for > >> notdatacow+checksumming be implemented?! > > > > And how are you going to write your data and checksum atomically when > > doing in-place updates? > > Exactly, that's the main reason I can figure out why btrfs disables checksum > for nodatacow. But does it matter if it's not strictly atomic? By turning off COW it implies you accept the risk of an ill-timed failure. Although from my point of view any reason that would require COW to be disabled implies you're using the wrong filesystem anyway. Paul.
Re: Building a BTRFS test machine
On 2017年08月14日 09:01, Cerem Cem ASLAN wrote: Would that be useful to build a BTRFS test machine, which will perform both software tests (btrfs send | btrfs receive, read/write random data etc.) and hardware tests, such as abrupt power off test, abruptly removing a raid-X disk physically, etc. If you want to contribute to btrfs development/bug report, then it's definitely helpful. If it would be useful, what tests should it cover? At least fstests (old xfstests, and some tests are known to fail, so you need some knowledge to judge which failure is real problem), selftest from btrfs-progs, and fs related tests from Linux Test Project. The hard part to integrate all these different tests and make them automatic. I'm not familiar with hardware crash test, but it would definitely help, as currently we're only using dm-flakey to *emulate* power loss, still not true power off. Other test like performance test from variant other projects like postgresql/apache will also help. Thanks, Qu -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: RedHat 7.4 Release Notes: "Btrfs has been deprecated" - wut?
On 2017年08月13日 22:08, Goffredo Baroncelli wrote: On 08/12/2017 02:12 PM, Hugo Mills wrote: On Sat, Aug 12, 2017 at 01:51:46PM +0200, Christoph Anton Mitterer wrote: On Sat, 2017-08-12 at 00:42 -0700, Christoph Hellwig wrote: [...] good, but csum is not I don't think this is a particularly good description of the problem. I'd say it's more like this: If you write data and metadata separately (which you have to do in the nodatacow case), and the system halts between the two writes, then you either have the new data with the old csum, or the old csum with the new data. Both data and csum are "good", but good from different states of the FS. In both cases (data first or metadata first), the csum doesn't match the data, and so you now have an I/O error reported when trying to read that data. You can't easily fix this, because when the data and csum don't match, you need to know the _reason_ they don't match -- is it because the machine was interrupted during write (in which case you can fix it), or is it because the hard disk has had someone write data to it directly, and the data is now toast (in which case you shouldn't fix the I/O error)? I am still inclined to think that this kind of problems could be solved using a journal: if you track which blocks are updated in the transaction and their checksum, if the transaction are interrupted, you can always rebuild the pair data/checksum: in case of interruption of a transaction: - all COW data are trashed - some NOCOW data might be written - all metadata (which are COW) are trashed The idea itself sounds good, however btrfs doesn't use journal (yet) and that means we need to introduce journal while btrfs uses metadata CoW to handle most work of journal. Supposing to log for each transaction BTRFS which "data NOCOW blocks" will be updated and their checksum, in case a transaction is interrupted you know which blocks have to be checked and are able to verify if the checksum matches and correct the mismatch. Logging also the checksum could help to identify if: - the data is old - the data is updated - the updated data is correct The same approach could be used also to solving also the issue related to the infamous RAID5/6 hole: logging which block are updated, in case of transaction aborted you can check the parity which have to be rebuild. Indeed Liu is using journal to solve RAID5/6 write hole. But to address the lack-of-journal nature of btrfs, he introduced a journal device to handle it, since btrfs metadata is either written or trashed, we can't rely existing btrfs metadata to handle journal. PS: This reminds me why ZFS is still using journal (called ZFS intent log) but not mandatory metadata CoW of btrfs. Thanks, Qu Basically, nodatacow bypasses the very mechanisms that are meant to provide consistency in the filesystem. Hugo. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] btrfs: use BTRFS_FSID_SIZE for fsid
On 13.08.2017 06:58, Anand Jain wrote: > We have define for FSID size so use it. > > Signed-off-by: Anand JainI like consistency! Reviewed-by: Nikolay Borisov -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH] fstests: btrfs: enhance regression test for nocsum dio read's repair
I catch this following error from dmesg when this testcase fails. [17446.661127] Buffer I/O error on dev sdb1, logical block 64, async page read We expect to inject disk IO errors on the device when xfs_io reads the specific file, but other processes may trigger IO error earlier. So, we can use task-filter to solve this problem. Signed-off-by: Lu Fengqi--- tests/btrfs/142 | 19 +++ 1 file changed, 11 insertions(+), 8 deletions(-) diff --git a/tests/btrfs/142 b/tests/btrfs/142 index 414af1b2..5bd8b728 100755 --- a/tests/btrfs/142 +++ b/tests/btrfs/142 @@ -75,6 +75,7 @@ start_fail() { echo 100 > $DEBUGFS_MNT/fail_make_request/probability echo 2 > $DEBUGFS_MNT/fail_make_request/times + echo 1 > $DEBUGFS_MNT/fail_make_request/task-filter echo 0 > $DEBUGFS_MNT/fail_make_request/verbose echo 1 > $SYSFS_BDEV/make-it-fail } @@ -83,6 +84,7 @@ stop_fail() { echo 0 > $DEBUGFS_MNT/fail_make_request/probability echo 0 > $DEBUGFS_MNT/fail_make_request/times + echo 0 > $DEBUGFS_MNT/fail_make_request/task-filter echo 0 > $SYSFS_BDEV/make-it-fail } @@ -118,16 +120,17 @@ echo "step 3..repair the bad copy" >>$seqres.full # since raid1 consists of two copies, and the bad copy was put on stripe #1 # while the good copy lies on stripe #0, the bad copy only gets access when the # reader's pid % 2 == 1 is true -while true; do - # start_fail only fails the following dio read so the repair is +start_fail +while [[ -z ${result} ]]; do + # enable task-filter only fails the following dio read so the repair is # supposed to work. - start_fail - $XFS_IO_PROG -d -c "pread -b 128K 0 128K" "$SCRATCH_MNT/foobar" > /dev/null & - pid=$! - wait - stop_fail - [ $((pid % 2)) == 1 ] && break + result=$(bash -c " + if [[ \$((\$\$ % 2)) -eq 1 ]]; then + echo 1 > /proc/\$\$/make-it-fail + exec $XFS_IO_PROG -d -c \"pread -b 128K 0 128K\" \"$SCRATCH_MNT/foobar\" + fi"); done +stop_fail _scratch_unmount -- 2.14.1 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: RedHat 7.4 Release Notes: "Btrfs has been deprecated" - wut?
On 2017年08月12日 15:42, Christoph Hellwig wrote: On Sat, Aug 12, 2017 at 02:10:18AM +0200, Christoph Anton Mitterer wrote: Qu Wenruo wrote: Although Btrfs can disable data CoW, nodatacow also disables data checksum, which is another main feature for btrfs. Then decoupling of the two should probably decoupled and support for notdatacow+checksumming be implemented?! And how are you going to write your data and checksum atomically when doing in-place updates? Exactly, that's the main reason I can figure out why btrfs disables checksum for nodatacow. Thanks, Qu -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html