Re: [DOC] BTRFS Volume operations, Device Lists and Locks all in one page
On 2018年07月13日 13:32, Anand Jain wrote: > > > > > >>> But if you are planning to >>> record and start at transaction [14] then its an overkill because >>> transaction [19 and [20] are already in the disk. >> >> Yes, I'm doing it overkilled. > > Ah. Ok. > >> But it's already much better than scrub all block groups (my original >> plan). > > That's true. Which can be optimized later, but how? and scrub can't > fix RAID1. How could scrub not fix RAID1? For metadata or data with csum, just goes normal scrub. For data without csum, we know which device is resilvering, just use the other copy. Thanks, Qu > > Thanks, Anand > > >> Thanks, >> Qu >> >>> >>> Thanks, Anand >>> >>> Thanks, Qu > [3] https://patchwork.kernel.org/patch/10403311/ > > Further, as we do a self adapting chunk allocation in RAID1, it > needs > balance-convert to fix. IMO at some point we have to provide > degraded > raid1 chunk allocation and also modify the scrub to be chunk > granular. > > Thanks, Anand > >> Any idea on this? >> >> Thanks, >> Qu >> >>> Unlock: btrfs_fs_info::chunk_mutex >>> Unlock: btrfs_fs_devices::device_list_mutex >>> >>> --- >>> >>> >>> >>> Thanks, Anand >>> -- >>> To unsubscribe from this list: send the line "unsubscribe >>> linux-btrfs" in >>> the body of a message to majord...@vger.kernel.org >>> More majordomo info at http://vger.kernel.org/majordomo-info.html >> > -- > To unsubscribe from this list: send the line "unsubscribe > linux-btrfs" in > the body of a message to majord...@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html >>> -- >>> To unsubscribe from this list: send the line "unsubscribe >>> linux-btrfs" in >>> the body of a message to majord...@vger.kernel.org >>> More majordomo info at http://vger.kernel.org/majordomo-info.html >> signature.asc Description: OpenPGP digital signature
Re: [DOC] BTRFS Volume operations, Device Lists and Locks all in one page
But if you are planning to record and start at transaction [14] then its an overkill because transaction [19 and [20] are already in the disk. Yes, I'm doing it overkilled. Ah. Ok. But it's already much better than scrub all block groups (my original plan). That's true. Which can be optimized later, but how? and scrub can't fix RAID1. Thanks, Anand Thanks, Qu Thanks, Anand Thanks, Qu [3] https://patchwork.kernel.org/patch/10403311/ Further, as we do a self adapting chunk allocation in RAID1, it needs balance-convert to fix. IMO at some point we have to provide degraded raid1 chunk allocation and also modify the scrub to be chunk granular. Thanks, Anand Any idea on this? Thanks, Qu Unlock: btrfs_fs_info::chunk_mutex Unlock: btrfs_fs_devices::device_list_mutex --- Thanks, Anand -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v2] btrfs: use customized batch size for total_bytes_pinned
Omar Sandoval 於 2018-07-13 06:19 寫到: On Wed, Jul 11, 2018 at 11:59:36PM +0800, Ethan Lien wrote: In commit b150a4f10d878 ("Btrfs: use a percpu to keep track of possibly pinned bytes") we use total_bytes_pinned to track how many bytes we are going to free in this transaction. When we are close to ENOSPC, we check it and know if we can make the allocation by commit the current transaction. For every data/metadata extent we are going to free, we add total_bytes_pinned in btrfs_free_extent() and btrfs_free_tree_block(), and release it in unpin_extent_range() when we finish the transaction. So this is a variable we frequently update but rarely read - just the suitable use of percpu_counter. But in previous commit we update total_bytes_pinned by default 32 batch size, making every update essentially a spin lock protected update. Since every spin lock/unlock operation involves syncing a globally used variable and some kind of barrier in a SMP system, this is more expensive than using total_bytes_pinned as a simple atomic64_t. So fix this by using a customized batch size. Since we only read total_bytes_pinned when we are close to ENOSPC and fail to alloc new chunk, we can use a really large batch size and have nearly no penalty in most cases. [Test] We test the patch on a 4-cores x86 machine: 1. falloate a 16GiB size test file. 2. take snapshot (so all following writes will be cow write). 3. run a 180 sec, 4 jobs, 4K random write fio on test file. We also add a temporary lockdep class on percpu_counter's spin lock used by total_bytes_pinned to track lock_stat. [Results] unpatched: lock_stat version 0.4 --- class namecon-bouncescontentions waittime-min waittime-max waittime-total waittime-avg acq-bounces acquisitions holdtime-min holdtime-max holdtime-total holdtime-avg total_bytes_pinned_percpu:82 82 0.21 0.61 29.46 0.36 298340 635973 0.09 11.01 173476.25 0.27 patched: lock_stat version 0.4 --- class namecon-bouncescontentions waittime-min waittime-max waittime-total waittime-avg acq-bounces acquisitions holdtime-min holdtime-max holdtime-total holdtime-avg total_bytes_pinned_percpu: 1 1 0.62 0.62 0.62 0.62 13601 31542 0.14 9.61 11016.90 0.35 [Analysis] Since the spin lock only protect a single in-memory variable, the contentions (number of lock acquisitions that had to wait) in both unpatched and patched version are low. But when we see acquisitions and acq-bounces, we get much lower counts in patched version. Here the most important metric is acq-bounces. It means how many times the lock get transferred between different cpus, so the patch can really recude cacheline bouncing of spin lock (also the global counter of percpu_counter) in a SMP system. Fixes: b150a4f10d878 ("Btrfs: use a percpu to keep track of possibly pinned bytes") Signed-off-by: Ethan Lien --- V2: Rewrite commit comments. Add lock_stat test. Pull dirty_metadata_bytes out to a separate patch. fs/btrfs/ctree.h | 1 + fs/btrfs/extent-tree.c | 46 -- 2 files changed, 32 insertions(+), 15 deletions(-) diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h index 118346aceea9..df682a521635 100644 --- a/fs/btrfs/ctree.h +++ b/fs/btrfs/ctree.h @@ -422,6 +422,7 @@ struct btrfs_space_info { * time the transaction commits. */ struct percpu_counter total_bytes_pinned; + s32 total_bytes_pinned_batch; Can this just be a constant instead of adding it to space_info? Yes constant is better here, I'll resend it, thanks. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [DOC] BTRFS Volume operations, Device Lists and Locks all in one page
On 2018年07月13日 08:20, Qu Wenruo wrote: > > > [snip] >>> In this case, it depends on when and how we mark the device resilvering. >>> If we record the generation of write error happens, then just initial a >>> scrub for generation greater than that generation. >> >> If we record all the degraded transactions then yes. Not just the last >> failed transaction. > > The last successful generation won't be upgraded until the scrub success. > >> >>> In the list, some guys mentioned that for LVM/mdraid they will record >>> the generation when some device(s) get write error or missing, and do >>> self cure. >>> I have been scratching on fix for this [3] for some time now. Thanks for the participation. In my understanding we are missing across-tree parent transid verification at the lowest possible granular OR >>> >>> Maybe the newly added first_key and level check could help detect such >>> mismatch? >>> other approach is to modify Liubo approach to provide a list of degraded chunks but without a journal disk. >>> >>> Currently, DEV_ITEM::generation is seldom used. (only for seed sprout >>> case) >>> Maybe we could reuse that member to record the last successful written >>> transaction to that device and do above purposed LVM/mdraid style self >>> cure? >> >> Record of just the last successful transaction won't help. OR its an >> overkill to fix a write hole. >> >> Transactions: 10 11 [12] [13] [14] < write hole > [19] [20] >> In the above example >> disk disappeared at transaction 11 and when it reappeared at >> the transaction 19, there were new writes as well as the resilver >> writes, > > Then the last good generation will be 11 and we will commit current > transaction as soon as we find a device disappear, and won't upgrade the > last good generation until the scrub finishes. > >> so we were able to write 12 13 14 and 19 20 and then >> the disk disappears again leaving a write hole. > > Only if in above transactions, the auto scrub finishes, the device will > has generation updated, or it will stay generation 11. > >> Now next time when >> disk reappears the last transaction indicates 20 on both-disks >> but leaving a write hole in one of disk. > > That will only happens if auto-scrub finishes in transaction 20, or its > last successful generation will stay 11. > >> But if you are planning to >> record and start at transaction [14] then its an overkill because >> transaction [19 and [20] are already in the disk. > > Yes, I'm doing it overkilled. > But it's already much better than scrub all block groups (my original plan). Well, my idea has a major problem, that's we don't have generation for block group item, that's to say either we use free space cache generation or add new BLOCK_GROUP_ITEM member for generation detection. Thanks, Qu > > Thanks, > Qu > >> >> Thanks, Anand >> >> >>> Thanks, >>> Qu >>> [3] https://patchwork.kernel.org/patch/10403311/ Further, as we do a self adapting chunk allocation in RAID1, it needs balance-convert to fix. IMO at some point we have to provide degraded raid1 chunk allocation and also modify the scrub to be chunk granular. Thanks, Anand > Any idea on this? > > Thanks, > Qu > >> Unlock: btrfs_fs_info::chunk_mutex >> Unlock: btrfs_fs_devices::device_list_mutex >> >> --- >> >> >> Thanks, Anand >> -- >> To unsubscribe from this list: send the line "unsubscribe >> linux-btrfs" in >> the body of a message to majord...@vger.kernel.org >> More majordomo info at http://vger.kernel.org/majordomo-info.html > -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html >>> >> -- >> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in >> the body of a message to majord...@vger.kernel.org >> More majordomo info at http://vger.kernel.org/majordomo-info.html > signature.asc Description: OpenPGP digital signature
Re: [LKP] [lkp-robot] [mm] 9092c71bb7: blogbench.write_score -12.3% regression
Hi, Chris, Chris Mason writes: > On 19 Jun 2018, at 23:51, Huang, Ying wrote: "Huang, Ying" writes: > Hi, Josef, > > Do you have time to take a look at the regression? > > kernel test robot writes: > >> Greeting, >> >> FYI, we noticed a -12.3% regression of blogbench.write_score and >> a +9.6% improvement >> of blogbench.read_score due to commit: >> >> >> commit: 9092c71bb724dba2ecba849eae69e5c9d39bd3d2 ("mm: use >> sc->priority for slab shrink targets") >> https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git >> master >> >> in testcase: blogbench >> on test machine: 16 threads Intel(R) Xeon(R) CPU D-1541 @ >> 2.10GHz with 8G memory >> with following parameters: >> >> disk: 1SSD >> fs: btrfs >> cpufreq_governor: performance >> >> test-description: Blogbench is a portable filesystem benchmark >> that tries to reproduce the load of a real-world busy file >> server. >> test-url: > > I'm surprised, this patch is a big win in production here at FB. I'll > have to reproduce these results to better understand what is going on. > My first guess is that since we have fewer inodes in slab, we're > reading more inodes from disk in order to do the writes. > > But that should also make our read scores lower. Any update on this? Best Regards, Huang, Ying -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Why original mode doesn't use swap? (Original: Re: btrfs check lowmem, take 2)
On 2018年07月13日 07:14, Marc MERLIN wrote: > On Thu, Jul 12, 2018 at 01:26:41PM +0800, Qu Wenruo wrote: >> >> >> On 2018年07月12日 01:09, Chris Murphy wrote: >>> On Tue, Jul 10, 2018 at 12:09 PM, Marc MERLIN wrote: Thanks to Su and Qu, I was able to get my filesystem to a point that it's mountable. I then deleted loads of snapshots and I'm down to 26. IT now looks like this: gargamel:~# btrfs fi show /mnt/mnt Label: 'dshelf2' uuid: 0f1a0c9f-4e54-4fa7-8736-fd50818ff73d Total devices 1 FS bytes used 12.30TiB devid1 size 14.55TiB used 13.81TiB path /dev/mapper/dshelf2 gargamel:~# btrfs fi df /mnt/mnt Data, single: total=13.57TiB, used=12.19TiB System, DUP: total=32.00MiB, used=1.55MiB Metadata, DUP: total=124.50GiB, used=115.62GiB Metadata, single: total=216.00MiB, used=0.00B GlobalReserve, single: total=512.00MiB, used=0.00B Problems 1) btrfs check --repair _still_ takes all 32GB of RAM and crashes the server, despite my deleting lots of snapshots. Is it because I have too many files then? >>> >>> I think originally needs most of metdata in memory. >>> >>> I'm not understanding why btrfs check won't use swap like at least >>> xfs_repair and pretty sure e2fsck will as well. >> >> I don't understand either. >> >> Isn't memory from malloc() swappable? > > I never looked at the code and why/how it crashes, but my guess was > that it somehow causes the kernel to grab a lot of memory in the btrfs > driver and that is what is what is crashing the system. Btrfs check is done completely at user space, so it should not be related to kernel btrfs module. > If it were just malloc() the btrfs user space tool, it should be both > swappable like you said, and should also get OOM'ed. That's the case, but then why xfs/ext check tool could take up tons of swap without get killed by OOM? > > I suppose I can still be completely wrong, but I can't find another > logical explanation. > > I just tried running it again to trigger the problem, but because I > freed a lot of snapshots, btrfs check --repair goes back to only using > 10GB instead of 32GB, so I wasn't able to replicate OOM for you. At least it's a good news for you. > > Incidently, it died with: > gargamel:~# btrfs check --repair /dev/mapper/dshelf2 > enabling repair mode > Checking filesystem on /dev/mapper/dshelf2 > UUID: 0f1a0c9f-4e54-4fa7-8736-fd50818ff73d > root 18446744073709551607 has a root item with a more recent gen (143376) > compared to the found > root node (139061) > ERROR: failed to repair root items: Invalid argument > > That said, when it was using a fair amount of RAM, I captured this: > USER PID %CPU %MEMVSZ RSS TTY STAT START TIME COMMAND > root 1376 1.4 25.2 8256368 8240392 pts/18 R+ 14:52 1:07 btrfs check > --repair /dev/mapper/dshelf2 > > I don't know how to read /proc/meminfo, but that's what it said: > MemTotal: 32643792 kB > MemFree: 1367516 kB > MemAvailable: 15554836 kB > Buffers: 3491672 kB > Cached: 15900320 kB > SwapCached: 2092 kB > Active: 14577228 kB > Inactive: 15028608 kB > Active(anon): 12122180 kB > Inactive(anon): 2643176 kB > Active(file):2455048 kB > Inactive(file): 12385432 kB > Unevictable:8068 kB > Mlocked:8068 kB > SwapTotal: 15616764 kB < swap was totally unused and stays unused when > I get the system to crash > SwapFree: 15578020 kB > Dirty: 71956 kB > Writeback:64 kB > AnonPages: 10219976 kB > Mapped: 4033568 kB > Shmem: 4545552 kB > Slab: 713300 kB > SReclaimable: 395508 kB > SUnreclaim: 317792 kB > KernelStack: 11788 kB > PageTables:52592 kB > NFS_Unstable: 0 kB > Bounce:0 kB > WritebackTmp: 0 kB > CommitLimit:31938660 kB > Committed_AS: 20070736 kB > VmallocTotal: 34359738367 kB > VmallocUsed: 0 kB > VmallocChunk: 0 kB > HardwareCorrupted: 0 kB > AnonHugePages: 0 kB > ShmemHugePages:0 kB > ShmemPmdMapped:0 kB > CmaTotal: 16384 kB > CmaFree: 0 kB > HugePages_Total: 0 > HugePages_Free:0 > HugePages_Rsvd:0 > HugePages_Surp:0 > Hugepagesize: 2048 kB > Hugetlb: 0 kB > DirectMap4k: 1207572 kB > DirectMap2M:32045056 kB > > Does it help figure out where the memory was going and wehther kernel > memory was being used? Not really, much similar to what I observed. I also tried to over-commit my memory usage on my system, however it just freeze for several seconds and then get killed by OOM, failed to capture any useful info during that freeze. Thanks, Qu > > Marc > -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo
Re: [DOC] BTRFS Volume operations, Device Lists and Locks all in one page
[snip] >> In this case, it depends on when and how we mark the device resilvering. >> If we record the generation of write error happens, then just initial a >> scrub for generation greater than that generation. > > If we record all the degraded transactions then yes. Not just the last > failed transaction. The last successful generation won't be upgraded until the scrub success. > >> In the list, some guys mentioned that for LVM/mdraid they will record >> the generation when some device(s) get write error or missing, and do >> self cure. >> >>> >>> I have been scratching on fix for this [3] for some time now. Thanks >>> for the participation. In my understanding we are missing across-tree >>> parent transid verification at the lowest possible granular OR >> >> Maybe the newly added first_key and level check could help detect such >> mismatch? >> >>> other approach is to modify Liubo approach to provide a list of >>> degraded chunks but without a journal disk. >> >> Currently, DEV_ITEM::generation is seldom used. (only for seed sprout >> case) >> Maybe we could reuse that member to record the last successful written >> transaction to that device and do above purposed LVM/mdraid style self >> cure? > > Record of just the last successful transaction won't help. OR its an > overkill to fix a write hole. > > Transactions: 10 11 [12] [13] [14] < write hole > [19] [20] > In the above example > disk disappeared at transaction 11 and when it reappeared at > the transaction 19, there were new writes as well as the resilver > writes, Then the last good generation will be 11 and we will commit current transaction as soon as we find a device disappear, and won't upgrade the last good generation until the scrub finishes. > so we were able to write 12 13 14 and 19 20 and then > the disk disappears again leaving a write hole. Only if in above transactions, the auto scrub finishes, the device will has generation updated, or it will stay generation 11. > Now next time when > disk reappears the last transaction indicates 20 on both-disks > but leaving a write hole in one of disk. That will only happens if auto-scrub finishes in transaction 20, or its last successful generation will stay 11. > But if you are planning to > record and start at transaction [14] then its an overkill because > transaction [19 and [20] are already in the disk. Yes, I'm doing it overkilled. But it's already much better than scrub all block groups (my original plan). Thanks, Qu > > Thanks, Anand > > >> Thanks, >> Qu >> >>> [3] https://patchwork.kernel.org/patch/10403311/ >>> >>> Further, as we do a self adapting chunk allocation in RAID1, it needs >>> balance-convert to fix. IMO at some point we have to provide degraded >>> raid1 chunk allocation and also modify the scrub to be chunk granular. >>> >>> Thanks, Anand >>> Any idea on this? Thanks, Qu > Unlock: btrfs_fs_info::chunk_mutex > Unlock: btrfs_fs_devices::device_list_mutex > > --- > > > Thanks, Anand > -- > To unsubscribe from this list: send the line "unsubscribe > linux-btrfs" in > the body of a message to majord...@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html >>> -- >>> To unsubscribe from this list: send the line "unsubscribe >>> linux-btrfs" in >>> the body of a message to majord...@vger.kernel.org >>> More majordomo info at http://vger.kernel.org/majordomo-info.html >> > -- > To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in > the body of a message to majord...@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html signature.asc Description: OpenPGP digital signature
Re: [PATCH v2 1/2] btrfs-progs: Rename OPEN_CTREE_FS_PARTIAL to OPEN_CTREE_TEMPORARY_SUPER
Hey. Better late than never ;-) Just to confirm: At least since 4.16.1, I could btrfs-restore from the broken fs image again (that I've described in "spurious full btrfs corruption" from around mid March). So the regression in btrfsprogs has in fact been fixed by these patches, it seems. Thanks, Chris. On Wed, 2018-04-11 at 15:29 +0800, Qu Wenruo wrote: > The old flag OPEN_CTREE_FS_PARTIAL is in fact quite easy to be > confused > with OPEN_CTREE_PARTIAL, which allow btrfs-progs to open damaged > filesystem (like corrupted extent/csum tree). > > However OPEN_CTREE_FS_PARTIAL, unlike its name, is just allowing > btrfs-progs to open fs with temporary superblocks (which only has 6 > basic trees on SINGLE meta/sys chunks). > > The usage of FS_PARTIAL is really confusing here. > > So rename OPEN_CTREE_FS_PARTIAL to OPEN_CTREE_TEMPORARY_SUPER, and > add > extra comment for its behavior. > Also rename BTRFS_MAGIC_PARTIAL to BTRFS_MAGIC_TEMPORARY to keep the > naming consistent. > > And with above comment, the usage of FS_PARTIAL in dump-tree is > obviously incorrect, fix it. > > Fixes: 8698a2b9ba89 ("btrfs-progs: Allow inspect dump-tree to show > specified tree block even some tree roots are corrupted") > Signed-off-by: Qu Wenruo > --- > changelog: > v2: > New patch > --- > cmds-inspect-dump-tree.c | 2 +- > convert/main.c | 4 ++-- > ctree.h | 8 +--- > disk-io.c| 12 ++-- > disk-io.h| 10 +++--- > mkfs/main.c | 2 +- > 6 files changed, 22 insertions(+), 16 deletions(-) > > diff --git a/cmds-inspect-dump-tree.c b/cmds-inspect-dump-tree.c > index 0802b31e9596..e6510851e8f4 100644 > --- a/cmds-inspect-dump-tree.c > +++ b/cmds-inspect-dump-tree.c > @@ -220,7 +220,7 @@ int cmd_inspect_dump_tree(int argc, char **argv) > int uuid_tree_only = 0; > int roots_only = 0; > int root_backups = 0; > - unsigned open_ctree_flags = OPEN_CTREE_FS_PARTIAL; > + unsigned open_ctree_flags = OPEN_CTREE_PARTIAL; > u64 block_only = 0; > struct btrfs_root *tree_root_scan; > u64 tree_id = 0; > diff --git a/convert/main.c b/convert/main.c > index 6bdfab40d0b0..80f3bed84c84 100644 > --- a/convert/main.c > +++ b/convert/main.c > @@ -1140,7 +1140,7 @@ static int do_convert(const char *devname, u32 > convert_flags, u32 nodesize, > } > > root = open_ctree_fd(fd, devname, mkfs_cfg.super_bytenr, > - OPEN_CTREE_WRITES | > OPEN_CTREE_FS_PARTIAL); > + OPEN_CTREE_WRITES | > OPEN_CTREE_TEMPORARY_SUPER); > if (!root) { > error("unable to open ctree"); > goto fail; > @@ -1230,7 +1230,7 @@ static int do_convert(const char *devname, u32 > convert_flags, u32 nodesize, > } > > root = open_ctree_fd(fd, devname, 0, > - OPEN_CTREE_WRITES | OPEN_CTREE_FS_PARTIAL); > + OPEN_CTREE_WRITES | > OPEN_CTREE_TEMPORARY_SUPER); > if (!root) { > error("unable to open ctree for finalization"); > goto fail; > diff --git a/ctree.h b/ctree.h > index fa861ba0b4c3..80d4e59a66ce 100644 > --- a/ctree.h > +++ b/ctree.h > @@ -45,10 +45,12 @@ struct btrfs_free_space_ctl; > #define BTRFS_MAGIC 0x4D5F53665248425FULL /* ascii _BHRfS_M, no null > */ > > /* > - * Fake signature for an unfinalized filesystem, structures might be > partially > - * created or missing. > + * Fake signature for an unfinalized filesystem, which only has > barebone tree > + * structures (normally 6 near empty trees, on SINGLE meta/sys > temporary chunks) > + * > + * ascii !BHRfS_M, no null > */ > -#define BTRFS_MAGIC_PARTIAL 0x4D5F536652484221ULL /* ascii !BHRfS_M, > no null */ > +#define BTRFS_MAGIC_TEMPORARY 0x4D5F536652484221ULL > > #define BTRFS_MAX_MIRRORS 3 > > diff --git a/disk-io.c b/disk-io.c > index 58eae709e0e8..9e8b1e9d295c 100644 > --- a/disk-io.c > +++ b/disk-io.c > @@ -1117,14 +1117,14 @@ static struct btrfs_fs_info > *__open_ctree_fd(int fp, const char *path, > fs_info->ignore_chunk_tree_error = 1; > > if ((flags & OPEN_CTREE_RECOVER_SUPER) > - && (flags & OPEN_CTREE_FS_PARTIAL)) { > + && (flags & OPEN_CTREE_TEMPORARY_SUPER)) { > fprintf(stderr, > - "cannot open a partially created filesystem for > recovery"); > + "cannot open a filesystem with temporary super block for > recovery"); > goto out; > } > > - if (flags & OPEN_CTREE_FS_PARTIAL) > - sbflags = SBREAD_PARTIAL; > + if (flags & OPEN_CTREE_TEMPORARY_SUPER) > + sbflags = SBREAD_TEMPORARY; > > ret = btrfs_scan_fs_devices(fp, path, _devices, > sb_bytenr, sbflags, > (flags & OPEN_CTREE_NO_DEVICES)); > @@ -1285,8 +1285,8 @@ static int check_super(struct btrfs_super_block > *sb, unsigned sbflags) > int csum_size; > > if
Re: Why original mode doesn't use swap? (Original: Re: btrfs check lowmem, take 2)
On Thu, Jul 12, 2018 at 01:26:41PM +0800, Qu Wenruo wrote: > > > On 2018年07月12日 01:09, Chris Murphy wrote: > > On Tue, Jul 10, 2018 at 12:09 PM, Marc MERLIN wrote: > >> Thanks to Su and Qu, I was able to get my filesystem to a point that > >> it's mountable. > >> I then deleted loads of snapshots and I'm down to 26. > >> > >> IT now looks like this: > >> gargamel:~# btrfs fi show /mnt/mnt > >> Label: 'dshelf2' uuid: 0f1a0c9f-4e54-4fa7-8736-fd50818ff73d > >> Total devices 1 FS bytes used 12.30TiB > >> devid1 size 14.55TiB used 13.81TiB path /dev/mapper/dshelf2 > >> > >> gargamel:~# btrfs fi df /mnt/mnt > >> Data, single: total=13.57TiB, used=12.19TiB > >> System, DUP: total=32.00MiB, used=1.55MiB > >> Metadata, DUP: total=124.50GiB, used=115.62GiB > >> Metadata, single: total=216.00MiB, used=0.00B > >> GlobalReserve, single: total=512.00MiB, used=0.00B > >> > >> > >> Problems > >> 1) btrfs check --repair _still_ takes all 32GB of RAM and crashes the > >> server, despite my deleting lots of snapshots. > >> Is it because I have too many files then? > > > > I think originally needs most of metdata in memory. > > > > I'm not understanding why btrfs check won't use swap like at least > > xfs_repair and pretty sure e2fsck will as well. > > I don't understand either. > > Isn't memory from malloc() swappable? I never looked at the code and why/how it crashes, but my guess was that it somehow causes the kernel to grab a lot of memory in the btrfs driver and that is what is what is crashing the system. If it were just malloc() the btrfs user space tool, it should be both swappable like you said, and should also get OOM'ed. I suppose I can still be completely wrong, but I can't find another logical explanation. I just tried running it again to trigger the problem, but because I freed a lot of snapshots, btrfs check --repair goes back to only using 10GB instead of 32GB, so I wasn't able to replicate OOM for you. Incidently, it died with: gargamel:~# btrfs check --repair /dev/mapper/dshelf2 enabling repair mode Checking filesystem on /dev/mapper/dshelf2 UUID: 0f1a0c9f-4e54-4fa7-8736-fd50818ff73d root 18446744073709551607 has a root item with a more recent gen (143376) compared to the found root node (139061) ERROR: failed to repair root items: Invalid argument That said, when it was using a fair amount of RAM, I captured this: USER PID %CPU %MEMVSZ RSS TTY STAT START TIME COMMAND root 1376 1.4 25.2 8256368 8240392 pts/18 R+ 14:52 1:07 btrfs check --repair /dev/mapper/dshelf2 I don't know how to read /proc/meminfo, but that's what it said: MemTotal: 32643792 kB MemFree: 1367516 kB MemAvailable: 15554836 kB Buffers: 3491672 kB Cached: 15900320 kB SwapCached: 2092 kB Active: 14577228 kB Inactive: 15028608 kB Active(anon): 12122180 kB Inactive(anon): 2643176 kB Active(file):2455048 kB Inactive(file): 12385432 kB Unevictable:8068 kB Mlocked:8068 kB SwapTotal: 15616764 kB < swap was totally unused and stays unused when I get the system to crash SwapFree: 15578020 kB Dirty: 71956 kB Writeback:64 kB AnonPages: 10219976 kB Mapped: 4033568 kB Shmem: 4545552 kB Slab: 713300 kB SReclaimable: 395508 kB SUnreclaim: 317792 kB KernelStack: 11788 kB PageTables:52592 kB NFS_Unstable: 0 kB Bounce:0 kB WritebackTmp: 0 kB CommitLimit:31938660 kB Committed_AS: 20070736 kB VmallocTotal: 34359738367 kB VmallocUsed: 0 kB VmallocChunk: 0 kB HardwareCorrupted: 0 kB AnonHugePages: 0 kB ShmemHugePages:0 kB ShmemPmdMapped:0 kB CmaTotal: 16384 kB CmaFree: 0 kB HugePages_Total: 0 HugePages_Free:0 HugePages_Rsvd:0 HugePages_Surp:0 Hugepagesize: 2048 kB Hugetlb: 0 kB DirectMap4k: 1207572 kB DirectMap2M:32045056 kB Does it help figure out where the memory was going and wehther kernel memory was being used? Marc -- "A mouse is a device used to point at the xterm you want to type in" - A.S.R. Microsoft is to operating systems what McDonalds is to gourmet cooking Home page: http://marc.merlins.org/ | PGP 7F55D5F27AAF9D08 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH RESEND 2/2] btrfs-progs: make all programs and libraries optional
From: Omar Sandoval We have a build system internally which only needs to build the libraries out of a repository, not any binaries. I looked at how this works with other projects, and the best example was util-linux, which makes it possible to enable or disable everything individually. This is nice and really flexible, so let's do the same. This way, if you only want to build and install libbtrfsutil, you can simply do ./configure --disable-documentation --disable-all-programs --enable-libbtrfsutil make make install Signed-off-by: Omar Sandoval --- Makefile| 129 +++- Makefile.inc.in | 16 +- configure.ac| 138 +--- 3 files changed, 226 insertions(+), 57 deletions(-) diff --git a/Makefile b/Makefile index 62102baf..fe71b694 100644 --- a/Makefile +++ b/Makefile @@ -206,22 +206,40 @@ endif MAKEOPTS = --no-print-directory Q=$(Q) -# build all by default -progs = $(progs_install) btrfsck btrfs-corrupt-block - -# install only selected -progs_install = btrfs mkfs.btrfs btrfs-map-logical btrfs-image \ - btrfs-find-root btrfstune \ - btrfs-select-super - -# other tools, not built by default -progs_extra = btrfs-fragments - -progs_static = $(foreach p,$(progs),$(p).static) - -ifneq ($(DISABLE_BTRFSCONVERT),1) +ifeq ($(BUILD_BTRFS),1) +progs_install += btrfs +progs += btrfsck +endif +ifeq ($(BUILD_CONVERT),1) progs_install += btrfs-convert endif +ifeq ($(BUILD_CORRUPT_BLOCK),1) +progs += btrfs-corrupt-block +endif +ifeq ($(BUILD_FIND_ROOT),1) +progs_install += btrfs-find-root +endif +ifeq ($(BUILD_FRAGMENTS),1) +progs += btrfs-fragments +endif +ifeq ($(BUILD_IMAGE),1) +progs_install += btrfs-image +endif +ifeq ($(BUILD_MAP_LOGICAL),1) +progs_install += btrfs-map-logical +endif +ifeq ($(BUILD_MKFS),1) +progs_install += mkfs.btrfs +endif +ifeq ($(BUILD_SELECT_SUPER),1) +progs_install += btrfs-select-super +endif +ifeq ($(BUILD_TUNE),1) +progs_install += btrfstune +endif + +progs += $(progs_install) +progs_static = $(foreach p,$(progs),$(p).static) # external libs required by various binaries; for btrfs-foo, # specify btrfs_foo_libs = ; see $($(subst...)) rules below @@ -233,7 +251,7 @@ cmds_restore_cflags = -DBTRFSRESTORE_ZSTD=$(BTRFSRESTORE_ZSTD) CHECKER_FLAGS += $(btrfs_convert_cflags) # collect values of the variables above -standalone_deps = $(foreach dep,$(patsubst %,%_objects,$(subst -,_,$(filter btrfs-%, $(progs) $(progs_extra,$($(dep))) +standalone_deps = $(foreach dep,$(patsubst %,%_objects,$(subst -,_,$(filter btrfs-%, $(progs,$($(dep))) SUBDIRS = BUILDDIRS = $(patsubst %,build-%,$(SUBDIRS)) @@ -262,10 +280,21 @@ static_convert_objects = $(patsubst %.o, %.static.o, $(convert_objects)) static_mkfs_objects = $(patsubst %.o, %.static.o, $(mkfs_objects)) static_image_objects = $(patsubst %.o, %.static.o, $(image_objects)) -libs_shared = libbtrfs.so.0.1 libbtrfsutil.so.$(libbtrfsutil_version) -libs_static = libbtrfs.a libbtrfsutil.a +ifeq ($(BUILD_LIBBTRFS),1) +ifeq ($(BUILD_SHARED),1) +libs_shared += libbtrfs.so.0.1 +lib_links += libbtrfs.so.0 libbtrfs.so +endif +libs_static += libbtrfs.a +endif +ifeq ($(BUILD_LIBBTRFSUTIL),1) +ifeq ($(BUILD_SHARED),1) +libs_shared += libbtrfsutil.so.$(libbtrfsutil_version) +lib_links += libbtrfsutil.so.$(libbtrfsutil_major) libbtrfsutil.so +endif +libs_static += libbtrfsutil.a +endif libs = $(libs_shared) $(libs_static) -lib_links = libbtrfs.so.0 libbtrfs.so libbtrfsutil.so.$(libbtrfsutil_major) libbtrfsutil.so # make C=1 to enable sparse ifdef C @@ -303,7 +332,7 @@ endif $($(subst -,_,btrfs-$(@:%/$(notdir $@)=%)-cflags)) all: $(progs) $(libs) $(lib_links) $(BUILDDIRS) -ifeq ($(PYTHON_BINDINGS),1) +ifeq ($(BUILD_PYTHON),1) all: libbtrfsutil_python endif $(SUBDIRS): $(BUILDDIRS) @@ -353,7 +382,7 @@ testsuite: btrfs-corrupt-block fssum @echo "Export tests as a package" $(Q)cd tests && ./export-testsuite.sh -ifeq ($(PYTHON_BINDINGS),1) +ifeq ($(BUILD_PYTHON),1) test-libbtrfsutil: libbtrfsutil_python mkfs.btrfs $(Q)cd libbtrfsutil/python; \ LD_LIBRARY_PATH=../.. $(PYTHON) -m unittest discover -v tests @@ -413,7 +442,7 @@ libbtrfsutil.so.$(libbtrfsutil_major) libbtrfsutil.so: libbtrfsutil.so.$(libbtrf @echo "[LN] $@" $(Q)$(LN_S) -f $< $@ -ifeq ($(PYTHON_BINDINGS),1) +ifeq ($(BUILD_PYTHON),1) libbtrfsutil_python: libbtrfsutil.so.$(libbtrfsutil_major) libbtrfsutil.so libbtrfsutil/btrfsutil.h @echo "[PY] libbtrfsutil" $(Q)cd libbtrfsutil/python; \ @@ -439,14 +468,14 @@ btrfs-%.static: btrfs-%.static.o $(static_objects) $(patsubst %.o,%.static.o,$(s $(static_libbtrfs_objects) $(STATIC_LDFLAGS) \ $($(subst -,_,$(subst .static,,$@)-libs)) $(STATIC_LIBS) -btrfs-%: btrfs-%.o $(objects) $(standalone_deps) $(libs_static) +btrfs-%: btrfs-%.o $(objects) $(standalone_deps)
[PATCH RESEND 0/2] btrfs-progs: build improvements
From: Omar Sandoval Hi, Dave, This is a resend of a couple of patches I sent back in April, rebased on the current devel branch. Patch 1 cleans up some stale build targets, and patch 2 makes the btrfs-progs build more flexible, so that it's possible to pick and choose what gets built. Please consider these for the next progs release. Thanks! Omar Sandoval (2): btrfs-progs: remove stale dir-test and quick-test btrfs-progs: make all programs and libraries optional Makefile| 139 - Makefile.inc.in | 16 +- configure.ac| 138 +++-- dir-test.c | 518 quick-test.c| 226 - 5 files changed, 227 insertions(+), 810 deletions(-) delete mode 100644 dir-test.c delete mode 100644 quick-test.c -- 2.18.0 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH RESEND 1/2] btrfs-progs: remove stale dir-test and quick-test
From: Omar Sandoval These don't build anymore and don't appear to be used for anything. Signed-off-by: Omar Sandoval --- Makefile | 10 +- dir-test.c | 518 --- quick-test.c | 226 -- 3 files changed, 1 insertion(+), 753 deletions(-) delete mode 100644 dir-test.c delete mode 100644 quick-test.c diff --git a/Makefile b/Makefile index 544410e6..62102baf 100644 --- a/Makefile +++ b/Makefile @@ -495,14 +495,6 @@ btrfs-convert.static: $(static_convert_objects) $(static_objects) $(static_libbt @echo "[LD] $@" $(Q)$(CC) -o $@ $^ $(STATIC_LDFLAGS) $(btrfs_convert_libs) $(STATIC_LIBS) -dir-test: dir-test.o $(objects) $(libs) - @echo "[LD] $@" - $(Q)$(CC) -o $@ $^ $(LDFLAGS) $(LIBS) - -quick-test: quick-test.o $(objects) $(libs) - @echo "[LD] $@" - $(Q)$(CC) -o $@ $^ $(LDFLAGS) $(LIBS) - ioctl-test.o: ioctl-test.c ioctl.h kerncompat.h ctree.h @echo "[CC] $@" $(Q)$(CC) $(CFLAGS) -c $< -o $@ @@ -603,7 +595,7 @@ clean: $(CLEANDIRS) image/*.o image/*.o.d \ convert/*.o convert/*.o.d \ mkfs/*.o mkfs/*.o.d check/*.o check/*.o.d \ - dir-test ioctl-test quick-test library-test library-test-static \ + ioctl-test library-test library-test-static \ mktables btrfs.static mkfs.btrfs.static fssum \ $(check_defs) \ $(libs) $(lib_links) \ diff --git a/dir-test.c b/dir-test.c deleted file mode 100644 index cfb77f2a.. --- a/dir-test.c +++ /dev/null @@ -1,518 +0,0 @@ -/* - * Copyright (C) 2007 Oracle. All rights reserved. - * - * This program is free software; you can redistribute it and/or - * modify it under the terms of the GNU General Public - * License v2 as published by the Free Software Foundation. - * - * This program is distributed in the hope that it will be useful, - * but WITHOUT ANY WARRANTY; without even the implied warranty of - * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU - * General Public License for more details. - * - * You should have received a copy of the GNU General Public - * License along with this program; if not, write to the - * Free Software Foundation, Inc., 59 Temple Place - Suite 330, - * Boston, MA 021110-1307, USA. - */ - -#include -#include -#include -#include -#include -#include -#include "kerncompat.h" -#include "radix-tree.h" -#include "ctree.h" -#include "disk-io.h" -#include "print-tree.h" -#include "hash.h" -#include "transaction.h" - -int keep_running = 1; -struct btrfs_super_block super; -static u64 dir_oid = 0; -static u64 file_oid = 33778; - -static int find_num(struct radix_tree_root *root, unsigned long *num_ret, -int exists) -{ - unsigned long num = rand(); - unsigned long res[2]; - int ret; - -again: - ret = radix_tree_gang_lookup(root, (void **)res, num, 2); - if (exists) { - if (ret == 0) - return -1; - num = res[0]; - } else if (ret != 0 && num == res[0]) { - num++; - if (ret > 1 && num == res[1]) { - num++; - goto again; - } - } - *num_ret = num; - return 0; -} - -static void initial_inode_init(struct btrfs_root *root, - struct btrfs_inode_item *inode_item) -{ - memset(inode_item, 0, sizeof(*inode_item)); - btrfs_set_inode_generation(inode_item, root->fs_info->generation); - btrfs_set_inode_mode(inode_item, S_IFREG | 0700); -} - -static int ins_one(struct btrfs_trans_handle *trans, struct btrfs_root *root, - struct radix_tree_root *radix) -{ - int ret; - char buf[128]; - unsigned long oid; - u64 objectid; - struct btrfs_path path; - struct btrfs_key inode_map; - struct btrfs_inode_item inode_item; - - find_num(radix, , 0); - sprintf(buf, "str-%lu", oid); - - ret = btrfs_find_free_objectid(trans, root, dir_oid + 1, ); - if (ret) - goto error; - - inode_map.objectid = objectid; - inode_map.flags = 0; - inode_map.type = BTRFS_INODE_ITEM_KEY; - inode_map.offset = 0; - - initial_inode_init(root, _item); - ret = btrfs_insert_inode(trans, root, objectid, _item); - if (ret) - goto error; - ret = btrfs_insert_dir_item(trans, root, buf, strlen(buf), dir_oid, - _map, BTRFS_FT_UNKNOWN); - if (ret) - goto error; - - radix_tree_preload(GFP_KERNEL); - ret = radix_tree_insert(radix, oid, (void *)oid); - radix_tree_preload_end(); - if (ret) - goto error; - return ret; -error: - if (ret != -EEXIST) - goto fatal; - - /* -* if we got an
Re: [PATCH v2] btrfs: use customized batch size for total_bytes_pinned
On Wed, Jul 11, 2018 at 11:59:36PM +0800, Ethan Lien wrote: > In commit b150a4f10d878 ("Btrfs: use a percpu to keep track of possibly > pinned bytes") we use total_bytes_pinned to track how many bytes we are > going to free in this transaction. When we are close to ENOSPC, we check it > and know if we can make the allocation by commit the current transaction. > For every data/metadata extent we are going to free, we add > total_bytes_pinned in btrfs_free_extent() and btrfs_free_tree_block(), and > release it in unpin_extent_range() when we finish the transaction. So this > is a variable we frequently update but rarely read - just the suitable > use of percpu_counter. But in previous commit we update total_bytes_pinned > by default 32 batch size, making every update essentially a spin lock > protected update. Since every spin lock/unlock operation involves syncing > a globally used variable and some kind of barrier in a SMP system, this is > more expensive than using total_bytes_pinned as a simple atomic64_t. So > fix this by using a customized batch size. Since we only read > total_bytes_pinned when we are close to ENOSPC and fail to alloc new chunk, > we can use a really large batch size and have nearly no penalty in most > cases. > > > [Test] > We test the patch on a 4-cores x86 machine: > 1. falloate a 16GiB size test file. > 2. take snapshot (so all following writes will be cow write). > 3. run a 180 sec, 4 jobs, 4K random write fio on test file. > > We also add a temporary lockdep class on percpu_counter's spin lock used > by total_bytes_pinned to track lock_stat. > > > [Results] > unpatched: > lock_stat version 0.4 > --- > class namecon-bouncescontentions > waittime-min waittime-max waittime-total waittime-avgacq-bounces > acquisitions holdtime-min holdtime-max holdtime-total holdtime-avg > >total_bytes_pinned_percpu:82 82 > 0.21 0.61 29.46 0.36 298340 > 635973 0.09 11.01 173476.25 0.27 > > > patched: > lock_stat version 0.4 > --- > class namecon-bouncescontentions > waittime-min waittime-max waittime-total waittime-avgacq-bounces > acquisitions holdtime-min holdtime-max holdtime-total holdtime-avg > >total_bytes_pinned_percpu: 1 1 > 0.62 0.62 0.62 0.62 13601 >31542 0.14 9.61 11016.90 0.35 > > > [Analysis] > Since the spin lock only protect a single in-memory variable, the > contentions (number of lock acquisitions that had to wait) in both > unpatched and patched version are low. But when we see acquisitions and > acq-bounces, we get much lower counts in patched version. Here the most > important metric is acq-bounces. It means how many times the lock get > transferred between different cpus, so the patch can really recude > cacheline bouncing of spin lock (also the global counter of percpu_counter) > in a SMP system. > > Fixes: b150a4f10d878 ("Btrfs: use a percpu to keep track of possibly > pinned bytes") > > Signed-off-by: Ethan Lien > --- > > V2: > Rewrite commit comments. > Add lock_stat test. > Pull dirty_metadata_bytes out to a separate patch. > > fs/btrfs/ctree.h | 1 + > fs/btrfs/extent-tree.c | 46 -- > 2 files changed, 32 insertions(+), 15 deletions(-) > > diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h > index 118346aceea9..df682a521635 100644 > --- a/fs/btrfs/ctree.h > +++ b/fs/btrfs/ctree.h > @@ -422,6 +422,7 @@ struct btrfs_space_info { >* time the transaction commits. >*/ > struct percpu_counter total_bytes_pinned; > + s32 total_bytes_pinned_batch; Can this just be a constant instead of adding it to space_info? -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Transaction aborted (error -28) btrfs_run_delayed_refs*0x163/0x190
On 07/12/2018 07:07 PM, Pete wrote: > On 07/12/2018 08:11 AM, Nikolay Borisov wrote: >> >> >> This one shouldn't have gone RO since it has plenty of unallocated and >> free space. What was the workload at the time it went RO? Hard to say, >> it's best if you can provide output with the debug patch applied when >> this issue re-appears. >> > > I'll respond in a separate email to list, as the above is the more > important issue and I need to trawl the logs. Nothing seen, though I recently had the disks go read-only. I'll wait and see what happens. Pete -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Transaction aborted (error -28) btrfs_run_delayed_refs*0x163/0x190
On 07/12/2018 08:11 AM, Nikolay Borisov wrote: > > > This one shouldn't have gone RO since it has plenty of unallocated and > free space. What was the workload at the time it went RO? Hard to say, > it's best if you can provide output with the debug patch applied when > this issue re-appears. > I'll respond in a separate email to list, as the above is the more important issue and I need to trawl the logs. >> >> >> Incidental I've been running out of space on my ssd which contains / and >> /home - which I am sorting. >> >> root@phoenix:~# btrfs fi usage / >> >> Overall: >> >> Device size: 350.00GiB >> >> Device allocated:350.00GiB >> >> Device unallocated:1.00MiB >> >> Device missing: 0.00B >> >> Used:324.74GiB >> >> Free (estimated): 23.28GiB (min: 23.28GiB) >> >> Data ratio: 1.00 >> >> Metadata ratio: 1.00 >> >> Global reserve: 512.00MiB (used: 0.00B) >> > SO this doesn't look healthy, essentially you don't have any unallocated > space on your device. I will suggest you run balance to try and compact > the space in your data groups and hopefully free up some space. As a > first step you can try and run : > > btrfs balance start -dusage=0 -musage=0 / > > This will try and reclaim any non-used block groups i.e it could bring > some unallocated space back. Then you can run 'btrfs fi us / ' to see if > this is the case. Then I'd suggest you run something like: > > 'btrfs balance start -dusage=60 -musage=60 /' this will try and compact > all data/metadata chunks which are less than 60% full. Thank you, appreciated. I have run the above and listed the usage. root@phoenix:~# btrfs balance start -dusage=0 -musage=0 / Done, had to relocate 0 out of 351 chunks root@phoenix:~# btrfs fi usage / Overall: Device size: 350.00GiB Device allocated:349.03GiB Device unallocated: 992.00MiB Device missing: 0.00B Used:324.77GiB Free (estimated): 24.22GiB (min: 24.22GiB) Data ratio: 1.00 Metadata ratio: 1.00 Global reserve: 512.00MiB (used: 0.00B) Data,single: Size:343.00GiB, Used:319.75GiB /dev/disk/by-label/desk-system343.00GiB Metadata,single: Size:6.00GiB, Used:5.02GiB /dev/disk/by-label/desk-system 6.00GiB System,single: Size:32.00MiB, Used:64.00KiB /dev/disk/by-label/desk-system 32.00MiB Unallocated: /dev/disk/by-label/desk-system992.00MiB root@phoenix:~# root@phoenix:~# btrfs balance start -dusage=60 -musage=60 / Done, had to relocate 15 out of 351 chunks root@phoenix:~# btrfs fi us / Overall: Device size: 350.00GiB Device allocated:335.06GiB Device unallocated: 14.94GiB Device missing: 0.00B Used:324.77GiB Free (estimated): 24.22GiB (min: 24.22GiB) Data ratio: 1.00 Metadata ratio: 1.00 Global reserve: 512.00MiB (used: 0.00B) Data,single: Size:329.03GiB, Used:319.75GiB /dev/disk/by-label/desk-system329.03GiB Metadata,single: Size:6.00GiB, Used:5.02GiB /dev/disk/by-label/desk-system 6.00GiB System,single: Size:32.00MiB, Used:64.00KiB /dev/disk/by-label/desk-system 32.00MiB Unallocated: /dev/disk/by-label/desk-system 14.94GiB root@phoenix:~# I believe this is simply coincidental to the error I was getting. I've got to do some housekeeping. I think running out of disk space is a combination of having root on the disk, installing another distro in another subvolume, having losts of old kernels and modules around and snapshotting directores that are likely churning files (.mozilla?). -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v2] btrfs: use customized batch size for total_bytes_pinned
Nikolay Borisov 於 2018-07-12 15:07 寫到: On 11.07.2018 18:59, Ethan Lien wrote: In commit b150a4f10d878 ("Btrfs: use a percpu to keep track of possibly pinned bytes") we use total_bytes_pinned to track how many bytes we are going to free in this transaction. When we are close to ENOSPC, we check it and know if we can make the allocation by commit the current transaction. For every data/metadata extent we are going to free, we add total_bytes_pinned in btrfs_free_extent() and btrfs_free_tree_block(), and release it in unpin_extent_range() when we finish the transaction. So this is a variable we frequently update but rarely read - just the suitable use of percpu_counter. But in previous commit we update total_bytes_pinned by default 32 batch size, making every update essentially a spin lock protected update. Since every spin lock/unlock operation involves syncing a globally used variable and some kind of barrier in a SMP system, this is more expensive than using total_bytes_pinned as a simple atomic64_t. So fix this by using a customized batch size. Since we only read total_bytes_pinned when we are close to ENOSPC and fail to alloc new chunk, we can use a really large batch size and have nearly no penalty in most cases. [Test] We test the patch on a 4-cores x86 machine: 1. falloate a 16GiB size test file. 2. take snapshot (so all following writes will be cow write). 3. run a 180 sec, 4 jobs, 4K random write fio on test file. We also add a temporary lockdep class on percpu_counter's spin lock used by total_bytes_pinned to track lock_stat. [Results] unpatched: lock_stat version 0.4 --- class namecon-bouncescontentions waittime-min waittime-max waittime-total waittime-avg acq-bounces acquisitions holdtime-min holdtime-max holdtime-total holdtime-avg total_bytes_pinned_percpu:82 82 0.21 0.61 29.46 0.36 298340 635973 0.09 11.01 173476.25 0.27 patched: lock_stat version 0.4 --- class namecon-bouncescontentions waittime-min waittime-max waittime-total waittime-avg acq-bounces acquisitions holdtime-min holdtime-max holdtime-total holdtime-avg total_bytes_pinned_percpu: 1 1 0.62 0.62 0.62 0.62 13601 31542 0.14 9.61 11016.90 0.35 According to these numbers though, the best-case (waittime-min and the average waittime-avg) have actually regressed. So what really saves you In patched version we only got one contentions count so the min/max/avg waittime all come from that count. I think the min/max/avg waittime here are inconclusive since we have only one sample count so it could be biased. is the fact that the number of time we had to go to the best/average case is reduced, due to the larger batch. I guess in that regard you are in the clear. Another pertinent question is did you observe any significant impact on run times of actual workloads or, say, transaction commit times or anything like that? I think this case will only be hit when the filesystem is struggling to satisfy a metadata allocation has to cycle through all stages . The frequently update of total_bytes_pinned can happen in normal cow write path. In our test, any re-write to a existing 4K data block will trigger cow and we release old data block by btrfs_free_extent() and in that time we add total_bytes_pinned. As the test file is written out, eventually almost every 4K write will trigger a +4K and a -4K update to total_bytes_pinned. But so far we haven't seen significant performance improvement from the test. [Analysis] Since the spin lock only protect a single in-memory variable, the contentions (number of lock acquisitions that had to wait) in both unpatched and patched version are low. But when we see acquisitions and acq-bounces, we get much lower counts in patched version. Here the most important metric is acq-bounces. It means how many times the lock get transferred between different cpus, so the patch can really recude nit: s/recude/reduce - no need to resend i guess david will fix it on the way into the tree cacheline bouncing of spin lock (also the global counter of percpu_counter) in a SMP system. Fixes: b150a4f10d878 ("Btrfs: use a percpu to keep track of possibly pinned bytes") Signed-off-by: Ethan Lien Reviewed-by: Nikolay Borisov --- V2: Rewrite commit comments. Add lock_stat test. Pull dirty_metadata_bytes out to a separate patch. fs/btrfs/ctree.h | 1 + fs/btrfs/extent-tree.c | 46 -- 2 files
Re: [DOC] BTRFS Volume operations, Device Lists and Locks all in one page
On 07/12/2018 08:59 PM, Qu Wenruo wrote: On 2018年07月12日 20:33, Anand Jain wrote: On 07/12/2018 01:43 PM, Qu Wenruo wrote: On 2018年07月11日 15:50, Anand Jain wrote: BTRFS Volume operations, Device Lists and Locks all in one page: Devices are managed in two contexts, the scan context and the mounted context. In scan context the threads originate from the btrfs_control ioctl and in the mounted context the threads originates from the mount point ioctl. Apart from these two context, there also can be two transient state where device state are transitioning from the scan to the mount context or from the mount to the scan context. Device List and Locks:- Count: btrfs_fs_devices::num_devices List : btrfs_fs_devices::devices -> btrfs_devices::dev_list Lock : btrfs_fs_devices::device_list_mutex Count: btrfs_fs_devices::rw_devices So btrfs_fs_devices::num_devices = btrfs_fs_devices::rw_devices + RO devices. How seed and ro devices are different in this case? Given: btrfs_fs_devices::total_devices = btrfs_super_num_devices(disk_super); Consider no missing devices, no replace target, no seeding. Then, btrfs_fs_devices::total_devices == btrfs_fs_devices::num_devices And in case of seeding. btrfs_fs_devices::total_devices == (btrfs_fs_devices::num_devices + btrfs_fs_devices::seed::total_devices All devices in the list [1] are RW/Sprout [1] fs_info::btrfs_fs_devices::devices All devices in the list [2] are RO/Seed to avoid confusion I shall remove RO here All devices in the list [2] are Seed [2] fs_info::btrfs_fs_devices::seed::devices Thanks for asking will add this part to the doc. Another question is, what if a device is RO but not seed? E.g. loopback device set to RO. IMHO it won't be mounted RW for single device case, but not sure for multi device case. RO devices are different from the seed devices. If any one device is RO then FS is mounted in RO. And the btrfs_fs_devices::seed will still be NULL. List : btrfs_fs_devices::alloc_list -> btrfs_devices::dev_alloc_list Lock : btrfs_fs_info::chunk_mutex At least the chunk_mutex is also shared with chunk allocator, Right. or we should have some mutex in btrfs_fs_devices other than fs_info. Right? More locks? no. But some of the locks-and-flags are wrongly belong to fs_info instead it should have been in fs_devices. When the dust settles planning to propose to migrate them to fs_devices. OK, migrating to fs_devices looks good to me then. Lock: set_bit btrfs_fs_info::flags::BTRFS_FS_EXCL_OP FSID List and Lock:- Count : None HEAD : Global::fs_uuids -> btrfs_fs_devices::fs_list Lock : Global::uuid_mutex After the fs_devices is mounted, the btrfs_fs_devices::opened > 0. fs_devices::opended should be btrfs_fs_devices::num_devices if no device is missing and -1 or -2 for degraded case, right? No. I think you are getting confused with btrfs_fs_devices::open_devices btrfs_fs_devices::opened indicate how many times the volume is opened. And in reality it would stay at 1 always. (except for a short duration of time during subsequent subvol mount). Thanks, this makes sense. In the scan context we have the following device operations.. Device SCAN:- which creates the btrfs_fs_devices and its corresponding btrfs_device entries, also checks and frees the duplicate device entries. Lock: uuid_mutex SCAN if (found_duplicate && btrfs_fs_devices::opened == 0) Free_duplicate Unlock: uuid_mutex Device READY:- check if the volume is ready. Also does an implicit scan and duplicate device free as in Device SCAN. Lock: uuid_mutex SCAN if (found_duplicate && btrfs_fs_devices::opened == 0) Free_duplicate Check READY Unlock: uuid_mutex Device FORGET:- (planned) free a given or all unmounted devices and empty fs_devices if any. Lock: uuid_mutex if (found_duplicate && btrfs_fs_devices::opened == 0) Free duplicate Unlock: uuid_mutex Device mount operation -> A Transient state leading to the mounted context Lock: uuid_mutex Find, SCAN, btrfs_fs_devices::opened++ Unlock: uuid_mutex Device umount operation -> A transient state leading to the unmounted context or scan context Lock: uuid_mutex btrfs_fs_devices::opened-- Unlock: uuid_mutex In the mounted context we have the following device operations.. Device Rename through SCAN:- This is a special case where the device path gets renamed after its been mounted. (Ubuntu changes the boot path during boot up so we need this feature). Currently, this is part of Device SCAN as above. And we need the locks as below, because the dynamic disappearing device might cleanup the btrfs_device::name Lock: btrfs_fs_devices::device_list_mutex Rename Unlock: btrfs_fs_devices::device_list_mutex Commit Transaction:- Write All supers. Lock: btrfs_fs_devices::device_list_mutex Write all super of
Re: [DOC] BTRFS Volume operations, Device Lists and Locks all in one page
On 2018年07月12日 20:33, Anand Jain wrote: > > > On 07/12/2018 01:43 PM, Qu Wenruo wrote: >> >> >> On 2018年07月11日 15:50, Anand Jain wrote: >>> >>> >>> BTRFS Volume operations, Device Lists and Locks all in one page: >>> >>> Devices are managed in two contexts, the scan context and the mounted >>> context. In scan context the threads originate from the btrfs_control >>> ioctl and in the mounted context the threads originates from the mount >>> point ioctl. >>> Apart from these two context, there also can be two transient state >>> where device state are transitioning from the scan to the mount context >>> or from the mount to the scan context. >>> >>> Device List and Locks:- >>> >>> Count: btrfs_fs_devices::num_devices >>> List : btrfs_fs_devices::devices -> btrfs_devices::dev_list >>> Lock : btrfs_fs_devices::device_list_mutex >>> >>> Count: btrfs_fs_devices::rw_devices >> >> So btrfs_fs_devices::num_devices = btrfs_fs_devices::rw_devices + RO >> devices. >> How seed and ro devices are different in this case? > > Given: > btrfs_fs_devices::total_devices = btrfs_super_num_devices(disk_super); > > Consider no missing devices, no replace target, no seeding. Then, > btrfs_fs_devices::total_devices == btrfs_fs_devices::num_devices > > And in case of seeding. > btrfs_fs_devices::total_devices == (btrfs_fs_devices::num_devices + > btrfs_fs_devices::seed::total_devices > > All devices in the list [1] are RW/Sprout > [1] fs_info::btrfs_fs_devices::devices > All devices in the list [2] are RO/Seed > [2] fs_info::btrfs_fs_devices::seed::devices > > > Thanks for asking will add this part to the doc. Another question is, what if a device is RO but not seed? E.g. loopback device set to RO. IMHO it won't be mounted RW for single device case, but not sure for multi device case. > > >> >>> List : btrfs_fs_devices::alloc_list -> btrfs_devices::dev_alloc_list >>> Lock : btrfs_fs_info::chunk_mutex >> >> At least the chunk_mutex is also shared with chunk allocator, > > Right. > >> or we >> should have some mutex in btrfs_fs_devices other than fs_info. >> Right? > > More locks? no. But some of the locks-and-flags are wrongly > belong to fs_info instead it should have been in fs_devices. > When the dust settles planning to propose to migrate them > to fs_devices. OK, migrating to fs_devices looks good to me then. > >>> Lock: set_bit btrfs_fs_info::flags::BTRFS_FS_EXCL_OP >>> >>> FSID List and Lock:- >>> >>> Count : None >>> HEAD : Global::fs_uuids -> btrfs_fs_devices::fs_list >>> Lock : Global::uuid_mutex >>> >>> >>> After the fs_devices is mounted, the btrfs_fs_devices::opened > 0. >> >> fs_devices::opended should be btrfs_fs_devices::num_devices if no device >> is missing and -1 or -2 for degraded case, right? > > No. I think you are getting confused with > btrfs_fs_devices::open_devices > > btrfs_fs_devices::opened > indicate how many times the volume is opened. And in reality it would > stay at 1 always. (except for a short duration of time during > subsequent subvol mount). Thanks, this makes sense. > > >>> In the scan context we have the following device operations.. >>> >>> Device SCAN:- which creates the btrfs_fs_devices and its corresponding >>> btrfs_device entries, also checks and frees the duplicate device >>> entries. >>> Lock: uuid_mutex >>> SCAN >>> if (found_duplicate && btrfs_fs_devices::opened == 0) >>> Free_duplicate >>> Unlock: uuid_mutex >>> >>> Device READY:- check if the volume is ready. Also does an implicit scan >>> and duplicate device free as in Device SCAN. >>> Lock: uuid_mutex >>> SCAN >>> if (found_duplicate && btrfs_fs_devices::opened == 0) >>> Free_duplicate >>> Check READY >>> Unlock: uuid_mutex >>> >>> Device FORGET:- (planned) free a given or all unmounted devices and >>> empty fs_devices if any. >>> Lock: uuid_mutex >>> if (found_duplicate && btrfs_fs_devices::opened == 0) >>> Free duplicate >>> Unlock: uuid_mutex >>> >>> Device mount operation -> A Transient state leading to the mounted >>> context >>> Lock: uuid_mutex >>> Find, SCAN, btrfs_fs_devices::opened++ >>> Unlock: uuid_mutex >>> >>> Device umount operation -> A transient state leading to the unmounted >>> context or scan context >>> Lock: uuid_mutex >>> btrfs_fs_devices::opened-- >>> Unlock: uuid_mutex >>> >>> >>> In the mounted context we have the following device operations.. >>> >>> Device Rename through SCAN:- This is a special case where the device >>> path gets renamed after its been mounted. (Ubuntu changes the boot path >>> during boot up so we need this feature). Currently, this is part of >>> Device SCAN as above. And we need the locks as below, because the >>> dynamic disappearing device might cleanup the btrfs_device::name >>> Lock: btrfs_fs_devices::device_list_mutex >>> Rename >>> Unlock: btrfs_fs_devices::device_list_mutex >>> >>> Commit Transaction:-
Re: About hung task on generic/041
On Thu, Jul 12, 2018 at 11:40:54AM +0100, Filipe Manana wrote: >On Wed, Jul 11, 2018 at 10:02 AM, Lu Fengqi wrote: >> Hi, >> >> When I run generic/041 with v4.18-rc3 (turn on kasan and hung task >> detection), btrfs-transaction kthread will trigger the hung task timeout >> (stall at wait_event in btrfs_commit_transaction). At the same time, you >> can see that xfs_io -c fsync will occupy 100% of the CPU. I am not sure >> whether this is a problem. Any suggestion? > >Well, something at 100% cpu and that seems hang forever is definitely >a problem, specially a workload as simple as the one in generic/041 To clarify, the hung task will end within 500s. Without KASAN, it will end within 80s, so it won't trigger hung task timeout 120s. I'm not sure if this is just slow, or have some problem? >(never happened to me, even on vanilla 4.18-rc4). >Do you have the stack trace for the fsync task? What you pasted below I will send the stack trace tomorrow. -- Thanks, Lu >is only for the transaction kthread and that alone doesn't help. > >> >> [Wed Jul 11 15:50:08 2018] INFO: task btrfs-transacti:1053 blocked for more >> than 120 seconds. >> [Wed Jul 11 15:50:08 2018] Not tainted 4.18.0-rc3-custom #14 >> [Wed Jul 11 15:50:08 2018] "echo 0 > >> /proc/sys/kernel/hung_task_timeout_secs" disables this message. >> [Wed Jul 11 15:50:08 2018] btrfs-transacti D0 1053 2 0x8000 >> [Wed Jul 11 15:50:08 2018] Call Trace: >> [Wed Jul 11 15:50:08 2018] ? __schedule+0x5b2/0x1380 >> [Wed Jul 11 15:50:08 2018] ? check_flags.part.23+0x240/0x240 >> [Wed Jul 11 15:50:08 2018] ? firmware_map_remove+0x187/0x187 >> [Wed Jul 11 15:50:08 2018] ? ___preempt_schedule+0x16/0x18 >> [Wed Jul 11 15:50:08 2018] ? mark_held_locks+0x6e/0x90 >> [Wed Jul 11 15:50:08 2018] ? _raw_spin_unlock_irqrestore+0x59/0x70 >> [Wed Jul 11 15:50:08 2018] ? preempt_count_sub+0x14/0xc0 >> [Wed Jul 11 15:50:08 2018] ? _raw_spin_unlock_irqrestore+0x46/0x70 >> [Wed Jul 11 15:50:08 2018] ? prepare_to_wait_event+0x191/0x410 >> [Wed Jul 11 15:50:08 2018] ? prepare_to_wait_exclusive+0x210/0x210 >> [Wed Jul 11 15:50:08 2018] ? print_usage_bug+0x3a0/0x3a0 >> [Wed Jul 11 15:50:08 2018] ? do_raw_spin_unlock+0x10f/0x1e0 >> [Wed Jul 11 15:50:08 2018] ? do_raw_spin_trylock+0x120/0x120 >> [Wed Jul 11 15:50:08 2018] schedule+0xca/0x260 >> [Wed Jul 11 15:50:08 2018] ? rcu_lockdep_current_cpu_online+0x12b/0x160 >> [Wed Jul 11 15:50:08 2018] ? __schedule+0x1380/0x1380 >> [Wed Jul 11 15:50:08 2018] ? ___might_sleep+0x126/0x370 >> [Wed Jul 11 15:50:08 2018] ? init_wait_entry+0xc7/0x100 >> [Wed Jul 11 15:50:08 2018] ? __wake_up_locked_key_bookmark+0x20/0x20 >> [Wed Jul 11 15:50:08 2018] ? __btrfs_run_delayed_items+0x1e5/0x280 [btrfs] >> [Wed Jul 11 15:50:08 2018] ? __might_sleep+0x31/0xd0 >> [Wed Jul 11 15:50:08 2018] btrfs_commit_transaction+0x122a/0x1640 [btrfs] >> [Wed Jul 11 15:50:08 2018] ? btrfs_apply_pending_changes+0x90/0x90 [btrfs] >> [Wed Jul 11 15:50:08 2018] ? wait_woken+0x150/0x150 >> [Wed Jul 11 15:50:08 2018] ? ret_from_fork+0x27/0x50 >> [Wed Jul 11 15:50:08 2018] ? ret_from_fork+0x27/0x50 >> [Wed Jul 11 15:50:08 2018] ? deref_stack_reg+0xe0/0xe0 >> [Wed Jul 11 15:50:08 2018] ? __module_text_address+0x63/0xa0 >> [Wed Jul 11 15:50:08 2018] ? preempt_count_sub+0x14/0xc0 >> [Wed Jul 11 15:50:08 2018] ? transaction_kthread+0x161/0x240 [btrfs] >> [Wed Jul 11 15:50:08 2018] ? is_module_text_address+0x2b/0x50 >> [Wed Jul 11 15:50:08 2018] ? transaction_kthread+0x1d9/0x240 [btrfs] >> [Wed Jul 11 15:50:08 2018] ? kernel_text_address+0x5a/0x100 >> [Wed Jul 11 15:50:08 2018] ? deactivate_slab.isra.27+0x64f/0x7a0 >> [Wed Jul 11 15:50:08 2018] ? __save_stack_trace+0x82/0x100 >> [Wed Jul 11 15:50:08 2018] ? kasan_kmalloc+0x142/0x170 >> [Wed Jul 11 15:50:08 2018] ? kmem_cache_alloc+0xfc/0x2e0 >> [Wed Jul 11 15:50:08 2018] ? start_transaction+0x596/0x930 [btrfs] >> [Wed Jul 11 15:50:08 2018] ? transaction_kthread+0x1d9/0x240 [btrfs] >> [Wed Jul 11 15:50:08 2018] ? kthread+0x1b9/0x1e0 >> [Wed Jul 11 15:50:08 2018] ? ret_from_fork+0x27/0x50 >> [Wed Jul 11 15:50:08 2018] ? deactivate_slab.isra.27+0x64f/0x7a0 >> [Wed Jul 11 15:50:08 2018] ? mark_lock+0x149/0xa80 >> [Wed Jul 11 15:50:08 2018] ? init_object+0x6b/0x80 >> [Wed Jul 11 15:50:08 2018] ? print_usage_bug+0x3a0/0x3a0 >> [Wed Jul 11 15:50:08 2018] ? ___slab_alloc+0x62a/0x690 >> [Wed Jul 11 15:50:08 2018] ? ___slab_alloc+0x62a/0x690 >> [Wed Jul 11 15:50:08 2018] ? __lock_is_held+0x8c/0xe0 >> [Wed Jul 11 15:50:08 2018] ? start_transaction+0x596/0x930 [btrfs] >> [Wed Jul 11 15:50:08 2018] ? preempt_count_sub+0x14/0xc0 >> [Wed Jul 11 15:50:08 2018] ? rcu_lockdep_current_cpu_online+0x12b/0x160 >> [Wed Jul 11 15:50:08 2018] ? rcu_oom_callback+0x40/0x40 >> [Wed Jul 11 15:50:08 2018] ? __lock_is_held+0x8c/0xe0 >> [Wed Jul 11 15:50:08 2018] ? start_transaction+0x596/0x930 [btrfs] >> [Wed Jul 11 15:50:08 2018] ? rcu_read_lock_sched_held+0x8f/0xa0 >> [Wed Jul 11
Re: [DOC] BTRFS Volume operations, Device Lists and Locks all in one page
On 07/12/2018 01:43 PM, Qu Wenruo wrote: On 2018年07月11日 15:50, Anand Jain wrote: BTRFS Volume operations, Device Lists and Locks all in one page: Devices are managed in two contexts, the scan context and the mounted context. In scan context the threads originate from the btrfs_control ioctl and in the mounted context the threads originates from the mount point ioctl. Apart from these two context, there also can be two transient state where device state are transitioning from the scan to the mount context or from the mount to the scan context. Device List and Locks:- Count: btrfs_fs_devices::num_devices List : btrfs_fs_devices::devices -> btrfs_devices::dev_list Lock : btrfs_fs_devices::device_list_mutex Count: btrfs_fs_devices::rw_devices So btrfs_fs_devices::num_devices = btrfs_fs_devices::rw_devices + RO devices. How seed and ro devices are different in this case? Given: btrfs_fs_devices::total_devices = btrfs_super_num_devices(disk_super); Consider no missing devices, no replace target, no seeding. Then, btrfs_fs_devices::total_devices == btrfs_fs_devices::num_devices And in case of seeding. btrfs_fs_devices::total_devices == (btrfs_fs_devices::num_devices + btrfs_fs_devices::seed::total_devices All devices in the list [1] are RW/Sprout [1] fs_info::btrfs_fs_devices::devices All devices in the list [2] are RO/Seed [2] fs_info::btrfs_fs_devices::seed::devices Thanks for asking will add this part to the doc. List : btrfs_fs_devices::alloc_list -> btrfs_devices::dev_alloc_list Lock : btrfs_fs_info::chunk_mutex At least the chunk_mutex is also shared with chunk allocator, Right. or we should have some mutex in btrfs_fs_devices other than fs_info. Right? More locks? no. But some of the locks-and-flags are wrongly belong to fs_info instead it should have been in fs_devices. When the dust settles planning to propose to migrate them to fs_devices. Lock: set_bit btrfs_fs_info::flags::BTRFS_FS_EXCL_OP FSID List and Lock:- Count : None HEAD : Global::fs_uuids -> btrfs_fs_devices::fs_list Lock : Global::uuid_mutex After the fs_devices is mounted, the btrfs_fs_devices::opened > 0. fs_devices::opended should be btrfs_fs_devices::num_devices if no device is missing and -1 or -2 for degraded case, right? No. I think you are getting confused with btrfs_fs_devices::open_devices btrfs_fs_devices::opened indicate how many times the volume is opened. And in reality it would stay at 1 always. (except for a short duration of time during subsequent subvol mount). In the scan context we have the following device operations.. Device SCAN:- which creates the btrfs_fs_devices and its corresponding btrfs_device entries, also checks and frees the duplicate device entries. Lock: uuid_mutex SCAN if (found_duplicate && btrfs_fs_devices::opened == 0) Free_duplicate Unlock: uuid_mutex Device READY:- check if the volume is ready. Also does an implicit scan and duplicate device free as in Device SCAN. Lock: uuid_mutex SCAN if (found_duplicate && btrfs_fs_devices::opened == 0) Free_duplicate Check READY Unlock: uuid_mutex Device FORGET:- (planned) free a given or all unmounted devices and empty fs_devices if any. Lock: uuid_mutex if (found_duplicate && btrfs_fs_devices::opened == 0) Free duplicate Unlock: uuid_mutex Device mount operation -> A Transient state leading to the mounted context Lock: uuid_mutex Find, SCAN, btrfs_fs_devices::opened++ Unlock: uuid_mutex Device umount operation -> A transient state leading to the unmounted context or scan context Lock: uuid_mutex btrfs_fs_devices::opened-- Unlock: uuid_mutex In the mounted context we have the following device operations.. Device Rename through SCAN:- This is a special case where the device path gets renamed after its been mounted. (Ubuntu changes the boot path during boot up so we need this feature). Currently, this is part of Device SCAN as above. And we need the locks as below, because the dynamic disappearing device might cleanup the btrfs_device::name Lock: btrfs_fs_devices::device_list_mutex Rename Unlock: btrfs_fs_devices::device_list_mutex Commit Transaction:- Write All supers. Lock: btrfs_fs_devices::device_list_mutex Write all super of btrfs_devices::dev_list Unlock: btrfs_fs_devices::device_list_mutex Device add:- Add a new device to the existing mounted volume. set_bit: btrfs_fs_info::flags::BTRFS_FS_EXCL_OP Lock: btrfs_fs_devices::device_list_mutex Lock: btrfs_fs_info::chunk_mutex List_add btrfs_devices::dev_list List_add btrfs_devices::dev_alloc_list Unlock: btrfs_fs_info::chunk_mutex Unlock: btrfs_fs_devices::device_list_mutex Device remove:- Remove a device from the mounted volume. set_bit: btrfs_fs_info::flags::BTRFS_FS_EXCL_OP Lock: btrfs_fs_devices::device_list_mutex Lock: btrfs_fs_info::chunk_mutex List_del btrfs_devices::dev_list List_del
[v10.4 2/5] btrfs-progs: dedupe: Add enable command for dedupe command group
From: Qu Wenruo Add enable subcommand for dedupe commmand group. Signed-off-by: Qu Wenruo Signed-off-by: Lu Fengqi --- v10.4: 1. s/btrfs-dedupe/btrfs-dedupe-inband 2. replace strerror(errno) with %m Documentation/btrfs-dedupe-inband.asciidoc | 114 +- btrfs-completion | 6 +- cmds-dedupe-ib.c | 238 + ioctl.h| 2 + 4 files changed, 358 insertions(+), 2 deletions(-) diff --git a/Documentation/btrfs-dedupe-inband.asciidoc b/Documentation/btrfs-dedupe-inband.asciidoc index 83113f5487e2..d895aafbcf45 100644 --- a/Documentation/btrfs-dedupe-inband.asciidoc +++ b/Documentation/btrfs-dedupe-inband.asciidoc @@ -22,7 +22,119 @@ use with caution. SUBCOMMAND -- -Nothing yet +*enable* [options] :: +Enable in-band de-duplication for a filesystem. ++ +`Options` ++ +-f|--force +Force 'enable' command to be exected. +Will skip memory limit check and allow 'enable' to be executed even in-band +de-duplication is already enabled. ++ +NOTE: If re-enable dedupe with '-f' option, any unspecified parameter will be +reset to its default value. + +-s|--storage-backend +Specify de-duplication hash storage backend. +Only 'inmemory' backend is supported yet. +If not specified, default value is 'inmemory'. ++ +Refer to *BACKENDS* sector for more information. + +-b|--blocksize +Specify dedupe block size. +Supported values are power of 2 from '16K' to '8M'. +Default value is '128K'. ++ +Refer to *BLOCKSIZE* sector for more information. + +-a|--hash-algorithm +Specify hash algorithm. +Only 'sha256' is supported yet. + +-l|--limit-hash +Specify maximum number of hashes stored in memory. +Only works for 'inmemory' backend. +Conflicts with '-m' option. ++ +Only positive values are valid. +Default value is '32K'. + +-m|--limit-memory +Specify maximum memory used for hashes. +Only works for 'inmemory' backend. +Conflicts with '-l' option. ++ +Only value larger than or equal to '1024' is valid. +No default value. ++ +NOTE: Memory limit will be rounded down to kernel internal hash size, +so the memory limit shown in 'btrfs dedupe-inband status' may be different +from the . + +WARNING: Too large value for '-l' or '-m' will easily trigger OOM. +Please use with caution according to system memory. + +NOTE: In-band de-duplication is not compactible with compression yet. +And compression has higher priority than in-band de-duplication, means if +compression and de-duplication is enabled at the same time, only compression +will work. + +BACKENDS + +Btrfs in-band de-duplication will support different storage backends, with +different use case and features. + +In-memory backend:: +This backend provides backward-compatibility, and more fine-tuning options. +But hash pool is non-persistent and may exhaust kernel memory if not setup +properly. ++ +This backend can be used on old btrfs(without '-O dedupe' mkfs option). +When used on old btrfs, this backend needs to be enabled manually after mount. ++ +Designed for fast hash search speed, in-memory backend will keep all dedupe +hashes in memory. (Although overall performance is still much the same with +'ondisk' backend if all 'ondisk' hash can be cached in memory) ++ +And only keeps limited number of hash in memory to avoid exhausting memory. +Hashes over the limit will be dropped following Last-Recent-Use behavior. +So this backend has a consistent overhead for given limit but can\'t ensure +all duplicated blocks will be de-duplicated. ++ +After umount and mount, in-memory backend need to refill its hash pool. + +On-disk backend:: +This backend provides persistent hash pool, with more smart memory management +for hash pool. +But it\'s not backward-compatible, meaning it must be used with '-O dedupe' mkfs +option and older kernel can\'t mount it read-write. ++ +Designed for de-duplication rate, hash pool is stored as btrfs B+ tree on disk. +This behavior may cause extra disk IO for hash search under high memory +pressure. ++ +After umount and mount, on-disk backend still has its hash on disk, no need to +refill its dedupe hash pool. + +Currently, only 'inmemory' backend is supported in btrfs-progs. + +DEDUPE BLOCK SIZE + +In-band de-duplication is done at dedupe block size. +Any data smaller than dedupe block size won\'t go through in-band +de-duplication. + +And dedupe block size affects dedupe rate and fragmentation heavily. + +Smaller block size will cause more fragments, but higher dedupe rate. + +Larger block size will cause less fragments, but lower dedupe rate. + +In-band de-duplication rate is highly related to the workload pattern. +So it\'s highly recommended to align dedupe block size to the workload +block size to make full use of de-duplication. EXIT STATUS --- diff --git a/btrfs-completion b/btrfs-completion index ae683f4ecf61..cfdf70966e47 100644 --- a/btrfs-completion +++
[v10.4 1/5] btrfs-progs: Basic framework for dedupe-inband command group
From: Qu Wenruo Add basic ioctl header and command group framework for later use. Alone with basic man page doc. Signed-off-by: Qu Wenruo Signed-off-by: Lu Fengqi --- v10.4: 1. s/btrfs-dedupe/btrfs-dedupe-inband in man page 2. use SZ_* instead of the intermediate number Documentation/Makefile.in | 1 + Documentation/btrfs-dedupe-inband.asciidoc | 40 ++ Documentation/btrfs.asciidoc | 4 +++ Makefile | 3 +- btrfs.c| 2 ++ cmds-dedupe-ib.c | 35 +++ commands.h | 2 ++ dedupe-ib.h| 28 +++ ioctl.h| 36 +++ 9 files changed, 150 insertions(+), 1 deletion(-) create mode 100644 Documentation/btrfs-dedupe-inband.asciidoc create mode 100644 cmds-dedupe-ib.c create mode 100644 dedupe-ib.h diff --git a/Documentation/Makefile.in b/Documentation/Makefile.in index 184647c41940..402155fae001 100644 --- a/Documentation/Makefile.in +++ b/Documentation/Makefile.in @@ -28,6 +28,7 @@ MAN8_TXT += btrfs-qgroup.asciidoc MAN8_TXT += btrfs-replace.asciidoc MAN8_TXT += btrfs-restore.asciidoc MAN8_TXT += btrfs-property.asciidoc +MAN8_TXT += btrfs-dedupe-inband.asciidoc # Category 5 manual page MAN5_TXT += btrfs-man5.asciidoc diff --git a/Documentation/btrfs-dedupe-inband.asciidoc b/Documentation/btrfs-dedupe-inband.asciidoc new file mode 100644 index ..83113f5487e2 --- /dev/null +++ b/Documentation/btrfs-dedupe-inband.asciidoc @@ -0,0 +1,40 @@ +btrfs-dedupe-inband(8) +== + +NAME + +btrfs-dedupe-inband - manage in-band (write time) de-duplication of a btrfs +filesystem + +SYNOPSIS + +*btrfs dedupe-inband* + +DESCRIPTION +--- +*btrfs dedupe-inband* is used to enable/disable or show current in-band de-duplication +status of a btrfs filesystem. + +Kernel support for in-band de-duplication starts from 4.19. + +WARNING: In-band de-duplication is still an experimental feautre of btrfs, +use with caution. + +SUBCOMMAND +-- +Nothing yet + +EXIT STATUS +--- +*btrfs dedupe-inband* returns a zero exit status if it succeeds. Non zero is +returned in case of failure. + +AVAILABILITY + +*btrfs* is part of btrfs-progs. +Please refer to the btrfs wiki http://btrfs.wiki.kernel.org for +further details. + +SEE ALSO + +`mkfs.btrfs`(8), diff --git a/Documentation/btrfs.asciidoc b/Documentation/btrfs.asciidoc index 7316ac094413..1cf5bddec335 100644 --- a/Documentation/btrfs.asciidoc +++ b/Documentation/btrfs.asciidoc @@ -50,6 +50,10 @@ COMMANDS Do off-line check on a btrfs filesystem. + See `btrfs-check`(8) for details. +*dedupe-inband*:: + Control btrfs in-band(write time) de-duplication. + + See `btrfs-dedupe-inband`(8) for details. + *device*:: Manage devices managed by btrfs, including add/delete/scan and so on. + diff --git a/Makefile b/Makefile index 544410e6440c..1ebed7135714 100644 --- a/Makefile +++ b/Makefile @@ -123,7 +123,8 @@ cmds_objects = cmds-subvolume.o cmds-filesystem.o cmds-device.o cmds-scrub.o \ cmds-restore.o cmds-rescue.o chunk-recover.o super-recover.o \ cmds-property.o cmds-fi-usage.o cmds-inspect-dump-tree.o \ cmds-inspect-dump-super.o cmds-inspect-tree-stats.o cmds-fi-du.o \ - mkfs/common.o check/mode-common.o check/mode-lowmem.o + mkfs/common.o check/mode-common.o check/mode-lowmem.o \ + cmds-dedupe-ib.o libbtrfs_objects = send-stream.o send-utils.o kernel-lib/rbtree.o btrfs-list.o \ kernel-lib/crc32c.o messages.o \ uuid-tree.o utils-lib.o rbtree-utils.o diff --git a/btrfs.c b/btrfs.c index 2d39f2ced3e8..2168f5a8bc7f 100644 --- a/btrfs.c +++ b/btrfs.c @@ -255,6 +255,8 @@ static const struct cmd_group btrfs_cmd_group = { { "quota", cmd_quota, NULL, _cmd_group, 0 }, { "qgroup", cmd_qgroup, NULL, _cmd_group, 0 }, { "replace", cmd_replace, NULL, _cmd_group, 0 }, + { "dedupe-inband", cmd_dedupe_ib, NULL, _ib_cmd_group, + 0 }, { "help", cmd_help, cmd_help_usage, NULL, 0 }, { "version", cmd_version, cmd_version_usage, NULL, 0 }, NULL_CMD_STRUCT diff --git a/cmds-dedupe-ib.c b/cmds-dedupe-ib.c new file mode 100644 index ..73c923a797da --- /dev/null +++ b/cmds-dedupe-ib.c @@ -0,0 +1,35 @@ +// SPDX-License-Identifier: GPL-2.0 +/* + * Copyright (C) 2017 Fujitsu. All rights reserved. + */ + +#include +#include +#include + +#include "ctree.h" +#include "ioctl.h" + +#include "commands.h" +#include "utils.h" +#include "kerncompat.h" +#include "dedupe-ib.h" + +static const char * const dedupe_ib_cmd_group_usage[] = { +
[v10.4 0/5] In-band de-duplication for btrfs-progs
Patchset can be fetched from github: https://github.com/littleroad/btrfs-progs.git dedupe_latest Inband dedupe(in-memory backend only) ioctl support for btrfs-progs. v7 changes: Update ctree.h to follow kernel structure change Update print-tree to follow kernel structure change V8 changes: Move dedup props and on-disk backend support out of the patchset Change command group name to "dedupe-inband", to avoid confusion with possible out-of-band dedupe. Suggested by Mark. Rebase to latest devel branch. V9 changes: Follow kernels ioctl change to support FORCE flag, new reconf ioctl, and more precious error reporting. v10 changes: Rebase to v4.10. Add BUILD_ASSERT for btrfs_ioctl_dedupe_args v10.1 changes: Rebase to v4.14. v10.2 changes: Rebase to v4.16.1. v10.3 changes: Rebase to v4.17. v10.4 changes: Deal with offline reviews from Misono Tomohiro. 1. s/btrfs-dedupe/btrfs-dedupe-inband 2. Replace strerror(errno) with %m 3. Use SZ_* instead of intermedia number 4. update btrfs-completion for reconfigure subcommand Qu Wenruo (5): btrfs-progs: Basic framework for dedupe-inband command group btrfs-progs: dedupe: Add enable command for dedupe command group btrfs-progs: dedupe: Add disable support for inband dedupelication btrfs-progs: dedupe: Add status subcommand btrfs-progs: dedupe: introduce reconfigure subcommand Documentation/Makefile.in | 1 + Documentation/btrfs-dedupe-inband.asciidoc | 167 Documentation/btrfs.asciidoc | 4 + Makefile | 3 +- btrfs-completion | 6 +- btrfs.c| 2 + cmds-dedupe-ib.c | 437 + commands.h | 2 + dedupe-ib.h| 28 ++ ioctl.h| 38 ++ 10 files changed, 686 insertions(+), 2 deletions(-) create mode 100644 Documentation/btrfs-dedupe-inband.asciidoc create mode 100644 cmds-dedupe-ib.c create mode 100644 dedupe-ib.h -- 2.18.0 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[v10.4 4/5] btrfs-progs: dedupe: Add status subcommand
From: Qu Wenruo Add status subcommand for dedupe command group. Signed-off-by: Qu Wenruo Signed-off-by: Lu Fengqi --- v10.4: 1. s/btrfs-dedupe/btrfs-dedupe-inband 2. replace strerror(errno) with %m Documentation/btrfs-dedupe-inband.asciidoc | 3 + btrfs-completion | 2 +- cmds-dedupe-ib.c | 80 ++ 3 files changed, 84 insertions(+), 1 deletion(-) diff --git a/Documentation/btrfs-dedupe-inband.asciidoc b/Documentation/btrfs-dedupe-inband.asciidoc index 3452f690e3e5..6096389cb0b4 100644 --- a/Documentation/btrfs-dedupe-inband.asciidoc +++ b/Documentation/btrfs-dedupe-inband.asciidoc @@ -86,6 +86,9 @@ And compression has higher priority than in-band de-duplication, means if compression and de-duplication is enabled at the same time, only compression will work. +*status* :: +Show current in-band de-duplication status of a filesystem. + BACKENDS Btrfs in-band de-duplication will support different storage backends, with diff --git a/btrfs-completion b/btrfs-completion index a74a23f42022..62a7bdd4d0d5 100644 --- a/btrfs-completion +++ b/btrfs-completion @@ -41,7 +41,7 @@ _btrfs() commands_quota='enable disable rescan' commands_qgroup='assign remove create destroy show limit' commands_replace='start status cancel' - commands_dedupe_inband='enable disable' + commands_dedupe_inband='enable disable status' if [[ "$cur" == -* && $cword -le 3 && "$cmd" != "help" ]]; then COMPREPLY=( $( compgen -W '--help' -- "$cur" ) ) diff --git a/cmds-dedupe-ib.c b/cmds-dedupe-ib.c index 91b6fe234043..e778457e25a8 100644 --- a/cmds-dedupe-ib.c +++ b/cmds-dedupe-ib.c @@ -298,12 +298,92 @@ out: return 0; } +static const char * const cmd_dedupe_ib_status_usage[] = { + "btrfs dedupe-inband status ", + "Show current in-band(write time) de-duplication status of a btrfs.", + NULL +}; + +static int cmd_dedupe_ib_status(int argc, char **argv) +{ + struct btrfs_ioctl_dedupe_args dargs; + DIR *dirstream; + char *path; + int fd; + int ret; + int print_limit = 1; + + if (check_argc_exact(argc, 2)) + usage(cmd_dedupe_ib_status_usage); + + path = argv[1]; + fd = open_file_or_dir(path, ); + if (fd < 0) { + error("failed to open file or directory: %s", path); + ret = 1; + goto out; + } + memset(, 0, sizeof(dargs)); + dargs.cmd = BTRFS_DEDUPE_CTL_STATUS; + + ret = ioctl(fd, BTRFS_IOC_DEDUPE_CTL, ); + if (ret < 0) { + error("failed to get inband deduplication status: %m"); + ret = 1; + goto out; + } + ret = 0; + if (dargs.status == 0) { + printf("Status: \t\t\tDisabled\n"); + goto out; + } + printf("Status:\t\t\tEnabled\n"); + + if (dargs.hash_algo == BTRFS_DEDUPE_HASH_SHA256) + printf("Hash algorithm:\t\tSHA-256\n"); + else + printf("Hash algorithm:\t\tUnrecognized(%x)\n", + dargs.hash_algo); + + if (dargs.backend == BTRFS_DEDUPE_BACKEND_INMEMORY) { + printf("Backend:\t\tIn-memory\n"); + print_limit = 1; + } else { + printf("Backend:\t\tUnrecognized(%x)\n", + dargs.backend); + } + + printf("Dedup Blocksize:\t%llu\n", dargs.blocksize); + + if (print_limit) { + u64 cur_mem; + + /* Limit nr may be 0 */ + if (dargs.limit_nr) + cur_mem = dargs.current_nr * (dargs.limit_mem / + dargs.limit_nr); + else + cur_mem = 0; + + printf("Number of hash: \t[%llu/%llu]\n", dargs.current_nr, + dargs.limit_nr); + printf("Memory usage: \t\t[%s/%s]\n", + pretty_size(cur_mem), + pretty_size(dargs.limit_mem)); + } +out: + close_file_or_dir(fd, dirstream); + return ret; +} + const struct cmd_group dedupe_ib_cmd_group = { dedupe_ib_cmd_group_usage, dedupe_ib_cmd_group_info, { { "enable", cmd_dedupe_ib_enable, cmd_dedupe_ib_enable_usage, NULL, 0}, { "disable", cmd_dedupe_ib_disable, cmd_dedupe_ib_disable_usage, NULL, 0}, + { "status", cmd_dedupe_ib_status, cmd_dedupe_ib_status_usage, + NULL, 0}, NULL_CMD_STRUCT } }; -- 2.18.0 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[v10.4 3/5] btrfs-progs: dedupe: Add disable support for inband dedupelication
From: Qu Wenruo Add disable subcommand for dedupe command group. Signed-off-by: Qu Wenruo Signed-off-by: Lu Fengqi --- v10.4: 1. s/btrfs-dedupe/btrfs-dedupe-inband 2. replace strerror(errno) with %m Documentation/btrfs-dedupe-inband.asciidoc | 5 +++ btrfs-completion | 2 +- cmds-dedupe-ib.c | 41 ++ 3 files changed, 47 insertions(+), 1 deletion(-) diff --git a/Documentation/btrfs-dedupe-inband.asciidoc b/Documentation/btrfs-dedupe-inband.asciidoc index d895aafbcf45..3452f690e3e5 100644 --- a/Documentation/btrfs-dedupe-inband.asciidoc +++ b/Documentation/btrfs-dedupe-inband.asciidoc @@ -22,6 +22,11 @@ use with caution. SUBCOMMAND -- +*disable* :: +Disable in-band de-duplication for a filesystem. ++ +This will trash all stored dedupe hash. ++ *enable* [options] :: Enable in-band de-duplication for a filesystem. + diff --git a/btrfs-completion b/btrfs-completion index cfdf70966e47..a74a23f42022 100644 --- a/btrfs-completion +++ b/btrfs-completion @@ -41,7 +41,7 @@ _btrfs() commands_quota='enable disable rescan' commands_qgroup='assign remove create destroy show limit' commands_replace='start status cancel' - commands_dedupe_inband='enable' + commands_dedupe_inband='enable disable' if [[ "$cur" == -* && $cword -le 3 && "$cmd" != "help" ]]; then COMPREPLY=( $( compgen -W '--help' -- "$cur" ) ) diff --git a/cmds-dedupe-ib.c b/cmds-dedupe-ib.c index 4d499677d9ae..91b6fe234043 100644 --- a/cmds-dedupe-ib.c +++ b/cmds-dedupe-ib.c @@ -259,10 +259,51 @@ out: return ret; } +static const char * const cmd_dedupe_ib_disable_usage[] = { + "btrfs dedupe-inband disable ", + "Disable in-band(write time) de-duplication of a btrfs.", + NULL +}; + +static int cmd_dedupe_ib_disable(int argc, char **argv) +{ + struct btrfs_ioctl_dedupe_args dargs; + DIR *dirstream; + char *path; + int fd; + int ret; + + if (check_argc_exact(argc, 2)) + usage(cmd_dedupe_ib_disable_usage); + + path = argv[1]; + fd = open_file_or_dir(path, ); + if (fd < 0) { + error("failed to open file or directory: %s", path); + return 1; + } + memset(, 0, sizeof(dargs)); + dargs.cmd = BTRFS_DEDUPE_CTL_DISABLE; + + ret = ioctl(fd, BTRFS_IOC_DEDUPE_CTL, ); + if (ret < 0) { + error("failed to disable inband deduplication: %m"); + ret = 1; + goto out; + } + ret = 0; + +out: + close_file_or_dir(fd, dirstream); + return 0; +} + const struct cmd_group dedupe_ib_cmd_group = { dedupe_ib_cmd_group_usage, dedupe_ib_cmd_group_info, { { "enable", cmd_dedupe_ib_enable, cmd_dedupe_ib_enable_usage, NULL, 0}, + { "disable", cmd_dedupe_ib_disable, cmd_dedupe_ib_disable_usage, + NULL, 0}, NULL_CMD_STRUCT } }; -- 2.18.0 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[v10.4 5/5] btrfs-progs: dedupe: introduce reconfigure subcommand
From: Qu Wenruo Introduce reconfigure subcommand to co-operate with new kernel ioctl modification. Signed-off-by: Qu Wenruo Signed-off-by: Lu Fengqi --- v10.4: 1. update btrfs-completion for reconfigure 2. replace strerror(errno) with %m Documentation/btrfs-dedupe-inband.asciidoc | 7 +++ btrfs-completion | 2 +- cmds-dedupe-ib.c | 73 +- 3 files changed, 66 insertions(+), 16 deletions(-) diff --git a/Documentation/btrfs-dedupe-inband.asciidoc b/Documentation/btrfs-dedupe-inband.asciidoc index 6096389cb0b4..78c806f772d6 100644 --- a/Documentation/btrfs-dedupe-inband.asciidoc +++ b/Documentation/btrfs-dedupe-inband.asciidoc @@ -86,6 +86,13 @@ And compression has higher priority than in-band de-duplication, means if compression and de-duplication is enabled at the same time, only compression will work. +*reconfigure* [options] :: +Re-configure in-band de-duplication parameters of a filesystem. ++ +In-band de-duplication must be enbaled first before re-configuration. ++ +[Options] are the same with 'btrfs dedupe-inband enable'. + *status* :: Show current in-band de-duplication status of a filesystem. diff --git a/btrfs-completion b/btrfs-completion index 62a7bdd4d0d5..6ff48e4c2f6a 100644 --- a/btrfs-completion +++ b/btrfs-completion @@ -41,7 +41,7 @@ _btrfs() commands_quota='enable disable rescan' commands_qgroup='assign remove create destroy show limit' commands_replace='start status cancel' - commands_dedupe_inband='enable disable status' + commands_dedupe_inband='enable disable status reconfigure' if [[ "$cur" == -* && $cword -le 3 && "$cmd" != "help" ]]; then COMPREPLY=( $( compgen -W '--help' -- "$cur" ) ) diff --git a/cmds-dedupe-ib.c b/cmds-dedupe-ib.c index e778457e25a8..e52f939c9ced 100644 --- a/cmds-dedupe-ib.c +++ b/cmds-dedupe-ib.c @@ -56,7 +56,6 @@ static const char * const cmd_dedupe_ib_enable_usage[] = { NULL }; - #define report_fatal_parameter(dargs, old, member, type, err_val, fmt) \ ({ \ if (dargs->member != old->member && \ @@ -88,6 +87,12 @@ static void report_parameter_error(struct btrfs_ioctl_dedupe_args *dargs, } report_option_parameter(dargs, old, flags, u8, -1, x); } + + if (dargs->status == 0 && old->cmd == BTRFS_DEDUPE_CTL_RECONF) { + error("must enable dedupe before reconfiguration"); + return; + } + if (report_fatal_parameter(dargs, old, cmd, u16, -1, u) || report_fatal_parameter(dargs, old, blocksize, u64, -1, llu) || report_fatal_parameter(dargs, old, backend, u16, -1, u) || @@ -100,14 +105,17 @@ static void report_parameter_error(struct btrfs_ioctl_dedupe_args *dargs, old->limit_nr, old->limit_mem); } -static int cmd_dedupe_ib_enable(int argc, char **argv) +static int enable_reconfig_dedupe(int argc, char **argv, int reconf) { int ret; int fd = -1; char *path; u64 blocksize = BTRFS_DEDUPE_BLOCKSIZE_DEFAULT; + int blocksize_set = 0; u16 hash_algo = BTRFS_DEDUPE_HASH_SHA256; + int hash_algo_set = 0; u16 backend = BTRFS_DEDUPE_BACKEND_INMEMORY; + int backend_set = 0; u64 limit_nr = 0; u64 limit_mem = 0; u64 sys_mem = 0; @@ -134,15 +142,17 @@ static int cmd_dedupe_ib_enable(int argc, char **argv) break; switch (c) { case 's': - if (!strcasecmp("inmemory", optarg)) + if (!strcasecmp("inmemory", optarg)) { backend = BTRFS_DEDUPE_BACKEND_INMEMORY; - else { + backend_set = 1; + } else { error("unsupported dedupe backend: %s", optarg); exit(1); } break; case 'b': blocksize = parse_size(optarg); + blocksize_set = 1; break; case 'a': if (strcmp("sha256", optarg)) { @@ -224,26 +234,40 @@ static int cmd_dedupe_ib_enable(int argc, char **argv) return 1; } memset(, -1, sizeof(dargs)); - dargs.cmd = BTRFS_DEDUPE_CTL_ENABLE; - dargs.blocksize = blocksize; - dargs.hash_algo = hash_algo; - dargs.limit_nr = limit_nr; - dargs.limit_mem = limit_mem; - dargs.backend = backend; - if (force) - dargs.flags |= BTRFS_DEDUPE_FLAG_FORCE; - else - dargs.flags = 0; + if (reconf) { + dargs.cmd = BTRFS_DEDUPE_CTL_RECONF; + if (blocksize_set) +
Re: About hung task on generic/041
On Wed, Jul 11, 2018 at 10:02 AM, Lu Fengqi wrote: > Hi, > > When I run generic/041 with v4.18-rc3 (turn on kasan and hung task > detection), btrfs-transaction kthread will trigger the hung task timeout > (stall at wait_event in btrfs_commit_transaction). At the same time, you > can see that xfs_io -c fsync will occupy 100% of the CPU. I am not sure > whether this is a problem. Any suggestion? Well, something at 100% cpu and that seems hang forever is definitely a problem, specially a workload as simple as the one in generic/041 (never happened to me, even on vanilla 4.18-rc4). Do you have the stack trace for the fsync task? What you pasted below is only for the transaction kthread and that alone doesn't help. > > [Wed Jul 11 15:50:08 2018] INFO: task btrfs-transacti:1053 blocked for more > than 120 seconds. > [Wed Jul 11 15:50:08 2018] Not tainted 4.18.0-rc3-custom #14 > [Wed Jul 11 15:50:08 2018] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" > disables this message. > [Wed Jul 11 15:50:08 2018] btrfs-transacti D0 1053 2 0x8000 > [Wed Jul 11 15:50:08 2018] Call Trace: > [Wed Jul 11 15:50:08 2018] ? __schedule+0x5b2/0x1380 > [Wed Jul 11 15:50:08 2018] ? check_flags.part.23+0x240/0x240 > [Wed Jul 11 15:50:08 2018] ? firmware_map_remove+0x187/0x187 > [Wed Jul 11 15:50:08 2018] ? ___preempt_schedule+0x16/0x18 > [Wed Jul 11 15:50:08 2018] ? mark_held_locks+0x6e/0x90 > [Wed Jul 11 15:50:08 2018] ? _raw_spin_unlock_irqrestore+0x59/0x70 > [Wed Jul 11 15:50:08 2018] ? preempt_count_sub+0x14/0xc0 > [Wed Jul 11 15:50:08 2018] ? _raw_spin_unlock_irqrestore+0x46/0x70 > [Wed Jul 11 15:50:08 2018] ? prepare_to_wait_event+0x191/0x410 > [Wed Jul 11 15:50:08 2018] ? prepare_to_wait_exclusive+0x210/0x210 > [Wed Jul 11 15:50:08 2018] ? print_usage_bug+0x3a0/0x3a0 > [Wed Jul 11 15:50:08 2018] ? do_raw_spin_unlock+0x10f/0x1e0 > [Wed Jul 11 15:50:08 2018] ? do_raw_spin_trylock+0x120/0x120 > [Wed Jul 11 15:50:08 2018] schedule+0xca/0x260 > [Wed Jul 11 15:50:08 2018] ? rcu_lockdep_current_cpu_online+0x12b/0x160 > [Wed Jul 11 15:50:08 2018] ? __schedule+0x1380/0x1380 > [Wed Jul 11 15:50:08 2018] ? ___might_sleep+0x126/0x370 > [Wed Jul 11 15:50:08 2018] ? init_wait_entry+0xc7/0x100 > [Wed Jul 11 15:50:08 2018] ? __wake_up_locked_key_bookmark+0x20/0x20 > [Wed Jul 11 15:50:08 2018] ? __btrfs_run_delayed_items+0x1e5/0x280 [btrfs] > [Wed Jul 11 15:50:08 2018] ? __might_sleep+0x31/0xd0 > [Wed Jul 11 15:50:08 2018] btrfs_commit_transaction+0x122a/0x1640 [btrfs] > [Wed Jul 11 15:50:08 2018] ? btrfs_apply_pending_changes+0x90/0x90 [btrfs] > [Wed Jul 11 15:50:08 2018] ? wait_woken+0x150/0x150 > [Wed Jul 11 15:50:08 2018] ? ret_from_fork+0x27/0x50 > [Wed Jul 11 15:50:08 2018] ? ret_from_fork+0x27/0x50 > [Wed Jul 11 15:50:08 2018] ? deref_stack_reg+0xe0/0xe0 > [Wed Jul 11 15:50:08 2018] ? __module_text_address+0x63/0xa0 > [Wed Jul 11 15:50:08 2018] ? preempt_count_sub+0x14/0xc0 > [Wed Jul 11 15:50:08 2018] ? transaction_kthread+0x161/0x240 [btrfs] > [Wed Jul 11 15:50:08 2018] ? is_module_text_address+0x2b/0x50 > [Wed Jul 11 15:50:08 2018] ? transaction_kthread+0x1d9/0x240 [btrfs] > [Wed Jul 11 15:50:08 2018] ? kernel_text_address+0x5a/0x100 > [Wed Jul 11 15:50:08 2018] ? deactivate_slab.isra.27+0x64f/0x7a0 > [Wed Jul 11 15:50:08 2018] ? __save_stack_trace+0x82/0x100 > [Wed Jul 11 15:50:08 2018] ? kasan_kmalloc+0x142/0x170 > [Wed Jul 11 15:50:08 2018] ? kmem_cache_alloc+0xfc/0x2e0 > [Wed Jul 11 15:50:08 2018] ? start_transaction+0x596/0x930 [btrfs] > [Wed Jul 11 15:50:08 2018] ? transaction_kthread+0x1d9/0x240 [btrfs] > [Wed Jul 11 15:50:08 2018] ? kthread+0x1b9/0x1e0 > [Wed Jul 11 15:50:08 2018] ? ret_from_fork+0x27/0x50 > [Wed Jul 11 15:50:08 2018] ? deactivate_slab.isra.27+0x64f/0x7a0 > [Wed Jul 11 15:50:08 2018] ? mark_lock+0x149/0xa80 > [Wed Jul 11 15:50:08 2018] ? init_object+0x6b/0x80 > [Wed Jul 11 15:50:08 2018] ? print_usage_bug+0x3a0/0x3a0 > [Wed Jul 11 15:50:08 2018] ? ___slab_alloc+0x62a/0x690 > [Wed Jul 11 15:50:08 2018] ? ___slab_alloc+0x62a/0x690 > [Wed Jul 11 15:50:08 2018] ? __lock_is_held+0x8c/0xe0 > [Wed Jul 11 15:50:08 2018] ? start_transaction+0x596/0x930 [btrfs] > [Wed Jul 11 15:50:08 2018] ? preempt_count_sub+0x14/0xc0 > [Wed Jul 11 15:50:08 2018] ? rcu_lockdep_current_cpu_online+0x12b/0x160 > [Wed Jul 11 15:50:08 2018] ? rcu_oom_callback+0x40/0x40 > [Wed Jul 11 15:50:08 2018] ? __lock_is_held+0x8c/0xe0 > [Wed Jul 11 15:50:08 2018] ? start_transaction+0x596/0x930 [btrfs] > [Wed Jul 11 15:50:08 2018] ? rcu_read_lock_sched_held+0x8f/0xa0 > [Wed Jul 11 15:50:08 2018] ? btrfs_record_root_in_trans+0x1f/0xa0 [btrfs] > [Wed Jul 11 15:50:08 2018] ? start_transaction+0x26b/0x930 [btrfs] > [Wed Jul 11 15:50:08 2018] ? btrfs_commit_transaction+0x1640/0x1640 [btrfs] > [Wed Jul 11 15:50:08 2018] ? check_flags.part.23+0x240/0x240 > [Wed Jul 11 15:50:08 2018] ? lock_downgrade+0x380/0x380 > [Wed Jul 11 15:50:08 2018] ?
[PATCH v2] generic: add test for fsync after cloning file range
From: Filipe Manana Test that if we do a buffered write to a file, fsync it, clone a range from another file into our file that overlaps the previously written range, fsync the file again and then power fail, after we mount again the filesystem, no file data was lost or corrupted. This test is motivated by a bug found in btrfs, which is fixed by a patch for the linux kernel titled: "Btrfs: fix file data corruption after cloning a range and fsync" Signed-off-by: Filipe Manana --- V2: Fixed title of referenced btrfs patch. tests/generic/500 | 70 +++ tests/generic/500.out | 5 tests/generic/group | 1 + 3 files changed, 76 insertions(+) create mode 100755 tests/generic/500 create mode 100644 tests/generic/500.out diff --git a/tests/generic/500 b/tests/generic/500 new file mode 100755 index ..b7baca34 --- /dev/null +++ b/tests/generic/500 @@ -0,0 +1,70 @@ +#! /bin/bash +# SPDX-License-Identifier: GPL-2.0 +# Copyright (C) 2018 SUSE Linux Products GmbH. All Rights Reserved. +# +# FS QA Test No. 500 +# +# Test that if we do a buffered write to a file, fsync it, clone a range from +# another file into our file that overlaps the previously written range, fsync +# the file again and then power fail, after we mount again the filesystem, no +# file data was lost or corrupted. +# +seq=`basename $0` +seqres=$RESULT_DIR/$seq +echo "QA output created by $seq" +tmp=/tmp/$$ +status=1 # failure is the default! +trap "_cleanup; exit \$status" 0 1 2 3 15 + +_cleanup() +{ + _cleanup_flakey + cd / + rm -f $tmp.* +} + +# get standard environment, filters and checks +. ./common/rc +. ./common/filter +. ./common/reflink +. ./common/dmflakey + +# real QA test starts here +_supported_fs generic +_supported_os Linux +_require_scratch_reflink +_require_dm_target flakey + +rm -f $seqres.full + +_scratch_mkfs >>$seqres.full 2>&1 +_require_metadata_journaling $SCRATCH_DEV +_init_flakey +_mount_flakey + +$XFS_IO_PROG -f -c "pwrite -S 0x18 9000K 6908K" $SCRATCH_MNT/foo >>$seqres.full +$XFS_IO_PROG -f -c "pwrite -S 0x20 2572K 156K" $SCRATCH_MNT/bar >>$seqres.full + +# We clone from file foo into a range of file bar that overlaps the existing +# extent at file bar. The destination offset of the reflink operation matches +# the eof position of file bar minus 4Kb. +$XFS_IO_PROG -c "fsync" \ +-c "reflink ${SCRATCH_MNT}/foo 0 2724K 15908K" \ +-c "fsync" \ +$SCRATCH_MNT/bar >>$seqres.full + +echo "File bar digest before power failure:" +md5sum $SCRATCH_MNT/bar | _filter_scratch + +# Simulate a power failure and mount the filesystem to check that no file data +# was lost or corrupted. +_flakey_drop_and_remount + +echo "File bar digest after power failure:" +md5sum $SCRATCH_MNT/bar | _filter_scratch + +_unmount_flakey +_cleanup_flakey + +status=0 +exit diff --git a/tests/generic/500.out b/tests/generic/500.out new file mode 100644 index ..f590154e --- /dev/null +++ b/tests/generic/500.out @@ -0,0 +1,5 @@ +QA output created by 500 +File bar digest before power failure: +95a95813a8c2abc9aa75a6c2914a077e SCRATCH_MNT/bar +File bar digest after power failure: +95a95813a8c2abc9aa75a6c2914a077e SCRATCH_MNT/bar diff --git a/tests/generic/group b/tests/generic/group index b2a093f4..a84321dd 100644 --- a/tests/generic/group +++ b/tests/generic/group @@ -502,3 +502,4 @@ 497 auto quick swap collapse 498 auto quick log 499 auto quick rw collapse zero +500 auto quick clone log -- 2.11.0 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v2] Btrfs: fix file data corruption after cloning a range and fsync
From: Filipe Manana When we clone a range into a file we can end up dropping existing extent maps (or trimming them) and replacing them with new ones if the range to be cloned overlaps with a range in the destination inode. When that happens we add the new extent maps to the list of modified extents in the inode's extent map tree, so that a "fast" fsync (the flag BTRFS_INODE_NEEDS_FULL_SYNC not set in the inode) will see the extent maps and log corresponding extent items. However, at the end of range cloning operation we do truncate all the pages in the affected range (in order to ensure future reads will not get stale data). Sometimes this truncation will release the corresponding extent maps besides the pages from the page cache. If this happens, then a "fast" fsync operation will miss logging some extent items, because it relies exclusively on the extent maps being present in the inode's extent tree, leading to data loss/corruption if the fsync ends up using the same transaction used by the clone operation (that transaction was not committed in the meanwhile). An extent map is released through the callback btrfs_invalidatepage(), which gets called by truncate_inode_pages_range(), and it calls __btrfs_releasepage(). The later ends up calling try_release_extent_mapping() which will release the extent map if some conditions are met, like the file size being greater than 16Mb, gfp flags allow blocking and the range not being locked (which is the case during the clone operation) nor being the extent map flagged as pinned (also the case for cloning). The following example, turned into a test for fstests, reproduces the issue: $ mkfs.btrfs -f /dev/sdb $ mount /dev/sdb /mnt $ xfs_io -f -c "pwrite -S 0x18 9000K 6908K" /mnt/foo $ xfs_io -f -c "pwrite -S 0x20 2572K 156K" /mnt/bar $ xfs_io -c "fsync" /mnt/bar # reflink destination offset corresponds to the size of file bar, # 2728Kb minus 4Kb. $ xfs_io -c ""reflink ${SCRATCH_MNT}/foo 0 2724K 15908K" /mnt/bar $ xfs_io -c "fsync" /mnt/bar $ md5sum /mnt/bar 95a95813a8c2abc9aa75a6c2914a077e /mnt/bar $ mount /dev/sdb /mnt $ md5sum /mnt/bar 207fd8d0b161be8a84b945f0df8d5f8d /mnt/bar # digest should be 95a95813a8c2abc9aa75a6c2914a077e like before the # power failure In the above example, the destination offset of the clone operation corresponds to the size of the "bar" file minus 4Kb. So during the clone operation, the extent map covering the range from 2572Kb to 2728Kb gets trimmed so that it ends at offset 2724Kb, and a new extent map covering the range from 2724Kb to 11724Kb is created. So at the end of the clone operation when we ask to truncate the pages in the range from 2724Kb to 2724Kb + 15908Kb, the page invalidation callback ends up removing the new extent map (through try_release_extent_mapping()) when the page at offset 2724Kb is passed to that callback. Fix this by setting the bit BTRFS_INODE_NEEDS_FULL_SYNC whenever an extent map is removed at try_release_extent_mapping(), forcing the next fsync to search for modified extents in the fs/subvolume tree instead of relying on the presence of extent maps in memory. This way we can continue doing a "fast" fsync if the destination range of a clone operation does not overlap with an existing range or if any of the criteria necessary to remove an extent map at try_release_extent_mapping() is not met (file size not bigger then 16Mb or gfp flags do not allow blocking). CC: sta...@vger.kernel.org # 3.16+ Signed-off-by: Filipe Manana --- V2: Added missing "fix" word to subject only. fs/btrfs/extent_io.c | 7 +-- 1 file changed, 5 insertions(+), 2 deletions(-) diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c index e55843f536bc..b3e45714d28f 100644 --- a/fs/btrfs/extent_io.c +++ b/fs/btrfs/extent_io.c @@ -4238,8 +4238,9 @@ int try_release_extent_mapping(struct page *page, gfp_t mask) struct extent_map *em; u64 start = page_offset(page); u64 end = start + PAGE_SIZE - 1; - struct extent_io_tree *tree = _I(page->mapping->host)->io_tree; - struct extent_map_tree *map = _I(page->mapping->host)->extent_tree; + struct btrfs_inode *btrfs_inode = BTRFS_I(page->mapping->host); + struct extent_io_tree *tree = _inode->io_tree; + struct extent_map_tree *map = _inode->extent_tree; if (gfpflags_allow_blocking(mask) && page->mapping->host->i_size > SZ_16M) { @@ -4262,6 +4263,8 @@ int try_release_extent_mapping(struct page *page, gfp_t mask) extent_map_end(em) - 1, EXTENT_LOCKED | EXTENT_WRITEBACK, 0, NULL)) { + set_bit(BTRFS_INODE_NEEDS_FULL_SYNC, + _inode->runtime_flags); remove_extent_mapping(map, em); /* once for the rb tree */
[PATCH] generic: add test for fsync after cloning file range
From: Filipe Manana Test that if we do a buffered write to a file, fsync it, clone a range from another file into our file that overlaps the previously written range, fsync the file again and then power fail, after we mount again the filesystem, no file data was lost or corrupted. This test is motivated by a bug found in btrfs, which is fixed by a patch for the linux kernel titled: "Btrfs: file data corruption after cloning a range and fsync" Signed-off-by: Filipe Manana --- tests/generic/500 | 70 +++ tests/generic/500.out | 5 tests/generic/group | 1 + 3 files changed, 76 insertions(+) create mode 100755 tests/generic/500 create mode 100644 tests/generic/500.out diff --git a/tests/generic/500 b/tests/generic/500 new file mode 100755 index ..b7baca34 --- /dev/null +++ b/tests/generic/500 @@ -0,0 +1,70 @@ +#! /bin/bash +# SPDX-License-Identifier: GPL-2.0 +# Copyright (C) 2018 SUSE Linux Products GmbH. All Rights Reserved. +# +# FS QA Test No. 500 +# +# Test that if we do a buffered write to a file, fsync it, clone a range from +# another file into our file that overlaps the previously written range, fsync +# the file again and then power fail, after we mount again the filesystem, no +# file data was lost or corrupted. +# +seq=`basename $0` +seqres=$RESULT_DIR/$seq +echo "QA output created by $seq" +tmp=/tmp/$$ +status=1 # failure is the default! +trap "_cleanup; exit \$status" 0 1 2 3 15 + +_cleanup() +{ + _cleanup_flakey + cd / + rm -f $tmp.* +} + +# get standard environment, filters and checks +. ./common/rc +. ./common/filter +. ./common/reflink +. ./common/dmflakey + +# real QA test starts here +_supported_fs generic +_supported_os Linux +_require_scratch_reflink +_require_dm_target flakey + +rm -f $seqres.full + +_scratch_mkfs >>$seqres.full 2>&1 +_require_metadata_journaling $SCRATCH_DEV +_init_flakey +_mount_flakey + +$XFS_IO_PROG -f -c "pwrite -S 0x18 9000K 6908K" $SCRATCH_MNT/foo >>$seqres.full +$XFS_IO_PROG -f -c "pwrite -S 0x20 2572K 156K" $SCRATCH_MNT/bar >>$seqres.full + +# We clone from file foo into a range of file bar that overlaps the existing +# extent at file bar. The destination offset of the reflink operation matches +# the eof position of file bar minus 4Kb. +$XFS_IO_PROG -c "fsync" \ +-c "reflink ${SCRATCH_MNT}/foo 0 2724K 15908K" \ +-c "fsync" \ +$SCRATCH_MNT/bar >>$seqres.full + +echo "File bar digest before power failure:" +md5sum $SCRATCH_MNT/bar | _filter_scratch + +# Simulate a power failure and mount the filesystem to check that no file data +# was lost or corrupted. +_flakey_drop_and_remount + +echo "File bar digest after power failure:" +md5sum $SCRATCH_MNT/bar | _filter_scratch + +_unmount_flakey +_cleanup_flakey + +status=0 +exit diff --git a/tests/generic/500.out b/tests/generic/500.out new file mode 100644 index ..f590154e --- /dev/null +++ b/tests/generic/500.out @@ -0,0 +1,5 @@ +QA output created by 500 +File bar digest before power failure: +95a95813a8c2abc9aa75a6c2914a077e SCRATCH_MNT/bar +File bar digest after power failure: +95a95813a8c2abc9aa75a6c2914a077e SCRATCH_MNT/bar diff --git a/tests/generic/group b/tests/generic/group index b2a093f4..a84321dd 100644 --- a/tests/generic/group +++ b/tests/generic/group @@ -502,3 +502,4 @@ 497 auto quick swap collapse 498 auto quick log 499 auto quick rw collapse zero +500 auto quick clone log -- 2.11.0 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH] Btrfs: file data corruption after cloning a range and fsync
From: Filipe Manana When we clone a range into a file we can end up dropping existing extent maps (or trimming them) and replacing them with new ones if the range to be cloned overlaps with a range in the destination inode. When that happens we add the new extent maps to the list of modified extents in the inode's extent map tree, so that a "fast" fsync (the flag BTRFS_INODE_NEEDS_FULL_SYNC not set in the inode) will see the extent maps and log corresponding extent items. However, at the end of range cloning operation we do truncate all the pages in the affected range (in order to ensure future reads will not get stale data). Sometimes this truncation will release the corresponding extent maps besides the pages from the page cache. If this happens, then a "fast" fsync operation will miss logging some extent items, because it relies exclusively on the extent maps being present in the inode's extent tree, leading to data loss/corruption if the fsync ends up using the same transaction used by the clone operation (that transaction was not committed in the meanwhile). An extent map is released through the callback btrfs_invalidatepage(), which gets called by truncate_inode_pages_range(), and it calls __btrfs_releasepage(). The later ends up calling try_release_extent_mapping() which will release the extent map if some conditions are met, like the file size being greater than 16Mb, gfp flags allow blocking and the range not being locked (which is the case during the clone operation) nor being the extent map flagged as pinned (also the case for cloning). The following example, turned into a test for fstests, reproduces the issue: $ mkfs.btrfs -f /dev/sdb $ mount /dev/sdb /mnt $ xfs_io -f -c "pwrite -S 0x18 9000K 6908K" /mnt/foo $ xfs_io -f -c "pwrite -S 0x20 2572K 156K" /mnt/bar $ xfs_io -c "fsync" /mnt/bar # reflink destination offset corresponds to the size of file bar, # 2728Kb minus 4Kb. $ xfs_io -c ""reflink ${SCRATCH_MNT}/foo 0 2724K 15908K" /mnt/bar $ xfs_io -c "fsync" /mnt/bar $ md5sum /mnt/bar 95a95813a8c2abc9aa75a6c2914a077e /mnt/bar $ mount /dev/sdb /mnt $ md5sum /mnt/bar 207fd8d0b161be8a84b945f0df8d5f8d /mnt/bar # digest should be 95a95813a8c2abc9aa75a6c2914a077e like before the # power failure In the above example, the destination offset of the clone operation corresponds to the size of the "bar" file minus 4Kb. So during the clone operation, the extent map covering the range from 2572Kb to 2728Kb gets trimmed so that it ends at offset 2724Kb, and a new extent map covering the range from 2724Kb to 11724Kb is created. So at the end of the clone operation when we ask to truncate the pages in the range from 2724Kb to 2724Kb + 15908Kb, the page invalidation callback ends up removing the new extent map (through try_release_extent_mapping()) when the page at offset 2724Kb is passed to that callback. Fix this by setting the bit BTRFS_INODE_NEEDS_FULL_SYNC whenever an extent map is removed at try_release_extent_mapping(), forcing the next fsync to search for modified extents in the fs/subvolume tree instead of relying on the presence of extent maps in memory. This way we can continue doing a "fast" fsync if the destination range of a clone operation does not overlap with an existing range or if any of the criteria necessary to remove an extent map at try_release_extent_mapping() is not met (file size not bigger then 16Mb or gfp flags do not allow blocking). CC: sta...@vger.kernel.org # 3.16+ Signed-off-by: Filipe Manana --- fs/btrfs/extent_io.c | 7 +-- 1 file changed, 5 insertions(+), 2 deletions(-) diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c index e55843f536bc..b3e45714d28f 100644 --- a/fs/btrfs/extent_io.c +++ b/fs/btrfs/extent_io.c @@ -4238,8 +4238,9 @@ int try_release_extent_mapping(struct page *page, gfp_t mask) struct extent_map *em; u64 start = page_offset(page); u64 end = start + PAGE_SIZE - 1; - struct extent_io_tree *tree = _I(page->mapping->host)->io_tree; - struct extent_map_tree *map = _I(page->mapping->host)->extent_tree; + struct btrfs_inode *btrfs_inode = BTRFS_I(page->mapping->host); + struct extent_io_tree *tree = _inode->io_tree; + struct extent_map_tree *map = _inode->extent_tree; if (gfpflags_allow_blocking(mask) && page->mapping->host->i_size > SZ_16M) { @@ -4262,6 +4263,8 @@ int try_release_extent_mapping(struct page *page, gfp_t mask) extent_map_end(em) - 1, EXTENT_LOCKED | EXTENT_WRITEBACK, 0, NULL)) { + set_bit(BTRFS_INODE_NEEDS_FULL_SYNC, + _inode->runtime_flags); remove_extent_mapping(map, em); /* once for the rb tree */ free_extent_map(em);
Re: [PATCH 3/3] btrfs: add helper function check device delete able
On 10.07.2018 21:22, Anand Jain wrote: > Move the section of the code which performs the check if the device is > indelible, move that into a helper function. > > Signed-off-by: Anand Jain > --- > fs/btrfs/volumes.c | 49 ++--- > 1 file changed, 30 insertions(+), 19 deletions(-) > > diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c > index 59a6d8f42c98..feb29c5b44f6 100644 > --- a/fs/btrfs/volumes.c > +++ b/fs/btrfs/volumes.c > @@ -1945,6 +1945,33 @@ static inline u64 btrfs_num_devices(struct > btrfs_fs_info *fs_info) > return num_devices; > } > > +static struct btrfs_device *btrfs_device_delete_able( Ugliest name ever! So this function is not really a predicate, rather it's used to fetch the struct btrfs_device * to delete. So a more becoming name would be: btrfs_get_device_for_delete - though this a bit verbose. I guess btrfs_can_delete_device is more suitable if you want to follow this predicate style. At the very least, though, the correct form of the adjective is deletable so it should be btrfs_device_deletable. But as I said this function is not really used as a predicate. > + struct btrfs_fs_info *fs_info, > + const char *device_path, u64 devid) > +{ > + int ret; > + struct btrfs_device *device; > + > + ret = btrfs_check_raid_min_devices(fs_info, > +btrfs_num_devices(fs_info) - 1); > + if (ret) > + return ERR_PTR(ret); > + > + ret = btrfs_find_device_by_devspec(fs_info, devid, device_path, > +); Not really related to this patchset, but I think the whole btrfs_find_device_by_devspec -> btrfs_find_device_missing_or_by_path could be simplified by making those functions return a pointer to btrfs_device rather than an int error value. That way you eliminate the ugly "argument as return value" convention. > + if (ret) > + return ERR_PTR(ret); > + > + if (test_bit(BTRFS_DEV_STATE_REPLACE_TGT, >dev_state)) > + return ERR_PTR(BTRFS_ERROR_DEV_TGT_REPLACE); > + > + if (test_bit(BTRFS_DEV_STATE_WRITEABLE, >dev_state) && > + fs_info->fs_devices->rw_devices == 1) > + return ERR_PTR(BTRFS_ERROR_DEV_ONLY_WRITABLE); > + > + return device; > +} > + > int btrfs_rm_device(struct btrfs_fs_info *fs_info, const char *device_path, > u64 devid) > { > @@ -1958,25 +1985,9 @@ int btrfs_rm_device(struct btrfs_fs_info *fs_info, > const char *device_path, > > mutex_lock(_mutex); > > - num_devices = btrfs_num_devices(fs_info); > - > - ret = btrfs_check_raid_min_devices(fs_info, num_devices - 1); > - if (ret) > - goto out; > - > - ret = btrfs_find_device_by_devspec(fs_info, devid, device_path, > -); > - if (ret) > - goto out; > - > - if (test_bit(BTRFS_DEV_STATE_REPLACE_TGT, >dev_state)) { > - ret = BTRFS_ERROR_DEV_TGT_REPLACE; > - goto out; > - } > - > - if (test_bit(BTRFS_DEV_STATE_WRITEABLE, >dev_state) && > - fs_info->fs_devices->rw_devices == 1) { > - ret = BTRFS_ERROR_DEV_ONLY_WRITABLE; > + device = btrfs_device_delete_able(fs_info, device_path, devid); > + if (IS_ERR(device)) { > + ret = PTR_ERR(device); > goto out; > } > > -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 2/3] btrfs: add helper btrfs_num_devices() to deduce num_devices
On 10.07.2018 21:22, Anand Jain wrote: > When the replace is running the fs_devices::num_devices also includes > the replace device, however in some operations like device delete and > balance it needs the actual num_devices without the repalce devices, so > now the function btrfs_num_devices() just provides that. > > Signed-off-by: Anand Jain > --- > fs/btrfs/volumes.c | 31 +-- > 1 file changed, 17 insertions(+), 14 deletions(-) > > diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c > index ce6faeb8bcf8..59a6d8f42c98 100644 > --- a/fs/btrfs/volumes.c > +++ b/fs/btrfs/volumes.c > @@ -1931,6 +1931,20 @@ void btrfs_assign_next_active_device(struct > btrfs_fs_info *fs_info, > fs_info->fs_devices->latest_bdev = next_device->bdev; > } > Put a comment above the function saying it returns the number of devices minus the replace device (in case replace is working) otherwise one have to go and 'git blame' it > +static inline u64 btrfs_num_devices(struct btrfs_fs_info *fs_info) > +{ > + u64 num_devices = fs_info->fs_devices->num_devices; > + > + btrfs_dev_replace_read_lock(_info->dev_replace); > + if (btrfs_dev_replace_is_ongoing(_info->dev_replace)) { > + WARN_ON(num_devices < 1); > + num_devices--; > + } > + btrfs_dev_replace_read_unlock(_info->dev_replace); > + > + return num_devices; > +} > + > int btrfs_rm_device(struct btrfs_fs_info *fs_info, const char *device_path, > u64 devid) > { > @@ -1944,13 +1958,7 @@ int btrfs_rm_device(struct btrfs_fs_info *fs_info, > const char *device_path, > > mutex_lock(_mutex); > > - num_devices = fs_devices->num_devices; > - btrfs_dev_replace_read_lock(_info->dev_replace); > - if (btrfs_dev_replace_is_ongoing(_info->dev_replace)) { > - WARN_ON(num_devices < 1); > - num_devices--; > - } > - btrfs_dev_replace_read_unlock(_info->dev_replace); > + num_devices = btrfs_num_devices(fs_info); > > ret = btrfs_check_raid_min_devices(fs_info, num_devices - 1); > if (ret) > @@ -3810,13 +3818,8 @@ int btrfs_balance(struct btrfs_fs_info *fs_info, > } > } > > - num_devices = fs_info->fs_devices->num_devices; > - btrfs_dev_replace_read_lock(_info->dev_replace); > - if (btrfs_dev_replace_is_ongoing(_info->dev_replace)) { > - WARN_ON(num_devices < 1); > - num_devices--; > - } > - btrfs_dev_replace_read_unlock(_info->dev_replace); > + num_devices = btrfs_num_devices(fs_info); > + > allowed = BTRFS_AVAIL_ALLOC_BIT_SINGLE | BTRFS_BLOCK_GROUP_DUP; > if (num_devices > 1) > allowed |= (BTRFS_BLOCK_GROUP_RAID0 | BTRFS_BLOCK_GROUP_RAID1); > -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 1/3] btrfs: warn for num_devices below 0
On 10.07.2018 21:22, Anand Jain wrote: > In preparation to de-duplicate a section of code where we deduce the > num_devices, use warn instead of bug. > > Signed-off-by: Anand Jain > --- > fs/btrfs/volumes.c | 2 +- > 1 file changed, 1 insertion(+), 1 deletion(-) > > diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c > index eb78bb8d1108..ce6faeb8bcf8 100644 > --- a/fs/btrfs/volumes.c > +++ b/fs/btrfs/volumes.c > @@ -3813,7 +3813,7 @@ int btrfs_balance(struct btrfs_fs_info *fs_info, > num_devices = fs_info->fs_devices->num_devices; > btrfs_dev_replace_read_lock(_info->dev_replace); > if (btrfs_dev_replace_is_ongoing(_info->dev_replace)) { > - BUG_ON(num_devices < 1); > + WARN_ON(num_devices < 1); Isn't dev_replace_is_ongoing && num_devices < 1 indeed a logical bug situation? Under what condition can it happen that you deem "non critical" ? > num_devices--; > } > btrfs_dev_replace_read_unlock(_info->dev_replace); > -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Transaction aborted (error -28) btrfs_run_delayed_refs*0x163/0x190
On 12.07.2018 10:04, Peter Chant wrote: > On 07/12/2018 07:10 AM, Nikolay Borisov wrote: >> >> >> On 10.07.2018 10:04, Pete wrote: >>> I've just had the error in the subject which caused the file system to >>> go read-only. >>> >>> Further part of error message: >>> WARNING: CPU: 14 PID: 1351 at fs/btrfs/extent-tree.c:3076 >>> btrfs_run_delayed_refs*0x163/0x190 >>> >>> 'Screenshot' here: >>> https://drive.google.com/file/d/1qw7TE1bec8BKcmffrOmg2LS15IOq8Jwc/view?usp=sharing >>> >>> The kernel is 4.17.4. There are three hard drives in the file system. >>> dmcrypt (luks) is used between btrfs and the disks. >>> >>> I'm about to run a scrub. On reboot the disks mounted fine. >> >> Show the output of : >> >> btrfs fi usage /path/to/fs >> > > On the hdd system, which recently went ro again: > root@phoenix:~# btrfs fi usage /home_data/ > Overall: > Device size: 9.10TiB > Device allocated: 6.38TiB > Device unallocated:2.71TiB > Device missing: 0.00B > Used: 5.50TiB > Free (estimated): 1.80TiB (min: 1.80TiB) > Data ratio: 2.00 > Metadata ratio: 2.00 > Global reserve: 512.00MiB (used: 0.00B)> > Data,RAID1: Size:3.15TiB, Used:2.71TiB >/dev/mapper/data_disk_1 1.80TiB >/dev/mapper/data_disk_2 1.80TiB >/dev/mapper/data_disk_3 2.70TiB > > Metadata,RAID1: Size:42.00GiB, Used:40.56GiB >/dev/mapper/data_disk_125.00GiB >/dev/mapper/data_disk_226.00GiB >/dev/mapper/data_disk_333.00GiB > > System,RAID1: Size:64.00MiB, Used:480.00KiB >/dev/mapper/data_disk_164.00MiB >/dev/mapper/data_disk_232.00MiB >/dev/mapper/data_disk_332.00MiB > > Unallocated: >/dev/mapper/data_disk_1 926.46GiB >/dev/mapper/data_disk_2 925.49GiB >/dev/mapper/data_disk_3 924.99GiB > > root@phoenix:~# This one shouldn't have gone RO since it has plenty of unallocated and free space. What was the workload at the time it went RO? Hard to say, it's best if you can provide output with the debug patch applied when this issue re-appears. > > > Incidental I've been running out of space on my ssd which contains / and > /home - which I am sorting. > > root@phoenix:~# btrfs fi usage / > > Overall: > > Device size: 350.00GiB > > Device allocated:350.00GiB > > Device unallocated:1.00MiB > > Device missing: 0.00B > > Used:324.74GiB > > Free (estimated): 23.28GiB (min: 23.28GiB) > > Data ratio: 1.00 > > Metadata ratio: 1.00 > > Global reserve: 512.00MiB (used: 0.00B) > SO this doesn't look healthy, essentially you don't have any unallocated space on your device. I will suggest you run balance to try and compact the space in your data groups and hopefully free up some space. As a first step you can try and run : btrfs balance start -dusage=0 -musage=0 / This will try and reclaim any non-used block groups i.e it could bring some unallocated space back. Then you can run 'btrfs fi us / ' to see if this is the case. Then I'd suggest you run something like: 'btrfs balance start -dusage=60 -musage=60 /' this will try and compact all data/metadata chunks which are less than 60% full. > > > Data,single: Size:343.00GiB, Used:319.72GiB > >/dev/disk/by-label/desk-system343.00GiB > > Metadata,single: Size:6.97GiB, Used:5.02GiB >/dev/disk/by-label/desk-system 6.97GiB > > System,single: Size:32.00MiB, Used:64.00KiB >/dev/disk/by-label/desk-system 32.00MiB > > Unallocated: >/dev/disk/by-label/desk-system 1.00MiB > root@phoenix:~# > > Pete > > > > -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v2] btrfs: use customized batch size for total_bytes_pinned
On 11.07.2018 18:59, Ethan Lien wrote: > In commit b150a4f10d878 ("Btrfs: use a percpu to keep track of possibly > pinned bytes") we use total_bytes_pinned to track how many bytes we are > going to free in this transaction. When we are close to ENOSPC, we check it > and know if we can make the allocation by commit the current transaction. > For every data/metadata extent we are going to free, we add > total_bytes_pinned in btrfs_free_extent() and btrfs_free_tree_block(), and > release it in unpin_extent_range() when we finish the transaction. So this > is a variable we frequently update but rarely read - just the suitable > use of percpu_counter. But in previous commit we update total_bytes_pinned > by default 32 batch size, making every update essentially a spin lock > protected update. Since every spin lock/unlock operation involves syncing > a globally used variable and some kind of barrier in a SMP system, this is > more expensive than using total_bytes_pinned as a simple atomic64_t. So > fix this by using a customized batch size. Since we only read > total_bytes_pinned when we are close to ENOSPC and fail to alloc new chunk, > we can use a really large batch size and have nearly no penalty in most > cases. > > > [Test] > We test the patch on a 4-cores x86 machine: > 1. falloate a 16GiB size test file. > 2. take snapshot (so all following writes will be cow write). > 3. run a 180 sec, 4 jobs, 4K random write fio on test file. > > We also add a temporary lockdep class on percpu_counter's spin lock used > by total_bytes_pinned to track lock_stat. > > > [Results] > unpatched: > lock_stat version 0.4 > --- > class namecon-bouncescontentions > waittime-min waittime-max waittime-total waittime-avgacq-bounces > acquisitions holdtime-min holdtime-max holdtime-total holdtime-avg > >total_bytes_pinned_percpu:82 82 > 0.21 0.61 29.46 0.36 298340 > 635973 0.09 11.01 173476.25 0.27 > > > patched: > lock_stat version 0.4 > --- > class namecon-bouncescontentions > waittime-min waittime-max waittime-total waittime-avgacq-bounces > acquisitions holdtime-min holdtime-max holdtime-total holdtime-avg > >total_bytes_pinned_percpu: 1 1 > 0.62 0.62 0.62 0.62 13601 >31542 0.14 9.61 11016.90 0.35 > According to these numbers though, the best-case (waittime-min and the average waittime-avg) have actually regressed. So what really saves you is the fact that the number of time we had to go to the best/average case is reduced, due to the larger batch. I guess in that regard you are in the clear. Another pertinent question is did you observe any significant impact on run times of actual workloads or, say, transaction commit times or anything like that? I think this case will only be hit when the filesystem is struggling to satisfy a metadata allocation has to cycle through all stages . > > [Analysis] > Since the spin lock only protect a single in-memory variable, the > contentions (number of lock acquisitions that had to wait) in both > unpatched and patched version are low. But when we see acquisitions and > acq-bounces, we get much lower counts in patched version. Here the most > important metric is acq-bounces. It means how many times the lock get > transferred between different cpus, so the patch can really recude nit: s/recude/reduce - no need to resend i guess david will fix it on the way into the tree > cacheline bouncing of spin lock (also the global counter of percpu_counter) > in a SMP system. > > Fixes: b150a4f10d878 ("Btrfs: use a percpu to keep track of possibly > pinned bytes") > > Signed-off-by: Ethan Lien Reviewed-by: Nikolay Borisov > --- > > V2: > Rewrite commit comments. > Add lock_stat test. > Pull dirty_metadata_bytes out to a separate patch. > > fs/btrfs/ctree.h | 1 + > fs/btrfs/extent-tree.c | 46 -- > 2 files changed, 32 insertions(+), 15 deletions(-) > > diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h > index 118346aceea9..df682a521635 100644 > --- a/fs/btrfs/ctree.h > +++ b/fs/btrfs/ctree.h > @@ -422,6 +422,7 @@ struct btrfs_space_info { >* time the transaction commits. >*/ > struct percpu_counter total_bytes_pinned; > + s32 total_bytes_pinned_batch; > > struct list_head list; > /* Protected by the spinlock 'lock'. */ > diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c > index 3d9fe58c0080..937113534ef4 100644 > --- a/fs/btrfs/extent-tree.c > +++ b/fs/btrfs/extent-tree.c >
Re: Transaction aborted (error -28) btrfs_run_delayed_refs*0x163/0x190
On 07/12/2018 07:10 AM, Nikolay Borisov wrote: > > > On 10.07.2018 10:04, Pete wrote: >> I've just had the error in the subject which caused the file system to >> go read-only. >> >> Further part of error message: >> WARNING: CPU: 14 PID: 1351 at fs/btrfs/extent-tree.c:3076 >> btrfs_run_delayed_refs*0x163/0x190 >> >> 'Screenshot' here: >> https://drive.google.com/file/d/1qw7TE1bec8BKcmffrOmg2LS15IOq8Jwc/view?usp=sharing >> >> The kernel is 4.17.4. There are three hard drives in the file system. >> dmcrypt (luks) is used between btrfs and the disks. >> >> I'm about to run a scrub. On reboot the disks mounted fine. > > Show the output of : > > btrfs fi usage /path/to/fs > On the hdd system, which recently went ro again: root@phoenix:~# btrfs fi usage /home_data/ Overall: Device size: 9.10TiB Device allocated: 6.38TiB Device unallocated:2.71TiB Device missing: 0.00B Used: 5.50TiB Free (estimated): 1.80TiB (min: 1.80TiB) Data ratio: 2.00 Metadata ratio: 2.00 Global reserve: 512.00MiB (used: 0.00B) Data,RAID1: Size:3.15TiB, Used:2.71TiB /dev/mapper/data_disk_1 1.80TiB /dev/mapper/data_disk_2 1.80TiB /dev/mapper/data_disk_3 2.70TiB Metadata,RAID1: Size:42.00GiB, Used:40.56GiB /dev/mapper/data_disk_125.00GiB /dev/mapper/data_disk_226.00GiB /dev/mapper/data_disk_333.00GiB System,RAID1: Size:64.00MiB, Used:480.00KiB /dev/mapper/data_disk_164.00MiB /dev/mapper/data_disk_232.00MiB /dev/mapper/data_disk_332.00MiB Unallocated: /dev/mapper/data_disk_1 926.46GiB /dev/mapper/data_disk_2 925.49GiB /dev/mapper/data_disk_3 924.99GiB root@phoenix:~# Incidental I've been running out of space on my ssd which contains / and /home - which I am sorting. root@phoenix:~# btrfs fi usage / Overall: Device size: 350.00GiB Device allocated:350.00GiB Device unallocated:1.00MiB Device missing: 0.00B Used:324.74GiB Free (estimated): 23.28GiB (min: 23.28GiB) Data ratio: 1.00 Metadata ratio: 1.00 Global reserve: 512.00MiB (used: 0.00B) Data,single: Size:343.00GiB, Used:319.72GiB /dev/disk/by-label/desk-system343.00GiB Metadata,single: Size:6.97GiB, Used:5.02GiB /dev/disk/by-label/desk-system 6.97GiB System,single: Size:32.00MiB, Used:64.00KiB /dev/disk/by-label/desk-system 32.00MiB Unallocated: /dev/disk/by-label/desk-system 1.00MiB root@phoenix:~# Pete -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] btrfs: Introduce compile time structure size check
On 2018年07月12日 14:27, Nikolay Borisov wrote: > > > On 12.07.2018 09:19, Qu Wenruo wrote: >> Introduce a new macro based compile time check for ioctl structures. >> >> The new macro is BTRFS_ASSERT_SIZE(), which is mostly copied from >> VMMDEV_ASSERT_SIZE(). >> >> Such check is only added to structure pended to power of 2. >> And exposed one structure, btrfs_ioctl_get_dev_stats() is not aligned >> well. >> The misalign is introduced by commit b27f7c0c150f ("btrfs: join DEV_STATS >> ioctls to one"). > > So what's the negative effect if a structure is not aligned to a power of 2? No real negative effect, until one day one may find the comment is not correct. Thanks, Qu > >> >> Signed-off-by: Qu Wenruo >> --- >> include/uapi/linux/btrfs.h | 18 +- >> 1 file changed, 17 insertions(+), 1 deletion(-) >> >> diff --git a/include/uapi/linux/btrfs.h b/include/uapi/linux/btrfs.h >> index 5ca1d21fc4a7..be1213c10080 100644 >> --- a/include/uapi/linux/btrfs.h >> +++ b/include/uapi/linux/btrfs.h >> @@ -22,6 +22,14 @@ >> #include >> #include >> >> +/* >> + * Compile time check on structure size, same method as >> + * VMMDEV_ASSERT_SIZE() from linux/vbox_vmmdev_types.h, to trigger a >> negative >> + * array size at compile time if size doesn't match. >> + */ >> +#define BTRFS_ASSERT_SIZE(type, size) \ >> +typedef char type ## _asrt_size[1 - 2 * !!(sizeof(struct type) != >> (size))] >> + >> #define BTRFS_IOCTL_MAGIC 0x94 >> #define BTRFS_VOL_NAME_MAX 255 >> #define BTRFS_LABEL_SIZE 256 >> @@ -32,6 +40,7 @@ struct btrfs_ioctl_vol_args { >> __s64 fd; >> char name[BTRFS_PATH_NAME_MAX + 1]; >> }; >> +BTRFS_ASSERT_SIZE(btrfs_ioctl_vol_args, 4096); >> >> #define BTRFS_DEVICE_PATH_NAME_MAX 1024 >> #define BTRFS_SUBVOL_NAME_MAX 4039 >> @@ -171,6 +180,7 @@ struct btrfs_ioctl_scrub_args { >> /* pad to 1k */ >> __u64 unused[(1024-32-sizeof(struct btrfs_scrub_progress))/8]; >> }; >> +BTRFS_ASSERT_SIZE(btrfs_ioctl_scrub_args, 1024); >> >> #define BTRFS_IOCTL_DEV_REPLACE_CONT_READING_FROM_SRCDEV_MODE_ALWAYS>> 0 >> #define BTRFS_IOCTL_DEV_REPLACE_CONT_READING_FROM_SRCDEV_MODE_AVOID 1 >> @@ -223,6 +233,7 @@ struct btrfs_ioctl_dev_info_args { >> __u64 unused[379]; /* pad to 4k */ >> __u8 path[BTRFS_DEVICE_PATH_NAME_MAX]; /* out */ >> }; >> +BTRFS_ASSERT_SIZE(btrfs_ioctl_dev_info_args, 4096); >> >> struct btrfs_ioctl_fs_info_args { >> __u64 max_id; /* out */ >> @@ -234,6 +245,7 @@ struct btrfs_ioctl_fs_info_args { >> __u32 reserved32; >> __u64 reserved[122];/* pad to 1k */ >> }; >> +BTRFS_ASSERT_SIZE(btrfs_ioctl_fs_info_args, 1024); >> >> /* >> * feature flags >> @@ -414,6 +426,7 @@ struct btrfs_ioctl_balance_args { >> >> __u64 unused[72]; /* pad to 1k */ >> }; >> +BTRFS_ASSERT_SIZE(btrfs_ioctl_balance_args, 1024); >> >> #define BTRFS_INO_LOOKUP_PATH_MAX 4080 >> struct btrfs_ioctl_ino_lookup_args { >> @@ -421,6 +434,7 @@ struct btrfs_ioctl_ino_lookup_args { >> __u64 objectid; >> char name[BTRFS_INO_LOOKUP_PATH_MAX]; >> }; >> +BTRFS_ASSERT_SIZE(btrfs_ioctl_ino_lookup_args, 4096); >> >> #define BTRFS_INO_LOOKUP_USER_PATH_MAX (4080 - BTRFS_VOL_NAME_MAX - 1) >> struct btrfs_ioctl_ino_lookup_user_args { >> @@ -436,6 +450,7 @@ struct btrfs_ioctl_ino_lookup_user_args { >> */ >> char path[BTRFS_INO_LOOKUP_USER_PATH_MAX]; >> }; >> +BTRFS_ASSERT_SIZE(btrfs_ioctl_ino_lookup_args, 4096); >> >> /* Search criteria for the btrfs SEARCH ioctl family. */ >> struct btrfs_ioctl_search_key { >> @@ -664,8 +679,9 @@ struct btrfs_ioctl_get_dev_stats { >> /* out values: */ >> __u64 values[BTRFS_DEV_STAT_VALUES_MAX]; >> >> -__u64 unused[128 - 2 - BTRFS_DEV_STAT_VALUES_MAX]; /* pad to 1k */ >> +__u64 unused[128 - 3 - BTRFS_DEV_STAT_VALUES_MAX]; /* pad to 1k */ >> }; >> +BTRFS_ASSERT_SIZE(btrfs_ioctl_get_dev_stats, 1024); >> >> #define BTRFS_QUOTA_CTL_ENABLE 1 >> #define BTRFS_QUOTA_CTL_DISABLE 2 >> > -- > To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in > the body of a message to majord...@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html > -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Transaction aborted (error -28) btrfs_run_delayed_refs*0x163/0x190
On 11.07.2018 23:59, Pete wrote: > On 07/10/2018 09:38 AM, Martin Raiber wrote: > >> This is probably a known issue. See >> https://www.spinics.net/lists/linux-btrfs/msg75647.html >> You could apply the patch in this thread and mount with enospc_debug to >> confirm it is the same issue. >> -- >> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in >> the body of a message to majord...@vger.kernel.org >> More majordomo info at http://vger.kernel.org/majordomo-info.html >> > > OK, I've applied the patch, by hand, and hopefully put it in the right > place. Need to learn to patch better. > > Booted the (rebuilt with make ; make modules_install syslinux etc) > kernel with the option enospc_debug for the two btrfs file systems (1st > entry for each in fstb. I was not expecting to get the issue to appear > quickly as it took several days to hit previously. However, on checking > I see another error, not sure if it is related, still is in extent-tree.c. > > https://drive.google.com/file/d/1K12MfpWFB1aHSXBga1Rym5terbmHeDfI/view?usp=sharing This is just a warn_on saying that btrfs didn't manage to reserve space for a CoW operation on the checksum tree ( so metadata). But did your machine crash at that time, i.e why don't you paste the text but post images? > > > Pete > -- > To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in > the body of a message to majord...@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html > -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v4 1/2] btrfs: make fs_devices to be a local variable
On 12.07.2018 09:23, Gu Jinxiang wrote: > fs_devices is always passed to btrfs_scan_one_device which > overrides it. And in the call stack below fs_devices is passed to > btrfs_scan_one_device from btrfs_mount_root. > And in btrfs_mount_root the output fs_devices of this call stack > is not used. > btrfs_mount_root > -> btrfs_parse_early_options > ->btrfs_scan_one_device > So, there is no necessary to pass fs_devices from btrfs_mount_root, > use a local variable in btrfs_parse_early_options is enough. > > Signed-off-by: Gu Jinxiang > Reviewed-by: Anand Jain Reviewed-by: Nikolay Borisov > --- > > Changelog: > v4: changed a line warp, and adjusted the order of two rows. > v3: rebase to misc-next. > v2: deal with Nikolay's comment, make changelog more clair. > > fs/btrfs/super.c | 8 > 1 file changed, 4 insertions(+), 4 deletions(-) > > diff --git a/fs/btrfs/super.c b/fs/btrfs/super.c > index 39d8e39b2fe1..44f58bdb5fa6 100644 > --- a/fs/btrfs/super.c > +++ b/fs/btrfs/super.c > @@ -884,10 +884,11 @@ int btrfs_parse_options(struct btrfs_fs_info *info, > char *options, > * only when we need to allocate a new super block. > */ > static int btrfs_parse_early_options(const char *options, fmode_t flags, > - void *holder, struct btrfs_fs_devices **fs_devices) > + void *holder) > { > substring_t args[MAX_OPT_ARGS]; > char *device_name, *opts, *orig, *p; > + struct btrfs_fs_devices *fs_devices = NULL; > int error = 0; > > if (!options) > @@ -916,7 +917,7 @@ static int btrfs_parse_early_options(const char *options, > fmode_t flags, > goto out; > } > error = btrfs_scan_one_device(device_name, > - flags, holder, fs_devices); > + flags, holder, _devices); > kfree(device_name); > if (error) > goto out; > @@ -1524,8 +1525,7 @@ static struct dentry *btrfs_mount_root(struct > file_system_type *fs_type, > if (!(flags & SB_RDONLY)) > mode |= FMODE_WRITE; > > - error = btrfs_parse_early_options(data, mode, fs_type, > - _devices); > + error = btrfs_parse_early_options(data, mode, fs_type); > if (error) { > return ERR_PTR(error); > } > -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v4 2/2] btrfs: get device pointer from btrfs_scan_one_device
On 12.07.2018 09:23, Gu Jinxiang wrote: > Instead of pointer to btrfs_fs_devices as an arg in > btrfs_scan_one_device, better to make it as a return value. > > And since btrfs_fs_devices can be get by btrfs_device, > better to return btrfs_device than fs_btrfs_devices. > > Signed-off-by: Gu Jinxiang Reviewed-by: Nikolay Borisov > --- > > Changelog: > v4: as suggested by Anand, change return value of > btrfs_scan_one_device from btrfs_fs_devices to btrfs_device. > v3: as comment by robot, use PTR_ERR_OR_ZERO, and rebase to misc-next. > v2: as comment by Nikolay, use ERR_CAST instead of cast type manually. > > fs/btrfs/super.c | 37 - > fs/btrfs/volumes.c | 17 ++--- > fs/btrfs/volumes.h | 4 ++-- > 3 files changed, 32 insertions(+), 26 deletions(-) > > diff --git a/fs/btrfs/super.c b/fs/btrfs/super.c > index 44f58bdb5fa6..e73d547eacff 100644 > --- a/fs/btrfs/super.c > +++ b/fs/btrfs/super.c > @@ -888,7 +888,7 @@ static int btrfs_parse_early_options(const char *options, > fmode_t flags, > { > substring_t args[MAX_OPT_ARGS]; > char *device_name, *opts, *orig, *p; > - struct btrfs_fs_devices *fs_devices = NULL; > + struct btrfs_device *device = NULL; > int error = 0; > > if (!options) > @@ -916,11 +916,13 @@ static int btrfs_parse_early_options(const char > *options, fmode_t flags, > error = -ENOMEM; > goto out; > } > - error = btrfs_scan_one_device(device_name, > - flags, holder, _devices); > + device = btrfs_scan_one_device(device_name, > + flags, holder); > kfree(device_name); > - if (error) > + if (IS_ERR(device)) { > + error = PTR_ERR(device); > goto out; > + } > } > } > > @@ -1516,6 +1518,7 @@ static struct dentry *btrfs_mount_root(struct > file_system_type *fs_type, > { > struct block_device *bdev = NULL; > struct super_block *s; > + struct btrfs_device *device = NULL; > struct btrfs_fs_devices *fs_devices = NULL; > struct btrfs_fs_info *fs_info = NULL; > struct security_mnt_opts new_sec_opts; > @@ -1537,9 +1540,13 @@ static struct dentry *btrfs_mount_root(struct > file_system_type *fs_type, > return ERR_PTR(error); > } > > - error = btrfs_scan_one_device(device_name, mode, fs_type, _devices); > - if (error) > + device = btrfs_scan_one_device(device_name, mode, fs_type); > + if (IS_ERR(device)) { > + error = PTR_ERR(device); > goto error_sec_opts; > + } > + > + fs_devices = device->fs_devices; > > /* >* Setup a dummy root and fs_info for test/set super. This is because > @@ -2220,7 +2227,7 @@ static long btrfs_control_ioctl(struct file *file, > unsigned int cmd, > unsigned long arg) > { > struct btrfs_ioctl_vol_args *vol; > - struct btrfs_fs_devices *fs_devices; > + struct btrfs_device *device = NULL; > int ret = -ENOTTY; > > if (!capable(CAP_SYS_ADMIN)) > @@ -2232,15 +2239,19 @@ static long btrfs_control_ioctl(struct file *file, > unsigned int cmd, > > switch (cmd) { > case BTRFS_IOC_SCAN_DEV: > - ret = btrfs_scan_one_device(vol->name, FMODE_READ, > - _root_fs_type, _devices); > + device = btrfs_scan_one_device(vol->name, FMODE_READ, > + _root_fs_type); > + ret = PTR_ERR_OR_ZERO(device); > break; > case BTRFS_IOC_DEVICES_READY: > - ret = btrfs_scan_one_device(vol->name, FMODE_READ, > - _root_fs_type, _devices); > - if (ret) > + device = btrfs_scan_one_device(vol->name, FMODE_READ, > + _root_fs_type); > + if (IS_ERR(device)) { > + ret = PTR_ERR(device); > break; > - ret = !(fs_devices->num_devices == fs_devices->total_devices); > + } > + ret = !(device->fs_devices->num_devices == > + device->fs_devices->total_devices); > break; > case BTRFS_IOC_GET_SUPPORTED_FEATURES: > ret = btrfs_ioctl_get_supported_features((void __user*)arg); > diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c > index 52abada5c789..3bc479e8cc22 100644 > --- a/fs/btrfs/volumes.c > +++ b/fs/btrfs/volumes.c > @@ -1212,14 +1212,13 @@ static int btrfs_read_disk_super(struct block_device > *bdev, u64 bytenr, > * and we are not allowed to call set_blocksize during the scan. The
[PATCH v4 2/2] btrfs: get device pointer from btrfs_scan_one_device
Instead of pointer to btrfs_fs_devices as an arg in btrfs_scan_one_device, better to make it as a return value. And since btrfs_fs_devices can be get by btrfs_device, better to return btrfs_device than fs_btrfs_devices. Signed-off-by: Gu Jinxiang --- Changelog: v4: as suggested by Anand, change return value of btrfs_scan_one_device from btrfs_fs_devices to btrfs_device. v3: as comment by robot, use PTR_ERR_OR_ZERO, and rebase to misc-next. v2: as comment by Nikolay, use ERR_CAST instead of cast type manually. fs/btrfs/super.c | 37 - fs/btrfs/volumes.c | 17 ++--- fs/btrfs/volumes.h | 4 ++-- 3 files changed, 32 insertions(+), 26 deletions(-) diff --git a/fs/btrfs/super.c b/fs/btrfs/super.c index 44f58bdb5fa6..e73d547eacff 100644 --- a/fs/btrfs/super.c +++ b/fs/btrfs/super.c @@ -888,7 +888,7 @@ static int btrfs_parse_early_options(const char *options, fmode_t flags, { substring_t args[MAX_OPT_ARGS]; char *device_name, *opts, *orig, *p; - struct btrfs_fs_devices *fs_devices = NULL; + struct btrfs_device *device = NULL; int error = 0; if (!options) @@ -916,11 +916,13 @@ static int btrfs_parse_early_options(const char *options, fmode_t flags, error = -ENOMEM; goto out; } - error = btrfs_scan_one_device(device_name, - flags, holder, _devices); + device = btrfs_scan_one_device(device_name, + flags, holder); kfree(device_name); - if (error) + if (IS_ERR(device)) { + error = PTR_ERR(device); goto out; + } } } @@ -1516,6 +1518,7 @@ static struct dentry *btrfs_mount_root(struct file_system_type *fs_type, { struct block_device *bdev = NULL; struct super_block *s; + struct btrfs_device *device = NULL; struct btrfs_fs_devices *fs_devices = NULL; struct btrfs_fs_info *fs_info = NULL; struct security_mnt_opts new_sec_opts; @@ -1537,9 +1540,13 @@ static struct dentry *btrfs_mount_root(struct file_system_type *fs_type, return ERR_PTR(error); } - error = btrfs_scan_one_device(device_name, mode, fs_type, _devices); - if (error) + device = btrfs_scan_one_device(device_name, mode, fs_type); + if (IS_ERR(device)) { + error = PTR_ERR(device); goto error_sec_opts; + } + + fs_devices = device->fs_devices; /* * Setup a dummy root and fs_info for test/set super. This is because @@ -2220,7 +2227,7 @@ static long btrfs_control_ioctl(struct file *file, unsigned int cmd, unsigned long arg) { struct btrfs_ioctl_vol_args *vol; - struct btrfs_fs_devices *fs_devices; + struct btrfs_device *device = NULL; int ret = -ENOTTY; if (!capable(CAP_SYS_ADMIN)) @@ -2232,15 +2239,19 @@ static long btrfs_control_ioctl(struct file *file, unsigned int cmd, switch (cmd) { case BTRFS_IOC_SCAN_DEV: - ret = btrfs_scan_one_device(vol->name, FMODE_READ, - _root_fs_type, _devices); + device = btrfs_scan_one_device(vol->name, FMODE_READ, + _root_fs_type); + ret = PTR_ERR_OR_ZERO(device); break; case BTRFS_IOC_DEVICES_READY: - ret = btrfs_scan_one_device(vol->name, FMODE_READ, - _root_fs_type, _devices); - if (ret) + device = btrfs_scan_one_device(vol->name, FMODE_READ, + _root_fs_type); + if (IS_ERR(device)) { + ret = PTR_ERR(device); break; - ret = !(fs_devices->num_devices == fs_devices->total_devices); + } + ret = !(device->fs_devices->num_devices == + device->fs_devices->total_devices); break; case BTRFS_IOC_GET_SUPPORTED_FEATURES: ret = btrfs_ioctl_get_supported_features((void __user*)arg); diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c index 52abada5c789..3bc479e8cc22 100644 --- a/fs/btrfs/volumes.c +++ b/fs/btrfs/volumes.c @@ -1212,14 +1212,13 @@ static int btrfs_read_disk_super(struct block_device *bdev, u64 bytenr, * and we are not allowed to call set_blocksize during the scan. The superblock * is read via pagecache */ -int btrfs_scan_one_device(const char *path, fmode_t flags, void *holder, - struct btrfs_fs_devices
[PATCH v4 1/2] btrfs: make fs_devices to be a local variable
fs_devices is always passed to btrfs_scan_one_device which overrides it. And in the call stack below fs_devices is passed to btrfs_scan_one_device from btrfs_mount_root. And in btrfs_mount_root the output fs_devices of this call stack is not used. btrfs_mount_root -> btrfs_parse_early_options ->btrfs_scan_one_device So, there is no necessary to pass fs_devices from btrfs_mount_root, use a local variable in btrfs_parse_early_options is enough. Signed-off-by: Gu Jinxiang Reviewed-by: Anand Jain --- Changelog: v4: changed a line warp, and adjusted the order of two rows. v3: rebase to misc-next. v2: deal with Nikolay's comment, make changelog more clair. fs/btrfs/super.c | 8 1 file changed, 4 insertions(+), 4 deletions(-) diff --git a/fs/btrfs/super.c b/fs/btrfs/super.c index 39d8e39b2fe1..44f58bdb5fa6 100644 --- a/fs/btrfs/super.c +++ b/fs/btrfs/super.c @@ -884,10 +884,11 @@ int btrfs_parse_options(struct btrfs_fs_info *info, char *options, * only when we need to allocate a new super block. */ static int btrfs_parse_early_options(const char *options, fmode_t flags, - void *holder, struct btrfs_fs_devices **fs_devices) + void *holder) { substring_t args[MAX_OPT_ARGS]; char *device_name, *opts, *orig, *p; + struct btrfs_fs_devices *fs_devices = NULL; int error = 0; if (!options) @@ -916,7 +917,7 @@ static int btrfs_parse_early_options(const char *options, fmode_t flags, goto out; } error = btrfs_scan_one_device(device_name, - flags, holder, fs_devices); + flags, holder, _devices); kfree(device_name); if (error) goto out; @@ -1524,8 +1525,7 @@ static struct dentry *btrfs_mount_root(struct file_system_type *fs_type, if (!(flags & SB_RDONLY)) mode |= FMODE_WRITE; - error = btrfs_parse_early_options(data, mode, fs_type, - _devices); + error = btrfs_parse_early_options(data, mode, fs_type); if (error) { return ERR_PTR(error); } -- 2.17.1 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] btrfs: Introduce compile time structure size check
On 12.07.2018 09:19, Qu Wenruo wrote: > Introduce a new macro based compile time check for ioctl structures. > > The new macro is BTRFS_ASSERT_SIZE(), which is mostly copied from > VMMDEV_ASSERT_SIZE(). > > Such check is only added to structure pended to power of 2. > And exposed one structure, btrfs_ioctl_get_dev_stats() is not aligned > well. > The misalign is introduced by commit b27f7c0c150f ("btrfs: join DEV_STATS > ioctls to one"). So what's the negative effect if a structure is not aligned to a power of 2? > > Signed-off-by: Qu Wenruo > --- > include/uapi/linux/btrfs.h | 18 +- > 1 file changed, 17 insertions(+), 1 deletion(-) > > diff --git a/include/uapi/linux/btrfs.h b/include/uapi/linux/btrfs.h > index 5ca1d21fc4a7..be1213c10080 100644 > --- a/include/uapi/linux/btrfs.h > +++ b/include/uapi/linux/btrfs.h > @@ -22,6 +22,14 @@ > #include > #include > > +/* > + * Compile time check on structure size, same method as > + * VMMDEV_ASSERT_SIZE() from linux/vbox_vmmdev_types.h, to trigger a negative > + * array size at compile time if size doesn't match. > + */ > +#define BTRFS_ASSERT_SIZE(type, size) \ > + typedef char type ## _asrt_size[1 - 2 * !!(sizeof(struct type) != > (size))] > + > #define BTRFS_IOCTL_MAGIC 0x94 > #define BTRFS_VOL_NAME_MAX 255 > #define BTRFS_LABEL_SIZE 256 > @@ -32,6 +40,7 @@ struct btrfs_ioctl_vol_args { > __s64 fd; > char name[BTRFS_PATH_NAME_MAX + 1]; > }; > +BTRFS_ASSERT_SIZE(btrfs_ioctl_vol_args, 4096); > > #define BTRFS_DEVICE_PATH_NAME_MAX 1024 > #define BTRFS_SUBVOL_NAME_MAX4039 > @@ -171,6 +180,7 @@ struct btrfs_ioctl_scrub_args { > /* pad to 1k */ > __u64 unused[(1024-32-sizeof(struct btrfs_scrub_progress))/8]; > }; > +BTRFS_ASSERT_SIZE(btrfs_ioctl_scrub_args, 1024); > > #define BTRFS_IOCTL_DEV_REPLACE_CONT_READING_FROM_SRCDEV_MODE_ALWAYS 0 > #define BTRFS_IOCTL_DEV_REPLACE_CONT_READING_FROM_SRCDEV_MODE_AVOID 1 > @@ -223,6 +233,7 @@ struct btrfs_ioctl_dev_info_args { > __u64 unused[379]; /* pad to 4k */ > __u8 path[BTRFS_DEVICE_PATH_NAME_MAX]; /* out */ > }; > +BTRFS_ASSERT_SIZE(btrfs_ioctl_dev_info_args, 4096); > > struct btrfs_ioctl_fs_info_args { > __u64 max_id; /* out */ > @@ -234,6 +245,7 @@ struct btrfs_ioctl_fs_info_args { > __u32 reserved32; > __u64 reserved[122];/* pad to 1k */ > }; > +BTRFS_ASSERT_SIZE(btrfs_ioctl_fs_info_args, 1024); > > /* > * feature flags > @@ -414,6 +426,7 @@ struct btrfs_ioctl_balance_args { > > __u64 unused[72]; /* pad to 1k */ > }; > +BTRFS_ASSERT_SIZE(btrfs_ioctl_balance_args, 1024); > > #define BTRFS_INO_LOOKUP_PATH_MAX 4080 > struct btrfs_ioctl_ino_lookup_args { > @@ -421,6 +434,7 @@ struct btrfs_ioctl_ino_lookup_args { > __u64 objectid; > char name[BTRFS_INO_LOOKUP_PATH_MAX]; > }; > +BTRFS_ASSERT_SIZE(btrfs_ioctl_ino_lookup_args, 4096); > > #define BTRFS_INO_LOOKUP_USER_PATH_MAX (4080 - BTRFS_VOL_NAME_MAX - 1) > struct btrfs_ioctl_ino_lookup_user_args { > @@ -436,6 +450,7 @@ struct btrfs_ioctl_ino_lookup_user_args { >*/ > char path[BTRFS_INO_LOOKUP_USER_PATH_MAX]; > }; > +BTRFS_ASSERT_SIZE(btrfs_ioctl_ino_lookup_args, 4096); > > /* Search criteria for the btrfs SEARCH ioctl family. */ > struct btrfs_ioctl_search_key { > @@ -664,8 +679,9 @@ struct btrfs_ioctl_get_dev_stats { > /* out values: */ > __u64 values[BTRFS_DEV_STAT_VALUES_MAX]; > > - __u64 unused[128 - 2 - BTRFS_DEV_STAT_VALUES_MAX]; /* pad to 1k */ > + __u64 unused[128 - 3 - BTRFS_DEV_STAT_VALUES_MAX]; /* pad to 1k */ > }; > +BTRFS_ASSERT_SIZE(btrfs_ioctl_get_dev_stats, 1024); > > #define BTRFS_QUOTA_CTL_ENABLE 1 > #define BTRFS_QUOTA_CTL_DISABLE 2 > -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH] btrfs: Introduce compile time structure size check
Introduce a new macro based compile time check for ioctl structures. The new macro is BTRFS_ASSERT_SIZE(), which is mostly copied from VMMDEV_ASSERT_SIZE(). Such check is only added to structure pended to power of 2. And exposed one structure, btrfs_ioctl_get_dev_stats() is not aligned well. The misalign is introduced by commit b27f7c0c150f ("btrfs: join DEV_STATS ioctls to one"). Signed-off-by: Qu Wenruo --- include/uapi/linux/btrfs.h | 18 +- 1 file changed, 17 insertions(+), 1 deletion(-) diff --git a/include/uapi/linux/btrfs.h b/include/uapi/linux/btrfs.h index 5ca1d21fc4a7..be1213c10080 100644 --- a/include/uapi/linux/btrfs.h +++ b/include/uapi/linux/btrfs.h @@ -22,6 +22,14 @@ #include #include +/* + * Compile time check on structure size, same method as + * VMMDEV_ASSERT_SIZE() from linux/vbox_vmmdev_types.h, to trigger a negative + * array size at compile time if size doesn't match. + */ +#define BTRFS_ASSERT_SIZE(type, size) \ + typedef char type ## _asrt_size[1 - 2 * !!(sizeof(struct type) != (size))] + #define BTRFS_IOCTL_MAGIC 0x94 #define BTRFS_VOL_NAME_MAX 255 #define BTRFS_LABEL_SIZE 256 @@ -32,6 +40,7 @@ struct btrfs_ioctl_vol_args { __s64 fd; char name[BTRFS_PATH_NAME_MAX + 1]; }; +BTRFS_ASSERT_SIZE(btrfs_ioctl_vol_args, 4096); #define BTRFS_DEVICE_PATH_NAME_MAX 1024 #define BTRFS_SUBVOL_NAME_MAX 4039 @@ -171,6 +180,7 @@ struct btrfs_ioctl_scrub_args { /* pad to 1k */ __u64 unused[(1024-32-sizeof(struct btrfs_scrub_progress))/8]; }; +BTRFS_ASSERT_SIZE(btrfs_ioctl_scrub_args, 1024); #define BTRFS_IOCTL_DEV_REPLACE_CONT_READING_FROM_SRCDEV_MODE_ALWAYS 0 #define BTRFS_IOCTL_DEV_REPLACE_CONT_READING_FROM_SRCDEV_MODE_AVOID1 @@ -223,6 +233,7 @@ struct btrfs_ioctl_dev_info_args { __u64 unused[379]; /* pad to 4k */ __u8 path[BTRFS_DEVICE_PATH_NAME_MAX]; /* out */ }; +BTRFS_ASSERT_SIZE(btrfs_ioctl_dev_info_args, 4096); struct btrfs_ioctl_fs_info_args { __u64 max_id; /* out */ @@ -234,6 +245,7 @@ struct btrfs_ioctl_fs_info_args { __u32 reserved32; __u64 reserved[122];/* pad to 1k */ }; +BTRFS_ASSERT_SIZE(btrfs_ioctl_fs_info_args, 1024); /* * feature flags @@ -414,6 +426,7 @@ struct btrfs_ioctl_balance_args { __u64 unused[72]; /* pad to 1k */ }; +BTRFS_ASSERT_SIZE(btrfs_ioctl_balance_args, 1024); #define BTRFS_INO_LOOKUP_PATH_MAX 4080 struct btrfs_ioctl_ino_lookup_args { @@ -421,6 +434,7 @@ struct btrfs_ioctl_ino_lookup_args { __u64 objectid; char name[BTRFS_INO_LOOKUP_PATH_MAX]; }; +BTRFS_ASSERT_SIZE(btrfs_ioctl_ino_lookup_args, 4096); #define BTRFS_INO_LOOKUP_USER_PATH_MAX (4080 - BTRFS_VOL_NAME_MAX - 1) struct btrfs_ioctl_ino_lookup_user_args { @@ -436,6 +450,7 @@ struct btrfs_ioctl_ino_lookup_user_args { */ char path[BTRFS_INO_LOOKUP_USER_PATH_MAX]; }; +BTRFS_ASSERT_SIZE(btrfs_ioctl_ino_lookup_args, 4096); /* Search criteria for the btrfs SEARCH ioctl family. */ struct btrfs_ioctl_search_key { @@ -664,8 +679,9 @@ struct btrfs_ioctl_get_dev_stats { /* out values: */ __u64 values[BTRFS_DEV_STAT_VALUES_MAX]; - __u64 unused[128 - 2 - BTRFS_DEV_STAT_VALUES_MAX]; /* pad to 1k */ + __u64 unused[128 - 3 - BTRFS_DEV_STAT_VALUES_MAX]; /* pad to 1k */ }; +BTRFS_ASSERT_SIZE(btrfs_ioctl_get_dev_stats, 1024); #define BTRFS_QUOTA_CTL_ENABLE 1 #define BTRFS_QUOTA_CTL_DISABLE2 -- 2.18.0 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Transaction aborted (error -28) btrfs_run_delayed_refs*0x163/0x190
On 10.07.2018 10:04, Pete wrote: > I've just had the error in the subject which caused the file system to > go read-only. > > Further part of error message: > WARNING: CPU: 14 PID: 1351 at fs/btrfs/extent-tree.c:3076 > btrfs_run_delayed_refs*0x163/0x190 > > 'Screenshot' here: > https://drive.google.com/file/d/1qw7TE1bec8BKcmffrOmg2LS15IOq8Jwc/view?usp=sharing > > The kernel is 4.17.4. There are three hard drives in the file system. > dmcrypt (luks) is used between btrfs and the disks. > > I'm about to run a scrub. On reboot the disks mounted fine. Show the output of : btrfs fi usage /path/to/fs > > Pete > -- > To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in > the body of a message to majord...@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html > -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html