Re: [RFC v2] Improve subvolume usability for a normal user
And also, how to prevent creation of the snapshots by the user. On 12/11/2017 08:30 AM, ein wrote: > On 12/11/2017 07:38 AM, Misono, Tomohiro wrote: >> - Change the default behavior to allow a user to delete subvolume which is >> empty > From sysadmin point of view I think it's worth considering the following > scenario(s): > what if admin wants one persistent snapshot undeletable by the user? > - snapshots created by the root in user work tree should not be deleted > by the user (snapshot owner should be root?), but we may want also > permissions, filesystem ACLs and extend ACLs consistency > - snapshots with chattr +i should be not deleted by the user, even if he > created it. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC v2] Improve subvolume usability for a normal user
On 12/11/2017 07:38 AM, Misono, Tomohiro wrote: > - Change the default behavior to allow a user to delete subvolume which is > empty >From sysadmin point of view I think it's worth considering the following scenario(s): what if admin wants one persistent snapshot undeletable by the user? - snapshots created by the root in user work tree should not be deleted by the user (snapshot owner should be root?), but we may want also permissions, filesystem ACLs and extend ACLs consistency - snapshots with chattr +i should be not deleted by the user, even if he created it. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[RFC v2] Improve subvolume usability for a normal user
Hello all, I reflected the comments of the first version of the RFC[1]. Thanks for all those who commented. The summary of updated proposal is: - Change the default behavior to allow a user to delete subvolume which is empty - Add 2 new non-root ioctls to get subvolume/quota info under the specified path Please see the 'Proposal' section below for the detail. More comments are welcome. Regards, Tomohiro Misono [1] https://www.mail-archive.com/linux-btrfs@vger.kernel.org/msg70666.html == - Goal and current problem The goal of this RFC is to give a normal user more control to their own subvolumes. Currently the control to subvolumes for a normal user is restricted as below: +-+--+--+ | command | root | user | +-+--+--+ | sub create | Y| Y| | sub snap| Y| Y| | sub del | Y| N| | sub list| Y| N| | sub show| Y| N| | qgroup show | Y| N| +-+--+--+ In short, I want to change this as below in order to improve user's usability: +-+--++ | command | root | user | +-+--++ | sub create | Y| Y | | sub snap| Y| Y | | sub del | Y| N -> Y | | sub list| Y| N -> Y | | sub show| Y| N -> Y | | qgroup show | Y| N -> Y | +-+--++ In words, (1) allow deletion of subvolume (if it is empty) and (2) allow getting subvolume/quota info (under the specified path) I think other commands not listed above (qgroup limit, send/receive etc.) should be done by root and not be allowed for a normal user. - Proposal (1) deletion of subvolume Change the default behavior for a user to allow to delete a subvolume (by "subvol del") if 1. the user has write+exec right to it, and 2. it is empty So, it is the same as user_subvol_rm_allowed option with emptiness check. Emptiness check is needed because Snapshot creation wont' check the permission and can copy a dir which cannot be deleted by the user, and therefore just allowing deletion may cause data loss. Summary of behavior by different condition is as follows: +===++==+ | |Current | Proposal | +===++==+ | root | Can delete all | Same as the current | +---++--+ | user (user_subvol_rm_allowed) | Can delete if he | Same as the current | | | has write+exec right | | +---++--+ | user (default)| Cannot delete anything | Can delete if he | | || has write+exec right | | || and is empty | +---++--+ (2) getting subvolume/quota info Introduce 2 new ioctls to get subol/quota info under the specified path (which needs to be able to be opened by the user) and modify INO_LOOKUP to check permission during path construction for a normal user. Current approach cannot be used directly for a normal user as explained below: TREE_SEARCH ioctl is used to retrieve the subvolume/quota info by btrfs-progs (sub show/list, qgroup show etc.). This requires the root permission. Also, in order to construct the path, INO_LOOKUP will be called afterwards, which also requires root permission and omits the permission check during path construction. The easiest way to allow a user to get subvolume/quota info is just relaxing the permission of TREE_SEARCH. However, since all the tree information (inc. file name) will be exposed, this poses a sequrity risk and is not acceptable. The detail of new ioctls and approach is here: [subvolume info] Searching ROOT tree for ROOT_ITEM/ROOT_BACKREF under the specified path, and checking its read right by searching FS/FILE tree and comparing the mode with caller's uid. After this ioctl is called, btrfs-progs calls modified INO_LOOKUP to construct the path with permission check. In case path construction fails due to permission, btrfs-progs skips to output the infomation of the subvolume. [quota info] Same as above, but mainly searching QUOTA tree. - Summary of Proposal - Change the default behavior to allow a user to delete subvolume which is empty - Add 2 new non-root ioctls to get subvolume/quota info under the specified path == -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs"
Re: defragmenting best practice?
2017-12-11 8:18 GMT+03:00 Dave: > On Tue, Oct 31, 2017 someone wrote: >> >> >> > 2. Put $HOME/.cache on a separate BTRFS subvolume that is mounted >> > nocow -- it will NOT be snapshotted > > I did exactly this. It servers the purpose of avoiding snapshots. > However, today I saw the following at > https://wiki.archlinux.org/index.php/Btrfs > > Note: From Btrfs Wiki Mount options: within a single file system, it > is not possible to mount some subvolumes with nodatacow and others > with datacow. The mount option of the first mounted subvolume applies > to any other subvolumes. > > That makes me think my nodatacow mount option on $HOME/.cache is not > effective. True? > > (My subjective performance results have not been as good as hoped for > with the tweaks I have tried so far.) > -- > To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in > the body of a message to majord...@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html True, for magic dirs, that you may want mark as no cow, you need to use chattr, like: rm -rf ~/.cache mkdir ~/.cache chattr +C ~/.cache -- Have a nice day, Timofey. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: defragmenting best practice?
On Tue, Oct 31, 2017 someone wrote: > > > > 2. Put $HOME/.cache on a separate BTRFS subvolume that is mounted > > nocow -- it will NOT be snapshotted I did exactly this. It servers the purpose of avoiding snapshots. However, today I saw the following at https://wiki.archlinux.org/index.php/Btrfs Note: From Btrfs Wiki Mount options: within a single file system, it is not possible to mount some subvolumes with nodatacow and others with datacow. The mount option of the first mounted subvolume applies to any other subvolumes. That makes me think my nodatacow mount option on $HOME/.cache is not effective. True? (My subjective performance results have not been as good as hoped for with the tweaks I have tried so far.) -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v4 72/73] xfs: Convert mru cache to XArray
On Mon, Dec 11, 2017 at 10:57:45AM +1100, Dave Chinner wrote: > i.e. the fact the cmpxchg failed may not have anything to do with a > race condtion - it failed because the slot wasn't empty like we > expected it to be. There can be any number of reasons the slot isn't > empty - the API should not "document" that the reason the insert > failed was a race condition. It should document the case that we > "couldn't insert because there was an existing entry in the slot". > Let the surrounding code document the reason why that might have > happened - it's not for the API to assume reasons for failure. > > i.e. this API and potential internal implementation makes much > more sense: > > int > xa_store_iff_empty(...) > { > curr = xa_cmpxchg(>pag_ici_xa, agino, NULL, ip, GFP_NOFS); > if (!curr) > return 0; /* success! */ > if (!IS_ERR(curr)) > return -EEXIST; /* failed - slot not empty */ > return PTR_ERR(curr); /* failed - XA internal issue */ > } > > as it replaces the existing preload and insert code in the XFS code > paths whilst letting us handle and document the "insert failed > because slot not empty" case however we want. It implies nothing > about *why* the slot wasn't empty, just that we couldn't do the > insert because it wasn't empty. Yeah, OK. So, over the top of the recent changes I'm looking at this: I'm not in love with xa_store_empty() as a name. I almost want xa_store_weak(), but after my MAP_FIXED_WEAK proposed name got shot down, I'm leery of it. "empty" is at least a concept we already have in the API with the comment for xa_init() talking about an empty array and xa_empty(). I also considered xa_store_null and xa_overwrite_null and xa_replace_null(). Naming remains hard. diff --git a/fs/xfs/xfs_icache.c b/fs/xfs/xfs_icache.c index 941f38bb94a4..586b43836905 100644 --- a/fs/xfs/xfs_icache.c +++ b/fs/xfs/xfs_icache.c @@ -451,7 +451,7 @@ xfs_iget_cache_miss( int flags, int lock_flags) { - struct xfs_inode*ip, *curr; + struct xfs_inode*ip; int error; xfs_agino_t agino = XFS_INO_TO_AGINO(mp, ino); int iflags; @@ -498,8 +498,7 @@ xfs_iget_cache_miss( xfs_iflags_set(ip, iflags); /* insert the new inode */ - curr = xa_cmpxchg(>pag_ici_xa, agino, NULL, ip, GFP_NOFS); - error = __xa_race(curr, -EAGAIN); + error = xa_store_empty(>pag_ici_xa, agino, ip, GFP_NOFS, -EAGAIN); if (error) goto out_unlock; diff --git a/include/linux/xarray.h b/include/linux/xarray.h index 5792b6dbb040..cc7cc5253a67 100644 --- a/include/linux/xarray.h +++ b/include/linux/xarray.h @@ -271,43 +271,30 @@ static inline int xa_err(void *entry) } /** - * __xa_race() - Turn a cmpxchg result into an errno. - * @entry: Result from calling an XArray function. - * @errno: Error number to return if we lost the race. + * xa_store_empty() - Store this entry in the XArray unless another entry is + * already present. + * @xa: XArray. + * @index: Index into array. + * @entry: New entry. + * @gfp: Memory allocation flags. + * @rc: Number to return if another entry was present. * - * Like xa_race(), but returns the error number of your choice. Calling - * __xa_race(entry, 0) has the same result (but is less efficient) as - * calling xa_err(). + * Like xa_store(), but will fail and return the supplied error number if + * the existing entry at @index is not %NULL. * * Return: A negative errno or 0. */ -static inline int __xa_race(void *entry, int errno) +static inline int xa_store_empty(struct xarray *xa, unsigned long index, + void *entry, gfp_t gfp, int errno) { - if (!entry) + void *curr = xa_cmpxchg(xa, index, NULL, entry, gfp); + if (!curr) return 0; - if (xa_is_err(entry)) - return (long)entry >> 2; + if (xa_is_err(curr)) + return xa_err(curr); return errno; } -/** - * xa_race() - Turn a cmpxchg result into an errno. - * @entry: Result from calling an XArray function. - * - * It is common to use xa_cmpxchg() to ensure that only one thread assigns - * a value to an index. Pass the result from xa_cmpxchg() to xa_race() to - * get an errno back. This function also handles any other error which - * may have been returned by xa_cmpxchg() such as ENOMEM. - * - * If you don't care that you lost the race, you can use xa_err() instead. - * - * Return: A negative errno or 0. - */ -static inline int xa_race(void *entry) -{ - return __xa_race(entry, -EEXIST); -} - /* Everything below here is the Advanced API. Proceed with caution. */ #define xa_trylock(xa) spin_trylock(&(xa)->xa_lock) diff --git a/mm/backing-dev.c b/mm/backing-dev.c index 85d1bc963ab6..87ed55af823e 100644 --- a/mm/backing-dev.c +++
Ref doesn't match the record start and is compressed
Hello! Since my system was temporarily overloaded (load 500+ for probably several hours but it recovered, no reboot needed), I'm seeing btrfs-related backtraces in dmesg with the fs going RO. I tried to fix it but it fails: > Fixed 0 roots. > checking extents > ref mismatch on [3043919785984 57344] extent item 1, found 3 > Data backref 3043919785984 root 265 owner 83535 offset 131072 num_refs 0 not > found in extent tree > Incorrect local backref count on 3043919785984 root 265 owner 83535 offset > 131072 found 1 wanted 0 back 0x61854760 > Backref disk bytenr does not match extent record, bytenr=3043919785984, ref > bytenr=3043919790080 > Backref bytes do not match extent backref, bytenr=3043919785984, ref > bytes=57344, backref bytes=40960 > Data backref 3043919785984 root 259 owner 11522804 offset 0 num_refs 0 not > found in extent tree > Incorrect local backref count on 3043919785984 root 259 owner 11522804 offset > 0 found 1 wanted 0 back 0x57892430 > Backref disk bytenr does not match extent record, bytenr=3043919785984, ref > bytenr=3043919831040 > Backref bytes do not match extent backref, bytenr=3043919785984, ref > bytes=57344, backref bytes=8192 > backpointer mismatch on [3043919785984 57344] > attempting to repair backref discrepency for bytenr 3043919785984 > Ref doesn't match the record start and is compressed, please take a > btrfs-image of this file system and send it to a btrfs developer so they can > complete this functionality for bytenr 3043919790080 > failed to repair damaged filesystem, aborting > enabling repair mode > Checking filesystem on /dev/bcacache48 8 UUID: > bc201ce5-8f2b-4263-995a-6641e89d4c88 When I identify the files with "btrfs inspect-internal" and try to delete them, btrfs fails and goes into RO mode: [ 735.877040] [ cut here ] [ 735.877048] WARNING: CPU: 2 PID: 809 at fs/btrfs/extent-tree.c:7053 __btrfs_free_extent.isra.71+0x740/0xbc0 [ 735.877049] Modules linked in: uas usb_storage r8168(O) nvidia_drm(PO) vboxpci(O) vboxnetadp(O) vboxnetflt(O) vboxdrv(O) nvidia_uvm(PO) nvidia_modeset(PO) nvidia(PO) nct6775 hwmon_vid coretemp hwmon efivarfs [ 735.877065] CPU: 2 PID: 809 Comm: umount Tainted: P O 4.14.0-pf4 #1 [ 735.877066] Hardware name: To Be Filled By O.E.M. To Be Filled By O.E.M./Z68 Pro3, BIOS L2.16A 02/22/2013 [ 735.877067] task: 88040b6cf080 task.stack: c900035b [ 735.877070] RIP: 0010:__btrfs_free_extent.isra.71+0x740/0xbc0 [ 735.877071] RSP: 0018:c900035b3c00 EFLAGS: 00010246 [ 735.877073] RAX: fffe RBX: 02c4b7c2 RCX: c900035b3bb8 [ 735.877074] RDX: 02c4b7c2d000 RSI: 88021c22f607 RDI: c900035b3bb8 [ 735.877075] RBP: fffe R08: R09: 0001464f [ 735.877076] R10: 0002 R11: 076d R12: 880414618000 [ 735.877077] R13: 880406b3d850 R14: R15: 0109
Re: ERROR: failed to repair root items: Input/output error
Tomasz Pala posted on Sun, 10 Dec 2017 16:38:05 +0100 as excerpted: > On Sun, Dec 10, 2017 at 15:18:32 +, constantine wrote: > >> I have a laptop root hard drive (Samsung SSD 850 EVO 1TB), which is >> within warranty. >> I can't mount it read-write ("no rw mounting after error"). > > There is a data-corruption issue with this controller! > The same as 840 EVO - just google this. That's a bit alarmist... It's not a problem with recent kernels (I too have a Samsung 850 EVO 1 TB), as you mention below. Given the focus of this list, btrfs, and the fact that btrfs is still stabilizing, not yet fully stable and mature, so reasonably new kernels are strongly recommended (with the second newest mainline-LTS kernel series, now 4.9 since 4.14 is LTS, being the oldest recommended)... Anyone already following btrfs-list kernel recommendations is already /well/ beyond the kernel versions you mention below as adding the blacklisting, so it shouldn't be a problem. And he mentions Ubuntu with kernel 4.13, so he's at least well beyond the problem kernels now, tho of course it's possible he was running an older one previously, and that's what did the damage. > In short: either use recent kernel (AFAIR 4.0.5+ for 840 EVO and some > newer for entire 8* Samsung SSD family blacklisting) or disable NCQ. > > Using queued TRIM on this drive leads to data loss! Firmware zeroes > fist 512 bytes of a block, sorry. > > If you only had smaller drive, as 850s up to 512 GB have different > controller... > >> checksum verify failed on 103009173504 found 25334496 wanted 3500 >> bytenr mismatch, want=103009173504, have=889192478 >> ERROR: failed to repair root items: Input/output error >> >> What do these errors mean? >> What should I do to fix the filesystem and be able to mount it >> read-write? > > You probably can't fix this - there is data missing on bare metal, so > you should recover using backups. If you don't have one, you need to > perform manual data recovery procedures (like photorec) with little > chances to restore complete files due to the nature of data loss > (beginning of blocks). I'll agree here. First sysadmin rule of backups. The value of your data is defined not by empty claims, but by the number of backups you consider it worth having of that data. No backups simply defines the data as of only trivial value, worth less than the time, trouble and resources necessary to make that backup. So in the event of loss of the working copy, you can always rest easy. Either there was a backup and recovery can proceed from it, or there wasn't, in which case you can still be happy, since after all you still saved what was self-evidently more important than that data, the time, trouble and resources you'd have put into backing it up if it /were/ worth it. Meanwhile, apparently the filesystem can still be mounted read-only, just not read-write, so the first thing I'd do would be to mount it read-only and see how much of the data I can successfully copy off to some other filesystem. With a bit of luck, only a few files are damaged and will need restored from backup (assuming they were valuable enough to have a backup, of course), while the rest are fine. If that's /not/ the case, and the filesystem won't mount read-only either, or there's too much damaged, then it's time to try btrfs restore, to try to restore some of the files to some other location. Tho note that btrfs restore is an effort at recovery, and as such, it doesn't verify checksums as normal btrfs file operations will, so some of the otherwise successfully restored files may still be bitrotted. (Tho most filesystems don't checksum in any case, so the danger of bitrot is no worse than using a normal filesystem that doesn't do checksumming anyway.) -- Duncan - List replies preferred. No HTML msgs. "Every nonfree program has a lord, a master -- and if you use the program, he is your master." Richard Stallman -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: exclusive subvolume space missing
On 2017年12月11日 07:44, Qu Wenruo wrote: > > > On 2017年12月10日 19:27, Tomasz Pala wrote: >> On Mon, Dec 04, 2017 at 08:34:28 +0800, Qu Wenruo wrote: >> 1. is there any switch resulting in 'defrag only exclusive data'? >>> >>> IIRC, no. >> >> I have found a directory - pam_abl databases, which occupy 10 MB (yes, >> TEN MEGAbytes) and released ...8.7 GB (almost NINE GIGAbytes) after >> defrag. After defragging files were not snapshotted again and I've lost >> 3.6 GB again, so I got this fully reproducible. >> There are 7 files, one of which is 99% of the space (10 MB). None of >> them has nocow set, so they're riding all-btrfs. >> >> I could debug something before I'll clean this up, is there anything you >> want to me to check/know about the files? > > fiemap result along with btrfs dump-tree -t2 result. > > Both output has nothing related to file name/dir name, but only some > "meaningless" bytenr, so it should be completely OK to share them. > >> >> The fragmentation impact is HUGE here, 1000-ratio is almost a DoS >> condition which could be triggered by malicious user during a few hours >> or faster > > You won't want to hear this: > The biggest ratio in theory is, 128M / 4K = 32768. > >> - I've lost 3.6 GB during the night with reasonably small >> amount of writes, I guess it might be possible to trash entire >> filesystem within 10 minutes if doing this on purpose. > > That's a little complex. > To get into such situation, snapshot must be used and one must know > which file extent is shared and how it's shared. > > But yes, it's possible. > > While on the other hand, XFS, which also supports reflink, handles it > quite well, so I'm wondering if it's possible for btrfs to follow its > behavior. > >> 3. I guess there aren't, so how could I accomplish my target, i.e. reclaiming space that was lost due to fragmentation, without breaking spanshoted CoW where it would be not only pointless, but actually harmful? >>> >>> What about using old kernel, like v4.13? >> >> Unfortunately (I guess you had 3.13 on mind), I need the new ones and >> will be pushing towards 4.14. > > No, I really mean v4.13. My fault, it is v3.13. What a stupid error... > > From btrfs(5): > --- >Warning >Defragmenting with Linux kernel versions < 3.9 or ≥ > 3.14-rc2 as >well as with Linux stable kernel versions ≥ 3.10.31, ≥ > 3.12.12 >or ≥ 3.13.4 will break up the ref-links of CoW data (for >example files copied with cp --reflink, snapshots or >de-duplicated data). This may cause considerable increase of >space usage depending on the broken up ref-links. > --- > >> 4. How can I prevent this from happening again? All the files, that are written constantly (stats collector here, PostgreSQL database and logs on other machines), are marked with nocow (+C); maybe some new attribute to mark file as autodefrag? +t? >>> >>> Unfortunately, nocow only works if there is no other subvolume/inode >>> referring to it. >> >> This shouldn't be my case anymore after defrag (==breaking links). >> I guess no easy way to check refcounts of the blocks? > > No easy way unfortunately. > It's either time consuming (used by qgroup) or complex (manually tree > search and do the backref walk by yourself) > >> >>> But in my understanding, btrfs is not suitable for such conflicting >>> situation, where you want to have snapshots of frequent partial updates. >>> >>> IIRC, btrfs is better for use case where either update is less frequent, >>> or update is replacing the whole file, not just part of it. >>> >>> So btrfs is good for root filesystem like /etc /usr (and /bin /lib which >>> is pointing to /usr/bin and /usr/lib) , but not for /var or /run. >> >> That is something coherent with my conclusions after 2 years on btrfs, >> however I didn't expect a single file to eat 1000 times more space than it >> should... >> >> >> I wonder how many other filesystems were trashed like this - I'm short >> of ~10 GB on other system, many other users might be affected by that >> (telling the Internet stories about btrfs running out of space). > > Firstly, no other filesystem supports snapshot. > So it's pretty hard to get a baseline. > > But as I mentioned, XFS supports reflink, which means file extent can be > shared between several inodes. > > From the message I got from XFS guys, they free any unused space of a > file extent, so it should handle it quite well. > > But it's quite a hard work to achieve in btrfs, needs years development > at least. > >> >> It is not a problem that I need to defrag a file, the problem is I don't >> know: >> 1. whether I need to defrag, >> 2. *what* should I defrag >> nor have a tool that would defrag smart - only the exclusive data or, in >> general, the block that are worth defragging if space released from >> extents is greater than space lost on
Re: [PATCH v4 72/73] xfs: Convert mru cache to XArray
On Fri, Dec 08, 2017 at 03:01:31PM -0800, Matthew Wilcox wrote: > On Thu, Dec 07, 2017 at 11:38:43AM +1100, Dave Chinner wrote: > > > > cmpxchg is for replacing a known object in a store - it's not really > > > > intended for doing initial inserts after a lookup tells us there is > > > > nothing in the store. The radix tree "insert only if empty" makes > > > > sense here, because it naturally takes care of lookup/insert races > > > > via the -EEXIST mechanism. > > > > > > > > I think that providing xa_store_excl() (which would return -EEXIST > > > > if the entry is not empty) would be a better interface here, because > > > > it matches the semantics of lookup cache population used all over > > > > the kernel > > > > > > I'm not thrilled with xa_store_excl(), but I need to think about that > > > a bit more. > > > > Not fussed about the name - I just think we need a function that > > matches the insert semantics of the code > > I think I have something that works better for you than returning -EEXIST > (because you don't actually want -EEXIST, you want -EAGAIN): > > /* insert the new inode */ > - spin_lock(>pag_ici_lock); > - error = radix_tree_insert(>pag_ici_root, agino, ip); > - if (unlikely(error)) { > - WARN_ON(error != -EEXIST); > - XFS_STATS_INC(mp, xs_ig_dup); > - error = -EAGAIN; > - goto out_preload_end; > - } > - spin_unlock(>pag_ici_lock); > - radix_tree_preload_end(); > + curr = xa_cmpxchg(>pag_ici_xa, agino, NULL, ip, GFP_NOFS); > + error = __xa_race(curr, -EAGAIN); > + if (error) > + goto out_unlock; > > ... > > -out_preload_end: > - spin_unlock(>pag_ici_lock); > - radix_tree_preload_end(); > +out_unlock: > + if (error == -EAGAIN) > + XFS_STATS_INC(mp, xs_ig_dup); > > I've changed the behaviour slightly in that returning an -ENOMEM used to > hit a WARN_ON, and I don't think that's the right response -- GFP_NOFS > returning -ENOMEM probably gets you a nice warning already from the > mm code. It's been a couple of days since I first looked at this, and my initial reaction of "that's horrible!" hasn't changed. What you are proposing here might be a perfectly reasonable *internal implemention* of xa_store_excl(), but it makes for a terrible external API because the sematics and behaviour are so vague. e.g. what does "race" mean here with respect to an insert failure? i.e. the fact the cmpxchg failed may not have anything to do with a race condtion - it failed because the slot wasn't empty like we expected it to be. There can be any number of reasons the slot isn't empty - the API should not "document" that the reason the insert failed was a race condition. It should document the case that we "couldn't insert because there was an existing entry in the slot". Let the surrounding code document the reason why that might have happened - it's not for the API to assume reasons for failure. i.e. this API and potential internal implementation makes much more sense: int xa_store_iff_empty(...) { curr = xa_cmpxchg(>pag_ici_xa, agino, NULL, ip, GFP_NOFS); if (!curr) return 0; /* success! */ if (!IS_ERR(curr)) return -EEXIST; /* failed - slot not empty */ return PTR_ERR(curr); /* failed - XA internal issue */ } as it replaces the existing preload and insert code in the XFS code paths whilst letting us handle and document the "insert failed because slot not empty" case however we want. It implies nothing about *why* the slot wasn't empty, just that we couldn't do the insert because it wasn't empty. Cheers, Dave. -- Dave Chinner da...@fromorbit.com -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Chunk-Recovery fails with alignment error
On 2017年12月11日 05:16, Benjamin Beichler wrote: > The patch let the chunk-recover be successful. But I'm no lucky man, > the recovered chunk tree does not work or other metadata is also > broken. > > Mounting the system is not successful (dmesg): > BTRFS critical (device dm-0): corrupt node, invalid item slot: > block=16384, root=1, slot=0 > BTRFS error (device dm-0): failed to read chunk root > BTRFS error (device dm-0): open_ctree failed For this error, you could apply this diff to by-pass it: -- diff --git a/fs/btrfs/tree-checker.c b/fs/btrfs/tree-checker.c index ce4ed6ec8f39..355220e21c2e 100644 --- a/fs/btrfs/tree-checker.c +++ b/fs/btrfs/tree-checker.c @@ -413,12 +413,6 @@ int btrfs_check_node(struct btrfs_root *root, struct extent_buffer *node) btrfs_node_key_to_cpu(node, , slot); btrfs_node_key_to_cpu(node, _key, slot + 1); - if (!bytenr) { - generic_err(root, node, slot, - "invalid NULL node pointer"); - ret = -EUCLEAN; - goto out; - } if (!IS_ALIGNED(bytenr, root->fs_info->sectorsize)) { generic_err(root, node, slot, "unaligned pointer, have %llu should be aligned to %u", -- Thanks, Qu > > Therefore I tried a btrfs check --repair, this time without error: > https://gist.github.com/anonymous/5cf7ad9e187032d2c94db4f91bb62c24 > > Then I tried btrfs check --init-extent-tree and this produces much > output. I put the beginning into here: > https://gist.github.com/anonymous/70e2482646a8235ee2327105d920dadd > From a fast view, the messages keep to be similar to the last of the > gist, but the messages in the beginning are not repeating. If it helps > I have complete compressed log. > > Then I tried a btrfs recover to get files, but for many files (also > improtant data, but I filtered the output) I get outputs like in: > https://gist.github.com/anonymous/1cc7f7ab5af33e76d0bf80960bb300eb > > Any new suggestions? > signature.asc Description: OpenPGP digital signature
btrfs-progs: replace: error message can be improved when other operation is running
Dear all, when trying to replace a device of a file system for which a balance is running, btrfs-progs fails with the error message: ERROR: ioctl(DEV_REPLACE_START) on '/mnt/xyz' returns error: This might also be true for alike operations, such as "add", "delete" and "resize", since those cases do not seem to be considered in cmds-replace.c [0]. Apparently, this is not very helpful to the user (if not scary). In contrast, other commands give very helpful output in similar situations (e.g., "add/delete/… operation in progress" [1]). Other users' confusions might also be related to this potential issue [2]. This is probably very easy to fix for someone into ioctl return values and all this. Thanks and cheers, Lukas GNU/Linux 4.13.0-0.bpo.1-amd64 #1 SMP Debian 4.13.13-1~bpo9+1 (2017-11-22) x86_64 btrfs-progs v4.13.3 [0] https://github.com/kdave/btrfs-progs/blob/11c83cefb8b4a03b1835efaf603ddc95430a0c9e/cmds-replace.c#L48 [1] https://github.com/kdave/btrfs-progs/blob/9fe889ac02b9c49b885c8999f5dd4e192697fa83/ioctl.h#L709 [2] https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=866734 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: exclusive subvolume space missing
On 2017年12月10日 19:27, Tomasz Pala wrote: > On Mon, Dec 04, 2017 at 08:34:28 +0800, Qu Wenruo wrote: > >>> 1. is there any switch resulting in 'defrag only exclusive data'? >> >> IIRC, no. > > I have found a directory - pam_abl databases, which occupy 10 MB (yes, > TEN MEGAbytes) and released ...8.7 GB (almost NINE GIGAbytes) after > defrag. After defragging files were not snapshotted again and I've lost > 3.6 GB again, so I got this fully reproducible. > There are 7 files, one of which is 99% of the space (10 MB). None of > them has nocow set, so they're riding all-btrfs. > > I could debug something before I'll clean this up, is there anything you > want to me to check/know about the files? fiemap result along with btrfs dump-tree -t2 result. Both output has nothing related to file name/dir name, but only some "meaningless" bytenr, so it should be completely OK to share them. > > The fragmentation impact is HUGE here, 1000-ratio is almost a DoS > condition which could be triggered by malicious user during a few hours > or faster You won't want to hear this: The biggest ratio in theory is, 128M / 4K = 32768. > - I've lost 3.6 GB during the night with reasonably small > amount of writes, I guess it might be possible to trash entire > filesystem within 10 minutes if doing this on purpose. That's a little complex. To get into such situation, snapshot must be used and one must know which file extent is shared and how it's shared. But yes, it's possible. While on the other hand, XFS, which also supports reflink, handles it quite well, so I'm wondering if it's possible for btrfs to follow its behavior. > >>> 3. I guess there aren't, so how could I accomplish my target, i.e. >>>reclaiming space that was lost due to fragmentation, without breaking >>>spanshoted CoW where it would be not only pointless, but actually >>> harmful? >> >> What about using old kernel, like v4.13? > > Unfortunately (I guess you had 3.13 on mind), I need the new ones and > will be pushing towards 4.14. No, I really mean v4.13. From btrfs(5): --- Warning Defragmenting with Linux kernel versions < 3.9 or ≥ 3.14-rc2 as well as with Linux stable kernel versions ≥ 3.10.31, ≥ 3.12.12 or ≥ 3.13.4 will break up the ref-links of CoW data (for example files copied with cp --reflink, snapshots or de-duplicated data). This may cause considerable increase of space usage depending on the broken up ref-links. --- > >>> 4. How can I prevent this from happening again? All the files, that are >>>written constantly (stats collector here, PostgreSQL database and >>>logs on other machines), are marked with nocow (+C); maybe some new >>>attribute to mark file as autodefrag? +t? >> >> Unfortunately, nocow only works if there is no other subvolume/inode >> referring to it. > > This shouldn't be my case anymore after defrag (==breaking links). > I guess no easy way to check refcounts of the blocks? No easy way unfortunately. It's either time consuming (used by qgroup) or complex (manually tree search and do the backref walk by yourself) > >> But in my understanding, btrfs is not suitable for such conflicting >> situation, where you want to have snapshots of frequent partial updates. >> >> IIRC, btrfs is better for use case where either update is less frequent, >> or update is replacing the whole file, not just part of it. >> >> So btrfs is good for root filesystem like /etc /usr (and /bin /lib which >> is pointing to /usr/bin and /usr/lib) , but not for /var or /run. > > That is something coherent with my conclusions after 2 years on btrfs, > however I didn't expect a single file to eat 1000 times more space than it > should... > > > I wonder how many other filesystems were trashed like this - I'm short > of ~10 GB on other system, many other users might be affected by that > (telling the Internet stories about btrfs running out of space). Firstly, no other filesystem supports snapshot. So it's pretty hard to get a baseline. But as I mentioned, XFS supports reflink, which means file extent can be shared between several inodes. From the message I got from XFS guys, they free any unused space of a file extent, so it should handle it quite well. But it's quite a hard work to achieve in btrfs, needs years development at least. > > It is not a problem that I need to defrag a file, the problem is I don't know: > 1. whether I need to defrag, > 2. *what* should I defrag > nor have a tool that would defrag smart - only the exclusive data or, in > general, the block that are worth defragging if space released from > extents is greater than space lost on inter-snapshot duplication. > > I can't just defrag entire filesystem since it breaks links with snapshots. > This change was a real deal-breaker here... IIRC it's better to add a option to make defrag snapshot-aware. (Don't break snapshot sharing but only to
Re: Chunk-Recovery fails with alignment error
On 2017年12月11日 05:16, Benjamin Beichler wrote: > The patch let the chunk-recover be successful. But I'm no lucky man, > the recovered chunk tree does not work or other metadata is also > broken. > > Mounting the system is not successful (dmesg): > BTRFS critical (device dm-0): corrupt node, invalid item slot: > block=16384, root=1, slot=0 What does btrfs dump-tree -b 16384 say? IIRC the same 0 bytenr problem in kernel. > BTRFS error (device dm-0): failed to read chunk root > BTRFS error (device dm-0): open_ctree failed > > Therefore I tried a btrfs check --repair, this time without error: > https://gist.github.com/anonymous/5cf7ad9e187032d2c94db4f91bb62c24 > > Then I tried btrfs check --init-extent-tree and this produces much > output. I put the beginning into here: > https://gist.github.com/anonymous/70e2482646a8235ee2327105d920dadd That's common for --init-extent-tree, but I don't think extent tree is related in this case. Thanks, Qu > From a fast view, the messages keep to be similar to the last of the > gist, but the messages in the beginning are not repeating. If it helps > I have complete compressed log. > > Then I tried a btrfs recover to get files, but for many files (also > improtant data, but I filtered the output) I get outputs like in: > https://gist.github.com/anonymous/1cc7f7ab5af33e76d0bf80960bb300eb> > Any new suggestions? > signature.asc Description: OpenPGP digital signature
Just curious, whats happened with btrfs & rcu skiplist in 2013?
Subj, I found that https://lwn.net/Articles/554885/, https://lwn.net/Articles/553047/ and some others messages like judy rcu & etc. But nothing after june 2013, just curious, may be some one know why that has been stalled? or just what happened in the end? (i.e. ex, RT guys, just said that too bad for us..) Thanks! -- Have a nice day, Timofey. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Chunk-Recovery fails with alignment error
The patch let the chunk-recover be successful. But I'm no lucky man, the recovered chunk tree does not work or other metadata is also broken. Mounting the system is not successful (dmesg): BTRFS critical (device dm-0): corrupt node, invalid item slot: block=16384, root=1, slot=0 BTRFS error (device dm-0): failed to read chunk root BTRFS error (device dm-0): open_ctree failed Therefore I tried a btrfs check --repair, this time without error: https://gist.github.com/anonymous/5cf7ad9e187032d2c94db4f91bb62c24 Then I tried btrfs check --init-extent-tree and this produces much output. I put the beginning into here: https://gist.github.com/anonymous/70e2482646a8235ee2327105d920dadd >From a fast view, the messages keep to be similar to the last of the gist, but the messages in the beginning are not repeating. If it helps I have complete compressed log. Then I tried a btrfs recover to get files, but for many files (also improtant data, but I filtered the output) I get outputs like in: https://gist.github.com/anonymous/1cc7f7ab5af33e76d0bf80960bb300eb Any new suggestions? -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: exclusive subvolume space missing
On Sun, Dec 10, 2017 at 12:27:38 +0100, Tomasz Pala wrote: > I have found a directory - pam_abl databases, which occupy 10 MB (yes, > TEN MEGAbytes) and released ...8.7 GB (almost NINE GIGAbytes) after # df Filesystem Size Used Avail Use% Mounted on /dev/sda264G 61G 2.8G 96% / # btrfs fi du . Total Exclusive Set shared Filename 0.00B 0.00B - ./1/__db.register 10.00MiB10.00MiB - ./1/log.01 16.00KiB 0.00B - ./1/hosts.db 16.00KiB 0.00B - ./1/users.db 168.00KiB 0.00B - ./1/__db.001 40.00KiB 0.00B - ./1/__db.002 44.00KiB 0.00B - ./1/__db.003 10.28MiB10.00MiB - ./1 0.00B 0.00B - ./__db.register 16.00KiB16.00KiB - ./hosts.db 16.00KiB16.00KiB - ./users.db 10.00MiB10.00MiB - ./log.13 0.00B 0.00B - ./__db.001 0.00B 0.00B - ./__db.002 0.00B 0.00B - ./__db.003 20.31MiB20.03MiB 284.00KiB . # btrfs fi defragment log.13 # df /dev/sda264G 54G 9.4G 86% / 6.6 GB / 10 MB = 660:1 overhead within 1 day of uptime. -- Tomasz Pala-- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: ERROR: failed to repair root items: Input/output error
On Sun, Dec 10, 2017 at 15:18:32 +, constantine wrote: > I have a laptop root hard drive (Samsung SSD 850 EVO 1TB), which is > within warranty. > I can't mount it read-write ("no rw mounting after error"). There is a data-corruption issue with this controller! The same as 840 EVO - just google this. In short: either use recent kernel (AFAIR 4.0.5+ for 840 EVO and some newer for entire 8* Samsung SSD family blacklisting) or disable NCQ. Using queued TRIM on this drive leads to data loss! Firmware zeroes fist 512 bytes of a block, sorry. If you only had smaller drive, as 850s up to 512 GB have different controller... > checksum verify failed on 103009173504 found 25334496 wanted 3500 > bytenr mismatch, want=103009173504, have=889192478 > ERROR: failed to repair root items: Input/output error > > What do these errors mean? > What should I do to fix the filesystem and be able to mount it read-write? You probably can't fix this - there is data missing on bare metal, so you should recover using backups. If you don't have one, you need to perform manual data recovery procedures (like photorec) with little chances to restore complete files due to the nature of data loss (beginning of blocks). -- Tomasz Pala-- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
ERROR: failed to repair root items: Input/output error
I have a laptop root hard drive (Samsung SSD 850 EVO 1TB), which is within warranty. I can't mount it read-write ("no rw mounting after error"). The data are not really critical (I will overcome the shock of losing them within a couple of days). Btrfs check --repair throws an error: sudo btrfs check --repair /dev/sdb1 enabling repair mode Checking filesystem on /dev/sdb1 UUID: a955bc5f-e5f0-42ce-bd5a-de5eb8d5d3aa checking extents ERROR: add_tree_backref failed (non-leaf block): File exists parent transid verify failed on 103009185792 wanted 5026 found 345954 parent transid verify failed on 103009185792 wanted 5026 found 345954 Ignoring transid failure leaf parent key incorrect 103009185792 bad block 103009185792 ERROR: errors found in extent allocation tree or chunk allocation checksum verify failed on 103009173504 found 25334496 wanted 3500 checksum verify failed on 103009173504 found 25334496 wanted 3500 bytenr mismatch, want=103009173504, have=889192478 ERROR: failed to repair root items: Input/output error What do these errors mean? What should I do to fix the filesystem and be able to mount it read-write? Thank you, Konstantinos Tsardounis PS: I login with an Ubuntu LiveCD now which returns: uname -a Linux ubuntu 4.13.0-16-generic #19-Ubuntu SMP Wed Oct 11 18:35:14 UTC 2017 x86_64 x86_64 x86_64 GNU/Linux btrfs --version btrfs-progs v4.12 btrfs fi show Label: 'arch' uuid: a955bc5f-e5f0-42ce-bd5a-de5eb8d5d3aa Total devices 1 FS bytes used 876.26GiB devid1 size 931.01GiB used 931.01GiB path /dev/sdb1 btrfs fi df sdb1 Data, single: total=926.25GiB, used=872.65GiB System, single: total=4.00MiB, used=128.00KiB Metadata, single: total=4.76GiB, used=3.60GiB GlobalReserve, single: total=512.00MiB, used=0.00B and the dmesg.log is on: https://gist.github.com/anonymous/16344244259bb2989701f3ec43e26f39 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: exclusive subvolume space missing
On Mon, Dec 04, 2017 at 08:34:28 +0800, Qu Wenruo wrote: >> 1. is there any switch resulting in 'defrag only exclusive data'? > > IIRC, no. I have found a directory - pam_abl databases, which occupy 10 MB (yes, TEN MEGAbytes) and released ...8.7 GB (almost NINE GIGAbytes) after defrag. After defragging files were not snapshotted again and I've lost 3.6 GB again, so I got this fully reproducible. There are 7 files, one of which is 99% of the space (10 MB). None of them has nocow set, so they're riding all-btrfs. I could debug something before I'll clean this up, is there anything you want to me to check/know about the files? The fragmentation impact is HUGE here, 1000-ratio is almost a DoS condition which could be triggered by malicious user during a few hours or faster - I've lost 3.6 GB during the night with reasonably small amount of writes, I guess it might be possible to trash entire filesystem within 10 minutes if doing this on purpose. >> 3. I guess there aren't, so how could I accomplish my target, i.e. >>reclaiming space that was lost due to fragmentation, without breaking >>spanshoted CoW where it would be not only pointless, but actually harmful? > > What about using old kernel, like v4.13? Unfortunately (I guess you had 3.13 on mind), I need the new ones and will be pushing towards 4.14. >> 4. How can I prevent this from happening again? All the files, that are >>written constantly (stats collector here, PostgreSQL database and >>logs on other machines), are marked with nocow (+C); maybe some new >>attribute to mark file as autodefrag? +t? > > Unfortunately, nocow only works if there is no other subvolume/inode > referring to it. This shouldn't be my case anymore after defrag (==breaking links). I guess no easy way to check refcounts of the blocks? > But in my understanding, btrfs is not suitable for such conflicting > situation, where you want to have snapshots of frequent partial updates. > > IIRC, btrfs is better for use case where either update is less frequent, > or update is replacing the whole file, not just part of it. > > So btrfs is good for root filesystem like /etc /usr (and /bin /lib which > is pointing to /usr/bin and /usr/lib) , but not for /var or /run. That is something coherent with my conclusions after 2 years on btrfs, however I didn't expect a single file to eat 1000 times more space than it should... I wonder how many other filesystems were trashed like this - I'm short of ~10 GB on other system, many other users might be affected by that (telling the Internet stories about btrfs running out of space). It is not a problem that I need to defrag a file, the problem is I don't know: 1. whether I need to defrag, 2. *what* should I defrag nor have a tool that would defrag smart - only the exclusive data or, in general, the block that are worth defragging if space released from extents is greater than space lost on inter-snapshot duplication. I can't just defrag entire filesystem since it breaks links with snapshots. This change was a real deal-breaker here... Any way to fed the deduplication code with snapshots maybe? There are directories and files in the same layout, this could be fast-tracked to check and deduplicate. -- Tomasz Pala-- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: exclusive subvolume space missing
On Sun, Dec 03, 2017 at 01:45:45 +, Duncan wrote: > OTOH, it's also quite possible that people chose btrfs at least partly > for other reasons, say the "storage pool" qualities, and would rather Well, to name some: 1. filesystem-level backups via snapshot/send/receive - much cleaner and faster than rsyncs or other old-fashioned methods. This obviously requires the CoW-once feature; - caveat: for btrfs-killing usage patterns all the snapshots but the last one need to be removed; 2. block-level checksums with RAID1-awareness - in contrary to mdadm RAIDx, which chooses random data copy from underlying devices, this is much less susceptible to bit rot; - caveats: requires CoW enabled, RAID1 reading is dumb (even/odd PID instead of real balancing), no N-way mirroring nor write-mostly flag. 3. compression - there is no real alternative, however: - caveat: requires CoW enabled, which makes it not suitable for ...systemd journals, which compress with great ratio (c.a. 1:10), nor for various databases, as they will be nocowed sooner or later; 4. storage pools you've mentioned - they are actually not much superior to LVM-based approach; until one could create subvolume with different profile (e.g. 'disable RAID1 for /var/log/journal') it is still better to create separate filesystems, meaning one have to use LVM or (the hard way) paritioning. Some of the drawbacks above are immanent to CoW and so shouldn't be expected to be fixed internally, as the needs are conflicting, but their impact might be nullified by some housekeeping. -- Tomasz Pala-- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v3 4/5] btrfs: cleanup device states define BTRFS_DEV_STATE_REPLACE_TGT
Currently device state is being managed by each individual int variable such as struct btrfs_device::is_tgtdev_for_dev_replace. Instead of that declare btrfs_device::dev_state BTRFS_DEV_STATE_MISSING and use the bit operations. Signed-off-by: Anand Jain--- v3: Define BTRFS_DEV_STATE_REPLACE_TGT as bit nr fs/btrfs/dev-replace.c | 5 +++-- fs/btrfs/extent-tree.c | 3 ++- fs/btrfs/ioctl.c | 2 +- fs/btrfs/scrub.c | 2 +- fs/btrfs/super.c | 5 +++-- fs/btrfs/volumes.c | 39 ++- fs/btrfs/volumes.h | 2 +- 7 files changed, 33 insertions(+), 25 deletions(-) diff --git a/fs/btrfs/dev-replace.c b/fs/btrfs/dev-replace.c index 559db7667f38..12fd8a203735 100644 --- a/fs/btrfs/dev-replace.c +++ b/fs/btrfs/dev-replace.c @@ -172,7 +172,8 @@ int btrfs_init_dev_replace(struct btrfs_fs_info *fs_info) dev_replace->tgtdev->commit_bytes_used = dev_replace->srcdev->commit_bytes_used; } - dev_replace->tgtdev->is_tgtdev_for_dev_replace = 1; + set_bit(BTRFS_DEV_STATE_REPLACE_TGT, + _replace->tgtdev->dev_state); btrfs_init_dev_replace_tgtdev_for_resume(fs_info, dev_replace->tgtdev); } @@ -564,7 +565,7 @@ static int btrfs_dev_replace_finishing(struct btrfs_fs_info *fs_info, dev_missing_or_rcu_str(src_device), src_device->devid, rcu_str_deref(tgt_device->name)); - tgt_device->is_tgtdev_for_dev_replace = 0; + clear_bit(BTRFS_DEV_STATE_REPLACE_TGT, _device->dev_state); tgt_device->devid = src_device->devid; src_device->devid = BTRFS_DEV_REPLACE_DEVID; memcpy(uuid_tmp, tgt_device->uuid, sizeof(uuid_tmp)); diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c index 2cd323d184a0..1e65d5d54a8a 100644 --- a/fs/btrfs/extent-tree.c +++ b/fs/btrfs/extent-tree.c @@ -9692,7 +9692,8 @@ int btrfs_can_relocate(struct btrfs_fs_info *fs_info, u64 bytenr) * space to fit our block group in. */ if (device->total_bytes > device->bytes_used + min_free && - !device->is_tgtdev_for_dev_replace) { + !test_bit(BTRFS_DEV_STATE_REPLACE_TGT, + >dev_state)) { ret = find_free_dev_extent(trans, device, min_free, _offset, NULL); if (!ret) diff --git a/fs/btrfs/ioctl.c b/fs/btrfs/ioctl.c index e59004a17166..953563138020 100644 --- a/fs/btrfs/ioctl.c +++ b/fs/btrfs/ioctl.c @@ -1528,7 +1528,7 @@ static noinline int btrfs_ioctl_resize(struct file *file, } } - if (device->is_tgtdev_for_dev_replace) { + if (test_bit(BTRFS_DEV_STATE_REPLACE_TGT, >dev_state)) { ret = -EPERM; goto out_free; } diff --git a/fs/btrfs/scrub.c b/fs/btrfs/scrub.c index b6de017066b3..b5a33db38874 100644 --- a/fs/btrfs/scrub.c +++ b/fs/btrfs/scrub.c @@ -4131,7 +4131,7 @@ int btrfs_scrub_dev(struct btrfs_fs_info *fs_info, u64 devid, u64 start, mutex_lock(_info->scrub_lock); if (!test_bit(BTRFS_DEV_STATE_IN_FS_METADATA, >dev_state) || - dev->is_tgtdev_for_dev_replace) { + test_bit(BTRFS_DEV_STATE_REPLACE_TGT, >dev_state)) { mutex_unlock(_info->scrub_lock); mutex_unlock(_info->fs_devices->device_list_mutex); return -EIO; diff --git a/fs/btrfs/super.c b/fs/btrfs/super.c index 6bae2e046257..b16e3fbd5895 100644 --- a/fs/btrfs/super.c +++ b/fs/btrfs/super.c @@ -1973,8 +1973,9 @@ static int btrfs_calc_avail_data_space(struct btrfs_fs_info *fs_info, rcu_read_lock(); list_for_each_entry_rcu(device, _devices->devices, dev_list) { if (!test_bit(BTRFS_DEV_STATE_IN_FS_METADATA, - >dev_state) || - !device->bdev || device->is_tgtdev_for_dev_replace) + >dev_state) || !device->bdev || + test_bit(BTRFS_DEV_STATE_REPLACE_TGT, + >dev_state)) continue; if (i >= nr_devices) diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c index c6f7f4935dc4..37b1aed14353 100644 --- a/fs/btrfs/volumes.c +++ b/fs/btrfs/volumes.c @@ -845,8 +845,8 @@ void btrfs_close_extra_devices(struct btrfs_fs_devices *fs_devices, int step) list_for_each_entry_safe(device, next, _devices->devices, dev_list) { if (test_bit(BTRFS_DEV_STATE_IN_FS_METADATA, >dev_state)) {
Re: [bug report] btrfs: cleanup device states define BTRFS_DEV_STATE_IN_FS_METADATA
On 12/08/2017 07:08 PM, Dan Carpenter wrote: Hello Anand Jain, The patch 2dcfcdc43524: "btrfs: cleanup device states define BTRFS_DEV_STATE_IN_FS_METADATA" from Dec 4, 2017, leads to the following static checker warning: fs/btrfs/super.c:2007 btrfs_calc_avail_data_space() warn: test_bit() takes a bit number fs/btrfs/super.c 2004 2005 rcu_read_lock(); 2006 list_for_each_entry_rcu(device, _devices->devices, dev_list) { 2007 if (!test_bit(BTRFS_DEV_STATE_IN_FS_METADATA, ^^ This BTRFS_DEV_STATE_IN_FS_METADATA define is BIT(1) but for test_bit() we should just take 1. In other words we are doing double BIT(BIT(1)). 2008 >dev_state) || 2009 !device->bdev || 2010 test_bit(BTRFS_DEV_STATE_REPLACE_TGT, >dev_state)) 2011 continue; 2012 2013 if (i >= nr_devices) 2014 break; 2015 Thanks Dan. I fixed all these in v3. - Anand -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v3 3/5] btrfs: cleanup device states define BTRFS_DEV_STATE_MISSING
Currently device state is being managed by each individual int variable such as struct btrfs_device::missing. Instead of that declare btrfs_device::dev_state BTRFS_DEV_STATE_MISSING and use the bit operations. Signed-off-by: Anand JainReviewed-by : Nikolay Borisov --- v3: Define BTRFS_DEV_STATE_MISSING as bit nr fs/btrfs/dev-replace.c | 3 ++- fs/btrfs/disk-io.c | 4 ++-- fs/btrfs/scrub.c | 7 --- fs/btrfs/super.c | 2 +- fs/btrfs/volumes.c | 34 -- fs/btrfs/volumes.h | 2 +- 6 files changed, 30 insertions(+), 22 deletions(-) diff --git a/fs/btrfs/dev-replace.c b/fs/btrfs/dev-replace.c index 4b6ceb38cb5f..559db7667f38 100644 --- a/fs/btrfs/dev-replace.c +++ b/fs/btrfs/dev-replace.c @@ -306,7 +306,8 @@ void btrfs_after_dev_replace_commit(struct btrfs_fs_info *fs_info) static inline char *dev_missing_or_rcu_str(struct btrfs_device *device) { - return device->missing ? "" : rcu_str_deref(device->name); + return test_bit(BTRFS_DEV_STATE_MISSING, >dev_state) ? + "" : rcu_str_deref(device->name); } int btrfs_dev_replace_start(struct btrfs_fs_info *fs_info, diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c index 634e8eb51cc8..890e3a6a2f3e 100644 --- a/fs/btrfs/disk-io.c +++ b/fs/btrfs/disk-io.c @@ -3399,7 +3399,7 @@ static int barrier_all_devices(struct btrfs_fs_info *info) /* send down all the barriers */ head = >fs_devices->devices; list_for_each_entry_rcu(dev, head, dev_list) { - if (dev->missing) + if (test_bit(BTRFS_DEV_STATE_MISSING, >dev_state)) continue; if (!dev->bdev) continue; @@ -3415,7 +3415,7 @@ static int barrier_all_devices(struct btrfs_fs_info *info) /* wait for all the barriers */ list_for_each_entry_rcu(dev, head, dev_list) { - if (dev->missing) + if (test_bit(BTRFS_DEV_STATE_MISSING, >dev_state)) continue; if (!dev->bdev) { errors_wait++; diff --git a/fs/btrfs/scrub.c b/fs/btrfs/scrub.c index c4705de2ec26..b6de017066b3 100644 --- a/fs/btrfs/scrub.c +++ b/fs/btrfs/scrub.c @@ -2535,7 +2535,7 @@ static int scrub_pages(struct scrub_ctx *sctx, u64 logical, u64 len, } WARN_ON(sblock->page_count == 0); - if (dev->missing) { + if (test_bit(BTRFS_DEV_STATE_MISSING, >dev_state)) { /* * This case should only be hit for RAID 5/6 device replace. See * the comment in scrub_missing_raid56_pages() for details. @@ -2870,7 +2870,7 @@ static int scrub_extent_for_parity(struct scrub_parity *sparity, u8 csum[BTRFS_CSUM_SIZE]; u32 blocksize; - if (dev->missing) { + if (test_bit(BTRFS_DEV_STATE_MISSING, >dev_state)) { scrub_parity_mark_sectors_error(sparity, logical, len); return 0; } @@ -4112,7 +4112,8 @@ int btrfs_scrub_dev(struct btrfs_fs_info *fs_info, u64 devid, u64 start, mutex_lock(_info->fs_devices->device_list_mutex); dev = btrfs_find_device(fs_info, devid, NULL, NULL); - if (!dev || (dev->missing && !is_dev_replace)) { + if (!dev || (test_bit(BTRFS_DEV_STATE_MISSING, >dev_state) && + !is_dev_replace)) { mutex_unlock(_info->fs_devices->device_list_mutex); return -ENODEV; } diff --git a/fs/btrfs/super.c b/fs/btrfs/super.c index f0906fbfa731..6bae2e046257 100644 --- a/fs/btrfs/super.c +++ b/fs/btrfs/super.c @@ -2270,7 +2270,7 @@ static int btrfs_show_devname(struct seq_file *m, struct dentry *root) while (cur_devices) { head = _devices->devices; list_for_each_entry(dev, head, dev_list) { - if (dev->missing) + if (test_bit(BTRFS_DEV_STATE_MISSING, >dev_state)) continue; if (!dev->name) continue; diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c index 7100c877748d..c6f7f4935dc4 100644 --- a/fs/btrfs/volumes.c +++ b/fs/btrfs/volumes.c @@ -758,9 +758,9 @@ static noinline int device_list_add(const char *path, return -ENOMEM; rcu_string_free(device->name); rcu_assign_pointer(device->name, name); - if (device->missing) { + if (test_bit(BTRFS_DEV_STATE_MISSING, >dev_state)) { fs_devices->missing_devices--; - device->missing = 0; + clear_bit(BTRFS_DEV_STATE_MISSING, >dev_state); } } @@ -944,7 +944,7 @@ static void btrfs_prepare_close_one_device(struct btrfs_device *device) fs_devices->rw_devices--;
[PATCH v3 0/5] define BTRFS_DEV_STATE
As of now device properties and states are being represented as int variable, patches here makes them bit flags instead. Further, wip patches such as device failed state needs this cleanup. v2: Adds BTRFS_DEV_STATE_REPLACE_TGT Adds BTRFS_DEV_STATE_FLUSH_SENT Drops BTRFS_DEV_STATE_CAN_DISCARD Starts bit flag from the bit 0 Drops unrelated change - declare btrfs_device v3: Fix static checker warning, define respective dev state as bit number Anand Jain (5): btrfs: cleanup device states define BTRFS_DEV_STATE_WRITEABLE btrfs: cleanup device states define BTRFS_DEV_STATE_IN_FS_METADATA btrfs: cleanup device states define BTRFS_DEV_STATE_MISSING btrfs: cleanup device states define BTRFS_DEV_STATE_REPLACE_TGT btrfs: cleanup device states define BTRFS_DEV_STATE_FLUSH_SENT fs/btrfs/dev-replace.c | 8 ++- fs/btrfs/disk-io.c | 29 ++--- fs/btrfs/extent-tree.c | 5 +- fs/btrfs/extent_io.c | 3 +- fs/btrfs/ioctl.c | 4 +- fs/btrfs/scrub.c | 13 +++-- fs/btrfs/super.c | 8 ++- fs/btrfs/volumes.c | 156 - fs/btrfs/volumes.h | 11 ++-- 9 files changed, 143 insertions(+), 94 deletions(-) -- 2.7.0 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v3 5/5] btrfs: cleanup device states define BTRFS_DEV_STATE_FLUSH_SENT
Currently device state is being managed by each individual int variable such as struct btrfs_device::is_tgtdev_for_dev_replace. Instead of that declare btrfs_device::dev_state BTRFS_DEV_STATE_FLUSH_SENT and use the bit operations. Signed-off-by: Anand Jain--- v3: Define BTRFS_DEV_STATE_FLUSH_SENT as bit nr fs/btrfs/disk-io.c | 6 +++--- fs/btrfs/volumes.h | 1 + 2 files changed, 4 insertions(+), 3 deletions(-) diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c index 890e3a6a2f3e..9b20c1f3563b 100644 --- a/fs/btrfs/disk-io.c +++ b/fs/btrfs/disk-io.c @@ -3359,7 +3359,7 @@ static void write_dev_flush(struct btrfs_device *device) bio->bi_private = >flush_wait; btrfsic_submit_bio(bio); - device->flush_bio_sent = 1; + set_bit(BTRFS_DEV_STATE_FLUSH_SENT, >dev_state); } /* @@ -3369,10 +3369,10 @@ static blk_status_t wait_dev_flush(struct btrfs_device *device) { struct bio *bio = device->flush_bio; - if (!device->flush_bio_sent) + if (!test_bit(BTRFS_DEV_STATE_FLUSH_SENT, >dev_state)) return BLK_STS_OK; - device->flush_bio_sent = 0; + clear_bit(BTRFS_DEV_STATE_FLUSH_SENT, >dev_state); wait_for_completion_io(>flush_wait); return bio->bi_status; diff --git a/fs/btrfs/volumes.h b/fs/btrfs/volumes.h index a15f8b103072..826fce3d6825 100644 --- a/fs/btrfs/volumes.h +++ b/fs/btrfs/volumes.h @@ -51,6 +51,7 @@ struct btrfs_pending_bios { #define BTRFS_DEV_STATE_IN_FS_METADATA 1 #define BTRFS_DEV_STATE_MISSING2 #define BTRFS_DEV_STATE_REPLACE_TGT3 +#define BTRFS_DEV_STATE_FLUSH_SENT 4 struct btrfs_device { struct list_head dev_list; -- 2.7.0 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v3 2/5] btrfs: cleanup device states define BTRFS_DEV_STATE_IN_FS_METADATA
Currently device state is being managed by each individual int variable such as struct btrfs_device::in_fs_metadata. Instead of that declare device state BTRFS_DEV_STATE_IN_FS_METADATA and use the bit operations. Signed-off-by: Anand JainReviewed-by: Nikolay Borisov --- v3: Define BTRFS_DEV_STATE_IN_FS_METADATA as bit nr fs/btrfs/disk-io.c | 21 ++--- fs/btrfs/scrub.c | 3 ++- fs/btrfs/super.c | 5 +++-- fs/btrfs/volumes.c | 29 + fs/btrfs/volumes.h | 2 +- 5 files changed, 37 insertions(+), 23 deletions(-) diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c index 56198cb02b35..634e8eb51cc8 100644 --- a/fs/btrfs/disk-io.c +++ b/fs/btrfs/disk-io.c @@ -3403,8 +3403,10 @@ static int barrier_all_devices(struct btrfs_fs_info *info) continue; if (!dev->bdev) continue; - if (!dev->in_fs_metadata || - !test_bit(BTRFS_DEV_STATE_WRITEABLE, >dev_state)) + if (!test_bit(BTRFS_DEV_STATE_IN_FS_METADATA, + >dev_state) || + !test_bit(BTRFS_DEV_STATE_WRITEABLE, + >dev_state)) continue; write_dev_flush(dev); @@ -3419,8 +3421,10 @@ static int barrier_all_devices(struct btrfs_fs_info *info) errors_wait++; continue; } - if (!dev->in_fs_metadata || - !test_bit(BTRFS_DEV_STATE_WRITEABLE, >dev_state)) + if (!test_bit(BTRFS_DEV_STATE_IN_FS_METADATA, + >dev_state) || + !test_bit(BTRFS_DEV_STATE_WRITEABLE, + >dev_state)) continue; ret = wait_dev_flush(dev); @@ -3517,7 +3521,8 @@ int write_all_supers(struct btrfs_fs_info *fs_info, int max_mirrors) total_errors++; continue; } - if (!dev->in_fs_metadata || + if (!test_bit(BTRFS_DEV_STATE_IN_FS_METADATA, + >dev_state) || !test_bit(BTRFS_DEV_STATE_WRITEABLE, >dev_state)) continue; @@ -3557,8 +3562,10 @@ int write_all_supers(struct btrfs_fs_info *fs_info, int max_mirrors) list_for_each_entry_rcu(dev, head, dev_list) { if (!dev->bdev) continue; - if (!dev->in_fs_metadata || - !test_bit(BTRFS_DEV_STATE_WRITEABLE, >dev_state)) + if (!test_bit(BTRFS_DEV_STATE_IN_FS_METADATA, + >dev_state) || + !test_bit(BTRFS_DEV_STATE_WRITEABLE, + >dev_state)) continue; ret = wait_dev_supers(dev, max_mirrors); diff --git a/fs/btrfs/scrub.c b/fs/btrfs/scrub.c index fa70ff9b7762..c4705de2ec26 100644 --- a/fs/btrfs/scrub.c +++ b/fs/btrfs/scrub.c @@ -4129,7 +4129,8 @@ int btrfs_scrub_dev(struct btrfs_fs_info *fs_info, u64 devid, u64 start, } mutex_lock(_info->scrub_lock); - if (!dev->in_fs_metadata || dev->is_tgtdev_for_dev_replace) { + if (!test_bit(BTRFS_DEV_STATE_IN_FS_METADATA, >dev_state) || + dev->is_tgtdev_for_dev_replace) { mutex_unlock(_info->scrub_lock); mutex_unlock(_info->fs_devices->device_list_mutex); return -EIO; diff --git a/fs/btrfs/super.c b/fs/btrfs/super.c index 3a4dce153645..f0906fbfa731 100644 --- a/fs/btrfs/super.c +++ b/fs/btrfs/super.c @@ -1972,8 +1972,9 @@ static int btrfs_calc_avail_data_space(struct btrfs_fs_info *fs_info, rcu_read_lock(); list_for_each_entry_rcu(device, _devices->devices, dev_list) { - if (!device->in_fs_metadata || !device->bdev || - device->is_tgtdev_for_dev_replace) + if (!test_bit(BTRFS_DEV_STATE_IN_FS_METADATA, + >dev_state) || + !device->bdev || device->is_tgtdev_for_dev_replace) continue; if (i >= nr_devices) diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c index 9d14d83ab8dc..7100c877748d 100644 --- a/fs/btrfs/volumes.c +++ b/fs/btrfs/volumes.c @@ -636,7 +636,7 @@ static int btrfs_open_one_device(struct btrfs_fs_devices *fs_devices, fs_devices->rotating = 1; device->bdev = bdev; - device->in_fs_metadata = 0; + clear_bit(BTRFS_DEV_STATE_IN_FS_METADATA, >dev_state); device->mode = flags; fs_devices->open_devices++; @@ -843,7 +843,8 @@ void
[PATCH v3 1/5] btrfs: cleanup device states define BTRFS_DEV_STATE_WRITEABLE
Currently device state is being managed by each individual int variable such as struct btrfs_device::writeable. Instead of that declare device state BTRFS_DEV_STATE_WRITEABLE and use the bit operations. Signed-off-by: Anand Jain--- v2: Drop unrelated change to declare btrfs_device Start bit flag defines from bit 0 v3: Define BTRFS_DEV_STATE_WRITEABLE as bit nr fs/btrfs/disk-io.c | 12 ++ fs/btrfs/extent-tree.c | 2 +- fs/btrfs/extent_io.c | 3 ++- fs/btrfs/ioctl.c | 2 +- fs/btrfs/scrub.c | 3 ++- fs/btrfs/volumes.c | 60 +- fs/btrfs/volumes.h | 4 +++- 7 files changed, 52 insertions(+), 34 deletions(-) diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c index 10a2a579cc7f..56198cb02b35 100644 --- a/fs/btrfs/disk-io.c +++ b/fs/btrfs/disk-io.c @@ -3403,7 +3403,8 @@ static int barrier_all_devices(struct btrfs_fs_info *info) continue; if (!dev->bdev) continue; - if (!dev->in_fs_metadata || !dev->writeable) + if (!dev->in_fs_metadata || + !test_bit(BTRFS_DEV_STATE_WRITEABLE, >dev_state)) continue; write_dev_flush(dev); @@ -3418,7 +3419,8 @@ static int barrier_all_devices(struct btrfs_fs_info *info) errors_wait++; continue; } - if (!dev->in_fs_metadata || !dev->writeable) + if (!dev->in_fs_metadata || + !test_bit(BTRFS_DEV_STATE_WRITEABLE, >dev_state)) continue; ret = wait_dev_flush(dev); @@ -3515,7 +3517,8 @@ int write_all_supers(struct btrfs_fs_info *fs_info, int max_mirrors) total_errors++; continue; } - if (!dev->in_fs_metadata || !dev->writeable) + if (!dev->in_fs_metadata || + !test_bit(BTRFS_DEV_STATE_WRITEABLE, >dev_state)) continue; btrfs_set_stack_device_generation(dev_item, 0); @@ -3554,7 +3557,8 @@ int write_all_supers(struct btrfs_fs_info *fs_info, int max_mirrors) list_for_each_entry_rcu(dev, head, dev_list) { if (!dev->bdev) continue; - if (!dev->in_fs_metadata || !dev->writeable) + if (!dev->in_fs_metadata || + !test_bit(BTRFS_DEV_STATE_WRITEABLE, >dev_state)) continue; ret = wait_dev_supers(dev, max_mirrors); diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c index 15c01014e5e1..2cd323d184a0 100644 --- a/fs/btrfs/extent-tree.c +++ b/fs/btrfs/extent-tree.c @@ -10877,7 +10877,7 @@ static int btrfs_trim_free_extents(struct btrfs_device *device, *trimmed = 0; /* Not writeable = nothing to do. */ - if (!device->writeable) + if (!test_bit(BTRFS_DEV_STATE_WRITEABLE, >dev_state)) return 0; /* No free space = nothing to do. */ diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c index 012d63870b99..25682c5a0dd5 100644 --- a/fs/btrfs/extent_io.c +++ b/fs/btrfs/extent_io.c @@ -2027,7 +2027,8 @@ int repair_io_failure(struct btrfs_fs_info *fs_info, u64 ino, u64 start, bio->bi_iter.bi_sector = sector; dev = bbio->stripes[bbio->mirror_num - 1].dev; btrfs_put_bbio(bbio); - if (!dev || !dev->bdev || !dev->writeable) { + if (!dev || !dev->bdev || + !test_bit(BTRFS_DEV_STATE_WRITEABLE, >dev_state)) { btrfs_bio_counter_dec(fs_info); bio_put(bio); return -EIO; diff --git a/fs/btrfs/ioctl.c b/fs/btrfs/ioctl.c index d748ad1c3620..e59004a17166 100644 --- a/fs/btrfs/ioctl.c +++ b/fs/btrfs/ioctl.c @@ -1503,7 +1503,7 @@ static noinline int btrfs_ioctl_resize(struct file *file, goto out_free; } - if (!device->writeable) { + if (!test_bit(BTRFS_DEV_STATE_WRITEABLE, >dev_state)) { btrfs_info(fs_info, "resizer unable to apply on readonly device %llu", devid); diff --git a/fs/btrfs/scrub.c b/fs/btrfs/scrub.c index b2f871d80982..fa70ff9b7762 100644 --- a/fs/btrfs/scrub.c +++ b/fs/btrfs/scrub.c @@ -4117,7 +4117,8 @@ int btrfs_scrub_dev(struct btrfs_fs_info *fs_info, u64 devid, u64 start, return -ENODEV; } - if (!is_dev_replace && !readonly && !dev->writeable) { + if (!is_dev_replace && !readonly && + !test_bit(BTRFS_DEV_STATE_WRITEABLE, >dev_state)) { mutex_unlock(_info->fs_devices->device_list_mutex); rcu_read_lock(); name = rcu_dereference(dev->name); diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c index c19a49167966..9d14d83ab8dc