Re: [RFC] Improve subvolume usability for a normal user
On 2017年12月07日 15:32, Marat Khalili wrote: > On 07/12/17 08:27, Qu Wenruo wrote: >> When doing snapshot, btrfs only needs to increase reference of 2nd >> highest level tree blocks of original snapshot, other than "walking the >> tree". >> (If tree root level is 2, then level 2 node is copied, while all >> reference of level 1 tree blocks get increased) > > Out of curiosity, how does it interacts with nocow files? Does every > write to these files involves backref walk? For details I need to check the code. But at least, partial backref walk must be done. For "partial" I mean it doesn't need to get every root referring to the file extent, it only needs to check if the extent is only referred by current inode and offset. If any other inode (or even the same inode with different offset) is found, then the walk is finished. (The same walk we do to determine the FIEMAP_EXTENT_SHARED flag) In contract, for qgroup and balance, the full backref walk is done, where btrfs needs every root involved (for qgroup) or even every inode and offset involved (for balance). Thanks, Qu > > -- > > With Best Regards, > Marat Khalili signature.asc Description: OpenPGP digital signature
Re: [RFC] Improve subvolume usability for a normal user
On 07/12/17 08:27, Qu Wenruo wrote: When doing snapshot, btrfs only needs to increase reference of 2nd highest level tree blocks of original snapshot, other than "walking the tree". (If tree root level is 2, then level 2 node is copied, while all reference of level 1 tree blocks get increased) Out of curiosity, how does it interacts with nocow files? Does every write to these files involves backref walk? -- With Best Regards, Marat Khalili -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC] Improve subvolume usability for a normal user
On 2017/12/07 11:56, Duncan wrote: > Austin S. Hemmelgarn posted on Wed, 06 Dec 2017 07:39:56 -0500 as > excerpted: > >> Somewhat OT, but the only operation that's remotely 'instant' is >> creating an empty subvolume. Snapshot creation has to walk the tree in >> the subvolume being snapshotted, which can take a long time (and as a >> result of it's implementation, also means BTRFS snapshots are _not_ >> atomic). Subvolume deletion has to do a bunch of cleanup work in the >> background (though it may be fairly quick if it was a snapshot and the >> source subvolume hasn't changed much). > > Indeed, while btrfs in general has taken a strategy of making /creating/ > snapshots and subvolumes fast, snapshot deletion in particular can take > some time[1]. > > And in that regard a question just occurred to me regarding this whole > very tough problem of a user being able to create but not delete > subvolumes and snapshots: > > Given that at least snapshot deletion (not so sure about non-snapshot > subvolume deletion, tho I strongly suspect it would depend on the number > of cross-subvolume reflinks) is already a task that can take some time, > why /not/ just bite the bullet and make the behavior much more like the > directory deletion, given that subvolumes already behave much like > directories. Yes, for non-root, that /does/ mean tracing the entire > subtree and checking permissions, and yes, that's going to take time and > lower performance somewhat, but subvolume and in particular snapshot > deletion is already an operation that takes time, so this wouldn't be > unduly changing the situation, and it would eliminate the entire class of > security issues that come with either asymmetrically restricting deletion > (but not creation) to root on the one hand, > or possible data loss due to > allowing a user to delete a subvolume they couldn't delete were it an > ordinary directory due to not owning stuff further down the tree. But, this is also the very reason I'm for "sub del" instead of unlink(). Since snapshot creation won't check the permissions of the containing files/dirs, it can copy a directory which cannot be deleted by the user. Therefore if we won't allow "sub del" for the user, he couldn't remove the snapshot. > > --- > [1] Based on comments and reports here. My own use-case doesn't involve > snapshots/subvolumes. > -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC] Improve subvolume usability for a normal user
On 2017年12月06日 20:39, Austin S. Hemmelgarn wrote: > Somewhat OT, but the only operation that's remotely 'instant' is > creating an empty subvolume. Snapshot creation has to walk the tree in > the subvolume being snapshotted, which can take a long time (and as a > result of it's implementation, also means BTRFS snapshots are _not_ > atomic). Nope, this is not true. Btrfs from the very beginning is designed to speed up snapshot. The most obvious evidence is its extent tree design. When doing snapshot, btrfs only needs to increase reference of 2nd highest level tree blocks of original snapshot, other than "walking the tree". (If tree root level is 2, then level 2 node is copied, while all reference of level 1 tree blocks get increased) This design hugely speeds up snapshot creation, making snapshot obviously faster than any block level snapshot. Although this design obviously slows down backref walk. As btrfs must do a full walk until tree root, to see if any tree blocks between leaf and root is shared with another snapshot. And for btrfs snapshot atomic thing, it's just because btrfs delayed snapshot creation to transaction commit time, and buffered write is not triggered for snapshot creation, so it makes snapshot not so atomic. Thanks, Qu > Subvolume deletion has to do a bunch of cleanup work in the > background (though it may be fairly quick if it was a snapshot and the > source subvolume hasn't changed much). signature.asc Description: OpenPGP digital signature
Re: [PATCH] fstests: btrfs: Add test case to check if btrfs can handle full fs trim correctly
On Thu, Dec 07, 2017 at 08:43:43AM +0800, Qu Wenruo wrote: > Ping. > > Any comment on this? It's been pushed out to upstream, see commit 88231c0c0b9d Thanks, Eryu -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC] Improve subvolume usability for a normal user
Austin S. Hemmelgarn posted on Wed, 06 Dec 2017 07:39:56 -0500 as excerpted: > Somewhat OT, but the only operation that's remotely 'instant' is > creating an empty subvolume. Snapshot creation has to walk the tree in > the subvolume being snapshotted, which can take a long time (and as a > result of it's implementation, also means BTRFS snapshots are _not_ > atomic). Subvolume deletion has to do a bunch of cleanup work in the > background (though it may be fairly quick if it was a snapshot and the > source subvolume hasn't changed much). Indeed, while btrfs in general has taken a strategy of making /creating/ snapshots and subvolumes fast, snapshot deletion in particular can take some time[1]. And in that regard a question just occurred to me regarding this whole very tough problem of a user being able to create but not delete subvolumes and snapshots: Given that at least snapshot deletion (not so sure about non-snapshot subvolume deletion, tho I strongly suspect it would depend on the number of cross-subvolume reflinks) is already a task that can take some time, why /not/ just bite the bullet and make the behavior much more like the directory deletion, given that subvolumes already behave much like directories. Yes, for non-root, that /does/ mean tracing the entire subtree and checking permissions, and yes, that's going to take time and lower performance somewhat, but subvolume and in particular snapshot deletion is already an operation that takes time, so this wouldn't be unduly changing the situation, and it would eliminate the entire class of security issues that come with either asymmetrically restricting deletion (but not creation) to root on the one hand, or possible data loss due to allowing a user to delete a subvolume they couldn't delete were it an ordinary directory due to not owning stuff further down the tree. --- [1] Based on comments and reports here. My own use-case doesn't involve snapshots/subvolumes. -- Duncan - List replies preferred. No HTML msgs. "Every nonfree program has a lord, a master -- and if you use the program, he is your master." Richard Stallman -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: mail notifications to root once quota limit reached
On Wed, Nov 22, 2017 at 9:39 AM, Chris Murphy wrote: > On Wed, Nov 22, 2017 at 8:36 AM, ST wrote: >> Hello, >> >> is it possible to get mail notifications to root once quota limit is >> reached? If not should I file a feature request? > > You should look at Nagios or OpenNMS, etc. That's where such > functionality belongs. I ran across this: http://docs.gluster.org/en/latest/Administrator%20Guide/Gluster%20On%20ZFS/ It would need modification to do what you want with Btrfs, but maybe it's a start. -- Chris Murphy -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: system OOM when using ghettoVCB script to backup VMs to nfs server (sync mode) running btrfs
I think no special option: /dev/md127 on /data type btrfs (rw,noatime,nodiratime,nospace_cache,subvolid=5,subvol=/) ok, will try laster kernel. thanks. On Thu, Dec 7, 2017 at 9:21 AM, Qu Wenruo wrote: > > > On 2017年12月07日 09:19, Taibai Li wrote: >> thanks for the quick response, I tired to test this on 4.4.100 >> kernel, disabled quota : >> # btrfs qgroup show /data/ >> ERROR: can't list qgroups: quotas not enabled >> >> But seems it still OOM after about 7 hours copyed 144G files, any >> other ideas? Maybe I will try to test by disable quota on 4.14 kernel >> too. > > Trying latest kernel is always a good idea. > > Despite qgroup, I am not pretty sure which can be the cause. > > Is there any special mount option used? > > Thanks, > Qu > >> >> thanks. >> >> On Wed, Dec 6, 2017 at 2:25 PM, Qu Wenruo wrote: >>> >>> >>> On 2017年12月06日 14:22, taibai li wrote: Hi Guys, I hit the OOM issues with as Box running 4.4.x kernel, so I tried to build a 4.14.3 kernel to try that. The testbed is : NAS box with 2G memory, and a single disk raid , I setup a nfs server with sync mode, add the storage on ESXi servers 6.0 and backup all the VMs on it by the ghettoVCB script , after about 14 hours, Inpot/Output error happened, checked the box , found OOM. # cat /etc/exports "/data/Videos" *(insecure,insecure_locks,no_subtree_check,crossmnt,anonuid=99,anongid=99,root_squash,rw,sync) # uname -a Linux lzx-314-desk 4.14.3.x86_64.1 #1 SMP Fri Dec 1 01:31:25 UTC 2017 x86_64 GNU/Linux # btrfs fi show /data/ Label: '43f611ae:data' uuid: 6fefb319-a21d-476e-9642-565e0600a049 Total devices 1 FS bytes used 292.78GiB devid 1 size 1.81TiB used 296.02GiB path /dev/md127 The stack is : Dec 04 23:47:43 lzx-314-desk kernel: nfsd: page allocation stalls for 621031ms, order:0, mode:0x14000c0(GFP_KERNEL), nodemask=(null) Dec 04 23:47:43 lzx-314-desk kernel: nfsd cpuset=/ mems_allowed=0 Dec 04 23:47:43 lzx-314-desk kernel: CPU: 0 PID: 3376 Comm: nfsd Not tainted 4.14.3.x86_64.1 #1 Dec 04 23:47:43 lzx-314-desk kernel: Hardware name: NETGEAR ReadyNAS 314/To be filled by O.E.M., BIOS 4.6.5 11/05/2013 Dec 04 23:47:43 lzx-314-desk kernel: Call Trace: Dec 04 23:47:43 lzx-314-desk kernel: dump_stack+0x4d/0x6a Dec 04 23:47:43 lzx-314-desk kernel: warn_alloc+0xe3/0x180 Dec 04 23:47:43 lzx-314-desk kernel: __alloc_pages_nodemask+0xb1e/0xed0 Dec 04 23:47:43 lzx-314-desk kernel: svc_recv+0x99/0x900 Dec 04 23:47:43 lzx-314-desk kernel: ? svc_process+0x241/0x690 Dec 04 23:47:43 lzx-314-desk kernel: nfsd+0xd2/0x150 Dec 04 23:47:43 lzx-314-desk kernel: kthread+0x11a/0x150 Dec 04 23:47:43 lzx-314-desk kernel: ? nfsd_destroy+0x60/0x60 Dec 04 23:47:43 lzx-314-desk kernel: ? kthread_create_on_node+0x40/0x40 Dec 04 23:47:43 lzx-314-desk kernel: ret_from_fork+0x22/0x30 Dec 04 23:47:43 lzx-314-desk kernel: readynasd invoked oom-killer: gfp_mask=0x14200ca(GFP_HIGHUSER_MOVABLE), nodemask=(null), order=0, oom_score_adj=-1000 Dec 04 23:47:43 lzx-314-desk kernel: readynasd cpuset=/ mems_allowed=0 Dec 04 23:47:43 lzx-314-desk kernel: CPU: 3 PID: 3307 Comm: readynasd Not tainted 4.14.3.x86_64.1 #1 Dec 04 23:47:43 lzx-314-desk kernel: Hardware name: NETGEAR ReadyNAS 314/To be filled by O.E.M., BIOS 4.6.5 11/05/2013 Dec 04 23:47:43 lzx-314-desk kernel: Call Trace: Dec 04 23:47:43 lzx-314-desk kernel: dump_stack+0x4d/0x6a Dec 04 23:47:43 lzx-314-desk kernel: dump_header+0x9a/0x21b Dec 04 23:47:43 lzx-314-desk kernel: ? pick_next_task_fair+0x1d5/0x4b0 Dec 04 23:47:43 lzx-314-desk kernel: ? security_capable_noaudit+0x40/0x60 Dec 04 23:47:43 lzx-314-desk kernel: oom_kill_process+0x216/0x430 Dec 04 23:47:43 lzx-314-desk kernel: out_of_memory+0xf9/0x2e0 Dec 04 23:47:43 lzx-314-desk kernel: __alloc_pages_nodemask+0xd6c/0xed0 Dec 04 23:47:43 lzx-314-desk kernel: __read_swap_cache_async+0x11d/0x190 Dec 04 23:47:43 lzx-314-desk kernel: read_swap_cache_async+0x17/0x40 Dec 04 23:47:43 lzx-314-desk kernel: swapin_readahead+0x1f1/0x230 Dec 04 23:47:43 lzx-314-desk kernel: ? find_get_entry+0x19/0xf0 Dec 04 23:47:43 lzx-314-desk kernel: ? pagecache_get_page+0x27/0x210 Dec 04 23:47:43 lzx-314-desk kernel: do_swap_page+0x432/0x590 Dec 04 23:47:43 lzx-314-desk kernel: ? do_swap_page+0x432/0x590 Dec 04 23:47:43 lzx-314-desk kernel: ? poll_select_copy_remaining+0x120/0x120 Dec 04 23:47:43 lzx-314-desk kernel: __handle_mm_fault+0x33e/0xa20 Dec 04 23:47:43 lzx-314-desk kernel: handle_mm_fault+0x14a/0x1d0 Dec 04 23:47:43 lzx-314-desk kernel: __do_page_fault+0x212/0x440 Dec 04 23:47:43 lzx-314-desk kernel: page_fault+0x22/0x30 Dec 04 23:47:43 lzx-314-desk kernel: RIP: 0010:copy_user_generic_unrolled+0x89/0xc0 Dec 04 23:47:43 lzx-314-desk ke
Re: system OOM when using ghettoVCB script to backup VMs to nfs server (sync mode) running btrfs
On 2017年12月07日 09:19, Taibai Li wrote: > thanks for the quick response, I tired to test this on 4.4.100 > kernel, disabled quota : > # btrfs qgroup show /data/ > ERROR: can't list qgroups: quotas not enabled > > But seems it still OOM after about 7 hours copyed 144G files, any > other ideas? Maybe I will try to test by disable quota on 4.14 kernel > too. Trying latest kernel is always a good idea. Despite qgroup, I am not pretty sure which can be the cause. Is there any special mount option used? Thanks, Qu > > thanks. > > On Wed, Dec 6, 2017 at 2:25 PM, Qu Wenruo wrote: >> >> >> On 2017年12月06日 14:22, taibai li wrote: >>> Hi Guys, >>> >>> I hit the OOM issues with as Box running 4.4.x kernel, so I tried to >>> build a 4.14.3 kernel to try that. The testbed is : >>> NAS box with 2G memory, and a single disk raid , I setup a nfs server >>> with sync mode, add the storage on ESXi servers 6.0 and backup all >>> the VMs on it by the ghettoVCB script , after about 14 hours, >>> Inpot/Output error happened, >>> checked the box , found OOM. >>> # cat /etc/exports >>> "/data/Videos" >>> *(insecure,insecure_locks,no_subtree_check,crossmnt,anonuid=99,anongid=99,root_squash,rw,sync) >>> # uname -a >>> Linux lzx-314-desk 4.14.3.x86_64.1 #1 SMP Fri Dec 1 01:31:25 UTC 2017 >>> x86_64 GNU/Linux >>> # btrfs fi show /data/ >>> Label: '43f611ae:data' uuid: 6fefb319-a21d-476e-9642-565e0600a049 >>> Total devices 1 FS bytes used 292.78GiB >>> devid 1 size 1.81TiB used 296.02GiB path /dev/md127 >>> >>> The stack is : >>> Dec 04 23:47:43 lzx-314-desk kernel: nfsd: page allocation stalls for >>> 621031ms, order:0, mode:0x14000c0(GFP_KERNEL), nodemask=(null) >>> Dec 04 23:47:43 lzx-314-desk kernel: nfsd cpuset=/ mems_allowed=0 >>> Dec 04 23:47:43 lzx-314-desk kernel: CPU: 0 PID: 3376 Comm: nfsd Not >>> tainted 4.14.3.x86_64.1 #1 >>> Dec 04 23:47:43 lzx-314-desk kernel: Hardware name: NETGEAR ReadyNAS >>> 314/To be filled by O.E.M., BIOS 4.6.5 11/05/2013 >>> Dec 04 23:47:43 lzx-314-desk kernel: Call Trace: >>> Dec 04 23:47:43 lzx-314-desk kernel: dump_stack+0x4d/0x6a >>> Dec 04 23:47:43 lzx-314-desk kernel: warn_alloc+0xe3/0x180 >>> Dec 04 23:47:43 lzx-314-desk kernel: __alloc_pages_nodemask+0xb1e/0xed0 >>> Dec 04 23:47:43 lzx-314-desk kernel: svc_recv+0x99/0x900 >>> Dec 04 23:47:43 lzx-314-desk kernel: ? svc_process+0x241/0x690 >>> Dec 04 23:47:43 lzx-314-desk kernel: nfsd+0xd2/0x150 >>> Dec 04 23:47:43 lzx-314-desk kernel: kthread+0x11a/0x150 >>> Dec 04 23:47:43 lzx-314-desk kernel: ? nfsd_destroy+0x60/0x60 >>> Dec 04 23:47:43 lzx-314-desk kernel: ? kthread_create_on_node+0x40/0x40 >>> Dec 04 23:47:43 lzx-314-desk kernel: ret_from_fork+0x22/0x30 >>> Dec 04 23:47:43 lzx-314-desk kernel: readynasd invoked oom-killer: >>> gfp_mask=0x14200ca(GFP_HIGHUSER_MOVABLE), nodemask=(null), order=0, >>> oom_score_adj=-1000 >>> Dec 04 23:47:43 lzx-314-desk kernel: readynasd cpuset=/ mems_allowed=0 >>> Dec 04 23:47:43 lzx-314-desk kernel: CPU: 3 PID: 3307 Comm: readynasd >>> Not tainted 4.14.3.x86_64.1 #1 >>> Dec 04 23:47:43 lzx-314-desk kernel: Hardware name: NETGEAR ReadyNAS >>> 314/To be filled by O.E.M., BIOS 4.6.5 11/05/2013 >>> Dec 04 23:47:43 lzx-314-desk kernel: Call Trace: >>> Dec 04 23:47:43 lzx-314-desk kernel: dump_stack+0x4d/0x6a >>> Dec 04 23:47:43 lzx-314-desk kernel: dump_header+0x9a/0x21b >>> Dec 04 23:47:43 lzx-314-desk kernel: ? pick_next_task_fair+0x1d5/0x4b0 >>> Dec 04 23:47:43 lzx-314-desk kernel: ? security_capable_noaudit+0x40/0x60 >>> Dec 04 23:47:43 lzx-314-desk kernel: oom_kill_process+0x216/0x430 >>> Dec 04 23:47:43 lzx-314-desk kernel: out_of_memory+0xf9/0x2e0 >>> Dec 04 23:47:43 lzx-314-desk kernel: __alloc_pages_nodemask+0xd6c/0xed0 >>> Dec 04 23:47:43 lzx-314-desk kernel: __read_swap_cache_async+0x11d/0x190 >>> Dec 04 23:47:43 lzx-314-desk kernel: read_swap_cache_async+0x17/0x40 >>> Dec 04 23:47:43 lzx-314-desk kernel: swapin_readahead+0x1f1/0x230 >>> Dec 04 23:47:43 lzx-314-desk kernel: ? find_get_entry+0x19/0xf0 >>> Dec 04 23:47:43 lzx-314-desk kernel: ? pagecache_get_page+0x27/0x210 >>> Dec 04 23:47:43 lzx-314-desk kernel: do_swap_page+0x432/0x590 >>> Dec 04 23:47:43 lzx-314-desk kernel: ? do_swap_page+0x432/0x590 >>> Dec 04 23:47:43 lzx-314-desk kernel: ? >>> poll_select_copy_remaining+0x120/0x120 >>> Dec 04 23:47:43 lzx-314-desk kernel: __handle_mm_fault+0x33e/0xa20 >>> Dec 04 23:47:43 lzx-314-desk kernel: handle_mm_fault+0x14a/0x1d0 >>> Dec 04 23:47:43 lzx-314-desk kernel: __do_page_fault+0x212/0x440 >>> Dec 04 23:47:43 lzx-314-desk kernel: page_fault+0x22/0x30 >>> Dec 04 23:47:43 lzx-314-desk kernel: RIP: >>> 0010:copy_user_generic_unrolled+0x89/0xc0 >>> Dec 04 23:47:43 lzx-314-desk kernel: RSP: :c9cdbd70 EFLAGS: >>> 00010202 >>> Dec 04 23:47:43 lzx-314-desk kernel: RAX: RBX: >>> 0008 RCX: 0001 >>> Dec 04 23:47:43 lzx-314-desk kernel: RDX: RSI: >>> c9cdbdd8 RDI: 7ff45e7fba80 >>> Dec 04 23:47:43 lzx-314-de
Re: system OOM when using ghettoVCB script to backup VMs to nfs server (sync mode) running btrfs
thanks for the quick response, I tired to test this on 4.4.100 kernel, disabled quota : # btrfs qgroup show /data/ ERROR: can't list qgroups: quotas not enabled But seems it still OOM after about 7 hours copyed 144G files, any other ideas? Maybe I will try to test by disable quota on 4.14 kernel too. thanks. On Wed, Dec 6, 2017 at 2:25 PM, Qu Wenruo wrote: > > > On 2017年12月06日 14:22, taibai li wrote: >> Hi Guys, >> >> I hit the OOM issues with as Box running 4.4.x kernel, so I tried to >> build a 4.14.3 kernel to try that. The testbed is : >> NAS box with 2G memory, and a single disk raid , I setup a nfs server >> with sync mode, add the storage on ESXi servers 6.0 and backup all >> the VMs on it by the ghettoVCB script , after about 14 hours, >> Inpot/Output error happened, >> checked the box , found OOM. >> # cat /etc/exports >> "/data/Videos" >> *(insecure,insecure_locks,no_subtree_check,crossmnt,anonuid=99,anongid=99,root_squash,rw,sync) >> # uname -a >> Linux lzx-314-desk 4.14.3.x86_64.1 #1 SMP Fri Dec 1 01:31:25 UTC 2017 >> x86_64 GNU/Linux >> # btrfs fi show /data/ >> Label: '43f611ae:data' uuid: 6fefb319-a21d-476e-9642-565e0600a049 >> Total devices 1 FS bytes used 292.78GiB >> devid 1 size 1.81TiB used 296.02GiB path /dev/md127 >> >> The stack is : >> Dec 04 23:47:43 lzx-314-desk kernel: nfsd: page allocation stalls for >> 621031ms, order:0, mode:0x14000c0(GFP_KERNEL), nodemask=(null) >> Dec 04 23:47:43 lzx-314-desk kernel: nfsd cpuset=/ mems_allowed=0 >> Dec 04 23:47:43 lzx-314-desk kernel: CPU: 0 PID: 3376 Comm: nfsd Not >> tainted 4.14.3.x86_64.1 #1 >> Dec 04 23:47:43 lzx-314-desk kernel: Hardware name: NETGEAR ReadyNAS >> 314/To be filled by O.E.M., BIOS 4.6.5 11/05/2013 >> Dec 04 23:47:43 lzx-314-desk kernel: Call Trace: >> Dec 04 23:47:43 lzx-314-desk kernel: dump_stack+0x4d/0x6a >> Dec 04 23:47:43 lzx-314-desk kernel: warn_alloc+0xe3/0x180 >> Dec 04 23:47:43 lzx-314-desk kernel: __alloc_pages_nodemask+0xb1e/0xed0 >> Dec 04 23:47:43 lzx-314-desk kernel: svc_recv+0x99/0x900 >> Dec 04 23:47:43 lzx-314-desk kernel: ? svc_process+0x241/0x690 >> Dec 04 23:47:43 lzx-314-desk kernel: nfsd+0xd2/0x150 >> Dec 04 23:47:43 lzx-314-desk kernel: kthread+0x11a/0x150 >> Dec 04 23:47:43 lzx-314-desk kernel: ? nfsd_destroy+0x60/0x60 >> Dec 04 23:47:43 lzx-314-desk kernel: ? kthread_create_on_node+0x40/0x40 >> Dec 04 23:47:43 lzx-314-desk kernel: ret_from_fork+0x22/0x30 >> Dec 04 23:47:43 lzx-314-desk kernel: readynasd invoked oom-killer: >> gfp_mask=0x14200ca(GFP_HIGHUSER_MOVABLE), nodemask=(null), order=0, >> oom_score_adj=-1000 >> Dec 04 23:47:43 lzx-314-desk kernel: readynasd cpuset=/ mems_allowed=0 >> Dec 04 23:47:43 lzx-314-desk kernel: CPU: 3 PID: 3307 Comm: readynasd >> Not tainted 4.14.3.x86_64.1 #1 >> Dec 04 23:47:43 lzx-314-desk kernel: Hardware name: NETGEAR ReadyNAS >> 314/To be filled by O.E.M., BIOS 4.6.5 11/05/2013 >> Dec 04 23:47:43 lzx-314-desk kernel: Call Trace: >> Dec 04 23:47:43 lzx-314-desk kernel: dump_stack+0x4d/0x6a >> Dec 04 23:47:43 lzx-314-desk kernel: dump_header+0x9a/0x21b >> Dec 04 23:47:43 lzx-314-desk kernel: ? pick_next_task_fair+0x1d5/0x4b0 >> Dec 04 23:47:43 lzx-314-desk kernel: ? security_capable_noaudit+0x40/0x60 >> Dec 04 23:47:43 lzx-314-desk kernel: oom_kill_process+0x216/0x430 >> Dec 04 23:47:43 lzx-314-desk kernel: out_of_memory+0xf9/0x2e0 >> Dec 04 23:47:43 lzx-314-desk kernel: __alloc_pages_nodemask+0xd6c/0xed0 >> Dec 04 23:47:43 lzx-314-desk kernel: __read_swap_cache_async+0x11d/0x190 >> Dec 04 23:47:43 lzx-314-desk kernel: read_swap_cache_async+0x17/0x40 >> Dec 04 23:47:43 lzx-314-desk kernel: swapin_readahead+0x1f1/0x230 >> Dec 04 23:47:43 lzx-314-desk kernel: ? find_get_entry+0x19/0xf0 >> Dec 04 23:47:43 lzx-314-desk kernel: ? pagecache_get_page+0x27/0x210 >> Dec 04 23:47:43 lzx-314-desk kernel: do_swap_page+0x432/0x590 >> Dec 04 23:47:43 lzx-314-desk kernel: ? do_swap_page+0x432/0x590 >> Dec 04 23:47:43 lzx-314-desk kernel: ? poll_select_copy_remaining+0x120/0x120 >> Dec 04 23:47:43 lzx-314-desk kernel: __handle_mm_fault+0x33e/0xa20 >> Dec 04 23:47:43 lzx-314-desk kernel: handle_mm_fault+0x14a/0x1d0 >> Dec 04 23:47:43 lzx-314-desk kernel: __do_page_fault+0x212/0x440 >> Dec 04 23:47:43 lzx-314-desk kernel: page_fault+0x22/0x30 >> Dec 04 23:47:43 lzx-314-desk kernel: RIP: >> 0010:copy_user_generic_unrolled+0x89/0xc0 >> Dec 04 23:47:43 lzx-314-desk kernel: RSP: :c9cdbd70 EFLAGS: >> 00010202 >> Dec 04 23:47:43 lzx-314-desk kernel: RAX: RBX: >> 0008 RCX: 0001 >> Dec 04 23:47:43 lzx-314-desk kernel: RDX: RSI: >> c9cdbdd8 RDI: 7ff45e7fba80 >> Dec 04 23:47:43 lzx-314-desk kernel: RBP: c9cdbee8 R08: >> R09: 0104 >> Dec 04 23:47:43 lzx-314-desk kernel: R10: c9cdbd78 R11: >> 0104 R12: >> Dec 04 23:47:43 lzx-314-desk kernel: R13: c9cdbdc0 R14: >> 7ff45e7fba80 R15: c9cdbdc0 >> Dec 04 23
Re: [PATCH] fstests: btrfs: Add test case to check if btrfs can handle full fs trim correctly
Ping. Any comment on this? Thanks, Qu On 2017年11月29日 14:14, Qu Wenruo wrote: > Ancient commit f4c697e6406d ("btrfs: return EINVAL if start > total_bytes in > fitrim ioctl") introduced a regression where btrfs may fail to trim any > free space in existing block groups. > > It's caused by confusion with btrfs_super_block->total_bytes and btrfs > logical address space. > Unlike physical address, any aligned bytenr in range [0, U64_MAX) is > valid in btrfs logical address space, and it's chunk mapping mechanism > of btrfs to handle the logical<->physical mapping. > > The test case will craft a btrfs with the following features: > 0) Single data/meta profile >Make trimmed bytes reporting and chunk allocation more predictable. > > 1) All chunks start beyond super_block->total_bytes (1G) >By relocating these blocks several times. > > 2) Unallocated space is less than 50% of the whole fs > > 3) Fragmented data chunks >Data chunks will be full of fragments, 50% of data chunks will be >free space. > > So in theory fstrim should be able to trim over 50% space of the fs. > (after fix, 64% of the fs can be trimmed) > While the regression makes btrfs only able to trim unallocated space, > which is less than 50% of the total space. > (without fix, it's only 31%) > > Fixed by patch named "btrfs: Ensure btrfs_trim_fs can trim the whole fs". > > Signed-off-by: Qu Wenruo > --- > tests/btrfs/155 | 120 > > tests/btrfs/155.out | 2 + > tests/btrfs/group | 1 + > 3 files changed, 123 insertions(+) > create mode 100755 tests/btrfs/155 > create mode 100644 tests/btrfs/155.out > > diff --git a/tests/btrfs/155 b/tests/btrfs/155 > new file mode 100755 > index ..6918f093 > --- /dev/null > +++ b/tests/btrfs/155 > @@ -0,0 +1,120 @@ > +#! /bin/bash > +# FS QA Test 155 > +# > +# Check if btrfs can correctly trim free space in block groups > +# > +# An ancient regression prevent btrfs from trimming free space inside > +# existing block groups, if bytenr of block group starts beyond > +# btrfs_super_block->total_bytes. > +# However all bytenr in btrfs is in btrfs logical address space, > +# where any bytenr in range [0, U64_MAX] is valid. > +# > +# Fixed by patch named "btrfs: Ensure btrfs_trim_fs can trim the whole fs". > +# > +#--- > +# Copyright (c) 2017 SUSE Linux Products GmbH. All Rights Reserved. > +# > +# This program is free software; you can redistribute it and/or > +# modify it under the terms of the GNU General Public License as > +# published by the Free Software Foundation. > +# > +# This program is distributed in the hope that it would be useful, > +# but WITHOUT ANY WARRANTY; without even the implied warranty of > +# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the > +# GNU General Public License for more details. > +# > +# You should have received a copy of the GNU General Public License > +# along with this program; if not, write the Free Software Foundation, > +# Inc., 51 Franklin St, Fifth Floor, Boston, MA 02110-1301 USA > +#--- > +# > + > +seq=`basename $0` > +seqres=$RESULT_DIR/$seq > +echo "QA output created by $seq" > + > +here=`pwd` > +tmp=/tmp/$$ > +status=1 # failure is the default! > +trap "_cleanup; exit \$status" 0 1 2 3 15 > + > +_cleanup() > +{ > + cd / > + rm -f $tmp.* > +} > + > +# get standard environment, filters and checks > +. ./common/rc > +. ./common/filter > + > +# remove previous $seqres.full before test > +rm -f $seqres.full > + > +# real QA test starts here > + > +# Modify as appropriate. > +_supported_fs btrfs > +_supported_os Linux > +_require_scratch > +_require_fstrim > + > +# 1024fs size > +fs_size=$((1024 * 1024 * 1024)) > + > +# Use small files to fill half of the fs > +file_size=$(( 1024 * 1024 )) > +nr_files=$(( $fs_size / $file_size / 2)) > + > +# Force to use single data and meta profile. > +# Since the test relies on fstrim output, which will differ for different > +# profiles > +_scratch_mkfs -b $fs_size -m single -d single > /dev/null > +_scratch_mount > + > +_require_batched_discard "$SCRATCH_MNT" > + > +for n in $(seq -w 0 $(( $nr_files - 1))); do > + $XFS_IO_PROG -f -c "pwrite 0 $file_size" "$SCRATCH_MNT/file_$n" \ > + > /dev/null > +done > + > +# Flush all buffer data into disk, to trigger chunk allocation > +sync > + > +# Now we have take at least 50% of the filesystem, relocate all chunks twice > +# so all chunks will start after 1G in logical space. > +# (Btrfs chunk allocation will not rewind to reuse lower space) > +_run_btrfs_util_prog balance start --full-balance "$SCRATCH_MNT" > + > +# To avoid possible false ENOSPC alert on v4.15-rc1, seems to be a > +# reserved space related bug (maybe related to outstanding space rework?), > +# but that's another story. > +sync > + > +_run_btrfs_util_pro
Re: [PATCH v4 72/73] xfs: Convert mru cache to XArray
On Wed, Dec 06, 2017 at 06:06:48AM -0800, Matthew Wilcox wrote: > On Wed, Dec 06, 2017 at 07:44:04PM +1100, Dave Chinner wrote: > > On Tue, Dec 05, 2017 at 08:45:49PM -0800, Matthew Wilcox wrote: > > > That said, using xa_cmpxchg() in the dquot code looked like the right > > > thing to do? Since we'd dropped the qi mutex and the ILOCK, it looks > > > entirely reasonable for another thread to come in and set up the dquot. > > > But I'm obviously quite ignorant of the XFS internals, so maybe there's > > > something else going on that makes this essentially a "can't happen". > > > > It's no different to the inode cache code, which drops the RCU > > lock on lookup miss, instantiates the new inode (maybe reading it > > off disk), then locks the tree and attempts to insert it. Both cases > > use "insert if empty, otherwise retry lookup from start" semantics. > > Ah. I had my focus set a little narrow on the inode cache code and didn't > recognise the pattern. > > Why do you sleep for one jiffy after encountering a miss, then seeing > someone else insert the inode for you? The sleep is a backoff that allows whatever we raced with to complete, be it a hit that raced with an inode being reclaimed and removed, or a miss that raced with another insert. Ideally we'd sleep on the XFS_INEW bit, similar to the vfs I_NEW flag, but it's not quite that simple with the reclaim side of things... > > cmpxchg is for replacing a known object in a store - it's not really > > intended for doing initial inserts after a lookup tells us there is > > nothing in the store. The radix tree "insert only if empty" makes > > sense here, because it naturally takes care of lookup/insert races > > via the -EEXIST mechanism. > > > > I think that providing xa_store_excl() (which would return -EEXIST > > if the entry is not empty) would be a better interface here, because > > it matches the semantics of lookup cache population used all over > > the kernel > > I'm not thrilled with xa_store_excl(), but I need to think about that > a bit more. Not fussed about the name - I just think we need a function that matches the insert semantics of the code > > > I'm quite happy to have normal API variants that don't save/restore > > > interrupts. Just need to come up with good names ... I don't think > > > xa_store_noirq() is a good name, but maybe you do? > > > > I'd prefer not to have to deal with such things at all. :P > > > > How many subsystems actually require irq safety in the XA locking > > code? Make them use irqsafe versions, not make everyone else use > > "noirq" versions, as is the convention for the rest of the kernel > > code > > Hard to say how many existing radix tree users require the irq safety. The mapping tree requires it because it gets called from IO completion contexts to clear page writeback state, but I don't know about any of the others. > Also hard to say how many potential users (people currently using > linked lists, people using resizable arrays, etc) need irq safety. > My thinking was "make it safe by default and let people who know better > have a way to opt out", but there's definitely something to be said for > "make it fast by default and let people who need the unusual behaviour > type those extra few letters". > > So, you're arguing for providing xa_store(), xa_store_irq(), xa_store_bh() > and xa_store_irqsafe()? (at least on demand, as users come to light?) > At least the read side doesn't require any variants; everybody can use > RCU for read side protection. That would follow the pattern of the rest of the kernel APIs, though I think it might be cleaner to simply state the locking requirement to xa_init() and keep all those details completely internal rather than encoding them into API calls. After all, the "irqsafe-ness" of the locking needs to be consistent across the entire XA instance > ("safe", not "save" because I wouldn't make the caller provide the > "flags" argument). > > > > At least, not today. One of the future plans is to allow xa_nodes to > > > be allocated from ZONE_MOVABLE. In order to do that, we have to be > > > able to tell which lock protects any given node. With the XArray, > > > we can find that out (xa_node->root->xa_lock); with the radix tree, > > > we don't even know what kind of lock protects the tree. > > > > Yup, this is a prime example of why we shouldn't be creating > > external dependencies by smearing the locking context outside the XA > > structure itself. It's not a stretch to see something like a > > ZONE_MOVEABLE dependency because some other object indexed in a XA > > is stored in the same page as the xa_node that points to it, and > > both require the same xa_lock to move/update... > > That is a bit of a stretch. Christoph Lameter and I had a discussion about it > here: https://www.spinics.net/lists/linux-mm/msg122902.html > > There's no situation where you need to acquire two locks in order to > free an object; ZONE_MOVEABLE is for m
Re: [PATCH] btrfs: tree-checker: use %zu format string for size_t
On 2017年12月06日 22:18, Arnd Bergmann wrote: > The return value of sizeof() is of type size_t, so we must print it > using the %z format modifier rather than %l to avoid this warning > on some architectures: > > fs/btrfs/tree-checker.c: In function 'check_dir_item': > fs/btrfs/tree-checker.c:273:50: error: format '%lu' expects argument of type > 'long unsigned int', but argument 5 has type 'u32' {aka 'unsigned int'} > [-Werror=format=] Any idea about which architecture will cause such warning? On x86_64 I always fail to get such warning. > > Fixes: 005887f2e3e0 ("btrfs: tree-checker: Add checker for dir item") Reviewed-by: Qu Wenruo Thanks, Qu > Signed-off-by: Arnd Bergmann > --- > fs/btrfs/tree-checker.c | 2 +- > 1 file changed, 1 insertion(+), 1 deletion(-) > > diff --git a/fs/btrfs/tree-checker.c b/fs/btrfs/tree-checker.c > index 66dac0a4b01f..7c55e3ba5a6c 100644 > --- a/fs/btrfs/tree-checker.c > +++ b/fs/btrfs/tree-checker.c > @@ -270,7 +270,7 @@ static int check_dir_item(struct btrfs_root *root, > /* header itself should not cross item boundary */ > if (cur + sizeof(*di) > item_size) { > dir_item_err(root, leaf, slot, > - "dir item header crosses item boundary, have %lu boundary %u", > + "dir item header crosses item boundary, have %zu boundary %u", > cur + sizeof(*di), item_size); > return -EUCLEAN; > } > signature.asc Description: OpenPGP digital signature
Re: [PATCH 3/3] Btrfs: make raid6 rebuild retry more
On Wed, Dec 06, 2017 at 08:11:30AM +0800, Qu Wenruo wrote: > > > On 2017年12月06日 06:55, Liu Bo wrote: > > On Tue, Dec 05, 2017 at 07:09:25PM +0100, David Sterba wrote: > >> On Tue, Dec 05, 2017 at 04:07:35PM +0800, Qu Wenruo wrote: > @@ -2166,11 +2166,21 @@ int raid56_parity_recover(struct btrfs_fs_info > *fs_info, struct bio *bio, > } > > /* > - * reconstruct from the q stripe if they are > - * asking for mirror 3 > + * Loop retry: > + * for 'mirror == 2', reconstruct from all other stripes. > >>> > >>> What about using macro to makes the reassemble method more human readable? > >> > >> Yeah, that's definetelly needed and should be based on > >> BTRFS_MAX_MIRRORS, not just hardcoded to 3. > > > > OK. > > > > In case of raid5/6, BTRFS_MAX_MIRRORS is an abused name, it's more a > > raid1/10 concept, either BTRFS_RAID56_FULL_REBUILD or > > BTRFS_RAID56_FULL_CHK is better to me, which one do you guys like? > > For mirror > 2 case, the mirror_num is no longer a single indicator, but > a ranged iterator for later rebuild retries. > > Something like set_raid_fail_from_mirror_num() seems better to me. I feel like having the logic open-code'd plus necessary comments is probably better as we can see which stripe failb is. For those who are new to this raid6 retry logic are likely to feel confused by a helper function like set_raid_fail_from_mirror_num() and will go check the helper function about the retry logic. Thanks, -liubo -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v4 00/73] XArray version 4
On Wed, Dec 06, 2017 at 04:58:29PM -0700, Ross Zwisler wrote: > Maybe I missed this from a previous version, but can you explain the > motivation for replacing the radix tree with an xarray? (I think this should > probably still be part of the cover letter?) Do we have a performance problem > we need to solve? A code complexity issue we need to solve? Something else? Sure! Something else I screwed up in the v4 announcement ... I'll need it again for v5, so here's a quick update of the v1 announcement's justification: I wrote the xarray to replace the radix tree with a better API based on observing how programmers are currently using the radix tree, and on how (and why) they aren't. Conceptually, an xarray is an array of ULONG_MAX pointers which is initially full of NULL pointers. Improvements the xarray has over the radix tree: - The radix tree provides operations like other trees do; 'insert' and 'delete'. But what users really want is an automatically resizing array, and so it makes more sense to give users an API that is like an array -- 'load' and 'store'. - Locking is part of the API. This simplifies a lot of users who formerly had to manage their own locking just for the radix tree. It also improves code generation as we can now tell RCU that we're holding a lock and it doesn't need to generate as much fencing code. The other advantage is that tree nodes can be moved (not yet implemented). - GFP flags are now parameters to calls which may need to allocate memory. The radix tree forced users to decide what the allocation flags would be at creation time. It's much clearer to specify them at allocation time. I know the MM people disapprove of the radix tree using the top bits of the GFP flags for its own purpose, so they'll like this aspect. - Memory is not preloaded; we don't tie up dozens of pages on the off chance that the slab allocator fails. Instead, we drop the lock, allocate a new node and retry the operation. - The xarray provides a conditional-replace operation. The radix tree forces users to roll their own (and at least four have). - Iterators now take a 'max' parameter. That simplifies many users and will reduce the amount of iteration done. - Iteration can proceed backwards. We only have one user for this, but since it's called as part of the pagefault readahead algorithm, that seemed worth mentioning. - RCU-protected pointers are not exposed as part of the API. There are some fun bugs where the page cache forgets to use rcu_dereference() in the current codebase. - Any function which wants it can now call the update_node() callback. There were a few places missing that I noticed as part of this rewrite. - Exceptional entries may now be BITS_PER_LONG-1 in size, rather than the BITS_PER_LONG-2 that they had in the radix tree. That gives us the extra bit we need to put huge page swap entries in the page cache. The API comes in two parts, normal and advanced. The normal API takes care of the locking and memory allocation for you. You can get the value of a pointer by calling xa_load() and set the value of a pointer by calling xa_store(). You can conditionally update the value of a pointer by calling xa_cmpxchg(). Each pointer which isn't NULL can be tagged with up to 3 bits of extra information, accessed through xa_get_tag(), xa_set_tag() and xa_clear_tag(). You can copy batches of pointers out of the array by calling xa_get_entries() or xa_get_tagged(). You can iterate over pointers in the array by calling xa_find(), xa_find_after() or xa_for_each(). The advanced API allows users to build their own operations. You have to take care of your own locking and handle memory allocation failures. Most of the advanced operations are based around the xa_state which keeps state between sub-operations. Read the xarray.h header file for more information on the advanced API, and see the implementation of the normal API for examples of how to use the advanced API. Those familiar with the radix tree may notice certain similarities between the implementation of the xarray and the radix tree. That's entirely intentional, but the implementation will certainly adapt in the future. For example, one of the impediments I see to using xarrays instead of kvmalloced arrays is memory consumption, so I have a couple of ideas to reduce memory usage for smaller arrays. I have reimplementated the IDR and the IDA based on the xarray. They are roughly the same complexity as they were when implemented on top of the radix tree (although much less intertwined). When converting code from the radix tree to the xarray, the biggest thing to bear in mind is that 'store' overwrites anything which happens to be in the xarray. Just like the assignment operator. The equivalent to the insert operation is to replace NULL with the new value. A quick reference guide to help when converting radix tree code. Functions which
Re: [PATCH v4 00/73] XArray version 4
On Tue, Dec 05, 2017 at 04:40:46PM -0800, Matthew Wilcox wrote: > From: Matthew Wilcox > > I looked through some notes and decided this was version 4 of the XArray. > Last posted two weeks ago, this version includes a *lot* of changes. > I'd like to thank Dave Chinner for his feedback, encouragement and > distracting ideas for improvement, which I'll get to once this is merged. > > Highlights: > - Over 2000 words of documentation in patch 8! And lots more kernel-doc. > - The page cache is now fully converted to the XArray. > - Many more tests in the test-suite. > > This patch set is not for applying. 0day is still reporting problems, > and I'd feel bad for eating someone's data. These patches apply on top > of a set of prepatory patches which just aren't interesting. If you > want to see the patches applied to a tree, I suggest pulling my git tree: > http://git.infradead.org/users/willy/linux-dax.git/shortlog/refs/heads/xarray-2017-12-04 > I also left out the idr_preload removals. They're still in the git tree, > but I'm not looking for feedback on them. Hey Matthew, Maybe I missed this from a previous version, but can you explain the motivation for replacing the radix tree with an xarray? (I think this should probably still be part of the cover letter?) Do we have a performance problem we need to solve? A code complexity issue we need to solve? Something else? - Ross -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] fstests: btrfs: reproduce a read failure on raid6 setup
On Wed, Dec 06, 2017 at 04:56:11PM +0800, Eryu Guan wrote: > On Mon, Dec 04, 2017 at 03:33:23PM -0700, Liu Bo wrote: > > This test case is to reproduce a bug of raid6 reconstruction process. > > > > The kernel fix are > > Btrfs: do not merge rbios if their fail stripe index are not identical > > Btrfs: make raid6 rebuild retry more > > > > Signed-off-by: Liu Bo > > Test failed as expected with v4.15-rc2 kernel and passed with above > patches applied. > > I fixed two trailing whitespace issues and some of the indentions to > avoid too-long line, also replaced 'repair' group with a new 'raid' > group. I think 'repair' group is meant for testing filesystem repair > tools, not the internal raid repair mechanism, so a new 'raid' group > looks better to me, as we're having more raid-specific tests recently. Sounds good to me, thanks for the efforts. Thanks, -liubo -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v2 06/11] writeback: add counters for metadata usage
On Wed, Dec 06, 2017 at 03:18:35PM -0500, Josef Bacik wrote: > On Mon, Dec 04, 2017 at 02:06:30PM +0100, Jan Kara wrote: > > On Wed 22-11-17 16:16:01, Josef Bacik wrote: > > > diff --git a/mm/util.c b/mm/util.c > > > index 34e57fae959d..681d62631ee0 100644 > > > --- a/mm/util.c > > > +++ b/mm/util.c > > > @@ -616,6 +616,7 @@ int __vm_enough_memory(struct mm_struct *mm, long > > > pages, int cap_sys_admin) > > > if (sysctl_overcommit_memory == OVERCOMMIT_GUESS) { > > > free = global_zone_page_state(NR_FREE_PAGES); > > > free += global_node_page_state(NR_FILE_PAGES); > > > + free += global_node_page_state(NR_METADATA_BYTES) >> PAGE_SHIFT; > > > > > > I'm not really sure this is OK. It depends on whether mm is really able to > > reclaim these pages easily enough... Summon mm people for help :) > > > > Well we count NR_SLAB_RECLAIMABLE here, so it's no different than that. The > point is that it's theoretically reclaimable, and we should at least try. I agree with including metadata in the equation. The (somewhat dusty) overcommit code is mostly for containing swap storms, which is why it adds up all the reclaimable pools that aren't backed by swap. The metadata pool belongs in that category. > > > @@ -3812,6 +3813,7 @@ static inline unsigned long > > > node_unmapped_file_pages(struct pglist_data *pgdat) > > > static unsigned long node_pagecache_reclaimable(struct pglist_data > > > *pgdat) > > > { > > > unsigned long nr_pagecache_reclaimable; > > > + unsigned long nr_metadata_reclaimable; > > > unsigned long delta = 0; > > > > > > /* > > > @@ -3833,7 +3835,20 @@ static unsigned long > > > node_pagecache_reclaimable(struct pglist_data *pgdat) > > > if (unlikely(delta > nr_pagecache_reclaimable)) > > > delta = nr_pagecache_reclaimable; > > > > > > - return nr_pagecache_reclaimable - delta; > > > + nr_metadata_reclaimable = > > > + node_page_state(pgdat, NR_METADATA_BYTES) >> PAGE_SHIFT; > > > + /* > > > + * We don't do writeout through the shrinkers so subtract any > > > + * dirty/writeback metadata bytes from the reclaimable count. > > > + */ > > > + if (nr_metadata_reclaimable) { > > > + unsigned long unreclaimable = > > > + node_page_state(pgdat, NR_METADATA_DIRTY_BYTES) + > > > + node_page_state(pgdat, NR_METADATA_WRITEBACK_BYTES); > > > + unreclaimable >>= PAGE_SHIFT; > > > + nr_metadata_reclaimable -= unreclaimable; > > > + } > > > + return nr_metadata_reclaimable + nr_pagecache_reclaimable - delta; > > > } > > > > Ditto as with __vm_enough_memory(). In particular I'm unsure whether the > > watermarks like min_unmapped_pages or min_slab_pages would still work as > > designed. > > > > Yeah agreed I'd like an MM person's thoughts on this as well. We don't count > SLAB_RECLAIMABLE here, but that's because it's just not related to pagecache. > I > guess it only matters for node reclaim and we have our node reclaim stuff > turned > off, which means it doesn't help us anyway, so I'm happy to just drop it and > let > somebody who cares about node reclaim think about it later ;). Thanks, Few people care about node reclaim at this point, see 4f9b16a64753 ("mm: disable zone_reclaim_mode by default"), and it's honestly a bit baffling why we made min_slab_ratio a tunable in the first place. Who knows how/if anybody relies on that behavior. I'd just leave it alone. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v2 06/11] writeback: add counters for metadata usage
On Mon, Dec 04, 2017 at 02:06:30PM +0100, Jan Kara wrote: > On Wed 22-11-17 16:16:01, Josef Bacik wrote: > > From: Josef Bacik > > > > Btrfs has no bounds except memory on the amount of dirty memory that we > > have in > > use for metadata. Historically we have used a special inode so we could > > take > > advantage of the balance_dirty_pages throttling that comes with using > > pagecache. > > However as we'd like to support different blocksizes it would be nice to not > > have to rely on pagecache, but still get the balance_dirty_pages throttling > > without having to do it ourselves. > > > > So introduce *METADATA_DIRTY_BYTES and *METADATA_WRITEBACK_BYTES. These are > > zone and bdi_writeback counters to keep track of how many bytes we have in > > flight for METADATA. We need to count in bytes as blocksizes could be > > percentages of pagesize. We simply convert the bytes to number of pages > > where > > it is needed for the throttling. > > > > Also introduce NR_METADATA_BYTES so we can keep track of the total amount of > > pages used for metadata on the system. This is also needed so things like > > dirty > > throttling know that this is dirtyable memory as well and easily reclaimed. > > I'll defer to mm guys for final decision but the fact is the memory for > metadata is likely to be allocated from some slab cache and that actually > goes against the 'easily reclaimed' statement. Granted these are going to > be relatively large objects (1k at least I assume) so fragmentation issues > are not as bad but still getting actual free pages out of slab cache isn't > that easy... More on this below. > > > diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h > > index 356a814e7c8e..fd516a0f0bfe 100644 > > --- a/include/linux/mmzone.h > > +++ b/include/linux/mmzone.h > > @@ -179,6 +179,9 @@ enum node_stat_item { > > NR_VMSCAN_IMMEDIATE,/* Prioritise for reclaim when writeback ends */ > > NR_DIRTIED, /* page dirtyings since bootup */ > > NR_WRITTEN, /* page writings since bootup */ > > + NR_METADATA_DIRTY_BYTES,/* Metadata dirty bytes */ > > + NR_METADATA_WRITEBACK_BYTES,/* Metadata writeback bytes */ > > + NR_METADATA_BYTES, /* total metadata bytes in use. */ > > NR_VM_NODE_STAT_ITEMS > > }; > > I think you didn't address my comment from last version of the series. > > 1) Per-cpu node-stat batching will be basically useless for these counters > as the batch size is <128. Maybe we don't care but it would deserve a > comment. > > 2) These counters are tracked in atomic_long_t type. That means max 2GB of > metadata on 32-bit machines. I *guess* that should be OK since you would > not be able to address that much of slab cache on such machine anyway but > still worth a comment I think. > You're right I missed this, sorry about that. I've resolved the batching problem, and I'll add a comment about the 32bit machines problem. > > diff --git a/mm/util.c b/mm/util.c > > index 34e57fae959d..681d62631ee0 100644 > > --- a/mm/util.c > > +++ b/mm/util.c > > @@ -616,6 +616,7 @@ int __vm_enough_memory(struct mm_struct *mm, long > > pages, int cap_sys_admin) > > if (sysctl_overcommit_memory == OVERCOMMIT_GUESS) { > > free = global_zone_page_state(NR_FREE_PAGES); > > free += global_node_page_state(NR_FILE_PAGES); > > + free += global_node_page_state(NR_METADATA_BYTES) >> PAGE_SHIFT; > > > I'm not really sure this is OK. It depends on whether mm is really able to > reclaim these pages easily enough... Summon mm people for help :) > Well we count NR_SLAB_RECLAIMABLE here, so it's no different than that. The point is that it's theoretically reclaimable, and we should at least try. > > diff --git a/mm/vmscan.c b/mm/vmscan.c > > index 13d711dd8776..415b003e475c 100644 > > --- a/mm/vmscan.c > > +++ b/mm/vmscan.c > > @@ -225,7 +225,8 @@ unsigned long pgdat_reclaimable_pages(struct > > pglist_data *pgdat) > > > > nr = node_page_state_snapshot(pgdat, NR_ACTIVE_FILE) + > > node_page_state_snapshot(pgdat, NR_INACTIVE_FILE) + > > -node_page_state_snapshot(pgdat, NR_ISOLATED_FILE); > > +node_page_state_snapshot(pgdat, NR_ISOLATED_FILE) + > > +(node_page_state_snapshot(pgdat, NR_METADATA_BYTES) >> PAGE_SHIFT); > > > > if (get_nr_swap_pages() > 0) > > nr += node_page_state_snapshot(pgdat, NR_ACTIVE_ANON) + > > Just drop this hunk. The function is going away (and is currently unused). > Will do. > > @@ -3812,6 +3813,7 @@ static inline unsigned long > > node_unmapped_file_pages(struct pglist_data *pgdat) > > static unsigned long node_pagecache_reclaimable(struct pglist_data *pgdat) > > { > > unsigned long nr_pagecache_reclaimable; > > + unsigned long nr_metadata_reclaimable; > > unsigned long delta = 0; > > > > /* > > @@ -3833,7 +3835,20 @@ static unsigned long > > node_pagecache_reclaimable(struct pglist_data *pgdat) > >
Re: [PATCH] btrfs: Handle btrfs_set_extent_delalloc failure
On Wed, Dec 06, 2017 at 09:27:50AM +0200, Nikolay Borisov wrote: > On 5.12.2017 19:50, David Sterba wrote: > > On Tue, Dec 05, 2017 at 04:10:59PM +0800, Qu Wenruo wrote: > >> > >> > >> On 2017年12月05日 15:29, Nikolay Borisov wrote: > >>> This function was introduced by 247e743cbe6e ("Btrfs: Use async helpers > >>> to deal > >>> with pages that have been improperly dirtied") and it didn't do any error > >>> handling then. This function might very well fail in ENOMEM situation, yet > >>> it's not handled, this could lead to inconsistent state. So let's handle > >>> the > >>> failure by setting the mapping error bit. > >>> > >>> Signed-off-by: Nikolay Borisov > >>> Cc: sta...@vger.kernel.org > >> > >> Reviewed-by: Qu Wenruo > >> > >> That's the only missing one. Nice catch. > > > > You mean the only unhandled call of btrfs_set_extent_delalloc? There's > > one more in relocate_file_extent_cluster. > > I'd prefer this call site be handled in a separate patch. Agreed, just that Qu indicated there's no such call site :) -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] btrfs: put btrfs_ioctl_vol_args_v2 related defines together
On Wed, Dec 06, 2017 at 11:40:10AM +0800, Anand Jain wrote: > Just a code spatial rearrangement, no functional change. > > Signed-off-by: Anand Jain Reviewed-by: David Sterba -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 23/45] fs: btrfs: remove duplicate includes
These duplicate includes have been found with scripts/checkincludes.pl but they have been removed manually to avoid removing false positives. Signed-off-by: Pravin Shedge --- fs/btrfs/super.c | 1 - 1 file changed, 1 deletion(-) diff --git a/fs/btrfs/super.c b/fs/btrfs/super.c index 3a4dce1..b34fe15 100644 --- a/fs/btrfs/super.c +++ b/fs/btrfs/super.c @@ -61,7 +61,6 @@ #include "tests/btrfs-tests.h" #include "qgroup.h" -#include "backref.h" #define CREATE_TRACE_POINTS #include -- 2.7.4 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 5/5] btrfs: Greatly simplify btrfs_read_dev_super
On Mon, Dec 04, 2017 at 06:20:09PM +0200, Nikolay Borisov wrote: > >> I don't understand what problem *should* be solved here... > > > > Without any further checks and validation, we cannot simply iterate over > > all superblocks and try to mount anything that looks ok. Even if the > > offsets exist on the block device. The fsid should be at least checked, > > but that's still not enough to ensure the 1st copy is what we want to > > mount. > > I'm more inclined to agree with Anand here, that if the users wants to > mount with -t btrfs then the filesystem should assume it's correct to > use any of the additional superblocks. If and only if the additional superblocks are valid. And if we can't read the primary superblock, we don't have enough information to establish the validation. EIO? We don't have anything. CRC mismatch? Can't trust the whole data. We need FSID and total_size at least. Other actions are limited from inside kernel. > Otherwise we should explicitly > state somewhere that the superblock copies are there only for the sake > of the user-space tools, in which case this patch can go in, albeit with > some rewording. Unless there's something I'm missing to address the concerns above, I'm for (an updated version) of your patch. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v8 0/5] Add the ability to do BPF directed error injection
Jon noticed that I had a typo in my _ASM_KPROBE_ERROR_INJECT macro. I went to figure out why the compiler didn't catch it and it's because it was not used anywhere. I had copied it from the trace blacklist code without understanding where it was used as cscope didn't find the original macro I was looking for, so I assumed it was some voodoo and left it in place. Turns out cscope failed me and I didn't need the macro at all, the trace blacklist thing I was looking at was for marking assembly functions as blacklisted and I have no intention of marking assembly functions as error injectable at the moment. v7->v8: - removed the _ASM_KPROBE_ERROR_INJECT since it was not needed. v6->v7: - moved the opt-in macro to bpf.h out of kprobes.h. v5->v6: - add BPF_ALLOW_ERROR_INJECTION() tagging for functions that will support this feature. This way only functions that opt-in will be allowed to be overridden. - added a btrfs patch to allow error injection for open_ctree() so that the bpf sample actually works. v4->v5: - disallow kprobe_override programs from being put in the prog map array so we don't tail call into something we didn't check. This allows us to make the normal path still fast without a bunch of percpu operations. v3->v4: - fix a build error found by kbuild test bot (I didn't wait long enough apparently.) - Added a warning message as per Daniels suggestion. v2->v3: - added a ->kprobe_override flag to bpf_prog. - added some sanity checks to disallow attaching bpf progs that have ->kprobe_override set that aren't for ftrace kprobes. - added the trace_kprobe_ftrace helper to check if the trace_event_call is a ftrace kprobe. - renamed bpf_kprobe_state to bpf_kprobe_override, fixed it so we only read this value in the kprobe path, and thus only write to it if we're overriding or clearing the override. v1->v2: - moved things around to make sure that bpf_override_return could really only be used for an ftrace kprobe. - killed the special return values from trace_call_bpf. - renamed pc_modified to bpf_kprobe_state so bpf_override_return could tell if it was being called from an ftrace kprobe context. - reworked the logic in kprobe_perf_func to take advantage of bpf_kprobe_state. - updated the test as per Alexei's review. - Original message - A lot of our error paths are not well tested because we have no good way of injecting errors generically. Some subystems (block, memory) have ways to inject errors, but they are random so it's hard to get reproduceable results. With BPF we can add determinism to our error injection. We can use kprobes and other things to verify we are injecting errors at the exact case we are trying to test. This patch gives us the tool to actual do the error injection part. It is very simple, we just set the return value of the pt_regs we're given to whatever we provide, and then override the PC with a dummy function that simply returns. Right now this only works on x86, but it would be simple enough to expand to other architectures. Thanks, Josef -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v8 1/5] add infrastructure for tagging functions as error injectable
From: Josef Bacik Using BPF we can override kprob'ed functions and return arbitrary values. Obviously this can be a bit unsafe, so make this feature opt-in for functions. Simply tag a function with KPROBE_ERROR_INJECT_SYMBOL in order to give BPF access to that function for error injection purposes. Signed-off-by: Josef Bacik Acked-by: Ingo Molnar --- include/asm-generic/vmlinux.lds.h | 10 +++ include/linux/bpf.h | 11 +++ include/linux/kprobes.h | 1 + include/linux/module.h| 5 ++ kernel/kprobes.c | 163 ++ kernel/module.c | 6 +- 6 files changed, 195 insertions(+), 1 deletion(-) diff --git a/include/asm-generic/vmlinux.lds.h b/include/asm-generic/vmlinux.lds.h index 8acfc1e099e1..85822804861e 100644 --- a/include/asm-generic/vmlinux.lds.h +++ b/include/asm-generic/vmlinux.lds.h @@ -136,6 +136,15 @@ #define KPROBE_BLACKLIST() #endif +#ifdef CONFIG_BPF_KPROBE_OVERRIDE +#define ERROR_INJECT_LIST(). = ALIGN(8); \ + VMLINUX_SYMBOL(__start_kprobe_error_inject_list) = .; \ + KEEP(*(_kprobe_error_inject_list)) \ + VMLINUX_SYMBOL(__stop_kprobe_error_inject_list) = .; +#else +#define ERROR_INJECT_LIST() +#endif + #ifdef CONFIG_EVENT_TRACING #define FTRACE_EVENTS(). = ALIGN(8); \ VMLINUX_SYMBOL(__start_ftrace_events) = .; \ @@ -560,6 +569,7 @@ FTRACE_EVENTS() \ TRACE_SYSCALLS()\ KPROBE_BLACKLIST() \ + ERROR_INJECT_LIST() \ MEM_DISCARD(init.rodata)\ CLK_OF_TABLES() \ RESERVEDMEM_OF_TABLES() \ diff --git a/include/linux/bpf.h b/include/linux/bpf.h index 520aeebe0d93..552a666a338b 100644 --- a/include/linux/bpf.h +++ b/include/linux/bpf.h @@ -530,4 +530,15 @@ extern const struct bpf_func_proto bpf_sock_map_update_proto; void bpf_user_rnd_init_once(void); u64 bpf_user_rnd_u32(u64 r1, u64 r2, u64 r3, u64 r4, u64 r5); +#if defined(__KERNEL__) && !defined(__ASSEMBLY__) +#ifdef CONFIG_BPF_KPROBE_OVERRIDE +#define BPF_ALLOW_ERROR_INJECTION(fname) \ +static unsigned long __used\ + __attribute__((__section__("_kprobe_error_inject_list"))) \ + _eil_addr_##fname = (unsigned long)fname; +#else +#define BPF_ALLOW_ERROR_INJECTION(fname) +#endif +#endif + #endif /* _LINUX_BPF_H */ diff --git a/include/linux/kprobes.h b/include/linux/kprobes.h index bd2684700b74..4f501cb73aec 100644 --- a/include/linux/kprobes.h +++ b/include/linux/kprobes.h @@ -271,6 +271,7 @@ extern bool arch_kprobe_on_func_entry(unsigned long offset); extern bool kprobe_on_func_entry(kprobe_opcode_t *addr, const char *sym, unsigned long offset); extern bool within_kprobe_blacklist(unsigned long addr); +extern bool within_kprobe_error_injection_list(unsigned long addr); struct kprobe_insn_cache { struct mutex mutex; diff --git a/include/linux/module.h b/include/linux/module.h index fe5aa3736707..7bb1a9b9a322 100644 --- a/include/linux/module.h +++ b/include/linux/module.h @@ -475,6 +475,11 @@ struct module { ctor_fn_t *ctors; unsigned int num_ctors; #endif + +#ifdef CONFIG_BPF_KPROBE_OVERRIDE + unsigned int num_kprobe_ei_funcs; + unsigned long *kprobe_ei_funcs; +#endif } cacheline_aligned __randomize_layout; #ifndef MODULE_ARCH_INIT #define MODULE_ARCH_INIT {} diff --git a/kernel/kprobes.c b/kernel/kprobes.c index a1606a4224e1..bdd7dd724f6f 100644 --- a/kernel/kprobes.c +++ b/kernel/kprobes.c @@ -83,6 +83,16 @@ static raw_spinlock_t *kretprobe_table_lock_ptr(unsigned long hash) return &(kretprobe_table_locks[hash].lock); } +/* List of symbols that can be overriden for error injection. */ +static LIST_HEAD(kprobe_error_injection_list); +static DEFINE_MUTEX(kprobe_ei_mutex); +struct kprobe_ei_entry { + struct list_head list; + unsigned long start_addr; + unsigned long end_addr; + void *priv; +}; + /* Blacklist -- list of struct kprobe_blacklist_entry */ static LIST_HEAD(kprobe_blacklist); @@ -1392,6 +1402,17 @@ bool within_kprobe_blacklist(unsigned long addr) return false; } +bool within_kprobe_error_injection_list(unsigned long addr) +{ + struct kprobe_ei_entry *ent; + + list_for_each_entry(ent, &kprobe_error_injection_list, list) { + if (addr >= ent->start_addr && addr < ent->end_addr) + return true;
[PATCH v8 3/5] bpf: add a bpf_override_function helper
From: Josef Bacik Error injection is sloppy and very ad-hoc. BPF could fill this niche perfectly with it's kprobe functionality. We could make sure errors are only triggered in specific call chains that we care about with very specific situations. Accomplish this with the bpf_override_funciton helper. This will modify the probe'd callers return value to the specified value and set the PC to an override function that simply returns, bypassing the originally probed function. This gives us a nice clean way to implement systematic error injection for all of our code paths. Acked-by: Alexei Starovoitov Acked-by: Ingo Molnar Signed-off-by: Josef Bacik --- arch/Kconfig | 3 +++ arch/x86/Kconfig | 1 + arch/x86/include/asm/kprobes.h | 4 +++ arch/x86/include/asm/ptrace.h| 5 arch/x86/kernel/kprobes/ftrace.c | 14 ++ include/linux/filter.h | 3 ++- include/linux/trace_events.h | 1 + include/uapi/linux/bpf.h | 7 - kernel/bpf/core.c| 3 +++ kernel/bpf/verifier.c| 2 ++ kernel/events/core.c | 7 + kernel/trace/Kconfig | 11 kernel/trace/bpf_trace.c | 38 +++ kernel/trace/trace_kprobe.c | 55 +++- kernel/trace/trace_probe.h | 12 + 15 files changed, 157 insertions(+), 9 deletions(-) diff --git a/arch/Kconfig b/arch/Kconfig index d789a89cb32c..4fb618082259 100644 --- a/arch/Kconfig +++ b/arch/Kconfig @@ -195,6 +195,9 @@ config HAVE_OPTPROBES config HAVE_KPROBES_ON_FTRACE bool +config HAVE_KPROBE_OVERRIDE + bool + config HAVE_NMI bool diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig index 971feac13506..5126d2750dd0 100644 --- a/arch/x86/Kconfig +++ b/arch/x86/Kconfig @@ -152,6 +152,7 @@ config X86 select HAVE_KERNEL_XZ select HAVE_KPROBES select HAVE_KPROBES_ON_FTRACE + select HAVE_KPROBE_OVERRIDE select HAVE_KRETPROBES select HAVE_KVM select HAVE_LIVEPATCH if X86_64 diff --git a/arch/x86/include/asm/kprobes.h b/arch/x86/include/asm/kprobes.h index 6cf65437b5e5..c6c3b1f4306a 100644 --- a/arch/x86/include/asm/kprobes.h +++ b/arch/x86/include/asm/kprobes.h @@ -67,6 +67,10 @@ extern const int kretprobe_blacklist_size; void arch_remove_kprobe(struct kprobe *p); asmlinkage void kretprobe_trampoline(void); +#ifdef CONFIG_KPROBES_ON_FTRACE +extern void arch_ftrace_kprobe_override_function(struct pt_regs *regs); +#endif + /* Architecture specific copy of original instruction*/ struct arch_specific_insn { /* copy of the original instruction */ diff --git a/arch/x86/include/asm/ptrace.h b/arch/x86/include/asm/ptrace.h index 91c04c8e67fa..f04e71800c2f 100644 --- a/arch/x86/include/asm/ptrace.h +++ b/arch/x86/include/asm/ptrace.h @@ -108,6 +108,11 @@ static inline unsigned long regs_return_value(struct pt_regs *regs) return regs->ax; } +static inline void regs_set_return_value(struct pt_regs *regs, unsigned long rc) +{ + regs->ax = rc; +} + /* * user_mode(regs) determines whether a register set came from user * mode. On x86_32, this is true if V8086 mode was enabled OR if the diff --git a/arch/x86/kernel/kprobes/ftrace.c b/arch/x86/kernel/kprobes/ftrace.c index 041f7b6dfa0f..3c455bf490cb 100644 --- a/arch/x86/kernel/kprobes/ftrace.c +++ b/arch/x86/kernel/kprobes/ftrace.c @@ -97,3 +97,17 @@ int arch_prepare_kprobe_ftrace(struct kprobe *p) p->ainsn.boostable = false; return 0; } + +asmlinkage void override_func(void); +asm( + ".type override_func, @function\n" + "override_func:\n" + " ret\n" + ".size override_func, .-override_func\n" +); + +void arch_ftrace_kprobe_override_function(struct pt_regs *regs) +{ + regs->ip = (unsigned long)&override_func; +} +NOKPROBE_SYMBOL(arch_ftrace_kprobe_override_function); diff --git a/include/linux/filter.h b/include/linux/filter.h index cdd78a7beaae..dfa44fd74bae 100644 --- a/include/linux/filter.h +++ b/include/linux/filter.h @@ -458,7 +458,8 @@ struct bpf_prog { locked:1, /* Program image locked? */ gpl_compatible:1, /* Is filter GPL compatible? */ cb_access:1,/* Is control block accessed? */ - dst_needed:1; /* Do we need dst entry? */ + dst_needed:1, /* Do we need dst entry? */ + kprobe_override:1; /* Do we override a kprobe? */ kmemcheck_bitfield_end(meta); enum bpf_prog_type type; /* Type of BPF program */ u32 len;/* Number of filter blocks */ diff --git a/include/linux/trace_events.h b/include/linux/trace_events.h index fc6aeca945db..be8bd5a8efaa 100644 --- a/include/linux/trace_even
[PATCH v8 4/5] samples/bpf: add a test for bpf_override_return
From: Josef Bacik This adds a basic test for bpf_override_return to verify it works. We override the main function for mounting a btrfs fs so it'll return -ENOMEM and then make sure that trying to mount a btrfs fs will fail. Acked-by: Alexei Starovoitov Acked-by: Ingo Molnar Signed-off-by: Josef Bacik --- samples/bpf/Makefile | 4 samples/bpf/test_override_return.sh | 15 +++ samples/bpf/tracex7_kern.c| 16 samples/bpf/tracex7_user.c| 28 tools/include/uapi/linux/bpf.h| 7 ++- tools/testing/selftests/bpf/bpf_helpers.h | 3 ++- 6 files changed, 71 insertions(+), 2 deletions(-) create mode 100755 samples/bpf/test_override_return.sh create mode 100644 samples/bpf/tracex7_kern.c create mode 100644 samples/bpf/tracex7_user.c diff --git a/samples/bpf/Makefile b/samples/bpf/Makefile index ea2b9e6135f3..83d06bc1f710 100644 --- a/samples/bpf/Makefile +++ b/samples/bpf/Makefile @@ -14,6 +14,7 @@ hostprogs-y += tracex3 hostprogs-y += tracex4 hostprogs-y += tracex5 hostprogs-y += tracex6 +hostprogs-y += tracex7 hostprogs-y += test_probe_write_user hostprogs-y += trace_output hostprogs-y += lathist @@ -58,6 +59,7 @@ tracex3-objs := bpf_load.o $(LIBBPF) tracex3_user.o tracex4-objs := bpf_load.o $(LIBBPF) tracex4_user.o tracex5-objs := bpf_load.o $(LIBBPF) tracex5_user.o tracex6-objs := bpf_load.o $(LIBBPF) tracex6_user.o +tracex7-objs := bpf_load.o $(LIBBPF) tracex7_user.o load_sock_ops-objs := bpf_load.o $(LIBBPF) load_sock_ops.o test_probe_write_user-objs := bpf_load.o $(LIBBPF) test_probe_write_user_user.o trace_output-objs := bpf_load.o $(LIBBPF) trace_output_user.o @@ -100,6 +102,7 @@ always += tracex3_kern.o always += tracex4_kern.o always += tracex5_kern.o always += tracex6_kern.o +always += tracex7_kern.o always += sock_flags_kern.o always += test_probe_write_user_kern.o always += trace_output_kern.o @@ -153,6 +156,7 @@ HOSTLOADLIBES_tracex3 += -lelf HOSTLOADLIBES_tracex4 += -lelf -lrt HOSTLOADLIBES_tracex5 += -lelf HOSTLOADLIBES_tracex6 += -lelf +HOSTLOADLIBES_tracex7 += -lelf HOSTLOADLIBES_test_cgrp2_sock2 += -lelf HOSTLOADLIBES_load_sock_ops += -lelf HOSTLOADLIBES_test_probe_write_user += -lelf diff --git a/samples/bpf/test_override_return.sh b/samples/bpf/test_override_return.sh new file mode 100755 index ..e68b9ee6814b --- /dev/null +++ b/samples/bpf/test_override_return.sh @@ -0,0 +1,15 @@ +#!/bin/bash + +rm -f testfile.img +dd if=/dev/zero of=testfile.img bs=1M seek=1000 count=1 +DEVICE=$(losetup --show -f testfile.img) +mkfs.btrfs -f $DEVICE +mkdir tmpmnt +./tracex7 $DEVICE +if [ $? -eq 0 ] +then + echo "SUCCESS!" +else + echo "FAILED!" +fi +losetup -d $DEVICE diff --git a/samples/bpf/tracex7_kern.c b/samples/bpf/tracex7_kern.c new file mode 100644 index ..1ab308a43e0f --- /dev/null +++ b/samples/bpf/tracex7_kern.c @@ -0,0 +1,16 @@ +#include +#include +#include +#include "bpf_helpers.h" + +SEC("kprobe/open_ctree") +int bpf_prog1(struct pt_regs *ctx) +{ + unsigned long rc = -12; + + bpf_override_return(ctx, rc); + return 0; +} + +char _license[] SEC("license") = "GPL"; +u32 _version SEC("version") = LINUX_VERSION_CODE; diff --git a/samples/bpf/tracex7_user.c b/samples/bpf/tracex7_user.c new file mode 100644 index ..8a52ac492e8b --- /dev/null +++ b/samples/bpf/tracex7_user.c @@ -0,0 +1,28 @@ +#define _GNU_SOURCE + +#include +#include +#include +#include "libbpf.h" +#include "bpf_load.h" + +int main(int argc, char **argv) +{ + FILE *f; + char filename[256]; + char command[256]; + int ret; + + snprintf(filename, sizeof(filename), "%s_kern.o", argv[0]); + + if (load_bpf_file(filename)) { + printf("%s", bpf_log_buf); + return 1; + } + + snprintf(command, 256, "mount %s tmpmnt/", argv[1]); + f = popen(command, "r"); + ret = pclose(f); + + return ret ? 0 : 1; +} diff --git a/tools/include/uapi/linux/bpf.h b/tools/include/uapi/linux/bpf.h index 4a4b6e78c977..3756dde69834 100644 --- a/tools/include/uapi/linux/bpf.h +++ b/tools/include/uapi/linux/bpf.h @@ -673,6 +673,10 @@ union bpf_attr { * @buf: buf to fill * @buf_size: size of the buf * Return : 0 on success or negative error code + * + * int bpf_override_return(pt_regs, rc) + * @pt_regs: pointer to struct pt_regs + * @rc: the return value to set */ #define __BPF_FUNC_MAPPER(FN) \ FN(unspec), \ @@ -732,7 +736,8 @@ union bpf_attr { FN(xdp_adjust_meta),\ FN(perf_event_read_value), \ FN(perf_prog_read_value), \ - FN(getsockopt), + FN(getsockopt), \ + FN(override_return), /* integer value in 'imm' field of BPF_CALL instruction selects which helper * function eBPF program intends
[PATCH v8 5/5] btrfs: allow us to inject errors at io_ctl_init
From: Josef Bacik This was instrumental in reproducing a space cache bug. Signed-off-by: Josef Bacik Acked-by: Ingo Molnar --- fs/btrfs/free-space-cache.c | 2 ++ 1 file changed, 2 insertions(+) diff --git a/fs/btrfs/free-space-cache.c b/fs/btrfs/free-space-cache.c index cdc9f4015ec3..daa98dc1f844 100644 --- a/fs/btrfs/free-space-cache.c +++ b/fs/btrfs/free-space-cache.c @@ -22,6 +22,7 @@ #include #include #include +#include #include "ctree.h" #include "free-space-cache.h" #include "transaction.h" @@ -332,6 +333,7 @@ static int io_ctl_init(struct btrfs_io_ctl *io_ctl, struct inode *inode, return 0; } +BPF_ALLOW_ERROR_INJECTION(io_ctl_init); static void io_ctl_free(struct btrfs_io_ctl *io_ctl) { -- 2.7.5 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v8 2/5] btrfs: make open_ctree error injectable
From: Josef Bacik This allows us to do error injection with BPF for open_ctree. Signed-off-by: Josef Bacik Acked-by: Ingo Molnar --- fs/btrfs/disk-io.c | 2 ++ 1 file changed, 2 insertions(+) diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c index dfdab849037b..69d17a640b94 100644 --- a/fs/btrfs/disk-io.c +++ b/fs/btrfs/disk-io.c @@ -30,6 +30,7 @@ #include #include #include +#include #include #include "ctree.h" #include "disk-io.h" @@ -3283,6 +3284,7 @@ int open_ctree(struct super_block *sb, goto fail_block_groups; goto retry_root_backup; } +BPF_ALLOW_ERROR_INJECTION(open_ctree); static void btrfs_end_buffer_write_sync(struct buffer_head *bh, int uptodate) { -- 2.7.5 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 4/5] btrfs: Remove redundant NULL check
On Fri, Dec 01, 2017 at 11:19:43AM +0200, Nikolay Borisov wrote: > Before returning hole_em in btrfs_get_fiemap_extent we check if it's different > than null. However, by the time this null check is triggered we already know > hole_em is not null because it means it points to the em we found and it > has already been dereferenced. Well, ok The whole function could use some cleanups and reordering, it's hard to track all the variable changes. > > Signed-off-by: Nikolay Borisov > --- > fs/btrfs/inode.c | 3 +-- > 1 file changed, 1 insertion(+), 2 deletions(-) > > diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c > index 92d140b06271..9e0473c883ce 100644 > --- a/fs/btrfs/inode.c > +++ b/fs/btrfs/inode.c > @@ -7300,9 +7300,8 @@ struct extent_map *btrfs_get_extent_fiemap(struct > btrfs_inode *inode, > em->block_start = EXTENT_MAP_DELALLOC; > em->block_len = found; > } > - } else if (hole_em) { > + } else > return hole_em; The { } should stay. I'll fix it. > - } > out: > > free_extent_map(hole_em); > -- > 2.7.4 > > -- > To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in > the body of a message to majord...@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 3/5] btrfs: Fix possible off-by-one in btrfs_search_path_in_tree
On Fri, Dec 01, 2017 at 11:19:42AM +0200, Nikolay Borisov wrote: > The name char array passed to btrfs_search_path_in_tree is of size > BTRFS_INO_LOOKUP_PATH_MAX (4080). So the actual accessible char indexes > are in the range of [0, 4079]. Currently the code uses the define but this > represents an off-by-one. For such fixes it's good to think about and document potential implications, like where could the extra byte end be written, and if this is a patch stable. Size of btrfs_ioctl_ino_lookup_args is 4096, so the new byte will be written to extra space, not some padding that could be provided by the allocator. btrfs-progs store the arguments on stack, but kernel does own copy of the ioctl buffer and the off-by-one overwrite does not affect userspace, but the ending 0 might be lost. kernel ioctl buffer is allocated dynamically so we're overwriting somebody else's memory, and the ioctl is privileged if args.objectid is not 256. Which is in most cases, but resolving a subvolume stored in another directory will trigger that path. Fixes: ac8e9819d71f907 ("Btrfs: add search and inode lookup ioctls") Before this patch the buffer was one byte larger, but then the -1 was not added. > Signed-off-by: Nikolay Borisov > --- > fs/btrfs/ioctl.c | 6 +++--- > 1 file changed, 3 insertions(+), 3 deletions(-) > > diff --git a/fs/btrfs/ioctl.c b/fs/btrfs/ioctl.c > index e8adebc8c1b0..fc148b7c4265 100644 > --- a/fs/btrfs/ioctl.c > +++ b/fs/btrfs/ioctl.c > @@ -2206,7 +2206,7 @@ static noinline int btrfs_search_path_in_tree(struct > btrfs_fs_info *info, > if (!path) > return -ENOMEM; > > - ptr = &name[BTRFS_INO_LOOKUP_PATH_MAX]; > + ptr = &name[BTRFS_INO_LOOKUP_PATH_MAX - 1]; > > key.objectid = tree_id; > key.type = BTRFS_ROOT_ITEM_KEY; > @@ -2272,8 +2272,8 @@ static noinline int btrfs_search_path_in_tree(struct > btrfs_fs_info *info, > static noinline int btrfs_ioctl_ino_lookup(struct file *file, > void __user *argp) > { > - struct btrfs_ioctl_ino_lookup_args *args; > - struct inode *inode; > + struct btrfs_ioctl_ino_lookup_args *args; > + struct inode *inode; Unrelated change. > int ret = 0; > > args = memdup_user(argp, sizeof(*args)); > -- > 2.7.4 > > -- > To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in > the body of a message to majord...@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] btrfs: tree-checker: use %zu format string for size_t
On Wed, Dec 06, 2017 at 03:18:14PM +0100, Arnd Bergmann wrote: > The return value of sizeof() is of type size_t, so we must print it > using the %z format modifier rather than %l to avoid this warning > on some architectures: > > fs/btrfs/tree-checker.c: In function 'check_dir_item': > fs/btrfs/tree-checker.c:273:50: error: format '%lu' expects argument of type > 'long unsigned int', but argument 5 has type 'u32' {aka 'unsigned int'} > [-Werror=format=] > > Fixes: 005887f2e3e0 ("btrfs: tree-checker: Add checker for dir item") > Signed-off-by: Arnd Bergmann Reviewed-by: David Sterba -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] Btrfs: disable FUA if mounted with nobarrier
On Tue, Dec 05, 2017 at 10:54:02PM -0800, Omar Sandoval wrote: > From: Omar Sandoval > > I was seeing disk flushes still happening when I mounted a Btrfs > filesystem with nobarrier for testing. This is because we use FUA to > write out the first super block, and on devices without FUA support, the > block layer translates FUA to a flush. Even on devices supporting true > FUA, using FUA when we asked for no barriers is surprising. > > Signed-off-by: Omar Sandoval Reviewed-by: David Sterba The documented nobarrier behaviour matches the updated code. -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH] btrfs: tree-checker: use %zu format string for size_t
The return value of sizeof() is of type size_t, so we must print it using the %z format modifier rather than %l to avoid this warning on some architectures: fs/btrfs/tree-checker.c: In function 'check_dir_item': fs/btrfs/tree-checker.c:273:50: error: format '%lu' expects argument of type 'long unsigned int', but argument 5 has type 'u32' {aka 'unsigned int'} [-Werror=format=] Fixes: 005887f2e3e0 ("btrfs: tree-checker: Add checker for dir item") Signed-off-by: Arnd Bergmann --- fs/btrfs/tree-checker.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/fs/btrfs/tree-checker.c b/fs/btrfs/tree-checker.c index 66dac0a4b01f..7c55e3ba5a6c 100644 --- a/fs/btrfs/tree-checker.c +++ b/fs/btrfs/tree-checker.c @@ -270,7 +270,7 @@ static int check_dir_item(struct btrfs_root *root, /* header itself should not cross item boundary */ if (cur + sizeof(*di) > item_size) { dir_item_err(root, leaf, slot, - "dir item header crosses item boundary, have %lu boundary %u", + "dir item header crosses item boundary, have %zu boundary %u", cur + sizeof(*di), item_size); return -EUCLEAN; } -- 2.9.0 -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v4 72/73] xfs: Convert mru cache to XArray
On Wed, Dec 06, 2017 at 07:44:04PM +1100, Dave Chinner wrote: > On Tue, Dec 05, 2017 at 08:45:49PM -0800, Matthew Wilcox wrote: > > That said, using xa_cmpxchg() in the dquot code looked like the right > > thing to do? Since we'd dropped the qi mutex and the ILOCK, it looks > > entirely reasonable for another thread to come in and set up the dquot. > > But I'm obviously quite ignorant of the XFS internals, so maybe there's > > something else going on that makes this essentially a "can't happen". > > It's no different to the inode cache code, which drops the RCU > lock on lookup miss, instantiates the new inode (maybe reading it > off disk), then locks the tree and attempts to insert it. Both cases > use "insert if empty, otherwise retry lookup from start" semantics. Ah. I had my focus set a little narrow on the inode cache code and didn't recognise the pattern. Why do you sleep for one jiffy after encountering a miss, then seeing someone else insert the inode for you? > cmpxchg is for replacing a known object in a store - it's not really > intended for doing initial inserts after a lookup tells us there is > nothing in the store. The radix tree "insert only if empty" makes > sense here, because it naturally takes care of lookup/insert races > via the -EEXIST mechanism. > > I think that providing xa_store_excl() (which would return -EEXIST > if the entry is not empty) would be a better interface here, because > it matches the semantics of lookup cache population used all over > the kernel I'm not thrilled with xa_store_excl(), but I need to think about that a bit more. > > I'm quite happy to have normal API variants that don't save/restore > > interrupts. Just need to come up with good names ... I don't think > > xa_store_noirq() is a good name, but maybe you do? > > I'd prefer not to have to deal with such things at all. :P > > How many subsystems actually require irq safety in the XA locking > code? Make them use irqsafe versions, not make everyone else use > "noirq" versions, as is the convention for the rest of the kernel > code Hard to say how many existing radix tree users require the irq safety. Also hard to say how many potential users (people currently using linked lists, people using resizable arrays, etc) need irq safety. My thinking was "make it safe by default and let people who know better have a way to opt out", but there's definitely something to be said for "make it fast by default and let people who need the unusual behaviour type those extra few letters". So, you're arguing for providing xa_store(), xa_store_irq(), xa_store_bh() and xa_store_irqsafe()? (at least on demand, as users come to light?) At least the read side doesn't require any variants; everybody can use RCU for read side protection. ("safe", not "save" because I wouldn't make the caller provide the "flags" argument). > > At least, not today. One of the future plans is to allow xa_nodes to > > be allocated from ZONE_MOVABLE. In order to do that, we have to be > > able to tell which lock protects any given node. With the XArray, > > we can find that out (xa_node->root->xa_lock); with the radix tree, > > we don't even know what kind of lock protects the tree. > > Yup, this is a prime example of why we shouldn't be creating > external dependencies by smearing the locking context outside the XA > structure itself. It's not a stretch to see something like a > ZONE_MOVEABLE dependency because some other object indexed in a XA > is stored in the same page as the xa_node that points to it, and > both require the same xa_lock to move/update... That is a bit of a stretch. Christoph Lameter and I had a discussion about it here: https://www.spinics.net/lists/linux-mm/msg122902.html There's no situation where you need to acquire two locks in order to free an object; you'd create odd locking dependencies between objects if you did that (eg we already have a locking dependency between pag_ici and perag from __xfs_inode_set_eofblocks_tag). It'd be a pretty horrible shrinker design where you had to get all the locks on all the objects, regardless of what locking order the real code had. > > There are other costs to not having a lock. The lockdep/RCU > > analysis done on the radix tree code is none. Because we have > > no idea what lock might protect any individual radix tree, we use > > rcu_dereference_raw(), disabling lockdep's ability to protect us. > > Unfortunately for you, I don't find arguments along the lines of > "lockdep will save us" at all convincing. lockdep already throws > too many false positives to be useful as a tool that reliably and > accurately points out rare, exciting, complex, intricate locking > problems. But it does reliably and accurately point out "dude, you forgot to take the lock". It's caught a number of real problems in my own testing that you never got to see. > That problem has not gone away - very few people who read and have > to maintain this code understandxs all t
Re: [RFC] Improve subvolume usability for a normal user
On 2017-12-05 23:52, Misono, Tomohiro wrote: On 2017/12/05 21:41, Austin S. Hemmelgarn wrote: On 2017-12-05 03:43, Qu Wenruo wrote: On 2017年12月05日 16:25, Misono, Tomohiro wrote: Hello all, I want to address some issues of subvolume usability for a normal user. i.e. a user can create subvolumes, but - Cannot delete their own subvolume (by default) - Cannot tell subvolumes from directories (in a straightforward way) - Cannot check the quota limit when qgroup is enabled Here I show the initial thoughts and approaches to this problem. I want to check if this is a right approach or not before I start writing code. Comments are welcome. Tomohiro Misono == - Goal and current problem The goal of this RFC is to give a normal user more control to their own subvolumes. Currently the control to subvolumes for a normal user is restricted as below: +-+--+--+ | command | root | user | +-+--+--+ | sub create | Y| Y| | sub snap| Y| Y| | sub del | Y| N| | sub list| Y| N| | sub show| Y| N| | qgroup show | Y| N| +-+--+--+ In short, I want to change this as below in order to improve user's usability: +-+--++ | command | root | user | +-+--++ | sub create | Y| Y | | sub snap| Y| Y | | sub del | Y| N -> Y | | sub list| Y| N -> Y | | sub show| Y| N -> Y | | qgroup show | Y| N -> Y | +-+--++ In words, (1) allow deletion of subvolume if a user owns it, and (2) allow getting subvolume/quota info if a user has read access to it (sub list/qgroup show just lists the subvolumes which are readable by the user) I think other commands not listed above (qgroup limit, send/receive etc.) should be done by root and not be allowed for a normal user. - Outside the scope of this RFC There is a qualitative problem to use qgroup for limiting user disk amount; quota limit can easily be averted by creating a subvolume. I think that forcing inheriting quota of parent subvolume is a solution, but I won't address nor discuss this here. - Proposal (1) deletion of subvolume I want to change the default behavior to allow a user to delete their own subvolumes. This is not the same behavior as when user_subvol_rm_alowed mount option is specified; in that case a user can delete the subvolume to which they have write+exec right. Since snapshot creation is already restricted to the subvolume owner, it is consistent that only the owner of the subvolume (or root) can delete it. The implementation should be straightforward. Personally speaking, I prefer to do the complex owner check in user daemon. And do the privilege in user daemon (call it btrfsd for example). So btrfs-progs will works in 2 modes, if root calls it, do as it used to do. If normal user calls it, proxy the request to btrfsd, and btrfsd does the privilege checking and call ioctl (with root privilege). Then no impact to kernel, all complex work is done in user space. Exactly how hard is it to just check ownership of the root inode of a subvolume from the ioctl context? You could just as easily push all the checking out to the VFS layer by taking an open fd for the subvolume root (and probably implicitly closing it) instead of taking a path, and that would give you all the benefits of ACL's and whatever security modules the local system is using. Alternatively, quit treating subvolumes specially for `unlink()`, and make them behave like regular directories (and thus avoid the one possible complication resulting from home directories being subvolumes but needing to be owned by the users). Thanks to all for the comments. Let me explain my stance clearly: I'm for limiting the subvolume operation to btrfs-progs commands since a subvolume looks like a regular directory, but is different in many points. This is the reason I want to fix not only "sub delete" but also "sub show/list" etc. I agree that it needs to behave more sanely, but the approach of treating it differently from a directory is the entire cause of this issue in the first place. If the original implementation had just treated them as regular directories once created (with the obvious caveat regarding read-only snapshots), we wouldn't be having this conversation right now, except possibly regarding fetching info about subvolumes. Personally, I would really rather just let unlink() work the same on subvolumes as it does with directories, and only need special handling for creation and metadata retrieval (with the possible option of 'quick' deletion through a btrfs-progs command). I think "sub delete" should be the counterpart of "sub create" and "sub snap". Since "sub snap" can be used for the subvolume the user owns, "sub delete" should be allowed to the suvolume the use
Re: [RFC] Improve subvolume usability for a normal user
On 2017-12-05 17:08, Goffredo Baroncelli wrote: On 12/05/2017 09:17 PM, Austin S. Hemmelgarn wrote: On 2017-12-05 14:09, Goffredo Baroncelli wrote: On 12/05/2017 07:46 PM, Graham Cobb wrote: On 05/12/17 18:01, Goffredo Baroncelli wrote: On 12/05/2017 04:42 PM, Graham Cobb wrote: [] Then no impact to kernel, all complex work is done in user space. Exactly how hard is it to just check ownership of the root inode of a subvolume from the ioctl context? You could just as easily push all the checking out to the VFS layer by taking an open fd for the subvolume root (and probably implicitly closing it) instead of taking a path, and that would give you all the benefits of ACL's and whatever security modules the local system is using. +1 - stop inventing new access control rules for each different action! A subvolume is like a directory; In all filesystems you cannot remove an inode if you don't have write access to the parent directory. I assume that this is a POSIX requirement; and if so this should be true even for BTRFS. This means that in order to remove a subvolume and you are not root, you should check all the contained directories. It is not sufficient to check one inode. In the past I create a patch [1][2] which extended the unlink (2) syscall to allow removing a subvolume if: a) the user is root or b.1) if the subvolume is empty and b.2) the user has the write capability to the parent directory That is also OK, as it follows a logical and consistent model without introducing new rules, but it has two disadvantages with snapshots the user creates and then wants to delete: i) they have to change the snapshot to readwrite to remove all the contents -- how many users will know that that can even be done? I think that this is an orthogonal question. However the user should use btrfs set Anyway I am not against to use something more specific like "btrfs sub del...", where it could be possible to implement a more specific checks. I am only highlight that destroy a subvolume without looking to its content could break the POSIX rules No, it doesn't break POSIX rules, because you can't do it with POSIX defined interfaces (not counting ioctls, but ioctls are by definition system dependent). user$ mkdir -p dir1/dir2 user$ touch dir1/dir2/file user$ sudo chown foo:foo dir1/dir2 user$ rm -rf dir1/ rm: cannot remove 'dir1/dir2/file1': Permission denied In a POSIX filesystem, you (as user) cannot remove file1. But if instead of dir1 you have a subvolume (called sub1), and the check of removing the subvolume is performed only on the root of subvolume, you could remove it. Is it a problem ? I think no, but I am not a security expert. But even if POSIX doesn't have the concept of subvolume, this is definitely a break of a POSIX rule. Except it isn't a break of a POSIX rule, because deleting the subvolume doesn't go through regular POSIX API's (again, ioctl really doesn't count because it's designed for random extended stuff). POSIX makes absolutely zero guarantees about the behavior of things not done through POSIX API's. Yes, this doesn't use POSIX style permission checks, but it also doesn't violate POSIX standards either (because it's not defined in the POSIX standard, and things that aren't defined by the standards are explicitly stated to be implementation defined). The data loss risk this entails is however the reason users can't delete subvolumes by default. On the other side, if the user makes a snapshot of a subvolume containing a not owned and not empty directory, for the user it would be impossible to delete its snapshot (!). I would argue that by default they shouldn't be able to make a snapshot because: 1. With the default behavior of users not being able to delete subvolumes, users can trivially cause a resource exhaustion situation that they can't resolve themself. 2. Whitelisting is usually better for security than blacklisting. ii) if the snapshot contains a file the user cannot remove then the user cannot remove the snapshot at all (even though they could create it). It is the same for the other filesystem: if in a filesystem tree, there is a directory which is not changeable by the user, the user cannot remove all the tree. Again, I am only highlighting that there is a possible break of POSIX rules. POSIX says nothing about non-permissions related reasons for not being able to remove a directory. Nobody has say the opposite Claiming that it's potentially POSIX incompatible behavior directly implies that POSIX has provisions regarding non-permissions related reasons for not being able to remove a directory. A subvolume is treated (too much in my opinion) like a mount point, regardless of whether or not it's explicitly mounted, and that seems to be most of why you can't remove it with unlink(). That is why I preferred Austin's first proposal. I suppose your proposal could be extended: a) the user is root or b
Re: btrfs subvolume list for users?
On Wed 2017-12-06 (08:24), Ulli Horlacher wrote: > Is there a way for a non-root user to list his subvolumes? I must have Alzheimer: I have asked this already 3 month ago and found a workaround by muself Date: Sun, 17 Sep 2017 10:48:10 +0200 Message-ID: <20170917084810.gp32...@rus.uni-stuttgart.de> Sorry! -- Ullrich Horlacher Server und Virtualisierung Rechenzentrum TIK Universitaet Stuttgart E-Mail: horlac...@tik.uni-stuttgart.de Allmandring 30aTel:++49-711-68565868 70569 Stuttgart (Germany) WWW:http://www.tik.uni-stuttgart.de/ REF:<20171206072436.ga29...@rus.uni-stuttgart.de> -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] fstests: btrfs: reproduce a read failure on raid6 setup
On Mon, Dec 04, 2017 at 03:33:23PM -0700, Liu Bo wrote: > This test case is to reproduce a bug of raid6 reconstruction process. > > The kernel fix are > Btrfs: do not merge rbios if their fail stripe index are not identical > Btrfs: make raid6 rebuild retry more > > Signed-off-by: Liu Bo Test failed as expected with v4.15-rc2 kernel and passed with above patches applied. I fixed two trailing whitespace issues and some of the indentions to avoid too-long line, also replaced 'repair' group with a new 'raid' group. I think 'repair' group is meant for testing filesystem repair tools, not the internal raid repair mechanism, so a new 'raid' group looks better to me, as we're having more raid-specific tests recently. Thanks, Eryu -- To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v4 72/73] xfs: Convert mru cache to XArray
On Tue, Dec 05, 2017 at 08:45:49PM -0800, Matthew Wilcox wrote: > On Wed, Dec 06, 2017 at 02:14:56PM +1100, Dave Chinner wrote: > > > The other conversions use the normal API instead of the advanced API, so > > > all of this gets hidden away. For example, the inode cache does this: > > > > Ah, OK, that's not obvious from the code changes. :/ > > Yeah, it's a lot easier to understand (I think!) if you build the > docs in that tree and look at > file:///home/willy/kernel/xarray-3/Documentation/output/core-api/xarray.html > (mutatis mutandi). I've tried to tell a nice story about how to put > all the pieces together from the normal to the advanced API. > > > However, it's probably overkill for XFS. In all the cases, when we > > insert there should be no entry in the tree - the > > radix tree insert error handling code there was simply catching > > "should never happen" cases and handling it without crashing. > > I thought it was probably overkill to be using xa_cmpxchg() in the > pag_ici patch. I didn't want to take away your error handling as part > of the conversion, but I think a rational person implementing it today > would just call xa_store() and not even worry about the return value > except to check it for IS_ERR(). *nod* > That said, using xa_cmpxchg() in the dquot code looked like the right > thing to do? Since we'd dropped the qi mutex and the ILOCK, it looks > entirely reasonable for another thread to come in and set up the dquot. > But I'm obviously quite ignorant of the XFS internals, so maybe there's > something else going on that makes this essentially a "can't happen". It's no different to the inode cache code, which drops the RCU lock on lookup miss, instantiates the new inode (maybe reading it off disk), then locks the tree and attempts to insert it. Both cases use "insert if empty, otherwise retry lookup from start" semantics. cmpxchg is for replacing a known object in a store - it's not really intended for doing initial inserts after a lookup tells us there is nothing in the store. The radix tree "insert only if empty" makes sense here, because it naturally takes care of lookup/insert races via the -EEXIST mechanism. I think that providing xa_store_excl() (which would return -EEXIST if the entry is not empty) would be a better interface here, because it matches the semantics of lookup cache population used all over the kernel > > Now that I've looked at this, I have to say that having a return > > value of NULL meaning "success" is quite counter-intuitive. That's > > going to fire my "that looks so wrong" detector every time I look at > > the code and notice it's erroring out on a non-null return value > > that isn't a PTR_ERR case > > It's the same convention as cmpxchg(). I think it's triggering your > "looks so wrong" detector because it's fundamentally not the natural > thing to write. Most definitely the case, and this is why it's a really bad interface for the semantics we have. This how we end up with code that makes it easy for programmers to screw up pointer checks in error handling... :/ > I'm quite happy to have normal API variants that don't save/restore > interrupts. Just need to come up with good names ... I don't think > xa_store_noirq() is a good name, but maybe you do? I'd prefer not to have to deal with such things at all. :P How many subsystems actually require irq safety in the XA locking code? Make them use irqsafe versions, not make everyone else use "noirq" versions, as is the convention for the rest of the kernel code > > > It's the design pattern I've always intended to use. Naturally, the > > > xfs radix trees weren't my initial target; it was the page cache, and > > > the page cache does the same thing; uses the tree_lock to protect both > > > the radix tree and several other fields in that same data structure. > > > > > > I'm open to argument on this though ... particularly if you have a better > > > design pattern in mind! > > > > I don't mind structures having internal locking - I have a problem > > with leaking them into contexts outside the structure they protect. > > That way lies madness - you can't change the internal locking in > > future because of external dependencies, and the moment you need > > something different externally we've got to go back to an external > > lock anyway. > > > > This is demonstrated by the way you converted the XFS dquot tree - > > you didn't replace the dquot tree lock with the internal xa_lock > > because it's a mutex and we have to sleep holding it. IOWs, we've > > added another layer of locking here, not simplified the code. > > I agree the dquot code is no simpler than it was, but it's also no more > complicated from a locking analysis point of view; the xa_lock is just > not providing you with any useful exclusion. Sure, that's fine. All I'm doing is pointing out that we can't use the internal xa_lock to handle everything the indexed objects require, and so we're going to still n