date:20171206

Re: [RFC] Improve subvolume usability for a normal user

2017-12-06 Thread Qu Wenruo

On 2017年12月07日 15:32, Marat Khalili wrote:
> On 07/12/17 08:27, Qu Wenruo wrote:
>> When doing snapshot, btrfs only needs to increase reference of 2nd
>> highest level tree blocks of original snapshot, other than "walking the
>> tree".
>> (If tree root level is 2, then level 2 node is copied, while all
>> reference of level 1 tree blocks get increased)
> 
> Out of curiosity, how does it interacts with nocow files? Does every
> write to these files involves backref walk?

For details I need to check the code.
But at least, partial backref walk must be done.

For "partial" I mean it doesn't need to get every root referring to the
file extent, it only needs to check if the extent is only referred by
current inode and offset.

If any other inode (or even the same inode with different offset) is
found, then the walk is finished.
(The same walk we do to determine the FIEMAP_EXTENT_SHARED flag)

In contract, for qgroup and balance, the full backref walk is done,
where btrfs needs every root involved (for qgroup) or even every inode
and offset involved (for balance).

Thanks,
Qu

> 
> -- 
> 
> With Best Regards,
> Marat Khalili

signature.asc
Description: OpenPGP digital signature

Re: [RFC] Improve subvolume usability for a normal user

2017-12-06 Thread Marat Khalili


On 07/12/17 08:27, Qu Wenruo wrote:

When doing snapshot, btrfs only needs to increase reference of 2nd
highest level tree blocks of original snapshot, other than "walking the
tree".
(If tree root level is 2, then level 2 node is copied, while all
reference of level 1 tree blocks get increased)


Out of curiosity, how does it interacts with nocow files? Does every 
write to these files involves backref walk?


--

With Best Regards,
Marat Khalili
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [RFC] Improve subvolume usability for a normal user

2017-12-06 Thread Misono, Tomohiro

On 2017/12/07 11:56, Duncan wrote:
> Austin S. Hemmelgarn posted on Wed, 06 Dec 2017 07:39:56 -0500 as
> excerpted:
> 
>> Somewhat OT, but the only operation that's remotely 'instant' is
>> creating an empty subvolume.  Snapshot creation has to walk the tree in
>> the subvolume being snapshotted, which can take a long time (and as a
>> result of it's implementation, also means BTRFS snapshots are _not_
>> atomic). Subvolume deletion has to do a bunch of cleanup work in the
>> background (though it may be fairly quick if it was a snapshot and the
>> source subvolume hasn't changed much).
> 
> Indeed, while btrfs in general has taken a strategy of making /creating/ 
> snapshots and subvolumes fast, snapshot deletion in particular can take 
> some time[1].
> 
> And in that regard a question just occurred to me regarding this whole 
> very tough problem of a user being able to create but not delete 
> subvolumes and snapshots:
> 
> Given that at least snapshot deletion (not so sure about non-snapshot 
> subvolume deletion, tho I strongly suspect it would depend on the number 
> of cross-subvolume reflinks) is already a task that can take some time, 
> why /not/ just bite the bullet and make the behavior much more like the 
> directory deletion, given that subvolumes already behave much like 
> directories.  Yes, for non-root, that /does/ mean tracing the entire 
> subtree and checking permissions, and yes, that's going to take time and 
> lower performance somewhat, but subvolume and in particular snapshot 
> deletion is already an operation that takes time, so this wouldn't be 
> unduly changing the situation, and it would eliminate the entire class of 
> security issues that come with either asymmetrically restricting deletion 
> (but not creation) to root on the one hand,

> or possible data loss due to 
> allowing a user to delete a subvolume they couldn't delete were it an 
> ordinary directory due to not owning stuff further down the tree.

But, this is also the very reason I'm for "sub del" instead of unlink().
Since snapshot creation won't check the permissions of the containing 
files/dirs,
it can copy a directory which cannot be deleted by the user.
Therefore if we won't allow "sub del" for the user, he couldn't remove the 
snapshot.

> 
> ---
> [1] Based on comments and reports here.  My own use-case doesn't involve 
> snapshots/subvolumes.
> 

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [RFC] Improve subvolume usability for a normal user

2017-12-06 Thread Qu Wenruo

On 2017年12月06日 20:39, Austin S. Hemmelgarn wrote:
> Somewhat OT, but the only operation that's remotely 'instant' is
> creating an empty subvolume.  Snapshot creation has to walk the tree in
> the subvolume being snapshotted, which can take a long time (and as a
> result of it's implementation, also means BTRFS snapshots are _not_
> atomic).

Nope, this is not true.

Btrfs from the very beginning is designed to speed up snapshot.
The most obvious evidence is its extent tree design.

When doing snapshot, btrfs only needs to increase reference of 2nd
highest level tree blocks of original snapshot, other than "walking the
tree".
(If tree root level is 2, then level 2 node is copied, while all
reference of level 1 tree blocks get increased)

This design hugely speeds up snapshot creation, making snapshot
obviously faster than any block level snapshot.

Although this design obviously slows down backref walk.
As btrfs must do a full walk until tree root, to see if any tree blocks
between leaf and root is shared with another snapshot.

And for btrfs snapshot atomic thing, it's just because btrfs delayed
snapshot creation to transaction commit time, and buffered write is not
triggered for snapshot creation, so it makes snapshot not so atomic.

Thanks,
Qu

>  Subvolume deletion has to do a bunch of cleanup work in the
> background (though it may be fairly quick if it was a snapshot and the
> source subvolume hasn't changed much).

signature.asc
Description: OpenPGP digital signature

Re: [PATCH] fstests: btrfs: Add test case to check if btrfs can handle full fs trim correctly

2017-12-06 Thread Eryu Guan

On Thu, Dec 07, 2017 at 08:43:43AM +0800, Qu Wenruo wrote:
> Ping.
> 
> Any comment on this?

It's been pushed out to upstream, see commit 88231c0c0b9d

Thanks,
Eryu
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [RFC] Improve subvolume usability for a normal user

2017-12-06 Thread Duncan

Austin S. Hemmelgarn posted on Wed, 06 Dec 2017 07:39:56 -0500 as
excerpted:

> Somewhat OT, but the only operation that's remotely 'instant' is
> creating an empty subvolume.  Snapshot creation has to walk the tree in
> the subvolume being snapshotted, which can take a long time (and as a
> result of it's implementation, also means BTRFS snapshots are _not_
> atomic). Subvolume deletion has to do a bunch of cleanup work in the
> background (though it may be fairly quick if it was a snapshot and the
> source subvolume hasn't changed much).

Indeed, while btrfs in general has taken a strategy of making /creating/ 
snapshots and subvolumes fast, snapshot deletion in particular can take 
some time[1].

And in that regard a question just occurred to me regarding this whole 
very tough problem of a user being able to create but not delete 
subvolumes and snapshots:

Given that at least snapshot deletion (not so sure about non-snapshot 
subvolume deletion, tho I strongly suspect it would depend on the number 
of cross-subvolume reflinks) is already a task that can take some time, 
why /not/ just bite the bullet and make the behavior much more like the 
directory deletion, given that subvolumes already behave much like 
directories.  Yes, for non-root, that /does/ mean tracing the entire 
subtree and checking permissions, and yes, that's going to take time and 
lower performance somewhat, but subvolume and in particular snapshot 
deletion is already an operation that takes time, so this wouldn't be 
unduly changing the situation, and it would eliminate the entire class of 
security issues that come with either asymmetrically restricting deletion 
(but not creation) to root on the one hand, or possible data loss due to 
allowing a user to delete a subvolume they couldn't delete were it an 
ordinary directory due to not owning stuff further down the tree.

---
[1] Based on comments and reports here.  My own use-case doesn't involve 
snapshots/subvolumes.

-- 
Duncan - List replies preferred.   No HTML msgs.
"Every nonfree program has a lord, a master --
and if you use the program, he is your master."  Richard Stallman

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: mail notifications to root once quota limit reached

2017-12-06 Thread Chris Murphy

On Wed, Nov 22, 2017 at 9:39 AM, Chris Murphy  wrote:
> On Wed, Nov 22, 2017 at 8:36 AM, ST  wrote:
>> Hello,
>>
>> is it possible to get mail notifications to root once quota limit is
>> reached? If not should I file a feature request?
>
> You should look at Nagios or OpenNMS, etc. That's where such
> functionality belongs.

I ran across this:

http://docs.gluster.org/en/latest/Administrator%20Guide/Gluster%20On%20ZFS/

It would need modification to do what you want with Btrfs, but maybe
it's a start.

-- 
Chris Murphy
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: system OOM when using ghettoVCB script to backup VMs to nfs server (sync mode) running btrfs

2017-12-06 Thread Taibai Li

I think no special option:
/dev/md127 on /data type btrfs
(rw,noatime,nodiratime,nospace_cache,subvolid=5,subvol=/)

ok, will try laster kernel.

thanks.

On Thu, Dec 7, 2017 at 9:21 AM, Qu Wenruo  wrote:
>
>
> On 2017年12月07日 09:19, Taibai Li wrote:
>> thanks for the quick response,  I tired to test this on 4.4.100
>> kernel, disabled quota :
>> # btrfs qgroup show /data/
>> ERROR: can't list qgroups: quotas not enabled
>>
>> But seems it still OOM after about 7 hours copyed  144G files,  any
>> other ideas?  Maybe I will try to test by disable quota on 4.14 kernel
>> too.
>
> Trying latest kernel is always a good idea.
>
> Despite qgroup, I am not pretty sure which can be the cause.
>
> Is there any special mount option used?
>
> Thanks,
> Qu
>
>>
>> thanks.
>>
>> On Wed, Dec 6, 2017 at 2:25 PM, Qu Wenruo  wrote:
>>>
>>>
>>> On 2017年12月06日 14:22, taibai li wrote:
 Hi Guys,

 I hit the OOM issues with as Box running 4.4.x kernel,   so I tried to
 build a 4.14.3 kernel to try that.  The testbed is :
 NAS box with 2G memory, and a single disk raid ,  I setup a nfs server
 with sync mode,  add the storage on ESXi servers 6.0 and backup all
 the VMs on it by the ghettoVCB script , after about 14 hours,
 Inpot/Output error happened,
 checked the box , found OOM.
 # cat /etc/exports
 "/data/Videos" 
 *(insecure,insecure_locks,no_subtree_check,crossmnt,anonuid=99,anongid=99,root_squash,rw,sync)
 # uname -a
 Linux lzx-314-desk 4.14.3.x86_64.1 #1 SMP Fri Dec 1 01:31:25 UTC 2017
 x86_64 GNU/Linux
 # btrfs fi show /data/
 Label: '43f611ae:data' uuid: 6fefb319-a21d-476e-9642-565e0600a049
 Total devices 1 FS bytes used 292.78GiB
 devid 1 size 1.81TiB used 296.02GiB path /dev/md127

 The stack is :
 Dec 04 23:47:43 lzx-314-desk kernel: nfsd: page allocation stalls for
 621031ms, order:0, mode:0x14000c0(GFP_KERNEL), nodemask=(null)
 Dec 04 23:47:43 lzx-314-desk kernel: nfsd cpuset=/ mems_allowed=0
 Dec 04 23:47:43 lzx-314-desk kernel: CPU: 0 PID: 3376 Comm: nfsd Not
 tainted 4.14.3.x86_64.1 #1
 Dec 04 23:47:43 lzx-314-desk kernel: Hardware name: NETGEAR ReadyNAS
 314/To be filled by O.E.M., BIOS 4.6.5 11/05/2013
 Dec 04 23:47:43 lzx-314-desk kernel: Call Trace:
 Dec 04 23:47:43 lzx-314-desk kernel: dump_stack+0x4d/0x6a
 Dec 04 23:47:43 lzx-314-desk kernel: warn_alloc+0xe3/0x180
 Dec 04 23:47:43 lzx-314-desk kernel: __alloc_pages_nodemask+0xb1e/0xed0
 Dec 04 23:47:43 lzx-314-desk kernel: svc_recv+0x99/0x900
 Dec 04 23:47:43 lzx-314-desk kernel: ? svc_process+0x241/0x690
 Dec 04 23:47:43 lzx-314-desk kernel: nfsd+0xd2/0x150
 Dec 04 23:47:43 lzx-314-desk kernel: kthread+0x11a/0x150
 Dec 04 23:47:43 lzx-314-desk kernel: ? nfsd_destroy+0x60/0x60
 Dec 04 23:47:43 lzx-314-desk kernel: ? kthread_create_on_node+0x40/0x40
 Dec 04 23:47:43 lzx-314-desk kernel: ret_from_fork+0x22/0x30
 Dec 04 23:47:43 lzx-314-desk kernel: readynasd invoked oom-killer:
 gfp_mask=0x14200ca(GFP_HIGHUSER_MOVABLE), nodemask=(null), order=0,
 oom_score_adj=-1000
 Dec 04 23:47:43 lzx-314-desk kernel: readynasd cpuset=/ mems_allowed=0
 Dec 04 23:47:43 lzx-314-desk kernel: CPU: 3 PID: 3307 Comm: readynasd
 Not tainted 4.14.3.x86_64.1 #1
 Dec 04 23:47:43 lzx-314-desk kernel: Hardware name: NETGEAR ReadyNAS
 314/To be filled by O.E.M., BIOS 4.6.5 11/05/2013
 Dec 04 23:47:43 lzx-314-desk kernel: Call Trace:
 Dec 04 23:47:43 lzx-314-desk kernel: dump_stack+0x4d/0x6a
 Dec 04 23:47:43 lzx-314-desk kernel: dump_header+0x9a/0x21b
 Dec 04 23:47:43 lzx-314-desk kernel: ? pick_next_task_fair+0x1d5/0x4b0
 Dec 04 23:47:43 lzx-314-desk kernel: ? security_capable_noaudit+0x40/0x60
 Dec 04 23:47:43 lzx-314-desk kernel: oom_kill_process+0x216/0x430
 Dec 04 23:47:43 lzx-314-desk kernel: out_of_memory+0xf9/0x2e0
 Dec 04 23:47:43 lzx-314-desk kernel: __alloc_pages_nodemask+0xd6c/0xed0
 Dec 04 23:47:43 lzx-314-desk kernel: __read_swap_cache_async+0x11d/0x190
 Dec 04 23:47:43 lzx-314-desk kernel: read_swap_cache_async+0x17/0x40
 Dec 04 23:47:43 lzx-314-desk kernel: swapin_readahead+0x1f1/0x230
 Dec 04 23:47:43 lzx-314-desk kernel: ? find_get_entry+0x19/0xf0
 Dec 04 23:47:43 lzx-314-desk kernel: ? pagecache_get_page+0x27/0x210
 Dec 04 23:47:43 lzx-314-desk kernel: do_swap_page+0x432/0x590
 Dec 04 23:47:43 lzx-314-desk kernel: ? do_swap_page+0x432/0x590
 Dec 04 23:47:43 lzx-314-desk kernel: ? 
 poll_select_copy_remaining+0x120/0x120
 Dec 04 23:47:43 lzx-314-desk kernel: __handle_mm_fault+0x33e/0xa20
 Dec 04 23:47:43 lzx-314-desk kernel: handle_mm_fault+0x14a/0x1d0
 Dec 04 23:47:43 lzx-314-desk kernel: __do_page_fault+0x212/0x440
 Dec 04 23:47:43 lzx-314-desk kernel: page_fault+0x22/0x30
 Dec 04 23:47:43 lzx-314-desk kernel: RIP:
 0010:copy_user_generic_unrolled+0x89/0xc0
 Dec 04 23:47:43 lzx-314-desk ke

Re: system OOM when using ghettoVCB script to backup VMs to nfs server (sync mode) running btrfs

2017-12-06 Thread Qu Wenruo



On 2017年12月07日 09:19, Taibai Li wrote:
> thanks for the quick response,  I tired to test this on 4.4.100
> kernel, disabled quota :
> # btrfs qgroup show /data/
> ERROR: can't list qgroups: quotas not enabled
> 
> But seems it still OOM after about 7 hours copyed  144G files,  any
> other ideas?  Maybe I will try to test by disable quota on 4.14 kernel
> too.

Trying latest kernel is always a good idea.

Despite qgroup, I am not pretty sure which can be the cause.

Is there any special mount option used?

Thanks,
Qu

> 
> thanks.
> 
> On Wed, Dec 6, 2017 at 2:25 PM, Qu Wenruo  wrote:
>>
>>
>> On 2017年12月06日 14:22, taibai li wrote:
>>> Hi Guys,
>>>
>>> I hit the OOM issues with as Box running 4.4.x kernel,   so I tried to
>>> build a 4.14.3 kernel to try that.  The testbed is :
>>> NAS box with 2G memory, and a single disk raid ,  I setup a nfs server
>>> with sync mode,  add the storage on ESXi servers 6.0 and backup all
>>> the VMs on it by the ghettoVCB script , after about 14 hours,
>>> Inpot/Output error happened,
>>> checked the box , found OOM.
>>> # cat /etc/exports
>>> "/data/Videos" 
>>> *(insecure,insecure_locks,no_subtree_check,crossmnt,anonuid=99,anongid=99,root_squash,rw,sync)
>>> # uname -a
>>> Linux lzx-314-desk 4.14.3.x86_64.1 #1 SMP Fri Dec 1 01:31:25 UTC 2017
>>> x86_64 GNU/Linux
>>> # btrfs fi show /data/
>>> Label: '43f611ae:data' uuid: 6fefb319-a21d-476e-9642-565e0600a049
>>> Total devices 1 FS bytes used 292.78GiB
>>> devid 1 size 1.81TiB used 296.02GiB path /dev/md127
>>>
>>> The stack is :
>>> Dec 04 23:47:43 lzx-314-desk kernel: nfsd: page allocation stalls for
>>> 621031ms, order:0, mode:0x14000c0(GFP_KERNEL), nodemask=(null)
>>> Dec 04 23:47:43 lzx-314-desk kernel: nfsd cpuset=/ mems_allowed=0
>>> Dec 04 23:47:43 lzx-314-desk kernel: CPU: 0 PID: 3376 Comm: nfsd Not
>>> tainted 4.14.3.x86_64.1 #1
>>> Dec 04 23:47:43 lzx-314-desk kernel: Hardware name: NETGEAR ReadyNAS
>>> 314/To be filled by O.E.M., BIOS 4.6.5 11/05/2013
>>> Dec 04 23:47:43 lzx-314-desk kernel: Call Trace:
>>> Dec 04 23:47:43 lzx-314-desk kernel: dump_stack+0x4d/0x6a
>>> Dec 04 23:47:43 lzx-314-desk kernel: warn_alloc+0xe3/0x180
>>> Dec 04 23:47:43 lzx-314-desk kernel: __alloc_pages_nodemask+0xb1e/0xed0
>>> Dec 04 23:47:43 lzx-314-desk kernel: svc_recv+0x99/0x900
>>> Dec 04 23:47:43 lzx-314-desk kernel: ? svc_process+0x241/0x690
>>> Dec 04 23:47:43 lzx-314-desk kernel: nfsd+0xd2/0x150
>>> Dec 04 23:47:43 lzx-314-desk kernel: kthread+0x11a/0x150
>>> Dec 04 23:47:43 lzx-314-desk kernel: ? nfsd_destroy+0x60/0x60
>>> Dec 04 23:47:43 lzx-314-desk kernel: ? kthread_create_on_node+0x40/0x40
>>> Dec 04 23:47:43 lzx-314-desk kernel: ret_from_fork+0x22/0x30
>>> Dec 04 23:47:43 lzx-314-desk kernel: readynasd invoked oom-killer:
>>> gfp_mask=0x14200ca(GFP_HIGHUSER_MOVABLE), nodemask=(null), order=0,
>>> oom_score_adj=-1000
>>> Dec 04 23:47:43 lzx-314-desk kernel: readynasd cpuset=/ mems_allowed=0
>>> Dec 04 23:47:43 lzx-314-desk kernel: CPU: 3 PID: 3307 Comm: readynasd
>>> Not tainted 4.14.3.x86_64.1 #1
>>> Dec 04 23:47:43 lzx-314-desk kernel: Hardware name: NETGEAR ReadyNAS
>>> 314/To be filled by O.E.M., BIOS 4.6.5 11/05/2013
>>> Dec 04 23:47:43 lzx-314-desk kernel: Call Trace:
>>> Dec 04 23:47:43 lzx-314-desk kernel: dump_stack+0x4d/0x6a
>>> Dec 04 23:47:43 lzx-314-desk kernel: dump_header+0x9a/0x21b
>>> Dec 04 23:47:43 lzx-314-desk kernel: ? pick_next_task_fair+0x1d5/0x4b0
>>> Dec 04 23:47:43 lzx-314-desk kernel: ? security_capable_noaudit+0x40/0x60
>>> Dec 04 23:47:43 lzx-314-desk kernel: oom_kill_process+0x216/0x430
>>> Dec 04 23:47:43 lzx-314-desk kernel: out_of_memory+0xf9/0x2e0
>>> Dec 04 23:47:43 lzx-314-desk kernel: __alloc_pages_nodemask+0xd6c/0xed0
>>> Dec 04 23:47:43 lzx-314-desk kernel: __read_swap_cache_async+0x11d/0x190
>>> Dec 04 23:47:43 lzx-314-desk kernel: read_swap_cache_async+0x17/0x40
>>> Dec 04 23:47:43 lzx-314-desk kernel: swapin_readahead+0x1f1/0x230
>>> Dec 04 23:47:43 lzx-314-desk kernel: ? find_get_entry+0x19/0xf0
>>> Dec 04 23:47:43 lzx-314-desk kernel: ? pagecache_get_page+0x27/0x210
>>> Dec 04 23:47:43 lzx-314-desk kernel: do_swap_page+0x432/0x590
>>> Dec 04 23:47:43 lzx-314-desk kernel: ? do_swap_page+0x432/0x590
>>> Dec 04 23:47:43 lzx-314-desk kernel: ? 
>>> poll_select_copy_remaining+0x120/0x120
>>> Dec 04 23:47:43 lzx-314-desk kernel: __handle_mm_fault+0x33e/0xa20
>>> Dec 04 23:47:43 lzx-314-desk kernel: handle_mm_fault+0x14a/0x1d0
>>> Dec 04 23:47:43 lzx-314-desk kernel: __do_page_fault+0x212/0x440
>>> Dec 04 23:47:43 lzx-314-desk kernel: page_fault+0x22/0x30
>>> Dec 04 23:47:43 lzx-314-desk kernel: RIP:
>>> 0010:copy_user_generic_unrolled+0x89/0xc0
>>> Dec 04 23:47:43 lzx-314-desk kernel: RSP: :c9cdbd70 EFLAGS: 
>>> 00010202
>>> Dec 04 23:47:43 lzx-314-desk kernel: RAX:  RBX:
>>> 0008 RCX: 0001
>>> Dec 04 23:47:43 lzx-314-desk kernel: RDX:  RSI:
>>> c9cdbdd8 RDI: 7ff45e7fba80
>>> Dec 04 23:47:43 lzx-314-de

Re: system OOM when using ghettoVCB script to backup VMs to nfs server (sync mode) running btrfs

2017-12-06 Thread Taibai Li

thanks for the quick response,  I tired to test this on 4.4.100
kernel, disabled quota :
# btrfs qgroup show /data/
ERROR: can't list qgroups: quotas not enabled

But seems it still OOM after about 7 hours copyed  144G files,  any
other ideas?  Maybe I will try to test by disable quota on 4.14 kernel
too.

thanks.

On Wed, Dec 6, 2017 at 2:25 PM, Qu Wenruo  wrote:
>
>
> On 2017年12月06日 14:22, taibai li wrote:
>> Hi Guys,
>>
>> I hit the OOM issues with as Box running 4.4.x kernel,   so I tried to
>> build a 4.14.3 kernel to try that.  The testbed is :
>> NAS box with 2G memory, and a single disk raid ,  I setup a nfs server
>> with sync mode,  add the storage on ESXi servers 6.0 and backup all
>> the VMs on it by the ghettoVCB script , after about 14 hours,
>> Inpot/Output error happened,
>> checked the box , found OOM.
>> # cat /etc/exports
>> "/data/Videos" 
>> *(insecure,insecure_locks,no_subtree_check,crossmnt,anonuid=99,anongid=99,root_squash,rw,sync)
>> # uname -a
>> Linux lzx-314-desk 4.14.3.x86_64.1 #1 SMP Fri Dec 1 01:31:25 UTC 2017
>> x86_64 GNU/Linux
>> # btrfs fi show /data/
>> Label: '43f611ae:data' uuid: 6fefb319-a21d-476e-9642-565e0600a049
>> Total devices 1 FS bytes used 292.78GiB
>> devid 1 size 1.81TiB used 296.02GiB path /dev/md127
>>
>> The stack is :
>> Dec 04 23:47:43 lzx-314-desk kernel: nfsd: page allocation stalls for
>> 621031ms, order:0, mode:0x14000c0(GFP_KERNEL), nodemask=(null)
>> Dec 04 23:47:43 lzx-314-desk kernel: nfsd cpuset=/ mems_allowed=0
>> Dec 04 23:47:43 lzx-314-desk kernel: CPU: 0 PID: 3376 Comm: nfsd Not
>> tainted 4.14.3.x86_64.1 #1
>> Dec 04 23:47:43 lzx-314-desk kernel: Hardware name: NETGEAR ReadyNAS
>> 314/To be filled by O.E.M., BIOS 4.6.5 11/05/2013
>> Dec 04 23:47:43 lzx-314-desk kernel: Call Trace:
>> Dec 04 23:47:43 lzx-314-desk kernel: dump_stack+0x4d/0x6a
>> Dec 04 23:47:43 lzx-314-desk kernel: warn_alloc+0xe3/0x180
>> Dec 04 23:47:43 lzx-314-desk kernel: __alloc_pages_nodemask+0xb1e/0xed0
>> Dec 04 23:47:43 lzx-314-desk kernel: svc_recv+0x99/0x900
>> Dec 04 23:47:43 lzx-314-desk kernel: ? svc_process+0x241/0x690
>> Dec 04 23:47:43 lzx-314-desk kernel: nfsd+0xd2/0x150
>> Dec 04 23:47:43 lzx-314-desk kernel: kthread+0x11a/0x150
>> Dec 04 23:47:43 lzx-314-desk kernel: ? nfsd_destroy+0x60/0x60
>> Dec 04 23:47:43 lzx-314-desk kernel: ? kthread_create_on_node+0x40/0x40
>> Dec 04 23:47:43 lzx-314-desk kernel: ret_from_fork+0x22/0x30
>> Dec 04 23:47:43 lzx-314-desk kernel: readynasd invoked oom-killer:
>> gfp_mask=0x14200ca(GFP_HIGHUSER_MOVABLE), nodemask=(null), order=0,
>> oom_score_adj=-1000
>> Dec 04 23:47:43 lzx-314-desk kernel: readynasd cpuset=/ mems_allowed=0
>> Dec 04 23:47:43 lzx-314-desk kernel: CPU: 3 PID: 3307 Comm: readynasd
>> Not tainted 4.14.3.x86_64.1 #1
>> Dec 04 23:47:43 lzx-314-desk kernel: Hardware name: NETGEAR ReadyNAS
>> 314/To be filled by O.E.M., BIOS 4.6.5 11/05/2013
>> Dec 04 23:47:43 lzx-314-desk kernel: Call Trace:
>> Dec 04 23:47:43 lzx-314-desk kernel: dump_stack+0x4d/0x6a
>> Dec 04 23:47:43 lzx-314-desk kernel: dump_header+0x9a/0x21b
>> Dec 04 23:47:43 lzx-314-desk kernel: ? pick_next_task_fair+0x1d5/0x4b0
>> Dec 04 23:47:43 lzx-314-desk kernel: ? security_capable_noaudit+0x40/0x60
>> Dec 04 23:47:43 lzx-314-desk kernel: oom_kill_process+0x216/0x430
>> Dec 04 23:47:43 lzx-314-desk kernel: out_of_memory+0xf9/0x2e0
>> Dec 04 23:47:43 lzx-314-desk kernel: __alloc_pages_nodemask+0xd6c/0xed0
>> Dec 04 23:47:43 lzx-314-desk kernel: __read_swap_cache_async+0x11d/0x190
>> Dec 04 23:47:43 lzx-314-desk kernel: read_swap_cache_async+0x17/0x40
>> Dec 04 23:47:43 lzx-314-desk kernel: swapin_readahead+0x1f1/0x230
>> Dec 04 23:47:43 lzx-314-desk kernel: ? find_get_entry+0x19/0xf0
>> Dec 04 23:47:43 lzx-314-desk kernel: ? pagecache_get_page+0x27/0x210
>> Dec 04 23:47:43 lzx-314-desk kernel: do_swap_page+0x432/0x590
>> Dec 04 23:47:43 lzx-314-desk kernel: ? do_swap_page+0x432/0x590
>> Dec 04 23:47:43 lzx-314-desk kernel: ? poll_select_copy_remaining+0x120/0x120
>> Dec 04 23:47:43 lzx-314-desk kernel: __handle_mm_fault+0x33e/0xa20
>> Dec 04 23:47:43 lzx-314-desk kernel: handle_mm_fault+0x14a/0x1d0
>> Dec 04 23:47:43 lzx-314-desk kernel: __do_page_fault+0x212/0x440
>> Dec 04 23:47:43 lzx-314-desk kernel: page_fault+0x22/0x30
>> Dec 04 23:47:43 lzx-314-desk kernel: RIP:
>> 0010:copy_user_generic_unrolled+0x89/0xc0
>> Dec 04 23:47:43 lzx-314-desk kernel: RSP: :c9cdbd70 EFLAGS: 
>> 00010202
>> Dec 04 23:47:43 lzx-314-desk kernel: RAX:  RBX:
>> 0008 RCX: 0001
>> Dec 04 23:47:43 lzx-314-desk kernel: RDX:  RSI:
>> c9cdbdd8 RDI: 7ff45e7fba80
>> Dec 04 23:47:43 lzx-314-desk kernel: RBP: c9cdbee8 R08:
>>  R09: 0104
>> Dec 04 23:47:43 lzx-314-desk kernel: R10: c9cdbd78 R11:
>> 0104 R12: 
>> Dec 04 23:47:43 lzx-314-desk kernel: R13: c9cdbdc0 R14:
>> 7ff45e7fba80 R15: c9cdbdc0
>> Dec 04 23

Re: [PATCH] fstests: btrfs: Add test case to check if btrfs can handle full fs trim correctly

2017-12-06 Thread Qu Wenruo

Ping.

Any comment on this?

Thanks,
Qu

On 2017年11月29日 14:14, Qu Wenruo wrote:
> Ancient commit f4c697e6406d ("btrfs: return EINVAL if start > total_bytes in
> fitrim ioctl") introduced a regression where btrfs may fail to trim any
> free space in existing block groups.
> 
> It's caused by confusion with btrfs_super_block->total_bytes and btrfs
> logical address space.
> Unlike physical address, any aligned bytenr in range [0, U64_MAX) is
> valid in btrfs logical address space, and it's chunk mapping mechanism
> of btrfs to handle the logical<->physical mapping.
> 
> The test case will craft a btrfs with the following features:
> 0) Single data/meta profile
>Make trimmed bytes reporting and chunk allocation more predictable.
> 
> 1) All chunks start beyond super_block->total_bytes (1G)
>By relocating these blocks several times.
> 
> 2) Unallocated space is less than 50% of the whole fs
> 
> 3) Fragmented data chunks
>Data chunks will be full of fragments, 50% of data chunks will be
>free space.
> 
> So in theory fstrim should be able to trim over 50% space of the fs.
> (after fix, 64% of the fs can be trimmed)
> While the regression makes btrfs only able to trim unallocated space,
> which is less than 50% of the total space.
> (without fix, it's only 31%)
> 
> Fixed by patch named "btrfs: Ensure btrfs_trim_fs can trim the whole fs".
> 
> Signed-off-by: Qu Wenruo 
> ---
>  tests/btrfs/155 | 120 
> 
>  tests/btrfs/155.out |   2 +
>  tests/btrfs/group   |   1 +
>  3 files changed, 123 insertions(+)
>  create mode 100755 tests/btrfs/155
>  create mode 100644 tests/btrfs/155.out
> 
> diff --git a/tests/btrfs/155 b/tests/btrfs/155
> new file mode 100755
> index ..6918f093
> --- /dev/null
> +++ b/tests/btrfs/155
> @@ -0,0 +1,120 @@
> +#! /bin/bash
> +# FS QA Test 155
> +#
> +# Check if btrfs can correctly trim free space in block groups
> +#
> +# An ancient regression prevent btrfs from trimming free space inside
> +# existing block groups, if bytenr of block group starts beyond
> +# btrfs_super_block->total_bytes.
> +# However all bytenr in btrfs is in btrfs logical address space,
> +# where any bytenr in range [0, U64_MAX] is valid.
> +#
> +# Fixed by patch named "btrfs: Ensure btrfs_trim_fs can trim the whole fs".
> +#
> +#---
> +# Copyright (c) 2017 SUSE Linux Products GmbH.  All Rights Reserved.
> +#
> +# This program is free software; you can redistribute it and/or
> +# modify it under the terms of the GNU General Public License as
> +# published by the Free Software Foundation.
> +#
> +# This program is distributed in the hope that it would be useful,
> +# but WITHOUT ANY WARRANTY; without even the implied warranty of
> +# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
> +# GNU General Public License for more details.
> +#
> +# You should have received a copy of the GNU General Public License
> +# along with this program; if not, write the Free Software Foundation,
> +# Inc.,  51 Franklin St, Fifth Floor, Boston, MA  02110-1301  USA
> +#---
> +#
> +
> +seq=`basename $0`
> +seqres=$RESULT_DIR/$seq
> +echo "QA output created by $seq"
> +
> +here=`pwd`
> +tmp=/tmp/$$
> +status=1 # failure is the default!
> +trap "_cleanup; exit \$status" 0 1 2 3 15
> +
> +_cleanup()
> +{
> + cd /
> + rm -f $tmp.*
> +}
> +
> +# get standard environment, filters and checks
> +. ./common/rc
> +. ./common/filter
> +
> +# remove previous $seqres.full before test
> +rm -f $seqres.full
> +
> +# real QA test starts here
> +
> +# Modify as appropriate.
> +_supported_fs btrfs
> +_supported_os Linux
> +_require_scratch
> +_require_fstrim
> +
> +# 1024fs size
> +fs_size=$((1024 * 1024 * 1024))
> +
> +# Use small files to fill half of the fs
> +file_size=$(( 1024 * 1024 ))
> +nr_files=$(( $fs_size / $file_size / 2))
> +
> +# Force to use single data and meta profile.
> +# Since the test relies on fstrim output, which will differ for different
> +# profiles
> +_scratch_mkfs -b $fs_size -m single -d single > /dev/null
> +_scratch_mount
> +
> +_require_batched_discard "$SCRATCH_MNT"
> +
> +for n in $(seq -w 0 $(( $nr_files - 1))); do
> + $XFS_IO_PROG -f -c "pwrite 0 $file_size" "$SCRATCH_MNT/file_$n" \
> + > /dev/null
> +done
> +
> +# Flush all buffer data into disk, to trigger chunk allocation
> +sync
> +
> +# Now we have take at least 50% of the filesystem, relocate all chunks twice
> +# so all chunks will start after 1G in logical space.
> +# (Btrfs chunk allocation will not rewind to reuse lower space)
> +_run_btrfs_util_prog balance start --full-balance "$SCRATCH_MNT"
> +
> +# To avoid possible false ENOSPC alert on v4.15-rc1, seems to be a
> +# reserved space related bug (maybe related to outstanding space rework?),
> +# but that's another story.
> +sync
> +
> +_run_btrfs_util_pro

Re: [PATCH v4 72/73] xfs: Convert mru cache to XArray

2017-12-06 Thread Dave Chinner

On Wed, Dec 06, 2017 at 06:06:48AM -0800, Matthew Wilcox wrote:
> On Wed, Dec 06, 2017 at 07:44:04PM +1100, Dave Chinner wrote:
> > On Tue, Dec 05, 2017 at 08:45:49PM -0800, Matthew Wilcox wrote:
> > > That said, using xa_cmpxchg() in the dquot code looked like the right
> > > thing to do?  Since we'd dropped the qi mutex and the ILOCK, it looks
> > > entirely reasonable for another thread to come in and set up the dquot.
> > > But I'm obviously quite ignorant of the XFS internals, so maybe there's
> > > something else going on that makes this essentially a "can't happen".
> > 
> > It's no different to the inode cache code, which drops the RCU
> > lock on lookup miss, instantiates the new inode (maybe reading it
> > off disk), then locks the tree and attempts to insert it. Both cases
> > use "insert if empty, otherwise retry lookup from start" semantics.
> 
> Ah.  I had my focus set a little narrow on the inode cache code and didn't
> recognise the pattern.
> 
> Why do you sleep for one jiffy after encountering a miss, then seeing
> someone else insert the inode for you?

The sleep is a backoff that allows whatever we raced with to
complete, be it a hit that raced with an inode being reclaimed and
removed, or a miss that raced with another insert. Ideally we'd
sleep on the XFS_INEW bit, similar to the vfs I_NEW flag, but it's
not quite that simple with the reclaim side of things...

> > cmpxchg is for replacing a known object in a store - it's not really
> > intended for doing initial inserts after a lookup tells us there is
> > nothing in the store.  The radix tree "insert only if empty" makes
> > sense here, because it naturally takes care of lookup/insert races
> > via the -EEXIST mechanism.
> > 
> > I think that providing xa_store_excl() (which would return -EEXIST
> > if the entry is not empty) would be a better interface here, because
> > it matches the semantics of lookup cache population used all over
> > the kernel
> 
> I'm not thrilled with xa_store_excl(), but I need to think about that
> a bit more.

Not fussed about the name - I just think we need a function that
matches the insert semantics of the code

> > > I'm quite happy to have normal API variants that don't save/restore
> > > interrupts.  Just need to come up with good names ... I don't think
> > > xa_store_noirq() is a good name, but maybe you do?
> > 
> > I'd prefer not to have to deal with such things at all. :P
> > 
> > How many subsystems actually require irq safety in the XA locking
> > code? Make them use irqsafe versions, not make everyone else use
> > "noirq" versions, as is the convention for the rest of the kernel
> > code
> 
> Hard to say how many existing radix tree users require the irq safety.

The mapping tree requires it because it gets called from IO
completion contexts to clear page writeback state, but I don't know
about any of the others.

> Also hard to say how many potential users (people currently using
> linked lists, people using resizable arrays, etc) need irq safety.
> My thinking was "make it safe by default and let people who know better
> have a way to opt out", but there's definitely something to be said for
> "make it fast by default and let people who need the unusual behaviour
> type those extra few letters".
> 
> So, you're arguing for providing xa_store(), xa_store_irq(), xa_store_bh()
> and xa_store_irqsafe()?  (at least on demand, as users come to light?)
> At least the read side doesn't require any variants; everybody can use
> RCU for read side protection.

That would follow the pattern of the rest of the kernel APIs, though
I think it might be cleaner to simply state the locking requirement
to xa_init() and keep all those details completely internal rather
than encoding them into API calls. After all, the "irqsafe-ness" of
the locking needs to be consistent across the entire XA instance

> ("safe", not "save" because I wouldn't make the caller provide the
> "flags" argument).
> 
> > > At least, not today.  One of the future plans is to allow xa_nodes to
> > > be allocated from ZONE_MOVABLE.  In order to do that, we have to be
> > > able to tell which lock protects any given node.  With the XArray,
> > > we can find that out (xa_node->root->xa_lock); with the radix tree,
> > > we don't even know what kind of lock protects the tree.
> > 
> > Yup, this is a prime example of why we shouldn't be creating
> > external dependencies by smearing the locking context outside the XA
> > structure itself. It's not a stretch to see something like a
> > ZONE_MOVEABLE dependency because some other object indexed in a XA
> > is stored in the same page as the xa_node that points to it, and
> > both require the same xa_lock to move/update...
> 
> That is a bit of a stretch.  Christoph Lameter and I had a discussion about it
> here: https://www.spinics.net/lists/linux-mm/msg122902.html
> 
> There's no situation where you need to acquire two locks in order to
> free an object;

ZONE_MOVEABLE is for m

Re: [PATCH] btrfs: tree-checker: use %zu format string for size_t

2017-12-06 Thread Qu Wenruo



On 2017年12月06日 22:18, Arnd Bergmann wrote:
> The return value of sizeof() is of type size_t, so we must print it
> using the %z format modifier rather than %l to avoid this warning
> on some architectures:
> 
> fs/btrfs/tree-checker.c: In function 'check_dir_item':
> fs/btrfs/tree-checker.c:273:50: error: format '%lu' expects argument of type 
> 'long unsigned int', but argument 5 has type 'u32' {aka 'unsigned int'} 
> [-Werror=format=]

Any idea about which architecture will cause such warning?
On x86_64 I always fail to get such warning.

> 
> Fixes: 005887f2e3e0 ("btrfs: tree-checker: Add checker for dir item")

Reviewed-by: Qu Wenruo 

Thanks,
Qu

> Signed-off-by: Arnd Bergmann 
> ---
>  fs/btrfs/tree-checker.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/fs/btrfs/tree-checker.c b/fs/btrfs/tree-checker.c
> index 66dac0a4b01f..7c55e3ba5a6c 100644
> --- a/fs/btrfs/tree-checker.c
> +++ b/fs/btrfs/tree-checker.c
> @@ -270,7 +270,7 @@ static int check_dir_item(struct btrfs_root *root,
>   /* header itself should not cross item boundary */
>   if (cur + sizeof(*di) > item_size) {
>   dir_item_err(root, leaf, slot,
> - "dir item header crosses item boundary, have %lu boundary %u",
> + "dir item header crosses item boundary, have %zu boundary %u",
>   cur + sizeof(*di), item_size);
>   return -EUCLEAN;
>   }
> 



signature.asc
Description: OpenPGP digital signature

Re: [PATCH 3/3] Btrfs: make raid6 rebuild retry more

2017-12-06 Thread Liu Bo

On Wed, Dec 06, 2017 at 08:11:30AM +0800, Qu Wenruo wrote:
> 
> 
> On 2017年12月06日 06:55, Liu Bo wrote:
> > On Tue, Dec 05, 2017 at 07:09:25PM +0100, David Sterba wrote:
> >> On Tue, Dec 05, 2017 at 04:07:35PM +0800, Qu Wenruo wrote:
>  @@ -2166,11 +2166,21 @@ int raid56_parity_recover(struct btrfs_fs_info 
>  *fs_info, struct bio *bio,
>   }
>   
>   /*
>  - * reconstruct from the q stripe if they are
>  - * asking for mirror 3
>  + * Loop retry:
>  + * for 'mirror == 2', reconstruct from all other stripes.
> >>>
> >>> What about using macro to makes the reassemble method more human readable?
> >>
> >> Yeah, that's definetelly needed and should be based on
> >> BTRFS_MAX_MIRRORS, not just hardcoded to 3.
> > 
> > OK.
> > 
> > In case of raid5/6, BTRFS_MAX_MIRRORS is an abused name, it's more a
> > raid1/10 concept, either BTRFS_RAID56_FULL_REBUILD or
> > BTRFS_RAID56_FULL_CHK is better to me, which one do you guys like?
> 
> For mirror > 2 case, the mirror_num is no longer a single indicator, but
> a ranged iterator for later rebuild retries.
> 
> Something like set_raid_fail_from_mirror_num() seems better to me.

I feel like having the logic open-code'd plus necessary comments is
probably better as we can see which stripe failb is.

For those who are new to this raid6 retry logic are likely to feel
confused by a helper function like set_raid_fail_from_mirror_num() and
will go check the helper function about the retry logic.

Thanks,

-liubo
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH v4 00/73] XArray version 4

2017-12-06 Thread Matthew Wilcox

On Wed, Dec 06, 2017 at 04:58:29PM -0700, Ross Zwisler wrote:
> Maybe I missed this from a previous version, but can you explain the
> motivation for replacing the radix tree with an xarray?  (I think this should
> probably still be part of the cover letter?)  Do we have a performance problem
> we need to solve?  A code complexity issue we need to solve?  Something else?

Sure!  Something else I screwed up in the v4 announcement ... I'll
need it again for v5, so here's a quick update of the v1 announcement's
justification:

I wrote the xarray to replace the radix tree with a better API based
on observing how programmers are currently using the radix tree, and
on how (and why) they aren't.  Conceptually, an xarray is an array of
ULONG_MAX pointers which is initially full of NULL pointers.

Improvements the xarray has over the radix tree:

 - The radix tree provides operations like other trees do; 'insert' and
   'delete'.  But what users really want is an automatically resizing
   array, and so it makes more sense to give users an API that is like
   an array -- 'load' and 'store'.
 - Locking is part of the API.  This simplifies a lot of users who
   formerly had to manage their own locking just for the radix tree.
   It also improves code generation as we can now tell RCU that we're
   holding a lock and it doesn't need to generate as much fencing code.
   The other advantage is that tree nodes can be moved (not yet
   implemented).
 - GFP flags are now parameters to calls which may need to allocate
   memory.  The radix tree forced users to decide what the allocation
   flags would be at creation time.  It's much clearer to specify them
   at allocation time.  I know the MM people disapprove of the radix
   tree using the top bits of the GFP flags for its own purpose, so
   they'll like this aspect.
 - Memory is not preloaded; we don't tie up dozens of pages on the
   off chance that the slab allocator fails.  Instead, we drop the lock,
   allocate a new node and retry the operation.
 - The xarray provides a conditional-replace operation.  The radix tree
   forces users to roll their own (and at least four have).
 - Iterators now take a 'max' parameter.  That simplifies many users and
   will reduce the amount of iteration done.
 - Iteration can proceed backwards.  We only have one user for this, but
   since it's called as part of the pagefault readahead algorithm, that
   seemed worth mentioning.
 - RCU-protected pointers are not exposed as part of the API.  There are
   some fun bugs where the page cache forgets to use rcu_dereference()
   in the current codebase.
 - Any function which wants it can now call the update_node() callback.
   There were a few places missing that I noticed as part of this rewrite.
 - Exceptional entries may now be BITS_PER_LONG-1 in size, rather than the
   BITS_PER_LONG-2 that they had in the radix tree.  That gives us the
   extra bit we need to put huge page swap entries in the page cache.

The API comes in two parts, normal and advanced.  The normal API takes
care of the locking and memory allocation for you.  You can get the
value of a pointer by calling xa_load() and set the value of a pointer by
calling xa_store().  You can conditionally update the value of a pointer
by calling xa_cmpxchg().  Each pointer which isn't NULL can be tagged
with up to 3 bits of extra information, accessed through xa_get_tag(),
xa_set_tag() and xa_clear_tag().  You can copy batches of pointers out
of the array by calling xa_get_entries() or xa_get_tagged().  You can
iterate over pointers in the array by calling xa_find(), xa_find_after()
or xa_for_each().

The advanced API allows users to build their own operations.  You have
to take care of your own locking and handle memory allocation failures.
Most of the advanced operations are based around the xa_state which
keeps state between sub-operations.  Read the xarray.h header file for
more information on the advanced API, and see the implementation of the
normal API for examples of how to use the advanced API.

Those familiar with the radix tree may notice certain similarities between
the implementation of the xarray and the radix tree.  That's entirely
intentional, but the implementation will certainly adapt in the future.
For example, one of the impediments I see to using xarrays instead of
kvmalloced arrays is memory consumption, so I have a couple of ideas to
reduce memory usage for smaller arrays.

I have reimplementated the IDR and the IDA based on the xarray.  They are
roughly the same complexity as they were when implemented on top of the
radix tree (although much less intertwined).

When converting code from the radix tree to the xarray, the biggest thing
to bear in mind is that 'store' overwrites anything which happens to be
in the xarray.  Just like the assignment operator.  The equivalent to
the insert operation is to replace NULL with the new value.

A quick reference guide to help when converting radix tree code.
Functions which

Re: [PATCH v4 00/73] XArray version 4

2017-12-06 Thread Ross Zwisler

On Tue, Dec 05, 2017 at 04:40:46PM -0800, Matthew Wilcox wrote:
> From: Matthew Wilcox 
> 
> I looked through some notes and decided this was version 4 of the XArray.
> Last posted two weeks ago, this version includes a *lot* of changes.
> I'd like to thank Dave Chinner for his feedback, encouragement and
> distracting ideas for improvement, which I'll get to once this is merged.
> 
> Highlights:
>  - Over 2000 words of documentation in patch 8!  And lots more kernel-doc.
>  - The page cache is now fully converted to the XArray.
>  - Many more tests in the test-suite.
> 
> This patch set is not for applying.  0day is still reporting problems,
> and I'd feel bad for eating someone's data.  These patches apply on top
> of a set of prepatory patches which just aren't interesting.  If you
> want to see the patches applied to a tree, I suggest pulling my git tree:
> http://git.infradead.org/users/willy/linux-dax.git/shortlog/refs/heads/xarray-2017-12-04
> I also left out the idr_preload removals.  They're still in the git tree,
> but I'm not looking for feedback on them.

Hey Matthew,

Maybe I missed this from a previous version, but can you explain the
motivation for replacing the radix tree with an xarray?  (I think this should
probably still be part of the cover letter?)  Do we have a performance problem
we need to solve?  A code complexity issue we need to solve?  Something else?

- Ross
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] fstests: btrfs: reproduce a read failure on raid6 setup

2017-12-06 Thread Liu Bo

On Wed, Dec 06, 2017 at 04:56:11PM +0800, Eryu Guan wrote:
> On Mon, Dec 04, 2017 at 03:33:23PM -0700, Liu Bo wrote:
> > This test case is to reproduce a bug of raid6 reconstruction process.
> > 
> > The kernel fix are
> > Btrfs: do not merge rbios if their fail stripe index are not identical
> > Btrfs: make raid6 rebuild retry more
> > 
> > Signed-off-by: Liu Bo 
> 
> Test failed as expected with v4.15-rc2 kernel and passed with above
> patches applied.
> 
> I fixed two trailing whitespace issues and some of the indentions to
> avoid too-long line, also replaced 'repair' group with a new 'raid'
> group. I think 'repair' group is meant for testing filesystem repair
> tools, not the internal raid repair mechanism, so a new 'raid' group
> looks better to me, as we're having more raid-specific tests recently.

Sounds good to me, thanks for the efforts.

Thanks,

-liubo
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH v2 06/11] writeback: add counters for metadata usage

2017-12-06 Thread Johannes Weiner

On Wed, Dec 06, 2017 at 03:18:35PM -0500, Josef Bacik wrote:
> On Mon, Dec 04, 2017 at 02:06:30PM +0100, Jan Kara wrote:
> > On Wed 22-11-17 16:16:01, Josef Bacik wrote:
> > > diff --git a/mm/util.c b/mm/util.c
> > > index 34e57fae959d..681d62631ee0 100644
> > > --- a/mm/util.c
> > > +++ b/mm/util.c
> > > @@ -616,6 +616,7 @@ int __vm_enough_memory(struct mm_struct *mm, long 
> > > pages, int cap_sys_admin)
> > >   if (sysctl_overcommit_memory == OVERCOMMIT_GUESS) {
> > >   free = global_zone_page_state(NR_FREE_PAGES);
> > >   free += global_node_page_state(NR_FILE_PAGES);
> > > + free += global_node_page_state(NR_METADATA_BYTES) >> PAGE_SHIFT;
> > 
> > 
> > I'm not really sure this is OK. It depends on whether mm is really able to
> > reclaim these pages easily enough... Summon mm people for help :)
> > 
> 
> Well we count NR_SLAB_RECLAIMABLE here, so it's no different than that.  The
> point is that it's theoretically reclaimable, and we should at least try.

I agree with including metadata in the equation. The (somewhat dusty)
overcommit code is mostly for containing swap storms, which is why it
adds up all the reclaimable pools that aren't backed by swap. The
metadata pool belongs in that category.

> > > @@ -3812,6 +3813,7 @@ static inline unsigned long 
> > > node_unmapped_file_pages(struct pglist_data *pgdat)
> > >  static unsigned long node_pagecache_reclaimable(struct pglist_data 
> > > *pgdat)
> > >  {
> > >   unsigned long nr_pagecache_reclaimable;
> > > + unsigned long nr_metadata_reclaimable;
> > >   unsigned long delta = 0;
> > >  
> > >   /*
> > > @@ -3833,7 +3835,20 @@ static unsigned long 
> > > node_pagecache_reclaimable(struct pglist_data *pgdat)
> > >   if (unlikely(delta > nr_pagecache_reclaimable))
> > >   delta = nr_pagecache_reclaimable;
> > >  
> > > - return nr_pagecache_reclaimable - delta;
> > > + nr_metadata_reclaimable =
> > > + node_page_state(pgdat, NR_METADATA_BYTES) >> PAGE_SHIFT;
> > > + /*
> > > +  * We don't do writeout through the shrinkers so subtract any
> > > +  * dirty/writeback metadata bytes from the reclaimable count.
> > > +  */
> > > + if (nr_metadata_reclaimable) {
> > > + unsigned long unreclaimable =
> > > + node_page_state(pgdat, NR_METADATA_DIRTY_BYTES) +
> > > + node_page_state(pgdat, NR_METADATA_WRITEBACK_BYTES);
> > > + unreclaimable >>= PAGE_SHIFT;
> > > + nr_metadata_reclaimable -= unreclaimable;
> > > + }
> > > + return nr_metadata_reclaimable + nr_pagecache_reclaimable - delta;
> > >  }
> > 
> > Ditto as with __vm_enough_memory(). In particular I'm unsure whether the
> > watermarks like min_unmapped_pages or min_slab_pages would still work as
> > designed.
> > 
> 
> Yeah agreed I'd like an MM person's thoughts on this as well.  We don't count
> SLAB_RECLAIMABLE here, but that's because it's just not related to pagecache. 
>  I
> guess it only matters for node reclaim and we have our node reclaim stuff 
> turned
> off, which means it doesn't help us anyway, so I'm happy to just drop it and 
> let
> somebody who cares about node reclaim think about it later ;).  Thanks,

Few people care about node reclaim at this point, see 4f9b16a64753
("mm: disable zone_reclaim_mode by default"), and it's honestly a bit
baffling why we made min_slab_ratio a tunable in the first place. Who
knows how/if anybody relies on that behavior. I'd just leave it alone.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH v2 06/11] writeback: add counters for metadata usage

2017-12-06 Thread Josef Bacik

On Mon, Dec 04, 2017 at 02:06:30PM +0100, Jan Kara wrote:
> On Wed 22-11-17 16:16:01, Josef Bacik wrote:
> > From: Josef Bacik 
> > 
> > Btrfs has no bounds except memory on the amount of dirty memory that we 
> > have in
> > use for metadata.  Historically we have used a special inode so we could 
> > take
> > advantage of the balance_dirty_pages throttling that comes with using 
> > pagecache.
> > However as we'd like to support different blocksizes it would be nice to not
> > have to rely on pagecache, but still get the balance_dirty_pages throttling
> > without having to do it ourselves.
> > 
> > So introduce *METADATA_DIRTY_BYTES and *METADATA_WRITEBACK_BYTES.  These are
> > zone and bdi_writeback counters to keep track of how many bytes we have in
> > flight for METADATA.  We need to count in bytes as blocksizes could be
> > percentages of pagesize.  We simply convert the bytes to number of pages 
> > where
> > it is needed for the throttling.
> > 
> > Also introduce NR_METADATA_BYTES so we can keep track of the total amount of
> > pages used for metadata on the system.  This is also needed so things like 
> > dirty
> > throttling know that this is dirtyable memory as well and easily reclaimed.
> 
> I'll defer to mm guys for final decision but the fact is the memory for
> metadata is likely to be allocated from some slab cache and that actually
> goes against the 'easily reclaimed' statement. Granted these are going to
> be relatively large objects (1k at least I assume) so fragmentation issues
> are not as bad but still getting actual free pages out of slab cache isn't
> that easy... More on this below.
> 
> > diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
> > index 356a814e7c8e..fd516a0f0bfe 100644
> > --- a/include/linux/mmzone.h
> > +++ b/include/linux/mmzone.h
> > @@ -179,6 +179,9 @@ enum node_stat_item {
> > NR_VMSCAN_IMMEDIATE,/* Prioritise for reclaim when writeback ends */
> > NR_DIRTIED, /* page dirtyings since bootup */
> > NR_WRITTEN, /* page writings since bootup */
> > +   NR_METADATA_DIRTY_BYTES,/* Metadata dirty bytes */
> > +   NR_METADATA_WRITEBACK_BYTES,/* Metadata writeback bytes */
> > +   NR_METADATA_BYTES,  /* total metadata bytes in use. */
> > NR_VM_NODE_STAT_ITEMS
> >  };
> 
> I think you didn't address my comment from last version of the series.
> 
> 1) Per-cpu node-stat batching will be basically useless for these counters
> as the batch size is <128. Maybe we don't care but it would deserve a
> comment.
> 
> 2) These counters are tracked in atomic_long_t type. That means max 2GB of
> metadata on 32-bit machines. I *guess* that should be OK since you would
> not be able to address that much of slab cache on such machine anyway but 
> still worth a comment I think.
> 

You're right I missed this, sorry about that.  I've resolved the batching
problem, and I'll add a comment about the 32bit machines problem.

> > diff --git a/mm/util.c b/mm/util.c
> > index 34e57fae959d..681d62631ee0 100644
> > --- a/mm/util.c
> > +++ b/mm/util.c
> > @@ -616,6 +616,7 @@ int __vm_enough_memory(struct mm_struct *mm, long 
> > pages, int cap_sys_admin)
> > if (sysctl_overcommit_memory == OVERCOMMIT_GUESS) {
> > free = global_zone_page_state(NR_FREE_PAGES);
> > free += global_node_page_state(NR_FILE_PAGES);
> > +   free += global_node_page_state(NR_METADATA_BYTES) >> PAGE_SHIFT;
> 
> 
> I'm not really sure this is OK. It depends on whether mm is really able to
> reclaim these pages easily enough... Summon mm people for help :)
> 

Well we count NR_SLAB_RECLAIMABLE here, so it's no different than that.  The
point is that it's theoretically reclaimable, and we should at least try.

> > diff --git a/mm/vmscan.c b/mm/vmscan.c
> > index 13d711dd8776..415b003e475c 100644
> > --- a/mm/vmscan.c
> > +++ b/mm/vmscan.c
> > @@ -225,7 +225,8 @@ unsigned long pgdat_reclaimable_pages(struct 
> > pglist_data *pgdat)
> >  
> > nr = node_page_state_snapshot(pgdat, NR_ACTIVE_FILE) +
> >  node_page_state_snapshot(pgdat, NR_INACTIVE_FILE) +
> > -node_page_state_snapshot(pgdat, NR_ISOLATED_FILE);
> > +node_page_state_snapshot(pgdat, NR_ISOLATED_FILE) +
> > +(node_page_state_snapshot(pgdat, NR_METADATA_BYTES) >> PAGE_SHIFT);
> >  
> > if (get_nr_swap_pages() > 0)
> > nr += node_page_state_snapshot(pgdat, NR_ACTIVE_ANON) +
> 
> Just drop this hunk. The function is going away (and is currently unused).
> 

Will do.

> > @@ -3812,6 +3813,7 @@ static inline unsigned long 
> > node_unmapped_file_pages(struct pglist_data *pgdat)
> >  static unsigned long node_pagecache_reclaimable(struct pglist_data *pgdat)
> >  {
> > unsigned long nr_pagecache_reclaimable;
> > +   unsigned long nr_metadata_reclaimable;
> > unsigned long delta = 0;
> >  
> > /*
> > @@ -3833,7 +3835,20 @@ static unsigned long 
> > node_pagecache_reclaimable(struct pglist_data *pgdat)
> >

Re: [PATCH] btrfs: Handle btrfs_set_extent_delalloc failure

2017-12-06 Thread David Sterba

On Wed, Dec 06, 2017 at 09:27:50AM +0200, Nikolay Borisov wrote:
> On  5.12.2017 19:50, David Sterba wrote:
> > On Tue, Dec 05, 2017 at 04:10:59PM +0800, Qu Wenruo wrote:
> >>
> >>
> >> On 2017年12月05日 15:29, Nikolay Borisov wrote:
> >>> This function was introduced by 247e743cbe6e ("Btrfs: Use async helpers 
> >>> to deal
> >>> with pages that have been improperly dirtied") and it didn't do any error
> >>> handling then. This function might very well fail in ENOMEM situation, yet
> >>> it's not handled, this could lead to inconsistent state. So let's handle 
> >>> the
> >>> failure by setting the mapping error bit.
> >>>
> >>> Signed-off-by: Nikolay Borisov 
> >>> Cc: sta...@vger.kernel.org
> >>
> >> Reviewed-by: Qu Wenruo 
> >>
> >> That's the only missing one. Nice catch.
> > 
> > You mean the only unhandled call of btrfs_set_extent_delalloc? There's
> > one more in relocate_file_extent_cluster.
> 
> I'd prefer this call site be handled in a separate patch.

Agreed, just that Qu indicated there's no such call site :)
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] btrfs: put btrfs_ioctl_vol_args_v2 related defines together

2017-12-06 Thread David Sterba

On Wed, Dec 06, 2017 at 11:40:10AM +0800, Anand Jain wrote:
> Just a code spatial rearrangement, no functional change.
> 
> Signed-off-by: Anand Jain 

Reviewed-by: David Sterba 
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH 23/45] fs: btrfs: remove duplicate includes

2017-12-06 Thread Pravin Shedge

These duplicate includes have been found with scripts/checkincludes.pl but
they have been removed manually to avoid removing false positives.

Signed-off-by: Pravin Shedge 
---
 fs/btrfs/super.c | 1 -
 1 file changed, 1 deletion(-)

diff --git a/fs/btrfs/super.c b/fs/btrfs/super.c
index 3a4dce1..b34fe15 100644
--- a/fs/btrfs/super.c
+++ b/fs/btrfs/super.c
@@ -61,7 +61,6 @@
 #include "tests/btrfs-tests.h"
 
 #include "qgroup.h"
-#include "backref.h"
 #define CREATE_TRACE_POINTS
 #include 
 
-- 
2.7.4

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 5/5] btrfs: Greatly simplify btrfs_read_dev_super

2017-12-06 Thread David Sterba

On Mon, Dec 04, 2017 at 06:20:09PM +0200, Nikolay Borisov wrote:
> >> I don't understand what problem *should* be solved here...
> > 
> > Without any further checks and validation, we cannot simply iterate over
> > all superblocks and try to mount anything that looks ok. Even if the
> > offsets exist on the block device.  The fsid should be at least checked,
> > but that's still not enough to ensure the 1st copy is what we want to
> > mount.
> 
> I'm more inclined to agree with Anand here, that if the users wants to
> mount with -t btrfs then the filesystem should assume it's correct to
> use any of the additional superblocks.

If and only if the additional superblocks are valid. And if we can't
read the primary superblock, we don't have enough information to
establish the validation.  EIO?  We don't have anything.  CRC mismatch?
Can't trust the whole data.  We need FSID and total_size at least. Other
actions are limited from inside kernel.

> Otherwise we should explicitly
> state somewhere that the superblock copies are there only for the sake
> of the user-space tools, in which case this patch can go in, albeit with
> some rewording.

Unless there's something I'm missing to address the concerns above, I'm
for (an updated version) of your patch.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH v8 0/5] Add the ability to do BPF directed error injection

2017-12-06 Thread Josef Bacik

Jon noticed that I had a typo in my _ASM_KPROBE_ERROR_INJECT macro.  I went to
figure out why the compiler didn't catch it and it's because it was not used
anywhere.  I had copied it from the trace blacklist code without understanding
where it was used as cscope didn't find the original macro I was looking for, so
I assumed it was some voodoo and left it in place.  Turns out cscope failed me
and I didn't need the macro at all, the trace blacklist thing I was looking at
was for marking assembly functions as blacklisted and I have no intention of
marking assembly functions as error injectable at the moment.

v7->v8:
- removed the _ASM_KPROBE_ERROR_INJECT since it was not needed.

v6->v7:
- moved the opt-in macro to bpf.h out of kprobes.h.

v5->v6:
- add BPF_ALLOW_ERROR_INJECTION() tagging for functions that will support this
  feature.  This way only functions that opt-in will be allowed to be
  overridden.
- added a btrfs patch to allow error injection for open_ctree() so that the bpf
  sample actually works.

v4->v5:
- disallow kprobe_override programs from being put in the prog map array so we
  don't tail call into something we didn't check.  This allows us to make the
  normal path still fast without a bunch of percpu operations.

v3->v4:
- fix a build error found by kbuild test bot (I didn't wait long enough
  apparently.)
- Added a warning message as per Daniels suggestion.

v2->v3:
- added a ->kprobe_override flag to bpf_prog.
- added some sanity checks to disallow attaching bpf progs that have
  ->kprobe_override set that aren't for ftrace kprobes.
- added the trace_kprobe_ftrace helper to check if the trace_event_call is a
  ftrace kprobe.
- renamed bpf_kprobe_state to bpf_kprobe_override, fixed it so we only read this
  value in the kprobe path, and thus only write to it if we're overriding or
  clearing the override.

v1->v2:
- moved things around to make sure that bpf_override_return could really only be
  used for an ftrace kprobe.
- killed the special return values from trace_call_bpf.
- renamed pc_modified to bpf_kprobe_state so bpf_override_return could tell if
  it was being called from an ftrace kprobe context.
- reworked the logic in kprobe_perf_func to take advantage of bpf_kprobe_state.
- updated the test as per Alexei's review.

- Original message -

A lot of our error paths are not well tested because we have no good way of
injecting errors generically.  Some subystems (block, memory) have ways to
inject errors, but they are random so it's hard to get reproduceable results.

With BPF we can add determinism to our error injection.  We can use kprobes and
other things to verify we are injecting errors at the exact case we are trying
to test.  This patch gives us the tool to actual do the error injection part.
It is very simple, we just set the return value of the pt_regs we're given to
whatever we provide, and then override the PC with a dummy function that simply
returns.

Right now this only works on x86, but it would be simple enough to expand to
other architectures.  Thanks,

Josef
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH v8 1/5] add infrastructure for tagging functions as error injectable

2017-12-06 Thread Josef Bacik

From: Josef Bacik 

Using BPF we can override kprob'ed functions and return arbitrary
values.  Obviously this can be a bit unsafe, so make this feature opt-in
for functions.  Simply tag a function with KPROBE_ERROR_INJECT_SYMBOL in
order to give BPF access to that function for error injection purposes.

Signed-off-by: Josef Bacik 
Acked-by: Ingo Molnar 
---
 include/asm-generic/vmlinux.lds.h |  10 +++
 include/linux/bpf.h   |  11 +++
 include/linux/kprobes.h   |   1 +
 include/linux/module.h|   5 ++
 kernel/kprobes.c  | 163 ++
 kernel/module.c   |   6 +-
 6 files changed, 195 insertions(+), 1 deletion(-)

diff --git a/include/asm-generic/vmlinux.lds.h 
b/include/asm-generic/vmlinux.lds.h
index 8acfc1e099e1..85822804861e 100644
--- a/include/asm-generic/vmlinux.lds.h
+++ b/include/asm-generic/vmlinux.lds.h
@@ -136,6 +136,15 @@
 #define KPROBE_BLACKLIST()
 #endif
 
+#ifdef CONFIG_BPF_KPROBE_OVERRIDE
+#define ERROR_INJECT_LIST(). = ALIGN(8);   
\
+   
VMLINUX_SYMBOL(__start_kprobe_error_inject_list) = .;   \
+   KEEP(*(_kprobe_error_inject_list))  
\
+   VMLINUX_SYMBOL(__stop_kprobe_error_inject_list) 
= .;
+#else
+#define ERROR_INJECT_LIST()
+#endif
+
 #ifdef CONFIG_EVENT_TRACING
 #define FTRACE_EVENTS(). = ALIGN(8);   
\
VMLINUX_SYMBOL(__start_ftrace_events) = .;  \
@@ -560,6 +569,7 @@
FTRACE_EVENTS() \
TRACE_SYSCALLS()\
KPROBE_BLACKLIST()  \
+   ERROR_INJECT_LIST() \
MEM_DISCARD(init.rodata)\
CLK_OF_TABLES() \
RESERVEDMEM_OF_TABLES() \
diff --git a/include/linux/bpf.h b/include/linux/bpf.h
index 520aeebe0d93..552a666a338b 100644
--- a/include/linux/bpf.h
+++ b/include/linux/bpf.h
@@ -530,4 +530,15 @@ extern const struct bpf_func_proto 
bpf_sock_map_update_proto;
 void bpf_user_rnd_init_once(void);
 u64 bpf_user_rnd_u32(u64 r1, u64 r2, u64 r3, u64 r4, u64 r5);
 
+#if defined(__KERNEL__) && !defined(__ASSEMBLY__)
+#ifdef CONFIG_BPF_KPROBE_OVERRIDE
+#define BPF_ALLOW_ERROR_INJECTION(fname)   \
+static unsigned long __used\
+   __attribute__((__section__("_kprobe_error_inject_list")))   \
+   _eil_addr_##fname = (unsigned long)fname;
+#else
+#define BPF_ALLOW_ERROR_INJECTION(fname)
+#endif
+#endif
+
 #endif /* _LINUX_BPF_H */
diff --git a/include/linux/kprobes.h b/include/linux/kprobes.h
index bd2684700b74..4f501cb73aec 100644
--- a/include/linux/kprobes.h
+++ b/include/linux/kprobes.h
@@ -271,6 +271,7 @@ extern bool arch_kprobe_on_func_entry(unsigned long offset);
 extern bool kprobe_on_func_entry(kprobe_opcode_t *addr, const char *sym, 
unsigned long offset);
 
 extern bool within_kprobe_blacklist(unsigned long addr);
+extern bool within_kprobe_error_injection_list(unsigned long addr);
 
 struct kprobe_insn_cache {
struct mutex mutex;
diff --git a/include/linux/module.h b/include/linux/module.h
index fe5aa3736707..7bb1a9b9a322 100644
--- a/include/linux/module.h
+++ b/include/linux/module.h
@@ -475,6 +475,11 @@ struct module {
ctor_fn_t *ctors;
unsigned int num_ctors;
 #endif
+
+#ifdef CONFIG_BPF_KPROBE_OVERRIDE
+   unsigned int num_kprobe_ei_funcs;
+   unsigned long *kprobe_ei_funcs;
+#endif
 } cacheline_aligned __randomize_layout;
 #ifndef MODULE_ARCH_INIT
 #define MODULE_ARCH_INIT {}
diff --git a/kernel/kprobes.c b/kernel/kprobes.c
index a1606a4224e1..bdd7dd724f6f 100644
--- a/kernel/kprobes.c
+++ b/kernel/kprobes.c
@@ -83,6 +83,16 @@ static raw_spinlock_t *kretprobe_table_lock_ptr(unsigned 
long hash)
return &(kretprobe_table_locks[hash].lock);
 }
 
+/* List of symbols that can be overriden for error injection. */
+static LIST_HEAD(kprobe_error_injection_list);
+static DEFINE_MUTEX(kprobe_ei_mutex);
+struct kprobe_ei_entry {
+   struct list_head list;
+   unsigned long start_addr;
+   unsigned long end_addr;
+   void *priv;
+};
+
 /* Blacklist -- list of struct kprobe_blacklist_entry */
 static LIST_HEAD(kprobe_blacklist);
 
@@ -1392,6 +1402,17 @@ bool within_kprobe_blacklist(unsigned long addr)
return false;
 }
 
+bool within_kprobe_error_injection_list(unsigned long addr)
+{
+   struct kprobe_ei_entry *ent;
+
+   list_for_each_entry(ent, &kprobe_error_injection_list, list) {
+   if (addr >= ent->start_addr && addr < ent->end_addr)
+   return true;

[PATCH v8 3/5] bpf: add a bpf_override_function helper

2017-12-06 Thread Josef Bacik

From: Josef Bacik 

Error injection is sloppy and very ad-hoc.  BPF could fill this niche
perfectly with it's kprobe functionality.  We could make sure errors are
only triggered in specific call chains that we care about with very
specific situations.  Accomplish this with the bpf_override_funciton
helper.  This will modify the probe'd callers return value to the
specified value and set the PC to an override function that simply
returns, bypassing the originally probed function.  This gives us a nice
clean way to implement systematic error injection for all of our code
paths.

Acked-by: Alexei Starovoitov 
Acked-by: Ingo Molnar 
Signed-off-by: Josef Bacik 
---
 arch/Kconfig |  3 +++
 arch/x86/Kconfig |  1 +
 arch/x86/include/asm/kprobes.h   |  4 +++
 arch/x86/include/asm/ptrace.h|  5 
 arch/x86/kernel/kprobes/ftrace.c | 14 ++
 include/linux/filter.h   |  3 ++-
 include/linux/trace_events.h |  1 +
 include/uapi/linux/bpf.h |  7 -
 kernel/bpf/core.c|  3 +++
 kernel/bpf/verifier.c|  2 ++
 kernel/events/core.c |  7 +
 kernel/trace/Kconfig | 11 
 kernel/trace/bpf_trace.c | 38 +++
 kernel/trace/trace_kprobe.c  | 55 +++-
 kernel/trace/trace_probe.h   | 12 +
 15 files changed, 157 insertions(+), 9 deletions(-)

diff --git a/arch/Kconfig b/arch/Kconfig
index d789a89cb32c..4fb618082259 100644
--- a/arch/Kconfig
+++ b/arch/Kconfig
@@ -195,6 +195,9 @@ config HAVE_OPTPROBES
 config HAVE_KPROBES_ON_FTRACE
bool
 
+config HAVE_KPROBE_OVERRIDE
+   bool
+
 config HAVE_NMI
bool
 
diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index 971feac13506..5126d2750dd0 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -152,6 +152,7 @@ config X86
select HAVE_KERNEL_XZ
select HAVE_KPROBES
select HAVE_KPROBES_ON_FTRACE
+   select HAVE_KPROBE_OVERRIDE
select HAVE_KRETPROBES
select HAVE_KVM
select HAVE_LIVEPATCH   if X86_64
diff --git a/arch/x86/include/asm/kprobes.h b/arch/x86/include/asm/kprobes.h
index 6cf65437b5e5..c6c3b1f4306a 100644
--- a/arch/x86/include/asm/kprobes.h
+++ b/arch/x86/include/asm/kprobes.h
@@ -67,6 +67,10 @@ extern const int kretprobe_blacklist_size;
 void arch_remove_kprobe(struct kprobe *p);
 asmlinkage void kretprobe_trampoline(void);
 
+#ifdef CONFIG_KPROBES_ON_FTRACE
+extern void arch_ftrace_kprobe_override_function(struct pt_regs *regs);
+#endif
+
 /* Architecture specific copy of original instruction*/
 struct arch_specific_insn {
/* copy of the original instruction */
diff --git a/arch/x86/include/asm/ptrace.h b/arch/x86/include/asm/ptrace.h
index 91c04c8e67fa..f04e71800c2f 100644
--- a/arch/x86/include/asm/ptrace.h
+++ b/arch/x86/include/asm/ptrace.h
@@ -108,6 +108,11 @@ static inline unsigned long regs_return_value(struct 
pt_regs *regs)
return regs->ax;
 }
 
+static inline void regs_set_return_value(struct pt_regs *regs, unsigned long 
rc)
+{
+   regs->ax = rc;
+}
+
 /*
  * user_mode(regs) determines whether a register set came from user
  * mode.  On x86_32, this is true if V8086 mode was enabled OR if the
diff --git a/arch/x86/kernel/kprobes/ftrace.c b/arch/x86/kernel/kprobes/ftrace.c
index 041f7b6dfa0f..3c455bf490cb 100644
--- a/arch/x86/kernel/kprobes/ftrace.c
+++ b/arch/x86/kernel/kprobes/ftrace.c
@@ -97,3 +97,17 @@ int arch_prepare_kprobe_ftrace(struct kprobe *p)
p->ainsn.boostable = false;
return 0;
 }
+
+asmlinkage void override_func(void);
+asm(
+   ".type override_func, @function\n"
+   "override_func:\n"
+   "   ret\n"
+   ".size override_func, .-override_func\n"
+);
+
+void arch_ftrace_kprobe_override_function(struct pt_regs *regs)
+{
+   regs->ip = (unsigned long)&override_func;
+}
+NOKPROBE_SYMBOL(arch_ftrace_kprobe_override_function);
diff --git a/include/linux/filter.h b/include/linux/filter.h
index cdd78a7beaae..dfa44fd74bae 100644
--- a/include/linux/filter.h
+++ b/include/linux/filter.h
@@ -458,7 +458,8 @@ struct bpf_prog {
locked:1,   /* Program image locked? */
gpl_compatible:1, /* Is filter GPL compatible? 
*/
cb_access:1,/* Is control block accessed? */
-   dst_needed:1;   /* Do we need dst entry? */
+   dst_needed:1,   /* Do we need dst entry? */
+   kprobe_override:1; /* Do we override a kprobe? 
*/
kmemcheck_bitfield_end(meta);
enum bpf_prog_type  type;   /* Type of BPF program */
u32 len;/* Number of filter blocks */
diff --git a/include/linux/trace_events.h b/include/linux/trace_events.h
index fc6aeca945db..be8bd5a8efaa 100644
--- a/include/linux/trace_even

[PATCH v8 4/5] samples/bpf: add a test for bpf_override_return

2017-12-06 Thread Josef Bacik

From: Josef Bacik 

This adds a basic test for bpf_override_return to verify it works.  We
override the main function for mounting a btrfs fs so it'll return
-ENOMEM and then make sure that trying to mount a btrfs fs will fail.

Acked-by: Alexei Starovoitov 
Acked-by: Ingo Molnar 
Signed-off-by: Josef Bacik 
---
 samples/bpf/Makefile  |  4 
 samples/bpf/test_override_return.sh   | 15 +++
 samples/bpf/tracex7_kern.c| 16 
 samples/bpf/tracex7_user.c| 28 
 tools/include/uapi/linux/bpf.h|  7 ++-
 tools/testing/selftests/bpf/bpf_helpers.h |  3 ++-
 6 files changed, 71 insertions(+), 2 deletions(-)
 create mode 100755 samples/bpf/test_override_return.sh
 create mode 100644 samples/bpf/tracex7_kern.c
 create mode 100644 samples/bpf/tracex7_user.c

diff --git a/samples/bpf/Makefile b/samples/bpf/Makefile
index ea2b9e6135f3..83d06bc1f710 100644
--- a/samples/bpf/Makefile
+++ b/samples/bpf/Makefile
@@ -14,6 +14,7 @@ hostprogs-y += tracex3
 hostprogs-y += tracex4
 hostprogs-y += tracex5
 hostprogs-y += tracex6
+hostprogs-y += tracex7
 hostprogs-y += test_probe_write_user
 hostprogs-y += trace_output
 hostprogs-y += lathist
@@ -58,6 +59,7 @@ tracex3-objs := bpf_load.o $(LIBBPF) tracex3_user.o
 tracex4-objs := bpf_load.o $(LIBBPF) tracex4_user.o
 tracex5-objs := bpf_load.o $(LIBBPF) tracex5_user.o
 tracex6-objs := bpf_load.o $(LIBBPF) tracex6_user.o
+tracex7-objs := bpf_load.o $(LIBBPF) tracex7_user.o
 load_sock_ops-objs := bpf_load.o $(LIBBPF) load_sock_ops.o
 test_probe_write_user-objs := bpf_load.o $(LIBBPF) test_probe_write_user_user.o
 trace_output-objs := bpf_load.o $(LIBBPF) trace_output_user.o
@@ -100,6 +102,7 @@ always += tracex3_kern.o
 always += tracex4_kern.o
 always += tracex5_kern.o
 always += tracex6_kern.o
+always += tracex7_kern.o
 always += sock_flags_kern.o
 always += test_probe_write_user_kern.o
 always += trace_output_kern.o
@@ -153,6 +156,7 @@ HOSTLOADLIBES_tracex3 += -lelf
 HOSTLOADLIBES_tracex4 += -lelf -lrt
 HOSTLOADLIBES_tracex5 += -lelf
 HOSTLOADLIBES_tracex6 += -lelf
+HOSTLOADLIBES_tracex7 += -lelf
 HOSTLOADLIBES_test_cgrp2_sock2 += -lelf
 HOSTLOADLIBES_load_sock_ops += -lelf
 HOSTLOADLIBES_test_probe_write_user += -lelf
diff --git a/samples/bpf/test_override_return.sh 
b/samples/bpf/test_override_return.sh
new file mode 100755
index ..e68b9ee6814b
--- /dev/null
+++ b/samples/bpf/test_override_return.sh
@@ -0,0 +1,15 @@
+#!/bin/bash
+
+rm -f testfile.img
+dd if=/dev/zero of=testfile.img bs=1M seek=1000 count=1
+DEVICE=$(losetup --show -f testfile.img)
+mkfs.btrfs -f $DEVICE
+mkdir tmpmnt
+./tracex7 $DEVICE
+if [ $? -eq 0 ]
+then
+   echo "SUCCESS!"
+else
+   echo "FAILED!"
+fi
+losetup -d $DEVICE
diff --git a/samples/bpf/tracex7_kern.c b/samples/bpf/tracex7_kern.c
new file mode 100644
index ..1ab308a43e0f
--- /dev/null
+++ b/samples/bpf/tracex7_kern.c
@@ -0,0 +1,16 @@
+#include 
+#include 
+#include 
+#include "bpf_helpers.h"
+
+SEC("kprobe/open_ctree")
+int bpf_prog1(struct pt_regs *ctx)
+{
+   unsigned long rc = -12;
+
+   bpf_override_return(ctx, rc);
+   return 0;
+}
+
+char _license[] SEC("license") = "GPL";
+u32 _version SEC("version") = LINUX_VERSION_CODE;
diff --git a/samples/bpf/tracex7_user.c b/samples/bpf/tracex7_user.c
new file mode 100644
index ..8a52ac492e8b
--- /dev/null
+++ b/samples/bpf/tracex7_user.c
@@ -0,0 +1,28 @@
+#define _GNU_SOURCE
+
+#include 
+#include 
+#include 
+#include "libbpf.h"
+#include "bpf_load.h"
+
+int main(int argc, char **argv)
+{
+   FILE *f;
+   char filename[256];
+   char command[256];
+   int ret;
+
+   snprintf(filename, sizeof(filename), "%s_kern.o", argv[0]);
+
+   if (load_bpf_file(filename)) {
+   printf("%s", bpf_log_buf);
+   return 1;
+   }
+
+   snprintf(command, 256, "mount %s tmpmnt/", argv[1]);
+   f = popen(command, "r");
+   ret = pclose(f);
+
+   return ret ? 0 : 1;
+}
diff --git a/tools/include/uapi/linux/bpf.h b/tools/include/uapi/linux/bpf.h
index 4a4b6e78c977..3756dde69834 100644
--- a/tools/include/uapi/linux/bpf.h
+++ b/tools/include/uapi/linux/bpf.h
@@ -673,6 +673,10 @@ union bpf_attr {
  * @buf: buf to fill
  * @buf_size: size of the buf
  * Return : 0 on success or negative error code
+ *
+ * int bpf_override_return(pt_regs, rc)
+ * @pt_regs: pointer to struct pt_regs
+ * @rc: the return value to set
  */
 #define __BPF_FUNC_MAPPER(FN)  \
FN(unspec), \
@@ -732,7 +736,8 @@ union bpf_attr {
FN(xdp_adjust_meta),\
FN(perf_event_read_value),  \
FN(perf_prog_read_value),   \
-   FN(getsockopt),
+   FN(getsockopt), \
+   FN(override_return),
 
 /* integer value in 'imm' field of BPF_CALL instruction selects which helper
  * function eBPF program intends

[PATCH v8 5/5] btrfs: allow us to inject errors at io_ctl_init

2017-12-06 Thread Josef Bacik

From: Josef Bacik 

This was instrumental in reproducing a space cache bug.

Signed-off-by: Josef Bacik 
Acked-by: Ingo Molnar 
---
 fs/btrfs/free-space-cache.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/fs/btrfs/free-space-cache.c b/fs/btrfs/free-space-cache.c
index cdc9f4015ec3..daa98dc1f844 100644
--- a/fs/btrfs/free-space-cache.c
+++ b/fs/btrfs/free-space-cache.c
@@ -22,6 +22,7 @@
 #include 
 #include 
 #include 
+#include 
 #include "ctree.h"
 #include "free-space-cache.h"
 #include "transaction.h"
@@ -332,6 +333,7 @@ static int io_ctl_init(struct btrfs_io_ctl *io_ctl, struct 
inode *inode,
 
return 0;
 }
+BPF_ALLOW_ERROR_INJECTION(io_ctl_init);
 
 static void io_ctl_free(struct btrfs_io_ctl *io_ctl)
 {
-- 
2.7.5

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH v8 2/5] btrfs: make open_ctree error injectable

2017-12-06 Thread Josef Bacik

From: Josef Bacik 

This allows us to do error injection with BPF for open_ctree.

Signed-off-by: Josef Bacik 
Acked-by: Ingo Molnar 
---
 fs/btrfs/disk-io.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
index dfdab849037b..69d17a640b94 100644
--- a/fs/btrfs/disk-io.c
+++ b/fs/btrfs/disk-io.c
@@ -30,6 +30,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include "ctree.h"
 #include "disk-io.h"
@@ -3283,6 +3284,7 @@ int open_ctree(struct super_block *sb,
goto fail_block_groups;
goto retry_root_backup;
 }
+BPF_ALLOW_ERROR_INJECTION(open_ctree);
 
 static void btrfs_end_buffer_write_sync(struct buffer_head *bh, int uptodate)
 {
-- 
2.7.5

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 4/5] btrfs: Remove redundant NULL check

2017-12-06 Thread David Sterba

On Fri, Dec 01, 2017 at 11:19:43AM +0200, Nikolay Borisov wrote:
> Before returning hole_em in btrfs_get_fiemap_extent we check if it's different
> than null. However, by the time this null check is triggered we already know
> hole_em is not null because it means it points to the em we found and it
> has already been dereferenced.

Well, ok The whole function could use some cleanups and reordering,
it's hard to track all the variable changes.
> 
> Signed-off-by: Nikolay Borisov 
> ---
>  fs/btrfs/inode.c | 3 +--
>  1 file changed, 1 insertion(+), 2 deletions(-)
> 
> diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
> index 92d140b06271..9e0473c883ce 100644
> --- a/fs/btrfs/inode.c
> +++ b/fs/btrfs/inode.c
> @@ -7300,9 +7300,8 @@ struct extent_map *btrfs_get_extent_fiemap(struct 
> btrfs_inode *inode,
>   em->block_start = EXTENT_MAP_DELALLOC;
>   em->block_len = found;
>   }
> - } else if (hole_em) {
> + } else
>   return hole_em;

The { } should stay. I'll fix it.

> - }
>  out:
>  
>   free_extent_map(hole_em);
> -- 
> 2.7.4
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 3/5] btrfs: Fix possible off-by-one in btrfs_search_path_in_tree

2017-12-06 Thread David Sterba

On Fri, Dec 01, 2017 at 11:19:42AM +0200, Nikolay Borisov wrote:
> The name char array passed to btrfs_search_path_in_tree is of size
> BTRFS_INO_LOOKUP_PATH_MAX (4080). So the actual accessible char indexes
> are in the range of [0, 4079]. Currently the code uses the define but this
> represents an off-by-one.

For such fixes it's good to think about and document potential
implications, like where could the extra byte end be written, and if this
is a patch stable.

Size of btrfs_ioctl_ino_lookup_args is 4096, so the new byte will be
written to extra space, not some padding that could be provided by the
allocator.

btrfs-progs store the arguments on stack, but kernel does own copy of
the ioctl buffer and the off-by-one overwrite does not affect userspace,
but the ending 0 might be lost.

kernel ioctl buffer is allocated dynamically so we're overwriting
somebody else's memory, and the ioctl is privileged if args.objectid is
not 256. Which is in most cases, but resolving a subvolume stored in
another directory will trigger that path.

Fixes: ac8e9819d71f907 ("Btrfs: add search and inode lookup ioctls")

Before this patch the buffer was one byte larger, but then the -1 was
not added.

> Signed-off-by: Nikolay Borisov 
> ---
>  fs/btrfs/ioctl.c | 6 +++---
>  1 file changed, 3 insertions(+), 3 deletions(-)
> 
> diff --git a/fs/btrfs/ioctl.c b/fs/btrfs/ioctl.c
> index e8adebc8c1b0..fc148b7c4265 100644
> --- a/fs/btrfs/ioctl.c
> +++ b/fs/btrfs/ioctl.c
> @@ -2206,7 +2206,7 @@ static noinline int btrfs_search_path_in_tree(struct 
> btrfs_fs_info *info,
>   if (!path)
>   return -ENOMEM;
>  
> - ptr = &name[BTRFS_INO_LOOKUP_PATH_MAX];
> + ptr = &name[BTRFS_INO_LOOKUP_PATH_MAX - 1];
>  
>   key.objectid = tree_id;
>   key.type = BTRFS_ROOT_ITEM_KEY;
> @@ -2272,8 +2272,8 @@ static noinline int btrfs_search_path_in_tree(struct 
> btrfs_fs_info *info,
>  static noinline int btrfs_ioctl_ino_lookup(struct file *file,
>  void __user *argp)
>  {
> -  struct btrfs_ioctl_ino_lookup_args *args;
> -  struct inode *inode;
> + struct btrfs_ioctl_ino_lookup_args *args;
> + struct inode *inode;

Unrelated change.

>   int ret = 0;
>  
>   args = memdup_user(argp, sizeof(*args));
> -- 
> 2.7.4
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] btrfs: tree-checker: use %zu format string for size_t

2017-12-06 Thread David Sterba

On Wed, Dec 06, 2017 at 03:18:14PM +0100, Arnd Bergmann wrote:
> The return value of sizeof() is of type size_t, so we must print it
> using the %z format modifier rather than %l to avoid this warning
> on some architectures:
> 
> fs/btrfs/tree-checker.c: In function 'check_dir_item':
> fs/btrfs/tree-checker.c:273:50: error: format '%lu' expects argument of type 
> 'long unsigned int', but argument 5 has type 'u32' {aka 'unsigned int'} 
> [-Werror=format=]
> 
> Fixes: 005887f2e3e0 ("btrfs: tree-checker: Add checker for dir item")
> Signed-off-by: Arnd Bergmann 

Reviewed-by: David Sterba 
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] Btrfs: disable FUA if mounted with nobarrier

2017-12-06 Thread David Sterba

On Tue, Dec 05, 2017 at 10:54:02PM -0800, Omar Sandoval wrote:
> From: Omar Sandoval 
> 
> I was seeing disk flushes still happening when I mounted a Btrfs
> filesystem with nobarrier for testing. This is because we use FUA to
> write out the first super block, and on devices without FUA support, the
> block layer translates FUA to a flush. Even on devices supporting true
> FUA, using FUA when we asked for no barriers is surprising.
> 
> Signed-off-by: Omar Sandoval 

Reviewed-by: David Sterba 

The documented nobarrier behaviour matches the updated code.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH] btrfs: tree-checker: use %zu format string for size_t

2017-12-06 Thread Arnd Bergmann

The return value of sizeof() is of type size_t, so we must print it
using the %z format modifier rather than %l to avoid this warning
on some architectures:

fs/btrfs/tree-checker.c: In function 'check_dir_item':
fs/btrfs/tree-checker.c:273:50: error: format '%lu' expects argument of type 
'long unsigned int', but argument 5 has type 'u32' {aka 'unsigned int'} 
[-Werror=format=]

Fixes: 005887f2e3e0 ("btrfs: tree-checker: Add checker for dir item")
Signed-off-by: Arnd Bergmann 
---
 fs/btrfs/tree-checker.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/fs/btrfs/tree-checker.c b/fs/btrfs/tree-checker.c
index 66dac0a4b01f..7c55e3ba5a6c 100644
--- a/fs/btrfs/tree-checker.c
+++ b/fs/btrfs/tree-checker.c
@@ -270,7 +270,7 @@ static int check_dir_item(struct btrfs_root *root,
/* header itself should not cross item boundary */
if (cur + sizeof(*di) > item_size) {
dir_item_err(root, leaf, slot,
-   "dir item header crosses item boundary, have %lu boundary %u",
+   "dir item header crosses item boundary, have %zu boundary %u",
cur + sizeof(*di), item_size);
return -EUCLEAN;
}
-- 
2.9.0

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH v4 72/73] xfs: Convert mru cache to XArray

2017-12-06 Thread Matthew Wilcox

On Wed, Dec 06, 2017 at 07:44:04PM +1100, Dave Chinner wrote:
> On Tue, Dec 05, 2017 at 08:45:49PM -0800, Matthew Wilcox wrote:
> > That said, using xa_cmpxchg() in the dquot code looked like the right
> > thing to do?  Since we'd dropped the qi mutex and the ILOCK, it looks
> > entirely reasonable for another thread to come in and set up the dquot.
> > But I'm obviously quite ignorant of the XFS internals, so maybe there's
> > something else going on that makes this essentially a "can't happen".
> 
> It's no different to the inode cache code, which drops the RCU
> lock on lookup miss, instantiates the new inode (maybe reading it
> off disk), then locks the tree and attempts to insert it. Both cases
> use "insert if empty, otherwise retry lookup from start" semantics.

Ah.  I had my focus set a little narrow on the inode cache code and didn't
recognise the pattern.

Why do you sleep for one jiffy after encountering a miss, then seeing
someone else insert the inode for you?

> cmpxchg is for replacing a known object in a store - it's not really
> intended for doing initial inserts after a lookup tells us there is
> nothing in the store.  The radix tree "insert only if empty" makes
> sense here, because it naturally takes care of lookup/insert races
> via the -EEXIST mechanism.
> 
> I think that providing xa_store_excl() (which would return -EEXIST
> if the entry is not empty) would be a better interface here, because
> it matches the semantics of lookup cache population used all over
> the kernel

I'm not thrilled with xa_store_excl(), but I need to think about that
a bit more.

> > I'm quite happy to have normal API variants that don't save/restore
> > interrupts.  Just need to come up with good names ... I don't think
> > xa_store_noirq() is a good name, but maybe you do?
> 
> I'd prefer not to have to deal with such things at all. :P
> 
> How many subsystems actually require irq safety in the XA locking
> code? Make them use irqsafe versions, not make everyone else use
> "noirq" versions, as is the convention for the rest of the kernel
> code

Hard to say how many existing radix tree users require the irq safety.
Also hard to say how many potential users (people currently using
linked lists, people using resizable arrays, etc) need irq safety.
My thinking was "make it safe by default and let people who know better
have a way to opt out", but there's definitely something to be said for
"make it fast by default and let people who need the unusual behaviour
type those extra few letters".

So, you're arguing for providing xa_store(), xa_store_irq(), xa_store_bh()
and xa_store_irqsafe()?  (at least on demand, as users come to light?)
At least the read side doesn't require any variants; everybody can use
RCU for read side protection.

("safe", not "save" because I wouldn't make the caller provide the
"flags" argument).

> > At least, not today.  One of the future plans is to allow xa_nodes to
> > be allocated from ZONE_MOVABLE.  In order to do that, we have to be
> > able to tell which lock protects any given node.  With the XArray,
> > we can find that out (xa_node->root->xa_lock); with the radix tree,
> > we don't even know what kind of lock protects the tree.
> 
> Yup, this is a prime example of why we shouldn't be creating
> external dependencies by smearing the locking context outside the XA
> structure itself. It's not a stretch to see something like a
> ZONE_MOVEABLE dependency because some other object indexed in a XA
> is stored in the same page as the xa_node that points to it, and
> both require the same xa_lock to move/update...

That is a bit of a stretch.  Christoph Lameter and I had a discussion about it
here: https://www.spinics.net/lists/linux-mm/msg122902.html

There's no situation where you need to acquire two locks in order to
free an object; you'd create odd locking dependencies between objects
if you did that (eg we already have a locking dependency between pag_ici
and perag from __xfs_inode_set_eofblocks_tag).  It'd be a pretty horrible
shrinker design where you had to get all the locks on all the objects,
regardless of what locking order the real code had.

> > There are other costs to not having a lock.  The lockdep/RCU
> > analysis done on the radix tree code is none.  Because we have
> > no idea what lock might protect any individual radix tree, we use
> > rcu_dereference_raw(), disabling lockdep's ability to protect us.
> 
> Unfortunately for you, I don't find arguments along the lines of
> "lockdep will save us" at all convincing.  lockdep already throws
> too many false positives to be useful as a tool that reliably and
> accurately points out rare, exciting, complex, intricate locking
> problems.

But it does reliably and accurately point out "dude, you forgot to take
the lock".  It's caught a number of real problems in my own testing that
you never got to see.

> That problem has not gone away - very few people who read and have
> to maintain this code understandxs all t

Re: [RFC] Improve subvolume usability for a normal user

2017-12-06 Thread Austin S. Hemmelgarn


On 2017-12-05 23:52, Misono, Tomohiro wrote:

On 2017/12/05 21:41, Austin S. Hemmelgarn wrote:

On 2017-12-05 03:43, Qu Wenruo wrote:



On 2017年12月05日 16:25, Misono, Tomohiro wrote:

Hello all,

I want to address some issues of subvolume usability for a normal user.
i.e. a user can create subvolumes, but
   - Cannot delete their own subvolume (by default)
   - Cannot tell subvolumes from directories (in a straightforward way)
   - Cannot check the quota limit when qgroup is enabled

Here I show the initial thoughts and approaches to this problem.
I want to check if this is a right approach or not before I start writing code.

Comments are welcome.
Tomohiro Misono

==
- Goal and current problem
The goal of this RFC is to give a normal user more control to their own 
subvolumes.
Currently the control to subvolumes for a normal user is restricted as below:

+-+--+--+
|   command   | root | user |
+-+--+--+
| sub create  | Y| Y|
| sub snap| Y| Y|
| sub del | Y| N|
| sub list| Y| N|
| sub show| Y| N|
| qgroup show | Y| N|
+-+--+--+

In short, I want to change this as below in order to improve user's usability:

+-+--++
|   command   | root | user   |
+-+--++
| sub create  | Y| Y  |
| sub snap| Y| Y  |
| sub del | Y| N -> Y |
| sub list| Y| N -> Y |
| sub show| Y| N -> Y |
| qgroup show | Y| N -> Y |
+-+--++

In words,
(1) allow deletion of subvolume if a user owns it, and
(2) allow getting subvolume/quota info if a user has read access to it
   (sub list/qgroup show just lists the subvolumes which are readable by the 
user)

I think other commands not listed above (qgroup limit, send/receive etc.) should
be done by root and not be allowed for a normal user.


- Outside the scope of this RFC
There is a qualitative problem to use qgroup for limiting user disk amount;
quota limit can easily be averted by creating a subvolume. I think that forcing
inheriting quota of parent subvolume is a solution, but I won't address nor
discuss this here.


- Proposal
   (1) deletion of subvolume

I want to change the default behavior to allow a user to delete their own
subvolumes.

This is not the same behavior as when user_subvol_rm_alowed mount option is

specified; in that case a user can delete the subvolume to which they have
write+exec right.

Since snapshot creation is already restricted to the subvolume owner, it is

consistent that only the owner of the subvolume (or root) can delete it.

The implementation should be straightforward.


Personally speaking, I prefer to do the complex owner check in user daemon.

And do the privilege in user daemon (call it btrfsd for example).

So btrfs-progs will works in 2 modes, if root calls it, do as it used to do.
If normal user calls it, proxy the request to btrfsd, and btrfsd does
the privilege checking and call ioctl (with root privilege).

Then no impact to kernel, all complex work is done in user space.

Exactly how hard is it to just check ownership of the root inode of a
subvolume from the ioctl context?  You could just as easily push all the
checking out to the VFS layer by taking an open fd for the subvolume
root (and probably implicitly closing it) instead of taking a path, and
that would give you all the benefits of ACL's and whatever security
modules the local system is using.  Alternatively, quit treating
subvolumes specially for `unlink()`, and make them behave like regular
directories (and thus avoid the one possible complication resulting from
home directories being subvolumes but needing to be owned by the users).


Thanks to all for the comments.
Let me explain my stance clearly:

I'm for limiting the subvolume operation to btrfs-progs commands since a
subvolume looks like a regular directory, but is different in many points.
This is the reason I want to fix not only "sub delete" but also "sub show/list" 
etc.
I agree that it needs to behave more sanely, but the approach of 
treating it differently from a directory is the entire cause of this 
issue in the first place.  If the original implementation had just 
treated them as regular directories once created (with the obvious 
caveat regarding read-only snapshots), we wouldn't be having this 
conversation right now, except possibly regarding fetching info about 
subvolumes.


Personally, I would really rather just let unlink() work the same on 
subvolumes as it does with directories, and only need special handling 
for creation and metadata retrieval (with the possible option of 'quick' 
deletion through a btrfs-progs command).


I think "sub delete" should be the counterpart of "sub create" and "sub snap".
Since "sub snap" can be used for the subvolume the user owns, "sub delete"
should be allowed to the suvolume the use

Re: [RFC] Improve subvolume usability for a normal user

2017-12-06 Thread Austin S. Hemmelgarn


On 2017-12-05 17:08, Goffredo Baroncelli wrote:

On 12/05/2017 09:17 PM, Austin S. Hemmelgarn wrote:

On 2017-12-05 14:09, Goffredo Baroncelli wrote:

On 12/05/2017 07:46 PM, Graham Cobb wrote:

On 05/12/17 18:01, Goffredo Baroncelli wrote:

On 12/05/2017 04:42 PM, Graham Cobb wrote:

[]

Then no impact to kernel, all complex work is done in user space.

Exactly how hard is it to just check ownership of the root inode of a
subvolume from the ioctl context?  You could just as easily push all the
checking out to the VFS layer by taking an open fd for the subvolume
root (and probably implicitly closing it) instead of taking a path, and
that would give you all the benefits of ACL's and whatever security
modules the local system is using.


+1 - stop inventing new access control rules for each different action!


A subvolume is like a directory; In all filesystems you cannot remove an inode 
if you don't have write access to the parent directory. I assume that this is a 
POSIX requirement; and if so this should be true even for BTRFS.
This means that in order to remove a subvolume and you are not root, you should 
check all the contained directories. It is not sufficient to check one inode.

In the past I create a patch [1][2] which extended the unlink (2) syscall to 
allow removing a subvolume if:
a) the user is root  or
b.1) if the subvolume is empty and
b.2) the user has the write capability to the parent directory


That is also OK, as it follows a logical and consistent model without
introducing new rules, but it has two disadvantages with snapshots the
user creates and then wants to delete:

i) they have to change the snapshot to readwrite to remove all the
contents -- how many users will know that that can even be done?


I think that this is an orthogonal question.  However the user should use btrfs 
set

Anyway I am not against to use something more specific like "btrfs sub del...", 
where it could be possible to implement a more specific checks. I am only highlight that 
destroy a subvolume without looking to its content could break the POSIX rules

No, it doesn't break POSIX rules, because you can't do it with POSIX defined 
interfaces (not counting ioctls, but ioctls are by definition system dependent).


user$ mkdir -p dir1/dir2
user$ touch dir1/dir2/file
user$ sudo chown foo:foo dir1/dir2

user$ rm -rf dir1/
rm: cannot remove 'dir1/dir2/file1': Permission denied

In a POSIX filesystem, you (as user) cannot remove file1.

But if instead of dir1 you have a subvolume (called sub1), and the check of 
removing the subvolume is performed only on the root of subvolume, you could 
remove it.

Is it a problem ? I think no, but I am not a security expert. But even if POSIX 
doesn't have the concept of subvolume, this is definitely a break of a POSIX 
rule.
Except it isn't a break of a POSIX rule, because deleting the subvolume 
doesn't go through regular POSIX API's (again, ioctl really doesn't 
count because it's designed for random extended stuff).  POSIX makes 
absolutely zero guarantees about the behavior of things not done through 
POSIX API's.  Yes, this doesn't use POSIX style permission checks, but 
it also doesn't violate POSIX standards either (because it's not defined 
in the POSIX standard, and things that aren't defined by the standards 
are explicitly stated to be implementation defined).


The data loss risk this entails is however the reason users can't delete 
subvolumes by default.


On the other side, if the user makes a snapshot of a subvolume containing a not 
owned and not empty directory, for the user it would be impossible to delete 
its snapshot (!).
I would argue that by default they shouldn't be able to make a snapshot 
because:
1. With the default behavior of users not being able to delete 
subvolumes, users can trivially cause a resource exhaustion situation 
that they can't resolve themself.

2. Whitelisting is usually better for security than blacklisting.







ii) if the snapshot contains a file the user cannot remove then the user
cannot remove the snapshot at all (even though they could create it).


It is the same for the other filesystem: if in a filesystem tree, there is a 
directory which is not changeable by the user, the user cannot remove all the 
tree. Again, I am only highlighting that there is a possible break of POSIX 
rules.

POSIX says nothing about non-permissions related reasons for not being able to 
remove a directory.

Nobody has say the opposite
Claiming that it's potentially POSIX incompatible behavior directly 
implies that POSIX has provisions regarding non-permissions related 
reasons for not being able to remove a directory.



   A subvolume is treated (too much in my opinion) like a mount point, 
regardless of whether or not it's explicitly mounted, and that seems to be most 
of why you can't remove it with unlink().




That is why I preferred Austin's first proposal. I suppose your proposal
could be extended:

a) the user is root  or
b

Re: btrfs subvolume list for users?

2017-12-06 Thread Ulli Horlacher

On Wed 2017-12-06 (08:24), Ulli Horlacher wrote:
> Is there a way for a non-root user to list his subvolumes?

I must have Alzheimer:
I have asked this already 3 month ago and found a workaround by muself

Date: Sun, 17 Sep 2017 10:48:10 +0200
Message-ID: <20170917084810.gp32...@rus.uni-stuttgart.de>


Sorry!

-- 
Ullrich Horlacher  Server und Virtualisierung
Rechenzentrum TIK 
Universitaet Stuttgart E-Mail: horlac...@tik.uni-stuttgart.de
Allmandring 30aTel:++49-711-68565868
70569 Stuttgart (Germany)  WWW:http://www.tik.uni-stuttgart.de/
REF:<20171206072436.ga29...@rus.uni-stuttgart.de>
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH] fstests: btrfs: reproduce a read failure on raid6 setup

2017-12-06 Thread Eryu Guan

On Mon, Dec 04, 2017 at 03:33:23PM -0700, Liu Bo wrote:
> This test case is to reproduce a bug of raid6 reconstruction process.
> 
> The kernel fix are
> Btrfs: do not merge rbios if their fail stripe index are not identical
> Btrfs: make raid6 rebuild retry more
> 
> Signed-off-by: Liu Bo 

Test failed as expected with v4.15-rc2 kernel and passed with above
patches applied.

I fixed two trailing whitespace issues and some of the indentions to
avoid too-long line, also replaced 'repair' group with a new 'raid'
group. I think 'repair' group is meant for testing filesystem repair
tools, not the internal raid repair mechanism, so a new 'raid' group
looks better to me, as we're having more raid-specific tests recently.

Thanks,
Eryu
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH v4 72/73] xfs: Convert mru cache to XArray

2017-12-06 Thread Dave Chinner

On Tue, Dec 05, 2017 at 08:45:49PM -0800, Matthew Wilcox wrote:
> On Wed, Dec 06, 2017 at 02:14:56PM +1100, Dave Chinner wrote:
> > > The other conversions use the normal API instead of the advanced API, so
> > > all of this gets hidden away.  For example, the inode cache does this:
> > 
> > Ah, OK, that's not obvious from the code changes. :/
> 
> Yeah, it's a lot easier to understand (I think!) if you build the
> docs in that tree and look at
> file:///home/willy/kernel/xarray-3/Documentation/output/core-api/xarray.html
> (mutatis mutandi).  I've tried to tell a nice story about how to put
> all the pieces together from the normal to the advanced API.
> 
> > However, it's probably overkill for XFS. In all the cases, when we
> > insert there should be no entry in the tree - the
> > radix tree insert error handling code there was simply catching
> > "should never happen" cases and handling it without crashing.
> 
> I thought it was probably overkill to be using xa_cmpxchg() in the
> pag_ici patch.  I didn't want to take away your error handling as part
> of the conversion, but I think a rational person implementing it today
> would just call xa_store() and not even worry about the return value
> except to check it for IS_ERR().

*nod*

> That said, using xa_cmpxchg() in the dquot code looked like the right
> thing to do?  Since we'd dropped the qi mutex and the ILOCK, it looks
> entirely reasonable for another thread to come in and set up the dquot.
> But I'm obviously quite ignorant of the XFS internals, so maybe there's
> something else going on that makes this essentially a "can't happen".

It's no different to the inode cache code, which drops the RCU
lock on lookup miss, instantiates the new inode (maybe reading it
off disk), then locks the tree and attempts to insert it. Both cases
use "insert if empty, otherwise retry lookup from start" semantics.

cmpxchg is for replacing a known object in a store - it's not really
intended for doing initial inserts after a lookup tells us there is
nothing in the store.  The radix tree "insert only if empty" makes
sense here, because it naturally takes care of lookup/insert races
via the -EEXIST mechanism.

I think that providing xa_store_excl() (which would return -EEXIST
if the entry is not empty) would be a better interface here, because
it matches the semantics of lookup cache population used all over
the kernel

> > Now that I've looked at this, I have to say that having a return
> > value of NULL meaning "success" is quite counter-intuitive. That's
> > going to fire my "that looks so wrong" detector every time I look at
> > the code and notice it's erroring out on a non-null return value
> > that isn't a PTR_ERR case
> 
> It's the same convention as cmpxchg().  I think it's triggering your
> "looks so wrong" detector because it's fundamentally not the natural
> thing to write.

Most definitely the case, and this is why it's a really bad
interface for the semantics we have. This how we end up with code
that makes it easy for programmers to screw up pointer checks in
error handling... :/

> I'm quite happy to have normal API variants that don't save/restore
> interrupts.  Just need to come up with good names ... I don't think
> xa_store_noirq() is a good name, but maybe you do?

I'd prefer not to have to deal with such things at all. :P

How many subsystems actually require irq safety in the XA locking
code? Make them use irqsafe versions, not make everyone else use
"noirq" versions, as is the convention for the rest of the kernel
code

> > > It's the design pattern I've always intended to use.  Naturally, the
> > > xfs radix trees weren't my initial target; it was the page cache, and
> > > the page cache does the same thing; uses the tree_lock to protect both
> > > the radix tree and several other fields in that same data structure.
> > > 
> > > I'm open to argument on this though ... particularly if you have a better
> > > design pattern in mind!
> > 
> > I don't mind structures having internal locking - I have a problem
> > with leaking them into contexts outside the structure they protect.
> > That way lies madness - you can't change the internal locking in
> > future because of external dependencies, and the moment you need
> > something different externally we've got to go back to an external
> > lock anyway.
> > 
> > This is demonstrated by the way you converted the XFS dquot tree -
> > you didn't replace the dquot tree lock with the internal xa_lock
> > because it's a mutex and we have to sleep holding it. IOWs, we've
> > added another layer of locking here, not simplified the code.
> 
> I agree the dquot code is no simpler than it was, but it's also no more
> complicated from a locking analysis point of view; the xa_lock is just
> not providing you with any useful exclusion.

Sure, that's fine. All I'm doing is pointing out that we can't use
the internal xa_lock to handle everything the indexed objects
require, and so we're going to still n

40 matches

Mail list logo