Re: btrfs panic problem

2018-09-25 Thread sunny.s.zhang



在 2018年09月20日 02:36, Liu Bo 写道:

On Mon, Sep 17, 2018 at 5:28 PM, sunny.s.zhang  wrote:

Hi All,

My OS(4.1.12) panic in kmem_cache_alloc, which is called by
btrfs_get_or_create_delayed_node.

I found that the freelist of the slub is wrong.

crash> struct kmem_cache_cpu 887e7d7a24b0

struct kmem_cache_cpu {
   freelist = 0x2026,   <<< the value is id of one inode
   tid = 29567861,
   page = 0xea0132168d00,
   partial = 0x0
}

And, I found there are two different btrfs inodes pointing delayed_node. It
means that the same slub is used twice.

I think this slub is freed twice, and then the next pointer of this slub
point itself. So we get the same slub twice.

When use this slub again, that break the freelist.

Folloing code will make the delayed node being freed twice. But I don't
found what is the process.

Process A (btrfs_evict_inode) Process B

call btrfs_remove_delayed_node call  btrfs_get_delayed_node

node = ACCESS_ONCE(btrfs_inode->delayed_node);

BTRFS_I(inode)->delayed_node = NULL;
btrfs_release_delayed_node(delayed_node);

if (node) {
atomic_inc(&node->refs);
return node;
}

..

btrfs_release_delayed_node(delayed_node);


By looking at the race,  seems the following commit has addressed it.

btrfs: fix refcount_t usage when deleting btrfs_delayed_nodes
https://urldefense.proofpoint.com/v2/url?u=https-3A__git.kernel.org_pub_scm_linux_kernel_git_torvalds_linux.git_commit_-3Fid-3Dec35e48b286959991cdbb886f1bdeda4575c80b4&d=DwIBaQ&c=RoP1YumCXCgaWHvlZYR8PZh8Bv7qIrMUB65eapI_JnE&r=mcYQsljqnoxPHJVaWVFtwsEEDhXdP3ULRlrPW_9etWQ&m=O7fQASCATWfOIp82M24gmi314geaUJDU-9erYxJ2ZEs&s=QtIafUNfkdy5BqfRQLhoHLY6o-Vk8-ZB0sD28mM-o_s&e=

thanks,
liubo


I don't think so.
this patch has resolved the problem of radix_tree_lookup. I don't think 
this can resolve my problem that race occur after 
ACCESS_ONCE(btrfs_inode->delayed_node).
Because, if ACCESS_ONCE(btrfs_inode->delayed_node) return the node, then 
the function of btrfs_get_delayed_node will return, and don't continue.


Thanks,
Sunny




1313 void btrfs_remove_delayed_node(struct inode *inode)
1314 {
1315 struct btrfs_delayed_node *delayed_node;
1316
1317 delayed_node = ACCESS_ONCE(BTRFS_I(inode)->delayed_node);
1318 if (!delayed_node)
1319 return;
1320
1321 BTRFS_I(inode)->delayed_node = NULL;
1322 btrfs_release_delayed_node(delayed_node);
1323 }


   87 static struct btrfs_delayed_node *btrfs_get_delayed_node(struct inode
*inode)
   88 {
   89 struct btrfs_inode *btrfs_inode = BTRFS_I(inode);
   90 struct btrfs_root *root = btrfs_inode->root;
   91 u64 ino = btrfs_ino(inode);
   92 struct btrfs_delayed_node *node;
   93
   94 node = ACCESS_ONCE(btrfs_inode->delayed_node);
   95 if (node) {
   96 atomic_inc(&node->refs);
   97 return node;
   98 }


Thanks,

Sunny


PS:



panic informations

PID: 73638  TASK: 887deb586200  CPU: 38  COMMAND: "dockerd"
  #0 [88130404f940] machine_kexec at 8105ec10
  #1 [88130404f9b0] crash_kexec at 811145b8
  #2 [88130404fa80] oops_end at 8101a868
  #3 [88130404fab0] no_context at 8106ea91
  #4 [88130404fb00] __bad_area_nosemaphore at 8106ec8d
  #5 [88130404fb50] bad_area_nosemaphore at 8106eda3
  #6 [88130404fb60] __do_page_fault at 8106f328
  #7 [88130404fbd0] do_page_fault at 8106f637
  #8 [88130404fc10] page_fault at 816f6308
 [exception RIP: kmem_cache_alloc+121]
 RIP: 811ef019  RSP: 88130404fcc8  RFLAGS: 00010286
 RAX:   RBX:   RCX: 01c32b76
 RDX: 01c32b75  RSI:   RDI: 000224b0
 RBP: 88130404fd08   R8: 887e7d7a24b0   R9: 
 R10: 8802668b6618  R11: 0002  R12: 887e3e230a00
 R13: 2026  R14: 887e3e230a00  R15: a01abf49
 ORIG_RAX:   CS: 0010  SS: 0018
  #9 [88130404fd10] btrfs_get_or_create_delayed_node at a01abf49
[btrfs]
#10 [88130404fd60] btrfs_delayed_update_inode at a01aea12
[btrfs]
#11 [88130404fdb0] btrfs_update_inode at a015b199 [btrfs]
#12 [88130404fdf0] btrfs_dirty_inode at a015cd11 [btrfs]
#13 [88130404fe20] btrfs_update_time at a015fa25 [btrfs]
#14 [88130404fe50] touch_atime at 812286d3
#15 [88130404fe90] iterate_dir at 81221929
#16 [88130404fee0] sys_getdents64 at 81221a19
#17 [88130404ff50] system_call_fastpath at 816f2594
 RIP: 006b68e4  RSP: 00c866259080  RFLAGS: 0246
 RAX: ffda  RBX: 00c828dbbe00  RCX: 006b68e4
 RDX: 1000  RSI: 00c83da14000  RDI: 0011
 RBP:    R8:    R9: 

Re: btrfs panic problem

2018-09-25 Thread sunny.s.zhang



在 2018年09月20日 00:12, Nikolay Borisov 写道:

On 19.09.2018 02:53, sunny.s.zhang wrote:

Hi Duncan,

Thank you for your advice. I understand what you mean.  But i have
reviewed the latest btrfs code, and i think the issue is exist still.

At 71 line, if the function of btrfs_get_delayed_node run over this
line, then switch to other process, which run over the 1282 and release
the delayed node at the end.

And then, switch back to the  btrfs_get_delayed_node. find that the node
is not null, and use it as normal. that mean we used a freed memory.

at some time, this memory will be freed again.

latest code as below.

1278 void btrfs_remove_delayed_node(struct btrfs_inode *inode)
1279 {
1280 struct btrfs_delayed_node *delayed_node;
1281
1282 delayed_node = READ_ONCE(inode->delayed_node);
1283 if (!delayed_node)
1284 return;
1285
1286 inode->delayed_node = NULL;
1287 btrfs_release_delayed_node(delayed_node);
1288 }


   64 static struct btrfs_delayed_node *btrfs_get_delayed_node(
   65 struct btrfs_inode *btrfs_inode)
   66 {
   67 struct btrfs_root *root = btrfs_inode->root;
   68 u64 ino = btrfs_ino(btrfs_inode);
   69 struct btrfs_delayed_node *node;
   70
   71 node = READ_ONCE(btrfs_inode->delayed_node);
   72 if (node) {
   73 refcount_inc(&node->refs);
   74 return node;
   75 }
   76
   77 spin_lock(&root->inode_lock);
   78 node = radix_tree_lookup(&root->delayed_nodes_tree, ino);



You are analysis is correct, however it's missing one crucial point -
btrfs_remove_delayed_node is called only from btrfs_evict_inode. And
inodes are evicted when all other references have been dropped. Check
the code in evict_inodes() - inodes are added to the dispose list when
their i_count is 0 at which point there should be no references in this
inode. This invalidates your analysis...

Thanks.
Yes, I know this.  and I know that other process can not use this inode 
if the inode is in the I_FREEING status.
But,  Chris has fixed a bug, which is similar with this and is found in 
production.  it mean that this will occur in some condition.


btrfs: fix refcount_t usage when deleting btrfs_delayed_nodes
https://urldefense.proofpoint.com/v2/url?u=https-3A__git.kernel.org_pub_scm_linux_kernel_git_torvalds_linux.git_commit_-3Fid-3Dec35e48b286959991cdbb886f1bdeda4575c80b4&d=DwIBaQ&c=RoP1YumCXCgaWHvlZYR8PZh8Bv7qIrMUB65eapI_JnE&r=mcYQsljqnoxPHJVaWVFtwsEEDhXdP3ULRlrPW_9etWQ&m=O7fQASCATWfOIp82M24gmi314geaUJDU-9erYxJ2ZEs&s=QtIafUNfkdy5BqfRQLhoHLY6o-Vk8-ZB0sD28mM-o_s&e=


在 2018年09月18日 13:05, Duncan 写道:

sunny.s.zhang posted on Tue, 18 Sep 2018 08:28:14 +0800 as excerpted:


My OS(4.1.12) panic in kmem_cache_alloc, which is called by
btrfs_get_or_create_delayed_node.

I found that the freelist of the slub is wrong.

[Not a dev, just a btrfs list regular and user, myself.  But here's a
general btrfs list recommendations reply...]

You appear to mean kernel 4.1.12 -- confirmed by the version reported in
the posted dump:  4.1.12-112.14.13.el6uek.x86_64

OK, so from the perspective of this forward-development-focused list,
kernel 4.1 is pretty ancient history, but you do have a number of
options.

First let's consider the general situation.  Most people choose an
enterprise distro for supported stability, and that's certainly a valid
thing to want.  However, btrfs, while now reaching early maturity for the
basics (single device in single or dup mode, and multi-device in single/
raid0/1/10 modes, note that raid56 mode is newer and less mature),
remains under quite heavy development, and keeping reasonably current is
recommended for that reason.

So you you chose an enterprise distro presumably to lock in supported
stability for several years, but you chose a filesystem, btrfs, that's
still under heavy development, with reasonably current kernels and
userspace recommended as tending to have the known bugs fixed.  There's a
bit of a conflict there, and the /general/ recommendation would thus be
to consider whether one or the other of those choices are inappropriate
for your use-case, because it's really quite likely that if you really
want the stability of an enterprise distro and kernel, that btrfs isn't
as stable a filesystem as you're likely to want to match with it.
Alternatively, if you want something newer to match the still under heavy
development btrfs, you very likely want a distro that's not focused on
years-old stability just for the sake of it.  One or the other is likely
to be a poor match for your needs, and choosing something else that's a
better match is likely to be a much better experience for you.

But perhaps you do have reason to want to run the newer and not quite to
traditional enterprise-distro level stability btrfs, on an otherwise
older and very stable enterprise distro.  That's fine, provided you know
what you're getting yourself into, and are pre

Re: btrfs panic problem

2018-09-25 Thread Nikolay Borisov



On 25.09.2018 11:20, sunny.s.zhang wrote:
> 
> 在 2018年09月20日 02:36, Liu Bo 写道:
>> On Mon, Sep 17, 2018 at 5:28 PM, sunny.s.zhang
>>  wrote:
>>> Hi All,
>>>
>>> My OS(4.1.12) panic in kmem_cache_alloc, which is called by
>>> btrfs_get_or_create_delayed_node.
>>>
>>> I found that the freelist of the slub is wrong.
>>>
>>> crash> struct kmem_cache_cpu 887e7d7a24b0
>>>
>>> struct kmem_cache_cpu {
>>>    freelist = 0x2026,   <<< the value is id of one inode
>>>    tid = 29567861,
>>>    page = 0xea0132168d00,
>>>    partial = 0x0
>>> }
>>>
>>> And, I found there are two different btrfs inodes pointing
>>> delayed_node. It
>>> means that the same slub is used twice.
>>>
>>> I think this slub is freed twice, and then the next pointer of this slub
>>> point itself. So we get the same slub twice.
>>>
>>> When use this slub again, that break the freelist.
>>>
>>> Folloing code will make the delayed node being freed twice. But I don't
>>> found what is the process.
>>>
>>> Process A (btrfs_evict_inode) Process B
>>>
>>> call btrfs_remove_delayed_node call  btrfs_get_delayed_node
>>>
>>> node = ACCESS_ONCE(btrfs_inode->delayed_node);
>>>
>>> BTRFS_I(inode)->delayed_node = NULL;
>>> btrfs_release_delayed_node(delayed_node);
>>>
>>> if (node) {
>>> atomic_inc(&node->refs);
>>> return node;
>>> }
>>>
>>> ..
>>>
>>> btrfs_release_delayed_node(delayed_node);
>>>
>> By looking at the race,  seems the following commit has addressed it.
>>
>> btrfs: fix refcount_t usage when deleting btrfs_delayed_nodes
>> https://urldefense.proofpoint.com/v2/url?u=https-3A__git.kernel.org_pub_scm_linux_kernel_git_torvalds_linux.git_commit_-3Fid-3Dec35e48b286959991cdbb886f1bdeda4575c80b4&d=DwIBaQ&c=RoP1YumCXCgaWHvlZYR8PZh8Bv7qIrMUB65eapI_JnE&r=mcYQsljqnoxPHJVaWVFtwsEEDhXdP3ULRlrPW_9etWQ&m=O7fQASCATWfOIp82M24gmi314geaUJDU-9erYxJ2ZEs&s=QtIafUNfkdy5BqfRQLhoHLY6o-Vk8-ZB0sD28mM-o_s&e=
>>
>>
>> thanks,
>> liubo
> 
> I don't think so.
> this patch has resolved the problem of radix_tree_lookup. I don't think
> this can resolve my problem that race occur after
> ACCESS_ONCE(btrfs_inode->delayed_node).
> Because, if ACCESS_ONCE(btrfs_inode->delayed_node) return the node, then
> the function of btrfs_get_delayed_node will return, and don't continue.

Can you reproduce the problem on an upstream kernel with added delays?
The original report is from some RHEL-based distro (presumably oracle
unbreakable linux) so there is no indication currently that this is a
genuine problem in upstream kernels.

> 
> Thanks,
> Sunny
> 
>>
>>> 1313 void btrfs_remove_delayed_node(struct inode *inode)
>>> 1314 {
>>> 1315 struct btrfs_delayed_node *delayed_node;
>>> 1316
>>> 1317 delayed_node = ACCESS_ONCE(BTRFS_I(inode)->delayed_node);
>>> 1318 if (!delayed_node)
>>> 1319 return;
>>> 1320
>>> 1321 BTRFS_I(inode)->delayed_node = NULL;
>>> 1322 btrfs_release_delayed_node(delayed_node);
>>> 1323 }
>>>
>>>
>>>    87 static struct btrfs_delayed_node *btrfs_get_delayed_node(struct
>>> inode
>>> *inode)
>>>    88 {
>>>    89 struct btrfs_inode *btrfs_inode = BTRFS_I(inode);
>>>    90 struct btrfs_root *root = btrfs_inode->root;
>>>    91 u64 ino = btrfs_ino(inode);
>>>    92 struct btrfs_delayed_node *node;
>>>    93
>>>    94 node = ACCESS_ONCE(btrfs_inode->delayed_node);
>>>    95 if (node) {
>>>    96 atomic_inc(&node->refs);
>>>    97 return node;
>>>    98 }
>>>
>>>
>>> Thanks,
>>>
>>> Sunny
>>>
>>>
>>> PS:
>>>
>>> 
>>>
>>> panic informations
>>>
>>> PID: 73638  TASK: 887deb586200  CPU: 38  COMMAND: "dockerd"
>>>   #0 [88130404f940] machine_kexec at 8105ec10
>>>   #1 [88130404f9b0] crash_kexec at 811145b8
>>>   #2 [88130404fa80] oops_end at 8101a868
>>>   #3 [88130404fab0] no_context at 8106ea91
>>>   #4 [88130404fb00] __bad_area_nosemaphore at 8106ec8d
>>>   #5 [88130404fb50] bad_area_nosemaphore at 8106eda3
>>>   #6 [88130404fb60] __do_page_fault at 8106f328
>>>   #7 [88130404fbd0] do_page_fault at 8106f637
>>>   #8 [88130404fc10] page_fault at 816f6308
>>>  [exception RIP: kmem_cache_alloc+121]
>>>  RIP: 811ef019  RSP: 88130404fcc8  RFLAGS: 00010286
>>>  RAX:   RBX:   RCX: 01c32b76
>>>  RDX: 01c32b75  RSI:   RDI: 000224b0
>>>  RBP: 88130404fd08   R8: 887e7d7a24b0   R9: 
>>>  R10: 8802668b6618  R11: 0002  R12: 887e3e230a00
>>>  R13: 2026  R14: 887e3e230a00  R15: a01abf49
>>>  ORIG_RAX:   CS: 0010  SS: 0018
>>>   #9 [88130404fd10] btrfs_get_or_create_delayed_node at
>>> a01abf49
>>> [btrfs]
>>> #10 [88130404fd60] btrfs_delayed_update_inode at fff

Re: btrfs panic problem

2018-09-25 Thread sunny.s.zhang




在 2018年09月25日 16:31, Nikolay Borisov 写道:


On 25.09.2018 11:20, sunny.s.zhang wrote:

在 2018年09月20日 02:36, Liu Bo 写道:

On Mon, Sep 17, 2018 at 5:28 PM, sunny.s.zhang
 wrote:

Hi All,

My OS(4.1.12) panic in kmem_cache_alloc, which is called by
btrfs_get_or_create_delayed_node.

I found that the freelist of the slub is wrong.

crash> struct kmem_cache_cpu 887e7d7a24b0

struct kmem_cache_cpu {
    freelist = 0x2026,   <<< the value is id of one inode
    tid = 29567861,
    page = 0xea0132168d00,
    partial = 0x0
}

And, I found there are two different btrfs inodes pointing
delayed_node. It
means that the same slub is used twice.

I think this slub is freed twice, and then the next pointer of this slub
point itself. So we get the same slub twice.

When use this slub again, that break the freelist.

Folloing code will make the delayed node being freed twice. But I don't
found what is the process.

Process A (btrfs_evict_inode) Process B

call btrfs_remove_delayed_node call  btrfs_get_delayed_node

node = ACCESS_ONCE(btrfs_inode->delayed_node);

BTRFS_I(inode)->delayed_node = NULL;
btrfs_release_delayed_node(delayed_node);

if (node) {
atomic_inc(&node->refs);
return node;
}

..

btrfs_release_delayed_node(delayed_node);


By looking at the race,  seems the following commit has addressed it.

btrfs: fix refcount_t usage when deleting btrfs_delayed_nodes
https://urldefense.proofpoint.com/v2/url?u=https-3A__git.kernel.org_pub_scm_linux_kernel_git_torvalds_linux.git_commit_-3Fid-3Dec35e48b286959991cdbb886f1bdeda4575c80b4&d=DwIBaQ&c=RoP1YumCXCgaWHvlZYR8PZh8Bv7qIrMUB65eapI_JnE&r=mcYQsljqnoxPHJVaWVFtwsEEDhXdP3ULRlrPW_9etWQ&m=O7fQASCATWfOIp82M24gmi314geaUJDU-9erYxJ2ZEs&s=QtIafUNfkdy5BqfRQLhoHLY6o-Vk8-ZB0sD28mM-o_s&e=


thanks,
liubo

I don't think so.
this patch has resolved the problem of radix_tree_lookup. I don't think
this can resolve my problem that race occur after
ACCESS_ONCE(btrfs_inode->delayed_node).
Because, if ACCESS_ONCE(btrfs_inode->delayed_node) return the node, then
the function of btrfs_get_delayed_node will return, and don't continue.

Can you reproduce the problem on an upstream kernel with added delays?
The original report is from some RHEL-based distro (presumably oracle
unbreakable linux) so there is no indication currently that this is a
genuine problem in upstream kernels.

Not yet. I will reproduce later.
But I don't have any clue about this race now.
Thanks,
Sunny




Thanks,
Sunny


1313 void btrfs_remove_delayed_node(struct inode *inode)
1314 {
1315 struct btrfs_delayed_node *delayed_node;
1316
1317 delayed_node = ACCESS_ONCE(BTRFS_I(inode)->delayed_node);
1318 if (!delayed_node)
1319 return;
1320
1321 BTRFS_I(inode)->delayed_node = NULL;
1322 btrfs_release_delayed_node(delayed_node);
1323 }


    87 static struct btrfs_delayed_node *btrfs_get_delayed_node(struct
inode
*inode)
    88 {
    89 struct btrfs_inode *btrfs_inode = BTRFS_I(inode);
    90 struct btrfs_root *root = btrfs_inode->root;
    91 u64 ino = btrfs_ino(inode);
    92 struct btrfs_delayed_node *node;
    93
    94 node = ACCESS_ONCE(btrfs_inode->delayed_node);
    95 if (node) {
    96 atomic_inc(&node->refs);
    97 return node;
    98 }


Thanks,

Sunny


PS:



panic informations

PID: 73638  TASK: 887deb586200  CPU: 38  COMMAND: "dockerd"
   #0 [88130404f940] machine_kexec at 8105ec10
   #1 [88130404f9b0] crash_kexec at 811145b8
   #2 [88130404fa80] oops_end at 8101a868
   #3 [88130404fab0] no_context at 8106ea91
   #4 [88130404fb00] __bad_area_nosemaphore at 8106ec8d
   #5 [88130404fb50] bad_area_nosemaphore at 8106eda3
   #6 [88130404fb60] __do_page_fault at 8106f328
   #7 [88130404fbd0] do_page_fault at 8106f637
   #8 [88130404fc10] page_fault at 816f6308
  [exception RIP: kmem_cache_alloc+121]
  RIP: 811ef019  RSP: 88130404fcc8  RFLAGS: 00010286
  RAX:   RBX:   RCX: 01c32b76
  RDX: 01c32b75  RSI:   RDI: 000224b0
  RBP: 88130404fd08   R8: 887e7d7a24b0   R9: 
  R10: 8802668b6618  R11: 0002  R12: 887e3e230a00
  R13: 2026  R14: 887e3e230a00  R15: a01abf49
  ORIG_RAX:   CS: 0010  SS: 0018
   #9 [88130404fd10] btrfs_get_or_create_delayed_node at
a01abf49
[btrfs]
#10 [88130404fd60] btrfs_delayed_update_inode at a01aea12
[btrfs]
#11 [88130404fdb0] btrfs_update_inode at a015b199 [btrfs]
#12 [88130404fdf0] btrfs_dirty_inode at a015cd11 [btrfs]
#13 [88130404fe20] btrfs_update_time at a015fa25 [btrfs]
#14 [88130404fe50] touch_atime at 812286d3
#1

Re: [PATCH] Btrfs: get rid of btrfs_symlink_aops

2018-09-25 Thread David Sterba
On Mon, Sep 24, 2018 at 03:16:55PM -0700, Omar Sandoval wrote:
> From: Omar Sandoval 
> 
> The only aops we define for symlinks are identical to the aops for
> regular files. This has been the case since symlink support was added in
> commit 2b8d99a723a3 ("Btrfs: symlinks and hard links"). As far as I can
> tell, there wasn't a good reason to have separate aops then, and there
> isn't now, so let's just do what most other filesystems do and reuse the
> same structure.
> 
> Signed-off-by: Omar Sandoval 

Reviewed-by: David Sterba 

Can we also reuse btrfs_inode_operations for
btrfs_special_inode_operations ? The only difference is the fiemap
operation, so if the special files must not implement that callback,
this could be decided inside that.


Re: [PATCH v2 1/9] fstests: btrfs: _scratch_mkfs_sized fix min size without mixed option

2018-09-25 Thread Nikolay Borisov



On 25.09.2018 07:24, Anand Jain wrote:
> As of now _scratch_mkfs_sized() checks if the requested size is below 1G
> and forces the --mixed option for the mkfs.btrfs. Well the correct size
> considering all possible group profiles at which we need to force the
> mixed option is roughly 256Mbytes. So fix that.
> 
> Signed-off-by: Anand Jain 

Have you considered the implications of this w.r.t commit
d4da414a9a9d ("common/rc: raise btrfs mixed mode threshold to 1GB")

Initially this threshold was 100mb then Omar changed it to 1g. Does this
change affect generic/427?

> ---
>  common/rc | 5 -
>  1 file changed, 4 insertions(+), 1 deletion(-)
> 
> diff --git a/common/rc b/common/rc
> index d5bb1feee2c3..90dc3002bc3d 100644
> --- a/common/rc
> +++ b/common/rc
> @@ -969,7 +969,10 @@ _scratch_mkfs_sized()
>   ;;
>  btrfs)
>   local mixed_opt=
> - (( fssize <= 1024 * 1024 * 1024 )) && mixed_opt='--mixed'
> + # minimum size that's needed without the mixed option.
> + # Ref: btrfs-prog: btrfs_min_dev_size()
> + # Non mixed mode is also the default option.
> + (( fssize < $((256 * 1024 *1024)) )) && mixed_opt='--mixed'
>   $MKFS_BTRFS_PROG $MKFS_OPTIONS $mixed_opt -b $fssize $SCRATCH_DEV
>   ;;
>  jfs)
> 


Re: [PATCH v2 5/9] generic/102 open code dev_size _scratch_mkfs_sized()

2018-09-25 Thread Nikolay Borisov



On 25.09.2018 07:24, Anand Jain wrote:
> Open code helps to grep and find out parameter sent to the
> _scratch_mkfs_sized here.
> 
> Signed-off-by: Anand Jain 

IMO this is noise, you can just as simply do
"grep _scratch_mkfs_sized" and then open the file to inspect the actual
argument. But it's up to the xfstest maintainers
> ---
>  tests/generic/102 | 3 +--
>  1 file changed, 1 insertion(+), 2 deletions(-)
> 
> diff --git a/tests/generic/102 b/tests/generic/102
> index faf940ac5070..aad496a5bc69 100755
> --- a/tests/generic/102
> +++ b/tests/generic/102
> @@ -31,8 +31,7 @@ _require_scratch
>  
>  rm -f $seqres.full
>  
> -dev_size=$((512 * 1024 * 1024)) # 512MB filesystem
> -_scratch_mkfs_sized $dev_size >>$seqres.full 2>&1
> +_scratch_mkfs_sized $((512 * 1024 * 1024)) >>$seqres.full 2>&1
>  _scratch_mount
>  
>  for ((i = 0; i < 10; i++)); do
> 


Re: [PATCH v2 2/9] generic/015 fix to test the default non-mixed mode

2018-09-25 Thread Nikolay Borisov



On 25.09.2018 07:24, Anand Jain wrote:
> commit 97575acd7495 (generic/015: Change the test filesystem size to
> 101mb), created 101mb FS instead of 100mb FS to make sure we create
> a FS which is non mixed mode.
> 
> btrfs-progs commit 18e2663db3e1 (btrfs-progs: Add minimum device size
> check) added a more accurate minimum required space to create the btrfs
> FS in non mixed mode depending on the group profile, and considering
> any group profiles we would need at least 256MB (with upward round off).
> 
> So this patch changes the FS size to be created by _scratch_sized_mkfs()
> to 256MB so that we create the FS in non mixed mode for any group
> profile.
> 
> Mixed blockgroup can be tested using the MKFS_OPTIONS explicitly.
> 
> Signed-off-by: Anand Jain 
> ---
>  tests/generic/015 | 8 +++-
>  1 file changed, 3 insertions(+), 5 deletions(-)
> 
> diff --git a/tests/generic/015 b/tests/generic/015
> index 0f4d29800f4f..e6c8d7c37c07 100755
> --- a/tests/generic/015
> +++ b/tests/generic/015
> @@ -37,11 +37,9 @@ _supported_os Linux
>  _require_scratch
>  _require_no_large_scratch_dev
>  
> -# With filesystems less than 100mb btrfs is created in mixed mode
> -# which can lead to slight accounting errors of 1mb. Having the
> -# fs be at least 101 mb ensures those errors are within the error
> -# tolerance of 1%
> -_scratch_mkfs_sized `expr 101 \* 1024 \* 1024` >/dev/null 2>&1 \
> +# btrfs needs at least 256MB (with upward round off) to create a non-mixed 
> mode
> +# fs. Ref: btrfs-progs: btrfs_min_dev_size()
> +_scratch_mkfs_sized $((256 * 1024 * 1024)) >> $seqres.full 2>&1 \

Yeah, this test seems rather flaky with mixed block groups I had fixed
it and then Omar changed the mixed bg threshold to 1g and it started
failing again internally. I'm happy with this change provided the first
patch lands as well.

>  || _fail "mkfs failed"
>  _scratch_mount
>  out=$SCRATCH_MNT/fillup.$$
> 


Re: [PATCH] btrfs: relocation: Add basic extent backref related comment for build_backref_tree()

2018-09-25 Thread David Sterba
On Tue, Sep 25, 2018 at 02:37:46PM +0800, Qu Wenruo wrote:
> fs/btrfs/relocation.c:build_backref_tree() is some code from 2009 era,
> although it works pretty fine, it's not that easy to understand.
> Especially combined with the complex btrfs backref format.
> 
> This patch adds some basic comment for the backref build part of the
> code, making it less hard to read, at least for backref searching part.
> 
> Signed-off-by: Qu Wenruo 

Added to misc-next, thanks.


Re: [PATCH 1/2] btrfs: Use NAME_MAX to replace intermediate number of BTRFS_NAME_LEN

2018-09-25 Thread David Sterba
On Tue, Sep 25, 2018 at 08:06:25AM +0800, Qu Wenruo wrote:
> Since we're following the name size limit of linux, just use NAME_MAX.
> 
> Signed-off-by: Qu Wenruo 
> ---
>  fs/btrfs/ctree.h | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
> index 53af9f5253f4..5ab6d1f6e055 100644
> --- a/fs/btrfs/ctree.h
> +++ b/fs/btrfs/ctree.h
> @@ -65,7 +65,7 @@ struct btrfs_ordered_sum;
>   * we can actually store much bigger names, but lets not confuse the rest
>   * of linux
>   */
> -#define BTRFS_NAME_LEN 255
> +#define BTRFS_NAME_LEN NAME_MAX

While the values are the same, the symbolic names have a slightly
different meaning. NAME_MAX is from the public API, BTRFS_NAME_LEN is
defined as btrfs limit, and de facto part of the on-disk format. These
are independent, although compatible for all practical purposes. I would
not conflate the two in the define, the comment could be updated to
document that better though.


Re: [PATCH 2/2] btrfs: tree-checker: Avoid using max() for stack array allocation

2018-09-25 Thread David Sterba
On Tue, Sep 25, 2018 at 08:06:26AM +0800, Qu Wenruo wrote:
> Although BTRFS_NAME_LEN and XATTR_NAME_MAX is the same value (255),
> max(BTRFS_NAME_LEN, XATTR_NAME_MAX) should be optimized as const at
> runtime.
> 
> However S390x' arch dependent option "-mwarn-dynamicstack" could still
> report it as dyanamic stack allocation.
> 
> Just use BTRFS_NAME_LEN directly to avoid such false alert.

Same reasoning as for the NAME_MAX, these are different things.

> Signed-off-by: Qu Wenruo 
> ---
>  fs/btrfs/tree-checker.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/fs/btrfs/tree-checker.c b/fs/btrfs/tree-checker.c
> index db835635372f..4c045609909b 100644
> --- a/fs/btrfs/tree-checker.c
> +++ b/fs/btrfs/tree-checker.c
> @@ -336,7 +336,7 @@ static int check_dir_item(struct btrfs_fs_info *fs_info,
>*/
>   if (key->type == BTRFS_DIR_ITEM_KEY ||
>   key->type == BTRFS_XATTR_ITEM_KEY) {
> - char namebuf[max(BTRFS_NAME_LEN, XATTR_NAME_MAX)];
> + char namebuf[BTRFS_NAME_LEN];

The updated implementation of max() can now handle the expression
without a warning, with sufficiently new compiler so I don't think we
need to fix that.

Alternatively, you could use BTRFS_NAME_LEN and add a
BUILD_BUG_ON(BTRFS_NAME_LEN < XATTR_NAME_MAX) with a comment why.

>  
>   read_extent_buffer(leaf, namebuf,
>   (unsigned long)(di + 1), name_len);
> -- 
> 2.19.0


[PATCH V6] Btrfs: enhance raid1/10 balance heuristic

2018-09-25 Thread Timofey Titovets
Currently btrfs raid1/10 balancer bаlance requests to mirrors,
based on pid % num of mirrors.

Make logic understood:
 - if one of underline devices are non rotational
 - Queue length to underline devices

By default try use pid % num_mirrors guessing, but:
 - If one of mirrors are non rotational, repick optimal to it
 - If underline mirror have less queue length then optimal,
   repick to that mirror

For avoid round-robin request balancing,
lets round down queue length:
 - By 8 for rotational devs
 - By 2 for all non rotational devs

Some bench results from mail list
(Dmitrii Tcvetkov ):
Benchmark summary (arithmetic mean of 3 runs):
 Mainline Patch

RAID1  | 18.9 MiB/s | 26.5 MiB/s
RAID10 | 30.7 MiB/s | 30.7 MiB/s

mainline, fio got lucky to read from first HDD (quite slow HDD):
Jobs: 1 (f=1): [r(1)][100.0%][r=8456KiB/s,w=0KiB/s][r=264,w=0 IOPS]
  read: IOPS=265, BW=8508KiB/s (8712kB/s)(499MiB/60070msec)
  lat (msec): min=2, max=825, avg=60.17, stdev=65.06

mainline, fio got lucky to read from second HDD (much more modern):
Jobs: 1 (f=1): [r(1)][8.7%][r=11.9MiB/s,w=0KiB/s][r=380,w=0 IOPS]
  read: IOPS=378, BW=11.8MiB/s (12.4MB/s)(710MiB/60051msec)
  lat (usec): min=416, max=644286, avg=42312.74, stdev=48518.56

mainline, fio got lucky to read from an SSD:
Jobs: 1 (f=1): [r(1)][100.0%][r=436MiB/s,w=0KiB/s][r=13.9k,w=0 IOPS]
  read: IOPS=13.9k, BW=433MiB/s (454MB/s)(25.4GiB/60002msec) 
  lat (usec): min=343, max=16319, avg=1152.52, stdev=245.36

With the patch, 2 HDDs:
Jobs: 1 (f=1): [r(1)][100.0%][r=17.5MiB/s,w=0KiB/s][r=560,w=0 IOPS]
  read: IOPS=560, BW=17.5MiB/s (18.4MB/s)(1053MiB/60052msec)
  lat (usec): min=435, max=341037, avg=28511.64, stdev=3.14

With the patch, HDD(old one)+SSD:
Jobs: 1 (f=1): [r(1)][100.0%][r=371MiB/s,w=0KiB/s][r=11.9k,w=0 IOPS]
  read: IOPS=11.6k, BW=361MiB/s (379MB/s)(21.2GiB/60084msec)
  lat  (usec): min=363, max=346752, avg=1381.73, stdev=6948.32

Changes:
  v1 -> v2:
- Use helper part_in_flight() from genhd.c
  to get queue length
- Move guess code to guess_optimal()
- Change balancer logic, try use pid % mirror by default
  Make balancing on spinning rust if one of underline devices
  are overloaded
  v2 -> v3:
- Fix arg for RAID10 - use sub_stripes, instead of num_stripes
  v3 -> v4:
- Rebased on latest misc-next
  v4 -> v5:
- Rebased on latest misc-next
  v5 -> v6:
- Fix spelling
- Include bench results

Signed-off-by: Timofey Titovets 
Tested-by: Dmitrii Tcvetkov 
Reviewed-by: Dmitrii Tcvetkov 
---
 block/genhd.c  |   1 +
 fs/btrfs/volumes.c | 111 -
 2 files changed, 110 insertions(+), 2 deletions(-)

diff --git a/block/genhd.c b/block/genhd.c
index 9656f9e9f99e..5ea5acc88d3c 100644
--- a/block/genhd.c
+++ b/block/genhd.c
@@ -81,6 +81,7 @@ void part_in_flight(struct request_queue *q, struct hd_struct 
*part,
atomic_read(&part->in_flight[1]);
}
 }
+EXPORT_SYMBOL_GPL(part_in_flight);
 
 void part_in_flight_rw(struct request_queue *q, struct hd_struct *part,
   unsigned int inflight[2])
diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
index c95af358b71f..fa7dd6ac087f 100644
--- a/fs/btrfs/volumes.c
+++ b/fs/btrfs/volumes.c
@@ -16,6 +16,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include "ctree.h"
 #include "extent_map.h"
@@ -5201,6 +5202,111 @@ int btrfs_is_parity_mirror(struct btrfs_fs_info 
*fs_info, u64 logical, u64 len)
return ret;
 }
 
+/**
+ * bdev_get_queue_len - return rounded down in flight queue length of bdev
+ *
+ * @bdev: target bdev
+ * @round_down: round factor big for hdd and small for ssd, like 8 and 2
+ */
+static int bdev_get_queue_len(struct block_device *bdev, int round_down)
+{
+   int sum;
+   struct hd_struct *bd_part = bdev->bd_part;
+   struct request_queue *rq = bdev_get_queue(bdev);
+   uint32_t inflight[2] = {0, 0};
+
+   part_in_flight(rq, bd_part, inflight);
+
+   sum = max_t(uint32_t, inflight[0], inflight[1]);
+
+   /*
+* Try prevent switch for every sneeze
+* By roundup output num by some value
+*/
+   return ALIGN_DOWN(sum, round_down);
+}
+
+/**
+ * guess_optimal - return guessed optimal mirror
+ *
+ * Optimal expected to be pid % num_stripes
+ *
+ * That's generaly ok for spread load
+ * Add some balancer based on queue length to device
+ *
+ * Basic ideas:
+ *  - Sequential read generate low amount of request
+ *so if load of drives are equal, use pid % num_stripes balancing
+ *  - For mixed r

Re: [PATCH 2/2] btrfs: tree-checker: Avoid using max() for stack array allocation

2018-09-25 Thread Qu Wenruo


On 2018/9/25 下午11:34, David Sterba wrote:
> On Tue, Sep 25, 2018 at 08:06:26AM +0800, Qu Wenruo wrote:
>> Although BTRFS_NAME_LEN and XATTR_NAME_MAX is the same value (255),
>> max(BTRFS_NAME_LEN, XATTR_NAME_MAX) should be optimized as const at
>> runtime.
>>
>> However S390x' arch dependent option "-mwarn-dynamicstack" could still
>> report it as dyanamic stack allocation.
>>
>> Just use BTRFS_NAME_LEN directly to avoid such false alert.
> 
> Same reasoning as for the NAME_MAX, these are different things.
> 
>> Signed-off-by: Qu Wenruo 
>> ---
>>  fs/btrfs/tree-checker.c | 2 +-
>>  1 file changed, 1 insertion(+), 1 deletion(-)
>>
>> diff --git a/fs/btrfs/tree-checker.c b/fs/btrfs/tree-checker.c
>> index db835635372f..4c045609909b 100644
>> --- a/fs/btrfs/tree-checker.c
>> +++ b/fs/btrfs/tree-checker.c
>> @@ -336,7 +336,7 @@ static int check_dir_item(struct btrfs_fs_info *fs_info,
>>   */
>>  if (key->type == BTRFS_DIR_ITEM_KEY ||
>>  key->type == BTRFS_XATTR_ITEM_KEY) {
>> -char namebuf[max(BTRFS_NAME_LEN, XATTR_NAME_MAX)];
>> +char namebuf[BTRFS_NAME_LEN];
> 
> The updated implementation of max() can now handle the expression
> without a warning, with sufficiently new compiler so I don't think we
> need to fix that.

Yes, it's mostly a workaround to make S390 happy.

And if it can be fixed by kernel config/compiler, it doesn't make much
sense to fix it here.

So please discard these 2 patches.

Thanks,
Qu

> 
> Alternatively, you could use BTRFS_NAME_LEN and add a
> BUILD_BUG_ON(BTRFS_NAME_LEN < XATTR_NAME_MAX) with a comment why.
> 
>>  
>>  read_extent_buffer(leaf, namebuf,
>>  (unsigned long)(di + 1), name_len);
>> -- 
>> 2.19.0



signature.asc
Description: OpenPGP digital signature


Re: [PATCH v2 1/9] fstests: btrfs: _scratch_mkfs_sized fix min size without mixed option

2018-09-25 Thread Anand Jain




On 09/25/2018 06:51 PM, Nikolay Borisov wrote:



On 25.09.2018 07:24, Anand Jain wrote:

As of now _scratch_mkfs_sized() checks if the requested size is below 1G
and forces the --mixed option for the mkfs.btrfs. Well the correct size
considering all possible group profiles at which we need to force the
mixed option is roughly 256Mbytes. So fix that.

Signed-off-by: Anand Jain 


Have you considered the implications of this w.r.t commit
d4da414a9a9d ("common/rc: raise btrfs mixed mode threshold to 1GB")

>

Initially this threshold was 100mb then Omar changed it to 1g. Does this
change affect generic/427?


d4da414a9a9d does not explain what was the problem that Omar wanted to 
address, mainly what was the failure about.


And no it does not affect. I have verified generic/427 with kernel 4.1 
and 4.19-rc5 with  btrfs-progs 4.1, 4.9 and latest from kdave they all 
run fine. Good to integrate.


Thanks, Anand




---
  common/rc | 5 -
  1 file changed, 4 insertions(+), 1 deletion(-)

diff --git a/common/rc b/common/rc
index d5bb1feee2c3..90dc3002bc3d 100644
--- a/common/rc
+++ b/common/rc
@@ -969,7 +969,10 @@ _scratch_mkfs_sized()
;;
  btrfs)
local mixed_opt=
-   (( fssize <= 1024 * 1024 * 1024 )) && mixed_opt='--mixed'
+   # minimum size that's needed without the mixed option.
+   # Ref: btrfs-prog: btrfs_min_dev_size()
+   # Non mixed mode is also the default option.
+   (( fssize < $((256 * 1024 *1024)) )) && mixed_opt='--mixed'
$MKFS_BTRFS_PROG $MKFS_OPTIONS $mixed_opt -b $fssize $SCRATCH_DEV
;;
  jfs)



Re: [PATCH v2 5/9] generic/102 open code dev_size _scratch_mkfs_sized()

2018-09-25 Thread Anand Jain




On 09/25/2018 06:54 PM, Nikolay Borisov wrote:



On 25.09.2018 07:24, Anand Jain wrote:

Open code helps to grep and find out parameter sent to the
_scratch_mkfs_sized here.

Signed-off-by: Anand Jain 


IMO this is noise, you can just as simply do
"grep _scratch_mkfs_sized" and then open the file to inspect the actual
argument. But it's up to the xfstest maintainers


 I am ok. Its just a nice cleanup.

Thanks, Anand


---
  tests/generic/102 | 3 +--
  1 file changed, 1 insertion(+), 2 deletions(-)

diff --git a/tests/generic/102 b/tests/generic/102
index faf940ac5070..aad496a5bc69 100755
--- a/tests/generic/102
+++ b/tests/generic/102
@@ -31,8 +31,7 @@ _require_scratch
  
  rm -f $seqres.full
  
-dev_size=$((512 * 1024 * 1024)) # 512MB filesystem

-_scratch_mkfs_sized $dev_size >>$seqres.full 2>&1
+_scratch_mkfs_sized $((512 * 1024 * 1024)) >>$seqres.full 2>&1
  _scratch_mount
  
  for ((i = 0; i < 10; i++)); do




Re: [PATCH v2 1/9] fstests: btrfs: _scratch_mkfs_sized fix min size without mixed option

2018-09-25 Thread Nikolay Borisov



On 26.09.2018 07:07, Anand Jain wrote:
> 
> 
> On 09/25/2018 06:51 PM, Nikolay Borisov wrote:
>>
>>
>> On 25.09.2018 07:24, Anand Jain wrote:
>>> As of now _scratch_mkfs_sized() checks if the requested size is below 1G
>>> and forces the --mixed option for the mkfs.btrfs. Well the correct size
>>> considering all possible group profiles at which we need to force the
>>> mixed option is roughly 256Mbytes. So fix that.
>>>
>>> Signed-off-by: Anand Jain 
>>
>> Have you considered the implications of this w.r.t commit
>> d4da414a9a9d ("common/rc: raise btrfs mixed mode threshold to 1GB")
>>
>> Initially this threshold was 100mb then Omar changed it to 1g. Does this
>> change affect generic/427?
> 
> d4da414a9a9d does not explain what was the problem that Omar wanted to
> address, mainly what was the failure about.

I just retested on upstream 4.19.0-rc3 with Omar's patch reverted (so
anything above 100m for fs size is created with non-mixed block groups)
and the test succeeded. So indeed your change seems to not make a
difference for this test.

> 
> And no it does not affect. I have verified generic/427 with kernel 4.1
> and 4.19-rc5 with  btrfs-progs 4.1, 4.9 and latest from kdave they all
> run fine. Good to integrate.
> 
> Thanks, Anand
> 
>>
>>> ---
>>>   common/rc | 5 -
>>>   1 file changed, 4 insertions(+), 1 deletion(-)
>>>
>>> diff --git a/common/rc b/common/rc
>>> index d5bb1feee2c3..90dc3002bc3d 100644
>>> --- a/common/rc
>>> +++ b/common/rc
>>> @@ -969,7 +969,10 @@ _scratch_mkfs_sized()
>>>   ;;
>>>   btrfs)
>>>   local mixed_opt=
>>> -    (( fssize <= 1024 * 1024 * 1024 )) && mixed_opt='--mixed'
>>> +    # minimum size that's needed without the mixed option.
>>> +    # Ref: btrfs-prog: btrfs_min_dev_size()
>>> +    # Non mixed mode is also the default option.
>>> +    (( fssize < $((256 * 1024 *1024)) )) && mixed_opt='--mixed'
>>>   $MKFS_BTRFS_PROG $MKFS_OPTIONS $mixed_opt -b $fssize $SCRATCH_DEV
>>>   ;;
>>>   jfs)
>>>
> 


Re: [PATCH v2 1/9] fstests: btrfs: _scratch_mkfs_sized fix min size without mixed option

2018-09-25 Thread Anand Jain




On 09/26/2018 02:34 PM, Nikolay Borisov wrote:



On 26.09.2018 07:07, Anand Jain wrote:



On 09/25/2018 06:51 PM, Nikolay Borisov wrote:



On 25.09.2018 07:24, Anand Jain wrote:

As of now _scratch_mkfs_sized() checks if the requested size is below 1G
and forces the --mixed option for the mkfs.btrfs. Well the correct size
considering all possible group profiles at which we need to force the
mixed option is roughly 256Mbytes. So fix that.

Signed-off-by: Anand Jain 


Have you considered the implications of this w.r.t commit
d4da414a9a9d ("common/rc: raise btrfs mixed mode threshold to 1GB")

Initially this threshold was 100mb then Omar changed it to 1g. Does this
change affect generic/427?


d4da414a9a9d does not explain what was the problem that Omar wanted to
address, mainly what was the failure about.


I just retested on upstream 4.19.0-rc3 with Omar's patch reverted (so
anything above 100m for fs size is created with non-mixed block groups)
and the test succeeded. So indeed your change seems to not make a
difference for this test.


 Thanks for testing.
Anand



And no it does not affect. I have verified generic/427 with kernel 4.1
and 4.19-rc5 with  btrfs-progs 4.1, 4.9 and latest from kdave they all
run fine. Good to integrate.

Thanks, Anand




---
   common/rc | 5 -
   1 file changed, 4 insertions(+), 1 deletion(-)

diff --git a/common/rc b/common/rc
index d5bb1feee2c3..90dc3002bc3d 100644
--- a/common/rc
+++ b/common/rc
@@ -969,7 +969,10 @@ _scratch_mkfs_sized()
   ;;
   btrfs)
   local mixed_opt=
-    (( fssize <= 1024 * 1024 * 1024 )) && mixed_opt='--mixed'
+    # minimum size that's needed without the mixed option.
+    # Ref: btrfs-prog: btrfs_min_dev_size()
+    # Non mixed mode is also the default option.
+    (( fssize < $((256 * 1024 *1024)) )) && mixed_opt='--mixed'
   $MKFS_BTRFS_PROG $MKFS_OPTIONS $mixed_opt -b $fssize $SCRATCH_DEV
   ;;
   jfs)





[PATCH v2] test unaligned punch hole at ENOSPC

2018-09-25 Thread Anand Jain
Try to punch hole with unaligned size and offset when the FS
returns ENOSPC.

Signed-off-by: Anand Jain 
---
v1->v2: Use at least 256MB to test.
This test case fails on btrfs as of now.

 tests/btrfs/172 | 66 +
 tests/btrfs/172.out |  2 ++
 tests/btrfs/group   |  1 +
 3 files changed, 69 insertions(+)
 create mode 100755 tests/btrfs/172
 create mode 100644 tests/btrfs/172.out

diff --git a/tests/btrfs/172 b/tests/btrfs/172
new file mode 100755
index ..e84e742a3f1a
--- /dev/null
+++ b/tests/btrfs/172
@@ -0,0 +1,66 @@
+#! /bin/bash
+# SPDX-License-Identifier: GPL-2.0
+# Copyright (c) 2018 Oracle. All Rights Reserved.
+#
+# FS QA Test 172
+#
+# Test if the unaligned (by size and offset) punch hole is successful when FS
+# is at ENOSPC.
+#
+seq=`basename $0`
+seqres=$RESULT_DIR/$seq
+echo "QA output created by $seq"
+
+here=`pwd`
+tmp=/tmp/$$
+status=1   # failure is the default!
+trap "_cleanup; exit \$status" 0 1 2 3 15
+
+_cleanup()
+{
+   cd /
+   rm -f $tmp.*
+}
+
+# get standard environment, filters and checks
+. ./common/rc
+. ./common/filter
+
+# remove previous $seqres.full before test
+rm -f $seqres.full
+
+# real QA test starts here
+
+# Modify as appropriate.
+_supported_fs generic
+_supported_os Linux
+_require_scratch
+
+_scratch_mkfs_sized $((256 * 1024 *1024)) >> $seqres.full
+
+# max_inline helps to create regular extent
+_scratch_mount "-o max_inline=0,nodatacow"
+
+echo "Fill fs upto ENOSPC" >> $seqres.full
+dd status=none if=/dev/zero of=$SCRATCH_MNT/filler bs=512 >> $seqres.full 2>&1
+
+extent_size=$(_scratch_btrfs_sectorsize)
+unalign_by=512
+echo extent_size=$extent_size unalign_by=$unalign_by >> $seqres.full
+
+hole_offset=0
+hole_len=$unalign_by
+run_check fallocate -p -o $hole_offset -l $hole_len $SCRATCH_MNT/filler
+
+hole_offset=$(($extent_size + $unalign_by))
+hole_len=$(($extent_size - $unalign_by))
+run_check fallocate -p -o $hole_offset -l $hole_len $SCRATCH_MNT/filler
+
+hole_offset=$(($extent_size * 2 + $unalign_by))
+hole_len=$(($extent_size * 5))
+run_check fallocate -p -o $hole_offset -l $hole_len $SCRATCH_MNT/filler
+
+# success, all done
+echo "Silence is golden"
+status=0
+exit
diff --git a/tests/btrfs/172.out b/tests/btrfs/172.out
new file mode 100644
index ..ce2de3f0d107
--- /dev/null
+++ b/tests/btrfs/172.out
@@ -0,0 +1,2 @@
+QA output created by 172
+Silence is golden
diff --git a/tests/btrfs/group b/tests/btrfs/group
index feffc45b6564..7e1a638ab7e1 100644
--- a/tests/btrfs/group
+++ b/tests/btrfs/group
@@ -174,3 +174,4 @@
 169 auto quick send
 170 auto quick snapshot
 171 auto quick qgroup
+172 auto quick
-- 
1.8.3.1



[PATCH 1/8] mm: push vm_fault into the page fault handlers

2018-09-25 Thread Josef Bacik
In preparation for caching pages during filemap faults we need to push
the struct vm_fault up a level into the arch page fault handlers, since
they are the ones responsible for retrying if we unlock the mmap_sem.

Signed-off-by: Josef Bacik 
---
 arch/alpha/mm/fault.c |  4 ++-
 arch/arc/mm/fault.c   |  2 ++
 arch/arm/mm/fault.c   | 18 -
 arch/arm64/mm/fault.c | 18 +++--
 arch/hexagon/mm/vm_fault.c|  4 ++-
 arch/ia64/mm/fault.c  |  4 ++-
 arch/m68k/mm/fault.c  |  5 ++--
 arch/microblaze/mm/fault.c|  4 ++-
 arch/mips/mm/fault.c  |  4 ++-
 arch/nds32/mm/fault.c |  5 ++--
 arch/nios2/mm/fault.c |  4 ++-
 arch/openrisc/mm/fault.c  |  5 ++--
 arch/parisc/mm/fault.c|  5 ++--
 arch/powerpc/mm/copro_fault.c |  4 ++-
 arch/powerpc/mm/fault.c   |  4 ++-
 arch/riscv/mm/fault.c |  2 ++
 arch/s390/mm/fault.c  |  4 ++-
 arch/sh/mm/fault.c|  4 ++-
 arch/sparc/mm/fault_32.c  |  6 -
 arch/sparc/mm/fault_64.c  |  2 ++
 arch/um/kernel/trap.c |  4 ++-
 arch/unicore32/mm/fault.c | 17 +++-
 arch/x86/mm/fault.c   |  4 ++-
 arch/xtensa/mm/fault.c|  4 ++-
 drivers/iommu/amd_iommu_v2.c  |  4 ++-
 drivers/iommu/intel-svm.c |  6 +++--
 include/linux/mm.h| 16 +---
 mm/gup.c  |  8 --
 mm/hmm.c  |  4 ++-
 mm/ksm.c  | 10 ---
 mm/memory.c   | 61 +--
 31 files changed, 157 insertions(+), 89 deletions(-)

diff --git a/arch/alpha/mm/fault.c b/arch/alpha/mm/fault.c
index d73dc473fbb9..3c98dfef03a9 100644
--- a/arch/alpha/mm/fault.c
+++ b/arch/alpha/mm/fault.c
@@ -84,6 +84,7 @@ asmlinkage void
 do_page_fault(unsigned long address, unsigned long mmcsr,
  long cause, struct pt_regs *regs)
 {
+   struct vm_fault vmf = {};
struct vm_area_struct * vma;
struct mm_struct *mm = current->mm;
const struct exception_table_entry *fixup;
@@ -148,7 +149,8 @@ do_page_fault(unsigned long address, unsigned long mmcsr,
/* If for any reason at all we couldn't handle the fault,
   make sure we exit gracefully rather than endlessly redo
   the fault.  */
-   fault = handle_mm_fault(vma, address, flags);
+   vm_fault_init(&vmfs, vma, flags, address);
+   fault = handle_mm_fault(&vmf);
 
if ((fault & VM_FAULT_RETRY) && fatal_signal_pending(current))
return;
diff --git a/arch/arc/mm/fault.c b/arch/arc/mm/fault.c
index db6913094be3..7aeb81ff5070 100644
--- a/arch/arc/mm/fault.c
+++ b/arch/arc/mm/fault.c
@@ -63,6 +63,7 @@ noinline static int handle_kernel_vaddr_fault(unsigned long 
address)
 
 void do_page_fault(unsigned long address, struct pt_regs *regs)
 {
+   struct vm_fault vmf = {};
struct vm_area_struct *vma = NULL;
struct task_struct *tsk = current;
struct mm_struct *mm = tsk->mm;
@@ -141,6 +142,7 @@ void do_page_fault(unsigned long address, struct pt_regs 
*regs)
 * make sure we exit gracefully rather than endlessly redo
 * the fault.
 */
+   vm_fault_init(&vmf, vma, address, flags);
fault = handle_mm_fault(vma, address, flags);
 
/* If Pagefault was interrupted by SIGKILL, exit page fault "early" */
diff --git a/arch/arm/mm/fault.c b/arch/arm/mm/fault.c
index 3232afb6fdc0..885a24385a0a 100644
--- a/arch/arm/mm/fault.c
+++ b/arch/arm/mm/fault.c
@@ -225,17 +225,17 @@ static inline bool access_error(unsigned int fsr, struct 
vm_area_struct *vma)
 }
 
 static vm_fault_t __kprobes
-__do_page_fault(struct mm_struct *mm, unsigned long addr, unsigned int fsr,
-   unsigned int flags, struct task_struct *tsk)
+__do_page_fault(struct mm_struct *mm, struct vm_fault *vm, unsigned int fsr,
+   struct task_struct *tsk)
 {
struct vm_area_struct *vma;
vm_fault_t fault;
 
-   vma = find_vma(mm, addr);
+   vma = find_vma(mm, vmf->address);
fault = VM_FAULT_BADMAP;
if (unlikely(!vma))
goto out;
-   if (unlikely(vma->vm_start > addr))
+   if (unlikely(vma->vm_start > vmf->address))
goto check_stack;
 
/*
@@ -248,12 +248,14 @@ __do_page_fault(struct mm_struct *mm, unsigned long addr, 
unsigned int fsr,
goto out;
}
 
-   return handle_mm_fault(vma, addr & PAGE_MASK, flags);
+   vmf->vma = vma;
+   return handle_mm_fault(vmf);
 
 check_stack:
/* Don't allow expansion below FIRST_USER_ADDRESS */
if (vma->vm_flags & VM_GROWSDOWN &&
-   addr >= FIRST_USER_ADDRESS && !expand_stack(vma, addr))
+   vmf->address >= FIRST_USER_ADDRESS &&
+   !expand_stack(vma, vmf->address))
goto good_area;
 out:
return fault;
@@ -262,6 +264,7 @@ __do_page_fault(struct mm_struct *mm, unsigned long 

[RFC][PATCH 0/8] drop the mmap_sem when doing IO in the fault path

2018-09-25 Thread Josef Bacik
Now that we have proper isolation in place with cgroups2 we have started going
through and fixing the various priority inversions.  Most are all gone now, but
this one is sort of weird since it's not necessarily a priority inversion that
happens within the kernel, but rather because of something userspace does.

We have giant applications that we want to protect, and parts of these giant
applications do things like watch the system state to determine how healthy the
box is for load balancing and such.  This involves running 'ps' or other such
utilities.  These utilities will often walk /proc//whatever, and these
files can sometimes need to down_read(&task->mmap_sem).  Not usually a big deal,
but we noticed when we are stress testing that sometimes our protected
application has latency spikes trying to get the mmap_sem for tasks that are in
lower priority cgroups.

This is because any down_write() on a semaphore essentially turns it into a
mutex, so even if we currently have it held for reading, any new readers will
not be allowed on to keep from starving the writer.  This is fine, except a
lower priority task could be stuck doing IO because it has been throttled to the
point that its IO is taking much longer than normal.  But because a higher
priority group depends on this completing it is now stuck behind lower priority
work.

In order to avoid this particular priority inversion we want to use the existing
retry mechanism to stop from holding the mmap_sem at all if we are going to do
IO.  This already exists in the read case sort of, but needed to be extended for
more than just grabbing the page lock.  With io.latency we throttle at
submit_bio() time, so the readahead stuff can block and even page_cache_read can
block, so all these paths need to have the mmap_sem dropped.

The other big thing is ->page_mkwrite.  btrfs is particularly shitty here
because we have to reserve space for the dirty page, which can be a very
expensive operation.  We use the same retry method as the read path, and simply
cache the page and verify the page is still setup properly the next pass through
->page_mkwrite().

I've tested these patches with xfstests and there are no regressions.  Let me
know what you think.  Thanks,

Josef


[PATCH 7/8] mm: add a flag to indicate we used a cached page

2018-09-25 Thread Josef Bacik
This is preparation for dropping the mmap_sem in page_mkwrite.  We need
to know if we used our cached page so we can be sure it is the page we
already did the page_mkwrite stuff on so we don't have to redo all of
that work.

Signed-off-by: Josef Bacik 
---
 include/linux/mm.h | 6 +-
 mm/filemap.c   | 5 -
 2 files changed, 9 insertions(+), 2 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 724514be03b2..10a0118f5485 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -318,6 +318,9 @@ extern pgprot_t protection_map[16];
 #define FAULT_FLAG_USER0x40/* The fault originated in 
userspace */
 #define FAULT_FLAG_REMOTE  0x80/* faulting for non current tsk/mm */
 #define FAULT_FLAG_INSTRUCTION  0x100  /* The fault was during an instruction 
fetch */
+#define FAULT_FLAG_USED_CACHED 0x200   /* Our vmf->page was from a previous
+* loop through the fault handler.
+*/
 
 #define FAULT_FLAG_TRACE \
{ FAULT_FLAG_WRITE, "WRITE" }, \
@@ -328,7 +331,8 @@ extern pgprot_t protection_map[16];
{ FAULT_FLAG_TRIED, "TRIED" }, \
{ FAULT_FLAG_USER,  "USER" }, \
{ FAULT_FLAG_REMOTE,"REMOTE" }, \
-   { FAULT_FLAG_INSTRUCTION,   "INSTRUCTION" }
+   { FAULT_FLAG_INSTRUCTION,   "INSTRUCTION" }, \
+   { FAULT_FLAG_USED_CACHED,   "USED_CACHED" }
 
 /*
  * vm_fault is filled by the the pagefault handler and passed to the vma's
diff --git a/mm/filemap.c b/mm/filemap.c
index 49b35293fa95..75a8b252814a 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -2556,6 +2556,7 @@ vm_fault_t filemap_fault(struct vm_fault *vmf)
if (cached_page->mapping == mapping &&
cached_page->index == offset) {
page = cached_page;
+   vmf->flags |= FAULT_FLAG_USED_CACHED;
goto have_cached_page;
}
unlock_page(cached_page);
@@ -2618,8 +2619,10 @@ vm_fault_t filemap_fault(struct vm_fault *vmf)
 * We have a locked page in the page cache, now we need to check
 * that it's up-to-date. If not, it is going to be due to an error.
 */
-   if (unlikely(!PageUptodate(page)))
+   if (unlikely(!PageUptodate(page))) {
+   vmf->flags &= ~(FAULT_FLAG_USED_CACHED);
goto page_not_uptodate;
+   }
 
/*
 * Found the page and have a reference on it.
-- 
2.14.3



[PATCH 6/8] mm: keep the page we read for the next loop

2018-09-25 Thread Josef Bacik
If we drop the mmap_sem we need to redo the vma lookup and then
re-lookup the page.  This is kind of a waste since we've already done
the work, and we could even possibly evict the page, causing a refault.
Instead just hold a reference to the page and save it in our vm_fault.
The next time we go through filemap_fault we'll grab our page, verify
that it's the one we want and carry on.

Signed-off-by: Josef Bacik 
---
 arch/alpha/mm/fault.c |  7 +--
 arch/arc/mm/fault.c   |  6 +-
 arch/arm/mm/fault.c   |  2 ++
 arch/arm64/mm/fault.c |  2 ++
 arch/hexagon/mm/vm_fault.c|  6 +-
 arch/ia64/mm/fault.c  |  6 +-
 arch/m68k/mm/fault.c  |  6 +-
 arch/microblaze/mm/fault.c|  6 +-
 arch/mips/mm/fault.c  |  6 +-
 arch/nds32/mm/fault.c |  3 +++
 arch/nios2/mm/fault.c |  6 +-
 arch/openrisc/mm/fault.c  |  6 +-
 arch/parisc/mm/fault.c|  6 +-
 arch/powerpc/mm/copro_fault.c |  3 ++-
 arch/powerpc/mm/fault.c   |  3 +++
 arch/riscv/mm/fault.c |  6 +-
 arch/s390/mm/fault.c  |  1 +
 arch/sh/mm/fault.c|  8 ++--
 arch/sparc/mm/fault_32.c  |  8 +++-
 arch/sparc/mm/fault_64.c  |  6 +-
 arch/um/kernel/trap.c |  6 +-
 arch/unicore32/mm/fault.c |  5 -
 arch/x86/mm/fault.c   |  2 ++
 arch/xtensa/mm/fault.c|  6 +-
 drivers/iommu/amd_iommu_v2.c  |  1 +
 drivers/iommu/intel-svm.c |  1 +
 include/linux/mm.h| 14 ++
 mm/filemap.c  | 31 ---
 mm/gup.c  |  3 +++
 mm/hmm.c  |  1 +
 mm/ksm.c  |  1 +
 31 files changed, 151 insertions(+), 23 deletions(-)

diff --git a/arch/alpha/mm/fault.c b/arch/alpha/mm/fault.c
index 3c98dfef03a9..ed5929787d4a 100644
--- a/arch/alpha/mm/fault.c
+++ b/arch/alpha/mm/fault.c
@@ -152,10 +152,13 @@ do_page_fault(unsigned long address, unsigned long mmcsr,
vm_fault_init(&vmfs, vma, flags, address);
fault = handle_mm_fault(&vmf);
 
-   if ((fault & VM_FAULT_RETRY) && fatal_signal_pending(current))
+   if ((fault & VM_FAULT_RETRY) && fatal_signal_pending(current)) {
+   vm_fault_cleanup(&vmf);
return;
+   }
 
if (unlikely(fault & VM_FAULT_ERROR)) {
+   vm_fault_cleanup(&vmf);
if (fault & VM_FAULT_OOM)
goto out_of_memory;
else if (fault & VM_FAULT_SIGSEGV)
@@ -181,7 +184,7 @@ do_page_fault(unsigned long address, unsigned long mmcsr,
goto retry;
}
}
-
+   vm_fault_cleanup(&vmf);
up_read(&mm->mmap_sem);
 
return;
diff --git a/arch/arc/mm/fault.c b/arch/arc/mm/fault.c
index 7aeb81ff5070..38a6c5e94fac 100644
--- a/arch/arc/mm/fault.c
+++ b/arch/arc/mm/fault.c
@@ -149,8 +149,10 @@ void do_page_fault(unsigned long address, struct pt_regs 
*regs)
if (unlikely(fatal_signal_pending(current))) {
if ((fault & VM_FAULT_ERROR) && !(fault & VM_FAULT_RETRY))
up_read(&mm->mmap_sem);
-   if (user_mode(regs))
+   if (user_mode(regs)) {
+   vm_fault_cleanup(&vmf);
return;
+   }
}
 
perf_sw_event(PERF_COUNT_SW_PAGE_FAULTS, 1, regs, address);
@@ -176,10 +178,12 @@ void do_page_fault(unsigned long address, struct pt_regs 
*regs)
}
 
/* Fault Handled Gracefully */
+   vm_fault_cleanup(&vmf);
up_read(&mm->mmap_sem);
return;
}
 
+   vm_fault_cleanup(&vmf);
if (fault & VM_FAULT_OOM)
goto out_of_memory;
else if (fault & VM_FAULT_SIGSEGV)
diff --git a/arch/arm/mm/fault.c b/arch/arm/mm/fault.c
index 885a24385a0a..f08946e78bd9 100644
--- a/arch/arm/mm/fault.c
+++ b/arch/arm/mm/fault.c
@@ -325,6 +325,7 @@ do_page_fault(unsigned long addr, unsigned int fsr, struct 
pt_regs *regs)
 * it would already be released in __lock_page_or_retry in
 * mm/filemap.c. */
if ((fault & VM_FAULT_RETRY) && fatal_signal_pending(current)) {
+   vm_fault_cleanup(&vmf);
if (!user_mode(regs))
goto no_context;
return 0;
@@ -356,6 +357,7 @@ do_page_fault(unsigned long addr, unsigned int fsr, struct 
pt_regs *regs)
}
}
 
+   vm_fault_cleanup(&vmf);
up_read(&mm->mmap_sem);
 
/*
diff --git a/arch/arm64/mm/fault.c b/arch/arm64/mm/fault.c
index 31e86a74cbe0..6f3e908a3820 100644
--- a/arch/arm64/mm/fault.c
+++ b/arch/arm64/mm/fault.c
@@ -506,6 +506,7 @@ static int __kprobes do_page_fault(unsigned long addr, 
unsigned int esr,
 * in __lock_page_or_retry in mm/filemap.c.
 */
if (fatal_signal_pendin

[PATCH 8/8] btrfs: drop mmap_sem in mkwrite for btrfs

2018-09-25 Thread Josef Bacik
->page_mkwrite is extremely expensive in btrfs.  We have to reserve
space, which can take 6 lifetimes, and we could possibly have to wait on
writeback on the page, another several lifetimes.  To avoid this simply
drop the mmap_sem if we didn't have the cached page and do all of our
work and return the appropriate retry error.  If we have the cached page
we know we did all the right things to set this page up and we can just
carry on.

Signed-off-by: Josef Bacik 
---
 fs/btrfs/inode.c   | 40 ++--
 include/linux/mm.h | 14 ++
 mm/filemap.c   |  3 ++-
 3 files changed, 54 insertions(+), 3 deletions(-)

diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index 3ea5339603cf..34c33b96d335 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -8809,7 +8809,9 @@ static void btrfs_invalidatepage(struct page *page, 
unsigned int offset,
 vm_fault_t btrfs_page_mkwrite(struct vm_fault *vmf)
 {
struct page *page = vmf->page;
-   struct inode *inode = file_inode(vmf->vma->vm_file);
+   struct file *file = vmf->vma->vm_file, *fpin;
+   struct mm_struct *mm = vmf->vma->vm_mm;
+   struct inode *inode = file_inode(file);
struct btrfs_fs_info *fs_info = btrfs_sb(inode->i_sb);
struct extent_io_tree *io_tree = &BTRFS_I(inode)->io_tree;
struct btrfs_ordered_extent *ordered;
@@ -8828,6 +8830,29 @@ vm_fault_t btrfs_page_mkwrite(struct vm_fault *vmf)
 
reserved_space = PAGE_SIZE;
 
+   /*
+* We have our cached page from a previous mkwrite, check it to make
+* sure it's still dirty and our file size matches when we ran mkwrite
+* the last time.  If everything is OK then return VM_FAULT_LOCKED,
+* otherwise do the mkwrite again.
+*/
+   if (vmf->flags & FAULT_FLAG_USED_CACHED) {
+   lock_page(page);
+   if (vmf->cached_size == i_size_read(inode) &&
+   PageDirty(page))
+   return VM_FAULT_LOCKED;
+   unlock_page(page);
+   }
+
+   /*
+* mkwrite is extremely expensive, and we are holding the mmap_sem
+* during this, which means we can starve out anybody trying to
+* down_write(mmap_sem) for a long while, especially if we throw cgroups
+* into the mix.  So just drop the mmap_sem and do all of our work,
+* we'll loop back through and verify everything is ok the next time and
+* hopefully avoid doing the work twice.
+*/
+   fpin = maybe_unlock_mmap_for_io(vmf->vma, vmf->flags);
sb_start_pagefault(inode->i_sb);
page_start = page_offset(page);
page_end = page_start + PAGE_SIZE - 1;
@@ -8844,7 +8869,7 @@ vm_fault_t btrfs_page_mkwrite(struct vm_fault *vmf)
ret2 = btrfs_delalloc_reserve_space(inode, &data_reserved, page_start,
   reserved_space);
if (!ret2) {
-   ret2 = file_update_time(vmf->vma->vm_file);
+   ret2 = file_update_time(file);
reserved = 1;
}
if (ret2) {
@@ -8943,6 +8968,13 @@ vm_fault_t btrfs_page_mkwrite(struct vm_fault *vmf)
btrfs_delalloc_release_extents(BTRFS_I(inode), PAGE_SIZE, true);
sb_end_pagefault(inode->i_sb);
extent_changeset_free(data_reserved);
+   if (fpin) {
+   unlock_page(page);
+   fput(fpin);
+   vmf->cached_size = size;
+   down_read(&mm->mmap_sem);
+   return VM_FAULT_RETRY;
+   }
return VM_FAULT_LOCKED;
}
 
@@ -8955,6 +8987,10 @@ vm_fault_t btrfs_page_mkwrite(struct vm_fault *vmf)
 out_noreserve:
sb_end_pagefault(inode->i_sb);
extent_changeset_free(data_reserved);
+   if (fpin) {
+   fput(fpin);
+   down_read(&mm->mmap_sem);
+   }
return ret;
 }
 
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 10a0118f5485..b9ad6cb3de84 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -370,6 +370,13 @@ struct vm_fault {
 * next time we loop through the fault
 * handler for faster lookup.
 */
+   loff_t cached_size; /* ->page_mkwrite handlers may drop
+* the mmap_sem to avoid starvation, in
+* which case they need to save the
+* i_size in order to verify the cached
+* page we're using the next loop
+* through hasn't changed under us.
+*/
/* These three entries are valid only while holding ptl lock */
pte_t *pte; /* Pointer to 

[PATCH 4/8] mm: drop mmap_sem for swap read IO submission

2018-09-25 Thread Josef Bacik
From: Johannes Weiner 

We don't need to hold the mmap_sem while we're doing the IO, simply drop
it and retry appropriately.

Signed-off-by: Johannes Weiner 
Signed-off-by: Josef Bacik 
---
 mm/page_io.c | 14 ++
 1 file changed, 14 insertions(+)

diff --git a/mm/page_io.c b/mm/page_io.c
index aafd19ec1db4..bf21b56a964e 100644
--- a/mm/page_io.c
+++ b/mm/page_io.c
@@ -365,6 +365,20 @@ int swap_readpage(struct page *page, bool synchronous)
goto out;
}
 
+   /*
+* XXX:
+*
+* Propagate mm->mmap_sem into this function. Then:
+*
+* get_file(sis->swap_file)
+* up_read(mm->mmap_sem)
+* submit io request
+* fput
+*
+* After mmap_sem is dropped, sis is no longer valid. Go
+* through swap_file->blah->bdev.
+*/
+
if (sis->flags & SWP_FILE) {
struct file *swap_file = sis->swap_file;
struct address_space *mapping = swap_file->f_mapping;
-- 
2.14.3



[PATCH 5/8] mm: drop the mmap_sem in all read fault cases

2018-09-25 Thread Josef Bacik
Johannes' patches didn't quite cover all of the IO cases that we need to
drop the mmap_sem for, this patch covers the rest of them.

Signed-off-by: Josef Bacik 
---
 mm/filemap.c | 11 +++
 1 file changed, 11 insertions(+)

diff --git a/mm/filemap.c b/mm/filemap.c
index 1ed35cd99b2c..65395ee132a0 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -2523,6 +2523,7 @@ vm_fault_t filemap_fault(struct vm_fault *vmf)
int error;
struct mm_struct *mm = vmf->vma->vm_mm;
struct file *file = vmf->vma->vm_file;
+   struct file *fpin = NULL;
struct address_space *mapping = file->f_mapping;
struct file_ra_state *ra = &file->f_ra;
struct inode *inode = mapping->host;
@@ -2610,11 +2611,15 @@ vm_fault_t filemap_fault(struct vm_fault *vmf)
return ret | VM_FAULT_LOCKED;
 
 no_cached_page:
+   fpin = maybe_unlock_mmap_for_io(vmf->vma, vmf->flags);
+
/*
 * We're only likely to ever get here if MADV_RANDOM is in
 * effect.
 */
error = page_cache_read(file, offset, vmf->gfp_mask);
+   if (fpin)
+   goto out_retry;
 
/*
 * The page we want has now been added to the page cache.
@@ -2634,6 +2639,8 @@ vm_fault_t filemap_fault(struct vm_fault *vmf)
return VM_FAULT_SIGBUS;
 
 page_not_uptodate:
+   fpin = maybe_unlock_mmap_for_io(vmf->vma, vmf->flags);
+
/*
 * Umm, take care of errors if the page isn't up-to-date.
 * Try to re-read it _once_. We do this synchronously,
@@ -2647,6 +2654,8 @@ vm_fault_t filemap_fault(struct vm_fault *vmf)
if (!PageUptodate(page))
error = -EIO;
}
+   if (fpin)
+   goto out_retry;
put_page(page);
 
if (!error || error == AOP_TRUNCATED_PAGE)
@@ -2665,6 +2674,8 @@ vm_fault_t filemap_fault(struct vm_fault *vmf)
}
 
 out_retry:
+   if (fpin)
+   fput(fpin);
if (page)
put_page(page);
return ret | VM_FAULT_RETRY;
-- 
2.14.3



[PATCH 3/8] mm: clean up swapcache lookup and creation function names

2018-09-25 Thread Josef Bacik
From: Johannes Weiner 

__read_swap_cache_async() has a misleading name. All it does is look
up or create a page in swapcache; it doesn't initiate any IO.

The swapcache has many parallels to the page cache, and shares naming
schemes with it elsewhere. Analogous to the cache lookup and creation
API, rename __read_swap_cache_async() find_or_create_swap_cache() and
lookup_swap_cache() to find_swap_cache().

Signed-off-by: Johannes Weiner 
Signed-off-by: Josef Bacik 
---
 include/linux/swap.h | 14 --
 mm/memory.c  |  2 +-
 mm/shmem.c   |  2 +-
 mm/swap_state.c  | 43 ++-
 mm/zswap.c   |  8 
 5 files changed, 36 insertions(+), 33 deletions(-)

diff --git a/include/linux/swap.h b/include/linux/swap.h
index 8e2c11e692ba..293a84c34448 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -412,15 +412,17 @@ extern void __delete_from_swap_cache(struct page *);
 extern void delete_from_swap_cache(struct page *);
 extern void free_page_and_swap_cache(struct page *);
 extern void free_pages_and_swap_cache(struct page **, int);
-extern struct page *lookup_swap_cache(swp_entry_t entry,
- struct vm_area_struct *vma,
- unsigned long addr);
+extern struct page *find_swap_cache(swp_entry_t entry,
+   struct vm_area_struct *vma,
+   unsigned long addr);
+extern struct page *find_or_create_swap_cache(swp_entry_t entry,
+ gfp_t gfp_mask,
+ struct vm_area_struct *vma,
+ unsigned long addr,
+ bool *created);
 extern struct page *read_swap_cache_async(swp_entry_t, gfp_t,
struct vm_area_struct *vma, unsigned long addr,
bool do_poll);
-extern struct page *__read_swap_cache_async(swp_entry_t, gfp_t,
-   struct vm_area_struct *vma, unsigned long addr,
-   bool *new_page_allocated);
 extern struct page *swap_cluster_readahead(swp_entry_t entry, gfp_t flag,
struct vm_fault *vmf);
 extern struct page *swapin_readahead(swp_entry_t entry, gfp_t flag,
diff --git a/mm/memory.c b/mm/memory.c
index 9152c2a2c9f6..f27295c1c91d 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -2935,7 +2935,7 @@ vm_fault_t do_swap_page(struct vm_fault *vmf)
 
 
delayacct_set_flag(DELAYACCT_PF_SWAPIN);
-   page = lookup_swap_cache(entry, vma, vmf->address);
+   page = find_swap_cache(entry, vma, vmf->address);
swapcache = page;
 
if (!page) {
diff --git a/mm/shmem.c b/mm/shmem.c
index 0376c124b043..9854903ae92f 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -1679,7 +1679,7 @@ static int shmem_getpage_gfp(struct inode *inode, pgoff_t 
index,
 
if (swap.val) {
/* Look it up and read it in.. */
-   page = lookup_swap_cache(swap, NULL, 0);
+   page = find_swap_cache(swap, NULL, 0);
if (!page) {
/* Or update major stats only when swapin succeeds?? */
if (fault_type) {
diff --git a/mm/swap_state.c b/mm/swap_state.c
index ecee9c6c4cc1..bae758e19f7a 100644
--- a/mm/swap_state.c
+++ b/mm/swap_state.c
@@ -330,8 +330,8 @@ static inline bool swap_use_vma_readahead(void)
  * lock getting page table operations atomic even if we drop the page
  * lock before returning.
  */
-struct page *lookup_swap_cache(swp_entry_t entry, struct vm_area_struct *vma,
-  unsigned long addr)
+struct page *find_swap_cache(swp_entry_t entry, struct vm_area_struct *vma,
+unsigned long addr)
 {
struct page *page;
 
@@ -374,19 +374,20 @@ struct page *lookup_swap_cache(swp_entry_t entry, struct 
vm_area_struct *vma,
return page;
 }
 
-struct page *__read_swap_cache_async(swp_entry_t entry, gfp_t gfp_mask,
+struct page *find_or_create_swap_cache(swp_entry_t entry, gfp_t gfp_mask,
struct vm_area_struct *vma, unsigned long addr,
-   bool *new_page_allocated)
+   bool *created)
 {
struct page *found_page, *new_page = NULL;
struct address_space *swapper_space = swap_address_space(entry);
int err;
-   *new_page_allocated = false;
+
+   *created = false;
 
do {
/*
 * First check the swap cache.  Since this is normally
-* called after lookup_swap_cache() failed, re-calling
+* called after find_swap_cache() failed, re-calling
 * that would confuse statistics.
 */
found_page = find_get_page(swapper_space, swp_offset(entry));
@@ -449,7 +450,7 @@ struct page *__read_swap_cache_

[PATCH 2/8] mm: drop mmap_sem for page cache read IO submission

2018-09-25 Thread Josef Bacik
From: Johannes Weiner 

Reads can take a long time, and if anybody needs to take a write lock on
the mmap_sem it'll block any subsequent readers to the mmap_sem while
the read is outstanding, which could cause long delays.  Instead drop
the mmap_sem if we do any reads at all.

Signed-off-by: Johannes Weiner 
Signed-off-by: Josef Bacik 
---
 mm/filemap.c | 119 ---
 1 file changed, 90 insertions(+), 29 deletions(-)

diff --git a/mm/filemap.c b/mm/filemap.c
index 52517f28e6f4..1ed35cd99b2c 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -2366,6 +2366,18 @@ generic_file_read_iter(struct kiocb *iocb, struct 
iov_iter *iter)
 EXPORT_SYMBOL(generic_file_read_iter);
 
 #ifdef CONFIG_MMU
+static struct file *maybe_unlock_mmap_for_io(struct vm_area_struct *vma, int 
flags)
+{
+   if ((flags & (FAULT_FLAG_ALLOW_RETRY | FAULT_FLAG_RETRY_NOWAIT)) == 
FAULT_FLAG_ALLOW_RETRY) {
+   struct file *file;
+
+   file = get_file(vma->vm_file);
+   up_read(&vma->vm_mm->mmap_sem);
+   return file;
+   }
+   return NULL;
+}
+
 /**
  * page_cache_read - adds requested page to the page cache if not already there
  * @file:  file to read
@@ -2405,23 +2417,28 @@ static int page_cache_read(struct file *file, pgoff_t 
offset, gfp_t gfp_mask)
  * Synchronous readahead happens when we don't even find
  * a page in the page cache at all.
  */
-static void do_sync_mmap_readahead(struct vm_area_struct *vma,
-  struct file_ra_state *ra,
-  struct file *file,
-  pgoff_t offset)
+static int do_sync_mmap_readahead(struct vm_area_struct *vma,
+ struct file_ra_state *ra,
+ struct file *file,
+ pgoff_t offset,
+ int flags)
 {
struct address_space *mapping = file->f_mapping;
+   struct file *fpin;
 
/* If we don't want any read-ahead, don't bother */
if (vma->vm_flags & VM_RAND_READ)
-   return;
+   return 0;
if (!ra->ra_pages)
-   return;
+   return 0;
 
if (vma->vm_flags & VM_SEQ_READ) {
+   fpin = maybe_unlock_mmap_for_io(vma, flags);
page_cache_sync_readahead(mapping, ra, file, offset,
  ra->ra_pages);
-   return;
+   if (fpin)
+   fput(fpin);
+   return fpin ? -EAGAIN : 0;
}
 
/* Avoid banging the cache line if not needed */
@@ -2433,7 +2450,9 @@ static void do_sync_mmap_readahead(struct vm_area_struct 
*vma,
 * stop bothering with read-ahead. It will only hurt.
 */
if (ra->mmap_miss > MMAP_LOTSAMISS)
-   return;
+   return 0;
+
+   fpin = maybe_unlock_mmap_for_io(vma, flags);
 
/*
 * mmap read-around
@@ -2442,28 +2461,40 @@ static void do_sync_mmap_readahead(struct 
vm_area_struct *vma,
ra->size = ra->ra_pages;
ra->async_size = ra->ra_pages / 4;
ra_submit(ra, mapping, file);
+
+   if (fpin)
+   fput(fpin);
+
+   return fpin ? -EAGAIN : 0;
 }
 
 /*
  * Asynchronous readahead happens when we find the page and PG_readahead,
  * so we want to possibly extend the readahead further..
  */
-static void do_async_mmap_readahead(struct vm_area_struct *vma,
-   struct file_ra_state *ra,
-   struct file *file,
-   struct page *page,
-   pgoff_t offset)
+static int do_async_mmap_readahead(struct vm_area_struct *vma,
+  struct file_ra_state *ra,
+  struct file *file,
+  struct page *page,
+  pgoff_t offset,
+  int flags)
 {
struct address_space *mapping = file->f_mapping;
+   struct file *fpin;
 
/* If we don't want any read-ahead, don't bother */
if (vma->vm_flags & VM_RAND_READ)
-   return;
+   return 0;
if (ra->mmap_miss > 0)
ra->mmap_miss--;
-   if (PageReadahead(page))
-   page_cache_async_readahead(mapping, ra, file,
-  page, offset, ra->ra_pages);
+   if (!PageReadahead(page))
+   return 0;
+   fpin = maybe_unlock_mmap_for_io(vma, flags);
+   page_cache_async_readahead(mapping, ra, file,
+  page, offset, ra->ra_pages);
+   if (fpin)
+   fput(fpin);
+   return fpin ? -EAGAIN : 0;
 }
 
 /**
@@ -2479,10 +2510,8 @@ static void do_async_mmap_readahead(struct 
vm_area_struct *vma,
  *
  * vma->v

Re: [PATCH 1/9] btrfs: Add support for reading a filesystem with a RAID 5 or RAID 6 profile.

2018-09-25 Thread Daniel Kiper
On Wed, Sep 19, 2018 at 08:40:32PM +0200, Goffredo Baroncelli wrote:
> From: Goffredo Baroncelli 
>
> Signed-off-by: Goffredo Baroncelli 
> ---
>  grub-core/fs/btrfs.c | 66 
>  1 file changed, 66 insertions(+)
>
> diff --git a/grub-core/fs/btrfs.c b/grub-core/fs/btrfs.c
> index be195448d..56c42746d 100644
> --- a/grub-core/fs/btrfs.c
> +++ b/grub-core/fs/btrfs.c
> @@ -119,6 +119,8 @@ struct grub_btrfs_chunk_item
>  #define GRUB_BTRFS_CHUNK_TYPE_RAID1 0x10
>  #define GRUB_BTRFS_CHUNK_TYPE_DUPLICATED0x20
>  #define GRUB_BTRFS_CHUNK_TYPE_RAID100x40
> +#define GRUB_BTRFS_CHUNK_TYPE_RAID5 0x80
> +#define GRUB_BTRFS_CHUNK_TYPE_RAID6 0x100
>grub_uint8_t dummy2[0xc];
>grub_uint16_t nstripes;
>grub_uint16_t nsubstripes;
> @@ -764,6 +766,70 @@ grub_btrfs_read_logical (struct grub_btrfs_data *data, 
> grub_disk_addr_t addr,
> stripe_offset = low + chunk_stripe_length
>   * high;
> csize = chunk_stripe_length - low;
> +   break;
> + }
> +   case GRUB_BTRFS_CHUNK_TYPE_RAID5:
> +   case GRUB_BTRFS_CHUNK_TYPE_RAID6:
> + {
> +   grub_uint64_t nparities, block_nr, high, low;
> +
> +   redundancy = 1;   /* no redundancy for now */
> +
> +   if (grub_le_to_cpu64 (chunk->type) & GRUB_BTRFS_CHUNK_TYPE_RAID5)
> + {
> +   grub_dprintf ("btrfs", "RAID5\n");
> +   nparities = 1;
> + }
> +   else
> + {
> +   grub_dprintf ("btrfs", "RAID6\n");
> +   nparities = 2;
> + }
> +
> +   /*
> +* A RAID 6 layout consists of several blocks spread on the disks.
> +* The raid terminology is used to call all the blocks of a row
> +* "stripe". Unfortunately the BTRFS terminology confuses block

Stripe is data set or parity (parity stripe) on one disk. Block has
different meaning. Please stick to btrfs terminology and say it clearly
in the comment. And even add a link to btrfs wiki page to ease reading.

I think about this one:
  
https://btrfs.wiki.kernel.org/index.php/Manpage/mkfs.btrfs#BLOCK_GROUPS.2C_CHUNKS.2C_RAID

> +* and stripe.

I do not think so. Or at least not so much...

> +*
> +*   Disk0  Disk1  Disk2  Disk3
> +*
> +*A1 B1 P1 Q1
> +*Q2 A2 B2 P2
> +*P3 Q3 A3 B3
> +*  [...]
> +*
> +*  Note that the placement of the parities depends on row index.
> +*  In the code below:
> +*  - block_nr is the block number without the parities

Well, it seems to me that the btrfs code introduces confusion not the
spec itself. I would leave code as is but s/block number/stripe number/.

> +*(A1 = 0, B1 = 1, A2 = 2, B2 = 3, ...),
> +*  - high is the row number (0 for A1...Q1, 1 for Q2...P2, ...),
> +*  - stripen is the disk number (0 for A1,Q2,P3, 1 for B1...),

s/disk number/disk number in a row/

> +*  - off is the logical address to read
> +*  - chunk_stripe_length is the size of a block (typically 64k),

s/a block/a stripe/

> +*  - nstripes is the number of disks,

s/number of disks/number of disks in a row/

I miss the description of nparities here...

> +*  - low is the offset of the data inside a stripe,
> +*  - stripe_offset is the disk offset,

s/the disk offset/the data offset in an array/?

> +*  - csize is the "potential" data to read. It will be reduced to
> +*size if the latter is smaller.
> +*/
> +   block_nr = grub_divmod64 (off, chunk_stripe_length, &low);
> +
> +   /*
> +* stripen is computed without the parities (0 for A1, A2, A3...
> +* 1 for B1, B2...).
> +*/
> +   high = grub_divmod64 (block_nr, nstripes - nparities, &stripen);

This is clear...

> +   /*
> +* stripen is recomputed considering the parities (0 for A1, 1 for
> +* A2, 2 for A3).
> +*/
> +   grub_divmod64 (high + stripen, nstripes, &stripen);

... but this looks strange... You add disk number to row number. Hmmm...
It looks that it works but this is not obvious at first sight. Could you
explain that?

> +   stripe_offset = low + chunk_stripe_length * high;
> +   csize = chunk_stripe_length - low;
> +
> break;
>   }
> default:

Daniel


Re: [PATCH 3/9] btrfs: Move the error logging from find_device() to its caller.

2018-09-25 Thread Daniel Kiper
On Wed, Sep 19, 2018 at 08:40:34PM +0200, Goffredo Baroncelli wrote:
> From: Goffredo Baroncelli 
>
> The caller knows better if this error is fatal or not, i.e. another disk is
> available or not.
>
> This is a preparatory patch.
>
> Signed-off-by: Goffredo Baroncelli 
> Reviewed-by: Daniel Kiper 
> ---
>  grub-core/fs/btrfs.c | 6 +++---
>  1 file changed, 3 insertions(+), 3 deletions(-)
>
> diff --git a/grub-core/fs/btrfs.c b/grub-core/fs/btrfs.c
> index 4f404f4b2..0cdfaf7c0 100644
> --- a/grub-core/fs/btrfs.c
> +++ b/grub-core/fs/btrfs.c
> @@ -604,9 +604,6 @@ find_device (struct grub_btrfs_data *data, grub_uint64_t 
> id, int do_rescan)
>  grub_device_iterate (find_device_iter, &ctx);
>if (!ctx.dev_found)
>  {
> -  grub_error (GRUB_ERR_BAD_FS,
> -   N_("couldn't find a necessary member device "
> -  "of multi-device filesystem"));
>return NULL;
>  }

I think that you can drop curly brackets too.
If you do that you can retain my Reviewed-by.

Daniel


Re: [PATCH 4/9] btrfs: Avoid a rescan for a device which was already not found.

2018-09-25 Thread Daniel Kiper
On Wed, Sep 19, 2018 at 08:40:35PM +0200, Goffredo Baroncelli wrote:
> From: Goffredo Baroncelli 
>
> If a device is not found, do not return immediately but
> record this failure by storing NULL in data->devices_attached[].

Still the same question: Where the store happens in the code?
I cannot find it in the patch below. This have to be clarified.

Daniel

> This way we avoid unnecessary devices rescan, and speedup the
> reads in case of a degraded array.
>
> Signed-off-by: Goffredo Baroncelli 
> ---
>  grub-core/fs/btrfs.c | 19 +--
>  1 file changed, 9 insertions(+), 10 deletions(-)
>
> diff --git a/grub-core/fs/btrfs.c b/grub-core/fs/btrfs.c
> index 0cdfaf7c0..6e42c33f6 100644
> --- a/grub-core/fs/btrfs.c
> +++ b/grub-core/fs/btrfs.c
> @@ -588,7 +588,7 @@ find_device_iter (const char *name, void *data)
>  }
>
>  static grub_device_t
> -find_device (struct grub_btrfs_data *data, grub_uint64_t id, int do_rescan)
> +find_device (struct grub_btrfs_data *data, grub_uint64_t id)
>  {
>struct find_device_ctx ctx = {
>  .data = data,
> @@ -600,12 +600,9 @@ find_device (struct grub_btrfs_data *data, grub_uint64_t 
> id, int do_rescan)
>for (i = 0; i < data->n_devices_attached; i++)
>  if (id == data->devices_attached[i].id)
>return data->devices_attached[i].dev;
> -  if (do_rescan)
> -grub_device_iterate (find_device_iter, &ctx);
> -  if (!ctx.dev_found)
> -{
> -  return NULL;
> -}
> +
> +  grub_device_iterate (find_device_iter, &ctx);
> +
>data->n_devices_attached++;
>if (data->n_devices_attached > data->n_devices_allocated)
>  {
> @@ -617,7 +614,8 @@ find_device (struct grub_btrfs_data *data, grub_uint64_t 
> id, int do_rescan)
>   * sizeof (data->devices_attached[0]));
>if (!data->devices_attached)
>   {
> -   grub_device_close (ctx.dev_found);
> +   if (ctx.dev_found)
> + grub_device_close (ctx.dev_found);
> data->devices_attached = tmp;
> return NULL;
>   }
> @@ -892,7 +890,7 @@ grub_btrfs_read_logical (struct grub_btrfs_data *data, 
> grub_disk_addr_t addr,
> " for laddr 0x%" PRIxGRUB_UINT64_T "\n", paddr,
> addr);
>
> - dev = find_device (data, stripe->device_id, j);
> + dev = find_device (data, stripe->device_id);
>   if (!dev)
> {
>   grub_dprintf ("btrfs",
> @@ -969,7 +967,8 @@ grub_btrfs_unmount (struct grub_btrfs_data *data)
>unsigned i;
>/* The device 0 is closed one layer upper.  */
>for (i = 1; i < data->n_devices_attached; i++)
> -grub_device_close (data->devices_attached[i].dev);
> +if (data->devices_attached[i].dev)
> +grub_device_close (data->devices_attached[i].dev);
>grub_free (data->devices_attached);
>grub_free (data->extent);
>grub_free (data);
> --
> 2.19.0
>


Re: [PATCH 7/9] btrfs: Add support for recovery for a RAID 5 btrfs profiles.

2018-09-25 Thread Daniel Kiper
On Wed, Sep 19, 2018 at 08:40:38PM +0200, Goffredo Baroncelli wrote:
> From: Goffredo Baroncelli 
>
> Add support for recovery for a RAID 5 btrfs profile. In addition
> it is added some code as preparatory work for RAID 6 recovery code.
>
> Signed-off-by: Goffredo Baroncelli 
> ---
>  grub-core/fs/btrfs.c | 169 +--
>  1 file changed, 164 insertions(+), 5 deletions(-)
>
> diff --git a/grub-core/fs/btrfs.c b/grub-core/fs/btrfs.c
> index 5c1ebae77..55a7eeffc 100644
> --- a/grub-core/fs/btrfs.c
> +++ b/grub-core/fs/btrfs.c
> @@ -29,6 +29,7 @@
>  #include 
>  #include 
>  #include 
> +#include 
>
>  GRUB_MOD_LICENSE ("GPLv3+");
>
> @@ -665,6 +666,148 @@ btrfs_read_from_chunk (struct grub_btrfs_data *data,
>  return err;
>  }
>
> +struct raid56_buffer {
> +  void *buf;
> +  int  data_is_valid;
> +};
> +
> +static void
> +rebuild_raid5 (char *dest, struct raid56_buffer *buffers,
> +grub_uint64_t nstripes, grub_uint64_t csize)
> +{
> +  grub_uint64_t i;
> +  int first;
> +
> +  i = 0;
> +  while (buffers[i].data_is_valid && i < nstripes)
> +++i;

for (i = 0; buffers[i].data_is_valid && i < nstripes; i++);

> +  if (i == nstripes)
> +{
> +  grub_dprintf ("btrfs", "called rebuild_raid5(), but all disks are 
> OK\n");
> +  return;
> +}
> +
> +  grub_dprintf ("btrfs", "rebuilding RAID 5 stripe #%" PRIuGRUB_UINT64_T 
> "\n",
> + i);

One line here please.

> +  for (i = 0, first = 1; i < nstripes; i++)
> +{
> +  if (!buffers[i].data_is_valid)
> + continue;
> +
> +  if (first) {
> + grub_memcpy(dest, buffers[i].buf, csize);
> + first = 0;
> +  } else
> + grub_crypto_xor (dest, dest, buffers[i].buf, csize);
> +
> +}

Hmmm... I think that this function can be simpler. You can drop first
while/for and "if (i == nstripes)". Then here:

if (first) {
  grub_dprintf ("btrfs", "called rebuild_raid5(), but all disks are OK\n");

Am I right?

> +}
> +
> +static grub_err_t
> +raid56_read_retry (struct grub_btrfs_data *data,
> +struct grub_btrfs_chunk_item *chunk,
> +grub_uint64_t stripe_offset,
> +grub_uint64_t csize, void *buf)
> +{
> +  struct raid56_buffer *buffers;
> +  grub_uint64_t nstripes = grub_le_to_cpu16 (chunk->nstripes);
> +  grub_uint64_t chunk_type = grub_le_to_cpu64 (chunk->type);
> +  grub_err_t ret = GRUB_ERR_NONE;

s/GRUB_ERR_NONE/GRUB_ERR_OUT_OF_MEMORY/ and then you can drop at
least two relevant assigments and some curly brackets. Of course
before cleanup label you have to add ret = GRUB_ERR_NONE.

> +  grub_uint64_t i, failed_devices;
> +
> +  buffers = grub_zalloc (sizeof(*buffers) * nstripes);
> +  if (!buffers)
> +{
> +  ret = GRUB_ERR_OUT_OF_MEMORY;
> +  goto cleanup;
> +}
> +
> +  for (i = 0; i < nstripes; i++)
> +{
> +  buffers[i].buf = grub_zalloc (csize);
> +  if (!buffers[i].buf)
> + {
> +   ret = GRUB_ERR_OUT_OF_MEMORY;
> +   goto cleanup;
> + }
> +}
> +
> +  for (failed_devices = 0, i = 0; i < nstripes; i++)
> +{
> +  struct grub_btrfs_chunk_stripe *stripe;
> +  grub_disk_addr_t paddr;
> +  grub_device_t dev;
> +  grub_err_t err2;

s/err2/err/?

> +
> +  stripe = (struct grub_btrfs_chunk_stripe *) (chunk + 1);
> +  stripe += i;

Why not stripe = ((struct grub_btrfs_chunk_stripe *) (chunk + 1)) + i;?

Daniel


Re: [PATCH 9/9] btrfs: Add RAID 6 recovery for a btrfs filesystem.

2018-09-25 Thread Daniel Kiper
On Wed, Sep 19, 2018 at 08:40:40PM +0200, Goffredo Baroncelli wrote:
> From: Goffredo Baroncelli 
>
> Add the RAID 6 recovery, in order to use a RAID 6 filesystem even if some
> disks (up to two) are missing. This code use the md RAID 6 code already
> present in grub.
>
> Signed-off-by: Goffredo Baroncelli 
> ---
>  grub-core/fs/btrfs.c | 54 +++-
>  1 file changed, 48 insertions(+), 6 deletions(-)
>
> diff --git a/grub-core/fs/btrfs.c b/grub-core/fs/btrfs.c
> index 55a7eeffc..400cd56b6 100644
> --- a/grub-core/fs/btrfs.c
> +++ b/grub-core/fs/btrfs.c
> @@ -30,6 +30,7 @@
>  #include 
>  #include 
>  #include 
> +#include 
>
>  GRUB_MOD_LICENSE ("GPLv3+");
>
> @@ -705,11 +706,36 @@ rebuild_raid5 (char *dest, struct raid56_buffer 
> *buffers,
>  }
>  }
>
> +static grub_err_t
> +raid6_recover_read_buffer (void *data, int disk_nr,
> +grub_uint64_t addr __attribute__ ((unused)),
> +void *dest, grub_size_t size)
> +{
> +struct raid56_buffer *buffers = data;
> +
> +if (!buffers[disk_nr].data_is_valid)
> + return grub_errno = GRUB_ERR_READ_ERROR;
> +
> +grub_memcpy(dest, buffers[disk_nr].buf, size);
> +
> +return grub_errno = GRUB_ERR_NONE;
> +}
> +
> +static void
> +rebuild_raid6 (struct raid56_buffer *buffers, grub_uint64_t nstripes,
> +   grub_uint64_t csize, grub_uint64_t parities_pos, void *dest,
> +   grub_uint64_t stripen)
> +
> +{
> +  grub_raid6_recover_gen (buffers, nstripes, stripen, parities_pos,
> +  dest, 0, csize, 0, raid6_recover_read_buffer);
> +}
> +
>  static grub_err_t
>  raid56_read_retry (struct grub_btrfs_data *data,
>  struct grub_btrfs_chunk_item *chunk,
> -grub_uint64_t stripe_offset,
> -grub_uint64_t csize, void *buf)
> +grub_uint64_t stripe_offset, grub_uint64_t stripen,
> +grub_uint64_t csize, void *buf, grub_uint64_t parities_pos)
>  {
>struct raid56_buffer *buffers;
>grub_uint64_t nstripes = grub_le_to_cpu16 (chunk->nstripes);
> @@ -787,6 +813,15 @@ raid56_read_retry (struct grub_btrfs_data *data,
>ret = GRUB_ERR_READ_ERROR;
>goto cleanup;
>  }
> +  else if (failed_devices > 2 && (chunk_type & GRUB_BTRFS_CHUNK_TYPE_RAID6))
> +{
> +  grub_dprintf ("btrfs",
> + "not enough disks for raid6: total %" PRIuGRUB_UINT64_T
> + ", missing %" PRIuGRUB_UINT64_T "\n",
> + nstripes, failed_devices);
> +  ret = GRUB_ERR_READ_ERROR;
> +  goto cleanup;
> +}
>else
>  grub_dprintf ("btrfs",
> "enough disks for RAID 5 rebuilding: total %"
> @@ -797,7 +832,7 @@ raid56_read_retry (struct grub_btrfs_data *data,
>if (chunk_type & GRUB_BTRFS_CHUNK_TYPE_RAID5)
>  rebuild_raid5 (buf, buffers, nstripes, csize);
>else
> -grub_dprintf ("btrfs", "called rebuild_raid6(), NOT IMPLEMENTED\n");
> +rebuild_raid6 (buffers, nstripes, csize, parities_pos, buf, stripen);
>
>   cleanup:
>if (buffers)
> @@ -886,9 +921,11 @@ grub_btrfs_read_logical (struct grub_btrfs_data *data, 
> grub_disk_addr_t addr,
>   unsigned redundancy = 1;
>   unsigned i, j;
>   int is_raid56;
> + grub_uint64_t parities_pos = 0;
>
> - is_raid56 = !!(grub_le_to_cpu64 (chunk->type) &
> -GRUB_BTRFS_CHUNK_TYPE_RAID5);
> +is_raid56 = !!(grub_le_to_cpu64 (chunk->type) &
> +(GRUB_BTRFS_CHUNK_TYPE_RAID5 |
> + GRUB_BTRFS_CHUNK_TYPE_RAID6));
>
>   if (grub_le_to_cpu64 (chunk->size) <= off)
> {
> @@ -1015,6 +1052,8 @@ grub_btrfs_read_logical (struct grub_btrfs_data *data, 
> grub_disk_addr_t addr,
>  *  - stripe_offset is the disk offset,
>  *  - csize is the "potential" data to read. It will be reduced to
>  *size if the latter is smaller.
> +*  - parities_pos is the position of the parity inside a row (

s/inside/in/

> +*2 for P1, 3 for P2...)
>  */
> block_nr = grub_divmod64 (off, chunk_stripe_length, &low);
>
> @@ -1030,6 +1069,9 @@ grub_btrfs_read_logical (struct grub_btrfs_data *data, 
> grub_disk_addr_t addr,
>  */
> grub_divmod64 (high + stripen, nstripes, &stripen);
>
> +   grub_divmod64 (high + nstripes - nparities, nstripes,
> +  &parities_pos);

I think that this math requires a bit of explanation in the comment
before grub_divmod64(). Especially I am interested in why high +
nstripes - nparities works as expected.

Daniel


Re: [PATCH 1/8] mm: push vm_fault into the page fault handlers

2018-09-25 Thread Dave Chinner
On Tue, Sep 25, 2018 at 11:30:04AM -0400, Josef Bacik wrote:
> In preparation for caching pages during filemap faults we need to push
> the struct vm_fault up a level into the arch page fault handlers, since
> they are the ones responsible for retrying if we unlock the mmap_sem.
> 
> Signed-off-by: Josef Bacik 
> ---
>  arch/alpha/mm/fault.c |  4 ++-
>  arch/arc/mm/fault.c   |  2 ++
>  arch/arm/mm/fault.c   | 18 -
>  arch/arm64/mm/fault.c | 18 +++--
>  arch/hexagon/mm/vm_fault.c|  4 ++-
>  arch/ia64/mm/fault.c  |  4 ++-
>  arch/m68k/mm/fault.c  |  5 ++--
>  arch/microblaze/mm/fault.c|  4 ++-
>  arch/mips/mm/fault.c  |  4 ++-
>  arch/nds32/mm/fault.c |  5 ++--
>  arch/nios2/mm/fault.c |  4 ++-
>  arch/openrisc/mm/fault.c  |  5 ++--
>  arch/parisc/mm/fault.c|  5 ++--
>  arch/powerpc/mm/copro_fault.c |  4 ++-
>  arch/powerpc/mm/fault.c   |  4 ++-
>  arch/riscv/mm/fault.c |  2 ++
>  arch/s390/mm/fault.c  |  4 ++-
>  arch/sh/mm/fault.c|  4 ++-
>  arch/sparc/mm/fault_32.c  |  6 -
>  arch/sparc/mm/fault_64.c  |  2 ++
>  arch/um/kernel/trap.c |  4 ++-
>  arch/unicore32/mm/fault.c | 17 +++-
>  arch/x86/mm/fault.c   |  4 ++-
>  arch/xtensa/mm/fault.c|  4 ++-
>  drivers/iommu/amd_iommu_v2.c  |  4 ++-
>  drivers/iommu/intel-svm.c |  6 +++--
>  include/linux/mm.h| 16 +---
>  mm/gup.c  |  8 --
>  mm/hmm.c  |  4 ++-
>  mm/ksm.c  | 10 ---
>  mm/memory.c   | 61 
> +--
>  31 files changed, 157 insertions(+), 89 deletions(-)
> 
> diff --git a/arch/alpha/mm/fault.c b/arch/alpha/mm/fault.c
> index d73dc473fbb9..3c98dfef03a9 100644
> --- a/arch/alpha/mm/fault.c
> +++ b/arch/alpha/mm/fault.c
> @@ -84,6 +84,7 @@ asmlinkage void
>  do_page_fault(unsigned long address, unsigned long mmcsr,
> long cause, struct pt_regs *regs)
>  {
> + struct vm_fault vmf = {};
>   struct vm_area_struct * vma;
>   struct mm_struct *mm = current->mm;
>   const struct exception_table_entry *fixup;
> @@ -148,7 +149,8 @@ do_page_fault(unsigned long address, unsigned long mmcsr,
>   /* If for any reason at all we couldn't handle the fault,
>  make sure we exit gracefully rather than endlessly redo
>  the fault.  */
> - fault = handle_mm_fault(vma, address, flags);
> + vm_fault_init(&vmfs, vma, flags, address);
> + fault = handle_mm_fault(&vmf);

Doesn't compile.

> --- a/arch/arm/mm/fault.c
> +++ b/arch/arm/mm/fault.c
> @@ -225,17 +225,17 @@ static inline bool access_error(unsigned int fsr, 
> struct vm_area_struct *vma)
>  }
>  
>  static vm_fault_t __kprobes
> -__do_page_fault(struct mm_struct *mm, unsigned long addr, unsigned int fsr,
> - unsigned int flags, struct task_struct *tsk)
> +__do_page_fault(struct mm_struct *mm, struct vm_fault *vm, unsigned int fsr,

vm_fault is *vm

> + struct task_struct *tsk)
>  {
>   struct vm_area_struct *vma;
>   vm_fault_t fault;
>  
> - vma = find_vma(mm, addr);
> + vma = find_vma(mm, vmf->address);

So this doesn't compile.

>  
>  check_stack:
> - if (vma->vm_flags & VM_GROWSDOWN && !expand_stack(vma, addr))
> + if (vma->vm_flags & VM_GROWSDOWN && !expand_stack(vma, vmf->address))
>   goto good_area;
>  out:
>   return fault;
> @@ -424,6 +424,7 @@ static bool is_el0_instruction_abort(unsigned int esr)
>  static int __kprobes do_page_fault(unsigned long addr, unsigned int esr,
>  struct pt_regs *regs)
>  {
> + struct vm_fault vmf = {};
>   struct task_struct *tsk;
>   struct mm_struct *mm;
>   struct siginfo si;
> @@ -493,7 +494,8 @@ static int __kprobes do_page_fault(unsigned long addr, 
> unsigned int esr,
>  #endif
>   }
>  
> - fault = __do_page_fault(mm, addr, mm_flags, vm_flags, tsk);
> + vm_fault_init(&vmf, NULL, addr, mm_flags);
> + fault = __do_page_fault(mm, vmf, vm_flags, tsk);

I'm betting this doesn't compile, either.

/me stops looking.

Cheers,

Dave.
-- 
Dave Chinner
da...@fromorbit.com


Re: [PATCH 8/8] btrfs: drop mmap_sem in mkwrite for btrfs

2018-09-25 Thread Dave Chinner
On Tue, Sep 25, 2018 at 11:30:11AM -0400, Josef Bacik wrote:
> @@ -1454,6 +1463,11 @@ static inline int fixup_user_fault(struct task_struct 
> *tsk,
>   BUG();
>   return -EFAULT;
>  }
> +stiatc inline struct file *maybe_unlock_mmap_for_io(struct vm_area_struct 
> *vma,
> + int flags)
> +{
> + return NULL;
> +}

This doesn't compile either.

-Dave.
-- 
Dave Chinner
da...@fromorbit.com