from:"nikolay"

Re: net/core: BUG in unregister_netdevice_many

2017-04-21 Thread Nikolay Aleksandrov

On 21/04/17 20:42, Linus Torvalds wrote:
> On Fri, Apr 21, 2017 at 10:25 AM, Linus Torvalds
>  wrote:
>>
>> I'm assuming that the real cause is simply that "dev->reg_state" ends
>> up being NETREG_UNREGISTERING or something. Maybe the BUG_ON() could
>> be just removed, and replaced by the previous warning about
>> NETREG_UNINITIALIZED.
>>
>> Something like the attached (TOTALLY UNTESTED) patch.
> 
> .. might as well test it.
> 
> That patch doesn't fix the problem, but it does show that yes, it was
> NETREG_UNREGISTERING:
> 
>   unregister_netdevice: device pim6reg/962dc4606000 was not registered (2)
> 
> but then immediately afterwards we get
> 
>   general protection fault:  [#1] SMP
>   Workqueue: netns cleanup_net
>   RIP: 0010:dev_shutdown+0xe/0xc0
>   Call Trace:
>  rollback_registered_many+0x2a5/0x440
>  unregister_netdevice_many+0x1e/0xb0
>  default_device_exit_batch+0x145/0x170
> 
> which is due to a
> 
> mov0x388(%rdi),%eax
> 
> where %rdi is 0xdead0090. That is at the very beginning of
> dev_shutdown, it's "dev" itself that has that value, so it comes from
> (_another_) invocation of rollback_registered_many(), when it does
> that
> 
> list_for_each_entry(dev, head, unreg_list) {
> 
> so it seems to be a case of another "list_del() leaves list in bad
> state", and it was the added test for "dev->reg_state !=
> NETREG_REGISTERED" that did that
> 
> list_del(>unreg_list);
> 
> and left random contents in the unreg_list.
> 
> So that "handle error case" was almost certainly just buggy too.
> 
> And the bug seems to be that we're trying to unregister a netdevice
> that has already been unregistered.
> 
> Over to Eric and networking people. This oops is user-triggerable, and
> leaves the machine in a bad state (the original BUG_ON() and the new
> GP fault both happen while holding the RTNL, so networking is not
> healthy afterwards.
> 
>   Linus
> 

Right, I've already posted a patch for ip6mr that should fix the issue.
CCed you and LKML just now.

Thanks,
 Nik

Re: [PATCH 2/4] fs/block_dev: always invalidate cleancache in invalidate_bdev()

2017-04-18 Thread Nikolay Borisov



On 14.04.2017 17:07, Andrey Ryabinin wrote:
> invalidate_bdev() calls cleancache_invalidate_inode() iff ->nrpages != 0
> which doen't make any sense.
> Make invalidate_bdev() always invalidate cleancache data.
> 
> Fixes: c515e1fd361c ("mm/fs: add hooks to support cleancache")
> Signed-off-by: Andrey Ryabinin 
> ---
>  fs/block_dev.c | 11 +--
>  1 file changed, 5 insertions(+), 6 deletions(-)
> 
> diff --git a/fs/block_dev.c b/fs/block_dev.c
> index e405d8e..7af4787 100644
> --- a/fs/block_dev.c
> +++ b/fs/block_dev.c
> @@ -103,12 +103,11 @@ void invalidate_bdev(struct block_device *bdev)
>  {
>   struct address_space *mapping = bdev->bd_inode->i_mapping;
>  
> - if (mapping->nrpages == 0)
> - return;
> -
> - invalidate_bh_lrus();
> - lru_add_drain_all();/* make sure all lru add caches are flushed */
> - invalidate_mapping_pages(mapping, 0, -1);
> + if (mapping->nrpages) {
> + invalidate_bh_lrus();
> + lru_add_drain_all();/* make sure all lru add caches are 
> flushed */
> + invalidate_mapping_pages(mapping, 0, -1);
> + }

How is this different than the current code? You will only invalidate
the mapping iff ->nrpages > 0 ( I assume it can't go down below 0) ?
Perhaps just remove the if altogether?

>   /* 99% of the time, we don't need to flush the cleancache on the bdev.
>* But, for the strange corners, lets be cautious
>*/
>

Re: [PATCH 2/4] fs/block_dev: always invalidate cleancache in invalidate_bdev()

2017-04-18 Thread Nikolay Borisov



On 14.04.2017 17:07, Andrey Ryabinin wrote:
> invalidate_bdev() calls cleancache_invalidate_inode() iff ->nrpages != 0
> which doen't make any sense.
> Make invalidate_bdev() always invalidate cleancache data.
> 
> Fixes: c515e1fd361c ("mm/fs: add hooks to support cleancache")
> Signed-off-by: Andrey Ryabinin 
> ---
>  fs/block_dev.c | 11 +--
>  1 file changed, 5 insertions(+), 6 deletions(-)
> 
> diff --git a/fs/block_dev.c b/fs/block_dev.c
> index e405d8e..7af4787 100644
> --- a/fs/block_dev.c
> +++ b/fs/block_dev.c
> @@ -103,12 +103,11 @@ void invalidate_bdev(struct block_device *bdev)
>  {
>   struct address_space *mapping = bdev->bd_inode->i_mapping;
>  
> - if (mapping->nrpages == 0)
> - return;
> -
> - invalidate_bh_lrus();
> - lru_add_drain_all();/* make sure all lru add caches are flushed */
> - invalidate_mapping_pages(mapping, 0, -1);
> + if (mapping->nrpages) {
> + invalidate_bh_lrus();
> + lru_add_drain_all();/* make sure all lru add caches are 
> flushed */
> + invalidate_mapping_pages(mapping, 0, -1);
> + }

How is this different than the current code? You will only invalidate
the mapping iff ->nrpages > 0 ( I assume it can't go down below 0) ?
Perhaps just remove the if altogether?

>   /* 99% of the time, we don't need to flush the cleancache on the bdev.
>* But, for the strange corners, lets be cautious
>*/
>

Re: [RFC PATCH 1/4] fs: new infrastructure for writeback error handling and reporting

2017-04-03 Thread Nikolay Borisov



On 31.03.2017 22:26, Jeff Layton wrote:
> Most filesystems currently use mapping_set_error and
> filemap_check_errors for setting and reporting/clearing writeback errors
> at the mapping level. filemap_check_errors is indirectly called from
> most of the filemap_fdatawait_* functions and from
> filemap_write_and_wait*. These functions are called from all sorts of
> contexts to wait on writeback to finish -- e.g. mostly in fsync, but
> also in truncate calls, getattr, etc.
> 
> It's those non-fsync callers that are problematic. We should be
> reporting writeback errors during fsync, but many places in the code
> clear out errors before they can be properly reported, or report errors
> at nonsensical times. If I get -EIO on a stat() call, how do I know that
> was because writeback failed?
> 
> This patch adds a small bit of new infrastructure for setting and
> reporting errors during pagecache writeback. While the above was my
> original impetus for adding this, I think it's also the case that
> current fsync semantics are just problematic for userland. Most
> applications that call fsync do so to ensure that the data they wrote
> has hit the backing store.
> 
> In the case where there are multiple writers to the file at the same
> time, this is really hard to determine. The first one to call fsync will
> see any stored error, and the rest get back 0. The processes with open
> fd may not be associated with one another in any way. They could even be
> in different containers, so ensuring coordination between all fsync
> callers is not really an option.
> 
> One way to remedy this would be to track what file descriptor was used
> to dirty the file, but that's rather cumbersome and would likely be
> slow. However, there is a simpler way to improve the semantics here
> without incurring too much overhead.
> 
> This set adds a wb_error field and a sequence counter to the
> address_space, and a corresponding sequence counter in the struct file.
> When errors are reported during writeback, we set the error field in the
> mapping and increment the sequence counter.
> 
> When fsync or flush is called, we check the sequence in the file vs. the
> one in the mapping. If the file's counter is behind the one in the
> mapping, then we update the sequence counter in the file to the value of
> the one in the mapping and report the error. If the file is "caught up"
> then we just report 0.
> 
> This changes the semantics of fsync such that applications can now use
> it to determine whether there were any writeback errors since fsync(fd)
> was last called (or since the file was opened in the case of fsync
> having never been called).
> 
> Note that those writeback errors may have occurred when writing data
> that was dirtied via an entirely different fd, but that's the case now
> with the current mapping_set_error/filemap_check_error infrastructure.
> This will at least prevent you from getting a false report of success.
> 
> The basic idea here is for filesystems to use filemap_set_wb_error to
> set the error in the mapping when there are writeback errors, and then
> have the fsync and flush operations call filemap_report_wb_error just
> before returning to ensure that those errors get reported properly.
> 
> Eventually, it may make sense to move the reporting into the generic
> vfs_fsync_range helper, but doing it this way for now makes it simpler
> to convert filesystems to the new API individually.

There is already a mapping_set_error API which sets flags in
mapping->flags (AS_EIO/AS_ENOSPC). Aren't you essentially duplicating
some of the semantics of that API ?

Re: [RFC PATCH 1/4] fs: new infrastructure for writeback error handling and reporting

2017-04-03 Thread Nikolay Borisov



On 31.03.2017 22:26, Jeff Layton wrote:
> Most filesystems currently use mapping_set_error and
> filemap_check_errors for setting and reporting/clearing writeback errors
> at the mapping level. filemap_check_errors is indirectly called from
> most of the filemap_fdatawait_* functions and from
> filemap_write_and_wait*. These functions are called from all sorts of
> contexts to wait on writeback to finish -- e.g. mostly in fsync, but
> also in truncate calls, getattr, etc.
> 
> It's those non-fsync callers that are problematic. We should be
> reporting writeback errors during fsync, but many places in the code
> clear out errors before they can be properly reported, or report errors
> at nonsensical times. If I get -EIO on a stat() call, how do I know that
> was because writeback failed?
> 
> This patch adds a small bit of new infrastructure for setting and
> reporting errors during pagecache writeback. While the above was my
> original impetus for adding this, I think it's also the case that
> current fsync semantics are just problematic for userland. Most
> applications that call fsync do so to ensure that the data they wrote
> has hit the backing store.
> 
> In the case where there are multiple writers to the file at the same
> time, this is really hard to determine. The first one to call fsync will
> see any stored error, and the rest get back 0. The processes with open
> fd may not be associated with one another in any way. They could even be
> in different containers, so ensuring coordination between all fsync
> callers is not really an option.
> 
> One way to remedy this would be to track what file descriptor was used
> to dirty the file, but that's rather cumbersome and would likely be
> slow. However, there is a simpler way to improve the semantics here
> without incurring too much overhead.
> 
> This set adds a wb_error field and a sequence counter to the
> address_space, and a corresponding sequence counter in the struct file.
> When errors are reported during writeback, we set the error field in the
> mapping and increment the sequence counter.
> 
> When fsync or flush is called, we check the sequence in the file vs. the
> one in the mapping. If the file's counter is behind the one in the
> mapping, then we update the sequence counter in the file to the value of
> the one in the mapping and report the error. If the file is "caught up"
> then we just report 0.
> 
> This changes the semantics of fsync such that applications can now use
> it to determine whether there were any writeback errors since fsync(fd)
> was last called (or since the file was opened in the case of fsync
> having never been called).
> 
> Note that those writeback errors may have occurred when writing data
> that was dirtied via an entirely different fd, but that's the case now
> with the current mapping_set_error/filemap_check_error infrastructure.
> This will at least prevent you from getting a false report of success.
> 
> The basic idea here is for filesystems to use filemap_set_wb_error to
> set the error in the mapping when there are writeback errors, and then
> have the fsync and flush operations call filemap_report_wb_error just
> before returning to ensure that those errors get reported properly.
> 
> Eventually, it may make sense to move the reporting into the generic
> vfs_fsync_range helper, but doing it this way for now makes it simpler
> to convert filesystems to the new API individually.

There is already a mapping_set_error API which sets flags in
mapping->flags (AS_EIO/AS_ENOSPC). Aren't you essentially duplicating
some of the semantics of that API ?

Re: [PATCHv2] fs: Handle register_shrinker failure

2017-04-01 Thread Nikolay Borisov



On 24.03.2017 10:25, Nikolay Borisov wrote:
> register_shrinker allocates dynamic memory and thus is susceptible to failures
> under low-memory situation. Currently,get_userns ignores the return value of
> register_shrinker, potentially exposing not fully initialised object. This
> can lead to a NULL-ptr deref everytime shrinker->nr_deferred is referenced.
> 
> Fix this by failing to register the filesystem in case there is not enough
> memory to fully construct the shrinker object.
> 
> Signed-off-by: Nikolay Borisov <nbori...@suse.com>
> Fixes: 1d3d4437eae1 ("vmscan: per-node deferred work")
> Link: 
> lkml.kernel.org/r/CACT4Y+b-purC3HHbw=sctms3ma8fkqtnyzus_kco2wmcttw...@mail.gmail.com
> ---

PING, Al is there something bothering you with this patch that needs
fixing before it's merged? Also I think it should be tagged stable.

> 
> Add Fixes and Link tags for better traceability
> 
>  fs/super.c | 14 +-
>  1 file changed, 13 insertions(+), 1 deletion(-)
> 
> diff --git a/fs/super.c b/fs/super.c
> index b8b6a086c03b..964b18447c92 100644
> --- a/fs/super.c
> +++ b/fs/super.c
> @@ -518,7 +518,19 @@ struct super_block *sget_userns(struct file_system_type 
> *type,
>   hlist_add_head(>s_instances, >fs_supers);
>   spin_unlock(_lock);
>   get_filesystem(type);
> - register_shrinker(>s_shrink);
> + err = register_shrinker(>s_shrink);
> + if (err) {
> + spin_lock(_lock);
> + list_del(>s_list);
> + hlist_del(>s_instances);
> + spin_unlock(_lock);
> +
> + up_write(>s_umount);
> + destroy_super(s);
> + put_filesystem(type);
> + return ERR_PTR(err);
> + }
> +
>   return s;
>  }
>  
>

Re: [PATCHv2] fs: Handle register_shrinker failure

2017-04-01 Thread Nikolay Borisov



On 24.03.2017 10:25, Nikolay Borisov wrote:
> register_shrinker allocates dynamic memory and thus is susceptible to failures
> under low-memory situation. Currently,get_userns ignores the return value of
> register_shrinker, potentially exposing not fully initialised object. This
> can lead to a NULL-ptr deref everytime shrinker->nr_deferred is referenced.
> 
> Fix this by failing to register the filesystem in case there is not enough
> memory to fully construct the shrinker object.
> 
> Signed-off-by: Nikolay Borisov 
> Fixes: 1d3d4437eae1 ("vmscan: per-node deferred work")
> Link: 
> lkml.kernel.org/r/CACT4Y+b-purC3HHbw=sctms3ma8fkqtnyzus_kco2wmcttw...@mail.gmail.com
> ---

PING, Al is there something bothering you with this patch that needs
fixing before it's merged? Also I think it should be tagged stable.

> 
> Add Fixes and Link tags for better traceability
> 
>  fs/super.c | 14 +-
>  1 file changed, 13 insertions(+), 1 deletion(-)
> 
> diff --git a/fs/super.c b/fs/super.c
> index b8b6a086c03b..964b18447c92 100644
> --- a/fs/super.c
> +++ b/fs/super.c
> @@ -518,7 +518,19 @@ struct super_block *sget_userns(struct file_system_type 
> *type,
>   hlist_add_head(>s_instances, >fs_supers);
>   spin_unlock(_lock);
>   get_filesystem(type);
> - register_shrinker(>s_shrink);
> + err = register_shrinker(>s_shrink);
> + if (err) {
> + spin_lock(_lock);
> + list_del(>s_list);
> + hlist_del(>s_instances);
> + spin_unlock(_lock);
> +
> + up_write(>s_umount);
> + destroy_super(s);
> + put_filesystem(type);
> + return ERR_PTR(err);
> + }
> +
>   return s;
>  }
>  
>

[PATCH] fs: Handle register_shrinker failure

2017-03-24 Thread Nikolay Borisov

register_shrinker allocates dynamic memory and thus is susceptible to failures
under low-memory situation. Currently,get_userns ignores the return value of
register_shrinker, potentially exposing not fully initialised object. This
can lead to a NULL-ptr deref everytime shrinker->nr_deferred is referenced.

Fix this by failing to register the filesystem in case there is not enough
memory to fully construct the shrinker object.

Signed-off-by: Nikolay Borisov <nbori...@suse.com>
---
 fs/super.c | 14 +-
 1 file changed, 13 insertions(+), 1 deletion(-)

diff --git a/fs/super.c b/fs/super.c
index b8b6a086c03b..964b18447c92 100644
--- a/fs/super.c
+++ b/fs/super.c
@@ -518,7 +518,19 @@ struct super_block *sget_userns(struct file_system_type 
*type,
hlist_add_head(>s_instances, >fs_supers);
spin_unlock(_lock);
get_filesystem(type);
-   register_shrinker(>s_shrink);
+   err = register_shrinker(>s_shrink);
+   if (err) {
+   spin_lock(_lock);
+   list_del(>s_list);
+   hlist_del(>s_instances);
+   spin_unlock(_lock);
+
+   up_write(>s_umount);
+   destroy_super(s);
+   put_filesystem(type);
+   return ERR_PTR(err);
+   }
+
return s;
 }
 
-- 
2.7.4

[PATCH] fs: Handle register_shrinker failure

2017-03-24 Thread Nikolay Borisov

register_shrinker allocates dynamic memory and thus is susceptible to failures
under low-memory situation. Currently,get_userns ignores the return value of
register_shrinker, potentially exposing not fully initialised object. This
can lead to a NULL-ptr deref everytime shrinker->nr_deferred is referenced.

Fix this by failing to register the filesystem in case there is not enough
memory to fully construct the shrinker object.

Signed-off-by: Nikolay Borisov 
---
 fs/super.c | 14 +-
 1 file changed, 13 insertions(+), 1 deletion(-)

diff --git a/fs/super.c b/fs/super.c
index b8b6a086c03b..964b18447c92 100644
--- a/fs/super.c
+++ b/fs/super.c
@@ -518,7 +518,19 @@ struct super_block *sget_userns(struct file_system_type 
*type,
hlist_add_head(>s_instances, >fs_supers);
spin_unlock(_lock);
get_filesystem(type);
-   register_shrinker(>s_shrink);
+   err = register_shrinker(>s_shrink);
+   if (err) {
+   spin_lock(_lock);
+   list_del(>s_list);
+   hlist_del(>s_instances);
+   spin_unlock(_lock);
+
+   up_write(>s_umount);
+   destroy_super(s);
+   put_filesystem(type);
+   return ERR_PTR(err);
+   }
+
return s;
 }
 
-- 
2.7.4

[PATCHv2] fs: Handle register_shrinker failure

2017-03-24 Thread Nikolay Borisov

register_shrinker allocates dynamic memory and thus is susceptible to failures
under low-memory situation. Currently,get_userns ignores the return value of
register_shrinker, potentially exposing not fully initialised object. This
can lead to a NULL-ptr deref everytime shrinker->nr_deferred is referenced.

Fix this by failing to register the filesystem in case there is not enough
memory to fully construct the shrinker object.

Signed-off-by: Nikolay Borisov <nbori...@suse.com>
Fixes: 1d3d4437eae1 ("vmscan: per-node deferred work")
Link: 
lkml.kernel.org/r/CACT4Y+b-purC3HHbw=sctms3ma8fkqtnyzus_kco2wmcttw...@mail.gmail.com
---

Add Fixes and Link tags for better traceability

 fs/super.c | 14 +-
 1 file changed, 13 insertions(+), 1 deletion(-)

diff --git a/fs/super.c b/fs/super.c
index b8b6a086c03b..964b18447c92 100644
--- a/fs/super.c
+++ b/fs/super.c
@@ -518,7 +518,19 @@ struct super_block *sget_userns(struct file_system_type 
*type,
hlist_add_head(>s_instances, >fs_supers);
spin_unlock(_lock);
get_filesystem(type);
-   register_shrinker(>s_shrink);
+   err = register_shrinker(>s_shrink);
+   if (err) {
+   spin_lock(_lock);
+   list_del(>s_list);
+   hlist_del(>s_instances);
+   spin_unlock(_lock);
+
+   up_write(>s_umount);
+   destroy_super(s);
+   put_filesystem(type);
+   return ERR_PTR(err);
+   }
+
return s;
 }
 
-- 
2.7.4

[PATCHv2] fs: Handle register_shrinker failure

2017-03-24 Thread Nikolay Borisov

register_shrinker allocates dynamic memory and thus is susceptible to failures
under low-memory situation. Currently,get_userns ignores the return value of
register_shrinker, potentially exposing not fully initialised object. This
can lead to a NULL-ptr deref everytime shrinker->nr_deferred is referenced.

Fix this by failing to register the filesystem in case there is not enough
memory to fully construct the shrinker object.

Signed-off-by: Nikolay Borisov 
Fixes: 1d3d4437eae1 ("vmscan: per-node deferred work")
Link: 
lkml.kernel.org/r/CACT4Y+b-purC3HHbw=sctms3ma8fkqtnyzus_kco2wmcttw...@mail.gmail.com
---

Add Fixes and Link tags for better traceability

 fs/super.c | 14 +-
 1 file changed, 13 insertions(+), 1 deletion(-)

diff --git a/fs/super.c b/fs/super.c
index b8b6a086c03b..964b18447c92 100644
--- a/fs/super.c
+++ b/fs/super.c
@@ -518,7 +518,19 @@ struct super_block *sget_userns(struct file_system_type 
*type,
hlist_add_head(>s_instances, >fs_supers);
spin_unlock(_lock);
get_filesystem(type);
-   register_shrinker(>s_shrink);
+   err = register_shrinker(>s_shrink);
+   if (err) {
+   spin_lock(_lock);
+   list_del(>s_list);
+   hlist_del(>s_instances);
+   spin_unlock(_lock);
+
+   up_write(>s_umount);
+   destroy_super(s);
+   put_filesystem(type);
+   return ERR_PTR(err);
+   }
+
return s;
 }
 
-- 
2.7.4

[PATCH] fs: Handle register_shrinker failure

2017-03-24 Thread Nikolay Borisov

register_shrinker allocates dynamic memory and thus is susceptible to failures
under low-memory situation. Currently,get_userns ignores the return value of
register_shrinker, potentially exposing not fully initialised object. This
can lead to a NULL-ptr deref everytime shrinker->nr_deferred is referenced.

Fix this by failing to register the filesystem in case there is not enough
memory to fully construct the shrinker object.

Signed-off-by: Nikolay Borisov <nbori...@suse.com>
---
 fs/super.c | 14 +-
 1 file changed, 13 insertions(+), 1 deletion(-)

diff --git a/fs/super.c b/fs/super.c
index b8b6a086c03b..964b18447c92 100644
--- a/fs/super.c
+++ b/fs/super.c
@@ -518,7 +518,19 @@ struct super_block *sget_userns(struct file_system_type 
*type,
hlist_add_head(>s_instances, >fs_supers);
spin_unlock(_lock);
get_filesystem(type);
-   register_shrinker(>s_shrink);
+   err = register_shrinker(>s_shrink);
+   if (err) {
+   spin_lock(_lock);
+   list_del(>s_list);
+   hlist_del(>s_instances);
+   spin_unlock(_lock);
+
+   up_write(>s_umount);
+   destroy_super(s);
+   put_filesystem(type);
+   return ERR_PTR(err);
+   }
+
return s;
 }
 
-- 
2.7.4

[PATCH] fs: Handle register_shrinker failure

2017-03-24 Thread Nikolay Borisov

register_shrinker allocates dynamic memory and thus is susceptible to failures
under low-memory situation. Currently,get_userns ignores the return value of
register_shrinker, potentially exposing not fully initialised object. This
can lead to a NULL-ptr deref everytime shrinker->nr_deferred is referenced.

Fix this by failing to register the filesystem in case there is not enough
memory to fully construct the shrinker object.

Signed-off-by: Nikolay Borisov 
---
 fs/super.c | 14 +-
 1 file changed, 13 insertions(+), 1 deletion(-)

diff --git a/fs/super.c b/fs/super.c
index b8b6a086c03b..964b18447c92 100644
--- a/fs/super.c
+++ b/fs/super.c
@@ -518,7 +518,19 @@ struct super_block *sget_userns(struct file_system_type 
*type,
hlist_add_head(>s_instances, >fs_supers);
spin_unlock(_lock);
get_filesystem(type);
-   register_shrinker(>s_shrink);
+   err = register_shrinker(>s_shrink);
+   if (err) {
+   spin_lock(_lock);
+   list_del(>s_list);
+   hlist_del(>s_instances);
+   spin_unlock(_lock);
+
+   up_write(>s_umount);
+   destroy_super(s);
+   put_filesystem(type);
+   return ERR_PTR(err);
+   }
+
return s;
 }
 
-- 
2.7.4

Re: kasan behavior when built with unsupported compiler

2017-03-09 Thread Nikolay Borisov



On  9.03.2017 11:46, Andrey Ryabinin wrote:
> On 03/08/2017 11:10 AM, Nikolay Borisov wrote:
> 
>>
>> So apparently this is indeed a false positive, resulting from using the old 
>> compiler. I used the attached patch to verify it. 
>>
>> And what it prints is : 
>> [   17.184288] Assigned fbdev-blacklist.conff(880001ea8020)20 whole 
>> object: 88006ae8fdb0 inode:88006bff60d0
>> [   17.185808] Calling filldir with 88006ae8fdb0
>>
>> So the first line essentially happens when the object 88006ae8fdb0 is
>> being allocated and the second when it's used in filldir. The warning in 
>> ext4_ext_map_blocks doesn't trigger. However, if I remove the check for 
>> the value of ext4_global_pointer then I see multiple lines such as: 
>> [   17.386283] ext4_ext_map_blocks:freeing  pointer used in 
>> ext4_htree_store_dirent: 88006ae8fdb0 inode: 88006bff60d0
>> [   17.387601] Assigned fbdev-blacklist.conff(880001eb3020)20 whole 
>> object: 88006ae8fdb0 inode:88006bff60d0
>> [   17.388740] Calling filldir with 88006ae8fdb0
>>
>> so that same object was used right before it is allocated again in 
>> ext4_htree_store_dirent. And when you think about it it is logical since 
>> before filling in the dentry names in ext4_htree_store_dirent ext4 has to 
>> fetch the 
>> contents of the directory from disk.
>>
>> This leads me to believe that kasan is getting confused thinking that 
>> the object is being freed 
> 
> As I said before, this is *not* use-after-free. It's out-of-bounds access.
> No, kasan is not confused, it doesn't think that object is freed.
> Object is allocated and kasan see it as allocated object.
> The problem is that filldir reads past the end of that allocated object.
> 
> I don't see any sign that it's a false-positive.

Okay in that case I will continue digging. So the name is indeed housed
at the end of the struct fname. In ext4_htree_store_dirent this
structure seems allocated correctly sizeof(struct fname) + ent_name->len
+ 1;

Also the read should indeed be 20 bytes since the filename in question
fbdev-blacklist.conf is indeed 20 bytes. This implies that the 'namlen'
passed to copy_to_user is also correct. I guess I will have to look at
the generated assembly between the 2 gcc versions and see if anything
pops out.

> 
> 
>> AFTER being allocated in 
>> ext4_htree_store_dirent but testing shows it's being freed BEFORE. So 
>> I deem this a false positive, triggered by the compiler. 
>>
>>
>>

Re: kasan behavior when built with unsupported compiler

2017-03-09 Thread Nikolay Borisov



On  9.03.2017 11:46, Andrey Ryabinin wrote:
> On 03/08/2017 11:10 AM, Nikolay Borisov wrote:
> 
>>
>> So apparently this is indeed a false positive, resulting from using the old 
>> compiler. I used the attached patch to verify it. 
>>
>> And what it prints is : 
>> [   17.184288] Assigned fbdev-blacklist.conff(880001ea8020)20 whole 
>> object: 88006ae8fdb0 inode:88006bff60d0
>> [   17.185808] Calling filldir with 88006ae8fdb0
>>
>> So the first line essentially happens when the object 88006ae8fdb0 is
>> being allocated and the second when it's used in filldir. The warning in 
>> ext4_ext_map_blocks doesn't trigger. However, if I remove the check for 
>> the value of ext4_global_pointer then I see multiple lines such as: 
>> [   17.386283] ext4_ext_map_blocks:freeing  pointer used in 
>> ext4_htree_store_dirent: 88006ae8fdb0 inode: 88006bff60d0
>> [   17.387601] Assigned fbdev-blacklist.conff(880001eb3020)20 whole 
>> object: 88006ae8fdb0 inode:88006bff60d0
>> [   17.388740] Calling filldir with 88006ae8fdb0
>>
>> so that same object was used right before it is allocated again in 
>> ext4_htree_store_dirent. And when you think about it it is logical since 
>> before filling in the dentry names in ext4_htree_store_dirent ext4 has to 
>> fetch the 
>> contents of the directory from disk.
>>
>> This leads me to believe that kasan is getting confused thinking that 
>> the object is being freed 
> 
> As I said before, this is *not* use-after-free. It's out-of-bounds access.
> No, kasan is not confused, it doesn't think that object is freed.
> Object is allocated and kasan see it as allocated object.
> The problem is that filldir reads past the end of that allocated object.
> 
> I don't see any sign that it's a false-positive.

Okay in that case I will continue digging. So the name is indeed housed
at the end of the struct fname. In ext4_htree_store_dirent this
structure seems allocated correctly sizeof(struct fname) + ent_name->len
+ 1;

Also the read should indeed be 20 bytes since the filename in question
fbdev-blacklist.conf is indeed 20 bytes. This implies that the 'namlen'
passed to copy_to_user is also correct. I guess I will have to look at
the generated assembly between the 2 gcc versions and see if anything
pops out.

> 
> 
>> AFTER being allocated in 
>> ext4_htree_store_dirent but testing shows it's being freed BEFORE. So 
>> I deem this a false positive, triggered by the compiler. 
>>
>>
>>

Re: Race condition in ext4 (was Re: 4.11-rc1 acpi stomping ext4 slabs)

2017-03-08 Thread Nikolay Borisov



On  9.03.2017 03:58, Theodore Ts'o wrote:
> On Tue, Mar 07, 2017 at 10:40:53PM +0200, Nikolay Borisov wrote:
>> So this is wrong, the reason why the issues seemed fix is because I
>> switched my compiler to version 5.4.0. So this manifests only if I'm
>> using gcc 4.7.4. With the pr_info added here is the output of a boot. So
>> there are multiple invocations of ext4_ext_map_blocks and the freeing,
>> including with the address being used in subsequent kasan reports :
>> 88006ae8fdb0
> 
> Can you help bisect this, then?  I'm using Debian Testing, and the
> default gcc is gcc 6.3.0.  I'm currently forcing the use of gcc 5.4.1
> because I was running into problems with gcc 6.x a while back.  (TBH,
> I was thinking about trying to see if gcc 6.3 was stable for kernel
> compiles when I had some spare time.)  But I don't have access to
> *any* gcc 4.x on my development system, and I don't think I've tried
> using gcc 4.x in a long, Long, LONG time.
> 
> I'm currently kicking off a test run using 5.4.1 with KASAN enabled to
> see if I can trigger it myself.  Can you send me a copy of your
> .config so I can see what else might be interesting with your config?
> (e.g., SLAB vs SLUB, etc.)

Attached the config. FUrther debugging and talking with the kasan
developers I think this actually might be a kasan problem when used with
an old compiler.  I bisected this all the way to 1771c6e1a567ea0ba2,
which is the commit introducing the user access instrumentation. Here is
a mail thread where I confirmed that this might be a kasan issue :
https://lkml.org/lkml/2017/3/8/69

What I believe is happening is that the manual checks inserted in user
access code misses some context information due to instrumentation not
inserted by the compiler. Kasan gets confused as a result, hence the
warnings.


> 
> Thanks,
> 
>  - Ted
> 
#
# Automatically generated file; DO NOT EDIT.
# Linux/x86 4.7.0 Kernel Configuration
#
CONFIG_64BIT=y
CONFIG_X86_64=y
CONFIG_X86=y
CONFIG_INSTRUCTION_DECODER=y
CONFIG_OUTPUT_FORMAT="elf64-x86-64"
CONFIG_ARCH_DEFCONFIG="arch/x86/configs/x86_64_defconfig"
CONFIG_LOCKDEP_SUPPORT=y
CONFIG_STACKTRACE_SUPPORT=y
CONFIG_MMU=y
CONFIG_ARCH_MMAP_RND_BITS_MIN=28
CONFIG_ARCH_MMAP_RND_BITS_MAX=32
CONFIG_ARCH_MMAP_RND_COMPAT_BITS_MIN=8
CONFIG_ARCH_MMAP_RND_COMPAT_BITS_MAX=16
CONFIG_NEED_DMA_MAP_STATE=y
CONFIG_NEED_SG_DMA_LENGTH=y
CONFIG_GENERIC_BUG=y
CONFIG_GENERIC_BUG_RELATIVE_POINTERS=y
CONFIG_GENERIC_HWEIGHT=y
CONFIG_RWSEM_XCHGADD_ALGORITHM=y
CONFIG_GENERIC_CALIBRATE_DELAY=y
CONFIG_ARCH_HAS_CPU_RELAX=y
CONFIG_ARCH_HAS_CACHE_LINE_SIZE=y
CONFIG_HAVE_SETUP_PER_CPU_AREA=y
CONFIG_NEED_PER_CPU_EMBED_FIRST_CHUNK=y
CONFIG_NEED_PER_CPU_PAGE_FIRST_CHUNK=y
CONFIG_ARCH_HIBERNATION_POSSIBLE=y
CONFIG_ARCH_SUSPEND_POSSIBLE=y
CONFIG_ARCH_WANT_HUGE_PMD_SHARE=y
CONFIG_ARCH_WANT_GENERAL_HUGETLB=y
CONFIG_ZONE_DMA32=y
CONFIG_AUDIT_ARCH=y
CONFIG_ARCH_SUPPORTS_OPTIMIZED_INLINING=y
CONFIG_ARCH_SUPPORTS_DEBUG_PAGEALLOC=y
CONFIG_X86_64_SMP=y
CONFIG_ARCH_HWEIGHT_CFLAGS="-fcall-saved-rdi -fcall-saved-rsi -fcall-saved-rdx 
-fcall-saved-rcx -fcall-saved-r8 -fcall-saved-r9 -fcall-saved-r10 
-fcall-saved-r11"
CONFIG_ARCH_SUPPORTS_UPROBES=y
CONFIG_FIX_EARLYCON_MEM=y
CONFIG_DEBUG_RODATA=y
CONFIG_PGTABLE_LEVELS=4
CONFIG_DEFCONFIG_LIST="/lib/modules/$UNAME_RELEASE/.config"
CONFIG_IRQ_WORK=y
CONFIG_BUILDTIME_EXTABLE_SORT=y

#
# General setup
#
CONFIG_INIT_ENV_ARG_LIMIT=32
CONFIG_CROSS_COMPILE=""
# CONFIG_COMPILE_TEST is not set
CONFIG_LOCALVERSION="-nbor"
# CONFIG_LOCALVERSION_AUTO is not set
CONFIG_HAVE_KERNEL_GZIP=y
CONFIG_HAVE_KERNEL_BZIP2=y
CONFIG_HAVE_KERNEL_LZMA=y
CONFIG_HAVE_KERNEL_XZ=y
CONFIG_HAVE_KERNEL_LZO=y
CONFIG_HAVE_KERNEL_LZ4=y
CONFIG_KERNEL_GZIP=y
# CONFIG_KERNEL_BZIP2 is not set
# CONFIG_KERNEL_LZMA is not set
# CONFIG_KERNEL_XZ is not set
# CONFIG_KERNEL_LZO is not set
# CONFIG_KERNEL_LZ4 is not set
CONFIG_DEFAULT_HOSTNAME="(none)"
CONFIG_SWAP=y
CONFIG_SYSVIPC=y
CONFIG_SYSVIPC_SYSCTL=y
# CONFIG_POSIX_MQUEUE is not set
CONFIG_CROSS_MEMORY_ATTACH=y
CONFIG_FHANDLE=y
# CONFIG_USELIB is not set
# CONFIG_AUDIT is not set
CONFIG_HAVE_ARCH_AUDITSYSCALL=y

#
# IRQ subsystem
#
CONFIG_GENERIC_IRQ_PROBE=y
CONFIG_GENERIC_IRQ_SHOW=y
CONFIG_GENERIC_PENDING_IRQ=y
CONFIG_IRQ_DOMAIN=y
CONFIG_IRQ_DOMAIN_HIERARCHY=y
CONFIG_GENERIC_MSI_IRQ=y
CONFIG_GENERIC_MSI_IRQ_DOMAIN=y
# CONFIG_IRQ_DOMAIN_DEBUG is not set
CONFIG_IRQ_FORCED_THREADING=y
CONFIG_SPARSE_IRQ=y
CONFIG_CLOCKSOURCE_WATCHDOG=y
CONFIG_ARCH_CLOCKSOURCE_DATA=y
CONFIG_CLOCKSOURCE_VALIDATE_LAST_CYCLE=y
CONFIG_GENERIC_TIME_VSYSCALL=y
CONFIG_GENERIC_CLOCKEVENTS=y
CONFIG_GENERIC_CLOCKEVENTS_BROADCAST=y
CONFIG_GENERIC_CLOCKEVENTS_MIN_ADJUST=y
CONFIG_GENERIC_CMOS_UPDATE=y

#
# Timers subsystem
#
CONFIG_TICK_ONESHOT=y
CONFIG_NO_HZ_COMMON=y
# CONFIG_HZ_PERIODIC is not set
CONFIG_NO

Re: Race condition in ext4 (was Re: 4.11-rc1 acpi stomping ext4 slabs)

2017-03-08 Thread Nikolay Borisov



On  9.03.2017 03:58, Theodore Ts'o wrote:
> On Tue, Mar 07, 2017 at 10:40:53PM +0200, Nikolay Borisov wrote:
>> So this is wrong, the reason why the issues seemed fix is because I
>> switched my compiler to version 5.4.0. So this manifests only if I'm
>> using gcc 4.7.4. With the pr_info added here is the output of a boot. So
>> there are multiple invocations of ext4_ext_map_blocks and the freeing,
>> including with the address being used in subsequent kasan reports :
>> 88006ae8fdb0
> 
> Can you help bisect this, then?  I'm using Debian Testing, and the
> default gcc is gcc 6.3.0.  I'm currently forcing the use of gcc 5.4.1
> because I was running into problems with gcc 6.x a while back.  (TBH,
> I was thinking about trying to see if gcc 6.3 was stable for kernel
> compiles when I had some spare time.)  But I don't have access to
> *any* gcc 4.x on my development system, and I don't think I've tried
> using gcc 4.x in a long, Long, LONG time.
> 
> I'm currently kicking off a test run using 5.4.1 with KASAN enabled to
> see if I can trigger it myself.  Can you send me a copy of your
> .config so I can see what else might be interesting with your config?
> (e.g., SLAB vs SLUB, etc.)

Attached the config. FUrther debugging and talking with the kasan
developers I think this actually might be a kasan problem when used with
an old compiler.  I bisected this all the way to 1771c6e1a567ea0ba2,
which is the commit introducing the user access instrumentation. Here is
a mail thread where I confirmed that this might be a kasan issue :
https://lkml.org/lkml/2017/3/8/69

What I believe is happening is that the manual checks inserted in user
access code misses some context information due to instrumentation not
inserted by the compiler. Kasan gets confused as a result, hence the
warnings.


> 
> Thanks,
> 
>  - Ted
> 
#
# Automatically generated file; DO NOT EDIT.
# Linux/x86 4.7.0 Kernel Configuration
#
CONFIG_64BIT=y
CONFIG_X86_64=y
CONFIG_X86=y
CONFIG_INSTRUCTION_DECODER=y
CONFIG_OUTPUT_FORMAT="elf64-x86-64"
CONFIG_ARCH_DEFCONFIG="arch/x86/configs/x86_64_defconfig"
CONFIG_LOCKDEP_SUPPORT=y
CONFIG_STACKTRACE_SUPPORT=y
CONFIG_MMU=y
CONFIG_ARCH_MMAP_RND_BITS_MIN=28
CONFIG_ARCH_MMAP_RND_BITS_MAX=32
CONFIG_ARCH_MMAP_RND_COMPAT_BITS_MIN=8
CONFIG_ARCH_MMAP_RND_COMPAT_BITS_MAX=16
CONFIG_NEED_DMA_MAP_STATE=y
CONFIG_NEED_SG_DMA_LENGTH=y
CONFIG_GENERIC_BUG=y
CONFIG_GENERIC_BUG_RELATIVE_POINTERS=y
CONFIG_GENERIC_HWEIGHT=y
CONFIG_RWSEM_XCHGADD_ALGORITHM=y
CONFIG_GENERIC_CALIBRATE_DELAY=y
CONFIG_ARCH_HAS_CPU_RELAX=y
CONFIG_ARCH_HAS_CACHE_LINE_SIZE=y
CONFIG_HAVE_SETUP_PER_CPU_AREA=y
CONFIG_NEED_PER_CPU_EMBED_FIRST_CHUNK=y
CONFIG_NEED_PER_CPU_PAGE_FIRST_CHUNK=y
CONFIG_ARCH_HIBERNATION_POSSIBLE=y
CONFIG_ARCH_SUSPEND_POSSIBLE=y
CONFIG_ARCH_WANT_HUGE_PMD_SHARE=y
CONFIG_ARCH_WANT_GENERAL_HUGETLB=y
CONFIG_ZONE_DMA32=y
CONFIG_AUDIT_ARCH=y
CONFIG_ARCH_SUPPORTS_OPTIMIZED_INLINING=y
CONFIG_ARCH_SUPPORTS_DEBUG_PAGEALLOC=y
CONFIG_X86_64_SMP=y
CONFIG_ARCH_HWEIGHT_CFLAGS="-fcall-saved-rdi -fcall-saved-rsi -fcall-saved-rdx 
-fcall-saved-rcx -fcall-saved-r8 -fcall-saved-r9 -fcall-saved-r10 
-fcall-saved-r11"
CONFIG_ARCH_SUPPORTS_UPROBES=y
CONFIG_FIX_EARLYCON_MEM=y
CONFIG_DEBUG_RODATA=y
CONFIG_PGTABLE_LEVELS=4
CONFIG_DEFCONFIG_LIST="/lib/modules/$UNAME_RELEASE/.config"
CONFIG_IRQ_WORK=y
CONFIG_BUILDTIME_EXTABLE_SORT=y

#
# General setup
#
CONFIG_INIT_ENV_ARG_LIMIT=32
CONFIG_CROSS_COMPILE=""
# CONFIG_COMPILE_TEST is not set
CONFIG_LOCALVERSION="-nbor"
# CONFIG_LOCALVERSION_AUTO is not set
CONFIG_HAVE_KERNEL_GZIP=y
CONFIG_HAVE_KERNEL_BZIP2=y
CONFIG_HAVE_KERNEL_LZMA=y
CONFIG_HAVE_KERNEL_XZ=y
CONFIG_HAVE_KERNEL_LZO=y
CONFIG_HAVE_KERNEL_LZ4=y
CONFIG_KERNEL_GZIP=y
# CONFIG_KERNEL_BZIP2 is not set
# CONFIG_KERNEL_LZMA is not set
# CONFIG_KERNEL_XZ is not set
# CONFIG_KERNEL_LZO is not set
# CONFIG_KERNEL_LZ4 is not set
CONFIG_DEFAULT_HOSTNAME="(none)"
CONFIG_SWAP=y
CONFIG_SYSVIPC=y
CONFIG_SYSVIPC_SYSCTL=y
# CONFIG_POSIX_MQUEUE is not set
CONFIG_CROSS_MEMORY_ATTACH=y
CONFIG_FHANDLE=y
# CONFIG_USELIB is not set
# CONFIG_AUDIT is not set
CONFIG_HAVE_ARCH_AUDITSYSCALL=y

#
# IRQ subsystem
#
CONFIG_GENERIC_IRQ_PROBE=y
CONFIG_GENERIC_IRQ_SHOW=y
CONFIG_GENERIC_PENDING_IRQ=y
CONFIG_IRQ_DOMAIN=y
CONFIG_IRQ_DOMAIN_HIERARCHY=y
CONFIG_GENERIC_MSI_IRQ=y
CONFIG_GENERIC_MSI_IRQ_DOMAIN=y
# CONFIG_IRQ_DOMAIN_DEBUG is not set
CONFIG_IRQ_FORCED_THREADING=y
CONFIG_SPARSE_IRQ=y
CONFIG_CLOCKSOURCE_WATCHDOG=y
CONFIG_ARCH_CLOCKSOURCE_DATA=y
CONFIG_CLOCKSOURCE_VALIDATE_LAST_CYCLE=y
CONFIG_GENERIC_TIME_VSYSCALL=y
CONFIG_GENERIC_CLOCKEVENTS=y
CONFIG_GENERIC_CLOCKEVENTS_BROADCAST=y
CONFIG_GENERIC_CLOCKEVENTS_MIN_ADJUST=y
CONFIG_GENERIC_CMOS_UPDATE=y

#
# Timers subsystem
#
CONFIG_TICK_ONESHOT=y
CONFIG_NO_HZ_COMMON=y
# CONFIG_HZ_PERIODIC is not set
CONFIG_NO

Re: kasan behavior when built with unsupported compiler

2017-03-08 Thread Nikolay Borisov



On  7.03.2017 17:54, Dmitry Vyukov wrote:
> On Tue, Mar 7, 2017 at 4:35 PM, Nikolay Borisov
> <n.borisov.l...@gmail.com> wrote:
>> Hello,
>>
>> I've been chasing a particular UAF as reported by kasan
>> (https://www.spinics.net/lists/kernel/msg2458136.html). However, one
>> thing which I took notice of rather lately is that I was building my
>> kernel with gcc 4.7.4 which is not supported by kasan as indicated by
>> the following string:
>>
>> scripts/Makefile.kasan:19: Cannot use CONFIG_KASAN:
>> -fsanitize=kernel-address is not supported by compiler
>>
>>
>> Nevertheless, the kernel compiles and when I boot it I see the kasan
>> splats as per the referenced thread. If, however, I build the kernel
>> with a newer compiler version 5.4.0 kasan no longer complains.
>>
>>
>> At this point I'm wondering whether the splats can be due to old
>> compiler being used e.g. false positives or are they genuine splats and
>> gcc 5 somehow obfuscates them ? Clearly despite the warning about not
>> being able to use CONFIG_KASAN it is still working since I'm seeing the
>> splats. Is this valid behavior ?
> 
> 
> Hi,
> 
> Re the message that kasan is not supported while it's still enabled in the 
> end.
> I think it's an issue related to gcc plugins. Originally kasan was
> supported with 5.0+ thus the message. However, later we extended this
> support to 4.5+ with gcc plugins. However, that confusing message from
> build system was not fixed. So yes, it's confusing and it's something
> to fix, but mostly you can just ignore it.
> 
> Re false positives with 4.7. By default I would assume that it is true
> positive. Should be easy to check with manual printfs.

So apparently this is indeed a false positive, resulting from using the old 
compiler. I used the attached patch to verify it. 

And what it prints is : 
[   17.184288] Assigned fbdev-blacklist.conff(880001ea8020)20 whole object: 
88006ae8fdb0 inode:88006bff60d0
[   17.185808] Calling filldir with 88006ae8fdb0

So the first line essentially happens when the object 88006ae8fdb0 is
being allocated and the second when it's used in filldir. The warning in 
ext4_ext_map_blocks doesn't trigger. However, if I remove the check for 
the value of ext4_global_pointer then I see multiple lines such as: 
[   17.386283] ext4_ext_map_blocks:freeing  pointer used in 
ext4_htree_store_dirent: 88006ae8fdb0 inode: 88006bff60d0
[   17.387601] Assigned fbdev-blacklist.conff(880001eb3020)20 whole object: 
88006ae8fdb0 inode:88006bff60d0
[   17.388740] Calling filldir with 88006ae8fdb0

so that same object was used right before it is allocated again in 
ext4_htree_store_dirent. And when you think about it it is logical since 
before filling in the dentry names in ext4_htree_store_dirent ext4 has to fetch 
the 
contents of the directory from disk.

This leads me to believe that kasan is getting confused thinking that 
the object is being freed AFTER being allocated in 
ext4_htree_store_dirent but testing shows it's being freed BEFORE. So 
I deem this a false positive, triggered by the compiler. 



> 
> Re why 5.4 does not detect it. Difficult to say.
> If you confirm that it's a real bug and provide repro instructions,
> then I can recheck it with latest gcc. If it's a real bug and the
> latest gcc does not detect it, then we need to look more closely at
> it. I afraid 5.4 won't be fixed.
> It's also possible that it's a false positive in the old compiler (I
> think there were some bugs). If so, I would recommend switching to a
> newer compiler.
> 
diff --git a/fs/ext4/dir.c b/fs/ext4/dir.c
index 68323e3da3fa..31f5153b3df4 100644
--- a/fs/ext4/dir.c
+++ b/fs/ext4/dir.c
@@ -429,6 +429,7 @@ void ext4_htree_free_dir_info(struct dir_private_info *p)
  * encrypted filename, while the htree will hold decrypted filename.
  * The decrypted filename is passed in via ent_name.  parameter.
  */
+void *global_ext4_pointer = NULL;
 int ext4_htree_store_dirent(struct file *dir_file, __u32 hash,
 			 __u32 minor_hash,
 			struct ext4_dir_entry_2 *dirent,
@@ -454,7 +455,10 @@ int ext4_htree_store_dirent(struct file *dir_file, __u32 hash,
 	new_fn->file_type = dirent->file_type;
 	memcpy(new_fn->name, ent_name->name, ent_name->len);
 	new_fn->name[ent_name->len] = 0;
-
+	if (!strcmp(new_fn->name, "fbdev-blacklist.conf")) {
+		pr_info("Assigned %s(%p)%u whole object: %p inode:%p\n", ent_name->name, ent_name->name, ent_name->len, new_fn, file_inode(dir_file));
+		global_ext4_pointer = new_fn; 
+	}
 	while (*p) {
 		parent = *p;
 		fname = rb_entry(parent, struct fname, rb_hash);
@@ -507,6 +511,8 @@ static int call_filldir(struct file *file, struct dir_context *ctx,
 	}
 	ctx->po

Re: kasan behavior when built with unsupported compiler

2017-03-08 Thread Nikolay Borisov



On  7.03.2017 17:54, Dmitry Vyukov wrote:
> On Tue, Mar 7, 2017 at 4:35 PM, Nikolay Borisov
>  wrote:
>> Hello,
>>
>> I've been chasing a particular UAF as reported by kasan
>> (https://www.spinics.net/lists/kernel/msg2458136.html). However, one
>> thing which I took notice of rather lately is that I was building my
>> kernel with gcc 4.7.4 which is not supported by kasan as indicated by
>> the following string:
>>
>> scripts/Makefile.kasan:19: Cannot use CONFIG_KASAN:
>> -fsanitize=kernel-address is not supported by compiler
>>
>>
>> Nevertheless, the kernel compiles and when I boot it I see the kasan
>> splats as per the referenced thread. If, however, I build the kernel
>> with a newer compiler version 5.4.0 kasan no longer complains.
>>
>>
>> At this point I'm wondering whether the splats can be due to old
>> compiler being used e.g. false positives or are they genuine splats and
>> gcc 5 somehow obfuscates them ? Clearly despite the warning about not
>> being able to use CONFIG_KASAN it is still working since I'm seeing the
>> splats. Is this valid behavior ?
> 
> 
> Hi,
> 
> Re the message that kasan is not supported while it's still enabled in the 
> end.
> I think it's an issue related to gcc plugins. Originally kasan was
> supported with 5.0+ thus the message. However, later we extended this
> support to 4.5+ with gcc plugins. However, that confusing message from
> build system was not fixed. So yes, it's confusing and it's something
> to fix, but mostly you can just ignore it.
> 
> Re false positives with 4.7. By default I would assume that it is true
> positive. Should be easy to check with manual printfs.

So apparently this is indeed a false positive, resulting from using the old 
compiler. I used the attached patch to verify it. 

And what it prints is : 
[   17.184288] Assigned fbdev-blacklist.conff(880001ea8020)20 whole object: 
88006ae8fdb0 inode:88006bff60d0
[   17.185808] Calling filldir with 88006ae8fdb0

So the first line essentially happens when the object 88006ae8fdb0 is
being allocated and the second when it's used in filldir. The warning in 
ext4_ext_map_blocks doesn't trigger. However, if I remove the check for 
the value of ext4_global_pointer then I see multiple lines such as: 
[   17.386283] ext4_ext_map_blocks:freeing  pointer used in 
ext4_htree_store_dirent: 88006ae8fdb0 inode: 88006bff60d0
[   17.387601] Assigned fbdev-blacklist.conff(880001eb3020)20 whole object: 
88006ae8fdb0 inode:88006bff60d0
[   17.388740] Calling filldir with 88006ae8fdb0

so that same object was used right before it is allocated again in 
ext4_htree_store_dirent. And when you think about it it is logical since 
before filling in the dentry names in ext4_htree_store_dirent ext4 has to fetch 
the 
contents of the directory from disk.

This leads me to believe that kasan is getting confused thinking that 
the object is being freed AFTER being allocated in 
ext4_htree_store_dirent but testing shows it's being freed BEFORE. So 
I deem this a false positive, triggered by the compiler. 



> 
> Re why 5.4 does not detect it. Difficult to say.
> If you confirm that it's a real bug and provide repro instructions,
> then I can recheck it with latest gcc. If it's a real bug and the
> latest gcc does not detect it, then we need to look more closely at
> it. I afraid 5.4 won't be fixed.
> It's also possible that it's a false positive in the old compiler (I
> think there were some bugs). If so, I would recommend switching to a
> newer compiler.
> 
diff --git a/fs/ext4/dir.c b/fs/ext4/dir.c
index 68323e3da3fa..31f5153b3df4 100644
--- a/fs/ext4/dir.c
+++ b/fs/ext4/dir.c
@@ -429,6 +429,7 @@ void ext4_htree_free_dir_info(struct dir_private_info *p)
  * encrypted filename, while the htree will hold decrypted filename.
  * The decrypted filename is passed in via ent_name.  parameter.
  */
+void *global_ext4_pointer = NULL;
 int ext4_htree_store_dirent(struct file *dir_file, __u32 hash,
 			 __u32 minor_hash,
 			struct ext4_dir_entry_2 *dirent,
@@ -454,7 +455,10 @@ int ext4_htree_store_dirent(struct file *dir_file, __u32 hash,
 	new_fn->file_type = dirent->file_type;
 	memcpy(new_fn->name, ent_name->name, ent_name->len);
 	new_fn->name[ent_name->len] = 0;
-
+	if (!strcmp(new_fn->name, "fbdev-blacklist.conf")) {
+		pr_info("Assigned %s(%p)%u whole object: %p inode:%p\n", ent_name->name, ent_name->name, ent_name->len, new_fn, file_inode(dir_file));
+		global_ext4_pointer = new_fn; 
+	}
 	while (*p) {
 		parent = *p;
 		fname = rb_entry(parent, struct fname, rb_hash);
@@ -507,6 +511,8 @@ static int call_filldir(struct file *file, struct dir_context *ctx,
 	}
 	ctx->pos = hash2pos(file, fname->hash,

Re: net: BUG in unix_notinflight

2017-03-07 Thread Nikolay Borisov


>>
>>
>> New report from linux-next/c0b7b2b33bd17f7155956d0338ce92615da686c9
>>
>> [ cut here ]
>> kernel BUG at net/unix/garbage.c:149!
>> invalid opcode:  [#1] SMP KASAN
>> Dumping ftrace buffer:
>>(ftrace buffer empty)
>> Modules linked in:
>> CPU: 0 PID: 1806 Comm: syz-executor7 Not tainted 4.10.0-next-20170303+ #6
>> Hardware name: Google Google Compute Engine/Google Compute Engine,
>> BIOS Google 01/01/2011
>> task: 880121c64740 task.stack: 88012c9e8000
>> RIP: 0010:unix_notinflight+0x417/0x5d0 net/unix/garbage.c:149
>> RSP: 0018:88012c9ef0f8 EFLAGS: 00010297
>> RAX: 880121c64740 RBX: 11002593de23 RCX: 8801c490c628
>> RDX:  RSI: 11002593de27 RDI: 8557e504
>> RBP: 88012c9ef220 R08: 0001 R09: 
>> R10: dc00 R11: ed002593de55 R12: 8801c490c0c0
>> R13: 88012c9ef1f8 R14: 85101620 R15: dc00
>> FS:  013d3940() GS:8801dbe0() knlGS:
>> CS:  0010 DS:  ES:  CR0: 80050033
>> CR2: 01fd8cd8 CR3: 0001cce69000 CR4: 001426f0
>> Call Trace:
>>  unix_detach_fds.isra.23+0xfa/0x170 net/unix/af_unix.c:1490
>>  unix_destruct_scm+0xf4/0x200 net/unix/af_unix.c:1499
> 
> The problem here is there is no lock protecting concurrent unix_detach_fds()
> even though unix_notinflight() is already serialized, if we call
> unix_notinflight()
> twice on the same file pointer, we trigger this bug...
> 
> I don't know what is the right lock here to serialize it.
> 


I reported something similar a while ago
https://lists.gt.net/linux/kernel/2534612

And Miklos Szeredi then produced the following patch :

https://patchwork.kernel.org/patch/9305121/

However, this was never applied. I wonder if the patch makes sense?

Re: net: BUG in unix_notinflight

2017-03-07 Thread Nikolay Borisov


>>
>>
>> New report from linux-next/c0b7b2b33bd17f7155956d0338ce92615da686c9
>>
>> [ cut here ]
>> kernel BUG at net/unix/garbage.c:149!
>> invalid opcode:  [#1] SMP KASAN
>> Dumping ftrace buffer:
>>(ftrace buffer empty)
>> Modules linked in:
>> CPU: 0 PID: 1806 Comm: syz-executor7 Not tainted 4.10.0-next-20170303+ #6
>> Hardware name: Google Google Compute Engine/Google Compute Engine,
>> BIOS Google 01/01/2011
>> task: 880121c64740 task.stack: 88012c9e8000
>> RIP: 0010:unix_notinflight+0x417/0x5d0 net/unix/garbage.c:149
>> RSP: 0018:88012c9ef0f8 EFLAGS: 00010297
>> RAX: 880121c64740 RBX: 11002593de23 RCX: 8801c490c628
>> RDX:  RSI: 11002593de27 RDI: 8557e504
>> RBP: 88012c9ef220 R08: 0001 R09: 
>> R10: dc00 R11: ed002593de55 R12: 8801c490c0c0
>> R13: 88012c9ef1f8 R14: 85101620 R15: dc00
>> FS:  013d3940() GS:8801dbe0() knlGS:
>> CS:  0010 DS:  ES:  CR0: 80050033
>> CR2: 01fd8cd8 CR3: 0001cce69000 CR4: 001426f0
>> Call Trace:
>>  unix_detach_fds.isra.23+0xfa/0x170 net/unix/af_unix.c:1490
>>  unix_destruct_scm+0xf4/0x200 net/unix/af_unix.c:1499
> 
> The problem here is there is no lock protecting concurrent unix_detach_fds()
> even though unix_notinflight() is already serialized, if we call
> unix_notinflight()
> twice on the same file pointer, we trigger this bug...
> 
> I don't know what is the right lock here to serialize it.
> 


I reported something similar a while ago
https://lists.gt.net/linux/kernel/2534612

And Miklos Szeredi then produced the following patch :

https://patchwork.kernel.org/patch/9305121/

However, this was never applied. I wonder if the patch makes sense?

Re: Race condition in ext4 (was Re: 4.11-rc1 acpi stomping ext4 slabs)

2017-03-07 Thread Nikolay Borisov



On  7.03.2017 16:33, Nikolay Borisov wrote:
> 
> 
> On  7.03.2017 11:38, Nikolay Borisov wrote:
>>
>>
>> On  7.03.2017 00:35, Rafael J. Wysocki wrote:
>>> On Mon, Mar 6, 2017 at 9:31 PM, Nikolay Borisov
>>> <n.borisov.l...@gmail.com> wrote:
>>>> Hello,
>>>>
>>>> Booting 4.11-rc1 with kasan enabled and "slub_debug=F" produces the 
>>>> following errors:
>>>>
>>>> [7.070797] 
>>>> ==
>>>> [7.071724] BUG: KASAN: slab-out-of-bounds in filldir+0xc3/0x160 at 
>>>> addr 88006bc2b0ae
>>>> [7.071724] Read of size 20 by task systemd/1
>>>> [7.071724] CPU: 1 PID: 1 Comm: systemd Not tainted 4.11.0-rc1-nbor #150
>>>> [7.071724] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 
>>>> Ubuntu-1.8.2-1ubuntu1 04/01/2014
>>>> [7.071724] Call Trace:
>>>> [7.071724]  dump_stack+0x85/0xc9
>>>> [7.071724]  kasan_object_err+0x2c/0x90
>>>> [7.071724]  kasan_report+0x285/0x510
>>>> [7.071724]  check_memory_region+0x137/0x160
>>>> [7.071724]  kasan_check_read+0x11/0x20
>>>> [7.071724]  filldir+0xc3/0x160
>>>> [7.071724]  call_filldir+0x88/0x140
>>>> [7.071724]  ext4_readdir+0x757/0x920
>>>> [7.071724]  ? iterate_dir+0x49/0x190
>>>> [7.071724]  iterate_dir+0x7d/0x190
>>>> [7.071724]  ? entry_SYSCALL_64_fastpath+0x5/0xc6
>>>> [7.071724]  SyS_getdents+0xac/0x170
>>>> [7.071724]  ? filldir64+0x170/0x170
>>>> [7.071724]  entry_SYSCALL_64_fastpath+0x23/0xc6
>>>> [7.071724] RIP: 0033:0x7fa37ca2dd3b
>>>> [7.071724] RSP: 002b:7ffc63daf400 EFLAGS: 0206 ORIG_RAX: 
>>>> 004e
>>>> [7.071724] RAX: ffda RBX: 0046 RCX: 
>>>> 7fa37ca2dd3b
>>>> [7.071724] RDX: 8000 RSI: 560b369e4a10 RDI: 
>>>> 0004
>>>> [7.071724] RBP: 7fa37cd29b20 R08: 7fa37cd29bd8 R09: 
>>>> 
>>>> [7.071724] R10: 008f R11: 0206 R12: 
>>>> 8041
>>>> [7.071724] R13: 7fa37cd29b78 R14: 270f R15: 
>>>> 7fa37cd29b78
>>>> [7.071724] Object at 88006bc2b080, in cache kmalloc-96 size: 96
>>>> [7.071724] Allocated:
>>>> [7.071724] PID = 1
>>>> [7.071724]  save_stack_trace+0x1b/0x20
>>>> [7.071724]  kasan_kmalloc.part.4+0x64/0xf0
>>>> [7.071724]  kasan_kmalloc+0x85/0xb0
>>>> [7.071724]  __kmalloc+0x12b/0x320
>>>> [7.071724]  ext4_htree_store_dirent+0x3e/0x120
>>>> [7.071724]  htree_dirblock_to_tree+0xb9/0x1a0
>>>> [7.071724]  ext4_htree_fill_tree+0xa3/0x310
>>>> [7.071724]  ext4_readdir+0x6a9/0x920
>>>> [7.071724]  iterate_dir+0x7d/0x190
>>>> [7.071724]  SyS_getdents+0xac/0x170
>>>> [7.071724]  entry_SYSCALL_64_fastpath+0x23/0xc6
>>>> [7.071724] Freed:
>>>> [7.071724] PID = 1
>>>> [7.071724]  save_stack_trace+0x1b/0x20
>>>> [7.071724]  kasan_slab_free+0xbe/0x190
>>>> [7.071724]  kfree+0xff/0x2f0
>>>> [7.071724]  acpi_ut_evaluate_object+0x18e/0x19d
>>>> [7.071724]  acpi_ut_execute_STA+0x26/0x53
>>>> [7.071724]  acpi_ns_get_device_callback+0x73/0x163
>>>> [7.071724]  acpi_ns_walk_namespace+0xc0/0x17a
>>>> [7.071724]  acpi_get_devices+0x66/0x7d
>>>> [7.071724]  pnpacpi_init+0x52/0x74
>>>> [7.071724]  do_one_initcall+0x51/0x1b0
>>>> [7.071724]  kernel_init_freeable+0x20a/0x2a1
>>>> [7.071724]  kernel_init+0xe/0x100
>>>> [7.071724]  ret_from_fork+0x31/0x40
>>>> [7.071724] Memory state around the buggy address:
>>>> [7.071724]  88006bc2af80: fc fc fc fc fc fc fc fc fc fc fc fc fc 
>>>> fc fc fc
>>>> [7.071724]  88006bc2b000: fb fb fb fb fb fb fb fb fb fb fb fb fc 
>>>> fc fc fc
>>>> [7.071724] >88006bc2b080: 00 00 00 00 00 00 00 00 05 fc fc fc fc 
>>>> fc fc fc
>>>> [7.071724]^
>>>> [7.071724]  88006bc2b100: 00 00 00 00 00 00 00 00 00 04 fc fc fc 
>>>> f

Re: Race condition in ext4 (was Re: 4.11-rc1 acpi stomping ext4 slabs)

2017-03-07 Thread Nikolay Borisov



On  7.03.2017 16:33, Nikolay Borisov wrote:
> 
> 
> On  7.03.2017 11:38, Nikolay Borisov wrote:
>>
>>
>> On  7.03.2017 00:35, Rafael J. Wysocki wrote:
>>> On Mon, Mar 6, 2017 at 9:31 PM, Nikolay Borisov
>>>  wrote:
>>>> Hello,
>>>>
>>>> Booting 4.11-rc1 with kasan enabled and "slub_debug=F" produces the 
>>>> following errors:
>>>>
>>>> [7.070797] 
>>>> ==
>>>> [7.071724] BUG: KASAN: slab-out-of-bounds in filldir+0xc3/0x160 at 
>>>> addr 88006bc2b0ae
>>>> [7.071724] Read of size 20 by task systemd/1
>>>> [7.071724] CPU: 1 PID: 1 Comm: systemd Not tainted 4.11.0-rc1-nbor #150
>>>> [7.071724] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 
>>>> Ubuntu-1.8.2-1ubuntu1 04/01/2014
>>>> [7.071724] Call Trace:
>>>> [7.071724]  dump_stack+0x85/0xc9
>>>> [7.071724]  kasan_object_err+0x2c/0x90
>>>> [7.071724]  kasan_report+0x285/0x510
>>>> [7.071724]  check_memory_region+0x137/0x160
>>>> [7.071724]  kasan_check_read+0x11/0x20
>>>> [7.071724]  filldir+0xc3/0x160
>>>> [7.071724]  call_filldir+0x88/0x140
>>>> [7.071724]  ext4_readdir+0x757/0x920
>>>> [7.071724]  ? iterate_dir+0x49/0x190
>>>> [7.071724]  iterate_dir+0x7d/0x190
>>>> [7.071724]  ? entry_SYSCALL_64_fastpath+0x5/0xc6
>>>> [7.071724]  SyS_getdents+0xac/0x170
>>>> [7.071724]  ? filldir64+0x170/0x170
>>>> [7.071724]  entry_SYSCALL_64_fastpath+0x23/0xc6
>>>> [7.071724] RIP: 0033:0x7fa37ca2dd3b
>>>> [7.071724] RSP: 002b:7ffc63daf400 EFLAGS: 0206 ORIG_RAX: 
>>>> 004e
>>>> [7.071724] RAX: ffda RBX: 0046 RCX: 
>>>> 7fa37ca2dd3b
>>>> [7.071724] RDX: 8000 RSI: 560b369e4a10 RDI: 
>>>> 0004
>>>> [7.071724] RBP: 7fa37cd29b20 R08: 7fa37cd29bd8 R09: 
>>>> 
>>>> [7.071724] R10: 008f R11: 0206 R12: 
>>>> 8041
>>>> [7.071724] R13: 7fa37cd29b78 R14: 270f R15: 
>>>> 7fa37cd29b78
>>>> [7.071724] Object at 88006bc2b080, in cache kmalloc-96 size: 96
>>>> [7.071724] Allocated:
>>>> [7.071724] PID = 1
>>>> [7.071724]  save_stack_trace+0x1b/0x20
>>>> [7.071724]  kasan_kmalloc.part.4+0x64/0xf0
>>>> [7.071724]  kasan_kmalloc+0x85/0xb0
>>>> [7.071724]  __kmalloc+0x12b/0x320
>>>> [7.071724]  ext4_htree_store_dirent+0x3e/0x120
>>>> [7.071724]  htree_dirblock_to_tree+0xb9/0x1a0
>>>> [7.071724]  ext4_htree_fill_tree+0xa3/0x310
>>>> [7.071724]  ext4_readdir+0x6a9/0x920
>>>> [7.071724]  iterate_dir+0x7d/0x190
>>>> [7.071724]  SyS_getdents+0xac/0x170
>>>> [7.071724]  entry_SYSCALL_64_fastpath+0x23/0xc6
>>>> [7.071724] Freed:
>>>> [7.071724] PID = 1
>>>> [7.071724]  save_stack_trace+0x1b/0x20
>>>> [7.071724]  kasan_slab_free+0xbe/0x190
>>>> [7.071724]  kfree+0xff/0x2f0
>>>> [7.071724]  acpi_ut_evaluate_object+0x18e/0x19d
>>>> [7.071724]  acpi_ut_execute_STA+0x26/0x53
>>>> [7.071724]  acpi_ns_get_device_callback+0x73/0x163
>>>> [7.071724]  acpi_ns_walk_namespace+0xc0/0x17a
>>>> [7.071724]  acpi_get_devices+0x66/0x7d
>>>> [7.071724]  pnpacpi_init+0x52/0x74
>>>> [7.071724]  do_one_initcall+0x51/0x1b0
>>>> [7.071724]  kernel_init_freeable+0x20a/0x2a1
>>>> [7.071724]  kernel_init+0xe/0x100
>>>> [7.071724]  ret_from_fork+0x31/0x40
>>>> [7.071724] Memory state around the buggy address:
>>>> [7.071724]  88006bc2af80: fc fc fc fc fc fc fc fc fc fc fc fc fc 
>>>> fc fc fc
>>>> [7.071724]  88006bc2b000: fb fb fb fb fb fb fb fb fb fb fb fb fc 
>>>> fc fc fc
>>>> [7.071724] >88006bc2b080: 00 00 00 00 00 00 00 00 05 fc fc fc fc 
>>>> fc fc fc
>>>> [7.071724]^
>>>> [7.071724]  88006bc2b100: 00 00 00 00 00 00 00 00 00 04 fc fc fc 
>>>> fc fc fc
>>>> [

Re: kasan behavior when built with unsupported compiler

2017-03-07 Thread Nikolay Borisov



On  7.03.2017 19:51, Alexander Potapenko wrote:
> On Tue, Mar 7, 2017 at 6:33 PM, Nikolay Borisov
> <n.borisov.l...@gmail.com> wrote:
>>
>>
>> On  7.03.2017 18:05, Alexander Potapenko wrote:
>>> On Tue, Mar 7, 2017 at 4:54 PM, Dmitry Vyukov <dvyu...@google.com> wrote:
>>>> On Tue, Mar 7, 2017 at 4:35 PM, Nikolay Borisov
>>>> <n.borisov.l...@gmail.com> wrote:
>>>>> Hello,
>>>>>
>>>>> I've been chasing a particular UAF as reported by kasan
>>>>> (https://www.spinics.net/lists/kernel/msg2458136.html). However, one
>>>>> thing which I took notice of rather lately is that I was building my
>>>>> kernel with gcc 4.7.4 which is not supported by kasan as indicated by
>>>>> the following string:
>>>>>
>>>>> scripts/Makefile.kasan:19: Cannot use CONFIG_KASAN:
>>>>> -fsanitize=kernel-address is not supported by compiler
>>>>>
>>>>>
>>>>> Nevertheless, the kernel compiles and when I boot it I see the kasan
>>>>> splats as per the referenced thread. If, however, I build the kernel
>>>>> with a newer compiler version 5.4.0 kasan no longer complains.
>>>>>
>>>>>
>>>>> At this point I'm wondering whether the splats can be due to old
>>>>> compiler being used e.g. false positives or are they genuine splats and
>>>>> gcc 5 somehow obfuscates them ? Clearly despite the warning about not
>>>>> being able to use CONFIG_KASAN it is still working since I'm seeing the
>>>>> splats. Is this valid behavior ?
>>>>
>>>>
>>>> Hi,
>>>>
>>>> Re the message that kasan is not supported while it's still enabled in the 
>>>> end.
>>>> I think it's an issue related to gcc plugins. Originally kasan was
>>>> supported with 5.0+ thus the message. However, later we extended this
>>>> support to 4.5+ with gcc plugins. However, that confusing message from
>>>> build system was not fixed. So yes, it's confusing and it's something
>>>> to fix, but mostly you can just ignore it.
>>>>
>>>> Re false positives with 4.7. By default I would assume that it is true
>>>> positive. Should be easy to check with manual printfs.
>>>>
>>>> Re why 5.4 does not detect it. Difficult to say.
>>>> If you confirm that it's a real bug and provide repro instructions,
>>>> then I can recheck it with latest gcc. If it's a real bug and the
>>>> latest gcc does not detect it, then we need to look more closely at
>>>> it. I afraid 5.4 won't be fixed.
>>>> It's also possible that it's a false positive in the old compiler (I
>>>> think there were some bugs). If so, I would recommend switching to a
>>>> newer compiler.
>>>
>>> I wonder if there's actual KASAN instrumentation in the kernel in
>>> question built with GCC 4.7.
>>> If there's none, there's little point in investigating this further,
>>> as the tool is anyway barely usable.
>>> Note that the report originates from something like copy_to_user() (or
>>> hard to tell the exact place - Nikolay, could you please symbolize the
>>> report?), i.e. it could be triggered even without KASAN
>>> instrumentation.
>>
>> Of course there is kasan instrumentation, otherwise I won't see kasan 
>> reports, no ?
> Not necessarily.
> There's KASAN instrumentation inserted by the compiler, and KASAN
> instrumentation added manually to the places the compiler can't
> instrument.
>> I bisected this to commit 1771c6e1a567ea0ba2cc which adds user memory access 
>> API
> Commit 1771c6e1a567ea0ba2cc added exactly these calls to
> check_memory_region() you are seeing.
> If there is any other instrumentation inserted by the compiler, it's
> possible that it may catch accesses to the same object (if something
> else except copy_to_user() is being done).
> Otherwise the only thing you can do to investigate this bug with GCC
> 4.7 is to bisect further by rolling to earlier revisions and applying
> 1771c6e1a567ea0ba2cc on top of them.
> I won't be surprised though if at some point the bisection may stop
> for a different reason.
>> instrumentation. Here is a symbolized report:
>>
>> ==
>> BUG: KASAN: slab-out-of-bounds in filldir+0xc8/0x170 at addr 88006a22560e
>> Read of size 20 by task systemd/1
>> =

Re: kasan behavior when built with unsupported compiler

2017-03-07 Thread Nikolay Borisov



On  7.03.2017 19:51, Alexander Potapenko wrote:
> On Tue, Mar 7, 2017 at 6:33 PM, Nikolay Borisov
>  wrote:
>>
>>
>> On  7.03.2017 18:05, Alexander Potapenko wrote:
>>> On Tue, Mar 7, 2017 at 4:54 PM, Dmitry Vyukov  wrote:
>>>> On Tue, Mar 7, 2017 at 4:35 PM, Nikolay Borisov
>>>>  wrote:
>>>>> Hello,
>>>>>
>>>>> I've been chasing a particular UAF as reported by kasan
>>>>> (https://www.spinics.net/lists/kernel/msg2458136.html). However, one
>>>>> thing which I took notice of rather lately is that I was building my
>>>>> kernel with gcc 4.7.4 which is not supported by kasan as indicated by
>>>>> the following string:
>>>>>
>>>>> scripts/Makefile.kasan:19: Cannot use CONFIG_KASAN:
>>>>> -fsanitize=kernel-address is not supported by compiler
>>>>>
>>>>>
>>>>> Nevertheless, the kernel compiles and when I boot it I see the kasan
>>>>> splats as per the referenced thread. If, however, I build the kernel
>>>>> with a newer compiler version 5.4.0 kasan no longer complains.
>>>>>
>>>>>
>>>>> At this point I'm wondering whether the splats can be due to old
>>>>> compiler being used e.g. false positives or are they genuine splats and
>>>>> gcc 5 somehow obfuscates them ? Clearly despite the warning about not
>>>>> being able to use CONFIG_KASAN it is still working since I'm seeing the
>>>>> splats. Is this valid behavior ?
>>>>
>>>>
>>>> Hi,
>>>>
>>>> Re the message that kasan is not supported while it's still enabled in the 
>>>> end.
>>>> I think it's an issue related to gcc plugins. Originally kasan was
>>>> supported with 5.0+ thus the message. However, later we extended this
>>>> support to 4.5+ with gcc plugins. However, that confusing message from
>>>> build system was not fixed. So yes, it's confusing and it's something
>>>> to fix, but mostly you can just ignore it.
>>>>
>>>> Re false positives with 4.7. By default I would assume that it is true
>>>> positive. Should be easy to check with manual printfs.
>>>>
>>>> Re why 5.4 does not detect it. Difficult to say.
>>>> If you confirm that it's a real bug and provide repro instructions,
>>>> then I can recheck it with latest gcc. If it's a real bug and the
>>>> latest gcc does not detect it, then we need to look more closely at
>>>> it. I afraid 5.4 won't be fixed.
>>>> It's also possible that it's a false positive in the old compiler (I
>>>> think there were some bugs). If so, I would recommend switching to a
>>>> newer compiler.
>>>
>>> I wonder if there's actual KASAN instrumentation in the kernel in
>>> question built with GCC 4.7.
>>> If there's none, there's little point in investigating this further,
>>> as the tool is anyway barely usable.
>>> Note that the report originates from something like copy_to_user() (or
>>> hard to tell the exact place - Nikolay, could you please symbolize the
>>> report?), i.e. it could be triggered even without KASAN
>>> instrumentation.
>>
>> Of course there is kasan instrumentation, otherwise I won't see kasan 
>> reports, no ?
> Not necessarily.
> There's KASAN instrumentation inserted by the compiler, and KASAN
> instrumentation added manually to the places the compiler can't
> instrument.
>> I bisected this to commit 1771c6e1a567ea0ba2cc which adds user memory access 
>> API
> Commit 1771c6e1a567ea0ba2cc added exactly these calls to
> check_memory_region() you are seeing.
> If there is any other instrumentation inserted by the compiler, it's
> possible that it may catch accesses to the same object (if something
> else except copy_to_user() is being done).
> Otherwise the only thing you can do to investigate this bug with GCC
> 4.7 is to bisect further by rolling to earlier revisions and applying
> 1771c6e1a567ea0ba2cc on top of them.
> I won't be surprised though if at some point the bisection may stop
> for a different reason.
>> instrumentation. Here is a symbolized report:
>>
>> ==
>> BUG: KASAN: slab-out-of-bounds in filldir+0xc8/0x170 at addr 88006a22560e
>> Read of size 20 by task systemd/1
>> =
>> BUG kmalloc

Re: kasan behavior when built with unsupported compiler

2017-03-07 Thread Nikolay Borisov



On  7.03.2017 18:05, Alexander Potapenko wrote:
> On Tue, Mar 7, 2017 at 4:54 PM, Dmitry Vyukov <dvyu...@google.com> wrote:
>> On Tue, Mar 7, 2017 at 4:35 PM, Nikolay Borisov
>> <n.borisov.l...@gmail.com> wrote:
>>> Hello,
>>>
>>> I've been chasing a particular UAF as reported by kasan
>>> (https://www.spinics.net/lists/kernel/msg2458136.html). However, one
>>> thing which I took notice of rather lately is that I was building my
>>> kernel with gcc 4.7.4 which is not supported by kasan as indicated by
>>> the following string:
>>>
>>> scripts/Makefile.kasan:19: Cannot use CONFIG_KASAN:
>>> -fsanitize=kernel-address is not supported by compiler
>>>
>>>
>>> Nevertheless, the kernel compiles and when I boot it I see the kasan
>>> splats as per the referenced thread. If, however, I build the kernel
>>> with a newer compiler version 5.4.0 kasan no longer complains.
>>>
>>>
>>> At this point I'm wondering whether the splats can be due to old
>>> compiler being used e.g. false positives or are they genuine splats and
>>> gcc 5 somehow obfuscates them ? Clearly despite the warning about not
>>> being able to use CONFIG_KASAN it is still working since I'm seeing the
>>> splats. Is this valid behavior ?
>>
>>
>> Hi,
>>
>> Re the message that kasan is not supported while it's still enabled in the 
>> end.
>> I think it's an issue related to gcc plugins. Originally kasan was
>> supported with 5.0+ thus the message. However, later we extended this
>> support to 4.5+ with gcc plugins. However, that confusing message from
>> build system was not fixed. So yes, it's confusing and it's something
>> to fix, but mostly you can just ignore it.
>>
>> Re false positives with 4.7. By default I would assume that it is true
>> positive. Should be easy to check with manual printfs.
>>
>> Re why 5.4 does not detect it. Difficult to say.
>> If you confirm that it's a real bug and provide repro instructions,
>> then I can recheck it with latest gcc. If it's a real bug and the
>> latest gcc does not detect it, then we need to look more closely at
>> it. I afraid 5.4 won't be fixed.
>> It's also possible that it's a false positive in the old compiler (I
>> think there were some bugs). If so, I would recommend switching to a
>> newer compiler.
> 
> I wonder if there's actual KASAN instrumentation in the kernel in
> question built with GCC 4.7.
> If there's none, there's little point in investigating this further,
> as the tool is anyway barely usable.
> Note that the report originates from something like copy_to_user() (or
> hard to tell the exact place - Nikolay, could you please symbolize the
> report?), i.e. it could be triggered even without KASAN
> instrumentation.

Of course there is kasan instrumentation, otherwise I won't see kasan reports, 
no ? 
I bisected this to commit 1771c6e1a567ea0ba2cc which adds user memory access 
API 
instrumentation. Here is a symbolized report: 

==
BUG: KASAN: slab-out-of-bounds in filldir+0xc8/0x170 at addr 88006a22560e
Read of size 20 by task systemd/1
=
BUG kmalloc-96 (Not tainted): kasan: bad access detected
-

Disabling lock debugging due to kernel taint
INFO: Allocated in ext4_htree_store_dirent+0x3e/0x120 age=0 cpu=2 pid=1
[<none>] ___slab_alloc+0x636/0x6a0 mm/slub.c:2446
[<none>] __slab_alloc+0x4f/0x86 mm/slub.c:2475
[< inline >] slab_alloc_node mm/slub.c:2538
[< inline >] slab_alloc mm/slub.c:2580
[<none>] __kmalloc+0x27a/0x340 mm/slub.c:3561
[< inline >] kmalloc include/linux/slab.h:483
[< inline >] kzalloc include/linux/slab.h:622
[<none>] ext4_htree_store_dirent+0x3e/0x120 fs/ext4/dir.c:447
[<none>] htree_dirblock_to_tree+0x16a/0x190 fs/ext4/namei.c:1001
[<none>] ext4_htree_fill_tree+0xaa/0x310 fs/ext4/namei.c:1075
[< inline >] ext4_dx_readdir fs/ext4/dir.c:571
[<none>] ext4_readdir+0x698/0x950 fs/ext4/dir.c:121
[<none>] iterate_dir+0x7d/0x190 fs/readdir.c:50
[< inline >] SYSC_getdents fs/readdir.c:230
[<none>] SyS_getdents+0x91/0x120 fs/readdir.c:212
[<none>] entry_SYSCALL_64_fastpath+0x23/0xc1 
arch/x86/entry/entry_64.S:207

INFO: Freed in ext4_ext_map_blocks+0x7f9/0x23e0 ag

Re: kasan behavior when built with unsupported compiler

2017-03-07 Thread Nikolay Borisov



On  7.03.2017 18:05, Alexander Potapenko wrote:
> On Tue, Mar 7, 2017 at 4:54 PM, Dmitry Vyukov  wrote:
>> On Tue, Mar 7, 2017 at 4:35 PM, Nikolay Borisov
>>  wrote:
>>> Hello,
>>>
>>> I've been chasing a particular UAF as reported by kasan
>>> (https://www.spinics.net/lists/kernel/msg2458136.html). However, one
>>> thing which I took notice of rather lately is that I was building my
>>> kernel with gcc 4.7.4 which is not supported by kasan as indicated by
>>> the following string:
>>>
>>> scripts/Makefile.kasan:19: Cannot use CONFIG_KASAN:
>>> -fsanitize=kernel-address is not supported by compiler
>>>
>>>
>>> Nevertheless, the kernel compiles and when I boot it I see the kasan
>>> splats as per the referenced thread. If, however, I build the kernel
>>> with a newer compiler version 5.4.0 kasan no longer complains.
>>>
>>>
>>> At this point I'm wondering whether the splats can be due to old
>>> compiler being used e.g. false positives or are they genuine splats and
>>> gcc 5 somehow obfuscates them ? Clearly despite the warning about not
>>> being able to use CONFIG_KASAN it is still working since I'm seeing the
>>> splats. Is this valid behavior ?
>>
>>
>> Hi,
>>
>> Re the message that kasan is not supported while it's still enabled in the 
>> end.
>> I think it's an issue related to gcc plugins. Originally kasan was
>> supported with 5.0+ thus the message. However, later we extended this
>> support to 4.5+ with gcc plugins. However, that confusing message from
>> build system was not fixed. So yes, it's confusing and it's something
>> to fix, but mostly you can just ignore it.
>>
>> Re false positives with 4.7. By default I would assume that it is true
>> positive. Should be easy to check with manual printfs.
>>
>> Re why 5.4 does not detect it. Difficult to say.
>> If you confirm that it's a real bug and provide repro instructions,
>> then I can recheck it with latest gcc. If it's a real bug and the
>> latest gcc does not detect it, then we need to look more closely at
>> it. I afraid 5.4 won't be fixed.
>> It's also possible that it's a false positive in the old compiler (I
>> think there were some bugs). If so, I would recommend switching to a
>> newer compiler.
> 
> I wonder if there's actual KASAN instrumentation in the kernel in
> question built with GCC 4.7.
> If there's none, there's little point in investigating this further,
> as the tool is anyway barely usable.
> Note that the report originates from something like copy_to_user() (or
> hard to tell the exact place - Nikolay, could you please symbolize the
> report?), i.e. it could be triggered even without KASAN
> instrumentation.

Of course there is kasan instrumentation, otherwise I won't see kasan reports, 
no ? 
I bisected this to commit 1771c6e1a567ea0ba2cc which adds user memory access 
API 
instrumentation. Here is a symbolized report: 

==
BUG: KASAN: slab-out-of-bounds in filldir+0xc8/0x170 at addr 88006a22560e
Read of size 20 by task systemd/1
=
BUG kmalloc-96 (Not tainted): kasan: bad access detected
-

Disabling lock debugging due to kernel taint
INFO: Allocated in ext4_htree_store_dirent+0x3e/0x120 age=0 cpu=2 pid=1
[<none>] ___slab_alloc+0x636/0x6a0 mm/slub.c:2446
[<none>] __slab_alloc+0x4f/0x86 mm/slub.c:2475
[< inline >] slab_alloc_node mm/slub.c:2538
[< inline >] slab_alloc mm/slub.c:2580
[<none>] __kmalloc+0x27a/0x340 mm/slub.c:3561
[< inline >] kmalloc include/linux/slab.h:483
[< inline >] kzalloc include/linux/slab.h:622
[<none>] ext4_htree_store_dirent+0x3e/0x120 fs/ext4/dir.c:447
[<none>] htree_dirblock_to_tree+0x16a/0x190 fs/ext4/namei.c:1001
[<none>] ext4_htree_fill_tree+0xaa/0x310 fs/ext4/namei.c:1075
[< inline >] ext4_dx_readdir fs/ext4/dir.c:571
[<none>] ext4_readdir+0x698/0x950 fs/ext4/dir.c:121
[<none>] iterate_dir+0x7d/0x190 fs/readdir.c:50
[< inline >] SYSC_getdents fs/readdir.c:230
[<none>] SyS_getdents+0x91/0x120 fs/readdir.c:212
[<none>] entry_SYSCALL_64_fastpath+0x23/0xc1 
arch/x86/entry/entry_64.S:207

INFO: Freed in ext4_ext_map_blocks+0x7f9/0x23e0 age=1 cpu=2 pid=1
[<none>] __slab_free+0x31b/0x440

kasan behavior when built with unsupported compiler

2017-03-07 Thread Nikolay Borisov

Hello,

I've been chasing a particular UAF as reported by kasan
(https://www.spinics.net/lists/kernel/msg2458136.html). However, one
thing which I took notice of rather lately is that I was building my
kernel with gcc 4.7.4 which is not supported by kasan as indicated by
the following string:

scripts/Makefile.kasan:19: Cannot use CONFIG_KASAN:
-fsanitize=kernel-address is not supported by compiler


Nevertheless, the kernel compiles and when I boot it I see the kasan
splats as per the referenced thread. If, however, I build the kernel
with a newer compiler version 5.4.0 kasan no longer complains.


At this point I'm wondering whether the splats can be due to old
compiler being used e.g. false positives or are they genuine splats and
gcc 5 somehow obfuscates them ? Clearly despite the warning about not
being able to use CONFIG_KASAN it is still working since I'm seeing the
splats. Is this valid behavior ?


Regards,
Nikolay

kasan behavior when built with unsupported compiler

2017-03-07 Thread Nikolay Borisov

Hello,

I've been chasing a particular UAF as reported by kasan
(https://www.spinics.net/lists/kernel/msg2458136.html). However, one
thing which I took notice of rather lately is that I was building my
kernel with gcc 4.7.4 which is not supported by kasan as indicated by
the following string:

scripts/Makefile.kasan:19: Cannot use CONFIG_KASAN:
-fsanitize=kernel-address is not supported by compiler


Nevertheless, the kernel compiles and when I boot it I see the kasan
splats as per the referenced thread. If, however, I build the kernel
with a newer compiler version 5.4.0 kasan no longer complains.


At this point I'm wondering whether the splats can be due to old
compiler being used e.g. false positives or are they genuine splats and
gcc 5 somehow obfuscates them ? Clearly despite the warning about not
being able to use CONFIG_KASAN it is still working since I'm seeing the
splats. Is this valid behavior ?


Regards,
Nikolay

Race condition in ext4 (was Re: 4.11-rc1 acpi stomping ext4 slabs)

2017-03-07 Thread Nikolay Borisov



On  7.03.2017 11:38, Nikolay Borisov wrote:
> 
> 
> On  7.03.2017 00:35, Rafael J. Wysocki wrote:
>> On Mon, Mar 6, 2017 at 9:31 PM, Nikolay Borisov
>> <n.borisov.l...@gmail.com> wrote:
>>> Hello,
>>>
>>> Booting 4.11-rc1 with kasan enabled and "slub_debug=F" produces the 
>>> following errors:
>>>
>>> [7.070797] 
>>> ==
>>> [7.071724] BUG: KASAN: slab-out-of-bounds in filldir+0xc3/0x160 at addr 
>>> 88006bc2b0ae
>>> [7.071724] Read of size 20 by task systemd/1
>>> [7.071724] CPU: 1 PID: 1 Comm: systemd Not tainted 4.11.0-rc1-nbor #150
>>> [7.071724] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 
>>> Ubuntu-1.8.2-1ubuntu1 04/01/2014
>>> [7.071724] Call Trace:
>>> [7.071724]  dump_stack+0x85/0xc9
>>> [7.071724]  kasan_object_err+0x2c/0x90
>>> [7.071724]  kasan_report+0x285/0x510
>>> [7.071724]  check_memory_region+0x137/0x160
>>> [7.071724]  kasan_check_read+0x11/0x20
>>> [7.071724]  filldir+0xc3/0x160
>>> [7.071724]  call_filldir+0x88/0x140
>>> [7.071724]  ext4_readdir+0x757/0x920
>>> [7.071724]  ? iterate_dir+0x49/0x190
>>> [7.071724]  iterate_dir+0x7d/0x190
>>> [7.071724]  ? entry_SYSCALL_64_fastpath+0x5/0xc6
>>> [7.071724]  SyS_getdents+0xac/0x170
>>> [7.071724]  ? filldir64+0x170/0x170
>>> [7.071724]  entry_SYSCALL_64_fastpath+0x23/0xc6
>>> [7.071724] RIP: 0033:0x7fa37ca2dd3b
>>> [7.071724] RSP: 002b:7ffc63daf400 EFLAGS: 0206 ORIG_RAX: 
>>> 004e
>>> [7.071724] RAX: ffda RBX: 0046 RCX: 
>>> 7fa37ca2dd3b
>>> [7.071724] RDX: 8000 RSI: 560b369e4a10 RDI: 
>>> 0004
>>> [7.071724] RBP: 7fa37cd29b20 R08: 7fa37cd29bd8 R09: 
>>> 
>>> [7.071724] R10: 008f R11: 0206 R12: 
>>> 8041
>>> [7.071724] R13: 7fa37cd29b78 R14: 270f R15: 
>>> 7fa37cd29b78
>>> [7.071724] Object at 88006bc2b080, in cache kmalloc-96 size: 96
>>> [7.071724] Allocated:
>>> [7.071724] PID = 1
>>> [7.071724]  save_stack_trace+0x1b/0x20
>>> [7.071724]  kasan_kmalloc.part.4+0x64/0xf0
>>> [7.071724]  kasan_kmalloc+0x85/0xb0
>>> [7.071724]  __kmalloc+0x12b/0x320
>>> [7.071724]  ext4_htree_store_dirent+0x3e/0x120
>>> [7.071724]  htree_dirblock_to_tree+0xb9/0x1a0
>>> [7.071724]  ext4_htree_fill_tree+0xa3/0x310
>>> [7.071724]  ext4_readdir+0x6a9/0x920
>>> [7.071724]  iterate_dir+0x7d/0x190
>>> [7.071724]  SyS_getdents+0xac/0x170
>>> [7.071724]  entry_SYSCALL_64_fastpath+0x23/0xc6
>>> [7.071724] Freed:
>>> [7.071724] PID = 1
>>> [7.071724]  save_stack_trace+0x1b/0x20
>>> [7.071724]  kasan_slab_free+0xbe/0x190
>>> [7.071724]  kfree+0xff/0x2f0
>>> [7.071724]  acpi_ut_evaluate_object+0x18e/0x19d
>>> [7.071724]  acpi_ut_execute_STA+0x26/0x53
>>> [7.071724]  acpi_ns_get_device_callback+0x73/0x163
>>> [7.071724]  acpi_ns_walk_namespace+0xc0/0x17a
>>> [7.071724]  acpi_get_devices+0x66/0x7d
>>> [7.071724]  pnpacpi_init+0x52/0x74
>>> [7.071724]  do_one_initcall+0x51/0x1b0
>>> [7.071724]  kernel_init_freeable+0x20a/0x2a1
>>> [7.071724]  kernel_init+0xe/0x100
>>> [7.071724]  ret_from_fork+0x31/0x40
>>> [7.071724] Memory state around the buggy address:
>>> [7.071724]  88006bc2af80: fc fc fc fc fc fc fc fc fc fc fc fc fc fc 
>>> fc fc
>>> [7.071724]  88006bc2b000: fb fb fb fb fb fb fb fb fb fb fb fb fc fc 
>>> fc fc
>>> [7.071724] >88006bc2b080: 00 00 00 00 00 00 00 00 05 fc fc fc fc fc 
>>> fc fc
>>> [7.071724]^
>>> [7.071724]  88006bc2b100: 00 00 00 00 00 00 00 00 00 04 fc fc fc fc 
>>> fc fc
>>> [7.071724]  88006bc2b180: 00 00 00 00 00 00 00 00 00 00 fc fc fc fc 
>>> fc fc
>>>
>>> Not killing the VM instantly produces a continuous stream of kasan errors. 
>>> Most of them
>>> are identical to the one above, however there was one which was different:
>>>
>>> [5.846193] 
>>> =

Race condition in ext4 (was Re: 4.11-rc1 acpi stomping ext4 slabs)

2017-03-07 Thread Nikolay Borisov



On  7.03.2017 11:38, Nikolay Borisov wrote:
> 
> 
> On  7.03.2017 00:35, Rafael J. Wysocki wrote:
>> On Mon, Mar 6, 2017 at 9:31 PM, Nikolay Borisov
>>  wrote:
>>> Hello,
>>>
>>> Booting 4.11-rc1 with kasan enabled and "slub_debug=F" produces the 
>>> following errors:
>>>
>>> [7.070797] 
>>> ==
>>> [7.071724] BUG: KASAN: slab-out-of-bounds in filldir+0xc3/0x160 at addr 
>>> 88006bc2b0ae
>>> [7.071724] Read of size 20 by task systemd/1
>>> [7.071724] CPU: 1 PID: 1 Comm: systemd Not tainted 4.11.0-rc1-nbor #150
>>> [7.071724] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 
>>> Ubuntu-1.8.2-1ubuntu1 04/01/2014
>>> [7.071724] Call Trace:
>>> [7.071724]  dump_stack+0x85/0xc9
>>> [7.071724]  kasan_object_err+0x2c/0x90
>>> [7.071724]  kasan_report+0x285/0x510
>>> [7.071724]  check_memory_region+0x137/0x160
>>> [7.071724]  kasan_check_read+0x11/0x20
>>> [7.071724]  filldir+0xc3/0x160
>>> [7.071724]  call_filldir+0x88/0x140
>>> [7.071724]  ext4_readdir+0x757/0x920
>>> [7.071724]  ? iterate_dir+0x49/0x190
>>> [7.071724]  iterate_dir+0x7d/0x190
>>> [7.071724]  ? entry_SYSCALL_64_fastpath+0x5/0xc6
>>> [7.071724]  SyS_getdents+0xac/0x170
>>> [7.071724]  ? filldir64+0x170/0x170
>>> [7.071724]  entry_SYSCALL_64_fastpath+0x23/0xc6
>>> [7.071724] RIP: 0033:0x7fa37ca2dd3b
>>> [7.071724] RSP: 002b:7ffc63daf400 EFLAGS: 0206 ORIG_RAX: 
>>> 004e
>>> [7.071724] RAX: ffda RBX: 0046 RCX: 
>>> 7fa37ca2dd3b
>>> [7.071724] RDX: 8000 RSI: 560b369e4a10 RDI: 
>>> 0004
>>> [7.071724] RBP: 7fa37cd29b20 R08: 7fa37cd29bd8 R09: 
>>> 
>>> [7.071724] R10: 008f R11: 0206 R12: 
>>> 8041
>>> [7.071724] R13: 7fa37cd29b78 R14: 270f R15: 
>>> 7fa37cd29b78
>>> [7.071724] Object at 88006bc2b080, in cache kmalloc-96 size: 96
>>> [7.071724] Allocated:
>>> [7.071724] PID = 1
>>> [7.071724]  save_stack_trace+0x1b/0x20
>>> [7.071724]  kasan_kmalloc.part.4+0x64/0xf0
>>> [7.071724]  kasan_kmalloc+0x85/0xb0
>>> [7.071724]  __kmalloc+0x12b/0x320
>>> [7.071724]  ext4_htree_store_dirent+0x3e/0x120
>>> [7.071724]  htree_dirblock_to_tree+0xb9/0x1a0
>>> [7.071724]  ext4_htree_fill_tree+0xa3/0x310
>>> [7.071724]  ext4_readdir+0x6a9/0x920
>>> [7.071724]  iterate_dir+0x7d/0x190
>>> [7.071724]  SyS_getdents+0xac/0x170
>>> [7.071724]  entry_SYSCALL_64_fastpath+0x23/0xc6
>>> [7.071724] Freed:
>>> [7.071724] PID = 1
>>> [7.071724]  save_stack_trace+0x1b/0x20
>>> [7.071724]  kasan_slab_free+0xbe/0x190
>>> [7.071724]  kfree+0xff/0x2f0
>>> [7.071724]  acpi_ut_evaluate_object+0x18e/0x19d
>>> [7.071724]  acpi_ut_execute_STA+0x26/0x53
>>> [7.071724]  acpi_ns_get_device_callback+0x73/0x163
>>> [7.071724]  acpi_ns_walk_namespace+0xc0/0x17a
>>> [7.071724]  acpi_get_devices+0x66/0x7d
>>> [7.071724]  pnpacpi_init+0x52/0x74
>>> [7.071724]  do_one_initcall+0x51/0x1b0
>>> [7.071724]  kernel_init_freeable+0x20a/0x2a1
>>> [7.071724]  kernel_init+0xe/0x100
>>> [7.071724]  ret_from_fork+0x31/0x40
>>> [7.071724] Memory state around the buggy address:
>>> [7.071724]  88006bc2af80: fc fc fc fc fc fc fc fc fc fc fc fc fc fc 
>>> fc fc
>>> [7.071724]  88006bc2b000: fb fb fb fb fb fb fb fb fb fb fb fb fc fc 
>>> fc fc
>>> [7.071724] >88006bc2b080: 00 00 00 00 00 00 00 00 05 fc fc fc fc fc 
>>> fc fc
>>> [7.071724]^
>>> [7.071724]  88006bc2b100: 00 00 00 00 00 00 00 00 00 04 fc fc fc fc 
>>> fc fc
>>> [7.071724]  88006bc2b180: 00 00 00 00 00 00 00 00 00 00 fc fc fc fc 
>>> fc fc
>>>
>>> Not killing the VM instantly produces a continuous stream of kasan errors. 
>>> Most of them
>>> are identical to the one above, however there was one which was different:
>>>
>>> [5.846193] 
>>> ==

Re: 4.11-rc1 acpi stomping ext4 slabs

2017-03-07 Thread Nikolay Borisov



On  7.03.2017 00:35, Rafael J. Wysocki wrote:
> On Mon, Mar 6, 2017 at 9:31 PM, Nikolay Borisov
> <n.borisov.l...@gmail.com> wrote:
>> Hello,
>>
>> Booting 4.11-rc1 with kasan enabled and "slub_debug=F" produces the 
>> following errors:
>>
>> [7.070797] 
>> ==
>> [7.071724] BUG: KASAN: slab-out-of-bounds in filldir+0xc3/0x160 at addr 
>> 88006bc2b0ae
>> [7.071724] Read of size 20 by task systemd/1
>> [7.071724] CPU: 1 PID: 1 Comm: systemd Not tainted 4.11.0-rc1-nbor #150
>> [7.071724] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 
>> Ubuntu-1.8.2-1ubuntu1 04/01/2014
>> [7.071724] Call Trace:
>> [7.071724]  dump_stack+0x85/0xc9
>> [7.071724]  kasan_object_err+0x2c/0x90
>> [7.071724]  kasan_report+0x285/0x510
>> [7.071724]  check_memory_region+0x137/0x160
>> [7.071724]  kasan_check_read+0x11/0x20
>> [7.071724]  filldir+0xc3/0x160
>> [7.071724]  call_filldir+0x88/0x140
>> [7.071724]  ext4_readdir+0x757/0x920
>> [7.071724]  ? iterate_dir+0x49/0x190
>> [7.071724]  iterate_dir+0x7d/0x190
>> [7.071724]  ? entry_SYSCALL_64_fastpath+0x5/0xc6
>> [7.071724]  SyS_getdents+0xac/0x170
>> [7.071724]  ? filldir64+0x170/0x170
>> [7.071724]  entry_SYSCALL_64_fastpath+0x23/0xc6
>> [7.071724] RIP: 0033:0x7fa37ca2dd3b
>> [7.071724] RSP: 002b:7ffc63daf400 EFLAGS: 0206 ORIG_RAX: 
>> 004e
>> [7.071724] RAX: ffda RBX: 0046 RCX: 
>> 7fa37ca2dd3b
>> [7.071724] RDX: 8000 RSI: 560b369e4a10 RDI: 
>> 0004
>> [7.071724] RBP: 7fa37cd29b20 R08: 7fa37cd29bd8 R09: 
>> 
>> [7.071724] R10: 008f R11: 0206 R12: 
>> 8041
>> [7.071724] R13: 7fa37cd29b78 R14: 270f R15: 
>> 7fa37cd29b78
>> [7.071724] Object at 88006bc2b080, in cache kmalloc-96 size: 96
>> [7.071724] Allocated:
>> [7.071724] PID = 1
>> [7.071724]  save_stack_trace+0x1b/0x20
>> [7.071724]  kasan_kmalloc.part.4+0x64/0xf0
>> [7.071724]  kasan_kmalloc+0x85/0xb0
>> [7.071724]  __kmalloc+0x12b/0x320
>> [7.071724]  ext4_htree_store_dirent+0x3e/0x120
>> [7.071724]  htree_dirblock_to_tree+0xb9/0x1a0
>> [7.071724]  ext4_htree_fill_tree+0xa3/0x310
>> [7.071724]  ext4_readdir+0x6a9/0x920
>> [7.071724]  iterate_dir+0x7d/0x190
>> [7.071724]  SyS_getdents+0xac/0x170
>> [7.071724]  entry_SYSCALL_64_fastpath+0x23/0xc6
>> [7.071724] Freed:
>> [7.071724] PID = 1
>> [7.071724]  save_stack_trace+0x1b/0x20
>> [7.071724]  kasan_slab_free+0xbe/0x190
>> [7.071724]  kfree+0xff/0x2f0
>> [7.071724]  acpi_ut_evaluate_object+0x18e/0x19d
>> [7.071724]  acpi_ut_execute_STA+0x26/0x53
>> [7.071724]  acpi_ns_get_device_callback+0x73/0x163
>> [7.071724]  acpi_ns_walk_namespace+0xc0/0x17a
>> [7.071724]  acpi_get_devices+0x66/0x7d
>> [7.071724]  pnpacpi_init+0x52/0x74
>> [7.071724]  do_one_initcall+0x51/0x1b0
>> [7.071724]  kernel_init_freeable+0x20a/0x2a1
>> [7.071724]  kernel_init+0xe/0x100
>> [7.071724]  ret_from_fork+0x31/0x40
>> [7.071724] Memory state around the buggy address:
>> [7.071724]  88006bc2af80: fc fc fc fc fc fc fc fc fc fc fc fc fc fc 
>> fc fc
>> [7.071724]  88006bc2b000: fb fb fb fb fb fb fb fb fb fb fb fb fc fc 
>> fc fc
>> [7.071724] >88006bc2b080: 00 00 00 00 00 00 00 00 05 fc fc fc fc fc 
>> fc fc
>> [7.071724]^
>> [7.071724]  88006bc2b100: 00 00 00 00 00 00 00 00 00 04 fc fc fc fc 
>> fc fc
>> [7.071724]  88006bc2b180: 00 00 00 00 00 00 00 00 00 00 fc fc fc fc 
>> fc fc
>>
>> Not killing the VM instantly produces a continuous stream of kasan errors. 
>> Most of them
>> are identical to the one above, however there was one which was different:
>>
>> [5.846193] 
>> ==
>> [5.846787] BUG: KASAN: slab-out-of-bounds in filldir+0xc3/0x160 at addr 
>> 88006c783eae
>> [5.847177] Read of size 22 by task systemd/1
>> [5.847177] CPU: 3 PID: 1 Comm: systemd Tainted: GB   
>> 4.11.0-rc1-nbor #150
>> [5.847177] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 
>>

Re: 4.11-rc1 acpi stomping ext4 slabs

2017-03-07 Thread Nikolay Borisov



On  7.03.2017 00:35, Rafael J. Wysocki wrote:
> On Mon, Mar 6, 2017 at 9:31 PM, Nikolay Borisov
>  wrote:
>> Hello,
>>
>> Booting 4.11-rc1 with kasan enabled and "slub_debug=F" produces the 
>> following errors:
>>
>> [7.070797] 
>> ==
>> [7.071724] BUG: KASAN: slab-out-of-bounds in filldir+0xc3/0x160 at addr 
>> 88006bc2b0ae
>> [7.071724] Read of size 20 by task systemd/1
>> [7.071724] CPU: 1 PID: 1 Comm: systemd Not tainted 4.11.0-rc1-nbor #150
>> [7.071724] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 
>> Ubuntu-1.8.2-1ubuntu1 04/01/2014
>> [7.071724] Call Trace:
>> [7.071724]  dump_stack+0x85/0xc9
>> [7.071724]  kasan_object_err+0x2c/0x90
>> [7.071724]  kasan_report+0x285/0x510
>> [7.071724]  check_memory_region+0x137/0x160
>> [7.071724]  kasan_check_read+0x11/0x20
>> [7.071724]  filldir+0xc3/0x160
>> [7.071724]  call_filldir+0x88/0x140
>> [7.071724]  ext4_readdir+0x757/0x920
>> [7.071724]  ? iterate_dir+0x49/0x190
>> [7.071724]  iterate_dir+0x7d/0x190
>> [7.071724]  ? entry_SYSCALL_64_fastpath+0x5/0xc6
>> [7.071724]  SyS_getdents+0xac/0x170
>> [7.071724]  ? filldir64+0x170/0x170
>> [7.071724]  entry_SYSCALL_64_fastpath+0x23/0xc6
>> [7.071724] RIP: 0033:0x7fa37ca2dd3b
>> [7.071724] RSP: 002b:7ffc63daf400 EFLAGS: 0206 ORIG_RAX: 
>> 004e
>> [7.071724] RAX: ffda RBX: 0046 RCX: 
>> 7fa37ca2dd3b
>> [7.071724] RDX: 8000 RSI: 560b369e4a10 RDI: 
>> 0004
>> [7.071724] RBP: 7fa37cd29b20 R08: 7fa37cd29bd8 R09: 
>> 
>> [7.071724] R10: 008f R11: 0206 R12: 
>> 8041
>> [7.071724] R13: 7fa37cd29b78 R14: 270f R15: 
>> 7fa37cd29b78
>> [7.071724] Object at 88006bc2b080, in cache kmalloc-96 size: 96
>> [7.071724] Allocated:
>> [7.071724] PID = 1
>> [7.071724]  save_stack_trace+0x1b/0x20
>> [7.071724]  kasan_kmalloc.part.4+0x64/0xf0
>> [7.071724]  kasan_kmalloc+0x85/0xb0
>> [7.071724]  __kmalloc+0x12b/0x320
>> [7.071724]  ext4_htree_store_dirent+0x3e/0x120
>> [7.071724]  htree_dirblock_to_tree+0xb9/0x1a0
>> [7.071724]  ext4_htree_fill_tree+0xa3/0x310
>> [7.071724]  ext4_readdir+0x6a9/0x920
>> [7.071724]  iterate_dir+0x7d/0x190
>> [7.071724]  SyS_getdents+0xac/0x170
>> [7.071724]  entry_SYSCALL_64_fastpath+0x23/0xc6
>> [7.071724] Freed:
>> [7.071724] PID = 1
>> [7.071724]  save_stack_trace+0x1b/0x20
>> [7.071724]  kasan_slab_free+0xbe/0x190
>> [7.071724]  kfree+0xff/0x2f0
>> [7.071724]  acpi_ut_evaluate_object+0x18e/0x19d
>> [7.071724]  acpi_ut_execute_STA+0x26/0x53
>> [7.071724]  acpi_ns_get_device_callback+0x73/0x163
>> [7.071724]  acpi_ns_walk_namespace+0xc0/0x17a
>> [7.071724]  acpi_get_devices+0x66/0x7d
>> [7.071724]  pnpacpi_init+0x52/0x74
>> [7.071724]  do_one_initcall+0x51/0x1b0
>> [7.071724]  kernel_init_freeable+0x20a/0x2a1
>> [7.071724]  kernel_init+0xe/0x100
>> [7.071724]  ret_from_fork+0x31/0x40
>> [7.071724] Memory state around the buggy address:
>> [7.071724]  88006bc2af80: fc fc fc fc fc fc fc fc fc fc fc fc fc fc 
>> fc fc
>> [7.071724]  88006bc2b000: fb fb fb fb fb fb fb fb fb fb fb fb fc fc 
>> fc fc
>> [7.071724] >88006bc2b080: 00 00 00 00 00 00 00 00 05 fc fc fc fc fc 
>> fc fc
>> [7.071724]^
>> [7.071724]  88006bc2b100: 00 00 00 00 00 00 00 00 00 04 fc fc fc fc 
>> fc fc
>> [7.071724]  88006bc2b180: 00 00 00 00 00 00 00 00 00 00 fc fc fc fc 
>> fc fc
>>
>> Not killing the VM instantly produces a continuous stream of kasan errors. 
>> Most of them
>> are identical to the one above, however there was one which was different:
>>
>> [5.846193] 
>> ==
>> [5.846787] BUG: KASAN: slab-out-of-bounds in filldir+0xc3/0x160 at addr 
>> 88006c783eae
>> [5.847177] Read of size 22 by task systemd/1
>> [5.847177] CPU: 3 PID: 1 Comm: systemd Tainted: GB   
>> 4.11.0-rc1-nbor #150
>> [5.847177] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 
>> Ubuntu-1.8.2-1ubuntu1

4.11-rc1 acpi stomping ext4 slabs

2017-03-06 Thread Nikolay Borisov

Hello, 

Booting 4.11-rc1 with kasan enabled and "slub_debug=F" produces the following 
errors:

[7.070797] 
==
[7.071724] BUG: KASAN: slab-out-of-bounds in filldir+0xc3/0x160 at addr 
88006bc2b0ae
[7.071724] Read of size 20 by task systemd/1
[7.071724] CPU: 1 PID: 1 Comm: systemd Not tainted 4.11.0-rc1-nbor #150
[7.071724] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 
Ubuntu-1.8.2-1ubuntu1 04/01/2014
[7.071724] Call Trace:
[7.071724]  dump_stack+0x85/0xc9
[7.071724]  kasan_object_err+0x2c/0x90
[7.071724]  kasan_report+0x285/0x510
[7.071724]  check_memory_region+0x137/0x160
[7.071724]  kasan_check_read+0x11/0x20
[7.071724]  filldir+0xc3/0x160
[7.071724]  call_filldir+0x88/0x140
[7.071724]  ext4_readdir+0x757/0x920
[7.071724]  ? iterate_dir+0x49/0x190
[7.071724]  iterate_dir+0x7d/0x190
[7.071724]  ? entry_SYSCALL_64_fastpath+0x5/0xc6
[7.071724]  SyS_getdents+0xac/0x170
[7.071724]  ? filldir64+0x170/0x170
[7.071724]  entry_SYSCALL_64_fastpath+0x23/0xc6
[7.071724] RIP: 0033:0x7fa37ca2dd3b
[7.071724] RSP: 002b:7ffc63daf400 EFLAGS: 0206 ORIG_RAX: 
004e
[7.071724] RAX: ffda RBX: 0046 RCX: 7fa37ca2dd3b
[7.071724] RDX: 8000 RSI: 560b369e4a10 RDI: 0004
[7.071724] RBP: 7fa37cd29b20 R08: 7fa37cd29bd8 R09: 
[7.071724] R10: 008f R11: 0206 R12: 8041
[7.071724] R13: 7fa37cd29b78 R14: 270f R15: 7fa37cd29b78
[7.071724] Object at 88006bc2b080, in cache kmalloc-96 size: 96
[7.071724] Allocated:
[7.071724] PID = 1
[7.071724]  save_stack_trace+0x1b/0x20
[7.071724]  kasan_kmalloc.part.4+0x64/0xf0
[7.071724]  kasan_kmalloc+0x85/0xb0
[7.071724]  __kmalloc+0x12b/0x320
[7.071724]  ext4_htree_store_dirent+0x3e/0x120
[7.071724]  htree_dirblock_to_tree+0xb9/0x1a0
[7.071724]  ext4_htree_fill_tree+0xa3/0x310
[7.071724]  ext4_readdir+0x6a9/0x920
[7.071724]  iterate_dir+0x7d/0x190
[7.071724]  SyS_getdents+0xac/0x170
[7.071724]  entry_SYSCALL_64_fastpath+0x23/0xc6
[7.071724] Freed:
[7.071724] PID = 1
[7.071724]  save_stack_trace+0x1b/0x20
[7.071724]  kasan_slab_free+0xbe/0x190
[7.071724]  kfree+0xff/0x2f0
[7.071724]  acpi_ut_evaluate_object+0x18e/0x19d
[7.071724]  acpi_ut_execute_STA+0x26/0x53
[7.071724]  acpi_ns_get_device_callback+0x73/0x163
[7.071724]  acpi_ns_walk_namespace+0xc0/0x17a
[7.071724]  acpi_get_devices+0x66/0x7d
[7.071724]  pnpacpi_init+0x52/0x74
[7.071724]  do_one_initcall+0x51/0x1b0
[7.071724]  kernel_init_freeable+0x20a/0x2a1
[7.071724]  kernel_init+0xe/0x100
[7.071724]  ret_from_fork+0x31/0x40
[7.071724] Memory state around the buggy address:
[7.071724]  88006bc2af80: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc 
fc
[7.071724]  88006bc2b000: fb fb fb fb fb fb fb fb fb fb fb fb fc fc fc 
fc
[7.071724] >88006bc2b080: 00 00 00 00 00 00 00 00 05 fc fc fc fc fc fc 
fc
[7.071724]^
[7.071724]  88006bc2b100: 00 00 00 00 00 00 00 00 00 04 fc fc fc fc fc 
fc
[7.071724]  88006bc2b180: 00 00 00 00 00 00 00 00 00 00 fc fc fc fc fc 
fc

Not killing the VM instantly produces a continuous stream of kasan errors. Most 
of them 
are identical to the one above, however there was one which was different: 

[5.846193] 
==
[5.846787] BUG: KASAN: slab-out-of-bounds in filldir+0xc3/0x160 at addr 
88006c783eae
[5.847177] Read of size 22 by task systemd/1
[5.847177] CPU: 3 PID: 1 Comm: systemd Tainted: GB   
4.11.0-rc1-nbor #150
[5.847177] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 
Ubuntu-1.8.2-1ubuntu1 04/01/2014
[5.847177] Call Trace:
[5.847177]  dump_stack+0x85/0xc9
[5.847177]  kasan_object_err+0x2c/0x90
[5.847177]  kasan_report+0x285/0x510
[5.847177]  check_memory_region+0x137/0x160
[5.847177]  kasan_check_read+0x11/0x20
[5.847177]  filldir+0xc3/0x160
[5.847177]  call_filldir+0x88/0x140
[5.847177]  ext4_readdir+0x757/0x920
[5.847177]  ? iterate_dir+0x49/0x190
[5.847177]  iterate_dir+0x7d/0x190
[5.847177]  ? entry_SYSCALL_64_fastpath+0x5/0xc6
[5.847177]  SyS_getdents+0xac/0x170
[5.847177]  ? filldir64+0x170/0x170
[5.847177]  entry_SYSCALL_64_fastpath+0x23/0xc6
[5.847177] RIP: 0033:0x7f9dbd4e1d3b
[5.847177] RSP: 002b:7ffee6b51a60 EFLAGS: 0206 ORIG_RAX: 
004e
[5.847177] RAX: ffda RBX: 0046 RCX: 7f9dbd4e1d3b
[5.847177] RDX: 8000 RSI: 55c046802a10 RDI: 0004
[5.847177] RBP: 7f9dbd7ddb20 R08: 7f9dbd7ddbd8 R09:

4.11-rc1 acpi stomping ext4 slabs

2017-03-06 Thread Nikolay Borisov

Hello, 

Booting 4.11-rc1 with kasan enabled and "slub_debug=F" produces the following 
errors:

[7.070797] 
==
[7.071724] BUG: KASAN: slab-out-of-bounds in filldir+0xc3/0x160 at addr 
88006bc2b0ae
[7.071724] Read of size 20 by task systemd/1
[7.071724] CPU: 1 PID: 1 Comm: systemd Not tainted 4.11.0-rc1-nbor #150
[7.071724] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 
Ubuntu-1.8.2-1ubuntu1 04/01/2014
[7.071724] Call Trace:
[7.071724]  dump_stack+0x85/0xc9
[7.071724]  kasan_object_err+0x2c/0x90
[7.071724]  kasan_report+0x285/0x510
[7.071724]  check_memory_region+0x137/0x160
[7.071724]  kasan_check_read+0x11/0x20
[7.071724]  filldir+0xc3/0x160
[7.071724]  call_filldir+0x88/0x140
[7.071724]  ext4_readdir+0x757/0x920
[7.071724]  ? iterate_dir+0x49/0x190
[7.071724]  iterate_dir+0x7d/0x190
[7.071724]  ? entry_SYSCALL_64_fastpath+0x5/0xc6
[7.071724]  SyS_getdents+0xac/0x170
[7.071724]  ? filldir64+0x170/0x170
[7.071724]  entry_SYSCALL_64_fastpath+0x23/0xc6
[7.071724] RIP: 0033:0x7fa37ca2dd3b
[7.071724] RSP: 002b:7ffc63daf400 EFLAGS: 0206 ORIG_RAX: 
004e
[7.071724] RAX: ffda RBX: 0046 RCX: 7fa37ca2dd3b
[7.071724] RDX: 8000 RSI: 560b369e4a10 RDI: 0004
[7.071724] RBP: 7fa37cd29b20 R08: 7fa37cd29bd8 R09: 
[7.071724] R10: 008f R11: 0206 R12: 8041
[7.071724] R13: 7fa37cd29b78 R14: 270f R15: 7fa37cd29b78
[7.071724] Object at 88006bc2b080, in cache kmalloc-96 size: 96
[7.071724] Allocated:
[7.071724] PID = 1
[7.071724]  save_stack_trace+0x1b/0x20
[7.071724]  kasan_kmalloc.part.4+0x64/0xf0
[7.071724]  kasan_kmalloc+0x85/0xb0
[7.071724]  __kmalloc+0x12b/0x320
[7.071724]  ext4_htree_store_dirent+0x3e/0x120
[7.071724]  htree_dirblock_to_tree+0xb9/0x1a0
[7.071724]  ext4_htree_fill_tree+0xa3/0x310
[7.071724]  ext4_readdir+0x6a9/0x920
[7.071724]  iterate_dir+0x7d/0x190
[7.071724]  SyS_getdents+0xac/0x170
[7.071724]  entry_SYSCALL_64_fastpath+0x23/0xc6
[7.071724] Freed:
[7.071724] PID = 1
[7.071724]  save_stack_trace+0x1b/0x20
[7.071724]  kasan_slab_free+0xbe/0x190
[7.071724]  kfree+0xff/0x2f0
[7.071724]  acpi_ut_evaluate_object+0x18e/0x19d
[7.071724]  acpi_ut_execute_STA+0x26/0x53
[7.071724]  acpi_ns_get_device_callback+0x73/0x163
[7.071724]  acpi_ns_walk_namespace+0xc0/0x17a
[7.071724]  acpi_get_devices+0x66/0x7d
[7.071724]  pnpacpi_init+0x52/0x74
[7.071724]  do_one_initcall+0x51/0x1b0
[7.071724]  kernel_init_freeable+0x20a/0x2a1
[7.071724]  kernel_init+0xe/0x100
[7.071724]  ret_from_fork+0x31/0x40
[7.071724] Memory state around the buggy address:
[7.071724]  88006bc2af80: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc 
fc
[7.071724]  88006bc2b000: fb fb fb fb fb fb fb fb fb fb fb fb fc fc fc 
fc
[7.071724] >88006bc2b080: 00 00 00 00 00 00 00 00 05 fc fc fc fc fc fc 
fc
[7.071724]^
[7.071724]  88006bc2b100: 00 00 00 00 00 00 00 00 00 04 fc fc fc fc fc 
fc
[7.071724]  88006bc2b180: 00 00 00 00 00 00 00 00 00 00 fc fc fc fc fc 
fc

Not killing the VM instantly produces a continuous stream of kasan errors. Most 
of them 
are identical to the one above, however there was one which was different: 

[5.846193] 
==
[5.846787] BUG: KASAN: slab-out-of-bounds in filldir+0xc3/0x160 at addr 
88006c783eae
[5.847177] Read of size 22 by task systemd/1
[5.847177] CPU: 3 PID: 1 Comm: systemd Tainted: GB   
4.11.0-rc1-nbor #150
[5.847177] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 
Ubuntu-1.8.2-1ubuntu1 04/01/2014
[5.847177] Call Trace:
[5.847177]  dump_stack+0x85/0xc9
[5.847177]  kasan_object_err+0x2c/0x90
[5.847177]  kasan_report+0x285/0x510
[5.847177]  check_memory_region+0x137/0x160
[5.847177]  kasan_check_read+0x11/0x20
[5.847177]  filldir+0xc3/0x160
[5.847177]  call_filldir+0x88/0x140
[5.847177]  ext4_readdir+0x757/0x920
[5.847177]  ? iterate_dir+0x49/0x190
[5.847177]  iterate_dir+0x7d/0x190
[5.847177]  ? entry_SYSCALL_64_fastpath+0x5/0xc6
[5.847177]  SyS_getdents+0xac/0x170
[5.847177]  ? filldir64+0x170/0x170
[5.847177]  entry_SYSCALL_64_fastpath+0x23/0xc6
[5.847177] RIP: 0033:0x7f9dbd4e1d3b
[5.847177] RSP: 002b:7ffee6b51a60 EFLAGS: 0206 ORIG_RAX: 
004e
[5.847177] RAX: ffda RBX: 0046 RCX: 7f9dbd4e1d3b
[5.847177] RDX: 8000 RSI: 55c046802a10 RDI: 0004
[5.847177] RBP: 7f9dbd7ddb20 R08: 7f9dbd7ddbd8 R09:

[PATCH v3] lockdep: Teach lockdep about memalloc_noio_save

2017-03-01 Thread Nikolay Borisov

Commit 21caf2fc1931 ("mm: teach mm by current context info to not do I/O
during memory allocation") added the memalloc_noio_(save|restore) functions
to enable people to modify the MM behavior by disbaling I/O during memory
allocation. This was further extended in Fixes: 934f3072c17c ("mm: clear 
__GFP_FS when PF_MEMALLOC_NOIO is set"). memalloc_noio_* functions prevent 
allocation paths recursing back into the filesystem without explicitly 
changing the flags for every allocation site. However, lockdep hasn't been 
keeping up with the changes and it entirely misses handling the memalloc_noio
adjustments. Instead, it is left to the callers of __lockdep_trace_alloc to 
call the functino after they have shaven the respective GFP flags. 

Let's fix this by making lockdep explicitly do the shaving of respective
GFP flags. 

Fixes: 934f3072c17c ("mm: clear __GFP_FS when PF_MEMALLOC_NOIO is set")
Signed-off-by: Nikolay Borisov <nbori...@suse.com>
---
 kernel/locking/lockdep.c | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

Changes since v2: 
* Incorporate Michal's suggestion of using memalloc_noio_flags 
explicitly. 
* Tune the commit message to make the problem statement a bit more
descriptive. 

diff --git a/kernel/locking/lockdep.c b/kernel/locking/lockdep.c
index 9812e5dd409e..565506c9e99c 100644
--- a/kernel/locking/lockdep.c
+++ b/kernel/locking/lockdep.c
@@ -2861,6 +2861,8 @@ static void __lockdep_trace_alloc(gfp_t gfp_mask, 
unsigned long flags)
if (unlikely(!debug_locks))
return;
 
+   gfp_mask = memalloc_noio_flags(gfp_mask);
+
/* no reclaim without waiting on it */
if (!(gfp_mask & __GFP_DIRECT_RECLAIM))
return;
@@ -3852,7 +3854,7 @@ EXPORT_SYMBOL_GPL(lock_unpin_lock);
 
 void lockdep_set_current_reclaim_state(gfp_t gfp_mask)
 {
-   current->lockdep_reclaim_gfp = gfp_mask;
+   current->lockdep_reclaim_gfp = memalloc_noio_flags(gfp_mask);
 }
 
 void lockdep_clear_current_reclaim_state(void)
-- 
2.7.4

[PATCH v3] lockdep: Teach lockdep about memalloc_noio_save

2017-03-01 Thread Nikolay Borisov

Commit 21caf2fc1931 ("mm: teach mm by current context info to not do I/O
during memory allocation") added the memalloc_noio_(save|restore) functions
to enable people to modify the MM behavior by disbaling I/O during memory
allocation. This was further extended in Fixes: 934f3072c17c ("mm: clear 
__GFP_FS when PF_MEMALLOC_NOIO is set"). memalloc_noio_* functions prevent 
allocation paths recursing back into the filesystem without explicitly 
changing the flags for every allocation site. However, lockdep hasn't been 
keeping up with the changes and it entirely misses handling the memalloc_noio
adjustments. Instead, it is left to the callers of __lockdep_trace_alloc to 
call the functino after they have shaven the respective GFP flags. 

Let's fix this by making lockdep explicitly do the shaving of respective
GFP flags. 

Fixes: 934f3072c17c ("mm: clear __GFP_FS when PF_MEMALLOC_NOIO is set")
Signed-off-by: Nikolay Borisov 
---
 kernel/locking/lockdep.c | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

Changes since v2: 
* Incorporate Michal's suggestion of using memalloc_noio_flags 
explicitly. 
* Tune the commit message to make the problem statement a bit more
descriptive. 

diff --git a/kernel/locking/lockdep.c b/kernel/locking/lockdep.c
index 9812e5dd409e..565506c9e99c 100644
--- a/kernel/locking/lockdep.c
+++ b/kernel/locking/lockdep.c
@@ -2861,6 +2861,8 @@ static void __lockdep_trace_alloc(gfp_t gfp_mask, 
unsigned long flags)
if (unlikely(!debug_locks))
return;
 
+   gfp_mask = memalloc_noio_flags(gfp_mask);
+
/* no reclaim without waiting on it */
if (!(gfp_mask & __GFP_DIRECT_RECLAIM))
return;
@@ -3852,7 +3854,7 @@ EXPORT_SYMBOL_GPL(lock_unpin_lock);
 
 void lockdep_set_current_reclaim_state(gfp_t gfp_mask)
 {
-   current->lockdep_reclaim_gfp = gfp_mask;
+   current->lockdep_reclaim_gfp = memalloc_noio_flags(gfp_mask);
 }
 
 void lockdep_clear_current_reclaim_state(void)
-- 
2.7.4

Re: [PATCH] lockdep: Teach lockdep about memalloc_noio_save

2017-03-01 Thread Nikolay Borisov



On  1.03.2017 12:31, Michal Hocko wrote:
> On Wed 01-03-17 11:22:51, Vlastimil Babka wrote:
>> On 03/01/2017 08:48 AM, Nikolay Borisov wrote:
>>> Commit 21caf2fc1931 ("mm: teach mm by current context info to not do I/O
>>> during memory allocation") added the memalloc_noio_(save|restore) functions
>>> to enable people to modify the MM behavior by disbaling I/O during memory
>>> allocation. This prevents allocation paths recursing back into the 
>>> filesystem
>>> without explicitly changing the flags for every allocation site. Yet, 
>>> lockdep
>>> not being aware of that is prone to showing false positives. Fix this
>>> by teaching it that the presence of PF_MEMALLOC_NOIO flag mean we are not
>>> going to issue any I/O
>>>
>>> Signed-off-by: Nikolay Borisov <nbori...@suse.com>
>>> ---
>>>  kernel/locking/lockdep.c | 3 ++-
>>>  1 file changed, 2 insertions(+), 1 deletion(-)
>>>
>>> diff --git a/kernel/locking/lockdep.c b/kernel/locking/lockdep.c
>>> index 9812e5dd409e..5715fdcede28 100644
>>> --- a/kernel/locking/lockdep.c
>>> +++ b/kernel/locking/lockdep.c
>>> @@ -2866,7 +2866,8 @@ static void __lockdep_trace_alloc(gfp_t gfp_mask, 
>>> unsigned long flags)
>>> return;
>>>  
>>> /* this guy won't enter reclaim */
>>> -   if ((curr->flags & PF_MEMALLOC) && !(gfp_mask & __GFP_NOMEMALLOC))
>>> +   if (((curr->flags & PF_MEMALLOC) && !(gfp_mask & __GFP_NOMEMALLOC)) ||
>>> +   curr->flags & PF_MEMALLOC_NOIO)
>>
>> It would be slightly better to use memalloc_noio_flags() here. Michal is
>> planning to convert it to take also a new PF_MEMALLOC_NOFS flag into
>> account, and there would be less chance of forgetting to update this place.
> 
> Yes, you are right. The following should do the trick. I am really
> surprised we haven't noticed this before. I thought we were shaving the
> gfp_mask before the allocator goes the lockdep_trace_alloc way. But it
> is not and what is worse SLAB tracks this as well so we cannot rely on
> the proper gfp mask. The positive thing is that the recursion avoidance
> works because we always clear GFP_IO and GFP_FS when doing reclaim.

Okay I will send a revised patch, doing it the way you suggested.

> 
> diff --git a/kernel/locking/lockdep.c b/kernel/locking/lockdep.c
> index 7c38f8f3d97b..0c70b26849ce 100644
> --- a/kernel/locking/lockdep.c
> +++ b/kernel/locking/lockdep.c
> @@ -2861,6 +2861,8 @@ static void __lockdep_trace_alloc(gfp_t gfp_mask, 
> unsigned long flags)
>   if (unlikely(!debug_locks))
>   return;
>  
> + gfp_mask = memalloc_noio_flags(gfp_mask);
> +
>   /* no reclaim without waiting on it */
>   if (!(gfp_mask & __GFP_DIRECT_RECLAIM))
>   return;
>

Re: [PATCH] lockdep: Teach lockdep about memalloc_noio_save

2017-03-01 Thread Nikolay Borisov



On  1.03.2017 12:31, Michal Hocko wrote:
> On Wed 01-03-17 11:22:51, Vlastimil Babka wrote:
>> On 03/01/2017 08:48 AM, Nikolay Borisov wrote:
>>> Commit 21caf2fc1931 ("mm: teach mm by current context info to not do I/O
>>> during memory allocation") added the memalloc_noio_(save|restore) functions
>>> to enable people to modify the MM behavior by disbaling I/O during memory
>>> allocation. This prevents allocation paths recursing back into the 
>>> filesystem
>>> without explicitly changing the flags for every allocation site. Yet, 
>>> lockdep
>>> not being aware of that is prone to showing false positives. Fix this
>>> by teaching it that the presence of PF_MEMALLOC_NOIO flag mean we are not
>>> going to issue any I/O
>>>
>>> Signed-off-by: Nikolay Borisov 
>>> ---
>>>  kernel/locking/lockdep.c | 3 ++-
>>>  1 file changed, 2 insertions(+), 1 deletion(-)
>>>
>>> diff --git a/kernel/locking/lockdep.c b/kernel/locking/lockdep.c
>>> index 9812e5dd409e..5715fdcede28 100644
>>> --- a/kernel/locking/lockdep.c
>>> +++ b/kernel/locking/lockdep.c
>>> @@ -2866,7 +2866,8 @@ static void __lockdep_trace_alloc(gfp_t gfp_mask, 
>>> unsigned long flags)
>>> return;
>>>  
>>> /* this guy won't enter reclaim */
>>> -   if ((curr->flags & PF_MEMALLOC) && !(gfp_mask & __GFP_NOMEMALLOC))
>>> +   if (((curr->flags & PF_MEMALLOC) && !(gfp_mask & __GFP_NOMEMALLOC)) ||
>>> +   curr->flags & PF_MEMALLOC_NOIO)
>>
>> It would be slightly better to use memalloc_noio_flags() here. Michal is
>> planning to convert it to take also a new PF_MEMALLOC_NOFS flag into
>> account, and there would be less chance of forgetting to update this place.
> 
> Yes, you are right. The following should do the trick. I am really
> surprised we haven't noticed this before. I thought we were shaving the
> gfp_mask before the allocator goes the lockdep_trace_alloc way. But it
> is not and what is worse SLAB tracks this as well so we cannot rely on
> the proper gfp mask. The positive thing is that the recursion avoidance
> works because we always clear GFP_IO and GFP_FS when doing reclaim.

Okay I will send a revised patch, doing it the way you suggested.

> 
> diff --git a/kernel/locking/lockdep.c b/kernel/locking/lockdep.c
> index 7c38f8f3d97b..0c70b26849ce 100644
> --- a/kernel/locking/lockdep.c
> +++ b/kernel/locking/lockdep.c
> @@ -2861,6 +2861,8 @@ static void __lockdep_trace_alloc(gfp_t gfp_mask, 
> unsigned long flags)
>   if (unlikely(!debug_locks))
>   return;
>  
> + gfp_mask = memalloc_noio_flags(gfp_mask);
> +
>   /* no reclaim without waiting on it */
>   if (!(gfp_mask & __GFP_DIRECT_RECLAIM))
>   return;
>

Re: [PATCH v2] lockdep: Teach lockdep about memalloc_noio_save

2017-03-01 Thread Nikolay Borisov



On  1.03.2017 11:46, Peter Zijlstra wrote:
> On Wed, Mar 01, 2017 at 09:59:00AM +0200, Nikolay Borisov wrote:
>> Commit 21caf2fc1931 ("mm: teach mm by current context info to not do I/O
>> during memory allocation") added the memalloc_noio_(save|restore) functions
>> to enable people to modify the MM behavior by disbaling I/O during memory
>> allocation. This prevents allocation paths recursing back into the filesystem
>> without explicitly changing the flags for every allocation site. Yet, lockdep
>> not being aware of that is prone to showing false positives. Fix this
>> by teaching it that the presence of PF_MEMALLOC_NOIO flag mean we are not
>> going to issue any I/O
> 
> I'm not up to date on the specific, but GFP_IO is separate from GFP_FS.
> 
> And MEMALLOC_NOIO only clears GFP_IO but leaves GFP_FS set.

static inline gfp_t memalloc_noio_flags(gfp_t flags)
{   
if (unlikely(current->flags & PF_MEMALLOC_NOIO))
flags &= ~(__GFP_IO | __GFP_FS);
return flags;   
}


> 
> Therefore I think your change is wrong, but I might have overlooked a
> detail. Added original authors to Cc to clarify.
> 
> 
>> Signed-off-by: Nikolay Borisov <nbori...@suse.com>
>> ---
>>  kernel/locking/lockdep.c | 5 +++--
>>  1 file changed, 3 insertions(+), 2 deletions(-)
>>
>>
>>  Turned out there was another place where RELCAIM_FS was being set, fix it 
>>  by using the memalloc_noio_flags when setting ->lockdep_reclaim_gfp flags.
>>
>> diff --git a/kernel/locking/lockdep.c b/kernel/locking/lockdep.c
>> index 9812e5dd409e..87cf9910e66f 100644
>> --- a/kernel/locking/lockdep.c
>> +++ b/kernel/locking/lockdep.c
>> @@ -2866,7 +2866,8 @@ static void __lockdep_trace_alloc(gfp_t gfp_mask, 
>> unsigned long flags)
>>  return;
>>  
>>  /* this guy won't enter reclaim */
>> -if ((curr->flags & PF_MEMALLOC) && !(gfp_mask & __GFP_NOMEMALLOC))
>> +if (((curr->flags & PF_MEMALLOC) && !(gfp_mask & __GFP_NOMEMALLOC)) ||
>> +curr->flags & PF_MEMALLOC_NOIO)
>>  return;
>>  
>>  /* We're only interested __GFP_FS allocations for now */
>> @@ -3852,7 +3853,7 @@ EXPORT_SYMBOL_GPL(lock_unpin_lock);
>>  
>>  void lockdep_set_current_reclaim_state(gfp_t gfp_mask)
>>  {
>> -current->lockdep_reclaim_gfp = gfp_mask;
>> +current->lockdep_reclaim_gfp = memalloc_noio_flags(gfp_mask);
>>  }
>>  
>>  void lockdep_clear_current_reclaim_state(void)
>> -- 
>> 2.7.4
>>
>

Re: [PATCH v2] lockdep: Teach lockdep about memalloc_noio_save

2017-03-01 Thread Nikolay Borisov



On  1.03.2017 11:46, Peter Zijlstra wrote:
> On Wed, Mar 01, 2017 at 09:59:00AM +0200, Nikolay Borisov wrote:
>> Commit 21caf2fc1931 ("mm: teach mm by current context info to not do I/O
>> during memory allocation") added the memalloc_noio_(save|restore) functions
>> to enable people to modify the MM behavior by disbaling I/O during memory
>> allocation. This prevents allocation paths recursing back into the filesystem
>> without explicitly changing the flags for every allocation site. Yet, lockdep
>> not being aware of that is prone to showing false positives. Fix this
>> by teaching it that the presence of PF_MEMALLOC_NOIO flag mean we are not
>> going to issue any I/O
> 
> I'm not up to date on the specific, but GFP_IO is separate from GFP_FS.
> 
> And MEMALLOC_NOIO only clears GFP_IO but leaves GFP_FS set.

static inline gfp_t memalloc_noio_flags(gfp_t flags)
{   
if (unlikely(current->flags & PF_MEMALLOC_NOIO))
flags &= ~(__GFP_IO | __GFP_FS);
return flags;   
}


> 
> Therefore I think your change is wrong, but I might have overlooked a
> detail. Added original authors to Cc to clarify.
> 
> 
>> Signed-off-by: Nikolay Borisov 
>> ---
>>  kernel/locking/lockdep.c | 5 +++--
>>  1 file changed, 3 insertions(+), 2 deletions(-)
>>
>>
>>  Turned out there was another place where RELCAIM_FS was being set, fix it 
>>  by using the memalloc_noio_flags when setting ->lockdep_reclaim_gfp flags.
>>
>> diff --git a/kernel/locking/lockdep.c b/kernel/locking/lockdep.c
>> index 9812e5dd409e..87cf9910e66f 100644
>> --- a/kernel/locking/lockdep.c
>> +++ b/kernel/locking/lockdep.c
>> @@ -2866,7 +2866,8 @@ static void __lockdep_trace_alloc(gfp_t gfp_mask, 
>> unsigned long flags)
>>  return;
>>  
>>  /* this guy won't enter reclaim */
>> -if ((curr->flags & PF_MEMALLOC) && !(gfp_mask & __GFP_NOMEMALLOC))
>> +if (((curr->flags & PF_MEMALLOC) && !(gfp_mask & __GFP_NOMEMALLOC)) ||
>> +curr->flags & PF_MEMALLOC_NOIO)
>>  return;
>>  
>>  /* We're only interested __GFP_FS allocations for now */
>> @@ -3852,7 +3853,7 @@ EXPORT_SYMBOL_GPL(lock_unpin_lock);
>>  
>>  void lockdep_set_current_reclaim_state(gfp_t gfp_mask)
>>  {
>> -current->lockdep_reclaim_gfp = gfp_mask;
>> +current->lockdep_reclaim_gfp = memalloc_noio_flags(gfp_mask);
>>  }
>>  
>>  void lockdep_clear_current_reclaim_state(void)
>> -- 
>> 2.7.4
>>
>

[PATCH v2] lockdep: Teach lockdep about memalloc_noio_save

2017-03-01 Thread Nikolay Borisov

Commit 21caf2fc1931 ("mm: teach mm by current context info to not do I/O
during memory allocation") added the memalloc_noio_(save|restore) functions
to enable people to modify the MM behavior by disbaling I/O during memory
allocation. This prevents allocation paths recursing back into the filesystem
without explicitly changing the flags for every allocation site. Yet, lockdep
not being aware of that is prone to showing false positives. Fix this
by teaching it that the presence of PF_MEMALLOC_NOIO flag mean we are not
going to issue any I/O

Signed-off-by: Nikolay Borisov <nbori...@suse.com>
---
 kernel/locking/lockdep.c | 5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)


 Turned out there was another place where RELCAIM_FS was being set, fix it 
 by using the memalloc_noio_flags when setting ->lockdep_reclaim_gfp flags.

diff --git a/kernel/locking/lockdep.c b/kernel/locking/lockdep.c
index 9812e5dd409e..87cf9910e66f 100644
--- a/kernel/locking/lockdep.c
+++ b/kernel/locking/lockdep.c
@@ -2866,7 +2866,8 @@ static void __lockdep_trace_alloc(gfp_t gfp_mask, 
unsigned long flags)
return;
 
/* this guy won't enter reclaim */
-   if ((curr->flags & PF_MEMALLOC) && !(gfp_mask & __GFP_NOMEMALLOC))
+   if (((curr->flags & PF_MEMALLOC) && !(gfp_mask & __GFP_NOMEMALLOC)) ||
+   curr->flags & PF_MEMALLOC_NOIO)
return;
 
/* We're only interested __GFP_FS allocations for now */
@@ -3852,7 +3853,7 @@ EXPORT_SYMBOL_GPL(lock_unpin_lock);
 
 void lockdep_set_current_reclaim_state(gfp_t gfp_mask)
 {
-   current->lockdep_reclaim_gfp = gfp_mask;
+   current->lockdep_reclaim_gfp = memalloc_noio_flags(gfp_mask);
 }
 
 void lockdep_clear_current_reclaim_state(void)
-- 
2.7.4

[PATCH v2] lockdep: Teach lockdep about memalloc_noio_save

2017-03-01 Thread Nikolay Borisov

Commit 21caf2fc1931 ("mm: teach mm by current context info to not do I/O
during memory allocation") added the memalloc_noio_(save|restore) functions
to enable people to modify the MM behavior by disbaling I/O during memory
allocation. This prevents allocation paths recursing back into the filesystem
without explicitly changing the flags for every allocation site. Yet, lockdep
not being aware of that is prone to showing false positives. Fix this
by teaching it that the presence of PF_MEMALLOC_NOIO flag mean we are not
going to issue any I/O

Signed-off-by: Nikolay Borisov 
---
 kernel/locking/lockdep.c | 5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)


 Turned out there was another place where RELCAIM_FS was being set, fix it 
 by using the memalloc_noio_flags when setting ->lockdep_reclaim_gfp flags.

diff --git a/kernel/locking/lockdep.c b/kernel/locking/lockdep.c
index 9812e5dd409e..87cf9910e66f 100644
--- a/kernel/locking/lockdep.c
+++ b/kernel/locking/lockdep.c
@@ -2866,7 +2866,8 @@ static void __lockdep_trace_alloc(gfp_t gfp_mask, 
unsigned long flags)
return;
 
/* this guy won't enter reclaim */
-   if ((curr->flags & PF_MEMALLOC) && !(gfp_mask & __GFP_NOMEMALLOC))
+   if (((curr->flags & PF_MEMALLOC) && !(gfp_mask & __GFP_NOMEMALLOC)) ||
+   curr->flags & PF_MEMALLOC_NOIO)
return;
 
/* We're only interested __GFP_FS allocations for now */
@@ -3852,7 +3853,7 @@ EXPORT_SYMBOL_GPL(lock_unpin_lock);
 
 void lockdep_set_current_reclaim_state(gfp_t gfp_mask)
 {
-   current->lockdep_reclaim_gfp = gfp_mask;
+   current->lockdep_reclaim_gfp = memalloc_noio_flags(gfp_mask);
 }
 
 void lockdep_clear_current_reclaim_state(void)
-- 
2.7.4

[PATCH] lockdep: Teach lockdep about memalloc_noio_save

2017-02-28 Thread Nikolay Borisov

Commit 21caf2fc1931 ("mm: teach mm by current context info to not do I/O
during memory allocation") added the memalloc_noio_(save|restore) functions
to enable people to modify the MM behavior by disbaling I/O during memory
allocation. This prevents allocation paths recursing back into the filesystem
without explicitly changing the flags for every allocation site. Yet, lockdep
not being aware of that is prone to showing false positives. Fix this
by teaching it that the presence of PF_MEMALLOC_NOIO flag mean we are not
going to issue any I/O

Signed-off-by: Nikolay Borisov <nbori...@suse.com>
---
 kernel/locking/lockdep.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/kernel/locking/lockdep.c b/kernel/locking/lockdep.c
index 9812e5dd409e..5715fdcede28 100644
--- a/kernel/locking/lockdep.c
+++ b/kernel/locking/lockdep.c
@@ -2866,7 +2866,8 @@ static void __lockdep_trace_alloc(gfp_t gfp_mask, 
unsigned long flags)
return;
 
/* this guy won't enter reclaim */
-   if ((curr->flags & PF_MEMALLOC) && !(gfp_mask & __GFP_NOMEMALLOC))
+   if (((curr->flags & PF_MEMALLOC) && !(gfp_mask & __GFP_NOMEMALLOC)) ||
+   curr->flags & PF_MEMALLOC_NOIO)
return;
 
/* We're only interested __GFP_FS allocations for now */
-- 
2.7.4

[PATCH] lockdep: Teach lockdep about memalloc_noio_save

2017-02-28 Thread Nikolay Borisov

Commit 21caf2fc1931 ("mm: teach mm by current context info to not do I/O
during memory allocation") added the memalloc_noio_(save|restore) functions
to enable people to modify the MM behavior by disbaling I/O during memory
allocation. This prevents allocation paths recursing back into the filesystem
without explicitly changing the flags for every allocation site. Yet, lockdep
not being aware of that is prone to showing false positives. Fix this
by teaching it that the presence of PF_MEMALLOC_NOIO flag mean we are not
going to issue any I/O

Signed-off-by: Nikolay Borisov 
---
 kernel/locking/lockdep.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/kernel/locking/lockdep.c b/kernel/locking/lockdep.c
index 9812e5dd409e..5715fdcede28 100644
--- a/kernel/locking/lockdep.c
+++ b/kernel/locking/lockdep.c
@@ -2866,7 +2866,8 @@ static void __lockdep_trace_alloc(gfp_t gfp_mask, 
unsigned long flags)
return;
 
/* this guy won't enter reclaim */
-   if ((curr->flags & PF_MEMALLOC) && !(gfp_mask & __GFP_NOMEMALLOC))
+   if (((curr->flags & PF_MEMALLOC) && !(gfp_mask & __GFP_NOMEMALLOC)) ||
+   curr->flags & PF_MEMALLOC_NOIO)
return;
 
/* We're only interested __GFP_FS allocations for now */
-- 
2.7.4

Re: [PATCH][net-next] net: bridge: remove redundant check to see if err is set

2017-02-07 Thread Nikolay Aleksandrov

On 07/02/17 11:56, Colin King wrote:
> From: Colin Ian King <colin.k...@canonical.com>
> 
> The error check on err is redundant as it is being checked
> previously each time it has been updated.  Remove this redundant
> check.
> 
> Detected with CoverityScan, CID#140030("Logically dead code")
> 
> Signed-off-by: Colin Ian King <colin.k...@canonical.com>
> ---
>  net/bridge/br_netlink.c | 3 ---
>  1 file changed, 3 deletions(-)
> 
> diff --git a/net/bridge/br_netlink.c b/net/bridge/br_netlink.c
> index fc5d885..cdc4e2a 100644
> --- a/net/bridge/br_netlink.c
> +++ b/net/bridge/br_netlink.c
> @@ -612,9 +612,6 @@ static int br_afspec(struct net_bridge *br,
>   return err;
>   break;
>   }
> -
> - if (err)
> - return err;
>   }
>  
>   return err;
> 

Actually that code can be reduced further, I'll follow up with a patch later.

Reviewed-by: Nikolay Aleksandrov <niko...@cumulusnetworks.com>

Re: [PATCH][net-next] net: bridge: remove redundant check to see if err is set

2017-02-07 Thread Nikolay Aleksandrov

On 07/02/17 11:56, Colin King wrote:
> From: Colin Ian King 
> 
> The error check on err is redundant as it is being checked
> previously each time it has been updated.  Remove this redundant
> check.
> 
> Detected with CoverityScan, CID#140030("Logically dead code")
> 
> Signed-off-by: Colin Ian King 
> ---
>  net/bridge/br_netlink.c | 3 ---
>  1 file changed, 3 deletions(-)
> 
> diff --git a/net/bridge/br_netlink.c b/net/bridge/br_netlink.c
> index fc5d885..cdc4e2a 100644
> --- a/net/bridge/br_netlink.c
> +++ b/net/bridge/br_netlink.c
> @@ -612,9 +612,6 @@ static int br_afspec(struct net_bridge *br,
>   return err;
>   break;
>   }
> -
> - if (err)
> - return err;
>   }
>  
>   return err;
> 

Actually that code can be reduced further, I'll follow up with a patch later.

Reviewed-by: Nikolay Aleksandrov

Re: [PATCH net-next v5] bridge: multicast to unicast

2017-01-21 Thread Nikolay Aleksandrov

On 21/01/17 21:01, Linus Lüssing wrote:
> From: Felix Fietkau <n...@nbd.name>
> 
> Implements an optional, per bridge port flag and feature to deliver
> multicast packets to any host on the according port via unicast
> individually. This is done by copying the packet per host and
> changing the multicast destination MAC to a unicast one accordingly.
> 
> multicast-to-unicast works on top of the multicast snooping feature of
> the bridge. Which means unicast copies are only delivered to hosts which
> are interested in it and signalized this via IGMP/MLD reports
> previously.
> 
> This feature is intended for interface types which have a more reliable
> and/or efficient way to deliver unicast packets than broadcast ones
> (e.g. wifi).
> 
> However, it should only be enabled on interfaces where no IGMPv2/MLDv1
> report suppression takes place. This feature is disabled by default.
> 
> The initial patch and idea is from Felix Fietkau.
> 
> Signed-off-by: Felix Fietkau <n...@nbd.name>
> [linus.luess...@c0d3.blue: various bug + style fixes, commit message]
> Signed-off-by: Linus Lüssing <linus.luess...@c0d3.blue>
> 
> ---
> 
> This feature is used and enabled by default in OpenWRT and LEDE for AP
> interfaces for more than a year now to allow both a more robust multicast
> delivery and multicast at higher rates (e.g. multicast streaming).
> 
> In OpenWRT/LEDE the IGMP/MLD report suppression issue is overcome by
> the network daemon enabling AP isolation and by that separating all STAs.
> Delivery of STA-to-STA IP mulitcast is made possible again by
> enabling and utilizing the bridge hairpin mode, which considers the
> incoming port as a potential outgoing port, too.
> 
> Hairpin-mode is performed after multicast snooping, therefore leading to
> only deliver reports to STAs running a multicast router.
> 
> Changes in v5:
> * fix a potential pagefault in br_ip6_multicast_mld2_report():
>   -> a pskb_may_pull() might reallocate skb->data, therefore perform
>  the "src = eth_hdr(skb)->h_source" only afterwards
> * simplify code by always adding ether source address to a port group
>   and checking the per-port multicast-to-unicast flag instead of a
>   per-port-group one (thanks Stephen!)
> 
> Changes in v4:
> * readd "From: Felix Fietkau [...]" (missed it again in v3)
> 
> Changes in v3:
> * fix an uninitialized variable bug introduced in br_multicast_flood()
>   in v2, found by kbuild test bot
> 
> Changes in v2:
> * netlink support (thanks Nik!)
> * fixed switching between multicast_to_unicast on/off
>   -> even after toggling an already existing entry would
>  stale in its mode and would never be replaced
>   -> new extra check in br_port_group_equal)
> * reduced checks in br_multicast_flood() from two to one
>   to address fast-path concerns (thanks Nik!)
> * rev-christmas tree ordering (thanks Nik!)
> * removed "net_bridge_port_group::unicast", using
>   ::flags instead (thanks Nik!)
> * BR_MULTICAST_TO_UCAST -> BR_MULTICAST_TO_UNICAST
>   (BR_MULTICAST_FLAST_LEAVE has the same length anyway)
> * simplified maybe_deliver_addr()
>   (no return, no "prev" paramater -> was a NOP anyway)
> * added "From: Felix Fietkau [...]"
> * added "Signed-off-by: Felix Fietkau [...]"
> ---
>  include/linux/if_bridge.h|  1 +
>  include/uapi/linux/if_link.h |  1 +
>  net/bridge/br_forward.c  | 39 ++++++-
>  net/bridge/br_mdb.c  |  2 +-
>  net/bridge/br_multicast.c| 90 
> 
>  net/bridge/br_netlink.c  |  5 +++
>  net/bridge/br_private.h  |  3 +-
>  net/bridge/br_sysfs_if.c |  2 +
>  8 files changed, 114 insertions(+), 29 deletions(-)
> 

Reviewed-by: Nikolay Aleksandrov <niko...@cumulusnetworks.com>

Re: [PATCH net-next v5] bridge: multicast to unicast

2017-01-21 Thread Nikolay Aleksandrov

On 21/01/17 21:01, Linus Lüssing wrote:
> From: Felix Fietkau 
> 
> Implements an optional, per bridge port flag and feature to deliver
> multicast packets to any host on the according port via unicast
> individually. This is done by copying the packet per host and
> changing the multicast destination MAC to a unicast one accordingly.
> 
> multicast-to-unicast works on top of the multicast snooping feature of
> the bridge. Which means unicast copies are only delivered to hosts which
> are interested in it and signalized this via IGMP/MLD reports
> previously.
> 
> This feature is intended for interface types which have a more reliable
> and/or efficient way to deliver unicast packets than broadcast ones
> (e.g. wifi).
> 
> However, it should only be enabled on interfaces where no IGMPv2/MLDv1
> report suppression takes place. This feature is disabled by default.
> 
> The initial patch and idea is from Felix Fietkau.
> 
> Signed-off-by: Felix Fietkau 
> [linus.luess...@c0d3.blue: various bug + style fixes, commit message]
> Signed-off-by: Linus Lüssing 
> 
> ---
> 
> This feature is used and enabled by default in OpenWRT and LEDE for AP
> interfaces for more than a year now to allow both a more robust multicast
> delivery and multicast at higher rates (e.g. multicast streaming).
> 
> In OpenWRT/LEDE the IGMP/MLD report suppression issue is overcome by
> the network daemon enabling AP isolation and by that separating all STAs.
> Delivery of STA-to-STA IP mulitcast is made possible again by
> enabling and utilizing the bridge hairpin mode, which considers the
> incoming port as a potential outgoing port, too.
> 
> Hairpin-mode is performed after multicast snooping, therefore leading to
> only deliver reports to STAs running a multicast router.
> 
> Changes in v5:
> * fix a potential pagefault in br_ip6_multicast_mld2_report():
>   -> a pskb_may_pull() might reallocate skb->data, therefore perform
>  the "src = eth_hdr(skb)->h_source" only afterwards
> * simplify code by always adding ether source address to a port group
>   and checking the per-port multicast-to-unicast flag instead of a
>   per-port-group one (thanks Stephen!)
> 
> Changes in v4:
> * readd "From: Felix Fietkau [...]" (missed it again in v3)
> 
> Changes in v3:
> * fix an uninitialized variable bug introduced in br_multicast_flood()
>   in v2, found by kbuild test bot
> 
> Changes in v2:
> * netlink support (thanks Nik!)
> * fixed switching between multicast_to_unicast on/off
>   -> even after toggling an already existing entry would
>  stale in its mode and would never be replaced
>   -> new extra check in br_port_group_equal)
> * reduced checks in br_multicast_flood() from two to one
>   to address fast-path concerns (thanks Nik!)
> * rev-christmas tree ordering (thanks Nik!)
> * removed "net_bridge_port_group::unicast", using
>   ::flags instead (thanks Nik!)
> * BR_MULTICAST_TO_UCAST -> BR_MULTICAST_TO_UNICAST
>   (BR_MULTICAST_FLAST_LEAVE has the same length anyway)
> * simplified maybe_deliver_addr()
>   (no return, no "prev" paramater -> was a NOP anyway)
> * added "From: Felix Fietkau [...]"
> * added "Signed-off-by: Felix Fietkau [...]"
> ---
>  include/linux/if_bridge.h|  1 +
>  include/uapi/linux/if_link.h |  1 +
>  net/bridge/br_forward.c  | 39 ++-
>  net/bridge/br_mdb.c  |  2 +-
>  net/bridge/br_multicast.c| 90 
> 
>  net/bridge/br_netlink.c  |  5 +++
>  net/bridge/br_private.h  |  3 +-
>  net/bridge/br_sysfs_if.c |  2 +
>  8 files changed, 114 insertions(+), 29 deletions(-)
> 

Reviewed-by: Nikolay Aleksandrov

Re: namespace: deadlock in dec_pid_namespaces

2017-01-20 Thread Nikolay Borisov



On 20.01.2017 20:05, Eric W. Biederman wrote:
> Nikolay Borisov <n.borisov.l...@gmail.com> writes:
> 
>> On 20.01.2017 15:07, Dmitry Vyukov wrote:
>>> Hello,
>>>
>>> I've got the following deadlock report while running syzkaller fuzzer
>>> on eec0d3d065bfcdf9cd5f56dd2a36b94d12d32297 of linux-next (on odroid
>>> device if it matters):
> 
> I am puzzled I thought we had fixed this with:
>   add7c65ca426 ("pid: fix lockdep deadlock warning due to ucount_lock")
> But apparently not.  We  just moved it from hardirq to softirq context.  Bah.
> 
> Thank you very much for the report.
> 
> Nikolay can you make your change use spinlock_irq?  And have put_ucounts
> do spin_lock_irqsave?  That way we just don't care where we call this.

Like the one attached? I haven't really taken careful look as to whether
the function where _irq versions do fiddle with irq state, since this
might cause a problem if we unconditionally enable them.

> 
> I a tired of being clever.
> 
> Eric
> 
> 
>From 0aa66c85afdac0cd07fabdf899c173c6dca2b6e7 Mon Sep 17 00:00:00 2001
From: Nikolay Borisov <n.borisov.l...@gmail.com>
Date: Fri, 20 Jan 2017 15:21:35 +0200
Subject: [PATCH] userns: Make ucounts lock softirq-safe

The ucounts_lock is being used to protect various ucounts lifecycle
management functionalities. However, those services can also be invoked
when a pidns is being freed in an RCU callback (e.g. softirq context).
This can lead to deadlocks. There were already efforts trying to
prevent similar deadlocks in add7c65ca426 ("pid: fix lockdep deadlock
warning due to ucount_lock"), however they just moved the context
from hardirq to softrq. Fix this issue once and for all by explictly
making the lock disable irqs altogether.

Fixes: add7c65ca426 ("pid: fix lockdep deadlock warning due to ucount_lock")
Link: https://www.spinics.net/lists/kernel/msg2426637.html
Signed-off-by: Nikolay Borisov <n.borisov.l...@gmail.com>
---
 kernel/ucount.c | 14 --
 1 file changed, 8 insertions(+), 6 deletions(-)

diff --git a/kernel/ucount.c b/kernel/ucount.c
index b4aaee935b3e..68716403b261 100644
--- a/kernel/ucount.c
+++ b/kernel/ucount.c
@@ -132,10 +132,10 @@ static struct ucounts *get_ucounts(struct user_namespace *ns, kuid_t uid)
 	struct hlist_head *hashent = ucounts_hashentry(ns, uid);
 	struct ucounts *ucounts, *new;
 
-	spin_lock(_lock);
+	spin_lock_irq(_lock);
 	ucounts = find_ucounts(ns, uid, hashent);
 	if (!ucounts) {
-		spin_unlock(_lock);
+		spin_unlock_irq(_lock);
 
 		new = kzalloc(sizeof(*new), GFP_KERNEL);
 		if (!new)
@@ -145,7 +145,7 @@ static struct ucounts *get_ucounts(struct user_namespace *ns, kuid_t uid)
 		new->uid = uid;
 		atomic_set(>count, 0);
 
-		spin_lock(_lock);
+		spin_lock_irq(_lock);
 		ucounts = find_ucounts(ns, uid, hashent);
 		if (ucounts) {
 			kfree(new);
@@ -156,16 +156,18 @@ static struct ucounts *get_ucounts(struct user_namespace *ns, kuid_t uid)
 	}
 	if (!atomic_add_unless(>count, 1, INT_MAX))
 		ucounts = NULL;
-	spin_unlock(_lock);
+	spin_unlock_irq(_lock);
 	return ucounts;
 }
 
 static void put_ucounts(struct ucounts *ucounts)
 {
+	unsigned long flags;
+
 	if (atomic_dec_and_test(>count)) {
-		spin_lock(_lock);
+		spin_lock_irqsave(_lock, flags);
 		hlist_del_init(>node);
-		spin_unlock(_lock);
+		spin_unlock_irqrestore(_lock, flags);
 
 		kfree(ucounts);
 	}
-- 
2.7.4

Re: namespace: deadlock in dec_pid_namespaces

2017-01-20 Thread Nikolay Borisov



On 20.01.2017 20:05, Eric W. Biederman wrote:
> Nikolay Borisov  writes:
> 
>> On 20.01.2017 15:07, Dmitry Vyukov wrote:
>>> Hello,
>>>
>>> I've got the following deadlock report while running syzkaller fuzzer
>>> on eec0d3d065bfcdf9cd5f56dd2a36b94d12d32297 of linux-next (on odroid
>>> device if it matters):
> 
> I am puzzled I thought we had fixed this with:
>   add7c65ca426 ("pid: fix lockdep deadlock warning due to ucount_lock")
> But apparently not.  We  just moved it from hardirq to softirq context.  Bah.
> 
> Thank you very much for the report.
> 
> Nikolay can you make your change use spinlock_irq?  And have put_ucounts
> do spin_lock_irqsave?  That way we just don't care where we call this.

Like the one attached? I haven't really taken careful look as to whether
the function where _irq versions do fiddle with irq state, since this
might cause a problem if we unconditionally enable them.

> 
> I a tired of being clever.
> 
> Eric
> 
> 
>From 0aa66c85afdac0cd07fabdf899c173c6dca2b6e7 Mon Sep 17 00:00:00 2001
From: Nikolay Borisov 
Date: Fri, 20 Jan 2017 15:21:35 +0200
Subject: [PATCH] userns: Make ucounts lock softirq-safe

The ucounts_lock is being used to protect various ucounts lifecycle
management functionalities. However, those services can also be invoked
when a pidns is being freed in an RCU callback (e.g. softirq context).
This can lead to deadlocks. There were already efforts trying to
prevent similar deadlocks in add7c65ca426 ("pid: fix lockdep deadlock
warning due to ucount_lock"), however they just moved the context
from hardirq to softrq. Fix this issue once and for all by explictly
making the lock disable irqs altogether.

Fixes: add7c65ca426 ("pid: fix lockdep deadlock warning due to ucount_lock")
Link: https://www.spinics.net/lists/kernel/msg2426637.html
Signed-off-by: Nikolay Borisov 
---
 kernel/ucount.c | 14 --
 1 file changed, 8 insertions(+), 6 deletions(-)

diff --git a/kernel/ucount.c b/kernel/ucount.c
index b4aaee935b3e..68716403b261 100644
--- a/kernel/ucount.c
+++ b/kernel/ucount.c
@@ -132,10 +132,10 @@ static struct ucounts *get_ucounts(struct user_namespace *ns, kuid_t uid)
 	struct hlist_head *hashent = ucounts_hashentry(ns, uid);
 	struct ucounts *ucounts, *new;
 
-	spin_lock(_lock);
+	spin_lock_irq(_lock);
 	ucounts = find_ucounts(ns, uid, hashent);
 	if (!ucounts) {
-		spin_unlock(_lock);
+		spin_unlock_irq(_lock);
 
 		new = kzalloc(sizeof(*new), GFP_KERNEL);
 		if (!new)
@@ -145,7 +145,7 @@ static struct ucounts *get_ucounts(struct user_namespace *ns, kuid_t uid)
 		new->uid = uid;
 		atomic_set(>count, 0);
 
-		spin_lock(_lock);
+		spin_lock_irq(_lock);
 		ucounts = find_ucounts(ns, uid, hashent);
 		if (ucounts) {
 			kfree(new);
@@ -156,16 +156,18 @@ static struct ucounts *get_ucounts(struct user_namespace *ns, kuid_t uid)
 	}
 	if (!atomic_add_unless(>count, 1, INT_MAX))
 		ucounts = NULL;
-	spin_unlock(_lock);
+	spin_unlock_irq(_lock);
 	return ucounts;
 }
 
 static void put_ucounts(struct ucounts *ucounts)
 {
+	unsigned long flags;
+
 	if (atomic_dec_and_test(>count)) {
-		spin_lock(_lock);
+		spin_lock_irqsave(_lock, flags);
 		hlist_del_init(>node);
-		spin_unlock(_lock);
+		spin_unlock_irqrestore(_lock, flags);
 
 		kfree(ucounts);
 	}
-- 
2.7.4

Re: namespace: deadlock in dec_pid_namespaces

2017-01-20 Thread Nikolay Borisov

4
> [] _raw_spin_lock+0x90/0xd0 kernel/locking/spinlock.c:151
> [< inline >] spin_lock ./include/linux/spinlock.h:302
> [] put_ucounts+0x60/0x138 kernel/ucount.c:162
> [] dec_ucount+0xf4/0x158 kernel/ucount.c:214
> [< inline >] dec_pid_namespaces kernel/pid_namespace.c:89
> [] delayed_free_pidns+0x40/0xe0 kernel/pid_namespace.c:156
> [< inline >] __rcu_reclaim kernel/rcu/rcu.h:118
> [< inline >] rcu_do_batch kernel/rcu/tree.c:2919
> [< inline >] invoke_rcu_callbacks kernel/rcu/tree.c:3182
> [< inline >] __rcu_process_callbacks kernel/rcu/tree.c:3149
> [] rcu_process_callbacks+0x768/0xc28 kernel/rcu/tree.c:3166
> [] __do_softirq+0x324/0x6e0 kernel/softirq.c:284
> [< inline >] do_softirq_own_stack ./include/linux/interrupt.h:488
> [< inline >] invoke_softirq kernel/softirq.c:371
> [] irq_exit+0x264/0x308 kernel/softirq.c:405
> [] __handle_domain_irq+0xc0/0x150 kernel/irq/irqdesc.c:636
> [] gic_handle_irq+0x68/0xd8
> Exception stack(0x8000648e7dd0 to 0x8000648e7f00)
> 7dc0:   8000648d4b3c 0007
> 7de0:  1c91a967 1c91a967 1c91a967
> 7e00: 2a4b6b68 0001 0007 0001
> 7e20: 1fffe4000149ae90 29d35000  0002
> 7e40:   02624a1a 
> 7e60:  29cbcd88 60006d2ed000 0140
> 7e80: 29cff000 29cb6000 29cc2020 29d2159d
> 7ea0:  8000648d4380  8000648e7f00
> 7ec0: 2820a478 8000648e7f00 2820a47c 1145
> 7ee0: 0140 dfff2000  2820a478
> [] el1_irq+0xb8/0x130 arch/arm64/kernel/entry.S:486
> [< inline >] arch_local_irq_restore
> ./arch/arm64/include/asm/irqflags.h:81
> [] rcu_idle_exit+0x64/0xa8 kernel/rcu/tree.c:1030
> [< inline >] cpuidle_idle_call kernel/sched/idle.c:200
> [] do_idle+0x1dc/0x2d0 kernel/sched/idle.c:243
> [] cpu_startup_entry+0x24/0x28 kernel/sched/idle.c:345
> [] secondary_start_kernel+0x2cc/0x358
> arch/arm64/kernel/smp.c:276
> [<0279f1a4>] 0x279f1a4
> 


So it seems that ucounts_lock is being used in a softirq context (in an
RCU callback) when a pidns is being freed. But this lock is not
softirq-safe e.g. it doesn't disable bh. Can you try the attached patch.



>From 86565326b382b41cb988a83791eff1c4d600040b Mon Sep 17 00:00:00 2001
From: Nikolay Borisov <n.borisov.l...@gmail.com>
Date: Fri, 20 Jan 2017 15:21:35 +0200
Subject: [PATCH] userns: Make ucounts lock softirq-safe

The ucounts_lock is being used to protect various ucounts lifecycle
management functionalities. However, those services can also be invoked
when a pidns is being freed in an RCU callback (e.g. softirq context).
This can lead to deadlocks. Fix it by making the spinlock disable softirq

Signed-off-by: Nikolay Borisov <n.borisov.l...@gmail.com>
---
 kernel/ucount.c | 12 ++--
 1 file changed, 6 insertions(+), 6 deletions(-)

diff --git a/kernel/ucount.c b/kernel/ucount.c
index b4aaee935b3e..23a44ea894cd 100644
--- a/kernel/ucount.c
+++ b/kernel/ucount.c
@@ -132,10 +132,10 @@ static struct ucounts *get_ucounts(struct user_namespace *ns, kuid_t uid)
 	struct hlist_head *hashent = ucounts_hashentry(ns, uid);
 	struct ucounts *ucounts, *new;
 
-	spin_lock(_lock);
+	spin_lock_bh(_lock);
 	ucounts = find_ucounts(ns, uid, hashent);
 	if (!ucounts) {
-		spin_unlock(_lock);
+		spin_unlock_bh(_lock);
 
 		new = kzalloc(sizeof(*new), GFP_KERNEL);
 		if (!new)
@@ -145,7 +145,7 @@ static struct ucounts *get_ucounts(struct user_namespace *ns, kuid_t uid)
 		new->uid = uid;
 		atomic_set(>count, 0);
 
-		spin_lock(_lock);
+		spin_lock_bh(_lock);
 		ucounts = find_ucounts(ns, uid, hashent);
 		if (ucounts) {
 			kfree(new);
@@ -156,16 +156,16 @@ static struct ucounts *get_ucounts(struct user_namespace *ns, kuid_t uid)
 	}
 	if (!atomic_add_unless(>count, 1, INT_MAX))
 		ucounts = NULL;
-	spin_unlock(_lock);
+	spin_unlock_bh(_lock);
 	return ucounts;
 }
 
 static void put_ucounts(struct ucounts *ucounts)
 {
 	if (atomic_dec_and_test(>count)) {
-		spin_lock(_lock);
+		spin_lock_bh(_lock);
 		hlist_del_init(>node);
-		spin_unlock(_lock);
+		spin_unlock_bh(_lock);
 
 		kfree(ucounts);
 	}
-- 
2.7.4

Re: namespace: deadlock in dec_pid_namespaces

2017-01-20 Thread Nikolay Borisov

4
> [] _raw_spin_lock+0x90/0xd0 kernel/locking/spinlock.c:151
> [< inline >] spin_lock ./include/linux/spinlock.h:302
> [] put_ucounts+0x60/0x138 kernel/ucount.c:162
> [] dec_ucount+0xf4/0x158 kernel/ucount.c:214
> [< inline >] dec_pid_namespaces kernel/pid_namespace.c:89
> [] delayed_free_pidns+0x40/0xe0 kernel/pid_namespace.c:156
> [< inline >] __rcu_reclaim kernel/rcu/rcu.h:118
> [< inline >] rcu_do_batch kernel/rcu/tree.c:2919
> [< inline >] invoke_rcu_callbacks kernel/rcu/tree.c:3182
> [< inline >] __rcu_process_callbacks kernel/rcu/tree.c:3149
> [] rcu_process_callbacks+0x768/0xc28 kernel/rcu/tree.c:3166
> [] __do_softirq+0x324/0x6e0 kernel/softirq.c:284
> [< inline >] do_softirq_own_stack ./include/linux/interrupt.h:488
> [< inline >] invoke_softirq kernel/softirq.c:371
> [] irq_exit+0x264/0x308 kernel/softirq.c:405
> [] __handle_domain_irq+0xc0/0x150 kernel/irq/irqdesc.c:636
> [] gic_handle_irq+0x68/0xd8
> Exception stack(0x8000648e7dd0 to 0x8000648e7f00)
> 7dc0:   8000648d4b3c 0007
> 7de0:  1c91a967 1c91a967 1c91a967
> 7e00: 2a4b6b68 0001 0007 0001
> 7e20: 1fffe4000149ae90 29d35000  0002
> 7e40:   02624a1a 
> 7e60:  29cbcd88 60006d2ed000 0140
> 7e80: 29cff000 29cb6000 29cc2020 29d2159d
> 7ea0:  8000648d4380  8000648e7f00
> 7ec0: 2820a478 8000648e7f00 2820a47c 1145
> 7ee0: 0140 dfff2000  2820a478
> [] el1_irq+0xb8/0x130 arch/arm64/kernel/entry.S:486
> [< inline >] arch_local_irq_restore
> ./arch/arm64/include/asm/irqflags.h:81
> [] rcu_idle_exit+0x64/0xa8 kernel/rcu/tree.c:1030
> [< inline >] cpuidle_idle_call kernel/sched/idle.c:200
> [] do_idle+0x1dc/0x2d0 kernel/sched/idle.c:243
> [] cpu_startup_entry+0x24/0x28 kernel/sched/idle.c:345
> [] secondary_start_kernel+0x2cc/0x358
> arch/arm64/kernel/smp.c:276
> [<0279f1a4>] 0x279f1a4
> 


So it seems that ucounts_lock is being used in a softirq context (in an
RCU callback) when a pidns is being freed. But this lock is not
softirq-safe e.g. it doesn't disable bh. Can you try the attached patch.



>From 86565326b382b41cb988a83791eff1c4d600040b Mon Sep 17 00:00:00 2001
From: Nikolay Borisov 
Date: Fri, 20 Jan 2017 15:21:35 +0200
Subject: [PATCH] userns: Make ucounts lock softirq-safe

The ucounts_lock is being used to protect various ucounts lifecycle
management functionalities. However, those services can also be invoked
when a pidns is being freed in an RCU callback (e.g. softirq context).
This can lead to deadlocks. Fix it by making the spinlock disable softirq

Signed-off-by: Nikolay Borisov 
---
 kernel/ucount.c | 12 ++--
 1 file changed, 6 insertions(+), 6 deletions(-)

diff --git a/kernel/ucount.c b/kernel/ucount.c
index b4aaee935b3e..23a44ea894cd 100644
--- a/kernel/ucount.c
+++ b/kernel/ucount.c
@@ -132,10 +132,10 @@ static struct ucounts *get_ucounts(struct user_namespace *ns, kuid_t uid)
 	struct hlist_head *hashent = ucounts_hashentry(ns, uid);
 	struct ucounts *ucounts, *new;
 
-	spin_lock(_lock);
+	spin_lock_bh(_lock);
 	ucounts = find_ucounts(ns, uid, hashent);
 	if (!ucounts) {
-		spin_unlock(_lock);
+		spin_unlock_bh(_lock);
 
 		new = kzalloc(sizeof(*new), GFP_KERNEL);
 		if (!new)
@@ -145,7 +145,7 @@ static struct ucounts *get_ucounts(struct user_namespace *ns, kuid_t uid)
 		new->uid = uid;
 		atomic_set(>count, 0);
 
-		spin_lock(_lock);
+		spin_lock_bh(_lock);
 		ucounts = find_ucounts(ns, uid, hashent);
 		if (ucounts) {
 			kfree(new);
@@ -156,16 +156,16 @@ static struct ucounts *get_ucounts(struct user_namespace *ns, kuid_t uid)
 	}
 	if (!atomic_add_unless(>count, 1, INT_MAX))
 		ucounts = NULL;
-	spin_unlock(_lock);
+	spin_unlock_bh(_lock);
 	return ucounts;
 }
 
 static void put_ucounts(struct ucounts *ucounts)
 {
 	if (atomic_dec_and_test(>count)) {
-		spin_lock(_lock);
+		spin_lock_bh(_lock);
 		hlist_del_init(>node);
-		spin_unlock(_lock);
+		spin_unlock_bh(_lock);
 
 		kfree(ucounts);
 	}
-- 
2.7.4

Re: [PATCH net-next v4] bridge: multicast to unicast

2017-01-19 Thread Nikolay Aleksandrov

On 19/01/17 03:45, Linus Lüssing wrote:
> From: Felix Fietkau <n...@nbd.name>
> 
> Implements an optional, per bridge port flag and feature to deliver
> multicast packets to any host on the according port via unicast
> individually. This is done by copying the packet per host and
> changing the multicast destination MAC to a unicast one accordingly.
> 
> multicast-to-unicast works on top of the multicast snooping feature of
> the bridge. Which means unicast copies are only delivered to hosts which
> are interested in it and signalized this via IGMP/MLD reports
> previously.
> 
> This feature is intended for interface types which have a more reliable
> and/or efficient way to deliver unicast packets than broadcast ones
> (e.g. wifi).
> 
> However, it should only be enabled on interfaces where no IGMPv2/MLDv1
> report suppression takes place. This feature is disabled by default.
> 
> The initial patch and idea is from Felix Fietkau.
> 
> Signed-off-by: Felix Fietkau <n...@nbd.name>
> [linus.luess...@c0d3.blue: various bug + style fixes, commit message]
> Signed-off-by: Linus Lüssing <linus.luess...@c0d3.blue>
> 
> ---
> 
> This feature is used and enabled by default in OpenWRT and LEDE for AP
> interfaces for more than a year now to allow both a more robust multicast
> delivery and multicast at higher rates (e.g. multicast streaming).
> 
> In OpenWRT/LEDE the IGMP/MLD report suppression issue is overcome by
> the network daemon enabling AP isolation and by that separating all STAs.
> Delivery of STA-to-STA IP mulitcast is made possible again by
> enabling and utilizing the bridge hairpin mode, which considers the
> incoming port as a potential outgoing port, too.
> 
> Hairpin-mode is performed after multicast snooping, therefore leading to
> only deliver reports to STAs running a multicast router.
> 
> Changes in v4:
> * readd "From: Felix Fietkau [...]" (missed it again in v3)
> 
> Changes in v3:
> * fix an uninitialized variable bug introduced in br_multicast_flood()
>   in v2, found by kbuild test bot
> 
> Changes in v2:
> * netlink support (thanks Nik!)
> * fixed switching between multicast_to_unicast on/off
>   -> even after toggling an already existing entry would
>  stale in its mode and would never be replaced
>   -> new extra check in br_port_group_equal)
> * reduced checks in br_multicast_flood() from two to one
>   to address fast-path concerns (thanks Nik!)
> * rev-christmas tree ordering (thanks Nik!)
> * removed "net_bridge_port_group::unicast", using
>   ::flags instead (thanks Nik!)
> * BR_MULTICAST_TO_UCAST -> BR_MULTICAST_TO_UNICAST
>   (BR_MULTICAST_FLAST_LEAVE has the same length anyway)
> * simplified maybe_deliver_addr()
>   (no return, no "prev" paramater -> was a NOP anyway)
> * added "From: Felix Fietkau [...]"
> * added "Signed-off-by: Felix Fietkau [...]"
> ---
>  include/linux/if_bridge.h|  1 +
>  include/uapi/linux/if_link.h |  1 +
>  net/bridge/br_forward.c  | 37 -
>  net/bridge/br_mdb.c  |  2 +-
>  net/bridge/br_multicast.c| 96 
> 
>  net/bridge/br_netlink.c  |  5 +++
>  net/bridge/br_private.h  |  8 ++--
>  net/bridge/br_sysfs_if.c |  2 +
>  8 files changed, 121 insertions(+), 31 deletions(-)
> 

Looks good to me,
Reviewed-by: Nikolay Aleksandrov <niko...@cumulusnetworks.com>

Re: [PATCH net-next v4] bridge: multicast to unicast

2017-01-19 Thread Nikolay Aleksandrov

On 19/01/17 03:45, Linus Lüssing wrote:
> From: Felix Fietkau 
> 
> Implements an optional, per bridge port flag and feature to deliver
> multicast packets to any host on the according port via unicast
> individually. This is done by copying the packet per host and
> changing the multicast destination MAC to a unicast one accordingly.
> 
> multicast-to-unicast works on top of the multicast snooping feature of
> the bridge. Which means unicast copies are only delivered to hosts which
> are interested in it and signalized this via IGMP/MLD reports
> previously.
> 
> This feature is intended for interface types which have a more reliable
> and/or efficient way to deliver unicast packets than broadcast ones
> (e.g. wifi).
> 
> However, it should only be enabled on interfaces where no IGMPv2/MLDv1
> report suppression takes place. This feature is disabled by default.
> 
> The initial patch and idea is from Felix Fietkau.
> 
> Signed-off-by: Felix Fietkau 
> [linus.luess...@c0d3.blue: various bug + style fixes, commit message]
> Signed-off-by: Linus Lüssing 
> 
> ---
> 
> This feature is used and enabled by default in OpenWRT and LEDE for AP
> interfaces for more than a year now to allow both a more robust multicast
> delivery and multicast at higher rates (e.g. multicast streaming).
> 
> In OpenWRT/LEDE the IGMP/MLD report suppression issue is overcome by
> the network daemon enabling AP isolation and by that separating all STAs.
> Delivery of STA-to-STA IP mulitcast is made possible again by
> enabling and utilizing the bridge hairpin mode, which considers the
> incoming port as a potential outgoing port, too.
> 
> Hairpin-mode is performed after multicast snooping, therefore leading to
> only deliver reports to STAs running a multicast router.
> 
> Changes in v4:
> * readd "From: Felix Fietkau [...]" (missed it again in v3)
> 
> Changes in v3:
> * fix an uninitialized variable bug introduced in br_multicast_flood()
>   in v2, found by kbuild test bot
> 
> Changes in v2:
> * netlink support (thanks Nik!)
> * fixed switching between multicast_to_unicast on/off
>   -> even after toggling an already existing entry would
>  stale in its mode and would never be replaced
>   -> new extra check in br_port_group_equal)
> * reduced checks in br_multicast_flood() from two to one
>   to address fast-path concerns (thanks Nik!)
> * rev-christmas tree ordering (thanks Nik!)
> * removed "net_bridge_port_group::unicast", using
>   ::flags instead (thanks Nik!)
> * BR_MULTICAST_TO_UCAST -> BR_MULTICAST_TO_UNICAST
>   (BR_MULTICAST_FLAST_LEAVE has the same length anyway)
> * simplified maybe_deliver_addr()
>   (no return, no "prev" paramater -> was a NOP anyway)
> * added "From: Felix Fietkau [...]"
> * added "Signed-off-by: Felix Fietkau [...]"
> ---
>  include/linux/if_bridge.h|  1 +
>  include/uapi/linux/if_link.h |  1 +
>  net/bridge/br_forward.c  | 37 -
>  net/bridge/br_mdb.c  |  2 +-
>  net/bridge/br_multicast.c| 96 
> 
>  net/bridge/br_netlink.c  |  5 +++
>  net/bridge/br_private.h  |  8 ++--
>  net/bridge/br_sysfs_if.c |  2 +
>  8 files changed, 121 insertions(+), 31 deletions(-)
> 

Looks good to me,
Reviewed-by: Nikolay Aleksandrov

Re: [PATCH net-next] bridge: multicast to unicast

2017-01-03 Thread Nikolay Aleksandrov


On 02/01/17 20:32, Linus Lüssing wrote:

Implements an optional, per bridge port flag and feature to deliver
multicast packets to any host on the according port via unicast
individually. This is done by copying the packet per host and
changing the multicast destination MAC to a unicast one accordingly.

multicast-to-unicast works on top of the multicast snooping feature of
the bridge. Which means unicast copies are only delivered to hosts which
are interested in it and signalized this via IGMP/MLD reports
previously.

This feature is intended for interface types which have a more reliable
and/or efficient way to deliver unicast packets than broadcast ones
(e.g. wifi).

However, it should only be enabled on interfaces where no IGMPv2/MLDv1
report suppression takes place. This feature is disabled by default.

The initial patch and idea is from Felix Fietkau.

Cc: Felix Fietkau 
Signed-off-by: Linus Lüssing 

---



Hi Linus,
A few comments below, in general I have 2 concerns: the new mcast fast-path
tests and cache line ref, and adding netlink support for the new flag.


This feature is used and enabled by default in OpenWRT and LEDE for AP
interfaces for more than a year now to allow both a more robust multicast
delivery and multicast at higher rates (e.g. multicast streaming).

In OpenWRT/LEDE the IGMP/MLD report suppression issue is overcome by
the network daemon enabling AP isolation and by that separating all STAs.
Delivery of STA-to-STA IP mulitcast is made possible again by
enabling and utilizing the bridge hairpin mode, which considers the
incoming port as a potential outgoing port, too.

Hairpin-mode is performed after multicast snooping, therefore leading to
only deliver reports to STAs running a multicast router.
---
 include/linux/if_bridge.h |  1 +
 net/bridge/br_forward.c   | 44 +--
 net/bridge/br_mdb.c   |  2 +-
 net/bridge/br_multicast.c | 92 ++-
 net/bridge/br_private.h   |  4 ++-
 net/bridge/br_sysfs_if.c  |  2 ++
 6 files changed, 115 insertions(+), 30 deletions(-)

diff --git a/include/linux/if_bridge.h b/include/linux/if_bridge.h
index c6587c0..f1b0d78 100644
--- a/include/linux/if_bridge.h
+++ b/include/linux/if_bridge.h
@@ -46,6 +46,7 @@ struct br_ip_list {
 #define BR_LEARNING_SYNC   BIT(9)
 #define BR_PROXYARP_WIFI   BIT(10)
 #define BR_MCAST_FLOOD BIT(11)
+#define BR_MULTICAST_TO_UCAST  BIT(12)

 #define BR_DEFAULT_AGEING_TIME (300 * HZ)

diff --git a/net/bridge/br_forward.c b/net/bridge/br_forward.c
index 7cb41ae..49d742d 100644
--- a/net/bridge/br_forward.c
+++ b/net/bridge/br_forward.c
@@ -174,6 +174,33 @@ static struct net_bridge_port *maybe_deliver(
return p;
 }

+static struct net_bridge_port *maybe_deliver_addr(
+   struct net_bridge_port *prev, struct net_bridge_port *p,
+   struct sk_buff *skb, const unsigned char *addr,
+   bool local_orig)
+{
+   struct net_device *dev = BR_INPUT_SKB_CB(skb)->brdev;
+   const unsigned char *src = eth_hdr(skb)->h_source;
+
+   if (!should_deliver(p, skb))
+   return prev;
+
+   /* Even with hairpin, no soliloquies - prevent breaking IPv6 DAD */
+   if (skb->dev == p->dev && ether_addr_equal(src, addr))
+   return prev;
+
+   skb = skb_copy(skb, GFP_ATOMIC);
+   if (!skb) {
+   dev->stats.tx_dropped++;
+   return prev;
+   }
+
+   memcpy(eth_hdr(skb)->h_dest, addr, ETH_ALEN);
+   __br_forward(p, skb, local_orig);
+
+   return prev;
+}
+
 /* called under rcu_read_lock */
 void br_flood(struct net_bridge *br, struct sk_buff *skb,
  enum br_pkt_type pkt_type, bool local_rcv, bool local_orig)
@@ -231,6 +258,7 @@ void br_multicast_flood(struct net_bridge_mdb_entry *mdst,
struct net_bridge_port *prev = NULL;
struct net_bridge_port_group *p;
struct hlist_node *rp;
+   const unsigned char *addr;


nit: please arrange these into reverse christmas tree



rp = rcu_dereference(hlist_first_rcu(>router_list));
p = mdst ? rcu_dereference(mdst->ports) : NULL;
@@ -241,10 +269,20 @@ void br_multicast_flood(struct net_bridge_mdb_entry *mdst,
rport = rp ? hlist_entry(rp, struct net_bridge_port, rlist) :
 NULL;

-   port = (unsigned long)lport > (unsigned long)rport ?
-  lport : rport;
+   if ((unsigned long)lport > (unsigned long)rport) {
+   port = lport;
+   addr = p->unicast ? p->eth_addr : NULL;
+   } else {
+   port = rport;
+   addr = NULL;
+   }
+
+   if (addr)
+   prev = maybe_deliver_addr(prev, port, skb, addr,
+ local_orig);
+   else
+   prev =

Re: [PATCH net-next] bridge: multicast to unicast

2017-01-03 Thread Nikolay Aleksandrov


On 02/01/17 20:32, Linus Lüssing wrote:

Implements an optional, per bridge port flag and feature to deliver
multicast packets to any host on the according port via unicast
individually. This is done by copying the packet per host and
changing the multicast destination MAC to a unicast one accordingly.

multicast-to-unicast works on top of the multicast snooping feature of
the bridge. Which means unicast copies are only delivered to hosts which
are interested in it and signalized this via IGMP/MLD reports
previously.

This feature is intended for interface types which have a more reliable
and/or efficient way to deliver unicast packets than broadcast ones
(e.g. wifi).

However, it should only be enabled on interfaces where no IGMPv2/MLDv1
report suppression takes place. This feature is disabled by default.

The initial patch and idea is from Felix Fietkau.

Cc: Felix Fietkau 
Signed-off-by: Linus Lüssing 

---



Hi Linus,
A few comments below, in general I have 2 concerns: the new mcast fast-path
tests and cache line ref, and adding netlink support for the new flag.


This feature is used and enabled by default in OpenWRT and LEDE for AP
interfaces for more than a year now to allow both a more robust multicast
delivery and multicast at higher rates (e.g. multicast streaming).

In OpenWRT/LEDE the IGMP/MLD report suppression issue is overcome by
the network daemon enabling AP isolation and by that separating all STAs.
Delivery of STA-to-STA IP mulitcast is made possible again by
enabling and utilizing the bridge hairpin mode, which considers the
incoming port as a potential outgoing port, too.

Hairpin-mode is performed after multicast snooping, therefore leading to
only deliver reports to STAs running a multicast router.
---
 include/linux/if_bridge.h |  1 +
 net/bridge/br_forward.c   | 44 +--
 net/bridge/br_mdb.c   |  2 +-
 net/bridge/br_multicast.c | 92 ++-
 net/bridge/br_private.h   |  4 ++-
 net/bridge/br_sysfs_if.c  |  2 ++
 6 files changed, 115 insertions(+), 30 deletions(-)

diff --git a/include/linux/if_bridge.h b/include/linux/if_bridge.h
index c6587c0..f1b0d78 100644
--- a/include/linux/if_bridge.h
+++ b/include/linux/if_bridge.h
@@ -46,6 +46,7 @@ struct br_ip_list {
 #define BR_LEARNING_SYNC   BIT(9)
 #define BR_PROXYARP_WIFI   BIT(10)
 #define BR_MCAST_FLOOD BIT(11)
+#define BR_MULTICAST_TO_UCAST  BIT(12)

 #define BR_DEFAULT_AGEING_TIME (300 * HZ)

diff --git a/net/bridge/br_forward.c b/net/bridge/br_forward.c
index 7cb41ae..49d742d 100644
--- a/net/bridge/br_forward.c
+++ b/net/bridge/br_forward.c
@@ -174,6 +174,33 @@ static struct net_bridge_port *maybe_deliver(
return p;
 }

+static struct net_bridge_port *maybe_deliver_addr(
+   struct net_bridge_port *prev, struct net_bridge_port *p,
+   struct sk_buff *skb, const unsigned char *addr,
+   bool local_orig)
+{
+   struct net_device *dev = BR_INPUT_SKB_CB(skb)->brdev;
+   const unsigned char *src = eth_hdr(skb)->h_source;
+
+   if (!should_deliver(p, skb))
+   return prev;
+
+   /* Even with hairpin, no soliloquies - prevent breaking IPv6 DAD */
+   if (skb->dev == p->dev && ether_addr_equal(src, addr))
+   return prev;
+
+   skb = skb_copy(skb, GFP_ATOMIC);
+   if (!skb) {
+   dev->stats.tx_dropped++;
+   return prev;
+   }
+
+   memcpy(eth_hdr(skb)->h_dest, addr, ETH_ALEN);
+   __br_forward(p, skb, local_orig);
+
+   return prev;
+}
+
 /* called under rcu_read_lock */
 void br_flood(struct net_bridge *br, struct sk_buff *skb,
  enum br_pkt_type pkt_type, bool local_rcv, bool local_orig)
@@ -231,6 +258,7 @@ void br_multicast_flood(struct net_bridge_mdb_entry *mdst,
struct net_bridge_port *prev = NULL;
struct net_bridge_port_group *p;
struct hlist_node *rp;
+   const unsigned char *addr;


nit: please arrange these into reverse christmas tree



rp = rcu_dereference(hlist_first_rcu(>router_list));
p = mdst ? rcu_dereference(mdst->ports) : NULL;
@@ -241,10 +269,20 @@ void br_multicast_flood(struct net_bridge_mdb_entry *mdst,
rport = rp ? hlist_entry(rp, struct net_bridge_port, rlist) :
 NULL;

-   port = (unsigned long)lport > (unsigned long)rport ?
-  lport : rport;
+   if ((unsigned long)lport > (unsigned long)rport) {
+   port = lport;
+   addr = p->unicast ? p->eth_addr : NULL;
+   } else {
+   port = rport;
+   addr = NULL;
+   }
+
+   if (addr)
+   prev = maybe_deliver_addr(prev, port, skb, addr,
+ local_orig);
+   else
+   prev = maybe_deliver(prev, port, skb, local_orig);


This hunk

Re: [PATCH 4/7] mm, vmscan: show LRU name in mm_vmscan_lru_isolate tracepoint

2016-12-28 Thread Nikolay Borisov

On 28.12.2016 18:00, Michal Hocko wrote:
> On Wed 28-12-16 17:50:31, Nikolay Borisov wrote:
>>
>>
>> On 28.12.2016 17:30, Michal Hocko wrote:
>>> From: Michal Hocko <mho...@suse.com>
>>>
>>> mm_vmscan_lru_isolate currently prints only whether the LRU we isolate
>>> from is file or anonymous but we do not know which LRU this is. It is
>>> useful to know whether the list is file or anonymous as well. Change
>>
>> Maybe you wanted to say whether the list is ACTIVE/INACTIVE ?
> 
> You are right. I will update the wording to:
> "
> mm_vmscan_lru_isolate currently prints only whether the LRU we isolate
> from is file or anonymous but we do not know which LRU this is. It is
> useful to know whether the list is active or inactive as well as we
> use the same function to isolate pages for both of them. Change
> the tracepoint to show symbolic names of the lru rather.
> "
> 
> Does it sound better?

It's better. Just one more nit about the " as well as we
use the same function to isolate pages for both of them"

I think this can be reworded better. The way I understand is - it's
better to know whether it's active/inactive since we are using the same
function to do both, correct? If so then then perhaps the following is a
bit more clear:

"
It is useful to know whether the list is active or inactive, since we
are using the same function to isolate pages from both of them and it's
hard to distinguish otherwise.
"

But as I said - it's a minor nit.

> 
> Thanks!
>

Re: [PATCH 4/7] mm, vmscan: show LRU name in mm_vmscan_lru_isolate tracepoint

2016-12-28 Thread Nikolay Borisov

On 28.12.2016 18:00, Michal Hocko wrote:
> On Wed 28-12-16 17:50:31, Nikolay Borisov wrote:
>>
>>
>> On 28.12.2016 17:30, Michal Hocko wrote:
>>> From: Michal Hocko 
>>>
>>> mm_vmscan_lru_isolate currently prints only whether the LRU we isolate
>>> from is file or anonymous but we do not know which LRU this is. It is
>>> useful to know whether the list is file or anonymous as well. Change
>>
>> Maybe you wanted to say whether the list is ACTIVE/INACTIVE ?
> 
> You are right. I will update the wording to:
> "
> mm_vmscan_lru_isolate currently prints only whether the LRU we isolate
> from is file or anonymous but we do not know which LRU this is. It is
> useful to know whether the list is active or inactive as well as we
> use the same function to isolate pages for both of them. Change
> the tracepoint to show symbolic names of the lru rather.
> "
> 
> Does it sound better?

It's better. Just one more nit about the " as well as we
use the same function to isolate pages for both of them"

I think this can be reworded better. The way I understand is - it's
better to know whether it's active/inactive since we are using the same
function to do both, correct? If so then then perhaps the following is a
bit more clear:

"
It is useful to know whether the list is active or inactive, since we
are using the same function to isolate pages from both of them and it's
hard to distinguish otherwise.
"

But as I said - it's a minor nit.

> 
> Thanks!
>

Re: [PATCH 4/7] mm, vmscan: show LRU name in mm_vmscan_lru_isolate tracepoint

2016-12-28 Thread Nikolay Borisov



On 28.12.2016 17:30, Michal Hocko wrote:
> From: Michal Hocko 
> 
> mm_vmscan_lru_isolate currently prints only whether the LRU we isolate
> from is file or anonymous but we do not know which LRU this is. It is
> useful to know whether the list is file or anonymous as well. Change
Maybe you wanted to say whether the list is ACTIVE/INACTIVE ?

> the tracepoint to show symbolic names of the lru rather.
> 
> Signed-off-by: Michal Hocko 
> ---
>  include/trace/events/vmscan.h | 20 ++--
>  mm/vmscan.c   |  2 +-
>  2 files changed, 15 insertions(+), 7 deletions(-)
> 
> diff --git a/include/trace/events/vmscan.h b/include/trace/events/vmscan.h
> index 6af4dae46db2..cc0b4c456c78 100644
> --- a/include/trace/events/vmscan.h
> +++ b/include/trace/events/vmscan.h
> @@ -36,6 +36,14 @@
>   (RECLAIM_WB_ASYNC) \
>   )
>  
> +#define show_lru_name(lru) \
> + __print_symbolic(lru, \
> + {LRU_INACTIVE_ANON, "LRU_INACTIVE_ANON"}, \
> + {LRU_ACTIVE_ANON, "LRU_ACTIVE_ANON"}, \
> + {LRU_INACTIVE_FILE, "LRU_INACTIVE_FILE"}, \
> + {LRU_ACTIVE_FILE, "LRU_ACTIVE_FILE"}, \
> + {LRU_UNEVICTABLE, "LRU_UNEVICTABLE"})
> +
>  TRACE_EVENT(mm_vmscan_kswapd_sleep,
>  
>   TP_PROTO(int nid),
> @@ -277,9 +285,9 @@ TRACE_EVENT(mm_vmscan_lru_isolate,
>   unsigned long nr_skipped,
>   unsigned long nr_taken,
>   isolate_mode_t isolate_mode,
> - int file),
> + int lru),
>  
> - TP_ARGS(classzone_idx, order, nr_requested, nr_scanned, nr_skipped, 
> nr_taken, isolate_mode, file),
> + TP_ARGS(classzone_idx, order, nr_requested, nr_scanned, nr_skipped, 
> nr_taken, isolate_mode, lru),
>  
>   TP_STRUCT__entry(
>   __field(int, classzone_idx)
> @@ -289,7 +297,7 @@ TRACE_EVENT(mm_vmscan_lru_isolate,
>   __field(unsigned long, nr_skipped)
>   __field(unsigned long, nr_taken)
>   __field(isolate_mode_t, isolate_mode)
> - __field(int, file)
> + __field(int, lru)
>   ),
>  
>   TP_fast_assign(
> @@ -300,10 +308,10 @@ TRACE_EVENT(mm_vmscan_lru_isolate,
>   __entry->nr_skipped = nr_skipped;
>   __entry->nr_taken = nr_taken;
>   __entry->isolate_mode = isolate_mode;
> - __entry->file = file;
> + __entry->lru = lru;
>   ),
>  
> - TP_printk("isolate_mode=%d classzone=%d order=%d nr_requested=%lu 
> nr_scanned=%lu nr_skipped=%lu nr_taken=%lu file=%d",
> + TP_printk("isolate_mode=%d classzone=%d order=%d nr_requested=%lu 
> nr_scanned=%lu nr_skipped=%lu nr_taken=%lu lru=%s",
>   __entry->isolate_mode,
>   __entry->classzone_idx,
>   __entry->order,
> @@ -311,7 +319,7 @@ TRACE_EVENT(mm_vmscan_lru_isolate,
>   __entry->nr_scanned,
>   __entry->nr_skipped,
>   __entry->nr_taken,
> - __entry->file)
> + show_lru_name(__entry->lru))
>  );
>  
>  TRACE_EVENT(mm_vmscan_writepage,
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index 4f7c0d66d629..3f0774f30a42 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -1500,7 +1500,7 @@ static unsigned long isolate_lru_pages(unsigned long 
> nr_to_scan,
>   }
>   *nr_scanned = scan + total_skipped;
>   trace_mm_vmscan_lru_isolate(sc->reclaim_idx, sc->order, nr_to_scan, 
> scan,
> - skipped, nr_taken, mode, is_file_lru(lru));
> + skipped, nr_taken, mode, lru);
>   update_lru_sizes(lruvec, lru, nr_zone_taken, nr_taken);
>   return nr_taken;
>  }
>

Re: [PATCH 4/7] mm, vmscan: show LRU name in mm_vmscan_lru_isolate tracepoint

2016-12-28 Thread Nikolay Borisov



On 28.12.2016 17:30, Michal Hocko wrote:
> From: Michal Hocko 
> 
> mm_vmscan_lru_isolate currently prints only whether the LRU we isolate
> from is file or anonymous but we do not know which LRU this is. It is
> useful to know whether the list is file or anonymous as well. Change
Maybe you wanted to say whether the list is ACTIVE/INACTIVE ?

> the tracepoint to show symbolic names of the lru rather.
> 
> Signed-off-by: Michal Hocko 
> ---
>  include/trace/events/vmscan.h | 20 ++--
>  mm/vmscan.c   |  2 +-
>  2 files changed, 15 insertions(+), 7 deletions(-)
> 
> diff --git a/include/trace/events/vmscan.h b/include/trace/events/vmscan.h
> index 6af4dae46db2..cc0b4c456c78 100644
> --- a/include/trace/events/vmscan.h
> +++ b/include/trace/events/vmscan.h
> @@ -36,6 +36,14 @@
>   (RECLAIM_WB_ASYNC) \
>   )
>  
> +#define show_lru_name(lru) \
> + __print_symbolic(lru, \
> + {LRU_INACTIVE_ANON, "LRU_INACTIVE_ANON"}, \
> + {LRU_ACTIVE_ANON, "LRU_ACTIVE_ANON"}, \
> + {LRU_INACTIVE_FILE, "LRU_INACTIVE_FILE"}, \
> + {LRU_ACTIVE_FILE, "LRU_ACTIVE_FILE"}, \
> + {LRU_UNEVICTABLE, "LRU_UNEVICTABLE"})
> +
>  TRACE_EVENT(mm_vmscan_kswapd_sleep,
>  
>   TP_PROTO(int nid),
> @@ -277,9 +285,9 @@ TRACE_EVENT(mm_vmscan_lru_isolate,
>   unsigned long nr_skipped,
>   unsigned long nr_taken,
>   isolate_mode_t isolate_mode,
> - int file),
> + int lru),
>  
> - TP_ARGS(classzone_idx, order, nr_requested, nr_scanned, nr_skipped, 
> nr_taken, isolate_mode, file),
> + TP_ARGS(classzone_idx, order, nr_requested, nr_scanned, nr_skipped, 
> nr_taken, isolate_mode, lru),
>  
>   TP_STRUCT__entry(
>   __field(int, classzone_idx)
> @@ -289,7 +297,7 @@ TRACE_EVENT(mm_vmscan_lru_isolate,
>   __field(unsigned long, nr_skipped)
>   __field(unsigned long, nr_taken)
>   __field(isolate_mode_t, isolate_mode)
> - __field(int, file)
> + __field(int, lru)
>   ),
>  
>   TP_fast_assign(
> @@ -300,10 +308,10 @@ TRACE_EVENT(mm_vmscan_lru_isolate,
>   __entry->nr_skipped = nr_skipped;
>   __entry->nr_taken = nr_taken;
>   __entry->isolate_mode = isolate_mode;
> - __entry->file = file;
> + __entry->lru = lru;
>   ),
>  
> - TP_printk("isolate_mode=%d classzone=%d order=%d nr_requested=%lu 
> nr_scanned=%lu nr_skipped=%lu nr_taken=%lu file=%d",
> + TP_printk("isolate_mode=%d classzone=%d order=%d nr_requested=%lu 
> nr_scanned=%lu nr_skipped=%lu nr_taken=%lu lru=%s",
>   __entry->isolate_mode,
>   __entry->classzone_idx,
>   __entry->order,
> @@ -311,7 +319,7 @@ TRACE_EVENT(mm_vmscan_lru_isolate,
>   __entry->nr_scanned,
>   __entry->nr_skipped,
>   __entry->nr_taken,
> - __entry->file)
> + show_lru_name(__entry->lru))
>  );
>  
>  TRACE_EVENT(mm_vmscan_writepage,
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index 4f7c0d66d629..3f0774f30a42 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -1500,7 +1500,7 @@ static unsigned long isolate_lru_pages(unsigned long 
> nr_to_scan,
>   }
>   *nr_scanned = scan + total_skipped;
>   trace_mm_vmscan_lru_isolate(sc->reclaim_idx, sc->order, nr_to_scan, 
> scan,
> - skipped, nr_taken, mode, is_file_lru(lru));
> + skipped, nr_taken, mode, lru);
>   update_lru_sizes(lruvec, lru, nr_zone_taken, nr_taken);
>   return nr_taken;
>  }
>

Re: [PATCH] ipv4: Namespaceify tcp_tw_reuse knob

2016-12-24 Thread Nikolay Borisov



On 24.12.2016 14:43, Haishuang Yan wrote:
> Signed-off-by: Haishuang Yan <yanhaishu...@cmss.chinamobile.com>

Reviewed-by: Nikolay Borisov <n.borisov.l...@gmail.com>

Re: [PATCH] ipv4: Namespaceify tcp_tw_reuse knob

2016-12-24 Thread Nikolay Borisov



On 24.12.2016 14:43, Haishuang Yan wrote:
> Signed-off-by: Haishuang Yan 

Reviewed-by: Nikolay Borisov

Re: [GIT PULL] kbuild changes for v4.9-rc1

2016-12-18 Thread Nikolay Borisov

On 18.12.2016 16:45, Jiri Slaby wrote:
> Moreover, with some modules, __put_user_1 and others are reported
> instead of mcount.

nm vmlinux | grep __fentry__
nm vmlinux | grep mcount

What do these report ? I bet you that in your vmlinux the first one
would return something like :

822f1810 T __fentry__
827fdc20 r __kcrctab___fentry__
82809461 r __kstrtab___fentry__
827e6c20 R __ksymtab___fentry__
 and nothing for the second. Whereas doing nm on the module in question
would give nothing for __fentry__ and something like: U mcount

Re: [GIT PULL] kbuild changes for v4.9-rc1

2016-12-18 Thread Nikolay Borisov

On 18.12.2016 16:45, Jiri Slaby wrote:
> Moreover, with some modules, __put_user_1 and others are reported
> instead of mcount.

nm vmlinux | grep __fentry__
nm vmlinux | grep mcount

What do these report ? I bet you that in your vmlinux the first one
would return something like :

822f1810 T __fentry__
827fdc20 r __kcrctab___fentry__
82809461 r __kstrtab___fentry__
827e6c20 R __ksymtab___fentry__
 and nothing for the second. Whereas doing nm on the module in question
would give nothing for __fentry__ and something like: U mcount

Re: [GIT PULL] kbuild changes for v4.9-rc1

2016-12-18 Thread Nikolay Borisov



On 18.12.2016 13:03, Arend Van Spriel wrote:
> On 18-12-2016 11:49, Jiri Slaby wrote:
>> On 12/18/2016, 12:59 AM, Linus Torvalds wrote:
>>> On Sat, Dec 17, 2016 at 12:57 AM, Jiri Slaby  wrote:

 Yes, disk drivers won't load:
 [2.141973] virtio_pci: disagrees about version of symbol mcount
 [2.144415] virtio_pci: Unknown symbol mcount (err -22)
>>>
>>> This makes no sense.
>>>
>>> mcount isn't even one of the symbols that the patch by Adam is touching.
>>
>> asm-prototypes.h in his patch includes asm/ftrace.h, where the function
>> is declared. That should be enough IIUC scripts/Makefile.build.
>>
>>> There's something else screwed up here. Not to mention that others
>>> don't have your issue.
>>
>> I suppose people don't run i386 kernels or have different config.
>>
>>> Do you have some other hacks in this area? Are you testing actual
>>> plain 4.9, or do you (for example) still carry Arnd's patch around
>>> that turned out to not work (reverted by f27c2f69cc8e in my tree)?
>>
>> Not at all. This was plain 4.9 packaged by suse -- only rpm-related
>> fixes. I tried plain 4.9 without rpm right now with the same output:
>> # insmod soundcore.ko
>> [   31.582326] soundcore: disagrees about version of symbol mcount
>> [   31.586183] soundcore: Unknown symbol mcount (err -22)
>> insmod: ERROR: could not insert module soundcore.ko: Invalid parameters
> 
> I hit an mcount issue a while back (years?) which was due to building a
> driver with gcc v4.x while kernel was built using gcc v4.y. Not claiming
> that is your issue though.

I've usually had the same thing happen to me if things were compiled
with different gcc versions . Essentially in newer gcc (starting with
4.6 I believe) CC_USING_FENTRY is defined, meaning that there is no
mcount() symbol but rather __fentry__. This is the likely problem here.


> 
> Regards,
> Arend
>

Re: [GIT PULL] kbuild changes for v4.9-rc1

2016-12-18 Thread Nikolay Borisov



On 18.12.2016 13:03, Arend Van Spriel wrote:
> On 18-12-2016 11:49, Jiri Slaby wrote:
>> On 12/18/2016, 12:59 AM, Linus Torvalds wrote:
>>> On Sat, Dec 17, 2016 at 12:57 AM, Jiri Slaby  wrote:

 Yes, disk drivers won't load:
 [2.141973] virtio_pci: disagrees about version of symbol mcount
 [2.144415] virtio_pci: Unknown symbol mcount (err -22)
>>>
>>> This makes no sense.
>>>
>>> mcount isn't even one of the symbols that the patch by Adam is touching.
>>
>> asm-prototypes.h in his patch includes asm/ftrace.h, where the function
>> is declared. That should be enough IIUC scripts/Makefile.build.
>>
>>> There's something else screwed up here. Not to mention that others
>>> don't have your issue.
>>
>> I suppose people don't run i386 kernels or have different config.
>>
>>> Do you have some other hacks in this area? Are you testing actual
>>> plain 4.9, or do you (for example) still carry Arnd's patch around
>>> that turned out to not work (reverted by f27c2f69cc8e in my tree)?
>>
>> Not at all. This was plain 4.9 packaged by suse -- only rpm-related
>> fixes. I tried plain 4.9 without rpm right now with the same output:
>> # insmod soundcore.ko
>> [   31.582326] soundcore: disagrees about version of symbol mcount
>> [   31.586183] soundcore: Unknown symbol mcount (err -22)
>> insmod: ERROR: could not insert module soundcore.ko: Invalid parameters
> 
> I hit an mcount issue a while back (years?) which was due to building a
> driver with gcc v4.x while kernel was built using gcc v4.y. Not claiming
> that is your issue though.

I've usually had the same thing happen to me if things were compiled
with different gcc versions . Essentially in newer gcc (starting with
4.6 I believe) CC_USING_FENTRY is defined, meaning that there is no
mcount() symbol but rather __fentry__. This is the likely problem here.


> 
> Regards,
> Arend
>

Re: default 0 if KASAN expression not working in kbuild

2016-12-15 Thread Nikolay Borisov



On 16.12.2016 09:50, Nikolay Borisov wrote:
> 
> 
> On 15.12.2016 23:32, Randy Dunlap wrote:
>> On 12/15/16 10:09, Nikolay Borisov wrote:
>>> Hello,
>>>
>>> I was doing some kasan-related debugging and when I enabled it I started
>>> getting warnings for large stackframes. So CONFIG_FRAME_WARN has :
>>>
>>> int "Warn for stack frames larger than (needs gcc 4.4)"
>>> range 0 8192
>>> default 0 if KASAN
>>> default 2048 if GCC_PLUGIN_LATENT_ENTROPY
>>> default 1024 if !64BIT
>>> default 2048 if 64BIT
>>>
>>> This means that frame_warns should effectively be disabled when kasan is
>>> enabled. However in my case this is not the situation.
>>> http://sprunge.us/FiGf here is the config file. It does have
>>> CONFIG_KASAN=y and CONFIG_FRAME_WARN=1024 . And even this is erroneous
>>> since it's a 64bit kernel, so it should be 2k. I haven't manually set
>>> the limit to 1k either.
>>
>> Yeah, it set FRAME_WARN=1024 for me also.
>>
>> It seems to be dependent on order of kconfig symbols in
>> lib/Kconfig.debug.
>>
>> If I move the line:
>>   source "lib/Kconfig.kasan"
>> to just after this line:
>>   menu "Compile-time checks and compiler options"
>> it seems to work for me.
>>
>> Can you test the patch below?
> 
> This patch has another problem that if I move the source line then I no
> longer get the kasan option in Memory Debugging section, furthermore the
> frame_warn wasn't changed either.

So actually kasan is being moved to the "Compile-time checks" menu, yet
the frame size still isn't changed for me.

> 
> 
>>

Re: default 0 if KASAN expression not working in kbuild

2016-12-15 Thread Nikolay Borisov



On 16.12.2016 09:50, Nikolay Borisov wrote:
> 
> 
> On 15.12.2016 23:32, Randy Dunlap wrote:
>> On 12/15/16 10:09, Nikolay Borisov wrote:
>>> Hello,
>>>
>>> I was doing some kasan-related debugging and when I enabled it I started
>>> getting warnings for large stackframes. So CONFIG_FRAME_WARN has :
>>>
>>> int "Warn for stack frames larger than (needs gcc 4.4)"
>>> range 0 8192
>>> default 0 if KASAN
>>> default 2048 if GCC_PLUGIN_LATENT_ENTROPY
>>> default 1024 if !64BIT
>>> default 2048 if 64BIT
>>>
>>> This means that frame_warns should effectively be disabled when kasan is
>>> enabled. However in my case this is not the situation.
>>> http://sprunge.us/FiGf here is the config file. It does have
>>> CONFIG_KASAN=y and CONFIG_FRAME_WARN=1024 . And even this is erroneous
>>> since it's a 64bit kernel, so it should be 2k. I haven't manually set
>>> the limit to 1k either.
>>
>> Yeah, it set FRAME_WARN=1024 for me also.
>>
>> It seems to be dependent on order of kconfig symbols in
>> lib/Kconfig.debug.
>>
>> If I move the line:
>>   source "lib/Kconfig.kasan"
>> to just after this line:
>>   menu "Compile-time checks and compiler options"
>> it seems to work for me.
>>
>> Can you test the patch below?
> 
> This patch has another problem that if I move the source line then I no
> longer get the kasan option in Memory Debugging section, furthermore the
> frame_warn wasn't changed either.

So actually kasan is being moved to the "Compile-time checks" menu, yet
the frame size still isn't changed for me.

> 
> 
>>

Re: default 0 if KASAN expression not working in kbuild

2016-12-15 Thread Nikolay Borisov



On 15.12.2016 23:32, Randy Dunlap wrote:
> On 12/15/16 10:09, Nikolay Borisov wrote:
>> Hello,
>>
>> I was doing some kasan-related debugging and when I enabled it I started
>> getting warnings for large stackframes. So CONFIG_FRAME_WARN has :
>>
>> int "Warn for stack frames larger than (needs gcc 4.4)"
>> range 0 8192
>> default 0 if KASAN
>> default 2048 if GCC_PLUGIN_LATENT_ENTROPY
>> default 1024 if !64BIT
>> default 2048 if 64BIT
>>
>> This means that frame_warns should effectively be disabled when kasan is
>> enabled. However in my case this is not the situation.
>> http://sprunge.us/FiGf here is the config file. It does have
>> CONFIG_KASAN=y and CONFIG_FRAME_WARN=1024 . And even this is erroneous
>> since it's a 64bit kernel, so it should be 2k. I haven't manually set
>> the limit to 1k either.
> 
> Yeah, it set FRAME_WARN=1024 for me also.
> 
> It seems to be dependent on order of kconfig symbols in
> lib/Kconfig.debug.
> 
> If I move the line:
>   source "lib/Kconfig.kasan"
> to just after this line:
>   menu "Compile-time checks and compiler options"
> it seems to work for me.
> 
> Can you test the patch below?

This patch has another problem that if I move the source line then I no
longer get the kasan option in Memory Debugging section, furthermore the
frame_warn wasn't changed either.


>

Re: default 0 if KASAN expression not working in kbuild

2016-12-15 Thread Nikolay Borisov



On 15.12.2016 23:32, Randy Dunlap wrote:
> On 12/15/16 10:09, Nikolay Borisov wrote:
>> Hello,
>>
>> I was doing some kasan-related debugging and when I enabled it I started
>> getting warnings for large stackframes. So CONFIG_FRAME_WARN has :
>>
>> int "Warn for stack frames larger than (needs gcc 4.4)"
>> range 0 8192
>> default 0 if KASAN
>> default 2048 if GCC_PLUGIN_LATENT_ENTROPY
>> default 1024 if !64BIT
>> default 2048 if 64BIT
>>
>> This means that frame_warns should effectively be disabled when kasan is
>> enabled. However in my case this is not the situation.
>> http://sprunge.us/FiGf here is the config file. It does have
>> CONFIG_KASAN=y and CONFIG_FRAME_WARN=1024 . And even this is erroneous
>> since it's a 64bit kernel, so it should be 2k. I haven't manually set
>> the limit to 1k either.
> 
> Yeah, it set FRAME_WARN=1024 for me also.
> 
> It seems to be dependent on order of kconfig symbols in
> lib/Kconfig.debug.
> 
> If I move the line:
>   source "lib/Kconfig.kasan"
> to just after this line:
>   menu "Compile-time checks and compiler options"
> it seems to work for me.
> 
> Can you test the patch below?

This patch has another problem that if I move the source line then I no
longer get the kasan option in Memory Debugging section, furthermore the
frame_warn wasn't changed either.


>

default 0 if KASAN expression not working in kbuild

2016-12-15 Thread Nikolay Borisov

Hello,

I was doing some kasan-related debugging and when I enabled it I started
getting warnings for large stackframes. So CONFIG_FRAME_WARN has :

int "Warn for stack frames larger than (needs gcc 4.4)"
range 0 8192
default 0 if KASAN
default 2048 if GCC_PLUGIN_LATENT_ENTROPY
default 1024 if !64BIT
default 2048 if 64BIT

This means that frame_warns should effectively be disabled when kasan is
enabled. However in my case this is not the situation.
http://sprunge.us/FiGf here is the config file. It does have
CONFIG_KASAN=y and CONFIG_FRAME_WARN=1024 . And even this is erroneous
since it's a 64bit kernel, so it should be 2k. I haven't manually set
the limit to 1k either.

Regards,
Nikolay

default 0 if KASAN expression not working in kbuild

2016-12-15 Thread Nikolay Borisov

Hello,

I was doing some kasan-related debugging and when I enabled it I started
getting warnings for large stackframes. So CONFIG_FRAME_WARN has :

int "Warn for stack frames larger than (needs gcc 4.4)"
range 0 8192
default 0 if KASAN
default 2048 if GCC_PLUGIN_LATENT_ENTROPY
default 1024 if !64BIT
default 2048 if 64BIT

This means that frame_warns should effectively be disabled when kasan is
enabled. However in my case this is not the situation.
http://sprunge.us/FiGf here is the config file. It does have
CONFIG_KASAN=y and CONFIG_FRAME_WARN=1024 . And even this is erroneous
since it's a 64bit kernel, so it should be 2k. I haven't manually set
the limit to 1k either.

Regards,
Nikolay

Re: [inotify] fee1df54b6: BUG_kmalloc-#(Not_tainted):Freepointer_corrupt

2016-12-13 Thread Nikolay Borisov



On 13.12.2016 20:51, Eric W. Biederman wrote:
> Nikolay Borisov <n.borisov.l...@gmail.com> writes:
> 
>> So this thing resurfaced again and I took a hard look into the code but
>> couldn't find anything suspicious. So the allocating and freeing
>> contexts leads me to believe it's the 'tbl' pointer that is being
>> corrupted. The only thing which I do with it is to increase it by two.
>>
>> Perhaps some liveness issues.
> 
> To me it feels like a double free somewhere.  Like we call dec_ucount
> and thus put_ucount multiple times in a way that goes to 0.
> 
> Perhaps there is a peculiarity in the existing code which allows the
> count to go to zero which we don't notice because we don't free anything
> when the count goes to zero today.
> 
> Perhaps there is some subtle semantic mismatch between your conversion
> and the inotify code.
> 
> I don't know if you made a subtle misreading of the code, or if
> there is an existing bug that your changes took from harmless to
> problematic, but the evidence is overwhelming that something
> is going wrong and it is your patch that brings it out.
> 
> If it helps the openvz folks apparently reproduced this with the criu
> regression tests and the appropriate kernel debug options, and confirmed
> the failure was your patch.

Great but I think I missed this conversation, care to send relevant
threads? I'd like to get to the bottom of this and have it merged?

@openvz guys - if you care to shout with more details I'd love to work
on getting this fixed!

> 
> The current state of play is that I would love to merge this if we can
> track down this issue.  I dropped this from my tree before I sent my pull
> request to Linus so there is no emergency to get this fixed.
> 
> Eric
> 
>

Re: [inotify] fee1df54b6: BUG_kmalloc-#(Not_tainted):Freepointer_corrupt

2016-12-13 Thread Nikolay Borisov



On 13.12.2016 20:51, Eric W. Biederman wrote:
> Nikolay Borisov  writes:
> 
>> So this thing resurfaced again and I took a hard look into the code but
>> couldn't find anything suspicious. So the allocating and freeing
>> contexts leads me to believe it's the 'tbl' pointer that is being
>> corrupted. The only thing which I do with it is to increase it by two.
>>
>> Perhaps some liveness issues.
> 
> To me it feels like a double free somewhere.  Like we call dec_ucount
> and thus put_ucount multiple times in a way that goes to 0.
> 
> Perhaps there is a peculiarity in the existing code which allows the
> count to go to zero which we don't notice because we don't free anything
> when the count goes to zero today.
> 
> Perhaps there is some subtle semantic mismatch between your conversion
> and the inotify code.
> 
> I don't know if you made a subtle misreading of the code, or if
> there is an existing bug that your changes took from harmless to
> problematic, but the evidence is overwhelming that something
> is going wrong and it is your patch that brings it out.
> 
> If it helps the openvz folks apparently reproduced this with the criu
> regression tests and the appropriate kernel debug options, and confirmed
> the failure was your patch.

Great but I think I missed this conversation, care to send relevant
threads? I'd like to get to the bottom of this and have it merged?

@openvz guys - if you care to shout with more details I'd love to work
on getting this fixed!

> 
> The current state of play is that I would love to merge this if we can
> track down this issue.  I dropped this from my tree before I sent my pull
> request to Linus so there is no emergency to get this fixed.
> 
> Eric
> 
>

Re: [inotify] fee1df54b6: BUG_kmalloc-#(Not_tainted):Freepointer_corrupt

2016-12-13 Thread Nikolay Borisov

So this thing resurfaced again and I took a hard look into the code but
couldn't find anything suspicious. So the allocating and freeing
contexts leads me to believe it's the 'tbl' pointer that is being
corrupted. The only thing which I do with it is to increase it by two.

Perhaps some liveness issues.

On 13.12.2016 05:22, kernel test robot wrote:
> FYI, we noticed the following commit:
> 
> commit: fee1df54b64871f8c097a53fcb02145af48c0b48 ("inotify: Convert to using 
> per-namespace limits")
> https://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace.git 
> for-next
> 
> in testcase: trinity
> with following parameters:
> 
>   runtime: 300s
> 
> test-description: Trinity is a linux system call fuzz tester.
> test-url: http://codemonkey.org.uk/projects/trinity/
> 
> 
> on test machine: qemu-system-x86_64 -enable-kvm -cpu qemu64,+ssse3 -smp 4 -m 
> 4G
> 
> caused below changes:
> 
> 
> +---+++
> |   | 19339c2516 | 
> fee1df54b6 |
> +---+++
> | boot_successes| 14 | 3  
> |
> | boot_failures | 2  | 13 
> |
> | BUG:kernel_hang_in_test_stage | 2  |
> |
> | BUG_kmalloc-#(Not_tainted):Freepointer_corrupt| 0  | 13 
> |
> | INFO:Allocated_in_setup_userns_sysctls_age=#cpu=#pid= | 0  | 13 
> |
> | INFO:Freed_in_assoc_array_rcu_cleanup_age=#cpu=#pid=  | 0  | 2  
> |
> | INFO:Slab#objects=#used=#fp=#flags=   | 0  | 13 
> |
> | INFO:Object#@offset=#fp=  | 0  | 13 
> |
> | calltrace:free_user_ns| 0  | 13 
> |
> | INFO:Freed_in_load_elf_binary_age=#cpu=#pid=  | 0  | 3  
> |
> | INFO:Freed_in_kvfree_age=#cpu=#pid=   | 0  | 3  
> |
> | INFO:Freed_in_skb_free_head_age=#cpu=#pid=| 0  | 1  
> |
> | INFO:Freed_in_do_readv_writev_age=#cpu=#pid=  | 0  | 2  
> |
> | INFO:Freed_in_process_vm_rw_age=#cpu=#pid=| 0  | 2  
> |
> +---+++
> 
> 
> 
> [   67.135026] [child2:457] Tried 8 32-bit syscalls unsuccessfully. Disabling 
> all 32-bit syscalls.
> [   67.170798] 
> [   67.195253] 
> =
> [   67.199676] BUG kmalloc-512 (Not tainted): Freepointer corrupt
> [   67.202508] 
> -
> [   67.202508] 
> [   67.208161] Disabling lock debugging due to kernel taint
> [   67.210870] INFO: Allocated in setup_userns_sysctls+0x44/0xd0 age=63 cpu=0 
> pid=459
> [   67.237533] INFO: Freed in assoc_array_rcu_cleanup+0x5b/0x60 age=194 cpu=0 
> pid=442
> [   67.270428] INFO: Slab 0x88013ee3c000 objects=19 used=7 
> fp=0x880119082478 flags=0x470004080
> [   67.274025] INFO: Object 0x880119080008 @offset=8 fp=0x8801127941b0
> [   67.274025] 
> [   67.277379] Redzone 88011908: cc cc cc cc cc cc cc cc  
> 
> [   67.280871] Object 880119080008: ce cd c8 81 ff ff ff ff 90 41 79 12 
> 01 88 ff ff  .Ay.
> [   67.297444] Object 880119080018: 04 00 00 00 a4 01 00 00 00 00 00 00 
> 00 00 00 00  
> [   67.301144] Object 880119080028: 50 6e 0a 81 ff ff ff ff 00 00 00 00 
> 00 00 00 00  Pn..
> [   67.304870] Object 880119080038: a0 d3 0c 82 ff ff ff ff 40 f0 e4 81 
> ff ff ff ff  @...
> [   67.308378] Object 880119080048: e2 cd c8 81 ff ff ff ff 94 41 79 12 
> 01 88 ff ff  .Ay.
> [   67.325144] Object 880119080058: 04 00 00 00 a4 01 00 00 00 00 00 00 
> 00 00 00 00  
> [   67.328715] Object 880119080068: 50 6e 0a 81 ff ff ff ff 00 00 00 00 
> 00 00 00 00  Pn..
> [   67.332349] Object 880119080078: a0 d3 0c 82 ff ff ff ff 40 f0 e4 81 
> ff ff ff ff  @...
> [   67.348963] Object 880119080088: f5 cd c8 81 ff ff ff ff 98 41 79 12 
> 01 88 ff ff  .Ay.
> [   67.352342] Object 880119080098: 04 00 00 00 a4 01 00 00 00 00 00 00 
> 00 00 00 00  
> [   67.355934] Object 8801190800a8: 50 6e 0a 81 ff ff ff ff 00 00 00 00 
> 00 00 00 00  Pn..
> [   67.359495] Object 8801190800b8: a0 d3 0c 82 ff ff ff ff 40 f0 e4 81 
> ff ff ff ff  @...
> [   67.376219] Object 8801190800c8: 08 ce c8 81 ff ff ff ff 9c 41 79 12 
> 01 88 ff ff  .Ay.
> [   67.380179] Object 8801190800d8: 04 00 00 00 a4 01 00 00 00 00 00

Re: [inotify] fee1df54b6: BUG_kmalloc-#(Not_tainted):Freepointer_corrupt

2016-12-13 Thread Nikolay Borisov

So this thing resurfaced again and I took a hard look into the code but
couldn't find anything suspicious. So the allocating and freeing
contexts leads me to believe it's the 'tbl' pointer that is being
corrupted. The only thing which I do with it is to increase it by two.

Perhaps some liveness issues.

On 13.12.2016 05:22, kernel test robot wrote:
> FYI, we noticed the following commit:
> 
> commit: fee1df54b64871f8c097a53fcb02145af48c0b48 ("inotify: Convert to using 
> per-namespace limits")
> https://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace.git 
> for-next
> 
> in testcase: trinity
> with following parameters:
> 
>   runtime: 300s
> 
> test-description: Trinity is a linux system call fuzz tester.
> test-url: http://codemonkey.org.uk/projects/trinity/
> 
> 
> on test machine: qemu-system-x86_64 -enable-kvm -cpu qemu64,+ssse3 -smp 4 -m 
> 4G
> 
> caused below changes:
> 
> 
> +---+++
> |   | 19339c2516 | 
> fee1df54b6 |
> +---+++
> | boot_successes| 14 | 3  
> |
> | boot_failures | 2  | 13 
> |
> | BUG:kernel_hang_in_test_stage | 2  |
> |
> | BUG_kmalloc-#(Not_tainted):Freepointer_corrupt| 0  | 13 
> |
> | INFO:Allocated_in_setup_userns_sysctls_age=#cpu=#pid= | 0  | 13 
> |
> | INFO:Freed_in_assoc_array_rcu_cleanup_age=#cpu=#pid=  | 0  | 2  
> |
> | INFO:Slab#objects=#used=#fp=#flags=   | 0  | 13 
> |
> | INFO:Object#@offset=#fp=  | 0  | 13 
> |
> | calltrace:free_user_ns| 0  | 13 
> |
> | INFO:Freed_in_load_elf_binary_age=#cpu=#pid=  | 0  | 3  
> |
> | INFO:Freed_in_kvfree_age=#cpu=#pid=   | 0  | 3  
> |
> | INFO:Freed_in_skb_free_head_age=#cpu=#pid=| 0  | 1  
> |
> | INFO:Freed_in_do_readv_writev_age=#cpu=#pid=  | 0  | 2  
> |
> | INFO:Freed_in_process_vm_rw_age=#cpu=#pid=| 0  | 2  
> |
> +---+++
> 
> 
> 
> [   67.135026] [child2:457] Tried 8 32-bit syscalls unsuccessfully. Disabling 
> all 32-bit syscalls.
> [   67.170798] 
> [   67.195253] 
> =
> [   67.199676] BUG kmalloc-512 (Not tainted): Freepointer corrupt
> [   67.202508] 
> -
> [   67.202508] 
> [   67.208161] Disabling lock debugging due to kernel taint
> [   67.210870] INFO: Allocated in setup_userns_sysctls+0x44/0xd0 age=63 cpu=0 
> pid=459
> [   67.237533] INFO: Freed in assoc_array_rcu_cleanup+0x5b/0x60 age=194 cpu=0 
> pid=442
> [   67.270428] INFO: Slab 0x88013ee3c000 objects=19 used=7 
> fp=0x880119082478 flags=0x470004080
> [   67.274025] INFO: Object 0x880119080008 @offset=8 fp=0x8801127941b0
> [   67.274025] 
> [   67.277379] Redzone 88011908: cc cc cc cc cc cc cc cc  
> 
> [   67.280871] Object 880119080008: ce cd c8 81 ff ff ff ff 90 41 79 12 
> 01 88 ff ff  .Ay.
> [   67.297444] Object 880119080018: 04 00 00 00 a4 01 00 00 00 00 00 00 
> 00 00 00 00  
> [   67.301144] Object 880119080028: 50 6e 0a 81 ff ff ff ff 00 00 00 00 
> 00 00 00 00  Pn..
> [   67.304870] Object 880119080038: a0 d3 0c 82 ff ff ff ff 40 f0 e4 81 
> ff ff ff ff  @...
> [   67.308378] Object 880119080048: e2 cd c8 81 ff ff ff ff 94 41 79 12 
> 01 88 ff ff  .Ay.
> [   67.325144] Object 880119080058: 04 00 00 00 a4 01 00 00 00 00 00 00 
> 00 00 00 00  
> [   67.328715] Object 880119080068: 50 6e 0a 81 ff ff ff ff 00 00 00 00 
> 00 00 00 00  Pn..
> [   67.332349] Object 880119080078: a0 d3 0c 82 ff ff ff ff 40 f0 e4 81 
> ff ff ff ff  @...
> [   67.348963] Object 880119080088: f5 cd c8 81 ff ff ff ff 98 41 79 12 
> 01 88 ff ff  .Ay.
> [   67.352342] Object 880119080098: 04 00 00 00 a4 01 00 00 00 00 00 00 
> 00 00 00 00  
> [   67.355934] Object 8801190800a8: 50 6e 0a 81 ff ff ff ff 00 00 00 00 
> 00 00 00 00  Pn..
> [   67.359495] Object 8801190800b8: a0 d3 0c 82 ff ff ff ff 40 f0 e4 81 
> ff ff ff ff  @...
> [   67.376219] Object 8801190800c8: 08 ce c8 81 ff ff ff ff 9c 41 79 12 
> 01 88 ff ff  .Ay.
> [   67.380179] Object 8801190800d8: 04 00 00 00 a4 01 00 00 00 00 00

Re: [PATCH v2] inotify: Convert to using per-namespace limits

2016-12-08 Thread Nikolay Borisov



On  8.12.2016 08:58, Nikolay Borisov wrote:
> 
> 
> On  8.12.2016 03:40, Eric W. Biederman wrote:
>> Nikolay Borisov <ker...@kyup.com> writes:
>>
>>> Gentle ping, now that rc1 has shipped and Jan's sysctl concern hopefully
>>> resolved.
>>
>> After getting slowed down by some fixes I am now taking a hard look at
>> your patch in the hopes of merging it.
>>
>> Did you happen to see the kbuild test roboot boot failures and did you
>> happen to look into what caused them?  I have just skimmed them and it
>> appears to be related to your patch.
> 
> I saw them in the beginning but they did look like a generic memory
> corruption and I believe at the time those patches were submitted there
> was a lingering memory corruption hitting various patches. Thus I didn't
> think it was related to my patches. I've since left my work so been
> taking a bit of time off and haven't looked really hard, so those
> patches have been kind of lingering.
> 
> 
> But now that you mention it I will try and take a second look to see
> what might cause the memory corruption? Is there a way to force 0day to
> re-run them to see whether the failure was indeed caused by my patches
> or were intermittent?

Ok, I took another look into the report but bear in mind that the
corruption indeed happened in retire_userns_sysctls. But also this row
in the report leads me to believe it's not my patch that's the culprit:

[   65.527277] INFO: Allocated in setup_userns_sysctls+0x3f/0xa6 age=5
cpu=1 pid=418
[   65.558397] INFO: Freed in free_ctx+0x1d/0x20 age=6 cpu=0 pid=19


So a free_ctx function did free it originally, likely causing the
corruption. And there is no such function involved in the code I'm touching.
> 
> Regards,
> Nikolay
> 
> 
>>
>> Eric
>>
>> ___
>> Containers mailing list
>> contain...@lists.linux-foundation.org
>> https://lists.linuxfoundation.org/mailman/listinfo/containers
>>

Re: [PATCH v2] inotify: Convert to using per-namespace limits

2016-12-08 Thread Nikolay Borisov



On  8.12.2016 08:58, Nikolay Borisov wrote:
> 
> 
> On  8.12.2016 03:40, Eric W. Biederman wrote:
>> Nikolay Borisov  writes:
>>
>>> Gentle ping, now that rc1 has shipped and Jan's sysctl concern hopefully
>>> resolved.
>>
>> After getting slowed down by some fixes I am now taking a hard look at
>> your patch in the hopes of merging it.
>>
>> Did you happen to see the kbuild test roboot boot failures and did you
>> happen to look into what caused them?  I have just skimmed them and it
>> appears to be related to your patch.
> 
> I saw them in the beginning but they did look like a generic memory
> corruption and I believe at the time those patches were submitted there
> was a lingering memory corruption hitting various patches. Thus I didn't
> think it was related to my patches. I've since left my work so been
> taking a bit of time off and haven't looked really hard, so those
> patches have been kind of lingering.
> 
> 
> But now that you mention it I will try and take a second look to see
> what might cause the memory corruption? Is there a way to force 0day to
> re-run them to see whether the failure was indeed caused by my patches
> or were intermittent?

Ok, I took another look into the report but bear in mind that the
corruption indeed happened in retire_userns_sysctls. But also this row
in the report leads me to believe it's not my patch that's the culprit:

[   65.527277] INFO: Allocated in setup_userns_sysctls+0x3f/0xa6 age=5
cpu=1 pid=418
[   65.558397] INFO: Freed in free_ctx+0x1d/0x20 age=6 cpu=0 pid=19


So a free_ctx function did free it originally, likely causing the
corruption. And there is no such function involved in the code I'm touching.
> 
> Regards,
> Nikolay
> 
> 
>>
>> Eric
>>
>> ___
>> Containers mailing list
>> contain...@lists.linux-foundation.org
>> https://lists.linuxfoundation.org/mailman/listinfo/containers
>>

Re: [PATCH v2] inotify: Convert to using per-namespace limits

2016-12-07 Thread Nikolay Borisov

On  8.12.2016 03:40, Eric W. Biederman wrote:
> Nikolay Borisov <ker...@kyup.com> writes:
> 
>> Gentle ping, now that rc1 has shipped and Jan's sysctl concern hopefully
>> resolved.
> 
> After getting slowed down by some fixes I am now taking a hard look at
> your patch in the hopes of merging it.
> 
> Did you happen to see the kbuild test roboot boot failures and did you
> happen to look into what caused them?  I have just skimmed them and it
> appears to be related to your patch.

I saw them in the beginning but they did look like a generic memory
corruption and I believe at the time those patches were submitted there
was a lingering memory corruption hitting various patches. Thus I didn't
think it was related to my patches. I've since left my work so been
taking a bit of time off and haven't looked really hard, so those
patches have been kind of lingering.

But now that you mention it I will try and take a second look to see
what might cause the memory corruption? Is there a way to force 0day to
re-run them to see whether the failure was indeed caused by my patches
or were intermittent?

Regards,
Nikolay

> 
> Eric
> 
> ___
> Containers mailing list
> contain...@lists.linux-foundation.org
> https://lists.linuxfoundation.org/mailman/listinfo/containers
>

Re: [PATCH v2] inotify: Convert to using per-namespace limits

2016-12-07 Thread Nikolay Borisov

On  8.12.2016 03:40, Eric W. Biederman wrote:
> Nikolay Borisov  writes:
> 
>> Gentle ping, now that rc1 has shipped and Jan's sysctl concern hopefully
>> resolved.
> 
> After getting slowed down by some fixes I am now taking a hard look at
> your patch in the hopes of merging it.
> 
> Did you happen to see the kbuild test roboot boot failures and did you
> happen to look into what caused them?  I have just skimmed them and it
> appears to be related to your patch.

I saw them in the beginning but they did look like a generic memory
corruption and I believe at the time those patches were submitted there
was a lingering memory corruption hitting various patches. Thus I didn't
think it was related to my patches. I've since left my work so been
taking a bit of time off and haven't looked really hard, so those
patches have been kind of lingering.

But now that you mention it I will try and take a second look to see
what might cause the memory corruption? Is there a way to force 0day to
re-run them to see whether the failure was indeed caused by my patches
or were intermittent?

Regards,
Nikolay

> 
> Eric
> 
> ___
> Containers mailing list
> contain...@lists.linux-foundation.org
> https://lists.linuxfoundation.org/mailman/listinfo/containers
>

list/count mismatch warning in rcu_do_batch (possibly triggered by a btrfs bug).

2016-11-15 Thread Nikolay Borisov

]
[1626691.339242]  [] kthread+0xef/0x110
[1626691.339412]  [] ? kthread_park+0x60/0x60
[1626691.339585]  [] ret_from_fork+0x3f/0x70
[1626691.339755]  [] ? kthread_park+0x60/0x60
[1626691.339924] ---[ end trace dacbbac64b357f79 ]---

That warning is this code in rcu_do_batch: 

WARN_ON_ONCE((rdp->nxtlist == NULL) != (rdp->qlen == 0));

Eventually the machines crashes in kmem_cache_alloc: 

[1626694.130460] BUG: unable to handle kernel paging request at 039ac000
[1626694.130731] IP: [] kmem_cache_alloc+0x77/0x220
[1626694.130954] PGD 29b86b067 PUD 38d8bd067 PMD 0 
[1626694.131260] Oops:  [#1] SMP 
[1626694.134847] CPU: 1 PID: 731 Comm: rsync Tainted: PB   W  O
4.4.26-clouder1 #3
[1626694.135135] Hardware name: Supermicro X9DRD-iF/LF/X9DRD-iF, BIOS 3.0b 
12/05/2013
[1626694.135422] task: 88027bdb9b80 ti: 8801de078000 task.ti: 
8801de078000
[1626694.135706] RIP: 0010:[]  [] 
kmem_cache_alloc+0x77/0x220
[1626694.136041] RSP: 0018:8801de07b900  EFLAGS: 00010282
[1626694.136210] RAX:  RBX: 02408840 RCX: 
0089da3e
[1626694.136499] RDX: 0089da3d RSI: 0507 RDI: 
81a0ce11
[1626694.136787] RBP: 8801de07b930 R08: 60fb80008b60 R09: 
039ac000
[1626694.137071] R10: 8803d1ec3520 R11:  R12: 
02408840
[1626694.139976] R13: a072c73a R14: 8803eb6d1900 R15: 
8803eb6d1900
[1626694.140263] FS:  7fd7a4a38700() GS:88047fc2() 
knlGS:
[1626694.140562] CS:  0010 DS:  ES:  CR0: 80050033
[1626694.140736] CR2: 039ac000 CR3: 00039aec6000 CR4: 
000406e0
[1626694.141027] Stack:
[1626694.141193]  8801de07b900 8801e5ffc000 4000 
8801e5ffc000
[1626694.141670]  0399 4000 8801de07b960 
a072c73a
[1626694.142139]  8801e5ffc000  0399 
8801f7bdd908
[1626694.142608] Call Trace:
[1626694.142799]  [] __alloc_extent_buffer+0x2a/0xe0 [btrfs]
[1626694.142988]  [] alloc_extent_buffer+0x67/0x360 [btrfs]
[1626694.143175]  [] read_tree_block+0x20/0x70 [btrfs]
[1626694.143357]  [] 
read_block_for_search.isra.32+0x129/0x340 [btrfs]
[1626694.143657]  [] btrfs_search_slot+0x3e1/0x9d0 [btrfs]
[1626694.143830]  [] ? inode_init_always+0x105/0x1b0
[1626694.144014]  [] btrfs_lookup_inode+0x2f/0xa0 [btrfs]
[1626694.144202]  [] btrfs_iget+0xd7/0x6a0 [btrfs]
[1626694.144385]  [] btrfs_lookup_dentry+0x3e4/0x530 [btrfs]
[1626694.144570]  [] btrfs_lookup+0x12/0x40 [btrfs]
[1626694.144743]  [] lookup_real+0x1d/0x60
[1626694.144912]  [] __lookup_hash+0x33/0x40
[1626694.145084]  [] walk_component+0x212/0x4e0
[1626694.145255]  [] path_lookupat+0x5d/0x110
[1626694.145427]  [] filename_lookup+0x9a/0x110
[1626694.145614]  [] ? btrfs_delayed_update_inode+0x14d/0x4e0 
[btrfs]
[1626694.145902]  [] ? getname_flags+0x37/0x1a0
[1626694.146073]  [] ? kmem_cache_alloc+0x1ba/0x220
[1626694.146245]  [] ? getname_flags+0x37/0x1a0
[1626694.146416]  [] user_path_at_empty+0x36/0x40
[1626694.146588]  [] vfs_fstatat+0x53/0xa0
[1626694.146758]  [] SYSC_newlstat+0x22/0x40
[1626694.146930]  [] SyS_newlstat+0xe/0x10
[1626694.147102]  [] entry_SYSCALL_64_fastpath+0x16/0x6e

I know most of this is out of your area of expertise but what I'm hoping is 
that the 
rcu corruption at least point in the right direction as to the root cause. 
Under what 
conditions is it "expected" to have list/count mismatch when running the rcu 
callbacks?
Is it plausible that a memory corruption, induced by btrfs can have such an 
effect on 
core RCU data structures? So what exactly does the warning mean? 

Regards, 
Nikolay

list/count mismatch warning in rcu_do_batch (possibly triggered by a btrfs bug).

2016-11-15 Thread Nikolay Borisov

]
[1626691.339242]  [] kthread+0xef/0x110
[1626691.339412]  [] ? kthread_park+0x60/0x60
[1626691.339585]  [] ret_from_fork+0x3f/0x70
[1626691.339755]  [] ? kthread_park+0x60/0x60
[1626691.339924] ---[ end trace dacbbac64b357f79 ]---

That warning is this code in rcu_do_batch: 

WARN_ON_ONCE((rdp->nxtlist == NULL) != (rdp->qlen == 0));

Eventually the machines crashes in kmem_cache_alloc: 

[1626694.130460] BUG: unable to handle kernel paging request at 039ac000
[1626694.130731] IP: [] kmem_cache_alloc+0x77/0x220
[1626694.130954] PGD 29b86b067 PUD 38d8bd067 PMD 0 
[1626694.131260] Oops:  [#1] SMP 
[1626694.134847] CPU: 1 PID: 731 Comm: rsync Tainted: PB   W  O
4.4.26-clouder1 #3
[1626694.135135] Hardware name: Supermicro X9DRD-iF/LF/X9DRD-iF, BIOS 3.0b 
12/05/2013
[1626694.135422] task: 88027bdb9b80 ti: 8801de078000 task.ti: 
8801de078000
[1626694.135706] RIP: 0010:[]  [] 
kmem_cache_alloc+0x77/0x220
[1626694.136041] RSP: 0018:8801de07b900  EFLAGS: 00010282
[1626694.136210] RAX:  RBX: 02408840 RCX: 
0089da3e
[1626694.136499] RDX: 0089da3d RSI: 0507 RDI: 
81a0ce11
[1626694.136787] RBP: 8801de07b930 R08: 60fb80008b60 R09: 
039ac000
[1626694.137071] R10: 8803d1ec3520 R11:  R12: 
02408840
[1626694.139976] R13: a072c73a R14: 8803eb6d1900 R15: 
8803eb6d1900
[1626694.140263] FS:  7fd7a4a38700() GS:88047fc2() 
knlGS:
[1626694.140562] CS:  0010 DS:  ES:  CR0: 80050033
[1626694.140736] CR2: 039ac000 CR3: 00039aec6000 CR4: 
000406e0
[1626694.141027] Stack:
[1626694.141193]  8801de07b900 8801e5ffc000 4000 
8801e5ffc000
[1626694.141670]  0399 4000 8801de07b960 
a072c73a
[1626694.142139]  8801e5ffc000  0399 
8801f7bdd908
[1626694.142608] Call Trace:
[1626694.142799]  [] __alloc_extent_buffer+0x2a/0xe0 [btrfs]
[1626694.142988]  [] alloc_extent_buffer+0x67/0x360 [btrfs]
[1626694.143175]  [] read_tree_block+0x20/0x70 [btrfs]
[1626694.143357]  [] 
read_block_for_search.isra.32+0x129/0x340 [btrfs]
[1626694.143657]  [] btrfs_search_slot+0x3e1/0x9d0 [btrfs]
[1626694.143830]  [] ? inode_init_always+0x105/0x1b0
[1626694.144014]  [] btrfs_lookup_inode+0x2f/0xa0 [btrfs]
[1626694.144202]  [] btrfs_iget+0xd7/0x6a0 [btrfs]
[1626694.144385]  [] btrfs_lookup_dentry+0x3e4/0x530 [btrfs]
[1626694.144570]  [] btrfs_lookup+0x12/0x40 [btrfs]
[1626694.144743]  [] lookup_real+0x1d/0x60
[1626694.144912]  [] __lookup_hash+0x33/0x40
[1626694.145084]  [] walk_component+0x212/0x4e0
[1626694.145255]  [] path_lookupat+0x5d/0x110
[1626694.145427]  [] filename_lookup+0x9a/0x110
[1626694.145614]  [] ? btrfs_delayed_update_inode+0x14d/0x4e0 
[btrfs]
[1626694.145902]  [] ? getname_flags+0x37/0x1a0
[1626694.146073]  [] ? kmem_cache_alloc+0x1ba/0x220
[1626694.146245]  [] ? getname_flags+0x37/0x1a0
[1626694.146416]  [] user_path_at_empty+0x36/0x40
[1626694.146588]  [] vfs_fstatat+0x53/0xa0
[1626694.146758]  [] SYSC_newlstat+0x22/0x40
[1626694.146930]  [] SyS_newlstat+0xe/0x10
[1626694.147102]  [] entry_SYSCALL_64_fastpath+0x16/0x6e

I know most of this is out of your area of expertise but what I'm hoping is 
that the 
rcu corruption at least point in the right direction as to the root cause. 
Under what 
conditions is it "expected" to have list/count mismatch when running the rcu 
callbacks?
Is it plausible that a memory corruption, induced by btrfs can have such an 
effect on 
core RCU data structures? So what exactly does the warning mean? 

Regards, 
Nikolay

Re: [PATCH v2] inotify: Convert to using per-namespace limits

2016-10-24 Thread Nikolay Borisov



On 10/11/2016 10:36 AM, Nikolay Borisov wrote:
> This patchset converts inotify to using the newly introduced
> per-userns sysctl infrastructure.
> 
> Currently the inotify instances/watches are being accounted in the
> user_struct structure. This means that in setups where multiple
> users in unprivileged containers map to the same underlying
> real user (i.e. pointing to the same user_struct) the inotify limits
> are going to be shared as well, allowing one user(or application) to exhaust
> all others limits.
> 
> Fix this by switching the inotify sysctls to using the
> per-namespace/per-user limits. This will allow the server admin to
> set sensible global limits, which can further be tuned inside every
> individual user namespace. Additionally, in order to preserve the
> sysctl ABI make the existing inotify instances/watches sysctls
> modify the values of the initial user namespace.
> 
> Signed-off-by: Nikolay Borisov <ker...@kyup.com>
> ---
> 
> So here is a revised version which retains the existing sysctls,
> and hooks them to the init_user_ns values. 

Gentle ping, now that rc1 has shipped and Jan's sysctl concern hopefully
resolved.

> 
>  fs/notify/inotify/inotify.h  | 17 +
>  fs/notify/inotify/inotify_fsnotify.c |  6 ++
>  fs/notify/inotify/inotify_user.c | 34 +-
>  include/linux/fsnotify_backend.h |  3 ++-
>  include/linux/sched.h|  4 
>  include/linux/user_namespace.h   |  4 
>  kernel/ucount.c  |  6 +-
>  7 files changed, 47 insertions(+), 27 deletions(-)
> 
> diff --git a/fs/notify/inotify/inotify.h b/fs/notify/inotify/inotify.h
> index ed855ef6f077..b5536f8ad3e0 100644
> --- a/fs/notify/inotify/inotify.h
> +++ b/fs/notify/inotify/inotify.h
> @@ -30,3 +30,20 @@ extern int inotify_handle_event(struct fsnotify_group 
> *group,
>   const unsigned char *file_name, u32 cookie);
>  
>  extern const struct fsnotify_ops inotify_fsnotify_ops;
> +
> +#ifdef CONFIG_INOTIFY_USER
> +static void dec_inotify_instances(struct ucounts *ucounts)
> +{
> + dec_ucount(ucounts, UCOUNT_INOTIFY_INSTANCES);
> +}
> +
> +static struct ucounts *inc_inotify_watches(struct ucounts *ucounts)
> +{
> + return inc_ucount(ucounts->ns, ucounts->uid, UCOUNT_INOTIFY_WATCHES);
> +}
> +
> +static void dec_inotify_watches(struct ucounts *ucounts)
> +{
> + dec_ucount(ucounts, UCOUNT_INOTIFY_WATCHES);
> +}
> +#endif
> diff --git a/fs/notify/inotify/inotify_fsnotify.c 
> b/fs/notify/inotify/inotify_fsnotify.c
> index 2cd900c2c737..1e6b3cfc2cfd 100644
> --- a/fs/notify/inotify/inotify_fsnotify.c
> +++ b/fs/notify/inotify/inotify_fsnotify.c
> @@ -165,10 +165,8 @@ static void inotify_free_group_priv(struct 
> fsnotify_group *group)
>   /* ideally the idr is empty and we won't hit the BUG in the callback */
>   idr_for_each(>inotify_data.idr, idr_callback, group);
>   idr_destroy(>inotify_data.idr);
> - if (group->inotify_data.user) {
> - atomic_dec(>inotify_data.user->inotify_devs);
> - free_uid(group->inotify_data.user);
> - }
> + if (group->inotify_data.ucounts)
> + dec_inotify_instances(group->inotify_data.ucounts);
>  }
>  
>  static void inotify_free_event(struct fsnotify_event *fsn_event)
> diff --git a/fs/notify/inotify/inotify_user.c 
> b/fs/notify/inotify/inotify_user.c
> index b8d08d0d0a4d..7d769a824742 100644
> --- a/fs/notify/inotify/inotify_user.c
> +++ b/fs/notify/inotify/inotify_user.c
> @@ -44,10 +44,8 @@
>  
>  #include 
>  
> -/* these are configurable via /proc/sys/fs/inotify/ */
> -static int inotify_max_user_instances __read_mostly;
> +/* configurable via /proc/sys/fs/inotify/ */
>  static int inotify_max_queued_events __read_mostly;
> -static int inotify_max_user_watches __read_mostly;
>  
>  static struct kmem_cache *inotify_inode_mark_cachep __read_mostly;
>  
> @@ -60,7 +58,7 @@ static int zero;
>  struct ctl_table inotify_table[] = {
>   {
>   .procname   = "max_user_instances",
> - .data   = _max_user_instances,
> + .data   = 
> _user_ns.ucount_max[UCOUNT_INOTIFY_INSTANCES],
>   .maxlen = sizeof(int),
>   .mode   = 0644,
>   .proc_handler   = proc_dointvec_minmax,
> @@ -68,7 +66,7 @@ struct ctl_table inotify_table[] = {
>   },
>   {
>   .procname   = "max_user_watches",
> - .data   = _max_user_watches,
> + .data   = 
> _user_

Re: [PATCH v2] inotify: Convert to using per-namespace limits

2016-10-24 Thread Nikolay Borisov



On 10/11/2016 10:36 AM, Nikolay Borisov wrote:
> This patchset converts inotify to using the newly introduced
> per-userns sysctl infrastructure.
> 
> Currently the inotify instances/watches are being accounted in the
> user_struct structure. This means that in setups where multiple
> users in unprivileged containers map to the same underlying
> real user (i.e. pointing to the same user_struct) the inotify limits
> are going to be shared as well, allowing one user(or application) to exhaust
> all others limits.
> 
> Fix this by switching the inotify sysctls to using the
> per-namespace/per-user limits. This will allow the server admin to
> set sensible global limits, which can further be tuned inside every
> individual user namespace. Additionally, in order to preserve the
> sysctl ABI make the existing inotify instances/watches sysctls
> modify the values of the initial user namespace.
> 
> Signed-off-by: Nikolay Borisov 
> ---
> 
> So here is a revised version which retains the existing sysctls,
> and hooks them to the init_user_ns values. 

Gentle ping, now that rc1 has shipped and Jan's sysctl concern hopefully
resolved.

> 
>  fs/notify/inotify/inotify.h  | 17 +
>  fs/notify/inotify/inotify_fsnotify.c |  6 ++
>  fs/notify/inotify/inotify_user.c | 34 +-
>  include/linux/fsnotify_backend.h |  3 ++-
>  include/linux/sched.h|  4 
>  include/linux/user_namespace.h   |  4 
>  kernel/ucount.c  |  6 +-
>  7 files changed, 47 insertions(+), 27 deletions(-)
> 
> diff --git a/fs/notify/inotify/inotify.h b/fs/notify/inotify/inotify.h
> index ed855ef6f077..b5536f8ad3e0 100644
> --- a/fs/notify/inotify/inotify.h
> +++ b/fs/notify/inotify/inotify.h
> @@ -30,3 +30,20 @@ extern int inotify_handle_event(struct fsnotify_group 
> *group,
>   const unsigned char *file_name, u32 cookie);
>  
>  extern const struct fsnotify_ops inotify_fsnotify_ops;
> +
> +#ifdef CONFIG_INOTIFY_USER
> +static void dec_inotify_instances(struct ucounts *ucounts)
> +{
> + dec_ucount(ucounts, UCOUNT_INOTIFY_INSTANCES);
> +}
> +
> +static struct ucounts *inc_inotify_watches(struct ucounts *ucounts)
> +{
> + return inc_ucount(ucounts->ns, ucounts->uid, UCOUNT_INOTIFY_WATCHES);
> +}
> +
> +static void dec_inotify_watches(struct ucounts *ucounts)
> +{
> + dec_ucount(ucounts, UCOUNT_INOTIFY_WATCHES);
> +}
> +#endif
> diff --git a/fs/notify/inotify/inotify_fsnotify.c 
> b/fs/notify/inotify/inotify_fsnotify.c
> index 2cd900c2c737..1e6b3cfc2cfd 100644
> --- a/fs/notify/inotify/inotify_fsnotify.c
> +++ b/fs/notify/inotify/inotify_fsnotify.c
> @@ -165,10 +165,8 @@ static void inotify_free_group_priv(struct 
> fsnotify_group *group)
>   /* ideally the idr is empty and we won't hit the BUG in the callback */
>   idr_for_each(>inotify_data.idr, idr_callback, group);
>   idr_destroy(>inotify_data.idr);
> - if (group->inotify_data.user) {
> - atomic_dec(>inotify_data.user->inotify_devs);
> - free_uid(group->inotify_data.user);
> - }
> + if (group->inotify_data.ucounts)
> + dec_inotify_instances(group->inotify_data.ucounts);
>  }
>  
>  static void inotify_free_event(struct fsnotify_event *fsn_event)
> diff --git a/fs/notify/inotify/inotify_user.c 
> b/fs/notify/inotify/inotify_user.c
> index b8d08d0d0a4d..7d769a824742 100644
> --- a/fs/notify/inotify/inotify_user.c
> +++ b/fs/notify/inotify/inotify_user.c
> @@ -44,10 +44,8 @@
>  
>  #include 
>  
> -/* these are configurable via /proc/sys/fs/inotify/ */
> -static int inotify_max_user_instances __read_mostly;
> +/* configurable via /proc/sys/fs/inotify/ */
>  static int inotify_max_queued_events __read_mostly;
> -static int inotify_max_user_watches __read_mostly;
>  
>  static struct kmem_cache *inotify_inode_mark_cachep __read_mostly;
>  
> @@ -60,7 +58,7 @@ static int zero;
>  struct ctl_table inotify_table[] = {
>   {
>   .procname   = "max_user_instances",
> - .data   = _max_user_instances,
> + .data   = 
> _user_ns.ucount_max[UCOUNT_INOTIFY_INSTANCES],
>   .maxlen = sizeof(int),
>   .mode   = 0644,
>   .proc_handler   = proc_dointvec_minmax,
> @@ -68,7 +66,7 @@ struct ctl_table inotify_table[] = {
>   },
>   {
>   .procname   = "max_user_watches",
> - .data   = _max_user_watches,
> + .data   = 
> _user_ns.ucount_max[UCOUNT_INOT

[PATCHv2] cephfs: Fix scheduler warning due to nested blocking

2016-10-11 Thread Nikolay Borisov

try_get_cap_refs can be used as a condition in a wait_event* calls.
This is all fine until it has to call __ceph_do_pending_vmtruncate,
which in turn acquires the i_truncate_mutex. This leads to a situation
in which a task's state is !TASK_RUNNING and at the same time it's
trying to acquire a sleeping primitive. In essence a nested sleeping
primitives are being used. This causes the following warning:

WARNING: CPU: 22 PID: 11064 at kernel/sched/core.c:7631 
__might_sleep+0x9f/0xb0()
do not call blocking ops when !TASK_RUNNING; state=1 set at 
[] prepare_to_wait_event+0x5d/0x110
 ipmi_msghandler tcp_scalable ib_qib dca ib_mad ib_core ib_addr ipv6
CPU: 22 PID: 11064 Comm: fs_checker.pl Tainted: G   O
4.4.20-clouder2 #6
Hardware name: Supermicro X10DRi/X10DRi, BIOS 1.1a 10/16/2015
  8838b416fa88 812f4409 8838b416fad0
 81a034f2 8838b416fac0 81052b46 81a0432c
 0061   88167bda54a0
Call Trace:
 [] dump_stack+0x67/0x9e
 [] warn_slowpath_common+0x86/0xc0
 [] warn_slowpath_fmt+0x4c/0x50
 [] ? prepare_to_wait_event+0x5d/0x110
 [] ? prepare_to_wait_event+0x5d/0x110
 [] __might_sleep+0x9f/0xb0
 [] mutex_lock+0x20/0x40
 [] __ceph_do_pending_vmtruncate+0x44/0x1a0 [ceph]
 [] try_get_cap_refs+0xa2/0x320 [ceph]
 [] ceph_get_caps+0x255/0x2b0 [ceph]
 [] ? wait_woken+0xb0/0xb0
 [] ceph_write_iter+0x2b1/0xde0 [ceph]
 [] ? schedule_timeout+0x202/0x260
 [] ? kmem_cache_free+0x1ea/0x200
 [] ? iput+0x9e/0x230
 [] ? __might_sleep+0x52/0xb0
 [] ? __might_fault+0x37/0x40
 [] ? cp_new_stat+0x153/0x170
 [] __vfs_write+0xaa/0xe0
 [] vfs_write+0xa9/0x190
 [] ? set_close_on_exec+0x31/0x70
 [] SyS_write+0x46/0xa0

This happens since wait_event_interruptible can interfere with the
mutex locking code, since they both fiddle with the task state.

Fix the issue by using the newly-added nested blocking infrastructure
in 61ada528dea0 ("sched/wait: Provide infrastructure to deal with
nested blocking")

Link: https://lwn.net/Articles/628628/
Signed-off-by: Nikolay Borisov <ker...@kyup.com>
---
 fs/ceph/caps.c | 12 +---
 1 file changed, 9 insertions(+), 3 deletions(-)

diff --git a/fs/ceph/caps.c b/fs/ceph/caps.c
index c69e1253b47b..9d401520b981 100644
--- a/fs/ceph/caps.c
+++ b/fs/ceph/caps.c
@@ -2467,6 +2467,7 @@ int ceph_get_caps(struct ceph_inode_info *ci, int need, 
int want,
  loff_t endoff, int *got, struct page **pinned_page)
 {
int _got, ret, err = 0;
+   DEFINE_WAIT_FUNC(wait, woken_wake_function);
 
ret = ceph_pool_perm_check(ci, need);
if (ret < 0)
@@ -2486,9 +2487,14 @@ int ceph_get_caps(struct ceph_inode_info *ci, int need, 
int want,
if (err < 0)
return err;
} else {
-   ret = wait_event_interruptible(ci->i_cap_wq,
-   try_get_cap_refs(ci, need, want, endoff,
-true, &_got, ));
+   add_wait_queue(>i_cap_wq, );
+
+   while (!try_get_cap_refs(ci, need, want, endoff,
+ true, &_got, ))
+   wait_woken(, TASK_INTERRUPTIBLE, 
MAX_SCHEDULE_TIMEOUT);
+
+   remove_wait_queue(>i_cap_wq, );
+
if (err == -EAGAIN)
continue;
if (err < 0)
-- 
2.5.0

[PATCHv2] cephfs: Fix scheduler warning due to nested blocking

2016-10-11 Thread Nikolay Borisov

try_get_cap_refs can be used as a condition in a wait_event* calls.
This is all fine until it has to call __ceph_do_pending_vmtruncate,
which in turn acquires the i_truncate_mutex. This leads to a situation
in which a task's state is !TASK_RUNNING and at the same time it's
trying to acquire a sleeping primitive. In essence a nested sleeping
primitives are being used. This causes the following warning:

WARNING: CPU: 22 PID: 11064 at kernel/sched/core.c:7631 
__might_sleep+0x9f/0xb0()
do not call blocking ops when !TASK_RUNNING; state=1 set at 
[] prepare_to_wait_event+0x5d/0x110
 ipmi_msghandler tcp_scalable ib_qib dca ib_mad ib_core ib_addr ipv6
CPU: 22 PID: 11064 Comm: fs_checker.pl Tainted: G   O
4.4.20-clouder2 #6
Hardware name: Supermicro X10DRi/X10DRi, BIOS 1.1a 10/16/2015
  8838b416fa88 812f4409 8838b416fad0
 81a034f2 8838b416fac0 81052b46 81a0432c
 0061   88167bda54a0
Call Trace:
 [] dump_stack+0x67/0x9e
 [] warn_slowpath_common+0x86/0xc0
 [] warn_slowpath_fmt+0x4c/0x50
 [] ? prepare_to_wait_event+0x5d/0x110
 [] ? prepare_to_wait_event+0x5d/0x110
 [] __might_sleep+0x9f/0xb0
 [] mutex_lock+0x20/0x40
 [] __ceph_do_pending_vmtruncate+0x44/0x1a0 [ceph]
 [] try_get_cap_refs+0xa2/0x320 [ceph]
 [] ceph_get_caps+0x255/0x2b0 [ceph]
 [] ? wait_woken+0xb0/0xb0
 [] ceph_write_iter+0x2b1/0xde0 [ceph]
 [] ? schedule_timeout+0x202/0x260
 [] ? kmem_cache_free+0x1ea/0x200
 [] ? iput+0x9e/0x230
 [] ? __might_sleep+0x52/0xb0
 [] ? __might_fault+0x37/0x40
 [] ? cp_new_stat+0x153/0x170
 [] __vfs_write+0xaa/0xe0
 [] vfs_write+0xa9/0x190
 [] ? set_close_on_exec+0x31/0x70
 [] SyS_write+0x46/0xa0

This happens since wait_event_interruptible can interfere with the
mutex locking code, since they both fiddle with the task state.

Fix the issue by using the newly-added nested blocking infrastructure
in 61ada528dea0 ("sched/wait: Provide infrastructure to deal with
nested blocking")

Link: https://lwn.net/Articles/628628/
Signed-off-by: Nikolay Borisov 
---
 fs/ceph/caps.c | 12 +---
 1 file changed, 9 insertions(+), 3 deletions(-)

diff --git a/fs/ceph/caps.c b/fs/ceph/caps.c
index c69e1253b47b..9d401520b981 100644
--- a/fs/ceph/caps.c
+++ b/fs/ceph/caps.c
@@ -2467,6 +2467,7 @@ int ceph_get_caps(struct ceph_inode_info *ci, int need, 
int want,
  loff_t endoff, int *got, struct page **pinned_page)
 {
int _got, ret, err = 0;
+   DEFINE_WAIT_FUNC(wait, woken_wake_function);
 
ret = ceph_pool_perm_check(ci, need);
if (ret < 0)
@@ -2486,9 +2487,14 @@ int ceph_get_caps(struct ceph_inode_info *ci, int need, 
int want,
if (err < 0)
return err;
} else {
-   ret = wait_event_interruptible(ci->i_cap_wq,
-   try_get_cap_refs(ci, need, want, endoff,
-true, &_got, ));
+   add_wait_queue(>i_cap_wq, );
+
+   while (!try_get_cap_refs(ci, need, want, endoff,
+ true, &_got, ))
+   wait_woken(, TASK_INTERRUPTIBLE, 
MAX_SCHEDULE_TIMEOUT);
+
+   remove_wait_queue(>i_cap_wq, );
+
if (err == -EAGAIN)
continue;
if (err < 0)
-- 
2.5.0

[PATCH] cephfs: Fix scheduler warning due to nested blocking

2016-10-11 Thread Nikolay Borisov

try_get_cap_refs can be used as a condition in a wait_event* calls.
This is all fine until it has to call __ceph_do_pending_vmtruncate,
which in turn acquires the i_truncate_mutex. This leads to a situation
in which a task's state is !TASK_RUNNING and at the same time it's
trying to acquire a sleeping primitive. In essence a nested sleeping
primitives are being used. This causes the following warning:

WARNING: CPU: 22 PID: 11064 at kernel/sched/core.c:7631 
__might_sleep+0x9f/0xb0()
do not call blocking ops when !TASK_RUNNING; state=1 set at 
[] prepare_to_wait_event+0x5d/0x110
 ipmi_msghandler tcp_scalable ib_qib dca ib_mad ib_core ib_addr ipv6
CPU: 22 PID: 11064 Comm: fs_checker.pl Tainted: G   O
4.4.20-clouder2 #6
Hardware name: Supermicro X10DRi/X10DRi, BIOS 1.1a 10/16/2015
  8838b416fa88 812f4409 8838b416fad0
 81a034f2 8838b416fac0 81052b46 81a0432c
 0061   88167bda54a0
Call Trace:
 [] dump_stack+0x67/0x9e
 [] warn_slowpath_common+0x86/0xc0
 [] warn_slowpath_fmt+0x4c/0x50
 [] ? prepare_to_wait_event+0x5d/0x110
 [] ? prepare_to_wait_event+0x5d/0x110
 [] __might_sleep+0x9f/0xb0
 [] mutex_lock+0x20/0x40
 [] __ceph_do_pending_vmtruncate+0x44/0x1a0 [ceph]
 [] try_get_cap_refs+0xa2/0x320 [ceph]
 [] ceph_get_caps+0x255/0x2b0 [ceph]
 [] ? wait_woken+0xb0/0xb0
 [] ceph_write_iter+0x2b1/0xde0 [ceph]
 [] ? schedule_timeout+0x202/0x260
 [] ? kmem_cache_free+0x1ea/0x200
 [] ? iput+0x9e/0x230
 [] ? __might_sleep+0x52/0xb0
 [] ? __might_fault+0x37/0x40
 [] ? cp_new_stat+0x153/0x170
 [] __vfs_write+0xaa/0xe0
 [] vfs_write+0xa9/0x190
 [] ? set_close_on_exec+0x31/0x70
 [] SyS_write+0x46/0xa0

This happens since wait_event_interruptible can interfere with the
mutex locking code, since they both fiddle with the task state.

Fix the issue by using the newly-added nested blocking infrastructure
in 61ada528dea0 ("sched/wait: Provide infrastructure to deal with
nested blocking")

Link: https://lwn.net/Articles/628628/
Signed-off-by: Nikolay Borisov <ker...@kyup.com>
---
 fs/ceph/caps.c | 12 +---
 1 file changed, 9 insertions(+), 3 deletions(-)

diff --git a/fs/ceph/caps.c b/fs/ceph/caps.c
index c69e1253b47b..c6bf34e29ea4 100644
--- a/fs/ceph/caps.c
+++ b/fs/ceph/caps.c
@@ -2467,6 +2467,7 @@ int ceph_get_caps(struct ceph_inode_info *ci, int need, 
int want,
  loff_t endoff, int *got, struct page **pinned_page)
 {
int _got, ret, err = 0;
+   DEFINE_WAIT_FUNC(wait, woken_wait_function);
 
ret = ceph_pool_perm_check(ci, need);
if (ret < 0)
@@ -2486,9 +2487,14 @@ int ceph_get_caps(struct ceph_inode_info *ci, int need, 
int want,
if (err < 0)
return err;
} else {
-   ret = wait_event_interruptible(ci->i_cap_wq,
-   try_get_cap_refs(ci, need, want, endoff,
-true, &_got, ));
+   add_wait_queue(ci->i_cap_wq, );
+
+   while (!try_get_cap_refs(ci, need, want, endoff,
+ true, &_got, ))
+   wait_woken(, TASK_INTERRUPTIBLE, 
MAX_SCHEDULE_TIMEOUT);
+
+   remove_wait_queue(ci->i_cap_wq, );
+
if (err == -EAGAIN)
continue;
if (err < 0)
-- 
2.5.0

[PATCH] cephfs: Fix scheduler warning due to nested blocking

2016-10-11 Thread Nikolay Borisov

try_get_cap_refs can be used as a condition in a wait_event* calls.
This is all fine until it has to call __ceph_do_pending_vmtruncate,
which in turn acquires the i_truncate_mutex. This leads to a situation
in which a task's state is !TASK_RUNNING and at the same time it's
trying to acquire a sleeping primitive. In essence a nested sleeping
primitives are being used. This causes the following warning:

WARNING: CPU: 22 PID: 11064 at kernel/sched/core.c:7631 
__might_sleep+0x9f/0xb0()
do not call blocking ops when !TASK_RUNNING; state=1 set at 
[] prepare_to_wait_event+0x5d/0x110
 ipmi_msghandler tcp_scalable ib_qib dca ib_mad ib_core ib_addr ipv6
CPU: 22 PID: 11064 Comm: fs_checker.pl Tainted: G   O
4.4.20-clouder2 #6
Hardware name: Supermicro X10DRi/X10DRi, BIOS 1.1a 10/16/2015
  8838b416fa88 812f4409 8838b416fad0
 81a034f2 8838b416fac0 81052b46 81a0432c
 0061   88167bda54a0
Call Trace:
 [] dump_stack+0x67/0x9e
 [] warn_slowpath_common+0x86/0xc0
 [] warn_slowpath_fmt+0x4c/0x50
 [] ? prepare_to_wait_event+0x5d/0x110
 [] ? prepare_to_wait_event+0x5d/0x110
 [] __might_sleep+0x9f/0xb0
 [] mutex_lock+0x20/0x40
 [] __ceph_do_pending_vmtruncate+0x44/0x1a0 [ceph]
 [] try_get_cap_refs+0xa2/0x320 [ceph]
 [] ceph_get_caps+0x255/0x2b0 [ceph]
 [] ? wait_woken+0xb0/0xb0
 [] ceph_write_iter+0x2b1/0xde0 [ceph]
 [] ? schedule_timeout+0x202/0x260
 [] ? kmem_cache_free+0x1ea/0x200
 [] ? iput+0x9e/0x230
 [] ? __might_sleep+0x52/0xb0
 [] ? __might_fault+0x37/0x40
 [] ? cp_new_stat+0x153/0x170
 [] __vfs_write+0xaa/0xe0
 [] vfs_write+0xa9/0x190
 [] ? set_close_on_exec+0x31/0x70
 [] SyS_write+0x46/0xa0

This happens since wait_event_interruptible can interfere with the
mutex locking code, since they both fiddle with the task state.

Fix the issue by using the newly-added nested blocking infrastructure
in 61ada528dea0 ("sched/wait: Provide infrastructure to deal with
nested blocking")

Link: https://lwn.net/Articles/628628/
Signed-off-by: Nikolay Borisov 
---
 fs/ceph/caps.c | 12 +---
 1 file changed, 9 insertions(+), 3 deletions(-)

diff --git a/fs/ceph/caps.c b/fs/ceph/caps.c
index c69e1253b47b..c6bf34e29ea4 100644
--- a/fs/ceph/caps.c
+++ b/fs/ceph/caps.c
@@ -2467,6 +2467,7 @@ int ceph_get_caps(struct ceph_inode_info *ci, int need, 
int want,
  loff_t endoff, int *got, struct page **pinned_page)
 {
int _got, ret, err = 0;
+   DEFINE_WAIT_FUNC(wait, woken_wait_function);
 
ret = ceph_pool_perm_check(ci, need);
if (ret < 0)
@@ -2486,9 +2487,14 @@ int ceph_get_caps(struct ceph_inode_info *ci, int need, 
int want,
if (err < 0)
return err;
} else {
-   ret = wait_event_interruptible(ci->i_cap_wq,
-   try_get_cap_refs(ci, need, want, endoff,
-true, &_got, ));
+   add_wait_queue(ci->i_cap_wq, );
+
+   while (!try_get_cap_refs(ci, need, want, endoff,
+ true, &_got, ))
+   wait_woken(, TASK_INTERRUPTIBLE, 
MAX_SCHEDULE_TIMEOUT);
+
+   remove_wait_queue(ci->i_cap_wq, );
+
if (err == -EAGAIN)
continue;
if (err < 0)
-- 
2.5.0

[PATCH v2] inotify: Convert to using per-namespace limits

2016-10-11 Thread Nikolay Borisov

This patchset converts inotify to using the newly introduced
per-userns sysctl infrastructure.

Currently the inotify instances/watches are being accounted in the
user_struct structure. This means that in setups where multiple
users in unprivileged containers map to the same underlying
real user (i.e. pointing to the same user_struct) the inotify limits
are going to be shared as well, allowing one user(or application) to exhaust
all others limits.

Fix this by switching the inotify sysctls to using the
per-namespace/per-user limits. This will allow the server admin to
set sensible global limits, which can further be tuned inside every
individual user namespace. Additionally, in order to preserve the
sysctl ABI make the existing inotify instances/watches sysctls
modify the values of the initial user namespace.

Signed-off-by: Nikolay Borisov <ker...@kyup.com>
---

So here is a revised version which retains the existing sysctls,
and hooks them to the init_user_ns values. 

 fs/notify/inotify/inotify.h  | 17 +
 fs/notify/inotify/inotify_fsnotify.c |  6 ++
 fs/notify/inotify/inotify_user.c | 34 +-
 include/linux/fsnotify_backend.h |  3 ++-
 include/linux/sched.h|  4 
 include/linux/user_namespace.h   |  4 
 kernel/ucount.c  |  6 +-
 7 files changed, 47 insertions(+), 27 deletions(-)

diff --git a/fs/notify/inotify/inotify.h b/fs/notify/inotify/inotify.h
index ed855ef6f077..b5536f8ad3e0 100644
--- a/fs/notify/inotify/inotify.h
+++ b/fs/notify/inotify/inotify.h
@@ -30,3 +30,20 @@ extern int inotify_handle_event(struct fsnotify_group *group,
const unsigned char *file_name, u32 cookie);
 
 extern const struct fsnotify_ops inotify_fsnotify_ops;
+
+#ifdef CONFIG_INOTIFY_USER
+static void dec_inotify_instances(struct ucounts *ucounts)
+{
+   dec_ucount(ucounts, UCOUNT_INOTIFY_INSTANCES);
+}
+
+static struct ucounts *inc_inotify_watches(struct ucounts *ucounts)
+{
+   return inc_ucount(ucounts->ns, ucounts->uid, UCOUNT_INOTIFY_WATCHES);
+}
+
+static void dec_inotify_watches(struct ucounts *ucounts)
+{
+   dec_ucount(ucounts, UCOUNT_INOTIFY_WATCHES);
+}
+#endif
diff --git a/fs/notify/inotify/inotify_fsnotify.c 
b/fs/notify/inotify/inotify_fsnotify.c
index 2cd900c2c737..1e6b3cfc2cfd 100644
--- a/fs/notify/inotify/inotify_fsnotify.c
+++ b/fs/notify/inotify/inotify_fsnotify.c
@@ -165,10 +165,8 @@ static void inotify_free_group_priv(struct fsnotify_group 
*group)
/* ideally the idr is empty and we won't hit the BUG in the callback */
idr_for_each(>inotify_data.idr, idr_callback, group);
idr_destroy(>inotify_data.idr);
-   if (group->inotify_data.user) {
-   atomic_dec(>inotify_data.user->inotify_devs);
-   free_uid(group->inotify_data.user);
-   }
+   if (group->inotify_data.ucounts)
+   dec_inotify_instances(group->inotify_data.ucounts);
 }
 
 static void inotify_free_event(struct fsnotify_event *fsn_event)
diff --git a/fs/notify/inotify/inotify_user.c b/fs/notify/inotify/inotify_user.c
index b8d08d0d0a4d..7d769a824742 100644
--- a/fs/notify/inotify/inotify_user.c
+++ b/fs/notify/inotify/inotify_user.c
@@ -44,10 +44,8 @@
 
 #include 
 
-/* these are configurable via /proc/sys/fs/inotify/ */
-static int inotify_max_user_instances __read_mostly;
+/* configurable via /proc/sys/fs/inotify/ */
 static int inotify_max_queued_events __read_mostly;
-static int inotify_max_user_watches __read_mostly;
 
 static struct kmem_cache *inotify_inode_mark_cachep __read_mostly;
 
@@ -60,7 +58,7 @@ static int zero;
 struct ctl_table inotify_table[] = {
{
.procname   = "max_user_instances",
-   .data   = _max_user_instances,
+   .data   = 
_user_ns.ucount_max[UCOUNT_INOTIFY_INSTANCES],
.maxlen = sizeof(int),
.mode   = 0644,
.proc_handler   = proc_dointvec_minmax,
@@ -68,7 +66,7 @@ struct ctl_table inotify_table[] = {
},
{
.procname   = "max_user_watches",
-   .data   = _max_user_watches,
+   .data   = 
_user_ns.ucount_max[UCOUNT_INOTIFY_WATCHES],
.maxlen = sizeof(int),
.mode   = 0644,
.proc_handler   = proc_dointvec_minmax,
@@ -500,7 +498,7 @@ void inotify_ignored_and_remove_idr(struct fsnotify_mark 
*fsn_mark,
/* remove this mark from the idr */
inotify_remove_from_idr(group, i_mark);
 
-   atomic_dec(>inotify_data.user->inotify_watches);
+   dec_inotify_watches(group->inotify_data.ucounts);
 }
 
 /* ding dong the mark is dead */
@@ -584,14 +582,17 @@ static int inotify_new_watch(struct fsnotify_group *group,
tmp_i_mark->fsn_ma

[PATCH v2] inotify: Convert to using per-namespace limits

2016-10-11 Thread Nikolay Borisov

This patchset converts inotify to using the newly introduced
per-userns sysctl infrastructure.

Currently the inotify instances/watches are being accounted in the
user_struct structure. This means that in setups where multiple
users in unprivileged containers map to the same underlying
real user (i.e. pointing to the same user_struct) the inotify limits
are going to be shared as well, allowing one user(or application) to exhaust
all others limits.

Fix this by switching the inotify sysctls to using the
per-namespace/per-user limits. This will allow the server admin to
set sensible global limits, which can further be tuned inside every
individual user namespace. Additionally, in order to preserve the
sysctl ABI make the existing inotify instances/watches sysctls
modify the values of the initial user namespace.

Signed-off-by: Nikolay Borisov 
---

So here is a revised version which retains the existing sysctls,
and hooks them to the init_user_ns values. 

 fs/notify/inotify/inotify.h  | 17 +
 fs/notify/inotify/inotify_fsnotify.c |  6 ++
 fs/notify/inotify/inotify_user.c | 34 +-
 include/linux/fsnotify_backend.h |  3 ++-
 include/linux/sched.h|  4 
 include/linux/user_namespace.h   |  4 
 kernel/ucount.c  |  6 +-
 7 files changed, 47 insertions(+), 27 deletions(-)

diff --git a/fs/notify/inotify/inotify.h b/fs/notify/inotify/inotify.h
index ed855ef6f077..b5536f8ad3e0 100644
--- a/fs/notify/inotify/inotify.h
+++ b/fs/notify/inotify/inotify.h
@@ -30,3 +30,20 @@ extern int inotify_handle_event(struct fsnotify_group *group,
const unsigned char *file_name, u32 cookie);
 
 extern const struct fsnotify_ops inotify_fsnotify_ops;
+
+#ifdef CONFIG_INOTIFY_USER
+static void dec_inotify_instances(struct ucounts *ucounts)
+{
+   dec_ucount(ucounts, UCOUNT_INOTIFY_INSTANCES);
+}
+
+static struct ucounts *inc_inotify_watches(struct ucounts *ucounts)
+{
+   return inc_ucount(ucounts->ns, ucounts->uid, UCOUNT_INOTIFY_WATCHES);
+}
+
+static void dec_inotify_watches(struct ucounts *ucounts)
+{
+   dec_ucount(ucounts, UCOUNT_INOTIFY_WATCHES);
+}
+#endif
diff --git a/fs/notify/inotify/inotify_fsnotify.c 
b/fs/notify/inotify/inotify_fsnotify.c
index 2cd900c2c737..1e6b3cfc2cfd 100644
--- a/fs/notify/inotify/inotify_fsnotify.c
+++ b/fs/notify/inotify/inotify_fsnotify.c
@@ -165,10 +165,8 @@ static void inotify_free_group_priv(struct fsnotify_group 
*group)
/* ideally the idr is empty and we won't hit the BUG in the callback */
idr_for_each(>inotify_data.idr, idr_callback, group);
idr_destroy(>inotify_data.idr);
-   if (group->inotify_data.user) {
-   atomic_dec(>inotify_data.user->inotify_devs);
-   free_uid(group->inotify_data.user);
-   }
+   if (group->inotify_data.ucounts)
+   dec_inotify_instances(group->inotify_data.ucounts);
 }
 
 static void inotify_free_event(struct fsnotify_event *fsn_event)
diff --git a/fs/notify/inotify/inotify_user.c b/fs/notify/inotify/inotify_user.c
index b8d08d0d0a4d..7d769a824742 100644
--- a/fs/notify/inotify/inotify_user.c
+++ b/fs/notify/inotify/inotify_user.c
@@ -44,10 +44,8 @@
 
 #include 
 
-/* these are configurable via /proc/sys/fs/inotify/ */
-static int inotify_max_user_instances __read_mostly;
+/* configurable via /proc/sys/fs/inotify/ */
 static int inotify_max_queued_events __read_mostly;
-static int inotify_max_user_watches __read_mostly;
 
 static struct kmem_cache *inotify_inode_mark_cachep __read_mostly;
 
@@ -60,7 +58,7 @@ static int zero;
 struct ctl_table inotify_table[] = {
{
.procname   = "max_user_instances",
-   .data   = _max_user_instances,
+   .data   = 
_user_ns.ucount_max[UCOUNT_INOTIFY_INSTANCES],
.maxlen = sizeof(int),
.mode   = 0644,
.proc_handler   = proc_dointvec_minmax,
@@ -68,7 +66,7 @@ struct ctl_table inotify_table[] = {
},
{
.procname   = "max_user_watches",
-   .data   = _max_user_watches,
+   .data   = 
_user_ns.ucount_max[UCOUNT_INOTIFY_WATCHES],
.maxlen = sizeof(int),
.mode   = 0644,
.proc_handler   = proc_dointvec_minmax,
@@ -500,7 +498,7 @@ void inotify_ignored_and_remove_idr(struct fsnotify_mark 
*fsn_mark,
/* remove this mark from the idr */
inotify_remove_from_idr(group, i_mark);
 
-   atomic_dec(>inotify_data.user->inotify_watches);
+   dec_inotify_watches(group->inotify_data.ucounts);
 }
 
 /* ding dong the mark is dead */
@@ -584,14 +582,17 @@ static int inotify_new_watch(struct fsnotify_group *group,
tmp_i_mark->fsn_mark.mask = mask;
tmp_i_m

Re: [PATCH] inotify: Convert to using per-namespace limits

2016-10-10 Thread Nikolay Borisov

On Mon, Oct 10, 2016 at 11:49 PM, Eric W. Biederman
<ebied...@xmission.com> wrote:
> Jan Kara <j...@suse.cz> writes:
>
>> On Mon 10-10-16 09:44:19, Nikolay Borisov wrote:
>>> On 10/07/2016 09:14 PM, Eric W. Biederman wrote:
>>> > Nikolay Borisov <ker...@kyup.com> writes:
>>> >
>>> >> This patchset converts inotify to using the newly introduced
>>> >> per-userns sysctl infrastructure.
>>> >>
>>> >> Currently the inotify instances/watches are being accounted in the
>>> >> user_struct structure. This means that in setups where multiple
>>> >> users in unprivileged containers map to the same underlying
>>> >> real user (i.e. pointing to the same user_struct) the inotify limits
>>> >> are going to be shared as well, allowing one user(or application) to 
>>> >> exhaust
>>> >> all others limits.
>>> >>
>>> >> Fix this by switching the inotify sysctls to using the
>>> >> per-namespace/per-user limits. This will allow the server admin to
>>> >> set sensible global limits, which can further be tuned inside every
>>> >> individual user namespace.
>>> >>
>>> >> Signed-off-by: Nikolay Borisov <ker...@kyup.com>
>>> >> ---
>>> >> Hello Eric,
>>> >>
>>> >> I saw you've finally sent your pull request for 4.9 and it
>>> >> includes your implementatino of the ucount infrastructure. So
>>> >> here is my respin of the inotify patches using that.
>>> >
>>> > Thanks.  I will take a good hard look at this after -rc1 when things are
>>> > stable enough that I can start a new development branch.
>>> >
>>> > I am a little concerned that the old sysctls have gone away.  If no one
>>> > cares it is fine, but if someone depends on them existing that may count
>>> > as an unnecessary userspace regression.  But otherwise skimming through
>>> > this code it looks good.
>>>
>>> So this indeed this is real issue and I meant to write something about
>>> it. Anyway, in order to preserve those sysctl what can be done is to
>>> hook them up with a custom sysctl handler taking the ns from the proc
>>> mount and the euid of current? I think this is a good approach, but
>>> let's wait and see if anyone will have objections to completely
>>> eliminating those sysctls.
>>
>> Well, I believe just discarding those sysctls is not an option - I'm pretty
>> sure there are scripts out there which tune these sysctls and those would
>> stop working. IMO not acceptable regression.
>
> Nikolay there is your objection.
>
> So since it should be straight forward let's preserve the existing
> sysctls.  Then this change doesn't need to prove there are no scripts
> that tweak those sysctls.
>
> We are just talking changing the values in the initial user namespace so
> it should be completely compatible and straight forward to implement
> unless I am missing something.

Well I'm not so sure about this. Let's say those sysctls are going to
modify the ucount values in the init_user_ns. That's fine, however for
which particular user should they do this ? Should it be hardcoded for
kuid 0? or current_euid? I personally think they should be changing
the values for the current_euid.

>
> Eric

Re: [PATCH] inotify: Convert to using per-namespace limits

2016-10-10 Thread Nikolay Borisov

On Mon, Oct 10, 2016 at 11:49 PM, Eric W. Biederman
 wrote:
> Jan Kara  writes:
>
>> On Mon 10-10-16 09:44:19, Nikolay Borisov wrote:
>>> On 10/07/2016 09:14 PM, Eric W. Biederman wrote:
>>> > Nikolay Borisov  writes:
>>> >
>>> >> This patchset converts inotify to using the newly introduced
>>> >> per-userns sysctl infrastructure.
>>> >>
>>> >> Currently the inotify instances/watches are being accounted in the
>>> >> user_struct structure. This means that in setups where multiple
>>> >> users in unprivileged containers map to the same underlying
>>> >> real user (i.e. pointing to the same user_struct) the inotify limits
>>> >> are going to be shared as well, allowing one user(or application) to 
>>> >> exhaust
>>> >> all others limits.
>>> >>
>>> >> Fix this by switching the inotify sysctls to using the
>>> >> per-namespace/per-user limits. This will allow the server admin to
>>> >> set sensible global limits, which can further be tuned inside every
>>> >> individual user namespace.
>>> >>
>>> >> Signed-off-by: Nikolay Borisov 
>>> >> ---
>>> >> Hello Eric,
>>> >>
>>> >> I saw you've finally sent your pull request for 4.9 and it
>>> >> includes your implementatino of the ucount infrastructure. So
>>> >> here is my respin of the inotify patches using that.
>>> >
>>> > Thanks.  I will take a good hard look at this after -rc1 when things are
>>> > stable enough that I can start a new development branch.
>>> >
>>> > I am a little concerned that the old sysctls have gone away.  If no one
>>> > cares it is fine, but if someone depends on them existing that may count
>>> > as an unnecessary userspace regression.  But otherwise skimming through
>>> > this code it looks good.
>>>
>>> So this indeed this is real issue and I meant to write something about
>>> it. Anyway, in order to preserve those sysctl what can be done is to
>>> hook them up with a custom sysctl handler taking the ns from the proc
>>> mount and the euid of current? I think this is a good approach, but
>>> let's wait and see if anyone will have objections to completely
>>> eliminating those sysctls.
>>
>> Well, I believe just discarding those sysctls is not an option - I'm pretty
>> sure there are scripts out there which tune these sysctls and those would
>> stop working. IMO not acceptable regression.
>
> Nikolay there is your objection.
>
> So since it should be straight forward let's preserve the existing
> sysctls.  Then this change doesn't need to prove there are no scripts
> that tweak those sysctls.
>
> We are just talking changing the values in the initial user namespace so
> it should be completely compatible and straight forward to implement
> unless I am missing something.

Well I'm not so sure about this. Let's say those sysctls are going to
modify the ucount values in the init_user_ns. That's fine, however for
which particular user should they do this ? Should it be hardcoded for
kuid 0? or current_euid? I personally think they should be changing
the values for the current_euid.

>
> Eric

Re: [PATCHv2] ceph: Fix error handling in ceph_read_iter

2016-10-10 Thread Nikolay Borisov



On 10/10/2016 04:11 PM, Yan, Zheng wrote:
> 
>> On 10 Oct 2016, at 20:56, Nikolay Borisov <ker...@kyup.com> wrote:
>>
>> In case __ceph_do_getattr returns an error and the retry_op in
>> ceph_read_iter is not READ_INLINE, then it's possible to invoke
>> __free_page on a page which is NULL, this naturally leads to a crash.
>> This can happen when, for example, a process waiting on a MDS reply
>> receives sigterm.
>>
>> Fix this by explicitly checking whether the page is set or not.
>>
>> Signed-off-by: Nikolay Borisov <ker...@kyup.com>
>> Link: http://www.spinics.net/lists/ceph-users/msg31592.html
>> ---
>>
>> Inverted the condition, so resending with correct condition
>> this time. 
>>
>> fs/ceph/file.c | 3 ++-
>> 1 file changed, 2 insertions(+), 1 deletion(-)
>>
>> diff --git a/fs/ceph/file.c b/fs/ceph/file.c
>> index 3c68e6aee2f0..7413313ae6c8 100644
>> --- a/fs/ceph/file.c
>> +++ b/fs/ceph/file.c
>> @@ -929,7 +929,8 @@ again:
>>  statret = __ceph_do_getattr(inode, page,
>>  CEPH_STAT_CAP_INLINE_DATA, !!page);
>>  if (statret < 0) {
>> - __free_page(page);
>> +if (page)
>> +__free_page(page);
>>  if (statret == -ENODATA) {
>>  BUG_ON(retry_op != READ_INLINE);
>>  goto again;
>> — 
> Reviewed-by: Yan, Zheng <z...@redhat.com>

I believe this needs to also be tagged as stable. To whomever is going
to merge it: can you please do that?


> 
>> 2.5.0
>>
>

Re: [PATCHv2] ceph: Fix error handling in ceph_read_iter

2016-10-10 Thread Nikolay Borisov



On 10/10/2016 04:11 PM, Yan, Zheng wrote:
> 
>> On 10 Oct 2016, at 20:56, Nikolay Borisov  wrote:
>>
>> In case __ceph_do_getattr returns an error and the retry_op in
>> ceph_read_iter is not READ_INLINE, then it's possible to invoke
>> __free_page on a page which is NULL, this naturally leads to a crash.
>> This can happen when, for example, a process waiting on a MDS reply
>> receives sigterm.
>>
>> Fix this by explicitly checking whether the page is set or not.
>>
>> Signed-off-by: Nikolay Borisov 
>> Link: http://www.spinics.net/lists/ceph-users/msg31592.html
>> ---
>>
>> Inverted the condition, so resending with correct condition
>> this time. 
>>
>> fs/ceph/file.c | 3 ++-
>> 1 file changed, 2 insertions(+), 1 deletion(-)
>>
>> diff --git a/fs/ceph/file.c b/fs/ceph/file.c
>> index 3c68e6aee2f0..7413313ae6c8 100644
>> --- a/fs/ceph/file.c
>> +++ b/fs/ceph/file.c
>> @@ -929,7 +929,8 @@ again:
>>  statret = __ceph_do_getattr(inode, page,
>>  CEPH_STAT_CAP_INLINE_DATA, !!page);
>>  if (statret < 0) {
>> - __free_page(page);
>> +if (page)
>> +__free_page(page);
>>  if (statret == -ENODATA) {
>>  BUG_ON(retry_op != READ_INLINE);
>>  goto again;
>> — 
> Reviewed-by: Yan, Zheng 

I believe this needs to also be tagged as stable. To whomever is going
to merge it: can you please do that?


> 
>> 2.5.0
>>
>

[PATCHv2] ceph: Fix error handling in ceph_read_iter

2016-10-10 Thread Nikolay Borisov

In case __ceph_do_getattr returns an error and the retry_op in
ceph_read_iter is not READ_INLINE, then it's possible to invoke
__free_page on a page which is NULL, this naturally leads to a crash.
This can happen when, for example, a process waiting on a MDS reply
receives sigterm.

Fix this by explicitly checking whether the page is set or not.

Signed-off-by: Nikolay Borisov <ker...@kyup.com>
Link: http://www.spinics.net/lists/ceph-users/msg31592.html
---

Inverted the condition, so resending with correct condition
this time. 

 fs/ceph/file.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/fs/ceph/file.c b/fs/ceph/file.c
index 3c68e6aee2f0..7413313ae6c8 100644
--- a/fs/ceph/file.c
+++ b/fs/ceph/file.c
@@ -929,7 +929,8 @@ again:
statret = __ceph_do_getattr(inode, page,
CEPH_STAT_CAP_INLINE_DATA, !!page);
if (statret < 0) {
-__free_page(page);
+   if (page)
+   __free_page(page);
if (statret == -ENODATA) {
BUG_ON(retry_op != READ_INLINE);
goto again;
-- 
2.5.0

[PATCHv2] ceph: Fix error handling in ceph_read_iter

2016-10-10 Thread Nikolay Borisov

In case __ceph_do_getattr returns an error and the retry_op in
ceph_read_iter is not READ_INLINE, then it's possible to invoke
__free_page on a page which is NULL, this naturally leads to a crash.
This can happen when, for example, a process waiting on a MDS reply
receives sigterm.

Fix this by explicitly checking whether the page is set or not.

Signed-off-by: Nikolay Borisov 
Link: http://www.spinics.net/lists/ceph-users/msg31592.html
---

Inverted the condition, so resending with correct condition
this time. 

 fs/ceph/file.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/fs/ceph/file.c b/fs/ceph/file.c
index 3c68e6aee2f0..7413313ae6c8 100644
--- a/fs/ceph/file.c
+++ b/fs/ceph/file.c
@@ -929,7 +929,8 @@ again:
statret = __ceph_do_getattr(inode, page,
CEPH_STAT_CAP_INLINE_DATA, !!page);
if (statret < 0) {
-__free_page(page);
+   if (page)
+   __free_page(page);
if (statret == -ENODATA) {
BUG_ON(retry_op != READ_INLINE);
goto again;
-- 
2.5.0

[PATCH] ceph: Fix error handling in ceph_read_iter

2016-10-10 Thread Nikolay Borisov

In case __ceph_do_getattr returns an error and the retry_op in
ceph_read_iter is not READ_INLINE, then it's possible to invoke
__free_page on a page which is NULL, this naturally leads to a crash.
This can happen when, for example, a process waiting on a MDS reply
receives sigterm.

Fix this by explicitly checking whether the page is set or not.

Signed-off-by: Nikolay Borisov <ker...@kyup.com>
Link: http://www.spinics.net/lists/ceph-users/msg31592.html
---
 fs/ceph/file.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/fs/ceph/file.c b/fs/ceph/file.c
index 3c68e6aee2f0..7413313ae6c8 100644
--- a/fs/ceph/file.c
+++ b/fs/ceph/file.c
@@ -929,7 +929,8 @@ again:
statret = __ceph_do_getattr(inode, page,
CEPH_STAT_CAP_INLINE_DATA, !!page);
if (statret < 0) {
-__free_page(page);
+   if (!page)
+   __free_page(page);
if (statret == -ENODATA) {
BUG_ON(retry_op != READ_INLINE);
goto again;
-- 
2.5.0

[PATCH] ceph: Fix error handling in ceph_read_iter

2016-10-10 Thread Nikolay Borisov

In case __ceph_do_getattr returns an error and the retry_op in
ceph_read_iter is not READ_INLINE, then it's possible to invoke
__free_page on a page which is NULL, this naturally leads to a crash.
This can happen when, for example, a process waiting on a MDS reply
receives sigterm.

Fix this by explicitly checking whether the page is set or not.

Signed-off-by: Nikolay Borisov 
Link: http://www.spinics.net/lists/ceph-users/msg31592.html
---
 fs/ceph/file.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/fs/ceph/file.c b/fs/ceph/file.c
index 3c68e6aee2f0..7413313ae6c8 100644
--- a/fs/ceph/file.c
+++ b/fs/ceph/file.c
@@ -929,7 +929,8 @@ again:
statret = __ceph_do_getattr(inode, page,
CEPH_STAT_CAP_INLINE_DATA, !!page);
if (statret < 0) {
-__free_page(page);
+   if (!page)
+   __free_page(page);
if (statret == -ENODATA) {
BUG_ON(retry_op != READ_INLINE);
goto again;
-- 
2.5.0

Re: [PATCH] inotify: Convert to using per-namespace limits

2016-10-10 Thread Nikolay Borisov



On 10/07/2016 09:14 PM, Eric W. Biederman wrote:
> Nikolay Borisov <ker...@kyup.com> writes:
> 
>> This patchset converts inotify to using the newly introduced
>> per-userns sysctl infrastructure.
>>
>> Currently the inotify instances/watches are being accounted in the
>> user_struct structure. This means that in setups where multiple
>> users in unprivileged containers map to the same underlying
>> real user (i.e. pointing to the same user_struct) the inotify limits
>> are going to be shared as well, allowing one user(or application) to exhaust
>> all others limits.
>>
>> Fix this by switching the inotify sysctls to using the
>> per-namespace/per-user limits. This will allow the server admin to
>> set sensible global limits, which can further be tuned inside every
>> individual user namespace.
>>
>> Signed-off-by: Nikolay Borisov <ker...@kyup.com>
>> ---
>> Hello Eric, 
>>
>> I saw you've finally sent your pull request for 4.9 and it 
>> includes your implementatino of the ucount infrastructure. So 
>> here is my respin of the inotify patches using that.
> 
> Thanks.  I will take a good hard look at this after -rc1 when things are
> stable enough that I can start a new development branch.
> 
> I am a little concerned that the old sysctls have gone away.  If no one
> cares it is fine, but if someone depends on them existing that may count
> as an unnecessary userspace regression.  But otherwise skimming through
> this code it looks good.

So this indeed this is real issue and I meant to write something about
it. Anyway, in order to preserve those sysctl what can be done is to
hook them up with a custom sysctl handler taking the ns from the proc
mount and the euid of current? I think this is a good approach, but
let's wait and see if anyone will have objections to completely
eliminating those sysctls.


> 
[SNIP]

< 1 2 3 4 5 6 7 8 9 10 >

501 - 600 of 1058 matches

Mail list logo