[PATCH v2] vhost/vsock: specify module version

2024-09-29 Thread Alexander Mikhalitsyn
Add an explicit MODULE_VERSION("0.0.1") specification for the vhost_vsock 
module.

It is useful because it allows userspace to check if vhost_vsock is there when 
it is
configured as a built-in.

This is what we have *without* this change and when vhost_vsock is configured
as a module and loaded:

$ ls -la /sys/module/vhost_vsock
total 0
drwxr-xr-x   5 root root0 Sep 29 19:00 .
drwxr-xr-x 337 root root0 Sep 29 18:59 ..
-r--r--r--   1 root root 4096 Sep 29 20:05 coresize
drwxr-xr-x   2 root root0 Sep 29 20:05 holders
-r--r--r--   1 root root 4096 Sep 29 20:05 initsize
-r--r--r--   1 root root 4096 Sep 29 20:05 initstate
drwxr-xr-x   2 root root0 Sep 29 20:05 notes
-r--r--r--   1 root root 4096 Sep 29 20:05 refcnt
drwxr-xr-x   2 root root0 Sep 29 20:05 sections
-r--r--r--   1 root root 4096 Sep 29 20:05 srcversion
-r--r--r--   1 root root 4096 Sep 29 20:05 taint
--w---   1 root root 4096 Sep 29 19:00 uevent

When vhost_vsock is configured as a built-in there is *no* 
/sys/module/vhost_vsock directory at all.
And this looks like an inconsistency.

With this change, when vhost_vsock is configured as a built-in we get:
$ ls -la /sys/module/vhost_vsock/
total 0
drwxr-xr-x   2 root root0 Sep 26 15:59 .
drwxr-xr-x 100 root root0 Sep 26 15:59 ..
--w---   1 root root 4096 Sep 26 15:59 uevent
-r--r--r--   1 root root 4096 Sep 26 15:59 version

Signed-off-by: Alexander Mikhalitsyn 
---
 drivers/vhost/vsock.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/drivers/vhost/vsock.c b/drivers/vhost/vsock.c
index 802153e23073..287ea8e480b5 100644
--- a/drivers/vhost/vsock.c
+++ b/drivers/vhost/vsock.c
@@ -956,6 +956,7 @@ static void __exit vhost_vsock_exit(void)
 
 module_init(vhost_vsock_init);
 module_exit(vhost_vsock_exit);
+MODULE_VERSION("0.0.1");
 MODULE_LICENSE("GPL v2");
 MODULE_AUTHOR("Asias He");
 MODULE_DESCRIPTION("vhost transport for vsock ");
-- 
2.34.1




[PATCH] vhost/vsock: specify module version

2024-09-26 Thread Alexander Mikhalitsyn
Add an explicit MODULE_VERSION("0.0.1") specification
for a vhost_vsock module. It is useful because it allows
userspace to check if vhost_vsock is there when it is
configured as a built-in.

Without this change, there is no /sys/module/vhost_vsock directory.

With this change:
$ ls -la /sys/module/vhost_vsock/
total 0
drwxr-xr-x   2 root root0 Sep 26 15:59 .
drwxr-xr-x 100 root root0 Sep 26 15:59 ..
--w---   1 root root 4096 Sep 26 15:59 uevent
-r--r--r--   1 root root 4096 Sep 26 15:59 version

Signed-off-by: Alexander Mikhalitsyn 
---
 drivers/vhost/vsock.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/drivers/vhost/vsock.c b/drivers/vhost/vsock.c
index 802153e23073..287ea8e480b5 100644
--- a/drivers/vhost/vsock.c
+++ b/drivers/vhost/vsock.c
@@ -956,6 +956,7 @@ static void __exit vhost_vsock_exit(void)
 
 module_init(vhost_vsock_init);
 module_exit(vhost_vsock_exit);
+MODULE_VERSION("0.0.1");
 MODULE_LICENSE("GPL v2");
 MODULE_AUTHOR("Asias He");
 MODULE_DESCRIPTION("vhost transport for vsock ");
-- 
2.34.1




Re: [PATCH net-next v17 02/14] mm: move the page fragment allocator from page_alloc into its own file

2024-09-04 Thread Alexander Duyck
On Mon, Sep 2, 2024 at 5:09 AM Yunsheng Lin  wrote:
>
> Inspired by [1], move the page fragment allocator from page_alloc
> into its own c file and header file, as we are about to make more
> change for it to replace another page_frag implementation in
> sock.c
>
> As this patchset is going to replace 'struct page_frag' with
> 'struct page_frag_cache' in sched.h, including page_frag_cache.h
> in sched.h has a compiler error caused by interdependence between
> mm_types.h and mm.h for asm-offsets.c, see [2]. So avoid the compiler
> error by moving 'struct page_frag_cache' to mm_types_task.h as
> suggested by Alexander, see [3].
>
> 1. https://lore.kernel.org/all/20230411160902.4134381-3-dhowe...@redhat.com/
> 2. https://lore.kernel.org/all/15623dac-9358-4597-b3ee-3694a5956...@gmail.com/
> 3. 
> https://lore.kernel.org/all/CAKgT0UdH1yD=LSCXFJ=ym_aia4oomd-2wxyko42bizawmt_...@mail.gmail.com/
> CC: David Howells 
> CC: Alexander Duyck 
> Signed-off-by: Yunsheng Lin 
> Acked-by: Andrew Morton 
> ---
>  include/linux/gfp.h   |  22 ---
>  include/linux/mm_types.h  |  18 ---
>  include/linux/mm_types_task.h |  18 +++
>  include/linux/page_frag_cache.h   |  31 
>  include/linux/skbuff.h|   1 +
>  mm/Makefile   |   1 +
>  mm/page_alloc.c   | 136 
>  mm/page_frag_cache.c  | 145 ++
>  .../selftests/mm/page_frag/page_frag_test.c   |   2 +-
>  9 files changed, 197 insertions(+), 177 deletions(-)
>  create mode 100644 include/linux/page_frag_cache.h
>  create mode 100644 mm/page_frag_cache.c
>

Looks good to me.

Reviewed-by: Alexander Duyck 



Re: [PATCH net-next v17 01/14] mm: page_frag: add a test module for page_frag

2024-09-04 Thread Alexander Duyck
On Mon, Sep 2, 2024 at 5:09 AM Yunsheng Lin  wrote:
>
> The testing is done by ensuring that the fragment allocated
> from a frag_frag_cache instance is pushed into a ptr_ring
> instance in a kthread binded to a specified cpu, and a kthread
> binded to a specified cpu will pop the fragment from the
> ptr_ring and free the fragment.
>
> CC: Alexander Duyck 
> Signed-off-by: Yunsheng Lin 
> ---
>  tools/testing/selftests/mm/Makefile   |   3 +
>  tools/testing/selftests/mm/page_frag/Makefile |  18 ++
>  .../selftests/mm/page_frag/page_frag_test.c   | 170 +
>  tools/testing/selftests/mm/run_vmtests.sh |   8 +
>  tools/testing/selftests/mm/test_page_frag.sh  | 171 ++
>  5 files changed, 370 insertions(+)
>  create mode 100644 tools/testing/selftests/mm/page_frag/Makefile
>  create mode 100644 tools/testing/selftests/mm/page_frag/page_frag_test.c
>  create mode 100755 tools/testing/selftests/mm/test_page_frag.sh
>

...

> diff --git a/tools/testing/selftests/mm/test_page_frag.sh 
> b/tools/testing/selftests/mm/test_page_frag.sh
> new file mode 100755
> index ..d2b0734a90b5
> --- /dev/null
> +++ b/tools/testing/selftests/mm/test_page_frag.sh
> @@ -0,0 +1,171 @@

...

> +check_test_requirements()
> +{
> +   uid=$(id -u)
> +   if [ $uid -ne 0 ]; then
> +   echo "$0: Must be run as root"
> +   exit $ksft_skip
> +   fi
> +
> +   if ! which insmod > /dev/null 2>&1; then
> +   echo "$0: You need insmod installed"
> +   exit $ksft_skip
> +   fi
> +
> +   if [ ! -f $DRIVER ]; then
> +   echo "$0: You need to compile page_frag_test module"
> +   exit $ksft_skip
> +   fi
> +}
> +
> +run_nonaligned_check()
> +{
> +   echo "Run performance tests to evaluate how fast nonaligned alloc API 
> is."
> +
> +   insmod $DRIVER $NONALIGNED_PARAM > /dev/null 2>&1
> +   echo "Done."
> +   echo "Ccheck the kernel ring buffer to see the summary."

Typo, should be "Check".

> +}
> +
> +run_aligned_check()
> +{
> +   echo "Run performance tests to evaluate how fast aligned alloc API 
> is."
> +
> +   insmod $DRIVER $ALIGNED_PARAM > /dev/null 2>&1
> +   echo "Done."
> +   echo "Check the kernel ring buffer to see the summary."
> +}
> +

Other than the one typo it looks fine to me.

Reviewed-by: Alexander Duyck 



[PATCH v4 15/15] fs/fuse/virtio_fs: allow idmapped mounts

2024-09-03 Thread Alexander Mikhalitsyn
Allow idmapped mounts for virtiofs.
It's absolutely safe as for virtiofs we have the same
feature negotiation mechanism as for classical fuse
filesystems. This does not affect any existing
setups anyhow.

virtiofsd support:
https://gitlab.com/virtio-fs/virtiofsd/-/merge_requests/245

Cc: Christian Brauner 
Cc: Seth Forshee 
Cc: Miklos Szeredi 
Cc: Vivek Goyal 
Cc: German Maglione 
Cc: Amir Goldstein 
Cc: Bernd Schubert 
Cc: 
Signed-off-by: Alexander Mikhalitsyn 
Reviewed-by: Christian Brauner 
---
v3:
- this commit added
---
 fs/fuse/virtio_fs.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/fs/fuse/virtio_fs.c b/fs/fuse/virtio_fs.c
index dd5260141615..7e5bbaef6f76 100644
--- a/fs/fuse/virtio_fs.c
+++ b/fs/fuse/virtio_fs.c
@@ -1628,6 +1628,7 @@ static struct file_system_type virtio_fs_type = {
.name   = "virtiofs",
.init_fs_context = virtio_fs_init_fs_context,
.kill_sb= virtio_kill_sb,
+   .fs_flags   = FS_ALLOW_IDMAP,
 };
 
 static int virtio_fs_uevent(const struct kobject *kobj, struct kobj_uevent_env 
*env)
-- 
2.34.1




[PATCH v3 11/11] fs/fuse/virtio_fs: allow idmapped mounts

2024-08-15 Thread Alexander Mikhalitsyn
Allow idmapped mounts for virtiofs.
It's absolutely safe as for virtiofs we have the same
feature negotiation mechanism as for classical fuse
filesystems. This does not affect any existing
setups anyhow.

virtiofsd support:
https://gitlab.com/virtio-fs/virtiofsd/-/merge_requests/245

Cc: Christian Brauner 
Cc: Seth Forshee 
Cc: Miklos Szeredi 
Cc: Vivek Goyal 
Cc: German Maglione 
Cc: Amir Goldstein 
Cc: Bernd Schubert 
Cc: 
Signed-off-by: Alexander Mikhalitsyn 
---
v3:
- this commit added
---
 fs/fuse/virtio_fs.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/fs/fuse/virtio_fs.c b/fs/fuse/virtio_fs.c
index dd5260141615..7e5bbaef6f76 100644
--- a/fs/fuse/virtio_fs.c
+++ b/fs/fuse/virtio_fs.c
@@ -1628,6 +1628,7 @@ static struct file_system_type virtio_fs_type = {
.name   = "virtiofs",
.init_fs_context = virtio_fs_init_fs_context,
.kill_sb= virtio_kill_sb,
+   .fs_flags   = FS_ALLOW_IDMAP,
 };
 
 static int virtio_fs_uevent(const struct kobject *kobj, struct kobj_uevent_env 
*env)
-- 
2.34.1




Re: [PATCH net-next v12 04/14] mm: page_frag: add '_va' suffix to page_frag API

2024-08-05 Thread Alexander Duyck
On Sun, Aug 4, 2024 at 10:00 AM Yunsheng Lin  wrote:
>
> On 8/3/2024 1:00 AM, Alexander Duyck wrote:
>
> >>
> >>>
> >>> As far as your API extension and naming maybe you should look like
> >>> something like bio_vec and borrow the naming from that since that is
> >>> essentially what you are passing back and forth is essentially that
> >>> instead of a page frag which is normally a virtual address.
> >>
> >> I thought about adding something like bio_vec before, but I am not sure
> >> what you have in mind is somthing like I considered before?
> >> Let's say that we reuse bio_vec like something below for the new APIs:
> >>
> >> struct bio_vec {
> >>  struct page *bv_page;
> >>  void*va;
> >>  unsigned intbv_len;
> >>  unsigned intbv_offset;
> >> };
> >
> > I wasn't suggesting changing the bio_vec. I was suggesting that be
> > what you pass as a pointer reference instead of the offset. Basically
> > your use case is mostly just for populating bio_vec style structures
> > anyway.
>
> I wasn't trying/going to reuse/change bio_vec for page_frag, I was just
> having a hard time coming with a good new name for it.
> The best one I came up with is pfrag_vec, but I am not sure about the
> 'vec' as the "vec" portion of the name would suggest, iovec structures
> tend to come in arrays, mentioned in the below article:
> https://lwn.net/Articles/625077/
>
> Anther one is page_frag, which is currently in use.
>
> Or any better one in your mind?

I was suggesting using bio_vec, not some new structure. The general
idea is that almost all the values you are using are exposed by that
structure already in the case of the page based calls you were adding,
so it makes sense to use what is there rather than reinventing the
wheel.

> >
> >> It seems we have the below options for the new API:
> >>
> >> option 1, it seems like a better option from API naming point of view, but
> >> it needs to return a bio_vec pointer to the caller, it seems we need to 
> >> have
> >> extra space for the pointer, I am not sure how we can avoid the memory 
> >> waste
> >> for sk_page_frag() case in patch 12:
> >> struct bio_vec *page_frag_alloc_bio(struct page_frag_cache *nc,
> >>  unsigned int fragsz, gfp_t gfp_mask);
> >>
> >> option 2, it need both the caller and callee to have a its own local space
> >> for 'struct bio_vec ', I am not sure if passing the content instead of
> >> the pointer of a struct through the function returning is the common 
> >> pattern
> >> and if it has any performance impact yet:
> >> struct bio_vec page_frag_alloc_bio(struct page_frag_cache *nc,
> >> unsigned int fragsz, gfp_t gfp_mask);
> >>
> >> option 3, the caller passes the pointer of 'struct bio_vec ' to the callee,
> >> and page_frag_alloc_bio() fills in the data, I am not sure what is the 
> >> point
> >> of indirect using 'struct bio_vec ' instead of passing 'va' & 'fragsz' &
> >> 'offset' through pointers directly:
> >> bool page_frag_alloc_bio(struct page_frag_cache *nc,
> >>   unsigned int fragsz, gfp_t gfp_mask, struct 
> >> bio_vec *bio);
> >>
> >> If one of the above option is something in your mind? Yes, please be more 
> >> specific
> >> about which one is the prefer option, and why it is the prefer option than 
> >> the one
> >> introduced in this patchset?
> >>
> >> If no, please be more specific what that is in your mind?
> >
> > Option 3 is more or less what I had in mind. Basically you would
> > return an int to indicate any errors and you would be populating a
> > bio_vec during your allocation. In addition you would use the bio_vec
>
> Actually using this new bio_vec style structures does not seem to solve
> the APIs naming issue this patch is trying to solve as my understanding,
> as the new struct is only about passing one pointer or multi-pointers
> from API naming perspective. It is part of the API naming, but not all
> of it.

I have no idea what you are talking about. The issue was you were
splitting things page_frag_alloc_va and page_frag_alloc_pg. Now it
would be page_frag_alloc and page_frag_alloc_bio or maybe
page_frag_fill_bio which would better explain what you are doing with
this function

Re: [PATCH net-next v12 04/14] mm: page_frag: add '_va' suffix to page_frag API

2024-08-02 Thread Alexander Duyck
On Fri, Aug 2, 2024 at 3:05 AM Yunsheng Lin  wrote:
>
> On 2024/8/1 23:21, Alexander Duyck wrote:
> > On Thu, Aug 1, 2024 at 6:01 AM Yunsheng Lin  wrote:
> >>
> >> On 2024/8/1 2:13, Alexander Duyck wrote:
> >>> On Wed, Jul 31, 2024 at 5:50 AM Yunsheng Lin  
> >>> wrote:
> >>>>
> >>>> Currently the page_frag API is returning 'virtual address'
> >>>> or 'va' when allocing and expecting 'virtual address' or
> >>>> 'va' as input when freeing.
> >>>>
> >>>> As we are about to support new use cases that the caller
> >>>> need to deal with 'struct page' or need to deal with both
> >>>> 'va' and 'struct page'. In order to differentiate the API
> >>>> handling between 'va' and 'struct page', add '_va' suffix
> >>>> to the corresponding API mirroring the page_pool_alloc_va()
> >>>> API of the page_pool. So that callers expecting to deal with
> >>>> va, page or both va and page may call page_frag_alloc_va*,
> >>>> page_frag_alloc_pg*, or page_frag_alloc* API accordingly.
> >>>>
> >>>> CC: Alexander Duyck 
> >>>> Signed-off-by: Yunsheng Lin 
> >>>> Reviewed-by: Subbaraya Sundeep 
> >>>
> >>> I am naking this patch. It is a pointless rename that is just going to
> >>> obfuscate the git history for these callers.
> >>
> >> I responded to your above similar comment in v2, and then responded more
> >> detailedly in v11, both got not direct responding, it would be good to
> >> have more concrete feedback here instead of abstract argument.
> >>
> >> https://lore.kernel.org/all/74e7259a-c462-e3c1-73ac-8e3f49fb8...@huawei.com/
> >> https://lore.kernel.org/all/11187fe4-9419-4341-97b5-6dad7583b...@huawei.com/
> >
> > I will make this much more understandable. This patch is one of the
> > ones that will permanently block this set in my opinion. As such I
> > will never ack this patch as I see no benefit to it. Arguing with me
> > on this is moot as you aren't going to change my mind, and I don't
> > have all day to argue back and forth with you on every single patch.
>
> Let's move on to more specific technical discussion then.
>
> >
> > As far as your API extension and naming maybe you should look like
> > something like bio_vec and borrow the naming from that since that is
> > essentially what you are passing back and forth is essentially that
> > instead of a page frag which is normally a virtual address.
>
> I thought about adding something like bio_vec before, but I am not sure
> what you have in mind is somthing like I considered before?
> Let's say that we reuse bio_vec like something below for the new APIs:
>
> struct bio_vec {
> struct page *bv_page;
> void*va;
> unsigned intbv_len;
> unsigned intbv_offset;
> };

I wasn't suggesting changing the bio_vec. I was suggesting that be
what you pass as a pointer reference instead of the offset. Basically
your use case is mostly just for populating bio_vec style structures
anyway.

> It seems we have the below options for the new API:
>
> option 1, it seems like a better option from API naming point of view, but
> it needs to return a bio_vec pointer to the caller, it seems we need to have
> extra space for the pointer, I am not sure how we can avoid the memory waste
> for sk_page_frag() case in patch 12:
> struct bio_vec *page_frag_alloc_bio(struct page_frag_cache *nc,
> unsigned int fragsz, gfp_t gfp_mask);
>
> option 2, it need both the caller and callee to have a its own local space
> for 'struct bio_vec ', I am not sure if passing the content instead of
> the pointer of a struct through the function returning is the common pattern
> and if it has any performance impact yet:
> struct bio_vec page_frag_alloc_bio(struct page_frag_cache *nc,
>unsigned int fragsz, gfp_t gfp_mask);
>
> option 3, the caller passes the pointer of 'struct bio_vec ' to the callee,
> and page_frag_alloc_bio() fills in the data, I am not sure what is the point
> of indirect using 'struct bio_vec ' instead of passing 'va' & 'fragsz' &
> 'offset' through pointers directly:
> bool page_frag_alloc_bio(struct page_frag_cache *nc,
>  unsigned int fragsz, gfp_t gfp_mask, struct bio_vec 
> *bio);

Re: [PATCH net-next v12 04/14] mm: page_frag: add '_va' suffix to page_frag API

2024-08-01 Thread Alexander Duyck
On Thu, Aug 1, 2024 at 6:01 AM Yunsheng Lin  wrote:
>
> On 2024/8/1 2:13, Alexander Duyck wrote:
> > On Wed, Jul 31, 2024 at 5:50 AM Yunsheng Lin  wrote:
> >>
> >> Currently the page_frag API is returning 'virtual address'
> >> or 'va' when allocing and expecting 'virtual address' or
> >> 'va' as input when freeing.
> >>
> >> As we are about to support new use cases that the caller
> >> need to deal with 'struct page' or need to deal with both
> >> 'va' and 'struct page'. In order to differentiate the API
> >> handling between 'va' and 'struct page', add '_va' suffix
> >> to the corresponding API mirroring the page_pool_alloc_va()
> >> API of the page_pool. So that callers expecting to deal with
> >> va, page or both va and page may call page_frag_alloc_va*,
> >> page_frag_alloc_pg*, or page_frag_alloc* API accordingly.
> >>
> >> CC: Alexander Duyck 
> >> Signed-off-by: Yunsheng Lin 
> >> Reviewed-by: Subbaraya Sundeep 
> >
> > I am naking this patch. It is a pointless rename that is just going to
> > obfuscate the git history for these callers.
>
> I responded to your above similar comment in v2, and then responded more
> detailedly in v11, both got not direct responding, it would be good to
> have more concrete feedback here instead of abstract argument.
>
> https://lore.kernel.org/all/74e7259a-c462-e3c1-73ac-8e3f49fb8...@huawei.com/
> https://lore.kernel.org/all/11187fe4-9419-4341-97b5-6dad7583b...@huawei.com/

I will make this much more understandable. This patch is one of the
ones that will permanently block this set in my opinion. As such I
will never ack this patch as I see no benefit to it. Arguing with me
on this is moot as you aren't going to change my mind, and I don't
have all day to argue back and forth with you on every single patch.

As far as your API extension and naming maybe you should look like
something like bio_vec and borrow the naming from that since that is
essentially what you are passing back and forth is essentially that
instead of a page frag which is normally a virtual address.



Re: [PATCH net-next v12 04/14] mm: page_frag: add '_va' suffix to page_frag API

2024-07-31 Thread Alexander Duyck
On Wed, Jul 31, 2024 at 5:50 AM Yunsheng Lin  wrote:
>
> Currently the page_frag API is returning 'virtual address'
> or 'va' when allocing and expecting 'virtual address' or
> 'va' as input when freeing.
>
> As we are about to support new use cases that the caller
> need to deal with 'struct page' or need to deal with both
> 'va' and 'struct page'. In order to differentiate the API
> handling between 'va' and 'struct page', add '_va' suffix
> to the corresponding API mirroring the page_pool_alloc_va()
> API of the page_pool. So that callers expecting to deal with
> va, page or both va and page may call page_frag_alloc_va*,
> page_frag_alloc_pg*, or page_frag_alloc* API accordingly.
>
> CC: Alexander Duyck 
> Signed-off-by: Yunsheng Lin 
> Reviewed-by: Subbaraya Sundeep 

I am naking this patch. It is a pointless rename that is just going to
obfuscate the git history for these callers.

As I believe I said before I would prefer to see this work more like
the handling of __get_free_pages and __free_pages in terms of the use
of pages versus pointers and/or longs. Pushing this API aside because
you want to reuse the name for something different isn't a valid
reason to rename an existing API and will just lead to confusion.



Re: [RFC v11 05/14] mm: page_frag: avoid caller accessing 'page_frag_cache' directly

2024-07-21 Thread Alexander H Duyck
On Fri, 2024-07-19 at 17:33 +0800, Yunsheng Lin wrote:
> Use appropriate frag_page API instead of caller accessing
> 'page_frag_cache' directly.
> 
> CC: Alexander Duyck 
> Signed-off-by: Yunsheng Lin 
> ---
>  drivers/vhost/net.c |  2 +-
>  include/linux/page_frag_cache.h | 10 ++
>  mm/page_frag_test.c |  2 +-
>  net/core/skbuff.c   |  6 +++---
>  net/rxrpc/conn_object.c |  4 +---
>  net/rxrpc/local_object.c|  4 +---
>  net/sunrpc/svcsock.c|  6 ++
>  7 files changed, 19 insertions(+), 15 deletions(-)
> 

Looks fine to me.

Reviewed-by: Alexander Duyck 




Re: [RFC v11 04/14] mm: page_frag: add '_va' suffix to page_frag API

2024-07-21 Thread Alexander Duyck
On Fri, Jul 19, 2024 at 2:37 AM Yunsheng Lin  wrote:
>
> Currently the page_frag API is returning 'virtual address'
> or 'va' when allocing and expecting 'virtual address' or
> 'va' as input when freeing.
>
> As we are about to support new use cases that the caller
> need to deal with 'struct page' or need to deal with both
> 'va' and 'struct page'. In order to differentiate the API
> handling between 'va' and 'struct page', add '_va' suffix
> to the corresponding API mirroring the page_pool_alloc_va()
> API of the page_pool. So that callers expecting to deal with
> va, page or both va and page may call page_frag_alloc_va*,
> page_frag_alloc_pg*, or page_frag_alloc* API accordingly.
>
> CC: Alexander Duyck 
> Signed-off-by: Yunsheng Lin 
> Reviewed-by: Subbaraya Sundeep 

Rather than renaming the existing API I would rather see this follow
the same approach as we use with the other memory subsystem functions.
A specific example being that with free_page it is essentially passed
a virtual address, while the double underscore version is passed a
page. I would be more okay with us renaming the double underscore
version of any functions we might have to address that rather than
renaming all the functions with "va".

In general I would say this patch is adding no value as what it is
doing is essentially pushing the primary users of this API to change
to support use cases that won't impact most of them. It is just
creating a ton of noise in terms of changes with no added value so we
can reuse the function names.



Re: [PATCH v6 16/39] kmsan: Expose KMSAN_WARN_ON()

2024-06-21 Thread Alexander Potapenko
On Fri, Jun 21, 2024 at 2:26 AM Ilya Leoshkevich  wrote:
>
> KMSAN_WARN_ON() is required for implementing s390-specific KMSAN
> functions, but right now it's available only to the KMSAN internal
> functions. Expose it to subsystems through .
>
> Signed-off-by: Ilya Leoshkevich 
Reviewed-by: Alexander Potapenko 



Re: [PATCH v6 32/39] s390/ptdump: Add KMSAN page markers

2024-06-21 Thread Alexander Potapenko
On Fri, Jun 21, 2024 at 2:27 AM Ilya Leoshkevich  wrote:
>
> Add KMSAN vmalloc metadata areas to kernel_page_tables.
>
> Signed-off-by: Ilya Leoshkevich 
Reviewed-by: Alexander Potapenko 



Re: [PATCH v5 13/37] kmsan: Support SLAB_POISON

2024-06-20 Thread Alexander Potapenko
On Wed, Jun 19, 2024 at 5:45 PM Ilya Leoshkevich  wrote:
>
> Avoid false KMSAN negatives with SLUB_DEBUG by allowing
> kmsan_slab_free() to poison the freed memory, and by preventing
> init_object() from unpoisoning new allocations by using __memset().
>
> There are two alternatives to this approach. First, init_object()
> can be marked with __no_sanitize_memory. This annotation should be used
> with great care, because it drops all instrumentation from the
> function, and any shadow writes will be lost. Even though this is not a
> concern with the current init_object() implementation, this may change
> in the future.
>
> Second, kmsan_poison_memory() calls may be added after memset() calls.
> The downside is that init_object() is called from
> free_debug_processing(), in which case poisoning will erase the
> distinction between simply uninitialized memory and UAF.
>
> Signed-off-by: Ilya Leoshkevich 
Reviewed-by: Alexander Potapenko 



Re: [PATCH v5 36/37] s390/kmsan: Implement the architecture-specific functions

2024-06-20 Thread Alexander Potapenko
On Thu, Jun 20, 2024 at 4:18 PM Alexander Potapenko  wrote:
>
> On Thu, Jun 20, 2024 at 3:38 PM Ilya Leoshkevich  wrote:
> >
> > On Thu, 2024-06-20 at 11:25 +0200, Alexander Gordeev wrote:
> > > On Wed, Jun 19, 2024 at 05:44:11PM +0200, Ilya Leoshkevich wrote:
> > >
> > > Hi Ilya,
> > >
> > > > +static inline bool is_lowcore_addr(void *addr)
> > > > +{
> > > > +   return addr >= (void *)&S390_lowcore &&
> > > > +  addr < (void *)(&S390_lowcore + 1);
> > > > +}
> > > > +
> > > > +static inline void *arch_kmsan_get_meta_or_null(void *addr, bool
> > > > is_origin)
> > > > +{
> > > > +   if (is_lowcore_addr(addr)) {
> > > > +   /*
> > > > +* Different lowcores accessed via S390_lowcore
> > > > are described
> > > > +* by the same struct page. Resolve the prefix
> > > > manually in
> > > > +* order to get a distinct struct page.
> > > > +*/
> > >
> > > > +   addr += (void
> > > > *)lowcore_ptr[raw_smp_processor_id()] -
> > > > +   (void *)&S390_lowcore;
> > >
> > > If I am not mistaken neither raw_smp_processor_id() itself, nor
> > > lowcore_ptr[raw_smp_processor_id()] are atomic. Should the preemption
> > > be disabled while the addr is calculated?
> > >
> > > But then the question arises - how meaningful the returned value is?
> > > AFAICT kmsan_get_metadata() is called from a preemptable context.
> > > So if the CPU is changed - how useful the previous CPU lowcore meta
> > > is?
> >
> > This code path will only be triggered by instrumented code that
> > accesses lowcore. That code is supposed to disable preemption;
> > if it didn't, it's a bug in that code and it should be fixed there.
> >
> > >
> > > Is it a memory block that needs to be ignored instead?
> > >
> > > > +   if (WARN_ON_ONCE(is_lowcore_addr(addr)))
> > > > +   return NULL;
> > >
> > > lowcore_ptr[] pointing into S390_lowcore is rather a bug.
> >
> > Right, but AFAIK BUG() calls are discouraged. I guess in a debug tool
> > the rules are more relaxed, but we can recover from this condition here
> > easily, that's why I still went for WARN_ON_ONCE().
>
> We have KMSAN_WARN_ON() for that, sorry for not pointing it out
> earlier: https://elixir.bootlin.com/linux/latest/source/mm/kmsan/kmsan.h#L46

Apart from that:

Reviewed-by: Alexander Potapenko 



Re: [PATCH v5 36/37] s390/kmsan: Implement the architecture-specific functions

2024-06-20 Thread Alexander Potapenko
On Thu, Jun 20, 2024 at 3:38 PM Ilya Leoshkevich  wrote:
>
> On Thu, 2024-06-20 at 11:25 +0200, Alexander Gordeev wrote:
> > On Wed, Jun 19, 2024 at 05:44:11PM +0200, Ilya Leoshkevich wrote:
> >
> > Hi Ilya,
> >
> > > +static inline bool is_lowcore_addr(void *addr)
> > > +{
> > > +   return addr >= (void *)&S390_lowcore &&
> > > +  addr < (void *)(&S390_lowcore + 1);
> > > +}
> > > +
> > > +static inline void *arch_kmsan_get_meta_or_null(void *addr, bool
> > > is_origin)
> > > +{
> > > +   if (is_lowcore_addr(addr)) {
> > > +   /*
> > > +* Different lowcores accessed via S390_lowcore
> > > are described
> > > +* by the same struct page. Resolve the prefix
> > > manually in
> > > +* order to get a distinct struct page.
> > > +*/
> >
> > > +   addr += (void
> > > *)lowcore_ptr[raw_smp_processor_id()] -
> > > +   (void *)&S390_lowcore;
> >
> > If I am not mistaken neither raw_smp_processor_id() itself, nor
> > lowcore_ptr[raw_smp_processor_id()] are atomic. Should the preemption
> > be disabled while the addr is calculated?
> >
> > But then the question arises - how meaningful the returned value is?
> > AFAICT kmsan_get_metadata() is called from a preemptable context.
> > So if the CPU is changed - how useful the previous CPU lowcore meta
> > is?
>
> This code path will only be triggered by instrumented code that
> accesses lowcore. That code is supposed to disable preemption;
> if it didn't, it's a bug in that code and it should be fixed there.
>
> >
> > Is it a memory block that needs to be ignored instead?
> >
> > > +   if (WARN_ON_ONCE(is_lowcore_addr(addr)))
> > > +   return NULL;
> >
> > lowcore_ptr[] pointing into S390_lowcore is rather a bug.
>
> Right, but AFAIK BUG() calls are discouraged. I guess in a debug tool
> the rules are more relaxed, but we can recover from this condition here
> easily, that's why I still went for WARN_ON_ONCE().

We have KMSAN_WARN_ON() for that, sorry for not pointing it out
earlier: https://elixir.bootlin.com/linux/latest/source/mm/kmsan/kmsan.h#L46



Re: [PATCH v5 36/37] s390/kmsan: Implement the architecture-specific functions

2024-06-20 Thread Alexander Gordeev
On Wed, Jun 19, 2024 at 05:44:11PM +0200, Ilya Leoshkevich wrote:
> arch_kmsan_get_meta_or_null() finds the lowcore shadow by querying the
> prefix and calling kmsan_get_metadata() again.
> 
> kmsan_virt_addr_valid() delegates to virt_addr_valid().
> 
> Signed-off-by: Ilya Leoshkevich 
> ---
>  arch/s390/include/asm/kmsan.h | 59 +++
>  1 file changed, 59 insertions(+)
>  create mode 100644 arch/s390/include/asm/kmsan.h


Acked-by: Alexander Gordeev 



Re: [PATCH v5 33/37] s390/uaccess: Add KMSAN support to put_user() and get_user()

2024-06-20 Thread Alexander Potapenko
On Thu, Jun 20, 2024 at 1:19 PM Ilya Leoshkevich  wrote:
>
> On Thu, 2024-06-20 at 10:36 +0200, Alexander Potapenko wrote:
> > On Wed, Jun 19, 2024 at 5:45 PM Ilya Leoshkevich 
> > wrote:
> > >
> > > put_user() uses inline assembly with precise constraints, so Clang
> > > is
> > > in principle capable of instrumenting it automatically.
> > > Unfortunately,
> > > one of the constraints contains a dereferenced user pointer, and
> > > Clang
> > > does not currently distinguish user and kernel pointers. Therefore
> > > KMSAN attempts to access shadow for user pointers, which is not a
> > > right
> > > thing to do.
> >
> > By the way, how does this problem manifest?
> > I was expecting KMSAN to generate dummy shadow accesses in this case,
> > and reading/writing 1-8 bytes from dummy shadow shouldn't be a
> > problem.
> >
> > (On the other hand, not inlining the get_user/put_user functions is
> > probably still faster than retrieving the dummy shadow, so I'm fine
> > either way)
>
> We have two problems here: not only clang can't distinguish user and
> kernel pointers, the KMSAN runtime - which is supposed to clean that
> up - can't do that either due to overlapping kernel and user address
> spaces on s390. So the instrumentation ultimately tries to access the
> real shadow.
>
> I forgot what the consequences of that were exactly, so I reverted the
> patch and now I get:
>
> Unable to handle kernel pointer dereference in virtual kernel address
> space
> Failing address: 03fed25fa000 TEID: 03fed25fa403
> Fault in home space mode while using kernel ASCE.
> AS:05a70007 R3:824d8007 S:0020
> Oops: 0010 ilc:2 [#1] SMP
> Modules linked in:
> CPU: 3 PID: 1 Comm: init Tainted: GBN 6.10.0-rc4-
> g8aadb00f495e #11
> Hardware name: IBM 3931 A01 704 (KVM/Linux)
> Krnl PSW : 0704c0018000 03ffe288975a (memset+0x3a/0xa0)
>R:0 T:1 IO:1 EX:1 Key:0 M:1 W:0 P:0 AS:3 CC:0 PM:0 RI:0 EA:3
> Krnl GPRS:  03fed25fa180 03fed25fa180
> 03ffe28897a6
>0007 03ffe000 
> 02ee06e68190
>02ee06f19000 03fed25fa180 03ffd25fa180
> 03ffd25fa180
>0008  03ffe17262e0
> 037ee000f730
> Krnl Code: 03ffe288974c: 41101100   la  %r1,256(%r1)
>03ffe2889750: a737fffb   brctg
> %r3,03ffe2889746
>   #03ffe2889754: c0300029   larl
> %r3,03ffe28897a6
>   >03ffe288975a: 44403000   ex  %r4,0(%r3)
>03ffe288975e: 07fe   bcr 15,%r14
>03ffe2889760: a74f0001   cghi%r4,1
>03ffe2889764: b9040012   lgr %r1,%r2
>03ffe2889768: a784001c   brc
> 8,03ffe28897a0
> Call Trace:
>  [<03ffe288975a>] memset+0x3a/0xa0
> ([<03ffe17262bc>] kmsan_internal_set_shadow_origin+0x21c/0x3a0)
>  [<03ffe1725fb6>] kmsan_internal_unpoison_memory+0x26/0x30
>  [<03ffe1c1c646>] create_elf_tables+0x13c6/0x2620
>  [<03ffe1c0ebaa>] load_elf_binary+0x50da/0x68f0
>  [<03ffe18c41fc>] bprm_execve+0x201c/0x2f40
>  [<03ffe18bff9a>] kernel_execve+0x2cda/0x2d00
>  [<03ffe49b745a>] kernel_init+0x9ba/0x1630
>  [<03ffe000cd5c>] __ret_from_fork+0xbc/0x180
>  [<03ffe4a1907a>] ret_from_fork+0xa/0x30
> Last Breaking-Event-Address:
>  [<03ffe2889742>] memset+0x22/0xa0
> Kernel panic - not syncing: Fatal exception: panic_on_oops
>
> So is_bad_asm_addr() returned false for a userspace address.
> Why? Because it happened to collide with the kernel modules area:
> precisely the effect of overlapping.
>
> VMALLOC_START: 0x37ee000
> VMALLOC_END:   0x3a96000
> MODULES_VADDR: 0x3ff6000
> Address:   0x3ffd157a580
> MODULES_END:   0x3ffe000

I see, thanks for the clarification!

> Now the question is, why do we crash when accessing shadow for modules?
> I'll need to investigate, this does not look normal. But even if that
> worked, we clearly wouldn't want userspace accesses to pollute module
> shadow, so I think we need this patch in its current form.

Ok, it indeed makes sense.

Reviewed-by: Alexander Potapenko 



Re: [PATCH v5 36/37] s390/kmsan: Implement the architecture-specific functions

2024-06-20 Thread Alexander Gordeev
On Wed, Jun 19, 2024 at 05:44:11PM +0200, Ilya Leoshkevich wrote:

Hi Ilya,

> +static inline bool is_lowcore_addr(void *addr)
> +{
> + return addr >= (void *)&S390_lowcore &&
> +addr < (void *)(&S390_lowcore + 1);
> +}
> +
> +static inline void *arch_kmsan_get_meta_or_null(void *addr, bool is_origin)
> +{
> + if (is_lowcore_addr(addr)) {
> + /*
> +  * Different lowcores accessed via S390_lowcore are described
> +  * by the same struct page. Resolve the prefix manually in
> +  * order to get a distinct struct page.
> +  */

> + addr += (void *)lowcore_ptr[raw_smp_processor_id()] -
> + (void *)&S390_lowcore;

If I am not mistaken neither raw_smp_processor_id() itself, nor
lowcore_ptr[raw_smp_processor_id()] are atomic. Should the preemption
be disabled while the addr is calculated?

But then the question arises - how meaningful the returned value is?
AFAICT kmsan_get_metadata() is called from a preemptable context.
So if the CPU is changed - how useful the previous CPU lowcore meta is?

Is it a memory block that needs to be ignored instead?

> + if (WARN_ON_ONCE(is_lowcore_addr(addr)))
> + return NULL;

lowcore_ptr[] pointing into S390_lowcore is rather a bug.

> + return kmsan_get_metadata(addr, is_origin);
> + }
> + return NULL;
> +}

Thanks!



Re: [PATCH v5 17/37] mm: slub: Disable KMSAN when checking the padding bytes

2024-06-20 Thread Alexander Potapenko
On Wed, Jun 19, 2024 at 5:45 PM Ilya Leoshkevich  wrote:
>
> Even though the KMSAN warnings generated by memchr_inv() are suppressed
> by metadata_access_enable(), its return value may still be poisoned.
>
> The reason is that the last iteration of memchr_inv() returns
> `*start != value ? start : NULL`, where *start is poisoned. Because of
> this, somewhat counterintuitively, the shadow value computed by
> visitSelectInst() is equal to `(uintptr_t)start`.
>
> One possibility to fix this, since the intention behind guarding
> memchr_inv() behind metadata_access_enable() is to touch poisoned
> metadata without triggering KMSAN, is to unpoison its return value.
> However, this approach is too fragile. So simply disable the KMSAN
> checks in the respective functions.
>
> Signed-off-by: Ilya Leoshkevich 
Reviewed-by: Alexander Potapenko 



Re: [PATCH v5 33/37] s390/uaccess: Add KMSAN support to put_user() and get_user()

2024-06-20 Thread Alexander Potapenko
ize] "K" (size)   \
> : "cc", "0");   \
> -   __rc;   \
> -})
> +   return rc;  \
> +}  \
> +   \
> +static __always_inline int \
> +__get_user_##type(unsigned type *to, unsigned type __user *from,   \
> + unsigned long size)   \
> +{  \
> +   int rc; \
> +   \
> +   rc = __get_user_##type##_noinstr(to, from, size);   \
> +   instrument_get_user(*to);   \
> +   return rc;  \
> +}
> +
> +DEFINE_GET_USER(char);
> +DEFINE_GET_USER(short);
> +DEFINE_GET_USER(int);
> +DEFINE_GET_USER(long);
>
>  static __always_inline int __get_user_fn(void *x, const void __user *ptr, 
> unsigned long size)
>  {
> @@ -163,24 +210,24 @@ static __always_inline int __get_user_fn(void *x, const 
> void __user *ptr, unsign
>
> switch (size) {
> case 1:
> -   rc = __get_user_asm((unsigned char *)x,
> -   (unsigned char __user *)ptr,
> -   size);
> +   rc = __get_user_char((unsigned char *)x,
> +(unsigned char __user *)ptr,
> +size);
> break;
> case 2:
> -   rc = __get_user_asm((unsigned short *)x,
> -   (unsigned short __user *)ptr,
> -   size);
> +   rc = __get_user_short((unsigned short *)x,
> + (unsigned short __user *)ptr,
> + size);
> break;
> case 4:
> -   rc = __get_user_asm((unsigned int *)x,
> +   rc = __get_user_int((unsigned int *)x,
>     (unsigned int __user *)ptr,
> size);
> break;
> case 8:
> -   rc = __get_user_asm((unsigned long *)x,
> -   (unsigned long __user *)ptr,
> -   size);
> +   rc = __get_user_long((unsigned long *)x,
> +(unsigned long __user *)ptr,
> +size);
> break;
> default:
> __get_user_bad();
> --
> 2.45.1
>
> --
> You received this message because you are subscribed to the Google Groups 
> "kasan-dev" group.
> To unsubscribe from this group and stop receiving emails from it, send an 
> email to kasan-dev+unsubscr...@googlegroups.com.
> To view this discussion on the web visit 
> https://groups.google.com/d/msgid/kasan-dev/20240619154530.163232-34-iii%40linux.ibm.com.



-- 
Alexander Potapenko
Software Engineer

Google Germany GmbH
Erika-Mann-Straße, 33
80636 München

Geschäftsführer: Paul Manicle, Liana Sebastian
Registergericht und -nummer: Hamburg, HRB 86891
Sitz der Gesellschaft: Hamburg



Re: [PATCH v5 34/37] s390/uaccess: Add the missing linux/instrumented.h #include

2024-06-20 Thread Alexander Potapenko
On Wed, Jun 19, 2024 at 5:45 PM Ilya Leoshkevich  wrote:
>
> uaccess.h uses instrument_get_user() and instrument_put_user(), which
> are defined in linux/instrumented.h. Currently we get this header from
> somewhere else by accident; prefer to be explicit about it and include
> it directly.
>
> Suggested-by: Alexander Potapenko 
> Signed-off-by: Ilya Leoshkevich 
Reviewed-by: Alexander Potapenko 



Re: [PATCH v5 12/37] kmsan: Introduce memset_no_sanitize_memory()

2024-06-20 Thread Alexander Potapenko
On Wed, Jun 19, 2024 at 5:45 PM Ilya Leoshkevich  wrote:
>
> Add a wrapper for memset() that prevents unpoisoning. This is useful
> for filling memory allocator redzones.
>
> Signed-off-by: Ilya Leoshkevich 
Reviewed-by: Alexander Potapenko 

> ---
>  include/linux/kmsan.h | 13 +
>  1 file changed, 13 insertions(+)
>
> diff --git a/include/linux/kmsan.h b/include/linux/kmsan.h
> index 23de1b3d6aee..5f50885f2023 100644
> --- a/include/linux/kmsan.h
> +++ b/include/linux/kmsan.h
> @@ -255,6 +255,14 @@ void kmsan_enable_current(void);
>   */
>  void kmsan_disable_current(void);
>
> +/*
> + * memset_no_sanitize_memory(): memset() without KMSAN instrumentation.
> + */
Please make this a doc comment, like in the rest of the file.
(Please also fix kmsan_enable_current/kmsan_disable_current in the
respective patch)



Re: [PATCH v4 35/35] kmsan: Enable on s390

2024-06-18 Thread Alexander Potapenko
On Thu, Jun 13, 2024 at 5:40 PM Ilya Leoshkevich  wrote:
>
> Now that everything else is in place, enable KMSAN in Kconfig.
>
> Acked-by: Heiko Carstens 
> Signed-off-by: Ilya Leoshkevich 
Reviewed-by: Alexander Potapenko 



Re: [PATCH v4 09/35] kmsan: Expose kmsan_get_metadata()

2024-06-18 Thread Alexander Potapenko
On Thu, Jun 13, 2024 at 5:39 PM Ilya Leoshkevich  wrote:
>
> Each s390 CPU has lowcore pages associated with it. Each CPU sees its
> own lowcore at virtual address 0 through a hardware mechanism called
> prefixing. Additionally, all lowcores are mapped to non-0 virtual
> addresses stored in the lowcore_ptr[] array.
>
> When lowcore is accessed through virtual address 0, one needs to
> resolve metadata for lowcore_ptr[raw_smp_processor_id()].
>
> Expose kmsan_get_metadata() to make it possible to do this from the
> arch code.
>
> Signed-off-by: Ilya Leoshkevich 
Reviewed-by: Alexander Potapenko 



Re: [PATCH v4 12/35] kmsan: Support SLAB_POISON

2024-06-18 Thread Alexander Potapenko
On Fri, Jun 14, 2024 at 1:44 AM Ilya Leoshkevich  wrote:
>
> On Thu, 2024-06-13 at 16:30 -0700, SeongJae Park wrote:
> > Hi Ilya,
> >
> > On Thu, 13 Jun 2024 17:34:14 +0200 Ilya Leoshkevich
> >  wrote:
> >
> > > Avoid false KMSAN negatives with SLUB_DEBUG by allowing
> > > kmsan_slab_free() to poison the freed memory, and by preventing
> > > init_object() from unpoisoning new allocations by using __memset().
> > >
> > > There are two alternatives to this approach. First, init_object()
> > > can be marked with __no_sanitize_memory. This annotation should be
> > > used
> > > with great care, because it drops all instrumentation from the
> > > function, and any shadow writes will be lost. Even though this is
> > > not a
> > > concern with the current init_object() implementation, this may
> > > change
> > > in the future.
> > >
> > > Second, kmsan_poison_memory() calls may be added after memset()
> > > calls.
> > > The downside is that init_object() is called from
> > > free_debug_processing(), in which case poisoning will erase the
> > > distinction between simply uninitialized memory and UAF.
> > >
> > > Signed-off-by: Ilya Leoshkevich 
> > > ---
> > >  mm/kmsan/hooks.c |  2 +-
> > >  mm/slub.c| 13 +
> > >  2 files changed, 10 insertions(+), 5 deletions(-)
> > >
> > [...]
> > > --- a/mm/slub.c
> > > +++ b/mm/slub.c
> > > @@ -1139,7 +1139,12 @@ static void init_object(struct kmem_cache
> > > *s, void *object, u8 val)
> > > unsigned int poison_size = s->object_size;
> > >
> > > if (s->flags & SLAB_RED_ZONE) {
> > > -   memset(p - s->red_left_pad, val, s->red_left_pad);
> > > +   /*
> > > +* Use __memset() here and below in order to avoid
> > > overwriting
> > > +* the KMSAN shadow. Keeping the shadow makes it
> > > possible to
> > > +* distinguish uninit-value from use-after-free.
> > > +*/
> > > +   __memset(p - s->red_left_pad, val, s-
> > > >red_left_pad);
> >
> > I found my build test[1] fails with below error on latest mm-unstable
> > branch.
> > 'git bisect' points me this patch.
> >
> >   CC  mm/slub.o
> > /mm/slub.c: In function 'init_object':
> > /mm/slub.c:1147:17: error: implicit declaration of function
> > '__memset'; did you mean 'memset'? [-Werror=implicit-function-
> > declaration]
> >  1147 | __memset(p - s->red_left_pad, val, s-
> > >red_left_pad);
> >   | ^~~~
> >   | memset
> > cc1: some warnings being treated as errors
> >
> > I haven't looked in deep, but reporting first.  Do you have any idea?
> >
> > [1]
> > https://github.com/awslabs/damon-tests/blob/next/corr/tests/build_m68k.sh
> >
> >
> > Thanks,
> > SJ
> >
> > [...]
>
> Thanks for the report.
>
> Apparently not all architectures have __memset(). We should probably go
> back to memset_no_sanitize_memory() [1], but this time mark it with
> noinline __maybe_unused __no_sanitize_memory, like it's done in, e.g.,
> 32/35.
>
> Alexander, what do you think?

We could probably go without __no_sanitize_memory assuming that
platforms supporting KMSAN always have __memset():

  #if defined(CONFIG_KMSAN)
  static inline void *memset_no_sanitize_memory(void *s, int c, size_t n)
  {
  return __memset(s, c, n);
  }
  #else
  static inline void *memset_no_sanitize_memory(void *s, int c, size_t n)
  {
  return memset(s, c, n);
  }
  #endif



Re: [PATCH v4 16/35] mm: slub: Unpoison the memchr_inv() return value

2024-06-18 Thread Alexander Potapenko
On Thu, Jun 13, 2024 at 5:39 PM Ilya Leoshkevich  wrote:
>
> Even though the KMSAN warnings generated by memchr_inv() are suppressed
> by metadata_access_enable(), its return value may still be poisoned.
>
> The reason is that the last iteration of memchr_inv() returns
> `*start != value ? start : NULL`, where *start is poisoned. Because of
> this, somewhat counterintuitively, the shadow value computed by
> visitSelectInst() is equal to `(uintptr_t)start`.
>
> The intention behind guarding memchr_inv() behind
> metadata_access_enable() is to touch poisoned metadata without
> triggering KMSAN, so unpoison its return value.

What do you think about applying __no_kmsan_checks to these functions instead?



Re: [PATCH v4 14/35] kmsan: Do not round up pg_data_t size

2024-06-18 Thread Alexander Potapenko
On Thu, Jun 13, 2024 at 5:39 PM Ilya Leoshkevich  wrote:
>
> x86's alloc_node_data() rounds up node data size to PAGE_SIZE. It's not
> explained why it's needed, but it's most likely for performance
> reasons, since the padding bytes are not used anywhere. Some other
> architectures do it as well, e.g., mips rounds it up to the cache line
> size.
>
> kmsan_init_shadow() initializes metadata for each node data and assumes
> the x86 rounding, which does not match other architectures. This may
> cause the range end to overshoot the end of available memory, in turn
> causing virt_to_page_or_null() in kmsan_init_alloc_meta_for_range() to
> return NULL, which leads to kernel panic shortly after.
>
> Since the padding bytes are not used, drop the rounding.

Nice catch, thanks!

> Signed-off-by: Ilya Leoshkevich 
Reviewed-by: Alexander Potapenko 



Re: [PATCH v4 15/35] mm: slub: Let KMSAN access metadata

2024-06-18 Thread Alexander Potapenko
On Thu, Jun 13, 2024 at 5:39 PM Ilya Leoshkevich  wrote:
>
> Building the kernel with CONFIG_SLUB_DEBUG and CONFIG_KMSAN causes
> KMSAN to complain about touching redzones in kfree().
>
> Fix by extending the existing KASAN-related metadata_access_enable()
> and metadata_access_disable() functions to KMSAN.
>
> Acked-by: Vlastimil Babka 
> Signed-off-by: Ilya Leoshkevich 
Reviewed-by: Alexander Potapenko 



Re: [PATCH v4 11/35] kmsan: Allow disabling KMSAN checks for the current task

2024-06-18 Thread Alexander Potapenko
On Thu, Jun 13, 2024 at 5:39 PM Ilya Leoshkevich  wrote:
>
> Like for KASAN, it's useful to temporarily disable KMSAN checks around,
> e.g., redzone accesses. Introduce kmsan_disable_current() and
> kmsan_enable_current(), which are similar to their KASAN counterparts.
>
> Make them reentrant in order to handle memory allocations in interrupt
> context. Repurpose the allow_reporting field for this.

I am still a bit reluctant, because these nested counters always end
up being inconsistent.
But your patch series fixes support for SLUB_DEBUG, and I don't have
better ideas how to do this.

Could you please extend "Disabling the instrumentation" in kmsan.rst
so that it explains the new enable/disable API?
I think we should mention that the users need to be careful with it,
keeping the regions short and preferring other ways to disable
instrumentation, where possible.

> Signed-off-by: Ilya Leoshkevich 
Reviewed-by: Alexander Potapenko 



Re: [PATCH v6 0/2] mm/memblock: Add "reserve_mem" to reserved named memory at boot up

2024-06-18 Thread Alexander Graf


On 18.06.24 12:21, Ard Biesheuvel wrote:

On Mon, 17 Jun 2024 at 23:01, Alexander Graf  wrote:

On 17.06.24 22:40, Steven Rostedt wrote:

On Mon, 17 Jun 2024 09:07:29 +0200
Alexander Graf  wrote:


Hey Steve,


I believe we're talking about 2 different things :). Let me rephrase a
bit and make a concrete example.

Imagine you have passed the "reserve_mem=12M:4096:trace" kernel command
line option. The kernel now comes up and allocates a random chunk of
memory that - by (admittedly good) chance - may be at the same physical
location as before. Let's assume it deemed 0x100 as a good offset.

Note, it's not random. It picks from the top of available memory every
time. But things can mess with it (see below).


Let's now assume you're running on a UEFI system. There, you always have
non-volatile storage available to you even in the pre-boot phase. That
means the kernel could create a UEFI variable that says "12M:4096:trace
-> 0x100". The pre-boot phase takes all these UEFI variables and
marks them as reserved. When you finally reach your command line parsing
logic for reserve_mem=, you can flip all reservations that were not on
the command line back to normal memory.

That way you have pretty much guaranteed persistent memory regions, even
with KASLR changing your memory layout across boots.

The nice thing is that the above is an extension of what you've already
built: Systems with UEFI simply get better guarantees that their regions
persist.

This could be an added feature, but it is very architecture specific,
and would likely need architecture specific updates.


It definitely would be an added feature, yes. But one that allows you to
ensure persistence a lot more safely :).



Requirement:

Need a way to reserve memory that will be at a consistent location for
every boot, if the kernel and system are the same. Does not need to work
if rebooting to a different kernel, or if the system can change the
memory layout between boots.

The reserved memory can not be an hard coded address, as the same kernel /
command line needs to run on several different machines. The picked memory
reservation just needs to be the same for a given machine, but may be

With KASLR is enabled, doesn't this approach break too often to be
reliable enough for the data you want to extract?

Picking up the idea above, with a persistent variable we could even make
KASLR avoid that reserved pstore region in its search for a viable KASLR
offset.

I think I was hit by it once in all my testing. For our use case, the
few times it fails to map is not going to affect what we need this for
at all.

Once is pretty good. Do you know why? Also once out of how many runs? Is
the randomness source not as random as it should be or are the number of
bits for KASLR maybe so few on your target architecture that the odds of
hitting anything become low? Do these same constraints hold true outside
of your testing environment?

So I just ran it a hundred times in a loop. I added a patch to print
the location of "_text". The loop was this:

for i in `seq 100`; do
  ssh root@debiantesting-x86-64 "dmesg | grep -e 'text starts' -e 'mapped boot'  
>> text; grub-reboot '1>0'; sleep 0.5; reboot"
  sleep 25
done

It searches dmesg for my added printk as well as the print of were the
ring buffer was loaded in physical memory.

It takes about 15 seconds to reboot, so I waited 25. The results are
attached. I found that it was consistent 76 times, which means 1 out of
4 it's not. Funny enough, it broke whenever it loaded the kernel below
0x1. And then it would be off by a little.

It was consistently at:

0x27d00

And when it failed, it was at 0x27ce0.

Note, when I used the e820 tables to do this, I never saw a failure. My
assumption is that when it is below 0x1, something else gets
allocated causing this to get pushed down.


Thinking about it again: What if you run the allocation super early (see
arch/x86/boot/compressed/kaslr.c:handle_mem_options())?

That code is not used by EFI boot anymore.

In general, I would recommend (and have recommended) against these
kinds of hacks in mainline, because -especially on x86- there is
always someone that turns up with some kind of convoluted use case
that gets broken if we try to change anything in the boot code.

I spent considerable time over the past year making the EFI/x86 boot
code compatible with the new MS imposed requirements on PC boot
firmware (related to secure boot and NX restrictions on memory
mappings). This involved some radical refactoring of the boot
sequence, including the KASLR logic. Adding fragile code there that
will result in regressions observable to end users when it gets broken
is really not what I'd like to see.

So I would personally prefer for this code not to go in at all. But if
it does go in (and Steven has a

Re: [PATCH v4 32/35] s390/uaccess: Add KMSAN support to put_user() and get_user()

2024-06-18 Thread Alexander Potapenko
On Tue, Jun 18, 2024 at 11:40 AM Ilya Leoshkevich  wrote:
>
> On Tue, 2024-06-18 at 11:24 +0200, Alexander Potapenko wrote:
> > On Thu, Jun 13, 2024 at 5:39 PM Ilya Leoshkevich 
> > wrote:
> > >
> > > put_user() uses inline assembly with precise constraints, so Clang
> > > is
> > > in principle capable of instrumenting it automatically.
> > > Unfortunately,
> > > one of the constraints contains a dereferenced user pointer, and
> > > Clang
> > > does not currently distinguish user and kernel pointers. Therefore
> > > KMSAN attempts to access shadow for user pointers, which is not a
> > > right
> > > thing to do.
> > >
> > > An obvious fix to add __no_sanitize_memory to __put_user_fn() does
> > > not
> > > work, since it's __always_inline. And __always_inline cannot be
> > > removed
> > > due to the __put_user_bad() trick.
> > >
> > > A different obvious fix of using the "a" instead of the "+Q"
> > > constraint
> > > degrades the code quality, which is very important here, since it's
> > > a
> > > hot path.
> > >
> > > Instead, repurpose the __put_user_asm() macro to define
> > > __put_user_{char,short,int,long}_noinstr() functions and mark them
> > > with
> > > __no_sanitize_memory. For the non-KMSAN builds make them
> > > __always_inline in order to keep the generated code quality. Also
> > > define __put_user_{char,short,int,long}() functions, which call the
> > > aforementioned ones and which *are* instrumented, because they call
> > > KMSAN hooks, which may be implemented as macros.
> >
> > I am not really familiar with s390 assembly, but I think you still
> > need to call kmsan_copy_to_user() and kmsan_copy_from_user() to
> > properly initialize the copied data and report infoleaks.
> > Would it be possible to insert calls to linux/instrumented.h hooks
> > into uaccess functions?
>
> Aren't the existing instrument_get_user() / instrument_put_user() calls
> sufficient?

Oh, sorry, I overlooked them. Yes, those should be sufficient.
But you don't include linux/instrumented.h, do you?



Re: [PATCH v4 26/35] s390/diag: Unpoison diag224() output buffer

2024-06-18 Thread Alexander Potapenko
On Thu, Jun 13, 2024 at 5:40 PM Ilya Leoshkevich  wrote:
>
> Diagnose 224 stores 4k bytes, which currently cannot be deduced from
> the inline assembly constraints. This leads to KMSAN false positives.
>
> Fix the constraints by using a 4k-sized struct instead of a raw
> pointer. While at it, prettify them too.
>
> Suggested-by: Heiko Carstens 
> Signed-off-by: Ilya Leoshkevich 
Reviewed-by: Alexander Potapenko 



Re: [PATCH v4 32/35] s390/uaccess: Add KMSAN support to put_user() and get_user()

2024-06-18 Thread Alexander Potapenko
On Thu, Jun 13, 2024 at 5:39 PM Ilya Leoshkevich  wrote:
>
> put_user() uses inline assembly with precise constraints, so Clang is
> in principle capable of instrumenting it automatically. Unfortunately,
> one of the constraints contains a dereferenced user pointer, and Clang
> does not currently distinguish user and kernel pointers. Therefore
> KMSAN attempts to access shadow for user pointers, which is not a right
> thing to do.
>
> An obvious fix to add __no_sanitize_memory to __put_user_fn() does not
> work, since it's __always_inline. And __always_inline cannot be removed
> due to the __put_user_bad() trick.
>
> A different obvious fix of using the "a" instead of the "+Q" constraint
> degrades the code quality, which is very important here, since it's a
> hot path.
>
> Instead, repurpose the __put_user_asm() macro to define
> __put_user_{char,short,int,long}_noinstr() functions and mark them with
> __no_sanitize_memory. For the non-KMSAN builds make them
> __always_inline in order to keep the generated code quality. Also
> define __put_user_{char,short,int,long}() functions, which call the
> aforementioned ones and which *are* instrumented, because they call
> KMSAN hooks, which may be implemented as macros.

I am not really familiar with s390 assembly, but I think you still
need to call kmsan_copy_to_user() and kmsan_copy_from_user() to
properly initialize the copied data and report infoleaks.
Would it be possible to insert calls to linux/instrumented.h hooks
into uaccess functions?



Re: [PATCH v6 0/2] mm/memblock: Add "reserve_mem" to reserved named memory at boot up

2024-06-17 Thread Alexander Graf

[resend because Thunderbird decided to send the previous version as HTML :(]


On 17.06.24 22:40, Steven Rostedt wrote:

On Mon, 17 Jun 2024 09:07:29 +0200
Alexander Graf  wrote:


Hey Steve,


I believe we're talking about 2 different things :). Let me rephrase a
bit and make a concrete example.

Imagine you have passed the "reserve_mem=12M:4096:trace" kernel command
line option. The kernel now comes up and allocates a random chunk of
memory that - by (admittedly good) chance - may be at the same physical
location as before. Let's assume it deemed 0x100 as a good offset.

Note, it's not random. It picks from the top of available memory every
time. But things can mess with it (see below).


Let's now assume you're running on a UEFI system. There, you always have
non-volatile storage available to you even in the pre-boot phase. That
means the kernel could create a UEFI variable that says "12M:4096:trace
-> 0x100". The pre-boot phase takes all these UEFI variables and
marks them as reserved. When you finally reach your command line parsing
logic for reserve_mem=, you can flip all reservations that were not on
the command line back to normal memory.

That way you have pretty much guaranteed persistent memory regions, even
with KASLR changing your memory layout across boots.

The nice thing is that the above is an extension of what you've already
built: Systems with UEFI simply get better guarantees that their regions
persist.

This could be an added feature, but it is very architecture specific,
and would likely need architecture specific updates.



It definitely would be an added feature, yes. But one that allows you to 
ensure persistence a lot more safely :).




Requirement:

Need a way to reserve memory that will be at a consistent location for
every boot, if the kernel and system are the same. Does not need to work
if rebooting to a different kernel, or if the system can change the
memory layout between boots.

The reserved memory can not be an hard coded address, as the same kernel /
command line needs to run on several different machines. The picked memory
reservation just needs to be the same for a given machine, but may be

With KASLR is enabled, doesn't this approach break too often to be
reliable enough for the data you want to extract?

Picking up the idea above, with a persistent variable we could even make
KASLR avoid that reserved pstore region in its search for a viable KASLR
offset.

I think I was hit by it once in all my testing. For our use case, the
few times it fails to map is not going to affect what we need this for
at all.

Once is pretty good. Do you know why? Also once out of how many runs? Is
the randomness source not as random as it should be or are the number of
bits for KASLR maybe so few on your target architecture that the odds of
hitting anything become low? Do these same constraints hold true outside
of your testing environment?

So I just ran it a hundred times in a loop. I added a patch to print
the location of "_text". The loop was this:

   for i in `seq 100`; do
 ssh root@debiantesting-x86-64 "dmesg | grep -e 'text starts' -e 'mapped boot'  
>> text; grub-reboot '1>0'; sleep 0.5; reboot"
 sleep 25
   done

It searches dmesg for my added printk as well as the print of were the
ring buffer was loaded in physical memory.

It takes about 15 seconds to reboot, so I waited 25. The results are
attached. I found that it was consistent 76 times, which means 1 out of
4 it's not. Funny enough, it broke whenever it loaded the kernel below
0x1. And then it would be off by a little.

It was consistently at:

   0x27d00

And when it failed, it was at 0x27ce0.

Note, when I used the e820 tables to do this, I never saw a failure. My
assumption is that when it is below 0x1, something else gets
allocated causing this to get pushed down.



Thinking about it again: What if you run the allocation super early (see 
arch/x86/boot/compressed/kaslr.c:handle_mem_options())? If you stick to 
allocating only from top, you're effectively kernel version independent 
for your allocations because none of the kernel code ran yet and 
definitely KASLR independent because you're running deterministically 
before KASLR even gets allocated.



As this code relies on memblock_phys_alloc() being consistent, if
something gets allocated before it differently depending on where the
kernel is, it can also move the location. A plugin to UEFI would mean
that it would need to reserve the memory, and the code here will need
to know where it is. We could always make the function reserve_mem()
global and weak so that architectures can override it.



Yes, the in-kernel UEFI loader (efi-stub) could simply populate a new 
type of memblock with the respective reservations and you later call 
memblock_find_in_range_node() instead of memblock_

Re: [PATCH v6 0/2] mm/memblock: Add "reserve_mem" to reserved named memory at boot up

2024-06-17 Thread Alexander Graf

Hey Steve,

On 13.06.24 19:12, Steven Rostedt wrote:

On Thu, 13 Jun 2024 18:54:12 +0200
Alexander Graf  wrote:


Do you have a "real" pstore on these systems that you could store
non-volatile variables in, such as persistent UEFI variables? If so, you
could create an actually persistent mapping for your trace pstore even
across kernel version updates as a general mechanism to create reserved
memblocks at fixed offsets.

After implementing all this, I don't think I can use pstore for my
purpose. pstore is a generic interface for persistent storage, and
requires an interface to access it. From what I understand, it's not
the place to just ask for an area of RAM.

For this, I have a single patch that allows the tracing instance to use
an area reserved by reserve_mem.

   reserve_mem=12M:4096:trace trace_instance=boot_mapped@trace

I've already tested this on qemu and a couple of chromebooks. It works
well.



I believe we're talking about 2 different things :). Let me rephrase a 
bit and make a concrete example.


Imagine you have passed the "reserve_mem=12M:4096:trace" kernel command 
line option. The kernel now comes up and allocates a random chunk of 
memory that - by (admittedly good) chance - may be at the same physical 
location as before. Let's assume it deemed 0x100 as a good offset.


Let's now assume you're running on a UEFI system. There, you always have 
non-volatile storage available to you even in the pre-boot phase. That 
means the kernel could create a UEFI variable that says "12M:4096:trace 
-> 0x100". The pre-boot phase takes all these UEFI variables and 
marks them as reserved. When you finally reach your command line parsing 
logic for reserve_mem=, you can flip all reservations that were not on 
the command line back to normal memory.


That way you have pretty much guaranteed persistent memory regions, even 
with KASLR changing your memory layout across boots.


The nice thing is that the above is an extension of what you've already 
built: Systems with UEFI simply get better guarantees that their regions 
persist.








Requirement:

Need a way to reserve memory that will be at a consistent location for
every boot, if the kernel and system are the same. Does not need to work
if rebooting to a different kernel, or if the system can change the
memory layout between boots.

The reserved memory can not be an hard coded address, as the same kernel /
command line needs to run on several different machines. The picked memory
reservation just needs to be the same for a given machine, but may be


With KASLR is enabled, doesn't this approach break too often to be
reliable enough for the data you want to extract?

Picking up the idea above, with a persistent variable we could even make
KASLR avoid that reserved pstore region in its search for a viable KASLR
offset.

I think I was hit by it once in all my testing. For our use case, the
few times it fails to map is not going to affect what we need this for
at all.



Once is pretty good. Do you know why? Also once out of how many runs? Is 
the randomness source not as random as it should be or are the number of 
bits for KASLR maybe so few on your target architecture that the odds of 
hitting anything become low? Do these same constraints hold true outside 
of your testing environment?



Alex




Amazon Web Services Development Center Germany GmbH
Krausenstr. 38
10117 Berlin
Geschaeftsfuehrung: Christian Schlaeger, Jonathan Weiss
Eingetragen am Amtsgericht Charlottenburg unter HRB 257764 B
Sitz: Berlin
Ust-ID: DE 365 538 597


Re: [PATCH v6 0/2] mm/memblock: Add "reserve_mem" to reserved named memory at boot up

2024-06-13 Thread Alexander Graf

Hey Steve,

On 13.06.24 17:55, Steven Rostedt wrote:

Reserve unspecified location of physical memory from kernel command line

Background:

In ChromeOS, we have 1 MB of pstore ramoops reserved so that we can extract
dmesg output and some other information when a crash happens in the field.
(This is only done when the user selects "Allow Google to collect data for
  improving the system"). But there are cases when there's a bug that
requires more data to be retrieved to figure out what is happening. We would
like to increase the pstore size, either temporarily, or maybe even
permanently. The pstore on these devices are at a fixed location in RAM (as
the RAM is not cleared on soft reboots nor crashes). The location is chosen
by the BIOS (coreboot) and passed to the kernel via ACPI tables on x86.
There's a driver that queries for this to initialize the pstore for
ChromeOS:

   See drivers/platform/chrome/chromeos_pstore.c

Problem:

The problem is that, even though there's a process to change the kernel on
these systems, and is done regularly to install updates, the firmware is
updated much less frequently. Choosing the place in RAM also takes special
care, and may be in a different address for different boards. Updating the
size via firmware is a large effort and not something that many are willing
to do for a temporary pstore size change.



(sorry for not commenting on earlier versions, I didn't see v1-v5 in my 
inbox)


Do you have a "real" pstore on these systems that you could store 
non-volatile variables in, such as persistent UEFI variables? If so, you 
could create an actually persistent mapping for your trace pstore even 
across kernel version updates as a general mechanism to create reserved 
memblocks at fixed offsets.




Requirement:

Need a way to reserve memory that will be at a consistent location for
every boot, if the kernel and system are the same. Does not need to work
if rebooting to a different kernel, or if the system can change the
memory layout between boots.

The reserved memory can not be an hard coded address, as the same kernel /
command line needs to run on several different machines. The picked memory
reservation just needs to be the same for a given machine, but may be



With KASLR is enabled, doesn't this approach break too often to be 
reliable enough for the data you want to extract?


Picking up the idea above, with a persistent variable we could even make 
KASLR avoid that reserved pstore region in its search for a viable KASLR 
offset.



Alex




Amazon Web Services Development Center Germany GmbH
Krausenstr. 38
10117 Berlin
Geschaeftsfuehrung: Christian Schlaeger, Jonathan Weiss
Eingetragen am Amtsgericht Charlottenburg unter HRB 257764 B
Sitz: Berlin
Ust-ID: DE 365 538 597


[PATCH v4 2/2] ipvs: allow some sysctls in non-init user namespaces

2024-05-06 Thread Alexander Mikhalitsyn
Let's make all IPVS sysctls writtable even when
network namespace is owned by non-initial user namespace.

Let's make a few sysctls to be read-only for non-privileged users:
- sync_qlen_max
- sync_sock_size
- run_estimation
- est_cpulist
- est_nice

I'm trying to be conservative with this to prevent
introducing any security issues in there. Maybe,
we can allow more sysctls to be writable, but let's
do this on-demand and when we see real use-case.

This patch is motivated by user request in the LXC
project [1]. Having this can help with running some
Kubernetes [2] or Docker Swarm [3] workloads inside the system
containers.

Link: https://github.com/lxc/lxc/issues/4278 [1]
Link: 
https://github.com/kubernetes/kubernetes/blob/b722d017a34b300a2284b890448e5a605f21d01e/pkg/proxy/ipvs/proxier.go#L103
 [2]
Link: 
https://github.com/moby/libnetwork/blob/3797618f9a38372e8107d8c06f6ae199e1133ae8/osl/namespace_linux.go#L682
 [3]

Cc: Julian Anastasov 
Cc: Simon Horman 
Cc: Pablo Neira Ayuso 
Cc: Jozsef Kadlecsik 
Cc: Florian Westphal 
Signed-off-by: Alexander Mikhalitsyn 
---
 net/netfilter/ipvs/ip_vs_ctl.c | 19 +++
 1 file changed, 15 insertions(+), 4 deletions(-)

diff --git a/net/netfilter/ipvs/ip_vs_ctl.c b/net/netfilter/ipvs/ip_vs_ctl.c
index e122fa367b81..b6d0dcf3a5c3 100644
--- a/net/netfilter/ipvs/ip_vs_ctl.c
+++ b/net/netfilter/ipvs/ip_vs_ctl.c
@@ -4269,6 +4269,7 @@ static int __net_init 
ip_vs_control_net_init_sysctl(struct netns_ipvs *ipvs)
struct ctl_table *tbl;
int idx, ret;
size_t ctl_table_size = ARRAY_SIZE(vs_vars);
+   bool unpriv = net->user_ns != &init_user_ns;
 
atomic_set(&ipvs->dropentry, 0);
spin_lock_init(&ipvs->dropentry_lock);
@@ -4283,10 +4284,6 @@ static int __net_init 
ip_vs_control_net_init_sysctl(struct netns_ipvs *ipvs)
tbl = kmemdup(vs_vars, sizeof(vs_vars), GFP_KERNEL);
if (tbl == NULL)
return -ENOMEM;
-
-   /* Don't export sysctls to unprivileged users */
-   if (net->user_ns != &init_user_ns)
-   ctl_table_size = 0;
} else
tbl = vs_vars;
/* Initialize sysctl defaults */
@@ -4312,10 +4309,17 @@ static int __net_init 
ip_vs_control_net_init_sysctl(struct netns_ipvs *ipvs)
ipvs->sysctl_sync_ports = 1;
tbl[idx++].data = &ipvs->sysctl_sync_ports;
tbl[idx++].data = &ipvs->sysctl_sync_persist_mode;
+
ipvs->sysctl_sync_qlen_max = nr_free_buffer_pages() / 32;
+   if (unpriv)
+   tbl[idx].mode = 0444;
tbl[idx++].data = &ipvs->sysctl_sync_qlen_max;
+
ipvs->sysctl_sync_sock_size = 0;
+   if (unpriv)
+   tbl[idx].mode = 0444;
tbl[idx++].data = &ipvs->sysctl_sync_sock_size;
+
tbl[idx++].data = &ipvs->sysctl_cache_bypass;
tbl[idx++].data = &ipvs->sysctl_expire_nodest_conn;
tbl[idx++].data = &ipvs->sysctl_sloppy_tcp;
@@ -4338,15 +4342,22 @@ static int __net_init 
ip_vs_control_net_init_sysctl(struct netns_ipvs *ipvs)
tbl[idx++].data = &ipvs->sysctl_conn_reuse_mode;
tbl[idx++].data = &ipvs->sysctl_schedule_icmp;
tbl[idx++].data = &ipvs->sysctl_ignore_tunneled;
+
ipvs->sysctl_run_estimation = 1;
+   if (unpriv)
+   tbl[idx].mode = 0444;
tbl[idx].extra2 = ipvs;
tbl[idx++].data = &ipvs->sysctl_run_estimation;
 
ipvs->est_cpulist_valid = 0;
+   if (unpriv)
+   tbl[idx].mode = 0444;
tbl[idx].extra2 = ipvs;
tbl[idx++].data = &ipvs->sysctl_est_cpulist;
 
ipvs->sysctl_est_nice = IPVS_EST_NICE;
+   if (unpriv)
+   tbl[idx].mode = 0444;
tbl[idx].extra2 = ipvs;
tbl[idx++].data = &ipvs->sysctl_est_nice;
 
-- 
2.34.1




[PATCH v4 1/2] ipvs: add READ_ONCE barrier for ipvs->sysctl_amemthresh

2024-05-06 Thread Alexander Mikhalitsyn
Cc: Julian Anastasov 
Cc: Simon Horman 
Cc: Pablo Neira Ayuso 
Cc: Jozsef Kadlecsik 
Cc: Florian Westphal 
Suggested-by: Julian Anastasov 
Signed-off-by: Alexander Mikhalitsyn 
---
 net/netfilter/ipvs/ip_vs_ctl.c | 14 +++---
 1 file changed, 7 insertions(+), 7 deletions(-)

diff --git a/net/netfilter/ipvs/ip_vs_ctl.c b/net/netfilter/ipvs/ip_vs_ctl.c
index 50b5dbe40eb8..e122fa367b81 100644
--- a/net/netfilter/ipvs/ip_vs_ctl.c
+++ b/net/netfilter/ipvs/ip_vs_ctl.c
@@ -94,6 +94,7 @@ static void update_defense_level(struct netns_ipvs *ipvs)
 {
struct sysinfo i;
int availmem;
+   int amemthresh;
int nomem;
int to_change = -1;
 
@@ -105,7 +106,8 @@ static void update_defense_level(struct netns_ipvs *ipvs)
/* si_swapinfo(&i); */
/* availmem = availmem - (i.totalswap - i.freeswap); */
 
-   nomem = (availmem < ipvs->sysctl_amemthresh);
+   amemthresh = max(READ_ONCE(ipvs->sysctl_amemthresh), 0);
+   nomem = (availmem < amemthresh);
 
local_bh_disable();
 
@@ -145,9 +147,8 @@ static void update_defense_level(struct netns_ipvs *ipvs)
break;
case 1:
if (nomem) {
-   ipvs->drop_rate = ipvs->drop_counter
-   = ipvs->sysctl_amemthresh /
-   (ipvs->sysctl_amemthresh-availmem);
+   ipvs->drop_counter = amemthresh / (amemthresh - 
availmem);
+   ipvs->drop_rate = ipvs->drop_counter;
ipvs->sysctl_drop_packet = 2;
} else {
ipvs->drop_rate = 0;
@@ -155,9 +156,8 @@ static void update_defense_level(struct netns_ipvs *ipvs)
break;
case 2:
if (nomem) {
-   ipvs->drop_rate = ipvs->drop_counter
-   = ipvs->sysctl_amemthresh /
-   (ipvs->sysctl_amemthresh-availmem);
+   ipvs->drop_counter = amemthresh / (amemthresh - 
availmem);
+   ipvs->drop_rate = ipvs->drop_counter;
} else {
ipvs->drop_rate = 0;
ipvs->sysctl_drop_packet = 1;
-- 
2.34.1




[PATCH net-next v3 2/2] ipvs: allow some sysctls in non-init user namespaces

2024-04-18 Thread Alexander Mikhalitsyn
Let's make all IPVS sysctls writtable even when
network namespace is owned by non-initial user namespace.

Let's make a few sysctls to be read-only for non-privileged users:
- sync_qlen_max
- sync_sock_size
- run_estimation
- est_cpulist
- est_nice

I'm trying to be conservative with this to prevent
introducing any security issues in there. Maybe,
we can allow more sysctls to be writable, but let's
do this on-demand and when we see real use-case.

This patch is motivated by user request in the LXC
project [1]. Having this can help with running some
Kubernetes [2] or Docker Swarm [3] workloads inside the system
containers.

Link: https://github.com/lxc/lxc/issues/4278 [1]
Link: 
https://github.com/kubernetes/kubernetes/blob/b722d017a34b300a2284b890448e5a605f21d01e/pkg/proxy/ipvs/proxier.go#L103
 [2]
Link: 
https://github.com/moby/libnetwork/blob/3797618f9a38372e8107d8c06f6ae199e1133ae8/osl/namespace_linux.go#L682
 [3]

Cc: Stéphane Graber 
Cc: Christian Brauner 
Cc: Julian Anastasov 
Cc: Simon Horman 
Cc: Pablo Neira Ayuso 
Cc: Jozsef Kadlecsik 
Cc: Florian Westphal 
Signed-off-by: Alexander Mikhalitsyn 
---
 net/netfilter/ipvs/ip_vs_ctl.c | 21 +++--
 1 file changed, 15 insertions(+), 6 deletions(-)

diff --git a/net/netfilter/ipvs/ip_vs_ctl.c b/net/netfilter/ipvs/ip_vs_ctl.c
index 32be24f0d4e4..c3ba71aa2654 100644
--- a/net/netfilter/ipvs/ip_vs_ctl.c
+++ b/net/netfilter/ipvs/ip_vs_ctl.c
@@ -4270,6 +4270,7 @@ static int __net_init 
ip_vs_control_net_init_sysctl(struct netns_ipvs *ipvs)
struct ctl_table *tbl;
int idx, ret;
size_t ctl_table_size = ARRAY_SIZE(vs_vars);
+   bool unpriv = net->user_ns != &init_user_ns;
 
atomic_set(&ipvs->dropentry, 0);
spin_lock_init(&ipvs->dropentry_lock);
@@ -4284,12 +4285,6 @@ static int __net_init 
ip_vs_control_net_init_sysctl(struct netns_ipvs *ipvs)
tbl = kmemdup(vs_vars, sizeof(vs_vars), GFP_KERNEL);
if (tbl == NULL)
return -ENOMEM;
-
-   /* Don't export sysctls to unprivileged users */
-   if (net->user_ns != &init_user_ns) {
-   tbl[0].procname = NULL;
-   ctl_table_size = 0;
-   }
} else
tbl = vs_vars;
/* Initialize sysctl defaults */
@@ -4315,10 +4310,17 @@ static int __net_init 
ip_vs_control_net_init_sysctl(struct netns_ipvs *ipvs)
ipvs->sysctl_sync_ports = 1;
tbl[idx++].data = &ipvs->sysctl_sync_ports;
tbl[idx++].data = &ipvs->sysctl_sync_persist_mode;
+
ipvs->sysctl_sync_qlen_max = nr_free_buffer_pages() / 32;
+   if (unpriv)
+   tbl[idx].mode = 0444;
tbl[idx++].data = &ipvs->sysctl_sync_qlen_max;
+
ipvs->sysctl_sync_sock_size = 0;
+   if (unpriv)
+   tbl[idx].mode = 0444;
tbl[idx++].data = &ipvs->sysctl_sync_sock_size;
+
tbl[idx++].data = &ipvs->sysctl_cache_bypass;
tbl[idx++].data = &ipvs->sysctl_expire_nodest_conn;
tbl[idx++].data = &ipvs->sysctl_sloppy_tcp;
@@ -4341,15 +4343,22 @@ static int __net_init 
ip_vs_control_net_init_sysctl(struct netns_ipvs *ipvs)
tbl[idx++].data = &ipvs->sysctl_conn_reuse_mode;
tbl[idx++].data = &ipvs->sysctl_schedule_icmp;
tbl[idx++].data = &ipvs->sysctl_ignore_tunneled;
+
ipvs->sysctl_run_estimation = 1;
+   if (unpriv)
+   tbl[idx].mode = 0444;
tbl[idx].extra2 = ipvs;
tbl[idx++].data = &ipvs->sysctl_run_estimation;
 
ipvs->est_cpulist_valid = 0;
+   if (unpriv)
+   tbl[idx].mode = 0444;
tbl[idx].extra2 = ipvs;
tbl[idx++].data = &ipvs->sysctl_est_cpulist;
 
ipvs->sysctl_est_nice = IPVS_EST_NICE;
+   if (unpriv)
+   tbl[idx].mode = 0444;
tbl[idx].extra2 = ipvs;
tbl[idx++].data = &ipvs->sysctl_est_nice;
 
-- 
2.34.1




[PATCH net-next v3 1/2] ipvs: add READ_ONCE barrier for ipvs->sysctl_amemthresh

2024-04-18 Thread Alexander Mikhalitsyn
Cc: Julian Anastasov 
Cc: Simon Horman 
Cc: Pablo Neira Ayuso 
Cc: Jozsef Kadlecsik 
Cc: Florian Westphal 
Suggested-by: Julian Anastasov 
Signed-off-by: Alexander Mikhalitsyn 
---
 net/netfilter/ipvs/ip_vs_ctl.c | 14 +++---
 1 file changed, 7 insertions(+), 7 deletions(-)

diff --git a/net/netfilter/ipvs/ip_vs_ctl.c b/net/netfilter/ipvs/ip_vs_ctl.c
index 143a341bbc0a..32be24f0d4e4 100644
--- a/net/netfilter/ipvs/ip_vs_ctl.c
+++ b/net/netfilter/ipvs/ip_vs_ctl.c
@@ -94,6 +94,7 @@ static void update_defense_level(struct netns_ipvs *ipvs)
 {
struct sysinfo i;
int availmem;
+   int amemthresh;
int nomem;
int to_change = -1;
 
@@ -105,7 +106,8 @@ static void update_defense_level(struct netns_ipvs *ipvs)
/* si_swapinfo(&i); */
/* availmem = availmem - (i.totalswap - i.freeswap); */
 
-   nomem = (availmem < ipvs->sysctl_amemthresh);
+   amemthresh = max(READ_ONCE(ipvs->sysctl_amemthresh), 0);
+   nomem = (availmem < amemthresh);
 
local_bh_disable();
 
@@ -145,9 +147,8 @@ static void update_defense_level(struct netns_ipvs *ipvs)
break;
case 1:
if (nomem) {
-   ipvs->drop_rate = ipvs->drop_counter
-   = ipvs->sysctl_amemthresh /
-   (ipvs->sysctl_amemthresh-availmem);
+   ipvs->drop_counter = amemthresh / (amemthresh - 
availmem);
+   ipvs->drop_rate = ipvs->drop_counter;
ipvs->sysctl_drop_packet = 2;
} else {
ipvs->drop_rate = 0;
@@ -155,9 +156,8 @@ static void update_defense_level(struct netns_ipvs *ipvs)
break;
case 2:
if (nomem) {
-   ipvs->drop_rate = ipvs->drop_counter
-   = ipvs->sysctl_amemthresh /
-   (ipvs->sysctl_amemthresh-availmem);
+   ipvs->drop_counter = amemthresh / (amemthresh - 
availmem);
+   ipvs->drop_rate = ipvs->drop_counter;
} else {
ipvs->drop_rate = 0;
ipvs->sysctl_drop_packet = 1;
-- 
2.34.1




[PATCH net-next v2 2/2] ipvs: allow some sysctls in non-init user namespaces

2024-04-18 Thread Alexander Mikhalitsyn
Let's make all IPVS sysctls writtable even when
network namespace is owned by non-initial user namespace.

Let's make a few sysctls to be read-only for non-privileged users:
- sync_qlen_max
- sync_sock_size
- run_estimation
- est_cpulist
- est_nice

I'm trying to be conservative with this to prevent
introducing any security issues in there. Maybe,
we can allow more sysctls to be writable, but let's
do this on-demand and when we see real use-case.

This patch is motivated by user request in the LXC
project [1]. Having this can help with running some
Kubernetes [2] or Docker Swarm [3] workloads inside the system
containers.

[1] https://github.com/lxc/lxc/issues/4278
[2] 
https://github.com/kubernetes/kubernetes/blob/b722d017a34b300a2284b890448e5a605f21d01e/pkg/proxy/ipvs/proxier.go#L103
[3] 
https://github.com/moby/libnetwork/blob/3797618f9a38372e8107d8c06f6ae199e1133ae8/osl/namespace_linux.go#L682

Cc: Stéphane Graber 
Cc: Christian Brauner 
Cc: Julian Anastasov 
Cc: Simon Horman 
Cc: Pablo Neira Ayuso 
Cc: Jozsef Kadlecsik 
Cc: Florian Westphal 
Signed-off-by: Alexander Mikhalitsyn 
---
 net/netfilter/ipvs/ip_vs_ctl.c | 21 +++--
 1 file changed, 15 insertions(+), 6 deletions(-)

diff --git a/net/netfilter/ipvs/ip_vs_ctl.c b/net/netfilter/ipvs/ip_vs_ctl.c
index daa62b8b2dd1..f84f091626ef 100644
--- a/net/netfilter/ipvs/ip_vs_ctl.c
+++ b/net/netfilter/ipvs/ip_vs_ctl.c
@@ -4272,6 +4272,7 @@ static int __net_init 
ip_vs_control_net_init_sysctl(struct netns_ipvs *ipvs)
struct ctl_table *tbl;
int idx, ret;
size_t ctl_table_size = ARRAY_SIZE(vs_vars);
+   bool unpriv = net->user_ns != &init_user_ns;
 
atomic_set(&ipvs->dropentry, 0);
spin_lock_init(&ipvs->dropentry_lock);
@@ -4286,12 +4287,6 @@ static int __net_init 
ip_vs_control_net_init_sysctl(struct netns_ipvs *ipvs)
tbl = kmemdup(vs_vars, sizeof(vs_vars), GFP_KERNEL);
if (tbl == NULL)
return -ENOMEM;
-
-   /* Don't export sysctls to unprivileged users */
-   if (net->user_ns != &init_user_ns) {
-   tbl[0].procname = NULL;
-   ctl_table_size = 0;
-   }
} else
tbl = vs_vars;
/* Initialize sysctl defaults */
@@ -4317,10 +4312,17 @@ static int __net_init 
ip_vs_control_net_init_sysctl(struct netns_ipvs *ipvs)
ipvs->sysctl_sync_ports = 1;
tbl[idx++].data = &ipvs->sysctl_sync_ports;
tbl[idx++].data = &ipvs->sysctl_sync_persist_mode;
+
ipvs->sysctl_sync_qlen_max = nr_free_buffer_pages() / 32;
+   if (unpriv)
+   tbl[idx].mode = 0444;
tbl[idx++].data = &ipvs->sysctl_sync_qlen_max;
+
ipvs->sysctl_sync_sock_size = 0;
+   if (unpriv)
+   tbl[idx].mode = 0444;
tbl[idx++].data = &ipvs->sysctl_sync_sock_size;
+
tbl[idx++].data = &ipvs->sysctl_cache_bypass;
tbl[idx++].data = &ipvs->sysctl_expire_nodest_conn;
tbl[idx++].data = &ipvs->sysctl_sloppy_tcp;
@@ -4343,15 +4345,22 @@ static int __net_init 
ip_vs_control_net_init_sysctl(struct netns_ipvs *ipvs)
tbl[idx++].data = &ipvs->sysctl_conn_reuse_mode;
tbl[idx++].data = &ipvs->sysctl_schedule_icmp;
tbl[idx++].data = &ipvs->sysctl_ignore_tunneled;
+
ipvs->sysctl_run_estimation = 1;
+   if (unpriv)
+   tbl[idx].mode = 0444;
tbl[idx].extra2 = ipvs;
tbl[idx++].data = &ipvs->sysctl_run_estimation;
 
ipvs->est_cpulist_valid = 0;
+   if (unpriv)
+   tbl[idx].mode = 0444;
tbl[idx].extra2 = ipvs;
tbl[idx++].data = &ipvs->sysctl_est_cpulist;
 
ipvs->sysctl_est_nice = IPVS_EST_NICE;
+   if (unpriv)
+   tbl[idx].mode = 0444;
tbl[idx].extra2 = ipvs;
tbl[idx++].data = &ipvs->sysctl_est_nice;
 
-- 
2.34.1




[PATCH net-next v2 1/2] ipvs: add READ_ONCE barrier for ipvs->sysctl_amemthresh

2024-04-18 Thread Alexander Mikhalitsyn
Cc: Julian Anastasov 
Cc: Simon Horman 
Cc: Pablo Neira Ayuso 
Cc: Jozsef Kadlecsik 
Cc: Florian Westphal 
Suggested-by: Julian Anastasov 
Signed-off-by: Alexander Mikhalitsyn 
---
 net/netfilter/ipvs/ip_vs_ctl.c | 12 +++-
 1 file changed, 7 insertions(+), 5 deletions(-)

diff --git a/net/netfilter/ipvs/ip_vs_ctl.c b/net/netfilter/ipvs/ip_vs_ctl.c
index 143a341bbc0a..daa62b8b2dd1 100644
--- a/net/netfilter/ipvs/ip_vs_ctl.c
+++ b/net/netfilter/ipvs/ip_vs_ctl.c
@@ -94,6 +94,7 @@ static void update_defense_level(struct netns_ipvs *ipvs)
 {
struct sysinfo i;
int availmem;
+   int amemthresh;
int nomem;
int to_change = -1;
 
@@ -105,7 +106,8 @@ static void update_defense_level(struct netns_ipvs *ipvs)
/* si_swapinfo(&i); */
/* availmem = availmem - (i.totalswap - i.freeswap); */
 
-   nomem = (availmem < ipvs->sysctl_amemthresh);
+   amemthresh = max(READ_ONCE(ipvs->sysctl_amemthresh), 0);
+   nomem = (availmem < amemthresh);
 
local_bh_disable();
 
@@ -146,8 +148,8 @@ static void update_defense_level(struct netns_ipvs *ipvs)
case 1:
if (nomem) {
ipvs->drop_rate = ipvs->drop_counter
-   = ipvs->sysctl_amemthresh /
-   (ipvs->sysctl_amemthresh-availmem);
+   = amemthresh /
+   (amemthresh-availmem);
ipvs->sysctl_drop_packet = 2;
} else {
ipvs->drop_rate = 0;
@@ -156,8 +158,8 @@ static void update_defense_level(struct netns_ipvs *ipvs)
case 2:
if (nomem) {
ipvs->drop_rate = ipvs->drop_counter
-   = ipvs->sysctl_amemthresh /
-   (ipvs->sysctl_amemthresh-availmem);
+   = amemthresh /
+   (amemthresh-availmem);
} else {
ipvs->drop_rate = 0;
ipvs->sysctl_drop_packet = 1;
-- 
2.34.1




Re: [PATCH net-next v2 07/15] mm: page_frag: add '_va' suffix to page_frag API

2024-04-16 Thread Alexander H Duyck
On Mon, 2024-04-15 at 21:19 +0800, Yunsheng Lin wrote:
> Currently most of the API for page_frag API is returning
> 'virtual address' as output or expecting 'virtual address'
> as input, in order to differentiate the API handling between
> 'virtual address' and 'struct page', add '_va' suffix to the
> corresponding API mirroring the page_pool_alloc_va() API of
> the page_pool.
> 
> Signed-off-by: Yunsheng Lin 

This patch is a total waste of time. By that logic we should be
renaming __get_free_pages since it essentially does the same thing.

This just seems like more code changes for the sake of adding code
changes rather than fixing anything. In my opinion it should be dropped
from the set.




[PATCH net-next] ipvs: allow some sysctls in non-init user namespaces

2024-04-16 Thread Alexander Mikhalitsyn
Let's make all IPVS sysctls visible and RO even when
network namespace is owned by non-initial user namespace.

Let's make a few sysctls to be writable:
- conntrack
- conn_reuse_mode
- expire_nodest_conn
- expire_quiescent_template

I'm trying to be conservative with this to prevent
introducing any security issues in there. Maybe,
we can allow more sysctls to be writable, but let's
do this on-demand and when we see real use-case.

This list of sysctls was chosen because I can't
see any security risks allowing them and also
Kubernetes uses [2] these specific sysctls.

This patch is motivated by user request in the LXC
project [1].

[1] https://github.com/lxc/lxc/issues/4278
[2] 
https://github.com/kubernetes/kubernetes/blob/b722d017a34b300a2284b890448e5a605f21d01e/pkg/proxy/ipvs/proxier.go#L103

Cc: Stéphane Graber 
Cc: Christian Brauner 
Cc: Julian Anastasov 
Cc: Simon Horman 
Cc: Pablo Neira Ayuso 
Cc: Jozsef Kadlecsik 
Cc: Florian Westphal 
Signed-off-by: Alexander Mikhalitsyn 
---
 net/netfilter/ipvs/ip_vs_ctl.c | 18 +++---
 1 file changed, 15 insertions(+), 3 deletions(-)

diff --git a/net/netfilter/ipvs/ip_vs_ctl.c b/net/netfilter/ipvs/ip_vs_ctl.c
index 143a341bbc0a..92a818c2f783 100644
--- a/net/netfilter/ipvs/ip_vs_ctl.c
+++ b/net/netfilter/ipvs/ip_vs_ctl.c
@@ -4285,10 +4285,22 @@ static int __net_init 
ip_vs_control_net_init_sysctl(struct netns_ipvs *ipvs)
if (tbl == NULL)
return -ENOMEM;
 
-   /* Don't export sysctls to unprivileged users */
+   /* Let's show all sysctls in non-init user namespace-owned
+* net namespaces, but make them read-only.
+*
+* Allow only a few specific sysctls to be writable.
+*/
if (net->user_ns != &init_user_ns) {
-   tbl[0].procname = NULL;
-   ctl_table_size = 0;
+   for (idx = 0; idx < ARRAY_SIZE(vs_vars); idx++) {
+   if (!tbl[idx].procname)
+   continue;
+
+   if (!((strcmp(tbl[idx].procname, "conntrack") 
== 0) ||
+ (strcmp(tbl[idx].procname, 
"conn_reuse_mode") == 0) ||
+ (strcmp(tbl[idx].procname, 
"expire_nodest_conn") == 0) ||
+ (strcmp(tbl[idx].procname, 
"expire_quiescent_template") == 0)))
+   tbl[idx].mode = 0444;
+   }
}
} else
tbl = vs_vars;
-- 
2.34.1




Re: [syzbot] [virtualization?] KMSAN: uninit-value in virtqueue_add (4)

2024-01-24 Thread Alexander Potapenko
On Thu, Jan 4, 2024 at 9:45 PM Stefan Hajnoczi  wrote:
>
> On Tue, Jan 02, 2024 at 08:03:46AM -0500, Michael S. Tsirkin wrote:
> > On Mon, Jan 01, 2024 at 05:38:24AM -0800, syzbot wrote:
> > > Hello,
> > >
> > > syzbot found the following issue on:
> > >
> > > HEAD commit:fbafc3e621c3 Merge tag 'for_linus' of 
> > > git://git.kernel.org..
> > > git tree:   upstream
> > > console+strace: https://syzkaller.appspot.com/x/log.txt?x=173df3e9e8
> > > kernel config:  https://syzkaller.appspot.com/x/.config?x=e0c7078a6b901aa3
> > > dashboard link: 
> > > https://syzkaller.appspot.com/bug?extid=d7521c1e3841ed075a42
> > > compiler:   Debian clang version 15.0.6, GNU ld (GNU Binutils for 
> > > Debian) 2.40
> > > syz repro:  https://syzkaller.appspot.com/x/repro.syz?x=1300b4a1e8
> > > C reproducer:   https://syzkaller.appspot.com/x/repro.c?x=130b0379e8
> > >
> > > Downloadable assets:
> > > disk image: 
> > > https://storage.googleapis.com/syzbot-assets/1520f7b6daa4/disk-fbafc3e6.raw.xz
> > > vmlinux: 
> > > https://storage.googleapis.com/syzbot-assets/8b490af009d5/vmlinux-fbafc3e6.xz
> > > kernel image: 
> > > https://storage.googleapis.com/syzbot-assets/202ca200f4a4/bzImage-fbafc3e6.xz
> > >
> > > IMPORTANT: if you fix the issue, please add the following tag to the 
> > > commit:
> > > Reported-by: syzbot+d7521c1e3841ed075...@syzkaller.appspotmail.com
> > >
> > > =
>
> Hi Alexander,
> Please take a look at this KMSAN failure. The uninitialized memory was
> created for the purpose of writing a coredump. vring_map_one_sg() should
> have direction=DMA_TO_DEVICE.
>
Hi Stefan,

I took a closer look, and am pretty confident this is a false positive.
I tried adding memset(..., 0xab, PAGE_SIZE << order) to alloc_pages()
and never saw
the 0xab pattern in the buffers for which KMSAN reported an error.

This probably isn't an error in 88938359e2df ("virtio: kmsan:
check/unpoison scatterlist in
vring_map_one_sg()"), which by itself should be doing a sane thing:
report an error if an
uninitialized buffer is passed to it. It is more likely that we're
missing some initialization that
happens in coredump.c

Does anyone have an idea where coredump.c is supposed to be
initializing these pages?
Maybe there are some inline assembly functions involved in copying the data?



Re: [PATCH net-next 2/6] page_frag: unify gfp bits for order 3 page allocation

2024-01-08 Thread Alexander Duyck
On Mon, Jan 8, 2024 at 12:25 AM Yunsheng Lin  wrote:
>
> On 2024/1/5 23:35, Alexander H Duyck wrote:
> > On Wed, 2024-01-03 at 17:56 +0800, Yunsheng Lin wrote:
> >> Currently there seems to be three page frag implementions
> >> which all try to allocate order 3 page, if that fails, it
> >> then fail back to allocate order 0 page, and each of them
> >> all allow order 3 page allocation to fail under certain
> >> condition by using specific gfp bits.
> >>
> >> The gfp bits for order 3 page allocation are different
> >> between different implementation, __GFP_NOMEMALLOC is
> >> or'd to forbid access to emergency reserves memory for
> >> __page_frag_cache_refill(), but it is not or'd in other
> >> implementions, __GFP_DIRECT_RECLAIM is masked off to avoid
> >> direct reclaim in skb_page_frag_refill(), but it is not
> >> masked off in __page_frag_cache_refill().
> >>
> >> This patch unifies the gfp bits used between different
> >> implementions by or'ing __GFP_NOMEMALLOC and masking off
> >> __GFP_DIRECT_RECLAIM for order 3 page allocation to avoid
> >> possible pressure for mm.
> >>
> >> Signed-off-by: Yunsheng Lin 
> >> CC: Alexander Duyck 
> >> ---
> >>  drivers/vhost/net.c | 2 +-
> >>  mm/page_alloc.c | 4 ++--
> >>  net/core/sock.c | 2 +-
> >>  3 files changed, 4 insertions(+), 4 deletions(-)
> >>
> >> diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c
> >> index f2ed7167c848..e574e21cc0ca 100644
> >> --- a/drivers/vhost/net.c
> >> +++ b/drivers/vhost/net.c
> >> @@ -670,7 +670,7 @@ static bool vhost_net_page_frag_refill(struct 
> >> vhost_net *net, unsigned int sz,
> >>  /* Avoid direct reclaim but allow kswapd to wake */
> >>  pfrag->page = alloc_pages((gfp & ~__GFP_DIRECT_RECLAIM) |
> >>__GFP_COMP | __GFP_NOWARN |
> >> -  __GFP_NORETRY,
> >> +  __GFP_NORETRY | __GFP_NOMEMALLOC,
> >>SKB_FRAG_PAGE_ORDER);
> >>  if (likely(pfrag->page)) {
> >>  pfrag->size = PAGE_SIZE << SKB_FRAG_PAGE_ORDER;
> >> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> >> index 9a16305cf985..1f0b36dd81b5 100644
> >> --- a/mm/page_alloc.c
> >> +++ b/mm/page_alloc.c
> >> @@ -4693,8 +4693,8 @@ static struct page *__page_frag_cache_refill(struct 
> >> page_frag_cache *nc,
> >>  gfp_t gfp = gfp_mask;
> >>
> >>  #if (PAGE_SIZE < PAGE_FRAG_CACHE_MAX_SIZE)
> >> -gfp_mask |= __GFP_COMP | __GFP_NOWARN | __GFP_NORETRY |
> >> -__GFP_NOMEMALLOC;
> >> +gfp_mask = (gfp_mask & ~__GFP_DIRECT_RECLAIM) |  __GFP_COMP |
> >> +   __GFP_NOWARN | __GFP_NORETRY | __GFP_NOMEMALLOC;
> >>  page = alloc_pages_node(NUMA_NO_NODE, gfp_mask,
> >>  PAGE_FRAG_CACHE_MAX_ORDER);
> >>  nc->size = page ? PAGE_FRAG_CACHE_MAX_SIZE : PAGE_SIZE;
> >> diff --git a/net/core/sock.c b/net/core/sock.c
> >> index 446e945f736b..d643332c3ee5 100644
> >> --- a/net/core/sock.c
> >> +++ b/net/core/sock.c
> >> @@ -2900,7 +2900,7 @@ bool skb_page_frag_refill(unsigned int sz, struct 
> >> page_frag *pfrag, gfp_t gfp)
> >>  /* Avoid direct reclaim but allow kswapd to wake */
> >>  pfrag->page = alloc_pages((gfp & ~__GFP_DIRECT_RECLAIM) |
> >>__GFP_COMP | __GFP_NOWARN |
> >> -  __GFP_NORETRY,
> >> +  __GFP_NORETRY | __GFP_NOMEMALLOC,
> >>SKB_FRAG_PAGE_ORDER);
> >>  if (likely(pfrag->page)) {
> >>  pfrag->size = PAGE_SIZE << SKB_FRAG_PAGE_ORDER;
> >
> > Looks fine to me.
> >
> > One thing you may want to consider would be to place this all in an
> > inline function that could just consolidate all the code.
>
> Do you think it is possible to further unify the implementations of the
> 'struct page_frag_cache' and 'struct page_frag', so adding a inline
> function for above is unnecessary?

Actually the skb_page_frag_refill seems to function more similarly to
how the Intel drivers do in terms of handling fragments. It is
basically slicing off pieces until either it runs out of them and
allocates a new one, or if the page reference count is one without
pre-allocating the references.

However, with that said many of the core bits are the same so it might
be possible to look at unifiying at least pieces of this. For example
the page_frag has the same first 3 members as the page_frag_cache so
it might be possible to look at refactoring things further to unify
more of the frag_refill logic.



Re: [PATCH net-next 4/6] vhost/net: remove vhost_net_page_frag_refill()

2024-01-08 Thread Alexander Duyck
On Mon, Jan 8, 2024 at 1:06 AM Yunsheng Lin  wrote:
>
> On 2024/1/6 0:06, Alexander H Duyck wrote:
> >>
> >>  static void handle_tx_copy(struct vhost_net *net, struct socket *sock)
> >> @@ -1353,8 +1318,7 @@ static int vhost_net_open(struct inode *inode, 
> >> struct file *f)
> >>  vqs[VHOST_NET_VQ_RX]);
> >>
> >>  f->private_data = n;
> >> -n->page_frag.page = NULL;
> >> -n->refcnt_bias = 0;
> >> +n->pf_cache.va = NULL;
> >>
> >>  return 0;
> >>  }
> >> @@ -1422,8 +1386,9 @@ static int vhost_net_release(struct inode *inode, 
> >> struct file *f)
> >>  kfree(n->vqs[VHOST_NET_VQ_RX].rxq.queue);
> >>  kfree(n->vqs[VHOST_NET_VQ_TX].xdp);
> >>  kfree(n->dev.vqs);
> >> -if (n->page_frag.page)
> >> -__page_frag_cache_drain(n->page_frag.page, n->refcnt_bias);
> >> +if (n->pf_cache.va)
> >> +__page_frag_cache_drain(virt_to_head_page(n->pf_cache.va),
> >> +n->pf_cache.pagecnt_bias);
> >>  kvfree(n);
> >>  return 0;
> >>  }
> >
> > I would recommend reordering this patch with patch 5. Then you could
> > remove the block that is setting "n->pf_cache.va = NULL" above and just
> > make use of page_frag_cache_drain in the lower block which would also
> > return the va to NULL.
>
> I am not sure if we can as there is no zeroing for 'struct vhost_net' in
> vhost_net_open().
>
> If we don't have "n->pf_cache.va = NULL", don't we use the uninitialized data
> when calling page_frag_alloc_align() for the first time?

I see. So kvmalloc is used instead of kvzalloc when allocating the
structure. That might be an opportunity to clean things up a bit by
making that change to reduce the risk of some piece of memory
initialization being missed.

That said, I still think reordering the two patches might be useful as
it would help to make it so that the change you make to vhost_net is
encapsulated in one patch to fully enable the use of the new page pool
API.



Re: [PATCH net-next 4/6] vhost/net: remove vhost_net_page_frag_refill()

2024-01-05 Thread Alexander H Duyck
On Wed, 2024-01-03 at 17:56 +0800, Yunsheng Lin wrote:
> The page frag in vhost_net_page_frag_refill() uses the
> 'struct page_frag' from skb_page_frag_refill(), but it's
> implementation is similar to page_frag_alloc_align() now.
> 
> This patch removes vhost_net_page_frag_refill() by using
> 'struct page_frag_cache' instead of 'struct page_frag',
> and allocating frag using page_frag_alloc_align().
> 
> The added benefit is that not only unifying the page frag
> implementation a little, but also having about 0.5% performance
> boost testing by using the vhost_net_test introduced in the
> last patch.
> 
> Signed-off-by: Yunsheng Lin 
> Acked-by: Jason Wang 
> ---
>  drivers/vhost/net.c | 93 ++---
>  1 file changed, 29 insertions(+), 64 deletions(-)
> 
> diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c
> index e574e21cc0ca..805e11d598e4 100644
> --- a/drivers/vhost/net.c
> +++ b/drivers/vhost/net.c
> @@ -141,10 +141,8 @@ struct vhost_net {
>   unsigned tx_zcopy_err;
>   /* Flush in progress. Protected by tx vq lock. */
>   bool tx_flush;
> - /* Private page frag */
> - struct page_frag page_frag;
> - /* Refcount bias of page frag */
> - int refcnt_bias;
> + /* Private page frag cache */
> + struct page_frag_cache pf_cache;
>  };
>  
>  static unsigned vhost_net_zcopy_mask __read_mostly;
> @@ -655,41 +653,6 @@ static bool tx_can_batch(struct vhost_virtqueue *vq, 
> size_t total_len)
>  !vhost_vq_avail_empty(vq->dev, vq);
>  }
>  
> -static bool vhost_net_page_frag_refill(struct vhost_net *net, unsigned int 
> sz,
> -struct page_frag *pfrag, gfp_t gfp)
> -{
> - if (pfrag->page) {
> - if (pfrag->offset + sz <= pfrag->size)
> - return true;
> - __page_frag_cache_drain(pfrag->page, net->refcnt_bias);
> - }
> -
> - pfrag->offset = 0;
> - net->refcnt_bias = 0;
> - if (SKB_FRAG_PAGE_ORDER) {
> - /* Avoid direct reclaim but allow kswapd to wake */
> - pfrag->page = alloc_pages((gfp & ~__GFP_DIRECT_RECLAIM) |
> -   __GFP_COMP | __GFP_NOWARN |
> -   __GFP_NORETRY | __GFP_NOMEMALLOC,
> -   SKB_FRAG_PAGE_ORDER);
> - if (likely(pfrag->page)) {
> - pfrag->size = PAGE_SIZE << SKB_FRAG_PAGE_ORDER;
> - goto done;
> - }
> - }
> - pfrag->page = alloc_page(gfp);
> - if (likely(pfrag->page)) {
> - pfrag->size = PAGE_SIZE;
> - goto done;
> - }
> - return false;
> -
> -done:
> - net->refcnt_bias = USHRT_MAX;
> - page_ref_add(pfrag->page, USHRT_MAX - 1);
> - return true;
> -}
> -
>  #define VHOST_NET_RX_PAD (NET_IP_ALIGN + NET_SKB_PAD)
>  
>  static int vhost_net_build_xdp(struct vhost_net_virtqueue *nvq,
> @@ -699,7 +662,6 @@ static int vhost_net_build_xdp(struct vhost_net_virtqueue 
> *nvq,
>   struct vhost_net *net = container_of(vq->dev, struct vhost_net,
>dev);
>   struct socket *sock = vhost_vq_get_backend(vq);
> - struct page_frag *alloc_frag = &net->page_frag;
>   struct virtio_net_hdr *gso;
>   struct xdp_buff *xdp = &nvq->xdp[nvq->batched_xdp];
>   struct tun_xdp_hdr *hdr;
> @@ -710,6 +672,7 @@ static int vhost_net_build_xdp(struct vhost_net_virtqueue 
> *nvq,
>   int sock_hlen = nvq->sock_hlen;
>   void *buf;
>   int copied;
> + int ret;
>  
>   if (unlikely(len < nvq->sock_hlen))
>   return -EFAULT;
> @@ -719,18 +682,17 @@ static int vhost_net_build_xdp(struct 
> vhost_net_virtqueue *nvq,
>   return -ENOSPC;
>  
>   buflen += SKB_DATA_ALIGN(len + pad);
> - alloc_frag->offset = ALIGN((u64)alloc_frag->offset, SMP_CACHE_BYTES);
> - if (unlikely(!vhost_net_page_frag_refill(net, buflen,
> -  alloc_frag, GFP_KERNEL)))
> + buf = page_frag_alloc_align(&net->pf_cache, buflen, GFP_KERNEL,
> + SMP_CACHE_BYTES);

If your changes from patch 1 are just to make it fit into this layout
might I suggest just splitting up page_frag_alloc_align into an inline
that accepts the arguments you have here, and adding
__page_frag_alloc_align which is passed the mask the original function
expected. By doing that you should be able to maintain the same level
of performance and still get most of the code cleanup.

> + if (unlikely(!buf))
>   return -ENOMEM;
>  
> - buf = (char *)page_address(alloc_frag->page) + alloc_frag->offset;
> - copied = copy_page_from_iter(alloc_frag->page,
> -  alloc_frag->offset +
> -  offsetof(struct tun_xdp_hdr, gso),
> -  sock_hlen, from);
> - if (copied != soc

Re: [PATCH net-next 5/6] net: introduce page_frag_cache_drain()

2024-01-05 Thread Alexander H Duyck
On Wed, 2024-01-03 at 17:56 +0800, Yunsheng Lin wrote:
> When draining a page_frag_cache, most user are doing
> the similar steps, so introduce an API to avoid code
> duplication.
> 
> Signed-off-by: Yunsheng Lin 
> Acked-by: Jason Wang 

Looks good to me.

Reviewed-by: Alexander Duyck 




Re: [PATCH net-next 2/6] page_frag: unify gfp bits for order 3 page allocation

2024-01-05 Thread Alexander H Duyck
On Wed, 2024-01-03 at 17:56 +0800, Yunsheng Lin wrote:
> Currently there seems to be three page frag implementions
> which all try to allocate order 3 page, if that fails, it
> then fail back to allocate order 0 page, and each of them
> all allow order 3 page allocation to fail under certain
> condition by using specific gfp bits.
> 
> The gfp bits for order 3 page allocation are different
> between different implementation, __GFP_NOMEMALLOC is
> or'd to forbid access to emergency reserves memory for
> __page_frag_cache_refill(), but it is not or'd in other
> implementions, __GFP_DIRECT_RECLAIM is masked off to avoid
> direct reclaim in skb_page_frag_refill(), but it is not
> masked off in __page_frag_cache_refill().
> 
> This patch unifies the gfp bits used between different
> implementions by or'ing __GFP_NOMEMALLOC and masking off
> __GFP_DIRECT_RECLAIM for order 3 page allocation to avoid
> possible pressure for mm.
> 
> Signed-off-by: Yunsheng Lin 
> CC: Alexander Duyck 
> ---
>  drivers/vhost/net.c | 2 +-
>  mm/page_alloc.c | 4 ++--
>  net/core/sock.c | 2 +-
>  3 files changed, 4 insertions(+), 4 deletions(-)
> 
> diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c
> index f2ed7167c848..e574e21cc0ca 100644
> --- a/drivers/vhost/net.c
> +++ b/drivers/vhost/net.c
> @@ -670,7 +670,7 @@ static bool vhost_net_page_frag_refill(struct vhost_net 
> *net, unsigned int sz,
>   /* Avoid direct reclaim but allow kswapd to wake */
>   pfrag->page = alloc_pages((gfp & ~__GFP_DIRECT_RECLAIM) |
> __GFP_COMP | __GFP_NOWARN |
> -   __GFP_NORETRY,
> +   __GFP_NORETRY | __GFP_NOMEMALLOC,
> SKB_FRAG_PAGE_ORDER);
>   if (likely(pfrag->page)) {
>   pfrag->size = PAGE_SIZE << SKB_FRAG_PAGE_ORDER;
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 9a16305cf985..1f0b36dd81b5 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -4693,8 +4693,8 @@ static struct page *__page_frag_cache_refill(struct 
> page_frag_cache *nc,
>   gfp_t gfp = gfp_mask;
>  
>  #if (PAGE_SIZE < PAGE_FRAG_CACHE_MAX_SIZE)
> - gfp_mask |= __GFP_COMP | __GFP_NOWARN | __GFP_NORETRY |
> - __GFP_NOMEMALLOC;
> + gfp_mask = (gfp_mask & ~__GFP_DIRECT_RECLAIM) |  __GFP_COMP |
> +__GFP_NOWARN | __GFP_NORETRY | __GFP_NOMEMALLOC;
>   page = alloc_pages_node(NUMA_NO_NODE, gfp_mask,
>   PAGE_FRAG_CACHE_MAX_ORDER);
>   nc->size = page ? PAGE_FRAG_CACHE_MAX_SIZE : PAGE_SIZE;
> diff --git a/net/core/sock.c b/net/core/sock.c
> index 446e945f736b..d643332c3ee5 100644
> --- a/net/core/sock.c
> +++ b/net/core/sock.c
> @@ -2900,7 +2900,7 @@ bool skb_page_frag_refill(unsigned int sz, struct 
> page_frag *pfrag, gfp_t gfp)
>   /* Avoid direct reclaim but allow kswapd to wake */
>   pfrag->page = alloc_pages((gfp & ~__GFP_DIRECT_RECLAIM) |
> __GFP_COMP | __GFP_NOWARN |
> -   __GFP_NORETRY,
> +   __GFP_NORETRY | __GFP_NOMEMALLOC,
> SKB_FRAG_PAGE_ORDER);
>   if (likely(pfrag->page)) {
>   pfrag->size = PAGE_SIZE << SKB_FRAG_PAGE_ORDER;

Looks fine to me.

One thing you may want to consider would be to place this all in an
inline function that could just consolidate all the code.

Reviewed-by: Alexander Duyck 




Re: [PATCH v3 28/34] s390/mm: Define KMSAN metadata for vmalloc and modules

2024-01-04 Thread Alexander Gordeev
On Tue, Jan 02, 2024 at 04:05:31PM +0100, Heiko Carstens wrote:
Hi Heiko,
...
> > @@ -253,9 +253,17 @@ static unsigned long setup_kernel_memory_layout(void)
> > MODULES_END = round_down(__abs_lowcore, _SEGMENT_SIZE);
> > MODULES_VADDR = MODULES_END - MODULES_LEN;
> > VMALLOC_END = MODULES_VADDR;
> > +#ifdef CONFIG_KMSAN
> > +   VMALLOC_END -= MODULES_LEN * 2;
> > +#endif
> >  
> > /* allow vmalloc area to occupy up to about 1/2 of the rest virtual 
> > space left */
> > vmalloc_size = min(vmalloc_size, round_down(VMALLOC_END / 2, 
> > _REGION3_SIZE));
> > +#ifdef CONFIG_KMSAN
> > +   /* take 2/3 of vmalloc area for KMSAN shadow and origins */
> > +   vmalloc_size = round_down(vmalloc_size / 3, _REGION3_SIZE);
> > +   VMALLOC_END -= vmalloc_size * 2;
> > +#endif
> 
> Please use
> 
>   if (IS_ENABLED(CONFIG_KMSAN))
> 
> above, since this way we get more compile time checks.

This way we will get a mixture of CONFIG_KASAN and CONFIG_KMSAN
#ifdef vs IS_ENABLED() checks within one function. I guess, we
would rather address it with a separate cleanup?

Thanks!



Re: [PATCH v3 17/34] lib/zlib: Unpoison DFLTCC output buffers

2023-12-22 Thread Alexander Potapenko
On Thu, Dec 14, 2023 at 12:36 AM Ilya Leoshkevich  wrote:
>
> The constraints of the DFLTCC inline assembly are not precise: they
> do not communicate the size of the output buffers to the compiler, so
> it cannot automatically instrument it.
>
> Add the manual kmsan_unpoison_memory() calls for the output buffers.
> The logic is the same as in [1].
>
> [1] 
> https://github.com/zlib-ng/zlib-ng/commit/1f5ddcc009ac3511e99fc88736a9e1a6381168c5
>
> Reported-by: Alexander Gordeev 
> Signed-off-by: Ilya Leoshkevich 
Reviewed-by: Alexander Potapenko 


> @@ -34,6 +37,7 @@ static inline dfltcc_cc dfltcc(
>  )
>  {
>  Byte *t2 = op1 ? *op1 : NULL;
> +unsigned char *orig_t2 = t2;
>  size_t t3 = len1 ? *len1 : 0;
>  const Byte *t4 = op2 ? *op2 : NULL;
>  size_t t5 = len2 ? *len2 : 0;
> @@ -59,6 +63,26 @@ static inline dfltcc_cc dfltcc(
>   : "cc", "memory");
>  t2 = r2; t3 = r3; t4 = r4; t5 = r5;
>
> +switch (fn & DFLTCC_FN_MASK) {

It might be a good idea to add a comment explaining what this block of
code does.
(And that it is no-op in non-KMSAN builds)



Re: [PATCH v3 27/34] s390/irqflags: Do not instrument arch_local_irq_*() with KMSAN

2023-12-22 Thread Alexander Potapenko
On Thu, Dec 14, 2023 at 12:36 AM Ilya Leoshkevich  wrote:
>
> KMSAN generates the following false positives on s390x:
>
> [6.063666] DEBUG_LOCKS_WARN_ON(lockdep_hardirqs_enabled())
> [ ...]
> [6.577050] Call Trace:
> [6.619637]  [<0690d2de>] check_flags+0x1fe/0x210
> [6.665411] ([<0690d2da>] check_flags+0x1fa/0x210)
> [6.707478]  [<006cec1a>] lock_acquire+0x2ca/0xce0
> [6.749959]  [<069820ea>] _raw_spin_lock_irqsave+0xea/0x190
> [6.794912]  [<041fc988>] __stack_depot_save+0x218/0x5b0
> [6.838420]  [<0197affe>] __msan_poison_alloca+0xfe/0x1a0
> [6.882985]  [<07c5827c>] start_kernel+0x70c/0xd50
> [6.927454]  [<00100036>] startup_continue+0x36/0x40
>
> Between trace_hardirqs_on() and `stosm __mask, 3` lockdep thinks that
> interrupts are on, but on the CPU they are still off. KMSAN
> instrumentation takes spinlocks, giving lockdep a chance to see and
> complain about this discrepancy.
>
> KMSAN instrumentation is inserted in order to poison the __mask
> variable. Disable instrumentation in the respective functions. They are
> very small and it's easy to see that no important metadata updates are
> lost because of this.
>
> Signed-off-by: Ilya Leoshkevich 
Reviewed-by: Alexander Potapenko 



Re: [PATCH v3 28/34] s390/mm: Define KMSAN metadata for vmalloc and modules

2023-12-21 Thread Alexander Gordeev
On Thu, Dec 14, 2023 at 12:24:48AM +0100, Ilya Leoshkevich wrote:
> The pages for the KMSAN metadata associated with most kernel mappings
> are taken from memblock by the common code. However, vmalloc and module
> metadata needs to be defined by the architectures.
> 
> Be a little bit more careful than x86: allocate exactly MODULES_LEN
> for the module shadow and origins, and then take 2/3 of vmalloc for
> the vmalloc shadow and origins. This ensures that users passing small
> vmalloc= values on the command line do not cause module metadata
> collisions.
> 
> Reviewed-by: Alexander Potapenko 
> Signed-off-by: Ilya Leoshkevich 
> ---
>  arch/s390/boot/startup.c|  8 
>  arch/s390/include/asm/pgtable.h | 10 ++
>  2 files changed, 18 insertions(+)
> 
> diff --git a/arch/s390/boot/startup.c b/arch/s390/boot/startup.c
> index 8104e0e3d188..e37e7ffda430 100644
> --- a/arch/s390/boot/startup.c
> +++ b/arch/s390/boot/startup.c
> @@ -253,9 +253,17 @@ static unsigned long setup_kernel_memory_layout(void)
>   MODULES_END = round_down(__abs_lowcore, _SEGMENT_SIZE);
>   MODULES_VADDR = MODULES_END - MODULES_LEN;
>   VMALLOC_END = MODULES_VADDR;
> +#ifdef CONFIG_KMSAN
> + VMALLOC_END -= MODULES_LEN * 2;
> +#endif
>  
>   /* allow vmalloc area to occupy up to about 1/2 of the rest virtual 
> space left */
>   vmalloc_size = min(vmalloc_size, round_down(VMALLOC_END / 2, 
> _REGION3_SIZE));

Since commit 2a65c6e1ad06 ("s390/boot: always align vmalloc area on segment 
boundary")
vmalloc_size is aligned on _SEGMENT_SIZE boundary.

> +#ifdef CONFIG_KMSAN
> + /* take 2/3 of vmalloc area for KMSAN shadow and origins */
> + vmalloc_size = round_down(vmalloc_size / 3, _REGION3_SIZE);

And thus, the alignment here should be _SEGMENT_SIZE as well.

> + VMALLOC_END -= vmalloc_size * 2;
> +#endif
>   VMALLOC_START = VMALLOC_END - vmalloc_size;
>  
>   /* split remaining virtual space between 1:1 mapping & vmemmap array */

...

With the above fixup:
Acked-by: Alexander Gordeev 

Thanks!



Re: [PATCH] tracing / synthetic: Disable events after testing in synth_event_gen_test_init()

2023-12-21 Thread Alexander Graf

Hi Steve,

On 20.12.23 17:15, Steven Rostedt wrote:


From: "Steven Rostedt (Google)" 

The synth_event_gen_test module can be built in, if someone wants to run
the tests at boot up and not have to load them.

The synth_event_gen_test_init() function creates and enables the synthetic
events and runs its tests.

The synth_event_gen_test_exit() disables the events it created and
destroys the events.

If the module is builtin, the events are never disabled. The issue is, the
events should be disable after the tests are run. This could be an issue
if the rest of the boot up tests are enabled, as they expect the events to
be in a known state before testing. That known state happens to be
disabled.

When CONFIG_SYNTH_EVENT_GEN_TEST=y and CONFIG_EVENT_TRACE_STARTUP_TEST=y
a warning will trigger:

  Running tests on trace events:
  Testing event create_synth_test:
  Enabled event during self test!
  [ cut here ]
  WARNING: CPU: 2 PID: 1 at kernel/trace/trace_events.c:4150 
event_trace_self_tests+0x1c2/0x480
  Modules linked in:
  CPU: 2 PID: 1 Comm: swapper/0 Not tainted 
6.7.0-rc2-test-00031-gb803d7c664d5-dirty #276
  Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 
1.16.2-debian-1.16.2-1 04/01/2014
  RIP: 0010:event_trace_self_tests+0x1c2/0x480
  Code: bb e8 a2 ab 5d fc 48 8d 7b 48 e8 f9 3d 99 fc 48 8b 73 48 40 f6 c6 01 0f 84 d6 
fe ff ff 48 c7 c7 20 b6 ad bb e8 7f ab 5d fc 90 <0f> 0b 90 48 89 df e8 d3 3d 99 
fc 48 8b 1b 4c 39 f3 0f 85 2c ff ff
  RSP: :c901fdc0 EFLAGS: 00010246
  RAX: 0029 RBX: 88810399ca80 RCX: 
  RDX:  RSI: b9f19478 RDI: 88823c734e64
  RBP: 88810399f300 R08:  R09: fbfff79eb32a
  R10: bcf59957 R11: 0001 R12: 888104068090
  R13: bc89f0a0 R14: bc8a0f08 R15: 0078
  FS:  () GS:88823c70() knlGS:
  CS:  0010 DS:  ES:  CR0: 80050033
  CR2:  CR3: 0001f6282001 CR4: 00170ef0
  Call Trace:
   
   ? __warn+0xa5/0x200
   ? event_trace_self_tests+0x1c2/0x480
   ? report_bug+0x1f6/0x220
   ? handle_bug+0x6f/0x90
   ? exc_invalid_op+0x17/0x50
   ? asm_exc_invalid_op+0x1a/0x20
   ? tracer_preempt_on+0x78/0x1c0
   ? event_trace_self_tests+0x1c2/0x480
   ? __pfx_event_trace_self_tests_init+0x10/0x10
   event_trace_self_tests_init+0x27/0xe0
   do_one_initcall+0xd6/0x3c0
   ? __pfx_do_one_initcall+0x10/0x10
   ? kasan_set_track+0x25/0x30
   ? rcu_is_watching+0x38/0x60
   kernel_init_freeable+0x324/0x450
   ? __pfx_kernel_init+0x10/0x10
   kernel_init+0x1f/0x1e0
   ? _raw_spin_unlock_irq+0x33/0x50
   ret_from_fork+0x34/0x60
   ? __pfx_kernel_init+0x10/0x10
   ret_from_fork_asm+0x1b/0x30
   

This is because the synth_event_gen_test_init() left the synthetic events
that it created enabled. By having it disable them after testing, the
other selftests will run fine.

Cc: sta...@vger.kernel.org
Fixes: 9fe41efaca084 ("tracing: Add synth event generation test module")
Reported-by: Alexander Graf 
Signed-off-by: Steven Rostedt (Google) 



Thanks a bunch for the super quick turnaround time for the fix! I can 
confirm that I'm no longer seeing the warning :)


Tested-by: Alexander Graf 


Do we need another similar patch for the kprobe self tests? The below is 
with 55cb5f43689d7 plus an unrelated initrd patch plus this patch and 
the following .config: http://csgraf.de/tmp2/config-ftrace.xz


[  919.717134] Testing all events: OK
[  924.418194] Testing ftrace filter: OK
[  924.418887] trace_kprobe: Testing kprobe tracing:
[  924.424244] [ cut here ]
[  924.424952] WARNING: CPU: 2 PID: 1 at 
kernel/trace/trace_kprobe.c:2073 kprobe_trace_self_tests_init+0x192/0x540

[  924.425659] Modules linked in:
[  924.425886] CPU: 2 PID: 1 Comm: swapper/0 Not tainted 
6.7.0-rc6-00024-gc10698ad3c9a #298
[  924.426448] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 
rel-1.16.2-0-gea1b7a073390-prebuilt.qemu.org 04/01/2014

[  924.427230] RIP: 0010:kprobe_trace_self_tests_init+0x192/0x540
[  924.427639] Code: 7e 10 74 3b 48 8b 36 48 39 d6 75 f2 0f 0b 48 c7 c7 
58 71 79 a5 e8 ee 3e 5a fe 48 c7 c7 20 38 b7 a5 e8 a2 51 68 fe 85 c0 74 
33 <0f> 0b 48 c7 c7 38 73 79 a5 e8 d0 3e 5a fe e8 4b 64 62 fe eb 23 48

[  924.428922] RSP: :ab508001be58 EFLAGS: 00010286
[  924.429288] RAX: fff0 RBX: 005a RCX: 
0202
[  924.429800] RDX:  RSI: 0002e970 RDI: 
a5b708a0
[  924.430295] RBP:  R08: 0001 R09: 
a4855a90
[  924.430794] R10: 0007 R11: 028a R12: 
0001
[  924.431286] R13: a5cc9a00 R14: 8cec81bebe00 R15: 
a619f0f0
[  924.431785] FS:  () GS:8cecf910() 
knlGS:

[  924.432346] CS:  0010 DS:  ES:  CR0

Re: [PATCH v3 33/34] s390: Implement the architecture-specific kmsan functions

2023-12-20 Thread Alexander Potapenko
On Thu, Dec 14, 2023 at 12:37 AM Ilya Leoshkevich  wrote:
>
> arch_kmsan_get_meta_or_null() finds the lowcore shadow by querying the
> prefix and calling kmsan_get_metadata() again.
>
> kmsan_virt_addr_valid() delegates to virt_addr_valid().
>
> Signed-off-by: Ilya Leoshkevich 
Reviewed-by: Alexander Potapenko 



Re: [PATCH v3 24/34] s390/cpumf: Unpoison STCCTM output buffer

2023-12-20 Thread Alexander Potapenko
On Thu, Dec 14, 2023 at 12:37 AM Ilya Leoshkevich  wrote:
>
> stcctm() uses the "Q" constraint for dest, therefore KMSAN does not
> understand that it fills multiple doublewords pointed to by dest, not
> just one. This results in false positives.
>
> Unpoison the whole dest manually with kmsan_unpoison_memory().
>
> Reported-by: Alexander Gordeev 
> Signed-off-by: Ilya Leoshkevich 
Reviewed-by: Alexander Potapenko 



Re: [PATCH RESEND] mac802154: Fix uninit-value access in ieee802154_hdr_push_sechdr

2023-12-14 Thread Alexander Aring
Hi,

On Thu, Dec 14, 2023 at 10:01 PM Alexander Aring  wrote:
>
> Hi,
>
> On Mon, Dec 4, 2023 at 3:57 AM Miquel Raynal  
> wrote:
> >
> > Hi Zhang,
> >
> > zhang_shur...@foxmail.com wrote on Sat,  2 Dec 2023 22:58:52 +0800:
> >
> > > The syzkaller reported an issue:
> >
> > Subject should start with [PATCH wpan]
> >
> > >
> > > BUG: KMSAN: uninit-value in ieee802154_hdr_push_sechdr 
> > > net/ieee802154/header_ops.c:54 [inline]
> > > BUG: KMSAN: uninit-value in ieee802154_hdr_push+0x971/0xb90 
> > > net/ieee802154/header_ops.c:108
> > >  ieee802154_hdr_push_sechdr net/ieee802154/header_ops.c:54 [inline]
> > >  ieee802154_hdr_push+0x971/0xb90 net/ieee802154/header_ops.c:108
> > >  ieee802154_header_create+0x9c0/0xc00 net/mac802154/iface.c:396
> > >  wpan_dev_hard_header include/net/cfg802154.h:494 [inline]
> > >  dgram_sendmsg+0xd1d/0x1500 net/ieee802154/socket.c:677
> > >  ieee802154_sock_sendmsg+0x91/0xc0 net/ieee802154/socket.c:96
> > >  sock_sendmsg_nosec net/socket.c:725 [inline]
> > >  sock_sendmsg net/socket.c:748 [inline]
> > >  sys_sendmsg+0x9c2/0xd60 net/socket.c:2494
> > >  ___sys_sendmsg+0x28d/0x3c0 net/socket.c:2548
> > >  __sys_sendmsg+0x225/0x3c0 net/socket.c:2577
> > >  __compat_sys_sendmsg net/compat.c:346 [inline]
> > >  __do_compat_sys_sendmsg net/compat.c:353 [inline]
> > >  __se_compat_sys_sendmsg net/compat.c:350 [inline]
> > >
> > > We found hdr->key_id_mode is uninitialized in 
> > > mac802154_set_header_security()
> > > which indicates hdr.fc.security_enabled should be 0. However, it is set 
> > > to be cb->secen before.
> > > Later, ieee802154_hdr_push_sechdr is invoked, causing KMSAN complains 
> > > uninit-value issue.
> >
> > I am not too deeply involved in the security header but for me it feels
> > like your patch does the opposite of what's needed. We should maybe
> > initialize hdr->key_id_mode based on the value in cb->secen, no? (maybe
> > Alexander will have a better understanding than I have).
>
> I can't help yet with a better answer why syzkaller reports it but it
> will break things as we using skb->cb to pass additional parameters
> through header_ops->create()... in this case it is some sockopts of
> af802154, I guess.
>

Maybe we just need to init some "more" defaults in [0]

- Alex

[0] 
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/net/ieee802154/socket.c?h=v6.7-rc5#n474




Re: [PATCH RESEND] mac802154: Fix uninit-value access in ieee802154_hdr_push_sechdr

2023-12-14 Thread Alexander Aring
Hi,

On Mon, Dec 4, 2023 at 3:57 AM Miquel Raynal  wrote:
>
> Hi Zhang,
>
> zhang_shur...@foxmail.com wrote on Sat,  2 Dec 2023 22:58:52 +0800:
>
> > The syzkaller reported an issue:
>
> Subject should start with [PATCH wpan]
>
> >
> > BUG: KMSAN: uninit-value in ieee802154_hdr_push_sechdr 
> > net/ieee802154/header_ops.c:54 [inline]
> > BUG: KMSAN: uninit-value in ieee802154_hdr_push+0x971/0xb90 
> > net/ieee802154/header_ops.c:108
> >  ieee802154_hdr_push_sechdr net/ieee802154/header_ops.c:54 [inline]
> >  ieee802154_hdr_push+0x971/0xb90 net/ieee802154/header_ops.c:108
> >  ieee802154_header_create+0x9c0/0xc00 net/mac802154/iface.c:396
> >  wpan_dev_hard_header include/net/cfg802154.h:494 [inline]
> >  dgram_sendmsg+0xd1d/0x1500 net/ieee802154/socket.c:677
> >  ieee802154_sock_sendmsg+0x91/0xc0 net/ieee802154/socket.c:96
> >  sock_sendmsg_nosec net/socket.c:725 [inline]
> >  sock_sendmsg net/socket.c:748 [inline]
> >  sys_sendmsg+0x9c2/0xd60 net/socket.c:2494
> >  ___sys_sendmsg+0x28d/0x3c0 net/socket.c:2548
> >  __sys_sendmsg+0x225/0x3c0 net/socket.c:2577
> >  __compat_sys_sendmsg net/compat.c:346 [inline]
> >  __do_compat_sys_sendmsg net/compat.c:353 [inline]
> >  __se_compat_sys_sendmsg net/compat.c:350 [inline]
> >
> > We found hdr->key_id_mode is uninitialized in 
> > mac802154_set_header_security()
> > which indicates hdr.fc.security_enabled should be 0. However, it is set to 
> > be cb->secen before.
> > Later, ieee802154_hdr_push_sechdr is invoked, causing KMSAN complains 
> > uninit-value issue.
>
> I am not too deeply involved in the security header but for me it feels
> like your patch does the opposite of what's needed. We should maybe
> initialize hdr->key_id_mode based on the value in cb->secen, no? (maybe
> Alexander will have a better understanding than I have).

I can't help yet with a better answer why syzkaller reports it but it
will break things as we using skb->cb to pass additional parameters
through header_ops->create()... in this case it is some sockopts of
af802154, I guess.

side note: we should stop doing that with skb->cb and introduce some
802.15.4 specific header_ops callback structure and not depend on such
generic callback which does not fit here (and maybe somebody does a
wrapper around that and reuse skb->cb for their needs, etc.)

- Alex




Re: [PATCH v2 12/33] kmsan: Allow disabling KMSAN checks for the current task

2023-12-11 Thread Alexander Potapenko
On Tue, Nov 21, 2023 at 11:06 PM Ilya Leoshkevich  wrote:
>
> Like for KASAN, it's useful to temporarily disable KMSAN checks around,
> e.g., redzone accesses. Introduce kmsan_disable_current() and
> kmsan_enable_current(), which are similar to their KASAN counterparts.

Initially we used to have this disablement counter in KMSAN, but
adding it uncontrollably can result in KMSAN not functioning properly.
E.g. forgetting to call kmsan_disable_current() or underflowing the
counter will break reporting.
We'd better put this API in include/linux/kmsan.h to indicate it
should be discouraged.

> Even though it's not strictly necessary, make them reentrant, in order
> to match the KASAN behavior.

Until this becomes strictly necessary, I think we'd better
KMSAN_WARN_ON if the counter is re-entered.



Re: [PATCH v2 28/33] s390/string: Add KMSAN support

2023-12-11 Thread Alexander Potapenko
On Tue, Nov 21, 2023 at 11:03 PM Ilya Leoshkevich  wrote:
>
> Add KMSAN support for the s390 implementations of the string functions.
> Do this similar to how it's already done for KASAN, except that the
> optimized memset{16,32,64}() functions need to be disabled: it's
> important for KMSAN to know that they initialized something.
>
> The way boot code is built with regard to string functions is
> problematic, since most files think it's configured with sanitizers,
> but boot/string.c doesn't. This creates various problems with the
> memset64() definitions, depending on whether the code is built with
> sanitizers or fortify. This should probably be streamlined, but in the
> meantime resolve the issues by introducing the IN_BOOT_STRING_C macro,
> similar to the existing IN_ARCH_STRING_C macro.
>
> Signed-off-by: Ilya Leoshkevich 
Reviewed-by: Alexander Potapenko 



Re: [PATCH v2 30/33] s390/uaccess: Add KMSAN support to put_user() and get_user()

2023-12-11 Thread Alexander Potapenko
On Tue, Nov 21, 2023 at 11:03 PM Ilya Leoshkevich  wrote:
>
> put_user() uses inline assembly with precise constraints, so Clang is
> in principle capable of instrumenting it automatically. Unfortunately,
> one of the constraints contains a dereferenced user pointer, and Clang
> does not currently distinguish user and kernel pointers. Therefore
> KMSAN attempts to access shadow for user pointers, which is not a right
> thing to do.
>
> An obvious fix to add __no_sanitize_memory to __put_user_fn() does not
> work, since it's __always_inline. And __always_inline cannot be removed
> due to the __put_user_bad() trick.
>
> A different obvious fix of using the "a" instead of the "+Q" constraint
> degrades the code quality, which is very important here, since it's a
> hot path.
>
> Instead, repurpose the __put_user_asm() macro to define
> __put_user_{char,short,int,long}_noinstr() functions and mark them with
> __no_sanitize_memory. For the non-KMSAN builds make them
> __always_inline in order to keep the generated code quality. Also
> define __put_user_{char,short,int,long}() functions, which call the
> aforementioned ones and which *are* instrumented, because they call
> KMSAN hooks, which may be implemented as macros.
>
> The same applies to get_user() as well.
>
> Acked-by: Heiko Carstens 
> Signed-off-by: Ilya Leoshkevich 

I think this patch makes sense, but I don't feel myself qualified
enough to stamp it. Hope Heiko's ack is enough.



Re: [PATCH v2 32/33] s390: Implement the architecture-specific kmsan functions

2023-12-11 Thread Alexander Potapenko
> > Is there a possibility for infinite recursion here? E.g. can
> > `lowcore_ptr[raw_smp_processor_id()]` point somewhere in between
> > `(void *)&S390_lowcore` and `(void *)(&S390_lowcore + 1))`?
>
> No, it's allocated with __get_free_pages() or memblock_alloc_low().
> But since this question came up, I should probably add a check and
> a WARN_ON_ONCE() here.

Yes, please.


-- 
Alexander Potapenko
Software Engineer

Google Germany GmbH
Erika-Mann-Straße, 33
80636 München

Geschäftsführer: Paul Manicle, Liana Sebastian
Registergericht und -nummer: Hamburg, HRB 86891
Sitz der Gesellschaft: Hamburg



Re: [PATCH v2 25/33] s390/cpacf: Unpoison the results of cpacf_trng()

2023-12-11 Thread Alexander Potapenko
On Tue, Nov 21, 2023 at 11:02 PM Ilya Leoshkevich  wrote:
>
> Prevent KMSAN from complaining about buffers filled by cpacf_trng()
> being uninitialized.
>
> Tested-by: Alexander Gordeev 
> Signed-off-by: Ilya Leoshkevich 
Reviewed-by: Alexander Potapenko 



Re: [PATCH v2 32/33] s390: Implement the architecture-specific kmsan functions

2023-12-11 Thread Alexander Potapenko
> +static inline void *arch_kmsan_get_meta_or_null(void *addr, bool is_origin)
> +{
> +   if (addr >= (void *)&S390_lowcore &&
> +   addr < (void *)(&S390_lowcore + 1)) {
> +   /*
> +* Different lowcores accessed via S390_lowcore are described
> +* by the same struct page. Resolve the prefix manually in
> +* order to get a distinct struct page.
> +*/
> +   addr += (void *)lowcore_ptr[raw_smp_processor_id()] -
> +   (void *)&S390_lowcore;
> +   return kmsan_get_metadata(addr, is_origin);
> +   }
> +   return NULL;
> +}

Is there a possibility for infinite recursion here? E.g. can
`lowcore_ptr[raw_smp_processor_id()]` point somewhere in between
`(void *)&S390_lowcore` and `(void *)(&S390_lowcore + 1))`?



Re: [PATCH v2 27/33] s390/mm: Define KMSAN metadata for vmalloc and modules

2023-12-11 Thread Alexander Potapenko
On Tue, Nov 21, 2023 at 11:07 PM Ilya Leoshkevich  wrote:
>
> The pages for the KMSAN metadata associated with most kernel mappings
> are taken from memblock by the common code. However, vmalloc and module
> metadata needs to be defined by the architectures.
>
> Be a little bit more careful than x86: allocate exactly MODULES_LEN
> for the module shadow and origins, and then take 2/3 of vmalloc for
> the vmalloc shadow and origins. This ensures that users passing small
> vmalloc= values on the command line do not cause module metadata
> collisions.
>
> Signed-off-by: Ilya Leoshkevich 
Reviewed-by: Alexander Potapenko 

(hope some s390 maintainer acks this as well)



Re: [PATCH v2 10/33] kmsan: Expose kmsan_get_metadata()

2023-12-11 Thread Alexander Potapenko
> +static inline void *kmsan_get_metadata(void *addr, bool is_origin)
> +{
> +   return NULL;
> +}
> +
>  #endif

We shouldn't need this part, as kmsan_get_metadata() should never be
called in non-KMSAN builds.



Re: [PATCH v2 05/33] kmsan: Fix is_bad_asm_addr() on arches with overlapping address spaces

2023-12-11 Thread Alexander Potapenko
On Tue, Nov 21, 2023 at 11:02 PM Ilya Leoshkevich  wrote:
>
> Comparing pointers with TASK_SIZE does not make sense when kernel and
> userspace overlap. Skip the comparison when this is the case.
>
> Signed-off-by: Ilya Leoshkevich 
Reviewed-by: Alexander Potapenko 



Re: [PATCH v2 23/33] s390/boot: Add the KMSAN runtime stub

2023-12-08 Thread Alexander Potapenko
On Tue, Nov 21, 2023 at 11:02 PM Ilya Leoshkevich  wrote:
>
> It should be possible to have inline functions in the s390 header
> files, which call kmsan_unpoison_memory(). The problem is that these
> header files might be included by the decompressor, which does not
> contain KMSAN runtime, causing linker errors.
>
> Not compiling these calls if __SANITIZE_MEMORY__ is not defined -
> either by changing kmsan-checks.h or at the call sites - may cause
> unintended side effects, since calling these functions from an
> uninstrumented code that is linked into the kernel is valid use case.
>
> One might want to explicitly distinguish between the kernel and the
> decompressor. Checking for a decompressor-specific #define is quite
> heavy-handed, and will have to be done at all call sites.
>
> A more generic approach is to provide a dummy kmsan_unpoison_memory()
> definition. This produces some runtime overhead, but only when building
> with CONFIG_KMSAN. The benefit is that it does not disturb the existing
> KMSAN build logic and call sites don't need to be changed.
>
> Signed-off-by: Ilya Leoshkevich 
Reviewed-by: Alexander Potapenko 



Re: [PATCH v2 09/33] kmsan: Introduce kmsan_memmove_metadata()

2023-12-08 Thread Alexander Potapenko
On Tue, Nov 21, 2023 at 11:07 PM Ilya Leoshkevich  wrote:
>
> It is useful to manually copy metadata in order to describe the effects
> of memmove()-like logic in uninstrumented code or inline asm. Introduce
> kmsan_memmove_metadata() for this purpose.
>
> Signed-off-by: Ilya Leoshkevich 
> ---
>  include/linux/kmsan-checks.h | 14 ++
>  mm/kmsan/hooks.c | 11 +++
>  2 files changed, 25 insertions(+)
>
> diff --git a/include/linux/kmsan-checks.h b/include/linux/kmsan-checks.h
> index c4cae333deec..5218973f0ad0 100644
> --- a/include/linux/kmsan-checks.h
> +++ b/include/linux/kmsan-checks.h
> @@ -61,6 +61,17 @@ void kmsan_check_memory(const void *address, size_t size);
>  void kmsan_copy_to_user(void __user *to, const void *from, size_t to_copy,
> size_t left);
>
> +/**
> + * kmsan_memmove_metadata() - Copy kernel memory range metadata.
> + * @dst: start of the destination kernel memory range.
> + * @src: start of the source kernel memory range.
> + * @n:   size of the memory ranges.
> + *
> + * KMSAN will treat the destination range as if its contents were memmove()d
> + * from the source range.
> + */
> +void kmsan_memmove_metadata(void *dst, const void *src, size_t n);

As noted in patch 18/33, I am pretty sure we shouldn't need this function.



Re: [PATCH v2 18/33] lib/string: Add KMSAN support to strlcpy() and strlcat()

2023-12-08 Thread Alexander Potapenko
On Tue, Nov 21, 2023 at 11:02 PM Ilya Leoshkevich  wrote:
>
> Currently KMSAN does not fully propagate metadata in strlcpy() and
> strlcat(), because they are built with -ffreestanding and call
> memcpy(). In this combination memcpy() calls are not instrumented.

Is this something specific to s390?

> Fix by copying the metadata manually. Add the __STDC_HOSTED__ #ifdef in
> case the code is compiled with different flags in the future.
>
> Signed-off-by: Ilya Leoshkevich 
> ---
>  lib/string.c | 6 ++
>  1 file changed, 6 insertions(+)
>
> diff --git a/lib/string.c b/lib/string.c
> index be26623953d2..e83c6dd77ec6 100644
> --- a/lib/string.c
> +++ b/lib/string.c
> @@ -111,6 +111,9 @@ size_t strlcpy(char *dest, const char *src, size_t size)
> if (size) {
> size_t len = (ret >= size) ? size - 1 : ret;
> __builtin_memcpy(dest, src, len);

On x86, I clearly see this __builtin_memcpy() being replaced with
__msan_memcpy().



Re: [PATCH v2 04/33] kmsan: Increase the maximum store size to 4096

2023-12-08 Thread Alexander Potapenko
On Tue, Nov 21, 2023 at 11:07 PM Ilya Leoshkevich  wrote:
>
> The inline assembly block in s390's chsc() stores that much.
>
> Signed-off-by: Ilya Leoshkevich 
Reviewed-by: Alexander Potapenko 



Re: [PATCH v2 13/33] kmsan: Introduce memset_no_sanitize_memory()

2023-12-08 Thread Alexander Potapenko
> A problem with __memset() is that, at least for me, it always ends
> up being a call. There is a use case where we need to write only 1
> byte, so I thought that introducing a call there (when compiling
> without KMSAN) would be unacceptable.

Wonder what happens with that use case if we e.g. build with fortify-source.
Calling memset() for a single byte might be indicating the code is not hot.

> > ...
> >
> > > +__no_sanitize_memory
> > > +static inline void *memset_no_sanitize_memory(void *s, int c,
> > > size_t n)
> > > +{
> > > +   return memset(s, c, n);
> > > +}
> >
> > I think depending on the compiler optimizations this might end up
> > being a call to normal memset, that would still change the shadow
> > bytes.
>
> Interesting, do you have some specific scenario in mind? I vaguely
> remember that in the past there were cases when sanitizer annotations
> were lost after inlining, but I thought they were sorted out?

Sanitizer annotations are indeed lost after inlining, and we cannot do
much about that.
They are implemented using function attributes, and if a function
dissolves after inlining, we cannot possibly know which instructions
belonged to it.

Consider the following example (also available at
https://godbolt.org/z/5r7817G8e):

==
void *kmalloc(int size);

__attribute__((no_sanitize("kernel-memory")))
__attribute__((always_inline))
static void *memset_nosanitize(void *s, int c, int n) {
  return __builtin_memset(s, c, n);
}

void *do_something_nosanitize(int size) {
  void *ptr = kmalloc(size);
  memset_nosanitize(ptr, 0, size);
  return ptr;
}

void *do_something_sanitize(int size) {
  void *ptr = kmalloc(size);
  __builtin_memset(ptr, 0, size);
  return ptr;
}
==

If memset_nosanitize() has __attribute__((always_inline)), the
compiler generates the same LLVM IR calling __msan_memset() for both
do_something_nosanitize() and do_something_sanitize().
If we comment out this attribute, do_something_nosanitize() calls
memset_nosanitize(), which doesn't have the sanitize_memory attribute.

But even now __builtin_memset() is still calling __msan_memset(),
because __attribute__((no_sanitize("kernel-memory"))) somewhat
counterintuitively still preserves some instrumentation (see
include/linux/compiler-clang.h for details).
Replacing __attribute__((no_sanitize("kernel-memory"))) with
__attribute__((disable_sanitizer_instrumentation)) fixes this
situation:

define internal fastcc noundef ptr @memset_nosanitize(void*, int,
int)(ptr noundef returned writeonly %s, i32 noundef %n) unnamed_addr
#2 {
entry:
%conv = sext i32 %n to i64
tail call void @llvm.memset.p0.i64(ptr align 1 %s, i8 0, i64 %conv, i1 false)
ret ptr %s
}

>
> And, in any case, if this were to happen, would not it be considered a
> compiler bug that needs fixing there, and not in the kernel?

As stated above, I don't think this is more or less working as intended.
If we really want the ability to inline __memset(), we could transform
it into memset() in non-sanitizer builds, but perhaps having a call is
also acceptable?



Re: [PATCH v2 19/33] lib/zlib: Unpoison DFLTCC output buffers

2023-12-08 Thread Alexander Potapenko
On Fri, Dec 8, 2023 at 3:14 PM Ilya Leoshkevich  wrote:
>
> On Fri, 2023-12-08 at 14:32 +0100, Alexander Potapenko wrote:
> > On Tue, Nov 21, 2023 at 11:07 PM Ilya Leoshkevich 
> > wrote:
> > >
> > > The constraints of the DFLTCC inline assembly are not precise: they
> > > do not communicate the size of the output buffers to the compiler,
> > > so
> > > it cannot automatically instrument it.
> >
> > KMSAN usually does a poor job instrumenting inline assembly.
> > Wouldn't be it better to switch to pure C ZLIB implementation, making
> > ZLIB_DFLTCC depend on !KMSAN?
>
> Normally I would agree, but the kernel DFLTCC code base is synced with
> the zlib-ng code base to the extent that it uses the zlib-ng code style
> instead of the kernel code style, and MSAN annotations are already a
> part of the zlib-ng code base. So I would prefer to keep them for
> consistency.

Hm, I didn't realize this code is being taken from elsewhere.
If so, maybe we should come up with an annotation that can be
contributed to zlib-ng, so that it doesn't cause merge conflicts every
time Mikhail is doing an update?
(leaving this up to you to decide).

If you decide to go with the current solution, please consider adding
an #include for kmsan-checks.h, which introduces
kmsan_unpoison_memory().



Re: [PATCH v2 26/33] s390/ftrace: Unpoison ftrace_regs in kprobe_ftrace_handler()

2023-12-08 Thread Alexander Potapenko
On Tue, Nov 21, 2023 at 11:02 PM Ilya Leoshkevich  wrote:
>
> s390 uses assembly code to initialize ftrace_regs and call
> kprobe_ftrace_handler(). Therefore, from the KMSAN's point of view,
> ftrace_regs is poisoned on kprobe_ftrace_handler() entry. This causes
> KMSAN warnings when running the ftrace testsuite.
>
> Fix by trusting the assembly code and always unpoisoning ftrace_regs in
> kprobe_ftrace_handler().
>
> Signed-off-by: Ilya Leoshkevich 
Reviewed-by: Alexander Potapenko 



Re: [PATCH v2 01/33] ftrace: Unpoison ftrace_regs in ftrace_ops_list_func()

2023-12-08 Thread Alexander Potapenko
On Tue, Nov 21, 2023 at 11:02 PM Ilya Leoshkevich  wrote:
>
> Architectures use assembly code to initialize ftrace_regs and call
> ftrace_ops_list_func(). Therefore, from the KMSAN's point of view,
> ftrace_regs is poisoned on ftrace_ops_list_func entry(). This causes
> KMSAN warnings when running the ftrace testsuite.

I couldn't reproduce these warnings on x86, hope you really need this
change on s390 :)

> Fix by trusting the architecture-specific assembly code and always
> unpoisoning ftrace_regs in ftrace_ops_list_func.
>
> Signed-off-by: Ilya Leoshkevich 
Reviewed-by: Alexander Potapenko 



Re: [PATCH v2 17/33] mm: kfence: Disable KMSAN when checking the canary

2023-12-08 Thread Alexander Potapenko
On Fri, Dec 8, 2023 at 1:53 PM Alexander Potapenko  wrote:
>
> On Tue, Nov 21, 2023 at 11:02 PM Ilya Leoshkevich  wrote:
> >
> > KMSAN warns about check_canary() accessing the canary.
> >
> > The reason is that, even though set_canary() is properly instrumented
> > and sets shadow, slub explicitly poisons the canary's address range
> > afterwards.
> >
> > Unpoisoning the canary is not the right thing to do: only
> > check_canary() is supposed to ever touch it. Instead, disable KMSAN
> > checks around canary read accesses.
> >
> > Signed-off-by: Ilya Leoshkevich 
> Reviewed-by: Alexander Potapenko 

and even

Tested-by: Alexander Potapenko 



Re: [PATCH v2 14/33] kmsan: Support SLAB_POISON

2023-12-08 Thread Alexander Potapenko
On Tue, Nov 21, 2023 at 11:02 PM Ilya Leoshkevich  wrote:
>
> Avoid false KMSAN negatives with SLUB_DEBUG by allowing
> kmsan_slab_free() to poison the freed memory, and by preventing
> init_object() from unpoisoning new allocations. The usage of
> memset_no_sanitize_memory() does not degrade the generated code
> quality.
>
> There are two alternatives to this approach. First, init_object()
> can be marked with __no_sanitize_memory. This annotation should be used
> with great care, because it drops all instrumentation from the
> function, and any shadow writes will be lost. Even though this is not a
> concern with the current init_object() implementation, this may change
> in the future.
>
> Second, kmsan_poison_memory() calls may be added after memset() calls.
> The downside is that init_object() is called from
> free_debug_processing(), in which case poisoning will erase the
> distinction between simply uninitialized memory and UAF.
>
> Signed-off-by: Ilya Leoshkevich 
> ---
>  mm/kmsan/hooks.c |  2 +-
>  mm/slub.c| 10 ++
>  2 files changed, 7 insertions(+), 5 deletions(-)
>
> diff --git a/mm/kmsan/hooks.c b/mm/kmsan/hooks.c
> index 7b5814412e9f..7a30274b893c 100644
> --- a/mm/kmsan/hooks.c
> +++ b/mm/kmsan/hooks.c
> @@ -76,7 +76,7 @@ void kmsan_slab_free(struct kmem_cache *s, void *object)
> return;
>
> /* RCU slabs could be legally used after free within the RCU period */
> -   if (unlikely(s->flags & (SLAB_TYPESAFE_BY_RCU | SLAB_POISON)))
> +   if (unlikely(s->flags & SLAB_TYPESAFE_BY_RCU))
> return;
> /*
>  * If there's a constructor, freed memory must remain in the same 
> state
> diff --git a/mm/slub.c b/mm/slub.c
> index 63d281dfacdb..169e5f645ea8 100644
> --- a/mm/slub.c
> +++ b/mm/slub.c
> @@ -1030,7 +1030,8 @@ static void init_object(struct kmem_cache *s, void 
> *object, u8 val)
> unsigned int poison_size = s->object_size;
>
> if (s->flags & SLAB_RED_ZONE) {
> -   memset(p - s->red_left_pad, val, s->red_left_pad);
> +   memset_no_sanitize_memory(p - s->red_left_pad, val,

As I wrote in patch 13/33, let's try to use __memset() here (with a
comment that we want to preserve the previously poisoned memory)



Re: [PATCH v2 13/33] kmsan: Introduce memset_no_sanitize_memory()

2023-12-08 Thread Alexander Potapenko
On Tue, Nov 21, 2023 at 11:06 PM Ilya Leoshkevich  wrote:
>
> Add a wrapper for memset() that prevents unpoisoning.

We have __memset() already, won't it work for this case?
On the other hand, I am not sure you want to preserve the redzone in
its previous state (unless it's known to be poisoned).
You might consider explicitly unpoisoning the redzone instead.

...

> +__no_sanitize_memory
> +static inline void *memset_no_sanitize_memory(void *s, int c, size_t n)
> +{
> +   return memset(s, c, n);
> +}

I think depending on the compiler optimizations this might end up
being a call to normal memset, that would still change the shadow
bytes.



Re: [PATCH v2 24/33] s390/checksum: Add a KMSAN check

2023-12-08 Thread Alexander Potapenko
On Tue, Nov 21, 2023 at 11:02 PM Ilya Leoshkevich  wrote:
>
> Add a KMSAN check to the CKSM inline assembly, similar to how it was
> done for ASAN in commit e42ac7789df6 ("s390/checksum: always use cksm
> instruction").
>
> Acked-by: Alexander Gordeev 
> Signed-off-by: Ilya Leoshkevich 
Reviewed-by: Alexander Potapenko 



Re: [PATCH v2 19/33] lib/zlib: Unpoison DFLTCC output buffers

2023-12-08 Thread Alexander Potapenko
On Tue, Nov 21, 2023 at 11:07 PM Ilya Leoshkevich  wrote:
>
> The constraints of the DFLTCC inline assembly are not precise: they
> do not communicate the size of the output buffers to the compiler, so
> it cannot automatically instrument it.

KMSAN usually does a poor job instrumenting inline assembly.
Wouldn't be it better to switch to pure C ZLIB implementation, making
ZLIB_DFLTCC depend on !KMSAN?



Re: [PATCH v2 17/33] mm: kfence: Disable KMSAN when checking the canary

2023-12-08 Thread Alexander Potapenko
On Tue, Nov 21, 2023 at 11:02 PM Ilya Leoshkevich  wrote:
>
> KMSAN warns about check_canary() accessing the canary.
>
> The reason is that, even though set_canary() is properly instrumented
> and sets shadow, slub explicitly poisons the canary's address range
> afterwards.
>
> Unpoisoning the canary is not the right thing to do: only
> check_canary() is supposed to ever touch it. Instead, disable KMSAN
> checks around canary read accesses.
>
> Signed-off-by: Ilya Leoshkevich 
Reviewed-by: Alexander Potapenko 



Re: [PATCH v2 33/33] kmsan: Enable on s390

2023-11-29 Thread Alexander Potapenko
Hi Ilya,

Sorry for this taking so long, I'll probably take a closer look next week.
Overall, the s390 part looks good to me, but I wanted to check the x86
behavior once again (and perhaps figure out how to avoid introducing
another way to disable KMSAN).
Do you happen to have a Git repo with your patches somewhere?



Re: [PATCH 26/32] s390/mm: Define KMSAN metadata for vmalloc and modules

2023-11-17 Thread Alexander Gordeev
On Thu, Nov 16, 2023 at 04:03:13PM +0100, Alexander Potapenko wrote:

Hi Alexander!

> > /* allow vmalloc area to occupy up to about 1/2 of the rest virtual 
> > space left */
> > vmalloc_size = min(vmalloc_size, round_down(VMALLOC_END / 2, 
> > _REGION3_SIZE));
> > +#ifdef CONFIG_KMSAN
> > +   /* take 2/3 of vmalloc area for KMSAN shadow and origins */
> > +   vmalloc_size = round_down(vmalloc_size / 3, PAGE_SIZE);
> Is it okay that vmalloc_size is only aligned on PAGE_SIZE?
> E.g. above the alignment is _REGION3_SIZE.

Good question!

This patch does not break anything, although the _REGION3_SIZE 
alignment would be consistent here. Yet, we might rethink this
whole code piece and the next version would reflect that.

Thanks!



Re: [PATCH 28/32] s390/traps: Unpoison the kernel_stack_overflow()'s pt_regs

2023-11-16 Thread Alexander Potapenko
On Wed, Nov 15, 2023 at 9:35 PM Ilya Leoshkevich  wrote:
>
> This is normally done by the generic entry code, but the
> kernel_stack_overflow() flow bypasses it.
>
> Signed-off-by: Ilya Leoshkevich 
Reviewed-by: Alexander Potapenko 

> ---
>  arch/s390/kernel/traps.c | 2 ++
>  1 file changed, 2 insertions(+)
>
> diff --git a/arch/s390/kernel/traps.c b/arch/s390/kernel/traps.c
> index 1d2aa448d103..dd7362806dbb 100644
> --- a/arch/s390/kernel/traps.c
> +++ b/arch/s390/kernel/traps.c
> @@ -27,6 +27,7 @@
>  #include 
>  #include 
>  #include 
> +#include 
>  #include 
>  #include 
>  #include 
> @@ -260,6 +261,7 @@ static void monitor_event_exception(struct pt_regs *regs)
>
>  void kernel_stack_overflow(struct pt_regs *regs)
>  {
> +   kmsan_unpoison_entry_regs(regs);

I suggest adding a comment here.



Re: [PATCH 26/32] s390/mm: Define KMSAN metadata for vmalloc and modules

2023-11-16 Thread Alexander Potapenko
On Wed, Nov 15, 2023 at 9:35 PM Ilya Leoshkevich  wrote:
>
> The pages for the KMSAN metadata associated with most kernel mappings
> are taken from memblock by the common code. However, vmalloc and module
> metadata needs to be defined by the architectures.
>
> Be a little bit more careful than x86: allocate exactly MODULES_LEN
> for the module shadow and origins, and then take 2/3 of vmalloc for
> the vmalloc shadow and origins. This ensures that users passing small
> vmalloc= values on the command line do not cause module metadata
> collisions.
>
> Signed-off-by: Ilya Leoshkevich 
> ---
>  arch/s390/boot/startup.c|  8 
>  arch/s390/include/asm/pgtable.h | 10 ++
>  2 files changed, 18 insertions(+)
>
> diff --git a/arch/s390/boot/startup.c b/arch/s390/boot/startup.c
> index 8104e0e3d188..297c1062372a 100644
> --- a/arch/s390/boot/startup.c
> +++ b/arch/s390/boot/startup.c
> @@ -253,9 +253,17 @@ static unsigned long setup_kernel_memory_layout(void)
> MODULES_END = round_down(__abs_lowcore, _SEGMENT_SIZE);
> MODULES_VADDR = MODULES_END - MODULES_LEN;
> VMALLOC_END = MODULES_VADDR;
> +#ifdef CONFIG_KMSAN
> +   VMALLOC_END -= MODULES_LEN * 2;
> +#endif
>
> /* allow vmalloc area to occupy up to about 1/2 of the rest virtual 
> space left */
> vmalloc_size = min(vmalloc_size, round_down(VMALLOC_END / 2, 
> _REGION3_SIZE));
> +#ifdef CONFIG_KMSAN
> +   /* take 2/3 of vmalloc area for KMSAN shadow and origins */
> +   vmalloc_size = round_down(vmalloc_size / 3, PAGE_SIZE);
Is it okay that vmalloc_size is only aligned on PAGE_SIZE?
E.g. above the alignment is _REGION3_SIZE.



Re: [PATCH 13/32] kmsan: Support SLAB_POISON

2023-11-16 Thread Alexander Potapenko
On Wed, Nov 15, 2023 at 9:34 PM Ilya Leoshkevich  wrote:
>
> Avoid false KMSAN negatives with SLUB_DEBUG by allowing
> kmsan_slab_free() to poison the freed memory, and by preventing
> init_object() from unpoisoning new allocations.
>
> Signed-off-by: Ilya Leoshkevich 
> ---
>  mm/kmsan/hooks.c | 2 +-
>  mm/slub.c| 3 ++-
>  2 files changed, 3 insertions(+), 2 deletions(-)
>
> diff --git a/mm/kmsan/hooks.c b/mm/kmsan/hooks.c
> index 7b5814412e9f..7a30274b893c 100644
> --- a/mm/kmsan/hooks.c
> +++ b/mm/kmsan/hooks.c
> @@ -76,7 +76,7 @@ void kmsan_slab_free(struct kmem_cache *s, void *object)
> return;
>
> /* RCU slabs could be legally used after free within the RCU period */
> -   if (unlikely(s->flags & (SLAB_TYPESAFE_BY_RCU | SLAB_POISON)))
> +   if (unlikely(s->flags & SLAB_TYPESAFE_BY_RCU))
> return;
> /*
>  * If there's a constructor, freed memory must remain in the same 
> state
> diff --git a/mm/slub.c b/mm/slub.c
> index 63d281dfacdb..8d9aa4d7cb7e 100644
> --- a/mm/slub.c
> +++ b/mm/slub.c
> @@ -1024,7 +1024,8 @@ static __printf(3, 4) void slab_err(struct kmem_cache 
> *s, struct slab *slab,
> add_taint(TAINT_BAD_PAGE, LOCKDEP_NOW_UNRELIABLE);
>  }
>
> -static void init_object(struct kmem_cache *s, void *object, u8 val)
> +__no_sanitize_memory static void

__no_sanitize_memory should be used with great care, because it drops
all instrumentation from the function, and any shadow writes will be
lost.
Won't it be better to add kmsan_poison() to init_object() if you want
it to stay uninitialized?



Re: [PATCH 07/32] kmsan: Remove a useless assignment from kmsan_vmap_pages_range_noflush()

2023-11-16 Thread Alexander Potapenko
On Wed, Nov 15, 2023 at 9:34 PM Ilya Leoshkevich  wrote:
>
> The value assigned to prot is immediately overwritten on the next line
> with PAGE_KERNEL. The right hand side of the assignment has no
> side-effects.
>
> Fixes: b073d7f8aee4 ("mm: kmsan: maintain KMSAN metadata for page operations")
> Suggested-by: Alexander Gordeev 
> Signed-off-by: Ilya Leoshkevich 
Reviewed-by: Alexander Potapenko 

> ---
>  mm/kmsan/shadow.c | 1 -
>  1 file changed, 1 deletion(-)
>
> diff --git a/mm/kmsan/shadow.c b/mm/kmsan/shadow.c
> index b9d05aff313e..2d57408c78ae 100644
> --- a/mm/kmsan/shadow.c
> +++ b/mm/kmsan/shadow.c
> @@ -243,7 +243,6 @@ int kmsan_vmap_pages_range_noflush(unsigned long start, 
> unsigned long end,
> s_pages[i] = shadow_page_for(pages[i]);
> o_pages[i] = origin_page_for(pages[i]);
> }
> -   prot = __pgprot(pgprot_val(prot) | _PAGE_NX);
> prot = PAGE_KERNEL;

This bug dates back to 5.1-rc2, when KMSAN didn't exist upstream.
The commit introducing vmap support already had it:
https://github.com/google/kmsan/commit/3ff9d7c640d378485286e1a99d85984ae6901f23
I don't remember what exactly required the more relaxed PAGE_KERNEL
mask though :)



Re: [PATCH 19/32] kmsan: Accept ranges starting with 0 on s390

2023-11-16 Thread Alexander Potapenko
On Wed, Nov 15, 2023 at 9:34 PM Ilya Leoshkevich  wrote:
>
> On s390 the virtual address 0 is valid (current CPU's lowcore is mapped
> there), therefore KMSAN should not complain about it.
>
> Disable the respective check on s390. There doesn't seem to be a
> Kconfig option to describe this situation, so explicitly check for
> s390.
>
> Signed-off-by: Ilya Leoshkevich 
Reviewed-by: Alexander Potapenko 
(see the nit below)

> ---
>  mm/kmsan/init.c | 4 +++-
>  1 file changed, 3 insertions(+), 1 deletion(-)
>
> diff --git a/mm/kmsan/init.c b/mm/kmsan/init.c
> index ffedf4dbc49d..14f4a432fddd 100644
> --- a/mm/kmsan/init.c
> +++ b/mm/kmsan/init.c
> @@ -33,7 +33,9 @@ static void __init kmsan_record_future_shadow_range(void 
> *start, void *end)
> bool merged = false;
>
> KMSAN_WARN_ON(future_index == NUM_FUTURE_RANGES);
> -   KMSAN_WARN_ON((nstart >= nend) || !nstart || !nend);
> +   KMSAN_WARN_ON((nstart >= nend) ||
> + (!IS_ENABLED(CONFIG_S390) && !nstart) ||
Please add a comment explaining this bit.

> + !nend);
> nstart = ALIGN_DOWN(nstart, PAGE_SIZE);
> nend = ALIGN(nend, PAGE_SIZE);
>
> --
> 2.41.0
>



Re: [PATCH 06/32] kmsan: Fix kmsan_copy_to_user() on arches with overlapping address spaces

2023-11-16 Thread Alexander Potapenko
On Wed, Nov 15, 2023 at 9:34 PM Ilya Leoshkevich  wrote:
>
> Comparing pointers with TASK_SIZE does not make sense when kernel and
> userspace overlap. Assume that we are handling user memory access in
> this case.
>
> Reported-by: Alexander Gordeev 
> Signed-off-by: Ilya Leoshkevich 
Reviewed-by: Alexander Potapenko 



Re: [PATCH 14/32] kmsan: Use ALIGN_DOWN() in kmsan_get_metadata()

2023-11-16 Thread Alexander Potapenko
On Wed, Nov 15, 2023 at 9:34 PM Ilya Leoshkevich  wrote:
>
> Improve the readability by replacing the custom aligning logic with
> ALIGN_DOWN(). Unlike other places where a similar sequence is used,
> there is no size parameter that needs to be adjusted, so the standard
> macro fits.

Good catch, thank you!

> Signed-off-by: Ilya Leoshkevich 
Reviewed-by: Alexander Potapenko 



Re: [PATCH 21/32] s390: Use a larger stack for KMSAN

2023-11-16 Thread Alexander Potapenko
On Wed, Nov 15, 2023 at 9:34 PM Ilya Leoshkevich  wrote:
>
> Adjust the stack size for the KMSAN-enabled kernel like it was done
> for the KASAN-enabled one in commit 7fef92ccadd7 ("s390/kasan: double
> the stack size"). Both tools have similar requirements.
>
> Reviewed-by: Alexander Gordeev 
> Signed-off-by: Ilya Leoshkevich 
Reviewed-by: Alexander Potapenko 



Re: [PATCH 08/32] kmsan: Remove an x86-specific #include from kmsan.h

2023-11-16 Thread Alexander Potapenko
On Wed, Nov 15, 2023 at 9:34 PM Ilya Leoshkevich  wrote:
>
> Replace the x86-specific asm/pgtable_64_types.h #include with the
> linux/pgtable.h one, which all architectures have.
>
> Fixes: f80be4571b19 ("kmsan: add KMSAN runtime core")
> Suggested-by: Heiko Carstens 
> Signed-off-by: Ilya Leoshkevich 
Reviewed-by: Alexander Potapenko 
(see the comment below)

>
> -#include 
> +#include 

For the sake of consistency with other KMSAN code, please keep the
headers sorted alphabetically.



Re: [PATCH 03/32] kmsan: Disable KMSAN when DEFERRED_STRUCT_PAGE_INIT is enabled

2023-11-16 Thread Alexander Potapenko
On Wed, Nov 15, 2023 at 9:34 PM Ilya Leoshkevich  wrote:
>
> KMSAN relies on memblock returning all available pages to it
> (see kmsan_memblock_free_pages()). It partitions these pages into 3
> categories: pages available to the buddy allocator, shadow pages and
> origin pages. This partitioning is static.
>
> If new pages appear after kmsan_init_runtime(), it is considered
> an error. DEFERRED_STRUCT_PAGE_INIT causes this, so mark it as
> incompatible with KMSAN.

In the future we could probably collect the deferred pages as well,
but it's okay to disable KMSAN for now.

> Signed-off-by: Ilya Leoshkevich 
Reviewed-by: Alexander Potapenko 



Re: [PATCH 02/32] kmsan: Make the tests compatible with kmsan.panic=1

2023-11-16 Thread Alexander Potapenko
On Wed, Nov 15, 2023 at 9:34 PM Ilya Leoshkevich  wrote:
>
> It's useful to have both tests and kmsan.panic=1 during development,
> but right now the warnings, that the tests cause, lead to kernel
> panics.
>
> Temporarily set kmsan.panic=0 for the duration of the KMSAN testing.

Nice!

> Signed-off-by: Ilya Leoshkevich 
Reviewed-by: Alexander Potapenko 



Re: [PATCH 20/32] s390: Turn off KMSAN for boot, vdso and purgatory

2023-11-16 Thread Alexander Potapenko
On Wed, Nov 15, 2023 at 9:34 PM Ilya Leoshkevich  wrote:
>
> All other sanitizers are disabled for these components as well.
>
> Reviewed-by: Alexander Gordeev 
> Signed-off-by: Ilya Leoshkevich 
Reviewed-by: Alexander Potapenko 
(see a nit below)

> ---
>  arch/s390/boot/Makefile  | 1 +
>  arch/s390/kernel/vdso32/Makefile | 1 +
>  arch/s390/kernel/vdso64/Makefile | 1 +
>  arch/s390/purgatory/Makefile | 1 +
>  4 files changed, 4 insertions(+)
>
> diff --git a/arch/s390/boot/Makefile b/arch/s390/boot/Makefile
> index c7c81e5f9218..5a05c927f703 100644
> --- a/arch/s390/boot/Makefile
> +++ b/arch/s390/boot/Makefile
> @@ -8,6 +8,7 @@ GCOV_PROFILE := n
>  UBSAN_SANITIZE := n
>  KASAN_SANITIZE := n
>  KCSAN_SANITIZE := n
> +KMSAN_SANITIZE := n

Nit: I think having even a one-line comment before this block
(something similar to
https://elixir.bootlin.com/linux/latest/source/arch/x86/boot/Makefile#L12)
will make it more clear.

But given that the comment wasn't there before, leaving this up to you.



Re: [PATCH 11/32] kmsan: Export panic_on_kmsan

2023-11-16 Thread Alexander Potapenko
On Wed, Nov 15, 2023 at 9:34 PM Ilya Leoshkevich  wrote:
>
> When building the kmsan test as a module, modpost fails with the
> following error message:
>
> ERROR: modpost: "panic_on_kmsan" [mm/kmsan/kmsan_test.ko] undefined!
>
> Export panic_on_kmsan in order to improve the KMSAN usability for
> modules.
>
> Signed-off-by: Ilya Leoshkevich 
Reviewed-by: Alexander Potapenko 



Re: [PATCH 30/32] s390/unwind: Disable KMSAN checks

2023-11-16 Thread Alexander Potapenko
On Thu, Nov 16, 2023 at 10:04 AM Alexander Potapenko  wrote:
>
> On Wed, Nov 15, 2023 at 9:35 PM Ilya Leoshkevich  wrote:
> >
> > The unwind code can read uninitialized frames. Furthermore, even in
> > the good case, KMSAN does not emit shadow for backchains. Therefore
> > disable it for the unwinding functions.
> >
> > Signed-off-by: Ilya Leoshkevich 
> > ---
> >  arch/s390/kernel/unwind_bc.c | 2 ++
> >  1 file changed, 2 insertions(+)
> >
> > diff --git a/arch/s390/kernel/unwind_bc.c b/arch/s390/kernel/unwind_bc.c
> > index 0ece156fdd7c..7ecaab24783f 100644
> > --- a/arch/s390/kernel/unwind_bc.c
> > +++ b/arch/s390/kernel/unwind_bc.c
> > @@ -49,6 +49,7 @@ static inline bool is_final_pt_regs(struct unwind_state 
> > *state,
> >READ_ONCE_NOCHECK(regs->psw.mask) & PSW_MASK_PSTATE;
> >  }
> >
> > +__no_kmsan_checks
>
> Please add some comments to the source file to back this annotation,
> so that the intent is not lost in git history.

Apart from that,

Reviewed-by: Alexander Potapenko 


-- 
Alexander Potapenko
Software Engineer

Google Germany GmbH
Erika-Mann-Straße, 33
80636 München

Geschäftsführer: Paul Manicle, Liana Sebastian
Registergericht und -nummer: Hamburg, HRB 86891
Sitz der Gesellschaft: Hamburg



  1   2   3   4   5   6   7   8   9   10   >