Re: [PATCH next] vhost_task: after freeing vhost_task it should not be accessed in vhost_task_fn

2024-05-01 Thread Michael S. Tsirkin
On Wed, May 01, 2024 at 10:57:38AM -0500, Mike Christie wrote:
> On 5/1/24 2:50 AM, Hillf Danton wrote:
> > On Wed, 1 May 2024 02:01:20 -0400 Michael S. Tsirkin 
> >>
> >> and then it failed testing.
> >>
> > So did my patch [1] but then the reason was spotted [2,3]
> > 
> > [1] https://lore.kernel.org/lkml/20240430110209.4310-1-hdan...@sina.com/
> > [2] https://lore.kernel.org/lkml/20240430225005.4368-1-hdan...@sina.com/
> > [3] https://lore.kernel.org/lkml/a7f8470617589...@google.com/
> 
> Just to make sure I understand the conclusion.
> 
> Edward's patch that just swaps the order of the calls:
> 
> https://lore.kernel.org/lkml/tencent_546da49414e876eebecf2c78d26d242ee...@qq.com/
> 
> fixes the UAF. I tested the same in my setup. However, when you guys tested it
> with sysbot, it also triggered a softirq/RCU warning.
> 
> The softirq/RCU part of the issue is fixed with this commit:
> 
> https://lore.kernel.org/all/20240427102808.29356-1-qiang.zhang1...@gmail.com/
> 
> commit 1dd1eff161bd55968d3d46bc36def62d71fb4785
> Author: Zqiang 
> Date:   Sat Apr 27 18:28:08 2024 +0800
> 
> softirq: Fix suspicious RCU usage in __do_softirq()
> 
> The problem was that I was testing with -next master which has that patch.
> It looks like you guys were testing against bb7a2467e6be which didn't have
> the patch, and so that's why you guys still hit the softirq/RCU issue. Later
> when you added that patch to your patch, it worked with syzbot.
> 
> So is it safe to assume that the softirq/RCU patch above will be upstream
> when the vhost changes go in or is there a tag I need to add to my patches?

That patch is upstream now. I rebased and asked syzbot to test
https://lore.kernel.org/lkml/tencent_546da49414e876eebecf2c78d26d242ee...@qq.com/
on top.

If that passes I will squash.




Re: [PATCH next] vhost_task: after freeing vhost_task it should not be accessed in vhost_task_fn

2024-05-01 Thread Michael S. Tsirkin
On Wed, May 01, 2024 at 10:57:38AM -0500, Mike Christie wrote:
> On 5/1/24 2:50 AM, Hillf Danton wrote:
> > On Wed, 1 May 2024 02:01:20 -0400 Michael S. Tsirkin 
> >>
> >> and then it failed testing.
> >>
> > So did my patch [1] but then the reason was spotted [2,3]
> > 
> > [1] https://lore.kernel.org/lkml/20240430110209.4310-1-hdan...@sina.com/
> > [2] https://lore.kernel.org/lkml/20240430225005.4368-1-hdan...@sina.com/
> > [3] https://lore.kernel.org/lkml/a7f8470617589...@google.com/
> 
> Just to make sure I understand the conclusion.
> 
> Edward's patch that just swaps the order of the calls:
> 
> https://lore.kernel.org/lkml/tencent_546da49414e876eebecf2c78d26d242ee...@qq.com/
> 
> fixes the UAF. I tested the same in my setup. However, when you guys tested it
> with sysbot, it also triggered a softirq/RCU warning.
> 
> The softirq/RCU part of the issue is fixed with this commit:
> 
> https://lore.kernel.org/all/20240427102808.29356-1-qiang.zhang1...@gmail.com/
> 
> commit 1dd1eff161bd55968d3d46bc36def62d71fb4785
> Author: Zqiang 
> Date:   Sat Apr 27 18:28:08 2024 +0800
> 
> softirq: Fix suspicious RCU usage in __do_softirq()
> 
> The problem was that I was testing with -next master which has that patch.
> It looks like you guys were testing against bb7a2467e6be which didn't have
> the patch, and so that's why you guys still hit the softirq/RCU issue. Later
> when you added that patch to your patch, it worked with syzbot.
> 
> So is it safe to assume that the softirq/RCU patch above will be upstream
> when the vhost changes go in or is there a tag I need to add to my patches?

Two points:
- I do not want bisect broken. If you depend on this patch either I pick
  it too before your patch, or we defer until 
1dd1eff161bd55968d3d46bc36def62d71fb4785
  is merged. You can also ask for that patch to be merged in this cycle.
- Do not assume - pls push somewhere a hash based on vhost that syzbot can test
  and confirm all is well. Thanks!




Re: [PATCH next] vhost_task: after freeing vhost_task it should not be accessed in vhost_task_fn

2024-05-01 Thread Mike Christie
On 5/1/24 2:50 AM, Hillf Danton wrote:
> On Wed, 1 May 2024 02:01:20 -0400 Michael S. Tsirkin 
>>
>> and then it failed testing.
>>
> So did my patch [1] but then the reason was spotted [2,3]
> 
> [1] https://lore.kernel.org/lkml/20240430110209.4310-1-hdan...@sina.com/
> [2] https://lore.kernel.org/lkml/20240430225005.4368-1-hdan...@sina.com/
> [3] https://lore.kernel.org/lkml/a7f8470617589...@google.com/

Just to make sure I understand the conclusion.

Edward's patch that just swaps the order of the calls:

https://lore.kernel.org/lkml/tencent_546da49414e876eebecf2c78d26d242ee...@qq.com/

fixes the UAF. I tested the same in my setup. However, when you guys tested it
with sysbot, it also triggered a softirq/RCU warning.

The softirq/RCU part of the issue is fixed with this commit:

https://lore.kernel.org/all/20240427102808.29356-1-qiang.zhang1...@gmail.com/

commit 1dd1eff161bd55968d3d46bc36def62d71fb4785
Author: Zqiang 
Date:   Sat Apr 27 18:28:08 2024 +0800

softirq: Fix suspicious RCU usage in __do_softirq()

The problem was that I was testing with -next master which has that patch.
It looks like you guys were testing against bb7a2467e6be which didn't have
the patch, and so that's why you guys still hit the softirq/RCU issue. Later
when you added that patch to your patch, it worked with syzbot.

So is it safe to assume that the softirq/RCU patch above will be upstream
when the vhost changes go in or is there a tag I need to add to my patches?



Re: [PATCH next] vhost_task: after freeing vhost_task it should not be accessed in vhost_task_fn

2024-05-01 Thread Hillf Danton
On Wed, 1 May 2024 02:01:20 -0400 Michael S. Tsirkin 
> 
> and then it failed testing.
> 
So did my patch [1] but then the reason was spotted [2,3]

[1] https://lore.kernel.org/lkml/20240430110209.4310-1-hdan...@sina.com/
[2] https://lore.kernel.org/lkml/20240430225005.4368-1-hdan...@sina.com/
[3] https://lore.kernel.org/lkml/a7f8470617589...@google.com/



Re: [PATCH next] vhost_task: after freeing vhost_task it should not be accessed in vhost_task_fn

2024-05-01 Thread Michael S. Tsirkin
On Wed, May 01, 2024 at 08:15:44AM +0800, Hillf Danton wrote:
> On Tue, Apr 30, 2024 at 11:23:04AM -0500, Mike Christie wrote:
> > On 4/30/24 8:05 AM, Edward Adam Davis wrote:
> > >  static int vhost_task_fn(void *data)
> > >  {
> > >   struct vhost_task *vtsk = data;
> > > @@ -51,7 +51,7 @@ static int vhost_task_fn(void *data)
> > >   schedule();
> > >   }
> > >  
> > > - mutex_lock(>exit_mutex);
> > > + mutex_lock(_mutex);
> > >   /*
> > >* If a vhost_task_stop and SIGKILL race, we can ignore the SIGKILL.
> > >* When the vhost layer has called vhost_task_stop it's already stopped
> > > @@ -62,7 +62,7 @@ static int vhost_task_fn(void *data)
> > >   vtsk->handle_sigkill(vtsk->data);
> > >   }
> > >   complete(>exited);
> > > - mutex_unlock(>exit_mutex);
> > > + mutex_unlock(_mutex);
> > >  
> > 
> > Edward, thanks for the patch. I think though I just needed to swap the
> > order of the calls above.
> > 
> > Instead of:
> > 
> > complete(>exited);
> > mutex_unlock(>exit_mutex);
> > 
> > it should have been:
> > 
> > mutex_unlock(>exit_mutex);
> > complete(>exited);
> 
> JFYI Edward did it [1]
> 
> [1] 
> https://lore.kernel.org/lkml/tencent_546da49414e876eebecf2c78d26d242ee...@qq.com/

and then it failed testing.

> > 
> > If my analysis is correct, then Michael do you want me to resubmit a
> > patch on top of your vhost branch or resubmit the entire patchset?




Re: [PATCH next] vhost_task: after freeing vhost_task it should not be accessed in vhost_task_fn

2024-04-30 Thread Michael S. Tsirkin
On Tue, Apr 30, 2024 at 08:01:11PM -0500, Mike Christie wrote:
> On 4/30/24 7:15 PM, Hillf Danton wrote:
> > On Tue, Apr 30, 2024 at 11:23:04AM -0500, Mike Christie wrote:
> >> On 4/30/24 8:05 AM, Edward Adam Davis wrote:
> >>>  static int vhost_task_fn(void *data)
> >>>  {
> >>>   struct vhost_task *vtsk = data;
> >>> @@ -51,7 +51,7 @@ static int vhost_task_fn(void *data)
> >>>   schedule();
> >>>   }
> >>>  
> >>> - mutex_lock(>exit_mutex);
> >>> + mutex_lock(_mutex);
> >>>   /*
> >>>* If a vhost_task_stop and SIGKILL race, we can ignore the SIGKILL.
> >>>* When the vhost layer has called vhost_task_stop it's already stopped
> >>> @@ -62,7 +62,7 @@ static int vhost_task_fn(void *data)
> >>>   vtsk->handle_sigkill(vtsk->data);
> >>>   }
> >>>   complete(>exited);
> >>> - mutex_unlock(>exit_mutex);
> >>> + mutex_unlock(_mutex);
> >>>  
> >>
> >> Edward, thanks for the patch. I think though I just needed to swap the
> >> order of the calls above.
> >>
> >> Instead of:
> >>
> >> complete(>exited);
> >> mutex_unlock(>exit_mutex);
> >>
> >> it should have been:
> >>
> >> mutex_unlock(>exit_mutex);
> >> complete(>exited);
> > 
> > JFYI Edward did it [1]
> > 
> > [1] 
> > https://lore.kernel.org/lkml/tencent_546da49414e876eebecf2c78d26d242ee...@qq.com/
> 
> Thanks.
> 
> I tested the code with that change and it no longer triggers the UAF.

Weird but syzcaller said that yes it triggers.

Compare
dcc0ca06174e6...@google.com
which tests the order
mutex_unlock(>exit_mutex);
complete(>exited);
that you like and says it triggers

and
97bda90617521...@google.com
which says it does not trigger.

Whatever you do please send it to syzcaller in the original
thread and then when you post please include the syzcaller report.

Given this gets confusing I'm fine with just a fixup patch,
and note in the commit log where I should squash it.


> I've fixed up the original patch that had the bug and am going to
> resubmit the patchset like how Michael requested.
> 




Re: [PATCH next] vhost_task: after freeing vhost_task it should not be accessed in vhost_task_fn

2024-04-30 Thread Mike Christie
On 4/30/24 7:15 PM, Hillf Danton wrote:
> On Tue, Apr 30, 2024 at 11:23:04AM -0500, Mike Christie wrote:
>> On 4/30/24 8:05 AM, Edward Adam Davis wrote:
>>>  static int vhost_task_fn(void *data)
>>>  {
>>> struct vhost_task *vtsk = data;
>>> @@ -51,7 +51,7 @@ static int vhost_task_fn(void *data)
>>> schedule();
>>> }
>>>  
>>> -   mutex_lock(>exit_mutex);
>>> +   mutex_lock(_mutex);
>>> /*
>>>  * If a vhost_task_stop and SIGKILL race, we can ignore the SIGKILL.
>>>  * When the vhost layer has called vhost_task_stop it's already stopped
>>> @@ -62,7 +62,7 @@ static int vhost_task_fn(void *data)
>>> vtsk->handle_sigkill(vtsk->data);
>>> }
>>> complete(>exited);
>>> -   mutex_unlock(>exit_mutex);
>>> +   mutex_unlock(_mutex);
>>>  
>>
>> Edward, thanks for the patch. I think though I just needed to swap the
>> order of the calls above.
>>
>> Instead of:
>>
>> complete(>exited);
>> mutex_unlock(>exit_mutex);
>>
>> it should have been:
>>
>> mutex_unlock(>exit_mutex);
>> complete(>exited);
> 
> JFYI Edward did it [1]
> 
> [1] 
> https://lore.kernel.org/lkml/tencent_546da49414e876eebecf2c78d26d242ee...@qq.com/

Thanks.

I tested the code with that change and it no longer triggers the UAF.

I've fixed up the original patch that had the bug and am going to
resubmit the patchset like how Michael requested.




Re: [PATCH next] vhost_task: after freeing vhost_task it should not be accessed in vhost_task_fn

2024-04-30 Thread Hillf Danton
On Tue, Apr 30, 2024 at 11:23:04AM -0500, Mike Christie wrote:
> On 4/30/24 8:05 AM, Edward Adam Davis wrote:
> >  static int vhost_task_fn(void *data)
> >  {
> > struct vhost_task *vtsk = data;
> > @@ -51,7 +51,7 @@ static int vhost_task_fn(void *data)
> > schedule();
> > }
> >  
> > -   mutex_lock(>exit_mutex);
> > +   mutex_lock(_mutex);
> > /*
> >  * If a vhost_task_stop and SIGKILL race, we can ignore the SIGKILL.
> >  * When the vhost layer has called vhost_task_stop it's already stopped
> > @@ -62,7 +62,7 @@ static int vhost_task_fn(void *data)
> > vtsk->handle_sigkill(vtsk->data);
> > }
> > complete(>exited);
> > -   mutex_unlock(>exit_mutex);
> > +   mutex_unlock(_mutex);
> >  
> 
> Edward, thanks for the patch. I think though I just needed to swap the
> order of the calls above.
> 
> Instead of:
> 
> complete(>exited);
> mutex_unlock(>exit_mutex);
> 
> it should have been:
> 
> mutex_unlock(>exit_mutex);
> complete(>exited);

JFYI Edward did it [1]

[1] 
https://lore.kernel.org/lkml/tencent_546da49414e876eebecf2c78d26d242ee...@qq.com/
> 
> If my analysis is correct, then Michael do you want me to resubmit a
> patch on top of your vhost branch or resubmit the entire patchset?



Re: [PATCH next] vhost_task: after freeing vhost_task it should not be accessed in vhost_task_fn

2024-04-30 Thread Michael S. Tsirkin
On Tue, Apr 30, 2024 at 11:23:04AM -0500, Mike Christie wrote:
> On 4/30/24 8:05 AM, Edward Adam Davis wrote:
> >  static int vhost_task_fn(void *data)
> >  {
> > struct vhost_task *vtsk = data;
> > @@ -51,7 +51,7 @@ static int vhost_task_fn(void *data)
> > schedule();
> > }
> >  
> > -   mutex_lock(>exit_mutex);
> > +   mutex_lock(_mutex);
> > /*
> >  * If a vhost_task_stop and SIGKILL race, we can ignore the SIGKILL.
> >  * When the vhost layer has called vhost_task_stop it's already stopped
> > @@ -62,7 +62,7 @@ static int vhost_task_fn(void *data)
> > vtsk->handle_sigkill(vtsk->data);
> > }
> > complete(>exited);
> > -   mutex_unlock(>exit_mutex);
> > +   mutex_unlock(_mutex);
> >  
> 
> Edward, thanks for the patch. I think though I just needed to swap the
> order of the calls above.
> 
> Instead of:
> 
> complete(>exited);
> mutex_unlock(>exit_mutex);
> 
> it should have been:
> 
> mutex_unlock(>exit_mutex);
> complete(>exited);
> 
> If my analysis is correct, then Michael do you want me to resubmit a
> patch on top of your vhost branch or resubmit the entire patchset?

Resubmit all please.

-- 
MST




Re: [PATCH next] vhost_task: after freeing vhost_task it should not be accessed in vhost_task_fn

2024-04-30 Thread Mike Christie
On 4/30/24 8:05 AM, Edward Adam Davis wrote:
>  static int vhost_task_fn(void *data)
>  {
>   struct vhost_task *vtsk = data;
> @@ -51,7 +51,7 @@ static int vhost_task_fn(void *data)
>   schedule();
>   }
>  
> - mutex_lock(>exit_mutex);
> + mutex_lock(_mutex);
>   /*
>* If a vhost_task_stop and SIGKILL race, we can ignore the SIGKILL.
>* When the vhost layer has called vhost_task_stop it's already stopped
> @@ -62,7 +62,7 @@ static int vhost_task_fn(void *data)
>   vtsk->handle_sigkill(vtsk->data);
>   }
>   complete(>exited);
> - mutex_unlock(>exit_mutex);
> + mutex_unlock(_mutex);
>  

Edward, thanks for the patch. I think though I just needed to swap the
order of the calls above.

Instead of:

complete(>exited);
mutex_unlock(>exit_mutex);

it should have been:

mutex_unlock(>exit_mutex);
complete(>exited);

If my analysis is correct, then Michael do you want me to resubmit a
patch on top of your vhost branch or resubmit the entire patchset?




[PATCH next] vhost_task: after freeing vhost_task it should not be accessed in vhost_task_fn

2024-04-30 Thread Edward Adam Davis
[syzbot reported]
BUG: KASAN: slab-use-after-free in instrument_atomic_read 
include/linux/instrumented.h:68 [inline]
BUG: KASAN: slab-use-after-free in atomic_long_read 
include/linux/atomic/atomic-instrumented.h:3188 [inline]
BUG: KASAN: slab-use-after-free in __mutex_unlock_slowpath+0xef/0x750 
kernel/locking/mutex.c:921
Read of size 8 at addr 888023632880 by task vhost-5104/5105

CPU: 0 PID: 5105 Comm: vhost-5104 Not tainted 6.9.0-rc5-next-20240426-syzkaller 
#0
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 
03/27/2024
Call Trace:
 
 __dump_stack lib/dump_stack.c:88 [inline]
 dump_stack_lvl+0x241/0x360 lib/dump_stack.c:114
 print_address_description mm/kasan/report.c:377 [inline]
 print_report+0x169/0x550 mm/kasan/report.c:488
 kasan_report+0x143/0x180 mm/kasan/report.c:601
 kasan_check_range+0x282/0x290 mm/kasan/generic.c:189
 instrument_atomic_read include/linux/instrumented.h:68 [inline]
 atomic_long_read include/linux/atomic/atomic-instrumented.h:3188 [inline]
 __mutex_unlock_slowpath+0xef/0x750 kernel/locking/mutex.c:921
 vhost_task_fn+0x3bc/0x3f0 kernel/vhost_task.c:65
 ret_from_fork+0x4b/0x80 arch/x86/kernel/process.c:147
 ret_from_fork_asm+0x1a/0x30 arch/x86/entry/entry_64.S:244
 

Allocated by task 5104:
 kasan_save_stack mm/kasan/common.c:47 [inline]
 kasan_save_track+0x3f/0x80 mm/kasan/common.c:68
 poison_kmalloc_redzone mm/kasan/common.c:370 [inline]
 __kasan_kmalloc+0x98/0xb0 mm/kasan/common.c:387
 kasan_kmalloc include/linux/kasan.h:211 [inline]
 kmalloc_trace_noprof+0x19c/0x2b0 mm/slub.c:4146
 kmalloc_noprof include/linux/slab.h:660 [inline]
 kzalloc_noprof include/linux/slab.h:778 [inline]
 vhost_task_create+0x149/0x300 kernel/vhost_task.c:134
 vhost_worker_create+0x17b/0x3f0 drivers/vhost/vhost.c:667
 vhost_dev_set_owner+0x563/0x940 drivers/vhost/vhost.c:945
 vhost_dev_ioctl+0xda/0xda0 drivers/vhost/vhost.c:2108
 vhost_vsock_dev_ioctl+0x2bb/0xfa0 drivers/vhost/vsock.c:875
 vfs_ioctl fs/ioctl.c:51 [inline]
 __do_sys_ioctl fs/ioctl.c:907 [inline]
 __se_sys_ioctl+0xfc/0x170 fs/ioctl.c:893
 do_syscall_x64 arch/x86/entry/common.c:52 [inline]
 do_syscall_64+0xf5/0x240 arch/x86/entry/common.c:83
 entry_SYSCALL_64_after_hwframe+0x77/0x7f

Freed by task 5104:
 kasan_save_stack mm/kasan/common.c:47 [inline]
 kasan_save_track+0x3f/0x80 mm/kasan/common.c:68
 kasan_save_free_info+0x40/0x50 mm/kasan/generic.c:579
 poison_slab_object+0xe0/0x150 mm/kasan/common.c:240
 __kasan_slab_free+0x37/0x60 mm/kasan/common.c:256
 kasan_slab_free include/linux/kasan.h:184 [inline]
 slab_free_hook mm/slub.c:2190 [inline]
 slab_free mm/slub.c:4430 [inline]
 kfree+0x149/0x350 mm/slub.c:4551
 vhost_worker_destroy drivers/vhost/vhost.c:629 [inline]
 vhost_workers_free drivers/vhost/vhost.c:648 [inline]
 vhost_dev_cleanup+0x9b0/0xba0 drivers/vhost/vhost.c:1051
 vhost_vsock_dev_release+0x3aa/0x410 drivers/vhost/vsock.c:751
 __fput+0x406/0x8b0 fs/file_table.c:422
 __do_sys_close fs/open.c:1555 [inline]
 __se_sys_close fs/open.c:1540 [inline]
 __x64_sys_close+0x7f/0x110 fs/open.c:1540
 do_syscall_x64 arch/x86/entry/common.c:52 [inline]
 do_syscall_64+0xf5/0x240 arch/x86/entry/common.c:83
 entry_SYSCALL_64_after_hwframe+0x77/0x7f
[Fix]
Delete the member exit_mutex from the struct vhost_task and replace it with a
declared global static mutex.

Fixes: a3df30984f4f ("vhost_task: Handle SIGKILL by flushing work and exiting")
Reported--by: syzbot+98edc2df894917b34...@syzkaller.appspotmail.com
Signed-off-by: Edward Adam Davis 
---
 kernel/vhost_task.c | 13 ++---
 1 file changed, 6 insertions(+), 7 deletions(-)

diff --git a/kernel/vhost_task.c b/kernel/vhost_task.c
index 48c289947b99..375356499867 100644
--- a/kernel/vhost_task.c
+++ b/kernel/vhost_task.c
@@ -20,10 +20,10 @@ struct vhost_task {
struct completion exited;
unsigned long flags;
struct task_struct *task;
-   /* serialize SIGKILL and vhost_task_stop calls */
-   struct mutex exit_mutex;
 };
 
+static DEFINE_MUTEX(exit_mutex); //serialize SIGKILL and vhost_task_stop calls
+
 static int vhost_task_fn(void *data)
 {
struct vhost_task *vtsk = data;
@@ -51,7 +51,7 @@ static int vhost_task_fn(void *data)
schedule();
}
 
-   mutex_lock(>exit_mutex);
+   mutex_lock(_mutex);
/*
 * If a vhost_task_stop and SIGKILL race, we can ignore the SIGKILL.
 * When the vhost layer has called vhost_task_stop it's already stopped
@@ -62,7 +62,7 @@ static int vhost_task_fn(void *data)
vtsk->handle_sigkill(vtsk->data);
}
complete(>exited);
-   mutex_unlock(>exit_mutex);
+   mutex_unlock(_mutex);
 
do_exit(0);
 }
@@ -88,12 +88,12 @@ EXPORT_SYMBOL_GPL(vhost_task_wake);
  */
 void vhost_task_stop(struct vhost_task *vtsk)
 {
-   mutex_lock(>exit_mutex);
+   mutex_lock(_mutex);
if (!test_bit(VHOST_TASK_FLAGS_KILLED, >flags)) {
set_bit(VHOST_TASK_FLAGS_STOP, >flags);