Re: [PATCH virt] virt: fix uninit-value in vhost_vsock_dev_open

2024-04-22 Thread Stefan Hajnoczi
On Sun, Apr 21, 2024 at 12:06:06PM +0900, Jeongjun Park wrote:
> static bool vhost_transport_seqpacket_allow(u32 remote_cid)
> {
> 
>   vsock = vhost_vsock_get(remote_cid);
> 
>   if (vsock)
>   seqpacket_allow = vsock->seqpacket_allow;
> 
> }
> 
> I think this is due to reading a previously created uninitialized 
> vsock->seqpacket_allow inside vhost_transport_seqpacket_allow(), 
> which is executed by the function pointer present in the if statement.

CCing Arseny, author of commit ced7b713711f ("vhost/vsock: support
SEQPACKET for transport").

Looks like a genuine bug in the commit. vhost_vsock_set_features() sets
seqpacket_allow to true when the feature is negotiated. The assumption
is that the field defaults to false.

The rest of the vhost_vsock.ko code is written to initialize the
vhost_vsock fields, so you could argue seqpacket_allow should just be
explicitly initialized to false.

However, eliminating this class of errors by zeroing seems reasonable in
this code path. vhost_vsock_dev_open() is not performance-critical.

Acked-by: Stefan Hajnoczi 


signature.asc
Description: PGP signature


Re: [syzbot] [virtualization?] upstream boot error: WARNING: refcount bug in __free_pages_ok

2024-03-21 Thread Stefan Hajnoczi
On Thu, Mar 21, 2024 at 08:52:03AM -0700, syzbot wrote:
> Hello,
> 
> syzbot tried to test the proposed patch but the build/boot failed:
> 
> bcore: registered new interface driver viperboard
> [7.297712][T1] usbcore: registered new interface driver dln2
> [7.299149][T1] usbcore: registered new interface driver pn533_usb
> [7.304759][  T924] kworker/u4:1 (924) used greatest stack depth: 22768 
> bytes left
> [7.308971][T1] nfcsim 0.2 initialized
> [7.310068][T1] usbcore: registered new interface driver port100
> [7.311312][T1] usbcore: registered new interface driver nfcmrvl
> [7.318405][T1] Loading iSCSI transport class v2.0-870.
> [7.334687][T1] virtio_scsi virtio0: 1/0/0 default/read/poll queues
> [7.344927][T1] [ cut here ]
> [7.345739][T1] refcount_t: decrement hit 0; leaking memory.

This confirms that the following commit introduced this issue:

  commit 217b2119b9e260609958db413876f211038f00ee
  Author: Oscar Salvador 
  Date:   Thu Feb 15 22:59:04 2024 +0100

  mm,page_owner: implement the tracking of the stacks count

Mike: thanks for pointing out the fix that Oscar is working on!

Oscar: Please add the syzbot trailer to the next revision of your
"[PATCH v2 0/2] page_owner: Refcount fixups" series so this issue can be
closed.

> [7.346982][T1] WARNING: CPU: 0 PID: 1 at lib/refcount.c:31 
> refcount_warn_saturate+0xfa/0x1d0
> [7.348761][T1] Modules linked in:
> [7.349418][T1] CPU: 0 PID: 1 Comm: swapper/0 Not tainted 
> 6.8.0-rc5-syzkaller-00257-g217b2119b9e2 #0
> [7.351070][T1] Hardware name: Google Google Compute Engine/Google 
> Compute Engine, BIOS Google 02/29/2024
> [7.352824][T1] RIP: 0010:refcount_warn_saturate+0xfa/0x1d0
> [7.353979][T1] Code: b2 00 00 00 e8 97 2d fc fc 5b 5d c3 cc cc cc cc 
> e8 8b 2d fc fc c6 05 0d d9 d6 0a 01 90 48 c7 c7 a0 46 fd 8b e8 e7 2c c0 fc 90 
> <0f> 0b 90 90 eb d9 e8 6b 2d fc fc c6 05 ea d8 d6 0a 01 90 48 c7 c7
> [7.358181][T1] RSP: :c9066e10 EFLAGS: 00010246
> [7.360206][T1] RAX: 67b097fa09053300 RBX: 88814073377c RCX: 
> 8880166c
> [7.362234][T1] RDX:  RSI:  RDI: 
> 
> [7.363496][T1] RBP: 0004 R08: 81589d62 R09: 
> 1920cd14
> [7.365139][T1] R10: dc00 R11: f520cd15 R12: 
> ea000501edc0
> [7.366612][T1] R13: ea000501edc8 R14: 1d4000a03db9 R15: 
> 
> [7.368171][T1] FS:  () GS:8880b940() 
> knlGS:
> [7.370111][T1] CS:  0010 DS:  ES:  CR0: 80050033
> [7.371030][T1] CR2: 88823000 CR3: 0df34000 CR4: 
> 003506f0
> [7.372121][T1] DR0:  DR1:  DR2: 
> 
> [7.373506][T1] DR3:  DR6: fffe0ff0 DR7: 
> 0400
> [7.374889][T1] Call Trace:
> [7.375371][T1]  
> [7.375798][T1]  ? __warn+0x162/0x4b0
> [7.376442][T1]  ? refcount_warn_saturate+0xfa/0x1d0
> [7.377482][T1]  ? report_bug+0x2b3/0x500
> [7.378161][T1]  ? refcount_warn_saturate+0xfa/0x1d0
> [7.379268][T1]  ? handle_bug+0x3e/0x70
> [7.379887][T1]  ? exc_invalid_op+0x1a/0x50
> [7.380563][T1]  ? asm_exc_invalid_op+0x1a/0x20
> [7.381253][T1]  ? __warn_printk+0x292/0x360
> [7.381912][T1]  ? refcount_warn_saturate+0xfa/0x1d0
> [7.382752][T1]  __free_pages_ok+0xc36/0xd60
> [7.384180][T1]  make_alloc_exact+0xc4/0x140
> [7.385037][T1]  vring_alloc_queue_split+0x20a/0x600
> [7.386037][T1]  ? __pfx_vring_alloc_queue_split+0x10/0x10
> [7.387029][T1]  ? vp_find_vqs+0x4c/0x4e0
> [7.387719][T1]  ? virtscsi_probe+0x3ea/0xf60
> [7.388408][T1]  ? virtio_dev_probe+0x991/0xaf0
> [7.389665][T1]  ? really_probe+0x29e/0xc50
> [7.390429][T1]  ? driver_probe_device+0x50/0x430
> [7.391176][T1]  vring_create_virtqueue_split+0xc6/0x310
> [7.392014][T1]  ? ret_from_fork+0x4b/0x80
> [7.392800][T1]  ? __pfx_vring_create_virtqueue_split+0x10/0x10
> [7.394115][T1]  vring_create_virtqueue+0xca/0x110
> [7.395151][T1]  ? __pfx_vp_notify+0x10/0x10
> [7.395888][T1]  ? __pfx_virtscsi_ctrl_done+0x10/0x10
> [7.396674][T1]  setup_vq+0xe9/0x2d0
> [7.397283][T1]  ? __pfx_vp_notify+0x10/0x10
> [7.397938][T1]  ? __pfx_virtscsi_ctrl_done+0x10/0x10
> [7.398806][T1]  ? __pfx_virtscsi_ctrl_done+0x10/0x10
> [7.399938][T1]  ? __pfx_virtscsi_ctrl_done+0x10/0x10
> [7.400951][T1]  vp_setup_vq+0xbf/0x330
> [7.401889][T1]  ? __pfx_vp_config_changed+0x10/0x10
> [7.403092][T1]  ? ioread16+0x2f/0x90
> [7.403909][T1]  ? __pfx_virtscsi_ctrl_done+0x10/0x10
> [ 

Re: [syzbot] [virtualization?] upstream boot error: WARNING: refcount bug in __free_pages_ok

2024-03-21 Thread Stefan Hajnoczi
On Wed, Mar 20, 2024 at 01:08:02PM -0700, syzbot wrote:
> Hello,
> 
> syzbot has tested the proposed patch and the reproducer did not trigger any 
> issue:
> 
> Reported-and-tested-by: syzbot+70f57d8a3ae84934c...@syzkaller.appspotmail.com
> 
> Tested on:
> 
> commit: 4bedfb31 mm,page_owner: maintain own list of stack_rec..
> git tree:   
> git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git
> kernel config:  https://syzkaller.appspot.com/x/.config?x=527195e149aa3091
> dashboard link: https://syzkaller.appspot.com/bug?extid=70f57d8a3ae84934c003
> compiler:   Debian clang version 15.0.6, GNU ld (GNU Binutils for Debian) 
> 2.40
> 
> Note: no patches were applied.
> Note: testing is done by a robot and is best-effort only.
> 

Good, that was the expected last working commit. Here is the next commit
(it should fail):

#syz test: git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git 
217b2119b9e260609958db413876f211038f00ee


signature.asc
Description: PGP signature


Re: [syzbot] [virtualization?] upstream boot error: WARNING: refcount bug in __free_pages_ok

2024-03-20 Thread Stefan Hajnoczi
On Tue, Mar 19, 2024 at 03:51:18PM -0500, Mike Christie wrote:
> On 3/19/24 12:19 PM, Stefan Hajnoczi wrote:
> > On Tue, Mar 19, 2024 at 03:40:53AM -0400, Michael S. Tsirkin wrote:
> >> On Tue, Mar 19, 2024 at 12:32:26AM -0700, syzbot wrote:
> >>> Hello,
> >>>
> >>> syzbot found the following issue on:
> >>>
> >>> HEAD commit:b3603fcb79b1 Merge tag 'dlm-6.9' of 
> >>> git://git.kernel.org/p..
> >>> git tree:   upstream
> >>> console output: https://syzkaller.appspot.com/x/log.txt?x=10f04c8118
> >>> kernel config:  https://syzkaller.appspot.com/x/.config?x=fcb5bfbee0a42b54
> >>> dashboard link: 
> >>> https://syzkaller.appspot.com/bug?extid=70f57d8a3ae84934c003
> >>> compiler:   Debian clang version 15.0.6, GNU ld (GNU Binutils for 
> >>> Debian) 2.40
> >>>
> >>> Downloadable assets:
> >>> disk image: 
> >>> https://storage.googleapis.com/syzbot-assets/43969dffd4a6/disk-b3603fcb.raw.xz
> >>> vmlinux: 
> >>> https://storage.googleapis.com/syzbot-assets/ef48ab3b378b/vmlinux-b3603fcb.xz
> >>> kernel image: 
> >>> https://storage.googleapis.com/syzbot-assets/728f5ff2b6fe/bzImage-b3603fcb.xz
> >>>
> >>> IMPORTANT: if you fix the issue, please add the following tag to the 
> >>> commit:
> >>> Reported-by: syzbot+70f57d8a3ae84934c...@syzkaller.appspotmail.com
> >>>
> >>> Key type pkcs7_test registered
> >>> Block layer SCSI generic (bsg) driver version 0.4 loaded (major 239)
> >>> io scheduler mq-deadline registered
> >>> io scheduler kyber registered
> >>> io scheduler bfq registered
> >>> input: Power Button as /devices/LNXSYSTM:00/LNXPWRBN:00/input/input0
> >>> ACPI: button: Power Button [PWRF]
> >>> input: Sleep Button as /devices/LNXSYSTM:00/LNXSLPBN:00/input/input1
> >>> ACPI: button: Sleep Button [SLPF]
> >>> ioatdma: Intel(R) QuickData Technology Driver 5.00
> >>> ACPI: \_SB_.LNKC: Enabled at IRQ 11
> >>> virtio-pci :00:03.0: virtio_pci: leaving for legacy driver
> >>> ACPI: \_SB_.LNKD: Enabled at IRQ 10
> >>> virtio-pci :00:04.0: virtio_pci: leaving for legacy driver
> >>> ACPI: \_SB_.LNKB: Enabled at IRQ 10
> >>> virtio-pci :00:06.0: virtio_pci: leaving for legacy driver
> >>> virtio-pci :00:07.0: virtio_pci: leaving for legacy driver
> >>> N_HDLC line discipline registered with maxframe=4096
> >>> Serial: 8250/16550 driver, 4 ports, IRQ sharing enabled
> >>> 00:03: ttyS0 at I/O 0x3f8 (irq = 4, base_baud = 115200) is a 16550A
> >>> 00:04: ttyS1 at I/O 0x2f8 (irq = 3, base_baud = 115200) is a 16550A
> >>> 00:05: ttyS2 at I/O 0x3e8 (irq = 6, base_baud = 115200) is a 16550A
> >>> 00:06: ttyS3 at I/O 0x2e8 (irq = 7, base_baud = 115200) is a 16550A
> >>> Non-volatile memory driver v1.3
> >>> Linux agpgart interface v0.103
> >>> ACPI: bus type drm_connector registered
> >>> [drm] Initialized vgem 1.0.0 20120112 for vgem on minor 0
> >>> [drm] Initialized vkms 1.0.0 20180514 for vkms on minor 1
> >>> Console: switching to colour frame buffer device 128x48
> >>> platform vkms: [drm] fb0: vkmsdrmfb frame buffer device
> >>> usbcore: registered new interface driver udl
> >>> brd: module loaded
> >>> loop: module loaded
> >>> zram: Added device: zram0
> >>> null_blk: disk nullb0 created
> >>> null_blk: module loaded
> >>> Guest personality initialized and is inactive
> >>> VMCI host device registered (name=vmci, major=10, minor=118)
> >>> Initialized host personality
> >>> usbcore: registered new interface driver rtsx_usb
> >>> usbcore: registered new interface driver viperboard
> >>> usbcore: registered new interface driver dln2
> >>> usbcore: registered new interface driver pn533_usb
> >>> nfcsim 0.2 initialized
> >>> usbcore: registered new interface driver port100
> >>> usbcore: registered new interface driver nfcmrvl
> >>> Loading iSCSI transport class v2.0-870.
> >>> virtio_scsi virtio0: 1/0/0 default/read/poll queues
> >>> [ cut here ]
> >>> refcount_t: decrement hit 0; leaking memory.
> >>> WARNING: CPU: 0 PID: 1 at lib/refcount.c:31 
> >>> refcount_warn_saturate+0xfa/0x1d0 lib/refcount.c:31
> >>> Module

Re: [syzbot] [virtualization?] upstream boot error: WARNING: refcount bug in __free_pages_ok

2024-03-19 Thread Stefan Hajnoczi
#syz test: git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git 
52998cdd8d3438df9a77c858a827b8932da1bb28

This is the last time virtio_scsi.c was touched. If the test passes then
the issue is probably in another subsystem and we can bisect more recent
commits. If it fails, then older virtio_scsi.c commits need to be
bisected.

Stefan


signature.asc
Description: PGP signature


Re: [syzbot] [virtualization?] upstream boot error: WARNING: refcount bug in __free_pages_ok

2024-03-19 Thread Stefan Hajnoczi
On Tue, Mar 19, 2024 at 03:40:53AM -0400, Michael S. Tsirkin wrote:
> On Tue, Mar 19, 2024 at 12:32:26AM -0700, syzbot wrote:
> > Hello,
> > 
> > syzbot found the following issue on:
> > 
> > HEAD commit:b3603fcb79b1 Merge tag 'dlm-6.9' of git://git.kernel.org/p..
> > git tree:   upstream
> > console output: https://syzkaller.appspot.com/x/log.txt?x=10f04c8118
> > kernel config:  https://syzkaller.appspot.com/x/.config?x=fcb5bfbee0a42b54
> > dashboard link: https://syzkaller.appspot.com/bug?extid=70f57d8a3ae84934c003
> > compiler:   Debian clang version 15.0.6, GNU ld (GNU Binutils for 
> > Debian) 2.40
> > 
> > Downloadable assets:
> > disk image: 
> > https://storage.googleapis.com/syzbot-assets/43969dffd4a6/disk-b3603fcb.raw.xz
> > vmlinux: 
> > https://storage.googleapis.com/syzbot-assets/ef48ab3b378b/vmlinux-b3603fcb.xz
> > kernel image: 
> > https://storage.googleapis.com/syzbot-assets/728f5ff2b6fe/bzImage-b3603fcb.xz
> > 
> > IMPORTANT: if you fix the issue, please add the following tag to the commit:
> > Reported-by: syzbot+70f57d8a3ae84934c...@syzkaller.appspotmail.com
> > 
> > Key type pkcs7_test registered
> > Block layer SCSI generic (bsg) driver version 0.4 loaded (major 239)
> > io scheduler mq-deadline registered
> > io scheduler kyber registered
> > io scheduler bfq registered
> > input: Power Button as /devices/LNXSYSTM:00/LNXPWRBN:00/input/input0
> > ACPI: button: Power Button [PWRF]
> > input: Sleep Button as /devices/LNXSYSTM:00/LNXSLPBN:00/input/input1
> > ACPI: button: Sleep Button [SLPF]
> > ioatdma: Intel(R) QuickData Technology Driver 5.00
> > ACPI: \_SB_.LNKC: Enabled at IRQ 11
> > virtio-pci :00:03.0: virtio_pci: leaving for legacy driver
> > ACPI: \_SB_.LNKD: Enabled at IRQ 10
> > virtio-pci :00:04.0: virtio_pci: leaving for legacy driver
> > ACPI: \_SB_.LNKB: Enabled at IRQ 10
> > virtio-pci :00:06.0: virtio_pci: leaving for legacy driver
> > virtio-pci :00:07.0: virtio_pci: leaving for legacy driver
> > N_HDLC line discipline registered with maxframe=4096
> > Serial: 8250/16550 driver, 4 ports, IRQ sharing enabled
> > 00:03: ttyS0 at I/O 0x3f8 (irq = 4, base_baud = 115200) is a 16550A
> > 00:04: ttyS1 at I/O 0x2f8 (irq = 3, base_baud = 115200) is a 16550A
> > 00:05: ttyS2 at I/O 0x3e8 (irq = 6, base_baud = 115200) is a 16550A
> > 00:06: ttyS3 at I/O 0x2e8 (irq = 7, base_baud = 115200) is a 16550A
> > Non-volatile memory driver v1.3
> > Linux agpgart interface v0.103
> > ACPI: bus type drm_connector registered
> > [drm] Initialized vgem 1.0.0 20120112 for vgem on minor 0
> > [drm] Initialized vkms 1.0.0 20180514 for vkms on minor 1
> > Console: switching to colour frame buffer device 128x48
> > platform vkms: [drm] fb0: vkmsdrmfb frame buffer device
> > usbcore: registered new interface driver udl
> > brd: module loaded
> > loop: module loaded
> > zram: Added device: zram0
> > null_blk: disk nullb0 created
> > null_blk: module loaded
> > Guest personality initialized and is inactive
> > VMCI host device registered (name=vmci, major=10, minor=118)
> > Initialized host personality
> > usbcore: registered new interface driver rtsx_usb
> > usbcore: registered new interface driver viperboard
> > usbcore: registered new interface driver dln2
> > usbcore: registered new interface driver pn533_usb
> > nfcsim 0.2 initialized
> > usbcore: registered new interface driver port100
> > usbcore: registered new interface driver nfcmrvl
> > Loading iSCSI transport class v2.0-870.
> > virtio_scsi virtio0: 1/0/0 default/read/poll queues
> > [ cut here ]
> > refcount_t: decrement hit 0; leaking memory.
> > WARNING: CPU: 0 PID: 1 at lib/refcount.c:31 
> > refcount_warn_saturate+0xfa/0x1d0 lib/refcount.c:31
> > Modules linked in:
> > CPU: 0 PID: 1 Comm: swapper/0 Not tainted 
> > 6.8.0-syzkaller-11567-gb3603fcb79b1 #0
> > Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS 
> > Google 02/29/2024
> > RIP: 0010:refcount_warn_saturate+0xfa/0x1d0 lib/refcount.c:31
> > Code: b2 00 00 00 e8 57 d4 f2 fc 5b 5d c3 cc cc cc cc e8 4b d4 f2 fc c6 05 
> > 0c f9 ef 0a 01 90 48 c7 c7 a0 5d 1e 8c e8 b7 75 b5 fc 90 <0f> 0b 90 90 eb 
> > d9 e8 2b d4 f2 fc c6 05 e9 f8 ef 0a 01 90 48 c7 c7
> > RSP: :c9066e18 EFLAGS: 00010246
> > RAX: 76f86e452fcad900 RBX: 8880210d2aec RCX: 888016ac8000
> > RDX:  RSI:  RDI: 
> > RBP: 0004 R08: 8157ffe2 R09: fbfff1c396e0
> > R10: dc00 R11: fbfff1c396e0 R12: ea000502cdc0
> > R13: ea000502cdc8 R14: 1d4000a059b9 R15: 
> > FS:  () GS:8880b940() knlGS:
> > CS:  0010 DS:  ES:  CR0: 80050033
> > CR2: 88823000 CR3: 0e132000 CR4: 003506f0
> > DR0:  DR1:  DR2: 
> > DR3:  DR6: fffe0ff0 DR7: 0400
> > Call Trace:
> >  
> > 

Re: [PATCH] virtiofs: don't mark virtio_fs_sysfs_exit as __exit

2024-02-28 Thread Stefan Hajnoczi
On Wed, 28 Feb 2024 at 16:47, Arnd Bergmann  wrote:
>
> From: Arnd Bergmann 
>
> Calling an __exit function from an __init function is not allowed
> and will result in undefined behavior when the code is built-in:
>
> WARNING: modpost: vmlinux: section mismatch in reference: virtio_fs_init+0x50 
> (section: .init.text) -> virtio_fs_sysfs_exit (section: .exit.text)
>
> Remove the incorrect annotation.
>
> Fixes: a8f62f50b4e4 ("virtiofs: export filesystem tags through sysfs")
> Signed-off-by: Arnd Bergmann 
> ---
>  fs/fuse/virtio_fs.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)

Thanks, Arnd. Please see the duplicate patch that Miklos applied:
https://lore.kernel.org/linux-fsdevel/cajfpegsjcz-dnzyft3b5gbgcntmdr6r1n8pm5yclmw9fjy1...@mail.gmail.com/T/#t

Stefan

>
> diff --git a/fs/fuse/virtio_fs.c b/fs/fuse/virtio_fs.c
> index 3a7dd48b534f..36d87dd3cb48 100644
> --- a/fs/fuse/virtio_fs.c
> +++ b/fs/fuse/virtio_fs.c
> @@ -1595,7 +1595,7 @@ static int __init virtio_fs_sysfs_init(void)
> return 0;
>  }
>
> -static void __exit virtio_fs_sysfs_exit(void)
> +static void virtio_fs_sysfs_exit(void)
>  {
> kset_unregister(virtio_fs_kset);
> virtio_fs_kset = NULL;
> --
> 2.39.2
>
>



Re: [PATCH v3] vduse: Fix off by one in vduse_dev_mmap()

2024-02-28 Thread Stefan Hajnoczi
On Wed, 28 Feb 2024 at 13:24, Dan Carpenter  wrote:
>
> The dev->vqs[] array has "dev->vq_num" elements.  It's allocated in
> vduse_dev_init_vqs().  Thus, this > comparison needs to be >= to avoid
> reading one element beyond the end of the array.
>
> Add an array_index_nospec() as well to prevent speculation issues.
>
> Fixes: 316ecd1346b0 ("vduse: Add file operation for mmap")
> Signed-off-by: Dan Carpenter 
> ---
> v2: add array_index_nospec()
> v3: I accidentally corrupted v2.  Try again.
>
>  drivers/vdpa/vdpa_user/vduse_dev.c | 3 ++-
>  1 file changed, 2 insertions(+), 1 deletion(-)

Reviewed-by: Stefan Hajnoczi 

> diff --git a/drivers/vdpa/vdpa_user/vduse_dev.c 
> b/drivers/vdpa/vdpa_user/vduse_dev.c
> index b7a1fb88c506..eb914084c650 100644
> --- a/drivers/vdpa/vdpa_user/vduse_dev.c
> +++ b/drivers/vdpa/vdpa_user/vduse_dev.c
> @@ -1532,9 +1532,10 @@ static int vduse_dev_mmap(struct file *file, struct 
> vm_area_struct *vma)
> if ((vma->vm_flags & VM_SHARED) == 0)
> return -EINVAL;
>
> -   if (index > dev->vq_num)
> +   if (index >= dev->vq_num)
> return -EINVAL;
>
> +   index = array_index_nospec(index, dev->vq_num);
> vq = dev->vqs[index];
> vaddr = vq->vdpa_reconnect_vaddr;
> if (vaddr == 0)
> --
> 2.43.0
>
>



Re: [PATCH v2] vduse: Fix off by one in vduse_dev_mmap()

2024-02-28 Thread Stefan Hajnoczi
On Wed, 28 Feb 2024 at 12:44, Dan Carpenter  wrote:
>
> The dev->vqs[] array has "dev->vq_num" elements.  It's allocated in
> vduse_dev_init_vqs().  Thus, this > comparison needs to be >= to avoid
> reading one element beyond the end of the array.
>
> Add an array_index_nospec() as well to prevent speculation issues.
>
> Fixes: 316ecd1346b0 ("vduse: Add file operation for mmap")
> Signed-off-by: Dan Carpenter 
> ---
> v2: add array_index_nospec().

Did you forget to update the patch, I don't see array_index_nospec()?

>
>  drivers/vdpa/vdpa_user/vduse_dev.c | 3 ++-
>  1 file changed, 2 insertions(+), 1 deletion(-)
>
> diff --git a/drivers/vdpa/vdpa_user/vduse_dev.c 
> b/drivers/vdpa/vdpa_user/vduse_dev.c
> index b7a1fb88c506..eb914084c650 100644
> --- a/drivers/vdpa/vdpa_user/vduse_dev.c
> +++ b/drivers/vdpa/vdpa_user/vduse_dev.c
> @@ -1532,9 +1532,10 @@ static int vduse_dev_mmap(struct file *file, struct 
> vm_area_struct *vma)
> if ((vma->vm_flags & VM_SHARED) == 0)
> return -EINVAL;
>
> -   if (index > dev->vq_num)
> +   if (index >= dev->vq_num)
> return -EINVAL;
>
> vq = dev->vqs[index];
> vaddr = vq->vdpa_reconnect_vaddr;
> if (vaddr == 0)
> --
> 2.43.0
>
>



Re: [syzbot] [virtualization?] KMSAN: uninit-value in virtqueue_add (4)

2024-01-24 Thread Stefan Hajnoczi
On Wed, Jan 24, 2024 at 11:47:32AM +0100, Alexander Potapenko wrote:
> On Thu, Jan 4, 2024 at 9:45 PM Stefan Hajnoczi  wrote:
> >
> > On Tue, Jan 02, 2024 at 08:03:46AM -0500, Michael S. Tsirkin wrote:
> > > On Mon, Jan 01, 2024 at 05:38:24AM -0800, syzbot wrote:
> > > > Hello,
> > > >
> > > > syzbot found the following issue on:
> > > >
> > > > HEAD commit:fbafc3e621c3 Merge tag 'for_linus' of 
> > > > git://git.kernel.org..
> > > > git tree:   upstream
> > > > console+strace: https://syzkaller.appspot.com/x/log.txt?x=173df3e9e8
> > > > kernel config:  
> > > > https://syzkaller.appspot.com/x/.config?x=e0c7078a6b901aa3
> > > > dashboard link: 
> > > > https://syzkaller.appspot.com/bug?extid=d7521c1e3841ed075a42
> > > > compiler:   Debian clang version 15.0.6, GNU ld (GNU Binutils for 
> > > > Debian) 2.40
> > > > syz repro:  
> > > > https://syzkaller.appspot.com/x/repro.syz?x=1300b4a1e8
> > > > C reproducer:   https://syzkaller.appspot.com/x/repro.c?x=130b0379e8
> > > >
> > > > Downloadable assets:
> > > > disk image: 
> > > > https://storage.googleapis.com/syzbot-assets/1520f7b6daa4/disk-fbafc3e6.raw.xz
> > > > vmlinux: 
> > > > https://storage.googleapis.com/syzbot-assets/8b490af009d5/vmlinux-fbafc3e6.xz
> > > > kernel image: 
> > > > https://storage.googleapis.com/syzbot-assets/202ca200f4a4/bzImage-fbafc3e6.xz
> > > >
> > > > IMPORTANT: if you fix the issue, please add the following tag to the 
> > > > commit:
> > > > Reported-by: syzbot+d7521c1e3841ed075...@syzkaller.appspotmail.com
> > > >
> > > > =
> >
> > Hi Alexander,
> > Please take a look at this KMSAN failure. The uninitialized memory was
> > created for the purpose of writing a coredump. vring_map_one_sg() should
> > have direction=DMA_TO_DEVICE.
> >
> Hi Stefan,
> 
> I took a closer look, and am pretty confident this is a false positive.
> I tried adding memset(..., 0xab, PAGE_SIZE << order) to alloc_pages()
> and never saw
> the 0xab pattern in the buffers for which KMSAN reported an error.
> 
> This probably isn't an error in 88938359e2df ("virtio: kmsan:
> check/unpoison scatterlist in
> vring_map_one_sg()"), which by itself should be doing a sane thing:
> report an error if an
> uninitialized buffer is passed to it. It is more likely that we're
> missing some initialization that
> happens in coredump.c
> 
> Does anyone have an idea where coredump.c is supposed to be
> initializing these pages?
> Maybe there are some inline assembly functions involved in copying the data?

Thanks for your time looking into this!

Stefan


signature.asc
Description: PGP signature


Re: [syzbot] [virtualization?] KMSAN: uninit-value in virtqueue_add (4)

2024-01-04 Thread Stefan Hajnoczi
On Tue, Jan 02, 2024 at 08:03:46AM -0500, Michael S. Tsirkin wrote:
> On Mon, Jan 01, 2024 at 05:38:24AM -0800, syzbot wrote:
> > Hello,
> > 
> > syzbot found the following issue on:
> > 
> > HEAD commit:fbafc3e621c3 Merge tag 'for_linus' of git://git.kernel.org..
> > git tree:   upstream
> > console+strace: https://syzkaller.appspot.com/x/log.txt?x=173df3e9e8
> > kernel config:  https://syzkaller.appspot.com/x/.config?x=e0c7078a6b901aa3
> > dashboard link: https://syzkaller.appspot.com/bug?extid=d7521c1e3841ed075a42
> > compiler:   Debian clang version 15.0.6, GNU ld (GNU Binutils for 
> > Debian) 2.40
> > syz repro:  https://syzkaller.appspot.com/x/repro.syz?x=1300b4a1e8
> > C reproducer:   https://syzkaller.appspot.com/x/repro.c?x=130b0379e8
> > 
> > Downloadable assets:
> > disk image: 
> > https://storage.googleapis.com/syzbot-assets/1520f7b6daa4/disk-fbafc3e6.raw.xz
> > vmlinux: 
> > https://storage.googleapis.com/syzbot-assets/8b490af009d5/vmlinux-fbafc3e6.xz
> > kernel image: 
> > https://storage.googleapis.com/syzbot-assets/202ca200f4a4/bzImage-fbafc3e6.xz
> > 
> > IMPORTANT: if you fix the issue, please add the following tag to the commit:
> > Reported-by: syzbot+d7521c1e3841ed075...@syzkaller.appspotmail.com
> > 
> > =

Hi Alexander,
Please take a look at this KMSAN failure. The uninitialized memory was
created for the purpose of writing a coredump. vring_map_one_sg() should
have direction=DMA_TO_DEVICE.

I can't easily tell whether this is a genuine bug or an issue with
commit 88938359e2df ("virtio: kmsan: check/unpoison scatterlist in
vring_map_one_sg()"). Maybe coredump.c is writing out pages that KMSAN
thinks are uninitialized?

Stefan

> > BUG: KMSAN: uninit-value in vring_map_one_sg 
> > drivers/virtio/virtio_ring.c:380 [inline]
> > BUG: KMSAN: uninit-value in virtqueue_add_split 
> > drivers/virtio/virtio_ring.c:614 [inline]
> > BUG: KMSAN: uninit-value in virtqueue_add+0x21c6/0x6530 
> > drivers/virtio/virtio_ring.c:2210
> >  vring_map_one_sg drivers/virtio/virtio_ring.c:380 [inline]
> >  virtqueue_add_split drivers/virtio/virtio_ring.c:614 [inline]
> >  virtqueue_add+0x21c6/0x6530 drivers/virtio/virtio_ring.c:2210
> >  virtqueue_add_sgs+0x186/0x1a0 drivers/virtio/virtio_ring.c:2244
> >  __virtscsi_add_cmd drivers/scsi/virtio_scsi.c:467 [inline]
> >  virtscsi_add_cmd+0x838/0xad0 drivers/scsi/virtio_scsi.c:501
> >  virtscsi_queuecommand+0x896/0xa60 drivers/scsi/virtio_scsi.c:598
> >  scsi_dispatch_cmd drivers/scsi/scsi_lib.c:1516 [inline]
> >  scsi_queue_rq+0x4874/0x5790 drivers/scsi/scsi_lib.c:1758
> >  blk_mq_dispatch_rq_list+0x13f8/0x3600 block/blk-mq.c:2049
> >  __blk_mq_do_dispatch_sched block/blk-mq-sched.c:170 [inline]
> >  blk_mq_do_dispatch_sched block/blk-mq-sched.c:184 [inline]
> >  __blk_mq_sched_dispatch_requests+0x10af/0x2500 block/blk-mq-sched.c:309
> >  blk_mq_sched_dispatch_requests+0x160/0x2d0 block/blk-mq-sched.c:333
> >  blk_mq_run_work_fn+0xd0/0x280 block/blk-mq.c:2434
> >  process_one_work kernel/workqueue.c:2627 [inline]
> >  process_scheduled_works+0x104e/0x1e70 kernel/workqueue.c:2700
> >  worker_thread+0xf45/0x1490 kernel/workqueue.c:2781
> >  kthread+0x3ed/0x540 kernel/kthread.c:388
> >  ret_from_fork+0x66/0x80 arch/x86/kernel/process.c:147
> >  ret_from_fork_asm+0x11/0x20 arch/x86/entry/entry_64.S:242
> > 
> > Uninit was created at:
> >  __alloc_pages+0x9a4/0xe00 mm/page_alloc.c:4591
> >  alloc_pages_mpol+0x62b/0x9d0 mm/mempolicy.c:2133
> >  alloc_pages mm/mempolicy.c:2204 [inline]
> >  folio_alloc+0x1da/0x380 mm/mempolicy.c:2211
> >  filemap_alloc_folio+0xa5/0x430 mm/filemap.c:974
> >  __filemap_get_folio+0xa5a/0x1760 mm/filemap.c:1918
> >  ext4_da_write_begin+0x7f8/0xec0 fs/ext4/inode.c:2891
> >  generic_perform_write+0x3f5/0xc40 mm/filemap.c:3918
> >  ext4_buffered_write_iter+0x564/0xaa0 fs/ext4/file.c:299
> >  ext4_file_write_iter+0x20f/0x3460
> >  __kernel_write_iter+0x329/0x930 fs/read_write.c:517
> >  dump_emit_page fs/coredump.c:888 [inline]
> >  dump_user_range+0x593/0xcd0 fs/coredump.c:915
> >  elf_core_dump+0x528d/0x5a40 fs/binfmt_elf.c:2077
> >  do_coredump+0x32c9/0x4920 fs/coredump.c:764
> >  get_signal+0x2185/0x2d10 kernel/signal.c:2890
> >  arch_do_signal_or_restart+0x53/0xca0 arch/x86/kernel/signal.c:309
> >  exit_to_user_mode_loop+0xe8/0x320 kernel/entry/common.c:168
> >  exit_to_user_mode_prepare+0x163/0x220 kernel/entry/common.c:204
> >  irqentry_exit_to_user_mode+0xd/0x30 kernel/entry/common.c:309
> >  irqentry_exit+0x16/0x40 kernel/entry/common.c:412
> >  exc_page_fault+0x246/0x6f0 arch/x86/mm/fault.c:1564
> >  asm_exc_page_fault+0x2b/0x30 arch/x86/include/asm/idtentry.h:570
> > 
> > Bytes 0-4095 of 4096 are uninitialized
> > Memory access of size 4096 starts at 88812c79c000
> > 
> > CPU: 0 PID: 997 Comm: kworker/0:1H Not tainted 
> > 6.7.0-rc7-syzkaller-3-gfbafc3e621c3 #0
> > Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS 
> > 

Re: [PATCH] virtio_blk: Add support for lifetime feature

2021-04-14 Thread Stefan Hajnoczi
On Mon, Apr 12, 2021 at 10:42:17AM +0100, Christoph Hellwig wrote:
> A note to the virtio committee:  eMMC is the worst of all the currently
> active storage standards by a large margin.  It defines very strange
> ad-hoc interfaces that expose very specific internals and often provides
> very poor abstractions.  It would be great it you could reach out to the
> wider storage community before taking bad ideas from the eMMC standard
> and putting it into virtio.

As Michael mentioned, there is still time to change the virtio-blk spec
since this feature hasn't been released yet.

Why exactly is exposing eMMC-style lifetime information problematic?

Can you and Enrico discuss the use case to figure out an alternative
interface?

Thanks,
Stefan


signature.asc
Description: PGP signature


Re: [PATCH] virtiofs: remove useless function

2021-04-14 Thread Stefan Hajnoczi
On Tue, Apr 13, 2021 at 05:22:23PM +0800, Jiapeng Chong wrote:
> Fix the following clang warning:
> 
> fs/fuse/virtio_fs.c:130:35: warning: unused function 'vq_to_fpq'
> [-Wunused-function].
> 
> Reported-by: Abaci Robot 
> Signed-off-by: Jiapeng Chong 
> ---
>  fs/fuse/virtio_fs.c | 5 -
>  1 file changed, 5 deletions(-)

The function was never used...

Reviewed-by: Stefan Hajnoczi 


signature.asc
Description: PGP signature


Re: [PATCH] virtio_blk: Add support for lifetime feature

2021-04-12 Thread Stefan Hajnoczi
On Tue, Mar 30, 2021 at 11:16:02PM +, Enrico Granata wrote:
> The VirtIO TC has adopted a new feature in virtio-blk enabling
> discovery of lifetime information.
> 
> This commit adds support for the VIRTIO_BLK_T_LIFETIME command
> to the virtio_blk driver, and adds two new attributes to the
> sysfs entry for virtio_blk:
> * pre_eol_info
> * life_time
> 
> which are defined in the same manner as the files of the same name
> for the eMMC driver, in line with the VirtIO specification.
> 
> Signed-off-by: Enrico Granata 
> ---
>  drivers/block/virtio_blk.c  | 76 -
>  include/uapi/linux/virtio_blk.h | 11 +
>  2 files changed, 86 insertions(+), 1 deletion(-)
> 
> diff --git a/drivers/block/virtio_blk.c b/drivers/block/virtio_blk.c
> index b9fa3ef5b57c..1fc0ec000b4f 100644
> --- a/drivers/block/virtio_blk.c
> +++ b/drivers/block/virtio_blk.c
> @@ -246,7 +246,7 @@ static blk_status_t virtio_queue_rq(struct blk_mq_hw_ctx 
> *hctx,
>   unmap = !(req->cmd_flags & REQ_NOUNMAP);
>   break;
>   case REQ_OP_DRV_IN:
> - type = VIRTIO_BLK_T_GET_ID;
> + type = vbr->out_hdr.type;

This patch changes the endianness of vbr->out_hdr.type from
virtio-endian to cpu endian before virtio_queue_rq. That is error-prone
because someone skimming through the code will see some accesses with
cpu_to_virtio32() and others without it. They would have to audit the
code carefully to understand what is going on.

The following is cleaner:

  case REQ_OP_DRV_IN:
  break; /* type already set for custom requests */
  ...
  if (req_op(req) != REQ_OP_DRV_IN)
  vbr->out_hdr.type = cpu_to_virtio32(vblk->vdev, type);

Now vbr->out_hdr.type is virtio-endian everywhere. If we need to support
REQ_OP_DRV_OUT in the future it can use the same approach.

virtblk_get_id() and virtblk_get_lifetime() would be updated like this:

  vbreq->out_hdr.type = cpu_to_virtio32(VIRTIO_BLK_T_GET_*);

>   break;
>   default:
>   WARN_ON_ONCE(1);
> @@ -310,11 +310,14 @@ static int virtblk_get_id(struct gendisk *disk, char 
> *id_str)
>   struct virtio_blk *vblk = disk->private_data;
>   struct request_queue *q = vblk->disk->queue;
>   struct request *req;
> + struct virtblk_req *vbreq;

It's called vbr elsewhere in the driver. It would be nice to keep naming
consistent.

>   int err;
>  
>   req = blk_get_request(q, REQ_OP_DRV_IN, 0);
>   if (IS_ERR(req))
>   return PTR_ERR(req);
> + vbreq = blk_mq_rq_to_pdu(req);
> + vbreq->out_hdr.type = VIRTIO_BLK_T_GET_ID;
>  
>   err = blk_rq_map_kern(q, req, id_str, VIRTIO_BLK_ID_BYTES, GFP_KERNEL);
>   if (err)
> @@ -327,6 +330,34 @@ static int virtblk_get_id(struct gendisk *disk, char 
> *id_str)
>   return err;
>  }
>  
> +static int virtblk_get_lifetime(struct gendisk *disk, struct 
> virtio_blk_lifetime *lifetime)
> +{
> + struct virtio_blk *vblk = disk->private_data;
> + struct request_queue *q = vblk->disk->queue;
> + struct request *req;
> + struct virtblk_req *vbreq;
> + int err;
> +
> + if (!virtio_has_feature(vblk->vdev, VIRTIO_BLK_F_LIFETIME))
> + return -EOPNOTSUPP;
> +
> + req = blk_get_request(q, REQ_OP_DRV_IN, 0);
> + if (IS_ERR(req))
> + return PTR_ERR(req);
> + vbreq = blk_mq_rq_to_pdu(req);
> + vbreq->out_hdr.type = VIRTIO_BLK_T_GET_LIFETIME;
> +
> + err = blk_rq_map_kern(q, req, lifetime, sizeof(*lifetime), GFP_KERNEL);
> + if (err)
> + goto out;
> +
> + blk_execute_rq(vblk->disk, req, false);
> + err = blk_status_to_errno(virtblk_result(blk_mq_rq_to_pdu(req)));
> +out:
> + blk_put_request(req);
> + return err;
> +}
> +
>  static void virtblk_get(struct virtio_blk *vblk)
>  {
>   refcount_inc(>refs);
> @@ -435,6 +466,46 @@ static ssize_t serial_show(struct device *dev,
>  
>  static DEVICE_ATTR_RO(serial);
>  
> +static ssize_t pre_eol_info_show(struct device *dev,
> +struct device_attribute *attr, char *buf)
> +{
> + struct gendisk *disk = dev_to_disk(dev);
> + struct virtio_blk_lifetime lft;
> + int err;
> +
> + /* sysfs gives us a PAGE_SIZE buffer */
> + BUILD_BUG_ON(sizeof(lft) >= PAGE_SIZE);

Why is this necessary? In serial_show() it protects against a buffer
overflow. That's not the case here since sprintf() is used to write to
buf and the size of lft doesn't really matter.


signature.asc
Description: PGP signature


Re: memory leak in virtio_transport_send_pkt_info

2021-03-31 Thread Stefan Hajnoczi
On Mon, Feb 08, 2021 at 08:39:30AM -0800, syzbot wrote:
> Hello,
> 
> syzbot found the following issue on:
> 
> HEAD commit:9f29bd8b Merge tag 'fs_for_v5.11-rc5' of git://git.kernel...
> git tree:   upstream
> console output: https://syzkaller.appspot.com/x/log.txt?x=11e435af50
> kernel config:  https://syzkaller.appspot.com/x/.config?x=162a0109d6ff731f
> dashboard link: https://syzkaller.appspot.com/bug?extid=24452624fc4c571eedd9
> compiler:   gcc (GCC) 10.1.0-syz 20200507
> syz repro:  https://syzkaller.appspot.com/x/repro.syz?x=135dd494d0
> C reproducer:   https://syzkaller.appspot.com/x/repro.c?x=128787e750
> 
> IMPORTANT: if you fix the issue, please add the following tag to the commit:
> Reported-by: syzbot+24452624fc4c571ee...@syzkaller.appspotmail.com

Hi Stefano,
Looks like tx packets are still queued when the syzkaller leak check
runs. I don't see a fix for this in linux.git. Have you already looked
at this?

Stefan

> 
> executing program
> BUG: memory leak
> unreferenced object 0x88811477d380 (size 96):
>   comm "syz-executor196", pid 8793, jiffies 4294968272 (age 26.670s)
>   hex dump (first 32 bytes):
> 01 00 00 00 00 00 00 00 ff ff ff ff 00 00 00 00  
> ba 34 8c fe 00 00 00 00 00 00 01 00 01 00 05 00  .4..
>   backtrace:
> [<51681be3>] kmalloc include/linux/slab.h:552 [inline]
> [<51681be3>] kzalloc include/linux/slab.h:682 [inline]
> [<51681be3>] virtio_transport_alloc_pkt+0x41/0x290 
> net/vmw_vsock/virtio_transport_common.c:51
> [<36c2d6e6>] virtio_transport_send_pkt_info+0x105/0x1b0 
> net/vmw_vsock/virtio_transport_common.c:209
> [] virtio_transport_stream_enqueue+0x58/0x80 
> net/vmw_vsock/virtio_transport_common.c:674
> [] vsock_stream_sendmsg+0x4f7/0x590 
> net/vmw_vsock/af_vsock.c:1800
> [] sock_sendmsg_nosec net/socket.c:652 [inline]
> [] sock_sendmsg+0x56/0x80 net/socket.c:672
> [<803a63df>] sys_sendmsg+0x17a/0x390 net/socket.c:2345
> [<9d42f642>] ___sys_sendmsg+0x8b/0xd0 net/socket.c:2399
> [<0a37ed0e>] __sys_sendmmsg+0xed/0x290 net/socket.c:2489
> [<324c1c73>] __do_sys_sendmmsg net/socket.c:2518 [inline]
> [<324c1c73>] __se_sys_sendmmsg net/socket.c:2515 [inline]
> [<324c1c73>] __x64_sys_sendmmsg+0x24/0x30 net/socket.c:2515
> [<4e7b2960>] do_syscall_64+0x2d/0x70 arch/x86/entry/common.c:46
> [<2615f594>] entry_SYSCALL_64_after_hwframe+0x44/0xa9
> 
> BUG: memory leak
> unreferenced object 0x88811477d280 (size 96):
>   comm "syz-executor196", pid 8793, jiffies 4294968272 (age 26.670s)
>   hex dump (first 32 bytes):
> 01 00 00 00 00 00 00 00 ff ff ff ff 00 00 00 00  
> ba 34 8c fe 00 00 00 00 00 00 01 00 01 00 05 00  .4..
>   backtrace:
> [<51681be3>] kmalloc include/linux/slab.h:552 [inline]
> [<51681be3>] kzalloc include/linux/slab.h:682 [inline]
> [<51681be3>] virtio_transport_alloc_pkt+0x41/0x290 
> net/vmw_vsock/virtio_transport_common.c:51
> [<36c2d6e6>] virtio_transport_send_pkt_info+0x105/0x1b0 
> net/vmw_vsock/virtio_transport_common.c:209
> [] virtio_transport_stream_enqueue+0x58/0x80 
> net/vmw_vsock/virtio_transport_common.c:674
> [] vsock_stream_sendmsg+0x4f7/0x590 
> net/vmw_vsock/af_vsock.c:1800
> [] sock_sendmsg_nosec net/socket.c:652 [inline]
> [] sock_sendmsg+0x56/0x80 net/socket.c:672
> [<803a63df>] sys_sendmsg+0x17a/0x390 net/socket.c:2345
> [<9d42f642>] ___sys_sendmsg+0x8b/0xd0 net/socket.c:2399
> [<0a37ed0e>] __sys_sendmmsg+0xed/0x290 net/socket.c:2489
> [<324c1c73>] __do_sys_sendmmsg net/socket.c:2518 [inline]
> [<324c1c73>] __se_sys_sendmmsg net/socket.c:2515 [inline]
> [<324c1c73>] __x64_sys_sendmmsg+0x24/0x30 net/socket.c:2515
> [<4e7b2960>] do_syscall_64+0x2d/0x70 arch/x86/entry/common.c:46
> [<2615f594>] entry_SYSCALL_64_after_hwframe+0x44/0xa9
> 
> BUG: memory leak
> unreferenced object 0x88811477d200 (size 96):
>   comm "syz-executor196", pid 8793, jiffies 4294968272 (age 26.670s)
>   hex dump (first 32 bytes):
> 01 00 00 00 00 00 00 00 ff ff ff ff 00 00 00 00  
> ba 34 8c fe 00 00 00 00 00 00 01 00 01 00 05 00  .4..
>   backtrace:
> [<51681be3>] kmalloc include/linux/slab.h:552 [inline]
> [<51681be3>] kzalloc include/linux/slab.h:682 [inline]
> [<51681be3>] virtio_transport_alloc_pkt+0x41/0x290 
> net/vmw_vsock/virtio_transport_common.c:51
> [<36c2d6e6>] virtio_transport_send_pkt_info+0x105/0x1b0 
> net/vmw_vsock/virtio_transport_common.c:209
> [] 

Re: [PATCH] virtio_blk: make virtio blks as non-rotational devices

2021-03-29 Thread Stefan Hajnoczi
On Fri, Mar 26, 2021 at 11:39:13AM +0800, Sochin Jiang wrote:
> This confuses some users seeing one rotational block device
> in the guest(/sys/block/vdx/queue/rotational), let's make
> virtio blks as virtual block devices, just like xen blks,
>  and as we known, QUEUE_FLAG_VIRT is defined as QUEUE_FLAG_NONROT
> actually. See also rbd and nbd block devices.
> 
> Signed-off-by: Sochin Jiang 
> ---
>  drivers/block/virtio_blk.c | 1 +
>  1 file changed, 1 insertion(+)

I would like to make this change because it seems consistent and often
the disk really is non-rotational. However, a justification and
performance results are needed especially since this has been reverted
previously. Please see commit f8b12e513b953aebf30f8ff7d2de9be7e024dbbe
("virtio_blk: revert QUEUE_FLAG_VIRT addition").

That was in 2009, so maybe things have changed now. Please explain what
has changed in the commit description.

This flag can affect performance. Please include performance data.

> diff --git a/drivers/block/virtio_blk.c b/drivers/block/virtio_blk.c
> index b9fa3ef5b57c..31a978a4dab5 100644
> --- a/drivers/block/virtio_blk.c
> +++ b/drivers/block/virtio_blk.c
> @@ -822,6 +822,7 @@ static int virtblk_probe(struct virtio_device *vdev)
>   max_size = min(max_size, v);
> 
>   blk_queue_max_segment_size(q, max_size);
> + blk_queue_flag_set(QUEUE_FLAG_VIRT, q);
> 
>   /* Host can optionally specify the block size of the device */
>   err = virtio_cread_feature(vdev, VIRTIO_BLK_F_BLK_SIZE,
> --
> 2.11.0
> 


signature.asc
Description: PGP signature


Re: [PATCH 2/3] virtiofs: split requests that exceed virtqueue size

2021-03-22 Thread Stefan Hajnoczi
On Thu, Mar 18, 2021 at 08:52:22AM -0500, Connor Kuehl wrote:
> If an incoming FUSE request can't fit on the virtqueue, the request is
> placed onto a workqueue so a worker can try to resubmit it later where
> there will (hopefully) be space for it next time.
> 
> This is fine for requests that aren't larger than a virtqueue's maximum
> capacity. However, if a request's size exceeds the maximum capacity of
> the virtqueue (even if the virtqueue is empty), it will be doomed to a
> life of being placed on the workqueue, removed, discovered it won't fit,
> and placed on the workqueue yet again.
> 
> Furthermore, from section 2.6.5.3.1 (Driver Requirements: Indirect
> Descriptors) of the virtio spec:
> 
>   "A driver MUST NOT create a descriptor chain longer than the Queue
>   Size of the device."
> 
> To fix this, limit the number of pages FUSE will use for an overall
> request. This way, each request can realistically fit on the virtqueue
> when it is decomposed into a scattergather list and avoid violating
> section 2.6.5.3.1 of the virtio spec.
> 
> Signed-off-by: Connor Kuehl 
> ---
>  fs/fuse/fuse_i.h|  5 +
>  fs/fuse/inode.c |  7 +++
>  fs/fuse/virtio_fs.c | 14 ++
>  3 files changed, 26 insertions(+)

Nice that FUSE already has max_pages :-).

Reviewed-by: Stefan Hajnoczi 


signature.asc
Description: PGP signature


Re: [PATCH v2] virtiofs: fix memory leak in virtio_fs_probe()

2021-03-17 Thread Stefan Hajnoczi
On Wed, Mar 17, 2021 at 08:44:43AM +, Luis Henriques wrote:
> When accidentally passing twice the same tag to qemu, kmemleak ended up
> reporting a memory leak in virtiofs.  Also, looking at the log I saw the
> following error (that's when I realised the duplicated tag):
> 
>   virtiofs: probe of virtio5 failed with error -17
> 
> Here's the kmemleak log for reference:
> 
> unreferenced object 0x888103d47800 (size 1024):
>   comm "systemd-udevd", pid 118, jiffies 4294893780 (age 18.340s)
>   hex dump (first 32 bytes):
> 00 00 00 00 ad 4e ad de ff ff ff ff 00 00 00 00  .N..
> ff ff ff ff ff ff ff ff 80 90 02 a0 ff ff ff ff  
>   backtrace:
> [<0ebb87c1>] virtio_fs_probe+0x171/0x7ae [virtiofs]
> [<f8aca419>] virtio_dev_probe+0x15f/0x210
> [<4d6baf3c>] really_probe+0xea/0x430
> [<a6ceeac8>] device_driver_attach+0xa8/0xb0
> [<196f47a7>] __driver_attach+0x98/0x140
> [<0b20601d>] bus_for_each_dev+0x7b/0xc0
> [<399c7b7f>] bus_add_driver+0x11b/0x1f0
> [<32b09ba7>] driver_register+0x8f/0xe0
> [<cdd55998>] 0xa002c013
> [<0ea196a2>] do_one_initcall+0x64/0x2e0
> [<08f727ce>] do_init_module+0x5c/0x260
> [<3cdedab6>] __do_sys_finit_module+0xb5/0x120
> [<ad2f48c6>] do_syscall_64+0x33/0x40
> [<809526b5>] entry_SYSCALL_64_after_hwframe+0x44/0xae
> 
> Cc: sta...@vger.kernel.org
> Signed-off-by: Luis Henriques 
> ---
> Changes since v1:
> - Use kfree() to free fs->vqs instead of calling virtio_fs_put()
> 
>  fs/fuse/virtio_fs.c | 1 +
>  1 file changed, 1 insertion(+)

Reviewed-by: Stefan Hajnoczi 


signature.asc
Description: PGP signature


Re: [PATCH] scsi: target: vhost-scsi: remove redundant initialization of variable ret

2021-03-04 Thread Stefan Hajnoczi
On Wed, Mar 03, 2021 at 01:43:39PM +, Colin King wrote:
> From: Colin Ian King 
> 
> The variable ret is being initialized with a value that is never read
> and it is being updated later with a new value.  The initialization is
> redundant and can be removed.
> 
> Addresses-Coverity: ("Unused value")
> Signed-off-by: Colin Ian King 
> ---
>  drivers/vhost/scsi.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)

Which kernel version is this patch based on?

If it's a fix for a patch that hasn't landed yet, please indicate this.
A "Fixes: ..." tag should be added to this patch as well.

I looked at linux.git/master commit f69d02e37a85645aa90d18cacfff36dba370f797 
and see this:

  static int __init vhost_scsi_init(void)
  {
  int ret = -ENOMEM;

  pr_debug("TCM_VHOST fabric module %s on %s/%s"
  " on "UTS_RELEASE"\n", VHOST_SCSI_VERSION, utsname()->sysname,
  utsname()->machine);

  /*
   * Use our own dedicated workqueue for submitting I/O into
   * target core to avoid contention within system_wq.
   */
  vhost_scsi_workqueue = alloc_workqueue("vhost_scsi", 0, 0);
  if (!vhost_scsi_workqueue)
  goto out;

We need ret's initialization value here ^

  ret = vhost_scsi_register();
  if (ret < 0)
  goto out_destroy_workqueue;

  ret = target_register_template(_scsi_ops);
  if (ret < 0)
  goto out_vhost_scsi_deregister;

  return 0;

  out_vhost_scsi_deregister:
  vhost_scsi_deregister();
  out_destroy_workqueue:
  destroy_workqueue(vhost_scsi_workqueue);
  out:
  return ret;
  };


> 
> diff --git a/drivers/vhost/scsi.c b/drivers/vhost/scsi.c
> index d16c04dcc144..9129ab8187fd 100644
> --- a/drivers/vhost/scsi.c
> +++ b/drivers/vhost/scsi.c
> @@ -2465,7 +2465,7 @@ static const struct target_core_fabric_ops 
> vhost_scsi_ops = {
>  
>  static int __init vhost_scsi_init(void)
>  {
> - int ret = -ENOMEM;
> + int ret;
>  
>   pr_debug("TCM_VHOST fabric module %s on %s/%s"
>   " on "UTS_RELEASE"\n", VHOST_SCSI_VERSION, utsname()->sysname,
> -- 
> 2.30.0
> 


signature.asc
Description: PGP signature


Re: [PATCH v5] i2c: virtio: add a virtio i2c frontend driver

2021-03-03 Thread Stefan Hajnoczi
On Tue, Mar 02, 2021 at 11:54:02AM +0100, Arnd Bergmann wrote:
> On Tue, Mar 2, 2021 at 10:51 AM Stefan Hajnoczi  wrote:
> > On Tue, Mar 02, 2021 at 10:42:06AM +0800, Jie Deng wrote:
> > > > > +/*
> > > > > + * Definitions for virtio I2C Adpter
> > > > > + *
> > > > > + * Copyright (c) 2021 Intel Corporation. All rights reserved.
> > > > > + */
> > > > > +
> > > > > +#ifndef _UAPI_LINUX_VIRTIO_I2C_H
> > > > > +#define _UAPI_LINUX_VIRTIO_I2C_H
> > > > Why is this a uapi header? Can't this all be moved into the driver
> > > > itself?
> >
> > Linux VIRTIO drivers provide a uapi header with structs and constants
> > that describe the device interface. This allows other software like
> > QEMU, other operating systems, etc to reuse these headers instead of
> > redefining everything.
> >
> > These files should contain:
> > 1. Device-specific feature bits (VIRTIO__F_)
> > 2. VIRTIO Configuration Space layout (struct virtio__config)
> > 3. Virtqueue request layout (struct virtio__)
> >
> > For examples, see  and .
> 
> Ok, makes sense. So it's not strictly uapi but just helpful for
> virtio applications to reference these. I suppose it is slightly harder
> when building qemu on other operating systems though, how does
> it get these headers on BSD or Windows?

qemu.git imports Linux headers from time to time and can use them
instead of system headers:

  
https://gitlab.com/qemu-project/qemu/-/blob/master/scripts/update-linux-headers.sh

Stefan


signature.asc
Description: PGP signature


Re: [PATCH v5] i2c: virtio: add a virtio i2c frontend driver

2021-03-02 Thread Stefan Hajnoczi
On Tue, Mar 02, 2021 at 10:42:06AM +0800, Jie Deng wrote:
> 
> On 2021/3/1 23:19, Arnd Bergmann wrote:
> > On Mon, Mar 1, 2021 at 7:41 AM Jie Deng  wrote:
> > 
> > > --- /dev/null
> > > +++ b/include/uapi/linux/virtio_i2c.h
> > > @@ -0,0 +1,56 @@
> > > +/* SPDX-License-Identifier: GPL-2.0-or-later WITH Linux-syscall-note */

The uapi VIRTIO header files are BSD licensed so they can be easily used
by other projects (including other operating systems and hypervisors
that don't use Linux). Please see the license headers in
 or .

> > > +/*
> > > + * Definitions for virtio I2C Adpter
> > > + *
> > > + * Copyright (c) 2021 Intel Corporation. All rights reserved.
> > > + */
> > > +
> > > +#ifndef _UAPI_LINUX_VIRTIO_I2C_H
> > > +#define _UAPI_LINUX_VIRTIO_I2C_H
> > Why is this a uapi header? Can't this all be moved into the driver
> > itself?

Linux VIRTIO drivers provide a uapi header with structs and constants
that describe the device interface. This allows other software like
QEMU, other operating systems, etc to reuse these headers instead of
redefining everything.

These files should contain:
1. Device-specific feature bits (VIRTIO__F_)
2. VIRTIO Configuration Space layout (struct virtio__config)
3. Virtqueue request layout (struct virtio__)

For examples, see  and .

> > > +/**
> > > + * struct virtio_i2c_req - the virtio I2C request structure
> > > + * @out_hdr: the OUT header of the virtio I2C message
> > > + * @write_buf: contains one I2C segment being written to the device
> > > + * @read_buf: contains one I2C segment being read from the device
> > > + * @in_hdr: the IN header of the virtio I2C message
> > > + */
> > > +struct virtio_i2c_req {
> > > +   struct virtio_i2c_out_hdr out_hdr;
> > > +   u8 *write_buf;
> > > +   u8 *read_buf;
> > > +   struct virtio_i2c_in_hdr in_hdr;
> > > +};
> > In particular, this structure looks like it is only ever usable between
> > the transfer functions in the driver itself, it is shared with neither
> > user space nor the virtio host side.

I agree. This struct is not part of the device interface. It's part of
the Linux driver implementation. This belongs inside the driver code and
not in include/uapi/ where public headers are located.

Stefan


signature.asc
Description: PGP signature


Re: [RFC PATCH] blk-core: remove blk_put_request()

2021-02-24 Thread Stefan Hajnoczi
On Mon, Feb 22, 2021 at 01:11:15PM -0800, Chaitanya Kulkarni wrote:
> The function blk_put_request() is just a wrapper to
> blk_mq_free_request(), remove the unnecessary wrapper.
> 
> Any feedback is welcome on this RFC.
> 
> Signed-off-by: Chaitanya Kulkarni 
> ---
>  block/blk-core.c   |  6 --
>  block/blk-merge.c  |  2 +-
>  block/bsg-lib.c|  4 ++--
>  block/bsg.c|  4 ++--
>  block/scsi_ioctl.c |  6 +++---
>  drivers/block/paride/pd.c  |  2 +-
>  drivers/block/pktcdvd.c|  2 +-
>  drivers/block/virtio_blk.c |  2 +-
>  drivers/cdrom/cdrom.c  |  4 ++--
>  drivers/ide/ide-atapi.c|  2 +-
>  drivers/ide/ide-cd.c   |  4 ++--
>  drivers/ide/ide-cd_ioctl.c |  2 +-
>  drivers/ide/ide-devsets.c  |  2 +-
>  drivers/ide/ide-disk.c |  2 +-
>  drivers/ide/ide-ioctls.c   |  4 ++--
>  drivers/ide/ide-park.c |  2 +-
>  drivers/ide/ide-pm.c   |  4 ++--
>  drivers/ide/ide-tape.c |  2 +-
>  drivers/ide/ide-taskfile.c |  2 +-
>  drivers/md/dm-mpath.c  |  2 +-
>  drivers/mmc/core/block.c   | 10 +-
>  drivers/scsi/scsi_error.c  |  2 +-
>  drivers/scsi/scsi_lib.c|  2 +-
>  drivers/scsi/sg.c  |  6 +++---
>  drivers/scsi/st.c  |  4 ++--
>  drivers/scsi/ufs/ufshcd.c  |  6 +++---
>  drivers/target/target_core_pscsi.c |  4 ++--
>  fs/nfsd/blocklayout.c  |  4 ++--
>  include/linux/blkdev.h |  1 -
>  29 files changed, 46 insertions(+), 53 deletions(-)
> 
> diff --git a/block/blk-core.c b/block/blk-core.c
> index fc60ff208497..1754f5e7cc80 100644
> --- a/block/blk-core.c
> +++ b/block/blk-core.c
> @@ -642,12 +642,6 @@ struct request *blk_get_request(struct request_queue *q, 
> unsigned int op,
>  }
>  EXPORT_SYMBOL(blk_get_request);
>  
> -void blk_put_request(struct request *req)
> -{
> - blk_mq_free_request(req);
> -}
> -EXPORT_SYMBOL(blk_put_request);

blk_get_request() still exists after this patch. A "get" API usually has
a corresponding "put" API. I'm not sure this patch helps the consistency
and clarity of the code.

If you do go ahead, please update the blk_get_request() doc comment
explicitly mentioning that blk_mq_free_request() needs to be called.

Stefan


signature.asc
Description: PGP signature


Re: [PATCH RFC v2 08/10] vdpa: add vdpa simulator for block device

2021-02-03 Thread Stefan Hajnoczi
On Tue, Feb 02, 2021 at 04:49:50PM +0100, Stefano Garzarella wrote:
> On Tue, Feb 02, 2021 at 09:34:12AM +0000, Stefan Hajnoczi wrote:
> > On Thu, Jan 28, 2021 at 03:41:25PM +0100, Stefano Garzarella wrote:
> > > +static void vdpasim_blk_work(struct work_struct *work)
> > > +{
> > > + struct vdpasim *vdpasim = container_of(work, struct vdpasim, work);
> > > + u8 status = VIRTIO_BLK_S_OK;
> > > + int i;
> > > +
> > > + spin_lock(>lock);
> > > +
> > > + if (!(vdpasim->status & VIRTIO_CONFIG_S_DRIVER_OK))
> > > + goto out;
> > > +
> > > + for (i = 0; i < VDPASIM_BLK_VQ_NUM; i++) {
> > > + struct vdpasim_virtqueue *vq = >vqs[i];
> > > +
> > > + if (!vq->ready)
> > > + continue;
> > > +
> > > + while (vringh_getdesc_iotlb(>vring, >out_iov,
> > > + >in_iov, >head,
> > > + GFP_ATOMIC) > 0) {
> > > + int write;
> > > +
> > > + vq->in_iov.i = vq->in_iov.used - 1;
> > > + write = vringh_iov_push_iotlb(>vring, >in_iov,
> > > +   , 1);
> > > + if (write <= 0)
> > > + break;
> > 
> > This code looks fragile:
> > 
> > 1. Relying on unsigned underflow and the while loop in
> >   vringh_iov_push_iotlb() to handle the case where in_iov.used == 0 is
> >   risky and could break.
> > 
> > 2. Does this assume that the last in_iov element has size 1? For
> >   example, the guest driver may send a single "in" iovec with size 513
> >   when reading 512 bytes (with an extra byte for the request status).
> > 
> > Please validate inputs fully, even in test/development code, because
> > it's likely to be copied by others when writing production code (or
> > deployed in production by unsuspecting users) :).
> 
> Perfectly agree on that, so I addressed these things, also following your
> review on the previous version, on the next patch of this series:
> "vdpa_sim_blk: implement ramdisk behaviour".
> 
> Do you think should I move these checks in this patch?
> 
> I did this to leave Max credit for this patch and add more code to emulate a
> ramdisk in later patches.

You could update the commit description so it's clear that input
validation is missing and will be added in the next commit.

Stefan


signature.asc
Description: PGP signature


Re: [PATCH RFC v2 10/10] vdpa_sim_blk: handle VIRTIO_BLK_T_GET_ID

2021-02-02 Thread Stefan Hajnoczi
On Thu, Jan 28, 2021 at 03:41:27PM +0100, Stefano Garzarella wrote:
> Handle VIRTIO_BLK_T_GET_ID request, always answering the
> "vdpa_blk_sim" string.
> 
> Signed-off-by: Stefano Garzarella 
> ---
> v2:
> - made 'vdpasim_blk_id' static [Jason]
> ---
>  drivers/vdpa/vdpa_sim/vdpa_sim_blk.c | 15 +++
>  1 file changed, 15 insertions(+)

Reviewed-by: Stefan Hajnoczi 


signature.asc
Description: PGP signature


Re: [PATCH RFC v2 08/10] vdpa: add vdpa simulator for block device

2021-02-02 Thread Stefan Hajnoczi
On Thu, Jan 28, 2021 at 03:41:25PM +0100, Stefano Garzarella wrote:
> +static void vdpasim_blk_work(struct work_struct *work)
> +{
> + struct vdpasim *vdpasim = container_of(work, struct vdpasim, work);
> + u8 status = VIRTIO_BLK_S_OK;
> + int i;
> +
> + spin_lock(>lock);
> +
> + if (!(vdpasim->status & VIRTIO_CONFIG_S_DRIVER_OK))
> + goto out;
> +
> + for (i = 0; i < VDPASIM_BLK_VQ_NUM; i++) {
> + struct vdpasim_virtqueue *vq = >vqs[i];
> +
> + if (!vq->ready)
> + continue;
> +
> + while (vringh_getdesc_iotlb(>vring, >out_iov,
> + >in_iov, >head,
> + GFP_ATOMIC) > 0) {
> + int write;
> +
> + vq->in_iov.i = vq->in_iov.used - 1;
> + write = vringh_iov_push_iotlb(>vring, >in_iov,
> +   , 1);
> + if (write <= 0)
> + break;

This code looks fragile:

1. Relying on unsigned underflow and the while loop in
   vringh_iov_push_iotlb() to handle the case where in_iov.used == 0 is
   risky and could break.

2. Does this assume that the last in_iov element has size 1? For
   example, the guest driver may send a single "in" iovec with size 513
   when reading 512 bytes (with an extra byte for the request status).

Please validate inputs fully, even in test/development code, because
it's likely to be copied by others when writing production code (or
deployed in production by unsuspecting users) :).


signature.asc
Description: PGP signature


Re: [PATCH RFC v2 09/10] vdpa_sim_blk: implement ramdisk behaviour

2021-02-02 Thread Stefan Hajnoczi
On Thu, Jan 28, 2021 at 03:41:26PM +0100, Stefano Garzarella wrote:
> The previous implementation wrote only the status of each request.
> This patch implements a more accurate block device simulator,
> providing a ramdisk-like behavior.
> 
> Signed-off-by: Stefano Garzarella 
> ---
> v2:
> - used %zd %zx to print size_t and ssize_t variables in dev_err()
> - removed unnecessary new line [Jason]
> - moved VIRTIO_BLK_T_GET_ID in another patch [Jason]
> - used push/pull instead of write/read terminology
> - added vdpasim_blk_check_range() to avoid overflows [Stefan]
> - use vdpasim*_to_cpu instead of le*_to_cpu
> - used vringh_kiov_length() helper [Jason]
> ---
>  drivers/vdpa/vdpa_sim/vdpa_sim_blk.c | 164 ---
>  1 file changed, 146 insertions(+), 18 deletions(-)

Reviewed-by: Stefan Hajnoczi 


signature.asc
Description: PGP signature


Re: [PATCH v2] vhost/vsock: add IOTLB API support

2021-01-04 Thread Stefan Hajnoczi
On Wed, Dec 23, 2020 at 03:36:38PM +0100, Stefano Garzarella wrote:
> This patch enables the IOTLB API support for vhost-vsock devices,
> allowing the userspace to emulate an IOMMU for the guest.
> 
> These changes were made following vhost-net, in details this patch:
> - exposes VIRTIO_F_ACCESS_PLATFORM feature and inits the iotlb
>   device if the feature is acked
> - implements VHOST_GET_BACKEND_FEATURES and
>   VHOST_SET_BACKEND_FEATURES ioctls
> - calls vq_meta_prefetch() before vq processing to prefetch vq
>   metadata address in IOTLB
> - provides .read_iter, .write_iter, and .poll callbacks for the
>   chardev; they are used by the userspace to exchange IOTLB messages
> 
> This patch was tested specifying "intel_iommu=strict" in the guest
> kernel command line. I used QEMU with a patch applied [1] to fix a
> simple issue (that patch was merged in QEMU v5.2.0):
> $ qemu -M q35,accel=kvm,kernel-irqchip=split \
>-drive file=fedora.qcow2,format=qcow2,if=virtio \
>-device intel-iommu,intremap=on,device-iotlb=on \
>-device vhost-vsock-pci,guest-cid=3,iommu_platform=on,ats=on
> 
> [1] https://lists.gnu.org/archive/html/qemu-devel/2020-10/msg09077.html
> 
> Reviewed-by: Stefan Hajnoczi 
> Signed-off-by: Stefano Garzarella 
> ---
> 
> The patch is the same of v1, but I re-tested it with:
> - QEMU v5.2.0-551-ga05f8ecd88
> - Linux 5.9.15 (host)
> - Linux 5.9.15 and 5.10.0 (guest)
> Now, enabling 'ats' it works well, there are just a few simple changes.
> 
> v1: https://www.spinics.net/lists/kernel/msg3716022.html
> v2:
> - updated commit message about QEMU version and string used to test
> - rebased on mst/vhost branch
> 
> Thanks,
> Stefano
> ---
>  drivers/vhost/vsock.c | 68 +--
>  1 file changed, 65 insertions(+), 3 deletions(-)

Acked-by: Stefan Hajnoczi 


signature.asc
Description: PGP signature


Re: [PATCH 06/21] vdpa: introduce virtqueue groups

2021-01-04 Thread Stefan Hajnoczi
On Wed, Dec 16, 2020 at 02:48:03PM +0800, Jason Wang wrote:
> This patch introduces virtqueue groups to vDPA device. The virtqueue
> group is the minimal set of virtqueues that must share an address
> space. And the adddress space identifier could only be attached to
> a specific virtqueue group.
> 
> A new mandated bus operation is introduced to get the virtqueue group
> ID for a specific virtqueue.
> 
> All the vDPA device drivers were converted to simply support a single
> virtqueue group.
> 
> Signed-off-by: Jason Wang 
> ---
>  drivers/vdpa/ifcvf/ifcvf_main.c   |  9 -
>  drivers/vdpa/mlx5/net/mlx5_vnet.c |  8 +++-
>  drivers/vdpa/vdpa.c   |  4 +++-
>  drivers/vdpa/vdpa_sim/vdpa_sim.c  | 11 ++-
>  include/linux/vdpa.h  | 12 +---
>  5 files changed, 37 insertions(+), 7 deletions(-)

Maybe consider calling it iotlb_group or iommu_group so the purpose of
the group is clear?


signature.asc
Description: PGP signature


Re: [PATCH] vhost scsi: fix error return code in vhost_scsi_set_endpoint()

2020-12-07 Thread Stefan Hajnoczi
On Fri, Dec 04, 2020 at 04:43:30PM +0800, Zhang Changzhong wrote:
> Fix to return a negative error code from the error handling
> case instead of 0, as done elsewhere in this function.
> 
> Fixes: 25b98b64e284 ("vhost scsi: alloc cmds per vq instead of session")
> Reported-by: Hulk Robot 
> Signed-off-by: Zhang Changzhong 
> ---
>  drivers/vhost/scsi.c | 3 ++-
>  1 file changed, 2 insertions(+), 1 deletion(-)

Acked-by: Stefan Hajnoczi 


signature.asc
Description: PGP signature


Re: [PATCH] vfio iommu type1: Bypass the vma permission check in vfio_pin_pages_remote()

2020-12-07 Thread Stefan Hajnoczi
On Thu, Dec 03, 2020 at 05:01:32PM +0100, David Hildenbrand wrote:
> The real question is: do we even *need* DMA from vfio devices to
> virtio-fs regions? If not (do guests rely on it? what does the spec
> state?), just don't care about vfio at all and don't map anything.

Can DMA to/from the virtio-fs DAX Window happen? Yes, the guest could do
it although it's rare.

Is it a great idea to combine VFIO and virtio-fs DAX? Maybe not. It
involves pinning host page cache pages O_o.

Stefan


signature.asc
Description: PGP signature


Re: [PATCH] vfio iommu type1: Bypass the vma permission check in vfio_pin_pages_remote()

2020-12-03 Thread Stefan Hajnoczi
On Wed, Dec 02, 2020 at 10:45:11AM -0500, Peter Xu wrote:
> On Wed, Dec 02, 2020 at 02:33:56PM +0000, Stefan Hajnoczi wrote:
> > On Wed, Nov 25, 2020 at 10:57:11AM -0500, Peter Xu wrote:
> > > On Wed, Nov 25, 2020 at 01:05:25AM +, Justin He wrote:
> > > > > I'd appreciate if you could explain why vfio needs to dma map some
> > > > > PROT_NONE
> > > > 
> > > > Virtiofs will map a PROT_NONE cache window region firstly, then remap 
> > > > the sub
> > > > region of that cache window with read or write permission. I guess this 
> > > > might
> > > > be an security concern. Just CC virtiofs expert Stefan to answer it 
> > > > more accurately.
> > > 
> > > Yep.  Since my previous sentence was cut off, I'll rephrase: I was 
> > > thinking
> > > whether qemu can do vfio maps only until it remaps the PROT_NONE regions 
> > > into
> > > PROT_READ|PROT_WRITE ones, rather than trying to map dma pages upon 
> > > PROT_NONE.
> > 
> > Userspace processes sometimes use PROT_NONE to reserve virtual address
> > space. That way future mmap(NULL, ...) calls will not accidentally
> > allocate an address from the reserved range.
> > 
> > virtio-fs needs to do this because the DAX window mappings change at
> > runtime. Initially the entire DAX window is just reserved using
> > PROT_NONE. When it's time to mmap a portion of a file into the DAX
> > window an mmap(fixed_addr, ...) call will be made.
> 
> Yes I can understand the rational on why the region is reserved.  However IMHO
> the real question is why such reservation behavior should affect qemu memory
> layout, and even further to VFIO mappings.
> 
> Note that PROT_NONE should likely mean that there's no backing page at all in
> this case.  Since vfio will pin all the pages before mapping the DMAs, it also
> means that it's at least inefficient, because when we try to map all the
> PROT_NONE pages we'll try to fault in every single page of it, even if they 
> may
> not ever be used.
> 
> So I still think this patch is not doing the right thing.  Instead we should
> somehow teach qemu that the virtiofs memory region should only be the size of
> enabled regions (with PROT_READ|PROT_WRITE), rather than the whole reserved
> PROT_NONE region.

virtio-fs was not implemented with IOMMUs in mind. The idea is just to
install a kvm.ko memory region that exposes the DAX window.

Perhaps we need to treat the DAX window like an IOMMU? That way the
virtio-fs code can send map/unmap notifications and hw/vfio/ can
propagate them to the host kernel.

Stefan


signature.asc
Description: PGP signature


Re: [PATCH net-next v1 1/3] vm_sockets: Include flag field in the vsock address data structure

2020-12-03 Thread Stefan Hajnoczi
On Tue, Dec 01, 2020 at 05:25:03PM +0200, Andra Paraschiv wrote:
> vsock enables communication between virtual machines and the host they
> are running on. With the multi transport support (guest->host and
> host->guest), nested VMs can also use vsock channels for communication.
> 
> In addition to this, by default, all the vsock packets are forwarded to
> the host, if no host->guest transport is loaded. This behavior can be
> implicitly used for enabling vsock communication between sibling VMs.
> 
> Add a flag field in the vsock address data structure that can be used to
> explicitly mark the vsock connection as being targeted for a certain
> type of communication. This way, can distinguish between nested VMs and
> sibling VMs use cases and can also setup them at the same time. Till
> now, could either have nested VMs or sibling VMs at a time using the
> vsock communication stack.
> 
> Use the already available "svm_reserved1" field and mark it as a flag
> field instead. This flag can be set when initializing the vsock address
> variable used for the connect() call.
> 
> Signed-off-by: Andra Paraschiv 
> ---
>  include/uapi/linux/vm_sockets.h | 18 +-
>  1 file changed, 17 insertions(+), 1 deletion(-)
> 
> diff --git a/include/uapi/linux/vm_sockets.h b/include/uapi/linux/vm_sockets.h
> index fd0ed7221645d..58da5a91413ac 100644
> --- a/include/uapi/linux/vm_sockets.h
> +++ b/include/uapi/linux/vm_sockets.h
> @@ -114,6 +114,22 @@
>  
>  #define VMADDR_CID_HOST 2
>  
> +/* This sockaddr_vm flag value covers the current default use case:
> + * local vsock communication between guest and host and nested VMs setup.
> + * In addition to this, implicitly, the vsock packets are forwarded to the 
> host
> + * if no host->guest vsock transport is set.
> + */
> +#define VMADDR_FLAG_DEFAULT_COMMUNICATION0x
> +
> +/* Set this flag value in the sockaddr_vm corresponding field if the vsock
> + * channel needs to be setup between two sibling VMs running on the same 
> host.
> + * This way can explicitly distinguish between vsock channels created for 
> nested
> + * VMs (or local communication between guest and host) and the ones created 
> for
> + * sibling VMs. And vsock channels for multiple use cases (nested / sibling 
> VMs)
> + * can be setup at the same time.
> + */
> +#define VMADDR_FLAG_SIBLING_VMS_COMMUNICATION0x0001

vsock has the h2g and g2h concept. It would be more general to call this
flag VMADDR_FLAG_G2H or less cryptically VMADDR_FLAG_TO_HOST.

That way it just tells the driver in which direction to send packets
without implying that sibling communication is possible (it's not
allowed by default on any transport).

I don't have a strong opinion on this but wanted to suggest the idea.

Stefan


signature.asc
Description: PGP signature


Re: [PATCH] vfio iommu type1: Bypass the vma permission check in vfio_pin_pages_remote()

2020-12-02 Thread Stefan Hajnoczi
On Wed, Nov 25, 2020 at 10:57:11AM -0500, Peter Xu wrote:
> On Wed, Nov 25, 2020 at 01:05:25AM +, Justin He wrote:
> > > I'd appreciate if you could explain why vfio needs to dma map some
> > > PROT_NONE
> > 
> > Virtiofs will map a PROT_NONE cache window region firstly, then remap the 
> > sub
> > region of that cache window with read or write permission. I guess this 
> > might
> > be an security concern. Just CC virtiofs expert Stefan to answer it more 
> > accurately.
> 
> Yep.  Since my previous sentence was cut off, I'll rephrase: I was thinking
> whether qemu can do vfio maps only until it remaps the PROT_NONE regions into
> PROT_READ|PROT_WRITE ones, rather than trying to map dma pages upon PROT_NONE.

Userspace processes sometimes use PROT_NONE to reserve virtual address
space. That way future mmap(NULL, ...) calls will not accidentally
allocate an address from the reserved range.

virtio-fs needs to do this because the DAX window mappings change at
runtime. Initially the entire DAX window is just reserved using
PROT_NONE. When it's time to mmap a portion of a file into the DAX
window an mmap(fixed_addr, ...) call will be made.

Stefan


signature.asc
Description: PGP signature


Re: [PATCH net] vsock: forward all packets to the host when no H2G is registered

2020-11-19 Thread Stefan Hajnoczi
On Thu, Nov 12, 2020 at 02:38:37PM +0100, Stefano Garzarella wrote:
> Before commit c0cfa2d8a788 ("vsock: add multi-transports support"),
> if a G2H transport was loaded (e.g. virtio transport), every packets
> was forwarded to the host, regardless of the destination CID.
> The H2G transports implemented until then (vhost-vsock, VMCI) always
> responded with an error, if the destination CID was not
> VMADDR_CID_HOST.
> 
> From that commit, we are using the remote CID to decide which
> transport to use, so packets with remote CID > VMADDR_CID_HOST(2)
> are sent only through H2G transport. If no H2G is available, packets
> are discarded directly in the guest.
> 
> Some use cases (e.g. Nitro Enclaves [1]) rely on the old behaviour
> to implement sibling VMs communication, so we restore the old
> behavior when no H2G is registered.
> It will be up to the host to discard packets if the destination is
> not the right one. As it was already implemented before adding
> multi-transport support.
> 
> Tested with nested QEMU/KVM by me and Nitro Enclaves by Andra.
> 
> [1] Documentation/virt/ne_overview.rst
> 
> Cc: Jorgen Hansen 
> Cc: Dexuan Cui 
> Fixes: c0cfa2d8a788 ("vsock: add multi-transports support")
> Reported-by: Andra Paraschiv 
> Tested-by: Andra Paraschiv 
> Signed-off-by: Stefano Garzarella 
> ---
>  net/vmw_vsock/af_vsock.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)

Acked-by: Stefan Hajnoczi 


signature.asc
Description: PGP signature


Re: [PATCH RFC 04/12] vdpa: add vdpa simulator for block device

2020-11-18 Thread Stefan Hajnoczi
On Tue, Nov 17, 2020 at 06:38:11PM +0100, Stefano Garzarella wrote:
> On Tue, Nov 17, 2020 at 04:43:42PM +0000, Stefan Hajnoczi wrote:
> > On Tue, Nov 17, 2020 at 03:16:20PM +0100, Stefano Garzarella wrote:
> > > On Tue, Nov 17, 2020 at 11:11:21AM +0000, Stefan Hajnoczi wrote:
> > > > On Fri, Nov 13, 2020 at 02:47:04PM +0100, Stefano Garzarella wrote:
> > > > > +static void vdpasim_blk_work(struct work_struct *work)
> > > > > +{
> > > > > + struct vdpasim *vdpasim = container_of(work, struct vdpasim, 
> > > > > work);
> > > > > + u8 status = VIRTIO_BLK_S_OK;
> > > > > + int i;
> > > > > +
> > > > > + spin_lock(>lock);
> > > > > +
> > > > > + if (!(vdpasim->status & VIRTIO_CONFIG_S_DRIVER_OK))
> > > > > + goto out;
> > > > > +
> > > > > + for (i = 0; i < VDPASIM_BLK_VQ_NUM; i++) {
> > > > > + struct vdpasim_virtqueue *vq = >vqs[i];
> > > > > +
> > > > > + if (!vq->ready)
> > > > > + continue;
> > > > > +
> > > > > + while (vringh_getdesc_iotlb(>vring, >iov, 
> > > > > >iov,
> > > > > + >head, GFP_ATOMIC) > 0) 
> > > > > {
> > > > > +
> > > > > + int write;
> > > > > +
> > > > > + vq->iov.i = vq->iov.used - 1;
> > > > > + write = vringh_iov_push_iotlb(>vring, 
> > > > > >iov, , 1);
> > > > > + if (write <= 0)
> > > > > + break;
> > > >
> > > > We're lucky the guest driver doesn't crash after VIRTIO_BLK_T_GET_ID? :)
> > > 
> > > The crash could happen if the simulator doesn't put the string terminator,
> > > but in virtio_blk.c, the serial_show() initialize the buffer putting the
> > > string terminator in the VIRTIO_BLK_ID_BYTES element:
> > > 
> > > buf[VIRTIO_BLK_ID_BYTES] = '\0';
> > > err = virtblk_get_id(disk, buf);
> > > 
> > > This should prevent the issue, right?
> > > 
> > > However in the last patch of this series I implemented VIRTIO_BLK_T_GET_ID
> > > support :-)
> > 
> > Windows, BSD, macOS, etc guest drivers aren't necessarily going to
> > terminate or initialize the serial string buffer.
> 
> Unfortunately I discovered that VIRTIO_BLK_T_GET_ID is not in the VIRTIO
> specs, so, just for curiosity, I checked the QEMU code and I found this:
> 
> case VIRTIO_BLK_T_GET_ID:
> {
> /*
>  * NB: per existing s/n string convention the string is
>  * terminated by '\0' only when shorter than buffer.
>  */
> const char *serial = s->conf.serial ? s->conf.serial : "";
> size_t size = MIN(strlen(serial) + 1,
>   MIN(iov_size(in_iov, in_num),
>   VIRTIO_BLK_ID_BYTES));
> iov_from_buf(in_iov, in_num, 0, serial, size);
> virtio_blk_req_complete(req, VIRTIO_BLK_S_OK);
> virtio_blk_free_request(req);
> break;
> }
> 
> It seems that the device emulation in QEMU expects that the driver
> terminates the serial string buffer.
> 
> Do you know why VIRTIO_BLK_T_GET_ID is not in the specs?
> Should we add it?

It's about to be merged into the VIRTIO spec:
https://github.com/oasis-tcs/virtio-spec/issues/63

Stefan


signature.asc
Description: PGP signature


Re: [PATCH RFC 04/12] vdpa: add vdpa simulator for block device

2020-11-17 Thread Stefan Hajnoczi
On Tue, Nov 17, 2020 at 03:16:20PM +0100, Stefano Garzarella wrote:
> On Tue, Nov 17, 2020 at 11:11:21AM +0000, Stefan Hajnoczi wrote:
> > On Fri, Nov 13, 2020 at 02:47:04PM +0100, Stefano Garzarella wrote:
> > > +static void vdpasim_blk_work(struct work_struct *work)
> > > +{
> > > + struct vdpasim *vdpasim = container_of(work, struct vdpasim, work);
> > > + u8 status = VIRTIO_BLK_S_OK;
> > > + int i;
> > > +
> > > + spin_lock(>lock);
> > > +
> > > + if (!(vdpasim->status & VIRTIO_CONFIG_S_DRIVER_OK))
> > > + goto out;
> > > +
> > > + for (i = 0; i < VDPASIM_BLK_VQ_NUM; i++) {
> > > + struct vdpasim_virtqueue *vq = >vqs[i];
> > > +
> > > + if (!vq->ready)
> > > + continue;
> > > +
> > > + while (vringh_getdesc_iotlb(>vring, >iov, >iov,
> > > + >head, GFP_ATOMIC) > 0) {
> > > +
> > > + int write;
> > > +
> > > + vq->iov.i = vq->iov.used - 1;
> > > + write = vringh_iov_push_iotlb(>vring, >iov, 
> > > , 1);
> > > + if (write <= 0)
> > > + break;
> > 
> > We're lucky the guest driver doesn't crash after VIRTIO_BLK_T_GET_ID? :)
> 
> The crash could happen if the simulator doesn't put the string terminator,
> but in virtio_blk.c, the serial_show() initialize the buffer putting the
> string terminator in the VIRTIO_BLK_ID_BYTES element:
> 
> buf[VIRTIO_BLK_ID_BYTES] = '\0';
> err = virtblk_get_id(disk, buf);
> 
> This should prevent the issue, right?
> 
> However in the last patch of this series I implemented VIRTIO_BLK_T_GET_ID
> support :-)

Windows, BSD, macOS, etc guest drivers aren't necessarily going to
terminate or initialize the serial string buffer.

Anyway, the later patch that implements VIRTIO_BLK_T_GET_ID solves this
issue! Thanks.

Stefan


signature.asc
Description: PGP signature


Re: [PATCH RFC 12/12] vdpa_sim_blk: implement ramdisk behaviour

2020-11-17 Thread Stefan Hajnoczi
On Fri, Nov 13, 2020 at 02:47:12PM +0100, Stefano Garzarella wrote:
> diff --git a/drivers/vdpa/vdpa_sim/vdpa_sim_blk.c 
> b/drivers/vdpa/vdpa_sim/vdpa_sim_blk.c
> index 8e41b3ab98d5..68e74383322f 100644
> --- a/drivers/vdpa/vdpa_sim/vdpa_sim_blk.c
> +++ b/drivers/vdpa/vdpa_sim/vdpa_sim_blk.c
> @@ -7,6 +7,7 @@
>   */
>  
>  #include 
> +#include 
>  #include 
>  
>  #include "vdpa_sim.h"
> @@ -24,10 +25,137 @@
>  
>  static struct vdpasim *vdpasim_blk_dev;
>  
> +static int vdpasim_blk_handle_req(struct vdpasim *vdpasim,
> +   struct vdpasim_virtqueue *vq)

This function has a non-standard int return value. Please document it.

> +{
> + size_t wrote = 0, to_read = 0, to_write = 0;
> + struct virtio_blk_outhdr hdr;
> + uint8_t status;
> + uint32_t type;
> + ssize_t bytes;
> + loff_t offset;
> + int i, ret;
> +
> + vringh_kiov_cleanup(>riov);
> + vringh_kiov_cleanup(>wiov);
> +
> + ret = vringh_getdesc_iotlb(>vring, >riov, >wiov,
> +>head, GFP_ATOMIC);
> + if (ret != 1)
> + return ret;
> +
> + for (i = 0; i < vq->wiov.used; i++)
> + to_write += vq->wiov.iov[i].iov_len;
> + to_write -= 1; /* last byte is the status */

What if vq->wiov.used == 0?

> +
> + for (i = 0; i < vq->riov.used; i++)
> + to_read += vq->riov.iov[i].iov_len;
> +
> + bytes = vringh_iov_pull_iotlb(>vring, >riov, , sizeof(hdr));
> + if (bytes != sizeof(hdr))
> + return 0;
> +
> + to_read -= bytes;
> +
> + type = le32_to_cpu(hdr.type);
> + offset = le64_to_cpu(hdr.sector) << SECTOR_SHIFT;
> + status = VIRTIO_BLK_S_OK;
> +
> + switch (type) {
> + case VIRTIO_BLK_T_IN:
> + if (offset + to_write > VDPASIM_BLK_CAPACITY << SECTOR_SHIFT) {

Integer overflow is not handled.


signature.asc
Description: PGP signature


Re: [PATCH RFC 10/12] vdpa_sim: split vdpasim_virtqueue's iov field in riov and wiov

2020-11-17 Thread Stefan Hajnoczi
On Fri, Nov 13, 2020 at 02:47:10PM +0100, Stefano Garzarella wrote:
> vringh_getdesc_iotlb() manages 2 iovs for writable and readable
> descriptors. This is very useful for the block device, where for
> each request we have both types of descriptor.
> 
> Let's split the vdpasim_virtqueue's iov field in riov and wiov
> to use them with vringh_getdesc_iotlb().

Is riov/wiov naming common? VIRTIO uses "in" (device-to-driver) and
"out" (driver-to-device). Using VIRTIO terminology might be clearer.

Stefan


signature.asc
Description: PGP signature


Re: [PATCH RFC 06/12] vdpa_sim: add struct vdpasim_device to store device properties

2020-11-17 Thread Stefan Hajnoczi
On Fri, Nov 13, 2020 at 02:47:06PM +0100, Stefano Garzarella wrote:
> Move device properties used during the entire life cycle in a new
> structure to simplify the copy of these fields during the vdpasim
> initialization.
> 
> Signed-off-by: Stefano Garzarella 
> ---
>  drivers/vdpa/vdpa_sim/vdpa_sim.h | 17 --
>  drivers/vdpa/vdpa_sim/vdpa_sim.c | 33 ++--
>  drivers/vdpa/vdpa_sim/vdpa_sim_blk.c |  8 +--
>  drivers/vdpa/vdpa_sim/vdpa_sim_net.c |  9 +---
>  4 files changed, 38 insertions(+), 29 deletions(-)
> 
> diff --git a/drivers/vdpa/vdpa_sim/vdpa_sim.h 
> b/drivers/vdpa/vdpa_sim/vdpa_sim.h
> index 6a1267c40d5e..76e642042eb0 100644
> --- a/drivers/vdpa/vdpa_sim/vdpa_sim.h
> +++ b/drivers/vdpa/vdpa_sim/vdpa_sim.h
> @@ -40,12 +40,17 @@ struct vdpasim_virtqueue {
>   irqreturn_t (*cb)(void *data);
>  };
>  
> +struct vdpasim_device {
> + u64 supported_features;
> + u32 id;
> + int nvqs;
> +};
> +
>  struct vdpasim_init_attr {
> - u32 device_id;
> - u64 features;
> + struct vdpasim_device device;

It's unclear to me what the exact purpose of struct vdpasim_device is.
At least the name reminds me of struct device, which this is not.

Should this be called just struct vdpasim_attr or struct
vdpasim_dev_attr? In other words, the attributes that are needed even
after intialization?


signature.asc
Description: PGP signature


Re: [PATCH RFC 04/12] vdpa: add vdpa simulator for block device

2020-11-17 Thread Stefan Hajnoczi
On Fri, Nov 13, 2020 at 02:47:04PM +0100, Stefano Garzarella wrote:
> +static void vdpasim_blk_work(struct work_struct *work)
> +{
> + struct vdpasim *vdpasim = container_of(work, struct vdpasim, work);
> + u8 status = VIRTIO_BLK_S_OK;
> + int i;
> +
> + spin_lock(>lock);
> +
> + if (!(vdpasim->status & VIRTIO_CONFIG_S_DRIVER_OK))
> + goto out;
> +
> + for (i = 0; i < VDPASIM_BLK_VQ_NUM; i++) {
> + struct vdpasim_virtqueue *vq = >vqs[i];
> +
> + if (!vq->ready)
> + continue;
> +
> + while (vringh_getdesc_iotlb(>vring, >iov, >iov,
> + >head, GFP_ATOMIC) > 0) {
> +
> + int write;
> +
> + vq->iov.i = vq->iov.used - 1;
> + write = vringh_iov_push_iotlb(>vring, >iov, 
> , 1);
> + if (write <= 0)
> + break;

We're lucky the guest driver doesn't crash after VIRTIO_BLK_T_GET_ID? :)


signature.asc
Description: PGP signature


Re: [PATCH RFC 01/12] vhost-vdpa: add support for vDPA blk devices

2020-11-17 Thread Stefan Hajnoczi
On Fri, Nov 13, 2020 at 02:47:01PM +0100, Stefano Garzarella wrote:
> diff --git a/drivers/vhost/vdpa.c b/drivers/vhost/vdpa.c
> index 2754f3069738..fb0411594963 100644
> --- a/drivers/vhost/vdpa.c
> +++ b/drivers/vhost/vdpa.c
> @@ -22,6 +22,7 @@
>  #include 
>  #include 
>  #include 
> +#include 
>  
>  #include "vhost.h"
>  
> @@ -194,6 +195,9 @@ static int vhost_vdpa_config_validate(struct vhost_vdpa 
> *v,
>   case VIRTIO_ID_NET:
>   size = sizeof(struct virtio_net_config);
>   break;
> + case VIRTIO_ID_BLOCK:
> + size = sizeof(struct virtio_blk_config);
> + break;
>   }
>  
>   if (c->len == 0)

Can vdpa_config_ops->get/set_config() handle the size check instead of
hardcoding device-specific knowledge into drivers/vhost/vdpa.c?


signature.asc
Description: PGP signature


Re: [PATCH] vhost/vsock: add IOTLB API support

2020-10-30 Thread Stefan Hajnoczi
On Thu, Oct 29, 2020 at 06:43:51PM +0100, Stefano Garzarella wrote:
> This patch enables the IOTLB API support for vhost-vsock devices,
> allowing the userspace to emulate an IOMMU for the guest.
> 
> These changes were made following vhost-net, in details this patch:
> - exposes VIRTIO_F_ACCESS_PLATFORM feature and inits the iotlb
>   device if the feature is acked
> - implements VHOST_GET_BACKEND_FEATURES and
>   VHOST_SET_BACKEND_FEATURES ioctls
> - calls vq_meta_prefetch() before vq processing to prefetch vq
>   metadata address in IOTLB
> - provides .read_iter, .write_iter, and .poll callbacks for the
>   chardev; they are used by the userspace to exchange IOTLB messages
> 
> This patch was tested with QEMU and a patch applied [1] to fix a
> simple issue:
> $ qemu -M q35,accel=kvm,kernel-irqchip=split \
>-drive file=fedora.qcow2,format=qcow2,if=virtio \
>-device intel-iommu,intremap=on \
>-device vhost-vsock-pci,guest-cid=3,iommu_platform=on
> 
> [1] https://lists.gnu.org/archive/html/qemu-devel/2020-10/msg09077.html
> 
> Signed-off-by: Stefano Garzarella 
> ---
>  drivers/vhost/vsock.c | 68 +++++--
>  1 file changed, 65 insertions(+), 3 deletions(-)

Reviewed-by: Stefan Hajnoczi 


signature.asc
Description: PGP signature


Re: virtiofs: WARN_ON(out_sgs + in_sgs != total_sgs)

2020-10-06 Thread Stefan Hajnoczi
On Sun, Oct 04, 2020 at 10:31:19AM -0400, Vivek Goyal wrote:
> On Fri, Oct 02, 2020 at 10:44:37PM -0400, Qian Cai wrote:
> > On Fri, 2020-10-02 at 12:28 -0400, Qian Cai wrote:
> > > Running some fuzzing on virtiofs from a non-privileged user could trigger 
> > > a
> > > warning in virtio_fs_enqueue_req():
> > > 
> > > WARN_ON(out_sgs + in_sgs != total_sgs);
> > 
> > Okay, I can reproduce this after running for a few hours:
> > 
> > out_sgs = 3, in_sgs = 2, total_sgs = 6
> 
> Thanks. I can also reproduce it simply by calling.
> 
> ioctl(fd, 0x5a004000, buf);
> 
> I think following WARN_ON() is not correct.
> 
> WARN_ON(out_sgs + in_sgs != total_sgs)
> 
> toal_sgs should actually be max sgs. It looks at ap->num_pages and
> counts one sg for each page. And it assumes that same number of
> pages will be used both for input and output.
> 
> But there are no such guarantees. With above ioctl() call, I noticed
> we are using 2 pages for input (out_sgs) and one page for output (in_sgs).
> 
> So out_sgs=4, in_sgs=3 and total_sgs=8 and warning triggers.
> 
> I think total sgs is actually max number of sgs and warning
> should probably be.
> 
> WARN_ON(out_sgs + in_sgs >  total_sgs)
> 
> Stefan, WDYT?

It should be possible to calculate total_sgs precisely (not a maximum).
Treating it as a maximum could hide bugs.

Maybe sg_count_fuse_req() should count in_args/out_args[numargs -
1].size pages instead of adding ap->num_pages.

Do you have the details of struct fuse_req and struct fuse_args_pages
fields for the ioctl in question?

Thanks,
Stefan


signature.asc
Description: PGP signature


Re: [RFC PATCH 00/24] Control VQ support in vDPA

2020-09-24 Thread Stefan Hajnoczi
On Thu, Sep 24, 2020 at 11:21:01AM +0800, Jason Wang wrote:
> This series tries to add the support for control virtqueue in vDPA.

Please include documentation for both driver authors and vhost-vdpa
ioctl users. vhost-vdpa ioctls are only documented with a single
sentence. Please add full information on arguments, return values, and a
high-level explanation of the feature (like this cover letter) to
introduce the API.

What is the policy for using virtqueue groups? My guess is:
1. virtio_vdpa simply enables all virtqueue groups.
2. vhost_vdpa relies on userspace policy on how to use virtqueue groups.
   Are the semantics of virtqueue groups documented somewhere so
   userspace knows what to do? If a vDPA driver author decides to create
   N virtqueue groups, N/2 virtqueue groups, or just 1 virtqueue group,
   how will userspace know what to do?

Maybe a document is needed to describe the recommended device-specific
virtqueue groups that vDPA drivers should implement (e.g. "put the net
control vq into its own virtqueue group")?

This could become messy with guidelines. For example, drivers might be
shipped that aren't usable for certain use cases just because the author
didn't know that a certain virtqueue grouping is advantageous.

BTW I like how general this feature is. It seems to allow vDPA devices
to be split into sub-devices for further passthrough. Who will write the
first vDPA-on-vDPA driver? :)

Stefan


signature.asc
Description: PGP signature


Re: [PATCH v3 11/18] fuse: implement FUSE_INIT map_alignment field

2020-08-26 Thread Stefan Hajnoczi
On Wed, Aug 26, 2020 at 11:51:42AM -0400, Vivek Goyal wrote:
> On Wed, Aug 26, 2020 at 04:06:35PM +0200, Miklos Szeredi wrote:
> > On Thu, Aug 20, 2020 at 12:21 AM Vivek Goyal  wrote:
> > >
> > > The device communicates FUSE_SETUPMAPPING/FUSE_REMOVMAPPING alignment
> > > constraints via the FUST_INIT map_alignment field.  Parse this field and
> > > ensure our DAX mappings meet the alignment constraints.
> > >
> > > We don't actually align anything differently since our mappings are
> > > already 2MB aligned.  Just check the value when the connection is
> > > established.  If it becomes necessary to honor arbitrary alignments in
> > > the future we'll have to adjust how mappings are sized.
> > >
> > > The upshot of this commit is that we can be confident that mappings will
> > > work even when emulating x86 on Power and similar combinations where the
> > > host page sizes are different.
> > >
> > > Signed-off-by: Stefan Hajnoczi 
> > > Signed-off-by: Vivek Goyal 
> > > ---
> > >  fs/fuse/fuse_i.h  |  5 -
> > >  fs/fuse/inode.c   | 18 --
> > >  include/uapi/linux/fuse.h |  4 +++-
> > >  3 files changed, 23 insertions(+), 4 deletions(-)
> > >
> > > diff --git a/fs/fuse/fuse_i.h b/fs/fuse/fuse_i.h
> > > index 478c940b05b4..4a46e35222c7 100644
> > > --- a/fs/fuse/fuse_i.h
> > > +++ b/fs/fuse/fuse_i.h
> > > @@ -47,7 +47,10 @@
> > >  /** Number of dentries for each connection in the control filesystem */
> > >  #define FUSE_CTL_NUM_DENTRIES 5
> > >
> > > -/* Default memory range size, 2MB */
> > > +/*
> > > + * Default memory range size.  A power of 2 so it agrees with common 
> > > FUSE_INIT
> > > + * map_alignment values 4KB and 64KB.
> > > + */
> > >  #define FUSE_DAX_SZ(2*1024*1024)
> > >  #define FUSE_DAX_SHIFT (21)
> > >  #define FUSE_DAX_PAGES (FUSE_DAX_SZ/PAGE_SIZE)
> > > diff --git a/fs/fuse/inode.c b/fs/fuse/inode.c
> > > index b82eb61d63cc..947abdd776ca 100644
> > > --- a/fs/fuse/inode.c
> > > +++ b/fs/fuse/inode.c
> > > @@ -980,9 +980,10 @@ static void process_init_reply(struct fuse_conn *fc, 
> > > struct fuse_args *args,
> > >  {
> > > struct fuse_init_args *ia = container_of(args, typeof(*ia), args);
> > > struct fuse_init_out *arg = >out;
> > > +   bool ok = true;
> > >
> > > if (error || arg->major != FUSE_KERNEL_VERSION)
> > > -   fc->conn_error = 1;
> > > +   ok = false;
> > > else {
> > > unsigned long ra_pages;
> > >
> > > @@ -1045,6 +1046,13 @@ static void process_init_reply(struct fuse_conn 
> > > *fc, struct fuse_args *args,
> > > min_t(unsigned int, 
> > > FUSE_MAX_MAX_PAGES,
> > > max_t(unsigned int, 
> > > arg->max_pages, 1));
> > > }
> > > +   if ((arg->flags & FUSE_MAP_ALIGNMENT) &&
> > > +   (FUSE_DAX_SZ % (1ul << arg->map_alignment))) {
> > 
> > This just obfuscates "arg->map_alignment != FUSE_DAX_SHIFT".
> > 
> > So the intention was that userspace can ask the kernel for a
> > particular alignment, right?
> 
> My understanding is that device will specify alignment for
> the foffset/moffset fields in fuse_setupmapping_in/fuse_removemapping_one.
> And DAX mapping can be any size meeting that alignment contraint.
> 
> > 
> > In that case kernel can definitely succeed if the requested alignment
> > is smaller than the kernel provided one, no? 
> 
> Yes. So if map_alignemnt is 64K and DAX mapping size is 2MB, that's just
> fine because it meets 4K alignment contraint. Just that we can't use
> 4K size DAX mapping in that case.
> 
> > It would also make
> > sense to make this a two way negotiation.  I.e. send the largest
> > alignment (FUSE_DAX_SHIFT in this implementation) that the kernel can
> > provide in fuse_init_in.   In that case the only error would be if
> > userspace ignored the given constraints.
> 
> We could make it two way negotiation if it helps. So if we support
> multiple mapping sizes in future, say 4K, 64K, 2MB, 1GB. So idea is
> to send alignment of largest mapping size to device/user_space (1GB)
> in this case? And that will allow device to choose an a

Re: [PATCH v2 0/5] virtio mmio specification enhancement

2020-07-31 Thread Stefan Hajnoczi
On Thu, Jul 30, 2020 at 08:15:12PM +, Pincus, Josh wrote:
> We were looking into a similar enhancement for the Virt I/O MMIO transport 
> and came across this project.
> This enhancement would be perfect for us.
> 
> Has there been any progress since Feb, 2020?  It looks like the effort might 
> have stalled?

CCing Alex Bennee. I think he recently asked the same question.

Stefan


signature.asc
Description: PGP signature


Re: [RFC PATCH 0/5] madvise MADV_DOEXEC

2020-07-31 Thread Stefan Hajnoczi
Hi,
Sileby looks interesting! I had just written up the following idea which
seems similar but includes a mechanism for revoking mappings.

Alexander Graf recently brought up an idea that solves the following
problem:

When process A passes shared memory file descriptors to process B there
is no way for process A to revoke access or change page protection bits
after passing the fd.

I'll describe the idea (not sure if it's exactly what Alexander had in
mind).

Memory view driver
--
The memory view driver allows process A to control the page table
entries of an mmap in process B. It is a character device driver that
process A opens like this:

  int fd = open("/dev/memory-view", O_RDWR);

This returns a file descriptor to a new memory view.

Next process A sets the size of the memory view:

  ftruncate(fd, 16 * GiB);

The size determines how large the memory view will be. The size is a
virtual memory concept and does not consume resources (there is no
physical memory backing this).

Process A populates the memory view with ranges from file descriptors it
wishes to share. The file descriptor can be a shared memory file
descriptor:

  int memfd = memfd_create("guest-ram, 0);
  ftruncate(memfd, 32 * GiB);

  /* Map [8GB, 10GB) at 8GB into the memory view */
  struct memview_map_fd_info = {
  .fd = memfd,
  .fd_offset = 8 * GiB,
  .size = 2 * GiB,
  .mem_offset = 8 * GiB,
  .flags = MEMVIEW_MAP_READ | MEMVIEW_MAP_WRITE,
  };
  ioctl(fd, MEMVIEW_MAP_FD, _fd_info);

It is also possible to populate the memory view from the page cache:

  int filefd = open("big-file.iso", O_RDONLY);

  /* Map [4GB, 12GB) at 0B into the memory view */
  struct memview_map_fd_info = {
  .fd = filefd,
  .fd_offset = 4 * GiB,
  .size = 8 * GiB,
  .mem_offset = 0,
  .flags = MEMVIEW_MAP_READ,
  };
  ioctl(fd, MEMVIEW_MAP_FD, _fd_info);

The memory view has now been populated like this:

Range (GiB)   Fd   Permissions
0-8   big-file.iso read
8-10  guest-ramread+write
10-16

Now process A gets the "view" file descriptor for this memory view. The
view file descriptor does not allow ioctls. It can be safely passed to
process B in the knowledge that process B can only mmap or close it:

  int viewfd = ioctl(fd, MEMVIEW_GET_VIEWFD);

  ...pass viewfd to process B...

Process B receives viewfd and mmaps it:

  void *ptr = mmap(NULL, 16 * GiB, PROT_READ | PROT_WRITE, MAP_SHARED,
   viewfd, 0);

When process B accesses a page in the mmap region the memory view
driver resolves the page fault by checking if the page is mapped to an
fd and what its permissions are.

For example, accessing the page at 4GB from the start of the memory view
is an access at 8GB into big-file.iso. That's because 8GB of
big-file.iso was mapped at 0 with fd_offset 4GB.

To summarize, there is one vma in process B and the memory view driver
maps pages from the file descriptors added with ioctl(MEMVIEW_MAP_FD) by
process A.

Page protection bits are the AND of the mmap
PROT_READ/PROT_WRITE/PROT_EXEC flags with the memory view driver's
MEMVIEW_MAP_READ/MEMVIEW_MAP_WRITE/MEMVIEW_MAP_EXEC flags for the
mapping in question.

Does vmf_insert_mixed_prot() or a similar Linux API allow this?

Can the memory view driver map pages from fds without pinning the pages?

Process A can make further ioctl(MEMVIEW_MAP_FD) calls and also
ioctl(MEMVIEW_UNMAP_FD) calls to change the mappings. This requires
zapping affected process B ptes. When process B accesses those pages
again the fault handler will handle the page fault based on the latest
memory view layout.

If process B accesses a page with incorrect permissions or that has not
been configured by process A ioctl calls, a SIGSEGV/SIGBUS signal is
raised.

When process B uses mprotect(2) and other virtual memory syscalls it
is unable to increase page permissions. Instead it can only reduce them
because the pte protection bits are the AND of the mmap flags and the
memory view driver's MEMVIEW_MAP_READ/MEMVIEW_MAP_WRITE/MEMVIEW_MAP_EXEC
flags.

Use cases
-
How to use the memory view driver for vhost-user

vhost-user and other out-of-process device emulation interfaces need a
way for the VMM to enforce the IOMMU mappings on the device emulation
process.

Today the VMM passes all guest RAM fds to the device emulation process
and has no way of restricting access or revoking it later. With the
memory view driver the VMM will pass one or more memory view fds instead
of the actual guest RAM fds. This allows the VMM to invoke
ioctl(MEMVIEW_MAP_FD/MEMVIEW_UNMAP_FD) to enforce permissions or revoke
access.

How to use the memory view driver for virtio-fs
~~~
The virtiofsd vhost-user process creates a memory view for the device's
DAX Window and passes it to QEMU. QEMU installs it as a kvm.ko memory
region so that the 

Re: [PATCH 02/10] block: virtio-blk: check logical block size

2020-07-28 Thread Stefan Hajnoczi
On Tue, Jul 21, 2020 at 01:52:31PM +0300, Maxim Levitsky wrote:
> Linux kernel only supports logical block sizes which are power of two,
> at least 512 bytes and no more that PAGE_SIZE.
> 
> Check this instead of crashing later on.
> 
> Note that there is no need to check physical block size since it is
> only a hint, and virtio-blk already only supports power of two values.
> 
> Bugzilla link: https://bugzilla.redhat.com/show_bug.cgi?id=1664619
> 
> Signed-off-by: Maxim Levitsky 
> ---
>  drivers/block/virtio_blk.c | 15 +--
>  1 file changed, 13 insertions(+), 2 deletions(-)

Reviewed-by: Stefan Hajnoczi 


signature.asc
Description: PGP signature


Re: strace of io_uring events?

2020-07-23 Thread Stefan Hajnoczi
On Tue, Jul 21, 2020 at 05:58:48PM +0200, Stefano Garzarella wrote:
> On Tue, Jul 21, 2020 at 08:27:34AM -0700, Andy Lutomirski wrote:
> > On Fri, Jul 17, 2020 at 1:02 AM Stefano Garzarella  
> > wrote:
> > >
> > > On Thu, Jul 16, 2020 at 08:12:35AM -0700, Kees Cook wrote:
> > > > On Thu, Jul 16, 2020 at 03:14:04PM +0200, Stefano Garzarella wrote:
> > 
> > > > access (IIUC) is possible without actually calling any of the io_uring
> > > > syscalls. Is that correct? A process would receive an fd (via 
> > > > SCM_RIGHTS,
> > > > pidfd_getfd, or soon seccomp addfd), and then call mmap() on it to gain
> > > > access to the SQ and CQ, and off it goes? (The only glitch I see is
> > > > waking up the worker thread?)
> > >
> > > It is true only if the io_uring istance is created with SQPOLL flag (not 
> > > the
> > > default behaviour and it requires CAP_SYS_ADMIN). In this case the
> > > kthread is created and you can also set an higher idle time for it, so
> > > also the waking up syscall can be avoided.
> > 
> > I stared at the io_uring code for a while, and I'm wondering if we're
> > approaching this the wrong way. It seems to me that most of the
> > complications here come from the fact that io_uring SQEs don't clearly
> > belong to any particular security principle.  (We have struct creds,
> > but we don't really have a task or mm.)  But I'm also not convinced
> > that io_uring actually supports cross-mm submission except by accident
> > -- as it stands, unless a user is very careful to only submit SQEs
> > that don't use user pointers, the results will be unpredictable.
> > Perhaps we can get away with this:
> > 
> > diff --git a/fs/io_uring.c b/fs/io_uring.c
> > index 74bc4a04befa..92266f869174 100644
> > --- a/fs/io_uring.c
> > +++ b/fs/io_uring.c
> > @@ -7660,6 +7660,20 @@ SYSCALL_DEFINE6(io_uring_enter, unsigned int,
> > fd, u32, to_submit,
> >  if (!percpu_ref_tryget(>refs))
> >  goto out_fput;
> > 
> > +if (unlikely(current->mm != ctx->sqo_mm)) {
> > +/*
> > + * The mm used to process SQEs will be current->mm or
> > + * ctx->sqo_mm depending on which submission path is used.
> > + * It's also unclear who is responsible for an SQE submitted
> > + * out-of-process from a security and auditing perspective.
> > + *
> > + * Until a real usecase emerges and there are clear semantics
> > + * for out-of-process submission, disallow it.
> > + */
> > +ret = -EACCES;
> > +goto out;
> > +}
> > +
> >  /*
> >   * For SQ polling, the thread will do all submissions and completions.
> >   * Just return the requested submit count, and wake the thread if
> > 
> > If we can do that, then we could bind seccomp-like io_uring filters to
> > an mm, and we get obvious semantics that ought to cover most of the
> > bases.
> > 
> > Jens, Christoph?
> > 
> > Stefano, what's your intended usecase for your restriction patchset?
> > 
> 
> Hi Andy,
> my use case concerns virtualization. The idea, that I described in the
> proposal of io-uring restrictions [1], is to share io_uring CQ and SQ queues
> with a guest VM for block operations.
> 
> In the PoC that I realized, there is a block device driver in the guest that
> uses io_uring queues coming from the host to submit block requests.
> 
> Since the guest is not trusted, we need restrictions to allow only
> a subset of syscalls on a subset of file descriptors and memory.

BTW there's only a single mm in the kvm.ko use case.

Stefan


signature.asc
Description: PGP signature


Re: [PATCH v5 01/18] nitro_enclaves: Add ioctl interface definition

2020-07-16 Thread Stefan Hajnoczi
On Wed, Jul 15, 2020 at 10:45:23PM +0300, Andra Paraschiv wrote:
> + * A NE CPU pool has be set before calling this function. The pool can be set

s/has be/has to be/

Thanks, this looks good!

Reviewed-by: Stefan Hajnoczi 


signature.asc
Description: PGP signature


Re: [PATCH] vhost/scsi: fix up req type endian-ness

2020-07-13 Thread Stefan Hajnoczi
On Fri, Jul 10, 2020 at 06:48:51AM -0400, Michael S. Tsirkin wrote:
> vhost/scsi doesn't handle type conversion correctly
> for request type when using virtio 1.0 and up for BE,
> or cross-endian platforms.
> 
> Fix it up using vhost_32_to_cpu.
> 
> Cc: sta...@vger.kernel.org
> Signed-off-by: Michael S. Tsirkin 
> ---
>  drivers/vhost/scsi.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)

Reviewed-by: Stefan Hajnoczi 


signature.asc
Description: PGP signature


Re: [PATCH] vsock/virtio: annotate 'the_virtio_vsock' RCU pointer

2020-07-13 Thread Stefan Hajnoczi
On Fri, Jul 10, 2020 at 02:12:43PM +0200, Stefano Garzarella wrote:
> Commit 0deab087b16a ("vsock/virtio: use RCU to avoid use-after-free
> on the_virtio_vsock") starts to use RCU to protect 'the_virtio_vsock'
> pointer, but we forgot to annotate it.
> 
> This patch adds the annotation to fix the following sparse errors:
> 
> net/vmw_vsock/virtio_transport.c:73:17: error: incompatible types in 
> comparison expression (different address spaces):
> net/vmw_vsock/virtio_transport.c:73:17:struct virtio_vsock [noderef] 
> __rcu *
> net/vmw_vsock/virtio_transport.c:73:17:struct virtio_vsock *
> net/vmw_vsock/virtio_transport.c:171:17: error: incompatible types in 
> comparison expression (different address spaces):
> net/vmw_vsock/virtio_transport.c:171:17:struct virtio_vsock [noderef] 
> __rcu *
> net/vmw_vsock/virtio_transport.c:171:17:struct virtio_vsock *
> net/vmw_vsock/virtio_transport.c:207:17: error: incompatible types in 
> comparison expression (different address spaces):
> net/vmw_vsock/virtio_transport.c:207:17:struct virtio_vsock [noderef] 
> __rcu *
> net/vmw_vsock/virtio_transport.c:207:17:struct virtio_vsock *
> net/vmw_vsock/virtio_transport.c:561:13: error: incompatible types in 
> comparison expression (different address spaces):
> net/vmw_vsock/virtio_transport.c:561:13:struct virtio_vsock [noderef] 
> __rcu *
> net/vmw_vsock/virtio_transport.c:561:13:struct virtio_vsock *
> net/vmw_vsock/virtio_transport.c:612:9: error: incompatible types in 
> comparison expression (different address spaces):
> net/vmw_vsock/virtio_transport.c:612:9:struct virtio_vsock [noderef] 
> __rcu *
> net/vmw_vsock/virtio_transport.c:612:9:struct virtio_vsock *
> net/vmw_vsock/virtio_transport.c:631:9: error: incompatible types in 
> comparison expression (different address spaces):
> net/vmw_vsock/virtio_transport.c:631:9:struct virtio_vsock [noderef] 
> __rcu *
> net/vmw_vsock/virtio_transport.c:631:9:struct virtio_vsock *
> 
> Fixes: 0deab087b16a ("vsock/virtio: use RCU to avoid use-after-free on 
> the_virtio_vsock")
> Reported-by: Michael S. Tsirkin 
> Signed-off-by: Stefano Garzarella 
> ---
>  net/vmw_vsock/virtio_transport.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)

Reviewed-by: Stefan Hajnoczi 


signature.asc
Description: PGP signature


Re: [PATCH RFC 0/3] io_uring: add restrictions to support untrusted applications and guests

2020-07-13 Thread Stefan Hajnoczi
On Fri, Jul 10, 2020 at 06:20:17PM +0200, Stefano Garzarella wrote:
> On Fri, Jul 10, 2020 at 11:33:09AM -0400, Konrad Rzeszutek Wilk wrote:
> > .snip..
> > > Just to recap the proposal, the idea is to add some restrictions to the
> > > operations (sqe, register, fixed file) to safely allow untrusted 
> > > applications
> > > or guests to use io_uring queues.
> > 
> > Hi!
> > 
> > This is neat and quite cool - but one thing that keeps nagging me is
> > what how much overhead does this cut from the existing setup when you use
> > virtio (with guests obviously)?
> 
> I need to do more tests, but the preliminary results that I reported on
> the original proposal [1] show an overhead of ~ 4.17 uS (with iodepth=1)
> when I'm using virtio ring processed in a dedicated iothread:
> 
>   - 73 kIOPS using virtio-blk + QEMU iothread + io_uring backend
>   - 104 kIOPS using io_uring passthrough
> 
> > That is from a high level view the
> > beaty of io_uring being passed in the guest is you don't have the
> > virtio ring -> io_uring processing, right?
> 
> Right, and potentially we can share the io_uring queues directly to the
> guest userspace applications, cutting down the cost of Linux block
> layer in the guest.

Another factor is that the guest submits requests directly to the host
kernel sqpoll thread. When a virtqueue is used the sqpoll thread cannot
poll it directly so another host thread (QEMU) needs to poll the
virtqueue. The same applies for the completion code path.

Stefan


signature.asc
Description: PGP signature


Re: [PATCH v3 02/14] iommu: Report domain nesting info

2020-07-03 Thread Stefan Hajnoczi
On Tue, Jun 30, 2020 at 02:00:49AM +, Tian, Kevin wrote:
> > From: Liu, Yi L 
> > Sent: Monday, June 29, 2020 8:23 PM
> > 
> > Hi Stefan,
> > 
> > > From: Stefan Hajnoczi 
> > > Sent: Monday, June 29, 2020 5:25 PM
> > >
> > > On Wed, Jun 24, 2020 at 01:55:15AM -0700, Liu Yi L wrote:
> > > > +/*
> > > > + * struct iommu_nesting_info - Information for nesting-capable IOMMU.
> > > > + * user space should check it before using
> > > > + * nesting capability.
> > > > + *
> > > > + * @size:  size of the whole structure
> > > > + * @format:PASID table entry format, the same definition with
> > > > + * @format of struct iommu_gpasid_bind_data.
> > > > + * @features:  supported nesting features.
> > > > + * @flags: currently reserved for future extension.
> > > > + * @data:  vendor specific cap info.
> > > > + *
> > > > + * 
> > > > +---++
> > > > + * | feature   |  Notes
> > > >  |
> > > > + *
> > >
> > +===+===
> > 
> > > =+
> > > > + * | SYSWIDE_PASID |  Kernel manages PASID in system wide, PASIDs
> > used  |
> > > > + * |   |  in the system should be allocated by host kernel 
> > > >  |
> > > > + * 
> > > > +---++
> > > > + * | BIND_PGTBL|  bind page tables to host PASID, the PASID could  
> > > >  |
> > > > + * |   |  either be a host PASID passed in bind request or 
> > > >  |
> > > > + * |   |  default PASIDs (e.g. default PASID of 
> > > > aux-domain) |
> > > > + * 
> > > > +---++
> > > > + * | CACHE_INVLD   |  mandatory feature for nesting capable IOMMU
> > |
> > > > + * 
> > > > +---++
> > >
> > > This feature description is vague about what CACHE_INVLD does and how
> > to
> > > use it. If I understand correctly, the presence of this feature means
> > > that VFIO_IOMMU_NESTING_OP_CACHE_INVLD must be used?
> > >
> > > The same kind of clarification could be done for SYSWIDE_PASID and
> > > BIND_PGTBL too.
> > 
> > For SYSWIDE_PASID and BIND_PGTBL, yes, presence of the feature bit
> > means must use. So the two are requirements to user space if it wants
> > to setup nesting. While for CACHE_INVLD, it's kind of availability
> > here. How about removing CACHE_INVLD as presence of BIND_PGTBL should
> > indicates support of CACHE_INVLD?
> > 
> 
> So far this assumption is correct but it may not be true when thinking 
> forward.
> For example, a vendor might find a way to allow the owner of 1st-level page
> table to directly invalidate cache w/o going through host IOMMU driver. From
> this angle I feel explicitly reporting this capability is more robust.
> 
> Regarding to the description, what about below?
> 
> --
> SYSWIDE_PASID: PASIDs are managed in system-wide, instead of per device.
> When a device is assigned to userspace or VM, proper uAPI (provided by 
> userspace driver framework, e.g. VFIO) must be used to allocate/free PASIDs
> for the assigned device.
> 
> BIND_PGTBL: The owner of the first-level/stage-1 page table must explicitly 
> bind the page table to associated PASID (either the one specified in bind 
> request or the default PASID of the iommu domain), through VFIO_IOMMU
> _NESTING_OP
> 
> CACHE_INVLD: The owner of the first-level/stage-1 page table must
> explicitly invalidate the IOMMU cache through VFIO_IOMMU_NESTING_OP,
> according to vendor-specific requirement when changing the page table.
> --

Mentioning the API to allocate/free PASIDs and VFIO_IOMMU_NESTING_OP has
made this clearer. This lets someone reading the documentation know
where to look for further information on using these features.

Thank you!

Stefan


signature.asc
Description: PGP signature


Re: [PATCH v3 02/14] iommu: Report domain nesting info

2020-06-29 Thread Stefan Hajnoczi
On Wed, Jun 24, 2020 at 01:55:15AM -0700, Liu Yi L wrote:
> +/*
> + * struct iommu_nesting_info - Information for nesting-capable IOMMU.
> + *   user space should check it before using
> + *   nesting capability.
> + *
> + * @size:size of the whole structure
> + * @format:  PASID table entry format, the same definition with
> + *   @format of struct iommu_gpasid_bind_data.
> + * @features:supported nesting features.
> + * @flags:   currently reserved for future extension.
> + * @data:vendor specific cap info.
> + *
> + * +---++
> + * | feature   |  Notes |
> + * +===++
> + * | SYSWIDE_PASID |  Kernel manages PASID in system wide, PASIDs used  |
> + * |   |  in the system should be allocated by host kernel  |
> + * +---++
> + * | BIND_PGTBL|  bind page tables to host PASID, the PASID could   |
> + * |   |  either be a host PASID passed in bind request or  |
> + * |   |  default PASIDs (e.g. default PASID of aux-domain) |
> + * +---++
> + * | CACHE_INVLD   |  mandatory feature for nesting capable IOMMU   |
> + * +---++

This feature description is vague about what CACHE_INVLD does and how to
use it. If I understand correctly, the presence of this feature means
that VFIO_IOMMU_NESTING_OP_CACHE_INVLD must be used?

The same kind of clarification could be done for SYSWIDE_PASID and
BIND_PGTBL too.

Stefan


signature.asc
Description: PGP signature


Re: [PATCH v3 13/14] vfio: Document dual stage control

2020-06-29 Thread Stefan Hajnoczi
On Wed, Jun 24, 2020 at 01:55:26AM -0700, Liu Yi L wrote:
> +Details can be found in Documentation/userspace-api/iommu.rst. For Intel
> +VT-d, each stage 1 page table is bound to host by:
> +
> +nesting_op->flags = VFIO_IOMMU_NESTING_OP_BIND_PGTBL;
> +memcpy(_op->data, _data, sizeof(bind_data));
> +ioctl(container->fd, VFIO_IOMMU_NESTING_OP, nesting_op);
> +
> +As mentioned above, guest OS may use stage 1 for GIOVA->GPA or GVA->GPA.
> +GVA->GPA page tables are available when PASID (Process Address Space ID)
> +is exposed to guest. e.g. guest with PASID-capable devices assigned. For
> +such page table binding, the bind_data should include PASID info, which
> +is allocated by guest itself or by host. This depends on hardware vendor
> +e.g. Intel VT-d requires to allocate PASID from host. This requirement is
> +defined by the Virtual Command Support in VT-d 3.0 spec, guest software
> +running on VT-d should allocate PASID from host kernel. To allocate PASID
> +from host, user space should +check the IOMMU_NESTING_FEAT_SYSWIDE_PASID

s/+check/check/g

Reviewed-by: Stefan Hajnoczi 


signature.asc
Description: PGP signature


Re: [PATCH v4 01/18] nitro_enclaves: Add ioctl interface definition

2020-06-25 Thread Stefan Hajnoczi
On Wed, Jun 24, 2020 at 05:02:54PM +0300, Paraschiv, Andra-Irina wrote:
> On 23/06/2020 11:56, Stefan Hajnoczi wrote:
> > On Mon, Jun 22, 2020 at 11:03:12PM +0300, Andra Paraschiv wrote:
> > > +/* User memory region flags */
> > > +
> > > +/* Memory region for enclave general usage. */
> > > +#define NE_DEFAULT_MEMORY_REGION (0x00)
> > > +
> > > +/* Memory region to be set for an enclave (write). */
> > > +struct ne_user_memory_region {
> > > + /**
> > > +  * Flags to determine the usage for the memory region (write).
> > > +  */
> > > + __u64 flags;
> > Where is the write flag defined?
> > 
> > I guess it's supposed to be:
> > 
> >#define NE_USER_MEMORY_REGION_FLAG_WRITE (0x01)
> 
> For now, the flags field is included in the NE ioctl interface for
> extensions, it is not part of the NE PCI device interface yet.
...
> Ah, and just as a note, that "read" / "write" in parentheses means that a
> certain data structure / field is read / written by user space. I updated to
> use "in" / "out" instead of "read" / "write" in v5.

Oops, I got confused. I thought "(write)" was an example of a flag that
can be set on the memory region. Now I realize "write" means this field
is an input to the ioctl. :)

Thanks for updating the docs.

Stefan


signature.asc
Description: PGP signature


Re: [PATCH v4 17/18] nitro_enclaves: Add overview documentation

2020-06-25 Thread Stefan Hajnoczi
On Wed, Jun 24, 2020 at 05:39:39PM +0300, Paraschiv, Andra-Irina wrote:
> 
> 
> On 23/06/2020 11:59, Stefan Hajnoczi wrote:
> > On Mon, Jun 22, 2020 at 11:03:28PM +0300, Andra Paraschiv wrote:
> > > +The kernel bzImage, the kernel command line, the ramdisk(s) are part of 
> > > the
> > > +Enclave Image Format (EIF); plus an EIF header including metadata such 
> > > as magic
> > > +number, eif version, image size and CRC.
> > > +
> > > +Hash values are computed for the entire enclave image (EIF), the kernel 
> > > and
> > > +ramdisk(s). That's used, for example, to check that the enclave image 
> > > that is
> > > +loaded in the enclave VM is the one that was intended to be run.
> > > +
> > > +These crypto measurements are included in a signed attestation document
> > > +generated by the Nitro Hypervisor and further used to prove the identity 
> > > of the
> > > +enclave; KMS is an example of service that NE is integrated with and 
> > > that checks
> > > +the attestation doc.
> > > +
> > > +The enclave image (EIF) is loaded in the enclave memory at offset 8 MiB. 
> > > The
> > > +init process in the enclave connects to the vsock CID of the primary VM 
> > > and a
> > > +predefined port - 9000 - to send a heartbeat value - 0xb7. This 
> > > mechanism is
> > > +used to check in the primary VM that the enclave has booted.
> > > +
> > > +If the enclave VM crashes or gracefully exits, an interrupt event is 
> > > received by
> > > +the NE driver. This event is sent further to the user space enclave 
> > > process
> > > +running in the primary VM via a poll notification mechanism. Then the 
> > > user space
> > > +enclave process can exit.
> > > +
> > > +[1] https://aws.amazon.com/ec2/nitro/nitro-enclaves/
> > > +[2] https://www.kernel.org/doc/Documentation/vm/hugetlbpage.txt
> > > +[3] https://lwn.net/Articles/807108/
> > > +[4] 
> > > https://www.kernel.org/doc/html/latest/admin-guide/kernel-parameters.html
> > > +[5] https://man7.org/linux/man-pages/man7/vsock.7.html
> > Is the EIF specification and the attestation protocol available?
> 
> For now, they are not publicly available. Once the refs are available (e.g.
> AWS documentation, GitHub documentation), I'll include them in the kernel
> documentation as well.
> 
> As a note here, the NE project is currently in preview
> (https://aws.amazon.com/ec2/nitro/nitro-enclaves/) and part of the
> documentation / codebase will be publicly available when NE is generally
> available (GA). This will be in addition to the ones already publicly
> available, like the NE kernel driver.
> 
> Let me know if I can help with any particular questions / clarifications.

Thanks!

Stefan


signature.asc
Description: PGP signature


Re: [PATCH v4 17/18] nitro_enclaves: Add overview documentation

2020-06-23 Thread Stefan Hajnoczi
On Mon, Jun 22, 2020 at 11:03:28PM +0300, Andra Paraschiv wrote:
> +The kernel bzImage, the kernel command line, the ramdisk(s) are part of the
> +Enclave Image Format (EIF); plus an EIF header including metadata such as 
> magic
> +number, eif version, image size and CRC.
> +
> +Hash values are computed for the entire enclave image (EIF), the kernel and
> +ramdisk(s). That's used, for example, to check that the enclave image that is
> +loaded in the enclave VM is the one that was intended to be run.
> +
> +These crypto measurements are included in a signed attestation document
> +generated by the Nitro Hypervisor and further used to prove the identity of 
> the
> +enclave; KMS is an example of service that NE is integrated with and that 
> checks
> +the attestation doc.
> +
> +The enclave image (EIF) is loaded in the enclave memory at offset 8 MiB. The
> +init process in the enclave connects to the vsock CID of the primary VM and a
> +predefined port - 9000 - to send a heartbeat value - 0xb7. This mechanism is
> +used to check in the primary VM that the enclave has booted.
> +
> +If the enclave VM crashes or gracefully exits, an interrupt event is 
> received by
> +the NE driver. This event is sent further to the user space enclave process
> +running in the primary VM via a poll notification mechanism. Then the user 
> space
> +enclave process can exit.
> +
> +[1] https://aws.amazon.com/ec2/nitro/nitro-enclaves/
> +[2] https://www.kernel.org/doc/Documentation/vm/hugetlbpage.txt
> +[3] https://lwn.net/Articles/807108/
> +[4] https://www.kernel.org/doc/html/latest/admin-guide/kernel-parameters.html
> +[5] https://man7.org/linux/man-pages/man7/vsock.7.html

Is the EIF specification and the attestation protocol available?

Stefan


signature.asc
Description: PGP signature


Re: [PATCH v4 01/18] nitro_enclaves: Add ioctl interface definition

2020-06-23 Thread Stefan Hajnoczi
On Mon, Jun 22, 2020 at 11:03:12PM +0300, Andra Paraschiv wrote:
> diff --git a/include/uapi/linux/nitro_enclaves.h 
> b/include/uapi/linux/nitro_enclaves.h
> new file mode 100644
> index ..3270eb939a97
> --- /dev/null
> +++ b/include/uapi/linux/nitro_enclaves.h
> @@ -0,0 +1,137 @@
> +/* SPDX-License-Identifier: GPL-2.0 WITH Linux-syscall-note */
> +/*
> + * Copyright 2020 Amazon.com, Inc. or its affiliates. All Rights Reserved.
> + */
> +
> +#ifndef _UAPI_LINUX_NITRO_ENCLAVES_H_
> +#define _UAPI_LINUX_NITRO_ENCLAVES_H_
> +
> +#include 
> +
> +/* Nitro Enclaves (NE) Kernel Driver Interface */
> +
> +#define NE_API_VERSION (1)
> +
> +/**
> + * The command is used to get the version of the NE API. This way the user 
> space
> + * processes can be aware of the feature sets provided by the NE kernel 
> driver.
> + *
> + * The NE API version is returned as result of this ioctl call.
> + */
> +#define NE_GET_API_VERSION _IO(0xAE, 0x20)
> +
> +/**
> + * The command is used to create a slot that is associated with an enclave 
> VM.
> + *
> + * The generated unique slot id is a read parameter of this command. An 
> enclave
> + * file descriptor is returned as result of this ioctl call. The enclave fd 
> can
> + * be further used with ioctl calls to set vCPUs and memory regions, then 
> start
> + * the enclave.
> + */
> +#define NE_CREATE_VM _IOR(0xAE, 0x21, __u64)

Information that would be useful for the ioctls:

1. Which fd the ioctl must be invoked on (/dev/nitro-enclaves, enclave fd, vCPU 
fd)

2. Errnos and their meanings

3. Which state(s) the ioctls may be invoked in (e.g. enclave 
created/started/etc)

> +/* User memory region flags */
> +
> +/* Memory region for enclave general usage. */
> +#define NE_DEFAULT_MEMORY_REGION (0x00)
> +
> +/* Memory region to be set for an enclave (write). */
> +struct ne_user_memory_region {
> + /**
> +  * Flags to determine the usage for the memory region (write).
> +  */
> + __u64 flags;

Where is the write flag defined?

I guess it's supposed to be:

  #define NE_USER_MEMORY_REGION_FLAG_WRITE (0x01)


signature.asc
Description: PGP signature


Re: [PATCH v2 14/15] vfio: Document dual stage control

2020-06-22 Thread Stefan Hajnoczi
On Wed, Jun 17, 2020 at 06:27:27AM +, Liu, Yi L wrote:
> > From: Stefan Hajnoczi 
> > Sent: Monday, June 15, 2020 5:41 PM
> > On Thu, Jun 11, 2020 at 05:15:33AM -0700, Liu Yi L wrote:
> >
> > > From: Eric Auger 
> > >
> > > The VFIO API was enhanced to support nested stage control: a bunch of
> > > new iotcls and usage guideline.
> > >
> > > Let's document the process to follow to set up nested mode.
> > >
> > > Cc: Kevin Tian 
> > > CC: Jacob Pan 
> > > Cc: Alex Williamson 
> > > Cc: Eric Auger 
> > > Cc: Jean-Philippe Brucker 
> > > Cc: Joerg Roedel 
> > > Cc: Lu Baolu 
> > > Signed-off-by: Eric Auger 
> > > Signed-off-by: Liu Yi L 
> > > ---
> > > v1 -> v2:
> > > *) new in v2, compared with Eric's original version, pasid table bind
> > >and fault reporting is removed as this series doesn't cover them.
> > >Original version from Eric.
> > >https://lkml.org/lkml/2020/3/20/700
> > >
> > >  Documentation/driver-api/vfio.rst | 64
> > > +++
> > >  1 file changed, 64 insertions(+)
> > >
> > > diff --git a/Documentation/driver-api/vfio.rst
> > > b/Documentation/driver-api/vfio.rst
> > > index f1a4d3c..06224bd 100644
> > > --- a/Documentation/driver-api/vfio.rst
> > > +++ b/Documentation/driver-api/vfio.rst
> > > @@ -239,6 +239,70 @@ group and can access them as follows::
> > >   /* Gratuitous device reset and go... */
> > >   ioctl(device, VFIO_DEVICE_RESET);
> > >
> > > +IOMMU Dual Stage Control
> > > +
> > > +
> > > +Some IOMMUs support 2 stages/levels of translation. Stage corresponds
> > > +to the ARM terminology while level corresponds to Intel's VTD 
> > > terminology.
> > > +In the following text we use either without distinction.
> > > +
> > > +This is useful when the guest is exposed with a virtual IOMMU and
> > > +some devices are assigned to the guest through VFIO. Then the guest
> > > +OS can use stage 1 (GIOVA -> GPA or GVA->GPA), while the hypervisor
> > > +uses stage 2 for VM isolation (GPA -> HPA).
> > > +
> > > +Under dual stage translation, the guest gets ownership of the stage 1
> > > +page tables and also owns stage 1 configuration structures. The
> > > +hypervisor owns the root configuration structure (for security
> > > +reason), including stage 2 configuration. This works as long
> > > +configuration structures and page table
> > 
> > s/as long configuration/as long as configuration/
> 
> got it.
> 
> > 
> > > +format are compatible between the virtual IOMMU and the physical IOMMU.
> > 
> > s/format/formats/
> 
> I see.
> 
> > > +
> > > +Assuming the HW supports it, this nested mode is selected by choosing
> > > +the VFIO_TYPE1_NESTING_IOMMU type through:
> > > +
> > > +ioctl(container, VFIO_SET_IOMMU, VFIO_TYPE1_NESTING_IOMMU);
> > > +
> > > +This forces the hypervisor to use the stage 2, leaving stage 1
> > > +available for guest usage. The guest stage 1 format depends on IOMMU
> > > +vendor, and it is the same with the nesting configuration method.
> > > +User space should check the format and configuration method after
> > > +setting nesting type by
> > > +using:
> > > +
> > > +ioctl(container->fd, VFIO_IOMMU_GET_INFO, _info);
> > > +
> > > +Details can be found in Documentation/userspace-api/iommu.rst. For
> > > +Intel VT-d, each stage 1 page table is bound to host by:
> > > +
> > > +nesting_op->flags = VFIO_IOMMU_NESTING_OP_BIND_PGTBL;
> > > +memcpy(_op->data, _data, sizeof(bind_data));
> > > +ioctl(container->fd, VFIO_IOMMU_NESTING_OP, nesting_op);
> > > +
> > > +As mentioned above, guest OS may use stage 1 for GIOVA->GPA or GVA->GPA.
> > > +GVA->GPA page tables are available when PASID (Process Address Space
> > > +GVA->ID)
> > > +is exposed to guest. e.g. guest with PASID-capable devices assigned.
> > > +For such page table binding, the bind_data should include PASID info,
> > > +which is allocated by guest itself or by host. This depends on
> > > +hardware vendor e.g. Intel VT-d requires to allocate PASID from host.
> > > +This requirement is available by VFIO_IOMMU_GET_INFO. User space
> > > +could allocate PASID 

Re: [PATCH v2 00/15] vfio: expose virtual Shared Virtual Addressing to VMs

2020-06-22 Thread Stefan Hajnoczi
On Tue, Jun 16, 2020 at 12:09:16PM -0400, Peter Xu wrote:
> On Tue, Jun 16, 2020 at 04:49:28PM +0100, Stefan Hajnoczi wrote:
> > Isolation between applications is preserved but there is no isolation
> > between the device and the application itself. The application needs to
> > trust the device.
> > 
> > Examples:
> > 
> > 1. The device can snoop secret data from readable pages in the
> >application's virtual memory space.
> > 
> > 2. The device can gain arbitrary execution on the CPU by overwriting
> >control flow addresses (e.g. function pointers, stack return
> >addresses) in writable pages.
> 
> To me, SVA seems to be that "middle layer" of secure where it's not as safe as
> VFIO_IOMMU_MAP_DMA which has buffer level granularity of control (but of 
> course
> we pay overhead on buffer setups and on-the-fly translations), however it's 
> far
> better than DMA with no IOMMU which can ruin the whole host/guest, because
> after all we do a lot of isolations as process based.
> 
> IMHO it's the same as when we see a VM (or the QEMU process) as a whole along
> with the guest code.  In some cases we don't care if the guest did some bad
> things to mess up with its own QEMU process.  It is still ideal if we can even
> stop the guest from doing so, but when it's not easy to do it the ideal way, 
> we
> just lower the requirement to not spread the influence to the host and other
> VMs.

Makes sense.

Stefan


signature.asc
Description: PGP signature


Re: [PATCH v2 00/15] vfio: expose virtual Shared Virtual Addressing to VMs

2020-06-22 Thread Stefan Hajnoczi
On Tue, Jun 16, 2020 at 10:00:16AM -0700, Raj, Ashok wrote:
> On Tue, Jun 16, 2020 at 04:49:28PM +0100, Stefan Hajnoczi wrote:
> > On Tue, Jun 16, 2020 at 02:26:38AM +, Tian, Kevin wrote:
> > > > From: Stefan Hajnoczi 
> > > > Sent: Monday, June 15, 2020 6:02 PM
> > > > 
> > > > On Thu, Jun 11, 2020 at 05:15:19AM -0700, Liu Yi L wrote:
> > > > > Shared Virtual Addressing (SVA), a.k.a, Shared Virtual Memory (SVM) on
> > > > > Intel platforms allows address space sharing between device DMA and
> > > > > applications. SVA can reduce programming complexity and enhance
> > > > security.
> > > > >
> > > > > This VFIO series is intended to expose SVA usage to VMs. i.e. Sharing
> > > > > guest application address space with passthru devices. This is called
> > > > > vSVA in this series. The whole vSVA enabling requires QEMU/VFIO/IOMMU
> > > > > changes. For IOMMU and QEMU changes, they are in separate series 
> > > > > (listed
> > > > > in the "Related series").
> > > > >
> > > > > The high-level architecture for SVA virtualization is as below, the 
> > > > > key
> > > > > design of vSVA support is to utilize the dual-stage IOMMU translation 
> > > > > (
> > > > > also known as IOMMU nesting translation) capability in host IOMMU.
> > > > >
> > > > >
> > > > > .-.  .---.
> > > > > |   vIOMMU|  | Guest process CR3, FL only|
> > > > > | |  '---'
> > > > > ./
> > > > > | PASID Entry |--- PASID cache flush -
> > > > > '-'   |
> > > > > | |   V
> > > > > | |CR3 in GPA
> > > > > '-'
> > > > > Guest
> > > > > --| Shadow |--|
> > > > >   vv  v
> > > > > Host
> > > > > .-.  .--.
> > > > > |   pIOMMU|  | Bind FL for GVA-GPA  |
> > > > > | |  '--'
> > > > > ./  |
> > > > > | PASID Entry | V (Nested xlate)
> > > > > '\.--.
> > > > > | |   |SL for GPA-HPA, default domain|
> > > > > | |   '--'
> > > > > '-'
> > > > > Where:
> > > > >  - FL = First level/stage one page tables
> > > > >  - SL = Second level/stage two page tables
> > > > 
> > > > Hi,
> > > > Looks like an interesting feature!
> > > > 
> > > > To check I understand this feature: can applications now pass virtual
> > > > addresses to devices instead of translating to IOVAs?
> > > > 
> > > > If yes, can guest applications restrict the vSVA address space so the
> > > > device only has access to certain regions?
> > > > 
> > > > On one hand replacing IOVA translation with virtual addresses simplifies
> > > > the application programming model, but does it give up isolation if the
> > > > device can now access all application memory?
> > > > 
> > > 
> > > with SVA each application is allocated with a unique PASID to tag its
> > > virtual address space. The device that claims SVA support must guarantee 
> > > that one application can only program the device to access its own virtual
> > > address space (i.e. all DMAs triggered by this application are tagged with
> > > the application's PASID, and are translated by IOMMU's PASID-granular
> > > page table). So, isolation is not sacrificed in SVA.
> > 
> > Isolation between applications is preserved but there is no isolation
> > between the device and the application itself. The application needs to
> > trust the device.
> 
> Right. With all convenience comes security trust. With SVA there is an
> expectation that the device has the required security boundaries properly
> implemented. FWIW, what is our guarantee today that VF's are secure from
> one another or even its own PF? They can also generate transactions with
>

Re: [RFC v9 09/11] vhost/scsi: switch to buf APIs

2020-06-22 Thread Stefan Hajnoczi
On Fri, Jun 19, 2020 at 08:23:00PM +0200, Eugenio Pérez wrote:
> @@ -1139,9 +1154,9 @@ vhost_scsi_send_tmf_reject(struct vhost_scsi *vs,
>   iov_iter_init(_iter, READ, >iov[vc->out], vc->in, sizeof(rsp));
>  
>   ret = copy_to_iter(, sizeof(rsp), _iter);
> - if (likely(ret == sizeof(rsp)))
> - vhost_add_used_and_signal(>dev, vq, vc->head, 0);
> - else
> + if (likely(ret == sizeof(rsp))) {
> + vhost_scsi_signal_noinput(>dev, vq, >buf);
> + } else
>   pr_err("Faulted on virtio_scsi_ctrl_tmf_resp\n");
>  }

The curly brackets are not necessary, but the patch still looks fine:

Reviewed-by: Stefan Hajnoczi 


signature.asc
Description: PGP signature


Re: [RFC v9 10/11] vhost/vsock: switch to the buf API

2020-06-22 Thread Stefan Hajnoczi
On Fri, Jun 19, 2020 at 08:23:01PM +0200, Eugenio Pérez wrote:
> From: "Michael S. Tsirkin" 
> 
> A straight-forward conversion.
> 
> Signed-off-by: Michael S. Tsirkin 
> Signed-off-by: Eugenio Pérez 
> ---
>  drivers/vhost/vsock.c | 30 ++
>  1 file changed, 18 insertions(+), 12 deletions(-)

Reviewed-by: Stefan Hajnoczi 


signature.asc
Description: PGP signature


Re: [PATCH V2] drivers/block: Use kobj_to_dev() API

2020-06-22 Thread Stefan Hajnoczi
On Sat, Jun 20, 2020 at 09:53:43AM +0800, Wang Qing wrote:
> Use kobj_to_dev() API instead of container_of().
> 
> Signed-off-by: Wang Qing 
> ---
>  drivers/block/virtio_blk.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)

Reviewed-by: Stefan Hajnoczi 


signature.asc
Description: PGP signature


Re: [PATCH] drivers\block: Use kobj_to_dev() API

2020-06-19 Thread Stefan Hajnoczi
On Fri, Jun 12, 2020 at 03:10:56PM +0800, Wang Qing wrote:
> Use kobj_to_dev() API instead of container_of().
> 
> Signed-off-by: Wang Qing 
> ---
>  drivers/block/virtio_blk.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
>  mode change 100644 => 100755 drivers/block/virtio_blk.c

Please fix the '\' -> '/' in the commit message. Looks good otherwise:

Reviewed-by: Stefan Hajnoczi 


signature.asc
Description: PGP signature


[RFC 1/2] genirq: honor device NUMA node when allocating descs

2020-06-17 Thread Stefan Hajnoczi
Use the device's NUMA node instead of the first masked CPUs node when
descs are allocated. The mask may include all CPUs and therefore not
correspond to the home NUMA node of the device.

Signed-off-by: Stefan Hajnoczi 
---
 kernel/irq/irqdesc.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/kernel/irq/irqdesc.c b/kernel/irq/irqdesc.c
index 1a7723604399..b9c4160d72c4 100644
--- a/kernel/irq/irqdesc.c
+++ b/kernel/irq/irqdesc.c
@@ -488,7 +488,8 @@ static int alloc_descs(unsigned int start, unsigned int 
cnt, int node,
IRQD_MANAGED_SHUTDOWN;
}
mask = >mask;
-   node = cpu_to_node(cpumask_first(mask));
+   if (node == NUMA_NO_NODE)
+   node = cpu_to_node(cpumask_first(mask));
affinity++;
}
 
-- 
2.26.2



[RFC 0/2] genirq: take device NUMA node into account for managed IRQs

2020-06-17 Thread Stefan Hajnoczi
Devices with a small number of managed IRQs do not benefit from spreading
across all CPUs. Instead they benefit from NUMA node affinity so that IRQs are
handled on the device's NUMA node.

For example, here is a machine with a virtio-blk PCI device on NUMA node 1:

  # lstopo-no-graphics
  Machine (958MB total)
Package L#0
  NUMANode L#0 (P#0 491MB)
  L3 L#0 (16MB) + L2 L#0 (4096KB) + L1d L#0 (32KB) + L1i L#0 (32KB) + Cor=
e L#0 + PU L#0 (P#0)
Package L#1
  NUMANode L#1 (P#1 466MB)
  L3 L#1 (16MB) + L2 L#1 (4096KB) + L1d L#1 (32KB) + L1i L#1 (32KB) + Cor=
e L#1 + PU L#1 (P#1)
  HostBridge
PCIBridge
  PCI c9:00.0 (SCSI)
Block "vdb"
HostBridge
  PCIBridge
PCI 02:00.0 (Ethernet)
  Net "enp2s0"
  PCIBridge
PCI 05:00.0 (SCSI)
  Block "vda"
  PCI 00:1f.2 (SATA)

Currently the virtio5-req.0 IRQ for the vdb device gets assigned to CPU 0:

  # cat /proc/interrupts
 CPU0   CPU1
  ...
   36:  0  0   PCI-MSI 105381888-edge  virtio5-config
   37: 81  0   PCI-MSI 105381889-edge  virtio5-req.0

If managed IRQ assignment takes the device's NUMA node into account then CPU 1
will be used instead:

  # cat /proc/interrupts
 CPU0   CPU1
  ...
   36:  0  0   PCI-MSI 105381888-edge  virtio5-config
   37:  0 92   PCI-MSI 105381889-edge  virtio5-req.0

The fio benchmark with 4KB random read running on CPU 1 increases IOPS by 58%:

  Name  IOPS   Error
  Before26720.59 =C2=B1 0.28%
  After 42373.79 =C2=B1 0.54%

Now most of this improvement is not due to NUMA but just because the requests
complete on the same CPU where they were submitted. However, if the IRQ is on
CPU 0 and fio also runs on CPU 0 only 39600 IOPS is achieved, not the full
42373 IOPS that we get when NUMA affinity is honored. So it is worth taking
NUMA into account to achieve maximum performance.

The following patches are a hack that uses the device's NUMA node when
assigning managed IRQs. They are not mergeable but I hope they will help start
the discussion. One bug is that they affect all managed IRQs, even for devices
with many IRQs where spreading across all CPUs is a good policy.

Please let me know what you think:

1. Is there a reason why managed IRQs should *not* take NUMA into account that
   I've missed?

2. Is there a better place to implement this logic? For example,
   pci_alloc_irq_vectors_affinity() where the cpumasks are calculated.

Any suggestions on how to proceed would be appreciated. Thanks!

Stefan Hajnoczi (2):
  genirq: honor device NUMA node when allocating descs
  genirq/matrix: take NUMA into account for managed IRQs

 include/linux/irq.h   |  2 +-
 arch/x86/kernel/apic/vector.c |  3 ++-
 kernel/irq/irqdesc.c  |  3 ++-
 kernel/irq/matrix.c   | 16 
 4 files changed, 17 insertions(+), 7 deletions(-)

--=20
2.26.2



[RFC 2/2] genirq/matrix: take NUMA into account for managed IRQs

2020-06-17 Thread Stefan Hajnoczi
Select CPUs from the IRQ's NUMA node in preference over other CPUs. This
ensures that managed IRQs are assigned to the same NUMA node as the
device.

Signed-off-by: Stefan Hajnoczi 
---
 include/linux/irq.h   |  2 +-
 arch/x86/kernel/apic/vector.c |  3 ++-
 kernel/irq/matrix.c   | 16 
 3 files changed, 15 insertions(+), 6 deletions(-)

diff --git a/include/linux/irq.h b/include/linux/irq.h
index 8d5bc2c237d7..bdc3faa3c280 100644
--- a/include/linux/irq.h
+++ b/include/linux/irq.h
@@ -1202,7 +1202,7 @@ void irq_matrix_assign_system(struct irq_matrix *m, 
unsigned int bit, bool repla
 int irq_matrix_reserve_managed(struct irq_matrix *m, const struct cpumask 
*msk);
 void irq_matrix_remove_managed(struct irq_matrix *m, const struct cpumask 
*msk);
 int irq_matrix_alloc_managed(struct irq_matrix *m, const struct cpumask *msk,
-   unsigned int *mapped_cpu);
+int node, unsigned int *mapped_cpu);
 void irq_matrix_reserve(struct irq_matrix *m);
 void irq_matrix_remove_reserved(struct irq_matrix *m);
 int irq_matrix_alloc(struct irq_matrix *m, const struct cpumask *msk,
diff --git a/arch/x86/kernel/apic/vector.c b/arch/x86/kernel/apic/vector.c
index 67768e54438b..8eb10b0d981d 100644
--- a/arch/x86/kernel/apic/vector.c
+++ b/arch/x86/kernel/apic/vector.c
@@ -309,6 +309,7 @@ assign_managed_vector(struct irq_data *irqd, const struct 
cpumask *dest)
 {
const struct cpumask *affmsk = irq_data_get_affinity_mask(irqd);
struct apic_chip_data *apicd = apic_chip_data(irqd);
+   int node = irq_data_get_node(irqd);
int vector, cpu;
 
cpumask_and(vector_searchmask, dest, affmsk);
@@ -317,7 +318,7 @@ assign_managed_vector(struct irq_data *irqd, const struct 
cpumask *dest)
if (apicd->vector && cpumask_test_cpu(apicd->cpu, vector_searchmask))
return 0;
vector = irq_matrix_alloc_managed(vector_matrix, vector_searchmask,
- );
+ node, );
trace_vector_alloc_managed(irqd->irq, vector, vector);
if (vector < 0)
return vector;
diff --git a/kernel/irq/matrix.c b/kernel/irq/matrix.c
index 30cc217b8631..ee35b6172b64 100644
--- a/kernel/irq/matrix.c
+++ b/kernel/irq/matrix.c
@@ -148,7 +148,8 @@ static unsigned int matrix_find_best_cpu(struct irq_matrix 
*m,
 
 /* Find the best CPU which has the lowest number of managed IRQs allocated */
 static unsigned int matrix_find_best_cpu_managed(struct irq_matrix *m,
-   const struct cpumask *msk)
+const struct cpumask *msk,
+int node)
 {
unsigned int cpu, best_cpu, allocated = UINT_MAX;
struct cpumap *cm;
@@ -156,6 +157,9 @@ static unsigned int matrix_find_best_cpu_managed(struct 
irq_matrix *m,
best_cpu = UINT_MAX;
 
for_each_cpu(cpu, msk) {
+   if (node != NUMA_NO_NODE && cpu_to_node(cpu) != node)
+   continue;
+
cm = per_cpu_ptr(m->maps, cpu);
 
if (!cm->online || cm->managed_allocated > allocated)
@@ -280,10 +284,12 @@ void irq_matrix_remove_managed(struct irq_matrix *m, 
const struct cpumask *msk)
 /**
  * irq_matrix_alloc_managed - Allocate a managed interrupt in a CPU map
  * @m: Matrix pointer
- * @cpu:   On which CPU the interrupt should be allocated
+ * @mask:  The mask of CPUs on which the interrupt can be allocated
+ * @node:  The preferred NUMA node
+ * @mapped_cpu:The resulting CPU on which the interrupt should be 
allocated
  */
 int irq_matrix_alloc_managed(struct irq_matrix *m, const struct cpumask *msk,
-unsigned int *mapped_cpu)
+int node, unsigned int *mapped_cpu)
 {
unsigned int bit, cpu, end = m->alloc_end;
struct cpumap *cm;
@@ -291,7 +297,9 @@ int irq_matrix_alloc_managed(struct irq_matrix *m, const 
struct cpumask *msk,
if (cpumask_empty(msk))
return -EINVAL;
 
-   cpu = matrix_find_best_cpu_managed(m, msk);
+   cpu = matrix_find_best_cpu_managed(m, msk, node);
+   if (cpu == UINT_MAX)
+   cpu = matrix_find_best_cpu_managed(m, msk, NUMA_NO_NODE);
if (cpu == UINT_MAX)
return -ENOSPC;
 
-- 
2.26.2



Re: [PATCH v2 00/15] vfio: expose virtual Shared Virtual Addressing to VMs

2020-06-16 Thread Stefan Hajnoczi
On Tue, Jun 16, 2020 at 02:26:38AM +, Tian, Kevin wrote:
> > From: Stefan Hajnoczi 
> > Sent: Monday, June 15, 2020 6:02 PM
> > 
> > On Thu, Jun 11, 2020 at 05:15:19AM -0700, Liu Yi L wrote:
> > > Shared Virtual Addressing (SVA), a.k.a, Shared Virtual Memory (SVM) on
> > > Intel platforms allows address space sharing between device DMA and
> > > applications. SVA can reduce programming complexity and enhance
> > security.
> > >
> > > This VFIO series is intended to expose SVA usage to VMs. i.e. Sharing
> > > guest application address space with passthru devices. This is called
> > > vSVA in this series. The whole vSVA enabling requires QEMU/VFIO/IOMMU
> > > changes. For IOMMU and QEMU changes, they are in separate series (listed
> > > in the "Related series").
> > >
> > > The high-level architecture for SVA virtualization is as below, the key
> > > design of vSVA support is to utilize the dual-stage IOMMU translation (
> > > also known as IOMMU nesting translation) capability in host IOMMU.
> > >
> > >
> > > .-.  .---.
> > > |   vIOMMU|  | Guest process CR3, FL only|
> > > | |  '---'
> > > ./
> > > | PASID Entry |--- PASID cache flush -
> > > '-'   |
> > > | |   V
> > > | |CR3 in GPA
> > > '-'
> > > Guest
> > > --| Shadow |--|
> > >   vv  v
> > > Host
> > > .-.  .--.
> > > |   pIOMMU|  | Bind FL for GVA-GPA  |
> > > | |  '--'
> > > ./  |
> > > | PASID Entry | V (Nested xlate)
> > > '\.--.
> > > | |   |SL for GPA-HPA, default domain|
> > > | |   '--'
> > > '-'
> > > Where:
> > >  - FL = First level/stage one page tables
> > >  - SL = Second level/stage two page tables
> > 
> > Hi,
> > Looks like an interesting feature!
> > 
> > To check I understand this feature: can applications now pass virtual
> > addresses to devices instead of translating to IOVAs?
> > 
> > If yes, can guest applications restrict the vSVA address space so the
> > device only has access to certain regions?
> > 
> > On one hand replacing IOVA translation with virtual addresses simplifies
> > the application programming model, but does it give up isolation if the
> > device can now access all application memory?
> > 
> 
> with SVA each application is allocated with a unique PASID to tag its
> virtual address space. The device that claims SVA support must guarantee 
> that one application can only program the device to access its own virtual
> address space (i.e. all DMAs triggered by this application are tagged with
> the application's PASID, and are translated by IOMMU's PASID-granular
> page table). So, isolation is not sacrificed in SVA.

Isolation between applications is preserved but there is no isolation
between the device and the application itself. The application needs to
trust the device.

Examples:

1. The device can snoop secret data from readable pages in the
   application's virtual memory space.

2. The device can gain arbitrary execution on the CPU by overwriting
   control flow addresses (e.g. function pointers, stack return
   addresses) in writable pages.

Stefan


signature.asc
Description: PGP signature


Re: [PATCH v2 00/15] vfio: expose virtual Shared Virtual Addressing to VMs

2020-06-16 Thread Stefan Hajnoczi
On Mon, Jun 15, 2020 at 12:39:40PM +, Liu, Yi L wrote:
> > From: Stefan Hajnoczi 
> > Sent: Monday, June 15, 2020 6:02 PM
> > 
> > On Thu, Jun 11, 2020 at 05:15:19AM -0700, Liu Yi L wrote:
> > > Shared Virtual Addressing (SVA), a.k.a, Shared Virtual Memory (SVM) on
> > > Intel platforms allows address space sharing between device DMA and
> > > applications. SVA can reduce programming complexity and enhance security.
> > >
> > > This VFIO series is intended to expose SVA usage to VMs. i.e. Sharing
> > > guest application address space with passthru devices. This is called
> > > vSVA in this series. The whole vSVA enabling requires QEMU/VFIO/IOMMU
> > > changes. For IOMMU and QEMU changes, they are in separate series (listed
> > > in the "Related series").
> > >
> > > The high-level architecture for SVA virtualization is as below, the key
> > > design of vSVA support is to utilize the dual-stage IOMMU translation (
> > > also known as IOMMU nesting translation) capability in host IOMMU.
> > >
> > >
> > > .-.  .---.
> > > |   vIOMMU|  | Guest process CR3, FL only|
> > > | |  '---'
> > > ./
> > > | PASID Entry |--- PASID cache flush -
> > > '-'   |
> > > | |   V
> > > | |CR3 in GPA
> > > '-'
> > > Guest
> > > --| Shadow |--|
> > >   vv  v
> > > Host
> > > .-.  .--.
> > > |   pIOMMU|  | Bind FL for GVA-GPA  |
> > > | |  '--'
> > > ./  |
> > > | PASID Entry | V (Nested xlate)
> > > '\.--.
> > > | |   |SL for GPA-HPA, default domain|
> > > | |   '--'
> > > '-'
> > > Where:
> > >  - FL = First level/stage one page tables
> > >  - SL = Second level/stage two page tables
> > 
> > Hi,
> > Looks like an interesting feature!
> 
> thanks for the interest. Stefan :-)
> 
> > To check I understand this feature: can applications now pass virtual
> > addresses to devices instead of translating to IOVAs?
> 
> yes, application could pass virtual addresses to device directly. As
> long as the virtual address is mapped in cpu page table, then IOMMU
> would get it translated to physical address.
> 
> > If yes, can guest applications restrict the vSVA address space so the
> > device only has access to certain regions?
> 
> do you mean restrict the access of certain virtual address regions of
> guest application ? or certain guest memory? :-)

Your reply below answered my question. I was wondering if applications
can protect parts of their virtual memory space that should not be
accessed by the device. It makes sense that there is a trade-off to
simplify the programming model and performance might also be better if
the application doesn't need to DMA map/unmap buffers frequently.

Stefan


signature.asc
Description: PGP signature


Re: [PATCH v2 00/15] vfio: expose virtual Shared Virtual Addressing to VMs

2020-06-15 Thread Stefan Hajnoczi
On Thu, Jun 11, 2020 at 05:15:19AM -0700, Liu Yi L wrote:
> Shared Virtual Addressing (SVA), a.k.a, Shared Virtual Memory (SVM) on
> Intel platforms allows address space sharing between device DMA and
> applications. SVA can reduce programming complexity and enhance security.
> 
> This VFIO series is intended to expose SVA usage to VMs. i.e. Sharing
> guest application address space with passthru devices. This is called
> vSVA in this series. The whole vSVA enabling requires QEMU/VFIO/IOMMU
> changes. For IOMMU and QEMU changes, they are in separate series (listed
> in the "Related series").
> 
> The high-level architecture for SVA virtualization is as below, the key
> design of vSVA support is to utilize the dual-stage IOMMU translation (
> also known as IOMMU nesting translation) capability in host IOMMU.
> 
> 
> .-.  .---.
> |   vIOMMU|  | Guest process CR3, FL only|
> | |  '---'
> ./
> | PASID Entry |--- PASID cache flush -
> '-'   |
> | |   V
> | |CR3 in GPA
> '-'
> Guest
> --| Shadow |--|
>   vv  v
> Host
> .-.  .--.
> |   pIOMMU|  | Bind FL for GVA-GPA  |
> | |  '--'
> ./  |
> | PASID Entry | V (Nested xlate)
> '\.--.
> | |   |SL for GPA-HPA, default domain|
> | |   '--'
> '-'
> Where:
>  - FL = First level/stage one page tables
>  - SL = Second level/stage two page tables

Hi,
Looks like an interesting feature!

To check I understand this feature: can applications now pass virtual
addresses to devices instead of translating to IOVAs?

If yes, can guest applications restrict the vSVA address space so the
device only has access to certain regions?

On one hand replacing IOVA translation with virtual addresses simplifies
the application programming model, but does it give up isolation if the
device can now access all application memory?

Thanks,
Stefan


signature.asc
Description: PGP signature


Re: [PATCH v2 14/15] vfio: Document dual stage control

2020-06-15 Thread Stefan Hajnoczi
On Thu, Jun 11, 2020 at 05:15:33AM -0700, Liu Yi L wrote:
> From: Eric Auger 
> 
> The VFIO API was enhanced to support nested stage control: a bunch of
> new iotcls and usage guideline.
> 
> Let's document the process to follow to set up nested mode.
> 
> Cc: Kevin Tian 
> CC: Jacob Pan 
> Cc: Alex Williamson 
> Cc: Eric Auger 
> Cc: Jean-Philippe Brucker 
> Cc: Joerg Roedel 
> Cc: Lu Baolu 
> Signed-off-by: Eric Auger 
> Signed-off-by: Liu Yi L 
> ---
> v1 -> v2:
> *) new in v2, compared with Eric's original version, pasid table bind
>and fault reporting is removed as this series doesn't cover them.
>Original version from Eric.
>https://lkml.org/lkml/2020/3/20/700
> 
>  Documentation/driver-api/vfio.rst | 64 
> +++
>  1 file changed, 64 insertions(+)
> 
> diff --git a/Documentation/driver-api/vfio.rst 
> b/Documentation/driver-api/vfio.rst
> index f1a4d3c..06224bd 100644
> --- a/Documentation/driver-api/vfio.rst
> +++ b/Documentation/driver-api/vfio.rst
> @@ -239,6 +239,70 @@ group and can access them as follows::
>   /* Gratuitous device reset and go... */
>   ioctl(device, VFIO_DEVICE_RESET);
>  
> +IOMMU Dual Stage Control
> +
> +
> +Some IOMMUs support 2 stages/levels of translation. Stage corresponds to
> +the ARM terminology while level corresponds to Intel's VTD terminology.
> +In the following text we use either without distinction.
> +
> +This is useful when the guest is exposed with a virtual IOMMU and some
> +devices are assigned to the guest through VFIO. Then the guest OS can use
> +stage 1 (GIOVA -> GPA or GVA->GPA), while the hypervisor uses stage 2 for
> +VM isolation (GPA -> HPA).
> +
> +Under dual stage translation, the guest gets ownership of the stage 1 page
> +tables and also owns stage 1 configuration structures. The hypervisor owns
> +the root configuration structure (for security reason), including stage 2
> +configuration. This works as long configuration structures and page table

s/as long configuration/as long as configuration/

> +format are compatible between the virtual IOMMU and the physical IOMMU.

s/format/formats/

> +
> +Assuming the HW supports it, this nested mode is selected by choosing the
> +VFIO_TYPE1_NESTING_IOMMU type through:
> +
> +ioctl(container, VFIO_SET_IOMMU, VFIO_TYPE1_NESTING_IOMMU);
> +
> +This forces the hypervisor to use the stage 2, leaving stage 1 available
> +for guest usage. The guest stage 1 format depends on IOMMU vendor, and
> +it is the same with the nesting configuration method. User space should
> +check the format and configuration method after setting nesting type by
> +using:
> +
> +ioctl(container->fd, VFIO_IOMMU_GET_INFO, _info);
> +
> +Details can be found in Documentation/userspace-api/iommu.rst. For Intel
> +VT-d, each stage 1 page table is bound to host by:
> +
> +nesting_op->flags = VFIO_IOMMU_NESTING_OP_BIND_PGTBL;
> +memcpy(_op->data, _data, sizeof(bind_data));
> +ioctl(container->fd, VFIO_IOMMU_NESTING_OP, nesting_op);
> +
> +As mentioned above, guest OS may use stage 1 for GIOVA->GPA or GVA->GPA.
> +GVA->GPA page tables are available when PASID (Process Address Space ID)
> +is exposed to guest. e.g. guest with PASID-capable devices assigned. For
> +such page table binding, the bind_data should include PASID info, which
> +is allocated by guest itself or by host. This depends on hardware vendor
> +e.g. Intel VT-d requires to allocate PASID from host. This requirement is
> +available by VFIO_IOMMU_GET_INFO. User space could allocate PASID from
> +host by:
> +
> +req.flags = VFIO_IOMMU_ALLOC_PASID;
> +ioctl(container, VFIO_IOMMU_PASID_REQUEST, );

It is not clear how the userspace application determines whether PASIDs
must be allocated from the host via VFIO_IOMMU_PASID_REQUEST or if the
guest itself can allocate PASIDs. The text mentions VFIO_IOMMU_GET_INFO
but what exactly should the userspace application check?


signature.asc
Description: PGP signature


Re: [PATCH RFC v5 12/13] vhost/vsock: switch to the buf API

2020-06-08 Thread Stefan Hajnoczi
On Sun, Jun 07, 2020 at 10:11:49AM -0400, Michael S. Tsirkin wrote:
> A straight-forward conversion.
> 
> Signed-off-by: Michael S. Tsirkin 
> ---
>  drivers/vhost/vsock.c | 30 ++
>  1 file changed, 18 insertions(+), 12 deletions(-)

Reviewed-by: Stefan Hajnoczi 


signature.asc
Description: PGP signature


Re: [PATCH RFC v5 11/13] vhost/scsi: switch to buf APIs

2020-06-08 Thread Stefan Hajnoczi
On Sun, Jun 07, 2020 at 10:11:46AM -0400, Michael S. Tsirkin wrote:
> Switch to buf APIs. Doing this exposes a spec violation in vhost scsi:
> all used bufs are marked with length 0.
> Fix that is left for another day.
> 
> Signed-off-by: Michael S. Tsirkin 
> ---
>  drivers/vhost/scsi.c | 73 ++--
>  1 file changed, 44 insertions(+), 29 deletions(-)

Reviewed-by: Stefan Hajnoczi 


signature.asc
Description: PGP signature


Re: [PATCH RFC 11/13] vhost/scsi: switch to buf APIs

2020-06-05 Thread Stefan Hajnoczi
On Tue, Jun 02, 2020 at 09:06:20AM -0400, Michael S. Tsirkin wrote:
> Switch to buf APIs. Doing this exposes a spec violation in vhost scsi:
> all used bufs are marked with length 0.
> Fix that is left for another day.
> 
> Signed-off-by: Michael S. Tsirkin 
> ---
>  drivers/vhost/scsi.c | 73 ++--
>  1 file changed, 44 insertions(+), 29 deletions(-)
> 
> diff --git a/drivers/vhost/scsi.c b/drivers/vhost/scsi.c
> index c39952243fd3..c426c4e899c7 100644
> --- a/drivers/vhost/scsi.c
> +++ b/drivers/vhost/scsi.c
> @@ -71,8 +71,8 @@ struct vhost_scsi_inflight {
>  };
>  
>  struct vhost_scsi_cmd {
> - /* Descriptor from vhost_get_vq_desc() for virt_queue segment */
> - int tvc_vq_desc;
> + /* Descriptor from vhost_get_avail_buf() for virt_queue segment */
> + struct vhost_buf tvc_vq_desc;
>   /* virtio-scsi initiator task attribute */
>   int tvc_task_attr;
>   /* virtio-scsi response incoming iovecs */
> @@ -213,7 +213,7 @@ struct vhost_scsi {
>   * Context for processing request and control queue operations.
>   */
>  struct vhost_scsi_ctx {
> - int head;
> + struct vhost_buf buf;
>   unsigned int out, in;
>   size_t req_size, rsp_size;
>   size_t out_size, in_size;
> @@ -443,6 +443,20 @@ static int vhost_scsi_check_stop_free(struct se_cmd 
> *se_cmd)
>   return target_put_sess_cmd(se_cmd);
>  }
>  
> +/* Signal to guest that request finished with no input buffer. */
> +/* TODO calling this when writing into buffer and most likely a bug */
> +static void vhost_scsi_signal_noinput(struct vhost_dev *vdev,
> +   struct vhost_virtqueue *vq,
> +   struct vhost_buf *bufp)
> +{
> + struct vhost_buf buf = *bufp;
> +
> + buf.in_len = 0;
> + vhost_put_used_buf(vq, );

Yes, this behavior differs from the QEMU virtio-scsi device
implementation. I think it's just a quirk that is probably my fault (I
guess I thought the length information is already encoded in the payload
SCSI headers so we have no use for the used descriptor length field).

Whether it's worth changing now or is an interesting question. In theory
it would make vhost-scsi more spec compliant and guest drivers might be
happier (especially drivers for niche OSes that were only tested against
QEMU's virtio-scsi). On the other hand, it's a guest-visible change that
could break similar niche drivers that assume length is always 0.

I'd leave it as-is unless people hit issues that justify the risk of
changing it.

Reviewed-by: Stefan Hajnoczi 


signature.asc
Description: PGP signature


Re: [PATCH RFC 12/13] vhost/vsock: switch to the buf API

2020-06-05 Thread Stefan Hajnoczi
On Tue, Jun 02, 2020 at 09:06:22AM -0400, Michael S. Tsirkin wrote:
> A straight-forward conversion.
> 
> Signed-off-by: Michael S. Tsirkin 
> ---
>  drivers/vhost/vsock.c | 30 ++
>  1 file changed, 18 insertions(+), 12 deletions(-)

Reviewed-by: Stefan Hajnoczi 


signature.asc
Description: PGP signature


Re: [PATCH v3 01/18] nitro_enclaves: Add ioctl interface definition

2020-06-05 Thread Stefan Hajnoczi
On Mon, Jun 01, 2020 at 10:20:18AM +0300, Paraschiv, Andra-Irina wrote:
> 
> 
> On 01/06/2020 06:02, Benjamin Herrenschmidt wrote:
> > On Wed, 2020-05-27 at 09:49 +0100, Stefan Hajnoczi wrote:
> > > What about feature bits or a API version number field? If you add
> > > features to the NE driver, how will userspace detect them?
> > > 
> > > Even if you intend to always compile userspace against the exact kernel
> > > headers that the program will run on, it can still be useful to have an
> > > API version for informational purposes and to easily prevent user
> > > errors (running a new userspace binary on an old kernel where the API is
> > > different).
> > > 
> > > Finally, reserved struct fields may come in handy in the future. That
> > > way userspace and the kernel don't need to explicitly handle multiple
> > > struct sizes.
> > Beware, Greg might disagree :)
> > 
> > That said, yes, at least a way to query the API version would be
> > useful.
> 
> I see there are several thoughts with regard to extensions possibilities. :)
> 
> I added an ioctl for getting the API version, we have now a way to query
> that info. Also, I updated the sample in this patch series to check for the
> API version.

Great. The ideas are orthogonal and not all of them need to be used
together. As long as their is a way of extending the API cleanly in the
future then extensions can be made without breaking userspace.

Stefan


signature.asc
Description: PGP signature


[PATCH v2] capabilities: add description for CAP_SETFCAP

2020-06-02 Thread Stefan Hajnoczi
Document the purpose of CAP_SETFCAP.  For some reason this capability
had no description while the others did.

Signed-off-by: Stefan Hajnoczi 
---
v2:
 * Rebased onto git master

 include/uapi/linux/capability.h | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/include/uapi/linux/capability.h b/include/uapi/linux/capability.h
index e58c9636741b..c4532b0fe00f 100644
--- a/include/uapi/linux/capability.h
+++ b/include/uapi/linux/capability.h
@@ -332,6 +332,8 @@ struct vfs_ns_cap_data {
 
 #define CAP_AUDIT_CONTROL30
 
+/* Set or remove capabilities on files */
+
 #define CAP_SETFCAP 31
 
 /* Override MAC access.
-- 
2.25.4



Re: [PATCH v3 01/18] nitro_enclaves: Add ioctl interface definition

2020-05-27 Thread Stefan Hajnoczi
On Tue, May 26, 2020 at 01:13:17AM +0300, Andra Paraschiv wrote:
> The Nitro Enclaves driver handles the enclave lifetime management. This
> includes enclave creation, termination and setting up its resources such
> as memory and CPU.
> 
> An enclave runs alongside the VM that spawned it. It is abstracted as a
> process running in the VM that launched it. The process interacts with
> the NE driver, that exposes an ioctl interface for creating an enclave
> and setting up its resources.
> 
> Include part of the KVM ioctls in the provided ioctl interface, with
> additional NE ioctl commands that e.g. triggers the enclave run.
> 
> Signed-off-by: Alexandru Vasile 
> Signed-off-by: Andra Paraschiv 
> ---
> Changelog
> 
> v2 -> v3
> 
> * Remove the GPL additional wording as SPDX-License-Identifier is already in
> place.
> 
> v1 -> v2
> 
> * Add ioctl for getting enclave image load metadata.
> * Update NE_ENCLAVE_START ioctl name to NE_START_ENCLAVE. 
> * Add entry in Documentation/userspace-api/ioctl/ioctl-number.rst for NE 
> ioctls.
> * Update NE ioctls definition based on the updated ioctl range for major and
> minor.
> ---
>  .../userspace-api/ioctl/ioctl-number.rst  |  5 +-
>  include/linux/nitro_enclaves.h| 11 
>  include/uapi/linux/nitro_enclaves.h   | 65 +++
>  3 files changed, 80 insertions(+), 1 deletion(-)
>  create mode 100644 include/linux/nitro_enclaves.h
>  create mode 100644 include/uapi/linux/nitro_enclaves.h
> 
> diff --git a/Documentation/userspace-api/ioctl/ioctl-number.rst 
> b/Documentation/userspace-api/ioctl/ioctl-number.rst
> index f759edafd938..8a19b5e871d3 100644
> --- a/Documentation/userspace-api/ioctl/ioctl-number.rst
> +++ b/Documentation/userspace-api/ioctl/ioctl-number.rst
> @@ -325,8 +325,11 @@ Code  Seq#Include File   
> Comments
>  0xAC  00-1F  linux/raw.h
>  0xAD  00 
> Netfilter device in development:
>   
> 
> -0xAE  alllinux/kvm.h 
> Kernel-based Virtual Machine
> +0xAE  00-1F  linux/kvm.h 
> Kernel-based Virtual Machine
>   
> 
> +0xAE  40-FF  linux/kvm.h 
> Kernel-based Virtual Machine
> + 
> 
> +0xAE  20-3F  linux/nitro_enclaves.h  Nitro 
> Enclaves
>  0xAF  00-1F  linux/fsl_hypervisor.h  
> Freescale hypervisor
>  0xB0  allRATIO 
> devices in development:
>   
> 

Reusing KVM ioctls seems a little hacky. Even the ioctls that are used
by this driver don't use all the fields or behave in the same way as
kvm.ko.

For example, the memory regions slot number is not used by Nitro
Enclaves.

It would be cleaner to define NE-specific ioctls instead.

> diff --git a/include/linux/nitro_enclaves.h b/include/linux/nitro_enclaves.h
> new file mode 100644
> index ..d91ef2bfdf47
> --- /dev/null
> +++ b/include/linux/nitro_enclaves.h
> @@ -0,0 +1,11 @@
> +/* SPDX-License-Identifier: GPL-2.0 */
> +/*
> + * Copyright 2020 Amazon.com, Inc. or its affiliates. All Rights Reserved.
> + */
> +
> +#ifndef _LINUX_NITRO_ENCLAVES_H_
> +#define _LINUX_NITRO_ENCLAVES_H_
> +
> +#include 
> +
> +#endif /* _LINUX_NITRO_ENCLAVES_H_ */
> diff --git a/include/uapi/linux/nitro_enclaves.h 
> b/include/uapi/linux/nitro_enclaves.h
> new file mode 100644
> index ..3413352baf32
> --- /dev/null
> +++ b/include/uapi/linux/nitro_enclaves.h
> @@ -0,0 +1,65 @@
> +/* SPDX-License-Identifier: GPL-2.0 WITH Linux-syscall-note */
> +/*
> + * Copyright 2020 Amazon.com, Inc. or its affiliates. All Rights Reserved.
> + */
> +
> +#ifndef _UAPI_LINUX_NITRO_ENCLAVES_H_
> +#define _UAPI_LINUX_NITRO_ENCLAVES_H_
> +
> +#include 
> +#include 
> +
> +/* Nitro Enclaves (NE) Kernel Driver Interface */
> +
> +/**
> + * The command is used to get information needed for in-memory enclave image
> + * loading e.g. offset in enclave memory to start placing the enclave image.
> + *
> + * The image load metadata is an in / out data structure. It includes info
> + * provided by the caller - flags - and returns the offset in enclave memory
> + * where to start placing the enclave image.
> + */
> +#define NE_GET_IMAGE_LOAD_METADATA _IOWR(0xAE, 0x20, struct 
> image_load_metadata)
> +
> +/**
> + * The command is used to trigger enclave start after the enclave resources,
> + * such as memory and CPU, have been set.
> + *
> + * The enclave 

Re: [PATCH v1 00/15] Add support for Nitro Enclaves

2020-05-11 Thread Stefan Hajnoczi
On Sun, May 10, 2020 at 11:02:18AM +, Herrenschmidt, Benjamin wrote:
> On Sat, 2020-05-09 at 21:21 +0200, Pavel Machek wrote:
> > 
> > On Fri 2020-05-08 10:00:27, Paraschiv, Andra-Irina wrote:
> > > 
> > > 
> > > On 07/05/2020 20:44, Pavel Machek wrote:
> > > > 
> > > > Hi!
> > > > 
> > > > > > it uses its own memory and CPUs + its virtio-vsock emulated device 
> > > > > > for
> > > > > > communication with the primary VM.
> > > > > > 
> > > > > > The memory and CPUs are carved out of the primary VM, they are 
> > > > > > dedicated
> > > > > > for the enclave. The Nitro hypervisor running on the host ensures 
> > > > > > memory
> > > > > > and CPU isolation between the primary VM and the enclave VM.
> > > > > > 
> > > > > > These two components need to reflect the same state e.g. when the
> > > > > > enclave abstraction process (1) is terminated, the enclave VM (2) is
> > > > > > terminated as well.
> > > > > > 
> > > > > > With regard to the communication channel, the primary VM has its own
> > > > > > emulated virtio-vsock PCI device. The enclave VM has its own 
> > > > > > emulated
> > > > > > virtio-vsock device as well. This channel is used, for example, to 
> > > > > > fetch
> > > > > > data in the enclave and then process it. An application that sets 
> > > > > > up the
> > > > > > vsock socket and connects or listens, depending on the use case, is 
> > > > > > then
> > > > > > developed to use this channel; this happens on both ends - primary 
> > > > > > VM
> > > > > > and enclave VM.
> > > > > > 
> > > > > > Let me know if further clarifications are needed.
> > > > > 
> > > > > Thanks, this is all useful.  However can you please clarify the
> > > > > low-level details here?
> > > > 
> > > > Is the virtual machine manager open-source? If so, I guess pointer for 
> > > > sources
> > > > would be useful.
> > > 
> > > Hi Pavel,
> > > 
> > > Thanks for reaching out.
> > > 
> > > The VMM that is used for the primary / parent VM is not open source.
> > 
> > Do we want to merge code that opensource community can not test?
> 
> Hehe.. this isn't quite the story Pavel :)
> 
> We merge support for proprietary hypervisors, this is no different. You
> can test it, well at least you'll be able to ... when AWS deploys the
> functionality. You don't need the hypervisor itself to be open source.
> 
> In fact, in this case, it's not even low level invasive arch code like
> some of the above can be. It's a driver for a PCI device :-) Granted a
> virtual one. We merge drivers for PCI devices routinely without the RTL
> or firmware of those devices being open source.
> 
> So yes, we probably want this if it's going to be a useful features to
> users when running on AWS EC2. (Disclaimer: I work for AWS these days).

I agree that the VMM does not need to be open source.

What is missing though are details of the enclave's initial state and
the image format required to boot code. Until this documentation is
available only Amazon can write a userspace application that does
anything useful with this driver.

Some of the people from Amazon are long-time Linux contributors (such as
yourself!) and the intent to publish this information has been
expressed, so I'm sure that will be done.

Until then, it's cool but no one else can play with it.

Stefan


signature.asc
Description: PGP signature


Re: linux-next: Fixes tag needs some work in the vhost tree

2020-05-05 Thread Stefan Hajnoczi
On Sat, May 02, 2020 at 10:30:18AM +1000, Stephen Rothwell wrote:
> Hi all,
> 
> In commit
> 
>   ab8be610c87d ("virtio-blk: handle block_device_operations callbacks after 
> hot unplug")
> 
> Fixes tag
> 
>   Fixes: 48e4043d4529523cbc7fa8dd745bd8e2c45ce1d3
> 
> has these problem(s):
> 
>   - missing subject
> 
> Should be
> 
> Fixes: 48e4043d4529 ("virtio: add virtio disk geometry feature")
> 
> Please don't split Fixes tags over more than one line.

Got it, thanks for letting me know. I'll keep the tag on one line in the
future.

Stefan


signature.asc
Description: PGP signature


[PATCH v4] virtio-blk: handle block_device_operations callbacks after hot unplug

2020-04-30 Thread Stefan Hajnoczi
A userspace process holding a file descriptor to a virtio_blk device can
still invoke block_device_operations after hot unplug.  This leads to a
use-after-free accessing vblk->vdev in virtblk_getgeo() when
ioctl(HDIO_GETGEO) is invoked:

  BUG: unable to handle kernel NULL pointer dereference at 0090
  IP: [] virtio_check_driver_offered_feature+0x10/0x90 
[virtio]
  PGD 80003a92f067 PUD 3a930067 PMD 0
  Oops:  [#1] SMP
  CPU: 0 PID: 1310 Comm: hdio-getgeo Tainted: G   OE     
3.10.0-1062.el7.x86_64 #1
  Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 
rel-1.13.0-0-gf21b5a4aeb02-prebuilt.qemu.org 04/01/2014
  task: 9be5fbfb8000 ti: 9be5fa89 task.ti: 9be5fa89
  RIP: 0010:[]  [] 
virtio_check_driver_offered_feature+0x10/0x90 [virtio]
  RSP: 0018:9be5fa893dc8  EFLAGS: 00010246
  RAX: 9be5fc3f3400 RBX: 9be5fa893e30 RCX: 
  RDX:  RSI: 0004 RDI: 9be5fbc10b40
  RBP: 9be5fa893dc8 R08: 0301 R09: 0301
  R10:  R11:  R12: 9be5fdc24680
  R13: 9be5fbc10b40 R14: 9be5fbc10480 R15: 
  FS:  7f1bfb968740() GS:9be5ffc0() knlGS:
  CS:  0010 DS:  ES:  CR0: 80050033
  CR2: 0090 CR3: 3a894000 CR4: 00360ff0
  DR0:  DR1:  DR2: 
  DR3:  DR6: fffe0ff0 DR7: 0400
  Call Trace:
   [] virtblk_getgeo+0x47/0x110 [virtio_blk]
   [] ? handle_mm_fault+0x39d/0x9b0
   [] blkdev_ioctl+0x1f5/0xa20
   [] block_ioctl+0x41/0x50
   [] do_vfs_ioctl+0x3a0/0x5a0
   [] SyS_ioctl+0xa1/0xc0

A related problem is that virtblk_remove() leaks the vd_index_ida index
when something still holds a reference to vblk->disk during hot unplug.
This causes virtio-blk device names to be lost (vda, vdb, etc).

Fix these issues by protecting vblk->vdev with a mutex and reference
counting vblk so the vd_index_ida index can be removed in all cases.

Fixes: 48e4043d4529523cbc7fa8dd745bd8e2c45ce1d3
   ("virtio: add virtio disk geometry feature")
Reported-by: Lance Digby 
Signed-off-by: Stefan Hajnoczi 
---
v4:
 * Clarify vdev_mutex usage [Stefano and Michael]

 drivers/block/virtio_blk.c | 86 ++
 1 file changed, 78 insertions(+), 8 deletions(-)

diff --git a/drivers/block/virtio_blk.c b/drivers/block/virtio_blk.c
index 93468b7c6701..9d21bf0f155e 100644
--- a/drivers/block/virtio_blk.c
+++ b/drivers/block/virtio_blk.c
@@ -33,6 +33,15 @@ struct virtio_blk_vq {
 } cacheline_aligned_in_smp;
 
 struct virtio_blk {
+   /*
+* This mutex must be held by anything that may run after
+* virtblk_remove() sets vblk->vdev to NULL.
+*
+* blk-mq, virtqueue processing, and sysfs attribute code paths are
+* shut down before vblk->vdev is set to NULL and therefore do not need
+* to hold this mutex.
+*/
+   struct mutex vdev_mutex;
struct virtio_device *vdev;
 
/* The disk structure for the kernel. */
@@ -44,6 +53,13 @@ struct virtio_blk {
/* Process context for config space updates */
struct work_struct config_work;
 
+   /*
+* Tracks references from block_device_operations open/release and
+* virtio_driver probe/remove so this object can be freed once no
+* longer in use.
+*/
+   refcount_t refs;
+
/* What host tells us, plus 2 for header & tailer. */
unsigned int sg_elems;
 
@@ -295,10 +311,55 @@ static int virtblk_get_id(struct gendisk *disk, char 
*id_str)
return err;
 }
 
+static void virtblk_get(struct virtio_blk *vblk)
+{
+   refcount_inc(>refs);
+}
+
+static void virtblk_put(struct virtio_blk *vblk)
+{
+   if (refcount_dec_and_test(>refs)) {
+   ida_simple_remove(_index_ida, vblk->index);
+   mutex_destroy(>vdev_mutex);
+   kfree(vblk);
+   }
+}
+
+static int virtblk_open(struct block_device *bd, fmode_t mode)
+{
+   struct virtio_blk *vblk = bd->bd_disk->private_data;
+   int ret = 0;
+
+   mutex_lock(>vdev_mutex);
+
+   if (vblk->vdev)
+   virtblk_get(vblk);
+   else
+   ret = -ENXIO;
+
+   mutex_unlock(>vdev_mutex);
+   return ret;
+}
+
+static void virtblk_release(struct gendisk *disk, fmode_t mode)
+{
+   struct virtio_blk *vblk = disk->private_data;
+
+   virtblk_put(vblk);
+}
+
 /* We provide getgeo only to please some old bootloader/partitioning tools */
 static int virtblk_getgeo(struct block_device *bd, struct hd_geometry *geo)
 {
struct virtio_blk *vblk = bd->bd_disk->private_data;
+   int ret = 0;
+
+   mutex_lock(>vdev_mutex);
+
+   if (!vblk->vdev) {
+   ret = -ENXIO;
+   got

Re: [PATCH v3] virtio-blk: handle block_device_operations callbacks after hot unplug

2020-04-30 Thread Stefan Hajnoczi
On Thu, Apr 30, 2020 at 10:43:23AM +0200, Stefano Garzarella wrote:
> On Wed, Apr 29, 2020 at 05:53:45PM +0100, Stefan Hajnoczi wrote:
> > A userspace process holding a file descriptor to a virtio_blk device can
> > still invoke block_device_operations after hot unplug.  This leads to a
> > use-after-free accessing vblk->vdev in virtblk_getgeo() when
> > ioctl(HDIO_GETGEO) is invoked:
> > 
> >   BUG: unable to handle kernel NULL pointer dereference at 0090
> >   IP: [] virtio_check_driver_offered_feature+0x10/0x90 
> > [virtio]
> >   PGD 80003a92f067 PUD 3a930067 PMD 0
> >   Oops:  [#1] SMP
> >   CPU: 0 PID: 1310 Comm: hdio-getgeo Tainted: G   OE    
> >  3.10.0-1062.el7.x86_64 #1
> >   Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 
> > rel-1.13.0-0-gf21b5a4aeb02-prebuilt.qemu.org 04/01/2014
> >   task: 9be5fbfb8000 ti: 9be5fa89 task.ti: 9be5fa89
> >   RIP: 0010:[]  [] 
> > virtio_check_driver_offered_feature+0x10/0x90 [virtio]
> >   RSP: 0018:9be5fa893dc8  EFLAGS: 00010246
> >   RAX: 9be5fc3f3400 RBX: 9be5fa893e30 RCX: 
> >   RDX:  RSI: 0004 RDI: 9be5fbc10b40
> >   RBP: 9be5fa893dc8 R08: 0301 R09: 0301
> >   R10:  R11:  R12: 9be5fdc24680
> >   R13: 9be5fbc10b40 R14: 9be5fbc10480 R15: 
> >   FS:  7f1bfb968740() GS:9be5ffc0() 
> > knlGS:
> >   CS:  0010 DS:  ES:  CR0: 80050033
> >   CR2: 0090 CR3: 3a894000 CR4: 00360ff0
> >   DR0:  DR1:  DR2: 
> >   DR3:  DR6: fffe0ff0 DR7: 0400
> >   Call Trace:
> >[] virtblk_getgeo+0x47/0x110 [virtio_blk]
> >[] ? handle_mm_fault+0x39d/0x9b0
> >[] blkdev_ioctl+0x1f5/0xa20
> >[] block_ioctl+0x41/0x50
> >[] do_vfs_ioctl+0x3a0/0x5a0
> >[] SyS_ioctl+0xa1/0xc0
> > 
> > A related problem is that virtblk_remove() leaks the vd_index_ida index
> > when something still holds a reference to vblk->disk during hot unplug.
> > This causes virtio-blk device names to be lost (vda, vdb, etc).
> > 
> > Fix these issues by protecting vblk->vdev with a mutex and reference
> > counting vblk so the vd_index_ida index can be removed in all cases.
> > 
> > Fixes: 48e4043d4529523cbc7fa8dd745bd8e2c45ce1d3
> >("virtio: add virtio disk geometry feature")
> > Reported-by: Lance Digby 
> > Signed-off-by: Stefan Hajnoczi 
> > ---
> >  drivers/block/virtio_blk.c | 87 ++
> >  1 file changed, 79 insertions(+), 8 deletions(-)
> > 
> > diff --git a/drivers/block/virtio_blk.c b/drivers/block/virtio_blk.c
> > index 93468b7c6701..6f7f277495f4 100644
> > --- a/drivers/block/virtio_blk.c
> > +++ b/drivers/block/virtio_blk.c
> > @@ -33,6 +33,16 @@ struct virtio_blk_vq {
> >  } cacheline_aligned_in_smp;
> >  
> >  struct virtio_blk {
> > +   /*
> > +* vdev may be accessed without taking this mutex in blk-mq and
> > +* virtqueue code paths because virtblk_remove() stops them before vdev
> > +* is freed.
> > +*
> > +* Everything else must hold this mutex when accessing vdev and must
> > +* handle the case where vdev is NULL after virtblk_remove() has been
> > +* called.
> > +*/
> > +   struct mutex vdev_mutex;
> 
> The patch LGTM, I'm just worried about cache_type_store() and
> cache_type_show() because IIUC they aren't in blk-mq and virtqueue code
> paths, but they use vdev.
> 
> Do we have to take the mutex there too?

No, because del_gendisk() in virtblk_remove() deletes the sysfs
attributes before vblk->vdev is set to NULL.  kernfs deactivates the
kernfs nodes so that further read()/write() operations fail with ENODEV
(see fs/kernfs/dir.c and fs/kernfs/file.c).

I have tested that a userspace process with a sysfs attr file open
cannot access the attribute after virtblk_remove().

Stefan


signature.asc
Description: PGP signature


[PATCH v3] virtio-blk: handle block_device_operations callbacks after hot unplug

2020-04-29 Thread Stefan Hajnoczi
A userspace process holding a file descriptor to a virtio_blk device can
still invoke block_device_operations after hot unplug.  This leads to a
use-after-free accessing vblk->vdev in virtblk_getgeo() when
ioctl(HDIO_GETGEO) is invoked:

  BUG: unable to handle kernel NULL pointer dereference at 0090
  IP: [] virtio_check_driver_offered_feature+0x10/0x90 
[virtio]
  PGD 80003a92f067 PUD 3a930067 PMD 0
  Oops:  [#1] SMP
  CPU: 0 PID: 1310 Comm: hdio-getgeo Tainted: G   OE     
3.10.0-1062.el7.x86_64 #1
  Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 
rel-1.13.0-0-gf21b5a4aeb02-prebuilt.qemu.org 04/01/2014
  task: 9be5fbfb8000 ti: 9be5fa89 task.ti: 9be5fa89
  RIP: 0010:[]  [] 
virtio_check_driver_offered_feature+0x10/0x90 [virtio]
  RSP: 0018:9be5fa893dc8  EFLAGS: 00010246
  RAX: 9be5fc3f3400 RBX: 9be5fa893e30 RCX: 
  RDX:  RSI: 0004 RDI: 9be5fbc10b40
  RBP: 9be5fa893dc8 R08: 0301 R09: 0301
  R10:  R11:  R12: 9be5fdc24680
  R13: 9be5fbc10b40 R14: 9be5fbc10480 R15: 
  FS:  7f1bfb968740() GS:9be5ffc0() knlGS:
  CS:  0010 DS:  ES:  CR0: 80050033
  CR2: 0090 CR3: 3a894000 CR4: 00360ff0
  DR0:  DR1:  DR2: 
  DR3:  DR6: fffe0ff0 DR7: 0400
  Call Trace:
   [] virtblk_getgeo+0x47/0x110 [virtio_blk]
   [] ? handle_mm_fault+0x39d/0x9b0
   [] blkdev_ioctl+0x1f5/0xa20
   [] block_ioctl+0x41/0x50
   [] do_vfs_ioctl+0x3a0/0x5a0
   [] SyS_ioctl+0xa1/0xc0

A related problem is that virtblk_remove() leaks the vd_index_ida index
when something still holds a reference to vblk->disk during hot unplug.
This causes virtio-blk device names to be lost (vda, vdb, etc).

Fix these issues by protecting vblk->vdev with a mutex and reference
counting vblk so the vd_index_ida index can be removed in all cases.

Fixes: 48e4043d4529523cbc7fa8dd745bd8e2c45ce1d3
   ("virtio: add virtio disk geometry feature")
Reported-by: Lance Digby 
Signed-off-by: Stefan Hajnoczi 
---
 drivers/block/virtio_blk.c | 87 ++
 1 file changed, 79 insertions(+), 8 deletions(-)

diff --git a/drivers/block/virtio_blk.c b/drivers/block/virtio_blk.c
index 93468b7c6701..6f7f277495f4 100644
--- a/drivers/block/virtio_blk.c
+++ b/drivers/block/virtio_blk.c
@@ -33,6 +33,16 @@ struct virtio_blk_vq {
 } cacheline_aligned_in_smp;
 
 struct virtio_blk {
+   /*
+* vdev may be accessed without taking this mutex in blk-mq and
+* virtqueue code paths because virtblk_remove() stops them before vdev
+* is freed.
+*
+* Everything else must hold this mutex when accessing vdev and must
+* handle the case where vdev is NULL after virtblk_remove() has been
+* called.
+*/
+   struct mutex vdev_mutex;
struct virtio_device *vdev;
 
/* The disk structure for the kernel. */
@@ -44,6 +54,13 @@ struct virtio_blk {
/* Process context for config space updates */
struct work_struct config_work;
 
+   /*
+* Tracks references from block_device_operations open/release and
+* virtio_driver probe/remove so this object can be freed once no
+* longer in use.
+*/
+   refcount_t refs;
+
/* What host tells us, plus 2 for header & tailer. */
unsigned int sg_elems;
 
@@ -295,10 +312,55 @@ static int virtblk_get_id(struct gendisk *disk, char 
*id_str)
return err;
 }
 
+static void virtblk_get(struct virtio_blk *vblk)
+{
+   refcount_inc(>refs);
+}
+
+static void virtblk_put(struct virtio_blk *vblk)
+{
+   if (refcount_dec_and_test(>refs)) {
+   ida_simple_remove(_index_ida, vblk->index);
+   mutex_destroy(>vdev_mutex);
+   kfree(vblk);
+   }
+}
+
+static int virtblk_open(struct block_device *bd, fmode_t mode)
+{
+   struct virtio_blk *vblk = bd->bd_disk->private_data;
+   int ret = 0;
+
+   mutex_lock(>vdev_mutex);
+
+   if (vblk->vdev)
+   virtblk_get(vblk);
+   else
+   ret = -ENXIO;
+
+   mutex_unlock(>vdev_mutex);
+   return ret;
+}
+
+static void virtblk_release(struct gendisk *disk, fmode_t mode)
+{
+   struct virtio_blk *vblk = disk->private_data;
+
+   virtblk_put(vblk);
+}
+
 /* We provide getgeo only to please some old bootloader/partitioning tools */
 static int virtblk_getgeo(struct block_device *bd, struct hd_geometry *geo)
 {
struct virtio_blk *vblk = bd->bd_disk->private_data;
+   int ret = 0;
+
+   mutex_lock(>vdev_mutex);
+
+   if (!vblk->vdev) {
+   ret = -ENXIO;
+   goto out;

Re: [PATCH v2] virtio-blk: handle block_device_operations callbacks after hot unplug

2020-04-28 Thread Stefan Hajnoczi
On Tue, Apr 28, 2020 at 11:25:07AM -0400, Michael S. Tsirkin wrote:
> On Tue, Apr 28, 2020 at 03:30:09PM +0100, Stefan Hajnoczi wrote:
> > A userspace process holding a file descriptor to a virtio_blk device can
> > still invoke block_device_operations after hot unplug.  For example, a
> > program that has /dev/vdb open can call ioctl(HDIO_GETGEO) after hot
> > unplug to invoke virtblk_getgeo().
> 
> 
> which causes what? a use after free?

Yes, use after free.  I will include the kernel panic in the next
revision.

virtio_check_driver_offered_feature() accesses vdev->dev.driver, which
doesn't work after virtblk_remove() has freed vdev :).

> > 
> > Introduce a reference count in struct virtio_blk so that its lifetime
> > covers both virtio_driver probe/remove and block_device_operations
> > open/release users.  This ensures that block_device_operations functions
> > like virtblk_getgeo() can safely access struct virtio_blk.
> > 
> > Add remove_mutex to prevent block_device_operations functions from
> > accessing vblk->vdev during virtblk_remove() and let the safely check
> 
> let the -> let them?

Thanks, will fix.

> 
> > for !vblk->vdev after virtblk_remove() returns.
> > 
> > Switching to a reference count also solves the vd_index_ida leak where
> > vda, vdb, vdc, etc indices were lost when the device was hot unplugged
> > while the block device was still open.
> 
> Can you move this statement up so we list both issues (use after free
> and leak) upfront, then discuss the fix?

Thanks, will fix.

> 
> > 
> > Reported-by: Lance Digby 
> > Signed-off-by: Stefan Hajnoczi 
> > ---
> > If someone has a simpler solution please let me know.  I looked at
> > various approaches including reusing device_lock(>vdev.dev) but
> > they were more complex and extending the lifetime of virtio_device after
> > remove() has been called seems questionable.
> > ---
> >  drivers/block/virtio_blk.c | 85 ++
> >  1 file changed, 77 insertions(+), 8 deletions(-)
> > 
> > diff --git a/drivers/block/virtio_blk.c b/drivers/block/virtio_blk.c
> > index 93468b7c6701..3dd53b445cc1 100644
> > --- a/drivers/block/virtio_blk.c
> > +++ b/drivers/block/virtio_blk.c
> > @@ -44,6 +44,13 @@ struct virtio_blk {
> > /* Process context for config space updates */
> > struct work_struct config_work;
> >  
> > +   /*
> > +* Tracks references from block_device_operations open/release and
> > +* virtio_driver probe/remove so this object can be freed once no
> > +* longer in use.
> > +*/
> > +   refcount_t refs;
> > +
> > /* What host tells us, plus 2 for header & tailer. */
> > unsigned int sg_elems;
> >  
> > @@ -53,6 +60,9 @@ struct virtio_blk {
> > /* num of vqs */
> > int num_vqs;
> > struct virtio_blk_vq *vqs;
> > +
> > +   /* Provides mutual exclusion with virtblk_remove(). */
> 
> This is not the best way to document access rules.
> Which fields does this protect, exactly?
> I think it's just vdev. Right?
> Pls add to the comment.

Yes, vblk->vdev cannot be dereferenced after virtblk_remove() is
entered.

I'll document this mutex as protecting vdev.

> 
> > +   struct mutex remove_mutex;
> >  };
> >  
> >  struct virtblk_req {
> > @@ -295,10 +305,54 @@ static int virtblk_get_id(struct gendisk *disk, char 
> > *id_str)
> > return err;
> >  }
> >  
> > +static void virtblk_get(struct virtio_blk *vblk)
> > +{
> > +   refcount_inc(>refs);
> > +}
> > +
> > +static void virtblk_put(struct virtio_blk *vblk)
> > +{
> > +   if (refcount_dec_and_test(>refs)) {
> > +   ida_simple_remove(_index_ida, vblk->index);
> > +   mutex_destroy(>remove_mutex);
> > +   kfree(vblk);
> > +   }
> > +}
> > +
> > +static int virtblk_open(struct block_device *bd, fmode_t mode)
> > +{
> > +   struct virtio_blk *vblk = bd->bd_disk->private_data;
> > +   int ret = -ENXIO;
> 
> 
> It's more common to do
> 
>   int ret = 0;
> 
> and on error:
>   ret = -ENXIO;
> 
> 
> let's do this.

Will fix.

> > +
> > +   mutex_lock(>remove_mutex);
> > +
> > +   if (vblk->vdev) {
> > +   virtblk_get(vblk);
> > +   ret = 0;
> > +   }
> 
> I prefer
>   else
>   ret = -ENXIO
> 
> here.

Got it.

> > +
> > +   mutex_unlock(>remove_mutex);
> > +   re

[PATCH v2] virtio-blk: handle block_device_operations callbacks after hot unplug

2020-04-28 Thread Stefan Hajnoczi
A userspace process holding a file descriptor to a virtio_blk device can
still invoke block_device_operations after hot unplug.  For example, a
program that has /dev/vdb open can call ioctl(HDIO_GETGEO) after hot
unplug to invoke virtblk_getgeo().

Introduce a reference count in struct virtio_blk so that its lifetime
covers both virtio_driver probe/remove and block_device_operations
open/release users.  This ensures that block_device_operations functions
like virtblk_getgeo() can safely access struct virtio_blk.

Add remove_mutex to prevent block_device_operations functions from
accessing vblk->vdev during virtblk_remove() and let the safely check
for !vblk->vdev after virtblk_remove() returns.

Switching to a reference count also solves the vd_index_ida leak where
vda, vdb, vdc, etc indices were lost when the device was hot unplugged
while the block device was still open.

Reported-by: Lance Digby 
Signed-off-by: Stefan Hajnoczi 
---
If someone has a simpler solution please let me know.  I looked at
various approaches including reusing device_lock(>vdev.dev) but
they were more complex and extending the lifetime of virtio_device after
remove() has been called seems questionable.
---
 drivers/block/virtio_blk.c | 85 ++
 1 file changed, 77 insertions(+), 8 deletions(-)

diff --git a/drivers/block/virtio_blk.c b/drivers/block/virtio_blk.c
index 93468b7c6701..3dd53b445cc1 100644
--- a/drivers/block/virtio_blk.c
+++ b/drivers/block/virtio_blk.c
@@ -44,6 +44,13 @@ struct virtio_blk {
/* Process context for config space updates */
struct work_struct config_work;
 
+   /*
+* Tracks references from block_device_operations open/release and
+* virtio_driver probe/remove so this object can be freed once no
+* longer in use.
+*/
+   refcount_t refs;
+
/* What host tells us, plus 2 for header & tailer. */
unsigned int sg_elems;
 
@@ -53,6 +60,9 @@ struct virtio_blk {
/* num of vqs */
int num_vqs;
struct virtio_blk_vq *vqs;
+
+   /* Provides mutual exclusion with virtblk_remove(). */
+   struct mutex remove_mutex;
 };
 
 struct virtblk_req {
@@ -295,10 +305,54 @@ static int virtblk_get_id(struct gendisk *disk, char 
*id_str)
return err;
 }
 
+static void virtblk_get(struct virtio_blk *vblk)
+{
+   refcount_inc(>refs);
+}
+
+static void virtblk_put(struct virtio_blk *vblk)
+{
+   if (refcount_dec_and_test(>refs)) {
+   ida_simple_remove(_index_ida, vblk->index);
+   mutex_destroy(>remove_mutex);
+   kfree(vblk);
+   }
+}
+
+static int virtblk_open(struct block_device *bd, fmode_t mode)
+{
+   struct virtio_blk *vblk = bd->bd_disk->private_data;
+   int ret = -ENXIO;
+
+   mutex_lock(>remove_mutex);
+
+   if (vblk->vdev) {
+   virtblk_get(vblk);
+   ret = 0;
+   }
+
+   mutex_unlock(>remove_mutex);
+   return ret;
+}
+
+static void virtblk_release(struct gendisk *disk, fmode_t mode)
+{
+   struct virtio_blk *vblk = disk->private_data;
+
+   virtblk_put(vblk);
+}
+
 /* We provide getgeo only to please some old bootloader/partitioning tools */
 static int virtblk_getgeo(struct block_device *bd, struct hd_geometry *geo)
 {
struct virtio_blk *vblk = bd->bd_disk->private_data;
+   int ret = -ENXIO;
+
+   mutex_lock(>remove_mutex);
+
+   if (!vblk->vdev) {
+   goto out;
+   }
 
/* see if the host passed in geometry config */
if (virtio_has_feature(vblk->vdev, VIRTIO_BLK_F_GEOMETRY)) {
@@ -314,11 +368,17 @@ static int virtblk_getgeo(struct block_device *bd, struct 
hd_geometry *geo)
geo->sectors = 1 << 5;
geo->cylinders = get_capacity(bd->bd_disk) >> 11;
}
-   return 0;
+
+   ret = 0;
+out:
+   mutex_unlock(>remove_mutex);
+   return ret;
 }
 
 static const struct block_device_operations virtblk_fops = {
.owner  = THIS_MODULE,
+   .open = virtblk_open,
+   .release = virtblk_release,
.getgeo = virtblk_getgeo,
 };
 
@@ -655,6 +715,10 @@ static int virtblk_probe(struct virtio_device *vdev)
goto out_free_index;
}
 
+   /* This reference is dropped in virtblk_remove(). */
+   refcount_set(>refs, 1);
+   mutex_init(>remove_mutex);
+
vblk->vdev = vdev;
vblk->sg_elems = sg_elems;
 
@@ -820,8 +884,12 @@ static int virtblk_probe(struct virtio_device *vdev)
 static void virtblk_remove(struct virtio_device *vdev)
 {
struct virtio_blk *vblk = vdev->priv;
-   int index = vblk->index;
-   int refc;
+
+   /*
+* Virtqueue processing is stopped safely here but mutual exclusion is
+* needed for block_device_operations.
+*/
+   mutex_lock(>remove_mutex);
 
  

Re: [PATCH] capabilities: add description for CAP_SETFCAP

2020-04-28 Thread Stefan Hajnoczi
On Tue, Apr 14, 2020 at 04:49:45PM +0100, Stefan Hajnoczi wrote:
> Document the purpose of CAP_SETFCAP.  For some reason this capability
> had no description while the others did.
> 
> Signed-off-by: Stefan Hajnoczi 
> ---
>  include/uapi/linux/capability.h | 2 ++
>  1 file changed, 2 insertions(+)

Ping?

> diff --git a/include/uapi/linux/capability.h b/include/uapi/linux/capability.h
> index 272dc69fa080..7288f0ad44af 100644
> --- a/include/uapi/linux/capability.h
> +++ b/include/uapi/linux/capability.h
> @@ -332,6 +332,8 @@ struct vfs_ns_cap_data {
>  
>  #define CAP_AUDIT_CONTROL30
>  
> +/* Set or remove capabilities on files */
> +
>  #define CAP_SETFCAP   31
>  
>  /* Override MAC access.
> -- 
> 2.25.1
> 


signature.asc
Description: PGP signature


Re: [PATCH -next] virtiofs: remove unused variable 'fc'

2019-10-23 Thread Stefan Hajnoczi
On Wed, Oct 23, 2019 at 02:21:30PM +0800, YueHaibing wrote:
> fs/fuse/virtio_fs.c:983:20: warning:
>  variable fc set but not used [-Wunused-but-set-variable]
> 
> It is not used since commit 7ee1e2e631db ("virtiofs:
> No need to check fpq->connected state")
> 
> Signed-off-by: YueHaibing 
> ---
>  fs/fuse/virtio_fs.c | 2 --
>  1 file changed, 2 deletions(-)

Reviewed-by: Stefan Hajnoczi 


signature.asc
Description: PGP signature


Re: [PATCH] vsock/virtio: remove unused 'work' field from 'struct virtio_vsock_pkt'

2019-10-16 Thread Stefan Hajnoczi
On Tue, Oct 15, 2019 at 05:00:51PM +0200, Stefano Garzarella wrote:
> The 'work' field was introduced with commit 06a8fc78367d0
> ("VSOCK: Introduce virtio_vsock_common.ko")
> but it is never used in the code, so we can remove it to save
> memory allocated in the per-packet 'struct virtio_vsock_pkt'
> 
> Suggested-by: Michael S. Tsirkin 
> Signed-off-by: Stefano Garzarella 
> ---
>  include/linux/virtio_vsock.h | 1 -
>  1 file changed, 1 deletion(-)

Reviewed-by: Stefan Hajnoczi 


signature.asc
Description: PGP signature


Re: [PATCH net 0/2] vsock: don't allow half-closed socket in the host transports

2019-10-15 Thread Stefan Hajnoczi
On Sat, Oct 12, 2019 at 06:38:46PM -0400, Michael S. Tsirkin wrote:
> On Fri, Oct 11, 2019 at 04:34:57PM +0200, Stefano Garzarella wrote:
> > On Fri, Oct 11, 2019 at 10:19:13AM -0400, Michael S. Tsirkin wrote:
> > > On Fri, Oct 11, 2019 at 03:07:56PM +0200, Stefano Garzarella wrote:
> > > > We are implementing a test suite for the VSOCK sockets and we discovered
> > > > that vmci_transport never allowed half-closed socket on the host side.
> > > > 
> > > > As Jorgen explained [1] this is due to the implementation of VMCI.
> > > > 
> > > > Since we want to have the same behaviour across all transports, this
> > > > series adds a section in the "Implementation notes" to exaplain this
> > > > behaviour, and changes the vhost_transport to behave the same way.
> > > > 
> > > > [1] https://patchwork.ozlabs.org/cover/847998/#1831400
> > > 
> > > Half closed sockets are very useful, and lots of
> > > applications use tricks to swap a vsock for a tcp socket,
> > > which might as a result break.
> > 
> > Got it!
> > 
> > > 
> > > If VMCI really cares it can implement an ioctl to
> > > allow applications to detect that half closed sockets aren't supported.
> > > 
> > > It does not look like VMCI wants to bother (users do not read
> > > kernel implementation notes) so it does not really care.
> > > So why do we want to cripple other transports intentionally?
> > 
> > The main reason is that we are developing the test suite and we noticed
> > the miss match. Since we want to make sure that applications behave in
> > the same way on different transports, we thought we would solve it that
> > way.
> > 
> > But what you are saying (also in the reply of the patches) is actually
> > quite right. Not being publicized, applications do not expect this behavior,
> > so please discard this series.
> > 
> > My problem during the tests, was trying to figure out if half-closed
> > sockets were supported or not, so as you say adding an IOCTL or maybe
> > better a getsockopt() could solve the problem.
> > 
> > What do you think?
> > 
> > Thanks,
> > Stefano
> 
> Sure, why not.

The aim is for applications using AF_VSOCK sockets to run on any
transport.  When the semantics differ between transports it creates a
compatibility problem.

That said, I do think keeping the standard sockets behavior is
reasonable.  If applications have problems on VMCI a sockopt may be
necessary :(.

Stefan


signature.asc
Description: PGP signature


Re: [PATCH V3 0/7] mdev based hardware virtio offloading support

2019-10-14 Thread Stefan Hajnoczi
On Fri, Oct 11, 2019 at 04:15:50PM +0800, Jason Wang wrote:
> There are hardware that can do virtio datapath offloading while having
> its own control path. This path tries to implement a mdev based
> unified API to support using kernel virtio driver to drive those
> devices. This is done by introducing a new mdev transport for virtio
> (virtio_mdev) and register itself as a new kind of mdev driver. Then
> it provides a unified way for kernel virtio driver to talk with mdev
> device implementation.
> 
> Though the series only contains kernel driver support, the goal is to
> make the transport generic enough to support userspace drivers. This
> means vhost-mdev[1] could be built on top as well by resuing the
> transport.
> 
> A sample driver is also implemented which simulate a virito-net
> loopback ethernet device on top of vringh + workqueue. This could be
> used as a reference implementation for real hardware driver.
> 
> Consider mdev framework only support VFIO device and driver right now,
> this series also extend it to support other types. This is done
> through introducing class id to the device and pairing it with
> id_talbe claimed by the driver. On top, this seris also decouple
> device specific parents ops out of the common ones.

I was curious so I took a quick look and posted comments.

I guess this driver runs inside the guest since it registers virtio
devices?

If this is used with physical PCI devices that support datapath
offloading then how are physical devices presented to the guest without
SR-IOV?

Stefan


signature.asc
Description: PGP signature


Re: [PATCH v4 1/5] vsock/virtio: limit the memory used per-socket

2019-10-14 Thread Stefan Hajnoczi
On Fri, Oct 11, 2019 at 03:40:48PM +0200, Stefano Garzarella wrote:
> On Sun, Sep 1, 2019 at 8:56 AM Michael S. Tsirkin  wrote:
> > On Fri, Aug 30, 2019 at 11:40:59AM +0200, Stefano Garzarella wrote:
> > > On Mon, Jul 29, 2019 at 10:04:29AM -0400, Michael S. Tsirkin wrote:
> > > > On Wed, Jul 17, 2019 at 01:30:26PM +0200, Stefano Garzarella wrote:
> > > > > Since virtio-vsock was introduced, the buffers filled by the host
> > > > > and pushed to the guest using the vring, are directly queued in
> > > > > a per-socket list. These buffers are preallocated by the guest
> > > > > with a fixed size (4 KB).
> > > > >
> > > > > The maximum amount of memory used by each socket should be
> > > > > controlled by the credit mechanism.
> > > > > The default credit available per-socket is 256 KB, but if we use
> > > > > only 1 byte per packet, the guest can queue up to 262144 of 4 KB
> > > > > buffers, using up to 1 GB of memory per-socket. In addition, the
> > > > > guest will continue to fill the vring with new 4 KB free buffers
> > > > > to avoid starvation of other sockets.
> > > > >
> > > > > This patch mitigates this issue copying the payload of small
> > > > > packets (< 128 bytes) into the buffer of last packet queued, in
> > > > > order to avoid wasting memory.
> > > > >
> > > > > Reviewed-by: Stefan Hajnoczi 
> > > > > Signed-off-by: Stefano Garzarella 
> > > >
> > > > This is good enough for net-next, but for net I think we
> > > > should figure out how to address the issue completely.
> > > > Can we make the accounting precise? What happens to
> > > > performance if we do?
> > > >
> > >
> > > Since I'm back from holidays, I'm restarting this thread to figure out
> > > how to address the issue completely.
> > >
> > > I did a better analysis of the credit mechanism that we implemented in
> > > virtio-vsock to get a clearer view and I'd share it with you:
> > >
> > > This issue affect only the "host->guest" path. In this case, when the
> > > host wants to send a packet to the guest, it uses a "free" buffer
> > > allocated by the guest (4KB).
> > > The "free" buffers available for the host are shared between all
> > > sockets, instead, the credit mechanism is per-socket, I think to
> > > avoid the starvation of others sockets.
> > > The guests re-fill the "free" queue when the available buffers are
> > > less than half.
> > >
> > > Each peer have these variables in the per-socket state:
> > >/* local vars */
> > >buf_alloc/* max bytes usable by this socket
> > >[exposed to the other peer] */
> > >fwd_cnt  /* increased when RX packet is consumed by the
> > >user space [exposed to the other peer] */
> > >tx_cnt /* increased when TX packet is sent to the 
> > > other peer */
> > >
> > >/* remote vars  */
> > >peer_buf_alloc   /* peer's buf_alloc */
> > >peer_fwd_cnt /* peer's fwd_cnt */
> > >
> > > When a peer sends a packet, it increases the 'tx_cnt'; when the
> > > receiver consumes the packet (copy it to the user-space buffer), it
> > > increases the 'fwd_cnt'.
> > > Note: increments are made considering the payload length and not the
> > > buffer length.
> > >
> > > The value of 'buf_alloc' and 'fwd_cnt' are sent to the other peer in
> > > all packet headers or with an explicit CREDIT_UPDATE packet.
> > >
> > > The local 'buf_alloc' value can be modified by the user space using
> > > setsockopt() with optname=SO_VM_SOCKETS_BUFFER_SIZE.
> > >
> > > Before to send a packet, the peer checks the space available:
> > >   credit_available = peer_buf_alloc - (tx_cnt - peer_fwd_cnt)
> > > and it will send up to credit_available bytes to the other peer.
> > >
> > > Possible solutions considering Michael's advice:
> > > 1. Use the buffer length instead of the payload length when we increment
> > >the counters:
> > >   - This approach will account precisely the memory used per socket.
> > >   - This requir

Re: [RFC PATCH 07/13] vsock: handle buffer_size sockopts in the core

2019-10-11 Thread Stefan Hajnoczi
On Thu, Oct 10, 2019 at 11:32:54AM +0200, Stefano Garzarella wrote:
> On Wed, Oct 09, 2019 at 01:30:26PM +0100, Stefan Hajnoczi wrote:
> > On Fri, Sep 27, 2019 at 01:26:57PM +0200, Stefano Garzarella wrote:
> > Another issue is that this patch drops the VIRTIO_VSOCK_MAX_BUF_SIZE
> > limit that used to be enforced by virtio_transport_set_buffer_size().
> > Now the limit is only applied at socket init time.  If the buffer size
> > is changed later then VIRTIO_VSOCK_MAX_BUF_SIZE can be exceeded.  If
> > that doesn't matter, why even bother with VIRTIO_VSOCK_MAX_BUF_SIZE
> > here?
> > 
> 
> The .notify_buffer_size() should avoid this issue, since it allows the
> transport to limit the buffer size requested after the initialization.
> 
> But again the min set by the user can not be respected and in the
> previous implementation we forced it to VIRTIO_VSOCK_MAX_BUF_SIZE.
> 
> Now we don't limit the min, but we guarantee only that vsk->buffer_size
> is lower than VIRTIO_VSOCK_MAX_BUF_SIZE.
> 
> Can that be an acceptable compromise?

I think so.

Setting buffer sizes was never tested or used much by userspace
applications that I'm aware of.  We should probably include tests for
changing buffer sizes in the test suite.

Stefan


signature.asc
Description: PGP signature


Re: [PATCH v2 07/11] VSOCK: add AF_VSOCK test cases

2019-10-09 Thread Stefan Hajnoczi
On Wed, Oct 09, 2019 at 12:03:53PM +0200, Stefano Garzarella wrote:
> Hi Stefan,
> I'm thinking about dividing this test into single applications, one
> for each test, do you think it makes sense?
> Or is it just a useless complication?

I don't mind either way but personally I would leave it as a single
program.

Stefan


signature.asc
Description: PGP signature


Re: [RFC PATCH 00/13] vsock: add multi-transports support

2019-10-09 Thread Stefan Hajnoczi
On Fri, Sep 27, 2019 at 01:26:50PM +0200, Stefano Garzarella wrote:
> Hi all,
> this series adds the multi-transports support to vsock, following
> this proposal:
> https://www.spinics.net/lists/netdev/msg575792.html

Nice series!  I have left a few comments but overall it looks promising.

Stefan


signature.asc
Description: PGP signature


Re: [RFC PATCH 13/13] vsock: fix bind() behaviour taking care of CID

2019-10-09 Thread Stefan Hajnoczi
On Fri, Sep 27, 2019 at 01:27:03PM +0200, Stefano Garzarella wrote:
> When we are looking for a socket bound to a specific address,
> we also have to take into account the CID.
> 
> This patch is useful with multi-transports support because it
> allows the binding of the same port with different CID, and
> it prevents a connection to a wrong socket bound to the same
> port, but with different CID.
> 
> Signed-off-by: Stefano Garzarella 
> ---
>  net/vmw_vsock/af_vsock.c | 10 --
>  1 file changed, 8 insertions(+), 2 deletions(-)

Reviewed-by: Stefan Hajnoczi 


signature.asc
Description: PGP signature


Re: [RFC PATCH 12/13] vsock: prevent transport modules unloading

2019-10-09 Thread Stefan Hajnoczi
On Fri, Sep 27, 2019 at 01:27:02PM +0200, Stefano Garzarella wrote:
> diff --git a/net/vmw_vsock/af_vsock.c b/net/vmw_vsock/af_vsock.c
> index c5f46b8242ce..750b62711b01 100644
> --- a/net/vmw_vsock/af_vsock.c
> +++ b/net/vmw_vsock/af_vsock.c
> @@ -416,13 +416,28 @@ int vsock_assign_transport(struct vsock_sock *vsk, 
> struct vsock_sock *psk)
>   return -ESOCKTNOSUPPORT;
>   }
>  
> - if (!vsk->transport)
> + /* We increase the module refcnt to prevent the tranport unloading

s/tranport/transport/

Otherwise:

Reviewed-by: Stefan Hajnoczi 


signature.asc
Description: PGP signature


Re: [RFC PATCH 11/13] vsock: add 'transport_hg' to handle g2h\h2g transports

2019-10-09 Thread Stefan Hajnoczi
On Fri, Sep 27, 2019 at 01:27:01PM +0200, Stefano Garzarella wrote:
> VMCI transport provides both g2h and h2g behaviors in a single
> transport.
> We are able to set (or not) the g2h behavior, detecting if we
> are in a VMware guest (or not), but the h2g feature is always set.
> This prevents to load other h2g transports while we are in a
> VMware guest.

In the vhost_vsock.ko case we only register the h2g transport when
userspace has loaded the module (by opening /dev/vhost-vsock).

VMCI has something kind of similar: /dev/vmci and the
vmci_host_active_users counter.  Maybe we can use this instead of
introducing the transport_hg concept?


signature.asc
Description: PGP signature


  1   2   3   4   5   6   7   8   >