Re: [PATCH net] VSOCK: check sk state before receive

2018-09-24 Thread Jorgen S. Hansen
On Sep 22, 2018, at 8:27 AM, Hangbin Liu  wrote:
> 
> On Fri, Sep 21, 2018 at 07:48:25AM +, Jorgen S. Hansen wrote:
>> Hi Hangbin,
>> 
>> I finaly got to the bottom of this - the issue was indeed in the VMCI 
>> driver. The patch is posted here:
>> 
>> https://na01.safelinks.protection.outlook.com/?url=https%3A%2F%2Flkml.org%2Flkml%2F2018%2F9%2F21%2F326data=02%7C01%7Cjhansen%40vmware.com%7C280a9c79e99248db3d1108d6205488de%7Cb39138ca3cee4b4aa4d6cd83d9dd62f0%7C1%7C0%7C636731944784367201sdata=BEExXdB%2BF0SSh83Epa5fAuyoG4qn%2FAeM2J0wG2gM6GQ%3Dreserved=0
>> 
>> I used your reproduce.log to test the fix. Thanks for discovering this issue.
> 
> Hi Jorgen,
> 
> Thanks for your patch. I built a test kernel with your fix, run my
> reproducer and syzkaller socket vnet test for a while. There is no such
> error. So I think your patch fixed this issue.

Great. Thanks a lot for trying out the patch.

> BTW, with FAULT_INJECTION enabled. I got another call trace:

The vhost_* stuff is for Virtio. Stefan would know better what is going on 
there.

> [  251.166377] FAULT_INJECTION: forcing a failure.
> [  251.178736] CPU: 15 PID: 10448 Comm: syz-executor7 Not tainted 4.19.0
> -rc4.syz.vnet+ #3
> [  251.187577] Hardware name: Dell Inc. PowerEdge R730/0WCJNT, BIOS 2.1.5 
> 04/11/2016
> [  251.187578] Call Trace:
> [  251.187586]  dump_stack+0x8c/0xce
> [  251.187594]  should_fail+0x5dd/0x6b0
> [  251.199932]  ? fault_create_debugfs_attr+0x1d0/0x1d0
> [  251.199937]  __should_failslab+0xe8/0x120
> [  251.199945]  should_failslab+0xa/0x20
> [  251.228430]  kmem_cache_alloc_trace+0x43/0x1f0
> [  251.233392]  ? vhost_dev_set_owner+0x366/0x790 [vhost]
> [  251.239129]  vhost_dev_set_owner+0x366/0x790 [vhost]
> [  251.244672]  ? vhost_poll_wakeup+0xa0/0xa0 [vhost]
> [  251.250018]  ? kasan_unpoison_shadow+0x30/0x40
> [  251.254978]  ? vhost_worker+0x370/0x370 [vhost]
> [  251.260035]  ? kasan_kmalloc_large+0x71/0xe0
> [  251.264799]  ? kmalloc_order+0x54/0x60
> [  251.268985]  vhost_net_ioctl+0xc2e/0x14c0 [vhost_net]
> [  251.274635]  ? avc_ss_reset+0x150/0x150
> [  251.278915]  ? kstrtouint_from_user+0xe5/0x140
> [  251.283876]  ? handle_tx_kick+0x40/0x40 [vhost_net]
> [  251.289320]  ? save_stack+0x89/0xb0
> [  251.293213]  ? __kasan_slab_free+0x12e/0x180
> [  251.297979]  ? kmem_cache_free+0x7a/0x210
> [  251.302452]  ? putname+0xe2/0x120
> [  251.306151]  ? get_pid_task+0x6e/0x90
> [  251.310238]  ? proc_fail_nth_write+0x91/0x1c0
> [  251.315100]  ? map_files_get_link+0x3c0/0x3c0
> [  251.319963]  ? exit_robust_list+0x1c0/0x1c0
> [  251.324633]  ? __vfs_write+0xf7/0x6a0
> [  251.328711]  ? handle_tx_kick+0x40/0x40 [vhost_net]
> [  251.334154]  do_vfs_ioctl+0x1a5/0xfb0
> [  251.338241]  ? ioctl_preallocate+0x1c0/0x1c0
> [  251.343009]  ? selinux_file_ioctl+0x382/0x560
> [  251.347872]  ? selinux_capable+0x40/0x40
> [  251.352250]  ? __fget+0x211/0x2e0
> [  251.355949]  ? iterate_fd+0x1c0/0x1c0
> [  251.360038]  ? syscall_trace_enter+0x285/0xaa0
> [  251.365011]  ? security_file_ioctl+0x5d/0xb0
> [  251.369776]  ? selinux_capable+0x40/0x40
> [  251.374153]  ksys_ioctl+0x89/0xa0
> [  251.377853]  __x64_sys_ioctl+0x74/0xb0
> [  251.382036]  do_syscall_64+0xc3/0x390
> [  251.386123]  ? syscall_return_slowpath+0x14c/0x230
> [  251.391473]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
> [  251.397111] RIP: 0033:0x451b89
> [  251.400519] Code: fc ff 48 81 c4 80 00 00 00 e9 f1 fe ff ff 0f 1f 00 48 89 
> f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 
> 01 f0 ff ff 0f 83 0b 67 fc ff c3 66 2e 0f 1f 84 00 00 00 00
> [  251.421476] RSP: 002b:7fc0d9673c48 EFLAGS: 0246 ORIG_RAX: 
> 0010
> [  251.429927] RAX: ffda RBX: 7fc0d96746b4 RCX: 
> 00451b89
> [  251.437889] RDX:  RSI: af01 RDI: 
> 0003
> [  251.445852] RBP: 0073bf00 R08:  R09: 
> 
> [  251.453815] R10:  R11: 0246 R12: 
> 0004
> [  251.461778] R13: 6450 R14: 004d3090 R15: 
> 7fc0d9674700
> 
> Thanks
> Hangbin

Thanks,
Jorgen

Re: [PATCH net] VSOCK: check sk state before receive

2018-09-21 Thread Jorgen S. Hansen
Hi Hangbin,

I finaly got to the bottom of this - the issue was indeed in the VMCI driver. 
The patch is posted here:

https://lkml.org/lkml/2018/9/21/326

I used your reproduce.log to test the fix. Thanks for discovering this issue.

Thanks,
Jørgen


From: Hangbin Liu 
Sent: Wednesday, June 13, 2018 3:44 AM
To: Jorgen S. Hansen
Cc: Stefan Hajnoczi; netdev@vger.kernel.org; David S. Miller
Subject: Re: [PATCH net] VSOCK: check sk state before receive

On Mon, Jun 04, 2018 at 04:02:39PM +, Jorgen S. Hansen wrote:
>
> > On May 30, 2018, at 11:17 AM, Stefan Hajnoczi  wrote:
> >
> > On Sun, May 27, 2018 at 11:29:45PM +0800, Hangbin Liu wrote:
> >> Hmm...Although I won't reproduce this bug with my reproducer after
> >> apply my patch. I could still get a similiar issue with syzkaller sock 
> >> vnet test.
> >>
> >> It looks this patch is not complete. Here is the KASAN call trace with my 
> >> patch.
> >> I can also reproduce it without my patch.
> >
> > Seems like a race between vmci_datagram_destroy_handle() and the
> > delayed callback, vmci_transport_recv_dgram_cb().
> >
> > I don't know the VMCI transport well so I'll leave this to Jorgen.
>
> Yes, it looks like we are calling the delayed callback after we return from 
> vmci_datagram_destroy_handle(). I’ll take a closer look at the VMCI side here 
> - the refcounting of VMCI datagram endpoints should guard against this, since 
> the delayed callback does a get on the datagram resource, so this could a 
> VMCI driver issue, and not a problem in the VMCI transport for AF_VSOCK.

Hi Jorgen,

Thanks for helping look at this. I'm happy to run test for you patch.

Thanks
Hangbin


Re: [PATCH net] VSOCK: check sk state before receive

2018-06-04 Thread Jorgen S. Hansen

> On May 30, 2018, at 11:17 AM, Stefan Hajnoczi  wrote:
> 
> On Sun, May 27, 2018 at 11:29:45PM +0800, Hangbin Liu wrote:
>> Hmm...Although I won't reproduce this bug with my reproducer after
>> apply my patch. I could still get a similiar issue with syzkaller sock vnet 
>> test.
>> 
>> It looks this patch is not complete. Here is the KASAN call trace with my 
>> patch.
>> I can also reproduce it without my patch.
> 
> Seems like a race between vmci_datagram_destroy_handle() and the
> delayed callback, vmci_transport_recv_dgram_cb().
> 
> I don't know the VMCI transport well so I'll leave this to Jorgen.

Yes, it looks like we are calling the delayed callback after we return from 
vmci_datagram_destroy_handle(). I’ll take a closer look at the VMCI side here - 
the refcounting of VMCI datagram endpoints should guard against this, since the 
delayed callback does a get on the datagram resource, so this could a VMCI 
driver issue, and not a problem in the VMCI transport for AF_VSOCK.

> 
>> ==
>> BUG: KASAN: use-after-free in vmci_transport_allow_dgram.part.7+0x155/0x1a0 
>> [vmw_vsock_vmci_transport]
>> Read of size 4 at addr 880026a3a914 by task kworker/0:2/96
>> 
>> CPU: 0 PID: 96 Comm: kworker/0:2 Not tainted 4.17.0-rc6.vsock+ #28
>> Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2011
>> Workqueue: events dg_delayed_dispatch [vmw_vmci]
>> Call Trace:
>> __dump_stack lib/dump_stack.c:77 [inline]
>> dump_stack+0xdd/0x18e lib/dump_stack.c:113
>> print_address_description+0x7a/0x3e0 mm/kasan/report.c:256
>> kasan_report_error mm/kasan/report.c:354 [inline]
>> kasan_report+0x1dd/0x460 mm/kasan/report.c:412
>> vmci_transport_allow_dgram.part.7+0x155/0x1a0 [vmw_vsock_vmci_transport]
>> vmci_transport_recv_dgram_cb+0x5d/0x200 [vmw_vsock_vmci_transport]
>> dg_delayed_dispatch+0x99/0x1b0 [vmw_vmci]
>> process_one_work+0xa4e/0x1720 kernel/workqueue.c:2145
>> worker_thread+0x1df/0x1400 kernel/workqueue.c:2279
>> kthread+0x343/0x4b0 kernel/kthread.c:240
>> ret_from_fork+0x35/0x40 arch/x86/entry/entry_64.S:412
>> 
>> Allocated by task 2684:
>> set_track mm/kasan/kasan.c:460 [inline]
>> kasan_kmalloc+0xa0/0xd0 mm/kasan/kasan.c:553
>> slab_post_alloc_hook mm/slab.h:444 [inline]
>> slab_alloc_node mm/slub.c:2741 [inline]
>> slab_alloc mm/slub.c:2749 [inline]
>> kmem_cache_alloc+0x105/0x330 mm/slub.c:2754
>> sk_prot_alloc+0x6a/0x2c0 net/core/sock.c:1468
>> sk_alloc+0xc9/0xbb0 net/core/sock.c:1528
>> __vsock_create+0xc8/0x9b0 [vsock]
>> vsock_create+0xfd/0x1a0 [vsock]
>> __sock_create+0x310/0x690 net/socket.c:1285
>> sock_create net/socket.c:1325 [inline]
>> __sys_socket+0x101/0x240 net/socket.c:1355
>> __do_sys_socket net/socket.c:1364 [inline]
>> __se_sys_socket net/socket.c:1362 [inline]
>> __x64_sys_socket+0x7d/0xd0 net/socket.c:1362
>> do_syscall_64+0x175/0x630 arch/x86/entry/common.c:287
>> entry_SYSCALL_64_after_hwframe+0x44/0xa9
>> 
>> Freed by task 2684:
>> set_track mm/kasan/kasan.c:460 [inline]
>> __kasan_slab_free+0x130/0x180 mm/kasan/kasan.c:521
>> slab_free_hook mm/slub.c:1388 [inline]
>> slab_free_freelist_hook mm/slub.c:1415 [inline]
>> slab_free mm/slub.c:2988 [inline]
>> kmem_cache_free+0xce/0x410 mm/slub.c:3004
>> sk_prot_free net/core/sock.c:1509 [inline]
>> __sk_destruct+0x629/0x940 net/core/sock.c:1593
>> sk_destruct+0x4e/0x90 net/core/sock.c:1601
>> __sk_free+0xd3/0x320 net/core/sock.c:1612
>> sk_free+0x2a/0x30 net/core/sock.c:1623
>> __vsock_release+0x431/0x610 [vsock]
>> vsock_release+0x3c/0xc0 [vsock]
>> sock_release+0x91/0x200 net/socket.c:594
>> sock_close+0x17/0x20 net/socket.c:1149
>> __fput+0x368/0xa20 fs/file_table.c:209
>> task_work_run+0x1c5/0x2a0 kernel/task_work.c:113
>> exit_task_work include/linux/task_work.h:22 [inline]
>> do_exit+0x1876/0x26c0 kernel/exit.c:865
>> do_group_exit+0x159/0x3e0 kernel/exit.c:968
>> get_signal+0x65a/0x1780 kernel/signal.c:2482
>> do_signal+0xa4/0x1fe0 arch/x86/kernel/signal.c:810
>> exit_to_usermode_loop+0x1b8/0x260 arch/x86/entry/common.c:162
>> prepare_exit_to_usermode arch/x86/entry/common.c:196 [inline]
>> syscall_return_slowpath arch/x86/entry/common.c:265 [inline]
>> do_syscall_64+0x505/0x630 arch/x86/entry/common.c:290
>> entry_SYSCALL_64_after_hwframe+0x44/0xa9
>> 
>> The buggy address belongs to the object at 880026a3a600
>> which belongs to the cache AF_VSOCK of size 1056
>> The buggy address is located 788 bytes inside of
>> 1056-byte region [880026a3a600, 880026a3aa20)
>> The buggy address belongs to the page:
>> page:ea9a8e00 count:1 mapcount:0 mapping: index:0x0 
>> compound_mapcount: 0
>> flags: 0xfc0008100(slab|head)
>> raw: 000fc0008100   0001000d000d
>> raw: dead0100 dead0200 880034471a40 
>> page dumped because: kasan: bad access detected
>> 
>> Memory state around the buggy address:
>> 880026a3a800: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
>> 

Re: [PATCH] VSOCK: make af_vsock.ko removable again

2018-04-17 Thread Jorgen S. Hansen

> On Apr 17, 2018, at 8:25 AM, Stefan Hajnoczi  wrote:
> 
> Commit c1eef220c1760762753b602c382127bfccee226d ("vsock: always call
> vsock_init_tables()") introduced a module_init() function without a
> corresponding module_exit() function.
> 
> Modules with an init function can only be removed if they also have an
> exit function.  Therefore the vsock module was considered "permanent"
> and could not be removed.
> 
> This patch adds an empty module_exit() function so that "rmmod vsock"
> works.  No explicit cleanup is required because:
> 
> 1. Transports call vsock_core_exit() upon exit and cannot be removed
>   while sockets are still alive.
> 2. vsock_diag.ko does not perform any action that requires cleanup by
>   vsock.ko.
> 
> Reported-by: Xiumei Mu 
> Cc: Cong Wang 
> Cc: Jorgen Hansen 
> Signed-off-by: Stefan Hajnoczi 
> ---
> net/vmw_vsock/af_vsock.c | 6 ++
> 1 file changed, 6 insertions(+)
> 
> diff --git a/net/vmw_vsock/af_vsock.c b/net/vmw_vsock/af_vsock.c
> index aac9b8f6552e..c1076c19b858 100644
> --- a/net/vmw_vsock/af_vsock.c
> +++ b/net/vmw_vsock/af_vsock.c
> @@ -2018,7 +2018,13 @@ const struct vsock_transport 
> *vsock_core_get_transport(void)
> }
> EXPORT_SYMBOL_GPL(vsock_core_get_transport);
> 
> +static void __exit vsock_exit(void)
> +{
> + /* Do nothing.  This function makes this module removable. */
> +}
> +
> module_init(vsock_init_tables);
> +module_exit(vsock_exit);
> 
> MODULE_AUTHOR("VMware, Inc.");
> MODULE_DESCRIPTION("VMware Virtual Socket Family");
> -- 
> 2.14.3
> 

Looks good to me.

Reviewed-by: Jorgen Hansen 



Re: [PATCH 0/5] VSOCK: add vsock_test test suite

2018-01-03 Thread Jorgen S. Hansen

> On Jan 2, 2018, at 1:05 PM, Stefan Hajnoczi <stefa...@redhat.com> wrote:
> 
> On Wed, Dec 20, 2017 at 02:48:43PM +0000, Jorgen S. Hansen wrote:
>> 
>>> On Dec 13, 2017, at 3:49 PM, Stefan Hajnoczi <stefa...@redhat.com> wrote:
>>> 
>>> The vsock_diag.ko module already has a test suite but the core AF_VSOCK
>>> functionality has no tests.  This patch series adds several test cases that
>>> exercise AF_VSOCK SOCK_STREAM socket semantics (send/recv, connect/accept,
>>> half-closed connections, simultaneous connections).
>>> 
>>> The test suite is modest but I hope to cover additional cases in the future.
>>> My goal is to have a shared test suite so VMCI, Hyper-V, and KVM can ensure
>>> that our transports behave the same.
>>> 
>>> I have tested virtio-vsock.
>>> 
>>> Jorgen: Please test the VMCI transport and let me know if anything needs to 
>>> be
>>> adjusted.  See tools/testing/vsock/README for information on how to run the
>>> test suite.
>>> 
>> 
>> I tried running the vsock_test on VMCI, and all the tests failed in one way 
>> or
>> another:
> 
> Great, thank you for testing and looking into the failures!
> 
>> 1) connection reset test: when the guest tries to connect to the host, we
>>  get EINVAL as the error instead of ECONNRESET. I’ll fix that.
> 
> Yay, the tests found a bug!
> 
>> 2) client close and server close tests: On the host side, VMCI doesn’t
>>  support reading data from a socket that has been closed by the
>>  guest. When the guest closes a connection, all data is gone, and
>>  we return EOF on the host side. So the tests that try to read data
>>  after close, should not attempt that on VMCI host side. I got the
>>  tests to pass by adding a getsockname call to determine if
>>  the local CID was the host CID, and then skip the read attempt
>>  in that case. We could add a vmci flag, that would enable
>>  this behavior.
> 
> Interesting behavior.  Is there a reason for disallowing half-closed
> sockets on the host side?

This is a consequence of the way the underlying VMCI queue pairs are
implemented. When the guest side closes a connection, it signals this
to the peer by detaching from the VMCI queue pair used for the data
transfer (the detach will result in an event being generated on the
peer side). However, the VMCI queue pair is allocated as part of the
guest memory, so when the guest detaches, that memory is reclaimed.
So the host side would need to create a copy of the contents of that
queue pair in kernel memory as part of the detach operation. When
this was implemented, it was decided that it was better to avoid
a potential large kernel memory allocation and the data copy at
detach time than to maintain the half close behavior of INET.

> 
>> 5) multiple connections tests: with the standard socket sizes,
>>  VMCI is only able to support about 100 concurrent stream
>>  connections so this test passes with MULTICONN_NFDS
>>  set to 100.
> 
> The 1000 magic number is because many distros have a maximum number of
> file descriptors ulimit of 1024.  But it's an arbitrary number and we
> could lower it to 100.
> 
> Is this VMCI concurrent stream limit a denial of service vector?  Can an
> unprivileged guest userspace process open many sockets to prevent
> legitimate connections from other users within the same guest?

vSocket uses VMCI queue pairs for the stream, and the VMCI device
only allows a limited amount of memory to be used for queue pairs
per VM. So it is possible to exhaust this shared resource. The queue
pairs are created as part of the connection establishment process, so
it would require the user process to both create and connect the sockets
to a host side endpoint (connections between guest processes will not
allocate VMCI queue pairs).

Thanks,
Jorgen

Re: [PATCH 0/5] VSOCK: add vsock_test test suite

2017-12-20 Thread Jorgen S. Hansen

> On Dec 13, 2017, at 3:49 PM, Stefan Hajnoczi  wrote:
> 
> The vsock_diag.ko module already has a test suite but the core AF_VSOCK
> functionality has no tests.  This patch series adds several test cases that
> exercise AF_VSOCK SOCK_STREAM socket semantics (send/recv, connect/accept,
> half-closed connections, simultaneous connections).
> 
> The test suite is modest but I hope to cover additional cases in the future.
> My goal is to have a shared test suite so VMCI, Hyper-V, and KVM can ensure
> that our transports behave the same.
> 
> I have tested virtio-vsock.
> 
> Jorgen: Please test the VMCI transport and let me know if anything needs to be
> adjusted.  See tools/testing/vsock/README for information on how to run the
> test suite.
> 

I tried running the vsock_test on VMCI, and all the tests failed in one way or
another:
1) connection reset test: when the guest tries to connect to the host, we
  get EINVAL as the error instead of ECONNRESET. I’ll fix that.
2) client close and server close tests: On the host side, VMCI doesn’t
  support reading data from a socket that has been closed by the
  guest. When the guest closes a connection, all data is gone, and
  we return EOF on the host side. So the tests that try to read data
  after close, should not attempt that on VMCI host side. I got the
  tests to pass by adding a getsockname call to determine if
  the local CID was the host CID, and then skip the read attempt
  in that case. We could add a vmci flag, that would enable
  this behavior.
3) send_byte(fd, -EPIPE): for the VMCI transport, the close
 isn’t necessarily visible immediately on the peer. So in most
 cases, these send operations would complete with success.
 I was running these tests using nested virtualization, so I
 suspect that the problem is more likely to occur here, but
 I had to add a sleep to be sure to get the EPIPE error.
4) server close test: the connect would sometimes fail. This looks
  like an issue where we detect the peer close on the client side
  before we complete the connection handshake on the client
  side. There are two different channels used for the connection
  handshake and the disconnect. I’ll look into this to see what
  exactly is going on.
5) multiple connections tests: with the standard socket sizes,
  VMCI is only able to support about 100 concurrent stream
  connections so this test passes with MULTICONN_NFDS
  set to 100.

Thanks,
Jorgen



Re: AF_VSOCK connection refused errno

2017-12-13 Thread Jorgen S. Hansen

> On Dec 12, 2017, at 4:53 PM, Stefan Hajnoczi  wrote:
> 
> When connect(2) fails because the peer is not listening the virtio vsock
> transport returns ECONNRESET.  I believe the VMCI transport does the
> same (based on code inspection).
> 
> Jorgen: Can you confirm this VMCI transport behavior?

Yes, that is correct.

> I'd like to change to ECONNREFUSED for all transports because developers
> will be surprised when they get ECONNRESET.  It makes porting AF_INET
> code harder.
> 
> On the other hand, it may be too late to fix this if there userspace
> applications that rely on ECONNRESET?  I'm not aware of any such
> applications myself.

In the past, I’ve explained to customers that an ECONNRESET error on connect
can indicate that the peer isn’t listening on the dest address. Whether they 
went
and used that information isn’t clear, but changing this behavior now would
risk breaking applications. While it is unfortunate that we deviate from INET in
this case, I would prefer it to stay as is.

Thanks,
Jorgen

Re: [PATCH v2] vsock.7: document VSOCK socket address family

2017-12-06 Thread Jorgen S. Hansen

> On Dec 5, 2017, at 11:56 AM, Stefan Hajnoczi  wrote:
> 
> The AF_VSOCK address family has been available since Linux 3.9 without a
> corresponding man page.
> 
> This patch adds vsock.7 and describes its use along the same lines as
> existing ip.7, unix.7, and netlink.7 man pages.
> 
> CC: Jorgen Hansen 
> CC: Dexuan Cui 
> Signed-off-by: Stefan Hajnoczi 
> ---
> man7/vsock.7 | 180 +++
> 1 file changed, 180 insertions(+)
> create mode 100644 man7/vsock.7
> 
> diff --git a/man7/vsock.7 b/man7/vsock.7
> new file mode 100644
> index 0..46dc561f5
> --- /dev/null
> +++ b/man7/vsock.7
> @@ -0,0 +1,180 @@
> +.TH VSOCK 7 2017-11-30 "Linux" "Linux Programmer's Manual"
> +.SH NAME
> +vsock \- Linux VSOCK address family
> +.SH SYNOPSIS
> +.B #include 
> +.br
> +.B #include 
> +.PP
> +.IB stream_socket " = socket(AF_VSOCK, SOCK_STREAM, 0);"
> +.br
> +.IB datagram_socket " = socket(AF_VSOCK, SOCK_DGRAM, 0);"
> +.SH DESCRIPTION
> +The VSOCK address family facilitates communication between virtual machines 
> and
> +the host they are running on.  This address family is used by guest agents 
> and
> +hypervisor services that need a communications channel that is independent of
> +virtual machine network configuration.
> +.PP
> +Valid socket types are
> +.B SOCK_STREAM
> +and
> +.BR SOCK_DGRAM .
> +.B SOCK_STREAM
> +provides connection-oriented byte streams with guaranteed, in-order delivery.
> +.B SOCK_DGRAM
> +provides a connectionless datagram packet service with best-effort delivery 
> and
> +best-effort ordering.  Availability of these socket types is dependent on the
> +underlying hypervisor.
> +.PP
> +A new socket is created with
> +.PP
> +socket(AF_VSOCK, socket_type, 0);
> +.PP
> +When a process wants to establish a connection it calls
> +.BR connect (2)
> +with a given destination socket address.  The socket is automatically bound 
> to
> +a free port if unbound.
> +.PP
> +A process can listen for incoming connections by first binding to a socket
> +address using
> +.BR bind (2)
> +and then calling
> +.BR listen (2).
> +.PP
> +Data is transferred using the usual
> +.BR send (2)
> +and
> +.BR recv (2)
> +family of socket system calls.
> +.SS Address format
> +A socket address is defined as a combination of a 32-bit Context Identifier
> +(CID) and a 32-bit port number.  The CID identifies the source or 
> destination,
> +which is either a virtual machine or the host.  The port number 
> differentiates
> +between multiple services running on a single machine.
> +.PP
> +.in +4n
> +.EX
> +struct sockaddr_vm {
> +sa_family_t svm_family; /* address family: AF_VSOCK */
> +unsigned short  svm_reserved1;
> +unsigned intsvm_port;   /* port in native byte order */
> +unsigned intsvm_cid;/* address in native byte order */
> +};
> +.EE
> +.in
> +.PP
> +.I svm_family
> +is always set to
> +.BR AF_VSOCK .
> +.I svm_reserved1
> +is always set to 0.
> +.I svm_port
> +contains the port in native byte order.
> +The port numbers below 1024 are called
> +.IR "privileged ports" .
> +Only a process with
> +.B CAP_NET_BIND_SERVER
> +capability may
> +.BR bind (2)
> +to these port numbers.
> +.PP
> +There are several special addresses:
> +.B VMADDR_CID_ANY
> +(-1U)
> +means any address for binding;
> +.B VMADDR_CID_HYPERVISOR
> +(0) is reserved for services built into the hypervisor;
> +.B VMADDR_CID_RESERVED
> +(1) must not be used;
> +.B VMADDR_CID_HOST
> +(2)
> +is the well-known address of the host.
> +.PP
> +The special constant
> +.B VMADDR_PORT_ANY
> +(-1U)
> +means any port number for binding.
> +.SS Live migration
> +Sockets are affected by live migration of virtual machines.  Connected
> +.B SOCK_STREAM
> +sockets become disconnected when the virtual machine migrates to a new host.
> +Applications must reconnect when this happens.
> +.PP
> +The local CID may change across live migration if the old CID is not 
> available
> +on the new host.  Bound sockets are automatically updated to the new CID.
> +.SS Ioctls
> +.TP
> +.B IOCTL_VM_SOCKETS_GET_LOCAL_CID
> +Get the CID of the local machine.  The argument is a pointer to an unsigned 
> int.
> +.IP
> +.in +4n
> +.EX
> +.IB error " = ioctl(" socket ", " IOCTL_VM_SOCKETS_GET_LOCAL_CID ", "  
> ");"
> +.EE
> +.in
> +.IP
> +Consider using
> +.B VMADDR_CID_ANY
> +when binding instead of getting the local CID with
> +.BR IOCTL_VM_SOCKETS_GET_LOCAL_CID .
> +.SH ERRORS
> +.TP
> +.B EACCES
> +Unable to bind to a privileged port without the
> +.B CAP_NET_BIND_SERVICE
> +capability.
> +.TP
> +.B EINVAL
> +Invalid parameters.  This includes:
> +attempting to bind a socket that is already bound, providing an invalid 
> struct
> +.BR sockaddr_vm ,
> +and other input validation errors.
> +.TP
> +.B EOPNOTSUPP
> +Operation not supported.  This includes:
> +the
> +.B MSG_OOB
> +flag that is not 

Re: [PATCH] VSOCK: fix outdated sk_state value in hvs_release()

2017-12-05 Thread Jorgen S. Hansen

> On Dec 5, 2017, at 12:31 PM, Stefan Hajnoczi  wrote:
> 
> Since commit 3b4477d2dcf2709d0be89e2a8dced3d0f4a017f2 ("VSOCK: use TCP
> state constants for sk_state") VSOCK has used TCP_* constants for
> sk_state.
> 
> Commit b4562ca7925a3bedada87a3dd072dd5bad043288 ("hv_sock: add locking
> in the open/close/release code paths") reintroduced the SS_DISCONNECTING
> constant.
> 
> This patch replaces the old SS_DISCONNECTING with the new TCP_CLOSING
> constant.
> 
> CC: Dexuan Cui 
> CC: Cathy Avery 
> Signed-off-by: Stefan Hajnoczi 
> ---
> net/vmw_vsock/hyperv_transport.c | 2 +-
> 1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/net/vmw_vsock/hyperv_transport.c 
> b/net/vmw_vsock/hyperv_transport.c
> index 5583df708b8c..a827547aa102 100644
> --- a/net/vmw_vsock/hyperv_transport.c
> +++ b/net/vmw_vsock/hyperv_transport.c
> @@ -487,7 +487,7 @@ static void hvs_release(struct vsock_sock *vsk)
> 
>   lock_sock(sk);
> 
> - sk->sk_state = SS_DISCONNECTING;
> + sk->sk_state = TCP_CLOSING;
>   vsock_remove_sock(vsk);
> 
>   release_sock(sk);
> -- 
> 2.14.3
> 

Reviewed-by: Jorgen Hansen 



Re: [PATCH] vsock.7: document VSOCK socket address family

2017-12-01 Thread Jorgen S. Hansen

> On Dec 1, 2017, at 2:09 PM, Stefan Hajnoczi <stefa...@redhat.com> wrote:
> 
> On Thu, Nov 30, 2017 at 01:21:26PM +0000, Jorgen S. Hansen wrote:
>>> On Nov 30, 2017, at 12:21 PM, Stefan Hajnoczi <stefa...@redhat.com> wrote:
> 
> Thanks for the quick review!
> 
> I forgot to ask you: Is SOCK_DGRAM reliable and in-order over VMCI?

No, that isn’t guaranteed.

> 
>>> +.PP
>>> +Valid socket types are
>>> +.B SOCK_STREAM
>>> +and
>>> +.B SOCK_DGRAM .
>> 
>> The space here results in a space between SOCK_DGRAM and the “.” in the 
>> formatted text. Is that intentional?
> 
> I haven't figured out the groff syntax to avoid the space :(.  Any
> ideas?

Looks like man 7 unix has a few examples with bold and “,” - maybe look at the 
source for that.

> 
>> Apart from the nits, this looks great.
> 
> Please let me know if there are any other topics that we should cover in
> the man page.

Sure, I’ll get back to you soon, if I think of anything.

Thanks,
Jorgen

Re: [PATCH] vsock.7: document VSOCK socket address family

2017-11-30 Thread Jorgen S. Hansen

> On Nov 30, 2017, at 12:21 PM, Stefan Hajnoczi  wrote:
> 
> The AF_VSOCK address family has been available since Linux 3.9 without a
> corresponding man page.
> 
> This patch adds vsock.7 and describes its use along the same lines as
> existing ip.7, unix.7, and netlink.7 man pages.
> 
> CC: Jorgen Hansen 
> CC: Dexuan Cui 
> Signed-off-by: Stefan Hajnoczi 
> ---
> man7/vsock.7 | 175 +++
> 1 file changed, 175 insertions(+)
> create mode 100644 man7/vsock.7
> 
> diff --git a/man7/vsock.7 b/man7/vsock.7
> new file mode 100644
> index 0..48c6c2e1e
> --- /dev/null
> +++ b/man7/vsock.7
> @@ -0,0 +1,175 @@
> +.TH VSOCK 7 2017-11-30 "Linux" "Linux Programmer's Manual"
> +.SH NAME
> +vsock \- Linux VSOCK address family
> +.SH SYNOPSIS
> +.B #include 
> +.br
> +.B #include 
> +.PP
> +.IB stream_socket " = socket(AF_VSOCK, SOCK_STREAM, 0);"
> +.br
> +.IB datagram_socket " = socket(AF_VSOCK, SOCK_DGRAM, 0);"
> +.SH DESCRIPTION
> +The VSOCK address family facilitates communication between virtual machines 
> and
> +the host they are running on.  This address family is used by guest agents 
> and
> +hypervisor services that need a communications channel that is independent of
> +virtual machine network configuration.
> +.PP
> +Valid socket types are
> +.B SOCK_STREAM
> +and
> +.B SOCK_DGRAM .

The space here results in a space between SOCK_DGRAM and the “.” in the 
formatted text. Is that intentional?

> +.B SOCK_STREAM
> +provides connection-oriented byte streams with guaranteed, in-order delivery.
> +.B SOCK_DGRAM
> +provides a connectionless datagram packet service.  Availability of these
> +socket types is dependent on the underlying hypervisor.
> +.PP
> +A new socket is created with
> +.PP
> +socket(AF_VSOCK, socket_type, 0);
> +.PP
> +When a process wants to establish a connection it calls
> +.BR connect (2)
> +with a given destination socket address.  The socket is automatically bound 
> to
> +a free port if unbound.
> +.PP
> +A process can listen for incoming connections by first binding to a socket 
> address using
> +.BR bind (2)
> +and then calling
> +.BR listen (2).
> +.PP
> +Data is transferred using the usual
> +.BR send (2)
> +and
> +.BR recv (2)
> +family of socket system calls.
> +.SS Address format
> +A socket address is defined as a combination of a 32-bit Context Identifier 
> (CID) and a 32-bit port number.  The CID identifies the source or 
> destination, which is either a virtual machine or the host.  The port number 
> differentiates between multiple services running on a single machine.
> +.PP
> +.in +4n
> +.EX
> +struct sockaddr_vm {
> +sa_family_t svm_family; /* address family: AF_VSOCK */
> +unsigned short  svm_reserved1;
> +unsigned intsvm_port;   /* port in native byte order */
> +unsigned intsvm_cid;/* address in native byte order */
> +};
> +.EE
> +.in
> +.PP
> +.I svm_family
> +is always set to
> +.BR AF_VSOCK .

Again the space before “."

> +.I svm_reserved1
> +is always set to 0.
> +.I svm_port
> +contains the port in native byte order.
> +The port numbers below 1024 are called
> +.IR "privileged ports" .
> +Only a process with
> +.B CAP_NET_BIND_SERVER
> +capability may
> +.BR bind (2)
> +to these port numbers.
> +.PP
> +There are several special addresses:
> +.B VMADDR_CID_ANY
> +(-1U)
> +means any address for binding;
> +.B VMADDR_CID_HYPERVISOR

We use VMADDR_CID_HYPERVISOR for communicating with services in the hypervisor, 
so you could describe this as “an address reserved for services built into the 
hypervisor”.

> +(0) and
> +.B VMADDR_CID_RESERVED
> +(1) are unused addresses;
> +.B VMADDR_CID_HOST
> +(2)
> +is the well-known address of the host.
> +.PP
> +The special constant
> +.B VMADDR_PORT_ANY
> +(-1U)
> +means any port number for binding.
> +.SS Live migration
> +Sockets are affected by live migration of virtual machines.  Connected
> +.B SOCK_STREAM
> +sockets become disconnected when the virtual machine migrates to a new host.
> +Applications must reconnect when this happens.
> +.PP
> +The local CID may change across live migration if the old CID is not 
> available
> +on the new host.  Bound sockets are automatically updated to the new CID.
> +.SS Ioctls
> +.TP
> +.B IOCTL_VM_SOCKETS_GET_LOCAL_CID
> +Get the CID of the local machine.  The argument is a pointer to an unsigned 
> int.
> +.IP
> +.in +4n
> +.EX
> +.IB error " = ioctl(" socket ", " IOCTL_VM_SOCKETS_GET_LOCAL_CID ", "  
> ");"
> +.EE
> +.in
> +.IP
> +Consider using
> +.B VMADDR_CID_ANY
> +when binding instead of getting the local CID with
> +.B IOCTL_VM_SOCKETS_GET_LOCAL_CID .

space before “.”

> +.SH ERRORS
> +.TP
> +.B EACCES
> +Unable to bind to a privileged port without the
> +.B CAP_NET_BIND_SERVICE
> +capability.
> +.TP
> +.B EINVAL
> +Invalid parameters.  This includes:
> +attempting to bind a socket 

Re: scheduling while atomic from vmci_transport_recv_stream_cb in 3.16 kernels

2017-09-13 Thread Jorgen S. Hansen

> On Sep 13, 2017, at 5:19 PM, Michal Hocko <mho...@kernel.org> wrote:
> 
> On Wed 13-09-17 15:07:26, Jorgen S. Hansen wrote:
>> 
>>> On Sep 12, 2017, at 11:08 AM, Michal Hocko <mho...@kernel.org> wrote:
>>> 
>>> Hi,
>>> we are seeing the following splat with Debian 3.16 stable kernel
>>> 
>>> BUG: scheduling while atomic: MATLAB/26771/0x0100
>>> Modules linked in: veeamsnap(O) hmac cbc cts nfsv4 dns_resolver 
>>> rpcsec_gss_krb5 nfsd auth_rpcgss oid_registry nfs_acl nfs lockd fscache 
>>> sunrpc vmw_vso$
>>> CPU: 0 PID: 26771 Comm: MATLAB Tainted: G   O  3.16.0-4-amd64 #1 
>>> Debian 3.16.7-ckt20-1+deb8u3
>>> Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference 
>>> Platform, BIOS 6.00 09/21/2015
>>> 88315c1e4c20 8150db3f 88193f803dc8 8150acdf
>>> 815103a2 00012f00 8819423dbfd8 00012f00
>>> 88315c1e4c20 88193f803dc8 88193f803d50 88193f803dc0
>>> Call Trace:
>>>   [] ? dump_stack+0x41/0x51
>>> [] ? __schedule_bug+0x48/0x55
>>> [] ? __schedule+0x5d2/0x700
>>> [] ? schedule_timeout+0x229/0x2a0
>>> [] ? select_task_rq_fair+0x390/0x700
>>> [] ? check_preempt_wakeup+0x120/0x1d0
>>> [] ? wait_for_completion+0xa8/0x120
>>> [] ? wake_up_state+0x10/0x10
>>> [] ? call_rcu_bh+0x20/0x20
>>> [] ? wait_rcu_gp+0x4b/0x60
>>> [] ? ftrace_raw_output_rcu_utilization+0x40/0x40
>>> [] ? vmci_event_unsubscribe+0x75/0xb0 [vmw_vmci]
>>> [] ? vmci_transport_destruct+0x1d/0xe0 
>>> [vmw_vsock_vmci_transport]
>>> [] ? vsock_sk_destruct+0x13/0x60 [vsock]
>>> [] ? __sk_free+0x1a/0x130
>>> [] ? vmci_transport_recv_stream_cb+0x1e8/0x2d0 
>>> [vmw_vsock_vmci_transport]
>>> [] ? vmci_datagram_invoke_guest_handler+0xaa/0xd0 
>>> [vmw_vmci]
>>> [] ? vmci_dispatch_dgs+0xc1/0x200 [vmw_vmci]
>>> [] ? tasklet_action+0xf4/0x100
>>> [] ? __do_softirq+0xf1/0x290
>>> [] ? irq_exit+0x95/0xa0
>>> [] ? do_IRQ+0x52/0xe0
>>> [] ? common_interrupt+0x6d/0x6d
>>> 
>>> AFAICS this has been fixed by 4ef7ea9195ea ("VSOCK: sock_put wasn't safe
>>> to call in interrupt context") but this patch hasn't been backported to
>>> stable trees. It applies cleanly on top of 3.16 stable tree but I am not
>>> familiar with the code to send the backport to the stable maintainer
>>> directly.
>>> 
>>> Could you double check that the patch below (just a blind cherry-pick)
>>> is correct and it doesn't need additional patches on top?
>> 
>> Hi,
>> 
>> The patch below has been used to fix the above issue by other distros
>> - among them Redhat for the 3.10 kernel, so it should work for 3.16 as
>> well.
> 
> Thanks for the confirmation. I do not see 4ef7ea9195ea ("VSOCK: sock_put
> wasn't safe to call in interrupt context") in 3.10 stable branch
> though.
> 
>> In addition to the patch above, there are two other patches that
>> need to be applied on top for the fix to be correct:
>> 
>> 8566b86ab9f0f45bc6f7dd422b21de9d0cf5415a "VSOCK: Fix lockdep issue."
>> 
>> and
>> 
>> 8ab18d71de8b07d2c4d6f984b718418c09ea45c5 "VSOCK: Detach QP check should 
>> filter out non matching QPs."
> 
> Good to know. I will send all three patches cherry-picked on top of the
> current 3.16 stable branch. Could you have a look please?

The patch series look good to me.

Thanks for taking care of this,
Jorgen



Re: scheduling while atomic from vmci_transport_recv_stream_cb in 3.16 kernels

2017-09-13 Thread Jorgen S. Hansen

> On Sep 12, 2017, at 11:08 AM, Michal Hocko  wrote:
> 
> Hi,
> we are seeing the following splat with Debian 3.16 stable kernel
> 
> BUG: scheduling while atomic: MATLAB/26771/0x0100
> Modules linked in: veeamsnap(O) hmac cbc cts nfsv4 dns_resolver 
> rpcsec_gss_krb5 nfsd auth_rpcgss oid_registry nfs_acl nfs lockd fscache 
> sunrpc vmw_vso$
> CPU: 0 PID: 26771 Comm: MATLAB Tainted: G   O  3.16.0-4-amd64 #1 
> Debian 3.16.7-ckt20-1+deb8u3
> Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference 
> Platform, BIOS 6.00 09/21/2015
> 88315c1e4c20 8150db3f 88193f803dc8 8150acdf
> 815103a2 00012f00 8819423dbfd8 00012f00
> 88315c1e4c20 88193f803dc8 88193f803d50 88193f803dc0
> Call Trace:
>   [] ? dump_stack+0x41/0x51
> [] ? __schedule_bug+0x48/0x55
> [] ? __schedule+0x5d2/0x700
> [] ? schedule_timeout+0x229/0x2a0
> [] ? select_task_rq_fair+0x390/0x700
> [] ? check_preempt_wakeup+0x120/0x1d0
> [] ? wait_for_completion+0xa8/0x120
> [] ? wake_up_state+0x10/0x10
> [] ? call_rcu_bh+0x20/0x20
> [] ? wait_rcu_gp+0x4b/0x60
> [] ? ftrace_raw_output_rcu_utilization+0x40/0x40
> [] ? vmci_event_unsubscribe+0x75/0xb0 [vmw_vmci]
> [] ? vmci_transport_destruct+0x1d/0xe0 
> [vmw_vsock_vmci_transport]
> [] ? vsock_sk_destruct+0x13/0x60 [vsock]
> [] ? __sk_free+0x1a/0x130
> [] ? vmci_transport_recv_stream_cb+0x1e8/0x2d0 
> [vmw_vsock_vmci_transport]
> [] ? vmci_datagram_invoke_guest_handler+0xaa/0xd0 [vmw_vmci]
> [] ? vmci_dispatch_dgs+0xc1/0x200 [vmw_vmci]
> [] ? tasklet_action+0xf4/0x100
> [] ? __do_softirq+0xf1/0x290
> [] ? irq_exit+0x95/0xa0
> [] ? do_IRQ+0x52/0xe0
> [] ? common_interrupt+0x6d/0x6d
> 
> AFAICS this has been fixed by 4ef7ea9195ea ("VSOCK: sock_put wasn't safe
> to call in interrupt context") but this patch hasn't been backported to
> stable trees. It applies cleanly on top of 3.16 stable tree but I am not
> familiar with the code to send the backport to the stable maintainer
> directly.
> 
> Could you double check that the patch below (just a blind cherry-pick)
> is correct and it doesn't need additional patches on top?

Hi,

The patch below has been used to fix the above issue by other distros - among 
them Redhat for the 3.10 kernel, so it should work for 3.16 as well. In 
addition to the patch above, there are two other patches that need to be 
applied on top for the fix to be correct:

8566b86ab9f0f45bc6f7dd422b21de9d0cf5415a "VSOCK: Fix lockdep issue."

and

8ab18d71de8b07d2c4d6f984b718418c09ea45c5 "VSOCK: Detach QP check should filter 
out non matching QPs."

Thanks,
Jorgen

> 
> Ben could you take this to your stable 3.16 branch if the patch is correct?
> 
> I am CCing Sasha for 4.1 stable tree as well. I haven't checked whether
> pre 3.16 kernels are affected as well.
> ---
> commit fac774c40b5c512113b6373cad498f35bee7a409
> Author: Jorgen Hansen 
> Date:   Wed Oct 21 04:53:56 2015 -0700
> 
>VSOCK: sock_put wasn't safe to call in interrupt context
> 
>commit 4ef7ea9195ea73262cd9730fb54e1eb726da157b upstream.
> 
>In the vsock vmci_transport driver, sock_put wasn't safe to call
>in interrupt context, since that may call the vsock destructor
>which in turn calls several functions that should only be called
>from process context. This change defers the callling of these
>functions  to a worker thread. All these functions were
>deallocation of resources related to the transport itself.
> 
>Furthermore, an unused callback was removed to simplify the
>cleanup.
> 
>Multiple customers have been hitting this issue when using
>VMware tools on vSphere 2015.
> 
>Also added a version to the vmci transport module (starting from
>1.0.2.0-k since up until now it appears that this module was
>sharing version with vsock that is currently at 1.0.1.0-k).
> 
>Reviewed-by: Aditya Asarwade 
>Reviewed-by: Thomas Hellstrom 
>Signed-off-by: Jorgen Hansen 
>Signed-off-by: David S. Miller 
> 
> diff --git a/net/vmw_vsock/vmci_transport.c b/net/vmw_vsock/vmci_transport.c
> index 9bb63ffec4f2..aed136d27b01 100644
> --- a/net/vmw_vsock/vmci_transport.c
> +++ b/net/vmw_vsock/vmci_transport.c
> @@ -40,13 +40,11 @@
> 
> static int vmci_transport_recv_dgram_cb(void *data, struct vmci_datagram *dg);
> static int vmci_transport_recv_stream_cb(void *data, struct vmci_datagram 
> *dg);
> -static void vmci_transport_peer_attach_cb(u32 sub_id,
> -   const struct vmci_event_data *ed,
> -   void *client_data);
> static void vmci_transport_peer_detach_cb(u32 sub_id,
> const struct vmci_event_data *ed,
> void *client_data);
> static void vmci_transport_recv_pkt_work(struct work_struct *work);
> +static 

Re: [PATCH] VSOCK: fix uapi/linux/vm_sockets.h incomplete types

2017-09-13 Thread Jorgen S. Hansen

> On Sep 12, 2017, at 6:34 PM, Stefan Hajnoczi  wrote:
> 
> This patch fixes the following compiler errors when userspace
> applications use the vm_sockets.h header:
> 
>  include/uapi/linux/vm_sockets.h:148:32: error: invalid application of 
> ‘sizeof’ to incomplete type ‘struct sockaddr’
>unsigned char svm_zero[sizeof(struct sockaddr) -
>  ^~
>  include/uapi/linux/vm_sockets.h:149:18: error: ‘sa_family_t’ undeclared here 
> (not in a function)
> sizeof(sa_family_t) -
>^~~
> 
> Two issues:
> 1. In the kernel struct sockaddr comes in via  but in
>   userspace  is required.
> 2. struct sockaddr_vm has a __kernel_sa_family_t field so let's be
>   consistent and use the same type for the sizeof(sa_family_t)
>   calculation.
> 
> Currently userspace applications work around this broken header by first
> including .  In the kernel there is no compiler error
> because  provides everything.  It's worth fixing the
> header file though.
> 
> Cc: Jorgen Hansen 
> Signed-off-by: Stefan Hajnoczi 
> ---
> include/uapi/linux/vm_sockets.h | 6 +-
> 1 file changed, 5 insertions(+), 1 deletion(-)
> 
> diff --git a/include/uapi/linux/vm_sockets.h b/include/uapi/linux/vm_sockets.h
> index b4ed5d895699..4ae5c625ac56 100644
> --- a/include/uapi/linux/vm_sockets.h
> +++ b/include/uapi/linux/vm_sockets.h
> @@ -18,6 +18,10 @@
> 
> #include 
> 
> +#ifndef __KERNEL__
> +#include  /* struct sockaddr */
> +#endif
> +
> /* Option name for STREAM socket buffer size.  Use as the option name in
>  * setsockopt(3) or getsockopt(3) to set or get an unsigned long long that
>  * specifies the size of the buffer underlying a vSockets STREAM socket.
> @@ -146,7 +150,7 @@ struct sockaddr_vm {
>   unsigned int svm_port;
>   unsigned int svm_cid;
>   unsigned char svm_zero[sizeof(struct sockaddr) -
> -sizeof(sa_family_t) -
> +sizeof(__kernel_sa_family_t) -
>  sizeof(unsigned short) -
>  sizeof(unsigned int) - sizeof(unsigned int)];
> };
> -- 
> 2.13.5
> 

Thanks for fixing this.

Reviewed-by: Jorgen Hansen 

Re: [PATCH] vsock: only load vmci transport on VMware hypervisor by default

2017-09-06 Thread Jorgen S. Hansen

> On Aug 31, 2017, at 1:54 PM, Stefan Hajnoczi <stefa...@redhat.com> wrote:
> 
> On Tue, Aug 29, 2017 at 03:37:07PM +0000, Jorgen S. Hansen wrote:
>>> On Aug 29, 2017, at 4:36 AM, Dexuan Cui <de...@microsoft.com> wrote:
>> If we allow multiple host side transports, virtio host side support and
>> vmci should be able to coexist regardless of the order of initialization.
> 
> That sounds good to me.
> 
> This means af_vsock.c needs to be aware of CID allocation.  Currently the
> vhost_vsock.ko driver handles this itself (it keeps a list of CIDs and
> checks that they are not used twice).  It should be possible to move
> that state into af_vsock.c so we have <cid, host_transport> pairs.
> 

Yes, that was my thinking as well.

> I'm currently working on NFS over AF_VSOCK and sock_diag support (for
> ss(8) and netstat-like tools).
> 
> Multi-transport support is lower priority for me at the moment.  I'm
> happy to review patches though.  If there is no progress on this by the
> end of the year then I will have time to work on it.
> 

I’ll try to find time to write a more coherent proposal in the coming weeks,
and we can discuss that.

> Are either of you are in Prague, Czech Republic on October 25-27 for
> Linux Kernel Summit, Open Source Summit Europe, Embedded Linux
> Conference Europe, KVM Forum, or MesosCon Europe?

No, I won’t be there.

Thanks,
Jorgen



Re: [PATCH] vsock: only load vmci transport on VMware hypervisor by default

2017-08-29 Thread Jorgen S. Hansen

> On Aug 29, 2017, at 4:36 AM, Dexuan Cui  wrote:
> 
>> From: Dexuan Cui
>> Sent: Tuesday, August 22, 2017 21:21
>>> ...
>>> ...
>>> The only problem here would be the potential for a guest and a host app
>> to
>>> have a conflict wrt port numbers, even though they would be able to
>>> operate fine, if restricted to their appropriate transport.
>>> 
>>> Thanks,
>>> Jorgen
>> 
>> Hi Jorgen, Stefan,
>> Thank you for the detailed analysis!
>> You have a much better understanding than me about the complex
>> scenarios. Can you please work out a patch? :-)
> 
> Hi Jorgen, Stefan,
> May I know your plan for this? 

I’d be happy to discuss this now, but I don’t have time to work on the
actual implementation in the next couple of weeks.

For the guest, we agree that registering a guest transport should be
tied to either the hypervisor running the guest, or the existence of
the appropriate virtual hardware.

My main concern is that our existing software relies on the current
behavior of auto-loading the VMCI transport on the host side. So
changing that behavior could cause trouble for our existing Linux
users.

I’m wondering whether it would be possible to support multiple host
side transports. Since it is theoretically possible to run multiple
hypervisors at the same time, there would even be a use case for this.

If the assignment of CIDs is made unique across all transports, a
connection initiated by the host side will get directed to the
appropriate transport based on the CID. Any traffic coming from a guest
will naturally be using the appropriate transport. Host side applications
will have to share the port space as well. This would require tracking
CIDs as a common resource across all transports, but make CIDs
unique VM addresses on a given host.

If we allow multiple host side transports, virtio host side support and
vmci should be able to coexist regardless of the order of initialization.

Thanks,
Jorgen



Re: [PATCH] vsock: only load vmci transport on VMware hypervisor by default

2017-08-22 Thread Jorgen S. Hansen

> On Aug 22, 2017, at 11:54 AM, Stefan Hajnoczi  wrote:
> 
> On Fri, Aug 18, 2017 at 11:07:37PM +, Dexuan Cui wrote:
>>> Not all features need to be supported.  For example, VMCI supports
>>> SOCK_DGRAM while Hyper-V and virtio do not.  But features that are
>>> available should behave identically.
>> I totally agree, though I'm afraid Hyper-V may have a little more limitations
>> compared to VMware/KVM duo to the  <--> 
>> mapping.
>> 
 Can we use the 'protocol' parameter in the socket() function:
 int socket(int domain, int type, int protocol)
 
 IMO currently the 'protocol' is not really used.
 I think we can modify __vsock_core_init() to allow multiple transport 
 layers
>>> to
 be registered, and we can define different 'protocol' numbers for
 VMware/KVM/Hyper-V, and ask the application to explicitly specify what
>>> should
 be used. Considering compatibility, we can use the default transport in a
>>> given
 VM depending on the underlying hypervisor.
>>> 
>>> I think AF_VSOCK should hide the transport from users/applications.
>> Ideally yes, but let's consider the KVM-on-KVM nested scenario: when
>> an application in the Level-1 VM creates an AF_VSOCK socket and call
>> connect() for it, how can we know if the app is trying to connect to
>> the Level-0 host, or connect to the Level-2 VM? We can't.
> 
> We *can* by looking at the destination CID.  Please take a look at
> drivers/misc/vmw_vmci/vmci_route.c:vmci_route() to see how VMCI handles
> nested virt.
> 
> It boils down to something like this:
> 
>  static int vsock_stream_connect(struct socket *sock, struct sockaddr *addr,
>  int addr_len, int flags)
>  {
>  ...
>  if (remote_addr.svm_cid == VMADDR_CID_HOST)
>  transport = host_transport;
>  else
>  transport = guest_transport;
> 
> It's easy for connect(2) but Jorgen mentioned it's harder for listen(2)
> because the socket would need to listen on both transports.  We define
> two new constants VMADDR_CID_LISTEN_FROM_GUEST and
> VMADDR_CID_LISTEN_FROM_HOST for bind(2) so that applications can decide
> which side to listen on.

If a socket is bound to VMADDR_CID_HOST, we would consider that socket as bound 
to the host side transport, so that would be the same as 
VMADDR_CID_LISTEN_FROM_GUEST. For the guest, we have 
IOCTL_VM_SOCKETS_GET_LOCAL_CID, so that could be used to get and bind a socket 
to the guest transport (VMCI will always return the guest CID as the local one, 
if the VMCI driver is used in a guest, and it looks like virtio will do the 
same). We could treat VMADDR_CID_ANY as always being the guest transport, since 
that is the use case where you don’t know upfront what your CID is, if we don’t 
want to listen on all transports. So we would use the host transport, if a 
socket is bound to VMADDR_CID_HOST, or if there is no guest transport, and in 
all other cases use the guest transport. However, having a couple of symbolic 
names like you suggest certainly makes it more obvious, and could be used in 
combination with this. It would be a plus if existing applications would 
function as intended in most cases.

>   Or the listen socket could simply listen to
> both sides.

The only problem here would be the potential for a guest and a host app to have 
a conflict wrt port numbers, even though they would be able to operate fine, if 
restricted to their appropriate transport.

Thanks,
Jorgen

Re: [PATCH] vsock: only load vmci transport on VMware hypervisor by default

2017-08-17 Thread Jorgen S. Hansen

> On Aug 17, 2017, at 3:55 PM, Stefan Hajnoczi  wrote:
> 
> On Thu, Aug 17, 2017 at 08:00:29AM +, Dexuan Cui wrote:
>> 
>> Without the patch, vmw_vsock_vmci_transport.ko can automatically load
>> when an application creates an AF_VSOCK socket.
>> 
>> This is the expected good behavior on VMware hypervisor, but as we
>> are going to add hv_sock.ko (i.e. Hyper-V transport for AF_VSOCK), we
>> should make sure vmw_vsock_vmci_transport.ko can't load on Hyper-V,
>> otherwise there is a -EBUSY conflict when both vmw_vsock_vmci_transport.ko
>> and hv_sock.ko try to call vsock_core_init() on Hyper-V.
>> 
>> On the other hand, hv_sock.ko can only load on Hyper-V, because it
>> depends on hv_vmbus.ko, which detects Hyper-V in hv_acpi_init().
>> 
>> KVM's vsock_virtio_transport doesn't have the issue because it doesn't
>> define MODULE_ALIAS_NETPROTO(PF_VSOCK).
> 
> Thanks for sending this patch, vmci's MODULE_ALIAS_NETPROTO(PF_VSOCK) is
> a problem for vhost_vsock.ko (the virtio host driver) too.  A host
> userspace program can create a AF_VSOCK socket before vhost_vsock is
> loaded.  The vmci transport will be unconditionally loaded and that's
> not the right behavior.
> 
> Putting aside nested virtualization, I want to load the transport (vmci,
> Hyper-V, vsock) for which there is paravirtualized hardware present
> inside the guest.

Good points. Completely agree that this is the desired behavior for a guest.


> It's a little tricker on the host side (doesn't matter for Hyper-V and
> probably also doesn't for VMware) because the host-side driver is a
> software device with no hardware backing it.  In KVM we assume the
> vhost_vsock.ko kernel module will be loaded sufficiently early.

Since the vmci driver is currently tied to PF_VSOCK it hasn’t been a problem, 
but on the host side the VMCI driver has no hardware backing it either, so when 
we move to a more appropriate solution, this will be an issue for VMCI as well. 
I’ll check our shipped products, but they most likely assume that if an 
upstreamed vmci module is present, it will be loaded automatically.

> 
> Things get trickier with nested virtualization because the VM might want
> to talk to its host but also to its nested VMs.  The simple way of
> fixing this would be to allow two transports loaded simultaneously and
> route traffic destined to CID 2 to the host transport and all other
> traffic to the guest transport.

This is close to the routing the VMCI driver does in a nested environment, but 
that is with the assumption that there is only one type of transport. Having 
two different transports would require that we delay resolving the transport 
type until the socket endpoint has been bound to an address. Things get 
trickier if listening sockets use VMADDR_CID_ANY - if only one transport is 
present, this would allow the socket to accept connections from both guests and 
outer host, but with multiple transports that won’t work, since we can’t 
associate a socket with a transport until the socket is bound.
 
> 
> Perhaps we should discuss these cases a bit more to figure out how to
> avoid conflicts over MODULE_ALIAS_NETPROTO(PF_VSOCK).

Agreed.

> 
>> 
>> The patch also adds a module parameter "skip_hypervisor_check" for
>> vmw_vsock_vmci_transport.ko.
>> 
>> Signed-off-by: Dexuan Cui 
>> Cc: Alok Kataria 
>> Cc: Andy King 
>> Cc: Adit Ranadive 
>> Cc: George Zhang 
>> Cc: Jorgen Hansen 
>> Cc: K. Y. Srinivasan 
>> Cc: Haiyang Zhang 
>> Cc: Stephen Hemminger 
>> ---
>> net/vmw_vsock/Kconfig  |  2 +-
>> net/vmw_vsock/vmci_transport.c | 11 +++
>> 2 files changed, 12 insertions(+), 1 deletion(-)
>> 
>> diff --git a/net/vmw_vsock/Kconfig b/net/vmw_vsock/Kconfig
>> index a24369d..3f52929 100644
>> --- a/net/vmw_vsock/Kconfig
>> +++ b/net/vmw_vsock/Kconfig
>> @@ -17,7 +17,7 @@ config VSOCKETS
>> 
>> config VMWARE_VMCI_VSOCKETS
>>  tristate "VMware VMCI transport for Virtual Sockets"
>> -depends on VSOCKETS && VMWARE_VMCI
>> +depends on VSOCKETS && VMWARE_VMCI && HYPERVISOR_GUEST
>>  help
>>This module implements a VMCI transport for Virtual Sockets.
>> 
>> diff --git a/net/vmw_vsock/vmci_transport.c b/net/vmw_vsock/vmci_transport.c
>> index 10ae782..c068873 100644
>> --- a/net/vmw_vsock/vmci_transport.c
>> +++ b/net/vmw_vsock/vmci_transport.c
>> @@ -16,6 +16,7 @@
>> #include 
>> #include 
>> #include 
>> +#include 
>> #include 
>> #include 
>> #include 
>> @@ -73,6 +74,10 @@ struct vmci_transport_recv_pkt_info {
>>  struct vmci_transport_packet pkt;
>> };
>> 
>> +static bool skip_hypervisor_check;
>> +module_param(skip_hypervisor_check, bool, 0444);
>> +MODULE_PARM_DESC(hot_add, "If set, attempt to load on non-VMware 
>> platforms");
>> +
>> static LIST_HEAD(vmci_transport_cleanup_list);
>> static 

Re: [PATCH net-next 2/3] vsock: fix vsock_dequeue/enqueue_accept race

2017-08-17 Thread Jorgen S. Hansen

> On Aug 16, 2017, at 12:15 AM, Dexuan Cui  wrote:
> 
> 
> With the current code, when vsock_dequeue_accept() is removing a sock
> from the list, nothing prevents vsock_enqueue_accept() from adding a new
> sock into the list concurrently. We should add a lock to protect the list.
> 

For the VMCI socket transport, we always lock the sockets before calling into 
vsock_enqueue_accept and af_vsock.c locks the socket before calling 
vsock_dequeue_accept, so from our point of view these operations are already 
protected, but with finer granularity than a single global lock. As far as I 
can see, the virtio transport also locks the socket before calling 
vsock_enqueue_accept, so they should be fine with the current version as well, 
but Stefan can comment on that.

Thanks,
Jorgen

Re: [PATCH net-next 1/3] VMCI: only load on VMware hypervisor

2017-08-16 Thread Jorgen S. Hansen

> On Aug 16, 2017, at 12:13 AM, Dexuan Cui  wrote:
> 
> 
> Without the patch, vmw_vsock_vmci_transport.ko and vmw_vmci.ko can
> automatically load when an application creates an AF_VSOCK socket.
> 
> This is the expected good behavior on VMware hypervisor, but as we
> are going to add hv_sock.ko (i.e. Hyper-V transport for AF_VSOCK), we
> should make sure vmw_vsock_vmci_transport.ko doesn't load on Hyper-V,
> otherwise there is a -EBUSY conflict when both vmw_vsock_vmci_transport.ko
> and hv_sock.ko try to call vsock_core_init() on Hyper-V.

The VMCI driver (vmw_vmci.ko) is used both by the VMware guest support (VMware 
Tools primarily) and by our Workstation product. Always disabling the VMCI 
driver on Hyper-V means that user won’t be able to run Workstation nested in 
Linux VMs on Hyper-V. Since the VMCI driver itself isn’t the problem here, 
maybe we could move the check to vmw_vsock_vmci_transport.ko? Ideally, there 
should be some way for a user to have access to both protocols, but for now 
disabling the VMCI socket transport for Hyper-V (possibly with a module option 
to skip that check and always load it) but leaving the VMCI driver functional 
would be better,

Thanks,
Jorgen

Re: AF_VSOCK unimplemented sockopts

2017-08-11 Thread Jorgen S. Hansen
Hi Stefan,

> On Aug 3, 2017, at 12:41 PM, Stefan Hajnoczi  wrote:
> 
> Hi Jorgen,
> There are 3 sockopts defined in include/uapi/linux/vm_sockets.h that are
> currently not implemented in net/vmw_vsock/af_vsock.c:
> 
> * SO_VM_SOCKETS_PEER_HOST_VM_ID
> * SO_VM_SOCKETS_TRUSTED
> * SO_VM_SOCKETS_NONBLOCK_TXRX
> 
> I noticed this because SO_VM_SOCKETS_TRUSTED is interesting for
> virtio-vsock.  Services listening on AF_VSOCK inside the guest may not
> want arbitrary unprivileged host processes to connect.  Instead of
> inventing a new solution I wanted to look into SO_VM_SOCKETS_TRUSTED but
> found it is not implemented in linux.git.
> 
> What is the status of these sockets?

These options were only implemented for ESX host endpoints, so were never part 
of the Linux host side support. It looks like they could have been omitted from 
vm_sockets.h, when the initial upstreaming was performed.

On ESX, the equivalent of SO_VM_SOCKETS_TRUSTED, is used for retrieving the 
value of s->trusted of a VMCI socket. It cannot be used to mark a socket as 
trusted. On Linux, trusted is tied to the CAP_NET_ADMIN capability of the 
socket creator. VMCI based vSockets will per default only allow host side 
sockets that are trusted, or are created by the same user as the VM, to 
communicate with a given VM. This is achieved by per default creating VMs with 
the VMCI privilege flag VMCI_PRIVILEGE_FLAG_RESTRICTED. It is possible to 
create a VM that isn’t restricted, in which case any host process will be able 
to communicate with the VM.

So it should be straight forward to implement the getsockopt part of 
SO_VM_SOCKETS_TRUSTED, since it just needs to return s->trusted.

Thanks,
Jorgen

Re: [PATCH v5 1/3] VSOCK: Add vsockmon tap functions

2017-04-21 Thread Jorgen S. Hansen

> On Apr 21, 2017, at 11:10 AM, Stefan Hajnoczi  wrote:
> 
> From: Gerard Garcia 
> 
> Add tap functions that can be used by the vsock transports to
> deliver packets to vsockmon virtual network devices.
> 
> Signed-off-by: Gerard Garcia 
> Signed-off-by: Stefan Hajnoczi 
> ---
> v5:
> * Change vsock_deliver_tap() API to avoid unnecessary skb creation
>   [Jorgen]
> * Fix skb leak when no taps are registered [Jorgen]
> * Add af_vsock_tap.c entry in MAINTAINERS
> * checkpatch.pl fixes
> v4:
> * Call synchronize_net() before module_put() [Michael]
> v3:
> * Include missing  header in af_vsock_tap.c
> ---
> MAINTAINERS  |   1 +
> net/vmw_vsock/Makefile   |   2 +-
> include/net/af_vsock.h   |  13 +
> include/uapi/linux/if_arp.h  |   1 +
> net/vmw_vsock/af_vsock_tap.c | 114 +++
> 5 files changed, 130 insertions(+), 1 deletion(-)
> create mode 100644 net/vmw_vsock/af_vsock_tap.c
> 
> diff --git a/MAINTAINERS b/MAINTAINERS
> index fdd5350..449e55f 100644
> --- a/MAINTAINERS
> +++ b/MAINTAINERS
> @@ -13276,6 +13276,7 @@ L:netdev@vger.kernel.org
> S:Maintained
> F:include/linux/virtio_vsock.h
> F:include/uapi/linux/virtio_vsock.h
> +F:   net/vmw_vsock/af_vsock_tap.c
> F:net/vmw_vsock/virtio_transport_common.c
> F:net/vmw_vsock/virtio_transport.c
> F:drivers/vhost/vsock.c
> diff --git a/net/vmw_vsock/Makefile b/net/vmw_vsock/Makefile
> index bc27c70..09fc2eb 100644
> --- a/net/vmw_vsock/Makefile
> +++ b/net/vmw_vsock/Makefile
> @@ -3,7 +3,7 @@ obj-$(CONFIG_VMWARE_VMCI_VSOCKETS) += 
> vmw_vsock_vmci_transport.o
> obj-$(CONFIG_VIRTIO_VSOCKETS) += vmw_vsock_virtio_transport.o
> obj-$(CONFIG_VIRTIO_VSOCKETS_COMMON) += vmw_vsock_virtio_transport_common.o
> 
> -vsock-y += af_vsock.o vsock_addr.o
> +vsock-y += af_vsock.o af_vsock_tap.o vsock_addr.o
> 
> vmw_vsock_vmci_transport-y += vmci_transport.o vmci_transport_notify.o \
>   vmci_transport_notify_qstate.o
> diff --git a/include/net/af_vsock.h b/include/net/af_vsock.h
> index f32ed9a..f9fb566 100644
> --- a/include/net/af_vsock.h
> +++ b/include/net/af_vsock.h
> @@ -188,4 +188,17 @@ struct sock *vsock_find_connected_socket(struct 
> sockaddr_vm *src,
> void vsock_remove_sock(struct vsock_sock *vsk);
> void vsock_for_each_connected_socket(void (*fn)(struct sock *sk));
> 
> +/ TAP /
> +
> +struct vsock_tap {
> + struct net_device *dev;
> + struct module *module;
> + struct list_head list;
> +};
> +
> +int vsock_init_tap(void);
> +int vsock_add_tap(struct vsock_tap *vt);
> +int vsock_remove_tap(struct vsock_tap *vt);
> +void vsock_deliver_tap(struct sk_buff *build_skb(void *opaque), void 
> *opaque);
> +
> #endif /* __AF_VSOCK_H__ */
> diff --git a/include/uapi/linux/if_arp.h b/include/uapi/linux/if_arp.h
> index 4d024d7..cf73510 100644
> --- a/include/uapi/linux/if_arp.h
> +++ b/include/uapi/linux/if_arp.h
> @@ -95,6 +95,7 @@
> #define ARPHRD_IP6GRE 823 /* GRE over IPv6*/
> #define ARPHRD_NETLINK824 /* Netlink header   
> */
> #define ARPHRD_6LOWPAN825 /* IPv6 over LoWPAN 
> */
> +#define ARPHRD_VSOCKMON  826 /* Vsock monitor header 
> */
> 
> #define ARPHRD_VOID 0x/* Void type, nothing is known */
> #define ARPHRD_NONE 0xFFFE/* zero header length */
> diff --git a/net/vmw_vsock/af_vsock_tap.c b/net/vmw_vsock/af_vsock_tap.c
> new file mode 100644
> index 000..98f09b5
> --- /dev/null
> +++ b/net/vmw_vsock/af_vsock_tap.c
> @@ -0,0 +1,114 @@
> +/*
> + * Tap functions for AF_VSOCK sockets.
> + *
> + * Code based on net/netlink/af_netlink.c tap functions.
> + *
> + * This program is free software; you can redistribute it and/or
> + * modify it under the terms of the GNU General Public License
> + * as published by the Free Software Foundation; either version
> + * 2 of the License, or (at your option) any later version.
> + */
> +
> +#include 
> +#include 
> +#include 
> +#include 
> +
> +static DEFINE_SPINLOCK(vsock_tap_lock);
> +static struct list_head vsock_tap_all __read_mostly =
> + LIST_HEAD_INIT(vsock_tap_all);
> +
> +int vsock_add_tap(struct vsock_tap *vt)
> +{
> + if (unlikely(vt->dev->type != ARPHRD_VSOCKMON))
> + return -EINVAL;
> +
> + __module_get(vt->module);
> +
> + spin_lock(_tap_lock);
> + list_add_rcu(>list, _tap_all);
> + spin_unlock(_tap_lock);
> +
> + return 0;
> +}
> +EXPORT_SYMBOL_GPL(vsock_add_tap);
> +
> +int vsock_remove_tap(struct vsock_tap *vt)
> +{
> + struct vsock_tap *tmp;
> + bool found = false;
> +
> + spin_lock(_tap_lock);
> +
> + list_for_each_entry(tmp, _tap_all, list) {
> + if (vt == tmp) {
> + list_del_rcu(>list);
> + found = true;
> + goto 

Re: [PATCH v5 3/3] VSOCK: Add virtio vsock vsockmon hooks

2017-04-21 Thread Jorgen S. Hansen

> On Apr 21, 2017, at 11:10 AM, Stefan Hajnoczi  wrote:
> 
> From: Gerard Garcia 
> 
> The virtio drivers deal with struct virtio_vsock_pkt.  Add
> virtio_transport_deliver_tap_pkt(pkt) for handing packets to the
> vsockmon device.
> 
> We call virtio_transport_deliver_tap_pkt(pkt) from
> net/vmw_vsock/virtio_transport.c and drivers/vhost/vsock.c instead of
> common code.  This is because the drivers may drop packets before
> handing them to common code - we still want to capture them.
> 
> Signed-off-by: Gerard Garcia 
> Signed-off-by: Stefan Hajnoczi 
> ---
> v5:
> * s/cpu_to_le16(pkt->hdr.op)/le16_to_cpu(pkt->hdr.op)/ [Michael]
> * checkpatch.pl fixes
> v3:
> * Hook virtio_transport.c (guest driver), not just
>   drivers/vhost/vsock.c (host driver)
> ---
> include/linux/virtio_vsock.h|  1 +
> drivers/vhost/vsock.c   |  8 +
> net/vmw_vsock/virtio_transport.c|  3 ++
> net/vmw_vsock/virtio_transport_common.c | 64 +
> 4 files changed, 76 insertions(+)
> 
> diff --git a/include/linux/virtio_vsock.h b/include/linux/virtio_vsock.h
> index 584f9a6..ab13f07 100644
> --- a/include/linux/virtio_vsock.h
> +++ b/include/linux/virtio_vsock.h
> @@ -153,5 +153,6 @@ void virtio_transport_free_pkt(struct virtio_vsock_pkt 
> *pkt);
> void virtio_transport_inc_tx_pkt(struct virtio_vsock_sock *vvs, struct 
> virtio_vsock_pkt *pkt);
> u32 virtio_transport_get_credit(struct virtio_vsock_sock *vvs, u32 wanted);
> void virtio_transport_put_credit(struct virtio_vsock_sock *vvs, u32 credit);
> +void virtio_transport_deliver_tap_pkt(struct virtio_vsock_pkt *pkt);
> 
> #endif /* _LINUX_VIRTIO_VSOCK_H */
> diff --git a/drivers/vhost/vsock.c b/drivers/vhost/vsock.c
> index 44eed8e..d939ac1 100644
> --- a/drivers/vhost/vsock.c
> +++ b/drivers/vhost/vsock.c
> @@ -176,6 +176,11 @@ vhost_transport_do_send_pkt(struct vhost_vsock *vsock,
>   restart_tx = true;
>   }
> 
> + /* Deliver to monitoring devices all correctly transmitted
> +  * packets.
> +  */
> + virtio_transport_deliver_tap_pkt(pkt);
> +
>   virtio_transport_free_pkt(pkt);
>   }
>   if (added)
> @@ -383,6 +388,9 @@ static void vhost_vsock_handle_tx_kick(struct vhost_work 
> *work)
> 
>   len = pkt->len;
> 
> + /* Deliver to monitoring devices all received packets */
> + virtio_transport_deliver_tap_pkt(pkt);
> +
>   /* Only accept correctly addressed packets */
>   if (le64_to_cpu(pkt->hdr.src_cid) == vsock->guest_cid)
>   virtio_transport_recv_pkt(pkt);
> diff --git a/net/vmw_vsock/virtio_transport.c 
> b/net/vmw_vsock/virtio_transport.c
> index 68675a1..9dffe02 100644
> --- a/net/vmw_vsock/virtio_transport.c
> +++ b/net/vmw_vsock/virtio_transport.c
> @@ -144,6 +144,8 @@ virtio_transport_send_pkt_work(struct work_struct *work)
>   list_del_init(>list);
>   spin_unlock_bh(>send_pkt_list_lock);
> 
> + virtio_transport_deliver_tap_pkt(pkt);
> +
>   reply = pkt->reply;
> 
>   sg_init_one(, >hdr, sizeof(pkt->hdr));
> @@ -370,6 +372,7 @@ static void virtio_transport_rx_work(struct work_struct 
> *work)
>   }
> 
>   pkt->len = len - sizeof(pkt->hdr);
> + virtio_transport_deliver_tap_pkt(pkt);
>   virtio_transport_recv_pkt(pkt);
>   }
>   } while (!virtqueue_enable_cb(vq));
> diff --git a/net/vmw_vsock/virtio_transport_common.c 
> b/net/vmw_vsock/virtio_transport_common.c
> index af087b4..18e2479 100644
> --- a/net/vmw_vsock/virtio_transport_common.c
> +++ b/net/vmw_vsock/virtio_transport_common.c
> @@ -16,6 +16,7 @@
> #include 
> #include 
> #include 
> +#include 
> 
> #include 
> #include 
> @@ -85,6 +86,69 @@ virtio_transport_alloc_pkt(struct virtio_vsock_pkt_info 
> *info,
>   return NULL;
> }
> 
> +/* Packet capture */
> +static struct sk_buff *virtio_transport_build_skb(void *opaque)
> +{
> + struct virtio_vsock_pkt *pkt = opaque;
> + unsigned char *t_hdr, *payload;
> + struct af_vsockmon_hdr *hdr;
> + struct sk_buff *skb;
> +
> + skb = alloc_skb(sizeof(*hdr) + sizeof(pkt->hdr) + pkt->len,
> + GFP_ATOMIC);
> + if (!skb)
> + return NULL;
> +
> + hdr = (struct af_vsockmon_hdr *)skb_put(skb, sizeof(*hdr));
> +
> + /* pkt->hdr is little-endian so no need to byteswap here */
> + hdr->src_cid = pkt->hdr.src_cid;
> + hdr->src_port = pkt->hdr.src_port;
> + hdr->dst_cid = pkt->hdr.dst_cid;
> + hdr->dst_port = pkt->hdr.dst_port;
> +
> + hdr->transport = cpu_to_le16(AF_VSOCK_TRANSPORT_VIRTIO);
> + hdr->len = cpu_to_le16(sizeof(pkt->hdr));
> + memset(hdr->reserved, 0, sizeof(hdr->reserved));
> +
> +   

Re: [PATCH v4 3/3] VSOCK: Add virtio vsock vsockmon hooks

2017-04-20 Thread Jorgen S. Hansen
> 
> +/* Packet capture */
> +void virtio_transport_deliver_tap_pkt(struct virtio_vsock_pkt *pkt)
> +{
> + struct sk_buff *skb;
> + struct af_vsockmon_hdr *hdr;
> + unsigned char *t_hdr, *payload;
> +
> + skb = alloc_skb(sizeof(*hdr) + sizeof(pkt->hdr) + pkt->len,
> + GFP_ATOMIC);

So with the current API, an skb will always be allocated, even if there are no 
listeners? And we’ll copy the payload into the skb as well, if there is any. 
Would it make sense to introduce a check here (exposed as part of the vsock tap 
API), that skips all that in the case of no listeners? In the common case, 
there won’t be any listeners, so it would save some cycles.

> + if (!skb)
> + return; /* nevermind if we cannot capture the packet */
> +
> + hdr = (struct af_vsockmon_hdr *)skb_put(skb, sizeof(*hdr));
> +
> + /* pkt->hdr is little-endian so no need to byteswap here */
> + hdr->src_cid = pkt->hdr.src_cid;
> + hdr->src_port = pkt->hdr.src_port;
> + hdr->dst_cid = pkt->hdr.dst_cid;
> + hdr->dst_port = pkt->hdr.dst_port;
> +
> + hdr->transport = cpu_to_le16(AF_VSOCK_TRANSPORT_VIRTIO);
> + hdr->len = cpu_to_le16(sizeof(pkt->hdr));
> + hdr->reserved[0] = hdr->reserved[1] = 0;
> +
> + switch(cpu_to_le16(pkt->hdr.op)) {
> + case VIRTIO_VSOCK_OP_REQUEST:
> + case VIRTIO_VSOCK_OP_RESPONSE:
> + hdr->op = cpu_to_le16(AF_VSOCK_OP_CONNECT);
> + break;
> + case VIRTIO_VSOCK_OP_RST:
> + case VIRTIO_VSOCK_OP_SHUTDOWN:
> + hdr->op = cpu_to_le16(AF_VSOCK_OP_DISCONNECT);
> + break;
> + case VIRTIO_VSOCK_OP_RW:
> + hdr->op = cpu_to_le16(AF_VSOCK_OP_PAYLOAD);
> + break;
> + case VIRTIO_VSOCK_OP_CREDIT_UPDATE:
> + case VIRTIO_VSOCK_OP_CREDIT_REQUEST:
> + hdr->op = cpu_to_le16(AF_VSOCK_OP_CONTROL);
> + break;
> + default:
> + hdr->op = cpu_to_le16(AF_VSOCK_OP_UNKNOWN);
> + break;
> + }
> +
> + t_hdr = skb_put(skb, sizeof(pkt->hdr));
> + memcpy(t_hdr, >hdr, sizeof(pkt->hdr));
> +
> + if (pkt->len) {
> + payload = skb_put(skb, pkt->len);
> + memcpy(payload, pkt->buf, pkt->len);
> + }
> +
> + vsock_deliver_tap(skb);
> +}
> +EXPORT_SYMBOL_GPL(virtio_transport_deliver_tap_pkt);
> +
> static int virtio_transport_send_pkt_info(struct vsock_sock *vsk,
> struct virtio_vsock_pkt_info *info)
> {
> -- 
> 2.9.3
> 



Re: [PATCH v4 1/3] VSOCK: Add vsockmon tap functions

2017-04-20 Thread Jorgen S. Hansen

> On Apr 13, 2017, at 6:18 PM, Stefan Hajnoczi  wrote:
> 
> +
> +static void __vsock_deliver_tap(struct sk_buff *skb)
> +{
> + int ret;
> + struct vsock_tap *tmp;
> +
> + list_for_each_entry_rcu(tmp, _tap_all, list) {
> + ret = __vsock_deliver_tap_skb(skb, tmp->dev);
> + if (unlikely(ret))
> + break;
> + }
> +
> + consume_skb(skb);

It looks like the caller will allocate the skb regardless of whether 
vsock_tap_all is empty, so shouldn’t the skb be consumed in vsock_deliver_tap?

> +}
> +
> +void vsock_deliver_tap(struct sk_buff *skb)
> +{
> + rcu_read_lock();
> +
> + if (unlikely(!list_empty(_tap_all)))
> + __vsock_deliver_tap(skb);
> +
> + rcu_read_unlock();
> +}
> +EXPORT_SYMBOL_GPL(vsock_deliver_tap);
> -- 
> 2.9.3
> 



Re: [PATCH][V2] VSOCK: remove unnecessary ternary operator on return value

2017-03-30 Thread Jorgen S. Hansen

> On Mar 29, 2017, at 5:33 PM, Colin King  wrote:
> 
> From: Colin Ian King 
> 
> Rather than assign the positive errno values to ret and then
> checking if it is positive and flip the sign, just return the
> errno value.
> 
> Detected by CoverityScan, CID#986649 ("Logically Dead Code")
> 
> Signed-off-by: Colin Ian King 
> ---
> net/vmw_vsock/vmci_transport.c | 22 +++---
> 1 file changed, 7 insertions(+), 15 deletions(-)
> 
> diff --git a/net/vmw_vsock/vmci_transport.c b/net/vmw_vsock/vmci_transport.c
> index 4be4fbbc0b50..10ae7823a19d 100644
> --- a/net/vmw_vsock/vmci_transport.c
> +++ b/net/vmw_vsock/vmci_transport.c
> @@ -96,31 +96,23 @@ static int PROTOCOL_OVERRIDE = -1;
> 
> static s32 vmci_transport_error_to_vsock_error(s32 vmci_error)
> {
> - int err;
> -
>   switch (vmci_error) {
>   case VMCI_ERROR_NO_MEM:
> - err = ENOMEM;
> - break;
> + return -ENOMEM;
>   case VMCI_ERROR_DUPLICATE_ENTRY:
>   case VMCI_ERROR_ALREADY_EXISTS:
> - err = EADDRINUSE;
> - break;
> + return -EADDRINUSE;
>   case VMCI_ERROR_NO_ACCESS:
> - err = EPERM;
> - break;
> + return -EPERM;
>   case VMCI_ERROR_NO_RESOURCES:
> - err = ENOBUFS;
> - break;
> + return -ENOBUFS;
>   case VMCI_ERROR_INVALID_RESOURCE:
> - err = EHOSTUNREACH;
> - break;
> + return -EHOSTUNREACH;
>   case VMCI_ERROR_INVALID_ARGS:
>   default:
> - err = EINVAL;
> + break;
>   }
> -
> - return err > 0 ? -err : err;
> + return -EINVAL;
> }
> 
> static u32 vmci_transport_peer_rid(u32 peer_cid)
> -- 
> 2.11.0
> 

Reviewed-by: Jorgen Hansen 



[RESEND] Request for VSOCK backport to 4.4 stable

2017-03-01 Thread Jorgen S. Hansen
Hello,

We’d like the following commit to be backported to 4.4 stable. The commit fixes 
a bug that was introduced in 4.3 and fixed in 4.6. The bug may cause wrongful 
termination of vSocket connections, if there is more than one active vSocket 
connection in a given VM. This affects customers running the normal VMware 
services inside VMs :

commit  8ab18d71de8b07d2c4d6f984b718418c09ea45c5 

VSOCK: Detach QP check should filter out non matching QPs.

The check in vmci_transport_peer_detach_cb should only allow a
detach when the qp handle of the transport matches the one in
the detach message.

Thanks,
Jørgen

Re: [PATCH v4 0/4] vsock: cancel connect packets when failing to connect

2017-01-17 Thread Jorgen S. Hansen

> On Jan 9, 2017, at 11:06 AM, Stefan Hajnoczi  wrote:
> 
> On Fri, Jan 06, 2017 at 10:22:56AM +0800, Peng Tao wrote:
>> On Tue, Dec 13, 2016 at 5:50 PM, Stefan Hajnoczi  wrote:
>>> On Mon, Dec 12, 2016 at 08:21:05PM +0800, Peng Tao wrote:
 Currently, if a connect call fails on a signal or timeout (e.g., guest is 
 still
 in the process of starting up), we'll just return to caller and leave the 
 connect
 packet queued and they are sent even though the connection is considered a 
 failure,
 which can confuse applications with unwanted false connect attempt.
 
 The patchset enables vsock (both host and guest) to cancel queued packets 
 when
 a connect attempt is considered to fail.
 
 v4 changelog:
  - drop two unnecessary void * cast
  - update new callback commnet
 v3 changelog:
  - define cancel_pkt callback in struct vsock_transport rather than struct 
 virtio_transport
  - rename virtio_vsock_pkt->vsk to virtio_vsock_pkt->cancel_token
 v2 changelog:
  - fix queued_replies counting and resume tx/rx when necessary
 
 Cheers,
 Tao
 
 Peng Tao (4):
  vsock: track pkt owner vsock
  vhost-vsock: add pkt cancel capability
  vsock: add pkt cancel capability
  vsock: cancel packets when failing to connect
 
 drivers/vhost/vsock.c   | 41 
 
 include/linux/virtio_vsock.h|  2 ++
 include/net/af_vsock.h  |  3 +++
 net/vmw_vsock/af_vsock.c| 14 +++
 net/vmw_vsock/virtio_transport.c| 42 
 +
 net/vmw_vsock/virtio_transport_common.c |  7 ++
 6 files changed, 109 insertions(+)
 
 --
 2.7.4
 
 ___
 Virtualization mailing list
 virtualizat...@lists.linux-foundation.org
 https://lists.linuxfoundation.org/mailman/listinfo/virtualization
>>> 
>>> Reviewed-by: Stefan Hajnoczi 
>> ping~
>> 
>> It looks like the patchsets are reviewed but not merged. Is there any 
>> blocker?
> 
> If Jorgen is sending pull requests then it should go through him.
> 
> Otherwise David Miller could apply it.

We are usually not involved in applying changes from outside VMware; those are 
applied by David Miller.

Thanks,
Jørgen

Re: [PATCH v4 2/4] vhost-vsock: add pkt cancel capability

2016-12-13 Thread Jorgen S. Hansen

> On Dec 12, 2016, at 1:21 PM, Peng Tao  wrote:
> 
> To allow canceling all packets of a connection.
> 
> Reviewed-by: Stefan Hajnoczi 
> Signed-off-by: Peng Tao 
> ---
> drivers/vhost/vsock.c  | 41 +
> include/net/af_vsock.h |  3 +++
> 2 files changed, 44 insertions(+)
> 
> diff --git a/drivers/vhost/vsock.c b/drivers/vhost/vsock.c
> index a504e2e0..fef8808 100644
> --- a/drivers/vhost/vsock.c
> +++ b/drivers/vhost/vsock.c
> @@ -218,6 +218,46 @@ vhost_transport_send_pkt(struct virtio_vsock_pkt *pkt)
>   return len;
> }
> 
> +static int
> +vhost_transport_cancel_pkt(struct vsock_sock *vsk)
> +{
> + struct vhost_vsock *vsock;
> + struct virtio_vsock_pkt *pkt, *n;
> + int cnt = 0;
> + LIST_HEAD(freeme);
> +
> + /* Find the vhost_vsock according to guest context id  */
> + vsock = vhost_vsock_get(vsk->remote_addr.svm_cid);
> + if (!vsock)
> + return -ENODEV;
> +
> + spin_lock_bh(>send_pkt_list_lock);
> + list_for_each_entry_safe(pkt, n, >send_pkt_list, list) {
> + if (pkt->cancel_token != vsk)
> + continue;
> + list_move(>list, );
> + }
> + spin_unlock_bh(>send_pkt_list_lock);
> +
> + list_for_each_entry_safe(pkt, n, , list) {
> + if (pkt->reply)
> + cnt++;
> + list_del(>list);
> + virtio_transport_free_pkt(pkt);
> + }
> +
> + if (cnt) {
> + struct vhost_virtqueue *tx_vq = >vqs[VSOCK_VQ_TX];
> + int new_cnt;
> +
> + new_cnt = atomic_sub_return(cnt, >queued_replies);
> + if (new_cnt + cnt >= tx_vq->num && new_cnt < tx_vq->num)
> + vhost_poll_queue(_vq->poll);
> + }
> +
> + return 0;
> +}
> +
> static struct virtio_vsock_pkt *
> vhost_vsock_alloc_pkt(struct vhost_virtqueue *vq,
> unsigned int out, unsigned int in)
> @@ -664,6 +704,7 @@ static struct virtio_transport vhost_transport = {
>   .release  = virtio_transport_release,
>   .connect  = virtio_transport_connect,
>   .shutdown = virtio_transport_shutdown,
> + .cancel_pkt   = vhost_transport_cancel_pkt,
> 
>   .dgram_enqueue= virtio_transport_dgram_enqueue,
>   .dgram_dequeue= virtio_transport_dgram_dequeue,
> diff --git a/include/net/af_vsock.h b/include/net/af_vsock.h
> index f275896..f32ed9a 100644
> --- a/include/net/af_vsock.h
> +++ b/include/net/af_vsock.h
> @@ -100,6 +100,9 @@ struct vsock_transport {
>   void (*destruct)(struct vsock_sock *);
>   void (*release)(struct vsock_sock *);
> 
> + /* Cancel all pending packets sent on vsock. */
> + int (*cancel_pkt)(struct vsock_sock *vsk);
> +
>   /* Connections. */
>   int (*connect)(struct vsock_sock *);
> 
> -- 
> 2.7.4
> 

Reviewed-by: Jorgen Hansen 



Re: [PATCH v3 4/4] vsock: cancel packets when failing to connect

2016-12-12 Thread Jorgen S. Hansen

> On Dec 8, 2016, at 6:12 PM, Peng Tao  wrote:
> 
> Otherwise we'll leave the packets queued until releasing vsock device.
> E.g., if guest is slow to start up, resulting ETIMEDOUT on connect, guest
> will get the connect requests from failed host sockets.
> 
> Reviewed-by: Stefan Hajnoczi 
> Signed-off-by: Peng Tao 
> ---
> net/vmw_vsock/af_vsock.c | 14 ++
> 1 file changed, 14 insertions(+)
> 
> diff --git a/net/vmw_vsock/af_vsock.c b/net/vmw_vsock/af_vsock.c
> index 8a398b3..c73b03a 100644
> --- a/net/vmw_vsock/af_vsock.c
> +++ b/net/vmw_vsock/af_vsock.c
> @@ -1101,10 +1101,19 @@ static const struct proto_ops vsock_dgram_ops = {
>   .sendpage = sock_no_sendpage,
> };
> 
> +static int vsock_transport_cancel_pkt(struct vsock_sock *vsk)
> +{
> + if (!transport->cancel_pkt)
> + return -EOPNOTSUPP;
> +
> + return transport->cancel_pkt(vsk);
> +}
> +
> static void vsock_connect_timeout(struct work_struct *work)
> {
>   struct sock *sk;
>   struct vsock_sock *vsk;
> + int cancel = 0;
> 
>   vsk = container_of(work, struct vsock_sock, dwork.work);
>   sk = sk_vsock(vsk);
> @@ -1115,8 +1124,11 @@ static void vsock_connect_timeout(struct work_struct 
> *work)
>   sk->sk_state = SS_UNCONNECTED;
>   sk->sk_err = ETIMEDOUT;
>   sk->sk_error_report(sk);
> + cancel = 1;
>   }
>   release_sock(sk);
> + if (cancel)
> + vsock_transport_cancel_pkt(vsk);
> 
>   sock_put(sk);
> }
> @@ -1223,11 +1235,13 @@ static int vsock_stream_connect(struct socket *sock, 
> struct sockaddr *addr,
>   err = sock_intr_errno(timeout);
>   sk->sk_state = SS_UNCONNECTED;
>   sock->state = SS_UNCONNECTED;
> + vsock_transport_cancel_pkt(vsk);
>   goto out_wait;
>   } else if (timeout == 0) {
>   err = -ETIMEDOUT;
>   sk->sk_state = SS_UNCONNECTED;
>   sock->state = SS_UNCONNECTED;
> + vsock_transport_cancel_pkt(vsk);
>   goto out_wait;
>   }
> 
> -- 
> 2.7.4
> 

This looks fine to me:

Reviewed-by: Jorgen Hansen 



Re: [PATCH v3 2/4] vhost-vsock: add pkt cancel capability

2016-12-12 Thread Jorgen S. Hansen

> On Dec 8, 2016, at 6:12 PM, Peng Tao  wrote:
> 
> --- a/include/net/af_vsock.h
> +++ b/include/net/af_vsock.h
> @@ -100,6 +100,9 @@ struct vsock_transport {
>   void (*destruct)(struct vsock_sock *);
>   void (*release)(struct vsock_sock *);
> 
> + /* Cancel packets belonging the same vsock */

How about “/* Cancel all pending packets sent on vsock. */“ ? 

> + int (*cancel_pkt)(struct vsock_sock *vsk);
> +
>   /* Connections. */
>   int (*connect)(struct vsock_sock *);
> 
> -- 
> 2.7.4
> 

Thanks,
Jørgen

Re: AF_VSOCK network namespace support

2016-11-28 Thread Jorgen S. Hansen
Hi Stefan,

> On Nov 23, 2016, at 3:55 PM, Stefan Hajnoczi  wrote:
> 
> Hi Jorgen,
> There are two use cases where network namespace support in AF_VSOCK
> could be useful:
> 
> 1. Claudio Imbrenda pointed out that a machine cannot act as both host
>   and guest at the same time.  This is necessary for nested
>   virtualization.  Currently only one transport (the host side or the
>   guest side) can be registered at a time.

VMCI based AF_VSOCK relies on the VMCI driver for nested virtualization 
support. The VMCI driver is a combined host/guest driver with a routing 
component, that will either direct traffic to VMs managed by the host 
“personality” of the driver, or to the outer host. So any VMCI driver driver is 
able to function simultaneously as both a guest and a host driver - exactly to 
be able to support nested virtualization.

Since, for VMCI based vSocket, the host has a fixed CID (2), any traffic 
generated by an application inside a VM destined for CID 2 will be routed out 
of the VM (to the host - either a virtual or physical one). Any traffic for a 
CID > 2 will be directed towards VMs managed by the host personality of the 
VMCI driver.

Since VMCI predates nested virtualization, the solution above was partly a 
result of having to support existing configurations in a transparent way.

> 2. Users may wish to isolate the AF_VSOCK address namespace so that two
>   VMs have completely independent CID and ports (they could even use
>   the same CID and ports because they're in separate namespaces).  This
>   ensures that a host service visible to VM1 is not automatically
>   visible to VM2.

If the goal is to provide fine grained service access control, won’t this end 
up requiring a namespace per VM? For ESX, we have a mechanism to tag VMs that 
allows them to be granted access to a service offered through AF_VSOCK, but 
this is not part of the Linux hypervisor.

If the intent is to be able to support multi tenancy, then this sounds like a 
better fit. Also, in the multi tenancy case, isolating the other AFs is 
probably what you want as well.

> Network namespaces could solve both problems.
> 
> A drawback of namespaces is that existing configurations using network
> namespaces for IPv4/6 or other purposes break if AF_VSOCK gains network
> namespace support.  This is not a big problem for virtio-vsock if we
> implement namespace support soon since there are no existing users.
> 
> I wonder how other address families have solved this transition to
> network namespaces.  It's almost like we need fine-grained namespaces
> instead of a blanket network namespace that applies across all address
> families...
> 
> I'm playing around with the code now but wanted to get your thoughts in
> case you've already considered these problems.
> 
> Stefan

Thanks,
Jørgen

Re: [PATCH] VSOCK: add loopback to virtio_transport

2016-11-21 Thread Jorgen S. Hansen
Hi Stefan,

That should make it on par with the VMCI transport.

Thanks,
Jørgen


From: Stefan Hajnoczi <stefa...@redhat.com>
Sent: Thursday, November 17, 2016 4:49 PM
To: netdev@vger.kernel.org
Cc: cav...@redhat.com; Claudio Imbrenda; Jorgen S. Hansen; David S . Miller; 
Stefan Hajnoczi
Subject: [PATCH] VSOCK: add loopback to virtio_transport

The VMware VMCI transport supports loopback inside virtual machines.
This patch implements loopback for virtio-vsock.

Flow control is handled by the virtio-vsock protocol as usual.  The
sending process stops transmitting on a connection when the peer's
receive buffer space is exhausted.

Cathy Avery <cav...@redhat.com> noticed this difference between VMCI and
virtio-vsock when a test case using loopback failed.  Although loopback
isn't the main point of AF_VSOCK, it is useful for testing and
virtio-vsock must match VMCI semantics so that userspace programs run
regardless of the underlying transport.

My understanding is that loopback is not supported on the host side with
VMCI.  Follow that by implementing it only in the guest driver, not the
vhost host driver.

Cc: Jorgen Hansen <jhan...@vmware.com>
Reported-by: Cathy Avery <cav...@redhat.com>
Signed-off-by: Stefan Hajnoczi <stefa...@redhat.com>
---
 net/vmw_vsock/virtio_transport.c | 57 
 1 file changed, 57 insertions(+)

diff --git a/net/vmw_vsock/virtio_transport.c b/net/vmw_vsock/virtio_transport.c
index 936d7ee..f2c4071 100644
--- a/net/vmw_vsock/virtio_transport.c
+++ b/net/vmw_vsock/virtio_transport.c
@@ -44,6 +44,10 @@ struct virtio_vsock {
spinlock_t send_pkt_list_lock;
struct list_head send_pkt_list;

+   struct work_struct loopback_work;
+   spinlock_t loopback_list_lock;
+   struct list_head loopback_list;
+
atomic_t queued_replies;

/* The following fields are protected by rx_lock.  vqs[VSOCK_VQ_RX]
@@ -74,6 +78,42 @@ static u32 virtio_transport_get_local_cid(void)
return vsock->guest_cid;
 }

+static void virtio_transport_loopback_work(struct work_struct *work)
+{
+   struct virtio_vsock *vsock =
+   container_of(work, struct virtio_vsock, loopback_work);
+   LIST_HEAD(pkts);
+
+   spin_lock_bh(>loopback_list_lock);
+   list_splice_init(>loopback_list, );
+   spin_unlock_bh(>loopback_list_lock);
+
+   mutex_lock(>rx_lock);
+   while (!list_empty()) {
+   struct virtio_vsock_pkt *pkt;
+
+   pkt = list_first_entry(, struct virtio_vsock_pkt, list);
+   list_del_init(>list);
+
+   virtio_transport_recv_pkt(pkt);
+   }
+   mutex_unlock(>rx_lock);
+}
+
+static int virtio_transport_send_pkt_loopback(struct virtio_vsock *vsock,
+ struct virtio_vsock_pkt *pkt)
+{
+   int len = pkt->len;
+
+   spin_lock_bh(>loopback_list_lock);
+   list_add_tail(>list, >loopback_list);
+   spin_unlock_bh(>loopback_list_lock);
+
+   queue_work(virtio_vsock_workqueue, >loopback_work);
+
+   return len;
+}
+
 static void
 virtio_transport_send_pkt_work(struct work_struct *work)
 {
@@ -159,6 +199,10 @@ virtio_transport_send_pkt(struct virtio_vsock_pkt *pkt)
return -ENODEV;
}

+   if (le32_to_cpu(pkt->hdr.dst_cid) == vsock->guest_cid) {
+   return virtio_transport_send_pkt_loopback(vsock, pkt);
+   }
+
if (pkt->reply)
atomic_inc(>queued_replies);

@@ -510,10 +554,13 @@ static int virtio_vsock_probe(struct virtio_device *vdev)
mutex_init(>event_lock);
spin_lock_init(>send_pkt_list_lock);
INIT_LIST_HEAD(>send_pkt_list);
+   spin_lock_init(>loopback_list_lock);
+   INIT_LIST_HEAD(>loopback_list);
INIT_WORK(>rx_work, virtio_transport_rx_work);
INIT_WORK(>tx_work, virtio_transport_tx_work);
INIT_WORK(>event_work, virtio_transport_event_work);
INIT_WORK(>send_pkt_work, virtio_transport_send_pkt_work);
+   INIT_WORK(>loopback_work, virtio_transport_loopback_work);

mutex_lock(>rx_lock);
virtio_vsock_rx_fill(vsock);
@@ -539,6 +586,7 @@ static void virtio_vsock_remove(struct virtio_device *vdev)
struct virtio_vsock *vsock = vdev->priv;
struct virtio_vsock_pkt *pkt;

+   flush_work(>loopback_work);
flush_work(>rx_work);
flush_work(>tx_work);
flush_work(>event_work);
@@ -565,6 +613,15 @@ static void virtio_vsock_remove(struct virtio_device *vdev)
}
spin_unlock_bh(>send_pkt_list_lock);

+   spin_lock_bh(>loopback_list_lock);
+   while (!list_empty(>loopback_list)) {
+   pkt = list_first_entry(>loopback_list,
+  struct virtio_vsock_pkt,

Re: AF_VSOCK loopback

2016-11-11 Thread Jorgen S. Hansen
Hi Stefan,

All datagram communication in VMCI based AF_VSOCK is going through the host - 
also for loopback communication. The only difference wrt loopback is that the 
VMCI queue pairs implementing the shared queues for the stream protocols aren't 
registered with the hypervisor - they are created specifying the 
VMCI_QPFLAG_LOCAL flag, and exist only as local guest memory.

So in the current form, there isn't much loopback code in the vmci AF_VSOCK 
implementation, so it doesn't seem like there would be much to share either.

Thanks,
Jørgen


From: Stefan Hajnoczi <stefa...@redhat.com>
Sent: Thursday, November 10, 2016 3:43 PM
To: Jorgen S. Hansen
Cc: cav...@redhat.com; netdev@vger.kernel.org
Subject: AF_VSOCK loopback

Hi Jorgen,
Cathy Avery found that the AF_VSOCK VMCI transport does loopback inside
the guest (but not on the host?).  The virtio transport currently does
no loopback.

The loopback scenario I'm thinking of is where process A listens on port
1234 and process B on the same machine connects to port 1234 both with
the same CID.

I'd like to make the virtio transport compatible with VMCI transport
semantics so AF_VSOCK behaves the same regardless of the transport.
This means loopback must be added to virtio-vsock.

The core net/vmware/af_vsock.c code does not implement loopback.  How
does VMCI do loopback?  Are the loopback packets reflected back from the
host?  Or does the guest driver notice the loopback and avoid passing
packets to the host in the first place?

Maybe we can make the loopback code common in af_vsock.c if that avoids
code duplication.

Thanks,
Stefan


Request for backport to 4.4 stable

2016-09-29 Thread Jorgen S. Hansen
Hello,

It would be great if the following commit could be backported to 4.4 stable. 
The commit fixes a bug that was introduced in 4.3 and fixed in 4.6. The bug may 
cause wrongful termination of vSocket connections, if there is more than one 
active vSocket connection in a given VM. This affects VMware services running 
inside VMs:

commit  8ab18d71de8b07d2c4d6f984b718418c09ea45c5 

VSOCK: Detach QP check should filter out non matching QPs.

The check in vmci_transport_peer_detach_cb should only allow a
detach when the qp handle of the transport matches the one in
the detach message.

Thanks,
Jørgen


Re: [PATCH] vsock: make listener child lock ordering explicit

2016-06-24 Thread Jorgen S. Hansen

> On Jun 23, 2016, at 5:28 PM, Stefan Hajnoczi  wrote:
> 
> There are several places where the listener and pending or accept queue
> child sockets are accessed at the same time.  Lockdep is unhappy that
> two locks from the same class are held.
> 
> Tell lockdep that it is safe and document the lock ordering.
> 
> Originally Claudio Imbrenda  sent a similar
> patch asking whether this is safe.  I have audited the code and also
> covered the vsock_pending_work() function.
> 
> Suggested-by: Claudio Imbrenda 
> Signed-off-by: Stefan Hajnoczi 
> ---
> net/vmw_vsock/af_vsock.c | 12 ++--
> 1 file changed, 10 insertions(+), 2 deletions(-)
> 
> diff --git a/net/vmw_vsock/af_vsock.c b/net/vmw_vsock/af_vsock.c
> index b5f1221..b96ac91 100644
> --- a/net/vmw_vsock/af_vsock.c
> +++ b/net/vmw_vsock/af_vsock.c
> @@ -61,6 +61,14 @@
>  * function will also cleanup rejected sockets, those that reach the connected
>  * state but leave it before they have been accepted.
>  *
> + * - Lock ordering for pending or accept queue sockets is:
> + *
> + * lock_sock(listener);
> + * lock_sock_nested(pending, SINGLE_DEPTH_NESTING);
> + *
> + * Using explicit nested locking keeps lockdep happy since normally only one
> + * lock of a given class may be taken at a time.
> + *
>  * - Sockets created by user action will be cleaned up when the user process
>  * calls close(2), causing our release implementation to be called. Our 
> release
>  * implementation will perform some cleanup then drop the last reference so 
> our
> @@ -443,7 +451,7 @@ void vsock_pending_work(struct work_struct *work)
>   cleanup = true;
> 
>   lock_sock(listener);
> - lock_sock(sk);
> + lock_sock_nested(sk, SINGLE_DEPTH_NESTING);
> 
>   if (vsock_is_pending(sk)) {
>   vsock_remove_pending(listener, sk);
> @@ -1292,7 +1300,7 @@ static int vsock_accept(struct socket *sock, struct 
> socket *newsock, int flags)
>   if (connected) {
>   listener->sk_ack_backlog--;
> 
> - lock_sock(connected);
> + lock_sock_nested(connected, SINGLE_DEPTH_NESTING);
>   vconnected = vsock_sk(connected);
> 
>   /* If the listener socket has received an error, then we should
> -- 
> 2.7.4
> 

Looks good to me - thanks for fixing this!

/jsh



Re: vmw_vsock sk_ack_backlog double decrement bug

2016-06-24 Thread Jorgen S. Hansen
Hi Stefan,

Good catch. Thanks for pointing this out. I'll take care of fixing and testing 
this.
 
 Thanks,
 Jørgen


From: Stefan Hajnoczi <stefa...@redhat.com>
Sent: Thursday, June 23, 2016 5:40 PM
To: Jorgen S. Hansen
Cc: netdev@vger.kernel.org
Subject: vmw_vsock sk_ack_backlog double decrement bug

Hi Jorgen,
virtio-vsock doesn't use vsock_pending_work() but I may have spotted a
problem that affects the VMCI transport.  I'm not sending a patch
because I can't test it.

1. During vsock_accept() listener->sk_ack_backlog is decremented.
2. vsock_pending_work() will decrement listener->sk_ack_backlog again if
   vsk->rejected.

The result is that sk_ack_backlog can be invalid.  It only happens in
the case where the listener socket has an error.  Maybe in practice it's
not a problem because the server application will close the listener
socket if there is an error...

Stefan


Re: [RFC 1/3] vsockmon: Add tap functions.

2016-06-14 Thread Jorgen S. Hansen
On 6/10/16, 5:44 PM, "Stefan Hajnoczi"  wrote:



>On Thu, Jun 09, 2016 at 05:02:47PM +0200, Gerard Garcia wrote:
>> On 06/01/2016 11:07 PM, Stefan Hajnoczi wrote:
>> > On Sat, May 28, 2016 at 06:29:05PM +0200, ggar...@abra.uab.cat wrote:
>> > > diff --git a/net/vmw_vsock/af_vsock.c b/net/vmw_vsock/af_vsock.c
>> > > index 6b158ab..ec7a05d 100644
>> > > --- a/net/vmw_vsock/af_vsock.c
>> > > +++ b/net/vmw_vsock/af_vsock.c
>> > > @@ -96,6 +96,7 @@
>> > >   #include 
>> > >   #include 
>> > >   #include 
>> > > +#include 
>> > >   #include 
>> > >   #include 
>> > > @@ -2012,6 +2013,110 @@ const struct vsock_transport 
>> > > *vsock_core_get_transport(void)
>> > >   }
>> > >   EXPORT_SYMBOL_GPL(vsock_core_get_transport);
>> > > +/ TAP /
>> > Feel free to put this in a separate source file.  The Kbuild can link
>> > multiple objects into a single kernel module.  That would be cleaner
>> > than using a big comment to separate it from af_vsock.c code.
>> I'm following the af_vsock.c style, where different logic is separated using
>> this style of comments. It is not a lot of code
>> so I thought it would be cleaner to have it in the same file.
>
>It's up to the af_vsock.c maintainer, but if we keep appending
>independent chunks of code to one file it becomes hard to manage and
>chances of conflicts during patch merging increases.

I agree with Stefan - af_vsock.c is already undesirably large, so it would be 
great if you could make this a separate file.