[PATCH v12 net-next 1/1] hv_sock: introduce Hyper-V Sockets

2016-06-24 Thread Dexuan Cui
Hyper-V Sockets (hv_sock) supplies a byte-stream based communication
mechanism between the host and the guest. It's somewhat like TCP over
VMBus, but the transportation layer (VMBus) is much simpler than IP.

With Hyper-V Sockets, applications between the host and the guest can talk
to each other directly by the traditional BSD-style socket APIs.

Hyper-V Sockets is only available on new Windows hosts, like Windows Server
2016. More info is in this article "Make your own integration services":
https://msdn.microsoft.com/en-us/virtualization/hyperv_on_windows/develop/make_mgmt_service

The patch implements the necessary support in the guest side by introducing
a new socket address family AF_HYPERV.

Signed-off-by: Dexuan Cui 
Cc: "K. Y. Srinivasan" 
Cc: Haiyang Zhang 
Cc: Vitaly Kuznetsov 
Cc: Cathy Avery 
---

You can also get the patch here:
https://github.com/dcui/linux/commits/decui/hv_sock/net-next/20160620_v12

For the change log before v12, please see https://lkml.org/lkml/2016/5/15/31


In v12, the changes are mainly the following:

1) remove the module params as David suggested.

2) use 5 exact pages for VMBus send/recv rings, respectively.
The host side's design of the feature requires 5 exact pages for recv/send
rings respectively -- this is suboptimal considering memory consumption,
however unluckily we have to live with it, before the host comes up with
a new design in the future. :-(

3) remove the per-connection static send/recv buffers
Instead, we allocate and free the buffers dynamically only when we recv/send
data. This means: when a connection is idle, no memory is consumed as
recv/send buffers at all.

Looking forward to your comments!

 MAINTAINERS |2 +
 include/linux/hyperv.h  |   14 +
 include/linux/socket.h  |4 +-
 include/net/af_hvsock.h |   59 ++
 include/uapi/linux/hyperv.h |   25 +
 net/Kconfig |1 +
 net/Makefile|1 +
 net/hv_sock/Kconfig |   10 +
 net/hv_sock/Makefile|3 +
 net/hv_sock/af_hvsock.c | 1514 +++
 10 files changed, 1632 insertions(+), 1 deletion(-)

diff --git a/MAINTAINERS b/MAINTAINERS
index 50f69ba..6eaa26f 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -5514,7 +5514,9 @@ F:drivers/pci/host/pci-hyperv.c
 F: drivers/net/hyperv/
 F: drivers/scsi/storvsc_drv.c
 F: drivers/video/fbdev/hyperv_fb.c
+F: net/hv_sock/
 F: include/linux/hyperv.h
+F: include/net/af_hvsock.h
 F: tools/hv/
 F: Documentation/ABI/stable/sysfs-bus-vmbus
 
diff --git a/include/linux/hyperv.h b/include/linux/hyperv.h
index 50f493e..95d159e 100644
--- a/include/linux/hyperv.h
+++ b/include/linux/hyperv.h
@@ -1509,4 +1509,18 @@ static inline void commit_rd_index(struct vmbus_channel 
*channel)
 }
 
 
+struct vmpipe_proto_header {
+   u32 pkt_type;
+   u32 data_size;
+};
+
+#define HVSOCK_HEADER_LEN  (sizeof(struct vmpacket_descriptor) + \
+sizeof(struct vmpipe_proto_header))
+
+/* See 'prev_indices' in hv_ringbuffer_read(), hv_ringbuffer_write() */
+#define PREV_INDICES_LEN   (sizeof(u64))
+
+#define HVSOCK_PKT_LEN(payload_len)(HVSOCK_HEADER_LEN + \
+   ALIGN((payload_len), 8) + \
+   PREV_INDICES_LEN)
 #endif /* _HYPERV_H */
diff --git a/include/linux/socket.h b/include/linux/socket.h
index b5cc5a6..0b68b58 100644
--- a/include/linux/socket.h
+++ b/include/linux/socket.h
@@ -202,8 +202,9 @@ struct ucred {
 #define AF_VSOCK   40  /* vSockets */
 #define AF_KCM 41  /* Kernel Connection Multiplexor*/
 #define AF_QIPCRTR 42  /* Qualcomm IPC Router  */
+#define AF_HYPERV  43  /* Hyper-V Sockets  */
 
-#define AF_MAX 43  /* For now.. */
+#define AF_MAX 44  /* For now.. */
 
 /* Protocol families, same as address families. */
 #define PF_UNSPEC  AF_UNSPEC
@@ -251,6 +252,7 @@ struct ucred {
 #define PF_VSOCK   AF_VSOCK
 #define PF_KCM AF_KCM
 #define PF_QIPCRTR AF_QIPCRTR
+#define PF_HYPERV  AF_HYPERV
 #define PF_MAX AF_MAX
 
 /* Maximum queue length specifiable by listen.  */
diff --git a/include/net/af_hvsock.h b/include/net/af_hvsock.h
new file mode 100644
index 000..20d23d5
--- /dev/null
+++ b/include/net/af_hvsock.h
@@ -0,0 +1,59 @@
+#ifndef __AF_HVSOCK_H__
+#define __AF_HVSOCK_H__
+
+#include 
+#include 
+#include 
+
+/* The host side's design of the feature requires 5 exact pages for recv/send
+ * rings respectively -- this is suboptimal considering memory consumption,
+ * however unluckily we have to live with it, before the host comes up with
+ * a better new design in the future.
+ */
+#define RINGBUFFER_HVSOCK_RCV_SIZE (PAGE_SIZE * 5)
+#define RINGBUFFER_HVSOCK_SND_SIZE (PAGE_SIZE * 5)
+
+#define sk_to_hvsock(__sk)   ((struct hvsock_sock *)(__sk))
+#define hvsock_to_s

Re: [PATCH v12 net-next 1/1] hv_sock: introduce Hyper-V Sockets

2016-06-28 Thread David Miller
From: Dexuan Cui 
Date: Fri, 24 Jun 2016 07:45:24 +

> + while ((ret = vmalloc(size)) == NULL)
> + ssleep(1);

This is completely, and entirely, unacceptable.

If the allocation fails, you return an error and release
your resources.

You don't just loop forever waiting for it to succeed.


RE: [PATCH v12 net-next 1/1] hv_sock: introduce Hyper-V Sockets

2016-06-28 Thread Dexuan Cui
> From: David Miller [mailto:da...@davemloft.net]
> Sent: Tuesday, June 28, 2016 17:34
> To: Dexuan Cui 
> Cc: gre...@linuxfoundation.org; netdev@vger.kernel.org; linux-
> ker...@vger.kernel.org; de...@linuxdriverproject.org; o...@aepfle.de;
> a...@canonical.com; jasow...@redhat.com; vkuzn...@redhat.com;
> cav...@redhat.com; KY Srinivasan ; Haiyang Zhang
> ; j...@perches.com
> Subject: Re: [PATCH v12 net-next 1/1] hv_sock: introduce Hyper-V Sockets
> 
> From: Dexuan Cui 
> Date: Fri, 24 Jun 2016 07:45:24 +
> 
> > +   while ((ret = vmalloc(size)) == NULL)
> > +   ssleep(1);
> 
> This is completely, and entirely, unacceptable.
> 
> If the allocation fails, you return an error and release
> your resources.
> 
> You don't just loop forever waiting for it to succeed.

Hi David,
I agree this is ugly...

The idea here is: IMO the syscalls sys_read()/write() shoudn't return
-ENOMEM, so I have to make sure the buffer allocation succeeds?

I tried to use kmalloc with __GFP_NOFAIL, but I hit a warning in 
in mm/page_alloc.c:
WARN_ON_ONCE((gfp_flags & __GFP_NOFAIL) && (order > 1));

What error code do you think I should return? 
EAGAIN, ERESTARTSYS, or something else?

May I have your suggestion? Thanks!

-- Dexuan



Re: [PATCH v12 net-next 1/1] hv_sock: introduce Hyper-V Sockets

2016-06-28 Thread David Miller
From: Dexuan Cui 
Date: Tue, 28 Jun 2016 09:59:21 +

> The idea here is: IMO the syscalls sys_read()/write() shoudn't return
> -ENOMEM, so I have to make sure the buffer allocation succeeds?

You have to fail if resources cannot be allocated.


RE: [PATCH v12 net-next 1/1] hv_sock: introduce Hyper-V Sockets

2016-06-28 Thread Dexuan Cui
> From: David Miller [mailto:da...@davemloft.net]
> Sent: Tuesday, June 28, 2016 21:45
> To: Dexuan Cui 
> Cc: gre...@linuxfoundation.org; netdev@vger.kernel.org; linux-
> ker...@vger.kernel.org; de...@linuxdriverproject.org; o...@aepfle.de;
> a...@canonical.com; jasow...@redhat.com; vkuzn...@redhat.com;
> cav...@redhat.com; KY Srinivasan ; Haiyang Zhang
> ; j...@perches.com
> Subject: Re: [PATCH v12 net-next 1/1] hv_sock: introduce Hyper-V Sockets
> 
> From: Dexuan Cui 
> Date: Tue, 28 Jun 2016 09:59:21 +
> 
> > The idea here is: IMO the syscalls sys_read()/write() shoudn't return
> > -ENOMEM, so I have to make sure the buffer allocation succeeds?
> 
> You have to fail if resources cannot be allocated.

OK, I'll try to fix this, probably by returning -EAGAIN or -ERESTARTSYS.

I'll report back ASAP.

Thanks,
-- Dexuan


Re: [PATCH v12 net-next 1/1] hv_sock: introduce Hyper-V Sockets

2016-06-28 Thread Rick Jones

On 06/28/2016 02:59 AM, Dexuan Cui wrote:

The idea here is: IMO the syscalls sys_read()/write() shoudn't return
-ENOMEM, so I have to make sure the buffer allocation succeeds?

I tried to use kmalloc with __GFP_NOFAIL, but I hit a warning in
in mm/page_alloc.c:
WARN_ON_ONCE((gfp_flags & __GFP_NOFAIL) && (order > 1));

What error code do you think I should return?
EAGAIN, ERESTARTSYS, or something else?

May I have your suggestion? Thanks!


What happens as far as errno is concerned when an application makes a 
read() call against a (say TCP) socket associated with a connection 
which has been reset?  Is it limited to those errno values listed in the 
read() manpage, or does it end-up getting an errno value from those 
listed in the recv() manpage?  Or, perhaps even one not (presently) 
listed in either?


rick jones



RE: [PATCH v12 net-next 1/1] hv_sock: introduce Hyper-V Sockets

2016-06-29 Thread Dexuan Cui
> From: Rick Jones [mailto:rick.jon...@hpe.com]
> Sent: Tuesday, June 28, 2016 23:43
> To: Dexuan Cui ; David Miller 
> Cc: gre...@linuxfoundation.org; netdev@vger.kernel.org; linux-
> ker...@vger.kernel.org; de...@linuxdriverproject.org; o...@aepfle.de;
> a...@canonical.com; jasow...@redhat.com; vkuzn...@redhat.com;
> cav...@redhat.com; KY Srinivasan ; Haiyang Zhang
> ; j...@perches.com
> Subject: Re: [PATCH v12 net-next 1/1] hv_sock: introduce Hyper-V Sockets
> 
> On 06/28/2016 02:59 AM, Dexuan Cui wrote:
> > The idea here is: IMO the syscalls sys_read()/write() shoudn't return
> > -ENOMEM, so I have to make sure the buffer allocation succeeds?
> >
> > I tried to use kmalloc with __GFP_NOFAIL, but I hit a warning in
> > in mm/page_alloc.c:
> > WARN_ON_ONCE((gfp_flags & __GFP_NOFAIL) && (order > 1));
> >
> > What error code do you think I should return?
> > EAGAIN, ERESTARTSYS, or something else?
> >
> > May I have your suggestion? Thanks!
> 
> What happens as far as errno is concerned when an application makes a
> read() call against a (say TCP) socket associated with a connection
> which has been reset? 
I suppose it is ECONNRESET (Connection reset by peer).

>  Is it limited to those errno values listed in the
> read() manpage, or does it end-up getting an errno value from those
> listed in the recv() manpage?  Or, perhaps even one not (presently)
> listed in either?
> 
> rick jones

Actually "man read/write" says "Other errors may occur, depending on the
object connected to fd".

"man send/recv" indeed lists ENOMEM.

Considering AF_HYPERV is a new socket type, ENOMEM seems OK to me
and I'm going to post a new version of the patch.

In the long run, I think we should add a new API in the VMBus driver,
allowing data copy from VMBus ringbuffer into user mode buffer directly.
This way, we can even eliminate this temporary buffer.

Thanks,
-- Dexuan