Re: [PATCH RFC 0/3] kcm: Kernel Connection Multiplexor (KCM)

2015-09-22 Thread Tom Herbert
On Tue, Sep 22, 2015 at 2:14 AM, Thomas Martitz  wrote:
> Am 21.09.2015 um 00:29 schrieb Tom Herbert:
>>
>> Kernel Connection Multiplexor (KCM) is a facility that provides a
>> message based interface over TCP for generic application protocols.
>> The motivation for this is based on the observation that although
>> TCP is byte stream transport protocol with no concept of message
>> boundaries, a common use case is to implement a framed application
>> layer protocol running over TCP. To date, most TCP stacks offer
>> byte stream API for applications, which places the burden of message
>> delineation, message I/O operation atomicity, and load balancing
>> in the application. With KCM an application can efficiently send
>> and receive application protocol messages over TCP using a
>> datagram interface.
>>
>
>
> Hello Tom,
>
> on a general note I'm wondering what this offers over the existing SCTP
> support. SCTP offers reliable message exchange and multihoming already.
>
Hi Thomas,

The idea of KCM is to provide a an improved interface that is usable
with currently deployed protocols without invoking any change on the
wire whatsoever. Deploying SCTP to replace TCP use cases would entail
a major operations change for us, AFAIK hardware support is scant, and
we probably still can't reliably use it on the Internet (you can
Google why is SCTP not widely deployed). I would point out that two of
the four limitations of TCP listed in RFC4960 should be addressed by
KCM :-)

> On a specific note, I'm wondering if the dup(2) familiy of system calls
> isn't a lot more suitable/natural instead of overloading accept.
>
dup is to duplicate a file descriptor not a socket. Sockets operations
on duplicated file descriptors operate on the same socket, need to
contend for the socket lock, etc. This is a similar rationale as to
why we need SO_REUSEPORT instead of just dup'ing listener fds.

Thanks,
Tom

> Best regards.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH RFC 0/3] kcm: Kernel Connection Multiplexor (KCM)

2015-09-22 Thread Thomas Martitz

Am 21.09.2015 um 00:29 schrieb Tom Herbert:

Kernel Connection Multiplexor (KCM) is a facility that provides a
message based interface over TCP for generic application protocols.
The motivation for this is based on the observation that although
TCP is byte stream transport protocol with no concept of message
boundaries, a common use case is to implement a framed application
layer protocol running over TCP. To date, most TCP stacks offer
byte stream API for applications, which places the burden of message
delineation, message I/O operation atomicity, and load balancing
in the application. With KCM an application can efficiently send
and receive application protocol messages over TCP using a
datagram interface.




Hello Tom,

on a general note I'm wondering what this offers over the existing SCTP 
support. SCTP offers reliable message exchange and multihoming already.


On a specific note, I'm wondering if the dup(2) familiy of system calls 
isn't a lot more suitable/natural instead of overloading accept.


Best regards.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH RFC 0/3] kcm: Kernel Connection Multiplexor (KCM)

2015-09-21 Thread Tom Herbert
On Mon, Sep 21, 2015 at 2:26 PM, Sowmini Varadhan
 wrote:
> On (09/21/15 10:33), Tom Herbert wrote:
>> >
>> > Some things that were not clear to me from the patch-set:
>> >
>> > The doc statses that we re-assemble packets the "stated length" -
>> > but how will the receiver know the "stated length"?
>>
>> BPF program returns the length of the next message. In my testing so
>> far I've been using HTTP/2 which defines a frame format with first 3
>> bytes being header length field . The BPF program (using LLVM/Clang--
>> thanks Alexei!) is just:
>
> Maybe I dont see something about the mux/demux here (I have to
> take a closer look at reserve_psock/unreserve_psock), but
> will every tcp segment have a 3 byte length in the payload?
>
No, there is no provision in TCP that application layer headers align
with TCP segments or that message boundaries are respected with TCP
segments. What we need to do, which you're probably doing for RDS, is
do message delineation on the stream as a sequence of:

1) Read protocol header to determine message length (BPF used here)
2) Read data up to the length of the message
3) Deliver message
4) Goto #1 (i.e. process next message in the stream).

> Not every TCP segment in the RDS-TCP case will have a RDS header,
> thus the comments before rds_send_xmit(), thus applying the bpf filter
> to a TCP segment holding some "from-the-middle" piece of the RDS dgram
> may not be possible
>
>> > the notes say one can "accept()" over a kcm socket- but "accept()"
>> > is itself a connection-oriented concept- one does not accept() on
> :
>> The accept method is overloaded on KCM sockets to do the socket
>> cloning operation. This is unrelated to TCP semantics, connection
>> management is performed on TCP sockets (i.e. before being attached to
>> a KCM multiplexor).
>
> If possible,it might be better to use some other
> glibc-func/name/syscall/sockopt/whatever
> for this, rather than overloading accept().. feels like that would
> keep the semantics cleaner, and probably less likely to trip
> up on accept code in the kernel..
>
I'll a look at alternatives, but I sort of think this is okay since
the semantics of accept are defined per protocol (in this case the
"protocol" is KCM).

Thanks,
Tom

> --Sowmini
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH RFC 0/3] kcm: Kernel Connection Multiplexor (KCM)

2015-09-21 Thread Sowmini Varadhan
On (09/21/15 15:36), Tom Herbert wrote:
> segments. What we need to do, which you're probably doing for RDS, is
> do message delineation on the stream as a sequence of:
> 
> 1) Read protocol header to determine message length (BPF used here)

right, that's what rds does- first reads the sizeof(rds_header),
and from that, figures out payload len, to stitch each rds dgram 
together from intermediate tcp segments..

> 2) Read data up to the length of the message
> 3) Deliver message
> 4) Goto #1 (i.e. process next message in the stream).

Thanks for the rest of the responses.

--Sowmini

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH RFC 0/3] kcm: Kernel Connection Multiplexor (KCM)

2015-09-21 Thread Tom Herbert
Hi Sowmini,

Thanks for your comments, some replies are in line.

> A lot of this design is very similar to the PF_RDS/RDS-TCP
> design. There too, we have a PF_RDS dgram socket (that already
> supports SEQPACKET semantics today) that can be tunneled over TCP.
>
> The biggest design difference that I see in your proposal is
> that you are using BPF so presumably the demux has more flexibility
> than RDS, which does the demux based on RDS port numbers?
>
I did look a bit a RDS. Major differences with KCM are:

- KCM does not implement any specific protocol in the kernel. Parsing
in receive is accomplished using BPF which allows protocol parsing to
be programmed from userspace.
- Connection management is done in userspace. This is particularly
important when connections need to switch into a protocol mode, like
when doing HTTP/2, Web sockets, SPDY, etc. over port 80.

> Would it make sense to build your solution on top of RDS,
> rather than re-invent solutions for many of the challenges
> that one encounters when building a dgram-over-stream hybrid
> socket (see "lessons learned" list below)?

There might be some points of leverage, but as I pointed out the
primary goal of KCM is the multiplexing and datagram interface over
TCP not application protocol implementation in the kernel. It might be
interesting if there were a common protocol generic library to handle
the user interface.

>
> Some things that were not clear to me from the patch-set:
>
> The doc statses that we re-assemble packets the "stated length" -
> but how will the receiver know the "stated length"?

BPF program returns the length of the next message. In my testing so
far I've been using HTTP/2 which defines a frame format with first 3
bytes being header length field . The BPF program (using LLVM/Clang--
thanks Alexei!) is just:

int bpf_prog1(struct __sk_buff *skb)
{
 return (load_word(skb, 0) >> 8) + 9;
}

> (fwiw, RDS figures that out from the header len in RDS,
> and elsewhere I think you allude to some similar encaps
> header - is that a correct understanding?)
>
KCM does not define any encaps header, it is intended to support
existing ones. For instance, BPF code to get length from an RDS
message would be:

int bpf_prog1(struct __sk_buff *skb)
{
 return load_word(skb, 16) + 40;
}

> not clear from the diagram: Is there one TCP socket per kcm-socket?
> what is the relation (one-one, many-one etc.)  between a kcm-socket and
> a psock?  How does the ksock-psock-tcp-sock association get set up?
>
Each multiplexor is logically one destination. At the top multiple KCM
sockets allow concurrent operations in userspace, at the bottom
multiple TCP connections allow for load balancing. An application
controls construction of the multiplexor and would presumably create
multiplexor for each peer. See Documentaiton/net/kcm.txt for the
details on interfaces for plumbing.

> the notes say one can "accept()" over a kcm socket- but "accept()"
> is itself a connection-oriented concept- one does not accept() on
> a dgram socket. So what exactly does this mean, and why not just
> use the well-defined TCP socket semantics at that point (with something
> like XDR for message boundary marking)?
>
The accept method is overloaded on KCM sockets to do the socket
cloning operation. This is unrelated to TCP semantics, connection
management is performed on TCP sockets (i.e. before being attached to
a KCM multiplexor).

> In the "fwiw" bucket of lessons learned from RDS..  please ignore if
> you were already aware of these-
>
> In the case of RDS, since multiple rds/dgram sockets share a single TCP
> socket, some issues that have to be dealt with are
>
> - congestion/starvation: we dont want tcp to start advertising
>   zero-window because one dgram socket pair has flooded the pipe
>   and the peer is not reading. So the RDS protocol has port-congestion
>   RDS control plane messages that track congestion at the RDS port.
>
In KCM all upper sockets are equivalent so there is not HOL blocking
on receive or transmit. A message received on a multiplexor can be
steered to any socket that is receiving. Conceivably, we could
implement some message affinity, for instance sending an RPC reply to
same socket that made the request, but even that I think should only
be best effort to avoid having to deal with blocking.

> - imposes some constraints on the TCP send side- if sock1 and sock2
>   are sharing a tcp socket, and both are sending dgrams over the
>   stream, dgrams from sock1 may get interleaved  (see comments above
>   rds_send_xmit() for a note on how rds deals witt this). There are ways
>   to fan this out over multiple tcp sockets (and I'm working on those,
>   to improve the scaling), but just a note that there is some complexity
>   to be dealt with here. Not sure if this was considered in the "KCM
>   sockets" section in patch2..
>
Writing and reading messages atomically is a critical operation of the
multiplexor. This is implemented using a reservation 

Re: [PATCH RFC 0/3] kcm: Kernel Connection Multiplexor (KCM)

2015-09-21 Thread Sowmini Varadhan
On (09/20/15 15:29), Tom Herbert wrote:
> 
> Kernel Connection Multiplexor (KCM) is a facility that provides a
> message based interface over TCP for generic application protocols.
> The motivation for this is based on the observation that although
> TCP is byte stream transport protocol with no concept of message
> boundaries, a common use case is to implement a framed application
> layer protocol running over TCP. To date, most TCP stacks offer
> byte stream API for applications, which places the burden of message
> delineation, message I/O operation atomicity, and load balancing
> in the application. With KCM an application can efficiently send
> and receive application protocol messages over TCP using a
> datagram interface.

A lot of this design is very similar to the PF_RDS/RDS-TCP
design. There too, we have a PF_RDS dgram socket (that already 
supports SEQPACKET semantics today) that can be tunneled over TCP.

The biggest design difference that I see in your proposal is 
that you are using BPF so presumably the demux has more flexibility
than RDS, which does the demux based on RDS port numbers?

Would it make sense to build your solution on top of RDS,
rather than re-invent solutions for many of the challenges
that one encounters when building a dgram-over-stream hybrid
socket (see "lessons learned" list below)?

Some things that were not clear to me from the patch-set:

The doc statses that we re-assemble packets the "stated length" -
but how will the receiver know the "stated length"? 
(fwiw, RDS figures that out from the header len in RDS,
and elsewhere I think you allude to some similar encaps
header - is that a correct understanding?)

not clear from the diagram: Is there one TCP socket per kcm-socket? 
what is the relation (one-one, many-one etc.)  between a kcm-socket and
a psock?  How does the ksock-psock-tcp-sock association get set up? 

the notes say one can "accept()" over a kcm socket- but "accept()"
is itself a connection-oriented concept- one does not accept() on
a dgram socket. So what exactly does this mean, and why not just
use the well-defined TCP socket semantics at that point (with something
like XDR for message boundary marking)?

In the "fwiw" bucket of lessons learned from RDS..  please ignore if
you were already aware of these- 

In the case of RDS, since multiple rds/dgram sockets share a single TCP
socket, some issues that have to be dealt with are

- congestion/starvation: we dont want tcp to start advertising
  zero-window because one dgram socket pair has flooded the pipe
  and the peer is not reading. So the RDS protocol has port-congestion
  RDS control plane messages that track congestion at the RDS port.

- imposes some constraints on the TCP send side- if sock1 and sock2
  are sharing a tcp socket, and both are sending dgrams over the 
  stream, dgrams from sock1 may get interleaved  (see comments above
  rds_send_xmit() for a note on how rds deals witt this). There are ways
  to fan this out over multiple tcp sockets (and I'm working on those,
  to improve the scaling), but just a note that there is some complexity
  to be dealt with here. Not sure if this was considered in the "KCM
  sockets" section in patch2..

- in general the "dgram-over-stream" hybrid has some peculiar issues. E.g.,
  dgram APIs like BINDTODEVICE and IP_PKTINFO cannot be applied
  to the underlying stream. In the typical use case for RDS (database 
  clusters) there's a reasonable workaround for this using network
  namespaces to define bundles of outgoing interfaces, but that solution
  may not always be workable for other use-cases. Thus it might actually
  be more obvious to simply use tcp sockets (and use something like XDR
  for message boundary markers on the stream).

--Sowmini
 


--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH RFC 0/3] kcm: Kernel Connection Multiplexor (KCM)

2015-09-21 Thread Sowmini Varadhan
On (09/21/15 10:33), Tom Herbert wrote:
> >
> > Some things that were not clear to me from the patch-set:
> >
> > The doc statses that we re-assemble packets the "stated length" -
> > but how will the receiver know the "stated length"?
> 
> BPF program returns the length of the next message. In my testing so
> far I've been using HTTP/2 which defines a frame format with first 3
> bytes being header length field . The BPF program (using LLVM/Clang--
> thanks Alexei!) is just:

Maybe I dont see something about the mux/demux here (I have to 
take a closer look at reserve_psock/unreserve_psock), but 
will every tcp segment have a 3 byte length in the payload?

Not every TCP segment in the RDS-TCP case will have a RDS header,
thus the comments before rds_send_xmit(), thus applying the bpf filter
to a TCP segment holding some "from-the-middle" piece of the RDS dgram
may not be possible 

> > the notes say one can "accept()" over a kcm socket- but "accept()"
> > is itself a connection-oriented concept- one does not accept() on
:
> The accept method is overloaded on KCM sockets to do the socket
> cloning operation. This is unrelated to TCP semantics, connection
> management is performed on TCP sockets (i.e. before being attached to
> a KCM multiplexor).

If possible,it might be better to use some other
glibc-func/name/syscall/sockopt/whatever
for this, rather than overloading accept().. feels like that would
keep the semantics cleaner, and probably less likely to trip 
up on accept code in the kernel..

--Sowmini
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH RFC 0/3] kcm: Kernel Connection Multiplexor (KCM)

2015-09-20 Thread Tom Herbert
Kernel Connection Multiplexor (KCM) is a facility that provides a
message based interface over TCP for generic application protocols.
The motivation for this is based on the observation that although
TCP is byte stream transport protocol with no concept of message
boundaries, a common use case is to implement a framed application
layer protocol running over TCP. To date, most TCP stacks offer
byte stream API for applications, which places the burden of message
delineation, message I/O operation atomicity, and load balancing
in the application. With KCM an application can efficiently send
and receive application protocol messages over TCP using a
datagram interface.

In order to delineate message in a TCP stream for receive in KCM, the
kernel implements a message parser. For this we chose to employ BPF
which is applied to the TCP stream. BPF code parses application layer
messages and returns a message length. Nearly all binary application
protocols are parsable in this manner, so KCM should be applicable
across a wide range of applications. Other than message length
determination in receive, KCM does not require any other application
specific awareness. KCM does not implement any other application
protocol semantics-- these are are provided in userspace or could be
implemented in a kernel module layered above KCM.

KCM implements an NxM multiplexor in the kernel as diagrammed below:

++   ++   ++   ++
| KCM socket |   | KCM socket |   | KCM socket |   | KCM socket |
++   ++   ++   ++
  | |   ||
  +---+ |   | +--+
  | |   | |
   +--+
   |   Multiplexor|
   +--+
 |   |   |   |  |
   +-+   |   |   |  +
   | |   |   |  |
+--+  +--+  +--+  +--+ +--+
|  Psock   |  |  Psock   |  |  Psock   |  |  Psock   | |  Psock   |
+--+  +--+  +--+  +--+ +--+
  |  |   || |
+--+  +--+  +--+  +--+ +--+
| TCP sock |  | TCP sock |  | TCP sock |  | TCP sock | | TCP sock |
+--+  +--+  +--+  +--+ +--+

The KCM sockets provide the datagram interface to applications,
Psocks are the state for each attached TCP connection (i.e. where
message delineation is performed on receive).

A description of the APIs and design can be found in the included
Documentation/networking/kcm.txt.

Testing:

For testing I have been developing kcmperf and super_kcmperf which
should allow functional verification and some baseline comparisons
with netperf using TCP. For the test results listed below, one
instance of kcmperf is run which creates a trivial MUX with one KCM
socket and one attached TCP connection.

netperf TCP_RR
   - 1 instance, 1 byte RR size
 34219 tps

   - 1 instance, 100 byte RR size
 464 tps

   - 200 instances, 1 byte RR size
 1721552 tps
 86.86% CPU utilization

   - 200 instances, 100 byte RR size
 1165
 7.16% CPU utilization

kcmperf
   - 1 instance, byte RR size
 32679 tps

   - 1 instance, byte RR size
 412 tps

   - 200 instances, 1 byte RR size
 1420454 tps
 80.02% CPU utilization 

   - 200 instances, 100 byte RR size
 1130 tps
 10.08% CPU utilization

Future support:

The implementation provided here should be thought of as a first cut,
for which the goal is to establish a robust base implementation. There
are many avenues for extending this basic implementation and improving
upon this:

 - Sample application support
 - SOCK_SEQPACKET support
 - Integration with TLS (TLS-in-kernel is a separate intiative).
 - Page operations/splice support
 - sendmmsg, recvmmsg support
 - Unconnected KCM sockets. Will be able to attach sockets to different
   destinations, AF_KCM addresses with be used in sendmsg and recvmsg
   to indicate destination
 - Explore more utility in performing BPF inline with a TCP data stream
   (setting SO_MARK, rxhash for messages being sent received on
   KCM sockets).
 - Performance work
   - Reduce locking (MUX lock is currently a bottleneck).
   - KCM socket to TCP socket affinity
   - Small message coalescing, direct calls to TCP send functions

Tom Herbert (3):
  rcu: Add list_next_or_null_rcu
  kcm: Kernel Connection Multiplexor module
  kcm: Add statistics and proc interfaces

 Documentation/networking/kcm.txt |  173 
 include/linux/rculist.h  |   21 +
 include/linux/socket.h   |6 +-
 include/net/kcm.h|  211 +
 include/uapi/linux/errqueue.h|1 +