Re: [PATCH net-next 0/6] kcm: Kernel Connection Multiplexor (KCM)

2015-11-25 Thread Sowmini Varadhan
On (11/24/15 17:25), Florian Westphal wrote:
> Its a well-written document, but I don't see how moving the burden of
> locking a single logical tcp connection (to prevent threads from
> reading a partial record) from userspace to kernel is an improvement.
> 
> If you really have 100 threads and must use a single tcp connection
> to multiplex some arbitrarily complex record-format in atomic fashion,
> then your requirements suck.

In the interest of providing some context from the rds-tcp use-case
here (without drifting into hyperbole).. RDS-TCP, like KCM,
provides a dgram-over-stream socket, with SEQPACKET semantics,
and an upper-bounded record-size per POSIX/SEQPACKET semantics. 
The major difference from kcm is that it does not use BPF, but 
instead has its own protocol header for each datagram.

There seems to be some misconception in this thread that this model
is about allowing application to be "lazy" and do a 1:1 mapping between
streams- that's not the case for RDS.

In the case of cluster apps, we have DB apps that want to have a single
dgram socket to talk to multiple peers (i.e., a star-network, with the
node in the center of the star wanting to have dgram sockets to everyone
else. Scale is more than a mere 100 threads).

If that central node wants reliable, ordered, congestion-managed
delivery, it would have to use UDP + bunch of its own code for
seq#, rexmit etc. And they are doing that today, but dont want the
to reinvent TCP's congavoid (and in fact, in the absence of congestion,
one complaint is that udp latency is 2x-3x better than rds-tcp for a
512 byte req, 8K resp that is typical for DB workloads. I'm still 
investigating)

>From the TCP standpoint of rds-tcp, we have a many-one mapping: 
multiple RDS sockets funneling to a single tcp connection, sharing
a single congestion state-machine.

I dont know if this is a "poorly designed application", I'm sure
its not perfect, but we have a ton of Oracle clustering s/w that's
already doing this with IB, so extending this with rds-tcp made
sense for us at this point.

--Sowmini
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH net-next 0/6] kcm: Kernel Connection Multiplexor (KCM)

2015-11-25 Thread Hannes Frederic Sowa
Hi,

On Tue, Nov 24, 2015, at 23:21, Alexei Starovoitov wrote:
> On Tue, Nov 24, 2015 at 10:58:08PM +0100, Hannes Frederic Sowa wrote:
> > > > > 
> > > > > interesting idea, but then remote host will be influencing local cpu
> > > > > selection?
> > > > > how remote can figure out the number of local cpus?
> > > > 
> > > > Via rpc! :)
> > > > 
> > > > The configuration shouldn't change all the time and some get_info rpc
> > > > call could provide info for the topology of the machine, or...
> > > 
> > > Configuration changes all the time. Machines crash, traffic redirected
> > > because of load, etc, etc
> > 
> > Yeah, same problem you have to handle with the kcm approach.
> 
> not at all. No new remote configuration things are needed.

At some point you have to manage your ip address and if you move a host
subnet or an ip address it should not matter a lot.

> > Just use GRO engine and TCP PSH flag, maybe by making it more
> > intentional via user space APIs on one connection. Wake up thread with
> > one complete RPC message. If receiver eats more data it is no problem,
> > if less, it will block as in your approach, too. Your worker threads
> > must handle that anyway and can buffer and concatenate it to the next
> > message to be processed.
> 
> of course not. kcm reading threads don't have to deal with it.

If there is no fallback for this you have to drop frames and kill the
TCP connection for all your RPC endpoints. If you lost synchronization
with the headers there is no way you will resync. 

> PS
> just noticed that you've excluded all. By accident?
> Feel free to reply to all.

Oh, thanks, yes it was a fault on my side.

For full transparency I reply.

Bye,
Hannes
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH net-next 0/6] kcm: Kernel Connection Multiplexor (KCM)

2015-11-24 Thread Florian Westphal
David Miller  wrote:
> From: Florian Westphal 
> Date: Tue, 24 Nov 2015 23:22:42 +0100
> 
> > Yes, I get that point, but I maintain that KCM is a strange workaround
> > for bad userspace design.
> 
> I fundamentally disagree with you.

Fair enough.  Still, I do not see how what KCM intends to do
can be achieved while at the same time imposing some upper bound on
the amount of kernel memory we can allocate globally and per socket.

Once such limit would be enforced the question becomes how the kernel
could handle such an error other than via close of the underlying tcp
connection.

Not adding any limit is not a good idea in my opinion.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH net-next 0/6] kcm: Kernel Connection Multiplexor (KCM)

2015-11-24 Thread David Miller
From: Florian Westphal 
Date: Tue, 24 Nov 2015 23:22:42 +0100

> Yes, I get that point, but I maintain that KCM is a strange workaround
> for bad userspace design.

I fundamentally disagree with you.

And even if I didn't, I would be remiss to completely dismiss the
difficulty in changing existing protocols and existing large scale
implementations of them.  If we can facilitate them somehow then
I see nothing wrong with that.

Neither you nor Hannes have made a strong enough argument for me
to consider Tom's work not suitable for upstream.

Have you even looked at the example userspace use case he referenced
and considered the constraints under which it operates?  I seriously
doubt you did.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH net-next 0/6] kcm: Kernel Connection Multiplexor (KCM)

2015-11-24 Thread Florian Westphal
Tom Herbert  wrote:
> On Tue, Nov 24, 2015 at 12:55 PM, Florian Westphal  wrote:
> > Why anyone would invest such a huge amount of work in making this
> > kernel-based framing for single-stream tcp record (de)mux rather than
> > improving the userspace protocol to use UDP or SCTP or at least
> > one tcp connection per worker is beyond me.
> >
> From the /0 patch:
> 
> Q: Why not use an existing message-oriented protocol such as RUDP,
>DCCP, SCTP, RDS, and others?
> 
> A: Because that would entail using a completely new transport protocol.

Thats why I wrote 'or at least one tcp connection per worker'.

> > For TX side, why is writev not good enough?
> 
> writev on a TCP stream does not guarantee atomicity of the operation.

Are you talking about short writes?

> It writes atomic without user space needing to implement locking when
> a socket is shared amongst threads.

Yes, I get that point, but I maintain that KCM is a strange workaround
for bad userspace design.

1 tcp connection per thread -> no userspace sockfd lock needed

Sender side can use writev, sendmsg, sendmmsg, etc to avoid sending
sub-record sized frames.

Is user space really so bad that instead of fixing it its simpler to
work around it with even more kernel bloat?

Since for KCM userspace has to be adjusted anyway I find that hard
to believe.

I don't know if the 'dynamic RCVLOWAT' that you want is needed
(you say 'yes', Eric reply seems to indicate its not (at least assuming
 a sane/friendly peer that doesn't intentionally xmit byte-by-byte).

But assuming there would really be a benefit, maybe a RCVLOWAT2 could
be added?  Of course we could only make it a hint and would have to
make a blocking read return with less data than desired when tcp rmem limit
gets hit.  But at least we'd avoid the 'unbounded allocation of large
amount of kernel memory' thing that we have with current proposal.

Thanks,
Florian
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH net-next 0/6] kcm: Kernel Connection Multiplexor (KCM)

2015-11-24 Thread Hannes Frederic Sowa
Hi David,

On Tue, Nov 24, 2015, at 23:25, David Miller wrote:
> From: Florian Westphal 
> Date: Tue, 24 Nov 2015 23:22:42 +0100
> 
> > Yes, I get that point, but I maintain that KCM is a strange workaround
> > for bad userspace design.
> 
> I fundamentally disagree with you.
> 
> And even if I didn't, I would be remiss to completely dismiss the
> difficulty in changing existing protocols and existing large scale
> implementations of them.  If we can facilitate them somehow then
> I see nothing wrong with that.
> 
> Neither you nor Hannes have made a strong enough argument for me
> to consider Tom's work not suitable for upstream.
> 
> Have you even looked at the example userspace use case he referenced
> and considered the constraints under which it operates?  I seriously
> doubt you did.

If you are referring to thrift and tls framing, yes indeed, I did. I
have experience in google protocol buffers and once cared about an
in-house RPC implementation. All I learned is that this approach is
prone to starving or building up huge messages in kernel space. That is
why xml streaming in form of StAX from the Java world is used more and
more and even Apache Jackson does provide a streaming API for JSON which
I once used because JSON messages streamed as hash tables got too big
and were prone to starve. Even user space needs to be careful what sizes
of messages they accept otherwise DoS attacks are possible and jvms with
small heaps are getting OutOfMemoryExceptions. This is the same in other
high level languages, non-GCed (or without copying garbage collector)
languages just reallocate and cause fragmentation in the long term. Even
keeping multiple 16MB chunks for HTTP/2 in the kernel heap so that user
space can read them in on go seems very much bad in my opinion.

Neither of all those approaches delimit datagrams by "read() barriers".
I think the alternatives should be tried. I think this framework is only
applicable to a small fractions of RPC systems.

Thanks for following up, :)
Hannes
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH net-next 0/6] kcm: Kernel Connection Multiplexor (KCM)

2015-11-24 Thread Alexei Starovoitov
On Tue, Nov 24, 2015 at 08:16:25PM +0100, Hannes Frederic Sowa wrote:
> Hello,
> 
> On Tue, Nov 24, 2015, at 19:59, Alexei Starovoitov wrote:
> > On Tue, Nov 24, 2015 at 07:23:30PM +0100, Hannes Frederic Sowa wrote:
> > > Hello,
> > > 
> > > On Tue, Nov 24, 2015, at 17:25, Florian Westphal wrote:
> > > > Its a well-written document, but I don't see how moving the burden of
> > > > locking a single logical tcp connection (to prevent threads from
> > > > reading a partial record) from userspace to kernel is an improvement.
> > > > 
> > > > If you really have 100 threads and must use a single tcp connection
> > > > to multiplex some arbitrarily complex record-format in atomic fashion,
> > > > then your requirements suck.
> > > 
> > > Right, if we are in a datacenter I would probably write a script and use
> > > all those IPv6 addresses to set up mappings a la:
> > > 
> > > for each $cpu; do
> > >   $ip address add 2000::$host:$cpu/64 dev if0 pref_cpu $cpu
> > > done
> > 
> > interesting idea, but then remote host will be influencing local cpu
> > selection?
> > how remote can figure out the number of local cpus?
> 
> Via rpc! :)
> 
> The configuration shouldn't change all the time and some get_info rpc
> call could provide info for the topology of the machine, or...

Configuration changes all the time. Machines crash, traffic redirected
because of load, etc, etc

> > Consider scenario where you have a ton of tcp sockets feeding into
> > bigger or smaller set of kcm sockets processed by threads or fibers.
> > Pinning sockets to cpu is not going to work.
> > 
> > Also note that opimizing byte copies between kernel and user space is
> > important,
> > but we lose a lot more in user space due to scheduling and re-scheduling
> > when demux-ing user space thread is feeding other worker threads.
> 
> ...also ipvs/netfilter could be used to only inspect the header and
> reroute the packet to some better fitting CPU. Complete hierarchies
> could be build with NUMA and addresses, packets could be rerouted into
> namespaces, etc.

or tc+bpf redirect...
but the reason it won't work is the same as af_packet+bpf fanout doesn't apply:
It's not packet based demuxing.
Kernel needs to deal with TCP stream first and different messages within single
TCP stream go to different workers.

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH net-next 0/6] kcm: Kernel Connection Multiplexor (KCM)

2015-11-24 Thread Florian Westphal
Tom Herbert  wrote:
> Message size limits can be enforced in BPF or we could add a limit
> enforced by KCM. For instance, the message size limit in http/2 is
> 16M. If it's needed, it wouldn't be much trouble to add a streaming
> interface for large messages.

That still won't change the fact that KCM allows eating large
amount of kernel memory (you could just open a lot of sockets...).

For tcp we cannot exceed the total rmem limits, even if I can open
4k sockets.

Why anyone would invest such a huge amount of work in making this
kernel-based framing for single-stream tcp record (de)mux rather than
improving the userspace protocol to use UDP or SCTP or at least
one tcp connection per worker is beyond me.

For TX side, why is writev not good enough?
Is KCM tx just so that userspace doesn't need to handle partial writes?
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH net-next 0/6] kcm: Kernel Connection Multiplexor (KCM)

2015-11-24 Thread Tom Herbert
On Tue, Nov 24, 2015 at 12:55 PM, Florian Westphal  wrote:
> Tom Herbert  wrote:
>> Message size limits can be enforced in BPF or we could add a limit
>> enforced by KCM. For instance, the message size limit in http/2 is
>> 16M. If it's needed, it wouldn't be much trouble to add a streaming
>> interface for large messages.
>
> That still won't change the fact that KCM allows eating large
> amount of kernel memory (you could just open a lot of sockets...).
>
> For tcp we cannot exceed the total rmem limits, even if I can open
> 4k sockets.
>
> Why anyone would invest such a huge amount of work in making this
> kernel-based framing for single-stream tcp record (de)mux rather than
> improving the userspace protocol to use UDP or SCTP or at least
> one tcp connection per worker is beyond me.
>
>From the /0 patch:

Q: Why not use an existing message-oriented protocol such as RUDP,
   DCCP, SCTP, RDS, and others?

A: Because that would entail using a completely new transport protocol.
   Deploying a new protocol at scale is either a huge undertaking or
   fundamentally infeasible. This is true in either the Internet and in
   the data center due in a large part to protocol ossification.
   Besides, KCM we want KCM to work existing, well deployed application
   protocols that we couldn't change even if we wanted to (e.g. http/2).

   KCM simply defines a new interface method, it does not redefine any
   aspect of the transport protocol nor application protocol, nor set
   any new requirements on these. Neither does KCM attempt to implement
   any application protocol logic other than message deliniation in the
   stream. These are fundamental requirement of KCM.


> For TX side, why is writev not good enough?

writev on a TCP stream does not guarantee atomicity of the operation.

> Is KCM tx just so that userspace doesn't need to handle partial writes?

It writes atomic without user space needing to implement locking when
a socket is shared amongst threads..
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH net-next 0/6] kcm: Kernel Connection Multiplexor (KCM)

2015-11-24 Thread Tom Herbert
On Tue, Nov 24, 2015 at 9:16 AM, Florian Westphal  wrote:
> Tom Herbert  wrote:
>> No one is being forced to use any of this.
>
> Right.  But it will need to be maintained.
> Lets ignore ktls for the time being and focus on KCM.
>
> I'm currently trying to figure out how memory handling in KCM
> is supposed to work.
>
> say we have following record framing:
>
> struct record {
> u32 len;
> char data[];
> };
>
> And I have a epbf filter that returns record->len within KCM.
> Now this program says 'length 128mbyte' (or whatever).
>
> If this was userspace, things are simple, userspace can either
> decide to hang up or start to read this in chunks as data arrives.
>
> AFAICS, with KCM, the kernel now has to keep 128mb of allocated
> memory around, rmem limits are ignored.
>
> Is that correct?  What if next record claims 4g in size?
> I don't really see how we can make any guarantees wrt.
> kernel stability...
>
> Am I missing something?

Message size limits can be enforced in BPF or we could add a limit
enforced by KCM. For instance, the message size limit in http/2 is
16M. If it's needed, it wouldn't be much trouble to add a streaming
interface for large messages.

>
> Thanks,
> Florian
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH net-next 0/6] kcm: Kernel Connection Multiplexor (KCM)

2015-11-24 Thread Hannes Frederic Sowa
Hello,

On Tue, Nov 24, 2015, at 17:25, Florian Westphal wrote:
> Its a well-written document, but I don't see how moving the burden of
> locking a single logical tcp connection (to prevent threads from
> reading a partial record) from userspace to kernel is an improvement.
> 
> If you really have 100 threads and must use a single tcp connection
> to multiplex some arbitrarily complex record-format in atomic fashion,
> then your requirements suck.

Right, if we are in a datacenter I would probably write a script and use
all those IPv6 addresses to set up mappings a la:

for each $cpu; do
  $ip address add 2000::$host:$cpu/64 dev if0 pref_cpu $cpu
done

where ip-address would also ensure via ntuples that the packets get
routed to the correct cpu. Some management layer could then ensure
correct usage of all CPUs. This way less cross-cpu traffic is ensured
for the TCP synchronization.

I only dealt with medium to low busy RPC systems but that would be my
first approach. Would something like that make sense? Would the NxM
federation overhead between the RPC system kill this approach? This way
you could also make use of the TCP PSH flag like Eric described.

Bye,
Hannes
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH net-next 0/6] kcm: Kernel Connection Multiplexor (KCM)

2015-11-24 Thread Florian Westphal
David Miller  wrote:
> From: Florian Westphal 
> Date: Tue, 24 Nov 2015 16:27:44 +0100
> 
> > Aside from Hannes comment -- KCM seems to be tied to the TLS work, i.e.
> > I have the impression that KCM without ability to do TLS in the kernel
> > is pretty much useless for whatever use case Tom has in mind.
> 
> I do not get this impression at all.
> 
> Tom's design document in the final patch looks legitimately what the
> core use case is.

You mean
https://patchwork.ozlabs.org/patch/547054/ ?

Its a well-written document, but I don't see how moving the burden of
locking a single logical tcp connection (to prevent threads from
reading a partial record) from userspace to kernel is an improvement.

If you really have 100 threads and must use a single tcp connection
to multiplex some arbitrarily complex record-format in atomic fashion,
then your requirements suck.

Now, arguably, maybe the requirements of Toms use case are restricted
/cannot be avoided.

But that still begs the question: Why should mainline care?

Once its in, next step will be 'my single tcp connection that I use
for multiplexing via KCM now has requirement to use TLS'.

How far are you willing to take the KCM concept?
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH net-next 0/6] kcm: Kernel Connection Multiplexor (KCM)

2015-11-24 Thread Tom Herbert
On Tue, Nov 24, 2015 at 8:25 AM, Florian Westphal  wrote:
> David Miller  wrote:
>> From: Florian Westphal 
>> Date: Tue, 24 Nov 2015 16:27:44 +0100
>>
>> > Aside from Hannes comment -- KCM seems to be tied to the TLS work, i.e.
>> > I have the impression that KCM without ability to do TLS in the kernel
>> > is pretty much useless for whatever use case Tom has in mind.
>>
>> I do not get this impression at all.
>>
>> Tom's design document in the final patch looks legitimately what the
>> core use case is.
>
> You mean
> https://patchwork.ozlabs.org/patch/547054/ ?
>
> Its a well-written document, but I don't see how moving the burden of
> locking a single logical tcp connection (to prevent threads from
> reading a partial record) from userspace to kernel is an improvement.
>
> If you really have 100 threads and must use a single tcp connection
> to multiplex some arbitrarily complex record-format in atomic fashion,
> then your requirements suck.
>
Well, this is the sort of thing that multi threaded applications do.

> Now, arguably, maybe the requirements of Toms use case are restricted
> /cannot be avoided.
>
> But that still begs the question: Why should mainline care?
>
I have no idea. I guess it's the same reason that mainline would care
about RDS, iSCSI, FCOE, RMDA, or anything in that nature. No one is
being forced to use any of this.

> Once its in, next step will be 'my single tcp connection that I use
> for multiplexing via KCM now has requirement to use TLS'.
>
> How far are you willing to take the KCM concept?

Obviously we are looking forward TLS+KCM. But it does open up a bunch
of other possibilities.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH net-next 0/6] kcm: Kernel Connection Multiplexor (KCM)

2015-11-24 Thread Rick Jones

On 11/24/2015 07:49 AM, Eric Dumazet wrote:

But in the end, latencies were bigger, because the application had to
copy from kernel to user (read()) the full message in one go. While if
you wake up application for every incoming GRO message, we prefill cpu
caches, and the last read() only has to copy the remaining part and
benefit from hot caches (RFS up2date state, TCP socket structure, but
also data in the application)


You can see something similar (at least in terms of latency) when 
messing about with MTU sizes.  For some message sizes - 8KB being a 
popular one - you will see higher latency on the likes of netperf TCP_RR 
with JumboFrames than you would with the standard 1500 byte MTU. 
Something I saw on GbE links years back anyway.  I chalked it up to 
getting better parallelism between the NIC and the host.


Of course the service demands were lower with JumboFrames...

rick jones
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH net-next 0/6] kcm: Kernel Connection Multiplexor (KCM)

2015-11-24 Thread Florian Westphal
Tom Herbert  wrote:
> No one is being forced to use any of this.

Right.  But it will need to be maintained.
Lets ignore ktls for the time being and focus on KCM.

I'm currently trying to figure out how memory handling in KCM
is supposed to work.

say we have following record framing:

struct record {
u32 len;
char data[];
};

And I have a epbf filter that returns record->len within KCM.
Now this program says 'length 128mbyte' (or whatever).

If this was userspace, things are simple, userspace can either
decide to hang up or start to read this in chunks as data arrives.

AFAICS, with KCM, the kernel now has to keep 128mb of allocated
memory around, rmem limits are ignored.

Is that correct?  What if next record claims 4g in size?
I don't really see how we can make any guarantees wrt.
kernel stability...

Am I missing something?

Thanks,
Florian
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH net-next 0/6] kcm: Kernel Connection Multiplexor (KCM)

2015-11-24 Thread Alexei Starovoitov
On Tue, Nov 24, 2015 at 07:23:30PM +0100, Hannes Frederic Sowa wrote:
> Hello,
> 
> On Tue, Nov 24, 2015, at 17:25, Florian Westphal wrote:
> > Its a well-written document, but I don't see how moving the burden of
> > locking a single logical tcp connection (to prevent threads from
> > reading a partial record) from userspace to kernel is an improvement.
> > 
> > If you really have 100 threads and must use a single tcp connection
> > to multiplex some arbitrarily complex record-format in atomic fashion,
> > then your requirements suck.
> 
> Right, if we are in a datacenter I would probably write a script and use
> all those IPv6 addresses to set up mappings a la:
> 
> for each $cpu; do
>   $ip address add 2000::$host:$cpu/64 dev if0 pref_cpu $cpu
> done

interesting idea, but then remote host will be influencing local cpu selection?
how remote can figure out the number of local cpus?

Consider scenario where you have a ton of tcp sockets feeding into
bigger or smaller set of kcm sockets processed by threads or fibers.
Pinning sockets to cpu is not going to work.

Also note that opimizing byte copies between kernel and user space is important,
but we lose a lot more in user space due to scheduling and re-scheduling
when demux-ing user space thread is feeding other worker threads.

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH net-next 0/6] kcm: Kernel Connection Multiplexor (KCM)

2015-11-24 Thread Hannes Frederic Sowa
Hello,

On Tue, Nov 24, 2015, at 19:59, Alexei Starovoitov wrote:
> On Tue, Nov 24, 2015 at 07:23:30PM +0100, Hannes Frederic Sowa wrote:
> > Hello,
> > 
> > On Tue, Nov 24, 2015, at 17:25, Florian Westphal wrote:
> > > Its a well-written document, but I don't see how moving the burden of
> > > locking a single logical tcp connection (to prevent threads from
> > > reading a partial record) from userspace to kernel is an improvement.
> > > 
> > > If you really have 100 threads and must use a single tcp connection
> > > to multiplex some arbitrarily complex record-format in atomic fashion,
> > > then your requirements suck.
> > 
> > Right, if we are in a datacenter I would probably write a script and use
> > all those IPv6 addresses to set up mappings a la:
> > 
> > for each $cpu; do
> >   $ip address add 2000::$host:$cpu/64 dev if0 pref_cpu $cpu
> > done
> 
> interesting idea, but then remote host will be influencing local cpu
> selection?
> how remote can figure out the number of local cpus?

Via rpc! :)

The configuration shouldn't change all the time and some get_info rpc
call could provide info for the topology of the machine, or...

> Consider scenario where you have a ton of tcp sockets feeding into
> bigger or smaller set of kcm sockets processed by threads or fibers.
> Pinning sockets to cpu is not going to work.
> 
> Also note that opimizing byte copies between kernel and user space is
> important,
> but we lose a lot more in user space due to scheduling and re-scheduling
> when demux-ing user space thread is feeding other worker threads.

...also ipvs/netfilter could be used to only inspect the header and
reroute the packet to some better fitting CPU. Complete hierarchies
could be build with NUMA and addresses, packets could be rerouted into
namespaces, etc.

Bye,
Hannes
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH net-next 0/6] kcm: Kernel Connection Multiplexor (KCM)

2015-11-24 Thread Hannes Frederic Sowa

On Tue, Nov 24, 2015, at 20:16, Hannes Frederic Sowa wrote:
> ...also ipvs/netfilter could be used to only inspect the header and
> reroute the packet to some better fitting CPU. Complete hierarchies
> could be build with NUMA and addresses, packets could be rerouted into
> namespaces, etc.

Maybe also redirects could be useful. But they might be a bit intrusive
and hard to debug.

Bye,
Hannes
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH net-next 0/6] kcm: Kernel Connection Multiplexor (KCM)

2015-11-24 Thread Florian Westphal
David Miller  wrote:
> From: Tom Herbert 
> Date: Mon, 23 Nov 2015 09:33:44 -0800
> 
> > The TCP PSH flag is not defined for message delineation (neither is
> > urgent pointer). We can't change that (many people have tried to add
> > message semantics to TCP protocol but have always failed miserably).
>
> Agreed.
>
> My only gripe with kcm right now is a lack of a native sendpage.

Aside from Hannes comment -- KCM seems to be tied to the TLS work, i.e.
I have the impression that KCM without ability to do TLS in the kernel
is pretty much useless for whatever use case Tom has in mind.

And that ktls thing just gives me the creeps.

For KCM itself I don't even get the use case -- its in the 'yeah, you
can do that, but... why?' category 8-/
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH net-next 0/6] kcm: Kernel Connection Multiplexor (KCM)

2015-11-24 Thread Hannes Frederic Sowa
Hello,

David Miller  writes:

> From: Tom Herbert 
> Date: Mon, 23 Nov 2015 09:33:44 -0800
>
>> The TCP PSH flag is not defined for message delineation (neither is
>> urgent pointer). We can't change that (many people have tried to add
>> message semantics to TCP protocol but have always failed miserably).
>
> Agreed.
>
> My only gripe with kcm right now is a lack of a native sendpage.
> We should be able to zero copy data through KCM streams without
> any problems whatsoever.

I understood from Tom's last mail that the messages are being
constructed *in kernel memory* before sending out of the tcp
socket. What advantage gives sendpage? The message construction must
actually happen before to fill in the necessary headers for the length,
so receiver can dissect again. Streaming semantics don't really fit
here?

Bye,
Hannes
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH net-next 0/6] kcm: Kernel Connection Multiplexor (KCM)

2015-11-24 Thread Eric Dumazet
On Tue, 2015-11-24 at 16:27 +0100, Florian Westphal wrote:
> David Miller  wrote:
> > From: Tom Herbert 
> > Date: Mon, 23 Nov 2015 09:33:44 -0800
> > 
> > > The TCP PSH flag is not defined for message delineation (neither is
> > > urgent pointer). We can't change that (many people have tried to add
> > > message semantics to TCP protocol but have always failed miserably).
> >
> > Agreed.
> >
> > My only gripe with kcm right now is a lack of a native sendpage.
> 
> Aside from Hannes comment -- KCM seems to be tied to the TLS work, i.e.
> I have the impression that KCM without ability to do TLS in the kernel
> is pretty much useless for whatever use case Tom has in mind.
> 
> And that ktls thing just gives me the creeps.
> 
> For KCM itself I don't even get the use case -- its in the 'yeah, you
> can do that, but... why?' category 8-/

Note that I also played with a similar idea in TCP stack, trying to
wakeup the receiver only full RPC was present in the receive queue.

(When dealing with our internal Google RPC format, we can easily delimit
RPC boundaries)

But in the end, latencies were bigger, because the application had to
copy from kernel to user (read()) the full message in one go. While if
you wake up application for every incoming GRO message, we prefill cpu
caches, and the last read() only has to copy the remaining part and
benefit from hot caches (RFS up2date state, TCP socket structure, but
also data in the application)

I focused on making TSO/GRO more effective, and had better results.

One nice idea was to mark the PSH flag for every TSO packet we send.
(One line patch in TCP senders)

This had the nice effect of keeping the number of flows in the receiver
GRO engine as small as possible, avoiding the evictions that we do when
we reach 8 flows per RX queue (This is because the PSH flag tells GRO to
immediately complete the GRO packet to upper stacks)

-> Less cpu spent in GRO engine, as the gro_list is kept small,
better aggregation efficiency.



--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH net-next 0/6] kcm: Kernel Connection Multiplexor (KCM)

2015-11-24 Thread David Miller
From: Hannes Frederic Sowa 
Date: Tue, 24 Nov 2015 12:25:39 +0100

> Hello,
> 
> David Miller  writes:
> 
>> From: Tom Herbert 
>> Date: Mon, 23 Nov 2015 09:33:44 -0800
>>
>>> The TCP PSH flag is not defined for message delineation (neither is
>>> urgent pointer). We can't change that (many people have tried to add
>>> message semantics to TCP protocol but have always failed miserably).
>>
>> Agreed.
>>
>> My only gripe with kcm right now is a lack of a native sendpage.
>> We should be able to zero copy data through KCM streams without
>> any problems whatsoever.
> 
> I understood from Tom's last mail that the messages are being
> constructed *in kernel memory* before sending out of the tcp
> socket. What advantage gives sendpage? The message construction must
> actually happen before to fill in the necessary headers for the length,
> so receiver can dissect again. Streaming semantics don't really fit
> here?

You can sendpage part of a filesystem file, and push the headers in
front (the "KCM stuff") without disturbing the page frags.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH net-next 0/6] kcm: Kernel Connection Multiplexor (KCM)

2015-11-24 Thread David Miller
From: Florian Westphal 
Date: Tue, 24 Nov 2015 16:27:44 +0100

> Aside from Hannes comment -- KCM seems to be tied to the TLS work, i.e.
> I have the impression that KCM without ability to do TLS in the kernel
> is pretty much useless for whatever use case Tom has in mind.

I do not get this impression at all.

Tom's design document in the final patch looks legitimately what the
core use case is.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH net-next 0/6] kcm: Kernel Connection Multiplexor (KCM)

2015-11-23 Thread Hannes Frederic Sowa
Hello Tom,

On Mon, Nov 23, 2015, at 18:33, Tom Herbert wrote:
> > For me this still looks a little bit like messages could be delimited by
> > TCP PSH flag, where we might need to have some more fine grained control
> > over and besides that just adding better fanout semantics to TCP, no?
> >
> The TCP PSH flag is not defined for message delineation (neither is
> urgent pointer). We can't change that (many people have tried to add
> message semantics to TCP protocol but have always failed miserably).
> The fact is TCP is always going to be a stream based protocol. Period!
> :-) It is up to the application to interpret the stream and extract
> messages. Even if we could somehow apply the PSH bit to "help" in
> message delineation, we would need to change senders to use the PSH
> bit in that fashion for it to be of benefit to receivers.

I see TCP PSH flags as an optimization and I agree it is hard to
properly make use of them in the internet. But in a datacenter where
everything is under control, this could be done?

Anyway, decoding arbitrary messages in the kernel with maybe huge
lengths could result in starvation problems if you adhere to the socket
receive buffer limits at all time. So I wonder if forward progress
guarantee can be achieved here agnostic of the eBPF program? I really
see this becoming a problem as soon as people use it for privilege
separation. Will there be central error handling?

Also, would a TCP option make sense here to add instead of using the TCP
PSH flag? Not sure, yet...

> > Do kcm sockets still allow streaming unlimited amounts of data? E.g. if
> > you want to pass a data stream attached to a rpc message? I think not
> > allowing streaming is a major shortcoming then (even though this will
> > induce head of line blocking).
> >
> RPC messages can be of arbitrary size and with SOCK_SEQPACKET,
> messages can be sent or received in multiple calls. No HOL blocking
> since message are constructed on KCM sockets before starting to send
> on TCP sockets. Socket buffer limits are respected. KCM does not
> enforce a maximum message size, if an applications does have a maximum
> then that can be checked in the BPF code.

I was referring to the receivers end HOL blocking, the same as in user
space TCP, where one data stream (or huge message) keeps the byte stream
busy so no other datagrams in there can be delivered. For low latency I
would actually use multiple streams or switch to UDP with user space
based retry.

I think this problem more and more comes down to improve epoll interface
with somewhat better CPU steered wake-up capabilities to make it more
agnostic. Some programs e.g. want also be woken up if a HTTP header is
received completely, SO_RCVLOWAT was made for this, FreeBSD has
accept_filter for this kind.

You want to use this in thrift which is mainly Java based and reuse the
existing NIO infrastructure?

> >> Future support:
> >>
> >>  - Integration with TLS (TLS-in-kernel is a separate initiative).
> >
> > This is interesting:
> >
> > Regarding the last week's discussion about better OOB support in TCP
> > e.g. for SOCKET_DESTROY, do you already have a plan to handle TLS alerts
> > and do CHANGE_CIPHER on the socket synchronously?
> >
> Dave should be posting the basic TLS-in-the-kenel patches shortly,
> those will be a better context for discussion.

Thanks, I am looking at them right now. :)

Thanks,
Hannes
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH net-next 0/6] kcm: Kernel Connection Multiplexor (KCM)

2015-11-23 Thread David Miller
From: Tom Herbert 
Date: Mon, 23 Nov 2015 09:33:44 -0800

> The TCP PSH flag is not defined for message delineation (neither is
> urgent pointer). We can't change that (many people have tried to add
> message semantics to TCP protocol but have always failed miserably).

Agreed.

My only gripe with kcm right now is a lack of a native sendpage.
We should be able to zero copy data through KCM streams without
any problems whatsoever.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH net-next 0/6] kcm: Kernel Connection Multiplexor (KCM)

2015-11-23 Thread Tom Herbert
On Mon, Nov 23, 2015 at 11:54 AM, David Miller  wrote:
> From: Tom Herbert 
> Date: Mon, 23 Nov 2015 09:33:44 -0800
>
>> The TCP PSH flag is not defined for message delineation (neither is
>> urgent pointer). We can't change that (many people have tried to add
>> message semantics to TCP protocol but have always failed miserably).
>
> Agreed.
>
> My only gripe with kcm right now is a lack of a native sendpage.
> We should be able to zero copy data through KCM streams without
> any problems whatsoever.

Right, there is no reason zero copy won't work here. I was just trying
minimize the initial implementation small for reviewability.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH net-next 0/6] kcm: Kernel Connection Multiplexor (KCM)

2015-11-23 Thread Sowmini Varadhan
On (11/23/15 10:53), Hannes Frederic Sowa wrote:
> > 
> >  - Integration with TLS (TLS-in-kernel is a separate initiative).
> 
> This is interesting:
> 
> Regarding the last week's discussion about better OOB support in TCP
> e.g. for SOCKET_DESTROY, do you already have a plan to handle TLS alerts
> and do CHANGE_CIPHER on the socket synchronously?

I have had that same question too. In fact I pointed this out already
in the thread at http://permalink.gmane.org/gmane.linux.network/382278

In addition to CCS, TLS does other complex things such as mid-session
regeneration of new session keys based on the master-secret. If you
move TLS to the kernel, there may be a lot of 
synchronicity/security/inter-op issues to resolve.

Perhaps it's not a good idea to use "TLS" on the TCP socket, but let
each kcm application negotiate a crypto key (in any manner that it wants) 
and set it on the PF_KCM socket, then use that key to encrypt application
data just before passing it off to tcp. (Of course, then you have to deal 
with the fact that BPF still needs to get to the clear data somehow)

--Sowmini


--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH net-next 0/6] kcm: Kernel Connection Multiplexor (KCM)

2015-11-23 Thread Hannes Frederic Sowa
Hello,

On Fri, Nov 20, 2015, at 22:21, Tom Herbert wrote:
> Kernel Connection Multiplexor (KCM) is a facility that provides a
> message based interface over TCP for generic application protocols.
> The motivation for this is based on the observation that although
> TCP is byte stream transport protocol with no concept of message
> boundaries, a common use case is to implement a framed application
> layer protocol running over TCP. To date, most TCP stacks offer
> byte stream API for applications, which places the burden of message
> delineation, message I/O operation atomicity, and load balancing
> in the application. With KCM an application can efficiently send
> and receive application protocol messages over TCP using a
> datagram interface.

I am a bit struggling seeing a real need to come up with a new socket
type and subsystem for that. It looks like you want to solve the same
problem that PACKET_FANOUT does? TCP has TCP-PSH flag which could help
delimit messages and a way to improve FANOUT like PACKET_FANOUT would
solve this same problem, too? A propoer fallback has to be in user space
anyway but messages could maybe simply be flagged with an skb->mark and
fanout could push it to the correct FANOUT-subsocket.

> In order to delineate message in a TCP stream for receive in KCM, the
> kernel implements a message parser. For this we chose to employ BPF
> which is applied to the TCP stream. BPF code parses application layer
> messages and returns a message length. Nearly all binary application
> protocols are parsable in this manner, so KCM should be applicable
> across a wide range of applications. Other than message length
> determination in receive, KCM does not require any other application
> specific awareness. KCM does not implement any other application
> protocol semantics-- these are are provided in userspace or could be
> implemented in a kernel module layered above KCM.

For me this still looks a little bit like messages could be delimited by
TCP PSH flag, where we might need to have some more fine grained control
over and besides that just adding better fanout semantics to TCP, no?

Do kcm sockets still allow streaming unlimited amounts of data? E.g. if
you want to pass a data stream attached to a rpc message? I think not
allowing streaming is a major shortcoming then (even though this will
induce head of line blocking).

> Future support:
> 
>  - Integration with TLS (TLS-in-kernel is a separate initiative).

This is interesting:

Regarding the last week's discussion about better OOB support in TCP
e.g. for SOCKET_DESTROY, do you already have a plan to handle TLS alerts
and do CHANGE_CIPHER on the socket synchronously?

Thanks,
Hannes
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH net-next 0/6] kcm: Kernel Connection Multiplexor (KCM)

2015-11-23 Thread Tom Herbert
> For me this still looks a little bit like messages could be delimited by
> TCP PSH flag, where we might need to have some more fine grained control
> over and besides that just adding better fanout semantics to TCP, no?
>
The TCP PSH flag is not defined for message delineation (neither is
urgent pointer). We can't change that (many people have tried to add
message semantics to TCP protocol but have always failed miserably).
The fact is TCP is always going to be a stream based protocol. Period!
:-) It is up to the application to interpret the stream and extract
messages. Even if we could somehow apply the PSH bit to "help" in
message delineation, we would need to change senders to use the PSH
bit in that fashion for it to be of benefit to receivers.

> Do kcm sockets still allow streaming unlimited amounts of data? E.g. if
> you want to pass a data stream attached to a rpc message? I think not
> allowing streaming is a major shortcoming then (even though this will
> induce head of line blocking).
>
RPC messages can be of arbitrary size and with SOCK_SEQPACKET,
messages can be sent or received in multiple calls. No HOL blocking
since message are constructed on KCM sockets before starting to send
on TCP sockets. Socket buffer limits are respected. KCM does not
enforce a maximum message size, if an applications does have a maximum
then that can be checked in the BPF code.

>> Future support:
>>
>>  - Integration with TLS (TLS-in-kernel is a separate initiative).
>
> This is interesting:
>
> Regarding the last week's discussion about better OOB support in TCP
> e.g. for SOCKET_DESTROY, do you already have a plan to handle TLS alerts
> and do CHANGE_CIPHER on the socket synchronously?
>
Dave should be posting the basic TLS-in-the-kenel patches shortly,
those will be a better context for discussion.

Tom
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html