Re: [PATCH 3.2 085/115] veth: don’t modify ip_summed; doing so treats packets with bad checksums as good.

2016-04-30 Thread Vijay Pandurangan
On Sat, Apr 30, 2016 at 5:52 PM, Ben Greear  wrote:
>>
>> Good point, so if you had:
>>
>> eth0 <-> raw <-> user space-bridge <-> raw <-> vethA <-> veth B <->
>> userspace-stub <->eth1
>>
>> and user-space hub enabled this elide flag, things would work, right?
>> Then, it seems like what we need is a way to tell the kernel
>> router/bridge logic to follow elide signals in packets coming from
>> veth. I'm not sure what the best way to do this is because I'm less
>> familiar with conventions in that part of the kernel, but assuming
>> there's a way to do this, would it be acceptable?
>
>
> You cannot receive on one veth without transmitting on the other, so
> I think the elide csum logic can go on the raw-socket, and apply to packets
> in the transmit-from-user-space direction.  Just allowing the socket to make
> the veth behave like it used to before this patch in question should be good
> enough, since that worked for us for years.  So, just an option to modify
> the
> ip_summed for pkts sent on a socket is probably sufficient.

I don't think this is right. Consider:

- App A  sends out corrupt packets 50% of the time and discards inbound data.
- App B doesn't care about corrupt packets and is happy to receive
them and has some way of dealing with them (special case)
- App C is a regular app, say nc or something.

In your world, where A decides what happens to data it transmits,
then
A<--veth-->B and A<---wire-->B will have the same behaviour

but

A<-- veth --> C and A<-- wire --> C will have _different_ behaviour: C
will behave incorrectly if it's connected over veth but correctly if
connected with a wire. That is a bug.

Since A cannot know what the app it's talking to will desire, I argue
that both sides of a message must be opted in to this optimization.






>
>>> There may be no sockets on the vethB port.  And reader/writer is not
>>> a good way to look at it since I am implementing a bi-directional bridge
>>> in
>>> user-space and each packet-socket is for both rx and tx.
>>
>>
>> Sure, but we could model a bidrectional connection as two
>> unidirectional sockets for our discussions here, right?
>
>
> Best not to I think, you want to make sure that one socket can
> correctly handle tx and rx.  As long as that works, then using
> uni-directional sockets should work too.
>
>
> Thanks,
> Ben
>
> --
> Ben Greear 
> Candela Technologies Inc  http://www.candelatech.com


Re: [PATCH 3.2 085/115] veth: don’t modify ip_summed; doing so treats packets with bad checksums as good.

2016-04-30 Thread Vijay Pandurangan
On Sat, Apr 30, 2016 at 5:29 PM, Ben Greear <gree...@candelatech.com> wrote:
>
>
> On 04/30/2016 02:13 PM, Vijay Pandurangan wrote:
>>
>> On Sat, Apr 30, 2016 at 4:59 PM, Ben Greear <gree...@candelatech.com>
>> wrote:
>>>
>>>
>>>
>>> On 04/30/2016 12:54 PM, Tom Herbert wrote:
>>>>
>>>>
>>>> We've put considerable effort into cleaning up the checksum interface
>>>> to make it as unambiguous as possible, please be very careful to
>>>> follow it. Broken checksum processing is really hard to detect and
>>>> debug.
>>>>
>>>> CHECKSUM_UNNECESSARY means that some number of _specific_ checksums
>>>> (indicated by csum_level) have been verified to be correct in a
>>>> packet. Blindly promoting CHECKSUM_NONE to CHECKSUM_UNNECESSARY is
>>>> never right. If CHECKSUM_UNNECESSARY is set in such a manner but the
>>>> checksum it would refer to has not been verified and is incorrect this
>>>> is a major bug.
>>>
>>>
>>>
>>> Suppose I know that the packet received on a packet-socket has
>>> already been verified by a NIC that supports hardware checksumming.
>>>
>>> Then, I want to transmit it on a veth interface using a second
>>> packet socket.  I do not want veth to recalculate the checksum on
>>> transmit, nor to validate it on the peer veth on receive, because I do
>>> not want to waste the CPU cycles.  I am assuming that my app is not
>>> accidentally corrupting frames, so the checksum can never be bad.
>>>
>>> How should the checksumming be configured for the packets going into
>>> the packet-socket from user-space?
>>
>>
>>
>> It seems like that only the receiver should decide whether or not to
>> checksum packets on the veth, not the sender.
>>
>> How about:
>>
>> We could add a receiving socket option for "don't checksum packets
>> received from a veth when the other side has marked them as
>> elide-checksum-suggested" (similar to UDP_NOCHECKSUM), and a sending
>> socket option for "mark all data sent via this socket to a veth as
>> elide-checksum-suggested".
>>
>> So the process would be:
>>
>> Writer:
>> 1. open read socket
>> 2. open write socket, with option elide-checksum-for-veth-suggested
>> 3. write data
>>
>> Reader:
>> 1. open read socket with "follow-elide-checksum-suggestions-on-veth"
>> 2. read data
>>
>> The kernel / module would then need to persist the flag on all packets
>> that traverse a veth, and drop these data when they leave the veth
>> module.
>
>
> I'm not sure this works completely.  In my app, the packet flow might be:
>
> eth0 <-> raw-socket <-> user-space-bridge <-> raw-socket <-> vethA <-> vethB
> <-> [kernel router/bridge logic ...] <-> eth1

Good point, so if you had:

eth0 <-> raw <-> user space-bridge <-> raw <-> vethA <-> veth B <->
userspace-stub <->eth1

and user-space hub enabled this elide flag, things would work, right?
Then, it seems like what we need is a way to tell the kernel
router/bridge logic to follow elide signals in packets coming from
veth. I'm not sure what the best way to do this is because I'm less
familiar with conventions in that part of the kernel, but assuming
there's a way to do this, would it be acceptable?



>
> There may be no sockets on the vethB port.  And reader/writer is not
> a good way to look at it since I am implementing a bi-directional bridge in
> user-space and each packet-socket is for both rx and tx.

Sure, but we could model a bidrectional connection as two
unidirectional sockets for our discussions here, right?



>
>>> Also, I might want to send raw frames that do have
>>> broken checksums (lets assume a real NIC, not veth), and I want them
>>> to hit the wire with those bad checksums.
>>>
>>>
>>> How do I configure the checksumming in this case?
>>
>>
>>
>> Correct me if I'm wrong but I think this is already possible now. You
>> can have packets with incorrect checksum hitting the wire as is. What
>> you cannot do is instruct the receiving end to ignore the checksum
>> from the sending end when using a physical device (and something I
>> think we should mimic on the sending device).
>
>
> Yes, it does work currently (or, last I checked)...I just want to make sure
> it keeps working.

Yeah, good point. It would be great if we could write a test, or at
the very least, a list of invariants about what kinds of things should
work somewhere.

>
>
> Thanks,
> Ben
>
> --
> Ben Greear <gree...@candelatech.com>
> Candela Technologies Inc  http://www.candelatech.com


Re: [PATCH 3.2 085/115] veth: don’t modify ip_summed; doing so treats packets with bad checksums as good.

2016-04-30 Thread Vijay Pandurangan
On Sat, Apr 30, 2016 at 4:59 PM, Ben Greear  wrote:
>
>
> On 04/30/2016 12:54 PM, Tom Herbert wrote:
>>
>> We've put considerable effort into cleaning up the checksum interface
>> to make it as unambiguous as possible, please be very careful to
>> follow it. Broken checksum processing is really hard to detect and
>> debug.
>>
>> CHECKSUM_UNNECESSARY means that some number of _specific_ checksums
>> (indicated by csum_level) have been verified to be correct in a
>> packet. Blindly promoting CHECKSUM_NONE to CHECKSUM_UNNECESSARY is
>> never right. If CHECKSUM_UNNECESSARY is set in such a manner but the
>> checksum it would refer to has not been verified and is incorrect this
>> is a major bug.
>
>
> Suppose I know that the packet received on a packet-socket has
> already been verified by a NIC that supports hardware checksumming.
>
> Then, I want to transmit it on a veth interface using a second
> packet socket.  I do not want veth to recalculate the checksum on
> transmit, nor to validate it on the peer veth on receive, because I do
> not want to waste the CPU cycles.  I am assuming that my app is not
> accidentally corrupting frames, so the checksum can never be bad.
>
> How should the checksumming be configured for the packets going into
> the packet-socket from user-space?


It seems like that only the receiver should decide whether or not to
checksum packets on the veth, not the sender.

How about:

We could add a receiving socket option for "don't checksum packets
received from a veth when the other side has marked them as
elide-checksum-suggested" (similar to UDP_NOCHECKSUM), and a sending
socket option for "mark all data sent via this socket to a veth as
elide-checksum-suggested".

So the process would be:

Writer:
1. open read socket
2. open write socket, with option elide-checksum-for-veth-suggested
3. write data

Reader:
1. open read socket with "follow-elide-checksum-suggestions-on-veth"
2. read data

The kernel / module would then need to persist the flag on all packets
that traverse a veth, and drop these data when they leave the veth
module.


>
>
> Also, I might want to send raw frames that do have
> broken checksums (lets assume a real NIC, not veth), and I want them
> to hit the wire with those bad checksums.
>
>
> How do I configure the checksumming in this case?


Correct me if I'm wrong but I think this is already possible now. You
can have packets with incorrect checksum hitting the wire as is. What
you cannot do is instruct the receiving end to ignore the checksum
from the sending end when using a physical device (and something I
think we should mimic on the sending device).


>
>
>
> Thanks,
> Ben
>
>
>
>>
>> Tom
>>
>> On Sat, Apr 30, 2016 at 12:40 PM, Ben Greear  wrote:
>>>
>>>
>>>
>>> On 04/30/2016 11:33 AM, Ben Hutchings wrote:


 On Thu, 2016-04-28 at 12:29 +0200, Sabrina Dubroca wrote:
>
>
> Hello,
>>>
>>>
>>>
>>
>> http://dmz2.candelatech.com/?p=linux-4.4.dev.y/.git;a=commitdiff;h=8153e983c0e5eba1aafe1fc296248ed2a553f1ac;hp=454b07405d694dad52e7f41af5816eed0190da8a
>
>
> Actually, no, this is not really a regression.


 [...]

 It really is.  Even though the old behaviour was a bug (raw packets
 should not be changed), if there are real applications that depend on
 that then we have to keep those applications working somehow.
>>>
>>>
>>>
>>> To be honest, I fail to see why the old behaviour is a bug when sending
>>> raw packets from user-space.  If raw packets should not be changed, then
>>> we need some way to specify what the checksum setting is to begin with,
>>> otherwise, user-space has not enough control.
>>>
>>> A socket option for new programs, and sysctl configurable defaults for raw
>>> sockets
>>> for old binary programs would be sufficient I think.
>>>
>>>
>>> Thanks,
>>> Ben
>>>
>>> --
>>> Ben Greear 
>>> Candela Technologies Inc  http://www.candelatech.com
>>
>>
>
> --
> Ben Greear 
> Candela Technologies Inc  http://www.candelatech.com


Re: [PATCH 3.2 085/115] veth: don’t modify ip_summed; doing so treats packets with bad checksums as good.

2016-04-30 Thread Vijay Pandurangan
[oops – resending this because I was using gmail in HTML mode before
by accident]

There was a discussion on a separate thread about this. I agree with
Sabrina fully. I believe veth should provide an abstraction layer that
correctly emulates a physical network in all ways.

Consider an environment where we have multiple physical computers.
Each computer runs one or more containers, each of which has a
publicly routable ip address.  When adding a new app to the cluster, a
scheduler might decide to run this container on any physical machine
of its choice, assuming that apps have a way of routing traffic to
their backends (we did something similar Google >10 years ago). This
is something we might imagine happening with docker and ipv6 for
instance.

If you have an app, A, which sends raw ethernet frames (the simplest
case I could imagine) with TCP data that has invalid checksums to app
B, which is receiving it, the behaviour of the system _will be
different_ depending upon whether app B is scheduled to run on the
same machine as app A or not. This seems like a clear bug and a broken
abstraction (especially as the default case), and something we should
endeavour to avoid.

I do see Ben's point about enabling sw checksum verification as
potentially incurring a huge performance penalty (I haven't had a
chance to measure it) that is completely wasteful in the vast majority
of cases.

Unfortunately I just don't see how we can solve this problem in a way
that preserves a correct abstraction layer while also avoiding excess
work. I guess the first piece of data that would be helpful is to
determine just how big of a performance penalty this is. If it's
small, then maybe it doesn't matter.




On Thu, Apr 28, 2016 at 6:29 AM, Sabrina Dubroca  wrote:
> Hello,
>
> 2016-04-27, 17:14:44 -0700, Ben Greear wrote:
>> On 04/27/2016 05:00 PM, Hannes Frederic Sowa wrote:
>> > Hi Ben,
>> >
>> > On Wed, Apr 27, 2016, at 20:07, Ben Hutchings wrote:
>> > > On Wed, 2016-04-27 at 08:59 -0700, Ben Greear wrote:
>> > > > On 04/26/2016 04:02 PM, Ben Hutchings wrote:
>> > > > >
>> > > > > 3.2.80-rc1 review patch.  If anyone has any objections, please let 
>> > > > > me know.
>> > > > I would be careful about this.  It causes regressions when sending
>> > > > PACKET_SOCKET buffers from user-space to veth devices.
>> > > >
>> > > > There was a proposed upstream fix for the regression, but it has not 
>> > > > gone
>> > > > into the tree as far as I know.
>> > > >
>> > > > http://www.spinics.net/lists/netdev/msg370436.html
>> > > [...]
>> > >
>> > > OK, I'll drop this for now.
>> >
>> > The fall out from not having this patch is in my opinion a bigger
>> > fallout than not having this patch. This patch fixes silent data
>> > corruption vs. the problem Ben Greear is talking about, which might not
>> > be that a common usage.
>> >
>> > What do others think?
>> >
>> > Bye,
>> > Hannes
>> >
>>
>> This patch from Cong Wang seems to fix the regression for me, I think it 
>> should be added and
>> tested in the main tree, and then apply them to stable as a pair.
>>
>> http://dmz2.candelatech.com/?p=linux-4.4.dev.y/.git;a=commitdiff;h=8153e983c0e5eba1aafe1fc296248ed2a553f1ac;hp=454b07405d694dad52e7f41af5816eed0190da8a
>
> Actually, no, this is not really a regression.
>
> If you capture packets on a device with checksum offloading enabled,
> the TCP/UDP checksum isn't filled.  veth also behaves that way.  What
> the "veth: don't modify ip_summed" patch does is enable proper
> checksum validation on veth.  This really was a bug in veth.
>
> Cong's patch would also break cases where we choose to inject packets
> with invalid checksums, and they would now be accepted as correct.
>
> Your use case is invalid, it just happened to work because of a
> bug.  If you want the stack to fill checksums so that you want capture
> and reinject packets, you have to disable checksum offloading (or
> compute the checksum yourself in userspace).
>
> Thanks.
>
> --
> Sabrina


Re: veth regression with "don’t modify ip_summed; doing so treats packets with bad checksums as good."

2016-04-07 Thread Vijay Pandurangan
On Fri, Mar 25, 2016 at 7:46 PM, Ben Greear  wrote:
> A real NIC can either do hardware checksums, or it cannot.  If it
> cannot, then the host must do it on the CPU for both transmit and
> receive.
>
> Veth is not a real NIC, and it cannot do hardware checksum offloading.
>
> So, we either lie and pretend it does, or we eat massive amounts
> of CPU usage to calculate and check checksums when sending across
> a veth pair.
>

That's a good point. Does anyone know what the overhead actually is these days?

>>> Any frame sent from a socket can be considered to be a local packet in my
>>> opinion.
>>
>>
>> I'm not sure that's totally right. Your bridge is adding a delay to
>> your packets; it could just as easily be simulating corruption by
>> corrupting 5% of packets going through it. If this change allows
>> corrupt packets to be delivered to an application when they could not
>> be delivered if the packets were routed via physical eths, I think
>> that is a bug.
>
>
> I actually do support corrupting the frame, but what I normally do is
> corrupt the contents
> of the packet, and then recalculate the IP checksum (and TCP if it applies)
> and send it on its way.  The receiving NIC and stack will pass the frame up
> to
> the application since the checksums match, and it would be up the
> application
> to deal with it.  So, I can easily cause an application to receive corrupted
> frames over physical eths.
>
> I can also corrupt without updating the checksums in case you want to
> test another systems NIC and/or stack.
>
> But, if I am purposely corrupting a frame destined for veth, then the only
> reason
> I would want the stack to check the checksums is if I were testing my own
> stack's checksum logic, and that seems to be a pretty limited use.


In the common case you're 100% right.  OTOH, there's something
disconcerting about an abstraction layer lying and behaving
unexpectedly.  Most traffic that originates on a machine can have its
checksums safely ignored.  Whatever the reason is (maybe, as you say
you're testing checksums – on the other hand maybe there's a bug in
your code somewhere), I really feel like we should try to figure out a
way to ensure that this optimization is at the very least opt-in…

>
>
> Thanks,
> Ben
>
> --
> Ben Greear 
> Candela Technologies Inc  http://www.candelatech.com
>


Re: veth regression with "don’t modify ip_summed; doing so treats packets with bad checksums as good."

2016-03-25 Thread Vijay Pandurangan
On Fri, Mar 25, 2016 at 6:23 PM, Ben Greear <gree...@candelatech.com> wrote:
> On 03/25/2016 02:59 PM, Vijay Pandurangan wrote:
>>
>> consider two scenarios, where process a sends raw ethernet frames
>> containing UDP packets to b
>>
>> I) process a --> veth --> process b
>>
>> II) process a -> eth -> wire -> eth -> process b
>>
>> I believe (I) is the simplest setup we can create that will replicate this
>> bug.
>>
>> If process a sends frames that contain UDP packets to process b, what
>> is the behaviour we want if the UDP packet *has an incorrect
>> checksum*?
>>
>> It seems to me that I and II should have identical behaviour, and I
>> would think that (II) would not deliver the packets to the
>> application.
>>
>> In (I) with Cong's patch would we be delivering corrupt UDP packets to
>> process b despite an incorrect checksum in (I)?
>>
>> If so, I would argue that this patch isn't right.
>
>
> Checksums are normally used to deal with flaky transport mechanisms,
> and once a machine receives the frame, we do not keep re-calculating
> checksums
> as we move it through various drivers and subsystems.
>
> In particular, checksums are NOT a security mechanism and can be easily
> faked.
>
> Since packets sent on one veth never actually hit any unreliable transport
> before they are received on the peer veth, then there should be no need to
> checksum packets whose origin is known to be on the local machine.

That's a good argument.  I'm trying to figure out how to reconcile
your thoughts with the argument that virtual ethernet devices are an
abstraction that should behave identically to perfectly-functional
physical ethernet devices when connected with a wire.

In my view, the invariant must be identical functionality, and if I
were writing a regression test for this system, that's what I would
test. I think optimizations for eliding checksums should be
implemented only if they don't alter this functionality.

There must be a way to structure / write this code so that we can
optimize veths without causing different behaviour ...


>
> Any frame sent from a socket can be considered to be a local packet in my
> opinion.

I'm not sure that's totally right. Your bridge is adding a delay to
your packets; it could just as easily be simulating corruption by
corrupting 5% of packets going through it. If this change allows
corrupt packets to be delivered to an application when they could not
be delivered if the packets were routed via physical eths, I think
that is a bug.

>
> That is what Cong's patch does as far as I can tell.
>
>
> Thanks,
> Ben
>
>
> --
> Ben Greear <gree...@candelatech.com>
> Candela Technologies Inc  http://www.candelatech.com
>


Re: veth regression with "don’t modify ip_summed; doing so treats packets with bad checksums as good."

2016-03-25 Thread Vijay Pandurangan
consider two scenarios, where process a sends raw ethernet frames
containing UDP packets to b

I) process a --> veth --> process b

II) process a -> eth -> wire -> eth -> process b

I believe (I) is the simplest setup we can create that will replicate this bug.

If process a sends frames that contain UDP packets to process b, what
is the behaviour we want if the UDP packet *has an incorrect
checksum*?

It seems to me that I and II should have identical behaviour, and I
would think that (II) would not deliver the packets to the
application.

In (I) with Cong's patch would we be delivering corrupt UDP packets to
process b despite an incorrect checksum in (I)?

If so, I would argue that this patch isn't right.


On Fri, Mar 25, 2016 at 4:56 PM, Ben Greear  wrote:
> On 03/24/2016 10:33 PM, Cong Wang wrote:
>
>> Here we go:
>>
>> diff --git a/net/packet/af_packet.c b/net/packet/af_packet.c
>> index 1ecfa71..ab66080 100644
>> --- a/net/packet/af_packet.c
>> +++ b/net/packet/af_packet.c
>> @@ -1925,6 +1925,7 @@ static int packet_sendmsg_spkt(struct socket
>> *sock, struct msghdr *msg,
>>  goto out_unlock;
>>  }
>>
>> +   skb->ip_summed = CHECKSUM_UNNECESSARY;
>>  skb->protocol = proto;
>>  skb->dev = dev;
>>  skb->priority = sk->sk_priority;
>> @@ -2496,6 +2497,7 @@ static int tpacket_fill_skb(struct packet_sock
>> *po, struct sk_buff *skb,
>>
>>  ph.raw = frame;
>>
>> +   skb->ip_summed = CHECKSUM_UNNECESSARY;
>>  skb->protocol = proto;
>>  skb->dev = dev;
>>  skb->priority = po->sk.sk_priority;
>> @@ -2805,6 +2807,7 @@ static struct sk_buff *packet_alloc_skb(struct
>> sock *sk, size_t prepad,
>>  skb_put(skb, linear);
>>  skb->data_len = len - linear;
>>  skb->len += len - linear;
>> +   skb->ip_summed = CHECKSUM_UNNECESSARY;
>>
>>  return skb;
>>   }
>
>
> I have tested UDP, TCP, TCPv6 and custom Ethernet frames across a veth pair.
>
> And, UDP, TCP, and pktgen across a pair of veth pairs
> bridged by my user-space packet filter.
>
> All of these tests work fine with your patch as far as I can tell.
>
> So, you can add:
>
> Tested-by: Ben Greear 
>
> That said, it could easily break some drivers and/or other scenarios that I
> have not tested, so at the least it should cook a while upstream before
> going into the
> stable tree
>
>
> Thanks,
> Ben
>
> --
> Ben Greear 
> Candela Technologies Inc  http://www.candelatech.com
>


Re: veth regression with "don’t modify ip_summed; doing so treats packets with bad checksums as good."

2016-03-25 Thread Vijay Pandurangan
On Fri, Mar 25, 2016 at 10:35 AM, Ben Greear <gree...@candelatech.com> wrote:
>
>
> On 03/24/2016 10:24 PM, Vijay Pandurangan wrote:
>>
>> On Fri, Mar 25, 2016 at 1:07 AM, Ben Greear <gree...@candelatech.com>
>> wrote:
>>>
>>> On 03/24/2016 09:45 PM, Vijay Pandurangan wrote:
>>>>
>>>>
>>>> Actually, maybe they should be set to CHECKSUM_PARTIAL if we want veth
>>>> to drop the packets if they have bad checksums before they hit the
>>>> application level.
>>>
>>>
>>>
>>> VETH is pretty special in that when you transmit a frame on one
>>> device, it's pair receives it, and unless there is RAM corruption
>>> or bugs in the kernel, then it cannot be corrupted.
>>
>>
>> Yeah, you're right that that's an optimization. However, I think that
>> we should first ensure that
>>
>> a->veth->b
>>
>> operates exactly like:
>>
>> a->physical eth 1 -> physical eth 2->b
>>
>> in all cases.  Once we have that working everywhere we could think
>> about optimizations.
>>
>>
>> If we're willing to refactor, we could implement the optimization by
>> allowing veth devices to know whether their immediate peer is. If a
>> veth knows it's talking to another veth, it could under some
>> circumstances elide checksum calculation and verification.  I'm not
>> sure what abstractions that would break, though. What do you guys
>> think?
>
>
> veth ALWAYS transmits to another VETH.  The problem is that when veth is
> given a packet to transmit, it is difficult to know where that packet
> came from.

Yeah you're totally right – I guess what I was trying to express (but
failed at) was that we might need to be able to track the original
source of the packet for optimizations.
>
> And, adding software checksumming to veth for every frame would be a huge
> performance hit.
>
>
> Thanks,
> Ben
>
> --
> Ben Greear <gree...@candelatech.com>
> Candela Technologies Inc  http://www.candelatech.com


Re: veth regression with "don’t modify ip_summed; doing so treats packets with bad checksums as good."

2016-03-24 Thread Vijay Pandurangan
On Fri, Mar 25, 2016 at 1:07 AM, Ben Greear <gree...@candelatech.com> wrote:
> On 03/24/2016 09:45 PM, Vijay Pandurangan wrote:
>>
>> Actually, maybe they should be set to CHECKSUM_PARTIAL if we want veth
>> to drop the packets if they have bad checksums before they hit the
>> application level.
>
>
> VETH is pretty special in that when you transmit a frame on one
> device, it's pair receives it, and unless there is RAM corruption
> or bugs in the kernel, then it cannot be corrupted.

Yeah, you're right that that's an optimization. However, I think that
we should first ensure that

a->veth->b

operates exactly like:

a->physical eth 1 -> physical eth 2->b

in all cases.  Once we have that working everywhere we could think
about optimizations.


If we're willing to refactor, we could implement the optimization by
allowing veth devices to know whether their immediate peer is. If a
veth knows it's talking to another veth, it could under some
circumstances elide checksum calculation and verification.  I'm not
sure what abstractions that would break, though. What do you guys
think?



>
> But, if you are routing frames from the network to veth
> devices, then the original packet could be corrupted, as
> described in your patch.
>
> Probably the right behaviour is to keep the old logic for packets
> originating from sockets, at least.  I am not sure of a good way
> to implement that.
>
> As for testing, I am not sure.  Probably you have to make a good
> effort and then just deal with fallout like what I found.  I would guess
> that any of us who have ever written an interesting patch have also written
> an interesting bug!
>
>
> Thanks,
> Ben
>
> --
> Ben Greear <gree...@candelatech.com>
> Candela Technologies Inc  http://www.candelatech.com


Re: veth regression with "don’t modify ip_summed; doing so treats packets with bad checksums as good."

2016-03-24 Thread Vijay Pandurangan
Actually, maybe they should be set to CHECKSUM_PARTIAL if we want veth
to drop the packets if they have bad checksums before they hit the
application level.

On Fri, Mar 25, 2016 at 12:41 AM, Vijay Pandurangan <vij...@vijayp.ca> wrote:
> agreed. It should maybe be set to CHECKSUM_UNNECESSARY. The comment
> seems to imply that it's treated the same as CHECKSUM_NONE but that's
> evidently not true. I think that would fix the checksumming issue but
> I'm fearful it may break something else:
> http://lxr.free-electrons.com/source/include/linux/skbuff.h#L177
>
> I'm really worried there are other equally subtle bugs hidden in this
> code. Do we have any kind of regression test, or any automated way to
> test all possible values on an skb to determine side effects to any
> change here? (I'm new to the kernel so sorry if there's an answer in
> an FAQ somewhere).
>
> If not,
> 1. how should we ensure that our change doesn't break something else?
> 2. Should we audit / simplify the checksum code or come up with a list
> of test cases that covers all uses?
>
>
> On Fri, Mar 25, 2016 at 12:34 AM, Ben Greear <gree...@candelatech.com> wrote:
>>
>>
>> On 03/24/2016 06:44 PM, Vijay Pandurangan wrote:
>>>
>>> Oops, I think my last email didn't go through due to an inadvertent
>>> html attachment from my phone mail client.
>>>
>>> Can you send us a copy of a packet you're sending and/or confirm that
>>> the IP and UDP4 checksums are set correctly in the packet?
>>>
>>> If those are set right, I think we need to read through the networking
>>> code again to see why this is broken...
>>
>>
>> Wireshark decodes the packet as having no checksum errors.
>>
>> I think the contents of the packet is correct, but the 'ip_summed'
>> field is set incorrectly to 'NONE' when transmitting on a raw packet
>> socket.
>>
>>
>> Thanks,
>> Ben
>>
>> --
>> Ben Greear <gree...@candelatech.com>
>> Candela Technologies Inc  http://www.candelatech.com


Re: veth regression with "don’t modify ip_summed; doing so treats packets with bad checksums as good."

2016-03-24 Thread Vijay Pandurangan
agreed. It should maybe be set to CHECKSUM_UNNECESSARY. The comment
seems to imply that it's treated the same as CHECKSUM_NONE but that's
evidently not true. I think that would fix the checksumming issue but
I'm fearful it may break something else:
http://lxr.free-electrons.com/source/include/linux/skbuff.h#L177

I'm really worried there are other equally subtle bugs hidden in this
code. Do we have any kind of regression test, or any automated way to
test all possible values on an skb to determine side effects to any
change here? (I'm new to the kernel so sorry if there's an answer in
an FAQ somewhere).

If not,
1. how should we ensure that our change doesn't break something else?
2. Should we audit / simplify the checksum code or come up with a list
of test cases that covers all uses?


On Fri, Mar 25, 2016 at 12:34 AM, Ben Greear <gree...@candelatech.com> wrote:
>
>
> On 03/24/2016 06:44 PM, Vijay Pandurangan wrote:
>>
>> Oops, I think my last email didn't go through due to an inadvertent
>> html attachment from my phone mail client.
>>
>> Can you send us a copy of a packet you're sending and/or confirm that
>> the IP and UDP4 checksums are set correctly in the packet?
>>
>> If those are set right, I think we need to read through the networking
>> code again to see why this is broken...
>
>
> Wireshark decodes the packet as having no checksum errors.
>
> I think the contents of the packet is correct, but the 'ip_summed'
> field is set incorrectly to 'NONE' when transmitting on a raw packet
> socket.
>
>
> Thanks,
> Ben
>
> --
> Ben Greear <gree...@candelatech.com>
> Candela Technologies Inc  http://www.candelatech.com


Re: veth regression with "don’t modify ip_summed; doing so treats packets with bad checksums as good."

2016-03-24 Thread Vijay Pandurangan
Oops, I think my last email didn't go through due to an inadvertent
html attachment from my phone mail client.

Can you send us a copy of a packet you're sending and/or confirm that
the IP and UDP4 checksums are set correctly in the packet?

If those are set right, I think we need to read through the networking
code again to see why this is broken...



On Thu, Mar 24, 2016 at 9:13 PM, Ben Greear  wrote:
> On 03/24/2016 06:11 PM, Ben Greear wrote:
>>
>> On 03/24/2016 05:06 PM, Ben Greear wrote:
>>>
>>> On 03/24/2016 04:56 PM, Cong Wang wrote:

 On Thu, Mar 24, 2016 at 3:01 PM, Ben Greear 
 wrote:
>
> I have an application that creates two pairs of veth devices.
>
> a <-> b   c <-> d
>
> b and c have a raw packet socket opened on them and I 'bridge' frames
> between b and c to provide network emulation (ie, configurable delay).
>

 IIUC, you create two raw sockets in order to bridge these two veth
 pairs?
 That is, to receive packets on one socket and deliver packets on the
 other?
>>>
>>>
>>> Yes.
>>>
> I put IP 1.1.1.1/24 on a, 1.1.1.2/24 on d, and then create a UDP
> connection
> (using policy based routing to ensure frames are sent on the
> appropriate
> interfaces).
>
> This is user-space only app, and kernel in this case is completely
> unmodified.
>
> The commit below breaks this feature:  UDP frames are sniffed on both a
> and
> d ports
> (in both directions), but the UDP socket does not receive frames.
>
> Using normal ethernet ports, this network emulation feature works fine,
> so
> it is
> specific to VETH.
>
> A similar test with just sending UDP between a single veth pair:  e <->
> f
> works fine.  Maybe it has something to do with raw packets?
>

 Yeah, I have the same feeling. Could you trace kfree_skb() to see
 where these packets are dropped? At UDP layer?
>>>
>>>
>>> Since reverting the patch fixes this, it almost certainly has to be due
>>> to some
>>> checksum checking logic.  Since UDP sockets (between single veth pair)
>>> works, it would appear to be related to my packet bridge, so maybe
>>> it is specific to raw packets and/or sendmmsg api.
>>>
>>> I'll investigate it better tomorrow.
>>
>>
>> So, I found time to poke at it this evening:
>>
>> Sending between two veth pairs, no packet bridge involved.
>
>
> E, to be clear:  I mean sending between two ends of a single veth pair
> here.
>
>>
>> UDP:  ip_summed is 3 (CHECKSUM_PARTIAL)   # Works fine.
>> raw packet frames, custom ether protocol (0x type):  ip_summed is 0
>> (NONE) # Works fine.
>>
>> When I try to send UDP through the veth pairs & pkt bridge, I see this:
>
>
> Thanks,
> Ben
>
> --
> Ben Greear 
> Candela Technologies Inc  http://www.candelatech.com
>


Re: [PATCH] veth: don’t modify ip_summed; doing so treats packets with bad checksums as good.

2016-02-11 Thread Vijay Pandurangan
Just as a followup, I wrote a short blog detailing the bug and our
resolution: (https://twitter.com/vijayp/status/697837808417779716)

Thanks again for your help in guiding us through our first kernel
patch. This was a great experience!

direct link: 
https://medium.com/vijay-pandurangan/linux-kernel-bug-delivers-corrupt-tcp-ip-data-to-mesos-kubernetes-docker-containers-4986f88f7a19#.aymvnbaa8

On Wed, Dec 23, 2015 at 9:57 AM, Cong Wang <xiyou.wangc...@gmail.com> wrote:
> On Tue, Dec 22, 2015 at 11:37 PM, Vijay Pandurangan <vij...@vijayp.ca> wrote:
>> Cool, thanks! I see it in the -stable queue. Is there anything else I need
>> to do to help with getting this into main or backporting?  Happy to pitch in
>> if I can be helpful.
>>
>
> DaveM usually just backports it to a few recent stable tree, if you want
> to backport further, for example 3.14, you probably need to send
> the commit ID to Greg KH.
>
> Thanks.


[PATCH] veth: don't modify ip-summed; doing so treats packets with bad checksums as good.

2015-12-18 Thread Vijay Pandurangan
Packets that arrive from real hardware devices have ip_summed ==
CHECKSUM_UNNECESSARY if the hardware verified the checksums, or
CHECKSUM_NONE if the packet is bad or it was unable to verify it. The
current version of veth will replace CHECKSUM_NONE with
CHECKSUM_UNNECESSARY, which causes corrupt packets routed from hardware to
a veth device to be delivered to the application. This caused applications
at Twitter to receive corrupt data when network hardware was corrupting
packets.

We believe this was added as an optimization to skip computing and
verifying checksums for communication between containers. However, locally
generated packets have ip_summed == CHECKSUM_PARTIAL, so the code as
written does nothing for them. As far as we can tell, after removing this
code, these packets are transmitted from one stack to another unmodified
(tcpdump shows invalid checksums on both sides, as expected), and they are
delivered correctly to applications. We didn’t test every possible network
configuration, but we tried a few common ones such as bridging containers,
using NAT between the host and a container, and routing from hardware
devices to containers. We have effectively deployed this in production at
Twitter (by disabling RX checksum offloading on veth devices).

This code dates back to the first version of the driver, commit
 ("[NET]: Virtual ethernet device driver"), so I
suspect this bug occurred mostly because the driver API has evolved
significantly since then. Commit <0b7967503dc97864f283a> ("net/veth: Fix
packet checksumming") (in December 2010) fixed this for packets that get
created locally and sent to hardware devices, by not changing
CHECKSUM_PARTIAL. However, the same issue still occurs for packets coming
in from hardware devices.

Co-authored-by: Evan Jones <e...@evanjones.ca>
Signed-off-by: Evan Jones <e...@evanjones.ca>
Cc: Nicolas Dichtel <nicolas.dich...@6wind.com>
Cc: Phil Sutter <p...@nwl.cc>
Cc: Toshiaki Makita <makita.toshi...@lab.ntt.co.jp>
Cc: netdev@vger.kernel.org
Cc: linux-ker...@vger.kernel.org
Signed-off-by: Vijay Pandurangan <vij...@vijayp.ca>
---
 drivers/net/veth.c | 6 --
 1 file changed, 6 deletions(-)

diff --git a/drivers/net/veth.c b/drivers/net/veth.c
index 0ef4a5a..ba21d07 100644
--- a/drivers/net/veth.c
+++ b/drivers/net/veth.c
@@ -117,12 +117,6 @@ static netdev_tx_t veth_xmit(struct sk_buff *skb,
struct net_device *dev)
  kfree_skb(skb);
  goto drop;
  }
- /* don't change ip_summed == CHECKSUM_PARTIAL, as that
- * will cause bad checksum on forwarded packets
- */
- if (skb->ip_summed == CHECKSUM_NONE &&
-rcv->features & NETIF_F_RXCSUM)
- skb->ip_summed = CHECKSUM_UNNECESSARY;

  if (likely(dev_forward_skb(rcv, skb) == NET_RX_SUCCESS)) {
  struct pcpu_vstats *stats = this_cpu_ptr(dev->vstats);
-- 
2.5.0
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] veth: don’t modify ip_summed; doing so treats packets with bad checksums as good.

2015-12-18 Thread Vijay Pandurangan
Packets that arrive from real hardware devices have ip_summed ==
CHECKSUM_UNNECESSARY if the hardware verified the checksums, or
CHECKSUM_NONE if the packet is bad or it was unable to verify it. The
current version of veth will replace CHECKSUM_NONE with
CHECKSUM_UNNECESSARY, which causes corrupt packets routed from hardware to
a veth device to be delivered to the application. This caused applications
at Twitter to receive corrupt data when network hardware was corrupting
packets.

We believe this was added as an optimization to skip computing and
verifying checksums for communication between containers. However, locally
generated packets have ip_summed == CHECKSUM_PARTIAL, so the code as
written does nothing for them. As far as we can tell, after removing this
code, these packets are transmitted from one stack to another unmodified
(tcpdump shows invalid checksums on both sides, as expected), and they are
delivered correctly to applications. We didn’t test every possible network
configuration, but we tried a few common ones such as bridging containers,
using NAT between the host and a container, and routing from hardware
devices to containers. We have effectively deployed this in production at
Twitter (by disabling RX checksum offloading on veth devices).

This code dates back to the first version of the driver, commit
 ("[NET]: Virtual ethernet device driver"), so I
suspect this bug occurred mostly because the driver API has evolved
significantly since then. Commit <0b7967503dc97864f283a> ("net/veth: Fix
packet checksumming") (in December 2010) fixed this for packets that get
created locally and sent to hardware devices, by not changing
CHECKSUM_PARTIAL. However, the same issue still occurs for packets coming
in from hardware devices.

Co-authored-by: Evan Jones <e...@evanjones.ca>
Signed-off-by: Evan Jones <e...@evanjones.ca>
Cc: Nicolas Dichtel <nicolas.dich...@6wind.com>
Cc: Phil Sutter <p...@nwl.cc>
Cc: Toshiaki Makita <makita.toshi...@lab.ntt.co.jp>
Cc: netdev@vger.kernel.org
Cc: linux-ker...@vger.kernel.org
Signed-off-by: Vijay Pandurangan <vij...@vijayp.ca>
---
 drivers/net/veth.c | 6 --
 1 file changed, 6 deletions(-)

diff --git a/drivers/net/veth.c b/drivers/net/veth.c
index 0ef4a5a..ba21d07 100644
--- a/drivers/net/veth.c
+++ b/drivers/net/veth.c
@@ -117,12 +117,6 @@ static netdev_tx_t veth_xmit(struct sk_buff *skb, struct 
net_device *dev)
kfree_skb(skb);
goto drop;
}
-   /* don't change ip_summed == CHECKSUM_PARTIAL, as that
-* will cause bad checksum on forwarded packets
-*/
-   if (skb->ip_summed == CHECKSUM_NONE &&
-   rcv->features & NETIF_F_RXCSUM)
-   skb->ip_summed = CHECKSUM_UNNECESSARY;
 
if (likely(dev_forward_skb(rcv, skb) == NET_RX_SUCCESS)) {
struct pcpu_vstats *stats = this_cpu_ptr(dev->vstats);
-- 
2.5.0

--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] veth: don't modify ip-summed; doing so treats packets with bad checksums as good.

2015-12-18 Thread Vijay Pandurangan
Evan and I have demonstrated this bug on Kubernetes as well, so it's
not just a problem in Mesos. (See
https://github.com/kubernetes/kubernetes/issues/18898)

Sorry about my email client, I've re-sent the patch in another thread
from git-email as I should have initially.

I'll read through the TX path again to see if we missed something, but
I'd love input from anyone else!

--

Vijay Pandurangan
https://www.twitter.com/vijayp
http://www.vijayp.ca


On Fri, Dec 18, 2015 at 2:00 PM, Cong Wang <xiyou.wangc...@gmail.com> wrote:
> (Cc'ing Eric B and Tom)
>
> On Fri, Dec 18, 2015 at 9:54 AM, Vijay Pandurangan <vij...@vijayp.ca> wrote:
>> Packets that arrive from real hardware devices have ip_summed ==
>> CHECKSUM_UNNECESSARY if the hardware verified the checksums, or
>> CHECKSUM_NONE if the packet is bad or it was unable to verify it. The
>> current version of veth will replace CHECKSUM_NONE with
>> CHECKSUM_UNNECESSARY, which causes corrupt packets routed from hardware to
>> a veth device to be delivered to the application. This caused applications
>> at Twitter to receive corrupt data when network hardware was corrupting
>> packets.
>
> Yeah, https://reviews.apache.org/r/41158/.
>
> This is because normally packets to a veth device are _only_ from its pair
> device, Mesos network isolator redirects packets from a hardware interface
> to veth, which violates this expectation. This is also why no one else sees
> this bug. ;)
>
>>
>> We believe this was added as an optimization to skip computing and
>> verifying checksums for communication between containers. However, locally
>> generated packets have ip_summed == CHECKSUM_PARTIAL, so the code as
>> written does nothing for them. As far as we can tell, after removing this
>> code, these packets are transmitted from one stack to another unmodified
>> (tcpdump shows invalid checksums on both sides, as expected), and they are
>> delivered correctly to applications. We didn’t test every possible network
>> configuration, but we tried a few common ones such as bridging containers,
>> using NAT between the host and a container, and routing from hardware
>> devices to containers. We have effectively deployed this in production at
>> Twitter (by disabling RX checksum offloading on veth devices).
>
>
> I am wondering if there is any other CHECKSUM_NONE case in the tx
> path we could miss here. Mesos case is too special not only because
> it redirects packets from hardware to veth, but also because it moves
> packets from RX path to TX path.
>
> Eric? Tom?
>
>>
>> This code dates back to the first version of the driver, commit
>>  ("[NET]: Virtual ethernet device driver"), so I
>> suspect this bug occurred mostly because the driver API has evolved
>> significantly since then. Commit <0b7967503dc97864f283a> ("net/veth: Fix
>> packet checksumming") (in December 2010) fixed this for packets that get
>> created locally and sent to hardware devices, by not changing
>> CHECKSUM_PARTIAL. However, the same issue still occurs for packets coming
>> in from hardware devices.
>>
>> Co-authored-by: Evan Jones <e...@evanjones.ca>
>> Signed-off-by: Evan Jones <e...@evanjones.ca>
>> Cc: Nicolas Dichtel <nicolas.dich...@6wind.com>
>> Cc: Phil Sutter <p...@nwl.cc>
>> Cc: Toshiaki Makita <makita.toshi...@lab.ntt.co.jp>
>> Cc: netdev@vger.kernel.org
>> Cc: linux-ker...@vger.kernel.org
>> Signed-off-by: Vijay Pandurangan <vij...@vijayp.ca>
>
> Your patch looks good to me but your email client corrupts your patch,
> so please resend.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html