Re: Resurrecting due to huge ipoib perf regression - [BUG] skb corruption and kernel panic at forwarding with fragmentation

2016-07-08 Thread Roland Dreier
On Fri, Jul 8, 2016 at 9:51 AM, Jason Gunthorpe
 wrote:
> So, it appears, the dst and neigh can be used for all performances cases.
>
> For the non performance dst == null case, can we just burn cycles and
> stuff the daddr in front of the packet at hardheader time, even if we
> have to copy?

OK, sounds interesting.

Unfortunately the scope of this work has gotten to the point where I
can't take it on right now.  My system is running 4.4.y for now
(before struct skb_gso_cb grew) so I think shrinking struct skb_gso_cb
to 8 bytes plus changing SKB_SGO_CB_OFFSET to 20 will work for now.
Hope someone is able to come up with a real fix before I need to
upgrade to 4.10.y...

 - R.


Re: Resurrecting due to huge ipoib perf regression - [BUG] skb corruption and kernel panic at forwarding with fragmentation

2016-07-08 Thread Roland Dreier
On Fri, Jul 8, 2016 at 9:51 AM, Jason Gunthorpe
 wrote:
> So, it appears, the dst and neigh can be used for all performances cases.
>
> For the non performance dst == null case, can we just burn cycles and
> stuff the daddr in front of the packet at hardheader time, even if we
> have to copy?

OK, sounds interesting.

Unfortunately the scope of this work has gotten to the point where I
can't take it on right now.  My system is running 4.4.y for now
(before struct skb_gso_cb grew) so I think shrinking struct skb_gso_cb
to 8 bytes plus changing SKB_SGO_CB_OFFSET to 20 will work for now.
Hope someone is able to come up with a real fix before I need to
upgrade to 4.10.y...

 - R.


Re: Resurrecting due to huge ipoib perf regression - [BUG] skb corruption and kernel panic at forwarding with fragmentation

2016-07-08 Thread Jason Gunthorpe
On Fri, Jul 08, 2016 at 07:18:11AM -0700, Roland Dreier wrote:
> On Thu, Jul 7, 2016 at 4:14 PM, Jason Gunthorpe
>  wrote:
> > We have neighbour_priv, and ndo_neigh_construct/destruct now ..
> >
> > A first blush that would seem to be enough to let ipoib store the AH
> > and other path information in the neigh and avoid the cb? At least the
> > example in clip sure looks like what ipoib needs to do.
> 
> Do you think those new facilities let us go back to using the neigh
> and still avoid the issues that led to commit b63b70d87741 ("IPoIB:
> Use a private hash table for path lookup in xmit path")?

Well, the priv stuff were brought up in the discussion around
b63b70d87741 but never fully analyzed. Maybe it could have been used
to solve that problem, who knows.. I guess it doesn't help this exact
issue because we don't have a dst at hard header time anyhow.

But, DaveM suggested how to handle our current problem in the above thread:

http://marc.info/?l=linux-rdma=132813323907877=2

Which is the same route CLIP took:

331 struct dst_entry *dst = skb_dst(skb);
347 rt = (struct rtable *) dst;
348 if (rt->rt_gateway)
349 daddr = >rt_gateway;
350 else
351 daddr = _hdr(skb)->daddr;
352 n = dst_neigh_lookup(dst, daddr);

(DaveM said it should be /ipv6_hdr(skb)->daddr, not the rtable cast)

Last time this was brought up you were concerned about ARP, ARP
sets skb_dst after calling dev_hard_header:

310 skb = arp_create(type, ptype, dest_ip, dev, src_ip,
311  dest_hw, src_hw, target_hw);
312 if (!skb)
313 return;
314
315 skb_dst_set(skb, dst_clone(dst));

However, there is at least one fringe case (arp_send) where the dst is
left NULL. Presumably there are other fringe cases too..

So, it appears, the dst and neigh can be used for all performances cases.

For the non performance dst == null case, can we just burn cycles and
stuff the daddr in front of the packet at hardheader time, even if we
have to copy?

Jason


Re: Resurrecting due to huge ipoib perf regression - [BUG] skb corruption and kernel panic at forwarding with fragmentation

2016-07-08 Thread Jason Gunthorpe
On Fri, Jul 08, 2016 at 07:18:11AM -0700, Roland Dreier wrote:
> On Thu, Jul 7, 2016 at 4:14 PM, Jason Gunthorpe
>  wrote:
> > We have neighbour_priv, and ndo_neigh_construct/destruct now ..
> >
> > A first blush that would seem to be enough to let ipoib store the AH
> > and other path information in the neigh and avoid the cb? At least the
> > example in clip sure looks like what ipoib needs to do.
> 
> Do you think those new facilities let us go back to using the neigh
> and still avoid the issues that led to commit b63b70d87741 ("IPoIB:
> Use a private hash table for path lookup in xmit path")?

Well, the priv stuff were brought up in the discussion around
b63b70d87741 but never fully analyzed. Maybe it could have been used
to solve that problem, who knows.. I guess it doesn't help this exact
issue because we don't have a dst at hard header time anyhow.

But, DaveM suggested how to handle our current problem in the above thread:

http://marc.info/?l=linux-rdma=132813323907877=2

Which is the same route CLIP took:

331 struct dst_entry *dst = skb_dst(skb);
347 rt = (struct rtable *) dst;
348 if (rt->rt_gateway)
349 daddr = >rt_gateway;
350 else
351 daddr = _hdr(skb)->daddr;
352 n = dst_neigh_lookup(dst, daddr);

(DaveM said it should be /ipv6_hdr(skb)->daddr, not the rtable cast)

Last time this was brought up you were concerned about ARP, ARP
sets skb_dst after calling dev_hard_header:

310 skb = arp_create(type, ptype, dest_ip, dev, src_ip,
311  dest_hw, src_hw, target_hw);
312 if (!skb)
313 return;
314
315 skb_dst_set(skb, dst_clone(dst));

However, there is at least one fringe case (arp_send) where the dst is
left NULL. Presumably there are other fringe cases too..

So, it appears, the dst and neigh can be used for all performances cases.

For the non performance dst == null case, can we just burn cycles and
stuff the daddr in front of the packet at hardheader time, even if we
have to copy?

Jason


Re: Resurrecting due to huge ipoib perf regression - [BUG] skb corruption and kernel panic at forwarding with fragmentation

2016-07-08 Thread Roland Dreier
On Thu, Jul 7, 2016 at 4:14 PM, Jason Gunthorpe
 wrote:
> We have neighbour_priv, and ndo_neigh_construct/destruct now ..
>
> A first blush that would seem to be enough to let ipoib store the AH
> and other path information in the neigh and avoid the cb? At least the
> example in clip sure looks like what ipoib needs to do.

Do you think those new facilities let us go back to using the neigh
and still avoid the issues that led to commit b63b70d87741 ("IPoIB:
Use a private hash table for path lookup in xmit path")?

 - R.


Re: Resurrecting due to huge ipoib perf regression - [BUG] skb corruption and kernel panic at forwarding with fragmentation

2016-07-08 Thread Roland Dreier
On Thu, Jul 7, 2016 at 4:14 PM, Jason Gunthorpe
 wrote:
> We have neighbour_priv, and ndo_neigh_construct/destruct now ..
>
> A first blush that would seem to be enough to let ipoib store the AH
> and other path information in the neigh and avoid the cb? At least the
> example in clip sure looks like what ipoib needs to do.

Do you think those new facilities let us go back to using the neigh
and still avoid the issues that led to commit b63b70d87741 ("IPoIB:
Use a private hash table for path lookup in xmit path")?

 - R.


Re: Resurrecting due to huge ipoib perf regression - [BUG] skb corruption and kernel panic at forwarding with fragmentation

2016-07-07 Thread Jason Gunthorpe
On Thu, Jul 07, 2016 at 03:01:40PM -0700, Roland Dreier wrote:

> The reason we moved to the cb storage is that in the past, trying to
> hide some data in the actual skb buffer that we don't actually send

We have neighbour_priv, and ndo_neigh_construct/destruct now ..

A first blush that would seem to be enough to let ipoib store the AH
and other path information in the neigh and avoid the cb? At least the
example in clip sure looks like what ipoib needs to do.

Jason


Re: Resurrecting due to huge ipoib perf regression - [BUG] skb corruption and kernel panic at forwarding with fragmentation

2016-07-07 Thread Jason Gunthorpe
On Thu, Jul 07, 2016 at 03:01:40PM -0700, Roland Dreier wrote:

> The reason we moved to the cb storage is that in the past, trying to
> hide some data in the actual skb buffer that we don't actually send

We have neighbour_priv, and ndo_neigh_construct/destruct now ..

A first blush that would seem to be enough to let ipoib store the AH
and other path information in the neigh and avoid the cb? At least the
example in clip sure looks like what ipoib needs to do.

Jason


Re: Resurrecting due to huge ipoib perf regression - [BUG] skb corruption and kernel panic at forwarding with fragmentation

2016-07-07 Thread Alexander Duyck
On Thu, Jul 7, 2016 at 3:01 PM, Roland Dreier  wrote:
>>> struct skb_gso_cb {
>>> int mac_offset;
>>> int encap_level;
>>> __u16   csum_start;
>>> };
>
>> This is based on an out-dated version of this struct.  The 4.7 RC
>> kernel has a few more fields that were added to support local checksum
>> offload for encapsulated frames.
>
> Thanks for pointing that out.  I hit the perf regression on 4.4.y
> (stable) and looked at the struct there.  I see that latest upstream
> has changed, and I agree that this struct really can't shrink below 10
> bytes.
>
> Since IP needs 20 bytes, GSO needs 10 bytes and IPoIB needs 20 bytes,
> we're 2 bytes over the 48 that are available in cb[].  So this is
> harder to fix than just changing skb_gso_cb and SKB_SGO_CB_OFFSET
> unfortunately.
>
>>> What is the best way to keep the crash fix but not kill IPoIB performance?
>>
>> It seems like what would probably need to happen is to move where the
>> IPoIB address is stored.  I'm not sure the control buffer is really
>> the best place for it since the cb gets overwritten at various levels,
>> and storing 20 bytes makes it hard to avoid bumping up against the
>> size restrictions of the buffer.  Seeing as how the IPoIB hwaddr is
>> generated around the same time we generate the L2 header for the
>> frame, I wonder if you couldn't get away with using a bit of extra skb
>> headroom to store it and then use a offset from the MAC header to
>> access it.  An added bonus would be that with a few tricks with
>> SKB_GSO_CB(skb)->mac_offset you might even be able to set things up so
>> that you copy the hwaddr when you copy the header for each fragment
>> instead of having to go and copy the hwaddr out of the cb and clone it
>> for each frame.
>
> Can we assume there are 20 bytes of skb headroom available?  What if
> we're forwarding an skb received on an Ethernet device?

The space should be there since a standard Ethernet device should be
reserving about 64B for headroom for Rx traffic assuming it is using
something like napi_alloc_skb.  I'm not sure how much you would need
for Infiniband headers and such, but I know the 20B for the address
would at least be there.

> The reason we moved to the cb storage is that in the past, trying to
> hide some data in the actual skb buffer that we don't actually send
> led to some awkward-at-best code.  (As I recall GRO was difficult to
> handle before commit 936d7de3d736 "IPoIB: Stop lying about
> hard_header_len and use skb->cb to stash LL addresses")  But maybe
> there's a third way to handle this other than the old way and the
> skb->cb way.

Well the description for that patch seems to indicate the problem was
the pseudo header length being included in the hard_header_len.  It
seems like other drivers include the length of such headers in
needed_headroom, although usually those types of headers don't get
added until after the devices ndo_start_xmit is called so I am not
sure if there would be any issues with trying to use needed_headroom
to indicate space needed for a pseudo header added in the header
creation call.

- Alex


Re: Resurrecting due to huge ipoib perf regression - [BUG] skb corruption and kernel panic at forwarding with fragmentation

2016-07-07 Thread Alexander Duyck
On Thu, Jul 7, 2016 at 3:01 PM, Roland Dreier  wrote:
>>> struct skb_gso_cb {
>>> int mac_offset;
>>> int encap_level;
>>> __u16   csum_start;
>>> };
>
>> This is based on an out-dated version of this struct.  The 4.7 RC
>> kernel has a few more fields that were added to support local checksum
>> offload for encapsulated frames.
>
> Thanks for pointing that out.  I hit the perf regression on 4.4.y
> (stable) and looked at the struct there.  I see that latest upstream
> has changed, and I agree that this struct really can't shrink below 10
> bytes.
>
> Since IP needs 20 bytes, GSO needs 10 bytes and IPoIB needs 20 bytes,
> we're 2 bytes over the 48 that are available in cb[].  So this is
> harder to fix than just changing skb_gso_cb and SKB_SGO_CB_OFFSET
> unfortunately.
>
>>> What is the best way to keep the crash fix but not kill IPoIB performance?
>>
>> It seems like what would probably need to happen is to move where the
>> IPoIB address is stored.  I'm not sure the control buffer is really
>> the best place for it since the cb gets overwritten at various levels,
>> and storing 20 bytes makes it hard to avoid bumping up against the
>> size restrictions of the buffer.  Seeing as how the IPoIB hwaddr is
>> generated around the same time we generate the L2 header for the
>> frame, I wonder if you couldn't get away with using a bit of extra skb
>> headroom to store it and then use a offset from the MAC header to
>> access it.  An added bonus would be that with a few tricks with
>> SKB_GSO_CB(skb)->mac_offset you might even be able to set things up so
>> that you copy the hwaddr when you copy the header for each fragment
>> instead of having to go and copy the hwaddr out of the cb and clone it
>> for each frame.
>
> Can we assume there are 20 bytes of skb headroom available?  What if
> we're forwarding an skb received on an Ethernet device?

The space should be there since a standard Ethernet device should be
reserving about 64B for headroom for Rx traffic assuming it is using
something like napi_alloc_skb.  I'm not sure how much you would need
for Infiniband headers and such, but I know the 20B for the address
would at least be there.

> The reason we moved to the cb storage is that in the past, trying to
> hide some data in the actual skb buffer that we don't actually send
> led to some awkward-at-best code.  (As I recall GRO was difficult to
> handle before commit 936d7de3d736 "IPoIB: Stop lying about
> hard_header_len and use skb->cb to stash LL addresses")  But maybe
> there's a third way to handle this other than the old way and the
> skb->cb way.

Well the description for that patch seems to indicate the problem was
the pseudo header length being included in the hard_header_len.  It
seems like other drivers include the length of such headers in
needed_headroom, although usually those types of headers don't get
added until after the devices ndo_start_xmit is called so I am not
sure if there would be any issues with trying to use needed_headroom
to indicate space needed for a pseudo header added in the header
creation call.

- Alex


Re: Resurrecting due to huge ipoib perf regression - [BUG] skb corruption and kernel panic at forwarding with fragmentation

2016-07-07 Thread Roland Dreier
>> struct skb_gso_cb {
>> int mac_offset;
>> int encap_level;
>> __u16   csum_start;
>> };

> This is based on an out-dated version of this struct.  The 4.7 RC
> kernel has a few more fields that were added to support local checksum
> offload for encapsulated frames.

Thanks for pointing that out.  I hit the perf regression on 4.4.y
(stable) and looked at the struct there.  I see that latest upstream
has changed, and I agree that this struct really can't shrink below 10
bytes.

Since IP needs 20 bytes, GSO needs 10 bytes and IPoIB needs 20 bytes,
we're 2 bytes over the 48 that are available in cb[].  So this is
harder to fix than just changing skb_gso_cb and SKB_SGO_CB_OFFSET
unfortunately.

>> What is the best way to keep the crash fix but not kill IPoIB performance?
>
> It seems like what would probably need to happen is to move where the
> IPoIB address is stored.  I'm not sure the control buffer is really
> the best place for it since the cb gets overwritten at various levels,
> and storing 20 bytes makes it hard to avoid bumping up against the
> size restrictions of the buffer.  Seeing as how the IPoIB hwaddr is
> generated around the same time we generate the L2 header for the
> frame, I wonder if you couldn't get away with using a bit of extra skb
> headroom to store it and then use a offset from the MAC header to
> access it.  An added bonus would be that with a few tricks with
> SKB_GSO_CB(skb)->mac_offset you might even be able to set things up so
> that you copy the hwaddr when you copy the header for each fragment
> instead of having to go and copy the hwaddr out of the cb and clone it
> for each frame.

Can we assume there are 20 bytes of skb headroom available?  What if
we're forwarding an skb received on an Ethernet device?

The reason we moved to the cb storage is that in the past, trying to
hide some data in the actual skb buffer that we don't actually send
led to some awkward-at-best code.  (As I recall GRO was difficult to
handle before commit 936d7de3d736 "IPoIB: Stop lying about
hard_header_len and use skb->cb to stash LL addresses")  But maybe
there's a third way to handle this other than the old way and the
skb->cb way.

 - R.


Re: Resurrecting due to huge ipoib perf regression - [BUG] skb corruption and kernel panic at forwarding with fragmentation

2016-07-07 Thread Roland Dreier
>> struct skb_gso_cb {
>> int mac_offset;
>> int encap_level;
>> __u16   csum_start;
>> };

> This is based on an out-dated version of this struct.  The 4.7 RC
> kernel has a few more fields that were added to support local checksum
> offload for encapsulated frames.

Thanks for pointing that out.  I hit the perf regression on 4.4.y
(stable) and looked at the struct there.  I see that latest upstream
has changed, and I agree that this struct really can't shrink below 10
bytes.

Since IP needs 20 bytes, GSO needs 10 bytes and IPoIB needs 20 bytes,
we're 2 bytes over the 48 that are available in cb[].  So this is
harder to fix than just changing skb_gso_cb and SKB_SGO_CB_OFFSET
unfortunately.

>> What is the best way to keep the crash fix but not kill IPoIB performance?
>
> It seems like what would probably need to happen is to move where the
> IPoIB address is stored.  I'm not sure the control buffer is really
> the best place for it since the cb gets overwritten at various levels,
> and storing 20 bytes makes it hard to avoid bumping up against the
> size restrictions of the buffer.  Seeing as how the IPoIB hwaddr is
> generated around the same time we generate the L2 header for the
> frame, I wonder if you couldn't get away with using a bit of extra skb
> headroom to store it and then use a offset from the MAC header to
> access it.  An added bonus would be that with a few tricks with
> SKB_GSO_CB(skb)->mac_offset you might even be able to set things up so
> that you copy the hwaddr when you copy the header for each fragment
> instead of having to go and copy the hwaddr out of the cb and clone it
> for each frame.

Can we assume there are 20 bytes of skb headroom available?  What if
we're forwarding an skb received on an Ethernet device?

The reason we moved to the cb storage is that in the past, trying to
hide some data in the actual skb buffer that we don't actually send
led to some awkward-at-best code.  (As I recall GRO was difficult to
handle before commit 936d7de3d736 "IPoIB: Stop lying about
hard_header_len and use skb->cb to stash LL addresses")  But maybe
there's a third way to handle this other than the old way and the
skb->cb way.

 - R.


Re: Resurrecting due to huge ipoib perf regression - [BUG] skb corruption and kernel panic at forwarding with fragmentation

2016-07-07 Thread Alexander Duyck
On Wed, Jul 6, 2016 at 11:25 PM, Roland Dreier  wrote:
> On Thu, Jan 7, 2016 at 3:00 AM, Konstantin Khlebnikov  
> wrote:
>> Or just shift GSO CB and add couple checks like
>> BUILD_BUG_ON(sizeof(SKB_GSO_CB(skb)->room) < sizeof(*IPCB(skb)));
>
> Resurrecting this old thread, because the patch that ultimately went
> upstream (commit 9207f9d45b0a / net: preserve IP control block during
> GSO segmentation) causes a huge IPoIB performance regression (to the
> point of being unusable):
> https://bugzilla.kernel.org/show_bug.cgi?id=111921
>
> I don't think anyone has explained what goes wrong or why IPoIB works
> the way it does.  The underlying difference that IPoIB has from other
> drivers is that there are two levels of address resolution.  First,
> normal ARP / ND resolves an IP address to a "hardware" address.  The
> difference is that in IPoIB, the hardware address is an IB GID (plus a
> QPN, but we can ignore that).  To actually send data to that GID, the
> IPoIB driver has to do a second lookup - it needs to ask the IB subnet
> manager for a path record that tells it how to reach that GID.
>
> In particular this means that "destination address" (as the IP / ARP
> layer understands it) actually isn't in the packet anywhere - there's
> nothing like an ethernet header as there is for "normal" network
> drivers.  Instead, the driver stashes the address in skb->cb during
> hard_header_ops->create() and then looks at it in the xmit routine -
> this was designed way back around when commit a0417fa3a18a / net: Make
> qdisc_skb_cb upper size bound explicit. was merged.  The expectation
> was that the part of the cb after sizeof (struct qdisc_skb_cb) would
> be preserved.
>
> The problem with commit 9207f9d45b0a is that GSO operations now access
> cb after SKB_SGO_CB_OFFSET==32, which lands right in the middle of
> where IPoIB stashes its hwaddr.

Really I don't know if the problem is GSO so much as the desire to
maintain the control buffer between layers.  Really this kind of
violates how we have documented the control buffer in skbuff.h.

> It seems that the intent of the commit is to preserve the IP control
> block - struct inet_skb_parm (and presumably struct inet6_skb_parm) -
> even when using SKB_GSO_CB().  Seems like both inet_skb_parm and
> inet6_skb_parm are 20 bytes.  IPoIB uses the part of cb after 28
> bytes, so if we could squeeze struct skb_gso_cb down to 8 bytes and
> set SKB_SGO_CB_OFFSET to 20, then everything would work.  The struct
> is
>
> struct skb_gso_cb {
> int mac_offset;
> int encap_level;
> __u16   csum_start;
> };

This is based on an out-dated version of this struct.  The 4.7 RC
kernel has a few more fields that were added to support local checksum
offload for encapsulated frames.

> is it feasible to make encap_level a __u16 (which would make the
> overall struct exactly 8 bytes)?  If I understand this correctly, 64K
> nested encapsulations seems like quite a bit for a packet...

The smallest we could probably get this structure at this point would
be around 10 bytes assuming we could make the offsets and encap level
only be a 16 bit field.  The added size is due to a wsum that was
added to keep a running checksum so we could keep csum_offset and
csum_start fields while generating a running checksum for the frame in
GSO.

> Or, earlier in this thread, having the GSO in ip_output and other gso
> paths save and restore the IP/IP6 control block was suggested as an
> alternate approach.  I don't know if there are performance
> implications to that.

Copying a block of data back and forth will always come at a cost.  In
addition, I don't see why we need to be moving the IP/IP6 control
block around.  If the IPoIB hwaddr is the only concern why couldn't it
be saved/restored instead?

> What is the best way to keep the crash fix but not kill IPoIB performance?

It seems like what would probably need to happen is to move where the
IPoIB address is stored.  I'm not sure the control buffer is really
the best place for it since the cb gets overwritten at various levels,
and storing 20 bytes makes it hard to avoid bumping up against the
size restrictions of the buffer.  Seeing as how the IPoIB hwaddr is
generated around the same time we generate the L2 header for the
frame, I wonder if you couldn't get away with using a bit of extra skb
headroom to store it and then use a offset from the MAC header to
access it.  An added bonus would be that with a few tricks with
SKB_GSO_CB(skb)->mac_offset you might even be able to set things up so
that you copy the hwaddr when you copy the header for each fragment
instead of having to go and copy the hwaddr out of the cb and clone it
for each frame.

Anyway I don't have that strong an understanding on IPoIB, but thought
I would add my $.02 since I noticed you were referencing an outdated
skb_gso_cb structure.

- Alex


Re: Resurrecting due to huge ipoib perf regression - [BUG] skb corruption and kernel panic at forwarding with fragmentation

2016-07-07 Thread Alexander Duyck
On Wed, Jul 6, 2016 at 11:25 PM, Roland Dreier  wrote:
> On Thu, Jan 7, 2016 at 3:00 AM, Konstantin Khlebnikov  
> wrote:
>> Or just shift GSO CB and add couple checks like
>> BUILD_BUG_ON(sizeof(SKB_GSO_CB(skb)->room) < sizeof(*IPCB(skb)));
>
> Resurrecting this old thread, because the patch that ultimately went
> upstream (commit 9207f9d45b0a / net: preserve IP control block during
> GSO segmentation) causes a huge IPoIB performance regression (to the
> point of being unusable):
> https://bugzilla.kernel.org/show_bug.cgi?id=111921
>
> I don't think anyone has explained what goes wrong or why IPoIB works
> the way it does.  The underlying difference that IPoIB has from other
> drivers is that there are two levels of address resolution.  First,
> normal ARP / ND resolves an IP address to a "hardware" address.  The
> difference is that in IPoIB, the hardware address is an IB GID (plus a
> QPN, but we can ignore that).  To actually send data to that GID, the
> IPoIB driver has to do a second lookup - it needs to ask the IB subnet
> manager for a path record that tells it how to reach that GID.
>
> In particular this means that "destination address" (as the IP / ARP
> layer understands it) actually isn't in the packet anywhere - there's
> nothing like an ethernet header as there is for "normal" network
> drivers.  Instead, the driver stashes the address in skb->cb during
> hard_header_ops->create() and then looks at it in the xmit routine -
> this was designed way back around when commit a0417fa3a18a / net: Make
> qdisc_skb_cb upper size bound explicit. was merged.  The expectation
> was that the part of the cb after sizeof (struct qdisc_skb_cb) would
> be preserved.
>
> The problem with commit 9207f9d45b0a is that GSO operations now access
> cb after SKB_SGO_CB_OFFSET==32, which lands right in the middle of
> where IPoIB stashes its hwaddr.

Really I don't know if the problem is GSO so much as the desire to
maintain the control buffer between layers.  Really this kind of
violates how we have documented the control buffer in skbuff.h.

> It seems that the intent of the commit is to preserve the IP control
> block - struct inet_skb_parm (and presumably struct inet6_skb_parm) -
> even when using SKB_GSO_CB().  Seems like both inet_skb_parm and
> inet6_skb_parm are 20 bytes.  IPoIB uses the part of cb after 28
> bytes, so if we could squeeze struct skb_gso_cb down to 8 bytes and
> set SKB_SGO_CB_OFFSET to 20, then everything would work.  The struct
> is
>
> struct skb_gso_cb {
> int mac_offset;
> int encap_level;
> __u16   csum_start;
> };

This is based on an out-dated version of this struct.  The 4.7 RC
kernel has a few more fields that were added to support local checksum
offload for encapsulated frames.

> is it feasible to make encap_level a __u16 (which would make the
> overall struct exactly 8 bytes)?  If I understand this correctly, 64K
> nested encapsulations seems like quite a bit for a packet...

The smallest we could probably get this structure at this point would
be around 10 bytes assuming we could make the offsets and encap level
only be a 16 bit field.  The added size is due to a wsum that was
added to keep a running checksum so we could keep csum_offset and
csum_start fields while generating a running checksum for the frame in
GSO.

> Or, earlier in this thread, having the GSO in ip_output and other gso
> paths save and restore the IP/IP6 control block was suggested as an
> alternate approach.  I don't know if there are performance
> implications to that.

Copying a block of data back and forth will always come at a cost.  In
addition, I don't see why we need to be moving the IP/IP6 control
block around.  If the IPoIB hwaddr is the only concern why couldn't it
be saved/restored instead?

> What is the best way to keep the crash fix but not kill IPoIB performance?

It seems like what would probably need to happen is to move where the
IPoIB address is stored.  I'm not sure the control buffer is really
the best place for it since the cb gets overwritten at various levels,
and storing 20 bytes makes it hard to avoid bumping up against the
size restrictions of the buffer.  Seeing as how the IPoIB hwaddr is
generated around the same time we generate the L2 header for the
frame, I wonder if you couldn't get away with using a bit of extra skb
headroom to store it and then use a offset from the MAC header to
access it.  An added bonus would be that with a few tricks with
SKB_GSO_CB(skb)->mac_offset you might even be able to set things up so
that you copy the hwaddr when you copy the header for each fragment
instead of having to go and copy the hwaddr out of the cb and clone it
for each frame.

Anyway I don't have that strong an understanding on IPoIB, but thought
I would add my $.02 since I noticed you were referencing an outdated
skb_gso_cb structure.

- Alex


Resurrecting due to huge ipoib perf regression - [BUG] skb corruption and kernel panic at forwarding with fragmentation

2016-07-07 Thread Roland Dreier
On Thu, Jan 7, 2016 at 3:00 AM, Konstantin Khlebnikov  wrote:
> Or just shift GSO CB and add couple checks like
> BUILD_BUG_ON(sizeof(SKB_GSO_CB(skb)->room) < sizeof(*IPCB(skb)));

Resurrecting this old thread, because the patch that ultimately went
upstream (commit 9207f9d45b0a / net: preserve IP control block during
GSO segmentation) causes a huge IPoIB performance regression (to the
point of being unusable):
https://bugzilla.kernel.org/show_bug.cgi?id=111921

I don't think anyone has explained what goes wrong or why IPoIB works
the way it does.  The underlying difference that IPoIB has from other
drivers is that there are two levels of address resolution.  First,
normal ARP / ND resolves an IP address to a "hardware" address.  The
difference is that in IPoIB, the hardware address is an IB GID (plus a
QPN, but we can ignore that).  To actually send data to that GID, the
IPoIB driver has to do a second lookup - it needs to ask the IB subnet
manager for a path record that tells it how to reach that GID.

In particular this means that "destination address" (as the IP / ARP
layer understands it) actually isn't in the packet anywhere - there's
nothing like an ethernet header as there is for "normal" network
drivers.  Instead, the driver stashes the address in skb->cb during
hard_header_ops->create() and then looks at it in the xmit routine -
this was designed way back around when commit a0417fa3a18a / net: Make
qdisc_skb_cb upper size bound explicit. was merged.  The expectation
was that the part of the cb after sizeof (struct qdisc_skb_cb) would
be preserved.

The problem with commit 9207f9d45b0a is that GSO operations now access
cb after SKB_SGO_CB_OFFSET==32, which lands right in the middle of
where IPoIB stashes its hwaddr.

It seems that the intent of the commit is to preserve the IP control
block - struct inet_skb_parm (and presumably struct inet6_skb_parm) -
even when using SKB_GSO_CB().  Seems like both inet_skb_parm and
inet6_skb_parm are 20 bytes.  IPoIB uses the part of cb after 28
bytes, so if we could squeeze struct skb_gso_cb down to 8 bytes and
set SKB_SGO_CB_OFFSET to 20, then everything would work.  The struct
is

struct skb_gso_cb {
int mac_offset;
int encap_level;
__u16   csum_start;
};

is it feasible to make encap_level a __u16 (which would make the
overall struct exactly 8 bytes)?  If I understand this correctly, 64K
nested encapsulations seems like quite a bit for a packet...

Or, earlier in this thread, having the GSO in ip_output and other gso
paths save and restore the IP/IP6 control block was suggested as an
alternate approach.  I don't know if there are performance
implications to that.

What is the best way to keep the crash fix but not kill IPoIB performance?

Thanks!
 - R.


Resurrecting due to huge ipoib perf regression - [BUG] skb corruption and kernel panic at forwarding with fragmentation

2016-07-07 Thread Roland Dreier
On Thu, Jan 7, 2016 at 3:00 AM, Konstantin Khlebnikov  wrote:
> Or just shift GSO CB and add couple checks like
> BUILD_BUG_ON(sizeof(SKB_GSO_CB(skb)->room) < sizeof(*IPCB(skb)));

Resurrecting this old thread, because the patch that ultimately went
upstream (commit 9207f9d45b0a / net: preserve IP control block during
GSO segmentation) causes a huge IPoIB performance regression (to the
point of being unusable):
https://bugzilla.kernel.org/show_bug.cgi?id=111921

I don't think anyone has explained what goes wrong or why IPoIB works
the way it does.  The underlying difference that IPoIB has from other
drivers is that there are two levels of address resolution.  First,
normal ARP / ND resolves an IP address to a "hardware" address.  The
difference is that in IPoIB, the hardware address is an IB GID (plus a
QPN, but we can ignore that).  To actually send data to that GID, the
IPoIB driver has to do a second lookup - it needs to ask the IB subnet
manager for a path record that tells it how to reach that GID.

In particular this means that "destination address" (as the IP / ARP
layer understands it) actually isn't in the packet anywhere - there's
nothing like an ethernet header as there is for "normal" network
drivers.  Instead, the driver stashes the address in skb->cb during
hard_header_ops->create() and then looks at it in the xmit routine -
this was designed way back around when commit a0417fa3a18a / net: Make
qdisc_skb_cb upper size bound explicit. was merged.  The expectation
was that the part of the cb after sizeof (struct qdisc_skb_cb) would
be preserved.

The problem with commit 9207f9d45b0a is that GSO operations now access
cb after SKB_SGO_CB_OFFSET==32, which lands right in the middle of
where IPoIB stashes its hwaddr.

It seems that the intent of the commit is to preserve the IP control
block - struct inet_skb_parm (and presumably struct inet6_skb_parm) -
even when using SKB_GSO_CB().  Seems like both inet_skb_parm and
inet6_skb_parm are 20 bytes.  IPoIB uses the part of cb after 28
bytes, so if we could squeeze struct skb_gso_cb down to 8 bytes and
set SKB_SGO_CB_OFFSET to 20, then everything would work.  The struct
is

struct skb_gso_cb {
int mac_offset;
int encap_level;
__u16   csum_start;
};

is it feasible to make encap_level a __u16 (which would make the
overall struct exactly 8 bytes)?  If I understand this correctly, 64K
nested encapsulations seems like quite a bit for a packet...

Or, earlier in this thread, having the GSO in ip_output and other gso
paths save and restore the IP/IP6 control block was suggested as an
alternate approach.  I don't know if there are performance
implications to that.

What is the best way to keep the crash fix but not kill IPoIB performance?

Thanks!
 - R.


Re: [BUG] skb corruption and kernel panic at forwarding with fragmentation

2016-01-07 Thread Eric Dumazet
On Thu, Jan 7, 2016 at 7:04 AM, Konstantin Khlebnikov  wrote:
> On Thu, Jan 7, 2016 at 2:59 PM, Eric Dumazet  wrote:
>> On Thu, Jan 7, 2016 at 6:38 AM, Konstantin Khlebnikov  
>> wrote:
>>>
>>> Also I've found strange thing: reason of expanding skb->cb from 40 to
>>> 48 bypes in 2006
>>> 3e3850e989c5d2eb1aab6f0fd9257759f0f4cbc6 was that struct inet6_skb_parm does
>>> not fit. But it's is only 24 bytes. Does some arches add pad after
>>> each _u16 field?
>>
>> "struct inet6_skb_parm" is part of struct tcp_skb_cb
>>
>> This is why Patrick had to increase skb->cb[]
>
> Whoa. Funny. TCP moves that chunk back and forward instead of just
> putting it at the first place in struct.

You probably want to look at git history to find out why it is done this way.

TCP performance is critical for some of us, and doing such trick avoid
one cache miss per skb in some critical list traversals.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [BUG] skb corruption and kernel panic at forwarding with fragmentation

2016-01-07 Thread Konstantin Khlebnikov
On Thu, Jan 7, 2016 at 2:59 PM, Eric Dumazet  wrote:
> On Thu, Jan 7, 2016 at 6:38 AM, Konstantin Khlebnikov  
> wrote:
>>
>> Also I've found strange thing: reason of expanding skb->cb from 40 to
>> 48 bypes in 2006
>> 3e3850e989c5d2eb1aab6f0fd9257759f0f4cbc6 was that struct inet6_skb_parm does
>> not fit. But it's is only 24 bytes. Does some arches add pad after
>> each _u16 field?
>
> "struct inet6_skb_parm" is part of struct tcp_skb_cb
>
> This is why Patrick had to increase skb->cb[]

Whoa. Funny. TCP moves that chunk back and forward instead of just
putting it at the first place in struct.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [BUG] skb corruption and kernel panic at forwarding with fragmentation

2016-01-07 Thread Eric Dumazet
On Thu, Jan 7, 2016 at 6:38 AM, Konstantin Khlebnikov  wrote:
>
> Also I've found strange thing: reason of expanding skb->cb from 40 to
> 48 bypes in 2006
> 3e3850e989c5d2eb1aab6f0fd9257759f0f4cbc6 was that struct inet6_skb_parm does
> not fit. But it's is only 24 bytes. Does some arches add pad after
> each _u16 field?

"struct inet6_skb_parm" is part of struct tcp_skb_cb

This is why Patrick had to increase skb->cb[]
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [BUG] skb corruption and kernel panic at forwarding with fragmentation

2016-01-07 Thread Konstantin Khlebnikov
On Thu, Jan 7, 2016 at 2:00 PM, Konstantin Khlebnikov  wrote:
> On Thu, Jan 7, 2016 at 2:49 AM, Florian Westphal  wrote:
>> Florian Westphal  wrote:
>>> Thadeu Lima de Souza Cascardo  wrote:
>>> > On Wed, Jan 06, 2016 at 11:11:41PM +0300, Konstantin Khlebnikov wrote:
>>
>> [ skb_gso_segment uses skb->cb[], causes oops if ip_fragment is invoked
>>   on segmented skbs ]
>>
>>> > I have hit this as well, this fixes it for me on an older kernel. Can you 
>>> > try it
>>> > on latest kernel?
>>>
>>> > diff --git a/net/ipv4/ip_output.c b/net/ipv4/ip_output.c
>>> > index d8a1745..f44bc91 100644
>>> > --- a/net/ipv4/ip_output.c
>>> > +++ b/net/ipv4/ip_output.c
>>> > @@ -216,6 +216,7 @@ static int ip_finish_output_gso(struct sk_buff *skb)
>>> > netdev_features_t features;
>>> > struct sk_buff *segs;
>>> > int ret = 0;
>>> > +   struct inet_skb_parm ipcb;
>>> >
>>> > if (skb_gso_network_seglen(skb) <= ip_skb_dst_mtu(skb))
>>> > return ip_finish_output2(skb);
>>> > @@ -227,6 +228,10 @@ static int ip_finish_output_gso(struct sk_buff *skb)
>>> >  * 2) skb arrived via virtio-net, we thus get TSO/GSO skbs directly
>>> >  * from host network stack.
>>> >  */
>>> > +   /* We need to save IPCB here because skb_gso_segment will use
>>> > +* SKB_GSO_CB.
>>> > +*/
>>> > +   ipcb = *IPCB(skb);
>>> > features = netif_skb_features(skb);
>>> > segs = skb_gso_segment(skb, features & ~NETIF_F_GSO_MASK);
>>> > if (IS_ERR_OR_NULL(segs)) {
>>> > @@ -241,6 +246,7 @@ static int ip_finish_output_gso(struct sk_buff *skb)
>>> > int err;
>>> >
>>> > segs->next = NULL;
>>> > +   *IPCB(segs) = ipcb;
>>> > err = ip_fragment(segs, ip_finish_output2);
>>> >
>>> > if (err && ret == 0)
>>>
>>> I'm worried that this doesn't solve all cases. f.e. xfrm may also
>>> call skb_gso_segment(), and it will call into ipv4/ipv6 netfilter
>>> postrouting + ipv4 output functions...
>>>
>>> nfqnl_enqueue_packet() is also affected.
>>
>> ... but it seems that those three are the only affected callers
>> of skb_gso_segment (tbf is ok since skb isn't owned by anyone,
>> ovs does save/restore already).
>>
>> I think this patch is the right way, we just need similar
>> save/restore in nfqnl_enqueue_packet and xfrm_output_gso().
>
> Which CB could be here? at this point skb isn't owned by netlink yet.
>
>>
>> The latter two can be used by either ipv4 or ipv6 so it might
>> be preferable to just save/restore sizeof(struct skb_gso_cb);
>> or a union of inet_skb_parm+inet6_skb_parm.
>
> Or just shift GSO CB and add couple checks like
> BUILD_BUG_ON(sizeof(SKB_GSO_CB(skb)->room) < sizeof(*IPCB(skb)));

Somethin like this (in attachment)

Also I've found strange thing: reason of expanding skb->cb from 40 to
48 bypes in 2006
3e3850e989c5d2eb1aab6f0fd9257759f0f4cbc6 was that struct inet6_skb_parm does
not fit. But it's is only 24 bytes. Does some arches add pad after
each _u16 field?


net-preserve-lower-bytes-of-skb-cb-during-gso-segmentation
Description: Binary data


Re: [BUG] skb corruption and kernel panic at forwarding with fragmentation

2016-01-07 Thread Konstantin Khlebnikov
On Thu, Jan 7, 2016 at 2:00 PM, Konstantin Khlebnikov  wrote:
> On Thu, Jan 7, 2016 at 2:49 AM, Florian Westphal  wrote:
>> Florian Westphal  wrote:
>>> Thadeu Lima de Souza Cascardo  wrote:
>>> > On Wed, Jan 06, 2016 at 11:11:41PM +0300, Konstantin Khlebnikov wrote:
>>
>> [ skb_gso_segment uses skb->cb[], causes oops if ip_fragment is invoked
>>   on segmented skbs ]
>>
>>> > I have hit this as well, this fixes it for me on an older kernel. Can you 
>>> > try it
>>> > on latest kernel?
>>>
>>> > diff --git a/net/ipv4/ip_output.c b/net/ipv4/ip_output.c
>>> > index d8a1745..f44bc91 100644
>>> > --- a/net/ipv4/ip_output.c
>>> > +++ b/net/ipv4/ip_output.c
>>> > @@ -216,6 +216,7 @@ static int ip_finish_output_gso(struct sk_buff *skb)
>>> > netdev_features_t features;
>>> > struct sk_buff *segs;
>>> > int ret = 0;
>>> > +   struct inet_skb_parm ipcb;
>>> >
>>> > if (skb_gso_network_seglen(skb) <= ip_skb_dst_mtu(skb))
>>> > return ip_finish_output2(skb);
>>> > @@ -227,6 +228,10 @@ static int ip_finish_output_gso(struct sk_buff *skb)
>>> >  * 2) skb arrived via virtio-net, we thus get TSO/GSO skbs directly
>>> >  * from host network stack.
>>> >  */
>>> > +   /* We need to save IPCB here because skb_gso_segment will use
>>> > +* SKB_GSO_CB.
>>> > +*/
>>> > +   ipcb = *IPCB(skb);
>>> > features = netif_skb_features(skb);
>>> > segs = skb_gso_segment(skb, features & ~NETIF_F_GSO_MASK);
>>> > if (IS_ERR_OR_NULL(segs)) {
>>> > @@ -241,6 +246,7 @@ static int ip_finish_output_gso(struct sk_buff *skb)
>>> > int err;
>>> >
>>> > segs->next = NULL;
>>> > +   *IPCB(segs) = ipcb;
>>> > err = ip_fragment(segs, ip_finish_output2);
>>> >
>>> > if (err && ret == 0)
>>>
>>> I'm worried that this doesn't solve all cases. f.e. xfrm may also
>>> call skb_gso_segment(), and it will call into ipv4/ipv6 netfilter
>>> postrouting + ipv4 output functions...
>>>
>>> nfqnl_enqueue_packet() is also affected.
>>
>> ... but it seems that those three are the only affected callers
>> of skb_gso_segment (tbf is ok since skb isn't owned by anyone,
>> ovs does save/restore already).
>>
>> I think this patch is the right way, we just need similar
>> save/restore in nfqnl_enqueue_packet and xfrm_output_gso().
>
> Which CB could be here? at this point skb isn't owned by netlink yet.
>
>>
>> The latter two can be used by either ipv4 or ipv6 so it might
>> be preferable to just save/restore sizeof(struct skb_gso_cb);
>> or a union of inet_skb_parm+inet6_skb_parm.
>
> Or just shift GSO CB and add couple checks like
> BUILD_BUG_ON(sizeof(SKB_GSO_CB(skb)->room) < sizeof(*IPCB(skb)));

Somethin like this (in attachment)

Also I've found strange thing: reason of expanding skb->cb from 40 to
48 bypes in 2006
3e3850e989c5d2eb1aab6f0fd9257759f0f4cbc6 was that struct inet6_skb_parm does
not fit. But it's is only 24 bytes. Does some arches add pad after
each _u16 field?


net-preserve-lower-bytes-of-skb-cb-during-gso-segmentation
Description: Binary data


Re: [BUG] skb corruption and kernel panic at forwarding with fragmentation

2016-01-07 Thread Konstantin Khlebnikov
On Thu, Jan 7, 2016 at 2:59 PM, Eric Dumazet  wrote:
> On Thu, Jan 7, 2016 at 6:38 AM, Konstantin Khlebnikov  
> wrote:
>>
>> Also I've found strange thing: reason of expanding skb->cb from 40 to
>> 48 bypes in 2006
>> 3e3850e989c5d2eb1aab6f0fd9257759f0f4cbc6 was that struct inet6_skb_parm does
>> not fit. But it's is only 24 bytes. Does some arches add pad after
>> each _u16 field?
>
> "struct inet6_skb_parm" is part of struct tcp_skb_cb
>
> This is why Patrick had to increase skb->cb[]

Whoa. Funny. TCP moves that chunk back and forward instead of just
putting it at the first place in struct.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [BUG] skb corruption and kernel panic at forwarding with fragmentation

2016-01-07 Thread Eric Dumazet
On Thu, Jan 7, 2016 at 6:38 AM, Konstantin Khlebnikov  wrote:
>
> Also I've found strange thing: reason of expanding skb->cb from 40 to
> 48 bypes in 2006
> 3e3850e989c5d2eb1aab6f0fd9257759f0f4cbc6 was that struct inet6_skb_parm does
> not fit. But it's is only 24 bytes. Does some arches add pad after
> each _u16 field?

"struct inet6_skb_parm" is part of struct tcp_skb_cb

This is why Patrick had to increase skb->cb[]
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [BUG] skb corruption and kernel panic at forwarding with fragmentation

2016-01-07 Thread Eric Dumazet
On Thu, Jan 7, 2016 at 7:04 AM, Konstantin Khlebnikov  wrote:
> On Thu, Jan 7, 2016 at 2:59 PM, Eric Dumazet  wrote:
>> On Thu, Jan 7, 2016 at 6:38 AM, Konstantin Khlebnikov  
>> wrote:
>>>
>>> Also I've found strange thing: reason of expanding skb->cb from 40 to
>>> 48 bypes in 2006
>>> 3e3850e989c5d2eb1aab6f0fd9257759f0f4cbc6 was that struct inet6_skb_parm does
>>> not fit. But it's is only 24 bytes. Does some arches add pad after
>>> each _u16 field?
>>
>> "struct inet6_skb_parm" is part of struct tcp_skb_cb
>>
>> This is why Patrick had to increase skb->cb[]
>
> Whoa. Funny. TCP moves that chunk back and forward instead of just
> putting it at the first place in struct.

You probably want to look at git history to find out why it is done this way.

TCP performance is critical for some of us, and doing such trick avoid
one cache miss per skb in some critical list traversals.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [BUG] skb corruption and kernel panic at forwarding with fragmentation

2016-01-06 Thread Florian Westphal
Florian Westphal  wrote:
> Thadeu Lima de Souza Cascardo  wrote:
> > On Wed, Jan 06, 2016 at 11:11:41PM +0300, Konstantin Khlebnikov wrote:

[ skb_gso_segment uses skb->cb[], causes oops if ip_fragment is invoked
  on segmented skbs ]

> > I have hit this as well, this fixes it for me on an older kernel. Can you 
> > try it
> > on latest kernel?
> 
> > diff --git a/net/ipv4/ip_output.c b/net/ipv4/ip_output.c
> > index d8a1745..f44bc91 100644
> > --- a/net/ipv4/ip_output.c
> > +++ b/net/ipv4/ip_output.c
> > @@ -216,6 +216,7 @@ static int ip_finish_output_gso(struct sk_buff *skb)
> > netdev_features_t features;
> > struct sk_buff *segs;
> > int ret = 0;
> > +   struct inet_skb_parm ipcb;
> >  
> > if (skb_gso_network_seglen(skb) <= ip_skb_dst_mtu(skb))
> > return ip_finish_output2(skb);
> > @@ -227,6 +228,10 @@ static int ip_finish_output_gso(struct sk_buff *skb)
> >  * 2) skb arrived via virtio-net, we thus get TSO/GSO skbs directly
> >  * from host network stack.
> >  */
> > +   /* We need to save IPCB here because skb_gso_segment will use
> > +* SKB_GSO_CB.
> > +*/
> > +   ipcb = *IPCB(skb);
> > features = netif_skb_features(skb);
> > segs = skb_gso_segment(skb, features & ~NETIF_F_GSO_MASK);
> > if (IS_ERR_OR_NULL(segs)) {
> > @@ -241,6 +246,7 @@ static int ip_finish_output_gso(struct sk_buff *skb)
> > int err;
> >  
> > segs->next = NULL;
> > +   *IPCB(segs) = ipcb;
> > err = ip_fragment(segs, ip_finish_output2);
> >  
> > if (err && ret == 0)
> 
> I'm worried that this doesn't solve all cases. f.e. xfrm may also
> call skb_gso_segment(), and it will call into ipv4/ipv6 netfilter
> postrouting + ipv4 output functions...
> 
> nfqnl_enqueue_packet() is also affected.

... but it seems that those three are the only affected callers
of skb_gso_segment (tbf is ok since skb isn't owned by anyone,
ovs does save/restore already).

I think this patch is the right way, we just need similar
save/restore in nfqnl_enqueue_packet and xfrm_output_gso().

The latter two can be used by either ipv4 or ipv6 so it might
be preferable to just save/restore sizeof(struct skb_gso_cb);
or a union of inet_skb_parm+inet6_skb_parm.

Cascardo, can you cook a patch?

Thanks!
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [BUG] skb corruption and kernel panic at forwarding with fragmentation

2016-01-06 Thread Florian Westphal
Thadeu Lima de Souza Cascardo  wrote:
> On Wed, Jan 06, 2016 at 11:11:41PM +0300, Konstantin Khlebnikov wrote:
> > On Wed, Jan 6, 2016 at 10:59 PM, Cong Wang  wrote:
> > > On Wed, Jan 6, 2016 at 11:15 AM, Konstantin Khlebnikov  
> > > wrote:
> > >> Looks like this happens because ip_options_fragment() relies on
> > >> correct ip options length in ip control block in skb. But in
> > >> ip_finish_output_gso() control block in segments is reused by
> > >> skb_gso_segment(). following ip_fragment() sees some garbage.
> > >>
> > >> In my case there was no ip options but length becomes non-zero and
> > >> ip_options_fragment() picked some bytes from payload and decides to
> > >> fill huge range with IPOPT_NOOP (1). One of that ones flipped nr_frags
> > >> in skb_shared_info at the end of data =)
> > >>
> > >
> > > Hmm, it looks like SKB_GSO_CB should be cleared after skb_gso_segment()
> > > since all the gso information should be saved in shared_info after it 
> > > finishes.
> > >
> > > Does a memset(0) on SKB_GSO_CB after skb_gso_segment() work as well?
> > 
> > This will break present logic around ip_options_fragment() - it clears
> > options from
> > second and following fragments. With zeroed cb it will do nothing.
> > 
> > ip_options_fragment() can get required information directly from ip header 
> > but
> > it also resets fields in IPCB -- probably it should stay valid here
> > and somebody else will use it later.
[..]
> I have hit this as well, this fixes it for me on an older kernel. Can you try 
> it
> on latest kernel?

> diff --git a/net/ipv4/ip_output.c b/net/ipv4/ip_output.c
> index d8a1745..f44bc91 100644
> --- a/net/ipv4/ip_output.c
> +++ b/net/ipv4/ip_output.c
> @@ -216,6 +216,7 @@ static int ip_finish_output_gso(struct sk_buff *skb)
>   netdev_features_t features;
>   struct sk_buff *segs;
>   int ret = 0;
> + struct inet_skb_parm ipcb;
>  
>   if (skb_gso_network_seglen(skb) <= ip_skb_dst_mtu(skb))
>   return ip_finish_output2(skb);
> @@ -227,6 +228,10 @@ static int ip_finish_output_gso(struct sk_buff *skb)
>* 2) skb arrived via virtio-net, we thus get TSO/GSO skbs directly
>* from host network stack.
>*/
> + /* We need to save IPCB here because skb_gso_segment will use
> +  * SKB_GSO_CB.
> +  */
> + ipcb = *IPCB(skb);
>   features = netif_skb_features(skb);
>   segs = skb_gso_segment(skb, features & ~NETIF_F_GSO_MASK);
>   if (IS_ERR_OR_NULL(segs)) {
> @@ -241,6 +246,7 @@ static int ip_finish_output_gso(struct sk_buff *skb)
>   int err;
>  
>   segs->next = NULL;
> + *IPCB(segs) = ipcb;
>   err = ip_fragment(segs, ip_finish_output2);
>  
>   if (err && ret == 0)

I'm worried that this doesn't solve all cases. f.e. xfrm may also
call skb_gso_segment(), and it will call into ipv4/ipv6 netfilter
postrouting + ipv4 output functions...

nfqnl_enqueue_packet() is also affected.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [BUG] skb corruption and kernel panic at forwarding with fragmentation

2016-01-06 Thread Thadeu Lima de Souza Cascardo
On Wed, Jan 06, 2016 at 11:11:41PM +0300, Konstantin Khlebnikov wrote:
> On Wed, Jan 6, 2016 at 10:59 PM, Cong Wang  wrote:
> > On Wed, Jan 6, 2016 at 11:15 AM, Konstantin Khlebnikov  
> > wrote:
> >> Looks like this happens because ip_options_fragment() relies on
> >> correct ip options length in ip control block in skb. But in
> >> ip_finish_output_gso() control block in segments is reused by
> >> skb_gso_segment(). following ip_fragment() sees some garbage.
> >>
> >> In my case there was no ip options but length becomes non-zero and
> >> ip_options_fragment() picked some bytes from payload and decides to
> >> fill huge range with IPOPT_NOOP (1). One of that ones flipped nr_frags
> >> in skb_shared_info at the end of data =)
> >>
> >
> > Hmm, it looks like SKB_GSO_CB should be cleared after skb_gso_segment()
> > since all the gso information should be saved in shared_info after it 
> > finishes.
> >
> > Does a memset(0) on SKB_GSO_CB after skb_gso_segment() work as well?
> 
> This will break present logic around ip_options_fragment() - it clears
> options from
> second and following fragments. With zeroed cb it will do nothing.
> 
> ip_options_fragment() can get required information directly from ip header but
> it also resets fields in IPCB -- probably it should stay valid here
> and somebody else will use it later.
> --
> To unsubscribe from this list: send the line "unsubscribe netdev" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

I have hit this as well, this fixes it for me on an older kernel. Can you try it
on latest kernel?

diff --git a/net/ipv4/ip_output.c b/net/ipv4/ip_output.c
index d8a1745..f44bc91 100644
--- a/net/ipv4/ip_output.c
+++ b/net/ipv4/ip_output.c
@@ -216,6 +216,7 @@ static int ip_finish_output_gso(struct sk_buff *skb)
netdev_features_t features;
struct sk_buff *segs;
int ret = 0;
+   struct inet_skb_parm ipcb;
 
if (skb_gso_network_seglen(skb) <= ip_skb_dst_mtu(skb))
return ip_finish_output2(skb);
@@ -227,6 +228,10 @@ static int ip_finish_output_gso(struct sk_buff *skb)
 * 2) skb arrived via virtio-net, we thus get TSO/GSO skbs directly
 * from host network stack.
 */
+   /* We need to save IPCB here because skb_gso_segment will use
+* SKB_GSO_CB.
+*/
+   ipcb = *IPCB(skb);
features = netif_skb_features(skb);
segs = skb_gso_segment(skb, features & ~NETIF_F_GSO_MASK);
if (IS_ERR_OR_NULL(segs)) {
@@ -241,6 +246,7 @@ static int ip_finish_output_gso(struct sk_buff *skb)
int err;
 
segs->next = NULL;
+   *IPCB(segs) = ipcb;
err = ip_fragment(segs, ip_finish_output2);
 
if (err && ret == 0)
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [BUG] skb corruption and kernel panic at forwarding with fragmentation

2016-01-06 Thread Konstantin Khlebnikov
On Wed, Jan 6, 2016 at 10:59 PM, Cong Wang  wrote:
> On Wed, Jan 6, 2016 at 11:15 AM, Konstantin Khlebnikov  
> wrote:
>> Looks like this happens because ip_options_fragment() relies on
>> correct ip options length in ip control block in skb. But in
>> ip_finish_output_gso() control block in segments is reused by
>> skb_gso_segment(). following ip_fragment() sees some garbage.
>>
>> In my case there was no ip options but length becomes non-zero and
>> ip_options_fragment() picked some bytes from payload and decides to
>> fill huge range with IPOPT_NOOP (1). One of that ones flipped nr_frags
>> in skb_shared_info at the end of data =)
>>
>
> Hmm, it looks like SKB_GSO_CB should be cleared after skb_gso_segment()
> since all the gso information should be saved in shared_info after it 
> finishes.
>
> Does a memset(0) on SKB_GSO_CB after skb_gso_segment() work as well?

This will break present logic around ip_options_fragment() - it clears
options from
second and following fragments. With zeroed cb it will do nothing.

ip_options_fragment() can get required information directly from ip header but
it also resets fields in IPCB -- probably it should stay valid here
and somebody else will use it later.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [BUG] skb corruption and kernel panic at forwarding with fragmentation

2016-01-06 Thread Cong Wang
On Wed, Jan 6, 2016 at 11:15 AM, Konstantin Khlebnikov  wrote:
> Looks like this happens because ip_options_fragment() relies on
> correct ip options length in ip control block in skb. But in
> ip_finish_output_gso() control block in segments is reused by
> skb_gso_segment(). following ip_fragment() sees some garbage.
>
> In my case there was no ip options but length becomes non-zero and
> ip_options_fragment() picked some bytes from payload and decides to
> fill huge range with IPOPT_NOOP (1). One of that ones flipped nr_frags
> in skb_shared_info at the end of data =)
>

Hmm, it looks like SKB_GSO_CB should be cleared after skb_gso_segment()
since all the gso information should be saved in shared_info after it finishes.

Does a memset(0) on SKB_GSO_CB after skb_gso_segment() work as well?
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


[BUG] skb corruption and kernel panic at forwarding with fragmentation

2016-01-06 Thread Konstantin Khlebnikov
I've got some of these:

[84408.314676] BUG: unable to handle kernel NULL pointer dereference
at   (null)
[84408.317324] IP: [] put_page+0x5/0x50
[84408.319985] PGD 0
[84408.322583] Oops:  [#1] SMP
[84408.325156] Modules linked in: ppp_mppe ppp_async ppp_generic slhc
8021q fuse nfsd auth_rpcgss oid_registry nfs_acl nfs lockd grace
sunrpc bridge stp llc xt_HL xt_TCPMSS xt_state w83627ehf hwmon_vid
snd_hda_codec_realtek snd_hda_codec_generic radeon snd_hda_codec_hdmi
snd_hda_intel snd_hda_controller snd_hda_codec snd_hwdep snd_pcm
snd_hda_core edac_core k10temp snd_timer snd drm_kms_helper soundcore
ath9k ttm ath9k_common ath9k_hw ath r8169 mii
[84408.336804] CPU: 3 PID: 0 Comm: swapper/3 Not tainted 4.1.15-zurg #1
[84408.339839] Hardware name: To Be Filled By O.E.M. To Be Filled By
O.E.M./RS880D, BIOS 080015  04/12/2011
[84408.342964] task: 880216d56f50 ti: 880216e04000 task.ti:
880216e04000
[84408.346136] RIP: 0010:[]  []
put_page+0x5/0x50
[84408.349301] RSP: 0018:88021fcc37c0  EFLAGS: 00010216
[84408.352433] RAX: 0030 RBX: 0001 RCX: 0077
[84408.355602] RDX: 880213d8818e RSI: 0200 RDI: 
[84408.358765] RBP: 88021fcc37e8 R08: 0076 R09: 880216c01900
[84408.361885] R10: ea000859a840 R11: 0001 R12: 8802166a1300
[84408.364988] R13: 88021280d8c0 R14: 8802166a1300 R15: 88021280d410
[84408.368059] FS:  7f9ada2de700() GS:88021fcc()
knlGS:
[84408.371211] CS:  0010 DS:  ES:  CR0: 8005003b
[84408.374336] CR2:  CR3: 000216575000 CR4: 06e0
[84408.377484] Stack:
[84408.380623]  81576ac8 88021fcc37e8 8802166a1300
8802166a1300
[84408.383843]   88021fcc3808 81576b48
88021fcc3808
[84408.387022]  6e00 88021fcc3828 81576cdd
6e00
[84408.390187] Call Trace:
[84408.393293]  
[84408.393323]  [] ? skb_release_data+0x78/0xd0
[84408.399488]  [] skb_release_all+0x28/0x30
[84408.402553]  [] consume_skb+0x5d/0x80
[84408.405630]  [] ip_fragment+0x5c4/0x970
[84408.408676]  [] ? ip_copy_metadata+0x160/0x160
[84408.411733]  [] ip_finish_output+0x601/0x900
[84408.414788]  [] ? nf_hook_slow+0x99/0x100
[84408.417828]  [] ip_output+0x66/0xc0
[84408.420847]  [] ? ip_fragment+0x970/0x970
[84408.423864]  [] ip_forward_finish+0x73/0xa0
[84408.426864]  [] ip_forward+0x3af/0x490
[84408.429833]  [] ? ip_frag_mem+0x50/0x50
[84408.432782]  [] ip_rcv_finish+0x81/0x370
[84408.435778]  [] ip_rcv+0x2a2/0x3c0
[84408.438780]  [] ? inet_del_offload+0x40/0x40
[84408.441780]  [] __netif_receive_skb_core+0x673/0x810
[84408.444785]  [] __netif_receive_skb+0x18/0x60
[84408.447766]  [] netif_receive_skb_internal+0x23/0x90
[84408.450739]  [] netif_receive_skb_sk+0x1c/0x70
[84408.453726]  [] br_handle_frame_finish+0x27c/0x520 [bridge]
[84408.456774]  [] ? ipv4_confirm+0xb8/0xe0
[84408.459787]  [] br_handle_frame+0x161/0x290 [bridge]
[84408.462803]  [] ? ip_local_deliver+0x46/0xa0
[84408.465796]  [] __netif_receive_skb_core+0x32e/0x810
[84408.468822]  [] __netif_receive_skb+0x18/0x60
[84408.471748]  [] netif_receive_skb_internal+0x23/0x90
[84408.474615]  [] ? tcp4_gro_complete+0x73/0x80
[84408.477378]  [] napi_gro_complete+0x9c/0xe0
[84408.480045]  [] dev_gro_receive+0x230/0x360
[84408.482675]  [] napi_gro_receive+0x30/0x100
[84408.485240]  [] rtl8169_poll+0x2c6/0x6b0 [r8169]
[84408.487766]  [] net_rx_action+0x1fa/0x320
[84408.490241]  [] __do_softirq+0x10b/0x2d0
[84408.492672]  [] irq_exit+0xd5/0xe0
[84408.495072]  [] do_IRQ+0x58/0xf0
[84408.497463]  [] common_interrupt+0x6e/0x6e
[84408.499879]  
[84408.499909]  [] ? native_safe_halt+0x6/0x10
[84408.504697]  [] ? tick_broadcast_oneshot_control+0xbe/0x200
[84408.507126]  [] default_idle+0x1e/0xc0
[84408.509516]  [] amd_e400_idle+0x6e/0xf0
[84408.511879]  [] arch_cpu_idle+0xf/0x20
[84408.514181]  [] cpu_startup_entry+0x327/0x3a0
[84408.516456]  [] ? clockevents_register_device+0xec/0x1d0
[84408.518760]  [] start_secondary+0x138/0x160
[84408.521066] Code: 48 89 d7 e8 2e f7 ff ff e9 a1 fe ff ff 48 89 d7
e8 51 f7 ff ff e9 94 fe ff ff 66 90 66 2e 0f 1f 84 00 00 00 00 00 66
66 66 66 90 <48> f7 07 00 c0 00 00 55 48 89 e5 75 1e 8b 47 1c 85 c0 74
27 f0
[84408.526216] RIP  [] put_page+0x5/0x50
[84408.528705]  RSP 
[84408.531178] CR2: 

Looks like this happens because ip_options_fragment() relies on
correct ip options length in ip control block in skb. But in
ip_finish_output_gso() control block in segments is reused by
skb_gso_segment(). following ip_fragment() sees some garbage.

In my case there was no ip options but length becomes non-zero and
ip_options_fragment() picked some bytes from payload and decides to
fill huge range with IPOPT_NOOP (1). One of that ones flipped nr_frags
in skb_shared_info at the end of data =)

Here is quick hack: just make room for ip control block in gso control 

[BUG] skb corruption and kernel panic at forwarding with fragmentation

2016-01-06 Thread Konstantin Khlebnikov
I've got some of these:

[84408.314676] BUG: unable to handle kernel NULL pointer dereference
at   (null)
[84408.317324] IP: [] put_page+0x5/0x50
[84408.319985] PGD 0
[84408.322583] Oops:  [#1] SMP
[84408.325156] Modules linked in: ppp_mppe ppp_async ppp_generic slhc
8021q fuse nfsd auth_rpcgss oid_registry nfs_acl nfs lockd grace
sunrpc bridge stp llc xt_HL xt_TCPMSS xt_state w83627ehf hwmon_vid
snd_hda_codec_realtek snd_hda_codec_generic radeon snd_hda_codec_hdmi
snd_hda_intel snd_hda_controller snd_hda_codec snd_hwdep snd_pcm
snd_hda_core edac_core k10temp snd_timer snd drm_kms_helper soundcore
ath9k ttm ath9k_common ath9k_hw ath r8169 mii
[84408.336804] CPU: 3 PID: 0 Comm: swapper/3 Not tainted 4.1.15-zurg #1
[84408.339839] Hardware name: To Be Filled By O.E.M. To Be Filled By
O.E.M./RS880D, BIOS 080015  04/12/2011
[84408.342964] task: 880216d56f50 ti: 880216e04000 task.ti:
880216e04000
[84408.346136] RIP: 0010:[]  []
put_page+0x5/0x50
[84408.349301] RSP: 0018:88021fcc37c0  EFLAGS: 00010216
[84408.352433] RAX: 0030 RBX: 0001 RCX: 0077
[84408.355602] RDX: 880213d8818e RSI: 0200 RDI: 
[84408.358765] RBP: 88021fcc37e8 R08: 0076 R09: 880216c01900
[84408.361885] R10: ea000859a840 R11: 0001 R12: 8802166a1300
[84408.364988] R13: 88021280d8c0 R14: 8802166a1300 R15: 88021280d410
[84408.368059] FS:  7f9ada2de700() GS:88021fcc()
knlGS:
[84408.371211] CS:  0010 DS:  ES:  CR0: 8005003b
[84408.374336] CR2:  CR3: 000216575000 CR4: 06e0
[84408.377484] Stack:
[84408.380623]  81576ac8 88021fcc37e8 8802166a1300
8802166a1300
[84408.383843]   88021fcc3808 81576b48
88021fcc3808
[84408.387022]  6e00 88021fcc3828 81576cdd
6e00
[84408.390187] Call Trace:
[84408.393293]  
[84408.393323]  [] ? skb_release_data+0x78/0xd0
[84408.399488]  [] skb_release_all+0x28/0x30
[84408.402553]  [] consume_skb+0x5d/0x80
[84408.405630]  [] ip_fragment+0x5c4/0x970
[84408.408676]  [] ? ip_copy_metadata+0x160/0x160
[84408.411733]  [] ip_finish_output+0x601/0x900
[84408.414788]  [] ? nf_hook_slow+0x99/0x100
[84408.417828]  [] ip_output+0x66/0xc0
[84408.420847]  [] ? ip_fragment+0x970/0x970
[84408.423864]  [] ip_forward_finish+0x73/0xa0
[84408.426864]  [] ip_forward+0x3af/0x490
[84408.429833]  [] ? ip_frag_mem+0x50/0x50
[84408.432782]  [] ip_rcv_finish+0x81/0x370
[84408.435778]  [] ip_rcv+0x2a2/0x3c0
[84408.438780]  [] ? inet_del_offload+0x40/0x40
[84408.441780]  [] __netif_receive_skb_core+0x673/0x810
[84408.444785]  [] __netif_receive_skb+0x18/0x60
[84408.447766]  [] netif_receive_skb_internal+0x23/0x90
[84408.450739]  [] netif_receive_skb_sk+0x1c/0x70
[84408.453726]  [] br_handle_frame_finish+0x27c/0x520 [bridge]
[84408.456774]  [] ? ipv4_confirm+0xb8/0xe0
[84408.459787]  [] br_handle_frame+0x161/0x290 [bridge]
[84408.462803]  [] ? ip_local_deliver+0x46/0xa0
[84408.465796]  [] __netif_receive_skb_core+0x32e/0x810
[84408.468822]  [] __netif_receive_skb+0x18/0x60
[84408.471748]  [] netif_receive_skb_internal+0x23/0x90
[84408.474615]  [] ? tcp4_gro_complete+0x73/0x80
[84408.477378]  [] napi_gro_complete+0x9c/0xe0
[84408.480045]  [] dev_gro_receive+0x230/0x360
[84408.482675]  [] napi_gro_receive+0x30/0x100
[84408.485240]  [] rtl8169_poll+0x2c6/0x6b0 [r8169]
[84408.487766]  [] net_rx_action+0x1fa/0x320
[84408.490241]  [] __do_softirq+0x10b/0x2d0
[84408.492672]  [] irq_exit+0xd5/0xe0
[84408.495072]  [] do_IRQ+0x58/0xf0
[84408.497463]  [] common_interrupt+0x6e/0x6e
[84408.499879]  
[84408.499909]  [] ? native_safe_halt+0x6/0x10
[84408.504697]  [] ? tick_broadcast_oneshot_control+0xbe/0x200
[84408.507126]  [] default_idle+0x1e/0xc0
[84408.509516]  [] amd_e400_idle+0x6e/0xf0
[84408.511879]  [] arch_cpu_idle+0xf/0x20
[84408.514181]  [] cpu_startup_entry+0x327/0x3a0
[84408.516456]  [] ? clockevents_register_device+0xec/0x1d0
[84408.518760]  [] start_secondary+0x138/0x160
[84408.521066] Code: 48 89 d7 e8 2e f7 ff ff e9 a1 fe ff ff 48 89 d7
e8 51 f7 ff ff e9 94 fe ff ff 66 90 66 2e 0f 1f 84 00 00 00 00 00 66
66 66 66 90 <48> f7 07 00 c0 00 00 55 48 89 e5 75 1e 8b 47 1c 85 c0 74
27 f0
[84408.526216] RIP  [] put_page+0x5/0x50
[84408.528705]  RSP 
[84408.531178] CR2: 

Looks like this happens because ip_options_fragment() relies on
correct ip options length in ip control block in skb. But in
ip_finish_output_gso() control block in segments is reused by
skb_gso_segment(). following ip_fragment() sees some garbage.

In my case there was no ip options but length becomes non-zero and
ip_options_fragment() picked some bytes from payload and decides to
fill huge range with IPOPT_NOOP (1). One of that ones flipped nr_frags
in skb_shared_info at the end of data =)

Here is quick hack: just make room for ip control block in gso control 

Re: [BUG] skb corruption and kernel panic at forwarding with fragmentation

2016-01-06 Thread Florian Westphal
Thadeu Lima de Souza Cascardo  wrote:
> On Wed, Jan 06, 2016 at 11:11:41PM +0300, Konstantin Khlebnikov wrote:
> > On Wed, Jan 6, 2016 at 10:59 PM, Cong Wang  wrote:
> > > On Wed, Jan 6, 2016 at 11:15 AM, Konstantin Khlebnikov  
> > > wrote:
> > >> Looks like this happens because ip_options_fragment() relies on
> > >> correct ip options length in ip control block in skb. But in
> > >> ip_finish_output_gso() control block in segments is reused by
> > >> skb_gso_segment(). following ip_fragment() sees some garbage.
> > >>
> > >> In my case there was no ip options but length becomes non-zero and
> > >> ip_options_fragment() picked some bytes from payload and decides to
> > >> fill huge range with IPOPT_NOOP (1). One of that ones flipped nr_frags
> > >> in skb_shared_info at the end of data =)
> > >>
> > >
> > > Hmm, it looks like SKB_GSO_CB should be cleared after skb_gso_segment()
> > > since all the gso information should be saved in shared_info after it 
> > > finishes.
> > >
> > > Does a memset(0) on SKB_GSO_CB after skb_gso_segment() work as well?
> > 
> > This will break present logic around ip_options_fragment() - it clears
> > options from
> > second and following fragments. With zeroed cb it will do nothing.
> > 
> > ip_options_fragment() can get required information directly from ip header 
> > but
> > it also resets fields in IPCB -- probably it should stay valid here
> > and somebody else will use it later.
[..]
> I have hit this as well, this fixes it for me on an older kernel. Can you try 
> it
> on latest kernel?

> diff --git a/net/ipv4/ip_output.c b/net/ipv4/ip_output.c
> index d8a1745..f44bc91 100644
> --- a/net/ipv4/ip_output.c
> +++ b/net/ipv4/ip_output.c
> @@ -216,6 +216,7 @@ static int ip_finish_output_gso(struct sk_buff *skb)
>   netdev_features_t features;
>   struct sk_buff *segs;
>   int ret = 0;
> + struct inet_skb_parm ipcb;
>  
>   if (skb_gso_network_seglen(skb) <= ip_skb_dst_mtu(skb))
>   return ip_finish_output2(skb);
> @@ -227,6 +228,10 @@ static int ip_finish_output_gso(struct sk_buff *skb)
>* 2) skb arrived via virtio-net, we thus get TSO/GSO skbs directly
>* from host network stack.
>*/
> + /* We need to save IPCB here because skb_gso_segment will use
> +  * SKB_GSO_CB.
> +  */
> + ipcb = *IPCB(skb);
>   features = netif_skb_features(skb);
>   segs = skb_gso_segment(skb, features & ~NETIF_F_GSO_MASK);
>   if (IS_ERR_OR_NULL(segs)) {
> @@ -241,6 +246,7 @@ static int ip_finish_output_gso(struct sk_buff *skb)
>   int err;
>  
>   segs->next = NULL;
> + *IPCB(segs) = ipcb;
>   err = ip_fragment(segs, ip_finish_output2);
>  
>   if (err && ret == 0)

I'm worried that this doesn't solve all cases. f.e. xfrm may also
call skb_gso_segment(), and it will call into ipv4/ipv6 netfilter
postrouting + ipv4 output functions...

nfqnl_enqueue_packet() is also affected.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [BUG] skb corruption and kernel panic at forwarding with fragmentation

2016-01-06 Thread Florian Westphal
Florian Westphal  wrote:
> Thadeu Lima de Souza Cascardo  wrote:
> > On Wed, Jan 06, 2016 at 11:11:41PM +0300, Konstantin Khlebnikov wrote:

[ skb_gso_segment uses skb->cb[], causes oops if ip_fragment is invoked
  on segmented skbs ]

> > I have hit this as well, this fixes it for me on an older kernel. Can you 
> > try it
> > on latest kernel?
> 
> > diff --git a/net/ipv4/ip_output.c b/net/ipv4/ip_output.c
> > index d8a1745..f44bc91 100644
> > --- a/net/ipv4/ip_output.c
> > +++ b/net/ipv4/ip_output.c
> > @@ -216,6 +216,7 @@ static int ip_finish_output_gso(struct sk_buff *skb)
> > netdev_features_t features;
> > struct sk_buff *segs;
> > int ret = 0;
> > +   struct inet_skb_parm ipcb;
> >  
> > if (skb_gso_network_seglen(skb) <= ip_skb_dst_mtu(skb))
> > return ip_finish_output2(skb);
> > @@ -227,6 +228,10 @@ static int ip_finish_output_gso(struct sk_buff *skb)
> >  * 2) skb arrived via virtio-net, we thus get TSO/GSO skbs directly
> >  * from host network stack.
> >  */
> > +   /* We need to save IPCB here because skb_gso_segment will use
> > +* SKB_GSO_CB.
> > +*/
> > +   ipcb = *IPCB(skb);
> > features = netif_skb_features(skb);
> > segs = skb_gso_segment(skb, features & ~NETIF_F_GSO_MASK);
> > if (IS_ERR_OR_NULL(segs)) {
> > @@ -241,6 +246,7 @@ static int ip_finish_output_gso(struct sk_buff *skb)
> > int err;
> >  
> > segs->next = NULL;
> > +   *IPCB(segs) = ipcb;
> > err = ip_fragment(segs, ip_finish_output2);
> >  
> > if (err && ret == 0)
> 
> I'm worried that this doesn't solve all cases. f.e. xfrm may also
> call skb_gso_segment(), and it will call into ipv4/ipv6 netfilter
> postrouting + ipv4 output functions...
> 
> nfqnl_enqueue_packet() is also affected.

... but it seems that those three are the only affected callers
of skb_gso_segment (tbf is ok since skb isn't owned by anyone,
ovs does save/restore already).

I think this patch is the right way, we just need similar
save/restore in nfqnl_enqueue_packet and xfrm_output_gso().

The latter two can be used by either ipv4 or ipv6 so it might
be preferable to just save/restore sizeof(struct skb_gso_cb);
or a union of inet_skb_parm+inet6_skb_parm.

Cascardo, can you cook a patch?

Thanks!
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [BUG] skb corruption and kernel panic at forwarding with fragmentation

2016-01-06 Thread Cong Wang
On Wed, Jan 6, 2016 at 11:15 AM, Konstantin Khlebnikov  wrote:
> Looks like this happens because ip_options_fragment() relies on
> correct ip options length in ip control block in skb. But in
> ip_finish_output_gso() control block in segments is reused by
> skb_gso_segment(). following ip_fragment() sees some garbage.
>
> In my case there was no ip options but length becomes non-zero and
> ip_options_fragment() picked some bytes from payload and decides to
> fill huge range with IPOPT_NOOP (1). One of that ones flipped nr_frags
> in skb_shared_info at the end of data =)
>

Hmm, it looks like SKB_GSO_CB should be cleared after skb_gso_segment()
since all the gso information should be saved in shared_info after it finishes.

Does a memset(0) on SKB_GSO_CB after skb_gso_segment() work as well?
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [BUG] skb corruption and kernel panic at forwarding with fragmentation

2016-01-06 Thread Konstantin Khlebnikov
On Wed, Jan 6, 2016 at 10:59 PM, Cong Wang  wrote:
> On Wed, Jan 6, 2016 at 11:15 AM, Konstantin Khlebnikov  
> wrote:
>> Looks like this happens because ip_options_fragment() relies on
>> correct ip options length in ip control block in skb. But in
>> ip_finish_output_gso() control block in segments is reused by
>> skb_gso_segment(). following ip_fragment() sees some garbage.
>>
>> In my case there was no ip options but length becomes non-zero and
>> ip_options_fragment() picked some bytes from payload and decides to
>> fill huge range with IPOPT_NOOP (1). One of that ones flipped nr_frags
>> in skb_shared_info at the end of data =)
>>
>
> Hmm, it looks like SKB_GSO_CB should be cleared after skb_gso_segment()
> since all the gso information should be saved in shared_info after it 
> finishes.
>
> Does a memset(0) on SKB_GSO_CB after skb_gso_segment() work as well?

This will break present logic around ip_options_fragment() - it clears
options from
second and following fragments. With zeroed cb it will do nothing.

ip_options_fragment() can get required information directly from ip header but
it also resets fields in IPCB -- probably it should stay valid here
and somebody else will use it later.
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/


Re: [BUG] skb corruption and kernel panic at forwarding with fragmentation

2016-01-06 Thread Thadeu Lima de Souza Cascardo
On Wed, Jan 06, 2016 at 11:11:41PM +0300, Konstantin Khlebnikov wrote:
> On Wed, Jan 6, 2016 at 10:59 PM, Cong Wang  wrote:
> > On Wed, Jan 6, 2016 at 11:15 AM, Konstantin Khlebnikov  
> > wrote:
> >> Looks like this happens because ip_options_fragment() relies on
> >> correct ip options length in ip control block in skb. But in
> >> ip_finish_output_gso() control block in segments is reused by
> >> skb_gso_segment(). following ip_fragment() sees some garbage.
> >>
> >> In my case there was no ip options but length becomes non-zero and
> >> ip_options_fragment() picked some bytes from payload and decides to
> >> fill huge range with IPOPT_NOOP (1). One of that ones flipped nr_frags
> >> in skb_shared_info at the end of data =)
> >>
> >
> > Hmm, it looks like SKB_GSO_CB should be cleared after skb_gso_segment()
> > since all the gso information should be saved in shared_info after it 
> > finishes.
> >
> > Does a memset(0) on SKB_GSO_CB after skb_gso_segment() work as well?
> 
> This will break present logic around ip_options_fragment() - it clears
> options from
> second and following fragments. With zeroed cb it will do nothing.
> 
> ip_options_fragment() can get required information directly from ip header but
> it also resets fields in IPCB -- probably it should stay valid here
> and somebody else will use it later.
> --
> To unsubscribe from this list: send the line "unsubscribe netdev" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

I have hit this as well, this fixes it for me on an older kernel. Can you try it
on latest kernel?

diff --git a/net/ipv4/ip_output.c b/net/ipv4/ip_output.c
index d8a1745..f44bc91 100644
--- a/net/ipv4/ip_output.c
+++ b/net/ipv4/ip_output.c
@@ -216,6 +216,7 @@ static int ip_finish_output_gso(struct sk_buff *skb)
netdev_features_t features;
struct sk_buff *segs;
int ret = 0;
+   struct inet_skb_parm ipcb;
 
if (skb_gso_network_seglen(skb) <= ip_skb_dst_mtu(skb))
return ip_finish_output2(skb);
@@ -227,6 +228,10 @@ static int ip_finish_output_gso(struct sk_buff *skb)
 * 2) skb arrived via virtio-net, we thus get TSO/GSO skbs directly
 * from host network stack.
 */
+   /* We need to save IPCB here because skb_gso_segment will use
+* SKB_GSO_CB.
+*/
+   ipcb = *IPCB(skb);
features = netif_skb_features(skb);
segs = skb_gso_segment(skb, features & ~NETIF_F_GSO_MASK);
if (IS_ERR_OR_NULL(segs)) {
@@ -241,6 +246,7 @@ static int ip_finish_output_gso(struct sk_buff *skb)
int err;
 
segs->next = NULL;
+   *IPCB(segs) = ipcb;
err = ip_fragment(segs, ip_finish_output2);
 
if (err && ret == 0)
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/