Re: IPv6 routing/fragmentation panic

2015-09-16 Thread David Woodhouse
On Wed, 2015-09-16 at 01:48 +0200, Florian Westphal wrote:
> 
> What I don't understand is why you see this with fragmented ipv6 
> packets only (and not with all ipv6 forwarded skbs).
> 
> Something like this copy-pastry from ip_finish_output2 should fix it:

That works; thanks.

Tested-by: David Woodhouse 

A little extra debugging output shows that the offending fragments were
arriving here with skb_headroom(skb)==10. Which is reasonable, being
the Solos ADSL card's header of 8 bytes followed by 2 bytes of PPP
frame type.

The non-fragmented packets, on the other hand, are arriving with a
headroom of 42 bytes. Could something else already have reallocated
them before they get that far? (Do we have any way to gather statistics
on such reallocations? It seems that might be useful for performance
investigation.)

Johannes and I were talking on IRC yesterday about trying to make this
kind of thing easier to reproduce without odd hardware. We postulated a
skb_torture() function which, when an appropriate debugging option was
enabled, would randomly screw around with the skb in various
interesting ways — shifting the data down so that there's no headroom,
deliberately making it *non-linear*, temporarily cloning it and freeing
the clone a couple of seconds later, etc.

Then we could insert calls to skb_torture() in interesting places like
netif_rx(), ip6_finish_output2() and anywhere else that seems
appropriate (perhaps with flags to indicate *what* kind of torture is
permissible in certain locations). And see what breaks...

-- 
David WoodhouseOpen Source Technology Centre
david.woodho...@intel.com  Intel Corporation



smime.p7s
Description: S/MIME cryptographic signature


Re: IPv6 routing/fragmentation panic

2015-09-16 Thread Florian Westphal
David Woodhouse  wrote:
> On Wed, 2015-09-16 at 01:48 +0200, Florian Westphal wrote:
> > 
> > What I don't understand is why you see this with fragmented ipv6 
> > packets only (and not with all ipv6 forwarded skbs).
> > 
> > Something like this copy-pastry from ip_finish_output2 should fix it:
> 
> That works; thanks.
> 
> Tested-by: David Woodhouse 
> 
> A little extra debugging output shows that the offending fragments were
> arriving here with skb_headroom(skb)==10. Which is reasonable, being
> the Solos ADSL card's header of 8 bytes followed by 2 bytes of PPP
> frame type.
> 
> The non-fragmented packets, on the other hand, are arriving with a
> headroom of 42 bytes. Could something else already have reallocated
> them before they get that far?

Yep.  I missed

if (skb_cow(skb, dst->dev->hard_header_len)) {

call in ip6_forward().

Problem is of course that we only expand headroom of the skb
and not of the fragment(s) stored in that skbs frag list.

So we have several options for a fix.

- expand headroom in ip6_finish_output2, like we do for ipv4
- expand headroom in ip6_fragment
- defer to slowpath if frags don't have enough headroom.

The latter is the smallest patch and would not add test for locally
generated, non-fragmented skbs.

(not even compile tested)
David, could you test this?  I'd do an official patch submission then.

diff --git a/net/ipv6/ip6_output.c b/net/ipv6/ip6_output.c
--- a/net/ipv6/ip6_output.c
+++ b/net/ipv6/ip6_output.c
@@ -586,6 +586,7 @@ int ip6_fragment(struct sock *sk, struct sk_buff *skb,
frag_id = ipv6_select_ident(net, _hdr(skb)->daddr,
_hdr(skb)->saddr);
 
+   hroom = LL_RESERVED_SPACE(rt->dst.dev);
if (skb_has_frag_list(skb)) {
int first_len = skb_pagelen(skb);
struct sk_buff *frag2;
@@ -599,7 +600,7 @@ int ip6_fragment(struct sock *sk, struct sk_buff *skb,
/* Correct geometry. */
if (frag->len > mtu ||
((frag->len & 7) && frag->next) ||
-   skb_headroom(frag) < hlen)
+   skb_headroom(frag) < (hlen + hroom))
goto slow_path_clean;
 
/* Partially cloned skb? */
@@ -724,7 +725,6 @@ slow_path:
 */
 
*prevhdr = NEXTHDR_FRAGMENT;
-   hroom = LL_RESERVED_SPACE(rt->dst.dev);
troom = rt->dst.dev->needed_tailroom;
 
/*
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: IPv6 routing/fragmentation panic

2015-09-16 Thread David Woodhouse
On Wed, 2015-09-16 at 15:27 +0200, Florian Westphal wrote:
> @@ -599,7 +600,7 @@ int ip6_fragment(struct sock *sk, struct sk_buff
> *skb,
> /* Correct geometry. */
> if (frag->len > mtu ||
> ((frag->len & 7) && frag->next) ||
> -   skb_headroom(frag) < hlen)
> +   skb_headroom(frag) < (hlen + hroom))
> goto slow_path_clean;
>  
> /* Partially cloned skb? */

My test is 'ping -s 2000', and I end up with a fragment of 1280 bytes
followed by a fragment of 776 bytes.

The test cited above is only actually running on the latter fragment
(which for some reason is fine and has headroom of 58 bytes).

The first, larger, fragment isn't being checked. And that's the one
with only 10 bytes of headroom.

[   62.027984] has frag list
[   62.030616] line 604 check frag ddc5fcc0 len 776 headroom 58 (hlen 40 hroom 
16) 
[   62.036720] line 678 send skb ded050c0 len 1280 headroom 10  
  
[   62.041096] skbuff: skb_under_panic: text:c125f9ca len:1294 put:14 head:dec89
000 data:dec88ffc tail:0xdec8950a end:0xdec89f50 dev:br-lan 

-- 
dwmw2



smime.p7s
Description: S/MIME cryptographic signature


Re: IPv6 routing/fragmentation panic

2015-09-16 Thread David Woodhouse
On Wed, 2015-09-16 at 15:27 +0200, Florian Westphal wrote:
> 
> David, could you test this?  I'd do an official patch submission
> then.

Compiles. Doesn't fix the problem.

-- 
dwmw2



smime.p7s
Description: S/MIME cryptographic signature


Re: IPv6 routing/fragmentation panic

2015-09-16 Thread Florian Westphal
David Woodhouse  wrote:
> > if (frag->len > mtu ||
> > ((frag->len & 7) && frag->next) ||
> > -   skb_headroom(frag) < hlen)
> > +   skb_headroom(frag) < (hlen + hroom))
> > goto slow_path_clean;
> >  
> > /* Partially cloned skb? */
> 
> My test is 'ping -s 2000', and I end up with a fragment of 1280 bytes
> followed by a fragment of 776 bytes.
> 
> The test cited above is only actually running on the latter fragment
> (which for some reason is fine and has headroom of 58 bytes).
> 
> The first, larger, fragment isn't being checked. And that's the one
> with only 10 bytes of headroom.

Thanks for this detailed analysis.
I've sent a patch that should address all of these issues.

Turns out that all tests are wrong in your case.

ip6_fragment doesn't expand headroom, since this skb had the ipv6
fragment header pulled, so that part thinks there are 18 bytes
available (we later push the frag header back when sending fragments).

The 'skb_headroom(frag) < hlen))' is wrong since it neither accounts for
device header length nor the fragment header that we need to push.
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: IPv6 routing/fragmentation panic

2015-09-15 Thread Florian Westphal
David Woodhouse  wrote:
> I can repeatably crash my router with 'ping6 -s 2000' to an external
> machine:
> [   61.741618] skbuff: skb_under_panic: text:c1277f1e len:1294 put:14 
> head:dec98000 data:dec97ffc tail:0xdec9850a end:0xdec98f40 dev:br-lan
> [   61.754128] [ cut here ]
> [   61.758754] Kernel BUG at c1201b1f [verbose debug info unavailable]
> [   61.764005] invalid opcode:  [#1] 
> [   61.764005] Modules linked in: sch_teql 8139cp mii iptable_nat pppoe 
> nf_nat_ipv4 nf_conntrack_ipv6 nf_conntrack_ipv4 ipt_REJECT ipt_MASQUERADE 
> xt_time xt_tcpudp xt_state xt_nat xt_multiport xt_mark xt_mac xt_limit 
> xt_conntrack xt_comment xt_TCPMSS xt_REDIRECT xt_LOG xt_CT solos_pci pppox 
> ppp_async nf_reject_ipv4 nf_nat_redirect nf_nat_masquerade_ipv4 nf_nat_ftp 
> nf_nat nf_log_ipv4 nf_defrag_ipv6 nf_defrag_ipv4 nf_conntrack_ftp 
> nf_conntrack iptable_raw iptable_mangle iptable_filter ip_tables crc_ccitt 
> act_skbedit act_mirred em_u32 cls_u32 cls_tcindex cls_flow cls_route cls_fw 
> sch_hfsc sch_ingress ledtrig_heartbeat ledtrig_gpio ip6t_REJECT 
> nf_reject_ipv6 nf_log_ipv6 nf_log_common ip6table_raw ip6table_mangle 
> ip6table_filter ip6_tables x_tables pppoatm ppp_generic slhc br2684 atm 
> geode_aes cbc arc4 aes_i586
> [   61.764005] CPU: 0 PID: 0 Comm: swapper Not tainted 4.2.0+ #2
> [   61.764005] task: c138d540 ti: c1386000 task.ti: c1386000
> [   61.764005] EIP: 0060:[] EFLAGS: 00210286 CPU: 0
> [   61.764005] EIP is at skb_panic+0x3b/0x3d
> [   61.764005] EAX: 007c EBX: deca3000 ECX: c13a0910 EDX: c139f3c4
> [   61.764005] ESI: dee85d8c EDI: dec9800a EBP: defe3b40 ESP: dec0bd50
> [   61.764005]  DS: 007b ES: 007b FS:  GS:  SS: 0068
> [   61.764005] CR0: 8005003b CR2: b7704474 CR3: 1ef0d000 CR4: 0090
> [   61.764005] Stack:
> [   61.764005]  c135e48c c12e1580 c1277f1e 050e 000e dec98000 
> dec97ffc dec9850a
> [   61.764005]  dec98f40 deca3000 dee85d00 c120337b c12e1580 c1277f1e 
>  000e
> [   61.764005]  dee85d7c ff671e02 deca3000 c109afd3 00200282 1d91 
> 0028 dec98012
> [   61.764005] Call Trace:
> [   61.764005]  [] ? ip6_finish_output2+0x196/0x4da

Hmm, unlike ip the ip6 stack doesn't check headroom size before adding hh.

> But should the kernel *panic* without it? If there are requirements on
> the headroom I must leave on received packets, where are they
> documented? Or is this a bug in the IPv6 fragmentation code, to make
> such assumptions?

I'm not sure the ipv6 (re)fragmentation code is to blame here.
In particular, we could have setups where additional headers need to be
inserted which could also require headroom expansion.

> I'm not entirely sure how to interpret the above stack trace. Is the
> incoming IPv6 packet being reassembled for netfilter's benefit, then re
> -fragmented for transmission?

Yes, ipv6 connection tracking depends on defragmentation.

ip6_fragment should use the frag_list of the (reassembled) skb so no
refragmentation should be happening, we should just be re-using the
original fragmented skbs from that fraglist.

What I don't understand is why you see this with fragmented ipv6 packets only
(and not with all ipv6 forwarded skbs).

Something like this copy-pastry from ip_finish_output2 should fix it:

diff --git a/net/ipv6/ip6_output.c b/net/ipv6/ip6_output.c
--- a/net/ipv6/ip6_output.c
+++ b/net/ipv6/ip6_output.c
@@ -62,6 +62,7 @@ static int ip6_finish_output2(struct sock *sk, struct sk_buff 
*skb)
struct net_device *dev = dst->dev;
struct neighbour *neigh;
struct in6_addr *nexthop;
+   unsigned int hh_len;
int ret;
 
skb->protocol = htons(ETH_P_IPV6);
@@ -104,6 +105,21 @@ static int ip6_finish_output2(struct sock *sk, struct 
sk_buff *skb)
}
}
 
+   hh_len = LL_RESERVED_SPACE(dev);
+   if (unlikely(skb_headroom(skb) < hh_len && dev->header_ops)) {
+   struct sk_buff *skb2;
+
+   skb2 = skb_realloc_headroom(skb, hh_len);
+   if (!skb2) {
+   kfree_skb(skb);
+   return -ENOMEM;
+   }
+   if (skb->sk)
+   skb_set_owner_w(skb2, skb->sk);
+   consume_skb(skb);
+   skb = skb2;
+   }
+
rcu_read_lock_bh();
nexthop = rt6_nexthop((struct rt6_info *)dst, _hdr(skb)->daddr);
neigh = __ipv6_neigh_lookup_noref(dst->dev, nexthop);
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: IPv6 routing/fragmentation panic

2015-09-15 Thread Michal Kubecek
On Tue, Sep 15, 2015 at 04:53:20PM +0100, David Woodhouse wrote:
> I'm not entirely sure how to interpret the above stack trace. Is the
> incoming IPv6 packet being reassembled for netfilter's benefit, then re
> -fragmented for transmission?

Not refragmented. Both the reassembled packet and the original fragments
are kept. Reassembled packet is used for connection tracking and (since
3.13) netfilter rule matching, the original fragments are then forwarded
on (if it passes the rules).

  Michal Kubecek
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


IPv6 routing/fragmentation panic

2015-09-15 Thread David Woodhouse
I can repeatably crash my router with 'ping6 -s 2000' to an external
machine:

[   61.741618] skbuff: skb_under_panic: text:c1277f1e len:1294 put:14 
head:dec98000 data:dec97ffc tail:0xdec9850a end:0xdec98f40 dev:br-lan
[   61.754128] [ cut here ]
[   61.758754] Kernel BUG at c1201b1f [verbose debug info unavailable]
[   61.764005] invalid opcode:  [#1] 
[   61.764005] Modules linked in: sch_teql 8139cp mii iptable_nat pppoe 
nf_nat_ipv4 nf_conntrack_ipv6 nf_conntrack_ipv4 ipt_REJECT ipt_MASQUERADE 
xt_time xt_tcpudp xt_state xt_nat xt_multiport xt_mark xt_mac xt_limit 
xt_conntrack xt_comment xt_TCPMSS xt_REDIRECT xt_LOG xt_CT solos_pci pppox 
ppp_async nf_reject_ipv4 nf_nat_redirect nf_nat_masquerade_ipv4 nf_nat_ftp 
nf_nat nf_log_ipv4 nf_defrag_ipv6 nf_defrag_ipv4 nf_conntrack_ftp nf_conntrack 
iptable_raw iptable_mangle iptable_filter ip_tables crc_ccitt act_skbedit 
act_mirred em_u32 cls_u32 cls_tcindex cls_flow cls_route cls_fw sch_hfsc 
sch_ingress ledtrig_heartbeat ledtrig_gpio ip6t_REJECT nf_reject_ipv6 
nf_log_ipv6 nf_log_common ip6table_raw ip6table_mangle ip6table_filter 
ip6_tables x_tables pppoatm ppp_generic slhc br2684 atm geode_aes cbc arc4 
aes_i586
[   61.764005] CPU: 0 PID: 0 Comm: swapper Not tainted 4.2.0+ #2
[   61.764005] task: c138d540 ti: c1386000 task.ti: c1386000
[   61.764005] EIP: 0060:[] EFLAGS: 00210286 CPU: 0
[   61.764005] EIP is at skb_panic+0x3b/0x3d
[   61.764005] EAX: 007c EBX: deca3000 ECX: c13a0910 EDX: c139f3c4
[   61.764005] ESI: dee85d8c EDI: dec9800a EBP: defe3b40 ESP: dec0bd50
[   61.764005]  DS: 007b ES: 007b FS:  GS:  SS: 0068
[   61.764005] CR0: 8005003b CR2: b7704474 CR3: 1ef0d000 CR4: 0090
[   61.764005] Stack:
[   61.764005]  c135e48c c12e1580 c1277f1e 050e 000e dec98000 dec97ffc 
dec9850a
[   61.764005]  dec98f40 deca3000 dee85d00 c120337b c12e1580 c1277f1e  
000e
[   61.764005]  dee85d7c ff671e02 deca3000 c109afd3 00200282 1d91 0028 
dec98012
[   61.764005] Call Trace:
[   61.764005]  [] ? ip6_finish_output2+0x196/0x4da
[   61.764005]  [] ? skb_push+0x2c/0x2c
[   61.764005]  [] ? ip6_finish_output2+0x196/0x4da
[   61.764005]  [] ? __kmalloc_track_caller+0x5a/0xd9
[   61.764005]  [] ? kmemdup+0x15/0x4a
[   61.764005]  [] ? ip6_forward_finish+0xa/0xa
[   61.764005]  [] ? ip6_fragment+0x924/0xb49
[   61.764005]  [] ? ip6_forward_finish+0xa/0xa
[   61.764005]  [] ? nf_hook_slow+0x50/0x92
[   61.764005]  [] ? ip6_output+0x85/0xeb
[   61.764005]  [] ? ip6_fragment+0xb49/0xb49
[   61.764005]  [] ? ip6_forward+0x4a9/0x6b9
[   61.764005]  [] ? ac6_proc_exit+0xd/0xd
[   61.764005]  [] ? ip6_make_skb+0x15f/0x15f
[   61.764005]  [] ? ip6_rcv_finish+0x7a/0x7e
[   61.764005]  [] ? ipv6_defrag+0xc3/0xc5 [nf_defrag_ipv6]
[   61.764005]  [] ? ip6_make_skb+0x15f/0x15f
[   61.764005]  [] ? nf_iterate+0x5b/0x64
[   61.764005]  [] ? nf_hook_slow+0x50/0x92
[   61.764005]  [] ? ipv6_rcv+0x305/0x470
[   61.764005]  [] ? ip6_make_skb+0x15f/0x15f
[   61.764005]  [] ? __netif_receive_skb_core+0x643/0x836
[   61.764005]  [] ? nommu_map_page+0x2d/0x4d
[   61.764005]  [] ? solos_bh+0x681/0x751 [solos_pci]
[   61.764005]  [] ? process_backlog+0x45/0x96
[   61.764005]  [] ? net_rx_action+0x15b/0x238
[   61.764005]  [] ? __do_softirq+0xb4/0x18a
[   61.764005]  [] ? __hrtimer_tasklet_trampoline+0x12/0x12
[   61.764005]  [] ? do_softirq_own_stack+0x1b/0x20
[   61.764005]   
[   61.764005]  [] ? do_IRQ+0x38/0x9a
[   61.764005]  [] ? common_interrupt+0x29/0x30
[   61.764005]  [] ? default_idle+0x2/0x3
[   61.764005]  [] ? arch_cpu_idle+0x6/0x7
[   61.764005]  [] ? cpu_startup_entry+0xed/0x189
[   61.764005]  [] ? start_kernel+0x2e5/0x2e8
[   61.764005] Code: ff b0 9c 00 00 00 ff b0 98 00 00 00 ff b0 a4 00 00 00 ff 
b0 a0 00 00 00 52 ff 70 54 51 ff 74 24 28 68 8c e4 35 c1 e8 9c 73 0b 00 <0f> 0b 
89 c1 83 79 58 00 8b 80 98 00 00 00 75 17 53 8d 1c 10 01
[   61.764005] EIP: [] skb_panic+0x3b/0x3d SS:ESP 0068:dec0bd50
[   62.120408] ---[ end trace 45d5375a04f3aef4 ]---
[   62.125034] Kernel panic - not syncing: Fatal exception in interrupt
[   62.130381] Kernel Offset: disabled
[   62.130381] Rebooting in 3 seconds..

I can 'fix' it thus (which demonstrates that the issue was with incoming
packets arriving over PPPoATM and being routed out the internal Ethernet):

--- drivers/atm/solos-pci.c~2015-08-31 23:19:23.0 +0100
+++ drivers/atm/solos-pci.c 2015-09-15 15:10:42.534125968 +0100
@@ -869,8 +869,9 @@ static void solos_bh(unsigned long card_
/* Allocate RX skbs for any ports which need them */
if (card->using_dma && card->atmdev[port] &&
!card->rx_skb[port]) {
-   struct sk_buff *skb = alloc_skb(RX_DMA_SIZE, 
GFP_ATOMIC);
+   struct sk_buff *skb = alloc_skb(RX_DMA_SIZE + 16, 
GFP_ATOMIC);
if (skb) {
+   skb_reserve(skb, 16);