Re: sky2 hw csum failure [was Re: sky2 large MTU problems]

2006-05-30 Thread Daniel J Blueman

On 25/05/06, Patrick McHardy <[EMAIL PROTECTED]> wrote:

Daniel J Blueman wrote:
> On 25/05/06, Patrick McHardy <[EMAIL PROTECTED]> wrote:
>
>> Daniel, is there an easy way to reproduce the checksum failure?
>
> In short, no. This was seen when packets may have been truncated by
> large MTU (eg 9000) problems in the sky2 driver transmit path.
>
> There is a small chance that this could relate to transmitting with an
> MTU of 9000 (possibly with receiving with an MTU of 1500 too)

Unfortunately I can't test this myself because my other NICs don't
support MTUs > 1500.

> On that interface, the only rules that were being exercised were:
>
> iptables -t filter -A INPUT -m state --state ESTABLISHED,RELATED -j ACCEPT
> iptables -t filter -A INPUT -p tcp -m tcp --dport 445 --syn -j ACCEPT # SMB
> iptables -t filter -A INPUT -j DROP

That shouldn't cause any packet modifications. Can you trigger the
checksum failures without netfilter?


When testing, I always run into the "kernel: sky2 lan0: rx error,
status 0x977d977d length 0" problem before anything else.

I need to eliminate the sky2 driver from the equation before I'm able
to prove if there is a problem elsewhere or not. I did have some e1000
NICs, but not any longer, so it'll have to wait until I can find a
stable scenario for my sky2 NIC...
--
Daniel J Blueman
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: sky2 hw csum failure [was Re: sky2 large MTU problems]

2006-05-25 Thread Patrick McHardy
Daniel J Blueman wrote:
> On 25/05/06, Patrick McHardy <[EMAIL PROTECTED]> wrote:
> 
>> Daniel, is there an easy way to reproduce the checksum failure?
> 
> 
> In short, no. This was seen when packets may have been truncated by
> large MTU (eg 9000) problems in the sky2 driver transmit path.
> 
> There is a small chance that this could relate to transmitting with an
> MTU of 9000 (possibly with receiving with an MTU of 1500 too)

Unfortunately I can't test this myself because my other NICs don't
support MTUs > 1500.

> On that interface, the only rules that were being exercised were:
> 
> iptables -t filter -A INPUT -m state --state ESTABLISHED,RELATED -j ACCEPT
> iptables -t filter -A INPUT -p tcp -m tcp --dport 445 --syn -j ACCEPT # SMB
> iptables -t filter -A INPUT -j DROP

That shouldn't cause any packet modifications. Can you trigger the
checksum failures without netfilter?

-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: sky2 hw csum failure [was Re: sky2 large MTU problems]

2006-05-25 Thread Daniel J Blueman

On 25/05/06, Patrick McHardy <[EMAIL PROTECTED]> wrote:

Stephen Hemminger wrote:
> On Wed, 24 May 2006 10:28:52 +0100
> "Daniel J Blueman" <[EMAIL PROTECTED]> wrote:
>
>>Having done some more stress testing with sky2 1.4 (in 2.6.17-rc4) and
>>the latest patch, I have found problems when streaming lots of data
>>out of the sky2 interface (eg via samba serving a large file to GigE
>>client). Ultimately, the interface will stop sending.
>>
>>Before this happens, I see lots of:
>>
>>kernel: lan0: hw csum failure.
>>kernel:  [__skb_checksum_complete+86/96] __skb_checksum_complete+0x56/0x60
>>kernel:  [tcp_error+300/512] tcp_error+0x12c/0x200
>>kernel:  [poison_obj+41/96] poison_obj+0x29/0x60
>>kernel:  [tcp_error+0/512] tcp_error+0x0/0x200
>>kernel:  [ip_conntrack_in+157/1072] ip_conntrack_in+0x9d/0x430
>>kernel:  [kfree_skbmem+8/128] kfree_skbmem+0x8/0x80
>>kernel:  [arp_process+102/1408] arp_process+0x66/0x580
>>kernel:  [check_poison_obj+36/416] check_poison_obj+0x24/0x1a0
>>kernel:  [arp_process+102/1408] arp_process+0x66/0x580
>>kernel:  [nf_iterate+99/144] nf_iterate+0x63/0x90
>>kernel:  [ip_rcv_finish+0/608] ip_rcv_finish+0x0/0x260
>>kernel:  [nf_hook_slow+89/240] nf_hook_slow+0x59/0xf0
>>kernel:  [ip_rcv_finish+0/608] ip_rcv_finish+0x0/0x260
>>kernel:  [ip_rcv+386/1104] ip_rcv+0x182/0x450
>>kernel:  [ip_rcv_finish+0/608] ip_rcv_finish+0x0/0x260
>>kernel:  [packet_rcv_spkt+216/320] packet_rcv_spkt+0xd8/0x140
>>kernel:  [netif_receive_skb+476/784] netif_receive_skb+0x1dc/0x310
>>kernel:  [sky2_poll+879/2096] sky2_poll+0x36f/0x830
>>kernel:  [_spin_lock_irqsave+9/16] _spin_lock_irqsave+0x9/0x10
>>kernel:  [run_timer_softirq+290/416] run_timer_softirq+0x122/0x1a0
>>kernel:  [net_rx_action+108/256] net_rx_action+0x6c/0x100
>>kernel:  [__do_softirq+66/160] __do_softirq+0x42/0xa0
>>kernel:  [do_softirq+78/96] do_softirq+0x4e/0x60
>>kernel:  ===
>>kernel:  [do_IRQ+90/160] do_IRQ+0x5a/0xa0
>>kernel:  [remove_vma+69/80] remove_vma+0x45/0x50
>>kernel:  [common_interrupt+26/32] common_interrupt+0x1a/0x20
>>kernel:  [get_offset_pmtmr+151/3584] get_offset_pmtmr+0x97/0xe00
>>kernel:  [do_gettimeofday+26/208] do_gettimeofday+0x1a/0xd0
>>kernel:  [sys_gettimeofday+26/144] sys_gettimeofday+0x1a/0x90
>>kernel:  [syscall_call+7/11] syscall_call+0x7/0xb
>
> What ever the netfilter chain is, it is trimming or altering the packet
> without clearing or altering the hardware checksum. It is not a driver
> problem, we saw these in VLAN's and ebtables already.

The call chain looks pretty messed up, but the point where an
invalid HW checksum is detected is in TCP connection tracking,
which is basically the first thing netfilter does, unless
you use the raw table. There are no packet modifications done
by conntrack, so I doubt that netfilter is the culprit here.
Of course we had some big checksumming cleanups, so there is
a possibilty of bugs there, but I did test them with sky2 and
HW checksumming, so I don't think thats the case.

Daniel, is there an easy way to reproduce the checksum failure?


In short, no. This was seen when packets may have been truncated by
large MTU (eg 9000) problems in the sky2 driver transmit path.

There is a small chance that this could relate to transmitting with an
MTU of 9000 (possibly with receiving with an MTU of 1500 too)

On that interface, the only rules that were being exercised were:

iptables -t filter -A INPUT -m state --state ESTABLISHED,RELATED -j ACCEPT
iptables -t filter -A INPUT -p tcp -m tcp --dport 445 --syn -j ACCEPT # SMB
iptables -t filter -A INPUT -j DROP

HTB and SFQ are active on other interfaces.
--
Daniel J Blueman
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: sky2 hw csum failure [was Re: sky2 large MTU problems]

2006-05-25 Thread Patrick McHardy
Stephen Hemminger wrote:
> On Wed, 24 May 2006 10:28:52 +0100
> "Daniel J Blueman" <[EMAIL PROTECTED]> wrote:
> 
> 
>>Having done some more stress testing with sky2 1.4 (in 2.6.17-rc4) and
>>the latest patch, I have found problems when streaming lots of data
>>out of the sky2 interface (eg via samba serving a large file to GigE
>>client). Ultimately, the interface will stop sending.
>>
>>Before this happens, I see lots of:
>>
>>kernel: lan0: hw csum failure.
>>kernel:  [__skb_checksum_complete+86/96] __skb_checksum_complete+0x56/0x60
>>kernel:  [tcp_error+300/512] tcp_error+0x12c/0x200
>>kernel:  [poison_obj+41/96] poison_obj+0x29/0x60
>>kernel:  [tcp_error+0/512] tcp_error+0x0/0x200
>>kernel:  [ip_conntrack_in+157/1072] ip_conntrack_in+0x9d/0x430
>>kernel:  [kfree_skbmem+8/128] kfree_skbmem+0x8/0x80
>>kernel:  [arp_process+102/1408] arp_process+0x66/0x580
>>kernel:  [check_poison_obj+36/416] check_poison_obj+0x24/0x1a0
>>kernel:  [arp_process+102/1408] arp_process+0x66/0x580
>>kernel:  [nf_iterate+99/144] nf_iterate+0x63/0x90
>>kernel:  [ip_rcv_finish+0/608] ip_rcv_finish+0x0/0x260
>>kernel:  [nf_hook_slow+89/240] nf_hook_slow+0x59/0xf0
>>kernel:  [ip_rcv_finish+0/608] ip_rcv_finish+0x0/0x260
>>kernel:  [ip_rcv+386/1104] ip_rcv+0x182/0x450
>>kernel:  [ip_rcv_finish+0/608] ip_rcv_finish+0x0/0x260
>>kernel:  [packet_rcv_spkt+216/320] packet_rcv_spkt+0xd8/0x140
>>kernel:  [netif_receive_skb+476/784] netif_receive_skb+0x1dc/0x310
>>kernel:  [sky2_poll+879/2096] sky2_poll+0x36f/0x830
>>kernel:  [_spin_lock_irqsave+9/16] _spin_lock_irqsave+0x9/0x10
>>kernel:  [run_timer_softirq+290/416] run_timer_softirq+0x122/0x1a0
>>kernel:  [net_rx_action+108/256] net_rx_action+0x6c/0x100
>>kernel:  [__do_softirq+66/160] __do_softirq+0x42/0xa0
>>kernel:  [do_softirq+78/96] do_softirq+0x4e/0x60
>>kernel:  ===
>>kernel:  [do_IRQ+90/160] do_IRQ+0x5a/0xa0
>>kernel:  [remove_vma+69/80] remove_vma+0x45/0x50
>>kernel:  [common_interrupt+26/32] common_interrupt+0x1a/0x20
>>kernel:  [get_offset_pmtmr+151/3584] get_offset_pmtmr+0x97/0xe00
>>kernel:  [do_gettimeofday+26/208] do_gettimeofday+0x1a/0xd0
>>kernel:  [sys_gettimeofday+26/144] sys_gettimeofday+0x1a/0x90
>>kernel:  [syscall_call+7/11] syscall_call+0x7/0xb
> 
> 
> 
> What ever the netfilter chain is, it is trimming or altering the packet
> without clearing or altering the hardware checksum. It is not a driver
> problem, we saw these in VLAN's and ebtables already.


The call chain looks pretty messed up, but the point where an
invalid HW checksum is detected is in TCP connection tracking,
which is basically the first thing netfilter does, unless
you use the raw table. There are no packet modifications done
by conntrack, so I doubt that netfilter is the culprit here.
Of course we had some big checksumming cleanups, so there is
a possibilty of bugs there, but I did test them with sky2 and
HW checksumming, so I don't think thats the case.

Daniel, is there an easy way to reproduce the checksum failure?
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: sky2 hw csum failure [was Re: sky2 large MTU problems]

2006-05-25 Thread Daniel J Blueman

Hi Stephen,

Thanks for your feedback.

On 24/05/06, Stephen Hemminger <[EMAIL PROTECTED]> wrote:

"Daniel J Blueman" <[EMAIL PROTECTED]> wrote:
> Having done some more stress testing with sky2 1.4 (in 2.6.17-rc4) and
> the latest patch, I have found problems when streaming lots of data
> out of the sky2 interface (eg via samba serving a large file to GigE
> client). Ultimately, the interface will stop sending.
>
> Before this happens, I see lots of:
>
> kernel: lan0: hw csum failure.
> kernel:  [__skb_checksum_complete+86/96] __skb_checksum_complete+0x56/0x60
> kernel:  [tcp_error+300/512] tcp_error+0x12c/0x200
> kernel:  [poison_obj+41/96] poison_obj+0x29/0x60
> kernel:  [tcp_error+0/512] tcp_error+0x0/0x200
> kernel:  [ip_conntrack_in+157/1072] ip_conntrack_in+0x9d/0x430
> kernel:  [kfree_skbmem+8/128] kfree_skbmem+0x8/0x80
> kernel:  [arp_process+102/1408] arp_process+0x66/0x580
> kernel:  [check_poison_obj+36/416] check_poison_obj+0x24/0x1a0
> kernel:  [arp_process+102/1408] arp_process+0x66/0x580
> kernel:  [nf_iterate+99/144] nf_iterate+0x63/0x90
> kernel:  [ip_rcv_finish+0/608] ip_rcv_finish+0x0/0x260
> kernel:  [nf_hook_slow+89/240] nf_hook_slow+0x59/0xf0
> kernel:  [ip_rcv_finish+0/608] ip_rcv_finish+0x0/0x260
> kernel:  [ip_rcv+386/1104] ip_rcv+0x182/0x450
> kernel:  [ip_rcv_finish+0/608] ip_rcv_finish+0x0/0x260
> kernel:  [packet_rcv_spkt+216/320] packet_rcv_spkt+0xd8/0x140
> kernel:  [netif_receive_skb+476/784] netif_receive_skb+0x1dc/0x310
> kernel:  [sky2_poll+879/2096] sky2_poll+0x36f/0x830
> kernel:  [_spin_lock_irqsave+9/16] _spin_lock_irqsave+0x9/0x10
> kernel:  [run_timer_softirq+290/416] run_timer_softirq+0x122/0x1a0
> kernel:  [net_rx_action+108/256] net_rx_action+0x6c/0x100
> kernel:  [__do_softirq+66/160] __do_softirq+0x42/0xa0
> kernel:  [do_softirq+78/96] do_softirq+0x4e/0x60
> kernel:  ===
> kernel:  [do_IRQ+90/160] do_IRQ+0x5a/0xa0
> kernel:  [remove_vma+69/80] remove_vma+0x45/0x50
> kernel:  [common_interrupt+26/32] common_interrupt+0x1a/0x20
> kernel:  [get_offset_pmtmr+151/3584] get_offset_pmtmr+0x97/0xe00
> kernel:  [do_gettimeofday+26/208] do_gettimeofday+0x1a/0xd0
> kernel:  [sys_gettimeofday+26/144] sys_gettimeofday+0x1a/0x90
> kernel:  [syscall_call+7/11] syscall_call+0x7/0xb

What ever the netfilter chain is, it is trimming or altering the packet
without clearing or altering the hardware checksum. It is not a driver
problem, we saw these in VLAN's and ebtables already.


No ebtables or VLAN used; the relevant part of iptables I have:

iptables -t filter -A INPUT -m state --state ESTABLISHED,RELATED -j ACCEPT
iptables -t filter -A INPUT -p tcp -m tcp --dport 445 --syn -j ACCEPT # SMB
iptables -t filter -A INPUT -j DROP

This may be linked to the use of the large MTU (7500 or 9000) for the
sky2 linux box and the client was transmitting back to the sky2 with
an MTU of 1500.


> One of these was preceeded by:
>
> kernel: sky2 lan0: rx error, status 0x977d977d length 0

The receive FIFO got overrun. You must not be running hardware flow
control.


This 'status 0x977d977d' message is received before the above problem
occurs and I couldn't reproduce the 'hw csum failure' last night. The
client is a Broadcom NetExtreme PCI-E card purportedly with flow
control on. I have got the reproducer down to:

1. use 2.6.17-rc4 w/ sky2 MTU patch
2. increase MTU to >= 7500
3. decrease MTU to 1500
4. send ~1-2GB out of sky2 NIC
5. "rx error, status 0x977d977d length 0" messages received

I have found that without raising the MTU initially to 7500/9000, this
problem does not occur. Perhaps chip tx buffers aren't shrunk when the
MTU is dropped?

Is there a tunable low-watermark for starting the DMA transfer from
the chip on rx? The client isn't sending back that much (TCP acks
every segment, SMB protocol acks every 64KB), but I guess there are
fewer rx buffers are available, as larger tx buffers are used on the
sky2 chip for the large tx packets.


> This was happening with the default MTU of 1500, not just at MTU size
> 9000 (but it was changed down from 9000). Hardware is Yukon-EC (0xb6)
> rev 1.
>
> I'll do some more stress testing tonight without the MTU patch and
> without the MTU being raised to 9000 initially and see what happens.
>
> Thanks for all your great work so far!


Let me know if this is a scenario that isn't expected to work, or if
there is anything else I can look at or try.

Thanks again!
--
Daniel J Blueman
-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: sky2 hw csum failure [was Re: sky2 large MTU problems]

2006-05-24 Thread Stephen Hemminger
On Wed, 24 May 2006 10:28:52 +0100
"Daniel J Blueman" <[EMAIL PROTECTED]> wrote:

> Having done some more stress testing with sky2 1.4 (in 2.6.17-rc4) and
> the latest patch, I have found problems when streaming lots of data
> out of the sky2 interface (eg via samba serving a large file to GigE
> client). Ultimately, the interface will stop sending.
> 
> Before this happens, I see lots of:
> 
> kernel: lan0: hw csum failure.
> kernel:  [__skb_checksum_complete+86/96] __skb_checksum_complete+0x56/0x60
> kernel:  [tcp_error+300/512] tcp_error+0x12c/0x200
> kernel:  [poison_obj+41/96] poison_obj+0x29/0x60
> kernel:  [tcp_error+0/512] tcp_error+0x0/0x200
> kernel:  [ip_conntrack_in+157/1072] ip_conntrack_in+0x9d/0x430
> kernel:  [kfree_skbmem+8/128] kfree_skbmem+0x8/0x80
> kernel:  [arp_process+102/1408] arp_process+0x66/0x580
> kernel:  [check_poison_obj+36/416] check_poison_obj+0x24/0x1a0
> kernel:  [arp_process+102/1408] arp_process+0x66/0x580
> kernel:  [nf_iterate+99/144] nf_iterate+0x63/0x90
> kernel:  [ip_rcv_finish+0/608] ip_rcv_finish+0x0/0x260
> kernel:  [nf_hook_slow+89/240] nf_hook_slow+0x59/0xf0
> kernel:  [ip_rcv_finish+0/608] ip_rcv_finish+0x0/0x260
> kernel:  [ip_rcv+386/1104] ip_rcv+0x182/0x450
> kernel:  [ip_rcv_finish+0/608] ip_rcv_finish+0x0/0x260
> kernel:  [packet_rcv_spkt+216/320] packet_rcv_spkt+0xd8/0x140
> kernel:  [netif_receive_skb+476/784] netif_receive_skb+0x1dc/0x310
> kernel:  [sky2_poll+879/2096] sky2_poll+0x36f/0x830
> kernel:  [_spin_lock_irqsave+9/16] _spin_lock_irqsave+0x9/0x10
> kernel:  [run_timer_softirq+290/416] run_timer_softirq+0x122/0x1a0
> kernel:  [net_rx_action+108/256] net_rx_action+0x6c/0x100
> kernel:  [__do_softirq+66/160] __do_softirq+0x42/0xa0
> kernel:  [do_softirq+78/96] do_softirq+0x4e/0x60
> kernel:  ===
> kernel:  [do_IRQ+90/160] do_IRQ+0x5a/0xa0
> kernel:  [remove_vma+69/80] remove_vma+0x45/0x50
> kernel:  [common_interrupt+26/32] common_interrupt+0x1a/0x20
> kernel:  [get_offset_pmtmr+151/3584] get_offset_pmtmr+0x97/0xe00
> kernel:  [do_gettimeofday+26/208] do_gettimeofday+0x1a/0xd0
> kernel:  [sys_gettimeofday+26/144] sys_gettimeofday+0x1a/0x90
> kernel:  [syscall_call+7/11] syscall_call+0x7/0xb


What ever the netfilter chain is, it is trimming or altering the packet
without clearing or altering the hardware checksum. It is not a driver
problem, we saw these in VLAN's and ebtables already.


> One of these was preceeded by:
> 
> kernel: sky2 lan0: rx error, status 0x977d977d length 0

The receive FIFO got overrun. You must not be running hardware flow
control.

> 
> This was happening with the default MTU of 1500, not just at MTU size
> 9000 (but it was changed down from 9000). Hardware is Yukon-EC (0xb6)
> rev 1.
> 
> I'll do some more stress testing tonight without the MTU patch and
> without the MTU being raised to 9000 initially and see what happens.
> 
> Thanks for all your great work so far!

-
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html