On Mon, Oct 29, 2018 at 8:52 PM, Eric Dumazet <eric.duma...@gmail.com> wrote:
>
>
> On 10/29/2018 07:53 PM, Eric Dumazet wrote:
>>
>>
>> On 10/29/2018 07:27 PM, Cong Wang wrote:
>>> Hi,
>>>
>>> On Mon, Oct 29, 2018 at 5:19 PM Paweł Staszewski <pstaszew...@itcare.pl> 
>>> wrote:
>>>>
>>>> Sorry not complete - followed by hw csum:
>>>>
>>>> [  342.190831] vlan1490: hw csum failure
>>>> [  342.190835] CPU: 52 PID: 0 Comm: swapper/52 Not tainted 4.19.0+ #1
>>>> [  342.190836] Call Trace:
>>>> [  342.190839]  <IRQ>
>>>> [  342.190849]  dump_stack+0x46/0x5b
>>>> [  342.190856]  __skb_checksum_complete+0x9a/0xa0
>>>> [  342.190859]  tcp_v4_rcv+0xef/0x960
>>>> [  342.190864]  ip_local_deliver_finish+0x49/0xd0
>>>> [  342.190866]  ip_local_deliver+0x5e/0xe0
>>>> [  342.190869]  ? ip_sublist_rcv_finish+0x50/0x50
>>>> [  342.190870]  ip_rcv+0x41/0xc0
>>>> [  342.190874]  __netif_receive_skb_one_core+0x4b/0x70
>>>> [  342.190877]  netif_receive_skb_internal+0x2f/0xd0
>>>> [  342.190879]  napi_gro_receive+0xb7/0xe0
>>>> [  342.190884]  mlx5e_handle_rx_cqe+0x7a/0xd0
>>>> [  342.190886]  mlx5e_poll_rx_cq+0xc6/0x930
>>>> [  342.190888]  mlx5e_napi_poll+0xab/0xc90
>>>
>>>
>>> We got exactly the same backtrace in our data center. However,
>>> it is not easy for us to reproduce it, do you have any clue to reproduce it?
>>>
>>> If you do, try to tcpdump the packets triggering this warning, it could
>>> be useful for debugging.
>>>
>>> Also, we tried to apply commit d55bef5059dd057bd, the warning _still_
>>> occurs. We tried to revert the offending commit 88078d98d1bb, it
>>> disappears. So it is likely that commit 88078d98d1bb introduces
>>> more troubles than the one fixed by d55bef5059dd057bd.
>>>
>>
>> Or this could be that mlx5 driver is buggy when dealing with VLAN tags.
>>
>> It both uses vlan_tci (hardware vlan offload) in skb _and_ this piece of 
>> code in mlx5e_handle_csum()
>>
>>               if (network_depth > ETH_HLEN)
>>                       /* CQE csum is calculated from the IP header and does
>>                        * not cover VLAN headers (if present). This will add
>>                        * the checksum manually.
>>                        */
>>                       skb->csum = csum_partial(skb->data + ETH_HLEN,
>>                                                network_depth - ETH_HLEN,
>>                                                skb->csum);
>>
>>
>> That seems strange to me, because skb_vlan_untag() will not adjust skb->csum 
>> in this case.
>>
>
> Bug might be in NETIF_F_RXFCS mlx5 handling btw...
>
> Code does :
>
> if (unlikely(netdev->features & NETIF_F_RXFCS))
>      skb->csum = csum_add(skb->csum,
>                           (__force __wsum)mlx5e_get_fcs(skb));
>
> But Dimitris told us that we need to take into account if FCS starts at odd 
> or even offset.
>
> ->
> if (unlikely(netdev->features & NETIF_F_RXFCS))
>      skb->csum = csum_block_add(skb->csum,
>                                 (__force __wsum)mlx5e_get_fcs(skb),
>                                 skb->len);
>

Indeed this is a bug. I would expect it to produce frequent errors
though as many odd-length
packets would trigger it. Do you have RXFCS? Regardless, how
frequently do you see the problem?

There is some other questionable code in the driver's RXFCS implementation.
Code like

                return *(__be32 *)(skb->data + skb->len - ETH_FCS_LEN);

doesn't work on processors with alignment requirements.

Reply via email to