Re: Latest net-next kernel 4.19.0+

2018-11-08 Thread Cong Wang
On Thu, Nov 1, 2018 at 3:59 PM Paweł Staszewski  wrote:
>
>
>
> W dniu 31.10.2018 o 22:17, Cong Wang pisze:
> > On Wed, Oct 31, 2018 at 2:05 PM Saeed Mahameed  wrote:
> >> Cong, How often does this happen ? can you some how verify if the
> >> problematic packet has extra end padding after the ip payload ?
> > For us, we need 10+ hours to get one warning. This is also
> > why we never capture the packet that causes this warning.
> >
> >
> >> It would be cool if we had a feature in kernel to store such SKB in
> >> memory when such issue occurs, and let the user dump it later (via
> >> tcpdump) and send the dump to the vendor for debug so we could just
> >> replay and see what happens.
> >>
> > Yeah, the warning kinda sucks, it tells almost nothing, the SKB
> > should be dumped up on this warning.
> >
>
> So another vlan and same hw csum - this time this vlan have less traffic
> so i catch traffic with tcpdump
> Nov  1 23:46:22 kernel: vlan2805: hw csum failure
> but the problem is there is about 1986 frames in that second
> Will tcpdump output helps ?

Looks like you don't have any IP fragments.

Do you try Eric's debugging patch? Does it make a difference?

Also, if doable, can you try to remove vlan from your setup to see if
the warning will be gone?

Thanks!


Re: Latest net-next kernel 4.19.0+

2018-10-31 Thread Paweł Staszewski




W dniu 30.10.2018 o 15:16, Eric Dumazet pisze:


On 10/30/2018 01:09 AM, Paweł Staszewski wrote:


W dniu 30.10.2018 o 08:29, Eric Dumazet pisze:

On 10/29/2018 11:09 PM, Dimitris Michailidis wrote:


Indeed this is a bug. I would expect it to produce frequent errors
though as many odd-length
packets would trigger it. Do you have RXFCS? Regardless, how
frequently do you see the problem?


Old kernels (before 88078d98d1bb) were simply resetting ip_summed to 
CHECKSUM_NONE

And before your fix (commit d55bef5059dd057bd), mlx5 bug was canceling the bug 
you fixed.

So we now need to also fix mlx5.

And of course use skb_header_pointer() in mlx5e_get_fcs() as I mentioned 
earlier,
plus __get_unaligned_cpu32() as you hinted.





No RXFCS

And this trace is rly frequently like once per 3/4 seconds
like below:
[28965.776864] vlan1490: hw csum failure

Might be vlan related.

Can you first check this :

diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c 
b/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c
index 
94224c22ecc310a87b6715051e335446f29bec03..6f4bfebf0d9a3ae7567062abb3ea6532b3aaf3d6
 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c
@@ -789,13 +789,8 @@ static inline void mlx5e_handle_csum(struct net_device 
*netdev,
 skb->ip_summed = CHECKSUM_COMPLETE;
 skb->csum = csum_unfold((__force __sum16)cqe->check_sum);
 if (network_depth > ETH_HLEN)
-   /* CQE csum is calculated from the IP header and does
-* not cover VLAN headers (if present). This will add
-* the checksum manually.
-*/
-   skb->csum = csum_partial(skb->data + ETH_HLEN,
-network_depth - ETH_HLEN,
-skb->csum);
+   /* Temporary debugging */
+   skb->ip_summed = CHECKSUM_NONE;
 if (unlikely(netdev->features & NETIF_F_RXFCS))
 skb->csum = csum_add(skb->csum,
  (__force 
__wsum)mlx5e_get_fcs(skb));




Ok thanks - will try it.




Re: Latest net-next kernel 4.19.0+

2018-10-31 Thread Paweł Staszewski




W dniu 31.10.2018 o 22:05, Saeed Mahameed pisze:

On Tue, 2018-10-30 at 10:32 -0700, Cong Wang wrote:

On Tue, Oct 30, 2018 at 7:16 AM Eric Dumazet 
wrote:



On 10/30/2018 01:09 AM, Paweł Staszewski wrote:


W dniu 30.10.2018 o 08:29, Eric Dumazet pisze:

On 10/29/2018 11:09 PM, Dimitris Michailidis wrote:


Indeed this is a bug. I would expect it to produce frequent
errors
though as many odd-length
packets would trigger it. Do you have RXFCS? Regardless, how
frequently do you see the problem?


Old kernels (before 88078d98d1bb) were simply resetting
ip_summed to CHECKSUM_NONE

And before your fix (commit d55bef5059dd057bd), mlx5 bug was
canceling the bug you fixed.

So we now need to also fix mlx5.

And of course use skb_header_pointer() in mlx5e_get_fcs() as I
mentioned earlier,
plus __get_unaligned_cpu32() as you hinted.





No RXFCS


Same with Pawel, RXFCS is disabled by default.



And this trace is rly frequently like once per 3/4 seconds
like below:
[28965.776864] vlan1490: hw csum failure

Might be vlan related.

Hi Pawel, is the vlan stripping offload disabled or enabled in your
case ?

To verify:
ethtool -k  | grep rx-vlan-offload
rx-vlan-offload: on
To set:
ethtool -K  rxvlan on/off

Enabled:
ethtool -k enp175s0f0
Features for enp175s0f0:
rx-checksumming: on
tx-checksumming: on
    tx-checksum-ipv4: on
    tx-checksum-ip-generic: off [fixed]
    tx-checksum-ipv6: on
    tx-checksum-fcoe-crc: off [fixed]
    tx-checksum-sctp: off [fixed]
scatter-gather: on
    tx-scatter-gather: on
    tx-scatter-gather-fraglist: off [fixed]
tcp-segmentation-offload: on
    tx-tcp-segmentation: on
    tx-tcp-ecn-segmentation: off [fixed]
    tx-tcp-mangleid-segmentation: off
    tx-tcp6-segmentation: on
udp-fragmentation-offload: off
generic-segmentation-offload: on
generic-receive-offload: on
large-receive-offload: off [fixed]
rx-vlan-offload: on
tx-vlan-offload: on
ntuple-filters: off
receive-hashing: on
highdma: on [fixed]
rx-vlan-filter: on
vlan-challenged: off [fixed]
tx-lockless: off [fixed]
netns-local: off [fixed]
tx-gso-robust: off [fixed]
tx-fcoe-segmentation: off [fixed]
tx-gre-segmentation: on
tx-gre-csum-segmentation: on
tx-ipxip4-segmentation: off [fixed]
tx-ipxip6-segmentation: off [fixed]
tx-udp_tnl-segmentation: on
tx-udp_tnl-csum-segmentation: on
tx-gso-partial: on
tx-sctp-segmentation: off [fixed]
tx-esp-segmentation: off [fixed]
tx-udp-segmentation: on
fcoe-mtu: off [fixed]
tx-nocache-copy: off
loopback: off [fixed]
rx-fcs: off
rx-all: off
tx-vlan-stag-hw-insert: on
rx-vlan-stag-hw-parse: off [fixed]
rx-vlan-stag-filter: on [fixed]
l2-fwd-offload: off [fixed]
hw-tc-offload: on
esp-hw-offload: off [fixed]
esp-tx-csum-hw-offload: off [fixed]
rx-udp_tunnel-port-offload: on
tls-hw-tx-offload: off [fixed]
tls-hw-rx-offload: off [fixed]
rx-gro-hw: off [fixed]
tls-hw-record: off [fixed]




if the vlan offload is off then it will trigger the mlx5e vlan csum
adjustment code pointed out by Eric.

Anyhow, it should work in both cases, but i am trying to narrow down
the possibilities.

Also could it be a double tagged packet ?

no double tagged packets there






Unlike Pawel's case, we don't use vlan at all, maybe this is why we
see
it much less frequently than Pawel.

Also, it is probably not specific to mlx5, as there is another report
which
is probably a non-mlx5 driver.


Cong, How often does this happen ? can you some how verify if the
problematic packet has extra end padding after the ip payload ?

It would be cool if we had a feature in kernel to store such SKB in
memory when such issue occurs, and let the user dump it later (via
tcpdump) and send the dump to the vendor for debug so we could just
replay and see what happens.


Thanks.




Re: Latest net-next kernel 4.19.0+

2018-10-31 Thread Cong Wang
On Wed, Oct 31, 2018 at 2:05 PM Saeed Mahameed  wrote:
>
> Cong, How often does this happen ? can you some how verify if the
> problematic packet has extra end padding after the ip payload ?

For us, we need 10+ hours to get one warning. This is also
why we never capture the packet that causes this warning.


>
> It would be cool if we had a feature in kernel to store such SKB in
> memory when such issue occurs, and let the user dump it later (via
> tcpdump) and send the dump to the vendor for debug so we could just
> replay and see what happens.
>

Yeah, the warning kinda sucks, it tells almost nothing, the SKB
should be dumped up on this warning.


Re: Latest net-next kernel 4.19.0+

2018-10-31 Thread Saeed Mahameed
On Tue, 2018-10-30 at 10:32 -0700, Cong Wang wrote:
> On Tue, Oct 30, 2018 at 7:16 AM Eric Dumazet 
> wrote:
> > 
> > 
> > 
> > On 10/30/2018 01:09 AM, Paweł Staszewski wrote:
> > > 
> > > 
> > > W dniu 30.10.2018 o 08:29, Eric Dumazet pisze:
> > > > 
> > > > On 10/29/2018 11:09 PM, Dimitris Michailidis wrote:
> > > > 
> > > > > Indeed this is a bug. I would expect it to produce frequent
> > > > > errors
> > > > > though as many odd-length
> > > > > packets would trigger it. Do you have RXFCS? Regardless, how
> > > > > frequently do you see the problem?
> > > > > 
> > > > 
> > > > Old kernels (before 88078d98d1bb) were simply resetting
> > > > ip_summed to CHECKSUM_NONE
> > > > 
> > > > And before your fix (commit d55bef5059dd057bd), mlx5 bug was
> > > > canceling the bug you fixed.
> > > > 
> > > > So we now need to also fix mlx5.
> > > > 
> > > > And of course use skb_header_pointer() in mlx5e_get_fcs() as I
> > > > mentioned earlier,
> > > > plus __get_unaligned_cpu32() as you hinted.
> > > > 
> > > > 
> > > > 
> > > > 
> > > 
> > > No RXFCS
> 
> 
> Same with Pawel, RXFCS is disabled by default.
> 
> 
> > > 
> > > And this trace is rly frequently like once per 3/4 seconds
> > > like below:
> > > [28965.776864] vlan1490: hw csum failure
> > 
> > Might be vlan related.
> 

Hi Pawel, is the vlan stripping offload disabled or enabled in your
case ? 

To verify:
ethtool -k  | grep rx-vlan-offload
rx-vlan-offload: on
To set:
ethtool -K  rxvlan on/off

if the vlan offload is off then it will trigger the mlx5e vlan csum
adjustment code pointed out by Eric.

Anyhow, it should work in both cases, but i am trying to narrow down
the possibilities. 

Also could it be a double tagged packet ?


> Unlike Pawel's case, we don't use vlan at all, maybe this is why we
> see
> it much less frequently than Pawel.
> 
> Also, it is probably not specific to mlx5, as there is another report
> which
> is probably a non-mlx5 driver.
> 

Cong, How often does this happen ? can you some how verify if the
problematic packet has extra end padding after the ip payload ?

It would be cool if we had a feature in kernel to store such SKB in
memory when such issue occurs, and let the user dump it later (via
tcpdump) and send the dump to the vendor for debug so we could just
replay and see what happens.

> Thanks.


Re: Latest net-next kernel 4.19.0+

2018-10-30 Thread Cong Wang
On Tue, Oct 30, 2018 at 10:50 AM Eric Dumazet  wrote:
>
>
>
> On 10/30/2018 10:32 AM, Cong Wang wrote:
>
> > Unlike Pawel's case, we don't use vlan at all, maybe this is why we see
> > it much less frequently than Pawel.
> >
> > Also, it is probably not specific to mlx5, as there is another report which
> > is probably a non-mlx5 driver.
>
> Not sure if you provided a stack trace ?

I said it is the same with Pawel's. Here it is anyway:

[ 3731.075989] eth0: hw csum failure
[ 3731.079316] CPU: 15 PID: 0 Comm: swapper/15 Not tainted 4.14.74.x86_64 #1
[ 3731.086703] Hardware name: Wiwynn F4WW/Y 300-0284/F4WW MAIN BOARD,
BIOS F4WWP02 10/19/2018
[ 3731.094961] Call Trace:
[ 3731.097408]  
[ 3731.099432]  dump_stack+0x46/0x59
[ 3731.102751]  __skb_checksum_complete+0xb8/0xd0
[ 3731.107194]  tcp_v4_rcv+0x116/0xa30
[ 3731.110688]  ip_local_deliver_finish+0x5d/0x1f0
[ 3731.115218]  ip_local_deliver+0x6b/0xe0
[ 3731.119056]  ? ip_rcv_finish+0x400/0x400
[ 3731.122973]  ip_rcv+0x287/0x360
[ 3731.126112]  ? inet_del_offload+0x40/0x40
[ 3731.130124]  __netif_receive_skb_core+0x404/0xc10
[ 3731.134831]  ? netif_receive_skb_internal+0x34/0xd0
[ 3731.139709]  netif_receive_skb_internal+0x34/0xd0
[ 3731.144415]  napi_gro_receive+0xb8/0xe0
[ 3731.148271]  mlx5e_handle_rx_cqe_mpwrq+0x4e3/0x7f0 [mlx5_core]
[ 3731.154099]  ? enqueue_entity+0x103/0x7f0
[ 3731.158114]  mlx5e_poll_rx_cq+0xba/0x850 [mlx5_core]
[ 3731.163080]  mlx5e_napi_poll+0x91/0x290 [mlx5_core]
[ 3731.167955]  net_rx_action+0x14a/0x3e0
[ 3731.171707]  ? credit_entropy_bits+0x23d/0x260
[ 3731.176153]  __do_softirq+0xe2/0x2c3
[ 3731.179734]  irq_exit+0xbc/0xd0
[ 3731.182878]  do_IRQ+0x89/0xd0
[ 3731.185851]  common_interrupt+0x7a/0x7a
[ 3731.189690]  
[ 3731.191799] RIP: 0010:cpuidle_enter_state+0xa6/0x2d0
[ 3731.196761] RSP: 0018:bb950c6f7eb0 EFLAGS: 0246 ORIG_RAX:
ff60
[ 3731.204328] RAX: 9fe25fbe14c0 RBX: 0364b57553af RCX: 001f
[ 3731.211459] RDX: 20c49ba5e353f7cf RSI: 68294248f469 RDI: 
[ 3731.218583] RBP: db7d003c3300 R08: c3be R09: 8612
[ 3731.225709] R10: bb950c6f7e98 R11: c3be R12: 0003
[ 3731.232841] R13: 912c9d18 R14:  R15: 0364b396207a
[ 3731.239968]  do_idle+0x166/0x1a0
[ 3731.243199]  cpu_startup_entry+0x6f/0x80
[ 3731.247128]  start_secondary+0x19c/0x1f0
[ 3731.251052]  secondary_startup_64+0xa5/0xb0



>
> Have you tried IPv6 frags maybe ?
>

We have no IPv6 traffic. I asked people to try to generate IPv4 fragment
traffic to see if it would be more reproducible, no progress yet.


Re: Latest net-next kernel 4.19.0+

2018-10-30 Thread Eric Dumazet



On 10/30/2018 10:32 AM, Cong Wang wrote:

> Unlike Pawel's case, we don't use vlan at all, maybe this is why we see
> it much less frequently than Pawel.
> 
> Also, it is probably not specific to mlx5, as there is another report which
> is probably a non-mlx5 driver.

Not sure if you provided a stack trace ?

Have you tried IPv6 frags maybe ?



Re: Latest net-next kernel 4.19.0+

2018-10-30 Thread Cong Wang
On Tue, Oct 30, 2018 at 7:16 AM Eric Dumazet  wrote:
>
>
>
> On 10/30/2018 01:09 AM, Paweł Staszewski wrote:
> >
> >
> > W dniu 30.10.2018 o 08:29, Eric Dumazet pisze:
> >>
> >> On 10/29/2018 11:09 PM, Dimitris Michailidis wrote:
> >>
> >>> Indeed this is a bug. I would expect it to produce frequent errors
> >>> though as many odd-length
> >>> packets would trigger it. Do you have RXFCS? Regardless, how
> >>> frequently do you see the problem?
> >>>
> >> Old kernels (before 88078d98d1bb) were simply resetting ip_summed to 
> >> CHECKSUM_NONE
> >>
> >> And before your fix (commit d55bef5059dd057bd), mlx5 bug was canceling the 
> >> bug you fixed.
> >>
> >> So we now need to also fix mlx5.
> >>
> >> And of course use skb_header_pointer() in mlx5e_get_fcs() as I mentioned 
> >> earlier,
> >> plus __get_unaligned_cpu32() as you hinted.
> >>
> >>
> >>
> >>
> >
> > No RXFCS


Same with Pawel, RXFCS is disabled by default.


> >
> > And this trace is rly frequently like once per 3/4 seconds
> > like below:
> > [28965.776864] vlan1490: hw csum failure
>
> Might be vlan related.

Unlike Pawel's case, we don't use vlan at all, maybe this is why we see
it much less frequently than Pawel.

Also, it is probably not specific to mlx5, as there is another report which
is probably a non-mlx5 driver.

Thanks.


Re: Latest net-next kernel 4.19.0+

2018-10-30 Thread Eric Dumazet



On 10/30/2018 01:09 AM, Paweł Staszewski wrote:
> 
> 
> W dniu 30.10.2018 o 08:29, Eric Dumazet pisze:
>>
>> On 10/29/2018 11:09 PM, Dimitris Michailidis wrote:
>>
>>> Indeed this is a bug. I would expect it to produce frequent errors
>>> though as many odd-length
>>> packets would trigger it. Do you have RXFCS? Regardless, how
>>> frequently do you see the problem?
>>>
>> Old kernels (before 88078d98d1bb) were simply resetting ip_summed to 
>> CHECKSUM_NONE
>>
>> And before your fix (commit d55bef5059dd057bd), mlx5 bug was canceling the 
>> bug you fixed.
>>
>> So we now need to also fix mlx5.
>>
>> And of course use skb_header_pointer() in mlx5e_get_fcs() as I mentioned 
>> earlier,
>> plus __get_unaligned_cpu32() as you hinted.
>>
>>
>>
>>
> 
> No RXFCS
> 
> And this trace is rly frequently like once per 3/4 seconds
> like below:
> [28965.776864] vlan1490: hw csum failure

Might be vlan related.

Can you first check this :

diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c 
b/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c
index 
94224c22ecc310a87b6715051e335446f29bec03..6f4bfebf0d9a3ae7567062abb3ea6532b3aaf3d6
 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c
@@ -789,13 +789,8 @@ static inline void mlx5e_handle_csum(struct net_device 
*netdev,
skb->ip_summed = CHECKSUM_COMPLETE;
skb->csum = csum_unfold((__force __sum16)cqe->check_sum);
if (network_depth > ETH_HLEN)
-   /* CQE csum is calculated from the IP header and does
-* not cover VLAN headers (if present). This will add
-* the checksum manually.
-*/
-   skb->csum = csum_partial(skb->data + ETH_HLEN,
-network_depth - ETH_HLEN,
-skb->csum);
+   /* Temporary debugging */
+   skb->ip_summed = CHECKSUM_NONE;
if (unlikely(netdev->features & NETIF_F_RXFCS))
skb->csum = csum_add(skb->csum,
 (__force 
__wsum)mlx5e_get_fcs(skb));



Re: Latest net-next kernel 4.19.0+

2018-10-30 Thread Paweł Staszewski




W dniu 30.10.2018 o 08:29, Eric Dumazet pisze:


On 10/29/2018 11:09 PM, Dimitris Michailidis wrote:


Indeed this is a bug. I would expect it to produce frequent errors
though as many odd-length
packets would trigger it. Do you have RXFCS? Regardless, how
frequently do you see the problem?


Old kernels (before 88078d98d1bb) were simply resetting ip_summed to 
CHECKSUM_NONE

And before your fix (commit d55bef5059dd057bd), mlx5 bug was canceling the bug 
you fixed.

So we now need to also fix mlx5.

And of course use skb_header_pointer() in mlx5e_get_fcs() as I mentioned 
earlier,
plus __get_unaligned_cpu32() as you hinted.






No RXFCS

And this trace is rly frequently like once per 3/4 seconds
like below:
[28965.776864] vlan1490: hw csum failure
[28965.776867] CPU: 0 PID: 0 Comm: swapper/0 Not tainted 4.19.0+ #1
[28965.776868] Call Trace:
[28965.776870]  
[28965.776876]  dump_stack+0x46/0x5b
[28965.776879]  __skb_checksum_complete+0x9a/0xa0
[28965.776882]  tcp_v4_rcv+0xef/0x960
[28965.776884]  ip_local_deliver_finish+0x49/0xd0
[28965.776886]  ip_local_deliver+0x5e/0xe0
[28965.776888]  ? ip_sublist_rcv_finish+0x50/0x50
[28965.776889]  ip_rcv+0x41/0xc0
[28965.776891]  __netif_receive_skb_one_core+0x4b/0x70
[28965.776893]  netif_receive_skb_internal+0x2f/0xd0
[28965.776894]  napi_gro_receive+0xb7/0xe0
[28965.776897]  mlx5e_handle_rx_cqe+0x7a/0xd0
[28965.776899]  mlx5e_poll_rx_cq+0xc6/0x930
[28965.776900]  mlx5e_napi_poll+0xab/0xc90
[28965.776904]  ? kmem_cache_free_bulk+0x1e4/0x280
[28965.776905]  net_rx_action+0x1f1/0x320
[28965.776909]  __do_softirq+0xec/0x2b7
[28965.776912]  irq_exit+0x7b/0x80
[28965.776913]  do_IRQ+0x45/0xc0
[28965.776915]  common_interrupt+0xf/0xf
[28965.776916]  
[28965.776918] RIP: 0010:mwait_idle+0x5f/0x1b0
[28965.776919] Code: a8 01 0f 85 3f 01 00 00 31 d2 65 48 8b 04 25 80 4c 
01 00 48 89 d1 0f 01 c8 48 8b 00 a8 08 0f 85 40 01 00 00 31 c0 fb 0f 01 
c9 <65> 8b 2d 2a c9 6a 7e 0f 1f 44 00 00 65 48 8b 04 25 80 4c 01 00 f0
[28965.776920] RSP: 0018:82203e98 EFLAGS: 0246 ORIG_RAX: 
ffd3
[28965.776921] RAX:  RBX:  RCX: 

[28965.776922] RDX:  RSI:  RDI: 

[28965.776922] RBP:  R08: 00aa R09: 
88046f81fbc0
[28965.776923] R10:  R11: 0001006d5985 R12: 
8220f780
[28965.776924] R13: 8220f780 R14:  R15: 


[28965.776927]  do_idle+0x1a3/0x1c0
[28965.776929]  cpu_startup_entry+0x14/0x20
[28965.776932]  start_kernel+0x488/0x4a8
[28965.776935]  secondary_startup_64+0xa4/0xb0
[28965.981529] vlan1490: hw csum failure
[28965.981531] CPU: 0 PID: 0 Comm: swapper/0 Not tainted 4.19.0+ #1
[28965.981532] Call Trace:
[28965.981534]  
[28965.981539]  dump_stack+0x46/0x5b
[28965.981543]  __skb_checksum_complete+0x9a/0xa0
[28965.981545]  tcp_v4_rcv+0xef/0x960
[28965.981548]  ip_local_deliver_finish+0x49/0xd0
[28965.981550]  ip_local_deliver+0x5e/0xe0
[28965.981551]  ? ip_sublist_rcv_finish+0x50/0x50
[28965.981552]  ip_rcv+0x41/0xc0
[28965.981555]  __netif_receive_skb_one_core+0x4b/0x70
[28965.981556]  netif_receive_skb_internal+0x2f/0xd0
[28965.981558]  napi_gro_receive+0xb7/0xe0
[28965.981560]  mlx5e_handle_rx_cqe+0x7a/0xd0
[28965.981562]  mlx5e_poll_rx_cq+0xc6/0x930
[28965.981563]  mlx5e_napi_poll+0xab/0xc90
[28965.981567]  ? kmem_cache_free_bulk+0x1e4/0x280
[28965.981568]  net_rx_action+0x1f1/0x320
[28965.981571]  __do_softirq+0xec/0x2b7
[28965.981575]  irq_exit+0x7b/0x80
[28965.981576]  do_IRQ+0x45/0xc0
[28965.981578]  common_interrupt+0xf/0xf
[28965.981579]  
[28965.981580] RIP: 0010:mwait_idle+0x5f/0x1b0
[28965.981582] Code: a8 01 0f 85 3f 01 00 00 31 d2 65 48 8b 04 25 80 4c 
01 00 48 89 d1 0f 01 c8 48 8b 00 a8 08 0f 85 40 01 00 00 31 c0 fb 0f 01 
c9 <65> 8b 2d 2a c9 6a 7e 0f 1f 44 00 00 65 48 8b 04 25 80 4c 01 00 f0
[28965.981583] RSP: 0018:82203e98 EFLAGS: 0246 ORIG_RAX: 
ffd3
[28965.981584] RAX:  RBX:  RCX: 

[28965.981585] RDX:  RSI:  RDI: 

[28965.981586] RBP:  R08: 0383 R09: 
88046f81fbc0
[28965.981586] R10:  R11: 0001006d59b8 R12: 
8220f780
[28965.981587] R13: 8220f780 R14:  R15: 


[28965.981591]  do_idle+0x1a3/0x1c0
[28965.981592]  cpu_startup_entry+0x14/0x20
[28965.981596]  start_kernel+0x488/0x4a8
[28965.981600]  secondary_startup_64+0xa4/0xb0
[28966.511782] vlan1490: hw csum failure
[28966.511785] CPU: 0 PID: 0 Comm: swapper/0 Not tainted 4.19.0+ #1
[28966.511785] Call Trace:
[28966.511787]  
[28966.511793]  dump_stack+0x46/0x5b
[28966.511797]  __skb_checksum_complete+0x9a/0xa0
[28966.511799]  tcp_v4_rcv+0xef/0x960
[28966.511802]  ip_local_deliver_finish+0x49/0xd0
[28966.511804]  ip_local_deliver+0x5e/0xe0
[28966.511806]  ? ip_sublist_rcv_finish+0x50/0x50

Re: Latest net-next kernel 4.19.0+

2018-10-30 Thread Eric Dumazet



On 10/29/2018 11:09 PM, Dimitris Michailidis wrote:

> 
> Indeed this is a bug. I would expect it to produce frequent errors
> though as many odd-length
> packets would trigger it. Do you have RXFCS? Regardless, how
> frequently do you see the problem?
> 

Old kernels (before 88078d98d1bb) were simply resetting ip_summed to 
CHECKSUM_NONE

And before your fix (commit d55bef5059dd057bd), mlx5 bug was canceling the bug 
you fixed.

So we now need to also fix mlx5.

And of course use skb_header_pointer() in mlx5e_get_fcs() as I mentioned 
earlier,
plus __get_unaligned_cpu32() as you hinted.





Re: Latest net-next kernel 4.19.0+

2018-10-30 Thread Dimitris Michailidis
On Mon, Oct 29, 2018 at 8:52 PM, Eric Dumazet  wrote:
>
>
> On 10/29/2018 07:53 PM, Eric Dumazet wrote:
>>
>>
>> On 10/29/2018 07:27 PM, Cong Wang wrote:
>>> Hi,
>>>
>>> On Mon, Oct 29, 2018 at 5:19 PM Paweł Staszewski  
>>> wrote:

 Sorry not complete - followed by hw csum:

 [  342.190831] vlan1490: hw csum failure
 [  342.190835] CPU: 52 PID: 0 Comm: swapper/52 Not tainted 4.19.0+ #1
 [  342.190836] Call Trace:
 [  342.190839]  
 [  342.190849]  dump_stack+0x46/0x5b
 [  342.190856]  __skb_checksum_complete+0x9a/0xa0
 [  342.190859]  tcp_v4_rcv+0xef/0x960
 [  342.190864]  ip_local_deliver_finish+0x49/0xd0
 [  342.190866]  ip_local_deliver+0x5e/0xe0
 [  342.190869]  ? ip_sublist_rcv_finish+0x50/0x50
 [  342.190870]  ip_rcv+0x41/0xc0
 [  342.190874]  __netif_receive_skb_one_core+0x4b/0x70
 [  342.190877]  netif_receive_skb_internal+0x2f/0xd0
 [  342.190879]  napi_gro_receive+0xb7/0xe0
 [  342.190884]  mlx5e_handle_rx_cqe+0x7a/0xd0
 [  342.190886]  mlx5e_poll_rx_cq+0xc6/0x930
 [  342.190888]  mlx5e_napi_poll+0xab/0xc90
>>>
>>>
>>> We got exactly the same backtrace in our data center. However,
>>> it is not easy for us to reproduce it, do you have any clue to reproduce it?
>>>
>>> If you do, try to tcpdump the packets triggering this warning, it could
>>> be useful for debugging.
>>>
>>> Also, we tried to apply commit d55bef5059dd057bd, the warning _still_
>>> occurs. We tried to revert the offending commit 88078d98d1bb, it
>>> disappears. So it is likely that commit 88078d98d1bb introduces
>>> more troubles than the one fixed by d55bef5059dd057bd.
>>>
>>
>> Or this could be that mlx5 driver is buggy when dealing with VLAN tags.
>>
>> It both uses vlan_tci (hardware vlan offload) in skb _and_ this piece of 
>> code in mlx5e_handle_csum()
>>
>>   if (network_depth > ETH_HLEN)
>>   /* CQE csum is calculated from the IP header and does
>>* not cover VLAN headers (if present). This will add
>>* the checksum manually.
>>*/
>>   skb->csum = csum_partial(skb->data + ETH_HLEN,
>>network_depth - ETH_HLEN,
>>skb->csum);
>>
>>
>> That seems strange to me, because skb_vlan_untag() will not adjust skb->csum 
>> in this case.
>>
>
> Bug might be in NETIF_F_RXFCS mlx5 handling btw...
>
> Code does :
>
> if (unlikely(netdev->features & NETIF_F_RXFCS))
>  skb->csum = csum_add(skb->csum,
>   (__force __wsum)mlx5e_get_fcs(skb));
>
> But Dimitris told us that we need to take into account if FCS starts at odd 
> or even offset.
>
> ->
> if (unlikely(netdev->features & NETIF_F_RXFCS))
>  skb->csum = csum_block_add(skb->csum,
> (__force __wsum)mlx5e_get_fcs(skb),
> skb->len);
>

Indeed this is a bug. I would expect it to produce frequent errors
though as many odd-length
packets would trigger it. Do you have RXFCS? Regardless, how
frequently do you see the problem?

There is some other questionable code in the driver's RXFCS implementation.
Code like

return *(__be32 *)(skb->data + skb->len - ETH_FCS_LEN);

doesn't work on processors with alignment requirements.


Re: Latest net-next kernel 4.19.0+

2018-10-29 Thread Eric Dumazet



On 10/29/2018 07:53 PM, Eric Dumazet wrote:
> 
> 
> On 10/29/2018 07:27 PM, Cong Wang wrote:
>> Hi,
>>
>> On Mon, Oct 29, 2018 at 5:19 PM Paweł Staszewski  
>> wrote:
>>>
>>> Sorry not complete - followed by hw csum:
>>>
>>> [  342.190831] vlan1490: hw csum failure
>>> [  342.190835] CPU: 52 PID: 0 Comm: swapper/52 Not tainted 4.19.0+ #1
>>> [  342.190836] Call Trace:
>>> [  342.190839]  
>>> [  342.190849]  dump_stack+0x46/0x5b
>>> [  342.190856]  __skb_checksum_complete+0x9a/0xa0
>>> [  342.190859]  tcp_v4_rcv+0xef/0x960
>>> [  342.190864]  ip_local_deliver_finish+0x49/0xd0
>>> [  342.190866]  ip_local_deliver+0x5e/0xe0
>>> [  342.190869]  ? ip_sublist_rcv_finish+0x50/0x50
>>> [  342.190870]  ip_rcv+0x41/0xc0
>>> [  342.190874]  __netif_receive_skb_one_core+0x4b/0x70
>>> [  342.190877]  netif_receive_skb_internal+0x2f/0xd0
>>> [  342.190879]  napi_gro_receive+0xb7/0xe0
>>> [  342.190884]  mlx5e_handle_rx_cqe+0x7a/0xd0
>>> [  342.190886]  mlx5e_poll_rx_cq+0xc6/0x930
>>> [  342.190888]  mlx5e_napi_poll+0xab/0xc90
>>
>>
>> We got exactly the same backtrace in our data center. However,
>> it is not easy for us to reproduce it, do you have any clue to reproduce it?
>>
>> If you do, try to tcpdump the packets triggering this warning, it could
>> be useful for debugging.
>>
>> Also, we tried to apply commit d55bef5059dd057bd, the warning _still_
>> occurs. We tried to revert the offending commit 88078d98d1bb, it
>> disappears. So it is likely that commit 88078d98d1bb introduces
>> more troubles than the one fixed by d55bef5059dd057bd.
>>
> 
> Or this could be that mlx5 driver is buggy when dealing with VLAN tags.
> 
> It both uses vlan_tci (hardware vlan offload) in skb _and_ this piece of code 
> in mlx5e_handle_csum() 
> 
>   if (network_depth > ETH_HLEN)
>   /* CQE csum is calculated from the IP header and does
>* not cover VLAN headers (if present). This will add
>* the checksum manually.
>*/
>   skb->csum = csum_partial(skb->data + ETH_HLEN,
>network_depth - ETH_HLEN,
>skb->csum);
> 
> 
> That seems strange to me, because skb_vlan_untag() will not adjust skb->csum 
> in this case.
> 

Bug might be in NETIF_F_RXFCS mlx5 handling btw...

Code does :

if (unlikely(netdev->features & NETIF_F_RXFCS))
 skb->csum = csum_add(skb->csum,
  (__force __wsum)mlx5e_get_fcs(skb));

But Dimitris told us that we need to take into account if FCS starts at odd or 
even offset.

->
if (unlikely(netdev->features & NETIF_F_RXFCS))
 skb->csum = csum_block_add(skb->csum,
(__force __wsum)mlx5e_get_fcs(skb),
skb->len);



Re: Latest net-next kernel 4.19.0+

2018-10-29 Thread Eric Dumazet



On 10/29/2018 07:27 PM, Cong Wang wrote:
> Hi,
> 
> On Mon, Oct 29, 2018 at 5:19 PM Paweł Staszewski  
> wrote:
>>
>> Sorry not complete - followed by hw csum:
>>
>> [  342.190831] vlan1490: hw csum failure
>> [  342.190835] CPU: 52 PID: 0 Comm: swapper/52 Not tainted 4.19.0+ #1
>> [  342.190836] Call Trace:
>> [  342.190839]  
>> [  342.190849]  dump_stack+0x46/0x5b
>> [  342.190856]  __skb_checksum_complete+0x9a/0xa0
>> [  342.190859]  tcp_v4_rcv+0xef/0x960
>> [  342.190864]  ip_local_deliver_finish+0x49/0xd0
>> [  342.190866]  ip_local_deliver+0x5e/0xe0
>> [  342.190869]  ? ip_sublist_rcv_finish+0x50/0x50
>> [  342.190870]  ip_rcv+0x41/0xc0
>> [  342.190874]  __netif_receive_skb_one_core+0x4b/0x70
>> [  342.190877]  netif_receive_skb_internal+0x2f/0xd0
>> [  342.190879]  napi_gro_receive+0xb7/0xe0
>> [  342.190884]  mlx5e_handle_rx_cqe+0x7a/0xd0
>> [  342.190886]  mlx5e_poll_rx_cq+0xc6/0x930
>> [  342.190888]  mlx5e_napi_poll+0xab/0xc90
> 
> 
> We got exactly the same backtrace in our data center. However,
> it is not easy for us to reproduce it, do you have any clue to reproduce it?
> 
> If you do, try to tcpdump the packets triggering this warning, it could
> be useful for debugging.
> 
> Also, we tried to apply commit d55bef5059dd057bd, the warning _still_
> occurs. We tried to revert the offending commit 88078d98d1bb, it
> disappears. So it is likely that commit 88078d98d1bb introduces
> more troubles than the one fixed by d55bef5059dd057bd.
> 

Or this could be that mlx5 driver is buggy when dealing with VLAN tags.

It both uses vlan_tci (hardware vlan offload) in skb _and_ this piece of code 
in mlx5e_handle_csum() 

if (network_depth > ETH_HLEN)
/* CQE csum is calculated from the IP header and does
 * not cover VLAN headers (if present). This will add
 * the checksum manually.
 */
skb->csum = csum_partial(skb->data + ETH_HLEN,
 network_depth - ETH_HLEN,
 skb->csum);


That seems strange to me, because skb_vlan_untag() will not adjust skb->csum in 
this case.



Re: Latest net-next kernel 4.19.0+

2018-10-29 Thread Cong Wang
(Adding Eric and Dimitris into Cc)

On Mon, Oct 29, 2018 at 7:27 PM Cong Wang  wrote:
>
> Hi,
>
> On Mon, Oct 29, 2018 at 5:19 PM Paweł Staszewski  
> wrote:
> >
> > Sorry not complete - followed by hw csum:
> >
> > [  342.190831] vlan1490: hw csum failure
> > [  342.190835] CPU: 52 PID: 0 Comm: swapper/52 Not tainted 4.19.0+ #1
> > [  342.190836] Call Trace:
> > [  342.190839]  
> > [  342.190849]  dump_stack+0x46/0x5b
> > [  342.190856]  __skb_checksum_complete+0x9a/0xa0
> > [  342.190859]  tcp_v4_rcv+0xef/0x960
> > [  342.190864]  ip_local_deliver_finish+0x49/0xd0
> > [  342.190866]  ip_local_deliver+0x5e/0xe0
> > [  342.190869]  ? ip_sublist_rcv_finish+0x50/0x50
> > [  342.190870]  ip_rcv+0x41/0xc0
> > [  342.190874]  __netif_receive_skb_one_core+0x4b/0x70
> > [  342.190877]  netif_receive_skb_internal+0x2f/0xd0
> > [  342.190879]  napi_gro_receive+0xb7/0xe0
> > [  342.190884]  mlx5e_handle_rx_cqe+0x7a/0xd0
> > [  342.190886]  mlx5e_poll_rx_cq+0xc6/0x930
> > [  342.190888]  mlx5e_napi_poll+0xab/0xc90
>
>
> We got exactly the same backtrace in our data center. However,
> it is not easy for us to reproduce it, do you have any clue to reproduce it?
>
> If you do, try to tcpdump the packets triggering this warning, it could
> be useful for debugging.
>
> Also, we tried to apply commit d55bef5059dd057bd, the warning _still_
> occurs. We tried to revert the offending commit 88078d98d1bb, it
> disappears. So it is likely that commit 88078d98d1bb introduces
> more troubles than the one fixed by d55bef5059dd057bd.


Re: Latest net-next kernel 4.19.0+

2018-10-29 Thread Cong Wang
Hi,

On Mon, Oct 29, 2018 at 5:19 PM Paweł Staszewski  wrote:
>
> Sorry not complete - followed by hw csum:
>
> [  342.190831] vlan1490: hw csum failure
> [  342.190835] CPU: 52 PID: 0 Comm: swapper/52 Not tainted 4.19.0+ #1
> [  342.190836] Call Trace:
> [  342.190839]  
> [  342.190849]  dump_stack+0x46/0x5b
> [  342.190856]  __skb_checksum_complete+0x9a/0xa0
> [  342.190859]  tcp_v4_rcv+0xef/0x960
> [  342.190864]  ip_local_deliver_finish+0x49/0xd0
> [  342.190866]  ip_local_deliver+0x5e/0xe0
> [  342.190869]  ? ip_sublist_rcv_finish+0x50/0x50
> [  342.190870]  ip_rcv+0x41/0xc0
> [  342.190874]  __netif_receive_skb_one_core+0x4b/0x70
> [  342.190877]  netif_receive_skb_internal+0x2f/0xd0
> [  342.190879]  napi_gro_receive+0xb7/0xe0
> [  342.190884]  mlx5e_handle_rx_cqe+0x7a/0xd0
> [  342.190886]  mlx5e_poll_rx_cq+0xc6/0x930
> [  342.190888]  mlx5e_napi_poll+0xab/0xc90


We got exactly the same backtrace in our data center. However,
it is not easy for us to reproduce it, do you have any clue to reproduce it?

If you do, try to tcpdump the packets triggering this warning, it could
be useful for debugging.

Also, we tried to apply commit d55bef5059dd057bd, the warning _still_
occurs. We tried to revert the offending commit 88078d98d1bb, it
disappears. So it is likely that commit 88078d98d1bb introduces
more troubles than the one fixed by d55bef5059dd057bd.


Re: Latest net-next kernel 4.19.0+

2018-10-29 Thread Paweł Staszewski

W dniu 30.10.2018 o 01:11, Paweł Staszewski pisze:

Sorry not complete - followed by hw csum:

[  342.190831] vlan1490: hw csum failure
[  342.190835] CPU: 52 PID: 0 Comm: swapper/52 Not tainted 4.19.0+ #1
[  342.190836] Call Trace:
[  342.190839]  
[  342.190849]  dump_stack+0x46/0x5b
[  342.190856]  __skb_checksum_complete+0x9a/0xa0
[  342.190859]  tcp_v4_rcv+0xef/0x960
[  342.190864]  ip_local_deliver_finish+0x49/0xd0
[  342.190866]  ip_local_deliver+0x5e/0xe0
[  342.190869]  ? ip_sublist_rcv_finish+0x50/0x50
[  342.190870]  ip_rcv+0x41/0xc0
[  342.190874]  __netif_receive_skb_one_core+0x4b/0x70
[  342.190877]  netif_receive_skb_internal+0x2f/0xd0
[  342.190879]  napi_gro_receive+0xb7/0xe0
[  342.190884]  mlx5e_handle_rx_cqe+0x7a/0xd0
[  342.190886]  mlx5e_poll_rx_cq+0xc6/0x930
[  342.190888]  mlx5e_napi_poll+0xab/0xc90
[  342.190893]  ? kmem_cache_free_bulk+0x1e4/0x280
[  342.190895]  net_rx_action+0x1f1/0x320
[  342.190901]  __do_softirq+0xec/0x2b7
[  342.190908]  irq_exit+0x7b/0x80
[  342.190910]  do_IRQ+0x45/0xc0
[  342.190912]  common_interrupt+0xf/0xf
[  342.190914]  
[  342.190916] RIP: 0010:mwait_idle+0x5f/0x1b0
[  342.190917] Code: a8 01 0f 85 3f 01 00 00 31 d2 65 48 8b 04 25 80 
4c 01 00 48 89 d1 0f 01 c8 48 8b 00 a8 08 0f 85 40 01 00 00 31 c0 fb 
0f 01 c9 <65> 8b 2d 2a c9 6a 7e 0f 1f 44 00 00 65 48 8b 04 25 80 4c 01 
00 f0
[  342.190918] RSP: 0018:c900034e7eb8 EFLAGS: 0246 ORIG_RAX: 
ffdd
[  342.190920] RAX:  RBX: 0034 RCX: 

[  342.190921] RDX:  RSI:  RDI: 

[  342.190922] RBP: 0034 R08: 0057 R09: 
88086fa1fbc0
[  342.190923] R10:  R11: 000128cc R12: 
88086d18
[  342.190923] R13: 88086d18 R14:  R15: 


[  342.190929]  do_idle+0x1a3/0x1c0
[  342.190931]  cpu_startup_entry+0x14/0x20
[  342.190934]  start_secondary+0x165/0x190
[  342.190939]  secondary_startup_64+0xa4/0xb0


W dniu 30.10.2018 o 01:10, Paweł Staszewski pisze:

Hi


Just checked in test lab latest kernel and have weird traces:

[  219.888673] CPU: 52 PID: 0 Comm: swapper/52 Not tainted 4.19.0+ #1
[  219.888674] Call Trace:
[  219.888676]  
[  219.888685]  dump_stack+0x46/0x5b
[  219.888691]  __skb_checksum_complete+0x9a/0xa0
[  219.888694]  tcp_v4_rcv+0xef/0x960
[  219.888698]  ip_local_deliver_finish+0x49/0xd0
[  219.888700]  ip_local_deliver+0x5e/0xe0
[  219.888702]  ? ip_sublist_rcv_finish+0x50/0x50
[  219.888703]  ip_rcv+0x41/0xc0
[  219.888706]  __netif_receive_skb_one_core+0x4b/0x70
[  219.888708]  netif_receive_skb_internal+0x2f/0xd0
[  219.888710]  napi_gro_receive+0xb7/0xe0
[  219.888714]  mlx5e_handle_rx_cqe+0x7a/0xd0
[  219.888716]  mlx5e_poll_rx_cq+0xc6/0x930
[  219.888717]  mlx5e_napi_poll+0xab/0xc90
[  219.888722]  ? enqueue_task_fair+0x286/0xc40
[  219.888723]  ? enqueue_task_fair+0x1d6/0xc40
[  219.888725]  net_rx_action+0x1f1/0x320
[  219.888730]  __do_softirq+0xec/0x2b7
[  219.888736]  irq_exit+0x7b/0x80
[  219.888737]  do_IRQ+0x45/0xc0
[  219.888740]  common_interrupt+0xf/0xf
[  219.888742]  
[  219.888743] RIP: 0010:mwait_idle+0x5f/0x1b0
[  219.888745] Code: a8 01 0f 85 3f 01 00 00 31 d2 65 48 8b 04 25 80 
4c 01 00 48 89 d1 0f 01 c8 48 8b 00 a8 08 0f 85 40 01 00 00 31 c0 fb 
0f 01 c9 <65> 8b 2d 2a c9 6a 7e 0f 1f 44 00 00 65 48 8b 04 25 80 4c 
01 00 f0
[  219.888746] RSP: 0018:c900034e7eb8 EFLAGS: 0246 ORIG_RAX: 
ffde
[  219.888749] RAX:  RBX: 0034 RCX: 

[  219.888749] RDX:  RSI:  RDI: 

[  219.888750] RBP: 0034 R08: 003b R09: 
88086fa1fbc0
[  219.888751] R10:  R11: b15d R12: 
88086d18
[  219.888752] R13: 88086d18 R14:  R15: 


[  219.888754]  do_idle+0x1a3/0x1c0
[  219.888757]  cpu_startup_entry+0x14/0x20
[  219.888760]  start_secondary+0x165/0x190






Also some perf top attacked to this - 14G rx traffic on vlans (pktgen 
generated random destination ip's and forwarded by test server)


   PerfTop:   45296 irqs/sec  kernel:99.3%  exact:  0.0% [4000Hz 
cycles],  (all, 56 CPUs)

---

 7.43%  [kernel]   [k] mlx5e_skb_from_cqe_linear
 5.17%  [kernel]   [k] mlx5e_sq_xmit
 3.83%  [kernel]   [k] fib_table_lookup
 3.41%  [kernel]   [k] irq_entries_start
 2.91%  [kernel]   [k] build_skb
 2.50%  [kernel]   [k] mlx5_eq_int
 2.29%  [kernel]   [k] _raw_spin_lock
 2.27%  [kernel]   [k] tasklet_action_common.isra.21
 1.99%  [kernel]   [k] _raw_spin_lock_irqsave
 1.91%  [kernel]   [k] memcpy_erms
 1.77%  [kernel]   

Re: Latest net-next kernel 4.19.0+

2018-10-29 Thread Paweł Staszewski

Sorry not complete - followed by hw csum:

[  342.190831] vlan1490: hw csum failure
[  342.190835] CPU: 52 PID: 0 Comm: swapper/52 Not tainted 4.19.0+ #1
[  342.190836] Call Trace:
[  342.190839]  
[  342.190849]  dump_stack+0x46/0x5b
[  342.190856]  __skb_checksum_complete+0x9a/0xa0
[  342.190859]  tcp_v4_rcv+0xef/0x960
[  342.190864]  ip_local_deliver_finish+0x49/0xd0
[  342.190866]  ip_local_deliver+0x5e/0xe0
[  342.190869]  ? ip_sublist_rcv_finish+0x50/0x50
[  342.190870]  ip_rcv+0x41/0xc0
[  342.190874]  __netif_receive_skb_one_core+0x4b/0x70
[  342.190877]  netif_receive_skb_internal+0x2f/0xd0
[  342.190879]  napi_gro_receive+0xb7/0xe0
[  342.190884]  mlx5e_handle_rx_cqe+0x7a/0xd0
[  342.190886]  mlx5e_poll_rx_cq+0xc6/0x930
[  342.190888]  mlx5e_napi_poll+0xab/0xc90
[  342.190893]  ? kmem_cache_free_bulk+0x1e4/0x280
[  342.190895]  net_rx_action+0x1f1/0x320
[  342.190901]  __do_softirq+0xec/0x2b7
[  342.190908]  irq_exit+0x7b/0x80
[  342.190910]  do_IRQ+0x45/0xc0
[  342.190912]  common_interrupt+0xf/0xf
[  342.190914]  
[  342.190916] RIP: 0010:mwait_idle+0x5f/0x1b0
[  342.190917] Code: a8 01 0f 85 3f 01 00 00 31 d2 65 48 8b 04 25 80 4c 
01 00 48 89 d1 0f 01 c8 48 8b 00 a8 08 0f 85 40 01 00 00 31 c0 fb 0f 01 
c9 <65> 8b 2d 2a c9 6a 7e 0f 1f 44 00 00 65 48 8b 04 25 80 4c 01 00 f0
[  342.190918] RSP: 0018:c900034e7eb8 EFLAGS: 0246 ORIG_RAX: 
ffdd
[  342.190920] RAX:  RBX: 0034 RCX: 

[  342.190921] RDX:  RSI:  RDI: 

[  342.190922] RBP: 0034 R08: 0057 R09: 
88086fa1fbc0
[  342.190923] R10:  R11: 000128cc R12: 
88086d18
[  342.190923] R13: 88086d18 R14:  R15: 


[  342.190929]  do_idle+0x1a3/0x1c0
[  342.190931]  cpu_startup_entry+0x14/0x20
[  342.190934]  start_secondary+0x165/0x190
[  342.190939]  secondary_startup_64+0xa4/0xb0


W dniu 30.10.2018 o 01:10, Paweł Staszewski pisze:

Hi


Just checked in test lab latest kernel and have weird traces:

[  219.888673] CPU: 52 PID: 0 Comm: swapper/52 Not tainted 4.19.0+ #1
[  219.888674] Call Trace:
[  219.888676]  
[  219.888685]  dump_stack+0x46/0x5b
[  219.888691]  __skb_checksum_complete+0x9a/0xa0
[  219.888694]  tcp_v4_rcv+0xef/0x960
[  219.888698]  ip_local_deliver_finish+0x49/0xd0
[  219.888700]  ip_local_deliver+0x5e/0xe0
[  219.888702]  ? ip_sublist_rcv_finish+0x50/0x50
[  219.888703]  ip_rcv+0x41/0xc0
[  219.888706]  __netif_receive_skb_one_core+0x4b/0x70
[  219.888708]  netif_receive_skb_internal+0x2f/0xd0
[  219.888710]  napi_gro_receive+0xb7/0xe0
[  219.888714]  mlx5e_handle_rx_cqe+0x7a/0xd0
[  219.888716]  mlx5e_poll_rx_cq+0xc6/0x930
[  219.888717]  mlx5e_napi_poll+0xab/0xc90
[  219.888722]  ? enqueue_task_fair+0x286/0xc40
[  219.888723]  ? enqueue_task_fair+0x1d6/0xc40
[  219.888725]  net_rx_action+0x1f1/0x320
[  219.888730]  __do_softirq+0xec/0x2b7
[  219.888736]  irq_exit+0x7b/0x80
[  219.888737]  do_IRQ+0x45/0xc0
[  219.888740]  common_interrupt+0xf/0xf
[  219.888742]  
[  219.888743] RIP: 0010:mwait_idle+0x5f/0x1b0
[  219.888745] Code: a8 01 0f 85 3f 01 00 00 31 d2 65 48 8b 04 25 80 
4c 01 00 48 89 d1 0f 01 c8 48 8b 00 a8 08 0f 85 40 01 00 00 31 c0 fb 
0f 01 c9 <65> 8b 2d 2a c9 6a 7e 0f 1f 44 00 00 65 48 8b 04 25 80 4c 01 
00 f0
[  219.888746] RSP: 0018:c900034e7eb8 EFLAGS: 0246 ORIG_RAX: 
ffde
[  219.888749] RAX:  RBX: 0034 RCX: 

[  219.888749] RDX:  RSI:  RDI: 

[  219.888750] RBP: 0034 R08: 003b R09: 
88086fa1fbc0
[  219.888751] R10:  R11: b15d R12: 
88086d18
[  219.888752] R13: 88086d18 R14:  R15: 


[  219.888754]  do_idle+0x1a3/0x1c0
[  219.888757]  cpu_startup_entry+0x14/0x20
[  219.888760]  start_secondary+0x165/0x190