Re: R8169: Network lockups in 4.18.{8,9,10} (and 4.19 dev)

Chris Clayton Wed, 10 Oct 2018 01:52:04 -0700

Sorry, I forgot that editing r8169.c and rebuilding would result in rc7+, so I 
tested the wrong kernel/module to get the
results I provided below. That, however, may make the results more interesting 
because they happened with a virgin rc7
kernel/module.


I'll test your proposals properly later.

Chris

On 10/10/2018 09:09, Chris Clayton wrote:
> 
> 
> On 10/10/2018 01:24, Maciej S. Szmigiero wrote:
>> On 09.10.2018 22:36, Heiner Kallweit wrote:
>>> On 09.10.2018 16:40, Chris Clayton wrote:
>>>> Thanks to Maciej and Heiner for their replies.
>>>>
>>>> On 09/10/2018 13:32, Maciej S. Szmigiero wrote:
>>>>> On 07.10.2018 21:36, Chris Clayton wrote:
>>>>>> Hi again,
>>>>>>
>>>>>> I didn't think there was anything in 4.19-rc7 to fix this regression, 
>>>>>> but tried it anyway. I can confirm that the
>>>>>> regression is still present and my network still fails when, after a 
>>>>>> resume from suspend (to ram or disk), I open my
>>>>>> browser or my mail client. In both those cases the failure is almost 
>>>>>> immediate - e.g. my home page doesn't get displayed
>>>>>> in the browser. Pinging one of my ISPs name servers doesn't fail quite 
>>>>>> so quickly but the reported time increases from
>>>>>> 14-15ms to more than 1000ms.
>>>>>
>>>>> You can try comparing chip registers (ethtool -d eth0) in the working
>>>>> state (before a suspend) and in the broken state (after a resume).
>>>>> Maybe there will be some obvious in the difference.
>>>>>
>>>>> The same goes for the PCI configuration (lspci -d :8168 -vv).
>>>>>
>>>> Maciej suggested comparing the output from lspci -vv for the ethernet 
>>>> device. They are identical.
>>>>
>>>> Both Maciej and Heiner suggested comparing the output from "ethtool -d" 
>>>> pre and post suspend. Again, they are identical.
>>>> Heiner specifically suggested looking at the RxConfig. The value of that 
>>>> is 0x0002870e both pre and post suspend.
>>>>
>>> Hmm, this is very weird, especially taking into account that in your 
>>> original
>>> report you state that removing the call to rtl_init_rxcfg() from 
>>> rtl_hw_start()
>>> fixes the issue. rtl_init_rxcfg() deals with the RxConfig register only and
>>> register values seem to be the same before and after resume. So how can the
>>> chip behave differently?
>>> So far my best guess is that some chip quirk causes it to accept writes to
>>> register RxConfig, but to misinterpret or ignore the written value.
>>> So far your report is the only one (affecting RTL8411), but we don't know
>>> whether other chip versions are affected too.
>>
>> Also, it is interesting that even if one removes a call to
>> rtl_init_rxcfg() from rtl_hw_start() the RxConfig register will still get
>> written to moments later by rtl_set_rx_mode().
>>
>> The only chip accesses in the meantime seems to be a write to TxConfig by
>> rtl_set_tx_config_registers() and then a read of RxConfig plus two writes
>> to MAR0 earlier in rtl_set_rx_mode().
>>
>> My proposals are:
>> 1) Try swapping "rtl_init_rxcfg(tp);" and "rtl_set_tx_config_registers(tp);"
>> in rtl_hw_start().
>> Maybe the chip does not like sometimes that RxConfig is written before
>> TxConfig.
>>
> After testing your first proposal, which made no  difference, I founf the 
> following in dmesg in the output from dmesg:
> 
> [  761.999468] ------------[ cut here ]------------
> [  761.999471] NETDEV WATCHDOG: eth0 (r8169): transmit queue 0 timed out
> [  761.999483] WARNING: CPU: 0 PID: 8938 at net/sched/sch_generic.c:461 
> dev_watchdog+0x1e9/0x1f0
> [  761.999484] Modules linked in: btusb btintel r8169 rfcomm bnep 
> iptable_filter xt_conntrack iptable_nat ipt_MASQUERADE
> nf_nat_ipv4 nf_nat nf_conntrack nf_defrag_ipv4 uvcvideo videobuf2_vmalloc 
> videobuf2_memops snd_hda_codec_via
> videobuf2_v4l2 snd_hda_codec_hdmi snd_hda_codec_generic videobuf2_common 
> usbhid realtek coretemp snd_hda_intel hwmon
> snd_hda_codec x86_pkg_temp_thermal snd_hwdep libphy snd_hda_core [last 
> unloaded: btintel]
> [  761.999503] CPU: 0 PID: 8938 Comm: kworker/0:0 Not tainted 4.19.0-rc7 #328
> [  761.999504] Hardware name: Notebook                         W65_67SZ       
>                  /W65_67SZ
>        , BIOS 1.03.05 02/26/2014
> [  761.999508] Workqueue: events rtl_task [r8169]
> [  761.999510] RIP: 0010:dev_watchdog+0x1e9/0x1f0
> [  761.999512] Code: 00 48 63 4d e8 eb 99 4c 89 ef c6 05 b6 13 a6 00 01 e8 1b 
> c7 fd ff 89 d9 4c 89 ee 48 c7 c7 40 53 e1
> 81 48 89 c2 e8 ae f4 a3 ff <0f> 0b eb c0 0f 1f 00 48 c7 47 08 00 00 00 00 48 
> c7 07 00 00 00 00
> [  761.999513] RSP: 0018:ffff88040f803e98 EFLAGS: 00010282
> [  761.999514] RAX: 0000000000000000 RBX: 0000000000000000 RCX: 
> 0000000000000006
> [  761.999516] RDX: 0000000000000007 RSI: 0000000000000096 RDI: 
> ffff88040f8153d0
> [  761.999517] RBP: ffff88040ca9a3b8 R08: ffffffff813565f0 R09: 
> 000000000000034e
> [  761.999517] R10: 0000000000000007 R11: 0000000000000000 R12: 
> ffff88040ca9a39c
> [  761.999518] R13: ffff88040ca9a000 R14: 0000000000000001 R15: 
> ffff8803ea17cc80
> [  761.999520] FS:  0000000000000000(0000) GS:ffff88040f800000(0000) 
> knlGS:0000000000000000
> [  761.999521] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [  761.999522] CR2: 00007f67280206b8 CR3: 000000000200a002 CR4: 
> 00000000001606f0
> [  761.999523] Call Trace:
> [  761.999525]  <IRQ>
> [  761.999527]  ? qdisc_reset+0xe0/0xe0
> [  761.999529]  ? qdisc_reset+0xe0/0xe0
> [  761.999532]  call_timer_fn+0x11/0x70
> [  761.999534]  expire_timers+0x8e/0xa0
> [  761.999535]  run_timer_softirq+0x7e/0x150
> [  761.999538]  ? __hrtimer_run_queues+0x12b/0x1a0
> [  761.999541]  ? recalibrate_cpu_khz+0x10/0x10
> [  761.999543]  ? ktime_get+0x32/0x90
> [  761.999546]  ? lapic_next_event+0x20/0x20
> [  761.999549]  __do_softirq+0xcc/0x1fc
> [  761.999552]  irq_exit+0x82/0xb0
> [  761.999554]  smp_apic_timer_interrupt+0x61/0x90
> [  761.999556]  apic_timer_interrupt+0xf/0x20
> [  761.999557]  </IRQ>
> [  761.999560] RIP: 0010:rtl_slow_event_work+0x2a/0x1f0 [r8169]
> [  761.999562] Code: 41 56 41 55 41 54 55 53 48 89 fb 48 83 ec 10 4c 8b 67 10 
> 65 48 8b 04 25 28 00 00 00 48 89 44 24 08
> 31 c0 48 8b 07 66 8b 68 3e <66> 23 af da 0d 00 00 48 8b 07 66 89 68 3e 40 f6 
> c5 40 0f 85 3b 01
> [  761.999563] RSP: 0018:ffffc900014d7e40 EFLAGS: 00000246 ORIG_RAX: 
> ffffffffffffff13
> [  761.999564] RAX: ffffc900000b9000 RBX: ffff88040ca9a7c0 RCX: 
> ffff88040f81f160
> [  761.999565] RDX: ffff8803ea21b300 RSI: 0000000000000000 RDI: 
> ffff88040ca9a7c0
> [  761.999566] RBP: ffff88040ca90050 R08: 0000000000000000 R09: 
> 000073746e657665
> [  761.999567] R10: 8080808080808080 R11: ffff88040f81ea68 R12: 
> ffff88040ca9a000
> [  761.999568] R13: ffff88040ca9a000 R14: ffff88040f81f140 R15: 
> 0000000000000000
> [  761.999571]  ? __switch_to_asm+0x34/0x70
> [  761.999573]  rtl_task+0x4f/0x70 [r8169]
> [  761.999576]  process_one_work+0x1bc/0x2f0
> [  761.999577]  worker_thread+0x28/0x3c0
> [  761.999579]  ? process_one_work+0x2f0/0x2f0
> [  761.999581]  kthread+0x109/0x120
> [  761.999583]  ? kthread_park+0x80/0x80
> [  761.999585]  ret_from_fork+0x35/0x40
> [  761.999586] ---[ end trace fd5800440feffc06 ]---
> 
> I haven't seen this before, but maybe it's a consequence of swapping the 
> order of the two functions calls.
> 
> I'll work on the second proposal later today.
> 
> Chris
>> 2) Check the original value of RxConfig (after a resume) before
>> rtl_init_rxcfg() overwrites it (compile tested only):
>> --- r8169.c.ori
>> +++ r8169.c
>> @@ -5155,6 +5155,9 @@
>>      /* Initially a 10 us delay. Turned it into a PCI commit. - FR */
>>      RTL_R8(tp, IntrMask);
>>      RTL_W8(tp, ChipCmd, CmdTxEnb | CmdRxEnb);
>> +
>> +    pr_notice("RxConfig before init was %.8x\n",
>> +            (unsigned int)RTL_R32(tp, RxConfig));
>>      rtl_init_rxcfg(tp);
>>      rtl_set_tx_config_registers(tp);
>>  
>>
>> This should be the value that you got when you removed the call to
>> rtl_init_rxcfg() for testing.
>> Now, knowing the "right" value you can experiment with what rtl_init_rxcfg()
>> writes (under the "default:" label for your NIC model).
>>
>> Hope this helps,
>> Maciej
>>

Re: R8169: Network lockups in 4.18.{8,9,10} (and 4.19 dev)

Reply via email to