Sorry, I forgot that editing r8169.c and rebuilding would result in rc7+, so I tested the wrong kernel/module to get the results I provided below. That, however, may make the results more interesting because they happened with a virgin rc7 kernel/module.
I'll test your proposals properly later. Chris On 10/10/2018 09:09, Chris Clayton wrote: > > > On 10/10/2018 01:24, Maciej S. Szmigiero wrote: >> On 09.10.2018 22:36, Heiner Kallweit wrote: >>> On 09.10.2018 16:40, Chris Clayton wrote: >>>> Thanks to Maciej and Heiner for their replies. >>>> >>>> On 09/10/2018 13:32, Maciej S. Szmigiero wrote: >>>>> On 07.10.2018 21:36, Chris Clayton wrote: >>>>>> Hi again, >>>>>> >>>>>> I didn't think there was anything in 4.19-rc7 to fix this regression, >>>>>> but tried it anyway. I can confirm that the >>>>>> regression is still present and my network still fails when, after a >>>>>> resume from suspend (to ram or disk), I open my >>>>>> browser or my mail client. In both those cases the failure is almost >>>>>> immediate - e.g. my home page doesn't get displayed >>>>>> in the browser. Pinging one of my ISPs name servers doesn't fail quite >>>>>> so quickly but the reported time increases from >>>>>> 14-15ms to more than 1000ms. >>>>> >>>>> You can try comparing chip registers (ethtool -d eth0) in the working >>>>> state (before a suspend) and in the broken state (after a resume). >>>>> Maybe there will be some obvious in the difference. >>>>> >>>>> The same goes for the PCI configuration (lspci -d :8168 -vv). >>>>> >>>> Maciej suggested comparing the output from lspci -vv for the ethernet >>>> device. They are identical. >>>> >>>> Both Maciej and Heiner suggested comparing the output from "ethtool -d" >>>> pre and post suspend. Again, they are identical. >>>> Heiner specifically suggested looking at the RxConfig. The value of that >>>> is 0x0002870e both pre and post suspend. >>>> >>> Hmm, this is very weird, especially taking into account that in your >>> original >>> report you state that removing the call to rtl_init_rxcfg() from >>> rtl_hw_start() >>> fixes the issue. rtl_init_rxcfg() deals with the RxConfig register only and >>> register values seem to be the same before and after resume. So how can the >>> chip behave differently? >>> So far my best guess is that some chip quirk causes it to accept writes to >>> register RxConfig, but to misinterpret or ignore the written value. >>> So far your report is the only one (affecting RTL8411), but we don't know >>> whether other chip versions are affected too. >> >> Also, it is interesting that even if one removes a call to >> rtl_init_rxcfg() from rtl_hw_start() the RxConfig register will still get >> written to moments later by rtl_set_rx_mode(). >> >> The only chip accesses in the meantime seems to be a write to TxConfig by >> rtl_set_tx_config_registers() and then a read of RxConfig plus two writes >> to MAR0 earlier in rtl_set_rx_mode(). >> >> My proposals are: >> 1) Try swapping "rtl_init_rxcfg(tp);" and "rtl_set_tx_config_registers(tp);" >> in rtl_hw_start(). >> Maybe the chip does not like sometimes that RxConfig is written before >> TxConfig. >> > After testing your first proposal, which made no difference, I founf the > following in dmesg in the output from dmesg: > > [ 761.999468] ------------[ cut here ]------------ > [ 761.999471] NETDEV WATCHDOG: eth0 (r8169): transmit queue 0 timed out > [ 761.999483] WARNING: CPU: 0 PID: 8938 at net/sched/sch_generic.c:461 > dev_watchdog+0x1e9/0x1f0 > [ 761.999484] Modules linked in: btusb btintel r8169 rfcomm bnep > iptable_filter xt_conntrack iptable_nat ipt_MASQUERADE > nf_nat_ipv4 nf_nat nf_conntrack nf_defrag_ipv4 uvcvideo videobuf2_vmalloc > videobuf2_memops snd_hda_codec_via > videobuf2_v4l2 snd_hda_codec_hdmi snd_hda_codec_generic videobuf2_common > usbhid realtek coretemp snd_hda_intel hwmon > snd_hda_codec x86_pkg_temp_thermal snd_hwdep libphy snd_hda_core [last > unloaded: btintel] > [ 761.999503] CPU: 0 PID: 8938 Comm: kworker/0:0 Not tainted 4.19.0-rc7 #328 > [ 761.999504] Hardware name: Notebook W65_67SZ > /W65_67SZ > , BIOS 1.03.05 02/26/2014 > [ 761.999508] Workqueue: events rtl_task [r8169] > [ 761.999510] RIP: 0010:dev_watchdog+0x1e9/0x1f0 > [ 761.999512] Code: 00 48 63 4d e8 eb 99 4c 89 ef c6 05 b6 13 a6 00 01 e8 1b > c7 fd ff 89 d9 4c 89 ee 48 c7 c7 40 53 e1 > 81 48 89 c2 e8 ae f4 a3 ff <0f> 0b eb c0 0f 1f 00 48 c7 47 08 00 00 00 00 48 > c7 07 00 00 00 00 > [ 761.999513] RSP: 0018:ffff88040f803e98 EFLAGS: 00010282 > [ 761.999514] RAX: 0000000000000000 RBX: 0000000000000000 RCX: > 0000000000000006 > [ 761.999516] RDX: 0000000000000007 RSI: 0000000000000096 RDI: > ffff88040f8153d0 > [ 761.999517] RBP: ffff88040ca9a3b8 R08: ffffffff813565f0 R09: > 000000000000034e > [ 761.999517] R10: 0000000000000007 R11: 0000000000000000 R12: > ffff88040ca9a39c > [ 761.999518] R13: ffff88040ca9a000 R14: 0000000000000001 R15: > ffff8803ea17cc80 > [ 761.999520] FS: 0000000000000000(0000) GS:ffff88040f800000(0000) > knlGS:0000000000000000 > [ 761.999521] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 > [ 761.999522] CR2: 00007f67280206b8 CR3: 000000000200a002 CR4: > 00000000001606f0 > [ 761.999523] Call Trace: > [ 761.999525] <IRQ> > [ 761.999527] ? qdisc_reset+0xe0/0xe0 > [ 761.999529] ? qdisc_reset+0xe0/0xe0 > [ 761.999532] call_timer_fn+0x11/0x70 > [ 761.999534] expire_timers+0x8e/0xa0 > [ 761.999535] run_timer_softirq+0x7e/0x150 > [ 761.999538] ? __hrtimer_run_queues+0x12b/0x1a0 > [ 761.999541] ? recalibrate_cpu_khz+0x10/0x10 > [ 761.999543] ? ktime_get+0x32/0x90 > [ 761.999546] ? lapic_next_event+0x20/0x20 > [ 761.999549] __do_softirq+0xcc/0x1fc > [ 761.999552] irq_exit+0x82/0xb0 > [ 761.999554] smp_apic_timer_interrupt+0x61/0x90 > [ 761.999556] apic_timer_interrupt+0xf/0x20 > [ 761.999557] </IRQ> > [ 761.999560] RIP: 0010:rtl_slow_event_work+0x2a/0x1f0 [r8169] > [ 761.999562] Code: 41 56 41 55 41 54 55 53 48 89 fb 48 83 ec 10 4c 8b 67 10 > 65 48 8b 04 25 28 00 00 00 48 89 44 24 08 > 31 c0 48 8b 07 66 8b 68 3e <66> 23 af da 0d 00 00 48 8b 07 66 89 68 3e 40 f6 > c5 40 0f 85 3b 01 > [ 761.999563] RSP: 0018:ffffc900014d7e40 EFLAGS: 00000246 ORIG_RAX: > ffffffffffffff13 > [ 761.999564] RAX: ffffc900000b9000 RBX: ffff88040ca9a7c0 RCX: > ffff88040f81f160 > [ 761.999565] RDX: ffff8803ea21b300 RSI: 0000000000000000 RDI: > ffff88040ca9a7c0 > [ 761.999566] RBP: ffff88040ca90050 R08: 0000000000000000 R09: > 000073746e657665 > [ 761.999567] R10: 8080808080808080 R11: ffff88040f81ea68 R12: > ffff88040ca9a000 > [ 761.999568] R13: ffff88040ca9a000 R14: ffff88040f81f140 R15: > 0000000000000000 > [ 761.999571] ? __switch_to_asm+0x34/0x70 > [ 761.999573] rtl_task+0x4f/0x70 [r8169] > [ 761.999576] process_one_work+0x1bc/0x2f0 > [ 761.999577] worker_thread+0x28/0x3c0 > [ 761.999579] ? process_one_work+0x2f0/0x2f0 > [ 761.999581] kthread+0x109/0x120 > [ 761.999583] ? kthread_park+0x80/0x80 > [ 761.999585] ret_from_fork+0x35/0x40 > [ 761.999586] ---[ end trace fd5800440feffc06 ]--- > > I haven't seen this before, but maybe it's a consequence of swapping the > order of the two functions calls. > > I'll work on the second proposal later today. > > Chris >> 2) Check the original value of RxConfig (after a resume) before >> rtl_init_rxcfg() overwrites it (compile tested only): >> --- r8169.c.ori >> +++ r8169.c >> @@ -5155,6 +5155,9 @@ >> /* Initially a 10 us delay. Turned it into a PCI commit. - FR */ >> RTL_R8(tp, IntrMask); >> RTL_W8(tp, ChipCmd, CmdTxEnb | CmdRxEnb); >> + >> + pr_notice("RxConfig before init was %.8x\n", >> + (unsigned int)RTL_R32(tp, RxConfig)); >> rtl_init_rxcfg(tp); >> rtl_set_tx_config_registers(tp); >> >> >> This should be the value that you got when you removed the call to >> rtl_init_rxcfg() for testing. >> Now, knowing the "right" value you can experiment with what rtl_init_rxcfg() >> writes (under the "default:" label for your NIC model). >> >> Hope this helps, >> Maciej >>