Bug#1060706: linux-image-6.1.0-17-amd64: intel i225 NIC loses PCIe link, network becomes unusable)

2024-02-12 Thread Arno Lehmann

Reported upstream, see

https://lore.kernel.org/netdev/3179622f-7090-4a57-98ba-9042809a0...@its-lehmann.de/T/#u

keep your fingers crossed this will have someone interested ;-)

Cheers,

Arno

--
Arno Lehmann

IT-Service Lehmann
Sandstr. 6, 49080 Osnabrück



Bug#1060706: linux-image-6.1.0-17-amd64: intel i225 NIC loses PCIe link, network becomes unusable)

2024-02-09 Thread Arno Lehmann
river_probe_device+0x78/0x160
[Fr Feb  9 13:27:17 2024]  driver_probe_device+0x1f/0x90
[Fr Feb  9 13:27:17 2024]  __driver_attach+0xd2/0x1c0
[Fr Feb  9 13:27:17 2024]  bus_for_each_dev+0x85/0xd0
[Fr Feb  9 13:27:17 2024]  bus_add_driver+0x116/0x220
[Fr Feb  9 13:27:17 2024]  driver_register+0x59/0x100
[Fr Feb  9 13:27:17 2024]  ? __pfx_igc_init_module+0x10/0x10 [igc]
[Fr Feb  9 13:27:17 2024]  do_one_initcall+0x5a/0x320
[Fr Feb  9 13:27:17 2024]  do_init_module+0x60/0x240
[Fr Feb  9 13:27:17 2024]  init_module_from_file+0x86/0xc0
[Fr Feb  9 13:27:17 2024]  idempotent_init_module+0x120/0x2b0
[Fr Feb  9 13:27:17 2024]  __x64_sys_finit_module+0x5e/0xb0
[Fr Feb  9 13:27:17 2024]  do_syscall_64+0x5c/0xc0
[Fr Feb  9 13:27:17 2024]  ? srso_alias_return_thunk+0x5/0x7f
[Fr Feb  9 13:27:17 2024]  ? ksys_mmap_pgoff+0xec/0x1f0
[Fr Feb  9 13:27:17 2024]  ? srso_alias_return_thunk+0x5/0x7f
[Fr Feb  9 13:27:17 2024]  ? exit_to_user_mode_prepare+0x40/0x1e0
[Fr Feb  9 13:27:17 2024]  ? srso_alias_return_thunk+0x5/0x7f
[Fr Feb  9 13:27:17 2024]  ? syscall_exit_to_user_mode+0x2b/0x40
[Fr Feb  9 13:27:17 2024]  ? srso_alias_return_thunk+0x5/0x7f
[Fr Feb  9 13:27:17 2024]  ? do_syscall_64+0x6b/0xc0
[Fr Feb  9 13:27:17 2024]  ? do_syscall_64+0x6b/0xc0
[Fr Feb  9 13:27:17 2024]  ? srso_alias_return_thunk+0x5/0x7f
[Fr Feb  9 13:27:17 2024]  ? exit_to_user_mode_prepare+0x40/0x1e0
[Fr Feb  9 13:27:17 2024]  entry_SYSCALL_64_after_hwframe+0x6e/0xd8
[Fr Feb  9 13:27:17 2024] RIP: 0033:0x7f709399e719
[Fr Feb  9 13:27:17 2024] Code: 08 89 e8 5b 5d c3 66 2e 0f 1f 84 00 00 
00 00 00 90 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 
4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d b7 06 0d 00 f7 d8 
64 89 01 48
[Fr Feb  9 13:27:17 2024] RSP: 002b:7ffd5ebd1f78 EFLAGS: 0246 
ORIG_RAX: 0139
[Fr Feb  9 13:27:17 2024] RAX: ffda RBX: 563800fbbc30 
RCX: 7f709399e719
[Fr Feb  9 13:27:17 2024] RDX:  RSI: 5637fff544a0 
RDI: 0003
[Fr Feb  9 13:27:17 2024] RBP: 5637fff544a0 R08:  
R09: 563800fbe650
[Fr Feb  9 13:27:17 2024] R10: 0003 R11: 0246 
R12: 0004
[Fr Feb  9 13:27:17 2024] R13:  R14: 563800fbbdc0 
R15: 

[Fr Feb  9 13:27:17 2024]  
[Fr Feb  9 13:27:17 2024] ---[ end trace  ]---
[Fr Feb  9 13:27:57 2024] igc: probe of :0b:00.0 failed with error -13


Can anybody suggest what information I can provide to tackle this?

Thanks,

Arno

--
Arno Lehmann

IT-Service Lehmann
Sandstr. 6, 49080 Osnabrück



Bug#1060706: linux-image-6.1.0-17-amd64: intel i225 NIC loses PCIe link, network becomes unusable)

2024-02-08 Thread Arno Lehmann

Hi all,

so, latest news.

System lost access to the NVMe again and could recover from that only 
after powercycling. Pings, until that powercycle, worked so I assume the 
NIC and software above it were still functional.


Rebooted into the 6.5 backported kernel, downloaded the newest BIOS, 
noticed the NIC getting lost, wrote the BIOS image to USB key, rebooted 
into the UEFI / BIOS control tool, flashed the newest firmware, set all 
defaults and conservative power saving settings and booted into Debian 
again.


Kernel is
# uname -a
Linux Zwerg 6.5.0-0.deb12.4-amd64 #1 SMP PREEMPT_DYNAMIC Debian 
6.5.10-1~bpo12+1 (2023-11-23) x86_64 GNU/Linux


These are the latest such events:
Jan 27 09:44:53 Zwerg kernel: igc :0a:00.0 eno1: PCIe link lost, 
device now detached
Jan 27 09:48:05 Zwerg kernel: igc :0a:00.0 (unnamed net_device) 
(uninitialized): PCIe link lost, device now detached
Jan 27 09:52:16 Zwerg kernel: igc :0a:00.0 (unnamed net_device) 
(uninitialized): PCIe link lost, device now detached
Feb 01 04:19:17 Zwerg kernel: igc :0a:00.0 eno1: PCIe link lost, 
device now detached
Feb 01 14:43:03 Zwerg kernel: igc :0a:00.0 (unnamed net_device) 
(uninitialized): PCIe link lost, device now detached
Feb 08 18:33:38 Zwerg kernel: igc :0a:00.0 eno1: PCIe link lost, 
device now detached
Feb 08 19:00:32 Zwerg kernel: igc :0b:00.0 eno1: PCIe link lost, 
device now detached
Feb 08 19:02:38 Zwerg kernel: igc :0b:00.0 (unnamed net_device) 
(uninitialized): PCIe link lost, device now detached


I think it's safe to say that the actual kernel version does not have an 
effect on those events.


Naturally, the NVMe connectivity losses are not logged but I believe it 
might be an interesting thing to see if I can capture that. Perhaps 
sending system logs to USB storage might work. However, I think it would 
be important to understand if this ticket's topic is a matter of the igc 
module, or perhaps about the power or PCIe management functionality (of 
which I know even less).


The big question: What can I do to help further pinpointing this problem?

Thanks,

Arno

--
Arno Lehmann

IT-Service Lehmann
Sandstr. 6, 49080 Osnabrück



Bug#1060706: linux-image-6.1.0-17-amd64: intel i225 NIC loses PCIe link, network becomes unusable

2024-02-01 Thread Arno Lehmann

Another one:

[Do Feb 1 04:19:21 2024] igc :0a:00.0 eno1: PCIe link lost, device 
now detached

[Do Feb 1 04:19:21 2024] [ cut here ]
[Do Feb 1 04:19:21 2024] igc: Failed to read reg 0xc030!
[Do Feb 1 04:19:21 2024] WARNING: CPU: 6 PID: 90291 at 
drivers/net/ethernet/intel/igc/igc_main.c:6384 igc_rd32+0x91/0xa0 [igc]
[Do Feb 1 04:19:21 2024] Modules linked in: rfcomm cpufreq_userspace 
cpufreq_powersave cpufreq_ondemand cpufreq_conservative nfsv3 nfs_acl 
rpcsec_gss_krb5 auth_rpcgss nfsv4 dns_resolver nfs lockd grace fscache 
netfs qrtr overlay cmac algif_hash algif_skcipher af_alg bnep sunrpc 
binfmt_misc nls_ascii nls_cp437 vfat fat ext4 mbcache jbd2 
intel_rapl_msr intel_rapl_common btusb btrtl btbcm btintel btmtk 
bluetooth edac_mce_amd jitterentropy_rng kvm_amd snd_hda_codec_hdmi 
uvcvideo drbg snd_hda_intel kvm eeepc_wmi videobuf2_vmalloc ansi_cprng 
videobuf2_memops snd_intel_dspcfg snd_intel_sdw_acpi snd_usb_audio 
snd_hda_codec videobuf2_v4l2 asus_wmi platform_profile videobuf2_common 
battery snd_usbmidi_lib ccp sparse_keymap irqbypass ecdh_generic 
sp5100_tco snd_hda_core snd_rawmidi ledtrig_audio ecc crc16 rapl rfkill 
videodev pcspkr wmi_bmof snd_seq_device rng_core watchdog k10temp 
snd_hwdep mc snd_pcm snd_timer joydev snd soundcore sg acpi_cpufreq 
evdev msr parport_pc ppdev lp parport fuse loop efi_pstore
[Do Feb 1 04:19:21 2024] configfs efivarfs ip_tables x_tables autofs4 
xfs libcrc32c crc32c_generic dm_crypt dm_mod hid_generic amdgpu 
gpu_sched drm_buddy i2c_algo_bit drm_display_helper usbhid hid cec 
sr_mod rc_core cdrom drm_ttm_helper ttm crc32_pclmul crc32c_intel 
drm_kms_helper ahci ghash_clmulni_intel sha512_ssse3 libahci xhci_pci 
sha512_generic libata xhci_hcd nvme drm nvme_core aesni_intel usbcore 
igc scsi_mod t10_pi crypto_simd crc64_rocksoft_generic cryptd 
crc64_rocksoft i2c_piix4 crc_t10dif ptp crct10dif_generic 
crct10dif_pclmul crc64 pps_core crct10dif_common usb_common scsi_common 
video wmi gpio_amdpt gpio_generic button
[Do Feb 1 04:19:21 2024] CPU: 6 PID: 90291 Comm: kworker/6:2 Not tainted 
6.1.0-1-amd64 #1 Debian 6.1.4-1
[Do Feb 1 04:19:21 2024] Hardware name: ASUS System Product Name/ROG 
STRIX X670E-A GAMING WIFI, BIOS 1410 04/28/2023

[Do Feb 1 04:19:21 2024] Workqueue: events igc_watchdog_task [igc]
[Do Feb 1 04:19:21 2024] RIP: 0010:igc_rd32+0x91/0xa0 [igc]
[Do Feb 1 04:19:21 2024] Code: 48 c7 c6 f8 b4 71 c0 e8 78 08 90 f0 48 8b 
bd 28 ff ff ff e8 d1 50 48 f0 84 c0 74 b4 89 de 48 c7 c7 20 b5 71 c0 e8 
b3 34 8c f0 <0f> 0b eb a2 66 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 
00 41 56

[Do Feb 1 04:19:21 2024] RSP: 0018:afa297007df0 EFLAGS: 00010282
[Do Feb 1 04:19:21 2024] RAX:  RBX: c030 
RCX: 
[Do Feb 1 04:19:21 2024] RDX: 0002 RSI: b193289e 
RDI: 
[Do Feb 1 04:19:21 2024] RBP: 988bd7b88c20 R08:  
R09: afa297007c78
[Do Feb 1 04:19:21 2024] R10: 0003 R11: 989b17f7ffe8 
R12: 988bd7b88000
[Do Feb 1 04:19:21 2024] R13:  R14: 989345341d40 
R15: c030
[Do Feb 1 04:19:21 2024] FS: () 
GS:989af840() knlGS:

[Do Feb 1 04:19:21 2024] CS: 0010 DS:  ES:  CR0: 80050033
[Do Feb 1 04:19:21 2024] CR2: 7f6ce7d94000 CR3: 0008850cc000 
CR4: 00750ee0

[Do Feb 1 04:19:21 2024] PKRU: 5554
[Do Feb 1 04:19:21 2024] Call Trace:
[Do Feb 1 04:19:21 2024] 
[Do Feb 1 04:19:21 2024] igc_update_stats+0x86/0x6c0 [igc]
[Do Feb 1 04:19:21 2024] igc_watchdog_task+0xa3/0x2c0 [igc]
[Do Feb 1 04:19:21 2024] process_one_work+0x1c7/0x380
[Do Feb 1 04:19:21 2024] worker_thread+0x4d/0x380
[Do Feb 1 04:19:21 2024] ? _raw_spin_lock_irqsave+0x23/0x50
[Do Feb 1 04:19:21 2024] ? rescuer_thread+0x3a0/0x3a0
[Do Feb 1 04:19:21 2024] kthread+0xe9/0x110
[Do Feb 1 04:19:21 2024] ? kthread_complete_and_exit+0x20/0x20
[Do Feb 1 04:19:21 2024] ret_from_fork+0x22/0x30
[Do Feb 1 04:19:21 2024] 
[Do Feb 1 04:19:21 2024] ---[ end trace  ]---


next round: trying a more bleeding-edge kernel from backports...

# uname -a
Linux Zwerg 6.5.0-0.deb12.4-amd64 #1 SMP PREEMPT_DYNAMIC Debian 
6.5.10-1~bpo12+1 (2023-11-23) x86_64 GNU/Linux


is what I just booted into.

Now -- we wait.

Cheers,

Arno



Bug#1060706:

2024-01-27 Thread Arno Lehmann
5 2024]  ? asm_exc_invalid_op+0x16/0x20
[Sa Jan 27 09:52:15 2024]  ? igc_rd32+0x91/0xa0 [igc]
[Sa Jan 27 09:52:15 2024]  ? igc_rd32+0x91/0xa0 [igc]
[Sa Jan 27 09:52:15 2024]  igc_get_invariants_base+0xb5/0x260 [igc]
[Sa Jan 27 09:52:15 2024]  igc_probe+0x2b9/0x8d0 [igc]
[Sa Jan 27 09:52:15 2024]  local_pci_probe+0x41/0x80
[Sa Jan 27 09:52:15 2024]  pci_device_probe+0xc3/0x240
[Sa Jan 27 09:52:15 2024]  really_probe+0xde/0x380
[Sa Jan 27 09:52:15 2024]  ? pm_runtime_barrier+0x50/0x90
[Sa Jan 27 09:52:15 2024]  __driver_probe_device+0x78/0x120
[Sa Jan 27 09:52:15 2024]  driver_probe_device+0x1f/0x90
[Sa Jan 27 09:52:15 2024]  __driver_attach+0xce/0x1c0
[Sa Jan 27 09:52:15 2024]  ? __device_attach_driver+0x110/0x110
[Sa Jan 27 09:52:15 2024]  bus_for_each_dev+0x87/0xd0
[Sa Jan 27 09:52:15 2024]  bus_add_driver+0x1ae/0x200
[Sa Jan 27 09:52:15 2024]  driver_register+0x89/0xe0
[Sa Jan 27 09:52:15 2024]  ? 0xc1174000
[Sa Jan 27 09:52:15 2024]  do_one_initcall+0x59/0x220
[Sa Jan 27 09:52:15 2024]  do_init_module+0x4a/0x1f0
[Sa Jan 27 09:52:15 2024]  __do_sys_finit_module+0xac/0x120
[Sa Jan 27 09:52:15 2024]  do_syscall_64+0x5b/0xc0
[Sa Jan 27 09:52:15 2024]  ? do_syscall_64+0x67/0xc0
[Sa Jan 27 09:52:15 2024]  entry_SYSCALL_64_after_hwframe+0x64/0xce
[Sa Jan 27 09:52:15 2024] RIP: 0033:0x7fe52d720559
[Sa Jan 27 09:52:15 2024] Code: 08 89 e8 5b 5d c3 66 2e 0f 1f 84 00 00 
00 00 00 90 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 8 4c 8b 
4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 77 08 0d 00 f7 d8 
64 89 01 48
[Sa Jan 27 09:52:15 2024] RSP: 002b:7ffd885b6ab8 EFLAGS: 0246 
ORIG_RAX: 0139
[Sa Jan 27 09:52:15 2024] RAX: ffda RBX: 559d3674ec30 
RCX: 7fe52d720559
[Sa Jan 27 09:52:15 2024] RDX:  RSI: 559d35c644a0 
RDI: 0003
[Sa Jan 27 09:52:15 2024] RBP: 559d35c644a0 R08:  
R09: 559d367513f0
[Sa Jan 27 09:52:15 2024] R10: 0003 R11: 0246 
R12: 0004
[Sa Jan 27 09:52:15 2024] R13:  R14: 559d3674edc0 
R15: 

[Sa Jan 27 09:52:15 2024]  
[Sa Jan 27 09:52:15 2024] ---[ end trace  ]---


I'll now reboot into the old kernel and see if I can send this message 
then :-)


...

And:

# uname -a
Linux Zwerg 6.1.0-1-amd64 #1 SMP PREEMPT_DYNAMIC Debian 6.1.4-1 
(2023-01-07) x86_64 GNU/Linux


so far... ok, I'll give this kernel another try, but next round will 
then be a backported ner-bleeding-edge one, I guess.


Cheers,

Arno

--
Arno Lehmann

IT-Service Lehmann
Sandstr. 6, 49080 Osnabrück



Bug#1060706: Info received (Bug#1060706: linux-image-6.1.0-17-amd64: intel i225 NIC loses PCIe link, network becomes unusable)

2024-01-24 Thread Arno Lehmann

Some news, but unfortunately not helping me to understand what we see :-)

Network link was lost during the day.

dmesg shows this:
[Tue Jan 23 06:54:24 2024] igc :0a:00.0 eno1: NIC Link is Up 1000 
Mbps Full Duplex, Flow Control: RX
[Tue Jan 23 16:24:13 2024] [drm:retrieve_link_cap [amdgpu]] *ERROR* 
retrieve_link_cap: Read receiver caps dpcd data failed.

[Tue Jan 23 23:09:16 2024] igc :0a:00.0 eno1: NIC Link is Down
[Tue Jan 23 23:09:19 2024] igc :0a:00.0 eno1: NIC Link is Up 1000 
Mbps Full Duplex, Flow Control: RX

[Wed Jan 24 12:00:23 2024] systemd-journald[750]:
[Wed Jan 24 14:46:17 2024] nfs: server  not responding, timed out
[Wed Jan 24 14:46:17 2024] nfs: server  not responding, timed out
[Wed Jan 24 17:00:09 2024] nfs: server  not responding, timed out

Here, I rmmod'ed the igc module and modprobe'd it immediately.

[Wed Jan 24 17:00:36 2024] igc :0a:00.0 eno1: PHC removed
[Wed Jan 24 17:00:42 2024] Intel(R) 2.5G Ethernet Linux Driver
[Wed Jan 24 17:00:42 2024] Copyright(c) 2018 Intel Corporation.
[Wed Jan 24 17:00:42 2024] igc :0a:00.0: PCIe PTM not supported by 
PCIe bus/controller

[Wed Jan 24 17:00:42 2024] pps pps0: new PPS source ptp0
[Wed Jan 24 17:00:42 2024] igc :0a:00.0 (unnamed net_device) 
(uninitialized): PHC added
[Wed Jan 24 17:00:42 2024] igc :0a:00.0: 4.000 Gb/s available PCIe 
bandwidth (5.0 GT/s PCIe x1 link)

[Wed Jan 24 17:00:42 2024] igc :0a:00.0 eth0: MAC: c8:7f:54:67:6d:cc
[Wed Jan 24 17:00:42 2024] igc :0a:00.0 eno1: renamed from eth0
[Wed Jan 24 17:00:45 2024] igc :0a:00.0 eno1: NIC Link is Up 1000 
Mbps Full Duplex, Flow Control: RX
[Wed Jan 24 17:00:45 2024] IPv6: ADDRCONF(NETDEV_CHANGE): eno1: link 
becomes ready



So, we have a case of the NIC becoming unresponsive for some reason, but 
I can not see or even guess the reason. I'll leave the system as it is 
for a few more days, I think, and then try a much newer kernel.


Or -- any better suggestions?

Cheers,

Arno

--
Arno Lehmann

IT-Service Lehmann
Sandstr. 6, 49080 Osnabrück



Bug#1060706: linux-image-6.1.0-17-amd64: intel i225 NIC loses PCIe link, network becomes unusable

2024-01-19 Thread Arno Lehmann

Hi all,

I have now installed an early 6.1 kernel:

$ uname -a
Linux Zwerg 6.1.0-1-amd64 #1 SMP PREEMPT_DYNAMIC Debian 6.1.4-1 
(2023-01-07) x86_64 GNU/Linux


and not updated anything else. Also, still running with PCIe power 
management in non-default:


# cat /proc/cmdline
BOOT_IMAGE=/vmlinuz-6.1.0-1-amd64 root=/dev/mapper/Zwerg--vg-root ro 
pcie_aspm=off quiet



Let's see how long this works :-) Or, rather, how much patience I have. 
Failures were between few hours and up to four weeks apart...


Cheers,

Arno

--
Arno Lehmann

IT-Service Lehmann
Sandstr. 6, 49080 Osnabrück



Bug#1060706: linux-image-6.1.0-17-amd64: intel i225 NIC loses PCIe link, network becomes unusable

2024-01-18 Thread Arno Lehmann

Hello,

Am 18.01.2024 um 22:12 schrieb Salvatore Bonaccorso:

Hi,

On Sat, Jan 13, 2024 at 04:39:51PM +0100, Arno Lehmann wrote:

Hi Salvatore,

Am 13.01.2024 um 13:47 schrieb Salvatore Bonaccorso:


Just to be clear, can you confirm this is or is not a regression from
a previous running 6.1.y kernel?


On this hardware, the network issues appeared right from the start.

First time I encountered it was with the Debian installation sime time last
year, and that's where my research led me to turn off PCIe power management.

Actually I don't even know which was the first kernel version I had on this
host, but it's been on Bookworm for all its lifetime.


This "feels" like its probably not really a regression, thus the
similarity (though not the identical case as the referenced thread).

What about newer kernels? Do 6.6.11-1 or 6.7-1~exp1 taken from
unstable (resp. experimental) show the same problem?

If yes, then it might be an idea to bring it upstream.


Well, tricky... at this stage, we're guessing what will tell us more -- 
newer kernel or an older one. And then we'll need to wait for while to 
see what happens.


Well, tomorrow morning I'll be on site and can then install another 
kernel and reboot.


Cheers,

Arno


--
Arno Lehmann

IT-Service Lehmann
Sandstr. 6, 49080 Osnabrück



Bug#1060706: linux-image-6.1.0-17-amd64: intel i225 NIC loses PCIe link, network becomes unusable

2024-01-13 Thread Arno Lehmann

Hi Diederik,

Am 13.01.2024 um 17:13 schrieb Diederik de Haas:
...

Via https://snapshot.debian.org/package/linux-signed-amd64/ you have easy
access to previous (6.1) kernels uploaded to Debian with which you can check
if the problem was present in early 6.1 kernels.


The oldest record of this issue has happened with Linux version 
6.1.0-11-amd64


As I usually keep this box updated, and the problems happens only 
randomly, I think the best way forward might be to try with a kernel 
that did *not* show this problem.


Does that look reasonable?

So, I have:
# journalctl --grep PCIe\ link\ lost
-- Boot 86e1a04baba04a409c34796c0fb079ff --
Sep 20 14:21:17 Zwerg kernel: igc :0a:00.0 eno1: PCIe link lost, 
device now detached

-- Boot da6a00d9278a422686ca46d80e2f3ca6 --
-- Boot 28fcdfe079c446c6b184bb5b6407da73 --
Okt 06 05:44:20 Zwerg kernel: igc :0a:00.0 eno1: PCIe link lost, 
device now detached
Okt 07 16:39:10 Zwerg kernel: igc :0a:00.0 (unnamed net_device) 
(uninitialized): PCIe link lost, device now detached

-- Boot 51e3605887764b60b6d0130d4f6356c0 --
-- Boot ce944a4bbffc45b38c1357d3e822cd46 --
Okt 23 18:31:25 Zwerg kernel: igc :0a:00.0 eno1: PCIe link lost, 
device now detached

-- Boot e6d80407cab74d0b9e28b74642b544c0 --
Okt 30 11:16:06 Zwerg kernel: igc :0a:00.0 eno1: PCIe link lost, 
device now detached
Okt 31 13:50:06 Zwerg kernel: igc :0a:00.0 (unnamed net_device) 
(uninitialized): PCIe link lost, device now detached

-- Boot 452f25ce23fe4d569490fbc42683ecd6 --
Nov 22 18:59:11 Zwerg kernel: igc :0a:00.0 eno1: PCIe link lost, 
device now detached

-- Boot f1add031e2fa495aba569ab9c374ce65 --
Nov 23 15:45:49 Zwerg kernel: igc :0a:00.0 eno1: PCIe link lost, 
device now detached

-- Boot f766dabb981e4aa49f0922d7794dea76 --
-- Boot 6d7c91a86ab44da1973f5ca716dad105 --
-- Boot 3ba3df042e0648a1aebfa4fcea5499bf --
Dez 19 07:33:02 Zwerg kernel: igc :0a:00.0 eno1: PCIe link lost, 
device now detached

-- Boot a4aea30bb33747e7853abec194a2a395 --
Jan 01 09:57:40 Zwerg kernel: igc :0a:00.0 eno1: PCIe link lost, 
device now detached

-- Boot 377a326561dc4909b45c55cffcd1a94d --
Jan 10 16:15:20 Zwerg kernel: igc :0a:00.0 eno1: PCIe link lost, 
device now detached

-- Boot 50c5a6a9cc34496984fe3cde6ba8b72a --
Jan 13 11:16:31 Zwerg kernel: igc :0a:00.0 eno1: PCIe link lost, 
device now detached

-- Boot a3c69838cab4426992a2f518a72a5e2b --


So I conclude I should look at something earlier than what was used with 
boot 86e1a04baba04a409c34796c0fb079ff, i.e.


journalctl --boot 86e1a04baba04a409c34796c0fb079ff  | head -n 1
Aug 30 18:16:18 Zwerg kernel: Linux version 6.1.0-11-amd64 
(debian-kernel@lists.debian.org) (gcc-12 (Debian 12.2.0-14) 12.2.0, GNU 
ld (GNU Binutils for Debian) 2.40) #1 SMP PREEMPT_DYNAMIC Debian 
6.1.38-4 (2023-08-08)


correct?

Via the page you reference, I find a kernel package 
linux-image-6.1.0-1-amd64 6.1.4-1 which might be worth a try.


I'll need some time to sort out how to install such a package...

Thanks for your suggestion,

Arno


--
Arno Lehmann

IT-Service Lehmann
Sandstr. 6, 49080 Osnabrück



Bug#1060706: linux-image-6.1.0-17-amd64: intel i225 NIC loses PCIe link, network becomes unusable

2024-01-13 Thread Arno Lehmann

Hi Salvatore,

Am 13.01.2024 um 13:47 schrieb Salvatore Bonaccorso:


Just to be clear, can you confirm this is or is not a regression from
a previous running 6.1.y kernel?


On this hardware, the network issues appeared right from the start.

First time I encountered it was with the Debian installation sime time 
last year, and that's where my research led me to turn off PCIe power 
management.


Actually I don't even know which was the first kernel version I had on 
this host, but it's been on Bookworm for all its lifetime.



I'm asking because I suspect that
this similar to
https://lore.kernel.org/intel-wired-lan/20221031170535.77be0...@kernel.org/
and did not ever worked reliably with your hardware?


The symptoms sound quite different to me. But I can't claim to know 
anything interesting about the different functionalities of PCIe or the 
Linux way to use them...


Cheers,

Arno


--
Arno Lehmann

IT-Service Lehmann
Sandstr. 6, 49080 Osnabrück



Bug#1060706: linux-image-6.1.0-17-amd64: intel i225 NIC loses PCIe link, network becomes unusable

2024-01-13 Thread Arno Lehmann

Am 13.01.2024 um 12:48 schrieb Diederik de Haas:

On Saturday, 13 January 2024 11:45:29 CET Arno Lehmann wrote:

Hardware name: ASUS System Product Name/ROG STRIX X670E-A GAMING WIFI,
BIOS 1410 04/28/2023


Possibly not related, but there's BIOS 1807 available.


I'll definitely give that a try -- when I'm physically close to the box! 
Thanks for reminding me!


Arno

--
Arno Lehmann

IT-Service Lehmann
Sandstr. 6, 49080 Osnabrück



Bug#1060706: linux-image-6.1.0-17-amd64: intel i225 NIC loses PCIe link, network becomes unusable

2024-01-13 Thread Arno Lehmann
Package: src:linux
Version: 6.1.69-1
Severity: normal
Tags: upstream

Dear Maintainer,


just having the computer run for a while, the network loses connection because
the NIC detached from PCIe. I suspect this is related to power management but
am not really sure.

As this seemed to be a known problem, I added pcie_aspm=off to the kernel
command line.

The problem happens more or less randomly, the computer is usually running 24/7:

# journalctl --grep 'PCIe link lost' --quiet | cat
Sep 20 14:21:17 Zwerg kernel: igc :0a:00.0 eno1: PCIe link lost, device now 
detached
Okt 06 05:44:20 Zwerg kernel: igc :0a:00.0 eno1: PCIe link lost, device now 
detached
Okt 07 16:39:10 Zwerg kernel: igc :0a:00.0 (unnamed net_device) 
(uninitialized): PCIe link lost, device now detached
Okt 23 18:31:25 Zwerg kernel: igc :0a:00.0 eno1: PCIe link lost, device now 
detached
Okt 30 11:16:06 Zwerg kernel: igc :0a:00.0 eno1: PCIe link lost, device now 
detached
Okt 31 13:50:06 Zwerg kernel: igc :0a:00.0 (unnamed net_device) 
(uninitialized): PCIe link lost, device now detached
Nov 22 18:59:11 Zwerg kernel: igc :0a:00.0 eno1: PCIe link lost, device now 
detached
Nov 23 15:45:49 Zwerg kernel: igc :0a:00.0 eno1: PCIe link lost, device now 
detached
Dez 19 07:33:02 Zwerg kernel: igc :0a:00.0 eno1: PCIe link lost, device now 
detached
Jan 01 09:57:40 Zwerg kernel: igc :0a:00.0 eno1: PCIe link lost, device now 
detached
Jan 10 16:15:20 Zwerg kernel: igc :0a:00.0 eno1: PCIe link lost, device now 
detached
Jan 13 11:16:31 Zwerg kernel: igc :0a:00.0 eno1: PCIe link lost, device now 
detached


This is what I find in the kernel or system log:

Jan 13 11:16:31 Zwerg kernel: igc :0a:00.0 eno1: PCIe link lost, device now 
detached
Jan 13 11:16:31 Zwerg kernel: [ cut here ]
Jan 13 11:16:31 Zwerg kernel: igc: Failed to read reg 0xc030!
Jan 13 11:16:31 Zwerg kernel: WARNING: CPU: 18 PID: 6389 at 
drivers/net/ethernet/intel/igc/igc_main.c:6482 igc_rd32+0x91/0xa0 [igc]
Jan 13 11:16:31 Zwerg kernel: Modules linked in: rfcomm cpufreq_userspace 
cpufreq_powersave cpufreq_ondemand cpufreq_conservative nfsv3 nfs_acl rpcs>
Jan 13 11:16:31 Zwerg kernel:  configfs efivarfs ip_tables x_tables autofs4 xfs 
libcrc32c crc32c_generic dm_crypt dm_mod hid_generic amdgpu crc32_pc>
Jan 13 11:16:31 Zwerg kernel: CPU: 18 PID: 6389 Comm: kworker/18:1 Not tainted 
6.1.0-17-amd64 #1  Debian 6.1.69-1
Jan 13 11:16:31 Zwerg kernel: Hardware name: ASUS System Product Name/ROG STRIX 
X670E-A GAMING WIFI, BIOS 1410 04/28/2023
Jan 13 11:16:31 Zwerg kernel: Workqueue: events igc_watchdog_task [igc]
Jan 13 11:16:31 Zwerg kernel: RIP: 0010:igc_rd32+0x91/0xa0 [igc]
Jan 13 11:16:31 Zwerg kernel: Code: 48 c7 c6 d0 55 56 c0 e8 0b 7d 6c f8 48 8b 
bd 28 ff ff ff e8 31 c7 23 f8 84 c0 74 b4 89 de 48 c7 c7 f8 55 56 c0 e>
Jan 13 11:16:31 Zwerg kernel: RSP: 0018:ac56d5f13df0 EFLAGS: 00010286
Jan 13 11:16:31 Zwerg kernel: RAX:  RBX: c030 RCX: 
0027
Jan 13 11:16:31 Zwerg kernel: RDX: a046f85a03a8 RSI: 0001 RDI: 
a046f85a03a0
Jan 13 11:16:31 Zwerg kernel: RBP: a03f45710c28 R08:  R09: 
ac56d5f13c68
Jan 13 11:16:31 Zwerg kernel: R10: 0003 R11: a04717f7ffe8 R12: 
a03f4571
Jan 13 11:16:31 Zwerg kernel: R13:  R14: a03f456efd40 R15: 
c030
Jan 13 11:16:31 Zwerg kernel: FS:  () 
GS:a046f858() knlGS:
Jan 13 11:16:31 Zwerg kernel: CS:  0010 DS:  ES:  CR0: 80050033
Jan 13 11:16:31 Zwerg kernel: CR2: 7f1fc894f000 CR3: 0008a8538000 CR4: 
00750ee0
Jan 13 11:16:31 Zwerg kernel: PKRU: 5554
Jan 13 11:16:31 Zwerg kernel: Call Trace:
Jan 13 11:16:31 Zwerg kernel:  


Obviously, the kernel parameter to disable PCIe power management was not 
solving this problem.

The way to recover is to restart the computer.


-- Package-specific info:
** Version:
Linux version 6.1.0-17-amd64 (debian-kernel@lists.debian.org) (gcc-12 (Debian 
12.2.0-14) 12.2.0, GNU ld (GNU Binutils for Debian) 2.40) #1 SMP 
PREEMPT_DYNAMIC Debian 6.1.69-1 (2023-12-30)

** Command line:
BOOT_IMAGE=/vmlinuz-6.1.0-17-amd64 root=/dev/mapper/Zwerg--vg-root ro 
pcie_aspm=off quiet

** Not tainted

** Kernel log:
Unable to read kernel log; any relevant messages should be attached

** Model information
sys_vendor: ASUS
product_name: System Product Name
product_version: System Version
chassis_vendor: Default string
chassis_version: Default string
bios_vendor: American Megatrends Inc.
bios_version: 1410
board_vendor: ASUSTeK COMPUTER INC.
board_name: ROG STRIX X670E-A GAMING WIFI
board_version: Rev 1.xx

** Loaded modules:
rfcomm
cpufreq_userspace
cpufreq_powersave
cpufreq_ondemand
cpufreq_conservative
nfsv3
nfs_acl
rpcsec_gss_krb5
auth_rpcgss
nfsv4
dns_resolver
nfs
lockd
grace
fscache
netfs
qrtr
overlay
cmac
algif_hash
algif_skcipher
af_alg
bnep
sunrpc