On 5/20/23 14:46, Christian wrote:
Hi there,

I am having trouble with a new build system. It works normal and stable
until I put extreme stress on it, e.g. using all 12 cores with stress
tool.

System will suddenly loose network connection and become unresponsive.
Only a reset works. I am not sure what is going on, but it is
reproducible: Put stress on the system and it fails. It seems, that
something is getting out of step.

Stuff below I found in the logs. I tried quite a bit, even upgraded to
bookworm, to see if the newer kernel works.

If anyone knows how to analyze this issue, it would be very helpful.

Kind regards
   Christian


2023-05-20T20:12:17.054224+02:00 diskstation kernel: [ 1303.236428] ---
---------[ cut here ]------------
2023-05-20T20:12:17.054234+02:00 diskstation kernel: [ 1303.236430]
NETDEV WATCHDOG: enp3s0 (r8169): transmit queue 0 timed out
2023-05-20T20:12:17.054235+02:00 diskstation kernel: [ 1303.236437]
WARNING: CPU: 5 PID: 2411 at net/sched/sch_generic.c:525
dev_watchdog+0x207/0x210
2023-05-20T20:12:17.054236+02:00 diskstation kernel: [ 1303.236442]
Modules linked in: eq3_char_loop(OE) rpi_rf_mod_led(OE) ledtrig_timer
ledtrig_default_on xt_MASQUERADE nf_conntrack_netlink xfrm_user
xfrm_algo xt_addrtype br_netfilter bridge stp llc overlay ip6t_rt
nft_chain_nat nf_nat xt_set xt_tcpmss xt_tcpudp xt_conntrack
nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 nft_compat nf_tables
ip_set_hash_ip ip_set binfmt_misc nfnetlink nls_ascii nls_cp437 vfat
fat amdgpu iwlmvm btusb intel_rapl_msr btrtl intel_rapl_common btbcm
btintel edac_mce_amd btmtk mac80211 snd_hda_codec_realtek bluetooth
snd_hda_codec_generic ledtrig_audio snd_hda_codec_hdmi gpu_sched
kvm_amd drm_buddy libarc4 snd_hda_intel drm_display_helper
snd_intel_dspcfg snd_intel_sdw_acpi iwlwifi kvm cec snd_hda_codec
jitterentropy_rng irqbypass rc_core snd_hda_core cfg80211 snd_hwdep
drm_ttm_helper snd_pcm ttm drbg wmi_bmof rapl ccp snd_timer ansi_cprng
drm_kms_helper sp5100_tco snd pcspkr ecdh_generic rng_core i2c_algo_bit
watchdog soundcore k10temp rfkill hb_rf_usb_2(OE) ecc
2023-05-20T20:12:17.054240+02:00 diskstation kernel: [ 1303.236494]
generic_raw_uart(OE) acpi_cpufreq button joydev evdev sg nct6775
nct6775_core drm hwmon_vid fuse loop efi_pstore configfs efivarfs
ip_tables x_tables autofs4 ext4 crc16 mbcache jbd2 btrfs
blake2b_generic xor raid6_pq zstd_compress libcrc32c crc32c_generic
dm_crypt dm_mod hid_generic usbhid hid sd_mod crc32_pclmul crc32c_intel
ahci ghash_clmulni_intel sha512_ssse3 libahci xhci_pci sha512_generic
xhci_hcd r8169 nvme realtek libata aesni_intel nvme_core t10_pi
crypto_simd mdio_devres usbcore scsi_mod crc64_rocksoft_generic cryptd
libphy crc64_rocksoft crc_t10dif i2c_piix4 crct10dif_generic
crct10dif_pclmul crc64 crct10dif_common usb_common scsi_common video
wmi gpio_amdpt gpio_generic
2023-05-20T20:12:17.054241+02:00 diskstation kernel: [ 1303.236534]
CPU: 5 PID: 2411 Comm: stress Tainted: G           OE      6.1.0-9-
amd64 #1  Debian 6.1.27-1
2023-05-20T20:12:17.054241+02:00 diskstation kernel: [ 1303.236536]
Hardware name: To Be Filled By O.E.M. B550M-ITX/ac/B550M-ITX/ac, BIOS
L2.62 01/31/2023
2023-05-20T20:12:17.054242+02:00 diskstation kernel: [ 1303.236537]
RIP: 0010:dev_watchdog+0x207/0x210
2023-05-20T20:12:17.054242+02:00 diskstation kernel: [ 1303.236540]
Code: 00 e9 40 ff ff ff 48 89 df c6 05 ff 5f 3d 01 01 e8 be 79 f9 ff 44
89 e9 48 89 de 48 c7 c7 c8 16 9b a8 48 89 c2 e8 09 d2 86 ff <0f> 0b e9
22 ff ff ff 66 90 0f 1f 44 00 00 55 53 48 89 fb 48 8b 6f
2023-05-20T20:12:17.054243+02:00 diskstation kernel: [ 1303.236541]
RSP: 0000:ffffa831c345fdc8 EFLAGS: 00010286
2023-05-20T20:12:17.054243+02:00 diskstation kernel: [ 1303.236543]
RAX: 0000000000000000 RBX: ffff91a3c1410000 RCX: 0000000000000000
2023-05-20T20:12:17.054243+02:00 diskstation kernel: [ 1303.236544]
RDX: 0000000000000103 RSI: ffffffffa893fa66 RDI: 00000000ffffffff
2023-05-20T20:12:17.054244+02:00 diskstation kernel: [ 1303.236545]
RBP: ffff91a3c1410488 R08: 0000000000000000 R09: ffffa831c345fc38
2023-05-20T20:12:17.054244+02:00 diskstation kernel: [ 1303.236546]
R10: 0000000000000003 R11: ffff91aafe27afe8 R12: ffff91a3c14103dc
2023-05-20T20:12:17.054245+02:00 diskstation kernel: [ 1303.236547]
R13: 0000000000000000 R14: ffffffffa7e2e7a0 R15: ffff91a3c1410488
2023-05-20T20:12:17.054245+02:00 diskstation kernel: [ 1303.236548] FS:
00007f169849d740(0000) GS:ffff91aade340000(0000) knlGS:0000000000000000
2023-05-20T20:12:17.054246+02:00 diskstation kernel: [ 1303.236550] CS:
0010 DS: 0000 ES: 0000 CR0: 0000000080050033
2023-05-20T20:12:17.054246+02:00 diskstation kernel: [ 1303.236551]
CR2: 000055d05c3f4000 CR3: 0000000103cf2000 CR4: 0000000000750ee0
2023-05-20T20:12:17.054246+02:00 diskstation kernel: [ 1303.236552]
PKRU: 55555554
2023-05-20T20:12:17.054247+02:00 diskstation kernel: [ 1303.236553]
Call Trace:
2023-05-20T20:12:17.054247+02:00 diskstation kernel: [ 1303.236554]
<TASK>
2023-05-20T20:12:17.054248+02:00 diskstation kernel: [ 1303.236557]  ?
pfifo_fast_reset+0x140/0x140
2023-05-20T20:12:17.054248+02:00 diskstation kernel: [ 1303.236559]
call_timer_fn+0x27/0x130
2023-05-20T20:12:17.054248+02:00 diskstation kernel: [ 1303.236562]
__run_timers+0x21c/0x2a0
2023-05-20T20:12:17.054249+02:00 diskstation kernel: [ 1303.236565]
run_timer_softirq+0x2b/0x50
2023-05-20T20:12:17.054249+02:00 diskstation kernel: [ 1303.236567]
__do_softirq+0xf0/0x2fe
2023-05-20T20:12:17.054249+02:00 diskstation kernel: [ 1303.236570]
__irq_exit_rcu+0xc7/0x130
2023-05-20T20:12:17.054250+02:00 diskstation kernel: [ 1303.236573]
sysvec_apic_timer_interrupt+0x52/0xc0
2023-05-20T20:12:17.054250+02:00 diskstation kernel: [ 1303.236576]
asm_sysvec_apic_timer_interrupt+0x16/0x20
2023-05-20T20:12:17.054251+02:00 diskstation kernel: [ 1303.236578]
RIP: 0033:0x7f16984e085c
2023-05-20T20:12:17.054251+02:00 diskstation kernel: [ 1303.236579]
Code: 48 89 44 24 08 31 c0 f0 0f b1 15 fb 3e 19 00 75 3d 48 8d 74 24 04
48 8d 3d f1 1f 19 00 e8 1c 04 00 00 31 c0 87 05 e0 3e 19 00 <83> f8 01
7f 2f 48 63 44 24 04 48 8b 54 24 08 64 48 2b 14 25 28 00
2023-05-20T20:12:17.054252+02:00 diskstation kernel: [ 1303.236581]
RSP: 002b:00007fffb2c4cca0 EFLAGS: 00000246
2023-05-20T20:12:17.054252+02:00 diskstation kernel: [ 1303.236582]
RAX: 0000000000000001 RBX: 0000000000000000 RCX: 00007f169867221c
2023-05-20T20:12:17.054253+02:00 diskstation kernel: [ 1303.236583]
RDX: 00007f1698672228 RSI: 00007fffb2c4cca4 RDI: 00007f1698672840
2023-05-20T20:12:17.054253+02:00 diskstation kernel: [ 1303.236584]
RBP: 00000000000080e8 R08: 00007f1698672228 R09: 00007f1698672260
2023-05-20T20:12:17.054253+02:00 diskstation kernel: [ 1303.236585]
R10: 0000000000000000 R11: 0000000000000202 R12: 0000000000000000
2023-05-20T20:12:17.054254+02:00 diskstation kernel: [ 1303.236586]
R13: 0000565167761004 R14: 0000565167761a78 R15: 000000000000000b
2023-05-20T20:12:17.054254+02:00 diskstation kernel: [ 1303.236588]
</TASK>
2023-05-20T20:12:17.054255+02:00 diskstation kernel: [ 1303.236589] ---
[ end trace 0000000000000000 ]---
2023-05-20T20:12:17.086199+02:00 diskstation kernel: [ 1303.270878]
r8169 0000:03:00.0 enp3s0: rtl_rxtx_empty_cond == 0 (loop: 42, delay:
100).



Have you verified that your PSU has sufficient capacity for the load on each and every rail?


Have you cleaned the system interior, filters, fans, heatsinks, ducts, etc., recently?


Have you tested the thermal solution(s) recently?


Have you tested the power supply recently?


Have you tested the memory recently?


Are you running Debian stable?


Are you running Debian stable packages only? Were they all installed with the same package manager?


If all of the above are okay and the system is still locking up, I would disable or remove all disks in the system, install a zeroed SSD, install Debian stable choosing only "SSH server" and "standard system utilities", install only the stable packages required for your workload, put the workload on it, and see what happens.


David

Reply via email to