> -------- Ursprüngliche Nachricht -------- > Von: David Christensen <dpchr...@holgerdanske.com> > An: debian-user@lists.debian.org > Betreff: Re: Weird behaviour on System under high load > Datum: Sun, 21 May 2023 03:11:43 -0700 > > On 5/21/23 01:14, Christian wrote: > > > > -------- Ursprüngliche Nachricht -------- > > > Von: David Christensen <dpchr...@holgerdanske.com> > > > An: debian-user@lists.debian.org > > > Betreff: Re: Weird behaviour on System under high load > > > Datum: Sat, 20 May 2023 18:00:48 -0700 > > > > > > On 5/20/23 14:46, Christian wrote: > > > > Hi there, > > > > > > > > I am having trouble with a new build system. It works normal > and > > > > stable > > > > until I put extreme stress on it, e.g. using all 12 cores with > > > > stress > > > > tool. > > > > > > > > System will suddenly loose network connection and become > > > > unresponsive. > > > > Only a reset works. I am not sure what is going on, but it is > > > > reproducible: Put stress on the system and it fails. It seems, > > > > that > > > > something is getting out of step. > > > > > > > > Stuff below I found in the logs. I tried quite a bit, even > > > > upgraded > > > > to > > > > bookworm, to see if the newer kernel works. > > > > > > > > If anyone knows how to analyze this issue, it would be very > > > > helpful. > > > Please use inline posting style and proper indentation.
Phew... will be quite hard to read. But here you go. > > > > > Have you verified that your PSU has sufficient capacity for the > > > load on > > > each and every rail? > > > Hi there, > > > > Lets go through the different topics: > > - Setup: It is a AMD 5600G > > https://www.amd.com/en/products/apu/amd-ryzen-5-5600g > > 65 W > > > > on a ASRock B550M-ITX/ac, > > > https://www.asrock.com/mb/AMD/B550M-ITXac/index.asp > > > > powered by a BeQuiet SP7 300W > > > > - Power: From the specifications it should fit. As it takes 5-20 > > minutes for the error to occur, I would take that as an > indication, > > that the power supply is ok. Otherwise would expect that to fail > right > > away? Is there a way to measure/test if there is any issue with > it? > > I also tested to limit PPT to 45W which also makes no difference. > > > If all you have a motherboard, a 65W CPU, and an SSD, that looks like > a > good quality 300W PSU and I would think it should support long-term > full > loading of the CPU. But, there is no substitute for doing the > engineering. > > > I do PSU calculations using a spreadsheet. This requires finding > power > specifications (or making estimates) for everything in the system, > which > can be tough. > > > BeQuiet has a PSU calculator. I suggest using it: > > https://www.bequiet.com/en/psucalculator > > > Measuring actual power supply output and system usage would involve > building or buying suitable test equipment. The cost would be non- > trivial. > > > An easy A/B test would be to connect a known-good, high-quality PSU > with > a higher power rating (say, 500-1000W). I use: > > https://www.fractal-design.com/products/power-supplies/ion/ion-2-platinum-660w/black/ > Used the calculator, however might be, that the onboard graphics is not attributed properly for. Will see that I get a 500W PSU for testing. > > > > Have you cleaned the system interior, filters, fans, heatsinks, > > > ducts, > > > etc., recently? > > > ? As written in OP, the system is new. Only PSU is used. So it is clean > > > > > Have you tested the thermal solution(s) recently? > > > - Thermal: I am observing the temperatures on the stresstest. If I > am > > correct in reading Smbusmaster0, Temps haven't been above 71°C, > but > > error also occurs earlier, way below 70. > > > Okay. > > > What is your CPU thermal solution? > What is a thermal solution? > > What stresstest are you using? > stress running in s-tui > > > > Have you tested the power supply recently? > It was working before without issues, so not explicitly tested. > > I suffered a rash of bad PSU's recently. I was able to figure it out > because I bought an inexpensive PSU tester years ago. It has saved > my > sanity more than once. I suggest that you buy something like it: > > https://www.ebay.com/sch/i.html?_from=R40&_trksid=m570.l1313&_nkw=antec+atx12+tester&_sacat=0 > I am not building regularly, so would need to borrow such equipment somewhere > > > Have you tested the memory recently? > > > - Memory: Yes was tested right after the build with no errors > > > Okay. > > > Did you do multi-threaded/ stress tests? > Yes, stress is running multiple threads. Only on 2 threads it was stable so far. However it takes longer for the errors to come up when using less threads. might be that I did not test long enough. > > > > Are you running Debian stable? > > > > > > > > > Are you running Debian stable packages only? Were they all > > > installed > > > with the same package manager? Having docker and log2ram as additional sources and now debmatic. > > > - OS: I was running Debian stable in quite a minimal configuration > > (fresh install as most services are dockerized) when first > observed the > > error. Now moved to Debian 12/Bookworm to see if it makes any > > difference with higher kernel (it does not). Also exchanged r8169 > for > > the r8168. It changes the error messages, however system > instability > > stays. > > > Did you see the problems when running Debian stable OOTB, before > adding > anything? I would need to do this with a liveUSB, to have it run OOTB > > > Did you stress test the system before adding anything (other than the > stress test)? No, I did the basic setup of my system first, then encountered the error. Will try with LiveUSB. > > > > > If all of the above are okay and the system is still locking up, > I > > > would > > > disable or remove all disks in the system, install a zeroed SSD, > > > install > > > Debian stable choosing only "SSH server" and "standard system > > > utilities", install only the stable packages required for your > > > workload, > > > put the workload on it, and see what happens. > > > I could disconnect the disks and see if it makes any difference. > > However when reproducing this error, disks other than system where > > unmounted. So would guess this would be a test to see if it is > about > > power? > > > Stripping the system down to minimum hardware and software is a good > starting point. You will need a tool to load the system and some > means > to watch what happens. Assuming the base configuration passes all > tests, then add something, test, and repeat until testing fails. > > > Here is a Perl script I wrote for loading the CPU. It should run on > a > base install of Debian OOTB: > > 2023-05-21 02:24:44 dpchrist@taz ~/home > $ cat exercise-cpu > #!/usr/bin/env perl > # $Id: exercise-cpu,v 1.1 2023/04/10 02:05:22 dpchrist Exp $ > # by David Paul Christensen dpchr...@holgerdanske.com > # Public Domain > # > # Exercise central processing unit > > use threads; > use strict; > use warnings; > > use File::Basename; > use Time::HiRes qw( sleep time ); > > die sprintf "Usage: %s PERCENT DURATION\n", basename($0) > unless @ARGV == 2; > > my $a = 0.01 * shift; # periodic exercise duration > my $b = 1 - $a; # periodic sleep duration > > $_ = qx/lscpu/; # Debian GNU/Linux > my ($c) = /CPU.s.:\s+(\d+)/; # number of virtual cores > > my $e = time + shift; # time to end > > my @thr; # threads > > push @thr, async { > while (time < $e) { > my $d = time + $a / 10; > 1 while time < $d; > sleep $b/10; > } > } for 1..$c; > > $_->join for @thr; > > > Run it like this: > > 2023-05-21 02:50:06 dpchrist@taz ~/home > $ ./exercise-cpu > Usage: exercise-cpu PERCENT DURATION > > 2023-05-21 02:50:52 dpchrist@taz ~/home > $ ./exercise-cpu 25 10 > > 2023-05-21 02:51:33 dpchrist@taz ~/home > $ ./exercise-cpu 50 10 > > 2023-05-21 02:51:48 dpchrist@taz ~/home > $ ./exercise-cpu 75 10 > > 2023-05-21 02:52:01 dpchrist@taz ~/home > $ ./exercise-cpu 100 10 > > > I install Xfce when installing Debian and use the Xfce plugins to > watch > CPU loading and CPU temperature. The above tests loaded all virtual > cores at the specified percentage for the specified duration. CPU > temperature peaked at 32 C, 38 C, 66 C and 72 C, respectively. > > > Having a Debian install on a USB 3.0 flash drive is very useful for > trouble-shooting and for imaging, backup/ restore, archiving, > integrity > checking, migration, validation, etc.. > As said above will try with LiveUSB > > David > >