> -------- Ursprüngliche Nachricht --------
> Von: David Christensen <dpchr...@holgerdanske.com>
> An: debian-user@lists.debian.org
> Betreff: Re: Weird behaviour on System under high load
> Datum: Sun, 21 May 2023 03:11:43 -0700
> 
> On 5/21/23 01:14, Christian wrote:
> 
> > > -------- Ursprüngliche Nachricht --------
> > > Von: David Christensen <dpchr...@holgerdanske.com>
> > > An: debian-user@lists.debian.org
> > > Betreff: Re: Weird behaviour on System under high load
> > > Datum: Sat, 20 May 2023 18:00:48 -0700
> > > 
> > > On 5/20/23 14:46, Christian wrote:
> > > > Hi there,
> > > > 
> > > > I am having trouble with a new build system. It works normal
> and
> > > > stable
> > > > until I put extreme stress on it, e.g. using all 12 cores with
> > > > stress
> > > > tool.
> > > > 
> > > > System will suddenly loose network connection and become
> > > > unresponsive.
> > > > Only a reset works. I am not sure what is going on, but it is
> > > > reproducible: Put stress on the system and it fails. It seems,
> > > > that
> > > > something is getting out of step.
> > > > 
> > > > Stuff below I found in the logs. I tried quite a bit, even
> > > > upgraded
> > > > to
> > > > bookworm, to see if the newer kernel works.
> > > > 
> > > > If anyone knows how to analyze this issue, it would be very
> > > > helpful.
> 
> 
> Please use inline posting style and proper indentation.

Phew... will be quite hard to read. But here you go.

> 
> 
> > > Have you verified that your PSU has sufficient capacity for the
> > > load on
> > > each and every rail?
> 
>  > Hi there,
>  >
>  > Lets go through the different topics:
>  > - Setup: It is a AMD 5600G
> 
> https://www.amd.com/en/products/apu/amd-ryzen-5-5600g
> 
> 65 W
> 
> 
>  > on a ASRock B550M-ITX/ac,
> 
> 
> https://www.asrock.com/mb/AMD/B550M-ITXac/index.asp
> 
> 
>  > powered by a BeQuiet SP7 300W
>  >
>  > - Power: From the specifications it should fit. As it takes 5-20
>  > minutes for the error to occur, I would take that as an
> indication,
>  > that the power supply is ok. Otherwise would expect that to fail
> right
>  > away? Is there a way to measure/test if there is any issue with
> it?
>  > I also tested to limit PPT to 45W which also makes no difference.
> 
> 
> If all you have a motherboard, a 65W CPU, and an SSD, that looks like
> a 
> good quality 300W PSU and I would think it should support long-term
> full 
> loading of the CPU.  But, there is no substitute for doing the
> engineering.
> 
> 
> I do PSU calculations using a spreadsheet.  This requires finding
> power 
> specifications (or making estimates) for everything in the system,
> which 
> can be tough.
> 
> 
> BeQuiet has a PSU calculator.  I suggest using it:
> 
> https://www.bequiet.com/en/psucalculator
> 
> 
> Measuring actual power supply output and system usage would involve 
> building or buying suitable test equipment.  The cost would be non-
> trivial.
> 
> 
> An easy A/B test would be to connect a known-good, high-quality PSU
> with 
> a higher power rating (say, 500-1000W).  I use:
> 
> https://www.fractal-design.com/products/power-supplies/ion/ion-2-platinum-660w/black/
> 
Used the calculator, however might be, that the onboard graphics is not
attributed properly for. Will see that I get a 500W PSU for testing.
> 
> > > Have you cleaned the system interior, filters, fans, heatsinks,
> > > ducts,
> > > etc., recently?
> 
> 
> ?
As written in OP, the system is new. Only PSU is used. So it is clean
> 
> 
> > > Have you tested the thermal solution(s) recently?
> 
>  > - Thermal: I am observing the temperatures on the stresstest. If I
> am
>  > correct in reading Smbusmaster0, Temps haven't been above 71°C,
> but
>  > error also occurs earlier, way below 70.
> 
> 
> Okay.
> 
> 
> What is your CPU thermal solution?
> 
What is a thermal solution?
> 
> What stresstest are you using?
> 
stress running in s-tui
> 
> > > Have you tested the power supply recently?
> 
It was working before without issues, so not explicitly tested.
> 
> I suffered a rash of bad PSU's recently.  I was able to figure it out
> because I bought an inexpensive PSU tester years ago.  It has saved
> my 
> sanity more than once.  I suggest that you buy something like it:
> 
> https://www.ebay.com/sch/i.html?_from=R40&_trksid=m570.l1313&_nkw=antec+atx12+tester&_sacat=0
> 
I am not building regularly, so would need to borrow such equipment
somewhere

> > > Have you tested the memory recently?
> 
>  > - Memory: Yes was tested right after the build with no errors
> 
> 
> Okay.
> 
> 
> Did you do multi-threaded/ stress tests?
> 
Yes, stress is running multiple threads. Only on 2 threads it was
stable so far. However it takes longer for the errors to come up when
using less threads. might be that I did not test long enough.
> 
> > > Are you running Debian stable?
> > > 
> > > 
> > > Are you running Debian stable packages only?  Were they all
> > > installed
> > > with the same package manager?
Having docker and log2ram as additional sources and now debmatic.
> 
>  > - OS: I was running Debian stable in quite a minimal configuration
>  > (fresh install as most services are dockerized) when first
> observed the
>  > error. Now moved to Debian 12/Bookworm to see if it makes any
>  > difference with higher kernel (it does not). Also exchanged r8169
> for
>  > the r8168. It changes the error messages, however system
> instability
>  > stays.
> 
> 
> Did you see the problems when running Debian stable OOTB, before
> adding 
> anything?
I would need to do this with a liveUSB, to have it run OOTB
> 
> 
> Did you stress test the system before adding anything (other than the
> stress test)?
No, I did the basic setup of my system first, then encountered the
error. Will try with LiveUSB.
> 
> 
> > > If all of the above are okay and the system is still locking up,
> I
> > > would
> > > disable or remove all disks in the system, install a zeroed SSD,
> > > install
> > > Debian stable choosing only "SSH server" and "standard system
> > > utilities", install only the stable packages required for your
> > > workload,
> > > put the workload on it, and see what happens.
> 
>  > I could disconnect the disks and see if it makes any difference.
>  > However when reproducing this error, disks other than system where
>  > unmounted. So would guess this would be a test to see if it is
> about
>  > power?
> 
> 
> Stripping the system down to minimum hardware and software is a good 
> starting point.  You will need a tool to load the system and some
> means 
> to watch what happens.  Assuming the base configuration passes all 
> tests, then add something, test, and repeat until testing fails.
> 
> 
> Here is a Perl script I wrote for loading the CPU.  It should run on
> a 
> base install of Debian OOTB:
> 
> 2023-05-21 02:24:44 dpchrist@taz ~/home
> $ cat exercise-cpu
> #!/usr/bin/env perl
> # $Id: exercise-cpu,v 1.1 2023/04/10 02:05:22 dpchrist Exp $
> # by David Paul Christensen dpchr...@holgerdanske.com
> # Public Domain
> #
> # Exercise central processing unit
> 
> use threads;
> use strict;
> use warnings;
> 
> use File::Basename;
> use Time::HiRes qw( sleep time );
> 
> die sprintf "Usage: %s PERCENT DURATION\n", basename($0)
>      unless @ARGV == 2;
> 
> my  $a  = 0.01 * shift;         # periodic exercise duration
> my  $b  = 1 - $a;               # periodic sleep duration
> 
> $_      = qx/lscpu/;            # Debian GNU/Linux
> my ($c) = /CPU.s.:\s+(\d+)/;    # number of virtual cores
> 
> my  $e  = time + shift;         # time to end
> 
> my @thr;                        # threads
> 
> push @thr, async {
>      while (time < $e) {
>         my $d = time + $a / 10;
>         1 while time < $d;
>         sleep $b/10;
>      }
> } for 1..$c;
> 
> $_->join for @thr;
> 
> 
> Run it like this:
> 
> 2023-05-21 02:50:06 dpchrist@taz ~/home
> $ ./exercise-cpu
> Usage: exercise-cpu PERCENT DURATION
> 
> 2023-05-21 02:50:52 dpchrist@taz ~/home
> $ ./exercise-cpu 25 10
> 
> 2023-05-21 02:51:33 dpchrist@taz ~/home
> $ ./exercise-cpu 50 10
> 
> 2023-05-21 02:51:48 dpchrist@taz ~/home
> $ ./exercise-cpu 75 10
> 
> 2023-05-21 02:52:01 dpchrist@taz ~/home
> $ ./exercise-cpu 100 10
> 
> 
> I install Xfce when installing Debian and use the Xfce plugins to
> watch 
> CPU loading and CPU temperature.  The above tests loaded all virtual 
> cores at the specified percentage for the specified duration.  CPU 
> temperature peaked at 32 C, 38 C, 66 C and 72 C, respectively.
> 
> 
> Having a Debian install on a USB 3.0 flash drive is very useful for 
> trouble-shooting and for imaging, backup/ restore, archiving,
> integrity 
> checking, migration, validation, etc..
> 
As said above will try with LiveUSB
> 
> David
> 
> 

Reply via email to