On 2023-07-25 09:02, Niels Möller wrote:
Curious. I would suspect some enviromental factors (e.g., high
processor
temperature sometimes leading to some frquency throttling, or something
else on the machine periodically competing for cpu).
Thanks for responding.
That machine runs completely alone.
It just completed 908 rounds of the 11121 rounds.
The pattern changed a bit, once there was an upward phase before going
back:
https://stamm-wilbrandt.de/en/forum/908.png
Eich of the 908 round on x-axis takes nearly 10 minutes, so plateaus
seen take a long time. Im am on that machine only short times for
getting latest data or so.
Because of Ryzen 7000 series burnout risk I applied latest stable 1.24
bios before that run. 7600X CPU max operating temperature is 95°C, and I
operated it with PBO 85°C before bios upgrade. But after upgrade I
cannot set it above 75°C anymore, and maximally measured temperature
with stress test on all threads showed 76°C. So high temperature cannot
be it, and not throttling.
I did "perf stat process" before starting the now >6 days run and saw
that CPU frequency was always in 5.2..5.4GHz range (5.3GHz is nominal
burst CPU frequency).
Here the current state of the PC:
hermann@7600x:~/9383761-digit-prime$ uptime
21:46:24 up 6 days, 4:12, 1 user, load average: 1.00, 1.00, 1.00
hermann@7600x:~/9383761-digit-prime$
hermann@7600x:~/9383761-digit-prime$ ps -ef | grep job
root 629465 629454 99 Jul22 ? 4-04:00:15 ./job 297
hermann 1947666 1942377 0 21:46 pts/0 00:00:00 grep --color=auto
job
hermann@7600x:~/9383761-digit-prime$
I had to do a hotfix after 2 days of computation to avoid bug that was
guaranteed to happen after 75 day, details here:
https://github.com/Hermann-SW/9383761-digit-prime#hotfix
Total non-computing time by hotfix tool was less than 10 seconds only.
Maybe it would be helpful to monitor performance related measurements
such as cpu frequency (or cycle counter), temperature, and count of
cache misses as computation progresses?
AMD is not good at temperature measurement support for Linux, but I
found an article with kernel driver change and then all will be fine. I
do not want to risk current >70 day computation though, so on Linux side
I am blind temperature wise at the moment (all tools get wrong values
with bad kernel support).
But I just learned that I can attach perf to a running process, and did
that, two times 5 seconds, and once 15 seconds. Always showing 5.23GHz,
near nominal 5.3GHz burst frequency.
How do you want me to count cache misses, with perf?
Over longer timeframe I think, multiple of 10min/round?
hermann@7600x:~/9383761-digit-prime$ ps -ef | grep job
root 629465 629454 99 Jul22 ? 4-04:08:14 ./job 297
hermann 1949425 1942377 0 21:54 pts/0 00:00:00 grep --color=auto
job
hermann@7600x:~/9383761-digit-prime$ echo 1 | sudo tee
/proc/sys/kernel/perf_event_paranoid
[sudo] password for hermann:
1
hermann@7600x:~/9383761-digit-prime$ sudo perf stat -e cycles,task-clock
-p 629465 sleep 5
Performance counter stats for process id '629465':
26,173,724,427 cycles # 5.234 GHz
5,000.98 msec task-clock # 1.000 CPUs
utilized
5.001249395 seconds time elapsed
hermann@7600x:~/9383761-digit-prime$ sudo perf stat -e cycles,task-clock
-p 629465 sleep 5
Performance counter stats for process id '629465':
26,156,801,154 cycles # 5.231 GHz
5,000.58 msec task-clock # 1.000 CPUs
utilized
5.000851191 seconds time elapsed
hermann@7600x:~/9383761-digit-prime$ sudo perf stat -e cycles,task-clock
-p 629465 sleep 15
Performance counter stats for process id '629465':
78,483,835,106 cycles # 5.232 GHz
14,999.67 msec task-clock # 1.000 CPUs
utilized
15.000559563 seconds time elapsed
hermann@7600x:~/9383761-digit-prime$
Regards,
Hermann.
_______________________________________________
gmp-devel mailing list
gmp-devel@gmplib.org
https://gmplib.org/mailman/listinfo/gmp-devel