Taylor R Campbell <riastr...@netbsd.org> writes: >> Date: Mon, 31 Jul 2023 12:47:20 -0400 >> >> # dtrace -x nolibs -n 'sdt:xen:hardclock:jump { @ = quantize(arg1 - arg0) } >> sdt:xen:hardclock:jump /arg2 >= 430/ { printf("hardclock jump violated >> timecounter contract") }' >> dtrace: description 'sdt:xen:hardclock:jump ' matched 2 probes >> dtrace: processing aborted: Abort due to systemic unresponsiveness > > Well! dtrace might be unhappy if the timecounter is broken too, heh. > So I just added a printf to the kernel in case this jump happens. Can > you update to xen_clock.c 1.15 (and sys/arch/x86/include/cpu.h 1.135) > and try again?
Sure... >> The system is fine just after a reboot, it certainly seems to be a >> requirment that a fair bit of work must be done before it gets into a >> bad state. >> >> If the dtrace does continue to run, sometimes, it is impossible to exit >> with CTRL-C. The process seems stuck in this: >> >> [ 4261.7158728] load: 2.64 cmd: dtrace 3295 [xclocv] 0.01u 0.02s 0% 7340k > > Interesting. If this is reproducible, can you enter crash or ddb and > get a stack trace for the dtrace process, as well as output from ps, > ps/w, and `show all tstiles'? It appears to be reproduceable.. in the sense that I encountered it a couple of times doing exactly the same workload test. I am more or less completely unsure as to what the trigger is, however. I probably should have mentioned, but when this happened the last time, I did have other newly created processes hang in tstile (the one in particular that I noticed was 'fortune' from a ssh attempt .. it got stuck on login and when I did a CTRL-T tstile was shown). I also probably should have mentioned that the DOM0 (NOT the DOMU) that the target system is running under has HZ set to 1000. This is mostly to help keep the ntpd and chronyd happy on the Xen guests. If the DOM0 is left at 100 the drift can be too much on the DOMU systems. Been running like this for a long time... -- Brad Spencer - b...@anduin.eldar.org - KC8VKS - http://anduin.eldar.org