Another reply to multiple messages in one, but starting from the last one this time, as it is the most important I think.
Date: Thu, 29 Jun 2023 13:52:03 +0200 From: Michael van Elst <mlel...@serpens.de> Message-ID: <zj1wy2boykkee...@serpens.de> | One possibility would be that the 3401 mode didn't enable turbo frequencies | but actually throttled the CPU (e.g. due to a faulty BIOS). Then the low | temperature readings would have been only a logical consequence. If that was the problem, then yes, that would have been a possibility, but that's backwards. At 3401 the temperature readings look OK (there's still the other problem I was initially seeking, but this one needs to be solved first). At slower cpu frequencies, the temperatures are lower. That, when simply stated, with just that much info, looks like it deserves a "That is how it should be, run slower, generate less heat" response - as indicated by the data you gave in your previous message: Date: Thu, 29 Jun 2023 13:45:11 +0200 From: Michael van Elst <mlel...@serpens.de> Message-ID: <ZJ1ux/x3xastc...@serpens.de> | The Haswell CPU here (room temperature about 27C) runs idle at about 40C | when clocked at minimum 800, but heats up to 47C when idling at 3300 and | there is no difference to 3301. That kind of thing is what I'd expect to see. But that isn't what I am seeing, if I set 3401 as the target frequency (turbo mode - which Intel still calls it) the temperatures of the cores (when idling) probably range in the low 30's to low 40's range (as reported, how that relates to real heat in the chip is anyone's guess - but the BIOS also reports those kinds of values). When I set 3400, which is what I have right now, if that dropped to low 30's, or high 20's, or just stayed the same while idling as your processor does, then that would all make sense. But it doesn't. I am running at 3400 now, and the coretemp readings are: Current CritMax WarnMax WarnMin CritMin Unit [coretemp0] cpu0 temperature: 13.000 degC [coretemp1] cpu1 temperature: 13.000 degC [coretemp10] cpu10 temperature: 15.000 degC [coretemp11] cpu11 temperature: 15.000 degC [coretemp12] cpu12 temperature: 14.000 degC [coretemp13] cpu13 temperature: 14.000 degC [coretemp14] cpu14 temperature: 14.000 degC [coretemp15] cpu15 temperature: 14.000 degC [coretemp2] cpu2 temperature: 14.000 degC [coretemp3] cpu3 temperature: 13.000 degC [coretemp4] cpu4 temperature: 13.000 degC [coretemp5] cpu5 temperature: 15.000 degC [coretemp6] cpu6 temperature: 12.000 degC [coretemp7] cpu7 temperature: 13.000 degC [coretemp8] cpu8 temperature: 15.000 degC [coretemp9] cpu9 temperature: 15.000 degC Room temperature is about 21C at the minute (A/C maintained). Even remaining at 3400, if the workload drops even more (I am replying to this mail, which means keyboard and mouse activity, some disc I/O as well, and the X server needs to be processing everything - so we can go much more idle than that, with the screens all off, so the X server has little to do, no keyboard or mouse activity, ...) then the reported temps will sometimes drop into single digits (8 or 9 ... I haven't seen less than 8). Those values are absurdly low. There doesn't seem to be much (if any) difference between the temps being reported whether the frequency is 3400, or 800 (highest and lowest available fixed frequencies). Maybe just one or two degrees less at 800 than at 3400 (when mostly idling ... not fully idle right now, there's also some network traffic at the minute - has been throughout this reply). | The xx01 frequency sets the maximum base clock and enables turbo mode... | on systems that support such a setting. | | On "modern CPUs" however, it is often sufficient to stay on that setting That's what I used to do, before I started getting the original problem (not yet really reported, as I don't yet have much of an idea what is happening) - but as a first guess, the cores seemed to be overheating, or at least powerd thought they were (powerd gets the same info as envstat, which also showed rising temperatures) - this usually happening when the system was idle (or mostly idle, there are all the usual low cpu usage background noise processes running - clocks, cron, inetd, nothing that normally causes even a blip in apparent cpu used). In fact, if I made the system really busy (like going a full release build) I never saw a problem (things get hot, then cool down again). That actually suggests another possibility for this original problem to be investigated later - perhaps when the CPU goes into idle mode, something is happening to the (at least reported) core temperatures, and the more time it spends idling, the more those appear to increase. For later, for now, unless we can trust what the CPU is telling us what the temperature is, worrying about probable nonsense numbers varying would be a waste of time. Date: Thu, 29 Jun 2023 13:24:23 +0200 From: Michael van Elst <mlel...@serpens.de> Message-ID: <ZJ1p5pZJGNOYXJ/g...@serpens.de> | Then it gets really strange what the temperature sensor would see. Yes, that's why I sent the original message. It is indeed really strange. | One possibility would be that the Tjmax value is actually changed | dynamically (maybe some SMM code) and that the patch isn't complete | to handle this. The possibility is certainly there. The patch certainly doesn't handle it, the code has been rearranged in a way that would make it much simpler to do, but as it is now, it is really just doing the same as before - calculate the Tjmax value to be used at sensor attach time, and never touch that again. That is, I am not surprised that it didn't change anything. However, if we can work out when it would be reasonable to look for a new Tjmax (and on which processor versions) now it will be trivial to make that happen, where it wouldn't have been before. However, to explain what is being observed, the Tjmax value would need to be increasing as the cpu frequency decreases, since at the minute we're using Tjmax==100, and getting 12 as the reported temp, so the value read and subtracted from Tjmax must be 88. To make that value be somewhere around 30-35, (which is what I am seeing now ... I just went back to 3401 mode temporarily) then Tjmax would need to be in the 118-123 range (and so perhaps 120). That seems a bit unlikely to me (unlikely things are that simple). Perhaps reading a new Tjmax recalibrates the internal temp monitoring though? Clutching at straws... | The scheduler did use first cores first, with performance cores | using low cpu numbers, they should be utilized first but not | necessarily for the important workloads. Depending upon what that really means, that is, "use first" (use the first cores *first*) wrt to what? System boot? If it is doing that, and just rotating though the cores, that might (kind of) match what I see. But if you mean "when a cpu is needed, the lowest numbered one which is currently unallocated is used" then I don't see that happening at all. What's even more peculiar is that we seem to be moving processes from core to core for no apparent reason, if I am running a single cpu bound process, I can observe it move from cpu to cpu. If all cpus were equal, then aside from the L1 cache losses suffered doing that, it would make no real difference. If it was moving processes to a more suitable cpu for the workload, that would also make sense, but it isn't doing that either. If there were lots of other processes demanding CPU time, then bumping the busy one (which will have its run priority dropping (increasing numerically)) to run others, and then restarting the busy one on the next free CPU would also make sense - but I doubt it is that either, there just aren't enough processes (even kernel threads that might have a reason to run) to use all the cores - and the chances of all the ones that might have something to do all wanting to use their few required ns of processor time at the same instant are remote indeed. If this seemingly random movement happened only rarely, maybe I could believe that, but it doesn't, it seems to be happening all the time, almost as if any system call being run results in the possibility that the resumption might be on a different cpu (the movement isn't frequent enough to be every system call though - and it happens to processes that make none, just infinite loop cpu wasters). | It now handles big.little configurations independent of cpu numbers, | but probably only on arm. This processor needs more than that, though it would be a start. It has been quite a while since I looked at the specs for it in detail, but as I remember it (and assuming we have no hyperthreading to occupy all the odd numbered cpu numbers, so cpuN means coreN here) cpu0 (and maybe cpu1) can run fastest (up to 5500GHz). Then I think perhaps cpu1 (maybe cpu2) can run up to 5300GHz, then the rest of the performance cores (...7) run up to 5200GHz. The economy cores (8..15) all run at a slower base freq (2500 rather than 3400 for (all) the performance cores - despite cpuctl on NetBSD claiming that their base is also 3400 .. I suspect there's just one kernel "base" frequency, reported for all cpus) and up to 4000 GHz in turbo mode. So there are certainly 3, perhaps 4, different processor classes, though all the fast ones are reasonably close to each other (but when I was running that openssl test, using turbo mode, I could see in the results the variations that having a different cpu assigned made - which is why I just kept repeating it until cpu0 happened to be chosen (for at least most of it). That one certainly runs faster than anything else (except maybe cpu1, which might be the same). kre