On Fri, Apr 23, 2021 at 11:59:30PM +0530, Vaidyanathan Srinivasan wrote: > * Michal Such?nek <msucha...@suse.de> [2021-04-23 19:45:05]: > > > On Fri, Apr 23, 2021 at 09:29:39PM +0530, Vaidyanathan Srinivasan wrote: > > > * Michal Such?nek <msucha...@suse.de> [2021-04-23 09:35:51]: > > > > > > > On Thu, Apr 22, 2021 at 08:37:29PM +0530, Gautham R. Shenoy wrote: > > > > > From: "Gautham R. Shenoy" <e...@linux.vnet.ibm.com> > > > > > > > > > > Commit d947fb4c965c ("cpuidle: pseries: Fixup exit latency for > > > > > CEDE(0)") sets the exit latency of CEDE(0) based on the latency values > > > > > of the Extended CEDE states advertised by the platform > > > > > > > > > > On some of the POWER9 LPARs, the older firmwares advertise a very low > > > > > value of 2us for CEDE1 exit latency on a Dedicated LPAR. However the > > > > Can you be more specific about 'older firmwares'? > > > > > > Hi Michal, > > > > > > This is POWER9 vs POWER10 difference, not really an obsolete FW. The > > > key idea behind the original patch was to make the H_CEDE latency and > > > hence target residency come from firmware instead of being decided by > > > the kernel. The advantage is such that, different type of systems in > > > POWER10 generation can adjust this value and have an optimal H_CEDE > > > entry criteria which balances good single thread performance and > > > wakeup latency. Further we can have additional H_CEDE state to feed > > > into the cpuidle. > > > > So all POWER9 machines are affected by the firmware bug where firmware > > reports CEDE1 exit latency of 2us and the real latency is 5us which > > causes the kernel to prefer CEDE1 too much when relying on the values > > supplied by the firmware. It is not about 'older firmware'. > > Correct. All POWER9 systems running Linux as guest LPARs will see > extra usage of CEDE idle state, but not baremetal (PowerNV). > > The correct definition of the bug or miss-match in expectation is that > firmware reports wakeup latency from a core/thread wakeup timing, but > not end-to-end time from sending a wakeup event like an IPI using > H_calls and receiving the events on the target. Practically there are > few extra micro-seconds needed after deciding to wakeup a target > core/thread to getting the target to start executing instructions > within the LPAR instance.
Thanks for the detailed explanation. Maybe just adding a few microseconds to the reported time would be a more reasonable workaround than using a blanket fixed value then. > > > I still think it would be preferrable to adjust the latency value > > reported by the firmware to match reality over a kernel workaround. > > Right, practically we can fix for future releases and as such we > targeted this scheme from POWER10 but expected no harm on POWER9 which > proved to be wrong. > > We can possibly change this FW value for POWER9, but it is too > expensive and not practical because many release streams exist for > different platforms and further customers are at different streams as > well. We cannot force all of them to update because that blows up > co-dependency matrix. >From the user point of view only few firmware release streams exist but what is packaged in such binaries might be another story. Thanks Michal