On Tue, 13 Sep 2016, Pan, Harry wrote: > This things is because of the Baytrail/Braswell quirk breaks original > assumption of perf RAPL polling timer rate calculation regarding of > counter overflow case based on 200W;
ESU are the 'Energy Status Units' bits in the MSR_RAPL_POWER_UNIT msr. ESU = (rdmsr(MSR_RAPL_POWER_UNIT) >> 8) & 0x1f; So we have 5 bits of information and therefor: 0 <= ESU <= 31 The standard readout is: joules = counter_value * mult; mult = 1 / (2 ^ ESU) The resulting multiplier is: 31 <= ESU <= 0 4.65661e-10J <= mult <= 1J The scale function does: val = counter << (32 - ESU); which is converting the readout in to units of 4.65661e-10J / 2 == 2.32830e-10J because the shift is actually: (1 + (31 - ESU)). The math for Baytrail/Braswell is: microjoules = counter_value * mult mult = 2 ^ ESU The resulting multiplier is: 31 <= ESU <= 0 1 uJ <= mult <= 2.14748e+09 uJ 1e-6J <= mult <= 2147J So now your baytrail/braswell quirk does: ESU = 32 - ESU so the scale function becomes: val = counter << (32 - (32 - ESU)) ==> val = counter << ESU which is converting the readout to units of 1e-6J So now you are concerned about the rapl_timer interval which is calculated so that the counter does not overflow for a total dissipation of 200W, which is equivalent to 200J/s. The maximum counter width is 32 bit. So depending on ESU the code scales the timeout to: t[ESU] = 1 << (31 - ESU) / 200 So for the normal case we get: t[0] = 10.737e6 s ... t[30] = 0.010 s t[31] = 0.005 s The counter capacity for ESU=31 is cap = (1 << 32) * 4.65661e-10J = 2J So: toverfl = 2J / 200W = 0.01s which we cut in half to avoid running the timer and the counter in lockstep which can cause overflows to go undetected. So this looks correct. But for your Baytrail/Braswel that results in: t[ESU] = 1 << (31 - (32 - ESU)) / 200 t[0] = TOTAL CRAP because the shift value becomes -1 But what saves you here is the check for if (hwunit < 32) which catches the hwunit = 32 - ESU[0] case and sets the timer to 2ms. So for the remaining ones we have: t[1] = 0.005s ... t[31] = 5.3687e+06s So lets look at the counter capacity for ESU=1: cap = (1 << 32) * 2 uJ == 8589.92J The resulting overflow is: toverfl = 8589.92J / 200W = 42.9496 s So if we divide this by two then we result in: 21.4748 s So your timeout is actually off by factor ~4k, which is not surprising due to the fact that the capacity has a ratio of 1 : 2147.48 and you have an additional off by one due to the (32 - ESU) quirk..... So the overflow prevention timer fires 4k times for no good reason. Indeed a very power friendly design. The timer calculation magically works for the original standard conversion, but in this case it is utter crap. You really want to have a proper scale factor for the timer calculation so we end up with: toverfl = capacity / 200 i.e. you need a way to calculate capacity from the hw_units[] mess and some factor which is dependent on the base unit. That all can be done with plain integer math. > in short, it leads every 80ms system triggers an event to read counters, I have no idea where these 80ms come from and I can't make any sense from the rest of your response either. Fact is, that you did not do the math amd just tinkered the Baytrail/Braswell support into the existing code and declared it done when it did not blow up in your face. Really excellent engineering work - NOT! Thanks, tglx