On Wed, Apr 6, 2016 at 7:02 AM, Waiman Long <[email protected]> wrote: > On a large system with many CPUs, using HPET as the clock source can > have a significant impact on the overall system performance because > of the following reasons: > 1) There is a single HPET counter shared by all the CPUs. > 2) HPET counter reading is a very slow operation. > > Using HPET as the default clock source may happen when, for example, > the TSC clock calibration exceeds the allowable tolerance. Something > the performance slowdown can be so severe that the system may crash > because of a NMI watchdog soft lockup, for example. > > This patch attempts to reduce HPET read contention by using the fact > that if more than one task are trying to access HPET at the same time, > it will be more efficient if one task in the group reads the HPET > counter and shares it with the rest of the group instead of each > group member reads the HPET counter individually. > > This is done by using a combination word with a sequence number and > a bit lock. The task that gets the bit lock will be responsible for > reading the HPET counter and update the sequence number. The others > will monitor the change in sequence number and grab the HPET counter > accordingly. > > On a 4-socket Haswell-EX box with 72 cores (HT off), running the > AIM7 compute workload (1500 users) on a 4.6-rc1 kernel (HZ=1000) > with and without the patch has the following performance numbers > (with HPET or TSC as clock source): > > TSC = 646515 jobs/min > HPET w/o patch = 566708 jobs/min > HPET with patch = 638791 jobs/min > > The perf profile showed a reduction of the %CPU time consumed by > read_hpet from 4.99% without patch to 1.41% with patch. > > On a 16-socket IvyBridge-EX system with 240 cores (HT on), on the > other hand, the performance numbers of the same benchmark were: > > TSC = 3145329 jobs/min > HPET w/o patch = 1108537 jobs/min > HPET with patch = 3019934 jobs/min > > The corresponding perf profile showed a drop of CPU consumption of > the read_hpet function from more than 34% to just 2.96%. > > Signed-off-by: Waiman Long <[email protected]> > --- > arch/x86/kernel/hpet.c | 110 > +++++++++++++++++++++++++++++++++++++++++++++++- > 1 files changed, 109 insertions(+), 1 deletions(-) > > diff --git a/arch/x86/kernel/hpet.c b/arch/x86/kernel/hpet.c > index a1f0e4a..9e3de73 100644 > --- a/arch/x86/kernel/hpet.c > +++ b/arch/x86/kernel/hpet.c > @@ -759,11 +759,112 @@ static int hpet_cpuhp_notify(struct notifier_block *n, > #endif > > /* > + * Reading the HPET counter is a very slow operation. If a large number of > + * CPUs are trying to access the HPET counter simultaneously, it can cause > + * massive delay and slow down system performance dramatically. This may > + * happen when HPET is the default clock source instead of TSC. For a > + * really large system with hundreds of CPUs, the slowdown may be so > + * severe that it may actually crash the system because of a NMI watchdog > + * soft lockup, for example. > + * > + * If multiple CPUs are trying to access the HPET counter at the same time, > + * we don't actually need to read the counter multiple times. Instead, the > + * other CPUs can use the counter value read by the first CPU in the group. > + * > + * A sequence number whose lsb is a lock bit is used to control which CPU > + * has the right to read the HPET counter directly and which CPUs are going > + * to get the indirect value read by the lock holder. For the later group, > + * if the sequence number differs from the expected locked value, they > + * can assume that the saved HPET value is up-to-date and return it. > + * > + * This mechanism is only activated on system with a large number of CPUs. > + * Currently, it is enabled when nr_cpus > 64. > + */
Reading the HPET is so slow that all the atomic ops in the world won't make a dent. Why not just turn this optimization on unconditionally? --Andy

