> Date: Thu, 10 Aug 2023 16:01:10 -0500
> From: Scott Cheloha <[email protected]>
>
> On Tue, Aug 08, 2023 at 08:00:47PM +0200, Mark Kettenis wrote:
> > > From: Dale Rahn <[email protected]>
> > > Date: Tue, 8 Aug 2023 12:36:45 -0400
> > >
> > > Switching the computation of cycles/delaycnt to a proper 64 value math
> > > instead of the '32 bit safe' complex math is likely a good idea.
> > > However I am not completely convinced that switching to 'yield' (current
> > > CPU_BUSY_CYCLE implementation) for every loop of a 'short wait' in the
> > > wait
> > > loop makes sense. In a hypervisor environment, this could cause a very
> > > short wait between register writes to become very long, In a
> > > non-hypervisor
> > > environment there is essentially no improvement because the yield doesn't
> > > really have any benefits on non-hypervisor.
> >
> > Dale, I think you're confused here. There is no arcitectural way to
> > trap a YIELD instruction. The docs explicitly state that SMT and SMP
> > systems. I suspect that this instruction was primarily introduced for
> > SMT systems that never materialized; I completely believe you that on
> > current hardware it does nothing.
> >
> > Even without a YIELD instruction a hypervisor might interrupt us and
> > schedule a different vCPU onto the core. And on real hardware
> > external interrupts may happen. So you really can't count on delay(9)
> > being accurate.
> >
> > Linux does use YIELD in it delay loop.
>
> Okay good.
>
> > > To my current understanding, there is no useful 'wait short period' on arm
> > > cores.
> >
> > There is WFET (and WFIT), but that is Armv8.7 material, so not
> > availble on actual hardware that we run on. Linux uses that
> > instruction in its delay loop when available.
>
> These look great. Can't wait for them to be implemented outside of a
> functional simulator.
>
> > On machines with a working Generic Timer event stream, Linux uses WFE
> > if the delay is long enough.
>
> I have a prototype for this. Obviously it's a separate patch.
>
> Anyway, can I go ahead with the patch below to start?
>
> - Use 64-bit arithmetic to simplify agtimer_delay().
>
> - Use CPU_BUSY_CYCLE() ("yield" on arm64) in the loop to match other
> delay(9) implementations.
ok kettenis@
> Index: agtimer.c
> ===================================================================
> RCS file: /cvs/src/sys/arch/arm64/dev/agtimer.c,v
> retrieving revision 1.23
> diff -u -p -r1.23 agtimer.c
> --- agtimer.c 25 Jul 2023 18:16:19 -0000 1.23
> +++ agtimer.c 10 Aug 2023 20:59:27 -0000
> @@ -323,32 +323,12 @@ agtimer_cpu_initclocks(void)
> void
> agtimer_delay(u_int usecs)
> {
> - uint64_t clock, oclock, delta, delaycnt;
> - uint64_t csec, usec;
> - volatile int j;
> + uint64_t cycles, start;
>
> - if (usecs > (0x80000000 / agtimer_frequency)) {
> - csec = usecs / 10000;
> - usec = usecs % 10000;
> -
> - delaycnt = (agtimer_frequency / 100) * csec +
> - (agtimer_frequency / 100) * usec / 10000;
> - } else {
> - delaycnt = agtimer_frequency * usecs / 1000000;
> - }
> - if (delaycnt <= 1)
> - for (j = 100; j > 0; j--)
> - ;
> -
> - oclock = agtimer_readcnt64();
> - while (1) {
> - for (j = 100; j > 0; j--)
> - ;
> - clock = agtimer_readcnt64();
> - delta = clock - oclock;
> - if (delta > delaycnt)
> - break;
> - }
> + start = agtimer_readcnt64();
> + cycles = (uint64_t)usecs * agtimer_frequency / 1000000;
> + while (agtimer_readcnt64() - start < cycles)
> + CPU_BUSY_CYCLE();
> }
>
> void
>
>