Re: [Intel-gfx] [PATCH 4/8] drm/i915: Use preempt_disable/enable_rt() where recommended

2022-02-14 Thread Mario Kleiner
On Fri, Feb 11, 2022 at 9:44 AM Sebastian Andrzej Siewior
 wrote:
>
> On 2022-01-27 00:29:37 [+0100], Mario Kleiner wrote:
> > Hi, first thank you for implementing these preempt disables according to
> Hi Mario,
>
> > the markers i left long ago. And sorry for the rather late reply.
> >
> > I had a look at the code, as of Linux 5.16, and did also a little test run
> > (of a standard kernel, not with PREEMPT_RT, only
> > CONFIG_PREEMPT_VOLUNTARY=y) on my Intel Kabylake GT2, so some thoughts:
> >
> > The area covers only register reads and writes. The part that worries me
> > > is:
> > > - __intel_get_crtc_scanline() the worst case is 100us if no match is
> > >   found.
> > >
> >
> > This one can be a problem indeed on (maybe all?) modern Intel gpu's since
> > Haswell, ie. the last ~10 years. I was able to reproduce it on my Kabylake
> > Intel gpu.
> >
> > Most of the time that for-loop with up to 100 repetitions (~ 100
> > udelay(1) + one mmio register read) (cfe.
> > https://elixir.bootlin.com/linux/v5.17-rc1/source/drivers/gpu/drm/i915/i915_irq.c#L856)
> > will not execute, because most of the time that function gets called from
> > the vblank irq handler and then that trigger condition (if
> > (HAS_DDI(dev_priv) && !position)) is not true. However, it also gets called
> > as part of power-saving on behalf of userspace context, whenever the
> > desktop graphics goes idle for two video refresh cycles. If the desktop
> > shows graphics activity again, and vblank interrupts need to get reenabled,
> > the probability of hitting that case is then ~1-4% depending on video mode.
> > How many loops it runs also varies.
> >
> > On my little Intel(R) Core(TM) i5-8250U CPU machine with a mostly idle
> > desktop, I observed about one hit every couple of seconds of regular use,
> > and each hit took between 125 usecs and almost 250 usecs. I guess udelay(1)
> > can take a bit longer than 1 usec?
>
> it should get very close to this. Maybe something else extended the time
> depending on what you observe.
>

Probably all the other stuff in that for-loop adds a microsecond. I
don't have a good feeling how long a typical mmio register read is
expected to take, except for quite a bit less than 1 usec from my
experience.

> > So that's too much for preempt-rt. What one could do is the following:
> >
> > 1. In the for-loop in __intel_get_crtc_scanline(), add a preempt_enable()
> > before the udelay(1); and a preempt_disable() again after it. Or
> > potentially around the whole for-loop if the overhead of
> > preempt_en/disable() is significant?
>
> It is very optimized on x86 ;)

Good! So adding a disable/enable pair into each of those loop
iterations won't hurt.

>
> > 2. In intel_get_crtc_scanline() also wrap the call to
> > __intel_get_crtc_scanline() into a preempt_disable() and preempt_enable(),
> > so we can be sure that __intel_get_crtc_scanline() always gets called with
> > preemption disabled.
> >
> > Why should this work ok'ish? The point of the original preempt disable
> > inside i915_get_crtc_scanoutpos
> > 
> > is that those two *stime = ktime_get() and *etime = ktime_get() clock
> > queries happen as close to the scanout position query as possible to get a
> > small confidence interval for when exactly the scanoutpos was
> > read/determined from the display hardware. error = (etime - stime) is the
> > error margin. If that margin becomes greater than 20 usecs, then the
> > higher-level code will consider the measurement invalid and repeat the
> > whole procedure up to 3 times before giving up.
>
> The preempt-disable is needed then? The task is preemptible here on
> PREEMPT_RT but it _may_ not come to this. The difference vs !RT is that
> an interrupt will preempt this code without it.
>

Yes, it is needed, as that chunk of code between the two ktime_get()
requires should ideally not get interrupted by anything.
The "try up to three times" higher level logic in calling code is just
to cover the hopefully rare cases where something still preempts,
e.g., a NMI or such.

I have not ever tested this on a PREEMPT_RT kernel in at least a
decade, but on regular kernels, e.g., Ubuntu generic or Ubuntu
low-latency kernels I haven't observed more than one retry when it
mattered, and usually the code executes in 0-2 usecs on my test
machines, way below the limit of 20 usecs at which a measurement is
considered failed and then retried. So the retries are sufficient as
long as all preventable preemption is prevented. Hence the
preempt_disable() annotations to make sure it continues to work on
PREEMPT_RT.

The exception, as I learned after your mail, is when that for-loop is
hit. Then it can take hundreds of microseconds, and even spoil all
three retries, resulting in a rather inaccurate measurement. But the
"good news" is that that for-loop will very likely only be hit in
cases where the code is not called from (Vblank/Pageflip-completion)
hardware 

Re: [Intel-gfx] [PATCH 4/8] drm/i915: Use preempt_disable/enable_rt() where recommended

2022-02-11 Thread Sebastian Andrzej Siewior
On 2022-01-27 00:29:37 [+0100], Mario Kleiner wrote:
> Hi, first thank you for implementing these preempt disables according to
Hi Mario,

> the markers i left long ago. And sorry for the rather late reply.
> 
> I had a look at the code, as of Linux 5.16, and did also a little test run
> (of a standard kernel, not with PREEMPT_RT, only
> CONFIG_PREEMPT_VOLUNTARY=y) on my Intel Kabylake GT2, so some thoughts:
> 
> The area covers only register reads and writes. The part that worries me
> > is:
> > - __intel_get_crtc_scanline() the worst case is 100us if no match is
> >   found.
> >
> 
> This one can be a problem indeed on (maybe all?) modern Intel gpu's since
> Haswell, ie. the last ~10 years. I was able to reproduce it on my Kabylake
> Intel gpu.
> 
> Most of the time that for-loop with up to 100 repetitions (~ 100
> udelay(1) + one mmio register read) (cfe.
> https://elixir.bootlin.com/linux/v5.17-rc1/source/drivers/gpu/drm/i915/i915_irq.c#L856)
> will not execute, because most of the time that function gets called from
> the vblank irq handler and then that trigger condition (if
> (HAS_DDI(dev_priv) && !position)) is not true. However, it also gets called
> as part of power-saving on behalf of userspace context, whenever the
> desktop graphics goes idle for two video refresh cycles. If the desktop
> shows graphics activity again, and vblank interrupts need to get reenabled,
> the probability of hitting that case is then ~1-4% depending on video mode.
> How many loops it runs also varies.
> 
> On my little Intel(R) Core(TM) i5-8250U CPU machine with a mostly idle
> desktop, I observed about one hit every couple of seconds of regular use,
> and each hit took between 125 usecs and almost 250 usecs. I guess udelay(1)
> can take a bit longer than 1 usec?

it should get very close to this. Maybe something else extended the time
depending on what you observe.

> So that's too much for preempt-rt. What one could do is the following:
> 
> 1. In the for-loop in __intel_get_crtc_scanline(), add a preempt_enable()
> before the udelay(1); and a preempt_disable() again after it. Or
> potentially around the whole for-loop if the overhead of
> preempt_en/disable() is significant?

It is very optimized on x86 ;)

> 2. In intel_get_crtc_scanline() also wrap the call to
> __intel_get_crtc_scanline() into a preempt_disable() and preempt_enable(),
> so we can be sure that __intel_get_crtc_scanline() always gets called with
> preemption disabled.
> 
> Why should this work ok'ish? The point of the original preempt disable
> inside i915_get_crtc_scanoutpos
> 
> is that those two *stime = ktime_get() and *etime = ktime_get() clock
> queries happen as close to the scanout position query as possible to get a
> small confidence interval for when exactly the scanoutpos was
> read/determined from the display hardware. error = (etime - stime) is the
> error margin. If that margin becomes greater than 20 usecs, then the
> higher-level code will consider the measurement invalid and repeat the
> whole procedure up to 3 times before giving up.

The preempt-disable is needed then? The task is preemptible here on
PREEMPT_RT but it _may_ not come to this. The difference vs !RT is that
an interrupt will preempt this code without it.

> Normally, in my experience with different graphics chips, one would observe
> error < 3 usecs, so the measurement almost always succeeds at first try,
> only very rarely takes two attempts. The preempt disable is meant to make
> sure that this stays the case on a PREEMPT_RT kernel.

Was it needed?

> The problem here are the relatively rare cases where we hit that up to 100
> iterations for-loop. Here even on a regular kernel, due to hardware quirks,
> we already exceed the 20 usecs tolerance by a huge amount of more than 100
> usecs, leading to a retry of the measurement. And my tests showed that
> often the two succeeding retries also fail, because of hardware quirks can
> apparently create a blackout situation approaching 1 msec, so we lose
> anyway, regardless if we get preempted on a RT kernel or not. That's why
> enabling preemption on RT again during that for-loop should not make the
> situation worse and at least keep RT as real-time as intended.
> 
> In practice I would also expect that this failure case is the one least
> likely to impair userspace applications greatly in practice. The cases that
> mostly matter are the ones executed during vblank hardware irq, where the
> for-loop never executes and error margin and preempt off time is only about
> 1 usec. My own software which depends on very precise timestamps from the
> mechanism never reported >> 20 usecs errors during startup tests or runtime
> tests.

That is without RT I assume?

> > - intel_crtc_scanlines_since_frame_timestamp() not sure how long this
> >   may take in the worst case.
> >
> >
> intel_crtc_scanlines_since_frame_timestamp() should be harmless. That
> 

Re: [Intel-gfx] [PATCH 4/8] drm/i915: Use preempt_disable/enable_rt() where recommended

2022-01-26 Thread Mario Kleiner
On Tue, Dec 14, 2021 at 3:03 PM Sebastian Andrzej Siewior <
bige...@linutronix.de> wrote:

> From: Mike Galbraith 
>
> Mario Kleiner suggest in commit
>   ad3543ede630f ("drm/intel: Push get_scanout_position() timestamping into
> kms driver.")
>
> a spots where preemption should be disabled on PREEMPT_RT. The
> difference is that on PREEMPT_RT the intel_uncore::lock disables neither
> preemption nor interrupts and so region remains preemptible.
>
>
Hi, first thank you for implementing these preempt disables according to
the markers i left long ago. And sorry for the rather late reply.

I had a look at the code, as of Linux 5.16, and did also a little test run
(of a standard kernel, not with PREEMPT_RT, only
CONFIG_PREEMPT_VOLUNTARY=y) on my Intel Kabylake GT2, so some thoughts:

The area covers only register reads and writes. The part that worries me
> is:
> - __intel_get_crtc_scanline() the worst case is 100us if no match is
>   found.
>

This one can be a problem indeed on (maybe all?) modern Intel gpu's since
Haswell, ie. the last ~10 years. I was able to reproduce it on my Kabylake
Intel gpu.

Most of the time that for-loop with up to 100 repetitions (~ 100
udelay(1) + one mmio register read) (cfe.
https://elixir.bootlin.com/linux/v5.17-rc1/source/drivers/gpu/drm/i915/i915_irq.c#L856)
will not execute, because most of the time that function gets called from
the vblank irq handler and then that trigger condition (if
(HAS_DDI(dev_priv) && !position)) is not true. However, it also gets called
as part of power-saving on behalf of userspace context, whenever the
desktop graphics goes idle for two video refresh cycles. If the desktop
shows graphics activity again, and vblank interrupts need to get reenabled,
the probability of hitting that case is then ~1-4% depending on video mode.
How many loops it runs also varies.

On my little Intel(R) Core(TM) i5-8250U CPU machine with a mostly idle
desktop, I observed about one hit every couple of seconds of regular use,
and each hit took between 125 usecs and almost 250 usecs. I guess udelay(1)
can take a bit longer than 1 usec?

So that's too much for preempt-rt. What one could do is the following:

1. In the for-loop in __intel_get_crtc_scanline(), add a preempt_enable()
before the udelay(1); and a preempt_disable() again after it. Or
potentially around the whole for-loop if the overhead of
preempt_en/disable() is significant?

2. In intel_get_crtc_scanline() also wrap the call to
__intel_get_crtc_scanline() into a preempt_disable() and preempt_enable(),
so we can be sure that __intel_get_crtc_scanline() always gets called with
preemption disabled.

Why should this work ok'ish? The point of the original preempt disable
inside i915_get_crtc_scanoutpos

is that those two *stime = ktime_get() and *etime = ktime_get() clock
queries happen as close to the scanout position query as possible to get a
small confidence interval for when exactly the scanoutpos was
read/determined from the display hardware. error = (etime - stime) is the
error margin. If that margin becomes greater than 20 usecs, then the
higher-level code will consider the measurement invalid and repeat the
whole procedure up to 3 times before giving up.

Normally, in my experience with different graphics chips, one would observe
error < 3 usecs, so the measurement almost always succeeds at first try,
only very rarely takes two attempts. The preempt disable is meant to make
sure that this stays the case on a PREEMPT_RT kernel.

The problem here are the relatively rare cases where we hit that up to 100
iterations for-loop. Here even on a regular kernel, due to hardware quirks,
we already exceed the 20 usecs tolerance by a huge amount of more than 100
usecs, leading to a retry of the measurement. And my tests showed that
often the two succeeding retries also fail, because of hardware quirks can
apparently create a blackout situation approaching 1 msec, so we lose
anyway, regardless if we get preempted on a RT kernel or not. That's why
enabling preemption on RT again during that for-loop should not make the
situation worse and at least keep RT as real-time as intended.

In practice I would also expect that this failure case is the one least
likely to impair userspace applications greatly in practice. The cases that
mostly matter are the ones executed during vblank hardware irq, where the
for-loop never executes and error margin and preempt off time is only about
1 usec. My own software which depends on very precise timestamps from the
mechanism never reported >> 20 usecs errors during startup tests or runtime
tests.


> - intel_crtc_scanlines_since_frame_timestamp() not sure how long this
>   may take in the worst case.
>
>
intel_crtc_scanlines_since_frame_timestamp() should be harmless. That
do-while loop just tries to make sure that two register reads that should
happen within the same video refresh cycle are happening in the same