[Intel-gfx] [RFCv2 00/12] TDR/watchdog timeout support for gen8 (edit: fixed coverletter)

2015-07-21 Thread Tomas Elf
This patch series introduces the following features:

* Feature 1: TDR (Timeout Detection and Recovery) for gen8 execlist mode.

TDR is an umbrella term for anything that goes into detecting and recovering 
from GPU hangs and is a term more widely used outside of the upstream driver.
This feature introduces an extensible framework that currently supports gen8 
but that can be easily extended to support gen7 as well (which is already the 
case in GMIN but unfortunately in a not quite upstreamable form). The code 
contained in this submission represents the essentials of what is currently in 
GMIN merged with what is currently in upstream (as of the time when this work 
commenced a few months back).

This feature adds a new hang recovery path alongside the legacy GPU reset path, 
which takes care of engine recovery only. Aside from adding support for 
per-engine recovery this feature also introduces rules for how to promote a 
potential per-engine reset to a legacy, full GPU reset.

The hang checker now integrates with the error handler in a slightly different 
way in that it allows hang recovery on multiple engines at the same time by 
passing an engine flag mask to the error handler where flags representing all 
of the hung engines are set. This allows us to schedule hang recovery once for 
all currently hung engines instead of one hang recovery per detected engine 
hang. Previously, when only full GPU reset was supported this was all the same 
since it wouldn't matter if one or four engines were hung at any given point 
since it would all amount to the same thing - the GPU getting reset. As it 
stands now the behaviour is different depending on which engine is hung since 
each engine is reset separately from all the other engines, therefore we have 
to think about this in terms of scheduling cost and recovery latency. (see open 
question below)

OPEN QUESTIONS:

1. Do we want to investigate the possibility of per-engine hang
detection? In the current upstream driver there is only one work queue
that handles the hang checker and everything from initial hang
detection to final hang recovery runs in this thread. This makes sense
if you're only supporting one form of hang recovery - using full GPU
reset and nothing tied to any particular engine. However, as part
of this patch series we're changing that by introducing per-engine
hang recovery. It could make sense to introduce multiple work
queues - one per engine - to run multiple hang checking threads in
parallel.

This would potentially allow savings in terms of recovery latency since
we don't have to scan all engines every time the hang checker is
scheduled and the error handler does not have to scan all engines every
time it is scheduled. Instead, we could implement one work queue per
engine that would invoke the hang checker that only checks _that_
particular engine and then the error handler is invoked for _that_
particular engine. If one engine has hung the latency for getting to
the hang recovery path for that particular engine would be (Time For
Hang Checking One Engine) + (Time For Error Handling One Engine) rather
than the time it takes to do hang checking for all engines + the time
it takes to do error handling for all engines that have been detected
as hung (which in the worst case would be all engines). There would
potentially be as many hang checker and error handling threads going on
concurrently as there are engines in the hardware but they would all be
running in parallel without any significant locking. The first time
where any thread needs exclusive access to the driver is at the point
of the actual hang recovery but the time it takes to get there would
theoretically be lower and the time it actually takes to do per-engine
hang recovery is quite a lot lower than the time it takes to actually
detect a hang reliably.

How much we would save by such a change still needs to be analysed and
compared against the current single-thread model but it makes sense
from a theoretical design point of view.

* Feature 2: Watchdog Timeout (a.k.a "media engine reset") for gen8.

This feature allows userland applications to control whether or not individual 
batch buffers should have a first-level, fine-grained, hardware-based hang 
detection mechanism on top of the ordinary, software-based periodic hang 
checker that is already in the driver. The advantage over relying solely on the 
current software-based hang checker is that the watchdog timeout mechanism is 
about 1000x quicker and more precise. Since it's not a full driver-level hang 
detection mechanism but only targetting one individual batch buffer at a time 
it can afford to be that quick without risking an increase in false positive 
hang 

Re: [Intel-gfx] [RFCv2 00/12] TDR/watchdog timeout support for gen8 (edit: fixed coverletter)

2015-07-24 Thread Tomas Elf

On 21/07/2015 15:51, Tomas Elf wrote:

This patch series introduces the following features:

* Feature 1: TDR (Timeout Detection and Recovery) for gen8 execlist mode.

TDR is an umbrella term for anything that goes into detecting and recovering 
from GPU hangs and is a term more widely used outside of the upstream driver.
This feature introduces an extensible framework that currently supports gen8 
but that can be easily extended to support gen7 as well (which is already the 
case in GMIN but unfortunately in a not quite upstreamable form). The code 
contained in this submission represents the essentials of what is currently in 
GMIN merged with what is currently in upstream (as of the time when this work 
commenced a few months back).

This feature adds a new hang recovery path alongside the legacy GPU reset path, 
which takes care of engine recovery only. Aside from adding support for 
per-engine recovery this feature also introduces rules for how to promote a 
potential per-engine reset to a legacy, full GPU reset.

The hang checker now integrates with the error handler in a slightly different 
way in that it allows hang recovery on multiple engines at the same time by 
passing an engine flag mask to the error handler where flags representing all 
of the hung engines are set. This allows us to schedule hang recovery once for 
all currently hung engines instead of one hang recovery per detected engine 
hang. Previously, when only full GPU reset was supported this was all the same 
since it wouldn't matter if one or four engines were hung at any given point 
since it would all amount to the same thing - the GPU getting reset. As it 
stands now the behaviour is different depending on which engine is hung since 
each engine is reset separately from all the other engines, therefore we have 
to think about this in terms of scheduling cost and recovery latency. (see open 
question below)

OPEN QUESTIONS:

1. Do we want to investigate the possibility of per-engine hang
detection? In the current upstream driver there is only one work queue
that handles the hang checker and everything from initial hang
detection to final hang recovery runs in this thread. This makes sense
if you're only supporting one form of hang recovery - using full GPU
reset and nothing tied to any particular engine. However, as part
of this patch series we're changing that by introducing per-engine
hang recovery. It could make sense to introduce multiple work
queues - one per engine - to run multiple hang checking threads in
parallel.

This would potentially allow savings in terms of recovery latency since
we don't have to scan all engines every time the hang checker is
scheduled and the error handler does not have to scan all engines every
time it is scheduled. Instead, we could implement one work queue per
engine that would invoke the hang checker that only checks _that_
particular engine and then the error handler is invoked for _that_
particular engine. If one engine has hung the latency for getting to
the hang recovery path for that particular engine would be (Time For
Hang Checking One Engine) + (Time For Error Handling One Engine) rather
than the time it takes to do hang checking for all engines + the time
it takes to do error handling for all engines that have been detected
as hung (which in the worst case would be all engines). There would
potentially be as many hang checker and error handling threads going on
concurrently as there are engines in the hardware but they would all be
running in parallel without any significant locking. The first time
where any thread needs exclusive access to the driver is at the point
of the actual hang recovery but the time it takes to get there would
theoretically be lower and the time it actually takes to do per-engine
hang recovery is quite a lot lower than the time it takes to actually
detect a hang reliably.

How much we would save by such a change still needs to be analysed and
compared against the current single-thread model but it makes sense
from a theoretical design point of view.

* Feature 2: Watchdog Timeout (a.k.a "media engine reset") for gen8.

This feature allows userland applications to control whether or not individual 
batch buffers should have a first-level, fine-grained, hardware-based hang 
detection mechanism on top of the ordinary, software-based periodic hang 
checker that is already in the driver. The advantage over relying solely on the 
current software-based hang checker is that the watchdog timeout mechanism is 
about 1000x quicker and more precise. Since it's not a full driver-level hang 
detection mechanism but only targetting one individual batch buffer at a time 
it can afford to be that quick without risk