Re: [PATCH v5 5/5] drm/i915: Add short HPD IRQ storm detection for non-MST systems

2018-11-07 Thread Ville Syrjälä
On Tue, Nov 06, 2018 at 04:30:16PM -0500, Lyude Paul wrote:
> Unfortunately, it seems that the HPD IRQ storm problem from the early
> days of Intel GPUs was never entirely solved, only mostly. Within the
> last couple of days, I got a bug report from one of our customers who
> had been having issues with their machine suddenly booting up very
> slowly after having updated. The amount of time it took to boot went
> from around 30 seconds, to over 6 minutes consistently.
> 
> After some investigation, I discovered that i915 was reporting massive
> amounts of short HPD IRQ spam on this system from the DisplayPort port,
> despite there not being anything actually connected. The symptoms would
> start with one "long" HPD IRQ being detected at boot:
> 
> [1.891398] [drm:intel_get_hpd_pins [i915]] hotplug event received, stat 
> 0x0044, dig 0x0044, pins 0x00a0
> [1.891436] [drm:intel_hpd_irq_handler [i915]] digital hpd port B - long
> [1.891472] [drm:intel_hpd_irq_handler [i915]] Received HPD interrupt on 
> PIN 5 - cnt: 0
> [1.891508] [drm:intel_hpd_irq_handler [i915]] digital hpd port D - long
> [1.891544] [drm:intel_hpd_irq_handler [i915]] Received HPD interrupt on 
> PIN 7 - cnt: 0
> [1.891592] [drm:intel_dp_hpd_pulse [i915]] got hpd irq on port B - long
> [1.891628] [drm:intel_dp_hpd_pulse [i915]] got hpd irq on port D - long
> …
> 
> followed by constant short IRQs afterwards:
> 
> [1.895091] [drm:intel_encoder_hotplug [i915]] [CONNECTOR:66:DP-1] status 
> updated from unknown to disconnected
> [1.895129] [drm:i915_hotplug_work_func [i915]] Connector DP-3 (pin 7) 
> received hotplug event.
> [1.895165] [drm:intel_dp_detect [i915]] [CONNECTOR:72:DP-3]
> [1.895275] [drm:intel_get_hpd_pins [i915]] hotplug event received, stat 
> 0x0020, dig 0x0020, pins 0x0080
> [1.895312] [drm:intel_hpd_irq_handler [i915]] digital hpd port D - short
> [1.895762] [drm:intel_get_hpd_pins [i915]] hotplug event received, stat 
> 0x0020, dig 0x0020, pins 0x0080
> [1.895799] [drm:intel_hpd_irq_handler [i915]] digital hpd port D - short
> [1.896239] [drm:intel_dp_aux_xfer [i915]] dp_aux_ch timeout status 
> 0x71450085
> [1.896293] [drm:intel_get_hpd_pins [i915]] hotplug event received, stat 
> 0x0020, dig 0x0020, pins 0x0080
> [1.896330] [drm:intel_hpd_irq_handler [i915]] digital hpd port D - short
> [1.896781] [drm:intel_get_hpd_pins [i915]] hotplug event received, stat 
> 0x0020, dig 0x0020, pins 0x0080
> [1.896817] [drm:intel_hpd_irq_handler [i915]] digital hpd port D - short
> [1.897275] [drm:intel_get_hpd_pins [i915]] hotplug event received, stat 
> 0x0020, dig 0x0020, pins 0x0080
> 
> The customer's system in question has a GM45 GPU, which is apparently
> well known for hotplugging storms.
> 
> So, workaround this impressively broken hardware by changing the default
> HPD storm threshold from 5 to 50. Then, make long IRQs count for 10, and
> short IRQs count for 1. This makes it so that 5 long IRQs will trigger
> an HPD storm, and on systems with short HPD storm detection 50 short
> IRQs will trigger an HPD storm. 50 short IRQs amounts to 100ms of
> constant pulsing, which seems like a good middleground between being too
> sensitive and not being sensitive enough (which would cause visible
> stutters in userspace every time a storm occurs).
> 
> And just to be extra safe: we don't enable this by default on systems
> with MST support. There's too high of a chance of MST support triggering
> storm detection, and systems that are new enough to support MST are a
> lot less likely to have issues with IRQ storms anyway.
> 
> As a note: this patch was tested using a ThinkPad T450s and a Chamelium
> to simulate the short IRQ storms.
> 
> Changes since v1:
> - Don't use two separate thresholds, just make long IRQs count for 10
>   each and short IRQs count for 1. This simplifies the code a bit
>   - Ville Syrjälä
> Changes since v2:
> - Document @long_hpd in intel_hpd_irq_storm_detect, no functional
>   changes
> Changes since v4:
> - Remove !! in long_hpd assignment - Ville Syrjälä
> - queue_hp = true - Ville Syrjälä
> 
> Signed-off-by: Lyude Paul 
> Cc: Ville Syrjälä 
> Cc: Rodrigo Vivi 

lgtm

Reviewed-by: Ville Syrjälä 

> ---
>  drivers/gpu/drm/i915/i915_debugfs.c  | 74 
>  drivers/gpu/drm/i915/i915_drv.h  |  5 +-
>  drivers/gpu/drm/i915/i915_irq.c  |  7 +++
>  drivers/gpu/drm/i915/intel_hotplug.c | 50 +++
>  4 files changed, 115 insertions(+), 21 deletions(-)
> 
> diff --git a/drivers/gpu/drm/i915/i915_debugfs.c 
> b/drivers/gpu/drm/i915/i915_debugfs.c
> index f60485906f7e..670db5073d70 100644
> --- a/drivers/gpu/drm/i915/i915_debugfs.c
> +++ b/drivers/gpu/drm/i915/i915_debugfs.c
> @@ -4658,6 +4658,79 @@ static const struct file_operations 
> i915_hpd_storm_ctl_fops = {
>   .write = i915_hpd_storm_ctl_write
>  };
>  
> 

[PATCH v5 5/5] drm/i915: Add short HPD IRQ storm detection for non-MST systems

2018-11-06 Thread Lyude Paul
Unfortunately, it seems that the HPD IRQ storm problem from the early
days of Intel GPUs was never entirely solved, only mostly. Within the
last couple of days, I got a bug report from one of our customers who
had been having issues with their machine suddenly booting up very
slowly after having updated. The amount of time it took to boot went
from around 30 seconds, to over 6 minutes consistently.

After some investigation, I discovered that i915 was reporting massive
amounts of short HPD IRQ spam on this system from the DisplayPort port,
despite there not being anything actually connected. The symptoms would
start with one "long" HPD IRQ being detected at boot:

[1.891398] [drm:intel_get_hpd_pins [i915]] hotplug event received, stat 
0x0044, dig 0x0044, pins 0x00a0
[1.891436] [drm:intel_hpd_irq_handler [i915]] digital hpd port B - long
[1.891472] [drm:intel_hpd_irq_handler [i915]] Received HPD interrupt on PIN 
5 - cnt: 0
[1.891508] [drm:intel_hpd_irq_handler [i915]] digital hpd port D - long
[1.891544] [drm:intel_hpd_irq_handler [i915]] Received HPD interrupt on PIN 
7 - cnt: 0
[1.891592] [drm:intel_dp_hpd_pulse [i915]] got hpd irq on port B - long
[1.891628] [drm:intel_dp_hpd_pulse [i915]] got hpd irq on port D - long
…

followed by constant short IRQs afterwards:

[1.895091] [drm:intel_encoder_hotplug [i915]] [CONNECTOR:66:DP-1] status 
updated from unknown to disconnected
[1.895129] [drm:i915_hotplug_work_func [i915]] Connector DP-3 (pin 7) 
received hotplug event.
[1.895165] [drm:intel_dp_detect [i915]] [CONNECTOR:72:DP-3]
[1.895275] [drm:intel_get_hpd_pins [i915]] hotplug event received, stat 
0x0020, dig 0x0020, pins 0x0080
[1.895312] [drm:intel_hpd_irq_handler [i915]] digital hpd port D - short
[1.895762] [drm:intel_get_hpd_pins [i915]] hotplug event received, stat 
0x0020, dig 0x0020, pins 0x0080
[1.895799] [drm:intel_hpd_irq_handler [i915]] digital hpd port D - short
[1.896239] [drm:intel_dp_aux_xfer [i915]] dp_aux_ch timeout status 
0x71450085
[1.896293] [drm:intel_get_hpd_pins [i915]] hotplug event received, stat 
0x0020, dig 0x0020, pins 0x0080
[1.896330] [drm:intel_hpd_irq_handler [i915]] digital hpd port D - short
[1.896781] [drm:intel_get_hpd_pins [i915]] hotplug event received, stat 
0x0020, dig 0x0020, pins 0x0080
[1.896817] [drm:intel_hpd_irq_handler [i915]] digital hpd port D - short
[1.897275] [drm:intel_get_hpd_pins [i915]] hotplug event received, stat 
0x0020, dig 0x0020, pins 0x0080

The customer's system in question has a GM45 GPU, which is apparently
well known for hotplugging storms.

So, workaround this impressively broken hardware by changing the default
HPD storm threshold from 5 to 50. Then, make long IRQs count for 10, and
short IRQs count for 1. This makes it so that 5 long IRQs will trigger
an HPD storm, and on systems with short HPD storm detection 50 short
IRQs will trigger an HPD storm. 50 short IRQs amounts to 100ms of
constant pulsing, which seems like a good middleground between being too
sensitive and not being sensitive enough (which would cause visible
stutters in userspace every time a storm occurs).

And just to be extra safe: we don't enable this by default on systems
with MST support. There's too high of a chance of MST support triggering
storm detection, and systems that are new enough to support MST are a
lot less likely to have issues with IRQ storms anyway.

As a note: this patch was tested using a ThinkPad T450s and a Chamelium
to simulate the short IRQ storms.

Changes since v1:
- Don't use two separate thresholds, just make long IRQs count for 10
  each and short IRQs count for 1. This simplifies the code a bit
  - Ville Syrjälä
Changes since v2:
- Document @long_hpd in intel_hpd_irq_storm_detect, no functional
  changes
Changes since v4:
- Remove !! in long_hpd assignment - Ville Syrjälä
- queue_hp = true - Ville Syrjälä

Signed-off-by: Lyude Paul 
Cc: Ville Syrjälä 
Cc: Rodrigo Vivi 
---
 drivers/gpu/drm/i915/i915_debugfs.c  | 74 
 drivers/gpu/drm/i915/i915_drv.h  |  5 +-
 drivers/gpu/drm/i915/i915_irq.c  |  7 +++
 drivers/gpu/drm/i915/intel_hotplug.c | 50 +++
 4 files changed, 115 insertions(+), 21 deletions(-)

diff --git a/drivers/gpu/drm/i915/i915_debugfs.c 
b/drivers/gpu/drm/i915/i915_debugfs.c
index f60485906f7e..670db5073d70 100644
--- a/drivers/gpu/drm/i915/i915_debugfs.c
+++ b/drivers/gpu/drm/i915/i915_debugfs.c
@@ -4658,6 +4658,79 @@ static const struct file_operations 
i915_hpd_storm_ctl_fops = {
.write = i915_hpd_storm_ctl_write
 };
 
+static int i915_hpd_short_storm_ctl_show(struct seq_file *m, void *data)
+{
+   struct drm_i915_private *dev_priv = m->private;
+
+   seq_printf(m, "Enabled: %s\n",
+  yesno(dev_priv->hotplug.hpd_short_storm_enabled));
+
+   return 0;
+}
+
+static int