Re: [PATCH v2 3/3] drm/i915: Add short HPD IRQ storm detection for non-MST systems
On Fri, Oct 26, 2018 at 06:49:09PM -0400, Lyude Paul wrote: > Unfortunately, it seems that the HPD IRQ storm problem from the early > days of Intel GPUs was never entirely solved, only mostly. Within the > last couple of days, I got a bug report from one of our customers who > had been having issues with their machine suddenly booting up very > slowly after having updated. The amount of time it took to boot went > from around 30 seconds, to over 6 minutes consistently. > > After some investigation, I discovered that i915 was reporting massive > amounts of short HPD IRQ spam on this system from the DisplayPort port, > despite there not being anything actually connected. The symptoms would > start with one "long" HPD IRQ being detected at boot: > > [1.891398] [drm:intel_get_hpd_pins [i915]] hotplug event received, stat > 0x0044, dig 0x0044, pins 0x00a0 > [1.891436] [drm:intel_hpd_irq_handler [i915]] digital hpd port B - long > [1.891472] [drm:intel_hpd_irq_handler [i915]] Received HPD interrupt on > PIN 5 - cnt: 0 > [1.891508] [drm:intel_hpd_irq_handler [i915]] digital hpd port D - long > [1.891544] [drm:intel_hpd_irq_handler [i915]] Received HPD interrupt on > PIN 7 - cnt: 0 > [1.891592] [drm:intel_dp_hpd_pulse [i915]] got hpd irq on port B - long > [1.891628] [drm:intel_dp_hpd_pulse [i915]] got hpd irq on port D - long > … > > followed by constant short IRQs afterwards: > > [1.895091] [drm:intel_encoder_hotplug [i915]] [CONNECTOR:66:DP-1] status > updated from unknown to disconnected > [1.895129] [drm:i915_hotplug_work_func [i915]] Connector DP-3 (pin 7) > received hotplug event. > [1.895165] [drm:intel_dp_detect [i915]] [CONNECTOR:72:DP-3] > [1.895275] [drm:intel_get_hpd_pins [i915]] hotplug event received, stat > 0x0020, dig 0x0020, pins 0x0080 > [1.895312] [drm:intel_hpd_irq_handler [i915]] digital hpd port D - short > [1.895762] [drm:intel_get_hpd_pins [i915]] hotplug event received, stat > 0x0020, dig 0x0020, pins 0x0080 > [1.895799] [drm:intel_hpd_irq_handler [i915]] digital hpd port D - short > [1.896239] [drm:intel_dp_aux_xfer [i915]] dp_aux_ch timeout status > 0x71450085 > [1.896293] [drm:intel_get_hpd_pins [i915]] hotplug event received, stat > 0x0020, dig 0x0020, pins 0x0080 > [1.896330] [drm:intel_hpd_irq_handler [i915]] digital hpd port D - short > [1.896781] [drm:intel_get_hpd_pins [i915]] hotplug event received, stat > 0x0020, dig 0x0020, pins 0x0080 > [1.896817] [drm:intel_hpd_irq_handler [i915]] digital hpd port D - short > [1.897275] [drm:intel_get_hpd_pins [i915]] hotplug event received, stat > 0x0020, dig 0x0020, pins 0x0080 > > The customer's system in question has a GM45 GPU, which is apparently > well known for hotplugging storms. > > So, workaround this impressively broken hardware by changing the default > HPD storm threshold from 5 to 50. Then, make long IRQs count for 10, and > short IRQs count for 1. This makes it so that 5 long IRQs will trigger > an HPD storm, and on systems with short HPD storm detection 50 short > IRQs will trigger an HPD storm. 50 short IRQs amounts to 100ms of > constant pulsing, which seems like a good middleground between being too > sensitive and not being sensitive enough (which would cause visible > stutters in userspace every time a storm occurs). > > And just to be extra safe: we don't enable this by default on systems > with MST support. There's too high of a chance of MST support triggering > storm detection, and systems that are new enough to support MST are a > lot less likely to have issues with IRQ storms anyway. > > As a note: this patch was tested using a ThinkPad T450s and a Chamelium > to simulate the short IRQ storms. > > Changes since v1: > - Don't use two separate thresholds, just make long IRQs count for 10 > each and short IRQs count for 1. This simplifies the code a bit > - Ville Syrjälä > > Signed-off-by: Lyude Paul > Cc: Ville Syrjälä > --- > drivers/gpu/drm/i915/i915_debugfs.c | 74 > drivers/gpu/drm/i915/i915_drv.h | 5 +- > drivers/gpu/drm/i915/i915_irq.c | 7 +++ > drivers/gpu/drm/i915/intel_hotplug.c | 47 +++--- > 4 files changed, 113 insertions(+), 20 deletions(-) > > diff --git a/drivers/gpu/drm/i915/i915_debugfs.c > b/drivers/gpu/drm/i915/i915_debugfs.c > index b4744a68cd88..1595b8565875 100644 > --- a/drivers/gpu/drm/i915/i915_debugfs.c > +++ b/drivers/gpu/drm/i915/i915_debugfs.c > @@ -4641,6 +4641,79 @@ static const struct file_operations > i915_hpd_storm_ctl_fops = { > .write = i915_hpd_storm_ctl_write > }; > > +static int i915_hpd_short_storm_ctl_show(struct seq_file *m, void *data) > +{ > + struct drm_i915_private *dev_priv = m->private; > + > + seq_printf(m, "Enabled: %s\n", > +yesno(dev_priv->hotplug.hpd_short_storm_enabled)); > + > + return 0; >
Re: [Intel-gfx] [PATCH v2 3/3] drm/i915: Add short HPD IRQ storm detection for non-MST systems
Hi Lyude, Thank you for the patch! Perhaps something to improve: [auto build test WARNING on drm-intel/for-linux-next] [also build test WARNING on v4.19 next-20181019] [if your patch is applied to the wrong git tree, please drop us a note to help improve the system] url: https://github.com/0day-ci/linux/commits/Lyude-Paul/drm-i915-HPD-IRQ-storm-detection-fixes/20181027-085424 base: git://anongit.freedesktop.org/drm-intel for-linux-next reproduce: make htmldocs All warnings (new ones prefixed by >>): include/net/mac80211.h:977: warning: Function parameter or member 'status.rates' not described in 'ieee80211_tx_info' include/net/mac80211.h:977: warning: Function parameter or member 'status.ack_signal' not described in 'ieee80211_tx_info' include/net/mac80211.h:977: warning: Function parameter or member 'status.ampdu_ack_len' not described in 'ieee80211_tx_info' include/net/mac80211.h:977: warning: Function parameter or member 'status.ampdu_len' not described in 'ieee80211_tx_info' include/net/mac80211.h:977: warning: Function parameter or member 'status.antenna' not described in 'ieee80211_tx_info' include/net/mac80211.h:977: warning: Function parameter or member 'status.tx_time' not described in 'ieee80211_tx_info' include/net/mac80211.h:977: warning: Function parameter or member 'status.is_valid_ack_signal' not described in 'ieee80211_tx_info' include/net/mac80211.h:977: warning: Function parameter or member 'status.status_driver_data' not described in 'ieee80211_tx_info' include/net/mac80211.h:977: warning: Function parameter or member 'driver_rates' not described in 'ieee80211_tx_info' include/net/mac80211.h:977: warning: Function parameter or member 'pad' not described in 'ieee80211_tx_info' include/net/mac80211.h:977: warning: Function parameter or member 'rate_driver_data' not described in 'ieee80211_tx_info' net/mac80211/sta_info.h:588: warning: Function parameter or member 'rx_stats_avg' not described in 'sta_info' net/mac80211/sta_info.h:588: warning: Function parameter or member 'rx_stats_avg.signal' not described in 'sta_info' net/mac80211/sta_info.h:588: warning: Function parameter or member 'rx_stats_avg.chain_signal' not described in 'sta_info' net/mac80211/sta_info.h:588: warning: Function parameter or member 'status_stats.filtered' not described in 'sta_info' net/mac80211/sta_info.h:588: warning: Function parameter or member 'status_stats.retry_failed' not described in 'sta_info' net/mac80211/sta_info.h:588: warning: Function parameter or member 'status_stats.retry_count' not described in 'sta_info' net/mac80211/sta_info.h:588: warning: Function parameter or member 'status_stats.lost_packets' not described in 'sta_info' net/mac80211/sta_info.h:588: warning: Function parameter or member 'status_stats.last_tdls_pkt_time' not described in 'sta_info' net/mac80211/sta_info.h:588: warning: Function parameter or member 'status_stats.msdu_retries' not described in 'sta_info' net/mac80211/sta_info.h:588: warning: Function parameter or member 'status_stats.msdu_failed' not described in 'sta_info' net/mac80211/sta_info.h:588: warning: Function parameter or member 'status_stats.last_ack' not described in 'sta_info' net/mac80211/sta_info.h:588: warning: Function parameter or member 'status_stats.last_ack_signal' not described in 'sta_info' net/mac80211/sta_info.h:588: warning: Function parameter or member 'status_stats.ack_signal_filled' not described in 'sta_info' net/mac80211/sta_info.h:588: warning: Function parameter or member 'status_stats.avg_ack_signal' not described in 'sta_info' net/mac80211/sta_info.h:588: warning: Function parameter or member 'tx_stats.packets' not described in 'sta_info' net/mac80211/sta_info.h:588: warning: Function parameter or member 'tx_stats.bytes' not described in 'sta_info' net/mac80211/sta_info.h:588: warning: Function parameter or member 'tx_stats.last_rate' not described in 'sta_info' net/mac80211/sta_info.h:588: warning: Function parameter or member 'tx_stats.msdu' not described in 'sta_info' include/linux/dma-buf.h:304: warning: Function parameter or member 'cb_excl.cb' not described in 'dma_buf' include/linux/dma-buf.h:304: warning: Function parameter or member 'cb_excl.poll' not described in 'dma_buf' include/linux/dma-buf.h:304: warning: Function parameter or member 'cb_excl.active' not described in 'dma_buf' include/linux/dma-buf.h:304: warning: Function parameter or member 'cb_shared.cb' not described in 'dma_buf' include/linux/dma-buf.h:304: warning: Function parameter or member 'cb_shared.poll' not described in 'dma_buf' include/linux/dma-buf.h:304: warning: Function parameter or member 'cb_shared.active' not described in 'dma_buf' include/linux/dma-fence-array.h:54: warning: Function parameter or member 'work' not described in 'dma_fence_array' include/linux/gpio/driver.h:142: warning: Function param
[PATCH v2 3/3] drm/i915: Add short HPD IRQ storm detection for non-MST systems
Unfortunately, it seems that the HPD IRQ storm problem from the early days of Intel GPUs was never entirely solved, only mostly. Within the last couple of days, I got a bug report from one of our customers who had been having issues with their machine suddenly booting up very slowly after having updated. The amount of time it took to boot went from around 30 seconds, to over 6 minutes consistently. After some investigation, I discovered that i915 was reporting massive amounts of short HPD IRQ spam on this system from the DisplayPort port, despite there not being anything actually connected. The symptoms would start with one "long" HPD IRQ being detected at boot: [1.891398] [drm:intel_get_hpd_pins [i915]] hotplug event received, stat 0x0044, dig 0x0044, pins 0x00a0 [1.891436] [drm:intel_hpd_irq_handler [i915]] digital hpd port B - long [1.891472] [drm:intel_hpd_irq_handler [i915]] Received HPD interrupt on PIN 5 - cnt: 0 [1.891508] [drm:intel_hpd_irq_handler [i915]] digital hpd port D - long [1.891544] [drm:intel_hpd_irq_handler [i915]] Received HPD interrupt on PIN 7 - cnt: 0 [1.891592] [drm:intel_dp_hpd_pulse [i915]] got hpd irq on port B - long [1.891628] [drm:intel_dp_hpd_pulse [i915]] got hpd irq on port D - long … followed by constant short IRQs afterwards: [1.895091] [drm:intel_encoder_hotplug [i915]] [CONNECTOR:66:DP-1] status updated from unknown to disconnected [1.895129] [drm:i915_hotplug_work_func [i915]] Connector DP-3 (pin 7) received hotplug event. [1.895165] [drm:intel_dp_detect [i915]] [CONNECTOR:72:DP-3] [1.895275] [drm:intel_get_hpd_pins [i915]] hotplug event received, stat 0x0020, dig 0x0020, pins 0x0080 [1.895312] [drm:intel_hpd_irq_handler [i915]] digital hpd port D - short [1.895762] [drm:intel_get_hpd_pins [i915]] hotplug event received, stat 0x0020, dig 0x0020, pins 0x0080 [1.895799] [drm:intel_hpd_irq_handler [i915]] digital hpd port D - short [1.896239] [drm:intel_dp_aux_xfer [i915]] dp_aux_ch timeout status 0x71450085 [1.896293] [drm:intel_get_hpd_pins [i915]] hotplug event received, stat 0x0020, dig 0x0020, pins 0x0080 [1.896330] [drm:intel_hpd_irq_handler [i915]] digital hpd port D - short [1.896781] [drm:intel_get_hpd_pins [i915]] hotplug event received, stat 0x0020, dig 0x0020, pins 0x0080 [1.896817] [drm:intel_hpd_irq_handler [i915]] digital hpd port D - short [1.897275] [drm:intel_get_hpd_pins [i915]] hotplug event received, stat 0x0020, dig 0x0020, pins 0x0080 The customer's system in question has a GM45 GPU, which is apparently well known for hotplugging storms. So, workaround this impressively broken hardware by changing the default HPD storm threshold from 5 to 50. Then, make long IRQs count for 10, and short IRQs count for 1. This makes it so that 5 long IRQs will trigger an HPD storm, and on systems with short HPD storm detection 50 short IRQs will trigger an HPD storm. 50 short IRQs amounts to 100ms of constant pulsing, which seems like a good middleground between being too sensitive and not being sensitive enough (which would cause visible stutters in userspace every time a storm occurs). And just to be extra safe: we don't enable this by default on systems with MST support. There's too high of a chance of MST support triggering storm detection, and systems that are new enough to support MST are a lot less likely to have issues with IRQ storms anyway. As a note: this patch was tested using a ThinkPad T450s and a Chamelium to simulate the short IRQ storms. Changes since v1: - Don't use two separate thresholds, just make long IRQs count for 10 each and short IRQs count for 1. This simplifies the code a bit - Ville Syrjälä Signed-off-by: Lyude Paul Cc: Ville Syrjälä --- drivers/gpu/drm/i915/i915_debugfs.c | 74 drivers/gpu/drm/i915/i915_drv.h | 5 +- drivers/gpu/drm/i915/i915_irq.c | 7 +++ drivers/gpu/drm/i915/intel_hotplug.c | 47 +++--- 4 files changed, 113 insertions(+), 20 deletions(-) diff --git a/drivers/gpu/drm/i915/i915_debugfs.c b/drivers/gpu/drm/i915/i915_debugfs.c index b4744a68cd88..1595b8565875 100644 --- a/drivers/gpu/drm/i915/i915_debugfs.c +++ b/drivers/gpu/drm/i915/i915_debugfs.c @@ -4641,6 +4641,79 @@ static const struct file_operations i915_hpd_storm_ctl_fops = { .write = i915_hpd_storm_ctl_write }; +static int i915_hpd_short_storm_ctl_show(struct seq_file *m, void *data) +{ + struct drm_i915_private *dev_priv = m->private; + + seq_printf(m, "Enabled: %s\n", + yesno(dev_priv->hotplug.hpd_short_storm_enabled)); + + return 0; +} + +static int +i915_hpd_short_storm_ctl_open(struct inode *inode, struct file *file) +{ + return single_open(file, i915_hpd_short_storm_ctl_show, + inode->i_private); +} + +static ssize_t i915_hpd_short_storm_ctl_write