Re: [PATCH] phy: qcom: qmp-combo: Fix VCO div offset on v3
On Thu, Apr 04, 2024 at 04:43:44PM -0700, Stephen Boyd wrote: > Commit ec17373aebd0 ("phy: qcom: qmp-combo: extract common function to > setup clocks") changed the offset that is used to write to > DP_PHY_VCO_DIV from QSERDES_V3_DP_PHY_VCO_DIV to > QSERDES_V4_DP_PHY_VCO_DIV. Unfortunately, this offset is different > between v3 and v4 phys: > > #define QSERDES_V3_DP_PHY_VCO_DIV 0x064 > #define QSERDES_V4_DP_PHY_VCO_DIV 0x070 > > meaning that we write the wrong register on v3 phys now. Add another > generic register to 'regs' and use it here instead of a version specific > define to fix this. > > This was discovered after Abhinav looked over register dumps with me > from sc7180 Trogdor devices that started failing to light up the > external display with v6.6 based kernels. It turns out that some > monitors are very specific about their link clk frequency and if the > default power on reset value is still there the monitor will show a > blank screen or a garbled display. Other monitors are perfectly happy to > get a bad clock signal. > Fixes: ec17373aebd0 ("phy: qcom: qmp-combo: extract common function to setup > clocks") > Signed-off-by: Stephen Boyd > --- > drivers/phy/qualcomm/phy-qcom-qmp-combo.c | 6 +- > 1 file changed, 5 insertions(+), 1 deletion(-) > > diff --git a/drivers/phy/qualcomm/phy-qcom-qmp-combo.c > b/drivers/phy/qualcomm/phy-qcom-qmp-combo.c > index 7d585a4a..3b19d8ebf467 100644 > --- a/drivers/phy/qualcomm/phy-qcom-qmp-combo.c > +++ b/drivers/phy/qualcomm/phy-qcom-qmp-combo.c > @@ -77,6 +77,7 @@ enum qphy_reg_layout { > QPHY_COM_BIAS_EN_CLKBUFLR_EN, > > QPHY_DP_PHY_STATUS, > + QPHY_DP_PHY_VCO_DIV, > > QPHY_TX_TX_POL_INV, > QPHY_TX_TX_DRV_LVL, > @@ -102,6 +103,7 @@ static const unsigned int > qmp_v3_usb3phy_regs_layout[QPHY_LAYOUT_SIZE] = { > + [QPHY_DP_PHY_VCO_DIV] = QSERDES_V3_DP_PHY_VCO_DIV, > @@ -126,6 +128,7 @@ static const unsigned int > qmp_v45_usb3phy_regs_layout[QPHY_LAYOUT_SIZE] = { > + [QPHY_DP_PHY_VCO_DIV] = QSERDES_V4_DP_PHY_VCO_DIV, I happened to skim this patch on the list and noticed that you added a new register abstraction but only updated two tables. A quick look at the driver reveals that there are currently four such tables, which means that the v5_5nm (e.g. the Lenovo ThinkPad X13s) and v6 hardware would now be broken instead as they would write to offset 0. Clearly the hardware abstraction in this driver leaves a lot to wish for when it's this fragile, but how can three people including the maintainer review this change without this being noticed? I just sent a follow-up fix here: https://lore.kernel.org/lkml/20240408093023.506-1-johan+lin...@kernel.org/ > @@ -2184,7 +2188,7 @@ static int qmp_combo_configure_dp_clocks(struct > qmp_combo *qmp) > /* Other link rates aren't supported */ > return -EINVAL; > } > - writel(phy_vco_div, qmp->dp_dp_phy + QSERDES_V4_DP_PHY_VCO_DIV); > + writel(phy_vco_div, qmp->dp_dp_phy + cfg->regs[QPHY_DP_PHY_VCO_DIV]); Johan
Re: [PATCH] phy: qcom: qmp-combo: Fix register base for QSERDES_DP_PHY_MODE
On Thu, Apr 04, 2024 at 05:56:32PM -0700, Bjorn Andersson wrote: > On Thu, Apr 04, 2024 at 05:01:03PM -0700, Stephen Boyd wrote: > > The register base that was used to write to the QSERDES_DP_PHY_MODE > > register was 'dp_dp_phy' before commit 815891eee668 ("phy: > > qcom-qmp-combo: Introduce orientation variable"). There isn't any > > explanation in the commit why this is changed, so I suspect it was an > > oversight or happened while being extracted from some other series. > > Thanks for catching that, I wrote that patch long before Johan did the > rename of "pcs" to "dp_dp_phy", and must have missed that while later > rebasing the patch. > > Reviewed-by: Bjorn Andersson > > > Oddly the value being 0x4c or 0x5c doesn't seem to matter for me, so I > > suspect this is dead code, but that can be fixed in another patch. It's > > not good to write to the wrong register space, and maybe some other > > version of this phy relies on this. This code is still reached on sc8280xp, but I guess only Qualcomm can tell us what these bits are for (and they should). The write to qmp->pcs + QSERDES_DP_PHY_MODE does not seem to have any effect on sc8280xp and that register still reads back as 0x2020202 after the incorrect write. qmp->dp_dp_phy + QSERDES_DP_PHY_MODE reads back as 0x4c4c4c4c before the fixed write and either 0x4c4c4c4c or 0x5c5c5c5c after depending on the orientation. Can someone please replace the magic constants in this driver, and at least explain what the impact of bit 0x10 not reflecting the orientation is? > > Fixes: 815891eee668 ("phy: qcom-qmp-combo: Introduce orientation variable") > > Signed-off-by: Stephen Boyd Either way, good catch, this was clearly unintentional: Reviewed-by: Johan Hovold I think this should go to stable as well even if the impact is currently not fully understood: Cc: sta...@vger.kernel.org # 6.5 > > @@ -2150,9 +2150,9 @@ static bool qmp_combo_configure_dp_mode(struct > > qmp_combo *qmp) > > writel(val, qmp->dp_dp_phy + QSERDES_DP_PHY_PD_CTL); > > > > if (reverse) > > - writel(0x4c, qmp->pcs + QSERDES_DP_PHY_MODE); > > + writel(0x4c, qmp->dp_dp_phy + QSERDES_DP_PHY_MODE); > > else > > - writel(0x5c, qmp->pcs + QSERDES_DP_PHY_MODE); > > + writel(0x5c, qmp->dp_dp_phy + QSERDES_DP_PHY_MODE); Johan
Re: [PATCH v3 1/2] drm/msm/dp: Add support for determining the eDP/DP mode from DT
On Fri, Mar 22, 2024 at 04:15:23PM +0200, Abel Vesa wrote: > On 24-03-22 15:38:03, Dmitry Baryshkov wrote: > > On Fri, 22 Mar 2024 at 15:36, Abel Vesa wrote: > > > On 24-03-22 15:30:54, Dmitry Baryshkov wrote: > > > > On Fri, 22 Mar 2024 at 15:22, Abel Vesa wrote: > > > > > +static int dp_display_get_connector_type(struct platform_device > > > > > *pdev, > > > > > +const struct msm_dp_desc > > > > > *desc) > > > > > +{ > > > > > + struct device *dev = &pdev->dev; > > > > > + struct device_node *aux_bus; > > > > > + struct device_node *panel; > > > > > + int ret = DRM_MODE_CONNECTOR_DisplayPort; > > > > > + > > > > > + /* legacy platforms specify connector type in match data */ > > > > > + if (desc->connector_type == DRM_MODE_CONNECTOR_eDP || > > > > > + desc->connector_type == > > > > > DRM_MODE_CONNECTOR_DisplayPort) > > > > > > > > misaligned > > > > > > > > > > Sure, will fix. > > > > > > > > + return desc->connector_type; > > > > > > > > Can we drop this part completely? > > > > > > > > > > You mean the whole if clause? How should we handle the legacy approach > > > then? > > > > Legacy platforms still have the aux-bus/panel. so they should be > > handled by the check below. > > > > Oh, in that case we can drop the connector_type from every desc for all > platforms. Guys, please trim your replies! Johan
Re: [PATCH] drm/msm/dp: move link_ready out of HPD event thread
On Mon, Mar 18, 2024 at 11:01:25AM -0700, Abhinav Kumar wrote: > On 3/15/2024 8:57 AM, Johan Hovold wrote: > > On Thu, Mar 14, 2024 at 09:30:57AM -0700, Abhinav Kumar wrote: > >> The race condition is between the time we get disconnect event and set > >> link_ready to false, a commit can come in. Because setting link_ready to > >> false happens in the event thread so it could be slightly delayed. > > > > I get this part, just not why, or rather when, that becomes a problem. > > > > Once the disconnect event is processed, dp_hpd_unplug_handle() will > > update the state to ST_DISCONNECT_PENDING, and queue a notification > > event. link_ready is (before this patch) still set to 1. > This is the case I am thinking of: > > 1) Disconnect event happens which will call dp_hpd_unplug_handle() but > link_ready is not false yet. > > 2) There is a commit with a modeset, which shall trigger > atomic_disable() followed by an atomic_enable() > > atomic_disable() will go through disable clocks and set hpd_state to > ST_DISCONNECTED. > > 3) atomic_enable() will not go through because we will bail out because > state was ST_DISCONNECTED. > > if (state != ST_DISPLAY_OFF && state != ST_MAINLINK_READY) { > mutex_unlock(&dp_display->event_mutex); > return; > } > > 4) Now, if there is another commit with a modeset, it will go and crash > at atomic_disable() Right, that's what I described in the mail you replied to but that still doesn't answer what triggers those mode sets. > > Here a commit comes in; what exactly are you suggesting would trigger > > that? And in such a way that it breaks the state machine? > Like we have seen, the commit can either come directly from userspace as > one last frame (the original bug I had given the link to) or from the > __drm_fb_helper_restore_fbdev_mode_unlocked() which happened in > sc8280xp's case. This is totally independent of the hpd_thread() with no > mutual exclusion. Right . Still not sure about the details about "that last frame" issue, that you saw in the past, and if that's still an issue or not. You claimed that you had fixed that, right? > This commit() can come before the link_ready was set to false. If it had > come after link_ready was set to false, atomic_check() would have failed > and no issue would have been seen. > > My change is making the link_ready false sooner in the disconnect case. Yes, but again, and as you have confirmed, you're only papering over the issue at such a mode set can still come in before you set link_state to false. > > One way this could cause trouble is if you end up with a call to > > dp_bridge_atomic_post_disable() which updates the state to > > ST_DISCONNECTED. (1) > > > > This would then need to be followed by another call to > > dp_bridge_atomic_enable() which bails out early with the link clock > > disabled. (2) (And if link_ready were to be set to 0 sooner, the > > likelihood of this is reduced.) > > > > This in turn, would trigger a reset when dp_bridge_atomic_disable() is > > later called. > Yes, this is exactly what I have written above. Thanks for confirming. > > This is the kind of description of the race I expect to see in the > > commit message, and I'm still not sure what would trigger the call to > > dp_bridge_atomic_post_disable() and dp_bridge_atomic_enable() (i.e. (1) > > and (2) above) and whether this is a real issue or not. > > I have explained what triggers the disable/enable call below. > > > Also note that the above scenario is quite different from the one I've > > hit and described earlier. > Why is that so? Eventually it will also translate to the same scenario. > I would like to understand why this is different. I think in your case, > probably we do not know what triggers the modeset, but its a minor > detail like I have written before. The state transitions are different and the enable event comes in before the bridge has been fully tore down unlike in the scenario we outlined above. And it's certainly not a minor detail, as in the sc8280xp VT case, those spurious hotplug events that trigger the atomic_enable would not have caused any trouble if it wasn't for the case that the bridge was stuck in the ST_MAINLINK_READY state. That explains why the hotplug notification revert in rc7 made a difference on sc8280xp. You're talking about an entirely different and, as far as I can tell, hypothetical scenario where are user executes a modeset while pulling the plug. This is certainly not why we had a number of user suddenly starting to hit thi
Re: [PATCH] drm/msm/dp: move link_ready out of HPD event thread
On Thu, Mar 14, 2024 at 09:30:57AM -0700, Abhinav Kumar wrote: > On 3/14/2024 8:38 AM, Johan Hovold wrote: > > On Wed, Mar 13, 2024 at 10:24:08AM -0700, Abhinav Kumar wrote: > > Perhaps I'm missing something in the race that you are trying to > > describe (and which I've asked you to describe in more detail so that I > > don't have to spend more time trying to come up with a reproducer > > myself). > The race condition is between the time we get disconnect event and set > link_ready to false, a commit can come in. Because setting link_ready to > false happens in the event thread so it could be slightly delayed. I get this part, just not why, or rather when, that becomes a problem. Once the disconnect event is processed, dp_hpd_unplug_handle() will update the state to ST_DISCONNECT_PENDING, and queue a notification event. link_ready is (before this patch) still set to 1. Here a commit comes in; what exactly are you suggesting would trigger that? And in such a way that it breaks the state machine? One way this could cause trouble is if you end up with a call to dp_bridge_atomic_post_disable() which updates the state to ST_DISCONNECTED. (1) This would then need to be followed by another call to dp_bridge_atomic_enable() which bails out early with the link clock disabled. (2) (And if link_ready were to be set to 0 sooner, the likelihood of this is reduced.) This in turn, would trigger a reset when dp_bridge_atomic_disable() is later called. This is the kind of description of the race I expect to see in the commit message, and I'm still not sure what would trigger the call to dp_bridge_atomic_post_disable() and dp_bridge_atomic_enable() (i.e. (1) and (2) above) and whether this is a real issue or not. Also note that the above scenario is quite different from the one I've hit and described earlier. > It will be hard to reproduce this. Only way I can think of is to delay > the EV_NOTIFICATION for sometime and see in dp_bridge_hpd_notify() > > else if (dp_display->link_ready && status == > connector_status_disconnected) > dp_add_event(dp, EV_HPD_UNPLUG_INT, 0, 0); > > as dp_add_event() will add the event, then wakeup the event_q. Sure that would increase the race window with the current code, but that alone isn't enough to trigger the bug AFAICT. > Before the event thread wakes up and processes this unplug event, the > commit can come in. This is the race condition i was thinking of. Yes, but what triggers the commit? And why would it lead to a mode set that disables the bridge? Johan
Re: [PATCH] drm/msm/dp: move link_ready out of HPD event thread
On Wed, Mar 13, 2024 at 10:24:08AM -0700, Abhinav Kumar wrote: > On 3/13/2024 1:18 AM, Johan Hovold wrote: > > Right, but your proposed fix would not actually fix anything and judging > > from the sparse commit message and diff itself it is clearly only meant > > to mitigate the case where user space is involved, which is *not* the > > case here. > There can be a race condition between the time the DP driver gets the > hpd disconnect event and when the hpd thread processes that event > allowing the commit to sneak in. This is something which has always been > there even without pm_runtime series and remains even today. > > In this race condition, the setting of "link_ready" to false can be a > bit delayed if we go through the HPD event processing increasing the > race condition window. > > If link_ready is false, atomic_check() fails, thereby failing any > commits and hence not allowing the atomic_disable() / atomic_enable() > cycle and hence avoiding this reset. > > The patch is moving the setting of link_ready to false earlier by not > putting it through the HPD event thread and hence trying to reduce the > window of the issue. Perhaps I'm missing something in the race that you are trying to describe (and which I've asked you to describe in more detail so that I don't have to spend more time trying to come up with a reproducer myself). I do understand how your patch works, but my point is that it does not fix the race that we are hitting on sc8280xp and, unless I'm missing something, it is not even sufficient to fix the race you are talking about as user space can still trigger that ioctl() before you clear the link_ready flag. That's why I said that it is only papering over the issue by making the race window smaller (and this should also be highlighted in the commit message). For some reason it also made things worse on sc8280xp, but I did not spend time on tracking down exactly why. Johan
[PATCH 0/2] drm/msm/dp: fix runtime PM leaks on hotplug
As I've reported elsewhere, I've been hitting runtime PM usage count leaks when investigated a DisplayPort hotplug regression on the Lenovo ThinkPad X13s. [1] This series addresses two obvious leaks on disconnect and on connect failures, respectively. Johan [1] https://lore.kernel.org/lkml/ze8ke_m2xhypy...@hovoldconsulting.com/ Johan Hovold (2): drm/msm/dp: fix runtime PM leak on disconnect drm/msm/dp: fix runtime PM leak on connect failure drivers/gpu/drm/msm/dp/dp_display.c | 2 ++ 1 file changed, 2 insertions(+) -- 2.43.2
[PATCH 2/2] drm/msm/dp: fix runtime PM leak on connect failure
Make sure to balance the runtime PM usage counter (and suspend) before returning on connect failures (e.g. DPCD read failures after a spurious connect event or if link training fails). Fixes: 5814b8bf086a ("drm/msm/dp: incorporate pm_runtime framework into DP driver") Cc: sta...@vger.kernel.org # 6.8 Cc: Kuogee Hsieh Signed-off-by: Johan Hovold --- drivers/gpu/drm/msm/dp/dp_display.c | 1 + 1 file changed, 1 insertion(+) diff --git a/drivers/gpu/drm/msm/dp/dp_display.c b/drivers/gpu/drm/msm/dp/dp_display.c index 8e8cf531da45..78464c395c3d 100644 --- a/drivers/gpu/drm/msm/dp/dp_display.c +++ b/drivers/gpu/drm/msm/dp/dp_display.c @@ -598,6 +598,7 @@ static int dp_hpd_plug_handle(struct dp_display_private *dp, u32 data) ret = dp_display_usbpd_configure_cb(&pdev->dev); if (ret) { /* link train failed */ dp->hpd_state = ST_DISCONNECTED; + pm_runtime_put_sync(&pdev->dev); } else { dp->hpd_state = ST_MAINLINK_READY; } -- 2.43.2
[PATCH 1/2] drm/msm/dp: fix runtime PM leak on disconnect
Make sure to put the runtime PM usage count (and suspend) also when receiving a disconnect event while in the ST_MAINLINK_READY state. This specifically avoids leaking a runtime PM usage count on every disconnect with display servers that do not automatically enable external displays when receiving a hotplug notification. Fixes: 5814b8bf086a ("drm/msm/dp: incorporate pm_runtime framework into DP driver") Cc: sta...@vger.kernel.org # 6.8 Cc: Kuogee Hsieh Signed-off-by: Johan Hovold --- drivers/gpu/drm/msm/dp/dp_display.c | 1 + 1 file changed, 1 insertion(+) diff --git a/drivers/gpu/drm/msm/dp/dp_display.c b/drivers/gpu/drm/msm/dp/dp_display.c index 4c72124ffb5d..8e8cf531da45 100644 --- a/drivers/gpu/drm/msm/dp/dp_display.c +++ b/drivers/gpu/drm/msm/dp/dp_display.c @@ -655,6 +655,7 @@ static int dp_hpd_unplug_handle(struct dp_display_private *dp, u32 data) dp_display_host_phy_exit(dp); dp->hpd_state = ST_DISCONNECTED; dp_display_notify_disconnect(&dp->dp_display.pdev->dev); + pm_runtime_put_sync(&pdev->dev); mutex_unlock(&dp->event_mutex); return 0; } -- 2.43.2
Re: [PATCH] drm/msm/dp: move link_ready out of HPD event thread
On Tue, Mar 12, 2024 at 10:39:46AM -0700, Abhinav Kumar wrote: > On 3/12/2024 9:59 AM, Johan Hovold wrote: > >> Heh. This is getting ridiculous. I just tried running with this patch > >> and it again breaks hotplug detect in a VT console and in X (where I > >> could enable a reconnected external display by running xrandr twice > >> before). > >> > >> So, please, do not apply this one. > > > > To make things worse, I indeed also hit the reset when disconnecting > > after such a failed hotplug. > Ack, I will hold off till I analyze your issues more which you have > listed in separate replies. Especially about the spurious connect, I > believe you are trying to mention that, by adding logs, you are able to > delay the processing of a connect event to *make* it like a spurious > one? In case, I got this part wrong, can you pls explain the spurious > connect scenario again? No, I only mentioned the debug printks in passing as instrumentation like that may affect race conditions (but I'm also hitting the resets also with no printks in place). The spurious connect event comes directly from the pmic firmware, and even if we may optimise things by implementing some kind of debounce, the hotplug implementation needs to be robust enough to not kill the machine if such an event gets through. Basically what I see is that during physical disconnect there can be multiple hpd notify events (e.g. connect, disconnect, connect): [ 146.910195] usb 5-1: USB disconnect, device number 4 [ 146.931026] msm-dp-display ae98000.displayport-controller: dp_bridge_hpd_notify - link_ready = 1, status = 2 [ 146.934785] msm-dp-display ae98000.displayport-controller: dp_hpd_unplug_handle [ 146.938114] msm-dp-display ae98000.displayport-controller: dp_bridge_hpd_notify - link_ready = 1, status = 1 [ 146.940245] [CONNECTOR:35:DP-2] status updated from disconnected to connected [ 146.955193] msm-dp-display ae98000.displayport-controller: dp_bridge_hpd_notify - link_ready = 0, status = 2 And it is the spurious connect event while the link is being tore down that triggers the hotplug processing that leads to the reset. Similarly, I've seen spurious disconnect events while the plug in being inserted. > A short response on why this change was made is that commit can be > issued by userspace or the fbdev client. So userspace involvement only > makes commit happen from a different path. It would be incorrect to > assume the issues from the earlier bug and the current one are different > only because there was userspace involvement in that one and not this. > > Because in the end, it manifests itself in the same way that > atomic_enable() did not go through after an atomic_disable() and the > next atomic_disable() crashes. Right, but your proposed fix would not actually fix anything and judging from the sparse commit message and diff itself it is clearly only meant to mitigate the case where user space is involved, which is *not* the case here. Johan
Re: [PATCH] drm/msm/dp: move link_ready out of HPD event thread
On Tue, Mar 12, 2024 at 05:41:23PM +0100, Johan Hovold wrote: > On Tue, Mar 12, 2024 at 11:09:11AM +0100, Johan Hovold wrote: > > On Fri, Mar 08, 2024 at 01:45:32PM -0800, Abhinav Kumar wrote: > > > > @@ -466,6 +466,8 @@ static int dp_display_notify_disconnect(struct device > > > *dev) > > > { > > > struct dp_display_private *dp = dev_get_dp_display_private(dev); > > > > > > + dp->dp_display.link_ready = false; > > > > As I also pointed out in the other thread, setting link_ready to false > > here means that any spurious connect event (during physical disconnect) > > will always be processed, something which can currently lead to a leaked > > runtime pm reference. > > > > Wasting some power is of course preferred over crashing the machine, but > > please take it into consideration anyway. > > > > Especially if your intention with this patch was to address the resets > > we saw with sc8280xp which are gone since the HPD notify revert (which > > fixed the hotplug detect issue that left the bridge in a > > half-initialised state). > > Heh. This is getting ridiculous. I just tried running with this patch > and it again breaks hotplug detect in a VT console and in X (where I > could enable a reconnected external display by running xrandr twice > before). > > So, please, do not apply this one. To make things worse, I indeed also hit the reset when disconnecting after such a failed hotplug. Johan
Re: [PATCH] drm/msm/dp: move link_ready out of HPD event thread
On Tue, Mar 12, 2024 at 11:09:11AM +0100, Johan Hovold wrote: > On Fri, Mar 08, 2024 at 01:45:32PM -0800, Abhinav Kumar wrote: > > @@ -466,6 +466,8 @@ static int dp_display_notify_disconnect(struct device > > *dev) > > { > > struct dp_display_private *dp = dev_get_dp_display_private(dev); > > > > + dp->dp_display.link_ready = false; > > As I also pointed out in the other thread, setting link_ready to false > here means that any spurious connect event (during physical disconnect) > will always be processed, something which can currently lead to a leaked > runtime pm reference. > > Wasting some power is of course preferred over crashing the machine, but > please take it into consideration anyway. > > Especially if your intention with this patch was to address the resets > we saw with sc8280xp which are gone since the HPD notify revert (which > fixed the hotplug detect issue that left the bridge in a > half-initialised state). Heh. This is getting ridiculous. I just tried running with this patch and it again breaks hotplug detect in a VT console and in X (where I could enable a reconnected external display by running xrandr twice before). So, please, do not apply this one. Johan
Re: [PATCH] drm/msm/dp: move link_ready out of HPD event thread
On Fri, Mar 08, 2024 at 01:45:32PM -0800, Abhinav Kumar wrote: > There are cases where the userspace might still send another > frame after the HPD disconnect causing a modeset cycle after > a disconnect. This messes the internal state machine of MSM DP driver > and can lead to a crash as there can be an imbalance between > bridge_disable() and bridge_enable(). Can you be more specific here? What steps would lead to this issue and how exactly does is mess with the state machine? Is there an easy way to reproduce it (e.g. by instrumenting the code with some sleep)? The hotplug code is really convoluted and having a clear description of the problem is needed to evaluate the patch (including when revisiting it some time from now when I've forgotten about how this state machine works). As you know, we ran into a related issue on sc8280xp (X13s) since 6.8-rc1, but that did not involve any user space interaction at all. For reference, there are some more details in this thread: https://lore.kernel.org/all/ze8ke_m2xhypy...@hovoldconsulting.com/ > This was also previously reported on [1] for which [2] was posted > and helped resolve the issue by rejecting commits if the DP is not > in connected state. > > The change resolved the bug but there can also be another race condition. > If hpd_event_thread does not pick up the EV_USER_NOTIFICATION and process it > link_ready will also not be set to false allowing the frame to sneak in. How could the event thread fail to pick up the notification event? Or do you mean there's a race window before it has been processed? > Lets move setting link_ready outside of hpd_event_thread() processing to > eliminate a window of race condition. As we discussed in thread above, this patch does not eliminate the race, even if it may reduce the race window. > [1] : https://gitlab.freedesktop.org/drm/msm/-/issues/17 > [2] : > https://lore.kernel.org/all/1664408211-25314-1-git-send-email-quic_khs...@quicinc.com/ > > Fixes: 8a3b4c17f863 ("drm/msm/dp: employ bridge mechanism for display enable > and disable") > Signed-off-by: Abhinav Kumar > @@ -466,6 +466,8 @@ static int dp_display_notify_disconnect(struct device > *dev) > { > struct dp_display_private *dp = dev_get_dp_display_private(dev); > > + dp->dp_display.link_ready = false; As I also pointed out in the other thread, setting link_ready to false here means that any spurious connect event (during physical disconnect) will always be processed, something which can currently lead to a leaked runtime pm reference. Wasting some power is of course preferred over crashing the machine, but please take it into consideration anyway. Especially if your intention with this patch was to address the resets we saw with sc8280xp which are gone since the HPD notify revert (which fixed the hotplug detect issue that left the bridge in a half-initialised state). > + > dp_add_event(dp, EV_USER_NOTIFICATION, false, 0); > > return 0; Johan
Re: drm/msm: DisplayPort hard-reset on hotplug regression in 6.8-rc1
On Mon, Mar 11, 2024 at 09:51:29AM -0700, Abhinav Kumar wrote: > On 3/11/2024 6:43 AM, Johan Hovold wrote: > > On Sat, Mar 09, 2024 at 05:30:17PM +0100, Johan Hovold wrote: > >> On Fri, Mar 08, 2024 at 09:50:17AM -0800, Abhinav Kumar wrote: > >>> I have posted my analysis with the patch here as a RFC y'day: > >>> > >>> https://patchwork.freedesktop.org/patch/581758/ > > I was able to reproduce the reset with some of my debug printks in place > > after reapplying the reverted hpd notify change so I have an explanation > > for (one of) the ways we can up in this state now. > > > > This does not match description of the problem in the fix linked to > > above and I don't think that patch solves the underlying issue even if > > it may make the race window somewhat smaller. More details below. > Its the same condition you described below that enable does not go > through and we bail out as we are in ST_DISCONNECTED. It's closely related but clearly not the same as user space is not involved in the reset I see. > > Turns out there are paths like that and we hit those more often before > > reverting e467e0bde881 ("drm/msm/dp: use drm_bridge_hpd_notify(). > > > > Specifically, when a previous connect attempt did not enable the bridge > > fully so that it is still in the ST_MAINLINK_READY when we receive a > > disconnect event, dp_hpd_unplug_handle() will turn of the link clock. > > > > [ 204.527625] msm-dp-display ae98000.displayport-controller: > > dp_bridge_hpd_notify - link_ready = 1, status = 2 > > [ 204.531553] msm-dp-display ae98000.displayport-controller: > > dp_hpd_unplug_handle > > [ 204.533261] msm-dp-display ae98000.displayport-controller: > > dp_ctrl_off_link > > > > A racing connect event, such as the one I described earlier, can then > > try to enable the bridge again but dp_bridge_atomic_enable() just bails > > out early (and leaks a rpm reference) because we're now in > > ST_DISCONNECTED: > > > > [ 204.535773] msm-dp-display ae98000.displayport-controller: > > dp_bridge_hpd_notify - link_ready = 1, status = 1 > > [ 204.536187] [CONNECTOR:35:DP-2] status updated from disconnected to > > connected > > [ 204.536905] msm-dp-display ae98000.displayport-controller: > > dp_display_notify_disconnect - would clear link ready (1), state = 0 > > [ 204.537821] msm-dp-display ae98000.displayport-controller: > > dp_bridge_atomic_check - link_ready = 1 > > [ 204.538063] msm-dp-display ae98000.displayport-controller: > > dp_display_send_hpd_notification - hpd = 0, link_ready = 1 > > [ 204.542778] msm-dp-display ae98000.displayport-controller: > > dp_bridge_atomic_enable > > [ 204.586547] msm-dp-display ae98000.displayport-controller: > > dp_bridge_atomic_enable - state = 0 (rpm leak?) > > > > Clearing link_ready already in dp_display_notify_disconnect() would make > > the race window slightly smaller, but it would essentially just paper > > over the bug as the events are still not serialised. Notably, there is > > no user space interaction involved here and it's the spurious connect > > event that triggers the bridge enable. > Yes, it only narrows down the race condition window. The issue can still > happen if the commit / modeset was issued before we marked link_ready as > false. > > And yes, I was only targetting a short term fix till we rework the HPD. > That will happen only incrementally as its a delicate piece of code. Ok, thanks for confirming. Please also make that clear in the commit message of any patch. I am however not sure that your patch (RFC) is needed at this point as the HPD revert fixes the 6.8-rc1 regression, and moving the clearing of link_ready can actually make things worse as it makes any spurious hotplug event always be processed (not just if they happen after dp_display_send_hpd_notification()). I'll reply to you patch as well. > > So, while it may still be theoretically possible to hit the resets after > > the revert, the HPD notify revert effectively "fixed" the regression in > > 6.8-rc1 by removing the preconditions that now made us hit it (i.e. the > > half-initialised bridge). > Not entirely. In the bug which was reported before the patch > e467e0bde881 ("drm/msm/dp: use drm_bridge_hpd_notify() got landed, its a > classic example of how this issue can happen with userspace involvement > and not just fbdev client which was your case. Sure, but you already added some kind of band-aid for that issue, right? https://lore.kernel.org/all/1664408211-25314-1-git-send-email-quic_khs...@quicinc.com/ Ho
Re: drm/msm: DisplayPort hard-reset on hotplug regression in 6.8-rc1
On Mon, Mar 11, 2024 at 02:43:24PM +0100, Johan Hovold wrote: > So, while it may still be theoretically possible to hit the resets after > the revert, the HPD notify revert effectively "fixed" the regression in > 6.8-rc1 by removing the preconditions that now made us hit it (i.e. the > half-initialised bridge). > > It seems the hotplug state machine needs to be reworked completely, but > at least we're roughly back where we were with 6.7 (including that the > bus clocks will never be turned of because of the rpm leaks on > disconnect). #regzbot introduced: e467e0bde881 #regzbot fix: 664bad6af3cb
Re: drm/msm: DisplayPort hard-reset on hotplug regression in 6.8-rc1
On Sat, Mar 09, 2024 at 05:30:17PM +0100, Johan Hovold wrote: > On Fri, Mar 08, 2024 at 09:50:17AM -0800, Abhinav Kumar wrote: > > On 3/8/2024 4:43 AM, Johan Hovold wrote: > > > For this last remaining reset with the stacktrace you have mentioned > > below, I do not think this was introduced due to PM runtime series. We > > have had this report earlier with the same stacktrace as well in earlier > > kernels which didnt have PM runtime support: > > > But, it seems like there is another race condition in this code > > resulting in this issue again. > > > > I have posted my analysis with the patch here as a RFC y'day: > > > > https://patchwork.freedesktop.org/patch/581758/ > > > > I missed CCing you and Bjorn on the RFC but when I post it as a patch > > either today/tomm, will CC you both. > > Ok, thanks. I'll take a closer look at that next week. It's not clear to > me what that race looks like after reading the commit message. It would > be good to have some more details in there (e.g. the sequence of events > that breaks the state machine) and some way to reproduce the issue > reliably. I was able to reproduce the reset with some of my debug printks in place after reapplying the reverted hpd notify change so I have an explanation for (one of) the ways we can up in this state now. This does not match description of the problem in the fix linked to above and I don't think that patch solves the underlying issue even if it may make the race window somewhat smaller. More details below. > > > We now also have Bjorn's call trace from a different Qualcomm platform > > > triggering an unclocked access on disconnect, something which would > > > trigger a reset by the hypervisor on sc8280xp platforms like the X13s: > > > > [ 2030.379583] Kernel panic - not syncing: Asynchronous SError Interrupt > > > [ 2030.379586] CPU: 0 PID: 239 Comm: kworker/0:2 Not tainted > > > 6.8.0-rc4-next-20240216-00015-gc937d3c43ffe-dirty #219 > > > [ 2030.379590] Hardware name: Qualcomm Technologies, Inc. Robotics > > > RB3gen2 (DT) > > > [ 2030.379592] Workqueue: events output_poll_execute [drm_kms_helper] > > > [ 2030.379642] Call trace: > > > > [ 2030.379722] wait_for_completion_timeout+0x14/0x20 > > > [ 2030.379727] dp_ctrl_push_idle+0x34/0x8c [msm] > > > [ 2030.379844] dp_bridge_atomic_disable+0x18/0x24 [msm] > > > [ 2030.379959] drm_atomic_bridge_chain_disable+0x6c/0xb4 [drm] > > > [ 2030.380150] drm_atomic_helper_commit_modeset_disables+0x174/0x57c > > > [drm_kms_helper] > > > [ 2030.380200] msm_atomic_commit_tail+0x1b4/0x474 [msm] > > > [ 2030.380316] commit_tail+0xa4/0x158 [drm_kms_helper] > > > [ 2030.380369] drm_atomic_helper_commit+0x24c/0x26c [drm_kms_helper] > > > [ 2030.380418] drm_atomic_commit+0xa8/0xd4 [drm] > > > [ 2030.380529] drm_client_modeset_commit_atomic+0x16c/0x244 [drm] > > > [ 2030.380641] drm_client_modeset_commit_locked+0x50/0x168 [drm] > > > [ 2030.380753] drm_client_modeset_commit+0x2c/0x54 [drm] > > > [ 2030.380865] __drm_fb_helper_restore_fbdev_mode_unlocked+0x60/0xa4 > > > [drm_kms_helper] > > > [ 2030.380915] drm_fb_helper_hotplug_event+0xe0/0xf4 [drm_kms_helper] > > > [ 2030.380965] msm_fbdev_client_hotplug+0x28/0xc8 [msm] > > > [ 2030.381081] drm_client_dev_hotplug+0x94/0x118 [drm] > > > [ 2030.381192] output_poll_execute+0x214/0x26c [drm_kms_helper] > > > [ 2030.381241] process_scheduled_works+0x19c/0x2cc > > > [ 2030.381249] worker_thread+0x290/0x3cc > > > [ 2030.381255] kthread+0xfc/0x184 > > > [ 2030.381260] ret_from_fork+0x10/0x20 > > > > > > The above could happen if the convoluted hotplug state machine breaks > > > down so that the device is runtime suspended before > > > dp_bridge_atomic_disable() is called. > > > Yes, state machine got broken and I have explained how in the commit > > text of the RFC. But its not necessarily due to PM runtime but a > > sequence of events happening from userspace exposing this breakage. > > After looking at this some more today, I agree with you. The > observations I've reported are consistent with what would happen if the > link clock is disabled when dp_bridge_atomic_disable() is called. > > That clock is disabled in dp_bridge_atomic_post_disable() before runtime > suspending, but perhaps there are some further paths that can end up > disabling it. Turns out there are paths like that and we hit those more often before reverting e467e0bde881 ("drm/msm/dp: use drm_bridge_hpd_notify().
Re: drm/msm: DisplayPort hard-reset on hotplug regression in 6.8-rc1
On Fri, Mar 08, 2024 at 09:50:17AM -0800, Abhinav Kumar wrote: > On 3/8/2024 4:43 AM, Johan Hovold wrote: > For this last remaining reset with the stacktrace you have mentioned > below, I do not think this was introduced due to PM runtime series. We > have had this report earlier with the same stacktrace as well in earlier > kernels which didnt have PM runtime support: > But, it seems like there is another race condition in this code > resulting in this issue again. > > I have posted my analysis with the patch here as a RFC y'day: > > https://patchwork.freedesktop.org/patch/581758/ > > I missed CCing you and Bjorn on the RFC but when I post it as a patch > either today/tomm, will CC you both. Ok, thanks. I'll take a closer look at that next week. It's not clear to me what that race looks like after reading the commit message. It would be good to have some more details in there (e.g. the sequence of events that breaks the state machine) and some way to reproduce the issue reliably. > > We now also have Bjorn's call trace from a different Qualcomm platform > > triggering an unclocked access on disconnect, something which would > > trigger a reset by the hypervisor on sc8280xp platforms like the X13s: > > [ 2030.379583] Kernel panic - not syncing: Asynchronous SError Interrupt > > [ 2030.379586] CPU: 0 PID: 239 Comm: kworker/0:2 Not tainted > > 6.8.0-rc4-next-20240216-00015-gc937d3c43ffe-dirty #219 > > [ 2030.379590] Hardware name: Qualcomm Technologies, Inc. Robotics RB3gen2 > > (DT) > > [ 2030.379592] Workqueue: events output_poll_execute [drm_kms_helper] > > [ 2030.379642] Call trace: > > [ 2030.379722] wait_for_completion_timeout+0x14/0x20 > > [ 2030.379727] dp_ctrl_push_idle+0x34/0x8c [msm] > > [ 2030.379844] dp_bridge_atomic_disable+0x18/0x24 [msm] > > [ 2030.379959] drm_atomic_bridge_chain_disable+0x6c/0xb4 [drm] > > [ 2030.380150] drm_atomic_helper_commit_modeset_disables+0x174/0x57c > > [drm_kms_helper] > > [ 2030.380200] msm_atomic_commit_tail+0x1b4/0x474 [msm] > > [ 2030.380316] commit_tail+0xa4/0x158 [drm_kms_helper] > > [ 2030.380369] drm_atomic_helper_commit+0x24c/0x26c [drm_kms_helper] > > [ 2030.380418] drm_atomic_commit+0xa8/0xd4 [drm] > > [ 2030.380529] drm_client_modeset_commit_atomic+0x16c/0x244 [drm] > > [ 2030.380641] drm_client_modeset_commit_locked+0x50/0x168 [drm] > > [ 2030.380753] drm_client_modeset_commit+0x2c/0x54 [drm] > > [ 2030.380865] __drm_fb_helper_restore_fbdev_mode_unlocked+0x60/0xa4 > > [drm_kms_helper] > > [ 2030.380915] drm_fb_helper_hotplug_event+0xe0/0xf4 [drm_kms_helper] > > [ 2030.380965] msm_fbdev_client_hotplug+0x28/0xc8 [msm] > > [ 2030.381081] drm_client_dev_hotplug+0x94/0x118 [drm] > > [ 2030.381192] output_poll_execute+0x214/0x26c [drm_kms_helper] > > [ 2030.381241] process_scheduled_works+0x19c/0x2cc > > [ 2030.381249] worker_thread+0x290/0x3cc > > [ 2030.381255] kthread+0xfc/0x184 > > [ 2030.381260] ret_from_fork+0x10/0x20 > > > > The above could happen if the convoluted hotplug state machine breaks > > down so that the device is runtime suspended before > > dp_bridge_atomic_disable() is called. > Yes, state machine got broken and I have explained how in the commit > text of the RFC. But its not necessarily due to PM runtime but a > sequence of events happening from userspace exposing this breakage. After looking at this some more today, I agree with you. The observations I've reported are consistent with what would happen if the link clock is disabled when dp_bridge_atomic_disable() is called. That clock is disabled in dp_bridge_atomic_post_disable() before runtime suspending, but perhaps there are some further paths that can end up disabling it. > > For some reason, possibly due to unrelated changes in timing, possibly > > after the hotplug revert, I am no longer able to reproduce the reset > > with 6.8-rc7 on the X13s. > I do not know how the hotplug revert fixed this stack because I think > this can still happen. Thanks for confirming. > > I am however able to get the hotplug state machine to leak > > runtime PM reference counts which prevents it from ever being suspended > > (e.g. by disconnecting slowly so that we get multiple connect and > > disconnect events). This can manifest itself as a hotplug event which is > > processed after disconnecting the display: > > > > [drm:dp_panel_read_sink_caps [msm]] *ERROR* read dpcd failed -110 > hmm ... this is another new report. I was not aware of this till this > instance. We would like to debug this issue too as we have also not seen > this one before. > > But, I am suspicious
Re: drm/msm: DisplayPort hard-reset on hotplug regression in 6.8-rc1
Hi Abhinav, Rob, Dmitry and Kuogee, On Tue, Feb 27, 2024 at 02:33:48PM +0100, Johan Hovold wrote: > Since 6.8-rc1 I have seen (and received reports) of hard resets of the > Lenovo ThinkPad X13s after connecting and disconnecting an external > display. > > I have triggered this on a simple disconnect while in a VT console, but > also when stopping Xorg after having repeatedly connected and > disconnected an external display and tried to enable it using xrandr. > > In the former case, the last (custom debug) messages printed over an SSH > session were once: > > [ 948.416358] usb 5-1: USB disconnect, device number 3 > [ 948.443496] msm_dpu ae01000.display-controller: > msm_fbdev_client_hotplug > [ 948.443723] msm-dp-display ae98000.displayport-controller: > dp_power_clk_enable - type = 1, enable = 0 > [ 948.443872] msm-dp-display ae98000.displayport-controller: > dp_ctrl_phy_exit > [ 948.445117] msm-dp-display ae98000.displayport-controller: > dp_ctrl_phy_exit - done > > and then the hypervisor resets the machine. Has there been any progress on tracking down the reset on disconnect issue? I was expecting you to share your findings here so that we can determine if the rest of the runtime PM series needs to be reverted or not before 6.8 is released (possibly on Sunday). I really did not want to spend more on this driver than I already have this cycle, but the lack of (visible) progress has again forced me to do so. It's quite likely that the resets are indeed a regression caused by the runtime PM series as the bus clocks were not disabled on disconnect before that one was merged in 6.8-rc1. In a VT console, the device is now runtime suspended immediately on disconnect, while in X, it currently remains active until X is killed, which is consistent with what I reported above. We now also have Bjorn's call trace from a different Qualcomm platform triggering an unclocked access on disconnect, something which would trigger a reset by the hypervisor on sc8280xp platforms like the X13s: [ 2030.379417] SError Interrupt on CPU0, code 0xbe00 -- SError [ 2030.379425] CPU: 0 PID: 239 Comm: kworker/0:2 Not tainted 6.8.0-rc4-next-20240216-00015-gc937d3c43ffe-dirty #219 [ 2030.379430] Hardware name: Qualcomm Technologies, Inc. Robotics RB3gen2 (DT) [ 2030.379435] Workqueue: events output_poll_execute [drm_kms_helper] [ 2030.379495] pstate: 604000c5 (nZCv daIF +PAN -UAO -TCO -DIT -SSBS BTYPE=--) [ 2030.379501] pc : el1_interrupt+0x28/0xc0 [ 2030.379514] lr : el1h_64_irq_handler+0x18/0x24 [ 2030.379520] sp : 80008129b700 [ 2030.379523] x29: 80008129b700 x28: 7a0a031aa200 x27: 7a0af768c3c0 [ 2030.379530] x26: x25: 7a0a024bb568 x24: 7a0a031aa200 [ 2030.379537] x23: 6045 x22: d2a4a10c871c x21: 80008129b8a0 [ 2030.379544] x20: d2a4a2f0 x19: 80008129b750 x18: [ 2030.379549] x17: x16: d2a4a10c21d4 x15: [ 2030.379555] x14: x13: 0001acf6 x12: [ 2030.379560] x11: 0001 x10: 7a0af623f680 x9 : 00010001 [ 2030.379565] x8 : 00c0 x7 : x6 : 003f [ 2030.379570] x5 : 7a0a003c6a70 x4 : 001f x3 : d2a48d6722dc [ 2030.379576] x2 : 0002 x1 : d2a4a2f0 x0 : 80008129b750 [ 2030.379583] Kernel panic - not syncing: Asynchronous SError Interrupt [ 2030.379586] CPU: 0 PID: 239 Comm: kworker/0:2 Not tainted 6.8.0-rc4-next-20240216-00015-gc937d3c43ffe-dirty #219 [ 2030.379590] Hardware name: Qualcomm Technologies, Inc. Robotics RB3gen2 (DT) [ 2030.379592] Workqueue: events output_poll_execute [drm_kms_helper] [ 2030.379642] Call trace: [ 2030.379644] dump_backtrace+0xec/0x108 [ 2030.379654] show_stack+0x18/0x24 [ 2030.379659] dump_stack_lvl+0x40/0x84 [ 2030.379664] dump_stack+0x18/0x24 [ 2030.379668] panic+0x130/0x34c [ 2030.379673] nmi_panic+0x44/0x90 [ 2030.379679] arm64_serror_panic+0x68/0x74 [ 2030.379683] do_serror+0xc4/0xcc [ 2030.379686] el1h_64_error_handler+0x34/0x4c [ 2030.379692] el1h_64_error+0x64/0x68 [ 2030.379696] el1_interrupt+0x28/0xc0 [ 2030.379700] el1h_64_irq_handler+0x18/0x24 [ 2030.379706] el1h_64_irq+0x64/0x68 [ 2030.379710] _raw_spin_unlock_irq+0x20/0x48 [ 2030.379718] wait_for_common+0xb4/0x16c [ 2030.379722] wait_for_completion_timeout+0x14/0x20 [ 2030.379727] dp_ctrl_push_idle+0x34/0x8c [msm] [ 2030.379844] dp_bridge_atomic_disable+0x18/0x24 [msm] [ 2030.379959] drm_atomic_bridge_chain_disable+0x6c/0xb4 [drm] [ 2030.380150] drm_atomic_helper_commit_modeset_disables+0x174/0x57c [drm_kms_helper] [ 2030.380200] msm_atomic_commit_tail+0x1b4/0x474 [msm] [ 2030.380316] commit_tail+0xa4/0x158 [drm_kms_helper] [ 2030.380369] drm_atomic_helper_commit+0x24c/0x26c [drm_kms_helper] [ 2030.380418] drm_atomic_c
Re: [PATCH] Revert "drm/msm/dp: use drm_bridge_hpd_notify() to report HPD status changes"
On Wed, Feb 28, 2024 at 10:10:10AM -0800, Abhinav Kumar wrote: > On 2/28/2024 5:28 AM, Johan Hovold wrote: > > This is a fix for a user-visible regression that was reported formally > > two weeks ago and informally (e.g. to you) soon after rc1 came out, and > > which now also has an identified cause and an analysis of the problem. > > And we're at rc6 so there should be no reason to delay fixing this (e.g. > > even if you want to run some more tests for a couple of days). > > Yup, we landed it in msm-fixes now, so this will go as a late -fixes PR > for 6.8. Perfect, thanks! Johan
Re: [PATCH] Revert "drm/msm/dp: use drm_bridge_hpd_notify() to report HPD status changes"
On Wed, Feb 28, 2024 at 01:08:04PM +0200, Dmitry Baryshkov wrote: > On Wed, 28 Feb 2024 at 11:50, Johan Hovold wrote: > > > > On Tue, Feb 27, 2024 at 02:11:56PM -0800, Abhinav Kumar wrote: > > > On 2/27/2024 2:08 PM, Dmitry Baryshkov wrote: > > > > This reverts commit e467e0bde881 ("drm/msm/dp: use > > > > drm_bridge_hpd_notify() to report HPD status changes"). > > > > > > > > The commit changed the way how the MSM DP driver communicates > > > > HPD-related events to the userspace. The mentioned commit made some of > > > > the HPD events being reported earlier. This way userspace starts poking > > > > around. It interacts in a bad way with the dp_bridge_detect and the > > > > driver's state machine, ending up either with the very long delays > > > > during hotplug detection or even inability of the DP driver to report > > > > the display as connected. > > > > > > > > A proper fix will involve redesigning of the HPD handling in the MSM DP > > > > driver. It is underway, but it will be intrusive and can not be thought > > > > about as a simple fix for the issue. Thus, revert the offending commit. > > > > > > Yes, for fixing this on 6.9 I am fine with this. > > > > Since this is a regression in 6.8-rc1, I hope you meant to say 6.8 here? > > In the worst case it will land to 6.8.x via the stable tree process. This is a fix for a user-visible regression that was reported formally two weeks ago and informally (e.g. to you) soon after rc1 came out, and which now also has an identified cause and an analysis of the problem. And we're at rc6 so there should be no reason to delay fixing this (e.g. even if you want to run some more tests for a couple of days). Johan
Re: [PATCH] Revert "drm/msm/dp: use drm_bridge_hpd_notify() to report HPD status changes"
On Tue, Feb 27, 2024 at 02:11:56PM -0800, Abhinav Kumar wrote: > On 2/27/2024 2:08 PM, Dmitry Baryshkov wrote: > > This reverts commit e467e0bde881 ("drm/msm/dp: use > > drm_bridge_hpd_notify() to report HPD status changes"). > > > > The commit changed the way how the MSM DP driver communicates > > HPD-related events to the userspace. The mentioned commit made some of > > the HPD events being reported earlier. This way userspace starts poking > > around. It interacts in a bad way with the dp_bridge_detect and the > > driver's state machine, ending up either with the very long delays > > during hotplug detection or even inability of the DP driver to report > > the display as connected. > > > > A proper fix will involve redesigning of the HPD handling in the MSM DP > > driver. It is underway, but it will be intrusive and can not be thought > > about as a simple fix for the issue. Thus, revert the offending commit. > > Yes, for fixing this on 6.9 I am fine with this. Since this is a regression in 6.8-rc1, I hope you meant to say 6.8 here? > I hope there were not other changes which were built on top of this. So > it will be better if we retest internal HPD case as well with this. > > We will do that in a day or two and give Tested-by. Johan
Re: [PATCH] Revert "drm/msm/dp: use drm_bridge_hpd_notify() to report HPD status changes"
On Wed, Feb 28, 2024 at 12:08:08AM +0200, Dmitry Baryshkov wrote: > This reverts commit e467e0bde881 ("drm/msm/dp: use > drm_bridge_hpd_notify() to report HPD status changes"). > > The commit changed the way how the MSM DP driver communicates > HPD-related events to the userspace. The mentioned commit made some of > the HPD events being reported earlier. This way userspace starts poking > around. It interacts in a bad way with the dp_bridge_detect and the > driver's state machine, ending up either with the very long delays > during hotplug detection or even inability of the DP driver to report > the display as connected. > > A proper fix will involve redesigning of the HPD handling in the MSM DP > driver. It is underway, but it will be intrusive and can not be thought > about as a simple fix for the issue. Thus, revert the offending commit. > > Fixes: e467e0bde881 ("drm/msm/dp: use drm_bridge_hpd_notify() to report HPD > status changes") > Link: https://gitlab.freedesktop.org/drm/msm/-/issues/50 > Reported-by: Johan Hovold > Link: https://lore.kernel.org/r/zd3ypgmrprxv-...@hovoldconsulting.com/ > Signed-off-by: Dmitry Baryshkov Tested-by: Johan Hovold
drm/msm: DisplayPort hard-reset on hotplug regression in 6.8-rc1
Hi, Since 6.8-rc1 I have seen (and received reports) of hard resets of the Lenovo ThinkPad X13s after connecting and disconnecting an external display. I have triggered this on a simple disconnect while in a VT console, but also when stopping Xorg after having repeatedly connected and disconnected an external display and tried to enable it using xrandr. In the former case, the last (custom debug) messages printed over an SSH session were once: [ 948.416358] usb 5-1: USB disconnect, device number 3 [ 948.443496] msm_dpu ae01000.display-controller: msm_fbdev_client_hotplug [ 948.443723] msm-dp-display ae98000.displayport-controller: dp_power_clk_enable - type = 1, enable = 0 [ 948.443872] msm-dp-display ae98000.displayport-controller: dp_ctrl_phy_exit [ 948.445117] msm-dp-display ae98000.displayport-controller: dp_ctrl_phy_exit - done and then the hypervisor resets the machine. Hotplug in Xorg seems to work worse than it did with 6.7, which also had some issues. Connecting a display once seems to work fine, but trying to re-enable a reconnected display using xrandr sometimes does not work at all, while with 6.7 it usually worked on the second xrandr execution. xrandr reports the reconnected display as disconnected: Screen 0: minimum 320 x 200, current 1920 x 1200, maximum 5120 x 4096 eDP-1 connected primary 1920x1200+0+0 (normal left inverted right x axis y axis) 286mm x 178mm 1920x1200 60.03*+ 1600x1200 60.00 DP-1 disconnected (normal left inverted right x axis y axis) DP-2 disconnected 1920x1200+0+0 (normal left inverted right x axis y axis) 0mm x 0mm 1920x1200 (0x40c) 154.000MHz +HSync -VSync h: width 1920 start 1968 end 2000 total 2080 skew0 clock 74.04KHz v: height 1200 start 1203 end 1209 total 1235 clock 59.95Hz Running 'xrandr --output DP-2 --auto' 2-3 times makes xrandr report the display as connected, but the display is still blank (unlike with 6.7). A few times after having exercised hotplug this way, the machine hard resets when Xorg is later stopped. Once I saw the following log messages on an SSH session but they may not have been printed directly before the hard reset: [ 214.555781] [drm:dpu_encoder_phys_vid_wait_for_commit_done:492] [dpu error]vblank timeout [ 214.555843] [drm:dpu_kms_wait_for_commit_done:483] [dpu error]wait for commit done returned -110 Note that this appears to be unrelated to the recently fixed Qualcomm power domain driver bug which can trigger similar resets when initialising the display subsystem on boot. Specifically, I have triggered the hotplug resets described above also with the fix applied. [1] Reverting commit e467e0bde881 ("drm/msm/dp: use drm_bridge_hpd_notify() to report HPD status changes") which fixes the related VT console regression does not seem to make any difference. [2] Daniel Thompson reports that reverting the whole runtime PM series appears to make the hard resets he has seen with DisplayPort hotplug go away however: https://lore.kernel.org/lkml/1701472789-25951-1-git-send-email-quic_khs...@quicinc.com/ So for now, let's assume that these regressions were also introduced (or triggered) by commit 5814b8bf086a ("drm/msm/dp: incorporate pm_runtime framework into DP driver"). Johan [1] https://lore.kernel.org/lkml/20240226-rpmhpd-enable-corner-fix-v1-1-68c004cec...@quicinc.com/ [2] https://lore.kernel.org/lkml/zd3ypgmrprxv-...@hovoldconsulting.com/ #regzbot introduced: 5814b8bf086a
drm/msm: VT console DisplayPort regression in 6.8-rc1
Hi, Since 6.8-rc1 the VT console is no longer mirrored on an external display on coldplug or hotplug on the Lenovo ThinkPad X13s. The hotplug notification appears to be generated immediately but it is no longer forwarded or processed correctly: [ 22.578434] msm_dpu ae01000.display-controller: msm_fbdev_client_hotplug [ 22.953161] msm_dpu ae01000.display-controller: msm_fbdev_client_hotplug [ 23.095122] msm_dpu ae01000.display-controller: msm_fbdev_client_hotplug [ 24.185683] msm_dpu ae01000.display-controller: msm_fbdev_client_hotplug [ 24.186197] msm-dp-display ae98000.displayport-controller: dp_pm_runtime_resume [ 24.188098] msm-dp-display ae98000.displayport-controller: dp_ctrl_phy_init [ 24.191805] msm_dpu ae01000.display-controller: msm_fbdev_client_hotplug [ 24.275904] [drm-dp] dp_ctrl_on_link: dp_ctrl_on_link - drm_dev = [ 24.276474] [drm-dp] dp_ctrl_enable_mainlink_clocks: dp_ctrl_enable_mainlink_clocks - drm_dev = as the external display remains off/blank. I've verified that reverting commit e467e0bde881 ("drm/msm/dp: use drm_bridge_hpd_notify() to report HPD status changes") fixes "both" issues. I've previously reported this here: https://gitlab.freedesktop.org/drm/msm/-/issues/50 Note that a couple of times the display was eventually mirrored after a very long timeout, but this doesn't seem to always be the case. Johan #regzbot introduced: e467e0bde881
Re: [PATCH 0/6] soc: qcom: pmic_glink_altmode: fix drm bridge use-after-free
On Fri, Feb 23, 2024 at 03:38:13PM +0100, Neil Armstrong wrote: > On 23/02/2024 15:21, Johan Hovold wrote: > > But it is *not* standalone as I tried to explain above. > > > > So you have to drop it again as the later patches depend on it and > > cannot be merged (through a different tree) without it. > > drm-misc branches cannot be rebased, it must be reverted, but it can still be > applied > on drm-misc-next and I'll send a revert patch for drm-misc-fixes if needed, > not a big deal. > > > I thought you had all the acks you needed to take this through drm-misc, > > but we can wait a bit more if necessary (and there's no rush to get the > > first one in). > > If you want it to be in v6.9, it's too late since the last drm-misc-next PR > has been sent > yesterday > (https://cgit.freedesktop.org/drm/drm-misc/tag/?h=drm-misc-next-2024-02-22) > > Please ping Thomas or Maxime, perhaps it's not too late since the > drm-misc-next tree > really closes on sunday. I don't want this in 6.9, this is needed for *6.8* as this fixes a DRM regression in 6.8-rc1 that breaks the display on machines like the X13s. If you guys can't sort this out in time, then perhaps Bjorn can take this through the Qualcomm tree instead (with DRM acks). But again, this is fixing a severe *regression* in 6.8-rc1. It can not wait for 6.9. Johan
Re: [PATCH 0/6] soc: qcom: pmic_glink_altmode: fix drm bridge use-after-free
On Fri, Feb 23, 2024 at 04:18:08PM +0200, Dmitry Baryshkov wrote: > On Fri, 23 Feb 2024 at 15:52, Neil Armstrong > wrote: > > On 23/02/2024 13:51, Johan Hovold wrote: > > > On Fri, Feb 23, 2024 at 12:03:10PM +0100, Neil Armstrong wrote: > > >> On 23/02/2024 12:02, Neil Armstrong wrote: > > >>> Thanks, Applied to https://anongit.freedesktop.org/git/drm/drm-misc.git > > >>> (drm-misc-fixes) > > >>> > > >>> [1/6] drm/bridge: aux-hpd: fix OF node leaks > > >>> > > >>> https://cgit.freedesktop.org/drm/drm-misc/commit/?id=9ee485bdda68d6d3f5728cbe3150eb9013d7d22b > > >>> [2/6] drm/bridge: aux-hpd: separate allocation and registration > > >>> (no commit info) > > >>> [3/6] soc: qcom: pmic_glink_altmode: fix drm bridge use-after-free > > >>> (no commit info) > > >>> [4/6] soc: qcom: pmic_glink: Fix boot when QRTR=m > > >>> (no commit info) > > >>> [5/6] phy: qcom-qmp-combo: fix drm bridge registration > > >>> (no commit info) > > >>> [6/6] phy: qcom-qmp-combo: fix type-c switch registration > > >>> (no commit info) > > >>> > > >> > > >> To clarify, I only applied patch 1 to drm-misc-fixes > > > > > > Ok, but can you please not do that? :) > > > > > > These patches should go in through the same tree to avoid conflicts. > > > > > > I discussed this with Bjorn and Dmitry the other day and the conclusion > > > was that it was easiest to take all of these through DRM. > > > > I only applied patch 1, which is a standalone fix and goes into a separate > > tree, > > for the next patches it would be indeed simpler for them to go via drm-misc > > when > > they are properly acked. > > I think PHY patches can go through a usual route (phy/next or > phy/fixes). They can, but I've explicitly asked Vinod to ack them so that they can go in with the rest of the series through DRM. They also fix a regression that came in through DRM in 6.8-rc1 (the bridge rework which started registering child devices) so it makes sense to also route the fix the same way. And to do it for this cycle. > For patches 3 and 4 I'd need an ack from Bjorn to merge > them through drm-misc-next or drm-misc-fixes. You have Bjorn's ack. He's reviewed all the patches for this purpose and we discussed this in person two days ago. And, again, this has to go in for *this* cycle. You broke the display on the X13s and other machines so this cannot wait for 6.9. Johan
Re: [PATCH 0/6] soc: qcom: pmic_glink_altmode: fix drm bridge use-after-free
On Fri, Feb 23, 2024 at 02:52:28PM +0100, Neil Armstrong wrote: > On 23/02/2024 13:51, Johan Hovold wrote: > > On Fri, Feb 23, 2024 at 12:03:10PM +0100, Neil Armstrong wrote: > >> On 23/02/2024 12:02, Neil Armstrong wrote: > >>> Thanks, Applied to https://anongit.freedesktop.org/git/drm/drm-misc.git > >>> (drm-misc-fixes) > >>> > >>> [1/6] drm/bridge: aux-hpd: fix OF node leaks > >>> > >>> https://cgit.freedesktop.org/drm/drm-misc/commit/?id=9ee485bdda68d6d3f5728cbe3150eb9013d7d22b > >>> [2/6] drm/bridge: aux-hpd: separate allocation and registration > >>> (no commit info) > >>> [3/6] soc: qcom: pmic_glink_altmode: fix drm bridge use-after-free > >>> (no commit info) > >>> [4/6] soc: qcom: pmic_glink: Fix boot when QRTR=m > >>> (no commit info) > >>> [5/6] phy: qcom-qmp-combo: fix drm bridge registration > >>> (no commit info) > >>> [6/6] phy: qcom-qmp-combo: fix type-c switch registration > >>> (no commit info) > >>> > >> > >> To clarify, I only applied patch 1 to drm-misc-fixes > > > > Ok, but can you please not do that? :) > > > > These patches should go in through the same tree to avoid conflicts. > > > > I discussed this with Bjorn and Dmitry the other day and the conclusion > > was that it was easiest to take all of these through DRM. > > I only applied patch 1, which is a standalone fix and goes into a separate > tree, > for the next patches it would be indeed simpler for them to go via drm-misc > when > they are properly acked. But it is *not* standalone as I tried to explain above. So you have to drop it again as the later patches depend on it and cannot be merged (through a different tree) without it. I thought you had all the acks you needed to take this through drm-misc, but we can wait a bit more if necessary (and there's no rush to get the first one in). Johan
Re: [PATCH 0/6] soc: qcom: pmic_glink_altmode: fix drm bridge use-after-free
On Fri, Feb 23, 2024 at 12:03:10PM +0100, Neil Armstrong wrote: > On 23/02/2024 12:02, Neil Armstrong wrote: > > Hi, > > > > On Sat, 17 Feb 2024 16:02:22 +0100, Johan Hovold wrote: > >> Starting with 6.8-rc1 the internal display sometimes fails to come up on > >> machines like the Lenovo ThinkPad X13s and the logs indicate that this > >> is due to a regression in the DRM subsystem [1]. > >> > >> This series fixes a race in the pmic_glink_altmode driver which was > >> exposed / triggered by the transparent DRM bridges rework that went into > >> 6.8-rc1 and that manifested itself as a bridge failing to attach and > >> sometimes triggering a NULL-pointer dereference. > >> > >> [...] > > > > Thanks, Applied to https://anongit.freedesktop.org/git/drm/drm-misc.git > > (drm-misc-fixes) > > > > [1/6] drm/bridge: aux-hpd: fix OF node leaks > > > > https://cgit.freedesktop.org/drm/drm-misc/commit/?id=9ee485bdda68d6d3f5728cbe3150eb9013d7d22b > > [2/6] drm/bridge: aux-hpd: separate allocation and registration > >(no commit info) > > [3/6] soc: qcom: pmic_glink_altmode: fix drm bridge use-after-free > >(no commit info) > > [4/6] soc: qcom: pmic_glink: Fix boot when QRTR=m > >(no commit info) > > [5/6] phy: qcom-qmp-combo: fix drm bridge registration > >(no commit info) > > [6/6] phy: qcom-qmp-combo: fix type-c switch registration > >(no commit info) > > > > To clarify, I only applied patch 1 to drm-misc-fixes Ok, but can you please not do that? :) These patches should go in through the same tree to avoid conflicts. I discussed this with Bjorn and Dmitry the other day and the conclusion was that it was easiest to take all of these through DRM. With Vinod acking the PHY patches, I believe you have what you need to merge the whole series now? Johan
Re: [PATCH 2/6] drm/bridge: aux-hpd: separate allocation and registration
On Thu, Feb 22, 2024 at 10:57:07PM +0200, Dmitry Baryshkov wrote: > On Sat, 17 Feb 2024 at 17:03, Johan Hovold wrote: > > > > Combining allocation and registration is an anti-pattern that should be > > avoided. Add two new functions for allocating and registering an dp-hpd > > bridge with a proper 'devm' prefix so that it is clear that these are > > device managed interfaces. > > > > devm_drm_dp_hpd_bridge_alloc() > > devm_drm_dp_hpd_bridge_add() > > > > The new interface will be used to fix a use-after-free bug in the > > Qualcomm PMIC GLINK driver and may prevent similar issues from being > > introduced elsewhere. > > > > The existing drm_dp_hpd_bridge_register() is reimplemented using the > > above and left in place for now. > > > > Signed-off-by: Johan Hovold > > Reviewed-by: Dmitry Baryshkov Thanks for reviewing. > Minor nit below. > > diff --git a/include/drm/bridge/aux-bridge.h > > b/include/drm/bridge/aux-bridge.h > > index c4c423e97f06..4453906105ca 100644 > > --- a/include/drm/bridge/aux-bridge.h > > +++ b/include/drm/bridge/aux-bridge.h > > @@ -9,6 +9,8 @@ > > > > #include > > > > +struct auxiliary_device; > > + > > #if IS_ENABLED(CONFIG_DRM_AUX_BRIDGE) > > int drm_aux_bridge_register(struct device *parent); > > #else > > @@ -19,10 +21,23 @@ static inline int drm_aux_bridge_register(struct device > > *parent) > > #endif > > > > #if IS_ENABLED(CONFIG_DRM_AUX_HPD_BRIDGE) > > +struct auxiliary_device *devm_drm_dp_hpd_bridge_alloc(struct device > > *parent, struct device_node *np); > > +int devm_drm_dp_hpd_bridge_add(struct device *dev, struct auxiliary_device > > *adev); > > I had a pretty close idea during prototyping, but I ended up doing it > as a single function for the following reasons: > > First, this exports the implementation detail that internally the code > uses an aux device. That's not an issue. The opposite, with interfaces trying to do too much and hide details from the developers so that they can no longer reason about what is going on, is a real problem though. > Also, by exporting the aux device the code becomes less type-safe. By > mistake one can call devm_drm_dp_hpd_bridge_add() on any aux device, > which is not necessarily the HPD bridge. No. First, that is currently not even an issue either as the registration interface is safe to use with any aux device. Second, if you cared about about type-safety you wouldn't have used a struct device pointer for drm_aux_hpd_bridge_notify() which you back cast to an aux device. > I'd prefer to see an opaque device-specific structure instead. WDYT? That should have been there from the start. But I'm not interested in cleaning up this mess beyond what is minimally required to fix the regressions it caused. This can be reworked for 6.9 or later. > > struct device *drm_dp_hpd_bridge_register(struct device *parent, > > struct device_node *np); > > void drm_aux_hpd_bridge_notify(struct device *dev, enum > > drm_connector_status status); Johan
Re: [PATCH 2/6] drm/bridge: aux-hpd: separate allocation and registration
On Wed, Feb 21, 2024 at 08:06:41PM -0600, Bjorn Andersson wrote: > On Sat, Feb 17, 2024 at 04:02:24PM +0100, Johan Hovold wrote: > > diff --git a/drivers/gpu/drm/bridge/aux-hpd-bridge.c > > b/drivers/gpu/drm/bridge/aux-hpd-bridge.c > [..] > > +/** > > + * devm_drm_dp_hpd_bridge_add - register a HDP DisplayPort bridge > > kernel-doc wants () after function names. I don't think that's required for the symbol name here even if some subsystems (drivers) use it. > > + * @dev: struct device to tie registration lifetime to > > + * @adev: bridge auxiliary device to be registered > > + * > > + * Returns: zero on success or a negative errno > > and "Return:" without the 's'. This was a mistake however. Perhaps whoever applies this can drop it, or I can send a v2. Side note: Looks like there are more instances with an 's' than without under driver/gpu/drm... > This could however be done in a separate patch, as the file is already > wrong in this regard. > > Reviewed-by: Bjorn Andersson Thanks for reviewing. Johan
Re: drm/msm: Second DisplayPort regression in 6.8-rc1
On Tue, Feb 20, 2024 at 01:19:54PM -0800, Abhinav Kumar wrote: > On 2/19/2024 2:41 AM, Johan Hovold wrote: > > It seems my initial suspicion that at least some of these regressions > > were related to the runtime PM work was correct. The hard resets happens > > when the DP controller is runtime suspended after being probed: > > [ 17.074925] bus: 'platform': __driver_probe_device: matched device > > aea.displayport-controller with driver msm-dp-display > > [Starting Network Time Synchronization... > > [ 17.112000] msm-dp-display aea.displayport-controller: > > dp_display_probe - populate aux bus > > [ 17.125208] msm-dp-display aea.displayport-controller: > > dp_pm_runtime_resume > > Starting Record System Boot/Shutdown in UTMP... > > Starting Virtual Console Setup... > > [ OK ] Finished Load/Save Screen Backlight Brightness of > > backlight:backlight. > > [ 17.197909] msm-dp-display aea.displayport-controller: > > dp_pm_runtime_suspend > > [ 17.198079] probe of aea0Format: Log Type - Time(microsec) - Message - > > Optional Info > > Log Type: B - Since Boot(Power On Reset), D - Delta, S - Statistic > > S - QC_IMAGE_VERSION_STRING=BOOT.MXF.1.1-00470-MAKENA-1 > > S - IMAGE_VARIANT_STRING=SocMakenaWP > > S - OEM_IMAGE_VERSION_STRING=crm-ubuntu92 > > > >< machine is reset by hypervisor > > > > > Presumably the reset happens when controller is being shut down while > > still being used by the EFI framebuffer. > > I am not sure if we can conclude like that. Even if we shut off the > controller when the framebuffer was still being fetched that should only > cause a blank screen and not a reset because we really don't trigger a > new register write / read while its fetching so as such there is no new > hardware access. It specifically looks like the reset happens when shutting down the PHY, that is, the call to dp_display_host_phy_exit(dp) in dp_pm_runtime_suspend() never returns. That seems like more than a coincidence to me. > One thing I must accept is that there are two differences between > sc8280xp where we are hitting these resets and sc7180/sc7280 chromebooks > where we tested it more thoroughly without any such issues: > > 1) with the chromebooks we have depthcharge and not the QC UEFI. > > If we are suspecting a hand-off issue here, will it help if we try to > disable the display in EFI by using "fastboot oem select-display-panel > none" (assuming this is a fastboot enabled device) and see if you still > hit the reset issue? No, we don't have fastboot. But as I mentioned I still do see resets when I instrument the code to not shut down the display, which could indicate more than one issue here. > 2) chromebooks used "internal_hpd" whereas the pmic_glink method used in > the sc8280xp. > > I am still checking if there are any code paths in the eDP/DP driver > left exposed due to this difference with pm_runtime which can cause > this. I am wondering if some sort of drm tracing will help to narrow > down the reset point. > > > In the cases where the machines survives boot, the controller is never > > suspended. > > > > When investigating this I've also seen intermittent: > > > > [drm:dp_display_probe [msm]] *ERROR* device tree parsing failed > > So this error I think is because in dp_parser_parse() ---> > dp_parser_ctrl_res(), we also have a devm_phy_get(). > > This can return -EDEFER if the phy driver has not yet probed. > > I checked the other things inside dp_parser_parse(), others calls seem > to be purely DT parsing except this one. I think to avoid the confusion, > we should move devm_phy_get() outside of DT parsing into a separate call > or atleast add an error log inside devm_phy_get() failure below to > indicate that it deferred > > io->phy = devm_phy_get(&pdev->dev, "dp"); > if (IS_ERR(io->phy)) > return PTR_ERR(io->phy); > > If my hypothesis is correct on this, then this error log (even though > misleading) should be harmless for this issue because if we hit > DRM_ERROR("device tree parsing failed\n"); we will skip the > devm_pm_runtime_enable(). Yeah, this seems to be the case as boot appears to recover from this, so this may indeed be a probe deferral. Probe deferrals should not be logged as errors however, so the fix is not to add another error message but rather to suppress the current one (e.g. using dev_err_probe()). > > Has anyone given some thought to how the framebuffer handover is > > supposed to work? It seems we're currently just relying on luck with > > timing. Any comments to this? It seems we should not be shutting down (runtime suspend) the display during boot as can currently happen. Johan
Re: [PATCH 3/6] soc: qcom: pmic_glink_altmode: fix drm bridge use-after-free
On Tue, Feb 20, 2024 at 11:55:57AM +0100, Markus Elfring wrote: > … > > Specifically, the dp-hpd bridge is currently registered before all > > resources have been acquired which means that it can also be > > deregistered on probe deferrals. > > > > In the meantime there is a race window where the new aux bridge driver > > (or PHY driver previously) may have looked up the dp-hpd bridge and > > stored a (non-reference-counted) pointer to the bridge which is about to > > be deallocated. > … > > +++ b/drivers/soc/qcom/pmic_glink_altmode.c > … > > @@ -454,7 +454,7 @@ static int pmic_glink_altmode_probe(struct > > auxiliary_device *adev, > > alt_port->index = port; > > INIT_WORK(&alt_port->work, pmic_glink_altmode_worker); > > > > - alt_port->bridge = drm_dp_hpd_bridge_register(dev, > > to_of_node(fwnode)); > > + alt_port->bridge = devm_drm_dp_hpd_bridge_alloc(dev, > > to_of_node(fwnode)); > > if (IS_ERR(alt_port->bridge)) { > > fwnode_handle_put(fwnode); > > return PTR_ERR(alt_port->bridge); > … > > The function call “fwnode_handle_put(fwnode)” is used in multiple if branches. > https://elixir.bootlin.com/linux/v6.8-rc5/source/drivers/soc/qcom/pmic_glink_altmode.c#L435 > > I suggest to add a jump target so that a bit of exception handling > can be better reused at the end of this function implementation. Markus, as people have told you repeatedly, just stop with these comments. You're not helping, in fact, you are actively harmful to the kernel community as you are wasting people's time. Johan
Re: [PATCH] drm/msm: Wire up tlb ops
On Thu, Feb 15, 2024 at 07:28:53AM -0800, Rob Clark wrote: > On Wed, Feb 14, 2024 at 11:34 PM Johan Hovold wrote: > > > > On Tue, Feb 13, 2024 at 09:23:40AM -0800, Rob Clark wrote: > > > From: Rob Clark > > > > > > The brute force iommu_flush_iotlb_all() was good enough for unmap, but > > > in some cases a map operation could require removing a table pte entry > > > to replace with a block entry. This also requires tlb invalidation. > > > Missing this was resulting an obscure iova fault on what should be a > > > valid buffer address. > > > > > > Thanks to Robin Murphy for helping me understand the cause of the fault. > > > > > > Cc: Robin Murphy > > > Fixes: b145c6e65eb0 ("drm/msm: Add support to create a local pagetable") > > > > Sounds like you're missing a > > > > Cc: sta...@vger.kernel.org > > > > here? Or is there some reason not to backport this fix (to 5.9 and later > > kernels)? > > No reason, I just expected the Fixes tag was sufficient No, you should still mark patches intended for stable with an explicit CC-stable tag (as per the documentation). The fact that Sasha and his AI tries to catch fixes which the author and maintainer failed to tag as a fallback should not be relied upon. It also makes it harder for the stable team and others to determine what the intention with a fix was. Johan
Re: [PATCH 1/6] drm/bridge: aux-hpd: fix OF node leaks
On Mon, Feb 19, 2024 at 06:48:30PM +0100, Markus Elfring wrote: > > The two device node references taken during allocation need to be > > dropped when the auxiliary device is freed. > … > > +++ b/drivers/gpu/drm/bridge/aux-hpd-bridge.c > … > > @@ -74,6 +75,8 @@ struct device *drm_dp_hpd_bridge_register(struct device > > *parent, > > > > ret = auxiliary_device_init(adev); > > if (ret) { > > + of_node_put(adev->dev.platform_data); > > + of_node_put(adev->dev.of_node); > > ida_free(&drm_aux_hpd_bridge_ida, adev->id); > > kfree(adev); > > return ERR_PTR(ret); > > The last two statements are also used in a previous if branch. > https://elixir.bootlin.com/linux/v6.8-rc5/source/drivers/gpu/drm/bridge/aux-hpd-bridge.c#L63 > > How do you think about to avoid such a bit of duplicate source code > by adding a label here? No, the current code is fine and what you are suggesting is in any case unrelated to this fix. If this function ever grows a third error path like that, I too would consider it however. Johan
Re: drm/msm: Second DisplayPort regression in 6.8-rc1
On Mon, Feb 19, 2024 at 11:41:41AM +0100, Johan Hovold wrote: > It seems my initial suspicion that at least some of these regressions > were related to the runtime PM work was correct. The hard resets happens > when the DP controller is runtime suspended after being probed: > [ 17.074925] bus: 'platform': __driver_probe_device: matched device > aea.displayport-controller with driver msm-dp-display > [ 17.112000] msm-dp-display aea.displayport-controller: > dp_display_probe - populate aux bus > [ 17.125208] msm-dp-display aea.displayport-controller: > dp_pm_runtime_resume > [ 17.197909] msm-dp-display aea.displayport-controller: > dp_pm_runtime_suspend > [ 17.198079] probe of aea0Format: Log Type - Time(microsec) - Message - > Optional Info > Log Type: B - Since Boot(Power On Reset), D - Delta, S - Statistic > S - QC_IMAGE_VERSION_STRING=BOOT.MXF.1.1-00470-MAKENA-1 > S - IMAGE_VARIANT_STRING=SocMakenaWP > S - OEM_IMAGE_VERSION_STRING=crm-ubuntu92 > > < machine is reset by hypervisor > > > Presumably the reset happens when controller is being shut down while > still being used by the EFI framebuffer. > > In the cases where the machines survives boot, the controller is never > suspended. > > When investigating this I've also seen intermittent: > > [drm:dp_display_probe [msm]] *ERROR* device tree parsing failed Note that there are further indications there may be more than one bug here too. I definitely see hard resets when dp_pm_runtime_suspend() is shutting down the eDP PHY, but there are occasional resets also if I instrument DP controller probe() to resume and then prevent the controller from suspending until after a timeout (e.g. to be used as a temporary workaround): [ 15.676495] bus: 'platform': __driver_probe_device: matched device aea.displayport-controller with driver msm-dp-display [ 15.769392] msm-dp-display aea.displayport-controller: dp_display_probe - populate aux bus [ 15.778808] msm-dp-display aea.displayport-controller: dp_display_probe - scheduling handover [ 15.789931] probe of aea.displayport-controller returned 0 after 91121 usecs [ 15.790460] bus: 'dp-aux': __driver_probe_device: matched device aux-aea.displayport-controller with driver panel-simple-dp-aux Format: Log Type - Time(microsec) - Message - Optional Info Log Type: B - Since Boot(Power On Reset), D - Delta, S - Statistic S - QC_IMAGE_VERSION_STRING=BOOT.MXF.1.1-00470-MAKENA-1 I'll wait for the maintainers and authors of this code to comment, but it seems the runtime PM work is broken in multiple ways. Johan
drm/msm: Second DisplayPort regression in 6.8-rc1
On Sat, Feb 17, 2024 at 04:14:58PM +0100, Johan Hovold wrote: > On Wed, Feb 14, 2024 at 02:52:06PM +0100, Johan Hovold wrote: > > On Tue, Feb 13, 2024 at 10:00:13AM -0800, Abhinav Kumar wrote: > Since Dmitry had trouble reproducing this issue I took a closer look at > the DRM aux bridge series that Abhinav pointed and was able to track > down the bridge regressions and come up with a reproducer. I just posted > a series fixing this here: > > > https://lore.kernel.org/lkml/20240217150228.5788-1-johan+lin...@kernel.org/ > > As I mentioned in the cover letter, I am still seeing intermittent hard > resets around the time that the DRM subsystem is initialising, which > suggests that we may be dealing with two separate DRM regressions here > however. > > If the hard resets are triggered by something like unclocked hardware, > perhaps that bit could this be related to the runtime PM rework? It seems my initial suspicion that at least some of these regressions were related to the runtime PM work was correct. The hard resets happens when the DP controller is runtime suspended after being probed: [ 16.748475] bus: 'platform': __driver_probe_device: matched device ae0.display-subsystem with driver msm-mdss [ 16.759444] msm-mdss ae0.display-subsystem: Adding to iommu group 21 [ 16.795226] bus: 'platform': __driver_probe_device: matched device ae01000.display-controller with driver msm_dpu [ 16.807542] probe of ae01000.display-controller returned -517 after 3 usecs [ 16.821552] bus: 'platform': __driver_probe_device: matched device ae9.displayport-controller with driver msm-dp-display [ 16.837749] probe of ae9.displayport-controller returned -517 after 1 usecs [ OK ] Listening on Load/Save RF Kill Swit[ 16.854659] bus: 'platform': __dch Status /dev/rfkill Watch. [ 16.868458] probe of ae98000.displayport-controller returned -517 after 2 usecs [ 16.880012] bus: 'platform': __driver_probe_device: matched device aea.displayport-controller with driver msm-dp-display [ 16.891856] probe of aea.displayport-controller returned -517 after 2 usecs [ 16.903825] probe of ae0.display-subsystem returned 0 after 144497 usecs [ 16.911636] bus: 'platform': __driver_probe_device: matched device ae01000.display-controller with driver msm_dpu [ 16.942092] probe of ae01000.display-controller returned 0 after 19593 usecs Starting Load/Save Screen Backligh…rightness[ 16.959146] bus: 'platform': _ of backlight:backlight... [ 16.995355] msm-dp-display ae9.displayport-controller: dp_display_probe - probe tail [ 17.004032] probe of ae9.displayport-controller returned 0 after 30225 usecs [ 17.012308] bus: 'platform': __driver_probe_device: matched device ae98000.displayport-controller with driver msm-dp-display [ 17.050193] msm-dp-display ae98000.displayport-controller: dp_display_probe - probe tail Starting Network Name Resolution... [ 17.058925] probe of ae98000.displayport-controller returned 0 after 34774 usecs [ 17.074925] bus: 'platform': __driver_probe_device: matched device aea.displayport-controller with driver msm-dp-display [Starting Network Time Synchronization... [ 17.112000] msm-dp-display aea.displayport-controller: dp_display_probe - populate aux bus [ 17.125208] msm-dp-display aea.displayport-controller: dp_pm_runtime_resume Starting Record System Boot/Shutdown in UTMP... Starting Virtual Console Setup... [ OK ] Finished Load/Save Screen Backlight Brightness of backlight:backlight. [ 17.197909] msm-dp-display aea.displayport-controller: dp_pm_runtime_suspend [ 17.198079] probe of aea0Format: Log Type - Time(microsec) - Message - Optional Info Log Type: B - Since Boot(Power On Reset), D - Delta, S - Statistic S - QC_IMAGE_VERSION_STRING=BOOT.MXF.1.1-00470-MAKENA-1 S - IMAGE_VARIANT_STRING=SocMakenaWP S - OEM_IMAGE_VERSION_STRING=crm-ubuntu92 < machine is reset by hypervisor > Presumably the reset happens when controller is being shut down while still being used by the EFI framebuffer. In the cases where the machines survives boot, the controller is never suspended. When investigating this I've also seen intermittent: [drm:dp_display_probe [msm]] *ERROR* device tree parsing failed which also appears to be related to the runtime PM rework: https://lore.kernel.org/lkml/1701472789-25951-1-git-send-email-quic_khs...@quicinc.com/ I believe this is enough evidence to conclude that this second regression is introduced by commit 5814b8bf086a ("drm/msm/dp: incorporate pm_runtime framework into DP driver"): #regzbot introduced: 5814b8bf086a Has anyone given some thought to how the framebuffer handover is supposed to work? It seems we're currently just relying on luck with timing. Johan
Re: drm/msm: DisplayPort regressions in 6.8-rc1
On Tue, Feb 13, 2024 at 12:42:17PM +0100, Johan Hovold wrote: > Since 6.8-rc1 the internal eDP display on the Lenovo ThinkPad X13s does > not always show up on boot. > > The logs indicate problems with the runtime PM and eDP rework that went > into 6.8-rc1: > > [6.007872] [drm:drm_bridge_attach [drm]] *ERROR* failed to attach > bridge /soc@0/phy@88eb000 to encoder TMDS-31: -16 > and this can also manifest itself as a NULL-pointer dereference: > > [7.339447] Unable to handle kernel NULL pointer dereference at > virtual address > > [7.643705] pc : drm_bridge_attach+0x70/0x1a8 [drm] #regzbot ^introduced: 2bcca96abfbf It looks like it may have been possible to hit this also before commit 2bcca96abfbf ("soc: qcom: pmic-glink: switch to DRM_AUX_HPD_BRIDGE") and the transparent bridge rework in 6.8-rc1 even if that has not yet been confirmed. The above is what made this trigger since 6.8-rc1 however. Johan
Re: drm/msm: DisplayPort regressions in 6.8-rc1
On Wed, Feb 14, 2024 at 02:52:06PM +0100, Johan Hovold wrote: > On Tue, Feb 13, 2024 at 10:00:13AM -0800, Abhinav Kumar wrote: > > > I do agree that pm runtime eDP driver got merged that time but I think > > the issue is either a combination of that along with DRM aux bridge > > https://patchwork.freedesktop.org/series/122584/ OR just the latter as > > even that went in around the same time. > > Yes, indeed there was a lot of changes that went into the MSM drm driver > in 6.8-rc1 and since I have not tried to debug this myself I can't say > for sure which change or changes that triggered this regression (or > possibly regressions). > > The fact that the USB-C/DP PHY appears to be involved > (/soc@0/phy@88eb000) could indeed point to the series you mentioned. > > > Thats why perhaps this issue was not seen with the chromebooks we tested > > on as they do not use pmic_glink (aux bridge). > > > > So we will need to debug this on sc8280xp specifically or an equivalent > > device which uses aux bridge. > > I've hit the NULL-pointer deference three times now in the last few days > on the sc8280xp CRD. But since it doesn't trigger on every boot it seems > you need to go back to the series that could potentially have caused > this regression and review them again. There's clearly something quite > broken here. Since Dmitry had trouble reproducing this issue I took a closer look at the DRM aux bridge series that Abhinav pointed and was able to track down the bridge regressions and come up with a reproducer. I just posted a series fixing this here: https://lore.kernel.org/lkml/20240217150228.5788-1-johan+lin...@kernel.org/ As I mentioned in the cover letter, I am still seeing intermittent hard resets around the time that the DRM subsystem is initialising, which suggests that we may be dealing with two separate DRM regressions here however. If the hard resets are triggered by something like unclocked hardware, perhaps that bit could this be related to the runtime PM rework? Johan
[PATCH 3/6] soc: qcom: pmic_glink_altmode: fix drm bridge use-after-free
A recent DRM series purporting to simplify support for "transparent bridges" and handling of probe deferrals ironically exposed a use-after-free issue on pmic_glink_altmode probe deferral. This has manifested itself as the display subsystem occasionally failing to initialise and NULL-pointer dereferences during boot of machines like the Lenovo ThinkPad X13s. Specifically, the dp-hpd bridge is currently registered before all resources have been acquired which means that it can also be deregistered on probe deferrals. In the meantime there is a race window where the new aux bridge driver (or PHY driver previously) may have looked up the dp-hpd bridge and stored a (non-reference-counted) pointer to the bridge which is about to be deallocated. When the display controller is later initialised, this triggers a use-after-free when attaching the bridges: dp -> aux -> dp-hpd (freed) which may, for example, result in the freed bridge failing to attach: [drm:drm_bridge_attach [drm]] *ERROR* failed to attach bridge /soc@0/phy@88eb000 to encoder TMDS-31: -16 or a NULL-pointer dereference: Unable to handle kernel NULL pointer dereference at virtual address ... Call trace: drm_bridge_attach+0x70/0x1a8 [drm] drm_aux_bridge_attach+0x24/0x38 [aux_bridge] drm_bridge_attach+0x80/0x1a8 [drm] dp_bridge_init+0xa8/0x15c [msm] msm_dp_modeset_init+0x28/0xc4 [msm] The DRM bridge implementation is clearly fragile and implicitly built on the assumption that bridges may never go away. In this case, the fix is to move the bridge registration in the pmic_glink_altmode driver to after all resources have been looked up. Incidentally, with the new dp-hpd bridge implementation, which registers child devices, this is also a requirement due to a long-standing issue in driver core that can otherwise lead to a probe deferral loop (see fbc35b45f9f6 ("Add documentation on meaning of -EPROBE_DEFER")). Fixes: 080b4e24852b ("soc: qcom: pmic_glink: Introduce altmode support") Fixes: 2bcca96abfbf ("soc: qcom: pmic-glink: switch to DRM_AUX_HPD_BRIDGE") Cc: sta...@vger.kernel.org # 6.3 Cc: Bjorn Andersson Cc: Dmitry Baryshkov Signed-off-by: Johan Hovold --- drivers/soc/qcom/pmic_glink_altmode.c | 16 +--- 1 file changed, 13 insertions(+), 3 deletions(-) diff --git a/drivers/soc/qcom/pmic_glink_altmode.c b/drivers/soc/qcom/pmic_glink_altmode.c index 5fcd0fdd2faa..b3808fc24c69 100644 --- a/drivers/soc/qcom/pmic_glink_altmode.c +++ b/drivers/soc/qcom/pmic_glink_altmode.c @@ -76,7 +76,7 @@ struct pmic_glink_altmode_port { struct work_struct work; - struct device *bridge; + struct auxiliary_device *bridge; enum typec_orientation orientation; u16 svid; @@ -230,7 +230,7 @@ static void pmic_glink_altmode_worker(struct work_struct *work) else pmic_glink_altmode_enable_usb(altmode, alt_port); - drm_aux_hpd_bridge_notify(alt_port->bridge, + drm_aux_hpd_bridge_notify(&alt_port->bridge->dev, alt_port->hpd_state ? connector_status_connected : connector_status_disconnected); @@ -454,7 +454,7 @@ static int pmic_glink_altmode_probe(struct auxiliary_device *adev, alt_port->index = port; INIT_WORK(&alt_port->work, pmic_glink_altmode_worker); - alt_port->bridge = drm_dp_hpd_bridge_register(dev, to_of_node(fwnode)); + alt_port->bridge = devm_drm_dp_hpd_bridge_alloc(dev, to_of_node(fwnode)); if (IS_ERR(alt_port->bridge)) { fwnode_handle_put(fwnode); return PTR_ERR(alt_port->bridge); @@ -510,6 +510,16 @@ static int pmic_glink_altmode_probe(struct auxiliary_device *adev, } } + for (port = 0; port < ARRAY_SIZE(altmode->ports); port++) { + alt_port = &altmode->ports[port]; + if (!alt_port->bridge) + continue; + + ret = devm_drm_dp_hpd_bridge_add(dev, alt_port->bridge); + if (ret) + return ret; + } + altmode->client = devm_pmic_glink_register_client(dev, altmode->owner_id, pmic_glink_altmode_callback, -- 2.43.0
[PATCH 5/6] phy: qcom-qmp-combo: fix drm bridge registration
Due to a long-standing issue in driver core, drivers may not probe defer after having registered child devices to avoid triggering a probe deferral loop (see fbc35b45f9f6 ("Add documentation on meaning of -EPROBE_DEFER")). This could potentially also trigger a bug in the DRM bridge implementation which does not expect bridges to go away even if device links may avoid triggering this (when enabled). Move registration of the DRM aux bridge to after looking up clocks and other resources. Note that PHY creation can in theory also trigger a probe deferral when a 'phy' supply is used. This does not seem to affect the QMP PHY driver but the PHY subsystem should be reworked to address this (i.e. by separating initialisation and registration of the PHY). Fixes: 35921910bbd0 ("phy: qcom: qmp-combo: switch to DRM_AUX_BRIDGE") Fixes: 1904c3f578dc ("phy: qcom-qmp-combo: Introduce drm_bridge") Cc: sta...@vger.kernel.org # 6.5 Cc: Bjorn Andersson Cc: Dmitry Baryshkov Signed-off-by: Johan Hovold --- drivers/phy/qualcomm/phy-qcom-qmp-combo.c | 8 1 file changed, 4 insertions(+), 4 deletions(-) diff --git a/drivers/phy/qualcomm/phy-qcom-qmp-combo.c b/drivers/phy/qualcomm/phy-qcom-qmp-combo.c index 1ad10110dd25..e19d6a084f10 100644 --- a/drivers/phy/qualcomm/phy-qcom-qmp-combo.c +++ b/drivers/phy/qualcomm/phy-qcom-qmp-combo.c @@ -3566,10 +3566,6 @@ static int qmp_combo_probe(struct platform_device *pdev) if (ret) return ret; - ret = drm_aux_bridge_register(dev); - if (ret) - return ret; - /* Check for legacy binding with child nodes. */ usb_np = of_get_child_by_name(dev->of_node, "usb3-phy"); if (usb_np) { @@ -3589,6 +3585,10 @@ static int qmp_combo_probe(struct platform_device *pdev) if (ret) goto err_node_put; + ret = drm_aux_bridge_register(dev); + if (ret) + goto err_node_put; + pm_runtime_set_active(dev); ret = devm_pm_runtime_enable(dev); if (ret) -- 2.43.0
[PATCH 0/6] soc: qcom: pmic_glink_altmode: fix drm bridge use-after-free
Starting with 6.8-rc1 the internal display sometimes fails to come up on machines like the Lenovo ThinkPad X13s and the logs indicate that this is due to a regression in the DRM subsystem [1]. This series fixes a race in the pmic_glink_altmode driver which was exposed / triggered by the transparent DRM bridges rework that went into 6.8-rc1 and that manifested itself as a bridge failing to attach and sometimes triggering a NULL-pointer dereference. The intermittent hard resets that have also been reported since 6.8-rc1 unfortunately still remains and suggests that we are dealing with two separate regressions. There is some indication that also the hard resets (e.g. due to register accesses to unclocked hardware) are also due to changes in the DRM subsystem as it happens around the time that the eDP panel and display controller would be initialised during boot (the runtime PM rework?). This remains to be verified, however. Included is also a fix for a related OF node reference leak in the aux-hpd driver found through inspection when reworking the driver. The use-after-free bug is triggered by a probe deferral and highlighted some further bugs in the involved drivers, which were registering child devices before deferring probe. This behaviour is not correct and can both trigger probe deferral loops and potentially also further issues with the DRM bridge implementation. This series can either go through the Qualcomm SoC tree (pmic_glink) or the DRM tree. The PHY patches do not depend on the rest of the series and could possibly be merged separately through the PHY tree. Whichever gets this to mainline the fastest. Johan [1] https://lore.kernel.org/lkml/zctvmlk4ztwcp...@hovoldconsulting.com/ Johan Hovold (5): drm/bridge: aux-hpd: fix OF node leaks drm/bridge: aux-hpd: separate allocation and registration soc: qcom: pmic_glink_altmode: fix drm bridge use-after-free phy: qcom-qmp-combo: fix drm bridge registration phy: qcom-qmp-combo: fix type-c switch registration Rob Clark (1): soc: qcom: pmic_glink: Fix boot when QRTR=m drivers/gpu/drm/bridge/aux-hpd-bridge.c | 70 ++- drivers/phy/qualcomm/phy-qcom-qmp-combo.c | 16 +++--- drivers/soc/qcom/pmic_glink.c | 21 +++ drivers/soc/qcom/pmic_glink_altmode.c | 16 +- include/drm/bridge/aux-bridge.h | 15 + 5 files changed, 102 insertions(+), 36 deletions(-) -- 2.43.0
[PATCH 6/6] phy: qcom-qmp-combo: fix type-c switch registration
Due to a long-standing issue in driver core, drivers may not probe defer after having registered child devices to avoid triggering a probe deferral loop (see fbc35b45f9f6 ("Add documentation on meaning of -EPROBE_DEFER")). Move registration of the typec switch to after looking up clocks and other resources. Note that PHY creation can in theory also trigger a probe deferral when a 'phy' supply is used. This does not seem to affect the QMP PHY driver but the PHY subsystem should be reworked to address this (i.e. by separating initialisation and registration of the PHY). Fixes: 2851117f8f42 ("phy: qcom-qmp-combo: Introduce orientation switching") Cc: sta...@vger.kernel.org # 6.5 Cc: Bjorn Andersson Signed-off-by: Johan Hovold --- drivers/phy/qualcomm/phy-qcom-qmp-combo.c | 8 1 file changed, 4 insertions(+), 4 deletions(-) diff --git a/drivers/phy/qualcomm/phy-qcom-qmp-combo.c b/drivers/phy/qualcomm/phy-qcom-qmp-combo.c index e19d6a084f10..17c4ad7553a5 100644 --- a/drivers/phy/qualcomm/phy-qcom-qmp-combo.c +++ b/drivers/phy/qualcomm/phy-qcom-qmp-combo.c @@ -3562,10 +3562,6 @@ static int qmp_combo_probe(struct platform_device *pdev) if (ret) return ret; - ret = qmp_combo_typec_switch_register(qmp); - if (ret) - return ret; - /* Check for legacy binding with child nodes. */ usb_np = of_get_child_by_name(dev->of_node, "usb3-phy"); if (usb_np) { @@ -3585,6 +3581,10 @@ static int qmp_combo_probe(struct platform_device *pdev) if (ret) goto err_node_put; + ret = qmp_combo_typec_switch_register(qmp); + if (ret) + goto err_node_put; + ret = drm_aux_bridge_register(dev); if (ret) goto err_node_put; -- 2.43.0
[PATCH 2/6] drm/bridge: aux-hpd: separate allocation and registration
Combining allocation and registration is an anti-pattern that should be avoided. Add two new functions for allocating and registering an dp-hpd bridge with a proper 'devm' prefix so that it is clear that these are device managed interfaces. devm_drm_dp_hpd_bridge_alloc() devm_drm_dp_hpd_bridge_add() The new interface will be used to fix a use-after-free bug in the Qualcomm PMIC GLINK driver and may prevent similar issues from being introduced elsewhere. The existing drm_dp_hpd_bridge_register() is reimplemented using the above and left in place for now. Signed-off-by: Johan Hovold --- drivers/gpu/drm/bridge/aux-hpd-bridge.c | 67 +++-- include/drm/bridge/aux-bridge.h | 15 ++ 2 files changed, 67 insertions(+), 15 deletions(-) diff --git a/drivers/gpu/drm/bridge/aux-hpd-bridge.c b/drivers/gpu/drm/bridge/aux-hpd-bridge.c index 9e71daf95bde..6886db2d9e00 100644 --- a/drivers/gpu/drm/bridge/aux-hpd-bridge.c +++ b/drivers/gpu/drm/bridge/aux-hpd-bridge.c @@ -30,16 +30,13 @@ static void drm_aux_hpd_bridge_release(struct device *dev) kfree(adev); } -static void drm_aux_hpd_bridge_unregister_adev(void *_adev) +static void drm_aux_hpd_bridge_free_adev(void *_adev) { - struct auxiliary_device *adev = _adev; - - auxiliary_device_delete(adev); - auxiliary_device_uninit(adev); + auxiliary_device_uninit(_adev); } /** - * drm_dp_hpd_bridge_register - Create a simple HPD DisplayPort bridge + * devm_drm_dp_hpd_bridge_alloc - allocate a HPD DisplayPort bridge * @parent: device instance providing this bridge * @np: device node pointer corresponding to this bridge instance * @@ -47,11 +44,9 @@ static void drm_aux_hpd_bridge_unregister_adev(void *_adev) * DRM_MODE_CONNECTOR_DisplayPort, which terminates the bridge chain and is * able to send the HPD events. * - * Return: device instance that will handle created bridge or an error code - * encoded into the pointer. + * Return: bridge auxiliary device pointer or an error pointer */ -struct device *drm_dp_hpd_bridge_register(struct device *parent, - struct device_node *np) +struct auxiliary_device *devm_drm_dp_hpd_bridge_alloc(struct device *parent, struct device_node *np) { struct auxiliary_device *adev; int ret; @@ -82,13 +77,55 @@ struct device *drm_dp_hpd_bridge_register(struct device *parent, return ERR_PTR(ret); } - ret = auxiliary_device_add(adev); - if (ret) { - auxiliary_device_uninit(adev); + ret = devm_add_action_or_reset(parent, drm_aux_hpd_bridge_free_adev, adev); + if (ret) return ERR_PTR(ret); - } - ret = devm_add_action_or_reset(parent, drm_aux_hpd_bridge_unregister_adev, adev); + return adev; +} +EXPORT_SYMBOL_GPL(devm_drm_dp_hpd_bridge_alloc); + +static void drm_aux_hpd_bridge_del_adev(void *_adev) +{ + auxiliary_device_delete(_adev); +} + +/** + * devm_drm_dp_hpd_bridge_add - register a HDP DisplayPort bridge + * @dev: struct device to tie registration lifetime to + * @adev: bridge auxiliary device to be registered + * + * Returns: zero on success or a negative errno + */ +int devm_drm_dp_hpd_bridge_add(struct device *dev, struct auxiliary_device *adev) +{ + int ret; + + ret = auxiliary_device_add(adev); + if (ret) + return ret; + + return devm_add_action_or_reset(dev, drm_aux_hpd_bridge_del_adev, adev); +} +EXPORT_SYMBOL_GPL(devm_drm_dp_hpd_bridge_add); + +/** + * drm_dp_hpd_bridge_register - allocate and register a HDP DisplayPort bridge + * @parent: device instance providing this bridge + * @np: device node pointer corresponding to this bridge instance + * + * Return: device instance that will handle created bridge or an error pointer + */ +struct device *drm_dp_hpd_bridge_register(struct device *parent, struct device_node *np) +{ + struct auxiliary_device *adev; + int ret; + + adev = devm_drm_dp_hpd_bridge_alloc(parent, np); + if (IS_ERR(adev)) + return ERR_CAST(adev); + + ret = devm_drm_dp_hpd_bridge_add(parent, adev); if (ret) return ERR_PTR(ret); diff --git a/include/drm/bridge/aux-bridge.h b/include/drm/bridge/aux-bridge.h index c4c423e97f06..4453906105ca 100644 --- a/include/drm/bridge/aux-bridge.h +++ b/include/drm/bridge/aux-bridge.h @@ -9,6 +9,8 @@ #include +struct auxiliary_device; + #if IS_ENABLED(CONFIG_DRM_AUX_BRIDGE) int drm_aux_bridge_register(struct device *parent); #else @@ -19,10 +21,23 @@ static inline int drm_aux_bridge_register(struct device *parent) #endif #if IS_ENABLED(CONFIG_DRM_AUX_HPD_BRIDGE) +struct auxiliary_device *devm_drm_dp_hpd_bridge_alloc(struct device *parent, struct device_node *np); +int devm_drm_dp_hpd_bridge_add(struct device *dev, struct auxiliary_device *adev); struct device *drm_dp_hpd_bridge_register(str
[PATCH 4/6] soc: qcom: pmic_glink: Fix boot when QRTR=m
From: Rob Clark We need to bail out before adding/removing devices if we are going to -EPROBE_DEFER. Otherwise boot can get stuck in a probe deferral loop due to a long-standing issue in driver core (see fbc35b45f9f6 ("Add documentation on meaning of -EPROBE_DEFER")). Deregistering the altmode child device can potentially also trigger bugs in the DRM bridge implementation, which does not expect bridges to go away. Suggested-by: Dmitry Baryshkov Signed-off-by: Rob Clark Link: https://lore.kernel.org/r/20231213210644.8702-1-robdcl...@gmail.com [ johan: rebase on 6.8-rc4, amend commit message and mention DRM ] Fixes: 58ef4ece1e41 ("soc: qcom: pmic_glink: Introduce base PMIC GLINK driver") Cc: sta...@vger.kernel.org # 6.3 Cc: Bjorn Andersson Signed-off-by: Johan Hovold --- drivers/soc/qcom/pmic_glink.c | 21 +++-- 1 file changed, 11 insertions(+), 10 deletions(-) diff --git a/drivers/soc/qcom/pmic_glink.c b/drivers/soc/qcom/pmic_glink.c index f4bfd24386f1..f913e9bd57ed 100644 --- a/drivers/soc/qcom/pmic_glink.c +++ b/drivers/soc/qcom/pmic_glink.c @@ -265,10 +265,17 @@ static int pmic_glink_probe(struct platform_device *pdev) pg->client_mask = *match_data; + pg->pdr = pdr_handle_alloc(pmic_glink_pdr_callback, pg); + if (IS_ERR(pg->pdr)) { + ret = dev_err_probe(&pdev->dev, PTR_ERR(pg->pdr), + "failed to initialize pdr\n"); + return ret; + } + if (pg->client_mask & BIT(PMIC_GLINK_CLIENT_UCSI)) { ret = pmic_glink_add_aux_device(pg, &pg->ucsi_aux, "ucsi"); if (ret) - return ret; + goto out_release_pdr_handle; } if (pg->client_mask & BIT(PMIC_GLINK_CLIENT_ALTMODE)) { ret = pmic_glink_add_aux_device(pg, &pg->altmode_aux, "altmode"); @@ -281,17 +288,11 @@ static int pmic_glink_probe(struct platform_device *pdev) goto out_release_altmode_aux; } - pg->pdr = pdr_handle_alloc(pmic_glink_pdr_callback, pg); - if (IS_ERR(pg->pdr)) { - ret = dev_err_probe(&pdev->dev, PTR_ERR(pg->pdr), "failed to initialize pdr\n"); - goto out_release_aux_devices; - } - service = pdr_add_lookup(pg->pdr, "tms/servreg", "msm/adsp/charger_pd"); if (IS_ERR(service)) { ret = dev_err_probe(&pdev->dev, PTR_ERR(service), "failed adding pdr lookup for charger_pd\n"); - goto out_release_pdr_handle; + goto out_release_aux_devices; } mutex_lock(&__pmic_glink_lock); @@ -300,8 +301,6 @@ static int pmic_glink_probe(struct platform_device *pdev) return 0; -out_release_pdr_handle: - pdr_handle_release(pg->pdr); out_release_aux_devices: if (pg->client_mask & BIT(PMIC_GLINK_CLIENT_BATT)) pmic_glink_del_aux_device(pg, &pg->ps_aux); @@ -311,6 +310,8 @@ static int pmic_glink_probe(struct platform_device *pdev) out_release_ucsi_aux: if (pg->client_mask & BIT(PMIC_GLINK_CLIENT_UCSI)) pmic_glink_del_aux_device(pg, &pg->ucsi_aux); +out_release_pdr_handle: + pdr_handle_release(pg->pdr); return ret; } -- 2.43.0
[PATCH 1/6] drm/bridge: aux-hpd: fix OF node leaks
The two device node references taken during allocation need to be dropped when the auxiliary device is freed. Fixes: 6914968a0b52 ("drm/bridge: properly refcount DT nodes in aux bridge drivers") Cc: Dmitry Baryshkov Cc: Neil Armstrong Signed-off-by: Johan Hovold --- drivers/gpu/drm/bridge/aux-hpd-bridge.c | 3 +++ 1 file changed, 3 insertions(+) diff --git a/drivers/gpu/drm/bridge/aux-hpd-bridge.c b/drivers/gpu/drm/bridge/aux-hpd-bridge.c index bb55f697a181..9e71daf95bde 100644 --- a/drivers/gpu/drm/bridge/aux-hpd-bridge.c +++ b/drivers/gpu/drm/bridge/aux-hpd-bridge.c @@ -25,6 +25,7 @@ static void drm_aux_hpd_bridge_release(struct device *dev) ida_free(&drm_aux_hpd_bridge_ida, adev->id); of_node_put(adev->dev.platform_data); + of_node_put(adev->dev.of_node); kfree(adev); } @@ -74,6 +75,8 @@ struct device *drm_dp_hpd_bridge_register(struct device *parent, ret = auxiliary_device_init(adev); if (ret) { + of_node_put(adev->dev.platform_data); + of_node_put(adev->dev.of_node); ida_free(&drm_aux_hpd_bridge_ida, adev->id); kfree(adev); return ERR_PTR(ret); -- 2.43.0
Re: [PATCH] drm/msm: Wire up tlb ops
On Tue, Feb 13, 2024 at 09:23:40AM -0800, Rob Clark wrote: > From: Rob Clark > > The brute force iommu_flush_iotlb_all() was good enough for unmap, but > in some cases a map operation could require removing a table pte entry > to replace with a block entry. This also requires tlb invalidation. > Missing this was resulting an obscure iova fault on what should be a > valid buffer address. > > Thanks to Robin Murphy for helping me understand the cause of the fault. > > Cc: Robin Murphy > Fixes: b145c6e65eb0 ("drm/msm: Add support to create a local pagetable") Sounds like you're missing a Cc: sta...@vger.kernel.org here? Or is there some reason not to backport this fix (to 5.9 and later kernels)? > Signed-off-by: Rob Clark Johan
Re: drm/msm: DisplayPort regressions in 6.8-rc1
On Tue, Feb 13, 2024 at 10:00:13AM -0800, Abhinav Kumar wrote: > I do agree that pm runtime eDP driver got merged that time but I think > the issue is either a combination of that along with DRM aux bridge > https://patchwork.freedesktop.org/series/122584/ OR just the latter as > even that went in around the same time. Yes, indeed there was a lot of changes that went into the MSM drm driver in 6.8-rc1 and since I have not tried to debug this myself I can't say for sure which change or changes that triggered this regression (or possibly regressions). The fact that the USB-C/DP PHY appears to be involved (/soc@0/phy@88eb000) could indeed point to the series you mentioned. > Thats why perhaps this issue was not seen with the chromebooks we tested > on as they do not use pmic_glink (aux bridge). > > So we will need to debug this on sc8280xp specifically or an equivalent > device which uses aux bridge. I've hit the NULL-pointer deference three times now in the last few days on the sc8280xp CRD. But since it doesn't trigger on every boot it seems you need to go back to the series that could potentially have caused this regression and review them again. There's clearly something quite broken here. > On 2/13/2024 3:42 AM, Johan Hovold wrote: > > Since 6.8-rc1 the internal eDP display on the Lenovo ThinkPad X13s does > > not always show up on boot. > > [6.007872] [drm:drm_bridge_attach [drm]] *ERROR* failed to attach > > bridge /soc@0/phy@88eb000 to encoder TMDS-31: -16 > > and this can also manifest itself as a NULL-pointer dereference: > > > > [7.339447] Unable to handle kernel NULL pointer dereference at > > virtual address > > > > [7.643705] pc : drm_bridge_attach+0x70/0x1a8 [drm] > > [7.686415] lr : drm_aux_bridge_attach+0x24/0x38 [aux_bridge] > > > > [7.769039] Call trace: > > [7.771564] drm_bridge_attach+0x70/0x1a8 [drm] > > [7.776234] drm_aux_bridge_attach+0x24/0x38 [aux_bridge] > > [7.781782] drm_bridge_attach+0x80/0x1a8 [drm] > > [7.786454] dp_bridge_init+0xa8/0x15c [msm] > > [7.790856] msm_dp_modeset_init+0x28/0xc4 [msm] > > [7.795617] _dpu_kms_drm_obj_init+0x19c/0x680 [msm] > > [7.800731] dpu_kms_hw_init+0x348/0x4c4 [msm] > > [7.805306] msm_drm_kms_init+0x84/0x324 [msm] > > [7.809891] msm_drm_bind+0x1d8/0x3a8 [msm] > > [7.814196] try_to_bring_up_aggregate_device+0x1f0/0x2f8 > > [7.819747] __component_add+0xa4/0x18c > > [7.823703] component_add+0x14/0x20 > > [7.827389] dp_display_probe+0x47c/0x568 [msm] > > [7.832052] platform_probe+0x68/0xd8 > > > > Users have also reported random crashes at boot since 6.8-rc1, and I've > > been able to trigger hard crashes twice when testing an external display > > (USB-C/DP), which may also be related to the DP regressions. Johan
drm/msm: DisplayPort regressions in 6.8-rc1
Hi, Since 6.8-rc1 the internal eDP display on the Lenovo ThinkPad X13s does not always show up on boot. The logs indicate problems with the runtime PM and eDP rework that went into 6.8-rc1: [6.006236] Console: switching to colour dummy device 80x25 [6.007542] [drm:dpu_kms_hw_init:1048] dpu hardware revision:0x8000 [6.007872] [drm:drm_bridge_attach [drm]] *ERROR* failed to attach bridge /soc@0/phy@88eb000 to encoder TMDS-31: -16 [6.007934] [drm:dp_bridge_init [msm]] *ERROR* failed to attach panel bridge: -16 [6.007983] msm_dpu ae01000.display-controller: [drm:msm_dp_modeset_init [msm]] *ERROR* failed to create dp bridge: -16 [6.008030] [drm:_dpu_kms_initialize_displayport:588] [dpu error]modeset_init failed for DP, rc = -16 [6.008050] [drm:_dpu_kms_setup_displays:681] [dpu error]initialize_DP failed, rc = -16 [6.008068] [drm:dpu_kms_hw_init:1153] [dpu error]modeset init failed: -16 [6.008388] msm_dpu ae01000.display-controller: [drm:msm_drm_kms_init [msm]] *ERROR* kms hw init failed: -16 and this can also manifest itself as a NULL-pointer dereference: [7.339447] Unable to handle kernel NULL pointer dereference at virtual address [7.643705] pc : drm_bridge_attach+0x70/0x1a8 [drm] [7.686415] lr : drm_aux_bridge_attach+0x24/0x38 [aux_bridge] [7.769039] Call trace: [7.771564] drm_bridge_attach+0x70/0x1a8 [drm] [7.776234] drm_aux_bridge_attach+0x24/0x38 [aux_bridge] [7.781782] drm_bridge_attach+0x80/0x1a8 [drm] [7.786454] dp_bridge_init+0xa8/0x15c [msm] [7.790856] msm_dp_modeset_init+0x28/0xc4 [msm] [7.795617] _dpu_kms_drm_obj_init+0x19c/0x680 [msm] [7.800731] dpu_kms_hw_init+0x348/0x4c4 [msm] [7.805306] msm_drm_kms_init+0x84/0x324 [msm] [7.809891] msm_drm_bind+0x1d8/0x3a8 [msm] [7.814196] try_to_bring_up_aggregate_device+0x1f0/0x2f8 [7.819747] __component_add+0xa4/0x18c [7.823703] component_add+0x14/0x20 [7.827389] dp_display_probe+0x47c/0x568 [msm] [7.832052] platform_probe+0x68/0xd8 Users have also reported random crashes at boot since 6.8-rc1, and I've been able to trigger hard crashes twice when testing an external display (USB-C/DP), which may also be related to the DP regressions. I've opened an issue here: https://gitlab.freedesktop.org/drm/msm/-/issues/51 but I also want Thorsten's help to track this so that it gets fixed before 6.8 is released. #regzbot introduced: v6.7..v6.8-rc1 The following series is likely the culprit: https://lore.kernel.org/all/1701472789-25951-1-git-send-email-quic_khs...@quicinc.com/ Johan
Re: [PATCH v2] iommu/arm-smmu-qcom: Add missing GMU entry to match table
On Sun, Dec 10, 2023 at 10:06:53AM -0800, Rob Clark wrote: > From: Rob Clark > > In some cases the firmware expects cbndx 1 to be assigned to the GMU, > so we also want the default domain for the GMU to be an identy domain. > This way it does not get a context bank assigned. Without this, both > of_dma_configure() and drm/msm's iommu_domain_attach() will trigger > allocating and configuring a context bank. So GMU ends up attached to > both cbndx 1 and later cbndx 2. This arrangement seemingly confounds > and surprises the firmware if the GPU later triggers a translation > fault, resulting (on sc8280xp / lenovo x13s, at least) in the SMMU > getting wedged and the GPU stuck without memory access. > > Cc: sta...@vger.kernel.org > Signed-off-by: Rob Clark Tested-by: Johan Hovold
Re: [PATCH] iommu/arm-smmu-qcom: Add missing GMU entry to match table
On Thu, Dec 07, 2023 at 01:24:39PM -0800, Rob Clark wrote: > From: Rob Clark > > We also want the default domain for the GMU to be an identy domain, > so it does not get a context bank assigned. Without this, both > of_dma_configure() and drm/msm's iommu_domain_attach() will trigger > allocating and configuring a context bank. So GMU ends up attached > to both cbndx 1 and cbndx 2. This arrangement seemingly confounds > and surprises the firmware if the GPU later triggers a translation > fault, resulting (on sc8280xp / lenovo x13s, at least) in the SMMU > getting wedged and the GPU stuck without memory access. This sounds like something that should be backported. Should you add a Fixes and CC-stable tag? > Signed-off-by: Rob Clark Johan
Re: [Freedreno] [PATCH] drm/msm/dpu: Add missing safe_lut_tbl in sc8180x catalog
On Thu, Nov 30, 2023 at 04:35:01PM -0800, Bjorn Andersson wrote: > Similar to SC8280XP, the misconfigured SAFE logic causes rather > significant delays in __arm_smmu_tlb_sync(), resulting in poor > performance for things such as USB. > > Introduce appropriate SAFE values for SC8180X to correct this. > > Fixes: f3af2d6ee9ab ("drm/msm/dpu: Add SC8180x to hw catalog") Missing CC stable tag? > Signed-off-by: Bjorn Andersson Johan
Re: [Freedreno] [PATCH v2 1/2] drm/msm/dp: don't touch DP subconnector property in eDP case
On Wed, Oct 25, 2023 at 12:23:09PM +0300, Dmitry Baryshkov wrote: > From: Abel Vesa > > In case of the eDP connection there is no subconnetor and as such no > subconnector property. Put drm_dp_set_subconnector_property() calls > under the !is_edp condition. > > Fixes: bfcc3d8f94f4 ("drm/msm/dp: support setting the DP subconnector type") > Signed-off-by: Abel Vesa > Signed-off-by: Dmitry Baryshkov Reviewed-by: Johan Hovold Tested-by: Johan Hovold
Re: [Freedreno] [PATCH v2 2/2] drm/msm/dp: attach the DP subconnector property
On Wed, Oct 25, 2023 at 12:23:10PM +0300, Dmitry Baryshkov wrote: > While developing and testing the commit bfcc3d8f94f4 ("drm/msm/dp: > support setting the DP subconnector type") I had the patch [1] in my > tree. I haven't noticed that it was a dependency for the commit in > question. Mea culpa. This also broke boot on the Lenovo ThinkPad X13s. Would be nice to get this fixed ASAP so that further people don't have to debug this known regression. > Since the patch has not landed yet (and even was not reviewed) > and since one of the bridges erroneously uses USB connector type instead > of DP, attach the property directly from the MSM DP driver. > > This fixes the following oops on DP HPD event: > > drm_object_property_set_value (drivers/gpu/drm/drm_mode_object.c:288) > dp_display_process_hpd_high (drivers/gpu/drm/msm/dp/dp_display.c:402) > dp_hpd_plug_handle.isra.0 (drivers/gpu/drm/msm/dp/dp_display.c:604) > hpd_event_thread (drivers/gpu/drm/msm/dp/dp_display.c:1110) > kthread (kernel/kthread.c:388) > ret_from_fork (arch/arm64/kernel/entry.S:858) This only says where the oops happened, it doesn't necessarily in itself indicate an oops at all or that in this case it's a NULL pointer dereference. On the X13s I'm seeing the NULL deref in a different path during boot, and when this happens after a deferred probe (due to the panel lookup mess) it hangs the machine, which makes it a bit of a pain to debug: Unable to handle kernel NULL pointer dereference at virtual address 0060 ... CPU: 4 PID: 57 Comm: kworker/u16:1 Not tainted 6.7.0-rc1 #4 Hardware name: Qualcomm QRD, BIOS 6.0.220110.BOOT.MXF.1.1-00470-MAKENA-1 01/10/2022 ... Call trace: drm_object_property_set_value+0x0/0x88 [drm] dp_display_process_hpd_high+0xa0/0x14c [msm] dp_hpd_plug_handle.constprop.0.isra.0+0x90/0x110 [msm] dp_bridge_atomic_enable+0x184/0x21c [msm] edp_bridge_atomic_enable+0x60/0x94 [msm] drm_atomic_bridge_chain_enable+0x54/0xc8 [drm] drm_atomic_helper_commit_modeset_enables+0x194/0x26c [drm_kms_helper] msm_atomic_commit_tail+0x204/0x804 [msm] commit_tail+0xa4/0x18c [drm_kms_helper] drm_atomic_helper_commit+0x19c/0x1b0 [drm_kms_helper] drm_atomic_commit+0xa4/0x104 [drm] drm_client_modeset_commit_atomic+0x22c/0x298 [drm] drm_client_modeset_commit_locked+0x60/0x1c0 [drm] drm_client_modeset_commit+0x30/0x58 [drm] __drm_fb_helper_restore_fbdev_mode_unlocked+0xbc/0xfc [drm_kms_helper] drm_fb_helper_set_par+0x30/0x4c [drm_kms_helper] fbcon_init+0x224/0x49c visual_init+0xb0/0x108 do_bind_con_driver.isra.0+0x19c/0x38c do_take_over_console+0x140/0x1ec do_fbcon_takeover+0x6c/0xe4 fbcon_fb_registered+0x180/0x1f0 register_framebuffer+0x19c/0x228 __drm_fb_helper_initial_config_and_unlock+0x2e8/0x4e8 [drm_kms_helper] drm_fb_helper_initial_config+0x3c/0x4c [drm_kms_helper] msm_fbdev_client_hotplug+0x84/0xcc [msm] drm_client_register+0x5c/0xa0 [drm] msm_fbdev_setup+0x94/0x148 [msm] msm_drm_bind+0x3d0/0x42c [msm] try_to_bring_up_aggregate_device+0x1ec/0x2f4 __component_add+0xa8/0x194 component_add+0x14/0x20 dp_display_probe+0x278/0x41c [msm] > [1] https://patchwork.freedesktop.org/patch/30/ > > Fixes: bfcc3d8f94f4 ("drm/msm/dp: support setting the DP subconnector type") > Reviewed-by: Abhinav Kumar > Signed-off-by: Dmitry Baryshkov Reviewed-by: Johan Hovold Tested-by: Johan Hovold Johan
Re: [Freedreno] [PATCH] drm/msm/dpu: Add missing safe_lut_tbl in sc8280xp catalog
On Mon, Oct 30, 2023 at 04:23:20PM -0700, Bjorn Andersson wrote: > During USB transfers on the SC8280XP __arm_smmu_tlb_sync() is seen to > typically take 1-2ms to complete. As expected this results in poor > performance, something that has been mitigated by proposing running the > iommu in non-strict mode (boot with iommu.strict=0). > > This turns out to be related to the SAFE logic, and programming the QOS > SAFE values in the DPU (per suggestion from Rob and Doug) reduces the > TLB sync time to below 10us, which means significant less time spent > with interrupts disabled and a significant boost in throughput. I ran some tests with a gigabit ethernet adapter to get an idea of how this performs in comparison to using lazy iommu mode ("non-strict"): 6.6 6.6-lazy6.6-dpu 6.6-dpu-lazy iperf3 recv 114 941 941 941 MBit/s iperf3 send 124 891 703 940 MBit/s scp recv14.6110 110 111 MB/s scp send12.598.991.5110 MB/s This patch in itself indeed improves things quite a bit, but there is still some performance that can be gained by using lazy iommu mode. Notably, lazy mode with this patch applied appears to saturate the link in both directions. Tested-by: Johan Hovold Johan
Re: [Freedreno] [PATCH] soc: qcom: pmic_glink: fix connector type to be DisplayPort
On Wed, Oct 25, 2023 at 12:29:26PM +, Simon Ser wrote: > On Wednesday, October 25th, 2023 at 14:22, Johan Hovold > wrote: > > > I was just going to post a patch fixing this after finally investigating > > why the DisplayPort outputs on the X13s were annoyingly identified as > > "Unknown20-1" and "Unknown20-2" instead of the expected "DP-1" and > > "DP-2". > > Note, ideally userspace should use drmModeGetConnectorTypeName() from > libdrm to figure out the proper name for a connector type. That way we > only need to update a single spot when adding a new connector type, > instead of patching a whole bunch of programs. Yeah, I only skimmed the xrandr code and these strings appear to originate from some X library. So hopefully not that many places to update. Scripts and other tools may still be using these strings directly however (e.g. as did my custom script to enable external displays). Johan
Re: [Freedreno] [PATCH] soc: qcom: pmic_glink: fix connector type to be DisplayPort
On Wed, Oct 11, 2023 at 01:52:29AM +0300, Dmitry Baryshkov wrote: > As it was pointed out by Simon Ser, the DRM_MODE_CONNECTOR_USB connector > is reserved for the GUD devices. Other drivers (i915, amdgpu) use > DRM_MODE_CONNECTOR_DisplayPort even if the DP stream is handled by the > USB-C altmode. While we are still working on implementing the proper way > to let userspace know that the DP is wrapped into USB-C, change > connector type to be DRM_MODE_CONNECTOR_DisplayPort. > > Fixes: 080b4e24852b ("soc: qcom: pmic_glink: Introduce altmode support") > Cc: Simon Ser > Signed-off-by: Dmitry Baryshkov > --- > drivers/soc/qcom/pmic_glink_altmode.c | 2 +- > 1 file changed, 1 insertion(+), 1 deletion(-) > > diff --git a/drivers/soc/qcom/pmic_glink_altmode.c > b/drivers/soc/qcom/pmic_glink_altmode.c > index 9569d999391d..6f8b2f7ae3cc 100644 > --- a/drivers/soc/qcom/pmic_glink_altmode.c > +++ b/drivers/soc/qcom/pmic_glink_altmode.c > @@ -467,7 +467,7 @@ static int pmic_glink_altmode_probe(struct > auxiliary_device *adev, > alt_port->bridge.funcs = &pmic_glink_altmode_bridge_funcs; > alt_port->bridge.of_node = to_of_node(fwnode); > alt_port->bridge.ops = DRM_BRIDGE_OP_HPD; > - alt_port->bridge.type = DRM_MODE_CONNECTOR_USB; > + alt_port->bridge.type = DRM_MODE_CONNECTOR_DisplayPort; > > ret = devm_drm_bridge_add(dev, &alt_port->bridge); > if (ret) { I was just going to post a patch fixing this after finally investigating why the DisplayPort outputs on the X13s were annoyingly identified as "Unknown20-1" and "Unknown20-2" instead of the expected "DP-1" and "DP-2". A lore search just before posting led me to this fix from two weeks ago. I think the commit message should have mentioned something about the how this change affects user space. My patch also had a CC stable, but I guess we can ping the stable team once it hits mainline: commit e5f55bf5ad4effdd59d4d06c839a0ac553a73c7d (HEAD -> work) Author: Johan Hovold Date: Wed Oct 25 11:54:09 2023 +0200 soc: qcom: pmic_glink_altmode: fix DP alt mode connector type The PMIC glink altmode bridge connector type should be "DisplayPort" rather than "USB", which is intended for custom USB display protocols (e.g. see 40e1a70b4aed ("drm: Add GUD USB Display driver")). This specifically makes the DisplayPort outputs on the Lenovo ThinkPad X13s show up as "DP-1" and "DP-2" rather than "Unknown20-1" and "Unknown20-2" with xrandr as expected (by users and tools): Screen 0: minimum 320 x 200, current 1920 x 1200, maximum 5120 x 4096 eDP-1 connected primary 1920x1200+0+0 (normal left inverted right x axis y axis) 286mm x 178mm 1920x1200 60.03*+ 1600x1200 60.00 DP-1 disconnected (normal left inverted right x axis y axis) DP-2 connected (normal left inverted right x axis y axis) 1920x1200 59.95 + ... Fixes: 080b4e24852b ("soc: qcom: pmic_glink: Introduce altmode support") Cc: sta...@vger.kernel.org # 6.3 Signed-off-by: Johan Hovold Johan
Re: [Freedreno] [PATCH v1] drm/msm/dp: do not reinitialize phy unless retry during link training
On Tue, Oct 03, 2023 at 11:10:59AM +0200, Johan Hovold wrote: > On Tue, Aug 08, 2023 at 03:19:50PM -0700, Kuogee Hsieh wrote: > > DP PHY re-initialization done using dp_ctrl_reinitialize_mainlink() will > > cause PLL unlocked initially and then PLL gets locked at the end of > > initialization. PLL_UNLOCKED interrupt will fire during this time if the > > interrupt mask is enabled. > > However currently DP driver link training implementation incorrectly > > re-initializes PHY unconditionally during link training as the PHY was > > already configured in dp_ctrl_enable_mainlink_clocks(). > > > > Fix this by re-initializing the PHY only if the previous link training > > failed. > > > > [drm:dp_aux_isr] *ERROR* Unexpected DP AUX IRQ 0x0100 when not busy > > > > Fixes: c943b4948b58 ("drm/msm/dp: add displayPort driver support") > > Closes: https://gitlab.freedesktop.org/drm/msm/-/issues/30 > > Signed-off-by: Kuogee Hsieh > > This fixes the above warning and avoids the unnecessary PHY power-off > and power-on during boot of the ThinkPad X13s: > > Reviewed-by: Johan Hovold > Tested-by: Johan Hovold > > I guess this one should go to stable as well: > > Cc: sta...@vger.kernel.org# 5.10 > > Is anyone planning on getting this fixed in 6.6-rc? I noticed that this > one still hasn't shown up linux-next. For the record, this one showed up in a PR from Rob and has now been forwarded to Linus. No stable tag included, but I guess we can ping the stable team unless AUTOSEL picks this up. Johan
Re: [Freedreno] [PATCH v1] drm/msm/dp: do not reinitialize phy unless retry during link training
On Tue, Aug 08, 2023 at 03:19:50PM -0700, Kuogee Hsieh wrote: > DP PHY re-initialization done using dp_ctrl_reinitialize_mainlink() will > cause PLL unlocked initially and then PLL gets locked at the end of > initialization. PLL_UNLOCKED interrupt will fire during this time if the > interrupt mask is enabled. > However currently DP driver link training implementation incorrectly > re-initializes PHY unconditionally during link training as the PHY was > already configured in dp_ctrl_enable_mainlink_clocks(). > > Fix this by re-initializing the PHY only if the previous link training > failed. > > [drm:dp_aux_isr] *ERROR* Unexpected DP AUX IRQ 0x0100 when not busy > > Fixes: c943b4948b58 ("drm/msm/dp: add displayPort driver support") > Closes: https://gitlab.freedesktop.org/drm/msm/-/issues/30 > Signed-off-by: Kuogee Hsieh This fixes the above warning and avoids the unnecessary PHY power-off and power-on during boot of the ThinkPad X13s: Reviewed-by: Johan Hovold Tested-by: Johan Hovold I guess this one should go to stable as well: Cc: sta...@vger.kernel.org # 5.10 Is anyone planning on getting this fixed in 6.6-rc? I noticed that this one still hasn't shown up linux-next. > --- > drivers/gpu/drm/msm/dp/dp_ctrl.c | 13 ++--- > 1 file changed, 6 insertions(+), 7 deletions(-) > > diff --git a/drivers/gpu/drm/msm/dp/dp_ctrl.c > b/drivers/gpu/drm/msm/dp/dp_ctrl.c > index a7a5c7e..77a8d93 100644 > --- a/drivers/gpu/drm/msm/dp/dp_ctrl.c > +++ b/drivers/gpu/drm/msm/dp/dp_ctrl.c > @@ -1774,13 +1774,6 @@ int dp_ctrl_on_link(struct dp_ctrl *dp_ctrl) > return rc; > > while (--link_train_max_retries) { > - rc = dp_ctrl_reinitialize_mainlink(ctrl); > - if (rc) { > - DRM_ERROR("Failed to reinitialize mainlink. rc=%d\n", > - rc); > - break; > - } > - > training_step = DP_TRAINING_NONE; > rc = dp_ctrl_setup_main_link(ctrl, &training_step); > if (rc == 0) { > @@ -1832,6 +1825,12 @@ int dp_ctrl_on_link(struct dp_ctrl *dp_ctrl) > /* stop link training before start re training */ > dp_ctrl_clear_training_pattern(ctrl); > } > + > + rc = dp_ctrl_reinitialize_mainlink(ctrl); > + if (rc) { > + DRM_ERROR("Failed to reinitialize mainlink. rc=%d\n", > rc); > + break; > + } > } > > if (ctrl->link->sink_request & DP_TEST_LINK_PHY_TEST_PATTERN) Johan
Re: [Freedreno] [PATCH] drm/msm/dp: Drop aux devices together with DP controller
On Mon, Jun 12, 2023 at 03:01:06PM -0700, Bjorn Andersson wrote: > Using devres to depopulate the aux bus made sure that upon a probe > deferral the EDP panel device would be destroyed and recreated upon next > attempt. > > But the struct device which the devres is tied to is the DPUs > (drm_dev->dev), which may be happen after the DP controller is torn > down. There appears to be some words missing in this sentence. > Indications of this can be seen in the commonly seen EDID-hexdump full > of zeros in the log, This could happen also when the aux bus lifetime was tied to DP controller and is mostly benign as dp_aux_deinit() set the "initted" flag to false. > or the occasional/rare KASAN fault where the > panel's attempt to read the EDID information causes a use after free on > DP resources. But this is clearly a bug as there's a small window where the aux bus struct holding the above flag may also have been released... > It's tempting to move the devres to the DP controller's struct device, > but the resources used by the device(s) on the aux bus are explicitly > torn down in the error path. The KASAN-reported use-after-free also > remains, as the DP aux "module" explicitly frees its devres-allocated > memory in this code path. Right, and this would also not work as the aux bus could remain populated for the next bind attempt which would then fail (as described in the commit message of the offending commit). > As such, explicitly depopulate the aux bus in the error path, and in the > component unbind path, to avoid these issues. Sounds good. > Fixes: 2b57f726611e ("drm/msm/dp: fix aux-bus EP lifetime") This one should also have a stable tag: Cc: sta...@vger.kernel.org # 5.19 > Signed-off-by: Bjorn Andersson > --- > drivers/gpu/drm/msm/dp/dp_display.c | 14 +++--- > 1 file changed, 3 insertions(+), 11 deletions(-) > > diff --git a/drivers/gpu/drm/msm/dp/dp_display.c > b/drivers/gpu/drm/msm/dp/dp_display.c > index 3d8fa2e73583..bbb0550a022b 100644 > --- a/drivers/gpu/drm/msm/dp/dp_display.c > +++ b/drivers/gpu/drm/msm/dp/dp_display.c > @@ -322,6 +322,8 @@ static void dp_display_unbind(struct device *dev, struct > device *master, > > kthread_stop(dp->ev_tsk); > > + of_dp_aux_depopulate_bus(dp->aux); This may now be called without first having populated the bus, but looks like that still works. > + > dp_power_client_deinit(dp->power); > dp_unregister_audio_driver(dev, dp->audio); > dp_aux_unregister(dp->aux); I know this one was merged while I was out-of-office last week, but for the record: Reviewed-by: Johan Hovold Tested-by: Johan Hovold Johan
Re: [Freedreno] Adreno devfreq lockdep splat with 6.3-rc2
On Thu, Jun 08, 2023 at 02:17:45PM -0700, Rob Clark wrote: > On Thu, Jun 8, 2023 at 7:12 AM Johan Hovold wrote: > > Have you had a chance to look at this regression yet? It prevents us > > from using lockdep on the X13s as it is disabled as soon as we start > > the GPU. > > Hmm, curious what is different between x13s and sc7180/sc7280 things? It seems like lockdep needs to hit the tear down path in order to detect the circular lock dependency. Perhaps you don't hit that on your sc7180/sc7280? It is due to the fact that the panel is looked up way too late so that bind fails unless the panel driver is already loaded when the msm drm driver probes. Manually loading the panel driver before msm makes the splat go away. > Or did lockdep recently get more clever (or more annotation)? I think this is indeed a new problem related to some of the devfreq work you did in 6.3-rc1 (e.g. fadcc3ab1302 ("drm/msm/gpu: Bypass PM QoS constraint for idle clamp")). > I did spend some time a while back trying to bring some sense to > devfreq/pm-qos/icc locking: > https://patchwork.freedesktop.org/series/115028/ > > but haven't had time to revisit that for a while That's the series I link to below, but IIRC it did not look directly applicable to the splat I see on X13s (e.g. does not involve fs_reclaim). > > On Wed, Mar 15, 2023 at 10:19:21AM +0100, Johan Hovold wrote: > > > > > > Since 6.3-rc2 (or possibly -rc1), I'm now seeing the below > > > devfreq-related lockdep splat. > > > > > > I noticed that you posted a fix for something similar here: > > > > > > > > > https://lore.kernel.org/r/20230312204150.1353517-9-robdcl...@gmail.com > > > > > > but that particular patch makes no difference. > > > > > > From skimming the calltraces below and qos/devfreq related changes in > > > 6.3-rc1 it seems like this could be related to: > > > > > > fadcc3ab1302 ("drm/msm/gpu: Bypass PM QoS constraint for idle > > > clamp") Johan
Re: [Freedreno] Adreno devfreq lockdep splat with 6.3-rc2
Hi Rob, Have you had a chance to look at this regression yet? It prevents us from using lockdep on the X13s as it is disabled as soon as we start the GPU. On Wed, Mar 15, 2023 at 10:19:21AM +0100, Johan Hovold wrote: > > Since 6.3-rc2 (or possibly -rc1), I'm now seeing the below > devfreq-related lockdep splat. > > I noticed that you posted a fix for something similar here: > > https://lore.kernel.org/r/20230312204150.1353517-9-robdcl...@gmail.com > > but that particular patch makes no difference. > > From skimming the calltraces below and qos/devfreq related changes in > 6.3-rc1 it seems like this could be related to: > > fadcc3ab1302 ("drm/msm/gpu: Bypass PM QoS constraint for idle clamp") Below is an updated splat from 6.4-rc5. Johan [ 2941.931507] == [ 2941.931509] WARNING: possible circular locking dependency detected [ 2941.931513] 6.4.0-rc5 #64 Not tainted [ 2941.931516] -- [ 2941.931518] ring0/359 is trying to acquire lock: [ 2941.931520] 63310e35c078 (&devfreq->lock){+.+.}-{3:3}, at: qos_min_notifier_call+0x28/0x88 [ 2941.931541] but task is already holding lock: [ 2941.931543] 63310e3cace8 (&(c->notifiers)->rwsem){}-{3:3}, at: blocking_notifier_call_chain+0x30/0x70 [ 2941.931553] which lock already depends on the new lock. [ 2941.931555] the existing dependency chain (in reverse order) is: [ 2941.931556] -> #4 (&(c->notifiers)->rwsem){}-{3:3}: [ 2941.931562]down_write+0x50/0x198 [ 2941.931567]blocking_notifier_chain_register+0x30/0x8c [ 2941.931570]freq_qos_add_notifier+0x68/0x7c [ 2941.931574]dev_pm_qos_add_notifier+0xa0/0xf8 [ 2941.931579]devfreq_add_device.part.0+0x360/0x5a4 [ 2941.931583]devm_devfreq_add_device+0x74/0xe0 [ 2941.931587]msm_devfreq_init+0xa0/0x154 [msm] [ 2941.931624]msm_gpu_init+0x2fc/0x588 [msm] [ 2941.931649]adreno_gpu_init+0x160/0x2d0 [msm] [ 2941.931675]a6xx_gpu_init+0x2c0/0x74c [msm] [ 2941.931699]adreno_bind+0x180/0x290 [msm] [ 2941.931723]component_bind_all+0x124/0x288 [ 2941.931728]msm_drm_bind+0x1d8/0x6cc [msm] [ 2941.931752]try_to_bring_up_aggregate_device+0x1ec/0x2f4 [ 2941.931755]__component_add+0xa8/0x194 [ 2941.931758]component_add+0x14/0x20 [ 2941.931761]dp_display_probe+0x2b4/0x474 [msm] [ 2941.931785]platform_probe+0x68/0xd8 [ 2941.931789]really_probe+0x184/0x3c8 [ 2941.931792]__driver_probe_device+0x7c/0x16c [ 2941.931794]driver_probe_device+0x3c/0x110 [ 2941.931797]__device_attach_driver+0xbc/0x158 [ 2941.931800]bus_for_each_drv+0x84/0xe0 [ 2941.931802]__device_attach+0xa8/0x1d4 [ 2941.931805]device_initial_probe+0x14/0x20 [ 2941.931807]bus_probe_device+0xb0/0xb4 [ 2941.931810]deferred_probe_work_func+0xa0/0xf4 [ 2941.931812]process_one_work+0x288/0x5bc [ 2941.931816]worker_thread+0x74/0x450 [ 2941.931818]kthread+0x124/0x128 [ 2941.931822]ret_from_fork+0x10/0x20 [ 2941.931826] -> #3 (dev_pm_qos_mtx){+.+.}-{3:3}: [ 2941.931831]__mutex_lock+0xa0/0x840 [ 2941.931833]mutex_lock_nested+0x24/0x30 [ 2941.931836]dev_pm_qos_remove_notifier+0x34/0x140 [ 2941.931838]genpd_remove_device+0x3c/0x174 [ 2941.931841]genpd_dev_pm_detach+0x78/0x1b4 [ 2941.931844]dev_pm_domain_detach+0x24/0x34 [ 2941.931846]a6xx_gmu_remove+0x34/0xc4 [msm] [ 2941.931869]a6xx_destroy+0xd0/0x160 [msm] [ 2941.931892]adreno_unbind+0x40/0x64 [msm] [ 2941.931916]component_unbind+0x38/0x6c [ 2941.931919]component_unbind_all+0xc8/0xd4 [ 2941.931921]msm_drm_uninit.isra.0+0x150/0x1c4 [msm] [ 2941.931945]msm_drm_bind+0x310/0x6cc [msm] [ 2941.931967]try_to_bring_up_aggregate_device+0x1ec/0x2f4 [ 2941.931970]__component_add+0xa8/0x194 [ 2941.931973]component_add+0x14/0x20 [ 2941.931976]dp_display_probe+0x2b4/0x474 [msm] [ 2941.932000]platform_probe+0x68/0xd8 [ 2941.932003]really_probe+0x184/0x3c8 [ 2941.932005]__driver_probe_device+0x7c/0x16c [ 2941.932008]driver_probe_device+0x3c/0x110 [ 2941.932011]__device_attach_driver+0xbc/0x158 [ 2941.932014]bus_for_each_drv+0x84/0xe0 [ 2941.932016]__device_attach+0xa8/0x1d4 [ 2941.932018]device_initial_probe+0x14/0x20 [ 2941.932021]bus_probe_device+0xb0/0xb4 [ 2941.932023]deferred_probe_work_func+0xa0/0xf4 [ 2941.932026]process_one_work+0x288/0x5bc [ 2941.932028]worker_thread+0x74/0x450 [ 2941.932031]kthread+0x124/0x128 [ 2941.932035]ret_from_fork+0x10/0x20 [ 2941.93203
Re: [Freedreno] [PATCH] Revert "drm/msm/dp: set self refresh aware based on PSR support"
On Mon, Jun 05, 2023 at 01:05:36PM +0300, Dmitry Baryshkov wrote: > On Mon, 5 Jun 2023 at 13:02, Johan Hovold wrote: > > Virtual terminals are still broken with 6.4-rc5 on the Lenovo ThinkPad > > X13s two weeks after I reported this, and there has been no indication > > of any progress in the other related thread: > > > > https://lore.kernel.org/lkml/zhyphnwodbxb-...@hovoldconsulting.com > > > > Seems like it is time to merge this revert to get this sorted. > > > > Rob, Abhinav, Dmitry, can either of you merge this one and get it into > > 6.4-rc6? > > Rob sent the pull request few hours ago, see > https://lore.kernel.org/dri-devel/caf6aeguhujkfjra6ys36uyh0kur4hd16u1emqjo8toz3ifv...@mail.gmail.com/ Ok, so you guys went with the module parameter hack. Whatever. As long as the regression is finally fixed. Next time, some visibility into your process would be appreciated to avoid unnecessary work. Johan
Re: [Freedreno] [PATCH] Revert "drm/msm/dp: set self refresh aware based on PSR support"
[ +CC: Thorsten and regzbot so they can help with tracking this regression ] #regzbot introduced: v6.3..v6.4-rc1 On Tue, May 23, 2023 at 05:16:46PM +0200, Johan Hovold wrote: > This reverts commit 1844e680d56bb0c4e0489138f2b7ba2dc1c988e3. > > PSR support clearly is not ready for mainline and specifically breaks > virtual terminals which are no longer updated when PSR is enabled (e.g. > no keyboard input is echoed, no cursor blink). > > Disable PSR support for now by reverting commit 1844e680d56b > ("drm/msm/dp: set self refresh aware based on PSR support"). > > Cc: Vinod Polimera > Cc: Dmitry Baryshkov > Signed-off-by: Johan Hovold > --- > > Bjorn reported that PSR support broke virtual terminals two months ago, > but this is still broken in 6.4-rc3: > > https://lore.kernel.org/lkml/20230326162723.3lo6pnsfdwzsvbhj@ripper/ > > despite the following series that claimed to address this: > > > https://lore.kernel.org/lkml/1680271114-1534-1-git-send-email-quic_vpoli...@quicinc.com > > Let's revert until this has been fixed properly. Virtual terminals are still broken with 6.4-rc5 on the Lenovo ThinkPad X13s two weeks after I reported this, and there has been no indication of any progress in the other related thread: https://lore.kernel.org/lkml/zhyphnwodbxb-...@hovoldconsulting.com Seems like it is time to merge this revert to get this sorted. Rob, Abhinav, Dmitry, can either of you merge this one and get it into 6.4-rc6? Johan
Re: [Freedreno] [PATCH] drm/msm/a6xx: fix uninitialised lock in init error path
On Wed, May 31, 2023 at 07:22:49AM -0700, Doug Anderson wrote: > Hi, > > On Wed, May 31, 2023 at 1:00 AM Johan Hovold wrote: > > > > A recent commit started taking the GMU lock in the GPU destroy path, > > which on GPU initialisation failure is called before the GMU and its > > lock have been initialised. > > > > Make sure that the GMU has been initialised before taking the lock in > > a6xx_destroy() and drop the now redundant check from a6xx_gmu_remove(). > > > > Fixes: 4cd15a3e8b36 ("drm/msm/a6xx: Make GPU destroy a bit safer") > > Cc: sta...@vger.kernel.org # 6.3 > > Cc: Douglas Anderson > > Signed-off-by: Johan Hovold > > --- > > drivers/gpu/drm/msm/adreno/a6xx_gmu.c | 3 --- > > drivers/gpu/drm/msm/adreno/a6xx_gpu.c | 9 ++--- > > 2 files changed, 6 insertions(+), 6 deletions(-) > > I think Dmitry already posted a patch 1.5 months ago to fix this. > > https://lore.kernel.org/r/20230410165908.3094626-1-dmitry.barysh...@linaro.org Bah, I checked if Bjorn had hit this with his recent A690 v3 series and posted a fix, but did not look further than that. > Can you confirm that works for you? That looks like it would work too, but I think I prefer my version which keeps the initialisation of the GMU struct in a6xx_gmu_init(). Dmitry or Rob, could you see to that either version gets merged soon so that we don't end up with even more people having to debug and fix the same issue? Johan
[Freedreno] [PATCH] drm/msm/a6xx: fix uninitialised lock in init error path
A recent commit started taking the GMU lock in the GPU destroy path, which on GPU initialisation failure is called before the GMU and its lock have been initialised. Make sure that the GMU has been initialised before taking the lock in a6xx_destroy() and drop the now redundant check from a6xx_gmu_remove(). Fixes: 4cd15a3e8b36 ("drm/msm/a6xx: Make GPU destroy a bit safer") Cc: sta...@vger.kernel.org # 6.3 Cc: Douglas Anderson Signed-off-by: Johan Hovold --- drivers/gpu/drm/msm/adreno/a6xx_gmu.c | 3 --- drivers/gpu/drm/msm/adreno/a6xx_gpu.c | 9 ++--- 2 files changed, 6 insertions(+), 6 deletions(-) diff --git a/drivers/gpu/drm/msm/adreno/a6xx_gmu.c b/drivers/gpu/drm/msm/adreno/a6xx_gmu.c index e16b4b3f8535..105ccf17041f 100644 --- a/drivers/gpu/drm/msm/adreno/a6xx_gmu.c +++ b/drivers/gpu/drm/msm/adreno/a6xx_gmu.c @@ -1472,9 +1472,6 @@ void a6xx_gmu_remove(struct a6xx_gpu *a6xx_gpu) struct a6xx_gmu *gmu = &a6xx_gpu->gmu; struct platform_device *pdev = to_platform_device(gmu->dev); - if (!gmu->initialized) - return; - pm_runtime_force_suspend(gmu->dev); /* diff --git a/drivers/gpu/drm/msm/adreno/a6xx_gpu.c b/drivers/gpu/drm/msm/adreno/a6xx_gpu.c index 9fb214f150dd..ee47b95a0205 100644 --- a/drivers/gpu/drm/msm/adreno/a6xx_gpu.c +++ b/drivers/gpu/drm/msm/adreno/a6xx_gpu.c @@ -1684,6 +1684,7 @@ static void a6xx_destroy(struct msm_gpu *gpu) { struct adreno_gpu *adreno_gpu = to_adreno_gpu(gpu); struct a6xx_gpu *a6xx_gpu = to_a6xx_gpu(adreno_gpu); + struct a6xx_gmu *gmu = &a6xx_gpu->gmu; if (a6xx_gpu->sqe_bo) { msm_gem_unpin_iova(a6xx_gpu->sqe_bo, gpu->aspace); @@ -1697,9 +1698,11 @@ static void a6xx_destroy(struct msm_gpu *gpu) a6xx_llc_slices_destroy(a6xx_gpu); - mutex_lock(&a6xx_gpu->gmu.lock); - a6xx_gmu_remove(a6xx_gpu); - mutex_unlock(&a6xx_gpu->gmu.lock); + if (gmu->initialized) { + mutex_lock(&gmu->lock); + a6xx_gmu_remove(a6xx_gpu); + mutex_unlock(&gmu->lock); + } adreno_gpu_cleanup(adreno_gpu); -- 2.39.3
Re: [Freedreno] [PATCH v3 0/3] drm/msm/adreno: GPU support on SC8280XP
On Tue, May 30, 2023 at 08:09:42PM -0700, Bjorn Andersson wrote: > This series introduces support for A690 in the DRM/MSM driver and > enables it for the two SC8280XP laptops. > > Bjorn Andersson (3): > drm/msm/adreno: Add Adreno A690 support > arm64: dts: qcom: sc8280xp: Add GPU related nodes > arm64: dts: qcom: sc8280xp: Enable GPU related nodes Seems to work well (after applying the dependency mentioned in the dtsi patch): Tested-by: Johan Hovold Johan
Re: [Freedreno] [PATCH v3 2/3] arm64: dts: qcom: sc8280xp: Add GPU related nodes
On Tue, May 30, 2023 at 08:09:44PM -0700, Bjorn Andersson wrote: > From: Bjorn Andersson > > Add Adreno SMMU, GPU clock controller, GMU and GPU nodes for the > SC8280XP. > > Tested-by: Steev Klimaszewski > Signed-off-by: Bjorn Andersson > Signed-off-by: Bjorn Andersson > --- > > Changes since v2: > - Added missing opp level (both gpu and gmu) > - Corrected opp-level for highest gpu opp > - Added dma-coherent to gpu smmu > > Note that in order for the GPU driver to probe, the last change > requires: > https://lore.kernel.org/linux-arm-msm/20230410185226.3240336-1-dmitry.barysh...@linaro.org/ That's a pretty well-hidden notice about a critical dependency. I just spent the morning debugging why this series broke the probe of the GPU and only saw this when I was going to report my findings. Please consider putting information like this in the cover letter in the future. > Changes since v1: > - Dropped gmu_pdc_seq region from &gmu, as it shouldn't have been used. > - Added missing compatible to &adreno_smmu. > - Dropped aoss_qmp clock in &gmu and &adreno_smmu. Changelogs are also preferably placed in the cover letter so that you don't have to read through N patches to determine what changed from one revision of a series to the next. Johan
Re: [Freedreno] [PATCH] drm/msm/dp: add module parameter for PSR
On Wed, May 24, 2023 at 10:13:33AM -0700, Doug Anderson wrote: > On Wed, May 24, 2023 at 1:06 AM Dmitry Baryshkov > wrote: > > Originally this issue was reported by Doug, and at [1] he reported that > > an issue is fixed for him. So, for me it looks like we have hardware > > where VT works and hardware where it doesn't. > > As I understand it, the problem was originally reported by Bjorn over > IRC. When he reported the problem on VT2, Stephen Boyd confirmed that > he could see the same problem on our hardware when using VT2. I > confirmed I could reproduce and also tested the fix. I don't remember > if Bjorn ever tested the fix. I think many of us assumed that it was > the same issue so a fix for one person would also fix the other. Yeah, that sounds reasonable even if there apparently are some differences between mainline and what you run in ChromeOS here. > > Doug, can you please confirm whether you can reproduce the PSR+VT issue > > on 6.4-rc (without extra patches) or if the issue is fixed for you? > > > > [1] > > https://lore.kernel.org/dri-devel/CAD=FV=vshmqptsqfwjviezeerms-veotmfozejasuc9zsmj...@mail.gmail.com/ > > Ugh. Unfortunately, it's not easy for me to test this particular > feature directly on upstream Linux. :( I typically run with a ChromeOS > root filesystem, which works pretty well with mainline. One place > where it doesn't work with mainline is VT2. The VT2 feature for > Chromebooks is implemented with "frecon" and takes advantage of a > downstream patch that we've carried in the ChromeOS kernel trees for > years allowing handoff of who the DRM "master" is. > > For testing the fix previously, I confirmed that I could reproduce the > problem on our downstream kernel (which had the PSR patches picked) > and that the fixes worked for me in that context. > > Ah, but it shouldn't be too hard to pick that one patch. Let's see if > that works. I'll pick ("CHROMIUM: drm: Add drm_master_relax debugfs > file (non-root set/drop master ioctls)"). Indeed, it works! > > OK, so with the top of Linus's tree (v6.4-rc3-17-g9d646009f65d) plus > the CHROMIUM patch to enable my VT2, I can confirm that I don't see > the issue. I can switch to VT2 and typing works fine. I will say that > on herobrine the backlight is all messed up, but I assume that's an > unrelated issue. Interesting, as VTs are still broken in with rc4 on the X13s (and sc8280xp CRD). Does anyone have any ideas of why things break on mainline with these machines? Any patches or tests I can try? Dmitry, do you have an X13s now that you can reproduce this on? Johan
Re: [Freedreno] [PATCH] drm/msm/dp: add module parameter for PSR
On Wed, May 24, 2023 at 11:06:03AM +0300, Dmitry Baryshkov wrote: > On 24/05/2023 09:59, Johan Hovold wrote: > > Regressions happen and sometimes there are corner cases that are harder > > to find, but this is a breakage of a fundamental feature that was > > reported before the code was even merged into mainline. > > > >> We should have ideally gone with the modparam with the feature patches > >> itself knowing that it gets enabled for all sinks if PSR is supported. > > > > Modparams are things of the past should not be used to enable broken > > features so that some vendor can tick of their internal lists of > > features that have been "mainlined". > > We have had a history of using modparam with i915 and IIRC amdgpu / > radeon drivers to allow users to easily check whether new feature works > for their hardware. My current understanding is that PSR+VT works for on > some laptops and doesn't on some other laptops, which makes it a valid case. But here it does not seem to be the hardware that's the issue, but rather that the implementation is incorrect or incomplete. Johan
Re: [Freedreno] [PATCH] drm/msm/dp: add module parameter for PSR
On Tue, May 23, 2023 at 12:23:04PM -0700, Abhinav Kumar wrote: > On 5/23/2023 8:24 AM, Johan Hovold wrote: > > On Fri, May 12, 2023 at 09:13:04PM +0300, Dmitry Baryshkov wrote: > >> On 28/04/2023 02:28, Abhinav Kumar wrote: > >>> On sc7280 where eDP is the primary display, PSR is causing > >>> IGT breakage even for basic test cases like kms_atomic and > >>> kms_atomic_transition. Most often the issue starts with below > >>> stack so providing that as reference > >>> > >>> Call trace: > >>> ---[ end trace ]--- > >>> [drm-dp] dp_ctrl_push_idle: PUSH_IDLE pattern timedout > >>> > >>> Other basic use-cases still seem to work fine hence add a > >>> a module parameter to allow toggling psr enable/disable till > >>> PSR related issues are hashed out with IGT. > >> > >> For the reference: Bjorn reported that he has issues with VT on a > >> PSR-enabled laptops. This patch fixes the issue for him > > > > Module parameters are almost never warranted, and it is definitely not > > the right way to handle a broken implementation. > > > > I've just sent a revert that unconditionally disables PSR support until > > the implementation has been fixed: > > > > > > https://lore.kernel.org/lkml/20230523151646.28366-1-johan+lin...@kernel.org/ > > I dont completely agree with this. Even the virtual terminal case was > reported to be fixed by one user but not the other. So it was probably > something missed out either in validation or reproduction steps of the > user who reported it to be fixed OR the user who reported it not fixed. > That needs to be investigated now. Yes, there may still be some time left to fix it, but it's pretty damn annoying to find that an issue reported two months ago still is not fixed at 6.4-rc3. (I even waited to make the switch to 6.4 so that I would not have to spend time on this.) I didn't see any mail from Bjorn saying that the series that claimed to fix the VT issue actually did fix the VT issue. There's only the comment above from Dmitry suggesting that disabling this feature is the only way to get a working terminal back. Regressions happen and sometimes there are corner cases that are harder to find, but this is a breakage of a fundamental feature that was reported before the code was even merged into mainline. > We should have ideally gone with the modparam with the feature patches > itself knowing that it gets enabled for all sinks if PSR is supported. Modparams are things of the past should not be used to enable broken features so that some vendor can tick of their internal lists of features that have been "mainlined". You can carry that single patch out-of-tree to enable this if you need it for some particular use case where you don't care about VTs. But hopefully you can just get this sorted quickly. If not, the revert I posted is the way to go rather than adding random module parameters. Johan
Re: [Freedreno] [PATCH] drm/msm/dp: add module parameter for PSR
On Fri, May 12, 2023 at 09:13:04PM +0300, Dmitry Baryshkov wrote: > On 28/04/2023 02:28, Abhinav Kumar wrote: > > On sc7280 where eDP is the primary display, PSR is causing > > IGT breakage even for basic test cases like kms_atomic and > > kms_atomic_transition. Most often the issue starts with below > > stack so providing that as reference > > > > Call trace: > > dpu_encoder_assign_crtc+0x64/0x6c > > dpu_crtc_enable+0x188/0x204 > > drm_atomic_helper_commit_modeset_enables+0xc0/0x274 > > msm_atomic_commit_tail+0x1a8/0x68c > > commit_tail+0xb0/0x160 > > drm_atomic_helper_commit+0x11c/0x124 > > drm_atomic_commit+0xb0/0xdc > > drm_atomic_connector_commit_dpms+0xf4/0x110 > > drm_mode_obj_set_property_ioctl+0x16c/0x3b0 > > drm_connector_property_set_ioctl+0x4c/0x74 > > drm_ioctl_kernel+0xec/0x15c > > drm_ioctl+0x264/0x408 > > __arm64_sys_ioctl+0x9c/0xd4 > > invoke_syscall+0x4c/0x110 > > el0_svc_common+0x94/0xfc > > do_el0_svc+0x3c/0xb0 > > el0_svc+0x2c/0x7c > > el0t_64_sync_handler+0x48/0x114 > > el0t_64_sync+0x190/0x194 > > ---[ end trace ]--- > > [drm-dp] dp_ctrl_push_idle: PUSH_IDLE pattern timedout > > > > Other basic use-cases still seem to work fine hence add a > > a module parameter to allow toggling psr enable/disable till > > PSR related issues are hashed out with IGT. > > For the reference: Bjorn reported that he has issues with VT on a > PSR-enabled laptops. This patch fixes the issue for him Module parameters are almost never warranted, and it is definitely not the right way to handle a broken implementation. I've just sent a revert that unconditionally disables PSR support until the implementation has been fixed: https://lore.kernel.org/lkml/20230523151646.28366-1-johan+lin...@kernel.org/ Johan
[Freedreno] [PATCH] Revert "drm/msm/dp: set self refresh aware based on PSR support"
This reverts commit 1844e680d56bb0c4e0489138f2b7ba2dc1c988e3. PSR support clearly is not ready for mainline and specifically breaks virtual terminals which are no longer updated when PSR is enabled (e.g. no keyboard input is echoed, no cursor blink). Disable PSR support for now by reverting commit 1844e680d56b ("drm/msm/dp: set self refresh aware based on PSR support"). Cc: Vinod Polimera Cc: Dmitry Baryshkov Signed-off-by: Johan Hovold --- Bjorn reported that PSR support broke virtual terminals two months ago, but this is still broken in 6.4-rc3: https://lore.kernel.org/lkml/20230326162723.3lo6pnsfdwzsvbhj@ripper/ despite the following series that claimed to address this: https://lore.kernel.org/lkml/1680271114-1534-1-git-send-email-quic_vpoli...@quicinc.com Let's revert until this has been fixed properly. Johan drivers/gpu/drm/msm/dp/dp_drm.c | 2 -- 1 file changed, 2 deletions(-) diff --git a/drivers/gpu/drm/msm/dp/dp_drm.c b/drivers/gpu/drm/msm/dp/dp_drm.c index 785d76639497..029e08c5bb06 100644 --- a/drivers/gpu/drm/msm/dp/dp_drm.c +++ b/drivers/gpu/drm/msm/dp/dp_drm.c @@ -117,8 +117,6 @@ static int edp_bridge_atomic_check(struct drm_bridge *drm_bridge, if (WARN_ON(!conn_state)) return -ENODEV; - conn_state->self_refresh_aware = dp->psr_supported; - if (!conn_state->crtc || !crtc_state) return 0; -- 2.39.3
Re: [Freedreno] [PATCH v2] drm/msm/adreno: adreno_gpu: Use suspend() instead of idle() on load error
On Fri, Mar 31, 2023 at 01:15:16AM +0200, Konrad Dybcio wrote: > The adreno_load_gpu() path is guarded by an error check on > adreno_load_fw(). This function is responsible for loading > Qualcomm-only-signed binaries (e.g. SQE and GMU FW for A6XX), but it > does not take the vendor-signed ZAP blob into account. > > By embedding the SQE (and GMU, if necessary) firmware into the > initrd/kernel, we can trigger and unfortunate path that would not bail > out early and proceed with gpu->hw_init(). That will fail, as the ZAP > loader path will not find the firmware and return back to > adreno_load_gpu(). > > This error path involves pm_runtime_put_sync() which then calls idle() > instead of suspend(). This is suboptimal, as it means that we're not > going through the clean shutdown sequence. With at least A619_holi, this > makes the GPU not wake up until it goes through at least one more > start-fail-stop cycle. The pm_runtime_put_sync that appears in the error > path actually does not guarantee that because of the earlier enabling of > runtime autosuspend. > > Fix that by using pm_runtime_put_sync_suspend to force a clean shutdown. > > Test cases: > 1. All firmware baked into kernel > 2. error loading ZAP fw in initrd -> load from rootfs at DE start > > Both succeed on A619_holi (SM6375) and A630 (SDM845). > > Fixes: 0d997f95b70f ("drm/msm/adreno: fix runtime PM imbalance at gpu load") As this one is marked for stable, you also need: Cc: sta...@vger.kernel.org # 6.0 > Signed-off-by: Konrad Dybcio Reviewed-by: Johan Hovold > --- > v1 -> v2: > - Improve the commit message and the reasoning within > > drivers/gpu/drm/msm/adreno/adreno_device.c | 2 +- > 1 file changed, 1 insertion(+), 1 deletion(-) > > diff --git a/drivers/gpu/drm/msm/adreno/adreno_device.c > b/drivers/gpu/drm/msm/adreno/adreno_device.c > index f61896629be6..59f3302e8167 100644 > --- a/drivers/gpu/drm/msm/adreno/adreno_device.c > +++ b/drivers/gpu/drm/msm/adreno/adreno_device.c > @@ -477,7 +477,7 @@ struct msm_gpu *adreno_load_gpu(struct drm_device *dev) > return gpu; > > err_put_rpm: > - pm_runtime_put_sync(&pdev->dev); > + pm_runtime_put_sync_suspend(&pdev->dev); > err_disable_rpm: > pm_runtime_disable(&pdev->dev); Johan
Re: [Freedreno] [PATCH] drm/msm/adreno: adreno_gpu: Use suspend() instead of idle() on load error
On Wed, Mar 29, 2023 at 10:45:52PM +0300, Dmitry Baryshkov wrote: > On Wed, 29 Mar 2023 at 18:48, Konrad Dybcio wrote: > > On 29.03.2023 16:37, Johan Hovold wrote: > > > On Wed, Mar 29, 2023 at 04:04:44PM +0200, Konrad Dybcio wrote: > > >> If we fail to initialize the GPU for whatever reason (say we don't > > >> embed the GPU firmware files in the initrd), the error path involves > > >> pm_runtime_put_sync() which then calls idle() instead of suspend(). > > >> > > >> This is suboptimal, as it means that we're not going through the > > >> clean shutdown sequence. With at least A619_holi, this makes the GPU > > >> not wake up until it goes through at least one more start-fail-stop > > >> cycle. Fix that by using pm_runtime_put_sync_suspend to force a clean > > >> shutdown. > > > > > > This does not sound right. If pm_runtime_put_sync() fails to suspend the > > > device when the usage count drops to zero, then you have a bug somewhere > > > else. > > I was surprised to see that it was not called as well, but I wasn't able > > to track it down before.. > > Could you please check that it's autosuspend who kicks in? In other > words, if we disable autosuspend, the pm_runtime_put_sync is enough()? Yes, that's it. The runtime PM implementation changed at one point and since you need to disable autosuspend first to actually get synchronous behaviour. My bad. > That would probably mean that we lack some kind of reset in the hw_init path. > > On the other hand, I do not know how the device will react to the > error-in-the-middle state. Modems for example, can enter the state > where you can not properly turn it off once it starts the boot > process. > > And if we remember the efforts that Akhil has put into making sure > that the GPU is properly reset in case of an _error_, it might be > nearly impossible to shut it down in a proper way. > > Thus said, I think that unless there is an obvious way to restart the > init process, Korad's pm_runtime_put_sync_suspend() looks like a > correct fix to me. I'd prefer to fix this by disabling autosuspend, but as that would involve also moving the call to enable autosuspend to this function (and add the missing disable on unbind), Konrad's patch using pm_runtime_put_sync_suspend() is probably the best option for now. I can send a patch to move the autosuspend handling on top. Perhaps you can just amend the commit message to clarify that not all fw is apparently preloaded and also mention that you need to use pm_runtime_put_sync_suspend() due to autosuspend being enabled. Johan
Re: [Freedreno] [PATCH] drm/msm/adreno: adreno_gpu: Use suspend() instead of idle() on load error
On Wed, Mar 29, 2023 at 04:04:44PM +0200, Konrad Dybcio wrote: > If we fail to initialize the GPU for whatever reason (say we don't > embed the GPU firmware files in the initrd), the error path involves > pm_runtime_put_sync() which then calls idle() instead of suspend(). > > This is suboptimal, as it means that we're not going through the > clean shutdown sequence. With at least A619_holi, this makes the GPU > not wake up until it goes through at least one more start-fail-stop > cycle. Fix that by using pm_runtime_put_sync_suspend to force a clean > shutdown. This does not sound right. If pm_runtime_put_sync() fails to suspend the device when the usage count drops to zero, then you have a bug somewhere else. Also since commit 2c087a336676 ("drm/msm/adreno: Load the firmware before bringing up the hardware") the firmware is loaded before even hitting these paths so the above description does not sound right in that respect either (or is missing some details). > Test cases: > 1. firmware baked into kernel > 2. error loading fw in initrd -> load from rootfs at DE start > > Both succeed on A619_holi (SM6375) and A630 (SDM845). > > Fixes: 0d997f95b70f ("drm/msm/adreno: fix runtime PM imbalance at gpu load") > Signed-off-by: Konrad Dybcio > --- > drivers/gpu/drm/msm/adreno/adreno_device.c | 2 +- > 1 file changed, 1 insertion(+), 1 deletion(-) > > diff --git a/drivers/gpu/drm/msm/adreno/adreno_device.c > b/drivers/gpu/drm/msm/adreno/adreno_device.c > index f61896629be6..59f3302e8167 100644 > --- a/drivers/gpu/drm/msm/adreno/adreno_device.c > +++ b/drivers/gpu/drm/msm/adreno/adreno_device.c > @@ -477,7 +477,7 @@ struct msm_gpu *adreno_load_gpu(struct drm_device *dev) > return gpu; > > err_put_rpm: > - pm_runtime_put_sync(&pdev->dev); > + pm_runtime_put_sync_suspend(&pdev->dev); > err_disable_rpm: > pm_runtime_disable(&pdev->dev); Johan
Re: [Freedreno] [PATCH 00/10] drm/msm: fix bind error handling
On Tue, Mar 21, 2023 at 05:21:56PM +0200, Dmitry Baryshkov wrote: > On 21/03/2023 15:02, Johan Hovold wrote: > > On Mon, Mar 06, 2023 at 11:07:12AM +0100, Johan Hovold wrote: > >> I had reasons to look closer at the MSM DRM driver error handling and > >> realised that it had suffered from a fair amount of bit rot over the > >> years. > >> > >> Unfortunately, I started fixing this in my 6.2 branch and failed to > >> notice two partial and, as it turned out, broken attempts to address > >> this that are now in 6.3-rc1. > >> > >> Instead of trying to salvage this incrementally, I'm reverting the two > >> broken commits so that clean and backportable fixes can be added in > >> their place. > >> > >> Included are also two related cleanups. > > > > Any further comments to these patches (except for 9/10, which should be > > dropped)? > > > > As the patches being reverted here were first added in 6.3-rc1 there is > > still time to get this into 6.3-rc (e.g. before AUTOSEL starts trying to > > backport them). > > I will take a look at the patches. Additional question, as you have been > looking into this area. We have plenty of code which is only called > under the `if (kms)` condition. Could you hopefully move these parts to > separate functions, so that the error handling is also simpler? If not, > I'll put this to my todo list, but it might take some time before I have > time for that. There's definitely room for cleaning up the bind/unbind paths further, but for this series I focus on correctness while maintaining symmetry (e.g. if an allocation was done under if (kms), then the release should be done under the same). I don't think I will have time to look at this further for a few weeks either, but I'll add it to my list of future work as well and I'll check in with you before actually working on it. Johan
Re: [Freedreno] [PATCH 05/10] drm/msm: fix drm device leak on bind errors
On Tue, Mar 21, 2023 at 04:54:51PM +0200, Dmitry Baryshkov wrote: > On 06/03/2023 12:07, Johan Hovold wrote: > > Make sure to free the DRM device also in case of early errors during > > bind(). > > > > Fixes: 2027e5b3413d ("drm/msm: Initialize MDSS irq domain at probe time") > > Cc: sta...@vger.kernel.org # 5.17 > > Cc: Dmitry Baryshkov > > Signed-off-by: Johan Hovold > > Can we migrate to devm_drm_dev_alloc instead() ? Will it make code > simpler and/or easier to handle? I'm just fixing the bugs here. Cleanups/rework like that can be done on top but should not be backported as it risks introducing new issues. Johan
Re: [Freedreno] [PATCH 00/10] drm/msm: fix bind error handling
On Mon, Mar 06, 2023 at 11:07:12AM +0100, Johan Hovold wrote: > I had reasons to look closer at the MSM DRM driver error handling and > realised that it had suffered from a fair amount of bit rot over the > years. > > Unfortunately, I started fixing this in my 6.2 branch and failed to > notice two partial and, as it turned out, broken attempts to address > this that are now in 6.3-rc1. > > Instead of trying to salvage this incrementally, I'm reverting the two > broken commits so that clean and backportable fixes can be added in > their place. > > Included are also two related cleanups. Any further comments to these patches (except for 9/10, which should be dropped)? As the patches being reverted here were first added in 6.3-rc1 there is still time to get this into 6.3-rc (e.g. before AUTOSEL starts trying to backport them). Johan > Johan Hovold (10): > Revert "drm/msm: Add missing check and destroy for > alloc_ordered_workqueue" > Revert "drm/msm: Fix failure paths in msm_drm_init()" > drm/msm: fix NULL-deref on snapshot tear down > drm/msm: fix NULL-deref on irq uninstall > drm/msm: fix drm device leak on bind errors > drm/msm: fix vram leak on bind errors > drm/msm: fix missing wq allocation error handling > drm/msm: fix workqueue leak on bind errors > drm/msm: use drmm_mode_config_init() > drm/msm: move include directive > > drivers/gpu/drm/msm/disp/msm_disp_snapshot.c | 3 - > drivers/gpu/drm/msm/msm_drv.c| 67 +--- > 2 files changed, 44 insertions(+), 26 deletions(-)
Re: [Freedreno] [PATCH v2 0/4] drm/msm/adreno: fix runtime PM imbalance at unbind
On Fri, Mar 03, 2023 at 05:48:03PM +0100, Johan Hovold wrote: > As reported by Bjorn, we can end up with an unbalanced runtime PM > disable count if unbind() is called before the DRM device is opened > (e.g. if component bind fails due to the panel driver not having been > loaded yet). > > As runtime PM must currently stay disabled until the firmware has been > loaded, fix this by making the runtime PM disable call at unbind() > conditional. > > The rest of the series fixes further imbalances in the load_gpu() error > paths and removes a bogus pm_runtime_set_active() call. Included is also > a related indentation cleanup. I noticed that Rob picked up the first patch below from v1 of this series. Any comments to the remaining three? Johan > Changes in v2 > - fix the runtime PM imbalance in the gpu load error paths (new) > > - drop the patch removing the pm_runtime_disable() from >adreno_gpu_cleanup() as this function can currently still be called >with runtime PM enabled if suspending the scheduler in > adreno_system_suspend() at unbind fails > > > Johan Hovold (4): > drm/msm/adreno: fix runtime PM imbalance at unbind > drm/msm/adreno: fix runtime PM imbalance at gpu load > drm/msm/adreno: drop bogus pm_runtime_set_active() > drm/msm/adreno: clean up component ops indentation > > drivers/gpu/drm/msm/adreno/adreno_device.c | 26 +- > 1 file changed, 16 insertions(+), 10 deletions(-)
[Freedreno] Adreno devfreq lockdep splat with 6.3-rc2
Hi Rob, Since 6.3-rc2 (or possibly -rc1), I'm now seeing the below devfreq-related lockdep splat. I noticed that you posted a fix for something similar here: https://lore.kernel.org/r/20230312204150.1353517-9-robdcl...@gmail.com but that particular patch makes no difference. >From skimming the calltraces below and qos/devfreq related changes in 6.3-rc1 it seems like this could be related to: fadcc3ab1302 ("drm/msm/gpu: Bypass PM QoS constraint for idle clamp") Johan [ 35.389822] == [ 35.389824] WARNING: possible circular locking dependency detected [ 35.389826] 6.3.0-rc2 #208 Not tainted [ 35.389828] -- [ 35.389829] ring0/348 is trying to acquire lock: [ 35.389830] 43424cfa2078 (&devfreq->lock){+.+.}-{3:3}, at: qos_min_notifier_call+0x28/0x88 [ 35.389845] but task is already holding lock: [ 35.389846] 4342432b78e8 (&(c->notifiers)->rwsem){}-{3:3}, at: blocking_notifier_call_chain+0x34/0xa0 [ 35.389855] which lock already depends on the new lock. [ 35.389856] the existing dependency chain (in reverse order) is: [ 35.389857] -> #4 (&(c->notifiers)->rwsem){}-{3:3}: [ 35.389861]lock_acquire+0x68/0x84 [ 35.389865]down_write+0x58/0xfc [ 35.389869]blocking_notifier_chain_register+0x30/0x8c [ 35.389872]freq_qos_add_notifier+0x68/0x7c [ 35.389876]dev_pm_qos_add_notifier+0xe8/0x114 [ 35.389881]devfreq_add_device.part.0+0x360/0x5a4 [ 35.389884]devm_devfreq_add_device+0x74/0xe0 [ 35.389886]msm_devfreq_init+0xa0/0x154 [msm] [ 35.389915]msm_gpu_init+0x320/0x5b0 [msm] [ 35.389933]adreno_gpu_init+0x164/0x2d8 [msm] [ 35.389951]a6xx_gpu_init+0x270/0x608 [msm] [ 35.389968]adreno_bind+0x184/0x284 [msm] [ 35.389983]component_bind_all+0x124/0x288 [ 35.389989]msm_drm_bind+0x1d8/0x6a8 [msm] [ 35.390004]try_to_bring_up_aggregate_device+0x1ec/0x2f4 [ 35.390007]__component_add+0xa8/0x194 [ 35.390010]component_add+0x14/0x20 [ 35.390012]dp_display_probe+0x2b4/0x474 [msm] [ 35.390029]platform_probe+0x68/0xd8 [ 35.390031]really_probe+0x184/0x3c8 [ 35.390034]__driver_probe_device+0x7c/0x188 [ 35.390036]driver_probe_device+0x3c/0x110 [ 35.390039]__device_attach_driver+0xbc/0x158 [ 35.390041]bus_for_each_drv+0x84/0xe0 [ 35.390044]__device_attach+0xa8/0x1d4 [ 35.390046]device_initial_probe+0x14/0x20 [ 35.390049]bus_probe_device+0xac/0xb0 [ 35.390051]deferred_probe_work_func+0xa0/0xf4 [ 35.390053]process_one_work+0x288/0x6c4 [ 35.390056]worker_thread+0x74/0x450 [ 35.390058]kthread+0x118/0x11c [ 35.390060]ret_from_fork+0x10/0x20 [ 35.390063] -> #3 (dev_pm_qos_mtx){+.+.}-{3:3}: [ 35.390066]lock_acquire+0x68/0x84 [ 35.390068]__mutex_lock+0x98/0x428 [ 35.390072]mutex_lock_nested+0x2c/0x38 [ 35.390074]dev_pm_qos_remove_notifier+0x34/0x140 [ 35.390077]genpd_remove_device+0x3c/0x174 [ 35.390081]genpd_dev_pm_detach+0x78/0x1b4 [ 35.390083]dev_pm_domain_detach+0x24/0x34 [ 35.390085]a6xx_gmu_remove+0x64/0xd0 [msm] [ 35.390101]a6xx_destroy+0xa8/0x138 [msm] [ 35.390116]adreno_unbind+0x40/0x64 [msm] [ 35.390131]component_unbind+0x38/0x6c [ 35.390134]component_unbind_all+0xc8/0xd4 [ 35.390136]msm_drm_uninit.isra.0+0x168/0x1dc [msm] [ 35.390152]msm_drm_bind+0x2f4/0x6a8 [msm] [ 35.390167]try_to_bring_up_aggregate_device+0x1ec/0x2f4 [ 35.390170]__component_add+0xa8/0x194 [ 35.390172]component_add+0x14/0x20 [ 35.390175]dp_display_probe+0x2b4/0x474 [msm] [ 35.390190]platform_probe+0x68/0xd8 [ 35.390192]really_probe+0x184/0x3c8 [ 35.390194]__driver_probe_device+0x7c/0x188 [ 35.390197]driver_probe_device+0x3c/0x110 [ 35.390199]__device_attach_driver+0xbc/0x158 [ 35.390201]bus_for_each_drv+0x84/0xe0 [ 35.390203]__device_attach+0xa8/0x1d4 [ 35.390206]device_initial_probe+0x14/0x20 [ 35.390208]bus_probe_device+0xac/0xb0 [ 35.390210]deferred_probe_work_func+0xa0/0xf4 [ 35.390212]process_one_work+0x288/0x6c4 [ 35.390214]worker_thread+0x74/0x450 [ 35.390216]kthread+0x118/0x11c [ 35.390217]ret_from_fork+0x10/0x20 [ 35.390219] -> #2 (&gmu->lock){+.+.}-{3:3}: [ 35.390222]lock_acquire+0x68/0x84 [ 35.390224]__mutex_lock+0x98/0x428 [ 35.390226]mutex_lock_nested+0x2c/0x38 [ 35.390229]a6xx_gpu_set_freq+0x30/0x5c [msm] [ 35.390245]msm_d
Re: [Freedreno] [PATCH v3 5/7] drm/msm/hdmi: stop using drm_bridge_connector_en/disable_hpd()
On Wed, Nov 02, 2022 at 09:07:03PM +0300, Dmitry Baryshkov wrote: > The functionality of drm_bridge_connector_enable_hpd() and > drm_bridge_connector_disable_hpd() is provided automatically by the > drm_kms_poll helpers. Stop calling these functions manually. I stumbled over this one when investigating a hotplug-related crash in the MSM DRM driver which this series prevents by moving hotplug notification enable to drm_kms_helper_poll_enable(). > Signed-off-by: Dmitry Baryshkov > --- > drivers/gpu/drm/msm/hdmi/hdmi.c | 2 -- > 1 file changed, 2 deletions(-) > > diff --git a/drivers/gpu/drm/msm/hdmi/hdmi.c b/drivers/gpu/drm/msm/hdmi/hdmi.c > index 93fe61b86967..a540c45d4fd3 100644 > --- a/drivers/gpu/drm/msm/hdmi/hdmi.c > +++ b/drivers/gpu/drm/msm/hdmi/hdmi.c > @@ -348,8 +348,6 @@ int msm_hdmi_modeset_init(struct hdmi *hdmi, > goto fail; > } > > - drm_bridge_connector_enable_hpd(hdmi->connector); > - > ret = msm_hdmi_hpd_enable(hdmi->bridge); > if (ret < 0) { > DRM_DEV_ERROR(&hdmi->pdev->dev, "failed to enable HPD: %d\n", > ret); It looks like you are now enabling hotplug events before the DRM driver is ready to receive them (i.e. msm_hdmi_hpd_enable() is called before drm_bridge_connector_enable_hpd()). Could this not lead to missed events or is the state being setup correctly somewhere else? Shouldn't the call to msm_hdmi_hpd_enable() be moved to when HPD is enabled either way (e.g. by being converted to a hpd_enable callback)? Johan
Re: [Freedreno] [PATCH] drm/msm: Initialize mode_config earlier
On Thu, Mar 02, 2023 at 03:17:04PM -0800, Bjorn Andersson wrote: > On Wed, Mar 01, 2023 at 02:58:50PM +0100, Johan Hovold wrote: > > So after debugging this issue a third time, I can conclude that it is > > still very much present in 6.2. > > > > It appears you looked at the linux-next tree when you concluded that > > this patch was not needed. In 6.2 the bridge->hpd_cb callback is set > > before mode_config.funcs is initialised as part of > > kms->funcs->hw_init(kms). > > > > The hpd DRM changes heading into 6.3 do appear to avoid the NULL-pointer > > dereference by moving the bridge->hpd_cb initialisation to > > drm_kms_helper_poll_init() as you mention above. I can confirm that as expected my reproducer no longer triggers with 6.3-rc1. > > The PMIC GLINK altmode driver still happily forwards notifications > > regardless of the DRM driver state though, which can lead to missed > > hotplug events. It seems you need to implement the > > hpd_enable()/disable() callbacks and either cache or not enable events > > in fw until the DRM driver is ready. > > > > It's not clear to me what the expectation from the DRM framework is on > this point. We register a drm_bridge which is only capable of signaling > HPD events (DRM_BRIDGE_OP_HPD), not querying HPD state (DRM_BRIDGE_OP_DETECT). I think the assumption is that any bridge that can generate hotplug events also has a way of detecting whether it is connected (i.e. DRM_BRIDGE_OP_HPD => DRM_BRIDGE_OP_DETECT). The pmic_glink_altmode driver appears to be the only driver that sets DRM_BRIDGE_OP_HPD but not DRM_BRIDGE_OP_DETECT. > Does this imply that any such bridge must ensure that hpd events are > re-delivered once hpd_enable() has been invoked (we can't invoke it from > hpd_enable...)? > > Is it reasonable to do this retriggering in the altmode driver? Or is it > the job of the TCPM (it seems reasonable to not send the PAN_EN message > until we get hpd_enable()...)? Are you sure there is no way to query the firmware about the connected state? Otherwise, enabling the notification messages when hpd_enable() is called looks like it should work as the fw currently appears to always send a disconnected event followed by a connect event if connected. But that's not going to be enough unless you can also disable events in fw on hpd_disable() so that the state can again be updated on the next hpd_enable(). If that's not possible, it seems you need to cache the state in the driver and hope you get a notification after a suspend cycle if the state has changed. But in any case, the DRM documentation is pretty clear on that a bridge driver should not be calling drm_bridge_hpd_notify() until hpd_enable() is called (and also not after hpd_disable()) as the pmic_glink_altmode driver currently do. hpd_enable Enable hot plug detection. From now on the bridge shall call drm_bridge_hpd_notify() each time a change is detected in the output connection status, until hot plug detection gets disabled with hpd_disable. This callback is optional and shall only be implemented by bridges that support hot-plug notification without polling. Bridges that implement it shall also implement the hpd_disable callback and set the DRM_BRIDGE_OP_HPD flag in their drm_bridge->ops. https://dri.freedesktop.org/docs/drm/gpu/drm-kms-helpers.html#c.drm_bridge_funcs Johan
Re: [Freedreno] [PATCH 01/10] Revert "drm/msm: Add missing check and destroy for alloc_ordered_workqueue"
On Wed, Mar 08, 2023 at 10:10:24AM +0800, Jiasheng Jiang wrote: > On Mon, 06 Mar 2023 18:07:13 +0800, Johan Hovold wrote: > > This reverts commit 643b7d0869cc7f1f7a5ac7ca6bd25d88f54e31d0. > > The commit not only adds the allocation sanity check, but also adds the > destroy_workqueue to release the allocated priv->wq. > Therefore, revert the commit will cause memory leak. No, reverting this commit does not cause any memory leaks (look at at diff again). The original patch called msm_drm_uninit() in some early error paths, but that was just completely broken as that function must not be called before the subcomponents have been bound and also triggered a bunch of other NULL-pointer dereferences. That bit was however removed when the change was merged with a second branch that also touched these error paths. In the end, the leaked wq is still there and this commit only added broken error handling for the wq allocation failing (as it does not free the drm device). > > A recent patch that tried to fix up the msm_drm_init() paths with > > respect to the workqueue but only ended up making things worse: > > > > First, the newly added calls to msm_drm_uninit() on early errors would > > trigger NULL-pointer dereferences, for example, as the kms pointer would > > not have been initialised. (Note that these paths were also modified by > > a second broken error handling patch which in effect cancelled out this > > part when merged.) > > There is a check for the kms pointer to avoid NULL-pointer dereference in > the msm_drm_uninit(). No, there were further places in msm_drm_uninit() which did not have any such checks when you submitted your patch. Some of the missing checks were added by a separate patch, but that would not in itself have been sufficient as with your patch you'd still end up trying to unbind the subcomponents before they are bound, which will lead to further crashes. > > Second, the newly added allocation sanity check would still leak the > > previously allocated drm device. > > The ddev is allocated by drm_dev_alloc which support automatic cleanup. We don't have automatic garbage collection in the kernel. You still need to release the reference to the device for it to be freed. Johan
Re: [Freedreno] [PATCH 09/10] drm/msm: use drmm_mode_config_init()
On Mon, Mar 06, 2023 at 02:38:37PM +0200, Dmitry Baryshkov wrote: > On 06/03/2023 12:07, Johan Hovold wrote: > > Switch to using drmm_mode_config_init() so that the mode config is > > released when the last reference to the DRM device is dropped rather > > than unconditionally at unbind() (which may be too soon). > > This also means that drm_bridge_detach() might be called at some point > after unbind(), which might be too late. Indeed. Please disregard this patch. It's not needed to fix the bind error paths anyway. Johan
[Freedreno] [PATCH 07/10] drm/msm: fix missing wq allocation error handling
Add the missing sanity check to handle workqueue allocation failures. Fixes: c8afe684c95c ("drm/msm: basic KMS driver for snapdragon") Cc: sta...@vger.kernel.org # 3.12 Cc: Rob Clark Signed-off-by: Johan Hovold --- drivers/gpu/drm/msm/msm_drv.c | 4 1 file changed, 4 insertions(+) diff --git a/drivers/gpu/drm/msm/msm_drv.c b/drivers/gpu/drm/msm/msm_drv.c index 41cc6cd690cd..ac3b77dbfacc 100644 --- a/drivers/gpu/drm/msm/msm_drv.c +++ b/drivers/gpu/drm/msm/msm_drv.c @@ -432,6 +432,10 @@ static int msm_drm_init(struct device *dev, const struct drm_driver *drv) priv->dev = ddev; priv->wq = alloc_ordered_workqueue("msm", 0); + if (!priv->wq) { + ret = -ENOMEM; + goto err_put_dev; + } INIT_LIST_HEAD(&priv->objects); mutex_init(&priv->obj_lock); -- 2.39.2
[Freedreno] [PATCH 10/10] drm/msm: move include directive
Move the include of of_address.h to the top of the file where it belongs. Signed-off-by: Johan Hovold --- drivers/gpu/drm/msm/msm_drv.c | 3 +-- 1 file changed, 1 insertion(+), 2 deletions(-) diff --git a/drivers/gpu/drm/msm/msm_drv.c b/drivers/gpu/drm/msm/msm_drv.c index ade17947d1e5..42ae7575622b 100644 --- a/drivers/gpu/drm/msm/msm_drv.c +++ b/drivers/gpu/drm/msm/msm_drv.c @@ -8,6 +8,7 @@ #include #include #include +#include #include #include #include @@ -272,8 +273,6 @@ static int msm_drm_uninit(struct device *dev) return 0; } -#include - struct msm_gem_address_space *msm_kms_init_aspace(struct drm_device *dev) { struct msm_gem_address_space *aspace; -- 2.39.2
[Freedreno] [PATCH 01/10] Revert "drm/msm: Add missing check and destroy for alloc_ordered_workqueue"
This reverts commit 643b7d0869cc7f1f7a5ac7ca6bd25d88f54e31d0. A recent patch that tried to fix up the msm_drm_init() paths with respect to the workqueue but only ended up making things worse: First, the newly added calls to msm_drm_uninit() on early errors would trigger NULL-pointer dereferences, for example, as the kms pointer would not have been initialised. (Note that these paths were also modified by a second broken error handling patch which in effect cancelled out this part when merged.) Second, the newly added allocation sanity check would still leak the previously allocated drm device. Instead of trying to salvage what was badly broken (and clearly not tested), let's revert the bad commit so that clean and backportable fixes can be added in its place. Fixes: 643b7d0869cc ("drm/msm: Add missing check and destroy for alloc_ordered_workqueue") Cc: Jiasheng Jiang Signed-off-by: Johan Hovold --- drivers/gpu/drm/msm/msm_drv.c | 2 -- 1 file changed, 2 deletions(-) diff --git a/drivers/gpu/drm/msm/msm_drv.c b/drivers/gpu/drm/msm/msm_drv.c index aca48c868c14..b7f5a78eadd4 100644 --- a/drivers/gpu/drm/msm/msm_drv.c +++ b/drivers/gpu/drm/msm/msm_drv.c @@ -420,8 +420,6 @@ static int msm_drm_init(struct device *dev, const struct drm_driver *drv) priv->dev = ddev; priv->wq = alloc_ordered_workqueue("msm", 0); - if (!priv->wq) - return -ENOMEM; INIT_LIST_HEAD(&priv->objects); mutex_init(&priv->obj_lock); -- 2.39.2
[Freedreno] [PATCH 04/10] drm/msm: fix NULL-deref on irq uninstall
In case of early initialisation errors and on platforms that do not use the DPU controller, the deinitilisation code can be called with the kms pointer set to NULL. Fixes: f026e431cf86 ("drm/msm: Convert to Linux IRQ interfaces") Cc: sta...@vger.kernel.org # 5.14 Cc: Thomas Zimmermann Signed-off-by: Johan Hovold --- drivers/gpu/drm/msm/msm_drv.c | 8 +--- 1 file changed, 5 insertions(+), 3 deletions(-) diff --git a/drivers/gpu/drm/msm/msm_drv.c b/drivers/gpu/drm/msm/msm_drv.c index 17a59d73fe01..2f2bcdb671d2 100644 --- a/drivers/gpu/drm/msm/msm_drv.c +++ b/drivers/gpu/drm/msm/msm_drv.c @@ -251,9 +251,11 @@ static int msm_drm_uninit(struct device *dev) drm_bridge_remove(priv->bridges[i]); priv->num_bridges = 0; - pm_runtime_get_sync(dev); - msm_irq_uninstall(ddev); - pm_runtime_put_sync(dev); + if (kms) { + pm_runtime_get_sync(dev); + msm_irq_uninstall(ddev); + pm_runtime_put_sync(dev); + } if (kms && kms->funcs) kms->funcs->destroy(kms); -- 2.39.2
[Freedreno] [PATCH 00/10] drm/msm: fix bind error handling
I had reasons to look closer at the MSM DRM driver error handling and realised that it had suffered from a fair amount of bit rot over the years. Unfortunately, I started fixing this in my 6.2 branch and failed to notice two partial and, as it turned out, broken attempts to address this that are now in 6.3-rc1. Instead of trying to salvage this incrementally, I'm reverting the two broken commits so that clean and backportable fixes can be added in their place. Included are also two related cleanups. Johan Johan Hovold (10): Revert "drm/msm: Add missing check and destroy for alloc_ordered_workqueue" Revert "drm/msm: Fix failure paths in msm_drm_init()" drm/msm: fix NULL-deref on snapshot tear down drm/msm: fix NULL-deref on irq uninstall drm/msm: fix drm device leak on bind errors drm/msm: fix vram leak on bind errors drm/msm: fix missing wq allocation error handling drm/msm: fix workqueue leak on bind errors drm/msm: use drmm_mode_config_init() drm/msm: move include directive drivers/gpu/drm/msm/disp/msm_disp_snapshot.c | 3 - drivers/gpu/drm/msm/msm_drv.c| 67 +--- 2 files changed, 44 insertions(+), 26 deletions(-) -- 2.39.2
[Freedreno] [PATCH 03/10] drm/msm: fix NULL-deref on snapshot tear down
In case of early initialisation errors and on platforms that do not use the DPU controller, the deinitilisation code can be called with the kms pointer set to NULL. Fixes: 98659487b845 ("drm/msm: add support to take dpu snapshot") Cc: sta...@vger.kernel.org # 5.14 Cc: Abhinav Kumar Signed-off-by: Johan Hovold --- drivers/gpu/drm/msm/msm_drv.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/drivers/gpu/drm/msm/msm_drv.c b/drivers/gpu/drm/msm/msm_drv.c index 9ded384acba4..17a59d73fe01 100644 --- a/drivers/gpu/drm/msm/msm_drv.c +++ b/drivers/gpu/drm/msm/msm_drv.c @@ -242,7 +242,8 @@ static int msm_drm_uninit(struct device *dev) msm_fbdev_free(ddev); #endif - msm_disp_snapshot_destroy(ddev); + if (kms) + msm_disp_snapshot_destroy(ddev); drm_mode_config_cleanup(ddev); -- 2.39.2
[Freedreno] [PATCH 05/10] drm/msm: fix drm device leak on bind errors
Make sure to free the DRM device also in case of early errors during bind(). Fixes: 2027e5b3413d ("drm/msm: Initialize MDSS irq domain at probe time") Cc: sta...@vger.kernel.org # 5.17 Cc: Dmitry Baryshkov Signed-off-by: Johan Hovold --- drivers/gpu/drm/msm/msm_drv.c | 10 -- 1 file changed, 8 insertions(+), 2 deletions(-) diff --git a/drivers/gpu/drm/msm/msm_drv.c b/drivers/gpu/drm/msm/msm_drv.c index 2f2bcdb671d2..89634159ad75 100644 --- a/drivers/gpu/drm/msm/msm_drv.c +++ b/drivers/gpu/drm/msm/msm_drv.c @@ -444,12 +444,12 @@ static int msm_drm_init(struct device *dev, const struct drm_driver *drv) ret = msm_init_vram(ddev); if (ret) - return ret; + goto err_put_dev; /* Bind all our sub-components: */ ret = component_bind_all(dev, ddev); if (ret) - return ret; + goto err_put_dev; dma_set_max_seg_size(dev, UINT_MAX); @@ -544,6 +544,12 @@ static int msm_drm_init(struct device *dev, const struct drm_driver *drv) err_msm_uninit: msm_drm_uninit(dev); + + return ret; + +err_put_dev: + drm_dev_put(ddev); + return ret; } -- 2.39.2
[Freedreno] [PATCH 09/10] drm/msm: use drmm_mode_config_init()
Switch to using drmm_mode_config_init() so that the mode config is released when the last reference to the DRM device is dropped rather than unconditionally at unbind() (which may be too soon). Signed-off-by: Johan Hovold --- drivers/gpu/drm/msm/msm_drv.c | 11 +-- 1 file changed, 5 insertions(+), 6 deletions(-) diff --git a/drivers/gpu/drm/msm/msm_drv.c b/drivers/gpu/drm/msm/msm_drv.c index 73c597565f99..ade17947d1e5 100644 --- a/drivers/gpu/drm/msm/msm_drv.c +++ b/drivers/gpu/drm/msm/msm_drv.c @@ -247,8 +247,6 @@ static int msm_drm_uninit(struct device *dev) if (kms) msm_disp_snapshot_destroy(ddev); - drm_mode_config_cleanup(ddev); - for (i = 0; i < priv->num_bridges; i++) drm_bridge_remove(priv->bridges[i]); priv->num_bridges = 0; @@ -454,11 +452,13 @@ static int msm_drm_init(struct device *dev, const struct drm_driver *drv) might_lock(&priv->lru.lock); fs_reclaim_release(GFP_KERNEL); - drm_mode_config_init(ddev); + ret = drmm_mode_config_init(ddev); + if (ret) + goto err_destroy_wq; ret = msm_init_vram(ddev); if (ret) - goto err_cleanup_mode_config; + goto err_destroy_wq; /* Bind all our sub-components: */ ret = component_bind_all(dev, ddev); @@ -563,8 +563,7 @@ static int msm_drm_init(struct device *dev, const struct drm_driver *drv) err_deinit_vram: msm_deinit_vram(ddev); -err_cleanup_mode_config: - drm_mode_config_cleanup(ddev); +err_destroy_wq: destroy_workqueue(priv->wq); err_put_dev: drm_dev_put(ddev); -- 2.39.2
[Freedreno] [PATCH 06/10] drm/msm: fix vram leak on bind errors
Make sure to release the VRAM buffer also in a case a subcomponent fails to bind. Fixes: d863f0c7b536 ("drm/msm: Call msm_init_vram before binding the gpu") Cc: sta...@vger.kernel.org # 5.11 Cc: Craig Tatlor Signed-off-by: Johan Hovold --- drivers/gpu/drm/msm/msm_drv.c | 26 +++--- 1 file changed, 19 insertions(+), 7 deletions(-) diff --git a/drivers/gpu/drm/msm/msm_drv.c b/drivers/gpu/drm/msm/msm_drv.c index 89634159ad75..41cc6cd690cd 100644 --- a/drivers/gpu/drm/msm/msm_drv.c +++ b/drivers/gpu/drm/msm/msm_drv.c @@ -51,6 +51,8 @@ #define MSM_VERSION_MINOR 10 #define MSM_VERSION_PATCHLEVEL 0 +static void msm_deinit_vram(struct drm_device *ddev); + static const struct drm_mode_config_funcs mode_config_funcs = { .fb_create = msm_framebuffer_create, .output_poll_changed = drm_fb_helper_output_poll_changed, @@ -260,12 +262,7 @@ static int msm_drm_uninit(struct device *dev) if (kms && kms->funcs) kms->funcs->destroy(kms); - if (priv->vram.paddr) { - unsigned long attrs = DMA_ATTR_NO_KERNEL_MAPPING; - drm_mm_takedown(&priv->vram.mm); - dma_free_attrs(dev, priv->vram.size, NULL, - priv->vram.paddr, attrs); - } + msm_deinit_vram(ddev); component_unbind_all(dev, ddev); @@ -403,6 +400,19 @@ static int msm_init_vram(struct drm_device *dev) return ret; } +static void msm_deinit_vram(struct drm_device *ddev) +{ + struct msm_drm_private *priv = ddev->dev_private; + unsigned long attrs = DMA_ATTR_NO_KERNEL_MAPPING; + + if (!priv->vram.paddr) + return; + + drm_mm_takedown(&priv->vram.mm); + dma_free_attrs(ddev->dev, priv->vram.size, NULL, priv->vram.paddr, + attrs); +} + static int msm_drm_init(struct device *dev, const struct drm_driver *drv) { struct msm_drm_private *priv = dev_get_drvdata(dev); @@ -449,7 +459,7 @@ static int msm_drm_init(struct device *dev, const struct drm_driver *drv) /* Bind all our sub-components: */ ret = component_bind_all(dev, ddev); if (ret) - goto err_put_dev; + goto err_deinit_vram; dma_set_max_seg_size(dev, UINT_MAX); @@ -547,6 +557,8 @@ static int msm_drm_init(struct device *dev, const struct drm_driver *drv) return ret; +err_deinit_vram: + msm_deinit_vram(ddev); err_put_dev: drm_dev_put(ddev); -- 2.39.2
[Freedreno] [PATCH 02/10] Revert "drm/msm: Fix failure paths in msm_drm_init()"
This reverts commit 8636500300a01740d92b345c680b036b94555b1b. A recent commit tried to address a drm device leak in the early msm_drm_uninit() error paths but ended up making things worse. Specifically, it moved the drm device reference put in msm_drm_uninit() to msm_drm_init() which means that the drm would now be leaked on normal unbind. For reasons that were never spelled out, it also added kms NULL pointer checks to a couple of helper functions that had nothing to do with the paths modified by the patch. Instead of trying to salvage this incrementally, let's revert the bad commit so that clean and backportable fixes can be added in its place. Fixes: 8636500300a0 ("drm/msm: Fix failure paths in msm_drm_init()") Cc: Akhil P Oommen Signed-off-by: Johan Hovold --- drivers/gpu/drm/msm/disp/msm_disp_snapshot.c | 3 --- drivers/gpu/drm/msm/msm_drv.c| 11 --- 2 files changed, 4 insertions(+), 10 deletions(-) diff --git a/drivers/gpu/drm/msm/disp/msm_disp_snapshot.c b/drivers/gpu/drm/msm/disp/msm_disp_snapshot.c index b73031cd48e4..e75b97127c0d 100644 --- a/drivers/gpu/drm/msm/disp/msm_disp_snapshot.c +++ b/drivers/gpu/drm/msm/disp/msm_disp_snapshot.c @@ -129,9 +129,6 @@ void msm_disp_snapshot_destroy(struct drm_device *drm_dev) } priv = drm_dev->dev_private; - if (!priv->kms) - return; - kms = priv->kms; if (kms->dump_worker) diff --git a/drivers/gpu/drm/msm/msm_drv.c b/drivers/gpu/drm/msm/msm_drv.c index b7f5a78eadd4..9ded384acba4 100644 --- a/drivers/gpu/drm/msm/msm_drv.c +++ b/drivers/gpu/drm/msm/msm_drv.c @@ -150,9 +150,6 @@ static void msm_irq_uninstall(struct drm_device *dev) struct msm_drm_private *priv = dev->dev_private; struct msm_kms *kms = priv->kms; - if (!priv->kms) - return; - kms->funcs->irq_uninstall(kms); if (kms->irq_requested) free_irq(kms->irq, dev); @@ -270,6 +267,8 @@ static int msm_drm_uninit(struct device *dev) component_unbind_all(dev, ddev); ddev->dev_private = NULL; + drm_dev_put(ddev); + destroy_workqueue(priv->wq); return 0; @@ -442,12 +441,12 @@ static int msm_drm_init(struct device *dev, const struct drm_driver *drv) ret = msm_init_vram(ddev); if (ret) - goto err_drm_dev_put; + return ret; /* Bind all our sub-components: */ ret = component_bind_all(dev, ddev); if (ret) - goto err_drm_dev_put; + return ret; dma_set_max_seg_size(dev, UINT_MAX); @@ -542,8 +541,6 @@ static int msm_drm_init(struct device *dev, const struct drm_driver *drv) err_msm_uninit: msm_drm_uninit(dev); -err_drm_dev_put: - drm_dev_put(ddev); return ret; } -- 2.39.2
[Freedreno] [PATCH 08/10] drm/msm: fix workqueue leak on bind errors
Make sure to destroy the workqueue also in case of early errors during bind (e.g. a subcomponent failing to bind). Since commit c3b790ea07a1 ("drm: Manage drm_mode_config_init with drmm_") the mode config will be freed when the drm device is released also when using the legacy interface, but add an explicit cleanup for consistency and to facilitate backporting. Fixes: 060530f1ea67 ("drm/msm: use componentised device support") Cc: sta...@vger.kernel.org # 3.15 Cc: Rob Clark Signed-off-by: Johan Hovold --- drivers/gpu/drm/msm/msm_drv.c | 5 - 1 file changed, 4 insertions(+), 1 deletion(-) diff --git a/drivers/gpu/drm/msm/msm_drv.c b/drivers/gpu/drm/msm/msm_drv.c index ac3b77dbfacc..73c597565f99 100644 --- a/drivers/gpu/drm/msm/msm_drv.c +++ b/drivers/gpu/drm/msm/msm_drv.c @@ -458,7 +458,7 @@ static int msm_drm_init(struct device *dev, const struct drm_driver *drv) ret = msm_init_vram(ddev); if (ret) - goto err_put_dev; + goto err_cleanup_mode_config; /* Bind all our sub-components: */ ret = component_bind_all(dev, ddev); @@ -563,6 +563,9 @@ static int msm_drm_init(struct device *dev, const struct drm_driver *drv) err_deinit_vram: msm_deinit_vram(ddev); +err_cleanup_mode_config: + drm_mode_config_cleanup(ddev); + destroy_workqueue(priv->wq); err_put_dev: drm_dev_put(ddev); -- 2.39.2
[Freedreno] [PATCH v2 4/4] drm/msm/adreno: clean up component ops indentation
Clean up the component ops initialisers which were indented one level too far. Signed-off-by: Johan Hovold --- drivers/gpu/drm/msm/adreno/adreno_device.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/drivers/gpu/drm/msm/adreno/adreno_device.c b/drivers/gpu/drm/msm/adreno/adreno_device.c index d9100e3870bc..f2cdc5ad7ce7 100644 --- a/drivers/gpu/drm/msm/adreno/adreno_device.c +++ b/drivers/gpu/drm/msm/adreno/adreno_device.c @@ -571,8 +571,8 @@ static void adreno_unbind(struct device *dev, struct device *master, } static const struct component_ops a3xx_ops = { - .bind = adreno_bind, - .unbind = adreno_unbind, + .bind = adreno_bind, + .unbind = adreno_unbind, }; static void adreno_device_register_headless(void) -- 2.39.2
[Freedreno] [PATCH v2 3/4] drm/msm/adreno: drop bogus pm_runtime_set_active()
The runtime PM status can only be updated while runtime PM is disabled. Drop the bogus pm_runtime_set_active() call that was made after enabling runtime PM and which (incidentally but correctly) left the runtime PM status set to 'suspended'. Fixes: 2c087a336676 ("drm/msm/adreno: Load the firmware before bringing up the hardware") Signed-off-by: Johan Hovold --- drivers/gpu/drm/msm/adreno/adreno_device.c | 3 --- 1 file changed, 3 deletions(-) diff --git a/drivers/gpu/drm/msm/adreno/adreno_device.c b/drivers/gpu/drm/msm/adreno/adreno_device.c index f9a0b11c2e43..d9100e3870bc 100644 --- a/drivers/gpu/drm/msm/adreno/adreno_device.c +++ b/drivers/gpu/drm/msm/adreno/adreno_device.c @@ -438,9 +438,6 @@ struct msm_gpu *adreno_load_gpu(struct drm_device *dev) */ pm_runtime_enable(&pdev->dev); - /* Make sure pm runtime is active and reset any previous errors */ - pm_runtime_set_active(&pdev->dev); - ret = pm_runtime_get_sync(&pdev->dev); if (ret < 0) { pm_runtime_put_noidle(&pdev->dev); -- 2.39.2
[Freedreno] [PATCH v2 2/4] drm/msm/adreno: fix runtime PM imbalance at gpu load
A recent commit moved enabling of runtime PM to GPU load time (first open()) but failed to update the error paths so that runtime PM is disabled if initialisation of the GPU fails. This would trigger a warning about the unbalanced disable count on the next open() attempt. Note that pm_runtime_put_noidle() is sufficient to balance the usage count when pm_runtime_put_sync() fails (and is chosen over pm_runtime_resume_and_get() for consistency reasons). Fixes: 4b18299b3365 ("drm/msm/adreno: Defer enabling runpm until hw_init()") Cc: sta...@vger.kernel.org # 6.0 Signed-off-by: Johan Hovold --- drivers/gpu/drm/msm/adreno/adreno_device.c | 16 1 file changed, 12 insertions(+), 4 deletions(-) diff --git a/drivers/gpu/drm/msm/adreno/adreno_device.c b/drivers/gpu/drm/msm/adreno/adreno_device.c index c5c4c93b3689..f9a0b11c2e43 100644 --- a/drivers/gpu/drm/msm/adreno/adreno_device.c +++ b/drivers/gpu/drm/msm/adreno/adreno_device.c @@ -443,20 +443,21 @@ struct msm_gpu *adreno_load_gpu(struct drm_device *dev) ret = pm_runtime_get_sync(&pdev->dev); if (ret < 0) { - pm_runtime_put_sync(&pdev->dev); + pm_runtime_put_noidle(&pdev->dev); DRM_DEV_ERROR(dev->dev, "Couldn't power up the GPU: %d\n", ret); - return NULL; + goto err_disable_rpm; } mutex_lock(&gpu->lock); ret = msm_gpu_hw_init(gpu); mutex_unlock(&gpu->lock); - pm_runtime_put_autosuspend(&pdev->dev); if (ret) { DRM_DEV_ERROR(dev->dev, "gpu hw init failed: %d\n", ret); - return NULL; + goto err_put_rpm; } + pm_runtime_put_autosuspend(&pdev->dev); + #ifdef CONFIG_DEBUG_FS if (gpu->funcs->debugfs_init) { gpu->funcs->debugfs_init(gpu, dev->primary); @@ -465,6 +466,13 @@ struct msm_gpu *adreno_load_gpu(struct drm_device *dev) #endif return gpu; + +err_put_rpm: + pm_runtime_put_sync(&pdev->dev); +err_disable_rpm: + pm_runtime_disable(&pdev->dev); + + return NULL; } static int find_chipid(struct device *dev, struct adreno_rev *rev) -- 2.39.2
[Freedreno] [PATCH v2 0/4] drm/msm/adreno: fix runtime PM imbalance at unbind
As reported by Bjorn, we can end up with an unbalanced runtime PM disable count if unbind() is called before the DRM device is opened (e.g. if component bind fails due to the panel driver not having been loaded yet). As runtime PM must currently stay disabled until the firmware has been loaded, fix this by making the runtime PM disable call at unbind() conditional. The rest of the series fixes further imbalances in the load_gpu() error paths and removes a bogus pm_runtime_set_active() call. Included is also a related indentation cleanup. Johan Changes in v2 - fix the runtime PM imbalance in the gpu load error paths (new) - drop the patch removing the pm_runtime_disable() from adreno_gpu_cleanup() as this function can currently still be called with runtime PM enabled if suspending the scheduler in adreno_system_suspend() at unbind fails Johan Hovold (4): drm/msm/adreno: fix runtime PM imbalance at unbind drm/msm/adreno: fix runtime PM imbalance at gpu load drm/msm/adreno: drop bogus pm_runtime_set_active() drm/msm/adreno: clean up component ops indentation drivers/gpu/drm/msm/adreno/adreno_device.c | 26 +- 1 file changed, 16 insertions(+), 10 deletions(-) -- 2.39.2
[Freedreno] [PATCH v2 1/4] drm/msm/adreno: fix runtime PM imbalance at unbind
A recent commit moved enabling of runtime PM from adreno_gpu_init() to adreno_load_gpu() (called on first open()), which means that unbind() may now be called with runtime PM disabled in case the device was never opened in between. Make sure to only forcibly suspend and disable runtime PM at unbind() in case runtime PM has been enabled to prevent a disable count imbalance. This specifically avoids leaving runtime PM disabled when the device is later opened after a successful bind: msm_dpu ae01000.display-controller: [drm:adreno_load_gpu [msm]] *ERROR* Couldn't power up the GPU: -13 Fixes: 4b18299b3365 ("drm/msm/adreno: Defer enabling runpm until hw_init()") Reported-by: Bjorn Andersson Link: https://lore.kernel.org/lkml/20230203181245.3523937-1-quic_bjora...@quicinc.com Cc: sta...@vger.kernel.org # 6.0 Signed-off-by: Johan Hovold --- drivers/gpu/drm/msm/adreno/adreno_device.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/drivers/gpu/drm/msm/adreno/adreno_device.c b/drivers/gpu/drm/msm/adreno/adreno_device.c index 36f062c7582f..c5c4c93b3689 100644 --- a/drivers/gpu/drm/msm/adreno/adreno_device.c +++ b/drivers/gpu/drm/msm/adreno/adreno_device.c @@ -558,7 +558,8 @@ static void adreno_unbind(struct device *dev, struct device *master, struct msm_drm_private *priv = dev_get_drvdata(master); struct msm_gpu *gpu = dev_to_gpu(dev); - WARN_ON_ONCE(adreno_system_suspend(dev)); + if (pm_runtime_enabled(dev)) + WARN_ON_ONCE(adreno_system_suspend(dev)); gpu->funcs->destroy(gpu); priv->gpu_pdev = NULL; -- 2.39.2