Re: [Intel-gfx] [char-misc-next 3/4] mei: pxp: re-enable client on errors
On Mon, Nov 27, 2023 at 03:22:29PM +0200, Ville Syrjälä wrote: > On Wed, Nov 15, 2023 at 10:35:16PM +0200, Ville Syrjälä wrote: > > On Tue, Nov 14, 2023 at 06:40:26PM +, Winkler, Tomas wrote: > > > > > > > > > > -Original Message- > > > > From: Teres Alexis, Alan Previn > > > > Sent: Tuesday, November 14, 2023 5:32 PM > > > > To: ville.syrj...@linux.intel.com; Winkler, Tomas > > > > > > > > Cc: gre...@linuxfoundation.org; Usyskin, Alexander > > > > ; linux-ker...@vger.kernel.org; intel- > > > > g...@lists.freedesktop.org; Lubart, Vitaly > > > > Subject: Re: [char-misc-next 3/4] mei: pxp: re-enable client on errors > > > > > > > > On Tue, 2023-11-14 at 16:00 +0200, Ville Syrjälä wrote: > > > > > On Wed, Oct 11, 2023 at 02:01:56PM +0300, Tomas Winkler wrote: > > > > > > From: Alexander Usyskin > > > > > > > > > > > > Disable and enable mei-pxp client on errors to clean the internal > > > > > > state. > > > > > > > > > > This broke i915 on my Alderlake-P laptop. > > > > > > This fix was already posted, just missed the merging window > > > https://lkml.org/lkml/2023/10/31/636 > > > > Gave this a spin and it fixes the issue for me. Thanks. > > > > > > > > Greg can you please take this fix into v6.7-rc2 run, or I can repost it > > > with the correct subject. > > We're at -rc3 already and this fix is still not in! Ah, good point, I'll take it now, sorry, catching up on things...
Re: [Intel-gfx] [char-misc-next 3/4] mei: pxp: re-enable client on errors
On Wed, Nov 15, 2023 at 10:35:16PM +0200, Ville Syrjälä wrote: > On Tue, Nov 14, 2023 at 06:40:26PM +, Winkler, Tomas wrote: > > > > > > > -Original Message- > > > From: Teres Alexis, Alan Previn > > > Sent: Tuesday, November 14, 2023 5:32 PM > > > To: ville.syrj...@linux.intel.com; Winkler, Tomas > > > > > > Cc: gre...@linuxfoundation.org; Usyskin, Alexander > > > ; linux-ker...@vger.kernel.org; intel- > > > g...@lists.freedesktop.org; Lubart, Vitaly > > > Subject: Re: [char-misc-next 3/4] mei: pxp: re-enable client on errors > > > > > > On Tue, 2023-11-14 at 16:00 +0200, Ville Syrjälä wrote: > > > > On Wed, Oct 11, 2023 at 02:01:56PM +0300, Tomas Winkler wrote: > > > > > From: Alexander Usyskin > > > > > > > > > > Disable and enable mei-pxp client on errors to clean the internal > > > > > state. > > > > > > > > This broke i915 on my Alderlake-P laptop. > > > > This fix was already posted, just missed the merging window > > https://lkml.org/lkml/2023/10/31/636 > > Gave this a spin and it fixes the issue for me. Thanks. > > > > > Greg can you please take this fix into v6.7-rc2 run, or I can repost it > > with the correct subject. We're at -rc3 already and this fix is still not in! > > Thanks > > Tomas > > > > > > > > > > > > > > > > > Hi Alex, i just relooked at the series that got merged, and i noticed > > > that in patch > > > #3 of the series, you had changed mei_pxp_send_message to return bytes > > > sent > > > instead of zero on success. IIRC, we had agreed to not effect the > > > behavior of > > > this component interface (other than adding the timeout) - this was the > > > intention of Patch #4 that i was pushing for in order to spec the > > > interface > > > (which continues to say zero on success). We should fix this to stay with > > > the > > > original behavior - where mei-pxp should NOT send partial packets and will > > > only return zero in success case where success is sending of the complete > > > packets - so we don't need to get back the "bytes sent" > > > from mei_pxp_send_message. So i think this might be causing the problem. > > > > > > > > > Side note to Ville:, are you enabling PXP kernel config by default in > > > all MESA > > > contexts? I recall that MESA folks were running some CI testing with > > > enable > > > pxp contexts, but didn't realize this is being enabled by default in all > > > contexts. > > > Please be aware that enabling pxp-contexts would temporarily disabled > > > runtime-pm during that contexts lifetime. > > > Also pxp contexts will be forced to be irrecoverable if it ever hangs. > > > The former is a hardware architecture requirement but doesn't do anything > > > if > > > you're enabling display (which I beleive also blocks in ADL). The latter > > > was a > > > requirement to comply with Vulkan. > > > > > > ...alan > > > > > > > -- > Ville Syrjälä > Intel -- Ville Syrjälä Intel
Re: [Intel-gfx] [char-misc-next 3/4] mei: pxp: re-enable client on errors
On Tue, Nov 14, 2023 at 06:40:26PM +, Winkler, Tomas wrote: > > > > -Original Message- > > From: Teres Alexis, Alan Previn > > Sent: Tuesday, November 14, 2023 5:32 PM > > To: ville.syrj...@linux.intel.com; Winkler, Tomas > > Cc: gre...@linuxfoundation.org; Usyskin, Alexander > > ; linux-ker...@vger.kernel.org; intel- > > g...@lists.freedesktop.org; Lubart, Vitaly > > Subject: Re: [char-misc-next 3/4] mei: pxp: re-enable client on errors > > > > On Tue, 2023-11-14 at 16:00 +0200, Ville Syrjälä wrote: > > > On Wed, Oct 11, 2023 at 02:01:56PM +0300, Tomas Winkler wrote: > > > > From: Alexander Usyskin > > > > > > > > Disable and enable mei-pxp client on errors to clean the internal state. > > > > > > This broke i915 on my Alderlake-P laptop. > > This fix was already posted, just missed the merging window > https://lkml.org/lkml/2023/10/31/636 Gave this a spin and it fixes the issue for me. Thanks. > > Greg can you please take this fix into v6.7-rc2 run, or I can repost it with > the correct subject. > Thanks > Tomas > > > > > > > > > > > Hi Alex, i just relooked at the series that got merged, and i noticed that > > in patch > > #3 of the series, you had changed mei_pxp_send_message to return bytes sent > > instead of zero on success. IIRC, we had agreed to not effect the behavior > > of > > this component interface (other than adding the timeout) - this was the > > intention of Patch #4 that i was pushing for in order to spec the interface > > (which continues to say zero on success). We should fix this to stay with > > the > > original behavior - where mei-pxp should NOT send partial packets and will > > only return zero in success case where success is sending of the complete > > packets - so we don't need to get back the "bytes sent" > > from mei_pxp_send_message. So i think this might be causing the problem. > > > > > > Side note to Ville:, are you enabling PXP kernel config by default in all > > MESA > > contexts? I recall that MESA folks were running some CI testing with enable > > pxp contexts, but didn't realize this is being enabled by default in all > > contexts. > > Please be aware that enabling pxp-contexts would temporarily disabled > > runtime-pm during that contexts lifetime. > > Also pxp contexts will be forced to be irrecoverable if it ever hangs. > > The former is a hardware architecture requirement but doesn't do anything if > > you're enabling display (which I beleive also blocks in ADL). The latter > > was a > > requirement to comply with Vulkan. > > > > ...alan > > > -- Ville Syrjälä Intel
Re: [Intel-gfx] [char-misc-next 3/4] mei: pxp: re-enable client on errors
On Wed, 2023-11-15 at 13:31 +, Tvrtko Ursulin wrote: > On 14/11/2023 15:31, Teres Alexis, Alan Previn wrote: > > On Tue, 2023-11-14 at 16:00 +0200, Ville Syrjälä wrote: > > > On Wed, Oct 11, 2023 at 02:01:56PM +0300, Tomas Winkler wrote: > > > > > Regardless of the mei_pxp_send_message being temporarily broken, doesn't > Ville's logs suggest the PXP detection is altogether messed up? AFAIR > the plan was exactly to avoid stalls during Mesa init and new uapi was > added to achieve that. But it doesn't seem to be working?! > > commit 3b918f4f0c8b5344af4058f1a12e2023363d0097 > Author: Alan Previn > Date: Wed Aug 2 11:25:50 2023 -0700 > > drm/i915/pxp: Optimize GET_PARAM:PXP_STATUS > > After recent discussions with Mesa folks, it was requested > that we optimize i915's GET_PARAM for the PXP_STATUS without > changing the UAPI spec. > > Add these additional optimizations: > - If any PXP initializatoin flow failed, then ensure that > we catch it so that we can change the returned PXP_STATUS > from "2" (i.e. 'PXP is supported but not yet ready') > to "-ENODEV". This typically should not happen and if it > does, we have a platform configuration issue. > - If a PXP arbitration session creation event failed > due to incorrect firmware version or blocking SOC fusing > or blocking BIOS configuration (platform reasons that won't > change if we retry), then reflect that blockage by also > returning -ENODEV in the GET_PARAM:PXP_STATUS. > - GET_PARAM:PXP_STATUS should not wait at all if PXP is > supported but non-i915 dependencies (component-driver / > firmware) we are still pending to complete the init flows. > In this case, just return "2" immediately (i.e. 'PXP is > supported but not yet ready'). > > AFAIU is things failed there shouldn't be long waits, repeated/constant > ones even less so. > alan: I agree - but I don't think MESA is detecting for PXP for above case... which is designed to be quick if using the GET_PARAM IOCTL - i believe MESA may actually be opting in to enforce PXP. When this happens we are required to be stringent when managing the protected-hw-sessions and for certain PXP operations we retry when a failure occurs - one in particular is the PXP hw session teardown or reset. We are expected to retry up to 3 times to ensure that the session is properly torn down (requirements from architecture) unless the error returned from FW indicated that we dont have proper fusing or security assets in the platform (i.e. lower level aspects of the platform won't permit PXP support). All of this is already coded. So we have two problems here: 1. The code for enforcing PXP when PXP is explicitly requested by UMD is expecting the mei-component driver to follow the mei-component-interface spec but a change was introduced by Alex that caused a bug in this lowest layer spec (see: https://elixir.bootlin.com/linux/latest/source/drivers/misc/mei/pxp/mei_pxp.c#L25) "Return: 0 on Success, <0 on Failure" for send and "Return: bytes sent on Success, <0 on Failure" for receive - the change they made was returning a positive number on send's success and i915 was checking the according to spec: ret = pxp_component->ops->send(pxp_component->tee_dev, msg_in, msg_in_size, PXP_TRANSPORT_TIMEOUT_MS); -> if (ret) { drm_err(>drm, "Failed to send PXP TEE message\n"); goto unlock; } ret = pxp_component->ops->recv(pxp_component->tee_dev, msg_out, msg_out_max_size, PXP_TRANSPORT_TIMEOUT_MS); if (ret < 0) { drm_err(>drm, "Failed to receive PXP TEE message\n"); goto unlock; } So i really cannot guarantee anything if the mei code itself is broken and not complying to the spec we agreed on - i can change the above check for send from "if (ret)" to "if (ret < 0)" but that would only be "working aroung" the mei bug. As a side note, when we bail on a successful "send" thinking its a failure, and try to "send" again, it triggers an link reset in the mei-hw link because the receive was never picked up. IF i understand this correctly, this is hw-fw design requirement, i.e. that the send-recv be an atomic FSM sequence. That said, if the the send-failure was a real failure, the timeout would have not been hit and the retry would have been much faster. As a side note, Alex from mei team did reply to me offline with indication that the fix was supposed to have been sent via "urgent queue" and he will follow up on that. 2. I dont believe MESA is using the GET_PARAM ioctl for detecting PXP availibility, I think MESA is explicitly opting into creating PXP contexts. That will of course trigger the the backend PXP code that actually executes any PXP operation.
Re: [Intel-gfx] [char-misc-next 3/4] mei: pxp: re-enable client on errors
On 14/11/2023 15:31, Teres Alexis, Alan Previn wrote: On Tue, 2023-11-14 at 16:00 +0200, Ville Syrjälä wrote: On Wed, Oct 11, 2023 at 02:01:56PM +0300, Tomas Winkler wrote: From: Alexander Usyskin Disable and enable mei-pxp client on errors to clean the internal state. This broke i915 on my Alderlake-P laptop. Hi Alex, i just relooked at the series that got merged, and i noticed that in patch #3 of the series, you had changed mei_pxp_send_message to return bytes sent instead of zero on success. IIRC, we had agreed to not effect the behavior of this component interface (other than adding the timeout) - this was the intention of Patch #4 that i was pushing for in order to spec the interface (which continues to say zero on success). We should fix this to stay with the original behavior - where mei-pxp should NOT send partial packets and will only return zero in success case where success is sending of the complete packets - so we don't need to get back the "bytes sent" from mei_pxp_send_message. So i think this might be causing the problem. Side note to Ville:, are you enabling PXP kernel config by default in all MESA contexts? I recall that MESA folks were running some CI testing with enable pxp contexts, but didn't realize this is being enabled by default in all contexts. Please be aware that enabling pxp-contexts would temporarily disabled runtime-pm during that contexts lifetime. Also pxp contexts will be forced to be irrecoverable if it ever hangs. The former is a hardware architecture requirement but doesn't do anything if you're enabling display (which I beleive also blocks in ADL). The latter was a requirement to comply with Vulkan. Regardless of the mei_pxp_send_message being temporarily broken, doesn't Ville's logs suggest the PXP detection is altogether messed up? AFAIR the plan was exactly to avoid stalls during Mesa init and new uapi was added to achieve that. But it doesn't seem to be working?! commit 3b918f4f0c8b5344af4058f1a12e2023363d0097 Author: Alan Previn Date: Wed Aug 2 11:25:50 2023 -0700 drm/i915/pxp: Optimize GET_PARAM:PXP_STATUS After recent discussions with Mesa folks, it was requested that we optimize i915's GET_PARAM for the PXP_STATUS without changing the UAPI spec. Add these additional optimizations: - If any PXP initializatoin flow failed, then ensure that we catch it so that we can change the returned PXP_STATUS from "2" (i.e. 'PXP is supported but not yet ready') to "-ENODEV". This typically should not happen and if it does, we have a platform configuration issue. - If a PXP arbitration session creation event failed due to incorrect firmware version or blocking SOC fusing or blocking BIOS configuration (platform reasons that won't change if we retry), then reflect that blockage by also returning -ENODEV in the GET_PARAM:PXP_STATUS. - GET_PARAM:PXP_STATUS should not wait at all if PXP is supported but non-i915 dependencies (component-driver / firmware) we are still pending to complete the init flows. In this case, just return "2" immediately (i.e. 'PXP is supported but not yet ready'). AFAIU is things failed there shouldn't be long waits, repeated/constant ones even less so. Regards, Tvrtko
Re: [Intel-gfx] [char-misc-next 3/4] mei: pxp: re-enable client on errors
> -Original Message- > From: Teres Alexis, Alan Previn > Sent: Tuesday, November 14, 2023 5:32 PM > To: ville.syrj...@linux.intel.com; Winkler, Tomas > Cc: gre...@linuxfoundation.org; Usyskin, Alexander > ; linux-ker...@vger.kernel.org; intel- > g...@lists.freedesktop.org; Lubart, Vitaly > Subject: Re: [char-misc-next 3/4] mei: pxp: re-enable client on errors > > On Tue, 2023-11-14 at 16:00 +0200, Ville Syrjälä wrote: > > On Wed, Oct 11, 2023 at 02:01:56PM +0300, Tomas Winkler wrote: > > > From: Alexander Usyskin > > > > > > Disable and enable mei-pxp client on errors to clean the internal state. > > > > This broke i915 on my Alderlake-P laptop. This fix was already posted, just missed the merging window https://lkml.org/lkml/2023/10/31/636 Greg can you please take this fix into v6.7-rc2 run, or I can repost it with the correct subject. Thanks Tomas > > > > > Hi Alex, i just relooked at the series that got merged, and i noticed that in > patch > #3 of the series, you had changed mei_pxp_send_message to return bytes sent > instead of zero on success. IIRC, we had agreed to not effect the behavior of > this component interface (other than adding the timeout) - this was the > intention of Patch #4 that i was pushing for in order to spec the interface > (which continues to say zero on success). We should fix this to stay with the > original behavior - where mei-pxp should NOT send partial packets and will > only return zero in success case where success is sending of the complete > packets - so we don't need to get back the "bytes sent" > from mei_pxp_send_message. So i think this might be causing the problem. > > > Side note to Ville:, are you enabling PXP kernel config by default in all > MESA > contexts? I recall that MESA folks were running some CI testing with enable > pxp contexts, but didn't realize this is being enabled by default in all > contexts. > Please be aware that enabling pxp-contexts would temporarily disabled > runtime-pm during that contexts lifetime. > Also pxp contexts will be forced to be irrecoverable if it ever hangs. > The former is a hardware architecture requirement but doesn't do anything if > you're enabling display (which I beleive also blocks in ADL). The latter was a > requirement to comply with Vulkan. > > ...alan >
Re: [Intel-gfx] [char-misc-next 3/4] mei: pxp: re-enable client on errors
On Tue, 2023-11-14 at 16:00 +0200, Ville Syrjälä wrote: > On Wed, Oct 11, 2023 at 02:01:56PM +0300, Tomas Winkler wrote: > > From: Alexander Usyskin > > > > Disable and enable mei-pxp client on errors to clean the internal state. > > This broke i915 on my Alderlake-P laptop. > Hi Alex, i just relooked at the series that got merged, and i noticed that in patch #3 of the series, you had changed mei_pxp_send_message to return bytes sent instead of zero on success. IIRC, we had agreed to not effect the behavior of this component interface (other than adding the timeout) - this was the intention of Patch #4 that i was pushing for in order to spec the interface (which continues to say zero on success). We should fix this to stay with the original behavior - where mei-pxp should NOT send partial packets and will only return zero in success case where success is sending of the complete packets - so we don't need to get back the "bytes sent" from mei_pxp_send_message. So i think this might be causing the problem. Side note to Ville:, are you enabling PXP kernel config by default in all MESA contexts? I recall that MESA folks were running some CI testing with enable pxp contexts, but didn't realize this is being enabled by default in all contexts. Please be aware that enabling pxp-contexts would temporarily disabled runtime-pm during that contexts lifetime. Also pxp contexts will be forced to be irrecoverable if it ever hangs. The former is a hardware architecture requirement but doesn't do anything if you're enabling display (which I beleive also blocks in ADL). The latter was a requirement to comply with Vulkan. ...alan
Re: [Intel-gfx] [char-misc-next 3/4] mei: pxp: re-enable client on errors
On Wed, Oct 11, 2023 at 02:01:56PM +0300, Tomas Winkler wrote: > From: Alexander Usyskin > > Disable and enable mei-pxp client on errors to clean the internal state. This broke i915 on my Alderlake-P laptop. Trying to start Xorg just hangs and I eventually have to power off the laptop to get things back into shape. The behaviour gets a bit better after commit fb99e79ee62a ("mei: update mei-pxp's component interface with timeouts") as Xorg "only" gets blocked for ~10 seconds, after which it manages to start, and I get a bunch of spew in dmesg: [ 25.431535] i915 :00:02.0: [drm] *ERROR* Failed to send PXP TEE message [ 30.435241] mei_pxp :00:16.0-fbf6fcf1-96cf-4e2e-a6a6-1bab8cbe36b1: Trying to reset the channel... [ 30.435965] i915 :00:02.0: [drm] *ERROR* Failed to send PXP TEE message [ 30.437341] i915 :00:02.0: [drm] *ERROR* Failed to send PXP TEE message [ 30.437356] i915 :00:02.0: [drm] *ERROR* Failed to send tee msg for inv-stream-key-15, ret=[28] [ 35.555210] mei_pxp :00:16.0-fbf6fcf1-96cf-4e2e-a6a6-1bab8cbe36b1: Trying to reset the channel... [ 35.555919] i915 :00:02.0: [drm] *ERROR* Failed to send PXP TEE message [ 35.555937] i915 :00:02.0: [drm] *ERROR* Failed to send tee msg init arb session, ret=[-62] [ 35.555941] i915 :00:02.0: [drm] *ERROR* tee cmd for arb session creation failed [ 35.556765] i915 :00:02.0: [drm] *ERROR* Failed to send PXP TEE message [ 36.021808] fuse: init (API version 7.39) [ 40.675183] mei_pxp :00:16.0-fbf6fcf1-96cf-4e2e-a6a6-1bab8cbe36b1: Trying to reset the channel... [ 40.676045] i915 :00:02.0: [drm] *ERROR* Failed to send PXP TEE message [ 40.676591] i915 :00:02.0: [drm] *ERROR* Failed to send PXP TEE message [ 40.676602] i915 :00:02.0: [drm] *ERROR* Failed to send tee msg for inv-stream-key-15, ret=[28] [ 40.960209] mate-session-ch[5936]: memfd_create() called without MFD_EXEC or MFD_NOEXEC_SEAL set [ 45.795172] mei_pxp :00:16.0-fbf6fcf1-96cf-4e2e-a6a6-1bab8cbe36b1: Trying to reset the channel... [ 45.795872] i915 :00:02.0: [drm] *ERROR* Failed to send PXP TEE message [ 45.796520] i915 :00:02.0: [drm] *ERROR* Failed to send PXP TEE message [ 50.915183] mei_pxp :00:16.0-fbf6fcf1-96cf-4e2e-a6a6-1bab8cbe36b1: Trying to reset the channel... [ 50.916005] i915 :00:02.0: [drm] *ERROR* Failed to send PXP TEE message [ 50.916012] i915 :00:02.0: [drm] *ERROR* Failed to send tee msg for inv-stream-key-15, ret=[-62] [ 50.916846] i915 :00:02.0: [drm] *ERROR* Failed to send PXP TEE message [ 56.035149] mei_pxp :00:16.0-fbf6fcf1-96cf-4e2e-a6a6-1bab8cbe36b1: Trying to reset the channel... [ 56.035956] i915 :00:02.0: [drm] *ERROR* Failed to send PXP TEE message [ 56.036585] i915 :00:02.0: [drm] *ERROR* Failed to send PXP TEE message [ 56.036592] i915 :00:02.0: [drm] *ERROR* Failed to send tee msg for inv-stream-key-15, ret=[28] [ 61.155137] mei_pxp :00:16.0-fbf6fcf1-96cf-4e2e-a6a6-1bab8cbe36b1: Trying to reset the channel... The same spew repeats every time I run any application that uses the GPU, and the application also gets blocked for a long time (eg. firefox takes over 15 seconds to start now). > > Signed-off-by: Alexander Usyskin > Signed-off-by: Tomas Winkler > --- > drivers/misc/mei/pxp/mei_pxp.c | 70 +++--- > 1 file changed, 48 insertions(+), 22 deletions(-) > > diff --git a/drivers/misc/mei/pxp/mei_pxp.c b/drivers/misc/mei/pxp/mei_pxp.c > index c6cdd6a47308ebcc72f34c38..9875d16445bb03efcfb31cd9 100644 > --- a/drivers/misc/mei/pxp/mei_pxp.c > +++ b/drivers/misc/mei/pxp/mei_pxp.c > @@ -23,6 +23,24 @@ > > #include "mei_pxp.h" > > +static inline int mei_pxp_reenable(const struct device *dev, struct > mei_cl_device *cldev) > +{ > + int ret; > + > + dev_warn(dev, "Trying to reset the channel...\n"); > + ret = mei_cldev_disable(cldev); > + if (ret < 0) > + dev_warn(dev, "mei_cldev_disable failed. %d\n", ret); > + /* > + * Explicitly ignoring disable failure, > + * enable may fix the states and succeed > + */ > + ret = mei_cldev_enable(cldev); > + if (ret < 0) > + dev_err(dev, "mei_cldev_enable failed. %d\n", ret); > + return ret; > +} > + > /** > * mei_pxp_send_message() - Sends a PXP message to ME FW. > * @dev: device corresponding to the mei_cl_device > @@ -35,6 +53,7 @@ mei_pxp_send_message(struct device *dev, const void > *message, size_t size) > { > struct mei_cl_device *cldev; > ssize_t byte; > + int ret; > > if (!dev || !message) > return -EINVAL; > @@ -44,10 +63,20 @@ mei_pxp_send_message(struct device *dev, const void > *message, size_t size) > byte = mei_cldev_send(cldev, message, size); > if (byte < 0) { > dev_dbg(dev, "mei_cldev_send failed. %zd\n", byte); > - return byte; > + switch (byte)