Re: [PATCH] drm/msm/adreno: De-spaghettify the use of memory barriers

2024-06-06 Thread Konrad Dybcio
On 4.06.2024 4:40 PM, Will Deacon wrote:
> On Thu, May 16, 2024 at 01:55:26PM -0500, Andrew Halaney wrote:
>> On Thu, May 16, 2024 at 08:20:05PM GMT, Akhil P Oommen wrote:
>>> On Thu, May 16, 2024 at 08:15:34AM -0500, Andrew Halaney wrote:
 If I understand correctly, you don't need any memory barrier.
 writel()/readl()'s are ordered to the same endpoint. That goes for all
 the reordering/barrier comments mentioned below too.

 device-io.rst:

 The read and write functions are defined to be ordered. That is the
 compiler is not permitted to reorder the I/O sequence. When the 
 ordering
 can be compiler optimised, you can use __readb() and friends to
 indicate the relaxed ordering. Use this with care.

 memory-barriers.txt:

  (*) readX(), writeX():

The readX() and writeX() MMIO accessors take a pointer to the
peripheral being accessed as an __iomem * parameter. For pointers
mapped with the default I/O attributes (e.g. those returned by
ioremap()), the ordering guarantees are as follows:

1. All readX() and writeX() accesses to the same peripheral are 
 ordered
   with respect to each other. This ensures that MMIO register 
 accesses
   by the same CPU thread to a particular device will arrive in 
 program
   order.

>>>
>>> In arm64, a writel followed by readl translates to roughly the following
>>> sequence: dmb_wmb(), __raw_writel(), __raw_readl(), dmb_rmb(). I am not
>>> sure what is stopping compiler from reordering  __raw_writel() and 
>>> __raw_readl()
>>> above? I am assuming iomem cookie is ignored during compilation.
>>
>> It seems to me that is due to some usage of volatile there in
>> __raw_writel() etc, but to be honest after reading about volatile and
>> some threads from gcc mailing lists, I don't have a confident answer :)
>>
>>>
>>> Added Will to this thread if he can throw some light on this.
>>
>> Hopefully Will can school us.
> 
> The ordering in this case is ensured by the memory attributes used for
> ioremap(). When an MMIO region is mapped using Device-nGnRE attributes
> (as it the case for ioremap()), the "nR" part means "no reordering", so
> readX() and writeX() to that region are ordered wrt each other.
> 
> Note that guarantee _doesn't_ apply to other flavours of ioremap(), so
> e.g. ioremap_wc() won't give you the ordering.
> 
> Hope that helps,

Just to make sure I'm following, would mapping things as nGnRnE effectively
get rid of write buffering, perhaps being a way of debugging whether that
in particular is causing issues (at the cost of speed)?

Konrad


Re: [PATCH] drm/msm/adreno: De-spaghettify the use of memory barriers

2024-06-04 Thread Konrad Dybcio




On 5/14/24 20:38, Akhil P Oommen wrote:

On Wed, May 08, 2024 at 07:46:31PM +0200, Konrad Dybcio wrote:

Memory barriers help ensure instruction ordering, NOT time and order
of actual write arrival at other observers (e.g. memory-mapped IP).
On architectures employing weak memory ordering, the latter can be a
giant pain point, and it has been as part of this driver.

Moreover, the gpu_/gmu_ accessors already use non-relaxed versions of
readl/writel, which include r/w (respectively) barriers.

Replace the barriers with a readback that ensures the previous writes
have exited the write buffer (as the CPU must flush the write to the
register it's trying to read back) and subsequently remove the hack
introduced in commit b77532803d11 ("drm/msm/a6xx: Poll for GBIF unhalt
status in hw_init").

Fixes: b77532803d11 ("drm/msm/a6xx: Poll for GBIF unhalt status in hw_init")
Signed-off-by: Konrad Dybcio 
---
  drivers/gpu/drm/msm/adreno/a6xx_gmu.c |  5 ++---
  drivers/gpu/drm/msm/adreno/a6xx_gpu.c | 14 --
  2 files changed, 6 insertions(+), 13 deletions(-)


I prefer this version compared to the v2. A helper routine is
unnecessary here because:
1. there are very few scenarios where we have to read back the same
register.
2. we may accidently readback a write only register.


Which would still trigger an address dependency on the CPU, no?





diff --git a/drivers/gpu/drm/msm/adreno/a6xx_gmu.c 
b/drivers/gpu/drm/msm/adreno/a6xx_gmu.c
index 0e3dfd4c2bc8..4135a53b55a7 100644
--- a/drivers/gpu/drm/msm/adreno/a6xx_gmu.c
+++ b/drivers/gpu/drm/msm/adreno/a6xx_gmu.c
@@ -466,9 +466,8 @@ static int a6xx_rpmh_start(struct a6xx_gmu *gmu)
int ret;
u32 val;
  
-	gmu_write(gmu, REG_A6XX_GMU_RSCC_CONTROL_REQ, 1 << 1);

-   /* Wait for the register to finish posting */
-   wmb();
+   gmu_write(gmu, REG_A6XX_GMU_RSCC_CONTROL_REQ, BIT(1));
+   gmu_read(gmu, REG_A6XX_GMU_RSCC_CONTROL_REQ);


This is unnecessary because we are polling on a register on the same port 
below. But I think we
can replace "wmb()" above with "mb()" to avoid reordering between read
and write IO instructions.


Ok on the dropping readback part

+ AFAIU from Will's response, we can drop the barrier as well



  
  	ret = gmu_poll_timeout(gmu, REG_A6XX_GMU_RSCC_CONTROL_ACK, val,

val & (1 << 1), 100, 1);
diff --git a/drivers/gpu/drm/msm/adreno/a6xx_gpu.c 
b/drivers/gpu/drm/msm/adreno/a6xx_gpu.c
index 973872ad0474..0acbc38b8e70 100644
--- a/drivers/gpu/drm/msm/adreno/a6xx_gpu.c
+++ b/drivers/gpu/drm/msm/adreno/a6xx_gpu.c
@@ -1713,22 +1713,16 @@ static int hw_init(struct msm_gpu *gpu)
}
  
  	/* Clear GBIF halt in case GX domain was not collapsed */

+   gpu_write(gpu, REG_A6XX_GBIF_HALT, 0);


We need a full barrier here to avoid reordering. Also, lets add a
comment about why we are doing this odd looking sequence.


+   gpu_read(gpu, REG_A6XX_GBIF_HALT);
if (adreno_is_a619_holi(adreno_gpu)) {
-   gpu_write(gpu, REG_A6XX_GBIF_HALT, 0);
gpu_write(gpu, REG_A6XX_RBBM_GPR0_CNTL, 0);
-   /* Let's make extra sure that the GPU can access the memory.. */
-   mb();


We need a full barrier here.


+   gpu_read(gpu, REG_A6XX_RBBM_GPR0_CNTL);
} else if (a6xx_has_gbif(adreno_gpu)) {
-   gpu_write(gpu, REG_A6XX_GBIF_HALT, 0);
gpu_write(gpu, REG_A6XX_RBBM_GBIF_HALT, 0);
-   /* Let's make extra sure that the GPU can access the memory.. */
-   mb();


We need a full barrier here.


Not sure we do between REG_A6XX_GBIF_HALT & REG_A6XX_RBBM_(GBIF_HALT/GPR0_CNTL),
but I suppose keeping the one after REG_A6XX_RBBM_(GBIF_HALT/GPR0_CNTL) makes
sense to avoid the possibility of configuring the GPU before it can access 
DRAM..




+   gpu_read(gpu, REG_A6XX_RBBM_GBIF_HALT);
}
  
-	/* Some GPUs are stubborn and take their sweet time to unhalt GBIF! */

-   if (adreno_is_a7xx(adreno_gpu) && a6xx_has_gbif(adreno_gpu))
-   spin_until(!gpu_read(gpu, REG_A6XX_GBIF_HALT_ACK));
-


Why is this removed?


Because it was a hack in the first place and the enforcement of GBIF
unhalt requests coming through before proceeding further removes the
necessity to check this (unless there's some hw-mandated delay we should
keep in mind, but kgsl doesn't have that and there doesn't seem to be
any from testing on 8[456]50).

Konrad


Re: [PATCH] drm/msm/adreno: De-spaghettify the use of memory barriers

2024-06-04 Thread Will Deacon
On Thu, May 16, 2024 at 01:55:26PM -0500, Andrew Halaney wrote:
> On Thu, May 16, 2024 at 08:20:05PM GMT, Akhil P Oommen wrote:
> > On Thu, May 16, 2024 at 08:15:34AM -0500, Andrew Halaney wrote:
> > > If I understand correctly, you don't need any memory barrier.
> > > writel()/readl()'s are ordered to the same endpoint. That goes for all
> > > the reordering/barrier comments mentioned below too.
> > > 
> > > device-io.rst:
> > > 
> > > The read and write functions are defined to be ordered. That is the
> > > compiler is not permitted to reorder the I/O sequence. When the 
> > > ordering
> > > can be compiler optimised, you can use __readb() and friends to
> > > indicate the relaxed ordering. Use this with care.
> > > 
> > > memory-barriers.txt:
> > > 
> > >  (*) readX(), writeX():
> > > 
> > >   The readX() and writeX() MMIO accessors take a pointer to the
> > >   peripheral being accessed as an __iomem * parameter. For pointers
> > >   mapped with the default I/O attributes (e.g. those returned by
> > >   ioremap()), the ordering guarantees are as follows:
> > > 
> > >   1. All readX() and writeX() accesses to the same peripheral are 
> > > ordered
> > >  with respect to each other. This ensures that MMIO register 
> > > accesses
> > >  by the same CPU thread to a particular device will arrive in 
> > > program
> > >  order.
> > > 
> > 
> > In arm64, a writel followed by readl translates to roughly the following
> > sequence: dmb_wmb(), __raw_writel(), __raw_readl(), dmb_rmb(). I am not
> > sure what is stopping compiler from reordering  __raw_writel() and 
> > __raw_readl()
> > above? I am assuming iomem cookie is ignored during compilation.
> 
> It seems to me that is due to some usage of volatile there in
> __raw_writel() etc, but to be honest after reading about volatile and
> some threads from gcc mailing lists, I don't have a confident answer :)
> 
> > 
> > Added Will to this thread if he can throw some light on this.
> 
> Hopefully Will can school us.

The ordering in this case is ensured by the memory attributes used for
ioremap(). When an MMIO region is mapped using Device-nGnRE attributes
(as it the case for ioremap()), the "nR" part means "no reordering", so
readX() and writeX() to that region are ordered wrt each other.

Note that guarantee _doesn't_ apply to other flavours of ioremap(), so
e.g. ioremap_wc() won't give you the ordering.

Hope that helps,

Will


Re: [PATCH] drm/msm/adreno: De-spaghettify the use of memory barriers

2024-06-03 Thread Andrew Halaney
On Thu, May 16, 2024 at 08:20:05PM GMT, Akhil P Oommen wrote:
> On Thu, May 16, 2024 at 08:15:34AM -0500, Andrew Halaney wrote:
> > On Wed, May 15, 2024 at 12:08:49AM GMT, Akhil P Oommen wrote:
> > > On Wed, May 08, 2024 at 07:46:31PM +0200, Konrad Dybcio wrote:
> > > > Memory barriers help ensure instruction ordering, NOT time and order
> > > > of actual write arrival at other observers (e.g. memory-mapped IP).
> > > > On architectures employing weak memory ordering, the latter can be a
> > > > giant pain point, and it has been as part of this driver.
> > > > 
> > > > Moreover, the gpu_/gmu_ accessors already use non-relaxed versions of
> > > > readl/writel, which include r/w (respectively) barriers.
> > > > 
> > > > Replace the barriers with a readback that ensures the previous writes
> > > > have exited the write buffer (as the CPU must flush the write to the
> > > > register it's trying to read back) and subsequently remove the hack
> > > > introduced in commit b77532803d11 ("drm/msm/a6xx: Poll for GBIF unhalt
> > > > status in hw_init").
> > 
> > For what its worth, I've been eyeing (but haven't tested) sending some
> > patches to clean up dsi_phy_write_udelay/ndelay(). There's no ordering
> > guarantee between a writel() and a delay(), so the expected "write then
> > delay" sequence might not be happening.. you need to write, read, delay.
> > 
> > memory-barriers.txt:
> > 
> > 5. A readX() by a CPU thread from the peripheral will complete before
> >any subsequent delay() loop can begin execution on the same thread.
> >This ensures that two MMIO register writes by the CPU to a peripheral
> >will arrive at least 1us apart if the first write is immediately read
> >back with readX() and udelay(1) is called prior to the second
> >writeX():
> > 
> > writel(42, DEVICE_REGISTER_0); // Arrives at the device...
> > readl(DEVICE_REGISTER_0);
> > udelay(1);
> > writel(42, DEVICE_REGISTER_1); // ...at least 1us before this.
> 
> Yes, udelay orders only with readl(). I saw a patch from Will Deacon
> which fixes this for arm64 few years back:
> https://lore.kernel.org/all/1543251228-30001-1-git-send-email-will.dea...@arm.com/T/
> 
> But this is needed only when you write io and do cpuside wait , not when
> you poll io to check status.

Sure, I'm just highlighting the bug in dsi_phy_write_*delay() functions
here, which match the udelay case you mention.

> 
> > 
> > > > 
> > > > Fixes: b77532803d11 ("drm/msm/a6xx: Poll for GBIF unhalt status in 
> > > > hw_init")
> > > > Signed-off-by: Konrad Dybcio 
> > > > ---
> > > >  drivers/gpu/drm/msm/adreno/a6xx_gmu.c |  5 ++---
> > > >  drivers/gpu/drm/msm/adreno/a6xx_gpu.c | 14 --
> > > >  2 files changed, 6 insertions(+), 13 deletions(-)
> > > 
> > > I prefer this version compared to the v2. A helper routine is
> > > unnecessary here because:
> > > 1. there are very few scenarios where we have to read back the same
> > > register.
> > > 2. we may accidently readback a write only register.
> > > 
> > > > 
> > > > diff --git a/drivers/gpu/drm/msm/adreno/a6xx_gmu.c 
> > > > b/drivers/gpu/drm/msm/adreno/a6xx_gmu.c
> > > > index 0e3dfd4c2bc8..4135a53b55a7 100644
> > > > --- a/drivers/gpu/drm/msm/adreno/a6xx_gmu.c
> > > > +++ b/drivers/gpu/drm/msm/adreno/a6xx_gmu.c
> > > > @@ -466,9 +466,8 @@ static int a6xx_rpmh_start(struct a6xx_gmu *gmu)
> > > > int ret;
> > > > u32 val;
> > > >  
> > > > -   gmu_write(gmu, REG_A6XX_GMU_RSCC_CONTROL_REQ, 1 << 1);
> > > > -   /* Wait for the register to finish posting */
> > > > -   wmb();
> > > > +   gmu_write(gmu, REG_A6XX_GMU_RSCC_CONTROL_REQ, BIT(1));
> > > > +   gmu_read(gmu, REG_A6XX_GMU_RSCC_CONTROL_REQ);
> > > 
> > > This is unnecessary because we are polling on a register on the same port 
> > > below. But I think we
> > > can replace "wmb()" above with "mb()" to avoid reordering between read
> > > and write IO instructions.
> > 
> > If I understand correctly, you don't need any memory barrier.
> > writel()/readl()'s are ordered to the same endpoint. That goes for all
> > the reordering/barrier comments mentioned below too.
> > 
> > device-io.rst:
> > 
> > The read and write functions are defined to be ordered. That is the
> > compiler is not permitted to reorder the I/O sequence. When the ordering
> > can be compiler optimised, you can use __readb() and friends to
> > indicate the relaxed ordering. Use this with care.
> > 
> > memory-barriers.txt:
> > 
> >  (*) readX(), writeX():
> > 
> > The readX() and writeX() MMIO accessors take a pointer to the
> > peripheral being accessed as an __iomem * parameter. For pointers
> > mapped with the default I/O attributes (e.g. those returned by
> > ioremap()), the ordering guarantees are as follows:
> > 
> > 1. All readX() and writeX() accesses to the same peripheral are 
> > ordered
> >   

Re: [PATCH] drm/msm/adreno: De-spaghettify the use of memory barriers

2024-06-03 Thread Andrew Halaney
On Wed, May 15, 2024 at 12:08:49AM GMT, Akhil P Oommen wrote:
> On Wed, May 08, 2024 at 07:46:31PM +0200, Konrad Dybcio wrote:
> > Memory barriers help ensure instruction ordering, NOT time and order
> > of actual write arrival at other observers (e.g. memory-mapped IP).
> > On architectures employing weak memory ordering, the latter can be a
> > giant pain point, and it has been as part of this driver.
> > 
> > Moreover, the gpu_/gmu_ accessors already use non-relaxed versions of
> > readl/writel, which include r/w (respectively) barriers.
> > 
> > Replace the barriers with a readback that ensures the previous writes
> > have exited the write buffer (as the CPU must flush the write to the
> > register it's trying to read back) and subsequently remove the hack
> > introduced in commit b77532803d11 ("drm/msm/a6xx: Poll for GBIF unhalt
> > status in hw_init").

For what its worth, I've been eyeing (but haven't tested) sending some
patches to clean up dsi_phy_write_udelay/ndelay(). There's no ordering
guarantee between a writel() and a delay(), so the expected "write then
delay" sequence might not be happening.. you need to write, read, delay.

memory-barriers.txt:

5. A readX() by a CPU thread from the peripheral will complete before
   any subsequent delay() loop can begin execution on the same thread.
   This ensures that two MMIO register writes by the CPU to a peripheral
   will arrive at least 1us apart if the first write is immediately read
   back with readX() and udelay(1) is called prior to the second
   writeX():

writel(42, DEVICE_REGISTER_0); // Arrives at the device...
readl(DEVICE_REGISTER_0);
udelay(1);
writel(42, DEVICE_REGISTER_1); // ...at least 1us before this.

> > 
> > Fixes: b77532803d11 ("drm/msm/a6xx: Poll for GBIF unhalt status in hw_init")
> > Signed-off-by: Konrad Dybcio 
> > ---
> >  drivers/gpu/drm/msm/adreno/a6xx_gmu.c |  5 ++---
> >  drivers/gpu/drm/msm/adreno/a6xx_gpu.c | 14 --
> >  2 files changed, 6 insertions(+), 13 deletions(-)
> 
> I prefer this version compared to the v2. A helper routine is
> unnecessary here because:
> 1. there are very few scenarios where we have to read back the same
> register.
> 2. we may accidently readback a write only register.
> 
> > 
> > diff --git a/drivers/gpu/drm/msm/adreno/a6xx_gmu.c 
> > b/drivers/gpu/drm/msm/adreno/a6xx_gmu.c
> > index 0e3dfd4c2bc8..4135a53b55a7 100644
> > --- a/drivers/gpu/drm/msm/adreno/a6xx_gmu.c
> > +++ b/drivers/gpu/drm/msm/adreno/a6xx_gmu.c
> > @@ -466,9 +466,8 @@ static int a6xx_rpmh_start(struct a6xx_gmu *gmu)
> > int ret;
> > u32 val;
> >  
> > -   gmu_write(gmu, REG_A6XX_GMU_RSCC_CONTROL_REQ, 1 << 1);
> > -   /* Wait for the register to finish posting */
> > -   wmb();
> > +   gmu_write(gmu, REG_A6XX_GMU_RSCC_CONTROL_REQ, BIT(1));
> > +   gmu_read(gmu, REG_A6XX_GMU_RSCC_CONTROL_REQ);
> 
> This is unnecessary because we are polling on a register on the same port 
> below. But I think we
> can replace "wmb()" above with "mb()" to avoid reordering between read
> and write IO instructions.

If I understand correctly, you don't need any memory barrier.
writel()/readl()'s are ordered to the same endpoint. That goes for all
the reordering/barrier comments mentioned below too.

device-io.rst:

The read and write functions are defined to be ordered. That is the
compiler is not permitted to reorder the I/O sequence. When the ordering
can be compiler optimised, you can use __readb() and friends to
indicate the relaxed ordering. Use this with care.

memory-barriers.txt:

 (*) readX(), writeX():

The readX() and writeX() MMIO accessors take a pointer to the
peripheral being accessed as an __iomem * parameter. For pointers
mapped with the default I/O attributes (e.g. those returned by
ioremap()), the ordering guarantees are as follows:

1. All readX() and writeX() accesses to the same peripheral are 
ordered
   with respect to each other. This ensures that MMIO register 
accesses
   by the same CPU thread to a particular device will arrive in 
program
   order.


> 
> >  
> > ret = gmu_poll_timeout(gmu, REG_A6XX_GMU_RSCC_CONTROL_ACK, val,
> > val & (1 << 1), 100, 1);
> > diff --git a/drivers/gpu/drm/msm/adreno/a6xx_gpu.c 
> > b/drivers/gpu/drm/msm/adreno/a6xx_gpu.c
> > index 973872ad0474..0acbc38b8e70 100644
> > --- a/drivers/gpu/drm/msm/adreno/a6xx_gpu.c
> > +++ b/drivers/gpu/drm/msm/adreno/a6xx_gpu.c
> > @@ -1713,22 +1713,16 @@ static int hw_init(struct msm_gpu *gpu)
> > }
> >  
> > /* Clear GBIF halt in case GX domain was not collapsed */
> > +   gpu_write(gpu, REG_A6XX_GBIF_HALT, 0);
> 
> We need a full barrier here to avoid reordering. Also, lets add a
> comment about why we are doing this odd looking sequence.
> 
> > +   gpu_read(gpu, 

Re: [PATCH] drm/msm/adreno: De-spaghettify the use of memory barriers

2024-05-16 Thread Akhil P Oommen
On Thu, May 16, 2024 at 08:15:34AM -0500, Andrew Halaney wrote:
> On Wed, May 15, 2024 at 12:08:49AM GMT, Akhil P Oommen wrote:
> > On Wed, May 08, 2024 at 07:46:31PM +0200, Konrad Dybcio wrote:
> > > Memory barriers help ensure instruction ordering, NOT time and order
> > > of actual write arrival at other observers (e.g. memory-mapped IP).
> > > On architectures employing weak memory ordering, the latter can be a
> > > giant pain point, and it has been as part of this driver.
> > > 
> > > Moreover, the gpu_/gmu_ accessors already use non-relaxed versions of
> > > readl/writel, which include r/w (respectively) barriers.
> > > 
> > > Replace the barriers with a readback that ensures the previous writes
> > > have exited the write buffer (as the CPU must flush the write to the
> > > register it's trying to read back) and subsequently remove the hack
> > > introduced in commit b77532803d11 ("drm/msm/a6xx: Poll for GBIF unhalt
> > > status in hw_init").
> 
> For what its worth, I've been eyeing (but haven't tested) sending some
> patches to clean up dsi_phy_write_udelay/ndelay(). There's no ordering
> guarantee between a writel() and a delay(), so the expected "write then
> delay" sequence might not be happening.. you need to write, read, delay.
> 
> memory-barriers.txt:
> 
>   5. A readX() by a CPU thread from the peripheral will complete before
>  any subsequent delay() loop can begin execution on the same thread.
>  This ensures that two MMIO register writes by the CPU to a peripheral
>  will arrive at least 1us apart if the first write is immediately read
>  back with readX() and udelay(1) is called prior to the second
>  writeX():
> 
>   writel(42, DEVICE_REGISTER_0); // Arrives at the device...
>   readl(DEVICE_REGISTER_0);
>   udelay(1);
>   writel(42, DEVICE_REGISTER_1); // ...at least 1us before this.

Yes, udelay orders only with readl(). I saw a patch from Will Deacon
which fixes this for arm64 few years back:
https://lore.kernel.org/all/1543251228-30001-1-git-send-email-will.dea...@arm.com/T/

But this is needed only when you write io and do cpuside wait , not when
you poll io to check status.

> 
> > > 
> > > Fixes: b77532803d11 ("drm/msm/a6xx: Poll for GBIF unhalt status in 
> > > hw_init")
> > > Signed-off-by: Konrad Dybcio 
> > > ---
> > >  drivers/gpu/drm/msm/adreno/a6xx_gmu.c |  5 ++---
> > >  drivers/gpu/drm/msm/adreno/a6xx_gpu.c | 14 --
> > >  2 files changed, 6 insertions(+), 13 deletions(-)
> > 
> > I prefer this version compared to the v2. A helper routine is
> > unnecessary here because:
> > 1. there are very few scenarios where we have to read back the same
> > register.
> > 2. we may accidently readback a write only register.
> > 
> > > 
> > > diff --git a/drivers/gpu/drm/msm/adreno/a6xx_gmu.c 
> > > b/drivers/gpu/drm/msm/adreno/a6xx_gmu.c
> > > index 0e3dfd4c2bc8..4135a53b55a7 100644
> > > --- a/drivers/gpu/drm/msm/adreno/a6xx_gmu.c
> > > +++ b/drivers/gpu/drm/msm/adreno/a6xx_gmu.c
> > > @@ -466,9 +466,8 @@ static int a6xx_rpmh_start(struct a6xx_gmu *gmu)
> > >   int ret;
> > >   u32 val;
> > >  
> > > - gmu_write(gmu, REG_A6XX_GMU_RSCC_CONTROL_REQ, 1 << 1);
> > > - /* Wait for the register to finish posting */
> > > - wmb();
> > > + gmu_write(gmu, REG_A6XX_GMU_RSCC_CONTROL_REQ, BIT(1));
> > > + gmu_read(gmu, REG_A6XX_GMU_RSCC_CONTROL_REQ);
> > 
> > This is unnecessary because we are polling on a register on the same port 
> > below. But I think we
> > can replace "wmb()" above with "mb()" to avoid reordering between read
> > and write IO instructions.
> 
> If I understand correctly, you don't need any memory barrier.
> writel()/readl()'s are ordered to the same endpoint. That goes for all
> the reordering/barrier comments mentioned below too.
> 
> device-io.rst:
> 
> The read and write functions are defined to be ordered. That is the
> compiler is not permitted to reorder the I/O sequence. When the ordering
> can be compiler optimised, you can use __readb() and friends to
> indicate the relaxed ordering. Use this with care.
> 
> memory-barriers.txt:
> 
>  (*) readX(), writeX():
> 
>   The readX() and writeX() MMIO accessors take a pointer to the
>   peripheral being accessed as an __iomem * parameter. For pointers
>   mapped with the default I/O attributes (e.g. those returned by
>   ioremap()), the ordering guarantees are as follows:
> 
>   1. All readX() and writeX() accesses to the same peripheral are 
> ordered
>  with respect to each other. This ensures that MMIO register 
> accesses
>  by the same CPU thread to a particular device will arrive in 
> program
>  order.
> 

In arm64, a writel followed by readl translates to roughly the following
sequence: dmb_wmb(), __raw_writel(), __raw_readl(), dmb_rmb(). I am not
sure what is stopping compiler from reordering  __raw_writel() and 

Re: [PATCH] drm/msm/adreno: De-spaghettify the use of memory barriers

2024-05-14 Thread Akhil P Oommen
On Wed, May 08, 2024 at 07:46:31PM +0200, Konrad Dybcio wrote:
> Memory barriers help ensure instruction ordering, NOT time and order
> of actual write arrival at other observers (e.g. memory-mapped IP).
> On architectures employing weak memory ordering, the latter can be a
> giant pain point, and it has been as part of this driver.
> 
> Moreover, the gpu_/gmu_ accessors already use non-relaxed versions of
> readl/writel, which include r/w (respectively) barriers.
> 
> Replace the barriers with a readback that ensures the previous writes
> have exited the write buffer (as the CPU must flush the write to the
> register it's trying to read back) and subsequently remove the hack
> introduced in commit b77532803d11 ("drm/msm/a6xx: Poll for GBIF unhalt
> status in hw_init").
> 
> Fixes: b77532803d11 ("drm/msm/a6xx: Poll for GBIF unhalt status in hw_init")
> Signed-off-by: Konrad Dybcio 
> ---
>  drivers/gpu/drm/msm/adreno/a6xx_gmu.c |  5 ++---
>  drivers/gpu/drm/msm/adreno/a6xx_gpu.c | 14 --
>  2 files changed, 6 insertions(+), 13 deletions(-)

I prefer this version compared to the v2. A helper routine is
unnecessary here because:
1. there are very few scenarios where we have to read back the same
register.
2. we may accidently readback a write only register.

> 
> diff --git a/drivers/gpu/drm/msm/adreno/a6xx_gmu.c 
> b/drivers/gpu/drm/msm/adreno/a6xx_gmu.c
> index 0e3dfd4c2bc8..4135a53b55a7 100644
> --- a/drivers/gpu/drm/msm/adreno/a6xx_gmu.c
> +++ b/drivers/gpu/drm/msm/adreno/a6xx_gmu.c
> @@ -466,9 +466,8 @@ static int a6xx_rpmh_start(struct a6xx_gmu *gmu)
>   int ret;
>   u32 val;
>  
> - gmu_write(gmu, REG_A6XX_GMU_RSCC_CONTROL_REQ, 1 << 1);
> - /* Wait for the register to finish posting */
> - wmb();
> + gmu_write(gmu, REG_A6XX_GMU_RSCC_CONTROL_REQ, BIT(1));
> + gmu_read(gmu, REG_A6XX_GMU_RSCC_CONTROL_REQ);

This is unnecessary because we are polling on a register on the same port 
below. But I think we
can replace "wmb()" above with "mb()" to avoid reordering between read
and write IO instructions.

>  
>   ret = gmu_poll_timeout(gmu, REG_A6XX_GMU_RSCC_CONTROL_ACK, val,
>   val & (1 << 1), 100, 1);
> diff --git a/drivers/gpu/drm/msm/adreno/a6xx_gpu.c 
> b/drivers/gpu/drm/msm/adreno/a6xx_gpu.c
> index 973872ad0474..0acbc38b8e70 100644
> --- a/drivers/gpu/drm/msm/adreno/a6xx_gpu.c
> +++ b/drivers/gpu/drm/msm/adreno/a6xx_gpu.c
> @@ -1713,22 +1713,16 @@ static int hw_init(struct msm_gpu *gpu)
>   }
>  
>   /* Clear GBIF halt in case GX domain was not collapsed */
> + gpu_write(gpu, REG_A6XX_GBIF_HALT, 0);

We need a full barrier here to avoid reordering. Also, lets add a
comment about why we are doing this odd looking sequence.

> + gpu_read(gpu, REG_A6XX_GBIF_HALT);
>   if (adreno_is_a619_holi(adreno_gpu)) {
> - gpu_write(gpu, REG_A6XX_GBIF_HALT, 0);
>   gpu_write(gpu, REG_A6XX_RBBM_GPR0_CNTL, 0);
> - /* Let's make extra sure that the GPU can access the memory.. */
> - mb();

We need a full barrier here.

> + gpu_read(gpu, REG_A6XX_RBBM_GPR0_CNTL);
>   } else if (a6xx_has_gbif(adreno_gpu)) {
> - gpu_write(gpu, REG_A6XX_GBIF_HALT, 0);
>   gpu_write(gpu, REG_A6XX_RBBM_GBIF_HALT, 0);
> - /* Let's make extra sure that the GPU can access the memory.. */
> - mb();

We need a full barrier here.

> + gpu_read(gpu, REG_A6XX_RBBM_GBIF_HALT);
>   }
>  
> - /* Some GPUs are stubborn and take their sweet time to unhalt GBIF! */
> - if (adreno_is_a7xx(adreno_gpu) && a6xx_has_gbif(adreno_gpu))
> - spin_until(!gpu_read(gpu, REG_A6XX_GBIF_HALT_ACK));
> -

Why is this removed?

-Akhil

>   gpu_write(gpu, REG_A6XX_RBBM_SECVID_TSB_CNTL, 0);
>  
>   if (adreno_is_a619_holi(adreno_gpu))
> 
> ---
> base-commit: 93a39e4766083050ca0ecd6a3548093a3b9eb60c
> change-id: 20240508-topic-adreno-a2d199cd4152
> 
> Best regards,
> -- 
> Konrad Dybcio 
> 


[PATCH] drm/msm/adreno: De-spaghettify the use of memory barriers

2024-05-08 Thread Konrad Dybcio
Memory barriers help ensure instruction ordering, NOT time and order
of actual write arrival at other observers (e.g. memory-mapped IP).
On architectures employing weak memory ordering, the latter can be a
giant pain point, and it has been as part of this driver.

Moreover, the gpu_/gmu_ accessors already use non-relaxed versions of
readl/writel, which include r/w (respectively) barriers.

Replace the barriers with a readback that ensures the previous writes
have exited the write buffer (as the CPU must flush the write to the
register it's trying to read back) and subsequently remove the hack
introduced in commit b77532803d11 ("drm/msm/a6xx: Poll for GBIF unhalt
status in hw_init").

Fixes: b77532803d11 ("drm/msm/a6xx: Poll for GBIF unhalt status in hw_init")
Signed-off-by: Konrad Dybcio 
---
 drivers/gpu/drm/msm/adreno/a6xx_gmu.c |  5 ++---
 drivers/gpu/drm/msm/adreno/a6xx_gpu.c | 14 --
 2 files changed, 6 insertions(+), 13 deletions(-)

diff --git a/drivers/gpu/drm/msm/adreno/a6xx_gmu.c 
b/drivers/gpu/drm/msm/adreno/a6xx_gmu.c
index 0e3dfd4c2bc8..4135a53b55a7 100644
--- a/drivers/gpu/drm/msm/adreno/a6xx_gmu.c
+++ b/drivers/gpu/drm/msm/adreno/a6xx_gmu.c
@@ -466,9 +466,8 @@ static int a6xx_rpmh_start(struct a6xx_gmu *gmu)
int ret;
u32 val;
 
-   gmu_write(gmu, REG_A6XX_GMU_RSCC_CONTROL_REQ, 1 << 1);
-   /* Wait for the register to finish posting */
-   wmb();
+   gmu_write(gmu, REG_A6XX_GMU_RSCC_CONTROL_REQ, BIT(1));
+   gmu_read(gmu, REG_A6XX_GMU_RSCC_CONTROL_REQ);
 
ret = gmu_poll_timeout(gmu, REG_A6XX_GMU_RSCC_CONTROL_ACK, val,
val & (1 << 1), 100, 1);
diff --git a/drivers/gpu/drm/msm/adreno/a6xx_gpu.c 
b/drivers/gpu/drm/msm/adreno/a6xx_gpu.c
index 973872ad0474..0acbc38b8e70 100644
--- a/drivers/gpu/drm/msm/adreno/a6xx_gpu.c
+++ b/drivers/gpu/drm/msm/adreno/a6xx_gpu.c
@@ -1713,22 +1713,16 @@ static int hw_init(struct msm_gpu *gpu)
}
 
/* Clear GBIF halt in case GX domain was not collapsed */
+   gpu_write(gpu, REG_A6XX_GBIF_HALT, 0);
+   gpu_read(gpu, REG_A6XX_GBIF_HALT);
if (adreno_is_a619_holi(adreno_gpu)) {
-   gpu_write(gpu, REG_A6XX_GBIF_HALT, 0);
gpu_write(gpu, REG_A6XX_RBBM_GPR0_CNTL, 0);
-   /* Let's make extra sure that the GPU can access the memory.. */
-   mb();
+   gpu_read(gpu, REG_A6XX_RBBM_GPR0_CNTL);
} else if (a6xx_has_gbif(adreno_gpu)) {
-   gpu_write(gpu, REG_A6XX_GBIF_HALT, 0);
gpu_write(gpu, REG_A6XX_RBBM_GBIF_HALT, 0);
-   /* Let's make extra sure that the GPU can access the memory.. */
-   mb();
+   gpu_read(gpu, REG_A6XX_RBBM_GBIF_HALT);
}
 
-   /* Some GPUs are stubborn and take their sweet time to unhalt GBIF! */
-   if (adreno_is_a7xx(adreno_gpu) && a6xx_has_gbif(adreno_gpu))
-   spin_until(!gpu_read(gpu, REG_A6XX_GBIF_HALT_ACK));
-
gpu_write(gpu, REG_A6XX_RBBM_SECVID_TSB_CNTL, 0);
 
if (adreno_is_a619_holi(adreno_gpu))

---
base-commit: 93a39e4766083050ca0ecd6a3548093a3b9eb60c
change-id: 20240508-topic-adreno-a2d199cd4152

Best regards,
-- 
Konrad Dybcio