Re: [Mesa-dev] [PATCH 5/6] radeonsi: use slot indexes for bindless handles

2017-07-24 Thread Marek Olšák
Some additional notes:

1) si_release_bindless_descriptors can be inlined.

2) Sampler slots have size = 16*4 and image slots have size = 8*4 in
your patch. The addressing is also done with a multiple of the size,
which allows creating a sampler and an image in the same slot such
that one of them overwrites the other.

create_bindless_sampler(): slot = 1, address = 1*16 = 16
create_bindless_image(): slot = 2, address = 2*8 = 16

The rest looks OK.

Marek

On Wed, Jul 19, 2017 at 9:14 AM, Samuel Pitoiset
 wrote:
>
>
> On 07/19/2017 12:01 AM, Marek Olšák wrote:
>>
>> On Mon, Jul 17, 2017 at 4:01 PM, Nicolai Hähnle 
>> wrote:
>>>
>>> Hi Samuel,
>>>
>>> On 07.07.2017 03:45, Samuel Pitoiset wrote:


 On 07/05/2017 01:42 PM, Nicolai Hähnle wrote:
>
>
> On 04.07.2017 15:05, Samuel Pitoiset wrote:
>>
>>
>> Using VRAM address as bindless handles is not a good idea because
>> we have to use LLVMIntToPTr and the LLVM CSE pass can't optimize
>> because it has no information about the pointer.
>>
>> Instead, use slots indexes like the existing descriptors.
>>
>> This improves performance with DOW3 by +7%.
>
>
>
> Wow.
>
> The thing is, burning a pair of user SGPRs for this seems a bit
> overkill,
> especially since it also hurts apps that don't use bindless at all.
>
> Do you have some examples of how LLVM fails here? Could we perhaps
> avoid
> most of the performance issues by casting 0 to an appropriate pointer
> type
> once, and then using the bindless handle as an index relative to that
> pointer?



 Here's two shaders, 1) is with master, 2) is with this series:

 1) https://hastebin.com/uvamarelig
 2) https://hastebin.com/voguqihilu

 The first shader contains a bunch of s_buffer_load_dword that the second
 one doesn't need because CSE do its job there. This is because of
 IntToPtr
 but if we use noalias, the pass is able to eliminate the redundant
 descriptor load operations.
>>>
>>>
>>>
>>> So I looked into your example again in more detail, compiling both
>>> shaders
>>> with
>>>
>>>llc -march=amdgcn -mcpu=tonga
>>>
>>> and also just extracting the assembly, and I think your analysis is
>>> actually
>>> flawed. It's true that the shader in (2) has fewer buffer loads, but the
>>> only buffer loads that are actually removed rather than just shuffled
>>> around
>>> are ones that load .y components. So basically, the reason you get fewer
>>> buffer loads is that you're effectly using 32 bit pointers. It's a bit
>>> shocking that that gives you a 7% performance improvement...
>>>
>>> If there are really examples where it makes a difference for CSE, then
>>> perhaps the same result could be achieved with alias/noalias metadata on
>>> the
>>> load/store instructions.
>>>
>>> In the meantime, perhaps a good comparison would be to use the original,
>>> inttoptr-based code, but only load the lower 32 bits for a handle and use
>>> bit shifts to get a full 64 bit pointers. That way, at least it should be
>>> possible to more easily find shaders where there is a genuine difference.
>>
>>
>> He probably gave you wrong shaders. If you imagine an NxN filter doing
>> NxN texture fetches, you should get NxN loads for the same descriptor
>> with inttoptr. Also, LICM won't move the descriptor loads out of
>> loops. Those are the biggest issues. If you can't do CSE and LICM, it
>> can easily make the 7% difference.
>
>
> There are many DOW3 shaders that are different with this series. The one I
> pasted you is maybe not the most interesting one though. I can try to find
> one with descriptor loads inside a loop.
>
>
>>
>> Marek
>>
>
___
mesa-dev mailing list
mesa-dev@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/mesa-dev


Re: [Mesa-dev] [PATCH 5/6] radeonsi: use slot indexes for bindless handles

2017-07-19 Thread Samuel Pitoiset



On 07/19/2017 12:01 AM, Marek Olšák wrote:

On Mon, Jul 17, 2017 at 4:01 PM, Nicolai Hähnle  wrote:

Hi Samuel,

On 07.07.2017 03:45, Samuel Pitoiset wrote:


On 07/05/2017 01:42 PM, Nicolai Hähnle wrote:


On 04.07.2017 15:05, Samuel Pitoiset wrote:


Using VRAM address as bindless handles is not a good idea because
we have to use LLVMIntToPTr and the LLVM CSE pass can't optimize
because it has no information about the pointer.

Instead, use slots indexes like the existing descriptors.

This improves performance with DOW3 by +7%.



Wow.

The thing is, burning a pair of user SGPRs for this seems a bit overkill,
especially since it also hurts apps that don't use bindless at all.

Do you have some examples of how LLVM fails here? Could we perhaps avoid
most of the performance issues by casting 0 to an appropriate pointer type
once, and then using the bindless handle as an index relative to that
pointer?



Here's two shaders, 1) is with master, 2) is with this series:

1) https://hastebin.com/uvamarelig
2) https://hastebin.com/voguqihilu

The first shader contains a bunch of s_buffer_load_dword that the second
one doesn't need because CSE do its job there. This is because of IntToPtr
but if we use noalias, the pass is able to eliminate the redundant
descriptor load operations.



So I looked into your example again in more detail, compiling both shaders
with

   llc -march=amdgcn -mcpu=tonga

and also just extracting the assembly, and I think your analysis is actually
flawed. It's true that the shader in (2) has fewer buffer loads, but the
only buffer loads that are actually removed rather than just shuffled around
are ones that load .y components. So basically, the reason you get fewer
buffer loads is that you're effectly using 32 bit pointers. It's a bit
shocking that that gives you a 7% performance improvement...

If there are really examples where it makes a difference for CSE, then
perhaps the same result could be achieved with alias/noalias metadata on the
load/store instructions.

In the meantime, perhaps a good comparison would be to use the original,
inttoptr-based code, but only load the lower 32 bits for a handle and use
bit shifts to get a full 64 bit pointers. That way, at least it should be
possible to more easily find shaders where there is a genuine difference.


He probably gave you wrong shaders. If you imagine an NxN filter doing
NxN texture fetches, you should get NxN loads for the same descriptor
with inttoptr. Also, LICM won't move the descriptor loads out of
loops. Those are the biggest issues. If you can't do CSE and LICM, it
can easily make the 7% difference.


There are many DOW3 shaders that are different with this series. The one 
I pasted you is maybe not the most interesting one though. I can try to 
find one with descriptor loads inside a loop.





Marek


___
mesa-dev mailing list
mesa-dev@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/mesa-dev


Re: [Mesa-dev] [PATCH 5/6] radeonsi: use slot indexes for bindless handles

2017-07-18 Thread Marek Olšák
On Mon, Jul 17, 2017 at 4:01 PM, Nicolai Hähnle  wrote:
> Hi Samuel,
>
> On 07.07.2017 03:45, Samuel Pitoiset wrote:
>>
>> On 07/05/2017 01:42 PM, Nicolai Hähnle wrote:
>>>
>>> On 04.07.2017 15:05, Samuel Pitoiset wrote:

 Using VRAM address as bindless handles is not a good idea because
 we have to use LLVMIntToPTr and the LLVM CSE pass can't optimize
 because it has no information about the pointer.

 Instead, use slots indexes like the existing descriptors.

 This improves performance with DOW3 by +7%.
>>>
>>>
>>> Wow.
>>>
>>> The thing is, burning a pair of user SGPRs for this seems a bit overkill,
>>> especially since it also hurts apps that don't use bindless at all.
>>>
>>> Do you have some examples of how LLVM fails here? Could we perhaps avoid
>>> most of the performance issues by casting 0 to an appropriate pointer type
>>> once, and then using the bindless handle as an index relative to that
>>> pointer?
>>
>>
>> Here's two shaders, 1) is with master, 2) is with this series:
>>
>> 1) https://hastebin.com/uvamarelig
>> 2) https://hastebin.com/voguqihilu
>>
>> The first shader contains a bunch of s_buffer_load_dword that the second
>> one doesn't need because CSE do its job there. This is because of IntToPtr
>> but if we use noalias, the pass is able to eliminate the redundant
>> descriptor load operations.
>
>
> So I looked into your example again in more detail, compiling both shaders
> with
>
>   llc -march=amdgcn -mcpu=tonga
>
> and also just extracting the assembly, and I think your analysis is actually
> flawed. It's true that the shader in (2) has fewer buffer loads, but the
> only buffer loads that are actually removed rather than just shuffled around
> are ones that load .y components. So basically, the reason you get fewer
> buffer loads is that you're effectly using 32 bit pointers. It's a bit
> shocking that that gives you a 7% performance improvement...
>
> If there are really examples where it makes a difference for CSE, then
> perhaps the same result could be achieved with alias/noalias metadata on the
> load/store instructions.
>
> In the meantime, perhaps a good comparison would be to use the original,
> inttoptr-based code, but only load the lower 32 bits for a handle and use
> bit shifts to get a full 64 bit pointers. That way, at least it should be
> possible to more easily find shaders where there is a genuine difference.

He probably gave you wrong shaders. If you imagine an NxN filter doing
NxN texture fetches, you should get NxN loads for the same descriptor
with inttoptr. Also, LICM won't move the descriptor loads out of
loops. Those are the biggest issues. If you can't do CSE and LICM, it
can easily make the 7% difference.

Marek
___
mesa-dev mailing list
mesa-dev@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/mesa-dev


Re: [Mesa-dev] [PATCH 5/6] radeonsi: use slot indexes for bindless handles

2017-07-17 Thread Nicolai Hähnle

Hi Samuel,

On 07.07.2017 03:45, Samuel Pitoiset wrote:

On 07/05/2017 01:42 PM, Nicolai Hähnle wrote:

On 04.07.2017 15:05, Samuel Pitoiset wrote:

Using VRAM address as bindless handles is not a good idea because
we have to use LLVMIntToPTr and the LLVM CSE pass can't optimize
because it has no information about the pointer.

Instead, use slots indexes like the existing descriptors.

This improves performance with DOW3 by +7%.


Wow.

The thing is, burning a pair of user SGPRs for this seems a bit 
overkill, especially since it also hurts apps that don't use bindless 
at all.


Do you have some examples of how LLVM fails here? Could we perhaps 
avoid most of the performance issues by casting 0 to an appropriate 
pointer type once, and then using the bindless handle as an index 
relative to that pointer?


Here's two shaders, 1) is with master, 2) is with this series:

1) https://hastebin.com/uvamarelig
2) https://hastebin.com/voguqihilu

The first shader contains a bunch of s_buffer_load_dword that the second 
one doesn't need because CSE do its job there. This is because of 
IntToPtr but if we use noalias, the pass is able to eliminate the 
redundant descriptor load operations.


So I looked into your example again in more detail, compiling both 
shaders with


  llc -march=amdgcn -mcpu=tonga

and also just extracting the assembly, and I think your analysis is 
actually flawed. It's true that the shader in (2) has fewer buffer 
loads, but the only buffer loads that are actually removed rather than 
just shuffled around are ones that load .y components. So basically, the 
reason you get fewer buffer loads is that you're effectly using 32 bit 
pointers. It's a bit shocking that that gives you a 7% performance 
improvement...


If there are really examples where it makes a difference for CSE, then 
perhaps the same result could be achieved with alias/noalias metadata on 
the load/store instructions.


In the meantime, perhaps a good comparison would be to use the original, 
inttoptr-based code, but only load the lower 32 bits for a handle and 
use bit shifts to get a full 64 bit pointers. That way, at least it 
should be possible to more easily find shaders where there is a genuine 
difference.


Cheers,
Nicolai


About the number of SGPRs loaded by SPI for non bindless applications, 
it should be possible to avoid that by scanning the shader.




Cheers,
Nicolai




Signed-off-by: Samuel Pitoiset 
---
  src/gallium/drivers/radeonsi/si_descriptors.c | 342 
+++---

  src/gallium/drivers/radeonsi/si_pipe.c|  12 -
  src/gallium/drivers/radeonsi/si_pipe.h|  23 +-
  src/gallium/drivers/radeonsi/si_shader_tgsi_mem.c |  32 +-
  4 files changed, 185 insertions(+), 224 deletions(-)

diff --git a/src/gallium/drivers/radeonsi/si_descriptors.c 
b/src/gallium/drivers/radeonsi/si_descriptors.c

index 06a171ff9e..601b18069d 100644
--- a/src/gallium/drivers/radeonsi/si_descriptors.c
+++ b/src/gallium/drivers/radeonsi/si_descriptors.c
@@ -1852,16 +1852,20 @@ static void si_rebind_buffer(struct 
pipe_context *ctx, struct pipe_resource *buf

  /* Bindless texture handles */
  if (rbuffer->texture_handle_allocated) {
+struct si_descriptors *descs = >bindless_descriptors;
+
  util_dynarray_foreach(>resident_tex_handles,
struct si_texture_handle *, tex_handle) {
  struct pipe_sampler_view *view = (*tex_handle)->view;
-struct si_bindless_descriptor *desc = (*tex_handle)->desc;
+unsigned desc_slot = (*tex_handle)->desc_slot;
  if (view->texture == buf) {
  si_set_buf_desc_address(rbuffer,
  view->u.buf.offset,
->desc_list[4]);
-desc->dirty = true;
+descs->list +
+desc_slot * 16 + 4);
+
+(*tex_handle)->desc_dirty = true;
  sctx->bindless_descriptors_dirty = true;
  radeon_add_to_buffer_list_check_mem(
@@ -1874,10 +1878,12 @@ static void si_rebind_buffer(struct 
pipe_context *ctx, struct pipe_resource *buf

  /* Bindless image handles */
  if (rbuffer->image_handle_allocated) {
+struct si_descriptors *descs = >bindless_descriptors;
+
  util_dynarray_foreach(>resident_img_handles,
struct si_image_handle *, img_handle) {
  struct pipe_image_view *view = &(*img_handle)->view;
-struct si_bindless_descriptor *desc = (*img_handle)->desc;
+unsigned desc_slot = (*img_handle)->desc_slot;
  if (view->resource == buf) {
  if (view->access & PIPE_IMAGE_ACCESS_WRITE)
@@ -1885,8 +1891,10 @@ static void si_rebind_buffer(struct 
pipe_context *ctx, struct pipe_resource *buf

  si_set_buf_desc_address(rbuffer,
  view->u.buf.offset,
-  

Re: [Mesa-dev] [PATCH 5/6] radeonsi: use slot indexes for bindless handles

2017-07-17 Thread Marek Olšák
On Mon, Jul 17, 2017 at 4:35 AM, Samuel Pitoiset
 wrote:
>
>
> On 07/15/2017 02:54 AM, Marek Olšák wrote:
>>
>> On Wed, Jul 5, 2017 at 1:42 PM, Nicolai Hähnle  wrote:
>>>
>>> On 04.07.2017 15:05, Samuel Pitoiset wrote:


 Using VRAM address as bindless handles is not a good idea because
 we have to use LLVMIntToPTr and the LLVM CSE pass can't optimize
 because it has no information about the pointer.

 Instead, use slots indexes like the existing descriptors.

 This improves performance with DOW3 by +7%.
>>>
>>>
>>>
>>> Wow.
>>>
>>> The thing is, burning a pair of user SGPRs for this seems a bit overkill,
>>> especially since it also hurts apps that don't use bindless at all.
>>>
>>> Do you have some examples of how LLVM fails here? Could we perhaps avoid
>>> most of the performance issues by casting 0 to an appropriate pointer
>>> type
>>> once, and then using the bindless handle as an index relative to that
>>> pointer?
>>
>>
>> The problem is inttoptr doesn't support noalias and LLVM passes assume
>> it's a generic pointer and therefore don't optimize it. radeonsi loads
>> descriptors before each use and relies on CSE to unify all equivalent
>> loads that are close to each other. Without CSE, the resulting code is
>> very bad.
>>
>> Another interesting aspect of having the bindless descriptor array in
>> user SGPRs is that we can do buffer invalidations easily by
>> reuploading the whole array. That, however, adds a lot of overhead,
>> because the array is usually huge (64 bytes * 1000 slots), so it's
>> usually worse than the current solution (partial flushes +
>> WRITE_DATA). The bindless array could be packed better though.
>> Textures need 12 dwords, images need 8 dwords, and buffers need 4
>> dwords. Right now, all slots have 16 dwords.
>>
>> Samuel, sorry I haven't had time to look at these patches yet.
>
>
> No worries, but are you fine with this solution? If yes, I will fix up patch
> 1.

Yes, I'm OK with the solution. The user SGPR usage should decrease
when we add support for 32bit pointers (actually we'll just need one
opcode to work with 32bit pointers: the 32bit->64bit address space
cast).

Marek
___
mesa-dev mailing list
mesa-dev@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/mesa-dev


Re: [Mesa-dev] [PATCH 5/6] radeonsi: use slot indexes for bindless handles

2017-07-17 Thread Samuel Pitoiset



On 07/15/2017 02:54 AM, Marek Olšák wrote:

On Wed, Jul 5, 2017 at 1:42 PM, Nicolai Hähnle  wrote:

On 04.07.2017 15:05, Samuel Pitoiset wrote:


Using VRAM address as bindless handles is not a good idea because
we have to use LLVMIntToPTr and the LLVM CSE pass can't optimize
because it has no information about the pointer.

Instead, use slots indexes like the existing descriptors.

This improves performance with DOW3 by +7%.



Wow.

The thing is, burning a pair of user SGPRs for this seems a bit overkill,
especially since it also hurts apps that don't use bindless at all.

Do you have some examples of how LLVM fails here? Could we perhaps avoid
most of the performance issues by casting 0 to an appropriate pointer type
once, and then using the bindless handle as an index relative to that
pointer?


The problem is inttoptr doesn't support noalias and LLVM passes assume
it's a generic pointer and therefore don't optimize it. radeonsi loads
descriptors before each use and relies on CSE to unify all equivalent
loads that are close to each other. Without CSE, the resulting code is very bad.

Another interesting aspect of having the bindless descriptor array in
user SGPRs is that we can do buffer invalidations easily by
reuploading the whole array. That, however, adds a lot of overhead,
because the array is usually huge (64 bytes * 1000 slots), so it's
usually worse than the current solution (partial flushes +
WRITE_DATA). The bindless array could be packed better though.
Textures need 12 dwords, images need 8 dwords, and buffers need 4
dwords. Right now, all slots have 16 dwords.

Samuel, sorry I haven't had time to look at these patches yet.


No worries, but are you fine with this solution? If yes, I will fix up 
patch 1.


Samuel.



Marek


___
mesa-dev mailing list
mesa-dev@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/mesa-dev


Re: [Mesa-dev] [PATCH 5/6] radeonsi: use slot indexes for bindless handles

2017-07-15 Thread Nicolai Hähnle

On 15.07.2017 02:54, Marek Olšák wrote:

On Wed, Jul 5, 2017 at 1:42 PM, Nicolai Hähnle  wrote:

On 04.07.2017 15:05, Samuel Pitoiset wrote:


Using VRAM address as bindless handles is not a good idea because
we have to use LLVMIntToPTr and the LLVM CSE pass can't optimize
because it has no information about the pointer.

Instead, use slots indexes like the existing descriptors.

This improves performance with DOW3 by +7%.



Wow.

The thing is, burning a pair of user SGPRs for this seems a bit overkill,
especially since it also hurts apps that don't use bindless at all.

Do you have some examples of how LLVM fails here? Could we perhaps avoid
most of the performance issues by casting 0 to an appropriate pointer type
once, and then using the bindless handle as an index relative to that
pointer?


The problem is inttoptr doesn't support noalias and LLVM passes assume
it's a generic pointer and therefore don't optimize it. radeonsi loads
descriptors before each use and relies on CSE to unify all equivalent
loads that are close to each other. Without CSE, the resulting code is very bad.


I was thinking that it should be possible to achieve the same effect 
with noalias metadata, but it seems there is some magic about the 
noalias attribute which is stronger than anything achievable with 
metadata. I haven't dug deeper into what that is, exactly.


Cheers,
Nicolai



Another interesting aspect of having the bindless descriptor array in
user SGPRs is that we can do buffer invalidations easily by
reuploading the whole array. That, however, adds a lot of overhead,
because the array is usually huge (64 bytes * 1000 slots), so it's
usually worse than the current solution (partial flushes +
WRITE_DATA). The bindless array could be packed better though.
Textures need 12 dwords, images need 8 dwords, and buffers need 4
dwords. Right now, all slots have 16 dwords.

Samuel, sorry I haven't had time to look at these patches yet.

Marek




--
Lerne, wie die Welt wirklich ist,
Aber vergiss niemals, wie sie sein sollte.
___
mesa-dev mailing list
mesa-dev@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/mesa-dev


Re: [Mesa-dev] [PATCH 5/6] radeonsi: use slot indexes for bindless handles

2017-07-14 Thread Marek Olšák
On Wed, Jul 5, 2017 at 1:42 PM, Nicolai Hähnle  wrote:
> On 04.07.2017 15:05, Samuel Pitoiset wrote:
>>
>> Using VRAM address as bindless handles is not a good idea because
>> we have to use LLVMIntToPTr and the LLVM CSE pass can't optimize
>> because it has no information about the pointer.
>>
>> Instead, use slots indexes like the existing descriptors.
>>
>> This improves performance with DOW3 by +7%.
>
>
> Wow.
>
> The thing is, burning a pair of user SGPRs for this seems a bit overkill,
> especially since it also hurts apps that don't use bindless at all.
>
> Do you have some examples of how LLVM fails here? Could we perhaps avoid
> most of the performance issues by casting 0 to an appropriate pointer type
> once, and then using the bindless handle as an index relative to that
> pointer?

The problem is inttoptr doesn't support noalias and LLVM passes assume
it's a generic pointer and therefore don't optimize it. radeonsi loads
descriptors before each use and relies on CSE to unify all equivalent
loads that are close to each other. Without CSE, the resulting code is very bad.

Another interesting aspect of having the bindless descriptor array in
user SGPRs is that we can do buffer invalidations easily by
reuploading the whole array. That, however, adds a lot of overhead,
because the array is usually huge (64 bytes * 1000 slots), so it's
usually worse than the current solution (partial flushes +
WRITE_DATA). The bindless array could be packed better though.
Textures need 12 dwords, images need 8 dwords, and buffers need 4
dwords. Right now, all slots have 16 dwords.

Samuel, sorry I haven't had time to look at these patches yet.

Marek
___
mesa-dev mailing list
mesa-dev@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/mesa-dev


Re: [Mesa-dev] [PATCH 5/6] radeonsi: use slot indexes for bindless handles

2017-07-07 Thread Samuel Pitoiset



On 07/05/2017 01:42 PM, Nicolai Hähnle wrote:

On 04.07.2017 15:05, Samuel Pitoiset wrote:

Using VRAM address as bindless handles is not a good idea because
we have to use LLVMIntToPTr and the LLVM CSE pass can't optimize
because it has no information about the pointer.

Instead, use slots indexes like the existing descriptors.

This improves performance with DOW3 by +7%.


Wow.

The thing is, burning a pair of user SGPRs for this seems a bit 
overkill, especially since it also hurts apps that don't use bindless at 
all.


Do you have some examples of how LLVM fails here? Could we perhaps avoid 
most of the performance issues by casting 0 to an appropriate pointer 
type once, and then using the bindless handle as an index relative to 
that pointer?


Here's two shaders, 1) is with master, 2) is with this series:

1) https://hastebin.com/uvamarelig
2) https://hastebin.com/voguqihilu

The first shader contains a bunch of s_buffer_load_dword that the second 
one doesn't need because CSE do its job there. This is because of 
IntToPtr but if we use noalias, the pass is able to eliminate the 
redundant descriptor load operations.


About the number of SGPRs loaded by SPI for non bindless applications, 
it should be possible to avoid that by scanning the shader.




Cheers,
Nicolai




Signed-off-by: Samuel Pitoiset 
---
  src/gallium/drivers/radeonsi/si_descriptors.c | 342 
+++---

  src/gallium/drivers/radeonsi/si_pipe.c|  12 -
  src/gallium/drivers/radeonsi/si_pipe.h|  23 +-
  src/gallium/drivers/radeonsi/si_shader_tgsi_mem.c |  32 +-
  4 files changed, 185 insertions(+), 224 deletions(-)

diff --git a/src/gallium/drivers/radeonsi/si_descriptors.c 
b/src/gallium/drivers/radeonsi/si_descriptors.c

index 06a171ff9e..601b18069d 100644
--- a/src/gallium/drivers/radeonsi/si_descriptors.c
+++ b/src/gallium/drivers/radeonsi/si_descriptors.c
@@ -1852,16 +1852,20 @@ static void si_rebind_buffer(struct 
pipe_context *ctx, struct pipe_resource *buf

  /* Bindless texture handles */
  if (rbuffer->texture_handle_allocated) {
+struct si_descriptors *descs = >bindless_descriptors;
+
  util_dynarray_foreach(>resident_tex_handles,
struct si_texture_handle *, tex_handle) {
  struct pipe_sampler_view *view = (*tex_handle)->view;
-struct si_bindless_descriptor *desc = (*tex_handle)->desc;
+unsigned desc_slot = (*tex_handle)->desc_slot;
  if (view->texture == buf) {
  si_set_buf_desc_address(rbuffer,
  view->u.buf.offset,
->desc_list[4]);
-desc->dirty = true;
+descs->list +
+desc_slot * 16 + 4);
+
+(*tex_handle)->desc_dirty = true;
  sctx->bindless_descriptors_dirty = true;
  radeon_add_to_buffer_list_check_mem(
@@ -1874,10 +1878,12 @@ static void si_rebind_buffer(struct 
pipe_context *ctx, struct pipe_resource *buf

  /* Bindless image handles */
  if (rbuffer->image_handle_allocated) {
+struct si_descriptors *descs = >bindless_descriptors;
+
  util_dynarray_foreach(>resident_img_handles,
struct si_image_handle *, img_handle) {
  struct pipe_image_view *view = &(*img_handle)->view;
-struct si_bindless_descriptor *desc = (*img_handle)->desc;
+unsigned desc_slot = (*img_handle)->desc_slot;
  if (view->resource == buf) {
  if (view->access & PIPE_IMAGE_ACCESS_WRITE)
@@ -1885,8 +1891,10 @@ static void si_rebind_buffer(struct 
pipe_context *ctx, struct pipe_resource *buf

  si_set_buf_desc_address(rbuffer,
  view->u.buf.offset,
->desc_list[4]);
-desc->dirty = true;
+descs->list +
+desc_slot * 8 + 4);
+
+(*img_handle)->desc_dirty = true;
  sctx->bindless_descriptors_dirty = true;
  radeon_add_to_buffer_list_check_mem(
@@ -1918,11 +1926,18 @@ static void si_invalidate_buffer(struct 
pipe_context *ctx, struct pipe_resource

  }
  static void si_upload_bindless_descriptor(struct si_context *sctx,
-  struct si_bindless_descriptor *desc)
+  unsigned desc_slot,
+  unsigned num_dwords)
  {
+struct si_descriptors *desc = >bindless_descriptors;
  struct radeon_winsys_cs *cs = sctx->b.gfx.cs;
-uint64_t va = desc->buffer->gpu_address + desc->offset;
-unsigned num_dwords = sizeof(desc->desc_list) / 4;
+uint32_t *data;
+uint64_t va;
+
+data = desc->list + desc_slot * num_dwords;
+
+va = desc->buffer->gpu_address + desc->buffer_offset +
+ desc_slot * num_dwords * 4;
  radeon_emit(cs, 

Re: [Mesa-dev] [PATCH 5/6] radeonsi: use slot indexes for bindless handles

2017-07-05 Thread Nicolai Hähnle

On 04.07.2017 15:05, Samuel Pitoiset wrote:

Using VRAM address as bindless handles is not a good idea because
we have to use LLVMIntToPTr and the LLVM CSE pass can't optimize
because it has no information about the pointer.

Instead, use slots indexes like the existing descriptors.

This improves performance with DOW3 by +7%.


Wow.

The thing is, burning a pair of user SGPRs for this seems a bit 
overkill, especially since it also hurts apps that don't use bindless at 
all.


Do you have some examples of how LLVM fails here? Could we perhaps avoid 
most of the performance issues by casting 0 to an appropriate pointer 
type once, and then using the bindless handle as an index relative to 
that pointer?


Cheers,
Nicolai




Signed-off-by: Samuel Pitoiset 
---
  src/gallium/drivers/radeonsi/si_descriptors.c | 342 +++---
  src/gallium/drivers/radeonsi/si_pipe.c|  12 -
  src/gallium/drivers/radeonsi/si_pipe.h|  23 +-
  src/gallium/drivers/radeonsi/si_shader_tgsi_mem.c |  32 +-
  4 files changed, 185 insertions(+), 224 deletions(-)

diff --git a/src/gallium/drivers/radeonsi/si_descriptors.c 
b/src/gallium/drivers/radeonsi/si_descriptors.c
index 06a171ff9e..601b18069d 100644
--- a/src/gallium/drivers/radeonsi/si_descriptors.c
+++ b/src/gallium/drivers/radeonsi/si_descriptors.c
@@ -1852,16 +1852,20 @@ static void si_rebind_buffer(struct pipe_context *ctx, 
struct pipe_resource *buf
  
  	/* Bindless texture handles */

if (rbuffer->texture_handle_allocated) {
+   struct si_descriptors *descs = >bindless_descriptors;
+
util_dynarray_foreach(>resident_tex_handles,
  struct si_texture_handle *, tex_handle) {
struct pipe_sampler_view *view = (*tex_handle)->view;
-   struct si_bindless_descriptor *desc = 
(*tex_handle)->desc;
+   unsigned desc_slot = (*tex_handle)->desc_slot;
  
  			if (view->texture == buf) {

si_set_buf_desc_address(rbuffer,
view->u.buf.offset,
-   >desc_list[4]);
-   desc->dirty = true;
+   descs->list +
+   desc_slot * 16 + 4);
+
+   (*tex_handle)->desc_dirty = true;
sctx->bindless_descriptors_dirty = true;
  
  radeon_add_to_buffer_list_check_mem(

@@ -1874,10 +1878,12 @@ static void si_rebind_buffer(struct pipe_context *ctx, 
struct pipe_resource *buf
  
  	/* Bindless image handles */

if (rbuffer->image_handle_allocated) {
+   struct si_descriptors *descs = >bindless_descriptors;
+
util_dynarray_foreach(>resident_img_handles,
  struct si_image_handle *, img_handle) {
struct pipe_image_view *view = &(*img_handle)->view;
-   struct si_bindless_descriptor *desc = 
(*img_handle)->desc;
+   unsigned desc_slot = (*img_handle)->desc_slot;
  
  			if (view->resource == buf) {

if (view->access & PIPE_IMAGE_ACCESS_WRITE)
@@ -1885,8 +1891,10 @@ static void si_rebind_buffer(struct pipe_context *ctx, 
struct pipe_resource *buf
  
  si_set_buf_desc_address(rbuffer,

view->u.buf.offset,
-   >desc_list[4]);
-   desc->dirty = true;
+   descs->list +
+   desc_slot * 8 + 4);
+
+   (*img_handle)->desc_dirty = true;
sctx->bindless_descriptors_dirty = true;
  
  radeon_add_to_buffer_list_check_mem(

@@ -1918,11 +1926,18 @@ static void si_invalidate_buffer(struct pipe_context 
*ctx, struct pipe_resource
  }
  
  static void si_upload_bindless_descriptor(struct si_context *sctx,

- struct si_bindless_descriptor *desc)
+ unsigned desc_slot,
+ unsigned num_dwords)
  {
+   struct si_descriptors *desc = >bindless_descriptors;
struct radeon_winsys_cs *cs = sctx->b.gfx.cs;
-   uint64_t va = desc->buffer->gpu_address + desc->offset;
-   unsigned num_dwords = sizeof(desc->desc_list) / 4;
+   uint32_t *data;
+   uint64_t va;
+
+   data = desc->list + desc_slot * num_dwords;
+
+   va = desc->buffer->gpu_address + desc->buffer_offset +
+desc_slot * num_dwords * 4;
  
  	radeon_emit(cs, PKT3(PKT3_WRITE_DATA, 2 + num_dwords, 0));

radeon_emit(cs, 

[Mesa-dev] [PATCH 5/6] radeonsi: use slot indexes for bindless handles

2017-07-04 Thread Samuel Pitoiset
Using VRAM address as bindless handles is not a good idea because
we have to use LLVMIntToPTr and the LLVM CSE pass can't optimize
because it has no information about the pointer.

Instead, use slots indexes like the existing descriptors.

This improves performance with DOW3 by +7%.

Signed-off-by: Samuel Pitoiset 
---
 src/gallium/drivers/radeonsi/si_descriptors.c | 342 +++---
 src/gallium/drivers/radeonsi/si_pipe.c|  12 -
 src/gallium/drivers/radeonsi/si_pipe.h|  23 +-
 src/gallium/drivers/radeonsi/si_shader_tgsi_mem.c |  32 +-
 4 files changed, 185 insertions(+), 224 deletions(-)

diff --git a/src/gallium/drivers/radeonsi/si_descriptors.c 
b/src/gallium/drivers/radeonsi/si_descriptors.c
index 06a171ff9e..601b18069d 100644
--- a/src/gallium/drivers/radeonsi/si_descriptors.c
+++ b/src/gallium/drivers/radeonsi/si_descriptors.c
@@ -1852,16 +1852,20 @@ static void si_rebind_buffer(struct pipe_context *ctx, 
struct pipe_resource *buf
 
/* Bindless texture handles */
if (rbuffer->texture_handle_allocated) {
+   struct si_descriptors *descs = >bindless_descriptors;
+
util_dynarray_foreach(>resident_tex_handles,
  struct si_texture_handle *, tex_handle) {
struct pipe_sampler_view *view = (*tex_handle)->view;
-   struct si_bindless_descriptor *desc = 
(*tex_handle)->desc;
+   unsigned desc_slot = (*tex_handle)->desc_slot;
 
if (view->texture == buf) {
si_set_buf_desc_address(rbuffer,
view->u.buf.offset,
-   >desc_list[4]);
-   desc->dirty = true;
+   descs->list +
+   desc_slot * 16 + 4);
+
+   (*tex_handle)->desc_dirty = true;
sctx->bindless_descriptors_dirty = true;
 
radeon_add_to_buffer_list_check_mem(
@@ -1874,10 +1878,12 @@ static void si_rebind_buffer(struct pipe_context *ctx, 
struct pipe_resource *buf
 
/* Bindless image handles */
if (rbuffer->image_handle_allocated) {
+   struct si_descriptors *descs = >bindless_descriptors;
+
util_dynarray_foreach(>resident_img_handles,
  struct si_image_handle *, img_handle) {
struct pipe_image_view *view = &(*img_handle)->view;
-   struct si_bindless_descriptor *desc = 
(*img_handle)->desc;
+   unsigned desc_slot = (*img_handle)->desc_slot;
 
if (view->resource == buf) {
if (view->access & PIPE_IMAGE_ACCESS_WRITE)
@@ -1885,8 +1891,10 @@ static void si_rebind_buffer(struct pipe_context *ctx, 
struct pipe_resource *buf
 
si_set_buf_desc_address(rbuffer,
view->u.buf.offset,
-   >desc_list[4]);
-   desc->dirty = true;
+   descs->list +
+   desc_slot * 8 + 4);
+
+   (*img_handle)->desc_dirty = true;
sctx->bindless_descriptors_dirty = true;
 
radeon_add_to_buffer_list_check_mem(
@@ -1918,11 +1926,18 @@ static void si_invalidate_buffer(struct pipe_context 
*ctx, struct pipe_resource
 }
 
 static void si_upload_bindless_descriptor(struct si_context *sctx,
- struct si_bindless_descriptor *desc)
+ unsigned desc_slot,
+ unsigned num_dwords)
 {
+   struct si_descriptors *desc = >bindless_descriptors;
struct radeon_winsys_cs *cs = sctx->b.gfx.cs;
-   uint64_t va = desc->buffer->gpu_address + desc->offset;
-   unsigned num_dwords = sizeof(desc->desc_list) / 4;
+   uint32_t *data;
+   uint64_t va;
+
+   data = desc->list + desc_slot * num_dwords;
+
+   va = desc->buffer->gpu_address + desc->buffer_offset +
+desc_slot * num_dwords * 4;
 
radeon_emit(cs, PKT3(PKT3_WRITE_DATA, 2 + num_dwords, 0));
radeon_emit(cs, S_370_DST_SEL(V_370_TC_L2) |
@@ -1930,7 +1945,7 @@ static void si_upload_bindless_descriptor(struct 
si_context *sctx,
S_370_ENGINE_SEL(V_370_ME));
radeon_emit(cs, va);
radeon_emit(cs, va >> 32);
-   radeon_emit_array(cs, desc->desc_list, num_dwords);
+   radeon_emit_array(cs, data, num_dwords);
 }