Some additional notes: 1) si_release_bindless_descriptors can be inlined.
2) Sampler slots have size = 16*4 and image slots have size = 8*4 in your patch. The addressing is also done with a multiple of the size, which allows creating a sampler and an image in the same slot such that one of them overwrites the other. create_bindless_sampler(): slot = 1, address = 1*16 = 16 create_bindless_image(): slot = 2, address = 2*8 = 16 The rest looks OK. Marek On Wed, Jul 19, 2017 at 9:14 AM, Samuel Pitoiset <samuel.pitoi...@gmail.com> wrote: > > > On 07/19/2017 12:01 AM, Marek Olšák wrote: >> >> On Mon, Jul 17, 2017 at 4:01 PM, Nicolai Hähnle <nhaeh...@gmail.com> >> wrote: >>> >>> Hi Samuel, >>> >>> On 07.07.2017 03:45, Samuel Pitoiset wrote: >>>> >>>> >>>> On 07/05/2017 01:42 PM, Nicolai Hähnle wrote: >>>>> >>>>> >>>>> On 04.07.2017 15:05, Samuel Pitoiset wrote: >>>>>> >>>>>> >>>>>> Using VRAM address as bindless handles is not a good idea because >>>>>> we have to use LLVMIntToPTr and the LLVM CSE pass can't optimize >>>>>> because it has no information about the pointer. >>>>>> >>>>>> Instead, use slots indexes like the existing descriptors. >>>>>> >>>>>> This improves performance with DOW3 by +7%. >>>>> >>>>> >>>>> >>>>> Wow. >>>>> >>>>> The thing is, burning a pair of user SGPRs for this seems a bit >>>>> overkill, >>>>> especially since it also hurts apps that don't use bindless at all. >>>>> >>>>> Do you have some examples of how LLVM fails here? Could we perhaps >>>>> avoid >>>>> most of the performance issues by casting 0 to an appropriate pointer >>>>> type >>>>> once, and then using the bindless handle as an index relative to that >>>>> pointer? >>>> >>>> >>>> >>>> Here's two shaders, 1) is with master, 2) is with this series: >>>> >>>> 1) https://hastebin.com/uvamarelig >>>> 2) https://hastebin.com/voguqihilu >>>> >>>> The first shader contains a bunch of s_buffer_load_dword that the second >>>> one doesn't need because CSE do its job there. This is because of >>>> IntToPtr >>>> but if we use noalias, the pass is able to eliminate the redundant >>>> descriptor load operations. >>> >>> >>> >>> So I looked into your example again in more detail, compiling both >>> shaders >>> with >>> >>> llc -march=amdgcn -mcpu=tonga >>> >>> and also just extracting the assembly, and I think your analysis is >>> actually >>> flawed. It's true that the shader in (2) has fewer buffer loads, but the >>> only buffer loads that are actually removed rather than just shuffled >>> around >>> are ones that load .y components. So basically, the reason you get fewer >>> buffer loads is that you're effectly using 32 bit pointers. It's a bit >>> shocking that that gives you a 7% performance improvement... >>> >>> If there are really examples where it makes a difference for CSE, then >>> perhaps the same result could be achieved with alias/noalias metadata on >>> the >>> load/store instructions. >>> >>> In the meantime, perhaps a good comparison would be to use the original, >>> inttoptr-based code, but only load the lower 32 bits for a handle and use >>> bit shifts to get a full 64 bit pointers. That way, at least it should be >>> possible to more easily find shaders where there is a genuine difference. >> >> >> He probably gave you wrong shaders. If you imagine an NxN filter doing >> NxN texture fetches, you should get NxN loads for the same descriptor >> with inttoptr. Also, LICM won't move the descriptor loads out of >> loops. Those are the biggest issues. If you can't do CSE and LICM, it >> can easily make the 7% difference. > > > There are many DOW3 shaders that are different with this series. The one I > pasted you is maybe not the most interesting one though. I can try to find > one with descriptor loads inside a loop. > > >> >> Marek >> > _______________________________________________ mesa-dev mailing list mesa-dev@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/mesa-dev