Re: [Mesa-dev] [PATCH 2/2] winsys/amdgpu: explicitly declare whether buffer_map is permanent or not

2018-11-23 Thread Marek Olšák
On Thu, Nov 22, 2018 at 6:24 AM Nicolai Hähnle  wrote:

> On 21.11.18 21:27, Marek Olšák wrote:
> > On Wed, Nov 21, 2018 at 12:57 PM Nicolai Hähnle  > > wrote:
> >
> > From: Nicolai Hähnle  > >
> >
> > Introduce a new driver-private transfer flag
> RADEON_TRANSFER_TEMPORARY
> > that specifies whether the caller will use buffer_unmap or not. The
> > default behavior is set to permanent maps, because that's what
> drivers
> > do for Gallium buffer maps.
> >
> > This should eliminate the need for hacks in libdrm. Assertions are
> added
> > to catch when the buffer_unmap calls don't match the (temporary)
> > buffer_map calls.
> >
> > I did my best to update r600 and r300 as well for completeness (yes,
> > it's a no-op for r300 because it never calls buffer_unmap), even
> though
> > the radeon winsys ignores the new flag.
> >
> >
> > You didn't make any changes to r300.
>
> Yeah, that's what I wrote :)
>
>
> > You can also drop all r600 changes, because the radeon winsys doesn't
> care.
>
> I don't think it's a good idea, though. The interface of the two winsys
> is different, yes, but it's largely the same and it makes sense to keep
> it that way conceptually. Not that it matters much for the code itself.
>
>
> [snip]
> > +enum radeon_transfer_flags {
> > +   /* Indicates that the caller will unmap the buffer.
> > +*
> > +* Not unmapping buffers is an important performance
> > optimization for
> > +* OpenGL (avoids kernel overhead for frequently mapped
> > buffers). However,
> > +* if you only map a buffer once and then use it indefinitely
> > from the GPU,
> > +* it is much better to unmap it so that the kernel is free to
> > move it to
> > +* non-visible VRAM.
> >
> >
> > The second half of the comment is misleading. The kernel will move
> > buffers to invisible VRAM regardless of whether they're mapped, so CPU
> > mappings have no effect on the placement. Buffers are only moved back to
> > CPU-accessible memory on a CPU page fault. If a buffer is mapped and
> > there no CPU access, it will stay in invisible VRAM forever. The general
> > recommendation is to keep those buffers mapped for CPU access just like
> > GTT buffers.
>
> Yeah, I'll change that.
>
>
> > +*/
> > +   RADEON_TRANSFER_TEMPORARY = (PIPE_TRANSFER_DRV_PRV << 0),
> > +};
> > +
> >   #define RADEON_SPARSE_PAGE_SIZE (64 * 1024)
> >
> >   enum ring_type {
> >   RING_GFX = 0,
> >   RING_COMPUTE,
> >   RING_DMA,
> >   RING_UVD,
> >   RING_VCE,
> >   RING_UVD_ENC,
> >   RING_VCN_DEC,
> > @@ -287,23 +299,26 @@ struct radeon_winsys {
> >   struct pb_buffer *(*buffer_create)(struct radeon_winsys *ws,
> >  uint64_t size,
> >  unsigned alignment,
> >  enum radeon_bo_domain
> domain,
> >  enum radeon_bo_flag flags);
> >
> >   /**
> >* Map the entire data store of a buffer object into the
> > client's address
> >* space.
> >*
> > + * Callers are expected to unmap buffers again if and only if
> the
> > + * RADEON_TRANSFER_TEMPORARY flag is set in \p usage.
> > + *
> >* \param buf   A winsys buffer object to map.
> >* \param csA command stream to flush if the buffer is
> > referenced by it.
> > - * \param usage A bitmask of the PIPE_TRANSFER_* flags.
> > + * \param usage A bitmask of the PIPE_TRANSFER_* and
> > RADEON_TRANSFER_* flags.
> >* \return  The pointer at the beginning of the buffer.
> >*/
> >   void *(*buffer_map)(struct pb_buffer *buf,
> >   struct radeon_cmdbuf *cs,
> >   enum pipe_transfer_usage usage);
> >
> >   /**
> >* Unmap a buffer object from the client's address space.
> >*
> >* \param buf   A winsys buffer object to unmap.
> > diff --git a/src/gallium/drivers/radeonsi/si_shader.c
> > b/src/gallium/drivers/radeonsi/si_shader.c
> > index 19522cc97b1..d455fb5db6a 100644
> > --- a/src/gallium/drivers/radeonsi/si_shader.c
> > +++ b/src/gallium/drivers/radeonsi/si_shader.c
> > @@ -5286,21 +5286,22 @@ int si_shader_binary_upload(struct si_screen
> > *sscreen, struct si_shader *shader)
> >  0 :
> > SI_RESOURCE_FLAG_READ_ONLY,
> > PIPE_USAGE_IMMUTABLE,
> > align(bo_size,
> > SI_CPDMA_ALIGNMENT),
> >  

Re: [Mesa-dev] [PATCH 2/2] winsys/amdgpu: explicitly declare whether buffer_map is permanent or not

2018-11-22 Thread Nicolai Hähnle

On 21.11.18 21:27, Marek Olšák wrote:
On Wed, Nov 21, 2018 at 12:57 PM Nicolai Hähnle > wrote:


From: Nicolai Hähnle mailto:nicolai.haeh...@amd.com>>

Introduce a new driver-private transfer flag RADEON_TRANSFER_TEMPORARY
that specifies whether the caller will use buffer_unmap or not. The
default behavior is set to permanent maps, because that's what drivers
do for Gallium buffer maps.

This should eliminate the need for hacks in libdrm. Assertions are added
to catch when the buffer_unmap calls don't match the (temporary)
buffer_map calls.

I did my best to update r600 and r300 as well for completeness (yes,
it's a no-op for r300 because it never calls buffer_unmap), even though
the radeon winsys ignores the new flag.


You didn't make any changes to r300.


Yeah, that's what I wrote :)



You can also drop all r600 changes, because the radeon winsys doesn't care.


I don't think it's a good idea, though. The interface of the two winsys 
is different, yes, but it's largely the same and it makes sense to keep 
it that way conceptually. Not that it matters much for the code itself.



[snip]

+enum radeon_transfer_flags {
+   /* Indicates that the caller will unmap the buffer.
+    *
+    * Not unmapping buffers is an important performance
optimization for
+    * OpenGL (avoids kernel overhead for frequently mapped
buffers). However,
+    * if you only map a buffer once and then use it indefinitely
from the GPU,
+    * it is much better to unmap it so that the kernel is free to
move it to
+    * non-visible VRAM.


The second half of the comment is misleading. The kernel will move 
buffers to invisible VRAM regardless of whether they're mapped, so CPU 
mappings have no effect on the placement. Buffers are only moved back to 
CPU-accessible memory on a CPU page fault. If a buffer is mapped and 
there no CPU access, it will stay in invisible VRAM forever. The general 
recommendation is to keep those buffers mapped for CPU access just like 
GTT buffers.


Yeah, I'll change that.



+    */
+   RADEON_TRANSFER_TEMPORARY = (PIPE_TRANSFER_DRV_PRV << 0),
+};
+
  #define RADEON_SPARSE_PAGE_SIZE (64 * 1024)

  enum ring_type {
      RING_GFX = 0,
      RING_COMPUTE,
      RING_DMA,
      RING_UVD,
      RING_VCE,
      RING_UVD_ENC,
      RING_VCN_DEC,
@@ -287,23 +299,26 @@ struct radeon_winsys {
      struct pb_buffer *(*buffer_create)(struct radeon_winsys *ws,
                                         uint64_t size,
                                         unsigned alignment,
                                         enum radeon_bo_domain domain,
                                         enum radeon_bo_flag flags);

      /**
       * Map the entire data store of a buffer object into the
client's address
       * space.
       *
+     * Callers are expected to unmap buffers again if and only if the
+     * RADEON_TRANSFER_TEMPORARY flag is set in \p usage.
+     *
       * \param buf       A winsys buffer object to map.
       * \param cs        A command stream to flush if the buffer is
referenced by it.
-     * \param usage     A bitmask of the PIPE_TRANSFER_* flags.
+     * \param usage     A bitmask of the PIPE_TRANSFER_* and
RADEON_TRANSFER_* flags.
       * \return          The pointer at the beginning of the buffer.
       */
      void *(*buffer_map)(struct pb_buffer *buf,
                          struct radeon_cmdbuf *cs,
                          enum pipe_transfer_usage usage);

      /**
       * Unmap a buffer object from the client's address space.
       *
       * \param buf       A winsys buffer object to unmap.
diff --git a/src/gallium/drivers/radeonsi/si_shader.c
b/src/gallium/drivers/radeonsi/si_shader.c
index 19522cc97b1..d455fb5db6a 100644
--- a/src/gallium/drivers/radeonsi/si_shader.c
+++ b/src/gallium/drivers/radeonsi/si_shader.c
@@ -5286,21 +5286,22 @@ int si_shader_binary_upload(struct si_screen
*sscreen, struct si_shader *shader)
                                                 0 :
SI_RESOURCE_FLAG_READ_ONLY,
                                                PIPE_USAGE_IMMUTABLE,
                                                align(bo_size,
SI_CPDMA_ALIGNMENT),
                                                256);
         if (!shader->bo)
                 return -ENOMEM;

         /* Upload. */
         ptr = sscreen->ws->buffer_map(shader->bo->buf, NULL,
                                         PIPE_TRANSFER_READ_WRITE |
-                                       PIPE_TRANSFER_UNSYNCHRONIZED);
+                                       PIPE_TRANSFER_UNSYNCHRONIZED |
+                                       RADEON_TRANSFER_TEMPORAR

Re: [Mesa-dev] [PATCH 2/2] winsys/amdgpu: explicitly declare whether buffer_map is permanent or not

2018-11-21 Thread Marek Olšák
On Wed, Nov 21, 2018 at 12:57 PM Nicolai Hähnle  wrote:

> From: Nicolai Hähnle 
>
> Introduce a new driver-private transfer flag RADEON_TRANSFER_TEMPORARY
> that specifies whether the caller will use buffer_unmap or not. The
> default behavior is set to permanent maps, because that's what drivers
> do for Gallium buffer maps.
>
> This should eliminate the need for hacks in libdrm. Assertions are added
> to catch when the buffer_unmap calls don't match the (temporary)
> buffer_map calls.
>
> I did my best to update r600 and r300 as well for completeness (yes,
> it's a no-op for r300 because it never calls buffer_unmap), even though
> the radeon winsys ignores the new flag.
>

You didn't make any changes to r300.

You can also drop all r600 changes, because the radeon winsys doesn't care.



>
> As an added bonus, this should actually improve the performance of
> the normal fast path, because we no longer call into libdrm at all
> after the first map, and there's one less atomic in the winsys itself
> (there are now no atomics left in the UNSYNCHRONIZED fast path).
>
> Cc: Leo Liu 
> --
> Leo, it'd be nice if you could confirm that all video buffer mappings
> are temporary in this sense.
> ---
>  src/gallium/drivers/r600/evergreen_compute.c |  4 +-
>  src/gallium/drivers/r600/r600_asm.c  |  4 +-
>  src/gallium/drivers/r600/r600_shader.c   |  4 +-
>  src/gallium/drivers/r600/radeon_uvd.c|  8 +-
>  src/gallium/drivers/r600/radeon_vce.c|  4 +-
>  src/gallium/drivers/r600/radeon_video.c  |  6 +-
>  src/gallium/drivers/radeon/radeon_uvd.c  | 10 ++-
>  src/gallium/drivers/radeon/radeon_uvd_enc.c  |  6 +-
>  src/gallium/drivers/radeon/radeon_vce.c  |  4 +-
>  src/gallium/drivers/radeon/radeon_vcn_dec.c  | 18 ++--
>  src/gallium/drivers/radeon/radeon_vcn_enc.c  |  4 +-
>  src/gallium/drivers/radeon/radeon_video.c|  6 +-
>  src/gallium/drivers/radeon/radeon_winsys.h   | 17 +++-
>  src/gallium/drivers/radeonsi/si_shader.c |  3 +-
>  src/gallium/include/pipe/p_defines.h |  8 +-
>  src/gallium/winsys/amdgpu/drm/amdgpu_bo.c| 95 +---
>  src/gallium/winsys/amdgpu/drm/amdgpu_bo.h|  3 +-
>  17 files changed, 142 insertions(+), 62 deletions(-)
>
> diff --git a/src/gallium/drivers/r600/evergreen_compute.c
> b/src/gallium/drivers/r600/evergreen_compute.c
> index a77f58242e3..9085be4e2f3 100644
> --- a/src/gallium/drivers/r600/evergreen_compute.c
> +++ b/src/gallium/drivers/r600/evergreen_compute.c
> @@ -431,21 +431,23 @@ static void *evergreen_create_compute_state(struct
> pipe_context *ctx,
> COMPUTE_DBG(rctx->screen, "*** evergreen_create_compute_state\n");
> header = cso->prog;
> code = cso->prog + sizeof(struct pipe_llvm_program_header);
> radeon_shader_binary_init(&shader->binary);
> r600_elf_read(code, header->num_bytes, &shader->binary);
> r600_create_shader(&shader->bc, &shader->binary, &use_kill);
>
> /* Upload code + ROdata */
> shader->code_bo = r600_compute_buffer_alloc_vram(rctx->screen,
> shader->bc.ndw *
> 4);
> -   p = r600_buffer_map_sync_with_rings(&rctx->b, shader->code_bo,
> PIPE_TRANSFER_WRITE);
> +   p = r600_buffer_map_sync_with_rings(
> +   &rctx->b, shader->code_bo,
> +   PIPE_TRANSFER_WRITE | RADEON_TRANSFER_TEMPORARY);
> //TODO: use util_memcpy_cpu_to_le32 ?
> memcpy(p, shader->bc.bytecode, shader->bc.ndw * 4);
> rctx->b.ws->buffer_unmap(shader->code_bo->buf);
>  #endif
>
> return shader;
>  }
>
>  static void evergreen_delete_compute_state(struct pipe_context *ctx, void
> *state)
>  {
> diff --git a/src/gallium/drivers/r600/r600_asm.c
> b/src/gallium/drivers/r600/r600_asm.c
> index 7029be24f4b..4ba77c535f9 100644
> --- a/src/gallium/drivers/r600/r600_asm.c
> +++ b/src/gallium/drivers/r600/r600_asm.c
> @@ -2765,21 +2765,23 @@ void *r600_create_vertex_fetch_shader(struct
> pipe_context *ctx,
>
> u_suballocator_alloc(rctx->allocator_fetch_shader, fs_size, 256,
>  &shader->offset,
>  (struct pipe_resource**)&shader->buffer);
> if (!shader->buffer) {
> r600_bytecode_clear(&bc);
> FREE(shader);
> return NULL;
> }
>
> -   bytecode = r600_buffer_map_sync_with_rings(&rctx->b,
> shader->buffer, PIPE_TRANSFER_WRITE | PIPE_TRANSFER_UNSYNCHRONIZED);
> +   bytecode = r600_buffer_map_sync_with_rings
> +   (&rctx->b, shader->buffer,
> +   PIPE_TRANSFER_WRITE | PIPE_TRANSFER_UNSYNCHRONIZED |
> RADEON_TRANSFER_TEMPORARY);
> bytecode += shader->offset / 4;
>
> if (R600_BIG_ENDIAN) {
> for (i = 0; i < fs_size / 4; ++i) {
> bytecode[i] = util_cpu_to_le32(bc.bytecode[i]);
> }
> } else {
> memcpy(bytecode, bc.by

[Mesa-dev] [PATCH 2/2] winsys/amdgpu: explicitly declare whether buffer_map is permanent or not

2018-11-21 Thread Nicolai Hähnle
From: Nicolai Hähnle 

Introduce a new driver-private transfer flag RADEON_TRANSFER_TEMPORARY
that specifies whether the caller will use buffer_unmap or not. The
default behavior is set to permanent maps, because that's what drivers
do for Gallium buffer maps.

This should eliminate the need for hacks in libdrm. Assertions are added
to catch when the buffer_unmap calls don't match the (temporary)
buffer_map calls.

I did my best to update r600 and r300 as well for completeness (yes,
it's a no-op for r300 because it never calls buffer_unmap), even though
the radeon winsys ignores the new flag.

As an added bonus, this should actually improve the performance of
the normal fast path, because we no longer call into libdrm at all
after the first map, and there's one less atomic in the winsys itself
(there are now no atomics left in the UNSYNCHRONIZED fast path).

Cc: Leo Liu 
--
Leo, it'd be nice if you could confirm that all video buffer mappings
are temporary in this sense.
---
 src/gallium/drivers/r600/evergreen_compute.c |  4 +-
 src/gallium/drivers/r600/r600_asm.c  |  4 +-
 src/gallium/drivers/r600/r600_shader.c   |  4 +-
 src/gallium/drivers/r600/radeon_uvd.c|  8 +-
 src/gallium/drivers/r600/radeon_vce.c|  4 +-
 src/gallium/drivers/r600/radeon_video.c  |  6 +-
 src/gallium/drivers/radeon/radeon_uvd.c  | 10 ++-
 src/gallium/drivers/radeon/radeon_uvd_enc.c  |  6 +-
 src/gallium/drivers/radeon/radeon_vce.c  |  4 +-
 src/gallium/drivers/radeon/radeon_vcn_dec.c  | 18 ++--
 src/gallium/drivers/radeon/radeon_vcn_enc.c  |  4 +-
 src/gallium/drivers/radeon/radeon_video.c|  6 +-
 src/gallium/drivers/radeon/radeon_winsys.h   | 17 +++-
 src/gallium/drivers/radeonsi/si_shader.c |  3 +-
 src/gallium/include/pipe/p_defines.h |  8 +-
 src/gallium/winsys/amdgpu/drm/amdgpu_bo.c| 95 +---
 src/gallium/winsys/amdgpu/drm/amdgpu_bo.h|  3 +-
 17 files changed, 142 insertions(+), 62 deletions(-)

diff --git a/src/gallium/drivers/r600/evergreen_compute.c 
b/src/gallium/drivers/r600/evergreen_compute.c
index a77f58242e3..9085be4e2f3 100644
--- a/src/gallium/drivers/r600/evergreen_compute.c
+++ b/src/gallium/drivers/r600/evergreen_compute.c
@@ -431,21 +431,23 @@ static void *evergreen_create_compute_state(struct 
pipe_context *ctx,
COMPUTE_DBG(rctx->screen, "*** evergreen_create_compute_state\n");
header = cso->prog;
code = cso->prog + sizeof(struct pipe_llvm_program_header);
radeon_shader_binary_init(&shader->binary);
r600_elf_read(code, header->num_bytes, &shader->binary);
r600_create_shader(&shader->bc, &shader->binary, &use_kill);
 
/* Upload code + ROdata */
shader->code_bo = r600_compute_buffer_alloc_vram(rctx->screen,
shader->bc.ndw * 4);
-   p = r600_buffer_map_sync_with_rings(&rctx->b, shader->code_bo, 
PIPE_TRANSFER_WRITE);
+   p = r600_buffer_map_sync_with_rings(
+   &rctx->b, shader->code_bo,
+   PIPE_TRANSFER_WRITE | RADEON_TRANSFER_TEMPORARY);
//TODO: use util_memcpy_cpu_to_le32 ?
memcpy(p, shader->bc.bytecode, shader->bc.ndw * 4);
rctx->b.ws->buffer_unmap(shader->code_bo->buf);
 #endif
 
return shader;
 }
 
 static void evergreen_delete_compute_state(struct pipe_context *ctx, void 
*state)
 {
diff --git a/src/gallium/drivers/r600/r600_asm.c 
b/src/gallium/drivers/r600/r600_asm.c
index 7029be24f4b..4ba77c535f9 100644
--- a/src/gallium/drivers/r600/r600_asm.c
+++ b/src/gallium/drivers/r600/r600_asm.c
@@ -2765,21 +2765,23 @@ void *r600_create_vertex_fetch_shader(struct 
pipe_context *ctx,
 
u_suballocator_alloc(rctx->allocator_fetch_shader, fs_size, 256,
 &shader->offset,
 (struct pipe_resource**)&shader->buffer);
if (!shader->buffer) {
r600_bytecode_clear(&bc);
FREE(shader);
return NULL;
}
 
-   bytecode = r600_buffer_map_sync_with_rings(&rctx->b, shader->buffer, 
PIPE_TRANSFER_WRITE | PIPE_TRANSFER_UNSYNCHRONIZED);
+   bytecode = r600_buffer_map_sync_with_rings
+   (&rctx->b, shader->buffer,
+   PIPE_TRANSFER_WRITE | PIPE_TRANSFER_UNSYNCHRONIZED | 
RADEON_TRANSFER_TEMPORARY);
bytecode += shader->offset / 4;
 
if (R600_BIG_ENDIAN) {
for (i = 0; i < fs_size / 4; ++i) {
bytecode[i] = util_cpu_to_le32(bc.bytecode[i]);
}
} else {
memcpy(bytecode, bc.bytecode, fs_size);
}
rctx->b.ws->buffer_unmap(shader->buffer->buf);
diff --git a/src/gallium/drivers/r600/r600_shader.c 
b/src/gallium/drivers/r600/r600_shader.c
index 408939d1105..fc826470d69 100644
--- a/src/gallium/drivers/r600/r600_shader.c
+++ b/src/gallium/drivers/r600/r600_shader.c
@@ -134,21 +134,23 @@ static int store_shader(struct pipe_contex