Re: [Mesa-dev] [PATCH 15/18] radeonsi: upload constants into VRAM instead of GTT

2017-02-17 Thread Nicolai Hähnle

On 16.02.2017 23:36, Marek Olšák wrote:

On Thu, Feb 16, 2017 at 4:21 PM, Nicolai Hähnle  wrote:

On 16.02.2017 13:53, Marek Olšák wrote:


From: Marek Olšák 

This lowers lgkm wait cycles by 30% on VI and normal conditions.
The might be a measurable improvement when CE is disabled (radeon)
or under L2 thrashing.



Good idea. I'm just wondering if all the users of const upload end up as
streaming writes? I hope we don't accidentally hit some place where writes
from the CPU end up extremely slow, e.g. where st/mesa uploads some
structures.


I think constant buffers always benefit from being in VRAM. If every
CU loads a value from a constant buffer, you'll get at least 16 TC L2
read requests on Fiji (each group of 4 CUs submits one), which can be
misses under thrashing. This is very different from "streaming" where
you expect to get exactly 1 read request for each piece of data.


Good point.



The small problem with VRAM uploads may be write combining. I don't
know the alignment at which it operates and how exactly it works. E.g.
if we get 2 16-byte uploads aligned to 32, there is an untouched hole
of 16 bytes. Does the hole have any effect on upload performance?
u_upload_mgr could fill all holes if it was a problem.


So some quick googling found this: 
https://fgiesen.wordpress.com/2013/01/29/write-combining-is-not-your-friend/ 
with the main three rules in the conclusion:


- Never read from write-combined memory.
- Try to keep writes sequential. This is good style even when it’s not 
strictly necessary. On processors with picky write-combining logic, you 
might also need to use volatile or some other way to cause the compiler 
not to reorder instructions.

- Don’t leave holes. Always write large, contiguous ranges.

All uses of u_upload_data should be fine (it's just a memcpy). Your 
example of 2 16-byte uploads aligned to 32 isn't ideal, but it's 
probably not terrible.


Scanning st/mesa, I see the following potentially questionable pieces of 
code:


1) st_DrawAtlasBitmaps: If the compiler reorders the verts->XYZ writes, 
that could be bad. But this is obsolete functionality, so we don't care.

2) st_DrawTex: similar
3) st_draw_quad: Used for scissored/windowed clears. Re-ordering by the 
compiler could potentially hurt, we may want to look into that.

4) st_pbo_draw: same as st_draw_quad

The blitter only uses u_upload_data, which is fine.

Also, none of the issues affect large uploads, so I think we're good.

The patch is: Reviewed-by: Nicolai Hähnle 



Also, Feral's games upload directly to VRAM all the time. This patch
is nothing compared to what they're doing.

Marek



___
mesa-dev mailing list
mesa-dev@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/mesa-dev


Re: [Mesa-dev] [PATCH 15/18] radeonsi: upload constants into VRAM instead of GTT

2017-02-16 Thread Marek Olšák
On Thu, Feb 16, 2017 at 4:21 PM, Nicolai Hähnle  wrote:
> On 16.02.2017 13:53, Marek Olšák wrote:
>>
>> From: Marek Olšák 
>>
>> This lowers lgkm wait cycles by 30% on VI and normal conditions.
>> The might be a measurable improvement when CE is disabled (radeon)
>> or under L2 thrashing.
>
>
> Good idea. I'm just wondering if all the users of const upload end up as
> streaming writes? I hope we don't accidentally hit some place where writes
> from the CPU end up extremely slow, e.g. where st/mesa uploads some
> structures.

I think constant buffers always benefit from being in VRAM. If every
CU loads a value from a constant buffer, you'll get at least 16 TC L2
read requests on Fiji (each group of 4 CUs submits one), which can be
misses under thrashing. This is very different from "streaming" where
you expect to get exactly 1 read request for each piece of data.

The small problem with VRAM uploads may be write combining. I don't
know the alignment at which it operates and how exactly it works. E.g.
if we get 2 16-byte uploads aligned to 32, there is an untouched hole
of 16 bytes. Does the hole have any effect on upload performance?
u_upload_mgr could fill all holes if it was a problem.

Also, Feral's games upload directly to VRAM all the time. This patch
is nothing compared to what they're doing.

Marek
___
mesa-dev mailing list
mesa-dev@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/mesa-dev


Re: [Mesa-dev] [PATCH 15/18] radeonsi: upload constants into VRAM instead of GTT

2017-02-16 Thread Nicolai Hähnle

On 16.02.2017 13:53, Marek Olšák wrote:

From: Marek Olšák 

This lowers lgkm wait cycles by 30% on VI and normal conditions.
The might be a measurable improvement when CE is disabled (radeon)
or under L2 thrashing.


Good idea. I'm just wondering if all the users of const upload end up as 
streaming writes? I hope we don't accidentally hit some place where 
writes from the CPU end up extremely slow, e.g. where st/mesa uploads 
some structures.


Nicolai



---
 src/gallium/drivers/radeon/r600_pipe_common.c | 11 ---
 src/gallium/drivers/radeonsi/si_compute.c |  4 ++--
 src/gallium/drivers/radeonsi/si_descriptors.c |  6 +++---
 src/gallium/drivers/radeonsi/si_state.c   |  7 +--
 4 files changed, 18 insertions(+), 10 deletions(-)

diff --git a/src/gallium/drivers/radeon/r600_pipe_common.c 
b/src/gallium/drivers/radeon/r600_pipe_common.c
index d573b39..1781584 100644
--- a/src/gallium/drivers/radeon/r600_pipe_common.c
+++ b/src/gallium/drivers/radeon/r600_pipe_common.c
@@ -600,21 +600,25 @@ bool r600_common_context_init(struct r600_common_context 
*rctx,
rctx->allocator_zeroed_memory =
u_suballocator_create(>b, rscreen->info.gart_page_size,
  0, PIPE_USAGE_DEFAULT, 0, true);
if (!rctx->allocator_zeroed_memory)
return false;

rctx->b.stream_uploader = u_upload_create(>b, 1024 * 1024,
  0, PIPE_USAGE_STREAM);
if (!rctx->b.stream_uploader)
return false;
-   rctx->b.const_uploader = rctx->b.stream_uploader;
+
+   rctx->b.const_uploader = u_upload_create(>b, 128 * 1024,
+0, PIPE_USAGE_DEFAULT);
+   if (!rctx->b.const_uploader)
+   return false;

rctx->ctx = rctx->ws->ctx_create(rctx->ws);
if (!rctx->ctx)
return false;

if (rscreen->info.has_sdma && !(rscreen->debug_flags & 
DBG_NO_ASYNC_DMA)) {
rctx->dma.cs = rctx->ws->cs_create(rctx->ctx, RING_DMA,
   r600_flush_dma_ring,
   rctx);
rctx->dma.flush = r600_flush_dma_ring;
@@ -642,23 +646,24 @@ void r600_common_context_cleanup(struct 
r600_common_context *rctx)
if (rctx->query_result_shader)
rctx->b.delete_compute_state(>b, 
rctx->query_result_shader);

if (rctx->gfx.cs)
rctx->ws->cs_destroy(rctx->gfx.cs);
if (rctx->dma.cs)
rctx->ws->cs_destroy(rctx->dma.cs);
if (rctx->ctx)
rctx->ws->ctx_destroy(rctx->ctx);

-   if (rctx->b.stream_uploader) {
+   if (rctx->b.stream_uploader)
u_upload_destroy(rctx->b.stream_uploader);
-   }
+   if (rctx->b.const_uploader)
+   u_upload_destroy(rctx->b.const_uploader);

slab_destroy_child(>pool_transfers);

if (rctx->allocator_zeroed_memory) {
u_suballocator_destroy(rctx->allocator_zeroed_memory);
}
rctx->ws->fence_reference(>last_gfx_fence, NULL);
rctx->ws->fence_reference(>last_sdma_fence, NULL);
 }

diff --git a/src/gallium/drivers/radeonsi/si_compute.c 
b/src/gallium/drivers/radeonsi/si_compute.c
index 381837c..88d72c1 100644
--- a/src/gallium/drivers/radeonsi/si_compute.c
+++ b/src/gallium/drivers/radeonsi/si_compute.c
@@ -496,21 +496,21 @@ static void si_setup_user_sgprs_co_v2(struct si_context 
*sctx,

dispatch.grid_size_x = info->grid[0] * info->block[0];
dispatch.grid_size_y = info->grid[1] * info->block[1];
dispatch.grid_size_z = info->grid[2] * info->block[2];

dispatch.private_segment_size = program->private_size;
dispatch.group_segment_size = program->local_size;

dispatch.kernarg_address = kernel_args_va;

-   u_upload_data(sctx->b.b.stream_uploader, 0, sizeof(dispatch),
+   u_upload_data(sctx->b.b.const_uploader, 0, sizeof(dispatch),
   256, , _offset,
   (struct pipe_resource**)_buf);

if (!dispatch_buf) {
fprintf(stderr, "Error: Failed to allocate dispatch "
"packet.");
}
radeon_add_to_buffer_list(>b, >b.gfx, dispatch_buf,
  RADEON_USAGE_READ, RADEON_PRIO_CONST_BUFFER);

@@ -558,21 +558,21 @@ static void si_upload_compute_input(struct si_context 
*sctx,
unsigned num_work_size_bytes = program->use_code_object_v2 ? 0 : 36;
uint32_t kernel_args_offset = 0;
uint32_t *kernel_args;
void *kernel_args_ptr;
uint64_t kernel_args_va;
unsigned i;

/* The extra num_work_size_bytes are for work group / work item size 
information */
 

[Mesa-dev] [PATCH 15/18] radeonsi: upload constants into VRAM instead of GTT

2017-02-16 Thread Marek Olšák
From: Marek Olšák 

This lowers lgkm wait cycles by 30% on VI and normal conditions.
The might be a measurable improvement when CE is disabled (radeon)
or under L2 thrashing.
---
 src/gallium/drivers/radeon/r600_pipe_common.c | 11 ---
 src/gallium/drivers/radeonsi/si_compute.c |  4 ++--
 src/gallium/drivers/radeonsi/si_descriptors.c |  6 +++---
 src/gallium/drivers/radeonsi/si_state.c   |  7 +--
 4 files changed, 18 insertions(+), 10 deletions(-)

diff --git a/src/gallium/drivers/radeon/r600_pipe_common.c 
b/src/gallium/drivers/radeon/r600_pipe_common.c
index d573b39..1781584 100644
--- a/src/gallium/drivers/radeon/r600_pipe_common.c
+++ b/src/gallium/drivers/radeon/r600_pipe_common.c
@@ -600,21 +600,25 @@ bool r600_common_context_init(struct r600_common_context 
*rctx,
rctx->allocator_zeroed_memory =
u_suballocator_create(>b, rscreen->info.gart_page_size,
  0, PIPE_USAGE_DEFAULT, 0, true);
if (!rctx->allocator_zeroed_memory)
return false;
 
rctx->b.stream_uploader = u_upload_create(>b, 1024 * 1024,
  0, PIPE_USAGE_STREAM);
if (!rctx->b.stream_uploader)
return false;
-   rctx->b.const_uploader = rctx->b.stream_uploader;
+
+   rctx->b.const_uploader = u_upload_create(>b, 128 * 1024,
+0, PIPE_USAGE_DEFAULT);
+   if (!rctx->b.const_uploader)
+   return false;
 
rctx->ctx = rctx->ws->ctx_create(rctx->ws);
if (!rctx->ctx)
return false;
 
if (rscreen->info.has_sdma && !(rscreen->debug_flags & 
DBG_NO_ASYNC_DMA)) {
rctx->dma.cs = rctx->ws->cs_create(rctx->ctx, RING_DMA,
   r600_flush_dma_ring,
   rctx);
rctx->dma.flush = r600_flush_dma_ring;
@@ -642,23 +646,24 @@ void r600_common_context_cleanup(struct 
r600_common_context *rctx)
if (rctx->query_result_shader)
rctx->b.delete_compute_state(>b, 
rctx->query_result_shader);
 
if (rctx->gfx.cs)
rctx->ws->cs_destroy(rctx->gfx.cs);
if (rctx->dma.cs)
rctx->ws->cs_destroy(rctx->dma.cs);
if (rctx->ctx)
rctx->ws->ctx_destroy(rctx->ctx);
 
-   if (rctx->b.stream_uploader) {
+   if (rctx->b.stream_uploader)
u_upload_destroy(rctx->b.stream_uploader);
-   }
+   if (rctx->b.const_uploader)
+   u_upload_destroy(rctx->b.const_uploader);
 
slab_destroy_child(>pool_transfers);
 
if (rctx->allocator_zeroed_memory) {
u_suballocator_destroy(rctx->allocator_zeroed_memory);
}
rctx->ws->fence_reference(>last_gfx_fence, NULL);
rctx->ws->fence_reference(>last_sdma_fence, NULL);
 }
 
diff --git a/src/gallium/drivers/radeonsi/si_compute.c 
b/src/gallium/drivers/radeonsi/si_compute.c
index 381837c..88d72c1 100644
--- a/src/gallium/drivers/radeonsi/si_compute.c
+++ b/src/gallium/drivers/radeonsi/si_compute.c
@@ -496,21 +496,21 @@ static void si_setup_user_sgprs_co_v2(struct si_context 
*sctx,
 
dispatch.grid_size_x = info->grid[0] * info->block[0];
dispatch.grid_size_y = info->grid[1] * info->block[1];
dispatch.grid_size_z = info->grid[2] * info->block[2];
 
dispatch.private_segment_size = program->private_size;
dispatch.group_segment_size = program->local_size;
 
dispatch.kernarg_address = kernel_args_va;
 
-   u_upload_data(sctx->b.b.stream_uploader, 0, sizeof(dispatch),
+   u_upload_data(sctx->b.b.const_uploader, 0, sizeof(dispatch),
   256, , _offset,
   (struct pipe_resource**)_buf);
 
if (!dispatch_buf) {
fprintf(stderr, "Error: Failed to allocate dispatch "
"packet.");
}
radeon_add_to_buffer_list(>b, >b.gfx, dispatch_buf,
  RADEON_USAGE_READ, RADEON_PRIO_CONST_BUFFER);
 
@@ -558,21 +558,21 @@ static void si_upload_compute_input(struct si_context 
*sctx,
unsigned num_work_size_bytes = program->use_code_object_v2 ? 0 : 36;
uint32_t kernel_args_offset = 0;
uint32_t *kernel_args;
void *kernel_args_ptr;
uint64_t kernel_args_va;
unsigned i;
 
/* The extra num_work_size_bytes are for work group / work item size 
information */
kernel_args_size = program->input_size + num_work_size_bytes;
 
-   u_upload_alloc(sctx->b.b.stream_uploader, 0, kernel_args_size,
+   u_upload_alloc(sctx->b.b.const_uploader, 0, kernel_args_size,