Re: [Mesa-dev] [PATCH 15/18] radeonsi: upload constants into VRAM instead of GTT
On 16.02.2017 23:36, Marek Olšák wrote: On Thu, Feb 16, 2017 at 4:21 PM, Nicolai Hähnlewrote: On 16.02.2017 13:53, Marek Olšák wrote: From: Marek Olšák This lowers lgkm wait cycles by 30% on VI and normal conditions. The might be a measurable improvement when CE is disabled (radeon) or under L2 thrashing. Good idea. I'm just wondering if all the users of const upload end up as streaming writes? I hope we don't accidentally hit some place where writes from the CPU end up extremely slow, e.g. where st/mesa uploads some structures. I think constant buffers always benefit from being in VRAM. If every CU loads a value from a constant buffer, you'll get at least 16 TC L2 read requests on Fiji (each group of 4 CUs submits one), which can be misses under thrashing. This is very different from "streaming" where you expect to get exactly 1 read request for each piece of data. Good point. The small problem with VRAM uploads may be write combining. I don't know the alignment at which it operates and how exactly it works. E.g. if we get 2 16-byte uploads aligned to 32, there is an untouched hole of 16 bytes. Does the hole have any effect on upload performance? u_upload_mgr could fill all holes if it was a problem. So some quick googling found this: https://fgiesen.wordpress.com/2013/01/29/write-combining-is-not-your-friend/ with the main three rules in the conclusion: - Never read from write-combined memory. - Try to keep writes sequential. This is good style even when it’s not strictly necessary. On processors with picky write-combining logic, you might also need to use volatile or some other way to cause the compiler not to reorder instructions. - Don’t leave holes. Always write large, contiguous ranges. All uses of u_upload_data should be fine (it's just a memcpy). Your example of 2 16-byte uploads aligned to 32 isn't ideal, but it's probably not terrible. Scanning st/mesa, I see the following potentially questionable pieces of code: 1) st_DrawAtlasBitmaps: If the compiler reorders the verts->XYZ writes, that could be bad. But this is obsolete functionality, so we don't care. 2) st_DrawTex: similar 3) st_draw_quad: Used for scissored/windowed clears. Re-ordering by the compiler could potentially hurt, we may want to look into that. 4) st_pbo_draw: same as st_draw_quad The blitter only uses u_upload_data, which is fine. Also, none of the issues affect large uploads, so I think we're good. The patch is: Reviewed-by: Nicolai Hähnle Also, Feral's games upload directly to VRAM all the time. This patch is nothing compared to what they're doing. Marek ___ mesa-dev mailing list mesa-dev@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/mesa-dev
Re: [Mesa-dev] [PATCH 15/18] radeonsi: upload constants into VRAM instead of GTT
On Thu, Feb 16, 2017 at 4:21 PM, Nicolai Hähnlewrote: > On 16.02.2017 13:53, Marek Olšák wrote: >> >> From: Marek Olšák >> >> This lowers lgkm wait cycles by 30% on VI and normal conditions. >> The might be a measurable improvement when CE is disabled (radeon) >> or under L2 thrashing. > > > Good idea. I'm just wondering if all the users of const upload end up as > streaming writes? I hope we don't accidentally hit some place where writes > from the CPU end up extremely slow, e.g. where st/mesa uploads some > structures. I think constant buffers always benefit from being in VRAM. If every CU loads a value from a constant buffer, you'll get at least 16 TC L2 read requests on Fiji (each group of 4 CUs submits one), which can be misses under thrashing. This is very different from "streaming" where you expect to get exactly 1 read request for each piece of data. The small problem with VRAM uploads may be write combining. I don't know the alignment at which it operates and how exactly it works. E.g. if we get 2 16-byte uploads aligned to 32, there is an untouched hole of 16 bytes. Does the hole have any effect on upload performance? u_upload_mgr could fill all holes if it was a problem. Also, Feral's games upload directly to VRAM all the time. This patch is nothing compared to what they're doing. Marek ___ mesa-dev mailing list mesa-dev@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/mesa-dev
Re: [Mesa-dev] [PATCH 15/18] radeonsi: upload constants into VRAM instead of GTT
On 16.02.2017 13:53, Marek Olšák wrote: From: Marek OlšákThis lowers lgkm wait cycles by 30% on VI and normal conditions. The might be a measurable improvement when CE is disabled (radeon) or under L2 thrashing. Good idea. I'm just wondering if all the users of const upload end up as streaming writes? I hope we don't accidentally hit some place where writes from the CPU end up extremely slow, e.g. where st/mesa uploads some structures. Nicolai --- src/gallium/drivers/radeon/r600_pipe_common.c | 11 --- src/gallium/drivers/radeonsi/si_compute.c | 4 ++-- src/gallium/drivers/radeonsi/si_descriptors.c | 6 +++--- src/gallium/drivers/radeonsi/si_state.c | 7 +-- 4 files changed, 18 insertions(+), 10 deletions(-) diff --git a/src/gallium/drivers/radeon/r600_pipe_common.c b/src/gallium/drivers/radeon/r600_pipe_common.c index d573b39..1781584 100644 --- a/src/gallium/drivers/radeon/r600_pipe_common.c +++ b/src/gallium/drivers/radeon/r600_pipe_common.c @@ -600,21 +600,25 @@ bool r600_common_context_init(struct r600_common_context *rctx, rctx->allocator_zeroed_memory = u_suballocator_create(>b, rscreen->info.gart_page_size, 0, PIPE_USAGE_DEFAULT, 0, true); if (!rctx->allocator_zeroed_memory) return false; rctx->b.stream_uploader = u_upload_create(>b, 1024 * 1024, 0, PIPE_USAGE_STREAM); if (!rctx->b.stream_uploader) return false; - rctx->b.const_uploader = rctx->b.stream_uploader; + + rctx->b.const_uploader = u_upload_create(>b, 128 * 1024, +0, PIPE_USAGE_DEFAULT); + if (!rctx->b.const_uploader) + return false; rctx->ctx = rctx->ws->ctx_create(rctx->ws); if (!rctx->ctx) return false; if (rscreen->info.has_sdma && !(rscreen->debug_flags & DBG_NO_ASYNC_DMA)) { rctx->dma.cs = rctx->ws->cs_create(rctx->ctx, RING_DMA, r600_flush_dma_ring, rctx); rctx->dma.flush = r600_flush_dma_ring; @@ -642,23 +646,24 @@ void r600_common_context_cleanup(struct r600_common_context *rctx) if (rctx->query_result_shader) rctx->b.delete_compute_state(>b, rctx->query_result_shader); if (rctx->gfx.cs) rctx->ws->cs_destroy(rctx->gfx.cs); if (rctx->dma.cs) rctx->ws->cs_destroy(rctx->dma.cs); if (rctx->ctx) rctx->ws->ctx_destroy(rctx->ctx); - if (rctx->b.stream_uploader) { + if (rctx->b.stream_uploader) u_upload_destroy(rctx->b.stream_uploader); - } + if (rctx->b.const_uploader) + u_upload_destroy(rctx->b.const_uploader); slab_destroy_child(>pool_transfers); if (rctx->allocator_zeroed_memory) { u_suballocator_destroy(rctx->allocator_zeroed_memory); } rctx->ws->fence_reference(>last_gfx_fence, NULL); rctx->ws->fence_reference(>last_sdma_fence, NULL); } diff --git a/src/gallium/drivers/radeonsi/si_compute.c b/src/gallium/drivers/radeonsi/si_compute.c index 381837c..88d72c1 100644 --- a/src/gallium/drivers/radeonsi/si_compute.c +++ b/src/gallium/drivers/radeonsi/si_compute.c @@ -496,21 +496,21 @@ static void si_setup_user_sgprs_co_v2(struct si_context *sctx, dispatch.grid_size_x = info->grid[0] * info->block[0]; dispatch.grid_size_y = info->grid[1] * info->block[1]; dispatch.grid_size_z = info->grid[2] * info->block[2]; dispatch.private_segment_size = program->private_size; dispatch.group_segment_size = program->local_size; dispatch.kernarg_address = kernel_args_va; - u_upload_data(sctx->b.b.stream_uploader, 0, sizeof(dispatch), + u_upload_data(sctx->b.b.const_uploader, 0, sizeof(dispatch), 256, , _offset, (struct pipe_resource**)_buf); if (!dispatch_buf) { fprintf(stderr, "Error: Failed to allocate dispatch " "packet."); } radeon_add_to_buffer_list(>b, >b.gfx, dispatch_buf, RADEON_USAGE_READ, RADEON_PRIO_CONST_BUFFER); @@ -558,21 +558,21 @@ static void si_upload_compute_input(struct si_context *sctx, unsigned num_work_size_bytes = program->use_code_object_v2 ? 0 : 36; uint32_t kernel_args_offset = 0; uint32_t *kernel_args; void *kernel_args_ptr; uint64_t kernel_args_va; unsigned i; /* The extra num_work_size_bytes are for work group / work item size information */
[Mesa-dev] [PATCH 15/18] radeonsi: upload constants into VRAM instead of GTT
From: Marek OlšákThis lowers lgkm wait cycles by 30% on VI and normal conditions. The might be a measurable improvement when CE is disabled (radeon) or under L2 thrashing. --- src/gallium/drivers/radeon/r600_pipe_common.c | 11 --- src/gallium/drivers/radeonsi/si_compute.c | 4 ++-- src/gallium/drivers/radeonsi/si_descriptors.c | 6 +++--- src/gallium/drivers/radeonsi/si_state.c | 7 +-- 4 files changed, 18 insertions(+), 10 deletions(-) diff --git a/src/gallium/drivers/radeon/r600_pipe_common.c b/src/gallium/drivers/radeon/r600_pipe_common.c index d573b39..1781584 100644 --- a/src/gallium/drivers/radeon/r600_pipe_common.c +++ b/src/gallium/drivers/radeon/r600_pipe_common.c @@ -600,21 +600,25 @@ bool r600_common_context_init(struct r600_common_context *rctx, rctx->allocator_zeroed_memory = u_suballocator_create(>b, rscreen->info.gart_page_size, 0, PIPE_USAGE_DEFAULT, 0, true); if (!rctx->allocator_zeroed_memory) return false; rctx->b.stream_uploader = u_upload_create(>b, 1024 * 1024, 0, PIPE_USAGE_STREAM); if (!rctx->b.stream_uploader) return false; - rctx->b.const_uploader = rctx->b.stream_uploader; + + rctx->b.const_uploader = u_upload_create(>b, 128 * 1024, +0, PIPE_USAGE_DEFAULT); + if (!rctx->b.const_uploader) + return false; rctx->ctx = rctx->ws->ctx_create(rctx->ws); if (!rctx->ctx) return false; if (rscreen->info.has_sdma && !(rscreen->debug_flags & DBG_NO_ASYNC_DMA)) { rctx->dma.cs = rctx->ws->cs_create(rctx->ctx, RING_DMA, r600_flush_dma_ring, rctx); rctx->dma.flush = r600_flush_dma_ring; @@ -642,23 +646,24 @@ void r600_common_context_cleanup(struct r600_common_context *rctx) if (rctx->query_result_shader) rctx->b.delete_compute_state(>b, rctx->query_result_shader); if (rctx->gfx.cs) rctx->ws->cs_destroy(rctx->gfx.cs); if (rctx->dma.cs) rctx->ws->cs_destroy(rctx->dma.cs); if (rctx->ctx) rctx->ws->ctx_destroy(rctx->ctx); - if (rctx->b.stream_uploader) { + if (rctx->b.stream_uploader) u_upload_destroy(rctx->b.stream_uploader); - } + if (rctx->b.const_uploader) + u_upload_destroy(rctx->b.const_uploader); slab_destroy_child(>pool_transfers); if (rctx->allocator_zeroed_memory) { u_suballocator_destroy(rctx->allocator_zeroed_memory); } rctx->ws->fence_reference(>last_gfx_fence, NULL); rctx->ws->fence_reference(>last_sdma_fence, NULL); } diff --git a/src/gallium/drivers/radeonsi/si_compute.c b/src/gallium/drivers/radeonsi/si_compute.c index 381837c..88d72c1 100644 --- a/src/gallium/drivers/radeonsi/si_compute.c +++ b/src/gallium/drivers/radeonsi/si_compute.c @@ -496,21 +496,21 @@ static void si_setup_user_sgprs_co_v2(struct si_context *sctx, dispatch.grid_size_x = info->grid[0] * info->block[0]; dispatch.grid_size_y = info->grid[1] * info->block[1]; dispatch.grid_size_z = info->grid[2] * info->block[2]; dispatch.private_segment_size = program->private_size; dispatch.group_segment_size = program->local_size; dispatch.kernarg_address = kernel_args_va; - u_upload_data(sctx->b.b.stream_uploader, 0, sizeof(dispatch), + u_upload_data(sctx->b.b.const_uploader, 0, sizeof(dispatch), 256, , _offset, (struct pipe_resource**)_buf); if (!dispatch_buf) { fprintf(stderr, "Error: Failed to allocate dispatch " "packet."); } radeon_add_to_buffer_list(>b, >b.gfx, dispatch_buf, RADEON_USAGE_READ, RADEON_PRIO_CONST_BUFFER); @@ -558,21 +558,21 @@ static void si_upload_compute_input(struct si_context *sctx, unsigned num_work_size_bytes = program->use_code_object_v2 ? 0 : 36; uint32_t kernel_args_offset = 0; uint32_t *kernel_args; void *kernel_args_ptr; uint64_t kernel_args_va; unsigned i; /* The extra num_work_size_bytes are for work group / work item size information */ kernel_args_size = program->input_size + num_work_size_bytes; - u_upload_alloc(sctx->b.b.stream_uploader, 0, kernel_args_size, + u_upload_alloc(sctx->b.b.const_uploader, 0, kernel_args_size,