Am 10.01.19 um 12:41 schrieb Marek Olšák:
On Thu, Jan 10, 2019, 4:15 AM Koenig, Christian
<christian.koe...@amd.com <mailto:christian.koe...@amd.com> wrote:
Am 10.01.19 um 00:39 schrieb Marek Olšák:
On Wed, Jan 9, 2019 at 1:41 PM Christian König
<ckoenig.leichtzumer...@gmail.com
<mailto:ckoenig.leichtzumer...@gmail.com>> wrote:
Am 09.01.19 um 17:14 schrieb Marek Olšák:
On Wed, Jan 9, 2019 at 8:09 AM Christian König
<ckoenig.leichtzumer...@gmail.com
<mailto:ckoenig.leichtzumer...@gmail.com>> wrote:
Am 09.01.19 um 13:36 schrieb Marek Olšák:
On Wed, Jan 9, 2019, 5:28 AM Christian König
<ckoenig.leichtzumer...@gmail.com
<mailto:ckoenig.leichtzumer...@gmail.com> wrote:
Looks good, but I'm wondering what's the actual
improvement?
No malloc calls and 1 less for loop copying the bo list.
Yeah, but didn't we want to get completely rid of the bo
list?
If we have multiple IBs (e.g. gfx + compute) that share a BO
list, I think it's faster to send the BO list to the kernel
only once.
That's not really faster.
The only thing we safe us is a single loop over all BOs to
lockup the handle into a pointer and that is only a tiny
fraction of the overhead.
The majority of the overhead is locking the BOs and reserving
space for the submission.
What could really help here is to submit gfx+comput together
in just one CS IOCTL. This way we would need the locking and
space reservation only once.
It's a bit of work in the kernel side, but certainly doable.
OK. Any objections to this patch?
In general I'm wondering if we couldn't avoid adding so much new
interface.
There are Vulkan drivers that still use the bo_list interface.
For example we can avoid the malloc() when we just cache the last
freed bo_list structure in the device. We would just need an
atomic pointer exchange operation for that.
This way we even don't need to change mesa at all.
There is still the for loop that we need to get rid of.
Yeah, but that I'm fine to handle with a amdgpu_bo_list_create_raw which
only takes the handles and still returns the amdgpu_bo_list structure we
are used to.
See what I'm mostly concerned about is having another CS function to
maintain.
Regarding optimization, this chunk can be replaced by a cast on 64bit:
+ chunk_array = alloca(sizeof(uint64_t) * num_chunks);
+ for (i = 0; i < num_chunks; i++)
+ chunk_array[i] = (uint64_t)(uintptr_t)&chunks[i];
It can't. The input is an array of structures. The ioctl takes an
array of pointers.
Ah! Haven't seen this, sorry for the noise.
Christian.
Marek
Regards,
Christian.
Thanks,
Marek
_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx
_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx