Am 10.01.19 um 12:41 schrieb Marek Olšák:


On Thu, Jan 10, 2019, 4:15 AM Koenig, Christian <christian.koe...@amd.com <mailto:christian.koe...@amd.com> wrote:

    Am 10.01.19 um 00:39 schrieb Marek Olšák:
    On Wed, Jan 9, 2019 at 1:41 PM Christian König
    <ckoenig.leichtzumer...@gmail.com
    <mailto:ckoenig.leichtzumer...@gmail.com>> wrote:

        Am 09.01.19 um 17:14 schrieb Marek Olšák:
        On Wed, Jan 9, 2019 at 8:09 AM Christian König
        <ckoenig.leichtzumer...@gmail.com
        <mailto:ckoenig.leichtzumer...@gmail.com>> wrote:

            Am 09.01.19 um 13:36 schrieb Marek Olšák:


            On Wed, Jan 9, 2019, 5:28 AM Christian König
            <ckoenig.leichtzumer...@gmail.com
            <mailto:ckoenig.leichtzumer...@gmail.com> wrote:

                Looks good, but I'm wondering what's the actual
                improvement?


            No malloc calls and 1 less for loop copying the bo list.

            Yeah, but didn't we want to get completely rid of the bo
            list?


        If we have multiple IBs (e.g. gfx + compute) that share a BO
        list, I think it's faster to send the BO list to the kernel
        only once.

        That's not really faster.

        The only thing we safe us is a single loop over all BOs to
        lockup the handle into a pointer and that is only a tiny
        fraction of the overhead.

        The majority of the overhead is locking the BOs and reserving
        space for the submission.

        What could really help here is to submit gfx+comput together
        in just one CS IOCTL. This way we would need the locking and
        space reservation only once.

        It's a bit of work in the kernel side, but certainly doable.


    OK. Any objections to this patch?

    In general I'm wondering if we couldn't avoid adding so much new
    interface.


There are Vulkan drivers that still use the bo_list interface.


    For example we can avoid the malloc() when we just cache the last
    freed bo_list structure in the device. We would just need an
    atomic pointer exchange operation for that.


    This way we even don't need to change mesa at all.


There is still the for loop that we need to get rid of.

Yeah, but that I'm fine to handle with a amdgpu_bo_list_create_raw which only takes the handles and still returns the amdgpu_bo_list structure we are used to.

See what I'm mostly concerned about is having another CS function to maintain.



    Regarding optimization, this chunk can be replaced by a cast on 64bit:
    +   chunk_array = alloca(sizeof(uint64_t) * num_chunks);
    +   for (i = 0; i < num_chunks; i++)
    +           chunk_array[i] = (uint64_t)(uintptr_t)&chunks[i];

It can't. The input is an array of structures. The ioctl takes an array of pointers.

Ah! Haven't seen this, sorry for the noise.

Christian.


Marek


    Regards,
    Christian.


    Thanks,
    Marek


_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

_______________________________________________
amd-gfx mailing list
amd-gfx@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/amd-gfx

Reply via email to