On 14.02.2017 09:58, Jacob Lifshay wrote:


On Feb 14, 2017 12:18 AM, "Nicolai Hähnle" <nhaeh...@gmail.com
<mailto:nhaeh...@gmail.com>> wrote:

    On 13.02.2017 17:54, Jacob Lifshay wrote:

        the algorithm i was going to use would get the union of the sets
        of live
        variables at the barriers (union over barriers), create an array of
        structs that holds them all, then for each barrier, insert the
        code to
        store all live variables, then end the for loop over
        tid_in_workgroup,
        then run the memory barrier, then start another for loop over
        tid_in_workgroup, then load all live variables.


    Okay, sounds reasonable in theory.

    There are some issues, like: how do you actually determine live
    variables? If you're working off TGSI like llvmpipe does today,
    you'd need to write your own analysis for that, but in a structured
    control flow graph like TGSI has, that shouldn't be too difficult.


I was planning on using the spir-v to llvm translator and never using
tgsi.

Cool, it would be interesting to see how that goes. Mind you, I don't think that code is being maintained very well.


I could implement the pass using llvm coroutines, however, I'd
need to have several additional passes to convert the output; it might
not optimize all the way because we would have the switch on the suspend
point index still left. Also, according to the docs from llvm trunk,
llvm doesn't support reducing the space required by using the minimum
size needed to store all the live variables at the suspend point with
the largest space requirements, instead, it allocates separate space for
each variable at each suspend
point: http://llvm.org/docs/Coroutines.html#areas-requiring-attention

Yes, that actually makes sense. About the switches, though, I'm not so sure how you can really avoid those. Consider kernel code like this:

void main()
{
   if (cond) {
      ...
      barrier();
      ...
   } else {
      ...
      barrier();
      ...
   }
}

This kernel is perfectly valid and will work as expected if (and only if) cond is uniform across the threads of a workgroup.

Consider what you'd want the control flow in the LLVM implementation to look like, how you'd handle the fact that the set of live values would be different across the different barriers.

As a bonus, perhaps you could set things up so that the user gets a nice error message when the kernel is incorrect (i.e., when cond is _not_ uniform across a workgroup).

Cheers,
Nicolai



    I'd still recommend you to at least seriously read through the LLVM
    coroutine stuff.

    Cheers,
    Nicolai

        Jacob Lifshay

        On Feb 13, 2017 08:45, "Nicolai Hähnle" <nhaeh...@gmail.com
        <mailto:nhaeh...@gmail.com>
        <mailto:nhaeh...@gmail.com <mailto:nhaeh...@gmail.com>>> wrote:

            [ re-adding mesa-dev on the assumption that it got dropped
        by accident ]

            On 13.02.2017 17:27, Jacob Lifshay wrote:

                        I would start a thread for each cpu, then have each
                thread run the
                        compute shader a number of times instead of having a
                thread per
                        shader
                        invocation.


                    This will not work.

                    Please, read again what the barrier() instruction
        does: When the
                    barrier() call is reached, _all_ threads within the
                workgroup are
                    supposed to be run until they reach that barrier() call.


                to clarify, I had meant that each os thread would run the
                sections of
                the shader between the barriers for all the shaders in a
        work group,
                then, when it finished the work group, it would go to
        the next work
                group assigned to the os thread.

                so, if our shader is:
                a = b + tid;
                barrier();
                d = e + f;

                and our simd width is 4, our work-group size is 128, and
        we have
                16 os
                threads, then it will run for each os thread:
                for(workgroup = os_thread_index; workgroup <
        workgroup_count;
                workgroup++)
                {
                    for(tid_in_workgroup = 0; tid_in_workgroup < 128;
                tid_in_workgroup += 4)
                    {
                        ivec4 tid = ivec4(0, 1, 2, 3) +
        ivec4(tid_in_workgroup +
                workgroup * 128);
                        a[tid_in_workgroup / 4] =
        ivec_add(b[tid_in_workgroup /
                4], tid);
                    }
                    memory_fence(); // if needed
                    for(tid_in_workgroup = 0; tid_in_workgroup < 128;
                tid_in_workgroup += 4)
                    {
                        d[tid_in_workgroup / 4] =
        vec_add(e[tid_in_workgroup / 4],
                f[tid_in_workgroup / 4]);
                    }
                }
                // after this, we run the next rendering or compute job


            Okay good, that's the right concept.

            Actually doing that is not at all straightforward though:
        consider
            that the barrier() might occur inside a loop in the shader.

            So if you implemented that within the framework of llvmpipe,
        you'd
            make a lot of people very happy: it would allow finally adding
            compute shader support to llvmpipe. Mind you, that in itself
        would
            already be a pretty decent-sized project for GSoC!

            Cheers,
            Nicolai



Jacob Lifshay

_______________________________________________
mesa-dev mailing list
mesa-dev@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/mesa-dev

Reply via email to