On 09/17/2013 05:13 AM, Rogovin, Kevin wrote:
> Hello,
> 
>  Thank you for the very fast answers, some more questions:
> 
> 
>> It's not a preference question.  The registers are 8 floats wide.
>> Vertex shaders get invoked 2 vertices at a time, with a register containing 
>> these values:
>>
>> .   +------+------+------+------+------+------+------+------+
>> .   | v0.x | v0.y | v0.z | v0.w | v1.x | v1.y | v1.z | v1.w |
>> .   +------+------+------+------+------+------+------+------+
> 
> This seems best to me: run two vertices in each invocation with the hopes 
> that the
> shader compiler will merge (multiple) float, vec2 and maybe even vec3 
> operations into 
> vec4 operations (does it)?

Not as well as it should.  There's a lot of room for improvement in our
SIMD4x2/vector backend.  We haven't spent a ton of effort optimizing it
since vertex shaders have rarely been the bottleneck in application
performance.

>> while these 8 pixels in screen space:
>>
>> .   +----+----+----+----+
>> .   | p0 | p1 | p2 | p3 |
>> .   +----+----+----+----+
>> .   | p4 | p5 | p6 | p7 |
>> .   +----+----+----+----+
>>
>> are loaded in fragment shader registers as:
>>
>> .   +------+------+------+------+------+------+------+------+
>> .   | p0.x | p1.x | p4.x | p5.x | p2.x | p3.x | p6.x | p7.x |
>> .   +------+------+------+------+------+------+------+------+
>>
>> Note how one register just holds a single channel ('.x' here) of a vector.  
>> A vec4 would take up 4 registers, and to do value0.xyzw * value1.xyzw, you'd 
>> emit 4 MULs.
> 
> This is exactly what I was trying to ask/say about the fragment shader 
> running, i.e. n-fragments are processed with 1 n-SIMD command (for i965, n=8),
> sighs my e-mail communications leave something to be desired. 
> Some questions:
>  1) do the fragments need to be in a 4x2 block, or can it be two separate 2x2 
> blocks?

The GPU processes two separate 2x2 blocks of pixels, which may actually
not be anywhere near each other.

>  2) for tiny triangles for fragment shaders that do not require dFdx, dFdy or 
> fwidth, can the fragments be totally scattered?

Nope, the pixel shader always works on 2x2 blocks.

> Along further lines, for non-dependent texture lookups, are there code lines 
> where the derivatives are computed
> analytically so that selecting the correct LOD does not require to process 
> fragments in 2x2 (or larger) blocks? Or does
> the i965 hardware sampler interface does not allow this kind of madness? 
> 
>>> On a related note, where are the beans about the dispatch table?
>> I don't know this one (or particularly what you're asking, I guess).
> 
> Viewing docs/index.html, on the side panel "Developer Topics --> GL
> Dispatch" there is text (broken into sections "1. Complexity of GL
> Dispatch", "2. Overview of Mesa's Implementation" and "3. Optimizations
> " describing how different GL contexts for the same hardware can do
> different things for the same GL function and that mesa has stubs which
> in turn call the "real" function. The documents go on to talk about
> various ways the function tables are filled and accessed across separate
> threads. My questions are:
>  0) is that information text still accurate? In particular, the directory 
> src/glapi is gone from Mesa (atleast what I git cloned) and I thought that 
> was the location of it.
>  1) where/how does the i965 driver fill that table, if it exists?
>  
> Along similar lines, I see that some of the code in src/mesa/main performs 
> various checks of various API calls and at times has some conditions 
> dependent on what context type it is, which kind of contradicts the idea of 
> different context have different dispatch tables [sort of, since the 
> functions might just be the driver magick, where as the stub is validate and 
> then call driver magick]. 
> 
> -Kevin
_______________________________________________
mesa-dev mailing list
mesa-dev@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/mesa-dev

Reply via email to