Re: [Mesa-dev] i965: Kicking off fp16 glsl support

2017-11-27 Thread Chema Casanova
El 27/11/17 a las 21:11, Matt Turner escribió:
> 1-14, except 4 are
>
> Reviewed-by: Matt Turner 
>
> I started getting to things that made me realize I needed to review
> Igalia's work before I continued here.

I'm submitting tomorrow the v4 for our VK_KHR_16bit_storage series. So
better have a look to the new one.

Chema Casanova
___
mesa-dev mailing list
mesa-dev@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/mesa-dev


Re: [Mesa-dev] i965: Kicking off fp16 glsl support

2017-11-27 Thread Matt Turner

1-14, except 4 are

Reviewed-by: Matt Turner 

I started getting to things that made me realize I needed to review
Igalia's work before I continued here.


signature.asc
Description: Digital signature
___
mesa-dev mailing list
mesa-dev@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/mesa-dev


[Mesa-dev] i965: Kicking off fp16 glsl support

2017-11-24 Thread Topi Pohjolainen
After Igalia's work on SPIRV 16-bit storage question arose how much
is needed on top in order to optimize GLES lowp/mediump with 16-bit
floats. I took glb 2.7 trex as a target and started drafting a glsl
lowering pass re-typing mediump floats into float16. In parallel,
I added bit by bit equivalent support into GLSL -> NIR pass and into
Intel compiler backend.

This series enables lowering for fragment shaders only. This was
sufficient for trex which doesn't use mediump precision for vertex
shaders.

First of all this is not complete work. I'd like to think of it more
as trying to give an idea what is currently missing. And by giving
concrete (if not ideal) solutions making each case a little clearer.

On SKL this runs trex pretty much on par compared to 32-bit. Intel
hardware doesn't have native support for linear interpolation using
16-bit floats and therefore pln() and lrp() incur additional moves
from 16-bits to 32-bits (and vice versa). Both can be replaced
relatively efficiently using mad() later on.
Comparing shader dumps between 16-bit and 32-bit indicates that all
optimization passes kick in nicely (sampler-eot, mad(), etc). Only
additional are the before mentioned conversion instructions.

Series starts with miscellanious bits needed in the glsl and nir.
This is followed by equivalent bits in the Intel compiler backend.
These are followed up changes that are subject to more debate:

1) Support for SIMD8 fp16 in liveness analysis, copy propagation,
   dead code elimination, etc.

   In order to tell if one instruction fully overwrites the results
   of another one needs to examine how much of a register is written.
   Until now this has been done in granularity of full or partial
   register, i.e., there is no concept of "full sub-region write".
   And until now there was no need as all data types took 4-bytes
   per element resulting into full 32-byte register even in case of
   SIMD8. Partial writes where all special and could be safely
   ignored in various analysis passes.
   Half precision floats, however, break this assumption. On SIMD8
   full write with 16-bit elements results into half register.

   I tried patching different passes examing partial writes one by
   one but that started to get out hand. Moreover, just by looking
   a register type size it is not safe to say if it really is full
   write or not.
   Solution here is to explicitly store this information into
   registers: added new member fs_reg::pad_per_component.
   Subsequently patching fs_reg::component_size() to take the
   padding into account propagates the information to all users.
   Patch 28 updates a few users to use component_size() instead
   of open coded, 29 adds the actual support and 30-35 update
   NIR -> FS to signal the padding (these are separated just for
   review).

   It should be noted that here one deals with virtual registers.
   Final hardware allocator is separate and using full registers
   in virtual space shouldn't prevent it from using thighter
   packing.

   Chema, this overlaps with your work, I hope you don't mind.

2) Booleans produced from 16-bit sources. Whereas for GLSL and for
   NIR booleans are just booleans, on Intel hardware they are integers.
   And the presentation depends on how they are produced. Comparisons
   (flt, fge, feq and fne) with 32-bit sources produce 32-bit results
   (0x000/0x) while with 16-bits one gets 16-bit results
   (0x0/0x).

   I thought about introducing 16-bit boolean into NIR but that
   felt too much hardware specific thing to do. Instead I patched
   NIR -> FS to take the type of producing instruction into account
   when setting up the SSA values. See patch 39 for the setup and
   patches 36-38 for consulting the backend SSA store instead of
   relying on NIR.

   Another approach left to try is emitting additional moves into
   32-bits (the same way we do for fp64). One could then add an
   optimization pass that removes unnecessary moves and uses
   strided sources instead.

3) Following up 2) GLSL -> NIR decides to emit integer typed
   and/or/xor even for originally boolean typed logic ops. Patch 40
   tries to cope with the case where the booleans are produced with
   non-matching precision.

4) In Intel compiler backend and push/pull constant setup things are
   relying on values being packed in 32-bit slots. Moreover, these
   slots are typeless and the laoder doesn't know if it is dealing
   with floats or integers let alone about precision. Patch 42
   takes the first step and simply adds type information into the
   backend. This is not particularly pretty but I had to start from
   somewhere. This allows the loader to convert float values from
   the 32-bit store in the core to 16-bits on the fly. Patch 43
   adjusts compiler to use 32-bit slots.
   Using 16-bit slots would require substantially more work.

   I think there is no question about core using 32-bit values. And
   even if the values there were 16-bit,