After Igalia's work on SPIRV 16-bit storage question arose how much
is needed on top in order to optimize GLES lowp/mediump with 16-bit
floats. I took glb 2.7 trex as a target and started drafting a glsl
lowering pass re-typing mediump floats into float16. In parallel,
I added bit by bit equivalent support into GLSL -> NIR pass and into
Intel compiler backend.
This series enables lowering for fragment shaders only. This was
sufficient for trex which doesn't use mediump precision for vertex
shaders.
First of all this is not complete work. I'd like to think of it more
as trying to give an idea what is currently missing. And by giving
concrete (if not ideal) solutions making each case a little clearer.
On SKL this runs trex pretty much on par compared to 32-bit. Intel
hardware doesn't have native support for linear interpolation using
16-bit floats and therefore pln() and lrp() incur additional moves
from 16-bits to 32-bits (and vice versa). Both can be replaced
relatively efficiently using mad() later on.
Comparing shader dumps between 16-bit and 32-bit indicates that all
optimization passes kick in nicely (sampler-eot, mad(), etc). Only
additional are the before mentioned conversion instructions.
Series starts with miscellanious bits needed in the glsl and nir.
This is followed by equivalent bits in the Intel compiler backend.
These are followed up changes that are subject to more debate:
1) Support for SIMD8 fp16 in liveness analysis, copy propagation,
dead code elimination, etc.
In order to tell if one instruction fully overwrites the results
of another one needs to examine how much of a register is written.
Until now this has been done in granularity of full or partial
register, i.e., there is no concept of "full sub-region write".
And until now there was no need as all data types took 4-bytes
per element resulting into full 32-byte register even in case of
SIMD8. Partial writes where all special and could be safely
ignored in various analysis passes.
Half precision floats, however, break this assumption. On SIMD8
full write with 16-bit elements results into half register.
I tried patching different passes examing partial writes one by
one but that started to get out hand. Moreover, just by looking
a register type size it is not safe to say if it really is full
write or not.
Solution here is to explicitly store this information into
registers: added new member fs_reg::pad_per_component.
Subsequently patching fs_reg::component_size() to take the
padding into account propagates the information to all users.
Patch 28 updates a few users to use component_size() instead
of open coded, 29 adds the actual support and 30-35 update
NIR -> FS to signal the padding (these are separated just for
review).
It should be noted that here one deals with virtual registers.
Final hardware allocator is separate and using full registers
in virtual space shouldn't prevent it from using thighter
packing.
Chema, this overlaps with your work, I hope you don't mind.
2) Booleans produced from 16-bit sources. Whereas for GLSL and for
NIR booleans are just booleans, on Intel hardware they are integers.
And the presentation depends on how they are produced. Comparisons
(flt, fge, feq and fne) with 32-bit sources produce 32-bit results
(0x000/0x) while with 16-bits one gets 16-bit results
(0x0/0x).
I thought about introducing 16-bit boolean into NIR but that
felt too much hardware specific thing to do. Instead I patched
NIR -> FS to take the type of producing instruction into account
when setting up the SSA values. See patch 39 for the setup and
patches 36-38 for consulting the backend SSA store instead of
relying on NIR.
Another approach left to try is emitting additional moves into
32-bits (the same way we do for fp64). One could then add an
optimization pass that removes unnecessary moves and uses
strided sources instead.
3) Following up 2) GLSL -> NIR decides to emit integer typed
and/or/xor even for originally boolean typed logic ops. Patch 40
tries to cope with the case where the booleans are produced with
non-matching precision.
4) In Intel compiler backend and push/pull constant setup things are
relying on values being packed in 32-bit slots. Moreover, these
slots are typeless and the laoder doesn't know if it is dealing
with floats or integers let alone about precision. Patch 42
takes the first step and simply adds type information into the
backend. This is not particularly pretty but I had to start from
somewhere. This allows the loader to convert float values from
the 32-bit store in the core to 16-bits on the fly. Patch 43
adjusts compiler to use 32-bit slots.
Using 16-bit slots would require substantially more work.
I think there is no question about core using 32-bit values. And
even if the values there were 16-bit,