tgsi: split mad to mul+add

Martin Peres Tue, 13 Jun 2017 05:47:55 -0700


On 13/06/17 15:43, Ilia Mirkin wrote:

On Tue, Jun 13, 2017 at 8:18 AM, Roland Scheidegger <srol...@vmware.com> wrote:

Am 13.06.2017 um 08:57 schrieb Karol Herbst:

On Tue, Jun 13, 2017 at 2:17 AM, Roland Scheidegger <srol...@vmware.com> wrote:

I am actually also thinking this should be different.

e.g. imho MAD means the operation can be either fused or unfused.
This is the "traditional" definition of MAD - opencl for instance will
follow this too, albeit this isn't mentioned in the gallium docs (it
probably should be).
(OpenCL says: "Whether or how the product of a * b is rounded and how
supernormal or subnormal intermediate products are handled is not
defined. mad is intended to be used where speed is preferred over
accuracy.")
I think doing something different here in gallium can only lead to
madness long term - glsl doesn't have mad in the first place, and as far
as I can tell d3d10 is ok with fused/unfused mad too (the docs stating
"Fused operations (such as mad, dp3) produce results that are no less
accurate than the worst possible serial ordering of evaluation of the
unfused expansion of the operation.")

This means that mul+add cannot be fused anywhere to a mad if precise is
specified, and therefore you should never have to worry about doing a
fused or unfused mul/add in the driver with a mad - it's enough if you
just don't fuse mul+add in the driver itself (if you can't do unfused mad).

Roland


well there is a TGSI peephole doing this mul+add=>mad optimisation,
because it isn't wrong, because mad != fma and mul+add==mad, but on
Fermi+ Nvidia hardware there is no mad, only fma and because mad != fma,
we need to split it up again.

So either TGSI doesn't merge it if the Instruction is flagged as precise (which
it is in those tests mentioned) allthough it is correct, or we lower
something in
the driver, because the Instruction isn't supported by the hardware all along.


Yes, I think the TGSI peephole shouldn't merge mul+add to mad with
precise. You say this isn't wrong, but imho it clearly is, because noone
ever said MAD can't be a fused add - it is multiply + add, yes, but if
there's intermediate rounding or not isn't specified. FWIW gallivm code
also assumes this, and will use llvm.fmuladd for implementation (which
is exactly the same "mul+add" story as opencl mad, and will use fma on
cpus supporting it and separate mul+add otherwise, save some bugs in
older llvm versions apparently).
So we should just clarify that in the tgsi docs - mad is multiply + add,
with undefined intermediate rounding, it can be a fused mul+add or an
unfused one (technically it could also be something in-between I suppose
since the apis just specify the accuracy isn't worse than a unfused
multiply + add). Every driver gets to use what it can do fastest for it,
and because there's no specified intermediate rounding for it, precise
doesn't change anything there.

That's at least my opinion what TGSI_OPCODE_MAD should be (of course,
older gpus always used unfused mad, but this wasn't a requirement).


BTW, irrespective of how this conversation turns out, I think it's a
good idea to split MAD into mul + add in the nv50 backend on input,
unconditionally.

I seem to remember that using MAD introduced a performance regression onmy nv86 for some benchmarks. I will need to get the setup working againfor mesa testing.


Martin
_______________________________________________
mesa-dev mailing list
mesa-dev@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/mesa-dev

Re: [Mesa-dev] [PATCH v2 9/9] nv50/ir/tgsi: split mad to mul+add

Reply via email to