Re: [Mesa-dev] [PATCH 00/31] nir: Use a 1-bit data type for booleans

2018-11-08 Thread Iago Toral
I just noticed I never replied to this:
On Tue, 2018-10-30 at 10:18 -0500, Jason Ekstrand wrote:
> On Tue, Oct 30, 2018 at 2:50 AM Iago Toral  wrote:
> > Some quick comments on this after experimenting with it:
> > 
> > 
> > 
> > Intel hardware produces booleans of the same bit-size as the
> > operations
> > 
> > that generates them so I am working on expanding this to add
> > another
> > 
> > lowering pass that can lower them to smaller bit-sizes.
> > 
> > 
> > 
> > I have a work-in-progress prototype of this and it is significantly
> > 
> > easier, at least for us, to do this as a lowering pass in NIR SSA
> > form
> > 
> > than in the backend. Specifically, doing it in the backend means
> > that
> > 
> > we have to deal with propagating the bit-sizes of the booleans
> > across
> > 
> > operation chains and across CFG blocks, which isn't fun.
> > 
> > 
> > 
> > There are still a few things that I am figuring out:
> > 
> > 
> > 
> > Right now I lower instructions as I find them, deciding about the
> > bit-
> > 
> > size of the destination based on the bit-size of the sources. This
> > 
> > usually works fine because by the time we reach an instruction that
> > 
> > reads boolean sources we have already lowered the instructions that
> > 
> > produced them. However, this also means that we need to make
> > decisions
> > 
> > about the bit-sizes of constant booleans before we know which bit-
> > size
> > 
> > is required for them, it could be that we need them as 32-bit
> > booleans
> > 
> > only, 16-bit booleans only, or maybe both. Right now I am assuming
> > that
> > 
> > constants are 32-bit always, which means that at the moment I have
> > to
> > 
> > inject a MOV when they are used in 16-bit instructions. We can
> > probably
> > 
> > do better, at least in some cases, if we inspect all uses of the
> > 
> > constants before we lower them.
> 
> If we did a step of constant folding, wouldn't the constant booleans
> get nicely folded to the right bit-size?

The lowering pass runs late, so we have already run constant folding by
the time we get here. Anyway, I have double checked and the load_const
instructions seem to come with bit-size > 1 in all the cases I have
available from CTS, so maybe this just works already and we never hit
this. With that said, I am not 100% sure if there are any cases where
we could still end up having to lower a 1-bit bool load const
instruction and figuring our a bit-size for it, but at the very least
this looks like it would not be a common case anyway, so maybe we can
deal with that whenwe find an actual case.
> > Another thing we need to deal with is canonicalization. Some
> > 
> > instructions may have boolean sources of different native bit-sizes 
> > and
> > 
> > it looks like we need to choose a canonical bit-size for them in
> > this
> > 
> > case (at least in my experiments this seems to be required for
> > Intel).
> > 
> > For now I am just taking the bit-size of the first source as the
> > 
> > canonical form, but we could try to be smarter about this. In the
> > case
> > 
> > of Intel I think this isn't too bad, because constant operands are
> > 
> > usually second, so for instructions that use a boolean immediate,
> > if we
> > 
> > choose the first source as canonical, we can then adjust the
> > imediate
> > 
> > source without having to emit additional instructions.
> > Unfortunately,
> > 
> > the 16-bit integer tests in CTS are not good to test set for
> > 
> > canonicalization, since they focus very specifically in 16-bit
> > paths
> > 
> > and I don't think we have many cases (if any) of boolean
> > instructions
> > 
> > mixing booleans of different bit-sizzes, so we will probably have
> > to
> > 
> > try and write our more tests to exercise these cases properly.
> > Also, it
> > 
> > might be that some hardware that supports multiple bit-sized
> > booleans
> > 
> > doesn't require canonicalization, in which case we might want to do
> > 
> > this configurable.
> 
> Assuming a back-end that can properly handle 8 and 16-bit booleans, I
> suspect "always go down" would be a reasonable for two reasons:
> 
>  1) It always uses no more bits than the biggest so it helps with
> register pressure
>  2) We can do the conversion "for free" either with a mixed-type
> operation (or of D and W) or, if that doesn't work, by using a
> register region to just reinterpret the D as a scattered out W.
> 
> For that matter, there's a part of me that's tempted to always spill
> to 8-bit for everything other than the comparison operation itself. 
> Again, this is all dependent on having a competent back-end compiler
> which we currently don't have. :-(

Right, doing 8-bit is probably not a good idea for now. Byte
instructions have this annoying restriction by which destinations need
to have a stride of 2 which I think doesn't play well with the
optimizer in general right now, but beyond that, the hardware doesn't
support byte immediates, which means that we lose algebraic and cmod

Re: [Mesa-dev] [PATCH 00/31] nir: Use a 1-bit data type for booleans

2018-11-03 Thread Jason Ekstrand
Thanks!  I don't mean to be a pest.  However, not everyone is as good about
keeping track of their backlog as you are so I thought it might be worth a
reminder.

--Jason

On Sat, Nov 3, 2018 at 7:59 PM Ian Romanick  wrote:

> I haven't forgotten... I'm planning to dig into this next week.
>
> On 11/02/2018 06:42 AM, Jason Ekstrand wrote:
> > Bump
> >
> > On Mon, Oct 22, 2018 at 5:14 PM Jason Ekstrand  > > wrote:
> >
> > This is something that Connor and I have talked about quite a bit
> > over the
> > last couple of months.  The core idea is to replace NIR's current
> 32-bit
> > 0/-1 D3D10-style booleans with a 1-bit data type.  All in all, I
> > think it
> > worked out pretty well though I really don't like the proliferation
> of
> > 32-bit comparison opcodes we now have kicking around for i965.
> >
> > Why?  No hardware really has a 1-bit type, right?  Well, sort of...
> AMD
> > actually uses 64-bit scalars for booleans with one bit per
> invocation.
> > However, most hardware such as Intel uses some other larger value for
> > booleans.  The real benefit of 1-bit booleans and requiring a
> > lowering pass
> > is that you can do somewhat custom lowering (like AMD wants) and your
> > lowering pass can always tell in an instant if a value is a boolean
> > based
> > on the bit size.  As can be seen in the last patch, this makes it
> really
> > easy to implement a bool -> float lowering pass for hardware that
> > doesn't
> > have real integers where NIR's current booleans are actually rather
> > painful.
> >
> > On Intel, the situation is a bit more complicated.  It's tempting to
> say
> > that we have 32-bit D3D10 booleans.  However, they're not really
> D3D10
> > booleans on gen4-5 because the top 31 bits are undefined garbage
> > and, while
> > iand, ior, ixor, and inot operations work, you have to iand with 1
> > at the
> > last minute to clear off all that garbage.  Also, on all
> generations, a
> > comparison of two N-bit values results in an N-bit boolean, not a
> 32-bit
> > bool.  This has caused the Igalia folks no end of trouble as they've
> > been
> > working on native 8 and 16-bit support.  If, instead, we have a
> > 1-bit bool
> > with a lowering pass and we can lower to whatever we want, then we
> could
> > lower to a set of comparison opcodes that return the same bit-size
> > as they
> > compare and it would match GEN hardware much better.
> >
> > But what about performance?  Aren't there all sorts of neat tricks
> > we can
> > do with D3D10 booleans like b & 1.0f for b2f?  As it turns out, not
> > really;
> > that's about the only one.  There is some small advantage when
> > optimizing
> > shaders that come from D3D if your native representation of booleans
> > matches that of D3D.  However, penultimate patch in this series adds
> > a few
> > small optimizations that get us to actually better than we were
> before.
> > With the entire series, shader-db on Kaby Lak looks like this:
> >
> > total instructions in shared programs: 15084098 -> 14988578
> (-0.63%)
> > instructions in affected programs: 1321114 -> 1225594 (-7.23%)
> > helped: 2340
> > HURT: 23
> >
> > total cycles in shared programs: 369790134 -> 359798399 (-2.70%)
> > cycles in affected programs: 134085452 -> 124093717 (-7.45%)
> > helped: 2149
> > HURT: 720
> >
> > total loops in shared programs: 4393 -> 4393 (0.00%)
> > loops in affected programs: 0 -> 0
> > helped: 0
> > HURT: 0
> >
> > total spills in shared programs: 10158 -> 10051 (-1.05%)
> > spills in affected programs: 1429 -> 1322 (-7.49%)
> > helped: 8
> > HURT: 15
> >
> > total fills in shared programs: 22105 -> 21720 (-1.74%)
> > fills in affected programs: 2853 -> 2468 (-13.49%)
> > helped: 12
> > HURT: 15
> >
> > How about ease of use?  Are they a pain to deal with?  Yes, adding
> > support
> > for 1-bit types was a bit awkward in a few places but most of it was
> > dealing with all the places where we have 32-bit booleans baked into
> > assumptions.  Getting rid of that baking in solves the problem and
> also
> > makes the whole IR more future-proof.
> >
> > All in all, I'd say I'm pretty happy with it.  However, I'd like
> other
> > people (particularly the AMD folks) to play with it a bit and verify
> > that
> > it solves their problems as well.  Also, I added a lowering pass and
> > tried
> > to turn it on in everyone's driver but may not have put it in the
> right
> > spot.  Please double-check my work.  For those wishing to take a
> > look, you
> > can also find the entire series on my gitlab here:
> >
> >
> 

Re: [Mesa-dev] [PATCH 00/31] nir: Use a 1-bit data type for booleans

2018-11-03 Thread Ian Romanick
I haven't forgotten... I'm planning to dig into this next week.

On 11/02/2018 06:42 AM, Jason Ekstrand wrote:
> Bump
> 
> On Mon, Oct 22, 2018 at 5:14 PM Jason Ekstrand  > wrote:
> 
> This is something that Connor and I have talked about quite a bit
> over the
> last couple of months.  The core idea is to replace NIR's current 32-bit
> 0/-1 D3D10-style booleans with a 1-bit data type.  All in all, I
> think it
> worked out pretty well though I really don't like the proliferation of
> 32-bit comparison opcodes we now have kicking around for i965.
> 
> Why?  No hardware really has a 1-bit type, right?  Well, sort of...  AMD
> actually uses 64-bit scalars for booleans with one bit per invocation.
> However, most hardware such as Intel uses some other larger value for
> booleans.  The real benefit of 1-bit booleans and requiring a
> lowering pass
> is that you can do somewhat custom lowering (like AMD wants) and your
> lowering pass can always tell in an instant if a value is a boolean
> based
> on the bit size.  As can be seen in the last patch, this makes it really
> easy to implement a bool -> float lowering pass for hardware that
> doesn't
> have real integers where NIR's current booleans are actually rather
> painful.
> 
> On Intel, the situation is a bit more complicated.  It's tempting to say
> that we have 32-bit D3D10 booleans.  However, they're not really D3D10
> booleans on gen4-5 because the top 31 bits are undefined garbage
> and, while
> iand, ior, ixor, and inot operations work, you have to iand with 1
> at the
> last minute to clear off all that garbage.  Also, on all generations, a
> comparison of two N-bit values results in an N-bit boolean, not a 32-bit
> bool.  This has caused the Igalia folks no end of trouble as they've
> been
> working on native 8 and 16-bit support.  If, instead, we have a
> 1-bit bool
> with a lowering pass and we can lower to whatever we want, then we could
> lower to a set of comparison opcodes that return the same bit-size
> as they
> compare and it would match GEN hardware much better.
> 
> But what about performance?  Aren't there all sorts of neat tricks
> we can
> do with D3D10 booleans like b & 1.0f for b2f?  As it turns out, not
> really;
> that's about the only one.  There is some small advantage when
> optimizing
> shaders that come from D3D if your native representation of booleans
> matches that of D3D.  However, penultimate patch in this series adds
> a few
> small optimizations that get us to actually better than we were before.
> With the entire series, shader-db on Kaby Lak looks like this:
> 
>     total instructions in shared programs: 15084098 -> 14988578 (-0.63%)
>     instructions in affected programs: 1321114 -> 1225594 (-7.23%)
>     helped: 2340
>     HURT: 23
> 
>     total cycles in shared programs: 369790134 -> 359798399 (-2.70%)
>     cycles in affected programs: 134085452 -> 124093717 (-7.45%)
>     helped: 2149
>     HURT: 720
> 
>     total loops in shared programs: 4393 -> 4393 (0.00%)
>     loops in affected programs: 0 -> 0
>     helped: 0
>     HURT: 0
> 
>     total spills in shared programs: 10158 -> 10051 (-1.05%)
>     spills in affected programs: 1429 -> 1322 (-7.49%)
>     helped: 8
>     HURT: 15
> 
>     total fills in shared programs: 22105 -> 21720 (-1.74%)
>     fills in affected programs: 2853 -> 2468 (-13.49%)
>     helped: 12
>     HURT: 15
> 
> How about ease of use?  Are they a pain to deal with?  Yes, adding
> support
> for 1-bit types was a bit awkward in a few places but most of it was
> dealing with all the places where we have 32-bit booleans baked into
> assumptions.  Getting rid of that baking in solves the problem and also
> makes the whole IR more future-proof.
> 
> All in all, I'd say I'm pretty happy with it.  However, I'd like other
> people (particularly the AMD folks) to play with it a bit and verify
> that
> it solves their problems as well.  Also, I added a lowering pass and
> tried
> to turn it on in everyone's driver but may not have put it in the right
> spot.  Please double-check my work.  For those wishing to take a
> look, you
> can also find the entire series on my gitlab here:
> 
> 
> https://gitlab.freedesktop.org/jekstrand/mesa/commits/review/nir-1-bit-bool
> 
> Please review!
> 
> --Jason
> 
> Cc: Connor Abbott mailto:cwabbo...@gmail.com>>
> Cc: Timothy Arceri  >
> Cc: Eric Anholt mailto:e...@anholt.net>>
> Cc: Rob Clark mailto:robdcl...@gmail.com>>
> Cc: Karol Herbst mailto:karolher...@gmail.com>>
> Cc: Bas Nieuwenhuizen  >
> Cc: 

Re: [Mesa-dev] [PATCH 00/31] nir: Use a 1-bit data type for booleans

2018-11-02 Thread Jason Ekstrand
Bump

On Mon, Oct 22, 2018 at 5:14 PM Jason Ekstrand  wrote:

> This is something that Connor and I have talked about quite a bit over the
> last couple of months.  The core idea is to replace NIR's current 32-bit
> 0/-1 D3D10-style booleans with a 1-bit data type.  All in all, I think it
> worked out pretty well though I really don't like the proliferation of
> 32-bit comparison opcodes we now have kicking around for i965.
>
> Why?  No hardware really has a 1-bit type, right?  Well, sort of...  AMD
> actually uses 64-bit scalars for booleans with one bit per invocation.
> However, most hardware such as Intel uses some other larger value for
> booleans.  The real benefit of 1-bit booleans and requiring a lowering pass
> is that you can do somewhat custom lowering (like AMD wants) and your
> lowering pass can always tell in an instant if a value is a boolean based
> on the bit size.  As can be seen in the last patch, this makes it really
> easy to implement a bool -> float lowering pass for hardware that doesn't
> have real integers where NIR's current booleans are actually rather
> painful.
>
> On Intel, the situation is a bit more complicated.  It's tempting to say
> that we have 32-bit D3D10 booleans.  However, they're not really D3D10
> booleans on gen4-5 because the top 31 bits are undefined garbage and, while
> iand, ior, ixor, and inot operations work, you have to iand with 1 at the
> last minute to clear off all that garbage.  Also, on all generations, a
> comparison of two N-bit values results in an N-bit boolean, not a 32-bit
> bool.  This has caused the Igalia folks no end of trouble as they've been
> working on native 8 and 16-bit support.  If, instead, we have a 1-bit bool
> with a lowering pass and we can lower to whatever we want, then we could
> lower to a set of comparison opcodes that return the same bit-size as they
> compare and it would match GEN hardware much better.
>
> But what about performance?  Aren't there all sorts of neat tricks we can
> do with D3D10 booleans like b & 1.0f for b2f?  As it turns out, not really;
> that's about the only one.  There is some small advantage when optimizing
> shaders that come from D3D if your native representation of booleans
> matches that of D3D.  However, penultimate patch in this series adds a few
> small optimizations that get us to actually better than we were before.
> With the entire series, shader-db on Kaby Lak looks like this:
>
> total instructions in shared programs: 15084098 -> 14988578 (-0.63%)
> instructions in affected programs: 1321114 -> 1225594 (-7.23%)
> helped: 2340
> HURT: 23
>
> total cycles in shared programs: 369790134 -> 359798399 (-2.70%)
> cycles in affected programs: 134085452 -> 124093717 (-7.45%)
> helped: 2149
> HURT: 720
>
> total loops in shared programs: 4393 -> 4393 (0.00%)
> loops in affected programs: 0 -> 0
> helped: 0
> HURT: 0
>
> total spills in shared programs: 10158 -> 10051 (-1.05%)
> spills in affected programs: 1429 -> 1322 (-7.49%)
> helped: 8
> HURT: 15
>
> total fills in shared programs: 22105 -> 21720 (-1.74%)
> fills in affected programs: 2853 -> 2468 (-13.49%)
> helped: 12
> HURT: 15
>
> How about ease of use?  Are they a pain to deal with?  Yes, adding support
> for 1-bit types was a bit awkward in a few places but most of it was
> dealing with all the places where we have 32-bit booleans baked into
> assumptions.  Getting rid of that baking in solves the problem and also
> makes the whole IR more future-proof.
>
> All in all, I'd say I'm pretty happy with it.  However, I'd like other
> people (particularly the AMD folks) to play with it a bit and verify that
> it solves their problems as well.  Also, I added a lowering pass and tried
> to turn it on in everyone's driver but may not have put it in the right
> spot.  Please double-check my work.  For those wishing to take a look, you
> can also find the entire series on my gitlab here:
>
> https://gitlab.freedesktop.org/jekstrand/mesa/commits/review/nir-1-bit-bool
>
> Please review!
>
> --Jason
>
> Cc: Connor Abbott 
> Cc: Timothy Arceri 
> Cc: Eric Anholt 
> Cc: Rob Clark 
> Cc: Karol Herbst 
> Cc: Bas Nieuwenhuizen 
> Cc: Alyssa Rosenzweig 
>
>
> Jason Ekstrand (31):
>   nir/validate: Print when the validation failed
>   nir/constant_folding: Add an unreachable to a switch
>   nir/constant_folding: Use nir_src_as_bool for discard_if
>   nir/builder: Add a nir_imm_true/false helpers
>   nir/builder: Handle 16-bit floats in nir_imm_floatN_t
>   nir/search: Use nir_builder
>   nir/opt_if: Rework condition propagation
>   nir/system_values: Use the bit size from the load_deref
>   glsl/nir: Use i2b instead of ine for fixing UBO/SSBO booleans
>   nir/prog: Use nir_bany in kill handling
>   spirv: Use the right bit-size for spec constant ops
>   spirv: Initialize subgroup destinations with the destination type
>   intel/nir: Use the OPT macro for more passes
>   

Re: [Mesa-dev] [PATCH 00/31] nir: Use a 1-bit data type for booleans

2018-10-30 Thread Jason Ekstrand
On Tue, Oct 30, 2018 at 2:50 AM Iago Toral  wrote:

> Some quick comments on this after experimenting with it:
>
> Intel hardware produces booleans of the same bit-size as the operations
> that generates them so I am working on expanding this to add another
> lowering pass that can lower them to smaller bit-sizes.
>
> I have a work-in-progress prototype of this and it is significantly
> easier, at least for us, to do this as a lowering pass in NIR SSA form
> than in the backend. Specifically, doing it in the backend means that
> we have to deal with propagating the bit-sizes of the booleans across
> operation chains and across CFG blocks, which isn't fun.
>
> There are still a few things that I am figuring out:
>
> Right now I lower instructions as I find them, deciding about the bit-
> size of the destination based on the bit-size of the sources. This
> usually works fine because by the time we reach an instruction that
> reads boolean sources we have already lowered the instructions that
> produced them. However, this also means that we need to make decisions
> about the bit-sizes of constant booleans before we know which bit-size
> is required for them, it could be that we need them as 32-bit booleans
> only, 16-bit booleans only, or maybe both. Right now I am assuming that
> constants are 32-bit always, which means that at the moment I have to
> inject a MOV when they are used in 16-bit instructions. We can probably
> do better, at least in some cases, if we inspect all uses of the
> constants before we lower them.
>

If we did a step of constant folding, wouldn't the constant booleans get
nicely folded to the right bit-size?


> Another thing we need to deal with is canonicalization. Some
> instructions may have boolean sources of different native bit-sizes and
> it looks like we need to choose a canonical bit-size for them in this
> case (at least in my experiments this seems to be required for Intel).
> For now I am just taking the bit-size of the first source as the
> canonical form, but we could try to be smarter about this. In the case
> of Intel I think this isn't too bad, because constant operands are
> usually second, so for instructions that use a boolean immediate, if we
> choose the first source as canonical, we can then adjust the imediate
> source without having to emit additional instructions. Unfortunately,
> the 16-bit integer tests in CTS are not good to test set for
> canonicalization, since they focus very specifically in 16-bit paths
> and I don't think we have many cases (if any) of boolean instructions
> mixing booleans of different bit-sizzes, so we will probably have to
> try and write our more tests to exercise these cases properly. Also, it
> might be that some hardware that supports multiple bit-sized booleans
> doesn't require canonicalization, in which case we might want to do
> this configurable.
>

Assuming a back-end that can properly handle 8 and 16-bit booleans, I
suspect "always go down" would be a reasonable for two reasons:

 1) It always uses no more bits than the biggest so it helps with register
pressure
 2) We can do the conversion "for free" either with a mixed-type operation
(or of D and W) or, if that doesn't work, by using a register region to
just reinterpret the D as a scattered out W.

For that matter, there's a part of me that's tempted to always spill to
8-bit for everything other than the comparison operation itself.  Again,
this is all dependent on having a competent back-end compiler which we
currently don't have. :-(


> Another thing I noticed, is that with phi sources the canonicalization
> aspect is a bit trickier, since we can't just insert conversions before
> the phi instructions (nir_validate complains that this is not allowed),
> so instead we need to get to the instruction that generated the ssa
> source and inject the conversion after it.
>

There's a nifty nir_before_src cursor builder function that correctly
handles phis.


> Last thing I noticed is that our backend optimizer is much better at
> optimizing 32-bit booleans that it is at optimizing 16-bit, so there
> are some cases where we end up producing worse code. Obviously this is
> not a problem of the NIR lowering pass though, it just means that we
> need to do better at optimizing 16-bit code in the backend.
>
> So all in all, I am quite happy with this. It is certainly easier than
> lowering in the backend and it gives us a more reasonable framework to
> work with. We still need to do some extra work if we want to do the
> best lowering for things like boolean constants or canonicalization,
> but since neither SPIR-V nor GLSL define multiple bit-sizes for
> booleans that would always have been something that drivers might need
> to address.
>
> Iago
>
>
> On Mon, 2018-10-22 at 17:13 -0500, Jason Ekstrand wrote:
> > This is something that Connor and I have talked about quite a bit
> > over the
> > last couple of months.  The core idea is to replace NIR's current 32-
> > bit
> > 

Re: [Mesa-dev] [PATCH 00/31] nir: Use a 1-bit data type for booleans

2018-10-30 Thread Iago Toral
Some quick comments on this after experimenting with it:

Intel hardware produces booleans of the same bit-size as the operations
that generates them so I am working on expanding this to add another
lowering pass that can lower them to smaller bit-sizes.

I have a work-in-progress prototype of this and it is significantly
easier, at least for us, to do this as a lowering pass in NIR SSA form
than in the backend. Specifically, doing it in the backend means that
we have to deal with propagating the bit-sizes of the booleans across
operation chains and across CFG blocks, which isn't fun.

There are still a few things that I am figuring out:

Right now I lower instructions as I find them, deciding about the bit-
size of the destination based on the bit-size of the sources. This
usually works fine because by the time we reach an instruction that
reads boolean sources we have already lowered the instructions that
produced them. However, this also means that we need to make decisions
about the bit-sizes of constant booleans before we know which bit-size
is required for them, it could be that we need them as 32-bit booleans
only, 16-bit booleans only, or maybe both. Right now I am assuming that
constants are 32-bit always, which means that at the moment I have to
inject a MOV when they are used in 16-bit instructions. We can probably
do better, at least in some cases, if we inspect all uses of the
constants before we lower them.

Another thing we need to deal with is canonicalization. Some
instructions may have boolean sources of different native bit-sizes and
it looks like we need to choose a canonical bit-size for them in this
case (at least in my experiments this seems to be required for Intel).
For now I am just taking the bit-size of the first source as the
canonical form, but we could try to be smarter about this. In the case
of Intel I think this isn't too bad, because constant operands are
usually second, so for instructions that use a boolean immediate, if we
choose the first source as canonical, we can then adjust the imediate
source without having to emit additional instructions. Unfortunately,
the 16-bit integer tests in CTS are not good to test set for
canonicalization, since they focus very specifically in 16-bit paths
and I don't think we have many cases (if any) of boolean instructions
mixing booleans of different bit-sizzes, so we will probably have to
try and write our more tests to exercise these cases properly. Also, it
might be that some hardware that supports multiple bit-sized booleans
doesn't require canonicalization, in which case we might want to do
this configurable.

Another thing I noticed, is that with phi sources the canonicalization
aspect is a bit trickier, since we can't just insert conversions before
the phi instructions (nir_validate complains that this is not allowed),
so instead we need to get to the instruction that generated the ssa
source and inject the conversion after it.

Last thing I noticed is that our backend optimizer is much better at
optimizing 32-bit booleans that it is at optimizing 16-bit, so there
are some cases where we end up producing worse code. Obviously this is
not a problem of the NIR lowering pass though, it just means that we
need to do better at optimizing 16-bit code in the backend.

So all in all, I am quite happy with this. It is certainly easier than
lowering in the backend and it gives us a more reasonable framework to
work with. We still need to do some extra work if we want to do the
best lowering for things like boolean constants or canonicalization,
but since neither SPIR-V nor GLSL define multiple bit-sizes for
booleans that would always have been something that drivers might need
to address.

Iago


On Mon, 2018-10-22 at 17:13 -0500, Jason Ekstrand wrote:
> This is something that Connor and I have talked about quite a bit
> over the
> last couple of months.  The core idea is to replace NIR's current 32-
> bit
> 0/-1 D3D10-style booleans with a 1-bit data type.  All in all, I
> think it
> worked out pretty well though I really don't like the proliferation
> of
> 32-bit comparison opcodes we now have kicking around for i965.
> 
> Why?  No hardware really has a 1-bit type, right?  Well, sort
> of...  AMD
> actually uses 64-bit scalars for booleans with one bit per
> invocation.
> However, most hardware such as Intel uses some other larger value for
> booleans.  The real benefit of 1-bit booleans and requiring a
> lowering pass
> is that you can do somewhat custom lowering (like AMD wants) and your
> lowering pass can always tell in an instant if a value is a boolean
> based
> on the bit size.  As can be seen in the last patch, this makes it
> really
> easy to implement a bool -> float lowering pass for hardware that
> doesn't
> have real integers where NIR's current booleans are actually rather
> painful.
> 
> On Intel, the situation is a bit more complicated.  It's tempting to
> say
> that we have 32-bit D3D10 booleans.  However, they're not 

Re: [Mesa-dev] [PATCH 00/31] nir: Use a 1-bit data type for booleans

2018-10-26 Thread Jason Ekstrand
Thanks!  I've pushed the first 13.

On Mon, Oct 22, 2018 at 6:06 PM Ian Romanick  wrote:

> Patches 1 through 13 are
>
> Reviewed-by: Ian Romanick 
>
> The rest are going to take a bit more time and deep thought.
>
> On 10/22/2018 03:13 PM, Jason Ekstrand wrote:
> > This is something that Connor and I have talked about quite a bit over
> the
> > last couple of months.  The core idea is to replace NIR's current 32-bit
> > 0/-1 D3D10-style booleans with a 1-bit data type.  All in all, I think it
> > worked out pretty well though I really don't like the proliferation of
> > 32-bit comparison opcodes we now have kicking around for i965.
> >
> > Why?  No hardware really has a 1-bit type, right?  Well, sort of...  AMD
> > actually uses 64-bit scalars for booleans with one bit per invocation.
> > However, most hardware such as Intel uses some other larger value for
> > booleans.  The real benefit of 1-bit booleans and requiring a lowering
> pass
> > is that you can do somewhat custom lowering (like AMD wants) and your
> > lowering pass can always tell in an instant if a value is a boolean based
> > on the bit size.  As can be seen in the last patch, this makes it really
> > easy to implement a bool -> float lowering pass for hardware that doesn't
> > have real integers where NIR's current booleans are actually rather
> > painful.
> >
> > On Intel, the situation is a bit more complicated.  It's tempting to say
> > that we have 32-bit D3D10 booleans.  However, they're not really D3D10
> > booleans on gen4-5 because the top 31 bits are undefined garbage and,
> while
> > iand, ior, ixor, and inot operations work, you have to iand with 1 at the
> > last minute to clear off all that garbage.  Also, on all generations, a
> > comparison of two N-bit values results in an N-bit boolean, not a 32-bit
> > bool.  This has caused the Igalia folks no end of trouble as they've been
> > working on native 8 and 16-bit support.  If, instead, we have a 1-bit
> bool
> > with a lowering pass and we can lower to whatever we want, then we could
> > lower to a set of comparison opcodes that return the same bit-size as
> they
> > compare and it would match GEN hardware much better.
> >
> > But what about performance?  Aren't there all sorts of neat tricks we can
> > do with D3D10 booleans like b & 1.0f for b2f?  As it turns out, not
> really;
> > that's about the only one.  There is some small advantage when optimizing
> > shaders that come from D3D if your native representation of booleans
> > matches that of D3D.  However, penultimate patch in this series adds a
> few
> > small optimizations that get us to actually better than we were before.
> > With the entire series, shader-db on Kaby Lak looks like this:
> >
> > total instructions in shared programs: 15084098 -> 14988578 (-0.63%)
> > instructions in affected programs: 1321114 -> 1225594 (-7.23%)
> > helped: 2340
> > HURT: 23
> >
> > total cycles in shared programs: 369790134 -> 359798399 (-2.70%)
> > cycles in affected programs: 134085452 -> 124093717 (-7.45%)
> > helped: 2149
> > HURT: 720
> >
> > total loops in shared programs: 4393 -> 4393 (0.00%)
> > loops in affected programs: 0 -> 0
> > helped: 0
> > HURT: 0
> >
> > total spills in shared programs: 10158 -> 10051 (-1.05%)
> > spills in affected programs: 1429 -> 1322 (-7.49%)
> > helped: 8
> > HURT: 15
> >
> > total fills in shared programs: 22105 -> 21720 (-1.74%)
> > fills in affected programs: 2853 -> 2468 (-13.49%)
> > helped: 12
> > HURT: 15
> >
> > How about ease of use?  Are they a pain to deal with?  Yes, adding
> support
> > for 1-bit types was a bit awkward in a few places but most of it was
> > dealing with all the places where we have 32-bit booleans baked into
> > assumptions.  Getting rid of that baking in solves the problem and also
> > makes the whole IR more future-proof.
> >
> > All in all, I'd say I'm pretty happy with it.  However, I'd like other
> > people (particularly the AMD folks) to play with it a bit and verify that
> > it solves their problems as well.  Also, I added a lowering pass and
> tried
> > to turn it on in everyone's driver but may not have put it in the right
> > spot.  Please double-check my work.  For those wishing to take a look,
> you
> > can also find the entire series on my gitlab here:
> >
> >
> https://gitlab.freedesktop.org/jekstrand/mesa/commits/review/nir-1-bit-bool
> >
> > Please review!
> >
> > --Jason
> >
> > Cc: Connor Abbott 
> > Cc: Timothy Arceri 
> > Cc: Eric Anholt 
> > Cc: Rob Clark 
> > Cc: Karol Herbst 
> > Cc: Bas Nieuwenhuizen 
> > Cc: Alyssa Rosenzweig 
> >
> >
> > Jason Ekstrand (31):
> >   nir/validate: Print when the validation failed
> >   nir/constant_folding: Add an unreachable to a switch
> >   nir/constant_folding: Use nir_src_as_bool for discard_if
> >   nir/builder: Add a nir_imm_true/false helpers
> >   nir/builder: Handle 16-bit floats in nir_imm_floatN_t
> >   nir/search: 

Re: [Mesa-dev] [PATCH 00/31] nir: Use a 1-bit data type for booleans

2018-10-23 Thread Iago Toral
On Mon, 2018-10-22 at 17:13 -0500, Jason Ekstrand wrote:
> This is something that Connor and I have talked about quite a bit
> over the
> last couple of months.  The core idea is to replace NIR's current 32-
> bit
> 0/-1 D3D10-style booleans with a 1-bit data type.  All in all, I
> think it
> worked out pretty well though I really don't like the proliferation
> of
> 32-bit comparison opcodes we now have kicking around for i965.
> 
> Why?  No hardware really has a 1-bit type, right?  Well, sort
> of...  AMD
> actually uses 64-bit scalars for booleans with one bit per
> invocation.
> However, most hardware such as Intel uses some other larger value for
> booleans.  The real benefit of 1-bit booleans and requiring a
> lowering pass
> is that you can do somewhat custom lowering (like AMD wants) and your
> lowering pass can always tell in an instant if a value is a boolean
> based
> on the bit size.  As can be seen in the last patch, this makes it
> really
> easy to implement a bool -> float lowering pass for hardware that
> doesn't
> have real integers where NIR's current booleans are actually rather
> painful.
> 
> On Intel, the situation is a bit more complicated.  It's tempting to
> say
> that we have 32-bit D3D10 booleans.  However, they're not really
> D3D10
> booleans on gen4-5 because the top 31 bits are undefined garbage and,
> while
> iand, ior, ixor, and inot operations work, you have to iand with 1 at
> the
> last minute to clear off all that garbage.  Also, on all generations,
> a
> comparison of two N-bit values results in an N-bit boolean, not a 32-
> bit
> bool.  This has caused the Igalia folks no end of trouble as they've
> been
> working on native 8 and 16-bit support.  If, instead, we have a 1-bit 
> bool
> with a lowering pass and we can lower to whatever we want, then we
> could
> lower to a set of comparison opcodes that return the same bit-size as
> they
> compare and it would match GEN hardware much better.

This is great! I have been working on and off on a backend optimization
pass to revert the 32-bit conversions we had to inject to honor NIR's
32-bit booleans during the NIR->FS translation and honestly, the more I
worked on it the less I liked it.

> But what about performance?  Aren't there all sorts of neat tricks we
> can
> do with D3D10 booleans like b & 1.0f for b2f?  As it turns out, not
> really;
> that's about the only one.  There is some small advantage when
> optimizing
> shaders that come from D3D if your native representation of booleans
> matches that of D3D.  However, penultimate patch in this series adds
> a few
> small optimizations that get us to actually better than we were
> before.
> With the entire series, shader-db on Kaby Lak looks like this:
> 
> total instructions in shared programs: 15084098 -> 14988578 (-
> 0.63%)
> instructions in affected programs: 1321114 -> 1225594 (-7.23%)
> helped: 2340
> HURT: 23
> 
> total cycles in shared programs: 369790134 -> 359798399 (-2.70%)
> cycles in affected programs: 134085452 -> 124093717 (-7.45%)
> helped: 2149
> HURT: 720
> 
> total loops in shared programs: 4393 -> 4393 (0.00%)
> loops in affected programs: 0 -> 0
> helped: 0
> HURT: 0
> 
> total spills in shared programs: 10158 -> 10051 (-1.05%)
> spills in affected programs: 1429 -> 1322 (-7.49%)
> helped: 8
> HURT: 15
> 
> total fills in shared programs: 22105 -> 21720 (-1.74%)
> fills in affected programs: 2853 -> 2468 (-13.49%)
> helped: 12
> HURT: 15
> 
> How about ease of use?  Are they a pain to deal with?  Yes, adding
> support
> for 1-bit types was a bit awkward in a few places but most of it was
> dealing with all the places where we have 32-bit booleans baked into
> assumptions.  Getting rid of that baking in solves the problem and
> also
> makes the whole IR more future-proof.
> 
> All in all, I'd say I'm pretty happy with it.  However, I'd like
> other
> people (particularly the AMD folks) to play with it a bit and verify
> that
> it solves their problems as well.  Also, I added a lowering pass and
> tried
> to turn it on in everyone's driver but may not have put it in the
> right
> spot.  Please double-check my work.  For those wishing to take a
> look, you
> can also find the entire series on my gitlab here:
> 
> 
https://gitlab.freedesktop.org/jekstrand/mesa/commits/review/nir-1-bit-bool

I'll try to rebase my work on top of this and see how things look like
as well, thanks again for doing this Jason.

Iago

> Please review!
> 
> --Jason
> 
> Cc: Connor Abbott 
> Cc: Timothy Arceri 
> Cc: Eric Anholt 
> Cc: Rob Clark 
> Cc: Karol Herbst 
> Cc: Bas Nieuwenhuizen 
> Cc: Alyssa Rosenzweig 
> 
> 
> Jason Ekstrand (31):
>   nir/validate: Print when the validation failed
>   nir/constant_folding: Add an unreachable to a switch
>   nir/constant_folding: Use nir_src_as_bool for discard_if
>   nir/builder: Add a nir_imm_true/false helpers
>   nir/builder: Handle 16-bit floats in 

Re: [Mesa-dev] [PATCH 00/31] nir: Use a 1-bit data type for booleans

2018-10-22 Thread Ian Romanick
Patches 1 through 13 are

Reviewed-by: Ian Romanick 

The rest are going to take a bit more time and deep thought.

On 10/22/2018 03:13 PM, Jason Ekstrand wrote:
> This is something that Connor and I have talked about quite a bit over the
> last couple of months.  The core idea is to replace NIR's current 32-bit
> 0/-1 D3D10-style booleans with a 1-bit data type.  All in all, I think it
> worked out pretty well though I really don't like the proliferation of
> 32-bit comparison opcodes we now have kicking around for i965.
> 
> Why?  No hardware really has a 1-bit type, right?  Well, sort of...  AMD
> actually uses 64-bit scalars for booleans with one bit per invocation.
> However, most hardware such as Intel uses some other larger value for
> booleans.  The real benefit of 1-bit booleans and requiring a lowering pass
> is that you can do somewhat custom lowering (like AMD wants) and your
> lowering pass can always tell in an instant if a value is a boolean based
> on the bit size.  As can be seen in the last patch, this makes it really
> easy to implement a bool -> float lowering pass for hardware that doesn't
> have real integers where NIR's current booleans are actually rather
> painful.
> 
> On Intel, the situation is a bit more complicated.  It's tempting to say
> that we have 32-bit D3D10 booleans.  However, they're not really D3D10
> booleans on gen4-5 because the top 31 bits are undefined garbage and, while
> iand, ior, ixor, and inot operations work, you have to iand with 1 at the
> last minute to clear off all that garbage.  Also, on all generations, a
> comparison of two N-bit values results in an N-bit boolean, not a 32-bit
> bool.  This has caused the Igalia folks no end of trouble as they've been
> working on native 8 and 16-bit support.  If, instead, we have a 1-bit bool
> with a lowering pass and we can lower to whatever we want, then we could
> lower to a set of comparison opcodes that return the same bit-size as they
> compare and it would match GEN hardware much better.
> 
> But what about performance?  Aren't there all sorts of neat tricks we can
> do with D3D10 booleans like b & 1.0f for b2f?  As it turns out, not really;
> that's about the only one.  There is some small advantage when optimizing
> shaders that come from D3D if your native representation of booleans
> matches that of D3D.  However, penultimate patch in this series adds a few
> small optimizations that get us to actually better than we were before.
> With the entire series, shader-db on Kaby Lak looks like this:
> 
> total instructions in shared programs: 15084098 -> 14988578 (-0.63%)
> instructions in affected programs: 1321114 -> 1225594 (-7.23%)
> helped: 2340
> HURT: 23
> 
> total cycles in shared programs: 369790134 -> 359798399 (-2.70%)
> cycles in affected programs: 134085452 -> 124093717 (-7.45%)
> helped: 2149
> HURT: 720
> 
> total loops in shared programs: 4393 -> 4393 (0.00%)
> loops in affected programs: 0 -> 0
> helped: 0
> HURT: 0
> 
> total spills in shared programs: 10158 -> 10051 (-1.05%)
> spills in affected programs: 1429 -> 1322 (-7.49%)
> helped: 8
> HURT: 15
> 
> total fills in shared programs: 22105 -> 21720 (-1.74%)
> fills in affected programs: 2853 -> 2468 (-13.49%)
> helped: 12
> HURT: 15
> 
> How about ease of use?  Are they a pain to deal with?  Yes, adding support
> for 1-bit types was a bit awkward in a few places but most of it was
> dealing with all the places where we have 32-bit booleans baked into
> assumptions.  Getting rid of that baking in solves the problem and also
> makes the whole IR more future-proof.
> 
> All in all, I'd say I'm pretty happy with it.  However, I'd like other
> people (particularly the AMD folks) to play with it a bit and verify that
> it solves their problems as well.  Also, I added a lowering pass and tried
> to turn it on in everyone's driver but may not have put it in the right
> spot.  Please double-check my work.  For those wishing to take a look, you
> can also find the entire series on my gitlab here:
> 
> https://gitlab.freedesktop.org/jekstrand/mesa/commits/review/nir-1-bit-bool
> 
> Please review!
> 
> --Jason
> 
> Cc: Connor Abbott 
> Cc: Timothy Arceri 
> Cc: Eric Anholt 
> Cc: Rob Clark 
> Cc: Karol Herbst 
> Cc: Bas Nieuwenhuizen 
> Cc: Alyssa Rosenzweig 
> 
> 
> Jason Ekstrand (31):
>   nir/validate: Print when the validation failed
>   nir/constant_folding: Add an unreachable to a switch
>   nir/constant_folding: Use nir_src_as_bool for discard_if
>   nir/builder: Add a nir_imm_true/false helpers
>   nir/builder: Handle 16-bit floats in nir_imm_floatN_t
>   nir/search: Use nir_builder
>   nir/opt_if: Rework condition propagation
>   nir/system_values: Use the bit size from the load_deref
>   glsl/nir: Use i2b instead of ine for fixing UBO/SSBO booleans
>   nir/prog: Use nir_bany in kill handling
>   spirv: Use the right bit-size for spec constant ops
>   spirv: 

[Mesa-dev] [PATCH 00/31] nir: Use a 1-bit data type for booleans

2018-10-22 Thread Jason Ekstrand
This is something that Connor and I have talked about quite a bit over the
last couple of months.  The core idea is to replace NIR's current 32-bit
0/-1 D3D10-style booleans with a 1-bit data type.  All in all, I think it
worked out pretty well though I really don't like the proliferation of
32-bit comparison opcodes we now have kicking around for i965.

Why?  No hardware really has a 1-bit type, right?  Well, sort of...  AMD
actually uses 64-bit scalars for booleans with one bit per invocation.
However, most hardware such as Intel uses some other larger value for
booleans.  The real benefit of 1-bit booleans and requiring a lowering pass
is that you can do somewhat custom lowering (like AMD wants) and your
lowering pass can always tell in an instant if a value is a boolean based
on the bit size.  As can be seen in the last patch, this makes it really
easy to implement a bool -> float lowering pass for hardware that doesn't
have real integers where NIR's current booleans are actually rather
painful.

On Intel, the situation is a bit more complicated.  It's tempting to say
that we have 32-bit D3D10 booleans.  However, they're not really D3D10
booleans on gen4-5 because the top 31 bits are undefined garbage and, while
iand, ior, ixor, and inot operations work, you have to iand with 1 at the
last minute to clear off all that garbage.  Also, on all generations, a
comparison of two N-bit values results in an N-bit boolean, not a 32-bit
bool.  This has caused the Igalia folks no end of trouble as they've been
working on native 8 and 16-bit support.  If, instead, we have a 1-bit bool
with a lowering pass and we can lower to whatever we want, then we could
lower to a set of comparison opcodes that return the same bit-size as they
compare and it would match GEN hardware much better.

But what about performance?  Aren't there all sorts of neat tricks we can
do with D3D10 booleans like b & 1.0f for b2f?  As it turns out, not really;
that's about the only one.  There is some small advantage when optimizing
shaders that come from D3D if your native representation of booleans
matches that of D3D.  However, penultimate patch in this series adds a few
small optimizations that get us to actually better than we were before.
With the entire series, shader-db on Kaby Lak looks like this:

total instructions in shared programs: 15084098 -> 14988578 (-0.63%)
instructions in affected programs: 1321114 -> 1225594 (-7.23%)
helped: 2340
HURT: 23

total cycles in shared programs: 369790134 -> 359798399 (-2.70%)
cycles in affected programs: 134085452 -> 124093717 (-7.45%)
helped: 2149
HURT: 720

total loops in shared programs: 4393 -> 4393 (0.00%)
loops in affected programs: 0 -> 0
helped: 0
HURT: 0

total spills in shared programs: 10158 -> 10051 (-1.05%)
spills in affected programs: 1429 -> 1322 (-7.49%)
helped: 8
HURT: 15

total fills in shared programs: 22105 -> 21720 (-1.74%)
fills in affected programs: 2853 -> 2468 (-13.49%)
helped: 12
HURT: 15

How about ease of use?  Are they a pain to deal with?  Yes, adding support
for 1-bit types was a bit awkward in a few places but most of it was
dealing with all the places where we have 32-bit booleans baked into
assumptions.  Getting rid of that baking in solves the problem and also
makes the whole IR more future-proof.

All in all, I'd say I'm pretty happy with it.  However, I'd like other
people (particularly the AMD folks) to play with it a bit and verify that
it solves their problems as well.  Also, I added a lowering pass and tried
to turn it on in everyone's driver but may not have put it in the right
spot.  Please double-check my work.  For those wishing to take a look, you
can also find the entire series on my gitlab here:

https://gitlab.freedesktop.org/jekstrand/mesa/commits/review/nir-1-bit-bool

Please review!

--Jason

Cc: Connor Abbott 
Cc: Timothy Arceri 
Cc: Eric Anholt 
Cc: Rob Clark 
Cc: Karol Herbst 
Cc: Bas Nieuwenhuizen 
Cc: Alyssa Rosenzweig 


Jason Ekstrand (31):
  nir/validate: Print when the validation failed
  nir/constant_folding: Add an unreachable to a switch
  nir/constant_folding: Use nir_src_as_bool for discard_if
  nir/builder: Add a nir_imm_true/false helpers
  nir/builder: Handle 16-bit floats in nir_imm_floatN_t
  nir/search: Use nir_builder
  nir/opt_if: Rework condition propagation
  nir/system_values: Use the bit size from the load_deref
  glsl/nir: Use i2b instead of ine for fixing UBO/SSBO booleans
  nir/prog: Use nir_bany in kill handling
  spirv: Use the right bit-size for spec constant ops
  spirv: Initialize subgroup destinations with the destination type
  intel/nir: Use the OPT macro for more passes
  nir/algebraic: Disable b2f lowering and two optimizations
  nir: Rename boolean-related opcodes to include 32 in the name
  FIXUP: nir/builder: Generate 32-bit bool opcodes transparently
  FIXUP: nir/algebraic: Remap boolean opcodes to the 32-bit variant