Re: [Mesa-dev] [PATCH 3/3] intel/compiler: add an optimization pass for booleans
On Thu, 2018-07-05 at 15:47 -0700, Caio Marcelo de Oliveira Filho wrote: > (I had to stop reading to go home last Tuesday, so here are the > remaining comments.) > > > On Tue, May 15, 2018 at 01:05:21PM +0200, Iago Toral Quiroga wrote: > > NIR assumes that all booleans are 32-bit but Intel hardware > > produces > > booleans of the same size as the operands to the CMP instruction, > > so we > > can actually have 8-bit and 16-bit booleans. To work around this > > mismatch between NIR and the hardware, we emit boolean conversions > > to > > 32-bit right after emitting the CMP instruction during the NIR->FS > > pass, which makes interfacing with NIR a lot easier, but can leave > > unnecessary boolean conversions in the shader code. > > Question: have you explored handling this at the NIR->FS conversion? > I.e. instead of generate the cmp + mov and then look for the cmp + > mov > to fix up; when generating a cmp, perform those checks (at nir level) > and then pick the right bitsize. It is not that easy. The problem is that NIR will continue to come at us with 32-bit boolean instructions after the CMP+MOV, so instead of prpagating forward for every conversion, now, for every bool we find in IR we'd need to go back in the FS program to check if it is a real 32- bit boolean or not to decide what to emit. I don't see any benefit, plus we would be coupling all this with the translation implementation, which I think is less nice than having it being a completely separate thing. Anyway, there is a major issue with the current patch that I have found this week thanks to some new CTS tests: when we propagate the bitsize of a logical instruction to its destination, that affects all its consumers even outside the current block, so we need to handle propagation across blocks, which adds a few more problems, so I still need to figure out how to deal with that properly and whether that is something we want to do (there is a reason why no other opt/lowering passes do cross-block changes...). Iago > > > +/** > > + * Propagates the bit-size of the destination of a boolean > > instruction to > > + * all its consumers. If propagate_from_source is True, then the > > producer > > + * is a conversion MOV from a low bit-size boolean to 32-bit, and > > in that > > + * case the propagation happens from the source of the instruction > > instead > > + * of its destination. > > + */ > > +static bool > > +propagate_bool_bit_size(fs_inst *inst, bool propagate_from_source) > > +{ > > + assert(!propagate_from_source || inst->opcode == > > BRW_OPCODE_MOV); > > + > > + bool progress = false; > > + > > + const unsigned bit_size = 8 * (propagate_from_source ? > > + type_sz(inst->src[0].type) : type_sz(inst->dst.type)); > > + > > + /* Look for any follow-up instructions that sources from the > > boolean > > +* result of the producer instruction and rewrite them to use > > the correct > > +* bit-size. > > +*/ > > + foreach_inst_in_block_starting_from(fs_inst, fixup_inst, inst) > > { > > + if (!inst_supports_boolean(fixup_inst)) > > + continue; > > Should we care about other instruction clobbering the contents of > inst->dst, or at this point of the optimization we can count on it > not > being? > > > > + /* If it is a plain boolean conversion to 32-bit, then > > look for any > > + * follow-up instructions that source from the 32-bit > > boolean and > > + * rewrite them to source from the output of the CMP > > (which is the > > + * source of the conversion instruction) directly if > > possible. > > + */ > > + progress = propagate_bool_bit_size(conv_inst, true) || > > progress; > > + } > > +#if 0 > > + else if (inst_supports_boolean(inst) && inst->sources > 1) > > { > > If you end up enabling this section, I suggest move the > inst_supports_boolean() check to the beginning of the for-loop, as an > early return. Makes the condition for the cases we are handling > cleaner. > > > > > + /* For all logical instructions that can take more than > > one operand > > + * we need to ensure that all of them have matching bit- > > sizes. If they > > + * don't, it means that the original shader code is > > operating boolean > > + * expressions with different native bit-sizes and we > > need to choose > > + * a canonical boolean form for all the operands, which > > requires to > > + * inject conversions to temporaries. We choose the bit- > > size of the > > + * destination as the canonical form (which must be a 32- > > bit boolean > > + * since we only propagate smaller bit-sizes to the > > destination if we > > + * managed to convert all the operands to the same bit- > > size) because > > + * that way we don't need to worry about propagating the > > destination > > + * bit-size down the line. > > + */ > > To make this comment less heavy, I'd
Re: [Mesa-dev] [PATCH 3/3] intel/compiler: add an optimization pass for booleans
(I had to stop reading to go home last Tuesday, so here are the remaining comments.) On Tue, May 15, 2018 at 01:05:21PM +0200, Iago Toral Quiroga wrote: > NIR assumes that all booleans are 32-bit but Intel hardware produces > booleans of the same size as the operands to the CMP instruction, so we > can actually have 8-bit and 16-bit booleans. To work around this > mismatch between NIR and the hardware, we emit boolean conversions to > 32-bit right after emitting the CMP instruction during the NIR->FS > pass, which makes interfacing with NIR a lot easier, but can leave > unnecessary boolean conversions in the shader code. Question: have you explored handling this at the NIR->FS conversion? I.e. instead of generate the cmp + mov and then look for the cmp + mov to fix up; when generating a cmp, perform those checks (at nir level) and then pick the right bitsize. > +/** > + * Propagates the bit-size of the destination of a boolean instruction to > + * all its consumers. If propagate_from_source is True, then the producer > + * is a conversion MOV from a low bit-size boolean to 32-bit, and in that > + * case the propagation happens from the source of the instruction instead > + * of its destination. > + */ > +static bool > +propagate_bool_bit_size(fs_inst *inst, bool propagate_from_source) > +{ > + assert(!propagate_from_source || inst->opcode == BRW_OPCODE_MOV); > + > + bool progress = false; > + > + const unsigned bit_size = 8 * (propagate_from_source ? > + type_sz(inst->src[0].type) : type_sz(inst->dst.type)); > + > + /* Look for any follow-up instructions that sources from the boolean > +* result of the producer instruction and rewrite them to use the correct > +* bit-size. > +*/ > + foreach_inst_in_block_starting_from(fs_inst, fixup_inst, inst) { > + if (!inst_supports_boolean(fixup_inst)) > + continue; Should we care about other instruction clobbering the contents of inst->dst, or at this point of the optimization we can count on it not being? > + /* If it is a plain boolean conversion to 32-bit, then look for any > + * follow-up instructions that source from the 32-bit boolean and > + * rewrite them to source from the output of the CMP (which is the > + * source of the conversion instruction) directly if possible. > + */ > + progress = propagate_bool_bit_size(conv_inst, true) || progress; > + } > +#if 0 > + else if (inst_supports_boolean(inst) && inst->sources > 1) { If you end up enabling this section, I suggest move the inst_supports_boolean() check to the beginning of the for-loop, as an early return. Makes the condition for the cases we are handling cleaner. > + /* For all logical instructions that can take more than one operand > + * we need to ensure that all of them have matching bit-sizes. If > they > + * don't, it means that the original shader code is operating > boolean > + * expressions with different native bit-sizes and we need to choose > + * a canonical boolean form for all the operands, which requires to > + * inject conversions to temporaries. We choose the bit-size of the > + * destination as the canonical form (which must be a 32-bit boolean > + * since we only propagate smaller bit-sizes to the destination if > we > + * managed to convert all the operands to the same bit-size) because > + * that way we don't need to worry about propagating the destination > + * bit-size down the line. > + */ To make this comment less heavy, I'd move the assumption about the destination being 32-bit right above the assert, which is kind of an explanation of the assertion. > @@ -6240,6 +6471,7 @@ fs_visitor::optimize() > > OPT(opt_drop_redundant_mov_to_flags); > OPT(remove_extra_rounding_modes); > + OPT(opt_bool_bit_size); It could be useful to have a short comment here about the importance of calling opt_bool_bit_size at this point. Thanks, Caio ___ mesa-dev mailing list mesa-dev@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/mesa-dev
Re: [Mesa-dev] [PATCH 3/3] intel/compiler: add an optimization pass for booleans
On Tue, 2018-07-03 at 18:45 -0700, Caio Marcelo de Oliveira Filho wrote: > Hi, > > > > + /* Look for any follow-up instructions that sources from the > > boolean > > +* result of the producer instruction and rewrite them to use > > the correct > > +* bit-size. > > +*/ > > + foreach_inst_in_block_starting_from(fs_inst, fixup_inst, inst) > > { > > + if (!inst_supports_boolean(fixup_inst)) > > + continue; > > + > > + /* For MOV instructions we can always rewrite the boolean > > source > > + * if the instrucion reads the same region we produced in > > the > > + * 32-bit conversion. > > + */ > > + if (fixup_inst->opcode == BRW_OPCODE_MOV && > > + region_match(inst->dst, inst->size_written, > > + fixup_inst->src[0], fixup_inst- > > >size_read(0))) { > > + if (propagate_from_source) { > > +fixup_inst->src[0].file = inst->src[0].file; > > +fixup_inst->src[0].nr = inst->src[0].nr; > > + } > > + fixup_inst->src[0] = > > +fix_bool_reg_bit_size(fixup_inst->src[0], bit_size); > > + progress = true; > > + continue; > > + } > > It seems the rest of the code assumes that instruction is not MOV, so > you would need to ensure continue is called regardless the region > match. Right, although if the region doesn't match the rest of the code won't do anything anyway. > Idea: it seems we could just remove this section above (handling > MOV), > and slightly change the section below so that MOV can be dealt with > it > too. > > - Drop the section above; > - Rename progress_logical to local_progress; > - Add a "fixup_inst->opcode == BRW_OPCODE_MOV" to the The recursive call executes for logical instructions, not for MOV, so this should be !=. > condition that controls the recursive call; > - Update comments accordingly. Sounds like a good idea, thanks for the feedback. Iago > > > + > > + /* For logical instructions we have the same restriction as > > for MOVs, > > + * and we also need to: > > + * > > + * 1. Propagate the bit-size to the boolean destination of > > the > > + *instruction. > > + * 2. Rewrite any instruction that reads the destination to > > use > > + *the new bit-size. > > + * > > + * However, we can only do these if we can rewrite all the > > operands > > + * to use the same bit-size. > > + */ > > + bool progress_logical = false; > > + bool same_bit_size = true; > > + for (unsigned i = 0; i < fixup_inst->sources; i++) { > > + if (region_match(inst->dst, inst->size_written, > > + fixup_inst->src[i], fixup_inst- > > >size_read(i))) { > > +if (propagate_from_source) { > > + fixup_inst->src[i].file = inst->src[0].file; > > + fixup_inst->src[i].nr = inst->src[0].nr; > > +} > > +fixup_inst->src[i] = > > + fix_bool_reg_bit_size(fixup_inst->src[i], > > bit_size); > > +progress_logical = true; > > +progress = true; > > + } > > + > > + if (i > 0 && > > + type_sz(fixup_inst->src[i].type) != > > + type_sz(fixup_inst->src[i - 1].type)) { > > +same_bit_size = false; > > + } > > + } > > + > > + /* If we have successfully rewritten a logical instruction > > operand > > + * to use a smaller bit-size boolean and all the operands in > > the > > + * instruction have the same small bit-size, then propagate > > the > > + * new bit-size to the destination boolean and do the same > > for all > > + * follow-up instructions that read from it. > > + */ > > + if (progress_logical && same_bit_size) { > > + fixup_inst->dst = retype(fixup_inst->dst, fixup_inst- > > >src[0].type); > > + propagate_bool_bit_size(fixup_inst, false); > > + } > > + } > > + > > + return progress; > > +} > > > > > Thanks, > Caio > > ___ mesa-dev mailing list mesa-dev@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/mesa-dev
Re: [Mesa-dev] [PATCH 3/3] intel/compiler: add an optimization pass for booleans
Hi, > + /* Look for any follow-up instructions that sources from the boolean > +* result of the producer instruction and rewrite them to use the correct > +* bit-size. > +*/ > + foreach_inst_in_block_starting_from(fs_inst, fixup_inst, inst) { > + if (!inst_supports_boolean(fixup_inst)) > + continue; > + > + /* For MOV instructions we can always rewrite the boolean source > + * if the instrucion reads the same region we produced in the > + * 32-bit conversion. > + */ > + if (fixup_inst->opcode == BRW_OPCODE_MOV && > + region_match(inst->dst, inst->size_written, > + fixup_inst->src[0], fixup_inst->size_read(0))) { > + if (propagate_from_source) { > +fixup_inst->src[0].file = inst->src[0].file; > +fixup_inst->src[0].nr = inst->src[0].nr; > + } > + fixup_inst->src[0] = > +fix_bool_reg_bit_size(fixup_inst->src[0], bit_size); > + progress = true; > + continue; > + } It seems the rest of the code assumes that instruction is not MOV, so you would need to ensure continue is called regardless the region match. Idea: it seems we could just remove this section above (handling MOV), and slightly change the section below so that MOV can be dealt with it too. - Drop the section above; - Rename progress_logical to local_progress; - Add a "fixup_inst->opcode == BRW_OPCODE_MOV" to the condition that controls the recursive call; - Update comments accordingly. > + > + /* For logical instructions we have the same restriction as for MOVs, > + * and we also need to: > + * > + * 1. Propagate the bit-size to the boolean destination of the > + *instruction. > + * 2. Rewrite any instruction that reads the destination to use > + *the new bit-size. > + * > + * However, we can only do these if we can rewrite all the operands > + * to use the same bit-size. > + */ > + bool progress_logical = false; > + bool same_bit_size = true; > + for (unsigned i = 0; i < fixup_inst->sources; i++) { > + if (region_match(inst->dst, inst->size_written, > + fixup_inst->src[i], fixup_inst->size_read(i))) { > +if (propagate_from_source) { > + fixup_inst->src[i].file = inst->src[0].file; > + fixup_inst->src[i].nr = inst->src[0].nr; > +} > +fixup_inst->src[i] = > + fix_bool_reg_bit_size(fixup_inst->src[i], bit_size); > +progress_logical = true; > +progress = true; > + } > + > + if (i > 0 && > + type_sz(fixup_inst->src[i].type) != > + type_sz(fixup_inst->src[i - 1].type)) { > +same_bit_size = false; > + } > + } > + > + /* If we have successfully rewritten a logical instruction operand > + * to use a smaller bit-size boolean and all the operands in the > + * instruction have the same small bit-size, then propagate the > + * new bit-size to the destination boolean and do the same for all > + * follow-up instructions that read from it. > + */ > + if (progress_logical && same_bit_size) { > + fixup_inst->dst = retype(fixup_inst->dst, fixup_inst->src[0].type); > + propagate_bool_bit_size(fixup_inst, false); > + } > + } > + > + return progress; > +} Thanks, Caio ___ mesa-dev mailing list mesa-dev@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/mesa-dev