On Wed, Sep 6, 2017 at 3:58 PM, Chema Casanova <jmcasan...@igalia.com> wrote:
> Hi Connor and Curro, > > On 28/08/17 12:24, Alejandro Piñeiro wrote: > > On 27/08/17 20:24, Connor Abbott wrote: > >> Hi, > >> > >> On Aug 25, 2017 9:28 AM, "Alejandro Piñeiro" <apinhe...@igalia.com > >> <mailto:apinhe...@igalia.com>> wrote: > >> > >> On 24/08/17 21:07, Connor Abbott wrote: > >> > > >> > Hi Alejandro, > >> > >> Hi Connor, > >> > >> > > >> > This seems really suspicious. If the live ranges are really > >> > independent, then the register allocator should be able to > >> assign the > >> > two virtual registers to the same physical register if it needs > to. > >> > >> Yes, it is true, the register allocator should be able to assign two > >> virtual registers to the same physical register. But that is done > >> at the > >> end (or really near the end), so late for the problem this > >> optimization > >> is trying to fix. > >> > >> > >> Well, my understanding is that the problem is long compilation times > >> due to spilling and our not-so-great implementation of it. So no, > >> register allocation is not late for the problem. As both Curro and I > >> explained, the change by itself can only pessimise register > >> allocation, so if it helps then it must be due to a bug in the > >> register allocator or a problem in a subsequent pass that's getting > >> hidden by this one. > > > > Ok. > > > >> > >> We are also reducing the amount of instructions used. > >> > >> > >> The comments in the source code say otherwise. Any instructions > >> eliminated were from spilling, which this pass only accidentally > reduces. > > > > Yes, sorry, I explained myself poorly. The optimization itself doesn't > > remove any instructions. But using it reduces the final number of > > instructions, although as you say, they are likely due reducing the > > spilling. > > > >> > >> > >> > >> Probably not really clear on the commit message. When I say > >> "reduce the > >> pressure of the register allocator" I mean having a code that the > >> register allocator would be able to handle without using too much > >> time. > >> The problem this optimization tries to solve is that for some 16 > >> bit CTS > >> tests (some with matrices and geometry shaders), the amount of > virtual > >> registers used and instructions was really big. For the record, > >> initially, some tests needed 24 min just to compile. Right now, > thanks > >> to other optimizations, the slower test without this optimization > >> needs > >> 1min 30 seconds. Adding some hacky timestamps, the time used at > >> fs_visitor::allocate_registers (brw_fs.cpp:6096) is: > >> > >> * While trying to schedule using the three available pre mode > >> heuristics: 7 seconds > >> * Allocation with spilling: 63 seconds > >> * Final schedule using SCHEDULE_POST: 19 seconds > >> > >> With this optimization, the total time goes down to 14 seconds (10 > >> + 0 + > >> 3 on the previous bullet point list). > >> > >> One could argue that 1min 30 seconds is okish. But taking into > account > >> that it goes down to 14 seconds, even with some caveats (see > below), I > >> still think that it is worth to use the optimization. > >> > >> And a final comment. For that same test, this is the final stats > >> (using > >> INTEL_DEBUG): > >> > >> * With the optimization: SIMD8 shader: 4610 instructions. 0 loops. > >> 130320 cycles. 15:9 spills:fills. > >> * Without the optimization: SIMD8 shader: 12312 instructions. 0 > >> loops. > >> 174816 cycles. 751:1851 spills:fills. > >> > >> > >> So, the fact that it helps at all with SIMD8 shows that my theory is > >> wrong, but since your pass reduces spilling, it clearly must be > >> avoiding a bug somewhere else. You need to compare the IR for a shader > >> with the problem with and without this pass right before register > >> allocation. Maybe the sources and destinations of the conversion > >> instructions interfere without the change due to some other pass > >> that's increasing register pressure, in which case that's the problem, > >> but I doubt it. > > > > Ok, thanks for the hints. > > After some research we found that we need to adapt the live_variables > algorithm to support 32 to 16-bit conversions. Because of the HW > alignment restrictions these conversions need that the result register > uses stride=2, so it is not continuous (stride!=1) so by definition > is_partial_write returns true. Any of the next last 3 conditions could > be true when we use 16-bit types. > > bool > fs_inst::is_partial_write() const > { > return ((this->predicate && this->opcode != BRW_OPCODE_SEL) || > (this->exec_size * type_sz(this->dst.type)) < 32 || > !this->dst.is_contiguous() || > this->dst.offset % REG_SIZE != 0); > } > > So at the check on the setup_one_write function at > brw_fs_live_variables.cpp the variable isn't marked as defined > completely in the block. > > if (inst->dst.file == VGRF && !inst->is_partial_write()) { > if (!BITSET_TEST(bd->use, var)) > BITSET_SET(bd->def, var); > } > > That makes that the live start of the variable is expected to defined > before the beginning of the block. In the commented test by Alejandro we > have a big unrolled loop that increases the pressure of the register > artificially making all result registers of a conversion to start life > at instruction 0. > Quick drive-by comment. Curro and I talked about this quite a bit in the office yesterday as I've been hitting similar issues with the vec2->double conversion that we do when loading 64-bit values from UBOs and SSBOs. He's got a patch to liveness which makes registers which are only ever partially written not extned their live ranges infinitely upward. That's probably easier than trying to track things per-component. --Jason > A simpler approach to the submitted optimization would be to consider as > a fully defined variable in a block the result of a 32-16 bit > conversion. This way the register allocator has more freedom to assign > the register because the live_interval is correct for this variable. > Something like the following code: > > diff --git a/src/intel/compiler/brw_fs_live_variables.cpp > b/src/intel/compiler/brw_fs_live_variables.cpp > index c449672a51..98d8583eaa 100644 > --- a/src/intel/compiler/brw_fs_live_variables.cpp > +++ b/src/intel/compiler/brw_fs_live_variables.cpp > @@ -83,7 +83,10 @@ fs_live_variables::setup_one_write(struct block_data > *bd, fs_inst *inst, > /* The def[] bitset marks when an initialization in a block completely > * screens off previous updates of that variable (VGRF channel). > */ > - if (inst->dst.file == VGRF && !inst->is_partial_write()) { > + if (inst->dst.file == VGRF && (!inst->is_partial_write() || > + ((type_sz(inst->dst.type) == 2) && > + (inst->opcode == BRW_OPCODE_MOV) && > + (type_sz(inst->src[0].type) == 4)))) { > if (!BITSET_TEST(bd->use, var)) > BITSET_SET(bd->def, var); > } > > This new code reduces the spills from 1217 to 1089 in our worse tests > compared to using the submitted optimization and without regressions. > > Although this is a simple fix that address the problem with the > conversions (that is the main use that 16bit arithmetic of the > VK_KHR_16bit_storage extension), there are other cases that we didn't > addressed with the reuse_16bit_conversions_register that need a more > complex approach. > > I'm thinking about the cases when 16-bit types are packed/unpacked or > shuffled/unshuffled in the storage operations. They are also > partial_writes but the aggregation of several operations make that the > register could be considered as completely defined in a block. > > - This would need doing a "subregister" tracking for stride=2 and > offset=[0,2] for shuffled components that are written using 32-bit > storage operations. > - And the same for packet 16-bit values on SIMD8 that use half register > continuously. > > So if only half of the register is defined for a given variable and only > that half is read by the shader we can consider consistent that the > variable is completely defined. But in the case that the different > halves are defined in different blocks things get complicated but could > stay as partial write. > > Does it makes sense to go deeper with an approach like this one? Maybe > is not strictly needed for this series but it could be an interesting > follow up for adjusting the liveness of the payloads of the 16-bit store > operations and any future use of 16-bit types. > > Thanks in advance for your feedback. > > Chema > > >> (IIRC there's a debug option to print out the register pressure for > >> each instruction, which will be helpful here). > > > > I tried to use that option back when I was working on this patch, > > without too much luck. Will try again. > > > >> If that's not the case, you should check to see if the register > >> allocator thinks the sources and destinations of the conversion > >> instructions interfere, and if so figure out why - that will likely be > >> the real bug. > > > > Ok. > > > > Thanks Connor and Curro for the comments. I will work on a different > > solution. > > _______________________________________________ > mesa-dev mailing list > mesa-dev@lists.freedesktop.org > https://lists.freedesktop.org/mailman/listinfo/mesa-dev >
_______________________________________________ mesa-dev mailing list mesa-dev@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/mesa-dev