El 13/12/18 a las 11:49, Pohjolainen, Topi escribió: > On Thu, Dec 13, 2018 at 09:10:24AM +0100, Iago Toral wrote: >> On Wed, 2018-12-12 at 14:15 +0200, Pohjolainen, Topi wrote: >>> On Wed, Dec 12, 2018 at 09:48:20AM +0100, Iago Toral wrote: >>>> On Tue, 2018-12-11 at 18:59 +0200, Pohjolainen, Topi wrote: >>>>> On Fri, Dec 07, 2018 at 03:30:11PM +0200, Pohjolainen, Topi >>>>> wrote: >>>>>> On Tue, Dec 04, 2018 at 08:17:05AM +0100, Iago Toral Quiroga >>>>>> wrote: >>>>>>> This function is used in two different scenarios that for 32- >>>>>>> bit >>>>>>> instructions are the same, but for 16-bit instructions are >>>>>>> not. >>>>>>> >>>>>>> One scenario is that in which we are working at a SIMD8 >>>>>>> register >>>>>>> level and we need to know if a register is fully defined or >>>>>>> written. >>>>>>> This is useful, for example, in the context of liveness >>>>>>> analysis >>>>>>> or >>>>>>> register allocation, where we work with units of registers. >>>>>>> >>>>>>> The other scenario is that in which we want to know if an >>>>>>> instruction >>>>>>> is writing a full scalar component or just some subset of it. >>>>>>> This is >>>>>>> useful, for example, in the context of some optimization >>>>>>> passes >>>>>>> like copy propagation. >>>>>>> >>>>>>> For 32-bit instructions (or larger), a SIMD8 dispatch will >>>>>>> always >>>>>>> write >>>>>>> at least a full SIMD8 register (32B) if the write is not >>>>>>> partial. >>>>>>> The >>>>>>> function is_partial_write() checks this to determine if we >>>>>>> have a >>>>>>> partial >>>>>>> write. However, when we deal with 16-bit instructions, that >>>>>>> logic >>>>>>> disables >>>>>>> some optimizations that should be safe. For example, a SIMD8 >>>>>>> 16- >>>>>>> bit MOV will >>>>>>> only update half of a SIMD register, but it is still a >>>>>>> complete >>>>>>> write of the >>>>>>> variable for a SIMD8 dispatch, so we should not prevent copy >>>>>>> propagation in >>>>>>> this scenario because we don't write all 32 bytes in the SIMD >>>>>>> register >>>>>>> or because the write starts at offset 16B (wehere we pack >>>>>>> components Y or >>>>>>> W of 16-bit vectors). >>>>>>> >>>>>>> This is a problem for SIMD8 executions (VS, TCS, TES, GS) of >>>>>>> 16- >>>>>>> bit >>>>>>> instructions, which lose a number of optimizations because of >>>>>>> this, most >>>>>>> important of which is copy-propagation. >>>>>>> >>>>>>> This patch splits is_partial_write() into >>>>>>> is_partial_reg_write(), >>>>>>> which >>>>>>> represents the current is_partial_write(), useful for things >>>>>>> like >>>>>>> liveness analysis, and is_partial_var_write(), which >>>>>>> considers >>>>>>> the dispatch size to check if we are writing a full variable >>>>>>> (rather >>>>>>> than a full register) to decide if the write is partial or >>>>>>> not, >>>>>>> which >>>>>>> is what we really want in many optimization passes. >>>>> >>>>> I actually started wondering why would liveness analysis and >>>>> register >>>>> coalescing need to treat the 16-bit SIMD8 case differently than >>>>> optimizations. >>>>> In virtual register space nothing would read or write the unused >>>>> second half >>>>> of the register in case of 16-bit type and SIMD8. >>>> >>>> True, we might be able to use the "variable" version in more cases. >>>> I >>>> was trying to be conservative when I implemented this because I >>>> don't >>>> think the half-float CTS tests provides a good testing ground for >>>> all >>>> aspects of the compiler. I can try that and see if it breaks >>>> anything >>>> though. >>>> >>>>> Real register allocation in turn should be orthogonal to how >>>>> things >>>>> are >>>>> allocated in virtual space. And I guess even there we wouldn't be >>>>> interested >>>>> of packing two 16-bit typed SIMD8 variables into one and same >>>>> hardware >>>>> register. It is SIMD16 where we get more pressure into register >>>>> space >>>>> anyway >>>>> and there the 16-bit typed variables occupy full registers. In >>>>> other >>>>> words, >>>>> if things fit in SIMD16, would we bother packing things more >>>>> tightly >>>>> in >>>>> SIMD8? Or even if SIMD8 was the only option, would we be >>>>> interested >>>>> packing >>>>> channels for two variables in one hw reg even then? >>>>> >>>>> Jason, we discussed this a little in the spring time. >>>>> >>>>> As a recap my approach shortly. Instead of ignoring the second >>>>> half >>>>> of >>>>> registers case by case I addressed it more generally: >>>>> >>>>> - changed all the open coded checks to use helpers, >>>>> - added a padding bit into fs_reg telling about the unused space, >>>>> - change nir -> fs step to set that bit for 16-bit typed regs >>>>> - and finally changed the helpers to consider the padding bit. >>>> >>>> So if I understand how this works, you mostly make the vgrf >>>> infrastructure think that half-float registers actually use twice >>>> the >>>> space they require by including the padding into the >>>> component_size() >>>> helper, right? (We also need to cover 8-bit by the way, but I >>>> suppose >>>> it just means increasing the padding we use for 8-bit variables). >>>> And >>>> then you modify is_partial_write() to use that helper so it >>>> accounts >>>> for that padding and just check if the (padded) size is less than a >>>> full SIMD register. Basically, we are making all smaller types look >>>> like they use 32-bit per channel in the vgrf space. >>> >>> Actually not 32-bit per channel but that there is unused data towards >>> the >>> end of the register, cut-paste from the diff: >>> >>> diff --git a/src/intel/compiler/brw_ir_fs.h >>> b/src/intel/compiler/brw_ir_fs.h >>> index 02c91b246a3..a1add69a319 100644 >>> --- a/src/intel/compiler/brw_ir_fs.h >>> +++ b/src/intel/compiler/brw_ir_fs.h >>> @@ -52,6 +52,9 @@ public: >>> >>> /** Register region horizontal stride */ >>> uint8_t stride; >>> + >>> + /* Needed, for example, for SIMD8 half float payloads. */ >>> + uint8_t pad_per_component; >>> }; >>> >>> static inline fs_reg >>> >>> >>> I should start by the history. Payloads for sampler messages, sampler >>> return >>> data and data port writes operate with full registers even in case of >>> SIMD8. >>> For example, in case of data port write, each component has all the >>> channels >>> packed in the beginning of the reg. The second half is simply ignored >>> by the >>> hardware. The hardware just expects each component to be aligned on >>> reg >>> boundary. The same applies for 16-bit sampler message coordinates, >>> each >>> component takes full reg and only the first half contains meaningful >>> data. >>> And when the hardware returns 16-bit typed data from the sampler in >>> SIMD8 the >>> data is once again padded. >>> This is where the notion of padding originally comes from. >>> >>> But then I started thinking why stop at payloads only - we could >>> extend the >>> same for all 16-bit typed variables in SIMD8. I thought that if the >>> hardware >>> wants to use padding for things such as sampler data coming in and >>> pixel data >>> going out then why should we use tighter packing for anything else - >>> at least >>> not in the virtual register space. >>> >>> And as answer to below, the padding is only used in SIMD8 mode, >>> SIMD16 doesn't >>> need any of that, for 16-bit types, I mean. >> >> Ok, so you just add the padding to the total size and only for SIMD8 >> dispatches of 16-bit. >> >> I guess we also need to account for horizontal strides right? For >> example, a SIMD8 half-float register with a stride of 2 uses a full >> register and should not have any padding in that case. We will have >> this for half-float when there are alignment restrictions (for >> conversions, etc). We also have this for pretty much all 8-bit >> instructions, since there is a 16-bit alignment requierement for pretty >> much everything in that case. >> >> Also, what would happen if someone changes the stride of the register >> (or retypes it), would the size calculations adapt the padding >> correspondingly and take it into account automatically? > > Well, this is just my way of seeing things. I thought that marking padding > explicitly in the fs_reg makes all sorts of asserts and debugging clearer > as the padding tells explicitly that at least the allocator of the reg meant > that only the first half is ever written (and hence only the first half should > be read). Now, the logic that adds workarounds for special cases, such as the > stride you mentioned, needs to modify the register (at least the stride) for > producer and every consumer. Now that logic would additionally need to reset > the padding as it doesn't apply anymore. Adding an assert allowing only > padding or stride != 1 but not both sounds like a thing to add on top. > >> >> I guess that in general I have two types of concerns; one is whether >> this is robust enough: on the one hand people can still try to open >> code size calculations in future work and they are very likely to not >> going get it right for SIMD8 with non-32 bit types if they do this. I >> guess this can be fixed with very careful reviews about these things in >> the future but I think this is going to be easy to miss. The other one >> is that as I mention above, we would be returning fake (padded) sizes >> and then using these sizes for things where we are expecting the real >> size, such as when checking for hardware restrictions, etc I am less >> worried about this issue though, since as long as we keep this for >> SIMD8 only I don't think that using padded sizes instead of real sizes >> would usually lead to different behaviors, as we usually only run into >> limitations when things exceed the size of a SIMD register, but I am >> not sure if this is always true. >> >>> Now, I haven't though about 8-bit types. At least there aren't >>> hardware >>> supported payloads for those, right? >> >> We need to do UBO load and SSBO load/store with 8-bit types. Chema >> implemented that so maybe he can share more details about what is >> needed here and if there are any particular restrictions or not. > > I had a quick look and initially it seems to me that those payloads are quite > different from sampler/fb write. They are rather fixed to 32-bit slots and > instead of any padding one interleaves things. I'll try to study this a little > more...
In the case of 8-bit types, we use the byte_scattered read/write messages that use a full register for the return/payload registers. But as you comment for every value read/written this message we use the fist 8/16 bits of each Double Word of the register. So we have undefined values interleaved that we shouldn't care about. So we need to read/write with stride=4 for 8-bit and stride=2 for 16-bit types. That is different for Samplers and Render Target Write for 16-bits that have an specific flag for 16-bits. BTW, In order to improve the register allocation I have a series related to "intel/fs: Liveness range improvements with partial writes" that I have some pending feedback to address at https://patchwork.freedesktop.org/series/46476/ Those patches solve for in the same block operations at the register allocator how to deal with partial read/write. They care for cases of padding and interleaved values. That are problematic because of the shuffle/unshuffle needed for 8/16/64 load/store operations. Chema >> >>>> >>>> One thing that worries me a bit is that we are faking the real size >>>> of >>>> things and that can lead to parts of the compiler that rely on >>>> these >>>> helpers to provide the real size to do things we don't want. For >>>> example: are you limiting this to SIMD8 only? I ask because if we >>>> do >>>> this for SIMD16 16-bit dispatches then we'd computing a size of two >>>> registers for things that only really use one, and I don't think we >>>> want that since that would have actual implications like SIMD-width >>>> lowering kicking in for things it should not need to. There are a >>>> lot >>>> of uses of the component_size helper in other parts of the >>>> compiler, >>>> including the generator, where we want to know the real size of >>>> things >>>> because they use these sizes to check hardware restrictions so we >>>> want >>>> to know the real size without padding... >>>> >>>> Also, I presume that all this is orthogonal to actual register >>>> allocation and that you are still storing the Y and W components of >>>> SIMD8 16-bit vectors in the second halves of registers, right? >>>> >>>>> Now, I'm fine doing this case by case the way it is done here. >>>>> I'm >>>>> just >>>>> wondering if the split is needed, i.e., considering in some cases >>>>> 16- >>>>> bit SIMD8 >>>>> virtual registers as half written and in some cases just ignoring >>>>> the >>>>> fact. >>>>>>> Then the patch goes on and rewrites all uses of >>>>>>> is_partial_write() to use >>>>>>> one or the other version. Specifically, we use >>>>>>> is_partial_var_write() >>>>>>> in the following places: copy propagation, cmod propagation, >>>>>>> common >>>>>>> subexpression elimination, saturate propagation and sel >>>>>>> peephole. >>>>>>> >>>>>>> Notice that the semantics of is_partial_var_write() exactly >>>>>>> match >>>>>>> the >>>>>>> current implementation of is_partial_write() for anything >>>>>>> that is >>>>>>> 32-bit or larger, so no changes are expected for 32-bit >>>>>>> instructions. >>>>>>> >>>>>>> Tested against ~5000 tests involving 16-bit instructions in >>>>>>> CTS >>>>>>> produced >>>>>>> the following changes in instruction counts: >>>>>>> >>>>>>> Patched | Master | % | >>>>>>> ================================================ >>>>>>> SIMD8 | 621,900 | 706,721 | -12.00% | >>>>>>> ================================================ >>>>>>> SIMD16 | 93,252 | 93,252 | 0.00% | >>>>>>> ================================================ >>>>>>> >>>>>>> As expected, the change only affects SIMD8 dispatches. >>>>>> >>>>>> I like this. But I think I want to try and rebase my fp16 work >>>>>> on >>>>>> top to see >>>>>> if there are any differences in the final assembly between this >>>>>> and >>>>>> my >>>>>> "register padding" scheme. >>>>>> >>>>>>> --- >>>>>>> src/intel/compiler/brw_fs.cpp | 31 >>>>>>> +++++++++++++++---- >>>>>>> .../compiler/brw_fs_cmod_propagation.cpp | 20 ++++++--- >>>>>>> --- >>>>>>> .../compiler/brw_fs_copy_propagation.cpp | 8 ++--- >>>>>>> src/intel/compiler/brw_fs_cse.cpp | 3 +- >>>>>>> .../compiler/brw_fs_dead_code_eliminate.cpp | 2 +- >>>>>>> src/intel/compiler/brw_fs_live_variables.cpp | 2 +- >>>>>>> src/intel/compiler/brw_fs_reg_allocate.cpp | 2 +- >>>>>>> .../compiler/brw_fs_register_coalesce.cpp | 2 +- >>>>>>> .../compiler/brw_fs_saturate_propagation.cpp | 7 +++-- >>>>>>> src/intel/compiler/brw_fs_sel_peephole.cpp | 4 +-- >>>>>>> src/intel/compiler/brw_ir_fs.h | 3 +- >>>>>>> 11 files changed, 54 insertions(+), 30 deletions(-) >>>>>>> >>>>>>> diff --git a/src/intel/compiler/brw_fs.cpp >>>>>>> b/src/intel/compiler/brw_fs.cpp >>>>>>> index 1d5d1dd0d22..9ea67975e1e 100644 >>>>>>> --- a/src/intel/compiler/brw_fs.cpp >>>>>>> +++ b/src/intel/compiler/brw_fs.cpp >>>>>>> @@ -698,14 +698,33 @@ >>>>>>> fs_visitor::limit_dispatch_width(unsigned >>>>>>> n, const char *msg) >>>>>>> * it. >>>>>>> */ >>>>>>> bool >>>>>>> -fs_inst::is_partial_write() const >>>>>>> +fs_inst::is_partial_reg_write() const >>>>>>> { >>>>>>> return ((this->predicate && this->opcode != >>>>>>> BRW_OPCODE_SEL) >>>>>>>>> >>>>>>> >>>>>>> - (this->exec_size * type_sz(this->dst.type)) < 32 >>>>>>> || >>>>>>> !this->dst.is_contiguous() || >>>>>>> + (this->exec_size * type_sz(this->dst.type)) < >>>>>>> REG_SIZE || >>>>>>> this->dst.offset % REG_SIZE != 0); >>>>>>> } >>>>>>> >>>>>>> +/** >>>>>>> + * Returns true if the instruction has a flag that means it >>>>>>> won't >>>>>>> + * update an entire variable for the given dispatch width. >>>>>>> + * >>>>>>> + * This is only different from is_partial_reg_write() for >>>>>>> SIMD8 >>>>>>> + * dispatches of 16-bit (or smaller) instructions. >>>>>>> + */ >>>>>>> +bool >>>>>>> +fs_inst::is_partial_var_write(uint32_t dispatch_width) const >>>>>>> +{ >>>>>>> + const uint32_t type_size = type_sz(this->dst.type); >>>>>>> + uint32_t var_size = MIN2(REG_SIZE, dispatch_width * >>>>>>> type_size); >>>>>>> + >>>>>>> + return ((this->predicate && this->opcode != >>>>>>> BRW_OPCODE_SEL) >>>>>>>>> >>>>>>> >>>>>>> + !this->dst.is_contiguous() || >>>>>>> + (this->exec_size * type_sz(this->dst.type)) < >>>>>>> var_size || >>>>>>> + this->dst.offset % var_size != 0); >>>>>>> +} >>>>>>> + >>>>>>> unsigned >>>>>>> fs_inst::components_read(unsigned i) const >>>>>>> { >>>>>>> @@ -2847,7 +2866,7 @@ fs_visitor::opt_register_renaming() >>>>>>> if (depth == 0 && >>>>>>> inst->dst.file == VGRF && >>>>>>> alloc.sizes[inst->dst.nr] * REG_SIZE == inst- >>>>>>>> size_written && >>>>>>> >>>>>>> - !inst->is_partial_write()) { >>>>>>> + !inst->is_partial_reg_write()) { >>>>>>> if (remap[dst] == -1) { >>>>>>> remap[dst] = dst; >>>>>>> } else { >>>>>>> @@ -3050,7 +3069,7 @@ fs_visitor::compute_to_mrf() >>>>>>> next_ip++; >>>>>>> >>>>>>> if (inst->opcode != BRW_OPCODE_MOV || >>>>>>> - inst->is_partial_write() || >>>>>>> + inst->is_partial_reg_write() || >>>>>>> inst->dst.file != MRF || inst->src[0].file != VGRF || >>>>>>> inst->dst.type != inst->src[0].type || >>>>>>> inst->src[0].abs || inst->src[0].negate || >>>>>>> @@ -3083,7 +3102,7 @@ fs_visitor::compute_to_mrf() >>>>>>> * that writes that reg, but it would require >>>>>>> smarter >>>>>>> * tracking. >>>>>>> */ >>>>>>> - if (scan_inst->is_partial_write()) >>>>>>> + if (scan_inst->is_partial_reg_write()) >>>>>>> break; >>>>>>> >>>>>>> /* Handling things not fully contained in the >>>>>>> source >>>>>>> of the copy >>>>>>> @@ -3395,7 +3414,7 @@ >>>>>>> fs_visitor::remove_duplicate_mrf_writes() >>>>>>> if (inst->opcode == BRW_OPCODE_MOV && >>>>>>> inst->dst.file == MRF && >>>>>>> inst->src[0].file != ARF && >>>>>>> - !inst->is_partial_write()) { >>>>>>> + !inst->is_partial_reg_write()) { >>>>>>> last_mrf_move[inst->dst.nr] = inst; >>>>>>> } >>>>>>> } >>>>>>> diff --git a/src/intel/compiler/brw_fs_cmod_propagation.cpp >>>>>>> b/src/intel/compiler/brw_fs_cmod_propagation.cpp >>>>>>> index 5fb522f810f..7bb5c9afbc9 100644 >>>>>>> --- a/src/intel/compiler/brw_fs_cmod_propagation.cpp >>>>>>> +++ b/src/intel/compiler/brw_fs_cmod_propagation.cpp >>>>>>> @@ -50,13 +50,13 @@ >>>>>>> >>>>>>> static bool >>>>>>> cmod_propagate_cmp_to_add(const gen_device_info *devinfo, >>>>>>> bblock_t *block, >>>>>>> - fs_inst *inst) >>>>>>> + fs_inst *inst, unsigned >>>>>>> dispatch_width) >>>>>>> { >>>>>>> bool read_flag = false; >>>>>>> >>>>>>> foreach_inst_in_block_reverse_starting_from(fs_inst, >>>>>>> scan_inst, inst) { >>>>>>> if (scan_inst->opcode == BRW_OPCODE_ADD && >>>>>>> - !scan_inst->is_partial_write() && >>>>>>> + !scan_inst->is_partial_var_write(dispatch_width) >>>>>>> && >>>>>>> scan_inst->exec_size == inst->exec_size) { >>>>>>> bool negate; >>>>>>> >>>>>>> @@ -126,7 +126,7 @@ cmod_propagate_cmp_to_add(const >>>>>>> gen_device_info *devinfo, bblock_t *block, >>>>>>> */ >>>>>>> static bool >>>>>>> cmod_propagate_not(const gen_device_info *devinfo, bblock_t >>>>>>> *block, >>>>>>> - fs_inst *inst) >>>>>>> + fs_inst *inst, unsigned dispatch_width) >>>>>>> { >>>>>>> const enum brw_conditional_mod cond = >>>>>>> brw_negate_cmod(inst- >>>>>>>> conditional_mod); >>>>>>> >>>>>>> bool read_flag = false; >>>>>>> @@ -141,7 +141,7 @@ cmod_propagate_not(const gen_device_info >>>>>>> *devinfo, bblock_t *block, >>>>>>> scan_inst->opcode != BRW_OPCODE_AND) >>>>>>> break; >>>>>>> >>>>>>> - if (scan_inst->is_partial_write() || >>>>>>> + if (scan_inst->is_partial_var_write(dispatch_width) >>>>>>> || >>>>>>> scan_inst->dst.offset != inst->src[0].offset || >>>>>>> scan_inst->exec_size != inst->exec_size) >>>>>>> break; >>>>>>> @@ -166,7 +166,9 @@ cmod_propagate_not(const gen_device_info >>>>>>> *devinfo, bblock_t *block, >>>>>>> } >>>>>>> >>>>>>> static bool >>>>>>> -opt_cmod_propagation_local(const gen_device_info *devinfo, >>>>>>> bblock_t *block) >>>>>>> +opt_cmod_propagation_local(const gen_device_info *devinfo, >>>>>>> + bblock_t *block, >>>>>>> + unsigned dispatch_width) >>>>>>> { >>>>>>> bool progress = false; >>>>>>> int ip = block->end_ip + 1; >>>>>>> @@ -219,14 +221,14 @@ opt_cmod_propagation_local(const >>>>>>> gen_device_info *devinfo, bblock_t *block) >>>>>>> */ >>>>>>> if (inst->opcode == BRW_OPCODE_CMP && !inst- >>>>>>>> src[1].is_zero()) { >>>>>>> >>>>>>> if (brw_reg_type_is_floating_point(inst- >>>>>>>> src[0].type) >>>>>>> && >>>>>>> - cmod_propagate_cmp_to_add(devinfo, block, >>>>>>> inst)) >>>>>>> + cmod_propagate_cmp_to_add(devinfo, block, inst, >>>>>>> dispatch_width)) >>>>>>> progress = true; >>>>>>> >>>>>>> continue; >>>>>>> } >>>>>>> >>>>>>> if (inst->opcode == BRW_OPCODE_NOT) { >>>>>>> - progress = cmod_propagate_not(devinfo, block, inst) >>>>>>> || >>>>>>> progress; >>>>>>> + progress = cmod_propagate_not(devinfo, block, inst, >>>>>>> dispatch_width) || progress; >>>>>>> continue; >>>>>>> } >>>>>>> >>>>>>> @@ -234,7 +236,7 @@ opt_cmod_propagation_local(const >>>>>>> gen_device_info *devinfo, bblock_t *block) >>>>>>> foreach_inst_in_block_reverse_starting_from(fs_inst, >>>>>>> scan_inst, inst) { >>>>>>> if (regions_overlap(scan_inst->dst, scan_inst- >>>>>>>> size_written, >>>>>>> >>>>>>> inst->src[0], inst- >>>>>>>> size_read(0))) >>>>>>> { >>>>>>> - if (scan_inst->is_partial_write() || >>>>>>> + if (scan_inst- >>>>>>>> is_partial_var_write(dispatch_width) >>>>>>>>> >>>>>>> >>>>>>> scan_inst->dst.offset != inst->src[0].offset >>>>>>> || >>>>>>> scan_inst->exec_size != inst->exec_size) >>>>>>> break; >>>>>>> @@ -342,7 +344,7 @@ fs_visitor::opt_cmod_propagation() >>>>>>> bool progress = false; >>>>>>> >>>>>>> foreach_block_reverse(block, cfg) { >>>>>>> - progress = opt_cmod_propagation_local(devinfo, block) >>>>>>> || >>>>>>> progress; >>>>>>> + progress = opt_cmod_propagation_local(devinfo, block, >>>>>>> dispatch_width) || progress; >>>>>>> } >>>>>>> >>>>>>> if (progress) >>>>>>> diff --git a/src/intel/compiler/brw_fs_copy_propagation.cpp >>>>>>> b/src/intel/compiler/brw_fs_copy_propagation.cpp >>>>>>> index c01d4ec4a4f..0f0284115fb 100644 >>>>>>> --- a/src/intel/compiler/brw_fs_copy_propagation.cpp >>>>>>> +++ b/src/intel/compiler/brw_fs_copy_propagation.cpp >>>>>>> @@ -505,7 +505,7 @@ fs_visitor::try_copy_propagate(fs_inst >>>>>>> *inst, >>>>>>> int arg, acp_entry *entry) >>>>>>> /* Compute the first component of the copy that the >>>>>>> instruction is >>>>>>> * reading, and the base byte offset within that >>>>>>> component. >>>>>>> */ >>>>>>> - assert(entry->dst.offset % REG_SIZE == 0 && entry- >>>>>>>> dst.stride >>>>>>> == 1); >>>>>>> + assert(entry->dst.stride == 1); >>>>>>> const unsigned component = rel_offset / type_sz(entry- >>>>>>>> dst.type); >>>>>>> >>>>>>> const unsigned suboffset = rel_offset % type_sz(entry- >>>>>>>> dst.type); >>>>>>> >>>>>>> >>>>>>> @@ -783,7 +783,7 @@ >>>>>>> fs_visitor::try_constant_propagate(fs_inst >>>>>>> *inst, acp_entry *entry) >>>>>>> } >>>>>>> >>>>>>> static bool >>>>>>> -can_propagate_from(fs_inst *inst) >>>>>>> +can_propagate_from(fs_inst *inst, unsigned dispatch_width) >>>>>>> { >>>>>>> return (inst->opcode == BRW_OPCODE_MOV && >>>>>>> inst->dst.file == VGRF && >>>>>>> @@ -794,7 +794,7 @@ can_propagate_from(fs_inst *inst) >>>>>>> inst->src[0].file == UNIFORM || >>>>>>> inst->src[0].file == IMM) && >>>>>>> inst->src[0].type == inst->dst.type && >>>>>>> - !inst->is_partial_write()); >>>>>>> + !inst->is_partial_var_write(dispatch_width)); >>>>>>> } >>>>>>> >>>>>>> /* Walks a basic block and does copy propagation on it using >>>>>>> the >>>>>>> acp >>>>>>> @@ -846,7 +846,7 @@ >>>>>>> fs_visitor::opt_copy_propagation_local(void >>>>>>> *copy_prop_ctx, bblock_t *block, >>>>>>> /* If this instruction's source could potentially be >>>>>>> folded into the >>>>>>> * operand of another instruction, add it to the ACP. >>>>>>> */ >>>>>>> - if (can_propagate_from(inst)) { >>>>>>> + if (can_propagate_from(inst, dispatch_width)) { >>>>>>> acp_entry *entry = ralloc(copy_prop_ctx, >>>>>>> acp_entry); >>>>>>> entry->dst = inst->dst; >>>>>>> entry->src = inst->src[0]; >>>>>>> diff --git a/src/intel/compiler/brw_fs_cse.cpp >>>>>>> b/src/intel/compiler/brw_fs_cse.cpp >>>>>>> index 6859733d58c..56813df2d2a 100644 >>>>>>> --- a/src/intel/compiler/brw_fs_cse.cpp >>>>>>> +++ b/src/intel/compiler/brw_fs_cse.cpp >>>>>>> @@ -247,7 +247,8 @@ fs_visitor::opt_cse_local(bblock_t >>>>>>> *block) >>>>>>> int ip = block->start_ip; >>>>>>> foreach_inst_in_block(fs_inst, inst, block) { >>>>>>> /* Skip some cases. */ >>>>>>> - if (is_expression(this, inst) && !inst- >>>>>>>> is_partial_write() >>>>>>> && >>>>>>> + if (is_expression(this, inst) && >>>>>>> + !inst->is_partial_var_write(dispatch_width) && >>>>>>> ((inst->dst.file != ARF && inst->dst.file != >>>>>>> FIXED_GRF) || >>>>>>> inst->dst.is_null())) >>>>>>> { >>>>>>> diff --git >>>>>>> a/src/intel/compiler/brw_fs_dead_code_eliminate.cpp >>>>>>> b/src/intel/compiler/brw_fs_dead_code_eliminate.cpp >>>>>>> index eeb71dd2b92..f24fd0643b8 100644 >>>>>>> --- a/src/intel/compiler/brw_fs_dead_code_eliminate.cpp >>>>>>> +++ b/src/intel/compiler/brw_fs_dead_code_eliminate.cpp >>>>>>> @@ -110,7 +110,7 @@ fs_visitor::dead_code_eliminate() >>>>>>> } >>>>>>> >>>>>>> if (inst->dst.file == VGRF) { >>>>>>> - if (!inst->is_partial_write()) { >>>>>>> + if (!inst->is_partial_reg_write()) { >>>>>>> int var = live_intervals->var_from_reg(inst- >>>>>>>> dst); >>>>>>> >>>>>>> for (unsigned i = 0; i < regs_written(inst); >>>>>>> i++) >>>>>>> { >>>>>>> BITSET_CLEAR(live, var + i); >>>>>>> diff --git a/src/intel/compiler/brw_fs_live_variables.cpp >>>>>>> b/src/intel/compiler/brw_fs_live_variables.cpp >>>>>>> index 059f076fa51..30625aa586a 100644 >>>>>>> --- a/src/intel/compiler/brw_fs_live_variables.cpp >>>>>>> +++ b/src/intel/compiler/brw_fs_live_variables.cpp >>>>>>> @@ -84,7 +84,7 @@ fs_live_variables::setup_one_write(struct >>>>>>> block_data *bd, fs_inst *inst, >>>>>>> * screens off previous updates of that variable (VGRF >>>>>>> channel). >>>>>>> */ >>>>>>> if (inst->dst.file == VGRF) { >>>>>>> - if (!inst->is_partial_write() && !BITSET_TEST(bd->use, >>>>>>> var)) >>>>>>> + if (!inst->is_partial_reg_write() && !BITSET_TEST(bd- >>>>>>>> use, >>>>>>> var)) >>>>>>> BITSET_SET(bd->def, var); >>>>>>> >>>>>>> BITSET_SET(bd->defout, var); >>>>>>> diff --git a/src/intel/compiler/brw_fs_reg_allocate.cpp >>>>>>> b/src/intel/compiler/brw_fs_reg_allocate.cpp >>>>>>> index 73b8b7841f5..5147e8b4426 100644 >>>>>>> --- a/src/intel/compiler/brw_fs_reg_allocate.cpp >>>>>>> +++ b/src/intel/compiler/brw_fs_reg_allocate.cpp >>>>>>> @@ -1026,7 +1026,7 @@ fs_visitor::spill_reg(int spill_reg) >>>>>>> * write, there should be no need for the unspill >>>>>>> since >>>>>>> the >>>>>>> * instruction will be overwriting the whole >>>>>>> destination in any case. >>>>>>> */ >>>>>>> - if (inst->is_partial_write() || >>>>>>> + if (inst->is_partial_reg_write() || >>>>>>> (!inst->force_writemask_all && !per_channel)) >>>>>>> emit_unspill(ubld, spill_src, >>>>>>> subset_spill_offset, >>>>>>> regs_written(inst)); >>>>>>> diff --git a/src/intel/compiler/brw_fs_register_coalesce.cpp >>>>>>> b/src/intel/compiler/brw_fs_register_coalesce.cpp >>>>>>> index 952276faed8..b27956023c6 100644 >>>>>>> --- a/src/intel/compiler/brw_fs_register_coalesce.cpp >>>>>>> +++ b/src/intel/compiler/brw_fs_register_coalesce.cpp >>>>>>> @@ -70,7 +70,7 @@ is_coalesce_candidate(const fs_visitor *v, >>>>>>> const fs_inst *inst) >>>>>>> { >>>>>>> if ((inst->opcode != BRW_OPCODE_MOV && >>>>>>> inst->opcode != SHADER_OPCODE_LOAD_PAYLOAD) || >>>>>>> - inst->is_partial_write() || >>>>>>> + inst->is_partial_reg_write() || >>>>>>> inst->saturate || >>>>>>> inst->src[0].file != VGRF || >>>>>>> inst->src[0].negate || >>>>>>> diff --git >>>>>>> a/src/intel/compiler/brw_fs_saturate_propagation.cpp >>>>>>> b/src/intel/compiler/brw_fs_saturate_propagation.cpp >>>>>>> index d6cfa79a618..1e1461063ae 100644 >>>>>>> --- a/src/intel/compiler/brw_fs_saturate_propagation.cpp >>>>>>> +++ b/src/intel/compiler/brw_fs_saturate_propagation.cpp >>>>>>> @@ -43,7 +43,8 @@ >>>>>>> */ >>>>>>> >>>>>>> static bool >>>>>>> -opt_saturate_propagation_local(fs_visitor *v, bblock_t >>>>>>> *block) >>>>>>> +opt_saturate_propagation_local(fs_visitor *v, bblock_t >>>>>>> *block, >>>>>>> + unsigned dispatch_width) >>>>>>> { >>>>>>> bool progress = false; >>>>>>> int ip = block->end_ip + 1; >>>>>>> @@ -66,7 +67,7 @@ opt_saturate_propagation_local(fs_visitor >>>>>>> *v, >>>>>>> bblock_t *block) >>>>>>> foreach_inst_in_block_reverse_starting_from(fs_inst, >>>>>>> scan_inst, inst) { >>>>>>> if (regions_overlap(scan_inst->dst, scan_inst- >>>>>>>> size_written, >>>>>>> >>>>>>> inst->src[0], inst- >>>>>>>> size_read(0))) >>>>>>> { >>>>>>> - if (scan_inst->is_partial_write() || >>>>>>> + if (scan_inst- >>>>>>>> is_partial_var_write(dispatch_width) >>>>>>>>> >>>>>>> >>>>>>> (scan_inst->dst.type != inst->dst.type && >>>>>>> !scan_inst->can_change_types())) >>>>>>> break; >>>>>>> @@ -153,7 +154,7 @@ fs_visitor::opt_saturate_propagation() >>>>>>> calculate_live_intervals(); >>>>>>> >>>>>>> foreach_block (block, cfg) { >>>>>>> - progress = opt_saturate_propagation_local(this, block) >>>>>>> || >>>>>>> progress; >>>>>>> + progress = opt_saturate_propagation_local(this, block, >>>>>>> dispatch_width) || progress; >>>>>>> } >>>>>>> >>>>>>> /* Live intervals are still valid. */ >>>>>>> diff --git a/src/intel/compiler/brw_fs_sel_peephole.cpp >>>>>>> b/src/intel/compiler/brw_fs_sel_peephole.cpp >>>>>>> index 6395b409b7c..98d640a3bfe 100644 >>>>>>> --- a/src/intel/compiler/brw_fs_sel_peephole.cpp >>>>>>> +++ b/src/intel/compiler/brw_fs_sel_peephole.cpp >>>>>>> @@ -167,8 +167,8 @@ fs_visitor::opt_peephole_sel() >>>>>>> then_mov[i]->exec_size != else_mov[i]- >>>>>>>> exec_size || >>>>>>> then_mov[i]->group != else_mov[i]->group || >>>>>>> then_mov[i]->force_writemask_all != >>>>>>> else_mov[i]- >>>>>>>> force_writemask_all || >>>>>>> >>>>>>> - then_mov[i]->is_partial_write() || >>>>>>> - else_mov[i]->is_partial_write() || >>>>>>> + then_mov[i]- >>>>>>>> is_partial_var_write(dispatch_width) >>>>>>>>> >>>>>>> >>>>>>> + else_mov[i]- >>>>>>>> is_partial_var_write(dispatch_width) >>>>>>>>> >>>>>>> >>>>>>> then_mov[i]->conditional_mod != >>>>>>> BRW_CONDITIONAL_NONE || >>>>>>> else_mov[i]->conditional_mod != >>>>>>> BRW_CONDITIONAL_NONE) { >>>>>>> movs = i; >>>>>>> diff --git a/src/intel/compiler/brw_ir_fs.h >>>>>>> b/src/intel/compiler/brw_ir_fs.h >>>>>>> index 07e7224e0f8..02c91b246a3 100644 >>>>>>> --- a/src/intel/compiler/brw_ir_fs.h >>>>>>> +++ b/src/intel/compiler/brw_ir_fs.h >>>>>>> @@ -349,7 +349,8 @@ public: >>>>>>> >>>>>>> bool equals(fs_inst *inst) const; >>>>>>> bool is_send_from_grf() const; >>>>>>> - bool is_partial_write() const; >>>>>>> + bool is_partial_reg_write() const; >>>>>>> + bool is_partial_var_write(unsigned dispatch_width) const; >>>>>>> bool is_copy_payload(const brw::simple_allocator >>>>>>> &grf_alloc) >>>>>>> const; >>>>>>> unsigned components_read(unsigned i) const; >>>>>>> unsigned size_read(int arg) const; >>>>>>> -- >>>>>>> 2.17.1 >>>>>>> >>>>>>> _______________________________________________ >>>>>>> mesa-dev mailing list >>>>>>> mesa-dev@lists.freedesktop.org >>>>>>> https://lists.freedesktop.org/mailman/listinfo/mesa-dev >>>>> >>>>> >>> >>> >> > _______________________________________________ mesa-dev mailing list mesa-dev@lists.freedesktop.org https://lists.freedesktop.org/mailman/listinfo/mesa-dev