> -----Original Message----- > From: Beignet [mailto:beignet-boun...@lists.freedesktop.org] On Behalf Of > Guo, Yejun > Sent: Monday, September 7, 2015 8:27 PM > To: Zhigang Gong; beignet@lists.freedesktop.org > Subject: Re: [Beignet] [PATCH 3/3] add optimization for local copy propagation > > Yes, there will be penalty for the case in your example. I read several > documents for local copy propagation, and none mentioned this case. :( > > For the method to iterate new instructions/registers, it requires to add the > 'new' flag during GenIR to SelectionIR period, since the current > implementation > is large, I think it might not be so good, there is high possibility to miss > something. > > I prefer for the liveness method, it provides much information which could be > possibly used in other later optimizations. I'll check if ir::liveness could > be > reused or need to design a new liveness for selection ir. We already have the liveness information in gen backend stage, please check The following code in backend/context.cpp Context::Context(const ir::Unit &unit, const std::string &name) : unit(unit), fn(*unit.getFunction(name)), name(name), liveness(NULL), dag(NULL), useDWLabel(false) { GBE_ASSERT(unit.getPointerSize() == ir::POINTER_32_BITS); this->liveness = GBE_NEW(ir::Liveness, const_cast<ir::Function&>(fn), true); this->dag = GBE_NEW(ir::FunctionDAG, *this->liveness); // r0 (GEN_REG_SIZE) is always set by the HW and used at the end by EOT this->registerAllocator = NULL; //GBE_NEW(RegisterAllocator, GEN_REG_SIZE, 4*KB - GEN_REG_SIZE); this->scratchAllocator = NULL; //GBE_NEW(ScratchAllocator, 12*KB); }
And use ctx.getLiveIn(bb) and ctx.getLiveOut(bb), you can easily get livein set or liveout set for a specified basic block. That should be good enough to implement this optimization. Thanks, Zhigang Gong. > > -----Original Message----- > From: Zhigang Gong [mailto:zhigang.g...@linux.intel.com] > Sent: Monday, September 07, 2015 2:47 PM > To: Guo, Yejun; beignet@lists.freedesktop.org > Subject: RE: [Beignet] [PATCH 3/3] add optimization for local copy propagation > > Right, only the instructions created in instruction selection stage could be > optimized here. > No need to iterate all the instructions. And no need to do multiple round > check. > Just one round check, and only need to check those temporary registers(only > live in this BB). If any register is in live out set, then we do not need to > touch it. > > Please take a look at the following example > (%r1 is in live in set, and %r0 is in liveout set, and the MOV %r0, %r1 is > the last > use of the %r1): > > MOV %r0, %r1 > ... > ADD %r5, %r0, %r2 > > If use this optimization, it will become: > > MOV %r0, %r1 > ... > ADD %r5, %r1, %r2 > > Because %r0 is in liveout set, the MOV instruction is not dead instruction and > will not be eliminated. > And the worse situation is the %r1's liveness interval incorrectly extent to > the > ADD instruction. > > In the original code, %r1 is not alive at the ADD instruction, now both %r1 > and %r0 are alive after the MOV instruction. Considering a worst case: > If there are many such type of MOV, then this optimization will not bring any > optimization, but will increase register pressure dramatically. > > So my two major suggestions are as below: > > 1. Try to iterate newly created instructions/registers only, this could reduce > most of the overhead. > 2. Use liveness information, don't touch any registers in the liveout set, > actually, > if you only iterate those newly created instructions, you may avoid any > liveout > registers naturally. > > > -----Original Message----- > > From: Beignet [mailto:beignet-boun...@lists.freedesktop.org] On Behalf > > Of Guo, Yejun > > Sent: Monday, September 7, 2015 2:11 PM > > To: Zhigang Gong; beignet@lists.freedesktop.org > > Subject: Re: [Beignet] [PATCH 3/3] add optimization for local copy > > propagation > > > > It is expected that there will be improvement with the optimization > > since some instructions are removed. > > > > As mentioned in the commit log, this patch itself does not remove any > > instruction, it modifies some instruction to make the removal possible. > > > > GenWriter::removeMOVs() did the work inside each basic block at Gen IR > > level, the idea can be introduced for Selection IR level, and > > removeMovs globally might also needed for Selection IR. > > > > Even GenWriter::removeMOVs() has done the optimization at Gen IR > > level, it is also necessary to do the same idea again at Selection IR > > level, since we introduced some extra instructions during GenIR to > > SelectionIR and/or other functions. > > > > Take the selection IR of utest compiler_saturate_sub_uint8_t as an > > example (see the instruction fragment in commit log), we can see such > > optimization at selection IR level is also necessary. > > > > -----Original Message----- > > From: Zhigang Gong [mailto:zhigang.g...@linux.intel.com] > > Sent: Monday, September 07, 2015 1:43 PM > > To: Guo, Yejun; beignet@lists.freedesktop.org > > Subject: RE: [Beignet] [PATCH 3/3] add optimization for local copy > > propagation > > > > Is there any evidence that this optimization could bring actual improvement? > > I doubt it because it doesn't reduce any instruction. > > > > Actually, if the %42 is not in the liveout set of current BB, then the > > MOV could be removed, the exactly same optimization logic has been > > implemented in the GEN IR stage, the function is GenWriter::removeMOVs(), > you can check it out. > > > > > -----Original Message----- > > > From: Beignet [mailto:beignet-boun...@lists.freedesktop.org] On > > > Behalf Of Guo Yejun > > > Sent: Monday, September 7, 2015 5:28 AM > > > To: beignet@lists.freedesktop.org > > > Cc: Guo Yejun > > > Subject: [Beignet] [PATCH 3/3] add optimization for local copy > > > propagation > > > > > > it is done at selection ir level, for instructions like: > > > MOV(8) %42<2>:UB : %53<32,8,4>:UB > > > ADD(8) %43<2>:B : %40<16,8,2>:B > > > -%42<16,8,2>:B > > > can be optimized as: > > > MOV(8) %42<2>:UB : %53<32,8,4>:UB > > > ADD(8) %43<2>:UB : %56<32,8,4>:UB > > > -%53<32,8,4>:UB > > > > > > the optimization is done for each basic block, we here can not > > > remove instruction "MOV %42 ..." since it is possible that %42 could > > > be used at other place in the same block or even in other blocks. We > > > need to add another path such as dead code elimination at global level to > remove it. > > > > > > Signed-off-by: Guo Yejun <yejun....@intel.com> > > > --- > > > .../src/backend/gen_insn_selection_optimize.cpp | 145 > > > +++++++++++++++++++++ > > > 1 file changed, 145 insertions(+) > > > > > > diff --git a/backend/src/backend/gen_insn_selection_optimize.cpp > > > b/backend/src/backend/gen_insn_selection_optimize.cpp > > > index c82fbe5..b196a4d 100644 > > > --- a/backend/src/backend/gen_insn_selection_optimize.cpp > > > +++ b/backend/src/backend/gen_insn_selection_optimize.cpp > > > @@ -20,6 +20,40 @@ namespace gbe > > > virtual void run() = 0; > > > virtual ~SelOptimizer() {} > > > protected: > > > + //we need info derived from execWdith, but it is not contained > > > + inside > > > GenRegister > > > + class ExpandedRegister > > > + { > > > + public: > > > + ExpandedRegister(const GenRegister& reg, uint32_t execWidth) : > > > genreg(reg) > > > + { > > > + elements = CalculateElements(reg, execWidth); > > > + } > > > + ~ExpandedRegister() {} > > > + static uint32_t CalculateElements(const GenRegister& reg, > > > + uint32_t > > > execWidth) > > > + { > > > + uint32_t elements = 0; > > > + uint32_t elementSize = typeSize(reg.type); > > > + uint32_t width = GenRegister::width_size(reg); > > > + assert(execWidth >= width); > > > + uint32_t height = execWidth / width; > > > + uint32_t vstride = GenRegister::vstride_size(reg); > > > + uint32_t hstride = GenRegister::hstride_size(reg); > > > + uint32_t base = reg.subnr; > > > + for (uint32_t i = 0; i < height; ++i) { > > > + uint32_t offsetInByte = base; > > > + for (uint32_t j = 0; j < width; ++j) { > > > + uint32_t offsetInType = offsetInByte / elementSize; > > > + elements |= (1 << offsetInType); //right for > > dest > > > register? > > > + offsetInByte += hstride * elementSize; > > > + } > > > + offsetInByte += vstride * elementSize; > > > + } > > > + return elements; > > > + } > > > + const GenRegister& genreg; > > > + uint32_t elements; > > > + }; > > > + > > > uint32_t features; > > > }; > > > > > > @@ -31,13 +65,124 @@ namespace gbe > > > virtual void run(); > > > > > > private: > > > + // local copy propagation > > > + typedef std::map<const ExpandedRegister*, const GenRegister*> > > > RegisterMap; > > > + static bool replaceWithCopytable(const RegisterMap& copytable, > > > uint32_t execWidth, GenRegister& var); > > > + static void addToCopytable(RegisterMap& copytable, uint32_t > > > execWidth, const GenRegister& src, const GenRegister& dst); > > > + static void removeFromCopytable(RegisterMap& copytable, const > > > GenRegister& var); > > > + static void cleanCopytable(RegisterMap& copytable); > > > + static void propagateRegister(GenRegister& dst, const > > > + GenRegister& > > > src); > > > + bool doLocalCopyPropagation(); > > > + > > > SelectionBlock &bb; > > > static const size_t MaxTries = 1; //the times for optimization > > > }; > > > > > > + void SelBasicBlockOptimizer::propagateRegister(GenRegister& dst, > > > + const GenRegister& src) { > > > + dst.type = src.type; > > > + dst.file = src.file; > > > + dst.physical = src.physical; > > > + dst.subphysical = src.subphysical; > > > + dst.value.reg = src.value.reg; > > > + dst.vstride = src.vstride; > > > + dst.width = src.width; > > > + dst.hstride = src.hstride; > > > + dst.quarter = src.quarter; > > > + dst.nr = src.nr; > > > + dst.subnr = src.subnr; > > > + dst.address_mode = src.address_mode; > > > + dst.a0_subnr = src.a0_subnr; > > > + dst.addr_imm = src.addr_imm; > > > + > > > + dst.negation = dst.negation ^ src.negation; > > > + dst.absolute = dst.absolute | src.absolute; } > > > + > > > + void SelBasicBlockOptimizer::cleanCopytable(RegisterMap& > > > + copytable) { > > > + for (RegisterMap::const_iterator pos = copytable.begin(); pos > > > + != > > > copytable.end(); ++pos) { > > > + const ExpandedRegister* key = pos->first; > > > + delete key; > > > + } > > > + copytable.clear(); > > > + } > > > + > > > + void SelBasicBlockOptimizer::removeFromCopytable(RegisterMap& > > > +copytable, const GenRegister& var) > > > + { > > > + for (RegisterMap::const_iterator pos = copytable.begin(); pos > > > +!= > > > copytable.end(); ) { > > > + if (pos->first->genreg.reg() == var.reg() || > > > + pos->second->reg() == > > > var.reg()) { > > > + const ExpandedRegister* key = pos->first; > > > + copytable.erase(pos++); > > > + delete key; > > > + } > > > + else > > > + ++pos; > > > + } > > > + } > > > + > > > + void SelBasicBlockOptimizer::addToCopytable(RegisterMap& > > > + copytable, uint32_t execWidth, const GenRegister& src, const > GenRegister& dst) { > > > + if (src.type != dst.type || src.file != dst.file) > > > + return; > > > + > > > + ExpandedRegister* expanded = new ExpandedRegister(dst, > > execWidth); > > > + copytable[expanded] = &src; > > > + } > > > + > > > + bool SelBasicBlockOptimizer::replaceWithCopytable(const > > > + RegisterMap& copytable, uint32_t execWidth, GenRegister& var) { > > > + bool replaced = false; > > > + for (RegisterMap::const_iterator pos = copytable.begin(); pos > > > + != > > > copytable.end(); ++pos) { > > > + const ExpandedRegister* key = pos->first; > > > + if (key->genreg.reg() == var.reg()) { > > > + if (key->genreg.type == var.type) { > > > + uint32_t elements = > > > + ExpandedRegister::CalculateElements(var, > > > execWidth); > > > + if (key->elements == elements) { > > > + propagateRegister(var, *(pos->second)); > > > + replaced = true; > > > + } > > > + } > > > + break; > > > + } > > > + } > > > + return replaced; > > > + } > > > + > > > + bool SelBasicBlockOptimizer::doLocalCopyPropagation() > > > + { > > > + bool optimized = false; > > > + RegisterMap copytable; > > > + for (SelectionInstruction &insn : bb.insnList) { > > > + uint32_t execWidth = insn.state.execWidth; > > > + if (insn.opcode != SEL_OP_BSWAP && !insn.isWrite()) { > > > + for (uint8_t i = 0; i < insn.srcNum; ++i) { > > > + bool replaced = replaceWithCopytable(copytable, > > > + execWidth, > > > insn.src(i)); > > > + optimized = optimized || replaced; > > > + } > > > + } > > > + > > > + for (uint8_t i = 0; i < insn.dstNum; ++i) > > > + removeFromCopytable(copytable, insn.dst(i)); > > > + > > > + if (insn.opcode == SEL_OP_MOV) > > > + addToCopytable(copytable, execWidth, insn.src(0), insn.dst(0)); > > > + } > > > + cleanCopytable(copytable); > > > + return optimized; > > > + } > > > + > > > void SelBasicBlockOptimizer::run() > > > { > > > + for (size_t i = 0; i < MaxTries; ++i) { > > > + bool again = false; > > > + again = again || doLocalCopyPropagation(); > > > + > > > + //again = again || doOtherLocalOptimization(); > > > > > > + if (!again) > > > + break; //break since no optimization found at this round > > > + } > > > } > > > > > > class SelGlobalOptimizer : public SelOptimizer > > > -- > > > 1.9.1 > > > > > > _______________________________________________ > > > Beignet mailing list > > > Beignet@lists.freedesktop.org > > > http://lists.freedesktop.org/mailman/listinfo/beignet > > > > _______________________________________________ > > Beignet mailing list > > Beignet@lists.freedesktop.org > > http://lists.freedesktop.org/mailman/listinfo/beignet > > _______________________________________________ > Beignet mailing list > Beignet@lists.freedesktop.org > http://lists.freedesktop.org/mailman/listinfo/beignet _______________________________________________ Beignet mailing list Beignet@lists.freedesktop.org http://lists.freedesktop.org/mailman/listinfo/beignet