Re: [Beignet] [PATCH] GBE: implement pre-register-allocation instruction scheduling.
I carefully read the implementation, it looks good. But according to Zhigang's experiment, some case would suffer performance downgrade. And he thought the main problem still lies in the non-accurate latency model. And the pre-schedule may make the post-schedule cannot re-schedule again because of physical register dependency. We need to tune the instruction schedule further to achieve good balance of ILP and register pressure. I think we can merge the patch, but still keep pre-schedule disabled. And tune instruction schedule later. Thanks! Ruiling > -Original Message- > From: Beignet [mailto:beignet-boun...@lists.freedesktop.org] On Behalf Of > Zhigang Gong > Sent: Wednesday, September 16, 2015 10:46 AM > To: beignet@lists.freedesktop.org > Cc: Gong, Zhigang > Subject: [Beignet] [PATCH] GBE: implement pre-register-allocation instruction > scheduling. > > To find out an instruction scheduling policy to achieve the theoretical > minimum > registers required in a basic block is a NP problem. We have to use some > heuristic > factor to simplify the algorithm. There are many researchs which indicate a > bottom-up list scheduling is much better than the top-down method in turns of > register pressure. I choose one of such research paper as our target. The > paper > is as below: > > "Register-Sensitive Selection, Duplication, and Sequencing of Instructions" > It use the bottom-up list scheduling with a Sethi-Ullman label as an > heuristic number. As we will do cycle awareness scheduling after the register > allocation, we don't need to bother with cycle related heuristic number here. > I just skipped the EST computing and usage part in the algorithm. > > It turns out this algorithm works well. It could reduce the register spilling > in clBlas's sgemmBlock kernel from 83+ to only 20. > > Although this scheduling method seems to be lowering the ILP(instruction level > parallism). > It's not a big issue, because we will allocate as much as possible different > registers > in the following register allocation stage, and we will do a after allocation > instruction scheduling which will try to get as much ILP as possible. > > Signed-off-by: Zhigang Gong > --- > backend/src/backend/gen_insn_scheduling.cpp | 137 > +++- > 1 file changed, 116 insertions(+), 21 deletions(-) > > diff --git a/backend/src/backend/gen_insn_scheduling.cpp > b/backend/src/backend/gen_insn_scheduling.cpp > index 358a2ce..f4f1e70 100644 > --- a/backend/src/backend/gen_insn_scheduling.cpp > +++ b/backend/src/backend/gen_insn_scheduling.cpp > @@ -41,26 +41,29 @@ > * == > * > * We try to limit the register pressure. > - * Well, this is a hard problem and we have a decent strategy now that we > called > - * "zero cycled LIFO scheduling". > - * We use a local forward list scheduling and we schedule the instructions > in a > - * LIFO order i.e. as a stack. Basically, we take the most recent instruction > - * and schedule it right away. Obviously we ignore completely the real > latencies > - * and throuputs and just simulate instructions that are issued and > completed in > - * zero cycle. For the complex kernels we already have (like menger sponge), > - * this provides a pretty good strategy enabling SIMD16 code generation where > - * when scheduling is deactivated, even SIMD8 fails > * > - * One may argue that this strategy is bad, latency wise. This is not true > since > - * the register allocator will anyway try to burn as many registers as > possible. > - * So, there is still opportunities to schedule after register allocation. > + * To find out an instruction scheduling policy to achieve the theoretical > minimum > + * registers required in a basic block is a NP problem. We have to use some > heuristic > + * factor to simplify the algorithm. There are many researchs which indicate > a > + * bottom-up list scheduling is much better than the top-down method in turns > of > + * register pressure. I choose one of such research paper as our target. The > paper > + * is as below: > * > - * Our idea seems to work decently. There is however a strong research > article > - * that is able to near-optimally reschudle the instructions to minimize > - * register use. This is: > + * "Register-Sensitive Selection, Duplication, and Sequencing of > Instructions" > + * It use the bottom-up list scheduling with a Sethi-Ullman label as an > + * heuristic number. As we will do cycle awareness scheduling after the > register > + * allocation, we don't need to bother with cycle related heuristic number > here. > + * I just skipped the EST c
Re: [Beignet] [PATCH] GBE: implement pre-register-allocation instruction scheduling.
Ping for review. On Wed, Sep 16, 2015 at 10:46:12AM +0800, Zhigang Gong wrote: > To find out an instruction scheduling policy to achieve the theoretical > minimum > registers required in a basic block is a NP problem. We have to use some > heuristic > factor to simplify the algorithm. There are many researchs which indicate a > bottom-up list scheduling is much better than the top-down method in turns of > register pressure. I choose one of such research paper as our target. The > paper > is as below: > > "Register-Sensitive Selection, Duplication, and Sequencing of Instructions" > It use the bottom-up list scheduling with a Sethi-Ullman label as an > heuristic number. As we will do cycle awareness scheduling after the register > allocation, we don't need to bother with cycle related heuristic number here. > I just skipped the EST computing and usage part in the algorithm. > > It turns out this algorithm works well. It could reduce the register spilling > in clBlas's sgemmBlock kernel from 83+ to only 20. > > Although this scheduling method seems to be lowering the ILP(instruction > level parallism). > It's not a big issue, because we will allocate as much as possible different > registers > in the following register allocation stage, and we will do a after allocation > instruction scheduling which will try to get as much ILP as possible. > > Signed-off-by: Zhigang Gong > --- > backend/src/backend/gen_insn_scheduling.cpp | 137 > +++- > 1 file changed, 116 insertions(+), 21 deletions(-) > > diff --git a/backend/src/backend/gen_insn_scheduling.cpp > b/backend/src/backend/gen_insn_scheduling.cpp > index 358a2ce..f4f1e70 100644 > --- a/backend/src/backend/gen_insn_scheduling.cpp > +++ b/backend/src/backend/gen_insn_scheduling.cpp > @@ -41,26 +41,29 @@ > * == > * > * We try to limit the register pressure. > - * Well, this is a hard problem and we have a decent strategy now that we > called > - * "zero cycled LIFO scheduling". > - * We use a local forward list scheduling and we schedule the instructions > in a > - * LIFO order i.e. as a stack. Basically, we take the most recent instruction > - * and schedule it right away. Obviously we ignore completely the real > latencies > - * and throuputs and just simulate instructions that are issued and > completed in > - * zero cycle. For the complex kernels we already have (like menger sponge), > - * this provides a pretty good strategy enabling SIMD16 code generation where > - * when scheduling is deactivated, even SIMD8 fails > * > - * One may argue that this strategy is bad, latency wise. This is not true > since > - * the register allocator will anyway try to burn as many registers as > possible. > - * So, there is still opportunities to schedule after register allocation. > + * To find out an instruction scheduling policy to achieve the theoretical > minimum > + * registers required in a basic block is a NP problem. We have to use some > heuristic > + * factor to simplify the algorithm. There are many researchs which indicate > a > + * bottom-up list scheduling is much better than the top-down method in > turns of > + * register pressure. I choose one of such research paper as our target. > The paper > + * is as below: > * > - * Our idea seems to work decently. There is however a strong research > article > - * that is able to near-optimally reschudle the instructions to minimize > - * register use. This is: > + * "Register-Sensitive Selection, Duplication, and Sequencing of > Instructions" > + * It use the bottom-up list scheduling with a Sethi-Ullman label as an > + * heuristic number. As we will do cycle awareness scheduling after the > register > + * allocation, we don't need to bother with cycle related heuristic number > here. > + * I just skipped the EST computing and usage part in the algorithm. > * > - * "Minimum Register Instruction Sequence Problem: Revisiting Optimal Code > - * Generation for DAGs" > + * It turns out this algorithm works well. It could reduce the register > spilling > + * in clBlas's sgemmBlock kernel from 83+ to only 20. > + * > + * Although this scheduling method seems to be lowering the ILP(instruction > level parallism). > + * It's not a big issue, because we will allocate as much as possible > different registers > + * in the following register allocation stage, and we will do a after > allocation > + * instruction scheduling which will try to get as much ILP as possible. > + * > + * FIXME: we only need to do this scheduling when a BB is really under high > register pressure. > * > * After the register allocation > * == > @@ -114,7 +117,7 @@ namespace gbe >struct ScheduleDAGNode >{ > INLINE ScheduleDAGNode(SelectionInstruction &insn) : > - insn(insn), refNum(0), retiredCycle(0), preRetired(false), > readDistance(0x7fff) {} > + insn(insn), refNum(0), depNum(0), retiredCy
[Beignet] [PATCH] GBE: implement pre-register-allocation instruction scheduling.
To find out an instruction scheduling policy to achieve the theoretical minimum registers required in a basic block is a NP problem. We have to use some heuristic factor to simplify the algorithm. There are many researchs which indicate a bottom-up list scheduling is much better than the top-down method in turns of register pressure. I choose one of such research paper as our target. The paper is as below: "Register-Sensitive Selection, Duplication, and Sequencing of Instructions" It use the bottom-up list scheduling with a Sethi-Ullman label as an heuristic number. As we will do cycle awareness scheduling after the register allocation, we don't need to bother with cycle related heuristic number here. I just skipped the EST computing and usage part in the algorithm. It turns out this algorithm works well. It could reduce the register spilling in clBlas's sgemmBlock kernel from 83+ to only 20. Although this scheduling method seems to be lowering the ILP(instruction level parallism). It's not a big issue, because we will allocate as much as possible different registers in the following register allocation stage, and we will do a after allocation instruction scheduling which will try to get as much ILP as possible. Signed-off-by: Zhigang Gong --- backend/src/backend/gen_insn_scheduling.cpp | 137 +++- 1 file changed, 116 insertions(+), 21 deletions(-) diff --git a/backend/src/backend/gen_insn_scheduling.cpp b/backend/src/backend/gen_insn_scheduling.cpp index 358a2ce..f4f1e70 100644 --- a/backend/src/backend/gen_insn_scheduling.cpp +++ b/backend/src/backend/gen_insn_scheduling.cpp @@ -41,26 +41,29 @@ * == * * We try to limit the register pressure. - * Well, this is a hard problem and we have a decent strategy now that we called - * "zero cycled LIFO scheduling". - * We use a local forward list scheduling and we schedule the instructions in a - * LIFO order i.e. as a stack. Basically, we take the most recent instruction - * and schedule it right away. Obviously we ignore completely the real latencies - * and throuputs and just simulate instructions that are issued and completed in - * zero cycle. For the complex kernels we already have (like menger sponge), - * this provides a pretty good strategy enabling SIMD16 code generation where - * when scheduling is deactivated, even SIMD8 fails * - * One may argue that this strategy is bad, latency wise. This is not true since - * the register allocator will anyway try to burn as many registers as possible. - * So, there is still opportunities to schedule after register allocation. + * To find out an instruction scheduling policy to achieve the theoretical minimum + * registers required in a basic block is a NP problem. We have to use some heuristic + * factor to simplify the algorithm. There are many researchs which indicate a + * bottom-up list scheduling is much better than the top-down method in turns of + * register pressure. I choose one of such research paper as our target. The paper + * is as below: * - * Our idea seems to work decently. There is however a strong research article - * that is able to near-optimally reschudle the instructions to minimize - * register use. This is: + * "Register-Sensitive Selection, Duplication, and Sequencing of Instructions" + * It use the bottom-up list scheduling with a Sethi-Ullman label as an + * heuristic number. As we will do cycle awareness scheduling after the register + * allocation, we don't need to bother with cycle related heuristic number here. + * I just skipped the EST computing and usage part in the algorithm. * - * "Minimum Register Instruction Sequence Problem: Revisiting Optimal Code - * Generation for DAGs" + * It turns out this algorithm works well. It could reduce the register spilling + * in clBlas's sgemmBlock kernel from 83+ to only 20. + * + * Although this scheduling method seems to be lowering the ILP(instruction level parallism). + * It's not a big issue, because we will allocate as much as possible different registers + * in the following register allocation stage, and we will do a after allocation + * instruction scheduling which will try to get as much ILP as possible. + * + * FIXME: we only need to do this scheduling when a BB is really under high register pressure. * * After the register allocation * == @@ -114,7 +117,7 @@ namespace gbe struct ScheduleDAGNode { INLINE ScheduleDAGNode(SelectionInstruction &insn) : - insn(insn), refNum(0), retiredCycle(0), preRetired(false), readDistance(0x7fff) {} + insn(insn), refNum(0), depNum(0), retiredCycle(0), preRetired(false), readDistance(0x7fff) {} bool dependsOn(ScheduleDAGNode *node) const { GBE_ASSERT(node != NULL); for (auto child : node->children) @@ -128,6 +131,10 @@ namespace gbe SelectionInstruction &insn; /*! Number of nodes that point to us (i.e. nodes we dep