Re: [Beignet] [PATCH] GBE: implement pre-register-allocation instruction scheduling.

2015-11-10 Thread Song, Ruiling
I carefully read the implementation, it looks good.
But according to Zhigang's experiment, some case would suffer performance 
downgrade.
And he thought the main problem still lies in the non-accurate latency model.
And the pre-schedule may make the post-schedule cannot re-schedule again 
because of physical register dependency.
We need to tune the instruction schedule further to achieve good balance of ILP 
and register pressure.
I think we can merge the patch, but still keep pre-schedule disabled. And tune 
instruction schedule later.

Thanks!
Ruiling

> -Original Message-
> From: Beignet [mailto:beignet-boun...@lists.freedesktop.org] On Behalf Of
> Zhigang Gong
> Sent: Wednesday, September 16, 2015 10:46 AM
> To: beignet@lists.freedesktop.org
> Cc: Gong, Zhigang
> Subject: [Beignet] [PATCH] GBE: implement pre-register-allocation instruction
> scheduling.
> 
> To find out an instruction scheduling policy to achieve the theoretical 
> minimum
> registers required in a basic block is a NP problem. We have to use some
> heuristic
> factor to simplify the algorithm. There are many researchs which indicate a
> bottom-up list scheduling is much better than the top-down method in turns of
> register pressure.  I choose one of such research paper as our target. The 
> paper
> is as below:
> 
> "Register-Sensitive Selection, Duplication, and Sequencing of Instructions"
> It use the bottom-up list scheduling with a Sethi-Ullman label as an
> heuristic number. As we will do cycle awareness scheduling after the register
> allocation, we don't need to bother with cycle related heuristic number here.
> I just skipped the EST computing and usage part in the algorithm.
> 
> It turns out this algorithm works well. It could reduce the register spilling
> in clBlas's sgemmBlock kernel from 83+ to only 20.
> 
> Although this scheduling method seems to be lowering the ILP(instruction level
> parallism).
> It's not a big issue, because we will allocate as much as possible different
> registers
> in the following register allocation stage, and we will do a after allocation
> instruction scheduling which will try to get as much ILP as possible.
> 
> Signed-off-by: Zhigang Gong 
> ---
>  backend/src/backend/gen_insn_scheduling.cpp | 137
> +++-
>  1 file changed, 116 insertions(+), 21 deletions(-)
> 
> diff --git a/backend/src/backend/gen_insn_scheduling.cpp
> b/backend/src/backend/gen_insn_scheduling.cpp
> index 358a2ce..f4f1e70 100644
> --- a/backend/src/backend/gen_insn_scheduling.cpp
> +++ b/backend/src/backend/gen_insn_scheduling.cpp
> @@ -41,26 +41,29 @@
>   * ==
>   *
>   * We try to limit the register pressure.
> - * Well, this is a hard problem and we have a decent strategy now that we
> called
> - * "zero cycled LIFO scheduling".
> - * We use a local forward list scheduling and we schedule the instructions 
> in a
> - * LIFO order i.e. as a stack. Basically, we take the most recent instruction
> - * and schedule it right away. Obviously we ignore completely the real 
> latencies
> - * and throuputs and just simulate instructions that are issued and 
> completed in
> - * zero cycle. For the complex kernels we already have (like menger sponge),
> - * this provides a pretty good strategy enabling SIMD16 code generation where
> - * when scheduling is deactivated, even SIMD8 fails
>   *
> - * One may argue that this strategy is bad, latency wise. This is not true 
> since
> - * the register allocator will anyway try to burn as many registers as 
> possible.
> - * So, there is still opportunities to schedule after register allocation.
> + * To find out an instruction scheduling policy to achieve the theoretical
> minimum
> + * registers required in a basic block is a NP problem. We have to use some
> heuristic
> + * factor to simplify the algorithm. There are many researchs which indicate 
> a
> + * bottom-up list scheduling is much better than the top-down method in turns
> of
> + * register pressure.  I choose one of such research paper as our target. The
> paper
> + * is as below:
>   *
> - * Our idea seems to work decently. There is however a strong research 
> article
> - * that is able to near-optimally reschudle the instructions to minimize
> - * register use. This is:
> + * "Register-Sensitive Selection, Duplication, and Sequencing of 
> Instructions"
> + * It use the bottom-up list scheduling with a Sethi-Ullman label as an
> + * heuristic number. As we will do cycle awareness scheduling after the 
> register
> + * allocation, we don't need to bother with cycle related heuristic number 
> here.
> + * I just skipped the EST c

Re: [Beignet] [PATCH] GBE: implement pre-register-allocation instruction scheduling.

2015-09-22 Thread Zhigang Gong
Ping for review.

On Wed, Sep 16, 2015 at 10:46:12AM +0800, Zhigang Gong wrote:
> To find out an instruction scheduling policy to achieve the theoretical 
> minimum
> registers required in a basic block is a NP problem. We have to use some 
> heuristic
> factor to simplify the algorithm. There are many researchs which indicate a
> bottom-up list scheduling is much better than the top-down method in turns of
> register pressure.  I choose one of such research paper as our target. The 
> paper
> is as below:
> 
> "Register-Sensitive Selection, Duplication, and Sequencing of Instructions"
> It use the bottom-up list scheduling with a Sethi-Ullman label as an
> heuristic number. As we will do cycle awareness scheduling after the register
> allocation, we don't need to bother with cycle related heuristic number here.
> I just skipped the EST computing and usage part in the algorithm.
> 
> It turns out this algorithm works well. It could reduce the register spilling
> in clBlas's sgemmBlock kernel from 83+ to only 20.
> 
> Although this scheduling method seems to be lowering the ILP(instruction 
> level parallism).
> It's not a big issue, because we will allocate as much as possible different 
> registers
> in the following register allocation stage, and we will do a after allocation
> instruction scheduling which will try to get as much ILP as possible.
> 
> Signed-off-by: Zhigang Gong 
> ---
>  backend/src/backend/gen_insn_scheduling.cpp | 137 
> +++-
>  1 file changed, 116 insertions(+), 21 deletions(-)
> 
> diff --git a/backend/src/backend/gen_insn_scheduling.cpp 
> b/backend/src/backend/gen_insn_scheduling.cpp
> index 358a2ce..f4f1e70 100644
> --- a/backend/src/backend/gen_insn_scheduling.cpp
> +++ b/backend/src/backend/gen_insn_scheduling.cpp
> @@ -41,26 +41,29 @@
>   * ==
>   *
>   * We try to limit the register pressure.
> - * Well, this is a hard problem and we have a decent strategy now that we 
> called
> - * "zero cycled LIFO scheduling".
> - * We use a local forward list scheduling and we schedule the instructions 
> in a
> - * LIFO order i.e. as a stack. Basically, we take the most recent instruction
> - * and schedule it right away. Obviously we ignore completely the real 
> latencies
> - * and throuputs and just simulate instructions that are issued and 
> completed in
> - * zero cycle. For the complex kernels we already have (like menger sponge),
> - * this provides a pretty good strategy enabling SIMD16 code generation where
> - * when scheduling is deactivated, even SIMD8 fails
>   *
> - * One may argue that this strategy is bad, latency wise. This is not true 
> since
> - * the register allocator will anyway try to burn as many registers as 
> possible.
> - * So, there is still opportunities to schedule after register allocation.
> + * To find out an instruction scheduling policy to achieve the theoretical 
> minimum
> + * registers required in a basic block is a NP problem. We have to use some 
> heuristic
> + * factor to simplify the algorithm. There are many researchs which indicate 
> a
> + * bottom-up list scheduling is much better than the top-down method in 
> turns of
> + * register pressure.  I choose one of such research paper as our target. 
> The paper
> + * is as below:
>   *
> - * Our idea seems to work decently. There is however a strong research 
> article
> - * that is able to near-optimally reschudle the instructions to minimize
> - * register use. This is:
> + * "Register-Sensitive Selection, Duplication, and Sequencing of 
> Instructions"
> + * It use the bottom-up list scheduling with a Sethi-Ullman label as an
> + * heuristic number. As we will do cycle awareness scheduling after the 
> register
> + * allocation, we don't need to bother with cycle related heuristic number 
> here.
> + * I just skipped the EST computing and usage part in the algorithm.
>   *
> - * "Minimum Register Instruction Sequence Problem: Revisiting Optimal Code
> - *  Generation for DAGs"
> + * It turns out this algorithm works well. It could reduce the register 
> spilling
> + * in clBlas's sgemmBlock kernel from 83+ to only 20.
> + *
> + * Although this scheduling method seems to be lowering the ILP(instruction 
> level parallism).
> + * It's not a big issue, because we will allocate as much as possible 
> different registers
> + * in the following register allocation stage, and we will do a after 
> allocation
> + * instruction scheduling which will try to get as much ILP as possible.
> + *
> + * FIXME: we only need to do this scheduling when a BB is really under high 
> register pressure.
>   *
>   * After the register allocation
>   * ==
> @@ -114,7 +117,7 @@ namespace gbe
>struct ScheduleDAGNode
>{
>  INLINE ScheduleDAGNode(SelectionInstruction &insn) :
> -  insn(insn), refNum(0), retiredCycle(0), preRetired(false), 
> readDistance(0x7fff) {}
> +  insn(insn), refNum(0), depNum(0), retiredCy

[Beignet] [PATCH] GBE: implement pre-register-allocation instruction scheduling.

2015-09-15 Thread Zhigang Gong
To find out an instruction scheduling policy to achieve the theoretical minimum
registers required in a basic block is a NP problem. We have to use some 
heuristic
factor to simplify the algorithm. There are many researchs which indicate a
bottom-up list scheduling is much better than the top-down method in turns of
register pressure.  I choose one of such research paper as our target. The paper
is as below:

"Register-Sensitive Selection, Duplication, and Sequencing of Instructions"
It use the bottom-up list scheduling with a Sethi-Ullman label as an
heuristic number. As we will do cycle awareness scheduling after the register
allocation, we don't need to bother with cycle related heuristic number here.
I just skipped the EST computing and usage part in the algorithm.

It turns out this algorithm works well. It could reduce the register spilling
in clBlas's sgemmBlock kernel from 83+ to only 20.

Although this scheduling method seems to be lowering the ILP(instruction level 
parallism).
It's not a big issue, because we will allocate as much as possible different 
registers
in the following register allocation stage, and we will do a after allocation
instruction scheduling which will try to get as much ILP as possible.

Signed-off-by: Zhigang Gong 
---
 backend/src/backend/gen_insn_scheduling.cpp | 137 +++-
 1 file changed, 116 insertions(+), 21 deletions(-)

diff --git a/backend/src/backend/gen_insn_scheduling.cpp 
b/backend/src/backend/gen_insn_scheduling.cpp
index 358a2ce..f4f1e70 100644
--- a/backend/src/backend/gen_insn_scheduling.cpp
+++ b/backend/src/backend/gen_insn_scheduling.cpp
@@ -41,26 +41,29 @@
  * ==
  *
  * We try to limit the register pressure.
- * Well, this is a hard problem and we have a decent strategy now that we 
called
- * "zero cycled LIFO scheduling".
- * We use a local forward list scheduling and we schedule the instructions in a
- * LIFO order i.e. as a stack. Basically, we take the most recent instruction
- * and schedule it right away. Obviously we ignore completely the real 
latencies
- * and throuputs and just simulate instructions that are issued and completed 
in
- * zero cycle. For the complex kernels we already have (like menger sponge),
- * this provides a pretty good strategy enabling SIMD16 code generation where
- * when scheduling is deactivated, even SIMD8 fails
  *
- * One may argue that this strategy is bad, latency wise. This is not true 
since
- * the register allocator will anyway try to burn as many registers as 
possible.
- * So, there is still opportunities to schedule after register allocation.
+ * To find out an instruction scheduling policy to achieve the theoretical 
minimum
+ * registers required in a basic block is a NP problem. We have to use some 
heuristic
+ * factor to simplify the algorithm. There are many researchs which indicate a
+ * bottom-up list scheduling is much better than the top-down method in turns 
of
+ * register pressure.  I choose one of such research paper as our target. The 
paper
+ * is as below:
  *
- * Our idea seems to work decently. There is however a strong research article
- * that is able to near-optimally reschudle the instructions to minimize
- * register use. This is:
+ * "Register-Sensitive Selection, Duplication, and Sequencing of Instructions"
+ * It use the bottom-up list scheduling with a Sethi-Ullman label as an
+ * heuristic number. As we will do cycle awareness scheduling after the 
register
+ * allocation, we don't need to bother with cycle related heuristic number 
here.
+ * I just skipped the EST computing and usage part in the algorithm.
  *
- * "Minimum Register Instruction Sequence Problem: Revisiting Optimal Code
- *  Generation for DAGs"
+ * It turns out this algorithm works well. It could reduce the register 
spilling
+ * in clBlas's sgemmBlock kernel from 83+ to only 20.
+ *
+ * Although this scheduling method seems to be lowering the ILP(instruction 
level parallism).
+ * It's not a big issue, because we will allocate as much as possible 
different registers
+ * in the following register allocation stage, and we will do a after 
allocation
+ * instruction scheduling which will try to get as much ILP as possible.
+ *
+ * FIXME: we only need to do this scheduling when a BB is really under high 
register pressure.
  *
  * After the register allocation
  * ==
@@ -114,7 +117,7 @@ namespace gbe
   struct ScheduleDAGNode
   {
 INLINE ScheduleDAGNode(SelectionInstruction &insn) :
-  insn(insn), refNum(0), retiredCycle(0), preRetired(false), 
readDistance(0x7fff) {}
+  insn(insn), refNum(0), depNum(0), retiredCycle(0), preRetired(false), 
readDistance(0x7fff) {}
 bool dependsOn(ScheduleDAGNode *node) const {
   GBE_ASSERT(node != NULL);
   for (auto child : node->children)
@@ -128,6 +131,10 @@ namespace gbe
 SelectionInstruction &insn;
 /*! Number of nodes that point to us (i.e. nodes we dep