from:"Zhigang Gong"

Re: [Beignet] [PATCH] CMAKE: Use DRM_INTEL_LIBDIR for CHECK_LIBRARY_EXISTS path

2016-06-30 Thread Zhigang Gong

This patch LGTM, thanks.

On Thu, Jun 30, 2016 at 2:48 PM, Xiuli Pan  wrote:

> From: Pan Xiuli 
>
> We check libdrm-intel with pkg-config, but CHECK_LIBRARY_EXISTS may
> search lib in different path, so add the path we will use for it.
>
> Signed-off-by: Pan Xiuli 
> ---
>  CMakeLists.txt | 6 +++---
>  1 file changed, 3 insertions(+), 3 deletions(-)
>
> diff --git a/CMakeLists.txt b/CMakeLists.txt
> index fae3e88..569d109 100644
> --- a/CMakeLists.txt
> +++ b/CMakeLists.txt
> @@ -135,19 +135,19 @@ pkg_check_modules(DRM_INTEL libdrm_intel>=2.4.52)
>  IF(DRM_INTEL_FOUND)
>INCLUDE_DIRECTORIES(${DRM_INTEL_INCLUDE_DIRS})
>MESSAGE(STATUS "Looking for DRM Intel - found at ${DRM_INTEL_PREFIX}
> ${DRM_INTEL_VERSION}")
> -  CHECK_LIBRARY_EXISTS(drm_intel "drm_intel_bo_alloc_userptr" ""
> HAVE_DRM_INTEL_USERPTR)
> +  CHECK_LIBRARY_EXISTS(drm_intel "drm_intel_bo_alloc_userptr"
> ${DRM_INTEL_LIBDIR} HAVE_DRM_INTEL_USERPTR)
>IF(HAVE_DRM_INTEL_USERPTR)
>  MESSAGE(STATUS "Enable userptr support")
>ELSE(HAVE_DRM_INTEL_USERPTR)
>  MESSAGE(STATUS "Disable userptr support")
>ENDIF(HAVE_DRM_INTEL_USERPTR)
> -  CHECK_LIBRARY_EXISTS(drm_intel "drm_intel_get_eu_total" ""
> HAVE_DRM_INTEL_EU_TOTAL)
> +  CHECK_LIBRARY_EXISTS(drm_intel "drm_intel_get_eu_total"
> ${DRM_INTEL_LIBDIR} HAVE_DRM_INTEL_EU_TOTAL)
>IF(HAVE_DRM_INTEL_EU_TOTAL)
>  MESSAGE(STATUS "Enable EU total query support")
>ELSE(HAVE_DRM_INTEL_EU_TOTAL)
>  MESSAGE(STATUS "Disable EU total query support")
>ENDIF(HAVE_DRM_INTEL_EU_TOTAL)
> -  CHECK_LIBRARY_EXISTS(drm_intel "drm_intel_get_subslice_total" ""
> HAVE_DRM_INTEL_SUBSLICE_TOTAL)
> +  CHECK_LIBRARY_EXISTS(drm_intel "drm_intel_get_subslice_total"
> ${DRM_INTEL_LIBDIR} HAVE_DRM_INTEL_SUBSLICE_TOTAL)
>IF(HAVE_DRM_INTEL_SUBSLICE_TOTAL)
>  MESSAGE(STATUS "Enable subslice total query support")
>ELSE(HAVE_DRM_INTEL_SUBSLICE_TOTAL)
> --
> 2.5.0
>
> ___
> Beignet mailing list
> Beignet@lists.freedesktop.org
> https://lists.freedesktop.org/mailman/listinfo/beignet
>
___
Beignet mailing list
Beignet@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/beignet

[Beignet] [PATCH 2/2] update android version.

2016-06-23 Thread Zhigang Gong

Signed-off-by: Zhigang Gong 
---
 src/Android.mk | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/src/Android.mk b/src/Android.mk
index 3d9102e..9b63f7e 100644
--- a/src/Android.mk
+++ b/src/Android.mk
@@ -7,7 +7,7 @@ include $(LOCAL_PATH)/../Android.common.mk
 ocl_config_file = $(LOCAL_PATH)/OCLConfig.h
 $(shell echo "// the configured options and settings for LIBCL" > 
$(ocl_config_file))
 $(shell echo "#define LIBCL_DRIVER_VERSION_MAJOR 1" >> $(ocl_config_file))
-$(shell echo "#define LIBCL_DRIVER_VERSION_MINOR 1" >> $(ocl_config_file))
+$(shell echo "#define LIBCL_DRIVER_VERSION_MINOR 2" >> $(ocl_config_file))
 $(shell echo "#define LIBCL_C_VERSION_MAJOR 1" >> $(ocl_config_file))
 $(shell echo "#define LIBCL_C_VERSION_MINOR 2" >> $(ocl_config_file))
 
-- 
2.7.4

___
Beignet mailing list
Beignet@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/beignet

[Beignet] [PATCH 1/2] Remove nonexisting unit test cases in Android.mk.

2016-06-23 Thread Zhigang Gong

Signed-off-by: Zhigang Gong 
---
 utests/Android.mk | 2 --
 1 file changed, 2 deletions(-)

diff --git a/utests/Android.mk b/utests/Android.mk
index 963b698..63dba2a 100644
--- a/utests/Android.mk
+++ b/utests/Android.mk
@@ -175,8 +175,6 @@ LOCAL_SRC_FILES:= \
   compiler_private_const.cpp \
   compiler_private_data_overflow.cpp \
   compiler_getelementptr_bitcast.cpp \
-  compiler_sub_group_any.cpp \
-  compiler_sub_group_all.cpp \
   compiler_time_stamp.cpp \
   compiler_double_precision.cpp \
   load_program_from_gen_bin.cpp \
-- 
2.7.4

___
Beignet mailing list
Beignet@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/beignet

Re: [Beignet] [PATCH] GBE: Optimize extraLiveOut register info.

2016-06-05 Thread Zhigang Gong

This patch LGTM, thx.

On Thu, Jun 2, 2016 at 10:21 AM, Ruiling Song 
wrote:

> extraLiveout anlysis is used to detect registers defined in loop and
> used out-of the loop. Previous logic may also include registers defined
> BEFORE loop and live-through loop as extraLiveout, which is not accurate.
> As the extraLiveout registers will be forced as non-uniform, which may
> waste some register space. Excluding registers that are defined before loop
> while used out-of loop will make such registers get correct
> uniform/non-uniform
> information. This would benefit some gemm cl kernel.
>
> also fix a small issue: do intersection when the exit-block has more
> than one predecessors.
>
> Signed-off-by: Ruiling Song 
> ---
>  backend/src/ir/function.cpp   |  4 ++--
>  backend/src/ir/function.hpp   |  7 ---
>  backend/src/ir/liveness.cpp   | 25 +---
>  backend/src/llvm/llvm_gen_backend.cpp | 36
> ++-
>  4 files changed, 34 insertions(+), 38 deletions(-)
>
> diff --git a/backend/src/ir/function.cpp b/backend/src/ir/function.cpp
> index 3b7891b..4112f06 100644
> --- a/backend/src/ir/function.cpp
> +++ b/backend/src/ir/function.cpp
> @@ -62,8 +62,8 @@ namespace ir {
>  return unit.getPointerFamily();
>}
>
> -  void Function::addLoop(const vector &bbs, const
> vector> &exits) {
> -loops.push_back(GBE_NEW(Loop, bbs, exits));
> +  void Function::addLoop(LabelIndex preheader, const vector
> &bbs, const vector> &exits) {
> +loops.push_back(GBE_NEW(Loop, preheader, bbs, exits));
>}
>
>void Function::checkEmptyLabels(void) {
> diff --git a/backend/src/ir/function.hpp b/backend/src/ir/function.hpp
> index 5785bee..6a90767 100644
> --- a/backend/src/ir/function.hpp
> +++ b/backend/src/ir/function.hpp
> @@ -273,8 +273,9 @@ namespace ir {
>struct Loop : public NonCopyable
>{
>public:
> -Loop(const vector &in, const vector LabelIndex>> &exit) :
> -bbs(in), exits(exit) {}
> +Loop(LabelIndex pre, const vector &in, const
> vector> &exit) :
> +preheader(pre), bbs(in), exits(exit) {}
> +LabelIndex preheader;
>  vector bbs;
>  vector> exits;
>  GBE_STRUCT(Loop);
> @@ -522,7 +523,7 @@ namespace ir {
>  /*! Push stack size. */
>  INLINE void pushStackSize(uint32_t step) { this->stackSize += step; }
>  /*! add the loop info for later liveness analysis */
> -void addLoop(const vector &bbs, const
> vector> &exits);
> +void addLoop(LabelIndex preheader, const vector &bbs,
> const vector> &exits);
>  INLINE const vector &getLoops() { return loops; }
>  vector &getBlocks() { return blocks; }
>  /*! Get surface starting address register from bti */
> diff --git a/backend/src/ir/liveness.cpp b/backend/src/ir/liveness.cpp
> index d48f067..4a89a73 100644
> --- a/backend/src/ir/liveness.cpp
> +++ b/backend/src/ir/liveness.cpp
> @@ -217,19 +217,38 @@ namespace ir {
>  if(loops.size() == 0) return;
>
>  for (auto l : loops) {
> +  const BasicBlock &preheader = fn.getBlock(l->preheader);
> +  BlockInfo *preheaderInfo = liveness[&preheader];
>for (auto x : l->exits) {
>  const BasicBlock &a = fn.getBlock(x.first);
>  const BasicBlock &b = fn.getBlock(x.second);
>  BlockInfo * exiting = liveness[&a];
>  BlockInfo * exit = liveness[&b];
>  std::vector toExtend;
> +std::vector toExtendCand;
>
> -if(b.getPredecessorSet().size() > 1) {
> +if(b.getPredecessorSet().size() <= 1) {
> +  // the exits only have one predecessor
>for (auto p : exit->upwardUsed)
> -toExtend.push_back(p);
> +toExtendCand.push_back(p);
>  } else {
> -  std::set_intersection(exiting->liveOut.begin(),
> exiting->liveOut.end(), exit->upwardUsed.begin(), exit->upwardUsed.end(),
> std::back_inserter(toExtend));
> +  // the exits have more than one predecessors
> +  std::set_intersection(exiting->liveOut.begin(),
> +exiting->liveOut.end(),
> +exit->upwardUsed.begin(),
> +exit->upwardUsed.end(),
> +std::back_inserter(toExtendCand));
>  }
> +// toExtendCand may contain some virtual register defined before
> loop,
> +// which need to be excluded. Because what we need is registers
> defined
> +// in the loop. Such kind of registers must be in live-out of the
> loop's
> +// preheader. So we do the subtraction here.
> +std::set_difference(toExtendCand.begin(),
> +toExtendCand.end(),
> +preheaderInfo->liveOut.begin(),
> +preheaderInfo->liveOut.end(),
> +std::back_inserter(toExtend));
> +
>  if (toExtend.size() == 0) continue;
>  for(auto r : toExtend)
>

Re: [Beignet] [Patch V2 07/10] GBE: disable the read byte as DW.

2016-05-26 Thread Zhigang Gong

The whole patchset LGTM, thanks.

On Thu, May 26, 2016 at 12:24 PM, Yang Rong  wrote:

> From: Zhigang Gong 
>
> Signed-off-by: Zhigang Gong 
> ---
>  backend/src/backend/gen_insn_selection.cpp | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/backend/src/backend/gen_insn_selection.cpp
> b/backend/src/backend/gen_insn_selection.cpp
> index 07901a6..fb45b4f 100644
> --- a/backend/src/backend/gen_insn_selection.cpp
> +++ b/backend/src/backend/gen_insn_selection.cpp
> @@ -2349,7 +2349,7 @@ extern bool OCL_DEBUGINFO; // first defined by
> calling BVAR in program.cpp
>  this->opaque->setHasLongType(true);
>  this->opaque->setHasDoubleType(true);
>  this->opaque->setLongRegRestrict(true);
> -this->opaque->setSlowByteGather(true);
> +this->opaque->setSlowByteGather(false);
>  this->opaque->setHasHalfType(true);
>  opt_features = SIOF_LOGICAL_SRCMOD | SIOF_OP_MOV_LONG_REG_RESTRICT;
>}
> --
> 2.1.4
>
> ___
> Beignet mailing list
> Beignet@lists.freedesktop.org
> https://lists.freedesktop.org/mailman/listinfo/beignet
>
___
Beignet mailing list
Beignet@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/beignet

Re: [Beignet] [PATCH 07/10] Android: disable the read byte as DW.

2016-05-25 Thread Zhigang Gong

This patch should not be a android specific patch, based on my testing
result we should set slow byte gathering to false for CHV platform.

Thanks,
Zhigang Gong.


On Thu, May 19, 2016 at 4:37 PM, Yang Rong  wrote:

> From: Zhigang Gong 
>
> Signed-off-by: Zhigang Gong 
> ---
>  backend/src/backend/gen_insn_selection.cpp | 4 
>  1 file changed, 4 insertions(+)
>
> diff --git a/backend/src/backend/gen_insn_selection.cpp
> b/backend/src/backend/gen_insn_selection.cpp
> index 07901a6..a48f95f 100644
> --- a/backend/src/backend/gen_insn_selection.cpp
> +++ b/backend/src/backend/gen_insn_selection.cpp
> @@ -2349,7 +2349,11 @@ extern bool OCL_DEBUGINFO; // first defined by
> calling BVAR in program.cpp
>  this->opaque->setHasLongType(true);
>  this->opaque->setHasDoubleType(true);
>  this->opaque->setLongRegRestrict(true);
> +#if defined(__ANDROID__)
> +this->opaque->setSlowByteGather(false);
> +#else
>  this->opaque->setSlowByteGather(true);
> +#endif
>  this->opaque->setHasHalfType(true);
>  opt_features = SIOF_LOGICAL_SRCMOD | SIOF_OP_MOV_LONG_REG_RESTRICT;
>}
> --
> 2.1.4
>
> ___
> Beignet mailing list
> Beignet@lists.freedesktop.org
> https://lists.freedesktop.org/mailman/listinfo/beignet
>
___
Beignet mailing list
Beignet@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/beignet

[Beignet] [PATCH] Refine custom unrolling policy.

2016-03-03 Thread Zhigang Gong

We should use the production of current trip count and parent trip
count to determine whether we should unroll the parent loop.

Signed-off-by: Zhigang Gong 
---
 backend/src/llvm/llvm_unroll.cpp | 21 -
 1 file changed, 12 insertions(+), 9 deletions(-)

diff --git a/backend/src/llvm/llvm_unroll.cpp b/backend/src/llvm/llvm_unroll.cpp
index 0f62bdc..a289c11 100644
--- a/backend/src/llvm/llvm_unroll.cpp
+++ b/backend/src/llvm/llvm_unroll.cpp
@@ -176,6 +176,12 @@ namespace gbe {
 if (ExitBlock)
   currTripCount = SE->getSmallConstantTripCount(L, ExitBlock);
 
+if (currTripCount > 32) {
+  shouldUnroll = false;
+  setUnrollID(currL, false);
+  return shouldUnroll;
+}
+
 while(currL) {
   Loop *parentL = currL->getParentLoop();
   unsigned parentTripCount = 0;
@@ -187,20 +193,17 @@ namespace gbe {
 if (parentExitBlock)
   parentTripCount = SE->getSmallConstantTripCount(parentL, 
parentExitBlock);
   }
-  if ((parentTripCount != 0 && currTripCount / parentTripCount > 16) ||
-  (currTripCount > 32)) {
-if (currL == L)
-  shouldUnroll = false;
-setUnrollID(currL, false);
-if (currL != L)
+  if (parentTripCount != 0 && currTripCount * parentTripCount > 32) {
+setUnrollID(parentL, false);
 #if LLVM_VERSION_MAJOR == 3 &&  LLVM_VERSION_MINOR >= 8
-  loopInfo.markAsRemoved(currL);
+loopInfo.markAsRemoved(parentL);
 #else
-  LPM.deleteLoopFromQueue(currL);
+LPM.deleteLoopFromQueue(parentL);
 #endif
+return shouldUnroll;
   }
   currL = parentL;
-  currTripCount = parentTripCount;
+  currTripCount = parentTripCount * currTripCount;
 }
 return shouldUnroll;
   }
-- 
2.1.4

___
Beignet mailing list
Beignet@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/beignet

[Beignet] [PATCH] Revert "GBE: disable mad for some cases."

2016-02-18 Thread Zhigang Gong

This reverts commit d73170df3508d18e250d0af118e3b7955401194f.
Actually, MAD should be always faster if we can use it to replace
orignal Multiply + ADD. So let's revert this patch.

Signed-off-by: Zhigang Gong 
---
 backend/src/backend/gen_insn_selection.cpp | 14 +-
 1 file changed, 1 insertion(+), 13 deletions(-)

diff --git a/backend/src/backend/gen_insn_selection.cpp 
b/backend/src/backend/gen_insn_selection.cpp
index 001a3c5..9225294 100644
--- a/backend/src/backend/gen_insn_selection.cpp
+++ b/backend/src/backend/gen_insn_selection.cpp
@@ -3111,7 +3111,7 @@ extern bool OCL_DEBUGINFO; // first defined by calling 
BVAR in program.cpp
 
   // XXX TODO: we need a clean support of FP_CONTRACT to remove below line 
'return false'
   // if 'pragma FP_CONTRACT OFF' is used in cl kernel, we should not do 
mad optimization.
-  if (!sel.ctx.relaxMath || sel.ctx.getSimdWidth() == 16)
+  if (!sel.ctx.relaxMath)
 return false;
   // MAD tend to increase liveness of the sources (since there are three of
   // them). TODO refine this strategy. Well, we should be able at least to
@@ -3129,12 +3129,6 @@ extern bool OCL_DEBUGINFO; // first defined by calling 
BVAR in program.cpp
   const GenRegister dst = sel.selReg(insn.getDst(0), TYPE_FLOAT);
   if (child0 && child0->insn.getOpcode() == OP_MUL) {
 GBE_ASSERT(cast(child0->insn).getType() == 
TYPE_FLOAT);
-SelectionDAG *child00 = child0->child[0];
-SelectionDAG *child01 = child0->child[1];
-if ((child00 && child00->insn.getOpcode() == OP_LOADI) ||
-(child01 && child01->insn.getOpcode() == OP_LOADI) ||
-(child1 && child1->insn.getOpcode() == OP_LOADI))
-  return false;
 const GenRegister src0 = sel.selReg(child0->insn.getSrc(0), 
TYPE_FLOAT);
 const GenRegister src1 = sel.selReg(child0->insn.getSrc(1), 
TYPE_FLOAT);
 GenRegister src2 = sel.selReg(insn.getSrc(1), TYPE_FLOAT);
@@ -3147,12 +3141,6 @@ extern bool OCL_DEBUGINFO; // first defined by calling 
BVAR in program.cpp
   }
   if (child1 && child1->insn.getOpcode() == OP_MUL) {
 GBE_ASSERT(cast(child1->insn).getType() == 
TYPE_FLOAT);
-SelectionDAG *child10 = child1->child[0];
-SelectionDAG *child11 = child1->child[1];
-if ((child10 && child10->insn.getOpcode() == OP_LOADI) ||
-(child11 && child11->insn.getOpcode() == OP_LOADI) ||
-(child0 && child0->insn.getOpcode() == OP_LOADI))
-  return false;
 GenRegister src0 = sel.selReg(child1->insn.getSrc(0), TYPE_FLOAT);
 const GenRegister src1 = sel.selReg(child1->insn.getSrc(1), 
TYPE_FLOAT);
 const GenRegister src2 = sel.selReg(insn.getSrc(0), TYPE_FLOAT);
-- 
2.5.0

___
Beignet mailing list
Beignet@lists.freedesktop.org
https://lists.freedesktop.org/mailman/listinfo/beignet

Re: [Beignet] [PATCH] Runtime: because double's built-ins haven't completely support, so disable it by default.

2015-12-22 Thread Zhigang Gong

This patch LGTM.

Thanks,
Zhigang Gong.

> -Original Message-
> From: Beignet [mailto:beignet-boun...@lists.freedesktop.org] On Behalf Of
> Yang Rong
> Sent: Tuesday, December 22, 2015 4:37 PM
> To: beignet@lists.freedesktop.org
> Cc: Yang Rong 
> Subject: [Beignet] [PATCH] Runtime: because double's built-ins haven't
> completely support, so disable it by default.
> 
> Add a cmake option for it, cmake with option -DEXPERIMENTAL_DOUBLE=true
> to enable it.
> ---
>  CMakeLists.txt |  6 ++
>  src/cl_device_id.c | 21 +
>  2 files changed, 27 insertions(+)
> 
> diff --git a/CMakeLists.txt b/CMakeLists.txt index 8762f7c..97725ca 100644
> --- a/CMakeLists.txt
> +++ b/CMakeLists.txt
> @@ -219,11 +219,17 @@ ENDIF(OCLIcd_FOUND)
> 
>  Find_Package(PythonInterp)
> 
> +OPTION(EXPERIMENTAL_DOUBLE "Enable experimental double support" OFF)
> +IF(EXPERIMENTAL_DOUBLE)
> +  ADD_DEFINITIONS(-DENABLE_FP64)
> +ENDIF(EXPERIMENTAL_DOUBLE)
> +
>  OPTION(BUILD_EXAMPLES "Build examples" OFF)
>  IF(BUILD_EXAMPLES)
>  IF(NOT X11_FOUND)
>MESSAGE(FATAL_ERROR "XLib is necessary for examples - not found")
> ENDIF(NOT X11_FOUND)
> +
>  # libva & libva-x11
>  #pkg_check_modules(LIBVA REQUIRED libva>=0.36.0)
> pkg_check_modules(LIBVA REQUIRED libva) diff --git a/src/cl_device_id.c
> b/src/cl_device_id.c index a98523f..c01e3d4 100644
> --- a/src/cl_device_id.c
> +++ b/src/cl_device_id.c
> @@ -418,7 +418,9 @@ brw_gt1_break:
>intel_brw_gt1_device.platform = cl_get_platform_default();
>ret = &intel_brw_gt1_device;
>cl_intel_platform_get_default_extension(ret);
> +#ifdef ENABLE_FP64
>cl_intel_platform_enable_extension(ret, cl_khr_fp64_ext_id);
> +#endif
>cl_intel_platform_enable_extension(ret, cl_khr_fp16_ext_id);
>break;
> 
> @@ -437,7 +439,9 @@ brw_gt2_break:
>intel_brw_gt2_device.platform = cl_get_platform_default();
>ret = &intel_brw_gt2_device;
>cl_intel_platform_get_default_extension(ret);
> +#ifdef ENABLE_FP64
>cl_intel_platform_enable_extension(ret, cl_khr_fp64_ext_id);
> +#endif
>cl_intel_platform_enable_extension(ret, cl_khr_fp16_ext_id);
>break;
> 
> @@ -458,7 +462,9 @@ brw_gt3_break:
>intel_brw_gt3_device.platform = cl_get_platform_default();
>ret = &intel_brw_gt3_device;
>cl_intel_platform_get_default_extension(ret);
> +#ifdef ENABLE_FP64
>cl_intel_platform_enable_extension(ret, cl_khr_fp64_ext_id);
> +#endif
>cl_intel_platform_enable_extension(ret, cl_khr_fp16_ext_id);
>break;
> 
> @@ -472,6 +478,9 @@ chv_break:
>intel_chv_device.platform = cl_get_platform_default();
>ret = &intel_chv_device;
>cl_intel_platform_get_default_extension(ret);
> +#ifdef ENABLE_FP64
> +  cl_intel_platform_enable_extension(ret, cl_khr_fp64_ext_id);
> +#endif
>cl_intel_platform_enable_extension(ret, cl_khr_fp16_ext_id);
>break;
> 
> @@ -490,6 +499,9 @@ skl_gt1_break:
>intel_skl_gt1_device.device_id = device_id;
>intel_skl_gt1_device.platform = cl_get_platform_default();
>ret = &intel_skl_gt1_device;
> +#ifdef ENABLE_FP64
> +  cl_intel_platform_enable_extension(ret, cl_khr_fp64_ext_id);
> +#endif
>cl_intel_platform_get_default_extension(ret);
>cl_intel_platform_enable_extension(ret, cl_khr_fp16_ext_id);
>break;
> @@ -510,6 +522,9 @@ skl_gt2_break:
>intel_skl_gt2_device.device_id = device_id;
>intel_skl_gt2_device.platform = cl_get_platform_default();
>ret = &intel_skl_gt2_device;
> +#ifdef ENABLE_FP64
> +  cl_intel_platform_enable_extension(ret, cl_khr_fp64_ext_id);
> +#endif
>cl_intel_platform_get_default_extension(ret);
>cl_intel_platform_enable_extension(ret, cl_khr_fp16_ext_id);
>break;
> @@ -525,6 +540,9 @@ skl_gt3_break:
>intel_skl_gt3_device.platform = cl_get_platform_default();
>ret = &intel_skl_gt3_device;
>cl_intel_platform_get_default_extension(ret);
> +#ifdef ENABLE_FP64
> +  cl_intel_platform_enable_extension(ret, cl_khr_fp64_ext_id);
> +#endif
>cl_intel_platform_enable_extension(ret, cl_khr_fp16_ext_id);
>break;
> 
> @@ -536,6 +554,9 @@ skl_gt4_break:
>intel_skl_gt4_device.device_id = device_id;
>intel_skl_gt4_device.platform = cl_get_platform_default();
>ret = &intel_skl_gt4_device;
> +#ifdef ENABLE_FP64
> +  cl_intel_platform_enable_extension(ret, cl_khr_fp64_ext_id);
> +#endif
>cl_intel_platform_get_default_extension(ret);
>cl_intel_platform_enable_extension(ret, cl_khr_fp16_ext_id);
>break;
> --
> 2.1.4
> 
> ___
> Beignet mailing list
> Beignet@lists.freedesktop.org
> http://lists.freedesktop.org/mailman/listinfo/beignet

___
Beignet mailing list
Beignet@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/beignet

Re: [Beignet] [PATCH] runtime: silent some error messages.

2015-11-18 Thread Zhigang Gong

use something like DEBUG_PRINT is better. Need to cleanup the whole library
to fix this type of things.

Thanks,
Zhigang Gong

On Tue, Nov 17, 2015 at 10:11 AM, Song, Ruiling 
wrote:

> Hi Zhigang,
>
> Directly remove the output message may be not proper.
> The output message is very useful for us to debug if some application run
> into these kind of issues.
> Although we return the error code to application. But it is still hard to
> know exactly what's wrong if application just receives a CL_OUT_OF_RESOURCE.
> Maybe we can add a simple DEBUG_PRINT() macro to print some message under
> debug mode.
> What do you think?
>
> Thanks!
> Ruiling
>
> > -Original Message-
> > From: Beignet [mailto:beignet-boun...@lists.freedesktop.org] On Behalf
> Of
> > Zhigang Gong
> > Sent: Friday, November 13, 2015 7:32 AM
> > To: beignet@lists.freedesktop.org
> > Cc: Gong, Zhigang 
> > Subject: [Beignet] [PATCH] runtime: silent some error messages.
> >
> > We already set corresponding error code and return it to the caller.
> > Don't bother to print the error messages in beignet internal.
> >
> > Signed-off-by: Zhigang Gong 
> > ---
> >  src/cl_command_queue_gen7.c | 3 ---
> >  1 file changed, 3 deletions(-)
> >
> > diff --git a/src/cl_command_queue_gen7.c
> > b/src/cl_command_queue_gen7.c
> > index f0ee20a..96b23fb 100644
> > --- a/src/cl_command_queue_gen7.c
> > +++ b/src/cl_command_queue_gen7.c
> > @@ -329,21 +329,18 @@
> > cl_command_queue_ND_range_gen7(cl_command_queue queue,
> >
> >/* Compute the number of HW threads we need */
> >if(UNLIKELY(err = cl_kernel_work_group_sz(ker, local_wk_sz, 3,
> > &local_sz) != CL_SUCCESS)) {
> > -fprintf(stderr, "Beignet: Work group size exceed Kernel's work group
> > size.\n");
> >  return err;
> >}
> >kernel.thread_n = thread_n = (local_sz + simd_sz - 1) / simd_sz;
> >kernel.curbe_sz = cst_sz;
> >
> >if (scratch_sz > ker->program->ctx->device->scratch_mem_size) {
> > -fprintf(stderr, "Beignet: Out of scratch memory %d.\n", scratch_sz);
> >  return CL_OUT_OF_RESOURCES;
> >}
> >/* Curbe step 1: fill the constant urb buffer data shared by all
> threads */
> >if (ker->curbe) {
> >  kernel.slm_sz = cl_curbe_fill(ker, work_dim, global_wk_off,
> global_wk_sz,
> > local_wk_sz, thread_n);
> >  if (kernel.slm_sz > ker->program->ctx->device->local_mem_size) {
> > -  fprintf(stderr, "Beignet: Out of shared local memory %d.\n",
> > kernel.slm_sz);
> >return CL_OUT_OF_RESOURCES;
> >  }
> >}
> > --
> > 1.9.1
> >
> > ___
> > Beignet mailing list
> > Beignet@lists.freedesktop.org
> > http://lists.freedesktop.org/mailman/listinfo/beignet
> ___
> Beignet mailing list
> Beignet@lists.freedesktop.org
> http://lists.freedesktop.org/mailman/listinfo/beignet
>
___
Beignet mailing list
Beignet@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/beignet

[Beignet] [PATCH] runtime: silent some error messages.

2015-11-12 Thread Zhigang Gong

We already set corresponding error code and return it to the caller.
Don't bother to print the error messages in beignet internal.

Signed-off-by: Zhigang Gong 
---
 src/cl_command_queue_gen7.c | 3 ---
 1 file changed, 3 deletions(-)

diff --git a/src/cl_command_queue_gen7.c b/src/cl_command_queue_gen7.c
index f0ee20a..96b23fb 100644
--- a/src/cl_command_queue_gen7.c
+++ b/src/cl_command_queue_gen7.c
@@ -329,21 +329,18 @@ cl_command_queue_ND_range_gen7(cl_command_queue queue,
 
   /* Compute the number of HW threads we need */
   if(UNLIKELY(err = cl_kernel_work_group_sz(ker, local_wk_sz, 3, &local_sz) != 
CL_SUCCESS)) {
-fprintf(stderr, "Beignet: Work group size exceed Kernel's work group 
size.\n");
 return err;
   }
   kernel.thread_n = thread_n = (local_sz + simd_sz - 1) / simd_sz;
   kernel.curbe_sz = cst_sz;
 
   if (scratch_sz > ker->program->ctx->device->scratch_mem_size) {
-fprintf(stderr, "Beignet: Out of scratch memory %d.\n", scratch_sz);
 return CL_OUT_OF_RESOURCES;
   }
   /* Curbe step 1: fill the constant urb buffer data shared by all threads */
   if (ker->curbe) {
 kernel.slm_sz = cl_curbe_fill(ker, work_dim, global_wk_off, global_wk_sz, 
local_wk_sz, thread_n);
 if (kernel.slm_sz > ker->program->ctx->device->local_mem_size) {
-  fprintf(stderr, "Beignet: Out of shared local memory %d.\n", 
kernel.slm_sz);
   return CL_OUT_OF_RESOURCES;
 }
   }
-- 
1.9.1

___
Beignet mailing list
Beignet@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/beignet

Re: [Beignet] [PATCH 4/5] runtime: set CL_KERNEL_PREFERRED_WORK_GROUP_SIZE_MULTIPLE to kernel's SIMD_WIDTH.

2015-11-12 Thread Zhigang Gong

On Thu, Nov 12, 2015 at 04:47:04PM +0800, Zhigang Gong wrote:
> It makes sense to set CL_KERNEL_PREFERRED_WORK_GROUP_SIZE_MULTIPLE to the
> corresponding SIMD size. Then it provides a way for intel's OCL application
> to get SIMD width at runtime and make some SIMD width dependant optimization
> possible.
> 
> Signed-off-by: Zhigang Gong 
> ---
>  src/cl_api.c|  3 ++-
>  src/cl_command_queue_gen7.c |  2 +-
>  src/cl_device_id.c  | 11 ++-
>  src/cl_device_id.h  |  2 --
>  src/cl_gt_device.h  |  1 -
>  5 files changed, 13 insertions(+), 6 deletions(-)
> 
> diff --git a/src/cl_api.c b/src/cl_api.c
> index a18bc99..64206eb 100644
> --- a/src/cl_api.c
> +++ b/src/cl_api.c
> @@ -3001,6 +3001,7 @@ clEnqueueNDRangeKernel(cl_command_queue  command_queue,
>  err = cl_command_queue_flush(command_queue);
>}
>  
> +error:
>if(b_output_kernel_perf)
>{
>  if(kernel->program->build_opts != NULL)
> @@ -3008,7 +3009,7 @@ clEnqueueNDRangeKernel(cl_command_queue  command_queue,
>  else
>time_end(command_queue->ctx, cl_kernel_get_name(kernel), "", 
> command_queue);
>}
> -error:
> +

The above change is to fix a dead lock when enable kernel performance 
measurement and
ran into error in cl_command_queue_ND_range(). Forgot to mention it in the 
commit log.

Thanks,
Zhigang Gong.

>return err;
>  }
>  
> diff --git a/src/cl_command_queue_gen7.c b/src/cl_command_queue_gen7.c
> index 2edc3be..f0ee20a 100644
> --- a/src/cl_command_queue_gen7.c
> +++ b/src/cl_command_queue_gen7.c
> @@ -329,7 +329,7 @@ cl_command_queue_ND_range_gen7(cl_command_queue queue,
>  
>/* Compute the number of HW threads we need */
>if(UNLIKELY(err = cl_kernel_work_group_sz(ker, local_wk_sz, 3, &local_sz) 
> != CL_SUCCESS)) {
> -fprintf(stderr, "Beignet: Work group size exceed Kerne's work group 
> size.\n");
> +fprintf(stderr, "Beignet: Work group size exceed Kernel's work group 
> size.\n");
>  return err;
>}
>kernel.thread_n = thread_n = (local_sz + simd_sz - 1) / simd_sz;
> diff --git a/src/cl_device_id.c b/src/cl_device_id.c
> index 4551aa8..8186ac8 100644
> --- a/src/cl_device_id.c
> +++ b/src/cl_device_id.c
> @@ -966,7 +966,16 @@ cl_get_kernel_workgroup_info(cl_kernel kernel,
>  return CL_SUCCESS;
>}
>  }
> -DECL_FIELD(PREFERRED_WORK_GROUP_SIZE_MULTIPLE, 
> device->preferred_wg_sz_mul)
> +case CL_KERNEL_PREFERRED_WORK_GROUP_SIZE_MULTIPLE:
> +{
> +  if (param_value && param_value_size < sizeof(size_t))
> +return CL_INVALID_VALUE;
> +  if (param_value_size_ret != NULL)
> +*param_value_size_ret = sizeof(size_t);
> +  if (param_value)
> +*(size_t*)param_value = interp_kernel_get_simd_width(kernel->opaque);
> +  return CL_SUCCESS;
> +}
>  case CL_KERNEL_LOCAL_MEM_SIZE:
>  {
>size_t local_mem_sz =  interp_kernel_get_slm_size(kernel->opaque) + 
> kernel->local_mem_sz;
> diff --git a/src/cl_device_id.h b/src/cl_device_id.h
> index 4a923ef..c5f9e57 100644
> --- a/src/cl_device_id.h
> +++ b/src/cl_device_id.h
> @@ -108,8 +108,6 @@ struct _cl_device_id {
>size_t driver_version_sz;
>size_t spir_versions_sz;
>size_t built_in_kernels_sz;
> -  /* Kernel specific info that we're assigning statically */
> -  size_t preferred_wg_sz_mul;
>/* SubDevice specific info */
>cl_device_id parent_device;
>cl_uint  partition_max_sub_device;
> diff --git a/src/cl_gt_device.h b/src/cl_gt_device.h
> index de7a636..12987b7 100644
> --- a/src/cl_gt_device.h
> +++ b/src/cl_gt_device.h
> @@ -39,7 +39,6 @@
>  .native_vector_width_float = 4,
>  .native_vector_width_double = 2,
>  .native_vector_width_half = 8,
> -.preferred_wg_sz_mul = 16,
>  .address_bits = 32,
>  .max_mem_alloc_size = 512 * 1024 * 1024,
>  .image_support = CL_TRUE,
> -- 
> 1.9.1
> 
> ___
> Beignet mailing list
> Beignet@lists.freedesktop.org
> http://lists.freedesktop.org/mailman/listinfo/beignet
___
Beignet mailing list
Beignet@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/beignet

[Beignet] [PATCH 1/5] GBE: extent register allocator size/offset to 32bit.

2015-11-12 Thread Zhigang Gong

Because the range of scratch size exceed the int16_t's
maximum size. We have to extent these elements to 32 bit.

Signed-off-by: Zhigang Gong 
---
 backend/src/backend/context.cpp | 52 -
 backend/src/backend/context.hpp |  6 ++---
 2 files changed, 29 insertions(+), 29 deletions(-)

diff --git a/backend/src/backend/context.cpp b/backend/src/backend/context.cpp
index a02771a..51d643e 100644
--- a/backend/src/backend/context.cpp
+++ b/backend/src/backend/context.cpp
@@ -38,27 +38,27 @@ namespace gbe
   class SimpleAllocator
   {
   public:
-SimpleAllocator(int16_t startOffset, int16_t size, bool _assertFail);
+SimpleAllocator(int32_t startOffset, int32_t size, bool _assertFail);
 ~SimpleAllocator(void);
 
 /*! Allocate some memory from the pool.
  */
-int16_t allocate(int16_t size, int16_t alignment, bool bFwd=false);
+int32_t allocate(int32_t size, int32_t alignment, bool bFwd=false);
 
 /*! Free the given register file piece */
-void deallocate(int16_t offset);
+void deallocate(int32_t offset);
 
 /*! Spilt a block into 2 blocks */
-void splitBlock(int16_t offset, int16_t subOffset);
+void splitBlock(int32_t offset, int32_t subOffset);
 
   protected:
 /*! Double chained list of free spaces */
 struct Block {
-  Block(int16_t offset, int16_t size) :
+  Block(int32_t offset, int32_t size) :
 prev(NULL), next(NULL), offset(offset), size(size) {}
   Block *prev, *next; //!< Previous and next free blocks
-  int16_t offset;//!< Where the free block starts
-  int16_t size;  //!< Size of the free block
+  int32_t offset;//!< Where the free block starts
+  int32_t size;  //!< Size of the free block
 };
 
 /*! Try to coalesce two blocks (left and right). They must be in that 
order.
@@ -66,7 +66,7 @@ namespace gbe
  */
 void coalesce(Block *left, Block *right);
 /*! the maximum offset */
-int16_t maxOffset;
+int32_t maxOffset;
 /*! whether trigger an assertion on allocation failure */
 bool assertFail;
 /*! Head and tail of the free list */
@@ -75,7 +75,7 @@ namespace gbe
 /*! Handle free list element allocation */
 DECL_POOL(Block, blockPool);
 /*! Track allocated memory blocks  */
-map allocatedBlocks;
+map allocatedBlocks;
 /*! Use custom allocators */
 GBE_CLASS(SimpleAllocator);
   };
@@ -90,7 +90,7 @@ namespace gbe
 
   class RegisterAllocator: public SimpleAllocator {
   public:
-RegisterAllocator(int16_t offset, int16_t size): SimpleAllocator(offset, 
size, false) {}
+RegisterAllocator(int32_t offset, int32_t size): SimpleAllocator(offset, 
size, false) {}
 
 GBE_CLASS(RegisterAllocator);
   };
@@ -102,14 +102,14 @@ namespace gbe
 
   class ScratchAllocator: public SimpleAllocator {
   public:
-ScratchAllocator(int16_t size): SimpleAllocator(0, size, true) {}
-int16_t getMaxScatchMemUsed() { return maxOffset; }
+ScratchAllocator(int32_t size): SimpleAllocator(0, size, true) {}
+int32_t getMaxScatchMemUsed() { return maxOffset; }
 
 GBE_CLASS(ScratchAllocator);
   };
 
-  SimpleAllocator::SimpleAllocator(int16_t startOffset,
-   int16_t size,
+  SimpleAllocator::SimpleAllocator(int32_t startOffset,
+   int32_t size,
bool _assertFail)
   : maxOffset(0),
   assertFail(_assertFail){
@@ -124,14 +124,14 @@ namespace gbe
 }
   }
 
-  int16_t SimpleAllocator::allocate(int16_t size, int16_t alignment, bool bFwd)
+  int32_t SimpleAllocator::allocate(int32_t size, int32_t alignment, bool bFwd)
   {
 // Make it simple and just use the first block we find
 Block *list = bFwd ? head : tail;
 while (list) {
-  int16_t aligned;
-  int16_t spaceOnLeft;
-  int16_t spaceOnRight;
+  int32_t aligned;
+  int32_t spaceOnLeft;
+  int32_t spaceOnRight;
   if(bFwd) {
 aligned = ALIGN(list->offset, alignment);
 spaceOnLeft = aligned - list->offset;
@@ -143,7 +143,7 @@ namespace gbe
   continue;
 }
   } else {
-int16_t unaligned = list->offset + list->size - size - (alignment-1);
+int32_t unaligned = list->offset + list->size - size - (alignment-1);
 if(unaligned < 0) {
   list = list->prev;
   continue;
@@ -233,12 +233,12 @@ namespace gbe
 return 0;
   }
 
-  void SimpleAllocator::deallocate(int16_t offset)
+  void SimpleAllocator::deallocate(int32_t offset)
   {
 // Retrieve the size in the allocation map
 auto it = allocatedBlocks.find(offset);
 GBE_ASSERT(it != allocatedBlocks.end());
-const int16_t size = it->second;
+const int32_t size = it->second;
 
 // Find the two blocks where to insert the new block

[Beignet] [PATCH 5/5] GBE: decrease the loop unrolling threshold to 640.

2015-11-12 Thread Zhigang Gong

1024 is some how too large for some kernels and may cause
some kernels fail to build due to lack of enough scratching
space.

Signed-off-by: Zhigang Gong 
---
 backend/src/llvm/llvm_to_gen.cpp | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/backend/src/llvm/llvm_to_gen.cpp b/backend/src/llvm/llvm_to_gen.cpp
index 24d4be7..551fd31 100644
--- a/backend/src/llvm/llvm_to_gen.cpp
+++ b/backend/src/llvm/llvm_to_gen.cpp
@@ -144,7 +144,7 @@ namespace gbe
 MPM.add(createIndVarSimplifyPass());// Canonicalize indvars
 MPM.add(createLoopIdiomPass()); // Recognize idioms like 
memset.
 MPM.add(createLoopDeletionPass());  // Delete dead loops
-MPM.add(createLoopUnrollPass(1024)); //1024, 32, 1024, 512)); //Unroll 
loops
+MPM.add(createLoopUnrollPass(640)); //1024, 32, 1024, 512)); //Unroll loops
 if(optLevel > 0) {
   MPM.add(createSROAPass(/*RequiresDomTree*/ false));
   MPM.add(createGVNPass()); // Remove redundancies
-- 
1.9.1

___
Beignet mailing list
Beignet@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/beignet

[Beignet] [PATCH 3/5] GBE: remove useless assertions code.

2015-11-12 Thread Zhigang Gong

Signed-off-by: Zhigang Gong 
---
 backend/src/backend/context.cpp | 14 +-
 1 file changed, 5 insertions(+), 9 deletions(-)

diff --git a/backend/src/backend/context.cpp b/backend/src/backend/context.cpp
index 47d8a45..5f5a858 100644
--- a/backend/src/backend/context.cpp
+++ b/backend/src/backend/context.cpp
@@ -38,7 +38,7 @@ namespace gbe
   class SimpleAllocator
   {
   public:
-SimpleAllocator(int32_t startOffset, int32_t size, bool _assertFail);
+SimpleAllocator(int32_t startOffset, int32_t size);
 ~SimpleAllocator(void);
 
 /*! Allocate some memory from the pool.
@@ -67,8 +67,6 @@ namespace gbe
 void coalesce(Block *left, Block *right);
 /*! the maximum offset */
 int32_t maxOffset;
-/*! whether trigger an assertion on allocation failure */
-bool assertFail;
 /*! Head and tail of the free list */
 Block *head;
 Block *tail;
@@ -90,7 +88,7 @@ namespace gbe
 
   class RegisterAllocator: public SimpleAllocator {
   public:
-RegisterAllocator(int32_t offset, int32_t size): SimpleAllocator(offset, 
size, false) {}
+RegisterAllocator(int32_t offset, int32_t size): SimpleAllocator(offset, 
size) {}
 
 GBE_CLASS(RegisterAllocator);
   };
@@ -102,17 +100,15 @@ namespace gbe
 
   class ScratchAllocator: public SimpleAllocator {
   public:
-ScratchAllocator(int32_t size): SimpleAllocator(0, size, true) {}
+ScratchAllocator(int32_t size): SimpleAllocator(0, size) {}
 int32_t getMaxScatchMemUsed() { return maxOffset; }
 
 GBE_CLASS(ScratchAllocator);
   };
 
   SimpleAllocator::SimpleAllocator(int32_t startOffset,
-   int32_t size,
-   bool _assertFail)
-  : maxOffset(0),
-  assertFail(_assertFail){
+   int32_t size)
+  : maxOffset(0) {
 tail = head = this->newBlock(startOffset, size);
   }
 
-- 
1.9.1

___
Beignet mailing list
Beignet@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/beignet

[Beignet] [PATCH 4/5] runtime: set CL_KERNEL_PREFERRED_WORK_GROUP_SIZE_MULTIPLE to kernel's SIMD_WIDTH.

2015-11-12 Thread Zhigang Gong

It makes sense to set CL_KERNEL_PREFERRED_WORK_GROUP_SIZE_MULTIPLE to the
corresponding SIMD size. Then it provides a way for intel's OCL application
to get SIMD width at runtime and make some SIMD width dependant optimization
possible.

Signed-off-by: Zhigang Gong 
---
 src/cl_api.c|  3 ++-
 src/cl_command_queue_gen7.c |  2 +-
 src/cl_device_id.c  | 11 ++-
 src/cl_device_id.h  |  2 --
 src/cl_gt_device.h  |  1 -
 5 files changed, 13 insertions(+), 6 deletions(-)

diff --git a/src/cl_api.c b/src/cl_api.c
index a18bc99..64206eb 100644
--- a/src/cl_api.c
+++ b/src/cl_api.c
@@ -3001,6 +3001,7 @@ clEnqueueNDRangeKernel(cl_command_queue  command_queue,
 err = cl_command_queue_flush(command_queue);
   }
 
+error:
   if(b_output_kernel_perf)
   {
 if(kernel->program->build_opts != NULL)
@@ -3008,7 +3009,7 @@ clEnqueueNDRangeKernel(cl_command_queue  command_queue,
 else
   time_end(command_queue->ctx, cl_kernel_get_name(kernel), "", 
command_queue);
   }
-error:
+
   return err;
 }
 
diff --git a/src/cl_command_queue_gen7.c b/src/cl_command_queue_gen7.c
index 2edc3be..f0ee20a 100644
--- a/src/cl_command_queue_gen7.c
+++ b/src/cl_command_queue_gen7.c
@@ -329,7 +329,7 @@ cl_command_queue_ND_range_gen7(cl_command_queue queue,
 
   /* Compute the number of HW threads we need */
   if(UNLIKELY(err = cl_kernel_work_group_sz(ker, local_wk_sz, 3, &local_sz) != 
CL_SUCCESS)) {
-fprintf(stderr, "Beignet: Work group size exceed Kerne's work group 
size.\n");
+fprintf(stderr, "Beignet: Work group size exceed Kernel's work group 
size.\n");
 return err;
   }
   kernel.thread_n = thread_n = (local_sz + simd_sz - 1) / simd_sz;
diff --git a/src/cl_device_id.c b/src/cl_device_id.c
index 4551aa8..8186ac8 100644
--- a/src/cl_device_id.c
+++ b/src/cl_device_id.c
@@ -966,7 +966,16 @@ cl_get_kernel_workgroup_info(cl_kernel kernel,
 return CL_SUCCESS;
   }
 }
-DECL_FIELD(PREFERRED_WORK_GROUP_SIZE_MULTIPLE, device->preferred_wg_sz_mul)
+case CL_KERNEL_PREFERRED_WORK_GROUP_SIZE_MULTIPLE:
+{
+  if (param_value && param_value_size < sizeof(size_t))
+return CL_INVALID_VALUE;
+  if (param_value_size_ret != NULL)
+*param_value_size_ret = sizeof(size_t);
+  if (param_value)
+*(size_t*)param_value = interp_kernel_get_simd_width(kernel->opaque);
+  return CL_SUCCESS;
+}
 case CL_KERNEL_LOCAL_MEM_SIZE:
 {
   size_t local_mem_sz =  interp_kernel_get_slm_size(kernel->opaque) + 
kernel->local_mem_sz;
diff --git a/src/cl_device_id.h b/src/cl_device_id.h
index 4a923ef..c5f9e57 100644
--- a/src/cl_device_id.h
+++ b/src/cl_device_id.h
@@ -108,8 +108,6 @@ struct _cl_device_id {
   size_t driver_version_sz;
   size_t spir_versions_sz;
   size_t built_in_kernels_sz;
-  /* Kernel specific info that we're assigning statically */
-  size_t preferred_wg_sz_mul;
   /* SubDevice specific info */
   cl_device_id parent_device;
   cl_uint  partition_max_sub_device;
diff --git a/src/cl_gt_device.h b/src/cl_gt_device.h
index de7a636..12987b7 100644
--- a/src/cl_gt_device.h
+++ b/src/cl_gt_device.h
@@ -39,7 +39,6 @@
 .native_vector_width_float = 4,
 .native_vector_width_double = 2,
 .native_vector_width_half = 8,
-.preferred_wg_sz_mul = 16,
 .address_bits = 32,
 .max_mem_alloc_size = 512 * 1024 * 1024,
 .image_support = CL_TRUE,
-- 
1.9.1

___
Beignet mailing list
Beignet@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/beignet

[Beignet] [PATCH 2/5] GBE: don't assert even if we fail to compile kernel at the backend stage.

2015-11-12 Thread Zhigang Gong

We should not assert even if the application triggers a internal limitation
such as lack of scratch space. We should return error to the application and
let the application to make further decision.

Signed-off-by: Zhigang Gong 
---
 backend/src/backend/context.cpp|  3 +--
 backend/src/backend/gen_context.hpp|  1 +
 backend/src/backend/gen_program.cpp|  2 +-
 backend/src/backend/gen_reg_allocation.cpp | 18 --
 backend/src/backend/program.cpp| 22 +++---
 5 files changed, 30 insertions(+), 16 deletions(-)

diff --git a/backend/src/backend/context.cpp b/backend/src/backend/context.cpp
index 51d643e..47d8a45 100644
--- a/backend/src/backend/context.cpp
+++ b/backend/src/backend/context.cpp
@@ -229,8 +229,7 @@ namespace gbe
   // We have a valid offset now
   return aligned;
 }
-GBE_ASSERT( !assertFail );
-return 0;
+return -1;
   }
 
   void SimpleAllocator::deallocate(int32_t offset)
diff --git a/backend/src/backend/gen_context.hpp 
b/backend/src/backend/gen_context.hpp
index 155b68e..c622236 100644
--- a/backend/src/backend/gen_context.hpp
+++ b/backend/src/backend/gen_context.hpp
@@ -49,6 +49,7 @@ namespace gbe
 REGISTER_ALLOCATION_FAIL,
 REGISTER_SPILL_EXCEED_THRESHOLD,
 REGISTER_SPILL_FAIL,
+REGISTER_SPILL_NO_SPACE,
 OUT_OF_RANGE_IF_ENDIF,
   } CompileErrorCode;
 
diff --git a/backend/src/backend/gen_program.cpp 
b/backend/src/backend/gen_program.cpp
index bb22542..a12ab39 100644
--- a/backend/src/backend/gen_program.cpp
+++ b/backend/src/backend/gen_program.cpp
@@ -197,7 +197,7 @@ namespace gbe {
 GBE_ASSERT(!(ctx->getErrCode() == OUT_OF_RANGE_IF_ENDIF && 
ctx->getIFENDIFFix()));
 }
 
-GBE_ASSERTM(kernel != NULL, "Fail to compile kernel, may need to increase 
reserved registers for spilling.");
+//GBE_ASSERTM(kernel != NULL, "Fail to compile kernel, may need to 
increase reserved registers for spilling.");
 return kernel;
 #else
 return NULL;
diff --git a/backend/src/backend/gen_reg_allocation.cpp 
b/backend/src/backend/gen_reg_allocation.cpp
index a9338c5..13c856e 100644
--- a/backend/src/backend/gen_reg_allocation.cpp
+++ b/backend/src/backend/gen_reg_allocation.cpp
@@ -192,7 +192,7 @@ namespace gbe
 INLINE bool spillReg(GenRegInterval interval, bool isAllocated = false);
 INLINE bool spillReg(ir::Register reg, bool isAllocated = false);
 INLINE bool vectorCanSpill(SelectionVector *vector);
-INLINE void allocateScratchForSpilled();
+INLINE bool allocateScratchForSpilled();
 void allocateCurbePayload(void);
 
 /*! replace specified source/dst register with temporary register and 
update interval */
@@ -788,7 +788,10 @@ namespace gbe
   return false;
 }
   }
-  allocateScratchForSpilled();
+  if (!allocateScratchForSpilled()) {
+ctx.errCode = REGISTER_SPILL_NO_SPACE;
+return false;
+  }
   bool success = selection.spillRegs(spilledRegs, reservedReg);
   if (!success) {
 ctx.errCode = REGISTER_SPILL_FAIL;
@@ -799,7 +802,7 @@ namespace gbe
 return true;
   }
 
-  INLINE void GenRegAllocator::Opaque::allocateScratchForSpilled()
+  INLINE bool GenRegAllocator::Opaque::allocateScratchForSpilled()
   {
 const uint32_t regNum = spilledRegs.size();
 this->starting.resize(regNum);
@@ -833,7 +836,10 @@ namespace gbe
   ir::RegisterFamily family = ctx.sel->getRegisterFamily(cur->reg);
   it->second.addr = ctx.allocateScratchMem(getFamilySize(family)
  * ctx.getSimdWidth());
-  }
+  if (it->second.addr == -1)
+return false;
+}
+return true;
   }
 
   INLINE bool GenRegAllocator::Opaque::expireReg(ir::Register reg)
@@ -1019,7 +1025,7 @@ namespace gbe
   INLINE uint32_t GenRegAllocator::Opaque::allocateReg(GenRegInterval interval,
uint32_t size,
uint32_t alignment) {
-uint32_t grfOffset;
+int32_t grfOffset;
 // Doing expireGRF too freqently will cause the post register allocation
 // scheduling very hard. As it will cause a very high register conflict 
rate.
 // The tradeoff here is to reduce the freqency here. And if we are under 
spilling
@@ -1032,7 +1038,7 @@ namespace gbe
 // and the source is a scalar Dword. If that is the case, the byte register
 // must get 4byte alignment register offset.
 alignment = (alignment + 3) & ~3;
-while ((grfOffset = ctx.allocate(size, alignment)) == 0) {
+while ((grfOffset = ctx.allocate(size, alignment)) == -1) {
   const bool success = this->expireGRF(interval);
   if (success == false) {
 if (spillAtInterval(interval, size, alignment) == false)
diff --git a/backend/src/backend/program.cpp b/backend/src/backend/program.cpp
index 472734b..15

Re: [Beignet] [PATCH V2 1/4] drivers: change the buf size to size_t

2015-10-26 Thread Zhigang Gong

LGTM.

Just a reminder, please add some thing to the commit log
to indicate the version of this patch and what has been
changed in this patch.

Thanks,
Zhigang Gong.

On Fri, Oct 23, 2015 at 01:22:56PM +0800, Pan Xiuli wrote:
> The uint32_t size is not enough for coming bigger
> gpu memory, now GEN9 support 4G buffer. Also add
> assertion for invalid size.
> 
> Signed-off-by: Pan Xiuli 
> ---
>  src/cl_driver.h |  2 +-
>  src/intel/intel_gpgpu.c | 19 +++
>  2 files changed, 12 insertions(+), 9 deletions(-)
> 
> diff --git a/src/cl_driver.h b/src/cl_driver.h
> index 1ab4dff..4ffca09 100644
> --- a/src/cl_driver.h
> +++ b/src/cl_driver.h
> @@ -138,7 +138,7 @@ typedef void (cl_gpgpu_sync_cb)(void*);
>  extern cl_gpgpu_sync_cb *cl_gpgpu_sync;
>  
>  /* Bind a regular unformatted buffer */
> -typedef void (cl_gpgpu_bind_buf_cb)(cl_gpgpu, cl_buffer, uint32_t offset, 
> uint32_t internal_offset, uint32_t size, uint8_t bti);
> +typedef void (cl_gpgpu_bind_buf_cb)(cl_gpgpu, cl_buffer, uint32_t offset, 
> uint32_t internal_offset, size_t size, uint8_t bti);
>  extern cl_gpgpu_bind_buf_cb *cl_gpgpu_bind_buf;
>  
>  /* bind samplers defined in both kernel and kernel args. */
> diff --git a/src/intel/intel_gpgpu.c b/src/intel/intel_gpgpu.c
> index 60d318a..e96bb95 100644
> --- a/src/intel/intel_gpgpu.c
> +++ b/src/intel/intel_gpgpu.c
> @@ -86,7 +86,7 @@ typedef void (intel_gpgpu_set_base_address_t)(intel_gpgpu_t 
> *gpgpu);
>  intel_gpgpu_set_base_address_t *intel_gpgpu_set_base_address = NULL;
>  
>  typedef void (intel_gpgpu_setup_bti_t)(intel_gpgpu_t *gpgpu, drm_intel_bo 
> *buf, uint32_t internal_offset,
> -   uint32_t size, unsigned char index, 
> uint32_t format);
> +   size_t size, unsigned char index, 
> uint32_t format);
>  intel_gpgpu_setup_bti_t *intel_gpgpu_setup_bti = NULL;
>  
>  
> @@ -1000,9 +1000,10 @@ intel_gpgpu_alloc_constant_buffer(intel_gpgpu_t 
> *gpgpu, uint32_t size, uint8_t b
>  
>  static void
>  intel_gpgpu_setup_bti_gen7(intel_gpgpu_t *gpgpu, drm_intel_bo *buf, uint32_t 
> internal_offset,
> -   uint32_t size, unsigned char index, 
> uint32_t format)
> +   size_t size, unsigned char index, 
> uint32_t format)
>  {
> -  uint32_t s = size - 1;
> +  assert(size <= (2ul<<30));
> +  size_t s = size - 1;
>surface_heap_t *heap = gpgpu->aux_buf.bo->virtual + 
> gpgpu->aux_offset.surface_heap_offset;
>gen7_surface_state_t *ss0 = (gen7_surface_state_t *) &heap->surface[index 
> * sizeof(gen7_surface_state_t)];
>memset(ss0, 0, sizeof(gen7_surface_state_t));
> @@ -1030,9 +1031,10 @@ intel_gpgpu_setup_bti_gen7(intel_gpgpu_t *gpgpu, 
> drm_intel_bo *buf, uint32_t int
>  
>  static void
>  intel_gpgpu_setup_bti_gen75(intel_gpgpu_t *gpgpu, drm_intel_bo *buf, 
> uint32_t internal_offset,
> -   uint32_t size, unsigned char index, 
> uint32_t format)
> +   size_t size, unsigned char index, 
> uint32_t format)
>  {
> -  uint32_t s = size - 1;
> +  assert(size <= (2ul<<30));
> +  size_t s = size - 1;
>surface_heap_t *heap = gpgpu->aux_buf.bo->virtual + 
> gpgpu->aux_offset.surface_heap_offset;
>gen7_surface_state_t *ss0 = (gen7_surface_state_t *) &heap->surface[index 
> * sizeof(gen7_surface_state_t)];
>memset(ss0, 0, sizeof(gen7_surface_state_t));
> @@ -1066,9 +1068,10 @@ intel_gpgpu_setup_bti_gen75(intel_gpgpu_t *gpgpu, 
> drm_intel_bo *buf, uint32_t in
>  
>  static void
>  intel_gpgpu_setup_bti_gen8(intel_gpgpu_t *gpgpu, drm_intel_bo *buf, uint32_t 
> internal_offset,
> -   uint32_t size, unsigned char index, 
> uint32_t format)
> +   size_t size, unsigned char index, 
> uint32_t format)
>  {
> -  uint32_t s = size - 1;
> +  assert(size <= (2ul<<30));
> +  size_t s = size - 1;
>surface_heap_t *heap = gpgpu->aux_buf.bo->virtual + 
> gpgpu->aux_offset.surface_heap_offset;
>gen8_surface_state_t *ss0 = (gen8_surface_state_t *) &heap->surface[index 
> * sizeof(gen8_surface_state_t)];
>memset(ss0, 0, sizeof(gen8_surface_state_t));
> @@ -1395,7 +1398,7 @@ intel_gpgpu_bind_image_gen9(intel_gpgpu_t *gpgpu,
>  
>  static void
>  intel_gpgpu_bind_buf(intel_gpgpu_t *gpgpu, drm_intel_bo *buf, uint32_t 
> offset,
> - uint32_t internal_offset, uint32_t size, uint8_t bti)
> + uint32_t internal_offset, size_t size, uint8_t bti)
>  {
>assert(gpgpu->binded_n < max_buf_n);
>gpgpu->binded_buf[gpgpu->binded_n] = buf;
> -- 
> 2.1.4
> 
> ___
> Beignet mailing list
> Beignet@lists.freedesktop.org
> http://lists.freedesktop.org/mailman/listinfo/beignet
___
Beignet mailing list
Beignet@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/beignet

Re: [Beignet] [PATCH 4/4] runtime: dynamically get global memory size and max alloc size

2015-10-22 Thread Zhigang Gong

LGTM, thx.

On Wed, Oct 14, 2015 at 04:34:07PM +0800, Pan Xiuli wrote:
> Now device and driver can support bigger memory, we need to abandon
> our old 2G hard code. We get global memory by considering device
> limitation, drm driver and kernel support and raw, this will ensure
> a bigger global memory and a more stable system. We get max mem alloc
> size from global memory size and the device limition.
> 
> Signed-off-by: Pan Xiuli 
> ---
>  src/cl_device_id.c   | 20 +++-
>  src/intel/intel_driver.c |  5 +
>  2 files changed, 20 insertions(+), 5 deletions(-)
> 
> diff --git a/src/cl_device_id.c b/src/cl_device_id.c
> index d92ce95..1c626f8 100644
> --- a/src/cl_device_id.c
> +++ b/src/cl_device_id.c
> @@ -547,14 +547,24 @@ skl_gt4_break:
>  
>/* Apply any driver-dependent updates to the device info */
>cl_driver_update_device_info(ret);
> -
> +  #define toMB(size) (size)&(0xfff<<20)
> +  /* Get the global_mem_size and max_mem_alloc size from
> +   * driver, system ram and hardware*/
>struct sysinfo info;
>if (sysinfo(&info) == 0) {
> -uint64_t two_gb = 2 * 1024 * 1024 * 1024ul; 
> +uint64_t totalgpumem = ret->global_mem_size;
> + uint64_t maxallocmem = ret->max_mem_alloc_size;
>  uint64_t totalram = info.totalram * info.mem_unit;
> -ret->global_mem_size = (totalram > two_gb) ? 
> -two_gb : info.totalram;
> -ret->max_mem_alloc_size = ret->global_mem_size / 2;
> + /* In case to keep system stable we just use half
> +  * of the raw as global mem */
> +ret->global_mem_size = toMB((totalram / 2 > totalgpumem) ?
> +totalgpumem: totalram / 2);
> + /* The hardware has some limit about the alloc size
> +  * and the excution of kernel need some global mem
> +  * so we now make sure single mem does not use much
> +  * than 3/4 global mem*/
> +ret->max_mem_alloc_size = toMB((ret->global_mem_size * 3 / 4 > 
> maxallocmem) ?
> +  maxallocmem: ret->global_mem_size * 3 / 4);
>}
>  
>return ret;
> diff --git a/src/intel/intel_driver.c b/src/intel/intel_driver.c
> index 035a103..782a2de 100644
> --- a/src/intel/intel_driver.c
> +++ b/src/intel/intel_driver.c
> @@ -829,6 +829,11 @@ intel_update_device_info(cl_device_id device)
>if (IS_CHERRYVIEW(device->device_id))
>  printf(CHV_CONFIG_WARNING);
>  #endif
> +  //We should get the device memory dynamically, but the
> +  //mapablce mem size usage is unknown. Just ignore it.
> +  size_t total_mem,map_mem;
> +  if(drm_intel_get_aperture_sizes(driver->fd,&map_mem,&total_mem) == 0)
> +device->global_mem_size = (cl_ulong)total_mem;
>  
>intel_driver_context_destroy(driver);
>intel_driver_close(driver);
> -- 
> 2.1.4
> 
> ___
> Beignet mailing list
> Beignet@lists.freedesktop.org
> http://lists.freedesktop.org/mailman/listinfo/beignet
___
Beignet mailing list
Beignet@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/beignet

Re: [Beignet] [PATCH 3/4] driver: add setup_bti_gen9 for bigger buffer up to 4G

2015-10-22 Thread Zhigang Gong

On Wed, Oct 14, 2015 at 04:34:06PM +0800, Pan Xiuli wrote:
> Now gen9 can support bigger buffer size, and it can also support
> 4G global memory. We add new function to support it.
> 
> Signed-off-by: Pan Xiuli 
> ---
>  src/intel/intel_gpgpu.c | 41 +++--
>  1 file changed, 39 insertions(+), 2 deletions(-)
> 
> diff --git a/src/intel/intel_gpgpu.c b/src/intel/intel_gpgpu.c
> index 4650325..fbc5566 100644
> --- a/src/intel/intel_gpgpu.c
> +++ b/src/intel/intel_gpgpu.c
> @@ -1103,6 +1103,43 @@ intel_gpgpu_setup_bti_gen8(intel_gpgpu_t *gpgpu, 
> drm_intel_bo *buf, uint32_t int
>  buf);
>  }
>  
> +static void
> +intel_gpgpu_setup_bti_gen9(intel_gpgpu_t *gpgpu, drm_intel_bo *buf, uint32_t 
> internal_offset,
> +   size_t size, unsigned char index, 
> uint32_t format)
> +{
> +  assert(size <= (4<<30));
4<<30 is overflowed. Use 4ul<<30.

> +  uint32_t s = size - 1;
Please use size_t for s.

The other part LGTM, thx.

> +  surface_heap_t *heap = gpgpu->aux_buf.bo->virtual + 
> gpgpu->aux_offset.surface_heap_offset;
> +  gen8_surface_state_t *ss0 = (gen8_surface_state_t *) &heap->surface[index 
> * sizeof(gen8_surface_state_t)];
> +  memset(ss0, 0, sizeof(gen8_surface_state_t));
> +  ss0->ss0.surface_type = I965_SURFACE_BUFFER;
> +  ss0->ss0.surface_format = format;
> +  if(format != I965_SURFACEFORMAT_RAW) {
> +ss0->ss7.shader_channel_select_red = I965_SURCHAN_SELECT_RED;
> +ss0->ss7.shader_channel_select_green = I965_SURCHAN_SELECT_GREEN;
> +ss0->ss7.shader_channel_select_blue = I965_SURCHAN_SELECT_BLUE;
> +ss0->ss7.shader_channel_select_alpha = I965_SURCHAN_SELECT_ALPHA;
> +  }
> +  ss0->ss2.width  = s & 0x7f;   /* bits 6:0 of sz */
> +  // Per bspec, I965_SURFACE_BUFFER and RAW format, size must be a multiple 
> of 4 byte.
> +  if(format == I965_SURFACEFORMAT_RAW)
> +assert((ss0->ss2.width & 0x03) == 3);
> +  ss0->ss2.height = (s >> 7) & 0x3fff; /* bits 20:7 of sz */
> +  ss0->ss3.depth  = (s >> 21) & 0x7ff; /* bits 31:21 of sz, from bespec only 
> gen 9 support that*/
> +  ss0->ss1.mem_obj_ctrl_state = cl_gpgpu_get_cache_ctrl();
> +  heap->binding_table[index] = offsetof(surface_heap_t, surface) + index * 
> sizeof(gen8_surface_state_t);
> +  ss0->ss8.surface_base_addr_lo = (buf->offset64 + internal_offset) & 
> 0x;
> +  ss0->ss9.surface_base_addr_hi = ((buf->offset64 + internal_offset) >> 32) 
> & 0x;
> +  dri_bo_emit_reloc(gpgpu->aux_buf.bo,
> +I915_GEM_DOMAIN_RENDER,
> +I915_GEM_DOMAIN_RENDER,
> +internal_offset,
> +gpgpu->aux_offset.surface_heap_offset +
> +heap->binding_table[index] +
> +offsetof(gen8_surface_state_t, ss8),
> +buf);
> +}
> +
>  static int
>  intel_is_surface_array(cl_mem_object_type type)
>  {
> @@ -2186,10 +2223,10 @@ intel_set_gpgpu_callbacks(int device_id)
>  intel_gpgpu_set_L3 = intel_gpgpu_set_L3_gen8;
>  cl_gpgpu_get_cache_ctrl = (cl_gpgpu_get_cache_ctrl_cb 
> *)intel_gpgpu_get_cache_ctrl_gen9;
>  intel_gpgpu_get_scratch_index = intel_gpgpu_get_scratch_index_gen8;
> -intel_gpgpu_post_action = intel_gpgpu_post_action_gen7; //BDW need not 
> restore SLM, same as gen7
> +intel_gpgpu_post_action = intel_gpgpu_post_action_gen7; //SKL need not 
> restore SLM, same as gen7
>  intel_gpgpu_read_ts_reg = intel_gpgpu_read_ts_reg_gen7;
>  intel_gpgpu_set_base_address = intel_gpgpu_set_base_address_gen9;
> -intel_gpgpu_setup_bti = intel_gpgpu_setup_bti_gen8;
> +intel_gpgpu_setup_bti = intel_gpgpu_setup_bti_gen9;
>  intel_gpgpu_load_vfe_state = intel_gpgpu_load_vfe_state_gen8;
>  cl_gpgpu_walker = (cl_gpgpu_walker_cb *)intel_gpgpu_walker_gen8;
>  intel_gpgpu_build_idrt = intel_gpgpu_build_idrt_gen9;
> -- 
> 2.1.4
> 
> ___
> Beignet mailing list
> Beignet@lists.freedesktop.org
> http://lists.freedesktop.org/mailman/listinfo/beignet
___
Beignet mailing list
Beignet@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/beignet

Re: [Beignet] [PATCH 2/4] runtime: refine the cl_device_id to support bigger memory

2015-10-22 Thread Zhigang Gong

This patch LGTM, thx.

On Wed, Oct 14, 2015 at 04:34:05PM +0800, Pan Xiuli wrote:
> Now gen8 and gen9 support 4G global memory, and gen9 support
> 4G single buffer. Need to move the global_mem and max_mem_alloc
> size into each define header.
> 
> Signed-off-by: Pan Xiuli 
> ---
>  src/cl_device_id.c| 14 +++---
>  src/cl_gen75_device.h |  5 +++--
>  src/cl_gen7_device.h  |  2 ++
>  src/cl_gen8_device.h  | 30 ++
>  src/cl_gen9_device.h  | 31 +++
>  src/cl_gt_device.h|  2 --
>  6 files changed, 73 insertions(+), 11 deletions(-)
>  create mode 100644 src/cl_gen8_device.h
>  create mode 100644 src/cl_gen9_device.h
> 
> diff --git a/src/cl_device_id.c b/src/cl_device_id.c
> index 78d2cf4..d92ce95 100644
> --- a/src/cl_device_id.c
> +++ b/src/cl_device_id.c
> @@ -116,7 +116,7 @@ static struct _cl_device_id intel_brw_gt1_device = {
>.max_work_item_sizes = {512, 512, 512},
>.max_work_group_size = 512,
>.max_clock_frequency = 1000,
> -#include "cl_gen75_device.h"
> +#include "cl_gen8_device.h"
>  };
>  
>  static struct _cl_device_id intel_brw_gt2_device = {
> @@ -127,7 +127,7 @@ static struct _cl_device_id intel_brw_gt2_device = {
>.max_work_item_sizes = {512, 512, 512},
>.max_work_group_size = 512,
>.max_clock_frequency = 1000,
> -#include "cl_gen75_device.h"
> +#include "cl_gen8_device.h"
>  };
>  
>  static struct _cl_device_id intel_brw_gt3_device = {
> @@ -138,7 +138,7 @@ static struct _cl_device_id intel_brw_gt3_device = {
>.max_work_item_sizes = {512, 512, 512},
>.max_work_group_size = 512,
>.max_clock_frequency = 1000,
> -#include "cl_gen75_device.h"
> +#include "cl_gen8_device.h"
>  };
>  
>  //Cherryview has the same pciid, must get the max_compute_unit and 
> max_thread_per_unit from drm
> @@ -162,7 +162,7 @@ static struct _cl_device_id intel_skl_gt1_device = {
>.max_work_item_sizes = {512, 512, 512},
>.max_work_group_size = 512,
>.max_clock_frequency = 1000,
> -#include "cl_gen75_device.h"
> +#include "cl_gen9_device.h"
>  };
>  
>  static struct _cl_device_id intel_skl_gt2_device = {
> @@ -173,7 +173,7 @@ static struct _cl_device_id intel_skl_gt2_device = {
>.max_work_item_sizes = {512, 512, 512},
>.max_work_group_size = 512,
>.max_clock_frequency = 1000,
> -#include "cl_gen75_device.h"
> +#include "cl_gen9_device.h"
>  };
>  
>  static struct _cl_device_id intel_skl_gt3_device = {
> @@ -184,7 +184,7 @@ static struct _cl_device_id intel_skl_gt3_device = {
>.max_work_item_sizes = {512, 512, 512},
>.max_work_group_size = 512,
>.max_clock_frequency = 1000,
> -#include "cl_gen75_device.h"
> +#include "cl_gen9_device.h"
>  };
>  
>  static struct _cl_device_id intel_skl_gt4_device = {
> @@ -195,7 +195,7 @@ static struct _cl_device_id intel_skl_gt4_device = {
>.max_work_item_sizes = {512, 512, 512},
>.max_work_group_size = 512,
>.max_clock_frequency = 1000,
> -#include "cl_gen75_device.h"
> +#include "cl_gen9_device.h"
>  };
>  
>  LOCAL cl_device_id
> diff --git a/src/cl_gen75_device.h b/src/cl_gen75_device.h
> index 43f6e8f..7ef2b82 100644
> --- a/src/cl_gen75_device.h
> +++ b/src/cl_gen75_device.h
> @@ -17,14 +17,15 @@
>   * Author: Benjamin Segovia 
>   */
>  
> -/* Common fields for both SNB devices (either GT1 or GT2)
> - */
> +/* Common fields for both CHV,VLV and HSW devices */
>  .max_parameter_size = 1024,
>  .global_mem_cache_line_size = 64, /* XXX */
>  .global_mem_cache_size = 8 << 10, /* XXX */
>  .local_mem_type = CL_GLOBAL,
>  .local_mem_size = 64 << 10,
>  .scratch_mem_size = 2 << 20,
> +.max_mem_alloc_size = 2 * 1024 * 1024 * 1024ul,
> +.global_mem_size = 2 * 1024 * 1024 * 1024ul,
>  
>  #include "cl_gt_device.h"
>  
> diff --git a/src/cl_gen7_device.h b/src/cl_gen7_device.h
> index 4ad5d96..104e929 100644
> --- a/src/cl_gen7_device.h
> +++ b/src/cl_gen7_device.h
> @@ -24,6 +24,8 @@
>  .local_mem_type = CL_GLOBAL,
>  .local_mem_size = 64 << 10,
>  .scratch_mem_size = 12 << 10,
> +.max_mem_alloc_size = 2 * 1024 * 1024 * 1024ul,
> +.global_mem_size = 2 * 1024 * 1024 * 1024ul,
>  
>  #include "cl_gt_device.h"
>  
> diff --git a/src/cl_gen8_device.h b/src/cl_gen8_device.h
> new file mode 100644
> index 000..08fde48
> --- /dev/null
> +++ b/src/cl_gen8_device.h
> @@ -0,0 +1,30 @@
> +/*
> + * Copyright © 2012 Intel Corporation
> + *
> + * This library is free software; you can redistribute it and/or
> + * modify it under the terms of the GNU Lesser General Public
> + * License as published by the Free Software Foundation; either
> + * version 2.1 of the License, or (at your option) any later version.
> + *
> + * This library is distributed in the hope that it will be useful,
> + * but WITHOUT ANY WARRANTY; without even the implied warranty of
> + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
> + * Lesser General Public License for more details.
> + *
> + * You should have received a copy of the GNU Lesser General Publ

Re: [Beignet] [PATCH 1/4] drivers: change the buf size to size_t

2015-10-22 Thread Zhigang Gong

On Wed, Oct 14, 2015 at 04:34:04PM +0800, Pan Xiuli wrote:
> The uint32_t size is not enough for coming bigger
> gpu memory, now GEN9 support 4G buffer. Also add
> assertion for invalid size.
> 
> Signed-off-by: Pan Xiuli 
> ---
>  src/cl_driver.h |  2 +-
>  src/intel/intel_gpgpu.c | 13 -
>  2 files changed, 9 insertions(+), 6 deletions(-)
> 
> diff --git a/src/cl_driver.h b/src/cl_driver.h
> index 1ab4dff..4ffca09 100644
> --- a/src/cl_driver.h
> +++ b/src/cl_driver.h
> @@ -138,7 +138,7 @@ typedef void (cl_gpgpu_sync_cb)(void*);
>  extern cl_gpgpu_sync_cb *cl_gpgpu_sync;
>  
>  /* Bind a regular unformatted buffer */
> -typedef void (cl_gpgpu_bind_buf_cb)(cl_gpgpu, cl_buffer, uint32_t offset, 
> uint32_t internal_offset, uint32_t size, uint8_t bti);
> +typedef void (cl_gpgpu_bind_buf_cb)(cl_gpgpu, cl_buffer, uint32_t offset, 
> uint32_t internal_offset, size_t size, uint8_t bti);
>  extern cl_gpgpu_bind_buf_cb *cl_gpgpu_bind_buf;
>  
>  /* bind samplers defined in both kernel and kernel args. */
> diff --git a/src/intel/intel_gpgpu.c b/src/intel/intel_gpgpu.c
> index 901bd98..4650325 100644
> --- a/src/intel/intel_gpgpu.c
> +++ b/src/intel/intel_gpgpu.c
> @@ -86,7 +86,7 @@ typedef void (intel_gpgpu_set_base_address_t)(intel_gpgpu_t 
> *gpgpu);
>  intel_gpgpu_set_base_address_t *intel_gpgpu_set_base_address = NULL;
>  
>  typedef void (intel_gpgpu_setup_bti_t)(intel_gpgpu_t *gpgpu, drm_intel_bo 
> *buf, uint32_t internal_offset,
> -   uint32_t size, unsigned char index, 
> uint32_t format);
> +   size_t size, unsigned char index, 
> uint32_t format);
>  intel_gpgpu_setup_bti_t *intel_gpgpu_setup_bti = NULL;
>  
>  
> @@ -1000,8 +1000,9 @@ intel_gpgpu_alloc_constant_buffer(intel_gpgpu_t *gpgpu, 
> uint32_t size, uint8_t b
>  
>  static void
>  intel_gpgpu_setup_bti_gen7(intel_gpgpu_t *gpgpu, drm_intel_bo *buf, uint32_t 
> internal_offset,
> -   uint32_t size, unsigned char index, 
> uint32_t format)
> +   size_t size, unsigned char index, 
> uint32_t format)
>  {
> +  assert(size <= (2<<30));
should use 2ul<<30
>uint32_t s = size - 1;
Please use size_t for s as well.

There are some similar problem in the following functions.

>surface_heap_t *heap = gpgpu->aux_buf.bo->virtual + 
> gpgpu->aux_offset.surface_heap_offset;
>gen7_surface_state_t *ss0 = (gen7_surface_state_t *) &heap->surface[index 
> * sizeof(gen7_surface_state_t)];
> @@ -1030,8 +1031,9 @@ intel_gpgpu_setup_bti_gen7(intel_gpgpu_t *gpgpu, 
> drm_intel_bo *buf, uint32_t int
>  
>  static void
>  intel_gpgpu_setup_bti_gen75(intel_gpgpu_t *gpgpu, drm_intel_bo *buf, 
> uint32_t internal_offset,
> -   uint32_t size, unsigned char index, 
> uint32_t format)
> +   size_t size, unsigned char index, 
> uint32_t format)
>  {
> +  assert(size <= (2<<30));
>uint32_t s = size - 1;
>surface_heap_t *heap = gpgpu->aux_buf.bo->virtual + 
> gpgpu->aux_offset.surface_heap_offset;
>gen7_surface_state_t *ss0 = (gen7_surface_state_t *) &heap->surface[index 
> * sizeof(gen7_surface_state_t)];
> @@ -1066,8 +1068,9 @@ intel_gpgpu_setup_bti_gen75(intel_gpgpu_t *gpgpu, 
> drm_intel_bo *buf, uint32_t in
>  
>  static void
>  intel_gpgpu_setup_bti_gen8(intel_gpgpu_t *gpgpu, drm_intel_bo *buf, uint32_t 
> internal_offset,
> -   uint32_t size, unsigned char index, 
> uint32_t format)
> +   size_t size, unsigned char index, 
> uint32_t format)
>  {
> +  assert(size <= (2<<30));
>uint32_t s = size - 1;
>surface_heap_t *heap = gpgpu->aux_buf.bo->virtual + 
> gpgpu->aux_offset.surface_heap_offset;
>gen8_surface_state_t *ss0 = (gen8_surface_state_t *) &heap->surface[index 
> * sizeof(gen8_surface_state_t)];
> @@ -1395,7 +1398,7 @@ intel_gpgpu_bind_image_gen9(intel_gpgpu_t *gpgpu,
>  
>  static void
>  intel_gpgpu_bind_buf(intel_gpgpu_t *gpgpu, drm_intel_bo *buf, uint32_t 
> offset,
> - uint32_t internal_offset, uint32_t size, uint8_t bti)
> + uint32_t internal_offset, size_t size, uint8_t bti)
>  {
>assert(gpgpu->binded_n < max_buf_n);
>gpgpu->binded_buf[gpgpu->binded_n] = buf;
> -- 
> 2.1.4
> 
> ___
> Beignet mailing list
> Beignet@lists.freedesktop.org
> http://lists.freedesktop.org/mailman/listinfo/beignet
___
Beignet mailing list
Beignet@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/beignet

[Beignet] [PATCH] GBE: fix a regression bug at post phi copy optimization.

2015-10-20 Thread Zhigang Gong

Forgot to handle the undefined phi value set of BBs when
we replace registers. This information will be used at next
round DAG generation.

Signed-off-by: Zhigang Gong 
---
 backend/src/ir/liveness.cpp | 6 ++
 1 file changed, 6 insertions(+)

diff --git a/backend/src/ir/liveness.cpp b/backend/src/ir/liveness.cpp
index 414bf42..d48f067 100644
--- a/backend/src/ir/liveness.cpp
+++ b/backend/src/ir/liveness.cpp
@@ -82,6 +82,8 @@ namespace ir {
 if (info.liveOut.contains(from)) {
   info.liveOut.erase(from);
   info.liveOut.insert(to);
+  // FIXME, a hack method to avoid the "to" register be treated as
+  // uniform value.
   bb->definedPhiRegs.insert(to);
 }
 if (info.upwardUsed.contains(from)) {
@@ -92,6 +94,10 @@ namespace ir {
   info.varKill.erase(from);
   info.varKill.insert(to);
 }
+if (bb->undefPhiRegs.contains(from)) {
+  bb->undefPhiRegs.erase(from);
+  bb->undefPhiRegs.insert(to);
+}
   }
 }
   }
-- 
1.9.1

___
Beignet mailing list
Beignet@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/beignet

Re: [Beignet] [PATCH] fix uniform case for ByteGather

2015-10-15 Thread Zhigang Gong

LGTM, thx.

On Wed, Oct 14, 2015 at 05:28:51AM +0800, Guo Yejun wrote:
> currently,the ByteGather generates IR as:
> BYTE_GATHER(16) %109<0>:UD:   %96<0,1,0>:UD   0x4:UD
> MOV(1)  %75<0>:UB :   %109<32,8,4>:UB
> 
> Fix it to generate IR as:
> BYTE_GATHER(16) %109<0>:UD  :   %96<0,1,0>:UD   0x4:UD
> MOV(1)  %75<0>:UB   :   %109<0,1,0>:UB
> 
> otherwise, there is regression issue of local copy propagation optimization
> which uses %109<32,8,4>:UB
> 
> Signed-off-by: Guo Yejun 
> ---
>  backend/src/backend/gen_insn_selection.cpp | 4 ++--
>  1 file changed, 2 insertions(+), 2 deletions(-)
> 
> diff --git a/backend/src/backend/gen_insn_selection.cpp 
> b/backend/src/backend/gen_insn_selection.cpp
> index da437d1..44cc473 100644
> --- a/backend/src/backend/gen_insn_selection.cpp
> +++ b/backend/src/backend/gen_insn_selection.cpp
> @@ -3657,9 +3657,9 @@ namespace gbe
>sel.curr.execWidth = 1;
>  }
>  if (elemSize == GEN_BYTE_SCATTER_WORD)
> -  sel.MOV(GenRegister::retype(value, GEN_TYPE_UW), 
> GenRegister::unpacked_uw(dst));
> +  sel.MOV(GenRegister::retype(value, GEN_TYPE_UW), 
> GenRegister::unpacked_uw(dst, isUniform));
>  else if (elemSize == GEN_BYTE_SCATTER_BYTE)
> -  sel.MOV(GenRegister::retype(value, GEN_TYPE_UB), 
> GenRegister::unpacked_ub(dst));
> +  sel.MOV(GenRegister::retype(value, GEN_TYPE_UB), 
> GenRegister::unpacked_ub(dst, isUniform));
>sel.pop();
>  }
>}
> -- 
> 1.9.1
> 
> ___
> Beignet mailing list
> Beignet@lists.freedesktop.org
> http://lists.freedesktop.org/mailman/listinfo/beignet
___
Beignet mailing list
Beignet@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/beignet

Re: [Beignet] [PATCH] driver/runtime: get global mem size dynamically

2015-10-09 Thread Zhigang Gong

You are right, you code is equivalent to my suggestion.
But I think it may be too conservative now. For an example,
if the system has 8GB memory, the max_mem_alloc_size will be
less or equal to 2GB. Maybe we can just set max_mem_alloc_size
equal to global_mem_size.

What do you think?

Even so, please be aware if the test machine has only 4GB
memory, it will never hit the GPU memory space above 2GB.

Thanks,
Zhigang Gong.

On Fri, Oct 09, 2015 at 08:56:38AM +, Pan, Xiuli wrote:
> So I have rewrite the code here, and I think we should not using totalgpumem 
> for max_mem_alloc_size, the ret->global_mem_size / 2 will always less or 
> equal to totalgpumem.
> Or I may not get your point about the threshold.
> 
> ret->global_mem_size = min(totalram/2 , totalgpumem) 
> ret->max_mem_alloc_size =  ret->global_mem_size / 2;
> 
> -----Original Message-
> From: Zhigang Gong [mailto:zhigang.g...@linux.intel.com] 
> Sent: Friday, October 9, 2015 3:05 PM
> To: Pan, Xiuli 
> Cc: beignet@lists.freedesktop.org
> Subject: Re: [Beignet] [PATCH] driver/runtime: get global mem size dynamically
> 
> On Fri, Oct 09, 2015 at 03:56:11PM +0800, Pan Xiuli wrote:
> > The gen8 and higher gpu can have more than 2G mem, so we dynamically 
> > get the global mem size.
> > 
> > Signed-off-by: Pan Xiuli 
> > ---
> >  src/cl_device_id.c   | 6 +++---
> >  src/intel/intel_driver.c | 6 ++
> >  2 files changed, 9 insertions(+), 3 deletions(-)
> > 
> > diff --git a/src/cl_device_id.c b/src/cl_device_id.c index 
> > 78d2cf4..c3bd35f 100644
> > --- a/src/cl_device_id.c
> > +++ b/src/cl_device_id.c
> > @@ -550,10 +550,10 @@ skl_gt4_break:
> >  
> >struct sysinfo info;
> >if (sysinfo(&info) == 0) {
> > -uint64_t two_gb = 2 * 1024 * 1024 * 1024ul; 
> > +uint64_t totalgpumem = ret->global_mem_size;
> >  uint64_t totalram = info.totalram * info.mem_unit;
> > -ret->global_mem_size = (totalram > two_gb) ? 
> > -two_gb : info.totalram;
> > +ret->global_mem_size = (totalram > totalgpumem) ?
> > +totalgpumem: totalram;
> >  ret->max_mem_alloc_size = ret->global_mem_size / 2;
> 
> The "/ 2" is a hard coded method to avoid allocate too much memory which is 
> not supported by the platform. Now this patch changes to get aperture size 
> from libdrm, then we don't need to use it any more.
> 
> But we may still want to make sure that we don't allocate too much system 
> memory. I think totalram/2 should be a reasonable threshold value.
> 
> So the following code may be better:
> 
> reg->global_mem_size = min(totalram/2, totalgpumem); max_mem_alloc_size 
> reg->= min(global_mem_size/2, totalgpumem);
> 
> Furthermore we still need to check carefully whether it's safe to use 
> totalgpumem as the global_mem_size and/or max_mem_alloc_size.
> 
> If the test case tests the maximum memory size, and there are some aperture 
> space already allocated by system, then the case may be failed. Before submit 
> the patch, it's better to test it with all the related conformance test cases 
> on all platforms. Or maybe we need to add some cases to test edge conditions 
> into the internal unit test cases.
> 
> Thanks,
> Zhigang Gong.
> 
> >}
> >  
> > diff --git a/src/intel/intel_driver.c b/src/intel/intel_driver.c index 
> > 035a103..1f286f7 100644
> > --- a/src/intel/intel_driver.c
> > +++ b/src/intel/intel_driver.c
> > @@ -829,6 +829,12 @@ intel_update_device_info(cl_device_id device)
> >if (IS_CHERRYVIEW(device->device_id))
> >  printf(CHV_CONFIG_WARNING);
> >  #endif
> > +  //We should get the device memory dynamically, also the  //mapablce 
> > + mem size usage is unknown. Still use global_mem_size/2  //as 
> > + max_mem_alloc_size in cl_get_gt_device.
> > +  size_t total_mem,map_mem;
> > +  drm_intel_get_aperture_sizes(driver->fd,&map_mem,&total_mem);
> > +  device->global_mem_size = (cl_ulong)total_mem;
> >  
> >intel_driver_context_destroy(driver);
> >intel_driver_close(driver);
> > --
> > 2.1.4
> > 
> > ___
> > Beignet mailing list
> > Beignet@lists.freedesktop.org
> > http://lists.freedesktop.org/mailman/listinfo/beignet
> ___
> Beignet mailing list
> Beignet@lists.freedesktop.org
> http://lists.freedesktop.org/mailman/listinfo/beignet
___
Beignet mailing list
Beignet@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/beignet

[Beignet] [PATCH] GBE: fix kernel arguments uploading bug.

2015-10-09 Thread Zhigang Gong

After the curbe allocation refactor, not all kernel arguments
will be allocated unconditional. If some kernel arguments haven't
been used at all, the corresponding arguments will be ignored
at backend thus we may get a -1 offset. On the runtime driver
side, we need check this situation.

Signed-off-by: Zhigang Gong 
---
 src/cl_command_queue_gen7.c | 6 --
 src/cl_kernel.c | 8 +---
 2 files changed, 9 insertions(+), 5 deletions(-)

diff --git a/src/cl_command_queue_gen7.c b/src/cl_command_queue_gen7.c
index 8c09615..2edc3be 100644
--- a/src/cl_command_queue_gen7.c
+++ b/src/cl_command_queue_gen7.c
@@ -173,7 +173,8 @@ cl_upload_constant_buffer(cl_command_queue queue, cl_kernel 
ker)
   uint32_t alignment = interp_kernel_get_arg_align(ker->opaque, arg);
   offset = ALIGN(offset, alignment);
   curbe_offset = interp_kernel_get_curbe_offset(ker->opaque, 
GBE_CURBE_KERNEL_ARGUMENT, arg);
-  assert(curbe_offset >= 0);
+  if (curbe_offset < 0)
+continue;
   *(uint32_t *) (ker->curbe + curbe_offset) = offset;
 
   cl_buffer_map(mem->bo, 1);
@@ -228,7 +229,8 @@ cl_curbe_fill(cl_kernel ker,
 assert(align != 0);
 slm_offset = ALIGN(slm_offset, align);
 offset = interp_kernel_get_curbe_offset(ker->opaque, 
GBE_CURBE_KERNEL_ARGUMENT, arg);
-assert(offset >= 0);
+if (offset < 0)
+  continue;
 uint32_t *slmptr = (uint32_t *) (ker->curbe + offset);
 *slmptr = slm_offset;
 slm_offset += ker->args[arg].local_sz;
diff --git a/src/cl_kernel.c b/src/cl_kernel.c
index 5d170c6..58a1224 100644
--- a/src/cl_kernel.c
+++ b/src/cl_kernel.c
@@ -153,9 +153,10 @@ cl_kernel_set_arg(cl_kernel k, cl_uint index, size_t sz, 
const void *value)
   /* Copy the structure or the value directly into the curbe */
   if (arg_type == GBE_ARG_VALUE) {
 offset = interp_kernel_get_curbe_offset(k->opaque, 
GBE_CURBE_KERNEL_ARGUMENT, index);
-assert(offset + sz <= k->curbe_sz);
-if (offset >= 0)
+if (offset >= 0) {
+  assert(offset + sz <= k->curbe_sz);
   memcpy(k->curbe + offset, value, sz);
+}
 k->args[index].local_sz = 0;
 k->args[index].is_set = 1;
 k->args[index].mem = NULL;
@@ -193,7 +194,8 @@ cl_kernel_set_arg(cl_kernel k, cl_uint index, size_t sz, 
const void *value)
   if(value == NULL || mem == NULL) {
 /* for buffer object GLOBAL_PTR CONSTANT_PTR, it maybe NULL */
 int32_t offset = interp_kernel_get_curbe_offset(k->opaque, 
GBE_CURBE_KERNEL_ARGUMENT, index);
-*((uint32_t *)(k->curbe + offset)) = 0;
+if (offset >= 0)
+  *((uint32_t *)(k->curbe + offset)) = 0;
 assert(arg_type == GBE_ARG_GLOBAL_PTR || arg_type == GBE_ARG_CONSTANT_PTR);
 
 if (k->args[index].mem)
-- 
1.9.1

___
Beignet mailing list
Beignet@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/beignet

Re: [Beignet] [PATCH] driver/runtime: get global mem size dynamically

2015-10-09 Thread Zhigang Gong

On Fri, Oct 09, 2015 at 03:56:11PM +0800, Pan Xiuli wrote:
> The gen8 and higher gpu can have more than 2G mem, so we dynamically
> get the global mem size.
> 
> Signed-off-by: Pan Xiuli 
> ---
>  src/cl_device_id.c   | 6 +++---
>  src/intel/intel_driver.c | 6 ++
>  2 files changed, 9 insertions(+), 3 deletions(-)
> 
> diff --git a/src/cl_device_id.c b/src/cl_device_id.c
> index 78d2cf4..c3bd35f 100644
> --- a/src/cl_device_id.c
> +++ b/src/cl_device_id.c
> @@ -550,10 +550,10 @@ skl_gt4_break:
>  
>struct sysinfo info;
>if (sysinfo(&info) == 0) {
> -uint64_t two_gb = 2 * 1024 * 1024 * 1024ul; 
> +uint64_t totalgpumem = ret->global_mem_size;
>  uint64_t totalram = info.totalram * info.mem_unit;
> -ret->global_mem_size = (totalram > two_gb) ? 
> -two_gb : info.totalram;
> +ret->global_mem_size = (totalram > totalgpumem) ?
> +totalgpumem: totalram;
>  ret->max_mem_alloc_size = ret->global_mem_size / 2;

The "/ 2" is a hard coded method to avoid allocate too much memory
which is not supported by the platform. Now this patch changes to
get aperture size from libdrm, then we don't need to use it any more.

But we may still want to make sure that we don't allocate too much
system memory. I think totalram/2 should be a reasonable threshold value.

So the following code may be better:

reg->global_mem_size = min(totalram/2, totalgpumem);
reg->max_mem_alloc_size = min(global_mem_size/2, totalgpumem);

Furthermore we still need to check carefully whether it's safe to use
totalgpumem as the global_mem_size and/or max_mem_alloc_size.

If the test case tests the maximum memory size, and there are some aperture
space already allocated by system, then the case may be failed. Before submit
the patch, it's better to test it with all the related conformance test cases
on all platforms. Or maybe we need to add some cases to test edge conditions
into the internal unit test cases.

Thanks,
Zhigang Gong.

>}
>  
> diff --git a/src/intel/intel_driver.c b/src/intel/intel_driver.c
> index 035a103..1f286f7 100644
> --- a/src/intel/intel_driver.c
> +++ b/src/intel/intel_driver.c
> @@ -829,6 +829,12 @@ intel_update_device_info(cl_device_id device)
>if (IS_CHERRYVIEW(device->device_id))
>  printf(CHV_CONFIG_WARNING);
>  #endif
> +  //We should get the device memory dynamically, also the
> +  //mapablce mem size usage is unknown. Still use global_mem_size/2
> +  //as max_mem_alloc_size in cl_get_gt_device.
> +  size_t total_mem,map_mem;
> +  drm_intel_get_aperture_sizes(driver->fd,&map_mem,&total_mem);
> +  device->global_mem_size = (cl_ulong)total_mem;
>  
>intel_driver_context_destroy(driver);
>intel_driver_close(driver);
> -- 
> 2.1.4
> 
> ___
> Beignet mailing list
> Beignet@lists.freedesktop.org
> http://lists.freedesktop.org/mailman/listinfo/beignet
___
Beignet mailing list
Beignet@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/beignet

Re: [Beignet] [PATCH] GBE: fix a zero/one's liveness bug.

2015-10-08 Thread Zhigang Gong

Ping for review.

On Tue, Sep 22, 2015 at 03:45:29PM +0800, Zhigang Gong wrote:
> Ping for review.
> Thanks.
> 
> On Mon, Sep 14, 2015 at 03:50:00PM +0800, Zhigang Gong wrote:
> > This is a long standing bug, and is exposed by my latest register
> > allocation refinement patchset. ir::ocl::zero and ir::ocl::one are
> > global registers, we have to compute its liveness information carefully,
> > not just get a local interval ID.
> > 
> > Signed-off-by: Zhigang Gong 
> > ---
> >  backend/src/backend/gen_reg_allocation.cpp | 29 
> > +
> >  1 file changed, 29 insertions(+)
> > 
> > diff --git a/backend/src/backend/gen_reg_allocation.cpp 
> > b/backend/src/backend/gen_reg_allocation.cpp
> > index bf2ac2b..f440747 100644
> > --- a/backend/src/backend/gen_reg_allocation.cpp
> > +++ b/backend/src/backend/gen_reg_allocation.cpp
> > @@ -179,6 +179,8 @@ namespace gbe
> >  SpilledRegs spilledRegs;
> >  /*! register which could be spilled.*/
> >  SpillCandidateSet spillCandidate;
> > +/*! BBs last instruction ID map */
> > +map bbLastInsnIDMap;
> >  /* reserved registers for register spill/reload */
> >  uint32_t reservedReg;
> >  /*! Current vector to expire */
> > @@ -505,6 +507,7 @@ namespace gbe
> >  // policy is to spill the allocate flag which live to the last time 
> > end point.
> >  
> >  // we have three flags we use for booleans f0.0 , f1.0 and f1.1
> > +set liveInSet01;
> >  for (auto &block : *selection.blockList) {
> >// Store the registers allocated in the map
> >map allocatedFlags;
> > @@ -674,6 +677,7 @@ namespace gbe
> >  sel0->src(0) = GenRegister::uw1grf(ir::ocl::one);
> >  sel0->src(1) = GenRegister::uw1grf(ir::ocl::zero);
> >  sel0->dst(0) = GET_FLAG_REG(insn);
> > +liveInSet01.insert(insn.parent->bb);
> >  insn.append(*sel0);
> >  // We use the zero one after the liveness analysis, we have to 
> > update
> >  // the liveness data manually here.
> > @@ -692,6 +696,30 @@ namespace gbe
> >  }
> >}
> >  }
> > +
> > +// As we introduce two global variables zero and one, we have to
> > +// recompute its liveness information here!
> > +if (liveInSet01.size()) {
> > +  set liveOutSet01;
> > +  set workSet(liveInSet01.begin(), 
> > liveInSet01.end());
> > +  while(workSet.size()) {
> > +for(auto bb : workSet) {
> > +  for(auto predBB : bb->getPredecessorSet()) {
> > +liveOutSet01.insert(predBB);
> > +if (liveInSet01.contains(predBB))
> > +  continue;
> > +liveInSet01.insert(predBB);
> > +workSet.insert(predBB);
> > +  }
> > +  workSet.erase(bb);
> > +}
> > +  }
> > +  int32_t maxID = 0;
> > +  for(auto bb : liveOutSet01)
> > +maxID = std::max(maxID, bbLastInsnIDMap.find(bb)->second);
> > +  intervals[ir::ocl::zero].maxID = 
> > std::max(intervals[ir::ocl::zero].maxID, maxID);
> > +  intervals[ir::ocl::one].maxID = 
> > std::max(intervals[ir::ocl::one].maxID, maxID);
> > +}
> >}
> >  
> >IVAR(OCL_SIMD16_SPILL_THRESHOLD, 0, 16, 256);
> > @@ -1127,6 +1155,7 @@ namespace gbe
> >  
> >// All registers alive at the begining of the block must update 
> > their intervals.
> >const ir::BasicBlock *bb = block.bb;
> > +  bbLastInsnIDMap.insert(std::make_pair(bb, lastID));
> >for (auto reg : ctx.getLiveIn(bb))
> >  this->intervals[reg].minID = std::min(this->intervals[reg].minID, 
> > firstID);
> >  
> > -- 
> > 1.9.1
> > 
> > ___
> > Beignet mailing list
> > Beignet@lists.freedesktop.org
> > http://lists.freedesktop.org/mailman/listinfo/beignet
> ___
> Beignet mailing list
> Beignet@lists.freedesktop.org
> http://lists.freedesktop.org/mailman/listinfo/beignet
___
Beignet mailing list
Beignet@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/beignet

Re: [Beignet] [PATCH 3/4] Fix a event leak in create context

2015-10-08 Thread Zhigang Gong

Nice catch, this patch LGTM.

On Thu, Sep 24, 2015 at 05:13:26PM +0800, Pan Xiuli wrote:
> We get an event out of NDRangeKernel, and we don't release it.
> As an gpgpu event it can also make drm buffer leak, to avoid
> potenial error we just release it.w
> 
> Signed-off-by: Pan Xiuli 
> ---
>  src/cl_device_id.c | 1 +
>  1 file changed, 1 insertion(+)
> 
> diff --git a/src/cl_device_id.c b/src/cl_device_id.c
> index 78d2cf4..a3d3fc4 100644
> --- a/src/cl_device_id.c
> +++ b/src/cl_device_id.c
> @@ -622,6 +622,7 @@ cl_self_test(cl_device_id device, cl_self_test_res 
> atomic_in_l3_flag)
>// Atomic fail need to test SLM again with atomic in L3 
> feature disabled.
>tested = 0;
>  }
> +clReleaseEvent(kernel_finished);
>}
>  }
>  clReleaseMemObject(buffer);
> -- 
> 2.1.4
> 
> ___
> Beignet mailing list
> Beignet@lists.freedesktop.org
> http://lists.freedesktop.org/mailman/listinfo/beignet
___
Beignet mailing list
Beignet@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/beignet

Re: [Beignet] [PATCH V2 3/3] add local copy propagation optimization for each basic block

2015-09-28 Thread Zhigang Gong

LGTM.

Thanks.

On Mon, Sep 28, 2015 at 03:23:01AM +0800, Guo Yejun wrote:
> It is done at selection ir level, it removes MOV instructions
> and helps to reduce the register pressure.
> 
> For instructions like:
> MOV(8)  %42<2>:UB :   %53<32,8,4>:UB
> ADD(8)  %43<2>:B  :   %40<16,8,2>:B   -%42<16,8,2>:B
> can be optimized as:
> ADD(8)  %43<2>:UB :   %56<32,8,4>:UB  -%53<32,8,4>:UB
> 
> v2: move propagateRegister() as static mothod of GenRegister
> refine replaceInfo from set to map
> encapsulate function CanBeReplaced
> 
> Signed-off-by: Guo Yejun 
> ---
>  backend/src/backend/gen_insn_selection.cpp |   4 +
>  backend/src/backend/gen_insn_selection.hpp |   2 +
>  .../src/backend/gen_insn_selection_optimize.cpp| 203 
> -
>  backend/src/backend/gen_register.hpp   |  21 +++
>  4 files changed, 225 insertions(+), 5 deletions(-)
> 
> diff --git a/backend/src/backend/gen_insn_selection.cpp 
> b/backend/src/backend/gen_insn_selection.cpp
> index 28c7ed2..038b839 100644
> --- a/backend/src/backend/gen_insn_selection.cpp
> +++ b/backend/src/backend/gen_insn_selection.cpp
> @@ -2062,6 +2062,10 @@ namespace gbe
>///
>// Code selection public implementation
>///
> +  const GenContext& Selection::getCtx()
> +  {
> +return this->opaque->ctx;
> +  }
>  
>Selection::Selection(GenContext &ctx) {
>  this->blockList = NULL;
> diff --git a/backend/src/backend/gen_insn_selection.hpp 
> b/backend/src/backend/gen_insn_selection.hpp
> index 86542b0..275eb9c 100644
> --- a/backend/src/backend/gen_insn_selection.hpp
> +++ b/backend/src/backend/gen_insn_selection.hpp
> @@ -271,6 +271,8 @@ namespace gbe
>  void optimize(void);
>  uint32_t opt_features;
>  
> +const GenContext &getCtx();
> +
>  /*! Use custom allocators */
>  GBE_CLASS(Selection);
>};
> diff --git a/backend/src/backend/gen_insn_selection_optimize.cpp 
> b/backend/src/backend/gen_insn_selection_optimize.cpp
> index c82fbe5..3f2ae2f 100644
> --- a/backend/src/backend/gen_insn_selection_optimize.cpp
> +++ b/backend/src/backend/gen_insn_selection_optimize.cpp
> @@ -12,38 +12,231 @@
>  
>  namespace gbe
>  {
> +  //helper functions
> +  static uint32_t CalculateElements(const GenRegister& reg, uint32_t 
> execWidth)
> +  {
> +uint32_t elements = 0;
> +uint32_t elementSize = typeSize(reg.type);
> +uint32_t width = GenRegister::width_size(reg);
> +assert(execWidth >= width);
> +uint32_t height = execWidth / width;
> +uint32_t vstride = GenRegister::vstride_size(reg);
> +uint32_t hstride = GenRegister::hstride_size(reg);
> +uint32_t base = reg.subnr;
> +for (uint32_t i = 0; i < height; ++i) {
> +  uint32_t offsetInByte = base;
> +  for (uint32_t j = 0; j < width; ++j) {
> +uint32_t offsetInType = offsetInByte / elementSize;
> +elements |= (1 << offsetInType);
> +offsetInByte += hstride * elementSize;
> +  }
> +  offsetInByte += vstride * elementSize;
> +}
> +return elements;
> +  }
>  
>class SelOptimizer
>{
>public:
> -SelOptimizer(uint32_t features) : features(features) {}
> +SelOptimizer(const GenContext& ctx, uint32_t features) : ctx(ctx), 
> features(features) {}
>  virtual void run() = 0;
>  virtual ~SelOptimizer() {}
>protected:
> +const GenContext &ctx;  //in case that we need it
>  uint32_t features;
>};
>  
>class SelBasicBlockOptimizer : public SelOptimizer
>{
>public:
> -SelBasicBlockOptimizer(uint32_t features, SelectionBlock &bb) : 
> SelOptimizer(features), bb(bb) {}
> +SelBasicBlockOptimizer(const GenContext& ctx,
> +   const ir::Liveness::LiveOut& liveout,
> +   uint32_t features,
> +   SelectionBlock &bb) :
> +SelOptimizer(ctx, features), bb(bb), liveout(liveout), 
> optimized(false)
> +{
> +}
>  ~SelBasicBlockOptimizer() {}
>  virtual void run();
>  
>private:
> +// local copy propagation
> +class ReplaceInfo
> +{
> +public:
> +  ReplaceInfo(SelectionInstruction& insn,
> +  const GenRegister& intermedia,
> +  const GenRegister& replacement) :
> +  insn(insn), intermedia(intermedia), 
> replacement(replacement)
> +  {
> +assert(insn.opcode == SEL_OP_MOV);
> +assert(&(insn.src(0)) == &replacement);
> +assert(&(insn.dst(0)) == &intermedia);
> +this->elements = CalculateElements(intermedia, insn.state.execWidth);
> +replacementOverwritten = false;
> +  }
> +  ~ReplaceInfo()
> +  {
> +this->toBeReplaceds.clear();
> +  }
> +
> +  SelectionInstruction& insn;
>

Re: [Beignet] [PATCH V2] GBE: Implement liveness dump.

2015-09-24 Thread Zhigang Gong

LGTM.

On Thu, Sep 24, 2015 at 03:47:30PM +0800, Ruiling Song wrote:
> v2:
> remove old printf debug code.
> Signed-off-by: Ruiling Song 
> ---
>  backend/src/ir/liveness.cpp | 58 
> -
>  1 file changed, 20 insertions(+), 38 deletions(-)
> 
> diff --git a/backend/src/ir/liveness.cpp b/backend/src/ir/liveness.cpp
> index c5a6374..9f456a3 100644
> --- a/backend/src/ir/liveness.cpp
> +++ b/backend/src/ir/liveness.cpp
> @@ -190,25 +190,6 @@ namespace ir {
>workSet.insert(prevInfo);
>}
>  };
> -#if 0
> -fn.foreachBlock([this](const BasicBlock &bb){
> -  printf("label %d:\n", bb.getLabelIndex());
> -  BlockInfo *info = liveness[&bb];
> -  auto &outVarSet = info->liveOut;
> -  auto &inVarSet = info->upwardUsed;
> -  printf("\n\tin Lives: ");
> -  for (auto inVar : inVarSet) {
> -printf("%d ", inVar);
> -  }
> -  printf("\n");
> -  printf("\tout Lives: ");
> -  for (auto outVar : outVarSet) {
> -printf("%d ", outVar);
> -  }
> -  printf("\n");
> -
> -});
> -#endif
> }
>  /*
>As we run in SIMD mode with prediction mask to indicate active lanes.
> @@ -252,27 +233,28 @@ namespace ir {
>  }
>}
>  }
> -#if 0
> -fn.foreachBlock([this](const BasicBlock &bb){
> -  printf("label %d:\n", bb.getLabelIndex());
> -  BlockInfo *info = liveness[&bb];
> -  auto &outVarSet = info->liveOut;
> -  auto &inVarSet = info->upwardUsed;
> -  printf("\n\tLive Ins: ");
> -  for (auto inVar : inVarSet) {
> -printf("%d ", inVar);
> -  }
> -  printf("\n");
> -  printf("\tLive outs: ");
> -  for (auto outVar : outVarSet) {
> -printf("%d ", outVar);
> -  }
> -  printf("\n");
> -
> -});
> -#endif
> }
>  
> +  std::ostream &operator<< (std::ostream &out, const Liveness &live) {
> +const Function &fn = live.getFunction();
> +fn.foreachBlock([&] (const BasicBlock &bb) {
> +  out << std::endl;
> +  out << "Label $" << bb.getLabelIndex() << std::endl;
> +  const Liveness::BlockInfo &bbInfo = live.getBlockInfo(&bb);
> +  out << "liveIn:" << std::endl;
> +  for (auto &x: bbInfo.upwardUsed) {
> +out << x << " ";
> +  }
> +  out << std::endl << "liveOut:" << std::endl;
> +  for (auto &x : bbInfo.liveOut)
> +out << x << " ";
> +  out << std::endl << "varKill:" << std::endl;
> +  for (auto &x : bbInfo.varKill)
> +out << x << " ";
> +  out << std::endl;
> +});
> +return out;
> +  }
>  
>/*! To pretty print the livfeness info */
>static const uint32_t prettyInsnStrSize = 48;
> -- 
> 2.3.1
> 
> ___
> Beignet mailing list
> Beignet@lists.freedesktop.org
> http://lists.freedesktop.org/mailman/listinfo/beignet
___
Beignet mailing list
Beignet@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/beignet

[Beignet] [Patch v2] GBE: refine longjmp checking.

2015-09-24 Thread Zhigang Gong

v2:
simplify the logic in function.hpp. Let the user to
prepare correct start and end point. Fix the incorrect
start/end point for one forward jump and one backward
jump case.

Signed-off-by: Zhigang Gong 
---
 backend/src/backend/gen_insn_selection.cpp | 17 +++--
 backend/src/ir/function.hpp| 11 +++
 2 files changed, 26 insertions(+), 2 deletions(-)

diff --git a/backend/src/backend/gen_insn_selection.cpp 
b/backend/src/backend/gen_insn_selection.cpp
index ab00269..0380d79 100644
--- a/backend/src/backend/gen_insn_selection.cpp
+++ b/backend/src/backend/gen_insn_selection.cpp
@@ -1154,7 +1154,19 @@ namespace gbe
 SelectionInstruction *insn = this->appendInsn(SEL_OP_JMPI, 0, 1);
 insn->src(0) = src;
 insn->index = index.value();
-insn->extra.longjmp = abs(index - origin) > 800;
+ir::LabelIndex start, end;
+if (origin.value() < index.value()) {
+// Forward Jump, need to exclude the target BB. Because we
+// need to jump to the beginning of it.
+  start = origin;
+  end = ir::LabelIndex(index.value() - 1);
+} else {
+  start = index;
+  end = origin;
+}
+// FIXME, this longjmp check is too hacky. We need to support instruction
+// insertion at code emission stage in the future.
+insn->extra.longjmp = ctx.getFunction().getDistance(start, end) > 8000;
 return insn->extra.longjmp ? 2 : 1;
   }
 
@@ -5150,7 +5162,8 @@ namespace gbe
   sel.curr.execWidth = 1;
   sel.curr.noMask = 1;
   sel.curr.predicate = GEN_PREDICATE_NONE;
-  sel.block->endifOffset -= sel.JMPI(GenRegister::immd(0), jip, 
curr->getLabelIndex());
+  // Actually, the origin of this JMPI should be the beginning of next 
BB.
+  sel.block->endifOffset -= sel.JMPI(GenRegister::immd(0), jip, 
ir::LabelIndex(curr->getLabelIndex().value() + 1));
 sel.pop();
   }
 }
diff --git a/backend/src/ir/function.hpp b/backend/src/ir/function.hpp
index b5f4ba2..265fdc3 100644
--- a/backend/src/ir/function.hpp
+++ b/backend/src/ir/function.hpp
@@ -486,6 +486,17 @@ namespace ir {
 /*! Get surface starting address register from bti */
 Register getSurfaceBaseReg(uint8_t bti) const;
 void appendSurface(uint8_t bti, Register reg);
+/*! Get instruction distance between two BBs include both b0 and b1,
+and b0 must be less than b1. */
+INLINE uint32_t getDistance(LabelIndex b0, LabelIndex b1) const {
+  uint32_t insnNum = 0;
+  GBE_ASSERT(b0.value() <= b1.value());
+  for(uint32_t i = b0.value(); i <= b1.value(); i++) {
+BasicBlock &bb = getBlock(LabelIndex(i));
+insnNum += bb.size();
+  }
+  return insnNum;
+}
 /*! Output the control flow graph to .dot file */
 void outputCFG();
   private:
-- 
1.9.1

___
Beignet mailing list
Beignet@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/beignet

Re: [Beignet] [PATCH 2/5] GBE: refine longjmp checking.

2015-09-24 Thread Zhigang Gong

On Thu, Sep 24, 2015 at 07:05:38AM +, Yang, Rong R wrote:
> One comment.
> 
> > -Original Message-
> > From: Beignet [mailto:beignet-boun...@lists.freedesktop.org] On Behalf Of
> > Zhigang Gong
> > Sent: Monday, September 14, 2015 14:20
> > To: beignet@lists.freedesktop.org
> > Cc: Gong, Zhigang
> > Subject: [Beignet] [PATCH 2/5] GBE: refine longjmp checking.
> > 
> > Signed-off-by: Zhigang Gong 
> > ---
> >  backend/src/backend/gen_insn_selection.cpp |  2 +-
> >  backend/src/ir/function.hpp| 17 +
> >  2 files changed, 18 insertions(+), 1 deletion(-)
> > 
> > diff --git a/backend/src/backend/gen_insn_selection.cpp
> > b/backend/src/backend/gen_insn_selection.cpp
> > index ab00269..57dbec9 100644
> > --- a/backend/src/backend/gen_insn_selection.cpp
> > +++ b/backend/src/backend/gen_insn_selection.cpp
> > @@ -1154,7 +1154,7 @@ namespace gbe
> >  SelectionInstruction *insn = this->appendInsn(SEL_OP_JMPI, 0, 1);
> >  insn->src(0) = src;
> >  insn->index = index.value();
> > -insn->extra.longjmp = abs(index - origin) > 800;
> > +insn->extra.longjmp = ctx.getFunction().getDistance(origin, index)
> > + > 8000;
> >  return insn->extra.longjmp ? 2 : 1;
> >}
> > 
> > diff --git a/backend/src/ir/function.hpp b/backend/src/ir/function.hpp index
> > b5f4ba2..b924332 100644
> > --- a/backend/src/ir/function.hpp
> > +++ b/backend/src/ir/function.hpp
> > @@ -487,6 +487,23 @@ namespace ir {
> >  Register getSurfaceBaseReg(uint8_t bti) const;
> >  void appendSurface(uint8_t bti, Register reg);
> >  /*! Output the control flow graph to .dot file */
> > +/*! Get instruction distance between two BBs */
> > +INLINE uint32_t getDistance(LabelIndex b0, LabelIndex b1) const {
> > +  int start, end;
> > +  if (b0.value() < b1.value()) {
> > +start = b0.value();
> > +end = b1.value() - 1;
> > +  } else {
> > +start = b1.value();
> > +end = b0.value() - 1;
> > +  }
> > +  uint32_t insnNum = 0;
> > +  for(int i = start; i <= end; i++) {
> > +BasicBlock &bb = getBlock(LabelIndex(i));
> > +insnNum += bb.size();
> > +  }
> If front jump, need not include the start and end block's size.
Nice catch, there are more cases than I thought:

Two forward jump cases:

  JMPI created by label instruction:

  b0:
(  85)  cmp.le(16)   null:UW g1<8,8,1>:UW0x1UW   { 
align1 WE_all 1H switch };
(  87)  (-f0.any16h) jmpi(1) offset_to_b1   
{ align1 WE_all };
(  89)  (+f0) if(16) 45 
{ align1 WE_normal 1H };
...


  b1:
 ...

  JMPI created by forward JMPI.

  b0:
...
jmpi(1) offset_to_b1   { align1 WE_all 
};

  b1:
   ...


One backward JMPI only has one case:

b1:


b0:
   ...
   (   11172)  endif(16) 2 null
{ align1 WE_normal 1H };
   (   11174)  (+f0.any16h) jmpi(1) offset_to_b0
  { align1 WE_all };

It jumps from b0's end to b1's beginning. So we need to accumulate b1 to b0's 
instructions.
I did make another mistake in the backward jump case, should not exclude b0.

I will send a new version latter.

BTW, using this solution to determine a long jump is still a hacky work around.
I hope some one could refine the backend to support insert one instruction 
easily
in code emission stage.

Thanks,
Zhigang Gong.
 
> 
> > +  return insnNum;
> > +}
> >  void outputCFG();
> >private:
> >  friend class Context;   //!< Can freely modify a function
> > --
> > 1.9.1
> > 
> > ___
> > Beignet mailing list
> > Beignet@lists.freedesktop.org
> > http://lists.freedesktop.org/mailman/listinfo/beignet
> ___
> Beignet mailing list
> Beignet@lists.freedesktop.org
> http://lists.freedesktop.org/mailman/listinfo/beignet
___
Beignet mailing list
Beignet@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/beignet

Re: [Beignet] [PATCH V2 3/3] add local copy propagation optimization for each basic block

2015-09-24 Thread Zhigang Gong

On Thu, Sep 24, 2015 at 06:51:12AM +, Guo, Yejun wrote:
> 
> 
> -Original Message-
> From: Zhigang Gong [mailto:zhigang.g...@linux.intel.com] 
> Sent: Thursday, September 24, 2015 1:12 PM
> To: Guo, Yejun
> Cc: beignet@lists.freedesktop.org
> Subject: Re: [Beignet] [PATCH V2 3/3] add local copy propagation optimization 
> for each basic block
> 
> On Thu, Sep 24, 2015 at 06:05:31AM +, Guo, Yejun wrote:
> > 
> > 
> > -Original Message-
> > From: Zhigang Gong [mailto:zhigang.g...@linux.intel.com]
> > Sent: Thursday, September 24, 2015 12:32 PM
> > To: Guo, Yejun
> > Cc: beignet@lists.freedesktop.org
> > Subject: Re: [Beignet] [PATCH V2 3/3] add local copy propagation 
> > optimization for each basic block
> > 
> > On Thu, Sep 24, 2015 at 02:58:22AM +, Guo, Yejun wrote:
> > > 
> > > > +
> > > > +  void SelBasicBlockOptimizer::changeInsideReplaceInfos(const
> > > > + SelectionInstruction& insn, GenRegister& var)  {
> > > > +for (ReplaceInfo* info : replaceInfos) {
> > > > +  if (info->intermedia.reg() == var.reg()) {
> > A new comment here is that the above loop is a little bit too heavy.
> > For a large BB which has many MOV instructions, it will iterate too many 
> > times for instructions after those MOV instructions. A better way is to 
> > change to use a new map type of replaceInfos:
> > map >
> > 
> > This will be much faster than iterate all infos for each instruction.
> > 
> > [Yejun] nice, I'll change to map.
> > 
> > > > +bool replaceable = false;
> > > It's better to add a comment here to describe the cases which can't be 
> > > replaced and why.
> > > 
> > > [yejun] actually, I think the code itself explains something, it is much 
> > > complex to explain every detail in human words. I consider it as a 
> > > nice-to-have since the basic logic is simple.
> > I still think it is not that simple. A case just came into my mind which we 
> > can't do replacement is:
> > 
> > MOV %r0, %r1
> > ADD %r1, %r1, 1
> > ...
> > ADD %r10, %r0, 1
> > ...
> > 
> > I'm afraid that your current code can't deal it correctly, right?
> > 
> > [yejun] current code does nothing for these instructions. It also relatives 
> > to constant, maybe add another optimization path to keep every path clear 
> > and simple. Or we can consider current code as a basic, and to extend it 
> > when find such optimization is necessary.
> If my understanding is correct, current code will replace %r0 to %r1 in the 
> second ADD instruction which breaks the code. This is not an optimize 
> opportunity, but a bug and must be fixed. The root cause is %r1 has been 
> modified after copy its value to another register, then we should not 
> propagate it to the destination register after that modification. For 
> example, for all instruction after ADD %r1, %r1, 1, We could not do the 
> 's/r0/r1'.
> 
> [Yejun] the current code does handle such case.
> Firstly, for the terms.  
> R0 = r1
> R3 = r0 + r2
> In the first MOV IR, R0 is named as intermedia, and r1 is named as 
> replacement.
> In the second IR, r0 is collected into toBeReplacements.
> 
> For the following IR:
> 1) MOV %r0, %r1
> 2) MOV %r2, %r0
> 3) ADD %r1, %r1, 1
>  ...
> 4) ADD %r10, %r0, 1
> 
> When the 3) IR is scanned, in SelBasicBlockOptimizer::removeFromReplaceInfos, 
> replacementChanged is recorded.
> 
> When the 4) IR is scanned, in 
> SelBasicBlockOptimizer::changeInsideReplaceInfos, we'll remove the info with 
> the following code because replacementChanged makes replaceable false.
> if (!replaceable) {
>   replaceInfos.erase(info);
>   delete info;
> }
> 
> If there is no 4) IR with %r0 as source, at the end of scan, in 
> SelBasicBlockOptimizer::cleanReplaceInfos, 1) and 2) IR will be optimized.

Oops, didn't see you remove registers when they become destination registers.
Then this logic is correct here.

Look into the code again, and found the following check is interesting and not 
simple at all :)

+if (insn.opcode != SEL_OP_BSWAP && !insn.isWrite()) {

1. Why BSWAP is special here? The root cause may be that the src of bswap will 
be modified.
   But should this be handled in the BSWAP instruction already (I mean, both 
src and
   dst of BSWAP should be put into  src array and dst array ).
   Unfortunately, BSWAP seems have a bug here and may lead to incorrect 
scheduling latter.
   That

Re: [Beignet] [PATCH V2 3/3] add local copy propagation optimization for each basic block

2015-09-23 Thread Zhigang Gong

On Thu, Sep 24, 2015 at 06:05:31AM +, Guo, Yejun wrote:
> 
> 
> -Original Message-
> From: Zhigang Gong [mailto:zhigang.g...@linux.intel.com] 
> Sent: Thursday, September 24, 2015 12:32 PM
> To: Guo, Yejun
> Cc: beignet@lists.freedesktop.org
> Subject: Re: [Beignet] [PATCH V2 3/3] add local copy propagation optimization 
> for each basic block
> 
> On Thu, Sep 24, 2015 at 02:58:22AM +, Guo, Yejun wrote:
> > 
> > > +
> > > +  void SelBasicBlockOptimizer::changeInsideReplaceInfos(const
> > > + SelectionInstruction& insn, GenRegister& var)  {
> > > +for (ReplaceInfo* info : replaceInfos) {
> > > +  if (info->intermedia.reg() == var.reg()) {
> A new comment here is that the above loop is a little bit too heavy.
> For a large BB which has many MOV instructions, it will iterate too many 
> times for instructions after those MOV instructions. A better way is to 
> change to use a new map type of replaceInfos:
> map >
> 
> This will be much faster than iterate all infos for each instruction.
> 
> [Yejun] nice, I'll change to map.
> 
> > > +bool replaceable = false;
> > It's better to add a comment here to describe the cases which can't be 
> > replaced and why.
> > 
> > [yejun] actually, I think the code itself explains something, it is much 
> > complex to explain every detail in human words. I consider it as a 
> > nice-to-have since the basic logic is simple.
> I still think it is not that simple. A case just came into my mind which we 
> can't do replacement is:
> 
> MOV %r0, %r1
> ADD %r1, %r1, 1
> ...
> ADD %r10, %r0, 1
> ...
> 
> I'm afraid that your current code can't deal it correctly, right?
> 
> [yejun] current code does nothing for these instructions. It also relatives 
> to constant, maybe add another optimization path to keep every path clear and 
> simple. Or we can consider current code as a basic, and to extend it when 
> find such optimization is necessary.
If my understanding is correct, current code will replace %r0 to %r1 in
the second ADD instruction which breaks the code. This is not an optimize
opportunity, but a bug and must be fixed. The root cause is %r1 has been
modified after copy its value to another register, then we should not
propagate it to the destination register after that modification. For example,
for all instruction after
ADD %r1, %r1, 1,
We could not do the 's/r0/r1'.

> 
> > 
> > > +uint32_t elements = CalculateElements(var, 
> > > insn.state.execWidth);
> > > +if (info->elements == elements) {
> > Why not compare each attribute value, such as 
> > vstride_size,hstride_size,.?
> > 
> > [Yejun] the execWidth is not recorded in the GenRegister, and I once saw 
> > instructions such as:
> > mov(1)  dst (vstride_size or hstride_size is 1/2/... or something 
> > else), src
> > mov(16) anotherdst, dst (with stride as 0)
> > 
> > the dst here is the same, but the strides are different, to handle this 
> > case, I add the function CalculateElements.
> The example you gave here has two different GenRegisters as GenRegister has 
> vstride and hstride members. Right? My previous comment suggests to compare 
> two GenRegistes' attributs and I can't understand your point here.
> 
> [Yejun] Let's use the following SEL IR as an example, %42 is the same 
> register but with two different stride in the two IRs. 
> MOV(1)  %42<2>:D  :   %41<0,1,0>:D
> MOV(16)   %43<1>:D:   %42<0,1,0>:D
This doesn't address my comment. As the comment suggests to compare the 
GenRegisters' attribute
not the virtual register. You can easily found the above two GenRegisters are 
different.

> 
> 
> Thanks,
> Zhigang Gong.
> 
> > 
> > 
> > 
> > 
> > The other parts LGTM,
> > 
> > Thanks,
> > Zhigang Gong
___
Beignet mailing list
Beignet@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/beignet

Re: [Beignet] [PATCH] GBE: Implement liveness dump.

2015-09-23 Thread Zhigang Gong

LGTM, and you can remove those useless printf code now.

On Thu, Sep 24, 2015 at 10:05:07AM +0800, Ruiling Song wrote:
> Signed-off-by: Ruiling Song 
> ---
>  backend/src/ir/liveness.cpp | 20 
>  1 file changed, 20 insertions(+)
> 
> diff --git a/backend/src/ir/liveness.cpp b/backend/src/ir/liveness.cpp
> index c5a6374..62a95ea 100644
> --- a/backend/src/ir/liveness.cpp
> +++ b/backend/src/ir/liveness.cpp
> @@ -273,6 +273,26 @@ namespace ir {
>  #endif
> }
>  
> +  std::ostream &operator<< (std::ostream &out, const Liveness &live) {
> +const Function &fn = live.getFunction();
> +fn.foreachBlock([&] (const BasicBlock &bb) {
> +  out << std::endl;
> +  out << "Label $" << bb.getLabelIndex() << std::endl;
> +  const Liveness::BlockInfo &bbInfo = live.getBlockInfo(&bb);
> +  out << "liveIn:" << std::endl;
> +  for (auto &x: bbInfo.upwardUsed) {
> +out << x << " ";
> +  }
> +  out << std::endl << "liveOut:" << std::endl;
> +  for (auto &x : bbInfo.liveOut)
> +out << x << " ";
> +  out << std::endl << "varKill:" << std::endl;
> +  for (auto &x : bbInfo.varKill)
> +out << x << " ";
> +  out << std::endl;
> +});
> +return out;
> +  }
>  
>/*! To pretty print the livfeness info */
>static const uint32_t prettyInsnStrSize = 48;
> -- 
> 2.3.1
> 
> ___
> Beignet mailing list
> Beignet@lists.freedesktop.org
> http://lists.freedesktop.org/mailman/listinfo/beignet
___
Beignet mailing list
Beignet@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/beignet

Re: [Beignet] [PATCH V2 3/3] add local copy propagation optimization for each basic block

2015-09-23 Thread Zhigang Gong

On Thu, Sep 24, 2015 at 02:58:22AM +, Guo, Yejun wrote:
> 
> 
> -Original Message-
> From: Zhigang Gong [mailto:zhigang.g...@linux.intel.com] 
> Sent: Thursday, September 24, 2015 9:14 AM
> To: Guo, Yejun
> Cc: beignet@lists.freedesktop.org
> Subject: Re: [Beignet] [PATCH V2 3/3] add local copy propagation optimization 
> for each basic block
> 
>   
> > +  void SelBasicBlockOptimizer::propagateRegister(GenRegister& dst, 
> > + const GenRegister& src)
> This function looks much like a GenRegister static method, because it doesn't 
> reference any external information.
> 
> [yejun] Yes, it is. I wrote here because I'm not sure if there will be other 
> changes due to regression issue or for extended usage.  I'm ok to change to 
> GenRegister static method.
> > +  {
> > +dst.type = src.type;
> > +dst.file = src.file;
> > +dst.physical = src.physical;
> > +dst.subphysical = src.subphysical;
> > +dst.value.reg = src.value.reg;
> > +dst.vstride = src.vstride;
> > +dst.width = src.width;
> > +dst.hstride = src.hstride;
> > +dst.quarter = src.quarter;
> > +dst.nr = src.nr;
> > +dst.subnr = src.subnr;
> > +dst.address_mode = src.address_mode;
> > +dst.a0_subnr = src.a0_subnr;
> > +dst.addr_imm = src.addr_imm;
> > +
> > +dst.negation = dst.negation ^ src.negation;
> > +dst.absolute = dst.absolute | src.absolute;  }
> > +
> 
> 
> 
> > +
> > +  void SelBasicBlockOptimizer::changeInsideReplaceInfos(const 
> > + SelectionInstruction& insn, GenRegister& var)  {
> > +for (ReplaceInfo* info : replaceInfos) {
> > +  if (info->intermedia.reg() == var.reg()) {
A new comment here is that the above loop is a little bit too heavy.
For a large BB which has many MOV instructions, it will iterate too
many times for instructions after those MOV instructions. A better
way is to change to use a new map type of replaceInfos:
map >

This will be much faster than iterate all infos for each instruction.

> > +bool replaceable = false;
> It's better to add a comment here to describe the cases which can't be 
> replaced and why.
> 
> [yejun] actually, I think the code itself explains something, it is much 
> complex to explain every detail in human words. I consider it as a 
> nice-to-have since the basic logic is simple.
I still think it is not that simple. A case just came into my mind which we 
can't do replacement is:

MOV %r0, %r1
ADD %r1, %r1, 1
...
ADD %r10, %r0, 1
...

I'm afraid that your current code can't deal it correctly, right?

> > +if (insn.opcode != SEL_OP_BSWAP && !insn.isWrite()) {
> > +  if (!info->replacementChanged && info->intermedia.type == 
> > var.type && info->intermedia.quarter == var.quarter && 
> > info->intermedia.subnr == var.subnr) {
> 
> 
> 
> 
> > +uint32_t elements = CalculateElements(var, 
> > insn.state.execWidth);
> > +if (info->elements == elements) {
> Why not compare each attribute value, such as vstride_size,hstride_size,.?
> 
> [Yejun] the execWidth is not recorded in the GenRegister, and I once saw 
> instructions such as:
> mov(1)  dst (vstride_size or hstride_size is 1/2/... or something else), src
> mov(16) anotherdst, dst (with stride as 0)
> 
> the dst here is the same, but the strides are different, to handle this case, 
> I add the function CalculateElements.
The example you gave here has two different GenRegisters as GenRegister
has vstride and hstride members. Right? My previous comment suggests to
compare two GenRegistes' attributs and I can't understand your point here.


Thanks,
Zhigang Gong.

> 
> 
> 
> 
> The other parts LGTM,
> 
> Thanks,
> Zhigang Gong
> > +  info->toBeReplacements.insert(&var);
> > +  replaceable = true;
> > +}
> > +  }
> > +}
> > +
> > +//if we found the same register, but could not be replaced for 
> > some reason,
> > +//that means we could not remove MOV instruction, and so no 
> > replacement,
> > +//so we'll remove the info for this case.
> > +if (!replaceable) {
> > +  replaceInfos.erase(info);
> > +  delete info;
> > +}
> > +break;
> > +  }
> > +}
> > +  }
___
Beignet mailing list
Beignet@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/beignet

[Beignet] [Patch v2] GBE: Don't try to remove instructions when liveness is in dynamic update phase.

2015-09-23 Thread Zhigang Gong

As we want to avoid liveness update all the time, we maintain the liveness
information dynamically during the phi mov optimization. Instruction(self-copy)
remving bring unecessary complexity here. Let's avoid do that here, and do
the self-copy removing latter in removeMOVs().

v2:
forgot to remove incorrect liveness checking for special registers.
Now remove them.

Signed-off-by: Zhigang Gong 
---
 backend/src/llvm/llvm_gen_backend.cpp | 21 +++--
 1 file changed, 7 insertions(+), 14 deletions(-)

diff --git a/backend/src/llvm/llvm_gen_backend.cpp 
b/backend/src/llvm/llvm_gen_backend.cpp
index b0b97e7..dc2e3e8 100644
--- a/backend/src/llvm/llvm_gen_backend.cpp
+++ b/backend/src/llvm/llvm_gen_backend.cpp
@@ -2149,6 +2149,11 @@ namespace gbe
 // destinations)
 uint32_t insnID = 2;
 bb.foreach([&](ir::Instruction &insn) {
+  if (insn.getOpcode() == ir::OP_MOV &&
+  insn.getDst(0) == insn.getSrc(0)) {
+insn.remove();
+return;
+  }
   const uint32_t dstNum = insn.getDstNum();
   const uint32_t srcNum = insn.getSrcNum();
   for (uint32_t srcID = 0; srcID < srcNum; ++srcID) {
@@ -2245,8 +2250,7 @@ namespace gbe
   ++iter;
 }
 if (!phiPhiCopySrcInterfere) {
-  // phiCopy source can be coaleased with phiCopy
-  const_cast(phiCopyDefInsn)->remove();
+  replaceSrc(const_cast(phiCopyDefInsn), 
phiCopySrc, phiCopy);
 
   for (auto &s : *phiCopySrcDef) {
 const Instruction *phiSrcDefInsn = s->getInstruction();
@@ -2300,7 +2304,7 @@ namespace gbe
   // coalease phi and phiCopy
   if (isOpt) {
 for (auto &x : *phiDef) {
-  const_cast(x->getInstruction())->remove();
+  replaceDst(const_cast(x->getInstruction()), phi, 
phiCopy);
 }
 for (auto &x : *phiUse) {
   const Instruction *phiUseInsn = x->getInstruction();
@@ -2361,21 +2365,11 @@ namespace gbe
   const ir::UseSet *phiCopySrcUse = dag->getRegUse(phiCopySrc);
   for (auto &s : *phiCopySrcDef) {
 const Instruction *phiSrcDefInsn = s->getInstruction();
-if (phiSrcDefInsn->getOpcode() == ir::OP_MOV &&
-phiSrcDefInsn->getSrc(0) == phiCopy) {
-   const_cast(phiSrcDefInsn)->remove();
-   continue;
-}
 replaceDst(const_cast(phiSrcDefInsn), phiCopySrc, 
phiCopy);
   }
 
   for (auto &s : *phiCopySrcUse) {
 const Instruction *phiSrcUseInsn = s->getInstruction();
-if (phiSrcUseInsn->getOpcode() == ir::OP_MOV &&
-phiSrcUseInsn->getDst(0) == phiCopy) {
-   const_cast(phiSrcUseInsn)->remove();
-   continue;
-}
 replaceSrc(const_cast(phiSrcUseInsn), phiCopySrc, 
phiCopy);
   }
 
@@ -2405,7 +2399,6 @@ namespace gbe
   } else
 break;
 
-  break;
   nextRedundant->clear();
   replacedRegs.clear();
   revReplacedRegs.clear();
-- 
1.9.1

___
Beignet mailing list
Beignet@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/beignet

Re: [Beignet] [PATCH 7/9] GBE: Don't try to remove instructions when liveness is in dynamic update phase.

2015-09-23 Thread Zhigang Gong

Forgot to remove incorrect liveness check in value.cpp. ignore this patch, will
send new version soon.

On Thu, Sep 24, 2015 at 08:47:31AM +0800, Zhigang Gong wrote:
> As we want to avoid liveness update all the time, we maintain the liveness
> information dynamically during the phi mov optimization. 
> Instruction(self-copy)
> remving bring unecessary complexity here. Let's avoid do that here, and do
> the self-copy removing latter in removeMOVs().
> 
> Signed-off-by: Zhigang Gong 
> ---
>  backend/src/ir/value.cpp  |  6 +++---
>  backend/src/llvm/llvm_gen_backend.cpp | 21 +++--
>  2 files changed, 10 insertions(+), 17 deletions(-)
> 
> diff --git a/backend/src/ir/value.cpp b/backend/src/ir/value.cpp
> index d2f0c2e..b0ed9c2 100644
> --- a/backend/src/ir/value.cpp
> +++ b/backend/src/ir/value.cpp
> @@ -190,7 +190,7 @@ namespace ir {
>// Do not transfer dead values
>if (info.inLiveOut(reg) == false) continue;
>// If we overwrite it, do not transfer the initial value
> -  if (info.inVarKill(reg) == true) continue;
> +  if ((info.inVarKill(reg) == true) && (info.inUpwardUsed(reg))) 
> continue;
>ValueDef *def = const_cast(this->dag.getDefAddress(&arg));
>auto it = blockDefMap->find(reg);
>GBE_ASSERT(it != blockDefMap->end());
> @@ -205,7 +205,7 @@ namespace ir {
>// Do not transfer dead values
>if (info.inLiveOut(reg) == false) continue;
>// If we overwrite it, do not transfer the initial value
> -  if (info.inVarKill(reg) == true) continue;
> +  if ((info.inVarKill(reg) == true) && (info.inUpwardUsed(reg))) 
> continue;
>ValueDef *def = const_cast(this->dag.getDefAddress(reg));
>auto it = blockDefMap->find(reg);
>GBE_ASSERT(it != blockDefMap->end());
> @@ -219,7 +219,7 @@ namespace ir {
>// Do not transfer dead values
>if (info.inLiveOut(reg) == false) continue;
>// If we overwrite it, do not transfer the initial value
> -  if (info.inVarKill(reg) == true) continue;
> +  if ((info.inVarKill(reg) == true) && (info.inUpwardUsed(reg))) 
> continue;
>ValueDef *def = 
> const_cast(this->dag.getDefAddress(&pushed.second));
>auto it = blockDefMap->find(reg);
>GBE_ASSERT(it != blockDefMap->end());
> diff --git a/backend/src/llvm/llvm_gen_backend.cpp 
> b/backend/src/llvm/llvm_gen_backend.cpp
> index b0b97e7..dc2e3e8 100644
> --- a/backend/src/llvm/llvm_gen_backend.cpp
> +++ b/backend/src/llvm/llvm_gen_backend.cpp
> @@ -2149,6 +2149,11 @@ namespace gbe
>  // destinations)
>  uint32_t insnID = 2;
>  bb.foreach([&](ir::Instruction &insn) {
> +  if (insn.getOpcode() == ir::OP_MOV &&
> +  insn.getDst(0) == insn.getSrc(0)) {
> +insn.remove();
> +return;
> +  }
>const uint32_t dstNum = insn.getDstNum();
>const uint32_t srcNum = insn.getSrcNum();
>for (uint32_t srcID = 0; srcID < srcNum; ++srcID) {
> @@ -2245,8 +2250,7 @@ namespace gbe
>++iter;
>  }
>  if (!phiPhiCopySrcInterfere) {
> -  // phiCopy source can be coaleased with phiCopy
> -  const_cast(phiCopyDefInsn)->remove();
> +  replaceSrc(const_cast(phiCopyDefInsn), 
> phiCopySrc, phiCopy);
>  
>for (auto &s : *phiCopySrcDef) {
>  const Instruction *phiSrcDefInsn = s->getInstruction();
> @@ -2300,7 +2304,7 @@ namespace gbe
>// coalease phi and phiCopy
>if (isOpt) {
>  for (auto &x : *phiDef) {
> -  const_cast(x->getInstruction())->remove();
> +  replaceDst(const_cast(x->getInstruction()), phi, 
> phiCopy);
>  }
>  for (auto &x : *phiUse) {
>const Instruction *phiUseInsn = x->getInstruction();
> @@ -2361,21 +2365,11 @@ namespace gbe
>const ir::UseSet *phiCopySrcUse = dag->getRegUse(phiCopySrc);
>for (auto &s : *phiCopySrcDef) {
>  const Instruction *phiSrcDefInsn = s->getInstruction();
> -if (phiSrcDefInsn->getOpcode() == ir::OP_MOV &&
> -phiSrcDefInsn->getSrc(0) == phiCopy) {
> -   const_cast(phiSrcDefInsn)->remove();
> -   continue;
> -}
>  replaceDst(const_cast(phiSrcDefInsn), phiCopySrc, 
> phiCopy);
>}
>  
>for (auto &s : *phiCopySrcUse) {
>  const Instruction *phiSrcUseInsn = s->getInstruction();
> -if (p

Re: [Beignet] [PATCH V2 3/3] add local copy propagation optimization for each basic block

2015-09-23 Thread Zhigang Gong

cement);
> +assert(&(insn.dst(0)) == &intermedia);
> +this->elements = CalculateElements(intermedia, execWidth);
> +replacementChanged = false;
> +  }
> +  ~ReplaceInfo()
> +  {
> +this->toBeReplacements.clear();
> +  }
> +
> +  SelectionInstruction& insn;
> +  const GenRegister& intermedia;
> +  uint32_t elements;
> +  const GenRegister& replacement;
> +  set toBeReplacements;
> +  bool replacementChanged;
> +  GBE_CLASS(ReplaceInfo);
> +};
> +set replaceInfos;
> +void doLocalCopyPropagation();
> +void addToReplaceInfos(SelectionInstruction& insn);
> +void changeInsideReplaceInfos(const SelectionInstruction& insn, 
> GenRegister& var);
> +void removeFromReplaceInfos(const GenRegister& var);
> +void doReplacement(ReplaceInfo* info);
> +void cleanReplaceInfos();
> +static void propagateRegister(GenRegister& dst, const GenRegister& src);
> +
>  SelectionBlock &bb;
> -static const size_t MaxTries = 1;   //the times for optimization
> +const ir::Liveness::LiveOut& liveout;
> +bool optimized;
> +static const size_t MaxTries = 1;   //the max times of optimization try
>};
>  
> +  void SelBasicBlockOptimizer::propagateRegister(GenRegister& dst, const 
> GenRegister& src)
This function looks much like a GenRegister static method, because it doesn't 
reference any
external information.
> +  {
> +dst.type = src.type;
> +dst.file = src.file;
> +dst.physical = src.physical;
> +dst.subphysical = src.subphysical;
> +dst.value.reg = src.value.reg;
> +dst.vstride = src.vstride;
> +dst.width = src.width;
> +dst.hstride = src.hstride;
> +dst.quarter = src.quarter;
> +dst.nr = src.nr;
> +dst.subnr = src.subnr;
> +dst.address_mode = src.address_mode;
> +dst.a0_subnr = src.a0_subnr;
> +dst.addr_imm = src.addr_imm;
> +
> +dst.negation = dst.negation ^ src.negation;
> +dst.absolute = dst.absolute | src.absolute;
> +  }
> +
> +  void SelBasicBlockOptimizer::doReplacement(ReplaceInfo* info)
> +  {
> +for (GenRegister* reg : info->toBeReplacements) {
> +  propagateRegister(*reg, info->replacement);
> +}
> +bb.insnList.erase(&(info->insn));
> +  }
> +
> +  void SelBasicBlockOptimizer::cleanReplaceInfos()
> +  {
> +for (ReplaceInfo* info : replaceInfos) {
> +  doReplacement(info);
> +  delete info;
> +}
> +replaceInfos.clear();
> +  }
> +
> +  void SelBasicBlockOptimizer::removeFromReplaceInfos(const GenRegister& var)
> +  {
> +for (set::iterator pos = replaceInfos.begin(); pos != 
> replaceInfos.end(); ) {
> +  ReplaceInfo* info = *pos;
> +  if (info->intermedia.reg() == var.reg()) {
> +if (info->intermedia.quarter == var.quarter && 
> info->intermedia.subnr == var.subnr)
> +  doReplacement(info);
> +replaceInfos.erase(pos++);
> +delete info;
> +  }
> +  else if (info->replacement.reg() == var.reg()) {
> +info->replacementChanged = true;
> +++pos;
> +  }
> +  else
> +++pos;
> +}
> +  }
> +
> +  void SelBasicBlockOptimizer::addToReplaceInfos(SelectionInstruction& insn)
> +  {
> +assert(insn.opcode == SEL_OP_MOV);
> +const GenRegister& src = insn.src(0);
> +const GenRegister& dst = insn.dst(0);
> +if (src.type != dst.type || src.file != dst.file)
> +  return;
> +
> +if (liveout.find(dst.reg()) != liveout.end())
> +  return;
> +
> +ReplaceInfo* info = new ReplaceInfo(insn, dst, src, 
> insn.state.execWidth);
> +replaceInfos.insert(info);
> +  }
> +
> +  void SelBasicBlockOptimizer::changeInsideReplaceInfos(const 
> SelectionInstruction& insn, GenRegister& var)
> +  {
> +for (ReplaceInfo* info : replaceInfos) {
> +  if (info->intermedia.reg() == var.reg()) {
> +bool replaceable = false;
It's better to add a comment here to describe the cases which can't be replaced 
and why.

> +if (insn.opcode != SEL_OP_BSWAP && !insn.isWrite()) {
> +  if (!info->replacementChanged && info->intermedia.type == var.type 
> && info->intermedia.quarter == var.quarter && info->intermedia.subnr == 
> var.subnr) {
> +uint32_t elements = CalculateElements(var, insn.state.execWidth);
> +if (info->elements == elements) {
Why not compare each attribute value, such as vstride_size,hstride_size,.?

The other p

[Beignet] [PATCH 2/9] GBE: refine liveness analysis.

2015-09-23 Thread Zhigang Gong

Only in gen backend stage, we need to take care of the
special extra liveout and uniform analysis. In IR stage,
we don't need to handle them.

Signed-off-by: Zhigang Gong 
---
 backend/src/backend/context.cpp |  2 +-
 backend/src/ir/liveness.cpp | 17 ++---
 backend/src/ir/liveness.hpp |  2 +-
 3 files changed, 12 insertions(+), 9 deletions(-)

diff --git a/backend/src/backend/context.cpp b/backend/src/backend/context.cpp
index 33b2409..81b284d 100644
--- a/backend/src/backend/context.cpp
+++ b/backend/src/backend/context.cpp
@@ -322,7 +322,7 @@ namespace gbe
 unit(unit), fn(*unit.getFunction(name)), name(name), liveness(NULL), 
dag(NULL), useDWLabel(false)
   {
 GBE_ASSERT(unit.getPointerSize() == ir::POINTER_32_BITS);
-this->liveness = GBE_NEW(ir::Liveness, const_cast(fn));
+this->liveness = GBE_NEW(ir::Liveness, const_cast(fn), 
true);
 this->dag = GBE_NEW(ir::FunctionDAG, *this->liveness);
 // r0 (GEN_REG_SIZE) is always set by the HW and used at the end by EOT
 this->registerAllocator = NULL; //GBE_NEW(RegisterAllocator, GEN_REG_SIZE, 
4*KB - GEN_REG_SIZE);
diff --git a/backend/src/ir/liveness.cpp b/backend/src/ir/liveness.cpp
index 9fa7ac3..e2240c0 100644
--- a/backend/src/ir/liveness.cpp
+++ b/backend/src/ir/liveness.cpp
@@ -27,7 +27,7 @@
 namespace gbe {
 namespace ir {
 
-  Liveness::Liveness(Function &fn) : fn(fn) {
+  Liveness::Liveness(Function &fn, bool isInGenBackend) : fn(fn) {
 // Initialize UEVar and VarKill for each block
 fn.foreachBlock([this](const BasicBlock &bb) {
   this->initBlock(bb);
@@ -48,12 +48,15 @@ namespace ir {
 }
 // extend register (def in loop, use out-of-loop) liveness to the whole 
loop
 set extentRegs;
-this->computeExtraLiveInOut(extentRegs);
-// analyze uniform values. The extentRegs contains all the values which is
-// defined in a loop and use out-of-loop which could not be a uniform. The 
reason
-// is that when it reenter the second time, it may active different lanes. 
So
-// reenter many times may cause it has different values in different lanes.
-this->analyzeUniform(&extentRegs);
+// Only in Gen backend we need to take care of extra live out analysis.
+if (isInGenBackend) {
+  this->computeExtraLiveInOut(extentRegs);
+  // analyze uniform values. The extentRegs contains all the values which 
is
+  // defined in a loop and use out-of-loop which could not be a uniform. 
The reason
+  // is that when it reenter the second time, it may active different 
lanes. So
+  // reenter many times may cause it has different values in different 
lanes.
+  this->analyzeUniform(&extentRegs);
+}
   }
 
   Liveness::~Liveness(void) {
diff --git a/backend/src/ir/liveness.hpp b/backend/src/ir/liveness.hpp
index 4a7dc4e..d9fa2ed 100644
--- a/backend/src/ir/liveness.hpp
+++ b/backend/src/ir/liveness.hpp
@@ -48,7 +48,7 @@ namespace ir {
   class Liveness : public NonCopyable
   {
   public:
-Liveness(Function &fn);
+Liveness(Function &fn, bool isInGenBackend = false);
 ~Liveness(void);
 /*! Set of variables used upwards in the block (before a definition) */
 typedef set UEVar;
-- 
1.9.1

___
Beignet mailing list
Beignet@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/beignet

[Beignet] [PATCH 5/9] GBE: implement further phi mov optimization based on intra-BB interefering analysis.

2015-09-23 Thread Zhigang Gong

The previous phi mov optimization try to reduce the phi copy source register
and the phi copy register if the phi copy source register is a normal SSA value.

But for some cases, many phi copy source registers are also phi copy value which
has multiple definitions. And they could all be reduced to one phi copy register
if there is no interfering in all BBs. This patch with the previous patches 
could
reduce the whole spilled register from 200+ to only 70 for a SGEMM kernel and 
the
performance could boost about 10 times.

v2:
Add one FIXME tag to indicate one more optimization opportunity we missed in 
current
implementation. Could be solved in the future.

v3:
Disable postPhi mov optimization for now as there is a liveness bug
need to be fixed.

Signed-off-by: Zhigang Gong 
---
 backend/src/llvm/llvm_gen_backend.cpp | 136 --
 1 file changed, 130 insertions(+), 6 deletions(-)

diff --git a/backend/src/llvm/llvm_gen_backend.cpp 
b/backend/src/llvm/llvm_gen_backend.cpp
index 38c63ce..b0b97e7 100644
--- a/backend/src/llvm/llvm_gen_backend.cpp
+++ b/backend/src/llvm/llvm_gen_backend.cpp
@@ -629,7 +629,15 @@ namespace gbe
 /*! Will try to remove MOVs due to PHI resolution */
 void removeMOVs(const ir::Liveness &liveness, ir::Function &fn);
 /*! Optimize phi move based on liveness information */
-void optimizePhiCopy(ir::Liveness &liveness, ir::Function &fn);
+void optimizePhiCopy(ir::Liveness &liveness, ir::Function &fn,
+ map  &replaceMap,
+ map  
&redundantPhiCopyMap);
+/*! further optimization after phi copy optimization.
+ *  Global liveness interefering checking based redundant phy value
+ *  elimination. */
+void postPhiCopyOptimization(ir::Liveness &liveness, ir::Function &fn,
+ map  &replaceMap,
+ map  
&redundantPhiCopyMap);
 /*! Will try to remove redundants LOADI in basic blocks */
 void removeLOADIs(const ir::Liveness &liveness, ir::Function &fn);
 /*! To avoid lost copy, we need two values for PHI. This function create a
@@ -2157,7 +2165,9 @@ namespace gbe
 });
   }
 
-  void GenWriter::optimizePhiCopy(ir::Liveness &liveness, ir::Function &fn)
+  void GenWriter::optimizePhiCopy(ir::Liveness &liveness, ir::Function &fn,
+  map &replaceMap,
+  map &redundantPhiCopyMap)
   {
 // The overall idea behind is we check whether there is any interference
 // between phi and phiCopy live range. If there is no point that
@@ -2168,7 +2178,6 @@ namespace gbe
 
 using namespace ir;
 ir::FunctionDAG *dag = new ir::FunctionDAG(liveness);
-
 for (auto &it : phiMap) {
   const Register phi = it.first;
   const Register phiCopy = it.second;
@@ -2248,8 +2257,15 @@ namespace gbe
 const Instruction *phiSrcUseInsn = s->getInstruction();
 replaceSrc(const_cast(phiSrcUseInsn), 
phiCopySrc, phiCopy);
   }
+  replaceMap.insert(std::make_pair(phiCopySrc, phiCopy));
 }
   }
+} else {
+  // FIXME, if the phiCopySrc is a phi value and has been used for 
more than one phiCopySrc
+  // This 1:1 map will ignore the second one.
+  if (((*(phiCopySrcDef->begin()))->getType() == 
ValueDef::DEF_INSN_DST) &&
+  redundantPhiCopyMap.find(phiCopySrc) == 
redundantPhiCopyMap.end())
+redundantPhiCopyMap.insert(std::make_pair(phiCopySrc, phiCopy));
 }
 
 // If phi is used in the same BB that define the phiCopy,
@@ -2281,7 +2297,7 @@ namespace gbe
 }
   }
 
-  // coalease phi and phiCopy 
+  // coalease phi and phiCopy
   if (isOpt) {
 for (auto &x : *phiDef) {
   const_cast(x->getInstruction())->remove();
@@ -2289,8 +2305,112 @@ namespace gbe
 for (auto &x : *phiUse) {
   const Instruction *phiUseInsn = x->getInstruction();
   replaceSrc(const_cast(phiUseInsn), phi, phiCopy);
+  replaceMap.insert(std::make_pair(phi, phiCopy));
+}
+  }
+}
+delete dag;
+  }
+
+  void GenWriter::postPhiCopyOptimization(ir::Liveness &liveness,
+ ir::Function &fn, map  &replaceMap,
+ map  &redundantPhiCopyMap)
+  {
+// When doing the first pass phi copy optimization, we skip all the phi 
src MOV cases
+// whoes phiSrdDefs are also a phi value. We leave it here when all phi 
copy optimizations
+// have been done. Then we don't need to worry about there are still 
reducible phi copy remained.
+// We only need to check whether those possible redundant phi copy pairs' 
interfering to
+// each other globally, by leverage the DAG information.
+using namespace ir;
+
+// Firstly, validate all possible re

[Beignet] [PATCH 3/9] GBE: add two helper routines for liveness partially update.

2015-09-23 Thread Zhigang Gong

We don't need to recompute the entire liveness information for
all cases. This is a preparation patch for further phi copy
optimization.

v2:
also need to update varKill set.

Signed-off-by: Zhigang Gong 
---
 backend/src/ir/liveness.cpp | 37 +
 backend/src/ir/liveness.hpp |  7 +++
 2 files changed, 44 insertions(+)

diff --git a/backend/src/ir/liveness.cpp b/backend/src/ir/liveness.cpp
index e2240c0..1839029 100644
--- a/backend/src/ir/liveness.cpp
+++ b/backend/src/ir/liveness.cpp
@@ -59,6 +59,43 @@ namespace ir {
 }
   }
 
+  void Liveness::removeRegs(const set &removes) {
+for (auto &pair : liveness) {
+  BlockInfo &info = *(pair.second);
+  for (auto reg : removes) {
+if (info.liveOut.contains(reg))
+  info.liveOut.erase(reg);
+if (info.upwardUsed.contains(reg))
+  info.upwardUsed.erase(reg);
+  }
+}
+  }
+
+  void Liveness::replaceRegs(const map &replaceMap) {
+
+for (auto &pair : liveness) {
+  BlockInfo &info = *pair.second;
+  BasicBlock *bb = const_cast(&info.bb);
+  for (auto &pair : replaceMap) {
+Register from = pair.first;
+Register to = pair.second;
+if (info.liveOut.contains(from)) {
+  info.liveOut.erase(from);
+  info.liveOut.insert(to);
+  bb->definedPhiRegs.insert(to);
+}
+if (info.upwardUsed.contains(from)) {
+  info.upwardUsed.erase(from);
+  info.upwardUsed.insert(to);
+}
+if (info.varKill.contains(from)) {
+  info.varKill.erase(from);
+  info.varKill.insert(to);
+}
+  }
+}
+  }
+
   Liveness::~Liveness(void) {
 for (auto &pair : liveness) GBE_SAFE_DELETE(pair.second);
   }
diff --git a/backend/src/ir/liveness.hpp b/backend/src/ir/liveness.hpp
index d9fa2ed..df889e6 100644
--- a/backend/src/ir/liveness.hpp
+++ b/backend/src/ir/liveness.hpp
@@ -116,6 +116,13 @@ namespace ir {
 }
   }
 }
+
+// remove some registers from the liveness information.
+void removeRegs(const set &removes);
+
+// replace some registers according to (from, to) register map.
+void replaceRegs(const map &replaceMap);
+
   private:
 /*! Store the liveness of all blocks */
 Info liveness;
-- 
1.9.1

___
Beignet mailing list
Beignet@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/beignet

[Beignet] [PATCH 6/9] GBE: continue to refine interfering check.

2015-09-23 Thread Zhigang Gong

More aggresive interfering check, even if both registers are in
Livein set or Liveout set, they are still possible not interfering
to each other.

v2:
Liveout interfering check need to take care those BBs which has only one
register defined.

For example:

BBn:
  ...
  MOV %r1, %src
  ...

Both %r1 and %r2 are in the BBn's liveout set, but %r2 is not defined or used
in BBn. The previous implementation ignore this BB which is incorrect. As %r1
was modified to a different value, it means %r1 could not be replaced with %r2
in this case.

v3:
Add comments and assertion to restrict the usage of interleve
check functions of DAG class.

Signed-off-by: Zhigang Gong 
---
 backend/src/ir/value.cpp | 133 ---
 backend/src/ir/value.hpp |  13 +++--
 2 files changed, 123 insertions(+), 23 deletions(-)

diff --git a/backend/src/ir/value.cpp b/backend/src/ir/value.cpp
index 19ecabf..d2f0c2e 100644
--- a/backend/src/ir/value.cpp
+++ b/backend/src/ir/value.cpp
@@ -577,13 +577,102 @@ namespace ir {
 }
   }
 
+  static void getBlockDefInsns(const BasicBlock *bb, const DefSet *dSet, 
Register r, set  &defInsns) {
+for (auto def : *dSet) {
+  auto defInsn = def->getInstruction();
+  if (defInsn->getParent() == bb)
+defInsns.insert(defInsn);
+}
+  }
+
+  static bool liveinInterfere(const BasicBlock *bb, const Instruction 
*defInsn, Register r1) {
+BasicBlock::const_iterator iter = BasicBlock::const_iterator(defInsn);
+BasicBlock::const_iterator iterE = bb->end();
+
+if (defInsn->getOpcode() == OP_MOV &&
+defInsn->getSrc(0) == r1)
+  return false;
+while (iter != iterE) {
+  const Instruction *insn = iter.node();
+  for (unsigned i = 0; i < insn->getDstNum(); i++) {
+Register dst = insn->getDst(i);
+if (dst == r1)
+  return false;
+  }
+  for (unsigned i = 0; i < insn->getSrcNum(); i++) {
+ir::Register src = insn->getSrc(i);
+if (src == r1)
+  return true;
+  }
+  ++iter;
+}
+
+return false;
+  }
+
+  // r0 and r1 both are in Livein set.
+  // Only if r0/r1 is used after r1/r0 has been modified.
+  bool FunctionDAG::interfereLivein(const BasicBlock *bb, Register r0, 
Register r1) const {
+set  defInsns0, defInsns1;
+auto defSet0 = getRegDef(r0);
+auto defSet1 = getRegDef(r1);
+getBlockDefInsns(bb, defSet0, r0, defInsns0);
+getBlockDefInsns(bb, defSet1, r1, defInsns1);
+if (defInsns0.size() == 0 && defInsns1.size() == 0)
+  return false;
+
+for (auto insn : defInsns0) {
+  if (liveinInterfere(bb, insn, r1))
+return true;
+}
+
+for (auto insn : defInsns1) {
+  if (liveinInterfere(bb, insn, r0))
+return true;
+}
+return false;
+  }
+
+  // r0 and r1 both are in Liveout set.
+  // Only if the last definition of r0/r1 is a MOV r0, r1 or MOV r1, r0,
+  // it will not introduce interfering in this BB.
+  bool FunctionDAG::interfereLiveout(const BasicBlock *bb, Register r0, 
Register r1) const {
+set  defInsns0, defInsns1;
+auto defSet0 = getRegDef(r0);
+auto defSet1 = getRegDef(r1);
+getBlockDefInsns(bb, defSet0, r0, defInsns0);
+getBlockDefInsns(bb, defSet1, r1, defInsns1);
+if (defInsns0.size() == 0 && defInsns1.size() == 0)
+  return false;
+
+BasicBlock::const_iterator iter = --bb->end();
+BasicBlock::const_iterator iterE = bb->begin();
+do {
+  const Instruction *insn = iter.node();
+  for (unsigned i = 0; i < insn->getDstNum(); i++) {
+Register dst = insn->getDst(i);
+if (dst == r0 || dst == r1) {
+  if (insn->getOpcode() != OP_MOV)
+return true;
+  if (dst == r0 && insn->getSrc(0) != r1)
+return true;
+  if (dst == r1 && insn->getSrc(0) != r0)
+return true;
+  return false;
+}
+  }
+  --iter;
+} while (iter != iterE);
+return false;
+  }
+
+  // check instructions after the def of r0, if there is any def of r1, then 
no interefere for this
+  // range. Otherwise, if there is any use of r1, then return true.
   bool FunctionDAG::interfere(const BasicBlock *bb, Register inReg, Register 
outReg) const {
 auto dSet = getRegDef(outReg);
-bool visited = false;
 for (auto &def : *dSet) {
   auto defInsn = def->getInstruction();
   if (defInsn->getParent() == bb) {
-visited = true;
 if (defInsn->getOpcode() == OP_MOV && defInsn->getSrc(0) == inReg)
   continue;
 BasicBlock::const_iterator iter = BasicBlock::const_iterator(defInsn);
@@ -602,19 +691,17 @@ namespace ir {
 }
   }
 }
-// We must visit the outReg at least once. Otherwise, something going 
wrong.
-GBE_ASSERT(visited);
 return false;
   }
 
   bool FunctionDAG::interfer

[Beignet] [PATCH 7/9] GBE: Don't try to remove instructions when liveness is in dynamic update phase.

2015-09-23 Thread Zhigang Gong

As we want to avoid liveness update all the time, we maintain the liveness
information dynamically during the phi mov optimization. Instruction(self-copy)
remving bring unecessary complexity here. Let's avoid do that here, and do
the self-copy removing latter in removeMOVs().

Signed-off-by: Zhigang Gong 
---
 backend/src/ir/value.cpp  |  6 +++---
 backend/src/llvm/llvm_gen_backend.cpp | 21 +++--
 2 files changed, 10 insertions(+), 17 deletions(-)

diff --git a/backend/src/ir/value.cpp b/backend/src/ir/value.cpp
index d2f0c2e..b0ed9c2 100644
--- a/backend/src/ir/value.cpp
+++ b/backend/src/ir/value.cpp
@@ -190,7 +190,7 @@ namespace ir {
   // Do not transfer dead values
   if (info.inLiveOut(reg) == false) continue;
   // If we overwrite it, do not transfer the initial value
-  if (info.inVarKill(reg) == true) continue;
+  if ((info.inVarKill(reg) == true) && (info.inUpwardUsed(reg))) continue;
   ValueDef *def = const_cast(this->dag.getDefAddress(&arg));
   auto it = blockDefMap->find(reg);
   GBE_ASSERT(it != blockDefMap->end());
@@ -205,7 +205,7 @@ namespace ir {
   // Do not transfer dead values
   if (info.inLiveOut(reg) == false) continue;
   // If we overwrite it, do not transfer the initial value
-  if (info.inVarKill(reg) == true) continue;
+  if ((info.inVarKill(reg) == true) && (info.inUpwardUsed(reg))) continue;
   ValueDef *def = const_cast(this->dag.getDefAddress(reg));
   auto it = blockDefMap->find(reg);
   GBE_ASSERT(it != blockDefMap->end());
@@ -219,7 +219,7 @@ namespace ir {
   // Do not transfer dead values
   if (info.inLiveOut(reg) == false) continue;
   // If we overwrite it, do not transfer the initial value
-  if (info.inVarKill(reg) == true) continue;
+  if ((info.inVarKill(reg) == true) && (info.inUpwardUsed(reg))) continue;
   ValueDef *def = 
const_cast(this->dag.getDefAddress(&pushed.second));
   auto it = blockDefMap->find(reg);
   GBE_ASSERT(it != blockDefMap->end());
diff --git a/backend/src/llvm/llvm_gen_backend.cpp 
b/backend/src/llvm/llvm_gen_backend.cpp
index b0b97e7..dc2e3e8 100644
--- a/backend/src/llvm/llvm_gen_backend.cpp
+++ b/backend/src/llvm/llvm_gen_backend.cpp
@@ -2149,6 +2149,11 @@ namespace gbe
 // destinations)
 uint32_t insnID = 2;
 bb.foreach([&](ir::Instruction &insn) {
+  if (insn.getOpcode() == ir::OP_MOV &&
+  insn.getDst(0) == insn.getSrc(0)) {
+insn.remove();
+return;
+  }
   const uint32_t dstNum = insn.getDstNum();
   const uint32_t srcNum = insn.getSrcNum();
   for (uint32_t srcID = 0; srcID < srcNum; ++srcID) {
@@ -2245,8 +2250,7 @@ namespace gbe
   ++iter;
 }
 if (!phiPhiCopySrcInterfere) {
-  // phiCopy source can be coaleased with phiCopy
-  const_cast(phiCopyDefInsn)->remove();
+  replaceSrc(const_cast(phiCopyDefInsn), 
phiCopySrc, phiCopy);
 
   for (auto &s : *phiCopySrcDef) {
 const Instruction *phiSrcDefInsn = s->getInstruction();
@@ -2300,7 +2304,7 @@ namespace gbe
   // coalease phi and phiCopy
   if (isOpt) {
 for (auto &x : *phiDef) {
-  const_cast(x->getInstruction())->remove();
+  replaceDst(const_cast(x->getInstruction()), phi, 
phiCopy);
 }
 for (auto &x : *phiUse) {
   const Instruction *phiUseInsn = x->getInstruction();
@@ -2361,21 +2365,11 @@ namespace gbe
   const ir::UseSet *phiCopySrcUse = dag->getRegUse(phiCopySrc);
   for (auto &s : *phiCopySrcDef) {
 const Instruction *phiSrcDefInsn = s->getInstruction();
-if (phiSrcDefInsn->getOpcode() == ir::OP_MOV &&
-phiSrcDefInsn->getSrc(0) == phiCopy) {
-   const_cast(phiSrcDefInsn)->remove();
-   continue;
-}
 replaceDst(const_cast(phiSrcDefInsn), phiCopySrc, 
phiCopy);
   }
 
   for (auto &s : *phiCopySrcUse) {
 const Instruction *phiSrcUseInsn = s->getInstruction();
-if (phiSrcUseInsn->getOpcode() == ir::OP_MOV &&
-phiSrcUseInsn->getDst(0) == phiCopy) {
-   const_cast(phiSrcUseInsn)->remove();
-   continue;
-}
 replaceSrc(const_cast(phiSrcUseInsn), phiCopySrc, 
phiCopy);
   }
 
@@ -2405,7 +2399,6 @@ namespace gbe
   } else
 break;
 
-  break;
   nextRedundant->clear();
   replacedRegs.clear();
   revReplacedRegs.clear();
-- 
1.9.1

___
Beignet mailing list
Beignet@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/beignet

[Beignet] [PATCH 8/9] GBE: enable post phi copy optimization function.

2015-09-23 Thread Zhigang Gong

Signed-off-by: Zhigang Gong 
---
 backend/src/llvm/llvm_gen_backend.cpp | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/backend/src/llvm/llvm_gen_backend.cpp 
b/backend/src/llvm/llvm_gen_backend.cpp
index dc2e3e8..cc28053 100644
--- a/backend/src/llvm/llvm_gen_backend.cpp
+++ b/backend/src/llvm/llvm_gen_backend.cpp
@@ -2870,7 +2870,7 @@ namespace gbe
 if (OCL_OPTIMIZE_PHI_MOVES) {
   map  replaceMap, redundantPhiCopyMap;
   this->optimizePhiCopy(liveness, fn, replaceMap, redundantPhiCopyMap);
-  //this->postPhiCopyOptimization(liveness, fn, replaceMap, 
redundantPhiCopyMap);
+  this->postPhiCopyOptimization(liveness, fn, replaceMap, 
redundantPhiCopyMap);
   this->removeMOVs(liveness, fn);
 }
   }
-- 
1.9.1

___
Beignet mailing list
Beignet@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/beignet

[Beignet] [PATCH 9/9] GBE: avoid vector registers when there is high register pressure.

2015-09-23 Thread Zhigang Gong

If the reservedSpillRegs is not zero, it indicates we are in a
very high register pressure. Use register vector will likely
increase that pressure and will cause significant performance
problem which is much worse than use a short-live temporary
vector register with several additional MOVs.

So let's simply avoid use vector registers and just use a
temporary short-live-interval vector.

v2:
remove out-of-date comments.

Signed-off-by: Zhigang Gong 
---
 backend/src/backend/gen_reg_allocation.cpp | 4 +---
 1 file changed, 1 insertion(+), 3 deletions(-)

diff --git a/backend/src/backend/gen_reg_allocation.cpp 
b/backend/src/backend/gen_reg_allocation.cpp
index 39f1934..3f6abf3 100644
--- a/backend/src/backend/gen_reg_allocation.cpp
+++ b/backend/src/backend/gen_reg_allocation.cpp
@@ -313,12 +313,10 @@ namespace gbe
   // case 1: the register is not already in a vector, so it can stay in 
this
   // vector. Note that local IDs are *non-scalar* special registers but 
will
   // require a MOV anyway since pre-allocated in the CURBE
-  // If an element has very long interval, we don't want to put it into a
-  // vector as it will add more pressure to the register allocation.
   if (it == vectorMap.end() &&
   ctx.sel->isScalarReg(reg) == false &&
   ctx.isSpecialReg(reg) == false &&
-  (intervals[reg].maxID - intervals[reg].minID) < 2048)
+  ctx.reservedSpillRegs == 0 )
   {
 const VectorLocation location = std::make_pair(vector, regID);
 this->vectorMap.insert(std::make_pair(reg, location));
-- 
1.9.1

___
Beignet mailing list
Beignet@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/beignet

[Beignet] [PATCH 4/9] GBE: add some dag helper routines to check registers' interfering.

2015-09-23 Thread Zhigang Gong

These helper function will be used in further phi mov optimization.

v2:
remove the useless debug message code.

Signed-off-by: Zhigang Gong 
---
 backend/src/ir/value.cpp | 100 +++
 backend/src/ir/value.hpp |  13 ++
 2 files changed, 113 insertions(+)

diff --git a/backend/src/ir/value.cpp b/backend/src/ir/value.cpp
index 840fb5c..19ecabf 100644
--- a/backend/src/ir/value.cpp
+++ b/backend/src/ir/value.cpp
@@ -558,6 +558,106 @@ namespace ir {
 return it->second;
   }
 
+  void FunctionDAG::getRegUDBBs(Register r, set &BBs) 
const{
+auto dSet = getRegDef(r);
+for (auto &def : *dSet)
+  BBs.insert(def->getInstruction()->getParent());
+auto uSet = getRegUse(r);
+for (auto &use : *uSet)
+  BBs.insert(use->getInstruction()->getParent());
+  }
+
+  static void getLivenessBBs(const Liveness &liveness, Register r, const 
set &useDefSet,
+ set &liveInSet, set &liveOutSet){
+for (auto bb : useDefSet) {
+  if (liveness.getLiveOut(bb).contains(r))
+liveOutSet.insert(bb);
+  if (liveness.getLiveIn(bb).contains(r))
+liveInSet.insert(bb);
+}
+  }
+
+  bool FunctionDAG::interfere(const BasicBlock *bb, Register inReg, Register 
outReg) const {
+auto dSet = getRegDef(outReg);
+bool visited = false;
+for (auto &def : *dSet) {
+  auto defInsn = def->getInstruction();
+  if (defInsn->getParent() == bb) {
+visited = true;
+if (defInsn->getOpcode() == OP_MOV && defInsn->getSrc(0) == inReg)
+  continue;
+BasicBlock::const_iterator iter = BasicBlock::const_iterator(defInsn);
+BasicBlock::const_iterator iterE = bb->end();
+iter++;
+// check no use of phi in this basicblock between [phiCopySrc def, bb 
end]
+while (iter != iterE) {
+  const ir::Instruction *insn = iter.node();
+  // check phiUse
+  for (unsigned i = 0; i < insn->getSrcNum(); i++) {
+ir::Register src = insn->getSrc(i);
+if (src == inReg)
+  return true;
+  }
+  ++iter;
+}
+  }
+}
+// We must visit the outReg at least once. Otherwise, something going 
wrong.
+GBE_ASSERT(visited);
+return false;
+  }
+
+  bool FunctionDAG::interfere(const Liveness &liveness, Register r0, Register 
r1) const {
+// There are two interfering cases:
+//   1. Two registers are in the Livein set of the same BB.
+//   2. Two registers are in the Liveout set of the same BB.
+// If there are no any intersection BB, they are not interfering to each 
other.
+// If they are some intersection BBs, but one is only in the LiveIn and 
the other is
+// only in the Liveout, then we need to check whether they interefere each 
other in
+// that BB.
+set bbSet0;
+set bbSet1;
+getRegUDBBs(r0, bbSet0);
+getRegUDBBs(r1, bbSet1);
+
+set liveInBBSet0, liveInBBSet1;
+set liveOutBBSet0, liveOutBBSet1;
+getLivenessBBs(liveness, r0, bbSet0, liveInBBSet0, liveOutBBSet0);
+getLivenessBBs(liveness, r1, bbSet1, liveInBBSet1, liveOutBBSet1);
+
+set intersect;
+set_intersection(liveInBBSet0.begin(), liveInBBSet0.end(),
+ liveInBBSet1.begin(), liveInBBSet1.end(),
+ std::inserter(intersect, intersect.begin()));
+if (intersect.size() != 0)
+  return true;
+intersect.clear();
+set_intersection(liveOutBBSet0.begin(), liveOutBBSet0.end(),
+ liveOutBBSet1.begin(), liveOutBBSet1.end(),
+ std::inserter(intersect, intersect.begin()));
+if (intersect.size() != 0)
+  return true;
+
+set OIIntersect, IOIntersect;
+set_intersection(liveOutBBSet0.begin(), liveOutBBSet0.end(),
+ liveInBBSet1.begin(), liveInBBSet1.end(),
+ std::inserter(OIIntersect, OIIntersect.begin()));
+
+for (auto bb : OIIntersect) {
+  if (interfere(bb, r1, r0))
+return true;
+}
+
+set_intersection(liveInBBSet0.begin(), liveInBBSet0.end(),
+ liveOutBBSet1.begin(), liveOutBBSet1.end(),
+ std::inserter(IOIntersect, IOIntersect.begin()));
+for (auto bb : IOIntersect) {
+  if (interfere(bb, r0, r1))
+return true;
+}
+return false;
+  }
+
   std::ostream &operator<< (std::ostream &out, const FunctionDAG &dag) {
 const Function &fn = dag.getFunction();
 
diff --git a/backend/src/ir/value.hpp b/backend/src/ir/value.hpp
index a9e5108..ba3ba01 100644
--- a/backend/src/ir/value.hpp
+++ b/backend/src/ir/value.hpp
@@ -238,6 +238,19 @@ namespace ir {
 typedef map UDGraph;
 /*! The UseSet for each definition */
 typedef map DUGraph;
+/*! get register's use and define BB set */
+void getRegUDBBs(Register r, set &BB

[Beignet] [PATCH 1/9] GBE: refine Phi copy interfering check.

2015-09-23 Thread Zhigang Gong

If the PHI source register's definition instruction uses the
phi register, it is not a interfere. For an example:

MOV %phi, %phicopy
...
ADD %phiSrcDef, %phi, tmp
...
MOV %phicopy, %phiSrcDef
...

The %phi and the %phiSrcDef is not interering each other.
Simply advancing the start of the check to next instruction is
enough to get better result. For some special case, this patch
could get significant performance boost.

Signed-off-by: Zhigang Gong 
---
 backend/src/llvm/llvm_gen_backend.cpp | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/backend/src/llvm/llvm_gen_backend.cpp 
b/backend/src/llvm/llvm_gen_backend.cpp
index 4905415..38c63ce 100644
--- a/backend/src/llvm/llvm_gen_backend.cpp
+++ b/backend/src/llvm/llvm_gen_backend.cpp
@@ -2220,6 +2220,8 @@ namespace gbe
 
 ir::BasicBlock::const_iterator iter = 
ir::BasicBlock::const_iterator(phiCopySrcDefInsn);
 ir::BasicBlock::const_iterator iterE = bb->end();
+
+iter++;
 // check no use of phi in this basicblock between [phiCopySrc def, 
bb end]
 bool phiPhiCopySrcInterfere = false;
 while (iter != iterE) {
-- 
1.9.1

___
Beignet mailing list
Beignet@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/beignet

[Beignet] [PATCH v3 0/9] phi out-of-SSA optimization patchset

2015-09-23 Thread Zhigang Gong

The major change in this version is in liveness helper routines and
add a new patch and remove the incorrect DAG fix bug.

  GBE: Don't try to remove instructions when liveness is in dynamic update 
phase.

Zhigang Gong (9):
  GBE: refine Phi copy interfering check.
  GBE: refine liveness analysis.
  GBE: add two helper routines for liveness partially update.
  GBE: add some dag helper routines to check registers' interfering.
  GBE: implement further phi mov optimization based on intra-BB
interefering analysis.
  GBE: continue to refine interfering check.
  GBE: Don't try to remove instructions when liveness is in dynamic
update phase.
  GBE: enable post phi copy optimization function.
  GBE: avoid vector registers when there is high register pressure.

 backend/src/backend/context.cpp|   2 +-
 backend/src/backend/gen_reg_allocation.cpp |   4 +-
 backend/src/ir/liveness.cpp|  54 +++-
 backend/src/ir/liveness.hpp|   9 +-
 backend/src/ir/value.cpp   | 203 -
 backend/src/ir/value.hpp   |  16 +++
 backend/src/llvm/llvm_gen_backend.cpp  | 140 ++--
 7 files changed, 403 insertions(+), 25 deletions(-)

-- 
1.9.1

___
Beignet mailing list
Beignet@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/beignet

[Beignet] [PATCH 7/8] GBE: Fix one DAG analysis issue and enable multiple round phi copy elimination.

2015-09-22 Thread Zhigang Gong

Even if one value is killed in current BB, we still need to
pass predecessor's definition into this BB. Otherwise, we will
miss one definition.

BB0:
  MOV %foo, %src0

BB1:
  MUL %foo, %src1, %f00
  ...
  BR BB1

In the above case, both BB1 and BB0 are the predecessors of BB1.
When pass the definition of %foo in BB0 to BB1, the previous implementation
will ignore it because %foo is killed in BB1, this is a bug.
This patch fixes it. And thus we can enable multiple round
phi copy elimination safely.

v2:
also need to fix the same issue for special registers and kernel
arguments.

Signed-off-by: Zhigang Gong 
---
 backend/src/ir/value.cpp  | 8 
 backend/src/llvm/llvm_gen_backend.cpp | 1 -
 2 files changed, 4 insertions(+), 5 deletions(-)

diff --git a/backend/src/ir/value.cpp b/backend/src/ir/value.cpp
index d2f0c2e..d5215cb 100644
--- a/backend/src/ir/value.cpp
+++ b/backend/src/ir/value.cpp
@@ -190,7 +190,7 @@ namespace ir {
   // Do not transfer dead values
   if (info.inLiveOut(reg) == false) continue;
   // If we overwrite it, do not transfer the initial value
-  if (info.inVarKill(reg) == true) continue;
+  if ((info.inVarKill(reg) == true) && (info.inUpwardUsed(reg))) continue;
   ValueDef *def = const_cast(this->dag.getDefAddress(&arg));
   auto it = blockDefMap->find(reg);
   GBE_ASSERT(it != blockDefMap->end());
@@ -205,7 +205,7 @@ namespace ir {
   // Do not transfer dead values
   if (info.inLiveOut(reg) == false) continue;
   // If we overwrite it, do not transfer the initial value
-  if (info.inVarKill(reg) == true) continue;
+  if ((info.inVarKill(reg) == true) && (info.inUpwardUsed(reg))) continue;
   ValueDef *def = const_cast(this->dag.getDefAddress(reg));
   auto it = blockDefMap->find(reg);
   GBE_ASSERT(it != blockDefMap->end());
@@ -219,7 +219,7 @@ namespace ir {
   // Do not transfer dead values
   if (info.inLiveOut(reg) == false) continue;
   // If we overwrite it, do not transfer the initial value
-  if (info.inVarKill(reg) == true) continue;
+  if ((info.inVarKill(reg) == true) && (info.inUpwardUsed(reg))) continue;
   ValueDef *def = 
const_cast(this->dag.getDefAddress(&pushed.second));
   auto it = blockDefMap->find(reg);
   GBE_ASSERT(it != blockDefMap->end());
@@ -242,7 +242,7 @@ namespace ir {
 const BasicBlock &pbb = pred.bb;
 for (auto reg : curr.liveOut) {
   if (pred.inLiveOut(reg) == false) continue;
-  if (curr.inVarKill(reg) == true) continue;
+  if (curr.inVarKill(reg) == true && curr.inUpwardUsed(reg) == false) 
continue;
   RegDefSet &currSet = this->getDefSet(&bb, reg);
   RegDefSet &predSet = this->getDefSet(&pbb, reg);
 
diff --git a/backend/src/llvm/llvm_gen_backend.cpp 
b/backend/src/llvm/llvm_gen_backend.cpp
index e964eb3..2cdfb0f 100644
--- a/backend/src/llvm/llvm_gen_backend.cpp
+++ b/backend/src/llvm/llvm_gen_backend.cpp
@@ -2405,7 +2405,6 @@ namespace gbe
   } else
 break;
 
-  break;
   nextRedundant->clear();
   replacedRegs.clear();
   revReplacedRegs.clear();
-- 
1.9.1

___
Beignet mailing list
Beignet@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/beignet

[Beignet] [PATCH 2/8] GBE: refine liveness analysis.

2015-09-22 Thread Zhigang Gong

Only in gen backend stage, we need to take care of the
special extra liveout and uniform analysis. In IR stage,
we don't need to handle them.

Signed-off-by: Zhigang Gong 
---
 backend/src/backend/context.cpp |  2 +-
 backend/src/ir/liveness.cpp | 17 ++---
 backend/src/ir/liveness.hpp |  2 +-
 3 files changed, 12 insertions(+), 9 deletions(-)

diff --git a/backend/src/backend/context.cpp b/backend/src/backend/context.cpp
index 33b2409..81b284d 100644
--- a/backend/src/backend/context.cpp
+++ b/backend/src/backend/context.cpp
@@ -322,7 +322,7 @@ namespace gbe
 unit(unit), fn(*unit.getFunction(name)), name(name), liveness(NULL), 
dag(NULL), useDWLabel(false)
   {
 GBE_ASSERT(unit.getPointerSize() == ir::POINTER_32_BITS);
-this->liveness = GBE_NEW(ir::Liveness, const_cast(fn));
+this->liveness = GBE_NEW(ir::Liveness, const_cast(fn), 
true);
 this->dag = GBE_NEW(ir::FunctionDAG, *this->liveness);
 // r0 (GEN_REG_SIZE) is always set by the HW and used at the end by EOT
 this->registerAllocator = NULL; //GBE_NEW(RegisterAllocator, GEN_REG_SIZE, 
4*KB - GEN_REG_SIZE);
diff --git a/backend/src/ir/liveness.cpp b/backend/src/ir/liveness.cpp
index 9fa7ac3..e2240c0 100644
--- a/backend/src/ir/liveness.cpp
+++ b/backend/src/ir/liveness.cpp
@@ -27,7 +27,7 @@
 namespace gbe {
 namespace ir {
 
-  Liveness::Liveness(Function &fn) : fn(fn) {
+  Liveness::Liveness(Function &fn, bool isInGenBackend) : fn(fn) {
 // Initialize UEVar and VarKill for each block
 fn.foreachBlock([this](const BasicBlock &bb) {
   this->initBlock(bb);
@@ -48,12 +48,15 @@ namespace ir {
 }
 // extend register (def in loop, use out-of-loop) liveness to the whole 
loop
 set extentRegs;
-this->computeExtraLiveInOut(extentRegs);
-// analyze uniform values. The extentRegs contains all the values which is
-// defined in a loop and use out-of-loop which could not be a uniform. The 
reason
-// is that when it reenter the second time, it may active different lanes. 
So
-// reenter many times may cause it has different values in different lanes.
-this->analyzeUniform(&extentRegs);
+// Only in Gen backend we need to take care of extra live out analysis.
+if (isInGenBackend) {
+  this->computeExtraLiveInOut(extentRegs);
+  // analyze uniform values. The extentRegs contains all the values which 
is
+  // defined in a loop and use out-of-loop which could not be a uniform. 
The reason
+  // is that when it reenter the second time, it may active different 
lanes. So
+  // reenter many times may cause it has different values in different 
lanes.
+  this->analyzeUniform(&extentRegs);
+}
   }
 
   Liveness::~Liveness(void) {
diff --git a/backend/src/ir/liveness.hpp b/backend/src/ir/liveness.hpp
index 4a7dc4e..d9fa2ed 100644
--- a/backend/src/ir/liveness.hpp
+++ b/backend/src/ir/liveness.hpp
@@ -48,7 +48,7 @@ namespace ir {
   class Liveness : public NonCopyable
   {
   public:
-Liveness(Function &fn);
+Liveness(Function &fn, bool isInGenBackend = false);
 ~Liveness(void);
 /*! Set of variables used upwards in the block (before a definition) */
 typedef set UEVar;
-- 
1.9.1

___
Beignet mailing list
Beignet@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/beignet

[Beignet] [PATCH 6/8] GBE: continue to refine interfering check.

2015-09-22 Thread Zhigang Gong

More aggresive interfering check, even if both registers are in
Livein set or Liveout set, they are still possible not interfering
to each other.

v2:
Liveout interfering check need to take care those BBs which has only one
register defined.

For example:

BBn:
  ...
  MOV %r1, %src
  ...

Both %r1 and %r2 are in the BBn's liveout set, but %r2 is not defined or used
in BBn. The previous implementation ignore this BB which is incorrect. As %r1
was modified to a different value, it means %r1 could not be replaced with %r2
in this case.

v3:
Add comments and assertion to restrict the usage of interleve
check functions of DAG class.

Signed-off-by: Zhigang Gong 
---
 backend/src/ir/value.cpp | 133 ---
 backend/src/ir/value.hpp |  13 +++--
 2 files changed, 123 insertions(+), 23 deletions(-)

diff --git a/backend/src/ir/value.cpp b/backend/src/ir/value.cpp
index 19ecabf..d2f0c2e 100644
--- a/backend/src/ir/value.cpp
+++ b/backend/src/ir/value.cpp
@@ -577,13 +577,102 @@ namespace ir {
 }
   }
 
+  static void getBlockDefInsns(const BasicBlock *bb, const DefSet *dSet, 
Register r, set  &defInsns) {
+for (auto def : *dSet) {
+  auto defInsn = def->getInstruction();
+  if (defInsn->getParent() == bb)
+defInsns.insert(defInsn);
+}
+  }
+
+  static bool liveinInterfere(const BasicBlock *bb, const Instruction 
*defInsn, Register r1) {
+BasicBlock::const_iterator iter = BasicBlock::const_iterator(defInsn);
+BasicBlock::const_iterator iterE = bb->end();
+
+if (defInsn->getOpcode() == OP_MOV &&
+defInsn->getSrc(0) == r1)
+  return false;
+while (iter != iterE) {
+  const Instruction *insn = iter.node();
+  for (unsigned i = 0; i < insn->getDstNum(); i++) {
+Register dst = insn->getDst(i);
+if (dst == r1)
+  return false;
+  }
+  for (unsigned i = 0; i < insn->getSrcNum(); i++) {
+ir::Register src = insn->getSrc(i);
+if (src == r1)
+  return true;
+  }
+  ++iter;
+}
+
+return false;
+  }
+
+  // r0 and r1 both are in Livein set.
+  // Only if r0/r1 is used after r1/r0 has been modified.
+  bool FunctionDAG::interfereLivein(const BasicBlock *bb, Register r0, 
Register r1) const {
+set  defInsns0, defInsns1;
+auto defSet0 = getRegDef(r0);
+auto defSet1 = getRegDef(r1);
+getBlockDefInsns(bb, defSet0, r0, defInsns0);
+getBlockDefInsns(bb, defSet1, r1, defInsns1);
+if (defInsns0.size() == 0 && defInsns1.size() == 0)
+  return false;
+
+for (auto insn : defInsns0) {
+  if (liveinInterfere(bb, insn, r1))
+return true;
+}
+
+for (auto insn : defInsns1) {
+  if (liveinInterfere(bb, insn, r0))
+return true;
+}
+return false;
+  }
+
+  // r0 and r1 both are in Liveout set.
+  // Only if the last definition of r0/r1 is a MOV r0, r1 or MOV r1, r0,
+  // it will not introduce interfering in this BB.
+  bool FunctionDAG::interfereLiveout(const BasicBlock *bb, Register r0, 
Register r1) const {
+set  defInsns0, defInsns1;
+auto defSet0 = getRegDef(r0);
+auto defSet1 = getRegDef(r1);
+getBlockDefInsns(bb, defSet0, r0, defInsns0);
+getBlockDefInsns(bb, defSet1, r1, defInsns1);
+if (defInsns0.size() == 0 && defInsns1.size() == 0)
+  return false;
+
+BasicBlock::const_iterator iter = --bb->end();
+BasicBlock::const_iterator iterE = bb->begin();
+do {
+  const Instruction *insn = iter.node();
+  for (unsigned i = 0; i < insn->getDstNum(); i++) {
+Register dst = insn->getDst(i);
+if (dst == r0 || dst == r1) {
+  if (insn->getOpcode() != OP_MOV)
+return true;
+  if (dst == r0 && insn->getSrc(0) != r1)
+return true;
+  if (dst == r1 && insn->getSrc(0) != r0)
+return true;
+  return false;
+}
+  }
+  --iter;
+} while (iter != iterE);
+return false;
+  }
+
+  // check instructions after the def of r0, if there is any def of r1, then 
no interefere for this
+  // range. Otherwise, if there is any use of r1, then return true.
   bool FunctionDAG::interfere(const BasicBlock *bb, Register inReg, Register 
outReg) const {
 auto dSet = getRegDef(outReg);
-bool visited = false;
 for (auto &def : *dSet) {
   auto defInsn = def->getInstruction();
   if (defInsn->getParent() == bb) {
-visited = true;
 if (defInsn->getOpcode() == OP_MOV && defInsn->getSrc(0) == inReg)
   continue;
 BasicBlock::const_iterator iter = BasicBlock::const_iterator(defInsn);
@@ -602,19 +691,17 @@ namespace ir {
 }
   }
 }
-// We must visit the outReg at least once. Otherwise, something going 
wrong.
-GBE_ASSERT(visited);
 return false;
   }
 
   bool FunctionDAG::interfer

[Beignet] [PATCH 5/8] GBE: implement further phi mov optimization based on intra-BB interefering analysis.

2015-09-22 Thread Zhigang Gong

The previous phi mov optimization try to reduce the phi copy source register
and the phi copy register if the phi copy source register is a normal SSA value.

But for some cases, many phi copy source registers are also phi copy value which
has multiple definitions. And they could all be reduced to one phi copy register
if there is no interfering in all BBs. This patch with the previous patches 
could
reduce the whole spilled register from 200+ to only 70 for a SGEMM kernel and 
the
performance could boost about 10 times.

v2:
Add one FIXME tag to indicate one more optimization opportunity we missed in 
current
implementation. Could be solved in the future.

Signed-off-by: Zhigang Gong 
---
 backend/src/llvm/llvm_gen_backend.cpp | 136 --
 1 file changed, 130 insertions(+), 6 deletions(-)

diff --git a/backend/src/llvm/llvm_gen_backend.cpp 
b/backend/src/llvm/llvm_gen_backend.cpp
index 38c63ce..e964eb3 100644
--- a/backend/src/llvm/llvm_gen_backend.cpp
+++ b/backend/src/llvm/llvm_gen_backend.cpp
@@ -629,7 +629,15 @@ namespace gbe
 /*! Will try to remove MOVs due to PHI resolution */
 void removeMOVs(const ir::Liveness &liveness, ir::Function &fn);
 /*! Optimize phi move based on liveness information */
-void optimizePhiCopy(ir::Liveness &liveness, ir::Function &fn);
+void optimizePhiCopy(ir::Liveness &liveness, ir::Function &fn,
+ map  &replaceMap,
+ map  
&redundantPhiCopyMap);
+/*! further optimization after phi copy optimization.
+ *  Global liveness interefering checking based redundant phy value
+ *  elimination. */
+void postPhiCopyOptimization(ir::Liveness &liveness, ir::Function &fn,
+ map  &replaceMap,
+ map  
&redundantPhiCopyMap);
 /*! Will try to remove redundants LOADI in basic blocks */
 void removeLOADIs(const ir::Liveness &liveness, ir::Function &fn);
 /*! To avoid lost copy, we need two values for PHI. This function create a
@@ -2157,7 +2165,9 @@ namespace gbe
 });
   }
 
-  void GenWriter::optimizePhiCopy(ir::Liveness &liveness, ir::Function &fn)
+  void GenWriter::optimizePhiCopy(ir::Liveness &liveness, ir::Function &fn,
+  map &replaceMap,
+  map &redundantPhiCopyMap)
   {
 // The overall idea behind is we check whether there is any interference
 // between phi and phiCopy live range. If there is no point that
@@ -2168,7 +2178,6 @@ namespace gbe
 
 using namespace ir;
 ir::FunctionDAG *dag = new ir::FunctionDAG(liveness);
-
 for (auto &it : phiMap) {
   const Register phi = it.first;
   const Register phiCopy = it.second;
@@ -2248,8 +2257,15 @@ namespace gbe
 const Instruction *phiSrcUseInsn = s->getInstruction();
 replaceSrc(const_cast(phiSrcUseInsn), 
phiCopySrc, phiCopy);
   }
+  replaceMap.insert(std::make_pair(phiCopySrc, phiCopy));
 }
   }
+} else {
+  // FIXME, if the phiCopySrc is a phi value and has been used for 
more than one phiCopySrc
+  // This 1:1 map will ignore the second one.
+  if (((*(phiCopySrcDef->begin()))->getType() == 
ValueDef::DEF_INSN_DST) &&
+  redundantPhiCopyMap.find(phiCopySrc) == 
redundantPhiCopyMap.end())
+redundantPhiCopyMap.insert(std::make_pair(phiCopySrc, phiCopy));
 }
 
 // If phi is used in the same BB that define the phiCopy,
@@ -2281,7 +2297,7 @@ namespace gbe
 }
   }
 
-  // coalease phi and phiCopy 
+  // coalease phi and phiCopy
   if (isOpt) {
 for (auto &x : *phiDef) {
   const_cast(x->getInstruction())->remove();
@@ -2289,8 +2305,112 @@ namespace gbe
 for (auto &x : *phiUse) {
   const Instruction *phiUseInsn = x->getInstruction();
   replaceSrc(const_cast(phiUseInsn), phi, phiCopy);
+  replaceMap.insert(std::make_pair(phi, phiCopy));
+}
+  }
+}
+delete dag;
+  }
+
+  void GenWriter::postPhiCopyOptimization(ir::Liveness &liveness,
+ ir::Function &fn, map  &replaceMap,
+ map  &redundantPhiCopyMap)
+  {
+// When doing the first pass phi copy optimization, we skip all the phi 
src MOV cases
+// whoes phiSrdDefs are also a phi value. We leave it here when all phi 
copy optimizations
+// have been done. Then we don't need to worry about there are still 
reducible phi copy remained.
+// We only need to check whether those possible redundant phi copy pairs' 
interfering to
+// each other globally, by leverage the DAG information.
+using namespace ir;
+
+// Firstly, validate all possible redundant phi copy map and update 
liveness information
+// accordingly.
+if (replaceMap.size()

[Beignet] [PATCH 4/8] GBE: add some dag helper routines to check registers' interfering.

2015-09-22 Thread Zhigang Gong

These helper function will be used in further phi mov optimization.

v2:
remove the useless debug message code.

Signed-off-by: Zhigang Gong 
---
 backend/src/ir/value.cpp | 100 +++
 backend/src/ir/value.hpp |  13 ++
 2 files changed, 113 insertions(+)

diff --git a/backend/src/ir/value.cpp b/backend/src/ir/value.cpp
index 840fb5c..19ecabf 100644
--- a/backend/src/ir/value.cpp
+++ b/backend/src/ir/value.cpp
@@ -558,6 +558,106 @@ namespace ir {
 return it->second;
   }
 
+  void FunctionDAG::getRegUDBBs(Register r, set &BBs) 
const{
+auto dSet = getRegDef(r);
+for (auto &def : *dSet)
+  BBs.insert(def->getInstruction()->getParent());
+auto uSet = getRegUse(r);
+for (auto &use : *uSet)
+  BBs.insert(use->getInstruction()->getParent());
+  }
+
+  static void getLivenessBBs(const Liveness &liveness, Register r, const 
set &useDefSet,
+ set &liveInSet, set &liveOutSet){
+for (auto bb : useDefSet) {
+  if (liveness.getLiveOut(bb).contains(r))
+liveOutSet.insert(bb);
+  if (liveness.getLiveIn(bb).contains(r))
+liveInSet.insert(bb);
+}
+  }
+
+  bool FunctionDAG::interfere(const BasicBlock *bb, Register inReg, Register 
outReg) const {
+auto dSet = getRegDef(outReg);
+bool visited = false;
+for (auto &def : *dSet) {
+  auto defInsn = def->getInstruction();
+  if (defInsn->getParent() == bb) {
+visited = true;
+if (defInsn->getOpcode() == OP_MOV && defInsn->getSrc(0) == inReg)
+  continue;
+BasicBlock::const_iterator iter = BasicBlock::const_iterator(defInsn);
+BasicBlock::const_iterator iterE = bb->end();
+iter++;
+// check no use of phi in this basicblock between [phiCopySrc def, bb 
end]
+while (iter != iterE) {
+  const ir::Instruction *insn = iter.node();
+  // check phiUse
+  for (unsigned i = 0; i < insn->getSrcNum(); i++) {
+ir::Register src = insn->getSrc(i);
+if (src == inReg)
+  return true;
+  }
+  ++iter;
+}
+  }
+}
+// We must visit the outReg at least once. Otherwise, something going 
wrong.
+GBE_ASSERT(visited);
+return false;
+  }
+
+  bool FunctionDAG::interfere(const Liveness &liveness, Register r0, Register 
r1) const {
+// There are two interfering cases:
+//   1. Two registers are in the Livein set of the same BB.
+//   2. Two registers are in the Liveout set of the same BB.
+// If there are no any intersection BB, they are not interfering to each 
other.
+// If they are some intersection BBs, but one is only in the LiveIn and 
the other is
+// only in the Liveout, then we need to check whether they interefere each 
other in
+// that BB.
+set bbSet0;
+set bbSet1;
+getRegUDBBs(r0, bbSet0);
+getRegUDBBs(r1, bbSet1);
+
+set liveInBBSet0, liveInBBSet1;
+set liveOutBBSet0, liveOutBBSet1;
+getLivenessBBs(liveness, r0, bbSet0, liveInBBSet0, liveOutBBSet0);
+getLivenessBBs(liveness, r1, bbSet1, liveInBBSet1, liveOutBBSet1);
+
+set intersect;
+set_intersection(liveInBBSet0.begin(), liveInBBSet0.end(),
+ liveInBBSet1.begin(), liveInBBSet1.end(),
+ std::inserter(intersect, intersect.begin()));
+if (intersect.size() != 0)
+  return true;
+intersect.clear();
+set_intersection(liveOutBBSet0.begin(), liveOutBBSet0.end(),
+ liveOutBBSet1.begin(), liveOutBBSet1.end(),
+ std::inserter(intersect, intersect.begin()));
+if (intersect.size() != 0)
+  return true;
+
+set OIIntersect, IOIntersect;
+set_intersection(liveOutBBSet0.begin(), liveOutBBSet0.end(),
+ liveInBBSet1.begin(), liveInBBSet1.end(),
+ std::inserter(OIIntersect, OIIntersect.begin()));
+
+for (auto bb : OIIntersect) {
+  if (interfere(bb, r1, r0))
+return true;
+}
+
+set_intersection(liveInBBSet0.begin(), liveInBBSet0.end(),
+ liveOutBBSet1.begin(), liveOutBBSet1.end(),
+ std::inserter(IOIntersect, IOIntersect.begin()));
+for (auto bb : IOIntersect) {
+  if (interfere(bb, r0, r1))
+return true;
+}
+return false;
+  }
+
   std::ostream &operator<< (std::ostream &out, const FunctionDAG &dag) {
 const Function &fn = dag.getFunction();
 
diff --git a/backend/src/ir/value.hpp b/backend/src/ir/value.hpp
index a9e5108..ba3ba01 100644
--- a/backend/src/ir/value.hpp
+++ b/backend/src/ir/value.hpp
@@ -238,6 +238,19 @@ namespace ir {
 typedef map UDGraph;
 /*! The UseSet for each definition */
 typedef map DUGraph;
+/*! get register's use and define BB set */
+void getRegUDBBs(Register r, set &BB

[Beignet] [PATCH 3/8] GBE: add two helper routines for liveness partially update.

2015-09-22 Thread Zhigang Gong

We don't need to recompute the entire liveness information for
all cases. This is a preparation patch for further phi copy
optimization.

Signed-off-by: Zhigang Gong 
---
 backend/src/ir/liveness.cpp | 33 +
 backend/src/ir/liveness.hpp |  7 +++
 2 files changed, 40 insertions(+)

diff --git a/backend/src/ir/liveness.cpp b/backend/src/ir/liveness.cpp
index e2240c0..c5a6374 100644
--- a/backend/src/ir/liveness.cpp
+++ b/backend/src/ir/liveness.cpp
@@ -59,6 +59,39 @@ namespace ir {
 }
   }
 
+  void Liveness::removeRegs(const set &removes) {
+for (auto &pair : liveness) {
+  BlockInfo &info = *(pair.second);
+  for (auto reg : removes) {
+if (info.liveOut.contains(reg))
+  info.liveOut.erase(reg);
+if (info.upwardUsed.contains(reg))
+  info.upwardUsed.erase(reg);
+  }
+}
+  }
+
+  void Liveness::replaceRegs(const map &replaceMap) {
+
+for (auto &pair : liveness) {
+  BlockInfo &info = *pair.second;
+  BasicBlock *bb = const_cast(&info.bb);
+  for (auto &pair : replaceMap) {
+Register from = pair.first;
+Register to = pair.second;
+if (info.liveOut.contains(from)) {
+  info.liveOut.erase(from);
+  info.liveOut.insert(to);
+  bb->definedPhiRegs.insert(to);
+}
+if (info.upwardUsed.contains(from)) {
+  info.upwardUsed.erase(from);
+  info.upwardUsed.insert(to);
+}
+  }
+}
+  }
+
   Liveness::~Liveness(void) {
 for (auto &pair : liveness) GBE_SAFE_DELETE(pair.second);
   }
diff --git a/backend/src/ir/liveness.hpp b/backend/src/ir/liveness.hpp
index d9fa2ed..df889e6 100644
--- a/backend/src/ir/liveness.hpp
+++ b/backend/src/ir/liveness.hpp
@@ -116,6 +116,13 @@ namespace ir {
 }
   }
 }
+
+// remove some registers from the liveness information.
+void removeRegs(const set &removes);
+
+// replace some registers according to (from, to) register map.
+void replaceRegs(const map &replaceMap);
+
   private:
 /*! Store the liveness of all blocks */
 Info liveness;
-- 
1.9.1

___
Beignet mailing list
Beignet@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/beignet

[Beignet] [PATCH 8/8] GBE: avoid vector registers when there is high register pressure.

2015-09-22 Thread Zhigang Gong

If the reservedSpillRegs is not zero, it indicates we are in a
very high register pressure. Use register vector will likely
increase that pressure and will cause significant performance
problem which is much worse than use a short-live temporary
vector register with several additional MOVs.

So let's simply avoid use vector registers and just use a
temporary short-live-interval vector.

v2:
remove out-of-date comments.

Signed-off-by: Zhigang Gong 
---
 backend/src/backend/gen_reg_allocation.cpp | 4 +---
 1 file changed, 1 insertion(+), 3 deletions(-)

diff --git a/backend/src/backend/gen_reg_allocation.cpp 
b/backend/src/backend/gen_reg_allocation.cpp
index 39f1934..3f6abf3 100644
--- a/backend/src/backend/gen_reg_allocation.cpp
+++ b/backend/src/backend/gen_reg_allocation.cpp
@@ -313,12 +313,10 @@ namespace gbe
   // case 1: the register is not already in a vector, so it can stay in 
this
   // vector. Note that local IDs are *non-scalar* special registers but 
will
   // require a MOV anyway since pre-allocated in the CURBE
-  // If an element has very long interval, we don't want to put it into a
-  // vector as it will add more pressure to the register allocation.
   if (it == vectorMap.end() &&
   ctx.sel->isScalarReg(reg) == false &&
   ctx.isSpecialReg(reg) == false &&
-  (intervals[reg].maxID - intervals[reg].minID) < 2048)
+  ctx.reservedSpillRegs == 0 )
   {
 const VectorLocation location = std::make_pair(vector, regID);
 this->vectorMap.insert(std::make_pair(reg, location));
-- 
1.9.1

___
Beignet mailing list
Beignet@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/beignet

[Beignet] [PATCH v2 0/8] phi out-of-SSA optimization patchset.

2015-09-22 Thread Zhigang Gong

This is new version with fixes to address comments.
Only the 7th one 
" GBE: Fix one DAG analysis issue and enable multiple round phi copy
elimOAination."
is not reviewed yet. But it is required for this patchset, and I also made
some change in it, so I sent it out together with other reviewed/modified
patches.

Zhigang Gong (8):
  GBE: refine Phi copy interfering check.
  GBE: refine liveness analysis.
  GBE: add two helper routines for liveness partially update.
  GBE: add some dag helper routines to check registers' interfering.
  GBE: implement further phi mov optimization based on intra-BB
interefering analysis.
  GBE: continue to refine interfering check.
  GBE: Fix one DAG analysis issue and enable multiple round phi copy
elimination.
  GBE: avoid vector registers when there is high register pressure.

 backend/src/backend/context.cpp|   2 +-
 backend/src/backend/gen_reg_allocation.cpp |   4 +-
 backend/src/ir/liveness.cpp|  50 ++-
 backend/src/ir/liveness.hpp|   9 +-
 backend/src/ir/value.cpp   | 205 -
 backend/src/ir/value.hpp   |  16 +++
 backend/src/llvm/llvm_gen_backend.cpp  | 137 ++-
 7 files changed, 401 insertions(+), 22 deletions(-)

-- 
1.9.1

___
Beignet mailing list
Beignet@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/beignet

[Beignet] [PATCH 1/8] GBE: refine Phi copy interfering check.

2015-09-22 Thread Zhigang Gong

If the PHI source register's definition instruction uses the
phi register, it is not a interfere. For an example:

MOV %phi, %phicopy
...
ADD %phiSrcDef, %phi, tmp
...
MOV %phicopy, %phiSrcDef
...

The %phi and the %phiSrcDef is not interering each other.
Simply advancing the start of the check to next instruction is
enough to get better result. For some special case, this patch
could get significant performance boost.

Signed-off-by: Zhigang Gong 
---
 backend/src/llvm/llvm_gen_backend.cpp | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/backend/src/llvm/llvm_gen_backend.cpp 
b/backend/src/llvm/llvm_gen_backend.cpp
index 4905415..38c63ce 100644
--- a/backend/src/llvm/llvm_gen_backend.cpp
+++ b/backend/src/llvm/llvm_gen_backend.cpp
@@ -2220,6 +2220,8 @@ namespace gbe
 
 ir::BasicBlock::const_iterator iter = 
ir::BasicBlock::const_iterator(phiCopySrcDefInsn);
 ir::BasicBlock::const_iterator iterE = bb->end();
+
+iter++;
 // check no use of phi in this basicblock between [phiCopySrc def, 
bb end]
 bool phiPhiCopySrcInterfere = false;
 while (iter != iterE) {
-- 
1.9.1

___
Beignet mailing list
Beignet@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/beignet

Re: [Beignet] [PATCH 5/5] GBE: implement further phi mov optimization based on intra-BB interefering analysis.

2015-09-22 Thread Zhigang Gong

Right, if a phiCopySrc is a phi value and has been used as phiCopySrc for more 
than
1 phi values, current implementation will only try to optimize one of them.
I will add a FIXME tag there, and will fix it in the future. As there are too 
many
patches pending. To avoid unecessary conflicts. I prefer only fix those problems
which related to correctness in this round. And defer the refinement part to the
future patches.

Thanks,
Zhigang Gong.

On Tue, Sep 22, 2015 at 03:54:50AM +, Song, Ruiling wrote:
> I just think of another optimization opportunity that may be missed in your 
> algorithm.
> As you use a map to record the possible to-be-coaleased 
> pair.
> The phiCopySrc may be used in another phiNode in the same way.
> Which the algorithm would not record. We may do it later.
> Could you inline related comment into the patch?
> Then others could easily understand the code.
> Anyway, the patchset looks good.
> 
> Thanks!
> Ruiling
> ___
> Beignet mailing list
> Beignet@lists.freedesktop.org
> http://lists.freedesktop.org/mailman/listinfo/beignet
___
Beignet mailing list
Beignet@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/beignet

Re: [Beignet] [PATCH v2 1/2] GBE: continue to refine interfering check.

2015-09-22 Thread Zhigang Gong

After discussion with ruiling, I know his point now.
These help routines only be used by the phi optmization for
interference between phi values which must be not belocal values.
But the function name looks too generic and
Ruiling is afraid these routines may be used by others
in the furture to check interference between local
values.

This is a good point, I will add comments or assertion to
restrict the use scenarios of these helper routines.

Thanks,
Zhigang Gong.

On Wed, Sep 23, 2015 at 09:58:29AM +0800, Zhigang Gong wrote:
> On Wed, Sep 23, 2015 at 03:05:18AM +, Song, Ruiling wrote:
> > 
> > 
> > > -Original Message-----
> > > From: Zhigang Gong [mailto:zhigang.g...@linux.intel.com]
> > > Sent: Wednesday, September 23, 2015 9:44 AM
> > > To: Song, Ruiling
> > > Cc: Gong, Zhigang; beignet@lists.freedesktop.org
> > > Subject: Re: [Beignet] [PATCH v2 1/2] GBE: continue to refine interfering 
> > > check.
> > > 
> > > On Wed, Sep 23, 2015 at 02:33:43AM +, Song, Ruiling wrote:
> > > >
> > > >
> > > > > -Original Message-
> > > > > From: Beignet [mailto:beignet-boun...@lists.freedesktop.org] On
> > > > > Behalf Of Zhigang Gong
> > > > > Sent: Monday, September 7, 2015 8:20 AM
> > > > > To: beignet@lists.freedesktop.org
> > > > > Cc: Gong, Zhigang
> > > > > Subject: [Beignet] [PATCH v2 1/2] GBE: continue to refine interfering 
> > > > > check.
> > > > >
> > > > > More aggresive interfering check, even if both registers are in
> > > > > Livein set or Liveout set, they are still possible not interfering to 
> > > > > each other.
> > > > >
> > > > > v2:
> > > > > Liveout interfering check need to take care those BBs which has only
> > > > > one register defined.
> > > > >
> > > > > For example:
> > > > >
> > > > > BBn:
> > > > >   ...
> > > > >   MOV %r1, %src
> > > > >   ...
> > > > >
> > > > > Both %r1 and %r2 are in the BBn's liveout set, but %r2 is not
> > > > > defined or used in BBn. The previous implementation ignore this BB
> > > > > which is incorrect. As %r1 was modified to a different value, it
> > > > > means %r1 could not be replaced with %r2 in this case.
> > > > I thought of another one: (r0, r1 contain different values)
> > > > BB0:
> > > >   def r0
> > > >
> > > > BB1:
> > > >   def r1
> > > >   use r1
> > > >   use r0
> > > >
> > > > How could the algorithm deal with it?
> > > If r1 is a local value in BB1, then it is just ignored in these analysis 
> > > including DAG
> > > and liveness. Don't worry about the correctness of this case, as in
> > > gen_reg_allocation we will calculate correct interval for it.
> > 
> > r1 was only defined and used in BB1. 
> > Ignored? the interfere() would return false if I call interfere() with r0 
> > and r1 as arguments. But it should return true, as their live-ranges 
> > interfere.
> 
> All of the local values have been ignored in DAG and liveness analysis. 
> You can check DAG analysis and liveness analysis function or try to dump
> liveness information to check. This is not a real problem, as local registers
> will be handled correctly in register allocation stage.
> 
> > 
> > > If r1 is not a local value of BB1, then it's either in live in of liveout 
> > > set of BB1.
> > > Either case, this algorithm could deal it correctly.
> > > 
> > > > And looks like the algorithm is converting the "live-range
> > > > interference problem" into "checking the interference in live-In, 
> > > > live-Out set".
> > > > Any paper talking on this method?
> > > Actually, most of the paper just talk about to use liveness and DAg 
> > > information
> > > to check interference between the phi and phi copy values to determin 
> > > whether
> > > to coalesce. Not too much details. I just use our liveness information 
> > > and DAG to
> > > do the job.
> > > 
> > > >
> > > > And for same BB interference (r0, r1 defined and used in same BB), 
> > > > although
> > > we don't need to support it yet, I think if we can put an assert, it 
> > > would be better.
> > > Could you describe more detail of this case? local variable or not local 
> > > variable?
> > > Are they in live in set or live out set?
> > Not in liveout or livein set. They are only defined and used in same Basic 
> > Block. But there live-range interfere.
> Local variables are irrelevant to this optimization, please check my comment 
> above.
> 
> Thanks,
> Zhigang Gong.
> 
> > 
> > > 
> > > Thanks for the comments,
> > > Zhigang Gong.
> > > 
> > > >
> > > > Thanks!
> > > > Ruiling
> > > >
> > > > ___
> > > > Beignet mailing list
> > > > Beignet@lists.freedesktop.org
> > > > http://lists.freedesktop.org/mailman/listinfo/beignet
> ___
> Beignet mailing list
> Beignet@lists.freedesktop.org
> http://lists.freedesktop.org/mailman/listinfo/beignet
___
Beignet mailing list
Beignet@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/beignet

Re: [Beignet] [PATCH v2 1/2] GBE: continue to refine interfering check.

2015-09-22 Thread Zhigang Gong

On Wed, Sep 23, 2015 at 03:05:18AM +, Song, Ruiling wrote:
> 
> 
> > -Original Message-
> > From: Zhigang Gong [mailto:zhigang.g...@linux.intel.com]
> > Sent: Wednesday, September 23, 2015 9:44 AM
> > To: Song, Ruiling
> > Cc: Gong, Zhigang; beignet@lists.freedesktop.org
> > Subject: Re: [Beignet] [PATCH v2 1/2] GBE: continue to refine interfering 
> > check.
> > 
> > On Wed, Sep 23, 2015 at 02:33:43AM +, Song, Ruiling wrote:
> > >
> > >
> > > > -Original Message-
> > > > From: Beignet [mailto:beignet-boun...@lists.freedesktop.org] On
> > > > Behalf Of Zhigang Gong
> > > > Sent: Monday, September 7, 2015 8:20 AM
> > > > To: beignet@lists.freedesktop.org
> > > > Cc: Gong, Zhigang
> > > > Subject: [Beignet] [PATCH v2 1/2] GBE: continue to refine interfering 
> > > > check.
> > > >
> > > > More aggresive interfering check, even if both registers are in
> > > > Livein set or Liveout set, they are still possible not interfering to 
> > > > each other.
> > > >
> > > > v2:
> > > > Liveout interfering check need to take care those BBs which has only
> > > > one register defined.
> > > >
> > > > For example:
> > > >
> > > > BBn:
> > > >   ...
> > > >   MOV %r1, %src
> > > >   ...
> > > >
> > > > Both %r1 and %r2 are in the BBn's liveout set, but %r2 is not
> > > > defined or used in BBn. The previous implementation ignore this BB
> > > > which is incorrect. As %r1 was modified to a different value, it
> > > > means %r1 could not be replaced with %r2 in this case.
> > > I thought of another one: (r0, r1 contain different values)
> > > BB0:
> > >   def r0
> > >
> > > BB1:
> > >   def r1
> > >   use r1
> > >   use r0
> > >
> > > How could the algorithm deal with it?
> > If r1 is a local value in BB1, then it is just ignored in these analysis 
> > including DAG
> > and liveness. Don't worry about the correctness of this case, as in
> > gen_reg_allocation we will calculate correct interval for it.
> 
> r1 was only defined and used in BB1. 
> Ignored? the interfere() would return false if I call interfere() with r0 and 
> r1 as arguments. But it should return true, as their live-ranges interfere.

All of the local values have been ignored in DAG and liveness analysis. 
You can check DAG analysis and liveness analysis function or try to dump
liveness information to check. This is not a real problem, as local registers
will be handled correctly in register allocation stage.

> 
> > If r1 is not a local value of BB1, then it's either in live in of liveout 
> > set of BB1.
> > Either case, this algorithm could deal it correctly.
> > 
> > > And looks like the algorithm is converting the "live-range
> > > interference problem" into "checking the interference in live-In, 
> > > live-Out set".
> > > Any paper talking on this method?
> > Actually, most of the paper just talk about to use liveness and DAg 
> > information
> > to check interference between the phi and phi copy values to determin 
> > whether
> > to coalesce. Not too much details. I just use our liveness information and 
> > DAG to
> > do the job.
> > 
> > >
> > > And for same BB interference (r0, r1 defined and used in same BB), 
> > > although
> > we don't need to support it yet, I think if we can put an assert, it would 
> > be better.
> > Could you describe more detail of this case? local variable or not local 
> > variable?
> > Are they in live in set or live out set?
> Not in liveout or livein set. They are only defined and used in same Basic 
> Block. But there live-range interfere.
Local variables are irrelevant to this optimization, please check my comment 
above.

Thanks,
Zhigang Gong.

> 
> > 
> > Thanks for the comments,
> > Zhigang Gong.
> > 
> > >
> > > Thanks!
> > > Ruiling
> > >
> > > ___
> > > Beignet mailing list
> > > Beignet@lists.freedesktop.org
> > > http://lists.freedesktop.org/mailman/listinfo/beignet
___
Beignet mailing list
Beignet@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/beignet

Re: [Beignet] [PATCH v2 1/2] GBE: continue to refine interfering check.

2015-09-22 Thread Zhigang Gong

On Wed, Sep 23, 2015 at 02:33:43AM +, Song, Ruiling wrote:
> 
> 
> > -Original Message-
> > From: Beignet [mailto:beignet-boun...@lists.freedesktop.org] On Behalf Of
> > Zhigang Gong
> > Sent: Monday, September 7, 2015 8:20 AM
> > To: beignet@lists.freedesktop.org
> > Cc: Gong, Zhigang
> > Subject: [Beignet] [PATCH v2 1/2] GBE: continue to refine interfering check.
> > 
> > More aggresive interfering check, even if both registers are in Livein set 
> > or
> > Liveout set, they are still possible not interfering to each other.
> > 
> > v2:
> > Liveout interfering check need to take care those BBs which has only one
> > register defined.
> > 
> > For example:
> > 
> > BBn:
> >   ...
> >   MOV %r1, %src
> >   ...
> > 
> > Both %r1 and %r2 are in the BBn's liveout set, but %r2 is not defined or 
> > used in
> > BBn. The previous implementation ignore this BB which is incorrect. As %r1 
> > was
> > modified to a different value, it means %r1 could not be replaced with %r2 
> > in
> > this case.
> I thought of another one: (r0, r1 contain different values)
> BB0:
>   def r0
> 
> BB1:
>   def r1
>   use r1
>   use r0
> 
> How could the algorithm deal with it?
If r1 is a local value in BB1, then it is just ignored in these analysis 
including
DAG and liveness. Don't worry about the correctness of this case, as in 
gen_reg_allocation
we will calculate correct interval for it.

If r1 is not a local value of BB1, then it's either in live in of
liveout set of BB1. Either case, this algorithm could deal it correctly.

> And looks like the algorithm is converting the "live-range interference 
> problem" into
> "checking the interference in live-In, live-Out set".
> Any paper talking on this method?
Actually, most of the paper just talk about to use liveness and DAg information 
to check
interference between the phi and phi copy values to determin whether to 
coalesce. Not
too much details. I just use our liveness information and DAG to do the job.

> 
> And for same BB interference (r0, r1 defined and used in same BB), although 
> we don't need to support it yet, I think if we can put an assert, it would be 
> better.
Could you describe more detail of this case? local variable or not local 
variable? Are they in live in set or live out set?

Thanks for the comments,
Zhigang Gong.

> 
> Thanks!
> Ruiling
> 
> ___
> Beignet mailing list
> Beignet@lists.freedesktop.org
> http://lists.freedesktop.org/mailman/listinfo/beignet
___
Beignet mailing list
Beignet@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/beignet

Re: [Beignet] [PATCH] GBE: implement pre-register-allocation instruction scheduling.

2015-09-22 Thread Zhigang Gong

Ping for review.

On Wed, Sep 16, 2015 at 10:46:12AM +0800, Zhigang Gong wrote:
> To find out an instruction scheduling policy to achieve the theoretical 
> minimum
> registers required in a basic block is a NP problem. We have to use some 
> heuristic
> factor to simplify the algorithm. There are many researchs which indicate a
> bottom-up list scheduling is much better than the top-down method in turns of
> register pressure.  I choose one of such research paper as our target. The 
> paper
> is as below:
> 
> "Register-Sensitive Selection, Duplication, and Sequencing of Instructions"
> It use the bottom-up list scheduling with a Sethi-Ullman label as an
> heuristic number. As we will do cycle awareness scheduling after the register
> allocation, we don't need to bother with cycle related heuristic number here.
> I just skipped the EST computing and usage part in the algorithm.
> 
> It turns out this algorithm works well. It could reduce the register spilling
> in clBlas's sgemmBlock kernel from 83+ to only 20.
> 
> Although this scheduling method seems to be lowering the ILP(instruction 
> level parallism).
> It's not a big issue, because we will allocate as much as possible different 
> registers
> in the following register allocation stage, and we will do a after allocation
> instruction scheduling which will try to get as much ILP as possible.
> 
> Signed-off-by: Zhigang Gong 
> ---
>  backend/src/backend/gen_insn_scheduling.cpp | 137 
> +++-
>  1 file changed, 116 insertions(+), 21 deletions(-)
> 
> diff --git a/backend/src/backend/gen_insn_scheduling.cpp 
> b/backend/src/backend/gen_insn_scheduling.cpp
> index 358a2ce..f4f1e70 100644
> --- a/backend/src/backend/gen_insn_scheduling.cpp
> +++ b/backend/src/backend/gen_insn_scheduling.cpp
> @@ -41,26 +41,29 @@
>   * ==
>   *
>   * We try to limit the register pressure.
> - * Well, this is a hard problem and we have a decent strategy now that we 
> called
> - * "zero cycled LIFO scheduling".
> - * We use a local forward list scheduling and we schedule the instructions 
> in a
> - * LIFO order i.e. as a stack. Basically, we take the most recent instruction
> - * and schedule it right away. Obviously we ignore completely the real 
> latencies
> - * and throuputs and just simulate instructions that are issued and 
> completed in
> - * zero cycle. For the complex kernels we already have (like menger sponge),
> - * this provides a pretty good strategy enabling SIMD16 code generation where
> - * when scheduling is deactivated, even SIMD8 fails
>   *
> - * One may argue that this strategy is bad, latency wise. This is not true 
> since
> - * the register allocator will anyway try to burn as many registers as 
> possible.
> - * So, there is still opportunities to schedule after register allocation.
> + * To find out an instruction scheduling policy to achieve the theoretical 
> minimum
> + * registers required in a basic block is a NP problem. We have to use some 
> heuristic
> + * factor to simplify the algorithm. There are many researchs which indicate 
> a
> + * bottom-up list scheduling is much better than the top-down method in 
> turns of
> + * register pressure.  I choose one of such research paper as our target. 
> The paper
> + * is as below:
>   *
> - * Our idea seems to work decently. There is however a strong research 
> article
> - * that is able to near-optimally reschudle the instructions to minimize
> - * register use. This is:
> + * "Register-Sensitive Selection, Duplication, and Sequencing of 
> Instructions"
> + * It use the bottom-up list scheduling with a Sethi-Ullman label as an
> + * heuristic number. As we will do cycle awareness scheduling after the 
> register
> + * allocation, we don't need to bother with cycle related heuristic number 
> here.
> + * I just skipped the EST computing and usage part in the algorithm.
>   *
> - * "Minimum Register Instruction Sequence Problem: Revisiting Optimal Code
> - *  Generation for DAGs"
> + * It turns out this algorithm works well. It could reduce the register 
> spilling
> + * in clBlas's sgemmBlock kernel from 83+ to only 20.
> + *
> + * Although this scheduling method seems to be lowering the ILP(instruction 
> level parallism).
> + * It's not a big issue, because we will allocate as much as possible 
> different registers
> + * in the following register allocation stage, and we will do a after 
> allocation
> + * instruction scheduling which will try to get as much ILP as possible.
> + *
> + * FIXME: we only need to do this scheduling when a BB is really under high

Re: [Beignet] [PATCH 0/5] curbe register allocation refactor and optimization

2015-09-22 Thread Zhigang Gong

Ping for review.
Thanks.

On Mon, Sep 14, 2015 at 02:19:31PM +0800, Zhigang Gong wrote:
> This patch series is to fix the hacky curbe register allocation.
> Before, we treat these registers totally different way to the other
> normal registers. Then we do a lot of patch work in the backend stage
> to handle curbe register firstly and even before interval computing,
> thus we have to allocate some unecessary registers. And this also
> introduce further overhead when preparing the payload values on
> host side, for example for a 1D kernel, we may totally don't need
> prepare LOCAL_IDY and LOCAL_IDZ, but previous implementation will
> prepare them anyway.
> 
> This patchset normalize those curbe register with normal registers.
> And gather information in the Gen IR stage as much as possible. Then
> we only need very tiny patch work at backend stage, say insert the
> image information offset and actually this part of patch work could
> also be eliminated in the furture. And we could use complete liveness
> information when do curbe payload register allocation. To put those
> registers have closer end point together to reduce possible fragments.
> And we eliminate all of those uncessary payload registers as much as
> possible.
> 
> This patchset changed btiUtils and zero one as normal registers with
> correct liveness information. At most cases, it can save one or two
> registers.
> 
> This patch also fixed one longjmp issue. The previous method is too
> inaccurate which is according basib block numbers.
> 
> This patch is a preparation of next patch set which is to further
> optimize register allocation.
> 
> 
> Zhigang Gong (5):
>   GBE: refactor curbe register allocation.
>   GBE: refine longjmp checking.
>   GBE: don't treat btiUtil as a curbe payload register.
>   GBE: don't always allocate ir::ocl::one/zero
>   GBE: we no longer need to allocate register from two directions.
> 
>  backend/src/backend/context.cpp|  14 ---
>  backend/src/backend/context.hpp|  20 +++-
>  backend/src/backend/gen8_context.cpp   |  10 +-
>  backend/src/backend/gen_context.cpp| 175 
> +
>  backend/src/backend/gen_context.hpp|   6 +-
>  backend/src/backend/gen_insn_selection.cpp | 158 +++---
>  backend/src/backend/gen_reg_allocation.cpp | 127 ++---
>  backend/src/backend/gen_reg_allocation.hpp |   2 +
>  backend/src/backend/program.h  |   6 +-
>  backend/src/ir/context.cpp |   7 +-
>  backend/src/ir/context.hpp |   3 +-
>  backend/src/ir/function.hpp|  36 +-
>  backend/src/ir/image.cpp   |   2 +-
>  backend/src/ir/instruction.hpp |   1 +
>  backend/src/ir/profile.cpp |  62 +-
>  backend/src/ir/profile.hpp |  11 +-
>  backend/src/ir/register.hpp|  58 --
>  src/cl_command_queue.c |   4 +-
>  src/cl_command_queue_gen7.c|  34 +++---
>  src/cl_kernel.c|  12 +-
>  20 files changed, 419 insertions(+), 329 deletions(-)
> 
> -- 
> 1.9.1
> 
> ___
> Beignet mailing list
> Beignet@lists.freedesktop.org
> http://lists.freedesktop.org/mailman/listinfo/beignet
___
Beignet mailing list
Beignet@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/beignet

Re: [Beignet] [PATCH] GBE: fix a zero/one's liveness bug.

2015-09-22 Thread Zhigang Gong

Ping for review.
Thanks.

On Mon, Sep 14, 2015 at 03:50:00PM +0800, Zhigang Gong wrote:
> This is a long standing bug, and is exposed by my latest register
> allocation refinement patchset. ir::ocl::zero and ir::ocl::one are
> global registers, we have to compute its liveness information carefully,
> not just get a local interval ID.
> 
> Signed-off-by: Zhigang Gong 
> ---
>  backend/src/backend/gen_reg_allocation.cpp | 29 +
>  1 file changed, 29 insertions(+)
> 
> diff --git a/backend/src/backend/gen_reg_allocation.cpp 
> b/backend/src/backend/gen_reg_allocation.cpp
> index bf2ac2b..f440747 100644
> --- a/backend/src/backend/gen_reg_allocation.cpp
> +++ b/backend/src/backend/gen_reg_allocation.cpp
> @@ -179,6 +179,8 @@ namespace gbe
>  SpilledRegs spilledRegs;
>  /*! register which could be spilled.*/
>  SpillCandidateSet spillCandidate;
> +/*! BBs last instruction ID map */
> +map bbLastInsnIDMap;
>  /* reserved registers for register spill/reload */
>  uint32_t reservedReg;
>  /*! Current vector to expire */
> @@ -505,6 +507,7 @@ namespace gbe
>  // policy is to spill the allocate flag which live to the last time end 
> point.
>  
>  // we have three flags we use for booleans f0.0 , f1.0 and f1.1
> +set liveInSet01;
>  for (auto &block : *selection.blockList) {
>// Store the registers allocated in the map
>map allocatedFlags;
> @@ -674,6 +677,7 @@ namespace gbe
>  sel0->src(0) = GenRegister::uw1grf(ir::ocl::one);
>  sel0->src(1) = GenRegister::uw1grf(ir::ocl::zero);
>  sel0->dst(0) = GET_FLAG_REG(insn);
> +liveInSet01.insert(insn.parent->bb);
>  insn.append(*sel0);
>  // We use the zero one after the liveness analysis, we have to 
> update
>  // the liveness data manually here.
> @@ -692,6 +696,30 @@ namespace gbe
>  }
>}
>  }
> +
> +// As we introduce two global variables zero and one, we have to
> +// recompute its liveness information here!
> +if (liveInSet01.size()) {
> +  set liveOutSet01;
> +  set workSet(liveInSet01.begin(), 
> liveInSet01.end());
> +  while(workSet.size()) {
> +for(auto bb : workSet) {
> +  for(auto predBB : bb->getPredecessorSet()) {
> +liveOutSet01.insert(predBB);
> +if (liveInSet01.contains(predBB))
> +  continue;
> +liveInSet01.insert(predBB);
> +workSet.insert(predBB);
> +  }
> +  workSet.erase(bb);
> +}
> +  }
> +  int32_t maxID = 0;
> +  for(auto bb : liveOutSet01)
> +maxID = std::max(maxID, bbLastInsnIDMap.find(bb)->second);
> +  intervals[ir::ocl::zero].maxID = 
> std::max(intervals[ir::ocl::zero].maxID, maxID);
> +  intervals[ir::ocl::one].maxID = 
> std::max(intervals[ir::ocl::one].maxID, maxID);
> +}
>}
>  
>IVAR(OCL_SIMD16_SPILL_THRESHOLD, 0, 16, 256);
> @@ -1127,6 +1155,7 @@ namespace gbe
>  
>// All registers alive at the begining of the block must update their 
> intervals.
>const ir::BasicBlock *bb = block.bb;
> +  bbLastInsnIDMap.insert(std::make_pair(bb, lastID));
>for (auto reg : ctx.getLiveIn(bb))
>  this->intervals[reg].minID = std::min(this->intervals[reg].minID, 
> firstID);
>  
> -- 
> 1.9.1
> 
> ___
> Beignet mailing list
> Beignet@lists.freedesktop.org
> http://lists.freedesktop.org/mailman/listinfo/beignet
___
Beignet mailing list
Beignet@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/beignet

Re: [Beignet] [PATCH 2/2] Fix DRM Memory leak BUG

2015-09-22 Thread Zhigang Gong

On Tue, Sep 22, 2015 at 07:47:45AM +, Yang, Rong R wrote:
> The scenario of this memory leak don't not deal with event's call back.
> The scenario is:
> 
> clEnqueuNDRange(., event1); 
> clReleaseEvent(event1);
> clEnqueuNDRange(., event2); 
> clReleaseEvent(event2);
> clEnqueuNDRange(., event3); 
> clReleaseEvent(event3);
> 
> 
> Application create events but don't use them.
> After first clEnqueuNDRange, the event1 ref count is 2, and last event is 
> event1.

> In first clReleaseEvent, because the event have't complete, the event's ref 
> count is 1, will not delete.
> After  the 2nd clEnqueuNDRange, the last event point to event2.
> So neither driver nor application will track of event1, so event1 leak.

Sign, another bug caused by the weird event handling mechanism.

Now, I know the root cause for this specific issue. Beignet doesn't have a 
deadicated 
thread to maintain event. And because we want to avoid busy wait for each event 
release,
we increase reference counter to the event when we create the event, and when 
the user
want to release it, it will just decrease ref counter it to 1 and not zero. 
Thus it will
wait (not busy wait) for the real completion to release that event. But it make 
a implicitly
requirement for a event status update after the event complete.

If we add any busy wait to solve this issue, then all we do to avoid busy wait 
at the clRelease()
become meaningless. If we want o avoid busy wait any way, we need a new 
mechanism to track all events,
and make sure all events will get a chance to be updated. So we need a event 
list to track these obsolete
events which have no users. And need to determine a proper time point to update 
their status.

Becareful that these event lists are thread specific, each thread should have 
different.

I still suggest some one in beignet team to rewrite the so complicate and 
error-prone event handling mechanism.

Thanks,
Zhigang Gong.

> 
> I think one solution is, before update last event, check the current last 
> event whether is waited by other events or not. If not, should add the last 
> event to a list, and update / delete the events in the list when flush or 
> queue delete. This solution does not need the busy wait.

> 
> > -Original Message-
> > From: Beignet [mailto:beignet-boun...@lists.freedesktop.org] On Behalf Of
> > Zhigang Gong
> > Sent: Tuesday, September 22, 2015 13:21
> > To: Pan, Xiuli
> > Cc: beignet@lists.freedesktop.org
> > Subject: Re: [Beignet] [PATCH 2/2] Fix DRM Memory leak BUG
> > 
> > On Tue, Sep 22, 2015 at 04:51:41AM +, Pan, Xiuli wrote:
> > > I agree about the complex event handlings, and maybe we should do that
> > update somewhere else, but the leaked event is newed from
> > clEnqueueNDRangeKernel and passed to user and it is a very rare usage. As it
> > is not a user event, and the only chance for us to update the event status  
> > in
> > the last_event is here. If the last_event is completed it will be deleted 
> > from
> > the event update function, otherwise it will be lost and cause leak, so we
> > need to force it updating here. Also if the event is completed before that,
> > the last_event should be NULL. I think if we did it like gpgpu in a linked 
> > list,
> > maybe we could not do blocking update, but now we may do a block update
> > to make other things easier in these cases. We should have more tests about
> > the events, but now the memory leak caused by rare usage of event is now
> > be fixed.
> > 
> > One misunderstanding in the above analysis is that event update function
> > itself never deletes any event. It just update the event status and check 
> > for
> > all events in wait lists, if any event status become compelte, it will try 
> > to
> > check wait list recursively and if any completed event has user call back
> > function, it will call those call back function.
> > 
> > The reason why we wil leak a event if we don't force update here is that
> > application may usually put the clReleaseEvent() into the event's call back
> > function. Otherwise, we will not leak any event. Because user will call
> > clReleaseEvent() explicitly. If user don't do that, then it's a application 
> > level
> > bug.
> > 
> > You could continue to track down the specific application to find out when
> > you put such a force update there, how does it help on releasing the missing
> > event?
> > 
> > Is the event released within beignet internal? If so, what's the code path?
> > Is the event released in user registered call back function? If so, how does
> > that call back

Re: [Beignet] [PATCH 2/2] Fix DRM Memory leak BUG

2015-09-21 Thread Zhigang Gong

One suggestion, you can use the conformance test suite's event
test cases to verify your modification.

On Tue, Sep 22, 2015 at 05:10:45AM +, Pan, Xiuli wrote:
> I have looked into the clWaitForEvents function and read about ocl spec, 
> maybe the uncompleted evnet in the last_event should not be there at all. It 
> should be finished in the waitforevent function and be deleted and  freed in 
> the clReleaseEvent. My patch may cause unexpectable user function behavior 
> for the event's actually finished time is random. I will look into the 
> WaitForEvnets function and make some patches there. Thank you!
> 
> -Original Message-
> From: Zhigang Gong [mailto:zhigang.g...@linux.intel.com] 
> Sent: Tuesday, September 22, 2015 10:30 AM
> To: Pan, Xiuli 
> Cc: beignet@lists.freedesktop.org
> Subject: Re: [Beignet] [PATCH 2/2] Fix DRM Memory leak BUG
> 
> Nice catch! But may not be a correct fix.
> We don't need to do the blocking event updating all the time.
> We only need to do that when there is potential possibility to leak a event. 
> If a event has a user call back function registered is such a case, and my 
> best guessing here is:
> one event in the wait list of the last event has user call back function 
> registered and has been missed.
> 
> We may need to check all the wait list of the last event before we do a 
> locking event updating here.
> 
> Thanks,
> Zhigang Gong.
> 
> On Mon, Sep 21, 2015 at 04:41:52PM +0800, Pan Xiuli wrote:
> > This bug is cased by event flush, we should not only run usr event but 
> > also event made by enqueue functions.
> > If the event haven't been completed before it is been overwite in the 
> > last_event, the related gpgpu buffer will not be unreference. And will 
> > cause all related drm buffers unreference and thenw leak.
> > 
> > Signed-off-by: Pan Xiuli 
> > ---
> >  src/cl_command_queue.c | 2 +-
> >  1 file changed, 1 insertion(+), 1 deletion(-)
> > 
> > diff --git a/src/cl_command_queue.c b/src/cl_command_queue.c index 
> > 4b92311..fd1d613 100644
> > --- a/src/cl_command_queue.c
> > +++ b/src/cl_command_queue.c
> > @@ -261,7 +261,7 @@ cl_command_queue_flush(cl_command_queue queue)
> >// the event any more. If we don't do this here, we will leak that event
> >// and all the corresponding buffers which is really bad.
> >cl_event last_event = get_last_event(queue);
> > -  if (last_event && last_event->user_cb)
> > +  if (last_event)
> >  cl_event_update_status(last_event, 1);
> >cl_event current_event = get_current_event(queue);
> >if (current_event && err == CL_SUCCESS) {
> > --
> > 2.1.4
> > 
> > ___
> > Beignet mailing list
> > Beignet@lists.freedesktop.org
> > http://lists.freedesktop.org/mailman/listinfo/beignet
> ___
> Beignet mailing list
> Beignet@lists.freedesktop.org
> http://lists.freedesktop.org/mailman/listinfo/beignet
___
Beignet mailing list
Beignet@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/beignet

Re: [Beignet] [PATCH 2/2] Fix DRM Memory leak BUG

2015-09-21 Thread Zhigang Gong

On Tue, Sep 22, 2015 at 04:51:41AM +, Pan, Xiuli wrote:
> I agree about the complex event handlings, and maybe we should do that update 
> somewhere else, but the leaked event is newed from clEnqueueNDRangeKernel and 
> passed to user and it is a very rare usage. As it is not a user event, and 
> the only chance for us to update the event status  in the last_event is here. 
> If the last_event is completed it will be deleted from the event update 
> function, otherwise it will be lost and cause leak, so we need to force it 
> updating here. Also if the event is completed before that, the last_event 
> should be NULL. I think if we did it like gpgpu in a linked list, maybe we 
> could not do blocking update, but now we may do a block update to make other 
> things easier in these cases. We should have more tests about the events, but 
> now the memory leak caused by rare usage of event is now be fixed.

One misunderstanding in the above analysis is that event update function
itself never deletes any event. It just update the event status and check
for all events in wait lists, if any event status become compelte, it will
try to check wait list recursively and if any completed event has user
call back function, it will call those call back function.

The reason why we wil leak a event if we don't force update here is that 
application
may usually put the clReleaseEvent() into the event's call back function. 
Otherwise,
we will not leak any event. Because user will call clReleaseEvent() explicitly. 
If user
don't do that, then it's a application level bug.

You could continue to track down the specific application to find out when you 
put such
a force update there, how does it help on releasing the missing event?

Is the event released within beignet internal? If so, what's the code path?
Is the event released in user registered call back function? If so, how does 
that call back function get missed?

cl_command_queue_flush() has been called from almost all the enqueue functions.
Add a almost unconditional(just check the last event) blocking event wait here
is really not good idea.

Thanks,
Zhigang Gong.

> 
> The rare usage of event from the PSieve-CUDA case:
>   checkCUDAErr(clEnqueueReadBuffer(commandQueue,
> d_factor_found,
> CL_TRUE,
> 0,
> cthread_count*sizeof(cl_uint),
> factor_found,
> 0,
> NULL,
> &dev_read_event), "Retrieving results");
> //It get the ReadBuffer event here, as well as NDRangeKernel event.
> 
>   checkCUDAErr(clWaitForEvents(1, &dev_read_event), "Waiting for results 
> read. (clWaitForEvents)");
>   checkCUDAErr(clReleaseEvent(dev_read_event), "Release event object 3. 
> (clReleaseEvent)");
> 
> //Then wait and release the event, it is very different from our usage.
> 
> I will have a deep look about this usage path. Thank you for your advice.
> 
> 
> 
> -Original Message-
> From: Zhigang Gong [mailto:zhigang.g...@linux.intel.com] 
> Sent: Tuesday, September 22, 2015 10:30 AM
> To: Pan, Xiuli 
> Cc: beignet@lists.freedesktop.org
> Subject: Re: [Beignet] [PATCH 2/2] Fix DRM Memory leak BUG
> 
> Nice catch! But may not be a correct fix.
> We don't need to do the blocking event updating all the time.
> We only need to do that when there is potential possibility to leak a event. 
> If a event has a user call back function registered is such a case, and my 
> best guessing here is:
> one event in the wait list of the last event has user call back function 
> registered and has been missed.
> 
> We may need to check all the wait list of the last event before we do a 
> locking event updating here.
> 
> Thanks,
> Zhigang Gong.
> 
> On Mon, Sep 21, 2015 at 04:41:52PM +0800, Pan Xiuli wrote:
> > This bug is cased by event flush, we should not only run usr event but 
> > also event made by enqueue functions.
> > If the event haven't been completed before it is been overwite in the 
> > last_event, the related gpgpu buffer will not be unreference. And will 
> > cause all related drm buffers unreference and thenw leak.
> > 
> > Signed-off-by: Pan Xiuli 
> > ---
> >  src/cl_command_queue.c | 2 +-
> >  1 file changed, 1 insertion(+), 1 deletion(-)
> > 
> > diff --git a/src/cl_command_queue.c b/src/cl_command_queue.c index 
> > 4b92311..fd1d613 100644
> > --- a/src/cl_command_queue.c
> > +++ b/src/cl_command_queue.c
> > @@ -261,7 +261,7 @@ cl_command_queue_flush(cl_command_queue queue)
> >// the event any more. If we don't do this here, we will leak that event
> >// and all the corresponding buffers which is really bad.
> >cl_event last_ev

Re: [Beignet] [PATCH 2/2] Fix DRM Memory leak BUG

2015-09-21 Thread Zhigang Gong

Nice catch! But may not be a correct fix.
We don't need to do the blocking event updating all the time.
We only need to do that when there is potential possibility
to leak a event. If a event has a user call back function
registered is such a case, and my best guessing here is:
one event in the wait list of the last event has
user call back function registered and has been missed.

We may need to check all the wait list of the last event
before we do a locking event updating here.

Thanks,
Zhigang Gong.

On Mon, Sep 21, 2015 at 04:41:52PM +0800, Pan Xiuli wrote:
> This bug is cased by event flush, we should not only run usr event but also
> event made by enqueue functions.
> If the event haven't been completed before it is been overwite in the
> last_event, the related gpgpu buffer will not be unreference. And will cause
> all related drm buffers unreference and thenw leak.
> 
> Signed-off-by: Pan Xiuli 
> ---
>  src/cl_command_queue.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/src/cl_command_queue.c b/src/cl_command_queue.c
> index 4b92311..fd1d613 100644
> --- a/src/cl_command_queue.c
> +++ b/src/cl_command_queue.c
> @@ -261,7 +261,7 @@ cl_command_queue_flush(cl_command_queue queue)
>// the event any more. If we don't do this here, we will leak that event
>// and all the corresponding buffers which is really bad.
>cl_event last_event = get_last_event(queue);
> -  if (last_event && last_event->user_cb)
> +  if (last_event)
>  cl_event_update_status(last_event, 1);
>cl_event current_event = get_current_event(queue);
>if (current_event && err == CL_SUCCESS) {
> -- 
> 2.1.4
> 
> ___
> Beignet mailing list
> Beignet@lists.freedesktop.org
> http://lists.freedesktop.org/mailman/listinfo/beignet
___
Beignet mailing list
Beignet@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/beignet

Re: [Beignet] [PATCH] GBE: avoid vector registers when there is high register pressure.

2015-09-20 Thread Zhigang Gong

On Mon, Sep 21, 2015 at 11:11:48AM +0800, Ruiling Song wrote:
> It makes sense to use short-live vector register under register pressure.
> But please also remove the comment
> "// If an element has very long interval, we don't want to put it into a
>  // vector as it will add more pressure to the register allocation"

Right, the comments are out-of-date now and need to be removed.
Thanks,
Zhigang Gong.

> 
> Thanks!
> Ruiling
> 2015-09-17 8:39 GMT+08:00 Zhigang Gong :
> 
> > Ping for review.
> > Thanks.
> >
> > On Sun, Sep 06, 2015 at 05:21:29PM +0800, Zhigang Gong wrote:
> > > If the reservedSpillRegs is not zero, it indicates we are in a
> > > very high register pressure. Use register vector will likely
> > > increase that pressure and will cause significant performance
> > > problem which is much worse than use a short-live temporary
> > > vector register with several additional MOVs.
> > >
> > > So let's simply avoid use vector registers and just use a
> > > temporary short-live-interval vector.
> > >
> > > Signed-off-by: Zhigang Gong 
> > > ---
> > >  backend/src/backend/gen_reg_allocation.cpp | 2 +-
> > >  1 file changed, 1 insertion(+), 1 deletion(-)
> > >
> > > diff --git a/backend/src/backend/gen_reg_allocation.cpp
> > b/backend/src/backend/gen_reg_allocation.cpp
> > > index 39f1934..36ad914 100644
> > > --- a/backend/src/backend/gen_reg_allocation.cpp
> > > +++ b/backend/src/backend/gen_reg_allocation.cpp
> > > @@ -318,7 +318,7 @@ namespace gbe
> > >if (it == vectorMap.end() &&
> > >ctx.sel->isScalarReg(reg) == false &&
> > >ctx.isSpecialReg(reg) == false &&
> > > -  (intervals[reg].maxID - intervals[reg].minID) < 2048)
> > > +  ctx.reservedSpillRegs == 0 )
> > >{
> > >  const VectorLocation location = std::make_pair(vector, regID);
> > >  this->vectorMap.insert(std::make_pair(reg, location));
> > > --
> > > 1.9.1
> > >
> > > ___
> > > Beignet mailing list
> > > Beignet@lists.freedesktop.org
> > > http://lists.freedesktop.org/mailman/listinfo/beignet
> > ___
> > Beignet mailing list
> > Beignet@lists.freedesktop.org
> > http://lists.freedesktop.org/mailman/listinfo/beignet
> >

> ___
> Beignet mailing list
> Beignet@lists.freedesktop.org
> http://lists.freedesktop.org/mailman/listinfo/beignet

___
Beignet mailing list
Beignet@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/beignet

Re: [Beignet] [PATCH 5/5] GBE: implement further phi mov optimization based on intra-BB interefering analysis.

2015-09-20 Thread Zhigang Gong

ySrcDef = dag->getRegDef(phiCopySrc);
> > +  const ir::UseSet *phiCopySrcUse = dag->getRegUse(phiCopySrc);
> > +  for (auto &s : *phiCopySrcDef) {
> > +const Instruction *phiSrcDefInsn = s->getInstruction();
> > +if (phiSrcDefInsn->getOpcode() == ir::OP_MOV &&
> > +phiSrcDefInsn->getSrc(0) == phiCopy) {
> > +   const_cast(phiSrcDefInsn)->remove();
> > +   continue;
> > +}
> > +replaceDst(const_cast(phiSrcDefInsn), 
> > phiCopySrc,
> > phiCopy);
> > +  }
> > +
> > +  for (auto &s : *phiCopySrcUse) {
> > +const Instruction *phiSrcUseInsn = s->getInstruction();
> > +if (phiSrcUseInsn->getOpcode() == ir::OP_MOV &&
> > +phiSrcUseInsn->getDst(0) == phiCopy) {
> > +   const_cast(phiSrcUseInsn)->remove();
> > +   continue;
> > +}
> > +replaceSrc(const_cast(phiSrcUseInsn), 
> > phiCopySrc,
> > phiCopy);
> > +  }
> > +
> > +  replacedRegs.insert(std::make_pair(phiCopySrc, phiCopy));
> > +  revReplacedRegs.insert(std::make_pair(phiCopy, phiCopySrc));
> > +  curRedundant->erase(phiCopySrc);
> >  }
> >}
> > +
> 
> 
> And again below code block is a little hard for me. I don't know what it is 
> used for.
Also refer to my above comments. If we got some register replaced in this round 
of
optimization, then we need to update the remained curRedundant paris and prepare
for next round optimization.

This complexity could be resolved latter when I implement DAG dynamic update 
just like
the liveness information update method. Let's just do it step by step.

Thanks for the careful review comments.
Zhigang Gong.

> > +  if (replacedRegs.size() != 0) {
> > +liveness.replaceRegs(replacedRegs);
> > +for (auto &pair : *curRedundant) {
> > +  auto from = pair.first;
> > +  auto to = pair.second;
> > +  bool revisit = false;
> > +  if (replacedRegs.find(pair.second) != replacedRegs.end()) {
> > +to = replacedRegs.find(to)->second;
> > +revisit = true;
> > +  }
> > +  if (revReplacedRegs.find(from) != revReplacedRegs.end() ||
> > +  revReplacedRegs.find(to) != revReplacedRegs.end())
> > +revisit = true;
> > +  if (revisit)
> > +nextRedundant->insert(std::make_pair(from, to));
> > +}
> > +std::swap(curRedundant, nextRedundant);
> > +  } else
> > +break;
> > +
> > +  break;
> > +  nextRedundant->clear();
> > +  replacedRegs.clear();
> > +  revReplacedRegs.clear();
> > +  delete dag;
> > +  dag = new ir::FunctionDAG(liveness);
> >  }
> >  delete dag;
> >}
> > @@ -2754,8 +2872,12 @@ namespace gbe
> >  ir::Liveness liveness(fn);
> > 
> 
> ___
> Beignet mailing list
> Beignet@lists.freedesktop.org
> http://lists.freedesktop.org/mailman/listinfo/beignet
___
Beignet mailing list
Beignet@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/beignet

Re: [Beignet] [PATCH 3/5] GBE: add two helper routines for liveness partially update.

2015-09-20 Thread Zhigang Gong

On Mon, Sep 21, 2015 at 02:51:26AM +, Song, Ruiling wrote:
> > +
> > +  void Liveness::replaceRegs(const map &replaceMap)
> > + {
> > +
> > +for (auto &pair : liveness) {
> > +  BlockInfo &info = *pair.second;
> > +  BasicBlock *bb = const_cast(&info.bb);
> > +  for (auto &pair : replaceMap) {
> > +Register from = pair.first;
> > +Register to = pair.second;
> > +if (info.liveOut.contains(from)) {
> > +  info.liveOut.erase(from);
> > +  info.liveOut.insert(to);
> Why do we need to insert into definedPhiRegs ? other parts LGTM.
The replacing of "from --> to" indicates "to" has multiple definitions.
We use definedPhiRegs to track such type of values and avoid to treat
them as uniform latter. It's a little bit hacky and we may need to
re-write the uniform analysis completely in the future. But for now,
I prefer to keep it as is.

Thanks,
Zhigang Gong. 

> 
> Thanks!
> Ruiling
> 
> > +  bb->definedPhiRegs.insert(to);
> > +}
> > +if (info.upwardUsed.contains(from)) {
> > +  info.upwardUsed.erase(from);
> > +  info.upwardUsed.insert(to);
> > +}
> > +  }
> > +}
> > +  }
> > +
> 
> ___
> Beignet mailing list
> Beignet@lists.freedesktop.org
> http://lists.freedesktop.org/mailman/listinfo/beignet
___
Beignet mailing list
Beignet@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/beignet

Re: [Beignet] [PATCH] GBE: avoid vector registers when there is high register pressure.

2015-09-16 Thread Zhigang Gong

Ping for review.
Thanks.

On Sun, Sep 06, 2015 at 05:21:29PM +0800, Zhigang Gong wrote:
> If the reservedSpillRegs is not zero, it indicates we are in a
> very high register pressure. Use register vector will likely
> increase that pressure and will cause significant performance
> problem which is much worse than use a short-live temporary
> vector register with several additional MOVs.
> 
> So let's simply avoid use vector registers and just use a
> temporary short-live-interval vector.
> 
> Signed-off-by: Zhigang Gong 
> ---
>  backend/src/backend/gen_reg_allocation.cpp | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/backend/src/backend/gen_reg_allocation.cpp 
> b/backend/src/backend/gen_reg_allocation.cpp
> index 39f1934..36ad914 100644
> --- a/backend/src/backend/gen_reg_allocation.cpp
> +++ b/backend/src/backend/gen_reg_allocation.cpp
> @@ -318,7 +318,7 @@ namespace gbe
>if (it == vectorMap.end() &&
>ctx.sel->isScalarReg(reg) == false &&
>ctx.isSpecialReg(reg) == false &&
> -  (intervals[reg].maxID - intervals[reg].minID) < 2048)
> +  ctx.reservedSpillRegs == 0 )
>{
>  const VectorLocation location = std::make_pair(vector, regID);
>  this->vectorMap.insert(std::make_pair(reg, location));
> -- 
> 1.9.1
> 
> ___
> Beignet mailing list
> Beignet@lists.freedesktop.org
> http://lists.freedesktop.org/mailman/listinfo/beignet
___
Beignet mailing list
Beignet@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/beignet

Re: [Beignet] [PATCH 1/2] GBE: continue to refine interfering check.

2015-09-16 Thread Zhigang Gong

Ping for review.
Thanks.

On Sun, Sep 06, 2015 at 03:05:00PM +0800, Zhigang Gong wrote:
> More aggresive interfering check, even if both registers are in
> Livein set or Liveout set, they are still possible not interfering
> to each other.
> 
> Signed-off-by: Zhigang Gong 
> ---
>  backend/src/ir/value.cpp | 117 
> ++-
>  backend/src/ir/value.hpp |   5 +-
>  2 files changed, 109 insertions(+), 13 deletions(-)
> 
> diff --git a/backend/src/ir/value.cpp b/backend/src/ir/value.cpp
> index 19ecabf..75a100f 100644
> --- a/backend/src/ir/value.cpp
> +++ b/backend/src/ir/value.cpp
> @@ -577,6 +577,97 @@ namespace ir {
>  }
>}
>  
> +  static void getBlockDefInsns(const BasicBlock *bb, const DefSet *dSet, 
> Register r, set  &defInsns) {
> +for (auto def : *dSet) {
> +  auto defInsn = def->getInstruction();
> +  if (defInsn->getParent() == bb)
> +defInsns.insert(defInsn);
> +}
> +  }
> +
> +  static bool liveinInterfere(const BasicBlock *bb, const Instruction 
> *defInsn, Register r1) {
> +BasicBlock::const_iterator iter = BasicBlock::const_iterator(defInsn);
> +BasicBlock::const_iterator iterE = bb->end();
> +
> +if (defInsn->getOpcode() == OP_MOV &&
> +defInsn->getSrc(0) == r1)
> +  return false;
> +while (iter != iterE) {
> +  const Instruction *insn = iter.node();
> +  for (unsigned i = 0; i < insn->getDstNum(); i++) {
> +Register dst = insn->getDst(i);
> +if (dst == r1)
> +  return false;
> +  }
> +  for (unsigned i = 0; i < insn->getSrcNum(); i++) {
> +ir::Register src = insn->getSrc(i);
> +if (src == r1)
> +  return true;
> +  }
> +  ++iter;
> +}
> +
> +return false;
> +  }
> +
> +  // r0 and r1 both are in Livein set.
> +  // Only if r0/r1 is used after r1/r0 has been modified.
> +  bool FunctionDAG::interfereLivein(const BasicBlock *bb, Register r0, 
> Register r1) const {
> +set  defInsns0, defInsns1;
> +auto defSet0 = getRegDef(r0);
> +auto defSet1 = getRegDef(r1);
> +getBlockDefInsns(bb, defSet0, r0, defInsns0);
> +getBlockDefInsns(bb, defSet1, r1, defInsns1);
> +if (defInsns0.size() == 0 && defInsns1.size() == 0)
> +  return false;
> +
> +for (auto insn : defInsns0) {
> +  if (liveinInterfere(bb, insn, r1))
> +return true;
> +}
> +
> +for (auto insn : defInsns1) {
> +  if (liveinInterfere(bb, insn, r0))
> +return true;
> +}
> +return false;
> +  }
> +
> +  // r0 and r1 both are in Liveout set.
> +  // Only if the last definition of r0/r1 is a MOV r0, r1 or MOV r1, r0,
> +  // it will not introduce interfering in this BB.
> +  bool FunctionDAG::interfereLiveout(const BasicBlock *bb, Register r0, 
> Register r1) const {
> +set  defInsns0, defInsns1;
> +auto defSet0 = getRegDef(r0);
> +auto defSet1 = getRegDef(r1);
> +getBlockDefInsns(bb, defSet0, r0, defInsns0);
> +getBlockDefInsns(bb, defSet1, r1, defInsns1);
> +if (defInsns0.size() == 0 && defInsns1.size() == 0)
> +  return false;
> +
> +BasicBlock::const_iterator iter = --bb->end();
> +BasicBlock::const_iterator iterE = bb->begin();
> +do {
> +  const Instruction *insn = iter.node();
> +  for (unsigned i = 0; i < insn->getDstNum(); i++) {
> +Register dst = insn->getDst(i);
> +if (dst == r0 || dst == r1) {
> +  if (insn->getOpcode() != OP_MOV)
> +return true;
> +  if (dst == r0 && insn->getSrc(0) != r1)
> +return true;
> +  if (dst == r1 && insn->getSrc(0) != r0)
> +return true;
> +  return false;
> +}
> +  }
> +  --iter;
> +} while (iter != iterE);
> +return false;
> +  }
> +
> +  // check instructions after the def of r0, if there is any def of r1, then 
> no interefere for this
> +  // range. Otherwise, if there is any use of r1, then return true.
>bool FunctionDAG::interfere(const BasicBlock *bb, Register inReg, Register 
> outReg) const {
>  auto dSet = getRegDef(outReg);
>  bool visited = false;
> @@ -608,13 +699,13 @@ namespace ir {
>}
>  
>bool FunctionDAG::interfere(const Liveness &liveness, Register r0, 
> Register r1) const {
> -// There are two interfering cases:
> -//   1. Two registers are in the Livein set of the same BB.
> -//   2. Two registers are in the Liveout set of the same BB.
>  // If there are no any int

Re: [Beignet] [PATCH 1/5] GBE: refine Phi copy interfering check.

2015-09-16 Thread Zhigang Gong

Ping for review.
Thanks.

On Tue, Sep 01, 2015 at 12:04:59PM +0800, Zhigang Gong wrote:
> If the PHI source register's definition instruction uses the
> phi register, it is not a interfere. For an example:
> 
> MOV %phi, %phicopy
> ...
> ADD %phiSrcDef, %phi, tmp
> ...
> MOV %phicopy, %phiSrcDef
> ...
> 
> The %phi and the %phiSrcDef is not interering each other.
> Simply advancing the start of the check to next instruction is
> enough to get better result. For some special case, this patch
> could get significant performance boost.
> 
> Signed-off-by: Zhigang Gong 
> ---
>  backend/src/llvm/llvm_gen_backend.cpp | 2 ++
>  1 file changed, 2 insertions(+)
> 
> diff --git a/backend/src/llvm/llvm_gen_backend.cpp 
> b/backend/src/llvm/llvm_gen_backend.cpp
> index 4905415..38c63ce 100644
> --- a/backend/src/llvm/llvm_gen_backend.cpp
> +++ b/backend/src/llvm/llvm_gen_backend.cpp
> @@ -2220,6 +2220,8 @@ namespace gbe
>  
>  ir::BasicBlock::const_iterator iter = 
> ir::BasicBlock::const_iterator(phiCopySrcDefInsn);
>  ir::BasicBlock::const_iterator iterE = bb->end();
> +
> +iter++;
>  // check no use of phi in this basicblock between [phiCopySrc 
> def, bb end]
>  bool phiPhiCopySrcInterfere = false;
>  while (iter != iterE) {
> -- 
> 1.9.1
> 
> ___
> Beignet mailing list
> Beignet@lists.freedesktop.org
> http://lists.freedesktop.org/mailman/listinfo/beignet
___
Beignet mailing list
Beignet@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/beignet

[Beignet] [PATCH] GBE: implement pre-register-allocation instruction scheduling.

2015-09-15 Thread Zhigang Gong

To find out an instruction scheduling policy to achieve the theoretical minimum
registers required in a basic block is a NP problem. We have to use some 
heuristic
factor to simplify the algorithm. There are many researchs which indicate a
bottom-up list scheduling is much better than the top-down method in turns of
register pressure.  I choose one of such research paper as our target. The paper
is as below:

"Register-Sensitive Selection, Duplication, and Sequencing of Instructions"
It use the bottom-up list scheduling with a Sethi-Ullman label as an
heuristic number. As we will do cycle awareness scheduling after the register
allocation, we don't need to bother with cycle related heuristic number here.
I just skipped the EST computing and usage part in the algorithm.

It turns out this algorithm works well. It could reduce the register spilling
in clBlas's sgemmBlock kernel from 83+ to only 20.

Although this scheduling method seems to be lowering the ILP(instruction level 
parallism).
It's not a big issue, because we will allocate as much as possible different 
registers
in the following register allocation stage, and we will do a after allocation
instruction scheduling which will try to get as much ILP as possible.

Signed-off-by: Zhigang Gong 
---
 backend/src/backend/gen_insn_scheduling.cpp | 137 +++-
 1 file changed, 116 insertions(+), 21 deletions(-)

diff --git a/backend/src/backend/gen_insn_scheduling.cpp 
b/backend/src/backend/gen_insn_scheduling.cpp
index 358a2ce..f4f1e70 100644
--- a/backend/src/backend/gen_insn_scheduling.cpp
+++ b/backend/src/backend/gen_insn_scheduling.cpp
@@ -41,26 +41,29 @@
  * ==
  *
  * We try to limit the register pressure.
- * Well, this is a hard problem and we have a decent strategy now that we 
called
- * "zero cycled LIFO scheduling".
- * We use a local forward list scheduling and we schedule the instructions in a
- * LIFO order i.e. as a stack. Basically, we take the most recent instruction
- * and schedule it right away. Obviously we ignore completely the real 
latencies
- * and throuputs and just simulate instructions that are issued and completed 
in
- * zero cycle. For the complex kernels we already have (like menger sponge),
- * this provides a pretty good strategy enabling SIMD16 code generation where
- * when scheduling is deactivated, even SIMD8 fails
  *
- * One may argue that this strategy is bad, latency wise. This is not true 
since
- * the register allocator will anyway try to burn as many registers as 
possible.
- * So, there is still opportunities to schedule after register allocation.
+ * To find out an instruction scheduling policy to achieve the theoretical 
minimum
+ * registers required in a basic block is a NP problem. We have to use some 
heuristic
+ * factor to simplify the algorithm. There are many researchs which indicate a
+ * bottom-up list scheduling is much better than the top-down method in turns 
of
+ * register pressure.  I choose one of such research paper as our target. The 
paper
+ * is as below:
  *
- * Our idea seems to work decently. There is however a strong research article
- * that is able to near-optimally reschudle the instructions to minimize
- * register use. This is:
+ * "Register-Sensitive Selection, Duplication, and Sequencing of Instructions"
+ * It use the bottom-up list scheduling with a Sethi-Ullman label as an
+ * heuristic number. As we will do cycle awareness scheduling after the 
register
+ * allocation, we don't need to bother with cycle related heuristic number 
here.
+ * I just skipped the EST computing and usage part in the algorithm.
  *
- * "Minimum Register Instruction Sequence Problem: Revisiting Optimal Code
- *  Generation for DAGs"
+ * It turns out this algorithm works well. It could reduce the register 
spilling
+ * in clBlas's sgemmBlock kernel from 83+ to only 20.
+ *
+ * Although this scheduling method seems to be lowering the ILP(instruction 
level parallism).
+ * It's not a big issue, because we will allocate as much as possible 
different registers
+ * in the following register allocation stage, and we will do a after 
allocation
+ * instruction scheduling which will try to get as much ILP as possible.
+ *
+ * FIXME: we only need to do this scheduling when a BB is really under high 
register pressure.
  *
  * After the register allocation
  * ==
@@ -114,7 +117,7 @@ namespace gbe
   struct ScheduleDAGNode
   {
 INLINE ScheduleDAGNode(SelectionInstruction &insn) :
-  insn(insn), refNum(0), retiredCycle(0), preRetired(false), 
readDistance(0x7fff) {}
+  insn(insn), refNum(0), depNum(0), retiredCycle(0), preRetired(false), 
readDistance(0x7fff) {}
 bool dependsOn(ScheduleDAGNode *node) const {
   GBE_ASSERT(node != NULL);
   for (auto child : node->children)
@@ -128,6 +131,10 @@ namespace gbe
 Selecti

Re: [Beignet] [PATCH 5/5] GBE: we no longer need to allocate register from two directions.

2015-09-14 Thread Zhigang Gong

It turns out that the issue was not caused by this patch, so this patch is good 
to go.
I already submitted another patch to fix that liveness bug.

Thanks,
Zhigang Gong.

> -Original Message-
> From: Beignet [mailto:beignet-boun...@lists.freedesktop.org] On Behalf Of
> Zhigang Gong
> Sent: Monday, September 14, 2015 2:28 PM
> To: Zhigang Gong
> Cc: beignet@lists.freedesktop.org
> Subject: Re: [Beignet] [PATCH 5/5] GBE: we no longer need to allocate register
> from two directions.
> 
> Please ignore this patch, it seems there are some issues after this change.
> I will look into it and send it again when things got fixed.
> 
> Thanks,
> Zhigang Gong.
> 
> On Mon, Sep 14, 2015 at 02:19:36PM +0800, Zhigang Gong wrote:
> > Signed-off-by: Zhigang Gong 
> > ---
> >  backend/src/backend/context.hpp| 2 +-
> >  backend/src/backend/gen_reg_allocation.cpp | 2 +-
> >  2 files changed, 2 insertions(+), 2 deletions(-)
> >
> > diff --git a/backend/src/backend/context.hpp
> > b/backend/src/backend/context.hpp index e1f5a71..04bcf43 100644
> > --- a/backend/src/backend/context.hpp
> > +++ b/backend/src/backend/context.hpp
> > @@ -85,7 +85,7 @@ namespace gbe
> >return JIPs.find(insn) != JIPs.end();
> >  }
> >  /*! Allocate some memory in the register file */
> > -int16_t allocate(int16_t size, int16_t alignment, bool bFwd=0);
> > +int16_t allocate(int16_t size, int16_t alignment, bool bFwd =
> > + true);
> >  /*! Deallocate previously allocated memory */
> >  void deallocate(int16_t offset);
> >  /*! Spilt a block into 2 blocks, for some registers allocate
> > together but  deallocate seperate */ diff --git
> > a/backend/src/backend/gen_reg_allocation.cpp
> > b/backend/src/backend/gen_reg_allocation.cpp
> > index 06f6cc7..bf2ac2b 100644
> > --- a/backend/src/backend/gen_reg_allocation.cpp
> > +++ b/backend/src/backend/gen_reg_allocation.cpp
> > @@ -1020,7 +1020,7 @@ namespace gbe
> >  using namespace ir;
> >
> >  if (ctx.reservedSpillRegs != 0) {
> > -  reservedReg = ctx.allocate(ctx.reservedSpillRegs * GEN_REG_SIZE,
> GEN_REG_SIZE);
> > +  reservedReg = ctx.allocate(ctx.reservedSpillRegs *
> > + GEN_REG_SIZE, GEN_REG_SIZE, false);
> >reservedReg /= GEN_REG_SIZE;
> >  } else {
> >reservedReg = 0;
> > --
> > 1.9.1
> >
> > ___
> > Beignet mailing list
> > Beignet@lists.freedesktop.org
> > http://lists.freedesktop.org/mailman/listinfo/beignet
> ___
> Beignet mailing list
> Beignet@lists.freedesktop.org
> http://lists.freedesktop.org/mailman/listinfo/beignet

___
Beignet mailing list
Beignet@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/beignet

[Beignet] [PATCH] GBE: fix a zero/one's liveness bug.

2015-09-14 Thread Zhigang Gong

This is a long standing bug, and is exposed by my latest register
allocation refinement patchset. ir::ocl::zero and ir::ocl::one are
global registers, we have to compute its liveness information carefully,
not just get a local interval ID.

Signed-off-by: Zhigang Gong 
---
 backend/src/backend/gen_reg_allocation.cpp | 29 +
 1 file changed, 29 insertions(+)

diff --git a/backend/src/backend/gen_reg_allocation.cpp 
b/backend/src/backend/gen_reg_allocation.cpp
index bf2ac2b..f440747 100644
--- a/backend/src/backend/gen_reg_allocation.cpp
+++ b/backend/src/backend/gen_reg_allocation.cpp
@@ -179,6 +179,8 @@ namespace gbe
 SpilledRegs spilledRegs;
 /*! register which could be spilled.*/
 SpillCandidateSet spillCandidate;
+/*! BBs last instruction ID map */
+map bbLastInsnIDMap;
 /* reserved registers for register spill/reload */
 uint32_t reservedReg;
 /*! Current vector to expire */
@@ -505,6 +507,7 @@ namespace gbe
 // policy is to spill the allocate flag which live to the last time end 
point.
 
 // we have three flags we use for booleans f0.0 , f1.0 and f1.1
+set liveInSet01;
 for (auto &block : *selection.blockList) {
   // Store the registers allocated in the map
   map allocatedFlags;
@@ -674,6 +677,7 @@ namespace gbe
 sel0->src(0) = GenRegister::uw1grf(ir::ocl::one);
 sel0->src(1) = GenRegister::uw1grf(ir::ocl::zero);
 sel0->dst(0) = GET_FLAG_REG(insn);
+liveInSet01.insert(insn.parent->bb);
 insn.append(*sel0);
 // We use the zero one after the liveness analysis, we have to 
update
 // the liveness data manually here.
@@ -692,6 +696,30 @@ namespace gbe
 }
   }
 }
+
+// As we introduce two global variables zero and one, we have to
+// recompute its liveness information here!
+if (liveInSet01.size()) {
+  set liveOutSet01;
+  set workSet(liveInSet01.begin(), 
liveInSet01.end());
+  while(workSet.size()) {
+for(auto bb : workSet) {
+  for(auto predBB : bb->getPredecessorSet()) {
+liveOutSet01.insert(predBB);
+if (liveInSet01.contains(predBB))
+  continue;
+liveInSet01.insert(predBB);
+workSet.insert(predBB);
+  }
+  workSet.erase(bb);
+}
+  }
+  int32_t maxID = 0;
+  for(auto bb : liveOutSet01)
+maxID = std::max(maxID, bbLastInsnIDMap.find(bb)->second);
+  intervals[ir::ocl::zero].maxID = 
std::max(intervals[ir::ocl::zero].maxID, maxID);
+  intervals[ir::ocl::one].maxID = std::max(intervals[ir::ocl::one].maxID, 
maxID);
+}
   }
 
   IVAR(OCL_SIMD16_SPILL_THRESHOLD, 0, 16, 256);
@@ -1127,6 +1155,7 @@ namespace gbe
 
   // All registers alive at the begining of the block must update their 
intervals.
   const ir::BasicBlock *bb = block.bb;
+  bbLastInsnIDMap.insert(std::make_pair(bb, lastID));
   for (auto reg : ctx.getLiveIn(bb))
 this->intervals[reg].minID = std::min(this->intervals[reg].minID, 
firstID);
 
-- 
1.9.1

___
Beignet mailing list
Beignet@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/beignet

Re: [Beignet] [PATCH 5/5] GBE: we no longer need to allocate register from two directions.

2015-09-14 Thread Zhigang Gong

Please ignore this patch, it seems there are some issues after this change.
I will look into it and send it again when things got fixed.

Thanks,
Zhigang Gong.

On Mon, Sep 14, 2015 at 02:19:36PM +0800, Zhigang Gong wrote:
> Signed-off-by: Zhigang Gong 
> ---
>  backend/src/backend/context.hpp| 2 +-
>  backend/src/backend/gen_reg_allocation.cpp | 2 +-
>  2 files changed, 2 insertions(+), 2 deletions(-)
> 
> diff --git a/backend/src/backend/context.hpp b/backend/src/backend/context.hpp
> index e1f5a71..04bcf43 100644
> --- a/backend/src/backend/context.hpp
> +++ b/backend/src/backend/context.hpp
> @@ -85,7 +85,7 @@ namespace gbe
>return JIPs.find(insn) != JIPs.end();
>  }
>  /*! Allocate some memory in the register file */
> -int16_t allocate(int16_t size, int16_t alignment, bool bFwd=0);
> +int16_t allocate(int16_t size, int16_t alignment, bool bFwd = true);
>  /*! Deallocate previously allocated memory */
>  void deallocate(int16_t offset);
>  /*! Spilt a block into 2 blocks, for some registers allocate together 
> but  deallocate seperate */
> diff --git a/backend/src/backend/gen_reg_allocation.cpp 
> b/backend/src/backend/gen_reg_allocation.cpp
> index 06f6cc7..bf2ac2b 100644
> --- a/backend/src/backend/gen_reg_allocation.cpp
> +++ b/backend/src/backend/gen_reg_allocation.cpp
> @@ -1020,7 +1020,7 @@ namespace gbe
>  using namespace ir;
>  
>  if (ctx.reservedSpillRegs != 0) {
> -  reservedReg = ctx.allocate(ctx.reservedSpillRegs * GEN_REG_SIZE, 
> GEN_REG_SIZE);
> +  reservedReg = ctx.allocate(ctx.reservedSpillRegs * GEN_REG_SIZE, 
> GEN_REG_SIZE, false);
>reservedReg /= GEN_REG_SIZE;
>  } else {
>reservedReg = 0;
> -- 
> 1.9.1
> 
> ___
> Beignet mailing list
> Beignet@lists.freedesktop.org
> http://lists.freedesktop.org/mailman/listinfo/beignet
___
Beignet mailing list
Beignet@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/beignet

Re: [Beignet] [PATCH] remove register name which is no longer there

2015-09-14 Thread Zhigang Gong

This patch LGTM, and my patchset includes this change.
I will rebase after both of them got reviewed.

Thanks.

On Mon, Sep 14, 2015 at 06:56:37AM +0800, Guo Yejun wrote:
> 8b9672ae40 removed the register laneid and should remove the name
> at same patch, but missed.
> 
> Signed-off-by: Guo Yejun 
> ---
>  backend/src/ir/profile.cpp | 1 -
>  1 file changed, 1 deletion(-)
> 
> diff --git a/backend/src/ir/profile.cpp b/backend/src/ir/profile.cpp
> index 37f2d3d..eed7e81 100644
> --- a/backend/src/ir/profile.cpp
> +++ b/backend/src/ir/profile.cpp
> @@ -44,7 +44,6 @@ namespace ir {
>  "retVal", "slm_offset",
>  "printf_buffer_pointer", "printf_index_buffer_pointer",
>  "dwblockip",
> -"lane_id",
>  "invalid",
>  "bti_utility"
>  };
> -- 
> 1.9.1
> 
> ___
> Beignet mailing list
> Beignet@lists.freedesktop.org
> http://lists.freedesktop.org/mailman/listinfo/beignet
___
Beignet mailing list
Beignet@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/beignet

[Beignet] [PATCH 2/5] GBE: refine longjmp checking.

2015-09-14 Thread Zhigang Gong

Signed-off-by: Zhigang Gong 
---
 backend/src/backend/gen_insn_selection.cpp |  2 +-
 backend/src/ir/function.hpp| 17 +
 2 files changed, 18 insertions(+), 1 deletion(-)

diff --git a/backend/src/backend/gen_insn_selection.cpp 
b/backend/src/backend/gen_insn_selection.cpp
index ab00269..57dbec9 100644
--- a/backend/src/backend/gen_insn_selection.cpp
+++ b/backend/src/backend/gen_insn_selection.cpp
@@ -1154,7 +1154,7 @@ namespace gbe
 SelectionInstruction *insn = this->appendInsn(SEL_OP_JMPI, 0, 1);
 insn->src(0) = src;
 insn->index = index.value();
-insn->extra.longjmp = abs(index - origin) > 800;
+insn->extra.longjmp = ctx.getFunction().getDistance(origin, index) > 8000;
 return insn->extra.longjmp ? 2 : 1;
   }
 
diff --git a/backend/src/ir/function.hpp b/backend/src/ir/function.hpp
index b5f4ba2..b924332 100644
--- a/backend/src/ir/function.hpp
+++ b/backend/src/ir/function.hpp
@@ -487,6 +487,23 @@ namespace ir {
 Register getSurfaceBaseReg(uint8_t bti) const;
 void appendSurface(uint8_t bti, Register reg);
 /*! Output the control flow graph to .dot file */
+/*! Get instruction distance between two BBs */
+INLINE uint32_t getDistance(LabelIndex b0, LabelIndex b1) const {
+  int start, end;
+  if (b0.value() < b1.value()) {
+start = b0.value();
+end = b1.value() - 1;
+  } else {
+start = b1.value();
+end = b0.value() - 1;
+  }
+  uint32_t insnNum = 0;
+  for(int i = start; i <= end; i++) {
+BasicBlock &bb = getBlock(LabelIndex(i));
+insnNum += bb.size();
+  }
+  return insnNum;
+}
 void outputCFG();
   private:
 friend class Context;   //!< Can freely modify a function
-- 
1.9.1

___
Beignet mailing list
Beignet@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/beignet

[Beignet] [PATCH 4/5] GBE: don't always allocate ir::ocl::one/zero

2015-09-14 Thread Zhigang Gong

Use liveness information, we can only allocate them
on demand. And they could be treated as non-curbe-payload
register.

Signed-off-by: Zhigang Gong 
---
 backend/src/backend/gen_context.cpp| 10 --
 backend/src/backend/gen_reg_allocation.cpp | 12 +---
 backend/src/backend/gen_reg_allocation.hpp |  2 ++
 backend/src/backend/program.h  |  2 --
 backend/src/ir/profile.cpp |  4 ++--
 5 files changed, 17 insertions(+), 13 deletions(-)

diff --git a/backend/src/backend/gen_context.cpp 
b/backend/src/backend/gen_context.cpp
index 5980db2..3dbd957 100644
--- a/backend/src/backend/gen_context.cpp
+++ b/backend/src/backend/gen_context.cpp
@@ -160,8 +160,6 @@ namespace gbe
 // when group size not aligned to simdWidth, flag register need clear to
 // make prediction(any8/16h) work correctly
 const GenRegister blockip = getBlockIP(*this);
-const GenRegister zero = ra->genReg(GenRegister::uw1grf(ir::ocl::zero));
-const GenRegister one = ra->genReg(GenRegister::uw1grf(ir::ocl::one));
 p->push();
   p->curr.noMask = 1;
   p->curr.predicate = GEN_PREDICATE_NONE;
@@ -169,10 +167,10 @@ namespace gbe
   p->curr.noMask = 0;
   setBlockIP(*this, blockip, 0);
   p->curr.execWidth = 1;
-  // FIXME, need to get the final use set of zero/one, if there is no user,
-  // no need to generate the following two instructions.
-  p->MOV(zero, GenRegister::immuw(0));
-  p->MOV(one, GenRegister::immw(-1));
+  if (ra->isAllocated(ir::ocl::zero))
+p->MOV(ra->genReg(GenRegister::uw1grf(ir::ocl::zero)), 
GenRegister::immuw(0));
+  if (ra->isAllocated(ir::ocl::one))
+p->MOV(ra->genReg(GenRegister::uw1grf(ir::ocl::one)), 
GenRegister::immw(-1));
 p->pop();
   }
 
diff --git a/backend/src/backend/gen_reg_allocation.cpp 
b/backend/src/backend/gen_reg_allocation.cpp
index 4430ca5..06f6cc7 100644
--- a/backend/src/backend/gen_reg_allocation.cpp
+++ b/backend/src/backend/gen_reg_allocation.cpp
@@ -102,6 +102,9 @@ namespace gbe
 bool allocate(Selection &selection);
 /*! Return the Gen register from the selection register */
 GenRegister genReg(const GenRegister ®);
+INLINE bool isAllocated(const ir::Register ®) {
+  return RA.contains(reg);
+}
 /*! Output the register allocation */
 void outputAllocation(void);
 INLINE void getRegAttrib(ir::Register reg, uint32_t ®Size, 
ir::RegisterFamily *regFamily = NULL) const {
@@ -1033,13 +1036,12 @@ namespace gbe
   if (curbeType != GBE_GEN_REG) {
 intervals[regID].minID = 0;
 
-// zero and one have implicitly usage in the initial block.
-if (curbeType == GBE_CURBE_ONE || curbeType == GBE_CURBE_ZERO)
-  intervals[regID].maxID = 10;
 // FIXME stack buffer is not used, we may need to remove it in the 
furture.
 if (curbeType == GBE_CURBE_EXTRA_ARGUMENT && subType == 
GBE_STACK_BUFFER)
   intervals[regID].maxID = 1;
   }
+  if (regID == ir::ocl::zero.value() || regID ==  ir::ocl::one.value())
+intervals[regID].minID = 0;
 }
 
 // Compute the intervals
@@ -1262,6 +1264,10 @@ namespace gbe
 return this->opaque->genReg(reg);
   }
 
+  bool GenRegAllocator::isAllocated(const ir::Register ®) {
+return this->opaque->isAllocated(reg);
+  }
+
   void GenRegAllocator::outputAllocation(void) {
 this->opaque->outputAllocation();
   }
diff --git a/backend/src/backend/gen_reg_allocation.hpp 
b/backend/src/backend/gen_reg_allocation.hpp
index 89dba64..8d5e797 100644
--- a/backend/src/backend/gen_reg_allocation.hpp
+++ b/backend/src/backend/gen_reg_allocation.hpp
@@ -54,6 +54,8 @@ namespace gbe
 bool allocate(Selection &selection);
 /*! Virtual to physical translation */
 GenRegister genReg(const GenRegister ®);
+/*! Check whether a register is allocated. */
+bool isAllocated(const ir::Register ®);
 /*! Output the register allocation */
 void outputAllocation(void);
 /*! Get register actual size in byte. */
diff --git a/backend/src/backend/program.h b/backend/src/backend/program.h
index 0758820..d364605 100644
--- a/backend/src/backend/program.h
+++ b/backend/src/backend/program.h
@@ -98,8 +98,6 @@ enum gbe_curbe_type {
   GBE_CURBE_BLOCK_IP,
   GBE_CURBE_DW_BLOCK_IP,
   GBE_CURBE_THREAD_NUM,
-  GBE_CURBE_ZERO,
-  GBE_CURBE_ONE,
   GBE_GEN_REG,
 };
 
diff --git a/backend/src/ir/profile.cpp b/backend/src/ir/profile.cpp
index 484e82d..4486863 100644
--- a/backend/src/ir/profile.cpp
+++ b/backend/src/ir/profile.cpp
@@ -80,8 +80,8 @@ namespace ir {
   DECL_NEW_REG(FAMILY_DWORD, barrierid, 1);
   DECL_NEW_REG(FAMILY_DWORD, threadn, 1, GBE_CURBE_THREAD_NUM);
   DECL_NEW_REG(FAMILY_DWORD, workdim, 1, GBE_CURBE_WORK_DIM);
-  DECL_NEW_REG(FAMILY_DWORD, zero, 1, GBE_CURBE_ZERO);
-  DECL_NEW_REG(FAMILY_DWORD, one, 1, GBE_CURBE_ONE);
+

[Beignet] [PATCH 3/5] GBE: don't treat btiUtil as a curbe payload register.

2015-09-14 Thread Zhigang Gong

Btiutil should be just a normal temporary register and only
alive for those specific laod/store instructions with mixed
BTI used.

Although btiutil only takes one DW register space, but in
practice, it may waste one entire 32-byte register space
as it has very long live range.

This patch fix this issue completely.

Signed-off-by: Zhigang Gong 
---
 backend/src/backend/gen8_context.cpp   |  10 +-
 backend/src/backend/gen_context.cpp|  47 +
 backend/src/backend/gen_context.hpp|   4 +-
 backend/src/backend/gen_insn_selection.cpp | 156 +
 backend/src/backend/gen_reg_allocation.cpp |   2 -
 backend/src/backend/program.h  |   1 -
 backend/src/ir/profile.cpp |   4 +-
 backend/src/ir/profile.hpp |   3 +-
 8 files changed, 128 insertions(+), 99 deletions(-)

diff --git a/backend/src/backend/gen8_context.cpp 
b/backend/src/backend/gen8_context.cpp
index b497ee5..7e51963 100644
--- a/backend/src/backend/gen8_context.cpp
+++ b/backend/src/backend/gen8_context.cpp
@@ -854,9 +854,10 @@ namespace gbe
   p->UNTYPED_READ(dst, src, bti, 2*elemNum);
 } else {
   const GenRegister tmp = ra->genReg(insn.dst(2*elemNum));
+  const GenRegister btiTmp = ra->genReg(insn.dst(2*elemNum + 1));
   unsigned desc = p->generateUntypedReadMessageDesc(0, 2*elemNum);
 
-  unsigned jip0 = beforeMessage(insn, bti, tmp, desc);
+  unsigned jip0 = beforeMessage(insn, bti, tmp, btiTmp, desc);
 
   //predicated load
   p->push();
@@ -864,7 +865,7 @@ namespace gbe
 p->curr.useFlag(insn.state.flag, insn.state.subFlag);
 p->UNTYPED_READ(dst, src, GenRegister::retype(GenRegister::addr1(0), 
GEN_TYPE_UD), 2*elemNum);
   p->pop();
-  afterMessage(insn, bti, tmp, jip0);
+  afterMessage(insn, bti, tmp, btiTmp, jip0);
 }
 
 for (uint32_t elemID = 0; elemID < elemNum; elemID++) {
@@ -893,9 +894,10 @@ namespace gbe
   p->UNTYPED_WRITE(addr, bti, elemNum*2);
 } else {
   const GenRegister tmp = ra->genReg(insn.dst(elemNum));
+  const GenRegister btiTmp = ra->genReg(insn.dst(elemNum + 1));
   unsigned desc = p->generateUntypedWriteMessageDesc(0, elemNum*2);
 
-  unsigned jip0 = beforeMessage(insn, bti, tmp, desc);
+  unsigned jip0 = beforeMessage(insn, bti, tmp, btiTmp, desc);
 
   //predicated load
   p->push();
@@ -903,7 +905,7 @@ namespace gbe
 p->curr.useFlag(insn.state.flag, insn.state.subFlag);
 p->UNTYPED_WRITE(addr, GenRegister::addr1(0), elemNum*2);
   p->pop();
-  afterMessage(insn, bti, tmp, jip0);
+  afterMessage(insn, bti, tmp, btiTmp, jip0);
 }
   }
   void Gen8Context::emitPackLongInstruction(const SelectionInstruction &insn) {
diff --git a/backend/src/backend/gen_context.cpp 
b/backend/src/backend/gen_context.cpp
index ae02fbe..5980db2 100644
--- a/backend/src/backend/gen_context.cpp
+++ b/backend/src/backend/gen_context.cpp
@@ -1769,16 +1769,17 @@ namespace gbe
   p->ATOMIC(dst, function, src, bti, srcNum);
 } else {
   GenRegister flagTemp = ra->genReg(insn.dst(1));
+  GenRegister btiTmp = ra->genReg(insn.dst(2));
 
   unsigned desc = p->generateAtomicMessageDesc(function, 0, srcNum);
 
-  unsigned jip0 = beforeMessage(insn, bti, flagTemp, desc);
+  unsigned jip0 = beforeMessage(insn, bti, flagTemp, btiTmp, desc);
   p->push();
 p->curr.predicate = GEN_PREDICATE_NORMAL;
 p->curr.useFlag(insn.state.flag, insn.state.subFlag);
 p->ATOMIC(dst, function, src, GenRegister::addr1(0), srcNum);
   p->pop();
-  afterMessage(insn, bti, flagTemp, jip0);
+  afterMessage(insn, bti, flagTemp, btiTmp, jip0);
 }
   }
 
@@ -1920,9 +1921,10 @@ namespace gbe
   p->UNTYPED_READ(dst, src, bti, elemNum);
 } else {
   const GenRegister tmp = ra->genReg(insn.dst(elemNum));
+  const GenRegister btiTmp = ra->genReg(insn.dst(elemNum + 1));
   unsigned desc = p->generateUntypedReadMessageDesc(0, elemNum);
 
-  unsigned jip0 = beforeMessage(insn, bti, tmp, desc);
+  unsigned jip0 = beforeMessage(insn, bti, tmp, btiTmp, desc);
 
   //predicated load
   p->push();
@@ -1930,17 +1932,17 @@ namespace gbe
 p->curr.useFlag(insn.state.flag, insn.state.subFlag);
 p->UNTYPED_READ(dst, src, GenRegister::retype(GenRegister::addr1(0), 
GEN_TYPE_UD), elemNum);
   p->pop();
-  afterMessage(insn, bti, tmp, jip0);
+  afterMessage(insn, bti, tmp, btiTmp, jip0);
 }
   }
-  unsigned GenContext::beforeMessage(const SelectionInstruction &insn, 
GenRegister bti, GenRegister tmp, unsigned desc) {
+  unsigned GenContext::beforeMessage(const SelectionInstruction &insn, 
GenRegister bti, GenRegister tmp, GenRegister btiTmp, unsigned desc) {
   const GenRegister flagReg = GenRegister::flag(insn.state.fla

[Beignet] [PATCH 5/5] GBE: we no longer need to allocate register from two directions.

2015-09-14 Thread Zhigang Gong

Signed-off-by: Zhigang Gong 
---
 backend/src/backend/context.hpp| 2 +-
 backend/src/backend/gen_reg_allocation.cpp | 2 +-
 2 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/backend/src/backend/context.hpp b/backend/src/backend/context.hpp
index e1f5a71..04bcf43 100644
--- a/backend/src/backend/context.hpp
+++ b/backend/src/backend/context.hpp
@@ -85,7 +85,7 @@ namespace gbe
   return JIPs.find(insn) != JIPs.end();
 }
 /*! Allocate some memory in the register file */
-int16_t allocate(int16_t size, int16_t alignment, bool bFwd=0);
+int16_t allocate(int16_t size, int16_t alignment, bool bFwd = true);
 /*! Deallocate previously allocated memory */
 void deallocate(int16_t offset);
 /*! Spilt a block into 2 blocks, for some registers allocate together but  
deallocate seperate */
diff --git a/backend/src/backend/gen_reg_allocation.cpp 
b/backend/src/backend/gen_reg_allocation.cpp
index 06f6cc7..bf2ac2b 100644
--- a/backend/src/backend/gen_reg_allocation.cpp
+++ b/backend/src/backend/gen_reg_allocation.cpp
@@ -1020,7 +1020,7 @@ namespace gbe
 using namespace ir;
 
 if (ctx.reservedSpillRegs != 0) {
-  reservedReg = ctx.allocate(ctx.reservedSpillRegs * GEN_REG_SIZE, 
GEN_REG_SIZE);
+  reservedReg = ctx.allocate(ctx.reservedSpillRegs * GEN_REG_SIZE, 
GEN_REG_SIZE, false);
   reservedReg /= GEN_REG_SIZE;
 } else {
   reservedReg = 0;
-- 
1.9.1

___
Beignet mailing list
Beignet@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/beignet

[Beignet] [PATCH 1/5] GBE: refactor curbe register allocation.

2015-09-14 Thread Zhigang Gong

The major motivation is to normalize the curbe payload's
allocation and prepare to use liveness information
to avoid unecessary payload register allocation and avoid
fragments when allocate curbe registers. For an example,
for GBE_CURBE_LOCAL_ID_Y/Z, many one dimention
kernels don't need them. But previous curbe allocation
occurs before the liveness interval computing, thus it
will allocate that curbe anyway. Altough it will be expired
soon but it still need us to prepare those payload at
host side. After this patch, this type of overhead
has been eliminated easily.

Another purpose is to eliminate the ugly curbe patch list
handling in backend. After this patch, the curbe register
handling is much cleaner than before.

Signed-off-by: Zhigang Gong 
---
 backend/src/backend/context.cpp|  14 
 backend/src/backend/context.hpp|  18 -
 backend/src/backend/gen_context.cpp| 118 ++--
 backend/src/backend/gen_context.hpp|   2 +-
 backend/src/backend/gen_reg_allocation.cpp | 121 -
 backend/src/backend/program.h  |   3 +-
 backend/src/ir/context.cpp |   7 +-
 backend/src/ir/context.hpp |   3 +-
 backend/src/ir/function.hpp|  19 -
 backend/src/ir/image.cpp   |   2 +-
 backend/src/ir/instruction.hpp |   1 +
 backend/src/ir/profile.cpp |  64 +++
 backend/src/ir/profile.hpp |  12 ++-
 backend/src/ir/register.hpp|  58 --
 src/cl_command_queue.c |   4 +-
 src/cl_command_queue_gen7.c|  34 
 src/cl_kernel.c|  12 +--
 17 files changed, 266 insertions(+), 226 deletions(-)

diff --git a/backend/src/backend/context.cpp b/backend/src/backend/context.cpp
index 81b284d..a02771a 100644
--- a/backend/src/backend/context.cpp
+++ b/backend/src/backend/context.cpp
@@ -421,20 +421,6 @@ namespace gbe
 return offset;
   }
 
-  uint32_t Context::getImageInfoCurbeOffset(ir::ImageInfoKey key, size_t size)
-  {
-int32_t offset = fn.getImageSet()->getInfoOffset(key);
-if (offset >= 0)
-  return offset + GEN_REG_SIZE;
-newCurbeEntry(GBE_CURBE_IMAGE_INFO, key.data, size, 4);
-std::sort(kernel->patches.begin(), kernel->patches.end());
-
-offset = kernel->getCurbeOffset(GBE_CURBE_IMAGE_INFO, key.data);
-GBE_ASSERT(offset >= 0); // XXX do we need to spill it out to bo?
-fn.getImageSet()->appendInfo(key, offset);
-return offset + GEN_REG_SIZE;
-  }
-
   void Context::insertCurbeReg(ir::Register reg, uint32_t offset) {
 curbeRegs.insert(std::make_pair(reg, offset));
   }
diff --git a/backend/src/backend/context.hpp b/backend/src/backend/context.hpp
index 079967d..e1f5a71 100644
--- a/backend/src/backend/context.hpp
+++ b/backend/src/backend/context.hpp
@@ -90,9 +90,6 @@ namespace gbe
 void deallocate(int16_t offset);
 /*! Spilt a block into 2 blocks, for some registers allocate together but  
deallocate seperate */
 void splitBlock(int16_t offset, int16_t subOffset);
-/* allocate a new entry for a specific image's information */
-/*! Get (search or allocate if fail to find one) image info curbeOffset.*/
-uint32_t getImageInfoCurbeOffset(ir::ImageInfoKey key, size_t size);
 /*! allocate size scratch memory and return start address */
 int32_t allocateScratchMem(uint32_t size);
 /*! deallocate scratch memory at offset */
@@ -107,6 +104,21 @@ namespace gbe
 uint32_t getMaxLabel(void) const {
   return this->isDWLabel() ? 0x : 0x;
 }
+/*! get register's payload type. */
+INLINE void getRegPayloadType(ir::Register reg, gbe_curbe_type &curbeType, 
int &subType) const {
+  if (reg.value() >= fn.getRegisterFile().regNum()) {
+curbeType = GBE_GEN_REG;
+subType = 0;
+return;
+  }
+  fn.getRegPayloadType(reg, curbeType, subType);
+}
+/*! check whether a register is a payload register */
+INLINE bool isPayloadReg(ir::Register reg) const{
+  if (reg.value() >= fn.getRegisterFile().regNum())
+return false;
+  return fn.isPayloadReg(reg);
+}
   protected:
 /*! Build the instruction stream. Return false if failed */
 virtual bool emitCode(void) = 0;
diff --git a/backend/src/backend/gen_context.cpp 
b/backend/src/backend/gen_context.cpp
index 25fdf08..ae02fbe 100644
--- a/backend/src/backend/gen_context.cpp
+++ b/backend/src/backend/gen_context.cpp
@@ -181,9 +181,8 @@ namespace gbe
 GenRegister dst_;
 if (dst.type == GEN_TYPE_UW)
   dst_ = dst;
-else
-  dst_ = GenRegister::uw16grf(126,0);
-
+else if (dst.type == GEN_TYPE_UD)
+  dst_ = GenRegister::retype(dst, GEN_TYPE_UW);
 p->push();
   uint32_t execWidth = p->curr.execWidth;
   p->curr.

[Beignet] [PATCH 0/5] curbe register allocation refactor and optimization

2015-09-14 Thread Zhigang Gong

This patch series is to fix the hacky curbe register allocation.
Before, we treat these registers totally different way to the other
normal registers. Then we do a lot of patch work in the backend stage
to handle curbe register firstly and even before interval computing,
thus we have to allocate some unecessary registers. And this also
introduce further overhead when preparing the payload values on
host side, for example for a 1D kernel, we may totally don't need
prepare LOCAL_IDY and LOCAL_IDZ, but previous implementation will
prepare them anyway.

This patchset normalize those curbe register with normal registers.
And gather information in the Gen IR stage as much as possible. Then
we only need very tiny patch work at backend stage, say insert the
image information offset and actually this part of patch work could
also be eliminated in the furture. And we could use complete liveness
information when do curbe payload register allocation. To put those
registers have closer end point together to reduce possible fragments.
And we eliminate all of those uncessary payload registers as much as
possible.

This patchset changed btiUtils and zero one as normal registers with
correct liveness information. At most cases, it can save one or two
registers.

This patch also fixed one longjmp issue. The previous method is too
inaccurate which is according basib block numbers.

This patch is a preparation of next patch set which is to further
optimize register allocation.


Zhigang Gong (5):
  GBE: refactor curbe register allocation.
  GBE: refine longjmp checking.
  GBE: don't treat btiUtil as a curbe payload register.
  GBE: don't always allocate ir::ocl::one/zero
  GBE: we no longer need to allocate register from two directions.

 backend/src/backend/context.cpp|  14 ---
 backend/src/backend/context.hpp|  20 +++-
 backend/src/backend/gen8_context.cpp   |  10 +-
 backend/src/backend/gen_context.cpp| 175 +
 backend/src/backend/gen_context.hpp|   6 +-
 backend/src/backend/gen_insn_selection.cpp | 158 +++---
 backend/src/backend/gen_reg_allocation.cpp | 127 ++---
 backend/src/backend/gen_reg_allocation.hpp |   2 +
 backend/src/backend/program.h  |   6 +-
 backend/src/ir/context.cpp |   7 +-
 backend/src/ir/context.hpp |   3 +-
 backend/src/ir/function.hpp|  36 +-
 backend/src/ir/image.cpp   |   2 +-
 backend/src/ir/instruction.hpp |   1 +
 backend/src/ir/profile.cpp |  62 +-
 backend/src/ir/profile.hpp |  11 +-
 backend/src/ir/register.hpp|  58 --
 src/cl_command_queue.c |   4 +-
 src/cl_command_queue_gen7.c|  34 +++---
 src/cl_kernel.c|  12 +-
 20 files changed, 419 insertions(+), 329 deletions(-)

-- 
1.9.1

___
Beignet mailing list
Beignet@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/beignet

[Beignet] [PATCH] GBE: fix build error with LLVM 3.5 and previous version.

2015-09-08 Thread Zhigang Gong

Signed-off-by: Zhigang Gong 
---
 backend/src/backend/program.cpp | 7 ++-
 1 file changed, 6 insertions(+), 1 deletion(-)

diff --git a/backend/src/backend/program.cpp b/backend/src/backend/program.cpp
index 330bead..57a5037 100644
--- a/backend/src/backend/program.cpp
+++ b/backend/src/backend/program.cpp
@@ -575,7 +575,12 @@ namespace gbe {
   Diags);
 llvm::StringRef srcString(source);
 (*CI).getPreprocessorOpts().addRemappedFile("stringInput.cl",
-llvm::MemoryBuffer::getMemBuffer(srcString).release());
+#if LLVM_VERSION_MAJOR == 3 && LLVM_VERSION_MINOR <= 5
+llvm::MemoryBuffer::getMemBuffer(srcString)
+#else
+llvm::MemoryBuffer::getMemBuffer(srcString).release()
+#endif
+);
 
 // Create the compiler instance
 clang::CompilerInstance Clang;
-- 
1.9.1

___
Beignet mailing list
Beignet@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/beignet

Re: [Beignet] [PATCH 3/3] add optimization for local copy propagation

2015-09-07 Thread Zhigang Gong

> -Original Message-
> From: Beignet [mailto:beignet-boun...@lists.freedesktop.org] On Behalf Of
> Guo, Yejun
> Sent: Monday, September 7, 2015 8:27 PM
> To: Zhigang Gong; beignet@lists.freedesktop.org
> Subject: Re: [Beignet] [PATCH 3/3] add optimization for local copy propagation
> 
> Yes, there will be penalty for the case in your example. I read several
> documents for local copy propagation, and none mentioned this case. :(
> 
> For the method to iterate new instructions/registers, it requires to add the
> 'new' flag during GenIR to SelectionIR period, since the current 
> implementation
> is large, I think it might not be so good, there is high possibility to miss
> something.
> 
> I prefer for the liveness method, it provides much information which could be
> possibly used in other later optimizations. I'll check if ir::liveness could 
> be
> reused or need to design a new liveness for selection ir.
We already have the liveness information in gen backend stage, please check
The following code in backend/context.cpp
  Context::Context(const ir::Unit &unit, const std::string &name) :
unit(unit), fn(*unit.getFunction(name)), name(name), liveness(NULL), 
dag(NULL), useDWLabel(false)
  {
GBE_ASSERT(unit.getPointerSize() == ir::POINTER_32_BITS);
this->liveness = GBE_NEW(ir::Liveness, const_cast(fn), true);
this->dag = GBE_NEW(ir::FunctionDAG, *this->liveness);
// r0 (GEN_REG_SIZE) is always set by the HW and used at the end by EOT
this->registerAllocator = NULL; //GBE_NEW(RegisterAllocator, GEN_REG_SIZE, 
4*KB - GEN_REG_SIZE);
this->scratchAllocator = NULL; //GBE_NEW(ScratchAllocator, 12*KB);
  }

And use ctx.getLiveIn(bb) and ctx.getLiveOut(bb), you can easily get livein set 
or liveout set for a specified basic block.
That should be good enough to implement this optimization.


Thanks,
Zhigang Gong.

> 
> -Original Message-
> From: Zhigang Gong [mailto:zhigang.g...@linux.intel.com]
> Sent: Monday, September 07, 2015 2:47 PM
> To: Guo, Yejun; beignet@lists.freedesktop.org
> Subject: RE: [Beignet] [PATCH 3/3] add optimization for local copy propagation
> 
> Right, only the instructions created in instruction selection stage could be
> optimized here.
> No need to iterate all the instructions. And no need to do multiple round 
> check.
> Just one round check, and only need to check those temporary registers(only
> live in this BB). If any register is in live out set, then we do not need to 
> touch it.
> 
> Please take a look at the following example
> (%r1 is in live in set, and %r0 is in liveout set, and the MOV %r0, %r1 is 
> the last
> use of the %r1):
> 
> MOV %r0, %r1
> ...
> ADD %r5, %r0, %r2
> 
> If use this optimization, it will become:
> 
> MOV %r0, %r1
> ...
> ADD %r5, %r1, %r2
> 
> Because %r0 is in liveout set, the MOV instruction is not dead instruction and
> will not be eliminated.
> And the worse situation is the %r1's liveness interval incorrectly extent to 
> the
> ADD instruction.
> 
> In the original code, %r1 is not alive at the ADD instruction, now both %r1
> and %r0 are alive after the MOV instruction. Considering a worst case:
> If there are many such type of MOV, then this optimization will not bring any
> optimization, but will increase register pressure dramatically.
> 
> So my two major suggestions are as below:
> 
> 1. Try to iterate newly created instructions/registers only, this could reduce
> most of the overhead.
> 2. Use liveness information, don't touch any registers in the liveout set, 
> actually,
> if you only iterate those newly created instructions, you may avoid any 
> liveout
> registers naturally.
> 
> > -Original Message-
> > From: Beignet [mailto:beignet-boun...@lists.freedesktop.org] On Behalf
> > Of Guo, Yejun
> > Sent: Monday, September 7, 2015 2:11 PM
> > To: Zhigang Gong; beignet@lists.freedesktop.org
> > Subject: Re: [Beignet] [PATCH 3/3] add optimization for local copy
> > propagation
> >
> > It is expected that there will be improvement with the optimization
> > since some instructions are removed.
> >
> > As mentioned in the commit log, this patch itself does not remove any
> > instruction, it modifies some instruction to make the removal possible.
> >
> > GenWriter::removeMOVs() did the work inside each basic block at Gen IR
> > level, the idea can be introduced for Selection IR level, and
> > removeMovs globally might also needed for Selection IR.
> >
> > Even GenWriter::removeMOVs() has done the optimization at Gen IR
> > level, it is also necessary to do the same idea again at Selection IR
> > level, since we intro

Re: [Beignet] [PATCH 3/3] add optimization for local copy propagation

2015-09-06 Thread Zhigang Gong

Right, only the instructions created in instruction selection stage could be 
optimized here.
No need to iterate all the instructions. And no need to do multiple round 
check. Just one
round check, and only need to check those temporary registers(only live in this 
BB). If any
register is in live out set, then we do not need to touch it.

Please take a look at the following example 
(%r1 is in live in set, and %r0 is in liveout set, and the MOV %r0, %r1 is the 
last use of the %r1):

MOV %r0, %r1
...
ADD %r5, %r0, %r2

If use this optimization, it will become:

MOV %r0, %r1
...
ADD %r5, %r1, %r2

Because %r0 is in liveout set, the MOV instruction is not dead instruction and 
will not be eliminated.
And the worse situation is the %r1's liveness interval incorrectly extent to 
the ADD instruction.

In the original code, %r1 is not alive at the ADD instruction, now both %r1 and 
%r0 are alive after the
MOV instruction. Considering a worst case:
If there are many such type of MOV, then this optimization will not bring any 
optimization, but will increase
register pressure dramatically.

So my two major suggestions are as below:

1. Try to iterate newly created instructions/registers only, this could reduce 
most of the overhead.
2. Use liveness information, don't touch any registers in the liveout set, 
actually, if you only iterate those newly created instructions, you may avoid 
any liveout registers naturally.

> -Original Message-
> From: Beignet [mailto:beignet-boun...@lists.freedesktop.org] On Behalf Of
> Guo, Yejun
> Sent: Monday, September 7, 2015 2:11 PM
> To: Zhigang Gong; beignet@lists.freedesktop.org
> Subject: Re: [Beignet] [PATCH 3/3] add optimization for local copy propagation
> 
> It is expected that there will be improvement with the optimization since some
> instructions are removed.
> 
> As mentioned in the commit log, this patch itself does not remove any
> instruction, it modifies some instruction to make the removal possible.
> 
> GenWriter::removeMOVs() did the work inside each basic block at Gen IR level,
> the idea can be introduced for Selection IR level, and removeMovs globally
> might also needed for Selection IR.
> 
> Even GenWriter::removeMOVs() has done the optimization at Gen IR level, it is
> also necessary to do the same idea again at Selection IR level, since we
> introduced some extra instructions during GenIR to SelectionIR and/or other
> functions.
> 
> Take the selection IR of utest compiler_saturate_sub_uint8_t as an example
> (see the instruction fragment in commit log), we can see such optimization at
> selection IR level is also necessary.
> 
> -Original Message-
> From: Zhigang Gong [mailto:zhigang.g...@linux.intel.com]
> Sent: Monday, September 07, 2015 1:43 PM
> To: Guo, Yejun; beignet@lists.freedesktop.org
> Subject: RE: [Beignet] [PATCH 3/3] add optimization for local copy propagation
> 
> Is there any evidence that this optimization could bring actual improvement?
> I doubt it because it doesn't reduce any instruction.
> 
> Actually, if the %42 is not in the liveout set of current BB, then the MOV 
> could
> be removed, the exactly same optimization logic has been implemented in the
> GEN IR stage, the function is GenWriter::removeMOVs(), you can check it out.
> 
> > -Original Message-
> > From: Beignet [mailto:beignet-boun...@lists.freedesktop.org] On Behalf
> > Of Guo Yejun
> > Sent: Monday, September 7, 2015 5:28 AM
> > To: beignet@lists.freedesktop.org
> > Cc: Guo Yejun
> > Subject: [Beignet] [PATCH 3/3] add optimization for local copy
> > propagation
> >
> > it is done at selection ir level, for instructions like:
> > MOV(8)  %42<2>:UB   :   %53<32,8,4>:UB
> > ADD(8)  %43<2>:B:   %40<16,8,2>:B
> > -%42<16,8,2>:B
> > can be optimized as:
> > MOV(8)  %42<2>:UB   :   %53<32,8,4>:UB
> > ADD(8)  %43<2>:UB   :   %56<32,8,4>:UB
> > -%53<32,8,4>:UB
> >
> > the optimization is done for each basic block, we here can not remove
> > instruction "MOV %42 ..." since it is possible that %42 could be used
> > at other place in the same block or even in other blocks. We need to
> > add another path such as dead code elimination at global level to remove it.
> >
> > Signed-off-by: Guo Yejun 
> > ---
> >  .../src/backend/gen_insn_selection_optimize.cpp| 145
> > +
> >  1 file changed, 145 insertions(+)
> >
> > diff --git a/backend/src/backend/gen_insn_selection_optimize.cpp
> > b/backend/src/backend/gen_insn_selection_optimize.cpp
> > index c82fbe5

Re: [Beignet] [PATCH 3/3] add optimization for local copy propagation

2015-09-06 Thread Zhigang Gong

Is there any evidence that this optimization could bring actual improvement?
I doubt it because it doesn't reduce any instruction.

Actually, if the %42 is not in the liveout set of current BB, then the MOV 
could be removed,
the exactly same optimization logic has been implemented in the GEN IR stage, 
the function is
GenWriter::removeMOVs(), you can check it out.

> -Original Message-
> From: Beignet [mailto:beignet-boun...@lists.freedesktop.org] On Behalf Of
> Guo Yejun
> Sent: Monday, September 7, 2015 5:28 AM
> To: beignet@lists.freedesktop.org
> Cc: Guo Yejun
> Subject: [Beignet] [PATCH 3/3] add optimization for local copy propagation
> 
> it is done at selection ir level, for instructions like:
> MOV(8)  %42<2>:UB :   %53<32,8,4>:UB
> ADD(8)  %43<2>:B  :   %40<16,8,2>:B
>   -%42<16,8,2>:B
> can be optimized as:
> MOV(8)  %42<2>:UB :   %53<32,8,4>:UB
> ADD(8)  %43<2>:UB :   %56<32,8,4>:UB
>   -%53<32,8,4>:UB
> 
> the optimization is done for each basic block, we here can not remove
> instruction "MOV %42 ..." since it is possible that %42 could be used at other
> place in the same block or even in other blocks. We need to add another path
> such as dead code elimination at global level to remove it.
> 
> Signed-off-by: Guo Yejun 
> ---
>  .../src/backend/gen_insn_selection_optimize.cpp| 145
> +
>  1 file changed, 145 insertions(+)
> 
> diff --git a/backend/src/backend/gen_insn_selection_optimize.cpp
> b/backend/src/backend/gen_insn_selection_optimize.cpp
> index c82fbe5..b196a4d 100644
> --- a/backend/src/backend/gen_insn_selection_optimize.cpp
> +++ b/backend/src/backend/gen_insn_selection_optimize.cpp
> @@ -20,6 +20,40 @@ namespace gbe
>  virtual void run() = 0;
>  virtual ~SelOptimizer() {}
>protected:
> +//we need info derived from execWdith, but it is not contained inside
> GenRegister
> +class ExpandedRegister
> +{
> +public:
> +  ExpandedRegister(const GenRegister& reg, uint32_t execWidth) :
> genreg(reg)
> +  {
> +elements = CalculateElements(reg, execWidth);
> +  }
> +  ~ExpandedRegister() {}
> +  static uint32_t CalculateElements(const GenRegister& reg, uint32_t
> execWidth)
> +  {
> +uint32_t elements = 0;
> +uint32_t elementSize = typeSize(reg.type);
> +uint32_t width = GenRegister::width_size(reg);
> +assert(execWidth >= width);
> +uint32_t height = execWidth / width;
> +uint32_t vstride = GenRegister::vstride_size(reg);
> +uint32_t hstride = GenRegister::hstride_size(reg);
> +uint32_t base = reg.subnr;
> +for (uint32_t i = 0; i < height; ++i) {
> +  uint32_t offsetInByte = base;
> +  for (uint32_t j = 0; j < width; ++j) {
> +uint32_t offsetInType = offsetInByte / elementSize;
> +elements |= (1 << offsetInType);//right for dest
> register?
> +offsetInByte += hstride * elementSize;
> +  }
> +  offsetInByte += vstride * elementSize;
> +}
> +return elements;
> +  }
> +  const GenRegister& genreg;
> +  uint32_t elements;
> +};
> +
>  uint32_t features;
>};
> 
> @@ -31,13 +65,124 @@ namespace gbe
>  virtual void run();
> 
>private:
> +// local copy propagation
> +typedef std::map
> RegisterMap;
> +static bool replaceWithCopytable(const RegisterMap& copytable,
> uint32_t execWidth, GenRegister& var);
> +static void addToCopytable(RegisterMap& copytable, uint32_t
> execWidth, const GenRegister& src, const GenRegister& dst);
> +static void removeFromCopytable(RegisterMap& copytable, const
> GenRegister& var);
> +static void cleanCopytable(RegisterMap& copytable);
> +static void propagateRegister(GenRegister& dst, const GenRegister&
> src);
> +bool doLocalCopyPropagation();
> +
>  SelectionBlock &bb;
>  static const size_t MaxTries = 1;   //the times for optimization
>};
> 
> +  void SelBasicBlockOptimizer::propagateRegister(GenRegister& dst,
> + const GenRegister& src)  {
> +dst.type = src.type;
> +dst.file = src.file;
> +dst.physical = src.physical;
> +dst.subphysical = src.subphysical;
> +dst.value.reg = src.value.reg;
> +dst.vstride = src.vstride;
> +dst.width = src.width;
> +dst.hstride = src.hstride;
> +dst.quarter = src.quarter;
> +dst.nr = src.nr;
> +dst.subnr = src.subnr;
> +dst.address_mode = src.address_mode;
> +dst.a0_subnr = src.a0_subnr;
> +dst.addr_imm = src.addr_imm;
> +
> +dst.negation = dst.negation ^ src.negation;
> +dst.absolute = dst.absolute | src.absolute;  }
> +
> +  void SelBasicBlockOptimizer::cleanCopytable(RegisterMap& copytable)
> + {
> +for (RegisterMap::const_iterator pos = copytable.begin(); pos !=
> copytable.end(); ++pos) {
> +  const ExpandedRegister* key = po

[Beignet] [PATCH v2 1/2] GBE: continue to refine interfering check.

2015-09-06 Thread Zhigang Gong

More aggresive interfering check, even if both registers are in
Livein set or Liveout set, they are still possible not interfering
to each other.

v2:
Liveout interfering check need to take care those BBs which has only one
register defined.

For example:

BBn:
  ...
  MOV %r1, %src
  ...

Both %r1 and %r2 are in the BBn's liveout set, but %r2 is not defined or used
in BBn. The previous implementation ignore this BB which is incorrect. As %r1
was modified to a different value, it means %r1 could not be replaced with %r2
in this case.

Signed-off-by: Zhigang Gong 
---
 backend/src/ir/value.cpp | 131 ---
 backend/src/ir/value.hpp |   5 +-
 2 files changed, 117 insertions(+), 19 deletions(-)

diff --git a/backend/src/ir/value.cpp b/backend/src/ir/value.cpp
index 19ecabf..72caa13 100644
--- a/backend/src/ir/value.cpp
+++ b/backend/src/ir/value.cpp
@@ -577,13 +577,102 @@ namespace ir {
 }
   }
 
+  static void getBlockDefInsns(const BasicBlock *bb, const DefSet *dSet, 
Register r, set  &defInsns) {
+for (auto def : *dSet) {
+  auto defInsn = def->getInstruction();
+  if (defInsn->getParent() == bb)
+defInsns.insert(defInsn);
+}
+  }
+
+  static bool liveinInterfere(const BasicBlock *bb, const Instruction 
*defInsn, Register r1) {
+BasicBlock::const_iterator iter = BasicBlock::const_iterator(defInsn);
+BasicBlock::const_iterator iterE = bb->end();
+
+if (defInsn->getOpcode() == OP_MOV &&
+defInsn->getSrc(0) == r1)
+  return false;
+while (iter != iterE) {
+  const Instruction *insn = iter.node();
+  for (unsigned i = 0; i < insn->getDstNum(); i++) {
+Register dst = insn->getDst(i);
+if (dst == r1)
+  return false;
+  }
+  for (unsigned i = 0; i < insn->getSrcNum(); i++) {
+ir::Register src = insn->getSrc(i);
+if (src == r1)
+  return true;
+  }
+  ++iter;
+}
+
+return false;
+  }
+
+  // r0 and r1 both are in Livein set.
+  // Only if r0/r1 is used after r1/r0 has been modified.
+  bool FunctionDAG::interfereLivein(const BasicBlock *bb, Register r0, 
Register r1) const {
+set  defInsns0, defInsns1;
+auto defSet0 = getRegDef(r0);
+auto defSet1 = getRegDef(r1);
+getBlockDefInsns(bb, defSet0, r0, defInsns0);
+getBlockDefInsns(bb, defSet1, r1, defInsns1);
+if (defInsns0.size() == 0 && defInsns1.size() == 0)
+  return false;
+
+for (auto insn : defInsns0) {
+  if (liveinInterfere(bb, insn, r1))
+return true;
+}
+
+for (auto insn : defInsns1) {
+  if (liveinInterfere(bb, insn, r0))
+return true;
+}
+return false;
+  }
+
+  // r0 and r1 both are in Liveout set.
+  // Only if the last definition of r0/r1 is a MOV r0, r1 or MOV r1, r0,
+  // it will not introduce interfering in this BB.
+  bool FunctionDAG::interfereLiveout(const BasicBlock *bb, Register r0, 
Register r1) const {
+set  defInsns0, defInsns1;
+auto defSet0 = getRegDef(r0);
+auto defSet1 = getRegDef(r1);
+getBlockDefInsns(bb, defSet0, r0, defInsns0);
+getBlockDefInsns(bb, defSet1, r1, defInsns1);
+if (defInsns0.size() == 0 && defInsns1.size() == 0)
+  return false;
+
+BasicBlock::const_iterator iter = --bb->end();
+BasicBlock::const_iterator iterE = bb->begin();
+do {
+  const Instruction *insn = iter.node();
+  for (unsigned i = 0; i < insn->getDstNum(); i++) {
+Register dst = insn->getDst(i);
+if (dst == r0 || dst == r1) {
+  if (insn->getOpcode() != OP_MOV)
+return true;
+  if (dst == r0 && insn->getSrc(0) != r1)
+return true;
+  if (dst == r1 && insn->getSrc(0) != r0)
+return true;
+  return false;
+}
+  }
+  --iter;
+} while (iter != iterE);
+return false;
+  }
+
+  // check instructions after the def of r0, if there is any def of r1, then 
no interefere for this
+  // range. Otherwise, if there is any use of r1, then return true.
   bool FunctionDAG::interfere(const BasicBlock *bb, Register inReg, Register 
outReg) const {
 auto dSet = getRegDef(outReg);
-bool visited = false;
 for (auto &def : *dSet) {
   auto defInsn = def->getInstruction();
   if (defInsn->getParent() == bb) {
-visited = true;
 if (defInsn->getOpcode() == OP_MOV && defInsn->getSrc(0) == inReg)
   continue;
 BasicBlock::const_iterator iter = BasicBlock::const_iterator(defInsn);
@@ -602,19 +691,17 @@ namespace ir {
 }
   }
 }
-// We must visit the outReg at least once. Otherwise, something going 
wrong.
-GBE_ASSERT(visited);
 return false;
   }
 
   bool FunctionDAG::interfere(const Liveness &liveness, Register r0, Register 
r1) const {
-// There are two interfering ca

[Beignet] [PATCH] GBE: avoid vector registers when there is high register pressure.

2015-09-06 Thread Zhigang Gong

If the reservedSpillRegs is not zero, it indicates we are in a
very high register pressure. Use register vector will likely
increase that pressure and will cause significant performance
problem which is much worse than use a short-live temporary
vector register with several additional MOVs.

So let's simply avoid use vector registers and just use a
temporary short-live-interval vector.

Signed-off-by: Zhigang Gong 
---
 backend/src/backend/gen_reg_allocation.cpp | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/backend/src/backend/gen_reg_allocation.cpp 
b/backend/src/backend/gen_reg_allocation.cpp
index 39f1934..36ad914 100644
--- a/backend/src/backend/gen_reg_allocation.cpp
+++ b/backend/src/backend/gen_reg_allocation.cpp
@@ -318,7 +318,7 @@ namespace gbe
   if (it == vectorMap.end() &&
   ctx.sel->isScalarReg(reg) == false &&
   ctx.isSpecialReg(reg) == false &&
-  (intervals[reg].maxID - intervals[reg].minID) < 2048)
+  ctx.reservedSpillRegs == 0 )
   {
 const VectorLocation location = std::make_pair(vector, regID);
 this->vectorMap.insert(std::make_pair(reg, location));
-- 
1.9.1

___
Beignet mailing list
Beignet@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/beignet

[Beignet] [PATCH 1/2] GBE: continue to refine interfering check.

2015-09-06 Thread Zhigang Gong

More aggresive interfering check, even if both registers are in
Livein set or Liveout set, they are still possible not interfering
to each other.

Signed-off-by: Zhigang Gong 
---
 backend/src/ir/value.cpp | 117 ++-
 backend/src/ir/value.hpp |   5 +-
 2 files changed, 109 insertions(+), 13 deletions(-)

diff --git a/backend/src/ir/value.cpp b/backend/src/ir/value.cpp
index 19ecabf..75a100f 100644
--- a/backend/src/ir/value.cpp
+++ b/backend/src/ir/value.cpp
@@ -577,6 +577,97 @@ namespace ir {
 }
   }
 
+  static void getBlockDefInsns(const BasicBlock *bb, const DefSet *dSet, 
Register r, set  &defInsns) {
+for (auto def : *dSet) {
+  auto defInsn = def->getInstruction();
+  if (defInsn->getParent() == bb)
+defInsns.insert(defInsn);
+}
+  }
+
+  static bool liveinInterfere(const BasicBlock *bb, const Instruction 
*defInsn, Register r1) {
+BasicBlock::const_iterator iter = BasicBlock::const_iterator(defInsn);
+BasicBlock::const_iterator iterE = bb->end();
+
+if (defInsn->getOpcode() == OP_MOV &&
+defInsn->getSrc(0) == r1)
+  return false;
+while (iter != iterE) {
+  const Instruction *insn = iter.node();
+  for (unsigned i = 0; i < insn->getDstNum(); i++) {
+Register dst = insn->getDst(i);
+if (dst == r1)
+  return false;
+  }
+  for (unsigned i = 0; i < insn->getSrcNum(); i++) {
+ir::Register src = insn->getSrc(i);
+if (src == r1)
+  return true;
+  }
+  ++iter;
+}
+
+return false;
+  }
+
+  // r0 and r1 both are in Livein set.
+  // Only if r0/r1 is used after r1/r0 has been modified.
+  bool FunctionDAG::interfereLivein(const BasicBlock *bb, Register r0, 
Register r1) const {
+set  defInsns0, defInsns1;
+auto defSet0 = getRegDef(r0);
+auto defSet1 = getRegDef(r1);
+getBlockDefInsns(bb, defSet0, r0, defInsns0);
+getBlockDefInsns(bb, defSet1, r1, defInsns1);
+if (defInsns0.size() == 0 && defInsns1.size() == 0)
+  return false;
+
+for (auto insn : defInsns0) {
+  if (liveinInterfere(bb, insn, r1))
+return true;
+}
+
+for (auto insn : defInsns1) {
+  if (liveinInterfere(bb, insn, r0))
+return true;
+}
+return false;
+  }
+
+  // r0 and r1 both are in Liveout set.
+  // Only if the last definition of r0/r1 is a MOV r0, r1 or MOV r1, r0,
+  // it will not introduce interfering in this BB.
+  bool FunctionDAG::interfereLiveout(const BasicBlock *bb, Register r0, 
Register r1) const {
+set  defInsns0, defInsns1;
+auto defSet0 = getRegDef(r0);
+auto defSet1 = getRegDef(r1);
+getBlockDefInsns(bb, defSet0, r0, defInsns0);
+getBlockDefInsns(bb, defSet1, r1, defInsns1);
+if (defInsns0.size() == 0 && defInsns1.size() == 0)
+  return false;
+
+BasicBlock::const_iterator iter = --bb->end();
+BasicBlock::const_iterator iterE = bb->begin();
+do {
+  const Instruction *insn = iter.node();
+  for (unsigned i = 0; i < insn->getDstNum(); i++) {
+Register dst = insn->getDst(i);
+if (dst == r0 || dst == r1) {
+  if (insn->getOpcode() != OP_MOV)
+return true;
+  if (dst == r0 && insn->getSrc(0) != r1)
+return true;
+  if (dst == r1 && insn->getSrc(0) != r0)
+return true;
+  return false;
+}
+  }
+  --iter;
+} while (iter != iterE);
+return false;
+  }
+
+  // check instructions after the def of r0, if there is any def of r1, then 
no interefere for this
+  // range. Otherwise, if there is any use of r1, then return true.
   bool FunctionDAG::interfere(const BasicBlock *bb, Register inReg, Register 
outReg) const {
 auto dSet = getRegDef(outReg);
 bool visited = false;
@@ -608,13 +699,13 @@ namespace ir {
   }
 
   bool FunctionDAG::interfere(const Liveness &liveness, Register r0, Register 
r1) const {
-// There are two interfering cases:
-//   1. Two registers are in the Livein set of the same BB.
-//   2. Two registers are in the Liveout set of the same BB.
 // If there are no any intersection BB, they are not interfering to each 
other.
-// If they are some intersection BBs, but one is only in the LiveIn and 
the other is
-// only in the Liveout, then we need to check whether they interefere each 
other in
-// that BB.
+// There are three different interfering cases which need further checking:
+//   1. Both registers are in the LiveIn register set.
+//   2. Both registers are in the LiveOut register set.
+//   3. One is in LiveIn set and the Other is in LiveOut set.
+// For the above 3 cases, we need 3 different ways to check whether they 
really
+// interfering to each other.
 set bbSet0;
 set bbSet1;
 getRegUDBBs(r0, b

[Beignet] [PATCH 2/2] GBE: Fix one DAG analysis issue and enable multiple round phi copy elimination.

2015-09-06 Thread Zhigang Gong

Even if one value is killed in current BB, we still need to
pass predecessor's definition into this BB. Otherwise, we will
miss one definition.

BB0:
  MOV %foo, %src0

BB1:
  MUL %foo, %src1, %f00
  ...
  BR BB1

In the above case, both BB1 and BB0 are the predecessors of BB1.
When pass the definition of %foo in BB0 to BB1, the previous implementation
will ignore it because %foo is killed in BB1, this is a bug.
This patch fixes it. And thus we can enable multiple round
phi copy elimination safely.

Signed-off-by: Zhigang Gong 
---
 backend/src/ir/value.cpp  | 2 +-
 backend/src/llvm/llvm_gen_backend.cpp | 1 -
 2 files changed, 1 insertion(+), 2 deletions(-)

diff --git a/backend/src/ir/value.cpp b/backend/src/ir/value.cpp
index 75a100f..7b54763 100644
--- a/backend/src/ir/value.cpp
+++ b/backend/src/ir/value.cpp
@@ -242,7 +242,7 @@ namespace ir {
 const BasicBlock &pbb = pred.bb;
 for (auto reg : curr.liveOut) {
   if (pred.inLiveOut(reg) == false) continue;
-  if (curr.inVarKill(reg) == true) continue;
+  if (curr.inVarKill(reg) == true && curr.inUpwardUsed(reg) == false) 
continue;
   RegDefSet &currSet = this->getDefSet(&bb, reg);
   RegDefSet &predSet = this->getDefSet(&pbb, reg);
 
diff --git a/backend/src/llvm/llvm_gen_backend.cpp 
b/backend/src/llvm/llvm_gen_backend.cpp
index 1d09727..4ac2c53 100644
--- a/backend/src/llvm/llvm_gen_backend.cpp
+++ b/backend/src/llvm/llvm_gen_backend.cpp
@@ -2403,7 +2403,6 @@ namespace gbe
   } else
 break;
 
-  break;
   nextRedundant->clear();
   replacedRegs.clear();
   revReplacedRegs.clear();
-- 
1.9.1

___
Beignet mailing list
Beignet@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/beignet

[Beignet] [PATCH v2 4/5] GBE: add some dag helper routines to check registers' interfering.

2015-09-06 Thread Zhigang Gong

These helper function will be used in further phi mov optimization.

v2:
remove the useless debug message code.

Signed-off-by: Zhigang Gong 
---
 backend/src/ir/value.cpp | 100 +++
 backend/src/ir/value.hpp |  13 ++
 2 files changed, 113 insertions(+)

diff --git a/backend/src/ir/value.cpp b/backend/src/ir/value.cpp
index 840fb5c..19ecabf 100644
--- a/backend/src/ir/value.cpp
+++ b/backend/src/ir/value.cpp
@@ -558,6 +558,106 @@ namespace ir {
 return it->second;
   }
 
+  void FunctionDAG::getRegUDBBs(Register r, set &BBs) 
const{
+auto dSet = getRegDef(r);
+for (auto &def : *dSet)
+  BBs.insert(def->getInstruction()->getParent());
+auto uSet = getRegUse(r);
+for (auto &use : *uSet)
+  BBs.insert(use->getInstruction()->getParent());
+  }
+
+  static void getLivenessBBs(const Liveness &liveness, Register r, const 
set &useDefSet,
+ set &liveInSet, set &liveOutSet){
+for (auto bb : useDefSet) {
+  if (liveness.getLiveOut(bb).contains(r))
+liveOutSet.insert(bb);
+  if (liveness.getLiveIn(bb).contains(r))
+liveInSet.insert(bb);
+}
+  }
+
+  bool FunctionDAG::interfere(const BasicBlock *bb, Register inReg, Register 
outReg) const {
+auto dSet = getRegDef(outReg);
+bool visited = false;
+for (auto &def : *dSet) {
+  auto defInsn = def->getInstruction();
+  if (defInsn->getParent() == bb) {
+visited = true;
+if (defInsn->getOpcode() == OP_MOV && defInsn->getSrc(0) == inReg)
+  continue;
+BasicBlock::const_iterator iter = BasicBlock::const_iterator(defInsn);
+BasicBlock::const_iterator iterE = bb->end();
+iter++;
+// check no use of phi in this basicblock between [phiCopySrc def, bb 
end]
+while (iter != iterE) {
+  const ir::Instruction *insn = iter.node();
+  // check phiUse
+  for (unsigned i = 0; i < insn->getSrcNum(); i++) {
+ir::Register src = insn->getSrc(i);
+if (src == inReg)
+  return true;
+  }
+  ++iter;
+}
+  }
+}
+// We must visit the outReg at least once. Otherwise, something going 
wrong.
+GBE_ASSERT(visited);
+return false;
+  }
+
+  bool FunctionDAG::interfere(const Liveness &liveness, Register r0, Register 
r1) const {
+// There are two interfering cases:
+//   1. Two registers are in the Livein set of the same BB.
+//   2. Two registers are in the Liveout set of the same BB.
+// If there are no any intersection BB, they are not interfering to each 
other.
+// If they are some intersection BBs, but one is only in the LiveIn and 
the other is
+// only in the Liveout, then we need to check whether they interefere each 
other in
+// that BB.
+set bbSet0;
+set bbSet1;
+getRegUDBBs(r0, bbSet0);
+getRegUDBBs(r1, bbSet1);
+
+set liveInBBSet0, liveInBBSet1;
+set liveOutBBSet0, liveOutBBSet1;
+getLivenessBBs(liveness, r0, bbSet0, liveInBBSet0, liveOutBBSet0);
+getLivenessBBs(liveness, r1, bbSet1, liveInBBSet1, liveOutBBSet1);
+
+set intersect;
+set_intersection(liveInBBSet0.begin(), liveInBBSet0.end(),
+ liveInBBSet1.begin(), liveInBBSet1.end(),
+ std::inserter(intersect, intersect.begin()));
+if (intersect.size() != 0)
+  return true;
+intersect.clear();
+set_intersection(liveOutBBSet0.begin(), liveOutBBSet0.end(),
+ liveOutBBSet1.begin(), liveOutBBSet1.end(),
+ std::inserter(intersect, intersect.begin()));
+if (intersect.size() != 0)
+  return true;
+
+set OIIntersect, IOIntersect;
+set_intersection(liveOutBBSet0.begin(), liveOutBBSet0.end(),
+ liveInBBSet1.begin(), liveInBBSet1.end(),
+ std::inserter(OIIntersect, OIIntersect.begin()));
+
+for (auto bb : OIIntersect) {
+  if (interfere(bb, r1, r0))
+return true;
+}
+
+set_intersection(liveInBBSet0.begin(), liveInBBSet0.end(),
+ liveOutBBSet1.begin(), liveOutBBSet1.end(),
+ std::inserter(IOIntersect, IOIntersect.begin()));
+for (auto bb : IOIntersect) {
+  if (interfere(bb, r0, r1))
+return true;
+}
+return false;
+  }
+
   std::ostream &operator<< (std::ostream &out, const FunctionDAG &dag) {
 const Function &fn = dag.getFunction();
 
diff --git a/backend/src/ir/value.hpp b/backend/src/ir/value.hpp
index a9e5108..ba3ba01 100644
--- a/backend/src/ir/value.hpp
+++ b/backend/src/ir/value.hpp
@@ -238,6 +238,19 @@ namespace ir {
 typedef map UDGraph;
 /*! The UseSet for each definition */
 typedef map DUGraph;
+/*! get register's use and define BB set */
+void getRegUDBBs(Register r, set &BB

[Beignet] [PATCH 4/5] GBE: add some dag helper routines to check registers' interfering.

2015-08-31 Thread Zhigang Gong

These helper function will be used in further phi mov optimization.

Signed-off-by: Zhigang Gong 
---
 backend/src/ir/value.cpp | 102 +++
 backend/src/ir/value.hpp |  13 ++
 2 files changed, 115 insertions(+)

diff --git a/backend/src/ir/value.cpp b/backend/src/ir/value.cpp
index 840fb5c..1c131e5 100644
--- a/backend/src/ir/value.cpp
+++ b/backend/src/ir/value.cpp
@@ -558,6 +558,108 @@ namespace ir {
 return it->second;
   }
 
+  void FunctionDAG::getRegUDBBs(Register r, set &BBs) 
const{
+auto dSet = getRegDef(r);
+for (auto &def : *dSet)
+  BBs.insert(def->getInstruction()->getParent());
+auto uSet = getRegUse(r);
+for (auto &use : *uSet)
+  BBs.insert(use->getInstruction()->getParent());
+  }
+
+  static void getLivenessBBs(const Liveness &liveness, Register r, const 
set &useDefSet,
+ set &liveInSet, set &liveOutSet){
+for (auto bb : useDefSet) {
+  if (liveness.getLiveOut(bb).contains(r))
+liveOutSet.insert(bb);
+  if (liveness.getLiveIn(bb).contains(r))
+liveInSet.insert(bb);
+}
+  }
+
+  bool FunctionDAG::interfere(const BasicBlock *bb, Register inReg, Register 
outReg) const {
+auto dSet = getRegDef(outReg);
+bool visited = false;
+for (auto &def : *dSet) {
+  auto defInsn = def->getInstruction();
+  if (defInsn->getParent() == bb) {
+visited = true;
+if (defInsn->getOpcode() == OP_MOV && defInsn->getSrc(0) == inReg)
+  continue;
+BasicBlock::const_iterator iter = BasicBlock::const_iterator(defInsn);
+BasicBlock::const_iterator iterE = bb->end();
+iter++;
+// check no use of phi in this basicblock between [phiCopySrc def, bb 
end]
+while (iter != iterE) {
+  const ir::Instruction *insn = iter.node();
+  // check phiUse
+  for (unsigned i = 0; i < insn->getSrcNum(); i++) {
+ir::Register src = insn->getSrc(i);
+if (src == inReg) {
+  std::cout << *insn << std::endl;
+  return true;
+}
+  }
+  ++iter;
+}
+  }
+}
+// We must visit the outReg at least once. Otherwise, something going 
wrong.
+GBE_ASSERT(visited);
+return false;
+  }
+
+  bool FunctionDAG::interfere(const Liveness &liveness, Register r0, Register 
r1) const {
+// There are two interfering cases:
+//   1. Two registers are in the Livein set of the same BB.
+//   2. Two registers are in the Liveout set of the same BB.
+// If there are no any intersection BB, they are not interfering to each 
other.
+// If they are some intersection BBs, but one is only in the LiveIn and 
the other is
+// only in the Liveout, then we need to check whether they interefere each 
other in
+// that BB.
+set bbSet0;
+set bbSet1;
+getRegUDBBs(r0, bbSet0);
+getRegUDBBs(r1, bbSet1);
+
+set liveInBBSet0, liveInBBSet1;
+set liveOutBBSet0, liveOutBBSet1;
+getLivenessBBs(liveness, r0, bbSet0, liveInBBSet0, liveOutBBSet0);
+getLivenessBBs(liveness, r1, bbSet1, liveInBBSet1, liveOutBBSet1);
+
+set intersect;
+set_intersection(liveInBBSet0.begin(), liveInBBSet0.end(),
+ liveInBBSet1.begin(), liveInBBSet1.end(),
+ std::inserter(intersect, intersect.begin()));
+if (intersect.size() != 0)
+  return true;
+intersect.clear();
+set_intersection(liveOutBBSet0.begin(), liveOutBBSet0.end(),
+ liveOutBBSet1.begin(), liveOutBBSet1.end(),
+ std::inserter(intersect, intersect.begin()));
+if (intersect.size() != 0)
+  return true;
+
+set OIIntersect, IOIntersect;
+set_intersection(liveOutBBSet0.begin(), liveOutBBSet0.end(),
+ liveInBBSet1.begin(), liveInBBSet1.end(),
+ std::inserter(OIIntersect, OIIntersect.begin()));
+
+for (auto bb : OIIntersect) {
+  if (interfere(bb, r1, r0))
+return true;
+}
+
+set_intersection(liveInBBSet0.begin(), liveInBBSet0.end(),
+ liveOutBBSet1.begin(), liveOutBBSet1.end(),
+ std::inserter(IOIntersect, IOIntersect.begin()));
+for (auto bb : IOIntersect) {
+  if (interfere(bb, r0, r1))
+return true;
+}
+return false;
+  }
+
   std::ostream &operator<< (std::ostream &out, const FunctionDAG &dag) {
 const Function &fn = dag.getFunction();
 
diff --git a/backend/src/ir/value.hpp b/backend/src/ir/value.hpp
index a9e5108..ba3ba01 100644
--- a/backend/src/ir/value.hpp
+++ b/backend/src/ir/value.hpp
@@ -238,6 +238,19 @@ namespace ir {
 typedef map UDGraph;
 /*! The UseSet for each definition */
 typedef map DUGraph;
+/*! get register's use and define BB set */
+void getR

[Beignet] [PATCH 2/5] GBE: refine liveness analysis.

2015-08-31 Thread Zhigang Gong

Only in gen backend stage, we need to take care of the
special extra liveout and uniform analysis. In IR stage,
we don't need to handle them.

Signed-off-by: Zhigang Gong 
---
 backend/src/backend/context.cpp |  2 +-
 backend/src/ir/liveness.cpp | 17 ++---
 backend/src/ir/liveness.hpp |  2 +-
 3 files changed, 12 insertions(+), 9 deletions(-)

diff --git a/backend/src/backend/context.cpp b/backend/src/backend/context.cpp
index 33b2409..81b284d 100644
--- a/backend/src/backend/context.cpp
+++ b/backend/src/backend/context.cpp
@@ -322,7 +322,7 @@ namespace gbe
 unit(unit), fn(*unit.getFunction(name)), name(name), liveness(NULL), 
dag(NULL), useDWLabel(false)
   {
 GBE_ASSERT(unit.getPointerSize() == ir::POINTER_32_BITS);
-this->liveness = GBE_NEW(ir::Liveness, const_cast(fn));
+this->liveness = GBE_NEW(ir::Liveness, const_cast(fn), 
true);
 this->dag = GBE_NEW(ir::FunctionDAG, *this->liveness);
 // r0 (GEN_REG_SIZE) is always set by the HW and used at the end by EOT
 this->registerAllocator = NULL; //GBE_NEW(RegisterAllocator, GEN_REG_SIZE, 
4*KB - GEN_REG_SIZE);
diff --git a/backend/src/ir/liveness.cpp b/backend/src/ir/liveness.cpp
index 9fa7ac3..e2240c0 100644
--- a/backend/src/ir/liveness.cpp
+++ b/backend/src/ir/liveness.cpp
@@ -27,7 +27,7 @@
 namespace gbe {
 namespace ir {
 
-  Liveness::Liveness(Function &fn) : fn(fn) {
+  Liveness::Liveness(Function &fn, bool isInGenBackend) : fn(fn) {
 // Initialize UEVar and VarKill for each block
 fn.foreachBlock([this](const BasicBlock &bb) {
   this->initBlock(bb);
@@ -48,12 +48,15 @@ namespace ir {
 }
 // extend register (def in loop, use out-of-loop) liveness to the whole 
loop
 set extentRegs;
-this->computeExtraLiveInOut(extentRegs);
-// analyze uniform values. The extentRegs contains all the values which is
-// defined in a loop and use out-of-loop which could not be a uniform. The 
reason
-// is that when it reenter the second time, it may active different lanes. 
So
-// reenter many times may cause it has different values in different lanes.
-this->analyzeUniform(&extentRegs);
+// Only in Gen backend we need to take care of extra live out analysis.
+if (isInGenBackend) {
+  this->computeExtraLiveInOut(extentRegs);
+  // analyze uniform values. The extentRegs contains all the values which 
is
+  // defined in a loop and use out-of-loop which could not be a uniform. 
The reason
+  // is that when it reenter the second time, it may active different 
lanes. So
+  // reenter many times may cause it has different values in different 
lanes.
+  this->analyzeUniform(&extentRegs);
+}
   }
 
   Liveness::~Liveness(void) {
diff --git a/backend/src/ir/liveness.hpp b/backend/src/ir/liveness.hpp
index 4a7dc4e..d9fa2ed 100644
--- a/backend/src/ir/liveness.hpp
+++ b/backend/src/ir/liveness.hpp
@@ -48,7 +48,7 @@ namespace ir {
   class Liveness : public NonCopyable
   {
   public:
-Liveness(Function &fn);
+Liveness(Function &fn, bool isInGenBackend = false);
 ~Liveness(void);
 /*! Set of variables used upwards in the block (before a definition) */
 typedef set UEVar;
-- 
1.9.1

___
Beignet mailing list
Beignet@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/beignet

[Beignet] [PATCH 5/5] GBE: implement further phi mov optimization based on intra-BB interefering analysis.

2015-08-31 Thread Zhigang Gong

The previous phi mov optimization try to reduce the phi copy source register
and the phi copy register if the phi copy source register is a normal SSA value.

But for some cases, many phi copy source registers are also phi copy value which
has multiple definitions. And they could all be reduced to one phi copy register
if there is no interfering in all BBs. This patch with the previous patches 
could
reduce the whole spilled register from 200+ to only 70 for a SGEMM kernel and 
the
performance could boost about 10 times.

Signed-off-by: Zhigang Gong 
---
 backend/src/llvm/llvm_gen_backend.cpp | 134 --
 1 file changed, 128 insertions(+), 6 deletions(-)

diff --git a/backend/src/llvm/llvm_gen_backend.cpp 
b/backend/src/llvm/llvm_gen_backend.cpp
index 38c63ce..1d09727 100644
--- a/backend/src/llvm/llvm_gen_backend.cpp
+++ b/backend/src/llvm/llvm_gen_backend.cpp
@@ -629,7 +629,15 @@ namespace gbe
 /*! Will try to remove MOVs due to PHI resolution */
 void removeMOVs(const ir::Liveness &liveness, ir::Function &fn);
 /*! Optimize phi move based on liveness information */
-void optimizePhiCopy(ir::Liveness &liveness, ir::Function &fn);
+void optimizePhiCopy(ir::Liveness &liveness, ir::Function &fn,
+ map  &replaceMap,
+ map  
&redundantPhiCopyMap);
+/*! further optimization after phi copy optimization.
+ *  Global liveness interefering checking based redundant phy value
+ *  elimination. */
+void postPhiCopyOptimization(ir::Liveness &liveness, ir::Function &fn,
+ map  &replaceMap,
+ map  
&redundantPhiCopyMap);
 /*! Will try to remove redundants LOADI in basic blocks */
 void removeLOADIs(const ir::Liveness &liveness, ir::Function &fn);
 /*! To avoid lost copy, we need two values for PHI. This function create a
@@ -2157,7 +2165,9 @@ namespace gbe
 });
   }
 
-  void GenWriter::optimizePhiCopy(ir::Liveness &liveness, ir::Function &fn)
+  void GenWriter::optimizePhiCopy(ir::Liveness &liveness, ir::Function &fn,
+  map &replaceMap,
+  map &redundantPhiCopyMap)
   {
 // The overall idea behind is we check whether there is any interference
 // between phi and phiCopy live range. If there is no point that
@@ -2168,7 +2178,6 @@ namespace gbe
 
 using namespace ir;
 ir::FunctionDAG *dag = new ir::FunctionDAG(liveness);
-
 for (auto &it : phiMap) {
   const Register phi = it.first;
   const Register phiCopy = it.second;
@@ -2248,8 +2257,13 @@ namespace gbe
 const Instruction *phiSrcUseInsn = s->getInstruction();
 replaceSrc(const_cast(phiSrcUseInsn), 
phiCopySrc, phiCopy);
   }
+  replaceMap.insert(std::make_pair(phiCopySrc, phiCopy));
 }
   }
+} else {
+  if (((*(phiCopySrcDef->begin()))->getType() == 
ValueDef::DEF_INSN_DST) &&
+  redundantPhiCopyMap.find(phiCopySrc) == 
redundantPhiCopyMap.end())
+redundantPhiCopyMap.insert(std::make_pair(phiCopySrc, phiCopy));
 }
 
 // If phi is used in the same BB that define the phiCopy,
@@ -2281,7 +2295,7 @@ namespace gbe
 }
   }
 
-  // coalease phi and phiCopy 
+  // coalease phi and phiCopy
   if (isOpt) {
 for (auto &x : *phiDef) {
   const_cast(x->getInstruction())->remove();
@@ -2289,8 +2303,112 @@ namespace gbe
 for (auto &x : *phiUse) {
   const Instruction *phiUseInsn = x->getInstruction();
   replaceSrc(const_cast(phiUseInsn), phi, phiCopy);
+  replaceMap.insert(std::make_pair(phi, phiCopy));
+}
+  }
+}
+delete dag;
+  }
+
+  void GenWriter::postPhiCopyOptimization(ir::Liveness &liveness,
+ ir::Function &fn, map  &replaceMap,
+ map  &redundantPhiCopyMap)
+  {
+// When doing the first pass phi copy optimization, we skip all the phi 
src MOV cases
+// whoes phiSrdDefs are also a phi value. We leave it here when all phi 
copy optimizations
+// have been done. Then we don't need to worry about there are still 
reducible phi copy remained.
+// We only need to check whether those possible redundant phi copy pairs' 
interfering to
+// each other globally, by leverage the DAG information.
+using namespace ir;
+
+// Firstly, validate all possible redundant phi copy map and update 
liveness information
+// accordingly.
+if (replaceMap.size() != 0) {
+  for (auto pair : replaceMap) {
+if (redundantPhiCopyMap.find(pair.first) != redundantPhiCopyMap.end()) 
{
+  auto it = redundantPhiCopyMap.find(pair.first);
+  Register phiCopy = it->second;
+  Register

[Beignet] [PATCH 1/5] GBE: refine Phi copy interfering check.

2015-08-31 Thread Zhigang Gong

If the PHI source register's definition instruction uses the
phi register, it is not a interfere. For an example:

MOV %phi, %phicopy
...
ADD %phiSrcDef, %phi, tmp
...
MOV %phicopy, %phiSrcDef
...

The %phi and the %phiSrcDef is not interering each other.
Simply advancing the start of the check to next instruction is
enough to get better result. For some special case, this patch
could get significant performance boost.

Signed-off-by: Zhigang Gong 
---
 backend/src/llvm/llvm_gen_backend.cpp | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/backend/src/llvm/llvm_gen_backend.cpp 
b/backend/src/llvm/llvm_gen_backend.cpp
index 4905415..38c63ce 100644
--- a/backend/src/llvm/llvm_gen_backend.cpp
+++ b/backend/src/llvm/llvm_gen_backend.cpp
@@ -2220,6 +2220,8 @@ namespace gbe
 
 ir::BasicBlock::const_iterator iter = 
ir::BasicBlock::const_iterator(phiCopySrcDefInsn);
 ir::BasicBlock::const_iterator iterE = bb->end();
+
+iter++;
 // check no use of phi in this basicblock between [phiCopySrc def, 
bb end]
 bool phiPhiCopySrcInterfere = false;
 while (iter != iterE) {
-- 
1.9.1

___
Beignet mailing list
Beignet@lists.freedesktop.org
http://lists.freedesktop.org/mailman/listinfo/beignet

1 2 3 4 5 6 7 8 9 10 >

1 - 100 of 1606 matches

Mail list logo