Re: [TCWG CI] 433.milc:[.] mult_su3_mat_vec slowed down by 11% after llvm: [AMDGPU] Enable load clustering in the post-RA scheduler

Maxim Kuvyrkov Tue, 26 Oct 2021 12:58:38 -0700

Hi Jay,

This is a false positive.  We’ll take a look why this report was sent out.


Regards,

--
Maxim Kuvyrkov
https://www.linaro.org

> On 26 Oct 2021, at 22:19, ci_not...@linaro.org wrote:
> 
> After llvm commit 66e13c7f439cf162d7ed1d25883e71a5755ac7ec
> Author: Jay Foad <jay.f...@amd.com>
> 
>    [AMDGPU] Enable load clustering in the post-RA scheduler
> 
> the following hot functions slowed down by more than 10% (but their 
> benchmarks slowed down by less than 2%):
> - 433.milc:[.] mult_su3_mat_vec slowed down by 11% from 2163 to 2391 perf 
> samples
> 
> Below reproducer instructions can be used to re-build both "first_bad" and 
> "last_good" cross-toolchains used in this bisection.  Naturally, the scripts 
> will fail when triggerring benchmarking jobs if you don't have access to 
> Linaro TCWG CI.
> 
> For your convenience, we have uploaded tarballs with pre-processed source and 
> assembly files at:
> - First_bad save-temps: 
> https://ci.linaro.org/job/tcwg_bmk_ci_llvm-bisect-tcwg_bmk_tx1-llvm-master-aarch64-spec2k6-O2/27/artifact/artifacts/build-66e13c7f439cf162d7ed1d25883e71a5755ac7ec/save-temps/
> - Last_good save-temps: 
> https://ci.linaro.org/job/tcwg_bmk_ci_llvm-bisect-tcwg_bmk_tx1-llvm-master-aarch64-spec2k6-O2/27/artifact/artifacts/build-838b4a533e6853d44e0c6d1977bcf0b06557d4ab/save-temps/
> - Baseline save-temps: 
> https://ci.linaro.org/job/tcwg_bmk_ci_llvm-bisect-tcwg_bmk_tx1-llvm-master-aarch64-spec2k6-O2/27/artifact/artifacts/build-baseline/save-temps/
> 
> Configuration:
> - Benchmark: SPEC CPU2006
> - Toolchain: Clang + Glibc + LLVM Linker
> - Version: all components were built from their tip of trunk
> - Target: aarch64-linux-gnu
> - Compiler flags: -O2
> - Hardware: NVidia TX1 4x Cortex-A57
> 
> This benchmarking CI is work-in-progress, and we welcome feedback and 
> suggestions at linaro-toolchain@lists.linaro.org .  In our improvement plans 
> is to add support for SPEC CPU2017 benchmarks and provide "perf 
> report/annotate" data behind these reports.
> 
> THIS IS THE END OF INTERESTING STUFF.  BELOW ARE LINKS TO BUILDS, 
> REPRODUCTION INSTRUCTIONS, AND THE RAW COMMIT.
> 
> This commit has regressed these CI configurations:
> - tcwg_bmk_llvm_tx1/llvm-master-aarch64-spec2k6-O2
> 
> First_bad build: 
> https://ci.linaro.org/job/tcwg_bmk_ci_llvm-bisect-tcwg_bmk_tx1-llvm-master-aarch64-spec2k6-O2/27/artifact/artifacts/build-66e13c7f439cf162d7ed1d25883e71a5755ac7ec/
> Last_good build: 
> https://ci.linaro.org/job/tcwg_bmk_ci_llvm-bisect-tcwg_bmk_tx1-llvm-master-aarch64-spec2k6-O2/27/artifact/artifacts/build-838b4a533e6853d44e0c6d1977bcf0b06557d4ab/
> Baseline build: 
> https://ci.linaro.org/job/tcwg_bmk_ci_llvm-bisect-tcwg_bmk_tx1-llvm-master-aarch64-spec2k6-O2/27/artifact/artifacts/build-baseline/
> Even more details: 
> https://ci.linaro.org/job/tcwg_bmk_ci_llvm-bisect-tcwg_bmk_tx1-llvm-master-aarch64-spec2k6-O2/27/artifact/artifacts/
> 
> Reproduce builds:
> <cut>
> mkdir investigate-llvm-66e13c7f439cf162d7ed1d25883e71a5755ac7ec
> cd investigate-llvm-66e13c7f439cf162d7ed1d25883e71a5755ac7ec
> 
> # Fetch scripts
> git clone https://git.linaro.org/toolchain/jenkins-scripts
> 
> # Fetch manifests and test.sh script
> mkdir -p artifacts/manifests
> curl -o artifacts/manifests/build-baseline.sh 
> https://ci.linaro.org/job/tcwg_bmk_ci_llvm-bisect-tcwg_bmk_tx1-llvm-master-aarch64-spec2k6-O2/27/artifact/artifacts/manifests/build-baseline.sh
>  --fail
> curl -o artifacts/manifests/build-parameters.sh 
> https://ci.linaro.org/job/tcwg_bmk_ci_llvm-bisect-tcwg_bmk_tx1-llvm-master-aarch64-spec2k6-O2/27/artifact/artifacts/manifests/build-parameters.sh
>  --fail
> curl -o artifacts/test.sh 
> https://ci.linaro.org/job/tcwg_bmk_ci_llvm-bisect-tcwg_bmk_tx1-llvm-master-aarch64-spec2k6-O2/27/artifact/artifacts/test.sh
>  --fail
> chmod +x artifacts/test.sh
> 
> # Reproduce the baseline build (build all pre-requisites)
> ./jenkins-scripts/tcwg_bmk-build.sh @@ artifacts/manifests/build-baseline.sh
> 
> # Save baseline build state (which is then restored in artifacts/test.sh)
> mkdir -p ./bisect
> rsync -a --del --delete-excluded --exclude /bisect/ --exclude /artifacts/ 
> --exclude /llvm/ ./ ./bisect/baseline/
> 
> cd llvm
> 
> # Reproduce first_bad build
> git checkout --detach 66e13c7f439cf162d7ed1d25883e71a5755ac7ec
> ../artifacts/test.sh
> 
> # Reproduce last_good build
> git checkout --detach 838b4a533e6853d44e0c6d1977bcf0b06557d4ab
> ../artifacts/test.sh
> 
> cd ..
> </cut>
> 
> Full commit (up to 1000 lines):
> <cut>
> commit 66e13c7f439cf162d7ed1d25883e71a5755ac7ec
> Author: Jay Foad <jay.f...@amd.com>
> Date:   Tue Oct 12 15:39:43 2021 +0100
> 
>    [AMDGPU] Enable load clustering in the post-RA scheduler
> 
>    This has a couple of benefits:
>    1. It can sometimes fix clusters that got broken apart when the register
>       allocator inserted a copy.
>    2. Post-RA scheduling does not have to worry about increasing register
>       pressure, which in some cases gives it more freedom to reorder
>       instructions.
> 
>    Testing on a collection of 10,000 graphics shaders compiled for gfx1010
>    showed:
>    - The average length of each run of one or more load instructions
>      increased by about 1%.
>    - The number of runs of two or more load instructions increased by
>      about 4%.
> ---
> llvm/lib/Target/AMDGPU/AMDGPUTargetMachine.cpp             | 1 +
> llvm/test/CodeGen/AMDGPU/GlobalISel/extractelement.i128.ll | 5 ++---
> llvm/test/CodeGen/AMDGPU/GlobalISel/udivrem.ll             | 5 +++--
> llvm/test/CodeGen/AMDGPU/amdgpu-codegenprepare-idiv.ll     | 4 ++--
> llvm/test/CodeGen/AMDGPU/idiv-licm.ll                      | 2 +-
> llvm/test/CodeGen/AMDGPU/promote-constOffset-to-imm.ll     | 6 +++---
> llvm/test/CodeGen/AMDGPU/sdiv64.ll                         | 2 +-
> llvm/test/CodeGen/AMDGPU/srem64.ll                         | 2 +-
> llvm/test/CodeGen/AMDGPU/udiv64.ll                         | 2 +-
> llvm/test/CodeGen/AMDGPU/urem64.ll                         | 2 +-
> 10 files changed, 16 insertions(+), 15 deletions(-)
> 
> diff --git a/llvm/lib/Target/AMDGPU/AMDGPUTargetMachine.cpp 
> b/llvm/lib/Target/AMDGPU/AMDGPUTargetMachine.cpp
> index b0902465c592..7b2d56e88b5f 100644
> --- a/llvm/lib/Target/AMDGPU/AMDGPUTargetMachine.cpp
> +++ b/llvm/lib/Target/AMDGPU/AMDGPUTargetMachine.cpp
> @@ -825,6 +825,7 @@ public:
>   createPostMachineScheduler(MachineSchedContext *C) const override {
>     ScheduleDAGMI *DAG = createGenericSchedPostRA(C);
>     const GCNSubtarget &ST = C->MF->getSubtarget<GCNSubtarget>();
> +    DAG->addMutation(createLoadClusterDAGMutation(DAG->TII, DAG->TRI));
>     DAG->addMutation(ST.createFillMFMAShadowMutation(DAG->TII));
>     return DAG;
>   }
> diff --git a/llvm/test/CodeGen/AMDGPU/GlobalISel/extractelement.i128.ll 
> b/llvm/test/CodeGen/AMDGPU/GlobalISel/extractelement.i128.ll
> index fa500054e058..804dea705011 100644
> --- a/llvm/test/CodeGen/AMDGPU/GlobalISel/extractelement.i128.ll
> +++ b/llvm/test/CodeGen/AMDGPU/GlobalISel/extractelement.i128.ll
> @@ -185,21 +185,20 @@ define i128 @extractelement_vgpr_v4i128_vgpr_idx(<4 x 
> i128> addrspace(1)* %ptr,
> ; GFX8-NEXT:    s_waitcnt vmcnt(0) expcnt(0) lgkmcnt(0)
> ; GFX8-NEXT:    v_add_u32_e32 v3, vcc, 16, v0
> ; GFX8-NEXT:    v_addc_u32_e32 v4, vcc, 0, v1, vcc
> -; GFX8-NEXT:    flat_load_dwordx4 v[8:11], v[0:1]
> ; GFX8-NEXT:    flat_load_dwordx4 v[4:7], v[3:4]
> +; GFX8-NEXT:    flat_load_dwordx4 v[8:11], v[0:1]
> ; GFX8-NEXT:    v_lshlrev_b32_e32 v16, 1, v2
> ; GFX8-NEXT:    v_add_u32_e32 v17, vcc, 1, v16
> ; GFX8-NEXT:    v_cmp_eq_u32_e32 vcc, 1, v17
> ; GFX8-NEXT:    v_cmp_eq_u32_e64 s[4:5], 1, v16
> ; GFX8-NEXT:    v_cmp_eq_u32_e64 s[6:7], 6, v16
> ; GFX8-NEXT:    v_cmp_eq_u32_e64 s[8:9], 7, v16
> -; GFX8-NEXT:    s_waitcnt vmcnt(1)
> +; GFX8-NEXT:    s_waitcnt vmcnt(0)
> ; GFX8-NEXT:    v_cndmask_b32_e64 v2, v8, v10, s[4:5]
> ; GFX8-NEXT:    v_cndmask_b32_e64 v3, v9, v11, s[4:5]
> ; GFX8-NEXT:    v_cndmask_b32_e32 v8, v8, v10, vcc
> ; GFX8-NEXT:    v_cndmask_b32_e32 v9, v9, v11, vcc
> ; GFX8-NEXT:    v_cmp_eq_u32_e32 vcc, 2, v16
> -; GFX8-NEXT:    s_waitcnt vmcnt(0)
> ; GFX8-NEXT:    v_cndmask_b32_e32 v2, v2, v4, vcc
> ; GFX8-NEXT:    v_cndmask_b32_e32 v3, v3, v5, vcc
> ; GFX8-NEXT:    v_cmp_eq_u32_e32 vcc, 2, v17
> diff --git a/llvm/test/CodeGen/AMDGPU/GlobalISel/udivrem.ll 
> b/llvm/test/CodeGen/AMDGPU/GlobalISel/udivrem.ll
> index 133a224b7437..bd4ecd3a17e5 100644
> --- a/llvm/test/CodeGen/AMDGPU/GlobalISel/udivrem.ll
> +++ b/llvm/test/CodeGen/AMDGPU/GlobalISel/udivrem.ll
> @@ -830,8 +830,8 @@ define amdgpu_kernel void @udivrem_v4i32(<4 x i32> 
> addrspace(1)* %out0, <4 x i32
> ; GFX9-LABEL: udivrem_v4i32:
> ; GFX9:       ; %bb.0:
> ; GFX9-NEXT:    s_load_dwordx4 s[0:3], s[4:5], 0x20
> -; GFX9-NEXT:    v_mov_b32_e32 v2, 0x4f7ffffe
> ; GFX9-NEXT:    s_load_dwordx4 s[8:11], s[4:5], 0x10
> +; GFX9-NEXT:    v_mov_b32_e32 v2, 0x4f7ffffe
> ; GFX9-NEXT:    s_waitcnt lgkmcnt(0)
> ; GFX9-NEXT:    v_cvt_f32_u32_e32 v0, s0
> ; GFX9-NEXT:    v_cvt_f32_u32_e32 v1, s1
> @@ -926,9 +926,10 @@ define amdgpu_kernel void @udivrem_v4i32(<4 x i32> 
> addrspace(1)* %out0, <4 x i32
> ;
> ; GFX10-LABEL: udivrem_v4i32:
> ; GFX10:       ; %bb.0:
> +; GFX10-NEXT:    s_clause 0x1
> ; GFX10-NEXT:    s_load_dwordx4 s[8:11], s[4:5], 0x20
> -; GFX10-NEXT:    v_mov_b32_e32 v4, 0x4f7ffffe
> ; GFX10-NEXT:    s_load_dwordx4 s[0:3], s[4:5], 0x10
> +; GFX10-NEXT:    v_mov_b32_e32 v4, 0x4f7ffffe
> ; GFX10-NEXT:    v_mov_b32_e32 v8, 0
> ; GFX10-NEXT:    s_waitcnt lgkmcnt(0)
> ; GFX10-NEXT:    v_cvt_f32_u32_e32 v0, s8
> diff --git a/llvm/test/CodeGen/AMDGPU/amdgpu-codegenprepare-idiv.ll 
> b/llvm/test/CodeGen/AMDGPU/amdgpu-codegenprepare-idiv.ll
> index b033497d3aed..81b055166dd2 100644
> --- a/llvm/test/CodeGen/AMDGPU/amdgpu-codegenprepare-idiv.ll
> +++ b/llvm/test/CodeGen/AMDGPU/amdgpu-codegenprepare-idiv.ll
> @@ -11236,8 +11236,8 @@ define amdgpu_kernel void 
> @sdiv_i64_pow2_shl_denom(i64 addrspace(1)* %out, i64 %
> ; GFX6-LABEL: sdiv_i64_pow2_shl_denom:
> ; GFX6:       ; %bb.0:
> ; GFX6-NEXT:    s_load_dword s4, s[0:1], 0xd
> -; GFX6-NEXT:    s_mov_b64 s[2:3], 0x1000
> ; GFX6-NEXT:    s_load_dwordx4 s[8:11], s[0:1], 0x9
> +; GFX6-NEXT:    s_mov_b64 s[2:3], 0x1000
> ; GFX6-NEXT:    s_mov_b32 s7, 0xf000
> ; GFX6-NEXT:    s_mov_b32 s6, -1
> ; GFX6-NEXT:    s_waitcnt lgkmcnt(0)
> @@ -13358,8 +13358,8 @@ define amdgpu_kernel void 
> @srem_i64_pow2_shl_denom(i64 addrspace(1)* %out, i64 %
> ; GFX6-LABEL: srem_i64_pow2_shl_denom:
> ; GFX6:       ; %bb.0:
> ; GFX6-NEXT:    s_load_dword s4, s[0:1], 0xd
> -; GFX6-NEXT:    s_mov_b64 s[2:3], 0x1000
> ; GFX6-NEXT:    s_load_dwordx4 s[8:11], s[0:1], 0x9
> +; GFX6-NEXT:    s_mov_b64 s[2:3], 0x1000
> ; GFX6-NEXT:    s_mov_b32 s7, 0xf000
> ; GFX6-NEXT:    s_mov_b32 s6, -1
> ; GFX6-NEXT:    s_waitcnt lgkmcnt(0)
> diff --git a/llvm/test/CodeGen/AMDGPU/idiv-licm.ll 
> b/llvm/test/CodeGen/AMDGPU/idiv-licm.ll
> index fb9348bae000..9ea8f101b5e9 100644
> --- a/llvm/test/CodeGen/AMDGPU/idiv-licm.ll
> +++ b/llvm/test/CodeGen/AMDGPU/idiv-licm.ll
> @@ -491,8 +491,8 @@ define amdgpu_kernel void @urem16_invariant_denom(i16 
> addrspace(1)* nocapture %a
> ; GFX9-LABEL: urem16_invariant_denom:
> ; GFX9:       ; %bb.0: ; %bb
> ; GFX9-NEXT:    s_load_dword s2, s[0:1], 0x2c
> -; GFX9-NEXT:    s_mov_b32 s6, 0xffff
> ; GFX9-NEXT:    s_load_dwordx2 s[4:5], s[0:1], 0x24
> +; GFX9-NEXT:    s_mov_b32 s6, 0xffff
> ; GFX9-NEXT:    v_mov_b32_e32 v1, 0
> ; GFX9-NEXT:    s_movk_i32 s8, 0x400
> ; GFX9-NEXT:    s_waitcnt lgkmcnt(0)
> diff --git a/llvm/test/CodeGen/AMDGPU/promote-constOffset-to-imm.ll 
> b/llvm/test/CodeGen/AMDGPU/promote-constOffset-to-imm.ll
> index e2fbc0bc4af9..ba093ad3771d 100644
> --- a/llvm/test/CodeGen/AMDGPU/promote-constOffset-to-imm.ll
> +++ b/llvm/test/CodeGen/AMDGPU/promote-constOffset-to-imm.ll
> @@ -100,14 +100,14 @@ define hidden amdgpu_kernel void @clmem_read(i8 
> addrspace(1)*  %buffer) {
> ; GFX900:    global_load_dwordx2 v[{{[0-9]+:[0-9]+}}], v[{{[0-9]+:[0-9]+}}], 
> off{{$}}
> ;
> ; GFX10:   global_load_dwordx2 v[{{[0-9]+:[0-9]+}}], v[{{[0-9]+:[0-9]+}}], 
> off offset:-2048
> -; GFX10:   global_load_dwordx2 v[{{[0-9]+:[0-9]+}}], v[{{[0-9]+:[0-9]+}}], 
> off{{$}}
> ; GFX10:   global_load_dwordx2 v[{{[0-9]+:[0-9]+}}], v[{{[0-9]+:[0-9]+}}], 
> off offset:-2048
> -; GFX10:   global_load_dwordx2 v[{{[0-9]+:[0-9]+}}], v[{{[0-9]+:[0-9]+}}], 
> off{{$}}
> ; GFX10:   global_load_dwordx2 v[{{[0-9]+:[0-9]+}}], v[{{[0-9]+:[0-9]+}}], 
> off offset:-2048
> -; GFX10:   global_load_dwordx2 v[{{[0-9]+:[0-9]+}}], v[{{[0-9]+:[0-9]+}}], 
> off{{$}}
> ; GFX10:   global_load_dwordx2 v[{{[0-9]+:[0-9]+}}], v[{{[0-9]+:[0-9]+}}], 
> off offset:-2048
> ; GFX10:   global_load_dwordx2 v[{{[0-9]+:[0-9]+}}], v[{{[0-9]+:[0-9]+}}], 
> off{{$}}
> ; GFX10:   global_load_dwordx2 v[{{[0-9]+:[0-9]+}}], v[{{[0-9]+:[0-9]+}}], 
> off{{$}}
> +; GFX10:   global_load_dwordx2 v[{{[0-9]+:[0-9]+}}], v[{{[0-9]+:[0-9]+}}], 
> off{{$}}
> +; GFX10:   global_load_dwordx2 v[{{[0-9]+:[0-9]+}}], v[{{[0-9]+:[0-9]+}}], 
> off{{$}}
> +; GFX10:   global_load_dwordx2 v[{{[0-9]+:[0-9]+}}], v[{{[0-9]+:[0-9]+}}], 
> off{{$}}
> ; GFX10:   global_load_dwordx2 v[{{[0-9]+:[0-9]+}}], v[{{[0-9]+:[0-9]+}}], 
> off offset:-2048
> ; GFX10:   global_load_dwordx2 v[{{[0-9]+:[0-9]+}}], v[{{[0-9]+:[0-9]+}}], 
> off{{$}}
> 
> diff --git a/llvm/test/CodeGen/AMDGPU/sdiv64.ll 
> b/llvm/test/CodeGen/AMDGPU/sdiv64.ll
> index 0b80b4170316..dbb6d4805495 100644
> --- a/llvm/test/CodeGen/AMDGPU/sdiv64.ll
> +++ b/llvm/test/CodeGen/AMDGPU/sdiv64.ll
> @@ -6,8 +6,8 @@ define amdgpu_kernel void @s_test_sdiv(i64 addrspace(1)* 
> %out, i64 %x, i64 %y) {
> ; GCN-LABEL: s_test_sdiv:
> ; GCN:       ; %bb.0:
> ; GCN-NEXT:    s_load_dwordx2 s[4:5], s[0:1], 0xd
> -; GCN-NEXT:    v_mov_b32_e32 v7, 0
> ; GCN-NEXT:    s_load_dwordx4 s[8:11], s[0:1], 0x9
> +; GCN-NEXT:    v_mov_b32_e32 v7, 0
> ; GCN-NEXT:    s_mov_b32 s7, 0xf000
> ; GCN-NEXT:    s_mov_b32 s6, -1
> ; GCN-NEXT:    s_waitcnt lgkmcnt(0)
> diff --git a/llvm/test/CodeGen/AMDGPU/srem64.ll 
> b/llvm/test/CodeGen/AMDGPU/srem64.ll
> index fac510e8dbda..04f8ea10545e 100644
> --- a/llvm/test/CodeGen/AMDGPU/srem64.ll
> +++ b/llvm/test/CodeGen/AMDGPU/srem64.ll
> @@ -6,8 +6,8 @@ define amdgpu_kernel void @s_test_srem(i64 addrspace(1)* 
> %out, i64 %x, i64 %y) {
> ; GCN-LABEL: s_test_srem:
> ; GCN:       ; %bb.0:
> ; GCN-NEXT:    s_load_dwordx2 s[12:13], s[0:1], 0xd
> -; GCN-NEXT:    v_mov_b32_e32 v2, 0
> ; GCN-NEXT:    s_load_dwordx4 s[8:11], s[0:1], 0x9
> +; GCN-NEXT:    v_mov_b32_e32 v2, 0
> ; GCN-NEXT:    s_mov_b32 s7, 0xf000
> ; GCN-NEXT:    s_mov_b32 s6, -1
> ; GCN-NEXT:    s_waitcnt lgkmcnt(0)
> diff --git a/llvm/test/CodeGen/AMDGPU/udiv64.ll 
> b/llvm/test/CodeGen/AMDGPU/udiv64.ll
> index cc829b8e7eb3..48a86eec9832 100644
> --- a/llvm/test/CodeGen/AMDGPU/udiv64.ll
> +++ b/llvm/test/CodeGen/AMDGPU/udiv64.ll
> @@ -6,8 +6,8 @@ define amdgpu_kernel void @s_test_udiv_i64(i64 addrspace(1)* 
> %out, i64 %x, i64 %
> ; GCN-LABEL: s_test_udiv_i64:
> ; GCN:       ; %bb.0:
> ; GCN-NEXT:    s_load_dwordx2 s[2:3], s[0:1], 0xd
> -; GCN-NEXT:    v_mov_b32_e32 v2, 0
> ; GCN-NEXT:    s_load_dwordx4 s[8:11], s[0:1], 0x9
> +; GCN-NEXT:    v_mov_b32_e32 v2, 0
> ; GCN-NEXT:    s_mov_b32 s7, 0xf000
> ; GCN-NEXT:    s_mov_b32 s6, -1
> ; GCN-NEXT:    s_waitcnt lgkmcnt(0)
> diff --git a/llvm/test/CodeGen/AMDGPU/urem64.ll 
> b/llvm/test/CodeGen/AMDGPU/urem64.ll
> index a0a4b73262a7..296aaf2ed1c6 100644
> --- a/llvm/test/CodeGen/AMDGPU/urem64.ll
> +++ b/llvm/test/CodeGen/AMDGPU/urem64.ll
> @@ -6,8 +6,8 @@ define amdgpu_kernel void @s_test_urem_i64(i64 addrspace(1)* 
> %out, i64 %x, i64 %
> ; GCN-LABEL: s_test_urem_i64:
> ; GCN:       ; %bb.0:
> ; GCN-NEXT:    s_load_dwordx2 s[12:13], s[0:1], 0xd
> -; GCN-NEXT:    v_mov_b32_e32 v2, 0
> ; GCN-NEXT:    s_load_dwordx4 s[8:11], s[0:1], 0x9
> +; GCN-NEXT:    v_mov_b32_e32 v2, 0
> ; GCN-NEXT:    s_mov_b32 s7, 0xf000
> ; GCN-NEXT:    s_mov_b32 s6, -1
> ; GCN-NEXT:    s_waitcnt lgkmcnt(0)
> </cut>

_______________________________________________
linaro-toolchain mailing list
linaro-toolchain@lists.linaro.org
https://lists.linaro.org/mailman/listinfo/linaro-toolchain

Re: [TCWG CI] 433.milc:[.] mult_su3_mat_vec slowed down by 11% after llvm: [AMDGPU] Enable load clustering in the post-RA scheduler

Reply via email to