Issue 163757
Summary [RISC-V] Possible performance improvement in matrix multiply function in Coremark
Labels
Assignees
Reporter christian-herber-nxp
    In function in Coremark:

```
#define bit_extract(x,from,to) (((x)>>(from)) & (~(0xffffffff << (to))))

void matrix_mul_matrix_bitextract(ee_u32 N, MATRES *C, MATDAT *A, MATDAT *B) {
	ee_u32 i,j,k;
	for (i=0; i<N; i++) {
		for (j=0; j<N; j++) {
			C[i*N+j]=0;
			for(k=0;k<N;k++)
			{
				MATRES tmp=(MATRES)A[i*N+k] * (MATRES)B[k*N+j];
				C[i*N+j]+=bit_extract(tmp,2,4)*bit_extract(tmp,5,7);
			}
		}
	}
}
```

LLVM is currently generating worse code than GCC with -O3 -march=rv32imbc -mabi=ilp32. 
The inner loop for clang is:

```
.LBB0_4:
        lhu t6, 0(t5)
        lhu     s0, 0(a4)
        addi    a5, a5, -1
        add a4, a4, t3
        mul     s0, s0, t6
        slli    t6, s0, 26
 slli    s0, s0, 20
        srli    t6, t6, 28
        srli    s0, s0, 25
        mul     s0, t6, s0
        add     t4, t4, s0
        addi t5, t5, 2
        bnez    a5, .LBB0_4
```

while gcc generates one less instruction:

```
.L4:
        lh      a4,0(a6)
        lh a5,0(a2)
        addi    a2,a2,2
        sh1add  a6,a0,a6
        mul a5,a5,a4
        srai    a4,a5,2
        srai    a5,a5,5
        andi a4,a4,15
        andi    a5,a5,127
        mul     a5,a4,a5
        add a7,a7,a5
        bne     t1,a2,.L4
```

I could not really pinpoint what exactly goes different, but I believe that gcc uses the memory address itself to terminate the loop while clang maintains a separate counter it decrements (a5).


_______________________________________________
llvm-bugs mailing list
[email protected]
https://lists.llvm.org/cgi-bin/mailman/listinfo/llvm-bugs

Reply via email to