Hello Everyone, I have been studying the machine code generated by V8 for Web Assembly. I took the following function *kernel_gemm* as example:
# define NI 1000 # define NJ 1100 # define NK 1200 void kernel_gemm( int C[NI][NJ], int A[NI][NK], int B[NK][NJ]) { int i, j, k; for (i = 0; i < NI; i++) { for (k = 0; k < NK; k++) { for (j = 0; j < NJ; j++) { C[i][j] += A[i][k] * B[k][j]; } } } } The above file is compiled to WASM using latest emsdk based on clang//llvm-6.0 and is executed by v8. After studying the generated machine code for above function by v8, I found that there are extra stack loads: movq %rax, -0x10(%rsp) movq %rdx, -0x18(%rsp) xorq %rsi,%rsi movq $0, %rdi nop L1: imull $0x12c0, %esi, %r8d addq %rdx, %r8 imull $0x1130,%esi,%r9d addq %rax,%r9 xorq %r11,%r11 nop L2: imull $0x1130,%r11d,%r12d leaq (%r8,%r11,4),%r14 addq %rcx,%r12 xorq %rbx,%rbx movl %ebx,%r15d nop nop L3: leaq 0x1(%r15),%rax leaq (%r12,%r15,4),%rbx movl (%rdi,%r14,1),%edx leaq (%r9,%r15,4),%r15 movl (%rdi,%rbx,1),%ebx imull %ebx,%edx movl (%rdi,%r15,1),%ebx addl %ebx,%edx movl %ebx,(%rdi,%r15,1) cmpl $0x44c,%eax jz L3END movl %eax,%r15d jmp L3 L3END: addl $0x1,%r11d cmpl $0x4b0,%r11d jnz L2 addl $0x1,%esi cmpl $0x3e8,%esi jz L1END movq -0x18(%rsp),%rdx movq -0x10(%rsp),%rax jmp L1 L1END: addq $0x20, %rsp As you can see that there are extra stack loads for *rdx* and *rax* registers in every iteration of first loop (in between *L1END *and *L3END*). However, clang generates a code which performs around 1.3x better than v8 and has no stack loads of operands. According to calling convention of V8 generated code, the arguments will be passed in registers *rax*, *rcx*, *rdx*. Hence, *rdx*, and *rax* are for variables B and C respectively. I have been trying to get to know why there are extra loads. One reason could be the register allocator of v8 is not as good as clang (which I guess is fine because v8 has JIT and JITs are supposed to generate code faster than AOT compilers). But I think there should exist another reason like may be for On Stack Replacement or Preemption of code. It would be really great if anyone can point me in the direction in V8 source code. I have looked at wasm-compiler.cc but couldn't find anything. NOTE: The v8 generated code is generated using nodejs v8.11.2 and has been converted to a simpler format by replacing absolute address in code with labels. Above code, when assembled using clang (after taking care of calling conventions of clang) performs exactly the same as v8 generated code. As a reference, the clang generated assembly code is xorl %r8d, %r8d .p2align 4, 0x90 .LBB0_1: # %for.body # =>This Loop Header: Depth=1 # Child Loop BB0_2 Depth 2 # Child Loop BB0_3 Depth 3 movq %rdx, %r10 xorl %r9d, %r9d .p2align 4, 0x90 .LBB0_2: # %for.body3 # Parent Loop BB0_1 Depth=1 # => This Loop Header: Depth=2 # Child Loop BB0_3 Depth 3 imulq $4800, %r8, %rax # imm = 0x12C0 addq %rsi, %rax leaq (%rax,%r9,4), %r11 movq $-1100, %rcx # imm = 0xFBB4 .p2align 4, 0x90 .LBB0_3: # %for.body6 # Parent Loop BB0_1 Depth=1 # Parent Loop BB0_2 Depth=2 # => This Inner Loop Header: Depth=3 movl (%r11), %eax movl 4400(%r10,%rcx,4), %ebx imull %eax, %ebx movl 4400(%rdi,%rcx,4), %eax addl %ebx, %eax movl %eax, 4400(%rdi,%rcx,4) addq $1, %rcx jne .LBB0_3 # %bb.4: # %for.inc17 # in Loop: Header=BB0_2 Depth=2 addq $1, %r9 addq $4400, %r10 # imm = 0x1130 cmpq $1200, %r9 # imm = 0x4B0 jne .LBB0_2 # %bb.5: # %for.inc20 # in Loop: Header=BB0_1 Depth=1 addq $1, %r8 addq $4400, %rdi # imm = 0x1130 cmpq $1000, %r8 # imm = 0x3E8 jne .LBB0_1 # %bb.6: # %for.end22 popq %rbx retq Thank You, -- -- v8-dev mailing list v8-dev@googlegroups.com http://groups.google.com/group/v8-dev --- You received this message because you are subscribed to the Google Groups "v8-dev" group. To unsubscribe from this group and stop receiving emails from it, send an email to v8-dev+unsubscr...@googlegroups.com. For more options, visit https://groups.google.com/d/optout.