https://gcc.gnu.org/bugzilla/show_bug.cgi?id=111376
--- Comment #13 from YunQiang Su <syq at gcc dot gnu.org> --- I try to insert li $3, 500 li $5, 500 between SLL/BGEZ and LUI+AND/BNE. The later is still some faster on Loongson 3A4000. I notice something like this in 74K's software manual: The 74K core’s ALU is pipelined. Some ALU instructions complete the operation and bypass the results in this cycle. These instructions are referred to as single-cycle ops and they include all logical instructions (AND, ANDI, OR, ORI, XOR, XORI, LUI), some shift instructions (SLL sa<=8, SRL 31<=sa<=25), and some arithmetic instructions (ADD rt=0, ADDU rt=0, SLT, SLTI, SLTU, SLTIU, SEH, SEB, ZEH, ZEB). In addition, add instructions (ADD, ADDU, ADDI, ADDIU) complete the operation and bypass results to the ALU pipe in this cycle. I guess it means that if sa>8, SLL may be some slow. On Loongson 3A4000, the value seems to be 20/21. It may means that we should be care about for 64bit. Can you have a test on XBurst 1?