https://gcc.gnu.org/bugzilla/show_bug.cgi?id=79012
Bug ID: 79012 Summary: basic block reordering causes suboptimal code Product: gcc Version: 7.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: middle-end Assignee: unassigned at gcc dot gnu.org Reporter: saaadhu at gcc dot gnu.org CC: segher at kernel dot crashing.org Target Milestone: --- For this C code (slightly modified from PR 30908) void wait(int i) { while (i-- > 0) asm volatile("nop" ::: "memory"); } gcc 4.8 at -Os produces jmp .L2 .L3: nop decl %edi .L2: testl %edi, %edi jg .L3 ret whereas gcc trunk (and 4.9 onwards, from a quick check) produces .L2: testl %edi, %edi jle .L5 nop decl %edi jmp .L2 .L5: ret The code size is identical, but the trunk version executes one more instruction everytime the loop runs (explicit jump to .L5 with trunk vs fallthrough with 4.8) - it's faster only if the loop never runs. This happens irrespective of the memory clobber inline assembler statement. Digging into the dump files, I found that the transformation occurs in the bb reorder pass, when it calls cfg_layout_initialize, which eventually calls try_redirect_by_replacing_jump with in_cfglayout set to true. That function then removes the jump and causes the RTL transformation that eventually results in slower code. RTL before and after bbro. Before: (jump_insn 24 6 25 2 (set (pc) (label_ref 15)) "pr30908.c":3 678 {jump} (nil) -> 15) (barrier 25 24 17) (code_label 17 25 12 3 3 "" [1 uses]) (note 12 17 13 3 [bb 3] NOTE_INSN_BASIC_BLOCK) (insn 13 12 14 3 (parallel [ (asm_operands/v ("nop") ("") 0 [] [] [] pr30908.c:4) (clobber (mem:BLK (scratch) [0 A8])) (clobber (reg:CCFP 18 fpsr)) (clobber (reg:CC 17 flags)) ]) "pr30908.c":4 -1 (expr_list:REG_UNUSED (reg:CCFP 18 fpsr) (expr_list:REG_UNUSED (reg:CC 17 flags) (nil)))) (insn 14 13 15 3 (parallel [ (set (reg:SI 5 di [orig:90 ivtmp.9 ] [90]) (plus:SI (reg:SI 5 di [orig:90 ivtmp.9 ] [90]) (const_int -1 [0xffffffffffffffff]))) (clobber (reg:CC 17 flags)) ]) 210 {*addsi_1} (expr_list:REG_UNUSED (reg:CC 17 flags) (nil))) (code_label 15 14 16 4 2 "" [1 uses]) (note 16 15 18 4 [bb 4] NOTE_INSN_BASIC_BLOCK) (insn 18 16 19 4 (set (reg:CCNO 17 flags) (compare:CCNO (reg:SI 5 di [orig:90 ivtmp.9 ] [90]) (const_int 0 [0]))) "pr30908.c":3 3 {*cmpsi_ccno_1} (nil)) (jump_insn 19 18 30 4 (set (pc) (if_then_else (gt (reg:CCNO 17 flags) (const_int 0 [0])) (label_ref 17) (pc))) "pr30908.c":3 646 {*jcc_1} (expr_list:REG_DEAD (reg:CCNO 17 flags) (int_list:REG_BR_PROB 8500 (nil))) -> 17) (note 30 19 28 5 [bb 5] NOTE_INSN_BASIC_BLOCK) (note 28 30 29 5 NOTE_INSN_EPILOGUE_BEG) (jump_insn 29 28 31 5 (simple_return) "pr30908.c":5 708 {simple_return_internal} (nil) -> simple_return) After: <snip> (code_label 15 6 16 3 2 "" [1 uses]) (note 16 15 18 3 [bb 3] NOTE_INSN_BASIC_BLOCK) (insn 18 16 19 3 (set (reg:CCNO 17 flags) (compare:CCNO (reg:SI 5 di [orig:90 ivtmp.9 ] [90]) (const_int 0 [0]))) "pr30908.c":3 3 {*cmpsi_ccno_1} (nil)) (jump_insn 19 18 12 3 (set (pc) (if_then_else (le (reg:CCNO 17 flags) (const_int 0 [0])) (label_ref:DI 34) (pc))) "pr30908.c":3 646 {*jcc_1} (expr_list:REG_DEAD (reg:CCNO 17 flags) (int_list:REG_BR_PROB 1500 (nil))) -> 34) (note 12 19 13 4 [bb 4] NOTE_INSN_BASIC_BLOCK) (insn 13 12 14 4 (parallel [ (asm_operands/v ("nop") ("") 0 [] [] [] pr30908.c:4) (clobber (mem:BLK (scratch) [0 A8])) (clobber (reg:CCFP 18 fpsr)) (clobber (reg:CC 17 flags)) ]) "pr30908.c":4 -1 (expr_list:REG_UNUSED (reg:CCFP 18 fpsr) (expr_list:REG_UNUSED (reg:CC 17 flags) (nil)))) (insn 14 13 35 4 (parallel [ (set (reg:SI 5 di [orig:90 ivtmp.9 ] [90]) (plus:SI (reg:SI 5 di [orig:90 ivtmp.9 ] [90]) (const_int -1 [0xffffffffffffffff]))) (clobber (reg:CC 17 flags)) ]) 210 {*addsi_1} (expr_list:REG_UNUSED (reg:CC 17 flags) (nil))) (jump_insn 35 14 36 4 (set (pc) (label_ref 15)) -1 (nil) -> 15) (barrier 36 35 34) (code_label 34 36 30 5 5 "" [1 uses]) (note 30 34 28 5 [bb 5] NOTE_INSN_BASIC_BLOCK) (note 28 30 29 5 NOTE_INSN_EPILOGUE_BEG) (jump_insn 29 28 31 5 (simple_return) "pr30908.c":5 708 {simple_return_internal} (nil) -> simple_return)