Of course, somebody really should do timings on modern CPU's (in cpl0, comparing native_fl() that enables interrupts with a popf)
I didn't do CPL0 tests yet. Realized that cli/sti can be tested in userspace if we set iopl(3) first. Surprisingly, STI is slower than CLI. A loop with 27 CLI's and one STI converges to about ~0.5 insn/cycle: # compile with: gcc -nostartfiles -nostdlib _start: .globl _start mov $172, %eax #iopl mov $3, %edi syscall mov $200*1000*1000, %eax .balign 64 loop: cli;cli;cli;cli cli;cli;cli;cli cli;cli;cli;cli cli;cli;cli;cli cli;cli;cli;cli cli;cli;cli;cli cli;cli;cli;sti dec %eax jnz loop mov $231, %eax #exit_group syscall perf stat: 6,015,787,968 instructions # 0.52 insn per cycle 3.355474199 seconds time elapsed With all CLIs replaced by STIs, it's ~0.25 insn/cycle: 6,030,530,328 instructions # 0.27 insn per cycle 6.547200322 seconds time elapsed POPF which needs to enable interrupts is not measurably faster than one which does not change .IF: Loop with: 400158: fa cli 400159: 53 push %rbx #saved eflags with if=1 40015a: 9d popfq shows: 8,908,857,324 instructions # 0.11 insn per cycle ( +- 0.00% ) Loop with: 400140: fb sti 400141: 53 push %rbx 400142: 9d popfq shows: 8,920,243,701 instructions # 0.10 insn per cycle ( +- 0.01% ) Even loop with neither CLI nor STI, only with POPF: 400140: 53 push %rbx 400141: 9d popfq shows: 6,079,936,714 instructions # 0.10 insn per cycle ( +- 0.00% ) This is on a Skylake CPU. The gist of it: CLI is 2 cycles, STI is 4 cycles, POPF is 10 cycles seemingly regardless of prior value of EFLAGS.IF.