On Wed, Aug 17, 2016 at 12:35 PM, Denys Vlasenko <dvlas...@redhat.com> wrote: > > Experimentally, POPF is stupidly slow _always_. 6 cycles > even if none of the "scary" flags are changed.
6 cycles is nothing. That's basically the overhead of "oops, I need to use the microcode sequencer". One issue is that the intel decoders (AMD too, for that matter) can only generate a fairly small set of uops for any instruction. Some instructions are really trivial to decode (popf definitely falls under that heading), but are more than just a couple of uops, so you end up having to use the uop sequencer logic. According to Agner Fog's tables, there's one or two micro-architectures that actually dot he simple "popf" case with a single cycle throughput, but that's the very unusual case. You can't even fit the "pop a value, see if only the arithmetic flags changed, trap to microcode otherwise" into the three of four uops that the "complex decoder" can generate directly. And that "fall back to the uop sequencer engine" tends to just always cause several cycles regardless. So yes, microcode tends to be slow even for what would otherwise be trivial operations. You'd think Intel could do as well as they do for the L0 uop cache, but afaik they don't. Anyway, six cycles is fast. I'd *love* for popf to actually be just 6 cycles when IF changes. It's much much worse iirc (although honestly, I haven't timed it in years - it's much easier to time just the arithmetic flag changes). It used to be more like a hundred cycles on Prescott. Linus