On Tue, Apr 28, 2015 at 9:58 AM, Borislav Petkov <b...@alien8.de> wrote: > > Well, AFAIK, NOPs do require resources for tracking in the machine. I > was hoping that hw would be smarter and discard at decode time but there > probably are reasons that it can't be done (...yet).
I suspect it might be related to things like getting performance counters and instruction debug traps etc right. There are quite possibly also simply constraints where the front end has to generate *something* just to keep the back end happy. The front end can generally not just totally remove things without any tracking, since the front end doesn't know if things are speculative etc. So you can't do instruction debug traps in the front end afaik. Or rather, I'm sure you *could*, but in general I suspect the best way to handle nops without making them *too* special is to bunch up several to make them look like one big instruction, and then associate that bunch with some minimal tracking uop that uses minimal resources in the back end without losing sight of the original nop entirely, so that you can still do checks at retirement time. So I think the "you can do ~5 nops per cycle" is not unreasonable. Even in the uop cache, the nops have to take some space, and have to do things like update eip, so I don't think they'll ever be entirely free, the best you can do is minimize their impact. > $ taskset -c 3 ./t > Running 60 times, 1000000 loops per run. > nop_0x90 average: 0.390625 > nop_3_byte average: 0.390625 > > and those exact numbers are actually reproducible pretty reliably. Yeah. That looks somewhat reasonable. I think the 16h architecture technically decodes just two instructions per cycle, but I wouldn't be surprised if there's some simple nop special casing going on so that it can decode three nops in one go when things line up right. So you might get 0.33 cycles for the best case, but then 0.5 cycles when it crosses a 16-byte boundary or something. So you might have some pattern where it decodes 32 bytes worth of nops as 12/8/12 bytes (3/2/3 instructions), which would come out to 0.38 cycles. Add some random overhead for the loop, and I could see the 0.39 cycles. That was wild handwaving with no data to back it up, but I'm trying to explain to myself why you could get some odd number like that. It seems _possiible_ at least. Linus -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/