On Fri, Jun 13, 2014 at 10:10:14PM -0400, George Spelvin wrote: > > Unrolling doesn't make much difference; which isn't surprising given > > that almost all of the differences go away when I commented out the > > udelay(). Basically, at this point what we're primarily measuring is > > how good various CPU's caches work, especially across context switches > > where other code gets to run in between. > > Huh. As I was referring to when I talked about the branch > predictor, I was hoping the removing *conditional* branches would > help.
At least for Intel, between its branch predictor and speculative execution engine, it doesn't make a difference. > Are you trying for an XOR to memory, or is the idea to remain in > registers for the entire operation? > > I'm not sure an XOR to memory is that much better; it's 2 pool loads > and 1 pool store either way. Currently, the store is first (to > input[]) and then both it and the fast_pool are fetched in fast_mix. > > With an XOR to memory, it's load-store-load, but is that really better? The second load can be optimized away. If the compiler isn't smart enough, the store means that the data is almost certainly still in the D-cache. But with a smart compiler (and gcc should be smart enough), if fast_mix is a static function, gcc will inline fast_mix, and then it should be able to optimize out the load. In fact, it might be smart enough to optimize out the first store, since it should be able to realize that first store to the pool[] array will get overwritten by the final store to the pool[] array. So hopefully, it will remain in registers for the entire operation, and the compilers will hopefully be smart enough to make the right hting happy without the code having to be really ugly. > In case it's useful, below is a small patch I made to > add_interrupt_randomness to take advantage of 64-bit processors and make > it a bit clearer what it's doing. Not submitted officially because: > 1) I haven't examined the consequences on 32-bit processors carefully yet. When I did a quick comparison of your 64-bit fast_mix2 variant, it's much slower than either the 32-bit fast_mix2, or the original fast_mix alrogithm. So given that 32-bit processors tend to be slower, I'm pretty sure if we want to add a 64-bit optimization, we'll have to conditionalize it on BITS_PER_LONG == 64 and include both the original code and the 64-bit optimized code. - Ted -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/