Hi! On Fri, Jul 24, 2020 at 09:01:33AM +0200, Andrea Corallo wrote: > Segher Boessenkool <seg...@kernel.crashing.org> writes: > >> Correct, it's a sliding window only because the real load address is not > >> known to the compiler and the algorithm is conservative. I believe we > >> could use ASM_OUTPUT_ALIGN_WITH_NOP if we align each function to (al > >> least) the granule size, then we should be able to insert 'nop aligned > >> labels' precisely. > > > > Yeah, we have similar issues on Power... Our "granule" (fetch group > > size, in our terminology) is 32 typically, but we align functions to > > just 16. This is causing some problems, but aligning to bigger > > boundaries isn't a very happy alternative either. WIP... > > Interesting, I was expecting other CPUs to have a similar mechanism.
On old cpus (like the 970) there were at most two branch predictions per cycle. Nowadays, all branches are predicted; not sure when this changed, it is pretty long ago already. > > (We don't have this exact same problem, because our non-ancient cores > > can just predict *all* branches in the same cycle). > > > >> My main fear is that given new cores tend to have big granules code size > >> would blow. One advantage of the implemented algorithm is that even if > >> slightly conservative it's impacting code size only where an high branch > >> density shows up. > > > > What is "big granules" for you? > > N1 is 8 instructions so 32 bytes as well, I guess this may grow further > (my speculation). It has to sooner rather than later, yeah. Or the mechanism has to change more radically. Interesting times ahead, I guess :-) About your patch itself. The basic idea seems fine (I didn't look too closely), but do you really need a new RTX class for this? That is not very appetising... Segher