Some architectures (I’m thinking of the latest Intel CPUs) have a small loop cache whose aim is to keep a loop entirely within that cache. That cache operates at the full speed of the instruction fetch/execute (actually I think it keeps the decoded uOps) cycles (e.g. you can’t go faster). L1 caches impose a penalty and of course there is the instruction decode time as well both of which are avoided.
TTFN - Guy > On Jan 8, 2019, at 2:43 PM, Chuck Guzis via cctalk <cctalk@classiccmp.org> > wrote: > > On 1/8/19 1:23 PM, Tapley, Mark via cctalk wrote: > >> Why so (why surprising, I mean)? Understood an unrolled loop executes >> faster... > > That can't always be true, can it? > > I'm thinking of an architecture where the instruction cache is slow to > fill and multiple overlapping operations are involved and branch > prediction assumes a branch taken. I'd say it was very close in that case. > > --Chuck >