As long as things stay in a pipe, instruction decode and execution looks to execute in one cycle. Pipe flushes are the penalty. That is where speculative execution pays off. ( also food for Meltdown and Spectre type security holes ). Such loops are quite fast if the prediction was right. Running small unrolled loops only give you a small advantage if the predictor is working well for your code. Large unrolled loops only gain a small amount percentage wise, as always. If one is unrolling a large amount, one may end up getting a cache miss. That can easily eats up any benefit of unrolling the loops. Before speculative execution, unrolling had a clear advantage. Dwight
________________________________ From: cctalk <cctalk-boun...@classiccmp.org> on behalf of Eric Korpela via cctalk <cctalk@classiccmp.org> Sent: Wednesday, January 9, 2019 11:06 AM To: ben; General Discussion: On-Topic and Off-Topic Posts Subject: Re: OT? Upper limits of FSB On Tue, Jan 8, 2019 at 3:01 PM ben via cctalk <cctalk@classiccmp.org> wrote: > I bet I/O loops throw every thing off. > Even worse than you might think. For user mode code you've got at least two context switches which are typically thousands of CPU cycles. On the plus side when you start waiting for I/O the CPU will execute another context switch to resume running something else while waiting for the I/O to complete. By the time you get back to your process, it's likely process memory may be at L3 or back in main memory. Depending upon what else is going on it might add 1 to 50 microseconds per I/O just for context switching and reloading caches. Of course in an embedded processor you can run in kernel mode and busy wait if you want. Even fast memory mapped I/O (i.e. PCIe graphics card) that doesn't trigger a page fault is going to have variable latency and will probably have special cache handling. -- Eric Korpela korp...@ssl.berkeley.edu AST:7731^29u18e3