As long as things stay in a pipe, instruction decode and execution looks to 
execute in one cycle. Pipe flushes are the penalty. That is where speculative 
execution pays off. ( also food for Meltdown and Spectre type security holes ). 
Such loops are quite fast if the prediction was right.
Running small unrolled loops only give you a small advantage if the predictor 
is working well for your code. Large unrolled loops only gain a small amount 
percentage wise, as always.
If one is unrolling a large amount, one may end up getting a cache miss. That 
can easily eats up any benefit of unrolling the loops. Before speculative 
execution, unrolling had a clear advantage.
Dwight

________________________________
From: cctalk <cctalk-boun...@classiccmp.org> on behalf of Eric Korpela via 
cctalk <cctalk@classiccmp.org>
Sent: Wednesday, January 9, 2019 11:06 AM
To: ben; General Discussion: On-Topic and Off-Topic Posts
Subject: Re: OT? Upper limits of FSB

On Tue, Jan 8, 2019 at 3:01 PM ben via cctalk <cctalk@classiccmp.org> wrote:

> I bet I/O loops throw every thing off.
>

Even worse than you might think.  For user mode code you've got at least
two context switches which are typically thousands of CPU cycles.  On the
plus side when you start waiting for I/O the CPU will execute another
context switch to resume running something else while waiting for the I/O
to complete.  By the time you get back to your process, it's likely
process memory may be at L3 or back in main memory.  Depending upon what
else is going on it might add 1 to 50 microseconds per I/O just for context
switching and reloading caches.

Of course in an embedded processor you can run in kernel mode and busy wait
if you want.

Even fast memory mapped I/O (i.e. PCIe graphics card) that doesn't trigger
a page fault is going to have variable latency and will probably have
special cache handling.


--
Eric Korpela
korp...@ssl.berkeley.edu
AST:7731^29u18e3

Reply via email to