> On Dec 11, 2020, at 9:54 AM, Maciej W. Rozycki <ma...@linux-mips.org> wrote:
> 
> On Wed, 9 Dec 2020, Paul Koning wrote:
> 
>>> This all sounds great.  Do you happen to know if it is cycle-accurate 
>>> with respect to individual hardware microarchitectures simulated?  That 
>>> would be required for performance evaluation of compiler-generated code.
>> 
>> No, it isn't.  I believe it just charges one time unit per instruction, 
>> with the possible exception of CIS instructions.
> 
> Fair enough, from experience most CPU emulators are instruction-accurate 
> only.  Of all the generally available emulators I came across (and looked 
> into closely enough; maybe I missed something) only ones for the Z80 were 
> cycle-accurate, and I believe the MAME project has had cycle-accurate 
> emulation, both down to the system level and both out of necessity, as 
> software they were written for was often unforgiving when it comes to any 
> discrepancy with respect to original hardware.

I know of a cycle-accurate CDC 6000 simulator, but I think that was a one man 
project never released.

> Commercially, MIPS Technologies used to have cycle-accurate MIPSsim, 
> actually used for hardware verification, and taking into account all the 
> implementation details such as the TLB and caches of individual CPU cores 
> supported.

There was also a simulator with capabilities like that for the SB-1 CPU core of 
the Sibyte SB-1250 SoC. 

> ...
>> I don't know of any cycle accurate PDP-11 emulators.  It's not even 
>> clear if it is possible to build one, given the asynchronous operation 
>> of the UNIBUS.  It certainly would be extremely difficult since even the 
>> documented timing is amazingly complex, never mind the possibility that 
>> the reality is different from what is documented.
> 
> For the purpose of compiler's performance evaluation however I don't 
> think we need to go down as far as the external bus, so however UNIBUS 
> performs should not really matter.  Even with the modern systems all the 
> pipeline descriptions and operation timings we have recorded within GCC 
> reflect perfect operating conditions such as hot caches, no TLB misses, no 
> branch mispredictions, to say nothing of disruption to all that caused by 
> hardware interrupts and context switches.

True, but I was thinking of models where the UNIBUS is used for memory.  The 
real issue is that the documented timings are full of strange numbers.  There 
isn't a timing for a given instruction, but rather a whole pile of numbers 
depending on the addressing modes, with occasional exceptions to a pattern (for 
example, some register to register operations are faster than the general 
pattern for the operation and addressing mode costs would suggest).  And it's 
hard to find a number that can be used as the "cycle time" where each time 
value is a small multiple of that basic number.  That's an issue both for a 
timing simulation and also for the GCC instruction scheduler and instruction 
cost models -- I ended up rounding things rather drastically and trimming out 
some detail in order to have the cost values be small integers and not blow up 
the size of the scheduler state machine.

> So I guess with cycle-accurate PDP-11 emulation it would be sufficient if 
> relative CPU instruction execution timings were correctly reflected, such 
> as the latency of say MOV vs DIV, as I am fairly sure they are not even 
> close to being equivalent.  But that does come at a cost; cycle-accurate 
> MIPSsim was much slower than its instruction-accurate counterpart which 
> also existed.
> 
> ...
>> More interesting would be to tweak the optimizing machinery to improve 
>> parts that either have bitrotted or never actually worked. The code 
>> generation for auto-increment etc. isn't particularly effective and I 
>> think that's a known limitation.  Ditto indirect addressing, since few 
>> other machines have that.  (VAX does, of course; it might benefit too.)  
>> And with LRA things are more limited still, again this seems to be known 
>> and is caused by the focus on modern machine architectures.
> 
> Correctness absolutely has to take precedence over performance, but that 
> does not mean the latter has to be completely ignored either.  And the 
> presence of tools may only help with that.  We may not have the resources 
> available commercially significant ports have, but that does not mean we 
> should decide upfront to abandon any kind of performance QA.  I think we 
> can still act professionally and try to do our best to make the quality of 
> code produced as good as possible within our available resources.

Definitely.  For me, one complication is that the key bits of the common code 
are things I don't really understand and there isn't much documentation, 
especially for the LRA case.  Some of what is documented apparently hasn't been 
correct in many years, and possibly was never correct.  I think some of the 
auto-increment facilities fall in that category.

        paul

Reply via email to