Re: [gem5-users] NEON ARM instruction latency / execution-unit datawidth

Raul Garcia Tue, 27 Jun 2017 11:19:14 -0700

Hello Gokul,

I believe that gem5 uses the VFP instruction timing by default, hence the 
operation latency for VFP VADD is '4'. I think that you should change this 
value to '6' to reflect the latency of the SIMD VADD instruction as presented 
in the ARM manual.


Table 3-2 VFP instruction timing
Name Format            Cycles    Source    Result    Writeback
VADD .F Sd,Sn,Sm     1            -,1,1         4             4


Table 3-4 Advanced SIMD integer arithmetic instruction timing
Name Format         Cycles     Source     Result    Writeback
VADD Dd,Dn,Dm     1              -,2,2         3             6

The Writeback value is the number of cycles that it takes to commit a result. 
And the Result value is the number of cycles that it takes to compute the 
operation result, this value can be used if operand forwarding is available, 
this value is sometimes, but not always, the same as the Writeback value.



Cheers,
Raul.


________________________________
From: gem5-users [[email protected]] on behalf of Gokul Subramanian 
Ravi [[email protected]]
Sent: Tuesday, June 27, 2017 5:57 PM
To: [email protected]
Subject: Re: [gem5-users] NEON ARM instruction latency / execution-unit 
datawidth


Hi Raul,


Thank you for the reference to the ARM manual. It gives me a better 
understanding of the execution latency. But I'm now more confused as to how 
gem5 interprets the latency.


Name           Format       Cycles     Source          Result          Writeback
VADD         Dd,Dn,Dm       1             -,2,2                3                
      6



The definitions are below:

Cycles:
This is the number of issue cycles the particular instruction consumes, and is 
the absolute minimum number of cycles per instruction if no operand interlocks 
are present.

Source:
The Source field indicates the execution cycle where the source operands must 
be available if the instruction is to be allowed to issue. The comma separated 
list matches that of the Format field, indicating the register that each value 
applies to.

Result:
The Result field indicates the execution cycle when the result of the operation 
is ready. At this point, the result might be ready as source operands for 
consumption by other instructions using forwarding paths. However, some pairs 
of instructions might have to wait until the value is written back to the 
register file.

Writeback:
The Writeback field indicates the execution cycle that the result is committed 
to the register file. From this cycle, the result of the instruction is 
available to all other instructions as a source operand.



So should the execution latency for VADD be 1 cycle (since the actual addition 
is 1 cycle: source @ 2, Result @ 3) OR should it be 4 cycles (source @ 2, Wb @ 
6) OR 3 cycles (Result @ 3) or 6 cycles (Wb @ 6)? The .py file uses 4 cycles.


To my understanding Gem5 simply uses this execution latency directly just like 
it takes 3 cycles for a IntMult instruction. It just considers it as a 
multi-cycle operation and schedules wake-up after N cycles. So all of the above 
information seems to be lost.


I noticed a similar question from you earlier. Any thoughts?


Thanks,

Gokul Subramanian Ravi,
Graduate Student,
ECE Dept.,
University of Wisconsin-Madison

-------------------------------------------------------------------------------------

Message: 4
Date: Tue, 27 Jun 2017 13:54:06 +0000
From: Raul Garcia <[email protected]>
To: gem5 users mailing list <[email protected]>
Subject: Re: [gem5-users] NEON ARM instruction latency /
        execution-unit  datawidth
Message-ID:
        <[email protected]>
Content-Type: text/plain; charset="iso-8859-1"

Hello Gokul,

Have you read this ARM manual?: Cortex -A9 NEON Media Processing Engine

In particular Section 3.4 Instruction-specific scheduling provides a detailed 
explanation of NEON instruction timing.

E.g.:


Name    Format        Cycles    Source    Result    Writeback
VADD    Dd,Dn,Dm     1            -,2,2        3            6


I hope this help you,

Raul.

________________________________
From: gem5-users [[email protected]] on behalf of Gokul Subramanian 
Ravi [[email protected]]
Sent: Tuesday, June 27, 2017 1:21 AM
To: [email protected]
Subject: [gem5-users] NEON ARM instruction latency / execution-unit datawidth


Hello all,


This is not strictly a Gem5 question, and is more to do with micro-architecture 
design itself... but I thought it might be appropriate to ask here since others 
might have looked into the same before.


I am trying to understand the execution latency of ARM NEON instructions. For 
example, the VADD instruction (as per the gem5 config file) has a 4 cycle 
execution latency. How is this 4 cycle latency divided in terms of actual 
execution? As in, how many of these 4 cycles go into pre-processing the data 
before the addition actually occurs? If there is no pre-processing, why is this 
a 4-cycle operation when normal Adds are single cycle?


Related to this, NEON can handle 8-bit to 64-bit data-width operations. Are 
higher width operations (eg. 64-bit) performed in some form of 8-bit atomics?

If anyone can point to open-source links that described the NEON 
pipeline/execution-unit in a detailed manner, that would be great as well. 
Thank you!


Best,

Gokul Subramanian Ravi,
Graduate Student,
ECE Dept.,
University of Wisconsin-Madison
-------------- next part --------------
An HTML attachment was scrubbed...
URL: 
<http://m5sim.org/cgi-bin/mailman/private/gem5-users/attachments/20170627/375579a9/attachment-0001.html>

_______________________________________________
gem5-users mailing list
[email protected]
http://m5sim.org/cgi-bin/mailman/listinfo/gem5-users

Re: [gem5-users] NEON ARM instruction latency / execution-unit datawidth

Reply via email to