Hello Gokul,

Have you read this ARM manual?: Cortex -A9 NEON Media Processing Engine

In particular Section 3.4 Instruction-specific scheduling provides a detailed 
explanation of NEON instruction timing.

E.g.:


Name    Format        Cycles    Source    Result    Writeback
VADD    Dd,Dn,Dm     1            -,2,2        3            6


I hope this help you,

Raul.

________________________________
From: gem5-users [gem5-users-boun...@gem5.org] on behalf of Gokul Subramanian 
Ravi [gr...@wisc.edu]
Sent: Tuesday, June 27, 2017 1:21 AM
To: gem5-users@gem5.org
Subject: [gem5-users] NEON ARM instruction latency / execution-unit datawidth


Hello all,


This is not strictly a Gem5 question, and is more to do with micro-architecture 
design itself... but I thought it might be appropriate to ask here since others 
might have looked into the same before.


I am trying to understand the execution latency of ARM NEON instructions. For 
example, the VADD instruction (as per the gem5 config file) has a 4 cycle 
execution latency. How is this 4 cycle latency divided in terms of actual 
execution? As in, how many of these 4 cycles go into pre-processing the data 
before the addition actually occurs? If there is no pre-processing, why is this 
a 4-cycle operation when normal Adds are single cycle?


Related to this, NEON can handle 8-bit to 64-bit data-width operations. Are 
higher width operations (eg. 64-bit) performed in some form of 8-bit atomics?

If anyone can point to open-source links that described the NEON 
pipeline/execution-unit in a detailed manner, that would be great as well. 
Thank you!


Best,

Gokul Subramanian Ravi,
Graduate Student,
ECE Dept.,
University of Wisconsin-Madison
_______________________________________________
gem5-users mailing list
gem5-users@gem5.org
http://m5sim.org/cgi-bin/mailman/listinfo/gem5-users

Reply via email to