Hello Gokul, Have you read this ARM manual?: Cortex -A9 NEON Media Processing Engine
In particular Section 3.4 Instruction-specific scheduling provides a detailed explanation of NEON instruction timing. E.g.: Name Format Cycles Source Result Writeback VADD Dd,Dn,Dm 1 -,2,2 3 6 I hope this help you, Raul. ________________________________ From: gem5-users [gem5-users-boun...@gem5.org] on behalf of Gokul Subramanian Ravi [gr...@wisc.edu] Sent: Tuesday, June 27, 2017 1:21 AM To: gem5-users@gem5.org Subject: [gem5-users] NEON ARM instruction latency / execution-unit datawidth Hello all, This is not strictly a Gem5 question, and is more to do with micro-architecture design itself... but I thought it might be appropriate to ask here since others might have looked into the same before. I am trying to understand the execution latency of ARM NEON instructions. For example, the VADD instruction (as per the gem5 config file) has a 4 cycle execution latency. How is this 4 cycle latency divided in terms of actual execution? As in, how many of these 4 cycles go into pre-processing the data before the addition actually occurs? If there is no pre-processing, why is this a 4-cycle operation when normal Adds are single cycle? Related to this, NEON can handle 8-bit to 64-bit data-width operations. Are higher width operations (eg. 64-bit) performed in some form of 8-bit atomics? If anyone can point to open-source links that described the NEON pipeline/execution-unit in a detailed manner, that would be great as well. Thank you! Best, Gokul Subramanian Ravi, Graduate Student, ECE Dept., University of Wisconsin-Madison
_______________________________________________ gem5-users mailing list gem5-users@gem5.org http://m5sim.org/cgi-bin/mailman/listinfo/gem5-users