Hello Gokul, I believe that gem5 uses the VFP instruction timing by default, hence the operation latency for VFP VADD is '4'. I think that you should change this value to '6' to reflect the latency of the SIMD VADD instruction as presented in the ARM manual.
Table 3-2 VFP instruction timing Name Format Cycles Source Result Writeback VADD .F Sd,Sn,Sm 1 -,1,1 4 4 Table 3-4 Advanced SIMD integer arithmetic instruction timing Name Format Cycles Source Result Writeback VADD Dd,Dn,Dm 1 -,2,2 3 6 The Writeback value is the number of cycles that it takes to commit a result. And the Result value is the number of cycles that it takes to compute the operation result, this value can be used if operand forwarding is available, this value is sometimes, but not always, the same as the Writeback value. Cheers, Raul. ________________________________ From: gem5-users [[email protected]] on behalf of Gokul Subramanian Ravi [[email protected]] Sent: Tuesday, June 27, 2017 5:57 PM To: [email protected] Subject: Re: [gem5-users] NEON ARM instruction latency / execution-unit datawidth Hi Raul, Thank you for the reference to the ARM manual. It gives me a better understanding of the execution latency. But I'm now more confused as to how gem5 interprets the latency. Name Format Cycles Source Result Writeback VADD Dd,Dn,Dm 1 -,2,2 3 6 The definitions are below: Cycles: This is the number of issue cycles the particular instruction consumes, and is the absolute minimum number of cycles per instruction if no operand interlocks are present. Source: The Source field indicates the execution cycle where the source operands must be available if the instruction is to be allowed to issue. The comma separated list matches that of the Format field, indicating the register that each value applies to. Result: The Result field indicates the execution cycle when the result of the operation is ready. At this point, the result might be ready as source operands for consumption by other instructions using forwarding paths. However, some pairs of instructions might have to wait until the value is written back to the register file. Writeback: The Writeback field indicates the execution cycle that the result is committed to the register file. From this cycle, the result of the instruction is available to all other instructions as a source operand. So should the execution latency for VADD be 1 cycle (since the actual addition is 1 cycle: source @ 2, Result @ 3) OR should it be 4 cycles (source @ 2, Wb @ 6) OR 3 cycles (Result @ 3) or 6 cycles (Wb @ 6)? The .py file uses 4 cycles. To my understanding Gem5 simply uses this execution latency directly just like it takes 3 cycles for a IntMult instruction. It just considers it as a multi-cycle operation and schedules wake-up after N cycles. So all of the above information seems to be lost. I noticed a similar question from you earlier. Any thoughts? Thanks, Gokul Subramanian Ravi, Graduate Student, ECE Dept., University of Wisconsin-Madison ------------------------------------------------------------------------------------- Message: 4 Date: Tue, 27 Jun 2017 13:54:06 +0000 From: Raul Garcia <[email protected]> To: gem5 users mailing list <[email protected]> Subject: Re: [gem5-users] NEON ARM instruction latency / execution-unit datawidth Message-ID: <[email protected]> Content-Type: text/plain; charset="iso-8859-1" Hello Gokul, Have you read this ARM manual?: Cortex -A9 NEON Media Processing Engine In particular Section 3.4 Instruction-specific scheduling provides a detailed explanation of NEON instruction timing. E.g.: Name Format Cycles Source Result Writeback VADD Dd,Dn,Dm 1 -,2,2 3 6 I hope this help you, Raul. ________________________________ From: gem5-users [[email protected]] on behalf of Gokul Subramanian Ravi [[email protected]] Sent: Tuesday, June 27, 2017 1:21 AM To: [email protected] Subject: [gem5-users] NEON ARM instruction latency / execution-unit datawidth Hello all, This is not strictly a Gem5 question, and is more to do with micro-architecture design itself... but I thought it might be appropriate to ask here since others might have looked into the same before. I am trying to understand the execution latency of ARM NEON instructions. For example, the VADD instruction (as per the gem5 config file) has a 4 cycle execution latency. How is this 4 cycle latency divided in terms of actual execution? As in, how many of these 4 cycles go into pre-processing the data before the addition actually occurs? If there is no pre-processing, why is this a 4-cycle operation when normal Adds are single cycle? Related to this, NEON can handle 8-bit to 64-bit data-width operations. Are higher width operations (eg. 64-bit) performed in some form of 8-bit atomics? If anyone can point to open-source links that described the NEON pipeline/execution-unit in a detailed manner, that would be great as well. Thank you! Best, Gokul Subramanian Ravi, Graduate Student, ECE Dept., University of Wisconsin-Madison -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://m5sim.org/cgi-bin/mailman/private/gem5-users/attachments/20170627/375579a9/attachment-0001.html>
_______________________________________________ gem5-users mailing list [email protected] http://m5sim.org/cgi-bin/mailman/listinfo/gem5-users
