On Tue, Jan 12, 2016 at 05:53:21AM +0000, Kumar, Venkataramanan wrote: > Hi James, > > > -----Original Message----- > > From: James Greenhalgh [mailto:james.greenha...@arm.com] > > Sent: Monday, January 11, 2016 5:24 PM > > To: gcc-patches@gcc.gnu.org > > Cc: n...@arm.com; marcus.shawcr...@arm.com; > > richard.earns...@arm.com; Kumar, Venkataramanan; > > philipp.toms...@theobroma-systems.com; pins...@gmail.com; > > kyrylo.tkac...@arm.com; e.mene...@samsung.com > > Subject: [Patch AArch64] Use software sqrt expansion always for -mlow- > > precision-recip-sqrt > > > > > > Hi, > > > > I'd like to switch the logic around in aarch64.c such that -mlow-precision- > > recip-sqrt causes us to always emit the low-precision software expansion for > > reciprocal square root. I have two reasons to do this; first is consistency > > across -mcpu targets, second is enabling more -mcpu targets to use the flag > > for peak tuning. > > > > I don't much like that the precision we use for -mlow-precision-recip-sqrt > > differs between cores (and possibly compiler revisions). Yes, we're under - > > ffast-math but I take this flag to mean the user explicitly wants the low- > > precision expansion, and we should not diverge from that based on an > > internal decision as to what is optimal for performance in the > > high-precision > > case. I'd prefer to keep things as predictable as possible, and here that > > means always emitting the low-precision expansion when asked. > > > > Judging by the comments in the thread proposing the reciprocal square root > > optimisation, this will benefit all cores currently supported by GCC. > > To be clear, we would still not expand in the high-precision case for any > > cores > > which do not explicitly ask for it. Currently that is Cortex-A57 and xgene, > > though I will be proposing a patch to remove Cortex-A57 from that list > > shortly. > > > > Which gives my second motivation for this patch. -mlow-precision-recip-sqrt > > is intended as a tuning flag for situations where performance is more > > important than precision, but the current logic requires setting an internal > > flag which also changes the performance characteristics where high-precision > > is needed. This conflates two decisions the target might want to make, and > > reduces the applicability of an option targets might want to enable for > > performance. In particular, I'd still like to see -mlow-precision-recip-sqrt > > continue to emit the cheaper, low-precision sequence for floats under > > Cortex-A57. > > > > Based on that reasoning, this patch makes the appropriate change to the > > logic. I've checked with the current -mcpu values to ensure that behaviour > > without -mlow-precision-recip-sqrt does not change, and that behaviour > > with -mlow-precision-recip-sqrt is to emit the low precision sequences. > > > > I've also put this through bootstrap and test on aarch64-none-linux-gnu with > > no issues. > > > > OK? > > > > Thanks, > > James > > > > Yes I like enabling this optimization for all cpus target via > -mlow-precision-recip-sqrt . > > If my understanding is correct for cortex-a57 we now need to use only > -mlow-precision-recip-sqrt to emit software sqrt expansion? > > In the below code > ---snip--- > void > aarch64_emit_swrsqrt (rtx dst, rtx src) > { > ............ > ............ > int iterations = double_mode ? 3 : 2; > > if (flag_mrecip_low_precision_sqrt) > iterations--; > ---snip--- > > Now cortex-a57 case we will always do 2 and 1 steps for double and float > and 3 and 2 will never be used. Should we make it 2 and 1 as default? Or > any target still needs to use 3 and 2.
The code here should handle two cases: 1) Normal -Ofast case -> Some targets use the estimate expansion with 3 iterations for double, 2 for float. Other targets use the hardware fsqrt/fdiv instructions. 2) -mlow-precision-recip-sqrt -> All targets use the estimate expansion with 2 iterations for double, 1 for float. -mlow-precision-recip-sqrt is a specialisation to be used only when the programmer knows the lower precision is acceptable. It should not be on by default... > Ps: I remember reducing iterations benefited gromacs but caused some VE in > other FP benchmarks. ... For exactly this reason :-) Thanks, James