On 01/03/2014 10:11 PM, Jakub Jelinek wrote:
Hi!
On Fri, Jan 03, 2014 at 08:58:30PM +0100, Toon Moene wrote:
I don't doubt that would work, what I'm interested in, is (cat verintlin.f):
Well, you need gather loads for that and there you hit PR target/59617.
I tried your patch, and the effect on the most heavily used loop in the
full routine (not the part that I quoted before):
160 DO JY = KLAT1,KLAT2
161 DO JX = KLON1,KLON2
162 IDX = KP(JX,JY)
163 IDY = KQ(JX,JY)
164 ILEV = KR(JX,JY)
...
237 + + PBETA(JX,JY,4)*( PALFA(JX,JY,1)*PARG(IDX-2,IDY+1,ILEV+1)
238 + + PALFA(JX,JY,2)*PARG(IDX-1,IDY+1,ILEV+1)
239 + + PALFA(JX,JY,3)*PARG(IDX ,IDY+1,ILEV+1)
240 + +
PALFA(JX,JY,4)*PARG(IDX+1,IDY+1,ILEV+1) ) )
241 ENDDO
242 ENDDO
is (just counting assembler lines, i.e., instructions):
-Ofast -mavx2 -mfma: 627 lines in the .s file.
-Ofast -mavx2 -mfma -mavx512f: 588 lines in the .s file.
However, this routine is clearly memory bound (as the vectorization with
the gather instruction, needed for the indirect adressing via IDX =
KP(JX,JY), etc. didn't bring any speed improvement).
The number of instructions accessing memory:
-Ofast -mavx2 -mfma: 364 lines in the .s file.
-Ofast -mavx2 -mfma -mavx512f: 221 lines in the .s file.
So there might be a clear improvement here ...
Thanks !
--
Toon Moene - e-mail: t...@moene.org - phone: +31 346 214290
Saturnushof 14, 3738 XG Maartensdijk, The Netherlands
At home: http://moene.org/~toon/; weather: http://moene.org/~hirlam/
Progress of GNU Fortran: http://gcc.gnu.org/wiki/GFortran#news