Re: -O3 and -ftree-vectorize

Tim Prince Sat, 08 Feb 2014 06:46:32 -0800


On 2/7/2014 11:09 AM, Tim Prince wrote:

On 02/07/2014 10:22 AM, Jakub Jelinek wrote:
On Thu, Feb 06, 2014 at 05:21:00PM -0500, Tim Prince wrote:
I'm seeing vectorization  but no output from
-ftree-vectorizer-verbose, and no dot product vectorization inside
omp parallel regions, with gcc g++ or gfortran 4.9.  Primary targets
are cygwin64 and linux x86_64.
I've been unable to use -O3 vectorization with gcc, although it
works with gfortran and g++, so use gcc -O2 -ftree-vectorize
together with additional optimization flags which don't break.
Can you file a GCC bugzilla PR with minimal testcases for this (orpoint us
at already filed bugreports)?
The question of problems with gcc -O3 (called from gfortran) haveeluded me as to finding a minimal test case. When I run under debug,it appears that somewhere prior to the crash some gfortran code isover-written with data by the gcc code, overwhelming my debuggingskill. I can get full performance with -O2 plus a bunch ofintermediate flags.As to non-vectorization of dot product in omp parallel region,-fopt-info (which I didn't know about) is reporting vectorization, butthere are no parallel simd instructions in the generated code for theomp_fn. I'll file a PR on that if it's still reproduced in a minimalcase.
I've made source code changes to take advantage of the new
vectorization with merge() and ? operators; while it's useful for
-march=core-avx2, it's sometimes a loss for -msse4.1.
gcc vectorization with #pragma omp parallel for simd is reasonably
effective in my tests only on 12 or more cores.
Likewise.
Those are cases of 2 levels of loops from netlib "vector" benchmarkwhere only one level is vectorizable and parallelizable. By puttingthe vectorizable loop on the outside the parallelization scales to alarge number of cores. I don't expect it to out-perform single threadoptimized avx vectorization until 8 or more cores are in use, but itneeds more than expected number of threads even relative to SSEvectorization.
#pragma omp simd reduction(max: ) is giving correct results but poor
performance in my tests.
Likewise.
I'll file a PR on this, didn't know if there might be interest. Ihave an Intel compiler issue "closed, will not be fixed" so the simdreduction(max: ) isn't viable for icc in the near term.
Thanks,

With further investigation, my case with reverse_copy outside andinner_product inside an omp parallel region is working very well with-O3 -ffast-math for double data type. There seems a possibleperformance problem with reverse_copy for float data type, so much sothat gfortran does better with the loop reversal pushed down into theparallel dot_products. I have seen at least 2 cases where the new gccvectorization of stride -1 with vpermd is superior to other compilers,even for float data type.For the cases where omp parallel for simd is set in expectation ofgaining outer loop parallel simd, gcc is ignoring the simd clause. So itis understandable that a large number of cores is needed to overcome thelack of parallel simd (other than by simd intrinsics coding).

I'll choose an example of omp simd reduction(max: ) for a PR.
Thanks.

--
Tim Prince

Re: -O3 and -ftree-vectorize

Reply via email to