------- Comment #31 from victork at gcc dot gnu dot org  2008-02-12 10:51 
-------
> I would appriciate, however, a further explaination about this issue.

The explanation has to deal with CPU architecture and is not related to
compilers.  In case of cache miss the memory load and store take tens of cpu
cycles instead of few cycles in case of cache hit.
When we run:
time ./mvec 400000 1 29720 1000
The program perform 400000 iterations of outer loop and 29720 iterations in
internal loop. The internal loop performs 3 load accesses and one store access
per iteration. Starting from second iteration of outer loop, all  29720
elements of arrays pSum, pSum1 and pVec1 will be placed into cache and from
this point all accesses will be cache hits. (I assume that data cache is big
enough to contain all 29720*3 elements).

Lets look at the slow run:
% time ./TestVec 92200 8 89720 1000
Here the program perform (89720-8) iterations in internal loop, so in order to
have cache hits most of the time we need the cache to be at least 89712*3 in
size.  Lets consider what will happen if cache size is only half of required
amount.  After completion of first iteration of the outer loop, the cache will
be filled with second half of data from arrays.  At start of second iteration
of outer loop, all elements from first half will be evicted from the cache as
most caches use LRU policy to choose evicted elements.  Considering that PPC970
is out-of-order, multiple-issue architecture we can guess why CPU have enough
time to perform arithmetic operations even in scalar manner without adding any
overhead relatively to vectorized version of internal loop.



-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=35117

Reply via email to