------- Comment #31 from victork at gcc dot gnu dot org 2008-02-12 10:51 ------- > I would appriciate, however, a further explaination about this issue.
The explanation has to deal with CPU architecture and is not related to compilers. In case of cache miss the memory load and store take tens of cpu cycles instead of few cycles in case of cache hit. When we run: time ./mvec 400000 1 29720 1000 The program perform 400000 iterations of outer loop and 29720 iterations in internal loop. The internal loop performs 3 load accesses and one store access per iteration. Starting from second iteration of outer loop, all 29720 elements of arrays pSum, pSum1 and pVec1 will be placed into cache and from this point all accesses will be cache hits. (I assume that data cache is big enough to contain all 29720*3 elements). Lets look at the slow run: % time ./TestVec 92200 8 89720 1000 Here the program perform (89720-8) iterations in internal loop, so in order to have cache hits most of the time we need the cache to be at least 89712*3 in size. Lets consider what will happen if cache size is only half of required amount. After completion of first iteration of the outer loop, the cache will be filled with second half of data from arrays. At start of second iteration of outer loop, all elements from first half will be evicted from the cache as most caches use LRU policy to choose evicted elements. Considering that PPC970 is out-of-order, multiple-issue architecture we can guess why CPU have enough time to perform arithmetic operations even in scalar manner without adding any overhead relatively to vectorized version of internal loop. -- http://gcc.gnu.org/bugzilla/show_bug.cgi?id=35117