I'm not sure what OpenMP spec says about default data scope (too lazy to read through), but it seems that examples from http://kallipolis.com/openmp/2.html assume default(private), while GCC GOMP defaults to shared. In your case,
#pragma omp parallel for shared(A, row, col) for (i = k+1; i<SIZE; i++) { for (j = k+1; j<SIZE; j++) { A[i][j] = A[i][j] - row[i] * col[j]; } } '#pragma omp for' makes 'i' private implicitly (it couldn't be otherwise), but 'j' is still shared. I just tried your original case, not only it is slow, but it also produces different results with and without OpenMP (just try to print any elem of 'A'). Adding 'private(j)' (or defining 'j' inside the outer loop) will fix the case. It would be nice if someone would post the measurement for the fixed case, my machine has only HT, and I experience slowdown for this example (but still it runs much faster then before the fix). -- Tomash Brechko