Hi Ian, Erik, Eloisa, > I attach a very brief report of some results I obtained in 2015 after > attending a KNC workshop. >> Conclusions: By using 244 threads, with the domain split into tiles of size >> 8 × 4 × 4 points, and OpenMP threads assigned one per tile as they become >> available, the MIC was able to outperform the single CPU by a factor of 1.5. >> The same tiling strategy was used on the CPU, as it has been found to give >> good performance there in the past. Since we have not yet optimised the code >> for the MIC architecture, we believe that further speed improvements will be >> possible, and that solving the Einstein equations on the MIC architecture >> should be feasible. >> > Eloisa, are you using LoopControl? There are tiling parameters which can > also help with performance on these devices.
how does tiling work with LoopControl? Is it documented somewhere? I naively thought that the point of tiling was to have chunks of data stored contiguously in memory... BTW, at the moment I am using this macro for all of my loop needs: #define UTILS_LOOP3(NAME,I,SI,EI,J,SJ,EJ,K,SK,EK) \ _Pragma("omp for collapse(3)") \ for(int I = SI; I < EI; ++I) \ for(int J = SJ; J < EJ; ++J) \ for(int K = SK; K < EK; ++K) How would I convert it to something equivalent using LoopControl? Thanks, David PS. Seeing that Eloisa was able to compile bbox.cc with the intel-17.0.0 with -no-vec, I made a patch to disable vectorization using pragmas inside bbox.cc (to avoid having to compile it manually): https://bitbucket.org/eschnett/carpet/pull-requests/16/carpetlib-fix-compilation-with-intel-1700/diff
signature.asc
Description: Message signed with OpenPGP
_______________________________________________ Users mailing list Users@einsteintoolkit.org http://lists.einsteintoolkit.org/mailman/listinfo/users