On Tue, 18 Oct 2016, Bernd Schmidt wrote: > The performance I saw was lower by a factor of 80 or so compared to their CUDA > version, and even lower than OpenMP on the host.
The currently published OpenMP version of LULESH simply doesn't use openmp-simd anywhere. This should make it obvious that it won't be anywhere near any reasonable CUDA implementation, and also bound to be below host performance. Besides, it's common for such benchmark suites to have very different levels of hand tuning for the native-CUDA implementation vs OpenMP implementation, sometimes to the point of significant algorithmic differences. So you're making an invalid comparison here. Internally at ISP RAS we used a small set of microbenchmarks implemented in CUDA/OpenACC/OpenMP specifically for the purpose of evaluating the exact same computations implemented in terms of different APIs. We got close performance in all three. The biggest issue is visible on short-running OpenMP target regions: the startup cost (going through libgomp) is non-trivial. That can be improved with further changes in libgomp port, notably avoiding malloc, shaving off more code, perhaps inlining more code (e.g. via LTO eventually). There's also avoidable cuMemAlloc/cuMemFree on the libgomp plugin side. For example, there's this patch on the branch: libgomp: avoid malloc calls in gomp_nvptx_main Avoid calling malloc where it's easy to use stack storage instead: device malloc is very slow in CUDA. This cuts about 60-80 microseconds from target region entry/exit time, slimming down empty target regions from ~95 to ~17 microseconds (as measured on a GTX Titan). (empty CUDA kernel is ~5 microseconds; all figures are taken via nvprof) > To me this kind of performance doesn't look like something that will be fixed > by fine-tuning; it leaves me undecided whether the chosen approach (what you > call the fundamentals) is viable at all. If you try to draw conclusions just from comparing the performance you got on LULESH, without looking at benchmark's source (otherwise you should have acknowledged the lack of openmp-simd and significant source-level differences between CUDA and OpenMP implementations, like the use of __shared__ in CUDA algorithms), I am sorry to say, but that is just ridiculous. The implementation on the branch is far from ideal, but your method of evaluation is nonsensical. > Performance is still better than the OpenACC version of the benchmark, but > then I think we shouldn't repeat the mistakes we made with OpenACC and avoid > merging something until we're sure it's ready and of benefit to users. Would you kindly try and keep your commentary constructive. It's frustrating to me to have to tolerate hostilities like an ad hominem attack, ignored nvptx-backend-related questions, etc. How can the work get ready if all you do is passively push back? Please trust me, I have experience with GPUs and GCC. There should be a process for getting this gradually reviewed, with fundamental design decisions acked and patches reviewed before all tweaks and optimizations are in place. If you suggest that the work needs to proceed on the branch without any kind of interim review, and then reviewed in one go after satisfying some unspecified criteria of being "ready and of benefit", that doesn't sound right to me. Alexander