https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110813
Bug ID: 110813 Summary: [OpenMP] omp_target_memcpy_rect (+ strided 'target update'): Improve GCN performance and contiguous subranges Product: gcc Version: 14.0 Status: UNCONFIRMED Keywords: missed-optimization, openmp Severity: normal Priority: P3 Component: libgomp Assignee: unassigned at gcc dot gnu.org Reporter: burnus at gcc dot gnu.org CC: jakub at gcc dot gnu.org, jules at gcc dot gnu.org Target Milestone: --- omp_target_memcpy_rect_worker is used by omp_target_memcpy_rect and omp_target_memcpy_rect_async. It is also used when passing strided memory to 'target update' - either on OG13 or when applying the patch https://gcc.gnu.org/pipermail/gcc-patches/2023-July/623502.html - as can be seen on OG13: https://github.com/gcc-mirror/gcc/blob/devel/omp/gcc-13/libgomp/target.c#L5689-L5843 (links to omp_target_memcpy_rect_worker; lines might be off when the file was changed after I linked there.) ISSUES: * The current algorithm always loops until dim == 1, even if the referenced memory is contiguous That's the case for _rect if src_dim == dst_dim == volume such as: volume=[V1,N2,N3], ..., dst_dimension=[D1,N2,N3], ... src_dimension=[S1,N2,N3] the inner two dimensions are contiguous, only the outermost isn't. Likewise for '!$omp target update to(cont_array(:,:,::2)' * While for nvptx, a patch exists (see below) that handles _rect copying for dim=2 and dim=3 more efficiently (CUDA functions), for GCN such a feature is currently missing. EXPECTED: * Improve performance if partially contiguous * Improve performance on GCN Cross ref: - "[patch] OpenMP: Call cuMemcpy2D/cuMemcpy3D for nvptx for omp_target_memcpy_rect" https://gcc.gnu.org/pipermail/gcc-patches/2023-July/625465.html (as mentioned in that patch, cross ref to: - PR101581 - [OpenMP] omp_target_memcpy – support inter-device memcpy )