https://gcc.gnu.org/bugzilla/show_bug.cgi?id=120753
--- Comment #16 from Benjamin Schulz <schulz.benjamin at googlemail dot com> --- Hm, I want to thank Andrew Pinski for pointing me to a post of mat colgrove from nvidia, https://forums.developer.nvidia.com/t/using-deviceptr-with-structs/136334 And to Tobias Burnus for his comment on map. By now nvc++ from nvidia eats member fields in constructs like deviceptr(...) without problem... But for creating a struct with data on the device, deviceptr is apparently not needed. As Tobias and Matt write: The strategy which works is apparently: 1) define the struct and declare a variable of its type, e.g. named t, 2) allocate the member fields on device by omp_target_alloc/acc_malloc 3) make the shallow copy of the struct with #pragma omp target enter data map(to:t) or #pragma acc data copyin(t) Then just use the member fields in the #pragma omp target or the #pragma acc paralell loop ... Because the fields are access prescriptions, it suffices if one has them allocated, with the members being pointers to the location. The shallow copy from copyin/ map of the struct will then simply upload these device pointers to the device and they can be used in the loops. For a temporary data structure on device, there is then no separate map/copyin of the member arrays necessary if we allocated them with target_alloc/acc_malloc and no associating/linking to the host pointers is necessary (as it were in a usual case). For a temporary data on gpu, we do not want the association to host members anyway.... It would, of course, be convenient, if the entire struct is only created on gpu. Openacc has a statement called device_resident for this. OpenMP does not seem to have that. Unfortunately, neither nvc++ nor gcc then seem to accept it that I allocate memory for the member fields with acc_malloc then... But with that trick to do the shallow copy of the struct with map and use omp_alloc, i guess I can now go forward in creating more difficult algorithms... So thank you for your help... But it is still a bit strange... For code like this: #include <omp.h> struct mytensor { int *strides; int *extents; double *data; }; int main() { mytensor t; t.data=double*)omp_target_alloc(sizeof(double)*20,omp_get_default_device()); t.strides=(int*)omp_target_alloc(sizeof(int)*2,omp_get_default_device()); t.extents=(int*)omp_target_alloc(sizeof(int)*2,omp_get_default_device()); #pragma omp target teams distribute for(int i=1; i<20; i++) { t.data[i]=20; } omp_target_free(t.data,omp_get_default_device()); omp_target_free(t.strides,omp_get_default_device()); omp_target_free(t.extents,omp_get_default_device()); } I get the following stack: 0,333277s 1,010 μs cuInit 0 2 1400 1400 0 OpenMP Initial Thread 0,333302s 116,313 ms cuCtxCreate_v2 0 8 1400 1400 0 OpenMP Initial Thread 0,450258s 2,018 ms cuLinkCreate_v2 0 24 1400 1400 0 OpenMP Initial Thread 0,461808s 1,915 ms cuLinkComplete 0 68 1400 1400 0 OpenMP Initial Thread 0,463724s 3,028 ms cuModuleLoadData 0 69 1400 1400 0 OpenMP Initial Thread 0,466753s 1,890 μs cuLinkDestroy 0 70 1400 1400 0 OpenMP Initial Thread 0,46712s 12,920 μs cuMemcpyHtoD_v2 0 82 1400 1400 0 OpenMP Initial Thread 0,467135s 99,311 μs cuMemAlloc_v2 0 84 1400 1400 0 OpenMP Initial Thread 0,467237s 3,800 μs cuMemAlloc_v2 0 86 1400 1400 0 OpenMP Initial Thread 0,467242s 3,170 μs cuMemAlloc_v2 0 88 1400 1400 0 OpenMP Initial Thread 0,467248s 2,360 μs cuMemAlloc_v2 0 90 1400 1400 0 OpenMP Initial Thread 0,467253s 7,250 μs cuMemcpyHtoD_v2 0 93 1400 1400 0 OpenMP Initial Thread 0,467262s 78,361 μs cuMemAlloc_v2 0 94 1400 1400 0 OpenMP Initial Thread 0,467341s 111,611 μs cuLaunchKernel 0 95 1400 1400 0 OpenMP Initial Thread 0,467453s 8,570 μs cuCtxSynchronize 0 96 1400 1400 0 OpenMP Initial Thread 0,467463s 19,620 μs cuMemcpyDtoH_v2 0 99 1400 1400 0 OpenMP Initial Thread 0,467484s 4,530 μs cuMemFree_v2 0 102 1400 1400 0 OpenMP Initial Thread 0,46749s 2,970 μs cuMemFree_v2 0 105 1400 1400 0 OpenMP Initial Thread 0,467494s 2,650 μs cuMemFree_v2 0 108 1400 1400 0 OpenMP Initial Thread 0,467497s 57,601 μs cuMemFree_v2 0 111 1400 1400 0 OpenMP Initial Thread 0,467569s 89,401 μs cuMemFree_v2 0 117 1400 1400 0 OpenMP Initial Thread 0,46766s 69,307 ms cuCtxDestroy_v2 0 118 1400 1400 0 OpenMP Initial Thread I issued 3 allocs and 3 copies. And especially, no device to host copy... I not even have allocated anything on the host... yet that code compiled with gcc, still calls cuMemcpyDtoH which I have not ordered and often h to d ...