https://gcc.gnu.org/bugzilla/show_bug.cgi?id=120753

--- Comment #16 from Benjamin Schulz <schulz.benjamin at googlemail dot com> ---
Hm, I want to thank Andrew Pinski for pointing me to a post of mat colgrove
from nvidia,

https://forums.developer.nvidia.com/t/using-deviceptr-with-structs/136334

And to Tobias Burnus for his comment on map.


By now nvc++ from nvidia eats member fields in constructs like deviceptr(...)
without problem...

But for creating a struct with data on the device, deviceptr is apparently not
needed. As Tobias and Matt write:

The strategy which works is apparently:

1) define the struct and declare a variable of its type, e.g. named t,
2) allocate the member fields on device by omp_target_alloc/acc_malloc
3) make the shallow copy of the struct with #pragma omp target enter data
map(to:t) or #pragma acc data copyin(t)

Then just use the member fields in the #pragma omp target or the #pragma acc
paralell loop ...

Because the fields are access prescriptions, it suffices if one has them
allocated, with the members being pointers to the location. 

The shallow copy from copyin/ map of the struct will then simply upload these
device pointers to the device and they can be used in the loops.

For a temporary data structure on device, there is then no separate map/copyin
of the member arrays necessary if we allocated them with
target_alloc/acc_malloc and no associating/linking to the host pointers is
necessary (as it were in a usual case). For a temporary data on gpu, we do not
want the association to host members anyway....


It would, of course, be convenient, if the entire struct is only created on
gpu.

Openacc has a statement called device_resident for this. 

OpenMP does not seem to have that.

Unfortunately, neither nvc++ nor gcc then seem to accept it that I allocate
memory for the member fields with acc_malloc then...

But with that trick to do the shallow copy of the struct with map and use
omp_alloc, i guess I can now go forward in creating more difficult
algorithms...


So thank you for your help...


But it is still a bit strange...

For code like this:

#include <omp.h>
struct mytensor
{
    int *strides;
    int *extents;
    double *data;
};

int main()
{
    mytensor t;
   
t.data=double*)omp_target_alloc(sizeof(double)*20,omp_get_default_device());
    t.strides=(int*)omp_target_alloc(sizeof(int)*2,omp_get_default_device());
    t.extents=(int*)omp_target_alloc(sizeof(int)*2,omp_get_default_device());

    #pragma omp target teams distribute
    for(int i=1; i<20; i++)
    {
        t.data[i]=20;
    }

    omp_target_free(t.data,omp_get_default_device());
    omp_target_free(t.strides,omp_get_default_device());
    omp_target_free(t.extents,omp_get_default_device());

}

I get the following stack:

0,333277s       1,010 μs        cuInit  0       2       1400    1400    0      
OpenMP Initial Thread
0,333302s       116,313 ms      cuCtxCreate_v2  0       8       1400    1400   
0       OpenMP Initial Thread
0,450258s       2,018 ms        cuLinkCreate_v2 0       24      1400    1400   
0       OpenMP Initial Thread
0,461808s       1,915 ms        cuLinkComplete  0       68      1400    1400   
0       OpenMP Initial Thread
0,463724s       3,028 ms        cuModuleLoadData        0       69      1400   
1400    0       OpenMP Initial Thread
0,466753s       1,890 μs        cuLinkDestroy   0       70      1400    1400   
0       OpenMP Initial Thread
0,46712s        12,920 μs       cuMemcpyHtoD_v2 0       82      1400    1400   
0       OpenMP Initial Thread
0,467135s       99,311 μs       cuMemAlloc_v2   0       84      1400    1400   
0       OpenMP Initial Thread
0,467237s       3,800 μs        cuMemAlloc_v2   0       86      1400    1400   
0       OpenMP Initial Thread
0,467242s       3,170 μs        cuMemAlloc_v2   0       88      1400    1400   
0       OpenMP Initial Thread
0,467248s       2,360 μs        cuMemAlloc_v2   0       90      1400    1400   
0       OpenMP Initial Thread
0,467253s       7,250 μs        cuMemcpyHtoD_v2 0       93      1400    1400   
0       OpenMP Initial Thread
0,467262s       78,361 μs       cuMemAlloc_v2   0       94      1400    1400   
0       OpenMP Initial Thread
0,467341s       111,611 μs      cuLaunchKernel  0       95      1400    1400   
0       OpenMP Initial Thread
0,467453s       8,570 μs        cuCtxSynchronize        0       96      1400   
1400    0       OpenMP Initial Thread
0,467463s       19,620 μs       cuMemcpyDtoH_v2 0       99      1400    1400   
0       OpenMP Initial Thread
0,467484s       4,530 μs        cuMemFree_v2    0       102     1400    1400   
0       OpenMP Initial Thread
0,46749s        2,970 μs        cuMemFree_v2    0       105     1400    1400   
0       OpenMP Initial Thread
0,467494s       2,650 μs        cuMemFree_v2    0       108     1400    1400   
0       OpenMP Initial Thread
0,467497s       57,601 μs       cuMemFree_v2    0       111     1400    1400   
0       OpenMP Initial Thread
0,467569s       89,401 μs       cuMemFree_v2    0       117     1400    1400   
0       OpenMP Initial Thread
0,46766s        69,307 ms       cuCtxDestroy_v2 0       118     1400    1400   
0       OpenMP Initial Thread


I issued 3 allocs and 3 copies. And especially, no device to host copy... I not
even have allocated anything on the host... yet that code compiled with gcc,
still calls cuMemcpyDtoH which I have not ordered and often h to d ...

Reply via email to