Hi Kirill, Thread hierarchy management and creation policy is very interesting topic for me as well. I came across that paper couple weeks ago. Creating more threads in the beginning and applying suchlike busy-waiting or if-master algorithm generally works better than dynamic parallelism due to the overhead of dp. Moreover compiler might close some optimizations when dp is enable. This paper Cuda-np[1] is also interesting about managing threads. And its idea is very close that to create more thread in advance instead of using dynamic parallelism. However in the other hand, sometimes dp has better performance since it let create new thread hierarchy.
In order to clarify, I prepared 2 examples while using dynamic parallelism and creating more threads in advance. *(1st example) Better result is dynamic parallelism. *(2nd example) Better result is creating more threads in advance 1st example: https://github.com/grypp/gcc-proposal-omp4/tree/master/prop0 *(prop0.c) Has 4 nested iteration *(prop0.c:10)will put small array into shared memory *Iteration size of first two loop is expressed explicitly. even if they become clear in rt, ptx/spir can be changed *Last two iteration is sizes are dynamic and dependent of first two iterations' induction variables *(prop0.c:24 - 28) there are array accessing in very inefficient way (non-coalesced) -If we put (prop0.c:21) #parallel for -*It will create another kernel (prop0_dynamic.cu:34) -*array accessing style will change (prop0_dynamic.cu:48 - 52) Basically advantages of creating dynamic parallelism in this point: 1- Accessing style to array is changed with coalasced 2- we could get rid of 3rd and 4th for loop since we could create thread as iteration size. (little advantage in terms of thread divergencency) 2nd example: https://github.com/grypp/gcc-proposal-omp4/tree/master/prop1 *Has 2 nested iteration *Innermost loop has reduction *I put 3 possible generated cuda code example *1 - prop1_baseline.cu : only cudaize prop1.c:8 and don't take account prop1.c:12 *2 - prop1_createMoreThread.cu : create more thread for innermost loop. Do reduction with extra threads. communicate by using shared memory. *3 - prop1_dynamic.cu : create child kernel. Communicate by using global memory. but allocate global memory in advance at prop1_dynamic.cu:5 Full version of prop1 calculates nbody. I benchmarked with y reserach compiler [2] and put results here https://github.com/grypp/gcc-proposal-omp4/blob/master/prop1/prop1-bench.pdf . As is seen from that figure, 2nd kernel has best performance. When we compare these 2 example, my roughly idea about this issue that, it might be good idea to implement an inspector by using compiler analyzing algorithms in order to decide whether dynamic parallelism will be used or not. Thus it also can be possible to avoid extra slowdown since compiler closes optimization when dp is enable. Besides there is some another cases exist while we can take advantage of dp such as recursive algorithms. Moreover using stream is available even if not guarantee concurrency (it also causes overhead). In addition to this, i can work on if-master or busy-waiting logic. I am really willing to work on thread hierarchy management and creation policy. if it is interesting for gcc, how can i progress on this topic? By the way, i haven't worked on #omp simd. it could be match with warps (if there is no dependency among loops). (in nvidia side) since threads in same warp can read their data with __shfl, data clauses can be used to enhance performance. (Not sure) [1] - http://people.engr.ncsu.edu/hzhou/ppopp_14_1.pdf [2] - http://link.springer.com/chapter/10.1007%2F978-3-319-11454-5_16 Güray Özen ~grypp 2015-03-20 15:47 GMT+01:00 Kirill Yukhin <kirill.yuk...@gmail.com>: > Hello Güray, > > On 20 Mar 12:14, guray ozen wrote: >> I've started to prepare my gsoc proposal for gcc's openmp for gpus. > I think that here is wide range for exploration. As you know, OpenMP 4 > contains vectorization pragmas (`pragma omp simd') which not perfectly > suites for GPGPU. > Another problem is how to create threads dynamically on GPGPU. As far as > we understand it there're two possible solutions: > 1. Use dynamic parallelism available in recent API (launch new kernel from > target) > 2. Estimate maximum thread number on host and start them all from host, > making unused threads busy-waiting > There's a paper which investigates both approaches [1], [2]. > >> However i'm little bit confused about which ideas, i mentioned last my >> mail, should i propose or which one of them is interesting for gcc. >> I'm willing to work on data clauses to enhance performance of shared >> memory. Or maybe it might be interesting to work on OpenMP 4.1 draft >> version. How do you think i should propose idea? > We're going to work on OpenMP 4.1 offloading features. > > [1] - http://openmp.org/sc14/Booth-Sam-IBM.pdf > [2] - http://dl.acm.org/citation.cfm?id=2688364 > > -- > Thanks, K