Re: Offloading GSOC 2015

guray ozen Mon, 23 Mar 2015 05:59:23 -0700

Hi Kirill,

Thread hierarchy management and creation policy is very interesting
topic for me as well. I came across that paper couple weeks ago.
Creating more threads in the beginning and applying suchlike
busy-waiting or if-master algorithm generally works better than
dynamic parallelism due to the overhead of dp. Moreover compiler might
close some optimizations when dp is enable. This paper Cuda-np[1] is
also interesting about managing threads. And its idea is very close
that to create more thread in advance instead of using dynamic
parallelism. However in the other hand, sometimes dp has better
performance since it let create new thread hierarchy.

In order to clarify, I prepared 2 examples while using dynamic
parallelism and creating more threads in advance.
*(1st example)  Better result is dynamic parallelism.
*(2nd example) Better result is creating more threads in advance

1st example: https://github.com/grypp/gcc-proposal-omp4/tree/master/prop0
*(prop0.c) Has 4 nested iteration
*(prop0.c:10)will put small array into shared memory
*Iteration size of first two loop is expressed explicitly. even if
they become clear in rt, ptx/spir can be changed
*Last two iteration is sizes are dynamic and dependent of first two
iterations' induction variables
*(prop0.c:24 - 28) there are array accessing in very inefficient way
(non-coalesced)
-If we put (prop0.c:21) #parallel for
-*It will create another kernel (prop0_dynamic.cu:34)
-*array accessing style will change  (prop0_dynamic.cu:48 - 52)

Basically advantages of creating dynamic parallelism in this point:
1- Accessing style to array is changed with coalasced
2- we could get rid of 3rd and 4th for loop since we could create
thread as iteration size. (little advantage in terms of thread
divergencency)

2nd example: https://github.com/grypp/gcc-proposal-omp4/tree/master/prop1
*Has 2 nested iteration
*Innermost loop has reduction
*I put 3 possible generated cuda code example
*1 - prop1_baseline.cu : only cudaize prop1.c:8 and don't take account
prop1.c:12
*2 - prop1_createMoreThread.cu : create more thread for innermost
loop. Do reduction with extra threads. communicate by using shared
memory.
*3 - prop1_dynamic.cu : create child kernel. Communicate by using
global memory. but allocate global memory in advance at
prop1_dynamic.cu:5

Full version of prop1 calculates nbody. I benchmarked with y reserach
compiler [2] and put results here
https://github.com/grypp/gcc-proposal-omp4/blob/master/prop1/prop1-bench.pdf
. As is seen from that figure, 2nd kernel has best performance.

When we compare these 2 example, my roughly idea about this issue
that,  it might be good idea to implement an inspector by using
compiler analyzing algorithms in order to decide whether dynamic
parallelism will be used or not. Thus it also can be possible to avoid
extra slowdown since compiler closes optimization when dp is enable.
Besides there is some another cases exist while we can take advantage
of dp such as recursive algorithms. Moreover using stream is available
even if not guarantee concurrency (it also causes overhead). In
addition to this, i can work on if-master or busy-waiting logic.

I am really willing to work on thread hierarchy management and
creation policy. if it is interesting for gcc, how can i progress on
this topic?

By the way, i haven't worked on #omp simd. it could be match with
warps (if there is no dependency among loops). (in nvidia side) since
threads in same warp can read their data with __shfl, data clauses can
be used to enhance performance. (Not sure)

[1] - http://people.engr.ncsu.edu/hzhou/ppopp_14_1.pdf
[2] - http://link.springer.com/chapter/10.1007%2F978-3-319-11454-5_16
Güray Özen
~grypp

2015-03-20 15:47 GMT+01:00 Kirill Yukhin <kirill.yuk...@gmail.com>:
> Hello Güray,
>
> On 20 Mar 12:14, guray ozen wrote:
>> I've started to prepare my gsoc proposal for gcc's openmp for gpus.
> I think that here is wide range for exploration. As you know, OpenMP 4
> contains vectorization pragmas (`pragma omp simd') which not perfectly
> suites for GPGPU.
> Another problem is how to create threads dynamically on GPGPU. As far as
> we understand it there're two possible solutions:
>   1. Use dynamic parallelism available in recent API (launch new kernel from
>   target)
>   2. Estimate maximum thread number on host and start them all from host,
>   making unused threads busy-waiting
> There's a paper which investigates both approaches [1], [2].
>
>> However i'm little bit confused about which ideas, i mentioned last my
>> mail, should i propose or which one of them is interesting for gcc.
>> I'm willing to work on data clauses to enhance performance of shared
>> memory. Or maybe it might be interesting to work on OpenMP 4.1 draft
>> version. How do you think i should propose idea?
> We're going to work on OpenMP 4.1 offloading features.
>
> [1] - http://openmp.org/sc14/Booth-Sam-IBM.pdf
> [2] - http://dl.acm.org/citation.cfm?id=2688364
>
> --
> Thanks, K

Re: Offloading GSOC 2015

Reply via email to