Re: [apache/incubator-mxnet] [RFC] A faster version of Gamma sampling on GPU. (#15928)

Przemyslaw Tredak Sat, 17 Aug 2019 20:58:00 -0700

@yzhliu No. What MXNet currently does is a scheme where, yes, each thread gets 
assigned statically some number of elements, but it has a while loop for each 
of them. The scheme I proposed has a single while loop that processes all 
elements assigned to a given thread. There is a big difference between these 
approaches, due to SIMT architecture of the GPU. Basically you can treat some 
number of threads (called warp, 32 threads on NVIDIA's GPU) as lanes in SIMD 
vector instruction on the CPU. This means that if 1 thread needs to perform 
some computation, all threads in the warp need to perform the same instruction 
(and possibly discard the result).
So in the current MXNet's implementation for each output element every group of 
32 threads is always doing the number of loop iterations equal to the slowest 
thread (because no thread in warp can exit the while loop while at least 1 
thread is still not finished).
In the proposed implementation there is only 1 while loop and the only 
difference between threads lies inside the `if (accepted)` part, which is cheap 
compared to generating a random number. In this implementation every warp does 
the number of loop iterations equal to sum of the steps for the slowest thread 
(which is hopefully pretty uniform across threads, especially as we are talking 
RNG and not some crafted input, and definitely much better than the previous 
"for each element take the slowest and sum that").


@xidulu What is the RNG used for host-side and device-side API? cuRAND ones 
should not really differ much in perf between device-side and host-side.
There are a few advantages:
 - you don't need to store and load the RNG numbers you made (and in the fully 
optimized case making random numbers should actually be pretty 
bandwidth-limited operation)
 - you don't need additional storage (besides the RNG generator state which you 
need anyway)
 - you compute only as many RNG numbers as you really need

-- 
You are receiving this because you are on a team that was mentioned.
Reply to this email directly or view it on GitHub:
https://github.com/apache/incubator-mxnet/issues/15928#issuecomment-522289104

Re: [apache/incubator-mxnet] [RFC] A faster version of Gamma sampling on GPU. (#15928)

Reply via email to