### Description

Sampling from Gamma distribution requires rejection sampling, which is applied 
in the implementation of `mxnet.ndarray.random.gamma()`.  However, two main 
drawbacks exist in the current implementation ( 
[https://github.com/apache/incubator-mxnet/blob/fb4f9d55382538fe688638b741830d84ae0d783e/src/operator/random/sampler.h#L182]()
 )

1. Random numbers used in the rejection sampling ( N(0,1) and U(0,1) ) are 
generated inside the kernel using CUDA device api. Also, although every batch 
of threads has its own RNG, samples are actually generated in serial inside 
each batch of threads.

2. Rejection sampling is achieved by using an infinite while loop inside the 
kernel, which may potentially affect the performance on GPU.

To solve the problems above, I write a new version of  Gamma sampling on GPU 
innovated by this blog post: 
[https://lips.cs.princeton.edu/a-parallel-gamma-sampling-implementation/](url)


### Implementation details

My implementation differs from the current version in the following aspects: 

1. Instead of generating samples in the kernel, we generate them in advance 
using host api, which allows us to  fill a buffer with random samples directly.

2. Redundant samples are generated to replace the while loop. Suppose we are 
going to generate a Gamma tensor of size **(N,)**,  N x (M + 1)  zero-one 
gaussian samples and N x (M + 1) zero-one uniform samples will be generated 
before entering the kernel, where M is a predefined const. For each entity, we 
generate M proposed Gamma r.v. and then select the first accepted one as the 
output.  The one extra sample is required when \alpha is less than one.

3. In case all M proposed samples get rejected in some entities (which would be 
marked as -1), we simply resample the random buffer again and perform another 
round of rejection sampling, **but only at the entities that fail the last 
round**.

Here's part of the implementation :
[https://gist.github.com/xidulu/cd9da21f2ecbccd9b784cadd67844e23](url)

In my experiment, I set M to be 1 ( i.e. no redundant samples are generated.) 
as the adopted policy(Marsaglia and Tsang's method) has a rather high 
acceptance rate of around 98%. 

The profiling result is listed below:

| Size | native numpy | ndarray on GPU | my implementation |
|------|--------------|----------------|-------------------|
| 10e2 | <0.1ms       | 3~5ms          | 0.5~0.7ms         |
| 10e4 | 0.76ms       | 7.6~7.8ms      | 0.72~0.76ms       |
| 10e6 | 70ms         | 12~13ms        | 3.1ms             |
| 10e8 | 7200ms       | 1600~1700ms    | 150~160ms         |


----------------------------

The new version is currently under development on  `numpy` branch. It also 
designed to support broadcastable parameters.


-- 
You are receiving this because you are subscribed to this thread.
Reply to this email directly or view it on GitHub:
https://github.com/apache/incubator-mxnet/issues/15928

Reply via email to