ptrendx commented on issue #15928: [RFC] A faster version of Gamma sampling on 
GPU.
URL: 
https://github.com/apache/incubator-mxnet/issues/15928#issuecomment-522055756
 
 
   Hi @xidulu. I did not look at the differences in the implementation of 
host-side vs device-side API for RNG in MXNet, but if they are comparable in 
terms of performance, a possible better approach would be something like this:
    - launch only as many blocks and threads as necessary to fill the GPU, each 
having their own RNG
    - use following pseudocode
   ```
   while(my_sample_id < N_samples) {
     float rng = generate_next_rng();
     bool accepted = ... // compute whether this rng value is accepted
     if (accepted) {
       // write the result
       my_sample_id = next_sample();
     }
   }
   ```
   There are 2 ways of implementing `next_sample` here - either by `atomicInc` 
on some global counter or just by adding the total number of threads (so every 
thread processes the same number of samples). The atomic approach is 
potentially faster (as with the static assignment you could end up hitting a 
corner case where 1 thread would still do a lot more work than the other 
threads), but is nondeterministic, so I think static assignment is preferable 
here.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

Reply via email to