Hi @xidulu. I did not look at the differences in the implementation of 
host-side vs device-side API for RNG in MXNet, but if they are comparable in 
terms of performance, a possible better approach would be something like this:
 - launch only as many blocks and threads as necessary to fill the GPU, each 
having their own RNG
 - use following pseudocode
```
while(my_sample_id < N_samples) {
  float rng = generate_next_rng();
  bool accepted = ... // compute whether this rng value is accepted
  if (accepted) {
    // write the result
    my_sample_id = next_sample();
  }
}
```
There are 2 ways of implementing `next_sample` here - either by `atomicInc` on 
some global counter or just by adding the total number of threads (so every 
thread processes the same number of samples). The atomic approach is 
potentially faster (as with the static assignment you could end up hitting a 
corner case where 1 thread would still do a lot more work than the other 
threads), but is nondeterministic, so I think static assignment is preferable 
here.

-- 
You are receiving this because you are on a team that was mentioned.
Reply to this email directly or view it on GitHub:
https://github.com/apache/incubator-mxnet/issues/15928#issuecomment-522055756

Reply via email to