ptrendx commented on issue #15928: [RFC] A faster version of Gamma sampling on GPU. URL: https://github.com/apache/incubator-mxnet/issues/15928#issuecomment-522055756 Hi @xidulu. I did not look at the differences in the implementation of host-side vs device-side API for RNG in MXNet, but if they are comparable in terms of performance, a possible better approach would be something like this: - launch only as many blocks and threads as necessary to fill the GPU, each having their own RNG - use following pseudocode ``` while(my_sample_id < N_samples) { float rng = generate_next_rng(); bool accepted = ... // compute whether this rng value is accepted if (accepted) { // write the result my_sample_id = next_sample(); } } ``` There are 2 ways of implementing `next_sample` here - either by `atomicInc` on some global counter or just by adding the total number of threads (so every thread processes the same number of samples). The atomic approach is potentially faster (as with the static assignment you could end up hitting a corner case where 1 thread would still do a lot more work than the other threads), but is nondeterministic, so I think static assignment is preferable here.
---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services