ptrendx opened a new pull request #8294: NCCL integration
URL: https://github.com/apache/incubator-mxnet/pull/8294
 
 
   ## Description ##
   This PR provides new KVStore type with integration for NCCL communication 
library.
   
   ## Checklist ##
   ### Essentials ###
   - [x] Passed code style checking (`make lint`)
   - [x] Changes are complete (i.e. I finished coding on this PR)
   - [ ] All changes have test coverage
   - [x] For user-facing API changes, API doc string has been updated.
   - [x] To my best knowledge, examples are either not affected by this change, 
or have been fixed to be compatible with this change
   
   ### Changes ###
   - [x] New `nccl` type of kvstore, using ncclReduce and ncclBcast
   - [x] `test_nccl.py` added to `tests/python/gpu`, but not enabled,  since 
NCCL is not present and enabled in CI
   
   ## Comments ##
   - Interesting edge cases to note here:
     - NCCL KVStore requires the same set of devices to be used for all 
communications (as is the case in typical data parallel training)
     - in NCCL KVStore push and pull are implemented using 2 steps - launching 
NCCL kernels in 1 step and synchronizing in the second step. This was made to 
enable seamless aggregation support - several reductions are scheduled before a 
synchronization.
   
 
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

Reply via email to