chrishkchris opened a new pull request #532: SINGA-487 Add the support of Python Multiprocess Module URL: https://github.com/apache/incubator-singa/pull/532 I have added the support for the Python Multiprocess Module for single-node multi-gpu scanerio. For the old MPI-based NCCL, I have also simplified and cleaned the code, where I removed the nDev variable (i.e. number of GPU controlled by each process) which is always 1 in our case. So, in the following autograd example, (i) mnist_multiprocess.py is the example using python multiprocessing module (ii) mnist_dist.py is the example using MPI for multiprocessing The results for both examples are as follows: ubuntu@ip-172-31-38-62:~/incubator-singa/examples/autograd$ python3 mnist_multiprocess.py Starting Epoch 0: Training loss = 801.480042, training accuracy = 0.709101 Evaluation accuracy = 0.920436, Elapsed Time = 1.248269s Starting Epoch 1: Training loss = 249.743988, training accuracy = 0.916817 Evaluation accuracy = 0.956620, Elapsed Time = 1.226179s Starting Epoch 2: Training loss = 175.276443, training accuracy = 0.942258 Evaluation accuracy = 0.970498, Elapsed Time = 1.181269s Starting Epoch 3: Training loss = 144.092194, training accuracy = 0.951289 Evaluation accuracy = 0.968236, Elapsed Time = 1.168137s Starting Epoch 4: Training loss = 116.727524, training accuracy = 0.961221 Evaluation accuracy = 0.977282, Elapsed Time = 1.169854s Starting Epoch 5: Training loss = 105.698898, training accuracy = 0.964577 Evaluation accuracy = 0.979132, Elapsed Time = 1.174284s Starting Epoch 6: Training loss = 94.009590, training accuracy = 0.968616 Evaluation accuracy = 0.976460, Elapsed Time = 1.172847s Starting Epoch 7: Training loss = 87.892418, training accuracy = 0.970419 Evaluation accuracy = 0.979852, Elapsed Time = 1.172124s Starting Epoch 8: Training loss = 82.783676, training accuracy = 0.972306 Evaluation accuracy = 0.983141, Elapsed Time = 1.163122s Starting Epoch 9: Training loss = 76.629707, training accuracy = 0.974576 Evaluation accuracy = 0.978927, Elapsed Time = 1.160587s ubuntu@ip-172-31-38-62:~/incubator-singa/examples/autograd$ /home/ubuntu/mpich-3.3/build/bin/mpiexec --hostfile host_file python3 mnist_dist.py Starting Epoch 0: Training loss = 792.865723, training accuracy = 0.713041 Evaluation accuracy = 0.929174, Elapsed Time = 1.262597s Starting Epoch 1: Training loss = 250.669327, training accuracy = 0.914931 Evaluation accuracy = 0.960218, Elapsed Time = 1.198090s Starting Epoch 2: Training loss = 174.226135, training accuracy = 0.941273 Evaluation accuracy = 0.966283, Elapsed Time = 1.189961s Starting Epoch 3: Training loss = 142.276245, training accuracy = 0.952541 Evaluation accuracy = 0.970806, Elapsed Time = 1.189858s Starting Epoch 4: Training loss = 121.220009, training accuracy = 0.959769 Evaluation accuracy = 0.972759, Elapsed Time = 1.190380s Starting Epoch 5: Training loss = 111.639114, training accuracy = 0.962423 Evaluation accuracy = 0.975946, Elapsed Time = 1.186215s Starting Epoch 6: Training loss = 96.729469, training accuracy = 0.967448 Evaluation accuracy = 0.982216, Elapsed Time = 1.177556s Starting Epoch 7: Training loss = 89.441696, training accuracy = 0.970169 Evaluation accuracy = 0.978824, Elapsed Time = 1.183380s Starting Epoch 8: Training loss = 79.853104, training accuracy = 0.973057 Evaluation accuracy = 0.982113, Elapsed Time = 1.181502s Starting Epoch 9: Training loss = 77.974480, training accuracy = 0.974259 Evaluation accuracy = 0.978516, Elapsed Time = 1.183578s
---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services