subject:"\[GitHub\] \[incubator\-singa\] chrishkchris commented on a change in pull request #468\: Distributted module"

[GitHub] [incubator-singa] chrishkchris commented on a change in pull request #468: Distributted module

2019-09-03 Thread GitBox

chrishkchris commented on a change in pull request #468: Distributted module
URL: https://github.com/apache/incubator-singa/pull/468#discussion_r320544707
 
 

 ##
 File path: src/dist/communicator.cc
 ##
 @@ -0,0 +1,143 @@
+/**
+ * Licensed to the Apache Software Foundation (ASF) under one
 
 Review comment:
   sorry i will also need to modify the cmakelist
   


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

[GitHub] [incubator-singa] chrishkchris commented on a change in pull request #468: Distributted module

2019-09-03 Thread GitBox

chrishkchris commented on a change in pull request #468: Distributted module
URL: https://github.com/apache/incubator-singa/pull/468#discussion_r320544707
 
 

 ##
 File path: src/dist/communicator.cc
 ##
 @@ -0,0 +1,143 @@
+/**
+ * Licensed to the Apache Software Foundation (ASF) under one
 
 Review comment:
   sorry i will also need to modify the cmakelist
   


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

[GitHub] [incubator-singa] chrishkchris commented on a change in pull request #468: Distributted module

2019-09-03 Thread GitBox

chrishkchris commented on a change in pull request #468: Distributted module
URL: https://github.com/apache/incubator-singa/pull/468#discussion_r320542607
 
 

 ##
 File path: src/dist/communicator.cc
 ##
 @@ -0,0 +1,143 @@
+/**
+ * Licensed to the Apache Software Foundation (ASF) under one
 
 Review comment:
   I have moved both communicator.cc and communicator.h to io folders


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

[GitHub] [incubator-singa] chrishkchris commented on a change in pull request #468: Distributted module

2019-09-03 Thread GitBox

chrishkchris commented on a change in pull request #468: Distributted module
URL: https://github.com/apache/incubator-singa/pull/468#discussion_r320542607
 
 

 ##
 File path: src/dist/communicator.cc
 ##
 @@ -0,0 +1,143 @@
+/**
+ * Licensed to the Apache Software Foundation (ASF) under one
 
 Review comment:
   I have moved the both communicator.cc and communicator.h to io folders


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

[GitHub] [incubator-singa] chrishkchris commented on a change in pull request #468: Distributted module

2019-08-27 Thread GitBox

chrishkchris commented on a change in pull request #468: Distributted module
URL: https://github.com/apache/incubator-singa/pull/468#discussion_r317937233
 
 

 ##
 File path: examples/autograd/mnist_dist.py
 ##
 @@ -0,0 +1,251 @@
+#
 
 Review comment:
   Has modified mnist_cnn.py and mnist_dist.py:
   1. the model construction, data preprocessing and training code are in 
mnist_cnn.py
   2. mnist_dist.py import mnist_cnn's functions and passes the dist opt into 
train_mnist_cnn() to conduct dist training (needs MPI).
   3. the download_mnist.py is added at the same dir, which is used to download 
the dataset before the training. It is separated out from the training code to 
prevent different process downloading data at the same time.
   
   Here is the log of running the code:
   ```
   ubuntu@ip-172-31-21-218:~/incubator-singa/examples/autograd$ python3 
download_mnist.py   
   Downloading http://yann.lecun.com/exdb/mnist/train-images-idx3-ubyte.gz
   Downloading http://yann.lecun.com/exdb/mnist/train-labels-idx1-ubyte.gz
   Downloading http://yann.lecun.com/exdb/mnist/t10k-images-idx3-ubyte.gz
   Downloading http://yann.lecun.com/exdb/mnist/t10k-labels-idx1-ubyte.gz
   ubuntu@ip-172-31-21-218:~/incubator-singa/examples/autograd$ python3 
mnist_cnn.py
   Starting Epoch 0:
   Training loss = 586.417175, training accuracy = 0.792840
   Evaluation accuracy = 0.940104, Elapsed Time = 5.638494s
   Starting Epoch 1:
   Training loss = 235.360107, training accuracy = 0.922292
   Evaluation accuracy = 0.955429, Elapsed Time = 5.563161s
   Starting Epoch 2:
   Training loss = 170.056442, training accuracy = 0.943270
   Evaluation accuracy = 0.963942, Elapsed Time = 5.579273s
   Starting Epoch 3:
   Training loss = 135.514252, training accuracy = 0.954476
   Evaluation accuracy = 0.967248, Elapsed Time = 5.562721s
   Starting Epoch 4:
   Training loss = 116.975700, training accuracy = 0.960812
   Evaluation accuracy = 0.978265, Elapsed Time = 5.583826s
   Starting Epoch 5:
   Training loss = 103.893723, training accuracy = 0.965065
   Evaluation accuracy = 0.982372, Elapsed Time = 5.585272s
   Starting Epoch 6:
   Training loss = 95.044586, training accuracy = 0.967266
   Evaluation accuracy = 0.981671, Elapsed Time = 5.580424s
   Starting Epoch 7:
   Training loss = 89.102654, training accuracy = 0.971118
   Evaluation accuracy = 0.980268, Elapsed Time = 5.583646s
   Starting Epoch 8:
   Training loss = 80.395744, training accuracy = 0.972969
   Evaluation accuracy = 0.983273, Elapsed Time = 5.600029s
   Starting Epoch 9:
   Training loss = 78.355209, training accuracy = 0.973119
   Evaluation accuracy = 0.979267, Elapsed Time = 5.587740s
   ubuntu@ip-172-31-21-218:~/incubator-singa/examples/autograd$ 
/home/ubuntu/mpich-3.3/build/bin/mpiexec --hostfile host_file python3 
mnist_dist.py
   Starting Epoch 0:
   Training loss = 781.167480, training accuracy = 0.719017
   Evaluation accuracy = 0.918586, Elapsed Time = 1.255623s
   Starting Epoch 1:
   Training loss = 259.223297, training accuracy = 0.912276
   Evaluation accuracy = 0.950863, Elapsed Time = 1.216926s
   Starting Epoch 2:
   Training loss = 179.333084, training accuracy = 0.940605
   Evaluation accuracy = 0.968030, Elapsed Time = 1.206751s
   Starting Epoch 3:
   Training loss = 137.840988, training accuracy = 0.954243
   Evaluation accuracy = 0.975946, Elapsed Time = 1.202503s
   Starting Epoch 4:
   Training loss = 119.743629, training accuracy = 0.959836
   Evaluation accuracy = 0.973581, Elapsed Time = 1.208274s
   Starting Epoch 5:
   Training loss = 102.545876, training accuracy = 0.965595
   Evaluation accuracy = 0.980572, Elapsed Time = 1.205539s
   Starting Epoch 6:
   Training loss = 93.249054, training accuracy = 0.969401
   Evaluation accuracy = 0.978207, Elapsed Time = 1.203708s
   Starting Epoch 7:
   Training loss = 84.66, training accuracy = 0.971104
   Evaluation accuracy = 0.980777, Elapsed Time = 1.206410s
   Starting Epoch 8:
   Training loss = 77.996643, training accuracy = 0.973691
   Evaluation accuracy = 0.985609, Elapsed Time = 1.207295s
   Starting Epoch 9:
   Training loss = 75.888077, training accuracy = 0.974442
   Evaluation accuracy = 0.982319, Elapsed Time = 1.203693s
   ```


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

[GitHub] [incubator-singa] chrishkchris commented on a change in pull request #468: Distributted module

2019-08-27 Thread GitBox

chrishkchris commented on a change in pull request #468: Distributted module
URL: https://github.com/apache/incubator-singa/pull/468#discussion_r317937233
 
 

 ##
 File path: examples/autograd/mnist_dist.py
 ##
 @@ -0,0 +1,251 @@
+#
 
 Review comment:
   Has modified mnist_cnn.py and mnist_dist.py:
   1. the model construction, data preprocessing and training code are in 
mnist_cnn.py
   2. mnist_dist.py import mnist_cnn functions and passes the dist opt into 
train_mnist_cnn() to conduct dist training (needs MPI).
   3. the download_mnist.py is added at the same dir, which is used to download 
the dataset before the training. It is separated out from the training code to 
prevent different process downloading data at the same time.
   
   Here is the log of running the code:
   ```
   ubuntu@ip-172-31-21-218:~/incubator-singa/examples/autograd$ python3 
download_mnist.py   
   Downloading http://yann.lecun.com/exdb/mnist/train-images-idx3-ubyte.gz
   Downloading http://yann.lecun.com/exdb/mnist/train-labels-idx1-ubyte.gz
   Downloading http://yann.lecun.com/exdb/mnist/t10k-images-idx3-ubyte.gz
   Downloading http://yann.lecun.com/exdb/mnist/t10k-labels-idx1-ubyte.gz
   ubuntu@ip-172-31-21-218:~/incubator-singa/examples/autograd$ python3 
mnist_cnn.py
   Starting Epoch 0:
   Training loss = 586.417175, training accuracy = 0.792840
   Evaluation accuracy = 0.940104, Elapsed Time = 5.638494s
   Starting Epoch 1:
   Training loss = 235.360107, training accuracy = 0.922292
   Evaluation accuracy = 0.955429, Elapsed Time = 5.563161s
   Starting Epoch 2:
   Training loss = 170.056442, training accuracy = 0.943270
   Evaluation accuracy = 0.963942, Elapsed Time = 5.579273s
   Starting Epoch 3:
   Training loss = 135.514252, training accuracy = 0.954476
   Evaluation accuracy = 0.967248, Elapsed Time = 5.562721s
   Starting Epoch 4:
   Training loss = 116.975700, training accuracy = 0.960812
   Evaluation accuracy = 0.978265, Elapsed Time = 5.583826s
   Starting Epoch 5:
   Training loss = 103.893723, training accuracy = 0.965065
   Evaluation accuracy = 0.982372, Elapsed Time = 5.585272s
   Starting Epoch 6:
   Training loss = 95.044586, training accuracy = 0.967266
   Evaluation accuracy = 0.981671, Elapsed Time = 5.580424s
   Starting Epoch 7:
   Training loss = 89.102654, training accuracy = 0.971118
   Evaluation accuracy = 0.980268, Elapsed Time = 5.583646s
   Starting Epoch 8:
   Training loss = 80.395744, training accuracy = 0.972969
   Evaluation accuracy = 0.983273, Elapsed Time = 5.600029s
   Starting Epoch 9:
   Training loss = 78.355209, training accuracy = 0.973119
   Evaluation accuracy = 0.979267, Elapsed Time = 5.587740s
   ubuntu@ip-172-31-21-218:~/incubator-singa/examples/autograd$ 
/home/ubuntu/mpich-3.3/build/bin/mpiexec --hostfile host_file python3 
mnist_dist.py
   Starting Epoch 0:
   Training loss = 781.167480, training accuracy = 0.719017
   Evaluation accuracy = 0.918586, Elapsed Time = 1.255623s
   Starting Epoch 1:
   Training loss = 259.223297, training accuracy = 0.912276
   Evaluation accuracy = 0.950863, Elapsed Time = 1.216926s
   Starting Epoch 2:
   Training loss = 179.333084, training accuracy = 0.940605
   Evaluation accuracy = 0.968030, Elapsed Time = 1.206751s
   Starting Epoch 3:
   Training loss = 137.840988, training accuracy = 0.954243
   Evaluation accuracy = 0.975946, Elapsed Time = 1.202503s
   Starting Epoch 4:
   Training loss = 119.743629, training accuracy = 0.959836
   Evaluation accuracy = 0.973581, Elapsed Time = 1.208274s
   Starting Epoch 5:
   Training loss = 102.545876, training accuracy = 0.965595
   Evaluation accuracy = 0.980572, Elapsed Time = 1.205539s
   Starting Epoch 6:
   Training loss = 93.249054, training accuracy = 0.969401
   Evaluation accuracy = 0.978207, Elapsed Time = 1.203708s
   Starting Epoch 7:
   Training loss = 84.66, training accuracy = 0.971104
   Evaluation accuracy = 0.980777, Elapsed Time = 1.206410s
   Starting Epoch 8:
   Training loss = 77.996643, training accuracy = 0.973691
   Evaluation accuracy = 0.985609, Elapsed Time = 1.207295s
   Starting Epoch 9:
   Training loss = 75.888077, training accuracy = 0.974442
   Evaluation accuracy = 0.982319, Elapsed Time = 1.203693s
   ```


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

[GitHub] [incubator-singa] chrishkchris commented on a change in pull request #468: Distributted module

2019-08-27 Thread GitBox

chrishkchris commented on a change in pull request #468: Distributted module
URL: https://github.com/apache/incubator-singa/pull/468#discussion_r317937233
 
 

 ##
 File path: examples/autograd/mnist_dist.py
 ##
 @@ -0,0 +1,251 @@
+#
 
 Review comment:
   Has modified mnist_cnn.py and mnist_dist.py:
   1. the model construction, data preprocessing and training code are in 
mnist_cnn.py
   2. mnist_dist.py import mnist_cnn functions and passes the dist opt into 
train_mnist_cnn() to conduct dist training (needs MPI).
   3. the download_mnist.py is added at the same dir, which is used to download 
the dataset before the training. It is separated out to prevent different 
process downloading data at the same time.
   
   Here is the log of running the code:
   ```
   ubuntu@ip-172-31-21-218:~/incubator-singa/examples/autograd$ python3 
download_mnist.py   
   Downloading http://yann.lecun.com/exdb/mnist/train-images-idx3-ubyte.gz
   Downloading http://yann.lecun.com/exdb/mnist/train-labels-idx1-ubyte.gz
   Downloading http://yann.lecun.com/exdb/mnist/t10k-images-idx3-ubyte.gz
   Downloading http://yann.lecun.com/exdb/mnist/t10k-labels-idx1-ubyte.gz
   ubuntu@ip-172-31-21-218:~/incubator-singa/examples/autograd$ python3 
mnist_cnn.py
   Starting Epoch 0:
   Training loss = 586.417175, training accuracy = 0.792840
   Evaluation accuracy = 0.940104, Elapsed Time = 5.638494s
   Starting Epoch 1:
   Training loss = 235.360107, training accuracy = 0.922292
   Evaluation accuracy = 0.955429, Elapsed Time = 5.563161s
   Starting Epoch 2:
   Training loss = 170.056442, training accuracy = 0.943270
   Evaluation accuracy = 0.963942, Elapsed Time = 5.579273s
   Starting Epoch 3:
   Training loss = 135.514252, training accuracy = 0.954476
   Evaluation accuracy = 0.967248, Elapsed Time = 5.562721s
   Starting Epoch 4:
   Training loss = 116.975700, training accuracy = 0.960812
   Evaluation accuracy = 0.978265, Elapsed Time = 5.583826s
   Starting Epoch 5:
   Training loss = 103.893723, training accuracy = 0.965065
   Evaluation accuracy = 0.982372, Elapsed Time = 5.585272s
   Starting Epoch 6:
   Training loss = 95.044586, training accuracy = 0.967266
   Evaluation accuracy = 0.981671, Elapsed Time = 5.580424s
   Starting Epoch 7:
   Training loss = 89.102654, training accuracy = 0.971118
   Evaluation accuracy = 0.980268, Elapsed Time = 5.583646s
   Starting Epoch 8:
   Training loss = 80.395744, training accuracy = 0.972969
   Evaluation accuracy = 0.983273, Elapsed Time = 5.600029s
   Starting Epoch 9:
   Training loss = 78.355209, training accuracy = 0.973119
   Evaluation accuracy = 0.979267, Elapsed Time = 5.587740s
   ubuntu@ip-172-31-21-218:~/incubator-singa/examples/autograd$ 
/home/ubuntu/mpich-3.3/build/bin/mpiexec --hostfile host_file python3 
mnist_dist.py
   Starting Epoch 0:
   Training loss = 781.167480, training accuracy = 0.719017
   Evaluation accuracy = 0.918586, Elapsed Time = 1.255623s
   Starting Epoch 1:
   Training loss = 259.223297, training accuracy = 0.912276
   Evaluation accuracy = 0.950863, Elapsed Time = 1.216926s
   Starting Epoch 2:
   Training loss = 179.333084, training accuracy = 0.940605
   Evaluation accuracy = 0.968030, Elapsed Time = 1.206751s
   Starting Epoch 3:
   Training loss = 137.840988, training accuracy = 0.954243
   Evaluation accuracy = 0.975946, Elapsed Time = 1.202503s
   Starting Epoch 4:
   Training loss = 119.743629, training accuracy = 0.959836
   Evaluation accuracy = 0.973581, Elapsed Time = 1.208274s
   Starting Epoch 5:
   Training loss = 102.545876, training accuracy = 0.965595
   Evaluation accuracy = 0.980572, Elapsed Time = 1.205539s
   Starting Epoch 6:
   Training loss = 93.249054, training accuracy = 0.969401
   Evaluation accuracy = 0.978207, Elapsed Time = 1.203708s
   Starting Epoch 7:
   Training loss = 84.66, training accuracy = 0.971104
   Evaluation accuracy = 0.980777, Elapsed Time = 1.206410s
   Starting Epoch 8:
   Training loss = 77.996643, training accuracy = 0.973691
   Evaluation accuracy = 0.985609, Elapsed Time = 1.207295s
   Starting Epoch 9:
   Training loss = 75.888077, training accuracy = 0.974442
   Evaluation accuracy = 0.982319, Elapsed Time = 1.203693s
   ```


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

[GitHub] [incubator-singa] chrishkchris commented on a change in pull request #468: Distributted module

2019-08-27 Thread GitBox

chrishkchris commented on a change in pull request #468: Distributted module
URL: https://github.com/apache/incubator-singa/pull/468#discussion_r317937233
 
 

 ##
 File path: examples/autograd/mnist_dist.py
 ##
 @@ -0,0 +1,251 @@
+#
 
 Review comment:
   Has modified mnist_cnn.py and mnist_dist.py:
   1. the model construction, data preprocessing and training code are in 
mnist_cnn.py
   2. mnist_dist.py import mnist_cnn functions and passes the dist opt into 
train_mnist_cnn() to conduct dist training (needs MPI).
   3. the download_mnist.py is added in the same dir, which is used to download 
the dataset before the training. It is separated out to prevent different 
process downloading data at the same time.
   
   Here is the log of running the code:
   ```
   ubuntu@ip-172-31-21-218:~/incubator-singa/examples/autograd$ python3 
download_mnist.py   
   Downloading http://yann.lecun.com/exdb/mnist/train-images-idx3-ubyte.gz
   Downloading http://yann.lecun.com/exdb/mnist/train-labels-idx1-ubyte.gz
   Downloading http://yann.lecun.com/exdb/mnist/t10k-images-idx3-ubyte.gz
   Downloading http://yann.lecun.com/exdb/mnist/t10k-labels-idx1-ubyte.gz
   ubuntu@ip-172-31-21-218:~/incubator-singa/examples/autograd$ python3 
mnist_cnn.py
   Starting Epoch 0:
   Training loss = 586.417175, training accuracy = 0.792840
   Evaluation accuracy = 0.940104, Elapsed Time = 5.638494s
   Starting Epoch 1:
   Training loss = 235.360107, training accuracy = 0.922292
   Evaluation accuracy = 0.955429, Elapsed Time = 5.563161s
   Starting Epoch 2:
   Training loss = 170.056442, training accuracy = 0.943270
   Evaluation accuracy = 0.963942, Elapsed Time = 5.579273s
   Starting Epoch 3:
   Training loss = 135.514252, training accuracy = 0.954476
   Evaluation accuracy = 0.967248, Elapsed Time = 5.562721s
   Starting Epoch 4:
   Training loss = 116.975700, training accuracy = 0.960812
   Evaluation accuracy = 0.978265, Elapsed Time = 5.583826s
   Starting Epoch 5:
   Training loss = 103.893723, training accuracy = 0.965065
   Evaluation accuracy = 0.982372, Elapsed Time = 5.585272s
   Starting Epoch 6:
   Training loss = 95.044586, training accuracy = 0.967266
   Evaluation accuracy = 0.981671, Elapsed Time = 5.580424s
   Starting Epoch 7:
   Training loss = 89.102654, training accuracy = 0.971118
   Evaluation accuracy = 0.980268, Elapsed Time = 5.583646s
   Starting Epoch 8:
   Training loss = 80.395744, training accuracy = 0.972969
   Evaluation accuracy = 0.983273, Elapsed Time = 5.600029s
   Starting Epoch 9:
   Training loss = 78.355209, training accuracy = 0.973119
   Evaluation accuracy = 0.979267, Elapsed Time = 5.587740s
   ubuntu@ip-172-31-21-218:~/incubator-singa/examples/autograd$ 
/home/ubuntu/mpich-3.3/build/bin/mpiexec --hostfile host_file python3 
mnist_dist.py
   Starting Epoch 0:
   Training loss = 781.167480, training accuracy = 0.719017
   Evaluation accuracy = 0.918586, Elapsed Time = 1.255623s
   Starting Epoch 1:
   Training loss = 259.223297, training accuracy = 0.912276
   Evaluation accuracy = 0.950863, Elapsed Time = 1.216926s
   Starting Epoch 2:
   Training loss = 179.333084, training accuracy = 0.940605
   Evaluation accuracy = 0.968030, Elapsed Time = 1.206751s
   Starting Epoch 3:
   Training loss = 137.840988, training accuracy = 0.954243
   Evaluation accuracy = 0.975946, Elapsed Time = 1.202503s
   Starting Epoch 4:
   Training loss = 119.743629, training accuracy = 0.959836
   Evaluation accuracy = 0.973581, Elapsed Time = 1.208274s
   Starting Epoch 5:
   Training loss = 102.545876, training accuracy = 0.965595
   Evaluation accuracy = 0.980572, Elapsed Time = 1.205539s
   Starting Epoch 6:
   Training loss = 93.249054, training accuracy = 0.969401
   Evaluation accuracy = 0.978207, Elapsed Time = 1.203708s
   Starting Epoch 7:
   Training loss = 84.66, training accuracy = 0.971104
   Evaluation accuracy = 0.980777, Elapsed Time = 1.206410s
   Starting Epoch 8:
   Training loss = 77.996643, training accuracy = 0.973691
   Evaluation accuracy = 0.985609, Elapsed Time = 1.207295s
   Starting Epoch 9:
   Training loss = 75.888077, training accuracy = 0.974442
   Evaluation accuracy = 0.982319, Elapsed Time = 1.203693s
   ```


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

[GitHub] [incubator-singa] chrishkchris commented on a change in pull request #468: Distributted module

2019-08-27 Thread GitBox

chrishkchris commented on a change in pull request #468: Distributted module
URL: https://github.com/apache/incubator-singa/pull/468#discussion_r317937233
 
 

 ##
 File path: examples/autograd/mnist_dist.py
 ##
 @@ -0,0 +1,251 @@
+#
 
 Review comment:
   Has modified mnist_cnn.py and mnist_dist.py:
   1. the model construction, data preprocessing and training code are in 
mnist_cnn.py
   2. mnist_dist.py import mnist_cnn functions and passes the dist opt into 
train_mnist_cnn() to conduct dist training (needs MPI).
   3. the download_mnist.py is used to download the dataset before the 
training. It is separated out to prevent different process downloading data at 
the same time.
   
   Here is the log of running the code:
   ```
   ubuntu@ip-172-31-21-218:~/incubator-singa/examples/autograd$ python3 
download_mnist.py   
   Downloading http://yann.lecun.com/exdb/mnist/train-images-idx3-ubyte.gz
   Downloading http://yann.lecun.com/exdb/mnist/train-labels-idx1-ubyte.gz
   Downloading http://yann.lecun.com/exdb/mnist/t10k-images-idx3-ubyte.gz
   Downloading http://yann.lecun.com/exdb/mnist/t10k-labels-idx1-ubyte.gz
   ubuntu@ip-172-31-21-218:~/incubator-singa/examples/autograd$ python3 
mnist_cnn.py
   Starting Epoch 0:
   Training loss = 586.417175, training accuracy = 0.792840
   Evaluation accuracy = 0.940104, Elapsed Time = 5.638494s
   Starting Epoch 1:
   Training loss = 235.360107, training accuracy = 0.922292
   Evaluation accuracy = 0.955429, Elapsed Time = 5.563161s
   Starting Epoch 2:
   Training loss = 170.056442, training accuracy = 0.943270
   Evaluation accuracy = 0.963942, Elapsed Time = 5.579273s
   Starting Epoch 3:
   Training loss = 135.514252, training accuracy = 0.954476
   Evaluation accuracy = 0.967248, Elapsed Time = 5.562721s
   Starting Epoch 4:
   Training loss = 116.975700, training accuracy = 0.960812
   Evaluation accuracy = 0.978265, Elapsed Time = 5.583826s
   Starting Epoch 5:
   Training loss = 103.893723, training accuracy = 0.965065
   Evaluation accuracy = 0.982372, Elapsed Time = 5.585272s
   Starting Epoch 6:
   Training loss = 95.044586, training accuracy = 0.967266
   Evaluation accuracy = 0.981671, Elapsed Time = 5.580424s
   Starting Epoch 7:
   Training loss = 89.102654, training accuracy = 0.971118
   Evaluation accuracy = 0.980268, Elapsed Time = 5.583646s
   Starting Epoch 8:
   Training loss = 80.395744, training accuracy = 0.972969
   Evaluation accuracy = 0.983273, Elapsed Time = 5.600029s
   Starting Epoch 9:
   Training loss = 78.355209, training accuracy = 0.973119
   Evaluation accuracy = 0.979267, Elapsed Time = 5.587740s
   ubuntu@ip-172-31-21-218:~/incubator-singa/examples/autograd$ 
/home/ubuntu/mpich-3.3/build/bin/mpiexec --hostfile host_file python3 
mnist_dist.py
   Starting Epoch 0:
   Training loss = 781.167480, training accuracy = 0.719017
   Evaluation accuracy = 0.918586, Elapsed Time = 1.255623s
   Starting Epoch 1:
   Training loss = 259.223297, training accuracy = 0.912276
   Evaluation accuracy = 0.950863, Elapsed Time = 1.216926s
   Starting Epoch 2:
   Training loss = 179.333084, training accuracy = 0.940605
   Evaluation accuracy = 0.968030, Elapsed Time = 1.206751s
   Starting Epoch 3:
   Training loss = 137.840988, training accuracy = 0.954243
   Evaluation accuracy = 0.975946, Elapsed Time = 1.202503s
   Starting Epoch 4:
   Training loss = 119.743629, training accuracy = 0.959836
   Evaluation accuracy = 0.973581, Elapsed Time = 1.208274s
   Starting Epoch 5:
   Training loss = 102.545876, training accuracy = 0.965595
   Evaluation accuracy = 0.980572, Elapsed Time = 1.205539s
   Starting Epoch 6:
   Training loss = 93.249054, training accuracy = 0.969401
   Evaluation accuracy = 0.978207, Elapsed Time = 1.203708s
   Starting Epoch 7:
   Training loss = 84.66, training accuracy = 0.971104
   Evaluation accuracy = 0.980777, Elapsed Time = 1.206410s
   Starting Epoch 8:
   Training loss = 77.996643, training accuracy = 0.973691
   Evaluation accuracy = 0.985609, Elapsed Time = 1.207295s
   Starting Epoch 9:
   Training loss = 75.888077, training accuracy = 0.974442
   Evaluation accuracy = 0.982319, Elapsed Time = 1.203693s
   ```


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

[GitHub] [incubator-singa] chrishkchris commented on a change in pull request #468: Distributted module

2019-08-27 Thread GitBox

chrishkchris commented on a change in pull request #468: Distributted module
URL: https://github.com/apache/incubator-singa/pull/468#discussion_r317938576
 
 

 ##
 File path: examples/autograd/mnist_dist.py
 ##
 @@ -0,0 +1,251 @@
+#
 
 Review comment:
   I have already followed up all the four comments recently added in the PR of 
dist_new. So I will continue the work on the distributed training code without 
MPI.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

[GitHub] [incubator-singa] chrishkchris commented on a change in pull request #468: Distributted module

2019-08-27 Thread GitBox

chrishkchris commented on a change in pull request #468: Distributted module
URL: https://github.com/apache/incubator-singa/pull/468#discussion_r317938576
 
 

 ##
 File path: examples/autograd/mnist_dist.py
 ##
 @@ -0,0 +1,251 @@
+#
 
 Review comment:
   I have already followed up all the four comments recently added in the PR of 
dist_new. So I will continue the on the distributed training code without MPI.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

[GitHub] [incubator-singa] chrishkchris commented on a change in pull request #468: Distributted module

2019-08-27 Thread GitBox

chrishkchris commented on a change in pull request #468: Distributted module
URL: https://github.com/apache/incubator-singa/pull/468#discussion_r317937233
 
 

 ##
 File path: examples/autograd/mnist_dist.py
 ##
 @@ -0,0 +1,251 @@
+#
 
 Review comment:
   Has modified mnist_cnn.py and mnist_dist.py:
   1. the model construction, data preprocessing and training code are in 
mnist_cnn.py
   2. mnist_dist.py import mnist_cnn functions and passes the dist opt into 
train_mnist_cnn() to conduct dist training (needs MPI).
   3. the download_mnist.py is used to download the dataset before the 
training. It is separated out to prevent different process downloading data at 
the same time.
   
   Here is the log of running the code:
   ```
   ubuntu@ip-172-31-21-218:~/incubator-singa/examples/autograd$ python3 
download_mnist.py   Downloading 
http://yann.lecun.com/exdb/mnist/train-images-idx3-ubyte.gz
   Downloading http://yann.lecun.com/exdb/mnist/train-labels-idx1-ubyte.gz
   Downloading http://yann.lecun.com/exdb/mnist/t10k-images-idx3-ubyte.gz
   Downloading http://yann.lecun.com/exdb/mnist/t10k-labels-idx1-ubyte.gz
   ubuntu@ip-172-31-21-218:~/incubator-singa/examples/autograd$ python3 
mnist_cnn.pyStarting Epoch 0:
   Training loss = 586.417175, training accuracy = 0.792840
   Evaluation accuracy = 0.940104, Elapsed Time = 5.638494s
   Starting Epoch 1:
   Training loss = 235.360107, training accuracy = 0.922292
   Evaluation accuracy = 0.955429, Elapsed Time = 5.563161s
   Starting Epoch 2:
   Training loss = 170.056442, training accuracy = 0.943270
   Evaluation accuracy = 0.963942, Elapsed Time = 5.579273s
   Starting Epoch 3:
   Training loss = 135.514252, training accuracy = 0.954476
   Evaluation accuracy = 0.967248, Elapsed Time = 5.562721s
   Starting Epoch 4:
   Training loss = 116.975700, training accuracy = 0.960812
   Evaluation accuracy = 0.978265, Elapsed Time = 5.583826s
   Starting Epoch 5:
   Training loss = 103.893723, training accuracy = 0.965065
   Evaluation accuracy = 0.982372, Elapsed Time = 5.585272s
   Starting Epoch 6:
   Training loss = 95.044586, training accuracy = 0.967266
   Evaluation accuracy = 0.981671, Elapsed Time = 5.580424s
   Starting Epoch 7:
   Training loss = 89.102654, training accuracy = 0.971118
   Evaluation accuracy = 0.980268, Elapsed Time = 5.583646s
   Starting Epoch 8:
   Training loss = 80.395744, training accuracy = 0.972969
   Evaluation accuracy = 0.983273, Elapsed Time = 5.600029s
   Starting Epoch 9:
   Training loss = 78.355209, training accuracy = 0.973119
   Evaluation accuracy = 0.979267, Elapsed Time = 5.587740s
   ubuntu@ip-172-31-21-218:~/incubator-singa/examples/autograd$ 
/home/ubuntu/mpich-3.3/build/bin/mpiexec --hostfile host_file python3 
mnist_dist.py
   Starting Epoch 0:
   Training loss = 781.167480, training accuracy = 0.719017
   Evaluation accuracy = 0.918586, Elapsed Time = 1.255623s
   Starting Epoch 1:
   Training loss = 259.223297, training accuracy = 0.912276
   Evaluation accuracy = 0.950863, Elapsed Time = 1.216926s
   Starting Epoch 2:
   Training loss = 179.333084, training accuracy = 0.940605
   Evaluation accuracy = 0.968030, Elapsed Time = 1.206751s
   Starting Epoch 3:
   Training loss = 137.840988, training accuracy = 0.954243
   Evaluation accuracy = 0.975946, Elapsed Time = 1.202503s
   Starting Epoch 4:
   Training loss = 119.743629, training accuracy = 0.959836
   Evaluation accuracy = 0.973581, Elapsed Time = 1.208274s
   Starting Epoch 5:
   Training loss = 102.545876, training accuracy = 0.965595
   Evaluation accuracy = 0.980572, Elapsed Time = 1.205539s
   Starting Epoch 6:
   Training loss = 93.249054, training accuracy = 0.969401
   Evaluation accuracy = 0.978207, Elapsed Time = 1.203708s
   Starting Epoch 7:
   Training loss = 84.66, training accuracy = 0.971104
   Evaluation accuracy = 0.980777, Elapsed Time = 1.206410s
   Starting Epoch 8:
   Training loss = 77.996643, training accuracy = 0.973691
   Evaluation accuracy = 0.985609, Elapsed Time = 1.207295s
   Starting Epoch 9:
   Training loss = 75.888077, training accuracy = 0.974442
   Evaluation accuracy = 0.982319, Elapsed Time = 1.203693s
   ```


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

[GitHub] [incubator-singa] chrishkchris commented on a change in pull request #468: Distributted module

2019-08-27 Thread GitBox

chrishkchris commented on a change in pull request #468: Distributted module
URL: https://github.com/apache/incubator-singa/pull/468#discussion_r317937233
 
 

 ##
 File path: examples/autograd/mnist_dist.py
 ##
 @@ -0,0 +1,251 @@
+#
 
 Review comment:
   Has modified mnist_cnn.py and mnist_dist.py:
   1. the model construction, data preprocessing and training code are in 
mnist_cnn.py
   2. mnist_dist.py import mnist_cnn functions and passes the dist opt into 
train_mnist_cnn() to conduct dist training (needs MPI).
   3. the download_mnist.py is used to download the dataset before the 
training. It is separated out to prevent different porocess downloading data at 
the same time.
   
   Here is the log of running the code:
   ```
   ubuntu@ip-172-31-21-218:~/incubator-singa/examples/autograd$ python3 
download_mnist.py   Downloading 
http://yann.lecun.com/exdb/mnist/train-images-idx3-ubyte.gz
   Downloading http://yann.lecun.com/exdb/mnist/train-labels-idx1-ubyte.gz
   Downloading http://yann.lecun.com/exdb/mnist/t10k-images-idx3-ubyte.gz
   Downloading http://yann.lecun.com/exdb/mnist/t10k-labels-idx1-ubyte.gz
   ubuntu@ip-172-31-21-218:~/incubator-singa/examples/autograd$ python3 
mnist_cnn.pyStarting Epoch 0:
   Training loss = 586.417175, training accuracy = 0.792840
   Evaluation accuracy = 0.940104, Elapsed Time = 5.638494s
   Starting Epoch 1:
   Training loss = 235.360107, training accuracy = 0.922292
   Evaluation accuracy = 0.955429, Elapsed Time = 5.563161s
   Starting Epoch 2:
   Training loss = 170.056442, training accuracy = 0.943270
   Evaluation accuracy = 0.963942, Elapsed Time = 5.579273s
   Starting Epoch 3:
   Training loss = 135.514252, training accuracy = 0.954476
   Evaluation accuracy = 0.967248, Elapsed Time = 5.562721s
   Starting Epoch 4:
   Training loss = 116.975700, training accuracy = 0.960812
   Evaluation accuracy = 0.978265, Elapsed Time = 5.583826s
   Starting Epoch 5:
   Training loss = 103.893723, training accuracy = 0.965065
   Evaluation accuracy = 0.982372, Elapsed Time = 5.585272s
   Starting Epoch 6:
   Training loss = 95.044586, training accuracy = 0.967266
   Evaluation accuracy = 0.981671, Elapsed Time = 5.580424s
   Starting Epoch 7:
   Training loss = 89.102654, training accuracy = 0.971118
   Evaluation accuracy = 0.980268, Elapsed Time = 5.583646s
   Starting Epoch 8:
   Training loss = 80.395744, training accuracy = 0.972969
   Evaluation accuracy = 0.983273, Elapsed Time = 5.600029s
   Starting Epoch 9:
   Training loss = 78.355209, training accuracy = 0.973119
   Evaluation accuracy = 0.979267, Elapsed Time = 5.587740s
   ubuntu@ip-172-31-21-218:~/incubator-singa/examples/autograd$ 
/home/ubuntu/mpich-3.3/build/bin/mpiexec --hostfile host_file python3 
mnist_dist.py
   Starting Epoch 0:
   Training loss = 781.167480, training accuracy = 0.719017
   Evaluation accuracy = 0.918586, Elapsed Time = 1.255623s
   Starting Epoch 1:
   Training loss = 259.223297, training accuracy = 0.912276
   Evaluation accuracy = 0.950863, Elapsed Time = 1.216926s
   Starting Epoch 2:
   Training loss = 179.333084, training accuracy = 0.940605
   Evaluation accuracy = 0.968030, Elapsed Time = 1.206751s
   Starting Epoch 3:
   Training loss = 137.840988, training accuracy = 0.954243
   Evaluation accuracy = 0.975946, Elapsed Time = 1.202503s
   Starting Epoch 4:
   Training loss = 119.743629, training accuracy = 0.959836
   Evaluation accuracy = 0.973581, Elapsed Time = 1.208274s
   Starting Epoch 5:
   Training loss = 102.545876, training accuracy = 0.965595
   Evaluation accuracy = 0.980572, Elapsed Time = 1.205539s
   Starting Epoch 6:
   Training loss = 93.249054, training accuracy = 0.969401
   Evaluation accuracy = 0.978207, Elapsed Time = 1.203708s
   Starting Epoch 7:
   Training loss = 84.66, training accuracy = 0.971104
   Evaluation accuracy = 0.980777, Elapsed Time = 1.206410s
   Starting Epoch 8:
   Training loss = 77.996643, training accuracy = 0.973691
   Evaluation accuracy = 0.985609, Elapsed Time = 1.207295s
   Starting Epoch 9:
   Training loss = 75.888077, training accuracy = 0.974442
   Evaluation accuracy = 0.982319, Elapsed Time = 1.203693s
   ```


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

[GitHub] [incubator-singa] chrishkchris commented on a change in pull request #468: Distributted module

2019-08-26 Thread GitBox

chrishkchris commented on a change in pull request #468: Distributted module
URL: https://github.com/apache/incubator-singa/pull/468#discussion_r317889630
 
 

 ##
 File path: python/singa/autograd.py
 ##
 @@ -1286,25 +1287,26 @@ def set_params(self, **parameters):
 
 
 class _BatchNorm2d(Operation):
-def __init__(self, handle, name=None):
+def __init__(self, handle, running_mean, running_var, name=None):
 super(_BatchNorm2d, self).__init__(name)
 self.handle = handle
+self.running_mean = running_mean.data
+self.running_var = running_var.data
 
-def forward(self, x, scale, bias, running_mean, running_var):
-self.running_mean = running_mean
-self.running_var = running_var
+def forward(self, x, scale, bias):
 if training:
 
 if isinstance(self.handle, singa.CudnnBatchNormHandle):
 y, mean, var = singa.GpuBatchNormForwardTraining(
-self.handle, x, scale, bias, running_mean, running_var
+self.handle, x, scale, bias, self.running_mean, 
self.running_var
 
 Review comment:
   The following is the resnet18 training using CPU on CIFAR10 in the first few 
epochs. CPU is slow so I trained only a few epochs. Training loss reduced from 
2233.4 to 782.5:
   ```
   ubuntu@ip-172-31-16-147:~/incubator-singa/examples/autograd$ python3 
resnetcifarcpu.py
   Loading data file cifar-10-batches-py/data_batch_1
   Loading data file cifar-10-batches-py/data_batch_2
   Loading data file cifar-10-batches-py/data_batch_3
   Loading data file cifar-10-batches-py/data_batch_4
   Loading data file cifar-10-batches-py/data_batch_5
   Loading data file cifar-10-batches-py/test_batch
   Start intialization
   Epoch=0: 
100%|| 
1562/1562 [2:09:57<00:00,  5.03s/it]
   Training loss = 2233.394769, training accuracy = 0.490297
   Test accuracy = 0.636218
   Epoch=1: 
100%|███|
 1562/1562 [2:10:00<00:00,  4.98s/it]
   Training loss = 1474.432049, training accuracy = 0.33
   Test accuracy = 0.678986
   Epoch=2: 
100%|███|
 1562/1562 [2:10:11<00:00,  5.00s/it]
   Training loss = 1163.035850, training accuracy = 0.741717
   Test accuracy = 0.738181
   Epoch=3: 
100%|███|
 1562/1562 [2:10:31<00:00,  5.03s/it]
   Training loss = 979.977119, training accuracy = 0.782570
   Test accuracy = 0.800581
   Epoch=4: 
100%|███|
 1562/1562 [2:10:10<00:00,  4.98s/it]
   Training loss = 872.811802, training accuracy = 0.806098
   Test accuracy = 0.813902
   Epoch=5: 
100%|███|
 1562/1562 [2:10:05<00:00,  4.99s/it]
   Training loss = 782.525783, training accuracy = 0.826144
   Test accuracy = 0.832232
   ```
   The training loss decreases normally. Therefore seems the CPU batch norm is 
working.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

[GitHub] [incubator-singa] chrishkchris commented on a change in pull request #468: Distributted module

2019-08-26 Thread GitBox

chrishkchris commented on a change in pull request #468: Distributted module
URL: https://github.com/apache/incubator-singa/pull/468#discussion_r317889630
 
 

 ##
 File path: python/singa/autograd.py
 ##
 @@ -1286,25 +1287,26 @@ def set_params(self, **parameters):
 
 
 class _BatchNorm2d(Operation):
-def __init__(self, handle, name=None):
+def __init__(self, handle, running_mean, running_var, name=None):
 super(_BatchNorm2d, self).__init__(name)
 self.handle = handle
+self.running_mean = running_mean.data
+self.running_var = running_var.data
 
-def forward(self, x, scale, bias, running_mean, running_var):
-self.running_mean = running_mean
-self.running_var = running_var
+def forward(self, x, scale, bias):
 if training:
 
 if isinstance(self.handle, singa.CudnnBatchNormHandle):
 y, mean, var = singa.GpuBatchNormForwardTraining(
-self.handle, x, scale, bias, running_mean, running_var
+self.handle, x, scale, bias, self.running_mean, 
self.running_var
 
 Review comment:
   The following is the resnet18 training using CPU on CIFAR10 in the first few 
epochs. CPU is slow so I trained only a few epochs
   ```
   ubuntu@ip-172-31-16-147:~/incubator-singa/examples/autograd$ python3 
resnetcifarcpu.py
   Loading data file cifar-10-batches-py/data_batch_1
   Loading data file cifar-10-batches-py/data_batch_2
   Loading data file cifar-10-batches-py/data_batch_3
   Loading data file cifar-10-batches-py/data_batch_4
   Loading data file cifar-10-batches-py/data_batch_5
   Loading data file cifar-10-batches-py/test_batch
   Start intialization
   Epoch=0: 
100%|| 
1562/1562 [2:09:57<00:00,  5.03s/it]
   Training loss = 2233.394769, training accuracy = 0.490297
   Test accuracy = 0.636218
   Epoch=1: 
100%|███|
 1562/1562 [2:10:00<00:00,  4.98s/it]
   Training loss = 1474.432049, training accuracy = 0.33
   Test accuracy = 0.678986
   Epoch=2: 
100%|███|
 1562/1562 [2:10:11<00:00,  5.00s/it]
   Training loss = 1163.035850, training accuracy = 0.741717
   Test accuracy = 0.738181
   Epoch=3: 
100%|███|
 1562/1562 [2:10:31<00:00,  5.03s/it]
   Training loss = 979.977119, training accuracy = 0.782570
   Test accuracy = 0.800581
   Epoch=4: 
100%|███|
 1562/1562 [2:10:10<00:00,  4.98s/it]
   Training loss = 872.811802, training accuracy = 0.806098
   Test accuracy = 0.813902
   Epoch=5: 
100%|███|
 1562/1562 [2:10:05<00:00,  4.99s/it]
   Training loss = 782.525783, training accuracy = 0.826144
   Test accuracy = 0.832232
   ```
   The training loss decreases normally. Therefore seems the CPU batch norm is 
working.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

[GitHub] [incubator-singa] chrishkchris commented on a change in pull request #468: Distributted module

2019-08-26 Thread GitBox

chrishkchris commented on a change in pull request #468: Distributted module
URL: https://github.com/apache/incubator-singa/pull/468#discussion_r317889630
 
 

 ##
 File path: python/singa/autograd.py
 ##
 @@ -1286,25 +1287,26 @@ def set_params(self, **parameters):
 
 
 class _BatchNorm2d(Operation):
-def __init__(self, handle, name=None):
+def __init__(self, handle, running_mean, running_var, name=None):
 super(_BatchNorm2d, self).__init__(name)
 self.handle = handle
+self.running_mean = running_mean.data
+self.running_var = running_var.data
 
-def forward(self, x, scale, bias, running_mean, running_var):
-self.running_mean = running_mean
-self.running_var = running_var
+def forward(self, x, scale, bias):
 if training:
 
 if isinstance(self.handle, singa.CudnnBatchNormHandle):
 y, mean, var = singa.GpuBatchNormForwardTraining(
-self.handle, x, scale, bias, running_mean, running_var
+self.handle, x, scale, bias, self.running_mean, 
self.running_var
 
 Review comment:
   The following is the resnet18 training using CPU on CIFAR10 in the first few 
epochs. CPU is slow so I trained only a few epochs
   ```
   Start intialization
   Epoch=0: 
100%|| 
1562/1562 [2:09:57<00:00,  5.03s/it]
   Training loss = 2233.394769, training accuracy = 0.490297
   Test accuracy = 0.636218
   Epoch=1: 
100%|███|
 1562/1562 [2:10:00<00:00,  4.98s/it]
   Training loss = 1474.432049, training accuracy = 0.33
   Test accuracy = 0.678986
   Epoch=2: 
100%|███|
 1562/1562 [2:10:11<00:00,  5.00s/it]
   Training loss = 1163.035850, training accuracy = 0.741717
   Test accuracy = 0.738181
   Epoch=3: 
100%|███|
 1562/1562 [2:10:31<00:00,  5.03s/it]
   Training loss = 979.977119, training accuracy = 0.782570
   Test accuracy = 0.800581
   Epoch=4: 
100%|███|
 1562/1562 [2:10:10<00:00,  4.98s/it]
   Training loss = 872.811802, training accuracy = 0.806098
   Test accuracy = 0.813902
   Epoch=5: 
100%|███|
 1562/1562 [2:10:05<00:00,  4.99s/it]
   Training loss = 782.525783, training accuracy = 0.826144
   Test accuracy = 0.832232
   ```
   The training loss decreases normally. Therefore seems the CPU batch norm is 
working.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

[GitHub] [incubator-singa] chrishkchris commented on a change in pull request #468: Distributted module

2019-08-26 Thread GitBox

chrishkchris commented on a change in pull request #468: Distributted module
URL: https://github.com/apache/incubator-singa/pull/468#discussion_r317889630
 
 

 ##
 File path: python/singa/autograd.py
 ##
 @@ -1286,25 +1287,26 @@ def set_params(self, **parameters):
 
 
 class _BatchNorm2d(Operation):
-def __init__(self, handle, name=None):
+def __init__(self, handle, running_mean, running_var, name=None):
 super(_BatchNorm2d, self).__init__(name)
 self.handle = handle
+self.running_mean = running_mean.data
+self.running_var = running_var.data
 
-def forward(self, x, scale, bias, running_mean, running_var):
-self.running_mean = running_mean
-self.running_var = running_var
+def forward(self, x, scale, bias):
 if training:
 
 if isinstance(self.handle, singa.CudnnBatchNormHandle):
 y, mean, var = singa.GpuBatchNormForwardTraining(
-self.handle, x, scale, bias, running_mean, running_var
+self.handle, x, scale, bias, self.running_mean, 
self.running_var
 
 Review comment:
   The following is the resnet18 training using CPU on CIFAR10 in the first few 
epochs. 
   ```
   Start intialization
   Epoch=0: 
100%|| 
1562/1562 [2:09:57<00:00,  5.03s/it]
   Training loss = 2233.394769, training accuracy = 0.490297
   Test accuracy = 0.636218
   Epoch=1: 
100%|███|
 1562/1562 [2:10:00<00:00,  4.98s/it]
   Training loss = 1474.432049, training accuracy = 0.33
   Test accuracy = 0.678986
   Epoch=2: 
100%|███|
 1562/1562 [2:10:11<00:00,  5.00s/it]
   Training loss = 1163.035850, training accuracy = 0.741717
   Test accuracy = 0.738181
   Epoch=3: 
100%|███|
 1562/1562 [2:10:31<00:00,  5.03s/it]
   Training loss = 979.977119, training accuracy = 0.782570
   Test accuracy = 0.800581
   Epoch=4: 
100%|███|
 1562/1562 [2:10:10<00:00,  4.98s/it]
   Training loss = 872.811802, training accuracy = 0.806098
   Test accuracy = 0.813902
   Epoch=5: 
100%|███|
 1562/1562 [2:10:05<00:00,  4.99s/it]
   Training loss = 782.525783, training accuracy = 0.826144
   Test accuracy = 0.832232
   ```
   The training loss decreases normally. Therefore seems the CPU batch norm is 
working.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

[GitHub] [incubator-singa] chrishkchris commented on a change in pull request #468: Distributted module

2019-08-26 Thread GitBox

chrishkchris commented on a change in pull request #468: Distributted module
URL: https://github.com/apache/incubator-singa/pull/468#discussion_r317889630
 
 

 ##
 File path: python/singa/autograd.py
 ##
 @@ -1286,25 +1287,26 @@ def set_params(self, **parameters):
 
 
 class _BatchNorm2d(Operation):
-def __init__(self, handle, name=None):
+def __init__(self, handle, running_mean, running_var, name=None):
 super(_BatchNorm2d, self).__init__(name)
 self.handle = handle
+self.running_mean = running_mean.data
+self.running_var = running_var.data
 
-def forward(self, x, scale, bias, running_mean, running_var):
-self.running_mean = running_mean
-self.running_var = running_var
+def forward(self, x, scale, bias):
 if training:
 
 if isinstance(self.handle, singa.CudnnBatchNormHandle):
 y, mean, var = singa.GpuBatchNormForwardTraining(
-self.handle, x, scale, bias, running_mean, running_var
+self.handle, x, scale, bias, self.running_mean, 
self.running_var
 
 Review comment:
   The following is the resnet18 training using CPU on CIFAR10 in the first few 
epochs. 
   ```
   Start intialization
   Epoch=0: 
100%|| 
1562/1562 [2:09:57<00:00,  5.03s/it]
   Training loss = 2233.394769, training accuracy = 0.490297
   Test accuracy = 0.636218
   Epoch=1: 
100%|███|
 1562/1562 [2:10:00<00:00,  4.98s/it]
   Training loss = 1474.432049, training accuracy = 0.33
   Test accuracy = 0.678986
   Epoch=2: 
100%|███|
 1562/1562 [2:10:11<00:00,  5.00s/it]
   Training loss = 1163.035850, training accuracy = 0.741717
   Test accuracy = 0.738181
   Epoch=3: 
100%|███|
 1562/1562 [2:10:31<00:00,  5.03s/it]
   Training loss = 979.977119, training accuracy = 0.782570
   Test accuracy = 0.800581
   Epoch=4: 
100%|███|
 1562/1562 [2:10:10<00:00,  4.98s/it]
   Training loss = 872.811802, training accuracy = 0.806098
   Test accuracy = 0.813902
   Epoch=5: 
100%|███|
 1562/1562 [2:10:05<00:00,  4.99s/it]
   Training loss = 782.525783, training accuracy = 0.826144
   Test accuracy = 0.832232
   ```
   The training loss decreases normally. Therefore seems the batch norm is 
working.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

[GitHub] [incubator-singa] chrishkchris commented on a change in pull request #468: Distributted module

2019-08-26 Thread GitBox

chrishkchris commented on a change in pull request #468: Distributted module
URL: https://github.com/apache/incubator-singa/pull/468#discussion_r317527518
 
 

 ##
 File path: python/singa/autograd.py
 ##
 @@ -1286,25 +1287,26 @@ def set_params(self, **parameters):
 
 
 class _BatchNorm2d(Operation):
-def __init__(self, handle, name=None):
+def __init__(self, handle, running_mean, running_var, name=None):
 super(_BatchNorm2d, self).__init__(name)
 self.handle = handle
+self.running_mean = running_mean.data
+self.running_var = running_var.data
 
-def forward(self, x, scale, bias, running_mean, running_var):
-self.running_mean = running_mean
-self.running_var = running_var
+def forward(self, x, scale, bias):
 if training:
 
 if isinstance(self.handle, singa.CudnnBatchNormHandle):
 y, mean, var = singa.GpuBatchNormForwardTraining(
-self.handle, x, scale, bias, running_mean, running_var
+self.handle, x, scale, bias, self.running_mean, 
self.running_var
 
 Review comment:
   Adding two batchnorm layers in the CNN for the MNIST, the cpu training is 
looking okay, loss reduced from 608.7 to 91.5
   ```
   ubuntu@ip-172-31-36-94:~/incubator-singa/examples/autograd$ python3 
mnist_dist_bn.py
   Starting Epoch 0:
   Training loss = 608.656677, training accuracy = 0.785035
   Evaluation accuracy = 0.900541, Elapsed Time = 130.748269s
   Starting Epoch 1:
   Training loss = 259.606445, training accuracy = 0.911720
   Evaluation accuracy = 0.951222, Elapsed Time = 129.687239s
   Starting Epoch 2:
   Training loss = 180.270645, training accuracy = 0.938917
   Evaluation accuracy = 0.965745, Elapsed Time = 129.715867s
   Starting Epoch 3:
   Training loss = 146.975281, training accuracy = 0.950607
   Evaluation accuracy = 0.961138, Elapsed Time = 129.695524s
   Starting Epoch 4:
   Training loss = 130.942749, training accuracy = 0.955576
   Evaluation accuracy = 0.966446, Elapsed Time = 129.814190s
   Starting Epoch 5:
   Training loss = 116.057938, training accuracy = 0.960846
   Evaluation accuracy = 0.964844, Elapsed Time = 129.697126s
   Starting Epoch 6:
   Training loss = 105.867195, training accuracy = 0.963914
   Evaluation accuracy = 0.973758, Elapsed Time = 129.782990s
   Starting Epoch 7:
   Training loss = 102.414818, training accuracy = 0.965498
   Evaluation accuracy = 0.973357, Elapsed Time = 129.847014s
   Starting Epoch 8:
   Training loss = 95.194695, training accuracy = 0.968433
   Evaluation accuracy = 0.973658, Elapsed Time = 129.762709s
   Starting Epoch 9:
   Training loss = 91.524719, training accuracy = 0.969717
   Evaluation accuracy = 0.975160, Elapsed Time = 129.581387s
   ```
   still waiting for resnet result on cifar10


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

[GitHub] [incubator-singa] chrishkchris commented on a change in pull request #468: Distributted module

2019-08-26 Thread GitBox

chrishkchris commented on a change in pull request #468: Distributted module
URL: https://github.com/apache/incubator-singa/pull/468#discussion_r317527518
 
 

 ##
 File path: python/singa/autograd.py
 ##
 @@ -1286,25 +1287,26 @@ def set_params(self, **parameters):
 
 
 class _BatchNorm2d(Operation):
-def __init__(self, handle, name=None):
+def __init__(self, handle, running_mean, running_var, name=None):
 super(_BatchNorm2d, self).__init__(name)
 self.handle = handle
+self.running_mean = running_mean.data
+self.running_var = running_var.data
 
-def forward(self, x, scale, bias, running_mean, running_var):
-self.running_mean = running_mean
-self.running_var = running_var
+def forward(self, x, scale, bias):
 if training:
 
 if isinstance(self.handle, singa.CudnnBatchNormHandle):
 y, mean, var = singa.GpuBatchNormForwardTraining(
-self.handle, x, scale, bias, running_mean, running_var
+self.handle, x, scale, bias, self.running_mean, 
self.running_var
 
 Review comment:
   Adding two batchnorm layers in the CNN for the MNIST, the cpu training is 
looking okay, loss reduced from 608.7 to 91.5
   ```
   ubuntu@ip-172-31-36-94:~/incubator-singa/examples/autograd$ python3 
mnist_dist_bn.py
   Starting Epoch 0:
   Training loss = 608.656677, training accuracy = 0.785035
   Evaluation accuracy = 0.900541, Elapsed Time = 130.748269s
   Starting Epoch 1:
   Training loss = 259.606445, training accuracy = 0.911720
   Evaluation accuracy = 0.951222, Elapsed Time = 129.687239s
   Starting Epoch 2:
   Training loss = 180.270645, training accuracy = 0.938917
   Evaluation accuracy = 0.965745, Elapsed Time = 129.715867s
   Starting Epoch 3:
   Training loss = 146.975281, training accuracy = 0.950607
   Evaluation accuracy = 0.961138, Elapsed Time = 129.695524s
   Starting Epoch 4:
   Training loss = 130.942749, training accuracy = 0.955576
   Evaluation accuracy = 0.966446, Elapsed Time = 129.814190s
   Starting Epoch 5:
   Training loss = 116.057938, training accuracy = 0.960846
   Evaluation accuracy = 0.964844, Elapsed Time = 129.697126s
   Starting Epoch 6:
   Training loss = 105.867195, training accuracy = 0.963914
   Evaluation accuracy = 0.973758, Elapsed Time = 129.782990s
   Starting Epoch 7:
   Training loss = 102.414818, training accuracy = 0.965498
   Evaluation accuracy = 0.973357, Elapsed Time = 129.847014s
   Starting Epoch 8:
   Training loss = 95.194695, training accuracy = 0.968433
   Evaluation accuracy = 0.973658, Elapsed Time = 129.762709s
   Starting Epoch 9:
   Training loss = 91.524719, training accuracy = 0.969717
   Evaluation accuracy = 0.975160, Elapsed Time = 129.581387s
   ```
   still waiting for resnet 18 result on cifar10


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

[GitHub] [incubator-singa] chrishkchris commented on a change in pull request #468: Distributted module

2019-08-26 Thread GitBox

chrishkchris commented on a change in pull request #468: Distributted module
URL: https://github.com/apache/incubator-singa/pull/468#discussion_r317527518
 
 

 ##
 File path: python/singa/autograd.py
 ##
 @@ -1286,25 +1287,26 @@ def set_params(self, **parameters):
 
 
 class _BatchNorm2d(Operation):
-def __init__(self, handle, name=None):
+def __init__(self, handle, running_mean, running_var, name=None):
 super(_BatchNorm2d, self).__init__(name)
 self.handle = handle
+self.running_mean = running_mean.data
+self.running_var = running_var.data
 
-def forward(self, x, scale, bias, running_mean, running_var):
-self.running_mean = running_mean
-self.running_var = running_var
+def forward(self, x, scale, bias):
 if training:
 
 if isinstance(self.handle, singa.CudnnBatchNormHandle):
 y, mean, var = singa.GpuBatchNormForwardTraining(
-self.handle, x, scale, bias, running_mean, running_var
+self.handle, x, scale, bias, self.running_mean, 
self.running_var
 
 Review comment:
   Adding two batchnorm layers in the CNN for the MNIST, the cpu training is 
looking okay, loss reduced from 608.7 to 91.5
   ```
   ubuntu@ip-172-31-36-94:~/incubator-singa/examples/autograd$ python3 
mnist_dist_bn.py
   Starting Epoch 0:
   Training loss = 608.656677, training accuracy = 0.785035
   Evaluation accuracy = 0.900541, Elapsed Time = 130.748269s
   Starting Epoch 1:
   Training loss = 259.606445, training accuracy = 0.911720
   Evaluation accuracy = 0.951222, Elapsed Time = 129.687239s
   Starting Epoch 2:
   Training loss = 180.270645, training accuracy = 0.938917
   Evaluation accuracy = 0.965745, Elapsed Time = 129.715867s
   Starting Epoch 3:
   Training loss = 146.975281, training accuracy = 0.950607
   Evaluation accuracy = 0.961138, Elapsed Time = 129.695524s
   Starting Epoch 4:
   Training loss = 130.942749, training accuracy = 0.955576
   Evaluation accuracy = 0.966446, Elapsed Time = 129.814190s
   Starting Epoch 5:
   Training loss = 116.057938, training accuracy = 0.960846
   Evaluation accuracy = 0.964844, Elapsed Time = 129.697126s
   Starting Epoch 6:
   Training loss = 105.867195, training accuracy = 0.963914
   Evaluation accuracy = 0.973758, Elapsed Time = 129.782990s
   Starting Epoch 7:
   Training loss = 102.414818, training accuracy = 0.965498
   Evaluation accuracy = 0.973357, Elapsed Time = 129.847014s
   Starting Epoch 8:
   Training loss = 95.194695, training accuracy = 0.968433
   Evaluation accuracy = 0.973658, Elapsed Time = 129.762709s
   Starting Epoch 9:
   Training loss = 91.524719, training accuracy = 0.969717
   Evaluation accuracy = 0.975160, Elapsed Time = 129.581387s
   ```
   still waiting for resnet 18 result
   


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

[GitHub] [incubator-singa] chrishkchris commented on a change in pull request #468: Distributted module

2019-08-26 Thread GitBox

chrishkchris commented on a change in pull request #468: Distributted module
URL: https://github.com/apache/incubator-singa/pull/468#discussion_r317488906
 
 

 ##
 File path: python/singa/autograd.py
 ##
 @@ -1286,25 +1287,26 @@ def set_params(self, **parameters):
 
 
 class _BatchNorm2d(Operation):
-def __init__(self, handle, name=None):
+def __init__(self, handle, running_mean, running_var, name=None):
 super(_BatchNorm2d, self).__init__(name)
 self.handle = handle
+self.running_mean = running_mean.data
+self.running_var = running_var.data
 
-def forward(self, x, scale, bias, running_mean, running_var):
-self.running_mean = running_mean
-self.running_var = running_var
+def forward(self, x, scale, bias):
 if training:
 
 if isinstance(self.handle, singa.CudnnBatchNormHandle):
 y, mean, var = singa.GpuBatchNormForwardTraining(
-self.handle, x, scale, bias, running_mean, running_var
+self.handle, x, scale, bias, self.running_mean, 
self.running_var
 
 Review comment:
   Cannot find CpuBatchNormHandle while the handle used by cpu is 
BatchNormHandle. So I have used BatchNormHandle instead of CudnnBatchNormHandle.
   
   Now Both CPU and GPU can run, but I used "type" instead of "isinstance" 
because CudnnBatchNormHandle is a subclass of BatchNormHandle, so 
CudnnBatchNormHandle is considered as an instance of BatchNormHandle in 
isinstance(). 
   
   Moreover, I have debugged the cpu batchnorm in the following two aspects:
   (i) the forward function of cpu batchnorm needed the initialization of 
running mean and var, otherwise when it access the block it returns error as 
accessing an non-initialized block. I fixed this by initializating the mean by 
0 and the var by 1.
   (ii) the backward function CpuBatchNormBackward does not exist, but there is 
another function called CpuBatchNormBackwardx (in the directory 
src/model/operation/batchnorm.cc). So I use this function by providing all the 
necessary arguments.
   
   The program can run now, but I am doing a brief real dataset training test 
on AWS.
   
   


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

[GitHub] [incubator-singa] chrishkchris commented on a change in pull request #468: Distributted module

2019-08-26 Thread GitBox

chrishkchris commented on a change in pull request #468: Distributted module
URL: https://github.com/apache/incubator-singa/pull/468#discussion_r317488906
 
 

 ##
 File path: python/singa/autograd.py
 ##
 @@ -1286,25 +1287,26 @@ def set_params(self, **parameters):
 
 
 class _BatchNorm2d(Operation):
-def __init__(self, handle, name=None):
+def __init__(self, handle, running_mean, running_var, name=None):
 super(_BatchNorm2d, self).__init__(name)
 self.handle = handle
+self.running_mean = running_mean.data
+self.running_var = running_var.data
 
-def forward(self, x, scale, bias, running_mean, running_var):
-self.running_mean = running_mean
-self.running_var = running_var
+def forward(self, x, scale, bias):
 if training:
 
 if isinstance(self.handle, singa.CudnnBatchNormHandle):
 y, mean, var = singa.GpuBatchNormForwardTraining(
-self.handle, x, scale, bias, running_mean, running_var
+self.handle, x, scale, bias, self.running_mean, 
self.running_var
 
 Review comment:
   Cannot find CpuBatchNormHandle while the handle used by cpu is 
BatchNormHandle. So I have used BatchNormHandle instead of CudnnBatchNormHandle.
   
   Now Both CPU and GPU can run, but I used "type" instead of "isinstance" 
because CudnnBatchNormHandle is a subclass of BatchNormHandle, so 
CudnnBatchNormHandle is considered as an instance of BatchNormHandle in 
isinstance(). 
   
   Moreover, I have debugged the cpu batchnorm in the following two aspects:
   (i) the forward function of cpu batchnorm needed the initialization of 
running mean and var, otherwise when it access the block it returns error as 
accessing an non-initialized block. I fixed this by initializating the mean by 
0 and the var by 1.
   (ii) the backward function CpuBatchNormBackward does not exist, but there is 
another function called CpuBatchNormBackwardx (in the directory 
src/model/operation/batchnorm.cc). So I use this function by providing all the 
necessary arguments.
   
   The program can run now, but I am doing a brief real dataset training test 
on AWS (using c5.x4large with 16 cpu cores).  
   
   


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

[GitHub] [incubator-singa] chrishkchris commented on a change in pull request #468: Distributted module

2019-08-26 Thread GitBox

chrishkchris commented on a change in pull request #468: Distributted module
URL: https://github.com/apache/incubator-singa/pull/468#discussion_r317527518
 
 

 ##
 File path: python/singa/autograd.py
 ##
 @@ -1286,25 +1287,26 @@ def set_params(self, **parameters):
 
 
 class _BatchNorm2d(Operation):
-def __init__(self, handle, name=None):
+def __init__(self, handle, running_mean, running_var, name=None):
 super(_BatchNorm2d, self).__init__(name)
 self.handle = handle
+self.running_mean = running_mean.data
+self.running_var = running_var.data
 
-def forward(self, x, scale, bias, running_mean, running_var):
-self.running_mean = running_mean
-self.running_var = running_var
+def forward(self, x, scale, bias):
 if training:
 
 if isinstance(self.handle, singa.CudnnBatchNormHandle):
 y, mean, var = singa.GpuBatchNormForwardTraining(
-self.handle, x, scale, bias, running_mean, running_var
+self.handle, x, scale, bias, self.running_mean, 
self.running_var
 
 Review comment:
   Adding two batchnorm layers in the CNN for the MNIST, the cpu training is 
looking okay, loss reduced from 608.7 to 91.5
   ```
   ubuntu@ip-172-31-36-94:~/incubator-singa/examples/autograd$ python3 
mnist_dist_bn.py
   Starting Epoch 0:
   Training loss = 608.656677, training accuracy = 0.785035
   Evaluation accuracy = 0.900541, Elapsed Time = 130.748269s
   Starting Epoch 1:
   Training loss = 259.606445, training accuracy = 0.911720
   Evaluation accuracy = 0.951222, Elapsed Time = 129.687239s
   Starting Epoch 2:
   Training loss = 180.270645, training accuracy = 0.938917
   Evaluation accuracy = 0.965745, Elapsed Time = 129.715867s
   Starting Epoch 3:
   Training loss = 146.975281, training accuracy = 0.950607
   Evaluation accuracy = 0.961138, Elapsed Time = 129.695524s
   Starting Epoch 4:
   Training loss = 130.942749, training accuracy = 0.955576
   Evaluation accuracy = 0.966446, Elapsed Time = 129.814190s
   Starting Epoch 5:
   Training loss = 116.057938, training accuracy = 0.960846
   Evaluation accuracy = 0.964844, Elapsed Time = 129.697126s
   Starting Epoch 6:
   Training loss = 105.867195, training accuracy = 0.963914
   Evaluation accuracy = 0.973758, Elapsed Time = 129.782990s
   Starting Epoch 7:
   Training loss = 102.414818, training accuracy = 0.965498
   Evaluation accuracy = 0.973357, Elapsed Time = 129.847014s
   Starting Epoch 8:
   Training loss = 95.194695, training accuracy = 0.968433
   Evaluation accuracy = 0.973658, Elapsed Time = 129.762709s
   Starting Epoch 9:
   Training loss = 91.524719, training accuracy = 0.969717
   Evaluation accuracy = 0.975160, Elapsed Time = 129.581387s
   ```
   
   


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

[GitHub] [incubator-singa] chrishkchris commented on a change in pull request #468: Distributted module

2019-08-26 Thread GitBox

chrishkchris commented on a change in pull request #468: Distributted module
URL: https://github.com/apache/incubator-singa/pull/468#discussion_r317527518
 
 

 ##
 File path: python/singa/autograd.py
 ##
 @@ -1286,25 +1287,26 @@ def set_params(self, **parameters):
 
 
 class _BatchNorm2d(Operation):
-def __init__(self, handle, name=None):
+def __init__(self, handle, running_mean, running_var, name=None):
 super(_BatchNorm2d, self).__init__(name)
 self.handle = handle
+self.running_mean = running_mean.data
+self.running_var = running_var.data
 
-def forward(self, x, scale, bias, running_mean, running_var):
-self.running_mean = running_mean
-self.running_var = running_var
+def forward(self, x, scale, bias):
 if training:
 
 if isinstance(self.handle, singa.CudnnBatchNormHandle):
 y, mean, var = singa.GpuBatchNormForwardTraining(
-self.handle, x, scale, bias, running_mean, running_var
+self.handle, x, scale, bias, self.running_mean, 
self.running_var
 
 Review comment:
   Adding two batchnorm layers in the CNN for the MNIST, the cpu training is 
looking okay, loss reduce from 608.7 to 91.5
   ```
   ubuntu@ip-172-31-36-94:~/incubator-singa/examples/autograd$ python3 
mnist_dist_bn.py
   Starting Epoch 0:
   Training loss = 608.656677, training accuracy = 0.785035
   Evaluation accuracy = 0.900541, Elapsed Time = 130.748269s
   Starting Epoch 1:
   Training loss = 259.606445, training accuracy = 0.911720
   Evaluation accuracy = 0.951222, Elapsed Time = 129.687239s
   Starting Epoch 2:
   Training loss = 180.270645, training accuracy = 0.938917
   Evaluation accuracy = 0.965745, Elapsed Time = 129.715867s
   Starting Epoch 3:
   Training loss = 146.975281, training accuracy = 0.950607
   Evaluation accuracy = 0.961138, Elapsed Time = 129.695524s
   Starting Epoch 4:
   Training loss = 130.942749, training accuracy = 0.955576
   Evaluation accuracy = 0.966446, Elapsed Time = 129.814190s
   Starting Epoch 5:
   Training loss = 116.057938, training accuracy = 0.960846
   Evaluation accuracy = 0.964844, Elapsed Time = 129.697126s
   Starting Epoch 6:
   Training loss = 105.867195, training accuracy = 0.963914
   Evaluation accuracy = 0.973758, Elapsed Time = 129.782990s
   Starting Epoch 7:
   Training loss = 102.414818, training accuracy = 0.965498
   Evaluation accuracy = 0.973357, Elapsed Time = 129.847014s
   Starting Epoch 8:
   Training loss = 95.194695, training accuracy = 0.968433
   Evaluation accuracy = 0.973658, Elapsed Time = 129.762709s
   Starting Epoch 9:
   Training loss = 91.524719, training accuracy = 0.969717
   Evaluation accuracy = 0.975160, Elapsed Time = 129.581387s
   ```
   
   


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

[GitHub] [incubator-singa] chrishkchris commented on a change in pull request #468: Distributted module

2019-08-26 Thread GitBox

chrishkchris commented on a change in pull request #468: Distributted module
URL: https://github.com/apache/incubator-singa/pull/468#discussion_r317527518
 
 

 ##
 File path: python/singa/autograd.py
 ##
 @@ -1286,25 +1287,26 @@ def set_params(self, **parameters):
 
 
 class _BatchNorm2d(Operation):
-def __init__(self, handle, name=None):
+def __init__(self, handle, running_mean, running_var, name=None):
 super(_BatchNorm2d, self).__init__(name)
 self.handle = handle
+self.running_mean = running_mean.data
+self.running_var = running_var.data
 
-def forward(self, x, scale, bias, running_mean, running_var):
-self.running_mean = running_mean
-self.running_var = running_var
+def forward(self, x, scale, bias):
 if training:
 
 if isinstance(self.handle, singa.CudnnBatchNormHandle):
 y, mean, var = singa.GpuBatchNormForwardTraining(
-self.handle, x, scale, bias, running_mean, running_var
+self.handle, x, scale, bias, self.running_mean, 
self.running_var
 
 Review comment:
   Adding two batchnorm layers in the CNN for the MNIST, the cpu training is 
looking okay,
   ```
   ubuntu@ip-172-31-36-94:~/incubator-singa/examples/autograd$ python3 
mnist_dist_bn.py
   Starting Epoch 0:
   Training loss = 608.656677, training accuracy = 0.785035
   Evaluation accuracy = 0.900541, Elapsed Time = 130.748269s
   Starting Epoch 1:
   Training loss = 259.606445, training accuracy = 0.911720
   Evaluation accuracy = 0.951222, Elapsed Time = 129.687239s
   Starting Epoch 2:
   Training loss = 180.270645, training accuracy = 0.938917
   Evaluation accuracy = 0.965745, Elapsed Time = 129.715867s
   Starting Epoch 3:
   Training loss = 146.975281, training accuracy = 0.950607
   Evaluation accuracy = 0.961138, Elapsed Time = 129.695524s
   Starting Epoch 4:
   Training loss = 130.942749, training accuracy = 0.955576
   Evaluation accuracy = 0.966446, Elapsed Time = 129.814190s
   Starting Epoch 5:
   Training loss = 116.057938, training accuracy = 0.960846
   Evaluation accuracy = 0.964844, Elapsed Time = 129.697126s
   Starting Epoch 6:
   Training loss = 105.867195, training accuracy = 0.963914
   Evaluation accuracy = 0.973758, Elapsed Time = 129.782990s
   Starting Epoch 7:
   Training loss = 102.414818, training accuracy = 0.965498
   Evaluation accuracy = 0.973357, Elapsed Time = 129.847014s
   Starting Epoch 8:
   Training loss = 95.194695, training accuracy = 0.968433
   Evaluation accuracy = 0.973658, Elapsed Time = 129.762709s
   Starting Epoch 9:
   Training loss = 91.524719, training accuracy = 0.969717
   Evaluation accuracy = 0.975160, Elapsed Time = 129.581387s
   ```
   
   


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

[GitHub] [incubator-singa] chrishkchris commented on a change in pull request #468: Distributted module

2019-08-26 Thread GitBox

chrishkchris commented on a change in pull request #468: Distributted module
URL: https://github.com/apache/incubator-singa/pull/468#discussion_r317527518
 
 

 ##
 File path: python/singa/autograd.py
 ##
 @@ -1286,25 +1287,26 @@ def set_params(self, **parameters):
 
 
 class _BatchNorm2d(Operation):
-def __init__(self, handle, name=None):
+def __init__(self, handle, running_mean, running_var, name=None):
 super(_BatchNorm2d, self).__init__(name)
 self.handle = handle
+self.running_mean = running_mean.data
+self.running_var = running_var.data
 
-def forward(self, x, scale, bias, running_mean, running_var):
-self.running_mean = running_mean
-self.running_var = running_var
+def forward(self, x, scale, bias):
 if training:
 
 if isinstance(self.handle, singa.CudnnBatchNormHandle):
 y, mean, var = singa.GpuBatchNormForwardTraining(
-self.handle, x, scale, bias, running_mean, running_var
+self.handle, x, scale, bias, self.running_mean, 
self.running_var
 
 Review comment:
   Still testing with real dataset using cpu
   
   Adding two batchnorm layers in the CNN for the MNIST, the cpu training is 
looking okay,
   ```
   ubuntu@ip-172-31-36-94:~/incubator-singa/examples/autograd$ python3 
mnist_dist_bn.py
   Starting Epoch 0:
   Training loss = 608.656677, training accuracy = 0.785035
   Evaluation accuracy = 0.900541, Elapsed Time = 130.748269s
   Starting Epoch 1:
   Training loss = 259.606445, training accuracy = 0.911720
   Evaluation accuracy = 0.951222, Elapsed Time = 129.687239s
   Starting Epoch 2:
   Training loss = 180.270645, training accuracy = 0.938917
   Evaluation accuracy = 0.965745, Elapsed Time = 129.715867s
   Starting Epoch 3:
   Training loss = 146.975281, training accuracy = 0.950607
   Evaluation accuracy = 0.961138, Elapsed Time = 129.695524s
   Starting Epoch 4:
   Training loss = 130.942749, training accuracy = 0.955576
   Evaluation accuracy = 0.966446, Elapsed Time = 129.814190s
   Starting Epoch 5:
   Training loss = 116.057938, training accuracy = 0.960846
   Evaluation accuracy = 0.964844, Elapsed Time = 129.697126s
   Starting Epoch 6:
   Training loss = 105.867195, training accuracy = 0.963914
   Evaluation accuracy = 0.973758, Elapsed Time = 129.782990s
   Starting Epoch 7:
   Training loss = 102.414818, training accuracy = 0.965498
   Evaluation accuracy = 0.973357, Elapsed Time = 129.847014s
   Starting Epoch 8:
   Training loss = 95.194695, training accuracy = 0.968433
   Evaluation accuracy = 0.973658, Elapsed Time = 129.762709s
   Starting Epoch 9:
   Training loss = 91.524719, training accuracy = 0.969717
   Evaluation accuracy = 0.975160, Elapsed Time = 129.581387s
   ```
   
   


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

[GitHub] [incubator-singa] chrishkchris commented on a change in pull request #468: Distributted module

2019-08-26 Thread GitBox

chrishkchris commented on a change in pull request #468: Distributted module
URL: https://github.com/apache/incubator-singa/pull/468#discussion_r317527518
 
 

 ##
 File path: python/singa/autograd.py
 ##
 @@ -1286,25 +1287,26 @@ def set_params(self, **parameters):
 
 
 class _BatchNorm2d(Operation):
-def __init__(self, handle, name=None):
+def __init__(self, handle, running_mean, running_var, name=None):
 super(_BatchNorm2d, self).__init__(name)
 self.handle = handle
+self.running_mean = running_mean.data
+self.running_var = running_var.data
 
-def forward(self, x, scale, bias, running_mean, running_var):
-self.running_mean = running_mean
-self.running_var = running_var
+def forward(self, x, scale, bias):
 if training:
 
 if isinstance(self.handle, singa.CudnnBatchNormHandle):
 y, mean, var = singa.GpuBatchNormForwardTraining(
-self.handle, x, scale, bias, running_mean, running_var
+self.handle, x, scale, bias, self.running_mean, 
self.running_var
 
 Review comment:
   Still testing with real dataset using cpu
   
   Adding two batchnorm layers in the CNN for the MNIST, the cpu training is 
looking okay,
   ```
   ubuntu@ip-172-31-31-187:~/incubator-singa/examples/autograd$ python3 
mnist_dist_bn.py
   Starting Epoch 0:
   Training loss = 1115.447021, training accuracy = 0.592116
   Evaluation accuracy = 0.649239, Elapsed Time = 91.413475s
   Starting Epoch 1:
   Training loss = 615.564209, training accuracy = 0.783968
   Evaluation accuracy = 0.878105, Elapsed Time = 91.053461s
   Starting Epoch 2:
   Training loss = 444.550018, training accuracy = 0.848286
   Evaluation accuracy = 0.900240, Elapsed Time = 91.064094s
   Starting Epoch 3:
   Training loss = 333.629150, training accuracy = 0.886690
   Evaluation accuracy = 0.857772, Elapsed Time = 90.935190s
   Starting Epoch 4:
   Training loss = 289.389832, training accuracy = 0.902648
   Evaluation accuracy = 0.913462, Elapsed Time = 91.152710s
   Starting Epoch 5:
   Training loss = 263.009583, training accuracy = 0.910836
   Evaluation accuracy = 0.922877, Elapsed Time = 91.171680s
   Starting Epoch 6:
   Training loss = 238.859818, training accuracy = 0.918957
   Evaluation accuracy = 0.933794, Elapsed Time = 91.016456s
   Starting Epoch 7:
   Training loss = 215.822647, training accuracy = 0.927428
   Evaluation accuracy = 0.946615, Elapsed Time = 90.870825s
   Starting Epoch 8:
   Training loss = 202.828430, training accuracy = 0.932080
   Evaluation accuracy = 0.948017, Elapsed Time = 91.014656s
   Starting Epoch 9:
   Training loss = 190.810226, training accuracy = 0.935899
   Evaluation accuracy = 0.949820, Elapsed Time = 91.270044s
   ```
   
   


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

[GitHub] [incubator-singa] chrishkchris commented on a change in pull request #468: Distributted module

2019-08-26 Thread GitBox

chrishkchris commented on a change in pull request #468: Distributted module
URL: https://github.com/apache/incubator-singa/pull/468#discussion_r317527518
 
 

 ##
 File path: python/singa/autograd.py
 ##
 @@ -1286,25 +1287,26 @@ def set_params(self, **parameters):
 
 
 class _BatchNorm2d(Operation):
-def __init__(self, handle, name=None):
+def __init__(self, handle, running_mean, running_var, name=None):
 super(_BatchNorm2d, self).__init__(name)
 self.handle = handle
+self.running_mean = running_mean.data
+self.running_var = running_var.data
 
-def forward(self, x, scale, bias, running_mean, running_var):
-self.running_mean = running_mean
-self.running_var = running_var
+def forward(self, x, scale, bias):
 if training:
 
 if isinstance(self.handle, singa.CudnnBatchNormHandle):
 y, mean, var = singa.GpuBatchNormForwardTraining(
-self.handle, x, scale, bias, running_mean, running_var
+self.handle, x, scale, bias, self.running_mean, 
self.running_var
 
 Review comment:
   Still testing with real dataset using cpu
   
   Adding a batchnorm layer in the CNN for the MNIST, the cpu training is 
looking okay,
   ```
   ubuntu@ip-172-31-31-187:~/incubator-singa/examples/autograd$ python3 
mnist_dist_bn.py
   Starting Epoch 0:
   Training loss = 1115.447021, training accuracy = 0.592116
   Evaluation accuracy = 0.649239, Elapsed Time = 91.413475s
   Starting Epoch 1:
   Training loss = 615.564209, training accuracy = 0.783968
   Evaluation accuracy = 0.878105, Elapsed Time = 91.053461s
   Starting Epoch 2:
   Training loss = 444.550018, training accuracy = 0.848286
   Evaluation accuracy = 0.900240, Elapsed Time = 91.064094s
   Starting Epoch 3:
   Training loss = 333.629150, training accuracy = 0.886690
   Evaluation accuracy = 0.857772, Elapsed Time = 90.935190s
   Starting Epoch 4:
   Training loss = 289.389832, training accuracy = 0.902648
   Evaluation accuracy = 0.913462, Elapsed Time = 91.152710s
   Starting Epoch 5:
   Training loss = 263.009583, training accuracy = 0.910836
   Evaluation accuracy = 0.922877, Elapsed Time = 91.171680s
   Starting Epoch 6:
   Training loss = 238.859818, training accuracy = 0.918957
   Evaluation accuracy = 0.933794, Elapsed Time = 91.016456s
   Starting Epoch 7:
   Training loss = 215.822647, training accuracy = 0.927428
   Evaluation accuracy = 0.946615, Elapsed Time = 90.870825s
   Starting Epoch 8:
   Training loss = 202.828430, training accuracy = 0.932080
   Evaluation accuracy = 0.948017, Elapsed Time = 91.014656s
   Starting Epoch 9:
   Training loss = 190.810226, training accuracy = 0.935899
   Evaluation accuracy = 0.949820, Elapsed Time = 91.270044s
   ```
   
   


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

[GitHub] [incubator-singa] chrishkchris commented on a change in pull request #468: Distributted module

2019-08-26 Thread GitBox

chrishkchris commented on a change in pull request #468: Distributted module
URL: https://github.com/apache/incubator-singa/pull/468#discussion_r317527518
 
 

 ##
 File path: python/singa/autograd.py
 ##
 @@ -1286,25 +1287,26 @@ def set_params(self, **parameters):
 
 
 class _BatchNorm2d(Operation):
-def __init__(self, handle, name=None):
+def __init__(self, handle, running_mean, running_var, name=None):
 super(_BatchNorm2d, self).__init__(name)
 self.handle = handle
+self.running_mean = running_mean.data
+self.running_var = running_var.data
 
-def forward(self, x, scale, bias, running_mean, running_var):
-self.running_mean = running_mean
-self.running_var = running_var
+def forward(self, x, scale, bias):
 if training:
 
 if isinstance(self.handle, singa.CudnnBatchNormHandle):
 y, mean, var = singa.GpuBatchNormForwardTraining(
-self.handle, x, scale, bias, running_mean, running_var
+self.handle, x, scale, bias, self.running_mean, 
self.running_var
 
 Review comment:
   Still testing with real dataset using cpu
   
   Adding a batchnorm layer in the CNN for the MNIST, the training is okay,
   ```
   ubuntu@ip-172-31-31-187:~/incubator-singa/examples/autograd$ python3 
mnist_dist_bn.py
   Starting Epoch 0:
   Training loss = 1115.447021, training accuracy = 0.592116
   Evaluation accuracy = 0.649239, Elapsed Time = 91.413475s
   Starting Epoch 1:
   Training loss = 615.564209, training accuracy = 0.783968
   Evaluation accuracy = 0.878105, Elapsed Time = 91.053461s
   Starting Epoch 2:
   Training loss = 444.550018, training accuracy = 0.848286
   Evaluation accuracy = 0.900240, Elapsed Time = 91.064094s
   Starting Epoch 3:
   Training loss = 333.629150, training accuracy = 0.886690
   Evaluation accuracy = 0.857772, Elapsed Time = 90.935190s
   Starting Epoch 4:
   Training loss = 289.389832, training accuracy = 0.902648
   Evaluation accuracy = 0.913462, Elapsed Time = 91.152710s
   Starting Epoch 5:
   Training loss = 263.009583, training accuracy = 0.910836
   Evaluation accuracy = 0.922877, Elapsed Time = 91.171680s
   Starting Epoch 6:
   Training loss = 238.859818, training accuracy = 0.918957
   Evaluation accuracy = 0.933794, Elapsed Time = 91.016456s
   Starting Epoch 7:
   Training loss = 215.822647, training accuracy = 0.927428
   Evaluation accuracy = 0.946615, Elapsed Time = 90.870825s
   Starting Epoch 8:
   Training loss = 202.828430, training accuracy = 0.932080
   Evaluation accuracy = 0.948017, Elapsed Time = 91.014656s
   Starting Epoch 9:
   Training loss = 190.810226, training accuracy = 0.935899
   Evaluation accuracy = 0.949820, Elapsed Time = 91.270044s
   ```
   
   


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

[GitHub] [incubator-singa] chrishkchris commented on a change in pull request #468: Distributted module

2019-08-26 Thread GitBox

chrishkchris commented on a change in pull request #468: Distributted module
URL: https://github.com/apache/incubator-singa/pull/468#discussion_r317527518
 
 

 ##
 File path: python/singa/autograd.py
 ##
 @@ -1286,25 +1287,26 @@ def set_params(self, **parameters):
 
 
 class _BatchNorm2d(Operation):
-def __init__(self, handle, name=None):
+def __init__(self, handle, running_mean, running_var, name=None):
 super(_BatchNorm2d, self).__init__(name)
 self.handle = handle
+self.running_mean = running_mean.data
+self.running_var = running_var.data
 
-def forward(self, x, scale, bias, running_mean, running_var):
-self.running_mean = running_mean
-self.running_var = running_var
+def forward(self, x, scale, bias):
 if training:
 
 if isinstance(self.handle, singa.CudnnBatchNormHandle):
 y, mean, var = singa.GpuBatchNormForwardTraining(
-self.handle, x, scale, bias, running_mean, running_var
+self.handle, x, scale, bias, self.running_mean, 
self.running_var
 
 Review comment:
   Still testing with real dataset using cpu
   
   Adding a batchnorm layer in the CNN for the MNIST, the training is okay,
   ```
ubuntu@ip-172-31-31-187:~/incubator-singa/examples/autograd$ python3 
mnist_dist_bn.py
   Starting Epoch 0:
   Training loss = 1115.447021, training accuracy = 0.592116
   Evaluation accuracy = 0.649239, Elapsed Time = 91.413475s
   Starting Epoch 1:
   Training loss = 615.564209, training accuracy = 0.783968
   Evaluation accuracy = 0.878105, Elapsed Time = 91.053461s
   Starting Epoch 2:
   Training loss = 444.550018, training accuracy = 0.848286
   Evaluation accuracy = 0.900240, Elapsed Time = 91.064094s
   Starting Epoch 3:
   Training loss = 333.629150, training accuracy = 0.886690
   Evaluation accuracy = 0.857772, Elapsed Time = 90.935190s
   Starting Epoch 4:
   Training loss = 289.389832, training accuracy = 0.902648
   Evaluation accuracy = 0.913462, Elapsed Time = 91.152710s
   Starting Epoch 5:
   Training loss = 263.009583, training accuracy = 0.910836
   Evaluation accuracy = 0.922877, Elapsed Time = 91.171680s
   Starting Epoch 6:
   Training loss = 238.859818, training accuracy = 0.918957
   Evaluation accuracy = 0.933794, Elapsed Time = 91.016456s
   Starting Epoch 7:
   Training loss = 215.822647, training accuracy = 0.927428
   Evaluation accuracy = 0.946615, Elapsed Time = 90.870825s
   Starting Epoch 8:
   Training loss = 202.828430, training accuracy = 0.932080
   Evaluation accuracy = 0.948017, Elapsed Time = 91.014656s
   Starting Epoch 9:
   Training loss = 190.810226, training accuracy = 0.935899
   Evaluation accuracy = 0.949820, Elapsed Time = 91.270044s
   ```
   
   


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

[GitHub] [incubator-singa] chrishkchris commented on a change in pull request #468: Distributted module

2019-08-26 Thread GitBox

chrishkchris commented on a change in pull request #468: Distributted module
URL: https://github.com/apache/incubator-singa/pull/468#discussion_r317527518
 
 

 ##
 File path: python/singa/autograd.py
 ##
 @@ -1286,25 +1287,26 @@ def set_params(self, **parameters):
 
 
 class _BatchNorm2d(Operation):
-def __init__(self, handle, name=None):
+def __init__(self, handle, running_mean, running_var, name=None):
 super(_BatchNorm2d, self).__init__(name)
 self.handle = handle
+self.running_mean = running_mean.data
+self.running_var = running_var.data
 
-def forward(self, x, scale, bias, running_mean, running_var):
-self.running_mean = running_mean
-self.running_var = running_var
+def forward(self, x, scale, bias):
 if training:
 
 if isinstance(self.handle, singa.CudnnBatchNormHandle):
 y, mean, var = singa.GpuBatchNormForwardTraining(
-self.handle, x, scale, bias, running_mean, running_var
+self.handle, x, scale, bias, self.running_mean, 
self.running_var
 
 Review comment:
   Still testing with real dataset using cpu


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

[GitHub] [incubator-singa] chrishkchris commented on a change in pull request #468: Distributted module

2019-08-26 Thread GitBox

chrishkchris commented on a change in pull request #468: Distributted module
URL: https://github.com/apache/incubator-singa/pull/468#discussion_r317527518
 
 

 ##
 File path: python/singa/autograd.py
 ##
 @@ -1286,25 +1287,26 @@ def set_params(self, **parameters):
 
 
 class _BatchNorm2d(Operation):
-def __init__(self, handle, name=None):
+def __init__(self, handle, running_mean, running_var, name=None):
 super(_BatchNorm2d, self).__init__(name)
 self.handle = handle
+self.running_mean = running_mean.data
+self.running_var = running_var.data
 
-def forward(self, x, scale, bias, running_mean, running_var):
-self.running_mean = running_mean
-self.running_var = running_var
+def forward(self, x, scale, bias):
 if training:
 
 if isinstance(self.handle, singa.CudnnBatchNormHandle):
 y, mean, var = singa.GpuBatchNormForwardTraining(
-self.handle, x, scale, bias, running_mean, running_var
+self.handle, x, scale, bias, self.running_mean, 
self.running_var
 
 Review comment:
   However, using CPU the cifar10 training loss does not decrease, I will need 
further debug


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

[GitHub] [incubator-singa] chrishkchris commented on a change in pull request #468: Distributted module

2019-08-26 Thread GitBox

chrishkchris commented on a change in pull request #468: Distributted module
URL: https://github.com/apache/incubator-singa/pull/468#discussion_r317488906
 
 

 ##
 File path: python/singa/autograd.py
 ##
 @@ -1286,25 +1287,26 @@ def set_params(self, **parameters):
 
 
 class _BatchNorm2d(Operation):
-def __init__(self, handle, name=None):
+def __init__(self, handle, running_mean, running_var, name=None):
 super(_BatchNorm2d, self).__init__(name)
 self.handle = handle
+self.running_mean = running_mean.data
+self.running_var = running_var.data
 
-def forward(self, x, scale, bias, running_mean, running_var):
-self.running_mean = running_mean
-self.running_var = running_var
+def forward(self, x, scale, bias):
 if training:
 
 if isinstance(self.handle, singa.CudnnBatchNormHandle):
 y, mean, var = singa.GpuBatchNormForwardTraining(
-self.handle, x, scale, bias, running_mean, running_var
+self.handle, x, scale, bias, self.running_mean, 
self.running_var
 
 Review comment:
   Cannot find CpuBatchNormHandle while the handle used by cpu is 
BatchNormHandle. So I have used BatchNormHandle instead of CudnnBatchNormHandle.
   
   Now Both CPU and GPU can run, but I used "type" instead of "isinstance" 
because CudnnBatchNormHandle is a subclass of BatchNormHandle, so 
CudnnBatchNormHandle is considered as an instance of BatchNormHandle in 
isinstance(). 
   
   Moreover, I have debugged the cpu batchnorm in the following two aspects:
   (i) the forward function of cpu batchnorm needed the initialization of 
running mean and var, otherwise when it access the block it returns error as 
accessing an non-initialized block. I fixed this by initializating the mean by 
0 and the var by 1.
   (ii) the backward function CpuBatchNormBackward does not exist, but there is 
another function called CpuBatchNormBackwardx (in the directory 
src/model/operation/batchnorm.cc). So I use this function by providing all the 
necessary arguments.
   
   The program can run now, but I am doing a brief cifar10 training test on AWS 
(using c5.x4large with 16 cpu cores).  
   
   


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

[GitHub] [incubator-singa] chrishkchris commented on a change in pull request #468: Distributted module

2019-08-26 Thread GitBox

chrishkchris commented on a change in pull request #468: Distributted module
URL: https://github.com/apache/incubator-singa/pull/468#discussion_r317488906
 
 

 ##
 File path: python/singa/autograd.py
 ##
 @@ -1286,25 +1287,26 @@ def set_params(self, **parameters):
 
 
 class _BatchNorm2d(Operation):
-def __init__(self, handle, name=None):
+def __init__(self, handle, running_mean, running_var, name=None):
 super(_BatchNorm2d, self).__init__(name)
 self.handle = handle
+self.running_mean = running_mean.data
+self.running_var = running_var.data
 
-def forward(self, x, scale, bias, running_mean, running_var):
-self.running_mean = running_mean
-self.running_var = running_var
+def forward(self, x, scale, bias):
 if training:
 
 if isinstance(self.handle, singa.CudnnBatchNormHandle):
 y, mean, var = singa.GpuBatchNormForwardTraining(
-self.handle, x, scale, bias, running_mean, running_var
+self.handle, x, scale, bias, self.running_mean, 
self.running_var
 
 Review comment:
   Cannot find CpuBatchNormHandle while the handle used by cpu is 
BatchNormHandle. So I have used BatchNormHandle instead of CudnnBatchNormHandle.
   
   Now Both CPU and GPU can run, but I used "type" instead of "isinstance" 
because CudnnBatchNormHandle is a subclass of BatchNormHandle, so 
CudnnBatchNormHandle is considered as an instance of BatchNormHandle in 
isinstance(). For more explanation, See 
http://www.runoob.com/python/python-func-isinstance.html
   
   Moreover, I have debugged the cpu batchnorm in the following two aspects:
   (i) the forward function of cpu batchnorm needed the initialization of 
running mean and var, otherwise when it access the block it returns error as 
accessing an non-initialized block. I fixed this by initializating the mean by 
0 and the var by 1.
   (ii) the backward function CpuBatchNormBackward does not exist, but there is 
another function called CpuBatchNormBackwardx (in the directory 
src/model/operation/batchnorm.cc). So I use this function by providing all the 
necessary arguments.
   
   The program can run now, but I am doing a brief cifar10 training test on AWS 
(using c5.x4large with 16 cpu cores).  
   
   


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

[GitHub] [incubator-singa] chrishkchris commented on a change in pull request #468: Distributted module

2019-08-26 Thread GitBox

chrishkchris commented on a change in pull request #468: Distributted module
URL: https://github.com/apache/incubator-singa/pull/468#discussion_r317527518
 
 

 ##
 File path: python/singa/autograd.py
 ##
 @@ -1286,25 +1287,26 @@ def set_params(self, **parameters):
 
 
 class _BatchNorm2d(Operation):
-def __init__(self, handle, name=None):
+def __init__(self, handle, running_mean, running_var, name=None):
 super(_BatchNorm2d, self).__init__(name)
 self.handle = handle
+self.running_mean = running_mean.data
+self.running_var = running_var.data
 
-def forward(self, x, scale, bias, running_mean, running_var):
-self.running_mean = running_mean
-self.running_var = running_var
+def forward(self, x, scale, bias):
 if training:
 
 if isinstance(self.handle, singa.CudnnBatchNormHandle):
 y, mean, var = singa.GpuBatchNormForwardTraining(
-self.handle, x, scale, bias, running_mean, running_var
+self.handle, x, scale, bias, self.running_mean, 
self.running_var
 
 Review comment:
   added the cpu test of conv2d and batchnorm2d in test_operation.py and 
passed, but still waiting for the cifar10 training accuracy test


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

[GitHub] [incubator-singa] chrishkchris commented on a change in pull request #468: Distributted module

2019-08-26 Thread GitBox

chrishkchris commented on a change in pull request #468: Distributted module
URL: https://github.com/apache/incubator-singa/pull/468#discussion_r317488906
 
 

 ##
 File path: python/singa/autograd.py
 ##
 @@ -1286,25 +1287,26 @@ def set_params(self, **parameters):
 
 
 class _BatchNorm2d(Operation):
-def __init__(self, handle, name=None):
+def __init__(self, handle, running_mean, running_var, name=None):
 super(_BatchNorm2d, self).__init__(name)
 self.handle = handle
+self.running_mean = running_mean.data
+self.running_var = running_var.data
 
-def forward(self, x, scale, bias, running_mean, running_var):
-self.running_mean = running_mean
-self.running_var = running_var
+def forward(self, x, scale, bias):
 if training:
 
 if isinstance(self.handle, singa.CudnnBatchNormHandle):
 y, mean, var = singa.GpuBatchNormForwardTraining(
-self.handle, x, scale, bias, running_mean, running_var
+self.handle, x, scale, bias, self.running_mean, 
self.running_var
 
 Review comment:
   Now Both CPU and GPU can run, but I used "type" instead of "isinstance" 
because CudnnBatchNormHandle is a subclass of BatchNormHandle, so 
CudnnBatchNormHandle is considered as an instance of BatchNormHandle in 
isinstance(). For more explanation, See 
http://www.runoob.com/python/python-func-isinstance.html
   
   Moreover, I have debugged the cpu batchnorm in the following two aspects:
   (i) the forward function of cpu batchnorm needed the initialization of 
running mean and var, otherwise when it access the block it returns error as 
accessing an non-initialized block. I fixed this by initializating the mean by 
0 and the var by 1.
   (ii) the backward function CpuBatchNormBackward does not exist, but there is 
another function called CpuBatchNormBackwardx (in the directory 
src/model/operation/batchnorm.cc). So I use this function by providing all the 
necessary arguments.
   
   The program can run now, but I am doing a brief cifar10 training test on AWS 
(using c5.x4large with 16 cpu cores).  
   
   


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

[GitHub] [incubator-singa] chrishkchris commented on a change in pull request #468: Distributted module

2019-08-26 Thread GitBox

chrishkchris commented on a change in pull request #468: Distributted module
URL: https://github.com/apache/incubator-singa/pull/468#discussion_r317488906
 
 

 ##
 File path: python/singa/autograd.py
 ##
 @@ -1286,25 +1287,26 @@ def set_params(self, **parameters):
 
 
 class _BatchNorm2d(Operation):
-def __init__(self, handle, name=None):
+def __init__(self, handle, running_mean, running_var, name=None):
 super(_BatchNorm2d, self).__init__(name)
 self.handle = handle
+self.running_mean = running_mean.data
+self.running_var = running_var.data
 
-def forward(self, x, scale, bias, running_mean, running_var):
-self.running_mean = running_mean
-self.running_var = running_var
+def forward(self, x, scale, bias):
 if training:
 
 if isinstance(self.handle, singa.CudnnBatchNormHandle):
 y, mean, var = singa.GpuBatchNormForwardTraining(
-self.handle, x, scale, bias, running_mean, running_var
+self.handle, x, scale, bias, self.running_mean, 
self.running_var
 
 Review comment:
   Now Both CPU and GPU can run, but I used "type" instead of "isinstance" 
because CudnnBatchNormHandle is a subclass of BatchNormHandle, so 
CudnnBatchNormHandle is considered as an instance of BatchNormHandle in 
isinstance(). For more explaination, See 
http://www.runoob.com/python/python-func-isinstance.html
   
   Moreover, I have degubbed the cpu batchnorm in the following two aspects:
   (i) the forward function of cpu batchnorm needed the initialization of 
running mean and var, otherwise when it access the block it returns error as 
accessing an non-initialized block. I fixed this by initializating the mean by 
0 and the var by 1.
   (ii) the backward function CpuBatchNormBackward does not exist, but there is 
another function called CpuBatchNormBackwardx (in the directory 
src/model/operation/batchnorm.cc). So I use this function by providing all the 
necessary arguments.
   
   The program can run now, but I am doing a brief cifar10 training test on AWS 
(using c5.x4large with 16 cpu cores).  
   
   


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

[GitHub] [incubator-singa] chrishkchris commented on a change in pull request #468: Distributted module

2019-08-24 Thread GitBox

chrishkchris commented on a change in pull request #468: Distributted module
URL: https://github.com/apache/incubator-singa/pull/468#discussion_r317382891
 
 

 ##
 File path: examples/autograd/mnist_dist.py
 ##
 @@ -0,0 +1,251 @@
+#
 
 Review comment:
   I see. I will modify the codes.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

[GitHub] [incubator-singa] chrishkchris commented on a change in pull request #468: Distributted module

2019-08-24 Thread GitBox

chrishkchris commented on a change in pull request #468: Distributted module
URL: https://github.com/apache/incubator-singa/pull/468#discussion_r317382900
 
 

 ##
 File path: python/singa/autograd.py
 ##
 @@ -1286,25 +1287,26 @@ def set_params(self, **parameters):
 
 
 class _BatchNorm2d(Operation):
-def __init__(self, handle, name=None):
+def __init__(self, handle, running_mean, running_var, name=None):
 super(_BatchNorm2d, self).__init__(name)
 self.handle = handle
+self.running_mean = running_mean.data
+self.running_var = running_var.data
 
-def forward(self, x, scale, bias, running_mean, running_var):
-self.running_mean = running_mean
-self.running_var = running_var
+def forward(self, x, scale, bias):
 if training:
 
 if isinstance(self.handle, singa.CudnnBatchNormHandle):
 y, mean, var = singa.GpuBatchNormForwardTraining(
-self.handle, x, scale, bias, running_mean, running_var
+self.handle, x, scale, bias, self.running_mean, 
self.running_var
 
 Review comment:
   I see. I will modify the codes.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

[GitHub] [incubator-singa] chrishkchris commented on a change in pull request #468: Distributted module

2019-08-24 Thread GitBox

chrishkchris commented on a change in pull request #468: Distributted module
URL: https://github.com/apache/incubator-singa/pull/468#discussion_r317382034
 
 

 ##
 File path: CMakeLists.txt
 ##
 @@ -30,7 +30,7 @@ LIST(APPEND CMAKE_MODULE_PATH 
${PROJECT_SOURCE_DIR}/cmake/Thirdparty)
 
 # Flags
 IF(UNIX)
-SET(CMAKE_CXX_FLAGS "${CMAKE_CXX_FLAGS} -std=c++11 -g -O2 -fPIC -Wall 
-pthread")
+SET(CMAKE_CXX_FLAGS "${CMAKE_CXX_FLAGS} -std=c++11 -O3 -fPIC -Wall 
-pthread")
 
 Review comment:
   ok, changed


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

[GitHub] [incubator-singa] chrishkchris commented on a change in pull request #468: Distributted module

2019-08-24 Thread GitBox

chrishkchris commented on a change in pull request #468: Distributted module
URL: https://github.com/apache/incubator-singa/pull/468#discussion_r317381887
 
 

 ##
 File path: src/api/config.i
 ##
 @@ -0,0 +1,34 @@
+// Licensed to the Apache Software Foundation (ASF) under one
 
 Review comment:
   Yes, this is generated directly from "config.i.in" by cmake.
   I have deleted this file "config.i"


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

[GitHub] [incubator-singa] chrishkchris commented on a change in pull request #468: Distributted module

2019-08-11 Thread GitBox

chrishkchris commented on a change in pull request #468: Distributted module
URL: https://github.com/apache/incubator-singa/pull/468#discussion_r311056639
 
 

 ##
 File path: src/api/config.i
 ##
 @@ -0,0 +1,33 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+
+
+// Pass in cmake configurations to swig
+#define USE_CUDA 1
+#define USE_CUDNN 1
+#define USE_OPENCL 0
+#define USE_PYTHON 1
+#define USE_MKLDNN 1
+#define USE_JAVA 0
+#define CUDNN_VERSION 7401
+
+// SINGA version
+#define SINGA_MAJOR_VERSION 1
 
 Review comment:
   Updated on 6th August: I removed a bug in the commit 0616000 which concerns 
the number of parameters (i.e. Size() of the tensor) taken part in the 
all-reduce process. Then I did a 8 * K80 multi-GPUs training and evaluation 
test with a simple MNIST dataset on simple CNN. It reduces the training loss 
from 802.7 to 42.2 in about 30 Epochs: 
   ```
   Epoch=0: 100%|¦¦| 117/117 [00:01<00:00, 92.86it/s]Training loss = 
802.659485, training accuracy = 0.713825
   Test accuracy = 0.920025
   Epoch=1: 100%|¦¦| 117/117 [00:01<00:00, 93.42it/s]Training loss = 
246.589371, training accuracy = 0.916767
   Test accuracy = 0.956106
   Epoch=2: 100%|¦¦| 117/117 [00:01<00:00, 94.04it/s]Training loss = 
175.012894, training accuracy = 0.941106
   Test accuracy = 0.967208
   Epoch=3: 100%|¦¦| 117/117 [00:01<00:00, 95.66it/s] Training loss = 
144.684052, training accuracy = 0.951539
   Test accuracy = 0.970806
   Epoch=4: 100%|¦¦| 117/117 [00:01<00:00, 102.59it/s]Training loss = 
120.399704, training accuracy = 0.959402
   Test accuracy = 0.976049
   Epoch=5: 100%|¦¦| 117/117 [00:01<00:00, 102.79it/s]Training loss = 
107.832191, training accuracy = 0.963709
   Test accuracy = 0.975946
   Epoch=6: 100%|¦¦| 117/117 [00:01<00:00, 102.70it/s]Training loss = 
96.289490, training accuracy = 0.967014
   Test accuracy = 0.979441
   Epoch=7: 100%|¦¦| 117/117 [00:01<00:00, 102.34it/s]Training loss = 
88.031815, training accuracy = 0.970436
   Test accuracy = 0.980983
   Epoch=8: 100%|¦¦| 117/117 [00:01<00:00, 101.81it/s]Training loss = 
79.349884, training accuracy = 0.973090
   Test accuracy = 0.980058
   Epoch=9: 100%|¦¦| 117/117 [00:01<00:00, 101.82it/s]Training loss = 
77.825607, training accuracy = 0.974342
   Test accuracy = 0.977282
   Epoch=10: 100%|¦¦| 117/117 [00:01<00:00, 101.97it/s]Training loss = 
74.710297, training accuracy = 0.974576
   Test accuracy = 0.983861
   Epoch=11: 100%|¦¦| 117/117 [00:01<00:00, 101.98it/s]Training loss = 
69.400230, training accuracy = 0.976162
   Test accuracy = 0.982936
   Epoch=12: 100%|¦¦| 117/117 [00:01<00:00, 102.03it/s]Training loss = 
65.100449, training accuracy = 0.978148
   Test accuracy = 0.983553
   Epoch=13: 100%|¦¦| 117/117 [00:01<00:00, 102.17it/s]Training loss = 
65.113991, training accuracy = 0.978249
   Test accuracy = 0.986534
   Epoch=14: 100%|¦¦| 117/117 [00:01<00:00, 101.83it/s]Training loss = 
63.065636, training accuracy = 0.978566
   Test accuracy = 0.984683
   Epoch=15: 100%|¦¦| 117/117 [00:01<00:00, 102.11it/s]Training loss = 
58.334709, training accuracy = 0.980018
   Test accuracy = 0.983758
   Epoch=16: 100%|¦¦| 117/117 [00:01<00:00, 102.16it/s]Training loss = 
58.280094, training accuracy = 0.980285
   Test accuracy = 0.983655
   Epoch=17: 100%|¦¦| 117/117 [00:01<00:00, 102.15it/s]Training loss = 
53.226196, training accuracy = 0.981420
   Test accuracy = 0.985197
   Epoch=18: 100%|¦¦| 117/117 [00:01<00:00, 102.15it/s]Training loss = 
55.968140, training accuracy = 0.980786
   Test accuracy = 0.982422
   Epoch=19: 100%|¦¦| 117/117 [00:01<00:00, 102.14it/s]Training loss = 
52.761921, training accuracy = 0.982489
   Test accuracy = 0.985814
   Epoch=20: 100%|¦¦| 117/117 [00:01<00:00, 101.86it/s]Training loss = 
51.989666, training accuracy = 0.982973
   Test accuracy = 0.983758
   Epoch=21: 100%|¦¦| 117/117 [00:01<00:00, 101.91it/s]Training loss = 
52.571381, training accuracy = 0.982455
   Test accuracy = 0.987973
   Epoch=22: 100%|¦¦| 117/117 [00:01<00:00, 101.99it/s]Training loss

[GitHub] [incubator-singa] chrishkchris commented on a change in pull request #468: Distributted module

2019-08-10 Thread GitBox

chrishkchris commented on a change in pull request #468: Distributted module
URL: https://github.com/apache/incubator-singa/pull/468#discussion_r310328731
 
 

 ##
 File path: src/api/config.i
 ##
 @@ -0,0 +1,33 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+
+
+// Pass in cmake configurations to swig
+#define USE_CUDA 1
+#define USE_CUDNN 1
+#define USE_OPENCL 0
+#define USE_PYTHON 1
+#define USE_MKLDNN 1
+#define USE_JAVA 0
+#define CUDNN_VERSION 7401
+
+// SINGA version
+#define SINGA_MAJOR_VERSION 1
 
 Review comment:
   I have trained the dist_new branch resnet (because resnet has batch norm) 
with cifar10 dataset using 1 GPU, and obtained 92.5% test accuracy in 100 
Epochs with data augmentation. This suggest that the batch norm is in good 
condition (while the onnx interface of batchnorm may need to be considered but 
I am not sure)
   
   ```
   ubuntu@ip-172-31-27-25:~/incubator-singa/examples/autograd$ python3 
resnet_realdata.py
   Loading data file cifar-10-batches-py/data_batch_1
   Loading data file cifar-10-batches-py/data_batch_2
   Loading data file cifar-10-batches-py/data_batch_3
   Loading data file cifar-10-batches-py/data_batch_4
   Loading data file cifar-10-batches-py/data_batch_5
   Loading data file cifar-10-batches-py/test_batch
   Start intialization
   Epoch=0: 
100%|███|
 1562/1562 [03:56<00:00,  6.61it/s]
   Training loss = 2927.551146, training accuracy = 0.338068
   Test accuracy = 0.441306
   Epoch=1: 
100%|███|
 1562/1562 [03:56<00:00,  6.59it/s]
   Training loss = 2110.360374, training accuracy = 0.511984
   Test accuracy = 0.606571
   Epoch=2: 
100%|███|
 1562/1562 [03:56<00:00,  6.61it/s]
   Training loss = 1658.897868, training accuracy = 0.623199
   Test accuracy = 0.645232
   Epoch=3: 
100%|███|
 1562/1562 [03:56<00:00,  6.64it/s]
   Training loss = 1354.082412, training accuracy = 0.694442
   Test accuracy = 0.731170
   Epoch=4: 
100%|███|
 1562/1562 [03:56<00:00,  6.63it/s]
   Training loss = 1155.785529, training accuracy = 0.743478
   Test accuracy = 0.761318
   Epoch=5: 
100%|███|
 1562/1562 [03:56<00:00,  6.59it/s]
   Training loss = 1022.750388, training accuracy = 0.773668
   Test accuracy = 0.741286
   Epoch=6: 
100%|███|
 1562/1562 [03:56<00:00,  6.62it/s]
   Training loss = 945.400214, training accuracy = 0.790373
   Test accuracy = 0.795072
   Epoch=7: 
100%|███|
 1562/1562 [03:56<00:00,  6.61it/s]
   Training loss = 840.933215, training accuracy = 0.814441
   Test accuracy = 0.810096
   Epoch=8: 
100%|███|
 1562/1562 [03:56<00:00,  6.62it/s]
   Training loss = 765.215148, training accuracy = 0.830566
   Test accuracy = 0.807091
   Epoch=9: 
100%|███|
 1562/1562 [03:56<00:00,  6.61it/s]
   Training loss = 701.153867, training accuracy = 0.845951
   Test accuracy = 0.822316
   Epoch=10: 
100%|██|
 1562/1562 [03:56<00:00,  6.63it/s]
   Training loss = 666.267428, training accuracy = 0.853073
   Test accuracy = 0.851162
   Epoch=11: 
100%|██|
 1562/1562 [03:56<00:00,  6.62it/s]
   Training loss = 606.699607, training accuracy = 0.866817
   Test accuracy = 0.770232
   Epoch=12: 
100%|██|
 1562/1562

[GitHub] [incubator-singa] chrishkchris commented on a change in pull request #468: Distributted module

2019-08-10 Thread GitBox

chrishkchris commented on a change in pull request #468: Distributted module
URL: https://github.com/apache/incubator-singa/pull/468#discussion_r311068821
 
 

 ##
 File path: src/api/config.i
 ##
 @@ -0,0 +1,33 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+
+
+// Pass in cmake configurations to swig
+#define USE_CUDA 1
+#define USE_CUDNN 1
+#define USE_OPENCL 0
+#define USE_PYTHON 1
+#define USE_MKLDNN 1
+#define USE_JAVA 0
+#define CUDNN_VERSION 7401
+
+// SINGA version
+#define SINGA_MAJOR_VERSION 1
 
 Review comment:
   In additional to the above, I also did a 8 * K80 multi-GPUs training and 
evaluation test with a CIFAR-10 dataset on resnet 50. It reduces the training 
loss from 3983.8 to 35.56 in 100 Epochs, and evaluation accuracy to 90.6% 
(maximum at epoch 90). However, this does not include the synchronization of 
running mean and variance before the evaluation phase:
   ```
   Epoch=0: 100%|██| 195/195 [06:06<00:00,  1.91s/it]Training loss = 
3983.820557, training accuracy = 0.225260
   Test accuracy = 0.347556
   Epoch=1: 100%|██| 195/195 [06:17<00:00,  1.94s/it]Training loss = 
2628.622070, training accuracy = 0.379768
   Test accuracy = 0.437700
   Epoch=2: 100%|██| 195/195 [06:12<00:00,  1.89s/it]Training loss = 
2347.072266, training accuracy = 0.448558
   Test accuracy = 0.459936
   Epoch=3: 100%|██| 195/195 [06:13<00:00,  1.88s/it]Training loss = 
2075.987305, training accuracy = 0.517348
   Test accuracy = 0.548978
   Epoch=4: 100%|██| 195/195 [06:19<00:00,  1.97s/it]Training loss = 
1890.109985, training accuracy = 0.566847
   Test accuracy = 0.594451
   Epoch=5: 100%|██| 195/195 [06:13<00:00,  1.92s/it]Training loss = 
1720.395142, training accuracy = 0.606911
   Test accuracy = 0.633413
   Epoch=6: 100%|██| 195/195 [06:10<00:00,  1.92s/it]Training loss = 
1555.737549, training accuracy = 0.645753
   Test accuracy = 0.659054
   Epoch=7: 100%|██| 195/195 [06:14<00:00,  1.91s/it]Training loss = 
1385.688477, training accuracy = 0.687220
   Test accuracy = 0.709836
   Epoch=8: 100%|██| 195/195 [06:20<00:00,  1.97s/it]Training loss = 
1269.426270, training accuracy = 0.714523
   Test accuracy = 0.735477
   Epoch=9: 100%|██| 195/195 [06:15<00:00,  1.91s/it]Training loss = 
1137.953979, training accuracy = 0.746054
   Test accuracy = 0.745393
   Epoch=10: 100%|██| 195/195 [06:11<00:00,  1.88s/it]Training loss = 
1031.773071, training accuracy = 0.770353
   Test accuracy = 0.750501
   Epoch=11: 100%|██| 195/195 [06:10<00:00,  1.89s/it]Training loss = 
956.600037, training accuracy = 0.788261
   Test accuracy = 0.44
   Epoch=12: 100%|██| 195/195 [06:16<00:00,  1.92s/it]Training loss = 
881.050171, training accuracy = 0.804167
   Test accuracy = 0.793369
   Epoch=13: 100%|██| 195/195 [06:16<00:00,  1.92s/it]Training loss = 
828.298828, training accuracy = 0.818309
   Test accuracy = 0.807692
   Epoch=14: 100%|██| 195/195 [06:11<00:00,  1.90s/it]Training loss = 
790.558838, training accuracy = 0.823918
   Test accuracy = 0.795373
   Epoch=15: 100%|██| 195/195 [06:13<00:00,  1.90s/it]Training loss = 
740.679871, training accuracy = 0.833734
   Test accuracy = 0.816707
   Epoch=16: 100%|██| 195/195 [06:20<00:00,  1.95s/it]Training loss = 
691.391479, training accuracy = 0.846855
   Test accuracy = 0.818510
   Epoch=17: 100%|██| 195/195 [06:16<00:00,  1.89s/it]Training loss = 
657.708130, training accuracy = 0.853986
   Test accuracy = 0.826122
   Epoch=18: 100%|██| 195/195 [06:10<00:00,  1.88s/it]Training loss = 
627.918579, training accuracy = 0.860216
   Test accuracy = 0.844752
   Epoch=19: 100%|██| 195/195 [06:13<00:00,  1.91s/it]Training loss = 
592.768982, training accuracy = 0.869551
   Test accuracy = 0.845653
   Epoch=20: 100%|██| 195/195 [06:19<00:00,  1.97s/it]Training loss = 
561.560608, training accuracy = 0.875060
   Test accuracy = 0.835938
   Epoch=21: 100%|██| 195/195 [06:15<00:00,  1.97s/it]Training loss = 
533.083740, training accuracy = 0.881370
   Test accuracy = 0.849860
   Epoch=22: 100%|██| 195/195 [06:12<00:00,

[GitHub] [incubator-singa] chrishkchris commented on a change in pull request #468: Distributted module

2019-08-10 Thread GitBox

chrishkchris commented on a change in pull request #468: Distributted module
URL: https://github.com/apache/incubator-singa/pull/468#discussion_r311074176
 
 

 ##
 File path: src/api/config.i
 ##
 @@ -0,0 +1,33 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+
+
+// Pass in cmake configurations to swig
+#define USE_CUDA 1
+#define USE_CUDNN 1
+#define USE_OPENCL 0
+#define USE_PYTHON 1
+#define USE_MKLDNN 1
+#define USE_JAVA 0
+#define CUDNN_VERSION 7401
+
+// SINGA version
+#define SINGA_MAJOR_VERSION 1
 
 Review comment:
   From the above, we can now train simple CNN without batchnorm (MNIST 
dataset) and resnet with batchnorm (CIFAR-10 dataset). The remaining task is 
the synchronization of the running mean and variance if we use batchnorm. 
   I tried to put the running mean and var in the _BatchNorm2D return list of 
backward 
   
   ```python
   def backward(self, dy):
   assert training is True and hasattr(
   self, "cache"
   ), "Please set training as True before do BP. "
   
   x, scale, mean, var = self.cache
   if isinstance(self.handle, singa.CudnnBatchNormHandle):
   dx, ds, db = singa.GpuBatchNormBackward(
   self.handle, dy, x, scale, mean, var
   )
   else:
   dx, ds, db = singa.CpuBatchNormBackward(
   self.handle, dy, x, scale, mean, var
   )
   
   #return dx, ds, db
   return dx, ds, db, self.running_mean, self.running_var
   ```
   and wish to synchronize it with
   ```python
   #all reduce running mean and var
   for p, g in autograd.backward(loss):
   if((p.requires_grad==False) and (p.stores_grad==False)):
   all_reduce(p)
   ```
   
   However, this is the error in return
   ```
   Traceback (most recent call last):
 File "resnet_multigpu.py", line 163, in 
   for p, g in autograd.backward(loss):
 File "/usr/local/lib/python3.5/dist-packages/singa/autograd.py", line 136, 
in backward
   % (len(op.src), len(dxs))
   AssertionError: the number of src ops (=3) and dx (=5) not match
   ```
   
   
   


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

[GitHub] [incubator-singa] chrishkchris commented on a change in pull request #468: Distributted module

2019-08-10 Thread GitBox

chrishkchris commented on a change in pull request #468: Distributted module
URL: https://github.com/apache/incubator-singa/pull/468#discussion_r311056639
 
 

 ##
 File path: src/api/config.i
 ##
 @@ -0,0 +1,33 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+
+
+// Pass in cmake configurations to swig
+#define USE_CUDA 1
+#define USE_CUDNN 1
+#define USE_OPENCL 0
+#define USE_PYTHON 1
+#define USE_MKLDNN 1
+#define USE_JAVA 0
+#define CUDNN_VERSION 7401
+
+// SINGA version
+#define SINGA_MAJOR_VERSION 1
 
 Review comment:
   Updated on 6th August: I removed a bug in the commit 0616000 which concerns 
the number of parameters (i.e. Size() of the tensor) taken part in the 
all-reduce process. Then I did a 8 * K80 multi-GPUs training and evaluation 
test with a simple MNIST dataset on simple CNN. It reduces the training loss 
from 802.7 to 42.2 in about 30 Epochs: 
   ```
   Epoch=0: 100%|¦¦| 117/117 [00:01<00:00, 92.86it/s]Training loss = 
802.659485, training accuracy = 0.713825
   Test accuracy = 0.920025
   Epoch=1: 100%|¦¦| 117/117 [00:01<00:00, 93.42it/s]Training loss = 
246.589371, training accuracy = 0.916767
   Test accuracy = 0.956106
   Epoch=2: 100%|¦¦| 117/117 [00:01<00:00, 94.04it/s]Training loss = 
175.012894, training accuracy = 0.941106
   Test accuracy = 0.967208
   Epoch=3: 100%|¦¦| 117/117 [00:01<00:00, 95.66it/s] Training loss = 
144.684052, training accuracy = 0.951539
   Test accuracy = 0.970806
   Epoch=4: 100%|¦¦| 117/117 [00:01<00:00, 102.59it/s]Training loss = 
120.399704, training accuracy = 0.959402
   Test accuracy = 0.976049
   Epoch=5: 100%|¦¦| 117/117 [00:01<00:00, 102.79it/s]Training loss = 
107.832191, training accuracy = 0.963709
   Test accuracy = 0.975946
   Epoch=6: 100%|¦¦| 117/117 [00:01<00:00, 102.70it/s]Training loss = 
96.289490, training accuracy = 0.967014
   Test accuracy = 0.979441
   Epoch=7: 100%|¦¦| 117/117 [00:01<00:00, 102.34it/s]Training loss = 
88.031815, training accuracy = 0.970436
   Test accuracy = 0.980983
   Epoch=8: 100%|¦¦| 117/117 [00:01<00:00, 101.81it/s]Training loss = 
79.349884, training accuracy = 0.973090
   Test accuracy = 0.980058
   Epoch=9: 100%|¦¦| 117/117 [00:01<00:00, 101.82it/s]Training loss = 
77.825607, training accuracy = 0.974342
   Test accuracy = 0.977282
   Epoch=10: 100%|¦¦| 117/117 [00:01<00:00, 101.97it/s]Training loss = 
74.710297, training accuracy = 0.974576
   Test accuracy = 0.983861
   Epoch=11: 100%|¦¦| 117/117 [00:01<00:00, 101.98it/s]Training loss = 
69.400230, training accuracy = 0.976162
   Test accuracy = 0.982936
   Epoch=12: 100%|¦¦| 117/117 [00:01<00:00, 102.03it/s]Training loss = 
65.100449, training accuracy = 0.978148
   Test accuracy = 0.983553
   Epoch=13: 100%|¦¦| 117/117 [00:01<00:00, 102.17it/s]Training loss = 
65.113991, training accuracy = 0.978249
   Test accuracy = 0.986534
   Epoch=14: 100%|¦¦| 117/117 [00:01<00:00, 101.83it/s]Training loss = 
63.065636, training accuracy = 0.978566
   Test accuracy = 0.984683
   Epoch=15: 100%|¦¦| 117/117 [00:01<00:00, 102.11it/s]Training loss = 
58.334709, training accuracy = 0.980018
   Test accuracy = 0.983758
   Epoch=16: 100%|¦¦| 117/117 [00:01<00:00, 102.16it/s]Training loss = 
58.280094, training accuracy = 0.980285
   Test accuracy = 0.983655
   Epoch=17: 100%|¦¦| 117/117 [00:01<00:00, 102.15it/s]Training loss = 
53.226196, training accuracy = 0.981420
   Test accuracy = 0.985197
   Epoch=18: 100%|¦¦| 117/117 [00:01<00:00, 102.15it/s]Training loss = 
55.968140, training accuracy = 0.980786
   Test accuracy = 0.982422
   Epoch=19: 100%|¦¦| 117/117 [00:01<00:00, 102.14it/s]Training loss = 
52.761921, training accuracy = 0.982489
   Test accuracy = 0.985814
   Epoch=20: 100%|¦¦| 117/117 [00:01<00:00, 101.86it/s]Training loss = 
51.989666, training accuracy = 0.982973
   Test accuracy = 0.983758
   Epoch=21: 100%|¦¦| 117/117 [00:01<00:00, 101.91it/s]Training loss = 
52.571381, training accuracy = 0.982455
   Test accuracy = 0.987973
   Epoch=22: 100%|¦¦| 117/117 [00:01<00:00, 101.99it/s]Training loss

[GitHub] [incubator-singa] chrishkchris commented on a change in pull request #468: Distributted module

2019-08-06 Thread GitBox

chrishkchris commented on a change in pull request #468: Distributted module
URL: https://github.com/apache/incubator-singa/pull/468#discussion_r311068821
 
 

 ##
 File path: src/api/config.i
 ##
 @@ -0,0 +1,33 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+
+
+// Pass in cmake configurations to swig
+#define USE_CUDA 1
+#define USE_CUDNN 1
+#define USE_OPENCL 0
+#define USE_PYTHON 1
+#define USE_MKLDNN 1
+#define USE_JAVA 0
+#define CUDNN_VERSION 7401
+
+// SINGA version
+#define SINGA_MAJOR_VERSION 1
 
 Review comment:
   In additional to the above, I also did a 8 * K80 multi-GPUs training and 
evaluation test with a CIFAR-10 dataset on resnet 50. It reduces the training 
loss from 3983.8 to 35.56 in 100 Epochs, and evaluation accuracy to 90.6% 
(maximum at epoch 90). However, this does not include the synchronization of 
running mean and variance before the evaluation phase:
   ```
   Epoch=0: 100%|██| 195/195 [06:06<00:00,  1.91s/it]Training loss = 
3983.820557, training accuracy = 0.225260
   Test accuracy = 0.347556
   Epoch=1: 100%|██| 195/195 [06:17<00:00,  1.94s/it]Training loss = 
2628.622070, training accuracy = 0.379768
   Test accuracy = 0.437700
   Epoch=2: 100%|██| 195/195 [06:12<00:00,  1.89s/it]Training loss = 
2347.072266, training accuracy = 0.448558
   Test accuracy = 0.459936
   Epoch=3: 100%|██| 195/195 [06:13<00:00,  1.88s/it]Training loss = 
2075.987305, training accuracy = 0.517348
   Test accuracy = 0.548978
   Epoch=4: 100%|██| 195/195 [06:19<00:00,  1.97s/it]Training loss = 
1890.109985, training accuracy = 0.566847
   Test accuracy = 0.594451
   Epoch=5: 100%|██| 195/195 [06:13<00:00,  1.92s/it]Training loss = 
1720.395142, training accuracy = 0.606911
   Test accuracy = 0.633413
   Epoch=6: 100%|██| 195/195 [06:10<00:00,  1.92s/it]Training loss = 
1555.737549, training accuracy = 0.645753
   Test accuracy = 0.659054
   Epoch=7: 100%|██| 195/195 [06:14<00:00,  1.91s/it]Training loss = 
1385.688477, training accuracy = 0.687220
   Test accuracy = 0.709836
   Epoch=8: 100%|██| 195/195 [06:20<00:00,  1.97s/it]Training loss = 
1269.426270, training accuracy = 0.714523
   Test accuracy = 0.735477
   Epoch=9: 100%|██| 195/195 [06:15<00:00,  1.91s/it]Training loss = 
1137.953979, training accuracy = 0.746054
   Test accuracy = 0.745393
   Epoch=10: 100%|██| 195/195 [06:11<00:00,  1.88s/it]Training loss = 
1031.773071, training accuracy = 0.770353
   Test accuracy = 0.750501
   Epoch=11: 100%|██| 195/195 [06:10<00:00,  1.89s/it]Training loss = 
956.600037, training accuracy = 0.788261
   Test accuracy = 0.44
   Epoch=12: 100%|██| 195/195 [06:16<00:00,  1.92s/it]Training loss = 
881.050171, training accuracy = 0.804167
   Test accuracy = 0.793369
   Epoch=13: 100%|██| 195/195 [06:16<00:00,  1.92s/it]Training loss = 
828.298828, training accuracy = 0.818309
   Test accuracy = 0.807692
   Epoch=14: 100%|██| 195/195 [06:11<00:00,  1.90s/it]Training loss = 
790.558838, training accuracy = 0.823918
   Test accuracy = 0.795373
   Epoch=15: 100%|██| 195/195 [06:13<00:00,  1.90s/it]Training loss = 
740.679871, training accuracy = 0.833734
   Test accuracy = 0.816707
   Epoch=16: 100%|██| 195/195 [06:20<00:00,  1.95s/it]Training loss = 
691.391479, training accuracy = 0.846855
   Test accuracy = 0.818510
   Epoch=17: 100%|██| 195/195 [06:16<00:00,  1.89s/it]Training loss = 
657.708130, training accuracy = 0.853986
   Test accuracy = 0.826122
   Epoch=18: 100%|██| 195/195 [06:10<00:00,  1.88s/it]Training loss = 
627.918579, training accuracy = 0.860216
   Test accuracy = 0.844752
   Epoch=19: 100%|██| 195/195 [06:13<00:00,  1.91s/it]Training loss = 
592.768982, training accuracy = 0.869551
   Test accuracy = 0.845653
   Epoch=20: 100%|██| 195/195 [06:19<00:00,  1.97s/it]Training loss = 
561.560608, training accuracy = 0.875060
   Test accuracy = 0.835938
   Epoch=21: 100%|██| 195/195 [06:15<00:00,  1.97s/it]Training loss = 
533.083740, training accuracy = 0.881370
   Test accuracy = 0.849860
   Epoch=22: 100%|██| 195/195 [06:12<00:00,

[GitHub] [incubator-singa] chrishkchris commented on a change in pull request #468: Distributted module

2019-08-06 Thread GitBox

chrishkchris commented on a change in pull request #468: Distributted module
URL: https://github.com/apache/incubator-singa/pull/468#discussion_r311056639
 
 

 ##
 File path: src/api/config.i
 ##
 @@ -0,0 +1,33 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+
+
+// Pass in cmake configurations to swig
+#define USE_CUDA 1
+#define USE_CUDNN 1
+#define USE_OPENCL 0
+#define USE_PYTHON 1
+#define USE_MKLDNN 1
+#define USE_JAVA 0
+#define CUDNN_VERSION 7401
+
+// SINGA version
+#define SINGA_MAJOR_VERSION 1
 
 Review comment:
   Updated on 6th August: I removed a bug in the commit 0616000 which concerns 
the number of parameters (i.e. Size() of the tensor) taken part in the 
all-reduce process. Then I did a 8 * K80 multi-GPUs training and evaluation 
test with a simple MNIST dataset on simple CNN. It reduces the training loss 
from 802.7 to 42.2 in about 30 Epochs: 
   ```
   Epoch=0: 100%|¦¦| 117/117 [00:01<00:00, 92.86it/s]Training loss = 
802.659485, training accuracy = 0.713825
   Test accuracy = 0.920025
   Epoch=1: 100%|¦¦| 117/117 [00:01<00:00, 93.42it/s]Training loss = 
246.589371, training accuracy = 0.916767
   Test accuracy = 0.956106
   Epoch=2: 100%|¦¦| 117/117 [00:01<00:00, 94.04it/s]Training loss = 
175.012894, training accuracy = 0.941106
   Test accuracy = 0.967208
   Epoch=3: 100%|¦¦| 117/117 [00:01<00:00, 95.66it/s] Training loss = 
144.684052, training accuracy = 0.951539
   Test accuracy = 0.970806
   Epoch=4: 100%|¦¦| 117/117 [00:01<00:00, 102.59it/s]Training loss = 
120.399704, training accuracy = 0.959402
   Test accuracy = 0.976049
   Epoch=5: 100%|¦¦| 117/117 [00:01<00:00, 102.79it/s]Training loss = 
107.832191, training accuracy = 0.963709
   Test accuracy = 0.975946
   Epoch=6: 100%|¦¦| 117/117 [00:01<00:00, 102.70it/s]Training loss = 
96.289490, training accuracy = 0.967014
   Test accuracy = 0.979441
   Epoch=7: 100%|¦¦| 117/117 [00:01<00:00, 102.34it/s]Training loss = 
88.031815, training accuracy = 0.970436
   Test accuracy = 0.980983
   Epoch=8: 100%|¦¦| 117/117 [00:01<00:00, 101.81it/s]Training loss = 
79.349884, training accuracy = 0.973090
   Test accuracy = 0.980058
   Epoch=9: 100%|¦¦| 117/117 [00:01<00:00, 101.82it/s]Training loss = 
77.825607, training accuracy = 0.974342
   Test accuracy = 0.977282
   Epoch=10: 100%|¦¦| 117/117 [00:01<00:00, 101.97it/s]Training loss = 
74.710297, training accuracy = 0.974576
   Test accuracy = 0.983861
   Epoch=11: 100%|¦¦| 117/117 [00:01<00:00, 101.98it/s]Training loss = 
69.400230, training accuracy = 0.976162
   Test accuracy = 0.982936
   Epoch=12: 100%|¦¦| 117/117 [00:01<00:00, 102.03it/s]Training loss = 
65.100449, training accuracy = 0.978148
   Test accuracy = 0.983553
   Epoch=13: 100%|¦¦| 117/117 [00:01<00:00, 102.17it/s]Training loss = 
65.113991, training accuracy = 0.978249
   Test accuracy = 0.986534
   Epoch=14: 100%|¦¦| 117/117 [00:01<00:00, 101.83it/s]Training loss = 
63.065636, training accuracy = 0.978566
   Test accuracy = 0.984683
   Epoch=15: 100%|¦¦| 117/117 [00:01<00:00, 102.11it/s]Training loss = 
58.334709, training accuracy = 0.980018
   Test accuracy = 0.983758
   Epoch=16: 100%|¦¦| 117/117 [00:01<00:00, 102.16it/s]Training loss = 
58.280094, training accuracy = 0.980285
   Test accuracy = 0.983655
   Epoch=17: 100%|¦¦| 117/117 [00:01<00:00, 102.15it/s]Training loss = 
53.226196, training accuracy = 0.981420
   Test accuracy = 0.985197
   Epoch=18: 100%|¦¦| 117/117 [00:01<00:00, 102.15it/s]Training loss = 
55.968140, training accuracy = 0.980786
   Test accuracy = 0.982422
   Epoch=19: 100%|¦¦| 117/117 [00:01<00:00, 102.14it/s]Training loss = 
52.761921, training accuracy = 0.982489
   Test accuracy = 0.985814
   Epoch=20: 100%|¦¦| 117/117 [00:01<00:00, 101.86it/s]Training loss = 
51.989666, training accuracy = 0.982973
   Test accuracy = 0.983758
   Epoch=21: 100%|¦¦| 117/117 [00:01<00:00, 101.91it/s]Training loss = 
52.571381, training accuracy = 0.982455
   Test accuracy = 0.987973
   Epoch=22: 100%|¦¦| 117/117 [00:01<00:00, 101.99it/s]Training loss

[GitHub] [incubator-singa] chrishkchris commented on a change in pull request #468: Distributted module

2019-08-06 Thread GitBox

chrishkchris commented on a change in pull request #468: Distributted module
URL: https://github.com/apache/incubator-singa/pull/468#discussion_r311056639
 
 

 ##
 File path: src/api/config.i
 ##
 @@ -0,0 +1,33 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+
+
+// Pass in cmake configurations to swig
+#define USE_CUDA 1
+#define USE_CUDNN 1
+#define USE_OPENCL 0
+#define USE_PYTHON 1
+#define USE_MKLDNN 1
+#define USE_JAVA 0
+#define CUDNN_VERSION 7401
+
+// SINGA version
+#define SINGA_MAJOR_VERSION 1
 
 Review comment:
   Updated on 6th August: I removed a bug in the commit 0616000 which concerns 
the number of parameters in the all-reduce. Then I did a 8 * K80 multi-GPUs 
training and evaluation test with a simple MNIST dataset on simple CNN. It 
reduces the training loss from 802.7 to 42.2 in about 30 Epochs: 
   ```
   Epoch=0: 100%|¦¦| 117/117 [00:01<00:00, 92.86it/s]Training loss = 
802.659485, training accuracy = 0.713825
   Test accuracy = 0.920025
   Epoch=1: 100%|¦¦| 117/117 [00:01<00:00, 93.42it/s]Training loss = 
246.589371, training accuracy = 0.916767
   Test accuracy = 0.956106
   Epoch=2: 100%|¦¦| 117/117 [00:01<00:00, 94.04it/s]Training loss = 
175.012894, training accuracy = 0.941106
   Test accuracy = 0.967208
   Epoch=3: 100%|¦¦| 117/117 [00:01<00:00, 95.66it/s] Training loss = 
144.684052, training accuracy = 0.951539
   Test accuracy = 0.970806
   Epoch=4: 100%|¦¦| 117/117 [00:01<00:00, 102.59it/s]Training loss = 
120.399704, training accuracy = 0.959402
   Test accuracy = 0.976049
   Epoch=5: 100%|¦¦| 117/117 [00:01<00:00, 102.79it/s]Training loss = 
107.832191, training accuracy = 0.963709
   Test accuracy = 0.975946
   Epoch=6: 100%|¦¦| 117/117 [00:01<00:00, 102.70it/s]Training loss = 
96.289490, training accuracy = 0.967014
   Test accuracy = 0.979441
   Epoch=7: 100%|¦¦| 117/117 [00:01<00:00, 102.34it/s]Training loss = 
88.031815, training accuracy = 0.970436
   Test accuracy = 0.980983
   Epoch=8: 100%|¦¦| 117/117 [00:01<00:00, 101.81it/s]Training loss = 
79.349884, training accuracy = 0.973090
   Test accuracy = 0.980058
   Epoch=9: 100%|¦¦| 117/117 [00:01<00:00, 101.82it/s]Training loss = 
77.825607, training accuracy = 0.974342
   Test accuracy = 0.977282
   Epoch=10: 100%|¦¦| 117/117 [00:01<00:00, 101.97it/s]Training loss = 
74.710297, training accuracy = 0.974576
   Test accuracy = 0.983861
   Epoch=11: 100%|¦¦| 117/117 [00:01<00:00, 101.98it/s]Training loss = 
69.400230, training accuracy = 0.976162
   Test accuracy = 0.982936
   Epoch=12: 100%|¦¦| 117/117 [00:01<00:00, 102.03it/s]Training loss = 
65.100449, training accuracy = 0.978148
   Test accuracy = 0.983553
   Epoch=13: 100%|¦¦| 117/117 [00:01<00:00, 102.17it/s]Training loss = 
65.113991, training accuracy = 0.978249
   Test accuracy = 0.986534
   Epoch=14: 100%|¦¦| 117/117 [00:01<00:00, 101.83it/s]Training loss = 
63.065636, training accuracy = 0.978566
   Test accuracy = 0.984683
   Epoch=15: 100%|¦¦| 117/117 [00:01<00:00, 102.11it/s]Training loss = 
58.334709, training accuracy = 0.980018
   Test accuracy = 0.983758
   Epoch=16: 100%|¦¦| 117/117 [00:01<00:00, 102.16it/s]Training loss = 
58.280094, training accuracy = 0.980285
   Test accuracy = 0.983655
   Epoch=17: 100%|¦¦| 117/117 [00:01<00:00, 102.15it/s]Training loss = 
53.226196, training accuracy = 0.981420
   Test accuracy = 0.985197
   Epoch=18: 100%|¦¦| 117/117 [00:01<00:00, 102.15it/s]Training loss = 
55.968140, training accuracy = 0.980786
   Test accuracy = 0.982422
   Epoch=19: 100%|¦¦| 117/117 [00:01<00:00, 102.14it/s]Training loss = 
52.761921, training accuracy = 0.982489
   Test accuracy = 0.985814
   Epoch=20: 100%|¦¦| 117/117 [00:01<00:00, 101.86it/s]Training loss = 
51.989666, training accuracy = 0.982973
   Test accuracy = 0.983758
   Epoch=21: 100%|¦¦| 117/117 [00:01<00:00, 101.91it/s]Training loss = 
52.571381, training accuracy = 0.982455
   Test accuracy = 0.987973
   Epoch=22: 100%|¦¦| 117/117 [00:01<00:00, 101.99it/s]Training loss = 
49.347313, training accuracy = 0.983140
   T

[GitHub] [incubator-singa] chrishkchris commented on a change in pull request #468: Distributted module

2019-08-06 Thread GitBox

chrishkchris commented on a change in pull request #468: Distributted module
URL: https://github.com/apache/incubator-singa/pull/468#discussion_r311074176
 
 

 ##
 File path: src/api/config.i
 ##
 @@ -0,0 +1,33 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+
+
+// Pass in cmake configurations to swig
+#define USE_CUDA 1
+#define USE_CUDNN 1
+#define USE_OPENCL 0
+#define USE_PYTHON 1
+#define USE_MKLDNN 1
+#define USE_JAVA 0
+#define CUDNN_VERSION 7401
+
+// SINGA version
+#define SINGA_MAJOR_VERSION 1
 
 Review comment:
   From the above, we can now train simple CNN without batchnorm (MNIST 
dataset) and resnet with batchnorm (CIFAR-10 dataset). The remaining task is 
the synchronization of the running mean and variance if we use batchnorm. 
   I tried to put the running mean and var in the _BatchNorm2D return list of 
backward 
   
   ```
   def backward(self, dy):
   assert training is True and hasattr(
   self, "cache"
   ), "Please set training as True before do BP. "
   
   x, scale, mean, var = self.cache
   if isinstance(self.handle, singa.CudnnBatchNormHandle):
   dx, ds, db = singa.GpuBatchNormBackward(
   self.handle, dy, x, scale, mean, var
   )
   else:
   dx, ds, db = singa.CpuBatchNormBackward(
   self.handle, dy, x, scale, mean, var
   )
   
   #return dx, ds, db
   return dx, ds, db, self.running_mean, self.running_var
   ```
   and wish to synchronize it with
   ```
   #all reduce running mean and var
   for p, g in autograd.backward(loss):
   if((p.requires_grad==False) and (p.stores_grad==False)):
   all_reduce(p)
   ```
   
   However, this is the error in return
   ```
   Traceback (most recent call last):
 File "resnet_multigpu.py", line 163, in 
   for p, g in autograd.backward(loss):
 File "/usr/local/lib/python3.5/dist-packages/singa/autograd.py", line 136, 
in backward
   % (len(op.src), len(dxs))
   AssertionError: the number of src ops (=3) and dx (=5) not match
   ```
   
   
   


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

[GitHub] [incubator-singa] chrishkchris commented on a change in pull request #468: Distributted module

2019-08-06 Thread GitBox

chrishkchris commented on a change in pull request #468: Distributted module
URL: https://github.com/apache/incubator-singa/pull/468#discussion_r311074176
 
 

 ##
 File path: src/api/config.i
 ##
 @@ -0,0 +1,33 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+
+
+// Pass in cmake configurations to swig
+#define USE_CUDA 1
+#define USE_CUDNN 1
+#define USE_OPENCL 0
+#define USE_PYTHON 1
+#define USE_MKLDNN 1
+#define USE_JAVA 0
+#define CUDNN_VERSION 7401
+
+// SINGA version
+#define SINGA_MAJOR_VERSION 1
 
 Review comment:
   From the above, we can now train simple CNN without batchnorm (MNIST 
dataset) and resnet with batchnorm (CIFAR-10 dataset). The remaining task is 
the synchronization of the running mean and variance. 
   I tried to put the running mean and var in the _BatchNorm2D return list of 
backward 
   
   ```
   def backward(self, dy):
   assert training is True and hasattr(
   self, "cache"
   ), "Please set training as True before do BP. "
   
   x, scale, mean, var = self.cache
   if isinstance(self.handle, singa.CudnnBatchNormHandle):
   dx, ds, db = singa.GpuBatchNormBackward(
   self.handle, dy, x, scale, mean, var
   )
   else:
   dx, ds, db = singa.CpuBatchNormBackward(
   self.handle, dy, x, scale, mean, var
   )
   
   #return dx, ds, db
   return dx, ds, db, self.running_mean, self.running_var
   ```
   and wish to synchronize it with
   ```
   #all reduce running mean and var
   for p, g in autograd.backward(loss):
   if((p.requires_grad==False) and (p.stores_grad==False)):
   all_reduce(p)
   ```
   
   However, this is the error in return
   ```
   Traceback (most recent call last):
 File "resnet_multigpu.py", line 163, in 
   for p, g in autograd.backward(loss):
 File "/usr/local/lib/python3.5/dist-packages/singa/autograd.py", line 136, 
in backward
   % (len(op.src), len(dxs))
   AssertionError: the number of src ops (=3) and dx (=5) not match
   ```
   
   
   


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

[GitHub] [incubator-singa] chrishkchris commented on a change in pull request #468: Distributted module

2019-08-06 Thread GitBox

chrishkchris commented on a change in pull request #468: Distributted module
URL: https://github.com/apache/incubator-singa/pull/468#discussion_r311074176
 
 

 ##
 File path: src/api/config.i
 ##
 @@ -0,0 +1,33 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+
+
+// Pass in cmake configurations to swig
+#define USE_CUDA 1
+#define USE_CUDNN 1
+#define USE_OPENCL 0
+#define USE_PYTHON 1
+#define USE_MKLDNN 1
+#define USE_JAVA 0
+#define CUDNN_VERSION 7401
+
+// SINGA version
+#define SINGA_MAJOR_VERSION 1
 
 Review comment:
   From the above, we can now train simple CNN (MNIST dataset) and resnet 
(CIFAR-10 dataset). The remaining task is the synchronization of the running 
mean and variance. 
   I tried to put the running mean and var in the _BatchNorm2D return list of 
backward 
   
   ```
   def backward(self, dy):
   assert training is True and hasattr(
   self, "cache"
   ), "Please set training as True before do BP. "
   
   x, scale, mean, var = self.cache
   if isinstance(self.handle, singa.CudnnBatchNormHandle):
   dx, ds, db = singa.GpuBatchNormBackward(
   self.handle, dy, x, scale, mean, var
   )
   else:
   dx, ds, db = singa.CpuBatchNormBackward(
   self.handle, dy, x, scale, mean, var
   )
   
   #return dx, ds, db
   return dx, ds, db, self.running_mean, self.running_var
   ```
   and wish to synchronize it with
   ```
   #all reduce running mean and var
   for p, g in autograd.backward(loss):
   if((p.requires_grad==False) and (p.stores_grad==False)):
   all_reduce(p)
   ```
   
   However, this is the error in return
   ```
   Traceback (most recent call last):
 File "resnet_multigpu.py", line 163, in 
   for p, g in autograd.backward(loss):
 File "/usr/local/lib/python3.5/dist-packages/singa/autograd.py", line 136, 
in backward
   % (len(op.src), len(dxs))
   AssertionError: the number of src ops (=3) and dx (=5) not match
   ```
   
   
   


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

[GitHub] [incubator-singa] chrishkchris commented on a change in pull request #468: Distributted module

2019-08-06 Thread GitBox

chrishkchris commented on a change in pull request #468: Distributted module
URL: https://github.com/apache/incubator-singa/pull/468#discussion_r311074176
 
 

 ##
 File path: src/api/config.i
 ##
 @@ -0,0 +1,33 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+
+
+// Pass in cmake configurations to swig
+#define USE_CUDA 1
+#define USE_CUDNN 1
+#define USE_OPENCL 0
+#define USE_PYTHON 1
+#define USE_MKLDNN 1
+#define USE_JAVA 0
+#define CUDNN_VERSION 7401
+
+// SINGA version
+#define SINGA_MAJOR_VERSION 1
 
 Review comment:
   From the above, we can now train simple CNN (MNIST dataset) and resnet 
(CIFAR-10 dataset). The remaining task is the synchronization of the running 
mean and variance. 
   I tried to put the running mean and var in the _BatchNorm2D return list of 
backward 
   
   ```
   def backward(self, dy):
   assert training is True and hasattr(
   self, "cache"
   ), "Please set training as True before do BP. "
   
   x, scale, mean, var = self.cache
   if isinstance(self.handle, singa.CudnnBatchNormHandle):
   dx, ds, db = singa.GpuBatchNormBackward(
   self.handle, dy, x, scale, mean, var
   )
   else:
   dx, ds, db = singa.CpuBatchNormBackward(
   self.handle, dy, x, scale, mean, var
   )
   
   #return dx, ds, db
   return dx, ds, db, self.running_mean, self.running_var
   ```
   and wish to collect it with
   ```
   #all reduce running mean and var
   for p, g in autograd.backward(loss):
   if((p.requires_grad==False) and (p.stores_grad==False)):
   all_reduce(p)
   ```
   
   However, this is the error in return
   ```
   Traceback (most recent call last):
 File "resnet_multigpu.py", line 163, in 
   for p, g in autograd.backward(loss):
 File "/usr/local/lib/python3.5/dist-packages/singa/autograd.py", line 136, 
in backward
   % (len(op.src), len(dxs))
   AssertionError: the number of src ops (=3) and dx (=5) not match
   ```
   
   
   


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

[GitHub] [incubator-singa] chrishkchris commented on a change in pull request #468: Distributted module

2019-08-06 Thread GitBox

chrishkchris commented on a change in pull request #468: Distributted module
URL: https://github.com/apache/incubator-singa/pull/468#discussion_r311074176
 
 

 ##
 File path: src/api/config.i
 ##
 @@ -0,0 +1,33 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+
+
+// Pass in cmake configurations to swig
+#define USE_CUDA 1
+#define USE_CUDNN 1
+#define USE_OPENCL 0
+#define USE_PYTHON 1
+#define USE_MKLDNN 1
+#define USE_JAVA 0
+#define CUDNN_VERSION 7401
+
+// SINGA version
+#define SINGA_MAJOR_VERSION 1
 
 Review comment:
   From the above, we can now train simple CNN (MNIST dataset) and resnet 
(CIFAR-10 dataset). The remaining task is the synchronization of the running 
mean and variance. 
   I tried to put the running mean and var in the batch return list of backward 
   
   ```
   def backward(self, dy):
   assert training is True and hasattr(
   self, "cache"
   ), "Please set training as True before do BP. "
   
   x, scale, mean, var = self.cache
   if isinstance(self.handle, singa.CudnnBatchNormHandle):
   dx, ds, db = singa.GpuBatchNormBackward(
   self.handle, dy, x, scale, mean, var
   )
   else:
   dx, ds, db = singa.CpuBatchNormBackward(
   self.handle, dy, x, scale, mean, var
   )
   
   #return dx, ds, db
   return dx, ds, db, self.running_mean, self.running_var
   ```
   and wish to collect it with
   ```
   #all reduce running mean and var
   for p, g in autograd.backward(loss):
   if((p.requires_grad==False) and (p.stores_grad==False)):
   all_reduce(p)
   ```
   
   However, this is the error in return
   ```
   Traceback (most recent call last):
 File "resnet_multigpu.py", line 163, in 
   for p, g in autograd.backward(loss):
 File "/usr/local/lib/python3.5/dist-packages/singa/autograd.py", line 136, 
in backward
   % (len(op.src), len(dxs))
   AssertionError: the number of src ops (=3) and dx (=5) not match
   ```
   
   
   


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

[GitHub] [incubator-singa] chrishkchris commented on a change in pull request #468: Distributted module

2019-08-06 Thread GitBox

chrishkchris commented on a change in pull request #468: Distributted module
URL: https://github.com/apache/incubator-singa/pull/468#discussion_r311068821
 
 

 ##
 File path: src/api/config.i
 ##
 @@ -0,0 +1,33 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+
+
+// Pass in cmake configurations to swig
+#define USE_CUDA 1
+#define USE_CUDNN 1
+#define USE_OPENCL 0
+#define USE_PYTHON 1
+#define USE_MKLDNN 1
+#define USE_JAVA 0
+#define CUDNN_VERSION 7401
+
+// SINGA version
+#define SINGA_MAJOR_VERSION 1
 
 Review comment:
   In additional to the above, I also did a 8 * K80 multi-GPUs training and 
evaluation test with a CIFAR-10 dataset on resnet 50. It reduces the training 
loss from 3983.8 to 345.7 in about 30 Epochs, and evaluation accuracy to 86.8%. 
However, this does not include the synchronization of running mean and variance 
before the evaluation phase:
   ```
   Epoch=0: 100%|██| 195/195 [06:06<00:00,  1.91s/it]Training loss = 
3983.820557, training accuracy = 0.225260
   Test accuracy = 0.347556
   Epoch=1: 100%|██| 195/195 [06:17<00:00,  1.94s/it]Training loss = 
2628.622070, training accuracy = 0.379768
   Test accuracy = 0.437700
   Epoch=2: 100%|██| 195/195 [06:12<00:00,  1.89s/it]Training loss = 
2347.072266, training accuracy = 0.448558
   Test accuracy = 0.459936
   Epoch=3: 100%|██| 195/195 [06:13<00:00,  1.88s/it]Training loss = 
2075.987305, training accuracy = 0.517348
   Test accuracy = 0.548978
   Epoch=4: 100%|██| 195/195 [06:19<00:00,  1.97s/it]Training loss = 
1890.109985, training accuracy = 0.566847
   Test accuracy = 0.594451
   Epoch=5: 100%|██| 195/195 [06:13<00:00,  1.92s/it]Training loss = 
1720.395142, training accuracy = 0.606911
   Test accuracy = 0.633413
   Epoch=6: 100%|██| 195/195 [06:10<00:00,  1.92s/it]Training loss = 
1555.737549, training accuracy = 0.645753
   Test accuracy = 0.659054
   Epoch=7: 100%|██| 195/195 [06:14<00:00,  1.91s/it]Training loss = 
1385.688477, training accuracy = 0.687220
   Test accuracy = 0.709836
   Epoch=8: 100%|██| 195/195 [06:20<00:00,  1.97s/it]Training loss = 
1269.426270, training accuracy = 0.714523
   Test accuracy = 0.735477
   Epoch=9: 100%|██| 195/195 [06:15<00:00,  1.91s/it]Training loss = 
1137.953979, training accuracy = 0.746054
   Test accuracy = 0.745393
   Epoch=10: 100%|██| 195/195 [06:11<00:00,  1.88s/it]Training loss = 
1031.773071, training accuracy = 0.770353
   Test accuracy = 0.750501
   Epoch=11: 100%|██| 195/195 [06:10<00:00,  1.89s/it]Training loss = 
956.600037, training accuracy = 0.788261
   Test accuracy = 0.44
   Epoch=12: 100%|██| 195/195 [06:16<00:00,  1.92s/it]Training loss = 
881.050171, training accuracy = 0.804167
   Test accuracy = 0.793369
   Epoch=13: 100%|██| 195/195 [06:16<00:00,  1.92s/it]Training loss = 
828.298828, training accuracy = 0.818309
   Test accuracy = 0.807692
   Epoch=14: 100%|██| 195/195 [06:11<00:00,  1.90s/it]Training loss = 
790.558838, training accuracy = 0.823918
   Test accuracy = 0.795373
   Epoch=15: 100%|██| 195/195 [06:13<00:00,  1.90s/it]Training loss = 
740.679871, training accuracy = 0.833734
   Test accuracy = 0.816707
   Epoch=16: 100%|██| 195/195 [06:20<00:00,  1.95s/it]Training loss = 
691.391479, training accuracy = 0.846855
   Test accuracy = 0.818510
   Epoch=17: 100%|██| 195/195 [06:16<00:00,  1.89s/it]Training loss = 
657.708130, training accuracy = 0.853986
   Test accuracy = 0.826122
   Epoch=18: 100%|██| 195/195 [06:10<00:00,  1.88s/it]Training loss = 
627.918579, training accuracy = 0.860216
   Test accuracy = 0.844752
   Epoch=19: 100%|██| 195/195 [06:13<00:00,  1.91s/it]Training loss = 
592.768982, training accuracy = 0.869551
   Test accuracy = 0.845653
   Epoch=20: 100%|██| 195/195 [06:19<00:00,  1.97s/it]Training loss = 
561.560608, training accuracy = 0.875060
   Test accuracy = 0.835938
   Epoch=21: 100%|██| 195/195 [06:15<00:00,  1.97s/it]Training loss = 
533.083740, training accuracy = 0.881370
   Test accuracy = 0.849860
   Epoch=22: 100%|██| 195/195 [06:12<00:00,  1.91s/it]Traini

[GitHub] [incubator-singa] chrishkchris commented on a change in pull request #468: Distributted module

2019-08-06 Thread GitBox

chrishkchris commented on a change in pull request #468: Distributted module
URL: https://github.com/apache/incubator-singa/pull/468#discussion_r311056639
 
 

 ##
 File path: src/api/config.i
 ##
 @@ -0,0 +1,33 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+
+
+// Pass in cmake configurations to swig
+#define USE_CUDA 1
+#define USE_CUDNN 1
+#define USE_OPENCL 0
+#define USE_PYTHON 1
+#define USE_MKLDNN 1
+#define USE_JAVA 0
+#define CUDNN_VERSION 7401
+
+// SINGA version
+#define SINGA_MAJOR_VERSION 1
 
 Review comment:
   Updated on 6th August: I removed a bug in the commit 0616000 which concerns 
to the number of parameters in the all-reduce. Then I did a 8 * K80 multi-GPUs 
training and evaluation test with a simple MNIST dataset on simple CNN. It 
reduces the training loss from 802.7 to 42.2 in about 30 Epochs: 
   ```
   Epoch=0: 100%|¦¦| 117/117 [00:01<00:00, 92.86it/s]Training loss = 
802.659485, training accuracy = 0.713825
   Test accuracy = 0.920025
   Epoch=1: 100%|¦¦| 117/117 [00:01<00:00, 93.42it/s]Training loss = 
246.589371, training accuracy = 0.916767
   Test accuracy = 0.956106
   Epoch=2: 100%|¦¦| 117/117 [00:01<00:00, 94.04it/s]Training loss = 
175.012894, training accuracy = 0.941106
   Test accuracy = 0.967208
   Epoch=3: 100%|¦¦| 117/117 [00:01<00:00, 95.66it/s] Training loss = 
144.684052, training accuracy = 0.951539
   Test accuracy = 0.970806
   Epoch=4: 100%|¦¦| 117/117 [00:01<00:00, 102.59it/s]Training loss = 
120.399704, training accuracy = 0.959402
   Test accuracy = 0.976049
   Epoch=5: 100%|¦¦| 117/117 [00:01<00:00, 102.79it/s]Training loss = 
107.832191, training accuracy = 0.963709
   Test accuracy = 0.975946
   Epoch=6: 100%|¦¦| 117/117 [00:01<00:00, 102.70it/s]Training loss = 
96.289490, training accuracy = 0.967014
   Test accuracy = 0.979441
   Epoch=7: 100%|¦¦| 117/117 [00:01<00:00, 102.34it/s]Training loss = 
88.031815, training accuracy = 0.970436
   Test accuracy = 0.980983
   Epoch=8: 100%|¦¦| 117/117 [00:01<00:00, 101.81it/s]Training loss = 
79.349884, training accuracy = 0.973090
   Test accuracy = 0.980058
   Epoch=9: 100%|¦¦| 117/117 [00:01<00:00, 101.82it/s]Training loss = 
77.825607, training accuracy = 0.974342
   Test accuracy = 0.977282
   Epoch=10: 100%|¦¦| 117/117 [00:01<00:00, 101.97it/s]Training loss = 
74.710297, training accuracy = 0.974576
   Test accuracy = 0.983861
   Epoch=11: 100%|¦¦| 117/117 [00:01<00:00, 101.98it/s]Training loss = 
69.400230, training accuracy = 0.976162
   Test accuracy = 0.982936
   Epoch=12: 100%|¦¦| 117/117 [00:01<00:00, 102.03it/s]Training loss = 
65.100449, training accuracy = 0.978148
   Test accuracy = 0.983553
   Epoch=13: 100%|¦¦| 117/117 [00:01<00:00, 102.17it/s]Training loss = 
65.113991, training accuracy = 0.978249
   Test accuracy = 0.986534
   Epoch=14: 100%|¦¦| 117/117 [00:01<00:00, 101.83it/s]Training loss = 
63.065636, training accuracy = 0.978566
   Test accuracy = 0.984683
   Epoch=15: 100%|¦¦| 117/117 [00:01<00:00, 102.11it/s]Training loss = 
58.334709, training accuracy = 0.980018
   Test accuracy = 0.983758
   Epoch=16: 100%|¦¦| 117/117 [00:01<00:00, 102.16it/s]Training loss = 
58.280094, training accuracy = 0.980285
   Test accuracy = 0.983655
   Epoch=17: 100%|¦¦| 117/117 [00:01<00:00, 102.15it/s]Training loss = 
53.226196, training accuracy = 0.981420
   Test accuracy = 0.985197
   Epoch=18: 100%|¦¦| 117/117 [00:01<00:00, 102.15it/s]Training loss = 
55.968140, training accuracy = 0.980786
   Test accuracy = 0.982422
   Epoch=19: 100%|¦¦| 117/117 [00:01<00:00, 102.14it/s]Training loss = 
52.761921, training accuracy = 0.982489
   Test accuracy = 0.985814
   Epoch=20: 100%|¦¦| 117/117 [00:01<00:00, 101.86it/s]Training loss = 
51.989666, training accuracy = 0.982973
   Test accuracy = 0.983758
   Epoch=21: 100%|¦¦| 117/117 [00:01<00:00, 101.91it/s]Training loss = 
52.571381, training accuracy = 0.982455
   Test accuracy = 0.987973
   Epoch=22: 100%|¦¦| 117/117 [00:01<00:00, 101.99it/s]Training loss = 
49.347313, training accuracy = 0.983140

[GitHub] [incubator-singa] chrishkchris commented on a change in pull request #468: Distributted module

2019-08-06 Thread GitBox

chrishkchris commented on a change in pull request #468: Distributted module
URL: https://github.com/apache/incubator-singa/pull/468#discussion_r311056639
 
 

 ##
 File path: src/api/config.i
 ##
 @@ -0,0 +1,33 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+
+
+// Pass in cmake configurations to swig
+#define USE_CUDA 1
+#define USE_CUDNN 1
+#define USE_OPENCL 0
+#define USE_PYTHON 1
+#define USE_MKLDNN 1
+#define USE_JAVA 0
+#define CUDNN_VERSION 7401
+
+// SINGA version
+#define SINGA_MAJOR_VERSION 1
 
 Review comment:
   Updated on 6th August: I removed a bug in the commit 0616000 which concerns 
to the number of parameters in the all-reduce. Then I did a 8 * K80 multi-GPUs 
training and evaluation test with a mnist dataset on simple CNN. It reduces the 
training loss from 802.7 to 42.2 in about 30 Epochs: 
   ```
   Epoch=0: 100%|¦¦| 117/117 [00:01<00:00, 92.86it/s]Training loss = 
802.659485, training accuracy = 0.713825
   Test accuracy = 0.920025
   Epoch=1: 100%|¦¦| 117/117 [00:01<00:00, 93.42it/s]Training loss = 
246.589371, training accuracy = 0.916767
   Test accuracy = 0.956106
   Epoch=2: 100%|¦¦| 117/117 [00:01<00:00, 94.04it/s]Training loss = 
175.012894, training accuracy = 0.941106
   Test accuracy = 0.967208
   Epoch=3: 100%|¦¦| 117/117 [00:01<00:00, 95.66it/s] Training loss = 
144.684052, training accuracy = 0.951539
   Test accuracy = 0.970806
   Epoch=4: 100%|¦¦| 117/117 [00:01<00:00, 102.59it/s]Training loss = 
120.399704, training accuracy = 0.959402
   Test accuracy = 0.976049
   Epoch=5: 100%|¦¦| 117/117 [00:01<00:00, 102.79it/s]Training loss = 
107.832191, training accuracy = 0.963709
   Test accuracy = 0.975946
   Epoch=6: 100%|¦¦| 117/117 [00:01<00:00, 102.70it/s]Training loss = 
96.289490, training accuracy = 0.967014
   Test accuracy = 0.979441
   Epoch=7: 100%|¦¦| 117/117 [00:01<00:00, 102.34it/s]Training loss = 
88.031815, training accuracy = 0.970436
   Test accuracy = 0.980983
   Epoch=8: 100%|¦¦| 117/117 [00:01<00:00, 101.81it/s]Training loss = 
79.349884, training accuracy = 0.973090
   Test accuracy = 0.980058
   Epoch=9: 100%|¦¦| 117/117 [00:01<00:00, 101.82it/s]Training loss = 
77.825607, training accuracy = 0.974342
   Test accuracy = 0.977282
   Epoch=10: 100%|¦¦| 117/117 [00:01<00:00, 101.97it/s]Training loss = 
74.710297, training accuracy = 0.974576
   Test accuracy = 0.983861
   Epoch=11: 100%|¦¦| 117/117 [00:01<00:00, 101.98it/s]Training loss = 
69.400230, training accuracy = 0.976162
   Test accuracy = 0.982936
   Epoch=12: 100%|¦¦| 117/117 [00:01<00:00, 102.03it/s]Training loss = 
65.100449, training accuracy = 0.978148
   Test accuracy = 0.983553
   Epoch=13: 100%|¦¦| 117/117 [00:01<00:00, 102.17it/s]Training loss = 
65.113991, training accuracy = 0.978249
   Test accuracy = 0.986534
   Epoch=14: 100%|¦¦| 117/117 [00:01<00:00, 101.83it/s]Training loss = 
63.065636, training accuracy = 0.978566
   Test accuracy = 0.984683
   Epoch=15: 100%|¦¦| 117/117 [00:01<00:00, 102.11it/s]Training loss = 
58.334709, training accuracy = 0.980018
   Test accuracy = 0.983758
   Epoch=16: 100%|¦¦| 117/117 [00:01<00:00, 102.16it/s]Training loss = 
58.280094, training accuracy = 0.980285
   Test accuracy = 0.983655
   Epoch=17: 100%|¦¦| 117/117 [00:01<00:00, 102.15it/s]Training loss = 
53.226196, training accuracy = 0.981420
   Test accuracy = 0.985197
   Epoch=18: 100%|¦¦| 117/117 [00:01<00:00, 102.15it/s]Training loss = 
55.968140, training accuracy = 0.980786
   Test accuracy = 0.982422
   Epoch=19: 100%|¦¦| 117/117 [00:01<00:00, 102.14it/s]Training loss = 
52.761921, training accuracy = 0.982489
   Test accuracy = 0.985814
   Epoch=20: 100%|¦¦| 117/117 [00:01<00:00, 101.86it/s]Training loss = 
51.989666, training accuracy = 0.982973
   Test accuracy = 0.983758
   Epoch=21: 100%|¦¦| 117/117 [00:01<00:00, 101.91it/s]Training loss = 
52.571381, training accuracy = 0.982455
   Test accuracy = 0.987973
   Epoch=22: 100%|¦¦| 117/117 [00:01<00:00, 101.99it/s]Training loss = 
49.347313, training accuracy = 0.983140
   Test

[GitHub] [incubator-singa] chrishkchris commented on a change in pull request #468: Distributted module

2019-08-02 Thread GitBox

chrishkchris commented on a change in pull request #468: Distributted module
URL: https://github.com/apache/incubator-singa/pull/468#discussion_r310330469
 
 

 ##
 File path: src/api/config.i
 ##
 @@ -0,0 +1,33 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+
+
+// Pass in cmake configurations to swig
+#define USE_CUDA 1
+#define USE_CUDNN 1
+#define USE_OPENCL 0
+#define USE_PYTHON 1
+#define USE_MKLDNN 1
+#define USE_JAVA 0
+#define CUDNN_VERSION 7401
+
+// SINGA version
+#define SINGA_MAJOR_VERSION 1
 
 Review comment:
   Next, I will try to train the resnet with cifar 10 using 8 gpu. This will 
take time to modfiy because I need to collect the accuracy from other processes 
(may use mpi to reduce), and syn the running mean and var of different processes


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

[GitHub] [incubator-singa] chrishkchris commented on a change in pull request #468: Distributted module

2019-08-02 Thread GitBox

chrishkchris commented on a change in pull request #468: Distributted module
URL: https://github.com/apache/incubator-singa/pull/468#discussion_r310330469
 
 

 ##
 File path: src/api/config.i
 ##
 @@ -0,0 +1,33 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+
+
+// Pass in cmake configurations to swig
+#define USE_CUDA 1
+#define USE_CUDNN 1
+#define USE_OPENCL 0
+#define USE_PYTHON 1
+#define USE_MKLDNN 1
+#define USE_JAVA 0
+#define CUDNN_VERSION 7401
+
+// SINGA version
+#define SINGA_MAJOR_VERSION 1
 
 Review comment:
   Next, I will try to train the resnet with cifar 10 using 8 gpu. This will 
take time to modfiy because I need to collect the accuracy other processes (may 
use mpi to reduce), and syn the running mean and var of different processes


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

[GitHub] [incubator-singa] chrishkchris commented on a change in pull request #468: Distributted module

2019-08-02 Thread GitBox

chrishkchris commented on a change in pull request #468: Distributted module
URL: https://github.com/apache/incubator-singa/pull/468#discussion_r310328731
 
 

 ##
 File path: src/api/config.i
 ##
 @@ -0,0 +1,33 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+
+
+// Pass in cmake configurations to swig
+#define USE_CUDA 1
+#define USE_CUDNN 1
+#define USE_OPENCL 0
+#define USE_PYTHON 1
+#define USE_MKLDNN 1
+#define USE_JAVA 0
+#define CUDNN_VERSION 7401
+
+// SINGA version
+#define SINGA_MAJOR_VERSION 1
 
 Review comment:
   I have trained the dist_new branch resnet (because resnet has batch norm) 
with cifar10 dataset using 1 GPU, and obtained 92.5% test accuracy in 100 
Epochs with data augmentation. This suggest that the batch norm is in good 
condition (while the onnx interface of batchnorm may need to be considered but 
I am not sure)
   
   ```
   ubuntu@ip-172-31-27-25:~/incubator-singa/examples/autograd$ python3 
resnet_realdata.py
   Loading data file cifar-10-batches-py/data_batch_1
   Loading data file cifar-10-batches-py/data_batch_2
   Loading data file cifar-10-batches-py/data_batch_3
   Loading data file cifar-10-batches-py/data_batch_4
   Loading data file cifar-10-batches-py/data_batch_5
   Loading data file cifar-10-batches-py/test_batch
   Start intialization
   Epoch=0: 
100%|███|
 1562/1562 [03:56<00:00,  6.61it/s]
   Training loss = 2927.551146, training accuracy = 0.338068
   Test accuracy = 0.441306
   Epoch=1: 
100%|███|
 1562/1562 [03:56<00:00,  6.59it/s]
   Training loss = 2110.360374, training accuracy = 0.511984
   Test accuracy = 0.606571
   Epoch=2: 
100%|███|
 1562/1562 [03:56<00:00,  6.61it/s]
   Training loss = 1658.897868, training accuracy = 0.623199
   Test accuracy = 0.645232
   Epoch=3: 
100%|███|
 1562/1562 [03:56<00:00,  6.64it/s]
   Training loss = 1354.082412, training accuracy = 0.694442
   Test accuracy = 0.731170
   Epoch=4: 
100%|███|
 1562/1562 [03:56<00:00,  6.63it/s]
   Training loss = 1155.785529, training accuracy = 0.743478
   Test accuracy = 0.761318
   Epoch=5: 
100%|███|
 1562/1562 [03:56<00:00,  6.59it/s]
   Training loss = 1022.750388, training accuracy = 0.773668
   Test accuracy = 0.741286
   Epoch=6: 
100%|███|
 1562/1562 [03:56<00:00,  6.62it/s]
   Training loss = 945.400214, training accuracy = 0.790373
   Test accuracy = 0.795072
   Epoch=7: 
100%|███|
 1562/1562 [03:56<00:00,  6.61it/s]
   Training loss = 840.933215, training accuracy = 0.814441
   Test accuracy = 0.810096
   Epoch=8: 
100%|███|
 1562/1562 [03:56<00:00,  6.62it/s]
   Training loss = 765.215148, training accuracy = 0.830566
   Test accuracy = 0.807091
   Epoch=9: 
100%|███|
 1562/1562 [03:56<00:00,  6.61it/s]
   Training loss = 701.153867, training accuracy = 0.845951
   Test accuracy = 0.822316
   Epoch=10: 
100%|██|
 1562/1562 [03:56<00:00,  6.63it/s]
   Training loss = 666.267428, training accuracy = 0.853073
   Test accuracy = 0.851162
   Epoch=11: 
100%|██|
 1562/1562 [03:56<00:00,  6.62it/s]
   Training loss = 606.699607, training accuracy = 0.866817
   Test accuracy = 0.770232
   Epoch=12: 
100%|██|
 1562/1562

[GitHub] [incubator-singa] chrishkchris commented on a change in pull request #468: Distributted module

2019-08-02 Thread GitBox

chrishkchris commented on a change in pull request #468: Distributted module
URL: https://github.com/apache/incubator-singa/pull/468#discussion_r310330469
 
 

 ##
 File path: src/api/config.i
 ##
 @@ -0,0 +1,33 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+
+
+// Pass in cmake configurations to swig
+#define USE_CUDA 1
+#define USE_CUDNN 1
+#define USE_OPENCL 0
+#define USE_PYTHON 1
+#define USE_MKLDNN 1
+#define USE_JAVA 0
+#define CUDNN_VERSION 7401
+
+// SINGA version
+#define SINGA_MAJOR_VERSION 1
 
 Review comment:
   Next, I will try to train the resnet with cifar 10 using 8 gpu.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

[GitHub] [incubator-singa] chrishkchris commented on a change in pull request #468: Distributted module

2019-08-02 Thread GitBox

chrishkchris commented on a change in pull request #468: Distributted module
URL: https://github.com/apache/incubator-singa/pull/468#discussion_r310328731
 
 

 ##
 File path: src/api/config.i
 ##
 @@ -0,0 +1,33 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+
+
+// Pass in cmake configurations to swig
+#define USE_CUDA 1
+#define USE_CUDNN 1
+#define USE_OPENCL 0
+#define USE_PYTHON 1
+#define USE_MKLDNN 1
+#define USE_JAVA 0
+#define CUDNN_VERSION 7401
+
+// SINGA version
+#define SINGA_MAJOR_VERSION 1
 
 Review comment:
   I have trained the dist_new branch resnet (because resnet has batch norm) 
with cifar10 dataset, and obtained 92.5% test accuracy in 100 Epoch with data 
augmentation.
   
   ```
   ubuntu@ip-172-31-27-25:~/incubator-singa/examples/autograd$ python3 
resnet_realdata.py
   Loading data file cifar-10-batches-py/data_batch_1
   Loading data file cifar-10-batches-py/data_batch_2
   Loading data file cifar-10-batches-py/data_batch_3
   Loading data file cifar-10-batches-py/data_batch_4
   Loading data file cifar-10-batches-py/data_batch_5
   Loading data file cifar-10-batches-py/test_batch
   Start intialization
   Epoch=0: 
100%|███|
 1562/1562 [03:56<00:00,  6.61it/s]
   Training loss = 2927.551146, training accuracy = 0.338068
   Test accuracy = 0.441306
   Epoch=1: 
100%|███|
 1562/1562 [03:56<00:00,  6.59it/s]
   Training loss = 2110.360374, training accuracy = 0.511984
   Test accuracy = 0.606571
   Epoch=2: 
100%|███|
 1562/1562 [03:56<00:00,  6.61it/s]
   Training loss = 1658.897868, training accuracy = 0.623199
   Test accuracy = 0.645232
   Epoch=3: 
100%|███|
 1562/1562 [03:56<00:00,  6.64it/s]
   Training loss = 1354.082412, training accuracy = 0.694442
   Test accuracy = 0.731170
   Epoch=4: 
100%|███|
 1562/1562 [03:56<00:00,  6.63it/s]
   Training loss = 1155.785529, training accuracy = 0.743478
   Test accuracy = 0.761318
   Epoch=5: 
100%|███|
 1562/1562 [03:56<00:00,  6.59it/s]
   Training loss = 1022.750388, training accuracy = 0.773668
   Test accuracy = 0.741286
   Epoch=6: 
100%|███|
 1562/1562 [03:56<00:00,  6.62it/s]
   Training loss = 945.400214, training accuracy = 0.790373
   Test accuracy = 0.795072
   Epoch=7: 
100%|███|
 1562/1562 [03:56<00:00,  6.61it/s]
   Training loss = 840.933215, training accuracy = 0.814441
   Test accuracy = 0.810096
   Epoch=8: 
100%|███|
 1562/1562 [03:56<00:00,  6.62it/s]
   Training loss = 765.215148, training accuracy = 0.830566
   Test accuracy = 0.807091
   Epoch=9: 
100%|███|
 1562/1562 [03:56<00:00,  6.61it/s]
   Training loss = 701.153867, training accuracy = 0.845951
   Test accuracy = 0.822316
   Epoch=10: 
100%|██|
 1562/1562 [03:56<00:00,  6.63it/s]
   Training loss = 666.267428, training accuracy = 0.853073
   Test accuracy = 0.851162
   Epoch=11: 
100%|██|
 1562/1562 [03:56<00:00,  6.62it/s]
   Training loss = 606.699607, training accuracy = 0.866817
   Test accuracy = 0.770232
   Epoch=12: 
100%|██|
 1562/1562 [03:56<00:00,  6.63it/s]
   Training loss = 564.226388, training accuracy = 0.875760
   Test accuracy = 0.811599
   Epoch=13: 
100%|█

[GitHub] [incubator-singa] chrishkchris commented on a change in pull request #468: Distributted module

2019-08-02 Thread GitBox

chrishkchris commented on a change in pull request #468: Distributted module
URL: https://github.com/apache/incubator-singa/pull/468#discussion_r310328731
 
 

 ##
 File path: src/api/config.i
 ##
 @@ -0,0 +1,33 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+
+
+// Pass in cmake configurations to swig
+#define USE_CUDA 1
+#define USE_CUDNN 1
+#define USE_OPENCL 0
+#define USE_PYTHON 1
+#define USE_MKLDNN 1
+#define USE_JAVA 0
+#define CUDNN_VERSION 7401
+
+// SINGA version
+#define SINGA_MAJOR_VERSION 1
 
 Review comment:
   I have trained the dist_new branch resnet (because resnet has batch norm) 
with cifar10 dataset, and obtained 92.5% test accuracy in 100 iteration with 
data augmentation.
   
   ```
   ubuntu@ip-172-31-27-25:~/incubator-singa/examples/autograd$ python3 
resnet_realdata.py
   Loading data file cifar-10-batches-py/data_batch_1
   Loading data file cifar-10-batches-py/data_batch_2
   Loading data file cifar-10-batches-py/data_batch_3
   Loading data file cifar-10-batches-py/data_batch_4
   Loading data file cifar-10-batches-py/data_batch_5
   Loading data file cifar-10-batches-py/test_batch
   Start intialization
   Epoch=0: 
100%|███|
 1562/1562 [03:56<00:00,  6.61it/s]
   Training loss = 2927.551146, training accuracy = 0.338068
   Test accuracy = 0.441306
   Epoch=1: 
100%|███|
 1562/1562 [03:56<00:00,  6.59it/s]
   Training loss = 2110.360374, training accuracy = 0.511984
   Test accuracy = 0.606571
   Epoch=2: 
100%|███|
 1562/1562 [03:56<00:00,  6.61it/s]
   Training loss = 1658.897868, training accuracy = 0.623199
   Test accuracy = 0.645232
   Epoch=3: 
100%|███|
 1562/1562 [03:56<00:00,  6.64it/s]
   Training loss = 1354.082412, training accuracy = 0.694442
   Test accuracy = 0.731170
   Epoch=4: 
100%|███|
 1562/1562 [03:56<00:00,  6.63it/s]
   Training loss = 1155.785529, training accuracy = 0.743478
   Test accuracy = 0.761318
   Epoch=5: 
100%|███|
 1562/1562 [03:56<00:00,  6.59it/s]
   Training loss = 1022.750388, training accuracy = 0.773668
   Test accuracy = 0.741286
   Epoch=6: 
100%|███|
 1562/1562 [03:56<00:00,  6.62it/s]
   Training loss = 945.400214, training accuracy = 0.790373
   Test accuracy = 0.795072
   Epoch=7: 
100%|███|
 1562/1562 [03:56<00:00,  6.61it/s]
   Training loss = 840.933215, training accuracy = 0.814441
   Test accuracy = 0.810096
   Epoch=8: 
100%|███|
 1562/1562 [03:56<00:00,  6.62it/s]
   Training loss = 765.215148, training accuracy = 0.830566
   Test accuracy = 0.807091
   Epoch=9: 
100%|███|
 1562/1562 [03:56<00:00,  6.61it/s]
   Training loss = 701.153867, training accuracy = 0.845951
   Test accuracy = 0.822316
   Epoch=10: 
100%|██|
 1562/1562 [03:56<00:00,  6.63it/s]
   Training loss = 666.267428, training accuracy = 0.853073
   Test accuracy = 0.851162
   Epoch=11: 
100%|██|
 1562/1562 [03:56<00:00,  6.62it/s]
   Training loss = 606.699607, training accuracy = 0.866817
   Test accuracy = 0.770232
   Epoch=12: 
100%|██|
 1562/1562 [03:56<00:00,  6.63it/s]
   Training loss = 564.226388, training accuracy = 0.875760
   Test accuracy = 0.811599
   Epoch=13: 
100%|█

[GitHub] [incubator-singa] chrishkchris commented on a change in pull request #468: Distributted module

2019-08-01 Thread GitBox

chrishkchris commented on a change in pull request #468: Distributted module
URL: https://github.com/apache/incubator-singa/pull/468#discussion_r309709702
 
 

 ##
 File path: src/api/config.i
 ##
 @@ -0,0 +1,33 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+
+
+// Pass in cmake configurations to swig
+#define USE_CUDA 1
+#define USE_CUDNN 1
+#define USE_OPENCL 0
+#define USE_PYTHON 1
+#define USE_MKLDNN 1
+#define USE_JAVA 0
+#define CUDNN_VERSION 7401
+
+// SINGA version
+#define SINGA_MAJOR_VERSION 1
 
 Review comment:
   Updated on 1 August 2019:
   
   Concerning the above error, I found that there is a different between the 
implementation of `class _BatchNorm2d(Operation):` in master branch and 
dist_new branch.
   
   In autograd.py, both the master branch and dist_new branch has modified (or 
debugged) the conv2d and batchnorm operator, but they modified it differently. 
Meanwhile, both conv2d in the master branch and dist_new branch can train and 
reduce loss of mnist simple CNN, so there is no big problem. However, the batch 
normalization is a much more complex case, because it includes non-training 
variables that are running means and running variances.
   
   In the master branch, the running means and running variances (non-training 
variables) are in the forward function: `def forward(self, x, scale, bias, 
running_mean, running_var):`
   
https://github.com/apache/incubator-singa/blob/master/python/singa/autograd.py#L1099
   
   When I run the code using the master branch dockerfile, the error is as 
follows:
   ```
   root@26c9db193eb0:~/incubator-singa/examples/autograd# python3 resnet.py
   Start intialization
 0%|
  | 0/200 [00:00
   for p, g in autograd.backward(loss):
 File "/root/incubator-singa/build/python/singa/autograd.py", line 135, in 
backward
   % (len(op.src), len(dxs))
   AssertionError: the number of src ops (=5) and dx (=3) not match
   ```
   I think the error is because the running_mean and running_var are in the 
forward function input arguments but are not training variables, so there are 
supposed to be three src ops but finally found 5.
   
   Meanwhile, the dist_new branch has modified the batchnorm function (commit 
2b3a857 by user ubuntu on Apr14) by moving the input arguments running_mean and 
running_var into the initialization function:
   `def __init__(self, handle, running_mean, running_var, name=None):`
   `def forward(self, x, scale, bias):`
   
https://github.com/xuewanqi/incubator-singa/blob/dist_new/python/singa/autograd.py#L1096
   This one can run successfully but I am not sure if it can train and reduce 
loss.
   
   Next, I will try training the resnet with real dataset to see if it can 
reduce the loss.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

[GitHub] [incubator-singa] chrishkchris commented on a change in pull request #468: Distributted module

2019-07-31 Thread GitBox

chrishkchris commented on a change in pull request #468: Distributted module
URL: https://github.com/apache/incubator-singa/pull/468#discussion_r309496476
 
 

 ##
 File path: src/CMakeLists.txt
 ##
 @@ -36,6 +36,9 @@ AUX_SOURCE_DIRECTORY(core/scheduler core_source)
 AUX_SOURCE_DIRECTORY(core/tensor core_source)
 LIST(APPEND singa_sources ${core_source})
 
 
 Review comment:
   ```
   The build log is here:
   
   ubuntu@ip-172-31-18-113:~/incubator-singa/build$ rm -rf *
   ubuntu@ip-172-31-18-113:~/incubator-singa/build$ cmake -D 
CMAKE_PREFIX_PATH="/usr/local/cuda/lib64;/usr/local/cuda/" -DENABLE_TEST=OFF 
-DUSE_CUDA=ON -DUSE_PYTHON3=ON -DUSE_MKLDNN=ON -DUSE_MODULES=OFF -DUSE_DIST=ON 
..
   -- The C compiler identification is GNU 5.4.0
   -- The CXX compiler identification is GNU 5.4.0
   -- Check for working C compiler: /usr/bin/cc
   -- Check for working C compiler: /usr/bin/cc -- works
   -- Detecting C compiler ABI info
   -- Detecting C compiler ABI info - done
   -- Detecting C compile features
   -- Detecting C compile features - done
   -- Check for working CXX compiler: /usr/bin/c++
   -- Check for working CXX compiler: /usr/bin/c++ -- works
   -- Detecting CXX compiler ABI info
   -- Detecting CXX compiler ABI info - done
   -- Detecting CXX compile features
   -- Detecting CXX compile features - done
   -- Looking for pthread.h
   -- Looking for pthread.h - found
   -- Looking for pthread_create
   -- Looking for pthread_create - not found
   -- Looking for pthread_create in pthreads
   -- Looking for pthread_create in pthreads - not found
   -- Looking for pthread_create in pthread
   -- Looking for pthread_create in pthread - found
   -- Found Threads: TRUE
   -- Found Protobuf: /usr/local/lib/libprotobuf.so;-lpthread (found suitable 
version "3.0.0", minimum required is "3.0")
   -- Found CBLAS: /usr/local/include
   -- Found GLOG: /usr/include
   -- Found cuda_v10.0
   -- Found CUDNN: /usr/local/cuda/include
   -- Found Cudnn_7401 at /usr/local/cuda/include 
/usr/local/cuda/lib64/libcudnn.so
   -- Found PythonInterp: /usr/bin/python3 (found suitable version "3.5.2", 
minimum required is "3")
   -- Found PythonLibs: /usr/lib/x86_64-linux-gnu/libpython3.5m.so (found 
suitable version "3.5.2", minimum required is "3")
   -- Found SWIG: /usr/local/bin/swig (found suitable version "3.0.12", minimum 
required is "3.0.10")
   -- Found MKLDNN at /usr/local/include
   -- Found MPI at /home/ubuntu/mpich-3.3/build/include
   -- Found MPI lib at /home/ubuntu/mpich-3.3/build/lib/libmpi.so
   -- Found all lib at 
/usr/local/lib/libprotobuf.so;/usr/local/lib/libopenblas.so;/usr/lib/x86_64-linux-gnu/libglog.so;/usr/local/cuda/lib64/libcudnn.so;/usr/local/cuda/lib64/libcudart.so;/usr/local/cuda/lib64/libcurand.so;/usr/local/cuda/lib64/libcublas.so;/home/ubuntu/incubator-singa/build/lib/libcnmem.a;/usr/local/lib/libmkldnn.so;/home/ubuntu/mpich-3.3/build/lib/libmpi.so;/home/ubuntu/mpich-3.3/build/lib/libmpicxx.so
   -- Found NCCL at /usr/local/cuda/include
   -- Found NCCL lib at /usr/local/cuda/lib/libnccl.so
   -- Configuring done
   -- Generating done
   -- Build files have been written to: /home/ubuntu/incubator-singa/build
   ubuntu@ip-172-31-18-113:~/incubator-singa/build$ make -j2
   Scanning dependencies of target cnmem
   Scanning dependencies of target copy_protobuf
   [  1%] Creating directories for 'cnmem'
   [  2%] Running C++ protocol buffer compiler on 
/home/ubuntu/incubator-singa/src/proto/model.proto
   [libprotobuf WARNING google/protobuf/compiler/parser.cc:547] No syntax 
specified for the proto file: model.proto. Please use 'syntax = "proto2";' or 
'syntax = "proto3";' to specify a syntax version. (Defaulted to proto2 syntax.)
   [  3%] Performing download step (git clone) for 'cnmem'
   Cloning into 'cnmem'...
   [  4%] Running C++ protocol buffer compiler on 
/home/ubuntu/incubator-singa/src/proto/caffe.proto
   [  5%] Running C++ protocol buffer compiler on 
/home/ubuntu/incubator-singa/src/proto/core.proto
   [libprotobuf WARNING google/protobuf/compiler/parser.cc:547] No syntax 
specified for the proto file: core.proto. Please use 'syntax = "proto2";' or 
'syntax = "proto3";' to specify a syntax version. (Defaulted to proto2 syntax.)
   [  6%] Running C++ protocol buffer compiler on 
/home/ubuntu/incubator-singa/src/proto/io.proto
   [libprotobuf WARNING google/protobuf/compiler/parser.cc:547] No syntax 
specified for the proto file: io.proto. Please use 'syntax = "proto2";' or 
'syntax = "proto3";' to specify a syntax version. (Defaulted to proto2 syntax.)
   [  7%] Copying Protobuf headers
   [  7%] Built target copy_protobuf
   [  8%] Building NVCC (Device) object 
src/CMakeFiles/cuda_compile_1.dir/core/tensor/cuda_compile_1_generated_math_kernel.cu.o
   Scanning dependencies of target singa_objects
   [  9%] Building CXX object src/CMakeFiles/singa_objects.dir/caffe.pb.cc.o
   Already on 'master'
   Your branch is up-to-date with 'origin/master'.
   [ 10%] No pat

[GitHub] [incubator-singa] chrishkchris commented on a change in pull request #468: Distributted module

2019-07-31 Thread GitBox

chrishkchris commented on a change in pull request #468: Distributted module
URL: https://github.com/apache/incubator-singa/pull/468#discussion_r309496476
 
 

 ##
 File path: src/CMakeLists.txt
 ##
 @@ -36,6 +36,9 @@ AUX_SOURCE_DIRECTORY(core/scheduler core_source)
 AUX_SOURCE_DIRECTORY(core/tensor core_source)
 LIST(APPEND singa_sources ${core_source})
 
 
 Review comment:
   The build log is here:
   
   ubuntu@ip-172-31-18-113:~/incubator-singa/build$ rm -rf *
   ubuntu@ip-172-31-18-113:~/incubator-singa/build$ cmake -D 
CMAKE_PREFIX_PATH="/usr/local/cuda/lib64;/usr/local/cuda/" -DENABLE_TEST=OFF 
-DUSE_CUDA=ON -DUSE_PYTHON3=ON -DUSE_MKLDNN=ON -DUSE_MODULES=OFF -DUSE_DIST=ON 
..
   -- The C compiler identification is GNU 5.4.0
   -- The CXX compiler identification is GNU 5.4.0
   -- Check for working C compiler: /usr/bin/cc
   -- Check for working C compiler: /usr/bin/cc -- works
   -- Detecting C compiler ABI info
   -- Detecting C compiler ABI info - done
   -- Detecting C compile features
   -- Detecting C compile features - done
   -- Check for working CXX compiler: /usr/bin/c++
   -- Check for working CXX compiler: /usr/bin/c++ -- works
   -- Detecting CXX compiler ABI info
   -- Detecting CXX compiler ABI info - done
   -- Detecting CXX compile features
   -- Detecting CXX compile features - done
   -- Looking for pthread.h
   -- Looking for pthread.h - found
   -- Looking for pthread_create
   -- Looking for pthread_create - not found
   -- Looking for pthread_create in pthreads
   -- Looking for pthread_create in pthreads - not found
   -- Looking for pthread_create in pthread
   -- Looking for pthread_create in pthread - found
   -- Found Threads: TRUE
   -- Found Protobuf: /usr/local/lib/libprotobuf.so;-lpthread (found suitable 
version "3.0.0", minimum required is "3.0")
   -- Found CBLAS: /usr/local/include
   -- Found GLOG: /usr/include
   -- Found cuda_v10.0
   -- Found CUDNN: /usr/local/cuda/include
   -- Found Cudnn_7401 at /usr/local/cuda/include 
/usr/local/cuda/lib64/libcudnn.so
   -- Found PythonInterp: /usr/bin/python3 (found suitable version "3.5.2", 
minimum required is "3")
   -- Found PythonLibs: /usr/lib/x86_64-linux-gnu/libpython3.5m.so (found 
suitable version "3.5.2", minimum required is "3")
   -- Found SWIG: /usr/local/bin/swig (found suitable version "3.0.12", minimum 
required is "3.0.10")
   -- Found MKLDNN at /usr/local/include
   -- Found MPI at /home/ubuntu/mpich-3.3/build/include
   -- Found MPI lib at /home/ubuntu/mpich-3.3/build/lib/libmpi.so
   -- Found all lib at 
/usr/local/lib/libprotobuf.so;/usr/local/lib/libopenblas.so;/usr/lib/x86_64-linux-gnu/libglog.so;/usr/local/cuda/lib64/libcudnn.so;/usr/local/cuda/lib64/libcudart.so;/usr/local/cuda/lib64/libcurand.so;/usr/local/cuda/lib64/libcublas.so;/home/ubuntu/incubator-singa/build/lib/libcnmem.a;/usr/local/lib/libmkldnn.so;/home/ubuntu/mpich-3.3/build/lib/libmpi.so;/home/ubuntu/mpich-3.3/build/lib/libmpicxx.so
   -- Found NCCL at /usr/local/cuda/include
   -- Found NCCL lib at /usr/local/cuda/lib/libnccl.so
   -- Configuring done
   -- Generating done
   -- Build files have been written to: /home/ubuntu/incubator-singa/build
   ubuntu@ip-172-31-18-113:~/incubator-singa/build$ make -j2
   Scanning dependencies of target cnmem
   Scanning dependencies of target copy_protobuf
   [  1%] Creating directories for 'cnmem'
   [  2%] Running C++ protocol buffer compiler on 
/home/ubuntu/incubator-singa/src/proto/model.proto
   [libprotobuf WARNING google/protobuf/compiler/parser.cc:547] No syntax 
specified for the proto file: model.proto. Please use 'syntax = "proto2";' or 
'syntax = "proto3";' to specify a syntax version. (Defaulted to proto2 syntax.)
   [  3%] Performing download step (git clone) for 'cnmem'
   Cloning into 'cnmem'...
   [  4%] Running C++ protocol buffer compiler on 
/home/ubuntu/incubator-singa/src/proto/caffe.proto
   [  5%] Running C++ protocol buffer compiler on 
/home/ubuntu/incubator-singa/src/proto/core.proto
   [libprotobuf WARNING google/protobuf/compiler/parser.cc:547] No syntax 
specified for the proto file: core.proto. Please use 'syntax = "proto2";' or 
'syntax = "proto3";' to specify a syntax version. (Defaulted to proto2 syntax.)
   [  6%] Running C++ protocol buffer compiler on 
/home/ubuntu/incubator-singa/src/proto/io.proto
   [libprotobuf WARNING google/protobuf/compiler/parser.cc:547] No syntax 
specified for the proto file: io.proto. Please use 'syntax = "proto2";' or 
'syntax = "proto3";' to specify a syntax version. (Defaulted to proto2 syntax.)
   [  7%] Copying Protobuf headers
   [  7%] Built target copy_protobuf
   [  8%] Building NVCC (Device) object 
src/CMakeFiles/cuda_compile_1.dir/core/tensor/cuda_compile_1_generated_math_kernel.cu.o
   Scanning dependencies of target singa_objects
   [  9%] Building CXX object src/CMakeFiles/singa_objects.dir/caffe.pb.cc.o
   Already on 'master'
   Your branch is up-to-date with 'origin/master'.
   [ 10%] No patch step

[GitHub] [incubator-singa] chrishkchris commented on a change in pull request #468: Distributted module

2019-07-31 Thread GitBox

chrishkchris commented on a change in pull request #468: Distributted module
URL: https://github.com/apache/incubator-singa/pull/468#discussion_r309329911
 
 

 ##
 File path: src/CMakeLists.txt
 ##
 @@ -36,6 +36,9 @@ AUX_SOURCE_DIRECTORY(core/scheduler core_source)
 AUX_SOURCE_DIRECTORY(core/tensor core_source)
 LIST(APPEND singa_sources ${core_source})
 
 
 Review comment:
   I updated also some files to include USE_DIST, see the following grep result 
on USE_DIST:
   
   ubuntu@ip-172-31-18-113:~/incubator-singa$ git grep USE_DIST
   CMakeLists.txt:OPTION(USE_DIST "Use nccl distributed module" OFF)
   cmake/Dependencies.cmake:IF(USE_DIST)
   cmake/Templates/singa_config.h.in:#cmakedefine USE_DIST
   include/singa/dist/communicator.h:#ifdef USE_DIST
   include/singa/dist/communicator.h:#endif // USE_DIST
   src/CMakeLists.txt:IF (USE_DIST)
   src/CMakeLists.txt:ENDIF (USE_DIST)
   src/api/config.i:#define USE_DIST 0
   src/api/config.i.in:#cmakedefine01 USE_DIST
   src/api/dist_communicator.i:#if USE_DIST
   src/api/dist_communicator.i:#endif  // USE_DIST
   src/dist/communicator.cc:#ifdef USE_DIST
   src/dist/communicator.cc:#endif // USE_DIST
   
   Note that the default is OFF if we do not set -DUSE_DIST=ON
   
   The test was on version 1.2 although I set the displayed value in CMakeLists 
to be version 2.0. I will still need to test the dist module on singa version 
2.0 and add partitioning of dataset according to MPI rank, etc.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

[GitHub] [incubator-singa] chrishkchris commented on a change in pull request #468: Distributted module

2019-07-31 Thread GitBox

chrishkchris commented on a change in pull request #468: Distributted module
URL: https://github.com/apache/incubator-singa/pull/468#discussion_r309329911
 
 

 ##
 File path: src/CMakeLists.txt
 ##
 @@ -36,6 +36,9 @@ AUX_SOURCE_DIRECTORY(core/scheduler core_source)
 AUX_SOURCE_DIRECTORY(core/tensor core_source)
 LIST(APPEND singa_sources ${core_source})
 
 
 Review comment:
   I updated also some files to include USE_DIST, see the following grep result 
on USE_DIST:
   
   ubuntu@ip-172-31-18-113:~/incubator-singa$ git grep USE_DIST
   CMakeLists.txt:OPTION(USE_DIST "Use nccl distributed module" OFF)
   cmake/Dependencies.cmake:IF(USE_DIST)
   cmake/Templates/singa_config.h.in:#cmakedefine USE_DIST
   include/singa/dist/communicator.h:#ifdef USE_DIST
   include/singa/dist/communicator.h:#endif // USE_DIST
   src/CMakeLists.txt:IF (USE_DIST)
   src/CMakeLists.txt:ENDIF (USE_DIST)
   src/api/config.i:#define USE_DIST 1
   src/api/config.i.in:#cmakedefine01 USE_DIST
   src/api/dist_communicator.i:#if USE_DIST
   src/api/dist_communicator.i:#endif  // USE_DIST
   src/dist/communicator.cc:#ifdef USE_DIST
   src/dist/communicator.cc:#endif // USE_DIST
   
   Note that the default is OFF if we do not set -DUSE_DIST=ON
   
   The test was on version 1.2 although I set the displayed value in CMakeLists 
to be version 2.0. I will still need to test the dist module on singa version 
2.0 and add partitioning of dataset according to MPI rank, etc.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

[GitHub] [incubator-singa] chrishkchris commented on a change in pull request #468: Distributted module

2019-07-31 Thread GitBox

chrishkchris commented on a change in pull request #468: Distributted module
URL: https://github.com/apache/incubator-singa/pull/468#discussion_r309329911
 
 

 ##
 File path: src/CMakeLists.txt
 ##
 @@ -36,6 +36,9 @@ AUX_SOURCE_DIRECTORY(core/scheduler core_source)
 AUX_SOURCE_DIRECTORY(core/tensor core_source)
 LIST(APPEND singa_sources ${core_source})
 
 
 Review comment:
   I updated also some files to include USE_DIST, see the following grep result 
on USE_DIST:
   
   ubuntu@ip-172-31-18-113:~/incubator-singa$ git grep USE_DIST
   CMakeLists.txt:OPTION(USE_DIST "Use nccl distributed module" OFF)
   cmake/Dependencies.cmake:IF(USE_DIST)
   cmake/Templates/singa_config.h.in:#cmakedefine USE_DIST
   include/singa/dist/communicator.h:#ifdef USE_DIST
   include/singa/dist/communicator.h:#endif // USE_DIST
   src/CMakeLists.txt:IF (USE_DIST)
   src/CMakeLists.txt:ENDIF (USE_DIST)
   src/api/config.i:#define USE_DIST 1
   src/api/config.i.in:#cmakedefine01 USE_DIST
   src/api/dist_communicator.i:#if USE_DIST
   src/api/dist_communicator.i:#endif  // USE_DIST
   src/dist/communicator.cc:#ifdef USE_DIST
   src/dist/communicator.cc:#endif // USE_DIST
   
   Note that the default is OFF if we do not set -DUSE_DIST=ON
   
   The test was on version 1.2 although I set the displayed value in CMakeLists 
to be version 2.0. I will still need to test version 2.0 and add partitioning 
of dataset according to MPI rank, etc.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

[GitHub] [incubator-singa] chrishkchris commented on a change in pull request #468: Distributted module

2019-07-31 Thread GitBox

chrishkchris commented on a change in pull request #468: Distributted module
URL: https://github.com/apache/incubator-singa/pull/468#discussion_r309144703
 
 

 ##
 File path: src/CMakeLists.txt
 ##
 @@ -36,6 +36,9 @@ AUX_SOURCE_DIRECTORY(core/scheduler core_source)
 AUX_SOURCE_DIRECTORY(core/tensor core_source)
 LIST(APPEND singa_sources ${core_source})
 
 
 Review comment:
   Changed the files cmake/Dependencies.cmake and src/CMakeLists.txt 
   Can use cmake -DUSE_DIST=ON to turn on the distributed module


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

[GitHub] [incubator-singa] chrishkchris commented on a change in pull request #468: Distributted module

2019-07-31 Thread GitBox

chrishkchris commented on a change in pull request #468: Distributted module
URL: https://github.com/apache/incubator-singa/pull/468#discussion_r309144703
 
 

 ##
 File path: src/CMakeLists.txt
 ##
 @@ -36,6 +36,9 @@ AUX_SOURCE_DIRECTORY(core/scheduler core_source)
 AUX_SOURCE_DIRECTORY(core/tensor core_source)
 LIST(APPEND singa_sources ${core_source})
 
 
 Review comment:
   Changed the files cmake/Dependencies.cmake and src/CMakeLists.txt 
   Can use cmake -DUSE_DIST=ON to turn on the distributed module
   
   However, there are some bugs (mainly segmentation fault) if I add the #ifdef 
USE_DIST in the files communicator.h and communicator.cc 
   I will update other files as well (e.g. #cmakedefine and #if USE_DIST etc. 
in many files) when I successfully remove the bug.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

[GitHub] [incubator-singa] chrishkchris commented on a change in pull request #468: Distributted module

2019-07-19 Thread GitBox

chrishkchris commented on a change in pull request #468: Distributted module
URL: https://github.com/apache/incubator-singa/pull/468#discussion_r305247134
 
 

 ##
 File path: src/api/config.i
 ##
 @@ -0,0 +1,33 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+
+
+// Pass in cmake configurations to swig
+#define USE_CUDA 1
+#define USE_CUDNN 1
+#define USE_OPENCL 0
+#define USE_PYTHON 1
+#define USE_MKLDNN 1
+#define USE_JAVA 0
+#define CUDNN_VERSION 7401
+
+// SINGA version
+#define SINGA_MAJOR_VERSION 1
 
 Review comment:
   In our server (at ncrg), I created a new anaconda python 3.6 enivorment and 
install singa 2.0 using "conda install -c nusdbsystem -c conda-forge 
singa=2.0.0=cudnn7.3.1_cuda10.0_py36"
   
   It passed the test: python -c "from singa import tensor"
   Also, it passed the old optimizer example: 
incubator-singa/example/cifar10/train.py can run and train successfully.
   
   However, the incubator-singa/examples/autograd/resnet.py cannot run, while 
the output is:
   Start intialization
 0%|
  | 0/200 [00:00
   x = model(tx)
 File "examples/autograd/resnet.py", line 155, in __call__
   x = self.conv1(x)
 File 
"/home/dcsysh/anaconda3/envs/singa2/lib/python3.6/site-packages/singa/autograd.py",
 line 939, in __call__
   self.device_check(x, self.W, self.b)
 File 
"/home/dcsysh/anaconda3/envs/singa2/lib/python3.6/site-packages/singa/autograd.py",
 line 656, in device_check
   if var.device.id() != x_dev_id:
   AttributeError: 'NoneType' object has no attribute 'device'


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

73 matches

Mail list logo