[GitHub] [incubator-singa] chrishkchris commented on a change in pull request #468: Distributted module
chrishkchris commented on a change in pull request #468: Distributted module URL: https://github.com/apache/incubator-singa/pull/468#discussion_r320544707 ## File path: src/dist/communicator.cc ## @@ -0,0 +1,143 @@ +/** + * Licensed to the Apache Software Foundation (ASF) under one Review comment: sorry i will also need to modify the cmakelist This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services
[GitHub] [incubator-singa] chrishkchris commented on a change in pull request #468: Distributted module
chrishkchris commented on a change in pull request #468: Distributted module URL: https://github.com/apache/incubator-singa/pull/468#discussion_r320544707 ## File path: src/dist/communicator.cc ## @@ -0,0 +1,143 @@ +/** + * Licensed to the Apache Software Foundation (ASF) under one Review comment: sorry i will also need to modify the cmakelist This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services
[GitHub] [incubator-singa] chrishkchris commented on a change in pull request #468: Distributted module
chrishkchris commented on a change in pull request #468: Distributted module URL: https://github.com/apache/incubator-singa/pull/468#discussion_r320542607 ## File path: src/dist/communicator.cc ## @@ -0,0 +1,143 @@ +/** + * Licensed to the Apache Software Foundation (ASF) under one Review comment: I have moved both communicator.cc and communicator.h to io folders This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services
[GitHub] [incubator-singa] chrishkchris commented on a change in pull request #468: Distributted module
chrishkchris commented on a change in pull request #468: Distributted module URL: https://github.com/apache/incubator-singa/pull/468#discussion_r320542607 ## File path: src/dist/communicator.cc ## @@ -0,0 +1,143 @@ +/** + * Licensed to the Apache Software Foundation (ASF) under one Review comment: I have moved the both communicator.cc and communicator.h to io folders This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services
[GitHub] [incubator-singa] chrishkchris commented on a change in pull request #468: Distributted module
chrishkchris commented on a change in pull request #468: Distributted module URL: https://github.com/apache/incubator-singa/pull/468#discussion_r317937233 ## File path: examples/autograd/mnist_dist.py ## @@ -0,0 +1,251 @@ +# Review comment: Has modified mnist_cnn.py and mnist_dist.py: 1. the model construction, data preprocessing and training code are in mnist_cnn.py 2. mnist_dist.py import mnist_cnn's functions and passes the dist opt into train_mnist_cnn() to conduct dist training (needs MPI). 3. the download_mnist.py is added at the same dir, which is used to download the dataset before the training. It is separated out from the training code to prevent different process downloading data at the same time. Here is the log of running the code: ``` ubuntu@ip-172-31-21-218:~/incubator-singa/examples/autograd$ python3 download_mnist.py Downloading http://yann.lecun.com/exdb/mnist/train-images-idx3-ubyte.gz Downloading http://yann.lecun.com/exdb/mnist/train-labels-idx1-ubyte.gz Downloading http://yann.lecun.com/exdb/mnist/t10k-images-idx3-ubyte.gz Downloading http://yann.lecun.com/exdb/mnist/t10k-labels-idx1-ubyte.gz ubuntu@ip-172-31-21-218:~/incubator-singa/examples/autograd$ python3 mnist_cnn.py Starting Epoch 0: Training loss = 586.417175, training accuracy = 0.792840 Evaluation accuracy = 0.940104, Elapsed Time = 5.638494s Starting Epoch 1: Training loss = 235.360107, training accuracy = 0.922292 Evaluation accuracy = 0.955429, Elapsed Time = 5.563161s Starting Epoch 2: Training loss = 170.056442, training accuracy = 0.943270 Evaluation accuracy = 0.963942, Elapsed Time = 5.579273s Starting Epoch 3: Training loss = 135.514252, training accuracy = 0.954476 Evaluation accuracy = 0.967248, Elapsed Time = 5.562721s Starting Epoch 4: Training loss = 116.975700, training accuracy = 0.960812 Evaluation accuracy = 0.978265, Elapsed Time = 5.583826s Starting Epoch 5: Training loss = 103.893723, training accuracy = 0.965065 Evaluation accuracy = 0.982372, Elapsed Time = 5.585272s Starting Epoch 6: Training loss = 95.044586, training accuracy = 0.967266 Evaluation accuracy = 0.981671, Elapsed Time = 5.580424s Starting Epoch 7: Training loss = 89.102654, training accuracy = 0.971118 Evaluation accuracy = 0.980268, Elapsed Time = 5.583646s Starting Epoch 8: Training loss = 80.395744, training accuracy = 0.972969 Evaluation accuracy = 0.983273, Elapsed Time = 5.600029s Starting Epoch 9: Training loss = 78.355209, training accuracy = 0.973119 Evaluation accuracy = 0.979267, Elapsed Time = 5.587740s ubuntu@ip-172-31-21-218:~/incubator-singa/examples/autograd$ /home/ubuntu/mpich-3.3/build/bin/mpiexec --hostfile host_file python3 mnist_dist.py Starting Epoch 0: Training loss = 781.167480, training accuracy = 0.719017 Evaluation accuracy = 0.918586, Elapsed Time = 1.255623s Starting Epoch 1: Training loss = 259.223297, training accuracy = 0.912276 Evaluation accuracy = 0.950863, Elapsed Time = 1.216926s Starting Epoch 2: Training loss = 179.333084, training accuracy = 0.940605 Evaluation accuracy = 0.968030, Elapsed Time = 1.206751s Starting Epoch 3: Training loss = 137.840988, training accuracy = 0.954243 Evaluation accuracy = 0.975946, Elapsed Time = 1.202503s Starting Epoch 4: Training loss = 119.743629, training accuracy = 0.959836 Evaluation accuracy = 0.973581, Elapsed Time = 1.208274s Starting Epoch 5: Training loss = 102.545876, training accuracy = 0.965595 Evaluation accuracy = 0.980572, Elapsed Time = 1.205539s Starting Epoch 6: Training loss = 93.249054, training accuracy = 0.969401 Evaluation accuracy = 0.978207, Elapsed Time = 1.203708s Starting Epoch 7: Training loss = 84.66, training accuracy = 0.971104 Evaluation accuracy = 0.980777, Elapsed Time = 1.206410s Starting Epoch 8: Training loss = 77.996643, training accuracy = 0.973691 Evaluation accuracy = 0.985609, Elapsed Time = 1.207295s Starting Epoch 9: Training loss = 75.888077, training accuracy = 0.974442 Evaluation accuracy = 0.982319, Elapsed Time = 1.203693s ``` This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services
[GitHub] [incubator-singa] chrishkchris commented on a change in pull request #468: Distributted module
chrishkchris commented on a change in pull request #468: Distributted module URL: https://github.com/apache/incubator-singa/pull/468#discussion_r317937233 ## File path: examples/autograd/mnist_dist.py ## @@ -0,0 +1,251 @@ +# Review comment: Has modified mnist_cnn.py and mnist_dist.py: 1. the model construction, data preprocessing and training code are in mnist_cnn.py 2. mnist_dist.py import mnist_cnn functions and passes the dist opt into train_mnist_cnn() to conduct dist training (needs MPI). 3. the download_mnist.py is added at the same dir, which is used to download the dataset before the training. It is separated out from the training code to prevent different process downloading data at the same time. Here is the log of running the code: ``` ubuntu@ip-172-31-21-218:~/incubator-singa/examples/autograd$ python3 download_mnist.py Downloading http://yann.lecun.com/exdb/mnist/train-images-idx3-ubyte.gz Downloading http://yann.lecun.com/exdb/mnist/train-labels-idx1-ubyte.gz Downloading http://yann.lecun.com/exdb/mnist/t10k-images-idx3-ubyte.gz Downloading http://yann.lecun.com/exdb/mnist/t10k-labels-idx1-ubyte.gz ubuntu@ip-172-31-21-218:~/incubator-singa/examples/autograd$ python3 mnist_cnn.py Starting Epoch 0: Training loss = 586.417175, training accuracy = 0.792840 Evaluation accuracy = 0.940104, Elapsed Time = 5.638494s Starting Epoch 1: Training loss = 235.360107, training accuracy = 0.922292 Evaluation accuracy = 0.955429, Elapsed Time = 5.563161s Starting Epoch 2: Training loss = 170.056442, training accuracy = 0.943270 Evaluation accuracy = 0.963942, Elapsed Time = 5.579273s Starting Epoch 3: Training loss = 135.514252, training accuracy = 0.954476 Evaluation accuracy = 0.967248, Elapsed Time = 5.562721s Starting Epoch 4: Training loss = 116.975700, training accuracy = 0.960812 Evaluation accuracy = 0.978265, Elapsed Time = 5.583826s Starting Epoch 5: Training loss = 103.893723, training accuracy = 0.965065 Evaluation accuracy = 0.982372, Elapsed Time = 5.585272s Starting Epoch 6: Training loss = 95.044586, training accuracy = 0.967266 Evaluation accuracy = 0.981671, Elapsed Time = 5.580424s Starting Epoch 7: Training loss = 89.102654, training accuracy = 0.971118 Evaluation accuracy = 0.980268, Elapsed Time = 5.583646s Starting Epoch 8: Training loss = 80.395744, training accuracy = 0.972969 Evaluation accuracy = 0.983273, Elapsed Time = 5.600029s Starting Epoch 9: Training loss = 78.355209, training accuracy = 0.973119 Evaluation accuracy = 0.979267, Elapsed Time = 5.587740s ubuntu@ip-172-31-21-218:~/incubator-singa/examples/autograd$ /home/ubuntu/mpich-3.3/build/bin/mpiexec --hostfile host_file python3 mnist_dist.py Starting Epoch 0: Training loss = 781.167480, training accuracy = 0.719017 Evaluation accuracy = 0.918586, Elapsed Time = 1.255623s Starting Epoch 1: Training loss = 259.223297, training accuracy = 0.912276 Evaluation accuracy = 0.950863, Elapsed Time = 1.216926s Starting Epoch 2: Training loss = 179.333084, training accuracy = 0.940605 Evaluation accuracy = 0.968030, Elapsed Time = 1.206751s Starting Epoch 3: Training loss = 137.840988, training accuracy = 0.954243 Evaluation accuracy = 0.975946, Elapsed Time = 1.202503s Starting Epoch 4: Training loss = 119.743629, training accuracy = 0.959836 Evaluation accuracy = 0.973581, Elapsed Time = 1.208274s Starting Epoch 5: Training loss = 102.545876, training accuracy = 0.965595 Evaluation accuracy = 0.980572, Elapsed Time = 1.205539s Starting Epoch 6: Training loss = 93.249054, training accuracy = 0.969401 Evaluation accuracy = 0.978207, Elapsed Time = 1.203708s Starting Epoch 7: Training loss = 84.66, training accuracy = 0.971104 Evaluation accuracy = 0.980777, Elapsed Time = 1.206410s Starting Epoch 8: Training loss = 77.996643, training accuracy = 0.973691 Evaluation accuracy = 0.985609, Elapsed Time = 1.207295s Starting Epoch 9: Training loss = 75.888077, training accuracy = 0.974442 Evaluation accuracy = 0.982319, Elapsed Time = 1.203693s ``` This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services
[GitHub] [incubator-singa] chrishkchris commented on a change in pull request #468: Distributted module
chrishkchris commented on a change in pull request #468: Distributted module URL: https://github.com/apache/incubator-singa/pull/468#discussion_r317937233 ## File path: examples/autograd/mnist_dist.py ## @@ -0,0 +1,251 @@ +# Review comment: Has modified mnist_cnn.py and mnist_dist.py: 1. the model construction, data preprocessing and training code are in mnist_cnn.py 2. mnist_dist.py import mnist_cnn functions and passes the dist opt into train_mnist_cnn() to conduct dist training (needs MPI). 3. the download_mnist.py is added at the same dir, which is used to download the dataset before the training. It is separated out to prevent different process downloading data at the same time. Here is the log of running the code: ``` ubuntu@ip-172-31-21-218:~/incubator-singa/examples/autograd$ python3 download_mnist.py Downloading http://yann.lecun.com/exdb/mnist/train-images-idx3-ubyte.gz Downloading http://yann.lecun.com/exdb/mnist/train-labels-idx1-ubyte.gz Downloading http://yann.lecun.com/exdb/mnist/t10k-images-idx3-ubyte.gz Downloading http://yann.lecun.com/exdb/mnist/t10k-labels-idx1-ubyte.gz ubuntu@ip-172-31-21-218:~/incubator-singa/examples/autograd$ python3 mnist_cnn.py Starting Epoch 0: Training loss = 586.417175, training accuracy = 0.792840 Evaluation accuracy = 0.940104, Elapsed Time = 5.638494s Starting Epoch 1: Training loss = 235.360107, training accuracy = 0.922292 Evaluation accuracy = 0.955429, Elapsed Time = 5.563161s Starting Epoch 2: Training loss = 170.056442, training accuracy = 0.943270 Evaluation accuracy = 0.963942, Elapsed Time = 5.579273s Starting Epoch 3: Training loss = 135.514252, training accuracy = 0.954476 Evaluation accuracy = 0.967248, Elapsed Time = 5.562721s Starting Epoch 4: Training loss = 116.975700, training accuracy = 0.960812 Evaluation accuracy = 0.978265, Elapsed Time = 5.583826s Starting Epoch 5: Training loss = 103.893723, training accuracy = 0.965065 Evaluation accuracy = 0.982372, Elapsed Time = 5.585272s Starting Epoch 6: Training loss = 95.044586, training accuracy = 0.967266 Evaluation accuracy = 0.981671, Elapsed Time = 5.580424s Starting Epoch 7: Training loss = 89.102654, training accuracy = 0.971118 Evaluation accuracy = 0.980268, Elapsed Time = 5.583646s Starting Epoch 8: Training loss = 80.395744, training accuracy = 0.972969 Evaluation accuracy = 0.983273, Elapsed Time = 5.600029s Starting Epoch 9: Training loss = 78.355209, training accuracy = 0.973119 Evaluation accuracy = 0.979267, Elapsed Time = 5.587740s ubuntu@ip-172-31-21-218:~/incubator-singa/examples/autograd$ /home/ubuntu/mpich-3.3/build/bin/mpiexec --hostfile host_file python3 mnist_dist.py Starting Epoch 0: Training loss = 781.167480, training accuracy = 0.719017 Evaluation accuracy = 0.918586, Elapsed Time = 1.255623s Starting Epoch 1: Training loss = 259.223297, training accuracy = 0.912276 Evaluation accuracy = 0.950863, Elapsed Time = 1.216926s Starting Epoch 2: Training loss = 179.333084, training accuracy = 0.940605 Evaluation accuracy = 0.968030, Elapsed Time = 1.206751s Starting Epoch 3: Training loss = 137.840988, training accuracy = 0.954243 Evaluation accuracy = 0.975946, Elapsed Time = 1.202503s Starting Epoch 4: Training loss = 119.743629, training accuracy = 0.959836 Evaluation accuracy = 0.973581, Elapsed Time = 1.208274s Starting Epoch 5: Training loss = 102.545876, training accuracy = 0.965595 Evaluation accuracy = 0.980572, Elapsed Time = 1.205539s Starting Epoch 6: Training loss = 93.249054, training accuracy = 0.969401 Evaluation accuracy = 0.978207, Elapsed Time = 1.203708s Starting Epoch 7: Training loss = 84.66, training accuracy = 0.971104 Evaluation accuracy = 0.980777, Elapsed Time = 1.206410s Starting Epoch 8: Training loss = 77.996643, training accuracy = 0.973691 Evaluation accuracy = 0.985609, Elapsed Time = 1.207295s Starting Epoch 9: Training loss = 75.888077, training accuracy = 0.974442 Evaluation accuracy = 0.982319, Elapsed Time = 1.203693s ``` This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services
[GitHub] [incubator-singa] chrishkchris commented on a change in pull request #468: Distributted module
chrishkchris commented on a change in pull request #468: Distributted module URL: https://github.com/apache/incubator-singa/pull/468#discussion_r317937233 ## File path: examples/autograd/mnist_dist.py ## @@ -0,0 +1,251 @@ +# Review comment: Has modified mnist_cnn.py and mnist_dist.py: 1. the model construction, data preprocessing and training code are in mnist_cnn.py 2. mnist_dist.py import mnist_cnn functions and passes the dist opt into train_mnist_cnn() to conduct dist training (needs MPI). 3. the download_mnist.py is added in the same dir, which is used to download the dataset before the training. It is separated out to prevent different process downloading data at the same time. Here is the log of running the code: ``` ubuntu@ip-172-31-21-218:~/incubator-singa/examples/autograd$ python3 download_mnist.py Downloading http://yann.lecun.com/exdb/mnist/train-images-idx3-ubyte.gz Downloading http://yann.lecun.com/exdb/mnist/train-labels-idx1-ubyte.gz Downloading http://yann.lecun.com/exdb/mnist/t10k-images-idx3-ubyte.gz Downloading http://yann.lecun.com/exdb/mnist/t10k-labels-idx1-ubyte.gz ubuntu@ip-172-31-21-218:~/incubator-singa/examples/autograd$ python3 mnist_cnn.py Starting Epoch 0: Training loss = 586.417175, training accuracy = 0.792840 Evaluation accuracy = 0.940104, Elapsed Time = 5.638494s Starting Epoch 1: Training loss = 235.360107, training accuracy = 0.922292 Evaluation accuracy = 0.955429, Elapsed Time = 5.563161s Starting Epoch 2: Training loss = 170.056442, training accuracy = 0.943270 Evaluation accuracy = 0.963942, Elapsed Time = 5.579273s Starting Epoch 3: Training loss = 135.514252, training accuracy = 0.954476 Evaluation accuracy = 0.967248, Elapsed Time = 5.562721s Starting Epoch 4: Training loss = 116.975700, training accuracy = 0.960812 Evaluation accuracy = 0.978265, Elapsed Time = 5.583826s Starting Epoch 5: Training loss = 103.893723, training accuracy = 0.965065 Evaluation accuracy = 0.982372, Elapsed Time = 5.585272s Starting Epoch 6: Training loss = 95.044586, training accuracy = 0.967266 Evaluation accuracy = 0.981671, Elapsed Time = 5.580424s Starting Epoch 7: Training loss = 89.102654, training accuracy = 0.971118 Evaluation accuracy = 0.980268, Elapsed Time = 5.583646s Starting Epoch 8: Training loss = 80.395744, training accuracy = 0.972969 Evaluation accuracy = 0.983273, Elapsed Time = 5.600029s Starting Epoch 9: Training loss = 78.355209, training accuracy = 0.973119 Evaluation accuracy = 0.979267, Elapsed Time = 5.587740s ubuntu@ip-172-31-21-218:~/incubator-singa/examples/autograd$ /home/ubuntu/mpich-3.3/build/bin/mpiexec --hostfile host_file python3 mnist_dist.py Starting Epoch 0: Training loss = 781.167480, training accuracy = 0.719017 Evaluation accuracy = 0.918586, Elapsed Time = 1.255623s Starting Epoch 1: Training loss = 259.223297, training accuracy = 0.912276 Evaluation accuracy = 0.950863, Elapsed Time = 1.216926s Starting Epoch 2: Training loss = 179.333084, training accuracy = 0.940605 Evaluation accuracy = 0.968030, Elapsed Time = 1.206751s Starting Epoch 3: Training loss = 137.840988, training accuracy = 0.954243 Evaluation accuracy = 0.975946, Elapsed Time = 1.202503s Starting Epoch 4: Training loss = 119.743629, training accuracy = 0.959836 Evaluation accuracy = 0.973581, Elapsed Time = 1.208274s Starting Epoch 5: Training loss = 102.545876, training accuracy = 0.965595 Evaluation accuracy = 0.980572, Elapsed Time = 1.205539s Starting Epoch 6: Training loss = 93.249054, training accuracy = 0.969401 Evaluation accuracy = 0.978207, Elapsed Time = 1.203708s Starting Epoch 7: Training loss = 84.66, training accuracy = 0.971104 Evaluation accuracy = 0.980777, Elapsed Time = 1.206410s Starting Epoch 8: Training loss = 77.996643, training accuracy = 0.973691 Evaluation accuracy = 0.985609, Elapsed Time = 1.207295s Starting Epoch 9: Training loss = 75.888077, training accuracy = 0.974442 Evaluation accuracy = 0.982319, Elapsed Time = 1.203693s ``` This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services
[GitHub] [incubator-singa] chrishkchris commented on a change in pull request #468: Distributted module
chrishkchris commented on a change in pull request #468: Distributted module URL: https://github.com/apache/incubator-singa/pull/468#discussion_r317937233 ## File path: examples/autograd/mnist_dist.py ## @@ -0,0 +1,251 @@ +# Review comment: Has modified mnist_cnn.py and mnist_dist.py: 1. the model construction, data preprocessing and training code are in mnist_cnn.py 2. mnist_dist.py import mnist_cnn functions and passes the dist opt into train_mnist_cnn() to conduct dist training (needs MPI). 3. the download_mnist.py is used to download the dataset before the training. It is separated out to prevent different process downloading data at the same time. Here is the log of running the code: ``` ubuntu@ip-172-31-21-218:~/incubator-singa/examples/autograd$ python3 download_mnist.py Downloading http://yann.lecun.com/exdb/mnist/train-images-idx3-ubyte.gz Downloading http://yann.lecun.com/exdb/mnist/train-labels-idx1-ubyte.gz Downloading http://yann.lecun.com/exdb/mnist/t10k-images-idx3-ubyte.gz Downloading http://yann.lecun.com/exdb/mnist/t10k-labels-idx1-ubyte.gz ubuntu@ip-172-31-21-218:~/incubator-singa/examples/autograd$ python3 mnist_cnn.py Starting Epoch 0: Training loss = 586.417175, training accuracy = 0.792840 Evaluation accuracy = 0.940104, Elapsed Time = 5.638494s Starting Epoch 1: Training loss = 235.360107, training accuracy = 0.922292 Evaluation accuracy = 0.955429, Elapsed Time = 5.563161s Starting Epoch 2: Training loss = 170.056442, training accuracy = 0.943270 Evaluation accuracy = 0.963942, Elapsed Time = 5.579273s Starting Epoch 3: Training loss = 135.514252, training accuracy = 0.954476 Evaluation accuracy = 0.967248, Elapsed Time = 5.562721s Starting Epoch 4: Training loss = 116.975700, training accuracy = 0.960812 Evaluation accuracy = 0.978265, Elapsed Time = 5.583826s Starting Epoch 5: Training loss = 103.893723, training accuracy = 0.965065 Evaluation accuracy = 0.982372, Elapsed Time = 5.585272s Starting Epoch 6: Training loss = 95.044586, training accuracy = 0.967266 Evaluation accuracy = 0.981671, Elapsed Time = 5.580424s Starting Epoch 7: Training loss = 89.102654, training accuracy = 0.971118 Evaluation accuracy = 0.980268, Elapsed Time = 5.583646s Starting Epoch 8: Training loss = 80.395744, training accuracy = 0.972969 Evaluation accuracy = 0.983273, Elapsed Time = 5.600029s Starting Epoch 9: Training loss = 78.355209, training accuracy = 0.973119 Evaluation accuracy = 0.979267, Elapsed Time = 5.587740s ubuntu@ip-172-31-21-218:~/incubator-singa/examples/autograd$ /home/ubuntu/mpich-3.3/build/bin/mpiexec --hostfile host_file python3 mnist_dist.py Starting Epoch 0: Training loss = 781.167480, training accuracy = 0.719017 Evaluation accuracy = 0.918586, Elapsed Time = 1.255623s Starting Epoch 1: Training loss = 259.223297, training accuracy = 0.912276 Evaluation accuracy = 0.950863, Elapsed Time = 1.216926s Starting Epoch 2: Training loss = 179.333084, training accuracy = 0.940605 Evaluation accuracy = 0.968030, Elapsed Time = 1.206751s Starting Epoch 3: Training loss = 137.840988, training accuracy = 0.954243 Evaluation accuracy = 0.975946, Elapsed Time = 1.202503s Starting Epoch 4: Training loss = 119.743629, training accuracy = 0.959836 Evaluation accuracy = 0.973581, Elapsed Time = 1.208274s Starting Epoch 5: Training loss = 102.545876, training accuracy = 0.965595 Evaluation accuracy = 0.980572, Elapsed Time = 1.205539s Starting Epoch 6: Training loss = 93.249054, training accuracy = 0.969401 Evaluation accuracy = 0.978207, Elapsed Time = 1.203708s Starting Epoch 7: Training loss = 84.66, training accuracy = 0.971104 Evaluation accuracy = 0.980777, Elapsed Time = 1.206410s Starting Epoch 8: Training loss = 77.996643, training accuracy = 0.973691 Evaluation accuracy = 0.985609, Elapsed Time = 1.207295s Starting Epoch 9: Training loss = 75.888077, training accuracy = 0.974442 Evaluation accuracy = 0.982319, Elapsed Time = 1.203693s ``` This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services
[GitHub] [incubator-singa] chrishkchris commented on a change in pull request #468: Distributted module
chrishkchris commented on a change in pull request #468: Distributted module URL: https://github.com/apache/incubator-singa/pull/468#discussion_r317938576 ## File path: examples/autograd/mnist_dist.py ## @@ -0,0 +1,251 @@ +# Review comment: I have already followed up all the four comments recently added in the PR of dist_new. So I will continue the work on the distributed training code without MPI. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services
[GitHub] [incubator-singa] chrishkchris commented on a change in pull request #468: Distributted module
chrishkchris commented on a change in pull request #468: Distributted module URL: https://github.com/apache/incubator-singa/pull/468#discussion_r317938576 ## File path: examples/autograd/mnist_dist.py ## @@ -0,0 +1,251 @@ +# Review comment: I have already followed up all the four comments recently added in the PR of dist_new. So I will continue the on the distributed training code without MPI. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services
[GitHub] [incubator-singa] chrishkchris commented on a change in pull request #468: Distributted module
chrishkchris commented on a change in pull request #468: Distributted module URL: https://github.com/apache/incubator-singa/pull/468#discussion_r317937233 ## File path: examples/autograd/mnist_dist.py ## @@ -0,0 +1,251 @@ +# Review comment: Has modified mnist_cnn.py and mnist_dist.py: 1. the model construction, data preprocessing and training code are in mnist_cnn.py 2. mnist_dist.py import mnist_cnn functions and passes the dist opt into train_mnist_cnn() to conduct dist training (needs MPI). 3. the download_mnist.py is used to download the dataset before the training. It is separated out to prevent different process downloading data at the same time. Here is the log of running the code: ``` ubuntu@ip-172-31-21-218:~/incubator-singa/examples/autograd$ python3 download_mnist.py Downloading http://yann.lecun.com/exdb/mnist/train-images-idx3-ubyte.gz Downloading http://yann.lecun.com/exdb/mnist/train-labels-idx1-ubyte.gz Downloading http://yann.lecun.com/exdb/mnist/t10k-images-idx3-ubyte.gz Downloading http://yann.lecun.com/exdb/mnist/t10k-labels-idx1-ubyte.gz ubuntu@ip-172-31-21-218:~/incubator-singa/examples/autograd$ python3 mnist_cnn.pyStarting Epoch 0: Training loss = 586.417175, training accuracy = 0.792840 Evaluation accuracy = 0.940104, Elapsed Time = 5.638494s Starting Epoch 1: Training loss = 235.360107, training accuracy = 0.922292 Evaluation accuracy = 0.955429, Elapsed Time = 5.563161s Starting Epoch 2: Training loss = 170.056442, training accuracy = 0.943270 Evaluation accuracy = 0.963942, Elapsed Time = 5.579273s Starting Epoch 3: Training loss = 135.514252, training accuracy = 0.954476 Evaluation accuracy = 0.967248, Elapsed Time = 5.562721s Starting Epoch 4: Training loss = 116.975700, training accuracy = 0.960812 Evaluation accuracy = 0.978265, Elapsed Time = 5.583826s Starting Epoch 5: Training loss = 103.893723, training accuracy = 0.965065 Evaluation accuracy = 0.982372, Elapsed Time = 5.585272s Starting Epoch 6: Training loss = 95.044586, training accuracy = 0.967266 Evaluation accuracy = 0.981671, Elapsed Time = 5.580424s Starting Epoch 7: Training loss = 89.102654, training accuracy = 0.971118 Evaluation accuracy = 0.980268, Elapsed Time = 5.583646s Starting Epoch 8: Training loss = 80.395744, training accuracy = 0.972969 Evaluation accuracy = 0.983273, Elapsed Time = 5.600029s Starting Epoch 9: Training loss = 78.355209, training accuracy = 0.973119 Evaluation accuracy = 0.979267, Elapsed Time = 5.587740s ubuntu@ip-172-31-21-218:~/incubator-singa/examples/autograd$ /home/ubuntu/mpich-3.3/build/bin/mpiexec --hostfile host_file python3 mnist_dist.py Starting Epoch 0: Training loss = 781.167480, training accuracy = 0.719017 Evaluation accuracy = 0.918586, Elapsed Time = 1.255623s Starting Epoch 1: Training loss = 259.223297, training accuracy = 0.912276 Evaluation accuracy = 0.950863, Elapsed Time = 1.216926s Starting Epoch 2: Training loss = 179.333084, training accuracy = 0.940605 Evaluation accuracy = 0.968030, Elapsed Time = 1.206751s Starting Epoch 3: Training loss = 137.840988, training accuracy = 0.954243 Evaluation accuracy = 0.975946, Elapsed Time = 1.202503s Starting Epoch 4: Training loss = 119.743629, training accuracy = 0.959836 Evaluation accuracy = 0.973581, Elapsed Time = 1.208274s Starting Epoch 5: Training loss = 102.545876, training accuracy = 0.965595 Evaluation accuracy = 0.980572, Elapsed Time = 1.205539s Starting Epoch 6: Training loss = 93.249054, training accuracy = 0.969401 Evaluation accuracy = 0.978207, Elapsed Time = 1.203708s Starting Epoch 7: Training loss = 84.66, training accuracy = 0.971104 Evaluation accuracy = 0.980777, Elapsed Time = 1.206410s Starting Epoch 8: Training loss = 77.996643, training accuracy = 0.973691 Evaluation accuracy = 0.985609, Elapsed Time = 1.207295s Starting Epoch 9: Training loss = 75.888077, training accuracy = 0.974442 Evaluation accuracy = 0.982319, Elapsed Time = 1.203693s ``` This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services
[GitHub] [incubator-singa] chrishkchris commented on a change in pull request #468: Distributted module
chrishkchris commented on a change in pull request #468: Distributted module URL: https://github.com/apache/incubator-singa/pull/468#discussion_r317937233 ## File path: examples/autograd/mnist_dist.py ## @@ -0,0 +1,251 @@ +# Review comment: Has modified mnist_cnn.py and mnist_dist.py: 1. the model construction, data preprocessing and training code are in mnist_cnn.py 2. mnist_dist.py import mnist_cnn functions and passes the dist opt into train_mnist_cnn() to conduct dist training (needs MPI). 3. the download_mnist.py is used to download the dataset before the training. It is separated out to prevent different porocess downloading data at the same time. Here is the log of running the code: ``` ubuntu@ip-172-31-21-218:~/incubator-singa/examples/autograd$ python3 download_mnist.py Downloading http://yann.lecun.com/exdb/mnist/train-images-idx3-ubyte.gz Downloading http://yann.lecun.com/exdb/mnist/train-labels-idx1-ubyte.gz Downloading http://yann.lecun.com/exdb/mnist/t10k-images-idx3-ubyte.gz Downloading http://yann.lecun.com/exdb/mnist/t10k-labels-idx1-ubyte.gz ubuntu@ip-172-31-21-218:~/incubator-singa/examples/autograd$ python3 mnist_cnn.pyStarting Epoch 0: Training loss = 586.417175, training accuracy = 0.792840 Evaluation accuracy = 0.940104, Elapsed Time = 5.638494s Starting Epoch 1: Training loss = 235.360107, training accuracy = 0.922292 Evaluation accuracy = 0.955429, Elapsed Time = 5.563161s Starting Epoch 2: Training loss = 170.056442, training accuracy = 0.943270 Evaluation accuracy = 0.963942, Elapsed Time = 5.579273s Starting Epoch 3: Training loss = 135.514252, training accuracy = 0.954476 Evaluation accuracy = 0.967248, Elapsed Time = 5.562721s Starting Epoch 4: Training loss = 116.975700, training accuracy = 0.960812 Evaluation accuracy = 0.978265, Elapsed Time = 5.583826s Starting Epoch 5: Training loss = 103.893723, training accuracy = 0.965065 Evaluation accuracy = 0.982372, Elapsed Time = 5.585272s Starting Epoch 6: Training loss = 95.044586, training accuracy = 0.967266 Evaluation accuracy = 0.981671, Elapsed Time = 5.580424s Starting Epoch 7: Training loss = 89.102654, training accuracy = 0.971118 Evaluation accuracy = 0.980268, Elapsed Time = 5.583646s Starting Epoch 8: Training loss = 80.395744, training accuracy = 0.972969 Evaluation accuracy = 0.983273, Elapsed Time = 5.600029s Starting Epoch 9: Training loss = 78.355209, training accuracy = 0.973119 Evaluation accuracy = 0.979267, Elapsed Time = 5.587740s ubuntu@ip-172-31-21-218:~/incubator-singa/examples/autograd$ /home/ubuntu/mpich-3.3/build/bin/mpiexec --hostfile host_file python3 mnist_dist.py Starting Epoch 0: Training loss = 781.167480, training accuracy = 0.719017 Evaluation accuracy = 0.918586, Elapsed Time = 1.255623s Starting Epoch 1: Training loss = 259.223297, training accuracy = 0.912276 Evaluation accuracy = 0.950863, Elapsed Time = 1.216926s Starting Epoch 2: Training loss = 179.333084, training accuracy = 0.940605 Evaluation accuracy = 0.968030, Elapsed Time = 1.206751s Starting Epoch 3: Training loss = 137.840988, training accuracy = 0.954243 Evaluation accuracy = 0.975946, Elapsed Time = 1.202503s Starting Epoch 4: Training loss = 119.743629, training accuracy = 0.959836 Evaluation accuracy = 0.973581, Elapsed Time = 1.208274s Starting Epoch 5: Training loss = 102.545876, training accuracy = 0.965595 Evaluation accuracy = 0.980572, Elapsed Time = 1.205539s Starting Epoch 6: Training loss = 93.249054, training accuracy = 0.969401 Evaluation accuracy = 0.978207, Elapsed Time = 1.203708s Starting Epoch 7: Training loss = 84.66, training accuracy = 0.971104 Evaluation accuracy = 0.980777, Elapsed Time = 1.206410s Starting Epoch 8: Training loss = 77.996643, training accuracy = 0.973691 Evaluation accuracy = 0.985609, Elapsed Time = 1.207295s Starting Epoch 9: Training loss = 75.888077, training accuracy = 0.974442 Evaluation accuracy = 0.982319, Elapsed Time = 1.203693s ``` This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services
[GitHub] [incubator-singa] chrishkchris commented on a change in pull request #468: Distributted module
chrishkchris commented on a change in pull request #468: Distributted module URL: https://github.com/apache/incubator-singa/pull/468#discussion_r317889630 ## File path: python/singa/autograd.py ## @@ -1286,25 +1287,26 @@ def set_params(self, **parameters): class _BatchNorm2d(Operation): -def __init__(self, handle, name=None): +def __init__(self, handle, running_mean, running_var, name=None): super(_BatchNorm2d, self).__init__(name) self.handle = handle +self.running_mean = running_mean.data +self.running_var = running_var.data -def forward(self, x, scale, bias, running_mean, running_var): -self.running_mean = running_mean -self.running_var = running_var +def forward(self, x, scale, bias): if training: if isinstance(self.handle, singa.CudnnBatchNormHandle): y, mean, var = singa.GpuBatchNormForwardTraining( -self.handle, x, scale, bias, running_mean, running_var +self.handle, x, scale, bias, self.running_mean, self.running_var Review comment: The following is the resnet18 training using CPU on CIFAR10 in the first few epochs. CPU is slow so I trained only a few epochs. Training loss reduced from 2233.4 to 782.5: ``` ubuntu@ip-172-31-16-147:~/incubator-singa/examples/autograd$ python3 resnetcifarcpu.py Loading data file cifar-10-batches-py/data_batch_1 Loading data file cifar-10-batches-py/data_batch_2 Loading data file cifar-10-batches-py/data_batch_3 Loading data file cifar-10-batches-py/data_batch_4 Loading data file cifar-10-batches-py/data_batch_5 Loading data file cifar-10-batches-py/test_batch Start intialization Epoch=0: 100%|| 1562/1562 [2:09:57<00:00, 5.03s/it] Training loss = 2233.394769, training accuracy = 0.490297 Test accuracy = 0.636218 Epoch=1: 100%|███| 1562/1562 [2:10:00<00:00, 4.98s/it] Training loss = 1474.432049, training accuracy = 0.33 Test accuracy = 0.678986 Epoch=2: 100%|███| 1562/1562 [2:10:11<00:00, 5.00s/it] Training loss = 1163.035850, training accuracy = 0.741717 Test accuracy = 0.738181 Epoch=3: 100%|███| 1562/1562 [2:10:31<00:00, 5.03s/it] Training loss = 979.977119, training accuracy = 0.782570 Test accuracy = 0.800581 Epoch=4: 100%|███| 1562/1562 [2:10:10<00:00, 4.98s/it] Training loss = 872.811802, training accuracy = 0.806098 Test accuracy = 0.813902 Epoch=5: 100%|███| 1562/1562 [2:10:05<00:00, 4.99s/it] Training loss = 782.525783, training accuracy = 0.826144 Test accuracy = 0.832232 ``` The training loss decreases normally. Therefore seems the CPU batch norm is working. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services
[GitHub] [incubator-singa] chrishkchris commented on a change in pull request #468: Distributted module
chrishkchris commented on a change in pull request #468: Distributted module URL: https://github.com/apache/incubator-singa/pull/468#discussion_r317889630 ## File path: python/singa/autograd.py ## @@ -1286,25 +1287,26 @@ def set_params(self, **parameters): class _BatchNorm2d(Operation): -def __init__(self, handle, name=None): +def __init__(self, handle, running_mean, running_var, name=None): super(_BatchNorm2d, self).__init__(name) self.handle = handle +self.running_mean = running_mean.data +self.running_var = running_var.data -def forward(self, x, scale, bias, running_mean, running_var): -self.running_mean = running_mean -self.running_var = running_var +def forward(self, x, scale, bias): if training: if isinstance(self.handle, singa.CudnnBatchNormHandle): y, mean, var = singa.GpuBatchNormForwardTraining( -self.handle, x, scale, bias, running_mean, running_var +self.handle, x, scale, bias, self.running_mean, self.running_var Review comment: The following is the resnet18 training using CPU on CIFAR10 in the first few epochs. CPU is slow so I trained only a few epochs ``` ubuntu@ip-172-31-16-147:~/incubator-singa/examples/autograd$ python3 resnetcifarcpu.py Loading data file cifar-10-batches-py/data_batch_1 Loading data file cifar-10-batches-py/data_batch_2 Loading data file cifar-10-batches-py/data_batch_3 Loading data file cifar-10-batches-py/data_batch_4 Loading data file cifar-10-batches-py/data_batch_5 Loading data file cifar-10-batches-py/test_batch Start intialization Epoch=0: 100%|| 1562/1562 [2:09:57<00:00, 5.03s/it] Training loss = 2233.394769, training accuracy = 0.490297 Test accuracy = 0.636218 Epoch=1: 100%|███| 1562/1562 [2:10:00<00:00, 4.98s/it] Training loss = 1474.432049, training accuracy = 0.33 Test accuracy = 0.678986 Epoch=2: 100%|███| 1562/1562 [2:10:11<00:00, 5.00s/it] Training loss = 1163.035850, training accuracy = 0.741717 Test accuracy = 0.738181 Epoch=3: 100%|███| 1562/1562 [2:10:31<00:00, 5.03s/it] Training loss = 979.977119, training accuracy = 0.782570 Test accuracy = 0.800581 Epoch=4: 100%|███| 1562/1562 [2:10:10<00:00, 4.98s/it] Training loss = 872.811802, training accuracy = 0.806098 Test accuracy = 0.813902 Epoch=5: 100%|███| 1562/1562 [2:10:05<00:00, 4.99s/it] Training loss = 782.525783, training accuracy = 0.826144 Test accuracy = 0.832232 ``` The training loss decreases normally. Therefore seems the CPU batch norm is working. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services
[GitHub] [incubator-singa] chrishkchris commented on a change in pull request #468: Distributted module
chrishkchris commented on a change in pull request #468: Distributted module URL: https://github.com/apache/incubator-singa/pull/468#discussion_r317889630 ## File path: python/singa/autograd.py ## @@ -1286,25 +1287,26 @@ def set_params(self, **parameters): class _BatchNorm2d(Operation): -def __init__(self, handle, name=None): +def __init__(self, handle, running_mean, running_var, name=None): super(_BatchNorm2d, self).__init__(name) self.handle = handle +self.running_mean = running_mean.data +self.running_var = running_var.data -def forward(self, x, scale, bias, running_mean, running_var): -self.running_mean = running_mean -self.running_var = running_var +def forward(self, x, scale, bias): if training: if isinstance(self.handle, singa.CudnnBatchNormHandle): y, mean, var = singa.GpuBatchNormForwardTraining( -self.handle, x, scale, bias, running_mean, running_var +self.handle, x, scale, bias, self.running_mean, self.running_var Review comment: The following is the resnet18 training using CPU on CIFAR10 in the first few epochs. CPU is slow so I trained only a few epochs ``` Start intialization Epoch=0: 100%|| 1562/1562 [2:09:57<00:00, 5.03s/it] Training loss = 2233.394769, training accuracy = 0.490297 Test accuracy = 0.636218 Epoch=1: 100%|███| 1562/1562 [2:10:00<00:00, 4.98s/it] Training loss = 1474.432049, training accuracy = 0.33 Test accuracy = 0.678986 Epoch=2: 100%|███| 1562/1562 [2:10:11<00:00, 5.00s/it] Training loss = 1163.035850, training accuracy = 0.741717 Test accuracy = 0.738181 Epoch=3: 100%|███| 1562/1562 [2:10:31<00:00, 5.03s/it] Training loss = 979.977119, training accuracy = 0.782570 Test accuracy = 0.800581 Epoch=4: 100%|███| 1562/1562 [2:10:10<00:00, 4.98s/it] Training loss = 872.811802, training accuracy = 0.806098 Test accuracy = 0.813902 Epoch=5: 100%|███| 1562/1562 [2:10:05<00:00, 4.99s/it] Training loss = 782.525783, training accuracy = 0.826144 Test accuracy = 0.832232 ``` The training loss decreases normally. Therefore seems the CPU batch norm is working. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services
[GitHub] [incubator-singa] chrishkchris commented on a change in pull request #468: Distributted module
chrishkchris commented on a change in pull request #468: Distributted module URL: https://github.com/apache/incubator-singa/pull/468#discussion_r317889630 ## File path: python/singa/autograd.py ## @@ -1286,25 +1287,26 @@ def set_params(self, **parameters): class _BatchNorm2d(Operation): -def __init__(self, handle, name=None): +def __init__(self, handle, running_mean, running_var, name=None): super(_BatchNorm2d, self).__init__(name) self.handle = handle +self.running_mean = running_mean.data +self.running_var = running_var.data -def forward(self, x, scale, bias, running_mean, running_var): -self.running_mean = running_mean -self.running_var = running_var +def forward(self, x, scale, bias): if training: if isinstance(self.handle, singa.CudnnBatchNormHandle): y, mean, var = singa.GpuBatchNormForwardTraining( -self.handle, x, scale, bias, running_mean, running_var +self.handle, x, scale, bias, self.running_mean, self.running_var Review comment: The following is the resnet18 training using CPU on CIFAR10 in the first few epochs. ``` Start intialization Epoch=0: 100%|| 1562/1562 [2:09:57<00:00, 5.03s/it] Training loss = 2233.394769, training accuracy = 0.490297 Test accuracy = 0.636218 Epoch=1: 100%|███| 1562/1562 [2:10:00<00:00, 4.98s/it] Training loss = 1474.432049, training accuracy = 0.33 Test accuracy = 0.678986 Epoch=2: 100%|███| 1562/1562 [2:10:11<00:00, 5.00s/it] Training loss = 1163.035850, training accuracy = 0.741717 Test accuracy = 0.738181 Epoch=3: 100%|███| 1562/1562 [2:10:31<00:00, 5.03s/it] Training loss = 979.977119, training accuracy = 0.782570 Test accuracy = 0.800581 Epoch=4: 100%|███| 1562/1562 [2:10:10<00:00, 4.98s/it] Training loss = 872.811802, training accuracy = 0.806098 Test accuracy = 0.813902 Epoch=5: 100%|███| 1562/1562 [2:10:05<00:00, 4.99s/it] Training loss = 782.525783, training accuracy = 0.826144 Test accuracy = 0.832232 ``` The training loss decreases normally. Therefore seems the CPU batch norm is working. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services
[GitHub] [incubator-singa] chrishkchris commented on a change in pull request #468: Distributted module
chrishkchris commented on a change in pull request #468: Distributted module URL: https://github.com/apache/incubator-singa/pull/468#discussion_r317889630 ## File path: python/singa/autograd.py ## @@ -1286,25 +1287,26 @@ def set_params(self, **parameters): class _BatchNorm2d(Operation): -def __init__(self, handle, name=None): +def __init__(self, handle, running_mean, running_var, name=None): super(_BatchNorm2d, self).__init__(name) self.handle = handle +self.running_mean = running_mean.data +self.running_var = running_var.data -def forward(self, x, scale, bias, running_mean, running_var): -self.running_mean = running_mean -self.running_var = running_var +def forward(self, x, scale, bias): if training: if isinstance(self.handle, singa.CudnnBatchNormHandle): y, mean, var = singa.GpuBatchNormForwardTraining( -self.handle, x, scale, bias, running_mean, running_var +self.handle, x, scale, bias, self.running_mean, self.running_var Review comment: The following is the resnet18 training using CPU on CIFAR10 in the first few epochs. ``` Start intialization Epoch=0: 100%|| 1562/1562 [2:09:57<00:00, 5.03s/it] Training loss = 2233.394769, training accuracy = 0.490297 Test accuracy = 0.636218 Epoch=1: 100%|███| 1562/1562 [2:10:00<00:00, 4.98s/it] Training loss = 1474.432049, training accuracy = 0.33 Test accuracy = 0.678986 Epoch=2: 100%|███| 1562/1562 [2:10:11<00:00, 5.00s/it] Training loss = 1163.035850, training accuracy = 0.741717 Test accuracy = 0.738181 Epoch=3: 100%|███| 1562/1562 [2:10:31<00:00, 5.03s/it] Training loss = 979.977119, training accuracy = 0.782570 Test accuracy = 0.800581 Epoch=4: 100%|███| 1562/1562 [2:10:10<00:00, 4.98s/it] Training loss = 872.811802, training accuracy = 0.806098 Test accuracy = 0.813902 Epoch=5: 100%|███| 1562/1562 [2:10:05<00:00, 4.99s/it] Training loss = 782.525783, training accuracy = 0.826144 Test accuracy = 0.832232 ``` The training loss decreases normally. Therefore seems the batch norm is working. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services
[GitHub] [incubator-singa] chrishkchris commented on a change in pull request #468: Distributted module
chrishkchris commented on a change in pull request #468: Distributted module URL: https://github.com/apache/incubator-singa/pull/468#discussion_r317527518 ## File path: python/singa/autograd.py ## @@ -1286,25 +1287,26 @@ def set_params(self, **parameters): class _BatchNorm2d(Operation): -def __init__(self, handle, name=None): +def __init__(self, handle, running_mean, running_var, name=None): super(_BatchNorm2d, self).__init__(name) self.handle = handle +self.running_mean = running_mean.data +self.running_var = running_var.data -def forward(self, x, scale, bias, running_mean, running_var): -self.running_mean = running_mean -self.running_var = running_var +def forward(self, x, scale, bias): if training: if isinstance(self.handle, singa.CudnnBatchNormHandle): y, mean, var = singa.GpuBatchNormForwardTraining( -self.handle, x, scale, bias, running_mean, running_var +self.handle, x, scale, bias, self.running_mean, self.running_var Review comment: Adding two batchnorm layers in the CNN for the MNIST, the cpu training is looking okay, loss reduced from 608.7 to 91.5 ``` ubuntu@ip-172-31-36-94:~/incubator-singa/examples/autograd$ python3 mnist_dist_bn.py Starting Epoch 0: Training loss = 608.656677, training accuracy = 0.785035 Evaluation accuracy = 0.900541, Elapsed Time = 130.748269s Starting Epoch 1: Training loss = 259.606445, training accuracy = 0.911720 Evaluation accuracy = 0.951222, Elapsed Time = 129.687239s Starting Epoch 2: Training loss = 180.270645, training accuracy = 0.938917 Evaluation accuracy = 0.965745, Elapsed Time = 129.715867s Starting Epoch 3: Training loss = 146.975281, training accuracy = 0.950607 Evaluation accuracy = 0.961138, Elapsed Time = 129.695524s Starting Epoch 4: Training loss = 130.942749, training accuracy = 0.955576 Evaluation accuracy = 0.966446, Elapsed Time = 129.814190s Starting Epoch 5: Training loss = 116.057938, training accuracy = 0.960846 Evaluation accuracy = 0.964844, Elapsed Time = 129.697126s Starting Epoch 6: Training loss = 105.867195, training accuracy = 0.963914 Evaluation accuracy = 0.973758, Elapsed Time = 129.782990s Starting Epoch 7: Training loss = 102.414818, training accuracy = 0.965498 Evaluation accuracy = 0.973357, Elapsed Time = 129.847014s Starting Epoch 8: Training loss = 95.194695, training accuracy = 0.968433 Evaluation accuracy = 0.973658, Elapsed Time = 129.762709s Starting Epoch 9: Training loss = 91.524719, training accuracy = 0.969717 Evaluation accuracy = 0.975160, Elapsed Time = 129.581387s ``` still waiting for resnet result on cifar10 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services
[GitHub] [incubator-singa] chrishkchris commented on a change in pull request #468: Distributted module
chrishkchris commented on a change in pull request #468: Distributted module URL: https://github.com/apache/incubator-singa/pull/468#discussion_r317527518 ## File path: python/singa/autograd.py ## @@ -1286,25 +1287,26 @@ def set_params(self, **parameters): class _BatchNorm2d(Operation): -def __init__(self, handle, name=None): +def __init__(self, handle, running_mean, running_var, name=None): super(_BatchNorm2d, self).__init__(name) self.handle = handle +self.running_mean = running_mean.data +self.running_var = running_var.data -def forward(self, x, scale, bias, running_mean, running_var): -self.running_mean = running_mean -self.running_var = running_var +def forward(self, x, scale, bias): if training: if isinstance(self.handle, singa.CudnnBatchNormHandle): y, mean, var = singa.GpuBatchNormForwardTraining( -self.handle, x, scale, bias, running_mean, running_var +self.handle, x, scale, bias, self.running_mean, self.running_var Review comment: Adding two batchnorm layers in the CNN for the MNIST, the cpu training is looking okay, loss reduced from 608.7 to 91.5 ``` ubuntu@ip-172-31-36-94:~/incubator-singa/examples/autograd$ python3 mnist_dist_bn.py Starting Epoch 0: Training loss = 608.656677, training accuracy = 0.785035 Evaluation accuracy = 0.900541, Elapsed Time = 130.748269s Starting Epoch 1: Training loss = 259.606445, training accuracy = 0.911720 Evaluation accuracy = 0.951222, Elapsed Time = 129.687239s Starting Epoch 2: Training loss = 180.270645, training accuracy = 0.938917 Evaluation accuracy = 0.965745, Elapsed Time = 129.715867s Starting Epoch 3: Training loss = 146.975281, training accuracy = 0.950607 Evaluation accuracy = 0.961138, Elapsed Time = 129.695524s Starting Epoch 4: Training loss = 130.942749, training accuracy = 0.955576 Evaluation accuracy = 0.966446, Elapsed Time = 129.814190s Starting Epoch 5: Training loss = 116.057938, training accuracy = 0.960846 Evaluation accuracy = 0.964844, Elapsed Time = 129.697126s Starting Epoch 6: Training loss = 105.867195, training accuracy = 0.963914 Evaluation accuracy = 0.973758, Elapsed Time = 129.782990s Starting Epoch 7: Training loss = 102.414818, training accuracy = 0.965498 Evaluation accuracy = 0.973357, Elapsed Time = 129.847014s Starting Epoch 8: Training loss = 95.194695, training accuracy = 0.968433 Evaluation accuracy = 0.973658, Elapsed Time = 129.762709s Starting Epoch 9: Training loss = 91.524719, training accuracy = 0.969717 Evaluation accuracy = 0.975160, Elapsed Time = 129.581387s ``` still waiting for resnet 18 result on cifar10 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services
[GitHub] [incubator-singa] chrishkchris commented on a change in pull request #468: Distributted module
chrishkchris commented on a change in pull request #468: Distributted module URL: https://github.com/apache/incubator-singa/pull/468#discussion_r317527518 ## File path: python/singa/autograd.py ## @@ -1286,25 +1287,26 @@ def set_params(self, **parameters): class _BatchNorm2d(Operation): -def __init__(self, handle, name=None): +def __init__(self, handle, running_mean, running_var, name=None): super(_BatchNorm2d, self).__init__(name) self.handle = handle +self.running_mean = running_mean.data +self.running_var = running_var.data -def forward(self, x, scale, bias, running_mean, running_var): -self.running_mean = running_mean -self.running_var = running_var +def forward(self, x, scale, bias): if training: if isinstance(self.handle, singa.CudnnBatchNormHandle): y, mean, var = singa.GpuBatchNormForwardTraining( -self.handle, x, scale, bias, running_mean, running_var +self.handle, x, scale, bias, self.running_mean, self.running_var Review comment: Adding two batchnorm layers in the CNN for the MNIST, the cpu training is looking okay, loss reduced from 608.7 to 91.5 ``` ubuntu@ip-172-31-36-94:~/incubator-singa/examples/autograd$ python3 mnist_dist_bn.py Starting Epoch 0: Training loss = 608.656677, training accuracy = 0.785035 Evaluation accuracy = 0.900541, Elapsed Time = 130.748269s Starting Epoch 1: Training loss = 259.606445, training accuracy = 0.911720 Evaluation accuracy = 0.951222, Elapsed Time = 129.687239s Starting Epoch 2: Training loss = 180.270645, training accuracy = 0.938917 Evaluation accuracy = 0.965745, Elapsed Time = 129.715867s Starting Epoch 3: Training loss = 146.975281, training accuracy = 0.950607 Evaluation accuracy = 0.961138, Elapsed Time = 129.695524s Starting Epoch 4: Training loss = 130.942749, training accuracy = 0.955576 Evaluation accuracy = 0.966446, Elapsed Time = 129.814190s Starting Epoch 5: Training loss = 116.057938, training accuracy = 0.960846 Evaluation accuracy = 0.964844, Elapsed Time = 129.697126s Starting Epoch 6: Training loss = 105.867195, training accuracy = 0.963914 Evaluation accuracy = 0.973758, Elapsed Time = 129.782990s Starting Epoch 7: Training loss = 102.414818, training accuracy = 0.965498 Evaluation accuracy = 0.973357, Elapsed Time = 129.847014s Starting Epoch 8: Training loss = 95.194695, training accuracy = 0.968433 Evaluation accuracy = 0.973658, Elapsed Time = 129.762709s Starting Epoch 9: Training loss = 91.524719, training accuracy = 0.969717 Evaluation accuracy = 0.975160, Elapsed Time = 129.581387s ``` still waiting for resnet 18 result This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services
[GitHub] [incubator-singa] chrishkchris commented on a change in pull request #468: Distributted module
chrishkchris commented on a change in pull request #468: Distributted module URL: https://github.com/apache/incubator-singa/pull/468#discussion_r317488906 ## File path: python/singa/autograd.py ## @@ -1286,25 +1287,26 @@ def set_params(self, **parameters): class _BatchNorm2d(Operation): -def __init__(self, handle, name=None): +def __init__(self, handle, running_mean, running_var, name=None): super(_BatchNorm2d, self).__init__(name) self.handle = handle +self.running_mean = running_mean.data +self.running_var = running_var.data -def forward(self, x, scale, bias, running_mean, running_var): -self.running_mean = running_mean -self.running_var = running_var +def forward(self, x, scale, bias): if training: if isinstance(self.handle, singa.CudnnBatchNormHandle): y, mean, var = singa.GpuBatchNormForwardTraining( -self.handle, x, scale, bias, running_mean, running_var +self.handle, x, scale, bias, self.running_mean, self.running_var Review comment: Cannot find CpuBatchNormHandle while the handle used by cpu is BatchNormHandle. So I have used BatchNormHandle instead of CudnnBatchNormHandle. Now Both CPU and GPU can run, but I used "type" instead of "isinstance" because CudnnBatchNormHandle is a subclass of BatchNormHandle, so CudnnBatchNormHandle is considered as an instance of BatchNormHandle in isinstance(). Moreover, I have debugged the cpu batchnorm in the following two aspects: (i) the forward function of cpu batchnorm needed the initialization of running mean and var, otherwise when it access the block it returns error as accessing an non-initialized block. I fixed this by initializating the mean by 0 and the var by 1. (ii) the backward function CpuBatchNormBackward does not exist, but there is another function called CpuBatchNormBackwardx (in the directory src/model/operation/batchnorm.cc). So I use this function by providing all the necessary arguments. The program can run now, but I am doing a brief real dataset training test on AWS. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services
[GitHub] [incubator-singa] chrishkchris commented on a change in pull request #468: Distributted module
chrishkchris commented on a change in pull request #468: Distributted module URL: https://github.com/apache/incubator-singa/pull/468#discussion_r317488906 ## File path: python/singa/autograd.py ## @@ -1286,25 +1287,26 @@ def set_params(self, **parameters): class _BatchNorm2d(Operation): -def __init__(self, handle, name=None): +def __init__(self, handle, running_mean, running_var, name=None): super(_BatchNorm2d, self).__init__(name) self.handle = handle +self.running_mean = running_mean.data +self.running_var = running_var.data -def forward(self, x, scale, bias, running_mean, running_var): -self.running_mean = running_mean -self.running_var = running_var +def forward(self, x, scale, bias): if training: if isinstance(self.handle, singa.CudnnBatchNormHandle): y, mean, var = singa.GpuBatchNormForwardTraining( -self.handle, x, scale, bias, running_mean, running_var +self.handle, x, scale, bias, self.running_mean, self.running_var Review comment: Cannot find CpuBatchNormHandle while the handle used by cpu is BatchNormHandle. So I have used BatchNormHandle instead of CudnnBatchNormHandle. Now Both CPU and GPU can run, but I used "type" instead of "isinstance" because CudnnBatchNormHandle is a subclass of BatchNormHandle, so CudnnBatchNormHandle is considered as an instance of BatchNormHandle in isinstance(). Moreover, I have debugged the cpu batchnorm in the following two aspects: (i) the forward function of cpu batchnorm needed the initialization of running mean and var, otherwise when it access the block it returns error as accessing an non-initialized block. I fixed this by initializating the mean by 0 and the var by 1. (ii) the backward function CpuBatchNormBackward does not exist, but there is another function called CpuBatchNormBackwardx (in the directory src/model/operation/batchnorm.cc). So I use this function by providing all the necessary arguments. The program can run now, but I am doing a brief real dataset training test on AWS (using c5.x4large with 16 cpu cores). This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services
[GitHub] [incubator-singa] chrishkchris commented on a change in pull request #468: Distributted module
chrishkchris commented on a change in pull request #468: Distributted module URL: https://github.com/apache/incubator-singa/pull/468#discussion_r317527518 ## File path: python/singa/autograd.py ## @@ -1286,25 +1287,26 @@ def set_params(self, **parameters): class _BatchNorm2d(Operation): -def __init__(self, handle, name=None): +def __init__(self, handle, running_mean, running_var, name=None): super(_BatchNorm2d, self).__init__(name) self.handle = handle +self.running_mean = running_mean.data +self.running_var = running_var.data -def forward(self, x, scale, bias, running_mean, running_var): -self.running_mean = running_mean -self.running_var = running_var +def forward(self, x, scale, bias): if training: if isinstance(self.handle, singa.CudnnBatchNormHandle): y, mean, var = singa.GpuBatchNormForwardTraining( -self.handle, x, scale, bias, running_mean, running_var +self.handle, x, scale, bias, self.running_mean, self.running_var Review comment: Adding two batchnorm layers in the CNN for the MNIST, the cpu training is looking okay, loss reduced from 608.7 to 91.5 ``` ubuntu@ip-172-31-36-94:~/incubator-singa/examples/autograd$ python3 mnist_dist_bn.py Starting Epoch 0: Training loss = 608.656677, training accuracy = 0.785035 Evaluation accuracy = 0.900541, Elapsed Time = 130.748269s Starting Epoch 1: Training loss = 259.606445, training accuracy = 0.911720 Evaluation accuracy = 0.951222, Elapsed Time = 129.687239s Starting Epoch 2: Training loss = 180.270645, training accuracy = 0.938917 Evaluation accuracy = 0.965745, Elapsed Time = 129.715867s Starting Epoch 3: Training loss = 146.975281, training accuracy = 0.950607 Evaluation accuracy = 0.961138, Elapsed Time = 129.695524s Starting Epoch 4: Training loss = 130.942749, training accuracy = 0.955576 Evaluation accuracy = 0.966446, Elapsed Time = 129.814190s Starting Epoch 5: Training loss = 116.057938, training accuracy = 0.960846 Evaluation accuracy = 0.964844, Elapsed Time = 129.697126s Starting Epoch 6: Training loss = 105.867195, training accuracy = 0.963914 Evaluation accuracy = 0.973758, Elapsed Time = 129.782990s Starting Epoch 7: Training loss = 102.414818, training accuracy = 0.965498 Evaluation accuracy = 0.973357, Elapsed Time = 129.847014s Starting Epoch 8: Training loss = 95.194695, training accuracy = 0.968433 Evaluation accuracy = 0.973658, Elapsed Time = 129.762709s Starting Epoch 9: Training loss = 91.524719, training accuracy = 0.969717 Evaluation accuracy = 0.975160, Elapsed Time = 129.581387s ``` This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services
[GitHub] [incubator-singa] chrishkchris commented on a change in pull request #468: Distributted module
chrishkchris commented on a change in pull request #468: Distributted module URL: https://github.com/apache/incubator-singa/pull/468#discussion_r317527518 ## File path: python/singa/autograd.py ## @@ -1286,25 +1287,26 @@ def set_params(self, **parameters): class _BatchNorm2d(Operation): -def __init__(self, handle, name=None): +def __init__(self, handle, running_mean, running_var, name=None): super(_BatchNorm2d, self).__init__(name) self.handle = handle +self.running_mean = running_mean.data +self.running_var = running_var.data -def forward(self, x, scale, bias, running_mean, running_var): -self.running_mean = running_mean -self.running_var = running_var +def forward(self, x, scale, bias): if training: if isinstance(self.handle, singa.CudnnBatchNormHandle): y, mean, var = singa.GpuBatchNormForwardTraining( -self.handle, x, scale, bias, running_mean, running_var +self.handle, x, scale, bias, self.running_mean, self.running_var Review comment: Adding two batchnorm layers in the CNN for the MNIST, the cpu training is looking okay, loss reduce from 608.7 to 91.5 ``` ubuntu@ip-172-31-36-94:~/incubator-singa/examples/autograd$ python3 mnist_dist_bn.py Starting Epoch 0: Training loss = 608.656677, training accuracy = 0.785035 Evaluation accuracy = 0.900541, Elapsed Time = 130.748269s Starting Epoch 1: Training loss = 259.606445, training accuracy = 0.911720 Evaluation accuracy = 0.951222, Elapsed Time = 129.687239s Starting Epoch 2: Training loss = 180.270645, training accuracy = 0.938917 Evaluation accuracy = 0.965745, Elapsed Time = 129.715867s Starting Epoch 3: Training loss = 146.975281, training accuracy = 0.950607 Evaluation accuracy = 0.961138, Elapsed Time = 129.695524s Starting Epoch 4: Training loss = 130.942749, training accuracy = 0.955576 Evaluation accuracy = 0.966446, Elapsed Time = 129.814190s Starting Epoch 5: Training loss = 116.057938, training accuracy = 0.960846 Evaluation accuracy = 0.964844, Elapsed Time = 129.697126s Starting Epoch 6: Training loss = 105.867195, training accuracy = 0.963914 Evaluation accuracy = 0.973758, Elapsed Time = 129.782990s Starting Epoch 7: Training loss = 102.414818, training accuracy = 0.965498 Evaluation accuracy = 0.973357, Elapsed Time = 129.847014s Starting Epoch 8: Training loss = 95.194695, training accuracy = 0.968433 Evaluation accuracy = 0.973658, Elapsed Time = 129.762709s Starting Epoch 9: Training loss = 91.524719, training accuracy = 0.969717 Evaluation accuracy = 0.975160, Elapsed Time = 129.581387s ``` This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services
[GitHub] [incubator-singa] chrishkchris commented on a change in pull request #468: Distributted module
chrishkchris commented on a change in pull request #468: Distributted module URL: https://github.com/apache/incubator-singa/pull/468#discussion_r317527518 ## File path: python/singa/autograd.py ## @@ -1286,25 +1287,26 @@ def set_params(self, **parameters): class _BatchNorm2d(Operation): -def __init__(self, handle, name=None): +def __init__(self, handle, running_mean, running_var, name=None): super(_BatchNorm2d, self).__init__(name) self.handle = handle +self.running_mean = running_mean.data +self.running_var = running_var.data -def forward(self, x, scale, bias, running_mean, running_var): -self.running_mean = running_mean -self.running_var = running_var +def forward(self, x, scale, bias): if training: if isinstance(self.handle, singa.CudnnBatchNormHandle): y, mean, var = singa.GpuBatchNormForwardTraining( -self.handle, x, scale, bias, running_mean, running_var +self.handle, x, scale, bias, self.running_mean, self.running_var Review comment: Adding two batchnorm layers in the CNN for the MNIST, the cpu training is looking okay, ``` ubuntu@ip-172-31-36-94:~/incubator-singa/examples/autograd$ python3 mnist_dist_bn.py Starting Epoch 0: Training loss = 608.656677, training accuracy = 0.785035 Evaluation accuracy = 0.900541, Elapsed Time = 130.748269s Starting Epoch 1: Training loss = 259.606445, training accuracy = 0.911720 Evaluation accuracy = 0.951222, Elapsed Time = 129.687239s Starting Epoch 2: Training loss = 180.270645, training accuracy = 0.938917 Evaluation accuracy = 0.965745, Elapsed Time = 129.715867s Starting Epoch 3: Training loss = 146.975281, training accuracy = 0.950607 Evaluation accuracy = 0.961138, Elapsed Time = 129.695524s Starting Epoch 4: Training loss = 130.942749, training accuracy = 0.955576 Evaluation accuracy = 0.966446, Elapsed Time = 129.814190s Starting Epoch 5: Training loss = 116.057938, training accuracy = 0.960846 Evaluation accuracy = 0.964844, Elapsed Time = 129.697126s Starting Epoch 6: Training loss = 105.867195, training accuracy = 0.963914 Evaluation accuracy = 0.973758, Elapsed Time = 129.782990s Starting Epoch 7: Training loss = 102.414818, training accuracy = 0.965498 Evaluation accuracy = 0.973357, Elapsed Time = 129.847014s Starting Epoch 8: Training loss = 95.194695, training accuracy = 0.968433 Evaluation accuracy = 0.973658, Elapsed Time = 129.762709s Starting Epoch 9: Training loss = 91.524719, training accuracy = 0.969717 Evaluation accuracy = 0.975160, Elapsed Time = 129.581387s ``` This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services
[GitHub] [incubator-singa] chrishkchris commented on a change in pull request #468: Distributted module
chrishkchris commented on a change in pull request #468: Distributted module URL: https://github.com/apache/incubator-singa/pull/468#discussion_r317527518 ## File path: python/singa/autograd.py ## @@ -1286,25 +1287,26 @@ def set_params(self, **parameters): class _BatchNorm2d(Operation): -def __init__(self, handle, name=None): +def __init__(self, handle, running_mean, running_var, name=None): super(_BatchNorm2d, self).__init__(name) self.handle = handle +self.running_mean = running_mean.data +self.running_var = running_var.data -def forward(self, x, scale, bias, running_mean, running_var): -self.running_mean = running_mean -self.running_var = running_var +def forward(self, x, scale, bias): if training: if isinstance(self.handle, singa.CudnnBatchNormHandle): y, mean, var = singa.GpuBatchNormForwardTraining( -self.handle, x, scale, bias, running_mean, running_var +self.handle, x, scale, bias, self.running_mean, self.running_var Review comment: Still testing with real dataset using cpu Adding two batchnorm layers in the CNN for the MNIST, the cpu training is looking okay, ``` ubuntu@ip-172-31-36-94:~/incubator-singa/examples/autograd$ python3 mnist_dist_bn.py Starting Epoch 0: Training loss = 608.656677, training accuracy = 0.785035 Evaluation accuracy = 0.900541, Elapsed Time = 130.748269s Starting Epoch 1: Training loss = 259.606445, training accuracy = 0.911720 Evaluation accuracy = 0.951222, Elapsed Time = 129.687239s Starting Epoch 2: Training loss = 180.270645, training accuracy = 0.938917 Evaluation accuracy = 0.965745, Elapsed Time = 129.715867s Starting Epoch 3: Training loss = 146.975281, training accuracy = 0.950607 Evaluation accuracy = 0.961138, Elapsed Time = 129.695524s Starting Epoch 4: Training loss = 130.942749, training accuracy = 0.955576 Evaluation accuracy = 0.966446, Elapsed Time = 129.814190s Starting Epoch 5: Training loss = 116.057938, training accuracy = 0.960846 Evaluation accuracy = 0.964844, Elapsed Time = 129.697126s Starting Epoch 6: Training loss = 105.867195, training accuracy = 0.963914 Evaluation accuracy = 0.973758, Elapsed Time = 129.782990s Starting Epoch 7: Training loss = 102.414818, training accuracy = 0.965498 Evaluation accuracy = 0.973357, Elapsed Time = 129.847014s Starting Epoch 8: Training loss = 95.194695, training accuracy = 0.968433 Evaluation accuracy = 0.973658, Elapsed Time = 129.762709s Starting Epoch 9: Training loss = 91.524719, training accuracy = 0.969717 Evaluation accuracy = 0.975160, Elapsed Time = 129.581387s ``` This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services
[GitHub] [incubator-singa] chrishkchris commented on a change in pull request #468: Distributted module
chrishkchris commented on a change in pull request #468: Distributted module URL: https://github.com/apache/incubator-singa/pull/468#discussion_r317527518 ## File path: python/singa/autograd.py ## @@ -1286,25 +1287,26 @@ def set_params(self, **parameters): class _BatchNorm2d(Operation): -def __init__(self, handle, name=None): +def __init__(self, handle, running_mean, running_var, name=None): super(_BatchNorm2d, self).__init__(name) self.handle = handle +self.running_mean = running_mean.data +self.running_var = running_var.data -def forward(self, x, scale, bias, running_mean, running_var): -self.running_mean = running_mean -self.running_var = running_var +def forward(self, x, scale, bias): if training: if isinstance(self.handle, singa.CudnnBatchNormHandle): y, mean, var = singa.GpuBatchNormForwardTraining( -self.handle, x, scale, bias, running_mean, running_var +self.handle, x, scale, bias, self.running_mean, self.running_var Review comment: Still testing with real dataset using cpu Adding two batchnorm layers in the CNN for the MNIST, the cpu training is looking okay, ``` ubuntu@ip-172-31-31-187:~/incubator-singa/examples/autograd$ python3 mnist_dist_bn.py Starting Epoch 0: Training loss = 1115.447021, training accuracy = 0.592116 Evaluation accuracy = 0.649239, Elapsed Time = 91.413475s Starting Epoch 1: Training loss = 615.564209, training accuracy = 0.783968 Evaluation accuracy = 0.878105, Elapsed Time = 91.053461s Starting Epoch 2: Training loss = 444.550018, training accuracy = 0.848286 Evaluation accuracy = 0.900240, Elapsed Time = 91.064094s Starting Epoch 3: Training loss = 333.629150, training accuracy = 0.886690 Evaluation accuracy = 0.857772, Elapsed Time = 90.935190s Starting Epoch 4: Training loss = 289.389832, training accuracy = 0.902648 Evaluation accuracy = 0.913462, Elapsed Time = 91.152710s Starting Epoch 5: Training loss = 263.009583, training accuracy = 0.910836 Evaluation accuracy = 0.922877, Elapsed Time = 91.171680s Starting Epoch 6: Training loss = 238.859818, training accuracy = 0.918957 Evaluation accuracy = 0.933794, Elapsed Time = 91.016456s Starting Epoch 7: Training loss = 215.822647, training accuracy = 0.927428 Evaluation accuracy = 0.946615, Elapsed Time = 90.870825s Starting Epoch 8: Training loss = 202.828430, training accuracy = 0.932080 Evaluation accuracy = 0.948017, Elapsed Time = 91.014656s Starting Epoch 9: Training loss = 190.810226, training accuracy = 0.935899 Evaluation accuracy = 0.949820, Elapsed Time = 91.270044s ``` This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services
[GitHub] [incubator-singa] chrishkchris commented on a change in pull request #468: Distributted module
chrishkchris commented on a change in pull request #468: Distributted module URL: https://github.com/apache/incubator-singa/pull/468#discussion_r317527518 ## File path: python/singa/autograd.py ## @@ -1286,25 +1287,26 @@ def set_params(self, **parameters): class _BatchNorm2d(Operation): -def __init__(self, handle, name=None): +def __init__(self, handle, running_mean, running_var, name=None): super(_BatchNorm2d, self).__init__(name) self.handle = handle +self.running_mean = running_mean.data +self.running_var = running_var.data -def forward(self, x, scale, bias, running_mean, running_var): -self.running_mean = running_mean -self.running_var = running_var +def forward(self, x, scale, bias): if training: if isinstance(self.handle, singa.CudnnBatchNormHandle): y, mean, var = singa.GpuBatchNormForwardTraining( -self.handle, x, scale, bias, running_mean, running_var +self.handle, x, scale, bias, self.running_mean, self.running_var Review comment: Still testing with real dataset using cpu Adding a batchnorm layer in the CNN for the MNIST, the cpu training is looking okay, ``` ubuntu@ip-172-31-31-187:~/incubator-singa/examples/autograd$ python3 mnist_dist_bn.py Starting Epoch 0: Training loss = 1115.447021, training accuracy = 0.592116 Evaluation accuracy = 0.649239, Elapsed Time = 91.413475s Starting Epoch 1: Training loss = 615.564209, training accuracy = 0.783968 Evaluation accuracy = 0.878105, Elapsed Time = 91.053461s Starting Epoch 2: Training loss = 444.550018, training accuracy = 0.848286 Evaluation accuracy = 0.900240, Elapsed Time = 91.064094s Starting Epoch 3: Training loss = 333.629150, training accuracy = 0.886690 Evaluation accuracy = 0.857772, Elapsed Time = 90.935190s Starting Epoch 4: Training loss = 289.389832, training accuracy = 0.902648 Evaluation accuracy = 0.913462, Elapsed Time = 91.152710s Starting Epoch 5: Training loss = 263.009583, training accuracy = 0.910836 Evaluation accuracy = 0.922877, Elapsed Time = 91.171680s Starting Epoch 6: Training loss = 238.859818, training accuracy = 0.918957 Evaluation accuracy = 0.933794, Elapsed Time = 91.016456s Starting Epoch 7: Training loss = 215.822647, training accuracy = 0.927428 Evaluation accuracy = 0.946615, Elapsed Time = 90.870825s Starting Epoch 8: Training loss = 202.828430, training accuracy = 0.932080 Evaluation accuracy = 0.948017, Elapsed Time = 91.014656s Starting Epoch 9: Training loss = 190.810226, training accuracy = 0.935899 Evaluation accuracy = 0.949820, Elapsed Time = 91.270044s ``` This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services
[GitHub] [incubator-singa] chrishkchris commented on a change in pull request #468: Distributted module
chrishkchris commented on a change in pull request #468: Distributted module URL: https://github.com/apache/incubator-singa/pull/468#discussion_r317527518 ## File path: python/singa/autograd.py ## @@ -1286,25 +1287,26 @@ def set_params(self, **parameters): class _BatchNorm2d(Operation): -def __init__(self, handle, name=None): +def __init__(self, handle, running_mean, running_var, name=None): super(_BatchNorm2d, self).__init__(name) self.handle = handle +self.running_mean = running_mean.data +self.running_var = running_var.data -def forward(self, x, scale, bias, running_mean, running_var): -self.running_mean = running_mean -self.running_var = running_var +def forward(self, x, scale, bias): if training: if isinstance(self.handle, singa.CudnnBatchNormHandle): y, mean, var = singa.GpuBatchNormForwardTraining( -self.handle, x, scale, bias, running_mean, running_var +self.handle, x, scale, bias, self.running_mean, self.running_var Review comment: Still testing with real dataset using cpu Adding a batchnorm layer in the CNN for the MNIST, the training is okay, ``` ubuntu@ip-172-31-31-187:~/incubator-singa/examples/autograd$ python3 mnist_dist_bn.py Starting Epoch 0: Training loss = 1115.447021, training accuracy = 0.592116 Evaluation accuracy = 0.649239, Elapsed Time = 91.413475s Starting Epoch 1: Training loss = 615.564209, training accuracy = 0.783968 Evaluation accuracy = 0.878105, Elapsed Time = 91.053461s Starting Epoch 2: Training loss = 444.550018, training accuracy = 0.848286 Evaluation accuracy = 0.900240, Elapsed Time = 91.064094s Starting Epoch 3: Training loss = 333.629150, training accuracy = 0.886690 Evaluation accuracy = 0.857772, Elapsed Time = 90.935190s Starting Epoch 4: Training loss = 289.389832, training accuracy = 0.902648 Evaluation accuracy = 0.913462, Elapsed Time = 91.152710s Starting Epoch 5: Training loss = 263.009583, training accuracy = 0.910836 Evaluation accuracy = 0.922877, Elapsed Time = 91.171680s Starting Epoch 6: Training loss = 238.859818, training accuracy = 0.918957 Evaluation accuracy = 0.933794, Elapsed Time = 91.016456s Starting Epoch 7: Training loss = 215.822647, training accuracy = 0.927428 Evaluation accuracy = 0.946615, Elapsed Time = 90.870825s Starting Epoch 8: Training loss = 202.828430, training accuracy = 0.932080 Evaluation accuracy = 0.948017, Elapsed Time = 91.014656s Starting Epoch 9: Training loss = 190.810226, training accuracy = 0.935899 Evaluation accuracy = 0.949820, Elapsed Time = 91.270044s ``` This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services
[GitHub] [incubator-singa] chrishkchris commented on a change in pull request #468: Distributted module
chrishkchris commented on a change in pull request #468: Distributted module URL: https://github.com/apache/incubator-singa/pull/468#discussion_r317527518 ## File path: python/singa/autograd.py ## @@ -1286,25 +1287,26 @@ def set_params(self, **parameters): class _BatchNorm2d(Operation): -def __init__(self, handle, name=None): +def __init__(self, handle, running_mean, running_var, name=None): super(_BatchNorm2d, self).__init__(name) self.handle = handle +self.running_mean = running_mean.data +self.running_var = running_var.data -def forward(self, x, scale, bias, running_mean, running_var): -self.running_mean = running_mean -self.running_var = running_var +def forward(self, x, scale, bias): if training: if isinstance(self.handle, singa.CudnnBatchNormHandle): y, mean, var = singa.GpuBatchNormForwardTraining( -self.handle, x, scale, bias, running_mean, running_var +self.handle, x, scale, bias, self.running_mean, self.running_var Review comment: Still testing with real dataset using cpu Adding a batchnorm layer in the CNN for the MNIST, the training is okay, ``` ubuntu@ip-172-31-31-187:~/incubator-singa/examples/autograd$ python3 mnist_dist_bn.py Starting Epoch 0: Training loss = 1115.447021, training accuracy = 0.592116 Evaluation accuracy = 0.649239, Elapsed Time = 91.413475s Starting Epoch 1: Training loss = 615.564209, training accuracy = 0.783968 Evaluation accuracy = 0.878105, Elapsed Time = 91.053461s Starting Epoch 2: Training loss = 444.550018, training accuracy = 0.848286 Evaluation accuracy = 0.900240, Elapsed Time = 91.064094s Starting Epoch 3: Training loss = 333.629150, training accuracy = 0.886690 Evaluation accuracy = 0.857772, Elapsed Time = 90.935190s Starting Epoch 4: Training loss = 289.389832, training accuracy = 0.902648 Evaluation accuracy = 0.913462, Elapsed Time = 91.152710s Starting Epoch 5: Training loss = 263.009583, training accuracy = 0.910836 Evaluation accuracy = 0.922877, Elapsed Time = 91.171680s Starting Epoch 6: Training loss = 238.859818, training accuracy = 0.918957 Evaluation accuracy = 0.933794, Elapsed Time = 91.016456s Starting Epoch 7: Training loss = 215.822647, training accuracy = 0.927428 Evaluation accuracy = 0.946615, Elapsed Time = 90.870825s Starting Epoch 8: Training loss = 202.828430, training accuracy = 0.932080 Evaluation accuracy = 0.948017, Elapsed Time = 91.014656s Starting Epoch 9: Training loss = 190.810226, training accuracy = 0.935899 Evaluation accuracy = 0.949820, Elapsed Time = 91.270044s ``` This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services
[GitHub] [incubator-singa] chrishkchris commented on a change in pull request #468: Distributted module
chrishkchris commented on a change in pull request #468: Distributted module URL: https://github.com/apache/incubator-singa/pull/468#discussion_r317527518 ## File path: python/singa/autograd.py ## @@ -1286,25 +1287,26 @@ def set_params(self, **parameters): class _BatchNorm2d(Operation): -def __init__(self, handle, name=None): +def __init__(self, handle, running_mean, running_var, name=None): super(_BatchNorm2d, self).__init__(name) self.handle = handle +self.running_mean = running_mean.data +self.running_var = running_var.data -def forward(self, x, scale, bias, running_mean, running_var): -self.running_mean = running_mean -self.running_var = running_var +def forward(self, x, scale, bias): if training: if isinstance(self.handle, singa.CudnnBatchNormHandle): y, mean, var = singa.GpuBatchNormForwardTraining( -self.handle, x, scale, bias, running_mean, running_var +self.handle, x, scale, bias, self.running_mean, self.running_var Review comment: Still testing with real dataset using cpu This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services
[GitHub] [incubator-singa] chrishkchris commented on a change in pull request #468: Distributted module
chrishkchris commented on a change in pull request #468: Distributted module URL: https://github.com/apache/incubator-singa/pull/468#discussion_r317527518 ## File path: python/singa/autograd.py ## @@ -1286,25 +1287,26 @@ def set_params(self, **parameters): class _BatchNorm2d(Operation): -def __init__(self, handle, name=None): +def __init__(self, handle, running_mean, running_var, name=None): super(_BatchNorm2d, self).__init__(name) self.handle = handle +self.running_mean = running_mean.data +self.running_var = running_var.data -def forward(self, x, scale, bias, running_mean, running_var): -self.running_mean = running_mean -self.running_var = running_var +def forward(self, x, scale, bias): if training: if isinstance(self.handle, singa.CudnnBatchNormHandle): y, mean, var = singa.GpuBatchNormForwardTraining( -self.handle, x, scale, bias, running_mean, running_var +self.handle, x, scale, bias, self.running_mean, self.running_var Review comment: However, using CPU the cifar10 training loss does not decrease, I will need further debug This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services
[GitHub] [incubator-singa] chrishkchris commented on a change in pull request #468: Distributted module
chrishkchris commented on a change in pull request #468: Distributted module URL: https://github.com/apache/incubator-singa/pull/468#discussion_r317488906 ## File path: python/singa/autograd.py ## @@ -1286,25 +1287,26 @@ def set_params(self, **parameters): class _BatchNorm2d(Operation): -def __init__(self, handle, name=None): +def __init__(self, handle, running_mean, running_var, name=None): super(_BatchNorm2d, self).__init__(name) self.handle = handle +self.running_mean = running_mean.data +self.running_var = running_var.data -def forward(self, x, scale, bias, running_mean, running_var): -self.running_mean = running_mean -self.running_var = running_var +def forward(self, x, scale, bias): if training: if isinstance(self.handle, singa.CudnnBatchNormHandle): y, mean, var = singa.GpuBatchNormForwardTraining( -self.handle, x, scale, bias, running_mean, running_var +self.handle, x, scale, bias, self.running_mean, self.running_var Review comment: Cannot find CpuBatchNormHandle while the handle used by cpu is BatchNormHandle. So I have used BatchNormHandle instead of CudnnBatchNormHandle. Now Both CPU and GPU can run, but I used "type" instead of "isinstance" because CudnnBatchNormHandle is a subclass of BatchNormHandle, so CudnnBatchNormHandle is considered as an instance of BatchNormHandle in isinstance(). Moreover, I have debugged the cpu batchnorm in the following two aspects: (i) the forward function of cpu batchnorm needed the initialization of running mean and var, otherwise when it access the block it returns error as accessing an non-initialized block. I fixed this by initializating the mean by 0 and the var by 1. (ii) the backward function CpuBatchNormBackward does not exist, but there is another function called CpuBatchNormBackwardx (in the directory src/model/operation/batchnorm.cc). So I use this function by providing all the necessary arguments. The program can run now, but I am doing a brief cifar10 training test on AWS (using c5.x4large with 16 cpu cores). This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services
[GitHub] [incubator-singa] chrishkchris commented on a change in pull request #468: Distributted module
chrishkchris commented on a change in pull request #468: Distributted module URL: https://github.com/apache/incubator-singa/pull/468#discussion_r317488906 ## File path: python/singa/autograd.py ## @@ -1286,25 +1287,26 @@ def set_params(self, **parameters): class _BatchNorm2d(Operation): -def __init__(self, handle, name=None): +def __init__(self, handle, running_mean, running_var, name=None): super(_BatchNorm2d, self).__init__(name) self.handle = handle +self.running_mean = running_mean.data +self.running_var = running_var.data -def forward(self, x, scale, bias, running_mean, running_var): -self.running_mean = running_mean -self.running_var = running_var +def forward(self, x, scale, bias): if training: if isinstance(self.handle, singa.CudnnBatchNormHandle): y, mean, var = singa.GpuBatchNormForwardTraining( -self.handle, x, scale, bias, running_mean, running_var +self.handle, x, scale, bias, self.running_mean, self.running_var Review comment: Cannot find CpuBatchNormHandle while the handle used by cpu is BatchNormHandle. So I have used BatchNormHandle instead of CudnnBatchNormHandle. Now Both CPU and GPU can run, but I used "type" instead of "isinstance" because CudnnBatchNormHandle is a subclass of BatchNormHandle, so CudnnBatchNormHandle is considered as an instance of BatchNormHandle in isinstance(). For more explanation, See http://www.runoob.com/python/python-func-isinstance.html Moreover, I have debugged the cpu batchnorm in the following two aspects: (i) the forward function of cpu batchnorm needed the initialization of running mean and var, otherwise when it access the block it returns error as accessing an non-initialized block. I fixed this by initializating the mean by 0 and the var by 1. (ii) the backward function CpuBatchNormBackward does not exist, but there is another function called CpuBatchNormBackwardx (in the directory src/model/operation/batchnorm.cc). So I use this function by providing all the necessary arguments. The program can run now, but I am doing a brief cifar10 training test on AWS (using c5.x4large with 16 cpu cores). This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services
[GitHub] [incubator-singa] chrishkchris commented on a change in pull request #468: Distributted module
chrishkchris commented on a change in pull request #468: Distributted module URL: https://github.com/apache/incubator-singa/pull/468#discussion_r317527518 ## File path: python/singa/autograd.py ## @@ -1286,25 +1287,26 @@ def set_params(self, **parameters): class _BatchNorm2d(Operation): -def __init__(self, handle, name=None): +def __init__(self, handle, running_mean, running_var, name=None): super(_BatchNorm2d, self).__init__(name) self.handle = handle +self.running_mean = running_mean.data +self.running_var = running_var.data -def forward(self, x, scale, bias, running_mean, running_var): -self.running_mean = running_mean -self.running_var = running_var +def forward(self, x, scale, bias): if training: if isinstance(self.handle, singa.CudnnBatchNormHandle): y, mean, var = singa.GpuBatchNormForwardTraining( -self.handle, x, scale, bias, running_mean, running_var +self.handle, x, scale, bias, self.running_mean, self.running_var Review comment: added the cpu test of conv2d and batchnorm2d in test_operation.py and passed, but still waiting for the cifar10 training accuracy test This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services
[GitHub] [incubator-singa] chrishkchris commented on a change in pull request #468: Distributted module
chrishkchris commented on a change in pull request #468: Distributted module URL: https://github.com/apache/incubator-singa/pull/468#discussion_r317488906 ## File path: python/singa/autograd.py ## @@ -1286,25 +1287,26 @@ def set_params(self, **parameters): class _BatchNorm2d(Operation): -def __init__(self, handle, name=None): +def __init__(self, handle, running_mean, running_var, name=None): super(_BatchNorm2d, self).__init__(name) self.handle = handle +self.running_mean = running_mean.data +self.running_var = running_var.data -def forward(self, x, scale, bias, running_mean, running_var): -self.running_mean = running_mean -self.running_var = running_var +def forward(self, x, scale, bias): if training: if isinstance(self.handle, singa.CudnnBatchNormHandle): y, mean, var = singa.GpuBatchNormForwardTraining( -self.handle, x, scale, bias, running_mean, running_var +self.handle, x, scale, bias, self.running_mean, self.running_var Review comment: Now Both CPU and GPU can run, but I used "type" instead of "isinstance" because CudnnBatchNormHandle is a subclass of BatchNormHandle, so CudnnBatchNormHandle is considered as an instance of BatchNormHandle in isinstance(). For more explanation, See http://www.runoob.com/python/python-func-isinstance.html Moreover, I have debugged the cpu batchnorm in the following two aspects: (i) the forward function of cpu batchnorm needed the initialization of running mean and var, otherwise when it access the block it returns error as accessing an non-initialized block. I fixed this by initializating the mean by 0 and the var by 1. (ii) the backward function CpuBatchNormBackward does not exist, but there is another function called CpuBatchNormBackwardx (in the directory src/model/operation/batchnorm.cc). So I use this function by providing all the necessary arguments. The program can run now, but I am doing a brief cifar10 training test on AWS (using c5.x4large with 16 cpu cores). This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services
[GitHub] [incubator-singa] chrishkchris commented on a change in pull request #468: Distributted module
chrishkchris commented on a change in pull request #468: Distributted module URL: https://github.com/apache/incubator-singa/pull/468#discussion_r317488906 ## File path: python/singa/autograd.py ## @@ -1286,25 +1287,26 @@ def set_params(self, **parameters): class _BatchNorm2d(Operation): -def __init__(self, handle, name=None): +def __init__(self, handle, running_mean, running_var, name=None): super(_BatchNorm2d, self).__init__(name) self.handle = handle +self.running_mean = running_mean.data +self.running_var = running_var.data -def forward(self, x, scale, bias, running_mean, running_var): -self.running_mean = running_mean -self.running_var = running_var +def forward(self, x, scale, bias): if training: if isinstance(self.handle, singa.CudnnBatchNormHandle): y, mean, var = singa.GpuBatchNormForwardTraining( -self.handle, x, scale, bias, running_mean, running_var +self.handle, x, scale, bias, self.running_mean, self.running_var Review comment: Now Both CPU and GPU can run, but I used "type" instead of "isinstance" because CudnnBatchNormHandle is a subclass of BatchNormHandle, so CudnnBatchNormHandle is considered as an instance of BatchNormHandle in isinstance(). For more explaination, See http://www.runoob.com/python/python-func-isinstance.html Moreover, I have degubbed the cpu batchnorm in the following two aspects: (i) the forward function of cpu batchnorm needed the initialization of running mean and var, otherwise when it access the block it returns error as accessing an non-initialized block. I fixed this by initializating the mean by 0 and the var by 1. (ii) the backward function CpuBatchNormBackward does not exist, but there is another function called CpuBatchNormBackwardx (in the directory src/model/operation/batchnorm.cc). So I use this function by providing all the necessary arguments. The program can run now, but I am doing a brief cifar10 training test on AWS (using c5.x4large with 16 cpu cores). This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services
[GitHub] [incubator-singa] chrishkchris commented on a change in pull request #468: Distributted module
chrishkchris commented on a change in pull request #468: Distributted module URL: https://github.com/apache/incubator-singa/pull/468#discussion_r317382891 ## File path: examples/autograd/mnist_dist.py ## @@ -0,0 +1,251 @@ +# Review comment: I see. I will modify the codes. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services
[GitHub] [incubator-singa] chrishkchris commented on a change in pull request #468: Distributted module
chrishkchris commented on a change in pull request #468: Distributted module URL: https://github.com/apache/incubator-singa/pull/468#discussion_r317382900 ## File path: python/singa/autograd.py ## @@ -1286,25 +1287,26 @@ def set_params(self, **parameters): class _BatchNorm2d(Operation): -def __init__(self, handle, name=None): +def __init__(self, handle, running_mean, running_var, name=None): super(_BatchNorm2d, self).__init__(name) self.handle = handle +self.running_mean = running_mean.data +self.running_var = running_var.data -def forward(self, x, scale, bias, running_mean, running_var): -self.running_mean = running_mean -self.running_var = running_var +def forward(self, x, scale, bias): if training: if isinstance(self.handle, singa.CudnnBatchNormHandle): y, mean, var = singa.GpuBatchNormForwardTraining( -self.handle, x, scale, bias, running_mean, running_var +self.handle, x, scale, bias, self.running_mean, self.running_var Review comment: I see. I will modify the codes. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services
[GitHub] [incubator-singa] chrishkchris commented on a change in pull request #468: Distributted module
chrishkchris commented on a change in pull request #468: Distributted module URL: https://github.com/apache/incubator-singa/pull/468#discussion_r317382034 ## File path: CMakeLists.txt ## @@ -30,7 +30,7 @@ LIST(APPEND CMAKE_MODULE_PATH ${PROJECT_SOURCE_DIR}/cmake/Thirdparty) # Flags IF(UNIX) -SET(CMAKE_CXX_FLAGS "${CMAKE_CXX_FLAGS} -std=c++11 -g -O2 -fPIC -Wall -pthread") +SET(CMAKE_CXX_FLAGS "${CMAKE_CXX_FLAGS} -std=c++11 -O3 -fPIC -Wall -pthread") Review comment: ok, changed This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services
[GitHub] [incubator-singa] chrishkchris commented on a change in pull request #468: Distributted module
chrishkchris commented on a change in pull request #468: Distributted module URL: https://github.com/apache/incubator-singa/pull/468#discussion_r317381887 ## File path: src/api/config.i ## @@ -0,0 +1,34 @@ +// Licensed to the Apache Software Foundation (ASF) under one Review comment: Yes, this is generated directly from "config.i.in" by cmake. I have deleted this file "config.i" This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services
[GitHub] [incubator-singa] chrishkchris commented on a change in pull request #468: Distributted module
chrishkchris commented on a change in pull request #468: Distributted module URL: https://github.com/apache/incubator-singa/pull/468#discussion_r311056639 ## File path: src/api/config.i ## @@ -0,0 +1,33 @@ +// Licensed to the Apache Software Foundation (ASF) under one +// or more contributor license agreements. See the NOTICE file +// distributed with this work for additional information +// regarding copyright ownership. The ASF licenses this file +// to you under the Apache License, Version 2.0 (the +// "License"); you may not use this file except in compliance +// with the License. You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, +// software distributed under the License is distributed on an +// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +// KIND, either express or implied. See the License for the +// specific language governing permissions and limitations +// under the License. + + + +// Pass in cmake configurations to swig +#define USE_CUDA 1 +#define USE_CUDNN 1 +#define USE_OPENCL 0 +#define USE_PYTHON 1 +#define USE_MKLDNN 1 +#define USE_JAVA 0 +#define CUDNN_VERSION 7401 + +// SINGA version +#define SINGA_MAJOR_VERSION 1 Review comment: Updated on 6th August: I removed a bug in the commit 0616000 which concerns the number of parameters (i.e. Size() of the tensor) taken part in the all-reduce process. Then I did a 8 * K80 multi-GPUs training and evaluation test with a simple MNIST dataset on simple CNN. It reduces the training loss from 802.7 to 42.2 in about 30 Epochs: ``` Epoch=0: 100%|¦¦| 117/117 [00:01<00:00, 92.86it/s]Training loss = 802.659485, training accuracy = 0.713825 Test accuracy = 0.920025 Epoch=1: 100%|¦¦| 117/117 [00:01<00:00, 93.42it/s]Training loss = 246.589371, training accuracy = 0.916767 Test accuracy = 0.956106 Epoch=2: 100%|¦¦| 117/117 [00:01<00:00, 94.04it/s]Training loss = 175.012894, training accuracy = 0.941106 Test accuracy = 0.967208 Epoch=3: 100%|¦¦| 117/117 [00:01<00:00, 95.66it/s] Training loss = 144.684052, training accuracy = 0.951539 Test accuracy = 0.970806 Epoch=4: 100%|¦¦| 117/117 [00:01<00:00, 102.59it/s]Training loss = 120.399704, training accuracy = 0.959402 Test accuracy = 0.976049 Epoch=5: 100%|¦¦| 117/117 [00:01<00:00, 102.79it/s]Training loss = 107.832191, training accuracy = 0.963709 Test accuracy = 0.975946 Epoch=6: 100%|¦¦| 117/117 [00:01<00:00, 102.70it/s]Training loss = 96.289490, training accuracy = 0.967014 Test accuracy = 0.979441 Epoch=7: 100%|¦¦| 117/117 [00:01<00:00, 102.34it/s]Training loss = 88.031815, training accuracy = 0.970436 Test accuracy = 0.980983 Epoch=8: 100%|¦¦| 117/117 [00:01<00:00, 101.81it/s]Training loss = 79.349884, training accuracy = 0.973090 Test accuracy = 0.980058 Epoch=9: 100%|¦¦| 117/117 [00:01<00:00, 101.82it/s]Training loss = 77.825607, training accuracy = 0.974342 Test accuracy = 0.977282 Epoch=10: 100%|¦¦| 117/117 [00:01<00:00, 101.97it/s]Training loss = 74.710297, training accuracy = 0.974576 Test accuracy = 0.983861 Epoch=11: 100%|¦¦| 117/117 [00:01<00:00, 101.98it/s]Training loss = 69.400230, training accuracy = 0.976162 Test accuracy = 0.982936 Epoch=12: 100%|¦¦| 117/117 [00:01<00:00, 102.03it/s]Training loss = 65.100449, training accuracy = 0.978148 Test accuracy = 0.983553 Epoch=13: 100%|¦¦| 117/117 [00:01<00:00, 102.17it/s]Training loss = 65.113991, training accuracy = 0.978249 Test accuracy = 0.986534 Epoch=14: 100%|¦¦| 117/117 [00:01<00:00, 101.83it/s]Training loss = 63.065636, training accuracy = 0.978566 Test accuracy = 0.984683 Epoch=15: 100%|¦¦| 117/117 [00:01<00:00, 102.11it/s]Training loss = 58.334709, training accuracy = 0.980018 Test accuracy = 0.983758 Epoch=16: 100%|¦¦| 117/117 [00:01<00:00, 102.16it/s]Training loss = 58.280094, training accuracy = 0.980285 Test accuracy = 0.983655 Epoch=17: 100%|¦¦| 117/117 [00:01<00:00, 102.15it/s]Training loss = 53.226196, training accuracy = 0.981420 Test accuracy = 0.985197 Epoch=18: 100%|¦¦| 117/117 [00:01<00:00, 102.15it/s]Training loss = 55.968140, training accuracy = 0.980786 Test accuracy = 0.982422 Epoch=19: 100%|¦¦| 117/117 [00:01<00:00, 102.14it/s]Training loss = 52.761921, training accuracy = 0.982489 Test accuracy = 0.985814 Epoch=20: 100%|¦¦| 117/117 [00:01<00:00, 101.86it/s]Training loss = 51.989666, training accuracy = 0.982973 Test accuracy = 0.983758 Epoch=21: 100%|¦¦| 117/117 [00:01<00:00, 101.91it/s]Training loss = 52.571381, training accuracy = 0.982455 Test accuracy = 0.987973 Epoch=22: 100%|¦¦| 117/117 [00:01<00:00, 101.99it/s]Training loss
[GitHub] [incubator-singa] chrishkchris commented on a change in pull request #468: Distributted module
chrishkchris commented on a change in pull request #468: Distributted module URL: https://github.com/apache/incubator-singa/pull/468#discussion_r310328731 ## File path: src/api/config.i ## @@ -0,0 +1,33 @@ +// Licensed to the Apache Software Foundation (ASF) under one +// or more contributor license agreements. See the NOTICE file +// distributed with this work for additional information +// regarding copyright ownership. The ASF licenses this file +// to you under the Apache License, Version 2.0 (the +// "License"); you may not use this file except in compliance +// with the License. You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, +// software distributed under the License is distributed on an +// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +// KIND, either express or implied. See the License for the +// specific language governing permissions and limitations +// under the License. + + + +// Pass in cmake configurations to swig +#define USE_CUDA 1 +#define USE_CUDNN 1 +#define USE_OPENCL 0 +#define USE_PYTHON 1 +#define USE_MKLDNN 1 +#define USE_JAVA 0 +#define CUDNN_VERSION 7401 + +// SINGA version +#define SINGA_MAJOR_VERSION 1 Review comment: I have trained the dist_new branch resnet (because resnet has batch norm) with cifar10 dataset using 1 GPU, and obtained 92.5% test accuracy in 100 Epochs with data augmentation. This suggest that the batch norm is in good condition (while the onnx interface of batchnorm may need to be considered but I am not sure) ``` ubuntu@ip-172-31-27-25:~/incubator-singa/examples/autograd$ python3 resnet_realdata.py Loading data file cifar-10-batches-py/data_batch_1 Loading data file cifar-10-batches-py/data_batch_2 Loading data file cifar-10-batches-py/data_batch_3 Loading data file cifar-10-batches-py/data_batch_4 Loading data file cifar-10-batches-py/data_batch_5 Loading data file cifar-10-batches-py/test_batch Start intialization Epoch=0: 100%|███| 1562/1562 [03:56<00:00, 6.61it/s] Training loss = 2927.551146, training accuracy = 0.338068 Test accuracy = 0.441306 Epoch=1: 100%|███| 1562/1562 [03:56<00:00, 6.59it/s] Training loss = 2110.360374, training accuracy = 0.511984 Test accuracy = 0.606571 Epoch=2: 100%|███| 1562/1562 [03:56<00:00, 6.61it/s] Training loss = 1658.897868, training accuracy = 0.623199 Test accuracy = 0.645232 Epoch=3: 100%|███| 1562/1562 [03:56<00:00, 6.64it/s] Training loss = 1354.082412, training accuracy = 0.694442 Test accuracy = 0.731170 Epoch=4: 100%|███| 1562/1562 [03:56<00:00, 6.63it/s] Training loss = 1155.785529, training accuracy = 0.743478 Test accuracy = 0.761318 Epoch=5: 100%|███| 1562/1562 [03:56<00:00, 6.59it/s] Training loss = 1022.750388, training accuracy = 0.773668 Test accuracy = 0.741286 Epoch=6: 100%|███| 1562/1562 [03:56<00:00, 6.62it/s] Training loss = 945.400214, training accuracy = 0.790373 Test accuracy = 0.795072 Epoch=7: 100%|███| 1562/1562 [03:56<00:00, 6.61it/s] Training loss = 840.933215, training accuracy = 0.814441 Test accuracy = 0.810096 Epoch=8: 100%|███| 1562/1562 [03:56<00:00, 6.62it/s] Training loss = 765.215148, training accuracy = 0.830566 Test accuracy = 0.807091 Epoch=9: 100%|███| 1562/1562 [03:56<00:00, 6.61it/s] Training loss = 701.153867, training accuracy = 0.845951 Test accuracy = 0.822316 Epoch=10: 100%|██| 1562/1562 [03:56<00:00, 6.63it/s] Training loss = 666.267428, training accuracy = 0.853073 Test accuracy = 0.851162 Epoch=11: 100%|██| 1562/1562 [03:56<00:00, 6.62it/s] Training loss = 606.699607, training accuracy = 0.866817 Test accuracy = 0.770232 Epoch=12: 100%|██| 1562/1562
[GitHub] [incubator-singa] chrishkchris commented on a change in pull request #468: Distributted module
chrishkchris commented on a change in pull request #468: Distributted module URL: https://github.com/apache/incubator-singa/pull/468#discussion_r311068821 ## File path: src/api/config.i ## @@ -0,0 +1,33 @@ +// Licensed to the Apache Software Foundation (ASF) under one +// or more contributor license agreements. See the NOTICE file +// distributed with this work for additional information +// regarding copyright ownership. The ASF licenses this file +// to you under the Apache License, Version 2.0 (the +// "License"); you may not use this file except in compliance +// with the License. You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, +// software distributed under the License is distributed on an +// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +// KIND, either express or implied. See the License for the +// specific language governing permissions and limitations +// under the License. + + + +// Pass in cmake configurations to swig +#define USE_CUDA 1 +#define USE_CUDNN 1 +#define USE_OPENCL 0 +#define USE_PYTHON 1 +#define USE_MKLDNN 1 +#define USE_JAVA 0 +#define CUDNN_VERSION 7401 + +// SINGA version +#define SINGA_MAJOR_VERSION 1 Review comment: In additional to the above, I also did a 8 * K80 multi-GPUs training and evaluation test with a CIFAR-10 dataset on resnet 50. It reduces the training loss from 3983.8 to 35.56 in 100 Epochs, and evaluation accuracy to 90.6% (maximum at epoch 90). However, this does not include the synchronization of running mean and variance before the evaluation phase: ``` Epoch=0: 100%|██| 195/195 [06:06<00:00, 1.91s/it]Training loss = 3983.820557, training accuracy = 0.225260 Test accuracy = 0.347556 Epoch=1: 100%|██| 195/195 [06:17<00:00, 1.94s/it]Training loss = 2628.622070, training accuracy = 0.379768 Test accuracy = 0.437700 Epoch=2: 100%|██| 195/195 [06:12<00:00, 1.89s/it]Training loss = 2347.072266, training accuracy = 0.448558 Test accuracy = 0.459936 Epoch=3: 100%|██| 195/195 [06:13<00:00, 1.88s/it]Training loss = 2075.987305, training accuracy = 0.517348 Test accuracy = 0.548978 Epoch=4: 100%|██| 195/195 [06:19<00:00, 1.97s/it]Training loss = 1890.109985, training accuracy = 0.566847 Test accuracy = 0.594451 Epoch=5: 100%|██| 195/195 [06:13<00:00, 1.92s/it]Training loss = 1720.395142, training accuracy = 0.606911 Test accuracy = 0.633413 Epoch=6: 100%|██| 195/195 [06:10<00:00, 1.92s/it]Training loss = 1555.737549, training accuracy = 0.645753 Test accuracy = 0.659054 Epoch=7: 100%|██| 195/195 [06:14<00:00, 1.91s/it]Training loss = 1385.688477, training accuracy = 0.687220 Test accuracy = 0.709836 Epoch=8: 100%|██| 195/195 [06:20<00:00, 1.97s/it]Training loss = 1269.426270, training accuracy = 0.714523 Test accuracy = 0.735477 Epoch=9: 100%|██| 195/195 [06:15<00:00, 1.91s/it]Training loss = 1137.953979, training accuracy = 0.746054 Test accuracy = 0.745393 Epoch=10: 100%|██| 195/195 [06:11<00:00, 1.88s/it]Training loss = 1031.773071, training accuracy = 0.770353 Test accuracy = 0.750501 Epoch=11: 100%|██| 195/195 [06:10<00:00, 1.89s/it]Training loss = 956.600037, training accuracy = 0.788261 Test accuracy = 0.44 Epoch=12: 100%|██| 195/195 [06:16<00:00, 1.92s/it]Training loss = 881.050171, training accuracy = 0.804167 Test accuracy = 0.793369 Epoch=13: 100%|██| 195/195 [06:16<00:00, 1.92s/it]Training loss = 828.298828, training accuracy = 0.818309 Test accuracy = 0.807692 Epoch=14: 100%|██| 195/195 [06:11<00:00, 1.90s/it]Training loss = 790.558838, training accuracy = 0.823918 Test accuracy = 0.795373 Epoch=15: 100%|██| 195/195 [06:13<00:00, 1.90s/it]Training loss = 740.679871, training accuracy = 0.833734 Test accuracy = 0.816707 Epoch=16: 100%|██| 195/195 [06:20<00:00, 1.95s/it]Training loss = 691.391479, training accuracy = 0.846855 Test accuracy = 0.818510 Epoch=17: 100%|██| 195/195 [06:16<00:00, 1.89s/it]Training loss = 657.708130, training accuracy = 0.853986 Test accuracy = 0.826122 Epoch=18: 100%|██| 195/195 [06:10<00:00, 1.88s/it]Training loss = 627.918579, training accuracy = 0.860216 Test accuracy = 0.844752 Epoch=19: 100%|██| 195/195 [06:13<00:00, 1.91s/it]Training loss = 592.768982, training accuracy = 0.869551 Test accuracy = 0.845653 Epoch=20: 100%|██| 195/195 [06:19<00:00, 1.97s/it]Training loss = 561.560608, training accuracy = 0.875060 Test accuracy = 0.835938 Epoch=21: 100%|██| 195/195 [06:15<00:00, 1.97s/it]Training loss = 533.083740, training accuracy = 0.881370 Test accuracy = 0.849860 Epoch=22: 100%|██| 195/195 [06:12<00:00,
[GitHub] [incubator-singa] chrishkchris commented on a change in pull request #468: Distributted module
chrishkchris commented on a change in pull request #468: Distributted module URL: https://github.com/apache/incubator-singa/pull/468#discussion_r311074176 ## File path: src/api/config.i ## @@ -0,0 +1,33 @@ +// Licensed to the Apache Software Foundation (ASF) under one +// or more contributor license agreements. See the NOTICE file +// distributed with this work for additional information +// regarding copyright ownership. The ASF licenses this file +// to you under the Apache License, Version 2.0 (the +// "License"); you may not use this file except in compliance +// with the License. You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, +// software distributed under the License is distributed on an +// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +// KIND, either express or implied. See the License for the +// specific language governing permissions and limitations +// under the License. + + + +// Pass in cmake configurations to swig +#define USE_CUDA 1 +#define USE_CUDNN 1 +#define USE_OPENCL 0 +#define USE_PYTHON 1 +#define USE_MKLDNN 1 +#define USE_JAVA 0 +#define CUDNN_VERSION 7401 + +// SINGA version +#define SINGA_MAJOR_VERSION 1 Review comment: From the above, we can now train simple CNN without batchnorm (MNIST dataset) and resnet with batchnorm (CIFAR-10 dataset). The remaining task is the synchronization of the running mean and variance if we use batchnorm. I tried to put the running mean and var in the _BatchNorm2D return list of backward ```python def backward(self, dy): assert training is True and hasattr( self, "cache" ), "Please set training as True before do BP. " x, scale, mean, var = self.cache if isinstance(self.handle, singa.CudnnBatchNormHandle): dx, ds, db = singa.GpuBatchNormBackward( self.handle, dy, x, scale, mean, var ) else: dx, ds, db = singa.CpuBatchNormBackward( self.handle, dy, x, scale, mean, var ) #return dx, ds, db return dx, ds, db, self.running_mean, self.running_var ``` and wish to synchronize it with ```python #all reduce running mean and var for p, g in autograd.backward(loss): if((p.requires_grad==False) and (p.stores_grad==False)): all_reduce(p) ``` However, this is the error in return ``` Traceback (most recent call last): File "resnet_multigpu.py", line 163, in for p, g in autograd.backward(loss): File "/usr/local/lib/python3.5/dist-packages/singa/autograd.py", line 136, in backward % (len(op.src), len(dxs)) AssertionError: the number of src ops (=3) and dx (=5) not match ``` This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services
[GitHub] [incubator-singa] chrishkchris commented on a change in pull request #468: Distributted module
chrishkchris commented on a change in pull request #468: Distributted module URL: https://github.com/apache/incubator-singa/pull/468#discussion_r311056639 ## File path: src/api/config.i ## @@ -0,0 +1,33 @@ +// Licensed to the Apache Software Foundation (ASF) under one +// or more contributor license agreements. See the NOTICE file +// distributed with this work for additional information +// regarding copyright ownership. The ASF licenses this file +// to you under the Apache License, Version 2.0 (the +// "License"); you may not use this file except in compliance +// with the License. You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, +// software distributed under the License is distributed on an +// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +// KIND, either express or implied. See the License for the +// specific language governing permissions and limitations +// under the License. + + + +// Pass in cmake configurations to swig +#define USE_CUDA 1 +#define USE_CUDNN 1 +#define USE_OPENCL 0 +#define USE_PYTHON 1 +#define USE_MKLDNN 1 +#define USE_JAVA 0 +#define CUDNN_VERSION 7401 + +// SINGA version +#define SINGA_MAJOR_VERSION 1 Review comment: Updated on 6th August: I removed a bug in the commit 0616000 which concerns the number of parameters (i.e. Size() of the tensor) taken part in the all-reduce process. Then I did a 8 * K80 multi-GPUs training and evaluation test with a simple MNIST dataset on simple CNN. It reduces the training loss from 802.7 to 42.2 in about 30 Epochs: ``` Epoch=0: 100%|¦¦| 117/117 [00:01<00:00, 92.86it/s]Training loss = 802.659485, training accuracy = 0.713825 Test accuracy = 0.920025 Epoch=1: 100%|¦¦| 117/117 [00:01<00:00, 93.42it/s]Training loss = 246.589371, training accuracy = 0.916767 Test accuracy = 0.956106 Epoch=2: 100%|¦¦| 117/117 [00:01<00:00, 94.04it/s]Training loss = 175.012894, training accuracy = 0.941106 Test accuracy = 0.967208 Epoch=3: 100%|¦¦| 117/117 [00:01<00:00, 95.66it/s] Training loss = 144.684052, training accuracy = 0.951539 Test accuracy = 0.970806 Epoch=4: 100%|¦¦| 117/117 [00:01<00:00, 102.59it/s]Training loss = 120.399704, training accuracy = 0.959402 Test accuracy = 0.976049 Epoch=5: 100%|¦¦| 117/117 [00:01<00:00, 102.79it/s]Training loss = 107.832191, training accuracy = 0.963709 Test accuracy = 0.975946 Epoch=6: 100%|¦¦| 117/117 [00:01<00:00, 102.70it/s]Training loss = 96.289490, training accuracy = 0.967014 Test accuracy = 0.979441 Epoch=7: 100%|¦¦| 117/117 [00:01<00:00, 102.34it/s]Training loss = 88.031815, training accuracy = 0.970436 Test accuracy = 0.980983 Epoch=8: 100%|¦¦| 117/117 [00:01<00:00, 101.81it/s]Training loss = 79.349884, training accuracy = 0.973090 Test accuracy = 0.980058 Epoch=9: 100%|¦¦| 117/117 [00:01<00:00, 101.82it/s]Training loss = 77.825607, training accuracy = 0.974342 Test accuracy = 0.977282 Epoch=10: 100%|¦¦| 117/117 [00:01<00:00, 101.97it/s]Training loss = 74.710297, training accuracy = 0.974576 Test accuracy = 0.983861 Epoch=11: 100%|¦¦| 117/117 [00:01<00:00, 101.98it/s]Training loss = 69.400230, training accuracy = 0.976162 Test accuracy = 0.982936 Epoch=12: 100%|¦¦| 117/117 [00:01<00:00, 102.03it/s]Training loss = 65.100449, training accuracy = 0.978148 Test accuracy = 0.983553 Epoch=13: 100%|¦¦| 117/117 [00:01<00:00, 102.17it/s]Training loss = 65.113991, training accuracy = 0.978249 Test accuracy = 0.986534 Epoch=14: 100%|¦¦| 117/117 [00:01<00:00, 101.83it/s]Training loss = 63.065636, training accuracy = 0.978566 Test accuracy = 0.984683 Epoch=15: 100%|¦¦| 117/117 [00:01<00:00, 102.11it/s]Training loss = 58.334709, training accuracy = 0.980018 Test accuracy = 0.983758 Epoch=16: 100%|¦¦| 117/117 [00:01<00:00, 102.16it/s]Training loss = 58.280094, training accuracy = 0.980285 Test accuracy = 0.983655 Epoch=17: 100%|¦¦| 117/117 [00:01<00:00, 102.15it/s]Training loss = 53.226196, training accuracy = 0.981420 Test accuracy = 0.985197 Epoch=18: 100%|¦¦| 117/117 [00:01<00:00, 102.15it/s]Training loss = 55.968140, training accuracy = 0.980786 Test accuracy = 0.982422 Epoch=19: 100%|¦¦| 117/117 [00:01<00:00, 102.14it/s]Training loss = 52.761921, training accuracy = 0.982489 Test accuracy = 0.985814 Epoch=20: 100%|¦¦| 117/117 [00:01<00:00, 101.86it/s]Training loss = 51.989666, training accuracy = 0.982973 Test accuracy = 0.983758 Epoch=21: 100%|¦¦| 117/117 [00:01<00:00, 101.91it/s]Training loss = 52.571381, training accuracy = 0.982455 Test accuracy = 0.987973 Epoch=22: 100%|¦¦| 117/117 [00:01<00:00, 101.99it/s]Training loss
[GitHub] [incubator-singa] chrishkchris commented on a change in pull request #468: Distributted module
chrishkchris commented on a change in pull request #468: Distributted module URL: https://github.com/apache/incubator-singa/pull/468#discussion_r311068821 ## File path: src/api/config.i ## @@ -0,0 +1,33 @@ +// Licensed to the Apache Software Foundation (ASF) under one +// or more contributor license agreements. See the NOTICE file +// distributed with this work for additional information +// regarding copyright ownership. The ASF licenses this file +// to you under the Apache License, Version 2.0 (the +// "License"); you may not use this file except in compliance +// with the License. You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, +// software distributed under the License is distributed on an +// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +// KIND, either express or implied. See the License for the +// specific language governing permissions and limitations +// under the License. + + + +// Pass in cmake configurations to swig +#define USE_CUDA 1 +#define USE_CUDNN 1 +#define USE_OPENCL 0 +#define USE_PYTHON 1 +#define USE_MKLDNN 1 +#define USE_JAVA 0 +#define CUDNN_VERSION 7401 + +// SINGA version +#define SINGA_MAJOR_VERSION 1 Review comment: In additional to the above, I also did a 8 * K80 multi-GPUs training and evaluation test with a CIFAR-10 dataset on resnet 50. It reduces the training loss from 3983.8 to 35.56 in 100 Epochs, and evaluation accuracy to 90.6% (maximum at epoch 90). However, this does not include the synchronization of running mean and variance before the evaluation phase: ``` Epoch=0: 100%|██| 195/195 [06:06<00:00, 1.91s/it]Training loss = 3983.820557, training accuracy = 0.225260 Test accuracy = 0.347556 Epoch=1: 100%|██| 195/195 [06:17<00:00, 1.94s/it]Training loss = 2628.622070, training accuracy = 0.379768 Test accuracy = 0.437700 Epoch=2: 100%|██| 195/195 [06:12<00:00, 1.89s/it]Training loss = 2347.072266, training accuracy = 0.448558 Test accuracy = 0.459936 Epoch=3: 100%|██| 195/195 [06:13<00:00, 1.88s/it]Training loss = 2075.987305, training accuracy = 0.517348 Test accuracy = 0.548978 Epoch=4: 100%|██| 195/195 [06:19<00:00, 1.97s/it]Training loss = 1890.109985, training accuracy = 0.566847 Test accuracy = 0.594451 Epoch=5: 100%|██| 195/195 [06:13<00:00, 1.92s/it]Training loss = 1720.395142, training accuracy = 0.606911 Test accuracy = 0.633413 Epoch=6: 100%|██| 195/195 [06:10<00:00, 1.92s/it]Training loss = 1555.737549, training accuracy = 0.645753 Test accuracy = 0.659054 Epoch=7: 100%|██| 195/195 [06:14<00:00, 1.91s/it]Training loss = 1385.688477, training accuracy = 0.687220 Test accuracy = 0.709836 Epoch=8: 100%|██| 195/195 [06:20<00:00, 1.97s/it]Training loss = 1269.426270, training accuracy = 0.714523 Test accuracy = 0.735477 Epoch=9: 100%|██| 195/195 [06:15<00:00, 1.91s/it]Training loss = 1137.953979, training accuracy = 0.746054 Test accuracy = 0.745393 Epoch=10: 100%|██| 195/195 [06:11<00:00, 1.88s/it]Training loss = 1031.773071, training accuracy = 0.770353 Test accuracy = 0.750501 Epoch=11: 100%|██| 195/195 [06:10<00:00, 1.89s/it]Training loss = 956.600037, training accuracy = 0.788261 Test accuracy = 0.44 Epoch=12: 100%|██| 195/195 [06:16<00:00, 1.92s/it]Training loss = 881.050171, training accuracy = 0.804167 Test accuracy = 0.793369 Epoch=13: 100%|██| 195/195 [06:16<00:00, 1.92s/it]Training loss = 828.298828, training accuracy = 0.818309 Test accuracy = 0.807692 Epoch=14: 100%|██| 195/195 [06:11<00:00, 1.90s/it]Training loss = 790.558838, training accuracy = 0.823918 Test accuracy = 0.795373 Epoch=15: 100%|██| 195/195 [06:13<00:00, 1.90s/it]Training loss = 740.679871, training accuracy = 0.833734 Test accuracy = 0.816707 Epoch=16: 100%|██| 195/195 [06:20<00:00, 1.95s/it]Training loss = 691.391479, training accuracy = 0.846855 Test accuracy = 0.818510 Epoch=17: 100%|██| 195/195 [06:16<00:00, 1.89s/it]Training loss = 657.708130, training accuracy = 0.853986 Test accuracy = 0.826122 Epoch=18: 100%|██| 195/195 [06:10<00:00, 1.88s/it]Training loss = 627.918579, training accuracy = 0.860216 Test accuracy = 0.844752 Epoch=19: 100%|██| 195/195 [06:13<00:00, 1.91s/it]Training loss = 592.768982, training accuracy = 0.869551 Test accuracy = 0.845653 Epoch=20: 100%|██| 195/195 [06:19<00:00, 1.97s/it]Training loss = 561.560608, training accuracy = 0.875060 Test accuracy = 0.835938 Epoch=21: 100%|██| 195/195 [06:15<00:00, 1.97s/it]Training loss = 533.083740, training accuracy = 0.881370 Test accuracy = 0.849860 Epoch=22: 100%|██| 195/195 [06:12<00:00,
[GitHub] [incubator-singa] chrishkchris commented on a change in pull request #468: Distributted module
chrishkchris commented on a change in pull request #468: Distributted module URL: https://github.com/apache/incubator-singa/pull/468#discussion_r311056639 ## File path: src/api/config.i ## @@ -0,0 +1,33 @@ +// Licensed to the Apache Software Foundation (ASF) under one +// or more contributor license agreements. See the NOTICE file +// distributed with this work for additional information +// regarding copyright ownership. The ASF licenses this file +// to you under the Apache License, Version 2.0 (the +// "License"); you may not use this file except in compliance +// with the License. You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, +// software distributed under the License is distributed on an +// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +// KIND, either express or implied. See the License for the +// specific language governing permissions and limitations +// under the License. + + + +// Pass in cmake configurations to swig +#define USE_CUDA 1 +#define USE_CUDNN 1 +#define USE_OPENCL 0 +#define USE_PYTHON 1 +#define USE_MKLDNN 1 +#define USE_JAVA 0 +#define CUDNN_VERSION 7401 + +// SINGA version +#define SINGA_MAJOR_VERSION 1 Review comment: Updated on 6th August: I removed a bug in the commit 0616000 which concerns the number of parameters (i.e. Size() of the tensor) taken part in the all-reduce process. Then I did a 8 * K80 multi-GPUs training and evaluation test with a simple MNIST dataset on simple CNN. It reduces the training loss from 802.7 to 42.2 in about 30 Epochs: ``` Epoch=0: 100%|¦¦| 117/117 [00:01<00:00, 92.86it/s]Training loss = 802.659485, training accuracy = 0.713825 Test accuracy = 0.920025 Epoch=1: 100%|¦¦| 117/117 [00:01<00:00, 93.42it/s]Training loss = 246.589371, training accuracy = 0.916767 Test accuracy = 0.956106 Epoch=2: 100%|¦¦| 117/117 [00:01<00:00, 94.04it/s]Training loss = 175.012894, training accuracy = 0.941106 Test accuracy = 0.967208 Epoch=3: 100%|¦¦| 117/117 [00:01<00:00, 95.66it/s] Training loss = 144.684052, training accuracy = 0.951539 Test accuracy = 0.970806 Epoch=4: 100%|¦¦| 117/117 [00:01<00:00, 102.59it/s]Training loss = 120.399704, training accuracy = 0.959402 Test accuracy = 0.976049 Epoch=5: 100%|¦¦| 117/117 [00:01<00:00, 102.79it/s]Training loss = 107.832191, training accuracy = 0.963709 Test accuracy = 0.975946 Epoch=6: 100%|¦¦| 117/117 [00:01<00:00, 102.70it/s]Training loss = 96.289490, training accuracy = 0.967014 Test accuracy = 0.979441 Epoch=7: 100%|¦¦| 117/117 [00:01<00:00, 102.34it/s]Training loss = 88.031815, training accuracy = 0.970436 Test accuracy = 0.980983 Epoch=8: 100%|¦¦| 117/117 [00:01<00:00, 101.81it/s]Training loss = 79.349884, training accuracy = 0.973090 Test accuracy = 0.980058 Epoch=9: 100%|¦¦| 117/117 [00:01<00:00, 101.82it/s]Training loss = 77.825607, training accuracy = 0.974342 Test accuracy = 0.977282 Epoch=10: 100%|¦¦| 117/117 [00:01<00:00, 101.97it/s]Training loss = 74.710297, training accuracy = 0.974576 Test accuracy = 0.983861 Epoch=11: 100%|¦¦| 117/117 [00:01<00:00, 101.98it/s]Training loss = 69.400230, training accuracy = 0.976162 Test accuracy = 0.982936 Epoch=12: 100%|¦¦| 117/117 [00:01<00:00, 102.03it/s]Training loss = 65.100449, training accuracy = 0.978148 Test accuracy = 0.983553 Epoch=13: 100%|¦¦| 117/117 [00:01<00:00, 102.17it/s]Training loss = 65.113991, training accuracy = 0.978249 Test accuracy = 0.986534 Epoch=14: 100%|¦¦| 117/117 [00:01<00:00, 101.83it/s]Training loss = 63.065636, training accuracy = 0.978566 Test accuracy = 0.984683 Epoch=15: 100%|¦¦| 117/117 [00:01<00:00, 102.11it/s]Training loss = 58.334709, training accuracy = 0.980018 Test accuracy = 0.983758 Epoch=16: 100%|¦¦| 117/117 [00:01<00:00, 102.16it/s]Training loss = 58.280094, training accuracy = 0.980285 Test accuracy = 0.983655 Epoch=17: 100%|¦¦| 117/117 [00:01<00:00, 102.15it/s]Training loss = 53.226196, training accuracy = 0.981420 Test accuracy = 0.985197 Epoch=18: 100%|¦¦| 117/117 [00:01<00:00, 102.15it/s]Training loss = 55.968140, training accuracy = 0.980786 Test accuracy = 0.982422 Epoch=19: 100%|¦¦| 117/117 [00:01<00:00, 102.14it/s]Training loss = 52.761921, training accuracy = 0.982489 Test accuracy = 0.985814 Epoch=20: 100%|¦¦| 117/117 [00:01<00:00, 101.86it/s]Training loss = 51.989666, training accuracy = 0.982973 Test accuracy = 0.983758 Epoch=21: 100%|¦¦| 117/117 [00:01<00:00, 101.91it/s]Training loss = 52.571381, training accuracy = 0.982455 Test accuracy = 0.987973 Epoch=22: 100%|¦¦| 117/117 [00:01<00:00, 101.99it/s]Training loss
[GitHub] [incubator-singa] chrishkchris commented on a change in pull request #468: Distributted module
chrishkchris commented on a change in pull request #468: Distributted module URL: https://github.com/apache/incubator-singa/pull/468#discussion_r311056639 ## File path: src/api/config.i ## @@ -0,0 +1,33 @@ +// Licensed to the Apache Software Foundation (ASF) under one +// or more contributor license agreements. See the NOTICE file +// distributed with this work for additional information +// regarding copyright ownership. The ASF licenses this file +// to you under the Apache License, Version 2.0 (the +// "License"); you may not use this file except in compliance +// with the License. You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, +// software distributed under the License is distributed on an +// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +// KIND, either express or implied. See the License for the +// specific language governing permissions and limitations +// under the License. + + + +// Pass in cmake configurations to swig +#define USE_CUDA 1 +#define USE_CUDNN 1 +#define USE_OPENCL 0 +#define USE_PYTHON 1 +#define USE_MKLDNN 1 +#define USE_JAVA 0 +#define CUDNN_VERSION 7401 + +// SINGA version +#define SINGA_MAJOR_VERSION 1 Review comment: Updated on 6th August: I removed a bug in the commit 0616000 which concerns the number of parameters in the all-reduce. Then I did a 8 * K80 multi-GPUs training and evaluation test with a simple MNIST dataset on simple CNN. It reduces the training loss from 802.7 to 42.2 in about 30 Epochs: ``` Epoch=0: 100%|¦¦| 117/117 [00:01<00:00, 92.86it/s]Training loss = 802.659485, training accuracy = 0.713825 Test accuracy = 0.920025 Epoch=1: 100%|¦¦| 117/117 [00:01<00:00, 93.42it/s]Training loss = 246.589371, training accuracy = 0.916767 Test accuracy = 0.956106 Epoch=2: 100%|¦¦| 117/117 [00:01<00:00, 94.04it/s]Training loss = 175.012894, training accuracy = 0.941106 Test accuracy = 0.967208 Epoch=3: 100%|¦¦| 117/117 [00:01<00:00, 95.66it/s] Training loss = 144.684052, training accuracy = 0.951539 Test accuracy = 0.970806 Epoch=4: 100%|¦¦| 117/117 [00:01<00:00, 102.59it/s]Training loss = 120.399704, training accuracy = 0.959402 Test accuracy = 0.976049 Epoch=5: 100%|¦¦| 117/117 [00:01<00:00, 102.79it/s]Training loss = 107.832191, training accuracy = 0.963709 Test accuracy = 0.975946 Epoch=6: 100%|¦¦| 117/117 [00:01<00:00, 102.70it/s]Training loss = 96.289490, training accuracy = 0.967014 Test accuracy = 0.979441 Epoch=7: 100%|¦¦| 117/117 [00:01<00:00, 102.34it/s]Training loss = 88.031815, training accuracy = 0.970436 Test accuracy = 0.980983 Epoch=8: 100%|¦¦| 117/117 [00:01<00:00, 101.81it/s]Training loss = 79.349884, training accuracy = 0.973090 Test accuracy = 0.980058 Epoch=9: 100%|¦¦| 117/117 [00:01<00:00, 101.82it/s]Training loss = 77.825607, training accuracy = 0.974342 Test accuracy = 0.977282 Epoch=10: 100%|¦¦| 117/117 [00:01<00:00, 101.97it/s]Training loss = 74.710297, training accuracy = 0.974576 Test accuracy = 0.983861 Epoch=11: 100%|¦¦| 117/117 [00:01<00:00, 101.98it/s]Training loss = 69.400230, training accuracy = 0.976162 Test accuracy = 0.982936 Epoch=12: 100%|¦¦| 117/117 [00:01<00:00, 102.03it/s]Training loss = 65.100449, training accuracy = 0.978148 Test accuracy = 0.983553 Epoch=13: 100%|¦¦| 117/117 [00:01<00:00, 102.17it/s]Training loss = 65.113991, training accuracy = 0.978249 Test accuracy = 0.986534 Epoch=14: 100%|¦¦| 117/117 [00:01<00:00, 101.83it/s]Training loss = 63.065636, training accuracy = 0.978566 Test accuracy = 0.984683 Epoch=15: 100%|¦¦| 117/117 [00:01<00:00, 102.11it/s]Training loss = 58.334709, training accuracy = 0.980018 Test accuracy = 0.983758 Epoch=16: 100%|¦¦| 117/117 [00:01<00:00, 102.16it/s]Training loss = 58.280094, training accuracy = 0.980285 Test accuracy = 0.983655 Epoch=17: 100%|¦¦| 117/117 [00:01<00:00, 102.15it/s]Training loss = 53.226196, training accuracy = 0.981420 Test accuracy = 0.985197 Epoch=18: 100%|¦¦| 117/117 [00:01<00:00, 102.15it/s]Training loss = 55.968140, training accuracy = 0.980786 Test accuracy = 0.982422 Epoch=19: 100%|¦¦| 117/117 [00:01<00:00, 102.14it/s]Training loss = 52.761921, training accuracy = 0.982489 Test accuracy = 0.985814 Epoch=20: 100%|¦¦| 117/117 [00:01<00:00, 101.86it/s]Training loss = 51.989666, training accuracy = 0.982973 Test accuracy = 0.983758 Epoch=21: 100%|¦¦| 117/117 [00:01<00:00, 101.91it/s]Training loss = 52.571381, training accuracy = 0.982455 Test accuracy = 0.987973 Epoch=22: 100%|¦¦| 117/117 [00:01<00:00, 101.99it/s]Training loss = 49.347313, training accuracy = 0.983140 T
[GitHub] [incubator-singa] chrishkchris commented on a change in pull request #468: Distributted module
chrishkchris commented on a change in pull request #468: Distributted module URL: https://github.com/apache/incubator-singa/pull/468#discussion_r311074176 ## File path: src/api/config.i ## @@ -0,0 +1,33 @@ +// Licensed to the Apache Software Foundation (ASF) under one +// or more contributor license agreements. See the NOTICE file +// distributed with this work for additional information +// regarding copyright ownership. The ASF licenses this file +// to you under the Apache License, Version 2.0 (the +// "License"); you may not use this file except in compliance +// with the License. You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, +// software distributed under the License is distributed on an +// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +// KIND, either express or implied. See the License for the +// specific language governing permissions and limitations +// under the License. + + + +// Pass in cmake configurations to swig +#define USE_CUDA 1 +#define USE_CUDNN 1 +#define USE_OPENCL 0 +#define USE_PYTHON 1 +#define USE_MKLDNN 1 +#define USE_JAVA 0 +#define CUDNN_VERSION 7401 + +// SINGA version +#define SINGA_MAJOR_VERSION 1 Review comment: From the above, we can now train simple CNN without batchnorm (MNIST dataset) and resnet with batchnorm (CIFAR-10 dataset). The remaining task is the synchronization of the running mean and variance if we use batchnorm. I tried to put the running mean and var in the _BatchNorm2D return list of backward ``` def backward(self, dy): assert training is True and hasattr( self, "cache" ), "Please set training as True before do BP. " x, scale, mean, var = self.cache if isinstance(self.handle, singa.CudnnBatchNormHandle): dx, ds, db = singa.GpuBatchNormBackward( self.handle, dy, x, scale, mean, var ) else: dx, ds, db = singa.CpuBatchNormBackward( self.handle, dy, x, scale, mean, var ) #return dx, ds, db return dx, ds, db, self.running_mean, self.running_var ``` and wish to synchronize it with ``` #all reduce running mean and var for p, g in autograd.backward(loss): if((p.requires_grad==False) and (p.stores_grad==False)): all_reduce(p) ``` However, this is the error in return ``` Traceback (most recent call last): File "resnet_multigpu.py", line 163, in for p, g in autograd.backward(loss): File "/usr/local/lib/python3.5/dist-packages/singa/autograd.py", line 136, in backward % (len(op.src), len(dxs)) AssertionError: the number of src ops (=3) and dx (=5) not match ``` This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services
[GitHub] [incubator-singa] chrishkchris commented on a change in pull request #468: Distributted module
chrishkchris commented on a change in pull request #468: Distributted module URL: https://github.com/apache/incubator-singa/pull/468#discussion_r311074176 ## File path: src/api/config.i ## @@ -0,0 +1,33 @@ +// Licensed to the Apache Software Foundation (ASF) under one +// or more contributor license agreements. See the NOTICE file +// distributed with this work for additional information +// regarding copyright ownership. The ASF licenses this file +// to you under the Apache License, Version 2.0 (the +// "License"); you may not use this file except in compliance +// with the License. You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, +// software distributed under the License is distributed on an +// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +// KIND, either express or implied. See the License for the +// specific language governing permissions and limitations +// under the License. + + + +// Pass in cmake configurations to swig +#define USE_CUDA 1 +#define USE_CUDNN 1 +#define USE_OPENCL 0 +#define USE_PYTHON 1 +#define USE_MKLDNN 1 +#define USE_JAVA 0 +#define CUDNN_VERSION 7401 + +// SINGA version +#define SINGA_MAJOR_VERSION 1 Review comment: From the above, we can now train simple CNN without batchnorm (MNIST dataset) and resnet with batchnorm (CIFAR-10 dataset). The remaining task is the synchronization of the running mean and variance. I tried to put the running mean and var in the _BatchNorm2D return list of backward ``` def backward(self, dy): assert training is True and hasattr( self, "cache" ), "Please set training as True before do BP. " x, scale, mean, var = self.cache if isinstance(self.handle, singa.CudnnBatchNormHandle): dx, ds, db = singa.GpuBatchNormBackward( self.handle, dy, x, scale, mean, var ) else: dx, ds, db = singa.CpuBatchNormBackward( self.handle, dy, x, scale, mean, var ) #return dx, ds, db return dx, ds, db, self.running_mean, self.running_var ``` and wish to synchronize it with ``` #all reduce running mean and var for p, g in autograd.backward(loss): if((p.requires_grad==False) and (p.stores_grad==False)): all_reduce(p) ``` However, this is the error in return ``` Traceback (most recent call last): File "resnet_multigpu.py", line 163, in for p, g in autograd.backward(loss): File "/usr/local/lib/python3.5/dist-packages/singa/autograd.py", line 136, in backward % (len(op.src), len(dxs)) AssertionError: the number of src ops (=3) and dx (=5) not match ``` This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services
[GitHub] [incubator-singa] chrishkchris commented on a change in pull request #468: Distributted module
chrishkchris commented on a change in pull request #468: Distributted module URL: https://github.com/apache/incubator-singa/pull/468#discussion_r311074176 ## File path: src/api/config.i ## @@ -0,0 +1,33 @@ +// Licensed to the Apache Software Foundation (ASF) under one +// or more contributor license agreements. See the NOTICE file +// distributed with this work for additional information +// regarding copyright ownership. The ASF licenses this file +// to you under the Apache License, Version 2.0 (the +// "License"); you may not use this file except in compliance +// with the License. You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, +// software distributed under the License is distributed on an +// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +// KIND, either express or implied. See the License for the +// specific language governing permissions and limitations +// under the License. + + + +// Pass in cmake configurations to swig +#define USE_CUDA 1 +#define USE_CUDNN 1 +#define USE_OPENCL 0 +#define USE_PYTHON 1 +#define USE_MKLDNN 1 +#define USE_JAVA 0 +#define CUDNN_VERSION 7401 + +// SINGA version +#define SINGA_MAJOR_VERSION 1 Review comment: From the above, we can now train simple CNN (MNIST dataset) and resnet (CIFAR-10 dataset). The remaining task is the synchronization of the running mean and variance. I tried to put the running mean and var in the _BatchNorm2D return list of backward ``` def backward(self, dy): assert training is True and hasattr( self, "cache" ), "Please set training as True before do BP. " x, scale, mean, var = self.cache if isinstance(self.handle, singa.CudnnBatchNormHandle): dx, ds, db = singa.GpuBatchNormBackward( self.handle, dy, x, scale, mean, var ) else: dx, ds, db = singa.CpuBatchNormBackward( self.handle, dy, x, scale, mean, var ) #return dx, ds, db return dx, ds, db, self.running_mean, self.running_var ``` and wish to synchronize it with ``` #all reduce running mean and var for p, g in autograd.backward(loss): if((p.requires_grad==False) and (p.stores_grad==False)): all_reduce(p) ``` However, this is the error in return ``` Traceback (most recent call last): File "resnet_multigpu.py", line 163, in for p, g in autograd.backward(loss): File "/usr/local/lib/python3.5/dist-packages/singa/autograd.py", line 136, in backward % (len(op.src), len(dxs)) AssertionError: the number of src ops (=3) and dx (=5) not match ``` This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services
[GitHub] [incubator-singa] chrishkchris commented on a change in pull request #468: Distributted module
chrishkchris commented on a change in pull request #468: Distributted module URL: https://github.com/apache/incubator-singa/pull/468#discussion_r311074176 ## File path: src/api/config.i ## @@ -0,0 +1,33 @@ +// Licensed to the Apache Software Foundation (ASF) under one +// or more contributor license agreements. See the NOTICE file +// distributed with this work for additional information +// regarding copyright ownership. The ASF licenses this file +// to you under the Apache License, Version 2.0 (the +// "License"); you may not use this file except in compliance +// with the License. You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, +// software distributed under the License is distributed on an +// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +// KIND, either express or implied. See the License for the +// specific language governing permissions and limitations +// under the License. + + + +// Pass in cmake configurations to swig +#define USE_CUDA 1 +#define USE_CUDNN 1 +#define USE_OPENCL 0 +#define USE_PYTHON 1 +#define USE_MKLDNN 1 +#define USE_JAVA 0 +#define CUDNN_VERSION 7401 + +// SINGA version +#define SINGA_MAJOR_VERSION 1 Review comment: From the above, we can now train simple CNN (MNIST dataset) and resnet (CIFAR-10 dataset). The remaining task is the synchronization of the running mean and variance. I tried to put the running mean and var in the _BatchNorm2D return list of backward ``` def backward(self, dy): assert training is True and hasattr( self, "cache" ), "Please set training as True before do BP. " x, scale, mean, var = self.cache if isinstance(self.handle, singa.CudnnBatchNormHandle): dx, ds, db = singa.GpuBatchNormBackward( self.handle, dy, x, scale, mean, var ) else: dx, ds, db = singa.CpuBatchNormBackward( self.handle, dy, x, scale, mean, var ) #return dx, ds, db return dx, ds, db, self.running_mean, self.running_var ``` and wish to collect it with ``` #all reduce running mean and var for p, g in autograd.backward(loss): if((p.requires_grad==False) and (p.stores_grad==False)): all_reduce(p) ``` However, this is the error in return ``` Traceback (most recent call last): File "resnet_multigpu.py", line 163, in for p, g in autograd.backward(loss): File "/usr/local/lib/python3.5/dist-packages/singa/autograd.py", line 136, in backward % (len(op.src), len(dxs)) AssertionError: the number of src ops (=3) and dx (=5) not match ``` This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services
[GitHub] [incubator-singa] chrishkchris commented on a change in pull request #468: Distributted module
chrishkchris commented on a change in pull request #468: Distributted module URL: https://github.com/apache/incubator-singa/pull/468#discussion_r311074176 ## File path: src/api/config.i ## @@ -0,0 +1,33 @@ +// Licensed to the Apache Software Foundation (ASF) under one +// or more contributor license agreements. See the NOTICE file +// distributed with this work for additional information +// regarding copyright ownership. The ASF licenses this file +// to you under the Apache License, Version 2.0 (the +// "License"); you may not use this file except in compliance +// with the License. You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, +// software distributed under the License is distributed on an +// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +// KIND, either express or implied. See the License for the +// specific language governing permissions and limitations +// under the License. + + + +// Pass in cmake configurations to swig +#define USE_CUDA 1 +#define USE_CUDNN 1 +#define USE_OPENCL 0 +#define USE_PYTHON 1 +#define USE_MKLDNN 1 +#define USE_JAVA 0 +#define CUDNN_VERSION 7401 + +// SINGA version +#define SINGA_MAJOR_VERSION 1 Review comment: From the above, we can now train simple CNN (MNIST dataset) and resnet (CIFAR-10 dataset). The remaining task is the synchronization of the running mean and variance. I tried to put the running mean and var in the batch return list of backward ``` def backward(self, dy): assert training is True and hasattr( self, "cache" ), "Please set training as True before do BP. " x, scale, mean, var = self.cache if isinstance(self.handle, singa.CudnnBatchNormHandle): dx, ds, db = singa.GpuBatchNormBackward( self.handle, dy, x, scale, mean, var ) else: dx, ds, db = singa.CpuBatchNormBackward( self.handle, dy, x, scale, mean, var ) #return dx, ds, db return dx, ds, db, self.running_mean, self.running_var ``` and wish to collect it with ``` #all reduce running mean and var for p, g in autograd.backward(loss): if((p.requires_grad==False) and (p.stores_grad==False)): all_reduce(p) ``` However, this is the error in return ``` Traceback (most recent call last): File "resnet_multigpu.py", line 163, in for p, g in autograd.backward(loss): File "/usr/local/lib/python3.5/dist-packages/singa/autograd.py", line 136, in backward % (len(op.src), len(dxs)) AssertionError: the number of src ops (=3) and dx (=5) not match ``` This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services
[GitHub] [incubator-singa] chrishkchris commented on a change in pull request #468: Distributted module
chrishkchris commented on a change in pull request #468: Distributted module URL: https://github.com/apache/incubator-singa/pull/468#discussion_r311068821 ## File path: src/api/config.i ## @@ -0,0 +1,33 @@ +// Licensed to the Apache Software Foundation (ASF) under one +// or more contributor license agreements. See the NOTICE file +// distributed with this work for additional information +// regarding copyright ownership. The ASF licenses this file +// to you under the Apache License, Version 2.0 (the +// "License"); you may not use this file except in compliance +// with the License. You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, +// software distributed under the License is distributed on an +// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +// KIND, either express or implied. See the License for the +// specific language governing permissions and limitations +// under the License. + + + +// Pass in cmake configurations to swig +#define USE_CUDA 1 +#define USE_CUDNN 1 +#define USE_OPENCL 0 +#define USE_PYTHON 1 +#define USE_MKLDNN 1 +#define USE_JAVA 0 +#define CUDNN_VERSION 7401 + +// SINGA version +#define SINGA_MAJOR_VERSION 1 Review comment: In additional to the above, I also did a 8 * K80 multi-GPUs training and evaluation test with a CIFAR-10 dataset on resnet 50. It reduces the training loss from 3983.8 to 345.7 in about 30 Epochs, and evaluation accuracy to 86.8%. However, this does not include the synchronization of running mean and variance before the evaluation phase: ``` Epoch=0: 100%|██| 195/195 [06:06<00:00, 1.91s/it]Training loss = 3983.820557, training accuracy = 0.225260 Test accuracy = 0.347556 Epoch=1: 100%|██| 195/195 [06:17<00:00, 1.94s/it]Training loss = 2628.622070, training accuracy = 0.379768 Test accuracy = 0.437700 Epoch=2: 100%|██| 195/195 [06:12<00:00, 1.89s/it]Training loss = 2347.072266, training accuracy = 0.448558 Test accuracy = 0.459936 Epoch=3: 100%|██| 195/195 [06:13<00:00, 1.88s/it]Training loss = 2075.987305, training accuracy = 0.517348 Test accuracy = 0.548978 Epoch=4: 100%|██| 195/195 [06:19<00:00, 1.97s/it]Training loss = 1890.109985, training accuracy = 0.566847 Test accuracy = 0.594451 Epoch=5: 100%|██| 195/195 [06:13<00:00, 1.92s/it]Training loss = 1720.395142, training accuracy = 0.606911 Test accuracy = 0.633413 Epoch=6: 100%|██| 195/195 [06:10<00:00, 1.92s/it]Training loss = 1555.737549, training accuracy = 0.645753 Test accuracy = 0.659054 Epoch=7: 100%|██| 195/195 [06:14<00:00, 1.91s/it]Training loss = 1385.688477, training accuracy = 0.687220 Test accuracy = 0.709836 Epoch=8: 100%|██| 195/195 [06:20<00:00, 1.97s/it]Training loss = 1269.426270, training accuracy = 0.714523 Test accuracy = 0.735477 Epoch=9: 100%|██| 195/195 [06:15<00:00, 1.91s/it]Training loss = 1137.953979, training accuracy = 0.746054 Test accuracy = 0.745393 Epoch=10: 100%|██| 195/195 [06:11<00:00, 1.88s/it]Training loss = 1031.773071, training accuracy = 0.770353 Test accuracy = 0.750501 Epoch=11: 100%|██| 195/195 [06:10<00:00, 1.89s/it]Training loss = 956.600037, training accuracy = 0.788261 Test accuracy = 0.44 Epoch=12: 100%|██| 195/195 [06:16<00:00, 1.92s/it]Training loss = 881.050171, training accuracy = 0.804167 Test accuracy = 0.793369 Epoch=13: 100%|██| 195/195 [06:16<00:00, 1.92s/it]Training loss = 828.298828, training accuracy = 0.818309 Test accuracy = 0.807692 Epoch=14: 100%|██| 195/195 [06:11<00:00, 1.90s/it]Training loss = 790.558838, training accuracy = 0.823918 Test accuracy = 0.795373 Epoch=15: 100%|██| 195/195 [06:13<00:00, 1.90s/it]Training loss = 740.679871, training accuracy = 0.833734 Test accuracy = 0.816707 Epoch=16: 100%|██| 195/195 [06:20<00:00, 1.95s/it]Training loss = 691.391479, training accuracy = 0.846855 Test accuracy = 0.818510 Epoch=17: 100%|██| 195/195 [06:16<00:00, 1.89s/it]Training loss = 657.708130, training accuracy = 0.853986 Test accuracy = 0.826122 Epoch=18: 100%|██| 195/195 [06:10<00:00, 1.88s/it]Training loss = 627.918579, training accuracy = 0.860216 Test accuracy = 0.844752 Epoch=19: 100%|██| 195/195 [06:13<00:00, 1.91s/it]Training loss = 592.768982, training accuracy = 0.869551 Test accuracy = 0.845653 Epoch=20: 100%|██| 195/195 [06:19<00:00, 1.97s/it]Training loss = 561.560608, training accuracy = 0.875060 Test accuracy = 0.835938 Epoch=21: 100%|██| 195/195 [06:15<00:00, 1.97s/it]Training loss = 533.083740, training accuracy = 0.881370 Test accuracy = 0.849860 Epoch=22: 100%|██| 195/195 [06:12<00:00, 1.91s/it]Traini
[GitHub] [incubator-singa] chrishkchris commented on a change in pull request #468: Distributted module
chrishkchris commented on a change in pull request #468: Distributted module URL: https://github.com/apache/incubator-singa/pull/468#discussion_r311056639 ## File path: src/api/config.i ## @@ -0,0 +1,33 @@ +// Licensed to the Apache Software Foundation (ASF) under one +// or more contributor license agreements. See the NOTICE file +// distributed with this work for additional information +// regarding copyright ownership. The ASF licenses this file +// to you under the Apache License, Version 2.0 (the +// "License"); you may not use this file except in compliance +// with the License. You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, +// software distributed under the License is distributed on an +// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +// KIND, either express or implied. See the License for the +// specific language governing permissions and limitations +// under the License. + + + +// Pass in cmake configurations to swig +#define USE_CUDA 1 +#define USE_CUDNN 1 +#define USE_OPENCL 0 +#define USE_PYTHON 1 +#define USE_MKLDNN 1 +#define USE_JAVA 0 +#define CUDNN_VERSION 7401 + +// SINGA version +#define SINGA_MAJOR_VERSION 1 Review comment: Updated on 6th August: I removed a bug in the commit 0616000 which concerns to the number of parameters in the all-reduce. Then I did a 8 * K80 multi-GPUs training and evaluation test with a simple MNIST dataset on simple CNN. It reduces the training loss from 802.7 to 42.2 in about 30 Epochs: ``` Epoch=0: 100%|¦¦| 117/117 [00:01<00:00, 92.86it/s]Training loss = 802.659485, training accuracy = 0.713825 Test accuracy = 0.920025 Epoch=1: 100%|¦¦| 117/117 [00:01<00:00, 93.42it/s]Training loss = 246.589371, training accuracy = 0.916767 Test accuracy = 0.956106 Epoch=2: 100%|¦¦| 117/117 [00:01<00:00, 94.04it/s]Training loss = 175.012894, training accuracy = 0.941106 Test accuracy = 0.967208 Epoch=3: 100%|¦¦| 117/117 [00:01<00:00, 95.66it/s] Training loss = 144.684052, training accuracy = 0.951539 Test accuracy = 0.970806 Epoch=4: 100%|¦¦| 117/117 [00:01<00:00, 102.59it/s]Training loss = 120.399704, training accuracy = 0.959402 Test accuracy = 0.976049 Epoch=5: 100%|¦¦| 117/117 [00:01<00:00, 102.79it/s]Training loss = 107.832191, training accuracy = 0.963709 Test accuracy = 0.975946 Epoch=6: 100%|¦¦| 117/117 [00:01<00:00, 102.70it/s]Training loss = 96.289490, training accuracy = 0.967014 Test accuracy = 0.979441 Epoch=7: 100%|¦¦| 117/117 [00:01<00:00, 102.34it/s]Training loss = 88.031815, training accuracy = 0.970436 Test accuracy = 0.980983 Epoch=8: 100%|¦¦| 117/117 [00:01<00:00, 101.81it/s]Training loss = 79.349884, training accuracy = 0.973090 Test accuracy = 0.980058 Epoch=9: 100%|¦¦| 117/117 [00:01<00:00, 101.82it/s]Training loss = 77.825607, training accuracy = 0.974342 Test accuracy = 0.977282 Epoch=10: 100%|¦¦| 117/117 [00:01<00:00, 101.97it/s]Training loss = 74.710297, training accuracy = 0.974576 Test accuracy = 0.983861 Epoch=11: 100%|¦¦| 117/117 [00:01<00:00, 101.98it/s]Training loss = 69.400230, training accuracy = 0.976162 Test accuracy = 0.982936 Epoch=12: 100%|¦¦| 117/117 [00:01<00:00, 102.03it/s]Training loss = 65.100449, training accuracy = 0.978148 Test accuracy = 0.983553 Epoch=13: 100%|¦¦| 117/117 [00:01<00:00, 102.17it/s]Training loss = 65.113991, training accuracy = 0.978249 Test accuracy = 0.986534 Epoch=14: 100%|¦¦| 117/117 [00:01<00:00, 101.83it/s]Training loss = 63.065636, training accuracy = 0.978566 Test accuracy = 0.984683 Epoch=15: 100%|¦¦| 117/117 [00:01<00:00, 102.11it/s]Training loss = 58.334709, training accuracy = 0.980018 Test accuracy = 0.983758 Epoch=16: 100%|¦¦| 117/117 [00:01<00:00, 102.16it/s]Training loss = 58.280094, training accuracy = 0.980285 Test accuracy = 0.983655 Epoch=17: 100%|¦¦| 117/117 [00:01<00:00, 102.15it/s]Training loss = 53.226196, training accuracy = 0.981420 Test accuracy = 0.985197 Epoch=18: 100%|¦¦| 117/117 [00:01<00:00, 102.15it/s]Training loss = 55.968140, training accuracy = 0.980786 Test accuracy = 0.982422 Epoch=19: 100%|¦¦| 117/117 [00:01<00:00, 102.14it/s]Training loss = 52.761921, training accuracy = 0.982489 Test accuracy = 0.985814 Epoch=20: 100%|¦¦| 117/117 [00:01<00:00, 101.86it/s]Training loss = 51.989666, training accuracy = 0.982973 Test accuracy = 0.983758 Epoch=21: 100%|¦¦| 117/117 [00:01<00:00, 101.91it/s]Training loss = 52.571381, training accuracy = 0.982455 Test accuracy = 0.987973 Epoch=22: 100%|¦¦| 117/117 [00:01<00:00, 101.99it/s]Training loss = 49.347313, training accuracy = 0.983140
[GitHub] [incubator-singa] chrishkchris commented on a change in pull request #468: Distributted module
chrishkchris commented on a change in pull request #468: Distributted module URL: https://github.com/apache/incubator-singa/pull/468#discussion_r311056639 ## File path: src/api/config.i ## @@ -0,0 +1,33 @@ +// Licensed to the Apache Software Foundation (ASF) under one +// or more contributor license agreements. See the NOTICE file +// distributed with this work for additional information +// regarding copyright ownership. The ASF licenses this file +// to you under the Apache License, Version 2.0 (the +// "License"); you may not use this file except in compliance +// with the License. You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, +// software distributed under the License is distributed on an +// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +// KIND, either express or implied. See the License for the +// specific language governing permissions and limitations +// under the License. + + + +// Pass in cmake configurations to swig +#define USE_CUDA 1 +#define USE_CUDNN 1 +#define USE_OPENCL 0 +#define USE_PYTHON 1 +#define USE_MKLDNN 1 +#define USE_JAVA 0 +#define CUDNN_VERSION 7401 + +// SINGA version +#define SINGA_MAJOR_VERSION 1 Review comment: Updated on 6th August: I removed a bug in the commit 0616000 which concerns to the number of parameters in the all-reduce. Then I did a 8 * K80 multi-GPUs training and evaluation test with a mnist dataset on simple CNN. It reduces the training loss from 802.7 to 42.2 in about 30 Epochs: ``` Epoch=0: 100%|¦¦| 117/117 [00:01<00:00, 92.86it/s]Training loss = 802.659485, training accuracy = 0.713825 Test accuracy = 0.920025 Epoch=1: 100%|¦¦| 117/117 [00:01<00:00, 93.42it/s]Training loss = 246.589371, training accuracy = 0.916767 Test accuracy = 0.956106 Epoch=2: 100%|¦¦| 117/117 [00:01<00:00, 94.04it/s]Training loss = 175.012894, training accuracy = 0.941106 Test accuracy = 0.967208 Epoch=3: 100%|¦¦| 117/117 [00:01<00:00, 95.66it/s] Training loss = 144.684052, training accuracy = 0.951539 Test accuracy = 0.970806 Epoch=4: 100%|¦¦| 117/117 [00:01<00:00, 102.59it/s]Training loss = 120.399704, training accuracy = 0.959402 Test accuracy = 0.976049 Epoch=5: 100%|¦¦| 117/117 [00:01<00:00, 102.79it/s]Training loss = 107.832191, training accuracy = 0.963709 Test accuracy = 0.975946 Epoch=6: 100%|¦¦| 117/117 [00:01<00:00, 102.70it/s]Training loss = 96.289490, training accuracy = 0.967014 Test accuracy = 0.979441 Epoch=7: 100%|¦¦| 117/117 [00:01<00:00, 102.34it/s]Training loss = 88.031815, training accuracy = 0.970436 Test accuracy = 0.980983 Epoch=8: 100%|¦¦| 117/117 [00:01<00:00, 101.81it/s]Training loss = 79.349884, training accuracy = 0.973090 Test accuracy = 0.980058 Epoch=9: 100%|¦¦| 117/117 [00:01<00:00, 101.82it/s]Training loss = 77.825607, training accuracy = 0.974342 Test accuracy = 0.977282 Epoch=10: 100%|¦¦| 117/117 [00:01<00:00, 101.97it/s]Training loss = 74.710297, training accuracy = 0.974576 Test accuracy = 0.983861 Epoch=11: 100%|¦¦| 117/117 [00:01<00:00, 101.98it/s]Training loss = 69.400230, training accuracy = 0.976162 Test accuracy = 0.982936 Epoch=12: 100%|¦¦| 117/117 [00:01<00:00, 102.03it/s]Training loss = 65.100449, training accuracy = 0.978148 Test accuracy = 0.983553 Epoch=13: 100%|¦¦| 117/117 [00:01<00:00, 102.17it/s]Training loss = 65.113991, training accuracy = 0.978249 Test accuracy = 0.986534 Epoch=14: 100%|¦¦| 117/117 [00:01<00:00, 101.83it/s]Training loss = 63.065636, training accuracy = 0.978566 Test accuracy = 0.984683 Epoch=15: 100%|¦¦| 117/117 [00:01<00:00, 102.11it/s]Training loss = 58.334709, training accuracy = 0.980018 Test accuracy = 0.983758 Epoch=16: 100%|¦¦| 117/117 [00:01<00:00, 102.16it/s]Training loss = 58.280094, training accuracy = 0.980285 Test accuracy = 0.983655 Epoch=17: 100%|¦¦| 117/117 [00:01<00:00, 102.15it/s]Training loss = 53.226196, training accuracy = 0.981420 Test accuracy = 0.985197 Epoch=18: 100%|¦¦| 117/117 [00:01<00:00, 102.15it/s]Training loss = 55.968140, training accuracy = 0.980786 Test accuracy = 0.982422 Epoch=19: 100%|¦¦| 117/117 [00:01<00:00, 102.14it/s]Training loss = 52.761921, training accuracy = 0.982489 Test accuracy = 0.985814 Epoch=20: 100%|¦¦| 117/117 [00:01<00:00, 101.86it/s]Training loss = 51.989666, training accuracy = 0.982973 Test accuracy = 0.983758 Epoch=21: 100%|¦¦| 117/117 [00:01<00:00, 101.91it/s]Training loss = 52.571381, training accuracy = 0.982455 Test accuracy = 0.987973 Epoch=22: 100%|¦¦| 117/117 [00:01<00:00, 101.99it/s]Training loss = 49.347313, training accuracy = 0.983140 Test
[GitHub] [incubator-singa] chrishkchris commented on a change in pull request #468: Distributted module
chrishkchris commented on a change in pull request #468: Distributted module URL: https://github.com/apache/incubator-singa/pull/468#discussion_r310330469 ## File path: src/api/config.i ## @@ -0,0 +1,33 @@ +// Licensed to the Apache Software Foundation (ASF) under one +// or more contributor license agreements. See the NOTICE file +// distributed with this work for additional information +// regarding copyright ownership. The ASF licenses this file +// to you under the Apache License, Version 2.0 (the +// "License"); you may not use this file except in compliance +// with the License. You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, +// software distributed under the License is distributed on an +// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +// KIND, either express or implied. See the License for the +// specific language governing permissions and limitations +// under the License. + + + +// Pass in cmake configurations to swig +#define USE_CUDA 1 +#define USE_CUDNN 1 +#define USE_OPENCL 0 +#define USE_PYTHON 1 +#define USE_MKLDNN 1 +#define USE_JAVA 0 +#define CUDNN_VERSION 7401 + +// SINGA version +#define SINGA_MAJOR_VERSION 1 Review comment: Next, I will try to train the resnet with cifar 10 using 8 gpu. This will take time to modfiy because I need to collect the accuracy from other processes (may use mpi to reduce), and syn the running mean and var of different processes This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services
[GitHub] [incubator-singa] chrishkchris commented on a change in pull request #468: Distributted module
chrishkchris commented on a change in pull request #468: Distributted module URL: https://github.com/apache/incubator-singa/pull/468#discussion_r310330469 ## File path: src/api/config.i ## @@ -0,0 +1,33 @@ +// Licensed to the Apache Software Foundation (ASF) under one +// or more contributor license agreements. See the NOTICE file +// distributed with this work for additional information +// regarding copyright ownership. The ASF licenses this file +// to you under the Apache License, Version 2.0 (the +// "License"); you may not use this file except in compliance +// with the License. You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, +// software distributed under the License is distributed on an +// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +// KIND, either express or implied. See the License for the +// specific language governing permissions and limitations +// under the License. + + + +// Pass in cmake configurations to swig +#define USE_CUDA 1 +#define USE_CUDNN 1 +#define USE_OPENCL 0 +#define USE_PYTHON 1 +#define USE_MKLDNN 1 +#define USE_JAVA 0 +#define CUDNN_VERSION 7401 + +// SINGA version +#define SINGA_MAJOR_VERSION 1 Review comment: Next, I will try to train the resnet with cifar 10 using 8 gpu. This will take time to modfiy because I need to collect the accuracy other processes (may use mpi to reduce), and syn the running mean and var of different processes This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services
[GitHub] [incubator-singa] chrishkchris commented on a change in pull request #468: Distributted module
chrishkchris commented on a change in pull request #468: Distributted module URL: https://github.com/apache/incubator-singa/pull/468#discussion_r310328731 ## File path: src/api/config.i ## @@ -0,0 +1,33 @@ +// Licensed to the Apache Software Foundation (ASF) under one +// or more contributor license agreements. See the NOTICE file +// distributed with this work for additional information +// regarding copyright ownership. The ASF licenses this file +// to you under the Apache License, Version 2.0 (the +// "License"); you may not use this file except in compliance +// with the License. You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, +// software distributed under the License is distributed on an +// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +// KIND, either express or implied. See the License for the +// specific language governing permissions and limitations +// under the License. + + + +// Pass in cmake configurations to swig +#define USE_CUDA 1 +#define USE_CUDNN 1 +#define USE_OPENCL 0 +#define USE_PYTHON 1 +#define USE_MKLDNN 1 +#define USE_JAVA 0 +#define CUDNN_VERSION 7401 + +// SINGA version +#define SINGA_MAJOR_VERSION 1 Review comment: I have trained the dist_new branch resnet (because resnet has batch norm) with cifar10 dataset using 1 GPU, and obtained 92.5% test accuracy in 100 Epochs with data augmentation. This suggest that the batch norm is in good condition (while the onnx interface of batchnorm may need to be considered but I am not sure) ``` ubuntu@ip-172-31-27-25:~/incubator-singa/examples/autograd$ python3 resnet_realdata.py Loading data file cifar-10-batches-py/data_batch_1 Loading data file cifar-10-batches-py/data_batch_2 Loading data file cifar-10-batches-py/data_batch_3 Loading data file cifar-10-batches-py/data_batch_4 Loading data file cifar-10-batches-py/data_batch_5 Loading data file cifar-10-batches-py/test_batch Start intialization Epoch=0: 100%|███| 1562/1562 [03:56<00:00, 6.61it/s] Training loss = 2927.551146, training accuracy = 0.338068 Test accuracy = 0.441306 Epoch=1: 100%|███| 1562/1562 [03:56<00:00, 6.59it/s] Training loss = 2110.360374, training accuracy = 0.511984 Test accuracy = 0.606571 Epoch=2: 100%|███| 1562/1562 [03:56<00:00, 6.61it/s] Training loss = 1658.897868, training accuracy = 0.623199 Test accuracy = 0.645232 Epoch=3: 100%|███| 1562/1562 [03:56<00:00, 6.64it/s] Training loss = 1354.082412, training accuracy = 0.694442 Test accuracy = 0.731170 Epoch=4: 100%|███| 1562/1562 [03:56<00:00, 6.63it/s] Training loss = 1155.785529, training accuracy = 0.743478 Test accuracy = 0.761318 Epoch=5: 100%|███| 1562/1562 [03:56<00:00, 6.59it/s] Training loss = 1022.750388, training accuracy = 0.773668 Test accuracy = 0.741286 Epoch=6: 100%|███| 1562/1562 [03:56<00:00, 6.62it/s] Training loss = 945.400214, training accuracy = 0.790373 Test accuracy = 0.795072 Epoch=7: 100%|███| 1562/1562 [03:56<00:00, 6.61it/s] Training loss = 840.933215, training accuracy = 0.814441 Test accuracy = 0.810096 Epoch=8: 100%|███| 1562/1562 [03:56<00:00, 6.62it/s] Training loss = 765.215148, training accuracy = 0.830566 Test accuracy = 0.807091 Epoch=9: 100%|███| 1562/1562 [03:56<00:00, 6.61it/s] Training loss = 701.153867, training accuracy = 0.845951 Test accuracy = 0.822316 Epoch=10: 100%|██| 1562/1562 [03:56<00:00, 6.63it/s] Training loss = 666.267428, training accuracy = 0.853073 Test accuracy = 0.851162 Epoch=11: 100%|██| 1562/1562 [03:56<00:00, 6.62it/s] Training loss = 606.699607, training accuracy = 0.866817 Test accuracy = 0.770232 Epoch=12: 100%|██| 1562/1562
[GitHub] [incubator-singa] chrishkchris commented on a change in pull request #468: Distributted module
chrishkchris commented on a change in pull request #468: Distributted module URL: https://github.com/apache/incubator-singa/pull/468#discussion_r310330469 ## File path: src/api/config.i ## @@ -0,0 +1,33 @@ +// Licensed to the Apache Software Foundation (ASF) under one +// or more contributor license agreements. See the NOTICE file +// distributed with this work for additional information +// regarding copyright ownership. The ASF licenses this file +// to you under the Apache License, Version 2.0 (the +// "License"); you may not use this file except in compliance +// with the License. You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, +// software distributed under the License is distributed on an +// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +// KIND, either express or implied. See the License for the +// specific language governing permissions and limitations +// under the License. + + + +// Pass in cmake configurations to swig +#define USE_CUDA 1 +#define USE_CUDNN 1 +#define USE_OPENCL 0 +#define USE_PYTHON 1 +#define USE_MKLDNN 1 +#define USE_JAVA 0 +#define CUDNN_VERSION 7401 + +// SINGA version +#define SINGA_MAJOR_VERSION 1 Review comment: Next, I will try to train the resnet with cifar 10 using 8 gpu. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services
[GitHub] [incubator-singa] chrishkchris commented on a change in pull request #468: Distributted module
chrishkchris commented on a change in pull request #468: Distributted module URL: https://github.com/apache/incubator-singa/pull/468#discussion_r310328731 ## File path: src/api/config.i ## @@ -0,0 +1,33 @@ +// Licensed to the Apache Software Foundation (ASF) under one +// or more contributor license agreements. See the NOTICE file +// distributed with this work for additional information +// regarding copyright ownership. The ASF licenses this file +// to you under the Apache License, Version 2.0 (the +// "License"); you may not use this file except in compliance +// with the License. You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, +// software distributed under the License is distributed on an +// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +// KIND, either express or implied. See the License for the +// specific language governing permissions and limitations +// under the License. + + + +// Pass in cmake configurations to swig +#define USE_CUDA 1 +#define USE_CUDNN 1 +#define USE_OPENCL 0 +#define USE_PYTHON 1 +#define USE_MKLDNN 1 +#define USE_JAVA 0 +#define CUDNN_VERSION 7401 + +// SINGA version +#define SINGA_MAJOR_VERSION 1 Review comment: I have trained the dist_new branch resnet (because resnet has batch norm) with cifar10 dataset, and obtained 92.5% test accuracy in 100 Epoch with data augmentation. ``` ubuntu@ip-172-31-27-25:~/incubator-singa/examples/autograd$ python3 resnet_realdata.py Loading data file cifar-10-batches-py/data_batch_1 Loading data file cifar-10-batches-py/data_batch_2 Loading data file cifar-10-batches-py/data_batch_3 Loading data file cifar-10-batches-py/data_batch_4 Loading data file cifar-10-batches-py/data_batch_5 Loading data file cifar-10-batches-py/test_batch Start intialization Epoch=0: 100%|███| 1562/1562 [03:56<00:00, 6.61it/s] Training loss = 2927.551146, training accuracy = 0.338068 Test accuracy = 0.441306 Epoch=1: 100%|███| 1562/1562 [03:56<00:00, 6.59it/s] Training loss = 2110.360374, training accuracy = 0.511984 Test accuracy = 0.606571 Epoch=2: 100%|███| 1562/1562 [03:56<00:00, 6.61it/s] Training loss = 1658.897868, training accuracy = 0.623199 Test accuracy = 0.645232 Epoch=3: 100%|███| 1562/1562 [03:56<00:00, 6.64it/s] Training loss = 1354.082412, training accuracy = 0.694442 Test accuracy = 0.731170 Epoch=4: 100%|███| 1562/1562 [03:56<00:00, 6.63it/s] Training loss = 1155.785529, training accuracy = 0.743478 Test accuracy = 0.761318 Epoch=5: 100%|███| 1562/1562 [03:56<00:00, 6.59it/s] Training loss = 1022.750388, training accuracy = 0.773668 Test accuracy = 0.741286 Epoch=6: 100%|███| 1562/1562 [03:56<00:00, 6.62it/s] Training loss = 945.400214, training accuracy = 0.790373 Test accuracy = 0.795072 Epoch=7: 100%|███| 1562/1562 [03:56<00:00, 6.61it/s] Training loss = 840.933215, training accuracy = 0.814441 Test accuracy = 0.810096 Epoch=8: 100%|███| 1562/1562 [03:56<00:00, 6.62it/s] Training loss = 765.215148, training accuracy = 0.830566 Test accuracy = 0.807091 Epoch=9: 100%|███| 1562/1562 [03:56<00:00, 6.61it/s] Training loss = 701.153867, training accuracy = 0.845951 Test accuracy = 0.822316 Epoch=10: 100%|██| 1562/1562 [03:56<00:00, 6.63it/s] Training loss = 666.267428, training accuracy = 0.853073 Test accuracy = 0.851162 Epoch=11: 100%|██| 1562/1562 [03:56<00:00, 6.62it/s] Training loss = 606.699607, training accuracy = 0.866817 Test accuracy = 0.770232 Epoch=12: 100%|██| 1562/1562 [03:56<00:00, 6.63it/s] Training loss = 564.226388, training accuracy = 0.875760 Test accuracy = 0.811599 Epoch=13: 100%|█
[GitHub] [incubator-singa] chrishkchris commented on a change in pull request #468: Distributted module
chrishkchris commented on a change in pull request #468: Distributted module URL: https://github.com/apache/incubator-singa/pull/468#discussion_r310328731 ## File path: src/api/config.i ## @@ -0,0 +1,33 @@ +// Licensed to the Apache Software Foundation (ASF) under one +// or more contributor license agreements. See the NOTICE file +// distributed with this work for additional information +// regarding copyright ownership. The ASF licenses this file +// to you under the Apache License, Version 2.0 (the +// "License"); you may not use this file except in compliance +// with the License. You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, +// software distributed under the License is distributed on an +// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +// KIND, either express or implied. See the License for the +// specific language governing permissions and limitations +// under the License. + + + +// Pass in cmake configurations to swig +#define USE_CUDA 1 +#define USE_CUDNN 1 +#define USE_OPENCL 0 +#define USE_PYTHON 1 +#define USE_MKLDNN 1 +#define USE_JAVA 0 +#define CUDNN_VERSION 7401 + +// SINGA version +#define SINGA_MAJOR_VERSION 1 Review comment: I have trained the dist_new branch resnet (because resnet has batch norm) with cifar10 dataset, and obtained 92.5% test accuracy in 100 iteration with data augmentation. ``` ubuntu@ip-172-31-27-25:~/incubator-singa/examples/autograd$ python3 resnet_realdata.py Loading data file cifar-10-batches-py/data_batch_1 Loading data file cifar-10-batches-py/data_batch_2 Loading data file cifar-10-batches-py/data_batch_3 Loading data file cifar-10-batches-py/data_batch_4 Loading data file cifar-10-batches-py/data_batch_5 Loading data file cifar-10-batches-py/test_batch Start intialization Epoch=0: 100%|███| 1562/1562 [03:56<00:00, 6.61it/s] Training loss = 2927.551146, training accuracy = 0.338068 Test accuracy = 0.441306 Epoch=1: 100%|███| 1562/1562 [03:56<00:00, 6.59it/s] Training loss = 2110.360374, training accuracy = 0.511984 Test accuracy = 0.606571 Epoch=2: 100%|███| 1562/1562 [03:56<00:00, 6.61it/s] Training loss = 1658.897868, training accuracy = 0.623199 Test accuracy = 0.645232 Epoch=3: 100%|███| 1562/1562 [03:56<00:00, 6.64it/s] Training loss = 1354.082412, training accuracy = 0.694442 Test accuracy = 0.731170 Epoch=4: 100%|███| 1562/1562 [03:56<00:00, 6.63it/s] Training loss = 1155.785529, training accuracy = 0.743478 Test accuracy = 0.761318 Epoch=5: 100%|███| 1562/1562 [03:56<00:00, 6.59it/s] Training loss = 1022.750388, training accuracy = 0.773668 Test accuracy = 0.741286 Epoch=6: 100%|███| 1562/1562 [03:56<00:00, 6.62it/s] Training loss = 945.400214, training accuracy = 0.790373 Test accuracy = 0.795072 Epoch=7: 100%|███| 1562/1562 [03:56<00:00, 6.61it/s] Training loss = 840.933215, training accuracy = 0.814441 Test accuracy = 0.810096 Epoch=8: 100%|███| 1562/1562 [03:56<00:00, 6.62it/s] Training loss = 765.215148, training accuracy = 0.830566 Test accuracy = 0.807091 Epoch=9: 100%|███| 1562/1562 [03:56<00:00, 6.61it/s] Training loss = 701.153867, training accuracy = 0.845951 Test accuracy = 0.822316 Epoch=10: 100%|██| 1562/1562 [03:56<00:00, 6.63it/s] Training loss = 666.267428, training accuracy = 0.853073 Test accuracy = 0.851162 Epoch=11: 100%|██| 1562/1562 [03:56<00:00, 6.62it/s] Training loss = 606.699607, training accuracy = 0.866817 Test accuracy = 0.770232 Epoch=12: 100%|██| 1562/1562 [03:56<00:00, 6.63it/s] Training loss = 564.226388, training accuracy = 0.875760 Test accuracy = 0.811599 Epoch=13: 100%|█
[GitHub] [incubator-singa] chrishkchris commented on a change in pull request #468: Distributted module
chrishkchris commented on a change in pull request #468: Distributted module URL: https://github.com/apache/incubator-singa/pull/468#discussion_r309709702 ## File path: src/api/config.i ## @@ -0,0 +1,33 @@ +// Licensed to the Apache Software Foundation (ASF) under one +// or more contributor license agreements. See the NOTICE file +// distributed with this work for additional information +// regarding copyright ownership. The ASF licenses this file +// to you under the Apache License, Version 2.0 (the +// "License"); you may not use this file except in compliance +// with the License. You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, +// software distributed under the License is distributed on an +// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +// KIND, either express or implied. See the License for the +// specific language governing permissions and limitations +// under the License. + + + +// Pass in cmake configurations to swig +#define USE_CUDA 1 +#define USE_CUDNN 1 +#define USE_OPENCL 0 +#define USE_PYTHON 1 +#define USE_MKLDNN 1 +#define USE_JAVA 0 +#define CUDNN_VERSION 7401 + +// SINGA version +#define SINGA_MAJOR_VERSION 1 Review comment: Updated on 1 August 2019: Concerning the above error, I found that there is a different between the implementation of `class _BatchNorm2d(Operation):` in master branch and dist_new branch. In autograd.py, both the master branch and dist_new branch has modified (or debugged) the conv2d and batchnorm operator, but they modified it differently. Meanwhile, both conv2d in the master branch and dist_new branch can train and reduce loss of mnist simple CNN, so there is no big problem. However, the batch normalization is a much more complex case, because it includes non-training variables that are running means and running variances. In the master branch, the running means and running variances (non-training variables) are in the forward function: `def forward(self, x, scale, bias, running_mean, running_var):` https://github.com/apache/incubator-singa/blob/master/python/singa/autograd.py#L1099 When I run the code using the master branch dockerfile, the error is as follows: ``` root@26c9db193eb0:~/incubator-singa/examples/autograd# python3 resnet.py Start intialization 0%| | 0/200 [00:00 for p, g in autograd.backward(loss): File "/root/incubator-singa/build/python/singa/autograd.py", line 135, in backward % (len(op.src), len(dxs)) AssertionError: the number of src ops (=5) and dx (=3) not match ``` I think the error is because the running_mean and running_var are in the forward function input arguments but are not training variables, so there are supposed to be three src ops but finally found 5. Meanwhile, the dist_new branch has modified the batchnorm function (commit 2b3a857 by user ubuntu on Apr14) by moving the input arguments running_mean and running_var into the initialization function: `def __init__(self, handle, running_mean, running_var, name=None):` `def forward(self, x, scale, bias):` https://github.com/xuewanqi/incubator-singa/blob/dist_new/python/singa/autograd.py#L1096 This one can run successfully but I am not sure if it can train and reduce loss. Next, I will try training the resnet with real dataset to see if it can reduce the loss. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services
[GitHub] [incubator-singa] chrishkchris commented on a change in pull request #468: Distributted module
chrishkchris commented on a change in pull request #468: Distributted module URL: https://github.com/apache/incubator-singa/pull/468#discussion_r309496476 ## File path: src/CMakeLists.txt ## @@ -36,6 +36,9 @@ AUX_SOURCE_DIRECTORY(core/scheduler core_source) AUX_SOURCE_DIRECTORY(core/tensor core_source) LIST(APPEND singa_sources ${core_source}) Review comment: ``` The build log is here: ubuntu@ip-172-31-18-113:~/incubator-singa/build$ rm -rf * ubuntu@ip-172-31-18-113:~/incubator-singa/build$ cmake -D CMAKE_PREFIX_PATH="/usr/local/cuda/lib64;/usr/local/cuda/" -DENABLE_TEST=OFF -DUSE_CUDA=ON -DUSE_PYTHON3=ON -DUSE_MKLDNN=ON -DUSE_MODULES=OFF -DUSE_DIST=ON .. -- The C compiler identification is GNU 5.4.0 -- The CXX compiler identification is GNU 5.4.0 -- Check for working C compiler: /usr/bin/cc -- Check for working C compiler: /usr/bin/cc -- works -- Detecting C compiler ABI info -- Detecting C compiler ABI info - done -- Detecting C compile features -- Detecting C compile features - done -- Check for working CXX compiler: /usr/bin/c++ -- Check for working CXX compiler: /usr/bin/c++ -- works -- Detecting CXX compiler ABI info -- Detecting CXX compiler ABI info - done -- Detecting CXX compile features -- Detecting CXX compile features - done -- Looking for pthread.h -- Looking for pthread.h - found -- Looking for pthread_create -- Looking for pthread_create - not found -- Looking for pthread_create in pthreads -- Looking for pthread_create in pthreads - not found -- Looking for pthread_create in pthread -- Looking for pthread_create in pthread - found -- Found Threads: TRUE -- Found Protobuf: /usr/local/lib/libprotobuf.so;-lpthread (found suitable version "3.0.0", minimum required is "3.0") -- Found CBLAS: /usr/local/include -- Found GLOG: /usr/include -- Found cuda_v10.0 -- Found CUDNN: /usr/local/cuda/include -- Found Cudnn_7401 at /usr/local/cuda/include /usr/local/cuda/lib64/libcudnn.so -- Found PythonInterp: /usr/bin/python3 (found suitable version "3.5.2", minimum required is "3") -- Found PythonLibs: /usr/lib/x86_64-linux-gnu/libpython3.5m.so (found suitable version "3.5.2", minimum required is "3") -- Found SWIG: /usr/local/bin/swig (found suitable version "3.0.12", minimum required is "3.0.10") -- Found MKLDNN at /usr/local/include -- Found MPI at /home/ubuntu/mpich-3.3/build/include -- Found MPI lib at /home/ubuntu/mpich-3.3/build/lib/libmpi.so -- Found all lib at /usr/local/lib/libprotobuf.so;/usr/local/lib/libopenblas.so;/usr/lib/x86_64-linux-gnu/libglog.so;/usr/local/cuda/lib64/libcudnn.so;/usr/local/cuda/lib64/libcudart.so;/usr/local/cuda/lib64/libcurand.so;/usr/local/cuda/lib64/libcublas.so;/home/ubuntu/incubator-singa/build/lib/libcnmem.a;/usr/local/lib/libmkldnn.so;/home/ubuntu/mpich-3.3/build/lib/libmpi.so;/home/ubuntu/mpich-3.3/build/lib/libmpicxx.so -- Found NCCL at /usr/local/cuda/include -- Found NCCL lib at /usr/local/cuda/lib/libnccl.so -- Configuring done -- Generating done -- Build files have been written to: /home/ubuntu/incubator-singa/build ubuntu@ip-172-31-18-113:~/incubator-singa/build$ make -j2 Scanning dependencies of target cnmem Scanning dependencies of target copy_protobuf [ 1%] Creating directories for 'cnmem' [ 2%] Running C++ protocol buffer compiler on /home/ubuntu/incubator-singa/src/proto/model.proto [libprotobuf WARNING google/protobuf/compiler/parser.cc:547] No syntax specified for the proto file: model.proto. Please use 'syntax = "proto2";' or 'syntax = "proto3";' to specify a syntax version. (Defaulted to proto2 syntax.) [ 3%] Performing download step (git clone) for 'cnmem' Cloning into 'cnmem'... [ 4%] Running C++ protocol buffer compiler on /home/ubuntu/incubator-singa/src/proto/caffe.proto [ 5%] Running C++ protocol buffer compiler on /home/ubuntu/incubator-singa/src/proto/core.proto [libprotobuf WARNING google/protobuf/compiler/parser.cc:547] No syntax specified for the proto file: core.proto. Please use 'syntax = "proto2";' or 'syntax = "proto3";' to specify a syntax version. (Defaulted to proto2 syntax.) [ 6%] Running C++ protocol buffer compiler on /home/ubuntu/incubator-singa/src/proto/io.proto [libprotobuf WARNING google/protobuf/compiler/parser.cc:547] No syntax specified for the proto file: io.proto. Please use 'syntax = "proto2";' or 'syntax = "proto3";' to specify a syntax version. (Defaulted to proto2 syntax.) [ 7%] Copying Protobuf headers [ 7%] Built target copy_protobuf [ 8%] Building NVCC (Device) object src/CMakeFiles/cuda_compile_1.dir/core/tensor/cuda_compile_1_generated_math_kernel.cu.o Scanning dependencies of target singa_objects [ 9%] Building CXX object src/CMakeFiles/singa_objects.dir/caffe.pb.cc.o Already on 'master' Your branch is up-to-date with 'origin/master'. [ 10%] No pat
[GitHub] [incubator-singa] chrishkchris commented on a change in pull request #468: Distributted module
chrishkchris commented on a change in pull request #468: Distributted module URL: https://github.com/apache/incubator-singa/pull/468#discussion_r309496476 ## File path: src/CMakeLists.txt ## @@ -36,6 +36,9 @@ AUX_SOURCE_DIRECTORY(core/scheduler core_source) AUX_SOURCE_DIRECTORY(core/tensor core_source) LIST(APPEND singa_sources ${core_source}) Review comment: The build log is here: ubuntu@ip-172-31-18-113:~/incubator-singa/build$ rm -rf * ubuntu@ip-172-31-18-113:~/incubator-singa/build$ cmake -D CMAKE_PREFIX_PATH="/usr/local/cuda/lib64;/usr/local/cuda/" -DENABLE_TEST=OFF -DUSE_CUDA=ON -DUSE_PYTHON3=ON -DUSE_MKLDNN=ON -DUSE_MODULES=OFF -DUSE_DIST=ON .. -- The C compiler identification is GNU 5.4.0 -- The CXX compiler identification is GNU 5.4.0 -- Check for working C compiler: /usr/bin/cc -- Check for working C compiler: /usr/bin/cc -- works -- Detecting C compiler ABI info -- Detecting C compiler ABI info - done -- Detecting C compile features -- Detecting C compile features - done -- Check for working CXX compiler: /usr/bin/c++ -- Check for working CXX compiler: /usr/bin/c++ -- works -- Detecting CXX compiler ABI info -- Detecting CXX compiler ABI info - done -- Detecting CXX compile features -- Detecting CXX compile features - done -- Looking for pthread.h -- Looking for pthread.h - found -- Looking for pthread_create -- Looking for pthread_create - not found -- Looking for pthread_create in pthreads -- Looking for pthread_create in pthreads - not found -- Looking for pthread_create in pthread -- Looking for pthread_create in pthread - found -- Found Threads: TRUE -- Found Protobuf: /usr/local/lib/libprotobuf.so;-lpthread (found suitable version "3.0.0", minimum required is "3.0") -- Found CBLAS: /usr/local/include -- Found GLOG: /usr/include -- Found cuda_v10.0 -- Found CUDNN: /usr/local/cuda/include -- Found Cudnn_7401 at /usr/local/cuda/include /usr/local/cuda/lib64/libcudnn.so -- Found PythonInterp: /usr/bin/python3 (found suitable version "3.5.2", minimum required is "3") -- Found PythonLibs: /usr/lib/x86_64-linux-gnu/libpython3.5m.so (found suitable version "3.5.2", minimum required is "3") -- Found SWIG: /usr/local/bin/swig (found suitable version "3.0.12", minimum required is "3.0.10") -- Found MKLDNN at /usr/local/include -- Found MPI at /home/ubuntu/mpich-3.3/build/include -- Found MPI lib at /home/ubuntu/mpich-3.3/build/lib/libmpi.so -- Found all lib at /usr/local/lib/libprotobuf.so;/usr/local/lib/libopenblas.so;/usr/lib/x86_64-linux-gnu/libglog.so;/usr/local/cuda/lib64/libcudnn.so;/usr/local/cuda/lib64/libcudart.so;/usr/local/cuda/lib64/libcurand.so;/usr/local/cuda/lib64/libcublas.so;/home/ubuntu/incubator-singa/build/lib/libcnmem.a;/usr/local/lib/libmkldnn.so;/home/ubuntu/mpich-3.3/build/lib/libmpi.so;/home/ubuntu/mpich-3.3/build/lib/libmpicxx.so -- Found NCCL at /usr/local/cuda/include -- Found NCCL lib at /usr/local/cuda/lib/libnccl.so -- Configuring done -- Generating done -- Build files have been written to: /home/ubuntu/incubator-singa/build ubuntu@ip-172-31-18-113:~/incubator-singa/build$ make -j2 Scanning dependencies of target cnmem Scanning dependencies of target copy_protobuf [ 1%] Creating directories for 'cnmem' [ 2%] Running C++ protocol buffer compiler on /home/ubuntu/incubator-singa/src/proto/model.proto [libprotobuf WARNING google/protobuf/compiler/parser.cc:547] No syntax specified for the proto file: model.proto. Please use 'syntax = "proto2";' or 'syntax = "proto3";' to specify a syntax version. (Defaulted to proto2 syntax.) [ 3%] Performing download step (git clone) for 'cnmem' Cloning into 'cnmem'... [ 4%] Running C++ protocol buffer compiler on /home/ubuntu/incubator-singa/src/proto/caffe.proto [ 5%] Running C++ protocol buffer compiler on /home/ubuntu/incubator-singa/src/proto/core.proto [libprotobuf WARNING google/protobuf/compiler/parser.cc:547] No syntax specified for the proto file: core.proto. Please use 'syntax = "proto2";' or 'syntax = "proto3";' to specify a syntax version. (Defaulted to proto2 syntax.) [ 6%] Running C++ protocol buffer compiler on /home/ubuntu/incubator-singa/src/proto/io.proto [libprotobuf WARNING google/protobuf/compiler/parser.cc:547] No syntax specified for the proto file: io.proto. Please use 'syntax = "proto2";' or 'syntax = "proto3";' to specify a syntax version. (Defaulted to proto2 syntax.) [ 7%] Copying Protobuf headers [ 7%] Built target copy_protobuf [ 8%] Building NVCC (Device) object src/CMakeFiles/cuda_compile_1.dir/core/tensor/cuda_compile_1_generated_math_kernel.cu.o Scanning dependencies of target singa_objects [ 9%] Building CXX object src/CMakeFiles/singa_objects.dir/caffe.pb.cc.o Already on 'master' Your branch is up-to-date with 'origin/master'. [ 10%] No patch step
[GitHub] [incubator-singa] chrishkchris commented on a change in pull request #468: Distributted module
chrishkchris commented on a change in pull request #468: Distributted module URL: https://github.com/apache/incubator-singa/pull/468#discussion_r309329911 ## File path: src/CMakeLists.txt ## @@ -36,6 +36,9 @@ AUX_SOURCE_DIRECTORY(core/scheduler core_source) AUX_SOURCE_DIRECTORY(core/tensor core_source) LIST(APPEND singa_sources ${core_source}) Review comment: I updated also some files to include USE_DIST, see the following grep result on USE_DIST: ubuntu@ip-172-31-18-113:~/incubator-singa$ git grep USE_DIST CMakeLists.txt:OPTION(USE_DIST "Use nccl distributed module" OFF) cmake/Dependencies.cmake:IF(USE_DIST) cmake/Templates/singa_config.h.in:#cmakedefine USE_DIST include/singa/dist/communicator.h:#ifdef USE_DIST include/singa/dist/communicator.h:#endif // USE_DIST src/CMakeLists.txt:IF (USE_DIST) src/CMakeLists.txt:ENDIF (USE_DIST) src/api/config.i:#define USE_DIST 0 src/api/config.i.in:#cmakedefine01 USE_DIST src/api/dist_communicator.i:#if USE_DIST src/api/dist_communicator.i:#endif // USE_DIST src/dist/communicator.cc:#ifdef USE_DIST src/dist/communicator.cc:#endif // USE_DIST Note that the default is OFF if we do not set -DUSE_DIST=ON The test was on version 1.2 although I set the displayed value in CMakeLists to be version 2.0. I will still need to test the dist module on singa version 2.0 and add partitioning of dataset according to MPI rank, etc. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services
[GitHub] [incubator-singa] chrishkchris commented on a change in pull request #468: Distributted module
chrishkchris commented on a change in pull request #468: Distributted module URL: https://github.com/apache/incubator-singa/pull/468#discussion_r309329911 ## File path: src/CMakeLists.txt ## @@ -36,6 +36,9 @@ AUX_SOURCE_DIRECTORY(core/scheduler core_source) AUX_SOURCE_DIRECTORY(core/tensor core_source) LIST(APPEND singa_sources ${core_source}) Review comment: I updated also some files to include USE_DIST, see the following grep result on USE_DIST: ubuntu@ip-172-31-18-113:~/incubator-singa$ git grep USE_DIST CMakeLists.txt:OPTION(USE_DIST "Use nccl distributed module" OFF) cmake/Dependencies.cmake:IF(USE_DIST) cmake/Templates/singa_config.h.in:#cmakedefine USE_DIST include/singa/dist/communicator.h:#ifdef USE_DIST include/singa/dist/communicator.h:#endif // USE_DIST src/CMakeLists.txt:IF (USE_DIST) src/CMakeLists.txt:ENDIF (USE_DIST) src/api/config.i:#define USE_DIST 1 src/api/config.i.in:#cmakedefine01 USE_DIST src/api/dist_communicator.i:#if USE_DIST src/api/dist_communicator.i:#endif // USE_DIST src/dist/communicator.cc:#ifdef USE_DIST src/dist/communicator.cc:#endif // USE_DIST Note that the default is OFF if we do not set -DUSE_DIST=ON The test was on version 1.2 although I set the displayed value in CMakeLists to be version 2.0. I will still need to test the dist module on singa version 2.0 and add partitioning of dataset according to MPI rank, etc. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services
[GitHub] [incubator-singa] chrishkchris commented on a change in pull request #468: Distributted module
chrishkchris commented on a change in pull request #468: Distributted module URL: https://github.com/apache/incubator-singa/pull/468#discussion_r309329911 ## File path: src/CMakeLists.txt ## @@ -36,6 +36,9 @@ AUX_SOURCE_DIRECTORY(core/scheduler core_source) AUX_SOURCE_DIRECTORY(core/tensor core_source) LIST(APPEND singa_sources ${core_source}) Review comment: I updated also some files to include USE_DIST, see the following grep result on USE_DIST: ubuntu@ip-172-31-18-113:~/incubator-singa$ git grep USE_DIST CMakeLists.txt:OPTION(USE_DIST "Use nccl distributed module" OFF) cmake/Dependencies.cmake:IF(USE_DIST) cmake/Templates/singa_config.h.in:#cmakedefine USE_DIST include/singa/dist/communicator.h:#ifdef USE_DIST include/singa/dist/communicator.h:#endif // USE_DIST src/CMakeLists.txt:IF (USE_DIST) src/CMakeLists.txt:ENDIF (USE_DIST) src/api/config.i:#define USE_DIST 1 src/api/config.i.in:#cmakedefine01 USE_DIST src/api/dist_communicator.i:#if USE_DIST src/api/dist_communicator.i:#endif // USE_DIST src/dist/communicator.cc:#ifdef USE_DIST src/dist/communicator.cc:#endif // USE_DIST Note that the default is OFF if we do not set -DUSE_DIST=ON The test was on version 1.2 although I set the displayed value in CMakeLists to be version 2.0. I will still need to test version 2.0 and add partitioning of dataset according to MPI rank, etc. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services
[GitHub] [incubator-singa] chrishkchris commented on a change in pull request #468: Distributted module
chrishkchris commented on a change in pull request #468: Distributted module URL: https://github.com/apache/incubator-singa/pull/468#discussion_r309144703 ## File path: src/CMakeLists.txt ## @@ -36,6 +36,9 @@ AUX_SOURCE_DIRECTORY(core/scheduler core_source) AUX_SOURCE_DIRECTORY(core/tensor core_source) LIST(APPEND singa_sources ${core_source}) Review comment: Changed the files cmake/Dependencies.cmake and src/CMakeLists.txt Can use cmake -DUSE_DIST=ON to turn on the distributed module This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services
[GitHub] [incubator-singa] chrishkchris commented on a change in pull request #468: Distributted module
chrishkchris commented on a change in pull request #468: Distributted module URL: https://github.com/apache/incubator-singa/pull/468#discussion_r309144703 ## File path: src/CMakeLists.txt ## @@ -36,6 +36,9 @@ AUX_SOURCE_DIRECTORY(core/scheduler core_source) AUX_SOURCE_DIRECTORY(core/tensor core_source) LIST(APPEND singa_sources ${core_source}) Review comment: Changed the files cmake/Dependencies.cmake and src/CMakeLists.txt Can use cmake -DUSE_DIST=ON to turn on the distributed module However, there are some bugs (mainly segmentation fault) if I add the #ifdef USE_DIST in the files communicator.h and communicator.cc I will update other files as well (e.g. #cmakedefine and #if USE_DIST etc. in many files) when I successfully remove the bug. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services
[GitHub] [incubator-singa] chrishkchris commented on a change in pull request #468: Distributted module
chrishkchris commented on a change in pull request #468: Distributted module URL: https://github.com/apache/incubator-singa/pull/468#discussion_r305247134 ## File path: src/api/config.i ## @@ -0,0 +1,33 @@ +// Licensed to the Apache Software Foundation (ASF) under one +// or more contributor license agreements. See the NOTICE file +// distributed with this work for additional information +// regarding copyright ownership. The ASF licenses this file +// to you under the Apache License, Version 2.0 (the +// "License"); you may not use this file except in compliance +// with the License. You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, +// software distributed under the License is distributed on an +// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +// KIND, either express or implied. See the License for the +// specific language governing permissions and limitations +// under the License. + + + +// Pass in cmake configurations to swig +#define USE_CUDA 1 +#define USE_CUDNN 1 +#define USE_OPENCL 0 +#define USE_PYTHON 1 +#define USE_MKLDNN 1 +#define USE_JAVA 0 +#define CUDNN_VERSION 7401 + +// SINGA version +#define SINGA_MAJOR_VERSION 1 Review comment: In our server (at ncrg), I created a new anaconda python 3.6 enivorment and install singa 2.0 using "conda install -c nusdbsystem -c conda-forge singa=2.0.0=cudnn7.3.1_cuda10.0_py36" It passed the test: python -c "from singa import tensor" Also, it passed the old optimizer example: incubator-singa/example/cifar10/train.py can run and train successfully. However, the incubator-singa/examples/autograd/resnet.py cannot run, while the output is: Start intialization 0%| | 0/200 [00:00 x = model(tx) File "examples/autograd/resnet.py", line 155, in __call__ x = self.conv1(x) File "/home/dcsysh/anaconda3/envs/singa2/lib/python3.6/site-packages/singa/autograd.py", line 939, in __call__ self.device_check(x, self.W, self.b) File "/home/dcsysh/anaconda3/envs/singa2/lib/python3.6/site-packages/singa/autograd.py", line 656, in device_check if var.device.id() != x_dev_id: AttributeError: 'NoneType' object has no attribute 'device' This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services