[GitHub] [incubator-singa] chrishkchris commented on a change in pull request #468: Distributted module

GitBox Thu, 01 Aug 2019 06:56:12 -0700

chrishkchris commented on a change in pull request #468: Distributted module
URL: https://github.com/apache/incubator-singa/pull/468#discussion_r309709702


 ##########
 File path: src/api/config.i
 ##########
 @@ -0,0 +1,33 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+
+
+// Pass in cmake configurations to swig
+#define USE_CUDA 1
+#define USE_CUDNN 1
+#define USE_OPENCL 0
+#define USE_PYTHON 1
+#define USE_MKLDNN 1
+#define USE_JAVA 0
+#define CUDNN_VERSION 7401
+
+// SINGA version
+#define SINGA_MAJOR_VERSION 1
 
 Review comment:
   Updated on 1 August 2019:
   
   Concerning the above error, I found that there is a different between the 
implementation of `class _BatchNorm2d(Operation):` in master branch and 
dist_new branch.
   
   In autograd.py, both the master branch and dist_new branch has modified (or 
debugged) the conv2d and batchnorm operator, but they modified it differently. 
Meanwhile, both conv2d in the master branch and dist_new branch can train and 
reduce loss of mnist simple CNN, so there is no big problem. However, the batch 
normalization is a much more complex case, because it includes non-training 
variables that are running means and running variances.
   
   In the master branch, the running means and running variances (non-training 
variables) are in the forward function: `def forward(self, x, scale, bias, 
running_mean, running_var):`
   
https://github.com/apache/incubator-singa/blob/master/python/singa/autograd.py#L1099
   
   When I run the code using the master branch dockerfile, the error is as 
follows:
   ```
   root@26c9db193eb0:~/incubator-singa/examples/autograd# python3 resnet.py
   Start intialization............
     0%|                                                                        
                              | 0/200 [00:00<?, ?it/s]
   Traceback (most recent call last):
     File "resnet.py", line 249, in <module>
       for p, g in autograd.backward(loss):
     File "/root/incubator-singa/build/python/singa/autograd.py", line 135, in 
backward
       % (len(op.src), len(dxs))
   AssertionError: the number of src ops (=5) and dx (=3) not match
   ```
   I think the error is because the running_mean and running_var are in the 
forward function input arguments but are not training variables, so there are 
supposed to be three src ops but finally found 5.
   
   Meanwhile, the dist_new branch has modified the batchnorm function (commit 
2b3a857 by user ubuntu on Apr14) by moving the input arguments running_mean and 
running_var into the initialization function:
   `def __init__(self, handle, running_mean, running_var, name=None):`
   `def forward(self, x, scale, bias):`
   
https://github.com/xuewanqi/incubator-singa/blob/dist_new/python/singa/autograd.py#L1096
   This one can run successfully but I am not sure if it can train and reduce 
loss.
   
   Next, I will try training the resnet with real dataset to see if it can 
reduce the loss.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

[GitHub] [incubator-singa] chrishkchris commented on a change in pull request #468: Distributted module

Reply via email to