chrishkchris commented on a change in pull request #468: Distributted module URL: https://github.com/apache/incubator-singa/pull/468#discussion_r309709702
########## File path: src/api/config.i ########## @@ -0,0 +1,33 @@ +// Licensed to the Apache Software Foundation (ASF) under one +// or more contributor license agreements. See the NOTICE file +// distributed with this work for additional information +// regarding copyright ownership. The ASF licenses this file +// to you under the Apache License, Version 2.0 (the +// "License"); you may not use this file except in compliance +// with the License. You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, +// software distributed under the License is distributed on an +// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +// KIND, either express or implied. See the License for the +// specific language governing permissions and limitations +// under the License. + + + +// Pass in cmake configurations to swig +#define USE_CUDA 1 +#define USE_CUDNN 1 +#define USE_OPENCL 0 +#define USE_PYTHON 1 +#define USE_MKLDNN 1 +#define USE_JAVA 0 +#define CUDNN_VERSION 7401 + +// SINGA version +#define SINGA_MAJOR_VERSION 1 Review comment: Updated on 1 August 2019: Concerning the above error, I found that there is a different between the implementation of `class _BatchNorm2d(Operation):` in master branch and dist_new branch. In autograd.py, both the master branch and dist_new branch has modified (or debugged) the conv2d and batchnorm operator, but they modified it differently. Meanwhile, both conv2d in the master branch and dist_new branch can train and reduce loss of mnist simple CNN, so there is no big problem. However, the batch normalization is a much more complex case, because it includes non-training variables that are running means and running variances. In the master branch, the running means and running variances (non-training variables) are in the forward function: `def forward(self, x, scale, bias, running_mean, running_var):` https://github.com/apache/incubator-singa/blob/master/python/singa/autograd.py#L1099 When I run the code using the master branch dockerfile, the error is as follows: ``` root@26c9db193eb0:~/incubator-singa/examples/autograd# python3 resnet.py Start intialization............ 0%| | 0/200 [00:00<?, ?it/s] Traceback (most recent call last): File "resnet.py", line 249, in <module> for p, g in autograd.backward(loss): File "/root/incubator-singa/build/python/singa/autograd.py", line 135, in backward % (len(op.src), len(dxs)) AssertionError: the number of src ops (=5) and dx (=3) not match ``` I think the error is because the running_mean and running_var are in the forward function input arguments but are not training variables, so there are supposed to be three src ops but finally found 5. Meanwhile, the dist_new branch has modified the batchnorm function (commit 2b3a857 by user ubuntu on Apr14) by moving the input arguments running_mean and running_var into the initialization function: `def __init__(self, handle, running_mean, running_var, name=None):` `def forward(self, x, scale, bias):` https://github.com/xuewanqi/incubator-singa/blob/dist_new/python/singa/autograd.py#L1096 This one can run successfully but I am not sure if it can train and reduce loss. Next, I will try training the resnet with real dataset to see if it can reduce the loss. ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services