chrishkchris commented on a change in pull request #468: Distributted module URL: https://github.com/apache/incubator-singa/pull/468#discussion_r311056639
########## File path: src/api/config.i ########## @@ -0,0 +1,33 @@ +// Licensed to the Apache Software Foundation (ASF) under one +// or more contributor license agreements. See the NOTICE file +// distributed with this work for additional information +// regarding copyright ownership. The ASF licenses this file +// to you under the Apache License, Version 2.0 (the +// "License"); you may not use this file except in compliance +// with the License. You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, +// software distributed under the License is distributed on an +// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +// KIND, either express or implied. See the License for the +// specific language governing permissions and limitations +// under the License. + + + +// Pass in cmake configurations to swig +#define USE_CUDA 1 +#define USE_CUDNN 1 +#define USE_OPENCL 0 +#define USE_PYTHON 1 +#define USE_MKLDNN 1 +#define USE_JAVA 0 +#define CUDNN_VERSION 7401 + +// SINGA version +#define SINGA_MAJOR_VERSION 1 Review comment: Updated on 6th August: I removed a bug in the commit 0616000 which concerns the number of parameters in the all-reduce. Then I did a 8 * K80 multi-GPUs training and evaluation test with a simple MNIST dataset on simple CNN. It reduces the training loss from 802.7 to 42.2 in about 30 Epochs: ``` Epoch=0: 100%|¦¦¦¦¦¦¦¦¦¦| 117/117 [00:01<00:00, 92.86it/s]Training loss = 802.659485, training accuracy = 0.713825 Test accuracy = 0.920025 Epoch=1: 100%|¦¦¦¦¦¦¦¦¦¦| 117/117 [00:01<00:00, 93.42it/s]Training loss = 246.589371, training accuracy = 0.916767 Test accuracy = 0.956106 Epoch=2: 100%|¦¦¦¦¦¦¦¦¦¦| 117/117 [00:01<00:00, 94.04it/s]Training loss = 175.012894, training accuracy = 0.941106 Test accuracy = 0.967208 Epoch=3: 100%|¦¦¦¦¦¦¦¦¦¦| 117/117 [00:01<00:00, 95.66it/s] Training loss = 144.684052, training accuracy = 0.951539 Test accuracy = 0.970806 Epoch=4: 100%|¦¦¦¦¦¦¦¦¦¦| 117/117 [00:01<00:00, 102.59it/s]Training loss = 120.399704, training accuracy = 0.959402 Test accuracy = 0.976049 Epoch=5: 100%|¦¦¦¦¦¦¦¦¦¦| 117/117 [00:01<00:00, 102.79it/s]Training loss = 107.832191, training accuracy = 0.963709 Test accuracy = 0.975946 Epoch=6: 100%|¦¦¦¦¦¦¦¦¦¦| 117/117 [00:01<00:00, 102.70it/s]Training loss = 96.289490, training accuracy = 0.967014 Test accuracy = 0.979441 Epoch=7: 100%|¦¦¦¦¦¦¦¦¦¦| 117/117 [00:01<00:00, 102.34it/s]Training loss = 88.031815, training accuracy = 0.970436 Test accuracy = 0.980983 Epoch=8: 100%|¦¦¦¦¦¦¦¦¦¦| 117/117 [00:01<00:00, 101.81it/s]Training loss = 79.349884, training accuracy = 0.973090 Test accuracy = 0.980058 Epoch=9: 100%|¦¦¦¦¦¦¦¦¦¦| 117/117 [00:01<00:00, 101.82it/s]Training loss = 77.825607, training accuracy = 0.974342 Test accuracy = 0.977282 Epoch=10: 100%|¦¦¦¦¦¦¦¦¦¦| 117/117 [00:01<00:00, 101.97it/s]Training loss = 74.710297, training accuracy = 0.974576 Test accuracy = 0.983861 Epoch=11: 100%|¦¦¦¦¦¦¦¦¦¦| 117/117 [00:01<00:00, 101.98it/s]Training loss = 69.400230, training accuracy = 0.976162 Test accuracy = 0.982936 Epoch=12: 100%|¦¦¦¦¦¦¦¦¦¦| 117/117 [00:01<00:00, 102.03it/s]Training loss = 65.100449, training accuracy = 0.978148 Test accuracy = 0.983553 Epoch=13: 100%|¦¦¦¦¦¦¦¦¦¦| 117/117 [00:01<00:00, 102.17it/s]Training loss = 65.113991, training accuracy = 0.978249 Test accuracy = 0.986534 Epoch=14: 100%|¦¦¦¦¦¦¦¦¦¦| 117/117 [00:01<00:00, 101.83it/s]Training loss = 63.065636, training accuracy = 0.978566 Test accuracy = 0.984683 Epoch=15: 100%|¦¦¦¦¦¦¦¦¦¦| 117/117 [00:01<00:00, 102.11it/s]Training loss = 58.334709, training accuracy = 0.980018 Test accuracy = 0.983758 Epoch=16: 100%|¦¦¦¦¦¦¦¦¦¦| 117/117 [00:01<00:00, 102.16it/s]Training loss = 58.280094, training accuracy = 0.980285 Test accuracy = 0.983655 Epoch=17: 100%|¦¦¦¦¦¦¦¦¦¦| 117/117 [00:01<00:00, 102.15it/s]Training loss = 53.226196, training accuracy = 0.981420 Test accuracy = 0.985197 Epoch=18: 100%|¦¦¦¦¦¦¦¦¦¦| 117/117 [00:01<00:00, 102.15it/s]Training loss = 55.968140, training accuracy = 0.980786 Test accuracy = 0.982422 Epoch=19: 100%|¦¦¦¦¦¦¦¦¦¦| 117/117 [00:01<00:00, 102.14it/s]Training loss = 52.761921, training accuracy = 0.982489 Test accuracy = 0.985814 Epoch=20: 100%|¦¦¦¦¦¦¦¦¦¦| 117/117 [00:01<00:00, 101.86it/s]Training loss = 51.989666, training accuracy = 0.982973 Test accuracy = 0.983758 Epoch=21: 100%|¦¦¦¦¦¦¦¦¦¦| 117/117 [00:01<00:00, 101.91it/s]Training loss = 52.571381, training accuracy = 0.982455 Test accuracy = 0.987973 Epoch=22: 100%|¦¦¦¦¦¦¦¦¦¦| 117/117 [00:01<00:00, 101.99it/s]Training loss = 49.347313, training accuracy = 0.983140 Test accuracy = 0.986637 Epoch=23: 100%|¦¦¦¦¦¦¦¦¦¦| 117/117 [00:01<00:00, 101.93it/s]Training loss = 49.053402, training accuracy = 0.983674 Test accuracy = 0.985814 Epoch=24: 100%|¦¦¦¦¦¦¦¦¦¦| 117/117 [00:01<00:00, 99.28it/s] Training loss = 46.263908, training accuracy = 0.984442 Test accuracy = 0.986431 Epoch=25: 100%|¦¦¦¦¦¦¦¦¦¦| 117/117 [00:01<00:00, 104.22it/s]Training loss = 46.021286, training accuracy = 0.984275 Test accuracy = 0.987664 Epoch=26: 100%|¦¦¦¦¦¦¦¦¦¦| 117/117 [00:01<00:00, 103.67it/s]Training loss = 45.950298, training accuracy = 0.984091 Test accuracy = 0.986534 Epoch=27: 100%|¦¦¦¦¦¦¦¦¦¦| 117/117 [00:01<00:00, 102.87it/s]Training loss = 43.926952, training accuracy = 0.984675 Test accuracy = 0.987150 Epoch=28: 100%|¦¦¦¦¦¦¦¦¦¦| 117/117 [00:01<00:00, 102.89it/s]Training loss = 44.020412, training accuracy = 0.985110 Test accuracy = 0.983450 Epoch=29: 100%|¦¦¦¦¦¦¦¦¦¦| 117/117 [00:01<00:00, 103.06it/s]Training loss = 41.906254, training accuracy = 0.985744 Test accuracy = 0.984375 Epoch=30: 100%|¦¦¦¦¦¦¦¦¦¦| 117/117 [00:01<00:00, 102.93it/s]Training loss = 42.237778, training accuracy = 0.985527 Test accuracy = 0.987664 ``` The following is the code used: ``` # # Licensed to the Apache Software Foundation (ASF) under one # or more contributor license agreements. See the NOTICE file # distributed with this work for additional information # regarding copyright ownership. The ASF licenses this file # to you under the Apache License, Version 2.0 (the # "License"); you may not use this file except in compliance # with the License. You may obtain a copy of the License at # # http://www.apache.org/licenses/LICENSE-2.0 # # Unless required by applicable law or agreed to in writing, # software distributed under the License is distributed on an # "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY # KIND, either express or implied. See the License for the # specific language governing permissions and limitations # under the License. # # the code is modified from # https://github.com/pytorch/vision/blob/master/torchvision/models/resnet.py try: import pickle except ImportError: import cPickle as pickle from singa import singa_wrap as singa from singa import autograd from singa import tensor from singa import device from singa import opt import numpy as np from tqdm import trange import os import urllib.request import gzip import codecs def load_dataset(): train_x_url = 'http://yann.lecun.com/exdb/mnist/train-images-idx3-ubyte.gz' train_y_url = 'http://yann.lecun.com/exdb/mnist/train-labels-idx1-ubyte.gz' valid_x_url = 'http://yann.lecun.com/exdb/mnist/t10k-images-idx3-ubyte.gz' valid_y_url = 'http://yann.lecun.com/exdb/mnist/t10k-labels-idx1-ubyte.gz' train_x = read_image_file(check_exist_or_download(train_x_url)).astype( np.float32) train_y = read_label_file(check_exist_or_download(train_y_url)).astype( np.float32) valid_x = read_image_file(check_exist_or_download(valid_x_url)).astype( np.float32) valid_y = read_label_file(check_exist_or_download(valid_y_url)).astype( np.float32) return train_x, train_y, valid_x, valid_y def check_exist_or_download(url): download_dir = '/tmp/' name = url.rsplit('/', 1)[-1] filename = os.path.join(download_dir, name) if not os.path.isfile(filename): print("Downloading %s" % url) urllib.request.urlretrieve(url, filename) return filename def read_label_file(path): with gzip.open(path, 'rb') as f: data = f.read() assert get_int(data[:4]) == 2049 length = get_int(data[4:8]) parsed = np.frombuffer(data, dtype=np.uint8, offset=8).reshape( (length)) return parsed def get_int(b): return int(codecs.encode(b, 'hex'), 16) def read_image_file(path): with gzip.open(path, 'rb') as f: data = f.read() assert get_int(data[:4]) == 2051 length = get_int(data[4:8]) num_rows = get_int(data[8:12]) num_cols = get_int(data[12:16]) parsed = np.frombuffer(data, dtype=np.uint8, offset=16).reshape( (length, 1, num_rows, num_cols)) return parsed def normalize_for_resnet(train_x, test_x): mean=[0.4914, 0.4822, 0.4465] std=[0.2023, 0.1994, 0.2010] train_x /= 255 test_x /= 255 for ch in range(0,2): train_x[:, ch, :, :] -= mean[ch] train_x[:, ch, :, :] /= std[ch] test_x[:, ch, :, :] -= mean[ch] test_x[:, ch, :, :] /= std[ch] return train_x, test_x def augmentation(x, batch_size): xpad = np.pad(x, [[0, 0], [0, 0], [4, 4], [4, 4]], 'symmetric') for data_num in range(0, batch_size): offset = np.random.randint(8, size=2) x[data_num,:,:,:] = xpad[data_num, :, offset[0]: offset[0] + 28, offset[1]: offset[1] + 28] if_flip = np.random.randint(2) if (if_flip): x[data_num, :, :, :] = x[data_num, :, :, ::-1] return x def accuracy(pred, target): y = np.argmax(pred, axis=1) t = np.argmax(target, axis=1) a = y == t return np.array(a, "int").sum() def to_categorical(y, num_classes): """ Converts a class vector (integers) to binary class matrix. Args y: class vector to be converted into a matrix (integers from 0 to num_classes). num_classes: total number of classes. Return A binary matrix representation of the input. """ y = np.array(y, dtype="int") n = y.shape[0] categorical = np.zeros((n, num_classes)) categorical[np.arange(n), y] = 1 categorical = categorical.astype(np.float32) return categorical def accuracy(pred, target): y = np.argmax(pred, axis=1) t = np.argmax(target, axis=1) a = y == t return np.array(a, "int").sum() class CNN: def __init__(self): self.conv1 = autograd.Conv2d(1, 20, 5, padding=0) self.conv2 = autograd.Conv2d(20, 50, 5, padding=0) self.linear1 = autograd.Linear(4 * 4 * 50, 500) self.linear2 = autograd.Linear(500, 10) self.pooling1 = autograd.MaxPool2d(2, 2, padding=0) self.pooling2 = autograd.MaxPool2d(2, 2, padding=0) def forward(self, x): y = self.conv1(x) y = autograd.relu(y) y = self.pooling1(y) y = self.conv2(y) y = autograd.relu(y) y = self.pooling2(y) y = autograd.flatten(y) y = self.linear1(y) y = autograd.relu(y) y = self.linear2(y) return y def data_partition(dataset_x, dataset_y, rank_in_global, world_size): data_per_rank = dataset_x.shape[0] // world_size idx_start = rank_in_global * data_per_rank idx_end = (rank_in_global + 1) * data_per_rank return dataset_x[idx_start: idx_end], dataset_y[idx_start: idx_end] def sychronize(tensor, dist_opt): singa.synch(tensor.data, dist_opt.communicator) # cannot use tensor/=dist_opt.world_size because "/=" not in place, but "-=" is in place tensor -= (dist_opt.world_size - 1) * tensor / dist_opt.world_size if __name__ == '__main__': sgd = opt.SGD(lr=0.04, momentum=0.9, weight_decay=1e-5) sgd = opt.DistOpt(sgd) # load data train_x, train_y, test_x, test_y = load_dataset() # normalization train_x = train_x / 255 test_x = test_x / 255 num_classes=10 train_y = to_categorical(train_y, num_classes) test_y = to_categorical(test_y, num_classes) train_x, train_y = data_partition(train_x, train_y, sgd.rank_in_global, sgd.world_size) test_x, test_y = data_partition(test_x, test_y, sgd.rank_in_global, sgd.world_size) #print(train_y[0]) print(np.shape(train_x)) print(np.shape(train_y)) # create model model = CNN() print('Start intialization............') dev = device.create_cuda_gpu_on(sgd.rank_in_local) max_epoch = 100 batch_size = 64 IMG_SIZE = 28 tx = tensor.Tensor((batch_size, 1, IMG_SIZE, IMG_SIZE), dev, tensor.float32) ty = tensor.Tensor((batch_size, num_classes), dev, tensor.int32) num_train_batch = train_x.shape[0] // batch_size num_test_batch = test_x.shape[0] // batch_size idx = np.arange(train_x.shape[0], dtype=np.int32) reducer = tensor.Tensor((1,), dev, tensor.float32) #allreduce the initialize parameter autograd.training = True #x = np.zeros(shape=[batch_size, 1, IMG_SIZE, IMG_SIZE], dtype=np.float32) #y = np.zeros(shape=[batch_size], dtype=np.int32) x = np.random.randn(batch_size, 1, IMG_SIZE, IMG_SIZE).astype(np.float32) y = np.zeros( shape=(batch_size, num_classes), dtype=np.int32) tx.copy_from_numpy(x) ty.copy_from_numpy(y) out = model.forward(tx) loss = autograd.softmax_cross_entropy(out, ty) for p, g in autograd.backward(loss): #p=sgd.all_reduce(p) sychronize(p, sgd) for epoch in range(max_epoch): np.random.shuffle(idx) #Training Phase autograd.training = True train_correct = np.zeros(shape=[1],dtype=np.float32) test_correct = np.zeros(shape=[1],dtype=np.float32) train_loss = np.zeros(shape=[1],dtype=np.float32) with trange(num_train_batch) as t: t.set_description('Epoch={}'.format(epoch)) for b in t: x = train_x[idx[b * batch_size: (b + 1) * batch_size]] x = augmentation(x, batch_size) y = train_y[idx[b * batch_size: (b + 1) * batch_size]] tx.copy_from_numpy(x) ty.copy_from_numpy(y) out = model.forward(tx) loss = autograd.softmax_cross_entropy(out, ty) train_correct += accuracy(tensor.to_numpy(out), y) train_loss += tensor.to_numpy(loss)[0] for p, g in autograd.backward(loss): sgd.update(p, g) #reduce the accuracy from multiple device reducer.copy_from_numpy(train_correct) reducer=sgd.all_reduce(reducer) train_correct = tensor.to_numpy(reducer) #reduce the loss from multiple device reducer.copy_from_numpy(train_loss) reducer=sgd.all_reduce(reducer) train_loss = tensor.to_numpy(reducer) * sgd.world_size if(sgd.rank_in_global==0): print('Training loss = %f, training accuracy = %f' % (train_loss, train_correct / (num_train_batch*batch_size)), flush=True) #Evaulation Phase autograd.training = False for b in range(num_test_batch): x = test_x[b * batch_size: (b + 1) * batch_size] y = test_y[b * batch_size: (b + 1) * batch_size] tx.copy_from_numpy(x) ty.copy_from_numpy(y) out_test = model.forward(tx) test_correct += accuracy(tensor.to_numpy(out_test), y) reducer.copy_from_numpy(test_correct) reducer=sgd.all_reduce(reducer) test_correct = tensor.to_numpy(reducer) if(sgd.rank_in_global==0): print('Test accuracy = %f' % (test_correct / (num_test_batch*(batch_size))), flush=True) ``` ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services