Hi Ce, Thanks for the clarification! I think I have got the high-level abstractions of CcT. We will come back to get your help (e.g., transfer some code) after finishing the GPU version.
Regards, Wei On Sat, Jul 25, 2015 at 12:13 PM, Ce Zhang <czh...@cs.wisc.edu> wrote: > >> How do you parallelize the training on CPU and GPU, by creating two > threads each with one (Caffe) Solver? > >> Or does each Layer partition the mini-batch and dispatch them onto two > threads (one using GPU Driver and one using CPU >> Driver) after receiving > the Pointer? > > The abstraction is like this. A Layer is an abstraction that takes as > input a Pointer and fills an output Pointer. This allows us to have > two types of Layers: > 1. Single-Device Layer, which corresponding to one worker that only > works on a single device. > 2. Multi-Device Layer, which creates multiple single-device layers, > starts multiple threads > correspond to these single-device layers, call their forward/backward > function > with p_thread, and join these threads. > > Both types of Layers are the subclass of the same Layer class so they share > the same interface. > > With this abstraction, I think this corresponds to your second strategy. > > >> Do I miss any other features of CcT? > > I think these 4-steps are exactly right, and this should be enough > to integrate all features of CcT! > > One very minor thing to notice is that the parallelization we talked about > maintains the replica of models on different devices and partition the > data. > It is also possible to partition the model instead, which might work > better > for layers like FC--I think a similar tradeoff is also in SINGA's > technical > report, so I am pretty sure this case will be handled efficiently! :) > > Thanks again for integrating CcT into SINGA! Let us know if you have any > questions! > > Ce > > On Fri, Jul 24, 2015 at 10:40 PM, Wang Wei <wang...@comp.nus.edu.sg> > wrote: > >> Hi Ce, >> >> Thanks for your explanation. >> I read the example training configuration of CcT on github, and have some >> further questions on the parallelization. >> >> How do you parallelize the training on CPU and GPU, by creating two >> threads each with one (Caffe) Solver? >> Or does each Layer partition the mini-batch and dispatch them onto two >> threads (one using GPU Driver and one using CPU Driver) after receiving the >> Pointer? >> >> According to the configuration file ( >> https://github.com/nudles/CaffeConTroll/blob/master/tests/imagenet_train/train_val/alexnet_train_val_1GPU_CPU.prototxt), >> >> it seems the parallelization is at the Layer level (i.e, the second one). >> In specific, given two connected layers A->B (e.g., conv and relu). >> If the batch partitioning for A is 0.6 on GPU and 0.4 on CPU, while the >> partitioning for B is 1.0 (ie.. all) on GPU. >> Then the computation for B should be blocked until the computation for A >> is finished. >> Consequently, the synchronization is done for every layer. >> >> For SINGA, we use the worker-server architecture (similar to parameter >> server). >> Currently, we support both synchronous and asynchronous training on CPU. >> We call the training framework implemented by CcT as synchronous training. >> SINGA implements this training framework by partitioning the neural >> network among one worker group. One worker runs in a thread. >> The partitioning is currently done at Layer level. Users can configure it >> to be on dimension 0 or 1 of the feature blob. >> For dimension 0, it means partitioning one mini-batch onto workers like >> CcT (but we use equal partitioning). >> For dimension 1, it means partitioning one layer (with 4096 neurons) into >> sub-layers (with 1024 neurons if there are 4 workers). >> After partitioning, each (sub) layer will be assigned a location ID, >> i.e., the ID of the worker to which the (sub) layer will be dispatched. >> We will support partitioning at the neural network level, i.e., let users >> to configure the location ID of one layer. >> During training, each worker it has the full neural network structure, >> but it only visits (e.g., forward pass) the layers that are dispatched to >> him (based on Location ID and worker ID). >> >> >> To Support CcT in SINGA, >> 1. We first need to support GPU training (should be done in August). >> 2. Update the neural network partitioning function to integrate CcT's >> scheduling strategy. >> 2. Make the Worker class a template, and create GPU workers and CPU >> workers after partitioning the neural network. >> 3. I think the Lowering, Shifting techniques are easy to integrate as a >> library if it is independent of the devices. >> >> Do I miss any other features of CcT? >> >> Regards, >> Wei >> >> A Layer takes as input a Pointer, and calls the `dereference` function of >>> Pointer to >>> obtain local-copy (w.r.t. the driver) of the data. Therefore, the Layer >>> object does not >>> know where the input data comes from. >>> >>> To run operations like GEMM, Lowering, a Layer will call Driver, which >>> provided >>> a unified interface across devices. >>> >>> The Layer, Pointer and Driver abstractions are clear and easy to >> understand. >> >>> I think it is possible for you to compile each layer as a library that >>> you can call that >>> takes as input a pointer object and fills in another pointer object as >>> output. >>> >>> >> How do you synchronize the parameters trained on CPU and GPU, using >>> the implementation of Hogwild from Caffe? >>> >>> Currently, we parallelize inside a batch--If the batch size is 256, we >>> might put 200 of them on GPU, 56 of >>> them on CPU. After their result comes back, we aggregate the result. >>> This means that our current result >>> is exactly the same as a single-thread run. >>> >>> >> >> >>> For AlexNet with 256 batch size (the one most paper used), we observe >>> this strategy gives almost >>> linear speed up even with 4 Amazon EC2 GPUs. >>> >>> Of course Hogwild! or parameter servers are a natural direction to >>> further scale up the current system >>> when the number of computational devices further increase and the >>> aggregation time starts to >>> dominate... >>> >>> >> Which part of Caffe you have changed (we also borrow some code from >>> Caffe, thus know its structure)? >>> >>> We borrow the parser code, loader code, and protobuf code. The main >>> reason is to make sure CcT >>> is compatible with Caffe. I think we rewrite most other layers, >>> especially for CONV. For faster layers >>> like ReLU, our code are very similar to them. >>> >>> Let us know if you have any questions! >>> >>> Ce >>> >>> On Wed, Jul 22, 2015 at 10:49 PM, Wang Wei <wang...@comp.nus.edu.sg> >>> wrote: >>> >>>> >>>> >>>> On Thu, Jul 23, 2015 at 11:47 AM, Wang Wei <wang...@comp.nus.edu.sg> >>>> wrote: >>>> >>>>> Hi Ce, >>>>> >>>>> Thanks for starting the discussion. >>>>> >>>>> We are preparing documentations and test for our first Apache >>>>> Incubator release. >>>>> I planed to contact Stefan after the first release. Because the first >>>>> release does not support GPU. >>>>> There are some developers in NetEase working on GPU implementation, >>>>> which will be integrated for the second release. >>>>> >>>>> It is a good time to discuss the integration now. We can consider this >>>>> new feature while implementing the GPU version. >>>>> Currently, we use Blob (from caffe) to manage memory and Mshadow for >>>>> computation. >>>>> To integerate Caffe on Troll, the ideal case is making Cafee on Troll >>>>> a library like Mshadow. >>>>> I think at least the convolution optimization techniques (lowering, >>>>> multiply, lifting) could be compiled as a library (correct?). >>>>> >>>>> How do you manage the memory across CPU and GPU? >>>>> How do you synchronize the parameters trained on CPU and GPU, using >>>>> the implementation of Hogwild from Caffe? >>>>> Which part of Caffe you have changed (we also borrow some code from >>>>> Caffe, thus know its structure)? >>>>> >>>>> Thank you. >>>>> (I cc'ed our dev-mailing list to notify the developers on GPU) >>>>> >>>>> Regards, >>>>> Wei Wang >>>>> >>>>> >>>>> >>>>> On Thu, Jul 23, 2015 at 9:47 AM, Ce Zhang <czh...@cs.wisc.edu> wrote: >>>>> >>>>>> Hi Wei, >>>>>> >>>>>> I am Ce from Wisconsin and Chris Re's group. >>>>>> I am one of the developer of Caffe con Troll >>>>>> (the CNN system that is faster than Caffe on >>>>>> CPU and can run hybrid between CPU and GPU) >>>>>> >>>>>> I think you and Stefan (CC'ed) chat at SIGMOD >>>>>> about the possibility of integrating Caffe con Troll into >>>>>> Apache SINGA. We are very excited about this! >>>>>> >>>>>> We are very curious about how to make this happen, >>>>>> e.g., what information do you need from us to do >>>>>> such an integration. This email aims at starting this >>>>>> discussion, and we'd love to hear your opinions. >>>>>> >>>>>> Thanks! >>>>>> >>>>>> Ce >>>>>> >>>>> >>>>> >>>> >>> >> >