Hi Ce,

Thanks for the clarification!
I think I have got the high-level abstractions of CcT.
We will come back to get your help (e.g., transfer some code) after
finishing the GPU version.

Regards,
Wei

On Sat, Jul 25, 2015 at 12:13 PM, Ce Zhang <czh...@cs.wisc.edu> wrote:

> >> How do you parallelize the training on CPU and GPU, by creating two
> threads each with one (Caffe) Solver?
> >> Or does each Layer partition the mini-batch and dispatch them onto two
> threads (one using GPU Driver and one using CPU >> Driver) after receiving
> the Pointer?
>
> The abstraction is like this. A Layer is an abstraction that takes as
> input a Pointer and fills an output Pointer. This allows us to have
> two types of Layers:
>    1. Single-Device Layer, which corresponding to one worker that only
> works on a single device.
>    2. Multi-Device Layer, which creates multiple single-device layers,
> starts multiple threads
> correspond to these single-device layers, call their forward/backward
> function
> with p_thread, and join these threads.
>
> Both types of Layers are the subclass of the same Layer class so they share
> the same interface.
>
> With this abstraction, I think this corresponds to your second strategy.
>
> >> Do I miss any other features of CcT?
>
> I think these 4-steps are exactly right, and this should be enough
> to integrate all features of CcT!
>
> One very minor thing to notice is that the parallelization we talked about
> maintains the replica of models on different devices and partition the
> data.
> It is also possible to partition the model instead, which might work
> better
> for layers like FC--I think a similar tradeoff is also in SINGA's
> technical
> report, so I am pretty sure this case will be handled efficiently! :)
>
> Thanks again for integrating CcT into SINGA! Let us know if you have any
> questions!
>
> Ce
>
> On Fri, Jul 24, 2015 at 10:40 PM, Wang Wei <wang...@comp.nus.edu.sg>
> wrote:
>
>> Hi Ce,
>>
>> Thanks for your explanation.
>> I read the example training configuration of CcT on github, and have some
>> further questions on the parallelization.
>>
>> How do you parallelize the training on CPU and GPU, by creating two
>> threads each with one (Caffe) Solver?
>> Or does each Layer partition the mini-batch and dispatch them onto two
>> threads (one using GPU Driver and one using CPU Driver) after receiving the
>> Pointer?
>>
>> According to the configuration file (
>> https://github.com/nudles/CaffeConTroll/blob/master/tests/imagenet_train/train_val/alexnet_train_val_1GPU_CPU.prototxt),
>>
>> it seems the parallelization is at the Layer level (i.e, the second one).
>> In specific, given two connected layers A->B (e.g., conv and relu).
>> If the batch partitioning for A is 0.6 on GPU and 0.4 on CPU, while the
>> partitioning for B is 1.0 (ie.. all) on GPU.
>> Then the computation for B should be blocked until the computation for A
>> is finished.
>> Consequently, the synchronization is done for every layer.
>>
>> For SINGA, we use the worker-server architecture (similar to parameter
>> server).
>> Currently, we support both synchronous and asynchronous training on CPU.
>> We call the training framework implemented by CcT as synchronous training.
>> SINGA implements this training framework by partitioning the neural
>> network among one worker group.  One worker runs in a thread.
>> The partitioning is currently done at Layer level. Users can configure it
>> to be on dimension 0  or 1 of the feature blob.
>> For dimension 0, it means partitioning one mini-batch onto workers like
>> CcT (but we use equal partitioning).
>> For dimension 1, it means partitioning one layer (with 4096 neurons) into
>> sub-layers (with 1024 neurons if there are 4 workers).
>> After partitioning, each (sub) layer will be assigned a location ID,
>> i.e., the ID of the worker to which the (sub) layer will be dispatched.
>> We will support partitioning at the neural network level, i.e., let users
>> to configure the location ID of one layer.
>> During training, each worker it has the full neural network structure,
>> but it only visits (e.g., forward pass) the layers that are dispatched to
>> him (based on Location ID and worker ID).
>>
>>
>> To Support CcT in SINGA,
>> 1. We first need to support GPU training (should be done in August).
>> 2. Update the neural network partitioning function to integrate CcT's
>> scheduling strategy.
>> 2. Make the Worker class a template, and create GPU workers and CPU
>> workers after partitioning the neural network.
>> 3. I think the Lowering, Shifting techniques are easy to integrate as a
>> library if it is independent of the devices.
>>
>> Do I miss any other features of CcT?
>>
>> Regards,
>> Wei
>>
>> A Layer takes as input a Pointer, and calls the `dereference` function of
>>> Pointer to
>>> obtain local-copy (w.r.t. the driver) of the data. Therefore, the Layer
>>> object does not
>>> know where the input data comes from.
>>>
>>> To run operations like GEMM, Lowering, a Layer will call Driver, which
>>> provided
>>> a unified interface across devices.
>>>
>>> The Layer, Pointer and Driver abstractions are clear and easy to
>> understand.
>>
>>> I think it is possible for you to compile each layer as a library that
>>> you can call that
>>> takes as input a pointer object and fills in another pointer object as
>>> output.
>>>
>>> >> How do you synchronize the parameters trained on CPU and GPU, using
>>> the implementation of Hogwild from Caffe?
>>>
>>> Currently, we parallelize inside a batch--If the batch size is 256, we
>>> might put 200 of them on GPU, 56 of
>>> them on CPU. After their result comes back, we aggregate the result.
>>> This means that our current result
>>> is exactly the same as a single-thread run.
>>>
>>>
>>
>>
>>> For AlexNet with 256 batch size (the one most paper used), we observe
>>> this strategy gives almost
>>> linear speed up even with 4 Amazon EC2 GPUs.
>>>
>>> Of course Hogwild! or parameter servers are a natural direction to
>>> further scale up the current system
>>> when the number of computational devices further increase and the
>>> aggregation time starts to
>>> dominate...
>>>
>>> >> Which part of Caffe you have changed (we also borrow some code from
>>> Caffe, thus know its structure)?
>>>
>>> We borrow the parser code, loader code, and protobuf code. The main
>>> reason is to make sure CcT
>>> is compatible with Caffe. I think we rewrite most other layers,
>>> especially for CONV. For faster layers
>>> like ReLU, our code are very similar to them.
>>>
>>> Let us know if you have any questions!
>>>
>>> Ce
>>>
>>> On Wed, Jul 22, 2015 at 10:49 PM, Wang Wei <wang...@comp.nus.edu.sg>
>>> wrote:
>>>
>>>>
>>>>
>>>> On Thu, Jul 23, 2015 at 11:47 AM, Wang Wei <wang...@comp.nus.edu.sg>
>>>> wrote:
>>>>
>>>>> Hi Ce,
>>>>>
>>>>> Thanks for starting the discussion.
>>>>>
>>>>> We are preparing documentations and test for our first Apache
>>>>> Incubator release.
>>>>> I planed to contact Stefan after the first release. Because the first
>>>>> release does not support GPU.
>>>>> There are some developers in NetEase working on GPU implementation,
>>>>> which will be integrated for the second release.
>>>>>
>>>>> It is a good time to discuss the integration now. We can consider this
>>>>> new feature while implementing the GPU version.
>>>>> Currently, we use Blob (from caffe) to manage memory and Mshadow for
>>>>> computation.
>>>>> To integerate Caffe on Troll, the ideal case is making Cafee on Troll
>>>>> a library like Mshadow.
>>>>> I think at least the convolution optimization techniques (lowering,
>>>>> multiply, lifting) could be compiled as a library (correct?).
>>>>>
>>>>> How do you manage the memory across CPU and GPU?
>>>>> How do you synchronize the parameters trained on CPU and GPU, using
>>>>> the implementation of Hogwild from Caffe?
>>>>> Which part of Caffe you have changed (we also borrow some code from
>>>>> Caffe, thus know its structure)?
>>>>>
>>>>> Thank you.
>>>>> (I cc'ed our dev-mailing list to notify the developers on GPU)
>>>>>
>>>>> Regards,
>>>>> Wei Wang
>>>>>
>>>>>
>>>>>
>>>>> On Thu, Jul 23, 2015 at 9:47 AM, Ce Zhang <czh...@cs.wisc.edu> wrote:
>>>>>
>>>>>> Hi Wei,
>>>>>>
>>>>>> I am Ce from Wisconsin and Chris Re's group.
>>>>>> I am one of the developer of Caffe con Troll
>>>>>> (the CNN system that is faster than Caffe on
>>>>>> CPU and can run hybrid between CPU and GPU)
>>>>>>
>>>>>> I think you and Stefan (CC'ed) chat at SIGMOD
>>>>>> about the possibility of integrating Caffe con Troll into
>>>>>> Apache SINGA. We are very excited about this!
>>>>>>
>>>>>> We are very curious about how to make this happen,
>>>>>> e.g., what information do you need from us to do
>>>>>> such an integration. This email aims at starting this
>>>>>> discussion, and we'd love to hear your opinions.
>>>>>>
>>>>>> Thanks!
>>>>>>
>>>>>> Ce
>>>>>>
>>>>>
>>>>>
>>>>
>>>
>>
>

Reply via email to