[GitHub] [singa] XJDKC opened a new pull request #626: [WIP] SINGA-505 Computational graph with memory optimization

GitBox Wed, 11 Mar 2020 04:47:58 -0700

XJDKC opened a new pull request #626: [WIP] SINGA-505 Computational graph with 
memory optimization
URL: https://github.com/apache/singa/pull/626
 
 
   # Overview
   This PR adds the computational graph with memory optimization. It is based 
on the code developed by @chrishkchris and @XJDKC and some discussions with 
@nudles.
   
   # Features
   There are three main features in this PR, namely the construction of the 
computational graph, lazy allocation and automatic recycling. Details as 
follows:
   * `Computational graph construction`: Construct a computational graph based 
on the user-defined neural network or expressions and then run the graph to 
accomplish the training task.
   * `Lazy allocation`: When blocks need to be allocated, devices won't 
allocate memory for them immediately. Only when an operation uses this block 
for the first time, memory allocation will be performed.
   * `Automatic recycling`: Automatically deallocate the intermediate tensors 
which won't be used again in the following operations when we are running the 
graph in an iteration.
   
   # Design
   1. Computational graph construction
       * Use the technique of delayed execution to falsely perform operations 
in forward propagation and backward propagation once. Buffer all the operations 
and the tensors read or written by each operation. 
       * Calculate dependencies between all the operations to decide the order 
of execution. (Support directed cyclic graph)
       * Execute all the operations in the order we just calculated to update 
all the parameters.
   2. Lazy allocation
       * When a device needs to create a new block, just pass the device to 
that block instead of allocating a piece of memory from the mempool and passing 
the pointer to that block.
       * When the block is accessed for the first time, let the device 
corresponding to the block allocate memory and then access it.
   3. Automatic recycling
       * When calculating dependencies between the operations during the graph 
construction, the reference count of tensors can also be calculated.
       * When an operation is completed, we can decrease the reference count of 
tensors the operation used.
       * If a tensor's reference count reaches zero, it means the tensor won't 
be accessed by latter operations and we can recycle its memory.
   
   # Changes
   * `Tensor`&`Operation`
       * Change the capture type of tensors in lambda expressions to achieve 
delayed execution.
       * Change the type of input and output parameters to ensure that the 
input and output of the operation are tensors.
   * `Device`: Add code for 
       * buffering operations
       * constructing graph
       * calculating dependencies
       * executing graph.
   * `Block`: Add a member variable of type device to help to do the lazy 
allocation. Add a function to help to do the automatic recycling.
   * `Swig`: add some interfaces
   *  `Examples`: Add some examples with operations buffering.
   
   # Evaluation
   * Experiment settings
       * Model: ResNet50 in [resnet.py](../tree/dev/examples/autograd/resnet.py)
       * GPU: Nvidia RTX 2080Ti
   * Result: `s =  second` `b = batch`
   <table>
       <tr>
           <th style="text-align: center">Batchsize</th>
           <th style="text-align: center">Cases</th>
           <th style="text-align: center">Memory-Usage(peak)</th>
           <th style="text-align: center">Throughput</th>
           <th style="text-align: center">Time</th>
           <th style="text-align: center">Reduction Rate</th>
           <th style="text-align: center">Speedup</th>
       </tr>
       <tr>
           <td rowspan="3">16</td>
           <td>dev branch</td>
           <td>4961MB</td>
           <td>176.9182/s</td>
           <td>0.0903s/b</td>
           <td>00.00%</td>
           <td>1.0000</td>
       </tr>
       <tr>
           <td>PR(no graph)</td>
           <td>4961MB</td>
           <td>173.6726/s</td>
           <td>0.0921s/b</td>
           <td>00.00%</td>
           <td>0.9796</td>
       </tr>
       <tr>
           <td>PR(with graph)</td>
           <td>3105MB</td>
           <td>218.4999/s</td>
           <td>0.0732s/b</td>
           <td>37.41%</td>
           <td>1.2325</td>
       </tr>
       <tr>
           <td rowspan="3">32</td>
           <td>dev branch</td>
           <td>10113MB</td>
           <td>203.3363/s</td>
           <td>0.1574s/b</td>
           <td>00.00%</td>
           <td>1.0000</td>
       </tr>
       <tr>
           <td>PR(no graph)</td>
           <td>10123MB</td>
           <td>173.6726/s</td>
           <td>202.3836s/b</td>
           <td>-0.10%</td>
           <td>0.9953</td>
       </tr>
       <tr>
           <td>PR(with graph)</td>
           <td>6517MB</td>
           <td>234.1376/s</td>
           <td>0.1367s/b</td>
           <td>35.56%</td>
           <td>1.1515</td>
       </tr>
   </table>
   
   From the table above, we can know that:
   * This PR does not affect training time and memory usage if the graph is 
disabled (has backward compatibility).
   * This PR can significantly reduce memory usage and training time by using 
the graph.
   
   # How to use
   ```Python
   # Initialize the input tensors
   # ...
   
   # Buffer the operations
   dev.SetBufferFlag(True)
   x = autograd.matmul(inputs, w0)
   x = autograd.add_bias(x, b0)
   x = autograd.relu(x)
   x = autograd.matmul(x, w1)
   x = autograd.add_bias(x, b1)
   # x = autograd.softmax(x)
   loss = autograd.softmax_cross_entropy(x, target)
   for p, gp in autograd.backward(loss):
           sgd.apply(0, gp, p, "")
   dev.SetBufferFlag(False)
   
   # Run Graph
   print("start executing buffered functions")
   for i in range(1001):
       dev.ExecBuffOps()
   ```
   
   # Plan
   - [ ] Computation graph optimization: replace a subgraph of the input 
computation graph with another subgraph which is functionally equivalent to the 
original one. 
   - [ ] Performing operations in parallel.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


With regards,
Apache Git Services

[GitHub] [singa] XJDKC opened a new pull request #626: [WIP] SINGA-505 Computational graph with memory optimization

Reply via email to