[systemml] 01/02: [SYSTEMML-540] Added looped_minibatch training algorithm in Keras2DML

niketanpansare Wed, 27 Mar 2019 08:59:14 -0700

This is an automated email from the ASF dual-hosted git repository.

niketanpansare pushed a commit to branch gh-pages
in repository https://gitbox.apache.org/repos/asf/systemml.git


commit 9deb19ca8092b20a4cebcb9bdbc91fb444b1918b
Author: Niketan Pansare <npan...@us.ibm.com>
AuthorDate: Mon Mar 25 12:33:50 2019 -0700

    [SYSTEMML-540] Added looped_minibatch training algorithm in Keras2DML
    
    - This algorithm performs multiple forward-backward passes 
(=`parallel_batches` parameters) with the given batch size, aggregate gradients 
and finally updates the model.
    - Updated the documentation.
---
 beginners-guide-caffe2dml.md |  2 +-
 beginners-guide-keras2dml.md | 35 ++++++++++++++++++++++++++++++++++-
 2 files changed, 35 insertions(+), 2 deletions(-)

diff --git a/beginners-guide-caffe2dml.md b/beginners-guide-caffe2dml.md
index 8814283..db74feb 100644
--- a/beginners-guide-caffe2dml.md
+++ b/beginners-guide-caffe2dml.md
@@ -161,7 +161,7 @@ Iter:2000, validation loss:173.66147359346, validation 
accuracy:97.4897540983606
 
 Unlike Caffe where default train and test algorithm is `minibatch`, you can 
specify the
 algorithm using the parameters `train_algo` and `test_algo` (valid values are: 
`minibatch`, `allreduce_parallel_batches`, 
-and `allreduce`). Here are some common settings:
+`looped_minibatch`, and `allreduce`). Here are some common settings:
 
 |                                                                          | 
PySpark script                                                                  
                                                         | Changes to 
Network/Solver                                              |
 
|--------------------------------------------------------------------------|------------------------------------------------------------------------------------------------------------------------------------------|------------------------------------------------------------------------|
diff --git a/beginners-guide-keras2dml.md b/beginners-guide-keras2dml.md
index 4517be5..2259397 100644
--- a/beginners-guide-keras2dml.md
+++ b/beginners-guide-keras2dml.md
@@ -208,4 +208,37 @@ For example: for the expression `Keras2DML(..., 
display=100, test_iter=10, test_
 To verify that Keras2DML produce same results as other Keras' backend, we have 
[Python unit 
tests](https://github.com/apache/systemml/blob/master/src/main/python/tests/test_nn_numpy.py)
 that compare the results of Keras2DML with that of TensorFlow. We assume that 
Keras team ensure that all their backends are consistent with their TensorFlow 
backend.
 
-
+#### How can I train very deep models on GPU?
+
+Unlike Keras where default train and test algorithm is `minibatch`, you can 
specify the
+algorithm using the parameters `train_algo` and `test_algo` (valid values are: 
`minibatch`, `allreduce_parallel_batches`, 
+`looped_minibatch`, and `allreduce`). Here are some common settings:
+
+|                                                                          | 
PySpark script                                                                  
                                                         | Changes to 
Network/Solver                                              |
+|--------------------------------------------------------------------------|------------------------------------------------------------------------------------------------------------------------------------------|------------------------------------------------------------------------|
+| Single-node CPU execution (similar to Caffe with solver_mode: CPU)       | 
`lenet.set(train_algo="minibatch", test_algo="minibatch")`                      
                                                         | Ensure that 
`batch_size` is set to appropriate value (for example: 64) |
+| Single-node single-GPU execution                                         | 
`lenet.set(train_algo="minibatch", 
test_algo="minibatch").setGPU(True).setForceGPU(True)`                          
                      | Ensure that `batch_size` is set to appropriate value 
(for example: 64) |
+| Single-node multi-GPU execution (similar to Caffe with solver_mode: GPU) | 
`lenet.set(train_algo="allreduce_parallel_batches", test_algo="minibatch", 
parallel_batches=num_gpu).setGPU(True).setForceGPU(True)`     | Ensure that 
`batch_size` is set to appropriate value (for example: 64) |
+| Distributed prediction                                                   | 
`lenet.set(test_algo="allreduce")`                                              
                                                         |                      
                                                  |
+| Distributed synchronous training                                         | 
`lenet.set(train_algo="allreduce_parallel_batches", 
parallel_batches=num_cluster_cores)`                                            
     | Ensure that `batch_size` is set to appropriate value (for example: 64) |
+
+Here are high-level guidelines to train very deep models on GPU with Keras2DML 
(and Caffe2DML):
+
+1. If there exists at least one layer/operator that does not fit on the 
device, please allow SystemML's optimizer to perform operator placement based 
on the memory estimates `sysml_model.setGPU(True)`.
+2. If each individual layer/operator fits on the device but not the entire 
network with a batch size of 1, then 
+- Rely on SystemML's GPU Memory Manager to perform automatic eviction 
(recommended): `sysml_model.setGPU(True) # Optional: .setForceGPU(True)`
+- Or enable Nvidia's Unified Memory:  
`sysml_model.setConfigProperty('sysml.gpu.memory.allocator', 'unified_memory')`
+3. If the entire neural network does not fit in the GPU memory with the 
user-specified `batch_size`, but fits in the GPU memory with `local_batch_size` 
such that `1 << local_batch_size < batch_size`, then
+- Use either of the above two options.
+- Or enable `train_algo` that performs multiple forward-backward pass with 
batch size `local_batch_size`, aggregate gradients and finally updates the 
model: 
+```python
+sysml_model = Keras2DML(spark, keras_model, batch_size=local_batch_size)
+sysml_model.set(train_algo="looped_minibatch", 
parallel_batches=int(batch_size/local_batch_size))
+sysml_model.setGPU(True).setForceGPU(True)
+```
+- Or add `int(batch_size/local_batch_size)` GPUs and perform single-node 
multi-GPU training with batch size `local_batch_size`:
+```python
+sysml_model = Keras2DML(spark, keras_model, batch_size=local_batch_size)
+sysml_model.set(train_algo="allreduce_parallel_batches", 
parallel_batches=int(batch_size/local_batch_size))
+sysml_model.setGPU(True).setForceGPU(True)
+```

[systemml] 01/02: [SYSTEMML-540] Added looped_minibatch training algorithm in Keras2DML

Reply via email to