(beam) branch master updated: Add info on large_model flag in LLM page (#30585)

jrmccluskey Fri, 08 Mar 2024 13:53:34 -0800

This is an automated email from the ASF dual-hosted git repository.

jrmccluskey pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/beam.git



The following commit(s) were added to refs/heads/master by this push:
     new 8e7e8fee387 Add info on large_model flag in LLM page (#30585)
8e7e8fee387 is described below

commit 8e7e8fee387276d2ada806652bda3906c18f1757
Author: Danny McCormick <dannymccorm...@google.com>
AuthorDate: Fri Mar 8 16:53:14 2024 -0500

    Add info on large_model flag in LLM page (#30585)
    
    * Add info on large_model flag in LLM page
    
    * wording
    
    * Simplify
    
    * Wording
---
 .../en/documentation/ml/large-language-modeling.md     | 18 ++++++++++++++----
 1 file changed, 14 insertions(+), 4 deletions(-)

diff --git 
a/website/www/site/content/en/documentation/ml/large-language-modeling.md 
b/website/www/site/content/en/documentation/ml/large-language-modeling.md
index 9148f5c28b9..79ef58e6de3 100644
--- a/website/www/site/content/en/documentation/ml/large-language-modeling.md
+++ b/website/www/site/content/en/documentation/ml/large-language-modeling.md
@@ -18,14 +18,22 @@ limitations under the License.
 # Large Language Model Inference in Beam
 In Apache Beam 2.40.0, Beam introduced the RunInference API, which lets you 
deploy a machine learning model in a Beam pipeline. A `RunInference` transform 
performs inference on a `PCollection` of examples using a machine learning (ML) 
model. The transform outputs a PCollection that contains the input examples and 
output predictions. For more information, see RunInference 
[here](/documentation/transforms/python/elementwise/runinference/). You can 
also find [inference examples on GitHub](h [...]
 
-
 ## Using RunInference with very large models
 RunInference works well on arbitrarily large models as long as they can fit on 
your hardware.
 
+### Memory Management
+
+RunInference has several mechanisms for reducing memory utilization. For 
example, by default RunInference load at most a single copy of each model per 
process (rather than one per thread).
+
+Many Beam runners, however, run multiple Beam processes per machine at once. 
This can cause problems since the memory footprint of loading large models like 
LLMs multiple times can be too large to fit into a single machine.
+For memory-intensive models, RunInference provides a mechanism for more 
intelligently sharing memory across multiple processes to reduce the overall 
memory footprint. To enable this mode, users just have
+to set the parameter `large_model` to True in their model configuration (see 
below for an example), and Beam will take care of the memory management.
+
+### Running an Example Pipeline with T5
+
 This example demonstrates running inference with a `T5` language model using 
`RunInference` in a pipeline. `T5` is an encoder-decoder model pre-trained on a 
multi-task mixture of unsupervised and supervised tasks. Each task is converted 
into a text-to-text format. The example uses `T5-11B`, which contains 11 
billion parameters and is 45 GB in size. In  order to work well on a variety of 
tasks, `T5` prepends a different prefix to the input corresponding to each 
task. For example, for tran [...]
 
-### Run the Pipeline ?
-First, install `apache-beam` 2.40 or greater:
+To run inference with this model, first, install `apache-beam` 2.40 or greater:
 
 ```
 pip install apache-beam -U
@@ -103,7 +111,8 @@ In order to use it, you must first define a `ModelHandler`. 
RunInference provide
       model_class=T5ForConditionalGeneration,
       model_params={"config": AutoConfig.from_pretrained(args.model_name)},
       device="cpu",
-      inference_fn=gen_fn)
+      inference_fn=gen_fn,
+      large_model=True)
 {{< /highlight >}}
 
 A `ModelHandler` requires parameters like:
@@ -112,3 +121,4 @@ A `ModelHandler` requires parameters like:
 * `model_params` – A dictionary of arguments required to instantiate the model 
class.
 * `device` – The device on which you wish to run the model. If device = GPU 
then a GPU device will be used if it is available. Otherwise, it will be CPU.
 * `inference_fn` -  The inference function to use during RunInference.
+* `large_model` - (see `Memory Management` above). Whether to use memory 
minimization techniques to lower the memory footprint of your model.

(beam) branch master updated: Add info on large_model flag in LLM page (#30585)

Reply via email to