MasterJH5574 commented on PR #16111:
URL: https://github.com/apache/tvm/pull/16111#issuecomment-1808922411

   Thanks for the inputs! Your example is clear and demonstrates the weakness 
of the upper-bound allocation. So far we discussed the three possible 
allocation strategies for the output tensor:
   
   * S1. exact-size allocation,
   * S2. upper-bound allocation,
   * S3. other alternatives such as bucketing.
   
   And we discussed two runtime use cases of VM:
   
   * C1. Running repetitively and hold every output array:
     ```c++
     // Case 1.
     std::vector<NDArray> outputs;
     for (int i = 0; i < 1024; ++i) {
       NDArray logits = mod["main"](1k, ...); // size = 1k, but storage = 128k;
       outputs.push_back(logits);
     }
     this->outputs = outputs;
     ```
   * C2. Running repetitively with different output tensor size, not holding 
output array:
     ```c++
     // Case 2.
     for (int i = 1; i <= 128; ++i) {
       NDArray logits = mod["main"](i * 1k + 1, ...);
       // post-processing of logits and then release.
     }
     ```
   
   General cases may have more complicated use of output tensor which mix both 
cases above.
   
   Based on our discussion, we agree that
   
   * S1 avoids over-allocation for both C1 and C2, while incurs fragmentation 
in C2. For C2, using S1 may at most have `((1 + 128) * 128 / 2) * 1k = 8256k` 
memory on held in VM when using pool allocator.
   * S2 avoids fragmentation for both cases, while incurs over-allocation in C1 
(when outputs are held). For C1, using S2 may have `1023 * 128k` memory wasted.
   * S3 might overly allocates in both cases. In the worst case of bucketing, 
each iteration there can be a waste of nearly the same size as the output 
array's in C1. Similarly, for C2, in the end there will also be `128k` unused 
in the pool, which is much less than S2, though.
   
   Though a general compiler pass is not supposed to assume the execution 
runtime to follow certain behavior, we believe the runtime behavior (when we 
clearly know it) can be a helpful information for compilers. For example, in 
the use case of MLC LLM, we are sure that output of VM functions (e.g., logits) 
will be released before the next invocation of the function. And in this case, 
compiling the model with upper-bound allocation for output tensors is 
beneficial.
   
   In consideration of this, one approach is to introduce a compilation flag in 
the form of a compile-time function attribute to suggest "whether to allocate 
output tensor statically with upper-bound estimation." For cases where we know 
the runtime use of output tensors (like in MLC LLM), we can enable this flag 
during model compilation, so that we can yield completely static runtime 
memory. This flag is by default not enabled, and we will keep the exact-size 
allocation for general cases where the output tensors may be arbitrarily used.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to