manupa-arm commented on issue #9022:
URL: https://github.com/apache/tvm/issues/9022#issuecomment-920577170


   @tqchen @mbs-octoml ,
   
   This is not specific to Arm(R) Ethos(TM)-U codegen and its generally 
applicable for any micro controller where we would want to avoid creating 
allocation of memories in the stack but rather service them via platform 
abstraction that is handled via TVMBackendWorkspaceAlloc --> 
TVMPlatformAllocate.
   
   This only showed up in Arm(R) Ethos(TM)-U codegen because we use 
TVMPlatformAllocate allocate memory from a buffer placed in a memory that is 
both accessible by CPU and NPU. Thus, it makes this a functional bug.
   However, with this change current main produces code that have much higher 
stack allocation for micros -- that is not desired.
   
   cc : @u99127  @areusch 
   
   > Stack allocation is important for the performance of the CPU code. In the 
case of TVM, we do not have explicit concept of registers in most cases. 
Instead we need to rely on LLVM's mem2reg pass to transform a set of constant 
indexing into stack allocation and turn them into registers, so the code can 
run effectively. So removing this code path can complicate the code generator 
side optimization by quite a bit and slow down the CPU code.
   
   The correct way represent this seems to be using tir.allocates with 
storage_scope="local" for device=CPU to go into the stack. For targets that 
needs this behavior, there should be an explicit pass to convert them to local 
to make them placed to the stack rather than assuming this default behaviour.
   
   > Of course this can be a target specific thing. LowerTVMBuiltin right now 
has the assumption to only run on host(CPU) code.
   
   >    Allocate always prefers (native) stack allocation when possible, but 
also allows other means of opaque allocation(as long as the allocation is 
fulfilled)
       There are however, cases when stack allocation is not possible
           When the size of memory requested is too big, stack alloca will 
explode the stack space(That is why there is a size check in the CPU case and 
the use of global opaque was meant as a fallback to avoid stackoverflow in 
models with big intermediate temp space)
           LowerTVMBuiltin was originally designed to run on the host side, 
which means as soon as the allocation is about device side memory, it will need 
to call onto a (host side) device API to allocate the memory instead
   
   Clearly, this definition of CPU leaves out micros.
   It feels wrong to print out allocates with "global" storage_scope directly 
into CPU PrimFunc that gets printed as a stack allocation rather it should be 
serviced via TVMBAW call which moves the responsibility for the 
runtime/application layer.
   
   > So rationales for the specific CPU side logic:
   
    >   We want to have stack alloca on host when possible(to gain mem2reg 
optimization)
       When the requested size is too large, we fallback to opaque workspace 
allocation on heap to allow the code to safely handle code with big temp memory 
requests as well as dynamic size allocation requests.
   
   This certainly sounds like we could use an optimization pass to convert the 
tir.allocate's storage_scope for targets that requires this rather than making 
that the default behaviour for tir.allocates with "global" storage scope.
   
   cc : @tom-gall @mbaret @Mousius 
   
   
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@tvm.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Reply via email to