manupa-arm commented on issue #9022: URL: https://github.com/apache/tvm/issues/9022#issuecomment-920577170
@tqchen @mbs-octoml , This is not specific to Arm(R) Ethos(TM)-U codegen and its generally applicable for any micro controller where we would want to avoid creating allocation of memories in the stack but rather service them via platform abstraction that is handled via TVMBackendWorkspaceAlloc --> TVMPlatformAllocate. This only showed up in Arm(R) Ethos(TM)-U codegen because we use TVMPlatformAllocate allocate memory from a buffer placed in a memory that is both accessible by CPU and NPU. Thus, it makes this a functional bug. However, with this change current main produces code that have much higher stack allocation for micros -- that is not desired. cc : @u99127 @areusch > Stack allocation is important for the performance of the CPU code. In the case of TVM, we do not have explicit concept of registers in most cases. Instead we need to rely on LLVM's mem2reg pass to transform a set of constant indexing into stack allocation and turn them into registers, so the code can run effectively. So removing this code path can complicate the code generator side optimization by quite a bit and slow down the CPU code. The correct way represent this seems to be using tir.allocates with storage_scope="local" for device=CPU to go into the stack. For targets that needs this behavior, there should be an explicit pass to convert them to local to make them placed to the stack rather than assuming this default behaviour. > Of course this can be a target specific thing. LowerTVMBuiltin right now has the assumption to only run on host(CPU) code. > Allocate always prefers (native) stack allocation when possible, but also allows other means of opaque allocation(as long as the allocation is fulfilled) There are however, cases when stack allocation is not possible When the size of memory requested is too big, stack alloca will explode the stack space(That is why there is a size check in the CPU case and the use of global opaque was meant as a fallback to avoid stackoverflow in models with big intermediate temp space) LowerTVMBuiltin was originally designed to run on the host side, which means as soon as the allocation is about device side memory, it will need to call onto a (host side) device API to allocate the memory instead Clearly, this definition of CPU leaves out micros. It feels wrong to print out allocates with "global" storage_scope directly into CPU PrimFunc that gets printed as a stack allocation rather it should be serviced via TVMBAW call which moves the responsibility for the runtime/application layer. > So rationales for the specific CPU side logic: > We want to have stack alloca on host when possible(to gain mem2reg optimization) When the requested size is too large, we fallback to opaque workspace allocation on heap to allow the code to safely handle code with big temp memory requests as well as dynamic size allocation requests. This certainly sounds like we could use an optimization pass to convert the tir.allocate's storage_scope for targets that requires this rather than making that the default behaviour for tir.allocates with "global" storage scope. cc : @tom-gall @mbaret @Mousius -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@tvm.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org