manupa-arm commented on issue #9022: URL: https://github.com/apache/tvm/issues/9022#issuecomment-921044001
Thanks @tqchen for summarizing the ideas and presenting possible resolutions. The two needs seems very valid. For N0, The operators should really be tagged with 'local' storage scope for the needs of N0 as they are quite local to the operator primfunc and they benefit of further optimizations within and beyond TVM -- i.e. follow up C compiler / LLVM. For N1, we could use the 'global' tag to give the responsibility for the application/runtime layer to service the allocation. Therefore, the actual fix should have been tagging the allocates that are expected to be optimized to be 'local' to the PrimFunc, rather than making the 'global' allocates to CPU being treated as local. > N0 is needed to get best performing kernel, since a native way of allocation that can be understood by the compiler would give the best chance for followup optimizations. This is the case for CPU related optimizations. Note that such optimization is also needed for micro setting, when we use TIR to generate kernel code that requested a temp scracth memory for output tiling. I feel we are incorrectly tagging the storage_scopes here. They should really be 'local' for this specific usecase. > First of all, R0 and R1 are not that different in nature. Both tries to introduce two separate scopes that brings different behavior. The main questions boils down to how can we name the "global" scope. In the solution R1, I still that as a workaround for incorrect treatment of 'global' scoped memories where we create an override of an actual 'global' what we declare as 'global.workspace'. In shared memory SoCs, it would be un-scalable explosion of tags if we want to keep tagging memories for devices which have access to them. I would think we'd want to treat memories to devices having a many-to-many relationship. The lowering we had until few days back was general enough so they were serviced by a runtime/application layer routine and that were more aligned with what we call as 'global' (with respect to codegen) scoped storage. > Per allocate semantics, we treats "global" as normal CPU memory which can come from stack or platform specific allocation Can you explain what you define as 'normal CPU memory' ? A CPU can technically have access to many memories. > However, the need N0 would favor stack allocation when possible. Note that we will likely need a related behavior for micro devices as well when generating operator kernels. It would have been nice to have a RFC (Apologize in advance if I missed this if there was a one already) to discuss before we move from TVMBAW style allocation which I find more generic than just stack allocations. It almost feel the schedules should have tagged them 'local' if this was the expectation rather than assuming a combined logic : 'global' and CPU. > One possible approach is to try to ask user to differentiate these two kinds of allocations. Wouldn't it be simpler if tag allocations for N0 to be 'local' and N1 to be 'global' ? > N2: Allocating memory with special semantics. For example, an unified device pinned memory that is accessible from both NPU and CPU. A specialized texture memory or shared memory. The request in N2 is quite different and brings additional requirement to how we allocate the memory and how the memory can be used and represented in codegen. Its a memory where multiple Target have access to which the runtime/application could provide via TVMBAW with a specialized global.<pool_name>. > It is important to do this because the compiler makes no assumption that "global" can be accessed by other types of devices. Hmmmm, this argument seems counter-intuitive to me. I think we should assume the 'global' to be accessible unless they are explicitly specified to be restricted. i.e. global.<pool_name>. Otherwise, the terminology is confusing. > R0: Separate out a "local" scope that carries the stack allocation heavior. (proposed by @manupa-arm ) R1: Keep "global" scope as it is, introduce a special tagged scope "global.workspace" that represents a global memory request specifically for workspace memory. And introduce lowerings for them. For specialized memory(e.g. NPU unified), introduce separate memory scopes. R3: Introduce target specific attribute that marks the possible stack alloca size for lowering the R0("local") or R1("global"). Note R3 can be done in addition to R0 or R1. Ideally, the allocates destined to be end up in stack should have been 'local'. Moreover, at which point we can decide not to tag allocates that exceed the target specific attribute rather than dealing this in the codegen or lower_builtin_tvm pass. I would propose : R4 : R0 + I think its cleaner if we just introduce a optional pass to tag memories with 'local'. At which point, we should only tag them according to the target specific attribute max stack alloca size -- that could work until we fix the schedules that wants them in stack to have 'local' storage scope for the need N0 -- that is a conscious decision the schedule writer / autoscheduler takes. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@tvm.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org