manupa-arm commented on issue #9022:
URL: https://github.com/apache/tvm/issues/9022#issuecomment-921044001


   Thanks @tqchen for summarizing the ideas and presenting possible resolutions.
   
   The two needs seems very valid.
   
   For N0, The operators should really be tagged with 'local' storage scope for 
the needs of N0 as they are quite local to the operator primfunc and they 
benefit of further optimizations within and beyond TVM -- i.e. follow up C 
compiler / LLVM.
   
   For N1, we could use the 'global' tag to give the responsibility for the 
application/runtime layer to service the allocation.
   
   Therefore, the actual fix should have been tagging the allocates that are 
expected to be optimized to be 'local' to the PrimFunc, rather than making the 
'global' allocates to CPU being treated as local.
   
   > N0 is needed to get best performing kernel, since a native way of 
allocation
   that can be understood by the compiler would give the best chance for 
followup
   optimizations. This is the case for CPU related optimizations. Note that such
   optimization is also needed for micro setting, when we use TIR to generate 
kernel
   code that requested a temp scracth memory for output tiling.
   
   I feel we are incorrectly tagging the storage_scopes here. They should 
really be 'local' for this specific usecase.
   
   > First of all, R0 and R1 are not that different in nature. Both tries to 
introduce two separate scopes that brings different behavior. The main 
questions boils down to how can we name the "global" scope.
   
   In the solution R1, I still that as a workaround for incorrect treatment of 
'global' scoped memories where we create an override of an actual 'global' what 
we declare as 'global.workspace'. In shared memory SoCs, it would be 
un-scalable explosion of tags if we want to keep tagging memories for devices 
which have access to them. I would think we'd want to treat memories to devices 
having a many-to-many relationship.
   
   The lowering we had until few days back was general enough so they were 
serviced by a runtime/application layer routine and that were more aligned with 
what we call as 'global' (with respect to codegen) scoped storage.
   
   > Per allocate semantics, we treats "global" as normal CPU memory which can 
come from stack or platform specific allocation
   
   Can you explain what you define as 'normal CPU memory' ? A CPU can 
technically have access to many memories.
   
   > However, the need N0 would favor stack allocation when possible. Note that 
we will likely need a related behavior for micro devices as well when 
generating operator kernels.
   
   It would have been nice to have a RFC (Apologize in advance if I missed this 
if there was a one already) to discuss before we move from TVMBAW style 
allocation which I find more generic than just stack allocations. It almost 
feel the schedules should have tagged them 'local' if this was the expectation 
rather than assuming a combined logic : 'global' and CPU.
   
   > One possible approach is to try to ask user to differentiate these two 
kinds of allocations.
   
   Wouldn't it be simpler if tag allocations for N0 to be 'local' and N1 to be 
'global' ?
   
   > N2: Allocating memory with special semantics. For example, an unified 
device pinned memory that is accessible from both NPU and CPU. A specialized 
texture memory or shared memory. The request in N2 is quite different and 
brings additional requirement to how we allocate the memory and how the memory 
can be used and represented in codegen.
   
   Its a memory where multiple Target have access to which the 
runtime/application could provide via TVMBAW with a specialized 
global.<pool_name>.
   
   > It is important to do this because the compiler makes no assumption that 
"global" can be accessed by other types of devices.
   
   Hmmmm, this argument seems counter-intuitive to me. I think we should assume 
the 'global' to be accessible unless they are explicitly specified to be 
restricted. i.e. global.<pool_name>. Otherwise, the terminology is confusing.
   
   > 
       R0: Separate out a "local" scope that carries the stack allocation 
heavior. (proposed by @manupa-arm )
       R1: Keep "global" scope as it is, introduce a special tagged scope 
"global.workspace"
       that represents a global memory request specifically for workspace 
memory.
       And introduce lowerings for them. For specialized memory(e.g. NPU 
unified), introduce
       separate memory scopes.
       R3: Introduce target specific attribute that marks the possible stack 
alloca size for
       lowering the R0("local") or R1("global"). Note R3 can be done in 
addition to R0 or R1.
   
   Ideally, the allocates destined to be end up in stack should have been 
'local'. Moreover, at which point we can decide not to tag allocates that 
exceed the target specific attribute rather than dealing this in the codegen or 
lower_builtin_tvm pass.
   
   I would propose :
   
   R4 : R0 + I think its cleaner if we just introduce a optional pass to tag 
memories with 'local'. At which point, we should only tag them according to the 
target specific attribute max stack alloca size -- that could work until we fix 
the schedules that wants them in stack to have 'local' storage scope for the 
need N0 -- that is a conscious decision the schedule writer / autoscheduler 
takes.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@tvm.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Reply via email to