[GitHub] [tvm] tqchen edited a comment on issue #9022: [Bug] BuiltinLower does not use alloca for storage on kDLCPU target devices

GitBox Fri, 17 Sep 2021 10:01:45 -0700


tqchen edited a comment on issue #9022:
URL: https://github.com/apache/tvm/issues/9022#issuecomment-921820526



   Please allow me to explain the overall rationale here, in particular over 
the term "constraint"
   
   - C0: On one hand, we want a "default" memory to be generically accessible 
(per @manupa-arm 's comment) in all cases so runtime libraries can be built to 
leverage the generic property(e.g. access from NPU).
   - C1: On the other hand, from compiler's POV, we want to leave flexibility 
to code optimizations and codegen phase, and only constraint on the property we 
want (e.g. accessible from CPU)
   
   ## The two ways to see the constrainting
   
   The way C0 sees the constaint is about the possible accessor of the memory
   - "global"=> memory can be accessed from {cpu, npu, other devices}
   - "global.stack" => memory can be accessed from { cpu }
   
   The way C1 sees the constraint is about possible memories to choose from.
   - "global"(memory that is accessible from CPU) => can choose from {stack, 
TVMBAW, other allocators}
   - "global.workspace" => can choose from  { TVMBAW }
   - "global.stack" => can choose from  { stack }
   
   ## Discussions
   
   When we say a compiler IR is more constrainted than another one. Usually we 
mean that less optimizations can be performed, because there is lack of 
flexibility in terms of rewriting. For example, `volatile` keyword puts 
additional constraints on memory access.
   
   This makes C1 more aligned in the common compiler IR design. Note that 
"memory that is accessible from all devices" is term that depends on the 
specific runtime platform, and not very well defined in a generic IR. 
   The more constraints we put on the memory itself, the smaller set it can 
become. As a result, there are less opportunities
   of code transformations and optimizations.
   
   Under the current semantics  "CPU" && "global" can result in stack 
allocation. Note that is is one kind of flexibility we want to offer to later 
stages so that specializations can be made.  
   
   - So yes it is indeed OK for a pass to map "global" to TVMBAW, the resulting 
program will run slower, but still correctly on CPU. 
   - It also does not precldue TVMLowerBuiltin to take benefit of the semantics 
to choose stack allocation, which usually benefit performance. 
   
   One thing we should keep in mind is that the codegen and AOT compiler should 
not rely on the behavior of TVMLowerBuiltin to ensure correctness(since it can 
choose to do anything in the case of "global", including dispatching to another 
custom allocator). If a special kind of memory is needed, we should declare 
such constraint through IR. Attaching a special scope is the best way to do so 
under the current semantics, regardless of the implementation of 
TVMLowerBuiltin.
   
   
   TVMLowerBuiltin picks `kMaxStackAllocaSize` as a heuristic number that 
maximizes the benefit of stack allocation without overexploding the stack. Of 
course a better heuristic can be used, setting it default to 0 would bring down 
performance of a lot of CPU code so not as desirable. We could certainly have a 
target dependent property for micro target and set that to 0. It should be 
pretty easy to do as well, see 
https://github.com/apache/tvm/blob/main/src/tir/transforms/lower_warp_memory.cc#L392
 that obtains a target dependent warp size 
   
   
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@tvm.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [tvm] tqchen edited a comment on issue #9022: [Bug] BuiltinLower does not use alloca for storage on kDLCPU target devices

Reply via email to