tqchen edited a comment on issue #9022: URL: https://github.com/apache/tvm/issues/9022#issuecomment-921820526
Please allow me to explain the overall rationale here, in particular over the term "constraint" - C0: On one hand, we want a "default" memory to be generically accessible (per @manupa-arm 's comment) in all cases so runtime libraries can be built to leverage the generic property(e.g. access from NPU). - C1: On the other hand, from compiler's POV, we want to leave flexibility to code optimizations and codegen phase, and only constraint on the property we want (e.g. accessible from CPU) ## The two ways to see the constrainting The way C0 sees the constaint is about the possible accessor of the memory - "global"=> memory can be accessed from {cpu, npu, other devices} - "global.stack" => memory can be accessed from { cpu } The way C1 sees the constraint is about possible memories to choose from. - "global"(memory that is accessible from CPU) => can choose from {stack, TVMBAW, other allocators} - "global.workspace" => can choose from { TVMBAW } - "global.stack" => can choose from { stack } ## Discussions When we say a compiler IR is more constrainted than another one. Usually we mean that less optimizations can be performed, because there is lack of flexibility in terms of rewriting. For example, `volatile` keyword puts additional constraints on memory access. This makes C1 more aligned in the common compiler IR design. Note that "memory that is accessible from all devices" is term that depends on the specific runtime platform, and not very well defined in a generic IR. The more constraints we put on the memory itself, the smaller set it can become. As a result, there are less opportunities of code transformations and optimizations. Under the current semantics "CPU" && "global" can result in stack allocation. Note that is is one kind of flexibility we want to offer to later stages so that specializations can be made. - So yes it is indeed OK for a pass to map "global" to TVMBAW, the resulting program will run slower, but still correctly on CPU. - It also does not precldue TVMLowerBuiltin to take benefit of the semantics to choose stack allocation, which usually benefit performance. One thing we should keep in mind is that the codegen and AOT compiler should not rely on the behavior of TVMLowerBuiltin to ensure correctness(since it can choose to do anything in the case of "global", including dispatching to another custom allocator). If a special kind of memory is needed, we should declare such constraint through IR. Attaching a special scope is the best way to do so under the current semantics, regardless of the implementation of TVMLowerBuiltin. TVMLowerBuiltin picks `kMaxStackAllocaSize` as a heuristic number that maximizes the benefit of stack allocation without overexploding the stack. Of course a better heuristic can be used, setting it default to 0 would bring down performance of a lot of CPU code so not as desirable. We could certainly have a target dependent property for micro target and set that to 0. It should be pretty easy to do as well, see https://github.com/apache/tvm/blob/main/src/tir/transforms/lower_warp_memory.cc#L392 that obtains a target dependent warp size -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@tvm.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org