[quote="Mousius, post:1, topic:9849"]
* this all as the default AOT behaviour for now rather than providing a
compiler flag
* Maintain current packed function signature and instead just change the
unwrapping from DLTensor to pointers - this is problematic as to which level
the user is informed of an error, with a changed signature you’d get a link
error rather than a segfault if you tried to use this for dynamic linking
[/quote]
Thanks @Mousius for bringing this up. I think it would still worth to think a
bit more to formalize the wants. In particular, there are a few parts of items
that are worth considering:
## Formalize the Transformation of Buffer to the Pointer(Change of Fn Signature)
The particular transformation we are looking for is actually to transform a
function with buffer map by directly passing in its data pointer. Note that
most of TVM's lowering transformations preserves
Consider the constructing code
```python
import tvm
import tvm.script
from tvm import te
def unpacked_example():
A = te.placeholder((4 , 5), name="A")
B = te.compute(A.shape, lambda *i: A(*i) + 1, name="B")
s = te.create_schedule(B.op)
mod = tvm.lower(s, [A, B])
print(tvm.script.asscript(mod))
unpacked_example()
```
This will give us the following script
```python
@tvm.script.tir
class Module:
def main(A: ty.handle, B: ty.handle) -> None:
# function attr dict
tir.func_attr({"global_symbol": "main", "tir.noalias": True})
A_1 = tir.match_buffer(A, [4, 5], elem_offset=0, align=128,
offset_factor=1)
B_1 = tir.match_buffer(B, [4, 5], elem_offset=0, align=128,
offset_factor=1)
# body
for i0, i1 in tir.grid(4, 5):
B_1.data[((i0*5) + i1)] = (tir.load("float32", A_1.data, ((i0*5) +
i1)) + tir.float32(1))
```
>From the data structure's PoV, the above code only refers to the pointer
>`B_1.data` and `A_1.data`. We can create a pass (say replace signature) that
>tries to changes the signature of the function from list
of buffers(requires DLTensor*) to just the data pointers themselves. This
transformation
can hold as long as all the referenced variables are covered, and the desired
code is lile
```python
@tvm.script.tir
class Module:
def main(Adata: ty.handle, Bdata: ty.handle) -> None:
# body
for i0, i1 in tir.grid(4, 5):
tir.store(Bdata, tir.load("float32", Adata, ((i0*5) + i1)) +
tir.float32(1)), ((i0*5) + i1))
```
Note then the function can be directly passed to the code generator, which
generates the
function with signature
```c
int main_func(void* Adata, void* Bdata);
```
The main point is that we do not have to try to twist the MakePackedAPI to
generate another kind of type erased API here. If what we want is the normal C
calling convention that passes in the field separately, we should add this pass
after lowering to change the expected function signature, then the followup
calls would follow naturally (via the normal C function calling convention).
The MakePackedAPI as it is is supposed to preserve the function signature(of
taking buffers) regardless of unpacked choices. So transforming the signature
should go to another pass.
### The Choice of Type-Erased API
My main concern about the current proposal is the introduction of another type
erased interface, namely
```
typedef int32_t(tvm_function_t)(void** inputs, void** outputs, void*
resource_handle);`
```
Given most of the internals can already be readibly handled by the raw C typed
version. The only usage of the type-erased function is when it comes to
interfaces. In that case, I still beleive that PackedC function is the right
choice, as it brings the benefit of standarization and consistency with the
overall TVM ecosystem.
### Impact of Compiler Optimizations
Finally, it is still useful to think about compiler optimizations and how can
they impact the choices in the table. Modern compilers can do a lot of things,
making it possibly to get as optimized code as long as we can inline the
function correctly.
Let us consider an example code below
```
#include <cstdio>
#include <tvm/runtime/c_runtime_api.h>
inline int PackedCFunc(void* args, int* type_codes, int num_args,
void* out_ret_value, int* out_ret_tcode,
void* resource_handle) {
DLTensor* dlx = (DLTensor*)(((TVMValue*)args)[0].v_handle);
DLTensor* dly = (DLTensor*)(((TVMValue*)args)[1].v_handle);
// error check that can be dead-code eliminated
if (type_codes[0] != kTVMDLTensorHandle) {
return -1;
}
if (type_codes[1] != kTVMDLTensorHandle) {
return -1;
}
if (dlx->shape[0] != 3) {
return -1;
}
if (dlx->shape[1] != 3) {
return -1;
}
if (dly->shape[0] != 3) {
return -1;
}
if (dly->shape[1] != 3) {
return -1;
}
if (dly->dtype.code != kDLFloat) {
return -1;
}
((float*)dlx->data)[0] = ((float*)dly->data)[0] + 1;
return 0;
}
// return y[i] = x[i] +1
extern "C" int AddViaPackedCFunc(float *x, float* y) {
TVMValue args[2];
int type_codes[2];
TVMValue out_ret_value;
int out_ret_tcode;
int64_t shape[2] = {3, 3};
DLTensor dlx, dly;
dlx.data = x;
dlx.ndim = 2;
dlx.shape = shape;
dlx.dtype.code = kDLFloat;
dlx.dtype.bits = 32;
dlx.dtype.lanes = 1;
dlx.device.device_type = kDLCPU;
dlx.device.device_id = 0;
dlx.strides = nullptr;
dlx.byte_offset = 0;
dly = dlx;
dly.data = y;
args[0].v_handle = &dlx;
args[1].v_handle = &dly;
type_codes[0] = kTVMDLTensorHandle;
type_codes[1] = kTVMDLTensorHandle;
// note: check can be dead-code eliminated
if (PackedCFunc(args, type_codes, 2, &out_ret_value, &out_ret_tcode, nullptr)
!= 0) {
printf("error\n");
}
return 0;
}
```
Run clang
```bash
clang-10 -O2 -emit-llvm -S -I ../../tvm/3rdparty/dlpack/include -I
../../tvm/include -o test.ll test.cc
```
The result is
```ll
; Function Attrs: nounwind uwtable
define dso_local i32 @AddViaPackedCFunc(float* %0, float* %1)
local_unnamed_addr #0 {
%3 = load float, float* %1, align 4, !tbaa !2
%4 = fadd float %3, 1.000000e+00
store float %4, float* %0, align 4, !tbaa !2
ret i32 0
}
```
Run gcc
```
gcc -O2 -S -I ../../tvm/3rdparty/dlpack/include -I ../../tvm/include -o test.s
test.cc
```
Gives the following asm code
```
.file "test.cc"
.text
.p2align 4,,15
.globl AddViaPackedCFunc
.type AddViaPackedCFunc, @function
AddViaPackedCFunc:
.LFB31:
.cfi_startproc
movss .LC0(%rip), %xmm0
xorl %eax, %eax
addss (%rsi), %xmm0
movss %xmm0, (%rdi)
ret
.cfi_endproc
.LFE31:
.size AddViaPackedCFunc, .-AddViaPackedCFunc
.section .rodata.cst4,"aM",@progbits,4
.align 4
.LC0:
.long 1065353216
.ident "GCC: (Ubuntu 7.4.0-1ubuntu1~18.04.1) 7.4.0"
.section .note.GNU-stack,"",@progbits
```
As we can see that even with the same PackedFunc API, as long as we can do
proper inlining, allocating DLTensor and other items on stack, the resulting
function call can be reduced to the same function as the minimum non-packed
version.
### Discussions
Considering the importance of a minimum internal, I agree that we could explore
an un-packed interface(essentially generating something that is related to C).
We should do that in a proper way, by introducing a function signature
transformation utility that transforms the function signature from the original
DLTensor* to the destructed fields.
However, we should also note that generating the DLTensor on stack and setting
up constant correctly might also bring similar effect in a modern compiler.
When it comes to type-erased interface, assuming we only need them at the
interface level(not the internals). I think it is useful to keep the
CPackedFunc convention, so that we still retain the benefit of additional
wraping to expose to the externals and standardization. Again in this case
carefully allocating the DLTensor on stack then pass it in plus strong
inlining/constant folding could remove the overhead of DLTensor even at the
interface level.
---
[Visit
Topic](https://discuss.tvm.apache.org/t/rfc-utvm-aot-optimisations-for-embedded-targets/9849/2)
to respond.
You are receiving this because you enabled mailing list mode.
To unsubscribe from these emails, [click
here](https://discuss.tvm.apache.org/email/unsubscribe/86c400f3873babb8f3c4c42b921d491a97ce8acfe63247967f110a28f7d71f82).