Re: [apache/incubator-mxnet] [RFC][mxnet 2.0][item 10.1] MXNet Imperative Op Invocation Overhead (#17097)

2020-02-01 Thread Haozheng Fan
@tqchen Thanks for sharing this. I don’t know if I understand correctly. For 
now arguments except primitives pass through FFI via Object (like ADTObj). It 
is then converted to TShape in backend and TShape is not involved in FFI 
directly.

As you said, Object allows me to conveniently put various kinds of things into 
a container (ADTObj), without losing their types. For example, now a tuple of 
tuples like ((2, 2), (2, 2)) is allowed.

Also sorry for the late reply. I have been on a vacation this week :)

-- 
You are receiving this because you were mentioned.
Reply to this email directly or view it on GitHub:
https://github.com/apache/incubator-mxnet/issues/17097#issuecomment-581108588

Re: [apache/incubator-mxnet] [RFC][mxnet 2.0][item 10.1] MXNet Imperative Op Invocation Overhead (#17097)

2020-01-25 Thread Tianqi Chen
Thanks @hzfan I would also high recommending taking a close look at the TVM's 
object protocol, and try to push most of the things through the Object 
eventually(Create temporary support for legacy cases like TShape is fine, but 
eventually pushing things as object will have a greater uniformity, and brings 
benefit such as putting everything into a container)

-- 
You are receiving this because you were mentioned.
Reply to this email directly or view it on GitHub:
https://github.com/apache/incubator-mxnet/issues/17097#issuecomment-578451034

Re: [apache/incubator-mxnet] [RFC][mxnet 2.0][item 10.1] MXNet Imperative Op Invocation Overhead (#17097)

2020-01-23 Thread Haozheng Fan
I created a follow-up design proposal on 
[cwiki](https://cwiki.apache.org/confluence/display/MXNET/MXNet+FFI+for+Operator+Imperative+Invocation).
 TVM FFI works well with MXNet and the overhead for `np.zeros` gets greatly 
reduced. Any feedback is appreciated.

-- 
You are receiving this because you were mentioned.
Reply to this email directly or view it on GitHub:
https://github.com/apache/incubator-mxnet/issues/17097#issuecomment-578011639

Re: [apache/incubator-mxnet] [RFC][mxnet 2.0][item 10.1] MXNet Imperative Op Invocation Overhead (#17097)

2019-12-27 Thread Tianqi Chen
re the need for explicit type checking code in TVM FFI. 

Actually there is no explicit code for type checking as they are generated 
automatically via template expansion(on the receiving end), also we also have a 
"strong typed" signature that wraps the packed function interface, which gives 
you compile time type checking 
https://github.com/apache/incubator-tvm/blob/master/include/tvm/runtime/packed_func.h#L191
 

For dynamic language side(python) the exposed function is still type erased(as 
python is a dynamic language). 

Note that the view dynamic vs static typed language does not really apply to 
this case, because the main goal(exposing to python) means type-erasure(as 
python is dynamically typed). The main goal would be how to reduce the number 
of abstraction layers.

> Also these microbenchmarks are nice, but we also need to consider the
> overhead in typical workloads and see if it's still significant.

If we apply reasoning, most API cost is going to be FFI cost + exec cost, and I 
think the conclusion so far is we want FFI cost to be around 1e-7s to 1e-6s, 
which is the limit of any cost .


-- 
You are receiving this because you were mentioned.
Reply to this email directly or view it on GitHub:
https://github.com/apache/incubator-mxnet/issues/17097#issuecomment-569382845

Re: [apache/incubator-mxnet] [RFC][mxnet 2.0][item 10.1] MXNet Imperative Op Invocation Overhead (#17097)

2019-12-27 Thread Pedro Larroy
Test

On Fri, Dec 27, 2019 at 11:54 AM Pedro Larroy 
wrote:

> Thanks for the explanation. I'm not so concerned about complexity of
> dispatching. If I understood you correctly the main benefit that you
> explain for the TVM project was not having to change the C API, but still
> you need to do type checking in both ends, or at least on the receiving end
> of the API, correct? I think we have discussed similar things in the past
> and we might have different views on strongly typed vs dynamic typed. A
> priori I prefer to see an API which can be evolved and changed, I find it
> more explicit and clearer that what I think you do with PackedFun which I
> have looked at briefly but not used extensively.  If one is going to call
> into the C API using pybind, does it make sense to layer a C++ API on top
> of the C API for this?
>
> Also these microbenchmarks are nice, but we also need to consider the
> overhead in typical workloads and see if it's still significant.
>
> CFFI is also another alternative.
>
> I couldn't access your pointers like:
>
> https://github.com/tqchen/tvm/tree/pyffi
>
> On Thu, Dec 26, 2019 at 2:00 PM Tianqi Chen 
> wrote:
>
>> @larroy indeed every solution has trade-offs, and these tradeoffs are
>> discussed in the above posts when we compare solutions, and they are backed
>> by benchmarks :) it would be great if you can also suggest potential
>> tradeoffs here.
>>
>> When you expose an API from typed language(c++) to a dynamic
>> language(python), you have to type erase it, given that the python
>> functions don't have the type, and you have to pass the information along.
>>
>> The only difference is where you do the type checking(that the python
>> type corresponds to the right c++ type), and translation(translating to the
>> c++ type).
>>
>> For example, in the case of pybind, the erasure is done implicitly when
>> you call the python function, then checking and translation happens when
>> you call into the c++ function.
>>
>> In the case of creating a C API for each feature and wrap things in the
>> python side, the type checking is done in the python side, and translation
>> as well.
>>
>> In the case of tvm ffi, the type translation is done in the python/cython
>> side,  while the type checking is done in the c++.
>>
>> To dive deeper into the tradeoffs for PackedFunc calling convention. The
>> convention erases the type by having the type code stored into the
>> arguments. This brings additional cost of passing arguments into heap, as
>> opposed to registers. So they might not be designed for inline functions
>> that needs to happen at the order of 1e-9s, however, for API functions that
>> needs to run around 1e-7 or even 1e-8 level, this convention is pretty good.
>>
>> In terms of the calling cost, it really depends on whether the caller and
>> callee are strongly typed.
>> - If caller is strongly typed, then assigning type code is O(1)
>> - If caller is a dynamic type(like python) then we need to have a
>> dispatcher to dispatch and select the right type code
>> - If callee is strongly typed, then the cost of checking is O(1) by just
>> check the code to be the correct one
>> - If the callee is dynamic type, then a dispatching need to happen, which
>> have another level of hashtable lookup O(1)
>>
>> As we can see, the only place where dispatching is necessary is the
>> dynamic type handling case. Even in these cases, if there is a strong need
>> of specialization, we can directly force the type by running checking on
>> the caller, and pass in the right type code (the engineering burden is the
>> same as wrapping the C API). However, the benchmark suggests that the
>> dynamic dispatching cost is reasonable, and satisfies the API speed.
>>
>> Coming back to the tradeoff, the main tradeoff here is the engineering
>> burden to keep an hourglass design(with fixed set of API) vs efficiency.
>> While my post did not suggest that TVM's ffi is a silver bullet, it does
>> works pretty well for our use cases. hope it helps
>>
>>
>> --
>> You are receiving this because you are subscribed to this thread.
>> Reply to this email directly or view it on GitHub:
>>
>> https://github.com/apache/incubator-mxnet/issues/17097#issuecomment-569139957
>
>


Re: [apache/incubator-mxnet] [RFC][mxnet 2.0][item 10.1] MXNet Imperative Op Invocation Overhead (#17097)

2019-12-27 Thread Sheng Zha
Test

On Fri, Dec 27, 2019 at 11:54 AM Pedro Larroy 
wrote:

> Thanks for the explanation. I'm not so concerned about complexity of
> dispatching. If I understood you correctly the main benefit that you
> explain for the TVM project was not having to change the C API, but still
> you need to do type checking in both ends, or at least on the receiving end
> of the API, correct? I think we have discussed similar things in the past
> and we might have different views on strongly typed vs dynamic typed. A
> priori I prefer to see an API which can be evolved and changed, I find it
> more explicit and clearer that what I think you do with PackedFun which I
> have looked at briefly but not used extensively.  If one is going to call
> into the C API using pybind, does it make sense to layer a C++ API on top
> of the C API for this?
>
> Also these microbenchmarks are nice, but we also need to consider the
> overhead in typical workloads and see if it's still significant.
>
> CFFI is also another alternative.
>
> I couldn't access your pointers like:
>
> https://github.com/tqchen/tvm/tree/pyffi
>
> On Thu, Dec 26, 2019 at 2:00 PM Tianqi Chen 
> wrote:
>
>> @larroy indeed every solution has trade-offs, and these tradeoffs are
>> discussed in the above posts when we compare solutions, and they are backed
>> by benchmarks :) it would be great if you can also suggest potential
>> tradeoffs here.
>>
>> When you expose an API from typed language(c++) to a dynamic
>> language(python), you have to type erase it, given that the python
>> functions don't have the type, and you have to pass the information along.
>>
>> The only difference is where you do the type checking(that the python
>> type corresponds to the right c++ type), and translation(translating to the
>> c++ type).
>>
>> For example, in the case of pybind, the erasure is done implicitly when
>> you call the python function, then checking and translation happens when
>> you call into the c++ function.
>>
>> In the case of creating a C API for each feature and wrap things in the
>> python side, the type checking is done in the python side, and translation
>> as well.
>>
>> In the case of tvm ffi, the type translation is done in the python/cython
>> side,  while the type checking is done in the c++.
>>
>> To dive deeper into the tradeoffs for PackedFunc calling convention. The
>> convention erases the type by having the type code stored into the
>> arguments. This brings additional cost of passing arguments into heap, as
>> opposed to registers. So they might not be designed for inline functions
>> that needs to happen at the order of 1e-9s, however, for API functions that
>> needs to run around 1e-7 or even 1e-8 level, this convention is pretty good.
>>
>> In terms of the calling cost, it really depends on whether the caller and
>> callee are strongly typed.
>> - If caller is strongly typed, then assigning type code is O(1)
>> - If caller is a dynamic type(like python) then we need to have a
>> dispatcher to dispatch and select the right type code
>> - If callee is strongly typed, then the cost of checking is O(1) by just
>> check the code to be the correct one
>> - If the callee is dynamic type, then a dispatching need to happen, which
>> have another level of hashtable lookup O(1)
>>
>> As we can see, the only place where dispatching is necessary is the
>> dynamic type handling case. Even in these cases, if there is a strong need
>> of specialization, we can directly force the type by running checking on
>> the caller, and pass in the right type code (the engineering burden is the
>> same as wrapping the C API). However, the benchmark suggests that the
>> dynamic dispatching cost is reasonable, and satisfies the API speed.
>>
>> Coming back to the tradeoff, the main tradeoff here is the engineering
>> burden to keep an hourglass design(with fixed set of API) vs efficiency.
>> While my post did not suggest that TVM's ffi is a silver bullet, it does
>> works pretty well for our use cases. hope it helps
>>
>>
>> --
>> You are receiving this because you are subscribed to this thread.
>> Reply to this email directly or view it on GitHub:
>>
>> https://github.com/apache/incubator-mxnet/issues/17097#issuecomment-569139957
>
>


-- 
You are receiving this because you are subscribed to this thread.
Reply to this email directly or view it on GitHub:
https://github.com/apache/incubator-mxnet/issues/17097#issuecomment-569358961

Re: [apache/incubator-mxnet] [RFC][mxnet 2.0][item 10.1] MXNet Imperative Op Invocation Overhead (#17097)

2019-12-27 Thread Sheng Zha
Thanks for the explanation. I'm not so concerned about complexity of
dispatching. If I understood you correctly the main benefit that you
explain for the TVM project was not having to change the C API, but still
you need to do type checking in both ends, or at least on the receiving end
of the API, correct? I think we have discussed similar things in the past
and we might have different views on strongly typed vs dynamic typed. A
priori I prefer to see an API which can be evolved and changed, I find it
more explicit and clearer that what I think you do with PackedFun which I
have looked at briefly but not used extensively.  If one is going to call
into the C API using pybind, does it make sense to layer a C++ API on top
of the C API for this?

Also these microbenchmarks are nice, but we also need to consider the
overhead in typical workloads and see if it's still significant.

CFFI is also another alternative.

I couldn't access your pointers like:

https://github.com/tqchen/tvm/tree/pyffi

On Thu, Dec 26, 2019 at 2:00 PM Tianqi Chen 
wrote:

> @larroy indeed every solution has trade-offs, and these tradeoffs are
> discussed in the above posts when we compare solutions, and they are backed
> by benchmarks :) it would be great if you can also suggest potential
> tradeoffs here.
>
> When you expose an API from typed language(c++) to a dynamic
> language(python), you have to type erase it, given that the python
> functions don't have the type, and you have to pass the information along.
>
> The only difference is where you do the type checking(that the python type
> corresponds to the right c++ type), and translation(translating to the c++
> type).
>
> For example, in the case of pybind, the erasure is done implicitly when
> you call the python function, then checking and translation happens when
> you call into the c++ function.
>
> In the case of creating a C API for each feature and wrap things in the
> python side, the type checking is done in the python side, and translation
> as well.
>
> In the case of tvm ffi, the type translation is done in the python/cython
> side,  while the type checking is done in the c++.
>
> To dive deeper into the tradeoffs for PackedFunc calling convention. The
> convention erases the type by having the type code stored into the
> arguments. This brings additional cost of passing arguments into heap, as
> opposed to registers. So they might not be designed for inline functions
> that needs to happen at the order of 1e-9s, however, for API functions that
> needs to run around 1e-7 or even 1e-8 level, this convention is pretty good.
>
> In terms of the calling cost, it really depends on whether the caller and
> callee are strongly typed.
> - If caller is strongly typed, then assigning type code is O(1)
> - If caller is a dynamic type(like python) then we need to have a
> dispatcher to dispatch and select the right type code
> - If callee is strongly typed, then the cost of checking is O(1) by just
> check the code to be the correct one
> - If the callee is dynamic type, then a dispatching need to happen, which
> have another level of hashtable lookup O(1)
>
> As we can see, the only place where dispatching is necessary is the
> dynamic type handling case. Even in these cases, if there is a strong need
> of specialization, we can directly force the type by running checking on
> the caller, and pass in the right type code (the engineering burden is the
> same as wrapping the C API). However, the benchmark suggests that the
> dynamic dispatching cost is reasonable, and satisfies the API speed.
>
> Coming back to the tradeoff, the main tradeoff here is the engineering
> burden to keep an hourglass design(with fixed set of API) vs efficiency.
> While my post did not suggest that TVM's ffi is a silver bullet, it does
> works pretty well for our use cases. hope it helps
>
>
> --
> You are receiving this because you are subscribed to this thread.
> Reply to this email directly or view it on GitHub:
>
> https://github.com/apache/incubator-mxnet/issues/17097#issuecomment-569139957


-- 
You are receiving this because you are subscribed to this thread.
Reply to this email directly or view it on GitHub:
https://github.com/apache/incubator-mxnet/issues/17097#issuecomment-569335211

Re: [apache/incubator-mxnet] [RFC][mxnet 2.0][item 10.1] MXNet Imperative Op Invocation Overhead (#17097)

2019-12-26 Thread Tianqi Chen
@larroy indeed every solution has trade-offs, and these tradeoffs are discussed 
in the above posts when we compare solutions, and they are backed by benchmarks 
:) it would be great if you can also suggest potential tradeoffs here.

When you expose an API from typed language(c++) to a dynamic language(python), 
you have to type erase it, given that the python functions don't have the type, 
and you have to pass the information along.  

The only difference is where you do the type checking(that the python type 
corresponds to the right c++ type), and translation(translating to the c++ 
type).

For example, in the case of pybind, the erasure is done implicitly when you 
call the python function, then checking and translation happens when you call 
into the c++ function.

In the case of creating a C API for each feature and wrap things in the python 
side, the type checking is done in the python side, and translation as well.

In the case of tvm ffi, the type translation is done in the python/cython side, 
 while the type checking is done in the c++. 

To dive deeper into the tradeoffs for PackedFunc calling convention. The 
convention erases the type by having the type code stored into the arguments. 
This brings additional cost of passing arguments into heap, as opposed to 
registers. So they might not be designed for inline functions that needs to 
happen at the order of 1e-9s, however, for API functions that needs to run 
around 1e-7 or even 1e-8 level, this convention is pretty good.

In terms of the calling cost, it really depends on whether the caller and 
callee are strongly typed.
- If caller is strongly typed, then assigning type code is O(1)
- If caller is a dynamic type(like python) then we need to have a dispatcher to 
dispatch and select the right type code
- If callee is strongly typed, then the cost of checking is O(1) by just check 
the code to be the correct one 
- If the callee is dynamic type, then a dispatching need to happen, which have 
another level of hashtable lookup O(1)

As we can see, the only place where dispatching is necessary is the dynamic 
type handling case. Even in these cases, if there is a strong need of 
specialization, we can directly force the type by running checking on the 
caller, and pass in the right type code (the engineering burden is the same as 
wrapping the C API). However, the benchmark suggests that the dynamic 
dispatching cost is reasonable, and satisfies the API speed.

Coming back to the tradeoff, the main tradeoff here is the engineering burden 
to keep an hourglass design(with fixed set of API) vs efficiency. While my post 
did not suggest that TVM's ffi is a silver bullet, it does works pretty well 
for our use cases. hope it helps


-- 
You are receiving this because you are subscribed to this thread.
Reply to this email directly or view it on GitHub:
https://github.com/apache/incubator-mxnet/issues/17097#issuecomment-569139957

Re: [apache/incubator-mxnet] [RFC][mxnet 2.0][item 10.1] MXNet Imperative Op Invocation Overhead (#17097)

2019-12-26 Thread Pedro Larroy
Pybind is nice, I used Boost python many years ago, which I think is based
on. The problem with this is the hourglass C bindings, you have to go from
Python to C++ / Pybind, down to C and to the engine, this seems like a lot
of boilerplate.

On Mon, Dec 16, 2019 at 10:02 PM reminisce  wrote:

> MXNet imperative operator invocation overhead is as large as 30-60us,
> which is significant compared to the official NumPy operators with ~600ns
> overhead. This has negatively impacted the performance of applying MXNet to
> the models where many operators' kernel runtime duration is short,
> especially in the area of classic machine learning. We plan to address the
> problem in two steps:
>
>1.
>
>Short term: Use pybind11 to replace Python op API and ctypes/c api.
>Preliminary experiments show that the pure Python-C++ turnaround time by
>using Pybind is between 400-600ns, while the current Python op API using
>ctypes/c api costs more than 10us. We believe with the correct
>implementation, we can reduce the op invocation overhead to 2us including
>the time on FFI and engine.
>2.
>
>Long term: Adopt Python's C extension interface. NumPy did this by
>developing its own C API. This provides considerably less overhead compared
>to other solutions. However, it would cost much more engineering efforts by
>integrating this with our existing operator workflow in C++.
>
> @hzfan  @hgt312 
>
> —
> You are receiving this because you are subscribed to this thread.
> Reply to this email directly, view it on GitHub
> ,
> or unsubscribe
> 
> .
>


-- 
You are receiving this because you are subscribed to this thread.
Reply to this email directly or view it on GitHub:
https://github.com/apache/incubator-mxnet/issues/17097#issuecomment-569135990

Re: [apache/incubator-mxnet] [RFC][mxnet 2.0][item 10.1] MXNet Imperative Op Invocation Overhead (#17097)

2019-12-26 Thread Pedro Larroy
What's the point of having an API if you type erase it? Then you might as
well have a single function API with a type erased callback name to select
the function to call. In the end you move the burden away from the API to
the callers and inside the API to the dispatchers. For going this route of
uber-clever template tricks to generate code, I think it's better to just
put in place proper code generation for maintainability. Could you provide
a bit more details about tradeoffs? Everything has tradeoffs, I don't
believe any solution which is sold as a panacea, there's no silver bullet.

On Thu, Dec 19, 2019 at 10:21 AM Tianqi Chen 
wrote:

> I have another candidate that would highly recommend: adopt TVM's FFI
> convention.
>
> The historical problem of MXNet FFI was the blowing amount of the C API
> bindings as we add new features. This creates a huge amount of maintenance
> burden.
>
> The real problem was not really about which FFI system to adopt(cython and
> pybind are fine in that end, except for the cost of compilation), but more
> of the cost to maintain the FFI. MXNet used to have a fast cython binding,
> but that was abandoned because we keep add new APIs we cannot keep up both
> ctypes and cython.
>
> When developing TVM we learnt from the lesson and restrict the API to a
> limited set of runtime APIs that does not change, and have a stable cython,
> ctypes binding for them. The runtime support a type-erased
> function(PackedFunc), which can be efficiently called from any of the
> frontend language, and all the APIs are exposed through the PackedFunc. On
> the python side an additional wrapping is created for better documentation
> and call into the PackedFunc. See more in
> https://docs.tvm.ai/dev/runtime.html The system works great for over a
> few years now.
>
> Of course I understand there has been legacy issues in MXNet that is why I
> did not bring this proposal up. But given this is a proposal for 2.0, I
> would encourage everyone to give a serious thought about this possibility.
>
> —
> You are receiving this because you are subscribed to this thread.
> Reply to this email directly, view it on GitHub
> ,
> or unsubscribe
> 
> .
>


-- 
You are receiving this because you are subscribed to this thread.
Reply to this email directly or view it on GitHub:
https://github.com/apache/incubator-mxnet/issues/17097#issuecomment-569135511

Re: [apache/incubator-mxnet] [RFC][mxnet 2.0][item 10.1] MXNet Imperative Op Invocation Overhead (#17097)

2019-12-24 Thread Tianqi Chen
@hzfan thanks for implementing a poc:) However, these is a subtle but important 
difference which worth discusses here :) I will use cython-ffi to refer to the 
above approach, and tvm-ffi to refer to tvm's approach

- In cython-ffi, both data structure and functions are exposed, this means a in 
order to grow the set of functions, we need to expand the set of C API. In 
another words, we need to grow the set of FFI API as we add more functions
- In the tvm-ffi, the set of C API is fixed, and only data structure 
constructions are exposed to the cython side, given that a set of supported 
data structures are also fixed. In this way, we do not have to grow the set of 
FFI API as we add functions.
- Another subtle point is that we are passing data structure across dll 
boundaries. In the case of cython-ffi, it could be a c++ container(Tuple). 
TVM's object structure is designed to be C ABI compatible, which allows us to 
construct in one dll and pass to another, however it is not necessarily true 
for all c++ classes. There is a potential danger when passing c++ container 
across DLL boundaries(when two dll has different allocator, calling push_back 
in another dll could cause error). 

The difference again boils down to the design point of what is a clear cut of 
FFI conventions. Ideally, it would be: a stable set of C API and container 
structures that does not change over time.



-- 
You are receiving this because you are subscribed to this thread.
Reply to this email directly or view it on GitHub:
https://github.com/apache/incubator-mxnet/issues/17097#issuecomment-568777082

Re: [apache/incubator-mxnet] [RFC][mxnet 2.0][item 10.1] MXNet Imperative Op Invocation Overhead (#17097)

2019-12-24 Thread Haozheng Fan
Following [this 
branch](https://github.com/tqchen/tvm/commit/ddd9323d9f8713b77591260f32529078f585a2ac)
 I made a simple POC on MXNet side (code 
[here](https://github.com/hzfan/incubator-mxnet/commit/35385cde85267dc848b551a326d3c30c9d369b8e)).
 It turns out that passing a python `tuple` and receiving it as `TShape` takes 
around 200ns.

-- 
You are receiving this because you are subscribed to this thread.
Reply to this email directly or view it on GitHub:
https://github.com/apache/incubator-mxnet/issues/17097#issuecomment-568730399

Re: [apache/incubator-mxnet] [RFC][mxnet 2.0][item 10.1] MXNet Imperative Op Invocation Overhead (#17097)

2019-12-23 Thread reminisce
Thank @tqchen for sharing the PoC code within such a short timeframe. :) The 
numbers look promising even with Python native objects deeply copied. Pybind 
performs deep copy by default unless the receiving object in C++ end is marked 
as `opaque` so that the Python object passed by reference. That is often used 
for propagating large object changes from C++ to Python. In our op invocation 
use cases, there has been no such urgency of introducing this level of 
complexity so far since the Python objects are small and parameter passing is a 
one-way trip. The 300ns overhead should give us a good start to squeeze the 
total overhead into the 2us range. If there is really a need of passing 
`PyObject`s in the future, we can always add that with a compile flag option. I 
think it's worth following [this branch]( 
https://github.com/tqchen/tvm/tree/poc-pyffi) to integrate TVM FFI with MXNet 
op invocation flow to get more comprehensive benchmark results.

-- 
You are receiving this because you are subscribed to this thread.
Reply to this email directly or view it on GitHub:
https://github.com/apache/incubator-mxnet/issues/17097#issuecomment-568567917

Re: [apache/incubator-mxnet] [RFC][mxnet 2.0][item 10.1] MXNet Imperative Op Invocation Overhead (#17097)

2019-12-22 Thread Tianqi Chen
After some thoughts along the direction, I find a better and fun answer to the 
above question: support tuple/ellipsis/slice in tvm ffi effectively.

I quickly hacked up a POC in https://github.com/tqchen/tvm/tree/pyffi that 
supports the following benchmark script(disclaimer: it is only a POC so not 
intended for use or fully optimized, but it demonstrates all the technical 
flows necessary to make a fully functioning FFI).

```python
import timeit
import tvm
nop = tvm._api_internal._nop

setup = """
import tvm
nop = tvm._api_internal._nop
"""
timer = timeit.Timer(setup=setup,
  stmt='nop((None,..., slice(0, 100, 2)))')
timer.timeit(1)
num_repeat = 1000
print("tvm.tuple_slice_ellipsis_combo:", timer.timeit(num_repeat) / num_repeat)


setup = """
import numpy as np
"""

timer = timeit.Timer(setup=setup,
  stmt='np.empty((1,2,1))')
timer.timeit(1)
print("numpy.emmpty:", timer.timeit(num_repeat) / num_repeat)

setup = """
import tvm
nop = tvm._api_internal._nop
"""
timer = timeit.Timer(setup=setup,
  stmt='nop("mystr")')
timer.timeit(1)
num_repeat = 1000
print("tvm.str_arg:", timer.timeit(num_repeat) / num_repeat)
```

On my laptop(macbook 13inch), the results are as follows
```
$ TVM_FFI=cython python benchmark_ffi.py
tvm.tuple_slice_ellipsis_combo: 4.6157324e-07
numpy.emmpty: 2.701659998834e-07
tvm.str_arg: 2.339079997714e-07
```

##  What is Implemented in the POC 

In the POC, we introduced specific objects for Ellipsis, Slice and 
Tuple(already supported in ADT). During a PackedFunc call, a python 
tuple/ellipsis/slice was  converted into the object that is supported by the 
backend. We implemented a cython version(the previous recursive conversion was 
in python) to back it up. 

The reason that we are able to create Object in the cython side is because all 
TVM object has been recently converted to be POD-C compatible, so the object 
can be created in the cython side without crossing DLL boundary and passed to 
the c++ backend.

We can see from the benchmark that the cost of such deep-copy was at a 
reasonable level. We also only used the default memory allocator, so there 
could be space for further improvements.

##  Discussions

Please also see tradeoff discussions in the last post. As we can see, the main 
difference here is where to do the conversion, and whether do we do lazy/deep 
copy:

- In the case of pybind: conversion is happened in the c++ side, data 
structures are lazily created.
- In the case of the POC: conversion is happened in cython, data structures are 
deeply translated into another in-memory format.

The laziness certainly avoids a copy in cases where we do not necessarily need 
to book-keep the created argument. On the other hand, supporting a common data 
structure in the c++ side means the binding can potentially be reused by other 
language frontends.










-- 
You are receiving this because you are subscribed to this thread.
Reply to this email directly or view it on GitHub:
https://github.com/apache/incubator-mxnet/issues/17097#issuecomment-568325041

Re: [apache/incubator-mxnet] [RFC][mxnet 2.0][item 10.1] MXNet Imperative Op Invocation Overhead (#17097)

2019-12-21 Thread Tianqi Chen
The following fast-path can be addressed in the TVM FFI:
- ```tuple``, ```list``` via translation in python/cython side (see benchmark 
above)
- ```str``` is already fast (seem benchmark above)
- ```Context``` can be quite fast if the object is a tvm object, around same 
mangitude as passing NDArray
- ```np.dtype``` can get around by by str conversion, or introduce a type 
structure, TVM FFI support DLDataType natively
- None: natively supported by FFI

The following items needs to be discussed
- py_slice, Ellipsis Can be supported by adding pyobject support, however that 
introduces dispatching into the FFI layer(making the function not accessible to 
other language frontends). Would be interesting to discuss alternatives



-- 
You are receiving this because you are subscribed to this thread.
Reply to this email directly or view it on GitHub:
https://github.com/apache/incubator-mxnet/issues/17097#issuecomment-568234198