Liya Fan created ARROW-5200:
-------------------------------

             Summary: Provide light-weight arrow APIs
                 Key: ARROW-5200
                 URL: https://issues.apache.org/jira/browse/ARROW-5200
             Project: Apache Arrow
          Issue Type: Improvement
          Components: Java
            Reporter: Liya Fan
            Assignee: Liya Fan
         Attachments: image-2019-04-23-15-19-34-187.png

We are trying to incorporate Apache Arrow to Apache Flink runtime. We find 
Arrow an amazing library, which greatly simplifies the support of columnar data 
format.

However, for many scenarios, we find the performance unacceptable. Our 
investigation shows the reason is that, there are too many redundant checks and 
computations in Arrow API.

For example, the following figures shows that in a single call to 
Float8Vector.get(int) method (this is one of the most frequently used APIs in 
Flink computation),  there are 20+ method invocations.

!image-2019-04-23-15-19-34-187.png!

 

There are many other APIs with similar problems. We believe that these checks 
will make sure of the integrity of the program. However, it also impacts 
performance severely. For our evaluation, the performance may degrade by two or 
three orders of magnitude slower, compared to access data on heap memory. 

We think at least for some scenarios, we can give the responsibility of 
integrity check to application owners. If they can be sure all the checks have 
been passed, we can provide some light-weight APIs and the inherent high 
performance, to them.

In the light-weight APIs, we only provide minimum checks, or avoid checks at 
all. The application owner can still develop and debug their code using the 
original heavy-weight APIs. Once all bugs have been fixed, they can switch to 
light-weight APIs in their products and enjoy the consequent high performance.

 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to