Yurui Zhou created ARROW-4714:
---------------------------------

             Summary: Providing JNI interface to Read ORC file via Arrow C++
                 Key: ARROW-4714
                 URL: https://issues.apache.org/jira/browse/ARROW-4714
             Project: Apache Arrow
          Issue Type: Improvement
          Components: C++, Java
            Reporter: Yurui Zhou


Currently if we want to read data from ORC data into Arrow Record Batch in Java 
runtime, we needs to first use the ORC Java reader to load data into memory 
then convert it to Arrow RecordBatch.

However, since ORC Java Reader only read orc data into on heap memory, while 
Arrow Record only support off heap memory on Java, memory copy in unavoidable 
in conversion process.  In our internal benchmark, the conversion time can take 
up to 25% E2E latency when running tpch q1.

To workaround this overhead, a method is to enable the Java runtime directly 
reading data from native ORC c++ reader, which will load data directly into off 
heap memory and only pointer manipulation and schema ser/de would be involved 
in the conversion process. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to