Hi All:

Currently Arrow provides a naive implementation on converting ColumnVectorBatch 
to Arrow’s RecordBatch, which involves a lot of overheads on memcopying and 
transcodeing. 

We would like to add a native api set to allow user directly reading data from 
ORC file to Arrow’s RecordBatch, the new api set will be separated from current 
ColumnVectorBatch api so that we won’t raise any backward compatibility issue.

Creating a new api set is not an elegent solution and it requires more 
maintenance effort. But given Arrow’s currently momentum and it’s benefits on 
sharing columnar data across various platforms and data format. We believe it 
worth to enable Arrow support on ORC. 

Any advice would be appreciated.
Thanks
Yurui 

from Alimail macOS
 ------------------Original Mail ------------------
Sender:Xiening Dai <xndai....@live.com>
Send Date:Fri Jul 6 01:25:34 2018
Recipients:dev@orc.apache.org <dev@orc.apache.org>
Subject:Re: Arrow Support of Orc
I haven’t done profiling. The major overhead I can see is the conversion from 
ColumnVectorBatch to Arrow’s RecordBatch, which involves memory copy and some 
transcoding. Also the current adapter only supports reading entire stripe as a 
batch, which in a lot of cases is not ideal. I agree that we should maintain 
backward compatibility. I am thinking if we could expose another set of 
interface for Arrow which is built on top of the same ColumnReader/ColumnWriter 
classes.



> On Jul 5, 2018, at 8:01 AM, Owen O'Malley <owen.omal...@gmail.com> wrote:
> 
> I think improved Arrow C++ integration would be great. I haven't looked at
> the current state of the work to see what could be better. I'd be against
> making Arrow the default C++ API, but changes to the API to make things
> faster for Arrow make sense. (Although as always, we need to worry about
> backwards compatibility.)
> 
> Have you tried benchmarking and profiling the current adapters to see where
> the bottlenecks are?
> 
> .. Owen
> 
> On Wed, Jul 4, 2018 at 1:41 AM, Xiening Dai <xndai....@live.com> wrote:
> 
>> Hi all,
>> 
>> Not sure if this has been brought up before - do we have plan to support
>> Apache Arrow? Given its popularity and momentum recently, we might consider
>> supporting Arrow format for Orc reader and writer. There’s an adapter for
>> Orc C++ reader - https://github.com/apache/arrow/tree/master/cpp/src/
>> arrow/adapters/orc but the implementation is inefficient. If we want to
>> better integrate with arrow, we should avoid conversions between
>> ColumnVectorBatch and arrow format.
>> 

Reply via email to