Re: Arrow Support of Orc

Xiening Dai Thu, 05 Jul 2018 10:25:49 -0700

I haven’t done profiling. The major overhead I can see is the conversion from 
ColumnVectorBatch to Arrow’s RecordBatch, which involves memory copy and some 
transcoding. Also the current adapter only supports reading entire stripe as a 
batch, which in a lot of cases is not ideal. I agree that we should maintain 
backward compatibility. I am thinking if we could expose another set of 
interface for Arrow which is built on top of the same ColumnReader/ColumnWriter 
classes.




> On Jul 5, 2018, at 8:01 AM, Owen O'Malley <owen.omal...@gmail.com> wrote:
> 
> I think improved Arrow C++ integration would be great. I haven't looked at
> the current state of the work to see what could be better. I'd be against
> making Arrow the default C++ API, but changes to the API to make things
> faster for Arrow make sense. (Although as always, we need to worry about
> backwards compatibility.)
> 
> Have you tried benchmarking and profiling the current adapters to see where
> the bottlenecks are?
> 
> .. Owen
> 
> On Wed, Jul 4, 2018 at 1:41 AM, Xiening Dai <xndai....@live.com> wrote:
> 
>> Hi all,
>> 
>> Not sure if this has been brought up before - do we have plan to support
>> Apache Arrow? Given its popularity and momentum recently, we might consider
>> supporting Arrow format for Orc reader and writer. There’s an adapter for
>> Orc C++ reader - https://github.com/apache/arrow/tree/master/cpp/src/
>> arrow/adapters/orc but the implementation is inefficient. If we want to
>> better integrate with arrow, we should avoid conversions between
>> ColumnVectorBatch and arrow format.
>>

Re: Arrow Support of Orc

Reply via email to