I haven’t done profiling. The major overhead I can see is the conversion from ColumnVectorBatch to Arrow’s RecordBatch, which involves memory copy and some transcoding. Also the current adapter only supports reading entire stripe as a batch, which in a lot of cases is not ideal. I agree that we should maintain backward compatibility. I am thinking if we could expose another set of interface for Arrow which is built on top of the same ColumnReader/ColumnWriter classes.
> On Jul 5, 2018, at 8:01 AM, Owen O'Malley <owen.omal...@gmail.com> wrote: > > I think improved Arrow C++ integration would be great. I haven't looked at > the current state of the work to see what could be better. I'd be against > making Arrow the default C++ API, but changes to the API to make things > faster for Arrow make sense. (Although as always, we need to worry about > backwards compatibility.) > > Have you tried benchmarking and profiling the current adapters to see where > the bottlenecks are? > > .. Owen > > On Wed, Jul 4, 2018 at 1:41 AM, Xiening Dai <xndai....@live.com> wrote: > >> Hi all, >> >> Not sure if this has been brought up before - do we have plan to support >> Apache Arrow? Given its popularity and momentum recently, we might consider >> supporting Arrow format for Orc reader and writer. There’s an adapter for >> Orc C++ reader - https://github.com/apache/arrow/tree/master/cpp/src/ >> arrow/adapters/orc but the implementation is inefficient. If we want to >> better integrate with arrow, we should avoid conversions between >> ColumnVectorBatch and arrow format. >>