Follow up to Owen's question, do you have an estimate on the performance gains from implementing the native support?
Creating a new API for supporting Arrow is a good starting point. Can you come up with a design document first? On Mon, Jul 16, 2018 at 4:24 AM 周宇睿(闻拙) <yurui....@alibaba-inc.com> wrote: > Hi All: > > Currently Arrow provides a naive implementation on converting > ColumnVectorBatch to Arrow’s RecordBatch, which involves a lot of overheads > on memcopying and transcodeing. > > We would like to add a native api set to allow user directly reading data > from ORC file to Arrow’s RecordBatch, the new api set will be separated > from current ColumnVectorBatch api so that we won’t raise any backward > compatibility issue. > > Creating a new api set is not an elegent solution and it requires more > maintenance effort. But given Arrow’s currently momentum and it’s benefits > on sharing columnar data across various platforms and data format. We > believe it worth to enable Arrow support on ORC. > > Any advice would be appreciated. > Thanks > Yurui > > from Alimail macOS > ------------------Original Mail ------------------ > Sender:Xiening Dai <xndai....@live.com> > Send Date:Fri Jul 6 01:25:34 2018 > Recipients:dev@orc.apache.org <dev@orc.apache.org> > Subject:Re: Arrow Support of Orc > I haven’t done profiling. The major overhead I can see is the conversion > from ColumnVectorBatch to Arrow’s RecordBatch, which involves memory copy > and some transcoding. Also the current adapter only supports reading entire > stripe as a batch, which in a lot of cases is not ideal. I agree that we > should maintain backward compatibility. I am thinking if we could expose > another set of interface for Arrow which is built on top of the same > ColumnReader/ColumnWriter classes. > > > > > On Jul 5, 2018, at 8:01 AM, Owen O'Malley <owen.omal...@gmail.com> > wrote: > > > > I think improved Arrow C++ integration would be great. I haven't looked > at > > the current state of the work to see what could be better. I'd be against > > making Arrow the default C++ API, but changes to the API to make things > > faster for Arrow make sense. (Although as always, we need to worry about > > backwards compatibility.) > > > > Have you tried benchmarking and profiling the current adapters to see > where > > the bottlenecks are? > > > > .. Owen > > > > On Wed, Jul 4, 2018 at 1:41 AM, Xiening Dai <xndai....@live.com> wrote: > > > >> Hi all, > >> > >> Not sure if this has been brought up before - do we have plan to support > >> Apache Arrow? Given its popularity and momentum recently, we might > consider > >> supporting Arrow format for Orc reader and writer. There’s an adapter > for > >> Orc C++ reader - https://github.com/apache/arrow/tree/master/cpp/src/ > >> arrow/adapters/orc but the implementation is inefficient. If we want to > >> better integrate with arrow, we should avoid conversions between > >> ColumnVectorBatch and arrow format. > >> > > -- regards, Deepak Majeti