Could you please simply describe the layout of DuckDB and Velox so we can know what kind of conversion is required from the raw pointer variant? If any engine simply represents string array in the form of something like std::vector<std::string_view>, should we provide a similar variant in C++ to minimize the conversion cost?
Best, Gang On Wed, Sep 27, 2023 at 7:09 AM Raphael Taylor-Davies <r.taylordav...@googlemail.com.invalid> wrote: > I'm confused why this would need to copy string data, assuming the > pointers are into defined memory regions, something necessary for the C > data interface's ownership semantics regardless, why can't these memory > regions just be used as buffers as is? This would therefore require just > rewriting the views buffer to subtract the base pointer of the given > buffer, which should be extremely fast? > > On 26 September 2023 23:34:54 BST, Matt Topol <zotthewiz...@gmail.com> > wrote: > >I believe the motivation is to avoid the cost of the data copy that would > >have to happen to convert from a pointer based to offset based scenario. > >Allowing the pointer-based implementation will ensure that we can maintain > >zero-copy communication with both DuckDB and Velox in a common workflow > >scenario. > > > >Converting to the offset-based version would have a cost of having to copy > >strings from their locations to contiguous buffers which could end up > being > >very significant depending on the shape and size of the data. The pointer > >-based solution wouldn't be allowed in IPC though, only across the C Data > >interface (correct me if I'm wrong). > > > >--Matt > > > >On Tue, Sep 26, 2023, 6:09 PM Raphael Taylor-Davies > ><r.taylordav...@googlemail.com.invalid> wrote: > > > >> Hi, > >> > >> Is the motivation here to avoid DuckDB and Velox having to duplicate the > >> conversion logic from pointer-based to offset-based, or to allow > >> arrow-cpp to operate directly on pointer-based arrays? > >> > >> If it is the former, I personally wouldn't have thought the conversion > >> logic sufficiently complex to really warrant this? > >> > >> If it is the latter, I wonder if you have some benchmark numbers for > >> converting between and operating on the differing representations? In > >> the absence of a strong performance case, it's hard in my opinion to > >> justify adding what will be an arrow-cpp specific extension that isn't > >> part of the standard, with all the potential for confusion and > >> interoperability challenges that entails. > >> > >> Kind Regards, > >> > >> Raphael > >> > >> On 26/09/2023 21:34, Benjamin Kietzman wrote: > >> > Hello all, > >> > > >> > In the PR to add support for Utf8View to the c++ implementation, > >> > I've taken the approach of allowing raw pointer views [1] alongside > the > >> > index/offset views described in the spec [2]. This was done to ease > >> > communication with other engines such as DuckDB and Velox whose native > >> > string representation is the raw pointer view. In order to be usable > >> > as a utility for writing IPC files and other operations on arrow > >> > formatted data, it is useful for the library to be able to directly > >> > import raw pointer arrays even when immediately converting these to > >> > the index/offset representation. > >> > > >> > However there has been objection in review [3] since the raw pointer > >> > representation is not part of the official format. Since data > visitation > >> > utilities are generic, IMHO this hybrid approach does not add > >> > significantly to the complexity of the C++ library, and I feel the > >> > aforementioned interoperability is a high priority when adding this > >> > feature to the C++ library. It's worth noting that this > interoperability > >> > has been a stated goal of the Utf8Type since its original proposal [4] > >> > and throughout the discussion of its adoption [5]. > >> > > >> > Sincerely, > >> > Ben Kietzman > >> > > >> > [1]: > >> > > >> > https://github.com/apache/arrow/pull/37792/files#diff-814ac6f43345f7d2f33e9249a1abf092c8078c62ec44cd782c49b676b94ec302R731-R752 > >> > [2]: > >> > > >> > https://github.com/apache/arrow/blob/9d6d501/docs/source/format/Columnar.rst#L369-L379 > >> > [3]: > https://github.com/apache/arrow/pull/37792#discussion_r1336010665 > >> > [4]: https://lists.apache.org/thread/49qzofswg1r5z7zh39pjvd1m2ggz2kdq > >> > [5]: https://lists.apache.org/thread/8mofy7khfvy3g1m9pmjshbty3cmvb4w4 > >> > > >> >