Hello Everyone, I think keeping interoperability with the large ecosystem is a very important goal for arrow so I am overall in favor of this proposal!
You mention benchmarks multiple times, are these results published somewhere? Thanks On Tue, May 16, 2023 at 11:39 PM Benjamin Kietzman <bengil...@gmail.com> wrote: > Hello all, > > As previously discussed on this list [1], an UmbraDB/DuckDB/Velox > compatible > "string view" type could bring several performance benefits to access and > authoring of string data in the arrow format [2]. Additionally better > interoperability with engines already using this format could be > established. > > PR #0 [3] adds Utf8View and BinaryView types to the C++ implementation and > to > the IPC format. For the purposes of IPC raw pointers are not used. Instead, > each view contains a pair of 32 bit unsigned integers which encode the > index of > a character buffer (string view arrays may consist of a variable number of > such buffers) and the offset of a view's data within that buffer > respectively. > Benefits of this substitution include: > - This makes explicit the guarantee that lifetime of all character data is > equal > to that of the array which views it, which is critical for confident > consumption across an interface boundary. > - As with other types in the arrow format, such arrays are serializable and > venue agnostic; directly usable in shared memory without modification. > - Indices and offsets are easily validated. > > Accessing the data requires some trivial pointer arithmetic, but in > benchmarking > this had negligible impact on sequential access and only minor impact on > random > access. > > In the C++ implementation, raw pointer string views are supported as an > extended > case of the Utf8View type: `utf8_view(/*has_raw_pointers=*/true)`. > Branching on > this access pattern bit at the data type level has negligible impact on > performance since the branch resides outside any hot loops. Utility > functions > are provided for efficient (potentially in-place) conversion between raw > pointer > and index offset views. For example, the C++ implementation could zero copy > a raw pointer array from Velox, filter it, then convert to index/offset for > serialization. Other implementations may choose to accommodate or eschew > raw > pointer views as their communities direct. > > Where desirous in a rigorously controlled context this still enables > construction > and safe consumption of string view arrays which reference memory not > directly bound to the lifetime of the array. I'm not sure when or if we > would > find it useful to have arrays like this; I do not introduce any in [3]. I > mention > this possibility to highlight that if benchmarking demonstrates that such > an > approach brings a significant performance benefit to some operation, the > only > barrier to its adoption would be code review. Likewise if more intensive > benchmarking determines that raw pointer views critically outperform > index/offset > views for real-world analytics tasks, prioritizing raw pointer string views > for usage within the C++ implementation will be straightforward. > > See also the proposal to Velox that their string view vector be refactored > in a similar vein [4]. > > Sincerely, > Ben Kietzman > > [1] https://lists.apache.org/thread/49qzofswg1r5z7zh39pjvd1m2ggz2kdq > [2] http://cidrdb.org/cidr2020/papers/p29-neumann-cidr20.pdf > [3] https://github.com/apache/arrow/pull/35628 > [4] https://github.com/facebookincubator/velox/discussions/4362 >