Hello again all, The PR [1] to add string view to the format and the C++ implementation is hovering around passing CI and has been undrafted. Furthermore, there is now also a PR [2] to add string view to the Go implementation. Code review is underway for each PR and I'd like to move toward a vote for acceptance- are there any other preliminaries which I've neglected?
To reiterate the answers to some past questions: - Benchmarks are added in the C++ PR [1] to demonstrate the performance of conversion between the various string formats. In addition, there are some benchmarks which demonstrate the performance gains available with the new format [3]. - Adding string view to the C ABI is a natural follow up, but should be handled independently. An issue has been added to track that enhancement [4]. Sincerely, Ben Kietzman [1] https://github.com/apache/arrow/pull/35628 [2] https://github.com/apache/arrow/pull/35769 [3] https://github.com/apache/arrow/pull/35628#issuecomment-1583218617 [4] https://github.com/apache/arrow/issues/36099 On Wed, May 17, 2023 at 12:53 PM Benjamin Kietzman <bengil...@gmail.com> wrote: > @Jacob > > You mention benchmarks multiple times, are these results published > somewhere? > > I benchmarked the performance of raw pointer vs index offset views in my > PR to velox, > I do intend to port them to my arrow PR but I haven't gotten there yet. > Furthermore, it > seemed less urgent to me since coexistence of the two types in the c++ > implementation > defers the question of how aggressively one should be preferred over the > other. > > @Dewey > > I don't see the C Data interface in the PR > > I have not addressed the C ABI in this PR. As you mention, it may be > useful to transmit > arrays with raw pointer views between implementations which allow them. I > can address > this in a follow up PR. > > @Will > > If I understand correctly, multiple arrays can reference the same buffers > > in memory, but once they are written to IPC their data buffers will be > > duplicated. Is that right? > The buffers will be duplicated. If buffer duplication is becomes a > concern, I'd prefer to handle > that in the ipc writer. Then buffers which are duplicated could be > detected by checking > pointer identity and written only once. > > > On Wed, May 17, 2023 at 12:07 AM Will Jones <will.jones...@gmail.com> > wrote: > >> Hello Ben, >> >> Thanks for your work on this. I think this will be an excellent addition >> to >> the format. >> >> If I understand correctly, multiple arrays can reference the same buffers >> in memory, but once they are written to IPC their data buffers will be >> duplicated. Is that right? >> >> Dictionary types have a special message so they can be reused across >> batches and even fields. Did we consider adding a similar message for >> string view buffers? >> >> One relevant use case I'm curious about is substring extraction. For >> example, if I have a column of URIs and I create columns where I've >> extracted substrings like the hostname, path, and a list column of query >> parameters, I'd like for those latter columns to be views into the URI >> buffers, rather than full copies. >> >> However, I've never touched the IPC read code paths, so it's quite >> possible >> I'm overlooking something obvious. >> >> Best, >> >> Will Jones >> >> >> On Tue, May 16, 2023 at 6:29 PM Dewey Dunnington >> <de...@voltrondata.com.invalid> wrote: >> >> > Very cool! >> > >> > In addition to performance mentioned above, I could see this being >> > useful for the R bindings - we already have a global string pool and a >> > mechanism for keeping a vector of them alive. >> > >> > I don't see the C Data interface in the PR although I may have missed >> > it - is that a part of the proposal? It seems like it would be >> > possible to use raw pointers as long as they can be guaranteed to be >> > valid until the release callback is called? >> > >> > On Tue, May 16, 2023 at 8:43 PM Jacob Wujciak >> > <ja...@voltrondata.com.invalid> wrote: >> > > >> > > Hello Everyone, >> > > I think keeping interoperability with the large ecosystem is a very >> > > important goal for arrow so I am overall in favor of this proposal! >> > > >> > > You mention benchmarks multiple times, are these results published >> > > somewhere? >> > > >> > > Thanks >> > > >> > > On Tue, May 16, 2023 at 11:39 PM Benjamin Kietzman < >> bengil...@gmail.com> >> > > wrote: >> > > >> > > > Hello all, >> > > > >> > > > As previously discussed on this list [1], an UmbraDB/DuckDB/Velox >> > > > compatible >> > > > "string view" type could bring several performance benefits to >> access >> > and >> > > > authoring of string data in the arrow format [2]. Additionally >> better >> > > > interoperability with engines already using this format could be >> > > > established. >> > > > >> > > > PR #0 [3] adds Utf8View and BinaryView types to the C++ >> implementation >> > and >> > > > to >> > > > the IPC format. For the purposes of IPC raw pointers are not used. >> > Instead, >> > > > each view contains a pair of 32 bit unsigned integers which encode >> the >> > > > index of >> > > > a character buffer (string view arrays may consist of a variable >> > number of >> > > > such buffers) and the offset of a view's data within that buffer >> > > > respectively. >> > > > Benefits of this substitution include: >> > > > - This makes explicit the guarantee that lifetime of all character >> > data is >> > > > equal >> > > > to that of the array which views it, which is critical for >> confident >> > > > consumption across an interface boundary. >> > > > - As with other types in the arrow format, such arrays are >> > serializable and >> > > > venue agnostic; directly usable in shared memory without >> > modification. >> > > > - Indices and offsets are easily validated. >> > > > >> > > > Accessing the data requires some trivial pointer arithmetic, but in >> > > > benchmarking >> > > > this had negligible impact on sequential access and only minor >> impact >> > on >> > > > random >> > > > access. >> > > > >> > > > In the C++ implementation, raw pointer string views are supported >> as an >> > > > extended >> > > > case of the Utf8View type: `utf8_view(/*has_raw_pointers=*/true)`. >> > > > Branching on >> > > > this access pattern bit at the data type level has negligible >> impact on >> > > > performance since the branch resides outside any hot loops. Utility >> > > > functions >> > > > are provided for efficient (potentially in-place) conversion between >> > raw >> > > > pointer >> > > > and index offset views. For example, the C++ implementation could >> zero >> > copy >> > > > a raw pointer array from Velox, filter it, then convert to >> > index/offset for >> > > > serialization. Other implementations may choose to accommodate or >> > eschew >> > > > raw >> > > > pointer views as their communities direct. >> > > > >> > > > Where desirous in a rigorously controlled context this still enables >> > > > construction >> > > > and safe consumption of string view arrays which reference memory >> not >> > > > directly bound to the lifetime of the array. I'm not sure when or >> if we >> > > > would >> > > > find it useful to have arrays like this; I do not introduce any in >> > [3]. I >> > > > mention >> > > > this possibility to highlight that if benchmarking demonstrates that >> > such >> > > > an >> > > > approach brings a significant performance benefit to some operation, >> > the >> > > > only >> > > > barrier to its adoption would be code review. Likewise if more >> > intensive >> > > > benchmarking determines that raw pointer views critically outperform >> > > > index/offset >> > > > views for real-world analytics tasks, prioritizing raw pointer >> string >> > views >> > > > for usage within the C++ implementation will be straightforward. >> > > > >> > > > See also the proposal to Velox that their string view vector be >> > refactored >> > > > in a similar vein [4]. >> > > > >> > > > Sincerely, >> > > > Ben Kietzman >> > > > >> > > > [1] >> https://lists.apache.org/thread/49qzofswg1r5z7zh39pjvd1m2ggz2kdq >> > > > [2] http://cidrdb.org/cidr2020/papers/p29-neumann-cidr20.pdf >> > > > [3] https://github.com/apache/arrow/pull/35628 >> > > > [4] https://github.com/facebookincubator/velox/discussions/4362 >> > > > >> > >> >