Re: [DISCUSS][C++] Raw pointer string views

Andrew Lamb Thu, 28 Sep 2023 09:09:58 -0700

> What this PR is creating is an "unofficial" Arrow format, with data
types exposed in Arrow C++ that are not part of the Arrow standard, but
are exposed as if they were.


I agree with Antoine here. It seems a pretty clear cut story of the C++
implementation doesn't follow the spec and thus we should either
1.  Update the standard to allow raw pointers
2.  fix the C++ implementation to not have them / treat them as though they
were

If the core usecase is "arrow has the same in memory format used by DuckDB
and Velox, and those systems can't/won't change their implementations" it
seems like the only path forward for that usecase is to adopt their model
(raw pointers) directly. Maybe I am missing something


Andrew






On Thu, Sep 28, 2023 at 11:11 AM Raphael Taylor-Davies
<r.taylordav...@googlemail.com.invalid> wrote:

> FWIW Rust wouldn't have issues using raw pointers, I can't speak for other
> languages though. They would be more expensive to validate, but validation
> is not going to be cheap regardless.
>
> I could definitely see a world where view types use pointers and IPC
> coerces to/from the large non-view types. IPC has to copy the string data
> regardless and re-encoding would avoid encoding masked data.
>
> The notion of supporting both is less of an exciting prospect... I'm also
> not sure if it is too late to make changes at this stage.
>
> On 28 September 2023 15:26:57 BST, Wes McKinney <wesmck...@gmail.com>
> wrote:
> >hi all,
> >
> >I'm just catching up on this thread after having taken a look at the
> format
> >PRs, the C++ implementation PR, and this e-mail thread. So only my $0.02
> >from having spent a great deal less time on this project than others.
> >
> >The original motivation I had for bringing up the idea of adding the
> >StringView concept from DuckDB / Velox / UmbraDB to the Arrow in-memory
> >format (though not necessarily the IPC format) was to provide a path for
> >zero-copy interoperability in some cases with these systems when dealing
> >with strings, and to enhance performance within Arrow-applications
> (setting
> >aside the external interop goal) in scenarios where being able to point to
> >external memory spaces could avoid a copy-and-repack step. I think it's
> >useful to have an zero-copy IPC-compatible string format (i.e. what was
> >proposed and merged into Columnar.rst) for that allows for out-of-order
> >construction or arrays, reuse of memory (e.g. consider the case of
> decoding
> >dictionary encoding Parquet data — not having to copy strings many times
> >when rehydrating string arrays), and chunked allocation — all good things
> >that the existing Arrow VarBinary layout does not provide for.
> >
> >For the in-memory side of things, I am somewhat more of Antoine's
> >perspective that trying to have both in-memory (index+offset and raw
> >pointers) creates a kind of uncanny valley situation that may confuse
> users
> >and cause other problems (especially if the raw pointer version is only
> >found in the C++ library). The raw pointer version also cannot be
> >validated, but I see validation as less of a requirement and more of a
> >"nice to have" (I realize others see validation as more of a requirement).
> >
> >* I see the raw-pointer type has having more net utility (going back to
> the
> >original motivation), but I also see how it is problematic for some
> non-C++
> >implementations.
> >* The index-offset version is intrinsic value over the existing "dense"
> >varbinary layout (per some of the benefits above) but does not satisfy the
> >external interoperability goal with systems that are becoming more popular
> >month over month
> >* Incoming data from external systems that use the raw pointer model have
> >to be serialized (and perhaps repacked) to the index-offset model. This
> >isn't ideal — going the other way (from index-offset to raw pointer) is
> >just a pointer swizzle, comparatively inexpensive.
> >
> >So it seems like we have several paths available, none of them wholly
> >satisfactory:
> >
> >1. Essentially what's in the existing PR — the raw pointer variant which
> is
> >"non-standard"
> >2. Pick one and only one for in memory — I think the raw pointer version
> is
> >more useful given that swizzling from index-offset is pretty cheap. But
> the
> >raw pointer version can't be validated safely and is problematic for e.g.
> >Rust. Picking the index-offset version means that the external ecosystem
> of
> >columnar engines won't be that much closer aligned to Arrow than they are
> >now.
> >3. Implement the raw pointer variant as an extension type in C++ / C ABI.
> >This seems potentially useful but given that it would likely be disfavored
> >for data originating from Arrow-land, there would be fewer scenarios where
> >zero-copy interop for strings is achieved
> >
> >This is difficult and I don't know what the best answer is, but personally
> >my inclination has been toward choices that are utilitarian and help with
> >alignment and cohesion in the open source ecosystem.
> >
> >- Wes
> >
> >On Thu, Sep 28, 2023 at 5:20 AM Antoine Pitrou <anto...@python.org>
> wrote:
> >
> >>
> >> To make things clear, any of the factory functions listed below create a
> >> type that maps exactly onto an Arrow columnar layout:
> >>
> https://arrow.apache.org/docs/dev/cpp/api/datatype.html#factory-functions
> >>
> >> For example, calling `arrow::dictionary` creates a dictionary type that
> >> exactly represents the dictionary layout specified in
> >>
> >>
> https://arrow.apache.org/docs/dev/format/Columnar.html#dictionary-encoded-layout
> >>
> >> Similarly, if you use any of the builders listed below, what you will
> >> get at the end is data that complies with the Arrow columnar
> specification:
> >> https://arrow.apache.org/docs/dev/cpp/api/builder.html
> >>
> >> All the core Arrow C++ APIs create and process data which complies with
> >> the Arrow specification, and which is interoperable with other Arrow
> >> implementations.
> >>
> >> Conversely, non-Arrow data such as CSV or Parquet (or Python lists,
> >> etc.) goes through dedicated converters. There is no ambiguity.
> >>
> >>
> >> Creating top-level utilities that create non-Arrow data introduces
> >> confusion and ambiguity as to what Arrow is. Users who haven't studied
> >> the spec in detail - which is probably most users of Arrow
> >> implementations - will call `arrow::string_view(raw_pointers=true)` and
> >> might later discover that their data cannot be shared with other
> >> implementations (or, if it can, there will be an unsuspected conversion
> >> cost at the edge).
> >>
> >> It also creates a risk of introducing a parallel Arrow-like ecosystem
> >> based on the superset of data layouts understood by Arrow C++. People
> >> may feel encouraged to code for that ecosystem, pessimizing
> >> interoperability with non-C++ runtimes.
> >>
> >> Which is why I think those APIs, however convenient, also go against the
> >> overarching goals of the Arrow project.
> >>
> >>
> >> If we want to keep such convenience APIs as part of Arrow C++, they
> >> should be clearly flagged as being non-Arrow compliant.
> >>
> >> It could be by naming (e.g. `arrow::non_arrow_string_view()`) or by
> >> specific namespacing (e.g. `non_arrow::raw_pointers_string_view()`).
> >>
> >> But, they could be also be provided by a distinct library.
> >>
> >> Regards
> >>
> >> Antoine.
> >>
> >>
> >>
> >> Le 28/09/2023 à 09:01, Antoine Pitrou a écrit :
> >> >
> >> > Hi Ben,
> >> >
> >> > Le 27/09/2023 à 23:25, Benjamin Kietzman a écrit :
> >> >>
> >> >> @Antoine
> >> >>> What this PR is creating is an "unofficial" Arrow format, with data
> >> >> types exposed in Arrow C++ that are not part of the Arrow standard,
> but
> >> >> are exposed as if they were.
> >> >>
> >> >> We already do this in every implementation of the arrow format I'm
> >> >> aware of: it's more convenient to consider dictionary as a data type
> >> >> even though the spec says that it is a field property.
> >> >
> >> > I'm not sure I understand your point. Dictionary encoding is part of
> the
> >> > Arrow spec, and considering it as a data type is an API choice that
> does
> >> > not violate the spec.
> >> >
> >> > Raw pointers in string views is just not an Arrow format.
> >> >
> >> > Regards
> >> >
> >> > Antoine.
> >>
>

Re: [DISCUSS][C++] Raw pointer string views

Reply via email to