Re: [DISCUSS][Format] Starting to do some concrete work on the new "StringView" columnar data type

Jorge Cardoso Leitão Mon, 01 Aug 2022 10:58:16 -0700

I am +1 on either - imo:

* it is important to have either available
* both provide a non-trivial improvement over what we have
* the trade-off is difficult to decide upon - I trust whomever is
implementing it to experiment and decide which better fits Arrow and the
ecosystem.


Thank you so much for driving this, Wes.

Best,
Jorge


On Mon, Aug 1, 2022 at 7:14 PM Wes McKinney <wesmck...@gmail.com> wrote:

> On Sun, Jul 31, 2022 at 8:05 AM Antoine Pitrou <anto...@python.org> wrote:
> >
> >
> > Hi Wes,
> >
> > Le 31/07/2022 à 00:02, Wes McKinney a écrit :
> > >
> > > I understand there are still some aspects of this project that cause
> > > some squeamishness (like having arbitrary memory addresses embedded
> > > within array values whose lifetime a C ABI consumer may not know about
> > > -- we already export memory addresses in the C ABI but fewer of them
> > > because they are only the buffers at the array level). We discussed
> > > some alternative approaches that address some of these questions, but
> > > each come with associated trade-offs.
> >
> > Are any of these trade-offs blocking?
> >
>
> They aren't blocking implementation work at least.
>
> I think the alternative designs / requirements that were discussed were
>
> * Attaching all referenced memory buffers by pointers in the C ABI or
> * Using offsets into an attached buffer instead of pointers
>
> I think that either of these pose conflicts with pooled allocators or
> tiered buffer management, since a single Arrow vector may reference
> many buffers within a memory pool (where different vectors may
> reference different memory chunks in the pool — so externalizing all
> referenced buffers is burdensome in the first case or would require an
> expensive "repack" operation in the latter case, defeating the goal of
> zero copy).
>
> You can see a discussion of how Umbra has three different storage
> tiers (persistent, transient, temporary) for out-of-line strings
>
> https://db.in.tum.de/~freitag/papers/p29-neumann-cidr20.pdf
>
> It might be a good idea to look more carefully at how DuckDB and Velox
> do memory management for the out-of-line strings.
>
> If we start placing restrictions on how the out-of-line string buffers
> are managed and externalized, it risks undermining the zero-copy
> interoperability benefits that we're trying to achieve with this.
>

Re: [DISCUSS][Format] Starting to do some concrete work on the new "StringView" columnar data type

Reply via email to