Re: [DISCUSS] Adding new columnar memory layouts to Arrow (in-memory, IPC, C ABI)

Micah Kornfield Mon, 13 Dec 2021 13:26:50 -0800

Hi Wes,
I'm also in favor of most of this, I need to think more about the new list
layout, and I think the RLE encoding as proposed contains redundancies with
dictionary encoding data we might not want.


A further question on this, do you expect all of this to be packaged up as
a RecordBatch for IPC or do you expect to define a new  message container.
I think one source of friction for some applications has been the coupling
of encoding with the schema (there are trade-offs here for overall message
sizes) but having to decide up front and stick with a certain encoding
(e.g. Dictionary vs Not dictionary and RLE vs not-RLE) for an entire stream
can be non-ergonomic for data producers.

As an example, given a sequence of parquet files in a dataset, one needs to
decide up front what type encodings a stream should produce for them.  That
means either:
1.  Looking at all encodings for a column across (possibly a sample of
files).  This causes latency due to IO
2.  Choosing the schema up front (maybe based on the first file opened) and
then converting all data to use the encoding there (can waste unnecessary
CPU if the column encoding isn't representative of the dataset).
3.  Changing the schema (i.e. pushing schema resolution higher) as
encodings change in the file.

Thanks,
Micah

On Mon, Dec 13, 2021 at 12:52 PM Andrew Lamb <al...@influxdata.com> wrote:

> Thank you for writing this down Wes
>
> I think my project is very interested in the RLE encoding and constant
> view.
>
> The StringView, as written, seems fairly tightly tied to C/C++, though I
> may be mistaken. I think allowing Rust to consume such StringViews would be
> possible but it seems very unlikely the Rust implementation would be able
> to generate the layout with `char*` type pointers with any sort of
> reasonable safety.
>
> > With dictionary and string view, I feel like rle is less important.
>
> While dictionaries certainly help, for sorted low cardinality data (e.g. 1
> Million values of 4 distinct strings) the benefits of RLE for compression
> and processing performance is arbitrarily enormous. I say the benefits are
> arbitrarily enormous because one can encode ~ an arbitrary number of rows
> in a constant number of RLE runs.
>
> Low cardinality string datasets appear commonly in timeseries data (for
> example, "AWS region name" field on monitoring data)
>
> Andrew
>
> On Fri, Dec 10, 2021 at 3:18 PM Jacques Nadeau <jacq...@apache.org> wrote:
>
> > I'm strongly in support of much of this. Thanks for bringing this up. It
> is
> > long overdue.
> >
> > On initial read, my thoughts would be:
> >
> > Stongly inclined:
> > - String view
> > - constant view
> >
> > Weakly inclined
> > - All null
> > - rle
> >
> > Somewhat disinclined
> > - Sequence change
> >
> >
> > With dictionary and string view, I feel like rle is less important.
> >
> > I'm not yet seeing huge benefit for sequence change.
> >
> > On Fri, Dec 10, 2021, 11:29 AM Wes McKinney <wesmck...@gmail.com> wrote:
> >
> > > hello all,
> > >
> > > This topic may provoke , but, given that Arrow is approaching its
> > > 6-year anniversary, I think this is an important discussion about how
> > > we can thoughtfully expand the Arrow specifications to support
> > > next-generation columnar data processing. In recent times, I have been
> > > motivated by recent interactions with CWI's DuckDB and Meta's Velox
> > > open source projects and the innovations they've made around data
> > > representation providing beneficial features above and beyond what we
> > > have already in Arrow. For example, they have a 16-byte "string view"
> > > data type that enables buffer memory reuse, faster "false" comparisons
> > > on strings unequal in the first 4 bytes, and inline small strings.
> > > Both the Rust and C++ query engine efforts could potentially benefit
> > > from this (not sure about the memory safety implications in Rust,
> > > comments around this would be helpful).
> > >
> > > I wrote a document to start a discussion about a few new ways to
> > > represent data that may help with building
> > > Arrow-native/Arrow-compatible query engines:
> > >
> > >
> > >
> >
> https://docs.google.com/document/d/12aZi8Inez9L_JCtZ6gi2XDbQpCsHICNy9_EUxj4ILeE/edit#
> > >
> > > Each of these potential additions would need to be eventually split
> > > off into independent efforts with associated additions to the columnar
> > > specification, IPC format, C ABI, integration tests, and so on.
> > >
> > > The document is open to anyone to comment but if anyone would like
> > > edit access please feel free to request and I look forward to the
> > > discussion.
> > >
> > > Thanks,
> > > Wes
> > >
> >
>

Re: [DISCUSS] Adding new columnar memory layouts to Arrow (in-memory, IPC, C ABI)

Reply via email to