hello all,

This topic may provoke , but, given that Arrow is approaching its
6-year anniversary, I think this is an important discussion about how
we can thoughtfully expand the Arrow specifications to support
next-generation columnar data processing. In recent times, I have been
motivated by recent interactions with CWI's DuckDB and Meta's Velox
open source projects and the innovations they've made around data
representation providing beneficial features above and beyond what we
have already in Arrow. For example, they have a 16-byte "string view"
data type that enables buffer memory reuse, faster "false" comparisons
on strings unequal in the first 4 bytes, and inline small strings.
Both the Rust and C++ query engine efforts could potentially benefit
from this (not sure about the memory safety implications in Rust,
comments around this would be helpful).

I wrote a document to start a discussion about a few new ways to
represent data that may help with building
Arrow-native/Arrow-compatible query engines:

https://docs.google.com/document/d/12aZi8Inez9L_JCtZ6gi2XDbQpCsHICNy9_EUxj4ILeE/edit#

Each of these potential additions would need to be eventually split
off into independent efforts with associated additions to the columnar
specification, IPC format, C ABI, integration tests, and so on.

The document is open to anyone to comment but if anyone would like
edit access please feel free to request and I look forward to the
discussion.

Thanks,
Wes

Reply via email to