hello all, This topic may provoke , but, given that Arrow is approaching its 6-year anniversary, I think this is an important discussion about how we can thoughtfully expand the Arrow specifications to support next-generation columnar data processing. In recent times, I have been motivated by recent interactions with CWI's DuckDB and Meta's Velox open source projects and the innovations they've made around data representation providing beneficial features above and beyond what we have already in Arrow. For example, they have a 16-byte "string view" data type that enables buffer memory reuse, faster "false" comparisons on strings unequal in the first 4 bytes, and inline small strings. Both the Rust and C++ query engine efforts could potentially benefit from this (not sure about the memory safety implications in Rust, comments around this would be helpful).
I wrote a document to start a discussion about a few new ways to represent data that may help with building Arrow-native/Arrow-compatible query engines: https://docs.google.com/document/d/12aZi8Inez9L_JCtZ6gi2XDbQpCsHICNy9_EUxj4ILeE/edit# Each of these potential additions would need to be eventually split off into independent efforts with associated additions to the columnar specification, IPC format, C ABI, integration tests, and so on. The document is open to anyone to comment but if anyone would like edit access please feel free to request and I look forward to the discussion. Thanks, Wes