We had an e-mail thread about this in 2018 https://lists.apache.org/thread/35pn7s8yzxozqmgx53ympxg63vjvggvm
I still think having a canonical in-memory row format (and libraries to transform to and from Arrow columnar format) is a good idea — but there is the risk of ending up in the tar pit of reinventing Avro. On Wed, Jul 27, 2022 at 5:11 PM Micah Kornfield <emkornfi...@gmail.com> wrote: > > Are there more details on what exactly an "Arrow Intermediate > Representation (AIR)" is? We've talked about in the past maybe having a > memory layout specification for row-based data as well as column based > data. There was also a recent attempt at least in C++ to try to build > utilities to do these pivots but it was decided that it didn't add much > utility (it was added a comprehensive example). > > Thanks, > Micah > > On Tue, Jul 26, 2022 at 2:26 PM Laurent Quérel <laurent.que...@gmail.com> > wrote: > > > In the context of this OTEP > > <https://github.com/lquerel/oteps/blob/main/text/0156-columnar-encoding.md > > > > > (OpenTelemetry > > Enhancement Proposal) I developed an integration layer on top of Apache > > Arrow (Go an Rust) to *facilitate the translation of row-oriented data > > stream into an arrow-based columnar representation*. In this particular > > case the goal was to translate all OpenTelemetry entities (metrics, logs, > > or traces) into Apache Arrow records. These entities can be quite complex > > and their corresponding Arrow schema must be defined on the fly. IMO, this > > approach is not specific to my specific needs but could be used in many > > other contexts where there is a need to simplify the integration between a > > row-oriented source of data and Apache Arrow. The trade-off is to have to > > perform the additional step of conversion to the intermediate > > representation, but this transformation does not require to understand the > > arcana of the Arrow format and allows to potentially benefit from > > functionalities such as the encoding of the dictionary "for free", the > > automatic generation of Arrow schemas, the batching, the multi-column > > sorting, etc. > > > > > > I know that JSON can be used as a kind of intermediate representation in > > the context of Arrow with some language specific implementation. Current > > JSON integrations are insufficient to cover the most complex scenarios and > > are not standardized; e.g. support for most of the Arrow data type, various > > optimizations (string|binary dictionaries, multi-column sorting), batching, > > integration with Arrow IPC, compression ratio optimization, ... The object > > of this proposal is to progressively cover these gaps. > > > > I am looking to see if the community would be interested in such a > > contribution. Above are some additional details on the current > > implementation. All feedback is welcome. > > > > 10K ft overview of the current implementation: > > > > 1. Developers convert their row oriented stream into records based on > > the Arrow Intermediate Representation (AIR). At this stage the > > translation > > can be quite mechanical but if needed developers can decide for example > > to > > translate a map into a struct if that makes sense for them. The current > > implementation support the following arrow data types: bool, all uints, > > all > > ints, all floats, string, binary, list of any supported types, and > > struct > > of any supported types. Additional Arrow types could be added > > progressively. > > 2. The row oriented record (i.e. AIR record) is then added to a > > RecordRepository. This repository will first compute a schema signature > > and > > will route the record to a RecordBatcher based on this signature. > > 3. The RecordBatcher is responsible for collecting all the compatible > > AIR records and, upon request, the "batcher" is able to build an Arrow > > Record representing a batch of compatible inputs. In the current > > implementation, the batcher is able to convert string columns to > > dictionary > > based on a configuration. Another configuration allows to evaluate which > > columns should be sorted to optimize the compression ratio. The same > > optimization process could be applied to binary columns. > > 4. Steps 1 through 3 can be repeated on the same RecordRepository > > instance to build new sets of arrow record batches. Subsequent > > iterations > > will be slightly faster due to different techniques used (e.g. object > > reuse, dictionary reuse and sorting, ...) > > > > > > The current Go implementation > > <https://github.com/lquerel/otel-arrow-adapter> (WIP) is currently part of > > this repo (see pkg/air package). If the community is interested, I could do > > a PR in the Arrow Go and Rust sub-projects. > >