In the context of this OTEP <https://github.com/lquerel/oteps/blob/main/text/0156-columnar-encoding.md> (OpenTelemetry Enhancement Proposal) I developed an integration layer on top of Apache Arrow (Go an Rust) to *facilitate the translation of row-oriented data stream into an arrow-based columnar representation*. In this particular case the goal was to translate all OpenTelemetry entities (metrics, logs, or traces) into Apache Arrow records. These entities can be quite complex and their corresponding Arrow schema must be defined on the fly. IMO, this approach is not specific to my specific needs but could be used in many other contexts where there is a need to simplify the integration between a row-oriented source of data and Apache Arrow. The trade-off is to have to perform the additional step of conversion to the intermediate representation, but this transformation does not require to understand the arcana of the Arrow format and allows to potentially benefit from functionalities such as the encoding of the dictionary "for free", the automatic generation of Arrow schemas, the batching, the multi-column sorting, etc.
I know that JSON can be used as a kind of intermediate representation in the context of Arrow with some language specific implementation. Current JSON integrations are insufficient to cover the most complex scenarios and are not standardized; e.g. support for most of the Arrow data type, various optimizations (string|binary dictionaries, multi-column sorting), batching, integration with Arrow IPC, compression ratio optimization, ... The object of this proposal is to progressively cover these gaps. I am looking to see if the community would be interested in such a contribution. Above are some additional details on the current implementation. All feedback is welcome. 10K ft overview of the current implementation: 1. Developers convert their row oriented stream into records based on the Arrow Intermediate Representation (AIR). At this stage the translation can be quite mechanical but if needed developers can decide for example to translate a map into a struct if that makes sense for them. The current implementation support the following arrow data types: bool, all uints, all ints, all floats, string, binary, list of any supported types, and struct of any supported types. Additional Arrow types could be added progressively. 2. The row oriented record (i.e. AIR record) is then added to a RecordRepository. This repository will first compute a schema signature and will route the record to a RecordBatcher based on this signature. 3. The RecordBatcher is responsible for collecting all the compatible AIR records and, upon request, the "batcher" is able to build an Arrow Record representing a batch of compatible inputs. In the current implementation, the batcher is able to convert string columns to dictionary based on a configuration. Another configuration allows to evaluate which columns should be sorted to optimize the compression ratio. The same optimization process could be applied to binary columns. 4. Steps 1 through 3 can be repeated on the same RecordRepository instance to build new sets of arrow record batches. Subsequent iterations will be slightly faster due to different techniques used (e.g. object reuse, dictionary reuse and sorting, ...) The current Go implementation <https://github.com/lquerel/otel-arrow-adapter> (WIP) is currently part of this repo (see pkg/air package). If the community is interested, I could do a PR in the Arrow Go and Rust sub-projects.