In the context of this OTEP
<https://github.com/lquerel/oteps/blob/main/text/0156-columnar-encoding.md>
(OpenTelemetry
Enhancement Proposal) I developed an integration layer on top of Apache
Arrow (Go an Rust) to *facilitate the translation of row-oriented data
stream into an arrow-based columnar representation*. In this particular
case the goal was to translate all OpenTelemetry entities (metrics, logs,
or traces) into Apache Arrow records. These entities can be quite complex
and their corresponding Arrow schema must be defined on the fly. IMO, this
approach is not specific to my specific needs but could be used in many
other contexts where there is a need to simplify the integration between a
row-oriented source of data and Apache Arrow. The trade-off is to have to
perform the additional step of conversion to the intermediate
representation, but this transformation does not require to understand the
arcana of the Arrow format and allows to potentially benefit from
functionalities such as the encoding of the dictionary "for free", the
automatic generation of Arrow schemas, the batching, the multi-column
sorting, etc.


I know that JSON can be used as a kind of intermediate representation in
the context of Arrow with some language specific implementation. Current
JSON integrations are insufficient to cover the most complex scenarios and
are not standardized; e.g. support for most of the Arrow data type, various
optimizations (string|binary dictionaries, multi-column sorting), batching,
integration with Arrow IPC, compression ratio optimization, ... The object
of this proposal is to progressively cover these gaps.

I am looking to see if the community would be interested in such a
contribution. Above are some additional details on the current
implementation. All feedback is welcome.

10K ft overview of the current implementation:

   1. Developers convert their row oriented stream into records based on
   the Arrow Intermediate Representation (AIR). At this stage the translation
   can be quite mechanical but if needed developers can decide for example to
   translate a map into a struct if that makes sense for them. The current
   implementation support the following arrow data types: bool, all uints, all
   ints, all floats, string, binary, list of any supported types, and struct
   of any supported types. Additional Arrow types could be added progressively.
   2. The row oriented record (i.e. AIR record) is then added to a
   RecordRepository. This repository will first compute a schema signature and
   will route the record to a RecordBatcher based on this signature.
   3. The RecordBatcher is responsible for collecting all the compatible
   AIR records and, upon request, the "batcher" is able to build an Arrow
   Record representing a batch of compatible inputs. In the current
   implementation, the batcher is able to convert string columns to dictionary
   based on a configuration. Another configuration allows to evaluate which
   columns should be sorted to optimize the compression ratio. The same
   optimization process could be applied to binary columns.
   4. Steps 1 through 3 can be repeated on the same RecordRepository
   instance to build new sets of arrow record batches. Subsequent iterations
   will be slightly faster due to different techniques used (e.g. object
   reuse, dictionary reuse and sorting, ...)


The current Go implementation
<https://github.com/lquerel/otel-arrow-adapter> (WIP) is currently part of
this repo (see pkg/air package). If the community is interested, I could do
a PR in the Arrow Go and Rust sub-projects.

Reply via email to