Re: [proposal] Arrow Intermediate Representation to facilitate the transformation of row-oriented data sources into Arrow columnar representation

Wes McKinney Wed, 27 Jul 2022 16:19:32 -0700

We had an e-mail thread about this in 2018

https://lists.apache.org/thread/35pn7s8yzxozqmgx53ympxg63vjvggvm


I still think having a canonical in-memory row format (and libraries
to transform to and from Arrow columnar format) is a good idea — but
there is the risk of ending up in the tar pit of reinventing Avro.


On Wed, Jul 27, 2022 at 5:11 PM Micah Kornfield <emkornfi...@gmail.com> wrote:
>
> Are there more details on what exactly an "Arrow Intermediate
> Representation (AIR)" is?  We've talked about in the past maybe having a
> memory layout specification for row-based data as well as column based
> data.  There was also a recent attempt at least in C++ to try to build
> utilities to do these pivots but it was decided that it didn't add much
> utility (it was added a comprehensive example).
>
> Thanks,
> Micah
>
> On Tue, Jul 26, 2022 at 2:26 PM Laurent Quérel <laurent.que...@gmail.com>
> wrote:
>
> > In the context of this OTEP
> > <https://github.com/lquerel/oteps/blob/main/text/0156-columnar-encoding.md
> > >
> > (OpenTelemetry
> > Enhancement Proposal) I developed an integration layer on top of Apache
> > Arrow (Go an Rust) to *facilitate the translation of row-oriented data
> > stream into an arrow-based columnar representation*. In this particular
> > case the goal was to translate all OpenTelemetry entities (metrics, logs,
> > or traces) into Apache Arrow records. These entities can be quite complex
> > and their corresponding Arrow schema must be defined on the fly. IMO, this
> > approach is not specific to my specific needs but could be used in many
> > other contexts where there is a need to simplify the integration between a
> > row-oriented source of data and Apache Arrow. The trade-off is to have to
> > perform the additional step of conversion to the intermediate
> > representation, but this transformation does not require to understand the
> > arcana of the Arrow format and allows to potentially benefit from
> > functionalities such as the encoding of the dictionary "for free", the
> > automatic generation of Arrow schemas, the batching, the multi-column
> > sorting, etc.
> >
> >
> > I know that JSON can be used as a kind of intermediate representation in
> > the context of Arrow with some language specific implementation. Current
> > JSON integrations are insufficient to cover the most complex scenarios and
> > are not standardized; e.g. support for most of the Arrow data type, various
> > optimizations (string|binary dictionaries, multi-column sorting), batching,
> > integration with Arrow IPC, compression ratio optimization, ... The object
> > of this proposal is to progressively cover these gaps.
> >
> > I am looking to see if the community would be interested in such a
> > contribution. Above are some additional details on the current
> > implementation. All feedback is welcome.
> >
> > 10K ft overview of the current implementation:
> >
> >    1. Developers convert their row oriented stream into records based on
> >    the Arrow Intermediate Representation (AIR). At this stage the
> > translation
> >    can be quite mechanical but if needed developers can decide for example
> > to
> >    translate a map into a struct if that makes sense for them. The current
> >    implementation support the following arrow data types: bool, all uints,
> > all
> >    ints, all floats, string, binary, list of any supported types, and
> > struct
> >    of any supported types. Additional Arrow types could be added
> > progressively.
> >    2. The row oriented record (i.e. AIR record) is then added to a
> >    RecordRepository. This repository will first compute a schema signature
> > and
> >    will route the record to a RecordBatcher based on this signature.
> >    3. The RecordBatcher is responsible for collecting all the compatible
> >    AIR records and, upon request, the "batcher" is able to build an Arrow
> >    Record representing a batch of compatible inputs. In the current
> >    implementation, the batcher is able to convert string columns to
> > dictionary
> >    based on a configuration. Another configuration allows to evaluate which
> >    columns should be sorted to optimize the compression ratio. The same
> >    optimization process could be applied to binary columns.
> >    4. Steps 1 through 3 can be repeated on the same RecordRepository
> >    instance to build new sets of arrow record batches. Subsequent
> > iterations
> >    will be slightly faster due to different techniques used (e.g. object
> >    reuse, dictionary reuse and sorting, ...)
> >
> >
> > The current Go implementation
> > <https://github.com/lquerel/otel-arrow-adapter> (WIP) is currently part of
> > this repo (see pkg/air package). If the community is interested, I could do
> > a PR in the Arrow Go and Rust sub-projects.
> >

Re: [proposal] Arrow Intermediate Representation to facilitate the transformation of row-oriented data sources into Arrow columnar representation

Reply via email to