Re: [proposal] Arrow Intermediate Representation to facilitate the transformation of row-oriented data sources into Arrow columnar representation

Andrew Lamb Fri, 29 Jul 2022 03:48:47 -0700

There has been a substantial amount of effort put into the arrow-rs Rust
Parquet implementation to handle the corner cases of nested structs and
list, and all the fun of various levels of nullability.


Do let us know if you happen to try writing nested structures directly to
parquet and have issues.

Andrew

On Wed, Jul 27, 2022 at 6:56 PM Lee, David <david....@blackrock.com.invalid>
wrote:

> I think this has been addressed for both Parquet and Python to handle
> records including nested structures. Not sure about Rust and Go..
>
> [C++][Parquet] Read and write nested Parquet data with a mix of struct and
> list nesting levels
>
> https://issues.apache.org/jira/browse/ARROW-1644
>
> [Python] Add from_pylist() and to_pylist() to pyarrow.Table to convert
> list of records
>
>
> https://issues.apache.org/jira/browse/ARROW-6001?focusedCommentId=16891152&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-16891152
>
>
> -----Original Message-----
> From: Laurent Quérel <laurent.que...@gmail.com>
> Sent: Tuesday, July 26, 2022 2:25 PM
> To: dev@arrow.apache.org
> Subject: [proposal] Arrow Intermediate Representation to facilitate the
> transformation of row-oriented data sources into Arrow columnar
> representation
>
> External Email: Use caution with links and attachments
>
>
> In the context of this OTEP
> <
> https://urldefense.com/v3/__https://github.com/lquerel/oteps/blob/main/text/0156-columnar-encoding.md__;!!KSjYCgUGsB4!d5RAULeQOio5gpEATTYTuqB7l2iK_gF1tygtPHAGIvLAWTB46ILIazrANdOWeTbY_RqPH2bXNpKf1W1ZXPldz_4ga_8$
> > (OpenTelemetry Enhancement Proposal) I developed an integration layer on
> top of Apache Arrow (Go an Rust) to *facilitate the translation of
> row-oriented data stream into an arrow-based columnar representation*. In
> this particular case the goal was to translate all OpenTelemetry entities
> (metrics, logs, or traces) into Apache Arrow records. These entities can be
> quite complex and their corresponding Arrow schema must be defined on the
> fly. IMO, this approach is not specific to my specific needs but could be
> used in many other contexts where there is a need to simplify the
> integration between a row-oriented source of data and Apache Arrow. The
> trade-off is to have to perform the additional step of conversion to the
> intermediate representation, but this transformation does not require to
> understand the arcana of the Arrow format and allows to potentially benefit
> from functionalities such as the encoding of the dictionary "for free", the
> automatic generation of Arrow schemas, the batching, the multi-column
> sorting, etc.
>
>
> I know that JSON can be used as a kind of intermediate representation in
> the context of Arrow with some language specific implementation. Current
> JSON integrations are insufficient to cover the most complex scenarios and
> are not standardized; e.g. support for most of the Arrow data type, various
> optimizations (string|binary dictionaries, multi-column sorting), batching,
> integration with Arrow IPC, compression ratio optimization, ... The object
> of this proposal is to progressively cover these gaps.
>
> I am looking to see if the community would be interested in such a
> contribution. Above are some additional details on the current
> implementation. All feedback is welcome.
>
> 10K ft overview of the current implementation:
>
>    1. Developers convert their row oriented stream into records based on
>    the Arrow Intermediate Representation (AIR). At this stage the
> translation
>    can be quite mechanical but if needed developers can decide for example
> to
>    translate a map into a struct if that makes sense for them. The current
>    implementation support the following arrow data types: bool, all uints,
> all
>    ints, all floats, string, binary, list of any supported types, and
> struct
>    of any supported types. Additional Arrow types could be added
> progressively.
>    2. The row oriented record (i.e. AIR record) is then added to a
>    RecordRepository. This repository will first compute a schema signature
> and
>    will route the record to a RecordBatcher based on this signature.
>    3. The RecordBatcher is responsible for collecting all the compatible
>    AIR records and, upon request, the "batcher" is able to build an Arrow
>    Record representing a batch of compatible inputs. In the current
>    implementation, the batcher is able to convert string columns to
> dictionary
>    based on a configuration. Another configuration allows to evaluate which
>    columns should be sorted to optimize the compression ratio. The same
>    optimization process could be applied to binary columns.
>    4. Steps 1 through 3 can be repeated on the same RecordRepository
>    instance to build new sets of arrow record batches. Subsequent
> iterations
>    will be slightly faster due to different techniques used (e.g. object
>    reuse, dictionary reuse and sorting, ...)
>
>
> The current Go implementation
> <
> https://urldefense.com/v3/__https://github.com/lquerel/otel-arrow-adapter__;!!KSjYCgUGsB4!d5RAULeQOio5gpEATTYTuqB7l2iK_gF1tygtPHAGIvLAWTB46ILIazrANdOWeTbY_RqPH2bXNpKf1W1ZXPldSPbp3VM$
> > (WIP) is currently part of this repo (see pkg/air package). If the
> community is interested, I could do a PR in the Arrow Go and Rust
> sub-projects.
>
>
> This message may contain information that is confidential or privileged.
> If you are not the intended recipient, please advise the sender immediately
> and delete this message. See
> http://www.blackrock.com/corporate/compliance/email-disclaimers for
> further information.  Please refer to
> http://www.blackrock.com/corporate/compliance/privacy-policy for more
> information about BlackRock’s Privacy Policy.
>
>
> For a list of BlackRock's office addresses worldwide, see
> http://www.blackrock.com/corporate/about-us/contacts-locations.
>
> © 2022 BlackRock, Inc. All rights reserved.
>

Re: [proposal] Arrow Intermediate Representation to facilitate the transformation of row-oriented data sources into Arrow columnar representation

Reply via email to