I think this has been addressed for both Parquet and Python to handle records 
including nested structures. Not sure about Rust and Go..

[C++][Parquet] Read and write nested Parquet data with a mix of struct and list 
nesting levels

https://issues.apache.org/jira/browse/ARROW-1644

[Python] Add from_pylist() and to_pylist() to pyarrow.Table to convert list of 
records

https://issues.apache.org/jira/browse/ARROW-6001?focusedCommentId=16891152&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-16891152


-----Original Message-----
From: Laurent Quérel <laurent.que...@gmail.com> 
Sent: Tuesday, July 26, 2022 2:25 PM
To: dev@arrow.apache.org
Subject: [proposal] Arrow Intermediate Representation to facilitate the 
transformation of row-oriented data sources into Arrow columnar representation

External Email: Use caution with links and attachments


In the context of this OTEP
<https://urldefense.com/v3/__https://github.com/lquerel/oteps/blob/main/text/0156-columnar-encoding.md__;!!KSjYCgUGsB4!d5RAULeQOio5gpEATTYTuqB7l2iK_gF1tygtPHAGIvLAWTB46ILIazrANdOWeTbY_RqPH2bXNpKf1W1ZXPldz_4ga_8$
 > (OpenTelemetry Enhancement Proposal) I developed an integration layer on top 
of Apache Arrow (Go an Rust) to *facilitate the translation of row-oriented 
data stream into an arrow-based columnar representation*. In this particular 
case the goal was to translate all OpenTelemetry entities (metrics, logs, or 
traces) into Apache Arrow records. These entities can be quite complex and 
their corresponding Arrow schema must be defined on the fly. IMO, this approach 
is not specific to my specific needs but could be used in many other contexts 
where there is a need to simplify the integration between a row-oriented source 
of data and Apache Arrow. The trade-off is to have to perform the additional 
step of conversion to the intermediate representation, but this transformation 
does not require to understand the arcana of the Arrow format and allows to 
potentially benefit from functionalities such as the encoding of the dictionary 
"for free", the automatic generation of Arrow schemas, the batching, the 
multi-column sorting, etc.


I know that JSON can be used as a kind of intermediate representation in the 
context of Arrow with some language specific implementation. Current JSON 
integrations are insufficient to cover the most complex scenarios and are not 
standardized; e.g. support for most of the Arrow data type, various 
optimizations (string|binary dictionaries, multi-column sorting), batching, 
integration with Arrow IPC, compression ratio optimization, ... The object of 
this proposal is to progressively cover these gaps.

I am looking to see if the community would be interested in such a 
contribution. Above are some additional details on the current implementation. 
All feedback is welcome.

10K ft overview of the current implementation:

   1. Developers convert their row oriented stream into records based on
   the Arrow Intermediate Representation (AIR). At this stage the translation
   can be quite mechanical but if needed developers can decide for example to
   translate a map into a struct if that makes sense for them. The current
   implementation support the following arrow data types: bool, all uints, all
   ints, all floats, string, binary, list of any supported types, and struct
   of any supported types. Additional Arrow types could be added progressively.
   2. The row oriented record (i.e. AIR record) is then added to a
   RecordRepository. This repository will first compute a schema signature and
   will route the record to a RecordBatcher based on this signature.
   3. The RecordBatcher is responsible for collecting all the compatible
   AIR records and, upon request, the "batcher" is able to build an Arrow
   Record representing a batch of compatible inputs. In the current
   implementation, the batcher is able to convert string columns to dictionary
   based on a configuration. Another configuration allows to evaluate which
   columns should be sorted to optimize the compression ratio. The same
   optimization process could be applied to binary columns.
   4. Steps 1 through 3 can be repeated on the same RecordRepository
   instance to build new sets of arrow record batches. Subsequent iterations
   will be slightly faster due to different techniques used (e.g. object
   reuse, dictionary reuse and sorting, ...)


The current Go implementation
<https://urldefense.com/v3/__https://github.com/lquerel/otel-arrow-adapter__;!!KSjYCgUGsB4!d5RAULeQOio5gpEATTYTuqB7l2iK_gF1tygtPHAGIvLAWTB46ILIazrANdOWeTbY_RqPH2bXNpKf1W1ZXPldSPbp3VM$
 > (WIP) is currently part of this repo (see pkg/air package). If the community 
is interested, I could do a PR in the Arrow Go and Rust sub-projects.


This message may contain information that is confidential or privileged. If you 
are not the intended recipient, please advise the sender immediately and delete 
this message. See 
http://www.blackrock.com/corporate/compliance/email-disclaimers for further 
information.  Please refer to 
http://www.blackrock.com/corporate/compliance/privacy-policy for more 
information about BlackRock’s Privacy Policy.


For a list of BlackRock's office addresses worldwide, see 
http://www.blackrock.com/corporate/about-us/contacts-locations.

© 2022 BlackRock, Inc. All rights reserved.

Reply via email to