Re: [RUST][Go][proposal] Arrow Intermediate Representation to facilitate the transformation of row-oriented data sources into Arrow columnar representation

Laurent Quérel Thu, 28 Jul 2022 23:06:23 -0700

Hi Sasha,
Thank you very much for this informative comment. It's interesting to see
another use of a row-based API in the context of a query engine. I think
that there is some thought to be given to whether or not it is possible to
converge these two use cases into a single public row-based API.


As a first reaction I would say that it is not necessarily easy to
reconcile because the constraints and the goals to be optimized are
relatively disjoint. If you see a way to do it I'm extremely interested.

If I understand correctly, in your case, you want to optimize the
conversion from column to row representation and vice versa (a kind of
bidirectional projection). Having a SIMD implementation of these
conversions is just fantastic. However it seems that in your case there is
no support for nested types yet and I feel like there is no public API to
build rows in a simple and ergonomic way outside this bridge with the
column-based representation.

In the use case I'm trying to solve, the criteria to optimize are 1) expose
a row-based API that offers the least amount of friction in the process of
converting any row-based source to Arrow, which implies an easy-to-use API
and support for nested types, 2) make it easy to create an efficient Arrow
schema by automating dictionary creation and multi-column sorting in a way
that makes Arrow easy to use for the casual user.

The criteria to be optimized seem relatively disjointed to me but again I
would be willing to dig with you a solution that offers a good compromise
for these two use cases.

Best,
Laurent



On Thu, Jul 28, 2022 at 1:46 PM Sasha Krassovsky <krassovskysa...@gmail.com>
wrote:

> Hi everyone,
> I just wanted to chime in that we already do have a form of row-oriented
> storage inside of `arrow/compute/row/row_internal.h`. It is used to store
> rows inside of GroupBy and Join within Acero. We also have utilities for
> converting to/from columnar storage (and AVX2 implementations of these
> conversions) inside of `arrow/compute/row/encode_internal.h`. Would it be
> useful to standardize this row-oriented format?
>
> As far as I understand fixed-width rows would be trivially convertible
> into this representation (just a pointer to your array of structs), while
> variable-width rows would need a little bit of massaging (though not too
> much) to be put into this representation.
>
> Sasha Krassovsky
>
> > On Jul 28, 2022, at 1:10 PM, Laurent Quérel <laurent.que...@gmail.com>
> wrote:
> >
> > Thank you Micah for a very clear summary of the intent behind this
> > proposal. Indeed, I think that clarifying from the beginning that this
> > approach aims at facilitating experimentation more than efficiency in
> terms
> > of performance of the transformation phase would have helped to better
> > understand my objective.
> >
> > Regarding your question, I don't think there is a specific technical
> reason
> > for such an integration in the core library. I was just thinking that it
> > would make this infrastructure easier to find for the users and that this
> > topic was general enough to find its place in the standard library.
> >
> > Best,
> > Laurent
> >
> > On Thu, Jul 28, 2022 at 12:50 PM Micah Kornfield <emkornfi...@gmail.com>
> > wrote:
> >
> >> Hi Laurent,
> >> I'm retitling this thread to include the specific languages you seem to
> be
> >> targeting in the subject line to hopefully get more eyes from
> maintainers
> >> in those languages.
> >>
> >> Thanks for clarifying the goals.  If I can restate my understanding, the
> >> intended use-case here is to provide easy (from the developer point of
> >> view) adaptation of row based formats to Arrow.  The means of achieving
> >> this is creating an API for a row-base structure, and having utility
> >> classes that can manipulate the interface to build up batches (there
> are no
> >> serialization or in memory spec associated with this API).  People
> wishing
> >> to integrate a specific row based format, can extend that API at
> whatever
> >> level makes sense for the format.
> >>
> >> I think this would be useful infrastructure as long as it was made clear
> >> that in many cases this wouldn't be the most efficient way to convert to
> >> Arrow from other formats.
> >>
> >> I don't work much with either the Rust or Go implementation, so I can't
> >> speak to if there is maintainer support for incorporating the changes
> >> directly in Arrow.  Is there any technical reasons for preferring to
> have
> >> this included directly in Arrow vs a separate library?
> >>
> >> Cheers,
> >> Micah
> >>
> >> On Thu, Jul 28, 2022 at 12:34 PM Laurent Quérel <
> laurent.que...@gmail.com>
> >> wrote:
> >>
> >>> Far be it from me to think that I know more than Jorge or Wes on this
> >>> subject. Sorry if my post gives that perception, that is clearly not my
> >>> intention. I'm just trying to defend the idea that when designing this
> >> kind
> >>> of transformation, it might be interesting to have a library to test
> >>> several mappings and evaluate them before doing a more direct
> >>> implementation if the performance is not there.
> >>>
> >>> On Thu, Jul 28, 2022 at 12:15 PM Benjamin Blodgett <
> >>> benjaminblodg...@gmail.com> wrote:
> >>>
> >>>> He was trying to nicely say he knows way more than you, and your ideas
> >>>> will result in a low performance scheme no one will use in production
> >>>> ai/machine learning.
> >>>>
> >>>> Sent from my iPhone
> >>>>
> >>>>> On Jul 28, 2022, at 12:14 PM, Benjamin Blodgett <
> >>>> benjaminblodg...@gmail.com> wrote:
> >>>>>
> >>>>> I think Jorge’s opinion has is that of an expert and him being
> >> humble
> >>>> is just being tactful.  Probably listen to Jorge on performance and
> >>>> architecture, even over Wes as he’s contributed more than anyone else
> >> and
> >>>> know the bleeding edge of low level performance stuff more than
> anyone.
> >>>>>
> >>>>> Sent from my iPhone
> >>>>>
> >>>>>> On Jul 28, 2022, at 12:03 PM, Laurent Quérel <
> >>> laurent.que...@gmail.com>
> >>>> wrote:
> >>>>>>
> >>>>>> Hi Jorge
> >>>>>>
> >>>>>> I don't think that the level of in-depth knowledge needed is the
> >> same
> >>>>>> between using a row-oriented internal representation and "Arrow"
> >> which
> >>>> not
> >>>>>> only changes the organization of the data but also introduces a set
> >> of
> >>>>>> additional mapping choices and concepts.
> >>>>>>
> >>>>>> For example, assuming that the initial row-oriented data source is a
> >>>> stream
> >>>>>> of nested assembly of structures, lists and maps. The mapping of
> >> such
> >>> a
> >>>>>> stream to Protobuf, JSON, YAML, ... is straightforward because on
> >> both
> >>>>>> sides the logical representation is exactly the same, the schema is
> >>>>>> sometimes optional, the interest of building batches is optional,
> >> ...
> >>> In
> >>>>>> the case of "Arrow" things are different - the schema and the
> >> batching
> >>>> are
> >>>>>> mandatory. The mapping is not necessarily direct and will generally
> >> be
> >>>> the
> >>>>>> result of the combination of several trade-offs (normalization vs
> >>>>>> denormalization representation, mapping influencing the compression
> >>>> rate,
> >>>>>> queryability with Arrow processors like DataFusion, ...). Note that
> >>>> some of
> >>>>>> these complexities are not intrinsically linked to the fact that the
> >>>> target
> >>>>>> format is column oriented. The ZST format (
> >>>>>> https://zed.brimdata.io/docs/formats/zst/) for example does not
> >>>> require an
> >>>>>> explicit schema definition.
> >>>>>>
> >>>>>> IMHO, having a library that allows you to easily experiment with
> >>>> different
> >>>>>> types of mapping (without having to worry about batching,
> >>> dictionaries,
> >>>>>> schema definition, understanding how lists of structs are
> >> represented,
> >>>> ...)
> >>>>>> and to evaluate the results according to your specific goals has a
> >>> value
> >>>>>> (especially if your criteria are compression ratio and
> >> queryability).
> >>> Of
> >>>>>> course there is an overhead to such an approach. In some cases, at
> >> the
> >>>> end
> >>>>>> of the process, it will be necessary to manually perform this direct
> >>>>>> transformation between a row-oriented XYZ format and "Arrow".
> >> However,
> >>>> this
> >>>>>> effort will be done after a simple experimentation phase to avoid
> >>>> changes
> >>>>>> in the implementation of the converter which in my opinion is not so
> >>>> simple
> >>>>>> to implement with the current Arrow API.
> >>>>>>
> >>>>>> If the Arrow developer community is not interested in integrating
> >> this
> >>>>>> proposal, I plan to release two independent libraries (Go and Rust)
> >>> that
> >>>>>> can be used on top of the standard "Arrow" libraries. This will have
> >>> the
> >>>>>> advantage to evaluate if such an approach is able to raise interest
> >>>> among
> >>>>>> Arrow users.
> >>>>>>
> >>>>>> Best,
> >>>>>>
> >>>>>> Laurent
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>>> On Wed, Jul 27, 2022 at 9:53 PM Jorge Cardoso Leitão <
> >>>>>>> jorgecarlei...@gmail.com> wrote:
> >>>>>>>
> >>>>>>> Hi Laurent,
> >>>>>>>
> >>>>>>> I agree that there is a common pattern in converting row-based
> >>> formats
> >>>> to
> >>>>>>> Arrow.
> >>>>>>>
> >>>>>>> Imho the difficult part is not to map the storage format to Arrow
> >>>>>>> specifically - it is to map the storage format to any in-memory
> >> (row-
> >>>> or
> >>>>>>> columnar- based) format, since it requires in-depth knowledge about
> >>>> the 2
> >>>>>>> formats (the source format and the target format).
> >>>>>>>
> >>>>>>> - Understanding the Arrow API which can be challenging for complex
> >>>> cases of
> >>>>>>>> rows representing complex objects (list of struct, struct of
> >> struct,
> >>>>>>> ...).
> >>>>>>>>
> >>>>>>>
> >>>>>>> the developer would have the same problem - just shifted around -
> >>> they
> >>>> now
> >>>>>>> need to convert their complex objects to the intermediary
> >>>> representation.
> >>>>>>> Whether it is more "difficult" or "complex" to learn than Arrow is
> >> an
> >>>> open
> >>>>>>> question, but we would essentially be shifting the problem from
> >>>> "learning
> >>>>>>> Arrow" to "learning the Intermediate in-memory".
> >>>>>>>
> >>>>>>> @Micah Kornfield, as described before my goal is not to define a
> >>> memory
> >>>>>>>> layout specification but more to define an API and a translation
> >>>>>>> mechanism
> >>>>>>>> able to take this intermediate representation (list of generic
> >>> objects
> >>>>>>>> representing the entities to translate) and to convert it into one
> >>> or
> >>>>>>> more
> >>>>>>>> Arrow records.
> >>>>>>>>
> >>>>>>>
> >>>>>>> imho a spec of "list of generic objects representing the entities"
> >> is
> >>>>>>> specified by an in-memory format (not by an API spec).
> >>>>>>>
> >>>>>>> A second challenge I anticipate is that in-memory formats
> >> inneerently
> >>>> "own"
> >>>>>>> the memory they outline (since by definition they outline how this
> >>>> memory
> >>>>>>> is outlined). An Intermediate in-memory representation would be no
> >>>>>>> different. Since row-based formats usually require at least one
> >>>> allocation
> >>>>>>> per row (and often more for variable-length types), the
> >>> transformation
> >>>>>>> (storage format -> row-based in-memory format -> Arrow) incurs a
> >>>>>>> significant cost (~2x slower last time I played with this problem
> >> in
> >>>> JSON
> >>>>>>> [1]).
> >>>>>>>
> >>>>>>> A third challenge I anticipate is that given that we have 10+
> >>>> languages, we
> >>>>>>> would eventually need to convert the intermediary representation
> >>> across
> >>>>>>> languages, which imo just hints that we would need to formalize an
> >>>> agnostic
> >>>>>>> spec for such representation (so languages agree on its
> >>>> representation),
> >>>>>>> and thus essentially declare a new (row-based) format.
> >>>>>>>
> >>>>>>> (none of this precludes efforts to invent an in-memory row format
> >> for
> >>>>>>> analytics workloads)
> >>>>>>>
> >>>>>>> @Wes McKinney <wesmck...@gmail.com>
> >>>>>>>
> >>>>>>> I still think having a canonical in-memory row format (and
> >> libraries
> >>>>>>>> to transform to and from Arrow columnar format) is a good idea —
> >> but
> >>>>>>>> there is the risk of ending up in the tar pit of reinventing Avro.
> >>>>>>>>
> >>>>>>>
> >>>>>>> afaik Avro does not have O(1) random access neither to its rows nor
> >>>> columns
> >>>>>>> - records are concatenated back to back, every record's column is
> >>>>>>> concatenated back to back within a record, and there is no indexing
> >>>>>>> information on how to access a particular row or column. There are
> >>>> blocks
> >>>>>>> of rows that reduce the cost of accessing large offsets, but imo it
> >>> is
> >>>> far
> >>>>>>> from the O(1) offered by Arrow (and expected by analytics
> >> workloads).
> >>>>>>>
> >>>>>>> [1] https://github.com/jorgecarleitao/arrow2/pull/1024
> >>>>>>>
> >>>>>>> Best,
> >>>>>>> Jorge
> >>>>>>>
> >>>>>>> On Thu, Jul 28, 2022 at 5:38 AM Laurent Quérel <
> >>>> laurent.que...@gmail.com>
> >>>>>>> wrote:
> >>>>>>>
> >>>>>>>> Let me clarify the proposal a bit before replying to the various
> >>>> previous
> >>>>>>>> feedbacks.
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>>>> It seems to me that the process of converting a row-oriented data
> >>>> source
> >>>>>>>> (row = set of fields or something more hierarchical) into an Arrow
> >>>> record
> >>>>>>>> repeatedly raises the same challenges. A developer who must
> >> perform
> >>>> this
> >>>>>>>> kind of transformation is confronted with the following questions
> >>> and
> >>>>>>>> problems:
> >>>>>>>>
> >>>>>>>> - Understanding the Arrow API which can be challenging for complex
> >>>> cases
> >>>>>>> of
> >>>>>>>> rows representing complex objects (list of struct, struct of
> >> struct,
> >>>>>>> ...).
> >>>>>>>>
> >>>>>>>> - Decide which Arrow schema(s) will correspond to your data
> >> source.
> >>> In
> >>>>>>> some
> >>>>>>>> complex cases it can be advantageous to translate the same
> >>>> row-oriented
> >>>>>>>> data source into several Arrow schemas (e.g. OpenTelementry data
> >>>>>>> sources).
> >>>>>>>>
> >>>>>>>> - Decide on the encoding of the columns to make the most of the
> >>>>>>>> column-oriented format and thus increase the compression rate
> >> (e.g.
> >>>>>>> define
> >>>>>>>> the columns that should be represent as dictionaries).
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>>>> By experience, I can attest that this process is usually
> >> iterative.
> >>>> For
> >>>>>>>> non-trivial data sources, arriving at the arrow representation
> >> that
> >>>>>>> offers
> >>>>>>>> the best compression ratio and is still perfectly usable and
> >>> queryable
> >>>>>>> is a
> >>>>>>>> long and tedious process.
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>>>> I see two approaches to ease this process and consequently
> >> increase
> >>>> the
> >>>>>>>> adoption of Apache Arrow:
> >>>>>>>>
> >>>>>>>> - Definition of a canonical in-memory row format specification
> >> that
> >>>> every
> >>>>>>>> row-oriented data source provider can progressively adopt to get
> >> an
> >>>>>>>> automatic translation into the Arrow format.
> >>>>>>>>
> >>>>>>>> - Definition of an integration library allowing to map any
> >>>> row-oriented
> >>>>>>>> source into a generic row-oriented source understood by the
> >>>> converter. It
> >>>>>>>> is not about defining a unique in-memory format but more about
> >>>> defining a
> >>>>>>>> standard API to represent row-oriented data.
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>>>> In my opinion these two approaches are complementary. The first
> >>> option
> >>>>>>> is a
> >>>>>>>> long-term approach targeting directly the data providers, which
> >> will
> >>>>>>>> require to agree on this generic row-oriented format and whose
> >>>> adoption
> >>>>>>>> will be more or less long. The second approach does not directly
> >>>> require
> >>>>>>>> the collaboration of data source providers but allows an
> >>> "integrator"
> >>>> to
> >>>>>>>> perform this transformation painlessly with potentially several
> >>>>>>>> representation trials to achieve the best results in his context.
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>>>> The current proposal is an implementation of the second approach,
> >>>> i.e. an
> >>>>>>>> API that maps a row-oriented source XYZ into an intermediate
> >>>> row-oriented
> >>>>>>>> representation understood mechanically by the translator. This
> >>>> translator
> >>>>>>>> also adds a series of optimizations to make the most of the Arrow
> >>>> format.
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>>>> You can find multiple examples of a such transformation in the
> >>>> following
> >>>>>>>> examples:
> >>>>>>>>
> >>>>>>>> -
> >>>>>>>>
> >>>>>>>>
> >>>>>>>
> >>>>
> >>>
> >>
> https://github.com/lquerel/otel-arrow-adapter/blob/main/pkg/otel/trace/otlp_to_arrow.go
> >>>>>>>> this example converts OTEL trace entities into their
> >> corresponding
> >>>>>>> Arrow
> >>>>>>>> IR. At the end of this conversion the method returns a collection
> >>> of
> >>>>>>>> Arrow
> >>>>>>>> Records.
> >>>>>>>> - A more complex example can be found here
> >>>>>>>>
> >>>>>>>>
> >>>>>>>
> >>>>
> >>>
> >>
> https://github.com/lquerel/otel-arrow-adapter/blob/main/pkg/otel/metrics/otlp_to_arrow.go
> >>>>>>>> .
> >>>>>>>> In this example a stream of OTEL univariate row-oriented metrics
> >>> are
> >>>>>>>> translate into multivariate row-oriented metrics and then
> >>>>>>> automatically
> >>>>>>>> translated into Apache Records.
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>>>> In these two examples, the creation of dictionaries and
> >> multi-column
> >>>>>>>> sorting is automatically done by the framework and the developer
> >>>> doesn’t
> >>>>>>>> have to worry about the definition of Arrow schemas.
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>>>> Now let's get to the answers.
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>>>> @David Lee, I don't think Parquet and from_pylist() solve this
> >>> problem
> >>>>>>>> particularly well. Parquet is a column-oriented data file format
> >> and
> >>>>>>>> doesn't really help to perform this transformation. The Python
> >>> method
> >>>> is
> >>>>>>>> relatively limited and language specific.
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>>>> @Micah Kornfield, as described before my goal is not to define a
> >>>> memory
> >>>>>>>> layout specification but more to define an API and a translation
> >>>>>>> mechanism
> >>>>>>>> able to take this intermediate representation (list of generic
> >>> objects
> >>>>>>>> representing the entities to translate) and to convert it into one
> >>> or
> >>>>>>> more
> >>>>>>>> Arrow records.
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>>>> @Wes McKinney, If I interpret your answer correctly, I think you
> >> are
> >>>>>>>> describing the option 1 mentioned above. Like you I think it is an
> >>>>>>>> interesting approach although complementary to the one I propose.
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>>>> Looking forward to your feedback.
> >>>>>>>>
> >>>>>>>> On Wed, Jul 27, 2022 at 4:19 PM Wes McKinney <wesmck...@gmail.com
> >>>
> >>>>>>> wrote:
> >>>>>>>>
> >>>>>>>>> We had an e-mail thread about this in 2018
> >>>>>>>>>
> >>>>>>>>> https://lists.apache.org/thread/35pn7s8yzxozqmgx53ympxg63vjvggvm
> >>>>>>>>>
> >>>>>>>>> I still think having a canonical in-memory row format (and
> >>> libraries
> >>>>>>>>> to transform to and from Arrow columnar format) is a good idea —
> >>> but
> >>>>>>>>> there is the risk of ending up in the tar pit of reinventing
> >> Avro.
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>> On Wed, Jul 27, 2022 at 5:11 PM Micah Kornfield <
> >>>> emkornfi...@gmail.com
> >>>>>>>>
> >>>>>>>>> wrote:
> >>>>>>>>>>
> >>>>>>>>>> Are there more details on what exactly an "Arrow Intermediate
> >>>>>>>>>> Representation (AIR)" is?  We've talked about in the past maybe
> >>>>>>> having
> >>>>>>>> a
> >>>>>>>>>> memory layout specification for row-based data as well as column
> >>>>>>> based
> >>>>>>>>>> data.  There was also a recent attempt at least in C++ to try to
> >>>>>>> build
> >>>>>>>>>> utilities to do these pivots but it was decided that it didn't
> >> add
> >>>>>>> much
> >>>>>>>>>> utility (it was added a comprehensive example).
> >>>>>>>>>>
> >>>>>>>>>> Thanks,
> >>>>>>>>>> Micah
> >>>>>>>>>>
> >>>>>>>>>> On Tue, Jul 26, 2022 at 2:26 PM Laurent Quérel <
> >>>>>>>> laurent.que...@gmail.com
> >>>>>>>>>>
> >>>>>>>>>> wrote:
> >>>>>>>>>>
> >>>>>>>>>>> In the context of this OTEP
> >>>>>>>>>>> <
> >>>>>>>>>
> >>>>>>>>
> >>>>>>>
> >>>>
> >>>
> >>
> https://github.com/lquerel/oteps/blob/main/text/0156-columnar-encoding.md
> >>>>>>>>>>>>
> >>>>>>>>>>> (OpenTelemetry
> >>>>>>>>>>> Enhancement Proposal) I developed an integration layer on top
> >> of
> >>>>>>>> Apache
> >>>>>>>>>>> Arrow (Go an Rust) to *facilitate the translation of
> >> row-oriented
> >>>>>>>> data
> >>>>>>>>>>> stream into an arrow-based columnar representation*. In this
> >>>>>>>> particular
> >>>>>>>>>>> case the goal was to translate all OpenTelemetry entities
> >>> (metrics,
> >>>>>>>>> logs,
> >>>>>>>>>>> or traces) into Apache Arrow records. These entities can be
> >> quite
> >>>>>>>>> complex
> >>>>>>>>>>> and their corresponding Arrow schema must be defined on the
> >> fly.
> >>>>>>> IMO,
> >>>>>>>>> this
> >>>>>>>>>>> approach is not specific to my specific needs but could be used
> >>> in
> >>>>>>>> many
> >>>>>>>>>>> other contexts where there is a need to simplify the
> >> integration
> >>>>>>>>> between a
> >>>>>>>>>>> row-oriented source of data and Apache Arrow. The trade-off is
> >> to
> >>>>>>>> have
> >>>>>>>>> to
> >>>>>>>>>>> perform the additional step of conversion to the intermediate
> >>>>>>>>>>> representation, but this transformation does not require to
> >>>>>>>> understand
> >>>>>>>>> the
> >>>>>>>>>>> arcana of the Arrow format and allows to potentially benefit
> >> from
> >>>>>>>>>>> functionalities such as the encoding of the dictionary "for
> >>> free",
> >>>>>>>> the
> >>>>>>>>>>> automatic generation of Arrow schemas, the batching, the
> >>>>>>> multi-column
> >>>>>>>>>>> sorting, etc.
> >>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>> I know that JSON can be used as a kind of intermediate
> >>>>>>> representation
> >>>>>>>>> in
> >>>>>>>>>>> the context of Arrow with some language specific
> >> implementation.
> >>>>>>>>> Current
> >>>>>>>>>>> JSON integrations are insufficient to cover the most complex
> >>>>>>>> scenarios
> >>>>>>>>> and
> >>>>>>>>>>> are not standardized; e.g. support for most of the Arrow data
> >>> type,
> >>>>>>>>> various
> >>>>>>>>>>> optimizations (string|binary dictionaries, multi-column
> >> sorting),
> >>>>>>>>> batching,
> >>>>>>>>>>> integration with Arrow IPC, compression ratio optimization, ...
> >>> The
> >>>>>>>>> object
> >>>>>>>>>>> of this proposal is to progressively cover these gaps.
> >>>>>>>>>>>
> >>>>>>>>>>> I am looking to see if the community would be interested in
> >> such
> >>> a
> >>>>>>>>>>> contribution. Above are some additional details on the current
> >>>>>>>>>>> implementation. All feedback is welcome.
> >>>>>>>>>>>
> >>>>>>>>>>> 10K ft overview of the current implementation:
> >>>>>>>>>>>
> >>>>>>>>>>> 1. Developers convert their row oriented stream into records
> >>>>>>> based
> >>>>>>>>> on
> >>>>>>>>>>> the Arrow Intermediate Representation (AIR). At this stage the
> >>>>>>>>>>> translation
> >>>>>>>>>>> can be quite mechanical but if needed developers can decide
> >> for
> >>>>>>>>> example
> >>>>>>>>>>> to
> >>>>>>>>>>> translate a map into a struct if that makes sense for them.
> >> The
> >>>>>>>>> current
> >>>>>>>>>>> implementation support the following arrow data types: bool,
> >> all
> >>>>>>>>> uints,
> >>>>>>>>>>> all
> >>>>>>>>>>> ints, all floats, string, binary, list of any supported types,
> >>>>>>> and
> >>>>>>>>>>> struct
> >>>>>>>>>>> of any supported types. Additional Arrow types could be added
> >>>>>>>>>>> progressively.
> >>>>>>>>>>> 2. The row oriented record (i.e. AIR record) is then added to
> >> a
> >>>>>>>>>>> RecordRepository. This repository will first compute a schema
> >>>>>>>>> signature
> >>>>>>>>>>> and
> >>>>>>>>>>> will route the record to a RecordBatcher based on this
> >>>>>>> signature.
> >>>>>>>>>>> 3. The RecordBatcher is responsible for collecting all the
> >>>>>>>>> compatible
> >>>>>>>>>>> AIR records and, upon request, the "batcher" is able to build
> >> an
> >>>>>>>>> Arrow
> >>>>>>>>>>> Record representing a batch of compatible inputs. In the
> >> current
> >>>>>>>>>>> implementation, the batcher is able to convert string columns
> >> to
> >>>>>>>>>>> dictionary
> >>>>>>>>>>> based on a configuration. Another configuration allows to
> >>>>>>> evaluate
> >>>>>>>>> which
> >>>>>>>>>>> columns should be sorted to optimize the compression ratio.
> >> The
> >>>>>>>> same
> >>>>>>>>>>> optimization process could be applied to binary columns.
> >>>>>>>>>>> 4. Steps 1 through 3 can be repeated on the same
> >>>>>>> RecordRepository
> >>>>>>>>>>> instance to build new sets of arrow record batches. Subsequent
> >>>>>>>>>>> iterations
> >>>>>>>>>>> will be slightly faster due to different techniques used (e.g.
> >>>>>>>>> object
> >>>>>>>>>>> reuse, dictionary reuse and sorting, ...)
> >>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>> The current Go implementation
> >>>>>>>>>>> <https://github.com/lquerel/otel-arrow-adapter> (WIP) is
> >>> currently
> >>>>>>>>> part of
> >>>>>>>>>>> this repo (see pkg/air package). If the community is
> >> interested,
> >>> I
> >>>>>>>>> could do
> >>>>>>>>>>> a PR in the Arrow Go and Rust sub-projects.
> >>>>>>>>>>>
> >>>>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>>>> --
> >>>>>>>> Laurent Quérel
> >>>>>>>>
> >>>>>>>
> >>>>>>
> >>>>>>
> >>>>>> --
> >>>>>> Laurent Quérel
> >>>>
> >>>
> >>>
> >>> --
> >>> Laurent Quérel
> >>>
> >>
> >
> >
> > --
> > Laurent Quérel
>
>

-- 
Laurent Quérel

Re: [RUST][Go][proposal] Arrow Intermediate Representation to facilitate the transformation of row-oriented data sources into Arrow columnar representation

Reply via email to