Re: [RUST][Go][proposal] Arrow Intermediate Representation to facilitate the transformation of row-oriented data sources into Arrow columnar representation

Lee, David Fri, 29 Jul 2022 06:56:46 -0700

In pyarrow.compute which is an extension of the C++ implementation there are 
scalar api functions that can be logically used to process rows of data, but 
they are executed on columnar batches of data.


As mentioned previously it is better to have an API that applies row level 
transformations than to have an intermediary row level memory format.

Sent from my iPad

> On Jul 29, 2022, at 3:43 AM, Andrew Lamb <al...@influxdata.com> wrote:
> 
> External Email: Use caution with links and attachments
> 
> 
> I am +0 on a standard API -- in the Rust arrow-rs implementation we tend to
> borrow inspiration from the C++ / Java interfaces and then create
> appropriate Rust APIs.
> 
> There is also a row based format in DataFusion [1] (Rust) and it is used to
> implement certain GroupBy and Sorts (similarly to what Sasha Krassovsky
> describes for Acero).
> 
> I think row based formats are common in vectorized query engines for
> operations that can't be easily vectorized (sorts, groups and joins),
> though I am not sure how reusable those formats would be
> 
> There are at least three uses that require slightly different layouts
> 1. Comparing row formatted data for equality (where space efficiency is
> important)
> 2. Comparing row formatted data for comparisons (where collation is
> important)
> 3. Using row formatted data to hold intermediate aggregates (where word
> alignment is important)
> 
> So in other words, I am not sure how easy it would be to define a common
> in-memory layout for rows.
> 
> Andrew
> 
> [1]
> https://urldefense.com/v3/__https://github.com/apache/arrow-datafusion/blob/3cd62e9/datafusion/row/src/layout.rs*L29-L75__;Iw!!KSjYCgUGsB4!eacNf7LBCm3exjzmw63baxsIs0UpuyAHVbpiOU59jYjalL_GyR3HdMRD1O6zYKLe_omitJ2GZSb1q1tHhSXS$
> 
> 
> 
>> On Fri, Jul 29, 2022 at 2:06 AM Laurent Quérel <laurent.que...@gmail.com>
>> wrote:
>> 
>> Hi Sasha,
>> Thank you very much for this informative comment. It's interesting to see
>> another use of a row-based API in the context of a query engine. I think
>> that there is some thought to be given to whether or not it is possible to
>> converge these two use cases into a single public row-based API.
>> 
>> As a first reaction I would say that it is not necessarily easy to
>> reconcile because the constraints and the goals to be optimized are
>> relatively disjoint. If you see a way to do it I'm extremely interested.
>> 
>> If I understand correctly, in your case, you want to optimize the
>> conversion from column to row representation and vice versa (a kind of
>> bidirectional projection). Having a SIMD implementation of these
>> conversions is just fantastic. However it seems that in your case there is
>> no support for nested types yet and I feel like there is no public API to
>> build rows in a simple and ergonomic way outside this bridge with the
>> column-based representation.
>> 
>> In the use case I'm trying to solve, the criteria to optimize are 1) expose
>> a row-based API that offers the least amount of friction in the process of
>> converting any row-based source to Arrow, which implies an easy-to-use API
>> and support for nested types, 2) make it easy to create an efficient Arrow
>> schema by automating dictionary creation and multi-column sorting in a way
>> that makes Arrow easy to use for the casual user.
>> 
>> The criteria to be optimized seem relatively disjointed to me but again I
>> would be willing to dig with you a solution that offers a good compromise
>> for these two use cases.
>> 
>> Best,
>> Laurent
>> 
>> 
>> 
>> On Thu, Jul 28, 2022 at 1:46 PM Sasha Krassovsky <
>> krassovskysa...@gmail.com>
>> wrote:
>> 
>>> Hi everyone,
>>> I just wanted to chime in that we already do have a form of row-oriented
>>> storage inside of `arrow/compute/row/row_internal.h`. It is used to store
>>> rows inside of GroupBy and Join within Acero. We also have utilities for
>>> converting to/from columnar storage (and AVX2 implementations of these
>>> conversions) inside of `arrow/compute/row/encode_internal.h`. Would it be
>>> useful to standardize this row-oriented format?
>>> 
>>> As far as I understand fixed-width rows would be trivially convertible
>>> into this representation (just a pointer to your array of structs), while
>>> variable-width rows would need a little bit of massaging (though not too
>>> much) to be put into this representation.
>>> 
>>> Sasha Krassovsky
>>> 
>>>> On Jul 28, 2022, at 1:10 PM, Laurent Quérel <laurent.que...@gmail.com>
>>> wrote:
>>>> 
>>>> Thank you Micah for a very clear summary of the intent behind this
>>>> proposal. Indeed, I think that clarifying from the beginning that this
>>>> approach aims at facilitating experimentation more than efficiency in
>>> terms
>>>> of performance of the transformation phase would have helped to better
>>>> understand my objective.
>>>> 
>>>> Regarding your question, I don't think there is a specific technical
>>> reason
>>>> for such an integration in the core library. I was just thinking that
>> it
>>>> would make this infrastructure easier to find for the users and that
>> this
>>>> topic was general enough to find its place in the standard library.
>>>> 
>>>> Best,
>>>> Laurent
>>>> 
>>>> On Thu, Jul 28, 2022 at 12:50 PM Micah Kornfield <
>> emkornfi...@gmail.com>
>>>> wrote:
>>>> 
>>>>> Hi Laurent,
>>>>> I'm retitling this thread to include the specific languages you seem
>> to
>>> be
>>>>> targeting in the subject line to hopefully get more eyes from
>>> maintainers
>>>>> in those languages.
>>>>> 
>>>>> Thanks for clarifying the goals.  If I can restate my understanding,
>> the
>>>>> intended use-case here is to provide easy (from the developer point of
>>>>> view) adaptation of row based formats to Arrow.  The means of
>> achieving
>>>>> this is creating an API for a row-base structure, and having utility
>>>>> classes that can manipulate the interface to build up batches (there
>>> are no
>>>>> serialization or in memory spec associated with this API).  People
>>> wishing
>>>>> to integrate a specific row based format, can extend that API at
>>> whatever
>>>>> level makes sense for the format.
>>>>> 
>>>>> I think this would be useful infrastructure as long as it was made
>> clear
>>>>> that in many cases this wouldn't be the most efficient way to convert
>> to
>>>>> Arrow from other formats.
>>>>> 
>>>>> I don't work much with either the Rust or Go implementation, so I
>> can't
>>>>> speak to if there is maintainer support for incorporating the changes
>>>>> directly in Arrow.  Is there any technical reasons for preferring to
>>> have
>>>>> this included directly in Arrow vs a separate library?
>>>>> 
>>>>> Cheers,
>>>>> Micah
>>>>> 
>>>>> On Thu, Jul 28, 2022 at 12:34 PM Laurent Quérel <
>>> laurent.que...@gmail.com>
>>>>> wrote:
>>>>> 
>>>>>> Far be it from me to think that I know more than Jorge or Wes on this
>>>>>> subject. Sorry if my post gives that perception, that is clearly not
>> my
>>>>>> intention. I'm just trying to defend the idea that when designing
>> this
>>>>> kind
>>>>>> of transformation, it might be interesting to have a library to test
>>>>>> several mappings and evaluate them before doing a more direct
>>>>>> implementation if the performance is not there.
>>>>>> 
>>>>>> On Thu, Jul 28, 2022 at 12:15 PM Benjamin Blodgett <
>>>>>> benjaminblodg...@gmail.com> wrote:
>>>>>> 
>>>>>>> He was trying to nicely say he knows way more than you, and your
>> ideas
>>>>>>> will result in a low performance scheme no one will use in
>> production
>>>>>>> ai/machine learning.
>>>>>>> 
>>>>>>> Sent from my iPhone
>>>>>>> 
>>>>>>>> On Jul 28, 2022, at 12:14 PM, Benjamin Blodgett <
>>>>>>> benjaminblodg...@gmail.com> wrote:
>>>>>>>> 
>>>>>>>> I think Jorge’s opinion has is that of an expert and him being
>>>>> humble
>>>>>>> is just being tactful.  Probably listen to Jorge on performance and
>>>>>>> architecture, even over Wes as he’s contributed more than anyone
>> else
>>>>> and
>>>>>>> know the bleeding edge of low level performance stuff more than
>>> anyone.
>>>>>>>> 
>>>>>>>> Sent from my iPhone
>>>>>>>> 
>>>>>>>>> On Jul 28, 2022, at 12:03 PM, Laurent Quérel <
>>>>>> laurent.que...@gmail.com>
>>>>>>> wrote:
>>>>>>>>> 
>>>>>>>>> Hi Jorge
>>>>>>>>> 
>>>>>>>>> I don't think that the level of in-depth knowledge needed is the
>>>>> same
>>>>>>>>> between using a row-oriented internal representation and "Arrow"
>>>>> which
>>>>>>> not
>>>>>>>>> only changes the organization of the data but also introduces a
>> set
>>>>> of
>>>>>>>>> additional mapping choices and concepts.
>>>>>>>>> 
>>>>>>>>> For example, assuming that the initial row-oriented data source
>> is a
>>>>>>> stream
>>>>>>>>> of nested assembly of structures, lists and maps. The mapping of
>>>>> such
>>>>>> a
>>>>>>>>> stream to Protobuf, JSON, YAML, ... is straightforward because on
>>>>> both
>>>>>>>>> sides the logical representation is exactly the same, the schema
>> is
>>>>>>>>> sometimes optional, the interest of building batches is optional,
>>>>> ...
>>>>>> In
>>>>>>>>> the case of "Arrow" things are different - the schema and the
>>>>> batching
>>>>>>> are
>>>>>>>>> mandatory. The mapping is not necessarily direct and will
>> generally
>>>>> be
>>>>>>> the
>>>>>>>>> result of the combination of several trade-offs (normalization vs
>>>>>>>>> denormalization representation, mapping influencing the
>> compression
>>>>>>> rate,
>>>>>>>>> queryability with Arrow processors like DataFusion, ...). Note
>> that
>>>>>>> some of
>>>>>>>>> these complexities are not intrinsically linked to the fact that
>> the
>>>>>>> target
>>>>>>>>> format is column oriented. The ZST format (
>>>>>>>>> https://urldefense.com/v3/__https://zed.brimdata.io/docs/formats/zst/__;!!KSjYCgUGsB4!eacNf7LBCm3exjzmw63baxsIs0UpuyAHVbpiOU59jYjalL_GyR3HdMRD1O6zYKLe_omitJ2GZSb1q1b25onh$
>>>>>>>>>  ) for example does not
>>>>>>> require an
>>>>>>>>> explicit schema definition.
>>>>>>>>> 
>>>>>>>>> IMHO, having a library that allows you to easily experiment with
>>>>>>> different
>>>>>>>>> types of mapping (without having to worry about batching,
>>>>>> dictionaries,
>>>>>>>>> schema definition, understanding how lists of structs are
>>>>> represented,
>>>>>>> ...)
>>>>>>>>> and to evaluate the results according to your specific goals has a
>>>>>> value
>>>>>>>>> (especially if your criteria are compression ratio and
>>>>> queryability).
>>>>>> Of
>>>>>>>>> course there is an overhead to such an approach. In some cases, at
>>>>> the
>>>>>>> end
>>>>>>>>> of the process, it will be necessary to manually perform this
>> direct
>>>>>>>>> transformation between a row-oriented XYZ format and "Arrow".
>>>>> However,
>>>>>>> this
>>>>>>>>> effort will be done after a simple experimentation phase to avoid
>>>>>>> changes
>>>>>>>>> in the implementation of the converter which in my opinion is not
>> so
>>>>>>> simple
>>>>>>>>> to implement with the current Arrow API.
>>>>>>>>> 
>>>>>>>>> If the Arrow developer community is not interested in integrating
>>>>> this
>>>>>>>>> proposal, I plan to release two independent libraries (Go and
>> Rust)
>>>>>> that
>>>>>>>>> can be used on top of the standard "Arrow" libraries. This will
>> have
>>>>>> the
>>>>>>>>> advantage to evaluate if such an approach is able to raise
>> interest
>>>>>>> among
>>>>>>>>> Arrow users.
>>>>>>>>> 
>>>>>>>>> Best,
>>>>>>>>> 
>>>>>>>>> Laurent
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>>> On Wed, Jul 27, 2022 at 9:53 PM Jorge Cardoso Leitão <
>>>>>>>>>> jorgecarlei...@gmail.com> wrote:
>>>>>>>>>> 
>>>>>>>>>> Hi Laurent,
>>>>>>>>>> 
>>>>>>>>>> I agree that there is a common pattern in converting row-based
>>>>>> formats
>>>>>>> to
>>>>>>>>>> Arrow.
>>>>>>>>>> 
>>>>>>>>>> Imho the difficult part is not to map the storage format to Arrow
>>>>>>>>>> specifically - it is to map the storage format to any in-memory
>>>>> (row-
>>>>>>> or
>>>>>>>>>> columnar- based) format, since it requires in-depth knowledge
>> about
>>>>>>> the 2
>>>>>>>>>> formats (the source format and the target format).
>>>>>>>>>> 
>>>>>>>>>> - Understanding the Arrow API which can be challenging for
>> complex
>>>>>>> cases of
>>>>>>>>>>> rows representing complex objects (list of struct, struct of
>>>>> struct,
>>>>>>>>>> ...).
>>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> the developer would have the same problem - just shifted around -
>>>>>> they
>>>>>>> now
>>>>>>>>>> need to convert their complex objects to the intermediary
>>>>>>> representation.
>>>>>>>>>> Whether it is more "difficult" or "complex" to learn than Arrow
>> is
>>>>> an
>>>>>>> open
>>>>>>>>>> question, but we would essentially be shifting the problem from
>>>>>>> "learning
>>>>>>>>>> Arrow" to "learning the Intermediate in-memory".
>>>>>>>>>> 
>>>>>>>>>> @Micah Kornfield, as described before my goal is not to define a
>>>>>> memory
>>>>>>>>>>> layout specification but more to define an API and a translation
>>>>>>>>>> mechanism
>>>>>>>>>>> able to take this intermediate representation (list of generic
>>>>>> objects
>>>>>>>>>>> representing the entities to translate) and to convert it into
>> one
>>>>>> or
>>>>>>>>>> more
>>>>>>>>>>> Arrow records.
>>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> imho a spec of "list of generic objects representing the
>> entities"
>>>>> is
>>>>>>>>>> specified by an in-memory format (not by an API spec).
>>>>>>>>>> 
>>>>>>>>>> A second challenge I anticipate is that in-memory formats
>>>>> inneerently
>>>>>>> "own"
>>>>>>>>>> the memory they outline (since by definition they outline how
>> this
>>>>>>> memory
>>>>>>>>>> is outlined). An Intermediate in-memory representation would be
>> no
>>>>>>>>>> different. Since row-based formats usually require at least one
>>>>>>> allocation
>>>>>>>>>> per row (and often more for variable-length types), the
>>>>>> transformation
>>>>>>>>>> (storage format -> row-based in-memory format -> Arrow) incurs a
>>>>>>>>>> significant cost (~2x slower last time I played with this problem
>>>>> in
>>>>>>> JSON
>>>>>>>>>> [1]).
>>>>>>>>>> 
>>>>>>>>>> A third challenge I anticipate is that given that we have 10+
>>>>>>> languages, we
>>>>>>>>>> would eventually need to convert the intermediary representation
>>>>>> across
>>>>>>>>>> languages, which imo just hints that we would need to formalize
>> an
>>>>>>> agnostic
>>>>>>>>>> spec for such representation (so languages agree on its
>>>>>>> representation),
>>>>>>>>>> and thus essentially declare a new (row-based) format.
>>>>>>>>>> 
>>>>>>>>>> (none of this precludes efforts to invent an in-memory row format
>>>>> for
>>>>>>>>>> analytics workloads)
>>>>>>>>>> 
>>>>>>>>>> @Wes McKinney <wesmck...@gmail.com>
>>>>>>>>>> 
>>>>>>>>>> I still think having a canonical in-memory row format (and
>>>>> libraries
>>>>>>>>>>> to transform to and from Arrow columnar format) is a good idea —
>>>>> but
>>>>>>>>>>> there is the risk of ending up in the tar pit of reinventing
>> Avro.
>>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> afaik Avro does not have O(1) random access neither to its rows
>> nor
>>>>>>> columns
>>>>>>>>>> - records are concatenated back to back, every record's column is
>>>>>>>>>> concatenated back to back within a record, and there is no
>> indexing
>>>>>>>>>> information on how to access a particular row or column. There
>> are
>>>>>>> blocks
>>>>>>>>>> of rows that reduce the cost of accessing large offsets, but imo
>> it
>>>>>> is
>>>>>>> far
>>>>>>>>>> from the O(1) offered by Arrow (and expected by analytics
>>>>> workloads).
>>>>>>>>>> 
>>>>>>>>>> [1] 
>>>>>>>>>> https://urldefense.com/v3/__https://github.com/jorgecarleitao/arrow2/pull/1024__;!!KSjYCgUGsB4!eacNf7LBCm3exjzmw63baxsIs0UpuyAHVbpiOU59jYjalL_GyR3HdMRD1O6zYKLe_omitJ2GZSb1qx3PX0GM$
>>>>>>>>>> 
>>>>>>>>>> Best,
>>>>>>>>>> Jorge
>>>>>>>>>> 
>>>>>>>>>> On Thu, Jul 28, 2022 at 5:38 AM Laurent Quérel <
>>>>>>> laurent.que...@gmail.com>
>>>>>>>>>> wrote:
>>>>>>>>>> 
>>>>>>>>>>> Let me clarify the proposal a bit before replying to the various
>>>>>>> previous
>>>>>>>>>>> feedbacks.
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> It seems to me that the process of converting a row-oriented
>> data
>>>>>>> source
>>>>>>>>>>> (row = set of fields or something more hierarchical) into an
>> Arrow
>>>>>>> record
>>>>>>>>>>> repeatedly raises the same challenges. A developer who must
>>>>> perform
>>>>>>> this
>>>>>>>>>>> kind of transformation is confronted with the following
>> questions
>>>>>> and
>>>>>>>>>>> problems:
>>>>>>>>>>> 
>>>>>>>>>>> - Understanding the Arrow API which can be challenging for
>> complex
>>>>>>> cases
>>>>>>>>>> of
>>>>>>>>>>> rows representing complex objects (list of struct, struct of
>>>>> struct,
>>>>>>>>>> ...).
>>>>>>>>>>> 
>>>>>>>>>>> - Decide which Arrow schema(s) will correspond to your data
>>>>> source.
>>>>>> In
>>>>>>>>>> some
>>>>>>>>>>> complex cases it can be advantageous to translate the same
>>>>>>> row-oriented
>>>>>>>>>>> data source into several Arrow schemas (e.g. OpenTelementry data
>>>>>>>>>> sources).
>>>>>>>>>>> 
>>>>>>>>>>> - Decide on the encoding of the columns to make the most of the
>>>>>>>>>>> column-oriented format and thus increase the compression rate
>>>>> (e.g.
>>>>>>>>>> define
>>>>>>>>>>> the columns that should be represent as dictionaries).
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> By experience, I can attest that this process is usually
>>>>> iterative.
>>>>>>> For
>>>>>>>>>>> non-trivial data sources, arriving at the arrow representation
>>>>> that
>>>>>>>>>> offers
>>>>>>>>>>> the best compression ratio and is still perfectly usable and
>>>>>> queryable
>>>>>>>>>> is a
>>>>>>>>>>> long and tedious process.
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> I see two approaches to ease this process and consequently
>>>>> increase
>>>>>>> the
>>>>>>>>>>> adoption of Apache Arrow:
>>>>>>>>>>> 
>>>>>>>>>>> - Definition of a canonical in-memory row format specification
>>>>> that
>>>>>>> every
>>>>>>>>>>> row-oriented data source provider can progressively adopt to get
>>>>> an
>>>>>>>>>>> automatic translation into the Arrow format.
>>>>>>>>>>> 
>>>>>>>>>>> - Definition of an integration library allowing to map any
>>>>>>> row-oriented
>>>>>>>>>>> source into a generic row-oriented source understood by the
>>>>>>> converter. It
>>>>>>>>>>> is not about defining a unique in-memory format but more about
>>>>>>> defining a
>>>>>>>>>>> standard API to represent row-oriented data.
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> In my opinion these two approaches are complementary. The first
>>>>>> option
>>>>>>>>>> is a
>>>>>>>>>>> long-term approach targeting directly the data providers, which
>>>>> will
>>>>>>>>>>> require to agree on this generic row-oriented format and whose
>>>>>>> adoption
>>>>>>>>>>> will be more or less long. The second approach does not directly
>>>>>>> require
>>>>>>>>>>> the collaboration of data source providers but allows an
>>>>>> "integrator"
>>>>>>> to
>>>>>>>>>>> perform this transformation painlessly with potentially several
>>>>>>>>>>> representation trials to achieve the best results in his
>> context.
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> The current proposal is an implementation of the second
>> approach,
>>>>>>> i.e. an
>>>>>>>>>>> API that maps a row-oriented source XYZ into an intermediate
>>>>>>> row-oriented
>>>>>>>>>>> representation understood mechanically by the translator. This
>>>>>>> translator
>>>>>>>>>>> also adds a series of optimizations to make the most of the
>> Arrow
>>>>>>> format.
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> You can find multiple examples of a such transformation in the
>>>>>>> following
>>>>>>>>>>> examples:
>>>>>>>>>>> 
>>>>>>>>>>> -
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>> 
>>>>>> 
>>>>> 
>>> 
>> https://urldefense.com/v3/__https://github.com/lquerel/otel-arrow-adapter/blob/main/pkg/otel/trace/otlp_to_arrow.go__;!!KSjYCgUGsB4!eacNf7LBCm3exjzmw63baxsIs0UpuyAHVbpiOU59jYjalL_GyR3HdMRD1O6zYKLe_omitJ2GZSb1q8mKSDrw$
>>>>>>>>>>> this example converts OTEL trace entities into their
>>>>> corresponding
>>>>>>>>>> Arrow
>>>>>>>>>>> IR. At the end of this conversion the method returns a
>> collection
>>>>>> of
>>>>>>>>>>> Arrow
>>>>>>>>>>> Records.
>>>>>>>>>>> - A more complex example can be found here
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>> 
>>>>>> 
>>>>> 
>>> 
>> https://urldefense.com/v3/__https://github.com/lquerel/otel-arrow-adapter/blob/main/pkg/otel/metrics/otlp_to_arrow.go__;!!KSjYCgUGsB4!eacNf7LBCm3exjzmw63baxsIs0UpuyAHVbpiOU59jYjalL_GyR3HdMRD1O6zYKLe_omitJ2GZSb1q5jPlI3L$
>>>>>>>>>>> .
>>>>>>>>>>> In this example a stream of OTEL univariate row-oriented metrics
>>>>>> are
>>>>>>>>>>> translate into multivariate row-oriented metrics and then
>>>>>>>>>> automatically
>>>>>>>>>>> translated into Apache Records.
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> In these two examples, the creation of dictionaries and
>>>>> multi-column
>>>>>>>>>>> sorting is automatically done by the framework and the developer
>>>>>>> doesn’t
>>>>>>>>>>> have to worry about the definition of Arrow schemas.
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> Now let's get to the answers.
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> @David Lee, I don't think Parquet and from_pylist() solve this
>>>>>> problem
>>>>>>>>>>> particularly well. Parquet is a column-oriented data file format
>>>>> and
>>>>>>>>>>> doesn't really help to perform this transformation. The Python
>>>>>> method
>>>>>>> is
>>>>>>>>>>> relatively limited and language specific.
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> @Micah Kornfield, as described before my goal is not to define a
>>>>>>> memory
>>>>>>>>>>> layout specification but more to define an API and a translation
>>>>>>>>>> mechanism
>>>>>>>>>>> able to take this intermediate representation (list of generic
>>>>>> objects
>>>>>>>>>>> representing the entities to translate) and to convert it into
>> one
>>>>>> or
>>>>>>>>>> more
>>>>>>>>>>> Arrow records.
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> @Wes McKinney, If I interpret your answer correctly, I think you
>>>>> are
>>>>>>>>>>> describing the option 1 mentioned above. Like you I think it is
>> an
>>>>>>>>>>> interesting approach although complementary to the one I
>> propose.
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> Looking forward to your feedback.
>>>>>>>>>>> 
>>>>>>>>>>> On Wed, Jul 27, 2022 at 4:19 PM Wes McKinney <
>> wesmck...@gmail.com
>>>>>> 
>>>>>>>>>> wrote:
>>>>>>>>>>> 
>>>>>>>>>>>> We had an e-mail thread about this in 2018
>>>>>>>>>>>> 
>>>>>>>>>>>> 
>> https://urldefense.com/v3/__https://lists.apache.org/thread/35pn7s8yzxozqmgx53ympxg63vjvggvm__;!!KSjYCgUGsB4!eacNf7LBCm3exjzmw63baxsIs0UpuyAHVbpiOU59jYjalL_GyR3HdMRD1O6zYKLe_omitJ2GZSb1q4Ed1cF1$
>>>>>>>>>>>> 
>>>>>>>>>>>> I still think having a canonical in-memory row format (and
>>>>>> libraries
>>>>>>>>>>>> to transform to and from Arrow columnar format) is a good idea
>> —
>>>>>> but
>>>>>>>>>>>> there is the risk of ending up in the tar pit of reinventing
>>>>> Avro.
>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>>> On Wed, Jul 27, 2022 at 5:11 PM Micah Kornfield <
>>>>>>> emkornfi...@gmail.com
>>>>>>>>>>> 
>>>>>>>>>>>> wrote:
>>>>>>>>>>>>> 
>>>>>>>>>>>>> Are there more details on what exactly an "Arrow Intermediate
>>>>>>>>>>>>> Representation (AIR)" is?  We've talked about in the past
>> maybe
>>>>>>>>>> having
>>>>>>>>>>> a
>>>>>>>>>>>>> memory layout specification for row-based data as well as
>> column
>>>>>>>>>> based
>>>>>>>>>>>>> data.  There was also a recent attempt at least in C++ to try
>> to
>>>>>>>>>> build
>>>>>>>>>>>>> utilities to do these pivots but it was decided that it didn't
>>>>> add
>>>>>>>>>> much
>>>>>>>>>>>>> utility (it was added a comprehensive example).
>>>>>>>>>>>>> 
>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>> Micah
>>>>>>>>>>>>> 
>>>>>>>>>>>>> On Tue, Jul 26, 2022 at 2:26 PM Laurent Quérel <
>>>>>>>>>>> laurent.que...@gmail.com
>>>>>>>>>>>>> 
>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>> 
>>>>>>>>>>>>>> In the context of this OTEP
>>>>>>>>>>>>>> <
>>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>> 
>>>>>> 
>>>>> 
>>> 
>> https://urldefense.com/v3/__https://github.com/lquerel/oteps/blob/main/text/0156-columnar-encoding.md__;!!KSjYCgUGsB4!eacNf7LBCm3exjzmw63baxsIs0UpuyAHVbpiOU59jYjalL_GyR3HdMRD1O6zYKLe_omitJ2GZSb1q9dxoq5_$
>>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> (OpenTelemetry
>>>>>>>>>>>>>> Enhancement Proposal) I developed an integration layer on top
>>>>> of
>>>>>>>>>>> Apache
>>>>>>>>>>>>>> Arrow (Go an Rust) to *facilitate the translation of
>>>>> row-oriented
>>>>>>>>>>> data
>>>>>>>>>>>>>> stream into an arrow-based columnar representation*. In this
>>>>>>>>>>> particular
>>>>>>>>>>>>>> case the goal was to translate all OpenTelemetry entities
>>>>>> (metrics,
>>>>>>>>>>>> logs,
>>>>>>>>>>>>>> or traces) into Apache Arrow records. These entities can be
>>>>> quite
>>>>>>>>>>>> complex
>>>>>>>>>>>>>> and their corresponding Arrow schema must be defined on the
>>>>> fly.
>>>>>>>>>> IMO,
>>>>>>>>>>>> this
>>>>>>>>>>>>>> approach is not specific to my specific needs but could be
>> used
>>>>>> in
>>>>>>>>>>> many
>>>>>>>>>>>>>> other contexts where there is a need to simplify the
>>>>> integration
>>>>>>>>>>>> between a
>>>>>>>>>>>>>> row-oriented source of data and Apache Arrow. The trade-off
>> is
>>>>> to
>>>>>>>>>>> have
>>>>>>>>>>>> to
>>>>>>>>>>>>>> perform the additional step of conversion to the intermediate
>>>>>>>>>>>>>> representation, but this transformation does not require to
>>>>>>>>>>> understand
>>>>>>>>>>>> the
>>>>>>>>>>>>>> arcana of the Arrow format and allows to potentially benefit
>>>>> from
>>>>>>>>>>>>>> functionalities such as the encoding of the dictionary "for
>>>>>> free",
>>>>>>>>>>> the
>>>>>>>>>>>>>> automatic generation of Arrow schemas, the batching, the
>>>>>>>>>> multi-column
>>>>>>>>>>>>>> sorting, etc.
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> I know that JSON can be used as a kind of intermediate
>>>>>>>>>> representation
>>>>>>>>>>>> in
>>>>>>>>>>>>>> the context of Arrow with some language specific
>>>>> implementation.
>>>>>>>>>>>> Current
>>>>>>>>>>>>>> JSON integrations are insufficient to cover the most complex
>>>>>>>>>>> scenarios
>>>>>>>>>>>> and
>>>>>>>>>>>>>> are not standardized; e.g. support for most of the Arrow data
>>>>>> type,
>>>>>>>>>>>> various
>>>>>>>>>>>>>> optimizations (string|binary dictionaries, multi-column
>>>>> sorting),
>>>>>>>>>>>> batching,
>>>>>>>>>>>>>> integration with Arrow IPC, compression ratio optimization,
>> ...
>>>>>> The
>>>>>>>>>>>> object
>>>>>>>>>>>>>> of this proposal is to progressively cover these gaps.
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> I am looking to see if the community would be interested in
>>>>> such
>>>>>> a
>>>>>>>>>>>>>> contribution. Above are some additional details on the
>> current
>>>>>>>>>>>>>> implementation. All feedback is welcome.
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> 10K ft overview of the current implementation:
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> 1. Developers convert their row oriented stream into records
>>>>>>>>>> based
>>>>>>>>>>>> on
>>>>>>>>>>>>>> the Arrow Intermediate Representation (AIR). At this stage
>> the
>>>>>>>>>>>>>> translation
>>>>>>>>>>>>>> can be quite mechanical but if needed developers can decide
>>>>> for
>>>>>>>>>>>> example
>>>>>>>>>>>>>> to
>>>>>>>>>>>>>> translate a map into a struct if that makes sense for them.
>>>>> The
>>>>>>>>>>>> current
>>>>>>>>>>>>>> implementation support the following arrow data types: bool,
>>>>> all
>>>>>>>>>>>> uints,
>>>>>>>>>>>>>> all
>>>>>>>>>>>>>> ints, all floats, string, binary, list of any supported
>> types,
>>>>>>>>>> and
>>>>>>>>>>>>>> struct
>>>>>>>>>>>>>> of any supported types. Additional Arrow types could be added
>>>>>>>>>>>>>> progressively.
>>>>>>>>>>>>>> 2. The row oriented record (i.e. AIR record) is then added to
>>>>> a
>>>>>>>>>>>>>> RecordRepository. This repository will first compute a schema
>>>>>>>>>>>> signature
>>>>>>>>>>>>>> and
>>>>>>>>>>>>>> will route the record to a RecordBatcher based on this
>>>>>>>>>> signature.
>>>>>>>>>>>>>> 3. The RecordBatcher is responsible for collecting all the
>>>>>>>>>>>> compatible
>>>>>>>>>>>>>> AIR records and, upon request, the "batcher" is able to build
>>>>> an
>>>>>>>>>>>> Arrow
>>>>>>>>>>>>>> Record representing a batch of compatible inputs. In the
>>>>> current
>>>>>>>>>>>>>> implementation, the batcher is able to convert string columns
>>>>> to
>>>>>>>>>>>>>> dictionary
>>>>>>>>>>>>>> based on a configuration. Another configuration allows to
>>>>>>>>>> evaluate
>>>>>>>>>>>> which
>>>>>>>>>>>>>> columns should be sorted to optimize the compression ratio.
>>>>> The
>>>>>>>>>>> same
>>>>>>>>>>>>>> optimization process could be applied to binary columns.
>>>>>>>>>>>>>> 4. Steps 1 through 3 can be repeated on the same
>>>>>>>>>> RecordRepository
>>>>>>>>>>>>>> instance to build new sets of arrow record batches.
>> Subsequent
>>>>>>>>>>>>>> iterations
>>>>>>>>>>>>>> will be slightly faster due to different techniques used
>> (e.g.
>>>>>>>>>>>> object
>>>>>>>>>>>>>> reuse, dictionary reuse and sorting, ...)
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> 
>>>>>>>>>>>>>> The current Go implementation
>>>>>>>>>>>>>> <https://urldefense.com/v3/__https://github.com/lquerel/otel-arrow-adapter__;!!KSjYCgUGsB4!eacNf7LBCm3exjzmw63baxsIs0UpuyAHVbpiOU59jYjalL_GyR3HdMRD1O6zYKLe_omitJ2GZSb1q7zLwKHr$
>>>>>>>>>>>>>>  > (WIP) is
>>>>>> currently
>>>>>>>>>>>> part of
>>>>>>>>>>>>>> this repo (see pkg/air package). If the community is
>>>>> interested,
>>>>>> I
>>>>>>>>>>>> could do
>>>>>>>>>>>>>> a PR in the Arrow Go and Rust sub-projects.
>>>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> --
>>>>>>>>>>> Laurent Quérel
>>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> --
>>>>>>>>> Laurent Quérel
>>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> --
>>>>>> Laurent Quérel
>>>>>> 
>>>>> 
>>>> 
>>>> 
>>>> --
>>>> Laurent Quérel
>>> 
>>> 
>> 
>> --
>> Laurent Quérel
>> 


This message may contain information that is confidential or privileged. If you 
are not the intended recipient, please advise the sender immediately and delete 
this message. See 
http://www.blackrock.com/corporate/compliance/email-disclaimers for further 
information.  Please refer to 
http://www.blackrock.com/corporate/compliance/privacy-policy for more 
information about BlackRock’s Privacy Policy.


For a list of BlackRock's office addresses worldwide, see 
http://www.blackrock.com/corporate/about-us/contacts-locations.

© 2022 BlackRock, Inc. All rights reserved.

Re: [RUST][Go][proposal] Arrow Intermediate Representation to facilitate the transformation of row-oriented data sources into Arrow columnar representation

Reply via email to