Re: [proposal] Arrow Intermediate Representation to facilitate the transformation of row-oriented data sources into Arrow columnar representation

2022-07-29 Thread Micah Kornfield
Please note this message and the previous one from the author violate our Code
of Conduct [1].  Specifically "Do not insult or put down other
participants."  Please try to be professional in communications and focus
on the technical issues at hand.

[1] https://www.apache.org/foundation/policies/conduct.html



On Thu, Jul 28, 2022 at 12:16 PM Benjamin Blodgett <
benjaminblodg...@gmail.com> wrote:

> He was trying to nicely say he knows way more than you, and your ideas
> will result in a low performance scheme no one will use in production
> ai/machine learning.
>
> Sent from my iPhone
>
> > On Jul 28, 2022, at 12:14 PM, Benjamin Blodgett <
> benjaminblodg...@gmail.com> wrote:
> >
> > I think Jorge’s opinion has is that of an expert and him being humble
> is just being tactful.  Probably listen to Jorge on performance and
> architecture, even over Wes as he’s contributed more than anyone else and
> know the bleeding edge of low level performance stuff more than anyone.
> >
> > Sent from my iPhone
> >
> >> On Jul 28, 2022, at 12:03 PM, Laurent Quérel 
> wrote:
> >>
> >> Hi Jorge
> >>
> >> I don't think that the level of in-depth knowledge needed is the same
> >> between using a row-oriented internal representation and "Arrow" which
> not
> >> only changes the organization of the data but also introduces a set of
> >> additional mapping choices and concepts.
> >>
> >> For example, assuming that the initial row-oriented data source is a
> stream
> >> of nested assembly of structures, lists and maps. The mapping of such a
> >> stream to Protobuf, JSON, YAML, ... is straightforward because on both
> >> sides the logical representation is exactly the same, the schema is
> >> sometimes optional, the interest of building batches is optional, ... In
> >> the case of "Arrow" things are different - the schema and the batching
> are
> >> mandatory. The mapping is not necessarily direct and will generally be
> the
> >> result of the combination of several trade-offs (normalization vs
> >> denormalization representation, mapping influencing the compression
> rate,
> >> queryability with Arrow processors like DataFusion, ...). Note that
> some of
> >> these complexities are not intrinsically linked to the fact that the
> target
> >> format is column oriented. The ZST format (
> >> https://zed.brimdata.io/docs/formats/zst/) for example does not
> require an
> >> explicit schema definition.
> >>
> >> IMHO, having a library that allows you to easily experiment with
> different
> >> types of mapping (without having to worry about batching, dictionaries,
> >> schema definition, understanding how lists of structs are represented,
> ...)
> >> and to evaluate the results according to your specific goals has a value
> >> (especially if your criteria are compression ratio and queryability). Of
> >> course there is an overhead to such an approach. In some cases, at the
> end
> >> of the process, it will be necessary to manually perform this direct
> >> transformation between a row-oriented XYZ format and "Arrow". However,
> this
> >> effort will be done after a simple experimentation phase to avoid
> changes
> >> in the implementation of the converter which in my opinion is not so
> simple
> >> to implement with the current Arrow API.
> >>
> >> If the Arrow developer community is not interested in integrating this
> >> proposal, I plan to release two independent libraries (Go and Rust) that
> >> can be used on top of the standard "Arrow" libraries. This will have the
> >> advantage to evaluate if such an approach is able to raise interest
> among
> >> Arrow users.
> >>
> >> Best,
> >>
> >> Laurent
> >>
> >>
> >>
> >>> On Wed, Jul 27, 2022 at 9:53 PM Jorge Cardoso Leitão <
> >>> jorgecarlei...@gmail.com> wrote:
> >>>
> >>> Hi Laurent,
> >>>
> >>> I agree that there is a common pattern in converting row-based formats
> to
> >>> Arrow.
> >>>
> >>> Imho the difficult part is not to map the storage format to Arrow
> >>> specifically - it is to map the storage format to any in-memory (row-
> or
> >>> columnar- based) format, since it requires in-depth knowledge about
> the 2
> >>> formats (the source format and the target format).
> >>>
> >>> - Understanding the Arrow API which can be challenging for complex
> cases of
>  rows representing complex objects (list of struct, struct of struct,
> >>> ...).
> 
> >>>
> >>> the developer would have the same problem - just shifted around - they
> now
> >>> need to convert their complex objects to the intermediary
> representation.
> >>> Whether it is more "difficult" or "complex" to learn than Arrow is an
> open
> >>> question, but we would essentially be shifting the problem from
> "learning
> >>> Arrow" to "learning the Intermediate in-memory".
> >>>
> >>> @Micah Kornfield, as described before my goal is not to define a memory
>  layout specification but more to define an API and a translation
> >>> mechanism
>  able to take this intermediate representation (list of generic o

Re: [proposal] Arrow Intermediate Representation to facilitate the transformation of row-oriented data sources into Arrow columnar representation

2022-07-29 Thread Andrew Lamb
There has been a substantial amount of effort put into the arrow-rs Rust
Parquet implementation to handle the corner cases of nested structs and
list, and all the fun of various levels of nullability.

Do let us know if you happen to try writing nested structures directly to
parquet and have issues.

Andrew

On Wed, Jul 27, 2022 at 6:56 PM Lee, David 
wrote:

> I think this has been addressed for both Parquet and Python to handle
> records including nested structures. Not sure about Rust and Go..
>
> [C++][Parquet] Read and write nested Parquet data with a mix of struct and
> list nesting levels
>
> https://issues.apache.org/jira/browse/ARROW-1644
>
> [Python] Add from_pylist() and to_pylist() to pyarrow.Table to convert
> list of records
>
>
> https://issues.apache.org/jira/browse/ARROW-6001?focusedCommentId=16891152&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-16891152
>
>
> -Original Message-
> From: Laurent Quérel 
> Sent: Tuesday, July 26, 2022 2:25 PM
> To: dev@arrow.apache.org
> Subject: [proposal] Arrow Intermediate Representation to facilitate the
> transformation of row-oriented data sources into Arrow columnar
> representation
>
> External Email: Use caution with links and attachments
>
>
> In the context of this OTEP
> <
> https://urldefense.com/v3/__https://github.com/lquerel/oteps/blob/main/text/0156-columnar-encoding.md__;!!KSjYCgUGsB4!d5RAULeQOio5gpEATTYTuqB7l2iK_gF1tygtPHAGIvLAWTB46ILIazrANdOWeTbY_RqPH2bXNpKf1W1ZXPldz_4ga_8$
> > (OpenTelemetry Enhancement Proposal) I developed an integration layer on
> top of Apache Arrow (Go an Rust) to *facilitate the translation of
> row-oriented data stream into an arrow-based columnar representation*. In
> this particular case the goal was to translate all OpenTelemetry entities
> (metrics, logs, or traces) into Apache Arrow records. These entities can be
> quite complex and their corresponding Arrow schema must be defined on the
> fly. IMO, this approach is not specific to my specific needs but could be
> used in many other contexts where there is a need to simplify the
> integration between a row-oriented source of data and Apache Arrow. The
> trade-off is to have to perform the additional step of conversion to the
> intermediate representation, but this transformation does not require to
> understand the arcana of the Arrow format and allows to potentially benefit
> from functionalities such as the encoding of the dictionary "for free", the
> automatic generation of Arrow schemas, the batching, the multi-column
> sorting, etc.
>
>
> I know that JSON can be used as a kind of intermediate representation in
> the context of Arrow with some language specific implementation. Current
> JSON integrations are insufficient to cover the most complex scenarios and
> are not standardized; e.g. support for most of the Arrow data type, various
> optimizations (string|binary dictionaries, multi-column sorting), batching,
> integration with Arrow IPC, compression ratio optimization, ... The object
> of this proposal is to progressively cover these gaps.
>
> I am looking to see if the community would be interested in such a
> contribution. Above are some additional details on the current
> implementation. All feedback is welcome.
>
> 10K ft overview of the current implementation:
>
>1. Developers convert their row oriented stream into records based on
>the Arrow Intermediate Representation (AIR). At this stage the
> translation
>can be quite mechanical but if needed developers can decide for example
> to
>translate a map into a struct if that makes sense for them. The current
>implementation support the following arrow data types: bool, all uints,
> all
>ints, all floats, string, binary, list of any supported types, and
> struct
>of any supported types. Additional Arrow types could be added
> progressively.
>2. The row oriented record (i.e. AIR record) is then added to a
>RecordRepository. This repository will first compute a schema signature
> and
>will route the record to a RecordBatcher based on this signature.
>3. The RecordBatcher is responsible for collecting all the compatible
>AIR records and, upon request, the "batcher" is able to build an Arrow
>Record representing a batch of compatible inputs. In the current
>implementation, the batcher is able to convert string columns to
> dictionary
>based on a configuration. Another configuration allows to evaluate which
>columns should be sorted to optimize the compression ratio. The same
>optimization process could be applied to binary columns.
>4. Steps 1 through 3 can be repeated on the same RecordRepository
>instance to build new sets of arrow record batches. Subsequent
> iterations
>will be slightly faster due to different techniques used (e.g. object
>reuse, dictionary reuse and sorting, ...)
>
>
> The current Go implementation
> <
> https://urldefense.com/v3/__https://github.com/lquer

Re: [proposal] Arrow Intermediate Representation to facilitate the transformation of row-oriented data sources into Arrow columnar representation

2022-07-28 Thread Laurent Quérel
Far be it from me to think that I know more than Jorge or Wes on this
subject. Sorry if my post gives that perception, that is clearly not my
intention. I'm just trying to defend the idea that when designing this kind
of transformation, it might be interesting to have a library to test
several mappings and evaluate them before doing a more direct
implementation if the performance is not there.

On Thu, Jul 28, 2022 at 12:15 PM Benjamin Blodgett <
benjaminblodg...@gmail.com> wrote:

> He was trying to nicely say he knows way more than you, and your ideas
> will result in a low performance scheme no one will use in production
> ai/machine learning.
>
> Sent from my iPhone
>
> > On Jul 28, 2022, at 12:14 PM, Benjamin Blodgett <
> benjaminblodg...@gmail.com> wrote:
> >
> > I think Jorge’s opinion has is that of an expert and him being humble
> is just being tactful.  Probably listen to Jorge on performance and
> architecture, even over Wes as he’s contributed more than anyone else and
> know the bleeding edge of low level performance stuff more than anyone.
> >
> > Sent from my iPhone
> >
> >> On Jul 28, 2022, at 12:03 PM, Laurent Quérel 
> wrote:
> >>
> >> Hi Jorge
> >>
> >> I don't think that the level of in-depth knowledge needed is the same
> >> between using a row-oriented internal representation and "Arrow" which
> not
> >> only changes the organization of the data but also introduces a set of
> >> additional mapping choices and concepts.
> >>
> >> For example, assuming that the initial row-oriented data source is a
> stream
> >> of nested assembly of structures, lists and maps. The mapping of such a
> >> stream to Protobuf, JSON, YAML, ... is straightforward because on both
> >> sides the logical representation is exactly the same, the schema is
> >> sometimes optional, the interest of building batches is optional, ... In
> >> the case of "Arrow" things are different - the schema and the batching
> are
> >> mandatory. The mapping is not necessarily direct and will generally be
> the
> >> result of the combination of several trade-offs (normalization vs
> >> denormalization representation, mapping influencing the compression
> rate,
> >> queryability with Arrow processors like DataFusion, ...). Note that
> some of
> >> these complexities are not intrinsically linked to the fact that the
> target
> >> format is column oriented. The ZST format (
> >> https://zed.brimdata.io/docs/formats/zst/) for example does not
> require an
> >> explicit schema definition.
> >>
> >> IMHO, having a library that allows you to easily experiment with
> different
> >> types of mapping (without having to worry about batching, dictionaries,
> >> schema definition, understanding how lists of structs are represented,
> ...)
> >> and to evaluate the results according to your specific goals has a value
> >> (especially if your criteria are compression ratio and queryability). Of
> >> course there is an overhead to such an approach. In some cases, at the
> end
> >> of the process, it will be necessary to manually perform this direct
> >> transformation between a row-oriented XYZ format and "Arrow". However,
> this
> >> effort will be done after a simple experimentation phase to avoid
> changes
> >> in the implementation of the converter which in my opinion is not so
> simple
> >> to implement with the current Arrow API.
> >>
> >> If the Arrow developer community is not interested in integrating this
> >> proposal, I plan to release two independent libraries (Go and Rust) that
> >> can be used on top of the standard "Arrow" libraries. This will have the
> >> advantage to evaluate if such an approach is able to raise interest
> among
> >> Arrow users.
> >>
> >> Best,
> >>
> >> Laurent
> >>
> >>
> >>
> >>> On Wed, Jul 27, 2022 at 9:53 PM Jorge Cardoso Leitão <
> >>> jorgecarlei...@gmail.com> wrote:
> >>>
> >>> Hi Laurent,
> >>>
> >>> I agree that there is a common pattern in converting row-based formats
> to
> >>> Arrow.
> >>>
> >>> Imho the difficult part is not to map the storage format to Arrow
> >>> specifically - it is to map the storage format to any in-memory (row-
> or
> >>> columnar- based) format, since it requires in-depth knowledge about
> the 2
> >>> formats (the source format and the target format).
> >>>
> >>> - Understanding the Arrow API which can be challenging for complex
> cases of
>  rows representing complex objects (list of struct, struct of struct,
> >>> ...).
> 
> >>>
> >>> the developer would have the same problem - just shifted around - they
> now
> >>> need to convert their complex objects to the intermediary
> representation.
> >>> Whether it is more "difficult" or "complex" to learn than Arrow is an
> open
> >>> question, but we would essentially be shifting the problem from
> "learning
> >>> Arrow" to "learning the Intermediate in-memory".
> >>>
> >>> @Micah Kornfield, as described before my goal is not to define a memory
>  layout specification but more to define an API and a translation
> 

Re: [proposal] Arrow Intermediate Representation to facilitate the transformation of row-oriented data sources into Arrow columnar representation

2022-07-28 Thread Benjamin Blodgett
He was trying to nicely say he knows way more than you, and your ideas will 
result in a low performance scheme no one will use in production ai/machine 
learning.

Sent from my iPhone

> On Jul 28, 2022, at 12:14 PM, Benjamin Blodgett  
> wrote:
> 
> I think Jorge’s opinion has is that of an expert and him being humble is 
> just being tactful.  Probably listen to Jorge on performance and 
> architecture, even over Wes as he’s contributed more than anyone else and 
> know the bleeding edge of low level performance stuff more than anyone.  
> 
> Sent from my iPhone
> 
>> On Jul 28, 2022, at 12:03 PM, Laurent Quérel  
>> wrote:
>> 
>> Hi Jorge
>> 
>> I don't think that the level of in-depth knowledge needed is the same
>> between using a row-oriented internal representation and "Arrow" which not
>> only changes the organization of the data but also introduces a set of
>> additional mapping choices and concepts.
>> 
>> For example, assuming that the initial row-oriented data source is a stream
>> of nested assembly of structures, lists and maps. The mapping of such a
>> stream to Protobuf, JSON, YAML, ... is straightforward because on both
>> sides the logical representation is exactly the same, the schema is
>> sometimes optional, the interest of building batches is optional, ... In
>> the case of "Arrow" things are different - the schema and the batching are
>> mandatory. The mapping is not necessarily direct and will generally be the
>> result of the combination of several trade-offs (normalization vs
>> denormalization representation, mapping influencing the compression rate,
>> queryability with Arrow processors like DataFusion, ...). Note that some of
>> these complexities are not intrinsically linked to the fact that the target
>> format is column oriented. The ZST format (
>> https://zed.brimdata.io/docs/formats/zst/) for example does not require an
>> explicit schema definition.
>> 
>> IMHO, having a library that allows you to easily experiment with different
>> types of mapping (without having to worry about batching, dictionaries,
>> schema definition, understanding how lists of structs are represented, ...)
>> and to evaluate the results according to your specific goals has a value
>> (especially if your criteria are compression ratio and queryability). Of
>> course there is an overhead to such an approach. In some cases, at the end
>> of the process, it will be necessary to manually perform this direct
>> transformation between a row-oriented XYZ format and "Arrow". However, this
>> effort will be done after a simple experimentation phase to avoid changes
>> in the implementation of the converter which in my opinion is not so simple
>> to implement with the current Arrow API.
>> 
>> If the Arrow developer community is not interested in integrating this
>> proposal, I plan to release two independent libraries (Go and Rust) that
>> can be used on top of the standard "Arrow" libraries. This will have the
>> advantage to evaluate if such an approach is able to raise interest among
>> Arrow users.
>> 
>> Best,
>> 
>> Laurent
>> 
>> 
>> 
>>> On Wed, Jul 27, 2022 at 9:53 PM Jorge Cardoso Leitão <
>>> jorgecarlei...@gmail.com> wrote:
>>> 
>>> Hi Laurent,
>>> 
>>> I agree that there is a common pattern in converting row-based formats to
>>> Arrow.
>>> 
>>> Imho the difficult part is not to map the storage format to Arrow
>>> specifically - it is to map the storage format to any in-memory (row- or
>>> columnar- based) format, since it requires in-depth knowledge about the 2
>>> formats (the source format and the target format).
>>> 
>>> - Understanding the Arrow API which can be challenging for complex cases of
 rows representing complex objects (list of struct, struct of struct,
>>> ...).
 
>>> 
>>> the developer would have the same problem - just shifted around - they now
>>> need to convert their complex objects to the intermediary representation.
>>> Whether it is more "difficult" or "complex" to learn than Arrow is an open
>>> question, but we would essentially be shifting the problem from "learning
>>> Arrow" to "learning the Intermediate in-memory".
>>> 
>>> @Micah Kornfield, as described before my goal is not to define a memory
 layout specification but more to define an API and a translation
>>> mechanism
 able to take this intermediate representation (list of generic objects
 representing the entities to translate) and to convert it into one or
>>> more
 Arrow records.
 
>>> 
>>> imho a spec of "list of generic objects representing the entities" is
>>> specified by an in-memory format (not by an API spec).
>>> 
>>> A second challenge I anticipate is that in-memory formats inneerently "own"
>>> the memory they outline (since by definition they outline how this memory
>>> is outlined). An Intermediate in-memory representation would be no
>>> different. Since row-based formats usually require at least one allocation
>>> per row (and often more for variable-length types

Re: [proposal] Arrow Intermediate Representation to facilitate the transformation of row-oriented data sources into Arrow columnar representation

2022-07-28 Thread Benjamin Blodgett
I think Jorge’s opinion has is that of an expert and him being humble is just 
being tactful.  Probably listen to Jorge on performance and architecture, even 
over Wes as he’s contributed more than anyone else and know the bleeding edge 
of low level performance stuff more than anyone.  

Sent from my iPhone

> On Jul 28, 2022, at 12:03 PM, Laurent Quérel  wrote:
> 
> Hi Jorge
> 
> I don't think that the level of in-depth knowledge needed is the same
> between using a row-oriented internal representation and "Arrow" which not
> only changes the organization of the data but also introduces a set of
> additional mapping choices and concepts.
> 
> For example, assuming that the initial row-oriented data source is a stream
> of nested assembly of structures, lists and maps. The mapping of such a
> stream to Protobuf, JSON, YAML, ... is straightforward because on both
> sides the logical representation is exactly the same, the schema is
> sometimes optional, the interest of building batches is optional, ... In
> the case of "Arrow" things are different - the schema and the batching are
> mandatory. The mapping is not necessarily direct and will generally be the
> result of the combination of several trade-offs (normalization vs
> denormalization representation, mapping influencing the compression rate,
> queryability with Arrow processors like DataFusion, ...). Note that some of
> these complexities are not intrinsically linked to the fact that the target
> format is column oriented. The ZST format (
> https://zed.brimdata.io/docs/formats/zst/) for example does not require an
> explicit schema definition.
> 
> IMHO, having a library that allows you to easily experiment with different
> types of mapping (without having to worry about batching, dictionaries,
> schema definition, understanding how lists of structs are represented, ...)
> and to evaluate the results according to your specific goals has a value
> (especially if your criteria are compression ratio and queryability). Of
> course there is an overhead to such an approach. In some cases, at the end
> of the process, it will be necessary to manually perform this direct
> transformation between a row-oriented XYZ format and "Arrow". However, this
> effort will be done after a simple experimentation phase to avoid changes
> in the implementation of the converter which in my opinion is not so simple
> to implement with the current Arrow API.
> 
> If the Arrow developer community is not interested in integrating this
> proposal, I plan to release two independent libraries (Go and Rust) that
> can be used on top of the standard "Arrow" libraries. This will have the
> advantage to evaluate if such an approach is able to raise interest among
> Arrow users.
> 
> Best,
> 
> Laurent
> 
> 
> 
>> On Wed, Jul 27, 2022 at 9:53 PM Jorge Cardoso Leitão <
>> jorgecarlei...@gmail.com> wrote:
>> 
>> Hi Laurent,
>> 
>> I agree that there is a common pattern in converting row-based formats to
>> Arrow.
>> 
>> Imho the difficult part is not to map the storage format to Arrow
>> specifically - it is to map the storage format to any in-memory (row- or
>> columnar- based) format, since it requires in-depth knowledge about the 2
>> formats (the source format and the target format).
>> 
>> - Understanding the Arrow API which can be challenging for complex cases of
>>> rows representing complex objects (list of struct, struct of struct,
>> ...).
>>> 
>> 
>> the developer would have the same problem - just shifted around - they now
>> need to convert their complex objects to the intermediary representation.
>> Whether it is more "difficult" or "complex" to learn than Arrow is an open
>> question, but we would essentially be shifting the problem from "learning
>> Arrow" to "learning the Intermediate in-memory".
>> 
>> @Micah Kornfield, as described before my goal is not to define a memory
>>> layout specification but more to define an API and a translation
>> mechanism
>>> able to take this intermediate representation (list of generic objects
>>> representing the entities to translate) and to convert it into one or
>> more
>>> Arrow records.
>>> 
>> 
>> imho a spec of "list of generic objects representing the entities" is
>> specified by an in-memory format (not by an API spec).
>> 
>> A second challenge I anticipate is that in-memory formats inneerently "own"
>> the memory they outline (since by definition they outline how this memory
>> is outlined). An Intermediate in-memory representation would be no
>> different. Since row-based formats usually require at least one allocation
>> per row (and often more for variable-length types), the transformation
>> (storage format -> row-based in-memory format -> Arrow) incurs a
>> significant cost (~2x slower last time I played with this problem in JSON
>> [1]).
>> 
>> A third challenge I anticipate is that given that we have 10+ languages, we
>> would eventually need to convert the intermediary representation across
>> languages, which imo 

Re: [proposal] Arrow Intermediate Representation to facilitate the transformation of row-oriented data sources into Arrow columnar representation

2022-07-28 Thread Laurent Quérel
Hi Jorge

I don't think that the level of in-depth knowledge needed is the same
between using a row-oriented internal representation and "Arrow" which not
only changes the organization of the data but also introduces a set of
additional mapping choices and concepts.

For example, assuming that the initial row-oriented data source is a stream
of nested assembly of structures, lists and maps. The mapping of such a
stream to Protobuf, JSON, YAML, ... is straightforward because on both
sides the logical representation is exactly the same, the schema is
sometimes optional, the interest of building batches is optional, ... In
the case of "Arrow" things are different - the schema and the batching are
mandatory. The mapping is not necessarily direct and will generally be the
result of the combination of several trade-offs (normalization vs
denormalization representation, mapping influencing the compression rate,
queryability with Arrow processors like DataFusion, ...). Note that some of
these complexities are not intrinsically linked to the fact that the target
format is column oriented. The ZST format (
https://zed.brimdata.io/docs/formats/zst/) for example does not require an
explicit schema definition.

IMHO, having a library that allows you to easily experiment with different
types of mapping (without having to worry about batching, dictionaries,
schema definition, understanding how lists of structs are represented, ...)
and to evaluate the results according to your specific goals has a value
(especially if your criteria are compression ratio and queryability). Of
course there is an overhead to such an approach. In some cases, at the end
of the process, it will be necessary to manually perform this direct
transformation between a row-oriented XYZ format and "Arrow". However, this
effort will be done after a simple experimentation phase to avoid changes
in the implementation of the converter which in my opinion is not so simple
to implement with the current Arrow API.

If the Arrow developer community is not interested in integrating this
proposal, I plan to release two independent libraries (Go and Rust) that
can be used on top of the standard "Arrow" libraries. This will have the
advantage to evaluate if such an approach is able to raise interest among
Arrow users.

Best,

Laurent



On Wed, Jul 27, 2022 at 9:53 PM Jorge Cardoso Leitão <
jorgecarlei...@gmail.com> wrote:

> Hi Laurent,
>
> I agree that there is a common pattern in converting row-based formats to
> Arrow.
>
> Imho the difficult part is not to map the storage format to Arrow
> specifically - it is to map the storage format to any in-memory (row- or
> columnar- based) format, since it requires in-depth knowledge about the 2
> formats (the source format and the target format).
>
> - Understanding the Arrow API which can be challenging for complex cases of
> > rows representing complex objects (list of struct, struct of struct,
> ...).
> >
>
> the developer would have the same problem - just shifted around - they now
> need to convert their complex objects to the intermediary representation.
> Whether it is more "difficult" or "complex" to learn than Arrow is an open
> question, but we would essentially be shifting the problem from "learning
> Arrow" to "learning the Intermediate in-memory".
>
> @Micah Kornfield, as described before my goal is not to define a memory
> > layout specification but more to define an API and a translation
> mechanism
> > able to take this intermediate representation (list of generic objects
> > representing the entities to translate) and to convert it into one or
> more
> > Arrow records.
> >
>
> imho a spec of "list of generic objects representing the entities" is
> specified by an in-memory format (not by an API spec).
>
> A second challenge I anticipate is that in-memory formats inneerently "own"
> the memory they outline (since by definition they outline how this memory
> is outlined). An Intermediate in-memory representation would be no
> different. Since row-based formats usually require at least one allocation
> per row (and often more for variable-length types), the transformation
> (storage format -> row-based in-memory format -> Arrow) incurs a
> significant cost (~2x slower last time I played with this problem in JSON
> [1]).
>
> A third challenge I anticipate is that given that we have 10+ languages, we
> would eventually need to convert the intermediary representation across
> languages, which imo just hints that we would need to formalize an agnostic
> spec for such representation (so languages agree on its representation),
> and thus essentially declare a new (row-based) format.
>
> (none of this precludes efforts to invent an in-memory row format for
> analytics workloads)
>
> @Wes McKinney 
>
> I still think having a canonical in-memory row format (and libraries
> > to transform to and from Arrow columnar format) is a good idea — but
> > there is the risk of ending up in the tar pit of reinventing Avro.

Re: [proposal] Arrow Intermediate Representation to facilitate the transformation of row-oriented data sources into Arrow columnar representation

2022-07-27 Thread Jorge Cardoso Leitão
Hi Laurent,

I agree that there is a common pattern in converting row-based formats to
Arrow.

Imho the difficult part is not to map the storage format to Arrow
specifically - it is to map the storage format to any in-memory (row- or
columnar- based) format, since it requires in-depth knowledge about the 2
formats (the source format and the target format).

- Understanding the Arrow API which can be challenging for complex cases of
> rows representing complex objects (list of struct, struct of struct, ...).
>

the developer would have the same problem - just shifted around - they now
need to convert their complex objects to the intermediary representation.
Whether it is more "difficult" or "complex" to learn than Arrow is an open
question, but we would essentially be shifting the problem from "learning
Arrow" to "learning the Intermediate in-memory".

@Micah Kornfield, as described before my goal is not to define a memory
> layout specification but more to define an API and a translation mechanism
> able to take this intermediate representation (list of generic objects
> representing the entities to translate) and to convert it into one or more
> Arrow records.
>

imho a spec of "list of generic objects representing the entities" is
specified by an in-memory format (not by an API spec).

A second challenge I anticipate is that in-memory formats inneerently "own"
the memory they outline (since by definition they outline how this memory
is outlined). An Intermediate in-memory representation would be no
different. Since row-based formats usually require at least one allocation
per row (and often more for variable-length types), the transformation
(storage format -> row-based in-memory format -> Arrow) incurs a
significant cost (~2x slower last time I played with this problem in JSON
[1]).

A third challenge I anticipate is that given that we have 10+ languages, we
would eventually need to convert the intermediary representation across
languages, which imo just hints that we would need to formalize an agnostic
spec for such representation (so languages agree on its representation),
and thus essentially declare a new (row-based) format.

(none of this precludes efforts to invent an in-memory row format for
analytics workloads)

@Wes McKinney 

I still think having a canonical in-memory row format (and libraries
> to transform to and from Arrow columnar format) is a good idea — but
> there is the risk of ending up in the tar pit of reinventing Avro.
>

afaik Avro does not have O(1) random access neither to its rows nor columns
- records are concatenated back to back, every record's column is
concatenated back to back within a record, and there is no indexing
information on how to access a particular row or column. There are blocks
of rows that reduce the cost of accessing large offsets, but imo it is far
from the O(1) offered by Arrow (and expected by analytics workloads).

[1] https://github.com/jorgecarleitao/arrow2/pull/1024

Best,
Jorge

On Thu, Jul 28, 2022 at 5:38 AM Laurent Quérel 
wrote:

> Let me clarify the proposal a bit before replying to the various previous
> feedbacks.
>
>
>
> It seems to me that the process of converting a row-oriented data source
> (row = set of fields or something more hierarchical) into an Arrow record
> repeatedly raises the same challenges. A developer who must perform this
> kind of transformation is confronted with the following questions and
> problems:
>
> - Understanding the Arrow API which can be challenging for complex cases of
> rows representing complex objects (list of struct, struct of struct, ...).
>
> - Decide which Arrow schema(s) will correspond to your data source. In some
> complex cases it can be advantageous to translate the same row-oriented
> data source into several Arrow schemas (e.g. OpenTelementry data sources).
>
> - Decide on the encoding of the columns to make the most of the
> column-oriented format and thus increase the compression rate (e.g. define
> the columns that should be represent as dictionaries).
>
>
>
> By experience, I can attest that this process is usually iterative. For
> non-trivial data sources, arriving at the arrow representation that offers
> the best compression ratio and is still perfectly usable and queryable is a
> long and tedious process.
>
>
>
> I see two approaches to ease this process and consequently increase the
> adoption of Apache Arrow:
>
> - Definition of a canonical in-memory row format specification that every
> row-oriented data source provider can progressively adopt to get an
> automatic translation into the Arrow format.
>
> - Definition of an integration library allowing to map any row-oriented
> source into a generic row-oriented source understood by the converter. It
> is not about defining a unique in-memory format but more about defining a
> standard API to represent row-oriented data.
>
>
>
> In my opinion these two approaches are complementary. The first option is a
> long-term approach targeting di

Re: [proposal] Arrow Intermediate Representation to facilitate the transformation of row-oriented data sources into Arrow columnar representation

2022-07-27 Thread Laurent Quérel
Let me clarify the proposal a bit before replying to the various previous
feedbacks.



It seems to me that the process of converting a row-oriented data source
(row = set of fields or something more hierarchical) into an Arrow record
repeatedly raises the same challenges. A developer who must perform this
kind of transformation is confronted with the following questions and
problems:

- Understanding the Arrow API which can be challenging for complex cases of
rows representing complex objects (list of struct, struct of struct, ...).

- Decide which Arrow schema(s) will correspond to your data source. In some
complex cases it can be advantageous to translate the same row-oriented
data source into several Arrow schemas (e.g. OpenTelementry data sources).

- Decide on the encoding of the columns to make the most of the
column-oriented format and thus increase the compression rate (e.g. define
the columns that should be represent as dictionaries).



By experience, I can attest that this process is usually iterative. For
non-trivial data sources, arriving at the arrow representation that offers
the best compression ratio and is still perfectly usable and queryable is a
long and tedious process.



I see two approaches to ease this process and consequently increase the
adoption of Apache Arrow:

- Definition of a canonical in-memory row format specification that every
row-oriented data source provider can progressively adopt to get an
automatic translation into the Arrow format.

- Definition of an integration library allowing to map any row-oriented
source into a generic row-oriented source understood by the converter. It
is not about defining a unique in-memory format but more about defining a
standard API to represent row-oriented data.



In my opinion these two approaches are complementary. The first option is a
long-term approach targeting directly the data providers, which will
require to agree on this generic row-oriented format and whose adoption
will be more or less long. The second approach does not directly require
the collaboration of data source providers but allows an "integrator" to
perform this transformation painlessly with potentially several
representation trials to achieve the best results in his context.



The current proposal is an implementation of the second approach, i.e. an
API that maps a row-oriented source XYZ into an intermediate row-oriented
representation understood mechanically by the translator. This translator
also adds a series of optimizations to make the most of the Arrow format.



You can find multiple examples of a such transformation in the following
examples:

   -
   
https://github.com/lquerel/otel-arrow-adapter/blob/main/pkg/otel/trace/otlp_to_arrow.go
   this example converts OTEL trace entities into their corresponding Arrow
   IR. At the end of this conversion the method returns a collection of Arrow
   Records.
   - A more complex example can be found here
   
https://github.com/lquerel/otel-arrow-adapter/blob/main/pkg/otel/metrics/otlp_to_arrow.go.
   In this example a stream of OTEL univariate row-oriented metrics are
   translate into multivariate row-oriented metrics and then automatically
   translated into Apache Records.



In these two examples, the creation of dictionaries and multi-column
sorting is automatically done by the framework and the developer doesn’t
have to worry about the definition of Arrow schemas.



Now let's get to the answers.



@David Lee, I don't think Parquet and from_pylist() solve this problem
particularly well. Parquet is a column-oriented data file format and
doesn't really help to perform this transformation. The Python method is
relatively limited and language specific.



@Micah Kornfield, as described before my goal is not to define a memory
layout specification but more to define an API and a translation mechanism
able to take this intermediate representation (list of generic objects
representing the entities to translate) and to convert it into one or more
Arrow records.



@Wes McKinney, If I interpret your answer correctly, I think you are
describing the option 1 mentioned above. Like you I think it is an
interesting approach although complementary to the one I propose.



Looking forward to your feedback.

On Wed, Jul 27, 2022 at 4:19 PM Wes McKinney  wrote:

> We had an e-mail thread about this in 2018
>
> https://lists.apache.org/thread/35pn7s8yzxozqmgx53ympxg63vjvggvm
>
> I still think having a canonical in-memory row format (and libraries
> to transform to and from Arrow columnar format) is a good idea — but
> there is the risk of ending up in the tar pit of reinventing Avro.
>
>
> On Wed, Jul 27, 2022 at 5:11 PM Micah Kornfield 
> wrote:
> >
> > Are there more details on what exactly an "Arrow Intermediate
> > Representation (AIR)" is?  We've talked about in the past maybe having a
> > memory layout specification for row-based data as well as column based
> > data.  There was also a recent attempt at least in 

Re: [proposal] Arrow Intermediate Representation to facilitate the transformation of row-oriented data sources into Arrow columnar representation

2022-07-27 Thread Wes McKinney
We had an e-mail thread about this in 2018

https://lists.apache.org/thread/35pn7s8yzxozqmgx53ympxg63vjvggvm

I still think having a canonical in-memory row format (and libraries
to transform to and from Arrow columnar format) is a good idea — but
there is the risk of ending up in the tar pit of reinventing Avro.


On Wed, Jul 27, 2022 at 5:11 PM Micah Kornfield  wrote:
>
> Are there more details on what exactly an "Arrow Intermediate
> Representation (AIR)" is?  We've talked about in the past maybe having a
> memory layout specification for row-based data as well as column based
> data.  There was also a recent attempt at least in C++ to try to build
> utilities to do these pivots but it was decided that it didn't add much
> utility (it was added a comprehensive example).
>
> Thanks,
> Micah
>
> On Tue, Jul 26, 2022 at 2:26 PM Laurent Quérel 
> wrote:
>
> > In the context of this OTEP
> >  > >
> > (OpenTelemetry
> > Enhancement Proposal) I developed an integration layer on top of Apache
> > Arrow (Go an Rust) to *facilitate the translation of row-oriented data
> > stream into an arrow-based columnar representation*. In this particular
> > case the goal was to translate all OpenTelemetry entities (metrics, logs,
> > or traces) into Apache Arrow records. These entities can be quite complex
> > and their corresponding Arrow schema must be defined on the fly. IMO, this
> > approach is not specific to my specific needs but could be used in many
> > other contexts where there is a need to simplify the integration between a
> > row-oriented source of data and Apache Arrow. The trade-off is to have to
> > perform the additional step of conversion to the intermediate
> > representation, but this transformation does not require to understand the
> > arcana of the Arrow format and allows to potentially benefit from
> > functionalities such as the encoding of the dictionary "for free", the
> > automatic generation of Arrow schemas, the batching, the multi-column
> > sorting, etc.
> >
> >
> > I know that JSON can be used as a kind of intermediate representation in
> > the context of Arrow with some language specific implementation. Current
> > JSON integrations are insufficient to cover the most complex scenarios and
> > are not standardized; e.g. support for most of the Arrow data type, various
> > optimizations (string|binary dictionaries, multi-column sorting), batching,
> > integration with Arrow IPC, compression ratio optimization, ... The object
> > of this proposal is to progressively cover these gaps.
> >
> > I am looking to see if the community would be interested in such a
> > contribution. Above are some additional details on the current
> > implementation. All feedback is welcome.
> >
> > 10K ft overview of the current implementation:
> >
> >1. Developers convert their row oriented stream into records based on
> >the Arrow Intermediate Representation (AIR). At this stage the
> > translation
> >can be quite mechanical but if needed developers can decide for example
> > to
> >translate a map into a struct if that makes sense for them. The current
> >implementation support the following arrow data types: bool, all uints,
> > all
> >ints, all floats, string, binary, list of any supported types, and
> > struct
> >of any supported types. Additional Arrow types could be added
> > progressively.
> >2. The row oriented record (i.e. AIR record) is then added to a
> >RecordRepository. This repository will first compute a schema signature
> > and
> >will route the record to a RecordBatcher based on this signature.
> >3. The RecordBatcher is responsible for collecting all the compatible
> >AIR records and, upon request, the "batcher" is able to build an Arrow
> >Record representing a batch of compatible inputs. In the current
> >implementation, the batcher is able to convert string columns to
> > dictionary
> >based on a configuration. Another configuration allows to evaluate which
> >columns should be sorted to optimize the compression ratio. The same
> >optimization process could be applied to binary columns.
> >4. Steps 1 through 3 can be repeated on the same RecordRepository
> >instance to build new sets of arrow record batches. Subsequent
> > iterations
> >will be slightly faster due to different techniques used (e.g. object
> >reuse, dictionary reuse and sorting, ...)
> >
> >
> > The current Go implementation
> >  (WIP) is currently part of
> > this repo (see pkg/air package). If the community is interested, I could do
> > a PR in the Arrow Go and Rust sub-projects.
> >


Re: [proposal] Arrow Intermediate Representation to facilitate the transformation of row-oriented data sources into Arrow columnar representation

2022-07-27 Thread Micah Kornfield
Are there more details on what exactly an "Arrow Intermediate
Representation (AIR)" is?  We've talked about in the past maybe having a
memory layout specification for row-based data as well as column based
data.  There was also a recent attempt at least in C++ to try to build
utilities to do these pivots but it was decided that it didn't add much
utility (it was added a comprehensive example).

Thanks,
Micah

On Tue, Jul 26, 2022 at 2:26 PM Laurent Quérel 
wrote:

> In the context of this OTEP
>  >
> (OpenTelemetry
> Enhancement Proposal) I developed an integration layer on top of Apache
> Arrow (Go an Rust) to *facilitate the translation of row-oriented data
> stream into an arrow-based columnar representation*. In this particular
> case the goal was to translate all OpenTelemetry entities (metrics, logs,
> or traces) into Apache Arrow records. These entities can be quite complex
> and their corresponding Arrow schema must be defined on the fly. IMO, this
> approach is not specific to my specific needs but could be used in many
> other contexts where there is a need to simplify the integration between a
> row-oriented source of data and Apache Arrow. The trade-off is to have to
> perform the additional step of conversion to the intermediate
> representation, but this transformation does not require to understand the
> arcana of the Arrow format and allows to potentially benefit from
> functionalities such as the encoding of the dictionary "for free", the
> automatic generation of Arrow schemas, the batching, the multi-column
> sorting, etc.
>
>
> I know that JSON can be used as a kind of intermediate representation in
> the context of Arrow with some language specific implementation. Current
> JSON integrations are insufficient to cover the most complex scenarios and
> are not standardized; e.g. support for most of the Arrow data type, various
> optimizations (string|binary dictionaries, multi-column sorting), batching,
> integration with Arrow IPC, compression ratio optimization, ... The object
> of this proposal is to progressively cover these gaps.
>
> I am looking to see if the community would be interested in such a
> contribution. Above are some additional details on the current
> implementation. All feedback is welcome.
>
> 10K ft overview of the current implementation:
>
>1. Developers convert their row oriented stream into records based on
>the Arrow Intermediate Representation (AIR). At this stage the
> translation
>can be quite mechanical but if needed developers can decide for example
> to
>translate a map into a struct if that makes sense for them. The current
>implementation support the following arrow data types: bool, all uints,
> all
>ints, all floats, string, binary, list of any supported types, and
> struct
>of any supported types. Additional Arrow types could be added
> progressively.
>2. The row oriented record (i.e. AIR record) is then added to a
>RecordRepository. This repository will first compute a schema signature
> and
>will route the record to a RecordBatcher based on this signature.
>3. The RecordBatcher is responsible for collecting all the compatible
>AIR records and, upon request, the "batcher" is able to build an Arrow
>Record representing a batch of compatible inputs. In the current
>implementation, the batcher is able to convert string columns to
> dictionary
>based on a configuration. Another configuration allows to evaluate which
>columns should be sorted to optimize the compression ratio. The same
>optimization process could be applied to binary columns.
>4. Steps 1 through 3 can be repeated on the same RecordRepository
>instance to build new sets of arrow record batches. Subsequent
> iterations
>will be slightly faster due to different techniques used (e.g. object
>reuse, dictionary reuse and sorting, ...)
>
>
> The current Go implementation
>  (WIP) is currently part of
> this repo (see pkg/air package). If the community is interested, I could do
> a PR in the Arrow Go and Rust sub-projects.
>


RE: [proposal] Arrow Intermediate Representation to facilitate the transformation of row-oriented data sources into Arrow columnar representation

2022-07-27 Thread Lee, David
I think this has been addressed for both Parquet and Python to handle records 
including nested structures. Not sure about Rust and Go..

[C++][Parquet] Read and write nested Parquet data with a mix of struct and list 
nesting levels

https://issues.apache.org/jira/browse/ARROW-1644

[Python] Add from_pylist() and to_pylist() to pyarrow.Table to convert list of 
records

https://issues.apache.org/jira/browse/ARROW-6001?focusedCommentId=16891152&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-16891152


-Original Message-
From: Laurent Quérel  
Sent: Tuesday, July 26, 2022 2:25 PM
To: dev@arrow.apache.org
Subject: [proposal] Arrow Intermediate Representation to facilitate the 
transformation of row-oriented data sources into Arrow columnar representation

External Email: Use caution with links and attachments


In the context of this OTEP
 (OpenTelemetry Enhancement Proposal) I developed an integration layer on top 
of Apache Arrow (Go an Rust) to *facilitate the translation of row-oriented 
data stream into an arrow-based columnar representation*. In this particular 
case the goal was to translate all OpenTelemetry entities (metrics, logs, or 
traces) into Apache Arrow records. These entities can be quite complex and 
their corresponding Arrow schema must be defined on the fly. IMO, this approach 
is not specific to my specific needs but could be used in many other contexts 
where there is a need to simplify the integration between a row-oriented source 
of data and Apache Arrow. The trade-off is to have to perform the additional 
step of conversion to the intermediate representation, but this transformation 
does not require to understand the arcana of the Arrow format and allows to 
potentially benefit from functionalities such as the encoding of the dictionary 
"for free", the automatic generation of Arrow schemas, the batching, the 
multi-column sorting, etc.


I know that JSON can be used as a kind of intermediate representation in the 
context of Arrow with some language specific implementation. Current JSON 
integrations are insufficient to cover the most complex scenarios and are not 
standardized; e.g. support for most of the Arrow data type, various 
optimizations (string|binary dictionaries, multi-column sorting), batching, 
integration with Arrow IPC, compression ratio optimization, ... The object of 
this proposal is to progressively cover these gaps.

I am looking to see if the community would be interested in such a 
contribution. Above are some additional details on the current implementation. 
All feedback is welcome.

10K ft overview of the current implementation:

   1. Developers convert their row oriented stream into records based on
   the Arrow Intermediate Representation (AIR). At this stage the translation
   can be quite mechanical but if needed developers can decide for example to
   translate a map into a struct if that makes sense for them. The current
   implementation support the following arrow data types: bool, all uints, all
   ints, all floats, string, binary, list of any supported types, and struct
   of any supported types. Additional Arrow types could be added progressively.
   2. The row oriented record (i.e. AIR record) is then added to a
   RecordRepository. This repository will first compute a schema signature and
   will route the record to a RecordBatcher based on this signature.
   3. The RecordBatcher is responsible for collecting all the compatible
   AIR records and, upon request, the "batcher" is able to build an Arrow
   Record representing a batch of compatible inputs. In the current
   implementation, the batcher is able to convert string columns to dictionary
   based on a configuration. Another configuration allows to evaluate which
   columns should be sorted to optimize the compression ratio. The same
   optimization process could be applied to binary columns.
   4. Steps 1 through 3 can be repeated on the same RecordRepository
   instance to build new sets of arrow record batches. Subsequent iterations
   will be slightly faster due to different techniques used (e.g. object
   reuse, dictionary reuse and sorting, ...)


The current Go implementation
 (WIP) is currently part of this repo (see pkg/air package). If the community 
is interested, I could do a PR in the Arrow Go and Rust sub-projects.


This message may contain information that is confidential or privileged. If you 
are not the intended recipient, please advise the sender immediately and delete 
this message. See 
http://www.blackrock.com/corporate/complian