Re: [proposal] Arrow Intermediate Representation to facilitate the transformation of row-oriented data sources into Arrow columnar representation

2022-07-29 Thread Micah Kornfield
Please note this message and the previous one from the author violate our Code of Conduct [1]. Specifically "Do not insult or put down other participants." Please try to be professional in communications and focus on the technical issues at hand. [1]

Re: [RUST][Go][proposal] Arrow Intermediate Representation to facilitate the transformation of row-oriented data sources into Arrow columnar representation

2022-07-29 Thread Gavin Ray
> there are scalar api functions that can be logically used to process rows of data, but they are executed on columnar batches of data. > As mentioned previously it is better to have an API that applies row level transformations than to have an intermediary row level memory format. Another way of

Re: [RUST][Go][proposal] Arrow Intermediate Representation to facilitate the transformation of row-oriented data sources into Arrow columnar representation

2022-07-29 Thread Lee, David
In pyarrow.compute which is an extension of the C++ implementation there are scalar api functions that can be logically used to process rows of data, but they are executed on columnar batches of data. As mentioned previously it is better to have an API that applies row level transformations

Re: [proposal] Arrow Intermediate Representation to facilitate the transformation of row-oriented data sources into Arrow columnar representation

2022-07-29 Thread Andrew Lamb
There has been a substantial amount of effort put into the arrow-rs Rust Parquet implementation to handle the corner cases of nested structs and list, and all the fun of various levels of nullability. Do let us know if you happen to try writing nested structures directly to parquet and have

Re: [RUST][Go][proposal] Arrow Intermediate Representation to facilitate the transformation of row-oriented data sources into Arrow columnar representation

2022-07-29 Thread Andrew Lamb
I am +0 on a standard API -- in the Rust arrow-rs implementation we tend to borrow inspiration from the C++ / Java interfaces and then create appropriate Rust APIs. There is also a row based format in DataFusion [1] (Rust) and it is used to implement certain GroupBy and Sorts (similarly to what

Re: [RUST][Go][proposal] Arrow Intermediate Representation to facilitate the transformation of row-oriented data sources into Arrow columnar representation

2022-07-29 Thread Laurent Quérel
Hi Julian, My intermediate representation is indeed an API and does not define a specific physical format (which could be different from one language to another, or even not exist at all in some cases). That being said, I didn't understand your feedback and I'm sure there's something to dig into

Re: [RUST][Go][proposal] Arrow Intermediate Representation to facilitate the transformation of row-oriented data sources into Arrow columnar representation

2022-07-29 Thread Laurent Quérel
Hi Gavin, I was not aware of this initiative but indeed, these two proposals have much in common. The implementation I am working on is available here https://github.com/lquerel/otel-arrow-adapter (directory pkg/air). I would be happy to get your feedback and identify with you the possible gaps to

Re: [RUST][Go][proposal] Arrow Intermediate Representation to facilitate the transformation of row-oriented data sources into Arrow columnar representation

2022-07-29 Thread Laurent Quérel
Hi Sasha, Thank you very much for this informative comment. It's interesting to see another use of a row-based API in the context of a query engine. I think that there is some thought to be given to whether or not it is possible to converge these two use cases into a single public row-based API.

Re: [RUST][Go][proposal] Arrow Intermediate Representation to facilitate the transformation of row-oriented data sources into Arrow columnar representation

2022-07-28 Thread Julian Hyde
If the 'row-oriented format' is an API rather than a physical data representation then it can be implemented via coroutines and could therefore have less scattered patterns of read/write access. By 'coroutines' I'm being rather imprecise, but I hope you get the general idea. An asynchronous API

Re: [RUST][Go][proposal] Arrow Intermediate Representation to facilitate the transformation of row-oriented data sources into Arrow columnar representation

2022-07-28 Thread Gavin Ray
This is essentially the same idea as the proposal here I think -- row/map-based representation & conversion functions for ease of use: [RFC] [Java] Higher-level "DataFrame"-like API. Lower barrier to entry, increase adoption/audience and productivity. · Issue #12618 · apache/arrow (github.com)

Re: [RUST][Go][proposal] Arrow Intermediate Representation to facilitate the transformation of row-oriented data sources into Arrow columnar representation

2022-07-28 Thread Sasha Krassovsky
Hi everyone, I just wanted to chime in that we already do have a form of row-oriented storage inside of `arrow/compute/row/row_internal.h`. It is used to store rows inside of GroupBy and Join within Acero. We also have utilities for converting to/from columnar storage (and AVX2 implementations

Re: [RUST][Go][proposal] Arrow Intermediate Representation to facilitate the transformation of row-oriented data sources into Arrow columnar representation

2022-07-28 Thread Laurent Quérel
Thank you Micah for a very clear summary of the intent behind this proposal. Indeed, I think that clarifying from the beginning that this approach aims at facilitating experimentation more than efficiency in terms of performance of the transformation phase would have helped to better understand my

[RUST][Go][proposal] Arrow Intermediate Representation to facilitate the transformation of row-oriented data sources into Arrow columnar representation

2022-07-28 Thread Micah Kornfield
Hi Laurent, I'm retitling this thread to include the specific languages you seem to be targeting in the subject line to hopefully get more eyes from maintainers in those languages. Thanks for clarifying the goals. If I can restate my understanding, the intended use-case here is to provide easy

Re: [proposal] Arrow Intermediate Representation to facilitate the transformation of row-oriented data sources into Arrow columnar representation

2022-07-28 Thread Laurent Quérel
Far be it from me to think that I know more than Jorge or Wes on this subject. Sorry if my post gives that perception, that is clearly not my intention. I'm just trying to defend the idea that when designing this kind of transformation, it might be interesting to have a library to test several

Re: [proposal] Arrow Intermediate Representation to facilitate the transformation of row-oriented data sources into Arrow columnar representation

2022-07-28 Thread Benjamin Blodgett
He was trying to nicely say he knows way more than you, and your ideas will result in a low performance scheme no one will use in production ai/machine learning. Sent from my iPhone > On Jul 28, 2022, at 12:14 PM, Benjamin Blodgett > wrote: > > I think Jorge’s opinion has is that of an

Re: [proposal] Arrow Intermediate Representation to facilitate the transformation of row-oriented data sources into Arrow columnar representation

2022-07-28 Thread Benjamin Blodgett
I think Jorge’s opinion has is that of an expert and him being humble is just being tactful. Probably listen to Jorge on performance and architecture, even over Wes as he’s contributed more than anyone else and know the bleeding edge of low level performance stuff more than anyone. Sent

Re: [proposal] Arrow Intermediate Representation to facilitate the transformation of row-oriented data sources into Arrow columnar representation

2022-07-28 Thread Laurent Quérel
Hi Jorge I don't think that the level of in-depth knowledge needed is the same between using a row-oriented internal representation and "Arrow" which not only changes the organization of the data but also introduces a set of additional mapping choices and concepts. For example, assuming that the

Re: [proposal] Arrow Intermediate Representation to facilitate the transformation of row-oriented data sources into Arrow columnar representation

2022-07-27 Thread Jorge Cardoso Leitão
Hi Laurent, I agree that there is a common pattern in converting row-based formats to Arrow. Imho the difficult part is not to map the storage format to Arrow specifically - it is to map the storage format to any in-memory (row- or columnar- based) format, since it requires in-depth knowledge

Re: [proposal] Arrow Intermediate Representation to facilitate the transformation of row-oriented data sources into Arrow columnar representation

2022-07-27 Thread Laurent Quérel
Let me clarify the proposal a bit before replying to the various previous feedbacks. It seems to me that the process of converting a row-oriented data source (row = set of fields or something more hierarchical) into an Arrow record repeatedly raises the same challenges. A developer who must

Re: [proposal] Arrow Intermediate Representation to facilitate the transformation of row-oriented data sources into Arrow columnar representation

2022-07-27 Thread Wes McKinney
We had an e-mail thread about this in 2018 https://lists.apache.org/thread/35pn7s8yzxozqmgx53ympxg63vjvggvm I still think having a canonical in-memory row format (and libraries to transform to and from Arrow columnar format) is a good idea — but there is the risk of ending up in the tar pit of

Re: [proposal] Arrow Intermediate Representation to facilitate the transformation of row-oriented data sources into Arrow columnar representation

2022-07-27 Thread Micah Kornfield
Are there more details on what exactly an "Arrow Intermediate Representation (AIR)" is? We've talked about in the past maybe having a memory layout specification for row-based data as well as column based data. There was also a recent attempt at least in C++ to try to build utilities to do these

RE: [proposal] Arrow Intermediate Representation to facilitate the transformation of row-oriented data sources into Arrow columnar representation

2022-07-27 Thread Lee, David
I think this has been addressed for both Parquet and Python to handle records including nested structures. Not sure about Rust and Go.. [C++][Parquet] Read and write nested Parquet data with a mix of struct and list nesting levels https://issues.apache.org/jira/browse/ARROW-1644 [Python] Add

[proposal] Arrow Intermediate Representation to facilitate the transformation of row-oriented data sources into Arrow columnar representation

2022-07-26 Thread Laurent Quérel
In the context of this OTEP (OpenTelemetry Enhancement Proposal) I developed an integration layer on top of Apache Arrow (Go an Rust) to *facilitate the translation of row-oriented data stream into an arrow-based columnar