Re: [proposal] Arrow Intermediate Representation to facilitate the transformation of row-oriented data sources into Arrow columnar representation

2022-07-29 Thread Micah Kornfield
Please note this message and the previous one from the author violate our Code of Conduct [1]. Specifically "Do not insult or put down other participants." Please try to be professional in communications and focus on the technical issues at hand. [1] https://www.apache.org/foundation/policies/co

Re: [proposal] Arrow Intermediate Representation to facilitate the transformation of row-oriented data sources into Arrow columnar representation

2022-07-29 Thread Andrew Lamb
There has been a substantial amount of effort put into the arrow-rs Rust Parquet implementation to handle the corner cases of nested structs and list, and all the fun of various levels of nullability. Do let us know if you happen to try writing nested structures directly to parquet and have issues

Re: [proposal] Arrow Intermediate Representation to facilitate the transformation of row-oriented data sources into Arrow columnar representation

2022-07-28 Thread Laurent Quérel
Far be it from me to think that I know more than Jorge or Wes on this subject. Sorry if my post gives that perception, that is clearly not my intention. I'm just trying to defend the idea that when designing this kind of transformation, it might be interesting to have a library to test several mapp

Re: [proposal] Arrow Intermediate Representation to facilitate the transformation of row-oriented data sources into Arrow columnar representation

2022-07-28 Thread Benjamin Blodgett
He was trying to nicely say he knows way more than you, and your ideas will result in a low performance scheme no one will use in production ai/machine learning. Sent from my iPhone > On Jul 28, 2022, at 12:14 PM, Benjamin Blodgett > wrote: > > I think Jorge’s opinion has is that of an expe

Re: [proposal] Arrow Intermediate Representation to facilitate the transformation of row-oriented data sources into Arrow columnar representation

2022-07-28 Thread Benjamin Blodgett
I think Jorge’s opinion has is that of an expert and him being humble is just being tactful. Probably listen to Jorge on performance and architecture, even over Wes as he’s contributed more than anyone else and know the bleeding edge of low level performance stuff more than anyone. Sent from

Re: [proposal] Arrow Intermediate Representation to facilitate the transformation of row-oriented data sources into Arrow columnar representation

2022-07-28 Thread Laurent Quérel
Hi Jorge I don't think that the level of in-depth knowledge needed is the same between using a row-oriented internal representation and "Arrow" which not only changes the organization of the data but also introduces a set of additional mapping choices and concepts. For example, assuming that the

Re: [proposal] Arrow Intermediate Representation to facilitate the transformation of row-oriented data sources into Arrow columnar representation

2022-07-27 Thread Jorge Cardoso Leitão
Hi Laurent, I agree that there is a common pattern in converting row-based formats to Arrow. Imho the difficult part is not to map the storage format to Arrow specifically - it is to map the storage format to any in-memory (row- or columnar- based) format, since it requires in-depth knowledge abo

Re: [proposal] Arrow Intermediate Representation to facilitate the transformation of row-oriented data sources into Arrow columnar representation

2022-07-27 Thread Laurent Quérel
Let me clarify the proposal a bit before replying to the various previous feedbacks. It seems to me that the process of converting a row-oriented data source (row = set of fields or something more hierarchical) into an Arrow record repeatedly raises the same challenges. A developer who must perf

Re: [proposal] Arrow Intermediate Representation to facilitate the transformation of row-oriented data sources into Arrow columnar representation

2022-07-27 Thread Wes McKinney
We had an e-mail thread about this in 2018 https://lists.apache.org/thread/35pn7s8yzxozqmgx53ympxg63vjvggvm I still think having a canonical in-memory row format (and libraries to transform to and from Arrow columnar format) is a good idea — but there is the risk of ending up in the tar pit of re

Re: [proposal] Arrow Intermediate Representation to facilitate the transformation of row-oriented data sources into Arrow columnar representation

2022-07-27 Thread Micah Kornfield
Are there more details on what exactly an "Arrow Intermediate Representation (AIR)" is? We've talked about in the past maybe having a memory layout specification for row-based data as well as column based data. There was also a recent attempt at least in C++ to try to build utilities to do these

RE: [proposal] Arrow Intermediate Representation to facilitate the transformation of row-oriented data sources into Arrow columnar representation

2022-07-27 Thread Lee, David
I think this has been addressed for both Parquet and Python to handle records including nested structures. Not sure about Rust and Go.. [C++][Parquet] Read and write nested Parquet data with a mix of struct and list nesting levels https://issues.apache.org/jira/browse/ARROW-1644 [Python] Add