Re: [I] Generated nested parquet files use ITEM instead of ELEMENT for list nodes [arrow-rs]

via GitHub Fri, 15 Nov 2024 09:21:22 -0800


tustvold commented on issue #6733:
URL: https://github.com/apache/arrow-rs/issues/6733#issuecomment-2479487010


   So I don't think this is a bug per-se, the parquet writer converts the arrow 
schema faithfully into parquet, preserving the field name of the list elements.
   
   The problem arises because the default within the arrow ecosystem is to call 
this "item" and not "element".
   ```
   >>> import pyarrow as pa
   >>> pa.list_(pa.string())
   ListType(list<item: string>)
   >>> pa.list_(pa.string()).field(0).name
   'item'
   ```
   
   The reason this matters is because the parquet schema is authoritative, that 
is when reading back a parquet file with a field name of "element", the arrow 
schema should reflect this. Therefore if we coerced to "item" the schema would 
not roundtrip as people might expect.
   
   I think the way to handle this is probably #1938, where we add an option to 
coerce arrow types to be more compatible with parquet's type system, with the 
understanding that things may not always roundtrip completely faithfully.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [I] Generated nested parquet files use ITEM instead of ELEMENT for list nodes [arrow-rs]

Reply via email to