ggreco opened a new issue, #6733:
URL: https://github.com/apache/arrow-rs/issues/6733

   **Describe the bug**
   arrow-rs generated .parquet files where the schema implies a nested 
structure should call the list item `element` as of parquet specifications:
   https://github.com/apache/parquet-format/blob/master/LogicalTypes.md#lists
   
   ... the files generated are instead using `item`, probably some legacy code 
was used to build the code.
   
   A similar issue has been recently fixed in polars-rs:
   https://github.com/pola-rs/polars/pull/17803
   
   Pyarrow let you use `item` instead of `element` (default) to support legacy 
files, but IMHO arrow-rs should not generate legacy parquet files!
   
   The code in arrow-rs that implement this is:
   https://github.com/apache/arrow-rs/blob/master/arrow-schema/src/field.rs#L147
   
   IMHO the fix will just involve a single line change, I can create a PR, but 
I want to be sure I'm not reading the specs in the wrong way or there is a 
reason for hardcoding `item` since it seems too simple...
   
   **To Reproduce**
   Generate a nested parquet file, or use the one attached to this issue and 
verify (with an hex editor, parquet-schema from this REPO or with a GUI tool 
that shows the parquet schema like "parquet floor"), that the type name 
associated to the list item is always `item` instead of `element`.
   
   Using
   
[example_parquet.zip](https://github.com/user-attachments/files/17776513/example_parquet.zip)
    the file attached to this ticket that follow the schema will be reported by 
`arrow-schema` :
   
   ```
   {
     REQUIRED BYTE_ARRAY school (STRING);
     REQUIRED group students (LIST) {
       REPEATED group list {
         OPTIONAL group item {
           REQUIRED BYTE_ARRAY name (STRING);
           REQUIRED INT32 age;
         }
       }
     }
     REQUIRED group teachers (LIST) {
       REPEATED group list {
         OPTIONAL group item {
           REQUIRED BYTE_ARRAY name (STRING);
           REQUIRED INT32 age;
         }
       }
     }
   }
   ```
   
   the expected value  was:
   ```
   {
     REQUIRED BYTE_ARRAY school (STRING);
     REQUIRED group students (LIST) {
       REPEATED group list {
         OPTIONAL group element {
           REQUIRED BYTE_ARRAY name (STRING);
           REQUIRED INT32 age;
         }
       }
     }
     REQUIRED group teachers (LIST) {
       REPEATED group list {
         OPTIONAL group element {
           REQUIRED BYTE_ARRAY name (STRING);
           REQUIRED INT32 age;
         }
       }
     }
   }
   ```
   
   I can get `parquet-schema` to output `element` instead of `item` when 
generating the parquet file from python or .net.
   
   In the hex editor you will see `students.list.item.name` instead of the 
expected `students.list.element.name`.
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to