setop opened a new issue, #5804:
URL: https://github.com/apache/arrow-rs/issues/5804

   ## Describe the bug
   
   Tabular data split into two parquet files, with the same schema, one 
produced with parquet-ccp-arrow, the other with parquet-rs.
   When concatenating I get `Error: General("inputs must have the same schema, 
[...]")`
   
   
   ## To Reproduce
   
   * `pip install pyarrow`
   * create csv
   
   ```bash
   cat << EOF > in.csv
   x,y
   1,2
   EOF
   ```
   
   * convert csv to parquet using python wrapper over 
[parquet-cpp](https://arrow.apache.org/docs/python/generated/pyarrow.parquet.write_table.html)
   
   ```python
   # csv_to_parquet.py
   import sys
   from pyarrow import csv, parquet as pq
   table = csv.read_csv(sys.argv[1])
   pq.write_table(table, sys.argv[2])
   ```
   
   `python csv_to_parquet.py in.csv a.parquet`
   
   * create schema file
   
   ```bash
   cat << EOF > schema.txt
   message schema {
     OPTIONAL INT64 x;
     OPTIONAL INT64 y;
   }
   EEE
   ```
   
   * convert csv to parquet using parquet-rs
   
   ```bash
   parquet-fromcsv --schema schema.txt --input-file a.csv -h --output-file  
b.parquet -w 2
   ```
   
   * concatenate these two files `parquet-concat c.parquet {a,b}.parquet`
   
   
   
   ## Expected behavior
   
   Files get concatenated, two rows.
   
   ## Additional context
   
   The only difference is the message name.
   
   ```diff
    GroupType {
        basic_info: BasicTypeInfo {
   -        name: \"schema\",
   +        name: \"arrow_schema\",
            repetition: None,
            converted_type: NONE,
            logical_type: None,
   ```
   
   ### schemas
   
   ```
   Metadata for file: a.parquet
   
   version: 2
   num of rows: 1
   created by: parquet-cpp-arrow version 16.1.0
   metadata:
     ARROW:schema: 
/////6gAAAAQAAAAAAAKAAwABgAFAAgACgAAAAABBAAMAAAACAAIAAAABAAIAAAABAAAAAIAAABEAAAABAAAANT///8AAAECEAAAABQAAAAEAAAAAAAAAAEAAAB5AAAAxP///wAAAAFAAAAAEAAUAAgABgAHAAwAAAAQABAAAAAAAAECEAAAABwAAAAEAAAAAAAAAAEAAAB4AAAACAAMAAgABwAIAAAAAAAAAUAAAAA=
   message schema {
     OPTIONAL INT64 x;
     OPTIONAL INT64 y;
   }
   ```
   
   ```
   Metadata for file: b.parquet
   
   version: 2
   num of rows: 1
   created by: parquet-rs version 51.0.0
   metadata:
     ARROW:schema: 
/////6gAAAAQAAAAAAAKAAwACgAJAAQACgAAABAAAAAAAQQACAAIAAAABAAIAAAABAAAAAIAAABEAAAABAAAANT///8QAAAAGAAAAAAAAQIUAAAAxP///0AAAAAAAAABAAAAAAEAAAB5AAAAEAAUABAADgAPAAQAAAAIABAAAAAYAAAAIAAAAAAAAQIcAAAACAAMAAQACwAIAAAAQAAAAAAAAAEAAAAAAQAAAHgAAAA=
   message arrow_schema {
     OPTIONAL INT64 x;
     OPTIONAL INT64 y;
   }
   ```
   
   I also tested with both files in writer version 1, with a mix of version 1 
and 2. Same error.
   
   ---
   
   I started commenting on #4799, but since it is closed, not sure it'll get 
noticed.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to