setop opened a new issue, #5804:
URL: https://github.com/apache/arrow-rs/issues/5804
## Describe the bug
Tabular data split into two parquet files, with the same schema, one
produced with parquet-ccp-arrow, the other with parquet-rs.
When concatenating I get `Error: General("inputs must have the same schema,
[...]")`
## To Reproduce
* `pip install pyarrow`
* create csv
```bash
cat << EOF > in.csv
x,y
1,2
EOF
```
* convert csv to parquet using python wrapper over
[parquet-cpp](https://arrow.apache.org/docs/python/generated/pyarrow.parquet.write_table.html)
```python
# csv_to_parquet.py
import sys
from pyarrow import csv, parquet as pq
table = csv.read_csv(sys.argv[1])
pq.write_table(table, sys.argv[2])
```
`python csv_to_parquet.py in.csv a.parquet`
* create schema file
```bash
cat << EOF > schema.txt
message schema {
OPTIONAL INT64 x;
OPTIONAL INT64 y;
}
EEE
```
* convert csv to parquet using parquet-rs
```bash
parquet-fromcsv --schema schema.txt --input-file a.csv -h --output-file
b.parquet -w 2
```
* concatenate these two files `parquet-concat c.parquet {a,b}.parquet`
## Expected behavior
Files get concatenated, two rows.
## Additional context
The only difference is the message name.
```diff
GroupType {
basic_info: BasicTypeInfo {
- name: \"schema\",
+ name: \"arrow_schema\",
repetition: None,
converted_type: NONE,
logical_type: None,
```
### schemas
```
Metadata for file: a.parquet
version: 2
num of rows: 1
created by: parquet-cpp-arrow version 16.1.0
metadata:
ARROW:schema:
/////6gAAAAQAAAAAAAKAAwABgAFAAgACgAAAAABBAAMAAAACAAIAAAABAAIAAAABAAAAAIAAABEAAAABAAAANT///8AAAECEAAAABQAAAAEAAAAAAAAAAEAAAB5AAAAxP///wAAAAFAAAAAEAAUAAgABgAHAAwAAAAQABAAAAAAAAECEAAAABwAAAAEAAAAAAAAAAEAAAB4AAAACAAMAAgABwAIAAAAAAAAAUAAAAA=
message schema {
OPTIONAL INT64 x;
OPTIONAL INT64 y;
}
```
```
Metadata for file: b.parquet
version: 2
num of rows: 1
created by: parquet-rs version 51.0.0
metadata:
ARROW:schema:
/////6gAAAAQAAAAAAAKAAwACgAJAAQACgAAABAAAAAAAQQACAAIAAAABAAIAAAABAAAAAIAAABEAAAABAAAANT///8QAAAAGAAAAAAAAQIUAAAAxP///0AAAAAAAAABAAAAAAEAAAB5AAAAEAAUABAADgAPAAQAAAAIABAAAAAYAAAAIAAAAAAAAQIcAAAACAAMAAQACwAIAAAAQAAAAAAAAAEAAAAAAQAAAHgAAAA=
message arrow_schema {
OPTIONAL INT64 x;
OPTIONAL INT64 y;
}
```
I also tested with both files in writer version 1, with a mix of version 1
and 2. Same error.
---
I started commenting on #4799, but since it is closed, not sure it'll get
noticed.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]