ion-elgreco opened a new issue, #6213:
URL: https://github.com/apache/arrow-rs/issues/6213
**Describe the bug**
Creating a recordbatch with arrow map types will have different field names
then parquet spec wants. When you write a parquet with datafusion, the parquet
spec is simply ignored and the data is written as-is with the same field names
in the parquet. Which violates the parquet spec.
The parquet file has this schema:
```
<pyarrow._parquet.ParquetSchema object at 0x7f5393f683c0>
required group field_id=-1 arrow_schema {
optional group field_id=-1 map_type (Map) {
repeated group field_id=-1 entries {
required binary field_id=-1 key (String);
optional binary field_id=-1 value;
}
}
}
```
instead of
```
<pyarrow._parquet.ParquetSchema object at 0x7f5393f9cd40>
required group field_id=-1 arrow_schema {
optional group field_id=-1 map_type (Map) {
repeated group field_id=-1 key_value {
required binary field_id=-1 key (String);
optional binary field_id=-1 value;
}
}
}
```
Pyarrow parquet writer doesn't do this, and follows the parquet spec when
writing. See here:
```python
import pyarrow as pa
import pyarrow.parquet as pq
pylist = [{"map_type":{'1':b"M"}}]
schema = pa.schema(
[
pa.field("map_type", pa.map_(pa.large_string(), pa.large_binary())),
]
)
table = pa.Table.from_pylist(pylist, schema=schema)
# table.schema
#
# map_type: map<large_string, large_binary>
# child 0, entries: struct<key: large_string not null, value:
large_binary> not null
# child 0, key: large_string not null
# child 1, value: large_binary
pq.write_table(table, "test.parquet")
pq.read_metadata("test.parquet").schema
<pyarrow._parquet.ParquetSchema object at 0x7f53cc5116c0>
required group field_id=-1 schema {
optional group field_id=-1 map_type (Map) {
repeated group field_id=-1 key_value {
required binary field_id=-1 key (String);
optional binary field_id=-1 value;
}
}
}
```
You can see entries got written as key_value properly. Also interesting to
note PyArrow uses "key","value", arrow-rs uses "keys","values",
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]