wgtmac commented on PR #67:
URL: https://github.com/apache/parquet-testing/pull/67#issuecomment-2576780835
```
File path: bad_data/ARROW-GH-45185.parquet
Created by: parquet-cpp-arrow version 19.0.0-SNAPSHOT
Properties: (none)
Schema:
message schema {
repeated int64 int64_field;
}
Row group 0: count: 50 19.10 B records start: 4 total(compressed): 955 B
total(uncompressed):955 B
--------------------------------------------------------------------------------
type encodings count avg size nulls min / max
int64_field INT64 _ _ R 100 9.55 B 0 "0" /
"99000000000000"
Column: int64_field
--------------------------------------------------------------------------------
page type enc count avg size size rows nulls min / max
0-D dict _ _ 100 8.00 B 800 B
0-1 data _ R 100 1.18 B 118 B
"columnIndexReference" : {
"offset" : 959,
"length" : 31
},
"offsetIndexReference" : {
"offset" : 990,
"length" : 12
},
```
The file size is 1.2K. Could we reduce it as much as possible? For example:
- leverage compression like zstd
- disable dictionary encoding
- disable page index
- reduce row count
BTW, `repeated int64 int64_field` is a special case of unannotated list type
which we should avoid:
https://github.com/apache/parquet-format/blob/master/LogicalTypes.md?plain=1#L607-L624.
Should we replace it with LIST-annotated type? cc @pitrou @mapleFU
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]