jorisvandenbossche commented on issue #35726:
URL: https://github.com/apache/arrow/issues/35726#issuecomment-1562806313
You can also the python APIs to read the parquet FileMetaData, that should
be a bit easier (in the case this includes all the relevant information).
Small example:
```python
import numpy as np
import pyarrow as pa
import pyarrow.parquet as pq
arr = np.random.randint(0, 200_000, size=10_000)
table1 = pa.table({"col": arr})
table2 = pa.table({"col": arr.astype("int32")})
pq.write_table(table1, "data_int64.parquet")
pq.write_table(table2, "data_int32.parquet")
```
gives:
```
In [25]: pq.read_metadata("data_int64.parquet").row_group(0).column(0)
Out[25]:
<pyarrow._parquet.ColumnChunkMetaData object at 0x7f91d8d309a0>
file_offset: 63767
file_path:
physical_type: INT64
num_values: 10000
path_in_schema: col
is_stats_set: True
statistics:
<pyarrow._parquet.Statistics object at 0x7f91d8406de0>
has_min_max: True
min: 10
max: 199969
null_count: 0
distinct_count: 0
num_values: 10000
physical_type: INT64
logical_type: None
converted_type (legacy): NONE
compression: SNAPPY
encodings: ('RLE_DICTIONARY', 'PLAIN', 'RLE')
has_dictionary_page: True
dictionary_page_offset: 4
data_page_offset: 46165
total_compressed_size: 63763
total_uncompressed_size: 95728
In [26]: pq.read_metadata("data_int32.parquet").row_group(0).column(0)
Out[26]:
<pyarrow._parquet.ColumnChunkMetaData object at 0x7f91d265cc70>
file_offset: 56672
file_path:
physical_type: INT32
num_values: 10000
path_in_schema: col
is_stats_set: True
statistics:
<pyarrow._parquet.Statistics object at 0x7f91d265d710>
has_min_max: True
min: 10
max: 199969
null_count: 0
distinct_count: 0
num_values: 10000
physical_type: INT32
logical_type: None
converted_type (legacy): NONE
compression: SNAPPY
encodings: ('RLE_DICTIONARY', 'PLAIN', 'RLE')
has_dictionary_page: True
dictionary_page_offset: 4
data_page_offset: 39086
total_compressed_size: 56668
total_uncompressed_size: 56656
```
For this toy example, it's also peculiar how the total compressed size is
actually a tiny bit larger than the total uncompressed size in the case of
int32 (I used the default of snappy, though, with zstd it actually does
compress a bit). And int64 also compresses better, but int32 compressed size is
still a bit smaller for this case.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]