[GitHub] [arrow] jorisvandenbossche commented on issue #35726: [Python] [Parquet] Compression degradation when column type changed from INT64 to INT32

via GitHub Thu, 25 May 2023 04:58:22 -0700


jorisvandenbossche commented on issue #35726:
URL: https://github.com/apache/arrow/issues/35726#issuecomment-1562777542


   You can also the python APIs to read the parquet FileMetaData, that should 
be a bit easier (in the case this includes all the relevant information).
   
   Small example:
   
   ```python
   import numpy as np
   import pyarrow as pa
   import pyarrow.parquet as pq
   
   arr = np.random.randint(0, 200_000, size=10_000)
   table1 = pa.table({"col": arr})
   table2 = pa.table({"col": arr.astype("int32")})
   
   pq.write_table(table1, "data_int64.parquet")
   pq.write_table(table2, "data_int32.parquet")
   ```
   
   gives:
   
   ```
   In [25]: pq.read_metadata("data_int64.parquet").row_group(0).column(0)
   Out[25]: 
   <pyarrow._parquet.ColumnChunkMetaData object at 0x7f91d8d309a0>
     file_offset: 63767
     file_path: 
     physical_type: INT64
     num_values: 10000
     path_in_schema: col
     is_stats_set: True
     statistics:
       <pyarrow._parquet.Statistics object at 0x7f91d8406de0>
         has_min_max: True
         min: 10
         max: 199969
         null_count: 0
         distinct_count: 0
         num_values: 10000
         physical_type: INT64
         logical_type: None
         converted_type (legacy): NONE
     compression: SNAPPY
     encodings: ('RLE_DICTIONARY', 'PLAIN', 'RLE')
     has_dictionary_page: True
     dictionary_page_offset: 4
     data_page_offset: 46165
     total_compressed_size: 63763
     total_uncompressed_size: 95728
   
   In [26]: pq.read_metadata("data_int32.parquet").row_group(0).column(0)
   Out[26]: 
   <pyarrow._parquet.ColumnChunkMetaData object at 0x7f91d265cc70>
     file_offset: 56672
     file_path: 
     physical_type: INT32
     num_values: 10000
     path_in_schema: col
     is_stats_set: True
     statistics:
       <pyarrow._parquet.Statistics object at 0x7f91d265d710>
         has_min_max: True
         min: 10
         max: 199969
         null_count: 0
         distinct_count: 0
         num_values: 10000
         physical_type: INT32
         logical_type: None
         converted_type (legacy): NONE
     compression: SNAPPY
     encodings: ('RLE_DICTIONARY', 'PLAIN', 'RLE')
     has_dictionary_page: True
     dictionary_page_offset: 4
     data_page_offset: 39086
     total_compressed_size: 56668
     total_uncompressed_size: 56656
   ```
   
   For this toy example, it's also peculiar how the total compressed size is 
actually a tiny bit larger than the total uncompressed size in the case of 
int32 (I used the default of snappy, though, with zstd it actually does 
compress a bit). And int64 also compresses better, but int32 compressed size is 
still a bit smaller for this case.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [arrow] jorisvandenbossche commented on issue #35726: [Python] [Parquet] Compression degradation when column type changed from INT64 to INT32

Reply via email to