[I] Improve string statistics display in `datafusion-cli` `parquet_metadata` [arrow-datafusion]

via GitHub Thu, 07 Dec 2023 14:15:06 -0800


alamb opened a new issue, #8464:
URL: https://github.com/apache/arrow-datafusion/issues/8464


   ### Is your feature request related to a problem or challenge?
   
   @Veeupup  implemented the great `parquet_metadata` feature in 
https://github.com/apache/arrow-datafusion/pull/8413 ❤️ 
   
   While playing around with it however, I noticed that the formatting of 
string statistics was not super easy to interpret as it is formatted something 
like `[72, 101, 108, 108, 111] `
   
   For example:
   ```shell
   andrewlamb@Andrews-MBP:~/Software/arrow-datafusion$ datafusion-cli -c 
"select * from 
parquet_metadata('parquet-testing/data/data_index_bloom_encoding_stats.parquet')";
   DataFusion CLI v33.0.0
   
+--------------------------------------------------------------+--------------+--------------------+-----------------------+-----------------+-----------+-------------+------------+----------------+------------+--------------------------+--------------------------+------------------+----------------------+--------------------------+--------------------------+--------------------+--------------------------+-------------------+------------------------+------------------+-----------------------+-------------------------+
   | filename                                                     | 
row_group_id | row_group_num_rows | row_group_num_columns | row_group_bytes | 
column_id | file_offset | num_values | path_in_schema | type       | stats_min  
              | stats_max                | stats_null_count | 
stats_distinct_count | stats_min_value          | stats_max_value          | 
compression        | encodings                | index_page_offset | 
dictionary_page_offset | data_page_offset | total_compressed_size | 
total_uncompressed_size |
   
+--------------------------------------------------------------+--------------+--------------------+-----------------------+-----------------+-----------+-------------+------------+----------------+------------+--------------------------+--------------------------+------------------+----------------------+--------------------------+--------------------------+--------------------+--------------------------+-------------------+------------------------+------------------+-----------------------+-------------------------+
   | parquet-testing/data/data_index_bloom_encoding_stats.parquet | 0           
 | 14                 | 1                     | 163             | 0         | 4 
          | 14         | "String"       | BYTE_ARRAY | [72, 101, 108, 108, 111] 
| [116, 111, 100, 97, 121] | 0                |                      | [72, 
101, 108, 108, 111] | [116, 111, 100, 97, 121] | GZIP(GzipLevel(6)) | 
[BIT_PACKED, RLE, PLAIN] |                   |                        | 4       
         | 152                   | 163                     |
   
+--------------------------------------------------------------+--------------+--------------------+-----------------------+-----------------+-----------+-------------+------------+----------------+------------+--------------------------+--------------------------+------------------+----------------------+--------------------------+--------------------------+--------------------+--------------------------+-------------------+------------------------+------------------+-----------------------+-------------------------+
   1 row in set. Query took 0.024 seconds.
   ```
   
   ### Describe the solution you'd like
   
   It would be nice if the output was formatted as an actual string for string 
arrays. For example as duckdb does (showing that `[72, 101, 108, 108, 111] ` as 
`Hello`
   
   ```shell
   andrewlamb@Andrews-MBP:~/Software/arrow-datafusion$ duckdb -c "select * from 
parquet_metadata('parquet-testing/data/data_index_bloom_encoding_stats.parquet')";
   
┌──────────────────────┬──────────────┬────────────────────┬──────────────────────┬─────────────────┬───────────┬─────────────┬────────────┬────────────────┬───┬─────────────────┬─────────────────┬─────────────┬──────────────────────┬───────────────────┬──────────────────────┬──────────────────┬──────────────────────┬────────────�
 �─────────┐
   │      file_name       │ row_group_id │ row_group_num_rows │ 
row_group_num_colu…  │ row_group_bytes │ column_id │ file_offset │ num_values │ 
path_in_schema │ … │ stats_min_value │ stats_max_value │ compression │      
encodings       │ index_page_offset │ dictionary_page_of…  │ data_page_offset │ 
total_compressed_s…  │ total_uncompressed…  │
   │       varchar        │    int64     │       int64        │        int64    
     │      int64      │   int64   │    int64    │   int64    │    varchar     
│   │     varchar     │     varchar     │   varchar   │       varchar        │  
     int64       │        int64         │      int64       │        int64       
  │        int64         │
   
├──────────────────────┼──────────────┼────────────────────┼──────────────────────┼─────────────────┼───────────┼─────────────┼────────────┼────────────────┼───┼─────────────────┼─────────────────┼─────────────┼──────────────────────┼───────────────────┼──────────────────────┼──────────────────┼──────────────────────┼────────────�
 �─────────┤
   │ parquet-testing/da…  │            0 │                 14 │                 
   1 │             163 │         0 │           4 │         14 │ String         
│ … │ Hello           │ today           │ GZIP        │ BIT_PACKED, RLE, P…  │  
                 │                      │                4 │                  
152 │                  163 │
   
├──────────────────────┴──────────────┴────────────────────┴──────────────────────┴─────────────────┴───────────┴─────────────┴────────────┴────────────────┴───┴─────────────────┴─────────────────┴─────────────┴──────────────────────┴───────────────────┴──────────────────────┴──────────────────┴──────────────────────┴────────────�
 �─────────┤
   │ 1 rows                                                                     
                                                                                
                                                                                
                                                                                
  23 columns (18 shown) │
   
└──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────�
 �─────────┘
   andrewlamb@Andrews-MBP:~/Software/arrow-datafusion$
   ```
   
   ### Describe alternatives you've considered
   
   _No response_
   
   ### Additional context
   
   _No response_


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[I] Improve string statistics display in `datafusion-cli` `parquet_metadata` [arrow-datafusion]

Reply via email to