alamb commented on issue #1433:
URL: 
https://github.com/apache/arrow-datafusion/issues/1433#issuecomment-995187994


   This sounds similar to something we hit in IOx 
(https://github.com/influxdata/influxdb_iox/issues/2153) which I ultimately 
tracked down to a bug in the parquet statistics generation: 
https://github.com/apache/arrow-rs/issues/641
   
   So in this case, the statistics embedded in the parquet file for the  
`direction` column are `T:[min: Merged, max: Outgoing, num_nulls not defined]`, 
namely that the minimum value is `"Merged"` and the maximum value is 
`"Outgoing"` which I do not think is correct
   
   ```shell
   $ parquet-tools meta test.parquet 
   file:        file:/Users/alamb/Downloads/test.parquet 
   creator:     UrbanLogiq 
   extra:       ARROW:schema = 
/////+gAAAAQAAAAAAAKAA4ADAALAAQACgAAABQAAAAAAAABBAAKAAwAAAAIAAQACgAAAAgAAAAIAAAAAAAAAAMAAAB8AAAAPAAAAAQAAACg////GAAAACAAAAAAAAACHAAAAAgADAAEAAsACAAAACAAAAAAAAABAAAAAAMAAABhZHQA1P///xQAAAAMAAAAAAAABQwAAAAAAAAAxP///wkAAABkaXJlY3Rpb24AAAAQABQAEAAAAA8ABAAAAAgAEAAAABgAAAAMAAAAAAAABRAAAAAAAAAABAAEAAQAAAAKAAAAdWxfbm9kZV9pZAAA
 
   
   file schema: arrow_schema 
   
--------------------------------------------------------------------------------
   ul_node_id:  REQUIRED BINARY L:STRING R:0 D:0
   direction:   REQUIRED BINARY L:STRING R:0 D:0
   adt:         REQUIRED INT32 R:0 D:0
   
   row group 1: RC:301 TS:3384 OFFSET:4 
   
--------------------------------------------------------------------------------
   ul_node_id:   BINARY ZSTD DO:4 FPO:1796 SZ:2143/3187/1.49 VC:301 
ENC:RLE_DICTIONARY,PLAIN,RLE ST:[min: /ehIvdei+UGfkQ4Gy5fr1w==, max: 
zThqpswvY6fa3VHF4BKWfw==, num_nulls not defined]
   direction:    BINARY ZSTD DO:2243 FPO:2311 SZ:195/177/0.91 VC:301 
ENC:RLE_DICTIONARY,PLAIN,RLE ST:[min: Merged, max: Outgoing, num_nulls not 
defined]
   adt:          INT32 ZSTD DO:2500 FPO:3159 SZ:1046/1503/1.44 VC:301 
ENC:RLE_DICTIONARY,PLAIN,RLE ST:[min: 15, max: 23116, num_nulls not defined]
   ```
   
   Which appears to be incorrect  for the data in test.parquet:
   
   ```
   ❯ select distinct direction from t order by direction;
   +-----------+
   | direction |
   +-----------+
   | Incoming  |
   | Merged    |
   | Outgoing  |
   | Two Way   |
   +-----------+
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


Reply via email to