alamb commented on issue #1433: URL: https://github.com/apache/arrow-datafusion/issues/1433#issuecomment-995187994
This sounds similar to something we hit in IOx (https://github.com/influxdata/influxdb_iox/issues/2153) which I ultimately tracked down to a bug in the parquet statistics generation: https://github.com/apache/arrow-rs/issues/641 So in this case, the statistics embedded in the parquet file for the `direction` column are `T:[min: Merged, max: Outgoing, num_nulls not defined]`, namely that the minimum value is `"Merged"` and the maximum value is `"Outgoing"` which I do not think is correct ```shell $ parquet-tools meta test.parquet file: file:/Users/alamb/Downloads/test.parquet creator: UrbanLogiq extra: ARROW:schema = /////+gAAAAQAAAAAAAKAA4ADAALAAQACgAAABQAAAAAAAABBAAKAAwAAAAIAAQACgAAAAgAAAAIAAAAAAAAAAMAAAB8AAAAPAAAAAQAAACg////GAAAACAAAAAAAAACHAAAAAgADAAEAAsACAAAACAAAAAAAAABAAAAAAMAAABhZHQA1P///xQAAAAMAAAAAAAABQwAAAAAAAAAxP///wkAAABkaXJlY3Rpb24AAAAQABQAEAAAAA8ABAAAAAgAEAAAABgAAAAMAAAAAAAABRAAAAAAAAAABAAEAAQAAAAKAAAAdWxfbm9kZV9pZAAA file schema: arrow_schema -------------------------------------------------------------------------------- ul_node_id: REQUIRED BINARY L:STRING R:0 D:0 direction: REQUIRED BINARY L:STRING R:0 D:0 adt: REQUIRED INT32 R:0 D:0 row group 1: RC:301 TS:3384 OFFSET:4 -------------------------------------------------------------------------------- ul_node_id: BINARY ZSTD DO:4 FPO:1796 SZ:2143/3187/1.49 VC:301 ENC:RLE_DICTIONARY,PLAIN,RLE ST:[min: /ehIvdei+UGfkQ4Gy5fr1w==, max: zThqpswvY6fa3VHF4BKWfw==, num_nulls not defined] direction: BINARY ZSTD DO:2243 FPO:2311 SZ:195/177/0.91 VC:301 ENC:RLE_DICTIONARY,PLAIN,RLE ST:[min: Merged, max: Outgoing, num_nulls not defined] adt: INT32 ZSTD DO:2500 FPO:3159 SZ:1046/1503/1.44 VC:301 ENC:RLE_DICTIONARY,PLAIN,RLE ST:[min: 15, max: 23116, num_nulls not defined] ``` Which appears to be incorrect for the data in test.parquet: ``` ❯ select distinct direction from t order by direction; +-----------+ | direction | +-----------+ | Incoming | | Merged | | Outgoing | | Two Way | +-----------+ ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
