xhwhis opened a new issue, #18241:
URL: https://github.com/apache/datafusion/issues/18241

   ### Describe the bug
   
   When `datafusion.execution.collect_statistics = true` is enabled, and 
Parquet statistics are only partially collected (i.e., some columns have 
statistics disabled), filters using `IS NULL` or `IS NOT NULL` on the 
non-statistics columns produce incorrect results.
   In this case, queries that should return rows instead return no results, 
even though the data clearly exists in the Parquet file.
   
   ### To Reproduce
   
   use DataFusion CLI v50.2.0
   ```
   > set datafusion.execution.collect_statistics = true;
   0 row(s) fetched.
   Elapsed 0.001 seconds.
   
   > COPY (
     SELECT
       column1 as id,
       column2 as value,
     FROM (VALUES (1, 10), (2, 20), (3, NULL))
   )
   TO 'test/data.parquet'
   OPTIONS (
     'statistics_enabled::value' 'none',
   );
   +-------+
   | count |
   +-------+
   | 3     |
   +-------+
   1 row(s) fetched.
   Elapsed 0.008 seconds.
   
   > CREATE EXTERNAL TABLE t (
     id INT,
     value INT,
   )
   STORED AS PARQUET
   LOCATION 'test';
   0 row(s) fetched.
   Elapsed 0.001 seconds.
   
   > select * from t where value is null;
   +----+-------+
   | id | value |
   +----+-------+
   +----+-------+
   0 row(s) fetched.
   Elapsed 0.007 seconds.
   
   > select * from 'test/data.parquet' where value is not null;
   +----+-------+
   | id | value |
   +----+-------+
   +----+-------+
   0 row(s) fetched.
   Elapsed 0.002 seconds.
   
   > select * from 'test/data.parquet';
   +----+-------+
   | id | value |
   +----+-------+
   | 1  | 10    |
   | 2  | 20    |
   | 3  | NULL  |
   +----+-------+
   3 row(s) fetched.
   Elapsed 0.002 seconds.
   
   > select * from test;
   +----+-------+
   | id | value |
   +----+-------+
   | 1  | 10    |
   | 2  | 20    |
   | 3  | NULL  |
   +----+-------+
   3 row(s) fetched.
   Elapsed 0.002 seconds.
   
   select 
path_in_schema,stats_min,stats_max,stats_null_count,stats_min_value,stats_max_value
 from parquet_metadata('test/data.parquet');
   
+----------------+-----------+-----------+------------------+-----------------+-----------------+
   | path_in_schema | stats_min | stats_max | stats_null_count | 
stats_min_value | stats_max_value |
   
+----------------+-----------+-----------+------------------+-----------------+-----------------+
   | "id"           | 1         | 3         | 0                | 1              
 | 3               |
   | "value"        | NULL      | NULL      | NULL             | NULL           
 | NULL            |
   
+----------------+-----------+-----------+------------------+-----------------+-----------------+
   2 row(s) fetched.
   Elapsed 0.002 seconds.
   ```
   
   ### Expected behavior
   
   Filtering with `IS NULL` / `IS NOT NULL` should still correctly evaluate 
against the file data even if statistics for a column are missing.
   
   ### Additional context
   
   _No response_


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to