gengliangwang opened a new pull request #23648: [BRANCH-2.3][SPARK-26709][SQL] 
OptimizeMetadataOnlyQuery does not handle empty records correctly
URL: https://github.com/apache/spark/pull/23648
 
 
   ## What changes were proposed in this pull request?
   
   When reading from empty tables, the optimization `OptimizeMetadataOnlyQuery` 
may return wrong results:
   ```
   sql("CREATE TABLE t (col1 INT, p1 INT) USING PARQUET PARTITIONED BY (p1)")
   sql("INSERT INTO TABLE t PARTITION (p1 = 5) SELECT ID FROM range(1, 1)")
   sql("SELECT MAX(p1) FROM t")
   ```
   The result is supposed to be `null`. However, with the optimization the 
result is `5`.
   
   The rule is originally ported from 
https://issues.apache.org/jira/browse/HIVE-1003 in #13494. In Hive, the rule is 
disabled by default in a later 
release(https://issues.apache.org/jira/browse/HIVE-15397), due to the same 
problem.
   
   It is hard to completely avoid the correctness issue. Because data sources 
like Parquet can be metadata-only. Spark can't tell whether it is empty or not 
without actually reading it. This PR disable the optimization by default.
   
   ## How was this patch tested?
   Unit test
   
   

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Reply via email to