majian1998 commented on code in PR #10389: URL: https://github.com/apache/hudi/pull/10389#discussion_r1439117535
########## hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/ColumnStatsIndexSupport.scala: ########## @@ -272,11 +272,13 @@ class ColumnStatsIndexSupport(spark: SparkSession, // NOTE: This could occur in either of the following cases: // 1. Particular file does not have this particular column (which is indexed by Column Stats Index): // in this case we're assuming missing column to essentially contain exclusively - // null values, we set min/max values as null and null-count to be equal to value-count (this + // null values, we set min/maxăand null-count values as null (this // behavior is consistent with reading non-existent columns from Parquet) + // 2. When evaluating non-null index conditions, a condition has been added to check if null-count equals null; + // this suggests that we are uncertain whether the column is empty or not, and if so, we return True. // // This is a way to determine current column's index without explicit iteration (we're adding 3 stats / column) - acc ++= Seq(null, null, valueCount) + acc ++= Seq(null, null, null) Review Comment: I believe this ensures schema consistency by populating missing columns with default null values, making sure that reading this file does not result in errors. Are there any other reasons for this behavior? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org