Re: [PR] [HUDI-7166] Provide a Procedure to Calculate Column Stats Overlap Degree [hudi]
danny0405 merged PR #10226: URL: https://github.com/apache/hudi/pull/10226 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7166] Provide a Procedure to Calculate Column Stats Overlap Degree [hudi]
hudi-bot commented on PR #10226: URL: https://github.com/apache/hudi/pull/10226#issuecomment-1838828006 ## CI report: * 22f5d8a5c8f2719aa9602958913fef1e2ee969b9 Azure: [SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=21289) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7166] Provide a Procedure to Calculate Column Stats Overlap Degree [hudi]
hudi-bot commented on PR #10226: URL: https://github.com/apache/hudi/pull/10226#issuecomment-1838582086 ## CI report: * 937b268dfcad35e0c9a77733a9d920f5e9577e4d Azure: [FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=21288) * 22f5d8a5c8f2719aa9602958913fef1e2ee969b9 Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=21289) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7166] Provide a Procedure to Calculate Column Stats Overlap Degree [hudi]
hudi-bot commented on PR #10226: URL: https://github.com/apache/hudi/pull/10226#issuecomment-1838517438 ## CI report: * 4cc49f2a068603f35b7b4391a0d3d40af3397d43 Azure: [SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=21287) * 937b268dfcad35e0c9a77733a9d920f5e9577e4d Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=21288) * 22f5d8a5c8f2719aa9602958913fef1e2ee969b9 Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=21289) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7166] Provide a Procedure to Calculate Column Stats Overlap Degree [hudi]
hudi-bot commented on PR #10226: URL: https://github.com/apache/hudi/pull/10226#issuecomment-1838504902 ## CI report: * 4cc49f2a068603f35b7b4391a0d3d40af3397d43 Azure: [SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=21287) * 937b268dfcad35e0c9a77733a9d920f5e9577e4d Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=21288) * 22f5d8a5c8f2719aa9602958913fef1e2ee969b9 UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7166] Provide a Procedure to Calculate Column Stats Overlap Degree [hudi]
hudi-bot commented on PR #10226: URL: https://github.com/apache/hudi/pull/10226#issuecomment-1838231534 ## CI report: * 4cc49f2a068603f35b7b4391a0d3d40af3397d43 Azure: [SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=21287) * 937b268dfcad35e0c9a77733a9d920f5e9577e4d Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=21288) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7166] Provide a Procedure to Calculate Column Stats Overlap Degree [hudi]
hudi-bot commented on PR #10226: URL: https://github.com/apache/hudi/pull/10226#issuecomment-1838133733 ## CI report: * ce7c47699387e9c0e629179440ed82c08bafecfa Azure: [SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=21270) * 4cc49f2a068603f35b7b4391a0d3d40af3397d43 Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=21287) * 937b268dfcad35e0c9a77733a9d920f5e9577e4d Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=21288) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7166] Provide a Procedure to Calculate Column Stats Overlap Degree [hudi]
hudi-bot commented on PR #10226: URL: https://github.com/apache/hudi/pull/10226#issuecomment-1838118862 ## CI report: * ce7c47699387e9c0e629179440ed82c08bafecfa Azure: [SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=21270) * 4cc49f2a068603f35b7b4391a0d3d40af3397d43 Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=21287) * 937b268dfcad35e0c9a77733a9d920f5e9577e4d UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7166] Provide a Procedure to Calculate Column Stats Overlap Degree [hudi]
hudi-bot commented on PR #10226: URL: https://github.com/apache/hudi/pull/10226#issuecomment-1837964188 ## CI report: * ce7c47699387e9c0e629179440ed82c08bafecfa Azure: [SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=21270) * 4cc49f2a068603f35b7b4391a0d3d40af3397d43 Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=21287) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7166] Provide a Procedure to Calculate Column Stats Overlap Degree [hudi]
danny0405 commented on code in PR #10226: URL: https://github.com/apache/hudi/pull/10226#discussion_r1413451686 ## hudi-spark-datasource/hudi-spark/src/main/scala/org/apache/spark/sql/hudi/command/procedures/ShowColumnStatsOverlapProcedure.scala: ## @@ -0,0 +1,336 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql.hudi.command.procedures + +import org.apache.hadoop.fs.{FileStatus, Path} +import org.apache.hudi.avro.model.HoodieMetadataColumnStats +import org.apache.hudi.client.common.HoodieSparkEngineContext +import org.apache.hudi.common.config.HoodieMetadataConfig +import org.apache.hudi.common.data.HoodieData +import org.apache.hudi.common.fs.FSUtils +import org.apache.hudi.common.model.{FileSlice, HoodieRecord} +import org.apache.hudi.common.table.timeline.{HoodieDefaultTimeline, HoodieInstant} +import org.apache.hudi.common.table.view.HoodieTableFileSystemView +import org.apache.hudi.common.table.{HoodieTableMetaClient, TableSchemaResolver} +import org.apache.hudi.common.util.{Option => HOption} +import org.apache.hudi.metadata.{HoodieTableMetadata, HoodieTableMetadataUtil} +import org.apache.hudi.{AvroConversionUtils, ColumnStatsIndexSupport} +import org.apache.spark.internal.Logging +import org.apache.spark.sql.Row +import org.apache.spark.sql.types.{DataTypes, Metadata, StructField, StructType} + +import java.util +import java.util.function.{Function, Supplier} +import scala.collection.JavaConversions.asScalaBuffer +import scala.collection.{JavaConversions, mutable} +import scala.jdk.CollectionConverters.{asScalaBufferConverter, asScalaIteratorConverter, seqAsJavaListConverter} + +/** + * Calculate the degree of overlap between column stats. + * + * The overlap represents the extent to which the min-max ranges cover each other. Review Comment: The suggested doc format for new paragraph is: ```java The overlap represents t ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7166] Provide a Procedure to Calculate Column Stats Overlap Degree [hudi]
danny0405 commented on code in PR #10226: URL: https://github.com/apache/hudi/pull/10226#discussion_r1413446855 ## hudi-common/src/main/java/org/apache/hudi/metadata/HoodieTableMetadataUtil.java: ## @@ -278,6 +293,19 @@ public static HoodieColumnRangeMetadata convertColumnStatsRecordToCo columnStats.getTotalUncompressedSize()); } + public static Option getColumnStatsValueAsString(Object statsValue) { +if (statsValue == null) { + System.out.println("Invalid value: " + statsValue); Review Comment: Log instead of print in production code. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7166] Provide a Procedure to Calculate Column Stats Overlap Degree [hudi]
hudi-bot commented on PR #10226: URL: https://github.com/apache/hudi/pull/10226#issuecomment-1837955494 ## CI report: * ce7c47699387e9c0e629179440ed82c08bafecfa Azure: [SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=21270) * 4cc49f2a068603f35b7b4391a0d3d40af3397d43 UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7166] Provide a Procedure to Calculate Column Stats Overlap Degree [hudi]
danny0405 commented on code in PR #10226: URL: https://github.com/apache/hudi/pull/10226#discussion_r1412716844 ## hudi-spark-datasource/hudi-spark/src/main/scala/org/apache/spark/sql/hudi/command/procedures/ShowColumnStatsOverlapProcedure.scala: ## @@ -0,0 +1,355 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql.hudi.command.procedures + +import org.apache.avro.generic.IndexedRecord +import org.apache.hadoop.fs.{FileStatus, Path} +import org.apache.hudi.avro.model._ +import org.apache.hudi.client.common.HoodieSparkEngineContext Review Comment: should fix the import sequence, you can reference this checkstyle: https://github.com/apache/hudi/blob/cd4f0de57522a681fbe5b62fd774c1943254ec2d/style/checkstyle.xml#L289 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7166] Provide a Procedure to Calculate Column Stats Overlap Degree [hudi]
majian1998 commented on code in PR #10226: URL: https://github.com/apache/hudi/pull/10226#discussion_r1413424885 ## hudi-spark-datasource/hudi-spark/src/main/scala/org/apache/spark/sql/hudi/command/procedures/ShowColumnStatsOverlapProcedure.scala: ## @@ -0,0 +1,355 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql.hudi.command.procedures + +import org.apache.avro.generic.IndexedRecord +import org.apache.hadoop.fs.{FileStatus, Path} +import org.apache.hudi.avro.model._ +import org.apache.hudi.client.common.HoodieSparkEngineContext +import org.apache.hudi.common.config.HoodieMetadataConfig +import org.apache.hudi.common.data.HoodieData +import org.apache.hudi.common.fs.FSUtils +import org.apache.hudi.common.model.{FileSlice, HoodieRecord} +import org.apache.hudi.common.table.timeline.{HoodieDefaultTimeline, HoodieInstant} +import org.apache.hudi.common.table.view.HoodieTableFileSystemView +import org.apache.hudi.common.table.{HoodieTableMetaClient, TableSchemaResolver} +import org.apache.hudi.common.util.{Option => HOption} +import org.apache.hudi.exception.HoodieException +import org.apache.hudi.metadata.HoodieTableMetadata +import org.apache.hudi.{AvroConversionUtils, ColumnStatsIndexSupport} +import org.apache.spark.internal.Logging +import org.apache.spark.sql.Row +import org.apache.spark.sql.types.{DataTypes, Metadata, StructField, StructType} + +import java.util +import java.util.function.{Function, Supplier} +import scala.collection.JavaConversions.asScalaBuffer +import scala.collection.{JavaConversions, mutable} +import scala.jdk.CollectionConverters.{asScalaBufferConverter, asScalaIteratorConverter, seqAsJavaListConverter} + +/** + * Calculate the degree of overlap between column stats. + * The overlap represents the extent to which the min-max ranges cover each other. + * By referring to the overlap, we can visually demonstrate the degree of data skipping + * for different columns under the current table's data layout. + * The calculation is performed at the partition level (assuming that data skipping is based on partition pruning). + * + * For example, consider three files: a.parquet, b.parquet, and c.parquet. + * Taking an integer-type column 'id' as an example, the range (min-max) for 'a' is 1–5, + * for 'b' is 3–7, and for 'c' is 7–8. This results in their values overlapping on the coordinate axis as follows: + * Value Range: 1 2 3 4 5 6 7 8 + * a.parquet: [---] + * b.parquet: [] + * c.parquet: [-] + * Thus, there will be overlap within the ranges 3–5 and 7. + * If the filter conditions for 'id' during data skipping include these values, + * multiple files will be filtered out. For a simpler case, if it's an equality query, + * 2 files will be filtered within these ranges, and no more than one file will be filtered in other cases (possibly outside of the range). + * + * Additionally, calculating the degree of overlap based solely on the maximum values + * may not provide sufficient information. Therefore, we sample and calculate the overlap degree + * for all values involved in the min-max range. We also compute the degree of overlap + * at different percentiles and tally the count of these values.An example of a result is as follows: + * |Partition path |Field name |Average overlap |Maximum file overlap |Total file number |50% overlap|75% overlap|95% overlap|99% overlap|Total value number | + * -- + * |path |c8 |1.33 |2 |2 |1 |1 |1 |1 |3 | + + */ +class ShowColumnStatsOverlapProcedure extends BaseProcedure with ProcedureBuilder with Logging { + private val PARAMETERS = Array[ProcedureParameter]( +ProcedureParameter.required(0, "table", DataTypes.StringType), +ProcedureParameter.optional(1, "partition", DataTypes.StringType), +ProcedureParameter.optional(2, "targetColumns", DataTypes.StringType) + ) + + private val OUTPUT_TYPE = new StructType(Array[StructField]( +StructField(
Re: [PR] [HUDI-7166] Provide a Procedure to Calculate Column Stats Overlap Degree [hudi]
danny0405 commented on code in PR #10226: URL: https://github.com/apache/hudi/pull/10226#discussion_r1412716844 ## hudi-spark-datasource/hudi-spark/src/main/scala/org/apache/spark/sql/hudi/command/procedures/ShowColumnStatsOverlapProcedure.scala: ## @@ -0,0 +1,355 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql.hudi.command.procedures + +import org.apache.avro.generic.IndexedRecord +import org.apache.hadoop.fs.{FileStatus, Path} +import org.apache.hudi.avro.model._ +import org.apache.hudi.client.common.HoodieSparkEngineContext Review Comment: should fix the important sequence, you can reference this checkstyle: https://github.com/apache/hudi/blob/cd4f0de57522a681fbe5b62fd774c1943254ec2d/style/checkstyle.xml#L289 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7166] Provide a Procedure to Calculate Column Stats Overlap Degree [hudi]
danny0405 commented on code in PR #10226: URL: https://github.com/apache/hudi/pull/10226#discussion_r1412716767 ## hudi-spark-datasource/hudi-spark/src/main/scala/org/apache/spark/sql/hudi/command/procedures/ShowColumnStatsOverlapProcedure.scala: ## @@ -0,0 +1,355 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql.hudi.command.procedures + +import org.apache.avro.generic.IndexedRecord +import org.apache.hadoop.fs.{FileStatus, Path} +import org.apache.hudi.avro.model._ +import org.apache.hudi.client.common.HoodieSparkEngineContext +import org.apache.hudi.common.config.HoodieMetadataConfig +import org.apache.hudi.common.data.HoodieData +import org.apache.hudi.common.fs.FSUtils +import org.apache.hudi.common.model.{FileSlice, HoodieRecord} +import org.apache.hudi.common.table.timeline.{HoodieDefaultTimeline, HoodieInstant} +import org.apache.hudi.common.table.view.HoodieTableFileSystemView +import org.apache.hudi.common.table.{HoodieTableMetaClient, TableSchemaResolver} +import org.apache.hudi.common.util.{Option => HOption} +import org.apache.hudi.exception.HoodieException +import org.apache.hudi.metadata.HoodieTableMetadata +import org.apache.hudi.{AvroConversionUtils, ColumnStatsIndexSupport} +import org.apache.spark.internal.Logging +import org.apache.spark.sql.Row +import org.apache.spark.sql.types.{DataTypes, Metadata, StructField, StructType} + +import java.util +import java.util.function.{Function, Supplier} +import scala.collection.JavaConversions.asScalaBuffer +import scala.collection.{JavaConversions, mutable} +import scala.jdk.CollectionConverters.{asScalaBufferConverter, asScalaIteratorConverter, seqAsJavaListConverter} + +/** + * Calculate the degree of overlap between column stats. + * The overlap represents the extent to which the min-max ranges cover each other. + * By referring to the overlap, we can visually demonstrate the degree of data skipping + * for different columns under the current table's data layout. + * The calculation is performed at the partition level (assuming that data skipping is based on partition pruning). + * + * For example, consider three files: a.parquet, b.parquet, and c.parquet. + * Taking an integer-type column 'id' as an example, the range (min-max) for 'a' is 1–5, + * for 'b' is 3–7, and for 'c' is 7–8. This results in their values overlapping on the coordinate axis as follows: + * Value Range: 1 2 3 4 5 6 7 8 + * a.parquet: [---] + * b.parquet: [] + * c.parquet: [-] + * Thus, there will be overlap within the ranges 3–5 and 7. + * If the filter conditions for 'id' during data skipping include these values, + * multiple files will be filtered out. For a simpler case, if it's an equality query, + * 2 files will be filtered within these ranges, and no more than one file will be filtered in other cases (possibly outside of the range). + * + * Additionally, calculating the degree of overlap based solely on the maximum values Review Comment: Each new paragraph should start with ``. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7166] Provide a Procedure to Calculate Column Stats Overlap Degree [hudi]
danny0405 commented on code in PR #10226: URL: https://github.com/apache/hudi/pull/10226#discussion_r1412716697 ## hudi-spark-datasource/hudi-spark/src/main/scala/org/apache/spark/sql/hudi/command/procedures/ShowColumnStatsOverlapProcedure.scala: ## @@ -0,0 +1,355 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql.hudi.command.procedures + +import org.apache.avro.generic.IndexedRecord +import org.apache.hadoop.fs.{FileStatus, Path} +import org.apache.hudi.avro.model._ +import org.apache.hudi.client.common.HoodieSparkEngineContext +import org.apache.hudi.common.config.HoodieMetadataConfig +import org.apache.hudi.common.data.HoodieData +import org.apache.hudi.common.fs.FSUtils +import org.apache.hudi.common.model.{FileSlice, HoodieRecord} +import org.apache.hudi.common.table.timeline.{HoodieDefaultTimeline, HoodieInstant} +import org.apache.hudi.common.table.view.HoodieTableFileSystemView +import org.apache.hudi.common.table.{HoodieTableMetaClient, TableSchemaResolver} +import org.apache.hudi.common.util.{Option => HOption} +import org.apache.hudi.exception.HoodieException +import org.apache.hudi.metadata.HoodieTableMetadata +import org.apache.hudi.{AvroConversionUtils, ColumnStatsIndexSupport} +import org.apache.spark.internal.Logging +import org.apache.spark.sql.Row +import org.apache.spark.sql.types.{DataTypes, Metadata, StructField, StructType} + +import java.util +import java.util.function.{Function, Supplier} +import scala.collection.JavaConversions.asScalaBuffer +import scala.collection.{JavaConversions, mutable} +import scala.jdk.CollectionConverters.{asScalaBufferConverter, asScalaIteratorConverter, seqAsJavaListConverter} + +/** + * Calculate the degree of overlap between column stats. + * The overlap represents the extent to which the min-max ranges cover each other. + * By referring to the overlap, we can visually demonstrate the degree of data skipping + * for different columns under the current table's data layout. + * The calculation is performed at the partition level (assuming that data skipping is based on partition pruning). + * + * For example, consider three files: a.parquet, b.parquet, and c.parquet. + * Taking an integer-type column 'id' as an example, the range (min-max) for 'a' is 1–5, + * for 'b' is 3–7, and for 'c' is 7–8. This results in their values overlapping on the coordinate axis as follows: + * Value Range: 1 2 3 4 5 6 7 8 + * a.parquet: [---] + * b.parquet: [] + * c.parquet: [-] + * Thus, there will be overlap within the ranges 3–5 and 7. + * If the filter conditions for 'id' during data skipping include these values, + * multiple files will be filtered out. For a simpler case, if it's an equality query, + * 2 files will be filtered within these ranges, and no more than one file will be filtered in other cases (possibly outside of the range). + * + * Additionally, calculating the degree of overlap based solely on the maximum values + * may not provide sufficient information. Therefore, we sample and calculate the overlap degree + * for all values involved in the min-max range. We also compute the degree of overlap + * at different percentiles and tally the count of these values.An example of a result is as follows: + * |Partition path |Field name |Average overlap |Maximum file overlap |Total file number |50% overlap|75% overlap|95% overlap|99% overlap|Total value number | + * -- + * |path |c8 |1.33 |2 |2 |1 |1 |1 |1 |3 | + + */ +class ShowColumnStatsOverlapProcedure extends BaseProcedure with ProcedureBuilder with Logging { + private val PARAMETERS = Array[ProcedureParameter]( +ProcedureParameter.required(0, "table", DataTypes.StringType), +ProcedureParameter.optional(1, "partition", DataTypes.StringType), +ProcedureParameter.optional(2, "targetColumns", DataTypes.StringType) + ) + + private val OUTPUT_TYPE = new StructType(Array[StructField]( +StructField("
Re: [PR] [HUDI-7166] Provide a Procedure to Calculate Column Stats Overlap Degree [hudi]
hudi-bot commented on PR #10226: URL: https://github.com/apache/hudi/pull/10226#issuecomment-1836063557 ## CI report: * ce7c47699387e9c0e629179440ed82c08bafecfa Azure: [SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=21270) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7166] Provide a Procedure to Calculate Column Stats Overlap Degree [hudi]
hudi-bot commented on PR #10226: URL: https://github.com/apache/hudi/pull/10226#issuecomment-1835990188 ## CI report: * 27aca8584fb17e3fe9da6ef4be101941686ecf41 Azure: [SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=21268) * ce7c47699387e9c0e629179440ed82c08bafecfa Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=21270) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7166] Provide a Procedure to Calculate Column Stats Overlap Degree [hudi]
hudi-bot commented on PR #10226: URL: https://github.com/apache/hudi/pull/10226#issuecomment-1835821221 ## CI report: * 2b2d5d9fe7468965a21ade6c292a942abf087ad3 Azure: [SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=21267) * 27aca8584fb17e3fe9da6ef4be101941686ecf41 Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=21268) * ce7c47699387e9c0e629179440ed82c08bafecfa Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=21270) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7166] Provide a Procedure to Calculate Column Stats Overlap Degree [hudi]
hudi-bot commented on PR #10226: URL: https://github.com/apache/hudi/pull/10226#issuecomment-1835809646 ## CI report: * 2b2d5d9fe7468965a21ade6c292a942abf087ad3 Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=21267) * 27aca8584fb17e3fe9da6ef4be101941686ecf41 Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=21268) * ce7c47699387e9c0e629179440ed82c08bafecfa Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=21270) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7166] Provide a Procedure to Calculate Column Stats Overlap Degree [hudi]
hudi-bot commented on PR #10226: URL: https://github.com/apache/hudi/pull/10226#issuecomment-1835796381 ## CI report: * 2b2d5d9fe7468965a21ade6c292a942abf087ad3 Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=21267) * 27aca8584fb17e3fe9da6ef4be101941686ecf41 Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=21268) * ce7c47699387e9c0e629179440ed82c08bafecfa UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7166] Provide a Procedure to Calculate Column Stats Overlap Degree [hudi]
hudi-bot commented on PR #10226: URL: https://github.com/apache/hudi/pull/10226#issuecomment-1835649479 ## CI report: * 2b2d5d9fe7468965a21ade6c292a942abf087ad3 Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=21267) * 27aca8584fb17e3fe9da6ef4be101941686ecf41 Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=21268) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7166] Provide a Procedure to Calculate Column Stats Overlap Degree [hudi]
hudi-bot commented on PR #10226: URL: https://github.com/apache/hudi/pull/10226#issuecomment-1835640643 ## CI report: * 2b2d5d9fe7468965a21ade6c292a942abf087ad3 Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=21267) * 27aca8584fb17e3fe9da6ef4be101941686ecf41 UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7166] Provide a Procedure to Calculate Column Stats Overlap Degree [hudi]
hudi-bot commented on PR #10226: URL: https://github.com/apache/hudi/pull/10226#issuecomment-1835571995 ## CI report: * 2b2d5d9fe7468965a21ade6c292a942abf087ad3 Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=21267) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7166] Provide a Procedure to Calculate Column Stats Overlap Degree [hudi]
hudi-bot commented on PR #10226: URL: https://github.com/apache/hudi/pull/10226#issuecomment-1835564976 ## CI report: * 2b2d5d9fe7468965a21ade6c292a942abf087ad3 UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[PR] [HUDI-7166] Provide a Procedure to Calculate Column Stats Overlap Degree [hudi]
majian1998 opened a new pull request, #10226: URL: https://github.com/apache/hudi/pull/10226 ### Change Logs In HUDI-7110 , a tool has been made available to display column stats. However, this tool is not very user-friendly for manual observation when dealing with large data volumes. For instance, with tens of thousands of parquet files, the number of rows in column stats could be of the order of hundreds of thousands. This renders the data virtually unreadable to humans, necessitating further processing by code. Yet, if an administrator simply wishes to directly observe the data layout based on column stats under such circumstances, a more intuitive display tool is required. Here, we offer a tool that calculates the overlap degree of column stats based on partition and column name. Overlap degree refers to the extent to which the min-max ranges of different files intersect with each other. This directly affects the effectiveness of data skipping. In fact, a similar concept is also provided by Snowflake to aid their clustering process. https://docs.snowflake.com/en/user-guide/tables-clustering-micropartitions Our implementation here is not overly complex. It yields output similar to the following: |Partition path |Field name |Average overlap |Maximum file overlap |Total file number |50% overlap|75% overlap|95% overlap|99% overlap|Total value number | |path |c8 |1.33 |2 |2 |1 |1 |1 |1 |3 | This content provides a straightforward representation of the relevant statistics. For example, consider three files: a.parquet, b.parquet, and c.parquet. Taking an integer-type column 'id' as an example, the range (min-max) for 'a' is 1–5, for 'b' is 3–7, and for 'c' is 7–8. Thus, there will be overlap within the ranges 3–5 and 7. If the filter conditions for 'id' during data skipping include these values, multiple files will be filtered out. For a simpler case, if it's an equality query, 2 files will be filtered within these ranges, and no more than one file will be filtered in other cases (possibly outside of the range). TODO: In the future, we hope that this foundation can inspire and be expanded upon to use overlap as a guide for clustering data layout. ### Impact None ### Risk level (write none, low medium or high below) None ### Documentation Update This procedure is designed to calculate and display the overlap degree of column statistics for different files within a table, which is a key factor in evaluating the performance of data skipping strategies. Parameters and Output Schema The procedure accepts the following parameters: table (StringType, required): The name of the table for which column statistics overlap will be calculated. partition (StringType, optional): A specific partition or comma-separated list of partitions to limit the scope of the calculation. targetColumns (StringType, optional): A specific column or comma-separated list of columns for which to calculate the statistics. The output of the procedure is a structured type (StructType) comprising the following fields, which describe various aspects of column statistics overlap for each field within the specified partitions or table: Partition path Field name Average overlap Maximum file overlap Total file number 50% overlap 75% overlap 95% overlap 99% overlap Total value number ### Contributor's checklist - [ ] Read through [contributor's guide](https://hudi.apache.org/contribute/how-to-contribute) - [ ] Change Logs and Impact were stated clearly - [ ] Adequate tests were added if applicable - [ ] CI passed -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org