Re: [PR] [HUDI-7166] Provide a Procedure to Calculate Column Stats Overlap Degree [hudi]

2023-12-04 Thread via GitHub


danny0405 merged PR #10226:
URL: https://github.com/apache/hudi/pull/10226


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7166] Provide a Procedure to Calculate Column Stats Overlap Degree [hudi]

2023-12-04 Thread via GitHub


hudi-bot commented on PR #10226:
URL: https://github.com/apache/hudi/pull/10226#issuecomment-1838828006

   
   ## CI report:
   
   * 22f5d8a5c8f2719aa9602958913fef1e2ee969b9 Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=21289)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7166] Provide a Procedure to Calculate Column Stats Overlap Degree [hudi]

2023-12-04 Thread via GitHub


hudi-bot commented on PR #10226:
URL: https://github.com/apache/hudi/pull/10226#issuecomment-1838582086

   
   ## CI report:
   
   * 937b268dfcad35e0c9a77733a9d920f5e9577e4d Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=21288)
 
   * 22f5d8a5c8f2719aa9602958913fef1e2ee969b9 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=21289)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7166] Provide a Procedure to Calculate Column Stats Overlap Degree [hudi]

2023-12-04 Thread via GitHub


hudi-bot commented on PR #10226:
URL: https://github.com/apache/hudi/pull/10226#issuecomment-1838517438

   
   ## CI report:
   
   * 4cc49f2a068603f35b7b4391a0d3d40af3397d43 Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=21287)
 
   * 937b268dfcad35e0c9a77733a9d920f5e9577e4d Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=21288)
 
   * 22f5d8a5c8f2719aa9602958913fef1e2ee969b9 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=21289)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7166] Provide a Procedure to Calculate Column Stats Overlap Degree [hudi]

2023-12-04 Thread via GitHub


hudi-bot commented on PR #10226:
URL: https://github.com/apache/hudi/pull/10226#issuecomment-1838504902

   
   ## CI report:
   
   * 4cc49f2a068603f35b7b4391a0d3d40af3397d43 Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=21287)
 
   * 937b268dfcad35e0c9a77733a9d920f5e9577e4d Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=21288)
 
   * 22f5d8a5c8f2719aa9602958913fef1e2ee969b9 UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7166] Provide a Procedure to Calculate Column Stats Overlap Degree [hudi]

2023-12-04 Thread via GitHub


hudi-bot commented on PR #10226:
URL: https://github.com/apache/hudi/pull/10226#issuecomment-1838231534

   
   ## CI report:
   
   * 4cc49f2a068603f35b7b4391a0d3d40af3397d43 Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=21287)
 
   * 937b268dfcad35e0c9a77733a9d920f5e9577e4d Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=21288)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7166] Provide a Procedure to Calculate Column Stats Overlap Degree [hudi]

2023-12-04 Thread via GitHub


hudi-bot commented on PR #10226:
URL: https://github.com/apache/hudi/pull/10226#issuecomment-1838133733

   
   ## CI report:
   
   * ce7c47699387e9c0e629179440ed82c08bafecfa Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=21270)
 
   * 4cc49f2a068603f35b7b4391a0d3d40af3397d43 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=21287)
 
   * 937b268dfcad35e0c9a77733a9d920f5e9577e4d Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=21288)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7166] Provide a Procedure to Calculate Column Stats Overlap Degree [hudi]

2023-12-04 Thread via GitHub


hudi-bot commented on PR #10226:
URL: https://github.com/apache/hudi/pull/10226#issuecomment-1838118862

   
   ## CI report:
   
   * ce7c47699387e9c0e629179440ed82c08bafecfa Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=21270)
 
   * 4cc49f2a068603f35b7b4391a0d3d40af3397d43 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=21287)
 
   * 937b268dfcad35e0c9a77733a9d920f5e9577e4d UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7166] Provide a Procedure to Calculate Column Stats Overlap Degree [hudi]

2023-12-03 Thread via GitHub


hudi-bot commented on PR #10226:
URL: https://github.com/apache/hudi/pull/10226#issuecomment-1837964188

   
   ## CI report:
   
   * ce7c47699387e9c0e629179440ed82c08bafecfa Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=21270)
 
   * 4cc49f2a068603f35b7b4391a0d3d40af3397d43 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=21287)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7166] Provide a Procedure to Calculate Column Stats Overlap Degree [hudi]

2023-12-03 Thread via GitHub


danny0405 commented on code in PR #10226:
URL: https://github.com/apache/hudi/pull/10226#discussion_r1413451686


##
hudi-spark-datasource/hudi-spark/src/main/scala/org/apache/spark/sql/hudi/command/procedures/ShowColumnStatsOverlapProcedure.scala:
##
@@ -0,0 +1,336 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.hudi.command.procedures
+
+import org.apache.hadoop.fs.{FileStatus, Path}
+import org.apache.hudi.avro.model.HoodieMetadataColumnStats
+import org.apache.hudi.client.common.HoodieSparkEngineContext
+import org.apache.hudi.common.config.HoodieMetadataConfig
+import org.apache.hudi.common.data.HoodieData
+import org.apache.hudi.common.fs.FSUtils
+import org.apache.hudi.common.model.{FileSlice, HoodieRecord}
+import org.apache.hudi.common.table.timeline.{HoodieDefaultTimeline, 
HoodieInstant}
+import org.apache.hudi.common.table.view.HoodieTableFileSystemView
+import org.apache.hudi.common.table.{HoodieTableMetaClient, 
TableSchemaResolver}
+import org.apache.hudi.common.util.{Option => HOption}
+import org.apache.hudi.metadata.{HoodieTableMetadata, HoodieTableMetadataUtil}
+import org.apache.hudi.{AvroConversionUtils, ColumnStatsIndexSupport}
+import org.apache.spark.internal.Logging
+import org.apache.spark.sql.Row
+import org.apache.spark.sql.types.{DataTypes, Metadata, StructField, 
StructType}
+
+import java.util
+import java.util.function.{Function, Supplier}
+import scala.collection.JavaConversions.asScalaBuffer
+import scala.collection.{JavaConversions, mutable}
+import scala.jdk.CollectionConverters.{asScalaBufferConverter, 
asScalaIteratorConverter, seqAsJavaListConverter}
+
+/**
+ * Calculate the degree of overlap between column stats.
+ * 
+ * The overlap represents the extent to which the min-max ranges cover each 
other.

Review Comment:
   The suggested doc format for new paragraph is:
   
   ```java
   
The overlap represents t 
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7166] Provide a Procedure to Calculate Column Stats Overlap Degree [hudi]

2023-12-03 Thread via GitHub


danny0405 commented on code in PR #10226:
URL: https://github.com/apache/hudi/pull/10226#discussion_r1413446855


##
hudi-common/src/main/java/org/apache/hudi/metadata/HoodieTableMetadataUtil.java:
##
@@ -278,6 +293,19 @@ public static HoodieColumnRangeMetadata 
convertColumnStatsRecordToCo
 columnStats.getTotalUncompressedSize());
   }
 
+  public static Option getColumnStatsValueAsString(Object statsValue) {
+if (statsValue == null) {
+  System.out.println("Invalid value: " + statsValue);

Review Comment:
   Log instead of print in production code.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7166] Provide a Procedure to Calculate Column Stats Overlap Degree [hudi]

2023-12-03 Thread via GitHub


hudi-bot commented on PR #10226:
URL: https://github.com/apache/hudi/pull/10226#issuecomment-1837955494

   
   ## CI report:
   
   * ce7c47699387e9c0e629179440ed82c08bafecfa Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=21270)
 
   * 4cc49f2a068603f35b7b4391a0d3d40af3397d43 UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7166] Provide a Procedure to Calculate Column Stats Overlap Degree [hudi]

2023-12-03 Thread via GitHub


danny0405 commented on code in PR #10226:
URL: https://github.com/apache/hudi/pull/10226#discussion_r1412716844


##
hudi-spark-datasource/hudi-spark/src/main/scala/org/apache/spark/sql/hudi/command/procedures/ShowColumnStatsOverlapProcedure.scala:
##
@@ -0,0 +1,355 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.hudi.command.procedures
+
+import org.apache.avro.generic.IndexedRecord
+import org.apache.hadoop.fs.{FileStatus, Path}
+import org.apache.hudi.avro.model._
+import org.apache.hudi.client.common.HoodieSparkEngineContext

Review Comment:
   should fix the import sequence, you can reference this checkstyle: 
https://github.com/apache/hudi/blob/cd4f0de57522a681fbe5b62fd774c1943254ec2d/style/checkstyle.xml#L289



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7166] Provide a Procedure to Calculate Column Stats Overlap Degree [hudi]

2023-12-03 Thread via GitHub


majian1998 commented on code in PR #10226:
URL: https://github.com/apache/hudi/pull/10226#discussion_r1413424885


##
hudi-spark-datasource/hudi-spark/src/main/scala/org/apache/spark/sql/hudi/command/procedures/ShowColumnStatsOverlapProcedure.scala:
##
@@ -0,0 +1,355 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.hudi.command.procedures
+
+import org.apache.avro.generic.IndexedRecord
+import org.apache.hadoop.fs.{FileStatus, Path}
+import org.apache.hudi.avro.model._
+import org.apache.hudi.client.common.HoodieSparkEngineContext
+import org.apache.hudi.common.config.HoodieMetadataConfig
+import org.apache.hudi.common.data.HoodieData
+import org.apache.hudi.common.fs.FSUtils
+import org.apache.hudi.common.model.{FileSlice, HoodieRecord}
+import org.apache.hudi.common.table.timeline.{HoodieDefaultTimeline, 
HoodieInstant}
+import org.apache.hudi.common.table.view.HoodieTableFileSystemView
+import org.apache.hudi.common.table.{HoodieTableMetaClient, 
TableSchemaResolver}
+import org.apache.hudi.common.util.{Option => HOption}
+import org.apache.hudi.exception.HoodieException
+import org.apache.hudi.metadata.HoodieTableMetadata
+import org.apache.hudi.{AvroConversionUtils, ColumnStatsIndexSupport}
+import org.apache.spark.internal.Logging
+import org.apache.spark.sql.Row
+import org.apache.spark.sql.types.{DataTypes, Metadata, StructField, 
StructType}
+
+import java.util
+import java.util.function.{Function, Supplier}
+import scala.collection.JavaConversions.asScalaBuffer
+import scala.collection.{JavaConversions, mutable}
+import scala.jdk.CollectionConverters.{asScalaBufferConverter, 
asScalaIteratorConverter, seqAsJavaListConverter}
+
+/**
+ * Calculate the degree of overlap between column stats.
+ * The overlap represents the extent to which the min-max ranges cover each 
other.
+ * By referring to the overlap, we can visually demonstrate the degree of data 
skipping
+ * for different columns under the current table's data layout.
+ * The calculation is performed at the partition level (assuming that data 
skipping is based on partition pruning).
+ *
+ * For example, consider three files: a.parquet, b.parquet, and c.parquet.
+ * Taking an integer-type column 'id' as an example, the range (min-max) for 
'a' is 1–5,
+ * for 'b' is 3–7, and for 'c' is 7–8. This results in their values 
overlapping on the coordinate axis as follows:
+ * Value Range: 1 2 3 4 5 6 7 8
+ * a.parquet:   [---]
+ * b.parquet:  []
+ * c.parquet:   [-]
+ * Thus, there will be overlap within the ranges 3–5 and 7.
+ * If the filter conditions for 'id' during data skipping include these values,
+ * multiple files will be filtered out. For a simpler case, if it's an 
equality query,
+ * 2 files will be filtered within these ranges, and no more than one file 
will be filtered in other cases (possibly outside of the range).
+ *
+ * Additionally, calculating the degree of overlap based solely on the maximum 
values
+ * may not provide sufficient information. Therefore, we sample and calculate 
the overlap degree
+ * for all values involved in the min-max range. We also compute the degree of 
overlap
+ * at different percentiles and tally the count of these values.An example of 
a result is as follows:
+ * |Partition path |Field name |Average overlap  |Maximum file overlap |Total 
file number |50% overlap|75% overlap|95% overlap|99% 
overlap|Total value number |
+ * --
+ * |path   |c8 |1.33 |2   |2   
 |1 |1 |1 |1
 |3  |
+
+ */
+class ShowColumnStatsOverlapProcedure extends BaseProcedure with 
ProcedureBuilder with Logging {
+  private val PARAMETERS = Array[ProcedureParameter](
+ProcedureParameter.required(0, "table", DataTypes.StringType),
+ProcedureParameter.optional(1, "partition", DataTypes.StringType),
+ProcedureParameter.optional(2, "targetColumns", DataTypes.StringType)
+  )
+
+  private val OUTPUT_TYPE = new StructType(Array[StructField](
+StructField(

Re: [PR] [HUDI-7166] Provide a Procedure to Calculate Column Stats Overlap Degree [hudi]

2023-12-01 Thread via GitHub


danny0405 commented on code in PR #10226:
URL: https://github.com/apache/hudi/pull/10226#discussion_r1412716844


##
hudi-spark-datasource/hudi-spark/src/main/scala/org/apache/spark/sql/hudi/command/procedures/ShowColumnStatsOverlapProcedure.scala:
##
@@ -0,0 +1,355 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.hudi.command.procedures
+
+import org.apache.avro.generic.IndexedRecord
+import org.apache.hadoop.fs.{FileStatus, Path}
+import org.apache.hudi.avro.model._
+import org.apache.hudi.client.common.HoodieSparkEngineContext

Review Comment:
   should fix the important sequence, you can reference this checkstyle: 
https://github.com/apache/hudi/blob/cd4f0de57522a681fbe5b62fd774c1943254ec2d/style/checkstyle.xml#L289



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7166] Provide a Procedure to Calculate Column Stats Overlap Degree [hudi]

2023-12-01 Thread via GitHub


danny0405 commented on code in PR #10226:
URL: https://github.com/apache/hudi/pull/10226#discussion_r1412716767


##
hudi-spark-datasource/hudi-spark/src/main/scala/org/apache/spark/sql/hudi/command/procedures/ShowColumnStatsOverlapProcedure.scala:
##
@@ -0,0 +1,355 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.hudi.command.procedures
+
+import org.apache.avro.generic.IndexedRecord
+import org.apache.hadoop.fs.{FileStatus, Path}
+import org.apache.hudi.avro.model._
+import org.apache.hudi.client.common.HoodieSparkEngineContext
+import org.apache.hudi.common.config.HoodieMetadataConfig
+import org.apache.hudi.common.data.HoodieData
+import org.apache.hudi.common.fs.FSUtils
+import org.apache.hudi.common.model.{FileSlice, HoodieRecord}
+import org.apache.hudi.common.table.timeline.{HoodieDefaultTimeline, 
HoodieInstant}
+import org.apache.hudi.common.table.view.HoodieTableFileSystemView
+import org.apache.hudi.common.table.{HoodieTableMetaClient, 
TableSchemaResolver}
+import org.apache.hudi.common.util.{Option => HOption}
+import org.apache.hudi.exception.HoodieException
+import org.apache.hudi.metadata.HoodieTableMetadata
+import org.apache.hudi.{AvroConversionUtils, ColumnStatsIndexSupport}
+import org.apache.spark.internal.Logging
+import org.apache.spark.sql.Row
+import org.apache.spark.sql.types.{DataTypes, Metadata, StructField, 
StructType}
+
+import java.util
+import java.util.function.{Function, Supplier}
+import scala.collection.JavaConversions.asScalaBuffer
+import scala.collection.{JavaConversions, mutable}
+import scala.jdk.CollectionConverters.{asScalaBufferConverter, 
asScalaIteratorConverter, seqAsJavaListConverter}
+
+/**
+ * Calculate the degree of overlap between column stats.
+ * The overlap represents the extent to which the min-max ranges cover each 
other.
+ * By referring to the overlap, we can visually demonstrate the degree of data 
skipping
+ * for different columns under the current table's data layout.
+ * The calculation is performed at the partition level (assuming that data 
skipping is based on partition pruning).
+ *
+ * For example, consider three files: a.parquet, b.parquet, and c.parquet.
+ * Taking an integer-type column 'id' as an example, the range (min-max) for 
'a' is 1–5,
+ * for 'b' is 3–7, and for 'c' is 7–8. This results in their values 
overlapping on the coordinate axis as follows:
+ * Value Range: 1 2 3 4 5 6 7 8
+ * a.parquet:   [---]
+ * b.parquet:  []
+ * c.parquet:   [-]
+ * Thus, there will be overlap within the ranges 3–5 and 7.
+ * If the filter conditions for 'id' during data skipping include these values,
+ * multiple files will be filtered out. For a simpler case, if it's an 
equality query,
+ * 2 files will be filtered within these ranges, and no more than one file 
will be filtered in other cases (possibly outside of the range).
+ *
+ * Additionally, calculating the degree of overlap based solely on the maximum 
values

Review Comment:
   Each new paragraph should start with ``.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7166] Provide a Procedure to Calculate Column Stats Overlap Degree [hudi]

2023-12-01 Thread via GitHub


danny0405 commented on code in PR #10226:
URL: https://github.com/apache/hudi/pull/10226#discussion_r1412716697


##
hudi-spark-datasource/hudi-spark/src/main/scala/org/apache/spark/sql/hudi/command/procedures/ShowColumnStatsOverlapProcedure.scala:
##
@@ -0,0 +1,355 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.hudi.command.procedures
+
+import org.apache.avro.generic.IndexedRecord
+import org.apache.hadoop.fs.{FileStatus, Path}
+import org.apache.hudi.avro.model._
+import org.apache.hudi.client.common.HoodieSparkEngineContext
+import org.apache.hudi.common.config.HoodieMetadataConfig
+import org.apache.hudi.common.data.HoodieData
+import org.apache.hudi.common.fs.FSUtils
+import org.apache.hudi.common.model.{FileSlice, HoodieRecord}
+import org.apache.hudi.common.table.timeline.{HoodieDefaultTimeline, 
HoodieInstant}
+import org.apache.hudi.common.table.view.HoodieTableFileSystemView
+import org.apache.hudi.common.table.{HoodieTableMetaClient, 
TableSchemaResolver}
+import org.apache.hudi.common.util.{Option => HOption}
+import org.apache.hudi.exception.HoodieException
+import org.apache.hudi.metadata.HoodieTableMetadata
+import org.apache.hudi.{AvroConversionUtils, ColumnStatsIndexSupport}
+import org.apache.spark.internal.Logging
+import org.apache.spark.sql.Row
+import org.apache.spark.sql.types.{DataTypes, Metadata, StructField, 
StructType}
+
+import java.util
+import java.util.function.{Function, Supplier}
+import scala.collection.JavaConversions.asScalaBuffer
+import scala.collection.{JavaConversions, mutable}
+import scala.jdk.CollectionConverters.{asScalaBufferConverter, 
asScalaIteratorConverter, seqAsJavaListConverter}
+
+/**
+ * Calculate the degree of overlap between column stats.
+ * The overlap represents the extent to which the min-max ranges cover each 
other.
+ * By referring to the overlap, we can visually demonstrate the degree of data 
skipping
+ * for different columns under the current table's data layout.
+ * The calculation is performed at the partition level (assuming that data 
skipping is based on partition pruning).
+ *
+ * For example, consider three files: a.parquet, b.parquet, and c.parquet.
+ * Taking an integer-type column 'id' as an example, the range (min-max) for 
'a' is 1–5,
+ * for 'b' is 3–7, and for 'c' is 7–8. This results in their values 
overlapping on the coordinate axis as follows:
+ * Value Range: 1 2 3 4 5 6 7 8
+ * a.parquet:   [---]
+ * b.parquet:  []
+ * c.parquet:   [-]
+ * Thus, there will be overlap within the ranges 3–5 and 7.
+ * If the filter conditions for 'id' during data skipping include these values,
+ * multiple files will be filtered out. For a simpler case, if it's an 
equality query,
+ * 2 files will be filtered within these ranges, and no more than one file 
will be filtered in other cases (possibly outside of the range).
+ *
+ * Additionally, calculating the degree of overlap based solely on the maximum 
values
+ * may not provide sufficient information. Therefore, we sample and calculate 
the overlap degree
+ * for all values involved in the min-max range. We also compute the degree of 
overlap
+ * at different percentiles and tally the count of these values.An example of 
a result is as follows:
+ * |Partition path |Field name |Average overlap  |Maximum file overlap |Total 
file number |50% overlap|75% overlap|95% overlap|99% 
overlap|Total value number |
+ * --
+ * |path   |c8 |1.33 |2   |2   
 |1 |1 |1 |1
 |3  |
+
+ */
+class ShowColumnStatsOverlapProcedure extends BaseProcedure with 
ProcedureBuilder with Logging {
+  private val PARAMETERS = Array[ProcedureParameter](
+ProcedureParameter.required(0, "table", DataTypes.StringType),
+ProcedureParameter.optional(1, "partition", DataTypes.StringType),
+ProcedureParameter.optional(2, "targetColumns", DataTypes.StringType)
+  )
+
+  private val OUTPUT_TYPE = new StructType(Array[StructField](
+StructField("

Re: [PR] [HUDI-7166] Provide a Procedure to Calculate Column Stats Overlap Degree [hudi]

2023-12-01 Thread via GitHub


hudi-bot commented on PR #10226:
URL: https://github.com/apache/hudi/pull/10226#issuecomment-1836063557

   
   ## CI report:
   
   * ce7c47699387e9c0e629179440ed82c08bafecfa Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=21270)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7166] Provide a Procedure to Calculate Column Stats Overlap Degree [hudi]

2023-12-01 Thread via GitHub


hudi-bot commented on PR #10226:
URL: https://github.com/apache/hudi/pull/10226#issuecomment-1835990188

   
   ## CI report:
   
   * 27aca8584fb17e3fe9da6ef4be101941686ecf41 Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=21268)
 
   * ce7c47699387e9c0e629179440ed82c08bafecfa Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=21270)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7166] Provide a Procedure to Calculate Column Stats Overlap Degree [hudi]

2023-12-01 Thread via GitHub


hudi-bot commented on PR #10226:
URL: https://github.com/apache/hudi/pull/10226#issuecomment-1835821221

   
   ## CI report:
   
   * 2b2d5d9fe7468965a21ade6c292a942abf087ad3 Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=21267)
 
   * 27aca8584fb17e3fe9da6ef4be101941686ecf41 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=21268)
 
   * ce7c47699387e9c0e629179440ed82c08bafecfa Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=21270)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7166] Provide a Procedure to Calculate Column Stats Overlap Degree [hudi]

2023-12-01 Thread via GitHub


hudi-bot commented on PR #10226:
URL: https://github.com/apache/hudi/pull/10226#issuecomment-1835809646

   
   ## CI report:
   
   * 2b2d5d9fe7468965a21ade6c292a942abf087ad3 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=21267)
 
   * 27aca8584fb17e3fe9da6ef4be101941686ecf41 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=21268)
 
   * ce7c47699387e9c0e629179440ed82c08bafecfa Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=21270)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7166] Provide a Procedure to Calculate Column Stats Overlap Degree [hudi]

2023-12-01 Thread via GitHub


hudi-bot commented on PR #10226:
URL: https://github.com/apache/hudi/pull/10226#issuecomment-1835796381

   
   ## CI report:
   
   * 2b2d5d9fe7468965a21ade6c292a942abf087ad3 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=21267)
 
   * 27aca8584fb17e3fe9da6ef4be101941686ecf41 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=21268)
 
   * ce7c47699387e9c0e629179440ed82c08bafecfa UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7166] Provide a Procedure to Calculate Column Stats Overlap Degree [hudi]

2023-12-01 Thread via GitHub


hudi-bot commented on PR #10226:
URL: https://github.com/apache/hudi/pull/10226#issuecomment-1835649479

   
   ## CI report:
   
   * 2b2d5d9fe7468965a21ade6c292a942abf087ad3 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=21267)
 
   * 27aca8584fb17e3fe9da6ef4be101941686ecf41 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=21268)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7166] Provide a Procedure to Calculate Column Stats Overlap Degree [hudi]

2023-12-01 Thread via GitHub


hudi-bot commented on PR #10226:
URL: https://github.com/apache/hudi/pull/10226#issuecomment-1835640643

   
   ## CI report:
   
   * 2b2d5d9fe7468965a21ade6c292a942abf087ad3 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=21267)
 
   * 27aca8584fb17e3fe9da6ef4be101941686ecf41 UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7166] Provide a Procedure to Calculate Column Stats Overlap Degree [hudi]

2023-11-30 Thread via GitHub


hudi-bot commented on PR #10226:
URL: https://github.com/apache/hudi/pull/10226#issuecomment-1835571995

   
   ## CI report:
   
   * 2b2d5d9fe7468965a21ade6c292a942abf087ad3 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=21267)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7166] Provide a Procedure to Calculate Column Stats Overlap Degree [hudi]

2023-11-30 Thread via GitHub


hudi-bot commented on PR #10226:
URL: https://github.com/apache/hudi/pull/10226#issuecomment-1835564976

   
   ## CI report:
   
   * 2b2d5d9fe7468965a21ade6c292a942abf087ad3 UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[PR] [HUDI-7166] Provide a Procedure to Calculate Column Stats Overlap Degree [hudi]

2023-11-30 Thread via GitHub


majian1998 opened a new pull request, #10226:
URL: https://github.com/apache/hudi/pull/10226

   ### Change Logs
   
   In HUDI-7110 , a tool has been made available to display column stats. 
However, this tool is not very user-friendly for manual observation when 
dealing with large data volumes. For instance, with tens of thousands of 
parquet files, the number of rows in column stats could be of the order of 
hundreds of thousands. This renders the data virtually unreadable to humans, 
necessitating further processing by code. Yet, if an administrator simply 
wishes to directly observe the data layout based on column stats under such 
circumstances, a more intuitive display tool is required. Here, we offer a tool 
that calculates the overlap degree of column stats based on partition and 
column name.
   
   Overlap degree refers to the extent to which the min-max ranges of different 
files intersect with each other. This directly affects the effectiveness of 
data skipping.
   
   In fact, a similar concept is also provided by Snowflake to aid their 
clustering process. 
https://docs.snowflake.com/en/user-guide/tables-clustering-micropartitions Our 
implementation here is not overly complex.
   
   It yields output similar to the following:
   
   |Partition path |Field name |Average overlap  |Maximum file overlap |Total 
file number |50% overlap|75% overlap|95% overlap|99% 
overlap|Total value number |
   |path   |c8 |1.33 |2   |2
|1 |1 |1 |1 
|3  |
   
   
   
   This content provides a straightforward representation of the relevant 
statistics.
   
   For example, consider three files: a.parquet, b.parquet, and c.parquet. 
Taking an integer-type column 'id' as an example, the range (min-max) for 'a' 
is 1–5, for 'b' is 3–7, and for 'c' is 7–8. Thus, there will be overlap within 
the ranges 3–5 and 7.
   
   If the filter conditions for 'id' during data skipping include these values, 
multiple files will be filtered out. For a simpler case, if it's an equality 
query, 2 files will be filtered within these ranges, and no more than one file 
will be filtered in other cases (possibly outside of the range).
   
   
   TODO: In the future, we hope that this foundation can inspire and be 
expanded upon to use overlap as a guide for clustering data layout.
   
   
   ### Impact
   
   None
   
   ### Risk level (write none, low medium or high below)
   
   None
   
   ### Documentation Update
   
   This procedure is designed to calculate and display the overlap degree of 
column statistics for different files within a table, which is a key factor in 
evaluating the performance of data skipping strategies.
   
   Parameters and Output Schema
   The procedure accepts the following parameters:
   
   table (StringType, required): The name of the table for which column 
statistics overlap will be calculated.
   partition (StringType, optional): A specific partition or comma-separated 
list of partitions to limit the scope of the calculation.
   targetColumns (StringType, optional): A specific column or comma-separated 
list of columns for which to calculate the statistics.
   The output of the procedure is a structured type (StructType) comprising the 
following fields, which describe various aspects of column statistics overlap 
for each field within the specified partitions or table:
   
   Partition path
   Field name
   Average overlap
   Maximum file overlap
   Total file number
   50% overlap
   75% overlap
   95% overlap
   99% overlap
   Total value number
   
   ### Contributor's checklist
   
   - [ ] Read through [contributor's 
guide](https://hudi.apache.org/contribute/how-to-contribute)
   - [ ] Change Logs and Impact were stated clearly
   - [ ] Adequate tests were added if applicable
   - [ ] CI passed
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org