Re: [PR] [HUDI-7291] Pushing Down Partition Pruning Conditions to Column Stats Earlier During Data Skipping [hudi]

2024-01-16 Thread via GitHub


stream2000 merged PR #10493:
URL: https://github.com/apache/hudi/pull/10493


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7291] Pushing Down Partition Pruning Conditions to Column Stats Earlier During Data Skipping [hudi]

2024-01-16 Thread via GitHub


hudi-bot commented on PR #10493:
URL: https://github.com/apache/hudi/pull/10493#issuecomment-1893616949

   
   ## CI report:
   
   * 42636082fbebb2491eadcd669fd2cb98e0f25ace Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=21976)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7291] Pushing Down Partition Pruning Conditions to Column Stats Earlier During Data Skipping [hudi]

2024-01-16 Thread via GitHub


hudi-bot commented on PR #10493:
URL: https://github.com/apache/hudi/pull/10493#issuecomment-1893505802

   
   ## CI report:
   
   * 334b3915326a19347e253edd5e22791b78c7c384 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=21973)
 
   * 42636082fbebb2491eadcd669fd2cb98e0f25ace Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=21976)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7291] Pushing Down Partition Pruning Conditions to Column Stats Earlier During Data Skipping [hudi]

2024-01-16 Thread via GitHub


hudi-bot commented on PR #10493:
URL: https://github.com/apache/hudi/pull/10493#issuecomment-1893407684

   
   ## CI report:
   
   * 2023e581f29ae0f1b96b57180118df7b6ebd1907 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=21944)
 
   * 334b3915326a19347e253edd5e22791b78c7c384 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=21973)
 
   * 42636082fbebb2491eadcd669fd2cb98e0f25ace Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=21976)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7291] Pushing Down Partition Pruning Conditions to Column Stats Earlier During Data Skipping [hudi]

2024-01-16 Thread via GitHub


hudi-bot commented on PR #10493:
URL: https://github.com/apache/hudi/pull/10493#issuecomment-1893337286

   
   ## CI report:
   
   * 2023e581f29ae0f1b96b57180118df7b6ebd1907 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=21944)
 
   * 334b3915326a19347e253edd5e22791b78c7c384 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=21973)
 
   * 42636082fbebb2491eadcd669fd2cb98e0f25ace UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7291] Pushing Down Partition Pruning Conditions to Column Stats Earlier During Data Skipping [hudi]

2024-01-16 Thread via GitHub


boneanxs commented on code in PR #10493:
URL: https://github.com/apache/hudi/pull/10493#discussion_r1453101821


##
hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/HoodieFileIndex.scala:
##
@@ -233,8 +233,23 @@ case class HoodieFileIndex(spark: SparkSession,
   //- Col-Stats Index is present
   //- Record-level Index is present
   //- List of predicates (filters) is present
+  val prunedPartitionFileNames: Set[String] = {
+prunedPartitionsAndFileSlices
+  .flatMap {
+case (_, fileSlices) => fileSlices
+  }
+  .flatMap { fileSlice =>
+val baseFileOption = Option(fileSlice.getBaseFile.orElse(null))
+val logFiles = if (includeLogFiles) {
+  
fileSlice.getLogFiles.iterator().asScala.map(_.getFileName).toList
+} else Nil
+baseFileOption.map(_.getFileName).toList ++ logFiles
+  }
+  .toSet
+  }
+
   val candidateFilesNamesOpt: Option[Set[String]] =
-  lookupCandidateFilesInMetadataTable(dataFilters) match {
+  lookupCandidateFilesInMetadataTable(dataFilters, 
prunedPartitionFileNames) match {

Review Comment:
   I mean we can put transformation inside 
`lookupCandidateFilesInMetadataTable`, after `!isMetadataTableEnabled || 
!isDataSkippingEnabled` is checked.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7291] Pushing Down Partition Pruning Conditions to Column Stats Earlier During Data Skipping [hudi]

2024-01-15 Thread via GitHub


hudi-bot commented on PR #10493:
URL: https://github.com/apache/hudi/pull/10493#issuecomment-1893175418

   
   ## CI report:
   
   * 2023e581f29ae0f1b96b57180118df7b6ebd1907 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=21944)
 
   * 334b3915326a19347e253edd5e22791b78c7c384 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=21973)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7291] Pushing Down Partition Pruning Conditions to Column Stats Earlier During Data Skipping [hudi]

2024-01-15 Thread via GitHub


hudi-bot commented on PR #10493:
URL: https://github.com/apache/hudi/pull/10493#issuecomment-1893167992

   
   ## CI report:
   
   * 2023e581f29ae0f1b96b57180118df7b6ebd1907 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=21944)
 
   * 334b3915326a19347e253edd5e22791b78c7c384 UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7291] Pushing Down Partition Pruning Conditions to Column Stats Earlier During Data Skipping [hudi]

2024-01-15 Thread via GitHub


stream2000 commented on code in PR #10493:
URL: https://github.com/apache/hudi/pull/10493#discussion_r1452990261


##
hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/ColumnStatsIndexSupport.scala:
##
@@ -106,14 +107,23 @@ class ColumnStatsIndexSupport(spark: SparkSession,
*
* Please check out scala-doc of the [[transpose]] method explaining this 
view in more details
*/
-  def loadTransposed[T](targetColumns: Seq[String], shouldReadInMemory: 
Boolean)(block: DataFrame => T): T = {
+  def loadTransposed[T](targetColumns: Seq[String], shouldReadInMemory: 
Boolean, prunedPartitionFileNames: Set[String] = Set.empty)(block: DataFrame => 
T): T = {

Review Comment:
   prunedPartitionFileNames -> prunedFileNames. 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7291] Pushing Down Partition Pruning Conditions to Column Stats Earlier During Data Skipping [hudi]

2024-01-15 Thread via GitHub


majian1998 commented on code in PR #10493:
URL: https://github.com/apache/hudi/pull/10493#discussion_r1452971478


##
hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/ColumnStatsIndexSupport.scala:
##
@@ -106,14 +107,23 @@ class ColumnStatsIndexSupport(spark: SparkSession,
*
* Please check out scala-doc of the [[transpose]] method explaining this 
view in more details
*/
-  def loadTransposed[T](targetColumns: Seq[String], shouldReadInMemory: 
Boolean)(block: DataFrame => T): T = {
+  def loadTransposed[T](targetColumns: Seq[String], shouldReadInMemory: 
Boolean, prunedPartitionFileNames: Set[String] = Set.empty)(block: DataFrame => 
T): T = {
 cachedColumnStatsIndexViews.get(targetColumns) match {
   case Some(cachedDF) =>
 block(cachedDF)
 
   case None =>
-val colStatsRecords: HoodieData[HoodieMetadataColumnStats] =
+val colStatsRecords: HoodieData[HoodieMetadataColumnStats] = if 
(prunedPartitionFileNames.isEmpty) {
+  // NOTE: In order to ensure that testing and unexpected logic are 
normal, judgment logic is added.

Review Comment:
   yes, I've modified this part of the description to make it clearer



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7291] Pushing Down Partition Pruning Conditions to Column Stats Earlier During Data Skipping [hudi]

2024-01-15 Thread via GitHub


majian1998 commented on code in PR #10493:
URL: https://github.com/apache/hudi/pull/10493#discussion_r1452971185


##
hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/HoodieFileIndex.scala:
##
@@ -223,6 +223,21 @@ case class HoodieFileIndex(spark: SparkSession,
 
 val prunedPartitionsAndFileSlices = 
getFileSlicesForPrunedPartitions(partitionFilters)
 
+val prunedPartitionFileNames: Set[String] = {

Review Comment:
   done



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7291] Pushing Down Partition Pruning Conditions to Column Stats Earlier During Data Skipping [hudi]

2024-01-15 Thread via GitHub


majian1998 commented on code in PR #10493:
URL: https://github.com/apache/hudi/pull/10493#discussion_r1452969248


##
hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/ColumnStatsIndexSupport.scala:
##
@@ -106,14 +107,23 @@ class ColumnStatsIndexSupport(spark: SparkSession,
*
* Please check out scala-doc of the [[transpose]] method explaining this 
view in more details
*/
-  def loadTransposed[T](targetColumns: Seq[String], shouldReadInMemory: 
Boolean)(block: DataFrame => T): T = {
+  def loadTransposed[T](targetColumns: Seq[String], shouldReadInMemory: 
Boolean, prunedPartitionFileNames: Set[String] = Set.empty)(block: DataFrame => 
T): T = {
 cachedColumnStatsIndexViews.get(targetColumns) match {
   case Some(cachedDF) =>
 block(cachedDF)
 
   case None =>
-val colStatsRecords: HoodieData[HoodieMetadataColumnStats] =
+val colStatsRecords: HoodieData[HoodieMetadataColumnStats] = if 
(prunedPartitionFileNames.isEmpty) {
+  // NOTE: In order to ensure that testing and unexpected logic are 
normal, judgment logic is added.
   loadColumnStatsIndexRecords(targetColumns, shouldReadInMemory)
+} else {
+  val filterFunction = new 
SerializableFunction[HoodieMetadataColumnStats, java.lang.Boolean] {

Review Comment:
   I refrained from introducing new tests as the current data skipping test 
logic is already comprehensive enough to encompass the modifications made 
here.I think ensuring the correctness of the existing tests should suffice.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7291] Pushing Down Partition Pruning Conditions to Column Stats Earlier During Data Skipping [hudi]

2024-01-15 Thread via GitHub


boneanxs commented on code in PR #10493:
URL: https://github.com/apache/hudi/pull/10493#discussion_r1452296953


##
hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/ColumnStatsIndexSupport.scala:
##
@@ -106,14 +107,23 @@ class ColumnStatsIndexSupport(spark: SparkSession,
*
* Please check out scala-doc of the [[transpose]] method explaining this 
view in more details
*/
-  def loadTransposed[T](targetColumns: Seq[String], shouldReadInMemory: 
Boolean)(block: DataFrame => T): T = {
+  def loadTransposed[T](targetColumns: Seq[String], shouldReadInMemory: 
Boolean, prunedPartitionFileNames: Set[String] = Set.empty)(block: DataFrame => 
T): T = {
 cachedColumnStatsIndexViews.get(targetColumns) match {
   case Some(cachedDF) =>
 block(cachedDF)
 
   case None =>
-val colStatsRecords: HoodieData[HoodieMetadataColumnStats] =
+val colStatsRecords: HoodieData[HoodieMetadataColumnStats] = if 
(prunedPartitionFileNames.isEmpty) {
+  // NOTE: In order to ensure that testing and unexpected logic are 
normal, judgment logic is added.

Review Comment:
   Can explain more here?



##
hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/HoodieFileIndex.scala:
##
@@ -223,6 +223,21 @@ case class HoodieFileIndex(spark: SparkSession,
 
 val prunedPartitionsAndFileSlices = 
getFileSlicesForPrunedPartitions(partitionFilters)
 
+val prunedPartitionFileNames: Set[String] = {

Review Comment:
   Can we do this transformation inside `lookupCandidateFilesInMetadataTable`? 
Whereas `!isMetadataTableEnabled || !isDataSkippingEnabled` queries can avoid 
this



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7291] Pushing Down Partition Pruning Conditions to Column Stats Earlier During Data Skipping [hudi]

2024-01-12 Thread via GitHub


danny0405 commented on code in PR #10493:
URL: https://github.com/apache/hudi/pull/10493#discussion_r1450156685


##
hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/ColumnStatsIndexSupport.scala:
##
@@ -106,14 +107,23 @@ class ColumnStatsIndexSupport(spark: SparkSession,
*
* Please check out scala-doc of the [[transpose]] method explaining this 
view in more details
*/
-  def loadTransposed[T](targetColumns: Seq[String], shouldReadInMemory: 
Boolean)(block: DataFrame => T): T = {
+  def loadTransposed[T](targetColumns: Seq[String], shouldReadInMemory: 
Boolean, prunedPartitionFileNames: Set[String] = Set.empty)(block: DataFrame => 
T): T = {
 cachedColumnStatsIndexViews.get(targetColumns) match {
   case Some(cachedDF) =>
 block(cachedDF)
 
   case None =>
-val colStatsRecords: HoodieData[HoodieMetadataColumnStats] =
+val colStatsRecords: HoodieData[HoodieMetadataColumnStats] = if 
(prunedPartitionFileNames.isEmpty) {
+  // NOTE: In order to ensure that testing and unexpected logic are 
normal, judgment logic is added.
   loadColumnStatsIndexRecords(targetColumns, shouldReadInMemory)
+} else {
+  val filterFunction = new 
SerializableFunction[HoodieMetadataColumnStats, java.lang.Boolean] {

Review Comment:
   Is there any test for it, @boneanxs Can you help for the review?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7291] Pushing Down Partition Pruning Conditions to Column Stats Earlier During Data Skipping [hudi]

2024-01-12 Thread via GitHub


hudi-bot commented on PR #10493:
URL: https://github.com/apache/hudi/pull/10493#issuecomment-1888654739

   
   ## CI report:
   
   * 2023e581f29ae0f1b96b57180118df7b6ebd1907 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=21944)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7291] Pushing Down Partition Pruning Conditions to Column Stats Earlier During Data Skipping [hudi]

2024-01-11 Thread via GitHub


hudi-bot commented on PR #10493:
URL: https://github.com/apache/hudi/pull/10493#issuecomment-1888525702

   
   ## CI report:
   
   * 9658fb87f6054486a54b4c83036cbf2f3f8efa2f Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=21943)
 
   * 2023e581f29ae0f1b96b57180118df7b6ebd1907 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=21944)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7291] Pushing Down Partition Pruning Conditions to Column Stats Earlier During Data Skipping [hudi]

2024-01-11 Thread via GitHub


hudi-bot commented on PR #10493:
URL: https://github.com/apache/hudi/pull/10493#issuecomment-1888518774

   
   ## CI report:
   
   * 9658fb87f6054486a54b4c83036cbf2f3f8efa2f Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=21943)
 
   * 2023e581f29ae0f1b96b57180118df7b6ebd1907 UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7291] Pushing Down Partition Pruning Conditions to Column Stats Earlier During Data Skipping [hudi]

2024-01-11 Thread via GitHub


hudi-bot commented on PR #10493:
URL: https://github.com/apache/hudi/pull/10493#issuecomment-1888466041

   
   ## CI report:
   
   * 9658fb87f6054486a54b4c83036cbf2f3f8efa2f Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=21943)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7291] Pushing Down Partition Pruning Conditions to Column Stats Earlier During Data Skipping [hudi]

2024-01-11 Thread via GitHub


hudi-bot commented on PR #10493:
URL: https://github.com/apache/hudi/pull/10493#issuecomment-1888390957

   
   ## CI report:
   
   * 9658fb87f6054486a54b4c83036cbf2f3f8efa2f Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=21943)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7291] Pushing Down Partition Pruning Conditions to Column Stats Earlier During Data Skipping [hudi]

2024-01-11 Thread via GitHub


hudi-bot commented on PR #10493:
URL: https://github.com/apache/hudi/pull/10493#issuecomment-1888386370

   
   ## CI report:
   
   * 9658fb87f6054486a54b4c83036cbf2f3f8efa2f UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[PR] [HUDI-7291] Pushing Down Partition Pruning Conditions to Column Stats Earlier During Data Skipping [hudi]

2024-01-11 Thread via GitHub


majian1998 opened a new pull request, #10493:
URL: https://github.com/apache/hudi/pull/10493

   In the current implementation of data skipping, column statistics for the 
entire table are read and then subjected to data skipping filtering operations 
based on these stats. When the table has a large volume of data and a high 
number of partitions, this approach can reduce the efficiency of data skipping, 
as partition pruning conditions are not utilized.
   
   By pushing down the conditions for partition filtering to after the column 
statistics are read and applying pruning at that point, the size of the column 
stats that are subsequently involved in data skipping will be significantly 
reduced. This not only saves time on later computations but also conserves 
memory.
   
   During a test conducted on a table with a total of 25TB distributed across 
60 subpartitions, a query was performed on one of the subpartitions, which was 
1.4TB in size. Overall, this simple test demonstrated that data skipping can 
lead to a savings of several seconds. In scenarios involving partition pruning, 
time savings are indeed achievable. Additionally, there will be a substantial 
reduction in the memory footprint for the list of candidate files needed for 
further computation.
   
   In scenarios where partition pruning is not applied, this query would only 
result in a minimal increase in cost. This minor cost increase is 
inconsequential either when the data volume is large—making these seconds-level 
overheads negligible—or when the data volume is small, eliminating the need for 
partitioning altogether, in which case the filter operation would not be 
time-consuming.
   
   ### Change Logs
   
   Pushing Down Partition Pruning Conditions to Column Stats During Data 
Skipping
   
   ### Impact
   
   None
   
   ### Risk level (write none, low medium or high below)
   
   low
   
   ### Documentation Update
   
   None
   
   ### Contributor's checklist
   
   - [ ] Read through [contributor's 
guide](https://hudi.apache.org/contribute/how-to-contribute)
   - [ ] Change Logs and Impact were stated clearly
   - [ ] Adequate tests were added if applicable
   - [ ] CI passed
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7291] Pushing Down Partition Pruning Conditions to Column Stats Earlier During Data Skipping [hudi]

2024-01-11 Thread via GitHub


majian1998 closed pull request #10485: [HUDI-7291] Pushing Down Partition 
Pruning Conditions to Column Stats Earlier During Data Skipping
URL: https://github.com/apache/hudi/pull/10485


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7291] Pushing Down Partition Pruning Conditions to Column Stats Earlier During Data Skipping [hudi]

2024-01-11 Thread via GitHub


hudi-bot commented on PR #10485:
URL: https://github.com/apache/hudi/pull/10485#issuecomment-1887171861

   
   ## CI report:
   
   * 55a5918fb3706f76a41b9fba793c777566e09363 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=21929)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7291] Pushing Down Partition Pruning Conditions to Column Stats Earlier During Data Skipping [hudi]

2024-01-11 Thread via GitHub


hudi-bot commented on PR #10485:
URL: https://github.com/apache/hudi/pull/10485#issuecomment-1887067482

   
   ## CI report:
   
   * 02cf31984f93fe6d147a400a55d7039dc87a38cf Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=21928)
 
   * 55a5918fb3706f76a41b9fba793c777566e09363 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=21929)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7291] Pushing Down Partition Pruning Conditions to Column Stats Earlier During Data Skipping [hudi]

2024-01-11 Thread via GitHub


hudi-bot commented on PR #10485:
URL: https://github.com/apache/hudi/pull/10485#issuecomment-1886985630

   
   ## CI report:
   
   * 02cf31984f93fe6d147a400a55d7039dc87a38cf Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=21928)
 
   * 55a5918fb3706f76a41b9fba793c777566e09363 UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7291] Pushing Down Partition Pruning Conditions to Column Stats Earlier During Data Skipping [hudi]

2024-01-11 Thread via GitHub


hudi-bot commented on PR #10485:
URL: https://github.com/apache/hudi/pull/10485#issuecomment-1886948753

   
   ## CI report:
   
   * 02cf31984f93fe6d147a400a55d7039dc87a38cf Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=21928)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7291] Pushing Down Partition Pruning Conditions to Column Stats Earlier During Data Skipping [hudi]

2024-01-11 Thread via GitHub


hudi-bot commented on PR #10485:
URL: https://github.com/apache/hudi/pull/10485#issuecomment-1886750286

   
   ## CI report:
   
   * 92cd13d2bac87ff68dce1ec60bf06427d4b88d94 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=21926)
 
   * 02cf31984f93fe6d147a400a55d7039dc87a38cf Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=21928)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7291] Pushing Down Partition Pruning Conditions to Column Stats Earlier During Data Skipping [hudi]

2024-01-11 Thread via GitHub


hudi-bot commented on PR #10485:
URL: https://github.com/apache/hudi/pull/10485#issuecomment-1886735750

   
   ## CI report:
   
   * 92cd13d2bac87ff68dce1ec60bf06427d4b88d94 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=21926)
 
   * 02cf31984f93fe6d147a400a55d7039dc87a38cf UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7291] Pushing Down Partition Pruning Conditions to Column Stats Earlier During Data Skipping [hudi]

2024-01-11 Thread via GitHub


hudi-bot commented on PR #10485:
URL: https://github.com/apache/hudi/pull/10485#issuecomment-1886709580

   
   ## CI report:
   
   * 92cd13d2bac87ff68dce1ec60bf06427d4b88d94 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=21926)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7291] Pushing Down Partition Pruning Conditions to Column Stats Earlier During Data Skipping [hudi]

2024-01-11 Thread via GitHub


hudi-bot commented on PR #10485:
URL: https://github.com/apache/hudi/pull/10485#issuecomment-1886612148

   
   ## CI report:
   
   * d7ef9f4481a1bef75266e204bcc02cbc74bd225e Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=21924)
 
   * 92cd13d2bac87ff68dce1ec60bf06427d4b88d94 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=21926)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7291] Pushing Down Partition Pruning Conditions to Column Stats Earlier During Data Skipping [hudi]

2024-01-10 Thread via GitHub


hudi-bot commented on PR #10485:
URL: https://github.com/apache/hudi/pull/10485#issuecomment-1886526246

   
   ## CI report:
   
   * c58ddb3ade3d8d54c0610991a0ad141330061b49 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=21922)
 
   * d7ef9f4481a1bef75266e204bcc02cbc74bd225e Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=21924)
 
   * 92cd13d2bac87ff68dce1ec60bf06427d4b88d94 UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7291] Pushing Down Partition Pruning Conditions to Column Stats Earlier During Data Skipping [hudi]

2024-01-10 Thread via GitHub


hudi-bot commented on PR #10485:
URL: https://github.com/apache/hudi/pull/10485#issuecomment-1886388255

   
   ## CI report:
   
   * c58ddb3ade3d8d54c0610991a0ad141330061b49 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=21922)
 
   * d7ef9f4481a1bef75266e204bcc02cbc74bd225e Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=21924)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7291] Pushing Down Partition Pruning Conditions to Column Stats Earlier During Data Skipping [hudi]

2024-01-10 Thread via GitHub


hudi-bot commented on PR #10485:
URL: https://github.com/apache/hudi/pull/10485#issuecomment-1886332817

   
   ## CI report:
   
   * c58ddb3ade3d8d54c0610991a0ad141330061b49 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=21922)
 
   * d7ef9f4481a1bef75266e204bcc02cbc74bd225e UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7291] Pushing Down Partition Pruning Conditions to Column Stats Earlier During Data Skipping [hudi]

2024-01-10 Thread via GitHub


hudi-bot commented on PR #10485:
URL: https://github.com/apache/hudi/pull/10485#issuecomment-1886288752

   
   ## CI report:
   
   * c58ddb3ade3d8d54c0610991a0ad141330061b49 Azure: 
[FAILURE](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=21922)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7291] Pushing Down Partition Pruning Conditions to Column Stats Earlier During Data Skipping [hudi]

2024-01-10 Thread via GitHub


hudi-bot commented on PR #10485:
URL: https://github.com/apache/hudi/pull/10485#issuecomment-1886240515

   
   ## CI report:
   
   * c58ddb3ade3d8d54c0610991a0ad141330061b49 UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org