Re: [PR] [HUDI-7310] Optimize Column Stats Partition Pruning for Non-Partition Pruning Queries [hudi]

2024-01-22 Thread via GitHub


stream2000 merged PR #10528:
URL: https://github.com/apache/hudi/pull/10528


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7310] Optimize Column Stats Partition Pruning for Non-Partition Pruning Queries [hudi]

2024-01-22 Thread via GitHub


hudi-bot commented on PR #10528:
URL: https://github.com/apache/hudi/pull/10528#issuecomment-1903566951

   
   ## CI report:
   
   * fe70696a5c4f16b0367470f54fe4814f510cb3b0 Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=22088)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7310] Optimize Column Stats Partition Pruning for Non-Partition Pruning Queries [hudi]

2024-01-21 Thread via GitHub


hudi-bot commented on PR #10528:
URL: https://github.com/apache/hudi/pull/10528#issuecomment-1903388167

   
   ## CI report:
   
   * 0c268f8c3ff082913d8522edfa1674fa88b19ab6 Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=22042)
 
   * fe70696a5c4f16b0367470f54fe4814f510cb3b0 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=22088)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7310] Optimize Column Stats Partition Pruning for Non-Partition Pruning Queries [hudi]

2024-01-21 Thread via GitHub


hudi-bot commented on PR #10528:
URL: https://github.com/apache/hudi/pull/10528#issuecomment-1903345377

   
   ## CI report:
   
   * 0c268f8c3ff082913d8522edfa1674fa88b19ab6 Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=22042)
 
   * fe70696a5c4f16b0367470f54fe4814f510cb3b0 UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7310] Optimize Column Stats Partition Pruning for Non-Partition Pruning Queries [hudi]

2024-01-18 Thread via GitHub


stream2000 commented on code in PR #10528:
URL: https://github.com/apache/hudi/pull/10528#discussion_r1458286332


##
hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/HoodieFileIndex.scala:
##
@@ -361,9 +364,16 @@ case class HoodieFileIndex(spark: SparkSession,
   //   For that we use a simple-heuristic to determine whether we 
should read and process CSI in-memory or
   //   on-cluster: total number of rows of the expected projected 
portion of the index has to be below the
   //   threshold (of 100k records)
-  val prunedFileNames = getPrunedFileNames(prunedPartitionsAndFileSlices)
   val shouldReadInMemory = columnStatsIndex.shouldReadInMemory(this, 
queryReferencedColumns)
-  columnStatsIndex.loadTransposed(queryReferencedColumns, 
shouldReadInMemory, prunedFileNames) { transposedColStatsDF =>
+  val prunedFileNames = getPrunedFileNames(prunedPartitionsAndFileSlices)
+  // NOTE: This judgment has two purposes:

Review Comment:
   nit: We can simplify the comment to: 
   
   // If partition pruning doesn't prune any files, then there's no need to 
apply file filters when loading the Column Statistics Index



##
hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/HoodieFileIndex.scala:
##
@@ -233,8 +233,9 @@ case class HoodieFileIndex(spark: SparkSession,
   //- Col-Stats Index is present
   //- Record-level Index is present
   //- List of predicates (filters) is present
+  val shouldPushDownFilesFilter = !partitionFilters.isEmpty
   val candidateFilesNamesOpt: Option[Set[String]] =
-  lookupCandidateFilesInMetadataTable(dataFilters, 
prunedPartitionsAndFileSlices) match {
+  lookupCandidateFilesInMetadataTable(dataFilters, 
shouldPushDownFilesFilter, prunedPartitionsAndFileSlices) match {

Review Comment:
   We can move the `shouldPushDownFilesFilter` to the end of the parameter list.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7310] Optimize Column Stats Partition Pruning for Non-Partition Pruning Queries [hudi]

2024-01-18 Thread via GitHub


hudi-bot commented on PR #10528:
URL: https://github.com/apache/hudi/pull/10528#issuecomment-1898535611

   
   ## CI report:
   
   * 0c268f8c3ff082913d8522edfa1674fa88b19ab6 Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=22042)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7310] Optimize Column Stats Partition Pruning for Non-Partition Pruning Queries [hudi]

2024-01-18 Thread via GitHub


hudi-bot commented on PR #10528:
URL: https://github.com/apache/hudi/pull/10528#issuecomment-1898350527

   
   ## CI report:
   
   * c5a5fc4a8d3dddbc4c4fb01a1d168136abeb4863 Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=22036)
 
   * 0c268f8c3ff082913d8522edfa1674fa88b19ab6 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=22042)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7310] Optimize Column Stats Partition Pruning for Non-Partition Pruning Queries [hudi]

2024-01-18 Thread via GitHub


hudi-bot commented on PR #10528:
URL: https://github.com/apache/hudi/pull/10528#issuecomment-1898169129

   
   ## CI report:
   
   * 70c69652005098181a64aa8037480f766259f711 Azure: 
[SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=22035)
 
   * c5a5fc4a8d3dddbc4c4fb01a1d168136abeb4863 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=22036)
 
   * 0c268f8c3ff082913d8522edfa1674fa88b19ab6 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=22042)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7310] Optimize Column Stats Partition Pruning for Non-Partition Pruning Queries [hudi]

2024-01-18 Thread via GitHub


hudi-bot commented on PR #10528:
URL: https://github.com/apache/hudi/pull/10528#issuecomment-1898084525

   
   ## CI report:
   
   * 70c69652005098181a64aa8037480f766259f711 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=22035)
 
   * c5a5fc4a8d3dddbc4c4fb01a1d168136abeb4863 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=22036)
 
   * 0c268f8c3ff082913d8522edfa1674fa88b19ab6 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=22042)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7310] Optimize Column Stats Partition Pruning for Non-Partition Pruning Queries [hudi]

2024-01-18 Thread via GitHub


hudi-bot commented on PR #10528:
URL: https://github.com/apache/hudi/pull/10528#issuecomment-1898071446

   
   ## CI report:
   
   * 70c69652005098181a64aa8037480f766259f711 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=22035)
 
   * c5a5fc4a8d3dddbc4c4fb01a1d168136abeb4863 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=22036)
 
   * 0c268f8c3ff082913d8522edfa1674fa88b19ab6 UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7310] Optimize Column Stats Partition Pruning for Non-Partition Pruning Queries [hudi]

2024-01-18 Thread via GitHub


hudi-bot commented on PR #10528:
URL: https://github.com/apache/hudi/pull/10528#issuecomment-1897998387

   
   ## CI report:
   
   * 70c69652005098181a64aa8037480f766259f711 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=22035)
 
   * c5a5fc4a8d3dddbc4c4fb01a1d168136abeb4863 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=22036)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7310] Optimize Column Stats Partition Pruning for Non-Partition Pruning Queries [hudi]

2024-01-18 Thread via GitHub


hudi-bot commented on PR #10528:
URL: https://github.com/apache/hudi/pull/10528#issuecomment-1897987875

   
   ## CI report:
   
   * 70c69652005098181a64aa8037480f766259f711 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=22035)
 
   * c5a5fc4a8d3dddbc4c4fb01a1d168136abeb4863 UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7310] Optimize Column Stats Partition Pruning for Non-Partition Pruning Queries [hudi]

2024-01-17 Thread via GitHub


hudi-bot commented on PR #10528:
URL: https://github.com/apache/hudi/pull/10528#issuecomment-1897929430

   
   ## CI report:
   
   * 70c69652005098181a64aa8037480f766259f711 Azure: 
[PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=22035)
 
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Re: [PR] [HUDI-7310] Optimize Column Stats Partition Pruning for Non-Partition Pruning Queries [hudi]

2024-01-17 Thread via GitHub


hudi-bot commented on PR #10528:
URL: https://github.com/apache/hudi/pull/10528#issuecomment-1897921531

   
   ## CI report:
   
   * 70c69652005098181a64aa8037480f766259f711 UNKNOWN
   
   
   Bot commands
 @hudi-bot supports the following commands:
   
- `@hudi-bot run azure` re-run the last Azure build
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[PR] [HUDI-7310] Optimize Column Stats Partition Pruning for Non-Partition Pruning Queries [hudi]

2024-01-17 Thread via GitHub


majian1998 opened a new pull request, #10528:
URL: https://github.com/apache/hudi/pull/10528

   In HUDI-7291, I applied the partition pruning conditions to the column stats 
earlier, reducing the amount of data during the data skipping process. However, 
for non-partitioned tables, or partitioned tables where the query does not 
involve partition conditions, this optimization introduces additional 
unnecessary serialization overhead. Therefore, a check has been added here to 
avoid filtering in such cases.
   
   ### Change Logs
   
   None
   
   ### Impact
   
   None
   
   ### Risk level (write none, low medium or high below)
   
   None
   
   ### Documentation Update
   
   None
   
   ### Contributor's checklist
   
   - [ ] Read through [contributor's 
guide](https://hudi.apache.org/contribute/how-to-contribute)
   - [ ] Change Logs and Impact were stated clearly
   - [ ] Adequate tests were added if applicable
   - [ ] CI passed
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org