Re: [PR] [HUDI-7310] Optimize Column Stats Partition Pruning for Non-Partition Pruning Queries [hudi]
stream2000 merged PR #10528: URL: https://github.com/apache/hudi/pull/10528 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7310] Optimize Column Stats Partition Pruning for Non-Partition Pruning Queries [hudi]
hudi-bot commented on PR #10528: URL: https://github.com/apache/hudi/pull/10528#issuecomment-1903566951 ## CI report: * fe70696a5c4f16b0367470f54fe4814f510cb3b0 Azure: [SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=22088) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7310] Optimize Column Stats Partition Pruning for Non-Partition Pruning Queries [hudi]
hudi-bot commented on PR #10528: URL: https://github.com/apache/hudi/pull/10528#issuecomment-1903388167 ## CI report: * 0c268f8c3ff082913d8522edfa1674fa88b19ab6 Azure: [SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=22042) * fe70696a5c4f16b0367470f54fe4814f510cb3b0 Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=22088) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7310] Optimize Column Stats Partition Pruning for Non-Partition Pruning Queries [hudi]
hudi-bot commented on PR #10528: URL: https://github.com/apache/hudi/pull/10528#issuecomment-1903345377 ## CI report: * 0c268f8c3ff082913d8522edfa1674fa88b19ab6 Azure: [SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=22042) * fe70696a5c4f16b0367470f54fe4814f510cb3b0 UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7310] Optimize Column Stats Partition Pruning for Non-Partition Pruning Queries [hudi]
stream2000 commented on code in PR #10528: URL: https://github.com/apache/hudi/pull/10528#discussion_r1458286332 ## hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/HoodieFileIndex.scala: ## @@ -361,9 +364,16 @@ case class HoodieFileIndex(spark: SparkSession, // For that we use a simple-heuristic to determine whether we should read and process CSI in-memory or // on-cluster: total number of rows of the expected projected portion of the index has to be below the // threshold (of 100k records) - val prunedFileNames = getPrunedFileNames(prunedPartitionsAndFileSlices) val shouldReadInMemory = columnStatsIndex.shouldReadInMemory(this, queryReferencedColumns) - columnStatsIndex.loadTransposed(queryReferencedColumns, shouldReadInMemory, prunedFileNames) { transposedColStatsDF => + val prunedFileNames = getPrunedFileNames(prunedPartitionsAndFileSlices) + // NOTE: This judgment has two purposes: Review Comment: nit: We can simplify the comment to: // If partition pruning doesn't prune any files, then there's no need to apply file filters when loading the Column Statistics Index ## hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/HoodieFileIndex.scala: ## @@ -233,8 +233,9 @@ case class HoodieFileIndex(spark: SparkSession, //- Col-Stats Index is present //- Record-level Index is present //- List of predicates (filters) is present + val shouldPushDownFilesFilter = !partitionFilters.isEmpty val candidateFilesNamesOpt: Option[Set[String]] = - lookupCandidateFilesInMetadataTable(dataFilters, prunedPartitionsAndFileSlices) match { + lookupCandidateFilesInMetadataTable(dataFilters, shouldPushDownFilesFilter, prunedPartitionsAndFileSlices) match { Review Comment: We can move the `shouldPushDownFilesFilter` to the end of the parameter list. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7310] Optimize Column Stats Partition Pruning for Non-Partition Pruning Queries [hudi]
hudi-bot commented on PR #10528: URL: https://github.com/apache/hudi/pull/10528#issuecomment-1898535611 ## CI report: * 0c268f8c3ff082913d8522edfa1674fa88b19ab6 Azure: [SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=22042) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7310] Optimize Column Stats Partition Pruning for Non-Partition Pruning Queries [hudi]
hudi-bot commented on PR #10528: URL: https://github.com/apache/hudi/pull/10528#issuecomment-1898350527 ## CI report: * c5a5fc4a8d3dddbc4c4fb01a1d168136abeb4863 Azure: [SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=22036) * 0c268f8c3ff082913d8522edfa1674fa88b19ab6 Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=22042) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7310] Optimize Column Stats Partition Pruning for Non-Partition Pruning Queries [hudi]
hudi-bot commented on PR #10528: URL: https://github.com/apache/hudi/pull/10528#issuecomment-1898169129 ## CI report: * 70c69652005098181a64aa8037480f766259f711 Azure: [SUCCESS](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=22035) * c5a5fc4a8d3dddbc4c4fb01a1d168136abeb4863 Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=22036) * 0c268f8c3ff082913d8522edfa1674fa88b19ab6 Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=22042) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7310] Optimize Column Stats Partition Pruning for Non-Partition Pruning Queries [hudi]
hudi-bot commented on PR #10528: URL: https://github.com/apache/hudi/pull/10528#issuecomment-1898084525 ## CI report: * 70c69652005098181a64aa8037480f766259f711 Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=22035) * c5a5fc4a8d3dddbc4c4fb01a1d168136abeb4863 Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=22036) * 0c268f8c3ff082913d8522edfa1674fa88b19ab6 Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=22042) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7310] Optimize Column Stats Partition Pruning for Non-Partition Pruning Queries [hudi]
hudi-bot commented on PR #10528: URL: https://github.com/apache/hudi/pull/10528#issuecomment-1898071446 ## CI report: * 70c69652005098181a64aa8037480f766259f711 Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=22035) * c5a5fc4a8d3dddbc4c4fb01a1d168136abeb4863 Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=22036) * 0c268f8c3ff082913d8522edfa1674fa88b19ab6 UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7310] Optimize Column Stats Partition Pruning for Non-Partition Pruning Queries [hudi]
hudi-bot commented on PR #10528: URL: https://github.com/apache/hudi/pull/10528#issuecomment-1897998387 ## CI report: * 70c69652005098181a64aa8037480f766259f711 Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=22035) * c5a5fc4a8d3dddbc4c4fb01a1d168136abeb4863 Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=22036) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7310] Optimize Column Stats Partition Pruning for Non-Partition Pruning Queries [hudi]
hudi-bot commented on PR #10528: URL: https://github.com/apache/hudi/pull/10528#issuecomment-1897987875 ## CI report: * 70c69652005098181a64aa8037480f766259f711 Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=22035) * c5a5fc4a8d3dddbc4c4fb01a1d168136abeb4863 UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7310] Optimize Column Stats Partition Pruning for Non-Partition Pruning Queries [hudi]
hudi-bot commented on PR #10528: URL: https://github.com/apache/hudi/pull/10528#issuecomment-1897929430 ## CI report: * 70c69652005098181a64aa8037480f766259f711 Azure: [PENDING](https://dev.azure.com/apache-hudi-ci-org/785b6ef4-2f42-4a89-8f0e-5f0d7039a0cc/_build/results?buildId=22035) Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
Re: [PR] [HUDI-7310] Optimize Column Stats Partition Pruning for Non-Partition Pruning Queries [hudi]
hudi-bot commented on PR #10528: URL: https://github.com/apache/hudi/pull/10528#issuecomment-1897921531 ## CI report: * 70c69652005098181a64aa8037480f766259f711 UNKNOWN Bot commands @hudi-bot supports the following commands: - `@hudi-bot run azure` re-run the last Azure build -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[PR] [HUDI-7310] Optimize Column Stats Partition Pruning for Non-Partition Pruning Queries [hudi]
majian1998 opened a new pull request, #10528: URL: https://github.com/apache/hudi/pull/10528 In HUDI-7291, I applied the partition pruning conditions to the column stats earlier, reducing the amount of data during the data skipping process. However, for non-partitioned tables, or partitioned tables where the query does not involve partition conditions, this optimization introduces additional unnecessary serialization overhead. Therefore, a check has been added here to avoid filtering in such cases. ### Change Logs None ### Impact None ### Risk level (write none, low medium or high below) None ### Documentation Update None ### Contributor's checklist - [ ] Read through [contributor's guide](https://hudi.apache.org/contribute/how-to-contribute) - [ ] Change Logs and Impact were stated clearly - [ ] Adequate tests were added if applicable - [ ] CI passed -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: commits-unsubscr...@hudi.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org