hudi-bot opened a new issue, #15584:
URL: https://github.com/apache/hudi/issues/15584
When looking up in col stats for data skipping, we are passing in only the
list of columns in the predicate. We don't leverage the pruned list of
partitions in this call.
For eg, if there are 1000 partitions and 5 cols w/ predicate, and only 10
partitions are matched after pruning,
exiting call will fetch 5 cols * 1000 partitions = 5k entries from col_stats
partition in MDT to do file skipping.
where as if we wire in pruned list of partitions, then we only need to do
file skipping from 50 entries.
{code:java}
private def loadColumnStatsIndexRecords(targetColumns: Seq[String],
shouldReadInMemory: Boolean): HoodieData[HoodieMetadataColumnStats] = {
// Read Metadata Table's Column Stats Index records into [[HoodieData]]
container by
// - Fetching the records from CSI by key-prefixes (encoded column
names)
// - Extracting [[HoodieMetadataColumnStats]] records
// - Filtering out nulls
checkState(targetColumns.nonEmpty)
// TODO encoding should be done internally w/in HoodieBackedTableMetadata
val encodedTargetColumnNames = targetColumns.map(colName => new
ColumnIndexID(colName).asBase64EncodedString())
val metadataRecords: HoodieData[HoodieRecord[HoodieMetadataPayload]] =
metadataTable.getRecordsByKeyPrefixes(encodedTargetColumnNames.asJava,
HoodieTableMetadataUtil.PARTITION_NAME_COLUMN_STATS, shouldReadInMemory)
.
. {code}
## JIRA info
- Link: https://issues.apache.org/jira/browse/HUDI-5245
- Type: Improvement
- Epic: https://issues.apache.org/jira/browse/HUDI-1292
- Fix version(s):
- 1.1.0
---
## Comments
21/Nov/22 20:58;alexey.kudinkin;I'd actually clarify the requirement a
little bit to avoid using both at the same time:
# If CSI is enabled, we should lookup just the CSI
# If it's not enabled, we should do the partition-pruning;;;
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]