[ https://issues.apache.org/jira/browse/HUDI-5245?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
sivabalan narayanan updated HUDI-5245: -------------------------------------- Priority: Critical (was: Major) > Honor pruned partitions while looking up in col stats partition in MDT > ---------------------------------------------------------------------- > > Key: HUDI-5245 > URL: https://issues.apache.org/jira/browse/HUDI-5245 > Project: Apache Hudi > Issue Type: Improvement > Components: metadata > Reporter: sivabalan narayanan > Priority: Critical > > When looking up in col stats for data skipping, we are passing in only the > list of columns in the predicate. We don't leverage the pruned list of > partitions in this call. > > For eg, if there are 1000 partitions and 5 cols w/ predicate, and only 10 > partitions are matched after pruning, > exiting call will fetch 5 cols * 1000 partitions = 5k entries from col_stats > partition in MDT to do file skipping. > where as if we wire in pruned list of partitions, then we only need to do > file skipping from 50 entries. > > {code:java} > private def loadColumnStatsIndexRecords(targetColumns: Seq[String], > shouldReadInMemory: Boolean): HoodieData[HoodieMetadataColumnStats] = { > // Read Metadata Table's Column Stats Index records into [[HoodieData]] > container by > // - Fetching the records from CSI by key-prefixes (encoded column names) > // - Extracting [[HoodieMetadataColumnStats]] records > // - Filtering out nulls > checkState(targetColumns.nonEmpty) > // TODO encoding should be done internally w/in HoodieBackedTableMetadata > val encodedTargetColumnNames = targetColumns.map(colName => new > ColumnIndexID(colName).asBase64EncodedString()) > val metadataRecords: HoodieData[HoodieRecord[HoodieMetadataPayload]] = > metadataTable.getRecordsByKeyPrefixes(encodedTargetColumnNames.asJava, > HoodieTableMetadataUtil.PARTITION_NAME_COLUMN_STATS, shouldReadInMemory) > . > . {code} -- This message was sent by Atlassian Jira (v8.20.10#820010)