[ 
https://issues.apache.org/jira/browse/HUDI-5245?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

sivabalan narayanan updated HUDI-5245:
--------------------------------------
    Priority: Critical  (was: Major)

> Honor pruned partitions while looking up in col stats partition in MDT
> ----------------------------------------------------------------------
>
>                 Key: HUDI-5245
>                 URL: https://issues.apache.org/jira/browse/HUDI-5245
>             Project: Apache Hudi
>          Issue Type: Improvement
>          Components: metadata
>            Reporter: sivabalan narayanan
>            Priority: Critical
>
> When looking up in col stats for data skipping, we are passing in only the 
> list of columns in the predicate. We don't leverage the pruned list of 
> partitions in this call.
>  
> For eg, if there are 1000 partitions and 5 cols w/ predicate, and only 10 
> partitions are matched after pruning,
> exiting call will fetch 5 cols * 1000 partitions = 5k entries from col_stats 
> partition in MDT to do file skipping.
> where as if we wire in pruned list of partitions, then we only need to do 
> file skipping from 50 entries. 
>  
> {code:java}
> private def loadColumnStatsIndexRecords(targetColumns: Seq[String], 
> shouldReadInMemory: Boolean): HoodieData[HoodieMetadataColumnStats] = {
>   // Read Metadata Table's Column Stats Index records into [[HoodieData]] 
> container by
>   //    - Fetching the records from CSI by key-prefixes (encoded column names)
>   //    - Extracting [[HoodieMetadataColumnStats]] records
>   //    - Filtering out nulls
>   checkState(targetColumns.nonEmpty)
>   // TODO encoding should be done internally w/in HoodieBackedTableMetadata
>   val encodedTargetColumnNames = targetColumns.map(colName => new 
> ColumnIndexID(colName).asBase64EncodedString())
>   val metadataRecords: HoodieData[HoodieRecord[HoodieMetadataPayload]] =
>     metadataTable.getRecordsByKeyPrefixes(encodedTargetColumnNames.asJava, 
> HoodieTableMetadataUtil.PARTITION_NAME_COLUMN_STATS, shouldReadInMemory)
> .
> . {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to