hudi-bot opened a new issue, #15584:
URL: https://github.com/apache/hudi/issues/15584

   When looking up in col stats for data skipping, we are passing in only the 
list of columns in the predicate. We don't leverage the pruned list of 
partitions in this call.
   
    
   
   For eg, if there are 1000 partitions and 5 cols w/ predicate, and only 10 
partitions are matched after pruning,
   
   exiting call will fetch 5 cols * 1000 partitions = 5k entries from col_stats 
partition in MDT to do file skipping.
   where as if we wire in pruned list of partitions, then we only need to do 
file skipping from 50 entries. 
   
    
   {code:java}
   private def loadColumnStatsIndexRecords(targetColumns: Seq[String], 
shouldReadInMemory: Boolean): HoodieData[HoodieMetadataColumnStats] = {
     // Read Metadata Table's Column Stats Index records into [[HoodieData]] 
container by
     //    - Fetching the records from CSI by key-prefixes (encoded column 
names)
     //    - Extracting [[HoodieMetadataColumnStats]] records
     //    - Filtering out nulls
     checkState(targetColumns.nonEmpty)
   
     // TODO encoding should be done internally w/in HoodieBackedTableMetadata
     val encodedTargetColumnNames = targetColumns.map(colName => new 
ColumnIndexID(colName).asBase64EncodedString())
   
     val metadataRecords: HoodieData[HoodieRecord[HoodieMetadataPayload]] =
       metadataTable.getRecordsByKeyPrefixes(encodedTargetColumnNames.asJava, 
HoodieTableMetadataUtil.PARTITION_NAME_COLUMN_STATS, shouldReadInMemory)
   .
   . {code}
   
   ## JIRA info
   
   - Link: https://issues.apache.org/jira/browse/HUDI-5245
   - Type: Improvement
   - Epic: https://issues.apache.org/jira/browse/HUDI-1292
   - Fix version(s):
     - 1.1.0
   
   
   ---
   
   
   ## Comments
   
   21/Nov/22 20:58;alexey.kudinkin;I'd actually clarify the requirement a 
little bit to avoid using both at the same time:
    # If CSI is enabled, we should lookup just the CSI
    # If it's not enabled, we should do the partition-pruning;;;


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to