yihua commented on issue #18558:
URL: https://github.com/apache/hudi/issues/18558#issuecomment-4467649349

   The root cause of the problem is that: (1) during `DELETE` command, reading 
the records with a filter `id = 3` for deletion returns an empty row although 
the MOR table with Lance file format has the record in one Lance base file (2) 
during reading the table, the partition stats-based data skipping is enabled, 
because partition stats is enabled in MDT (3) even though Lance file does not 
populate column stats or partition stats, the previous deltacommit generates 
log files, which populated the column stats and partition stats for log files 
only (see the logic below); for Lance files, stats are not populated, nor does 
it throw error, thus leading to partial set of files in snapshot having stats 
in the MDT, which is corrupted (4) when `HoodieFileIndex` tries to get the list 
of files, it uses the partition stats with wrong min/max value and inccorectly 
prunes out the partition containing record.
   
   For now, the column stats and partition stats must be explicitly turned off 
for Lance file format, instead of relying on implicit assumption that Lance 
base files do not populate `HoodieColumnRangeMetadata`. 
   
   ```
   private static List<HoodieColumnRangeMetadata<Comparable>> 
readColumnRangeMetadataFrom(String partitionPath,
                                                                                
            String fileName,
                                                                                
            HoodieTableMetaClient datasetMetaClient,
                                                                                
            List<String> columnsToIndex,
                                                                                
            int maxBufferSize,
                                                                                
            HoodieIndexVersion indexVersion) {
       String partitionPathFileName = 
(partitionPath.equals(EMPTY_PARTITION_NAME) || 
partitionPath.equals(NON_PARTITIONED_NAME)) ? fileName
           : partitionPath + "/" + fileName;
       try {
         StoragePath fullFilePath = new 
StoragePath(datasetMetaClient.getBasePath(), partitionPathFileName);
         if 
(partitionPathFileName.endsWith(HoodieFileFormat.PARQUET.getFileExtension())) {
           return HoodieIOFactory.getIOFactory(datasetMetaClient.getStorage())
               .getFileFormatUtils(HoodieFileFormat.PARQUET)
               .readColumnStatsFromMetadata(datasetMetaClient.getStorage(), 
fullFilePath, columnsToIndex, indexVersion);
         } else if (FSUtils.isLogFile(fileName)) {
           Option<HoodieSchema> writerSchemaOpt = 
tryResolveSchemaForTable(datasetMetaClient);
           log.info("Reading log file: {}, to build column range metadata.", 
partitionPathFileName);
           return getLogFileColumnRangeMetadata(fullFilePath.toString(), 
partitionPath, datasetMetaClient, columnsToIndex, writerSchemaOpt, 
maxBufferSize);
         }
         log.warn("Column range index not supported for: {}", 
partitionPathFileName);
         return Collections.emptyList();
       } catch (Exception e) {
         // NOTE: In case reading column range metadata from individual file 
failed,
         //       we simply fall back, in lieu of failing the whole task
         log.error("Failed to fetch column range metadata for: {}", 
partitionPathFileName);
         return Collections.emptyList();
       }
     }
   ```
   
   I filed #18758 for column stats and partition stats support for Lance file 
format.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to