yihua commented on issue #18558:
URL: https://github.com/apache/hudi/issues/18558#issuecomment-4467649349
The root cause of the problem is that: (1) during `DELETE` command, reading
the records with a filter `id = 3` for deletion returns an empty row although
the MOR table with Lance file format has the record in one Lance base file (2)
during reading the table, the partition stats-based data skipping is enabled,
because partition stats is enabled in MDT (3) even though Lance file does not
populate column stats or partition stats, the previous deltacommit generates
log files, which populated the column stats and partition stats for log files
only (see the logic below); for Lance files, stats are not populated, nor does
it throw error, thus leading to partial set of files in snapshot having stats
in the MDT, which is corrupted (4) when `HoodieFileIndex` tries to get the list
of files, it uses the partition stats with wrong min/max value and inccorectly
prunes out the partition containing record.
For now, the column stats and partition stats must be explicitly turned off
for Lance file format, instead of relying on implicit assumption that Lance
base files do not populate `HoodieColumnRangeMetadata`.
```
private static List<HoodieColumnRangeMetadata<Comparable>>
readColumnRangeMetadataFrom(String partitionPath,
String fileName,
HoodieTableMetaClient datasetMetaClient,
List<String> columnsToIndex,
int maxBufferSize,
HoodieIndexVersion indexVersion) {
String partitionPathFileName =
(partitionPath.equals(EMPTY_PARTITION_NAME) ||
partitionPath.equals(NON_PARTITIONED_NAME)) ? fileName
: partitionPath + "/" + fileName;
try {
StoragePath fullFilePath = new
StoragePath(datasetMetaClient.getBasePath(), partitionPathFileName);
if
(partitionPathFileName.endsWith(HoodieFileFormat.PARQUET.getFileExtension())) {
return HoodieIOFactory.getIOFactory(datasetMetaClient.getStorage())
.getFileFormatUtils(HoodieFileFormat.PARQUET)
.readColumnStatsFromMetadata(datasetMetaClient.getStorage(),
fullFilePath, columnsToIndex, indexVersion);
} else if (FSUtils.isLogFile(fileName)) {
Option<HoodieSchema> writerSchemaOpt =
tryResolveSchemaForTable(datasetMetaClient);
log.info("Reading log file: {}, to build column range metadata.",
partitionPathFileName);
return getLogFileColumnRangeMetadata(fullFilePath.toString(),
partitionPath, datasetMetaClient, columnsToIndex, writerSchemaOpt,
maxBufferSize);
}
log.warn("Column range index not supported for: {}",
partitionPathFileName);
return Collections.emptyList();
} catch (Exception e) {
// NOTE: In case reading column range metadata from individual file
failed,
// we simply fall back, in lieu of failing the whole task
log.error("Failed to fetch column range metadata for: {}",
partitionPathFileName);
return Collections.emptyList();
}
}
```
I filed #18758 for column stats and partition stats support for Lance file
format.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]