github-actions[bot] commented on code in PR #64134:
URL: https://github.com/apache/doris/pull/64134#discussion_r3360776696
##########
fe/fe-core/src/main/java/org/apache/doris/datasource/iceberg/IcebergUtils.java:
##########
@@ -1098,6 +1095,44 @@ public static FileFormat getFileFormat(Table
icebergTable) {
return fileFormat;
}
+ private static String resolveFileFormatName(Table icebergTable,
Map<String, String> properties) {
+ // 1. Check "write-format" (nickname in Flink and Spark)
+ if (properties.containsKey(WRITE_FORMAT)) {
+ return properties.get(WRITE_FORMAT);
+ }
+ // 2. Check "write.format.default" (standard Iceberg property)
+ if (properties.containsKey(TableProperties.DEFAULT_FILE_FORMAT)) {
+ return properties.get(TableProperties.DEFAULT_FILE_FORMAT);
+ }
+ // 3. Check "format" property (e.g., "iceberg/parquet", "iceberg/orc")
+ // This is commonly set on migrated Iceberg tables.
+ if (properties.containsKey(FORMAT)) {
+ return properties.get(FORMAT);
+ }
+ // 4. Last resort: infer from the actual data files in the current
snapshot.
+ // This handles migrated tables where none of the above properties
are set.
+ return inferFileFormatFromDataFiles(icebergTable);
+ }
+
+ private static String inferFileFormatFromDataFiles(Table icebergTable) {
+ if (icebergTable.currentSnapshot() == null) {
+ LOG.info("Iceberg table {} has no snapshot, defaulting to {}",
icebergTable.name(), PARQUET_NAME);
+ return PARQUET_NAME;
+ }
+ try (CloseableIterable<FileScanTask> files =
icebergTable.newScan().planFiles()) {
+ java.util.Iterator<FileScanTask> it = files.iterator();
+ if (it.hasNext()) {
+ String format = it.next().file().format().name().toLowerCase();
+ LOG.info("Iceberg table {} inferred file format {} from data
files", icebergTable.name(), format);
+ return format;
+ }
+ } catch (Exception e) {
+ LOG.warn("Failed to infer file format from data files for table
{}, defaulting to {}",
+ icebergTable.name(), PARQUET_NAME, e);
+ }
+ return PARQUET_NAME;
+ }
+
Review Comment:
This catch-all fallback reintroduces the same wrong-format behavior when
inference fails. For a migrated ORC table that lacks
`write-format`/`write.format.default`/`format`, any `planFiles()` failure here
(for example manifest IO/auth/catalog errors) is logged and converted to
`PARQUET_NAME`, so scans can still plan ORC files as Parquet and
writes/deletes/merge paths that call `getFileFormat()` can choose the wrong
file format. Per Doris error-handling rules, this should fail with table
context instead of silently defaulting; only the no-snapshot/no-files case
should use the explicit default.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]