Re: [PR] Spark 3.5: Fix Migrate procedure renaming issue for custom catalog [iceberg]

via GitHub Wed, 08 Nov 2023 07:34:00 -0800


tomtongue commented on code in PR #8931:
URL: https://github.com/apache/iceberg/pull/8931#discussion_r1386791377



##########
spark/v3.5/spark/src/main/java/org/apache/iceberg/spark/actions/MigrateTableSparkAction.java:
##########
@@ -108,6 +109,23 @@ public MigrateTableSparkAction backupTableName(String 
tableName) {
     return this;
   }
 
+  @Override
+  public MigrateTableSparkAction destCatalogName(String catalogName) {
+    CatalogManager catalogManager = spark().sessionState().catalogManager();
+
+    CatalogPlugin catalogPlugin;
+    if (catalogManager.isCatalogRegistered(catalogName)) {
+      catalogPlugin = catalogManager.catalog(catalogName);
+    } else {
+      LOG.warn(
+          "{} doesn't exist in SparkSession. " + "Fallback to current 
SparkSession catalog.",
+          catalogName);
+      catalogPlugin = catalogManager.currentCatalog();
+    }
+    this.destCatalog = checkDestinationCatalog(catalogPlugin);

Review Comment:
   Thanks for the review, @singhpk234. Yes, as you're saying, the Iceberg 
GlueCatalogImpl replicates the "partial" metadata in the rename. So if the 
source Spark/Hive table is partitioned, the restore process will fail as 
follows:
   
   ```
   23/11/06 09:54:03 INFO MigrateTableSparkAction: Generating Iceberg metadata 
for db.tbl in s3://bucket/path/tbl/metadata
   23/11/06 09:54:03 WARN BaseCatalogToHiveConverter: Hive Exception type not 
found for AccessDeniedException
   23/11/06 09:54:05 INFO ClientConfigurationFactory: Set initial getObject 
socket timeout to 2000 ms.
   23/11/06 09:54:06 INFO CodeGenerator: Code generated in 230.388332 ms
   23/11/06 09:54:06 INFO CodeGenerator: Code generated in 17.169875 ms
   23/11/06 09:54:06 INFO CodeGenerator: Code generated in 18.598328 ms
   23/11/06 09:54:07 ERROR MigrateTableSparkAction: Failed to perform the 
migration, aborting table creation and restoring the original table
   23/11/06 09:54:07 INFO MigrateTableSparkAction: Restoring db.tbl from 
db.tbl_backup
   23/11/06 09:54:08 INFO GlueCatalog: created rename destination table db.tbl
   23/11/06 09:54:08 INFO GlueCatalog: Successfully dropped table db.tbl_backup 
from Glue
   23/11/06 09:54:08 INFO GlueCatalog: Dropped table: db.tbl_backup
   23/11/06 09:54:08 INFO GlueCatalog: Successfully renamed table from 
db.tbl_backup to garbagedb.iceberg_migrate_w_year_partition
   Exception in thread "main" 
org.apache.iceberg.exceptions.ValidationException: Unable to get partition spec 
for table: `db`.`tbl_backup`
        at 
org.apache.iceberg.spark.SparkExceptionUtil.toUncheckedException(SparkExceptionUtil.java:55)
        at 
org.apache.iceberg.spark.SparkTableUtil.importSparkTable(SparkTableUtil.java:415)
        at 
org.apache.iceberg.spark.SparkTableUtil.importSparkTable(SparkTableUtil.java:460)
   ...
   Caused by: org.apache.spark.sql.AnalysisException: 
org.apache.hadoop.hive.ql.metadata.HiveException: Table tbl_backup is not a 
partitioned table
        at 
org.apache.spark.sql.hive.HiveExternalCatalog.withClient(HiveExternalCatalog.scala:133)
        at 
org.apache.spark.sql.hive.HiveExternalCatalog.doListPartitions(HiveExternalCatalog.scala:1308)
        at 
org.apache.spark.sql.hive.HiveExternalCatalog.listPartitions(HiveExternalCatalog.scala:1302)
   ...
   Caused by: org.apache.hadoop.hive.ql.metadata.HiveException: Table 
tbl_backup is not a partitioned table
        at org.apache.hadoop.hive.ql.metadata.Hive.getPartitions(Hive.java:2676)
        at org.apache.hadoop.hive.ql.metadata.Hive.getPartitions(Hive.java:2709)
   ...
   ```
   
   This error was caused by the partition lost in the renamed table. 
   
   So as you know, the way to resolve the migrate restriction, supporting the 
rename for GlueHiveMetastoreClient should be the best. 
   
   At least there are people who have tried to migrate from their table into 
Iceberg on custom catalog like Glue Catalog. But the migrate query cannot be 
used because of the rename restriction. So let me consider the better way to 
resolve this issue. If there's no way to resolve this issue, I think we need to 
ask the GlueHiveMetastoreClient to support the rename operation.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] Spark 3.5: Fix Migrate procedure renaming issue for custom catalog [iceberg]

Reply via email to