Re: [PR] Initial integration for hudi tables within Polaris [polaris]

via GitHub Fri, 25 Jul 2025 17:14:04 -0700


rahil-c commented on code in PR #1862:
URL: https://github.com/apache/polaris/pull/1862#discussion_r2232267318



##########
plugins/spark/v3.5/spark/src/main/java/org/apache/polaris/spark/utils/PolarisCatalogUtils.java:
##########
@@ -91,6 +109,85 @@ public static Table loadSparkTable(GenericTable 
genericTable) {
         provider, new CaseInsensitiveStringMap(tableProperties), 
scala.Option.empty());
   }
 
+  /**
+   * Extract catalog name from Spark session configuration. Looks for 
configuration like:
+   * spark.sql.catalog.<CATALOG_NAME>=org.apache.polaris.spark.SparkCatalog
+   */
+  private static String getCatalogName() {
+    SparkSession spark = SparkSession.active();
+    String catalogPrefix = "spark.sql.catalog.";
+    String polarisSparkCatalog = "org.apache.polaris.spark.SparkCatalog";
+
+    scala.collection.Iterator<scala.Tuple2<String, String>> configIterator =
+        spark.conf().getAll().iterator();
+    while (configIterator.hasNext()) {
+      scala.Tuple2<String, String> config = configIterator.next();
+      String key = config._1();
+      String value = config._2();
+
+      if (key.startsWith(catalogPrefix) && polarisSparkCatalog.equals(value)) {
+        return key.substring(catalogPrefix.length());
+      }
+    }
+
+    throw new IllegalStateException(
+        "Could not obtain Polaris catalog identifier."
+            + "Expected following configuration to be set in session: 
spark.sql.catalog.<CATALOG_NAME>=org.apache.polaris.spark.SparkCatalog");
+  }
+
+  public static Table loadV1SparkHudiTable(GenericTable genericTable, 
Identifier identifier) {
+    Map<String, String> tableProperties = Maps.newHashMap();
+    tableProperties.putAll(genericTable.getProperties());
+    tableProperties.put(
+        TABLE_PATH_KEY, 
genericTable.getProperties().get(TableCatalog.PROP_LOCATION));
+
+    // Need full identifier in order to construct CatalogTable correctly for 
Hudi
+    String namespacePath = String.join(".", identifier.namespace());
+    TableIdentifier tableIdentifier =
+        new TableIdentifier(
+            identifier.name(), Option.apply(namespacePath), 
Option.apply(getCatalogName()));

Review Comment:
   Hudi is expecting the Spark `V1Table` which will returned by Polaris to 
contain this catalog Identifier. This is because Hudi is currently using Spark 
DSV1 it will invoke the following Spark `Rule`, `FindDataSourceTable` which 
expects this catalog identifier `table.identifier.catalog.get`. If the catalog 
does not return this catalog identifier, then during `.get` it will hit a NPE 
on spark side and error out.
   
   Therefore we will need this in order for the integration to work.
   
   Class is `DataSourceStrategy.scala` in spark code base
   <img width="1359" height="251" alt="Screenshot 2025-07-25 at 4 57 12 PM" 
src="https://github.com/user-attachments/assets/3d586858-2696-434f-b751-158de506a8af";
 />
   
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [PR] Initial integration for hudi tables within Polaris [polaris]

Reply via email to