Re: [PR] Design: AWS Glue Data Catalog connector design document [gravitino]

via GitHub Wed, 25 Mar 2026 01:22:37 -0700


Copilot commented on code in PR #10539:
URL: https://github.com/apache/gravitino/pull/10539#discussion_r2986553754



##########
design/aws-glue-catalog-connector.md:
##########
@@ -0,0 +1,471 @@
+# Design: AWS Glue Data Catalog Support for Apache Gravitino
+
+## 1. Problem Statement and Goals
+
+### 1.1 Problem
+
+**Gravitino currently cannot federate AWS Glue Data Catalog.** This is a 
significant gap because:
+
+1. **Large user base on AWS**: The majority of cloud-native data lakes run on 
AWS with Glue Data Catalog as the central metadata service (default for Athena, 
Redshift Spectrum, EMR, Lake Formation). These organizations cannot bring their 
Glue metadata into Gravitino's unified management layer.
+2. **No native integration path**: The only workaround is pointing Gravitino's 
Hive catalog at Glue's HMS-compatible Thrift endpoint (`metastore.uris = 
thrift://...`), which is undocumented, region-limited, and cannot leverage 
Glue-native features (catalog ID, cross-account access, VPC endpoints).
+3. **Competitive landscape**: Trino, Spark, and other engines all have 
first-class Glue support with dedicated configuration. Users expect the same 
from Gravitino.
+
+### 1.2 Goals
+
+After this feature is implemented:
+
+1. **Register AWS Glue Data Catalog in Gravitino**:
+   ```bash
+   # Hive-format tables
+   gcli catalog create --name hive_on_glue --provider hive \
+     --properties metastore-type=glue,s3-region=us-east-1
+
+   # Iceberg-format tables
+   gcli catalog create --name iceberg_on_glue --provider lakehouse-iceberg \
+     --properties 
catalog-backend=glue,warehouse=s3://bucket/iceberg,s3-region=us-east-1
+   ```
+
+2. **Standard Gravitino API works against Glue catalogs**:
+   ```bash
+   gcli schema list --catalog hive_on_glue
+   gcli table list --catalog hive_on_glue --schema my_database
+   gcli table details --catalog iceberg_on_glue --schema analytics --table 
events
+   ```
+
+3. **Trino and Spark connect transparently** — Trino uses 
`hive.metastore=glue` / `iceberg.catalog.type=glue`; Spark uses 
`AWSGlueDataCatalogHiveClientFactory` / `GlueCatalog`. Users query Glue tables 
through Gravitino without knowing the underlying mechanism.
+
+4. **AWS-native authentication** (reuses existing S3 properties): static 
credentials, STS AssumeRole, or default credential chain (environment 
variables, instance profile).
+
+## 2. Background
+
+### 2.1 AWS Glue Data Catalog
+
+AWS Glue Data Catalog is a managed metadata repository storing:
+- **Databases** — logical groupings, equivalent to Gravitino schemas.
+- **Tables** — metadata records containing column definitions, storage 
descriptors, partition keys, and user-defined parameters.
+
+Tables come in two formats:
+
+| Format | How Glue Stores It |
+|---|---|
+| **Hive** | Full metadata in `StorageDescriptor` (columns, SerDe, 
InputFormat, OutputFormat, location). The majority of tables in most Glue 
catalogs (legacy ETL, Athena CTAS, Redshift Spectrum). |
+| **Iceberg** | `Parameters["table_type"] = "ICEBERG"` and 
`Parameters["metadata_location"]` pointing to Iceberg metadata JSON on S3. 
`StorageDescriptor.Columns` is typically empty. Growing rapidly. |
+
+A complete Glue integration must handle both table formats.
+
+### 2.2 How Query Engines Use Glue
+
+Trino and Spark both have native Glue support — they call the AWS Glue SDK 
directly, not via HMS Thrift:
+
+| Engine | Hive Tables on Glue | Iceberg Tables on Glue |
+|---|---|---|
+| **Trino** | Hive connector with `hive.metastore=glue` | Iceberg connector 
with `iceberg.catalog.type=glue` |
+| **Spark** | Hive catalog with `AWSGlueDataCatalogHiveClientFactory` | 
Iceberg catalog with `catalog-impl=org.apache.iceberg.aws.glue.GlueCatalog` |
+
+Both engines use a **one-catalog-to-one-connector** model — a single catalog 
handles either Hive-format or Iceberg-format tables, not both. This is 
consistent with Gravitino's existing catalog model.
+
+### 2.3 Gravitino's Current Architecture
+
+Gravitino's catalog plugin system provides:
+- **Hive catalog** (`provider=hive`): Connects to HMS via Thrift. Client 
chain: `HiveCatalogOperations` → `CachedClientPool` → `HiveClientImpl` → 
`HiveShimV2/V3` → `IMetaStoreClient`.
+- **Iceberg catalog** (`provider=lakehouse-iceberg`): Supports pluggable 
backends (`catalog-backend=hive|jdbc|rest|memory|custom`). Each backend maps to 
a different Iceberg `Catalog` implementation.
+- **Trino/Spark connectors**: Property converters translate Gravitino catalog 
properties into engine-specific properties.
+
+## 3. Design Alternatives
+
+### Alternative A: New `catalog-glue` Module
+
+Create a standalone `catalogs/catalog-glue/` with its own 
`GlueCatalogOperations`, type converters, and entity classes. Directly call the 
AWS Glue SDK for both Hive and Iceberg tables.
+
+**Pros**: Full control over Glue-specific behavior. Single catalog for mixed 
table formats.
+**Cons**:
+- Duplicates logic already in Hive catalog (type conversion, partition 
handling, SerDe parsing) and Iceberg catalog (schema conversion, metadata 
loading).
+- Trino/Spark integration requires a "Composite Connector" that routes queries 
based on table type — a significant architectural change.
+- Larger implementation surface area and maintenance burden.
+
+### Alternative B: Glue as a Metastore Type (Chosen)
+
+Extend the existing Hive and Iceberg catalogs with Glue as a backend option.
+
+**Pros**:
+- Reuses all existing catalog logic, type conversion, property handling, and 
entity models.
+- Trino/Spark integration works almost for free — both engines already have 
native Glue support.
+- Much smaller change set (~15 files modified, 1 new file vs. ~15 new files).
+- Consistent with how Trino and Spark model Glue (as a metastore variant, not 
a separate catalog type).
+
+**Cons**:
+- Users must create two Gravitino catalogs to cover both Hive and Iceberg 
tables from the same Glue Data Catalog.
+- Cannot add Glue-only features (e.g., Glue crawlers) without extending the 
generic interfaces.
+
+**Decision**: Alternative B — the reuse benefits and Trino/Spark alignment 
outweigh the minor UX cost of two catalogs.
+
+## 4. Detailed Design
+
+### 4.1 Configuration Properties
+
+Gravitino already defines standardized AWS/S3 properties in 
`S3Properties.java`:
+
+| Existing Property | Used By |
+|---|---|
+| `s3-access-key-id` / `s3-secret-access-key` | Iceberg, Hive (S3 storage + 
Glue auth) |
+| `s3-region` | Iceberg, Hive (S3 storage + Glue region) |
+| `s3-role-arn` / `s3-external-id` | Iceberg, Hive (STS AssumeRole) |
+| `s3-endpoint` | Iceberg, Hive (custom S3 endpoint) |
+
+We **reuse `s3-region` as the AWS region** (Glue and S3 are always co-located) 
and **reuse `s3-access-key-id` / `s3-secret-access-key` for authentication**. 
Only two new Glue-specific properties:

Review Comment:
   The statement that "Glue and S3 are always co-located" is not accurate: Glue 
Data Catalog is regional, and S3 buckets (or endpoints) can be in a different 
region than the Glue catalog. If the design intends to reuse `s3-region` as the 
default Glue region, consider rewording this to avoid the absolute claim and/or 
introduce an explicit Glue/AWS region property that can override `s3-region` 
when needed.
   ```suggestion
   We **reuse `s3-region` as the AWS region** (used as the default region for 
both Glue and S3 in this connector) and **reuse `s3-access-key-id` / 
`s3-secret-access-key` for authentication**. Only two new Glue-specific 
properties:
   ```



##########
design/aws-glue-catalog-connector.md:
##########
@@ -0,0 +1,471 @@
+# Design: AWS Glue Data Catalog Support for Apache Gravitino
+
+## 1. Problem Statement and Goals
+
+### 1.1 Problem
+
+**Gravitino currently cannot federate AWS Glue Data Catalog.** This is a 
significant gap because:
+
+1. **Large user base on AWS**: The majority of cloud-native data lakes run on 
AWS with Glue Data Catalog as the central metadata service (default for Athena, 
Redshift Spectrum, EMR, Lake Formation). These organizations cannot bring their 
Glue metadata into Gravitino's unified management layer.
+2. **No native integration path**: The only workaround is pointing Gravitino's 
Hive catalog at Glue's HMS-compatible Thrift endpoint (`metastore.uris = 
thrift://...`), which is undocumented, region-limited, and cannot leverage 
Glue-native features (catalog ID, cross-account access, VPC endpoints).
+3. **Competitive landscape**: Trino, Spark, and other engines all have 
first-class Glue support with dedicated configuration. Users expect the same 
from Gravitino.
+
+### 1.2 Goals
+
+After this feature is implemented:
+
+1. **Register AWS Glue Data Catalog in Gravitino**:
+   ```bash
+   # Hive-format tables
+   gcli catalog create --name hive_on_glue --provider hive \
+     --properties metastore-type=glue,s3-region=us-east-1
+
+   # Iceberg-format tables
+   gcli catalog create --name iceberg_on_glue --provider lakehouse-iceberg \
+     --properties 
catalog-backend=glue,warehouse=s3://bucket/iceberg,s3-region=us-east-1
+   ```
+
+2. **Standard Gravitino API works against Glue catalogs**:
+   ```bash
+   gcli schema list --catalog hive_on_glue
+   gcli table list --catalog hive_on_glue --schema my_database
+   gcli table details --catalog iceberg_on_glue --schema analytics --table 
events
+   ```
+
+3. **Trino and Spark connect transparently** — Trino uses 
`hive.metastore=glue` / `iceberg.catalog.type=glue`; Spark uses 
`AWSGlueDataCatalogHiveClientFactory` / `GlueCatalog`. Users query Glue tables 
through Gravitino without knowing the underlying mechanism.
+
+4. **AWS-native authentication** (reuses existing S3 properties): static 
credentials, STS AssumeRole, or default credential chain (environment 
variables, instance profile).
+
+## 2. Background
+
+### 2.1 AWS Glue Data Catalog
+
+AWS Glue Data Catalog is a managed metadata repository storing:
+- **Databases** — logical groupings, equivalent to Gravitino schemas.
+- **Tables** — metadata records containing column definitions, storage 
descriptors, partition keys, and user-defined parameters.
+
+Tables come in two formats:
+
+| Format | How Glue Stores It |
+|---|---|
+| **Hive** | Full metadata in `StorageDescriptor` (columns, SerDe, 
InputFormat, OutputFormat, location). The majority of tables in most Glue 
catalogs (legacy ETL, Athena CTAS, Redshift Spectrum). |
+| **Iceberg** | `Parameters["table_type"] = "ICEBERG"` and 
`Parameters["metadata_location"]` pointing to Iceberg metadata JSON on S3. 
`StorageDescriptor.Columns` is typically empty. Growing rapidly. |
+
+A complete Glue integration must handle both table formats.
+
+### 2.2 How Query Engines Use Glue
+
+Trino and Spark both have native Glue support — they call the AWS Glue SDK 
directly, not via HMS Thrift:
+
+| Engine | Hive Tables on Glue | Iceberg Tables on Glue |
+|---|---|---|
+| **Trino** | Hive connector with `hive.metastore=glue` | Iceberg connector 
with `iceberg.catalog.type=glue` |
+| **Spark** | Hive catalog with `AWSGlueDataCatalogHiveClientFactory` | 
Iceberg catalog with `catalog-impl=org.apache.iceberg.aws.glue.GlueCatalog` |
+
+Both engines use a **one-catalog-to-one-connector** model — a single catalog 
handles either Hive-format or Iceberg-format tables, not both. This is 
consistent with Gravitino's existing catalog model.
+
+### 2.3 Gravitino's Current Architecture
+
+Gravitino's catalog plugin system provides:
+- **Hive catalog** (`provider=hive`): Connects to HMS via Thrift. Client 
chain: `HiveCatalogOperations` → `CachedClientPool` → `HiveClientImpl` → 
`HiveShimV2/V3` → `IMetaStoreClient`.
+- **Iceberg catalog** (`provider=lakehouse-iceberg`): Supports pluggable 
backends (`catalog-backend=hive|jdbc|rest|memory|custom`). Each backend maps to 
a different Iceberg `Catalog` implementation.
+- **Trino/Spark connectors**: Property converters translate Gravitino catalog 
properties into engine-specific properties.
+
+## 3. Design Alternatives
+
+### Alternative A: New `catalog-glue` Module
+
+Create a standalone `catalogs/catalog-glue/` with its own 
`GlueCatalogOperations`, type converters, and entity classes. Directly call the 
AWS Glue SDK for both Hive and Iceberg tables.
+
+**Pros**: Full control over Glue-specific behavior. Single catalog for mixed 
table formats.
+**Cons**:
+- Duplicates logic already in Hive catalog (type conversion, partition 
handling, SerDe parsing) and Iceberg catalog (schema conversion, metadata 
loading).
+- Trino/Spark integration requires a "Composite Connector" that routes queries 
based on table type — a significant architectural change.
+- Larger implementation surface area and maintenance burden.
+
+### Alternative B: Glue as a Metastore Type (Chosen)
+
+Extend the existing Hive and Iceberg catalogs with Glue as a backend option.
+
+**Pros**:
+- Reuses all existing catalog logic, type conversion, property handling, and 
entity models.
+- Trino/Spark integration works almost for free — both engines already have 
native Glue support.
+- Much smaller change set (~15 files modified, 1 new file vs. ~15 new files).
+- Consistent with how Trino and Spark model Glue (as a metastore variant, not 
a separate catalog type).
+
+**Cons**:
+- Users must create two Gravitino catalogs to cover both Hive and Iceberg 
tables from the same Glue Data Catalog.
+- Cannot add Glue-only features (e.g., Glue crawlers) without extending the 
generic interfaces.
+
+**Decision**: Alternative B — the reuse benefits and Trino/Spark alignment 
outweigh the minor UX cost of two catalogs.
+
+## 4. Detailed Design
+
+### 4.1 Configuration Properties
+
+Gravitino already defines standardized AWS/S3 properties in 
`S3Properties.java`:
+
+| Existing Property | Used By |
+|---|---|
+| `s3-access-key-id` / `s3-secret-access-key` | Iceberg, Hive (S3 storage + 
Glue auth) |
+| `s3-region` | Iceberg, Hive (S3 storage + Glue region) |
+| `s3-role-arn` / `s3-external-id` | Iceberg, Hive (STS AssumeRole) |
+| `s3-endpoint` | Iceberg, Hive (custom S3 endpoint) |
+
+We **reuse `s3-region` as the AWS region** (Glue and S3 are always co-located) 
and **reuse `s3-access-key-id` / `s3-secret-access-key` for authentication**. 
Only two new Glue-specific properties:
+
+| New Property | Required | Description |
+|---|---|---|
+| `aws-glue-catalog-id` | No | Glue catalog ID (defaults to caller's AWS 
account). For cross-account access. |
+| `aws-glue-endpoint` | No | Custom Glue endpoint (for VPC endpoints or 
testing). |
+
+**Authentication priority**: Static credentials → STS AssumeRole 
(`s3-role-arn`) → Default credential chain (environment variables, instance 
profile).
+
+### 4.2 Iceberg Catalog + Glue Backend
+
+Add `GLUE` as a new `IcebergCatalogBackend` enum value. Use Iceberg's built-in 
`org.apache.iceberg.aws.glue.GlueCatalog`.
+
+#### Data Flow
+
+```
+User: catalog-backend=glue, warehouse=s3://..., s3-region=us-east-1
+  → IcebergCatalogOperations.initialize()
+    → IcebergCatalogUtil.loadCatalogBackend(GLUE, config)
+      → loadGlueCatalog(config)
+        → new GlueCatalog().initialize("glue", {
+            "warehouse": "s3://...",
+            "client.region": "us-east-1",
+            "glue.catalog-id": "..." })
+  → All existing IcebergCatalogOperations methods work unchanged
+```
+
+`GlueCatalog` is an official Iceberg implementation with full Schema CRUD + 
Table CRUD support — this is the lowest-risk part of the design.
+
+#### Engine Integration
+
+**Trino** — `IcebergCatalogPropertyConverter.java`: Add `case "glue":` → 
`iceberg.catalog.type=glue` + AWS region/catalog-id.
+
+**Spark** — No code change needed. The existing generic 
`all.put(ICEBERG_CATALOG_TYPE, catalogBackend)` already handles `"glue"`.
+
+### 4.3 Hive Catalog + Glue Backend
+
+Add `metastore-type=glue` property. Use AWS's 
`aws-glue-datacatalog-hive3-client` library which provides an 
`IMetaStoreClient` implementation backed by the Glue SDK.
+
+#### Data Flow
+
+```
+User: metastore-type=glue, s3-region=us-east-1
+  → HiveCatalogOperations.initialize()
+    → mergeProperties(conf) — maps Glue properties
+    → CachedClientPool(properties)
+      → HiveClientPool.newClient()
+        → HiveClientFactory.createHiveClient()      ← MODIFIED: skip hive2/3 
detection
+          → HiveClientClassLoader.createLoader(HIVE3, ...)  ← always Hive3 for 
Glue
+          → HiveClientImpl(HIVE3, properties)
+            → detects metastore.type=glue
+            → new GlueShim(properties)               ← NEW (replaces 
HiveShimV3)
+              → createMetaStoreClient()
+                → AWSGlueDataCatalogHiveClientFactory.create(hiveConf)
+                → returns AWSCatalogMetastoreClient (implements 
IMetaStoreClient)
+  → All existing HiveCatalogOperations methods work unchanged
+```
+
+#### GlueShim and Hive2/Hive3 Compatibility
+
+**Problem**: `HiveClientFactory.createHiveClient()` probes the remote HMS to 
detect Hive2 vs Hive3 (tries `getCatalogs()`, falls back on error). This 
detection is irrelevant for Glue — there is no remote HMS to probe.
+
+**Solution**: When `metastore.type=glue`, skip version detection and always 
use Hive3 classloader:
+
+```java
+// In HiveClientFactory.createHiveClient():
+public HiveClient createHiveClient() {
+  String metastoreType = properties.getProperty("metastore.type", "hive");
+  if ("glue".equalsIgnoreCase(metastoreType)) {
+    return createGlueClient();  // Always Hive3, no probe
+  }
+  // ... existing hive2/hive3 detection logic unchanged ...
+}
+
+private HiveClient createGlueClient() {
+  if (backendClassLoader == null) {
+    synchronized (classLoaderLock) {
+      if (backendClassLoader == null) {
+        backendClassLoader = HiveClientClassLoader.createLoader(
+            HIVE3, Thread.currentThread().getContextClassLoader());
+      }
+    }
+  }
+  return createHiveClientInternal(backendClassLoader);
+}
+```
+
+**Why Hive3 classloader?**
+
+1. **JAR loading path**: `HiveClientClassLoader.getJarDirectory()` maps 
`HIVE3` → `hive-metastore3-libs/`. The Glue client JAR is placed in this 
directory (see Section 4.4).
+2. **API compatibility**: AWS provides `aws-glue-datacatalog-hive2-client` and 
`aws-glue-datacatalog-hive3-client`. The `IMetaStoreClient` interfaces differ 
between versions (Hive3 adds catalog-aware methods). The JAR must match the 
Hive version in the same directory. We choose Hive3 as the actively maintained 
variant.
+
+**Future extension**: For Hive2 environments, add 
`aws-glue-datacatalog-hive2-client` to `hive-metastore2-libs` and select 
classloader by configuration.
+
+#### GlueShim Design
+
+`GlueShim` extends `HiveShimV2` and overrides only `createMetaStoreClient()`:
+
+| Shim | `createMetaStoreClient()` Implementation |
+|---|---|
+| `HiveShimV2` | `RetryingMetaStoreClient.getProxy(hiveConf)` → Thrift to HMS |
+| `HiveShimV3` | Same as V2 (V3 only adds catalog-aware method overrides) |
+| `GlueShim` | `AWSGlueDataCatalogHiveClientFactory.create(hiveConf)` → Glue 
SDK |
+
+All three return `IMetaStoreClient`. `HiveClientImpl` selects the shim:
+
+```java
+// In HiveClientImpl constructor:
+String metastoreType = properties.getProperty("metastore.type", "hive");
+if ("glue".equalsIgnoreCase(metastoreType)) {
+  shim = new GlueShim(properties);
+} else {
+  switch (hiveVersion) {
+    case HIVE2: shim = new HiveShimV2(properties); break;
+    case HIVE3: shim = new HiveShimV3(properties); break;
+  }
+}
+```
+
+All upstream code (`HiveClientPool`, `CachedClientPool`, 
`HiveCatalogOperations`) is unchanged — it programs against the `HiveClient` 
interface.
+
+#### IMetaStoreClient Relationship
+
+```
+org.apache.hadoop.hive.metastore.IMetaStoreClient    ← Hive standard interface
+    ├── HiveMetaStoreClient (Thrift impl, connects to HMS)
+    └── AWSCatalogMetastoreClient (Glue impl, via AWS Glue SDK)
+         └── Created by AWSGlueDataCatalogHiveClientFactory.create(hiveConf)
+```
+
+`AWSCatalogMetastoreClient` is a drop-in replacement for 
`HiveMetaStoreClient`. All upstream code is completely unaware of the 
difference.
+
+#### Engine Integration
+
+**Trino** — `HiveConnectorAdapter.java`:
+- When `metastore-type=glue`: set `hive.metastore=glue` + 
`hive.metastore.glue.region` + `hive.metastore.glue.catalogid`.
+- When `hive` (default): existing `hive.metastore.uri` path unchanged.
+
+**Spark** — `HivePropertiesConverter.java`:
+- When `metastore-type=glue`: set 
`spark.hadoop.hive.metastore.client.factory.class = 
com.amazonaws.glue.catalog.metastore.AWSGlueDataCatalogHiveClientFactory`.
+
+### 4.4 Dependency Management
+
+#### Iceberg + Glue
+
+| Dependency | Target Module | Scope |
+|---|---|---|
+| `org.apache.iceberg:iceberg-aws` — Contains `GlueCatalog` implementation. 
Transitively depends on `software.amazon.awssdk:glue`. Already in version 
catalog as `libs.iceberg.aws`. | `iceberg/iceberg-common/build.gradle.kts` | 
`compileOnly` (provided at runtime by `bundles/bundle-aws`) |

Review Comment:
   The dependency table says `iceberg-aws` is "provided at runtime by 
`bundles/bundle-aws`", but this repo uses `bundles/iceberg-aws-bundle` (and 
also has `bundles/aws-bundle`). To avoid confusion when implementing/packaging, 
please update this to reference the actual bundle module/directory that 
supplies Iceberg AWS classes at runtime (e.g., `iceberg-aws-bundle`).
   ```suggestion
   | `org.apache.iceberg:iceberg-aws` — Contains `GlueCatalog` implementation. 
Transitively depends on `software.amazon.awssdk:glue`. Already in version 
catalog as `libs.iceberg.aws`. | `iceberg/iceberg-common/build.gradle.kts` | 
`compileOnly` (provided at runtime by `bundles/iceberg-aws-bundle`) |
   ```



##########
design/aws-glue-catalog-connector.md:
##########
@@ -0,0 +1,471 @@
+# Design: AWS Glue Data Catalog Support for Apache Gravitino
+
+## 1. Problem Statement and Goals
+
+### 1.1 Problem
+
+**Gravitino currently cannot federate AWS Glue Data Catalog.** This is a 
significant gap because:
+
+1. **Large user base on AWS**: The majority of cloud-native data lakes run on 
AWS with Glue Data Catalog as the central metadata service (default for Athena, 
Redshift Spectrum, EMR, Lake Formation). These organizations cannot bring their 
Glue metadata into Gravitino's unified management layer.
+2. **No native integration path**: The only workaround is pointing Gravitino's 
Hive catalog at Glue's HMS-compatible Thrift endpoint (`metastore.uris = 
thrift://...`), which is undocumented, region-limited, and cannot leverage 
Glue-native features (catalog ID, cross-account access, VPC endpoints).
+3. **Competitive landscape**: Trino, Spark, and other engines all have 
first-class Glue support with dedicated configuration. Users expect the same 
from Gravitino.
+
+### 1.2 Goals
+
+After this feature is implemented:
+
+1. **Register AWS Glue Data Catalog in Gravitino**:
+   ```bash
+   # Hive-format tables
+   gcli catalog create --name hive_on_glue --provider hive \
+     --properties metastore-type=glue,s3-region=us-east-1
+
+   # Iceberg-format tables
+   gcli catalog create --name iceberg_on_glue --provider lakehouse-iceberg \
+     --properties 
catalog-backend=glue,warehouse=s3://bucket/iceberg,s3-region=us-east-1
+   ```
+
+2. **Standard Gravitino API works against Glue catalogs**:
+   ```bash
+   gcli schema list --catalog hive_on_glue
+   gcli table list --catalog hive_on_glue --schema my_database
+   gcli table details --catalog iceberg_on_glue --schema analytics --table 
events
+   ```
+
+3. **Trino and Spark connect transparently** — Trino uses 
`hive.metastore=glue` / `iceberg.catalog.type=glue`; Spark uses 
`AWSGlueDataCatalogHiveClientFactory` / `GlueCatalog`. Users query Glue tables 
through Gravitino without knowing the underlying mechanism.
+
+4. **AWS-native authentication** (reuses existing S3 properties): static 
credentials, STS AssumeRole, or default credential chain (environment 
variables, instance profile).
+
+## 2. Background
+
+### 2.1 AWS Glue Data Catalog
+
+AWS Glue Data Catalog is a managed metadata repository storing:
+- **Databases** — logical groupings, equivalent to Gravitino schemas.
+- **Tables** — metadata records containing column definitions, storage 
descriptors, partition keys, and user-defined parameters.
+
+Tables come in two formats:
+
+| Format | How Glue Stores It |
+|---|---|
+| **Hive** | Full metadata in `StorageDescriptor` (columns, SerDe, 
InputFormat, OutputFormat, location). The majority of tables in most Glue 
catalogs (legacy ETL, Athena CTAS, Redshift Spectrum). |
+| **Iceberg** | `Parameters["table_type"] = "ICEBERG"` and 
`Parameters["metadata_location"]` pointing to Iceberg metadata JSON on S3. 
`StorageDescriptor.Columns` is typically empty. Growing rapidly. |
+
+A complete Glue integration must handle both table formats.
+
+### 2.2 How Query Engines Use Glue
+
+Trino and Spark both have native Glue support — they call the AWS Glue SDK 
directly, not via HMS Thrift:
+
+| Engine | Hive Tables on Glue | Iceberg Tables on Glue |
+|---|---|---|
+| **Trino** | Hive connector with `hive.metastore=glue` | Iceberg connector 
with `iceberg.catalog.type=glue` |
+| **Spark** | Hive catalog with `AWSGlueDataCatalogHiveClientFactory` | 
Iceberg catalog with `catalog-impl=org.apache.iceberg.aws.glue.GlueCatalog` |
+
+Both engines use a **one-catalog-to-one-connector** model — a single catalog 
handles either Hive-format or Iceberg-format tables, not both. This is 
consistent with Gravitino's existing catalog model.
+
+### 2.3 Gravitino's Current Architecture
+
+Gravitino's catalog plugin system provides:
+- **Hive catalog** (`provider=hive`): Connects to HMS via Thrift. Client 
chain: `HiveCatalogOperations` → `CachedClientPool` → `HiveClientImpl` → 
`HiveShimV2/V3` → `IMetaStoreClient`.
+- **Iceberg catalog** (`provider=lakehouse-iceberg`): Supports pluggable 
backends (`catalog-backend=hive|jdbc|rest|memory|custom`). Each backend maps to 
a different Iceberg `Catalog` implementation.
+- **Trino/Spark connectors**: Property converters translate Gravitino catalog 
properties into engine-specific properties.
+
+## 3. Design Alternatives
+
+### Alternative A: New `catalog-glue` Module
+
+Create a standalone `catalogs/catalog-glue/` with its own 
`GlueCatalogOperations`, type converters, and entity classes. Directly call the 
AWS Glue SDK for both Hive and Iceberg tables.
+
+**Pros**: Full control over Glue-specific behavior. Single catalog for mixed 
table formats.
+**Cons**:
+- Duplicates logic already in Hive catalog (type conversion, partition 
handling, SerDe parsing) and Iceberg catalog (schema conversion, metadata 
loading).
+- Trino/Spark integration requires a "Composite Connector" that routes queries 
based on table type — a significant architectural change.
+- Larger implementation surface area and maintenance burden.
+
+### Alternative B: Glue as a Metastore Type (Chosen)
+
+Extend the existing Hive and Iceberg catalogs with Glue as a backend option.
+
+**Pros**:
+- Reuses all existing catalog logic, type conversion, property handling, and 
entity models.
+- Trino/Spark integration works almost for free — both engines already have 
native Glue support.
+- Much smaller change set (~15 files modified, 1 new file vs. ~15 new files).
+- Consistent with how Trino and Spark model Glue (as a metastore variant, not 
a separate catalog type).
+
+**Cons**:
+- Users must create two Gravitino catalogs to cover both Hive and Iceberg 
tables from the same Glue Data Catalog.
+- Cannot add Glue-only features (e.g., Glue crawlers) without extending the 
generic interfaces.
+
+**Decision**: Alternative B — the reuse benefits and Trino/Spark alignment 
outweigh the minor UX cost of two catalogs.
+
+## 4. Detailed Design
+
+### 4.1 Configuration Properties
+
+Gravitino already defines standardized AWS/S3 properties in 
`S3Properties.java`:
+
+| Existing Property | Used By |
+|---|---|
+| `s3-access-key-id` / `s3-secret-access-key` | Iceberg, Hive (S3 storage + 
Glue auth) |
+| `s3-region` | Iceberg, Hive (S3 storage + Glue region) |
+| `s3-role-arn` / `s3-external-id` | Iceberg, Hive (STS AssumeRole) |
+| `s3-endpoint` | Iceberg, Hive (custom S3 endpoint) |
+
+We **reuse `s3-region` as the AWS region** (Glue and S3 are always co-located) 
and **reuse `s3-access-key-id` / `s3-secret-access-key` for authentication**. 
Only two new Glue-specific properties:
+
+| New Property | Required | Description |
+|---|---|---|
+| `aws-glue-catalog-id` | No | Glue catalog ID (defaults to caller's AWS 
account). For cross-account access. |
+| `aws-glue-endpoint` | No | Custom Glue endpoint (for VPC endpoints or 
testing). |
+
+**Authentication priority**: Static credentials → STS AssumeRole 
(`s3-role-arn`) → Default credential chain (environment variables, instance 
profile).
+
+### 4.2 Iceberg Catalog + Glue Backend
+
+Add `GLUE` as a new `IcebergCatalogBackend` enum value. Use Iceberg's built-in 
`org.apache.iceberg.aws.glue.GlueCatalog`.
+
+#### Data Flow
+
+```
+User: catalog-backend=glue, warehouse=s3://..., s3-region=us-east-1
+  → IcebergCatalogOperations.initialize()
+    → IcebergCatalogUtil.loadCatalogBackend(GLUE, config)
+      → loadGlueCatalog(config)
+        → new GlueCatalog().initialize("glue", {
+            "warehouse": "s3://...",
+            "client.region": "us-east-1",
+            "glue.catalog-id": "..." })
+  → All existing IcebergCatalogOperations methods work unchanged
+```
+
+`GlueCatalog` is an official Iceberg implementation with full Schema CRUD + 
Table CRUD support — this is the lowest-risk part of the design.
+
+#### Engine Integration
+
+**Trino** — `IcebergCatalogPropertyConverter.java`: Add `case "glue":` → 
`iceberg.catalog.type=glue` + AWS region/catalog-id.
+
+**Spark** — No code change needed. The existing generic 
`all.put(ICEBERG_CATALOG_TYPE, catalogBackend)` already handles `"glue"`.
+
+### 4.3 Hive Catalog + Glue Backend
+
+Add `metastore-type=glue` property. Use AWS's 
`aws-glue-datacatalog-hive3-client` library which provides an 
`IMetaStoreClient` implementation backed by the Glue SDK.
+
+#### Data Flow
+
+```
+User: metastore-type=glue, s3-region=us-east-1
+  → HiveCatalogOperations.initialize()
+    → mergeProperties(conf) — maps Glue properties
+    → CachedClientPool(properties)
+      → HiveClientPool.newClient()
+        → HiveClientFactory.createHiveClient()      ← MODIFIED: skip hive2/3 
detection
+          → HiveClientClassLoader.createLoader(HIVE3, ...)  ← always Hive3 for 
Glue
+          → HiveClientImpl(HIVE3, properties)
+            → detects metastore.type=glue
+            → new GlueShim(properties)               ← NEW (replaces 
HiveShimV3)
+              → createMetaStoreClient()
+                → AWSGlueDataCatalogHiveClientFactory.create(hiveConf)
+                → returns AWSCatalogMetastoreClient (implements 
IMetaStoreClient)
+  → All existing HiveCatalogOperations methods work unchanged
+```
+
+#### GlueShim and Hive2/Hive3 Compatibility
+
+**Problem**: `HiveClientFactory.createHiveClient()` probes the remote HMS to 
detect Hive2 vs Hive3 (tries `getCatalogs()`, falls back on error). This 
detection is irrelevant for Glue — there is no remote HMS to probe.
+
+**Solution**: When `metastore.type=glue`, skip version detection and always 
use Hive3 classloader:
+
+```java
+// In HiveClientFactory.createHiveClient():
+public HiveClient createHiveClient() {
+  String metastoreType = properties.getProperty("metastore.type", "hive");
+  if ("glue".equalsIgnoreCase(metastoreType)) {
+    return createGlueClient();  // Always Hive3, no probe
+  }
+  // ... existing hive2/hive3 detection logic unchanged ...
+}
+
+private HiveClient createGlueClient() {
+  if (backendClassLoader == null) {
+    synchronized (classLoaderLock) {
+      if (backendClassLoader == null) {
+        backendClassLoader = HiveClientClassLoader.createLoader(
+            HIVE3, Thread.currentThread().getContextClassLoader());
+      }
+    }
+  }
+  return createHiveClientInternal(backendClassLoader);
+}
+```
+
+**Why Hive3 classloader?**
+
+1. **JAR loading path**: `HiveClientClassLoader.getJarDirectory()` maps 
`HIVE3` → `hive-metastore3-libs/`. The Glue client JAR is placed in this 
directory (see Section 4.4).
+2. **API compatibility**: AWS provides `aws-glue-datacatalog-hive2-client` and 
`aws-glue-datacatalog-hive3-client`. The `IMetaStoreClient` interfaces differ 
between versions (Hive3 adds catalog-aware methods). The JAR must match the 
Hive version in the same directory. We choose Hive3 as the actively maintained 
variant.
+
+**Future extension**: For Hive2 environments, add 
`aws-glue-datacatalog-hive2-client` to `hive-metastore2-libs` and select 
classloader by configuration.
+
+#### GlueShim Design
+
+`GlueShim` extends `HiveShimV2` and overrides only `createMetaStoreClient()`:
+
+| Shim | `createMetaStoreClient()` Implementation |
+|---|---|
+| `HiveShimV2` | `RetryingMetaStoreClient.getProxy(hiveConf)` → Thrift to HMS |
+| `HiveShimV3` | Same as V2 (V3 only adds catalog-aware method overrides) |
+| `GlueShim` | `AWSGlueDataCatalogHiveClientFactory.create(hiveConf)` → Glue 
SDK |
+
+All three return `IMetaStoreClient`. `HiveClientImpl` selects the shim:
+
+```java
+// In HiveClientImpl constructor:
+String metastoreType = properties.getProperty("metastore.type", "hive");
+if ("glue".equalsIgnoreCase(metastoreType)) {
+  shim = new GlueShim(properties);
+} else {
+  switch (hiveVersion) {
+    case HIVE2: shim = new HiveShimV2(properties); break;
+    case HIVE3: shim = new HiveShimV3(properties); break;
+  }
+}
+```
+
+All upstream code (`HiveClientPool`, `CachedClientPool`, 
`HiveCatalogOperations`) is unchanged — it programs against the `HiveClient` 
interface.
+
+#### IMetaStoreClient Relationship
+
+```
+org.apache.hadoop.hive.metastore.IMetaStoreClient    ← Hive standard interface
+    ├── HiveMetaStoreClient (Thrift impl, connects to HMS)
+    └── AWSCatalogMetastoreClient (Glue impl, via AWS Glue SDK)
+         └── Created by AWSGlueDataCatalogHiveClientFactory.create(hiveConf)
+```
+
+`AWSCatalogMetastoreClient` is a drop-in replacement for 
`HiveMetaStoreClient`. All upstream code is completely unaware of the 
difference.
+
+#### Engine Integration
+
+**Trino** — `HiveConnectorAdapter.java`:
+- When `metastore-type=glue`: set `hive.metastore=glue` + 
`hive.metastore.glue.region` + `hive.metastore.glue.catalogid`.
+- When `hive` (default): existing `hive.metastore.uri` path unchanged.
+
+**Spark** — `HivePropertiesConverter.java`:
+- When `metastore-type=glue`: set 
`spark.hadoop.hive.metastore.client.factory.class = 
com.amazonaws.glue.catalog.metastore.AWSGlueDataCatalogHiveClientFactory`.
+
+### 4.4 Dependency Management
+
+#### Iceberg + Glue
+
+| Dependency | Target Module | Scope |
+|---|---|---|
+| `org.apache.iceberg:iceberg-aws` — Contains `GlueCatalog` implementation. 
Transitively depends on `software.amazon.awssdk:glue`. Already in version 
catalog as `libs.iceberg.aws`. | `iceberg/iceberg-common/build.gradle.kts` | 
`compileOnly` (provided at runtime by `bundles/bundle-aws`) |
+
+No changes to `gradle/libs.versions.toml` required.
+
+#### Hive + Glue
+
+| Dependency | Target Module | Scope |
+|---|---|---|
+| `com.amazonaws:aws-glue-datacatalog-hive3-client` — Implements 
`IMetaStoreClient` via Glue SDK. Provides 
`AWSGlueDataCatalogHiveClientFactory`. | 
`catalogs/hive-metastore3-libs/build.gradle.kts` | `implementation` (packaged 
into `hive-metastore3-libs/`) |
+
+**Why `hive-metastore3-libs`?** The Hive catalog uses `HiveClientClassLoader` 
for class isolation — it loads JARs from `hive-metastore2-libs/` or 
`hive-metastore3-libs/`. GlueShim uses the Hive3 classloader (see Section 4.3), 
so the Glue client JAR must be in `hive-metastore3-libs`.
+
+### 4.5 End-to-End Architecture
+
+```
+                        Gravitino Server
+                              |
+            +------ provider=hive ------+------- provider=lakehouse-iceberg 
------+
+            |     metastore-type=glue   |         catalog-backend=glue         
   |
+            |                           |                                      
   |
+     HiveCatalogOperations       IcebergCatalogOperations
+            |                           |
+     HiveClientImpl                   IcebergCatalogUtil
+     -> GlueShim                      -> loadGlueCatalog()
+     -> AWSCatalogMetastoreClient     -> 
org.apache.iceberg.aws.glue.GlueCatalog
+     (impl IMetaStoreClient)          (impl org.apache.iceberg.catalog.Catalog)
+            |                           |
+            +-------- AWS Glue SDK -----+
+                          |
+                  AWS Glue Data Catalog
+                          |
+               +----------+----------+
+               |                     |
+          Hive Tables          Iceberg Tables
+     (StorageDescriptor)    (metadata_location)
+
+
+                        Query Engines
+                              |
+        +---- Trino ----+               +---- Spark ----+
+        |               |               |               |
+   Hive Connector  Iceberg Connector  HiveCatalog  SparkCatalog
+   metastore=glue  catalog.type=glue  factory=AWS  catalog-impl=GlueCatalog
+```
+
+## 5. Testing Strategy
+
+### 5.1 Unit Tests
+
+**Property conversion** — extend existing test classes:
+
+| Test Class | New Test Cases |
+|---|---|
+| `TestIcebergCatalogPropertyConverter` | `testGlueBackendProperty()`, 
`testGlueBackendMissingWarehouse()` |
+| `TestHiveConnectorAdapter` | `testBuildGlueConfig()`, 
`testBuildGlueConfigWithCatalogId()` |
+
+**Hive client routing** — new `TestHiveClientImpl` (in 
`hive-metastore-common/src/test/`):
+- `testGlueShimSelection()`: verify `GlueShim` created when 
`metastore.type=glue`
+- `testDefaultHiveShimSelection()`: verify `HiveShimV2/V3` when 
`metastore.type=hive` or unset
+- `testHiveClientFactorySkipsProbeForGlue()`: verify Hive3 classloader used 
directly
+
+**GlueShim** — new `TestGlueShim` (in `hive-metastore-common/src/test/`):
+- `testCreateMetaStoreClient()`: mock factory, verify correct invocation
+- `testGlueShimExtendsHiveShimV2()`: verify inheritance
+- `testGluePropertiesPassedToHiveConf()`: verify property propagation
+
+### 5.2 Integration Tests — Reuse Existing Test Framework
+
+The project has well-established integration test inheritance hierarchies. 
Glue tests inherit from existing parent classes — only override environment 
initialization.
+
+**Catalog operations**:
+
+| New Test Class | Extends | Override |
+|---|---|---|
+| `CatalogHiveGlueIT` | `CatalogHive2IT` (23 tests) | 
`startNecessaryContainer()` → LocalStack; `createCatalogProperties()` → 
`metastore-type=glue` |
+| `CatalogIcebergGlueIT` | `CatalogIcebergBaseIT` (15 tests) | 
`initIcebergCatalogProperties()` → `catalog-backend=glue` |
+
+Example:
+```java
+@Tag("gravitino-docker-test")
+public class CatalogHiveGlueIT extends CatalogHive2IT {
+  @Override
+  protected void startNecessaryContainer() {
+    containerSuite.startLocalStackContainer();
+  }
+
+  @Override
+  protected Map<String, String> createCatalogProperties() {
+    return ImmutableMap.of(
+        "metastore-type", "glue",
+        "s3-region", "us-east-1",
+        "aws-glue-endpoint", localStackEndpoint);
+  }
+}
+```
+
+All parent tests (`testCreateHiveTable`, `testAlterTable`, 
`testListPartitions`, etc.) automatically run against the Glue backend — no 
rewriting needed.
+
+**Hive + Glue supported operations** (all covered by inherited 
`CatalogHive2IT` tests):
+
+- Schema: `createDatabase`, `getDatabase`, `getAllDatabases`, `alterDatabase`, 
`dropDatabase(cascade)`
+- Table: `createTable`, `getTable`, `getAllTables`, `alterTable`, 
`dropTable(deleteData)`, `purgeTable`, `getTableObjectsByName`
+- Partition: `listPartitionNames`, `listPartitions`, `listPartitions(filter)`, 
`getPartition`, `addPartition`, `dropPartition(deleteData)`
+
+These are all directly supported by `AWSCatalogMetastoreClient` 
(`IMetaStoreClient`). GlueShim only creates the client instance — upstream 
`HiveShim` methods work automatically.
+
+**Trino E2E**: Add Glue catalog configuration in `TrinoQueryITBase`, reuse 
existing SQL test scripts.
+
+**Spark E2E**:
+
+| New Test Class | Extends | Override |
+|---|---|---|
+| `SparkHiveGlueCatalogIT` | `SparkHiveCatalogIT` | `getCatalogConfigs()` → 
Glue config |
+| `SparkIcebergGlueCatalogIT` | `SparkIcebergCatalogIT` | Catalog properties → 
`catalog-backend=glue` |
+
+All `SparkCommonIT` tests (31 DDL/DML/query tests) are automatically inherited.
+
+### 5.3 Build Verification
+
+```bash
+./gradlew build -x test                # Compile
+./build.sh sp                          # Spotless formatting
+./gradlew test -PskipITs               # Unit tests
+./gradlew test -PskipTests -PskipDockerTests=false  # Integration tests 
(Docker + LocalStack)

Review Comment:
   The build verification commands reference `./build.sh sp` and `-PskipITs`, 
but this repo doesn’t have a top-level `build.sh`, and Gradle doesn’t define a 
`skipITs` property (tests are controlled via `-PskipTests` / `skipDockerTests` 
and JUnit tags). Please update these commands to the repo’s actual workflow, 
e.g., use `./gradlew spotlessApply`/`spotlessCheck` for formatting and 
`./gradlew test` for unit tests.
   ```suggestion
   ./gradlew spotlessApply                # Spotless formatting
   ./gradlew test                         # Unit tests
   ./gradlew test -PskipDockerTests=false  # Integration tests (Docker + 
LocalStack)
   ```



##########
design/aws-glue-catalog-connector.md:
##########
@@ -0,0 +1,471 @@
+# Design: AWS Glue Data Catalog Support for Apache Gravitino
+
+## 1. Problem Statement and Goals
+
+### 1.1 Problem
+
+**Gravitino currently cannot federate AWS Glue Data Catalog.** This is a 
significant gap because:
+
+1. **Large user base on AWS**: The majority of cloud-native data lakes run on 
AWS with Glue Data Catalog as the central metadata service (default for Athena, 
Redshift Spectrum, EMR, Lake Formation). These organizations cannot bring their 
Glue metadata into Gravitino's unified management layer.
+2. **No native integration path**: The only workaround is pointing Gravitino's 
Hive catalog at Glue's HMS-compatible Thrift endpoint (`metastore.uris = 
thrift://...`), which is undocumented, region-limited, and cannot leverage 
Glue-native features (catalog ID, cross-account access, VPC endpoints).
+3. **Competitive landscape**: Trino, Spark, and other engines all have 
first-class Glue support with dedicated configuration. Users expect the same 
from Gravitino.
+
+### 1.2 Goals
+
+After this feature is implemented:
+
+1. **Register AWS Glue Data Catalog in Gravitino**:
+   ```bash
+   # Hive-format tables
+   gcli catalog create --name hive_on_glue --provider hive \
+     --properties metastore-type=glue,s3-region=us-east-1
+
+   # Iceberg-format tables
+   gcli catalog create --name iceberg_on_glue --provider lakehouse-iceberg \
+     --properties 
catalog-backend=glue,warehouse=s3://bucket/iceberg,s3-region=us-east-1
+   ```
+
+2. **Standard Gravitino API works against Glue catalogs**:
+   ```bash
+   gcli schema list --catalog hive_on_glue
+   gcli table list --catalog hive_on_glue --schema my_database
+   gcli table details --catalog iceberg_on_glue --schema analytics --table 
events
+   ```
+
+3. **Trino and Spark connect transparently** — Trino uses 
`hive.metastore=glue` / `iceberg.catalog.type=glue`; Spark uses 
`AWSGlueDataCatalogHiveClientFactory` / `GlueCatalog`. Users query Glue tables 
through Gravitino without knowing the underlying mechanism.
+
+4. **AWS-native authentication** (reuses existing S3 properties): static 
credentials, STS AssumeRole, or default credential chain (environment 
variables, instance profile).
+
+## 2. Background
+
+### 2.1 AWS Glue Data Catalog
+
+AWS Glue Data Catalog is a managed metadata repository storing:
+- **Databases** — logical groupings, equivalent to Gravitino schemas.
+- **Tables** — metadata records containing column definitions, storage 
descriptors, partition keys, and user-defined parameters.
+
+Tables come in two formats:
+
+| Format | How Glue Stores It |
+|---|---|
+| **Hive** | Full metadata in `StorageDescriptor` (columns, SerDe, 
InputFormat, OutputFormat, location). The majority of tables in most Glue 
catalogs (legacy ETL, Athena CTAS, Redshift Spectrum). |
+| **Iceberg** | `Parameters["table_type"] = "ICEBERG"` and 
`Parameters["metadata_location"]` pointing to Iceberg metadata JSON on S3. 
`StorageDescriptor.Columns` is typically empty. Growing rapidly. |
+
+A complete Glue integration must handle both table formats.
+
+### 2.2 How Query Engines Use Glue
+
+Trino and Spark both have native Glue support — they call the AWS Glue SDK 
directly, not via HMS Thrift:
+
+| Engine | Hive Tables on Glue | Iceberg Tables on Glue |
+|---|---|---|
+| **Trino** | Hive connector with `hive.metastore=glue` | Iceberg connector 
with `iceberg.catalog.type=glue` |
+| **Spark** | Hive catalog with `AWSGlueDataCatalogHiveClientFactory` | 
Iceberg catalog with `catalog-impl=org.apache.iceberg.aws.glue.GlueCatalog` |
+
+Both engines use a **one-catalog-to-one-connector** model — a single catalog 
handles either Hive-format or Iceberg-format tables, not both. This is 
consistent with Gravitino's existing catalog model.
+
+### 2.3 Gravitino's Current Architecture
+
+Gravitino's catalog plugin system provides:
+- **Hive catalog** (`provider=hive`): Connects to HMS via Thrift. Client 
chain: `HiveCatalogOperations` → `CachedClientPool` → `HiveClientImpl` → 
`HiveShimV2/V3` → `IMetaStoreClient`.
+- **Iceberg catalog** (`provider=lakehouse-iceberg`): Supports pluggable 
backends (`catalog-backend=hive|jdbc|rest|memory|custom`). Each backend maps to 
a different Iceberg `Catalog` implementation.
+- **Trino/Spark connectors**: Property converters translate Gravitino catalog 
properties into engine-specific properties.
+
+## 3. Design Alternatives
+
+### Alternative A: New `catalog-glue` Module
+
+Create a standalone `catalogs/catalog-glue/` with its own 
`GlueCatalogOperations`, type converters, and entity classes. Directly call the 
AWS Glue SDK for both Hive and Iceberg tables.
+
+**Pros**: Full control over Glue-specific behavior. Single catalog for mixed 
table formats.
+**Cons**:
+- Duplicates logic already in Hive catalog (type conversion, partition 
handling, SerDe parsing) and Iceberg catalog (schema conversion, metadata 
loading).
+- Trino/Spark integration requires a "Composite Connector" that routes queries 
based on table type — a significant architectural change.
+- Larger implementation surface area and maintenance burden.
+
+### Alternative B: Glue as a Metastore Type (Chosen)
+
+Extend the existing Hive and Iceberg catalogs with Glue as a backend option.
+
+**Pros**:
+- Reuses all existing catalog logic, type conversion, property handling, and 
entity models.
+- Trino/Spark integration works almost for free — both engines already have 
native Glue support.
+- Much smaller change set (~15 files modified, 1 new file vs. ~15 new files).
+- Consistent with how Trino and Spark model Glue (as a metastore variant, not 
a separate catalog type).
+
+**Cons**:
+- Users must create two Gravitino catalogs to cover both Hive and Iceberg 
tables from the same Glue Data Catalog.
+- Cannot add Glue-only features (e.g., Glue crawlers) without extending the 
generic interfaces.
+
+**Decision**: Alternative B — the reuse benefits and Trino/Spark alignment 
outweigh the minor UX cost of two catalogs.
+
+## 4. Detailed Design
+
+### 4.1 Configuration Properties
+
+Gravitino already defines standardized AWS/S3 properties in 
`S3Properties.java`:
+
+| Existing Property | Used By |
+|---|---|
+| `s3-access-key-id` / `s3-secret-access-key` | Iceberg, Hive (S3 storage + 
Glue auth) |
+| `s3-region` | Iceberg, Hive (S3 storage + Glue region) |
+| `s3-role-arn` / `s3-external-id` | Iceberg, Hive (STS AssumeRole) |
+| `s3-endpoint` | Iceberg, Hive (custom S3 endpoint) |
+
+We **reuse `s3-region` as the AWS region** (Glue and S3 are always co-located) 
and **reuse `s3-access-key-id` / `s3-secret-access-key` for authentication**. 
Only two new Glue-specific properties:
+
+| New Property | Required | Description |
+|---|---|---|
+| `aws-glue-catalog-id` | No | Glue catalog ID (defaults to caller's AWS 
account). For cross-account access. |
+| `aws-glue-endpoint` | No | Custom Glue endpoint (for VPC endpoints or 
testing). |
+
+**Authentication priority**: Static credentials → STS AssumeRole 
(`s3-role-arn`) → Default credential chain (environment variables, instance 
profile).
+
+### 4.2 Iceberg Catalog + Glue Backend
+
+Add `GLUE` as a new `IcebergCatalogBackend` enum value. Use Iceberg's built-in 
`org.apache.iceberg.aws.glue.GlueCatalog`.
+
+#### Data Flow
+
+```
+User: catalog-backend=glue, warehouse=s3://..., s3-region=us-east-1
+  → IcebergCatalogOperations.initialize()
+    → IcebergCatalogUtil.loadCatalogBackend(GLUE, config)
+      → loadGlueCatalog(config)
+        → new GlueCatalog().initialize("glue", {
+            "warehouse": "s3://...",
+            "client.region": "us-east-1",
+            "glue.catalog-id": "..." })
+  → All existing IcebergCatalogOperations methods work unchanged
+```
+
+`GlueCatalog` is an official Iceberg implementation with full Schema CRUD + 
Table CRUD support — this is the lowest-risk part of the design.
+
+#### Engine Integration
+
+**Trino** — `IcebergCatalogPropertyConverter.java`: Add `case "glue":` → 
`iceberg.catalog.type=glue` + AWS region/catalog-id.
+
+**Spark** — No code change needed. The existing generic 
`all.put(ICEBERG_CATALOG_TYPE, catalogBackend)` already handles `"glue"`.
+
+### 4.3 Hive Catalog + Glue Backend
+
+Add `metastore-type=glue` property. Use AWS's 
`aws-glue-datacatalog-hive3-client` library which provides an 
`IMetaStoreClient` implementation backed by the Glue SDK.
+
+#### Data Flow
+
+```
+User: metastore-type=glue, s3-region=us-east-1
+  → HiveCatalogOperations.initialize()
+    → mergeProperties(conf) — maps Glue properties
+    → CachedClientPool(properties)
+      → HiveClientPool.newClient()
+        → HiveClientFactory.createHiveClient()      ← MODIFIED: skip hive2/3 
detection
+          → HiveClientClassLoader.createLoader(HIVE3, ...)  ← always Hive3 for 
Glue
+          → HiveClientImpl(HIVE3, properties)
+            → detects metastore.type=glue
+            → new GlueShim(properties)               ← NEW (replaces 
HiveShimV3)
+              → createMetaStoreClient()
+                → AWSGlueDataCatalogHiveClientFactory.create(hiveConf)
+                → returns AWSCatalogMetastoreClient (implements 
IMetaStoreClient)
+  → All existing HiveCatalogOperations methods work unchanged
+```
+
+#### GlueShim and Hive2/Hive3 Compatibility
+
+**Problem**: `HiveClientFactory.createHiveClient()` probes the remote HMS to 
detect Hive2 vs Hive3 (tries `getCatalogs()`, falls back on error). This 
detection is irrelevant for Glue — there is no remote HMS to probe.
+
+**Solution**: When `metastore.type=glue`, skip version detection and always 
use Hive3 classloader:
+
+```java
+// In HiveClientFactory.createHiveClient():
+public HiveClient createHiveClient() {
+  String metastoreType = properties.getProperty("metastore.type", "hive");
+  if ("glue".equalsIgnoreCase(metastoreType)) {
+    return createGlueClient();  // Always Hive3, no probe
+  }
+  // ... existing hive2/hive3 detection logic unchanged ...
+}
+
+private HiveClient createGlueClient() {
+  if (backendClassLoader == null) {
+    synchronized (classLoaderLock) {
+      if (backendClassLoader == null) {
+        backendClassLoader = HiveClientClassLoader.createLoader(
+            HIVE3, Thread.currentThread().getContextClassLoader());
+      }
+    }
+  }
+  return createHiveClientInternal(backendClassLoader);
+}
+```
+
+**Why Hive3 classloader?**
+
+1. **JAR loading path**: `HiveClientClassLoader.getJarDirectory()` maps 
`HIVE3` → `hive-metastore3-libs/`. The Glue client JAR is placed in this 
directory (see Section 4.4).
+2. **API compatibility**: AWS provides `aws-glue-datacatalog-hive2-client` and 
`aws-glue-datacatalog-hive3-client`. The `IMetaStoreClient` interfaces differ 
between versions (Hive3 adds catalog-aware methods). The JAR must match the 
Hive version in the same directory. We choose Hive3 as the actively maintained 
variant.
+
+**Future extension**: For Hive2 environments, add 
`aws-glue-datacatalog-hive2-client` to `hive-metastore2-libs` and select 
classloader by configuration.
+
+#### GlueShim Design
+
+`GlueShim` extends `HiveShimV2` and overrides only `createMetaStoreClient()`:
+
+| Shim | `createMetaStoreClient()` Implementation |
+|---|---|
+| `HiveShimV2` | `RetryingMetaStoreClient.getProxy(hiveConf)` → Thrift to HMS |
+| `HiveShimV3` | Same as V2 (V3 only adds catalog-aware method overrides) |
+| `GlueShim` | `AWSGlueDataCatalogHiveClientFactory.create(hiveConf)` → Glue 
SDK |

Review Comment:
   Designing `GlueShim` to extend `HiveShimV2` conflicts with the stated 
decision to always run Glue through the Hive3 classloader/API. In the current 
codebase, `HiveShimV2#getCatalogs()` returns an empty list and 
`createCatalog()` is unsupported, so inheriting V2 would drop Hive3 
catalog-aware behavior. Consider having `GlueShim` extend `HiveShimV3` (or 
otherwise reuse Hive3 overrides) and only override metastore client creation.
   ```suggestion
   `GlueShim` extends `HiveShimV3` and overrides only `createMetaStoreClient()`:
   
   | Shim | `createMetaStoreClient()` Implementation |
   |---|---|
   | `HiveShimV2` | `RetryingMetaStoreClient.getProxy(hiveConf)` → Thrift to 
HMS (no catalog-aware overrides) |
   | `HiveShimV3` | `RetryingMetaStoreClient.getProxy(hiveConf)` → Thrift to 
HMS, plus catalog-aware method overrides (e.g., `getCatalogs()`, 
`createCatalog()`) |
   | `GlueShim` | Extends `HiveShimV3`; overrides `createMetaStoreClient()` 
with `AWSGlueDataCatalogHiveClientFactory.create(hiveConf)` → Glue SDK, reusing 
all Hive3 catalog-aware behavior |
   ```



##########
design/aws-glue-catalog-connector.md:
##########
@@ -0,0 +1,471 @@
+# Design: AWS Glue Data Catalog Support for Apache Gravitino
+

Review Comment:
   This new Markdown file will be checked by the root Apache RAT task, but 
`design/**/*.md` is not excluded (only `docs/**/*.md` is). Without an ASF 
license header comment, `./gradlew check` is likely to fail RAT. Add the 
standard Apache license header (typically as an HTML comment) at the top of 
this document, or explicitly add `design/**/*.md` to the RAT excludes (header 
preferred).



##########
design/aws-glue-catalog-connector.md:
##########
@@ -0,0 +1,471 @@
+# Design: AWS Glue Data Catalog Support for Apache Gravitino
+
+## 1. Problem Statement and Goals
+
+### 1.1 Problem
+
+**Gravitino currently cannot federate AWS Glue Data Catalog.** This is a 
significant gap because:
+
+1. **Large user base on AWS**: The majority of cloud-native data lakes run on 
AWS with Glue Data Catalog as the central metadata service (default for Athena, 
Redshift Spectrum, EMR, Lake Formation). These organizations cannot bring their 
Glue metadata into Gravitino's unified management layer.
+2. **No native integration path**: The only workaround is pointing Gravitino's 
Hive catalog at Glue's HMS-compatible Thrift endpoint (`metastore.uris = 
thrift://...`), which is undocumented, region-limited, and cannot leverage 
Glue-native features (catalog ID, cross-account access, VPC endpoints).
+3. **Competitive landscape**: Trino, Spark, and other engines all have 
first-class Glue support with dedicated configuration. Users expect the same 
from Gravitino.
+
+### 1.2 Goals
+
+After this feature is implemented:
+
+1. **Register AWS Glue Data Catalog in Gravitino**:
+   ```bash
+   # Hive-format tables
+   gcli catalog create --name hive_on_glue --provider hive \
+     --properties metastore-type=glue,s3-region=us-east-1
+
+   # Iceberg-format tables
+   gcli catalog create --name iceberg_on_glue --provider lakehouse-iceberg \
+     --properties 
catalog-backend=glue,warehouse=s3://bucket/iceberg,s3-region=us-east-1
+   ```
+
+2. **Standard Gravitino API works against Glue catalogs**:
+   ```bash
+   gcli schema list --catalog hive_on_glue
+   gcli table list --catalog hive_on_glue --schema my_database
+   gcli table details --catalog iceberg_on_glue --schema analytics --table 
events
+   ```
+
+3. **Trino and Spark connect transparently** — Trino uses 
`hive.metastore=glue` / `iceberg.catalog.type=glue`; Spark uses 
`AWSGlueDataCatalogHiveClientFactory` / `GlueCatalog`. Users query Glue tables 
through Gravitino without knowing the underlying mechanism.
+
+4. **AWS-native authentication** (reuses existing S3 properties): static 
credentials, STS AssumeRole, or default credential chain (environment 
variables, instance profile).
+
+## 2. Background
+
+### 2.1 AWS Glue Data Catalog
+
+AWS Glue Data Catalog is a managed metadata repository storing:
+- **Databases** — logical groupings, equivalent to Gravitino schemas.
+- **Tables** — metadata records containing column definitions, storage 
descriptors, partition keys, and user-defined parameters.
+
+Tables come in two formats:
+
+| Format | How Glue Stores It |
+|---|---|
+| **Hive** | Full metadata in `StorageDescriptor` (columns, SerDe, 
InputFormat, OutputFormat, location). The majority of tables in most Glue 
catalogs (legacy ETL, Athena CTAS, Redshift Spectrum). |
+| **Iceberg** | `Parameters["table_type"] = "ICEBERG"` and 
`Parameters["metadata_location"]` pointing to Iceberg metadata JSON on S3. 
`StorageDescriptor.Columns` is typically empty. Growing rapidly. |
+
+A complete Glue integration must handle both table formats.
+
+### 2.2 How Query Engines Use Glue
+
+Trino and Spark both have native Glue support — they call the AWS Glue SDK 
directly, not via HMS Thrift:
+
+| Engine | Hive Tables on Glue | Iceberg Tables on Glue |
+|---|---|---|
+| **Trino** | Hive connector with `hive.metastore=glue` | Iceberg connector 
with `iceberg.catalog.type=glue` |
+| **Spark** | Hive catalog with `AWSGlueDataCatalogHiveClientFactory` | 
Iceberg catalog with `catalog-impl=org.apache.iceberg.aws.glue.GlueCatalog` |
+
+Both engines use a **one-catalog-to-one-connector** model — a single catalog 
handles either Hive-format or Iceberg-format tables, not both. This is 
consistent with Gravitino's existing catalog model.
+
+### 2.3 Gravitino's Current Architecture
+
+Gravitino's catalog plugin system provides:
+- **Hive catalog** (`provider=hive`): Connects to HMS via Thrift. Client 
chain: `HiveCatalogOperations` → `CachedClientPool` → `HiveClientImpl` → 
`HiveShimV2/V3` → `IMetaStoreClient`.
+- **Iceberg catalog** (`provider=lakehouse-iceberg`): Supports pluggable 
backends (`catalog-backend=hive|jdbc|rest|memory|custom`). Each backend maps to 
a different Iceberg `Catalog` implementation.
+- **Trino/Spark connectors**: Property converters translate Gravitino catalog 
properties into engine-specific properties.
+
+## 3. Design Alternatives
+
+### Alternative A: New `catalog-glue` Module
+
+Create a standalone `catalogs/catalog-glue/` with its own 
`GlueCatalogOperations`, type converters, and entity classes. Directly call the 
AWS Glue SDK for both Hive and Iceberg tables.
+
+**Pros**: Full control over Glue-specific behavior. Single catalog for mixed 
table formats.
+**Cons**:
+- Duplicates logic already in Hive catalog (type conversion, partition 
handling, SerDe parsing) and Iceberg catalog (schema conversion, metadata 
loading).
+- Trino/Spark integration requires a "Composite Connector" that routes queries 
based on table type — a significant architectural change.
+- Larger implementation surface area and maintenance burden.
+
+### Alternative B: Glue as a Metastore Type (Chosen)
+
+Extend the existing Hive and Iceberg catalogs with Glue as a backend option.
+
+**Pros**:
+- Reuses all existing catalog logic, type conversion, property handling, and 
entity models.
+- Trino/Spark integration works almost for free — both engines already have 
native Glue support.
+- Much smaller change set (~15 files modified, 1 new file vs. ~15 new files).
+- Consistent with how Trino and Spark model Glue (as a metastore variant, not 
a separate catalog type).
+
+**Cons**:
+- Users must create two Gravitino catalogs to cover both Hive and Iceberg 
tables from the same Glue Data Catalog.
+- Cannot add Glue-only features (e.g., Glue crawlers) without extending the 
generic interfaces.
+
+**Decision**: Alternative B — the reuse benefits and Trino/Spark alignment 
outweigh the minor UX cost of two catalogs.
+
+## 4. Detailed Design
+
+### 4.1 Configuration Properties
+
+Gravitino already defines standardized AWS/S3 properties in 
`S3Properties.java`:
+
+| Existing Property | Used By |
+|---|---|
+| `s3-access-key-id` / `s3-secret-access-key` | Iceberg, Hive (S3 storage + 
Glue auth) |
+| `s3-region` | Iceberg, Hive (S3 storage + Glue region) |
+| `s3-role-arn` / `s3-external-id` | Iceberg, Hive (STS AssumeRole) |
+| `s3-endpoint` | Iceberg, Hive (custom S3 endpoint) |
+
+We **reuse `s3-region` as the AWS region** (Glue and S3 are always co-located) 
and **reuse `s3-access-key-id` / `s3-secret-access-key` for authentication**. 
Only two new Glue-specific properties:
+
+| New Property | Required | Description |
+|---|---|---|
+| `aws-glue-catalog-id` | No | Glue catalog ID (defaults to caller's AWS 
account). For cross-account access. |
+| `aws-glue-endpoint` | No | Custom Glue endpoint (for VPC endpoints or 
testing). |
+
+**Authentication priority**: Static credentials → STS AssumeRole 
(`s3-role-arn`) → Default credential chain (environment variables, instance 
profile).
+
+### 4.2 Iceberg Catalog + Glue Backend
+
+Add `GLUE` as a new `IcebergCatalogBackend` enum value. Use Iceberg's built-in 
`org.apache.iceberg.aws.glue.GlueCatalog`.
+
+#### Data Flow
+
+```
+User: catalog-backend=glue, warehouse=s3://..., s3-region=us-east-1
+  → IcebergCatalogOperations.initialize()
+    → IcebergCatalogUtil.loadCatalogBackend(GLUE, config)
+      → loadGlueCatalog(config)
+        → new GlueCatalog().initialize("glue", {
+            "warehouse": "s3://...",
+            "client.region": "us-east-1",
+            "glue.catalog-id": "..." })
+  → All existing IcebergCatalogOperations methods work unchanged
+```
+
+`GlueCatalog` is an official Iceberg implementation with full Schema CRUD + 
Table CRUD support — this is the lowest-risk part of the design.
+
+#### Engine Integration
+
+**Trino** — `IcebergCatalogPropertyConverter.java`: Add `case "glue":` → 
`iceberg.catalog.type=glue` + AWS region/catalog-id.
+
+**Spark** — No code change needed. The existing generic 
`all.put(ICEBERG_CATALOG_TYPE, catalogBackend)` already handles `"glue"`.
+
+### 4.3 Hive Catalog + Glue Backend
+
+Add `metastore-type=glue` property. Use AWS's 
`aws-glue-datacatalog-hive3-client` library which provides an 
`IMetaStoreClient` implementation backed by the Glue SDK.
+
+#### Data Flow
+
+```
+User: metastore-type=glue, s3-region=us-east-1
+  → HiveCatalogOperations.initialize()
+    → mergeProperties(conf) — maps Glue properties
+    → CachedClientPool(properties)
+      → HiveClientPool.newClient()
+        → HiveClientFactory.createHiveClient()      ← MODIFIED: skip hive2/3 
detection
+          → HiveClientClassLoader.createLoader(HIVE3, ...)  ← always Hive3 for 
Glue
+          → HiveClientImpl(HIVE3, properties)
+            → detects metastore.type=glue
+            → new GlueShim(properties)               ← NEW (replaces 
HiveShimV3)
+              → createMetaStoreClient()
+                → AWSGlueDataCatalogHiveClientFactory.create(hiveConf)
+                → returns AWSCatalogMetastoreClient (implements 
IMetaStoreClient)
+  → All existing HiveCatalogOperations methods work unchanged
+```
+
+#### GlueShim and Hive2/Hive3 Compatibility
+
+**Problem**: `HiveClientFactory.createHiveClient()` probes the remote HMS to 
detect Hive2 vs Hive3 (tries `getCatalogs()`, falls back on error). This 
detection is irrelevant for Glue — there is no remote HMS to probe.
+
+**Solution**: When `metastore.type=glue`, skip version detection and always 
use Hive3 classloader:
+
+```java
+// In HiveClientFactory.createHiveClient():
+public HiveClient createHiveClient() {
+  String metastoreType = properties.getProperty("metastore.type", "hive");
+  if ("glue".equalsIgnoreCase(metastoreType)) {
+    return createGlueClient();  // Always Hive3, no probe
+  }
+  // ... existing hive2/hive3 detection logic unchanged ...
+}
+
+private HiveClient createGlueClient() {
+  if (backendClassLoader == null) {
+    synchronized (classLoaderLock) {
+      if (backendClassLoader == null) {
+        backendClassLoader = HiveClientClassLoader.createLoader(
+            HIVE3, Thread.currentThread().getContextClassLoader());
+      }
+    }
+  }
+  return createHiveClientInternal(backendClassLoader);
+}
+```
+
+**Why Hive3 classloader?**
+
+1. **JAR loading path**: `HiveClientClassLoader.getJarDirectory()` maps 
`HIVE3` → `hive-metastore3-libs/`. The Glue client JAR is placed in this 
directory (see Section 4.4).
+2. **API compatibility**: AWS provides `aws-glue-datacatalog-hive2-client` and 
`aws-glue-datacatalog-hive3-client`. The `IMetaStoreClient` interfaces differ 
between versions (Hive3 adds catalog-aware methods). The JAR must match the 
Hive version in the same directory. We choose Hive3 as the actively maintained 
variant.
+
+**Future extension**: For Hive2 environments, add 
`aws-glue-datacatalog-hive2-client` to `hive-metastore2-libs` and select 
classloader by configuration.
+
+#### GlueShim Design
+
+`GlueShim` extends `HiveShimV2` and overrides only `createMetaStoreClient()`:
+
+| Shim | `createMetaStoreClient()` Implementation |
+|---|---|
+| `HiveShimV2` | `RetryingMetaStoreClient.getProxy(hiveConf)` → Thrift to HMS |
+| `HiveShimV3` | Same as V2 (V3 only adds catalog-aware method overrides) |
+| `GlueShim` | `AWSGlueDataCatalogHiveClientFactory.create(hiveConf)` → Glue 
SDK |
+
+All three return `IMetaStoreClient`. `HiveClientImpl` selects the shim:
+
+```java
+// In HiveClientImpl constructor:
+String metastoreType = properties.getProperty("metastore.type", "hive");
+if ("glue".equalsIgnoreCase(metastoreType)) {
+  shim = new GlueShim(properties);
+} else {
+  switch (hiveVersion) {
+    case HIVE2: shim = new HiveShimV2(properties); break;
+    case HIVE3: shim = new HiveShimV3(properties); break;
+  }
+}
+```
+
+All upstream code (`HiveClientPool`, `CachedClientPool`, 
`HiveCatalogOperations`) is unchanged — it programs against the `HiveClient` 
interface.
+
+#### IMetaStoreClient Relationship
+
+```
+org.apache.hadoop.hive.metastore.IMetaStoreClient    ← Hive standard interface
+    ├── HiveMetaStoreClient (Thrift impl, connects to HMS)
+    └── AWSCatalogMetastoreClient (Glue impl, via AWS Glue SDK)
+         └── Created by AWSGlueDataCatalogHiveClientFactory.create(hiveConf)
+```
+
+`AWSCatalogMetastoreClient` is a drop-in replacement for 
`HiveMetaStoreClient`. All upstream code is completely unaware of the 
difference.
+
+#### Engine Integration
+
+**Trino** — `HiveConnectorAdapter.java`:
+- When `metastore-type=glue`: set `hive.metastore=glue` + 
`hive.metastore.glue.region` + `hive.metastore.glue.catalogid`.
+- When `hive` (default): existing `hive.metastore.uri` path unchanged.

Review Comment:
   The Trino integration notes refer to `hive.metastore.uri`, but Trino’s Hive 
connector property (and this repo’s existing usage) is `hive.metastore.uris` 
(plural). Please correct the key here to avoid configuration copy/paste errors.
   ```suggestion
   - When `hive` (default): existing `hive.metastore.uris` path unchanged.
   ```



##########
design/aws-glue-catalog-connector.md:
##########
@@ -0,0 +1,471 @@
+# Design: AWS Glue Data Catalog Support for Apache Gravitino
+
+## 1. Problem Statement and Goals
+
+### 1.1 Problem
+
+**Gravitino currently cannot federate AWS Glue Data Catalog.** This is a 
significant gap because:
+
+1. **Large user base on AWS**: The majority of cloud-native data lakes run on 
AWS with Glue Data Catalog as the central metadata service (default for Athena, 
Redshift Spectrum, EMR, Lake Formation). These organizations cannot bring their 
Glue metadata into Gravitino's unified management layer.
+2. **No native integration path**: The only workaround is pointing Gravitino's 
Hive catalog at Glue's HMS-compatible Thrift endpoint (`metastore.uris = 
thrift://...`), which is undocumented, region-limited, and cannot leverage 
Glue-native features (catalog ID, cross-account access, VPC endpoints).
+3. **Competitive landscape**: Trino, Spark, and other engines all have 
first-class Glue support with dedicated configuration. Users expect the same 
from Gravitino.
+
+### 1.2 Goals
+
+After this feature is implemented:
+
+1. **Register AWS Glue Data Catalog in Gravitino**:
+   ```bash
+   # Hive-format tables
+   gcli catalog create --name hive_on_glue --provider hive \
+     --properties metastore-type=glue,s3-region=us-east-1
+
+   # Iceberg-format tables
+   gcli catalog create --name iceberg_on_glue --provider lakehouse-iceberg \
+     --properties 
catalog-backend=glue,warehouse=s3://bucket/iceberg,s3-region=us-east-1
+   ```
+
+2. **Standard Gravitino API works against Glue catalogs**:
+   ```bash
+   gcli schema list --catalog hive_on_glue
+   gcli table list --catalog hive_on_glue --schema my_database
+   gcli table details --catalog iceberg_on_glue --schema analytics --table 
events
+   ```
+
+3. **Trino and Spark connect transparently** — Trino uses 
`hive.metastore=glue` / `iceberg.catalog.type=glue`; Spark uses 
`AWSGlueDataCatalogHiveClientFactory` / `GlueCatalog`. Users query Glue tables 
through Gravitino without knowing the underlying mechanism.
+
+4. **AWS-native authentication** (reuses existing S3 properties): static 
credentials, STS AssumeRole, or default credential chain (environment 
variables, instance profile).
+
+## 2. Background
+
+### 2.1 AWS Glue Data Catalog
+
+AWS Glue Data Catalog is a managed metadata repository storing:
+- **Databases** — logical groupings, equivalent to Gravitino schemas.
+- **Tables** — metadata records containing column definitions, storage 
descriptors, partition keys, and user-defined parameters.
+
+Tables come in two formats:
+
+| Format | How Glue Stores It |
+|---|---|
+| **Hive** | Full metadata in `StorageDescriptor` (columns, SerDe, 
InputFormat, OutputFormat, location). The majority of tables in most Glue 
catalogs (legacy ETL, Athena CTAS, Redshift Spectrum). |
+| **Iceberg** | `Parameters["table_type"] = "ICEBERG"` and 
`Parameters["metadata_location"]` pointing to Iceberg metadata JSON on S3. 
`StorageDescriptor.Columns` is typically empty. Growing rapidly. |
+
+A complete Glue integration must handle both table formats.
+
+### 2.2 How Query Engines Use Glue
+
+Trino and Spark both have native Glue support — they call the AWS Glue SDK 
directly, not via HMS Thrift:
+
+| Engine | Hive Tables on Glue | Iceberg Tables on Glue |
+|---|---|---|
+| **Trino** | Hive connector with `hive.metastore=glue` | Iceberg connector 
with `iceberg.catalog.type=glue` |
+| **Spark** | Hive catalog with `AWSGlueDataCatalogHiveClientFactory` | 
Iceberg catalog with `catalog-impl=org.apache.iceberg.aws.glue.GlueCatalog` |
+
+Both engines use a **one-catalog-to-one-connector** model — a single catalog 
handles either Hive-format or Iceberg-format tables, not both. This is 
consistent with Gravitino's existing catalog model.
+
+### 2.3 Gravitino's Current Architecture
+
+Gravitino's catalog plugin system provides:
+- **Hive catalog** (`provider=hive`): Connects to HMS via Thrift. Client 
chain: `HiveCatalogOperations` → `CachedClientPool` → `HiveClientImpl` → 
`HiveShimV2/V3` → `IMetaStoreClient`.
+- **Iceberg catalog** (`provider=lakehouse-iceberg`): Supports pluggable 
backends (`catalog-backend=hive|jdbc|rest|memory|custom`). Each backend maps to 
a different Iceberg `Catalog` implementation.
+- **Trino/Spark connectors**: Property converters translate Gravitino catalog 
properties into engine-specific properties.
+
+## 3. Design Alternatives
+
+### Alternative A: New `catalog-glue` Module
+
+Create a standalone `catalogs/catalog-glue/` with its own 
`GlueCatalogOperations`, type converters, and entity classes. Directly call the 
AWS Glue SDK for both Hive and Iceberg tables.
+
+**Pros**: Full control over Glue-specific behavior. Single catalog for mixed 
table formats.
+**Cons**:
+- Duplicates logic already in Hive catalog (type conversion, partition 
handling, SerDe parsing) and Iceberg catalog (schema conversion, metadata 
loading).
+- Trino/Spark integration requires a "Composite Connector" that routes queries 
based on table type — a significant architectural change.
+- Larger implementation surface area and maintenance burden.
+
+### Alternative B: Glue as a Metastore Type (Chosen)
+
+Extend the existing Hive and Iceberg catalogs with Glue as a backend option.
+
+**Pros**:
+- Reuses all existing catalog logic, type conversion, property handling, and 
entity models.
+- Trino/Spark integration works almost for free — both engines already have 
native Glue support.
+- Much smaller change set (~15 files modified, 1 new file vs. ~15 new files).
+- Consistent with how Trino and Spark model Glue (as a metastore variant, not 
a separate catalog type).
+
+**Cons**:
+- Users must create two Gravitino catalogs to cover both Hive and Iceberg 
tables from the same Glue Data Catalog.
+- Cannot add Glue-only features (e.g., Glue crawlers) without extending the 
generic interfaces.
+
+**Decision**: Alternative B — the reuse benefits and Trino/Spark alignment 
outweigh the minor UX cost of two catalogs.
+
+## 4. Detailed Design
+
+### 4.1 Configuration Properties
+
+Gravitino already defines standardized AWS/S3 properties in 
`S3Properties.java`:
+
+| Existing Property | Used By |
+|---|---|
+| `s3-access-key-id` / `s3-secret-access-key` | Iceberg, Hive (S3 storage + 
Glue auth) |
+| `s3-region` | Iceberg, Hive (S3 storage + Glue region) |
+| `s3-role-arn` / `s3-external-id` | Iceberg, Hive (STS AssumeRole) |
+| `s3-endpoint` | Iceberg, Hive (custom S3 endpoint) |
+
+We **reuse `s3-region` as the AWS region** (Glue and S3 are always co-located) 
and **reuse `s3-access-key-id` / `s3-secret-access-key` for authentication**. 
Only two new Glue-specific properties:
+
+| New Property | Required | Description |
+|---|---|---|
+| `aws-glue-catalog-id` | No | Glue catalog ID (defaults to caller's AWS 
account). For cross-account access. |
+| `aws-glue-endpoint` | No | Custom Glue endpoint (for VPC endpoints or 
testing). |
+
+**Authentication priority**: Static credentials → STS AssumeRole 
(`s3-role-arn`) → Default credential chain (environment variables, instance 
profile).
+
+### 4.2 Iceberg Catalog + Glue Backend
+
+Add `GLUE` as a new `IcebergCatalogBackend` enum value. Use Iceberg's built-in 
`org.apache.iceberg.aws.glue.GlueCatalog`.
+
+#### Data Flow
+
+```
+User: catalog-backend=glue, warehouse=s3://..., s3-region=us-east-1
+  → IcebergCatalogOperations.initialize()
+    → IcebergCatalogUtil.loadCatalogBackend(GLUE, config)
+      → loadGlueCatalog(config)
+        → new GlueCatalog().initialize("glue", {
+            "warehouse": "s3://...",
+            "client.region": "us-east-1",
+            "glue.catalog-id": "..." })
+  → All existing IcebergCatalogOperations methods work unchanged
+```
+
+`GlueCatalog` is an official Iceberg implementation with full Schema CRUD + 
Table CRUD support — this is the lowest-risk part of the design.
+
+#### Engine Integration
+
+**Trino** — `IcebergCatalogPropertyConverter.java`: Add `case "glue":` → 
`iceberg.catalog.type=glue` + AWS region/catalog-id.
+
+**Spark** — No code change needed. The existing generic 
`all.put(ICEBERG_CATALOG_TYPE, catalogBackend)` already handles `"glue"`.
+
+### 4.3 Hive Catalog + Glue Backend
+
+Add `metastore-type=glue` property. Use AWS's 
`aws-glue-datacatalog-hive3-client` library which provides an 
`IMetaStoreClient` implementation backed by the Glue SDK.
+
+#### Data Flow
+
+```
+User: metastore-type=glue, s3-region=us-east-1
+  → HiveCatalogOperations.initialize()
+    → mergeProperties(conf) — maps Glue properties
+    → CachedClientPool(properties)
+      → HiveClientPool.newClient()
+        → HiveClientFactory.createHiveClient()      ← MODIFIED: skip hive2/3 
detection
+          → HiveClientClassLoader.createLoader(HIVE3, ...)  ← always Hive3 for 
Glue
+          → HiveClientImpl(HIVE3, properties)
+            → detects metastore.type=glue
+            → new GlueShim(properties)               ← NEW (replaces 
HiveShimV3)

Review Comment:
   The doc introduces the user-facing property as `metastore-type=glue`, but 
the data flow uses `metastore.type=glue` when selecting the shim. Unless the 
implementation will explicitly translate one into the other, this inconsistency 
is likely to cause misconfiguration. Please pick one canonical property key and 
use it consistently throughout the doc (examples + Java snippets), or document 
the exact mapping between the two keys.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [PR] Design: AWS Glue Data Catalog connector design document [gravitino]

Reply via email to