Re: [PR] [#9169] improvement(docs): Add docs about Lance REST service and Generic Lakehouse catalog [gravitino]

via GitHub Sun, 07 Dec 2025 20:10:09 -0800


danhuawang commented on code in PR #9173:
URL: https://github.com/apache/gravitino/pull/9173#discussion_r2596972590



##########
docs/lakehouse-generic-catalog.md:
##########
@@ -0,0 +1,588 @@
+---
+title: "Generic Lakehouse Catalog"
+slug: /lakehouse-generic-catalog
+keywords:
+  - lakehouse
+  - lance
+  - metadata
+  - generic catalog
+  - file system
+license: "This software is licensed under the Apache License version 2."
+---
+
+import Tabs from '@theme/Tabs';
+import TabItem from '@theme/TabItem';
+
+## Overview
+
+The Generic Lakehouse Catalog is a Gravitino catalog implementation designed 
to seamlessly integrate with lakehouse storage systems built on file 
system-based architectures. This catalog enables unified metadata management 
for lakehouse tables stored on various storage backends, providing a consistent 
interface for data discovery, governance, and access control.
+
+### What is a Lakehouse?
+
+A lakehouse combines the best features of data lakes and data warehouses:
+
+- **Data Lake Benefits**: 
+  - Low-cost storage for massive volumes of raw data
+  - Support for diverse data formats (structured, semi-structured, 
unstructured)
+  - Decoupled storage and compute for flexible scaling
+
+- **Data Warehouse Benefits**:
+  - ACID transactions for data consistency
+  - Schema enforcement and evolution
+  - High-performance analytical queries
+  - Time travel and versioning
+
+### Supported Storage Systems
+
+The catalog works with lakehouse systems built on top of:
+
+**Storage Backends:**
+- **Object Stores:** Amazon S3, Azure Blob Storage, Google Cloud Storage, MinIO
+- **Distributed File Systems:** HDFS, Apache Ozone
+- **Local File Systems:** For development and testing
+
+**Lakehouse Formats:**
+- **Lance** ✅ (We only support Lance format fully at present)
+
+:::info Current Support Status
+While the architecture is designed to support various lakehouse formats, 
Gravitino currently provides **native production support only for Lance-based 
lakehouse systems** with comprehensive testing and optimization.
+:::
+
+### Why Use Generic Lakehouse Catalog?
+
+1. **Unified Metadata Management**: Single source of truth for table metadata 
across multiple storage backends
+2. **Multi-Format Support**: Extensible architecture to support various 
lakehouse table formats
+3. **Storage Flexibility**: Work with any file system - local, or cloud object 
stores
+4. **Gravitino Integration**: Leverage Gravitino's access control, lineage 
tracking, and data discovery
+5. **Easy Migration**: Register existing lakehouse tables without data movement
+
+### System Requirements
+
+**Storage Requirements:**
+- Lakehouse storage system must support standard file system operations:
+  - Directory listing and navigation
+  - File reading and writing with atomic operations
+  - File deletion and renaming
+  - Path-based access control (optional but recommended)
+
+**Gravitino Requirements:**
+- Gravitino server version 1.1.0 or later
+- Configured metalake for catalog creation
+- Appropriate permissions for catalog management
+
+**Network Requirements:**
+- Network connectivity between Gravitino server and storage backend
+- For cloud storage: Internet access and valid credentials
+- For HDFS: Proper Hadoop configuration and network access
+
+## Catalog Management
+
+### Capabilities
+
+The Generic Lakehouse Catalog provides comprehensive relational metadata 
management capabilities equivalent to standard relational catalogs:
+
+**Supported Operations:**
+- ✅ Create, read, update, and delete catalogs
+- ✅ List all catalogs in a metalake
+- ✅ Manage catalog properties and metadata
+- ✅ Set and modify catalog locations
+- ✅ Configure storage backend credentials
+
+For detailed information on available operations, see [Manage Relational 
Metadata Using Gravitino](./manage-relational-metadata-using-gravitino.md).
+
+### Properties
+
+#### Required Properties
+
+| Property   | Description                                          | Example  
                        | Required |
+|------------|------------------------------------------------------|----------------------------------|----------|
+| `provider` | Catalog provider type                                | 
`lakehouse-generic`              | Yes      |
+| `location` | Root storage path for all schemas and tables         | 
`hdfs://namenode:9000/lakehouse` | False    |
+
+#### Key Property: `location`
+
+The `location` property specifies the root directory for the lakehouse storage 
system. All schemas and tables are stored under this location unless explicitly 
overridden at the schema or table level.
+
+**Location Resolution Hierarchy:**
+1. Table-level `location` (highest priority)
+2. Schema-level `location`
+3. Catalog-level `location` (fallback)
+
+**Example Location Hierarchy:**
+```
+Catalog location: hdfs://namenode:9000/lakehouse
+└── Schema: sales (hdfs://namenode:9000/lakehouse/sales)
+    ├── Table: orders (hdfs://namenode:9000/lakehouse/sales/orders)
+    └── Table: customers (custom: s3://analytics-bucket/customers)
+```
+
+### Creating a Catalog
+
+Use `provider: "lakehouse-generic"` when creating a generic lakehouse catalog.
+
+<Tabs groupId='language' queryString>
+<TabItem value="shell" label="Shell">
+
+```shell
+curl -X POST -H "Accept: application/vnd.gravitino.v1+json" \
+  -H "Content-Type: application/json" -d '{
+  "name": "generic_lakehouse_catalog",
+  "type": "RELATIONAL",
+  "comment": "Generic lakehouse catalog for Lance datasets",
+  "provider": "lakehouse-generic",
+  "properties": {
+    "location": "hdfs://localhost:9000/user/lakehouse"
+  }
+}' http://localhost:8090/api/metalakes/metalake/catalogs
+```
+
+</TabItem>
+<TabItem value="java" label="Java">
+
+```java
+GravitinoClient gravitinoClient = GravitinoClient
+    .builder("http://127.0.0.1:8090";)
+    .withMetalake("metalake")
+    .build();
+
+Map<String, String> catalogProperties = ImmutableMap.<String, String>builder()
+    .put("location", "hdfs://localhost:9000/user/lakehouse")
+    .build();
+
+Catalog catalog = gravitinoClient.createCatalog(
+    "generic_lakehouse_catalog",
+    Type.RELATIONAL,
+    "lakehouse-generic",
+    "Generic lakehouse catalog for Lance datasets",
+    catalogProperties
+);
+```
+
+</TabItem>
+</Tabs>
+
+
+## Schema Management
+
+### Capabilities
+
+Schema operations follow the same patterns as relational catalogs:
+
+**Supported Operations:**
+- ✅ Create schemas with custom properties
+- ✅ List all schemas in a catalog
+- ✅ Load schema metadata and properties
+- ✅ Update schema properties
+- ✅ Delete schemas
+- ✅ Check schema existence
+
+See [Schema 
Operations](./manage-relational-metadata-using-gravitino.md#schema-operations) 
for detailed documentation.
+
+### Properties
+
+Schemas inherit catalog properties and can override specific settings:
+
+| Property   | Description                                          | 
Inherited from Catalog | Required |
+|------------|------------------------------------------------------|------------------------|----------|
+| `location` | Custom storage path for schema tables                | Yes      
              | No       |
+
+#### Location Inheritance
+
+When a schema doesn't specify a `location`, it inherits from the catalog:
+
+**Without Schema Location:**
+```
+Catalog: hdfs://namenode:9000/lakehouse
+Schema: sales
+→ Schema location: hdfs://namenode:9000/lakehouse/sales
+→ Table location:  hdfs://namenode:9000/lakehouse/sales/orders
+```
+
+**With Schema Location:**
+```
+Catalog: hdfs://namenode:9000/lakehouse
+Schema: sales (location: s3://sales-data/prod)
+→ Schema location: s3://sales-data/prod
+→ Table location:  s3://sales-data/prod/orders
+```

Review Comment:
   The location type of catalog is `hdfs`, but in schema and table, the type is 
`s3`. Can we align the location in the examples so that user can easily follow 
the example command to experience the new function.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [PR] [#9169] improvement(docs): Add docs about Lance REST service and Generic Lakehouse catalog [gravitino]

Reply via email to