This is an automated email from the ASF dual-hosted git repository.

vinish pushed a commit to branch 590-RFC-CatalogSync
in repository https://gitbox.apache.org/repos/asf/incubator-xtable.git

commit b693c4342e4b7c31f19a59ba841daf65d70a7497
Author: Vinish Reddy <vinishreddygunne...@gmail.com>
AuthorDate: Wed Dec 18 16:48:42 2024 -0800

    [590] Add RFC for XCatalogSync - Synchronize tables across catalogs
---
 assets/images/catalog_sync_flow.jpg | Bin 0 -> 130344 bytes
 rfc/rfc-1/rfc-1.md                  | 142 ++++++++++++++++++++++++++++++++++++
 2 files changed, 142 insertions(+)

diff --git a/assets/images/catalog_sync_flow.jpg 
b/assets/images/catalog_sync_flow.jpg
new file mode 100644
index 00000000..237dc909
Binary files /dev/null and b/assets/images/catalog_sync_flow.jpg differ
diff --git a/rfc/rfc-1/rfc-1.md b/rfc/rfc-1/rfc-1.md
new file mode 100644
index 00000000..1f6a8310
--- /dev/null
+++ b/rfc/rfc-1/rfc-1.md
@@ -0,0 +1,142 @@
+<!--
+  Licensed to the Apache Software Foundation (ASF) under one or more
+  contributor license agreements.  See the NOTICE file distributed with
+  this work for additional information regarding copyright ownership.
+  The ASF licenses this file to You under the Apache License, Version 2.0
+  (the "License"); you may not use this file except in compliance with
+  the License.  You may obtain a copy of the License at
+
+       http://www.apache.org/licenses/LICENSE-2.0
+
+  Unless required by applicable law or agreed to in writing, software
+  distributed under the License is distributed on an "AS IS" BASIS,
+  WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+  See the License for the specific language governing permissions and
+  limitations under the License.
+-->
+# RFC-[1]: XCatalogSync - Synchronize tables across catalogs
+
+## Proposers
+
+- @vinishjail97
+
+## Approvers
+
+- Anyone from XTable community can approve/add feedback.
+
+## Status
+
+GH Feature Request: https://github.com/apache/incubator-xtable/issues/590
+
+> Please keep the status updated in `rfc/README.md`.
+
+## Abstract
+
+Users of Apache XTable (Incubating) today can translate metadata across table 
formats (iceberg, hudi, and delta) and use the tables in different platforms 
depending on their choice. 
+Today there's still some friction involved in terms of usability because users 
need to explicitly [register](https://xtable.apache.org/docs/catalogs-index) 
the tables in the catalog of their choice (glue, HMS, unity, bigLake etc.) 
+and then use the catalog in the platform of their choice to do DDL, DML 
queries.
+
+## Background
+XTable is built on the principle of omnidirectional interoperability, and I'm 
proposing an interface which allows syncing metadata of table formats to 
multiple catalogs in a continuous and incremental manner. With this new 
functionality we will be able to      
+1. Reduce friction for XTable users - XTable sync will register the tables in 
the catalogs of their choice after metadata generation. If users are using a 
single format, they can still use XTable to sync the metadata across multiple 
catalogs.
+2. Avoid catalog lock-in - There's no reason why data/metadata in storage 
should be registered in a single catalog, users can register the table across 
multiple catalogs depending on the use-case, ecosystem and features provided by 
the catalog.
+
+## Implementation
+
+Introducing two new interfaces `CatalogSyncClient` and `CatalogSync`. [[PR]]( 
https://github.com/apache/incubator-xtable/pull/603)
+1. `CatalogSyncClient` This interface contains methods that are responsible 
for creating table, refreshing table metadata, dropping table etc. in target 
catalog. Consider this interface as a translation layer between InternalTable 
and the catalog's table object. 
+2. `CatalogSync` synchronizes the internal XTable object (InternalTable) to 
multiple target catalogs using the methods available in `CatalogSyncClient` 
interface.
+
+For XTable users to define their source/target catalog configurations and 
synchronize tables will be done through the `RunCatalogSync` class. 
+This will be utility class that parses the user's YAML configuration, 
synchronizes table format metadata if there's a need for it and then use the 
interfaces defined above for synchronizing the table in the catalog.
+[[PR]]( https://github.com/apache/incubator-xtable/pull/591)
+
+User's YAML configuration.
+1. `sourceCatalog`: Configuration of the source catalog from which XTable will 
read. It must contain all the necessary connection and access details for 
describing and listing tables.
+    1. `catalogName`: A unique name for the source catalog (e.g., "source-1").
+    2. `catalogType`: The type of the source catalog. This might be a specific 
type understood by XTable, such as Hive, Glue etc.
+    3. `catalogImpl`(optional): A fully qualified class name that implements 
the interfaces for `CatalogSyncClient`, it can be used if the implementation 
for catalogType doesn't exist in XTable.
+    4. `catalogProperties`: A collection of configs used to configure access 
or connection properties for the catalog 
+2. `targetCatalogs`: Defines configuration one or more target catalogs, to 
which XTable will write or update tables. Unlike the source, these catalogs 
must be writable.
+3. `datasets`: A list of datasets that specify how a source table maps to one 
or more target tables.
+   1. `sourceCatalogTableIdentifier`: Identifies the source table in 
sourceCatalog. This can be done in two ways:
+      1. `catalogTableIdentifier`: Specifies a source table by its database 
and table name. 
+      2. `storageIdentifier`(optional): Provides direct storage details such 
as a table’s base path (like an S3 location) and the partition specification. 
This allows reading from a source even if it is not strictly registered in a 
catalog, as long as the format and location are known
+   2. `targetCatalogTableIdentifiers`: A list of one or more targets that this 
source table should be written to.
+      1. `catalogName`: The name of the target catalog where the table will be 
created or updated.
+      2. `tableFormat`: The target table format (e.g., DELTA, HUDI, ICEBERG), 
specifying how the data will be stored at the target.
+      3. `catalogTableIdentifier`: Specifies the database and table name in 
the target catalog.
+```
+sourceCatalog:
+  catalogName: "source-1"
+  catalogType: "catalog-type-1"
+  catalogProperties:
+    key01: "value01"
+    key02: "value02"
+    key03: "value03"
+targetCatalogs:
+  - catalogName: "target-1"
+    catalogType: "catalog-type-2"
+    catalogProperties:
+      key11: "value11"
+      key12: "value22"
+      key13: "value33"
+  - catalogName: "target-2"
+    catalogImpl: "org.apache.xtable.utilities.CustomCatalogImpl"
+    catalogProperties:
+      key21: "value21"
+      key22: "value22"
+      key23: "value23"
+datasets:
+  - sourceCatalogTableIdentifier:
+      catalogTableIdentifier:
+        databaseName: "source-database-1"
+        tableName: "source-1"
+    targetCatalogTableIdentifiers:
+      - catalogName: "target-1"
+        tableFormat: "DELTA"
+        catalogTableIdentifier:
+          databaseName: "target-database-1"
+          tableName: "target-tableName-1"
+      - catalogName: "target-1"
+        tableFormat: "ICEBERG"
+        catalogTableIdentifier:
+          databaseName: "target-database-2"
+          tableName: "target-tableName-2-iceberg"
+      - catalogName: "target-2"
+        tableFormat: "HUDI"
+        catalogTableIdentifier:
+          databaseName: "target-database-2"
+          tableName: "target-tableName-2-delta"
+  - sourceCatalogTableIdentifier:
+      storageIdentifier:
+        tableBasePath: s3://tpc-ds-datasets/1GB/hudi/catalog_sales
+        tableName: catalog_sales
+        partitionSpec: cs_sold_date_sk:VALUE
+        tableFormat: "HUDI"
+    targetCatalogTableIdentifiers:
+      - catalogName: "target-2"
+        tableFormat: "ICEBERG"
+        catalogTableIdentifier:
+          databaseName: "target-database-2"
+          tableName: "target-tableName-2"
+```
+
+## Overview of the CatalogSync process
+![img.jpg](/Users/vinishreddy/OpenSource/incubator-xtable/assets/images/catalog_sync_flow.jpg)
+
+
+## Rollout/Adoption Plan
+
+- What impact (if any) will there be on existing users? 
+  - Nope, this is a new functionality being added to synchronize tables across 
catalogs. Existing XTable users can still use the table format sync using 
RunSync without any problems.
+- If we are changing behavior how will we phase out the older behavior? 
+  - N/A
+- If we need special migration tools, describe them here.
+  - N/A
+- When will we remove the existing behavior
+  - N/A
+
+## Test Plan
+
+We plan to add the HMS and Glue implementations for `CatalogSyncClient` 
interface, conversion in both ways across all table formats will be tested.
\ No newline at end of file

Reply via email to