manisin commented on code in PR #3417: URL: https://github.com/apache/polaris/pull/3417#discussion_r2692993343
########## site/content/blog/2026/01/12/external-catalog-legacy-datalakes.md: ########## @@ -0,0 +1,79 @@ +--- +# +# Licensed to the Apache Software Foundation (ASF) under one +# or more contributor license agreements. See the NOTICE file +# distributed with this work for additional information +# regarding copyright ownership. The ASF licenses this file +# to you under the Apache License, Version 2.0 (the +# "License"); you may not use this file except in compliance +# with the License. You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, +# software distributed under the License is distributed on an +# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +# KIND, either express or implied. See the License for the +# specific language governing permissions and limitations +# under the License. +# +title: "Mapping Legacy and Heterogeneous Datalakes in Apache Polaris" +date: 2026-01-12 +author: Maninderjit Parmar +--- +## Introduction + +The data lake community and major engines such as Spark, Snowflake, and Trino are standardizing on the Iceberg REST catalog protocol for data discovery and access. While this shift provides the foundation for a modern lakehouse with transactions and centralized governance, many organizations still maintain significant volumes of data managed by legacy Hadoop/Hive catalogs or heterogeneous data lakes, resulting in siloed data environments with specialized query stacks. + +For these organizations, migrating a legacy data lake to a modern REST-based catalog is costly, risky and complex due to security and compliance requirements as well as potential operational downtime. This post explores how to use Apache Polaris's read-only External Catalog feature to create a zero-copy metadata bridge across these datalakes. This capability allows organizations to project legacy and heterogeneous data lakes into the Iceberg REST ecosystem for access by any Iceberg-compatible engine without moving or copying the underlying data. + +## Challenges of Existing Data Lake Systems + +Integrating existing data lakes into a modern ecosystem involves several engineering hurdles. These challenges primarily stem from format heterogeneity, security/compliance requirements, and the operational complexity of keeping multiple systems in sync. + +### Fragmented Formats and the Discovery Problem + +A primary hurdle in this environment is that Iceberg is often not the native format for the entire pipeline. For instance, streaming platforms like Confluent or WarpStream use tools such as TableFlow to materialize Kafka topics as Iceberg tables, creating a continuous stream of new metadata and Parquet files. Similarly, teams often use utilities like Apache XTable or Delta Lake UniForm to generate Iceberg-compatible metadata for their existing Delta or Hudi tables without duplicating the underlying data. + +This creates a discovery gap where the whole data asset is not discoverable by a single Iceberg catalog. Even after the metadata is generated, it must be registered with a central catalog to be useful for query engines. Relying on a traditional, pull-based catalog to monitor these diverse sources is operationally fragile. It often requires the catalog to maintain complex, long-lived credentials for every storage bucket and source system to perform crawling or polling operations. + +### The Security and Reliability Burden + +Traditional "pull" models also face security challenges. Granting a central catalog the permissions required to reach into multiple sub-engine/catalogs or across cloud provider accounts increases the risk of credential sprawl. Furthermore, in distributed systems where network issues are common, ensuring that a catalog correctly discovers every update without missing a commit is technically difficult. + +## The Solution: A Push-Based, Stateless External Catalog + +To address these challenges, Apache Polaris introduces the concept of an External Catalog. The External Catalog could be configured as a read-only "static facade" or as a read/write "passthrough facade" for federation to remote catalog. For the problem of mapping a legacy datalake, we use the External Catalog as the "static facade" and focus on it throughout the remaining blog. The architecture for it is designed around four core principles that simplify the integration of legacy data: + + + +- **Push-Based Architecture**: Instead of the catalog polling for changes, external sync agents push metadata updates to Polaris via the [Notification API](https://github.com/apache/polaris/blob/main/spec/polaris-catalog-apis/notifications-api.yaml). This makes the catalog a passive recipient of state. + +- **Stateless Credential Management**: By acting as a "static facade," Polaris does not need to store long-lived credentials for the source catalogs or databases. It only requires a definition of the allowed storage boundaries to vend short-lived, read-only scoped credentials to query engines. + +- **Unidirectional Trust**: The security model relies on the producer environment pushing to Polaris. Polaris never initiates outbound connections to the source data lake, maintaining a strict security perimeter. + +- **Idempotency and Weak Delivery Guarantees**: The [Notification API](https://github.com/apache/polaris/blob/main/spec/polaris-catalog-apis/notifications-api.yaml) is designed to tolerate at-least-once delivery. By using monotonically increasing timestamps within the notification payload, Polaris can safely reject older or duplicate updates, ensuring the catalog always reflects the most recent valid state of the data lake. + +## Best Practices for Implementing a Sync Agent + +In the push-based model of Apache Polaris, the catalog is passive. The responsibility for maintaining data freshness rests with the sync agent, the producer-side process that monitors your data lake and notifies Polaris of changes. To build a resilient and scalable sync agent, engineers should follow these three core principles. + +### Monotonically Increasing Timestamps + +The Polaris [Notification API](https://github.com/apache/polaris/blob/main/spec/polaris-catalog-apis/notifications-api.yaml) uses a timestamp-based ordering system to handle concurrent or out-of-order updates. When the sync agent sends a notification, it must include a timestamp that represents the "logical time" of the change. If a sync agent sends a notification with a timestamp that is older than one Polaris has already processed for that table, Polaris will return a `409 Conflict` error. This mechanism ensures that a stale update from a delayed network retry cannot overwrite the current state of your catalog. Review Comment: Ack, I will follow up in a separate PR -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
