flyrain commented on code in PR #2753: URL: https://github.com/apache/polaris/pull/2753#discussion_r2411993308
########## site/content/blog/2025/10/02/puppygraph-polaris-integration.md: ########## @@ -0,0 +1,389 @@ +--- +# +# Licensed to the Apache Software Foundation (ASF) under one +# or more contributor license agreements. See the NOTICE file +# distributed with this work for additional information +# regarding copyright ownership. The ASF licenses this file +# to you under the Apache License, Version 2.0 (the +# "License"); you may not use this file except in compliance +# with the License. You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, +# software distributed under the License is distributed on an +# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +# KIND, either express or implied. See the License for the +# specific language governing permissions and limitations +# under the License. +# +title: "Integrating Apache Polaris with PuppyGraph for Real-time Graph Analysis" +date: 2025-10-02 +author: Danfeng Xu +--- + +Unified data governance has become a hot topic over the last few years. As AI and other data-hungry use cases infiltrate the market, the need for a comprehensive data catalog solution with governance in mind has become critical. [Apache Polaris](https://github.com/apache/polaris) has found its calling as an open-source solution, specifically built to handle data governed by [Apache Iceberg](https://iceberg.apache.org/), that is changing the way we manage and access data across various clouds, formats, and platforms. With a foundation rooted in Apache Iceberg, Apache Polaris ensures compatibility with various compute engines and data formats, making it an ideal choice for organizations focused on scalable, open data architectures. + +The beauty of such catalog technologies is their interoperability with other technologies that can leverage their data. [**PuppyGraph**](https://www.puppygraph.com/), the first graph compute engine to integrate with Apache Polaris natively, is part of this revolution of making data (and graph analytics) more accessible \- all without a separate specialized graph database. By working with the Apache Polaris team, PuppyGraph’s integration with the Apache Polaris is a significant leap forward in graph compute technology, offering a unique and powerful approach to exploring and analyzing data within Apache Polaris. + +As the first graph query engine to natively integrate with Apache Polaris, PuppyGraph offers a unique approach to querying the data within an Apache Polaris instance: **through graph**. Although SQL querying will remain a staple for many developers, graph queries offer organizations a way to explore their interconnected data in unique and new ways that SQL-based querying cannot handle efficiently. This blog will explore the power of pairing Apache Polaris with graph analytics capabilities using PuppyGraph’s zero-ETL graph query engine. Let’s start by looking a bit closer at the inner workings of the Apache Polaris. + +## What is Apache Polaris? + +Apache Polaris is an open-source, interoperable catalog for Apache Iceberg. It offers a centralized governance solution for data across various cloud platforms, formats, and compute engines. For users, it provides fine-grained access controls to secure data handling, simplifies data discovery, and fosters collaboration by managing structured and unstructured data, machine learning models, and files. + + + +A significant component of Apache Polaris is its commitment to open accessibility and regulatory compliance. By supporting major data protection and privacy frameworks like GDPR, CCPA, and HIPAA, Apache Polaris helps organizations meet critical regulatory standards. This focus on compliance and secure data governance reduces risk while fostering greater confidence in how data is stored, accessed, and analyzed. + +### **Key Features & Benefits** + +Apache Polaris offers several key features and benefits that users should know. Diving a bit deeper, based on the image above, here are some noteworthy benefits: + +#### Cross-Engine Read and Write + +Apache Polaris leverages Apache Iceberg's open-source REST protocol, enabling multiple engines to read and write data seamlessly. This interoperability extends to popular engines like PuppyGraph, [Apache Flink](https://flink.apache.org/), [Apache Spark](https://spark.apache.org/), [Trino](https://trino.io/), and many others, ensuring flexibility and choice for users. + + + +#### Centralized Security and Access + +With Apache Polaris, you can define principals/users and roles, and manage RBAC (Role-Based Access Controls) on Iceberg tables for these users or roles. This centralized security management approach streamlines access control and simplifies data governance. + +#### Run Anywhere, No Lock-In + +Apache Polaris offers deployment flexibility, allowing you to run it in your own infrastructure within a container (e.g., Docker, Kubernetes) or as a managed service on Snowflake. This adaptability ensures you can retain RBAC, namespaces, and table definitions even if you switch infrastructure, providing long-term flexibility and cost optimization. + +The Apache Polaris offers various ways to query, analyze, and integrate data, one of the most flexible and scalable options for organizations to store and govern data effectively. + +## Why Add Graph Capabilities to Apache Polaris? + +While SQL querying is a mainstay for most developers dealing with data and traditional SQL queries are highly effective for many data operations, they can fall short when working with highly interconnected data. Specific use cases lend themselves to graph querying, such as: + +* **Social Network Analysis:** Understanding relationships between people, groups, and organizations. +* **Fraud Detection:** Identifying patterns and anomalies in financial transactions or online activities. +* **Knowledge Graphs:** Representing and querying complex networks of interconnected concepts and entities. +* **Recommendation Engines:** Suggesting products, services, or content based on user preferences and relationships. +* **Network and IT Infrastructure Analysis:** Modeling and analyzing network topologies, dependencies, and performance. + +Enhancing Apache Polaris with a graph query engine introduces advanced graph analytics, making it easier and more intuitive to handle complex, relationship-based queries like the ones mentioned above. Here's why integrating graph capabilities benefits querying in Apache Polaris: + +* **Enhanced Data Relationships**: Graph queries are designed to uncover complex patterns within data, making them particularly useful for exploring multi-level relationships or hierarchies that can be cumbersome to analyze with SQL. +* **Performance**: When traversing extensive relationships, graph queries are often faster than SQL, especially for deep link analysis, as graph databases are optimized for this type of network traversal. +* **Flexibility**: Graph databases allow for a more intuitive approach to modeling interconnected data, avoiding the need for complex `JOIN` operations common in SQL queries. Nodes and edges in graph models naturally represent connections, simplifying queries for relationship-based data. +* **Advanced Analytics**: Graph platforms support advanced analytics, such as community detection, shortest path calculations, and centrality measures. Many of these algorithms are built into graph platforms, making them more accessible and efficient than implementing such analytics manually in SQL. + +Users gain deeper insights, faster query performance, and simpler ways to handle complex data structures by adding the capability to perform graph-based querying and analytics within Apache Polaris. When adding these capabilities, PuppyGraph’s zero-ETL graph query engine integrates seamlessly with Apache Polaris, making it easy and fast to unlock these advantages. Let’s look at how seamlessly the two platforms fit together architecturally. + +## Apache Polaris \+ PuppyGraph Architecture + +Traditionally, enabling graph querying and analytics on organizational data required replicating data into a separate graph database before running queries. This complex process involved multiple technologies, teams, and a significant timeline. Generally, the most cumbersome part of the equation was the struggling of ETL to get the data transformed into a graph-compatible format and actually loaded into the database. Because of this, implementing graph analytics on data stored in SQL-based systems has historically been challenging, so graph analysis was often viewed as a niche technology—valuable but costly to implement. + +PuppyGraph overcomes these limitations by offering a novel approach: adding graph capabilities without needing a dedicated graph database. Removing the graph database from the equation shortens the implementation timeline, reducing both time-to-market and overall costs. With PuppyGraph’s Zero-ETL graph query engine, users can connect directly to the data source, enabling graph queries directly on an Apache Polaris instance while maintaining fine-grained governance and lineage. + + + +This approach allows for performant graph querying, such as supporting 10-hop neighbor queries across half a billion edges in 2.26 seconds through scalable and performant zero-ETL. PuppyGraph achieves this by leveraging the column-based data file format coupled with massively parallel processing and vectorized evaluation technology built into the PuppyGraph engine. This distributed compute engine design ensures fast query execution even without efficient indexing and caching, delivering a performant and efficient graph querying and analytics experience without the hassles of the traditional graph infrastructure. + +To prove just how easy it is, let's look at how you can connect PuppyGraph to the data you have stored in Apache Polaris. + +## Connecting PuppyGraph to Apache Polaris + +Enabling graph capabilities on your underlying data is extremely simple with PuppyGraph. We like to summarize it into three steps: deploy, connect, and query. Many users can be up and running in a matter of minutes. We’ll walk through the steps below to show how easy it is. + +### Deploy Apache Polaris + +Check out the code from the Apache Polaris repository. +```shell +git clone https://github.com/apache/polaris.git +``` + +Build and run an Apache Polaris server. Note that JDK 21 is required to build and run the Apache Polaris. +```shell +cd polaris +./gradlew runApp +``` + +The Apache Polaris server will start. Please note the credentials for the Apache Polaris server's output. The credentials are required to connect to the Apache Polaris server later. The line contains the credentials will look like this: +```shell +realm: default-realm root principal credentials: f6973789e5270e5d:dce8e8e53d8f770eb9804f22de923645 Review Comment: Or we can just remove line 107 to 109, adding the right credential here, https://github.com/apache/polaris/pull/2753/files#r2411972405 would be good enough. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
