Re: [PR] Add puppygraph integration [polaris]

via GitHub Sat, 18 Oct 2025 10:52:39 -0700


flyrain commented on code in PR #2753:
URL: https://github.com/apache/polaris/pull/2753#discussion_r2411993308



##########
site/content/blog/2025/10/02/puppygraph-polaris-integration.md:
##########
@@ -0,0 +1,389 @@
+---
+#
+# Licensed to the Apache Software Foundation (ASF) under one
+# or more contributor license agreements.  See the NOTICE file
+# distributed with this work for additional information
+# regarding copyright ownership.  The ASF licenses this file
+# to you under the Apache License, Version 2.0 (the
+# "License"); you may not use this file except in compliance
+# with the License.  You may obtain a copy of the License at
+#
+#   http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing,
+# software distributed under the License is distributed on an
+# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+# KIND, either express or implied.  See the License for the
+# specific language governing permissions and limitations
+# under the License.
+#
+title: "Integrating Apache Polaris with PuppyGraph for Real-time Graph 
Analysis"
+date: 2025-10-02
+author: Danfeng Xu
+---
+
+Unified data governance has become a hot topic over the last few years. As AI 
and other data-hungry use cases infiltrate the market, the need for a 
comprehensive data catalog solution with governance in mind has become 
critical. [Apache Polaris](https://github.com/apache/polaris) has found its 
calling as an open-source solution, specifically built to handle data governed 
by [Apache Iceberg](https://iceberg.apache.org/), that is changing the way we 
manage and access data across various clouds, formats, and platforms. With a 
foundation rooted in Apache Iceberg, Apache Polaris ensures compatibility with 
various compute engines and data formats, making it an ideal choice for 
organizations focused on scalable, open data architectures. 
+
+The beauty of such catalog technologies is their interoperability with other 
technologies that can leverage their data. 
[**PuppyGraph**](https://www.puppygraph.com/), the first graph compute engine 
to integrate with Apache Polaris natively, is part of this revolution of making 
data (and graph analytics) more accessible \- all without a separate 
specialized graph database. By working with the Apache Polaris team, 
PuppyGraph’s integration with the Apache Polaris is a significant leap forward 
in graph compute technology, offering a unique and powerful approach to 
exploring and analyzing data within Apache Polaris.
+
+As the first graph query engine to natively integrate with Apache Polaris, 
PuppyGraph offers a unique approach to querying the data within an Apache 
Polaris instance: **through graph**. Although SQL querying will remain a staple 
for many developers, graph queries offer organizations a way to explore their 
interconnected data in unique and new ways that SQL-based querying cannot 
handle efficiently. This blog will explore the power of pairing Apache Polaris 
with graph analytics capabilities using PuppyGraph’s zero-ETL graph query 
engine. Let’s start by looking a bit closer at the inner workings of the Apache 
Polaris.
+
+## What is Apache Polaris?
+
+Apache Polaris is an open-source, interoperable catalog for Apache Iceberg. It 
offers a centralized governance solution for data across various cloud 
platforms, formats, and compute engines. For users, it provides fine-grained 
access controls to secure data handling, simplifies data discovery, and fosters 
collaboration by managing structured and unstructured data, machine learning 
models, and files. 
+
+![](/img/blog/2025/10/02/fig1-what-is-apache-polaris.png)
+
+A significant component of Apache Polaris is its commitment to open 
accessibility and regulatory compliance. By supporting major data protection 
and privacy frameworks like GDPR, CCPA, and HIPAA, Apache Polaris helps 
organizations meet critical regulatory standards. This focus on compliance and 
secure data governance reduces risk while fostering greater confidence in how 
data is stored, accessed, and analyzed.
+
+### **Key Features & Benefits**
+
+Apache Polaris offers several key features and benefits that users should 
know. Diving a bit deeper, based on the image above, here are some noteworthy 
benefits:
+
+#### Cross-Engine Read and Write
+
+Apache Polaris leverages Apache Iceberg's open-source REST protocol, enabling 
multiple engines to read and write data seamlessly. This interoperability 
extends to popular engines like PuppyGraph, [Apache 
Flink](https://flink.apache.org/), [Apache Spark](https://spark.apache.org/), 
[Trino](https://trino.io/), and many others, ensuring flexibility and choice 
for users. 
+
+![](/img/blog/2025/10/02/fig2-cross-engine-rw.png)
+
+#### Centralized Security and Access
+
+With Apache Polaris, you can define principals/users and roles, and manage 
RBAC (Role-Based Access Controls) on Iceberg tables for these users or roles. 
This centralized security management approach streamlines access control and 
simplifies data governance.  
+
+#### Run Anywhere, No Lock-In
+
+Apache Polaris offers deployment flexibility, allowing you to run it in your 
own infrastructure within a container (e.g., Docker, Kubernetes) or as a 
managed service on Snowflake. This adaptability ensures you can retain RBAC, 
namespaces, and table definitions even if you switch infrastructure, providing 
long-term flexibility and cost optimization.
+
+The Apache Polaris offers various ways to query, analyze, and integrate data, 
one of the most flexible and scalable options for organizations to store and 
govern data effectively.
+
+## Why Add Graph Capabilities to Apache Polaris?
+
+While SQL querying is a mainstay for most developers dealing with data and 
traditional SQL queries are highly effective for many data operations, they can 
fall short when working with highly interconnected data. Specific use cases 
lend themselves to graph querying, such as:
+
+* **Social Network Analysis:** Understanding relationships between people, 
groups, and organizations.  
+* **Fraud Detection:** Identifying patterns and anomalies in financial 
transactions or online activities.  
+* **Knowledge Graphs:** Representing and querying complex networks of 
interconnected concepts and entities.  
+* **Recommendation Engines:** Suggesting products, services, or content based 
on user preferences and relationships.  
+* **Network and IT Infrastructure Analysis:** Modeling and analyzing network 
topologies, dependencies, and performance. 
+
+Enhancing Apache Polaris with a graph query engine introduces advanced graph 
analytics, making it easier and more intuitive to handle complex, 
relationship-based queries like the ones mentioned above. Here's why 
integrating graph capabilities benefits querying in Apache Polaris:
+
+* **Enhanced Data Relationships**: Graph queries are designed to uncover 
complex patterns within data, making them particularly useful for exploring 
multi-level relationships or hierarchies that can be cumbersome to analyze with 
SQL.  
+* **Performance**: When traversing extensive relationships, graph queries are 
often faster than SQL, especially for deep link analysis, as graph databases 
are optimized for this type of network traversal.  
+* **Flexibility**: Graph databases allow for a more intuitive approach to 
modeling interconnected data, avoiding the need for complex `JOIN` operations 
common in SQL queries. Nodes and edges in graph models naturally represent 
connections, simplifying queries for relationship-based data.  
+* **Advanced Analytics**: Graph platforms support advanced analytics, such as 
community detection, shortest path calculations, and centrality measures. Many 
of these algorithms are built into graph platforms, making them more accessible 
and efficient than implementing such analytics manually in SQL.
+
+Users gain deeper insights, faster query performance, and simpler ways to 
handle complex data structures by adding the capability to perform graph-based 
querying and analytics within Apache Polaris. When adding these capabilities, 
PuppyGraph’s zero-ETL graph query engine integrates seamlessly with Apache 
Polaris, making it easy and fast to unlock these advantages. Let’s look at how 
seamlessly the two platforms fit together architecturally.
+
+## Apache Polaris \+ PuppyGraph Architecture
+
+Traditionally, enabling graph querying and analytics on organizational data 
required replicating data into a separate graph database before running 
queries. This complex process involved multiple technologies, teams, and a 
significant timeline. Generally, the most cumbersome part of the equation was 
the struggling of ETL to get the data transformed into a graph-compatible 
format and actually loaded into the database. Because of this, implementing 
graph analytics on data stored in SQL-based systems has historically been 
challenging, so graph analysis was often viewed as a niche technology—valuable 
but costly to implement.
+
+PuppyGraph overcomes these limitations by offering a novel approach: adding 
graph capabilities without needing a dedicated graph database. Removing the 
graph database from the equation shortens the implementation timeline, reducing 
both time-to-market and overall costs. With PuppyGraph’s Zero-ETL graph query 
engine, users can connect directly to the data source, enabling graph queries 
directly on an Apache Polaris instance while maintaining fine-grained 
governance and lineage.
+
+![](/img/blog/2025/10/02/fig3-apache-polaris-puppygraph-architecture.png)
+
+This approach allows for performant graph querying, such as supporting 10-hop 
neighbor queries across half a billion edges in 2.26 seconds through scalable 
and performant zero-ETL. PuppyGraph achieves this by leveraging the 
column-based data file format coupled with massively parallel processing and 
vectorized evaluation technology built into the PuppyGraph engine. This 
distributed compute engine design ensures fast query execution even without 
efficient indexing and caching, delivering a performant and efficient graph 
querying and analytics experience without the hassles of the traditional graph 
infrastructure.
+
+To prove just how easy it is, let's look at how you can connect PuppyGraph to 
the data you have stored in Apache Polaris.
+
+## Connecting PuppyGraph to Apache Polaris 
+
+Enabling graph capabilities on your underlying data is extremely simple with 
PuppyGraph. We like to summarize it into three steps: deploy, connect, and 
query. Many users can be up and running in a matter of minutes. We’ll walk 
through the steps below to show how easy it is.
+
+### Deploy Apache Polaris
+
+Check out the code from the Apache Polaris repository.  
+```shell
+git clone https://github.com/apache/polaris.git
+```
+
+Build and run an Apache Polaris server. Note that JDK 21 is required to build 
and run the Apache Polaris.
+```shell   
+cd polaris  
+./gradlew runApp
+```
+
+The Apache Polaris server will start. Please note the credentials for the 
Apache Polaris server's output. The credentials are required to connect to the 
Apache Polaris server later. The line contains the credentials will look like 
this: 
+```shell 
+realm: default-realm root principal credentials: 
f6973789e5270e5d:dce8e8e53d8f770eb9804f22de923645

Review Comment:
   Or we can just remove line 107 to 109, adding the right credential here, 
https://github.com/apache/polaris/pull/2753/files#r2411972405 would be good 
enough.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [PR] Add puppygraph integration [polaris]

Reply via email to