nssalian commented on code in PR #3645:
URL: https://github.com/apache/polaris/pull/3645#discussion_r2756865725


##########
site/content/blog/2026/02/04/floe-polaris-integration.md:
##########
@@ -0,0 +1,231 @@
+---
+#
+# Licensed to the Apache Software Foundation (ASF) under one
+# or more contributor license agreements.  See the NOTICE file
+# distributed with this work for additional information
+# regarding copyright ownership.  The ASF licenses this file
+# to you under the Apache License, Version 2.0 (the
+# "License"); you may not use this file except in compliance
+# with the License.  You may obtain a copy of the License at
+#
+#   http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing,
+# software distributed under the License is distributed on an
+# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+# KIND, either express or implied.  See the License for the
+# specific language governing permissions and limitations
+# under the License.
+#
+title: "Floe and Apache Polaris: Policy-Driven Table Maintenance for Apache 
Iceberg"
+date: 2026-02-04
+author: Neelesh Salian
+---
+
+## Introduction
+
+Iceberg tables accumulate technical debt over time. Small files multiply as 
streaming jobs append data in micro-batches. Delete files pile up from CDC 
workloads. Snapshots grow unbounded, bloating metadata. Without regular 
maintenance, query performance degrades, storage costs rise, and planning times 
stretch from milliseconds to seconds.
+
+Apache Polaris provides a vendor-neutral Iceberg catalog with governance and 
access control, but it does not execute maintenance operations. The catalog 
manages metadata and enforces permissions. Compaction, snapshot expiration, 
orphan cleanup, and manifest optimization remain the user's responsibility.
+
+[Floe](https://github.com/nssalian/floe) fills that gap. It connects to 
Polaris, discovers tables, evaluates their health, and orchestrates maintenance 
through policy-driven automation. Instead of writing custom scripts or manually 
running Spark jobs, you define policies that specify what maintenance to 
perform, which tables to target, and under what conditions to trigger 
execution. Floe handles the rest: scheduling, execution via Spark or Trino, and 
tracking outcomes.
+
+## Architecture
+
+Polaris remains the source of truth for metadata and access control. Floe 
reads the catalog, evaluates policies, triggers maintenance on your chosen 
engine, and records outcomes.
+
+![Polaris + Floe 
Architecture](/img/blog/2026/02/04/high_level_architecture.png)
+
+### Data Flow
+
+1. **Policy discovery**: Floe loads enabled policies and matches them to 
tables.
+2. **Health assessment**: Floe evaluates table health based on scan mode and 
thresholds.
+3. **Planning & gating**: The planner selects operations; trigger conditions 
decide if they run.
+4. **Execution**: The orchestrator dispatches operations to Spark or Trino.
+5. **Persistence**: Results and health history are stored for tracking and 
recommendations.
+
+## Quick Start
+
+```bash
+make example-polaris
+```
+
+This starts Polaris, MinIO, and Floe with a `demo` catalog, creates sample 
Iceberg tables, and configures demo policies.
+
+* Floe UI: http://localhost:9091/ui
+* Floe API: http://localhost:9091/api/
+
+For Trino instead of Spark, run `make clean` first, then `make 
example-polaris-trino`.
+
+## Configuration
+
+```bash
+FLOE_CATALOG_TYPE=POLARIS
+FLOE_CATALOG_NAME=demo
+FLOE_CATALOG_POLARIS_URI=http://polaris:8181/api/catalog
+FLOE_CATALOG_POLARIS_CLIENT_ID=root
+FLOE_CATALOG_POLARIS_CLIENT_SECRET=secret
+FLOE_CATALOG_WAREHOUSE=demo
+```
+
+Note: For Polaris, `FLOE_CATALOG_WAREHOUSE` is the catalog name, not an S3 
path.

Review Comment:
   Good question. `FLOE_CATALOG_NAME` is Floe's internal identifier for the 
catalog connection, while `FLOE_CATALOG_WAREHOUSE` is passed to the Iceberg 
REST client as the `warehouse` parameter.



##########
site/content/blog/2026/02/04/floe-polaris-integration.md:
##########
@@ -0,0 +1,231 @@
+---
+#
+# Licensed to the Apache Software Foundation (ASF) under one
+# or more contributor license agreements.  See the NOTICE file
+# distributed with this work for additional information
+# regarding copyright ownership.  The ASF licenses this file
+# to you under the Apache License, Version 2.0 (the
+# "License"); you may not use this file except in compliance
+# with the License.  You may obtain a copy of the License at
+#
+#   http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing,
+# software distributed under the License is distributed on an
+# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+# KIND, either express or implied.  See the License for the
+# specific language governing permissions and limitations
+# under the License.
+#
+title: "Floe and Apache Polaris: Policy-Driven Table Maintenance for Apache 
Iceberg"
+date: 2026-02-04
+author: Neelesh Salian
+---
+
+## Introduction
+
+Iceberg tables accumulate technical debt over time. Small files multiply as 
streaming jobs append data in micro-batches. Delete files pile up from CDC 
workloads. Snapshots grow unbounded, bloating metadata. Without regular 
maintenance, query performance degrades, storage costs rise, and planning times 
stretch from milliseconds to seconds.
+
+Apache Polaris provides a vendor-neutral Iceberg catalog with governance and 
access control, but it does not execute maintenance operations. The catalog 
manages metadata and enforces permissions. Compaction, snapshot expiration, 
orphan cleanup, and manifest optimization remain the user's responsibility.
+
+[Floe](https://github.com/nssalian/floe) fills that gap. It connects to 
Polaris, discovers tables, evaluates their health, and orchestrates maintenance 
through policy-driven automation. Instead of writing custom scripts or manually 
running Spark jobs, you define policies that specify what maintenance to 
perform, which tables to target, and under what conditions to trigger 
execution. Floe handles the rest: scheduling, execution via Spark or Trino, and 
tracking outcomes.
+
+## Architecture
+
+Polaris remains the source of truth for metadata and access control. Floe 
reads the catalog, evaluates policies, triggers maintenance on your chosen 
engine, and records outcomes.
+
+![Polaris + Floe 
Architecture](/img/blog/2026/02/04/high_level_architecture.png)
+
+### Data Flow
+
+1. **Policy discovery**: Floe loads enabled policies and matches them to 
tables.
+2. **Health assessment**: Floe evaluates table health based on scan mode and 
thresholds.
+3. **Planning & gating**: The planner selects operations; trigger conditions 
decide if they run.
+4. **Execution**: The orchestrator dispatches operations to Spark or Trino.
+5. **Persistence**: Results and health history are stored for tracking and 
recommendations.
+
+## Quick Start
+
+```bash
+make example-polaris
+```
+
+This starts Polaris, MinIO, and Floe with a `demo` catalog, creates sample 
Iceberg tables, and configures demo policies.
+
+* Floe UI: http://localhost:9091/ui
+* Floe API: http://localhost:9091/api/
+
+For Trino instead of Spark, run `make clean` first, then `make 
example-polaris-trino`.
+
+## Configuration
+
+```bash
+FLOE_CATALOG_TYPE=POLARIS
+FLOE_CATALOG_NAME=demo
+FLOE_CATALOG_POLARIS_URI=http://polaris:8181/api/catalog
+FLOE_CATALOG_POLARIS_CLIENT_ID=root
+FLOE_CATALOG_POLARIS_CLIENT_SECRET=secret
+FLOE_CATALOG_WAREHOUSE=demo
+```
+
+Note: For Polaris, `FLOE_CATALOG_WAREHOUSE` is the catalog name, not an S3 
path.
+
+## Defining Policies
+
+Policies define maintenance operations and target tables via patterns:
+
+```bash
+curl -s -X POST "http://localhost:9091/api/v1/policies"; \
+  -H "Content-Type: application/json" \
+  -d '{
+    "name": "orders-maintenance",
+    "tablePattern": "demo.test.*",
+    "priority": 50,
+    "rewriteDataFiles": {
+      "strategy": "BINPACK",
+      "targetFileSizeBytes": 134217728
+    },
+    "expireSnapshots": {
+      "retainLast": 10,
+      "maxSnapshotAge": "P7D"
+    },
+    "orphanCleanup": {
+      "retentionPeriodInDays": 3
+    },
+    "rewriteManifests": {}
+  }'
+```
+
+Operations: `rewriteDataFiles`, `expireSnapshots`, `orphanCleanup`, 
`rewriteManifests`.
+
+## Triggering Maintenance
+
+```bash
+curl -X POST http://localhost:9091/api/v1/maintenance/trigger \
+  -H "Content-Type: application/json" \
+  -d '{
+    "catalog": "demo",
+    "namespace": "test",
+    "table": "orders"
+}'
+```
+
+Monitor progress via UI at `/ui/operations` or API at `/api/v1/operations`.
+
+## Floe UI
+
+Floe includes a web UI for managing policies and monitoring table health. The 
table view shows metadata alongside health indicators (snapshot count, small 
file percentage, delete file ratio) so you can see at a glance which tables 
need attention:
+
+![Table Metadata View](/img/blog/2026/02/04/table_metadata.png)
+
+## Health Reporting
+
+```bash
+curl http://localhost:9091/api/v1/tables/test/orders/health
+```
+
+Reports include: snapshot count/age, small file percentage, delete file count, 
partition skew, manifest size.

Review Comment:
   Floe connects to the Iceberg catalog, loads table metadata, and computes 
health metrics (snapshot count, file sizes, delete file ratio, etc.) based on 
the scan mode. 
   The latest release has all of these.



##########
site/content/blog/2026/02/04/floe-polaris-integration.md:
##########
@@ -0,0 +1,231 @@
+---
+#
+# Licensed to the Apache Software Foundation (ASF) under one
+# or more contributor license agreements.  See the NOTICE file
+# distributed with this work for additional information
+# regarding copyright ownership.  The ASF licenses this file
+# to you under the Apache License, Version 2.0 (the
+# "License"); you may not use this file except in compliance
+# with the License.  You may obtain a copy of the License at
+#
+#   http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing,
+# software distributed under the License is distributed on an
+# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+# KIND, either express or implied.  See the License for the
+# specific language governing permissions and limitations
+# under the License.
+#
+title: "Floe and Apache Polaris: Policy-Driven Table Maintenance for Apache 
Iceberg"
+date: 2026-02-04
+author: Neelesh Salian
+---
+
+## Introduction
+
+Iceberg tables accumulate technical debt over time. Small files multiply as 
streaming jobs append data in micro-batches. Delete files pile up from CDC 
workloads. Snapshots grow unbounded, bloating metadata. Without regular 
maintenance, query performance degrades, storage costs rise, and planning times 
stretch from milliseconds to seconds.
+
+Apache Polaris provides a vendor-neutral Iceberg catalog with governance and 
access control, but it does not execute maintenance operations. The catalog 
manages metadata and enforces permissions. Compaction, snapshot expiration, 
orphan cleanup, and manifest optimization remain the user's responsibility.
+
+[Floe](https://github.com/nssalian/floe) fills that gap. It connects to 
Polaris, discovers tables, evaluates their health, and orchestrates maintenance 
through policy-driven automation. Instead of writing custom scripts or manually 
running Spark jobs, you define policies that specify what maintenance to 
perform, which tables to target, and under what conditions to trigger 
execution. Floe handles the rest: scheduling, execution via Spark or Trino, and 
tracking outcomes.
+
+## Architecture
+
+Polaris remains the source of truth for metadata and access control. Floe 
reads the catalog, evaluates policies, triggers maintenance on your chosen 
engine, and records outcomes.
+
+![Polaris + Floe 
Architecture](/img/blog/2026/02/04/high_level_architecture.png)
+
+### Data Flow
+
+1. **Policy discovery**: Floe loads enabled policies and matches them to 
tables.
+2. **Health assessment**: Floe evaluates table health based on scan mode and 
thresholds.
+3. **Planning & gating**: The planner selects operations; trigger conditions 
decide if they run.
+4. **Execution**: The orchestrator dispatches operations to Spark or Trino.
+5. **Persistence**: Results and health history are stored for tracking and 
recommendations.
+
+## Quick Start
+
+```bash
+make example-polaris
+```
+
+This starts Polaris, MinIO, and Floe with a `demo` catalog, creates sample 
Iceberg tables, and configures demo policies.
+
+* Floe UI: http://localhost:9091/ui
+* Floe API: http://localhost:9091/api/
+
+For Trino instead of Spark, run `make clean` first, then `make 
example-polaris-trino`.
+
+## Configuration
+
+```bash
+FLOE_CATALOG_TYPE=POLARIS
+FLOE_CATALOG_NAME=demo
+FLOE_CATALOG_POLARIS_URI=http://polaris:8181/api/catalog
+FLOE_CATALOG_POLARIS_CLIENT_ID=root
+FLOE_CATALOG_POLARIS_CLIENT_SECRET=secret
+FLOE_CATALOG_WAREHOUSE=demo
+```
+
+Note: For Polaris, `FLOE_CATALOG_WAREHOUSE` is the catalog name, not an S3 
path.
+
+## Defining Policies
+
+Policies define maintenance operations and target tables via patterns:
+
+```bash
+curl -s -X POST "http://localhost:9091/api/v1/policies"; \
+  -H "Content-Type: application/json" \
+  -d '{
+    "name": "orders-maintenance",
+    "tablePattern": "demo.test.*",
+    "priority": 50,
+    "rewriteDataFiles": {
+      "strategy": "BINPACK",
+      "targetFileSizeBytes": 134217728
+    },
+    "expireSnapshots": {
+      "retainLast": 10,
+      "maxSnapshotAge": "P7D"
+    },
+    "orphanCleanup": {
+      "retentionPeriodInDays": 3
+    },
+    "rewriteManifests": {}
+  }'
+```
+
+Operations: `rewriteDataFiles`, `expireSnapshots`, `orphanCleanup`, 
`rewriteManifests`.
+
+## Triggering Maintenance
+
+```bash
+curl -X POST http://localhost:9091/api/v1/maintenance/trigger \
+  -H "Content-Type: application/json" \
+  -d '{
+    "catalog": "demo",
+    "namespace": "test",
+    "table": "orders"
+}'
+```
+
+Monitor progress via UI at `/ui/operations` or API at `/api/v1/operations`.
+
+## Floe UI
+
+Floe includes a web UI for managing policies and monitoring table health. The 
table view shows metadata alongside health indicators (snapshot count, small 
file percentage, delete file ratio) so you can see at a glance which tables 
need attention:
+
+![Table Metadata View](/img/blog/2026/02/04/table_metadata.png)
+
+## Health Reporting
+
+```bash
+curl http://localhost:9091/api/v1/tables/test/orders/health
+```
+
+Reports include: snapshot count/age, small file percentage, delete file count, 
partition skew, manifest size.
+
+Scan modes: `metadata` (default), `scan`, `sample`.
+
+```properties
+floe.health.scan-mode=metadata
+floe.health.sample-limit=10000
+floe.health.persistence-enabled=true
+floe.health.max-reports-per-table=100
+```
+
+The `metadata` mode is fast but only sees file-level statistics. Use `scan` or 
`sample` when you need accurate small-file detection based on actual file sizes.
+
+## Automated Scheduling
+
+The scheduler computes a debt score per table based on health issues, time 
since last maintenance, and failure rate. Higher scores are prioritized.
+
+```properties
+floe.scheduler.enabled=true
+floe.scheduler.max-tables-per-poll=10
+floe.scheduler.max-bytes-per-hour=10737418240
+floe.scheduler.failure-backoff-threshold=3
+floe.scheduler.failure-backoff-hours=6
+floe.scheduler.zero-change-threshold=5
+floe.scheduler.condition-based-triggering-enabled=true
+```
+
+Key tuning parameters:
+- `max-bytes-per-hour`: Caps total bytes rewritten to avoid overwhelming 
storage I/O
+- `failure-backoff-threshold` / `failure-backoff-hours`: Prevents repeatedly 
retrying failing tables
+- `zero-change-threshold`: Reduces frequency for tables that consistently have 
no work to do
+
+## Signal-Based Triggering
+
+Gate execution based on table health instead of pure cron:
+
+```bash
+curl -s -X POST "http://localhost:9091/api/v1/policies"; \
+  -H "Content-Type: application/json" \
+  -d '{
+    "name": "smart-compaction",
+    "tablePattern": "demo.test.*",
+    "priority": 100,
+    "rewriteDataFiles": {
+      "strategy": "BINPACK",
+      "targetFileSizeBytes": 134217728
+    },
+    "triggerConditions": {
+      "smallFilePercentageAbove": 20,
+      "deleteFileCountAbove": 50,
+      "minIntervalMinutes": 60
+    }
+  }'
+```
+
+Triggers when any condition is met (default OR logic) or when all conditions 
are met if `triggerLogic` is set to `AND`, and the min interval has elapsed.
+
+For critical tables, force execution when max delay is exceeded:
+
+```json
+{
+  "triggerConditions": {
+    "smallFilePercentageAbove": 30,
+    "criticalPipeline": true,
+    "criticalPipelineMaxDelayMinutes": 360
+  }
+}
+```
+
+Policies without `triggerConditions` run whenever the scheduler picks them up, 
preserving the original behavior.
+
+## Execution Engines
+
+Floe supports Spark (via Livy) and Trino as execution engines.
+
+Spark configuration:
+```bash
+FLOE_ENGINE_TYPE=SPARK
+FLOE_LIVY_URL=http://livy:8998
+```
+
+Trino configuration:
+```bash
+FLOE_ENGINE_TYPE=TRINO
+FLOE_TRINO_JDBC_URL=jdbc:trino://trino:8080
+FLOE_TRINO_CATALOG=demo
+```
+
+## Security
+
+* Enable authentication: `FLOE_AUTH_ENABLED=true`
+* Floe uses its own storage credentials; Polaris credentials are only used for 
catalog access

Review Comment:
   I haven't added support for credential vending yet. But I agree that's a 
good long term strategy especially for multi-tenant setups. 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to