This is an automated email from the ASF dual-hosted git repository.
yuxia pushed a commit to branch main
in repository https://gitbox.apache.org/repos/asf/fluss.git
The following commit(s) were added to refs/heads/main by this push:
new 06738812e [doc] Add streaming union read part for paimon document
(#1747)
06738812e is described below
commit 06738812ebbbc0cca2011525499cfac2478790d0
Author: yuxia Luo <[email protected]>
AuthorDate: Wed Sep 24 12:26:51 2025 +0800
[doc] Add streaming union read part for paimon document (#1747)
---
.../integrate-data-lakes/paimon.md | 24 +++++++++++++++++++++-
1 file changed, 23 insertions(+), 1 deletion(-)
diff --git a/website/docs/streaming-lakehouse/integrate-data-lakes/paimon.md
b/website/docs/streaming-lakehouse/integrate-data-lakes/paimon.md
index e1dbb3c59..5656a3288 100644
--- a/website/docs/streaming-lakehouse/integrate-data-lakes/paimon.md
+++ b/website/docs/streaming-lakehouse/integrate-data-lakes/paimon.md
@@ -72,6 +72,10 @@ You can choose between two views of the table:
#### Read Data Only in Paimon
+##### Prerequisites
+Download the [paimon-flink.jar](https://paimon.apache.org/docs/1.2/) that
matches your Flink version, and place it in the `FLINK_HOME/lib` directory
+
+##### Read Paimon Data
To read only data stored in Paimon, use the `$lake` suffix in the table name.
The following example demonstrates this:
```sql title="Flink SQL"
@@ -92,14 +96,32 @@ For further information, refer to Paimon’s [SQL Query
documentation](https://p
#### Union Read of Data in Fluss and Paimon
+##### Prerequisites
+Download the
[fluss-lake-paimon-$FLUSS_VERSION$.jar](https://repo1.maven.org/maven2/org/apache/fluss/fluss-lake-paimon/$FLUSS_VERSION$/fluss-lake-paimon-$FLUSS_VERSION$.jar),
and place it into `${FLINK_HOME}/lib`.
+
+##### Union Read
To read the full dataset, which includes both Fluss (fresh) and Paimon
(historical) data, simply query the table without any suffix. The following
example illustrates this:
```sql title="Flink SQL"
+-- Set execution mode to streaming or batch, here just take batch as an example
+SET 'execution.runtime-mode' = 'batch';
+
-- Query will union data from Fluss and Paimon
SELECT SUM(order_count) AS total_orders FROM ads_nation_purchase_power;
```
+It supports both batch and streaming modes, using Paimon for historical data
and Fluss for fresh data:
+- In batch mode
+
+ The query may run slower than reading only from Paimon because it needs to
merge rows from both Paimon and Fluss. However, it returns the most up-to-date
results. Multiple executions of the query may produce different outputs due to
continuous data ingestion.
+
+- In streaming mode
+
+ Flink first reads the latest Paimon snapshot (tiered via tiering service),
then switches to Fluss starting from the log offset aligned with that snapshot,
ensuring exactly-once semantics.
+ This design enables Fluss to store only a small portion of the dataset in
the Fluss cluster, reducing costs, while Paimon serves as the source of
complete historical data when needed.
+
+ More precisely, if Fluss log data is removed due to TTL
expiration—controlled by the `table.log.ttl` configuration—it can still be read
by Flink through its Union Read capability, as long as the data has already
been tiered to Paimon.
+ For partitioned tables, if a partition is cleaned up—controlled by the
`table.auto-partition.num-retention` configuration—the data in that partition
remains accessible from Paimon, provided it has been tiered there beforehand.
-This query may run slower than reading only from Paimon, but it returns the
most up-to-date data. If you execute the query multiple times, you may observe
different results due to continuous data ingestion.
### Reading with other Engines