(fluss) branch main updated: [blog]: hands on fluss lakehouse (#1279)

ipolyzos Wed, 23 Jul 2025 00:58:41 -0700

This is an automated email from the ASF dual-hosted git repository.

ipolyzos pushed a commit to branch main
in repository https://gitbox.apache.org/repos/asf/fluss.git



The following commit(s) were added to refs/heads/main by this push:
     new 5c4225e2f [blog]: hands on fluss lakehouse (#1279)
5c4225e2f is described below

commit 5c4225e2f2ddf2191b928ae558fd25ac6b76f2d3
Author: Yang Guo <[email protected]>
AuthorDate: Wed Jul 23 15:58:32 2025 +0800

    [blog]: hands on fluss lakehouse (#1279)
    
    * [blog]: hands on fluss lakehouse
    
    * feat: hands on fluss lakehouse with paimon s3
    
    * make a few adjustments
    
    ---------
    
    Co-authored-by: ipolyzos <[email protected]>
---
 .../blog/2025-07-23-hands-on-fluss-lakehouse.md    | 369 +++++++++++++++++++++
 .../hands_on_fluss_lakehouse/fluss-bucket-data.png | Bin 0 -> 44654 bytes
 .../hands_on_fluss_lakehouse/fluss-bucket.png      | Bin 0 -> 419306 bytes
 .../hands_on_fluss_lakehouse/streamhouse.png       | Bin 0 -> 243122 bytes
 .../tiering-serivce-job.png                        | Bin 0 -> 65796 bytes
 website/blog/authors.yml                           |   2 +-
 6 files changed, 370 insertions(+), 1 deletion(-)

diff --git a/website/blog/2025-07-23-hands-on-fluss-lakehouse.md 
b/website/blog/2025-07-23-hands-on-fluss-lakehouse.md
new file mode 100644
index 000000000..e35b8acf9
--- /dev/null
+++ b/website/blog/2025-07-23-hands-on-fluss-lakehouse.md
@@ -0,0 +1,369 @@
+---
+slug: hands-on-fluss-lakehouse
+title: "Hands-on Fluss Lakehouse with Paimon S3"
+authors: [gyang94]
+toc_max_heading_level: 5
+---
+
+<!--
+ Licensed to the Apache Software Foundation (ASF) under one
+ or more contributor license agreements.  See the NOTICE file
+ distributed with this work for additional information
+ regarding copyright ownership.  The ASF licenses this file
+ to you under the Apache License, Version 2.0 (the
+ "License"); you may not use this file except in compliance
+ with the License.  You may obtain a copy of the License at
+
+      http://www.apache.org/licenses/LICENSE-2.0
+
+ Unless required by applicable law or agreed to in writing, software
+ distributed under the License is distributed on an "AS IS" BASIS,
+ WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ See the License for the specific language governing permissions and
+ limitations under the License.
+-->
+
+# Hands-on Fluss Lakehouse with Paimon S3
+
+Fluss stores historical data in a lakehouse storage layer while keeping 
real-time data in the Fluss server. Its built-in tiering service continuously 
moves fresh events into the lakehouse, allowing various query engines to 
analyze both hot and cold data. The real magic happens with Fluss's union-read 
capability, which lets Flink jobs seamlessly query both the Fluss cluster and 
the lakehouse for truly integrated real-time processing.
+
+![](assets/hands_on_fluss_lakehouse/streamhouse.png)
+
+In this hands-on tutorial, we'll walk you through setting up a local Fluss 
lakehouse environment, running some practical data operations, and getting 
first-hand experience with the complete Fluss lakehouse architecture. By the 
end, you'll have a working environment for experimenting with Fluss's powerful 
data processing capabilities.
+
+## Integrate Paimon S3 Lakehouse
+
+For this tutorial, we'll use **Fluss 0.7** and **Flink 1.20** to run the 
tiering service on a local cluster. We'll configure **Paimon** as our lake 
format and **S3** as the storage backend. Let's get started:
+
+### Minio Setup
+
+1. Install Minio object storage locally.
+
+   Check out the official [guide](https://min.io/docs/minio/macos/index.html) 
for detailed instructions.
+
+2. Start the Minio server
+
+   Run this command, specifying a local path to store your Minio data:
+   ```
+   minio server /tmp/minio-data
+   ```
+
+3. Verify the Minio WebUI.
+
+   When your Minio server is up and running, you'll see endpoint information 
and login credentials:
+
+   ```
+   API: http://192.168.2.236:9000  http://127.0.0.1:9000
+      RootUser: minioadmin
+      RootPass: minioadmin
+
+   WebUI: http://192.168.2.236:61832 http://127.0.0.1:61832
+      RootUser: minioadmin
+      RootPass: minioadmin
+   ```
+   Open the WebUI link and log in with these credentials.
+
+4. Create a `fluss` bucket through the WebUI.
+
+   ![](assets/hands_on_fluss_lakehouse/fluss-bucket.png)
+
+
+### Fluss Cluster Setup
+
+1. Download Fluss
+
+   Grab the Fluss 0.7 binary release from the [Fluss official 
site](https://fluss.apache.org/downloads/).
+
+2. Add Dependencies
+
+   Download the `fluss-fs-s3-0.7.0.jar` from the [Fluss official 
site](https://fluss.apache.org/downloads/) and place it in your 
`<FLUSS_HOME>/lib` directory.
+
+   Next, download the `paimon-s3-1.0.1.jar` from the [Paimon official 
site](https://paimon.apache.org/docs/1.0/project/download/) and add it to 
`<FLUSS_HOME>/plugins/paimon`.
+
+3. Configure the Data Lake
+
+   Edit your `<FLUSS_HOME>/conf/server.yaml` file and add these settings:
+
+   ```yaml
+   data.dir: /tmp/fluss-data
+   remote.data.dir: /tmp/fluss-remote-data
+
+   datalake.format: paimon
+   datalake.paimon.metastore: filesystem
+   datalake.paimon.warehouse: s3://fluss/data
+   datalake.paimon.s3.endpoint: http://localhost:9000
+   datalake.paimon.s3.access-key: minioadmin
+   datalake.paimon.s3.secret-key: minioadmin
+   datalake.paimon.s3.path.style.access: true
+   ```
+
+   This configures Paimon as the datalake format with S3 as the warehouse.
+
+4. Start Fluss
+
+   ```bash
+   <FLUSS_HOME>/bin/local-cluster.sh start
+   ```
+
+### Flink Cluster Setup
+
+1. Download Flink
+
+   Download the Flink 1.20 binary package from the [Flink downloads 
page](https://flink.apache.org/downloads/).
+
+2. Add the Fluss Connector
+
+   Download `fluss-flink-1.20-0.7.0.jar` from the [Fluss official 
site](https://fluss.apache.org/downloads/) and copy it to:
+
+   ```
+   <FLINK_HOME>/lib
+   ```
+
+3. Add Paimon Dependencies
+
+   - Download `paimon-flink-1.20-1.0.1.jar` and `paimon-s3-1.0.1.jar` from the 
[Paimon official site](https://paimon.apache.org/docs/1.0/project/download/) 
and place them in `<FLINK_HOME>/lib`.
+   - Copy these Paimon plugin jars from Fluss into `<FLINK_HOME>/lib`:
+
+   ```
+   <FLUSS_HOME>/plugins/paimon/fluss-lake-paimon-0.7.0.jar
+   <FLUSS_HOME>/plugins/paimon/flink-shaded-hadoop-2-uber-2.8.3-10.0.jar
+   ```
+
+4. Increase Task Slots
+
+   Edit `<FLINK_HOME>/conf/config.yaml` to increase available task slots:
+
+   ```yaml
+   numberOfTaskSlots: 5 
+   ```
+
+5. Start Flink
+
+   ```bash
+   <FLINK_HOME>/bin/start-cluster.sh
+   ```
+
+6. Verify
+
+   Open your browser to `http://localhost:8081/` and make sure the cluster is 
running.
+
+### Launching the Tiering Service
+
+1. Get the Tiering Job Jar
+
+   Download the `fluss-flink-tiering-0.7.0.jar`.
+
+2. Submit the Job
+
+   ```bash
+   <FLINK_HOME>/bin/flink run \
+       <path_to_jar>/fluss-flink-tiering-0.7.0.jar \
+       --fluss.bootstrap.servers localhost:9123 \
+       --datalake.format paimon \
+       --datalake.paimon.metastore filesystem \
+       --datalake.paimon.warehouse s3://fluss/data \
+       --datalake.paimon.s3.endpoint http://localhost:9000 \
+       --datalake.paimon.s3.access-key minioadmin \
+       --datalake.paimon.s3.secret-key minioadmin \
+       --datalake.paimon.s3.path.style.access true
+   ```
+
+3. Confirm Deployment
+
+   Check the Flink UI for the **Fluss Lake Tiering Service** job. Once it's 
running, your local tiering pipeline is good to go.
+
+   ![](assets/hands_on_fluss_lakehouse/tiering-serivce-job.png)
+
+## Data Processing
+
+Now let's dive into some actual data processing. We'll use the Flink SQL 
Client to interact with our Fluss lakehouse and run both batch and streaming 
queries.
+
+1. Launch the SQL Client
+
+   ```bash
+   <FLINK_HOME>/bin/sql-client.sh
+   ```
+
+2. Create the Catalog and Table
+
+   ```sql
+   CREATE CATALOG fluss_catalog WITH (
+       'type' = 'fluss',   
+       'bootstrap.servers' = 'localhost:9123'
+   );
+
+   USE CATALOG fluss_catalog;
+
+   CREATE TABLE t_user (
+       `id` BIGINT,
+       `name` string NOT NULL,
+       `age` int,
+       `birth` DATE,
+       PRIMARY KEY (`id`) NOT ENFORCED
+   )WITH (
+       'table.datalake.enabled' = 'true',
+       'table.datalake.freshness' = '30s'
+   );
+   ```
+
+3. Write Some Data
+
+   Let's insert a couple of records:
+
+   ```sql
+   SET 'execution.runtime-mode' = 'batch';
+   SET 'sql-client.execution.result-mode' = 'tableau';
+
+   INSERT INTO t_user(id,name,age,birth) VALUES
+   (1,'Alice',18,DATE '2000-06-10'),
+   (2,'Bob',20,DATE '2001-06-20');
+   ```
+
+4. Union Read
+
+   Now run a simple query to retrieve data from the table. By default, Flink 
will automatically combine data from both the Fluss cluster and the lakehouse:
+
+   ```sql
+   Flink SQL> select * from t_user;
+   +----+-------+-----+------------+
+   | id |  name | age |      birth |
+   +----+-------+-----+------------+
+   |  1 | Alice |  18 | 2000-06-10 |
+   |  2 |   Bob |  20 | 2001-06-20 |
+   +----+-------+-----+------------+
+   ```
+
+   If you want to read data only from the lake table, simply append `$lake` 
after the table name:
+
+   ```sql
+   Flink SQL> select * from t_user$lake;
+   
+----+-------+-----+------------+----------+----------+----------------------------+
+   | id |  name | age |      birth | __bucket | __offset |                
__timestamp |
+   
+----+-------+-----+------------+----------+----------+----------------------------+
+   |  1 | Alice |  18 | 2000-06-10 |        0 |       -1 | 1970-01-01 
07:59:59.999000 |
+   |  2 |   Bob |  20 | 2001-06-20 |        0 |       -1 | 1970-01-01 
07:59:59.999000 |
+   
+----+-------+-----+------------+----------+----------+----------------------------+
+   ```
+
+   Great! Our records have been successfully synced to the data lake by the 
tiering service.
+
+   Notice the three system columns in the Paimon lake table: `__bucket`, 
`__offset`, and `__timestamp`. The `__bucket` column shows which bucket 
contains this row. The `__offset` and `__timestamp` columns are used for 
streaming data processing.
+
+5. Streaming Inserts
+
+   Let's switch to streaming mode and add two more records:
+
+   ```sql
+   Flink SQL> SET 'execution.runtime-mode' = 'streaming';
+
+   Flink SQL> INSERT INTO t_user(id,name,age,birth) VALUES
+   (3,'Catlin',25,DATE '2002-06-10'),
+   (4,'Dylan',28,DATE '2003-06-20');
+   ```
+
+   Now query the lake again:
+
+   ```sql
+   Flink SQL> select * from t_user$lake;
+   
+----+----+--------+-----+------------+----------+----------+----------------------------+
+   | op | id |   name | age |      birth | __bucket | __offset |               
 __timestamp |
+   
+----+----+--------+-----+------------+----------+----------+----------------------------+
+   | +I |  1 |  Alice |  18 | 2000-06-10 |        0 |       -1 | 1970-01-01 
07:59:59.999000 |
+   | +I |  2 |    Bob |  20 | 2001-06-20 |        0 |       -1 | 1970-01-01 
07:59:59.999000 |
+
+
+   Flink SQL> select * from t_user$lake;
+   
+----+----+--------+-----+------------+----------+----------+----------------------------+
+   | op | id |   name | age |      birth | __bucket | __offset |               
 __timestamp |
+   
+----+----+--------+-----+------------+----------+----------+----------------------------+
+   | +I |  1 |  Alice |  18 | 2000-06-10 |        0 |       -1 | 1970-01-01 
07:59:59.999000 |
+   | +I |  2 |    Bob |  20 | 2001-06-20 |        0 |       -1 | 1970-01-01 
07:59:59.999000 |
+   | +I |  3 | Catlin |  25 | 2002-06-10 |        0 |        2 | 2025-07-19 
19:03:54.150000 |
+   | +I |  4 |  Dylan |  28 | 2003-06-20 |        0 |        3 | 2025-07-19 
19:03:54.150000 |
+
+   ```
+
+   The first time we queried, our new records hadn't been synced to the lake 
table yet. After waiting a moment, they appeared.
+
+   Notice that the `__offset` and `__timestamp` values for these new records 
are no longer the default values. They now show the actual offset and timestamp 
when the records were added to the table.
+
+6. Inspect the Paimon Files
+
+   Open the Minio WebUI, and you'll see the Paimon files in your bucket:
+
+   ![](assets/hands_on_fluss_lakehouse/fluss-bucket-data.png)
+
+   You can also check the Parquet files and manifest in your local filesystem 
under `/tmp/minio-data`:
+
+   ```
+   /tmp/minio-data ❯ tree .
+   .
+   └── fluss
+       └── data
+           ├── default.db__XLDIR__
+           │         └── xl.meta
+           └── fluss.db
+               └── t_user
+                   ├── bucket-0
+                   │         ├── 
changelog-1bafcc32-f88a-42a6-bc92-d3ccf4f62d4c-0.parquet
+                   │         │         └── xl.meta
+                   │         ├── 
changelog-f1853f1c-2588-4035-8233-e4804b1d8344-0.parquet
+                   │         │         └── xl.meta
+                   │         ├── 
data-1bafcc32-f88a-42a6-bc92-d3ccf4f62d4c-1.parquet
+                   │         │         └── xl.meta
+                   │         └── 
data-f1853f1c-2588-4035-8233-e4804b1d8344-1.parquet
+                   │             └── xl.meta
+                   ├── manifest
+                   │         ├── 
manifest-d554f475-ad8f-47e0-a83b-22bce4b233d6-0
+                   │         │         └── xl.meta
+                   │         ├── 
manifest-d554f475-ad8f-47e0-a83b-22bce4b233d6-1
+                   │         │         └── xl.meta
+                   │         ├── 
manifest-e7fbe5b1-a9e4-4647-a07a-5cc71950a5be-0
+                   │         │         └── xl.meta
+                   │         ├── 
manifest-e7fbe5b1-a9e4-4647-a07a-5cc71950a5be-1
+                   │         │         └── xl.meta
+                   │         ├── 
manifest-list-8975f7d7-9fec-4ac9-bb31-12be03d297d0-0
+                   │         │         └── xl.meta
+                   │         ├── 
manifest-list-8975f7d7-9fec-4ac9-bb31-12be03d297d0-1
+                   │         │         └── xl.meta
+                   │         ├── 
manifest-list-8975f7d7-9fec-4ac9-bb31-12be03d297d0-2
+                   │         │         └── xl.meta
+                   │         ├── 
manifest-list-bba1f130-e7ab-4f5e-8ce3-928a53524136-0
+                   │         │         └── xl.meta
+                   │         ├── 
manifest-list-bba1f130-e7ab-4f5e-8ce3-928a53524136-1
+                   │         │         └── xl.meta
+                   │         └── 
manifest-list-bba1f130-e7ab-4f5e-8ce3-928a53524136-2
+                   │             └── xl.meta
+                   ├── schema
+                   │         └── schema-0
+                   │             └── xl.meta
+                   └── snapshot
+                       ├── LATEST
+                       │         └── xl.meta
+                       ├── snapshot-1
+                       │         └── xl.meta
+                       └── snapshot-2
+                           └── xl.meta
+
+   28 directories, 19 files
+   ```
+
+7. View Snapshots
+
+   You can also check the snapshots from the system table by appending 
`$lake$snapshots` after the Fluss table name:
+
+   ```sql
+   Flink SQL> select * from t_user$lake$snapshots;
+
+   
+-------------+-----------+----------------------+-------------------------+-------------+----------+
+   | snapshot_id | schema_id |          commit_user |             commit_time 
| commit_kind | ...      |
+   
+-------------+-----------+----------------------+-------------------------+-------------+----------+
+   |           1 |         0 | __fluss_lake_tiering | 2025-07-19 19:00:41.286 
|      APPEND | ...      |
+   |           2 |         0 | __fluss_lake_tiering | 2025-07-19 19:04:38.964 
|      APPEND | ...      |
+   
+-------------+-----------+----------------------+-------------------------+-------------+----------+
+   2 rows in set (0.33 seconds)
+   ```
+
+## Summary
+
+In this guide, we've explored the Fluss lakehouse architecture and set up a 
complete local environment with Fluss, Flink, Paimon, and S3. We've walked 
through practical examples of data processing that showcase how Fluss 
seamlessly integrates real-time and historical data. With this setup, you now 
have a solid foundation for experimenting with Fluss's powerful lakehouse 
capabilities on your own machine.
\ No newline at end of file
diff --git a/website/blog/assets/hands_on_fluss_lakehouse/fluss-bucket-data.png 
b/website/blog/assets/hands_on_fluss_lakehouse/fluss-bucket-data.png
new file mode 100644
index 000000000..8cc29fa87
Binary files /dev/null and 
b/website/blog/assets/hands_on_fluss_lakehouse/fluss-bucket-data.png differ
diff --git a/website/blog/assets/hands_on_fluss_lakehouse/fluss-bucket.png 
b/website/blog/assets/hands_on_fluss_lakehouse/fluss-bucket.png
new file mode 100644
index 000000000..f7d005ddb
Binary files /dev/null and 
b/website/blog/assets/hands_on_fluss_lakehouse/fluss-bucket.png differ
diff --git a/website/blog/assets/hands_on_fluss_lakehouse/streamhouse.png 
b/website/blog/assets/hands_on_fluss_lakehouse/streamhouse.png
new file mode 100644
index 000000000..7092c126b
Binary files /dev/null and 
b/website/blog/assets/hands_on_fluss_lakehouse/streamhouse.png differ
diff --git 
a/website/blog/assets/hands_on_fluss_lakehouse/tiering-serivce-job.png 
b/website/blog/assets/hands_on_fluss_lakehouse/tiering-serivce-job.png
new file mode 100644
index 000000000..f55ea8039
Binary files /dev/null and 
b/website/blog/assets/hands_on_fluss_lakehouse/tiering-serivce-job.png differ
diff --git a/website/blog/authors.yml b/website/blog/authors.yml
index 3099b89a0..5a2faaa65 100644
--- a/website/blog/authors.yml
+++ b/website/blog/authors.yml
@@ -35,7 +35,7 @@ yuxia:
   image_url: https://github.com/luoyuxia.png
 
 gyang94:
-  name: GUO Yang
+  name: Yang Guo
   title: Fluss Contributor
   url: https://github.com/gyang94
   image_url: https://github.com/gyang94.png
\ No newline at end of file

(fluss) branch main updated: [blog]: hands on fluss lakehouse (#1279)

Reply via email to