(fluss) branch main updated: [docs] Update datalake related doc for support iceberg and lance (#1686)

yuxia Sun, 14 Sep 2025 20:10:37 -0700

This is an automated email from the ASF dual-hosted git repository.

yuxia pushed a commit to branch main
in repository https://gitbox.apache.org/repos/asf/fluss.git



The following commit(s) were added to refs/heads/main by this push:
     new 02a8f55df [docs] Update datalake related doc for support iceberg and 
lance (#1686)
02a8f55df is described below

commit 02a8f55dfd730d9a8385314bf0d9f5eae10e3ed3
Author: CaoZhen <[email protected]>
AuthorDate: Mon Sep 15 11:10:07 2025 +0800

    [docs] Update datalake related doc for support iceberg and lance (#1686)
---
 .../org/apache/fluss/config/ConfigOptions.java     |  7 ++--
 website/docs/engine-flink/options.md               | 46 +++++++++++-----------
 website/docs/install-deploy/overview.md            |  5 ++-
 website/docs/maintenance/configuration.md          |  6 +--
 .../docs/maintenance/tiered-storage/overview.md    |  2 +-
 .../integrate-data-lakes/paimon.md                 |  6 ++-
 website/docs/streaming-lakehouse/overview.md       |  2 +-
 7 files changed, 40 insertions(+), 34 deletions(-)

diff --git 
a/fluss-common/src/main/java/org/apache/fluss/config/ConfigOptions.java 
b/fluss-common/src/main/java/org/apache/fluss/config/ConfigOptions.java
index e6261c692..343434d38 100644
--- a/fluss-common/src/main/java/org/apache/fluss/config/ConfigOptions.java
+++ b/fluss-common/src/main/java/org/apache/fluss/config/ConfigOptions.java
@@ -1299,7 +1299,8 @@ public class ConfigOptions {
                     .enumType(DataLakeFormat.class)
                     .noDefaultValue()
                     .withDescription(
-                            "The data lake format of the table specifies the 
tiered Lakehouse storage format, such as Paimon, Iceberg, DeltaLake, or Hudi. 
Currently, only `paimon` is supported. "
+                            "The data lake format of the table specifies the 
tiered Lakehouse storage format. Currently, supported formats are `paimon`, 
`iceberg`, and `lance`. "
+                                    + "In the future, more kinds of data lake 
format will be supported, such as DeltaLake or Hudi. "
                                     + "Once the `table.datalake.format` 
property is configured, Fluss adopts the key encoding and bucketing strategy 
used by the corresponding data lake format. "
                                     + "This ensures consistency in key 
encoding and bucketing, enabling seamless **Union Read** functionality across 
Fluss and Lakehouse. "
                                     + "The `table.datalake.format` can be 
pre-defined before enabling `table.datalake.enabled`. This allows the data lake 
feature to be dynamically enabled on the table without requiring table 
recreation. "
@@ -1646,8 +1647,8 @@ public class ConfigOptions {
                     .enumType(DataLakeFormat.class)
                     .noDefaultValue()
                     .withDescription(
-                            "The datalake format used by Fluss to be as lake 
storage, such as Paimon, Iceberg, Hudi. "
-                                    + "Now, only support Paimon.");
+                            "The datalake format used by of Fluss to be as 
lakehouse storage. Currently, supported formats are Paimon, Iceberg, and Lance. 
"
+                                    + "In the future, more kinds of data lake 
format will be supported, such as DeltaLake or Hudi.");
 
     // ------------------------------------------------------------------------
     //  ConfigOptions for fluss kafka
diff --git a/website/docs/engine-flink/options.md 
b/website/docs/engine-flink/options.md
index 159352c6c..8eacd0cfd 100644
--- a/website/docs/engine-flink/options.md
+++ b/website/docs/engine-flink/options.md
@@ -60,29 +60,29 @@ ALTER TABLE log_table SET ('table.log.ttl' = '7d');
 
 ## Storage Options
 
-| Option                                  | Type     | Default                 
            | Description                                                       
                                                                                
                                                                                
                                                                                
                                                                                
              [...]
-|-----------------------------------------|----------|-------------------------------------|----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 [...]
-| bucket.num                              | int      | The bucket number of 
Fluss cluster. | The number of buckets of a Fluss table.                        
                                                                                
                                                                                
                                                                                
                                                                                
                 [...]
-| bucket.key                              | String   | (None)                  
            | Specific the distribution policy of the Fluss table. Data will be 
distributed to each bucket according to the hash value of bucket-key (It must 
be a subset of the primary keys excluding partition keys of the primary key 
table). If you specify multiple fields, delimiter is `,`. If the table has a 
primary key and a bucket key is not specified, the bucket key will be used as 
primary key(excluding th [...]
-| table.log.ttl                           | Duration | 7 days                  
            | The time to live for log segments. The configuration controls the 
maximum time we will retain a log before we will delete old segments to free up 
space. If set to -1, the log will not be deleted.                               
                                                                                
                                                                                
              [...]
-| table.auto-partition.enabled            | Boolean  | false                   
            | Whether enable auto partition for the table. Disable by default. 
When auto partition is enabled, the partitions of the table will be created 
automatically.                                                                  
                                                                                
                                                                                
                   [...]
-| table.auto-partition.key                | String   | (None)                  
            | This configuration defines the time-based partition key to be 
used for auto-partitioning when a table is partitioned with multiple keys. 
Auto-partitioning utilizes a time-based partition key to handle partitions 
automatically, including creating new ones and removing outdated ones, by 
comparing the time value of the partition with the current system time. In the 
case of a table using multiple par [...]
-| table.auto-partition.time-unit          | ENUM     | DAY                     
            | The time granularity for auto created partitions. The default 
value is `DAY`. Valid values are `HOUR`, `DAY`, `MONTH`, `QUARTER`, `YEAR`. If 
the value is `HOUR`, the partition format for auto created is yyyyMMddHH. If 
the value is `DAY`, the partition format for auto created is yyyyMMdd. If the 
value is `MONTH`, the partition format for auto created is yyyyMM. If the value 
is `QUARTER`, the parti [...]
-| table.auto-partition.num-precreate      | Integer  | 2                       
            | The number of partitions to pre-create for auto created 
partitions in each check for auto partition. For example, if the current check 
time is 2024-11-11 and the value is configured as 3, then partitions 20241111, 
20241112, 20241113 will be pre-created. If any one partition exists, it'll skip 
creating the partition. The default value is 2, which means 2 partitions will 
be pre-created. If the `tab [...]
-| table.auto-partition.num-retention      | Integer  | 7                       
            | The number of history partitions to retain for auto created 
partitions in each check for auto partition. For example, if the current check 
time is 2024-11-11, time-unit is DAY, and the value is configured as 3, then 
the history partitions 20241108, 20241109, 20241110 will be retained. The 
partitions earlier than 20241108 will be deleted. The default value is 7, which 
means that 7 partitions will  [...]
-| table.auto-partition.time-zone          | String   | the system time zone    
            | The time zone for auto partitions, which is by default the same 
as the system time zone.                                                        
                                                                                
                                                                                
                                                                                
                [...]
-| table.replication.factor                | Integer  | (None)                  
            | The replication factor for the log of the new table. When it's 
not set, Fluss will use the cluster's default replication factor configured by 
default.replication.factor. It should be a positive number and not larger than 
the number of tablet servers in the Fluss cluster. A value larger than the 
number of tablet servers in Fluss cluster will result in an error when the new 
table is created.        [...]
-| table.log.format                        | Enum     | ARROW                   
            | The format of the log records in log store. The default value is 
`ARROW`. The supported formats are `ARROW` and `INDEXED`.                       
                                                                                
                                                                                
                                                                                
               [...]
-| table.log.arrow.compression.type        | Enum     | ZSTD                    
            | The compression type of the log records if the log format is set 
to `ARROW`. The candidate compression type is `NONE`, `LZ4_FRAME`, `ZSTD`. The 
default value is `ZSTD`.                                                        
                                                                                
                                                                                
                [...]
-| table.log.arrow.compression.zstd.level  | Integer  | 3                       
            | The compression level of the log records if the log format is set 
to `ARROW` and the compression type is set to `ZSTD`. The valid range is 1 to 
22. The default value is 3.                                                     
                                                                                
                                                                                
                [...]
-| table.kv.format                         | Enum     | COMPACTED               
            | The format of the kv records in kv store. The default value is 
`COMPACTED`. The supported formats are `COMPACTED` and `INDEXED`.               
                                                                                
                                                                                
                                                                                
                 [...]
-| table.log.tiered.local-segments         | Integer  | 2                       
            | The number of log segments to retain in local for each table when 
log tiered storage is enabled. It must be greater that 0. The default is 2.     
                                                                                
                                                                                
                                                                                
              [...]
-| table.datalake.enabled                  | Boolean  | false                   
            | Whether enable lakehouse storage for the table. Disabled by 
default. When this option is set to ture and the datalake tiering service is 
up, the table will be tiered and compacted into datalake format stored on 
lakehouse storage.                                                              
                                                                                
                             [...]
-| table.datalake.format                   | Enum     | (None)                  
            | The data lake format of the table specifies the tiered Lakehouse 
storage format, such as Paimon, Iceberg, DeltaLake, or Hudi. Currently, only 
`paimon` is supported. Once the `table.datalake.format` property is configured, 
Fluss adopts the key encoding and bucketing strategy used by the corresponding 
data lake format. This ensures consistency in key encoding and bucketing, 
enabling seamless **Unio [...]
-| table.datalake.freshness                | Duration | 3min                    
            | It defines the maximum amount of time that the datalake table's 
content should lag behind updates to the Fluss table. Based on this target 
freshness, the Fluss service automatically moves data from the Fluss table and 
updates to the datalake table, so that the data in the datalake table is kept 
up to date within this target. If the data does not need to be as fresh, you 
can specify a longer targe [...]
-| table.datalake.auto-compaction          | Boolean | false                    
            | If true, compaction will be triggered automatically when tiering 
service writes to the datalake. It is disabled by default.                      
                                                                                
                                                                                
                                                                                
               [...]
-| table.merge-engine                      | Enum     | (None)                  
            | Defines the merge engine for the primary key table. By default, 
primary key table uses the [default merge 
engine(last_row)](table-design/table-types/pk-table/merge-engines/default.md). 
It also supports two merge engines are `first_row` and `versioned`. The 
[first_row merge 
engine](table-design/table-types/pk-table/merge-engines/first-row.md) will keep 
the first row of the same primary key. The [v [...]
-| table.merge-engine.versioned.ver-column | String   | (None)                  
            | The column name of the version column for the `versioned` merge 
engine. If the merge engine is set to `versioned`, the version column must be 
set.                                                                            
                                                                                
                                                                                
                  [...]
+| Option                                  | Type     | Default                 
            | Description                                                       
                                                                                
                                                                                
                                                                                
                                                                                
              [...]
+|-----------------------------------------|----------|-------------------------------------|----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 [...]
+| bucket.num                              | int      | The bucket number of 
Fluss cluster. | The number of buckets of a Fluss table.                        
                                                                                
                                                                                
                                                                                
                                                                                
                 [...]
+| bucket.key                              | String   | (None)                  
            | Specific the distribution policy of the Fluss table. Data will be 
distributed to each bucket according to the hash value of bucket-key (It must 
be a subset of the primary keys excluding partition keys of the primary key 
table). If you specify multiple fields, delimiter is `,`. If the table has a 
primary key and a bucket key is not specified, the bucket key will be used as 
primary key(excluding th [...]
+| table.log.ttl                           | Duration | 7 days                  
            | The time to live for log segments. The configuration controls the 
maximum time we will retain a log before we will delete old segments to free up 
space. If set to -1, the log will not be deleted.                               
                                                                                
                                                                                
              [...]
+| table.auto-partition.enabled            | Boolean  | false                   
            | Whether enable auto partition for the table. Disable by default. 
When auto partition is enabled, the partitions of the table will be created 
automatically.                                                                  
                                                                                
                                                                                
                   [...]
+| table.auto-partition.key                | String   | (None)                  
            | This configuration defines the time-based partition key to be 
used for auto-partitioning when a table is partitioned with multiple keys. 
Auto-partitioning utilizes a time-based partition key to handle partitions 
automatically, including creating new ones and removing outdated ones, by 
comparing the time value of the partition with the current system time. In the 
case of a table using multiple par [...]
+| table.auto-partition.time-unit          | ENUM     | DAY                     
            | The time granularity for auto created partitions. The default 
value is `DAY`. Valid values are `HOUR`, `DAY`, `MONTH`, `QUARTER`, `YEAR`. If 
the value is `HOUR`, the partition format for auto created is yyyyMMddHH. If 
the value is `DAY`, the partition format for auto created is yyyyMMdd. If the 
value is `MONTH`, the partition format for auto created is yyyyMM. If the value 
is `QUARTER`, the parti [...]
+| table.auto-partition.num-precreate      | Integer  | 2                       
            | The number of partitions to pre-create for auto created 
partitions in each check for auto partition. For example, if the current check 
time is 2024-11-11 and the value is configured as 3, then partitions 20241111, 
20241112, 20241113 will be pre-created. If any one partition exists, it'll skip 
creating the partition. The default value is 2, which means 2 partitions will 
be pre-created. If the `tab [...]
+| table.auto-partition.num-retention      | Integer  | 7                       
            | The number of history partitions to retain for auto created 
partitions in each check for auto partition. For example, if the current check 
time is 2024-11-11, time-unit is DAY, and the value is configured as 3, then 
the history partitions 20241108, 20241109, 20241110 will be retained. The 
partitions earlier than 20241108 will be deleted. The default value is 7, which 
means that 7 partitions will  [...]
+| table.auto-partition.time-zone          | String   | the system time zone    
            | The time zone for auto partitions, which is by default the same 
as the system time zone.                                                        
                                                                                
                                                                                
                                                                                
                [...]
+| table.replication.factor                | Integer  | (None)                  
            | The replication factor for the log of the new table. When it's 
not set, Fluss will use the cluster's default replication factor configured by 
default.replication.factor. It should be a positive number and not larger than 
the number of tablet servers in the Fluss cluster. A value larger than the 
number of tablet servers in Fluss cluster will result in an error when the new 
table is created.        [...]
+| table.log.format                        | Enum     | ARROW                   
            | The format of the log records in log store. The default value is 
`ARROW`. The supported formats are `ARROW` and `INDEXED`.                       
                                                                                
                                                                                
                                                                                
               [...]
+| table.log.arrow.compression.type        | Enum     | ZSTD                    
            | The compression type of the log records if the log format is set 
to `ARROW`. The candidate compression type is `NONE`, `LZ4_FRAME`, `ZSTD`. The 
default value is `ZSTD`.                                                        
                                                                                
                                                                                
                [...]
+| table.log.arrow.compression.zstd.level  | Integer  | 3                       
            | The compression level of the log records if the log format is set 
to `ARROW` and the compression type is set to `ZSTD`. The valid range is 1 to 
22. The default value is 3.                                                     
                                                                                
                                                                                
                [...]
+| table.kv.format                         | Enum     | COMPACTED               
            | The format of the kv records in kv store. The default value is 
`COMPACTED`. The supported formats are `COMPACTED` and `INDEXED`.               
                                                                                
                                                                                
                                                                                
                 [...]
+| table.log.tiered.local-segments         | Integer  | 2                       
            | The number of log segments to retain in local for each table when 
log tiered storage is enabled. It must be greater that 0. The default is 2.     
                                                                                
                                                                                
                                                                                
              [...]
+| table.datalake.enabled                  | Boolean  | false                   
            | Whether enable lakehouse storage for the table. Disabled by 
default. When this option is set to ture and the datalake tiering service is 
up, the table will be tiered and compacted into datalake format stored on 
lakehouse storage.                                                              
                                                                                
                             [...]
+| table.datalake.format                   | Enum     | (None)                  
            | The data lake format of the table specifies the tiered Lakehouse 
storage format. Currently, supported formats are `paimon`, `iceberg`, and 
`lance`. In the future, more kinds of data lake format will be supported, such 
as DeltaLake or Hudi. Once the `table.datalake.format` property is configured, 
Fluss adopts the key encoding and bucketing strategy used by the corresponding 
data lake format. This  [...]
+| table.datalake.freshness                | Duration | 3min                    
            | It defines the maximum amount of time that the datalake table's 
content should lag behind updates to the Fluss table. Based on this target 
freshness, the Fluss service automatically moves data from the Fluss table and 
updates to the datalake table, so that the data in the datalake table is kept 
up to date within this target. If the data does not need to be as fresh, you 
can specify a longer targe [...]
+| table.datalake.auto-compaction          | Boolean | false                    
            | If true, compaction will be triggered automatically when tiering 
service writes to the datalake. It is disabled by default.                      
                                                                                
                                                                                
                                                                                
               [...]
+| table.merge-engine                      | Enum     | (None)                  
            | Defines the merge engine for the primary key table. By default, 
primary key table uses the [default merge 
engine(last_row)](table-design/table-types/pk-table/merge-engines/default.md). 
It also supports two merge engines are `first_row` and `versioned`. The 
[first_row merge 
engine](table-design/table-types/pk-table/merge-engines/first-row.md) will keep 
the first row of the same primary key. The [v [...]
+| table.merge-engine.versioned.ver-column | String   | (None)                  
            | The column name of the version column for the `versioned` merge 
engine. If the merge engine is set to `versioned`, the version column must be 
set.                                                                            
                                                                                
                                                                                
                  [...]
 
 ## Read Options
 
diff --git a/website/docs/install-deploy/overview.md 
b/website/docs/install-deploy/overview.md
index 4878f16ee..9704b8326 100644
--- a/website/docs/install-deploy/overview.md
+++ b/website/docs/install-deploy/overview.md
@@ -116,8 +116,9 @@ We have listed them in the table below the figure.
                by query engines such as Flink, Spark, StarRocks, Trino.
             </td>
             <td>
-            <li>[Paimon](maintenance/tiered-storage/lakehouse-storage.md)</li>
-            <li>[Iceberg (Roadmap)](/roadmap/)</li>
+                
<li>[Paimon](streaming-lakehouse/integrate-data-lakes/paimon.md)</li>
+                
<li>[Iceberg](streaming-lakehouse/integrate-data-lakes/iceberg.md)</li>
+                
<li>[Lance](streaming-lakehouse/integrate-data-lakes/lance.md)</li>
             </td>
         </tr>
         <tr>
diff --git a/website/docs/maintenance/configuration.md 
b/website/docs/maintenance/configuration.md
index 82c34e013..c5dc0ed7e 100644
--- a/website/docs/maintenance/configuration.md
+++ b/website/docs/maintenance/configuration.md
@@ -164,9 +164,9 @@ during the Fluss cluster working.
 
 ## Lakehouse
 
-| Option          | Type | Default | Description                               
                                                                                
|
-|-----------------|------|---------|---------------------------------------------------------------------------------------------------------------------------|
-| datalake.format | Enum | (None)  | The datalake format used by of Fluss to 
be as lakehouse storage, such as Paimon, Iceberg, Hudi. Now, only support 
Paimon. |
+| Option          | Type | Default | Description                               
                                                                                
                                                                                
                  |
+|-----------------|------|---------|-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
+| datalake.format | Enum | (None)  | The datalake format used by of Fluss to 
be as lakehouse storage. Currently, supported formats are Paimon, Iceberg, and 
Lance. In the future, more kinds of data lake format will be supported, such as 
DeltaLake or Hudi.   |
 
 ## Kafka
 
diff --git a/website/docs/maintenance/tiered-storage/overview.md 
b/website/docs/maintenance/tiered-storage/overview.md
index ad438d427..898af077d 100644
--- a/website/docs/maintenance/tiered-storage/overview.md
+++ b/website/docs/maintenance/tiered-storage/overview.md
@@ -14,7 +14,7 @@ Fluss organizes data into different storage layers based on 
its access patterns,
 Fluss ensures the recent data is stored in local for higher write/read 
performance and the historical data is stored in [remote 
storage](remote-storage.md) for lower cost.
 
 What's more, since the native format of Fluss's data is optimized for 
real-time write/read which is inevitable unfriendly to batch analytics, Fluss 
also introduces a [lakehouse storage](lakehouse-storage.md) which stores the 
data
-in the well-known open data lake format for better analytics performance. 
Currently, only Paimon is supported, but more kinds of data lake support are on 
the way. Keep eyes on us!
+in the well-known open data lake format for better analytics performance. 
Currently, supported formats are Paimon, Iceberg, and Lance. In the future, 
more kinds of data lake support are on the way. Keep eyes on us!
 
 The overall tiered storage architecture is shown in the following diagram:
 
diff --git a/website/docs/streaming-lakehouse/integrate-data-lakes/paimon.md 
b/website/docs/streaming-lakehouse/integrate-data-lakes/paimon.md
index c42b04d46..e1dbb3c59 100644
--- a/website/docs/streaming-lakehouse/integrate-data-lakes/paimon.md
+++ b/website/docs/streaming-lakehouse/integrate-data-lakes/paimon.md
@@ -5,10 +5,14 @@ sidebar_position: 1
 
 # Paimon
 
+## Introduction
+
 [Apache Paimon](https://paimon.apache.org/) innovatively combines a lake 
format with an LSM (Log-Structured Merge-tree) structure, bringing efficient 
updates into the lake architecture. 
 To integrate Fluss with Paimon, you must enable lakehouse storage and 
configure Paimon as the lakehouse storage. For more details, see [Enable 
Lakehouse 
Storage](maintenance/tiered-storage/lakehouse-storage.md#enable-lakehouse-storage).
 
-## Introduction
+## Configure Paimon as LakeHouse Storage
+
+For general guidance on configuring Paimon as the lakehouse storage, you can 
refer to [Lakehouse Storage](maintenance/tiered-storage/lakehouse-storage.md) 
documentation. When starting the tiering service, make sure to use 
Paimon-specific configurations as parameters.
 
 When a table is created or altered with the option `'table.datalake.enabled' = 
'true'`, Fluss will automatically create a corresponding Paimon table with the 
same table path.
 The schema of the Paimon table matches that of the Fluss table, except for the 
addition of three system columns at the end: `__bucket`, `__offset`, and 
`__timestamp`.  
diff --git a/website/docs/streaming-lakehouse/overview.md 
b/website/docs/streaming-lakehouse/overview.md
index 1ea85fb44..c1d75f082 100644
--- a/website/docs/streaming-lakehouse/overview.md
+++ b/website/docs/streaming-lakehouse/overview.md
@@ -44,4 +44,4 @@ Some powerful features it provided are:
 - **Analytical Streams**: The union reads help data streams to have the 
powerful analytics capabilities. This reduces complexity when developing 
streaming applications, simplifies debugging, and allows for immediate access 
to live data insights.
 - **Connect to Lakehouse Ecosystem**: Fluss keeps the table metadata in sync 
with data lake catalogs while compacting data into Lakehouse. This allows 
external engines like Spark, StarRocks, Flink, Trino to read the data directly 
by connecting to the data lake catalog.
 
-Currently, Fluss supports [Paimon](integrate-data-lakes/paimon.md) and 
[Lance](integrate-data-lakes/lance.md) as Lakehouse Storage, more kinds of data 
lake formats are on the roadmap.
\ No newline at end of file
+Currently, Fluss supports [Paimon](integrate-data-lakes/paimon.md), 
[Iceberg](integrate-data-lakes/iceberg.md), and 
[Lance](integrate-data-lakes/lance.md) as Lakehouse Storage, more kinds of data 
lake formats are on the roadmap.

(fluss) branch main updated: [docs] Update datalake related doc for support iceberg and lance (#1686)

Reply via email to