This is an automated email from the ASF dual-hosted git repository.
yuxia pushed a commit to branch main
in repository https://gitbox.apache.org/repos/asf/fluss.git
The following commit(s) were added to refs/heads/main by this push:
new 02a8f55df [docs] Update datalake related doc for support iceberg and
lance (#1686)
02a8f55df is described below
commit 02a8f55dfd730d9a8385314bf0d9f5eae10e3ed3
Author: CaoZhen <[email protected]>
AuthorDate: Mon Sep 15 11:10:07 2025 +0800
[docs] Update datalake related doc for support iceberg and lance (#1686)
---
.../org/apache/fluss/config/ConfigOptions.java | 7 ++--
website/docs/engine-flink/options.md | 46 +++++++++++-----------
website/docs/install-deploy/overview.md | 5 ++-
website/docs/maintenance/configuration.md | 6 +--
.../docs/maintenance/tiered-storage/overview.md | 2 +-
.../integrate-data-lakes/paimon.md | 6 ++-
website/docs/streaming-lakehouse/overview.md | 2 +-
7 files changed, 40 insertions(+), 34 deletions(-)
diff --git
a/fluss-common/src/main/java/org/apache/fluss/config/ConfigOptions.java
b/fluss-common/src/main/java/org/apache/fluss/config/ConfigOptions.java
index e6261c692..343434d38 100644
--- a/fluss-common/src/main/java/org/apache/fluss/config/ConfigOptions.java
+++ b/fluss-common/src/main/java/org/apache/fluss/config/ConfigOptions.java
@@ -1299,7 +1299,8 @@ public class ConfigOptions {
.enumType(DataLakeFormat.class)
.noDefaultValue()
.withDescription(
- "The data lake format of the table specifies the
tiered Lakehouse storage format, such as Paimon, Iceberg, DeltaLake, or Hudi.
Currently, only `paimon` is supported. "
+ "The data lake format of the table specifies the
tiered Lakehouse storage format. Currently, supported formats are `paimon`,
`iceberg`, and `lance`. "
+ + "In the future, more kinds of data lake
format will be supported, such as DeltaLake or Hudi. "
+ "Once the `table.datalake.format`
property is configured, Fluss adopts the key encoding and bucketing strategy
used by the corresponding data lake format. "
+ "This ensures consistency in key
encoding and bucketing, enabling seamless **Union Read** functionality across
Fluss and Lakehouse. "
+ "The `table.datalake.format` can be
pre-defined before enabling `table.datalake.enabled`. This allows the data lake
feature to be dynamically enabled on the table without requiring table
recreation. "
@@ -1646,8 +1647,8 @@ public class ConfigOptions {
.enumType(DataLakeFormat.class)
.noDefaultValue()
.withDescription(
- "The datalake format used by Fluss to be as lake
storage, such as Paimon, Iceberg, Hudi. "
- + "Now, only support Paimon.");
+ "The datalake format used by of Fluss to be as
lakehouse storage. Currently, supported formats are Paimon, Iceberg, and Lance.
"
+ + "In the future, more kinds of data lake
format will be supported, such as DeltaLake or Hudi.");
// ------------------------------------------------------------------------
// ConfigOptions for fluss kafka
diff --git a/website/docs/engine-flink/options.md
b/website/docs/engine-flink/options.md
index 159352c6c..8eacd0cfd 100644
--- a/website/docs/engine-flink/options.md
+++ b/website/docs/engine-flink/options.md
@@ -60,29 +60,29 @@ ALTER TABLE log_table SET ('table.log.ttl' = '7d');
## Storage Options
-| Option | Type | Default
| Description
[...]
-|-----------------------------------------|----------|-------------------------------------|----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
[...]
-| bucket.num | int | The bucket number of
Fluss cluster. | The number of buckets of a Fluss table.
[...]
-| bucket.key | String | (None)
| Specific the distribution policy of the Fluss table. Data will be
distributed to each bucket according to the hash value of bucket-key (It must
be a subset of the primary keys excluding partition keys of the primary key
table). If you specify multiple fields, delimiter is `,`. If the table has a
primary key and a bucket key is not specified, the bucket key will be used as
primary key(excluding th [...]
-| table.log.ttl | Duration | 7 days
| The time to live for log segments. The configuration controls the
maximum time we will retain a log before we will delete old segments to free up
space. If set to -1, the log will not be deleted.
[...]
-| table.auto-partition.enabled | Boolean | false
| Whether enable auto partition for the table. Disable by default.
When auto partition is enabled, the partitions of the table will be created
automatically.
[...]
-| table.auto-partition.key | String | (None)
| This configuration defines the time-based partition key to be
used for auto-partitioning when a table is partitioned with multiple keys.
Auto-partitioning utilizes a time-based partition key to handle partitions
automatically, including creating new ones and removing outdated ones, by
comparing the time value of the partition with the current system time. In the
case of a table using multiple par [...]
-| table.auto-partition.time-unit | ENUM | DAY
| The time granularity for auto created partitions. The default
value is `DAY`. Valid values are `HOUR`, `DAY`, `MONTH`, `QUARTER`, `YEAR`. If
the value is `HOUR`, the partition format for auto created is yyyyMMddHH. If
the value is `DAY`, the partition format for auto created is yyyyMMdd. If the
value is `MONTH`, the partition format for auto created is yyyyMM. If the value
is `QUARTER`, the parti [...]
-| table.auto-partition.num-precreate | Integer | 2
| The number of partitions to pre-create for auto created
partitions in each check for auto partition. For example, if the current check
time is 2024-11-11 and the value is configured as 3, then partitions 20241111,
20241112, 20241113 will be pre-created. If any one partition exists, it'll skip
creating the partition. The default value is 2, which means 2 partitions will
be pre-created. If the `tab [...]
-| table.auto-partition.num-retention | Integer | 7
| The number of history partitions to retain for auto created
partitions in each check for auto partition. For example, if the current check
time is 2024-11-11, time-unit is DAY, and the value is configured as 3, then
the history partitions 20241108, 20241109, 20241110 will be retained. The
partitions earlier than 20241108 will be deleted. The default value is 7, which
means that 7 partitions will [...]
-| table.auto-partition.time-zone | String | the system time zone
| The time zone for auto partitions, which is by default the same
as the system time zone.
[...]
-| table.replication.factor | Integer | (None)
| The replication factor for the log of the new table. When it's
not set, Fluss will use the cluster's default replication factor configured by
default.replication.factor. It should be a positive number and not larger than
the number of tablet servers in the Fluss cluster. A value larger than the
number of tablet servers in Fluss cluster will result in an error when the new
table is created. [...]
-| table.log.format | Enum | ARROW
| The format of the log records in log store. The default value is
`ARROW`. The supported formats are `ARROW` and `INDEXED`.
[...]
-| table.log.arrow.compression.type | Enum | ZSTD
| The compression type of the log records if the log format is set
to `ARROW`. The candidate compression type is `NONE`, `LZ4_FRAME`, `ZSTD`. The
default value is `ZSTD`.
[...]
-| table.log.arrow.compression.zstd.level | Integer | 3
| The compression level of the log records if the log format is set
to `ARROW` and the compression type is set to `ZSTD`. The valid range is 1 to
22. The default value is 3.
[...]
-| table.kv.format | Enum | COMPACTED
| The format of the kv records in kv store. The default value is
`COMPACTED`. The supported formats are `COMPACTED` and `INDEXED`.
[...]
-| table.log.tiered.local-segments | Integer | 2
| The number of log segments to retain in local for each table when
log tiered storage is enabled. It must be greater that 0. The default is 2.
[...]
-| table.datalake.enabled | Boolean | false
| Whether enable lakehouse storage for the table. Disabled by
default. When this option is set to ture and the datalake tiering service is
up, the table will be tiered and compacted into datalake format stored on
lakehouse storage.
[...]
-| table.datalake.format | Enum | (None)
| The data lake format of the table specifies the tiered Lakehouse
storage format, such as Paimon, Iceberg, DeltaLake, or Hudi. Currently, only
`paimon` is supported. Once the `table.datalake.format` property is configured,
Fluss adopts the key encoding and bucketing strategy used by the corresponding
data lake format. This ensures consistency in key encoding and bucketing,
enabling seamless **Unio [...]
-| table.datalake.freshness | Duration | 3min
| It defines the maximum amount of time that the datalake table's
content should lag behind updates to the Fluss table. Based on this target
freshness, the Fluss service automatically moves data from the Fluss table and
updates to the datalake table, so that the data in the datalake table is kept
up to date within this target. If the data does not need to be as fresh, you
can specify a longer targe [...]
-| table.datalake.auto-compaction | Boolean | false
| If true, compaction will be triggered automatically when tiering
service writes to the datalake. It is disabled by default.
[...]
-| table.merge-engine | Enum | (None)
| Defines the merge engine for the primary key table. By default,
primary key table uses the [default merge
engine(last_row)](table-design/table-types/pk-table/merge-engines/default.md).
It also supports two merge engines are `first_row` and `versioned`. The
[first_row merge
engine](table-design/table-types/pk-table/merge-engines/first-row.md) will keep
the first row of the same primary key. The [v [...]
-| table.merge-engine.versioned.ver-column | String | (None)
| The column name of the version column for the `versioned` merge
engine. If the merge engine is set to `versioned`, the version column must be
set.
[...]
+| Option | Type | Default
| Description
[...]
+|-----------------------------------------|----------|-------------------------------------|----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
[...]
+| bucket.num | int | The bucket number of
Fluss cluster. | The number of buckets of a Fluss table.
[...]
+| bucket.key | String | (None)
| Specific the distribution policy of the Fluss table. Data will be
distributed to each bucket according to the hash value of bucket-key (It must
be a subset of the primary keys excluding partition keys of the primary key
table). If you specify multiple fields, delimiter is `,`. If the table has a
primary key and a bucket key is not specified, the bucket key will be used as
primary key(excluding th [...]
+| table.log.ttl | Duration | 7 days
| The time to live for log segments. The configuration controls the
maximum time we will retain a log before we will delete old segments to free up
space. If set to -1, the log will not be deleted.
[...]
+| table.auto-partition.enabled | Boolean | false
| Whether enable auto partition for the table. Disable by default.
When auto partition is enabled, the partitions of the table will be created
automatically.
[...]
+| table.auto-partition.key | String | (None)
| This configuration defines the time-based partition key to be
used for auto-partitioning when a table is partitioned with multiple keys.
Auto-partitioning utilizes a time-based partition key to handle partitions
automatically, including creating new ones and removing outdated ones, by
comparing the time value of the partition with the current system time. In the
case of a table using multiple par [...]
+| table.auto-partition.time-unit | ENUM | DAY
| The time granularity for auto created partitions. The default
value is `DAY`. Valid values are `HOUR`, `DAY`, `MONTH`, `QUARTER`, `YEAR`. If
the value is `HOUR`, the partition format for auto created is yyyyMMddHH. If
the value is `DAY`, the partition format for auto created is yyyyMMdd. If the
value is `MONTH`, the partition format for auto created is yyyyMM. If the value
is `QUARTER`, the parti [...]
+| table.auto-partition.num-precreate | Integer | 2
| The number of partitions to pre-create for auto created
partitions in each check for auto partition. For example, if the current check
time is 2024-11-11 and the value is configured as 3, then partitions 20241111,
20241112, 20241113 will be pre-created. If any one partition exists, it'll skip
creating the partition. The default value is 2, which means 2 partitions will
be pre-created. If the `tab [...]
+| table.auto-partition.num-retention | Integer | 7
| The number of history partitions to retain for auto created
partitions in each check for auto partition. For example, if the current check
time is 2024-11-11, time-unit is DAY, and the value is configured as 3, then
the history partitions 20241108, 20241109, 20241110 will be retained. The
partitions earlier than 20241108 will be deleted. The default value is 7, which
means that 7 partitions will [...]
+| table.auto-partition.time-zone | String | the system time zone
| The time zone for auto partitions, which is by default the same
as the system time zone.
[...]
+| table.replication.factor | Integer | (None)
| The replication factor for the log of the new table. When it's
not set, Fluss will use the cluster's default replication factor configured by
default.replication.factor. It should be a positive number and not larger than
the number of tablet servers in the Fluss cluster. A value larger than the
number of tablet servers in Fluss cluster will result in an error when the new
table is created. [...]
+| table.log.format | Enum | ARROW
| The format of the log records in log store. The default value is
`ARROW`. The supported formats are `ARROW` and `INDEXED`.
[...]
+| table.log.arrow.compression.type | Enum | ZSTD
| The compression type of the log records if the log format is set
to `ARROW`. The candidate compression type is `NONE`, `LZ4_FRAME`, `ZSTD`. The
default value is `ZSTD`.
[...]
+| table.log.arrow.compression.zstd.level | Integer | 3
| The compression level of the log records if the log format is set
to `ARROW` and the compression type is set to `ZSTD`. The valid range is 1 to
22. The default value is 3.
[...]
+| table.kv.format | Enum | COMPACTED
| The format of the kv records in kv store. The default value is
`COMPACTED`. The supported formats are `COMPACTED` and `INDEXED`.
[...]
+| table.log.tiered.local-segments | Integer | 2
| The number of log segments to retain in local for each table when
log tiered storage is enabled. It must be greater that 0. The default is 2.
[...]
+| table.datalake.enabled | Boolean | false
| Whether enable lakehouse storage for the table. Disabled by
default. When this option is set to ture and the datalake tiering service is
up, the table will be tiered and compacted into datalake format stored on
lakehouse storage.
[...]
+| table.datalake.format | Enum | (None)
| The data lake format of the table specifies the tiered Lakehouse
storage format. Currently, supported formats are `paimon`, `iceberg`, and
`lance`. In the future, more kinds of data lake format will be supported, such
as DeltaLake or Hudi. Once the `table.datalake.format` property is configured,
Fluss adopts the key encoding and bucketing strategy used by the corresponding
data lake format. This [...]
+| table.datalake.freshness | Duration | 3min
| It defines the maximum amount of time that the datalake table's
content should lag behind updates to the Fluss table. Based on this target
freshness, the Fluss service automatically moves data from the Fluss table and
updates to the datalake table, so that the data in the datalake table is kept
up to date within this target. If the data does not need to be as fresh, you
can specify a longer targe [...]
+| table.datalake.auto-compaction | Boolean | false
| If true, compaction will be triggered automatically when tiering
service writes to the datalake. It is disabled by default.
[...]
+| table.merge-engine | Enum | (None)
| Defines the merge engine for the primary key table. By default,
primary key table uses the [default merge
engine(last_row)](table-design/table-types/pk-table/merge-engines/default.md).
It also supports two merge engines are `first_row` and `versioned`. The
[first_row merge
engine](table-design/table-types/pk-table/merge-engines/first-row.md) will keep
the first row of the same primary key. The [v [...]
+| table.merge-engine.versioned.ver-column | String | (None)
| The column name of the version column for the `versioned` merge
engine. If the merge engine is set to `versioned`, the version column must be
set.
[...]
## Read Options
diff --git a/website/docs/install-deploy/overview.md
b/website/docs/install-deploy/overview.md
index 4878f16ee..9704b8326 100644
--- a/website/docs/install-deploy/overview.md
+++ b/website/docs/install-deploy/overview.md
@@ -116,8 +116,9 @@ We have listed them in the table below the figure.
by query engines such as Flink, Spark, StarRocks, Trino.
</td>
<td>
- <li>[Paimon](maintenance/tiered-storage/lakehouse-storage.md)</li>
- <li>[Iceberg (Roadmap)](/roadmap/)</li>
+
<li>[Paimon](streaming-lakehouse/integrate-data-lakes/paimon.md)</li>
+
<li>[Iceberg](streaming-lakehouse/integrate-data-lakes/iceberg.md)</li>
+
<li>[Lance](streaming-lakehouse/integrate-data-lakes/lance.md)</li>
</td>
</tr>
<tr>
diff --git a/website/docs/maintenance/configuration.md
b/website/docs/maintenance/configuration.md
index 82c34e013..c5dc0ed7e 100644
--- a/website/docs/maintenance/configuration.md
+++ b/website/docs/maintenance/configuration.md
@@ -164,9 +164,9 @@ during the Fluss cluster working.
## Lakehouse
-| Option | Type | Default | Description
|
-|-----------------|------|---------|---------------------------------------------------------------------------------------------------------------------------|
-| datalake.format | Enum | (None) | The datalake format used by of Fluss to
be as lakehouse storage, such as Paimon, Iceberg, Hudi. Now, only support
Paimon. |
+| Option | Type | Default | Description
|
+|-----------------|------|---------|-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
+| datalake.format | Enum | (None) | The datalake format used by of Fluss to
be as lakehouse storage. Currently, supported formats are Paimon, Iceberg, and
Lance. In the future, more kinds of data lake format will be supported, such as
DeltaLake or Hudi. |
## Kafka
diff --git a/website/docs/maintenance/tiered-storage/overview.md
b/website/docs/maintenance/tiered-storage/overview.md
index ad438d427..898af077d 100644
--- a/website/docs/maintenance/tiered-storage/overview.md
+++ b/website/docs/maintenance/tiered-storage/overview.md
@@ -14,7 +14,7 @@ Fluss organizes data into different storage layers based on
its access patterns,
Fluss ensures the recent data is stored in local for higher write/read
performance and the historical data is stored in [remote
storage](remote-storage.md) for lower cost.
What's more, since the native format of Fluss's data is optimized for
real-time write/read which is inevitable unfriendly to batch analytics, Fluss
also introduces a [lakehouse storage](lakehouse-storage.md) which stores the
data
-in the well-known open data lake format for better analytics performance.
Currently, only Paimon is supported, but more kinds of data lake support are on
the way. Keep eyes on us!
+in the well-known open data lake format for better analytics performance.
Currently, supported formats are Paimon, Iceberg, and Lance. In the future,
more kinds of data lake support are on the way. Keep eyes on us!
The overall tiered storage architecture is shown in the following diagram:
diff --git a/website/docs/streaming-lakehouse/integrate-data-lakes/paimon.md
b/website/docs/streaming-lakehouse/integrate-data-lakes/paimon.md
index c42b04d46..e1dbb3c59 100644
--- a/website/docs/streaming-lakehouse/integrate-data-lakes/paimon.md
+++ b/website/docs/streaming-lakehouse/integrate-data-lakes/paimon.md
@@ -5,10 +5,14 @@ sidebar_position: 1
# Paimon
+## Introduction
+
[Apache Paimon](https://paimon.apache.org/) innovatively combines a lake
format with an LSM (Log-Structured Merge-tree) structure, bringing efficient
updates into the lake architecture.
To integrate Fluss with Paimon, you must enable lakehouse storage and
configure Paimon as the lakehouse storage. For more details, see [Enable
Lakehouse
Storage](maintenance/tiered-storage/lakehouse-storage.md#enable-lakehouse-storage).
-## Introduction
+## Configure Paimon as LakeHouse Storage
+
+For general guidance on configuring Paimon as the lakehouse storage, you can
refer to [Lakehouse Storage](maintenance/tiered-storage/lakehouse-storage.md)
documentation. When starting the tiering service, make sure to use
Paimon-specific configurations as parameters.
When a table is created or altered with the option `'table.datalake.enabled' =
'true'`, Fluss will automatically create a corresponding Paimon table with the
same table path.
The schema of the Paimon table matches that of the Fluss table, except for the
addition of three system columns at the end: `__bucket`, `__offset`, and
`__timestamp`.
diff --git a/website/docs/streaming-lakehouse/overview.md
b/website/docs/streaming-lakehouse/overview.md
index 1ea85fb44..c1d75f082 100644
--- a/website/docs/streaming-lakehouse/overview.md
+++ b/website/docs/streaming-lakehouse/overview.md
@@ -44,4 +44,4 @@ Some powerful features it provided are:
- **Analytical Streams**: The union reads help data streams to have the
powerful analytics capabilities. This reduces complexity when developing
streaming applications, simplifies debugging, and allows for immediate access
to live data insights.
- **Connect to Lakehouse Ecosystem**: Fluss keeps the table metadata in sync
with data lake catalogs while compacting data into Lakehouse. This allows
external engines like Spark, StarRocks, Flink, Trino to read the data directly
by connecting to the data lake catalog.
-Currently, Fluss supports [Paimon](integrate-data-lakes/paimon.md) and
[Lance](integrate-data-lakes/lance.md) as Lakehouse Storage, more kinds of data
lake formats are on the roadmap.
\ No newline at end of file
+Currently, Fluss supports [Paimon](integrate-data-lakes/paimon.md),
[Iceberg](integrate-data-lakes/iceberg.md), and
[Lance](integrate-data-lakes/lance.md) as Lakehouse Storage, more kinds of data
lake formats are on the roadmap.