This is an automated email from the ASF dual-hosted git repository.
yuxia pushed a commit to branch main
in repository https://gitbox.apache.org/repos/asf/fluss.git
The following commit(s) were added to refs/heads/main by this push:
new 49eeec15a [lake/lance] add documentation for Lance connector (#1587)
49eeec15a is described below
commit 49eeec15adcf7252bf099c62034060329bacf4f2
Author: xx789 <[email protected]>
AuthorDate: Mon Sep 8 20:04:48 2025 +0800
[lake/lance] add documentation for Lance connector (#1587)
---------
Co-authored-by: maxcwang <[email protected]>
Co-authored-by: luoyuxia <[email protected]>
Co-authored-by: Leonard Xu <[email protected]>
---
.../tiered-storage/lakehouse-storage.md | 7 +-
.../integrate-data-lakes/lance.md | 124 +++++++++++++++++++++
website/docs/streaming-lakehouse/overview.md | 2 +-
3 files changed, 130 insertions(+), 3 deletions(-)
diff --git a/website/docs/maintenance/tiered-storage/lakehouse-storage.md
b/website/docs/maintenance/tiered-storage/lakehouse-storage.md
index 00654a50d..d4a24845e 100644
--- a/website/docs/maintenance/tiered-storage/lakehouse-storage.md
+++ b/website/docs/maintenance/tiered-storage/lakehouse-storage.md
@@ -18,7 +18,10 @@ can gain much storage cost reduction and analytics
performance improvement.
## Enable Lakehouse Storage
-Lakehouse Storage is disabled by default, you must enable it manually.
+Lakehouse Storage is disabled by default, you must enable it manually.
+
+The following example uses Paimon for demonstration—other data lake formats
follow similar steps, but require different configuration settings and JAR
files.
+You can refer to the documentation of the corresponding data lake format
integration for required configurations and JAR files.
### Lakehouse Storage Cluster Configurations
#### Modify `server.yaml`
@@ -55,7 +58,7 @@ Then, you must start the datalake tiering service to tier
Fluss's data to the la
- Put [fluss-flink connector jar](/downloads) into `${FLINK_HOME}/lib`, you
should choose a connector version matching your Flink version. If you're using
Flink 1.20, please use
[fluss-flink-1.20-$FLUSS_VERSION$.jar](https://repo1.maven.org/maven2/org/apache/fluss/fluss-flink-1.20/$FLUSS_VERSION$/fluss-flink-1.20-$FLUSS_VERSION$.jar)
- If you are using [Amazon S3](http://aws.amazon.com/s3/), [Aliyun
OSS](https://www.aliyun.com/product/oss) or [HDFS(Hadoop Distributed File
System)](https://hadoop.apache.org/docs/stable/) as Fluss's [remote
storage](maintenance/tiered-storage/remote-storage.md),
you should download the corresponding [Fluss filesystem
jar](/downloads#filesystem-jars) and also put it into `${FLINK_HOME}/lib`
-- Put [fluss-lake-paimon
jar](https://repo1.maven.org/maven2/org/apache/fluss/fluss-lake-paimon/$FLUSS_VERSION$/fluss-lake-paimon-$FLUSS_VERSION$.jar)
into `${FLINK_HOME}/lib`, currently only paimon is supported, so you can only
choose `fluss-lake-paimon`
+- Put [fluss-lake-paimon
jar](https://repo1.maven.org/maven2/org/apache/fluss/fluss-lake-paimon/$FLUSS_VERSION$/fluss-lake-paimon-$FLUSS_VERSION$.jar)
into `${FLINK_HOME}/lib`
- [Download](https://flink.apache.org/downloads/) pre-bundled Hadoop jar
`flink-shaded-hadoop-2-uber-*.jar` and put into `${FLINK_HOME}/lib`
- Put Paimon's [filesystem
jar](https://paimon.apache.org/docs/1.1/project/download/) into
`${FLINK_HOME}/lib`, if you use s3 to store paimon data, please put `paimon-s3`
jar into `${FLINK_HOME}/lib`
- The other jars that Paimon may require, for example, if you use HiveCatalog,
you will need to put hive related jars
diff --git a/website/docs/streaming-lakehouse/integrate-data-lakes/lance.md
b/website/docs/streaming-lakehouse/integrate-data-lakes/lance.md
new file mode 100644
index 000000000..2361e8a25
--- /dev/null
+++ b/website/docs/streaming-lakehouse/integrate-data-lakes/lance.md
@@ -0,0 +1,124 @@
+---
+title: Lance
+sidebar_position: 3
+---
+
+# Lance
+
+[Lance](https://lancedb.github.io/lance/) is a modern table format optimized
for machine learning and AI applications.
+To integrate Fluss with Lance, you must enable lakehouse storage and configure
Lance as the lakehouse storage. For more details, see [Enable Lakehouse
Storage](maintenance/tiered-storage/lakehouse-storage.md#enable-lakehouse-storage).
+
+## Configure Lance as LakeHouse Storage
+
+### Configure Lance in Cluster Configurations
+
+To configure Lance as the lakehouse storage, you must configure the following
configurations in `server.yaml`:
+```yaml
+# Lance configuration
+datalake.format: lance
+
+# Currently only local file system and object stores such as AWS S3 (and
compatible stores) are supported as storage backends for Lance
+# To use S3 as Lance storage backend, you need to specify the following
properties
+datalake.lance.warehouse: s3://<bucket>
+datalake.lance.endpoint: <endpoint>
+datalake.lance.allow_http: true
+datalake.lance.access_key_id: <access_key_id>
+datalake.lance.secret_access_key: <secret_access_key>
+
+# Use local file system as Lance storage backend, you only need to specify the
following property
+# datalake.lance.warehouse: /tmp/lance
+```
+
+When a table is created or altered with the option `'table.datalake.enabled' =
'true'`, Fluss will automatically create a corresponding Lance table with path
`<warehouse_path>/<database_name>/<table_name>.lance`.
+The schema of the Lance table matches that of the Fluss table.
+
+```sql title="Flink SQL"
+USE CATALOG fluss_catalog;
+
+CREATE TABLE fluss_order_with_lake (
+ `order_id` BIGINT,
+ `item_id` BIGINT,
+ `amount` INT,
+ `address` STRING
+) WITH (
+ 'table.datalake.enabled' = 'true',
+ 'table.datalake.freshness' = '30s'
+);
+```
+
+### Start Tiering Service to Lance
+Then, you must start the datalake tiering service to tier Fluss's data to
Lance. For guidance, you can refer to [Start The Datalake Tiering Service
+](maintenance/tiered-storage/lakehouse-storage.md#start-the-datalake-tiering-service).
Although the example uses Paimon, the process is also applicable to Lance.
+
+But in [Prepare required
jars](maintenance/tiered-storage/lakehouse-storage.md#prepare-required-jars)
step, you should follow this guidance:
+- Put [fluss-flink connector jar](/downloads) into `${FLINK_HOME}/lib`, you
should choose a connector version matching your Flink version. If you're using
Flink 1.20, please use
[fluss-flink-1.20-$FLUSS_VERSION$.jar](https://repo1.maven.org/maven2/org/apache/fluss/fluss-flink-1.20/$FLUSS_VERSION$/fluss-flink-1.20-$FLUSS_VERSION$.jar)
+- If you are using [Amazon S3](http://aws.amazon.com/s3/), [Aliyun
OSS](https://www.aliyun.com/product/oss) or [HDFS(Hadoop Distributed File
System)](https://hadoop.apache.org/docs/stable/) as Fluss's [remote
storage](maintenance/tiered-storage/remote-storage.md),
+ you should download the corresponding [Fluss filesystem
jar](/downloads#filesystem-jars) and also put it into `${FLINK_HOME}/lib`
+- Put [fluss-lake-lance
jar](https://repo1.maven.org/maven2/org/apache/fluss/fluss-lake-lance/$FLUSS_VERSION$/fluss-lake-lance-$FLUSS_VERSION$.jar)
into `${FLINK_HOME}/lib`
+
+Additionally, when following the [Start Datalake Tiering
Service](maintenance/tiered-storage/lakehouse-storage.md#start-datalake-tiering-service)
guide, make sure to use Lance-specific configurations as parameters when
starting the Flink tiering job:
+```shell
+<FLINK_HOME>/bin/flink run /path/to/fluss-flink-tiering-$FLUSS_VERSION$.jar \
+ --fluss.bootstrap.servers localhost:9123 \
+ --datalake.format lance \
+ --datalake.lance.warehouse s3://<bucket> \
+ --datalake.lance.endpoint <endpoint> \
+ --datalake.lance.allow_http true \
+ --datalake.lance.secret_access_key <secret_access_key> \
+ --datalake.lance.access_key_id <access_key_id>
+```
+
+> **NOTE**: Fluss v0.8 only supports tiering log tables to Lance.
+
+Then, the datalake tiering service continuously tiers data from Fluss to
Lance. The parameter `table.datalake.freshness` controls the frequency that
Fluss writes data to Lance tables. By default, the data freshness is 3 minutes.
+
+You can also specify Lance table properties when creating a datalake-enabled
Fluss table by using the `lance.` prefix within the Fluss table properties
clause.
+
+```sql title="Flink SQL"
+CREATE TABLE fluss_order_with_lake (
+ `order_id` BIGINT,
+ `item_id` BIGINT,
+ `amount` INT,
+ `address` STRING
+ ) WITH (
+ 'table.datalake.enabled' = 'true',
+ 'table.datalake.freshness' = '30s',
+ 'lance.max_row_per_file' = '512'
+);
+```
+
+For example, you can specify the property `max_row_per_file` to control the
writing behavior when Fluss tiers data to Lance.
+
+## Reading with Lance ecosystem tools
+
+Since the data tiered to Lance from Fluss is stored as a standard Lance table,
you can use any tool that supports Lance to read it. Below is an example using
[pylance](https://pypi.org/project/pylance/):
+
+```python title="Lance Python"
+import lance
+ds = lance.dataset("<warehouse_path>/<database_name>/<table_name>.lance")
+```
+
+## Data Type Mapping
+
+Lance internally stores data in Arrow format.
+When integrating with Lance, Fluss automatically converts between Fluss data
types and Lance data types.
+The following table shows the mapping between [Fluss data
types](table-design/data-types.md) and Lance data types:
+
+| Fluss Data Type | Lance Data Type |
+|-------------------------------|-----------------|
+| BOOLEAN | Bool |
+| TINYINT | Int8 |
+| SMALLINT | Int16 |
+| INT | Int32 |
+| BIGINT | Int64 |
+| FLOAT | Float32 |
+| DOUBLE | Float64 |
+| DECIMAL | Decimal128 |
+| STRING | Utf8 |
+| CHAR | Utf8 |
+| DATE | Date |
+| TIME | Time |
+| TIMESTAMP | Timestamp |
+| TIMESTAMP WITH LOCAL TIMEZONE | Timestamp |
+| BINARY | FixedSizeBinary |
+| BYTES | Binary |
\ No newline at end of file
diff --git a/website/docs/streaming-lakehouse/overview.md
b/website/docs/streaming-lakehouse/overview.md
index f44eb2fcd..1ea85fb44 100644
--- a/website/docs/streaming-lakehouse/overview.md
+++ b/website/docs/streaming-lakehouse/overview.md
@@ -44,4 +44,4 @@ Some powerful features it provided are:
- **Analytical Streams**: The union reads help data streams to have the
powerful analytics capabilities. This reduces complexity when developing
streaming applications, simplifies debugging, and allows for immediate access
to live data insights.
- **Connect to Lakehouse Ecosystem**: Fluss keeps the table metadata in sync
with data lake catalogs while compacting data into Lakehouse. This allows
external engines like Spark, StarRocks, Flink, Trino to read the data directly
by connecting to the data lake catalog.
-Currently, Fluss supports [Paimon as Lakehouse
Storage](integrate-data-lakes/paimon.md), more kinds of data lake formats are
on the roadmap.
\ No newline at end of file
+Currently, Fluss supports [Paimon](integrate-data-lakes/paimon.md) and
[Lance](integrate-data-lakes/lance.md) as Lakehouse Storage, more kinds of data
lake formats are on the roadmap.
\ No newline at end of file