This is an automated email from the ASF dual-hosted git repository.
morningman pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/doris-website.git
The following commit(s) were added to refs/heads/master by this push:
new 1134149027e [feat] add juicefs and insert into tvf (#3479)
1134149027e is described below
commit 1134149027eb5ae6655b94eb6a9efd63bf1b182e
Author: Mingyu Chen (Rayner) <[email protected]>
AuthorDate: Thu Mar 19 23:31:39 2026 -0700
[feat] add juicefs and insert into tvf (#3479)
## Versions
- [x] dev
- [x] 4.x
- [x] 3.x
- [ ] 2.1
## Languages
- [x] Chinese
- [x] English
## Docs Checklist
- [ ] Checked by AI
- [ ] Test Cases Built
---
docs/lakehouse/file-analysis.md | 111 +++++++++++++++++++-
docs/lakehouse/storages/juicefs.md | 97 ++++++++++++++++++
.../current/lakehouse/file-analysis.md | 111 +++++++++++++++++++-
.../current/lakehouse/storages/juicefs.md | 97 ++++++++++++++++++
.../version-4.x/lakehouse/file-analysis.md | 111 +++++++++++++++++++-
.../version-4.x/lakehouse/storages/juicefs.md | 97 ++++++++++++++++++
sidebars.ts | 1 +
.../version-4.x/lakehouse/file-analysis.md | 113 ++++++++++++++++++++-
.../version-4.x/lakehouse/storages/juicefs.md | 97 ++++++++++++++++++
versioned_sidebars/version-4.x-sidebars.json | 3 +-
10 files changed, 827 insertions(+), 11 deletions(-)
diff --git a/docs/lakehouse/file-analysis.md b/docs/lakehouse/file-analysis.md
index e3204c35f34..df850cef676 100644
--- a/docs/lakehouse/file-analysis.md
+++ b/docs/lakehouse/file-analysis.md
@@ -2,11 +2,11 @@
{
"title": "Analyzing Files on S3/HDFS",
"language": "en",
- "description": "Learn how to use Apache Doris Table Value Function (TVF)
to directly query and analyze Parquet, ORC, CSV, and JSON files on storage
systems like S3 and HDFS, with support for automatic schema inference,
multi-file matching, and data import."
+ "description": "Learn how to use Apache Doris Table Value Function (TVF)
to directly query and analyze Parquet, ORC, CSV, and JSON files on storage
systems like S3 and HDFS, with support for automatic schema inference,
multi-file matching, data import, and exporting query results to files."
}
---
-Through the Table Value Function (TVF) feature, Doris can directly query and
analyze files on object storage or HDFS as tables without importing data in
advance, and supports automatic column type inference.
+Through the Table Value Function (TVF) feature, Doris can directly query and
analyze files on object storage or HDFS as tables without importing data in
advance, and supports automatic column type inference. Starting from version
4.1.0, it also supports exporting query results to file systems via `INSERT
INTO` TVF syntax.
## Supported Storage Systems
@@ -113,6 +113,113 @@ FROM s3(
);
```
+### Scenario 4: Exporting Query Results to Files (Since Version 4.1.0)
{#export-query-results}
+
+:::tip
+This feature is supported since Apache Doris version 4.1.0 and is currently an
experimental feature.
+:::
+
+Using the `INSERT INTO` TVF syntax, you can directly export query results as
files to local file systems, HDFS, or S3-compatible object storage, supporting
CSV, Parquet, and ORC formats.
+
+**Syntax:**
+
+```sql
+INSERT INTO tvf_name(
+ "file_path" = "<file_path_prefix>",
+ "format" = "<file_format>",
+ ... -- other connection properties and format options
+)
+[WITH LABEL label_name]
+SELECT ... ;
+```
+
+Where `tvf_name` can be:
+
+| TVF Name | Target Storage | Description |
+|----------|---------------|-------------|
+| `local` | Local file system | Export to the local disk of a BE node,
requires specifying `backend_id` |
+| `hdfs` | HDFS | Export to Hadoop Distributed File System |
+| `s3` | S3-compatible object storage | Export to AWS S3, Alibaba Cloud OSS,
Tencent Cloud COS, etc. |
+
+**Common Properties:**
+
+| Property | Required | Description |
+|----------|----------|-------------|
+| `file_path` | Yes | Output file path prefix. The actual generated file name
format is `{prefix}{query_id}_{idx}.{ext}` |
+| `format` | No | Output file format, supports `csv` (default), `parquet`,
`orc` |
+| `max_file_size` | No | Maximum size per file (in bytes). A new file is
automatically created when exceeded |
+| `delete_existing_files` | No | Whether to delete existing files in the
target directory before writing, default `false` |
+
+**Additional Properties for CSV Format:**
+
+| Property | Description |
+|----------|-------------|
+| `column_separator` | Column separator, default is `,` |
+| `line_delimiter` | Line delimiter, default is `\n` |
+| `compress_type` | Compression format, supports `gz`, `zstd`, `lz4`, `snappy`
|
+
+**Example 1: Export CSV to HDFS**
+
+```sql
+INSERT INTO hdfs(
+ "file_path" = "/tmp/export/csv_data_",
+ "format" = "csv",
+ "column_separator" = ",",
+ "hadoop.username" = "doris",
+ "fs.defaultFS" = "hdfs://namenode:8020",
+ "delete_existing_files" = "true"
+)
+SELECT * FROM my_table ORDER BY id;
+```
+
+**Example 2: Export Parquet to S3**
+
+```sql
+INSERT INTO s3(
+ "uri" = "s3://bucket/export/parquet_data_",
+ "s3.access_key" = "ak",
+ "s3.secret_key" = "sk",
+ "s3.endpoint" = "https://s3.us-east-1.amazonaws.com",
+ "s3.region" = "us-east-1",
+ "format" = "parquet"
+)
+SELECT * FROM my_table WHERE dt = '2024-01-01';
+```
+
+**Example 3: Export ORC to S3**
+
+```sql
+INSERT INTO s3(
+ "uri" = "s3://bucket/export/orc_data_",
+ "s3.access_key" = "ak",
+ "s3.secret_key" = "sk",
+ "s3.endpoint" = "https://s3.us-east-1.amazonaws.com",
+ "s3.region" = "us-east-1",
+ "format" = "orc",
+ "delete_existing_files" = "true"
+)
+SELECT c_int, c_varchar, c_string FROM my_table WHERE c_int IS NOT NULL ORDER
BY c_int;
+```
+
+**Example 4: Export CSV to Local BE Node**
+
+```sql
+INSERT INTO local(
+ "file_path" = "/tmp/export/local_csv_",
+ "backend_id" = "10001",
+ "format" = "csv"
+)
+SELECT * FROM my_table ORDER BY id;
+```
+
+:::note
+- `file_path` is a file name prefix. The actual generated file name format is
`{prefix}{query_id}_{idx}.{ext}`, where `idx` starts from 0 and increments.
+- When using the `local` TVF, you need to specify the BE node to write files
to via `backend_id`.
+- When using `delete_existing_files`, the system will delete all files in the
directory where `file_path` is located before writing. Use with caution.
+- Executing `INSERT INTO TVF` requires ADMIN or LOAD privileges.
+- For S3 TVF, the `file_path` property corresponds to the `uri` property.
+:::
+
## Core Features
### Multi-File Matching
diff --git a/docs/lakehouse/storages/juicefs.md
b/docs/lakehouse/storages/juicefs.md
new file mode 100644
index 00000000000..b5708b4a3c9
--- /dev/null
+++ b/docs/lakehouse/storages/juicefs.md
@@ -0,0 +1,97 @@
+---
+{
+ "title": "JuiceFS | Storages",
+ "language": "en",
+ "description": "This document describes the parameters required for
accessing JuiceFS. These parameters apply to:",
+ "sidebar_label": "JuiceFS"
+}
+---
+
+# JuiceFS
+
+:::tip Supported since
+Doris 4.0.2
+:::
+
+[JuiceFS](https://juicefs.com) is an open-source, high-performance distributed
file system designed for the cloud. It is fully compatible with the HDFS API.
Doris treats the `jfs://` scheme as HDFS-compatible, so you can access data
stored in JuiceFS using the same approach as HDFS.
+
+This document describes the parameters required for accessing JuiceFS. These
parameters apply to:
+
+* Catalog properties
+* Table Valued Function properties
+* Broker Load properties
+* Export properties
+* Outfile properties
+
+## Prerequisites
+
+JuiceFS access relies on the `juicefs-hadoop` client jar. Starting from Doris
4.0.2, the build system automatically downloads and packages this jar. The jar
is placed under:
+
+- FE: `fe/lib/juicefs/`
+- BE: `be/lib/java_extensions/juicefs/`
+
+If you are deploying manually, download the `juicefs-hadoop-<version>.jar`
from [Maven Central](https://repo1.maven.org/maven2/io/juicefs/juicefs-hadoop/)
and place it in the directories listed above.
+
+## Parameter Overview
+
+Since JuiceFS is HDFS-compatible, it shares the same authentication parameters
as HDFS. Additionally, the following JuiceFS-specific parameters are available:
+
+| Property Name | Description | Required |
+| --- | --- | --- |
+| fs.defaultFS | The default filesystem URI, for example `jfs://cluster`. |
Yes |
+| fs.jfs.impl | The Hadoop FileSystem implementation class. Must be set to
`io.juicefs.JuiceFileSystem`. | Yes |
+| juicefs.\<cluster\>.meta | The JuiceFS metadata engine endpoint, for example
`redis://127.0.0.1:6379/1` or `mysql://user:pwd@(host:port)/db`. Replace
`<cluster>` with the cluster name in your `fs.defaultFS` URI. | Yes |
+
+For HDFS authentication parameters (Simple or Kerberos), refer to the
[HDFS](./hdfs.md) documentation.
+
+All properties prefixed with `juicefs.` will be passed through to the
underlying JuiceFS Hadoop client.
+
+## Example Configurations
+
+### Catalog with Hive Metastore
+
+```sql
+CREATE CATALOG jfs_hive PROPERTIES (
+ 'type' = 'hms',
+ 'hive.metastore.uris' = 'thrift://<hms_host>:9083',
+ 'fs.defaultFS' = 'jfs://cluster',
+ 'fs.jfs.impl' = 'io.juicefs.JuiceFileSystem',
+ 'juicefs.cluster.meta' = 'redis://127.0.0.1:6379/1',
+ 'hadoop.username' = 'doris'
+);
+```
+
+### Broker Load
+
+```sql
+LOAD LABEL example_db.label1
+(
+ DATA INFILE("jfs://cluster/path/to/data/*")
+ INTO TABLE `my_table`
+)
+WITH BROKER
+(
+ "fs.defaultFS" = "jfs://cluster",
+ "fs.jfs.impl" = "io.juicefs.JuiceFileSystem",
+ "juicefs.cluster.meta" = "redis://127.0.0.1:6379/1",
+ "hadoop.username" = "doris"
+);
+```
+
+### Table Valued Function
+
+```sql
+SELECT * FROM TABLE(
+ "uri" = "jfs://cluster/path/to/file.parquet",
+ "format" = "parquet",
+ "fs.jfs.impl" = "io.juicefs.JuiceFileSystem",
+ "juicefs.cluster.meta" = "redis://127.0.0.1:6379/1"
+);
+```
+
+## Best Practices
+
+* Ensure the `juicefs-hadoop` jar is deployed on **all** FE and BE nodes.
+* The `juicefs.<cluster>.meta` property must match the cluster name in the
`jfs://` URI. For example, if `fs.defaultFS = jfs://mycluster`, the metadata
property should be `juicefs.mycluster.meta`.
+* JuiceFS supports multiple metadata engines (Redis, MySQL, TiKV, SQLite,
etc.). Choose based on your scale and availability requirements.
+* HDFS configurations such as Kerberos authentication, HA nameservice
settings, and Hadoop config files are all compatible with JuiceFS.
diff --git
a/i18n/zh-CN/docusaurus-plugin-content-docs/current/lakehouse/file-analysis.md
b/i18n/zh-CN/docusaurus-plugin-content-docs/current/lakehouse/file-analysis.md
index 7ca8d06013c..78b870e4132 100644
---
a/i18n/zh-CN/docusaurus-plugin-content-docs/current/lakehouse/file-analysis.md
+++
b/i18n/zh-CN/docusaurus-plugin-content-docs/current/lakehouse/file-analysis.md
@@ -2,11 +2,11 @@
{
"title": "分析 S3/HDFS 上的文件",
"language": "zh-CN",
- "description": "了解如何使用 Apache Doris Table Value Function (TVF) 直接查询和分析
S3、HDFS 等存储系统上的 Parquet、ORC、CSV、JSON 文件,支持自动 Schema 推断、多文件匹配和数据导入。"
+ "description": "了解如何使用 Apache Doris Table Value Function (TVF) 直接查询和分析
S3、HDFS 等存储系统上的 Parquet、ORC、CSV、JSON 文件,支持自动 Schema 推断、多文件匹配、数据导入以及导出查询结果到文件。"
}
---
-通过 Table Value Function(TVF)功能,Doris 可以直接将对象存储或 HDFS
上的文件作为表进行查询分析,无需事先导入数据,并且支持自动的列类型推断。
+通过 Table Value Function(TVF)功能,Doris 可以直接将对象存储或 HDFS
上的文件作为表进行查询分析,无需事先导入数据,并且支持自动的列类型推断。同时,自 4.1.0 版本起,还支持通过 `INSERT INTO` TVF
的方式将查询结果导出到文件系统中。
## 支持的存储系统
@@ -113,6 +113,113 @@ FROM s3(
);
```
+### 场景四:导出查询结果到文件
+
+:::tip
+该功能自 Apache Doris 4.1.0 版本起支持,当前为实验性功能。
+:::
+
+通过 `INSERT INTO` TVF 语法,可以将查询结果直接导出为文件,写入到本地文件系统、HDFS 或 S3 兼容的对象存储中,支持
CSV、Parquet 和 ORC 格式。
+
+**语法:**
+
+```sql
+INSERT INTO tvf_name(
+ "file_path" = "<file_path_prefix>",
+ "format" = "<file_format>",
+ ... -- 其他连接属性和格式选项
+)
+[WITH LABEL label_name]
+SELECT ... ;
+```
+
+其中 `tvf_name` 可以为:
+
+| TVF 名称 | 目标存储 | 说明 |
+|---------|---------|------|
+| `local` | 本地文件系统 | 导出到 BE 节点的本地磁盘,需指定 `backend_id` |
+| `hdfs` | HDFS | 导出到 Hadoop 分布式文件系统 |
+| `s3` | S3 兼容对象存储 | 导出到 AWS S3、阿里云 OSS、腾讯云 COS 等 |
+
+**通用属性:**
+
+| 属性 | 是否必须 | 说明 |
+|------|---------|------|
+| `file_path` | 是 | 输出文件路径前缀。实际生成的文件名格式为 `{prefix}{query_id}_{idx}.{ext}` |
+| `format` | 否 | 输出文件格式,支持 `csv`(默认)、`parquet`、`orc` |
+| `max_file_size` | 否 | 单个文件的最大大小(字节),超过后自动拆分生成新文件 |
+| `delete_existing_files` | 否 | 是否在写入前删除目标目录下的已有文件,默认 `false` |
+
+**CSV 格式的额外属性:**
+
+| 属性 | 说明 |
+|------|------|
+| `column_separator` | 列分隔符,默认为 `,` |
+| `line_delimiter` | 行分隔符,默认为 `\n` |
+| `compress_type` | 压缩格式,支持 `gz`、`zstd`、`lz4`、`snappy` |
+
+**示例 1:导出 CSV 到 HDFS**
+
+```sql
+INSERT INTO hdfs(
+ "file_path" = "/tmp/export/csv_data_",
+ "format" = "csv",
+ "column_separator" = ",",
+ "hadoop.username" = "doris",
+ "fs.defaultFS" = "hdfs://namenode:8020",
+ "delete_existing_files" = "true"
+)
+SELECT * FROM my_table ORDER BY id;
+```
+
+**示例 2:导出 Parquet 到 S3**
+
+```sql
+INSERT INTO s3(
+ "uri" = "s3://bucket/export/parquet_data_",
+ "s3.access_key" = "ak",
+ "s3.secret_key" = "sk",
+ "s3.endpoint" = "https://s3.us-east-1.amazonaws.com",
+ "s3.region" = "us-east-1",
+ "format" = "parquet"
+)
+SELECT * FROM my_table WHERE dt = '2024-01-01';
+```
+
+**示例 3:导出 ORC 到 S3**
+
+```sql
+INSERT INTO s3(
+ "uri" = "s3://bucket/export/orc_data_",
+ "s3.access_key" = "ak",
+ "s3.secret_key" = "sk",
+ "s3.endpoint" = "https://s3.us-east-1.amazonaws.com",
+ "s3.region" = "us-east-1",
+ "format" = "orc",
+ "delete_existing_files" = "true"
+)
+SELECT c_int, c_varchar, c_string FROM my_table WHERE c_int IS NOT NULL ORDER
BY c_int;
+```
+
+**示例 4:导出 CSV 到本地 BE 节点**
+
+```sql
+INSERT INTO local(
+ "file_path" = "/tmp/export/local_csv_",
+ "backend_id" = "10001",
+ "format" = "csv"
+)
+SELECT * FROM my_table ORDER BY id;
+```
+
+:::note
+- `file_path` 是文件名前缀,实际生成的文件名格式为 `{prefix}{query_id}_{idx}.{ext}`,其中 `idx` 从 0
开始递增。
+- 使用 `local` TVF 时,需要通过 `backend_id` 指定写入文件的 BE 节点。
+- 使用 `delete_existing_files` 时,系统会在写入前删除 `file_path` 所在目录下的所有文件,请谨慎使用。
+- 执行 `INSERT INTO TVF` 需要 ADMIN 或 LOAD 权限。
+- 对于 S3 TVF,`file_path` 属性对应的是 `uri` 属性。
+:::
+
## 核心功能
### 多文件匹配
diff --git
a/i18n/zh-CN/docusaurus-plugin-content-docs/current/lakehouse/storages/juicefs.md
b/i18n/zh-CN/docusaurus-plugin-content-docs/current/lakehouse/storages/juicefs.md
new file mode 100644
index 00000000000..92bcc477dcc
--- /dev/null
+++
b/i18n/zh-CN/docusaurus-plugin-content-docs/current/lakehouse/storages/juicefs.md
@@ -0,0 +1,97 @@
+---
+{
+ "title": "JuiceFS | Storages",
+ "language": "zh-CN",
+ "description": "本文档用于介绍访问 JuiceFS 时所需的参数。这些参数适用于:",
+ "sidebar_label": "JuiceFS"
+}
+---
+
+# JuiceFS
+
+:::tip 版本支持
+Doris 4.0.2 起支持
+:::
+
+[JuiceFS](https://juicefs.com) 是一款开源的、高性能的云原生分布式文件系统,完全兼容 HDFS API。Doris 将
`jfs://` 协议视为 HDFS 兼容协议,因此你可以使用与 HDFS 相同的方式访问存储在 JuiceFS 中的数据。
+
+本文档用于介绍访问 JuiceFS 时所需的参数。这些参数适用于:
+
+* Catalog 属性。
+* Table Valued Function 属性。
+* Broker Load 属性。
+* Export 属性。
+* Outfile 属性。
+
+## 前提条件
+
+访问 JuiceFS 依赖 `juicefs-hadoop` 客户端 jar 包。从 Doris 4.0.2 起,构建系统会自动下载并打包该 jar
包,存放位置如下:
+
+- FE:`fe/lib/juicefs/`
+- BE:`be/lib/java_extensions/juicefs/`
+
+如果手动部署,请从 [Maven
Central](https://repo1.maven.org/maven2/io/juicefs/juicefs-hadoop/) 下载
`juicefs-hadoop-<version>.jar`,并放置到上述目录中。
+
+## 参数总览
+
+由于 JuiceFS 兼容 HDFS,因此与 HDFS 共享相同的认证参数。此外,还需配置以下 JuiceFS 特有的参数:
+
+| 属性名称 | 描述 | 是否必须 |
+| --- | --- | --- |
+| fs.defaultFS | 默认文件系统 URI,例如 `jfs://cluster`。 | 是 |
+| fs.jfs.impl | Hadoop FileSystem 实现类,必须设置为 `io.juicefs.JuiceFileSystem`。 | 是 |
+| juicefs.\<cluster\>.meta | JuiceFS 元数据引擎端点,例如 `redis://127.0.0.1:6379/1` 或
`mysql://user:pwd@(host:port)/db`。其中 `<cluster>` 需替换为 `fs.defaultFS` URI
中的集群名称。 | 是 |
+
+关于 HDFS 认证参数(Simple 或 Kerberos),请参考 [HDFS](./hdfs.md) 文档。
+
+所有以 `juicefs.` 为前缀的属性会透传给底层 JuiceFS Hadoop 客户端。
+
+## 配置示例
+
+### 配合 Hive Metastore 创建 Catalog
+
+```sql
+CREATE CATALOG jfs_hive PROPERTIES (
+ 'type' = 'hms',
+ 'hive.metastore.uris' = 'thrift://<hms_host>:9083',
+ 'fs.defaultFS' = 'jfs://cluster',
+ 'fs.jfs.impl' = 'io.juicefs.JuiceFileSystem',
+ 'juicefs.cluster.meta' = 'redis://127.0.0.1:6379/1',
+ 'hadoop.username' = 'doris'
+);
+```
+
+### Broker Load
+
+```sql
+LOAD LABEL example_db.label1
+(
+ DATA INFILE("jfs://cluster/path/to/data/*")
+ INTO TABLE `my_table`
+)
+WITH BROKER
+(
+ "fs.defaultFS" = "jfs://cluster",
+ "fs.jfs.impl" = "io.juicefs.JuiceFileSystem",
+ "juicefs.cluster.meta" = "redis://127.0.0.1:6379/1",
+ "hadoop.username" = "doris"
+);
+```
+
+### Table Valued Function
+
+```sql
+SELECT * FROM TABLE(
+ "uri" = "jfs://cluster/path/to/file.parquet",
+ "format" = "parquet",
+ "fs.jfs.impl" = "io.juicefs.JuiceFileSystem",
+ "juicefs.cluster.meta" = "redis://127.0.0.1:6379/1"
+);
+```
+
+## 使用建议
+
+* 确保 `juicefs-hadoop` jar 包部署在**所有** FE 和 BE 节点上。
+* `juicefs.<cluster>.meta` 属性中的集群名称必须与 `jfs://` URI 中的集群名称一致。例如,如果
`fs.defaultFS = jfs://mycluster`,则元数据属性应为 `juicefs.mycluster.meta`。
+* JuiceFS 支持多种元数据引擎(Redis、MySQL、TiKV、SQLite 等),请根据规模和可用性需求选择。
+* HDFS 的 Kerberos 认证、HA Nameservice 配置以及 Hadoop 配置文件等,均与 JuiceFS 兼容。
diff --git
a/i18n/zh-CN/docusaurus-plugin-content-docs/version-4.x/lakehouse/file-analysis.md
b/i18n/zh-CN/docusaurus-plugin-content-docs/version-4.x/lakehouse/file-analysis.md
index 7ca8d06013c..78b870e4132 100644
---
a/i18n/zh-CN/docusaurus-plugin-content-docs/version-4.x/lakehouse/file-analysis.md
+++
b/i18n/zh-CN/docusaurus-plugin-content-docs/version-4.x/lakehouse/file-analysis.md
@@ -2,11 +2,11 @@
{
"title": "分析 S3/HDFS 上的文件",
"language": "zh-CN",
- "description": "了解如何使用 Apache Doris Table Value Function (TVF) 直接查询和分析
S3、HDFS 等存储系统上的 Parquet、ORC、CSV、JSON 文件,支持自动 Schema 推断、多文件匹配和数据导入。"
+ "description": "了解如何使用 Apache Doris Table Value Function (TVF) 直接查询和分析
S3、HDFS 等存储系统上的 Parquet、ORC、CSV、JSON 文件,支持自动 Schema 推断、多文件匹配、数据导入以及导出查询结果到文件。"
}
---
-通过 Table Value Function(TVF)功能,Doris 可以直接将对象存储或 HDFS
上的文件作为表进行查询分析,无需事先导入数据,并且支持自动的列类型推断。
+通过 Table Value Function(TVF)功能,Doris 可以直接将对象存储或 HDFS
上的文件作为表进行查询分析,无需事先导入数据,并且支持自动的列类型推断。同时,自 4.1.0 版本起,还支持通过 `INSERT INTO` TVF
的方式将查询结果导出到文件系统中。
## 支持的存储系统
@@ -113,6 +113,113 @@ FROM s3(
);
```
+### 场景四:导出查询结果到文件
+
+:::tip
+该功能自 Apache Doris 4.1.0 版本起支持,当前为实验性功能。
+:::
+
+通过 `INSERT INTO` TVF 语法,可以将查询结果直接导出为文件,写入到本地文件系统、HDFS 或 S3 兼容的对象存储中,支持
CSV、Parquet 和 ORC 格式。
+
+**语法:**
+
+```sql
+INSERT INTO tvf_name(
+ "file_path" = "<file_path_prefix>",
+ "format" = "<file_format>",
+ ... -- 其他连接属性和格式选项
+)
+[WITH LABEL label_name]
+SELECT ... ;
+```
+
+其中 `tvf_name` 可以为:
+
+| TVF 名称 | 目标存储 | 说明 |
+|---------|---------|------|
+| `local` | 本地文件系统 | 导出到 BE 节点的本地磁盘,需指定 `backend_id` |
+| `hdfs` | HDFS | 导出到 Hadoop 分布式文件系统 |
+| `s3` | S3 兼容对象存储 | 导出到 AWS S3、阿里云 OSS、腾讯云 COS 等 |
+
+**通用属性:**
+
+| 属性 | 是否必须 | 说明 |
+|------|---------|------|
+| `file_path` | 是 | 输出文件路径前缀。实际生成的文件名格式为 `{prefix}{query_id}_{idx}.{ext}` |
+| `format` | 否 | 输出文件格式,支持 `csv`(默认)、`parquet`、`orc` |
+| `max_file_size` | 否 | 单个文件的最大大小(字节),超过后自动拆分生成新文件 |
+| `delete_existing_files` | 否 | 是否在写入前删除目标目录下的已有文件,默认 `false` |
+
+**CSV 格式的额外属性:**
+
+| 属性 | 说明 |
+|------|------|
+| `column_separator` | 列分隔符,默认为 `,` |
+| `line_delimiter` | 行分隔符,默认为 `\n` |
+| `compress_type` | 压缩格式,支持 `gz`、`zstd`、`lz4`、`snappy` |
+
+**示例 1:导出 CSV 到 HDFS**
+
+```sql
+INSERT INTO hdfs(
+ "file_path" = "/tmp/export/csv_data_",
+ "format" = "csv",
+ "column_separator" = ",",
+ "hadoop.username" = "doris",
+ "fs.defaultFS" = "hdfs://namenode:8020",
+ "delete_existing_files" = "true"
+)
+SELECT * FROM my_table ORDER BY id;
+```
+
+**示例 2:导出 Parquet 到 S3**
+
+```sql
+INSERT INTO s3(
+ "uri" = "s3://bucket/export/parquet_data_",
+ "s3.access_key" = "ak",
+ "s3.secret_key" = "sk",
+ "s3.endpoint" = "https://s3.us-east-1.amazonaws.com",
+ "s3.region" = "us-east-1",
+ "format" = "parquet"
+)
+SELECT * FROM my_table WHERE dt = '2024-01-01';
+```
+
+**示例 3:导出 ORC 到 S3**
+
+```sql
+INSERT INTO s3(
+ "uri" = "s3://bucket/export/orc_data_",
+ "s3.access_key" = "ak",
+ "s3.secret_key" = "sk",
+ "s3.endpoint" = "https://s3.us-east-1.amazonaws.com",
+ "s3.region" = "us-east-1",
+ "format" = "orc",
+ "delete_existing_files" = "true"
+)
+SELECT c_int, c_varchar, c_string FROM my_table WHERE c_int IS NOT NULL ORDER
BY c_int;
+```
+
+**示例 4:导出 CSV 到本地 BE 节点**
+
+```sql
+INSERT INTO local(
+ "file_path" = "/tmp/export/local_csv_",
+ "backend_id" = "10001",
+ "format" = "csv"
+)
+SELECT * FROM my_table ORDER BY id;
+```
+
+:::note
+- `file_path` 是文件名前缀,实际生成的文件名格式为 `{prefix}{query_id}_{idx}.{ext}`,其中 `idx` 从 0
开始递增。
+- 使用 `local` TVF 时,需要通过 `backend_id` 指定写入文件的 BE 节点。
+- 使用 `delete_existing_files` 时,系统会在写入前删除 `file_path` 所在目录下的所有文件,请谨慎使用。
+- 执行 `INSERT INTO TVF` 需要 ADMIN 或 LOAD 权限。
+- 对于 S3 TVF,`file_path` 属性对应的是 `uri` 属性。
+:::
+
## 核心功能
### 多文件匹配
diff --git
a/i18n/zh-CN/docusaurus-plugin-content-docs/version-4.x/lakehouse/storages/juicefs.md
b/i18n/zh-CN/docusaurus-plugin-content-docs/version-4.x/lakehouse/storages/juicefs.md
new file mode 100644
index 00000000000..92bcc477dcc
--- /dev/null
+++
b/i18n/zh-CN/docusaurus-plugin-content-docs/version-4.x/lakehouse/storages/juicefs.md
@@ -0,0 +1,97 @@
+---
+{
+ "title": "JuiceFS | Storages",
+ "language": "zh-CN",
+ "description": "本文档用于介绍访问 JuiceFS 时所需的参数。这些参数适用于:",
+ "sidebar_label": "JuiceFS"
+}
+---
+
+# JuiceFS
+
+:::tip 版本支持
+Doris 4.0.2 起支持
+:::
+
+[JuiceFS](https://juicefs.com) 是一款开源的、高性能的云原生分布式文件系统,完全兼容 HDFS API。Doris 将
`jfs://` 协议视为 HDFS 兼容协议,因此你可以使用与 HDFS 相同的方式访问存储在 JuiceFS 中的数据。
+
+本文档用于介绍访问 JuiceFS 时所需的参数。这些参数适用于:
+
+* Catalog 属性。
+* Table Valued Function 属性。
+* Broker Load 属性。
+* Export 属性。
+* Outfile 属性。
+
+## 前提条件
+
+访问 JuiceFS 依赖 `juicefs-hadoop` 客户端 jar 包。从 Doris 4.0.2 起,构建系统会自动下载并打包该 jar
包,存放位置如下:
+
+- FE:`fe/lib/juicefs/`
+- BE:`be/lib/java_extensions/juicefs/`
+
+如果手动部署,请从 [Maven
Central](https://repo1.maven.org/maven2/io/juicefs/juicefs-hadoop/) 下载
`juicefs-hadoop-<version>.jar`,并放置到上述目录中。
+
+## 参数总览
+
+由于 JuiceFS 兼容 HDFS,因此与 HDFS 共享相同的认证参数。此外,还需配置以下 JuiceFS 特有的参数:
+
+| 属性名称 | 描述 | 是否必须 |
+| --- | --- | --- |
+| fs.defaultFS | 默认文件系统 URI,例如 `jfs://cluster`。 | 是 |
+| fs.jfs.impl | Hadoop FileSystem 实现类,必须设置为 `io.juicefs.JuiceFileSystem`。 | 是 |
+| juicefs.\<cluster\>.meta | JuiceFS 元数据引擎端点,例如 `redis://127.0.0.1:6379/1` 或
`mysql://user:pwd@(host:port)/db`。其中 `<cluster>` 需替换为 `fs.defaultFS` URI
中的集群名称。 | 是 |
+
+关于 HDFS 认证参数(Simple 或 Kerberos),请参考 [HDFS](./hdfs.md) 文档。
+
+所有以 `juicefs.` 为前缀的属性会透传给底层 JuiceFS Hadoop 客户端。
+
+## 配置示例
+
+### 配合 Hive Metastore 创建 Catalog
+
+```sql
+CREATE CATALOG jfs_hive PROPERTIES (
+ 'type' = 'hms',
+ 'hive.metastore.uris' = 'thrift://<hms_host>:9083',
+ 'fs.defaultFS' = 'jfs://cluster',
+ 'fs.jfs.impl' = 'io.juicefs.JuiceFileSystem',
+ 'juicefs.cluster.meta' = 'redis://127.0.0.1:6379/1',
+ 'hadoop.username' = 'doris'
+);
+```
+
+### Broker Load
+
+```sql
+LOAD LABEL example_db.label1
+(
+ DATA INFILE("jfs://cluster/path/to/data/*")
+ INTO TABLE `my_table`
+)
+WITH BROKER
+(
+ "fs.defaultFS" = "jfs://cluster",
+ "fs.jfs.impl" = "io.juicefs.JuiceFileSystem",
+ "juicefs.cluster.meta" = "redis://127.0.0.1:6379/1",
+ "hadoop.username" = "doris"
+);
+```
+
+### Table Valued Function
+
+```sql
+SELECT * FROM TABLE(
+ "uri" = "jfs://cluster/path/to/file.parquet",
+ "format" = "parquet",
+ "fs.jfs.impl" = "io.juicefs.JuiceFileSystem",
+ "juicefs.cluster.meta" = "redis://127.0.0.1:6379/1"
+);
+```
+
+## 使用建议
+
+* 确保 `juicefs-hadoop` jar 包部署在**所有** FE 和 BE 节点上。
+* `juicefs.<cluster>.meta` 属性中的集群名称必须与 `jfs://` URI 中的集群名称一致。例如,如果
`fs.defaultFS = jfs://mycluster`,则元数据属性应为 `juicefs.mycluster.meta`。
+* JuiceFS 支持多种元数据引擎(Redis、MySQL、TiKV、SQLite 等),请根据规模和可用性需求选择。
+* HDFS 的 Kerberos 认证、HA Nameservice 配置以及 Hadoop 配置文件等,均与 JuiceFS 兼容。
diff --git a/sidebars.ts b/sidebars.ts
index a9cb45b5ffa..f50f5ab7794 100644
--- a/sidebars.ts
+++ b/sidebars.ts
@@ -439,6 +439,7 @@ const sidebars: SidebarsConfig = {
'lakehouse/storages/huawei-obs',
'lakehouse/storages/baidu-bos',
'lakehouse/storages/minio',
+ 'lakehouse/storages/juicefs',
],
},
{
diff --git a/versioned_docs/version-4.x/lakehouse/file-analysis.md
b/versioned_docs/version-4.x/lakehouse/file-analysis.md
index d7ac1ea8266..df850cef676 100644
--- a/versioned_docs/version-4.x/lakehouse/file-analysis.md
+++ b/versioned_docs/version-4.x/lakehouse/file-analysis.md
@@ -2,11 +2,11 @@
{
"title": "Analyzing Files on S3/HDFS",
"language": "en",
- "description": "Learn how to use Apache Doris Table Value Function (TVF)
to directly query and analyze Parquet, ORC, CSV, and JSON files on storage
systems like S3 and HDFS, with support for automatic schema inference,
multi-file matching, and data import."
+ "description": "Learn how to use Apache Doris Table Value Function (TVF)
to directly query and analyze Parquet, ORC, CSV, and JSON files on storage
systems like S3 and HDFS, with support for automatic schema inference,
multi-file matching, data import, and exporting query results to files."
}
---
-Through the Table Value Function (TVF) feature, Doris can directly query and
analyze files on object storage or HDFS as tables without importing data in
advance, and supports automatic column type inference.
+Through the Table Value Function (TVF) feature, Doris can directly query and
analyze files on object storage or HDFS as tables without importing data in
advance, and supports automatic column type inference. Starting from version
4.1.0, it also supports exporting query results to file systems via `INSERT
INTO` TVF syntax.
## Supported Storage Systems
@@ -113,6 +113,113 @@ FROM s3(
);
```
+### Scenario 4: Exporting Query Results to Files (Since Version 4.1.0)
{#export-query-results}
+
+:::tip
+This feature is supported since Apache Doris version 4.1.0 and is currently an
experimental feature.
+:::
+
+Using the `INSERT INTO` TVF syntax, you can directly export query results as
files to local file systems, HDFS, or S3-compatible object storage, supporting
CSV, Parquet, and ORC formats.
+
+**Syntax:**
+
+```sql
+INSERT INTO tvf_name(
+ "file_path" = "<file_path_prefix>",
+ "format" = "<file_format>",
+ ... -- other connection properties and format options
+)
+[WITH LABEL label_name]
+SELECT ... ;
+```
+
+Where `tvf_name` can be:
+
+| TVF Name | Target Storage | Description |
+|----------|---------------|-------------|
+| `local` | Local file system | Export to the local disk of a BE node,
requires specifying `backend_id` |
+| `hdfs` | HDFS | Export to Hadoop Distributed File System |
+| `s3` | S3-compatible object storage | Export to AWS S3, Alibaba Cloud OSS,
Tencent Cloud COS, etc. |
+
+**Common Properties:**
+
+| Property | Required | Description |
+|----------|----------|-------------|
+| `file_path` | Yes | Output file path prefix. The actual generated file name
format is `{prefix}{query_id}_{idx}.{ext}` |
+| `format` | No | Output file format, supports `csv` (default), `parquet`,
`orc` |
+| `max_file_size` | No | Maximum size per file (in bytes). A new file is
automatically created when exceeded |
+| `delete_existing_files` | No | Whether to delete existing files in the
target directory before writing, default `false` |
+
+**Additional Properties for CSV Format:**
+
+| Property | Description |
+|----------|-------------|
+| `column_separator` | Column separator, default is `,` |
+| `line_delimiter` | Line delimiter, default is `\n` |
+| `compress_type` | Compression format, supports `gz`, `zstd`, `lz4`, `snappy`
|
+
+**Example 1: Export CSV to HDFS**
+
+```sql
+INSERT INTO hdfs(
+ "file_path" = "/tmp/export/csv_data_",
+ "format" = "csv",
+ "column_separator" = ",",
+ "hadoop.username" = "doris",
+ "fs.defaultFS" = "hdfs://namenode:8020",
+ "delete_existing_files" = "true"
+)
+SELECT * FROM my_table ORDER BY id;
+```
+
+**Example 2: Export Parquet to S3**
+
+```sql
+INSERT INTO s3(
+ "uri" = "s3://bucket/export/parquet_data_",
+ "s3.access_key" = "ak",
+ "s3.secret_key" = "sk",
+ "s3.endpoint" = "https://s3.us-east-1.amazonaws.com",
+ "s3.region" = "us-east-1",
+ "format" = "parquet"
+)
+SELECT * FROM my_table WHERE dt = '2024-01-01';
+```
+
+**Example 3: Export ORC to S3**
+
+```sql
+INSERT INTO s3(
+ "uri" = "s3://bucket/export/orc_data_",
+ "s3.access_key" = "ak",
+ "s3.secret_key" = "sk",
+ "s3.endpoint" = "https://s3.us-east-1.amazonaws.com",
+ "s3.region" = "us-east-1",
+ "format" = "orc",
+ "delete_existing_files" = "true"
+)
+SELECT c_int, c_varchar, c_string FROM my_table WHERE c_int IS NOT NULL ORDER
BY c_int;
+```
+
+**Example 4: Export CSV to Local BE Node**
+
+```sql
+INSERT INTO local(
+ "file_path" = "/tmp/export/local_csv_",
+ "backend_id" = "10001",
+ "format" = "csv"
+)
+SELECT * FROM my_table ORDER BY id;
+```
+
+:::note
+- `file_path` is a file name prefix. The actual generated file name format is
`{prefix}{query_id}_{idx}.{ext}`, where `idx` starts from 0 and increments.
+- When using the `local` TVF, you need to specify the BE node to write files
to via `backend_id`.
+- When using `delete_existing_files`, the system will delete all files in the
directory where `file_path` is located before writing. Use with caution.
+- Executing `INSERT INTO TVF` requires ADMIN or LOAD privileges.
+- For S3 TVF, the `file_path` property corresponds to the `uri` property.
+:::
+
## Core Features
### Multi-File Matching
@@ -125,8 +232,6 @@ The file path (URI) supports using wildcards and range
patterns to match multipl
| `{n..m}` | `file_{1..3}` | `file_1`, `file_2`, `file_3` |
| `{a,b,c}` | `file_{a,b}` | `file_a`, `file_b` |
-For complete syntax, please refer to [File Path
Pattern](../sql-manual/basic-element/file-path-pattern).
-
### Using Resource to Simplify Configuration
TVF supports referencing pre-created S3 or HDFS Resources through the
`resource` property, avoiding the need to repeatedly fill in connection
information for each query.
diff --git a/versioned_docs/version-4.x/lakehouse/storages/juicefs.md
b/versioned_docs/version-4.x/lakehouse/storages/juicefs.md
new file mode 100644
index 00000000000..b5708b4a3c9
--- /dev/null
+++ b/versioned_docs/version-4.x/lakehouse/storages/juicefs.md
@@ -0,0 +1,97 @@
+---
+{
+ "title": "JuiceFS | Storages",
+ "language": "en",
+ "description": "This document describes the parameters required for
accessing JuiceFS. These parameters apply to:",
+ "sidebar_label": "JuiceFS"
+}
+---
+
+# JuiceFS
+
+:::tip Supported since
+Doris 4.0.2
+:::
+
+[JuiceFS](https://juicefs.com) is an open-source, high-performance distributed
file system designed for the cloud. It is fully compatible with the HDFS API.
Doris treats the `jfs://` scheme as HDFS-compatible, so you can access data
stored in JuiceFS using the same approach as HDFS.
+
+This document describes the parameters required for accessing JuiceFS. These
parameters apply to:
+
+* Catalog properties
+* Table Valued Function properties
+* Broker Load properties
+* Export properties
+* Outfile properties
+
+## Prerequisites
+
+JuiceFS access relies on the `juicefs-hadoop` client jar. Starting from Doris
4.0.2, the build system automatically downloads and packages this jar. The jar
is placed under:
+
+- FE: `fe/lib/juicefs/`
+- BE: `be/lib/java_extensions/juicefs/`
+
+If you are deploying manually, download the `juicefs-hadoop-<version>.jar`
from [Maven Central](https://repo1.maven.org/maven2/io/juicefs/juicefs-hadoop/)
and place it in the directories listed above.
+
+## Parameter Overview
+
+Since JuiceFS is HDFS-compatible, it shares the same authentication parameters
as HDFS. Additionally, the following JuiceFS-specific parameters are available:
+
+| Property Name | Description | Required |
+| --- | --- | --- |
+| fs.defaultFS | The default filesystem URI, for example `jfs://cluster`. |
Yes |
+| fs.jfs.impl | The Hadoop FileSystem implementation class. Must be set to
`io.juicefs.JuiceFileSystem`. | Yes |
+| juicefs.\<cluster\>.meta | The JuiceFS metadata engine endpoint, for example
`redis://127.0.0.1:6379/1` or `mysql://user:pwd@(host:port)/db`. Replace
`<cluster>` with the cluster name in your `fs.defaultFS` URI. | Yes |
+
+For HDFS authentication parameters (Simple or Kerberos), refer to the
[HDFS](./hdfs.md) documentation.
+
+All properties prefixed with `juicefs.` will be passed through to the
underlying JuiceFS Hadoop client.
+
+## Example Configurations
+
+### Catalog with Hive Metastore
+
+```sql
+CREATE CATALOG jfs_hive PROPERTIES (
+ 'type' = 'hms',
+ 'hive.metastore.uris' = 'thrift://<hms_host>:9083',
+ 'fs.defaultFS' = 'jfs://cluster',
+ 'fs.jfs.impl' = 'io.juicefs.JuiceFileSystem',
+ 'juicefs.cluster.meta' = 'redis://127.0.0.1:6379/1',
+ 'hadoop.username' = 'doris'
+);
+```
+
+### Broker Load
+
+```sql
+LOAD LABEL example_db.label1
+(
+ DATA INFILE("jfs://cluster/path/to/data/*")
+ INTO TABLE `my_table`
+)
+WITH BROKER
+(
+ "fs.defaultFS" = "jfs://cluster",
+ "fs.jfs.impl" = "io.juicefs.JuiceFileSystem",
+ "juicefs.cluster.meta" = "redis://127.0.0.1:6379/1",
+ "hadoop.username" = "doris"
+);
+```
+
+### Table Valued Function
+
+```sql
+SELECT * FROM TABLE(
+ "uri" = "jfs://cluster/path/to/file.parquet",
+ "format" = "parquet",
+ "fs.jfs.impl" = "io.juicefs.JuiceFileSystem",
+ "juicefs.cluster.meta" = "redis://127.0.0.1:6379/1"
+);
+```
+
+## Best Practices
+
+* Ensure the `juicefs-hadoop` jar is deployed on **all** FE and BE nodes.
+* The `juicefs.<cluster>.meta` property must match the cluster name in the
`jfs://` URI. For example, if `fs.defaultFS = jfs://mycluster`, the metadata
property should be `juicefs.mycluster.meta`.
+* JuiceFS supports multiple metadata engines (Redis, MySQL, TiKV, SQLite,
etc.). Choose based on your scale and availability requirements.
+* HDFS configurations such as Kerberos authentication, HA nameservice
settings, and Hadoop config files are all compatible with JuiceFS.
diff --git a/versioned_sidebars/version-4.x-sidebars.json
b/versioned_sidebars/version-4.x-sidebars.json
index c5c2578df70..49c6cadff6f 100644
--- a/versioned_sidebars/version-4.x-sidebars.json
+++ b/versioned_sidebars/version-4.x-sidebars.json
@@ -439,7 +439,8 @@
"lakehouse/storages/tencent-cos",
"lakehouse/storages/huawei-obs",
"lakehouse/storages/baidu-bos",
- "lakehouse/storages/minio"
+ "lakehouse/storages/minio",
+ "lakehouse/storages/juicefs"
]
},
{
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]