This is an automated email from the ASF dual-hosted git repository.
morningman pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/doris-website.git
The following commit(s) were added to refs/heads/master by this push:
new 1b13fe5263c [catalog](benchmark) add trino connector doc (#656)
1b13fe5263c is described below
commit 1b13fe5263c13a7bb460023915ffb806e0b82539
Author: Mingyu Chen <[email protected]>
AuthorDate: Thu May 16 16:42:31 2024 +0800
[catalog](benchmark) add trino connector doc (#656)
add tpch and tpcds catalog doc
---
docs/lakehouse/datalake-analytics/tpcds.md | 179 ++++++++++++++++++++
docs/lakehouse/datalake-analytics/tpch.md | 147 ++++++++++++++++
.../current/lakehouse/datalake-analytics/tpcds.md | 187 +++++++++++++++++++++
.../current/lakehouse/datalake-analytics/tpch.md | 156 +++++++++++++++++
sidebars.json | 6 +-
5 files changed, 673 insertions(+), 2 deletions(-)
diff --git a/docs/lakehouse/datalake-analytics/tpcds.md
b/docs/lakehouse/datalake-analytics/tpcds.md
new file mode 100644
index 00000000000..4b3144d9bbf
--- /dev/null
+++ b/docs/lakehouse/datalake-analytics/tpcds.md
@@ -0,0 +1,179 @@
+---
+{
+"title": "TPCDS",
+"language": "en"
+}
+---
+
+<!--
+Licensed to the Apache Software Foundation (ASF) under one
+or more contributor license agreements. See the NOTICE file
+distributed with this work for additional information
+regarding copyright ownership. The ASF licenses this file
+to you under the Apache License, Version 2.0 (the
+"License"); you may not use this file except in compliance
+with the License. You may obtain a copy of the License at
+
+ http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing,
+software distributed under the License is distributed on an
+"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+KIND, either express or implied. See the License for the
+specific language governing permissions and limitations
+under the License.
+-->
+
+## Usage Notes
+
+TPCDS Catalog uses the [Trino
Connector](https://doris.apache.org/community/how-to-contribute/trino-connector-developer-guide)
compatibility framework and the [TPCDS
Connector](https://trino.io/docs/current/connector/tpcds.html) to quickly build
TPCDS test sets.
+
+:::tip
+This feature is supported starting from Doris version 3.0.0.
+:::
+
+## Compiling the TPCDS Connector
+
+> JDK 17 is required.
+
+```shell
+git clone https://github.com/trinodb/trino.git
+git checkout 435
+cd trino/plugin/trino-tpcds
+mvn clean install -DskipTest
+```
+
+After compiling, you will find the `trino-tpcds-435/` directory under
`trino/plugin/trino-tpcds/target/`.
+
+You can also directly download the precompiled
[trino-tpcds-435.tar.gz](https://github.com/morningman/trino-connectors/releases/download/trino-connectors/trino-tpcds-435.tar.gz)
and extract it.
+
+## Deploying the TPCDS Connector
+
+Place the `trino-tpcds-435/` directory under the `connectors/` directory in
the deployment paths of all FE and BE nodes. (If it does not exist, you can
create it manually).
+
+```
+├── bin
+├── conf
+├── connectors
+│ ├── trino-tpcds-435
+...
+```
+
+After deployment, it is recommended to restart the FE and BE nodes to ensure
the Connector is loaded correctly.
+
+## Creating the TPCDS Catalog
+
+```sql
+CREATE CATALOG `tpcds` PROPERTIES (
+ "type" = "trino-connector",
+ "connector.name" = "tpcds",
+ "tpcds.split-count" = "32"
+);
+```
+
+The `tpcds.split-count` property sets the level of concurrency. It is
recommended to set it to twice the number of cores per BE node to achieve
optimal concurrency and improve data generation efficiency.
+
+## Using the TPCDS Catalog
+
+The TPCDS Catalog includes pre-configured TPCDS datasets of different scale
factors, which can be viewed using the `SHOW DATABASES` and `SHOW TABLES`
commands.
+
+```
+mysql> SWITCH tpcds;
+Query OK, 0 rows affected (0.00 sec)
+
+mysql> SHOW DATABASES;
++--------------------+
+| Database |
++--------------------+
+| information_schema |
+| mysql |
+| sf1 |
+| sf100 |
+| sf1000 |
+| sf10000 |
+| sf100000 |
+| sf300 |
+| sf3000 |
+| sf30000 |
+| tiny |
++--------------------+
+11 rows in set (0.00 sec)
+
+mysql> USE sf1;
+mysql> SHOW TABLES;
++------------------------+
+| Tables_in_sf1 |
++------------------------+
+| call_center |
+| catalog_page |
+| catalog_returns |
+| catalog_sales |
+| customer |
+| customer_address |
+| customer_demographics |
+| date_dim |
+| dbgen_version |
+| household_demographics |
+| income_band |
+| inventory |
+| item |
+| promotion |
+| reason |
+| ship_mode |
+| store |
+| store_returns |
+| store_sales |
+| time_dim |
+| warehouse |
+| web_page |
+| web_returns |
+| web_sales |
+| web_site |
++------------------------+
+25 rows in set (0.00 sec)
+```
+
+You can directly query these tables using the SELECT statement.
+
+:::tip
+The data in these pre-configured datasets is not actually stored but generated
in real-time during queries. Therefore, these datasets are not suitable for
direct benchmarking. They are more appropriate for writing to other target
tables (such as Doris internal tables, Hive, Iceberg, and other data sources
supported by Doris) via `INSERT INTO SELECT`, after which performance tests can
be conducted on the target tables.
+:::
+
+### Best Practices
+
+#### Quickly Build TPCDS Test Dataset
+
+You can quickly build a TPCDS test dataset using the CTAS (Create Table As
Select) statement:
+
+```
+CREATE TABLE hive.tpcds100.call_center PROPERTIES("file_format" =
"parquet") AS SELECT * FROM tpcds.sf100.call_center ;
+CREATE TABLE hive.tpcds100.catalog_page PROPERTIES("file_format" =
"parquet") AS SELECT * FROM tpcds.sf100.catalog_page ;
+CREATE TABLE hive.tpcds100.catalog_returns PROPERTIES("file_format" =
"parquet") AS SELECT * FROM tpcds.sf100.catalog_returns ;
+CREATE TABLE hive.tpcds100.catalog_sales PROPERTIES("file_format" =
"parquet") AS SELECT * FROM tpcds.sf100.catalog_sales ;
+CREATE TABLE hive.tpcds100.customer PROPERTIES("file_format" =
"parquet") AS SELECT * FROM tpcds.sf100.customer ;
+CREATE TABLE hive.tpcds100.customer_address PROPERTIES("file_format" =
"parquet") AS SELECT * FROM tpcds.sf100.customer_address ;
+CREATE TABLE hive.tpcds100.customer_demographics PROPERTIES("file_format" =
"parquet") AS SELECT * FROM tpcds.sf100.customer_demographics ;
+CREATE TABLE hive.tpcds100.date_dim PROPERTIES("file_format" =
"parquet") AS SELECT * FROM tpcds.sf100.date_dim ;
+CREATE TABLE hive.tpcds100.dbgen_version PROPERTIES("file_format" =
"parquet") AS SELECT * FROM tpcds.sf100.dbgen_version ;
+CREATE TABLE hive.tpcds100.household_demographics PROPERTIES("file_format" =
"parquet") AS SELECT * FROM tpcds.sf100.household_demographics;
+CREATE TABLE hive.tpcds100.income_band PROPERTIES("file_format" =
"parquet") AS SELECT * FROM tpcds.sf100.income_band ;
+CREATE TABLE hive.tpcds100.inventory PROPERTIES("file_format" =
"parquet") AS SELECT * FROM tpcds.sf100.inventory ;
+CREATE TABLE hive.tpcds100.item PROPERTIES("file_format" =
"parquet") AS SELECT * FROM tpcds.sf100.item ;
+CREATE TABLE hive.tpcds100.promotion PROPERTIES("file_format" =
"parquet") AS SELECT * FROM tpcds.sf100.promotion ;
+CREATE TABLE hive.tpcds100.reason PROPERTIES("file_format" =
"parquet") AS SELECT * FROM tpcds.sf100.reason ;
+CREATE TABLE hive.tpcds100.ship_mode PROPERTIES("file_format" =
"parquet") AS SELECT * FROM tpcds.sf100.ship_mode ;
+CREATE TABLE hive.tpcds100.store PROPERTIES("file_format" =
"parquet") AS SELECT * FROM tpcds.sf100.store ;
+CREATE TABLE hive.tpcds100.store_returns PROPERTIES("file_format" =
"parquet") AS SELECT * FROM tpcds.sf100.store_returns ;
+CREATE TABLE hive.tpcds100.store_sales PROPERTIES("file_format" =
"parquet") AS SELECT * FROM tpcds.sf100.store_sales ;
+CREATE TABLE hive.tpcds100.time_dim PROPERTIES("file_format" =
"parquet") AS SELECT * FROM tpcds.sf100.time_dim ;
+CREATE TABLE hive.tpcds100.warehouse PROPERTIES("file_format" =
"parquet") AS SELECT * FROM tpcds.sf100.warehouse ;
+CREATE TABLE hive.tpcds100.web_page PROPERTIES("file_format" =
"parquet") AS SELECT * FROM tpcds.sf100.web_page ;
+CREATE TABLE hive.tpcds100.web_returns PROPERTIES("file_format" =
"parquet") AS SELECT * FROM tpcds.sf100.web_returns ;
+CREATE TABLE hive.tpcds100.web_sales PROPERTIES("file_format" =
"parquet") AS SELECT * FROM tpcds.sf100.web_sales ;
+CREATE TABLE hive.tpcds100.web_site PROPERTIES("file_format" =
"parquet") AS SELECT * FROM tpcds.sf100.web_site ;
+```
+
+:::tip
+On a Doris cluster with 3 BE nodes, each with 16 cores, creating a TPCDS 1000
dataset in Hive takes approximately 3 to 4 hours.
+:::
+
diff --git a/docs/lakehouse/datalake-analytics/tpch.md
b/docs/lakehouse/datalake-analytics/tpch.md
new file mode 100644
index 00000000000..252193779ff
--- /dev/null
+++ b/docs/lakehouse/datalake-analytics/tpch.md
@@ -0,0 +1,147 @@
+---
+{
+"title": "TPCH",
+"language": "en"
+}
+---
+
+<!--
+Licensed to the Apache Software Foundation (ASF) under one
+or more contributor license agreements. See the NOTICE file
+distributed with this work for additional information
+regarding copyright ownership. The ASF licenses this file
+to you under the Apache License, Version 2.0 (the
+"License"); you may not use this file except in compliance
+with the License. You may obtain a copy of the License at
+
+ http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing,
+software distributed under the License is distributed on an
+"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+KIND, either express or implied. See the License for the
+specific language governing permissions and limitations
+under the License.
+-->
+
+## Usage Notes
+
+TPCH Catalog uses the [Trino
Connector](https://doris.apache.org/community/how-to-contribute/trino-connector-developer-guide)
compatibility framework and the [TPCH
Connector](https://trino.io/docs/current/connector/tpch.html) to quickly build
TPCH test sets.
+
+:::tip
+This feature is supported starting from Doris version 3.0.0.
+:::
+
+## Compiling the TPCH Connector
+
+> JDK 17 is required.
+
+```shell
+git clone https://github.com/trinodb/trino.git
+git checkout 435
+cd trino/plugin/trino-tpch
+mvn clean install -DskipTest
+```
+
+After compiling, you will find the `trino-tpch-435/` directory under
`trino/plugin/trino-tpch/target/`.
+
+You can also directly download the precompiled
[trino-tpch-435.tar.gz](https://github.com/morningman/trino-connectors/releases/download/trino-connectors/trino-tpch-435.tar.gz)
and extract it.
+
+## Deploying the TPCH Connector
+
+Place the `trino-tpch-435/` directory under the `connectors/` directory in the
deployment paths of all FE and BE nodes. (If it does not exist, you can create
it manually).
+
+```
+├── bin
+├── conf
+├── connectors
+│ ├── trino-tpch-435
+...
+```
+
+After deployment, it is recommended to restart the FE and BE nodes to ensure
the Connector is loaded correctly.
+
+## Creating the TPCH Catalog
+
+```sql
+CREATE CATALOG `tpch` PROPERTIES (
+ "type" = "trino-connector",
+ "connector.name" = "tpch",
+ "tpch.column-naming" = "STANDARD",
+ "tpch.splits-per-node" = "32"
+);
+```
+
+The `tpch.splits-per-node` property sets the level of concurrency. It is
recommended to set it to twice the number of cores per BE node to achieve
optimal concurrency and improve data generation efficiency.
+
+When `"tpch.column-naming" = "STANDARD"`, the column names in the TPCH table
will start with the abbreviation of the table name, such as `l_orderkey`,
otherwise, it is `orderkey`.
+
+## Using the TPCH Catalog
+
+The TPCH Catalog includes pre-configured TPCH datasets of different scale
factors, which can be viewed using the `SHOW DATABASES` and `SHOW TABLES`
commands.
+
+```
+mysql> SWITCH tpch;
+Query OK, 0 rows affected (0.00 sec)
+
+mysql> SHOW DATABASES;
++--------------------+
+| Database |
++--------------------+
+| information_schema |
+| mysql |
+| sf1 |
+| sf100 |
+| sf1000 |
+| sf10000 |
+| sf100000 |
+| sf300 |
+| sf3000 |
+| sf30000 |
+| tiny |
++--------------------+
+11 rows in set (0.00 sec)
+
+mysql> USE sf1;
+mysql> SHOW TABLES;
++---------------+
+| Tables_in_sf1 |
++---------------+
+| customer |
+| lineitem |
+| nation |
+| orders |
+| part |
+| partsupp |
+| region |
+| supplier |
++---------------+
+8 rows in set (0.00 sec)
+```
+
+You can directly query these tables using the SELECT statement.
+
+:::tip
+The data in these pre-configured datasets is not actually stored but generated
in real-time during queries. Therefore, these datasets are not suitable for
direct benchmarking. They are more appropriate for writing to other target
tables (such as Doris internal tables, Hive, Iceberg, and other data sources
supported by Doris) via `INSERT INTO SELECT`, after which performance tests can
be conducted on the target tables.
+:::
+
+### Best Practices
+
+#### Quickly Build TPCH Test Dataset
+
+You can quickly build a TPCH test dataset using the CTAS (Create Table As
Select) statement:
+
+```
+CREATE TABLE hive.tpch100.customer PROPERTIES("file_format" = "parquet") AS
SELECT * FROM tpch.sf100.customer ;
+CREATE TABLE hive.tpch100.lineitem PROPERTIES("file_format" = "parquet") AS
SELECT * FROM tpch.sf100.lineitem ;
+CREATE TABLE hive.tpch100.nation PROPERTIES("file_format" = "parquet") AS
SELECT * FROM tpch.sf100.nation ;
+CREATE TABLE hive.tpch100.orders PROPERTIES("file_format" = "parquet") AS
SELECT * FROM tpch.sf100.orders ;
+CREATE TABLE hive.tpch100.part PROPERTIES("file_format" = "parquet") AS
SELECT * FROM tpch.sf100.part ;
+CREATE TABLE hive.tpch100.partsupp PROPERTIES("file_format" = "parquet") AS
SELECT * FROM tpch.sf100.partsupp ;
+CREATE TABLE hive.tpch100.region PROPERTIES("file_format" = "parquet") AS
SELECT * FROM tpch.sf100.region ;
+CREATE TABLE hive.tpch100.supplier PROPERTIES("file_format" = "parquet") AS
SELECT * FROM tpch.sf100.supplier ;
+```
+
+:::tip
+On a Doris cluster with 3 BE nodes, each with 16 cores, creating a TPCH 1000
dataset in Hive takes approximately 25 minutes, and TPCH 10000 takes about 4 to
5 hours.
+:::
diff --git
a/i18n/zh-CN/docusaurus-plugin-content-docs/current/lakehouse/datalake-analytics/tpcds.md
b/i18n/zh-CN/docusaurus-plugin-content-docs/current/lakehouse/datalake-analytics/tpcds.md
new file mode 100644
index 00000000000..69a0853ef5a
--- /dev/null
+++
b/i18n/zh-CN/docusaurus-plugin-content-docs/current/lakehouse/datalake-analytics/tpcds.md
@@ -0,0 +1,187 @@
+---
+{
+"title": "TPCDS",
+"language": "zh-CN"
+}
+---
+
+<!--
+Licensed to the Apache Software Foundation (ASF) under one
+or more contributor license agreements. See the NOTICE file
+distributed with this work for additional information
+regarding copyright ownership. The ASF licenses this file
+to you under the Apache License, Version 2.0 (the
+"License"); you may not use this file except in compliance
+with the License. You may obtain a copy of the License at
+
+ http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing,
+software distributed under the License is distributed on an
+"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+KIND, either express or implied. See the License for the
+specific language governing permissions and limitations
+under the License.
+-->
+
+## 使用须知
+
+TPCDS Catalog 通过 [Trino
Connector](https://doris.apache.org/community/how-to-contribute/trino-connector-developer-guide)
兼容框架,使用 [TPCDS Connector](https://trino.io/docs/current/connector/tpcds.html)
来快速构建 TPCDS 测试集。
+
+:::tip
+该功能自 Doris 3.0.0 版本开始支持。
+:::
+
+## 编译 TPCDS Connector
+
+> 需要 JDK 17 版本。
+
+```shell
+git clone https://github.com/trinodb/trino.git
+git checkout 435
+cd trino/plugin/trino-tpcds
+mvn clean install -DskipTest
+```
+
+完成编译后,会在 `trino/plugin/trino-tpcds/target/` 下得到 `trino-tpcds-435/` 目录。
+
+也可以直接下载预编译的
[trino-tpcds-435.tar.gz](https://github.com/morningman/trino-connectors/releases/download/trino-connectors/trino-tpcds-435.tar.gz)
并解压。
+
+## 部署 TPCDS Connector
+
+将 `trino-tpcds-435/` 目录放到所有 FE 和 BE 部署路径的 `connectors/` 目录下。(如果没有,可以手动创建)。
+
+```
+├── bin
+├── conf
+├── connectors
+│ ├── trino-tpcds-435
+...
+```
+
+部署完成后,建议重启 FE、BE 节点以确保 Connector 可以被正确加载。
+
+## 创建 TPCDS Catalog
+
+```sql
+CREATE CATALOG `tpcds` PROPERTIES (
+ "type" = "trino-connector",
+ "connector.name" = "tpcds",
+ "tpcds.split-count" = "32"
+);
+```
+
+其中 `tpcds.split-count` 为并发数,建议设置为 BE 单机核数的 2 倍,可以获得最优的并发度。提升数据生成效率。
+
+## 使用 TPCDS Catalog
+
+TPCDS Catalog 中预制了不同 Scale Factor 的 TPCDS 数据集,可以通过 `SHOW DATABASES` 和 `SHOW
TABLES` 命令查看。
+
+```
+mysql> SWITCH tpcds;
+Query OK, 0 rows affected (0.00 sec)
+
+mysql> SHOW DATABASES;
++--------------------+
+| Database |
++--------------------+
+| information_schema |
+| mysql |
+| sf1 |
+| sf100 |
+| sf1000 |
+| sf10000 |
+| sf100000 |
+| sf300 |
+| sf3000 |
+| sf30000 |
+| tiny |
++--------------------+
+11 rows in set (0.00 sec)
+
+mysql> USE sf1;
+mysql> SHOW TABLES;
++------------------------+
+| Tables_in_sf1 |
++------------------------+
+| call_center |
+| catalog_page |
+| catalog_returns |
+| catalog_sales |
+| customer |
+| customer_address |
+| customer_demographics |
+| date_dim |
+| dbgen_version |
+| household_demographics |
+| income_band |
+| inventory |
+| item |
+| promotion |
+| reason |
+| ship_mode |
+| store |
+| store_returns |
+| store_sales |
+| time_dim |
+| warehouse |
+| web_page |
+| web_returns |
+| web_sales |
+| web_site |
++------------------------+
+25 rows in set (0.00 sec)
+```
+
+通过 SELECT 语句可以直接查询这些表。
+
+:::tips
+这些预制数据集的数据,并没有实际存储,而是在查询时实时生成的。所以这些预制数据集不适合用来直接进行 Benchmark 测试。适用于通过 `INSERT
INTO SELECT` 将数据集写入到其他目的表(如 Doris 内表、Hive、Iceberg 等所有 Doris
支持写入的数据源)后,对目的表进行性能测试。
+:::
+
+### 最佳实践
+
+#### 快速构建 TPCDS 测试数据集
+
+可以通过 CTAS 语句快速构建一个 TPCDS 测试数据集:
+
+```
+CREATE TABLE hive.tpcds100.call_center PROPERTIES("file_format" =
"parquet") AS SELECT * FROM tpcds.sf100.call_center ;
+CREATE TABLE hive.tpcds100.catalog_page PROPERTIES("file_format" =
"parquet") AS SELECT * FROM tpcds.sf100.catalog_page ;
+CREATE TABLE hive.tpcds100.catalog_returns PROPERTIES("file_format" =
"parquet") AS SELECT * FROM tpcds.sf100.catalog_returns ;
+CREATE TABLE hive.tpcds100.catalog_sales PROPERTIES("file_format" =
"parquet") AS SELECT * FROM tpcds.sf100.catalog_sales ;
+CREATE TABLE hive.tpcds100.customer PROPERTIES("file_format" =
"parquet") AS SELECT * FROM tpcds.sf100.customer ;
+CREATE TABLE hive.tpcds100.customer_address PROPERTIES("file_format" =
"parquet") AS SELECT * FROM tpcds.sf100.customer_address ;
+CREATE TABLE hive.tpcds100.customer_demographics PROPERTIES("file_format" =
"parquet") AS SELECT * FROM tpcds.sf100.customer_demographics ;
+CREATE TABLE hive.tpcds100.date_dim PROPERTIES("file_format" =
"parquet") AS SELECT * FROM tpcds.sf100.date_dim ;
+CREATE TABLE hive.tpcds100.dbgen_version PROPERTIES("file_format" =
"parquet") AS SELECT * FROM tpcds.sf100.dbgen_version ;
+CREATE TABLE hive.tpcds100.household_demographics PROPERTIES("file_format" =
"parquet") AS SELECT * FROM tpcds.sf100.household_demographics;
+CREATE TABLE hive.tpcds100.income_band PROPERTIES("file_format" =
"parquet") AS SELECT * FROM tpcds.sf100.income_band ;
+CREATE TABLE hive.tpcds100.inventory PROPERTIES("file_format" =
"parquet") AS SELECT * FROM tpcds.sf100.inventory ;
+CREATE TABLE hive.tpcds100.item PROPERTIES("file_format" =
"parquet") AS SELECT * FROM tpcds.sf100.item ;
+CREATE TABLE hive.tpcds100.promotion PROPERTIES("file_format" =
"parquet") AS SELECT * FROM tpcds.sf100.promotion ;
+CREATE TABLE hive.tpcds100.reason PROPERTIES("file_format" =
"parquet") AS SELECT * FROM tpcds.sf100.reason ;
+CREATE TABLE hive.tpcds100.ship_mode PROPERTIES("file_format" =
"parquet") AS SELECT * FROM tpcds.sf100.ship_mode ;
+CREATE TABLE hive.tpcds100.store PROPERTIES("file_format" =
"parquet") AS SELECT * FROM tpcds.sf100.store ;
+CREATE TABLE hive.tpcds100.store_returns PROPERTIES("file_format" =
"parquet") AS SELECT * FROM tpcds.sf100.store_returns ;
+CREATE TABLE hive.tpcds100.store_sales PROPERTIES("file_format" =
"parquet") AS SELECT * FROM tpcds.sf100.store_sales ;
+CREATE TABLE hive.tpcds100.time_dim PROPERTIES("file_format" =
"parquet") AS SELECT * FROM tpcds.sf100.time_dim ;
+CREATE TABLE hive.tpcds100.warehouse PROPERTIES("file_format" =
"parquet") AS SELECT * FROM tpcds.sf100.warehouse ;
+CREATE TABLE hive.tpcds100.web_page PROPERTIES("file_format" =
"parquet") AS SELECT * FROM tpcds.sf100.web_page ;
+CREATE TABLE hive.tpcds100.web_returns PROPERTIES("file_format" =
"parquet") AS SELECT * FROM tpcds.sf100.web_returns ;
+CREATE TABLE hive.tpcds100.web_sales PROPERTIES("file_format" =
"parquet") AS SELECT * FROM tpcds.sf100.web_sales ;
+CREATE TABLE hive.tpcds100.web_site PROPERTIES("file_format" =
"parquet") AS SELECT * FROM tpcds.sf100.web_site ;
+```
+
+:::tips
+在包含 3 个 16C BE 节点的 Doris 集群上,创建一个 TPCDS 1000 的 Hive 数据集,大约需要 3 到 4 个小时。
+:::
+
+
+
+
+
+
+
+
+
diff --git
a/i18n/zh-CN/docusaurus-plugin-content-docs/current/lakehouse/datalake-analytics/tpch.md
b/i18n/zh-CN/docusaurus-plugin-content-docs/current/lakehouse/datalake-analytics/tpch.md
new file mode 100644
index 00000000000..8dc31766fc0
--- /dev/null
+++
b/i18n/zh-CN/docusaurus-plugin-content-docs/current/lakehouse/datalake-analytics/tpch.md
@@ -0,0 +1,156 @@
+---
+{
+"title": "TPCH",
+"language": "zh-CN"
+}
+---
+
+<!--
+Licensed to the Apache Software Foundation (ASF) under one
+or more contributor license agreements. See the NOTICE file
+distributed with this work for additional information
+regarding copyright ownership. The ASF licenses this file
+to you under the Apache License, Version 2.0 (the
+"License"); you may not use this file except in compliance
+with the License. You may obtain a copy of the License at
+
+ http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing,
+software distributed under the License is distributed on an
+"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+KIND, either express or implied. See the License for the
+specific language governing permissions and limitations
+under the License.
+-->
+
+## 使用须知
+
+TPCH Catalog 通过 [Trino
Connector](https://doris.apache.org/community/how-to-contribute/trino-connector-developer-guide)
兼容框架,使用 [TPCH Connector](https://trino.io/docs/current/connector/tpch.html)
来快速构建 TPCH 测试集。
+
+:::tip
+该功能自 Doris 3.0.0 版本开始支持。
+:::
+
+## 编译 TPCH Connector
+
+> 需要 JDK 17 版本。
+
+```shell
+git clone https://github.com/trinodb/trino.git
+git checkout 435
+cd trino/plugin/trino-tpch
+mvn clean install -DskipTest
+```
+
+完成编译后,会在 `trino/plugin/trino-tpch/target/` 下得到 `trino-tpch-435/` 目录。
+
+也可以直接下载预编译的
[trino-tpch-435.tar.gz](https://github.com/morningman/trino-connectors/releases/download/trino-connectors/trino-tpch-435.tar.gz)
并解压。
+
+## 部署 TPCH Connector
+
+将 `trino-tpch-435/` 目录放到所有 FE 和 BE 部署路径的 `connectors/` 目录下。(如果没有,可以手动创建)。
+
+```
+├── bin
+├── conf
+├── connectors
+│ ├── trino-tpch-435
+...
+```
+
+部署完成后,建议重启 FE、BE 节点以确保 Connector 可以被正确加载。
+
+## 创建 TPCH Catalog
+
+```sql
+CREATE CATALOG `tpch` PROPERTIES (
+ "type" = "trino-connector",
+ "connector.name" = "tpch",
+ "tpch.column-naming" = "STANDARD",
+ "tpch.splits-per-node" = "32"
+);
+```
+
+其中 `tpch.splits-per-node` 为并发数,建议设置为 BE 单机核数的 2 倍,可以获得最优的并发度。提升数据生成效率。
+
+`"tpch.column-naming" = "STANDARD"` 时,TPCH 表中的列名,都会以表名缩写开头,比如
`l_orderkey`,否则,是 `orderkey`。
+
+## 使用 TPCH Catalog
+
+TPCH Catalog 中预制了不同 Scale Factor 的 TPCH 数据集,可以通过 `SHOW DATABASES` 和 `SHOW
TABLES` 命令查看。
+
+```
+mysql> SWITCH tpch;
+Query OK, 0 rows affected (0.00 sec)
+
+mysql> SHOW DATABASES;
++--------------------+
+| Database |
++--------------------+
+| information_schema |
+| mysql |
+| sf1 |
+| sf100 |
+| sf1000 |
+| sf10000 |
+| sf100000 |
+| sf300 |
+| sf3000 |
+| sf30000 |
+| tiny |
++--------------------+
+11 rows in set (0.00 sec)
+
+mysql> USE sf1;
+mysql> SHOW TABLES;
++---------------+
+| Tables_in_sf1 |
++---------------+
+| customer |
+| lineitem |
+| nation |
+| orders |
+| part |
+| partsupp |
+| region |
+| supplier |
++---------------+
+8 rows in set (0.00 sec)
+```
+
+通过 SELECT 语句可以直接查询这些表。
+
+:::tips
+这些预制数据集的数据,并没有实际存储,而是在查询时实时生成的。所以这些预制数据集不适合用来直接进行 Benchmark 测试。适用于通过 `INSERT
INTO SELECT` 将数据集写入到其他目的表(如 Doris 内表、Hive、Iceberg 等所有 Doris
支持写入的数据源)后,对目的表进行性能测试。
+:::
+
+### 最佳实践
+
+#### 快速构建 TPCH 测试数据集
+
+可以通过 CTAS 语句快速构建一个 TPCH 测试数据集:
+
+```
+CREATE TABLE hive.tpch100.customer PROPERTIES("file_format" = "parquet") AS
SELECT * FROM tpch.sf100.customer ;
+CREATE TABLE hive.tpch100.lineitem PROPERTIES("file_format" = "parquet") AS
SELECT * FROM tpch.sf100.lineitem ;
+CREATE TABLE hive.tpch100.nation PROPERTIES("file_format" = "parquet") AS
SELECT * FROM tpch.sf100.nation ;
+CREATE TABLE hive.tpch100.orders PROPERTIES("file_format" = "parquet") AS
SELECT * FROM tpch.sf100.orders ;
+CREATE TABLE hive.tpch100.part PROPERTIES("file_format" = "parquet") AS
SELECT * FROM tpch.sf100.part ;
+CREATE TABLE hive.tpch100.partsupp PROPERTIES("file_format" = "parquet") AS
SELECT * FROM tpch.sf100.partsupp ;
+CREATE TABLE hive.tpch100.region PROPERTIES("file_format" = "parquet") AS
SELECT * FROM tpch.sf100.region ;
+CREATE TABLE hive.tpch100.supplier PROPERTIES("file_format" = "parquet") AS
SELECT * FROM tpch.sf100.supplier ;
+```
+
+:::tips
+在包含 3 个 16C BE 节点的 Doris 集群上,创建一个 TPCH 1000 的 Hive 数据集,大约需要 25 分钟,TPCH 10000
大约需要 4 到 5 个小时。
+:::
+
+
+
+
+
+
+
+
+
diff --git a/sidebars.json b/sidebars.json
index 48ca0ebd687..18e5621608c 100644
--- a/sidebars.json
+++ b/sidebars.json
@@ -265,7 +265,9 @@
"lakehouse/datalake-analytics/hudi",
"lakehouse/datalake-analytics/iceberg",
"lakehouse/datalake-analytics/paimon",
- "lakehouse/datalake-analytics/dlf"
+ "lakehouse/datalake-analytics/dlf",
+ "lakehouse/datalake-analytics/tpch",
+ "lakehouse/datalake-analytics/tpcds"
]
},
{
@@ -1522,4 +1524,4 @@
]
}
]
-}
\ No newline at end of file
+}
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]