[GitHub] [incubator-doris] wyb commented on a change in pull request #3418: [Spark load] Add spark etl cluster and cluster manager

GitBox Thu, 21 May 2020 08:36:11 -0700


wyb commented on a change in pull request #3418:
URL: https://github.com/apache/incubator-doris/pull/3418#discussion_r428732335




##########
File path: docs/zh-CN/administrator-guide/load-data/spark-load-manual.md
##########
@@ -0,0 +1,397 @@
+---
+{
+    "title": "Spark Load",
+    "language": "zh-CN"
+}
+---  
+
+<!-- 
+Licensed to the Apache Software Foundation (ASF) under one
+or more contributor license agreements.  See the NOTICE file
+distributed with this work for additional information
+regarding copyright ownership.  The ASF licenses this file
+to you under the Apache License, Version 2.0 (the
+"License"); you may not use this file except in compliance
+with the License.  You may obtain a copy of the License at
+
+  http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing,
+software distributed under the License is distributed on an
+"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+KIND, either express or implied.  See the License for the
+specific language governing permissions and limitations
+under the License.
+-->
+
+# Spark Load
+
+Spark load 通过 Spark 实现对导入数据的预处理，提高 Doris 大数据量的导入性能并且节省 Doris 
集群的计算资源。主要用于初次迁移，大数据量导入 Doris 的场景。
+
+Spark load 是一种异步导入方式，用户需要通过 MySQL 协议创建 Spark 类型导入任务，并通过 `SHOW LOAD` 查看导入结果。
+
+
+
+## 适用场景
+
+* 源数据在 Spark 可以访问的存储系统中，如 HDFS。
+* 数据量在 几十 GB 到 TB 级别。
+
+
+
+## 名词解释
+
+1. Frontend（FE）：Doris 系统的元数据和调度节点。在导入流程中主要负责导入任务的调度工作。
+2. Backend（BE）：Doris 系统的计算和存储节点。在导入流程中主要负责数据写入及存储。
+3. Spark ETL：在导入流程中主要负责数据的 ETL 工作，包括全局字典构建（BITMAP类型）、分区、排序、聚合等。
+4. Broker：Broker 为一个独立的无状态进程。封装了文件系统接口，提供 Doris 读取远端存储系统中文件的能力。
+
+
+## 基本原理
+
+### 基本流程
+
+用户通过 MySQL 客户端提交 Spark 类型导入任务，FE记录元数据并返回用户提交成功。
+
+Spark load 任务的执行主要分为以下5个阶段。
+
+1. FE 调度提交 ETL 任务到 Spark 集群执行。
+2. Spark 集群执行 ETL 完成对导入数据的预处理。包括全局字典构建（BITMAP类型）、分区、排序、聚合等。
+3. ETL 任务完成后，FE 获取预处理过的每个分片的数据路径，并调度相关的 BE 执行 Push 任务。
+4. BE 通过 Broker 读取数据，转化为 Doris 底层存储格式。
+5. FE 调度生效版本，完成导入任务。
+
+```
+                 +
+                 | 0. User create spark load job
+            +----v----+
+            |   FE    |---------------------------------+
+            +----+----+                                 |
+                 | 3. FE send push tasks                |
+                 | 5. FE publish version                |
+    +------------+------------+                         |
+    |            |            |                         |
++---v---+    +---v---+    +---v---+                     |
+|  BE   |    |  BE   |    |  BE   |                     |1. FE submit Spark 
ETL job
++---^---+    +---^---+    +---^---+                     |
+    |4. BE push with broker   |                         |
++---+---+    +---+---+    +---+---+                     |
+|Broker |    |Broker |    |Broker |                     |
++---^---+    +---^---+    +---^---+                     |
+    |            |            |                         |
++---+------------+------------+---+ 2.ETL +-------------v---------------+
+|               HDFS              +------->       Spark cluster         |
+|                                 <-------+                             |
++---------------------------------+       +-----------------------------+
+
+```
+
+
+
+### 全局字典
+
+待补
+
+
+
+### 数据预处理（DPP）
+
+待补
+
+
+
+## 基本操作
+
+### 配置 ETL 集群
+
+Spark作为一种外部计算资源在Doris中用来完成ETL工作，未来可能还有其他的外部资源会加入到Doris中使用，如Spark/GPU用于查询，HDFS/S3用于外部存储，MapReduce用于ETL等，因此我们引入resource
 management来管理Doris使用的这些外部资源。
+
+提交 Spark 导入任务之前，需要配置执行 ETL 任务的 Spark 集群。
+
+语法：
+
+```sql
+-- create spark resource
+CREATE EXTERNAL RESOURCE resource_name
+PROPERTIES 
+(                 
+  type = spark,
+  spark_conf_key = spark_conf_value,
+  working_dir = path,
+  broker = broker_name,
+  broker.property_key = property_value
+)
+
+-- drop spark resource
+DROP RESOURCE resource_name
+
+-- show resources
+SHOW RESOURCES
+SHOW PROC "/resources"
+
+-- privileges
+GRANT USAGE_PRIV ON RESOURCE resource_name TO user_identity
+GRANT USAGE_PRIV ON RESOURCE resource_name TO ROLE role_name
+
+REVOKE USAGE_PRIV ON RESOURCE resource_name FROM user_identity
+REVOKE USAGE_PRIV ON RESOURCE resource_name FROM ROLE role_name
+```
+
+#### 创建资源
+
+`resource_name` 为 Doris 中配置的 Spark 资源的名字。
+
+`PROPERTIES` 是 Spark 资源相关参数，如下：
+
+- `type`：资源类型，必填，目前仅支持 spark。
+
+- Spark 相关参数如下：
+  - `spark.master`: 必填，目前支持yarn，spark://host:port。
+  - `spark.submit.deployMode`:  Spark 程序的部署模式，必填，支持 cluster，client 两种。
+  - `spark.hadoop.yarn.resourcemanager.address`: master为yarn时必填。
+  - `spark.hadoop.fs.defaultFS`: master为yarn时必填。
+  - 其他参数为可选，参考http://spark.apache.org/docs/latest/configuration.html 
+- `working_dir`: ETL 使用的目录。spark作为ETL资源使用时必填。例如：hdfs://host:port/tmp/doris。

Review comment:
       spark.xxx is the standard format of spark configuration，so i think it is 
better to use working_dir to distinguish




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscr...@doris.apache.org
For additional commands, e-mail: commits-h...@doris.apache.org

[GitHub] [incubator-doris] wyb commented on a change in pull request #3418: [Spark load] Add spark etl cluster and cluster manager

Reply via email to