Re: [DISCUSS] Table lineage design

wu shaoj Wed, 28 Oct 2020 18:14:57 -0700

Not a good idea.
感觉太复杂了，还是等大json拆分之后再说吧。而且现在只有SQL节点，并不适合用来解析依赖关系，SQLScript才适合！
依赖关系的配置是一个比较麻烦的事情，自动创建在现阶段的优先级并不太高。


From: Hemin Wen <[email protected]>
Date: Wednesday, October 28, 2020 at 11:49
To: dev <[email protected]>
Subject: [DISCUSS] Table lineage design
Hi!

The function of table lineage automatic dependency configuration,
welcome everyone to discuss my ideas.

## 1. Demand background

   Currently, DS can only use DAG drawing to set up the workflow/node 
dependency, or call the API to create the workflow and dependency based on the 
data structure of the workflow. The data warehouse is generally hierarchical 
design, the data production process is link type, there are complex 
dependencies between layers, and there are many SQL scripts. Manually creating 
dependencies is inconvenient for the maintenance of large-scale workflows, and 
dependency configuration errors are not convenient for troubleshooting.

   It is possible to extract the table blood relationship by analyzing the SQL 
statements in the SQL related nodes, and then automatically establish the 
dependency relationship according to the table blood relationship. The Master 
Server executes the workflow according to the supplemented dependencies to 
ensure that the nodes execute in the order of dependencies.

## 2. Design Ideas

   - Analyze SQL table blood relationship when saving workflow, and 
automatically generate dependent configuration data (only for SQL related nodes)
   - Master Server automatically resolves dependencies based on nodes, 
generates dependent nodes, and executes all node tasks
   - The front-end node configuration page adds the "Automatically resolve 
dependencies" switch to control whether to enable dependency detection during 
execution of the node
   - A dependency graph page is added to the front end for easy viewing of node 
dependencies after automatic analysis

Insufficient:

   - In the current design, the automatically generated default rule for 
dependent nodes only supports judging whether the task status of the node on 
the day is successful. The fixed configuration is checked every N minutes for a 
total of M times. If the number is exceeded, it will be treated as a failure.

## 3. Timing diagram

    Please refer to the picture below

## 4. Table Design

Add node lineage relationship table: t_ds_node_lineage

| Column Name | Description |
| --------------------- | ------------------------|
| id | Auto-incrementing ID |
| process_definition_id | Workflow definition ID |
| process_node_id | Workflow node ID |
| lineage_type | Lineage type (1 input, 2 output) |
| lineage_union_key | Lineage only KEY |
| create_time | Creation time |

-------------------------------------------------------------------------------------------------------------------------------------------------------------

## 1.需求背景

当前DS只能通过DAG画图设置工作流/节点间依赖关系，或者根据工作流的数据结构调用API创建工作流及依赖关系。
而数仓一般是分层设计，数据的生产过程是链路式的，层与层之间存在复杂的依赖关系，SQL脚本众多。
手工创建依赖关系不便于大批量工作流的维护，依赖配置错误不方便排查。

可以通过解析SQL相关节点中的SQL语句，抽取表血缘关系，再根据表血缘关系自动建立依赖关系。
Master Server根据补充后的依赖关系执行工作流，保证节点按照依赖顺序执行。

## 2.设计思路

- 保存工作流时解析SQL的表血缘关系，自动生成依赖配置数据（仅限于SQL相关节点）
- Master Server根据节点自动解析依赖关系，生成依赖节点，执行所有节点任务
- 前端节点配置页面增加“自动解析依赖”开关，控制节点在执行时是否启用依赖检测
- 前端增加依赖图页面，方便查看自动解析后的节点依赖关系

不足：

- 当前设计中，自动生成的依赖节点默认规则仅支持判断当日节点任务状态是否成功，固定配置每隔N分钟检查一次，共检查M次，超过次数后作为失败处理

## 3.时序图
[cid:ii_kgsus5mg0]
[cid:ii_kgsusdlj1]

## 4.表设计

新增节点血缘关系表：t_ds_node_lineage
| 列名                  | 描述                     |
| --------------------- | ------------------------|
| id                    | 自增ID                   |
| process_definition_id | 工作流定义ID             |
| process_node_id       | 工作流节点ID             |
| lineage_type          | 血缘类型（1输入，2输出）   |
| lineage_union_key     | 血缘唯一KEY              |
| create_time           | 创建时间                 |

--------------------
DolphinScheduler(Incubator) Commtter
Hemin Wen  温合民
[email protected]<mailto:[email protected]>
--------------------

Re: [DISCUSS] Table lineage design

Reply via email to