Re: [DISCUSS] Table lineage design

Hemin Wen Wed, 28 Oct 2020 19:18:58 -0700

This function does not conflict with split workflow json, A single table
maintains dependencies.


This function is for sql related nodes, E.g. SQL node, ETL node, Sqoop node.
The design of the t_ds_node_lineage table is for expansion

https://github.com/apache/incubator-dolphinscheduler/issues/249,
This issue reflects that the demand is real and many people need it.
For the maintenance of batch nodes, it is currently a bottleneck for DS.

——————————————————————————————————————————————
我认为，这个功能和拆分工作流json并不冲突，因为依赖是单独维护在一张表中的。

这个功能面向sql相关的节点，例如：sql节点、etl节点、sqoop节点。
血缘关系表的设计也是面向扩展的，并不是只针对于sql设计，sql只是依赖的一种来源

https://github.com/apache/incubator-dolphinscheduler/issues/249，
可以看下这个issue，真实反映了需求是实际存在的，而且很多人需要这个功能。
针对于批量节点关系的维护，当前是DS的一个痛点，其中sql相关的依赖相对更多。

--------------------
DolphinScheduler(Incubator) Commtter
Hemin Wen  温合民
[email protected]
--------------------


wu shaoj <[email protected]> 于2020年10月29日周四 上午9:14写道：

> Not a good idea.
> 感觉太复杂了，还是等大json拆分之后再说吧。而且现在只有SQL节点，并不适合用来解析依赖关系，SQLScript才适合！
> 依赖关系的配置是一个比较麻烦的事情，自动创建在现阶段的优先级并不太高。
>
> From: Hemin Wen <[email protected]>
> Date: Wednesday, October 28, 2020 at 11:49
> To: dev <[email protected]>
> Subject: [DISCUSS] Table lineage design
> Hi!
>
> The function of table lineage automatic dependency configuration,
> welcome everyone to discuss my ideas.
>
> ## 1. Demand background
>
>    Currently, DS can only use DAG drawing to set up the workflow/node
> dependency, or call the API to create the workflow and dependency based on
> the data structure of the workflow. The data warehouse is generally
> hierarchical design, the data production process is link type, there are
> complex dependencies between layers, and there are many SQL scripts.
> Manually creating dependencies is inconvenient for the maintenance of
> large-scale workflows, and dependency configuration errors are not
> convenient for troubleshooting.
>
>    It is possible to extract the table blood relationship by analyzing the
> SQL statements in the SQL related nodes, and then automatically establish
> the dependency relationship according to the table blood relationship. The
> Master Server executes the workflow according to the supplemented
> dependencies to ensure that the nodes execute in the order of dependencies.
>
> ## 2. Design Ideas
>
>    - Analyze SQL table blood relationship when saving workflow, and
> automatically generate dependent configuration data (only for SQL related
> nodes)
>    - Master Server automatically resolves dependencies based on nodes,
> generates dependent nodes, and executes all node tasks
>    - The front-end node configuration page adds the "Automatically resolve
> dependencies" switch to control whether to enable dependency detection
> during execution of the node
>    - A dependency graph page is added to the front end for easy viewing of
> node dependencies after automatic analysis
>
> Insufficient:
>
>    - In the current design, the automatically generated default rule for
> dependent nodes only supports judging whether the task status of the node
> on the day is successful. The fixed configuration is checked every N
> minutes for a total of M times. If the number is exceeded, it will be
> treated as a failure.
>
> ## 3. Timing diagram
>
>     Please refer to the picture below
>
> ## 4. Table Design
>
> Add node lineage relationship table: t_ds_node_lineage
>
> | Column Name | Description |
> | --------------------- | ------------------------|
> | id | Auto-incrementing ID |
> | process_definition_id | Workflow definition ID |
> | process_node_id | Workflow node ID |
> | lineage_type | Lineage type (1 input, 2 output) |
> | lineage_union_key | Lineage only KEY |
> | create_time | Creation time |
>
>
> -------------------------------------------------------------------------------------------------------------------------------------------------------------
>
> ## 1.需求背景
>
> 当前DS只能通过DAG画图设置工作流/节点间依赖关系，或者根据工作流的数据结构调用API创建工作流及依赖关系。
> 而数仓一般是分层设计，数据的生产过程是链路式的，层与层之间存在复杂的依赖关系，SQL脚本众多。
> 手工创建依赖关系不便于大批量工作流的维护，依赖配置错误不方便排查。
>
> 可以通过解析SQL相关节点中的SQL语句，抽取表血缘关系，再根据表血缘关系自动建立依赖关系。
> Master Server根据补充后的依赖关系执行工作流，保证节点按照依赖顺序执行。
>
> ## 2.设计思路
>
> - 保存工作流时解析SQL的表血缘关系，自动生成依赖配置数据（仅限于SQL相关节点）
> - Master Server根据节点自动解析依赖关系，生成依赖节点，执行所有节点任务
> - 前端节点配置页面增加“自动解析依赖”开关，控制节点在执行时是否启用依赖检测
> - 前端增加依赖图页面，方便查看自动解析后的节点依赖关系
>
> 不足：
>
> - 当前设计中，自动生成的依赖节点默认规则仅支持判断当日节点任务状态是否成功，固定配置每隔N分钟检查一次，共检查M次，超过次数后作为失败处理
>
> ## 3.时序图
> [cid:ii_kgsus5mg0]
> [cid:ii_kgsusdlj1]
>
> ## 4.表设计
>
> 新增节点血缘关系表：t_ds_node_lineage
> | 列名                  | 描述                     |
> | --------------------- | ------------------------|
> | id                    | 自增ID                   |
> | process_definition_id | 工作流定义ID             |
> | process_node_id       | 工作流节点ID             |
> | lineage_type          | 血缘类型（1输入，2输出）   |
> | lineage_union_key     | 血缘唯一KEY              |
> | create_time           | 创建时间                 |
>
> --------------------
> DolphinScheduler(Incubator) Commtter
> Hemin Wen  温合民
> [email protected]<mailto:[email protected]>
> --------------------
>

Re: [DISCUSS] Table lineage design

Reply via email to