Gallardot commented on PR #15270: URL: https://github.com/apache/dolphinscheduler/pull/15270#issuecomment-1888523291
### This is an analysis of a bug related to the serial wait strategy, which causes the workflow instance to remain in a waiting state indefinitely. When a workflow's scheduled strategy is SERIAL_WAIT, if a workflow instance's status is WAITING, then this workflow instance will remain in a waiting state, even if the previous workflow instance has already completed execution. **There is a certain probability that this problem will occur.** The analysis of the cause is as follows: The `MasterSchedulerBootstrap` thread processes commands through the `handleCommand` method. Note that this `handleCommand` is within a transaction. In this transaction, the `saveSerialProcess` method is used to modify the status of the workflow instance. However, At the same time, in another thread pool of `WorkflowExecuteRunnable`, the `checkSerialProcess` method is used to check the status of the workflow instance in order to wake up the workflow instance in a waiting state. Everything seems fine. But there is a **specific situation**. That is, a workflow instance is about to complete, and a workflow instance is being created. Problems will arise at this time. Because of the isolation of transactions, the `saveSerialProcess` in the `handleCommand` method may have just been executed, but it has **not yet been committed**. At this time, the `checkSerialProcess` method will not be able to check that the status of this workflow instance is WAITING, causing this workflow instance to remain in a waiting state and cannot be awakened. My solution is to use a new transaction for updating the status of the workflow instance in the `handleCommand` transaction. This can avoid the above problem. I have been running this in my environment for two months, and the problem has not reoccurred https://github.com/apache/dolphinscheduler/blob/0f7081be10b657184d2eef316c8a2cafcf2ce343/dolphinscheduler-service/src/main/java/org/apache/dolphinscheduler/service/process/ProcessServiceImpl.java#L291-L316 https://github.com/apache/dolphinscheduler/blob/bd48c991783b2e0ea0c602f6ef6c9a09c92e7b42/dolphinscheduler-service/src/main/java/org/apache/dolphinscheduler/service/process/ProcessServiceImpl.java#L326-L342 https://github.com/apache/dolphinscheduler/blob/bd48c991783b2e0ea0c602f6ef6c9a09c92e7b42/dolphinscheduler-master/src/main/java/org/apache/dolphinscheduler/server/master/runner/WorkflowExecuteRunnable.java#L790-L832 @ruanwenjun @Radeity @EricGao888 @SbloodyS @fuchanghai @qingwli @caishunfeng PTAL. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
