aglinxinyuan commented on code in PR #5781:
URL: https://github.com/apache/texera/pull/5781#discussion_r3449140851
##########
amber/src/main/scala/org/apache/texera/web/service/WorkflowService.scala:
##########
@@ -292,11 +292,23 @@ class WorkflowService(
executionService.onNext(execution)
execution.executeWorkflow()
} catch {
- case e: Throwable => errorHandler(e)
+ case e: Throwable =>
+ errorHandler(e)
+ reportFatalErrorsToSubscribers(executionStateStore)
Review Comment:
Good question — let me split it into the two parts.
**Why the existing state-store path can't cover this:** it *does* work for a
running execution, but not for the case this PR targets — an exception thrown
inside `new WorkflowExecutionService(...)`. The `fatalErrors →
WorkflowErrorEvent` diff handler is registered *inside* that constructor
(`WorkflowExecutionService.scala:72-86`), and the only thing that forwards an
execution's store to the socket is `connectToExecution`, which subscribes after
`executionService.onNext(execution)` (`WorkflowService.scala:292`). If the
constructor throws we jump straight to the `catch` and never reach line 292 —
so `errorHandler` writes FAILED + fatalErrors into a metadata store that has
*neither an emitter nor a subscriber*, and it's silently dropped.
(`connect()`'s own state-store subscription is over `WorkflowStateStore`, which
only holds the result store, so it can't carry a `WorkflowErrorEvent` either.)
**On refresh:** you're right it isn't durable across a cold refresh — but
that's already true of the state-store path, not something this change
introduces. `fatalErrors` are never persisted (`updateWorkflowState` writes
only a status byte), and `getWebsocketEventObservable` doesn't replay past
diffs (it's `buffer(2,1)` over the live state subject), so a cold refresh loses
the error on *either* path today. `errorSubject` is a `BehaviorSubject`, so a
*within-instance* refresh does replay the last error via `connect()`; a refresh
after the cleanup timeout (which evicts the instance) loses it — same as the
state store would.
**What I changed in response:** I agree it should go through the state store
wherever it can, so I gated the `errorSubject` push to the **pre-publish window
only**. Once the execution is published, the state-store diff handler already
surfaces the error — pushing here too was actually double-emitting on the
`executeWorkflow()` failure path. So `errorSubject` is now just the narrow
fallback for the constructor-failure window, where the state store genuinely
can't reach a subscriber yet. (pushed in `547e558`)
For real cross-refresh durability we'd need to persist `fatalErrors` to
`workflow_executions` and rehydrate in `getOrCreate` — happy to file that as a
follow-up, since it helps both paths and is bigger than this PR. WDYT?
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]