lostluck commented on PR #35263: URL: https://github.com/apache/beam/pull/35263#issuecomment-2977201890
OK, so that path, should trigger the Element manager to fail when stage.Execute returns here: https://github.com/apache/beam/blob/2a499b6b24180adb022d810cd9bccf2f271fe514/sdks/go/pkg/beam/runners/prism/internal/execute.go#L366 Which should cause the bundle channel to ultimately close, and then wait on the error group to receive the failure here: https://github.com/apache/beam/blob/2a499b6b24180adb022d810cd9bccf2f271fe514/sdks/go/pkg/beam/runners/prism/internal/execute.go#L359 But that's not what fail bundle does..., it clears out the pending execution state from the manager so that the job can fail naturally. (This is intentional FYI, since the ElementManager is about tracking and dispatching pending work, and in principal we might want to add retries in the future.) So I think there's probably race condition here, which sometimes swallows the failures, as much as I don't see how that's happening. I'd put an [atomic value](https://pkg.go.dev/sync/atomic#Value) before the execute value, that we also set in that bundle failure case, but only for the first one received. Then if and only if the error cause from the context, or the error from the errgroup is nil, do we check that value and return any non-nil error. ----- The context Cancel path is likely only happening in the event the SDK side environment is wholesale crashing. Which should produce an error. It's possible, that the "race" here is: 1. SDK side bundle fails 2. SDK environment crashes (AKA the loopback FnAPI connection), causing the Context to be canceled Prism side. 3. The Bundle Execute loop responds to the canceled environment -> Job failure messages doesn't have the failed bundle. While the intended/desired path is 1. SDK side bundle fails. 2. Prism registers the failure reason. 3. Prism returns that error to the SDK pipeline host. I don't think it's contentious to assert we should prioritize the first SDK side bundle failure error vs any environmental / context cancel bundle errors. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@beam.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org