lostluck commented on PR #35263:
URL: https://github.com/apache/beam/pull/35263#issuecomment-2977201890

   OK, so that path, should trigger the Element manager to fail when 
stage.Execute returns here:
   
   
https://github.com/apache/beam/blob/2a499b6b24180adb022d810cd9bccf2f271fe514/sdks/go/pkg/beam/runners/prism/internal/execute.go#L366
   
   Which should cause the bundle channel to ultimately close, and then wait on 
the error group to receive the failure here:
   
   
https://github.com/apache/beam/blob/2a499b6b24180adb022d810cd9bccf2f271fe514/sdks/go/pkg/beam/runners/prism/internal/execute.go#L359
   
   But that's not what fail bundle does..., it clears out the pending execution 
state from the manager so that the job can fail naturally. (This is intentional 
FYI, since the ElementManager is about tracking and dispatching pending work, 
and in principal we might want to add retries in the future.)
   
   So I think there's probably race condition here, which sometimes swallows 
the failures, as much as I don't see how that's happening.
   
   I'd put an [atomic value](https://pkg.go.dev/sync/atomic#Value) before the 
execute value, that we also set in that bundle failure case, but only for the 
first one received. Then if and only if the error cause from the context, or 
the error from the errgroup is nil, do we check that value and return any 
non-nil error.
   
   -----
   
   The context Cancel path is likely only happening in the event the SDK side 
environment is wholesale crashing. Which should produce an error. It's 
possible, that the "race" here is:
   
    1. SDK side bundle fails
    2. SDK environment crashes (AKA the loopback FnAPI connection), causing the 
Context to be canceled Prism side.
    3. The Bundle Execute loop responds to the canceled environment -> Job 
failure messages doesn't have the failed bundle.
    
    While the intended/desired path is
    
    1. SDK side bundle fails.
    2. Prism registers the failure reason.
    3. Prism returns that error to the SDK pipeline host.
    
   I don't think it's contentious to assert we should prioritize the first SDK 
side bundle failure error vs any environmental / context cancel bundle errors.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@beam.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Reply via email to