Daren Wong created FLINK-28411:
----------------------------------

             Summary: OperatorCoordinator exception may fail Session Cluster
                 Key: FLINK-28411
                 URL: https://issues.apache.org/jira/browse/FLINK-28411
             Project: Flink
          Issue Type: Bug
          Components: Connectors / Common
            Reporter: Daren Wong
             Fix For: 1.15.2


Part of Scheduler's startScheduling procedure involves starting all 
OperatorCoordinatorHolder, and when one of the OperatorCoordinator fails to 
start, the exception is forwarded up the stack triggering a JobMaster failover. 
However, JobMaster failover only works if HA is enabled[1]. If HA is not 
enabled the fatal error handler will simply exit the JM process killing the 
entire cluster. This is problematic in the case of a session cluster where 
there may be multiple jobs running. It also does not play well with external 
tooling that does not expect job failure to cause a full cluster failure. 

 

It would be preferable if failure to start an OperatorCoordinator did not take 
down the entire cluster, but instead failed that particular job. 

 

This issue is similar to https://issues.apache.org/jira/browse/FLINK-24303 
which fix this issue for a SourceCoordinator specifically.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to