[jira] [Created] (FLINK-31509) REST Service missing sessionAffinity causes job run failure with HA cluster

Emmanuel Leroy (Jira) Fri, 17 Mar 2023 15:37:06 -0700

Emmanuel Leroy created FLINK-31509:
--------------------------------------

             Summary: REST Service missing sessionAffinity causes job run 
failure with HA cluster
                 Key: FLINK-31509
                 URL: https://issues.apache.org/jira/browse/FLINK-31509
             Project: Flink
          Issue Type: Bug
          Components: Kubernetes Operator
         Environment: Flink 1.15 on Flink Operator 1.4.0 on Kubernetes 1.25.4, 
(optionally with Beam 2.46.0)


but the issue was observed on Flink 1.14, 1.15 and 1.16 and on Flink Operator 
1.2, 1.3, 1.3.1, 1.4.0

 
            Reporter: Emmanuel Leroy


When using a Session Cluster with multiple Job Managers, the -rest service load 
balances the API requests to all job managers, not just the master.

When submitting a FlinkSessionJob, I often see errors like: `jar <jar_id>.jar 
was not found`, because the submission is done in 2 steps: 
 * upload the jar with `v1/jars/upload` which returns the `jar_id`
 * run the job with `v1/jars/<jar_id>/run`

Unfortunately, with the Service load balacing between nodes, it is often the 
case that the jar is uploaded on a JM, and the run request happens on another, 
where the jar doesn't exist.

A simple fix is to append the `sessionAffinity: ClientIP` on the -rest service, 
where the API calls from a given originating IP will always be routed to the 
same node.

This issue is especially problematic with Beam, where the Beam job submission 
does not retry to run the job with the jar_id, and will fail, causing it to 
re-upload a new jar and retrying, until it is lucky enough to get the 2 calls 
in a row routed to the same node.

 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Created] (FLINK-31509) REST Service missing sessionAffinity causes job run failure with HA cluster

Reply via email to