[ 
https://issues.apache.org/jira/browse/FLINK-31509?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17702659#comment-17702659
 ] 

Biao Geng commented on FLINK-31509:
-----------------------------------

I believe that it should be a bug to route API requests to all JM instead of 
the master. 
Besides, it should be ok to use K8s HA with one JM is launched as when this JM 
crashes, a new one will created from the HA data stored in config map. Multiple 
JM may reduce the recovery time but not so necessary.

> REST Service missing sessionAffinity causes job run failure with HA cluster
> ---------------------------------------------------------------------------
>
>                 Key: FLINK-31509
>                 URL: https://issues.apache.org/jira/browse/FLINK-31509
>             Project: Flink
>          Issue Type: Bug
>          Components: Kubernetes Operator
>         Environment: Flink 1.15 on Flink Operator 1.4.0 on Kubernetes 1.25.4, 
> (optionally with Beam 2.46.0)
> but the issue was observed on Flink 1.14, 1.15 and 1.16 and on Flink Operator 
> 1.2, 1.3, 1.3.1, 1.4.0
>  
>            Reporter: Emmanuel Leroy
>            Priority: Major
>
> When using a Session Cluster with multiple Job Managers, the -rest service 
> load balances the API requests to all job managers, not just the master.
> When submitting a FlinkSessionJob, I often see errors like: `jar <jar_id>.jar 
> was not found`, because the submission is done in 2 steps: 
>  * upload the jar with `v1/jars/upload` which returns the `jar_id`
>  * run the job with `v1/jars/<jar_id>/run`
> Unfortunately, with the Service load balacing between nodes, it is often the 
> case that the jar is uploaded on a JM, and the run request happens on 
> another, where the jar doesn't exist.
> A simple fix is to append the `sessionAffinity: ClientIP` on the -rest 
> service, where the API calls from a given originating IP will always be 
> routed to the same node.
> This issue is especially problematic with Beam, where the Beam job submission 
> does not retry to run the job with the jar_id, and will fail, causing it to 
> re-upload a new jar and retrying, until it is lucky enough to get the 2 calls 
> in a row routed to the same node.
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to