[jira] [Created] (FLINK-27612) Generate waring events when deleting the session cluster
Aitozi created FLINK-27612: -- Summary: Generate waring events when deleting the session cluster Key: FLINK-27612 URL: https://issues.apache.org/jira/browse/FLINK-27612 Project: Flink Issue Type: Improvement Components: Kubernetes Operator Reporter: Aitozi Fix For: kubernetes-operator-1.0.0 -- This message was sent by Atlassian Jira (v8.20.7#820007)
[jira] [Comment Edited] (FLINK-27576) Flink will request new pod when jm pod is delete, but will remove when TaskExecutor exceeded the idle timeout
[ https://issues.apache.org/jira/browse/FLINK-27576?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17534934#comment-17534934 ] Aitozi edited comment on FLINK-27576 at 5/11/22 2:37 PM: - Hi [~zhisheng] there is a ticket tracking this https://issues.apache.org/jira/browse/FLINK-24713 I will open a PR for this soon was (Author: aitozi): Hi [~zhisheng] there is a ticket tracking this [https://issues.apache.org/jira/projects/FLINK/issues/FLINK-24713] I will open a PR for this soon > Flink will request new pod when jm pod is delete, but will remove when > TaskExecutor exceeded the idle timeout > -- > > Key: FLINK-27576 > URL: https://issues.apache.org/jira/browse/FLINK-27576 > Project: Flink > Issue Type: Bug > Components: Deployment / Kubernetes >Affects Versions: 1.12.0 >Reporter: zhisheng >Priority: Major > Attachments: image-2022-05-11-20-06-58-955.png, > image-2022-05-11-20-08-01-739.png, jobmanager_log.txt > > > flink 1.12.0 enable the ha(zk) and checkpoint, when i use kubectl delete the > jm pod, the job will request new jm pod failover from the last checkpoint , > it is ok. But it will request new tm pod again, but not use actually, the > new tm pod will closed when TaskExecutor exceeded the idle timeout . actually > it will use the old tm, why need to request for new tm pod? whether the job > will fail if the cluster has no resource for the new tm?Can we optimize and > reuse the old tm directly? > > [^jobmanager_log.txt] > ^!image-2022-05-11-20-06-58-955.png!^ > ^!image-2022-05-11-20-08-01-739.png!^ -- This message was sent by Atlassian Jira (v8.20.7#820007)
[jira] [Commented] (FLINK-27576) Flink will request new pod when jm pod is delete, but will remove when TaskExecutor exceeded the idle timeout
[ https://issues.apache.org/jira/browse/FLINK-27576?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17534934#comment-17534934 ] Aitozi commented on FLINK-27576: Hi [~zhisheng] there is a ticket tracking this [https://issues.apache.org/jira/projects/FLINK/issues/FLINK-24713] I will open a PR for this soon > Flink will request new pod when jm pod is delete, but will remove when > TaskExecutor exceeded the idle timeout > -- > > Key: FLINK-27576 > URL: https://issues.apache.org/jira/browse/FLINK-27576 > Project: Flink > Issue Type: Bug > Components: Deployment / Kubernetes >Affects Versions: 1.12.0 >Reporter: zhisheng >Priority: Major > Attachments: image-2022-05-11-20-06-58-955.png, > image-2022-05-11-20-08-01-739.png, jobmanager_log.txt > > > flink 1.12.0 enable the ha(zk) and checkpoint, when i use kubectl delete the > jm pod, the job will request new jm pod failover from the last checkpoint , > it is ok. But it will request new tm pod again, but not use actually, the > new tm pod will closed when TaskExecutor exceeded the idle timeout . actually > it will use the old tm, why need to request for new tm pod? whether the job > will fail if the cluster has no resource for the new tm?Can we optimize and > reuse the old tm directly? > > [^jobmanager_log.txt] > ^!image-2022-05-11-20-06-58-955.png!^ > ^!image-2022-05-11-20-08-01-739.png!^ -- This message was sent by Atlassian Jira (v8.20.7#820007)
[jira] [Commented] (FLINK-27483) Support adding custom HTTP header for HTTP based Jar fetch
[ https://issues.apache.org/jira/browse/FLINK-27483?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17534696#comment-17534696 ] Aitozi commented on FLINK-27483: Yes, I have not seen you have an initial PR just now, you can continue your work [~Fuyao Li] (y) > Support adding custom HTTP header for HTTP based Jar fetch > -- > > Key: FLINK-27483 > URL: https://issues.apache.org/jira/browse/FLINK-27483 > Project: Flink > Issue Type: New Feature > Components: Kubernetes Operator >Reporter: Fuyao Li >Priority: Major > Labels: pull-request-available > Fix For: kubernetes-operator-1.0.0 > > > Hello Team, > I noticed this https://issues.apache.org/jira/browse/FLINK-27161, which could > enable users to specify a URL to fetch jars. > In many cases, user might want to add some custom headers to fetch jars from > remote private artifactory (like Oauth tokens, api-keys, etc). > adding a field called artifactoryJarHeader will be very helpful. > This field can be of array type. > [key1, value1, key2, value2] > It seems the current ArtifactManager doesn't support specify a custom header. > Please correct me if I am wrong. -- This message was sent by Atlassian Jira (v8.20.7#820007)
[jira] [Commented] (FLINK-27483) Support adding custom HTTP header for HTTP based Jar fetch
[ https://issues.apache.org/jira/browse/FLINK-27483?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17534645#comment-17534645 ] Aitozi commented on FLINK-27483: +1 for this solution, I will open a PR for it this week > Support adding custom HTTP header for HTTP based Jar fetch > -- > > Key: FLINK-27483 > URL: https://issues.apache.org/jira/browse/FLINK-27483 > Project: Flink > Issue Type: New Feature > Components: Kubernetes Operator >Reporter: Fuyao Li >Priority: Major > Fix For: kubernetes-operator-1.0.0 > > > Hello Team, > I noticed this https://issues.apache.org/jira/browse/FLINK-27161, which could > enable users to specify a URL to fetch jars. > In many cases, user might want to add some custom headers to fetch jars from > remote private artifactory (like Oauth tokens, api-keys, etc). > adding a field called artifactoryJarHeader will be very helpful. > This field can be of array type. > [key1, value1, key2, value2] > It seems the current ArtifactManager doesn't support specify a custom header. > Please correct me if I am wrong. -- This message was sent by Atlassian Jira (v8.20.7#820007)
[jira] [Closed] (FLINK-26915) Extend the Reconciler and Observer interface
[ https://issues.apache.org/jira/browse/FLINK-26915?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Aitozi closed FLINK-26915. -- Resolution: Won't Fix > Extend the Reconciler and Observer interface > > > Key: FLINK-26915 > URL: https://issues.apache.org/jira/browse/FLINK-26915 > Project: Flink > Issue Type: Sub-task > Components: Kubernetes Operator >Reporter: Aitozi >Assignee: Aitozi >Priority: Major > Fix For: kubernetes-operator-1.0.0 > > > As discussed in > [comments|https://github.com/apache/flink-kubernetes-operator/pull/112#discussion_r835762111], > I proposed make two changes to the Reconciler and Observer > # directly return the UpdateControl from the reconciler, because the > reconciler can in charge of the Update behavior, By this, we dont have to > infer the update control in the controller > # Make the params generic and extends from the ReconcilerContext and > ObserverContext. which will be easy for different controller to ship their > own objects for reconcile and observer. For example, in the FlinkSessionJob > case, we need to get the effective config from the FlinkDeployment first and > also pass the FlinkDeployment to the reconciler. > After the change, the reconciler will look like this: > {code:java} > public interface Reconciler> { > UpdateControl reconcile(CR cr, CTX context) throws Exception; > DeleteControl cleanup(CR cr, CTX ctx); > }{code} > > -- This message was sent by Atlassian Jira (v8.20.7#820007)
[jira] [Commented] (FLINK-26915) Extend the Reconciler and Observer interface
[ https://issues.apache.org/jira/browse/FLINK-26915?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17533425#comment-17533425 ] Aitozi commented on FLINK-26915: I have the same sense not to do this now, The main reason is that we have registered the event source for the {{{}FlinkDeployment{}}}, So the related CR can be obtained easily so it do not have to passed around the reconcile interface. Closing it now. > Extend the Reconciler and Observer interface > > > Key: FLINK-26915 > URL: https://issues.apache.org/jira/browse/FLINK-26915 > Project: Flink > Issue Type: Sub-task > Components: Kubernetes Operator >Reporter: Aitozi >Assignee: Aitozi >Priority: Major > Fix For: kubernetes-operator-1.0.0 > > > As discussed in > [comments|https://github.com/apache/flink-kubernetes-operator/pull/112#discussion_r835762111], > I proposed make two changes to the Reconciler and Observer > # directly return the UpdateControl from the reconciler, because the > reconciler can in charge of the Update behavior, By this, we dont have to > infer the update control in the controller > # Make the params generic and extends from the ReconcilerContext and > ObserverContext. which will be easy for different controller to ship their > own objects for reconcile and observer. For example, in the FlinkSessionJob > case, we need to get the effective config from the FlinkDeployment first and > also pass the FlinkDeployment to the reconciler. > After the change, the reconciler will look like this: > {code:java} > public interface Reconciler> { > UpdateControl reconcile(CR cr, CTX context) throws Exception; > DeleteControl cleanup(CR cr, CTX ctx); > }{code} > > -- This message was sent by Atlassian Jira (v8.20.7#820007)
[jira] [Commented] (FLINK-27483) Support adding custom HTTP header for HTTP based Jar fetch
[ https://issues.apache.org/jira/browse/FLINK-27483?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17533420#comment-17533420 ] Aitozi commented on FLINK-27483: [~wangyang0918] I check the code again I found I mess them up, also sorry for misleading [~Fuyao Li] :(. All the deployments share a same (dynamically) {{operatorConfiguration. }} Then come back to how to solve this problem, I proposal to config the http headers in the {{ConfigManager#defaultConfig}} And use this in the {{HttpArtifactFetcher}}. Is this ok ? [~Fuyao Li] [~wangyang0918] > Support adding custom HTTP header for HTTP based Jar fetch > -- > > Key: FLINK-27483 > URL: https://issues.apache.org/jira/browse/FLINK-27483 > Project: Flink > Issue Type: Improvement > Components: Kubernetes Operator >Reporter: Fuyao Li >Priority: Major > > Hello Team, > I noticed this https://issues.apache.org/jira/browse/FLINK-27161, which could > enable users to specify a URL to fetch jars. > In many cases, user might want to add some custom headers to fetch jars from > remote private artifactory (like Oauth tokens, api-keys, etc). > adding a field called artifactoryJarHeader will be very helpful. > This field can be of array type. > [key1, value1, key2, value2] > It seems the current ArtifactManager doesn't support specify a custom header. > Please correct me if I am wrong. -- This message was sent by Atlassian Jira (v8.20.7#820007)
[jira] [Commented] (FLINK-27497) Track terminal job states in the observer
[ https://issues.apache.org/jira/browse/FLINK-27497?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17533416#comment-17533416 ] Aitozi commented on FLINK-27497: I think it will be very useful improvements, I would like to give a try on this > Track terminal job states in the observer > - > > Key: FLINK-27497 > URL: https://issues.apache.org/jira/browse/FLINK-27497 > Project: Flink > Issue Type: New Feature > Components: Kubernetes Operator >Affects Versions: kubernetes-operator-1.0.0 >Reporter: Gyula Fora >Priority: Critical > > With the improvements in FLINK-27468 Flink 1.15 app clusters will not be shut > down in case of terminal job states (failed, finished) etc. > It is important to properly handle these states and let the user know about > it. > We should always trigger events, and for terminally failed jobs record the > error information in the FlinkDeployment status. -- This message was sent by Atlassian Jira (v8.20.7#820007)
[jira] [Comment Edited] (FLINK-27497) Track terminal job states in the observer
[ https://issues.apache.org/jira/browse/FLINK-27497?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17533416#comment-17533416 ] Aitozi edited comment on FLINK-27497 at 5/8/22 6:39 AM: I think it will be a very good improvement, I would like to give a try on this was (Author: aitozi): I think it will be very useful improvements, I would like to give a try on this > Track terminal job states in the observer > - > > Key: FLINK-27497 > URL: https://issues.apache.org/jira/browse/FLINK-27497 > Project: Flink > Issue Type: New Feature > Components: Kubernetes Operator >Affects Versions: kubernetes-operator-1.0.0 >Reporter: Gyula Fora >Priority: Critical > > With the improvements in FLINK-27468 Flink 1.15 app clusters will not be shut > down in case of terminal job states (failed, finished) etc. > It is important to properly handle these states and let the user know about > it. > We should always trigger events, and for terminally failed jobs record the > error information in the FlinkDeployment status. -- This message was sent by Atlassian Jira (v8.20.7#820007)
[jira] [Commented] (FLINK-24633) JobManager pod may stuck in containerCreating status during failover
[ https://issues.apache.org/jira/browse/FLINK-24633?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17533411#comment-17533411 ] Aitozi commented on FLINK-24633: I figured out that it's caused by our internal k8s cluster behavior, Closing it as Not a Problem > JobManager pod may stuck in containerCreating status during failover > > > Key: FLINK-24633 > URL: https://issues.apache.org/jira/browse/FLINK-24633 > Project: Flink > Issue Type: Bug > Components: Deployment / Kubernetes >Affects Versions: 1.14.0 >Reporter: Aitozi >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.7#820007)
[jira] [Closed] (FLINK-24633) JobManager pod may stuck in containerCreating status during failover
[ https://issues.apache.org/jira/browse/FLINK-24633?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Aitozi closed FLINK-24633. -- Resolution: Not A Problem > JobManager pod may stuck in containerCreating status during failover > > > Key: FLINK-24633 > URL: https://issues.apache.org/jira/browse/FLINK-24633 > Project: Flink > Issue Type: Bug > Components: Deployment / Kubernetes >Affects Versions: 1.14.0 >Reporter: Aitozi >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.7#820007)
[jira] [Comment Edited] (FLINK-17232) Rethink the implicit behavior to use the Service externalIP as the address of the Endpoint
[ https://issues.apache.org/jira/browse/FLINK-17232?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17495844#comment-17495844 ] Aitozi edited comment on FLINK-17232 at 5/8/22 4:00 AM: [~wangyang0918] Could you help review [PR|https://github.com/apache/flink/pull/18762] ? I want to move forward to finish the another part of this work which rely on the current PR was (Author: aitozi): [~wangyang0918] Could you help review [PR|https://github.com/apache/flink/pull/18762] ? I want to move forward to finish the another part of this work which reply on the current PR > Rethink the implicit behavior to use the Service externalIP as the address of > the Endpoint > -- > > Key: FLINK-17232 > URL: https://issues.apache.org/jira/browse/FLINK-17232 > Project: Flink > Issue Type: Sub-task > Components: Deployment / Kubernetes >Affects Versions: 1.10.0, 1.10.1 >Reporter: Canbin Zheng >Assignee: Aitozi >Priority: Major > Labels: auto-unassigned, stale-assigned > > Currently, for the LB/NodePort type Service, if we found that the > {{LoadBalancer}} in the {{Service}} is null, we would use the externalIPs > configured in the external Service as the address of the Endpoint. Again, > this is another implicit toleration and may confuse the users. > This ticket proposes to rethink the implicit toleration behaviour. -- This message was sent by Atlassian Jira (v8.20.7#820007)
[jira] [Commented] (FLINK-27483) Support adding custom HTTP header for HTTP based Jar fetch
[ https://issues.apache.org/jira/browse/FLINK-27483?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17532585#comment-17532585 ] Aitozi commented on FLINK-27483: The difference is whether we need to support the authentication configuration for per job level > Support adding custom HTTP header for HTTP based Jar fetch > -- > > Key: FLINK-27483 > URL: https://issues.apache.org/jira/browse/FLINK-27483 > Project: Flink > Issue Type: Improvement > Components: Kubernetes Operator >Reporter: Fuyao Li >Priority: Major > > Hello Team, > I noticed this https://issues.apache.org/jira/browse/FLINK-27161, which could > enable users to specify a URL to fetch jars. > In many cases, user might want to add some custom headers to fetch jars from > remote private artifactory (like Oauth tokens, api-keys, etc). > adding a field called artifactoryJarHeader will be very helpful. > This field can be of array type. > [key1, value1, key2, value2] > It seems the current ArtifactManager doesn't support specify a custom header. > Please correct me if I am wrong. -- This message was sent by Atlassian Jira (v8.20.7#820007)
[jira] [Commented] (FLINK-27483) Support adding custom HTTP header for HTTP based Jar fetch
[ https://issues.apache.org/jira/browse/FLINK-27483?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17532234#comment-17532234 ] Aitozi commented on FLINK-27483: I think the reconciliation interval, timeout can be set per FlinkDeployment now, There is a discussion here [https://lists.apache.org/thread/pnf2gk9dgqv3qrtszqbfcdxf32t2gr3x] https://issues.apache.org/jira/browse/FLINK-27023 So I think the flinkConfiguration in the FlinkDeployment can be used to control the reconcile options. The filesystem's configuration are initialize at the entrypoint of the Operator. To keep it simple, I think we could do the same thing for the HttpArtifactFetcher by directly using the config from the configManager and extract http headers from the config, what do you think [~wangyang0918] ? > Support adding custom HTTP header for HTTP based Jar fetch > -- > > Key: FLINK-27483 > URL: https://issues.apache.org/jira/browse/FLINK-27483 > Project: Flink > Issue Type: Improvement > Components: Kubernetes Operator >Reporter: Fuyao Li >Priority: Major > > Hello Team, > I noticed this https://issues.apache.org/jira/browse/FLINK-27161, which could > enable users to specify a URL to fetch jars. > In many cases, user might want to add some custom headers to fetch jars from > remote private artifactory (like Oauth tokens, api-keys, etc). > adding a field called artifactoryJarHeader will be very helpful. > This field can be of array type. > [key1, value1, key2, value2] > It seems the current ArtifactManager doesn't support specify a custom header. > Please correct me if I am wrong. -- This message was sent by Atlassian Jira (v8.20.7#820007)
[jira] [Commented] (FLINK-27483) Support adding custom HTTP header for HTTP based Jar fetch
[ https://issues.apache.org/jira/browse/FLINK-27483?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17531528#comment-17531528 ] Aitozi commented on FLINK-27483: The config type can be listed as below: * The session cluster/application's config are decided by the flinkDeployment's flinkConfiguration mixed with the default config. * The operator's reconcile configs like interval, client timeout, cancel job timeout's config are built from the operator config and the related FlinkDeployment config. I think for the session job we can built the reconcile configs from the session job's {{configuration + flink deployment config + default config}}. If the config field is added, I think we could add an option like {{kubernetes.operator.user.artifacts.http.header: k1:v1,k2:v2}} and the HttpArtifactFetcher will decorate the http request based on the header config. > Support adding custom HTTP header for HTTP based Jar fetch > -- > > Key: FLINK-27483 > URL: https://issues.apache.org/jira/browse/FLINK-27483 > Project: Flink > Issue Type: Improvement > Components: Kubernetes Operator >Reporter: Fuyao Li >Priority: Major > > Hello Team, > I noticed this https://issues.apache.org/jira/browse/FLINK-27161, which could > enable users to specify a URL to fetch jars. > In many cases, user might want to add some custom headers to fetch jars from > remote private artifactory (like Oauth tokens, api-keys, etc). > adding a field called artifactoryJarHeader will be very helpful. > This field can be of array type. > [key1, value1, key2, value2] > It seems the current ArtifactManager doesn't support specify a custom header. > Please correct me if I am wrong. -- This message was sent by Atlassian Jira (v8.20.7#820007)
[jira] [Comment Edited] (FLINK-27483) Support adding custom HTTP header for HTTP based Jar fetch
[ https://issues.apache.org/jira/browse/FLINK-27483?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17531506#comment-17531506 ] Aitozi edited comment on FLINK-27483 at 5/4/22 6:29 AM: I think supporting the custom headers is useful, but I'm not tend add a dedicated field {{artifactoryJarHeader}} for it, because it only works when jarURI is based on http, besides it's not much extendable. I'd like to add a {{configuration}} filed in {{FlinkSessionJobSpec}} . The field will used to control the reconcile logic like the reconcile interval, delay and also can pass the headers for the http url. WDYT [~wangyang0918] [~Fuyao Li] ? was (Author: aitozi): I think the custom headers is reasonable, but I'm not tend add a dedicated field {{artifactoryJarHeader}} for it, because it only works when jarURI is based on http, besides it's not much extendable. I'd like to add a {{configuration}} filed in {{FlinkSessionJobSpec}} . The field will used to control the reconcile logic like the reconcile interval, delay and also can pass the headers for the http url. WDYT [~wangyang0918] [~Fuyao Li] ? > Support adding custom HTTP header for HTTP based Jar fetch > -- > > Key: FLINK-27483 > URL: https://issues.apache.org/jira/browse/FLINK-27483 > Project: Flink > Issue Type: Improvement > Components: Kubernetes Operator >Reporter: Fuyao Li >Priority: Major > > Hello Team, > I noticed this https://issues.apache.org/jira/browse/FLINK-27161, which could > enable users to specify a URL to fetch jars. > In many cases, user might want to add some custom headers to fetch jars from > remote private artifactory (like Oauth tokens, api-keys, etc). > adding a field called artifactoryJarHeader will be very helpful. > This field can be of array type. > [key1, value1, key2, value2] > It seems the current ArtifactManager doesn't support specify a custom header. > Please correct me if I am wrong. -- This message was sent by Atlassian Jira (v8.20.7#820007)
[jira] [Commented] (FLINK-27483) Support adding custom HTTP header for HTTP based Jar fetch
[ https://issues.apache.org/jira/browse/FLINK-27483?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17531506#comment-17531506 ] Aitozi commented on FLINK-27483: I think the custom headers is reasonable, but I'm not tend add a dedicated field {{artifactoryJarHeader}} for it, because it only works when jarURI is based on http, besides it's not much extendable. I'd like to add a {{configuration}} filed in {{FlinkSessionJobSpec}} . The field will used to control the reconcile logic like the reconcile interval, delay and also can pass the headers for the http url. WDYT [~wangyang0918] ? > Support adding custom HTTP header for HTTP based Jar fetch > -- > > Key: FLINK-27483 > URL: https://issues.apache.org/jira/browse/FLINK-27483 > Project: Flink > Issue Type: Improvement > Components: Kubernetes Operator >Reporter: Fuyao Li >Priority: Major > > Hello Team, > I noticed this https://issues.apache.org/jira/browse/FLINK-27161, which could > enable users to specify a URL to fetch jars. > In many cases, user might want to add some custom headers to fetch jars from > remote private artifactory (like Oauth tokens, api-keys, etc). > adding a field called artifactoryJarHeader will be very helpful. > This field can be of array type. > [key1, value1, key2, value2] > It seems the current ArtifactManager doesn't support specify a custom header. > Please correct me if I am wrong. -- This message was sent by Atlassian Jira (v8.20.7#820007)
[jira] [Comment Edited] (FLINK-27483) Support adding custom HTTP header for HTTP based Jar fetch
[ https://issues.apache.org/jira/browse/FLINK-27483?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17531506#comment-17531506 ] Aitozi edited comment on FLINK-27483 at 5/4/22 6:28 AM: I think the custom headers is reasonable, but I'm not tend add a dedicated field {{artifactoryJarHeader}} for it, because it only works when jarURI is based on http, besides it's not much extendable. I'd like to add a {{configuration}} filed in {{FlinkSessionJobSpec}} . The field will used to control the reconcile logic like the reconcile interval, delay and also can pass the headers for the http url. WDYT [~wangyang0918] [~Fuyao Li] ? was (Author: aitozi): I think the custom headers is reasonable, but I'm not tend add a dedicated field {{artifactoryJarHeader}} for it, because it only works when jarURI is based on http, besides it's not much extendable. I'd like to add a {{configuration}} filed in {{FlinkSessionJobSpec}} . The field will used to control the reconcile logic like the reconcile interval, delay and also can pass the headers for the http url. WDYT [~wangyang0918] ? > Support adding custom HTTP header for HTTP based Jar fetch > -- > > Key: FLINK-27483 > URL: https://issues.apache.org/jira/browse/FLINK-27483 > Project: Flink > Issue Type: Improvement > Components: Kubernetes Operator >Reporter: Fuyao Li >Priority: Major > > Hello Team, > I noticed this https://issues.apache.org/jira/browse/FLINK-27161, which could > enable users to specify a URL to fetch jars. > In many cases, user might want to add some custom headers to fetch jars from > remote private artifactory (like Oauth tokens, api-keys, etc). > adding a field called artifactoryJarHeader will be very helpful. > This field can be of array type. > [key1, value1, key2, value2] > It seems the current ArtifactManager doesn't support specify a custom header. > Please correct me if I am wrong. -- This message was sent by Atlassian Jira (v8.20.7#820007)
[jira] [Commented] (FLINK-25865) Support to set restart policy of TaskManager pod for native K8s integration
[ https://issues.apache.org/jira/browse/FLINK-25865?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17531499#comment-17531499 ] Aitozi commented on FLINK-25865: Hi [~wangyang0918] are you working on this now ? If not, I would like to work on this. > Support to set restart policy of TaskManager pod for native K8s integration > --- > > Key: FLINK-25865 > URL: https://issues.apache.org/jira/browse/FLINK-25865 > Project: Flink > Issue Type: Improvement > Components: Deployment / Kubernetes >Reporter: Yang Wang >Priority: Major > > After FLIP-201, Flink's TaskManagers will be able to be restarted without > losing its local state. So it is reasonable to make the restart policy[1] of > TaskManager pod could be configured. > The current restart policy is {{{}Never{}}}. Flink will always delete the > failed TaskManager pod directly and create a new one instead. This ticket > could help to decrease the recovery time of TaskManager failure. > > Please note that the working directory needs to be located in the > emptyDir[1], which is retained in different restarts. > > [1]. > https://kubernetes.io/docs/concepts/workloads/pods/pod-lifecycle/#restart-policy > [2]. https://kubernetes.io/docs/concepts/storage/volumes/#emptydir -- This message was sent by Atlassian Jira (v8.20.7#820007)
[jira] [Commented] (FLINK-26647) Can not add extra config files on native Kubernetes
[ https://issues.apache.org/jira/browse/FLINK-26647?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17531498#comment-17531498 ] Aitozi commented on FLINK-26647: I think it will bring convenience to add a way to add config files to the configMap. Because we can not predict what config files users may need. Meanwhile, users are responsible for the files to be shipped, they should pay attention to avoid to add too much files to the configMap. If it's necessary to guard the shipped config files not too large, we can check the size before create the configMap, Right? But I have not come up with a suitable size threshold. I lean to make it easy to ship config files first, and add the safe guard if necessary, what do you think [~wangyang0918] ? > Can not add extra config files on native Kubernetes > > > Key: FLINK-26647 > URL: https://issues.apache.org/jira/browse/FLINK-26647 > Project: Flink > Issue Type: Bug > Components: Deployment / Kubernetes >Affects Versions: 1.13.5 >Reporter: Zhe Wang >Priority: Critical > > When using native Kubernetes mode (both session and application), predefine > FLINK_CONF_DIR environment with config files in. Only two files( > *flink-conf.yaml and log4j-console.properties* ) are populated to configmap > which means missing of other config files(like sql-client-defaults.yaml, > zoo.cfg etc.) > Tried these, neither worked out: > 1) After native Kubernetes startup, change both configmap and deployment: > 1. add all my config files to configmap. > 2. add config file to deployment.spec.template.spec.volumes[] > 3. Flink job pod startups fail(log: lost leadership ) > > 2) Using a *pod-template-file.taskmanager* file: > 1. add config files to created confimap. > 2. add my config files to template(others can be merged by Flink as guide > says) > 3. Flink task pod startup fail, log: Duplicated volume name -- This message was sent by Atlassian Jira (v8.20.7#820007)
[jira] [Commented] (FLINK-27461) Add a convenient way to set userAgent for the kubeclient
[ https://issues.apache.org/jira/browse/FLINK-27461?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17530390#comment-17530390 ] Aitozi commented on FLINK-27461: cc [~wangyang0918] WDYT ? > Add a convenient way to set userAgent for the kubeclient > > > Key: FLINK-27461 > URL: https://issues.apache.org/jira/browse/FLINK-27461 > Project: Flink > Issue Type: Improvement > Components: Deployment / Kubernetes >Reporter: Aitozi >Priority: Major > > Currently, we construct the kubeclient from the kubeconfig file or from the > context. However, If we have to set the user agent for the okttp client it > will be a little hard. > We use the kubernetes cluster with different team, we need to distinguish the > request for k8s guys to monitor the apiserver requests. We usually set the > user agent for different groups. The fabric client expose a way to extract > config from the property or enviroments. But the key for the user agent is > bundled with the client version, it's not so convenient. Can we support to > set the user agent by a dedicated flink option. -- This message was sent by Atlassian Jira (v8.20.7#820007)
[jira] [Created] (FLINK-27461) Add a convenient way to set userAgent for the kubeclient
Aitozi created FLINK-27461: -- Summary: Add a convenient way to set userAgent for the kubeclient Key: FLINK-27461 URL: https://issues.apache.org/jira/browse/FLINK-27461 Project: Flink Issue Type: Improvement Components: Deployment / Kubernetes Reporter: Aitozi Currently, we construct the kubeclient from the kubeconfig file or from the context. However, If we have to set the user agent for the okttp client it will be a little hard. We use the kubernetes cluster with different team, we need to distinguish the request for k8s guys to monitor the apiserver requests. We usually set the user agent for different groups. The fabric client expose a way to extract config from the property or enviroments. But the key for the user agent is bundled with the client version, it's not so convenient. Can we support to set the user agent by a dedicated flink option. -- This message was sent by Atlassian Jira (v8.20.7#820007)
[jira] [Commented] (FLINK-27337) Prevent session cluster to be deleted when there are running jobs
[ https://issues.apache.org/jira/browse/FLINK-27337?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17530384#comment-17530384 ] Aitozi commented on FLINK-27337: OK, I will implement the deletions logic first, and will think about the update part again. If necessary, I will open a discussion on mailing list. > Prevent session cluster to be deleted when there are running jobs > - > > Key: FLINK-27337 > URL: https://issues.apache.org/jira/browse/FLINK-27337 > Project: Flink > Issue Type: Sub-task > Components: Kubernetes Operator >Reporter: Yang Wang >Priority: Major > Fix For: kubernetes-operator-1.0.0 > > > We should prevent the session cluster to be deleted when there are running > jobs. -- This message was sent by Atlassian Jira (v8.20.7#820007)
[jira] [Commented] (FLINK-27458) Expose allowNonRestoredState flag in JobSpec
[ https://issues.apache.org/jira/browse/FLINK-27458?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17530382#comment-17530382 ] Aitozi commented on FLINK-27458: I am willing to work on this > Expose allowNonRestoredState flag in JobSpec > > > Key: FLINK-27458 > URL: https://issues.apache.org/jira/browse/FLINK-27458 > Project: Flink > Issue Type: Improvement > Components: Kubernetes Operator >Reporter: Gyula Fora >Priority: Major > Fix For: kubernetes-operator-1.0.0 > > > We should probably expose this option as a top level spec field otherwise it > is impossible to set this on a per job level for SessionJobs. > What do you think [~aitozi] [~wangyang0918] -- This message was sent by Atlassian Jira (v8.20.7#820007)
[jira] [Commented] (FLINK-27458) Expose allowNonRestoredState flag in JobSpec
[ https://issues.apache.org/jira/browse/FLINK-27458?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17530381#comment-17530381 ] Aitozi commented on FLINK-27458: The application job can set {{execution.savepoint.ignore-unclaimed-state: true}} to skip the unclaimed state. But the session job can not touch the cluster flink conf, So I think exposing allowNonRestoredState flag in JobSpec is reasonable. BTW, If this value set to true, I think we should also set the {{execution.savepoint.ignore-unclaimed-state}} to true for application mode, WDYT ? > Expose allowNonRestoredState flag in JobSpec > > > Key: FLINK-27458 > URL: https://issues.apache.org/jira/browse/FLINK-27458 > Project: Flink > Issue Type: Improvement > Components: Kubernetes Operator >Reporter: Gyula Fora >Priority: Major > Fix For: kubernetes-operator-1.0.0 > > > We should probably expose this option as a top level spec field otherwise it > is impossible to set this on a per job level for SessionJobs. > What do you think [~aitozi] [~wangyang0918] -- This message was sent by Atlassian Jira (v8.20.7#820007)
[jira] [Comment Edited] (FLINK-27337) Prevent session cluster to be deleted when there are running jobs
[ https://issues.apache.org/jira/browse/FLINK-27337?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17529879#comment-17529879 ] Aitozi edited comment on FLINK-27337 at 4/29/22 9:13 AM: - I want to solve this in the following way # In webhook we can prevent the deletion/upgrade if there is session jobs in the session cluster. # In SessionReconciler cleanup and reconcile we should also check whether there is session job and decide to upgrade/delete session cluster or postpone to do that until there is no session jobs. (Because user may not enable the webhook) # Make the check configurable, users can choose to manually suspend the job before upgrade/delete the session cluster or directly done by the operator (propagating the delete or upgrade) what do you think [~wangyang0918] [~gyfora] was (Author: aitozi): I want to solve this in the following way # In webhook we can prevent the deletion/upgrade if there is session jobs in the session cluster. # In SessionReconciler cleanup and reconcile we should also check whether there is session job and decide to upgrade/delete session cluster or postpone to do that until there is no session jobs. (Because user may not enable the webhook) # Make the check configurable, users can choose to manually suspend the job before upgrade/delete the session cluster or directly done by the operator (propagating the delete or upgrade) what do you think [~wangyang0918] [~gyfora] > Prevent session cluster to be deleted when there are running jobs > - > > Key: FLINK-27337 > URL: https://issues.apache.org/jira/browse/FLINK-27337 > Project: Flink > Issue Type: Sub-task > Components: Kubernetes Operator >Reporter: Yang Wang >Priority: Major > Fix For: kubernetes-operator-1.0.0 > > > We should prevent the session cluster to be deleted when there are running > jobs. -- This message was sent by Atlassian Jira (v8.20.7#820007)
[jira] [Commented] (FLINK-27337) Prevent session cluster to be deleted when there are running jobs
[ https://issues.apache.org/jira/browse/FLINK-27337?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17529879#comment-17529879 ] Aitozi commented on FLINK-27337: I want to solve this in the following way # In webhook we can prevent the deletion/upgrade if there is session jobs in the session cluster. # In SessionReconciler cleanup and reconcile we should also check whether there is session job and decide to upgrade/delete session cluster or postpone to do that until there is no session jobs. (Because user may not enable the webhook) # Make the check configurable, users can choose to manually suspend the job before upgrade/delete the session cluster or directly done by the operator (propagating the delete or upgrade) what do you think [~wangyang0918] [~gyfora] > Prevent session cluster to be deleted when there are running jobs > - > > Key: FLINK-27337 > URL: https://issues.apache.org/jira/browse/FLINK-27337 > Project: Flink > Issue Type: Sub-task > Components: Kubernetes Operator >Reporter: Yang Wang >Priority: Major > Fix For: kubernetes-operator-1.0.0 > > > We should prevent the session cluster to be deleted when there are running > jobs. -- This message was sent by Atlassian Jira (v8.20.7#820007)
[jira] [Commented] (FLINK-27451) Enable the validator plugin in webhook
[ https://issues.apache.org/jira/browse/FLINK-27451?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17529852#comment-17529852 ] Aitozi commented on FLINK-27451: I will give a quick fix for this. > Enable the validator plugin in webhook > -- > > Key: FLINK-27451 > URL: https://issues.apache.org/jira/browse/FLINK-27451 > Project: Flink > Issue Type: Improvement > Components: Kubernetes Operator >Affects Versions: kubernetes-operator-1.0.0 >Reporter: Aitozi >Priority: Major > > Currently the validator plugin is only enable in the operator. I think it > should also be enabled in webhook. -- This message was sent by Atlassian Jira (v8.20.7#820007)
[jira] [Created] (FLINK-27451) Enable the validator plugin in webhook
Aitozi created FLINK-27451: -- Summary: Enable the validator plugin in webhook Key: FLINK-27451 URL: https://issues.apache.org/jira/browse/FLINK-27451 Project: Flink Issue Type: Improvement Components: Kubernetes Operator Affects Versions: kubernetes-operator-1.0.0 Reporter: Aitozi Currently the validator plugin is only enable in the operator. I think it should also be enabled in webhook. -- This message was sent by Atlassian Jira (v8.20.7#820007)
[jira] [Updated] (FLINK-27262) Enrich validator for FlinkSessionJob
[ https://issues.apache.org/jira/browse/FLINK-27262?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Aitozi updated FLINK-27262: --- Summary: Enrich validator for FlinkSessionJob (was: Enrich validator for FlinkSesionJob) > Enrich validator for FlinkSessionJob > > > Key: FLINK-27262 > URL: https://issues.apache.org/jira/browse/FLINK-27262 > Project: Flink > Issue Type: Sub-task > Components: Kubernetes Operator >Reporter: Yang Wang >Assignee: Aitozi >Priority: Major > > We need to enrich the validator to cover FlinkSesionJob. > At least we could have the following rules. > * Upgrade mode is savepoint, then {{state.savepoints.dir}} should be > configured in session FlinkDeployment -- This message was sent by Atlassian Jira (v8.20.7#820007)
[jira] [Commented] (FLINK-27370) Add a new SessionJobState - Failed
[ https://issues.apache.org/jira/browse/FLINK-27370?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17528217#comment-17528217 ] Aitozi commented on FLINK-27370: Get it, I think we can add the {{success}} field here as a direct indication of whether reconcile succeed. cc [~gyfora] do you have some inputs for this ? > Add a new SessionJobState - Failed > -- > > Key: FLINK-27370 > URL: https://issues.apache.org/jira/browse/FLINK-27370 > Project: Flink > Issue Type: Improvement > Components: Kubernetes Operator >Reporter: Xin Hao >Priority: Minor > > It will be nice if we can add a new SessionJobState Failed to indicate there > is an error for the session job. > {code:java} > status: > error: 'The error message' > jobStatus: > savepointInfo: {} > reconciliationStatus: > reconciliationTimestamp: 0 > state: DEPLOYED {code} > Reason: > 1. It will be easier for monitoring > 2. I have a personal controller to submit session jobs, it will be cleaner to > check the state by a single field and get the details by the error field. > -- This message was sent by Atlassian Jira (v8.20.7#820007)
[jira] [Commented] (FLINK-27370) Add a new SessionJobState - Failed
[ https://issues.apache.org/jira/browse/FLINK-27370?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17528104#comment-17528104 ] Aitozi commented on FLINK-27370: This field will only reflect whether last reconcile succeed not the session job state. If you want to know the session job state, maybe you can refer to the {{status.jobStatus.state}} > Add a new SessionJobState - Failed > -- > > Key: FLINK-27370 > URL: https://issues.apache.org/jira/browse/FLINK-27370 > Project: Flink > Issue Type: Improvement > Components: Kubernetes Operator >Reporter: Xin Hao >Priority: Minor > > It will be nice if we can add a new SessionJobState Failed to indicate there > is an error for the session job. > {code:java} > status: > error: 'The error message' > jobStatus: > savepointInfo: {} > reconciliationStatus: > reconciliationTimestamp: 0 > state: DEPLOYED {code} > Reason: > 1. It will be easier for monitoring > 2. I have a personal controller to submit session jobs, it will be cleaner to > check the state by a single field and get the details by the error field. > -- This message was sent by Atlassian Jira (v8.20.7#820007)
[jira] [Commented] (FLINK-27262) Enrich validator for FlinkSesionJob
[ https://issues.apache.org/jira/browse/FLINK-27262?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17527928#comment-17527928 ] Aitozi commented on FLINK-27262: I have one thing to discuss here: do we need to support user create a session job CR when flinkDeployment CR have not been created. If not, we can always validate both together and will make the validate work easy I think. cc [~wangyang0918] [~gyfora] > Enrich validator for FlinkSesionJob > --- > > Key: FLINK-27262 > URL: https://issues.apache.org/jira/browse/FLINK-27262 > Project: Flink > Issue Type: Sub-task > Components: Kubernetes Operator >Reporter: Yang Wang >Assignee: Aitozi >Priority: Major > > We need to enrich the validator to cover FlinkSesionJob. > At least we could have the following rules. > * Upgrade mode is savepoint, then {{state.savepoints.dir}} should be > configured in session FlinkDeployment -- This message was sent by Atlassian Jira (v8.20.7#820007)
[jira] [Commented] (FLINK-27370) Add a new SessionJobState - Failed
[ https://issues.apache.org/jira/browse/FLINK-27370?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17527908#comment-17527908 ] Aitozi commented on FLINK-27370: It previously have a {{success=true/false}} field do you mean that ? > Add a new SessionJobState - Failed > -- > > Key: FLINK-27370 > URL: https://issues.apache.org/jira/browse/FLINK-27370 > Project: Flink > Issue Type: Improvement > Components: Kubernetes Operator >Reporter: Xin Hao >Priority: Minor > > It will be nice if we can add a new SessionJobState Failed to indicate there > is an error for the session job. > {code:java} > status: > error: 'The error message' > jobStatus: > savepointInfo: {} > reconciliationStatus: > reconciliationTimestamp: 0 > state: DEPLOYED {code} > Reason: > 1. It will be easier for monitoring > 2. I have a personal controller to submit session jobs, it will be cleaner to > check the state by a single field and get the details by the error field. > -- This message was sent by Atlassian Jira (v8.20.7#820007)
[jira] [Created] (FLINK-27397) Improve the CrdReferenceDoclet generator to handle the abstract class
Aitozi created FLINK-27397: -- Summary: Improve the CrdReferenceDoclet generator to handle the abstract class Key: FLINK-27397 URL: https://issues.apache.org/jira/browse/FLINK-27397 Project: Flink Issue Type: Improvement Components: Kubernetes Operator Affects Versions: kubernetes-operator-1.0.0 Reporter: Aitozi As discussed [here|https://github.com/apache/flink-kubernetes-operator/pull/176#discussion_r856945232] . We should improve the abstract class handle in the {{CrdReferenceDoclet}} -- This message was sent by Atlassian Jira (v8.20.7#820007)
[jira] [Commented] (FLINK-27358) Kubernetes operator throws NPE when testing with Flink 1.15
[ https://issues.apache.org/jira/browse/FLINK-27358?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17527147#comment-17527147 ] Aitozi commented on FLINK-27358: Do we need to link the related upstream ticket here ? > Kubernetes operator throws NPE when testing with Flink 1.15 > --- > > Key: FLINK-27358 > URL: https://issues.apache.org/jira/browse/FLINK-27358 > Project: Flink > Issue Type: Bug > Components: Kubernetes Operator >Reporter: Yang Wang >Assignee: Yang Wang >Priority: Major > Labels: pull-request-available > Fix For: kubernetes-operator-1.0.0 > > > {code:java} > 2022-04-22 10:19:18,307 o.a.f.k.o.c.FlinkDeploymentController [WARN > ][default/flink-example-statemachine] Attempt count: 5, last attempt: true > 2022-04-22 10:19:18,329 i.j.o.p.e.ReconciliationDispatcher > [ERROR][default/flink-example-statemachine] Error during event processing > ExecutionScope{ resource id: > CustomResourceID{name='flink-example-statemachine', namespace='default'}, > version: 4979543} failed. > org.apache.flink.kubernetes.operator.exception.ReconciliationException: > java.lang.NullPointerException > at > org.apache.flink.kubernetes.operator.controller.FlinkDeploymentController.reconcile(FlinkDeploymentController.java:110) > at > org.apache.flink.kubernetes.operator.controller.FlinkDeploymentController.reconcile(FlinkDeploymentController.java:53) > at > io.javaoperatorsdk.operator.processing.Controller$2.execute(Controller.java:101) > at > io.javaoperatorsdk.operator.processing.Controller$2.execute(Controller.java:76) > at > io.javaoperatorsdk.operator.api.monitoring.Metrics.timeControllerExecution(Metrics.java:34) > at > io.javaoperatorsdk.operator.processing.Controller.reconcile(Controller.java:75) > at > io.javaoperatorsdk.operator.processing.event.ReconciliationDispatcher.reconcileExecution(ReconciliationDispatcher.java:143) > at > io.javaoperatorsdk.operator.processing.event.ReconciliationDispatcher.handleReconcile(ReconciliationDispatcher.java:109) > at > io.javaoperatorsdk.operator.processing.event.ReconciliationDispatcher.handleDispatch(ReconciliationDispatcher.java:74) > at > io.javaoperatorsdk.operator.processing.event.ReconciliationDispatcher.handleExecution(ReconciliationDispatcher.java:50) > at > io.javaoperatorsdk.operator.processing.event.EventProcessor$ControllerExecution.run(EventProcessor.java:349) > at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown > Source) > at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown > Source) > at java.base/java.lang.Thread.run(Unknown Source) > Caused by: java.lang.NullPointerException > at > org.apache.flink.kubernetes.operator.utils.FlinkUtils.lambda$deleteJobGraphInKubernetesHA$0(FlinkUtils.java:253) > at java.base/java.util.ArrayList.forEach(Unknown Source) > at > org.apache.flink.kubernetes.operator.utils.FlinkUtils.deleteJobGraphInKubernetesHA(FlinkUtils.java:248) > at > org.apache.flink.kubernetes.operator.service.FlinkService.submitApplicationCluster(FlinkService.java:130) > at > org.apache.flink.kubernetes.operator.reconciler.deployment.ApplicationReconciler.deployFlinkJob(ApplicationReconciler.java:205) > at > org.apache.flink.kubernetes.operator.reconciler.deployment.ApplicationReconciler.restoreFromLastSavepoint(ApplicationReconciler.java:218) > at > org.apache.flink.kubernetes.operator.reconciler.deployment.ApplicationReconciler.reconcile(ApplicationReconciler.java:117) > at > org.apache.flink.kubernetes.operator.reconciler.deployment.ApplicationReconciler.reconcile(ApplicationReconciler.java:56) > at > org.apache.flink.kubernetes.operator.controller.FlinkDeploymentController.reconcile(FlinkDeploymentController.java:106) > ... 13 more {code} > The root cause is that the Kubernetes HA implementation has changed from > 1.15. When the job is cancelled, the data of leader ConfigMap will be > cleared. -- This message was sent by Atlassian Jira (v8.20.7#820007)
[jira] [Created] (FLINK-27362) Support restartNonce semantics in session job
Aitozi created FLINK-27362: -- Summary: Support restartNonce semantics in session job Key: FLINK-27362 URL: https://issues.apache.org/jira/browse/FLINK-27362 Project: Flink Issue Type: Sub-task Reporter: Aitozi -- This message was sent by Atlassian Jira (v8.20.7#820007)
[jira] [Commented] (FLINK-27262) Enrich validator for FlinkSesionJob
[ https://issues.apache.org/jira/browse/FLINK-27262?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17526985#comment-17526985 ] Aitozi commented on FLINK-27262: FYI, I'm working on this now. > Enrich validator for FlinkSesionJob > --- > > Key: FLINK-27262 > URL: https://issues.apache.org/jira/browse/FLINK-27262 > Project: Flink > Issue Type: Sub-task > Components: Kubernetes Operator >Reporter: Yang Wang >Priority: Major > > We need to enrich the validator to cover FlinkSesionJob. > At least we could have the following rules. > * Upgrade mode is savepoint, then {{state.savepoints.dir}} should be > configured in session FlinkDeployment -- This message was sent by Atlassian Jira (v8.20.7#820007)
[jira] [Commented] (FLINK-27360) Rename clusterId field of FlinkSessionJobSpec to deploymentName
[ https://issues.apache.org/jira/browse/FLINK-27360?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17526951#comment-17526951 ] Aitozi commented on FLINK-27360: Make sense to me, It's a minor change, I will open a PR for it. > Rename clusterId field of FlinkSessionJobSpec to deploymentName > --- > > Key: FLINK-27360 > URL: https://issues.apache.org/jira/browse/FLINK-27360 > Project: Flink > Issue Type: Improvement > Components: Kubernetes Operator >Reporter: Gyula Fora >Priority: Major > Fix For: kubernetes-operator-1.0.0 > > > Since the sessionjob logic only works together with a FlinkDeployment, and > the clusterId always has to match the name, it would be better to explicitly > call it deploymentName so that it is more intuitive for the users. > What do you think? > cc [~aitozi] -- This message was sent by Atlassian Jira (v8.20.7#820007)
[jira] [Commented] (FLINK-27337) Prevent session cluster to be deleted when there are running jobs
[ https://issues.apache.org/jira/browse/FLINK-27337?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17525579#comment-17525579 ] Aitozi commented on FLINK-27337: As discussed in the [mailing list|https://lists.apache.org/thread/9xvdwtf5tbr28lonj21p58tn5gdndns5], the deletion of all the session job may be a critical operation. So if there is an upgrade or cleanup event for the session cluster, we can postpone the real cleanup/stop until all the session job suspended by user. We can avoid to remove the finializer and re-schedule to check whether all the job are stopped. This way is more safe but may involve more step when user want to delete or upgrade the session cluster, what about make it both supported. > Prevent session cluster to be deleted when there are running jobs > - > > Key: FLINK-27337 > URL: https://issues.apache.org/jira/browse/FLINK-27337 > Project: Flink > Issue Type: Sub-task > Components: Kubernetes Operator >Reporter: Yang Wang >Priority: Major > Fix For: kubernetes-operator-1.0.0 > > > We should prevent the session cluster to be deleted when there are running > jobs. -- This message was sent by Atlassian Jira (v8.20.7#820007)
[jira] [Created] (FLINK-27334) Support auto generate the doc for the KubernetesOperatorConfigOptions
Aitozi created FLINK-27334: -- Summary: Support auto generate the doc for the KubernetesOperatorConfigOptions Key: FLINK-27334 URL: https://issues.apache.org/jira/browse/FLINK-27334 Project: Flink Issue Type: Improvement Components: Kubernetes Operator Reporter: Aitozi Fix For: kubernetes-operator-1.0.0 -- This message was sent by Atlassian Jira (v8.20.7#820007)
[jira] [Updated] (FLINK-27270) Add document of session job operations
[ https://issues.apache.org/jira/browse/FLINK-27270?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Aitozi updated FLINK-27270: --- Description: # Basic operations # How to support different source jars > Add document of session job operations > -- > > Key: FLINK-27270 > URL: https://issues.apache.org/jira/browse/FLINK-27270 > Project: Flink > Issue Type: Sub-task > Components: Kubernetes Operator > Reporter: Aitozi >Priority: Major > Fix For: kubernetes-operator-1.0.0 > > > # Basic operations > # How to support different source jars -- This message was sent by Atlassian Jira (v8.20.7#820007)
[jira] [Commented] (FLINK-27309) Allow to load default flink configs in the k8s operator dynamically
[ https://issues.apache.org/jira/browse/FLINK-27309?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17524250#comment-17524250 ] Aitozi commented on FLINK-27309: I think it's a very useful feature in production. But do we need to outline what config can dynamically take effect. Currently, most option are controlling the reconciler interval or check behavior can take effect at the next reconcile turn. But still with option like {{operator.reconciler.max.parallelism}} which may can't take effect directly, Right ? > Allow to load default flink configs in the k8s operator dynamically > --- > > Key: FLINK-27309 > URL: https://issues.apache.org/jira/browse/FLINK-27309 > Project: Flink > Issue Type: Improvement > Components: Kubernetes Operator >Reporter: Biao Geng >Priority: Major > > Current default configs used by the k8s operator will be saved in the > /opt/flink/conf dir in the k8s operator pod and will be loaded only once when > the operator is created. > Since the flink k8s operator could be a long running service and users may > want to modify the default configs(e.g the metric reporter sampling interval) > for newly created deployments, it may better to load the default configs > dynamically(i.e. parsing the latest /opt/flink/conf/flink-conf.yaml) in the > {{ReconcilerFactory}} and {{ObserverFactory}}, instead of redeploying the > operator. -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Commented] (FLINK-27279) Extract common status interfaces
[ https://issues.apache.org/jira/browse/FLINK-27279?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17524248#comment-17524248 ] Aitozi commented on FLINK-27279: [~gyfora] FYI, I'm working on this now. > Extract common status interfaces > > > Key: FLINK-27279 > URL: https://issues.apache.org/jira/browse/FLINK-27279 > Project: Flink > Issue Type: Improvement > Components: Kubernetes Operator >Reporter: Gyula Fora >Priority: Major > Fix For: kubernetes-operator-1.0.0 > > > FlinkDeploymentStatus - FlinkSessionJobStatus > and > ReconciliationStatus - FlinkSessionJobReconciiationStatus > share most of their content and extracting the shared parts into interfaces > would allow us to unify status update logic and remove some code duplicaiton -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Commented] (FLINK-27279) Extract common status interfaces
[ https://issues.apache.org/jira/browse/FLINK-27279?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17523590#comment-17523590 ] Aitozi commented on FLINK-27279: Get it > Extract common status interfaces > > > Key: FLINK-27279 > URL: https://issues.apache.org/jira/browse/FLINK-27279 > Project: Flink > Issue Type: Improvement > Components: Kubernetes Operator >Reporter: Gyula Fora >Priority: Major > Fix For: kubernetes-operator-1.0.0 > > > FlinkDeploymentStatus - FlinkSessionJobStatus > and > ReconciliationStatus - FlinkSessionJobReconciiationStatus > share most of their content and extracting the shared parts into interfaces > would allow us to unify status update logic and remove some code duplicaiton -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Commented] (FLINK-27279) Extract common status interfaces
[ https://issues.apache.org/jira/browse/FLINK-27279?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17523573#comment-17523573 ] Aitozi commented on FLINK-27279: hah, I'm willing to take this ticket. Does this also means that we can offload some methods in the {{ReconciliationUtils}} to the common related status objects as discussed [here|https://github.com/apache/flink-kubernetes-operator/pull/165#discussion_r847997670] > Extract common status interfaces > > > Key: FLINK-27279 > URL: https://issues.apache.org/jira/browse/FLINK-27279 > Project: Flink > Issue Type: Improvement > Components: Kubernetes Operator >Reporter: Gyula Fora >Priority: Major > Fix For: kubernetes-operator-1.0.0 > > > FlinkDeploymentStatus - FlinkSessionJobStatus > and > ReconciliationStatus - FlinkSessionJobReconciiationStatus > share most of their content and extracting the shared parts into interfaces > would allow us to unify status update logic and remove some code duplicaiton -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Commented] (FLINK-27261) Disable web.cancel.enable for session cluster
[ https://issues.apache.org/jira/browse/FLINK-27261?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17523273#comment-17523273 ] Aitozi commented on FLINK-27261: There is a bit different between the session job and application cluster. If user cancel the session job from web ui, I think the session job will sync the state to cancelled. But I still think we'd better disable it by default to make the job operation are all managed by the CR > Disable web.cancel.enable for session cluster > - > > Key: FLINK-27261 > URL: https://issues.apache.org/jira/browse/FLINK-27261 > Project: Flink > Issue Type: Improvement > Components: Kubernetes Operator >Reporter: Yang Wang >Priority: Major > Labels: starter > > In FLINK-27154, we disable {{web.cancel.enable}} for application cluster. We > should also do this for session cluster. -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Comment Edited] (FLINK-27261) Disable web.cancel.enable for session cluster
[ https://issues.apache.org/jira/browse/FLINK-27261?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17523273#comment-17523273 ] Aitozi edited comment on FLINK-27261 at 4/17/22 4:03 AM: - There is a bit difference between the session job and application cluster. If user cancel the session job from web ui, I think the session job will sync the state to cancelled. But I still think we'd better disable it by default to make the job operation are all managed by the CR was (Author: aitozi): There is a bit different between the session job and application cluster. If user cancel the session job from web ui, I think the session job will sync the state to cancelled. But I still think we'd better disable it by default to make the job operation are all managed by the CR > Disable web.cancel.enable for session cluster > - > > Key: FLINK-27261 > URL: https://issues.apache.org/jira/browse/FLINK-27261 > Project: Flink > Issue Type: Improvement > Components: Kubernetes Operator >Reporter: Yang Wang >Priority: Major > Labels: starter > > In FLINK-27154, we disable {{web.cancel.enable}} for application cluster. We > should also do this for session cluster. -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Created] (FLINK-27273) Support both configMap and zookeeper based HA data clean up
Aitozi created FLINK-27273: -- Summary: Support both configMap and zookeeper based HA data clean up Key: FLINK-27273 URL: https://issues.apache.org/jira/browse/FLINK-27273 Project: Flink Issue Type: Improvement Components: Kubernetes Operator Reporter: Aitozi Fix For: kubernetes-operator-1.0.0 As discussed in [comments|https://github.com/apache/flink-kubernetes-operator/pull/28#discussion_r815695041] We only support clean up the ha data based configMap. Considering that zookeeper is still widely used as ha service when deploy on the kubernetes, I think we should still take it into account, otherwise, It will come up with some unexpected behavior when play with zk ha jobs. -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Commented] (FLINK-27160) Add e2e tests for FlinkSessionJob type
[ https://issues.apache.org/jira/browse/FLINK-27160?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17523110#comment-17523110 ] Aitozi commented on FLINK-27160: I'm working on this now > Add e2e tests for FlinkSessionJob type > -- > > Key: FLINK-27160 > URL: https://issues.apache.org/jira/browse/FLINK-27160 > Project: Flink > Issue Type: Sub-task > Components: Kubernetes Operator >Reporter: Aitozi >Priority: Major > Fix For: kubernetes-operator-1.0.0 > > -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Commented] (FLINK-27262) Enrich validator for FlinkSesionJob
[ https://issues.apache.org/jira/browse/FLINK-27262?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17523109#comment-17523109 ] Aitozi commented on FLINK-27262: BTW, I think the webhook have not handle the validation of the sessionjob. I thinks this functionality will also should be included in this ticket. > Enrich validator for FlinkSesionJob > --- > > Key: FLINK-27262 > URL: https://issues.apache.org/jira/browse/FLINK-27262 > Project: Flink > Issue Type: Sub-task > Components: Kubernetes Operator >Reporter: Yang Wang >Priority: Major > > We need to enrich the validator to cover FlinkSesionJob. > At least we could have the following rules. > * Upgrade mode is savepoint, then {{state.savepoints.dir}} should be > configured in session FlinkDeployment -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Commented] (FLINK-27257) Flink kubernetes operator triggers savepoint failed because of not all tasks running
[ https://issues.apache.org/jira/browse/FLINK-27257?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17523105#comment-17523105 ] Aitozi commented on FLINK-27257: I also encountered this problem, I think it is caused by the {{ApplicationReconciler#isJobRunning}} results is not exact > Flink kubernetes operator triggers savepoint failed because of not all tasks > running > > > Key: FLINK-27257 > URL: https://issues.apache.org/jira/browse/FLINK-27257 > Project: Flink > Issue Type: Bug > Components: Kubernetes Operator >Reporter: Yang Wang >Priority: Major > > {code:java} > 2022-04-15 02:38:56,551 o.a.f.k.o.s.FlinkService [INFO > ][default/flink-example-statemachine] Fetching savepoint result with > triggerId: 182d7f176496856d7b33fe2f3767da18 > 2022-04-15 02:38:56,690 o.a.f.k.o.s.FlinkService > [ERROR][default/flink-example-statemachine] Savepoint error > org.apache.flink.runtime.checkpoint.CheckpointException: Checkpoint > triggering task Source: Custom Source (1/2) of job > is not being executed at the moment. > Aborting checkpoint. Failure reason: Not all required tasks are currently > running. > at > org.apache.flink.runtime.checkpoint.DefaultCheckpointPlanCalculator.checkTasksStarted(DefaultCheckpointPlanCalculator.java:143) > at > org.apache.flink.runtime.checkpoint.DefaultCheckpointPlanCalculator.lambda$calculateCheckpointPlan$1(DefaultCheckpointPlanCalculator.java:105) > at > java.util.concurrent.CompletableFuture$AsyncSupply.run(CompletableFuture.java:1604) > at > org.apache.flink.runtime.rpc.akka.AkkaRpcActor.lambda$handleRunAsync$4(AkkaRpcActor.java:455) > at > org.apache.flink.runtime.concurrent.akka.ClassLoadingUtils.runWithContextClassLoader(ClassLoadingUtils.java:68) > at > org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRunAsync(AkkaRpcActor.java:455) > at > org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRpcMessage(AkkaRpcActor.java:213) > at > org.apache.flink.runtime.rpc.akka.FencedAkkaRpcActor.handleRpcMessage(FencedAkkaRpcActor.java:78) > at > org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleMessage(AkkaRpcActor.java:163) > at akka.japi.pf.UnitCaseStatement.apply(CaseStatements.scala:24) > at akka.japi.pf.UnitCaseStatement.apply(CaseStatements.scala:20) > at scala.PartialFunction.applyOrElse(PartialFunction.scala:123) > at scala.PartialFunction.applyOrElse$(PartialFunction.scala:122) > at akka.japi.pf.UnitCaseStatement.applyOrElse(CaseStatements.scala:20) > at scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:171) > at scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:172) > at scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:172) > at akka.actor.Actor.aroundReceive(Actor.scala:537) > at akka.actor.Actor.aroundReceive$(Actor.scala:535) > at akka.actor.AbstractActor.aroundReceive(AbstractActor.scala:220) > at akka.actor.ActorCell.receiveMessage(ActorCell.scala:580) > at akka.actor.ActorCell.invoke(ActorCell.scala:548) > at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:270) > at akka.dispatch.Mailbox.run(Mailbox.scala:231) > at akka.dispatch.Mailbox.exec(Mailbox.scala:243) > at java.util.concurrent.ForkJoinTask.doExec(ForkJoinTask.java:289) > at > java.util.concurrent.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1056) > at java.util.concurrent.ForkJoinPool.runWorker(ForkJoinPool.java:1692) > at > java.util.concurrent.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:175) > 2022-04-15 02:38:56,693 o.a.f.k.o.o.SavepointObserver > [ERROR][default/flink-example-statemachine] Checkpoint triggering task > Source: Custom Source (1/2) of job is not > being executed at the moment. Aborting checkpoint. Failure reason: Not all > required tasks are currently running. {code} > How to reproduce? > Update arbitrary fields(e.g. parallelism) along with > {{{}savepointTriggerNonce{}}}. > > The root cause might be the running state return by > {{ClusterClient#listJobs()}} does not mean all the tasks are running. -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Commented] (FLINK-26870) Implement session job observer
[ https://issues.apache.org/jira/browse/FLINK-26870?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17523099#comment-17523099 ] Aitozi commented on FLINK-26870: I think this is already include in the [https://github.com/apache/flink-kubernetes-operator/pull/164]. Closing this ticket now > Implement session job observer > -- > > Key: FLINK-26870 > URL: https://issues.apache.org/jira/browse/FLINK-26870 > Project: Flink > Issue Type: Sub-task > Components: Kubernetes Operator >Reporter: Aitozi >Priority: Major > Fix For: kubernetes-operator-1.0.0 > > -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Closed] (FLINK-26870) Implement session job observer
[ https://issues.apache.org/jira/browse/FLINK-26870?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Aitozi closed FLINK-26870. -- Resolution: Fixed > Implement session job observer > -- > > Key: FLINK-26870 > URL: https://issues.apache.org/jira/browse/FLINK-26870 > Project: Flink > Issue Type: Sub-task > Components: Kubernetes Operator > Reporter: Aitozi >Priority: Major > Fix For: kubernetes-operator-1.0.0 > > -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Created] (FLINK-27270) Add document of session job operations
Aitozi created FLINK-27270: -- Summary: Add document of session job operations Key: FLINK-27270 URL: https://issues.apache.org/jira/browse/FLINK-27270 Project: Flink Issue Type: Sub-task Components: Kubernetes Operator Reporter: Aitozi Fix For: kubernetes-operator-1.0.0 -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Created] (FLINK-27269) Clean up the jar file after submitting the job
Aitozi created FLINK-27269: -- Summary: Clean up the jar file after submitting the job Key: FLINK-27269 URL: https://issues.apache.org/jira/browse/FLINK-27269 Project: Flink Issue Type: Sub-task Reporter: Aitozi When testing, I found that the jar files will exploded after several submits, I think we should clean up the jars after submitting -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Commented] (FLINK-27262) Enrich validator for FlinkSesionJob
[ https://issues.apache.org/jira/browse/FLINK-27262?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17523078#comment-17523078 ] Aitozi commented on FLINK-27262: Agree, the validator for session job is not complete now. I will work on this, Please help assign the ticket > Enrich validator for FlinkSesionJob > --- > > Key: FLINK-27262 > URL: https://issues.apache.org/jira/browse/FLINK-27262 > Project: Flink > Issue Type: Sub-task > Components: Kubernetes Operator >Reporter: Yang Wang >Priority: Major > > We need to enrich the validator to cover FlinkSesionJob. > At least we could have the following rules. > * Upgrade mode is savepoint, then {{state.savepoints.dir}} should be > configured in session FlinkDeployment -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Commented] (FLINK-27203) Add supports for uploading jars from object storage (s3, gcs, oss)
[ https://issues.apache.org/jira/browse/FLINK-27203?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17522186#comment-17522186 ] Aitozi commented on FLINK-27203: Hi [~haoxin] , do you mean for uploading the session jobs jars ? There is a ticket tracking this > Add supports for uploading jars from object storage (s3, gcs, oss) > -- > > Key: FLINK-27203 > URL: https://issues.apache.org/jira/browse/FLINK-27203 > Project: Flink > Issue Type: Improvement > Components: Kubernetes Operator >Reporter: Xin Hao >Priority: Major > > I think it will be efficient if we can read jars from object storage. > > * We can detect the object storage provider by path scheme. (Such as: > `gs://` for GCS) > * Because the most cloud providers' SDK support native permission > integration with K8s, for example, the GCP, the user can bind the permission > by service accounts, and the code running in the K8s cluster will be simple > `{{{}storage = StorageOptions.getDefaultInstance().service{}}}` > * In this way, the users only need to bind the service accounts permissions > and don't need to handle volumes mount and jars initialize (download jars > from object storage into the volumes by themselves). > I think we can define the interface first and let the community help > contribute to the implementation for the different cloud providers. -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Commented] (FLINK-27161) Support user jar from different type of filesystems
[ https://issues.apache.org/jira/browse/FLINK-27161?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17520316#comment-17520316 ] Aitozi commented on FLINK-27161: I'm working on this now. > Support user jar from different type of filesystems > --- > > Key: FLINK-27161 > URL: https://issues.apache.org/jira/browse/FLINK-27161 > Project: Flink > Issue Type: Sub-task > Components: Kubernetes Operator >Reporter: Aitozi >Priority: Major > Fix For: kubernetes-operator-1.0.0 > > -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Created] (FLINK-27161) Support user jar from different type of filesystems
Aitozi created FLINK-27161: -- Summary: Support user jar from different type of filesystems Key: FLINK-27161 URL: https://issues.apache.org/jira/browse/FLINK-27161 Project: Flink Issue Type: Sub-task Components: Kubernetes Operator Reporter: Aitozi Fix For: kubernetes-operator-1.0.0 -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Created] (FLINK-27160) Add e2e tests for FlinkSessionJob type
Aitozi created FLINK-27160: -- Summary: Add e2e tests for FlinkSessionJob type Key: FLINK-27160 URL: https://issues.apache.org/jira/browse/FLINK-27160 Project: Flink Issue Type: Sub-task Components: Kubernetes Operator Reporter: Aitozi Fix For: kubernetes-operator-1.0.0 -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Closed] (FLINK-27001) Support to specify the resource of the operator
[ https://issues.apache.org/jira/browse/FLINK-27001?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Aitozi closed FLINK-27001. -- Resolution: Fixed > Support to specify the resource of the operator > > > Key: FLINK-27001 > URL: https://issues.apache.org/jira/browse/FLINK-27001 > Project: Flink > Issue Type: Sub-task > Components: Kubernetes Operator > Reporter: Aitozi >Priority: Major > > Supporting to specify the operator resource requirements and limits -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Commented] (FLINK-27001) Support to specify the resource of the operator
[ https://issues.apache.org/jira/browse/FLINK-27001?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17519298#comment-17519298 ] Aitozi commented on FLINK-27001: Yes, closing it. > Support to specify the resource of the operator > > > Key: FLINK-27001 > URL: https://issues.apache.org/jira/browse/FLINK-27001 > Project: Flink > Issue Type: Sub-task > Components: Kubernetes Operator >Reporter: Aitozi >Priority: Major > > Supporting to specify the operator resource requirements and limits -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Commented] (FLINK-27028) Support to upload jar and run jar in RestClusterClient
[ https://issues.apache.org/jira/browse/FLINK-27028?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17516448#comment-17516448 ] Aitozi commented on FLINK-27028: cc [~wangyang0918] [~chesnay] If no objection, I'm willing to open a pull request for this. > Support to upload jar and run jar in RestClusterClient > -- > > Key: FLINK-27028 > URL: https://issues.apache.org/jira/browse/FLINK-27028 > Project: Flink > Issue Type: Improvement > Components: Client / Job Submission >Reporter: Aitozi >Priority: Major > > The {{flink-kubernetes-operator}} is using the JarUpload + JarRun to support > the session job submission. However, currently the RestClusterClient do not > expose a way to upload the user jar to session cluster and trigger the jar > run api. So a naked RestClient is used to achieve this, but it lacks the > common retry logic. > Can we expose these two api the the rest cluster client to make it more > convenient to use in the operator -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Updated] (FLINK-27028) Support to upload jar and run jar in RestClusterClient
[ https://issues.apache.org/jira/browse/FLINK-27028?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Aitozi updated FLINK-27028: --- Description: The {{flink-kubernetes-operator}} is using the JarUpload + JarRun to support the session job submission. However, currently the RestClusterClient do not expose a way to upload the user jar to session cluster and trigger the jar run api. So a naked RestClient is used to achieve this, but it lacks the common retry logic. Can we expose these two api the the rest cluster client to make it more convenient to use in the operator was: The flink-kubernetes-operator is using the JarUpload + JarRun to support the session job management. However, currently the RestClusterClient do not expose a way to upload the user jar to session cluster and trigger the jar run api. So I used to naked RestClient to achieve this. Can we expose these two api the the rest cluster client to make it more convenient to use in the operator > Support to upload jar and run jar in RestClusterClient > -- > > Key: FLINK-27028 > URL: https://issues.apache.org/jira/browse/FLINK-27028 > Project: Flink > Issue Type: Improvement > Components: Client / Job Submission > Reporter: Aitozi >Priority: Major > > The {{flink-kubernetes-operator}} is using the JarUpload + JarRun to support > the session job submission. However, currently the RestClusterClient do not > expose a way to upload the user jar to session cluster and trigger the jar > run api. So a naked RestClient is used to achieve this, but it lacks the > common retry logic. > Can we expose these two api the the rest cluster client to make it more > convenient to use in the operator -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Created] (FLINK-27028) Support to upload jar and run jar in RestClusterClient
Aitozi created FLINK-27028: -- Summary: Support to upload jar and run jar in RestClusterClient Key: FLINK-27028 URL: https://issues.apache.org/jira/browse/FLINK-27028 Project: Flink Issue Type: Improvement Components: Client / Job Submission Reporter: Aitozi The flink-kubernetes-operator is using the JarUpload + JarRun to support the session job management. However, currently the RestClusterClient do not expose a way to upload the user jar to session cluster and trigger the jar run api. So I used to naked RestClient to achieve this. Can we expose these two api the the rest cluster client to make it more convenient to use in the operator -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Commented] (FLINK-27001) Support to specify the resource of the operator
[ https://issues.apache.org/jira/browse/FLINK-27001?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17516445#comment-17516445 ] Aitozi commented on FLINK-27001: It seems this can also implement by https://issues.apache.org/jira/browse/FLINK-26663 So I will not work on this right now, I will keep an eye on this. > Support to specify the resource of the operator > > > Key: FLINK-27001 > URL: https://issues.apache.org/jira/browse/FLINK-27001 > Project: Flink > Issue Type: Sub-task > Components: Kubernetes Operator >Reporter: Aitozi >Priority: Major > > Supporting to specify the operator resource requirements and limits -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Created] (FLINK-27001) Support to specify the resource of the operator
Aitozi created FLINK-27001: -- Summary: Support to specify the resource of the operator Key: FLINK-27001 URL: https://issues.apache.org/jira/browse/FLINK-27001 Project: Flink Issue Type: Improvement Components: Kubernetes Operator Reporter: Aitozi Supporting to specify the operator resource requirements and limits -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Created] (FLINK-27000) Support to set JVM args for operator
Aitozi created FLINK-27000: -- Summary: Support to set JVM args for operator Key: FLINK-27000 URL: https://issues.apache.org/jira/browse/FLINK-27000 Project: Flink Issue Type: Improvement Components: Kubernetes Operator Reporter: Aitozi In production we often need to set the JVM option to operator -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Updated] (FLINK-20808) Remove redundant checkstyle rules
[ https://issues.apache.org/jira/browse/FLINK-20808?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Aitozi updated FLINK-20808: --- Attachment: (was: image-2022-04-02-12-46-11-005.png) > Remove redundant checkstyle rules > - > > Key: FLINK-20808 > URL: https://issues.apache.org/jira/browse/FLINK-20808 > Project: Flink > Issue Type: Technical Debt > Components: Build System >Reporter: Chesnay Schepler >Priority: Not a Priority > Labels: auto-deprioritized-major, auto-deprioritized-minor > Attachments: image-2022-04-02-12-46-28-065.png > > > There are probably a few checkstyle rules that are now enforced by spotless, > and we could remove these to clarify the responsibilities of each tool. -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Commented] (FLINK-20808) Remove redundant checkstyle rules
[ https://issues.apache.org/jira/browse/FLINK-20808?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17516216#comment-17516216 ] Aitozi commented on FLINK-20808: No need to answer, after some search, I found it can be solved by set the IDE line separator to {{LF :)}} !image-2022-04-02-12-46-28-065.png|width=354,height=85! > Remove redundant checkstyle rules > - > > Key: FLINK-20808 > URL: https://issues.apache.org/jira/browse/FLINK-20808 > Project: Flink > Issue Type: Technical Debt > Components: Build System >Reporter: Chesnay Schepler >Priority: Not a Priority > Labels: auto-deprioritized-major, auto-deprioritized-minor > Attachments: image-2022-04-02-12-46-11-005.png, > image-2022-04-02-12-46-28-065.png > > > There are probably a few checkstyle rules that are now enforced by spotless, > and we could remove these to clarify the responsibilities of each tool. -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Updated] (FLINK-20808) Remove redundant checkstyle rules
[ https://issues.apache.org/jira/browse/FLINK-20808?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Aitozi updated FLINK-20808: --- Attachment: image-2022-04-02-12-46-28-065.png > Remove redundant checkstyle rules > - > > Key: FLINK-20808 > URL: https://issues.apache.org/jira/browse/FLINK-20808 > Project: Flink > Issue Type: Technical Debt > Components: Build System >Reporter: Chesnay Schepler >Priority: Not a Priority > Labels: auto-deprioritized-major, auto-deprioritized-minor > Attachments: image-2022-04-02-12-46-11-005.png, > image-2022-04-02-12-46-28-065.png > > > There are probably a few checkstyle rules that are now enforced by spotless, > and we could remove these to clarify the responsibilities of each tool. -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Updated] (FLINK-20808) Remove redundant checkstyle rules
[ https://issues.apache.org/jira/browse/FLINK-20808?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Aitozi updated FLINK-20808: --- Attachment: image-2022-04-02-12-46-11-005.png > Remove redundant checkstyle rules > - > > Key: FLINK-20808 > URL: https://issues.apache.org/jira/browse/FLINK-20808 > Project: Flink > Issue Type: Technical Debt > Components: Build System >Reporter: Chesnay Schepler >Priority: Not a Priority > Labels: auto-deprioritized-major, auto-deprioritized-minor > Attachments: image-2022-04-02-12-46-11-005.png, > image-2022-04-02-12-46-28-065.png > > > There are probably a few checkstyle rules that are now enforced by spotless, > and we could remove these to clarify the responsibilities of each tool. -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Commented] (FLINK-20808) Remove redundant checkstyle rules
[ https://issues.apache.org/jira/browse/FLINK-20808?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17516211#comment-17516211 ] Aitozi commented on FLINK-20808: Hi [~chesnay] sorry to bother you here, I run into a case that: I follow the development doc to use the save action and google-java-format to automatically format the code. But the formatted code can not pass the checksytyle rule [NewlineAtEndOfFile] But I check the unpassed code,It has the end line actually. Is there some bug for checkstyle or I miss some configuration for development? > Remove redundant checkstyle rules > - > > Key: FLINK-20808 > URL: https://issues.apache.org/jira/browse/FLINK-20808 > Project: Flink > Issue Type: Technical Debt > Components: Build System >Reporter: Chesnay Schepler >Priority: Not a Priority > Labels: auto-deprioritized-major, auto-deprioritized-minor > > There are probably a few checkstyle rules that are now enforced by spotless, > and we could remove these to clarify the responsibilities of each tool. -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Created] (FLINK-26996) Break the reconcile after first create session cluster
Aitozi created FLINK-26996: -- Summary: Break the reconcile after first create session cluster Key: FLINK-26996 URL: https://issues.apache.org/jira/browse/FLINK-26996 Project: Flink Issue Type: Bug Components: Kubernetes Operator Reporter: Aitozi When I test session cluster, I found that it will always start twice for the session cluster. -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Comment Edited] (FLINK-26989) Fix the potential lost of secondary resource when watching a namespace
[ https://issues.apache.org/jira/browse/FLINK-26989?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17515886#comment-17515886 ] Aitozi edited comment on FLINK-26989 at 4/1/22 12:03 PM: - If there's one namespace, the eventsource do not duplicated for the same type, so not a bug, closing~ was (Author: aitozi): If there's a namespace, the eventsource do not duplicated for the same type, so not a bug, closing~ > Fix the potential lost of secondary resource when watching a namespace > -- > > Key: FLINK-26989 > URL: https://issues.apache.org/jira/browse/FLINK-26989 > Project: Flink > Issue Type: Bug > Components: Kubernetes Operator >Reporter: Aitozi >Priority: Major > Labels: pull-request-available > > The event source is register under the name of namespace when watching > namespace is not empty, But when we get from the context, we use the > different condition, It may make it miss the target event source -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Closed] (FLINK-26989) Fix the potential lost of secondary resource when watching a namespace
[ https://issues.apache.org/jira/browse/FLINK-26989?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Aitozi closed FLINK-26989. -- Resolution: Not A Problem If there's a namespace, the eventsource do not duplicated for the same type, so not a bug, closing~ > Fix the potential lost of secondary resource when watching a namespace > -- > > Key: FLINK-26989 > URL: https://issues.apache.org/jira/browse/FLINK-26989 > Project: Flink > Issue Type: Bug > Components: Kubernetes Operator >Reporter: Aitozi >Priority: Major > Labels: pull-request-available > > The event source is register under the name of namespace when watching > namespace is not empty, But when we get from the context, we use the > different condition, It may make it miss the target event source -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Created] (FLINK-26989) Fix the potential lost of secondary resource when watching a namespace
Aitozi created FLINK-26989: -- Summary: Fix the potential lost of secondary resource when watching a namespace Key: FLINK-26989 URL: https://issues.apache.org/jira/browse/FLINK-26989 Project: Flink Issue Type: Bug Components: Kubernetes Operator Reporter: Aitozi The event source is register under the name of namespace when watching namespace is not empty, But when we get from the context, we use the different condition, It may make it miss the target event source -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Commented] (FLINK-26915) Extend the Reconciler and Observer interface
[ https://issues.apache.org/jira/browse/FLINK-26915?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17514154#comment-17514154 ] Aitozi commented on FLINK-26915: cc [~gyfora] [~wangyang0918] > Extend the Reconciler and Observer interface > > > Key: FLINK-26915 > URL: https://issues.apache.org/jira/browse/FLINK-26915 > Project: Flink > Issue Type: Sub-task > Components: Kubernetes Operator >Reporter: Aitozi >Priority: Major > Fix For: kubernetes-operator-1.0.0 > > > As discussed in > [comments|https://github.com/apache/flink-kubernetes-operator/pull/112#discussion_r835762111], > I proposed make two changes to the Reconciler and Observer > # directly return the UpdateControl from the reconciler, because the > reconciler can in charge of the Update behavior, By this, we dont have to > infer the update control in the controller > # Make the params generic and extends from the ReconcilerContext and > ObserverContext. which will be easy for different controller to ship their > own objects for reconcile and observer. For example, in the FlinkSessionJob > case, we need to get the effective config from the FlinkDeployment first and > also pass the FlinkDeployment to the reconciler. > After the change, the reconciler will look like this: > {code:java} > public interface Reconciler> { > UpdateControl reconcile(CR cr, CTX context) throws Exception; > DeleteControl cleanup(CR cr, CTX ctx); > }{code} > > -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Updated] (FLINK-26915) Extend the Reconciler and Observer interface
[ https://issues.apache.org/jira/browse/FLINK-26915?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Aitozi updated FLINK-26915: --- Parent: FLINK-26784 Issue Type: Sub-task (was: Improvement) > Extend the Reconciler and Observer interface > > > Key: FLINK-26915 > URL: https://issues.apache.org/jira/browse/FLINK-26915 > Project: Flink > Issue Type: Sub-task > Components: Kubernetes Operator > Reporter: Aitozi >Priority: Major > Fix For: kubernetes-operator-1.0.0 > > > As discussed in > [comments|https://github.com/apache/flink-kubernetes-operator/pull/112#discussion_r835762111], > I proposed make two changes to the Reconciler and Observer > # directly return the UpdateControl from the reconciler, because the > reconciler can in charge of the Update behavior, By this, we dont have to > infer the update control in the controller > # Make the params generic and extends from the ReconcilerContext and > ObserverContext. which will be easy for different controller to ship their > own objects for reconcile and observer. For example, in the FlinkSessionJob > case, we need to get the effective config from the FlinkDeployment first and > also pass the FlinkDeployment to the reconciler. > After the change, the reconciler will look like this: > {code:java} > public interface Reconciler> { > UpdateControl reconcile(CR cr, CTX context) throws Exception; > DeleteControl cleanup(CR cr, CTX ctx); > }{code} > > -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Updated] (FLINK-26915) Extend the Reconciler and Observer interface
[ https://issues.apache.org/jira/browse/FLINK-26915?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Aitozi updated FLINK-26915: --- Description: As discussed in [comments|https://github.com/apache/flink-kubernetes-operator/pull/112#discussion_r835762111], I proposed make two changes to the Reconciler and Observer # directly return the UpdateControl from the reconciler, because the reconciler can in charge of the Update behavior, By this, we dont have to infer the update control in the controller # Make the params generic and extends from the ReconcilerContext and ObserverContext. which will be easy for different controller to ship their own objects for reconcile and observer. For example, in the FlinkSessionJob case, we need to get the effective config from the FlinkDeployment first and also pass the FlinkDeployment to the reconciler. After the change, the reconciler will look like this: {code:java} public interface Reconciler> { UpdateControl reconcile(CR cr, CTX context) throws Exception; DeleteControl cleanup(CR cr, CTX ctx); }{code} was: As discussed in [comments|https://github.com/apache/flink-kubernetes-operator/pull/112#discussion_r835762111], I proposed make two changes to the Reconciler and Observer # directly return the UpdateControl from the reconciler, because the reconciler can in charge of the Update behavior, By this, we dont have to infer the update control in the controller # Make the params generic and extends from the ReconcilerContext and ObserverContext. which will be easy for different controller to ship their own objects for reconcile and observer. For example, in the FlinkSessionJob case, we need to get the effective config from the FlinkDeployment first and also pass the FlinkDeployment to the reconciler. After the change, the reconciler will look like this: {{}} {code:java} {code} {{public interface Reconciler> \{ UpdateControl reconcile(CR cr, CTX context) throws Exception; DeleteControl cleanup(CR cr, CTX ctx); } }} {{}} > Extend the Reconciler and Observer interface > > > Key: FLINK-26915 > URL: https://issues.apache.org/jira/browse/FLINK-26915 > Project: Flink > Issue Type: Improvement > Components: Kubernetes Operator >Reporter: Aitozi >Priority: Major > Fix For: kubernetes-operator-1.0.0 > > > As discussed in > [comments|https://github.com/apache/flink-kubernetes-operator/pull/112#discussion_r835762111], > I proposed make two changes to the Reconciler and Observer > # directly return the UpdateControl from the reconciler, because the > reconciler can in charge of the Update behavior, By this, we dont have to > infer the update control in the controller > # Make the params generic and extends from the ReconcilerContext and > ObserverContext. which will be easy for different controller to ship their > own objects for reconcile and observer. For example, in the FlinkSessionJob > case, we need to get the effective config from the FlinkDeployment first and > also pass the FlinkDeployment to the reconciler. > After the change, the reconciler will look like this: > {code:java} > public interface Reconciler> { > UpdateControl reconcile(CR cr, CTX context) throws Exception; > DeleteControl cleanup(CR cr, CTX ctx); > }{code} > > -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Created] (FLINK-26915) Extend the Reconciler and Observer interface
Aitozi created FLINK-26915: -- Summary: Extend the Reconciler and Observer interface Key: FLINK-26915 URL: https://issues.apache.org/jira/browse/FLINK-26915 Project: Flink Issue Type: Improvement Components: Kubernetes Operator Reporter: Aitozi Fix For: kubernetes-operator-1.0.0 As discussed in [comments|https://github.com/apache/flink-kubernetes-operator/pull/112#discussion_r835762111], I proposed make two changes to the Reconciler and Observer # directly return the UpdateControl from the reconciler, because the reconciler can in charge of the Update behavior, By this, we dont have to infer the update control in the controller # Make the params generic and extends from the ReconcilerContext and ObserverContext. which will be easy for different controller to ship their own objects for reconcile and observer. For example, in the FlinkSessionJob case, we need to get the effective config from the FlinkDeployment first and also pass the FlinkDeployment to the reconciler. After the change, the reconciler will look like this: {{}} {code:java} {code} {{public interface Reconciler> \{ UpdateControl reconcile(CR cr, CTX context) throws Exception; DeleteControl cleanup(CR cr, CTX ctx); } }} {{}} -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Commented] (FLINK-18356) flink-table-planner Exit code 137 returned from process
[ https://issues.apache.org/jira/browse/FLINK-18356?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17513799#comment-17513799 ] Aitozi commented on FLINK-18356: [~martijnvisser] Thanks for your information, I have addressed the OOM error by apply the latest patch from release-1.14, not sure which commit solved it. Anyway, I can run the pipeline now. > flink-table-planner Exit code 137 returned from process > --- > > Key: FLINK-18356 > URL: https://issues.apache.org/jira/browse/FLINK-18356 > Project: Flink > Issue Type: Bug > Components: Build System / Azure Pipelines, Tests >Affects Versions: 1.12.0, 1.13.0, 1.14.0, 1.15.0 >Reporter: Piotr Nowojski >Assignee: Martijn Visser >Priority: Critical > Labels: pull-request-available, test-stability > Fix For: 1.15.0 > > Attachments: 1234.jpg, app-profiling_4.gif > > > {noformat} > = test session starts > == > platform linux -- Python 3.7.3, pytest-5.4.3, py-1.8.2, pluggy-0.13.1 > cachedir: .tox/py37-cython/.pytest_cache > rootdir: /__w/3/s/flink-python > collected 568 items > pyflink/common/tests/test_configuration.py ..[ > 1%] > pyflink/common/tests/test_execution_config.py ...[ > 5%] > pyflink/dataset/tests/test_execution_environment.py . > ##[error]Exit code 137 returned from process: file name '/bin/docker', > arguments 'exec -i -u 1002 > 97fc4e22522d2ced1f4d23096b8929045d083dd0a99a4233a8b20d0489e9bddb > /__a/externals/node/bin/node /__w/_temp/containerHandlerInvoker.js'. > Finishing: Test - python > {noformat} > https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=3729&view=logs&j=9cada3cb-c1d3-5621-16da-0f718fb86602&t=8d78fe4f-d658-5c70-12f8-4921589024c3 -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Created] (FLINK-26889) Eliminate the duplicated construct for FlinkOperatorConfiguration in test
Aitozi created FLINK-26889: -- Summary: Eliminate the duplicated construct for FlinkOperatorConfiguration in test Key: FLINK-26889 URL: https://issues.apache.org/jira/browse/FLINK-26889 Project: Flink Issue Type: Improvement Components: Kubernetes Operator Reporter: Aitozi A minor improvement to reduce the boilerplate code -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Commented] (FLINK-18356) flink-table-planner Exit code 137 returned from process
[ https://issues.apache.org/jira/browse/FLINK-18356?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17513286#comment-17513286 ] Aitozi commented on FLINK-18356: Hi, I'm running testing CI against release-1.14 locally and still meet the problem with exit code 137, Is release-1.14 miss some fix for it ? > flink-table-planner Exit code 137 returned from process > --- > > Key: FLINK-18356 > URL: https://issues.apache.org/jira/browse/FLINK-18356 > Project: Flink > Issue Type: Bug > Components: Build System / Azure Pipelines, Tests >Affects Versions: 1.12.0, 1.13.0, 1.14.0, 1.15.0 >Reporter: Piotr Nowojski >Assignee: Martijn Visser >Priority: Critical > Labels: pull-request-available, test-stability > Fix For: 1.15.0 > > Attachments: 1234.jpg, app-profiling_4.gif > > > {noformat} > = test session starts > == > platform linux -- Python 3.7.3, pytest-5.4.3, py-1.8.2, pluggy-0.13.1 > cachedir: .tox/py37-cython/.pytest_cache > rootdir: /__w/3/s/flink-python > collected 568 items > pyflink/common/tests/test_configuration.py ..[ > 1%] > pyflink/common/tests/test_execution_config.py ...[ > 5%] > pyflink/dataset/tests/test_execution_environment.py . > ##[error]Exit code 137 returned from process: file name '/bin/docker', > arguments 'exec -i -u 1002 > 97fc4e22522d2ced1f4d23096b8929045d083dd0a99a4233a8b20d0489e9bddb > /__a/externals/node/bin/node /__w/_temp/containerHandlerInvoker.js'. > Finishing: Test - python > {noformat} > https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=3729&view=logs&j=9cada3cb-c1d3-5621-16da-0f718fb86602&t=8d78fe4f-d658-5c70-12f8-4921589024c3 -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Commented] (FLINK-26676) Set ClusterIP service type when watching specific namespaces
[ https://issues.apache.org/jira/browse/FLINK-26676?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17512832#comment-17512832 ] Aitozi commented on FLINK-26676: Hi, what error we will get if we not set to ClusterIP? > Set ClusterIP service type when watching specific namespaces > > > Key: FLINK-26676 > URL: https://issues.apache.org/jira/browse/FLINK-26676 > Project: Flink > Issue Type: Sub-task > Components: Kubernetes Operator >Reporter: Gyula Fora >Assignee: Gyula Fora >Priority: Major > Labels: pull-request-available > > As noted in this PR > [https://github.com/apache/flink-kubernetes-operator/pull/42#issue-1159776739] > Users get service account related error messages unless we set: > {noformat} > kubernetes.rest-service.exposed.type: ClusterIP{noformat} > In cases where we are watching specific namespaces. > We should configure this automatically unless override by the user in the > flinkConfiguration for these cases. -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Updated] (FLINK-26873) Align the helm chart version with the flink operator
[ https://issues.apache.org/jira/browse/FLINK-26873?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Aitozi updated FLINK-26873: --- Component/s: Kubernetes Operator > Align the helm chart version with the flink operator > > > Key: FLINK-26873 > URL: https://issues.apache.org/jira/browse/FLINK-26873 > Project: Flink > Issue Type: Sub-task > Components: Kubernetes Operator > Reporter: Aitozi >Priority: Major > > Now the flink-operator helm chart version is 1.0.13. I think it should be > aligned to the flink-operator version during release -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Created] (FLINK-26873) Align the helm chart version with the flink operator
Aitozi created FLINK-26873: -- Summary: Align the helm chart version with the flink operator Key: FLINK-26873 URL: https://issues.apache.org/jira/browse/FLINK-26873 Project: Flink Issue Type: Sub-task Reporter: Aitozi Now the flink-operator helm chart version is 1.0.13. I think it should be aligned to the flink-operator version during release -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Commented] (FLINK-26873) Align the helm chart version with the flink operator
[ https://issues.apache.org/jira/browse/FLINK-26873?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17512748#comment-17512748 ] Aitozi commented on FLINK-26873: cc [~gyfora] > Align the helm chart version with the flink operator > > > Key: FLINK-26873 > URL: https://issues.apache.org/jira/browse/FLINK-26873 > Project: Flink > Issue Type: Sub-task > Components: Kubernetes Operator >Reporter: Aitozi >Priority: Major > > Now the flink-operator helm chart version is 1.0.13. I think it should be > aligned to the flink-operator version during release -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Commented] (FLINK-26871) Handle Session job spec change
[ https://issues.apache.org/jira/browse/FLINK-26871?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17512747#comment-17512747 ] Aitozi commented on FLINK-26871: I will work on this > Handle Session job spec change > --- > > Key: FLINK-26871 > URL: https://issues.apache.org/jira/browse/FLINK-26871 > Project: Flink > Issue Type: Sub-task >Reporter: Aitozi >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Created] (FLINK-26871) Handle Session job spec change
Aitozi created FLINK-26871: -- Summary: Handle Session job spec change Key: FLINK-26871 URL: https://issues.apache.org/jira/browse/FLINK-26871 Project: Flink Issue Type: Sub-task Reporter: Aitozi -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Created] (FLINK-26870) Implement session job observer
Aitozi created FLINK-26870: -- Summary: Implement session job observer Key: FLINK-26870 URL: https://issues.apache.org/jira/browse/FLINK-26870 Project: Flink Issue Type: Sub-task Reporter: Aitozi -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Closed] (FLINK-26807) The batch job not work well with Operator
[ https://issues.apache.org/jira/browse/FLINK-26807?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Aitozi closed FLINK-26807. -- Resolution: Duplicate > The batch job not work well with Operator > - > > Key: FLINK-26807 > URL: https://issues.apache.org/jira/browse/FLINK-26807 > Project: Flink > Issue Type: Sub-task > Components: Kubernetes Operator > Reporter: Aitozi >Priority: Major > > When I test the batch job or finite streaming job, the flinkdep will be an > orphaned resource and keep listing job after job finished. Because the > JobManagerDeploymentStatus will not be sync again. > I think we should sync the global terminated status from the application job, > and do the clean up work for the flinkdep resource -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Commented] (FLINK-26807) The batch job not work well with Operator
[ https://issues.apache.org/jira/browse/FLINK-26807?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17510991#comment-17510991 ] Aitozi commented on FLINK-26807: Get it, I will look over that discussion, Closing this one as duplicated. > The batch job not work well with Operator > - > > Key: FLINK-26807 > URL: https://issues.apache.org/jira/browse/FLINK-26807 > Project: Flink > Issue Type: Sub-task > Components: Kubernetes Operator >Reporter: Aitozi >Priority: Major > > When I test the batch job or finite streaming job, the flinkdep will be an > orphaned resource and keep listing job after job finished. Because the > JobManagerDeploymentStatus will not be sync again. > I think we should sync the global terminated status from the application job, > and do the clean up work for the flinkdep resource -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Updated] (FLINK-26807) The batch job not work well with Operator
[ https://issues.apache.org/jira/browse/FLINK-26807?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Aitozi updated FLINK-26807: --- Description: When I test the batch job or finite streaming job, the flinkdep will be an orphaned resource and keep listing job after job finished. Because the JobManagerDeploymentStatus will not be sync again. I think we should sync the global terminated status from the application job, and do the clean up work for the flinkdep resource was: When I test the batch job or finite streaming job, the flinkdep will be an orphaned resource and keep listing job. Because the JobManagerDeploymentStatus will not be sync again. I think we should sync the global terminated status from the application job, and do the clean up work for the flinkdep resource > The batch job not work well with Operator > - > > Key: FLINK-26807 > URL: https://issues.apache.org/jira/browse/FLINK-26807 > Project: Flink > Issue Type: Sub-task > Components: Kubernetes Operator > Reporter: Aitozi >Priority: Major > > When I test the batch job or finite streaming job, the flinkdep will be an > orphaned resource and keep listing job after job finished. Because the > JobManagerDeploymentStatus will not be sync again. > I think we should sync the global terminated status from the application job, > and do the clean up work for the flinkdep resource -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Created] (FLINK-26807) The batch job not work well with Operator
Aitozi created FLINK-26807: -- Summary: The batch job not work well with Operator Key: FLINK-26807 URL: https://issues.apache.org/jira/browse/FLINK-26807 Project: Flink Issue Type: Sub-task Components: Kubernetes Operator Reporter: Aitozi When I test the batch job or finite streaming job, the flinkdep will be an orphaned resource and keep listing job. Because the JobManagerDeploymentStatus will not be sync again. I think we should sync the global terminated status from the application job, and do the clean up work for the flinkdep resource -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Comment Edited] (FLINK-25480) Create dashboard/monitoring to see resource usage per E2E test
[ https://issues.apache.org/jira/browse/FLINK-25480?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17510448#comment-17510448 ] Aitozi edited comment on FLINK-25480 at 3/22/22, 12:16 PM: --- FYI, I encounter the same problem with 1.14.4 when running test in container. I test in 16C 32G container, and {{mvn -Dflink.forkCount=2 -Dflink.forkCountTestPackage=2 -Dmaven.test.failure.ignore=true verify}} command exit 137 finally. At the meantime, I opened another screen to run {{vsar --cpu --mem -l}} to monitor the memory usage. But I still not catch the memory stroke. Hope to be helpful to you guys. I'm curious about the root cause, because it stopped me from building our stable CI pipeline. was (Author: aitozi): FYI, I encounter the same problem with 1.14.4 when running test in container. I test in 16C 32G container, and {{mvn -Dflink.forkCount=2 -Dflink.forkCountTestPackage=2 -Dmaven.test.failure.ignore=true verify}} command exit 137 finally. At the meantime, I opened another screen to run {{vsar --cpu --mem -l}} to monitor the memory usage. But I still not catch the memory stroke. Hope to be helpful to your guys. I'm curious about it, because it stop me from building our stable CI pipeline. > Create dashboard/monitoring to see resource usage per E2E test > -- > > Key: FLINK-25480 > URL: https://issues.apache.org/jira/browse/FLINK-25480 > Project: Flink > Issue Type: Improvement > Components: Test Infrastructure >Affects Versions: 1.15.0, 1.13.6, 1.14.3 >Reporter: Martijn Visser >Priority: Critical > Labels: test-stability > > Over the past couple of weeks, we've encountered multiple problems with tests > failing due to out-of-memory errors and/or exit code 137 happening. These are > happening both on Alibaba CI machines, as well as Azure hosted agents. For > the Alibaba CI machines, we've mitigated the problem by reducing the number > of workers per CI machine from 7 to 5. These workers can spin up multiple > Docker containers, especially with Testcontainers getting used more and more. > If we can get insights in the resource usage per end-to-end test, it will > also help in debugging test infrastructure problems more quickly. -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Comment Edited] (FLINK-25480) Create dashboard/monitoring to see resource usage per E2E test
[ https://issues.apache.org/jira/browse/FLINK-25480?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17510448#comment-17510448 ] Aitozi edited comment on FLINK-25480 at 3/22/22, 12:15 PM: --- FYI, I encounter the same problem with 1.14.4 when running test in container. I test in 16C 32G container, and {{mvn -Dflink.forkCount=2 -Dflink.forkCountTestPackage=2 -Dmaven.test.failure.ignore=true verify}} command exit 137 finally. At the meantime, I opened another screen to run {{vsar --cpu --mem -l}} to monitor the memory usage. But I still not catch the memory stroke. Hope to be helpful to your guys. I'm curious about it, because it stop me from building our stable CI pipeline. was (Author: aitozi): FYI, I encounter the same problem with 1.14.4 when running test in container. I test in 16C 32G container, and {{mvn verify}} command exit 137 finally. At the meantime, I opened another screen to run {{vsar --cpu --mem -l}} to monitor the memory usage. But I still not catch the memory stroke. Hope to be helpful to your guys. I'm curious about it, because it stop me from building our stable CI pipeline. > Create dashboard/monitoring to see resource usage per E2E test > -- > > Key: FLINK-25480 > URL: https://issues.apache.org/jira/browse/FLINK-25480 > Project: Flink > Issue Type: Improvement > Components: Test Infrastructure >Affects Versions: 1.15.0, 1.13.6, 1.14.3 >Reporter: Martijn Visser >Priority: Critical > Labels: test-stability > > Over the past couple of weeks, we've encountered multiple problems with tests > failing due to out-of-memory errors and/or exit code 137 happening. These are > happening both on Alibaba CI machines, as well as Azure hosted agents. For > the Alibaba CI machines, we've mitigated the problem by reducing the number > of workers per CI machine from 7 to 5. These workers can spin up multiple > Docker containers, especially with Testcontainers getting used more and more. > If we can get insights in the resource usage per end-to-end test, it will > also help in debugging test infrastructure problems more quickly. -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Comment Edited] (FLINK-25480) Create dashboard/monitoring to see resource usage per E2E test
[ https://issues.apache.org/jira/browse/FLINK-25480?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17510448#comment-17510448 ] Aitozi edited comment on FLINK-25480 at 3/22/22, 12:14 PM: --- FYI, I encounter the same problem with 1.14.4 when running test in container. I test in 16C 32G container, and {{mvn verify}} command exit 137 finally. At the meantime, I opened another screen to run {{vsar --cpu --mem -l}} to monitor the memory usage. But I still not catch the memory stroke. Hope to be helpful to your guys. I'm curious about it, because it stop me from building our stable CI pipeline. was (Author: aitozi): FYI, I encounter the same problem with 1.14.4 when running test in container. I test in 16C 32G container, and {{mvn verify}} command exit 137 finally. Then I open another screen to run {{vsar --cpu --mem -l}} . But I still not catch the memory stroke. I'm curious about it. > Create dashboard/monitoring to see resource usage per E2E test > -- > > Key: FLINK-25480 > URL: https://issues.apache.org/jira/browse/FLINK-25480 > Project: Flink > Issue Type: Improvement > Components: Test Infrastructure >Affects Versions: 1.15.0, 1.13.6, 1.14.3 >Reporter: Martijn Visser >Priority: Critical > Labels: test-stability > > Over the past couple of weeks, we've encountered multiple problems with tests > failing due to out-of-memory errors and/or exit code 137 happening. These are > happening both on Alibaba CI machines, as well as Azure hosted agents. For > the Alibaba CI machines, we've mitigated the problem by reducing the number > of workers per CI machine from 7 to 5. These workers can spin up multiple > Docker containers, especially with Testcontainers getting used more and more. > If we can get insights in the resource usage per end-to-end test, it will > also help in debugging test infrastructure problems more quickly. -- This message was sent by Atlassian Jira (v8.20.1#820001)