[jira] [Commented] (FLINK-35376) When flink submits the job by calling the rest api, the dependent jar package generated to the tmp is not removed
[ https://issues.apache.org/jira/browse/FLINK-35376?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17847168#comment-17847168 ] Biao Geng commented on FLINK-35376: --- hi [~18380428...@163.com], in the implementation of the `extractContainerdLibraries()`, it would call `tempFile.deleteOnExit();`. So when the job exits, these temp files should be cleared. (See https://github.com/apache/flink/blob/f1ecb9e4701d612050da54589a8f561857debf34/flink-clients/src/main/java/org/apache/flink/client/program/PackagedProgram.java#L585 for details) Have you encountered any residual jars when the job exits? > When flink submits the job by calling the rest api, the dependent jar package > generated to the tmp is not removed > - > > Key: FLINK-35376 > URL: https://issues.apache.org/jira/browse/FLINK-35376 > Project: Flink > Issue Type: Bug > Components: Runtime / REST >Affects Versions: 1.14.4 >Reporter: 中国无锡周良 >Priority: Major > > When org. Apache. Flink. Runtime. Webmonitor. Handlers. JarRunHandler# > handleRequest receives job submission request, > {code:java} > final JarHandlerContext context = JarHandlerContext.fromRequest(request, > jarDir, log); > context.applyToConfiguration(effectiveConfiguration); {code} > The toPackagedProgram(configuration) method generates a dependency jar to the > tmp directory; > Then, > final PackagedProgram program = > context.toPackagedProgram(effectiveConfiguration); > Will generate a dependent jars, and org. Apache. The flink. Client. The > program. The PackagedProgram# PackagedProgram method inside > {code:java} > this.extractedTempLibraries = > this.jarFile == null > ? Collections.emptyList() > : extractContainedLibraries(this.jarFile); {code} > Will be overwritten by the second generation; > > As a result, the dependent jar package generated for the first time cannot be > deleted when close. If the job status detection is done and the rest api is > automatically invoked to pull up, the jar file will always be generated in > tmp. > > > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (FLINK-35288) Flink Restart Strategy does not work as documented
[ https://issues.apache.org/jira/browse/FLINK-35288?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17844106#comment-17844106 ] Biao Geng commented on FLINK-35288: --- https://cwiki.apache.org/confluence/display/FLINK/FLIP-364%3A+Improve+the+exponential-delay+restart-strategy#FLIP364:Improvetheexponentialdelayrestartstrategy-1.2Differentsemanticsofrestartattemptscauseregionfailovernotasexpected In the above FLIP, there is some relevant discussion of the 'restart-strategy.fixed-delay.attempts' problem. When 'region-failover' (the default value of *jobmanager.execution.failover-strategy*) is enabled, the org.apache.flink.runtime.executiongraph.failover.ExecutionFailureHandler#handleFailure method is called once a subtask in a region fails, which consumes the job-level 'restart-strategy.fixed-delay.attempts'. As a result, the restart strategy may not work as the documentation described. We have also met such case in the production environment. > Flink Restart Strategy does not work as documented > -- > > Key: FLINK-35288 > URL: https://issues.apache.org/jira/browse/FLINK-35288 > Project: Flink > Issue Type: Bug >Reporter: Keshav Kansal >Priority: Minor > > As per the documentation when using the Fixed Delay Restart Strategy, the > *restart-strategy.fixed-delay.attempts* defines the "The number of times that > Flink retries the execution before the job is declared as failed if has been > set to fixed-delay". > However in reality it is the *maximum-total-task-failures*, i.e. it is > possbile that the job does not even attempt to restart. > This is as per documented in > https://cwiki.apache.org/confluence/display/FLINK/FLIP-1%3A+Fine+Grained+Recovery+from+Task+Failures > If there is an outage at a Sink level, for example Elasticsearch outage, all > the independent tasks might fail and the job will immediately fail without > restart (if restart-strategy.fixed-delay.attempts is set lower or equal to > the parallelism of the sink) -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (FLINK-35273) PyFlink's LocalZonedTimestampType should respect timezone set by set_local_timezone
[ https://issues.apache.org/jira/browse/FLINK-35273?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Biao Geng updated FLINK-35273: -- Description: The issue is from https://apache-flink.slack.com/archives/C065944F9M2/p1714134880878399 When using TIMESTAMP_LTZ in PyFlink while setting a different time zone, it turns out that the output result does not show the expected result. Here is my test codes: {code:python} from pyflink.datastream import StreamExecutionEnvironment from pyflink.common import Types, Configuration from pyflink.table import DataTypes, StreamTableEnvironment from datetime import datetime import pytz config = Configuration() config.set_string("python.client.executable", "/usr/local/Caskroom/miniconda/base/envs/myenv/bin/python") config.set_string("python.executable", "/usr/local/Caskroom/miniconda/base/envs/myenv/bin/python") env = StreamExecutionEnvironment.get_execution_environment(config) t_env = StreamTableEnvironment.create(env) t_env.get_config().set_local_timezone("UTC") # t_env.get_config().set_local_timezone("GMT-08:00") input_table = t_env.from_elements( [ ( "elementA", datetime(year=2024, month=4, day=12, hour=8, minute=35), ), ( "elementB", datetime(year=2024, month=4, day=12, hour=8, minute=35, tzinfo=pytz.utc), # datetime(year=2024, month=4, day=12, hour=8, minute=35, tzinfo=pytz.timezone('America/New_York')), ), ], DataTypes.ROW( [ DataTypes.FIELD("name", DataTypes.STRING()), DataTypes.FIELD("timestamp", DataTypes.TIMESTAMP_LTZ(3)), ] ), ) input_table.execute().print() # SQL sql_result = t_env.execute_sql("CREATE VIEW MyView1 AS SELECT TO_TIMESTAMP_LTZ(171291090, 3);") t_env.execute_sql("CREATE TABLE Sink (`t` TIMESTAMP_LTZ) WITH ('connector'='print');") t_env.execute_sql("INSERT INTO Sink SELECT * FROM MyView1;") {code} The output is: {code:java} +++-+ | op | name | timestamp | +++-+ | +I | elementA | 2024-04-12 08:35:00.000 | | +I | elementB | 2024-04-12 16:35:00.000 | +++-+ 2 rows in set +I[2024-04-12T08:35:00Z] {code} In pyflink/tables/types.py, the `LocalZonedTimestampType` class will use follow logic to convert python obj to sql type: {code:python} EPOCH_ORDINAL = calendar.timegm(time.localtime(0)) * 10 ** 6 ... def to_sql_type(self, dt): if dt is not None: seconds = (calendar.timegm(dt.utctimetuple()) if dt.tzinfo else time.mktime(dt.timetuple())) return int(seconds) * 10 ** 6 + dt.microsecond + self.EPOCH_ORDINAL {code} It shows that the EPOCH_ORDINAL is calculated when the PVM starts but is not decided by the timezone set by `set_local_timezone`. > PyFlink's LocalZonedTimestampType should respect timezone set by > set_local_timezone > --- > > Key: FLINK-35273 > URL: https://issues.apache.org/jira/browse/FLINK-35273 > Project: Flink > Issue Type: Bug > Components: API / Python >Reporter: Biao Geng >Priority: Major > > The issue is from > https://apache-flink.slack.com/archives/C065944F9M2/p1714134880878399 > When using TIMESTAMP_LTZ in PyFlink while setting a different time zone, it > turns out that the output result does not show the expected result. > Here is my test codes: > {code:python} > from pyflink.datastream import StreamExecutionEnvironment > from pyflink.common import Types, Configuration > from pyflink.table import DataTypes, StreamTableEnvironment > from datetime import datetime > import pytz > config = Configuration() > config.set_string("python.client.executable", > "/usr/local/Caskroom/miniconda/base/envs/myenv/bin/python") > config.set_string("python.executable", > "/usr/local/Caskroom/miniconda/base/envs/myenv/bin/python") > env = StreamExecutionEnvironment.get_execution_environment(config) > t_env = StreamTableEnvironment.create(env) > t_env.get_config().set_local_timezone("UTC") > # t_env.get_config().set_local_timezone("GMT-08:00") > input_table = t_env.from_elements( > [ > ( > "elementA", > datetime(year=2024, month=4, day=12, hour=8, minute=35), > ), > ( > "elementB", > datetime(year=2024, month=4, day=12, hour=8, minute=35, > tzinfo=pytz.utc), > # datetime(year=2024, month=4, day=12, hour=8, minute=35, > tzinfo=pytz.timezone('America/New_York')), > ), > ], > DataTypes.ROW( > [ > DataTypes.FIELD("name",
[jira] [Updated] (FLINK-35273) PyFlink's LocalZonedTimestampType should respect timezone set by set_local_timezone
[ https://issues.apache.org/jira/browse/FLINK-35273?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Biao Geng updated FLINK-35273: -- Component/s: API / Python > PyFlink's LocalZonedTimestampType should respect timezone set by > set_local_timezone > --- > > Key: FLINK-35273 > URL: https://issues.apache.org/jira/browse/FLINK-35273 > Project: Flink > Issue Type: Bug > Components: API / Python >Reporter: Biao Geng >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (FLINK-35273) PyFlink's LocalZonedTimestampType should respect timezone set by set_local_timezone
Biao Geng created FLINK-35273: - Summary: PyFlink's LocalZonedTimestampType should respect timezone set by set_local_timezone Key: FLINK-35273 URL: https://issues.apache.org/jira/browse/FLINK-35273 Project: Flink Issue Type: Bug Reporter: Biao Geng -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (FLINK-35192) operator oom
[ https://issues.apache.org/jira/browse/FLINK-35192?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17842322#comment-17842322 ] Biao Geng commented on FLINK-35192: --- [~stupid_pig], it looks like that pictures of the pmap is broken. But the glibc issue is well known, if that's the case, it makes sense to me to introduce jemalloc. > operator oom > > > Key: FLINK-35192 > URL: https://issues.apache.org/jira/browse/FLINK-35192 > Project: Flink > Issue Type: Bug > Components: Kubernetes Operator >Affects Versions: kubernetes-operator-1.6.1 > Environment: jdk: openjdk11 > operator version: 1.6.1 >Reporter: chenyuzhi >Priority: Major > Labels: pull-request-available > Attachments: image-2024-04-22-15-47-49-455.png, > image-2024-04-22-15-52-51-600.png, image-2024-04-22-15-58-23-269.png, > image-2024-04-22-15-58-42-850.png, image-2024-04-30-16-47-07-289.png, > image-2024-04-30-17-11-24-974.png, screenshot-1.png, screenshot-2.png, > screenshot-3.png > > > The kubernetest operator docker process was killed by kernel cause out of > memory(the time is 2024.04.03: 18:16) > !image-2024-04-22-15-47-49-455.png! > Metrics: > the pod memory (RSS) is increasing slowly in the past 7 days: > !screenshot-1.png! > However the jvm memory metrics of operator not shown obvious anomaly: > !image-2024-04-22-15-58-23-269.png! > !image-2024-04-22-15-58-42-850.png! -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (FLINK-35192) operator oom
[ https://issues.apache.org/jira/browse/FLINK-35192?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17842294#comment-17842294 ] Biao Geng commented on FLINK-35192: --- Sure, created a [pr|https://github.com/apache/flink-kubernetes-operator/pull/822] for this. > operator oom > > > Key: FLINK-35192 > URL: https://issues.apache.org/jira/browse/FLINK-35192 > Project: Flink > Issue Type: Bug > Components: Kubernetes Operator >Affects Versions: kubernetes-operator-1.6.1 > Environment: jdk: openjdk11 > operator version: 1.6.1 >Reporter: chenyuzhi >Priority: Major > Labels: pull-request-available > Attachments: image-2024-04-22-15-47-49-455.png, > image-2024-04-22-15-52-51-600.png, image-2024-04-22-15-58-23-269.png, > image-2024-04-22-15-58-42-850.png, screenshot-1.png, screenshot-2.png, > screenshot-3.png > > > The kubernetest operator docker process was killed by kernel cause out of > memory(the time is 2024.04.03: 18:16) > !image-2024-04-22-15-47-49-455.png! > Metrics: > the pod memory (RSS) is increasing slowly in the past 7 days: > !screenshot-1.png! > However the jvm memory metrics of operator not shown obvious anomaly: > !image-2024-04-22-15-58-23-269.png! > !image-2024-04-22-15-58-42-850.png! -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Comment Edited] (FLINK-35212) PyFlink thread mode process just can run once in standalonesession mode
[ https://issues.apache.org/jira/browse/FLINK-35212?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17841934#comment-17841934 ] Biao Geng edited comment on FLINK-35212 at 4/29/24 10:22 AM: - Hi [~vonesec], thanks for creating the detailed bug report! I create a brand new env in my local computer and followed the instructions but I cannot reproduce the exception you have met. I take a look at the [pemja code|https://github.com/alibaba/pemja/blob/release-0.4-1-rc1/src/main/java/pemja/core/PythonInterpreter.java#L336], as we can see, the MainInterpreter is a singleton which means System.load('pemja_core.xxx.so') should happen only once in the JVM. It should get rid of the exception in the jira, whose cause is loading the same .so twice. So it is somehow strange for you to meet such exception. Have you ever changed something under your python site-packages of pyflink? But when running your codes, I do meet another exception when trying to `flink run` for a second time: {quote}2024-04-29 17:41:03,054 WARN org.apache.flink.runtime.taskmanager.Task [] - Source: *anonymous_python-input-format$1*[1] -> Calc[2] -> ConstraintEnforcer[3] -> TableToDataSteam -> Map -> Sink: Print to Std. Out (1/1)#0 (4b45f9997b18155682ed0218dcf0afbb_cbc357ccb763df2852fee8c4fc7d55f2_0_0) switched from RUNNING to FAILED with failure cause: java.lang.ClassCastException: pemja.core.object.PyIterator cannot be cast to pemja.core.object.PyIterator at org.apache.flink.streaming.api.operators.python.embedded.AbstractOneInputEmbeddedPythonFunctionOperator.processElement(AbstractOneInputEmbeddedPythonFunctionOperator.java:156) ~[blob_p-ce9fc762fcfc43a2cf28c0d69a738da7efd36b10-323be2d12056985907101ebb52aff326:1.18.1] at org.apache.flink.streaming.runtime.tasks.CopyingChainingOutput.pushToOperator(CopyingChainingOutput.java:75) ~[flink-dist-1.18.1.jar:1.18.1] at org.apache.flink.streaming.runtime.tasks.CopyingChainingOutput.collect(CopyingChainingOutput.java:50) ~[flink-dist-1.18.1.jar:1.18.1] at org.apache.flink.streaming.runtime.tasks.CopyingChainingOutput.collect(CopyingChainingOutput.java:29) ~[flink-dist-1.18.1.jar:1.18.1] at org.apache.flink.table.runtime.operators.sink.OutputConversionOperator.processElement(OutputConversionOperator.java:105) ~[flink-table-runtime-1.18.1.jar:1.18.1] at org.apache.flink.streaming.runtime.tasks.CopyingChainingOutput.pushToOperator(CopyingChainingOutput.java:75) ~[flink-dist-1.18.1.jar:1.18.1] at org.apache.flink.streaming.runtime.tasks.CopyingChainingOutput.collect(CopyingChainingOutput.java:50) ~[flink-dist-1.18.1.jar:1.18.1] at org.apache.flink.streaming.runtime.tasks.CopyingChainingOutput.collect(CopyingChainingOutput.java:29) ~[flink-dist-1.18.1.jar:1.18.1] at org.apache.flink.table.runtime.util.StreamRecordCollector.collect(StreamRecordCollector.java:44) ~[flink-table-runtime-1.18.1.jar:1.18.1] at org.apache.flink.table.runtime.operators.sink.ConstraintEnforcer.processElement(ConstraintEnforcer.java:247) ~[flink-table-runtime-1.18.1.jar:1.18.1] at org.apache.flink.streaming.runtime.tasks.CopyingChainingOutput.pushToOperator(CopyingChainingOutput.java:75) ~[flink-dist-1.18.1.jar:1.18.1] at org.apache.flink.streaming.runtime.tasks.CopyingChainingOutput.collect(CopyingChainingOutput.java:50) ~[flink-dist-1.18.1.jar:1.18.1] at org.apache.flink.streaming.runtime.tasks.CopyingChainingOutput.collect(CopyingChainingOutput.java:29) ~[flink-dist-1.18.1.jar:1.18.1] at StreamExecCalc$6.processElement(Unknown Source) ~[?:?] at org.apache.flink.streaming.runtime.tasks.CopyingChainingOutput.pushToOperator(CopyingChainingOutput.java:75) ~[flink-dist-1.18.1.jar:1.18.1] at org.apache.flink.streaming.runtime.tasks.CopyingChainingOutput.collect(CopyingChainingOutput.java:50) ~[flink-dist-1.18.1.jar:1.18.1] at org.apache.flink.streaming.runtime.tasks.CopyingChainingOutput.collect(CopyingChainingOutput.java:29) ~[flink-dist-1.18.1.jar:1.18.1] at org.apache.flink.streaming.api.operators.StreamSourceContexts$ManualWatermarkContext.processAndCollect(StreamSourceContexts.java:425) ~[flink-dist-1.18.1.jar:1.18.1] at org.apache.flink.streaming.api.operators.StreamSourceContexts$WatermarkContext.collect(StreamSourceContexts.java:520) ~[flink-dist-1.18.1.jar:1.18.1] at org.apache.flink.streaming.api.operators.StreamSourceContexts$SwitchingOnClose.collect(StreamSourceContexts.java:110) ~[flink-dist-1.18.1.jar:1.18.1] at org.apache.flink.streaming.api.functions.source.InputFormatSourceFunction.run(InputFormatSourceFunction.java:99) ~[flink-dist-1.18.1.jar:1.18.1] at org.apache.flink.streaming.api.operators.StreamSource.run(StreamSource.java:114) ~[flink-dist-1.18.1.jar:1.18.1] at
[jira] [Commented] (FLINK-35212) PyFlink thread mode process just can run once in standalonesession mode
[ https://issues.apache.org/jira/browse/FLINK-35212?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17841934#comment-17841934 ] Biao Geng commented on FLINK-35212: --- Hi [~vonesec], thanks for creating the detailed bug report! I create a brand new env in my local computer and followed the instructions but I cannot reproduce the exception you have met. I take a look at the [pemja code|https://github.com/alibaba/pemja/blob/release-0.4-1-rc1/src/main/java/pemja/core/PythonInterpreter.java#L336], as we can see, the MainInterpreter is a singleton which means System.load('pemja_core.xxx.so') should happen only once in the JVM. It should get rid of the exception in the jira, whose cause is loading the same .so twice. So it is somehow for you to meet such exception. Have you ever changed something under your python site-packages of pyflink? But when running your codes, I do meet another exception when trying to `flink run` for a second time: {quote}2024-04-29 17:41:03,054 WARN org.apache.flink.runtime.taskmanager.Task [] - Source: *anonymous_python-input-format$1*[1] -> Calc[2] -> ConstraintEnforcer[3] -> TableToDataSteam -> Map -> Sink: Print to Std. Out (1/1)#0 (4b45f9997b18155682ed0218dcf0afbb_cbc357ccb763df2852fee8c4fc7d55f2_0_0) switched from RUNNING to FAILED with failure cause: java.lang.ClassCastException: pemja.core.object.PyIterator cannot be cast to pemja.core.object.PyIterator at org.apache.flink.streaming.api.operators.python.embedded.AbstractOneInputEmbeddedPythonFunctionOperator.processElement(AbstractOneInputEmbeddedPythonFunctionOperator.java:156) ~[blob_p-ce9fc762fcfc43a2cf28c0d69a738da7efd36b10-323be2d12056985907101ebb52aff326:1.18.1] at org.apache.flink.streaming.runtime.tasks.CopyingChainingOutput.pushToOperator(CopyingChainingOutput.java:75) ~[flink-dist-1.18.1.jar:1.18.1] at org.apache.flink.streaming.runtime.tasks.CopyingChainingOutput.collect(CopyingChainingOutput.java:50) ~[flink-dist-1.18.1.jar:1.18.1] at org.apache.flink.streaming.runtime.tasks.CopyingChainingOutput.collect(CopyingChainingOutput.java:29) ~[flink-dist-1.18.1.jar:1.18.1] at org.apache.flink.table.runtime.operators.sink.OutputConversionOperator.processElement(OutputConversionOperator.java:105) ~[flink-table-runtime-1.18.1.jar:1.18.1] at org.apache.flink.streaming.runtime.tasks.CopyingChainingOutput.pushToOperator(CopyingChainingOutput.java:75) ~[flink-dist-1.18.1.jar:1.18.1] at org.apache.flink.streaming.runtime.tasks.CopyingChainingOutput.collect(CopyingChainingOutput.java:50) ~[flink-dist-1.18.1.jar:1.18.1] at org.apache.flink.streaming.runtime.tasks.CopyingChainingOutput.collect(CopyingChainingOutput.java:29) ~[flink-dist-1.18.1.jar:1.18.1] at org.apache.flink.table.runtime.util.StreamRecordCollector.collect(StreamRecordCollector.java:44) ~[flink-table-runtime-1.18.1.jar:1.18.1] at org.apache.flink.table.runtime.operators.sink.ConstraintEnforcer.processElement(ConstraintEnforcer.java:247) ~[flink-table-runtime-1.18.1.jar:1.18.1] at org.apache.flink.streaming.runtime.tasks.CopyingChainingOutput.pushToOperator(CopyingChainingOutput.java:75) ~[flink-dist-1.18.1.jar:1.18.1] at org.apache.flink.streaming.runtime.tasks.CopyingChainingOutput.collect(CopyingChainingOutput.java:50) ~[flink-dist-1.18.1.jar:1.18.1] at org.apache.flink.streaming.runtime.tasks.CopyingChainingOutput.collect(CopyingChainingOutput.java:29) ~[flink-dist-1.18.1.jar:1.18.1] at StreamExecCalc$6.processElement(Unknown Source) ~[?:?] at org.apache.flink.streaming.runtime.tasks.CopyingChainingOutput.pushToOperator(CopyingChainingOutput.java:75) ~[flink-dist-1.18.1.jar:1.18.1] at org.apache.flink.streaming.runtime.tasks.CopyingChainingOutput.collect(CopyingChainingOutput.java:50) ~[flink-dist-1.18.1.jar:1.18.1] at org.apache.flink.streaming.runtime.tasks.CopyingChainingOutput.collect(CopyingChainingOutput.java:29) ~[flink-dist-1.18.1.jar:1.18.1] at org.apache.flink.streaming.api.operators.StreamSourceContexts$ManualWatermarkContext.processAndCollect(StreamSourceContexts.java:425) ~[flink-dist-1.18.1.jar:1.18.1] at org.apache.flink.streaming.api.operators.StreamSourceContexts$WatermarkContext.collect(StreamSourceContexts.java:520) ~[flink-dist-1.18.1.jar:1.18.1] at org.apache.flink.streaming.api.operators.StreamSourceContexts$SwitchingOnClose.collect(StreamSourceContexts.java:110) ~[flink-dist-1.18.1.jar:1.18.1] at org.apache.flink.streaming.api.functions.source.InputFormatSourceFunction.run(InputFormatSourceFunction.java:99) ~[flink-dist-1.18.1.jar:1.18.1] at org.apache.flink.streaming.api.operators.StreamSource.run(StreamSource.java:114) ~[flink-dist-1.18.1.jar:1.18.1] at
[jira] [Commented] (FLINK-35192) operator oom
[ https://issues.apache.org/jira/browse/FLINK-35192?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17841129#comment-17841129 ] Biao Geng commented on FLINK-35192: --- !screenshot-3.png! According to the flink k8s op's codes, the deleteOnExit() is called when create config files or pod template files. It looks like that it is possible to lead the memory leak if the operator pod runs for a long time. In the operator's FlinkConfigManager implementation, we would clean up these temp files/dirs. Maybe we can safely remove the deleteOnExit() usage? cc [~gyfora] Also, from the attached yaml, it looks like a custom flink k8s op image(gdc-flink-kubernetes-operator:1.6.1-GDC1.0.2) is used. [~stupid_pig] would you mind checking if your codes call methods like deleteOnExit if you have some customized changes to the operator? > operator oom > > > Key: FLINK-35192 > URL: https://issues.apache.org/jira/browse/FLINK-35192 > Project: Flink > Issue Type: Bug > Components: Kubernetes Operator >Affects Versions: kubernetes-operator-1.6.1 > Environment: jdk: openjdk11 > operator version: 1.6.1 >Reporter: chenyuzhi >Priority: Major > Attachments: image-2024-04-22-15-47-49-455.png, > image-2024-04-22-15-52-51-600.png, image-2024-04-22-15-58-23-269.png, > image-2024-04-22-15-58-42-850.png, screenshot-1.png, screenshot-2.png, > screenshot-3.png > > > The kubernetest operator docker process was killed by kernel cause out of > memory(the time is 2024.04.03: 18:16) > !image-2024-04-22-15-47-49-455.png! > Metrics: > the pod memory (RSS) is increasing slowly in the past 7 days: > !screenshot-1.png! > However the jvm memory metrics of operator not shown obvious anomaly: > !image-2024-04-22-15-58-23-269.png! > !image-2024-04-22-15-58-42-850.png! -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (FLINK-35192) operator oom
[ https://issues.apache.org/jira/browse/FLINK-35192?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Biao Geng updated FLINK-35192: -- Attachment: screenshot-3.png > operator oom > > > Key: FLINK-35192 > URL: https://issues.apache.org/jira/browse/FLINK-35192 > Project: Flink > Issue Type: Bug > Components: Kubernetes Operator >Affects Versions: kubernetes-operator-1.6.1 > Environment: jdk: openjdk11 > operator version: 1.6.1 >Reporter: chenyuzhi >Priority: Major > Attachments: image-2024-04-22-15-47-49-455.png, > image-2024-04-22-15-52-51-600.png, image-2024-04-22-15-58-23-269.png, > image-2024-04-22-15-58-42-850.png, screenshot-1.png, screenshot-2.png, > screenshot-3.png > > > The kubernetest operator docker process was killed by kernel cause out of > memory(the time is 2024.04.03: 18:16) > !image-2024-04-22-15-47-49-455.png! > Metrics: > the pod memory (RSS) is increasing slowly in the past 7 days: > !screenshot-1.png! > However the jvm memory metrics of operator not shown obvious anomaly: > !image-2024-04-22-15-58-23-269.png! > !image-2024-04-22-15-58-42-850.png! -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (FLINK-35035) Reduce job pause time when cluster resources are expanded in adaptive mode
[ https://issues.apache.org/jira/browse/FLINK-35035?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17834752#comment-17834752 ] Biao Geng commented on FLINK-35035: --- I am not very familiar with adaptive scheduler, maybe others can share more insights. I just want to ask a question to make sure we are on the same page. Do you mean that instead of triggering onNewResourcesAvailable repeatedly once a new slot is found(in your example, 5 new slots so 5 times of rescheduling), you are expecting the JM can discover the 5 new slots at the same time after a configurable period of time and only trigger 1 time of rescheduling? > Reduce job pause time when cluster resources are expanded in adaptive mode > -- > > Key: FLINK-35035 > URL: https://issues.apache.org/jira/browse/FLINK-35035 > Project: Flink > Issue Type: Improvement > Components: Runtime / Task >Affects Versions: 1.19.0 >Reporter: yuanfenghu >Priority: Minor > > When 'jobmanager.scheduler = adaptive' , job graph changes triggered by > cluster expansion will cause long-term task stagnation. We should reduce this > impact. > As an example: > I have jobgraph for : [v1 (maxp=10 minp = 1)] -> [v2 (maxp=10, minp=1)] > When my cluster has 5 slots, the job will be executed as [v1 p5]->[v2 p5] > When I add slots the task will trigger jobgraph changes,by > org.apache.flink.runtime.scheduler.adaptive.ResourceListener#onNewResourcesAvailable, > However, the five new slots I added were not discovered at the same time (for > convenience, I assume that a taskmanager has one slot), because no matter > what environment we add, we cannot guarantee that the new slots will be added > at once, so this will cause onNewResourcesAvailable triggers repeatedly > ,If each new slot action has a certain interval, then the jobgraph will > continue to change during this period. What I hope is that there will be a > stable time to configure the cluster resources and then go to it after the > number of cluster slots has been stable for a certain period of time. Trigger > jobgraph changes to avoid this situation -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (FLINK-35036) Flink CDC Job cancel with savepoint failed
[ https://issues.apache.org/jira/browse/FLINK-35036?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17834749#comment-17834749 ] Biao Geng commented on FLINK-35036: --- Hi [~fly365], according to the attached screenshot, the failure is caused by a timeout in flink client side. IIUC, in the full volume phase of a flink cdc job, it needs to process lots of data and typically due to the back pressure, the state may be much larger than the incremental phase(you can check the state size in flink's web ui). As a result, it would take longer time for the flink to complete the savepoint. The client's [default timeout is 60s|https://nightlies.apache.org/flink/flink-docs-master/docs/deployment/config/#client-timeout], so maybe you can increase the value to see if the savepoint can succeed. > Flink CDC Job cancel with savepoint failed > -- > > Key: FLINK-35036 > URL: https://issues.apache.org/jira/browse/FLINK-35036 > Project: Flink > Issue Type: Bug > Components: Flink CDC > Environment: Flink 1.15.2 > Flink CDC 2.4.2 > Oracle 19C > Doris 2.0.3 >Reporter: Fly365 >Priority: Major > Attachments: image-2024-04-07-17-35-23-136.png > > > With the Flink CDC job, I want oracle data to doris, in the snapshot,canel > the Flink CDC Job with savepoint,the job cancel failed. > 使用Flink CDC,将Oracle > 19C的数据表同步到Doris中,在初始化快照阶段,同步了一部分数据但还没有到增量阶段,此时取消CDC任务并保存Flink > Savepoint,取消任务失败;而在任务进入增量阶段后,取消任务并保存savepoint是可以的,请问存量数据同步阶段,为何savepoint失败? > !image-2024-04-07-17-35-23-136.png! > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (FLINK-31850) Fix args in JobSpec not being passed through to Flink in Standalone mode - 1.4.0
[ https://issues.apache.org/jira/browse/FLINK-31850?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17717434#comment-17717434 ] Biao Geng commented on FLINK-31850: --- hi [~gil_shmaya], I checked relevant codes in the [PR|https://github.com/apache/flink-kubernetes-operator/pull/379], it looks like the codes are not broken. Could you please share your args example that worked in 1.2.0 while failed in 1.4.0? > Fix args in JobSpec not being passed through to Flink in Standalone mode - > 1.4.0 > > > Key: FLINK-31850 > URL: https://issues.apache.org/jira/browse/FLINK-31850 > Project: Flink > Issue Type: Bug > Components: Kubernetes Operator >Affects Versions: kubernetes-operator-1.4.0 >Reporter: Gil Shmaya >Priority: Major > > This issue is related to a previously fixed bug in version 1.2.0 - > [FLINK-29388] Fix args in JobSpec not being passed through to Flink in > Standalone mode - ASF JIRA (apache.org) > I have noticed that while the args are successfully being passed when using > version 1.2.0, this is not the case with version 1.4.0. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (FLINK-31964) Improve the document of Autoscaler as 1.17.0 is released
Biao Geng created FLINK-31964: - Summary: Improve the document of Autoscaler as 1.17.0 is released Key: FLINK-31964 URL: https://issues.apache.org/jira/browse/FLINK-31964 Project: Flink Issue Type: Improvement Components: Kubernetes Operator Reporter: Biao Geng Attachments: image-2023-04-28-10-21-09-935.png Since 1.17.0 is released and the official image is [available|https://hub.docker.com/layers/library/flink/1.17.0-scala_2.12-java8/images/sha256-a8bbef97ec3f7ce4fa6541d48dfe16261ee7f93f93b164c0e84644605f9ea0a3?context=explore], we can update the image link in the Autoscaler section. !image-2023-04-28-10-21-09-935.png! -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (FLINK-31509) REST Service missing sessionAffinity causes job run failure with HA cluster
[ https://issues.apache.org/jira/browse/FLINK-31509?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17702659#comment-17702659 ] Biao Geng commented on FLINK-31509: --- I believe that it should be a bug to route API requests to all JM instead of the master. Besides, it should be ok to use K8s HA with one JM is launched as when this JM crashes, a new one will created from the HA data stored in config map. Multiple JM may reduce the recovery time but not so necessary. > REST Service missing sessionAffinity causes job run failure with HA cluster > --- > > Key: FLINK-31509 > URL: https://issues.apache.org/jira/browse/FLINK-31509 > Project: Flink > Issue Type: Bug > Components: Kubernetes Operator > Environment: Flink 1.15 on Flink Operator 1.4.0 on Kubernetes 1.25.4, > (optionally with Beam 2.46.0) > but the issue was observed on Flink 1.14, 1.15 and 1.16 and on Flink Operator > 1.2, 1.3, 1.3.1, 1.4.0 > >Reporter: Emmanuel Leroy >Priority: Major > > When using a Session Cluster with multiple Job Managers, the -rest service > load balances the API requests to all job managers, not just the master. > When submitting a FlinkSessionJob, I often see errors like: `jar .jar > was not found`, because the submission is done in 2 steps: > * upload the jar with `v1/jars/upload` which returns the `jar_id` > * run the job with `v1/jars//run` > Unfortunately, with the Service load balacing between nodes, it is often the > case that the jar is uploaded on a JM, and the run request happens on > another, where the jar doesn't exist. > A simple fix is to append the `sessionAffinity: ClientIP` on the -rest > service, where the API calls from a given originating IP will always be > routed to the same node. > This issue is especially problematic with Beam, where the Beam job submission > does not retry to run the job with the jar_id, and will fail, causing it to > re-upload a new jar and retrying, until it is lucky enough to get the 2 calls > in a row routed to the same node. > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (FLINK-31529) Let yarn client exit early before JobManager running
[ https://issues.apache.org/jira/browse/FLINK-31529?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17702634#comment-17702634 ] Biao Geng commented on FLINK-31529: --- Hi there, I want to share some thoughts/questions about this jira: 1. According to the [doc|https://nightlies.apache.org/flink/flink-docs-master/docs/deployment/cli/], the --detached option is used to notify if the client should wait the job to finish. I have seem some users' platform rely on this option to get the returned YARN app info to manage their flink jobs(e.g. whether the job is submitted successfully). Maybe introducing a new option is better than changing the behavior of the --detached option. 2. The description says "In batch mode, the queue resources is insufficient in most case." IIUC, the lack of resource should not be a normal case. One possible use case I can come up with is that to reduce costs, people may run flink batch jobs in night and utilize workflow frameworks like airflow to retry the submission. Is that the case? > Let yarn client exit early before JobManager running > > > Key: FLINK-31529 > URL: https://issues.apache.org/jira/browse/FLINK-31529 > Project: Flink > Issue Type: Improvement > Components: Deployment / YARN >Reporter: Weihua Hu >Priority: Major > > Currently the YarnClusterDescriptor always wait yarn application status to be > RUNNING even if we use the detach mode. > In batch mode, the queue resources is insufficient in most case. So the job > manager may take a long time to wait resources. And client also keep waiting > too. If flink client is killed(some other reason), the cluster will be > shutdown too. > We need an option to let Flink client exit early. Use the detach option or > introduce a new option are both OK. > Looking forward other suggestions -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (FLINK-23890) CepOperator may create a large number of timers and cause performance problems
[ https://issues.apache.org/jira/browse/FLINK-23890?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17686299#comment-17686299 ] Biao Geng commented on FLINK-23890: --- This improvement is of great value for CEP users when they use Event Time and have lots of keys in their input stream. Thanks a lot for you guys work! > CepOperator may create a large number of timers and cause performance problems > -- > > Key: FLINK-23890 > URL: https://issues.apache.org/jira/browse/FLINK-23890 > Project: Flink > Issue Type: Improvement > Components: Library / CEP >Affects Versions: 1.12.1 >Reporter: Yue Ma >Assignee: Yue Ma >Priority: Major > Labels: pull-request-available > Fix For: 1.16.0 > > Attachments: image-2021-08-20-13-59-05-977.png, > image-2022-06-07-21-27-03-814.png, image-2022-06-07-21-40-58-781.png > > > There are two situations in the CepOperator that may register the time when > dealing with EventTime. > when the processElement will buffer the data first, and then register a timer > with a timestamp of watermark+1. > {code:java} > if (timestamp > timerService.currentWatermark()) { > // we have an event with a valid timestamp, so > // we buffer it until we receive the proper watermark. > saveRegisterWatermarkTimer(); > bufferEvent(value, timestamp); > }{code} > The other is when the EventTimer is triggered, if sortedTimestamps or > partialMatches are not empty, a timer will also be registered. > {code:java} > if (!sortedTimestamps.isEmpty() || !partialMatches.isEmpty()) { > saveRegisterWatermarkTimer(); > }{code} > > The problem is, if the partialMatches corresponding to each of my keys are > not empty. Then every time the watermark advances, the timers of all keys > will be triggered, and then a new EventTimer is re-registered under each key. > When the number of task keys is very large, this operation greatly affects > performance. > !https://code.byted.org/inf/flink/uploads/91aee639553df07fa376cf2865e91fd2/image.png! > I think it is unnecessary to register EventTimer frequently like this and can > we make the following changes? > When an event comes, the timestamp of the EventTimer we registered is equal > to the EventTime of this event instead of watermark + 1. > When a new ComputionState with window is created (like *withIn* pattern ), > we use the timeout of this window to create EventTimer (EventTime + > WindowTime). > After making such an attempt in our test environment, the number of > registered timers has been greatly reduced, and the performance has been > greatly improved. > !https://code.byted.org/inf/flink/uploads/24b85492c6a34a35c4445a4fd46c8363/image.png! > > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (FLINK-30562) Patterns are not emitted with parallelism >1 since 1.15.x+
[ https://issues.apache.org/jira/browse/FLINK-30562?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17654507#comment-17654507 ] Biao Geng commented on FLINK-30562: --- hi [~Jamalarm], thanks for the report. I tried to reproduce your problem with flink 1.16.0 using standalone cluster on my computer(my demo using event time can be found [here|https://github.com/bgeng777/ververica-cep-demo/tree/FLINK-30562]). I agree there are some difference in stdout when setting parallism to 2 comparing with setting it 1. But it seems that thre result is corrent. In my demo, when p is 2, the stdout of the matches is: {quote}1> 3,3,3 1> 2,2,2{quote} when p is 1: {quote}3,3,3 2,2,2{quote} Is above result the same with your experiments? > Patterns are not emitted with parallelism >1 since 1.15.x+ > -- > > Key: FLINK-30562 > URL: https://issues.apache.org/jira/browse/FLINK-30562 > Project: Flink > Issue Type: Bug > Components: Library / CEP >Affects Versions: 1.16.0, 1.15.3 > Environment: Problem observed in: > Production: > Dockerised Flink cluster running in AWS Fargate, sourced from AWS Kinesis and > sink to AWS SQS > Local: > Completely local MiniCluster based test with no external sinks or sources >Reporter: Thomas Wozniakowski >Priority: Major > > (Apologies for the speculative and somewhat vague ticket, but I wanted to > raise this while I am investigating to see if anyone has suggestions to help > me narrow down the problem.) > We are encountering an issue where our streaming Flink job has stopped > working correctly since Flink 1.15.3. This problem is also present on Flink > 1.16.0. The Keyed CEP operators that our job uses are no longer emitting > Patterns reliably, but critically *this is only happening when parallelism is > set to a value greater than 1*. > Our local build tests were previously set up using in-JVM `MiniCluster` > instances, or dockerised Flink clusters all set with a parallelism of 1, so > this problem was not caught and it caused an outage when we upgraded the > cluster version in production. > Observing the job using the Flink console in production, I can see that > events are *arriving* into the Keyed CEP operators, but no Pattern events are > being emitted out of any of the operators. Furthermore, all the reported > Watermark values are zero, though I don't know if that is a red herring as it > seems Watermark reporting seems to have changed since 1.14.x. > I am currently attempting to create a stripped down version of our streaming > job to demonstrate the problem, but this is quite tricky to set up. In the > meantime I would appreciate any hints that could point me in the right > direction. > I have isolated the problem to the Keyed CEP operator by removing our real > sinks and sources from the failing test. I am still seeing the erroneous > behaviour when setting up a job as: > # Events are read from a list using `env.fromCollection( ... )` > # CEP operator processes events > # Output is captured in another list for assertions > My best guess at the moment is something to do with Watermark emission? There > seems to have been changes related to watermark alignment, perhaps this has > caused some kind of regression in the CEP library? To reiterate, *this > problem only occurs with parallelism of 2 or more. Setting the parallelism to > 1 immediately fixes the issue* -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (FLINK-30518) [flink-operator] Kubernetes HA not working due to wrong jobmanager.rpc.address
[ https://issues.apache.org/jira/browse/FLINK-30518?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17652520#comment-17652520 ] Biao Geng commented on FLINK-30518: --- [~gyfora] I see. Thanks for the information. I misunderstood the problem somehow :( But I just tried 1.3.0 operator and 1.16 flink to run basic-checkpoint-ha-example with setting replicas of JM to 3. It works fine as well. [~tbnguyen1407] would you mind sharing the full deployment yaml for this problem? > [flink-operator] Kubernetes HA not working due to wrong jobmanager.rpc.address > -- > > Key: FLINK-30518 > URL: https://issues.apache.org/jira/browse/FLINK-30518 > Project: Flink > Issue Type: Bug > Components: Kubernetes Operator >Affects Versions: kubernetes-operator-1.3.0 >Reporter: Binh-Nguyen Tran >Priority: Major > Attachments: flink-configmap.png, screenshot-1.png > > > Since flink-conf.yaml is mounted as read-only configmap, the > /docker-entrypoint.sh script is not able to inject correct Pod IP to > `jobmanager.rpc.address`. This leads to same address (e.g flink.ns-ext) being > set for all Job Manager pods. This causes: > (1) flink-cluster-config-map always contains wrong address for all 3 > component leaders (see screenshot, should be pod IP instead of clusterIP > service name) > (2) Accessing Web UI when jobmanager.replicas > 1 is not possible with error > {code:java} > {"errors":["Service temporarily unavailable due to an ongoing leader > election. Please refresh."]} {code} > > ~ flinkdeployment.yaml ~ > {code:java} > spec: > flinkConfiguration: > high-availability: kubernetes > high-availability.storageDir: "file:///opt/flink/storage" > ... > jobManager: > replicas: 3 > ... {code} -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (FLINK-30518) [flink-operator] Kubernetes HA not working due to wrong jobmanager.rpc.address
[ https://issues.apache.org/jira/browse/FLINK-30518?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17652511#comment-17652511 ] Biao Geng commented on FLINK-30518: --- I think [~gyfora] is right that this is not a problem before. I just tried an older version(1.1.0) of the operator with setting replicas to 2 for jobManager and it works fine on running jobs and accessing web ui. I checked my configmap(`basic-checkpoint-ha-example-cluster-config-map`) and it seems that leader.dispatcher is set correctly. !screenshot-1.png! I will try to reproduce this problem with the image of the operator based on the latest main branch as the [PR|https://github.com/apache/flink-kubernetes-operator/pull/494] to remove read-only limt is merged. > [flink-operator] Kubernetes HA not working due to wrong jobmanager.rpc.address > -- > > Key: FLINK-30518 > URL: https://issues.apache.org/jira/browse/FLINK-30518 > Project: Flink > Issue Type: Bug > Components: Kubernetes Operator >Affects Versions: kubernetes-operator-1.3.0 >Reporter: Binh-Nguyen Tran >Priority: Major > Attachments: flink-configmap.png, screenshot-1.png > > > Since flink-conf.yaml is mounted as read-only configmap, the > /docker-entrypoint.sh script is not able to inject correct Pod IP to > `jobmanager.rpc.address`. This leads to same address (e.g flink.ns-ext) being > set for all Job Manager pods. This causes: > (1) flink-cluster-config-map always contains wrong address for all 3 > component leaders (see screenshot, should be pod IP instead of clusterIP > service name) > (2) Accessing Web UI when jobmanager.replicas > 1 is not possible with error > {code:java} > {"errors":["Service temporarily unavailable due to an ongoing leader > election. Please refresh."]} {code} > > ~ flinkdeployment.yaml ~ > {code:java} > spec: > flinkConfiguration: > high-availability: kubernetes > high-availability.storageDir: "file:///opt/flink/storage" > ... > jobManager: > replicas: 3 > ... {code} -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (FLINK-30518) [flink-operator] Kubernetes HA not working due to wrong jobmanager.rpc.address
[ https://issues.apache.org/jira/browse/FLINK-30518?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Biao Geng updated FLINK-30518: -- Attachment: screenshot-1.png > [flink-operator] Kubernetes HA not working due to wrong jobmanager.rpc.address > -- > > Key: FLINK-30518 > URL: https://issues.apache.org/jira/browse/FLINK-30518 > Project: Flink > Issue Type: Bug > Components: Kubernetes Operator >Affects Versions: kubernetes-operator-1.3.0 >Reporter: Binh-Nguyen Tran >Priority: Major > Attachments: flink-configmap.png, screenshot-1.png > > > Since flink-conf.yaml is mounted as read-only configmap, the > /docker-entrypoint.sh script is not able to inject correct Pod IP to > `jobmanager.rpc.address`. This leads to same address (e.g flink.ns-ext) being > set for all Job Manager pods. This causes: > (1) flink-cluster-config-map always contains wrong address for all 3 > component leaders (see screenshot, should be pod IP instead of clusterIP > service name) > (2) Accessing Web UI when jobmanager.replicas > 1 is not possible with error > {code:java} > {"errors":["Service temporarily unavailable due to an ongoing leader > election. Please refresh."]} {code} > > ~ flinkdeployment.yaml ~ > {code:java} > spec: > flinkConfiguration: > high-availability: kubernetes > high-availability.storageDir: "file:///opt/flink/storage" > ... > jobManager: > replicas: 3 > ... {code} -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (FLINK-28554) Kubernetes-Operator allow readOnlyRootFilesystem via operatorSecurityContext
[ https://issues.apache.org/jira/browse/FLINK-28554?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17649728#comment-17649728 ] Biao Geng commented on FLINK-28554: --- Want to attach more information: According to k8s' [doc|https://kubernetes.io/docs/concepts/storage/volumes/#configmap], "A container using a ConfigMap as a subPath volume mount will not receive ConfigMap updates". It may also be worthwhile to mention that {{dynamic.config.check.interval}} config may not work as expected in some case. For example, we are relying on {{dynamic.config.check.interval}} to do some default config update by updating {{flink-operator-config}} ConfigMap without restarting the operator Pod. After this jira, the above dynamic update is broken. Also, the {{CONF_OVERRIDE_DIR}} introduced in FLINK-28445 may not work as well. > Kubernetes-Operator allow readOnlyRootFilesystem via operatorSecurityContext > > > Key: FLINK-28554 > URL: https://issues.apache.org/jira/browse/FLINK-28554 > Project: Flink > Issue Type: Improvement > Components: Kubernetes Operator >Affects Versions: kubernetes-operator-1.0.1 >Reporter: Tim >Assignee: Tim >Priority: Minor > Labels: operator, pull-request-available > Fix For: kubernetes-operator-1.2.0 > > > It would be nice if the operator would support using the > "readOnlyRootFilesystem" setting via the operatorSecurityContext. When using > the default operator template the operator won't be able to start when using > this setting because the config files mounted in `/opt/flink/conf` are now > (of course) also read-only. > It would be nice if the default template would be written in such a way that > it allows adding emptyDir volumes to /opt/flink/conf via the values.yaml. > Which is not possible right now. Then the config files can remain editable by > the operator while keeping the root filesystem read-only. > I have successfully tried that in my branch (see: > https://github.com/apache/flink-kubernetes-operator/compare/main...timsn:flink-kubernetes-operator:mount-single-flink-conf-files) > which prepares the operator template. > After this small change to the template it is possible add emptyDir volumes > for the conf and tmp dirs and in the second step to enable the > readOnlyRootFilesystem setting via the values.yaml > values.yaml > {code:java} > [...] > operatorVolumeMounts: > create: true > data: > - name: flink-conf > mountPath: /opt/flink/conf > subPath: conf > - name: flink-tmp > mountPath: /tmp > operatorVolumes: > create: true > data: > - name: flink-conf > emptyDir: {} > - name: flink-tmp > emptyDir: {} > operatorSecurityContext: > readOnlyRootFilesystem: true > [...]{code} > I think this could be a viable way to allow this security setting and I could > turn this into a pull request if desired. What do you think about it? Or is > there even a better way to achive this I didn't think about yet? -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (FLINK-29529) Update flink version in flink-python-example of flink k8s operator
[ https://issues.apache.org/jira/browse/FLINK-29529?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Biao Geng updated FLINK-29529: -- Description: Currently, we hardcoded the version of both flink image and pyflink pip package as 1.15.0 in the example's Dockerfile. It is not the best practice as the flink has new 1.15.x releases. We had better do following improvements: {{FROM flink:1.15.0 -> FROM flink:1.15}} {{RUN pip3 install apache-flink==1.15.0 -> RUN pip3 install "apache-flink>=1.15.0,<1.16.0"}} was: Currently, we hardcoded the version of both flink image and pyflink pip package as 1.15.0 in the example's Dockerfile. It is not the best practice as the flink has new 1.15.x releases. We had better do following improvements: {{FROM flink:1.15.0 -> FROM flink:1.15}} {{RUN pip3 install apache-flink==1.15.0 -> RUN pip install "apache-flink>=1.15.0,<1.16.0"}} > Update flink version in flink-python-example of flink k8s operator > -- > > Key: FLINK-29529 > URL: https://issues.apache.org/jira/browse/FLINK-29529 > Project: Flink > Issue Type: Improvement > Components: Kubernetes Operator >Reporter: Biao Geng >Priority: Minor > > Currently, we hardcoded the version of both flink image and pyflink pip > package as 1.15.0 in the example's Dockerfile. It is not the best practice as > the flink has new 1.15.x releases. > We had better do following improvements: > {{FROM flink:1.15.0 -> FROM flink:1.15}} > {{RUN pip3 install apache-flink==1.15.0 -> RUN pip3 install > "apache-flink>=1.15.0,<1.16.0"}} -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (FLINK-29529) Update flink version in flink-python-example of flink k8s operator
Biao Geng created FLINK-29529: - Summary: Update flink version in flink-python-example of flink k8s operator Key: FLINK-29529 URL: https://issues.apache.org/jira/browse/FLINK-29529 Project: Flink Issue Type: Improvement Components: Kubernetes Operator Reporter: Biao Geng Currently, we hardcoded the version of both flink image and pyflink pip package as 1.15.0 in the example's Dockerfile. It is not the best practice as the flink has new 1.15.x releases. We had better do following improvements: {{FROM flink:1.15.0 -> FROM flink:1.15}} {{RUN pip3 install apache-flink==1.15.0 -> RUN pip install "apache-flink>=1.15.0,<1.16.0"}} -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (FLINK-29362) Allow loading dynamic config for kerberos authentication in CliFrontend
Biao Geng created FLINK-29362: - Summary: Allow loading dynamic config for kerberos authentication in CliFrontend Key: FLINK-29362 URL: https://issues.apache.org/jira/browse/FLINK-29362 Project: Flink Issue Type: Improvement Components: Command Line Client Reporter: Biao Geng In the [code|https://github.com/apache/flink/blob/97f5a45cd035fbae37a7468c6f771451ddb4a0a4/flink-clients/src/main/java/org/apache/flink/client/cli/CliFrontend.java#L1167], Flink's client will try to {{SecurityUtils.install(new SecurityConfiguration(cli.configuration));}} with configs(e.g. {{security.kerberos.login.principal}} and {{security.kerberos.login.keytab}}) from only flink-conf.yaml. If users specify the above 2 config via -D option, it will not work as {{cli.parseAndRun(args)}} will be executed after installing security configs from flink-conf.yaml. However, if a user specify principal A in client's flink-conf.yaml and use -D option to specify principal B, the launched YARN container will use principal B though the job is submitted in client end with principal A. Such behavior can be misleading as Flink provides 2 ways to set a config but does not keep consistency between client and cluster. It also influence users who want use flink with kerberos as they must modify flink-conf.yaml if they want to use another kerberos user. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (FLINK-29277) Flink submits tasks to yarn Federation and throws an exception 'org.apache.commons.lang3.NotImplementedException: Code is not implemented'
[ https://issues.apache.org/jira/browse/FLINK-29277?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Biao Geng updated FLINK-29277: -- Attachment: screenshot-1.png > Flink submits tasks to yarn Federation and throws an exception > 'org.apache.commons.lang3.NotImplementedException: Code is not implemented' > -- > > Key: FLINK-29277 > URL: https://issues.apache.org/jira/browse/FLINK-29277 > Project: Flink > Issue Type: Bug > Components: Deployment / YARN >Affects Versions: 1.14.3 > Environment: Flink 1.14.3、JDK8、hadoop-3.2.1 >Reporter: Jiankun Feng >Priority: Blocker > Attachments: error.log, image-2022-09-13-15-56-47-631.png, > screenshot-1.png > > > 2022-09-13 11:02:35,488 INFO > org.apache.flink.runtime.util.config.memory.ProcessMemoryUtils [] - The > derived from fraction jvm overhead memory (102.400mb (107374184 bytes)) is > less than its min value 192.000mb (201326592 bytes), min value will be used > instead > 2022-09-13 11:02:35,751 WARN org.apache.flink.table.client.cli.CliClient > [] - Could not execute SQL statement. > org.apache.flink.table.client.gateway.SqlExecutionException: Could not > execute SQL statement. > at > org.apache.flink.table.client.gateway.local.LocalExecutor.executeModifyOperations(LocalExecutor.java:225) > ~[flink-sql-client_2.11-1.14.3-qihoo-f5.jar:1.14.3-qihoo-f5] > at > org.apache.flink.table.client.cli.CliClient.callInserts(CliClient.java:617) > ~[flink-sql-client_2.11-1.14.3-qihoo-f5.jar:1.14.3-qihoo-f5] > at > org.apache.flink.table.client.cli.CliClient.callInsert(CliClient.java:606) > ~[flink-sql-client_2.11-1.14.3-qihoo-f5.jar:1.14.3-qihoo-f5] > at > org.apache.flink.table.client.cli.CliClient.callOperation(CliClient.java:466) > ~[flink-sql-client_2.11-1.14.3-qihoo-f5.jar:1.14.3-qihoo-f5] > at > org.apache.flink.table.client.cli.CliClient.lambda$executeStatement$1(CliClient.java:346) > [flink-sql-client_2.11-1.14.3-qihoo-f5.jar:1.14.3-qihoo-f5] > at java.util.Optional.ifPresent(Optional.java:159) ~[?:1.8.0_141] > at > org.apache.flink.table.client.cli.CliClient.executeStatement(CliClient.java:339) > [flink-sql-client_2.11-1.14.3-qihoo-f5.jar:1.14.3-qihoo-f5] > at > org.apache.flink.table.client.cli.CliClient.executeFile(CliClient.java:318) > [flink-sql-client_2.11-1.14.3-qihoo-f5.jar:1.14.3-qihoo-f5] > at > org.apache.flink.table.client.cli.CliClient.executeInNonInteractiveMode(CliClient.java:234) > [flink-sql-client_2.11-1.14.3-qihoo-f5.jar:1.14.3-qihoo-f5] > at > org.apache.flink.table.client.SqlClient.openCli(SqlClient.java:153) > [flink-sql-client_2.11-1.14.3-qihoo-f5.jar:1.14.3-qihoo-f5] > at org.apache.flink.table.client.SqlClient.start(SqlClient.java:95) > [flink-sql-client_2.11-1.14.3-qihoo-f5.jar:1.14.3-qihoo-f5] > at > org.apache.flink.table.client.SqlClient.startClient(SqlClient.java:187) > [flink-sql-client_2.11-1.14.3-qihoo-f5.jar:1.14.3-qihoo-f5] > at org.apache.flink.table.client.SqlClient.main(SqlClient.java:161) > [flink-sql-client_2.11-1.14.3-qihoo-f5.jar:1.14.3-qihoo-f5] > Caused by: org.apache.flink.table.api.TableException: Failed to execute sql > at > org.apache.flink.table.api.internal.TableEnvironmentImpl.executeInternal(TableEnvironmentImpl.java:791) > ~[flink-table_2.11-1.14.3-qihoo-f5.jar:1.14.3-qihoo-f5] > at > org.apache.flink.table.api.internal.TableEnvironmentImpl.executeInternal(TableEnvironmentImpl.java:754) > ~[flink-table_2.11-1.14.3-qihoo-f5.jar:1.14.3-qihoo-f5] > at > org.apache.flink.table.client.gateway.local.LocalExecutor.lambda$executeModifyOperations$4(LocalExecutor.java:223) > ~[flink-sql-client_2.11-1.14.3-qihoo-f5.jar:1.14.3-qihoo-f5] > at > org.apache.flink.table.client.gateway.context.ExecutionContext.wrapClassLoader(ExecutionContext.java:88) > ~[flink-sql-client_2.11-1.14.3-qihoo-f5.jar:1.14.3-qihoo-f5] > at > org.apache.flink.table.client.gateway.local.LocalExecutor.executeModifyOperations(LocalExecutor.java:223) > ~[flink-sql-client_2.11-1.14.3-qihoo-f5.jar:1.14.3-qihoo-f5] > ... 12 more > Caused by: org.apache.flink.client.deployment.ClusterDeploymentException: > Could not deploy Yarn job cluster. > at > org.apache.flink.yarn.YarnClusterDescriptor.deployJobCluster(YarnClusterDescriptor.java:489) > ~[flink-dist_2.11-1.14.3-qihoo-f5.jar:1.14.3-qihoo-f5] > at > org.apache.flink.client.deployment.executors.AbstractJobClusterExecutor.execute(AbstractJobClusterExecutor.java:81) > ~[flink-dist_2.11-1.14.3-qihoo-f5.jar:1.14.3-qihoo-f5] > at >
[jira] [Commented] (FLINK-29277) Flink submits tasks to yarn Federation and throws an exception 'org.apache.commons.lang3.NotImplementedException: Code is not implemented'
[ https://issues.apache.org/jira/browse/FLINK-29277?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17607268#comment-17607268 ] Biao Geng commented on FLINK-29277: --- In hadoop3.2.1, org.apache.hadoop.yarn.server.router.clientrm.FederationClientInterceptor#getClusterNodes is not implemented. !screenshot-1.png! > Flink submits tasks to yarn Federation and throws an exception > 'org.apache.commons.lang3.NotImplementedException: Code is not implemented' > -- > > Key: FLINK-29277 > URL: https://issues.apache.org/jira/browse/FLINK-29277 > Project: Flink > Issue Type: Bug > Components: Deployment / YARN >Affects Versions: 1.14.3 > Environment: Flink 1.14.3、JDK8、hadoop-3.2.1 >Reporter: Jiankun Feng >Priority: Blocker > Attachments: error.log, image-2022-09-13-15-56-47-631.png, > screenshot-1.png > > > 2022-09-13 11:02:35,488 INFO > org.apache.flink.runtime.util.config.memory.ProcessMemoryUtils [] - The > derived from fraction jvm overhead memory (102.400mb (107374184 bytes)) is > less than its min value 192.000mb (201326592 bytes), min value will be used > instead > 2022-09-13 11:02:35,751 WARN org.apache.flink.table.client.cli.CliClient > [] - Could not execute SQL statement. > org.apache.flink.table.client.gateway.SqlExecutionException: Could not > execute SQL statement. > at > org.apache.flink.table.client.gateway.local.LocalExecutor.executeModifyOperations(LocalExecutor.java:225) > ~[flink-sql-client_2.11-1.14.3-qihoo-f5.jar:1.14.3-qihoo-f5] > at > org.apache.flink.table.client.cli.CliClient.callInserts(CliClient.java:617) > ~[flink-sql-client_2.11-1.14.3-qihoo-f5.jar:1.14.3-qihoo-f5] > at > org.apache.flink.table.client.cli.CliClient.callInsert(CliClient.java:606) > ~[flink-sql-client_2.11-1.14.3-qihoo-f5.jar:1.14.3-qihoo-f5] > at > org.apache.flink.table.client.cli.CliClient.callOperation(CliClient.java:466) > ~[flink-sql-client_2.11-1.14.3-qihoo-f5.jar:1.14.3-qihoo-f5] > at > org.apache.flink.table.client.cli.CliClient.lambda$executeStatement$1(CliClient.java:346) > [flink-sql-client_2.11-1.14.3-qihoo-f5.jar:1.14.3-qihoo-f5] > at java.util.Optional.ifPresent(Optional.java:159) ~[?:1.8.0_141] > at > org.apache.flink.table.client.cli.CliClient.executeStatement(CliClient.java:339) > [flink-sql-client_2.11-1.14.3-qihoo-f5.jar:1.14.3-qihoo-f5] > at > org.apache.flink.table.client.cli.CliClient.executeFile(CliClient.java:318) > [flink-sql-client_2.11-1.14.3-qihoo-f5.jar:1.14.3-qihoo-f5] > at > org.apache.flink.table.client.cli.CliClient.executeInNonInteractiveMode(CliClient.java:234) > [flink-sql-client_2.11-1.14.3-qihoo-f5.jar:1.14.3-qihoo-f5] > at > org.apache.flink.table.client.SqlClient.openCli(SqlClient.java:153) > [flink-sql-client_2.11-1.14.3-qihoo-f5.jar:1.14.3-qihoo-f5] > at org.apache.flink.table.client.SqlClient.start(SqlClient.java:95) > [flink-sql-client_2.11-1.14.3-qihoo-f5.jar:1.14.3-qihoo-f5] > at > org.apache.flink.table.client.SqlClient.startClient(SqlClient.java:187) > [flink-sql-client_2.11-1.14.3-qihoo-f5.jar:1.14.3-qihoo-f5] > at org.apache.flink.table.client.SqlClient.main(SqlClient.java:161) > [flink-sql-client_2.11-1.14.3-qihoo-f5.jar:1.14.3-qihoo-f5] > Caused by: org.apache.flink.table.api.TableException: Failed to execute sql > at > org.apache.flink.table.api.internal.TableEnvironmentImpl.executeInternal(TableEnvironmentImpl.java:791) > ~[flink-table_2.11-1.14.3-qihoo-f5.jar:1.14.3-qihoo-f5] > at > org.apache.flink.table.api.internal.TableEnvironmentImpl.executeInternal(TableEnvironmentImpl.java:754) > ~[flink-table_2.11-1.14.3-qihoo-f5.jar:1.14.3-qihoo-f5] > at > org.apache.flink.table.client.gateway.local.LocalExecutor.lambda$executeModifyOperations$4(LocalExecutor.java:223) > ~[flink-sql-client_2.11-1.14.3-qihoo-f5.jar:1.14.3-qihoo-f5] > at > org.apache.flink.table.client.gateway.context.ExecutionContext.wrapClassLoader(ExecutionContext.java:88) > ~[flink-sql-client_2.11-1.14.3-qihoo-f5.jar:1.14.3-qihoo-f5] > at > org.apache.flink.table.client.gateway.local.LocalExecutor.executeModifyOperations(LocalExecutor.java:223) > ~[flink-sql-client_2.11-1.14.3-qihoo-f5.jar:1.14.3-qihoo-f5] > ... 12 more > Caused by: org.apache.flink.client.deployment.ClusterDeploymentException: > Could not deploy Yarn job cluster. > at > org.apache.flink.yarn.YarnClusterDescriptor.deployJobCluster(YarnClusterDescriptor.java:489) > ~[flink-dist_2.11-1.14.3-qihoo-f5.jar:1.14.3-qihoo-f5] > at >
[jira] [Created] (FLINK-28769) Flink History Server show wrong name of batch jobs
Biao Geng created FLINK-28769: - Summary: Flink History Server show wrong name of batch jobs Key: FLINK-28769 URL: https://issues.apache.org/jira/browse/FLINK-28769 Project: Flink Issue Type: Bug Components: Runtime / Web Frontend Reporter: Biao Geng Attachments: image-2022-08-02-00-41-51-815.png When running {{examples/batch/WordCount.jar}} using flink1.15 and 1.16 together with history server started, the history server shows default name(e.g. Flink Java Job at Tue Aug 02.. ) of the batch job instead of the name( "WordCount Example" ) specified in the java code. But for {{examples/streaming/WordCount.jar}}, the job name in history server is correct. It looks like that {{org.apache.flink.api.java.ExecutionEnvironment#executeAsync(java.lang.String)}} does not set job name as what {{org.apache.flink.streaming.api.environment.StreamExecutionEnvironment#execute(java.lang.String)}} does(e.g. streamGraph.setJobName(jobName); ). !image-2022-08-02-00-41-51-815.png! -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (FLINK-28643) Job archive file contains too much Job json file on Flink HistoryServer
[ https://issues.apache.org/jira/browse/FLINK-28643?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17573852#comment-17573852 ] Biao Geng commented on FLINK-28643: --- I run into this issue as well when running jobs with large parallism. One possible remedy is to limit the number of the archieved jobs. But a better way may be add limit in flink side so that it would not break down the whole hdfs system. > Job archive file contains too much Job json file on Flink HistoryServer > > > Key: FLINK-28643 > URL: https://issues.apache.org/jira/browse/FLINK-28643 > Project: Flink > Issue Type: Improvement > Components: Runtime / Web Frontend >Affects Versions: 1.15.1 >Reporter: LI Mingkun >Priority: Major > Attachments: image-2022-07-22-17-04-55-046.png > > > In History Server, 'HistoryServerArchiveFetcher' fetch job archived file,and > pass it into a job dir. > There are more than 1w json files for big parallism job which run out of > inodes > !image-2022-07-22-17-04-55-046.png! -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (FLINK-28725) flink-kubernetes-operator taskManager: replicas: 2 error
[ https://issues.apache.org/jira/browse/FLINK-28725?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17572257#comment-17572257 ] Biao Geng commented on FLINK-28725: --- Hi [~lizu18xz], the field "replicas" of task manager is introduced into CRD since 1.1.0. If your kube cluster has already installed earlier version of flink-kubernetes-operator, you may have to remove the outdated CRD manually and reinstall the newest operator. The detail of upgrade process is [here|https://nightlies.apache.org/flink/flink-kubernetes-operator-docs-release-1.1/docs/operations/upgrade/] Besides, in most cases, users do not need to set replica of task managers manually as it can be calculated from configs(i.e. parallelism / taskSlots). > flink-kubernetes-operator taskManager: replicas: 2 error > > > Key: FLINK-28725 > URL: https://issues.apache.org/jira/browse/FLINK-28725 > Project: Flink > Issue Type: Bug >Reporter: lizu18xz >Priority: Major > > version:v1.1.0 > > taskManager: > replicas: 2 > resource: > memory: "1024m" > cpu: 1 > > error validating data: ValidationError(FlinkDeployment.spec.taskManager): > unknown field "replicas" in > org.apache.flink.v1beta1.FlinkDeployment.spec.taskManager; if you choose to > ignore these errors, turn validation off with --validate=false -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] (FLINK-28725) flink-kubernetes-operator taskManager: replicas: 2 error
[ https://issues.apache.org/jira/browse/FLINK-28725 ] Biao Geng deleted comment on FLINK-28725: --- was (Author: bgeng777): Hi [~lizu18xz], the field "replicas" of task manager is introduced into CRD since 1.1.0. If your kube cluster has already installed earlier version of flink-kubernetes-operator, you may have to remove the outdated CRD manually and reinstall the newest operator. The detail of upgrade process is [here|https://nightlies.apache.org/flink/flink-kubernetes-operator-docs-release-1.1/docs/operations/upgrade/] Besides, in most cases, users do not need to set replica of task managers manually as it can be calculated from configs(i.e. parallelism / taskSlots). > flink-kubernetes-operator taskManager: replicas: 2 error > > > Key: FLINK-28725 > URL: https://issues.apache.org/jira/browse/FLINK-28725 > Project: Flink > Issue Type: Bug >Reporter: lizu18xz >Priority: Major > > version:v1.1.0 > > taskManager: > replicas: 2 > resource: > memory: "1024m" > cpu: 1 > > error validating data: ValidationError(FlinkDeployment.spec.taskManager): > unknown field "replicas" in > org.apache.flink.v1beta1.FlinkDeployment.spec.taskManager; if you choose to > ignore these errors, turn validation off with --validate=false -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (FLINK-28495) Fix typos or mistakes of Flink CEP Document in the official website
[ https://issues.apache.org/jira/browse/FLINK-28495?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17566586#comment-17566586 ] Biao Geng commented on FLINK-28495: --- [~martijnvisser] Sure. I can take this ticket. > Fix typos or mistakes of Flink CEP Document in the official website > --- > > Key: FLINK-28495 > URL: https://issues.apache.org/jira/browse/FLINK-28495 > Project: Flink > Issue Type: Improvement > Components: Library / CEP >Reporter: Biao Geng >Priority: Minor > > 1. "how you can migrate your job from an older Flink version to Flink-1.3." > -> "how you can migrate your job from an older Flink version to Flink-1.13." > 2. "Will generate the following matches for an input sequence: C D A1 A2 A3 D > A4 B. with combinations enabled: {quote}\\{ C A1 B\}, \{C A1 A2 B\}, \{C A1 > A3 B\}, \{C A1 A4 B\}, \{C A1 A2 A3 B\}, \{C A1 A2 A4 B\}, \{C A1 A3 A4 B\}, > \{C A1 A2 A3 A4 B\}{quote}" -> "Will generate the following matches for an > input sequence: C D A1 A2 A3 D A4 B. with combinations enabled: {quote}\{C A1 > B\}, \{C A1 A2 B\}, \{C A1 A3 B\}, \{C A1 A4 B\}, \{C A1 A2 A3 B\}, \{C A1 A2 > A4 B\}, \{C A1 A3 A4 B\}, \{C A1 A2 A3 A4 B\}, \{C A2 B\}, \{C A2 A3 B\}, \{C > A2 A4 B\}, \{C A2 A3 A4 B\}, \{C A3 B\}, \{C A3 A4 B\}, \{C A4 B\}{quote}" > 3. "For SKIP_TO_FIRST/LAST there are two options how to handle cases when > there are no elements mapped to the specified variable." -> "For > SKIP_TO_FIRST/LAST there are two options how to handle cases when there are > no events mapped to the *PatternName*." -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (FLINK-28503) Fix invalid link in Python FAQ Document
Biao Geng created FLINK-28503: - Summary: Fix invalid link in Python FAQ Document Key: FLINK-28503 URL: https://issues.apache.org/jira/browse/FLINK-28503 Project: Flink Issue Type: Improvement Components: Documentation, Project Website Reporter: Biao Geng Attachments: image-2022-07-12-14-51-01-434.png [https://nightlies.apache.org/flink/flink-docs-master/docs/dev/python/faq/#preparing-python-virtual-environment] The script for setting pyflink virtual environment is invalid now. The candidate is [https://nightlies.apache.org/flink/flink-docs-release-1.12/downloads/setup-pyflink-virtual-env.sh] or we can add this short script in the doc website directly. !image-2022-07-12-14-51-01-434.png! -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (FLINK-28495) Fix typos or mistakes of Flink CEP Document in the official website
[ https://issues.apache.org/jira/browse/FLINK-28495?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Biao Geng updated FLINK-28495: -- Description: 1. "how you can migrate your job from an older Flink version to Flink-1.3." -> "how you can migrate your job from an older Flink version to Flink-1.13." 2. "Will generate the following matches for an input sequence: C D A1 A2 A3 D A4 B. with combinations enabled: {quote}\\{ C A1 B\}, \{C A1 A2 B\}, \{C A1 A3 B\}, \{C A1 A4 B\}, \{C A1 A2 A3 B\}, \{C A1 A2 A4 B\}, \{C A1 A3 A4 B\}, \{C A1 A2 A3 A4 B\}{quote}" -> "Will generate the following matches for an input sequence: C D A1 A2 A3 D A4 B. with combinations enabled: {quote}\{C A1 B\}, \{C A1 A2 B\}, \{C A1 A3 B\}, \{C A1 A4 B\}, \{C A1 A2 A3 B\}, \{C A1 A2 A4 B\}, \{C A1 A3 A4 B\}, \{C A1 A2 A3 A4 B\}, \{C A2 B\}, \{C A2 A3 B\}, \{C A2 A4 B\}, \{C A2 A3 A4 B\}, \{C A3 B\}, \{C A3 A4 B\}, \{C A4 B\}{quote}" 3. "For SKIP_TO_FIRST/LAST there are two options how to handle cases when there are no elements mapped to the specified variable." -> "For SKIP_TO_FIRST/LAST there are two options how to handle cases when there are no events mapped to the *PatternName*." was: 1. "how you can migrate your job from an older Flink version to Flink-1.3." -> "how you can migrate your job from an older Flink version to Flink-1.13." 2. "Will generate the following matches for an input sequence: C D A1 A2 A3 D A4 B. with combinations enabled: {quote}{C A1 B}, {C A1 A2 B}, {C A1 A3 B}, {C A1 A4 B}, {C A1 A2 A3 B}, {C A1 A2 A4 B}, {C A1 A3 A4 B}, {C A1 A2 A3 A4 B}{quote}" -> "Will generate the following matches for an input sequence: C D A1 A2 A3 D A4 B. with combinations enabled: {quote}{C A1 B}, {C A1 A2 B}, {C A1 A3 B}, {C A1 A4 B}, {C A1 A2 A3 B}, {C A1 A2 A4 B}, {C A1 A3 A4 B}, {C A1 A2 A3 A4 B}, {C A2 B}, {C A2 A3 B}, {C A2 A4 B}, {C A2 A3 A4 B}, {C A3 B}, {C A3 A4 B}, {C A4 B}{quote}" 3. "For SKIP_TO_FIRST/LAST there are two options how to handle cases when there are no elements mapped to the specified variable." -> "For SKIP_TO_FIRST/LAST there are two options how to handle cases when there are no events mapped to the *PatternName*." > Fix typos or mistakes of Flink CEP Document in the official website > --- > > Key: FLINK-28495 > URL: https://issues.apache.org/jira/browse/FLINK-28495 > Project: Flink > Issue Type: Improvement > Components: Library / CEP >Reporter: Biao Geng >Priority: Minor > > 1. "how you can migrate your job from an older Flink version to Flink-1.3." > -> "how you can migrate your job from an older Flink version to Flink-1.13." > 2. "Will generate the following matches for an input sequence: C D A1 A2 A3 D > A4 B. with combinations enabled: {quote}\\{ C A1 B\}, \{C A1 A2 B\}, \{C A1 > A3 B\}, \{C A1 A4 B\}, \{C A1 A2 A3 B\}, \{C A1 A2 A4 B\}, \{C A1 A3 A4 B\}, > \{C A1 A2 A3 A4 B\}{quote}" -> "Will generate the following matches for an > input sequence: C D A1 A2 A3 D A4 B. with combinations enabled: {quote}\{C A1 > B\}, \{C A1 A2 B\}, \{C A1 A3 B\}, \{C A1 A4 B\}, \{C A1 A2 A3 B\}, \{C A1 A2 > A4 B\}, \{C A1 A3 A4 B\}, \{C A1 A2 A3 A4 B\}, \{C A2 B\}, \{C A2 A3 B\}, \{C > A2 A4 B\}, \{C A2 A3 A4 B\}, \{C A3 B\}, \{C A3 A4 B\}, \{C A4 B\}{quote}" > 3. "For SKIP_TO_FIRST/LAST there are two options how to handle cases when > there are no elements mapped to the specified variable." -> "For > SKIP_TO_FIRST/LAST there are two options how to handle cases when there are > no events mapped to the *PatternName*." -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (FLINK-28495) Fix typos or mistakes of Flink CEP Document in the official website
[ https://issues.apache.org/jira/browse/FLINK-28495?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Biao Geng updated FLINK-28495: -- Description: 1. "how you can migrate your job from an older Flink version to Flink-1.3." -> "how you can migrate your job from an older Flink version to Flink-1.13." 2. "Will generate the following matches for an input sequence: C D A1 A2 A3 D A4 B. with combinations enabled: {quote}{C A1 B}, {C A1 A2 B}, {C A1 A3 B}, {C A1 A4 B}, {C A1 A2 A3 B}, {C A1 A2 A4 B}, {C A1 A3 A4 B}, {C A1 A2 A3 A4 B}{quote}" -> "Will generate the following matches for an input sequence: C D A1 A2 A3 D A4 B. with combinations enabled: {quote}{C A1 B}, {C A1 A2 B}, {C A1 A3 B}, {C A1 A4 B}, {C A1 A2 A3 B}, {C A1 A2 A4 B}, {C A1 A3 A4 B}, {C A1 A2 A3 A4 B}, {C A2 B}, {C A2 A3 B}, {C A2 A4 B}, {C A2 A3 A4 B}, {C A3 B}, {C A3 A4 B}, {C A4 B}{quote}" 3. "For SKIP_TO_FIRST/LAST there are two options how to handle cases when there are no elements mapped to the specified variable." -> "For SKIP_TO_FIRST/LAST there are two options how to handle cases when there are no events mapped to the *PatternName*." was: "how you can migrate your job from an older Flink version to Flink-1.3." -> "how you can migrate your job from an older Flink version to Flink-1.13." "Will generate the following matches for an input sequence: C D A1 A2 A3 D A4 B. with combinations enabled: {C A1 B}, {C A1 A2 B}, {C A1 A3 B}, {C A1 A4 B}, {C A1 A2 A3 B}, {C A1 A2 A4 B}, {C A1 A3 A4 B}, {C A1 A2 A3 A4 B}" -> "Will generate the following matches for an input sequence: C D A1 A2 A3 D A4 B. with combinations enabled: {C A1 B}, {C A1 A2 B}, {C A1 A3 B}, {C A1 A4 B}, {C A1 A2 A3 B}, {C A1 A2 A4 B}, {C A1 A3 A4 B}, {C A1 A2 A3 A4 B}, {C A2 B}, {C A2 A3 B}, {C A2 A4 B}, {C A2 A3 A4 B}, {C A3 B}, {C A3 A4 B}, {C A4 B}" "For SKIP_TO_FIRST/LAST there are two options how to handle cases when there are no elements mapped to the specified variable." -> "For SKIP_TO_FIRST/LAST there are two options how to handle cases when there are no events mapped to the *PatternName*." > Fix typos or mistakes of Flink CEP Document in the official website > --- > > Key: FLINK-28495 > URL: https://issues.apache.org/jira/browse/FLINK-28495 > Project: Flink > Issue Type: Improvement > Components: Library / CEP >Reporter: Biao Geng >Priority: Minor > > 1. "how you can migrate your job from an older Flink version to Flink-1.3." > -> "how you can migrate your job from an older Flink version to Flink-1.13." > 2. "Will generate the following matches for an input sequence: C D A1 A2 A3 D > A4 B. with combinations enabled: {quote}{C A1 B}, {C A1 A2 B}, {C A1 A3 B}, > {C A1 A4 B}, {C A1 A2 A3 B}, {C A1 A2 A4 B}, {C A1 A3 A4 B}, {C A1 A2 A3 A4 > B}{quote}" -> "Will generate the following matches for an input sequence: C D > A1 A2 A3 D A4 B. with combinations enabled: {quote}{C A1 B}, {C A1 A2 B}, {C > A1 A3 B}, {C A1 A4 B}, {C A1 A2 A3 B}, {C A1 A2 A4 B}, {C A1 A3 A4 B}, {C A1 > A2 A3 A4 B}, {C A2 B}, {C A2 A3 B}, {C A2 A4 B}, {C A2 A3 A4 B}, {C A3 B}, {C > A3 A4 B}, {C A4 B}{quote}" > 3. "For SKIP_TO_FIRST/LAST there are two options how to handle cases when > there are no elements mapped to the specified variable." -> "For > SKIP_TO_FIRST/LAST there are two options how to handle cases when there are > no events mapped to the *PatternName*." -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (FLINK-28495) Fix typos or mistakes of Flink CEP Document in the official website
[ https://issues.apache.org/jira/browse/FLINK-28495?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Biao Geng updated FLINK-28495: -- Description: "how you can migrate your job from an older Flink version to Flink-1.3." -> "how you can migrate your job from an older Flink version to Flink-1.13." "Will generate the following matches for an input sequence: C D A1 A2 A3 D A4 B. with combinations enabled: {C A1 B}, {C A1 A2 B}, {C A1 A3 B}, {C A1 A4 B}, {C A1 A2 A3 B}, {C A1 A2 A4 B}, {C A1 A3 A4 B}, {C A1 A2 A3 A4 B}" -> "Will generate the following matches for an input sequence: C D A1 A2 A3 D A4 B. with combinations enabled: {C A1 B}, {C A1 A2 B}, {C A1 A3 B}, {C A1 A4 B}, {C A1 A2 A3 B}, {C A1 A2 A4 B}, {C A1 A3 A4 B}, {C A1 A2 A3 A4 B}, {C A2 B}, {C A2 A3 B}, {C A2 A4 B}, {C A2 A3 A4 B}, {C A3 B}, {C A3 A4 B}, {C A4 B}" "For SKIP_TO_FIRST/LAST there are two options how to handle cases when there are no elements mapped to the specified variable." -> "For SKIP_TO_FIRST/LAST there are two options how to handle cases when there are no events mapped to the *PatternName*." was: "Will generate the following matches for an input sequence: C D A1 A2 A3 D A4 B. with combinations enabled: {C A1 B}, {C A1 A2 B}, {C A1 A3 B}, {C A1 A4 B}, {C A1 A2 A3 B}, {C A1 A2 A4 B}, {C A1 A3 A4 B}, {C A1 A2 A3 A4 B}" -> "Will generate the following matches for an input sequence: C D A1 A2 A3 D A4 B. with combinations enabled: {C A1 B}, {C A1 A2 B}, {C A1 A3 B}, {C A1 A4 B}, {C A1 A2 A3 B}, {C A1 A2 A4 B}, {C A1 A3 A4 B}, {C A1 A2 A3 A4 B}, {C A2 B}, {C A2 A3 B}, {C A2 A4 B}, {C A2 A3 A4 B}, {C A3 B}, {C A3 A4 B}, {C A4 B}" "For SKIP_TO_FIRST/LAST there are two options how to handle cases when there are no elements mapped to the specified variable." -> "For SKIP_TO_FIRST/LAST there are two options how to handle cases when there are no events mapped to the *PatternName*." > Fix typos or mistakes of Flink CEP Document in the official website > --- > > Key: FLINK-28495 > URL: https://issues.apache.org/jira/browse/FLINK-28495 > Project: Flink > Issue Type: Improvement > Components: Library / CEP >Reporter: Biao Geng >Priority: Minor > > "how you can migrate your job from an older Flink version to Flink-1.3." -> > "how you can migrate your job from an older Flink version to Flink-1.13." > "Will generate the following matches for an input sequence: C D A1 A2 A3 D A4 > B. with combinations enabled: {C A1 B}, {C A1 A2 B}, {C A1 A3 B}, {C A1 A4 > B}, {C A1 A2 A3 B}, {C A1 A2 A4 B}, {C A1 A3 A4 B}, {C A1 A2 A3 A4 B}" -> > "Will generate the following matches for an input sequence: C D A1 A2 A3 D A4 > B. with combinations enabled: {C A1 B}, {C A1 A2 B}, {C A1 A3 B}, {C A1 A4 > B}, {C A1 A2 A3 B}, {C A1 A2 A4 B}, {C A1 A3 A4 B}, {C A1 A2 A3 A4 B}, {C A2 > B}, {C A2 A3 B}, {C A2 A4 B}, {C A2 A3 A4 B}, {C A3 B}, {C A3 A4 B}, {C A4 B}" > "For SKIP_TO_FIRST/LAST there are two options how to handle cases when there > are no elements mapped to the specified variable." -> "For SKIP_TO_FIRST/LAST > there are two options how to handle cases when there are no events mapped to > the *PatternName*." -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (FLINK-28495) Fix typos or mistakes of Flink CEP Document in the official website
[ https://issues.apache.org/jira/browse/FLINK-28495?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Biao Geng updated FLINK-28495: -- Component/s: Library / CEP > Fix typos or mistakes of Flink CEP Document in the official website > --- > > Key: FLINK-28495 > URL: https://issues.apache.org/jira/browse/FLINK-28495 > Project: Flink > Issue Type: Improvement > Components: Library / CEP >Reporter: Biao Geng >Priority: Minor > > "Will generate the following matches for an input sequence: C D A1 A2 A3 D A4 > B. with combinations enabled: {C A1 B}, {C A1 A2 B}, {C A1 A3 B}, {C A1 A4 > B}, {C A1 A2 A3 B}, {C A1 A2 A4 B}, {C A1 A3 A4 B}, {C A1 A2 A3 A4 B}" -> > "Will generate the following matches for an input sequence: C D A1 A2 A3 D A4 > B. with combinations enabled: {C A1 B}, {C A1 A2 B}, {C A1 A3 B}, {C A1 A4 > B}, {C A1 A2 A3 B}, {C A1 A2 A4 B}, {C A1 A3 A4 B}, {C A1 A2 A3 A4 B}, {C A2 > B}, {C A2 A3 B}, {C A2 A4 B}, {C A2 A3 A4 B}, {C A3 B}, {C A3 A4 B}, {C A4 B}" > "For SKIP_TO_FIRST/LAST there are two options how to handle cases when there > are no elements mapped to the specified variable." -> "For SKIP_TO_FIRST/LAST > there are two options how to handle cases when there are no events mapped to > the *PatternName*." -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (FLINK-28495) Fix typos or mistakes of Flink CEP Document in the official website
Biao Geng created FLINK-28495: - Summary: Fix typos or mistakes of Flink CEP Document in the official website Key: FLINK-28495 URL: https://issues.apache.org/jira/browse/FLINK-28495 Project: Flink Issue Type: Improvement Reporter: Biao Geng "Will generate the following matches for an input sequence: C D A1 A2 A3 D A4 B. with combinations enabled: {C A1 B}, {C A1 A2 B}, {C A1 A3 B}, {C A1 A4 B}, {C A1 A2 A3 B}, {C A1 A2 A4 B}, {C A1 A3 A4 B}, {C A1 A2 A3 A4 B}" -> "Will generate the following matches for an input sequence: C D A1 A2 A3 D A4 B. with combinations enabled: {C A1 B}, {C A1 A2 B}, {C A1 A3 B}, {C A1 A4 B}, {C A1 A2 A3 B}, {C A1 A2 A4 B}, {C A1 A3 A4 B}, {C A1 A2 A3 A4 B}, {C A2 B}, {C A2 A3 B}, {C A2 A4 B}, {C A2 A3 A4 B}, {C A3 B}, {C A3 A4 B}, {C A4 B}" "For SKIP_TO_FIRST/LAST there are two options how to handle cases when there are no elements mapped to the specified variable." -> "For SKIP_TO_FIRST/LAST there are two options how to handle cases when there are no events mapped to the *PatternName*." -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (FLINK-28443) Provide official Flink image for PyFlink
[ https://issues.apache.org/jira/browse/FLINK-28443?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17564520#comment-17564520 ] Biao Geng commented on FLINK-28443: --- This improvement is good for user experience, but one challenge of this improvement may be how to manage the version matrix: considering 3 popular python version(3.7,3.8,3.9) and 3 flink version(1.14, 1.15, 1.16), there should be 3 * 3 = 9 pyflink images. If we consider java version, the problem can be worse. > Provide official Flink image for PyFlink > > > Key: FLINK-28443 > URL: https://issues.apache.org/jira/browse/FLINK-28443 > Project: Flink > Issue Type: Improvement > Components: API / Python >Affects Versions: 1.15.0 >Reporter: Xuannan Su >Priority: Major > > Users must build a custom Flink image to include python and pyFlink to run a > pyFlink job with Native Kubernetes application mode. > I think we can improve the user experience by providing an official image > that pre-installs python and pyFlink. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (FLINK-28364) Python Job support for Kubernetes Operator
[ https://issues.apache.org/jira/browse/FLINK-28364?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17564277#comment-17564277 ] Biao Geng commented on FLINK-28364: --- Following the pattern we have adopted for sql job example, I have created a simple [example|https://github.com/bgeng777/flink-kubernetes-operator/blob/python_example/examples/flink-python-example/python-example.yaml] to run pyflink jobs. As [~nicholasjiang] said, the python demo is trival but we may need more document about building pyflink image in the operator repo. [~gyfora] [~dianfu] I can take this ticket to add the pyflink example. > Python Job support for Kubernetes Operator > -- > > Key: FLINK-28364 > URL: https://issues.apache.org/jira/browse/FLINK-28364 > Project: Flink > Issue Type: New Feature > Components: Kubernetes Operator >Reporter: wu3396 >Priority: Major > > *Describe the solution* > Job types that I want to support pyflink for > *Describe alternatives* > like [here|https://github.com/spotify/flink-on-k8s-operator/pull/165] -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (FLINK-27009) Support SQL job submission in flink kubernetes opeartor
[ https://issues.apache.org/jira/browse/FLINK-27009?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17562729#comment-17562729 ] Biao Geng commented on FLINK-27009: --- hi [~gyfora], sorry for the late reply. As Yang said, I had previously created a [PR|https://github.com/bgeng777/flink-kubernetes-operator/tree/sql-runner-demo/flink-kubernetes-sql-runner] to create a sql runner jar in the k8s operator side. The implementation is pretty basic: the sql runner jar just parses sql statements in the yaml and then call stableEnvironment.executeSql(statements). In my PR, I plan to make it bundled in the flink k8s operator image, so users can directly use it and of course we can add some utils to make submit SQL script easier. But after some tests, I meet a problem: Our flink kubernetes operator is based on Java11 and as a result, the `sql-runner.jar` should be compiled in Java11 but our flink image can be built on Java8 which leads to incompatibility issues. I am not sure if your implementation will meet such problem. So after some investigation, I prefer to do something in the upstream flink project to support running SQL scripts in Application Mode first so that we can use flink SQL client as the entry class directly. Then in our k8s operator, we can support sql scripts easier. It is somehow simliar to what we have done for YARN Application Mode support of Python jobs. I have created a [draft of Support submitting SQL scripts using Application Mode|https://docs.google.com/document/d/1C-o3wAgSE3TxxqaMxwIH7_Pt5w-v2PrSb6KU1vGNuE0/edit#heading=h.8rcq4dx7kcjy] for the upstream change but I do agree that it can take some uncertain time to discuss this upstream change. If we do not want this jira to be blocked by that, we can discuss how to improve the sql runner solution. Hope my previous experience can be helpful. > Support SQL job submission in flink kubernetes opeartor > --- > > Key: FLINK-27009 > URL: https://issues.apache.org/jira/browse/FLINK-27009 > Project: Flink > Issue Type: New Feature > Components: Kubernetes Operator >Reporter: Biao Geng >Assignee: Gyula Fora >Priority: Major > > Currently, the flink kubernetes opeartor is for jar job using application or > session cluster. For SQL job, there is no out of box solution in the > operator. > One simple and short-term solution is to wrap the SQL script into a jar job > using table API with limitation. > The long-term solution may work with > [FLINK-26541|https://issues.apache.org/jira/browse/FLINK-26541] to achieve > the full support. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Comment Edited] (FLINK-24346) Flink on yarn application mode,LinkageError
[ https://issues.apache.org/jira/browse/FLINK-24346?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17561986#comment-17561986 ] Biao Geng edited comment on FLINK-24346 at 7/4/22 6:47 AM: --- Hi [~soberchi...@gamil.com] have you ever tried to run the program with {{flink run-application -t yarn-application -Dyarn.per-job-cluster.include-user-jar=DISABLED flink-demo-1.13.3.jar}} ? (the option name `yarn.per-job-cluster.include-user-jar` is somehow misleading as it can work for application mode as well. This naming has been improved in flink1.15) I test your program locally and above command can be a workaround for the class loading problem. Besides, I believe your question is truly valuable and we may need to spend more time finding a way to see what happened for the class loading process. was (Author: bgeng777): Hi [~soberchi...@gamil.com] have you ever tried to run the program with {{flink run-application -t yarn-application -Dyarn.per-job-cluster.include-user-jar=DISABLED flink-demo-1.13.3.jar}} ? (the option name `yarn.per-job-cluster.include-user-jar` is somehow misleading as it can work for application mode as well. This naming has been improved in flink1.15) I test your program locally and above command can be a workaround for the class loading problem. > Flink on yarn application mode,LinkageError > --- > > Key: FLINK-24346 > URL: https://issues.apache.org/jira/browse/FLINK-24346 > Project: Flink > Issue Type: Bug > Components: API / DataStream >Affects Versions: 1.13.1 > Environment: hadoop version 2.6.x >Reporter: 李伟高 >Priority: Major > > Hello, I'm changing from per job mode to application mode to submit tasks to > yarn.All jars that my task depends on are typed into my task jar.I submit the > task as perjob and work normally, but change to application mode and report > an error. > {code:java} > [0;39mjava.util.concurrent.CompletionException: > org.apache.flink.client.deployment.application.ApplicationExecutionException: > Could not execute application. at > java.util.concurrent.CompletableFuture.encodeThrowable(CompletableFuture.java:292) > ~[na:1.8.0_271] at > java.util.concurrent.CompletableFuture.completeThrowable(CompletableFuture.java:308) > ~[na:1.8.0_271] at > java.util.concurrent.CompletableFuture.uniCompose(CompletableFuture.java:957) > ~[na:1.8.0_271] at > java.util.concurrent.CompletableFuture$UniCompose.tryFire(CompletableFuture.java:940) > ~[na:1.8.0_271] at > java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:488) > ~[na:1.8.0_271] at > java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:1990) > ~[na:1.8.0_271] at > org.apache.flink.client.deployment.application.ApplicationDispatcherBootstrap.runApplicationEntryPoint(ApplicationDispatcherBootstrap.java:257) > ~[hb-bigdata-wash-location-1.0-SNAPSHOT.jar:na] at > org.apache.flink.client.deployment.application.ApplicationDispatcherBootstrap.lambda$runApplicationAsync$1(ApplicationDispatcherBootstrap.java:212) > ~[hb-bigdata-wash-location-1.0-SNAPSHOT.jar:na] at > java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) > ~[na:1.8.0_271] at java.util.concurrent.FutureTask.run(FutureTask.java:266) > ~[na:1.8.0_271] at > org.apache.flink.runtime.concurrent.akka.ActorSystemScheduledExecutorAdapter$ScheduledFutureTask.run(ActorSystemScheduledExecutorAdapter.java:159) > ~[hb-bigdata-wash-location-1.0-SNAPSHOT.jar:na] at > akka.dispatch.TaskInvocation.run(AbstractDispatcher.scala:40) > ~[hb-bigdata-wash-location-1.0-SNAPSHOT.jar:na] at > akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(ForkJoinExecutorConfigurator.scala:44) > ~[hb-bigdata-wash-location-1.0-SNAPSHOT.jar:na] at > akka.dispatch.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260) > ~[hb-bigdata-wash-location-1.0-SNAPSHOT.jar:na] at > akka.dispatch.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339) > ~[hb-bigdata-wash-location-1.0-SNAPSHOT.jar:na] at > akka.dispatch.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979) > ~[hb-bigdata-wash-location-1.0-SNAPSHOT.jar:na] at > akka.dispatch.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107) > ~[hb-bigdata-wash-location-1.0-SNAPSHOT.jar:na] Caused by: > org.apache.flink.client.deployment.application.ApplicationExecutionException: > Could not execute application. ... 11 common frames omitted Caused by: > java.lang.LinkageError: loader constraint violation: loader (instance of > org/apache/flink/util/ChildFirstClassLoader) previously initiated loading for > a different type with name "org/elasticsearch/client/RestClientBuilder" at > java.lang.ClassLoader.defineClass1(Native Method) ~[na:1.8.0_271] at >
[jira] [Commented] (FLINK-24346) Flink on yarn application mode,LinkageError
[ https://issues.apache.org/jira/browse/FLINK-24346?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17561986#comment-17561986 ] Biao Geng commented on FLINK-24346: --- Hi [~soberchi...@gamil.com] have you ever tried to run the program with {{flink run-application -t yarn-application -Dyarn.per-job-cluster.include-user-jar=DISABLED flink-demo-1.13.3.jar}} ? (the option name `yarn.per-job-cluster.include-user-jar` is somehow misleading as it can work for application mode as well. This naming has been improved in flink1.15) I test your program locally and above command can be a workaround for the class loading problem. > Flink on yarn application mode,LinkageError > --- > > Key: FLINK-24346 > URL: https://issues.apache.org/jira/browse/FLINK-24346 > Project: Flink > Issue Type: Bug > Components: API / DataStream >Affects Versions: 1.13.1 > Environment: hadoop version 2.6.x >Reporter: 李伟高 >Priority: Major > > Hello, I'm changing from per job mode to application mode to submit tasks to > yarn.All jars that my task depends on are typed into my task jar.I submit the > task as perjob and work normally, but change to application mode and report > an error. > {code:java} > [0;39mjava.util.concurrent.CompletionException: > org.apache.flink.client.deployment.application.ApplicationExecutionException: > Could not execute application. at > java.util.concurrent.CompletableFuture.encodeThrowable(CompletableFuture.java:292) > ~[na:1.8.0_271] at > java.util.concurrent.CompletableFuture.completeThrowable(CompletableFuture.java:308) > ~[na:1.8.0_271] at > java.util.concurrent.CompletableFuture.uniCompose(CompletableFuture.java:957) > ~[na:1.8.0_271] at > java.util.concurrent.CompletableFuture$UniCompose.tryFire(CompletableFuture.java:940) > ~[na:1.8.0_271] at > java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:488) > ~[na:1.8.0_271] at > java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:1990) > ~[na:1.8.0_271] at > org.apache.flink.client.deployment.application.ApplicationDispatcherBootstrap.runApplicationEntryPoint(ApplicationDispatcherBootstrap.java:257) > ~[hb-bigdata-wash-location-1.0-SNAPSHOT.jar:na] at > org.apache.flink.client.deployment.application.ApplicationDispatcherBootstrap.lambda$runApplicationAsync$1(ApplicationDispatcherBootstrap.java:212) > ~[hb-bigdata-wash-location-1.0-SNAPSHOT.jar:na] at > java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) > ~[na:1.8.0_271] at java.util.concurrent.FutureTask.run(FutureTask.java:266) > ~[na:1.8.0_271] at > org.apache.flink.runtime.concurrent.akka.ActorSystemScheduledExecutorAdapter$ScheduledFutureTask.run(ActorSystemScheduledExecutorAdapter.java:159) > ~[hb-bigdata-wash-location-1.0-SNAPSHOT.jar:na] at > akka.dispatch.TaskInvocation.run(AbstractDispatcher.scala:40) > ~[hb-bigdata-wash-location-1.0-SNAPSHOT.jar:na] at > akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(ForkJoinExecutorConfigurator.scala:44) > ~[hb-bigdata-wash-location-1.0-SNAPSHOT.jar:na] at > akka.dispatch.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260) > ~[hb-bigdata-wash-location-1.0-SNAPSHOT.jar:na] at > akka.dispatch.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339) > ~[hb-bigdata-wash-location-1.0-SNAPSHOT.jar:na] at > akka.dispatch.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979) > ~[hb-bigdata-wash-location-1.0-SNAPSHOT.jar:na] at > akka.dispatch.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107) > ~[hb-bigdata-wash-location-1.0-SNAPSHOT.jar:na] Caused by: > org.apache.flink.client.deployment.application.ApplicationExecutionException: > Could not execute application. ... 11 common frames omitted Caused by: > java.lang.LinkageError: loader constraint violation: loader (instance of > org/apache/flink/util/ChildFirstClassLoader) previously initiated loading for > a different type with name "org/elasticsearch/client/RestClientBuilder" at > java.lang.ClassLoader.defineClass1(Native Method) ~[na:1.8.0_271] at > java.lang.ClassLoader.defineClass(ClassLoader.java:756) ~[na:1.8.0_271] at > java.security.SecureClassLoader.defineClass(SecureClassLoader.java:142) > ~[na:1.8.0_271] at > java.net.URLClassLoader.defineClass(URLClassLoader.java:468) ~[na:1.8.0_271] > at java.net.URLClassLoader.access$100(URLClassLoader.java:74) ~[na:1.8.0_271] > at java.net.URLClassLoader$1.run(URLClassLoader.java:369) ~[na:1.8.0_271] at > java.net.URLClassLoader$1.run(URLClassLoader.java:363) ~[na:1.8.0_271] at > java.security.AccessController.doPrivileged(Native Method) ~[na:1.8.0_271] at > java.net.URLClassLoader.findClass(URLClassLoader.java:362) ~[na:1.8.0_271] at >
[jira] [Comment Edited] (FLINK-28199) Failures on YARNHighAvailabilityITCase.testClusterClientRetrieval and YARNHighAvailabilityITCase.testKillYarnSessionClusterEntrypoint
[ https://issues.apache.org/jira/browse/FLINK-28199?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17557506#comment-17557506 ] Biao Geng edited comment on FLINK-28199 at 6/22/22 3:03 PM: According to the log, the exception is thrown as the yarn application for testKillYarnSessionClusterEntrypoint is not stopped as expected, which also leads to the failure of testClusterClientRetrieval furthermore. Similary as [~ferenc-csaky]'s analysis, IIUC, the PR for FLINK-27677 may not be relevant to this failure as FLINK-27677 's codes only change the @AfterAll method, which should be executed after all tests finished while this failure happens after a single test. It looks that {{killApplicationAndWait()}} may wrongly return due to the side effect of the previous test. was (Author: bgeng777): According to the log, the exception is thrown as the yarn application for testKillYarnSessionClusterEntrypoint is not stopped as expected, which also leads to the failure of testClusterClientRetrieval furthermore. IIUC, the PR for FLINK-27677 may not be relevant to this failure as [~ferenc-csaky]'s codes only change the @AfterAll method, which should be executed after all tests finished while this failure happens after a single test. It looks that {{killApplicationAndWait()}} may wrongly return due to the side effect of the previous test. > Failures on YARNHighAvailabilityITCase.testClusterClientRetrieval and > YARNHighAvailabilityITCase.testKillYarnSessionClusterEntrypoint > - > > Key: FLINK-28199 > URL: https://issues.apache.org/jira/browse/FLINK-28199 > Project: Flink > Issue Type: Bug > Components: Deployment / YARN >Affects Versions: 1.16.0 >Reporter: Martijn Visser >Priority: Major > Labels: test-stability > > {code:java} > Jun 22 08:57:50 [ERROR] Errors: > Jun 22 08:57:50 [ERROR] > YARNHighAvailabilityITCase.testClusterClientRetrieval » Timeout > testClusterCli... > Jun 22 08:57:50 [ERROR] > YARNHighAvailabilityITCase.testKillYarnSessionClusterEntrypoint:156->YarnTestBase.runTest:288->lambda$testKillYarnSessionClusterEntrypoint$0:182->waitForJobTermination:325 > » Execution > Jun 22 08:57:50 [INFO] > Jun 22 08:57:50 [ERROR] Tests run: 27, Failures: 0, Errors: 2, Skipped: 0 > {code} > https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=37037=logs=fc5181b0-e452-5c8f-68de-1097947f6483=995c650b-6573-581c-9ce6-7ad4cc038461=29523 -- This message was sent by Atlassian Jira (v8.20.7#820007)
[jira] [Commented] (FLINK-28199) Failures on YARNHighAvailabilityITCase.testClusterClientRetrieval and YARNHighAvailabilityITCase.testKillYarnSessionClusterEntrypoint
[ https://issues.apache.org/jira/browse/FLINK-28199?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17557506#comment-17557506 ] Biao Geng commented on FLINK-28199: --- According to the log, the exception is thrown as the yarn application for testKillYarnSessionClusterEntrypoint is not stopped as expected, which also leads to the failure of testClusterClientRetrieval furthermore. IIUC, the PR for FLINK-27677 may not be relevant to this failure as [~ferenc-csaky]'s codes only change the @AfterAll method, which should be executed after all tests finished while this failure happens after a single test. It looks that {{killApplicationAndWait()}} may wrongly return due to the side effect of the previous test. > Failures on YARNHighAvailabilityITCase.testClusterClientRetrieval and > YARNHighAvailabilityITCase.testKillYarnSessionClusterEntrypoint > - > > Key: FLINK-28199 > URL: https://issues.apache.org/jira/browse/FLINK-28199 > Project: Flink > Issue Type: Bug > Components: Deployment / YARN >Affects Versions: 1.16.0 >Reporter: Martijn Visser >Priority: Major > Labels: test-stability > > {code:java} > Jun 22 08:57:50 [ERROR] Errors: > Jun 22 08:57:50 [ERROR] > YARNHighAvailabilityITCase.testClusterClientRetrieval » Timeout > testClusterCli... > Jun 22 08:57:50 [ERROR] > YARNHighAvailabilityITCase.testKillYarnSessionClusterEntrypoint:156->YarnTestBase.runTest:288->lambda$testKillYarnSessionClusterEntrypoint$0:182->waitForJobTermination:325 > » Execution > Jun 22 08:57:50 [INFO] > Jun 22 08:57:50 [ERROR] Tests run: 27, Failures: 0, Errors: 2, Skipped: 0 > {code} > https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=37037=logs=fc5181b0-e452-5c8f-68de-1097947f6483=995c650b-6573-581c-9ce6-7ad4cc038461=29523 -- This message was sent by Atlassian Jira (v8.20.7#820007)
[jira] [Commented] (FLINK-27667) YARNHighAvailabilityITCase fails with "Failed to delete temp directory /tmp/junit1681"
[ https://issues.apache.org/jira/browse/FLINK-27667?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17557287#comment-17557287 ] Biao Geng commented on FLINK-27667: --- Hi [~martijnvisser] , I see [~ferenc-csaky] has created the PR which LGTM. The change in my mind is exactly the same as his PR and I may not need add other changes. > YARNHighAvailabilityITCase fails with "Failed to delete temp directory > /tmp/junit1681" > -- > > Key: FLINK-27667 > URL: https://issues.apache.org/jira/browse/FLINK-27667 > Project: Flink > Issue Type: Bug > Components: Deployment / YARN >Affects Versions: 1.16.0 >Reporter: Martijn Visser >Assignee: Ferenc Csaky >Priority: Critical > Labels: pull-request-available, test-stability > Fix For: 1.16.0 > > > https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=35733=logs=fc5181b0-e452-5c8f-68de-1097947f6483=995c650b-6573-581c-9ce6-7ad4cc038461=29208 > > {code:bash} > May 17 08:36:22 [INFO] Results: > May 17 08:36:22 [INFO] > May 17 08:36:22 [ERROR] Errors: > May 17 08:36:22 [ERROR] YARNHighAvailabilityITCase » IO Failed to delete temp > directory /tmp/junit1681... > May 17 08:36:22 [INFO] > May 17 08:36:22 [ERROR] Tests run: 28, Failures: 0, Errors: 1, Skipped: 0 > May 17 08:36:22 [INFO] > {code} > -- This message was sent by Atlassian Jira (v8.20.7#820007)
[jira] [Comment Edited] (FLINK-27667) YARNHighAvailabilityITCase fails with "Failed to delete temp directory /tmp/junit1681"
[ https://issues.apache.org/jira/browse/FLINK-27667?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=1732#comment-1732 ] Biao Geng edited comment on FLINK-27667 at 6/18/22 6:36 AM: hi [~ferenc-csaky] [~chesnay] , I did some investigation and hope it can help us to enhence the stability: In {{{}YARNHighAvailabilityITCase{}}}, we currently overwrite the {{teardown()}} method of the {{YarnTestBase}} (see [codes|https://github.com/apache/flink/blob/3ae4c6f5a48105d00807e8ce02e70d4c092cbf40/flink-yarn-tests/src/test/java/org/apache/flink/yarn/YARNHighAvailabilityITCase.java#L126] here) and as a result, only {{{}YARNHighAvailabilityITCase{}}}'s {{teardown()}} will be executed after all HA tests finish. The above behavior may lead to potential race condition: {{YARNHighAvailabilityITCase}} relies on the {{@TempDir protected static File tmp}} defined in {{YarnTestBase}} as YARN's parent dir of staging dir of each YARN application to launch the mini YARN cluster using {{{}RawLocalFileSystem{}}}. According to JUnit5's [doc|https://junit.org/junit5/docs/5.4.1/api/org/junit/jupiter/api/io/TempDir.html], the TempDir will be deleted recursively when the test class has finished execution. But as the {{teardown()}} method of the base class is not executed, there is no guarantee when the mini YARN cluster will be cleaned up(e.g. deleting staging dir like {{{}/tmp/junit1681458499635252469/.flink/application_1652775626514_0003{}}}). As a result, when JUnit wants to delete TempDir and it happens to see the staging dir, it is possible that the staging dir is deleted by YARN's cleanup method before being deleted by JUnit, which incurs the exception of {{{}java.io.IOException: Failed to delete temp directory{}}}. I tried to add the call of the teardown method of the base class (see [codes|https://github.com/bgeng777/flink/commit/c4a4c8c8d4d1dafaa2875dd8d88133e38a78d438] here) and manually triggered the test for 5 times(test[1|https://dev.azure.com/samuelgeng7/Flink/_build/results?buildId=119=results], [2|https://dev.azure.com/samuelgeng7/Flink/_build/results?buildId=120=results], [3|https://dev.azure.com/samuelgeng7/Flink/_build/results?buildId=121=results], [4|https://dev.azure.com/samuelgeng7/Flink/_build/results?buildId=122=results], [5|https://dev.azure.com/samuelgeng7/Flink/_build/results?buildId=123=results]) in my own azure pipeline. All of them passed the misc tests. One further question is that I am not so sure why this test keeps stable in JUnit 4. Maybe the @TempDir annotation behaves somehow differently. was (Author: bgeng777): hi [~ferenc-csaky] [~chesnay] , I did some investigation and hope it can help us to enhence the stability: In {{{}YARNHighAvailabilityITCase{}}}, we currently override the {{teardown()}} method of the {{YarnTestBase}} (see [codes|https://github.com/apache/flink/blob/3ae4c6f5a48105d00807e8ce02e70d4c092cbf40/flink-yarn-tests/src/test/java/org/apache/flink/yarn/YARNHighAvailabilityITCase.java#L126] here) and as a result, only {{{}YARNHighAvailabilityITCase{}}}'s {{teardown()}} will be executed after all HA tests finish. The above behavior may lead to potential race condition: {{YARNHighAvailabilityITCase}} relies on the {{@TempDir protected static File tmp}} defined in {{YarnTestBase}} as YARN's parent dir of staging dir of each YARN application to launch the mini YARN cluster using {{{}RawLocalFileSystem{}}}. According to JUnit5's [doc|https://junit.org/junit5/docs/5.4.1/api/org/junit/jupiter/api/io/TempDir.html], the TempDir will be deleted recursively when the test class has finished execution. But as the {{teardown()}} method of the base class is not executed, there is no guarantee when the mini YARN cluster will be cleaned up(e.g. deleting staging dir like {{{}/tmp/junit1681458499635252469/.flink/application_1652775626514_0003{}}}). As a result, when JUnit wants to delete TempDir and it happens to see the staging dir, it is possible that the staging dir is deleted by YARN's cleanup method before being deleted by JUnit, which incurs the exception of {{{}java.io.IOException: Failed to delete temp directory{}}}. I tried to add the call of the teardown method of the base class (see [codes|https://github.com/bgeng777/flink/commit/c4a4c8c8d4d1dafaa2875dd8d88133e38a78d438] here) and manually triggered the test for 5 times(test[1|https://dev.azure.com/samuelgeng7/Flink/_build/results?buildId=119=results], [2|https://dev.azure.com/samuelgeng7/Flink/_build/results?buildId=120=results], [3|https://dev.azure.com/samuelgeng7/Flink/_build/results?buildId=121=results], [4|https://dev.azure.com/samuelgeng7/Flink/_build/results?buildId=122=results], [5|https://dev.azure.com/samuelgeng7/Flink/_build/results?buildId=123=results]) in my own azure pipeline. All of them passed the misc tests. One further question is that I am not so
[jira] [Comment Edited] (FLINK-27667) YARNHighAvailabilityITCase fails with "Failed to delete temp directory /tmp/junit1681"
[ https://issues.apache.org/jira/browse/FLINK-27667?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=1732#comment-1732 ] Biao Geng edited comment on FLINK-27667 at 6/18/22 6:34 AM: hi [~ferenc-csaky] [~chesnay] , I did some investigation and hope it can help us to enhence the stability: In {{{}YARNHighAvailabilityITCase{}}}, we currently override the {{teardown()}} method of the {{YarnTestBase}} (see [codes|https://github.com/apache/flink/blob/3ae4c6f5a48105d00807e8ce02e70d4c092cbf40/flink-yarn-tests/src/test/java/org/apache/flink/yarn/YARNHighAvailabilityITCase.java#L126] here) and as a result, only {{{}YARNHighAvailabilityITCase{}}}'s {{teardown()}} will be executed after all HA tests finish. The above behavior may lead to potential race condition: {{YARNHighAvailabilityITCase}} relies on the {{@TempDir protected static File tmp}} defined in {{YarnTestBase}} as YARN's parent dir of staging dir of each YARN application to launch the mini YARN cluster using {{{}RawLocalFileSystem{}}}. According to JUnit5's [doc|https://junit.org/junit5/docs/5.4.1/api/org/junit/jupiter/api/io/TempDir.html], the TempDir will be deleted recursively when the test class has finished execution. But as the {{teardown()}} method of the base class is not executed, there is no guarantee when the mini YARN cluster will be cleaned up(e.g. deleting staging dir like {{{}/tmp/junit1681458499635252469/.flink/application_1652775626514_0003{}}}). As a result, when JUnit wants to delete TempDir and it happens to see the staging dir, it is possible that the staging dir is deleted by YARN's cleanup method before being deleted by JUnit, which incurs the exception of {{{}java.io.IOException: Failed to delete temp directory{}}}. I tried to add the call of the teardown method of the base class (see [codes|https://github.com/bgeng777/flink/commit/c4a4c8c8d4d1dafaa2875dd8d88133e38a78d438] here) and manually triggered the test for 5 times(test[1|https://dev.azure.com/samuelgeng7/Flink/_build/results?buildId=119=results], [2|https://dev.azure.com/samuelgeng7/Flink/_build/results?buildId=120=results], [3|https://dev.azure.com/samuelgeng7/Flink/_build/results?buildId=121=results], [4|https://dev.azure.com/samuelgeng7/Flink/_build/results?buildId=122=results], [5|https://dev.azure.com/samuelgeng7/Flink/_build/results?buildId=123=results]) in my own azure pipeline. All of them passed the misc tests. One further question is that I am not so sure why this test keeps stable in JUnit 4. Maybe the @TempDir annotation behaves somehow differently. was (Author: bgeng777): [~martijnvisser] I will do some investigation > YARNHighAvailabilityITCase fails with "Failed to delete temp directory > /tmp/junit1681" > -- > > Key: FLINK-27667 > URL: https://issues.apache.org/jira/browse/FLINK-27667 > Project: Flink > Issue Type: Bug > Components: Deployment / YARN >Affects Versions: 1.16.0 >Reporter: Martijn Visser >Assignee: Ferenc Csaky >Priority: Critical > Labels: pull-request-available, test-stability > > https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=35733=logs=fc5181b0-e452-5c8f-68de-1097947f6483=995c650b-6573-581c-9ce6-7ad4cc038461=29208 > > {code:bash} > May 17 08:36:22 [INFO] Results: > May 17 08:36:22 [INFO] > May 17 08:36:22 [ERROR] Errors: > May 17 08:36:22 [ERROR] YARNHighAvailabilityITCase » IO Failed to delete temp > directory /tmp/junit1681... > May 17 08:36:22 [INFO] > May 17 08:36:22 [ERROR] Tests run: 28, Failures: 0, Errors: 1, Skipped: 0 > May 17 08:36:22 [INFO] > {code} > -- This message was sent by Atlassian Jira (v8.20.7#820007)
[jira] [Commented] (FLINK-27667) YARNHighAvailabilityITCase fails with "Failed to delete temp directory /tmp/junit1681"
[ https://issues.apache.org/jira/browse/FLINK-27667?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=1732#comment-1732 ] Biao Geng commented on FLINK-27667: --- [~martijnvisser] I will do some investigation > YARNHighAvailabilityITCase fails with "Failed to delete temp directory > /tmp/junit1681" > -- > > Key: FLINK-27667 > URL: https://issues.apache.org/jira/browse/FLINK-27667 > Project: Flink > Issue Type: Bug > Components: Deployment / YARN >Affects Versions: 1.16.0 >Reporter: Martijn Visser >Assignee: Ferenc Csaky >Priority: Critical > Labels: test-stability > > https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=35733=logs=fc5181b0-e452-5c8f-68de-1097947f6483=995c650b-6573-581c-9ce6-7ad4cc038461=29208 > > {code:bash} > May 17 08:36:22 [INFO] Results: > May 17 08:36:22 [INFO] > May 17 08:36:22 [ERROR] Errors: > May 17 08:36:22 [ERROR] YARNHighAvailabilityITCase » IO Failed to delete temp > directory /tmp/junit1681... > May 17 08:36:22 [INFO] > May 17 08:36:22 [ERROR] Tests run: 28, Failures: 0, Errors: 1, Skipped: 0 > May 17 08:36:22 [INFO] > {code} > -- This message was sent by Atlassian Jira (v8.20.7#820007)
[jira] [Commented] (FLINK-28025) Document the change of serviceAccount in upgrading doc of k8s opeator
[ https://issues.apache.org/jira/browse/FLINK-28025?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17553605#comment-17553605 ] Biao Geng commented on FLINK-28025: --- cc [~wangyang0918] > Document the change of serviceAccount in upgrading doc of k8s opeator > - > > Key: FLINK-28025 > URL: https://issues.apache.org/jira/browse/FLINK-28025 > Project: Flink > Issue Type: Improvement > Components: Kubernetes Operator >Reporter: Biao Geng >Priority: Minor > Labels: pull-request-available > > Since 1.0.0, we require users to specify serviceAccount and add corresponding > validation. > According to the experience of [~czchen] , we had better document such change > in the upgrading doc. -- This message was sent by Atlassian Jira (v8.20.7#820007)
[jira] [Created] (FLINK-28025) Document the change of serviceAccount in upgrading doc of k8s opeator
Biao Geng created FLINK-28025: - Summary: Document the change of serviceAccount in upgrading doc of k8s opeator Key: FLINK-28025 URL: https://issues.apache.org/jira/browse/FLINK-28025 Project: Flink Issue Type: Improvement Components: Kubernetes Operator Reporter: Biao Geng Since 1.0.0, we require users to specify serviceAccount and add corresponding validation. According to the experience of [~czchen] , we had better document such change in the upgrading doc. -- This message was sent by Atlassian Jira (v8.20.7#820007)
[jira] [Commented] (FLINK-27009) Support SQL job submission in flink kubernetes opeartor
[ https://issues.apache.org/jira/browse/FLINK-27009?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17550265#comment-17550265 ] Biao Geng commented on FLINK-27009: --- Hi [~mbalassi] , sorry for the late reply. I have finished the [poc|https://github.com/bgeng777/flink/commit/5b4338fe52ec343326927f0fc12f015dd22b1133] that directly utilizes the sql client. But it turns out that I need more time to initial the draft of FLIP. Will notify you once I finish it. > Support SQL job submission in flink kubernetes opeartor > --- > > Key: FLINK-27009 > URL: https://issues.apache.org/jira/browse/FLINK-27009 > Project: Flink > Issue Type: New Feature > Components: Kubernetes Operator >Reporter: Biao Geng >Assignee: Biao Geng >Priority: Major > > Currently, the flink kubernetes opeartor is for jar job using application or > session cluster. For SQL job, there is no out of box solution in the > operator. > One simple and short-term solution is to wrap the SQL script into a jar job > using table API with limitation. > The long-term solution may work with > [FLINK-26541|https://issues.apache.org/jira/browse/FLINK-26541] to achieve > the full support. -- This message was sent by Atlassian Jira (v8.20.7#820007)
[jira] [Commented] (FLINK-27894) Build flink-connector-hive failed using Maven@3.8.5
[ https://issues.apache.org/jira/browse/FLINK-27894?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17546186#comment-17546186 ] Biao Geng commented on FLINK-27894: --- cc [~luoyuxia] have you ever met such error? > Build flink-connector-hive failed using Maven@3.8.5 > --- > > Key: FLINK-27894 > URL: https://issues.apache.org/jira/browse/FLINK-27894 > Project: Flink > Issue Type: Improvement >Reporter: Biao Geng >Priority: Major > > When I tried to build flink project locally with Java8 and Maven3.8.5, I met > such error: > {code:java} > [ERROR] Failed to execute goal on project flink-connector-hive_2.12: Could > not resolve dependencies for project > org.apache.flink:flink-connector-hive_2.12:jar:1.16-SNAPSHOT: Failed to > collect dependencies at org.apache.hive:hive-exec:jar:2.3.9 -> > org.pentaho:pentaho-aggdesigner-algorithm:jar:5.1.5-jhyde: Failed to read > artifact descriptor for > org.pentaho:pentaho-aggdesigner-algorithm:jar:5.1.5-jhyde: Could not transfer > artifact org.pentaho:pentaho-aggdesigner-algorithm:pom:5.1.5-jhyde from/to > maven-default-http-blocker (http://0.0.0.0/): Blocked mirror for > repositories: [repository.jboss.org > (http://repository.jboss.org/nexus/content/groups/public/, default, > disabled), conjars (http://conjars.org/repo, default, releases+snapshots), > apache.snapshots (http://repository.apache.org/snapshots, default, > snapshots)] -> [Help 1] > {code} > After some investigation, the reason may be that Maven 3.8.1 disables support > for repositories using "http" protocol. Due to [NIFI-8398], one possible > solution is adding > {code:xml} > > > > conjars > https://conjars.org/repo > > > {code} > in the pom.xml of flink-connector-hive module. -- This message was sent by Atlassian Jira (v8.20.7#820007)
[jira] [Created] (FLINK-27894) Build flink-connector-hive failed using Maven@3.8.5
Biao Geng created FLINK-27894: - Summary: Build flink-connector-hive failed using Maven@3.8.5 Key: FLINK-27894 URL: https://issues.apache.org/jira/browse/FLINK-27894 Project: Flink Issue Type: Improvement Reporter: Biao Geng When I tried to build flink project locally with Java8 and Maven3.8.5, I met such error: {code:java} [ERROR] Failed to execute goal on project flink-connector-hive_2.12: Could not resolve dependencies for project org.apache.flink:flink-connector-hive_2.12:jar:1.16-SNAPSHOT: Failed to collect dependencies at org.apache.hive:hive-exec:jar:2.3.9 -> org.pentaho:pentaho-aggdesigner-algorithm:jar:5.1.5-jhyde: Failed to read artifact descriptor for org.pentaho:pentaho-aggdesigner-algorithm:jar:5.1.5-jhyde: Could not transfer artifact org.pentaho:pentaho-aggdesigner-algorithm:pom:5.1.5-jhyde from/to maven-default-http-blocker (http://0.0.0.0/): Blocked mirror for repositories: [repository.jboss.org (http://repository.jboss.org/nexus/content/groups/public/, default, disabled), conjars (http://conjars.org/repo, default, releases+snapshots), apache.snapshots (http://repository.apache.org/snapshots, default, snapshots)] -> [Help 1] {code} After some investigation, the reason may be that Maven 3.8.1 disables support for repositories using "http" protocol. Due to [NIFI-8398], one possible solution is adding {code:xml} conjars https://conjars.org/repo {code} in the pom.xml of flink-connector-hive module. -- This message was sent by Atlassian Jira (v8.20.7#820007)
[jira] [Commented] (FLINK-27009) Support SQL job submission in flink kubernetes opeartor
[ https://issues.apache.org/jira/browse/FLINK-27009?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17543405#comment-17543405 ] Biao Geng commented on FLINK-27009: --- hi [~mbalassi] thanks for paying attention. I have already started working on this. After some offline discussion with [~wangyang0918] and [~dianfu], we have some alternative solutions and now prefer to do some work on [FLINK-26541|https://issues.apache.org/jira/browse/FLINK-26541] first. I am doing some basic verfication work to make sure the solution is feasible for our k8s operator and once it is ready, I will present the full FLIP for the community to discuss. Due to my current progress, I hope it will be proposed in next Friday. > Support SQL job submission in flink kubernetes opeartor > --- > > Key: FLINK-27009 > URL: https://issues.apache.org/jira/browse/FLINK-27009 > Project: Flink > Issue Type: New Feature > Components: Kubernetes Operator >Reporter: Biao Geng >Priority: Major > > Currently, the flink kubernetes opeartor is for jar job using application or > session cluster. For SQL job, there is no out of box solution in the > operator. > One simple and short-term solution is to wrap the SQL script into a jar job > using table API with limitation. > The long-term solution may work with > [FLINK-26541|https://issues.apache.org/jira/browse/FLINK-26541] to achieve > the full support. -- This message was sent by Atlassian Jira (v8.20.7#820007)
[jira] [Commented] (FLINK-27615) Document how to define namespaceSelector for k8s operator's webhook for different k8s versions
[ https://issues.apache.org/jira/browse/FLINK-27615?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17538690#comment-17538690 ] Biao Geng commented on FLINK-27615: --- [~wangyang0918] I'd like to take this ticket to alert the k8s version requirement and add some examples of customized namespace selectors. Can you assign it to me? > Document how to define namespaceSelector for k8s operator's webhook for > different k8s versions > -- > > Key: FLINK-27615 > URL: https://issues.apache.org/jira/browse/FLINK-27615 > Project: Flink > Issue Type: Improvement > Components: Kubernetes Operator >Reporter: Biao Geng >Priority: Major > > In our webhook, to support {{{}watchNamespaces{}}}, we rely on the > {{kubernetes.io/metadata.name}} to filter the validation requests. However, > this label will be automatically added to a namespace only since k8s 1.21.1 > due to k8s' [release > notes|https://github.com/kubernetes/kubernetes/blob/master/CHANGELOG/CHANGELOG-1.21.md#v1211] > . If users run the flink k8s operator on older k8s versions, they have to > add such label by themselevs to support the feature of namespaceSelector in > our webhook. > As a result, if we want to support the feature defaultly, we may need to > emphasize that users should use k8s >= v1.21.1 to run the flink k8s operator. > > Due to aitozi's advice, it may be better to add document of how to support > the nameSelector based validation instead of limiting the k8s version to be > >= 1.21.1 to support more users. -- This message was sent by Atlassian Jira (v8.20.7#820007)
[jira] [Updated] (FLINK-27615) Document how to define namespaceSelector for k8s operator's webhook for different k8s versions
[ https://issues.apache.org/jira/browse/FLINK-27615?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Biao Geng updated FLINK-27615: -- Summary: Document how to define namespaceSelector for k8s operator's webhook for different k8s versions (was: Document the how to define namespaceSelector for k8s operator's webhook) > Document how to define namespaceSelector for k8s operator's webhook for > different k8s versions > -- > > Key: FLINK-27615 > URL: https://issues.apache.org/jira/browse/FLINK-27615 > Project: Flink > Issue Type: Improvement > Components: Kubernetes Operator >Reporter: Biao Geng >Priority: Major > > In our webhook, to support {{{}watchNamespaces{}}}, we rely on the > {{kubernetes.io/metadata.name}} to filter the validation requests. However, > this label will be automatically added to a namespace only since k8s 1.21.1 > due to k8s' [release > notes|https://github.com/kubernetes/kubernetes/blob/master/CHANGELOG/CHANGELOG-1.21.md#v1211] > . If users run the flink k8s operator on older k8s versions, they have to > add such label by themselevs to support the feature of namespaceSelector in > our webhook. > As a result, if we want to support the feature defaultly, we may need to > emphasize that users should use k8s >= v1.21.1 to run the flink k8s operator. > > Due to aitozi's advice, it may be better to add document of how to support > the nameSelector based validation instead of limiting the k8s version to be > >= 1.21.1 to support more users. -- This message was sent by Atlassian Jira (v8.20.7#820007)
[jira] [Comment Edited] (FLINK-27615) Document the how to define namespaceSelector for k8s operator's webhook
[ https://issues.apache.org/jira/browse/FLINK-27615?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17537012#comment-17537012 ] Biao Geng edited comment on FLINK-27615 at 5/14/22 10:29 AM: - [~aitozi] I think your proposal makes sense. I checked the release notes of k8s and just noticed that the 1.21.1 was just released 1 year ago. It looks like k8s iterates much more quickly than flink. IIUC, users, especially those use k8s in production, will not update k8s to new versions frequently. I was doing my tests on k8s 1.20.4. Currently, besides the nameSelector feature, I have not met other issues. To support more k8s versions, I agree with you on makeing this feature non-default with better documents was (Author: bgeng777): [~aitozi] I think your proposal makes a lot of sense. I checked the release notes of k8s and just noticed that the 1.21.1 was just released 1 year ago. K8s iterates much more quickly than flink. IIUC, users, especially those use k8s in production, will not update k8s to new versions frequently. I was doing my tests on k8s 1.20.4. Currently, besides the nameSelector feature, I have not met other issues. To support more k8s versions, I agree with you on makeing this feature non-default with better documents > Document the how to define namespaceSelector for k8s operator's webhook > --- > > Key: FLINK-27615 > URL: https://issues.apache.org/jira/browse/FLINK-27615 > Project: Flink > Issue Type: Improvement > Components: Kubernetes Operator >Reporter: Biao Geng >Priority: Major > > In our webhook, to support {{{}watchNamespaces{}}}, we rely on the > {{kubernetes.io/metadata.name}} to filter the validation requests. However, > this label will be automatically added to a namespace only since k8s 1.21.1 > due to k8s' [release > notes|https://github.com/kubernetes/kubernetes/blob/master/CHANGELOG/CHANGELOG-1.21.md#v1211] > . If users run the flink k8s operator on older k8s versions, they have to > add such label by themselevs to support the feature of namespaceSelector in > our webhook. > As a result, if we want to support the feature defaultly, we may need to > emphasize that users should use k8s >= v1.21.1 to run the flink k8s operator. > > Due to aitozi's advice, it may be better to add document of how to support > the nameSelector based validation instead of limiting the k8s version to be > >= 1.21.1 to support more users. -- This message was sent by Atlassian Jira (v8.20.7#820007)
[jira] [Updated] (FLINK-27615) Document the how to define namespaceSelector for k8s operator's webhook
[ https://issues.apache.org/jira/browse/FLINK-27615?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Biao Geng updated FLINK-27615: -- Summary: Document the how to define namespaceSelector for k8s operator's webhook (was: Document the how to define ) > Document the how to define namespaceSelector for k8s operator's webhook > --- > > Key: FLINK-27615 > URL: https://issues.apache.org/jira/browse/FLINK-27615 > Project: Flink > Issue Type: Improvement > Components: Kubernetes Operator >Reporter: Biao Geng >Priority: Major > > In our webhook, to support {{{}watchNamespaces{}}}, we rely on the > {{kubernetes.io/metadata.name}} to filter the validation requests. However, > this label will be automatically added to a namespace only since k8s 1.21.1 > due to k8s' [release > notes|https://github.com/kubernetes/kubernetes/blob/master/CHANGELOG/CHANGELOG-1.21.md#v1211] > . If users run the flink k8s operator on older k8s versions, they have to > add such label by themselevs to support the feature of namespaceSelector in > our webhook. > As a result, if we want to support the feature defaultly, we may need to > emphasize that users should use k8s >= v1.21.1 to run the flink k8s operator. > > Due to aitozi's advice, it may be better to add document of how to support > the nameSelector based validation instead of limiting the k8s version to be > >= 1.21.1 to support more users. -- This message was sent by Atlassian Jira (v8.20.7#820007)
[jira] [Updated] (FLINK-27615) Document the how to define
[ https://issues.apache.org/jira/browse/FLINK-27615?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Biao Geng updated FLINK-27615: -- Summary: Document the how to define (was: Document the minimum supported version of k8s for flink k8s operator) > Document the how to define > --- > > Key: FLINK-27615 > URL: https://issues.apache.org/jira/browse/FLINK-27615 > Project: Flink > Issue Type: Improvement > Components: Kubernetes Operator >Reporter: Biao Geng >Priority: Major > > In our webhook, to support {{{}watchNamespaces{}}}, we rely on the > {{kubernetes.io/metadata.name}} to filter the validation requests. However, > this label will be automatically added to a namespace only since k8s 1.21.1 > due to k8s' [release > notes|https://github.com/kubernetes/kubernetes/blob/master/CHANGELOG/CHANGELOG-1.21.md#v1211] > . If users run the flink k8s operator on older k8s versions, they have to > add such label by themselevs to support the feature of namespaceSelector in > our webhook. > As a result, if we want to support the feature defaultly, we may need to > emphasize that users should use k8s >= v1.21.1 to run the flink k8s operator. > > Due to aitozi's advice, it may be better to add document of how to support > the nameSelector based validation instead of limiting the k8s version to be > >= 1.21.1 to support more users. -- This message was sent by Atlassian Jira (v8.20.7#820007)
[jira] [Updated] (FLINK-27615) Document the minimum supported version of k8s for flink k8s operator
[ https://issues.apache.org/jira/browse/FLINK-27615?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Biao Geng updated FLINK-27615: -- Description: In our webhook, to support {{{}watchNamespaces{}}}, we rely on the {{kubernetes.io/metadata.name}} to filter the validation requests. However, this label will be automatically added to a namespace only since k8s 1.21.1 due to k8s' [release notes|https://github.com/kubernetes/kubernetes/blob/master/CHANGELOG/CHANGELOG-1.21.md#v1211] . If users run the flink k8s operator on older k8s versions, they have to add such label by themselevs to support the feature of namespaceSelector in our webhook. As a result, if we want to support the feature defaultly, we may need to emphasize that users should use k8s >= v1.21.1 to run the flink k8s operator. Due to aitozi's advice, it may be better to add document of how to support the nameSelector based validation instead of limiting the k8s version to be >= 1.21.1 to support more users. was: In our webhook, to support {{{}watchNamespaces{}}}, we rely on the {{kubernetes.io/metadata.name}} to filter the validation requests. However, this label will be automatically added to a namespace only since k8s 1.21.1 due to k8s' [release notes|https://github.com/kubernetes/kubernetes/blob/master/CHANGELOG/CHANGELOG-1.21.md#v1211] . If users run the flink k8s operator on older k8s versions, they have to add such label by themselevs to support the feature of namespaceSelector in our webhook. As a result, if we want to support the feature defaultly, we may need to emphasize that users should use k8s >= v1.21.1 to run the flink k8s operator. > Document the minimum supported version of k8s for flink k8s operator > > > Key: FLINK-27615 > URL: https://issues.apache.org/jira/browse/FLINK-27615 > Project: Flink > Issue Type: Improvement > Components: Kubernetes Operator >Reporter: Biao Geng >Priority: Major > > In our webhook, to support {{{}watchNamespaces{}}}, we rely on the > {{kubernetes.io/metadata.name}} to filter the validation requests. However, > this label will be automatically added to a namespace only since k8s 1.21.1 > due to k8s' [release > notes|https://github.com/kubernetes/kubernetes/blob/master/CHANGELOG/CHANGELOG-1.21.md#v1211] > . If users run the flink k8s operator on older k8s versions, they have to > add such label by themselevs to support the feature of namespaceSelector in > our webhook. > As a result, if we want to support the feature defaultly, we may need to > emphasize that users should use k8s >= v1.21.1 to run the flink k8s operator. > > Due to aitozi's advice, it may be better to add document of how to support > the nameSelector based validation instead of limiting the k8s version to be > >= 1.21.1 to support more users. -- This message was sent by Atlassian Jira (v8.20.7#820007)
[jira] [Commented] (FLINK-27615) Document the minimum supported version of k8s for flink k8s operator
[ https://issues.apache.org/jira/browse/FLINK-27615?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17537012#comment-17537012 ] Biao Geng commented on FLINK-27615: --- [~aitozi] I think your proposal makes a lot of sense. I checked the release notes of k8s and just noticed that the 1.21.1 was just released 1 year ago. K8s iterates much more quickly than flink. IIUC, users, especially those use k8s in production, will not update k8s to new versions frequently. I was doing my tests on k8s 1.20.4. Currently, besides the nameSelector feature, I have not met other issues. To support more k8s versions, I agree with you on makeing this feature non-default with better documents > Document the minimum supported version of k8s for flink k8s operator > > > Key: FLINK-27615 > URL: https://issues.apache.org/jira/browse/FLINK-27615 > Project: Flink > Issue Type: Improvement > Components: Kubernetes Operator >Reporter: Biao Geng >Priority: Major > > In our webhook, to support {{{}watchNamespaces{}}}, we rely on the > {{kubernetes.io/metadata.name}} to filter the validation requests. However, > this label will be automatically added to a namespace only since k8s 1.21.1 > due to k8s' [release > notes|https://github.com/kubernetes/kubernetes/blob/master/CHANGELOG/CHANGELOG-1.21.md#v1211] > . If users run the flink k8s operator on older k8s versions, they have to > add such label by themselevs to support the feature of namespaceSelector in > our webhook. > As a result, if we want to support the feature defaultly, we may need to > emphasize that users should use k8s >= v1.21.1 to run the flink k8s operator. -- This message was sent by Atlassian Jira (v8.20.7#820007)
[jira] [Updated] (FLINK-27615) Document the minimum supported version of k8s for flink k8s operator
[ https://issues.apache.org/jira/browse/FLINK-27615?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Biao Geng updated FLINK-27615: -- Description: In our webhook, to support {{{}watchNamespaces{}}}, we rely on the {{kubernetes.io/metadata.name}} to filter the validation requests. However, this label will be automatically added to a namespace only since k8s 1.21.1 due to k8s' [release notes|https://github.com/kubernetes/kubernetes/blob/master/CHANGELOG/CHANGELOG-1.21.md#v1211] . If users run the flink k8s operator on older k8s versions, they have to add such label by themselevs to support the feature of namespaceSelector in our webhook. As a result, if we want to support the feature defaultly, we may need to emphasize that users should use k8s >= v1.21.1 to run the flink k8s operator. was: In our webhook, to support {{{}watchNamespaces{}}}, we rely on the {{kubernetes.io/metadata.name}} to filter the validation requests. However, this label will be automatically added to a namespace only since k8s 1.21.1 due to k8s' [release notes|[https://github.com/kubernetes/kubernetes/blob/master/CHANGELOG/CHANGELOG-1.21.md#v1211] . If users run the flink k8s operator on older k8s versions, they have add such label by themselevs to support the feature of namespaceSelector in our webhook. As a result, if we want to support the feature defaultly, we may need to emphasize that users should use k8s >= v1.21.1 to run the flink k8s operator. > Document the minimum supported version of k8s for flink k8s operator > > > Key: FLINK-27615 > URL: https://issues.apache.org/jira/browse/FLINK-27615 > Project: Flink > Issue Type: Improvement > Components: Kubernetes Operator >Reporter: Biao Geng >Priority: Major > > In our webhook, to support {{{}watchNamespaces{}}}, we rely on the > {{kubernetes.io/metadata.name}} to filter the validation requests. However, > this label will be automatically added to a namespace only since k8s 1.21.1 > due to k8s' [release > notes|https://github.com/kubernetes/kubernetes/blob/master/CHANGELOG/CHANGELOG-1.21.md#v1211] > . If users run the flink k8s operator on older k8s versions, they have to > add such label by themselevs to support the feature of namespaceSelector in > our webhook. > As a result, if we want to support the feature defaultly, we may need to > emphasize that users should use k8s >= v1.21.1 to run the flink k8s operator. -- This message was sent by Atlassian Jira (v8.20.7#820007)
[jira] [Created] (FLINK-27615) Document the minimum supported version of k8s for flink k8s operator
Biao Geng created FLINK-27615: - Summary: Document the minimum supported version of k8s for flink k8s operator Key: FLINK-27615 URL: https://issues.apache.org/jira/browse/FLINK-27615 Project: Flink Issue Type: Improvement Reporter: Biao Geng In our webhook, to support {{{}watchNamespaces{}}}, we rely on the {{kubernetes.io/metadata.name}} to filter the validation requests. However, this label will be automatically added to a namespace only since k8s 1.21.1 due to k8s' [release notes|[https://github.com/kubernetes/kubernetes/blob/master/CHANGELOG/CHANGELOG-1.21.md#v1211] . If users run the flink k8s operator on older k8s versions, they have add such label by themselevs to support the feature of namespaceSelector in our webhook. As a result, if we want to support the feature defaultly, we may need to emphasize that users should use k8s >= v1.21.1 to run the flink k8s operator. -- This message was sent by Atlassian Jira (v8.20.7#820007)
[jira] [Updated] (FLINK-27615) Document the minimum supported version of k8s for flink k8s operator
[ https://issues.apache.org/jira/browse/FLINK-27615?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Biao Geng updated FLINK-27615: -- Component/s: Kubernetes Operator > Document the minimum supported version of k8s for flink k8s operator > > > Key: FLINK-27615 > URL: https://issues.apache.org/jira/browse/FLINK-27615 > Project: Flink > Issue Type: Improvement > Components: Kubernetes Operator >Reporter: Biao Geng >Priority: Major > > In our webhook, to support {{{}watchNamespaces{}}}, we rely on the > {{kubernetes.io/metadata.name}} to filter the validation requests. However, > this label will be automatically added to a namespace only since k8s 1.21.1 > due to k8s' [release > notes|[https://github.com/kubernetes/kubernetes/blob/master/CHANGELOG/CHANGELOG-1.21.md#v1211] > . If users run the flink k8s operator on older k8s versions, they have add > such label by themselevs to support the feature of namespaceSelector in our > webhook. > As a result, if we want to support the feature defaultly, we may need to > emphasize that users should use k8s >= v1.21.1 to run the flink k8s operator. -- This message was sent by Atlassian Jira (v8.20.7#820007)
[jira] [Comment Edited] (FLINK-27329) Add default value of replica of JM pod and not declare it in example yamls
[ https://issues.apache.org/jira/browse/FLINK-27329?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17535845#comment-17535845 ] Biao Geng edited comment on FLINK-27329 at 5/12/22 3:35 AM: [~wangyang0918] I see your point that if we set default value of some fields (e.g. parallism) in the operator, we cannot adopt flink's default value as our default value in the {*}{{*}}Spec\{*} will be used. For parallism and cpu, l will not add default value in our k8s operator. Besides, it is worthwhile to mention that in our FlinkConfigBuilder, we will verify if parallism is larger than 0, so we do not need to worry about it. But for setResource, we do not handle cases when cpu or memory is null. As a result, the zero cpu value may be set. I improve it a little in the [PR|https://github.com/apache/flink-kubernetes-operator/pull/204]. was (Author: bgeng777): [~wangyang0918] I see your point that if we set default value of some fields (e.g. parallism) in the operator, we cannot adopt flink's default value as our default value in the *{*}Spec{*} will be used. For parallism and cpu, l will not add default value in our k8s operator. Besides, it is worthwhile to mention that in our FlinkConfigBuilder, we will verify if parallism is larger than 0, so we do not need to worry about it. But for setResource, we do not handle cases when cpu or memory is null. I improve it a little in the [PR|https://github.com/apache/flink-kubernetes-operator/pull/204]. > Add default value of replica of JM pod and not declare it in example yamls > -- > > Key: FLINK-27329 > URL: https://issues.apache.org/jira/browse/FLINK-27329 > Project: Flink > Issue Type: Improvement > Components: Kubernetes Operator >Reporter: Biao Geng >Assignee: Biao Geng >Priority: Critical > Fix For: kubernetes-operator-1.0.0 > > > Currently, we do not explicitly set the default value of `replica` in > `JobManagerSpec`. As a result, Java sets the default value to be zero. > Besides, in our examples, we explicitly declare `replica` in `JobManagerSpec` > to be 1. > After a deeper look when debugging the exception thrown in FLINK-27310, we > find it would be better to set the default value to 1 for the `replica` field > and remove the declaration in examples due to following reasons: > 1. A normal Session or Application cluster should have at least one JM. The > current default value, zero, does not follow the common case. > 2. One JM can work for k8s HA mode as well and if users really want to launch > a standby JM for faster recorvery, they can declare the value of `replica` > field in the yaml file. In examples, we just use the new default value(i.e. > 1), which should be fine. -- This message was sent by Atlassian Jira (v8.20.7#820007)
[jira] [Commented] (FLINK-27329) Add default value of replica of JM pod and not declare it in example yamls
[ https://issues.apache.org/jira/browse/FLINK-27329?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17535845#comment-17535845 ] Biao Geng commented on FLINK-27329: --- [~wangyang0918] I see your point that if we set default value of some fields (e.g. parallism) in the operator, we cannot adopt flink's default value as our default value in the *{*}Spec{*} will be used. For parallism and cpu, l will not add default value in our k8s operator. Besides, it is worthwhile to mention that in our FlinkConfigBuilder, we will verify if parallism is larger than 0, so we do not need to worry about it. But for setResource, we do not handle cases when cpu or memory is null. I improve it a little in the [PR|https://github.com/apache/flink-kubernetes-operator/pull/204]. > Add default value of replica of JM pod and not declare it in example yamls > -- > > Key: FLINK-27329 > URL: https://issues.apache.org/jira/browse/FLINK-27329 > Project: Flink > Issue Type: Improvement > Components: Kubernetes Operator >Reporter: Biao Geng >Assignee: Biao Geng >Priority: Critical > Fix For: kubernetes-operator-1.0.0 > > > Currently, we do not explicitly set the default value of `replica` in > `JobManagerSpec`. As a result, Java sets the default value to be zero. > Besides, in our examples, we explicitly declare `replica` in `JobManagerSpec` > to be 1. > After a deeper look when debugging the exception thrown in FLINK-27310, we > find it would be better to set the default value to 1 for the `replica` field > and remove the declaration in examples due to following reasons: > 1. A normal Session or Application cluster should have at least one JM. The > current default value, zero, does not follow the common case. > 2. One JM can work for k8s HA mode as well and if users really want to launch > a standby JM for faster recorvery, they can declare the value of `replica` > field in the yaml file. In examples, we just use the new default value(i.e. > 1), which should be fine. -- This message was sent by Atlassian Jira (v8.20.7#820007)
[jira] [Comment Edited] (FLINK-27329) Add default value of replica of JM pod and not declare it in example yamls
[ https://issues.apache.org/jira/browse/FLINK-27329?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17534972#comment-17534972 ] Biao Geng edited comment on FLINK-27329 at 5/11/22 3:42 PM: Hi [~wangyang0918] [~gyfora], I revisit the default values in our {*}{{*}}Spec\{*}*s and summarize them in the bottom table. IMO, most of them work well with `null` default value, but besides JobManagerSpec#replicas, there are some fields that I believe we can improve: # *JobManagerSpec#resource#cpu &* *TaskManagerSpec#resource#cpu:* current default value is 0 which is not consistent with upstream flink. In my mind, changing the default value to 1.0 is better for: a) for JM, if users not specify it explictly, flink will use 1.0; b) for TM, if users not specify it explictly, flink will use NUM_TASK_SLOTS, whose default value is 1 as well in flink's default flink-conf.yaml{*}{*} # *JobSpec#parallelism:* current default value is 0, which is illegal. But I am not sure if is possible/proper for us to read the value of parallelism.default in flink-conf.yaml in the *JobSpec* constructor. I tend to leave it as it is or use `1` as default. # *FlinkDeploymentSpec#serviceAccount:* current default value is null and as a result, if we do not specify it, flink will use `default` service account. It can be problematic as we use `flink` as default in our helm chart's values.yaml. Maybe `flink` can be a good candidate default value. | |Default Value in Upstream Flink|Current Default Value in k8s Operator| |FlinkDeploymentSpec#imagePullPolicy|KubernetesConfigOptions.ImagePullPolicy.IfNotPresent|null| |FlinkDeploymentSpec#image|KubernetesConfigOptions#getDefaultFlinkImage()|null| |*{*}FlinkDeploymentSpec#serviceAccount{*}*|"default"|null| |FlinkDeploymentSpec#flinkVersion|not exist|null| |FlinkDeploymentSpec#IngressSpec|not exist|null| |FlinkDeploymentSpec#podTemplate|no default value|null| |*{*}JobManagerSpec#replicas{*}*|not exist|0| |JobManagerSpec#resource#memory|1600M(defined in flink-conf.yaml)|null| |*{*}JobManagerSpec#resource#cpu{*}*|1.0|0| |JobManagerSpec#podTemplate|no default value|null| |TaskManagerSpec#resource#memory|memory: 1728m(defined in flink-conf.yaml)|null| |*{*}TaskManagerSpec#resource#cpu{*}*|NUM_TASK_SLOTS( whose default value is 1 in flink-conf.yaml)|0| |TaskManagerSpec#podTemplate|no default value|null| |JobSpec#jarURI|no default value|null| |*{*}JobSpec#parallelism{*}*|parallelism.default in flink-conf.yaml|0| |JobSpec#entryClass|no default value|null| |JobSpec#args|no default value|String[0]| |JobSpec#state|not exist|JobState.RUNNING| |JobSpec#savepointTriggerNonce|not exist|null| |JobSpec#initialSavepointPath|not exist|null| |JobSpec#upgradeMode|not exist|UpgradeMode.STATELESS| |JobSpec#allowNonRestoredState|not exist|null| | | | | was (Author: bgeng777): Hi [~wangyang0918] [~gyfora], I revisit the default values in our *{*}Spec{*}*s and summarize them in the bottom table. IMO, most of them work well with `null` default value, but besides JobManagerSpec#replicas, there are some fields that I believe we can improve: # *JobManagerSpec#resource#cpu &* *TaskManagerSpec#resource#cpu:* current default value is 0 which is not consistent with upstream flink. In my mind, changing the default value to 1.0 is better for: a) for JM, if users not specify it explictly, flink will use 1.0; b) for TM, if users not specify it explictly, flink will use NUM_TASK_SLOTS, whose default value is 1 as well in flink's default flink-conf.yaml{*}{*} # *JobSpec#parallelism:* current default value is 0, which is illegal. But I am not sure if is possible/proper for us to read the value of parallelism.default in flink-conf.yaml in the *JobSpec* constructor. I tend to leave it as it is or use `1` as default.{*}{*} # *FlinkDeploymentSpec#serviceAccount:* current default value is null and as a result, if we do not specify it, flink will use `default` service account. It can be problematic as we use `flink` as default in our helm chart's values.yaml. I am not so sure why we expose it as the first class field, but maybe `flink` can be a good candidate default value. || Default Value in Upstream Flink | Current Default Value in k8s Operator | | FlinkDeploymentSpec#imagePullPolicy| KubernetesConfigOptions.ImagePullPolicy.IfNotPresent | null | | FlinkDeploymentSpec#image | KubernetesConfigOptions#getDefaultFlinkImage() | null | | **FlinkDeploymentSpec#serviceAccount** | "default" | null | | FlinkDeploymentSpec#flinkVersion | not exist | null | |
[jira] [Comment Edited] (FLINK-27329) Add default value of replica of JM pod and not declare it in example yamls
[ https://issues.apache.org/jira/browse/FLINK-27329?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17534972#comment-17534972 ] Biao Geng edited comment on FLINK-27329 at 5/11/22 3:41 PM: Hi [~wangyang0918] [~gyfora], I revisit the default values in our *{*}Spec{*}*s and summarize them in the bottom table. IMO, most of them work well with `null` default value, but besides JobManagerSpec#replicas, there are some fields that I believe we can improve: # *JobManagerSpec#resource#cpu &* *TaskManagerSpec#resource#cpu:* current default value is 0 which is not consistent with upstream flink. In my mind, changing the default value to 1.0 is better for: a) for JM, if users not specify it explictly, flink will use 1.0; b) for TM, if users not specify it explictly, flink will use NUM_TASK_SLOTS, whose default value is 1 as well in flink's default flink-conf.yaml{*}{*} # *JobSpec#parallelism:* current default value is 0, which is illegal. But I am not sure if is possible/proper for us to read the value of parallelism.default in flink-conf.yaml in the *JobSpec* constructor. I tend to leave it as it is or use `1` as default.{*}{*} # *FlinkDeploymentSpec#serviceAccount:* current default value is null and as a result, if we do not specify it, flink will use `default` service account. It can be problematic as we use `flink` as default in our helm chart's values.yaml. I am not so sure why we expose it as the first class field, but maybe `flink` can be a good candidate default value. || Default Value in Upstream Flink | Current Default Value in k8s Operator | | FlinkDeploymentSpec#imagePullPolicy| KubernetesConfigOptions.ImagePullPolicy.IfNotPresent | null | | FlinkDeploymentSpec#image | KubernetesConfigOptions#getDefaultFlinkImage() | null | | **FlinkDeploymentSpec#serviceAccount** | "default" | null | | FlinkDeploymentSpec#flinkVersion | not exist | null | | FlinkDeploymentSpec#IngressSpec| not exist | null | | FlinkDeploymentSpec#podTemplate| no default value | null | | **JobManagerSpec#replicas**| not exist | 0 | | JobManagerSpec#resource#memory | 1600M(defined in flink-conf.yaml) | null | | **JobManagerSpec#resource#cpu**| 1.0 | 0 | | JobManagerSpec#podTemplate | no default value | null | | TaskManagerSpec#resource#memory| memory: 1728m(defined in flink-conf.yaml)| null | | **TaskManagerSpec#resource#cpu** | NUM_TASK_SLOTS( whose default value is 1 in flink-conf.yaml) | 0 | | TaskManagerSpec#podTemplate| no default value | null | | JobSpec#jarURI | no default value | null | | **JobSpec#parallelism**| parallelism.default in flink-conf.yaml | 0 | | JobSpec#entryClass | no default value | null | | JobSpec#args | no default value | String[0] | | JobSpec#state | not exist | JobState.RUNNING | | JobSpec#savepointTriggerNonce | not exist | null | | JobSpec#initialSavepointPath | not exist | null | | JobSpec#upgradeMode| not exist | UpgradeMode.STATELESS | | JobSpec#allowNonRestoredState | not exist | null | |
[jira] [Commented] (FLINK-27329) Add default value of replica of JM pod and not declare it in example yamls
[ https://issues.apache.org/jira/browse/FLINK-27329?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17534972#comment-17534972 ] Biao Geng commented on FLINK-27329: --- Hi [~wangyang0918] [~gyfora], I revisit the default values in our *{*}Spec{*}*s and summarize them in the bottom table. IMO, most of them work well with `null` default value, but besides JobManagerSpec#replicas, there are some fields that I believe we can improve: # *JobManagerSpec#resource#cpu &* *TaskManagerSpec#resource#cpu:* current default value is 0 which is not consistent with upstream flink. In my mind, changing the default value to 1.0 is better for: a) for JM, if users not specify it explictly, flink will use 1.0; b) for TM, if users not specify it explictly, flink will use NUM_TASK_SLOTS, whose default value is 1 as well in flink's default flink-conf.yaml{*}{*} # *JobSpec#parallelism:* current default value is 0, which is illegal. But I am not sure if is possible/proper for us to read the value of parallelism.default in flink-conf.yaml in the *JobSpec* constructor. I tend to leave it as it is or use `1` as default.{*}{*} # *FlinkDeploymentSpec#serviceAccount:* current default value is null and as a result, if we do not specify it, flink will use `default` service account. It can be problematic as we use `flink` as default in our helm chart's values.yaml. I am not so sure why we expose it as the first class field, but maybe `flink` can be a good candidate default value. | |Default Value in Upstream Flink Native K8s|Current Default Value in K8s Operator| |FlinkDeploymentSpec#imagePullPolicy|KubernetesConfigOptions.ImagePullPolicy.IfNotPresent|null| |FlinkDeploymentSpec#image|KubernetesConfigOptions#getDefaultFlinkImage()|null| |*{*}FlinkDeploymentSpec#serviceAccount{*}*|"default"|null| |FlinkDeploymentSpec#flinkVersion|\|null| |FlinkDeploymentSpec#IngressSpec|\|null| |FlinkDeploymentSpec#podTemplate|no default value|null| |*{*}JobManagerSpec#replicas{*}*|1|0| |JobManagerSpec#resource#memory|1600M(defined in flink-conf.yaml)|null| |*JobManagerSpec#resource#cpu*|1.0|0| |JobManagerSpec#podTemplate|no default value|null| |TaskManagerSpec#resource#memory|memory: 1728m(defined in flink-conf.yaml)|null| |*TaskManagerSpec#resource#cpu*|NUM_TASK_SLOTS( whose default value is 1 in flink-conf.yaml)|0| |TaskManagerSpec#podTemplate|no default value|null| |JobSpec#jarURI|no default value|null| |*{*}JobSpec#parallelism{*}*|parallelism.default in flink-conf.yaml|0| |JobSpec#entryClass|no default value|null| |JobSpec#args|no default value|String[0]| |JobSpec#state|\|JobState.RUNNING| |JobSpec#savepointTriggerNonce|\|null| |JobSpec#initialSavepointPath|\|null| |JobSpec#upgradeMode|\|UpgradeMode.STATELESS| |JobSpec#allowNonRestoredState|\|null| | | | | > Add default value of replica of JM pod and not declare it in example yamls > -- > > Key: FLINK-27329 > URL: https://issues.apache.org/jira/browse/FLINK-27329 > Project: Flink > Issue Type: Improvement > Components: Kubernetes Operator >Reporter: Biao Geng >Assignee: Biao Geng >Priority: Critical > Fix For: kubernetes-operator-1.0.0 > > > Currently, we do not explicitly set the default value of `replica` in > `JobManagerSpec`. As a result, Java sets the default value to be zero. > Besides, in our examples, we explicitly declare `replica` in `JobManagerSpec` > to be 1. > After a deeper look when debugging the exception thrown in FLINK-27310, we > find it would be better to set the default value to 1 for the `replica` field > and remove the declaration in examples due to following reasons: > 1. A normal Session or Application cluster should have at least one JM. The > current default value, zero, does not follow the common case. > 2. One JM can work for k8s HA mode as well and if users really want to launch > a standby JM for faster recorvery, they can declare the value of `replica` > field in the yaml file. In examples, we just use the new default value(i.e. > 1), which should be fine. -- This message was sent by Atlassian Jira (v8.20.7#820007)
[jira] [Commented] (FLINK-27329) Add default value of replica of JM pod and not declare it in example yamls
[ https://issues.apache.org/jira/browse/FLINK-27329?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17534341#comment-17534341 ] Biao Geng commented on FLINK-27329: --- Yes, I will create a PR in the next 2 days. Sorry for the lateness due to other offline tasks. > Add default value of replica of JM pod and not declare it in example yamls > -- > > Key: FLINK-27329 > URL: https://issues.apache.org/jira/browse/FLINK-27329 > Project: Flink > Issue Type: Improvement > Components: Kubernetes Operator >Reporter: Biao Geng >Assignee: Biao Geng >Priority: Critical > Fix For: kubernetes-operator-1.0.0 > > > Currently, we do not explicitly set the default value of `replica` in > `JobManagerSpec`. As a result, Java sets the default value to be zero. > Besides, in our examples, we explicitly declare `replica` in `JobManagerSpec` > to be 1. > After a deeper look when debugging the exception thrown in FLINK-27310, we > find it would be better to set the default value to 1 for the `replica` field > and remove the declaration in examples due to following reasons: > 1. A normal Session or Application cluster should have at least one JM. The > current default value, zero, does not follow the common case. > 2. One JM can work for k8s HA mode as well and if users really want to launch > a standby JM for faster recorvery, they can declare the value of `replica` > field in the yaml file. In examples, we just use the new default value(i.e. > 1), which should be fine. -- This message was sent by Atlassian Jira (v8.20.7#820007)
[jira] [Commented] (FLINK-27412) Allow flinkVersion v1_13 in flink-kubernetes-operator
[ https://issues.apache.org/jira/browse/FLINK-27412?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17528115#comment-17528115 ] Biao Geng commented on FLINK-27412: --- Big +1 for this ticket. Most users, exspecially those use flink in production, are not using latest versions like 1.14 or 1.15. I have verified most examples in my own cluseter using flink1.13(both the open source and ververica's impl_ and as far as I can see, they work well. > Allow flinkVersion v1_13 in flink-kubernetes-operator > - > > Key: FLINK-27412 > URL: https://issues.apache.org/jira/browse/FLINK-27412 > Project: Flink > Issue Type: Improvement > Components: Kubernetes Operator >Reporter: Yang Wang >Priority: Major > Labels: starter > Fix For: kubernetes-operator-1.0.0 > > > The core k8s related features: > * native k8s integration for session cluster, 1.10 > * native k8s integration for application cluster, 1.11 > * Flink K8s HA, 1.12 > * pod template, 1.13 > So we could set required the minimum version to 1.13. This will allow more > users to have a try on flink-kubernetes-operator. > > BTW, we need to update the e2e tests to cover all the supported versions. -- This message was sent by Atlassian Jira (v8.20.7#820007)
[jira] [Commented] (FLINK-27329) Add default value of replica of JM pod and not declare it in example yamls
[ https://issues.apache.org/jira/browse/FLINK-27329?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17525432#comment-17525432 ] Biao Geng commented on FLINK-27329: --- [~wangyang0918] Reviewing all default values makes sense to me. I will try to check specs under \{{org.apache.flink.kubernetes.operator.crd.spec}} package to find if there are some fields like JM replica that do not have a proper default value. > Add default value of replica of JM pod and not declare it in example yamls > -- > > Key: FLINK-27329 > URL: https://issues.apache.org/jira/browse/FLINK-27329 > Project: Flink > Issue Type: Improvement > Components: Kubernetes Operator >Reporter: Biao Geng >Assignee: Biao Geng >Priority: Major > > Currently, we do not explicitly set the default value of `replica` in > `JobManagerSpec`. As a result, Java sets the default value to be zero. > Besides, in our examples, we explicitly declare `replica` in `JobManagerSpec` > to be 1. > After a deeper look when debugging the exception thrown in FLINK-27310, we > find it would be better to set the default value to 1 for the `replica` field > and remove the declaration in examples due to following reasons: > 1. A normal Session or Application cluster should have at least one JM. The > current default value, zero, does not follow the common case. > 2. One JM can work for k8s HA mode as well and if users really want to launch > a standby JM for faster recorvery, they can declare the value of `replica` > field in the yaml file. In examples, we just use the new default value(i.e. > 1), which should be fine. -- This message was sent by Atlassian Jira (v8.20.7#820007)
[jira] [Updated] (FLINK-27329) Add default value of replica of JM pod and not declare it in example yamls
[ https://issues.apache.org/jira/browse/FLINK-27329?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Biao Geng updated FLINK-27329: -- Description: Currently, we do not explicitly set the default value of `replica` in `JobManagerSpec`. As a result, Java sets the default value to be zero. Besides, in our examples, we explicitly declare `replica` in `JobManagerSpec` to be 1. After a deeper look when debugging the exception thrown in FLINK-27310, we find it would be better to set the default value to 1 for the `replica` field and remove the declaration in examples due to following reasons: 1. A normal Session or Application cluster should have at least one JM. The current default value, zero, does not follow the common case. 2. One JM can work for k8s HA mode as well and if users really want to launch a standby JM for faster recorvery, they can declare the value of `replica` field in the yaml file. In examples, we just use the new default value(i.e. 1), which should be fine. was: Currently, we do not explicitly set the default value of `replica` in `JobManagerSpec`. As a result, Java sets the default value to be zero. Besides, in our examples, we explicitly declare `replica` in `JobManagerSpec` to be 1. After a deeper look when debugging the exception thrown in FLINK-27310, we find it would be better to set the default value to 1 for `replica` fields and remove the declaration in examples due to following reasons: 1. A normal Session or Application cluster should have at least one JM. The current default value, zero, does not follow the common case. 2. One JM can work for k8s HA mode as well and if users really want to launch a standby JM for faster recorvery, they can declare the `replica` field in the yaml file. In examples, we just use the new default valu(i.e. 1) should be fine. > Add default value of replica of JM pod and not declare it in example yamls > -- > > Key: FLINK-27329 > URL: https://issues.apache.org/jira/browse/FLINK-27329 > Project: Flink > Issue Type: Improvement > Components: Kubernetes Operator >Reporter: Biao Geng >Assignee: Biao Geng >Priority: Major > > Currently, we do not explicitly set the default value of `replica` in > `JobManagerSpec`. As a result, Java sets the default value to be zero. > Besides, in our examples, we explicitly declare `replica` in `JobManagerSpec` > to be 1. > After a deeper look when debugging the exception thrown in FLINK-27310, we > find it would be better to set the default value to 1 for the `replica` field > and remove the declaration in examples due to following reasons: > 1. A normal Session or Application cluster should have at least one JM. The > current default value, zero, does not follow the common case. > 2. One JM can work for k8s HA mode as well and if users really want to launch > a standby JM for faster recorvery, they can declare the value of `replica` > field in the yaml file. In examples, we just use the new default value(i.e. > 1), which should be fine. -- This message was sent by Atlassian Jira (v8.20.7#820007)
[jira] [Updated] (FLINK-27329) Add default value of replica of JM pod and not declare it in example yamls
[ https://issues.apache.org/jira/browse/FLINK-27329?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Biao Geng updated FLINK-27329: -- Summary: Add default value of replica of JM pod and not declare it in example yamls (was: Add default value of replica of JM pod and remove declaring it in example yamls) > Add default value of replica of JM pod and not declare it in example yamls > -- > > Key: FLINK-27329 > URL: https://issues.apache.org/jira/browse/FLINK-27329 > Project: Flink > Issue Type: Improvement > Components: Kubernetes Operator >Reporter: Biao Geng >Assignee: Biao Geng >Priority: Major > > Currently, we do not explicitly set the default value of `replica` in > `JobManagerSpec`. As a result, Java sets the default value to be zero. > Besides, in our examples, we explicitly declare `replica` in `JobManagerSpec` > to be 1. > After a deeper look when debugging the exception thrown in FLINK-27310, we > find it would be better to set the default value to 1 for `replica` fields > and remove the declaration in examples due to following reasons: > 1. A normal Session or Application cluster should have at least one JM. The > current default value, zero, does not follow the common case. > 2. One JM can work for k8s HA mode as well and if users really want to launch > a standby JM for faster recorvery, they can declare the `replica` field in > the yaml file. In examples, we just use the new default valu(i.e. 1) should > be fine. -- This message was sent by Atlassian Jira (v8.20.7#820007)
[jira] [Commented] (FLINK-27329) Add default value of replica of JM pod and remove declaring it in example yamls
[ https://issues.apache.org/jira/browse/FLINK-27329?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17525020#comment-17525020 ] Biao Geng commented on FLINK-27329: --- cc [~wangyang0918] I can take this ticket~ > Add default value of replica of JM pod and remove declaring it in example > yamls > --- > > Key: FLINK-27329 > URL: https://issues.apache.org/jira/browse/FLINK-27329 > Project: Flink > Issue Type: Improvement > Components: Kubernetes Operator >Reporter: Biao Geng >Priority: Major > > Currently, we do not explicitly set the default value of `replica` in > `JobManagerSpec`. As a result, Java sets the default value to be zero. > Besides, in our examples, we explicitly declare `replica` in `JobManagerSpec` > to be 1. > After a deeper look when debugging the exception thrown in FLINK-27310, we > find it would be better to set the default value to 1 for `replica` fields > and remove the declaration in examples due to following reasons: > 1. A normal Session or Application cluster should have at least one JM. The > current default value, zero, does not follow the common case. > 2. One JM can work for k8s HA mode as well and if users really want to launch > a standby JM for faster recorvery, they can declare the `replica` field in > the yaml file. In examples, we just use the new default valu(i.e. 1) should > be fine. -- This message was sent by Atlassian Jira (v8.20.7#820007)
[jira] [Created] (FLINK-27329) Add default value of replica of JM pod and remove declaring it in example yamls
Biao Geng created FLINK-27329: - Summary: Add default value of replica of JM pod and remove declaring it in example yamls Key: FLINK-27329 URL: https://issues.apache.org/jira/browse/FLINK-27329 Project: Flink Issue Type: Improvement Components: Kubernetes Operator Reporter: Biao Geng Currently, we do not explicitly set the default value of `replica` in `JobManagerSpec`. As a result, Java sets the default value to be zero. Besides, in our examples, we explicitly declare `replica` in `JobManagerSpec` to be 1. After a deeper look when debugging the exception thrown in FLINK-27310, we find it would be better to set the default value to 1 for `replica` fields and remove the declaration in examples due to following reasons: 1. A normal Session or Application cluster should have at least one JM. The current default value, zero, does not follow the common case. 2. One JM can work for k8s HA mode as well and if users really want to launch a standby JM for faster recorvery, they can declare the `replica` field in the yaml file. In examples, we just use the new default valu(i.e. 1) should be fine. -- This message was sent by Atlassian Jira (v8.20.7#820007)
[jira] [Commented] (FLINK-27309) Allow to load default flink configs in the k8s operator dynamically
[ https://issues.apache.org/jira/browse/FLINK-27309?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17525012#comment-17525012 ] Biao Geng commented on FLINK-27309: --- Thanks [~gyfora]! Besides [~aitozi]'s valuable point, I also think about if the changed default configs should take effect for a running job deployment. >From k8s side, it may make sense that such change should be considered as >`specChanged`, and all the running job will be suspended and restarted. I am not sure it is proper to restart running jobs(i.e. changed default configs affect running jobs) from Flink users' side. > Allow to load default flink configs in the k8s operator dynamically > --- > > Key: FLINK-27309 > URL: https://issues.apache.org/jira/browse/FLINK-27309 > Project: Flink > Issue Type: Improvement > Components: Kubernetes Operator >Reporter: Biao Geng >Assignee: Gyula Fora >Priority: Major > > Current default configs used by the k8s operator will be saved in the > /opt/flink/conf dir in the k8s operator pod and will be loaded only once when > the operator is created. > Since the flink k8s operator could be a long running service and users may > want to modify the default configs(e.g the metric reporter sampling interval) > for newly created deployments, it may better to load the default configs > dynamically(i.e. parsing the latest /opt/flink/conf/flink-conf.yaml) in the > {{ReconcilerFactory}} and {{ObserverFactory}}, instead of redeploying the > operator. -- This message was sent by Atlassian Jira (v8.20.7#820007)
[jira] [Commented] (FLINK-24960) YARNSessionCapacitySchedulerITCase.testVCoresAreSetCorrectlyAndJobManagerHostnameAreShownInWebInterfaceAndDynamicPropertiesAndYarnApplicationNameAndTaskManagerSlots ha
[ https://issues.apache.org/jira/browse/FLINK-24960?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17524860#comment-17524860 ] Biao Geng commented on FLINK-24960: --- Hi [~mapohl], I tried to do some research but in latest [failure|https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=34743=logs=298e20ef-7951-5965-0e79-ea664ddc435e=d4c90338-c843-57b0-3232-10ae74f00347=36026] posted by Yun, I did not find the {{Extracted hostname:port: }} was not shown. Though in the description of this ticket, it shows {{Extracted hostname:port: 5718b812c7ab:38622}} in the old CI test, which seems to be correct. I plan to verify {{yarnSessionClusterRunner.sendStop();}} works fine(i.e. the session cluster will be stopped normally) first but I have not found a way to run the cron test's "test_cron_jdk11 misc" test only on the my own [azure pipeline|https://dev.azure.com/samuelgeng7/Flink/_build/results?buildId=109=results], which made the verification pretty slow and hard. Is there any guidelines about debugging the azure pipeline with some specific tests? > YARNSessionCapacitySchedulerITCase.testVCoresAreSetCorrectlyAndJobManagerHostnameAreShownInWebInterfaceAndDynamicPropertiesAndYarnApplicationNameAndTaskManagerSlots > hangs on azure > --- > > Key: FLINK-24960 > URL: https://issues.apache.org/jira/browse/FLINK-24960 > Project: Flink > Issue Type: Bug > Components: Deployment / YARN >Affects Versions: 1.15.0, 1.14.3 >Reporter: Yun Gao >Assignee: Niklas Semmler >Priority: Critical > Labels: pull-request-available, test-stability > Fix For: 1.16.0 > > > {code:java} > Nov 18 22:37:08 > > Nov 18 22:37:08 Test > testVCoresAreSetCorrectlyAndJobManagerHostnameAreShownInWebInterfaceAndDynamicPropertiesAndYarnApplicationNameAndTaskManagerSlots(org.apache.flink.yarn.YARNSessionCapacitySchedulerITCase) > is running. > Nov 18 22:37:08 > > Nov 18 22:37:25 22:37:25,470 [main] INFO > org.apache.flink.yarn.YARNSessionCapacitySchedulerITCase [] - Extracted > hostname:port: 5718b812c7ab:38622 > Nov 18 22:52:36 > == > Nov 18 22:52:36 Process produced no output for 900 seconds. > Nov 18 22:52:36 > == > {code} > https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=26722=logs=f450c1a5-64b1-5955-e215-49cb1ad5ec88=cc452273-9efa-565d-9db8-ef62a38a0c10=36395 -- This message was sent by Atlassian Jira (v8.20.7#820007)
[jira] [Commented] (FLINK-27310) FlinkOperatorITCase failure due to JobManager replicas less than one
[ https://issues.apache.org/jira/browse/FLINK-27310?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17524322#comment-17524322 ] Biao Geng commented on FLINK-27310: --- [~usamj] oh, I tried to find the error in my local machine but ignored that the IT case is only executed in CI pipeline :( cc [~wangyang0918] I created a [PR|https://github.com/apache/flink-kubernetes-operator/pull/173https://github.com/apache/flink-kubernetes-operator/pull/173] for fixing this IT case. I think we should improve the ci.yml as well. > FlinkOperatorITCase failure due to JobManager replicas less than one > > > Key: FLINK-27310 > URL: https://issues.apache.org/jira/browse/FLINK-27310 > Project: Flink > Issue Type: Bug > Components: Kubernetes Operator >Reporter: Usamah Jassat >Priority: Minor > Labels: pull-request-available > > The FlinkOperatorITCase test is currently failing, even in the CI pipeline > > {code:java} > INFO] Running org.apache.flink.kubernetes.operator.FlinkOperatorITCase > > > > Error: Tests run: 1, Failures: 0, Errors: 1, Skipped: 0, Time elapsed: > 3.178 s <<< FAILURE! - in > org.apache.flink.kubernetes.operator.FlinkOperatorITCase > > > > Error: org.apache.flink.kubernetes.operator.FlinkOperatorITCase.test > Time elapsed: 2.664 s <<< ERROR! > > > > io.fabric8.kubernetes.client.KubernetesClientException: Failure > executing: POST at: > https://192.168.49.2:8443/apis/flink.apache.org/v1beta1/namespaces/flink-operator-test/flinkdeployments. > Message: Forbidden! User minikube doesn't have permission. admission webhook > "vflinkdeployments.flink.apache.org" denied the request: JobManager replicas > should not be configured less than one.. > > > > at > flink.kubernetes.operator@1.0-SNAPSHOT/org.apache.flink.kubernetes.operator.FlinkOperatorITCase.test(FlinkOperatorITCase.java:86){code} > > While the test is failing the CI test run is passing which also should be > fixed then to fail on the test failure. -- This message was sent by Atlassian Jira (v8.20.7#820007)
[jira] [Comment Edited] (FLINK-27310) FlinkOperatorITCase failure due to JobManager replicas less than one
[ https://issues.apache.org/jira/browse/FLINK-27310?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17524291#comment-17524291 ] Biao Geng edited comment on FLINK-27310 at 4/19/22 1:01 PM: It seems that this ITCase is not really executed due to its naming(not follow *Test* pattern). For the error in the posted log, I believe it is because we have not added {{jm.setReplicas(1);}} to set the the replica of JM. was (Author: bgeng777): It seems that this ITCase is not really executed due to its naming(not follow *Test{*}*{*} pattern). For the error in the posted log, I believe it is because we have not added {{jm.setReplicas(1);}} to set the the replica of JM. > FlinkOperatorITCase failure due to JobManager replicas less than one > > > Key: FLINK-27310 > URL: https://issues.apache.org/jira/browse/FLINK-27310 > Project: Flink > Issue Type: Bug > Components: Kubernetes Operator >Reporter: Usamah Jassat >Priority: Minor > > The FlinkOperatorITCase test is currently failing, even in the CI pipeline > > {code:java} > INFO] Running org.apache.flink.kubernetes.operator.FlinkOperatorITCase > > > > Error: Tests run: 1, Failures: 0, Errors: 1, Skipped: 0, Time elapsed: > 3.178 s <<< FAILURE! - in > org.apache.flink.kubernetes.operator.FlinkOperatorITCase > > > > Error: org.apache.flink.kubernetes.operator.FlinkOperatorITCase.test > Time elapsed: 2.664 s <<< ERROR! > > > > io.fabric8.kubernetes.client.KubernetesClientException: Failure > executing: POST at: > https://192.168.49.2:8443/apis/flink.apache.org/v1beta1/namespaces/flink-operator-test/flinkdeployments. > Message: Forbidden! User minikube doesn't have permission. admission webhook > "vflinkdeployments.flink.apache.org" denied the request: JobManager replicas > should not be configured less than one.. > > > > at > flink.kubernetes.operator@1.0-SNAPSHOT/org.apache.flink.kubernetes.operator.FlinkOperatorITCase.test(FlinkOperatorITCase.java:86){code} > > While the test is failing the CI test run is passing which also should be > fixed then to fail on the test failure. -- This message was sent by Atlassian Jira (v8.20.7#820007)
[jira] [Comment Edited] (FLINK-27310) FlinkOperatorITCase failure due to JobManager replicas less than one
[ https://issues.apache.org/jira/browse/FLINK-27310?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17524291#comment-17524291 ] Biao Geng edited comment on FLINK-27310 at 4/19/22 1:01 PM: It seems that this ITCase is not really executed due to its naming(not follow **Test * pattern). For the error in the posted log, I believe it is because we have not added {{jm.setReplicas(1);}} to set the the replica of JM. was (Author: bgeng777): It seems that this ITCase is not really executed due to its naming(not follow *Test* pattern). For the error in the posted log, I believe it is because we have not added {{jm.setReplicas(1);}} to set the the replica of JM. > FlinkOperatorITCase failure due to JobManager replicas less than one > > > Key: FLINK-27310 > URL: https://issues.apache.org/jira/browse/FLINK-27310 > Project: Flink > Issue Type: Bug > Components: Kubernetes Operator >Reporter: Usamah Jassat >Priority: Minor > > The FlinkOperatorITCase test is currently failing, even in the CI pipeline > > {code:java} > INFO] Running org.apache.flink.kubernetes.operator.FlinkOperatorITCase > > > > Error: Tests run: 1, Failures: 0, Errors: 1, Skipped: 0, Time elapsed: > 3.178 s <<< FAILURE! - in > org.apache.flink.kubernetes.operator.FlinkOperatorITCase > > > > Error: org.apache.flink.kubernetes.operator.FlinkOperatorITCase.test > Time elapsed: 2.664 s <<< ERROR! > > > > io.fabric8.kubernetes.client.KubernetesClientException: Failure > executing: POST at: > https://192.168.49.2:8443/apis/flink.apache.org/v1beta1/namespaces/flink-operator-test/flinkdeployments. > Message: Forbidden! User minikube doesn't have permission. admission webhook > "vflinkdeployments.flink.apache.org" denied the request: JobManager replicas > should not be configured less than one.. > > > > at > flink.kubernetes.operator@1.0-SNAPSHOT/org.apache.flink.kubernetes.operator.FlinkOperatorITCase.test(FlinkOperatorITCase.java:86){code} > > While the test is failing the CI test run is passing which also should be > fixed then to fail on the test failure. -- This message was sent by Atlassian Jira (v8.20.7#820007)
[jira] [Comment Edited] (FLINK-27310) FlinkOperatorITCase failure due to JobManager replicas less than one
[ https://issues.apache.org/jira/browse/FLINK-27310?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17524291#comment-17524291 ] Biao Geng edited comment on FLINK-27310 at 4/19/22 1:00 PM: It seems that this ITCase is not really executed due to its naming(not follow *Test{*}*{*} pattern). For the error in the posted log, I believe it is because we have not added {{jm.setReplicas(1);}} to set the the replica of JM. was (Author: bgeng777): It seems that this ITCase is not really executed due to its naming(not follow *Test* pattern). For the error in the posted log, I believe it is because we have not added {{jm.setReplicas(1);}} to set the the replica of JM. > FlinkOperatorITCase failure due to JobManager replicas less than one > > > Key: FLINK-27310 > URL: https://issues.apache.org/jira/browse/FLINK-27310 > Project: Flink > Issue Type: Bug > Components: Kubernetes Operator >Reporter: Usamah Jassat >Priority: Minor > > The FlinkOperatorITCase test is currently failing, even in the CI pipeline > > {code:java} > INFO] Running org.apache.flink.kubernetes.operator.FlinkOperatorITCase > > > > Error: Tests run: 1, Failures: 0, Errors: 1, Skipped: 0, Time elapsed: > 3.178 s <<< FAILURE! - in > org.apache.flink.kubernetes.operator.FlinkOperatorITCase > > > > Error: org.apache.flink.kubernetes.operator.FlinkOperatorITCase.test > Time elapsed: 2.664 s <<< ERROR! > > > > io.fabric8.kubernetes.client.KubernetesClientException: Failure > executing: POST at: > https://192.168.49.2:8443/apis/flink.apache.org/v1beta1/namespaces/flink-operator-test/flinkdeployments. > Message: Forbidden! User minikube doesn't have permission. admission webhook > "vflinkdeployments.flink.apache.org" denied the request: JobManager replicas > should not be configured less than one.. > > > > at > flink.kubernetes.operator@1.0-SNAPSHOT/org.apache.flink.kubernetes.operator.FlinkOperatorITCase.test(FlinkOperatorITCase.java:86){code} > > While the test is failing the CI test run is passing which also should be > fixed then to fail on the test failure. -- This message was sent by Atlassian Jira (v8.20.7#820007)
[jira] [Comment Edited] (FLINK-27310) FlinkOperatorITCase failure due to JobManager replicas less than one
[ https://issues.apache.org/jira/browse/FLINK-27310?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17524291#comment-17524291 ] Biao Geng edited comment on FLINK-27310 at 4/19/22 1:00 PM: It seems that this ITCase is not really executed due to its naming(not follow *Test* pattern). For the error in the posted log, I believe it is because we have not added {{jm.setReplicas(1);}} to set the the replica of JM. was (Author: bgeng777): It seems that this ITCase is not really executed due to its naming(not follow *Test* pattern). For the error in the posted log, I believe it is because we have not added {{jm.setReplicas(1);}} to set the the replica of JM. > FlinkOperatorITCase failure due to JobManager replicas less than one > > > Key: FLINK-27310 > URL: https://issues.apache.org/jira/browse/FLINK-27310 > Project: Flink > Issue Type: Bug > Components: Kubernetes Operator >Reporter: Usamah Jassat >Priority: Minor > > The FlinkOperatorITCase test is currently failing, even in the CI pipeline > > {code:java} > INFO] Running org.apache.flink.kubernetes.operator.FlinkOperatorITCase > > > > Error: Tests run: 1, Failures: 0, Errors: 1, Skipped: 0, Time elapsed: > 3.178 s <<< FAILURE! - in > org.apache.flink.kubernetes.operator.FlinkOperatorITCase > > > > Error: org.apache.flink.kubernetes.operator.FlinkOperatorITCase.test > Time elapsed: 2.664 s <<< ERROR! > > > > io.fabric8.kubernetes.client.KubernetesClientException: Failure > executing: POST at: > https://192.168.49.2:8443/apis/flink.apache.org/v1beta1/namespaces/flink-operator-test/flinkdeployments. > Message: Forbidden! User minikube doesn't have permission. admission webhook > "vflinkdeployments.flink.apache.org" denied the request: JobManager replicas > should not be configured less than one.. > > > > at > flink.kubernetes.operator@1.0-SNAPSHOT/org.apache.flink.kubernetes.operator.FlinkOperatorITCase.test(FlinkOperatorITCase.java:86){code} > > While the test is failing the CI test run is passing which also should be > fixed then to fail on the test failure. -- This message was sent by Atlassian Jira (v8.20.7#820007)
[jira] [Commented] (FLINK-27310) FlinkOperatorITCase failure due to JobManager replicas less than one
[ https://issues.apache.org/jira/browse/FLINK-27310?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17524291#comment-17524291 ] Biao Geng commented on FLINK-27310: --- It seems that this ITCase is not really executed due to its naming(not follow *Test* pattern). For the error in the posted log, I believe it is because we have not added {{jm.setReplicas(1);}} to set the the replica of JM. > FlinkOperatorITCase failure due to JobManager replicas less than one > > > Key: FLINK-27310 > URL: https://issues.apache.org/jira/browse/FLINK-27310 > Project: Flink > Issue Type: Bug > Components: Kubernetes Operator >Reporter: Usamah Jassat >Priority: Minor > > The FlinkOperatorITCase test is currently failing, even in the CI pipeline > > {code:java} > INFO] Running org.apache.flink.kubernetes.operator.FlinkOperatorITCase > > > > Error: Tests run: 1, Failures: 0, Errors: 1, Skipped: 0, Time elapsed: > 3.178 s <<< FAILURE! - in > org.apache.flink.kubernetes.operator.FlinkOperatorITCase > > > > Error: org.apache.flink.kubernetes.operator.FlinkOperatorITCase.test > Time elapsed: 2.664 s <<< ERROR! > > > > io.fabric8.kubernetes.client.KubernetesClientException: Failure > executing: POST at: > https://192.168.49.2:8443/apis/flink.apache.org/v1beta1/namespaces/flink-operator-test/flinkdeployments. > Message: Forbidden! User minikube doesn't have permission. admission webhook > "vflinkdeployments.flink.apache.org" denied the request: JobManager replicas > should not be configured less than one.. > > > > at > flink.kubernetes.operator@1.0-SNAPSHOT/org.apache.flink.kubernetes.operator.FlinkOperatorITCase.test(FlinkOperatorITCase.java:86){code} > > While the test is failing the CI test run is passing which also should be > fixed then to fail on the test failure. -- This message was sent by Atlassian Jira (v8.20.7#820007)
[jira] [Commented] (FLINK-27303) Flink Operator will create a large amount of temp log config files
[ https://issues.apache.org/jira/browse/FLINK-27303?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17524218#comment-17524218 ] Biao Geng commented on FLINK-27303: --- Yeah, it seems that most generated configs are for constructing ClusterClient to interact with the cluster. Configs like templates, log configs or image configs that are for submission will not be used by the operator's reconciliation. We can build them only in submission methods. Besides, it may also make sense to clear those temporary files for logs and templates in {{FlinkDeploymentController#cleanup()}}. > Flink Operator will create a large amount of temp log config files > -- > > Key: FLINK-27303 > URL: https://issues.apache.org/jira/browse/FLINK-27303 > Project: Flink > Issue Type: Improvement > Components: Kubernetes Operator >Affects Versions: kubernetes-operator-1.0.0 >Reporter: Gyula Fora >Priority: Critical > Fix For: kubernetes-operator-1.0.0 > > > Now we use the configbuilder in multiple different places to generate the > effective config including observer, reconciler, validator etc. > The effective config gerenration logic also creates temporary log config > files (if spec logConfiguration is set) which would lead to 3-4 files > generated in every reconcile loop for a given job. These files are not > cleaned up until the operator restarts leading to a large amount of files. > I believe we should change the config generation logic and only apply the > logconfig generation logic right before flink cluster submission as that is > the only thing affected by it. -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Created] (FLINK-27309) Allow to load default flink configs in the k8s operator dynamically
Biao Geng created FLINK-27309: - Summary: Allow to load default flink configs in the k8s operator dynamically Key: FLINK-27309 URL: https://issues.apache.org/jira/browse/FLINK-27309 Project: Flink Issue Type: Improvement Components: Kubernetes Operator Reporter: Biao Geng Current default configs used by the k8s operator will be saved in the /opt/flink/conf dir in the k8s operator pod and will be loaded only once when the operator is created. Since the flink k8s operator could be a long running service and users may want to modify the default configs(e.g the metric reporter sampling interval) for newly created deployments, it may better to load the default configs dynamically(i.e. parsing the latest /opt/flink/conf/flink-conf.yaml) in the {{ReconcilerFactory}} and {{ObserverFactory}}, instead of redeploying the operator. -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Commented] (FLINK-27303) Flink Operator will create a large amount of temp log config files
[ https://issues.apache.org/jira/browse/FLINK-27303?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17524189#comment-17524189 ] Biao Geng commented on FLINK-27303: --- Hi [~gyfora], big +1 for this jira. And I believe besides log config files, there will also be a lot of pod template file if we define it the yaml. Because the {{applyCommonPodTemplate}} will create temporary files for pod template as well. I think your suggestion makes sense. And one possible solution is to divide the generated configs into 2 types: type 1 is configs needed by observer and reconciler. type 2 is configs that not used by observer or reconciler. Besides, I am wondering if it is a good idea to maintain a concurrent hashmap to save the effective config of each deployment so that we can reduce some repeated creation of effective configs. It may be the bottleneck of memory usage when there are plenty of deployments in the operator and make the operator less `stateless`. Not sure if it is worthwhile. > Flink Operator will create a large amount of temp log config files > -- > > Key: FLINK-27303 > URL: https://issues.apache.org/jira/browse/FLINK-27303 > Project: Flink > Issue Type: Improvement > Components: Kubernetes Operator >Affects Versions: kubernetes-operator-1.0.0 >Reporter: Gyula Fora >Priority: Critical > Fix For: kubernetes-operator-1.0.0 > > > Now we use the configbuilder in multiple different places to generate the > effective config including observer, reconciler, validator etc. > The effective config gerenration logic also creates temporary log config > files (if spec logConfiguration is set) which would lead to 3-4 files > generated in every reconcile loop for a given job. These files are not > cleaned up until the operator restarts leading to a large amount of files. > I believe we should change the config generation logic and only apply the > logconfig generation logic right before flink cluster submission as that is > the only thing affected by it. -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Closed] (FLINK-27109) Improve the creation of ClusterRole in Flink K8s operator
[ https://issues.apache.org/jira/browse/FLINK-27109?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Biao Geng closed FLINK-27109. - Resolution: Won't Fix > Improve the creation of ClusterRole in Flink K8s operator > - > > Key: FLINK-27109 > URL: https://issues.apache.org/jira/browse/FLINK-27109 > Project: Flink > Issue Type: Improvement > Components: Kubernetes Operator >Reporter: Biao Geng >Priority: Not a Priority > > As the > [doc|https://kubernetes.io/docs/reference/access-authn-authz/rbac/#clusterrole-example] > of k8s said, ClusterRole is one kind of non-namespaced resource. > In our helm chart, we now define the ClusterRole with name 'flink-operator' > and the namespace field in metadata will be omitted. As a result, if a user > wants to install multiple flink-kubernetes-operator in different namespace, > the ClusterRole 'flink-operator' will be created multiple times. > Errors like > {quote}Error: INSTALLATION FAILED: rendered manifests contain a resource that > already exists. Unable to continue with install: ClusterRole "flink-operator" > in namespace "" exists and cannot be imported into the current release: > invalid ownership metadata; annotation validation error: key > "meta.helm.sh/release-namespace" must equal "c-8725bcef1dc84d6f": current > value is "default" > {quote} > will be thrown. > Solution 1 could be adding the namespace as a postfix in the name of > ClusterRole. > Solution 2 is to add if else check like > [this|https://stackoverflow.com/questions/65110332/clusterrole-exists-and-cannot-be-imported-into-the-current-release] > to avoid creating existed resource. One important drawback of solution 2 is > that when uninstalling one helm release, the created ClusterRole will be > removed as well. -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Updated] (FLINK-27109) Improve the creation of ClusterRole in Flink K8s operator
[ https://issues.apache.org/jira/browse/FLINK-27109?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Biao Geng updated FLINK-27109: -- Priority: Not a Priority (was: Major) > Improve the creation of ClusterRole in Flink K8s operator > - > > Key: FLINK-27109 > URL: https://issues.apache.org/jira/browse/FLINK-27109 > Project: Flink > Issue Type: Improvement > Components: Kubernetes Operator >Reporter: Biao Geng >Priority: Not a Priority > > As the > [doc|https://kubernetes.io/docs/reference/access-authn-authz/rbac/#clusterrole-example] > of k8s said, ClusterRole is one kind of non-namespaced resource. > In our helm chart, we now define the ClusterRole with name 'flink-operator' > and the namespace field in metadata will be omitted. As a result, if a user > wants to install multiple flink-kubernetes-operator in different namespace, > the ClusterRole 'flink-operator' will be created multiple times. > Errors like > {quote}Error: INSTALLATION FAILED: rendered manifests contain a resource that > already exists. Unable to continue with install: ClusterRole "flink-operator" > in namespace "" exists and cannot be imported into the current release: > invalid ownership metadata; annotation validation error: key > "meta.helm.sh/release-namespace" must equal "c-8725bcef1dc84d6f": current > value is "default" > {quote} > will be thrown. > Solution 1 could be adding the namespace as a postfix in the name of > ClusterRole. > Solution 2 is to add if else check like > [this|https://stackoverflow.com/questions/65110332/clusterrole-exists-and-cannot-be-imported-into-the-current-release] > to avoid creating existed resource. One important drawback of solution 2 is > that when uninstalling one helm release, the created ClusterRole will be > removed as well. -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Commented] (FLINK-27109) Improve the creation of ClusterRole in Flink K8s operator
[ https://issues.apache.org/jira/browse/FLINK-27109?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17518699#comment-17518699 ] Biao Geng commented on FLINK-27109: --- [~wangyang0918]I see your point. It's my bad to ignore the usage of the `watchNamespaces`. I will utilize this field. > Improve the creation of ClusterRole in Flink K8s operator > - > > Key: FLINK-27109 > URL: https://issues.apache.org/jira/browse/FLINK-27109 > Project: Flink > Issue Type: Improvement > Components: Kubernetes Operator >Reporter: Biao Geng >Priority: Major > > As the > [doc|https://kubernetes.io/docs/reference/access-authn-authz/rbac/#clusterrole-example] > of k8s said, ClusterRole is one kind of non-namespaced resource. > In our helm chart, we now define the ClusterRole with name 'flink-operator' > and the namespace field in metadata will be omitted. As a result, if a user > wants to install multiple flink-kubernetes-operator in different namespace, > the ClusterRole 'flink-operator' will be created multiple times. > Errors like > {quote}Error: INSTALLATION FAILED: rendered manifests contain a resource that > already exists. Unable to continue with install: ClusterRole "flink-operator" > in namespace "" exists and cannot be imported into the current release: > invalid ownership metadata; annotation validation error: key > "meta.helm.sh/release-namespace" must equal "c-8725bcef1dc84d6f": current > value is "default" > {quote} > will be thrown. > Solution 1 could be adding the namespace as a postfix in the name of > ClusterRole. > Solution 2 is to add if else check like > [this|https://stackoverflow.com/questions/65110332/clusterrole-exists-and-cannot-be-imported-into-the-current-release] > to avoid creating existed resource. One important drawback of solution 2 is > that when uninstalling one helm release, the created ClusterRole will be > removed as well. -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Commented] (FLINK-27109) Improve the creation of ClusterRole in Flink K8s operator
[ https://issues.apache.org/jira/browse/FLINK-27109?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17518673#comment-17518673 ] Biao Geng commented on FLINK-27109: --- cc [~morhidi] [~mbalassi] Do you have any suggestion? > Improve the creation of ClusterRole in Flink K8s operator > - > > Key: FLINK-27109 > URL: https://issues.apache.org/jira/browse/FLINK-27109 > Project: Flink > Issue Type: Improvement > Components: Kubernetes Operator >Reporter: Biao Geng >Priority: Major > > As the > [doc|https://kubernetes.io/docs/reference/access-authn-authz/rbac/#clusterrole-example] > of k8s said, ClusterRole is one kind of non-namespaced resource. > In our helm chart, we now define the ClusterRole with name 'flink-operator' > and the namespace field in metadata will be omitted. As a result, if a user > wants to install multiple flink-kubernetes-operator in different namespace, > the ClusterRole 'flink-operator' will be created multiple times. > Errors like > {quote}Error: INSTALLATION FAILED: rendered manifests contain a resource that > already exists. Unable to continue with install: ClusterRole "flink-operator" > in namespace "" exists and cannot be imported into the current release: > invalid ownership metadata; annotation validation error: key > "meta.helm.sh/release-namespace" must equal "c-8725bcef1dc84d6f": current > value is "default" > {quote} > will be thrown. > Solution 1 could be adding the namespace as a postfix in the name of > ClusterRole. > Solution 2 is to add if else check like > [this|https://stackoverflow.com/questions/65110332/clusterrole-exists-and-cannot-be-imported-into-the-current-release] > to avoid creating existed resource. One important drawback of solution 2 is > that when uninstalling one helm release, the created ClusterRole will be > removed as well. -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Updated] (FLINK-27109) Improve the creation of ClusterRole in Flink K8s operator
[ https://issues.apache.org/jira/browse/FLINK-27109?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Biao Geng updated FLINK-27109: -- Description: As the [doc|https://kubernetes.io/docs/reference/access-authn-authz/rbac/#clusterrole-example] of k8s said, ClusterRole is one kind of non-namespaced resource. In our helm chart, we now define the ClusterRole with name 'flink-operator' and the namespace field in metadata will be omitted. As a result, if a user wants to install multiple flink-kubernetes-operator in different namespace, the ClusterRole 'flink-operator' will be created multiple times. Errors like {quote}Error: INSTALLATION FAILED: rendered manifests contain a resource that already exists. Unable to continue with install: ClusterRole "flink-operator" in namespace "" exists and cannot be imported into the current release: invalid ownership metadata; annotation validation error: key "meta.helm.sh/release-namespace" must equal "c-8725bcef1dc84d6f": current value is "default" {quote} will be thrown. Solution 1 could be adding the namespace as a postfix in the name of ClusterRole. Solution 2 is to add if else check like [this|https://stackoverflow.com/questions/65110332/clusterrole-exists-and-cannot-be-imported-into-the-current-release] to avoid creating existed resource. One important drawback of solution 2 is that when uninstalling one helm release, the created ClusterRole will be removed as well. was: As the [doc|https://kubernetes.io/docs/reference/access-authn-authz/rbac/#clusterrole-example] of k8s said, ClusterRole is one kind of non-namespaced resource. In our helm chart, we now define the ClusterRole with name 'flink-operator' and the namespace field in metadata will be omitted. As a result, if a user wants to install multiple flink-kubernetes-operator in different namespace, the ClusterRole 'flink-operator' will be created multiple times. Errors like {quote}Error: INSTALLATION FAILED: rendered manifests contain a resource that already exists. Unable to continue with install: ClusterRole "flink-operator" in namespace "" exists and cannot be imported into the current release: invalid ownership metadata; annotation validation error: key "meta.helm.sh/release-namespace" must equal "c-8725bcef1dc84d6f": current value is "default" {quote} will be thrown. One solution could be adding the namespace as a postfix in the name of ClusterRole. Another possible solution is to add if else check like [this|https://stackoverflow.com/questions/65110332/clusterrole-exists-and-cannot-be-imported-into-the-current-release] to avoid creating existed resource. > Improve the creation of ClusterRole in Flink K8s operator > - > > Key: FLINK-27109 > URL: https://issues.apache.org/jira/browse/FLINK-27109 > Project: Flink > Issue Type: Improvement > Components: Kubernetes Operator >Reporter: Biao Geng >Priority: Major > > As the > [doc|https://kubernetes.io/docs/reference/access-authn-authz/rbac/#clusterrole-example] > of k8s said, ClusterRole is one kind of non-namespaced resource. > In our helm chart, we now define the ClusterRole with name 'flink-operator' > and the namespace field in metadata will be omitted. As a result, if a user > wants to install multiple flink-kubernetes-operator in different namespace, > the ClusterRole 'flink-operator' will be created multiple times. > Errors like > {quote}Error: INSTALLATION FAILED: rendered manifests contain a resource that > already exists. Unable to continue with install: ClusterRole "flink-operator" > in namespace "" exists and cannot be imported into the current release: > invalid ownership metadata; annotation validation error: key > "meta.helm.sh/release-namespace" must equal "c-8725bcef1dc84d6f": current > value is "default" > {quote} > will be thrown. > Solution 1 could be adding the namespace as a postfix in the name of > ClusterRole. > Solution 2 is to add if else check like > [this|https://stackoverflow.com/questions/65110332/clusterrole-exists-and-cannot-be-imported-into-the-current-release] > to avoid creating existed resource. One important drawback of solution 2 is > that when uninstalling one helm release, the created ClusterRole will be > removed as well. -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Updated] (FLINK-27109) Improve the creation of ClusterRole in Flink K8s operator
[ https://issues.apache.org/jira/browse/FLINK-27109?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Biao Geng updated FLINK-27109: -- Description: As the [doc|https://kubernetes.io/docs/reference/access-authn-authz/rbac/#clusterrole-example] of k8s said, ClusterRole is one kind of non-namespaced resource. In our helm chart, we now define the ClusterRole with name 'flink-operator' and the namespace field in metadata will be omitted. As a result, if a user wants to install multiple flink-kubernetes-operator in different namespace, the ClusterRole 'flink-operator' will be created multiple times. Errors like {quote}Error: INSTALLATION FAILED: rendered manifests contain a resource that already exists. Unable to continue with install: ClusterRole "flink-operator" in namespace "" exists and cannot be imported into the current release: invalid ownership metadata; annotation validation error: key "meta.helm.sh/release-namespace" must equal "c-8725bcef1dc84d6f": current value is "default" {quote} will be thrown. One solution could be adding the namespace as a postfix in the name of ClusterRole. Another possible solution is to add if else check like [this|https://stackoverflow.com/questions/65110332/clusterrole-exists-and-cannot-be-imported-into-the-current-release] to avoid creating existed resource. was: As the [doc|https://kubernetes.io/docs/reference/access-authn-authz/rbac/#clusterrole-example] of k8s said, ClusterRole is one kind of non-namespaced resource. In our helm chart, we now define the ClusterRole with name 'flink-operator' and the namespace field in metadata will be omitted. As a result, if a user wants to install multiple flink-kubernetes-operator in different namespace, the ClusterRole 'flink-operator' will be created multiple times. Errors like {quote}Error: INSTALLATION FAILED: rendered manifests contain a resource that already exists. Unable to continue with install: ClusterRole "flink-operator" in namespace "" exists and cannot be imported into the current release: invalid ownership metadata; annotation validation error: key "meta.helm.sh/release-namespace" must equal "c-8725bcef1dc84d6f": current value is "default" {quote} will be thrown. One solution could be adding the namespace as a postfix in the name of ClusterRole. Another possible solution is to add if else check to avoid creating existed resource. > Improve the creation of ClusterRole in Flink K8s operator > - > > Key: FLINK-27109 > URL: https://issues.apache.org/jira/browse/FLINK-27109 > Project: Flink > Issue Type: Improvement > Components: Kubernetes Operator >Reporter: Biao Geng >Priority: Major > > As the > [doc|https://kubernetes.io/docs/reference/access-authn-authz/rbac/#clusterrole-example] > of k8s said, ClusterRole is one kind of non-namespaced resource. > In our helm chart, we now define the ClusterRole with name 'flink-operator' > and the namespace field in metadata will be omitted. As a result, if a user > wants to install multiple flink-kubernetes-operator in different namespace, > the ClusterRole 'flink-operator' will be created multiple times. > Errors like > {quote}Error: INSTALLATION FAILED: rendered manifests contain a resource that > already exists. Unable to continue with install: ClusterRole "flink-operator" > in namespace "" exists and cannot be imported into the current release: > invalid ownership metadata; annotation validation error: key > "meta.helm.sh/release-namespace" must equal "c-8725bcef1dc84d6f": current > value is "default" > {quote} > will be thrown. > One solution could be adding the namespace as a postfix in the name of > ClusterRole. > Another possible solution is to add if else check like > [this|https://stackoverflow.com/questions/65110332/clusterrole-exists-and-cannot-be-imported-into-the-current-release] > to avoid creating existed resource. -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Updated] (FLINK-27109) Improve the creation of ClusterRole in Flink K8s operator
[ https://issues.apache.org/jira/browse/FLINK-27109?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Biao Geng updated FLINK-27109: -- Summary: Improve the creation of ClusterRole in Flink K8s operator (was: The naming pattern of ClusterRole in Flink K8s operator should consider namespace) > Improve the creation of ClusterRole in Flink K8s operator > - > > Key: FLINK-27109 > URL: https://issues.apache.org/jira/browse/FLINK-27109 > Project: Flink > Issue Type: Improvement > Components: Kubernetes Operator >Reporter: Biao Geng >Priority: Major > > As the > [doc|https://kubernetes.io/docs/reference/access-authn-authz/rbac/#clusterrole-example] > of k8s said, ClusterRole is one kind of non-namespaced resource. > In our helm chart, we now define the ClusterRole with name 'flink-operator' > and the namespace field in metadata will be omitted. As a result, if a user > wants to install multiple flink-kubernetes-operator in different namespace, > the ClusterRole 'flink-operator' will be created multiple times. > Errors like > {quote}Error: INSTALLATION FAILED: rendered manifests contain a resource that > already exists. Unable to continue with install: ClusterRole "flink-operator" > in namespace "" exists and cannot be imported into the current release: > invalid ownership metadata; annotation validation error: key > "meta.helm.sh/release-namespace" must equal "c-8725bcef1dc84d6f": current > value is "default" > {quote} > will be thrown. > One solution could be adding the namespace as a postfix in the name of > ClusterRole. > Another possible solution is to add if else check to avoid creating existed > resource. -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Created] (FLINK-27109) The naming pattern of ClusterRole in Flink K8s operator should consider namespace
Biao Geng created FLINK-27109: - Summary: The naming pattern of ClusterRole in Flink K8s operator should consider namespace Key: FLINK-27109 URL: https://issues.apache.org/jira/browse/FLINK-27109 Project: Flink Issue Type: Improvement Components: Kubernetes Operator Reporter: Biao Geng As the [doc|https://kubernetes.io/docs/reference/access-authn-authz/rbac/#clusterrole-example] of k8s said, ClusterRole is one kind of non-namespaced resource. In our helm chart, we now define the ClusterRole with name 'flink-operator' and the namespace field in metadata will be omitted. As a result, if a user wants to install multiple flink-kubernetes-operator in different namespace, the ClusterRole 'flink-operator' will be created multiple times. Errors like {quote}Error: INSTALLATION FAILED: rendered manifests contain a resource that already exists. Unable to continue with install: ClusterRole "flink-operator" in namespace "" exists and cannot be imported into the current release: invalid ownership metadata; annotation validation error: key "meta.helm.sh/release-namespace" must equal "c-8725bcef1dc84d6f": current value is "default" {quote} will be thrown. One solution could be adding the namespace as a postfix in the name of ClusterRole. Another possible solution is to add if else check to avoid creating existed resource. -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Commented] (FLINK-26611) Document operator config options
[ https://issues.apache.org/jira/browse/FLINK-26611?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17516254#comment-17516254 ] Biao Geng commented on FLINK-26611: --- [~gyfora] I see this jira is open and in the plan of 0.1.0 version. If no one has started on it, I can take it. > Document operator config options > > > Key: FLINK-26611 > URL: https://issues.apache.org/jira/browse/FLINK-26611 > Project: Flink > Issue Type: Sub-task > Components: Kubernetes Operator >Reporter: Gyula Fora >Priority: Major > Fix For: kubernetes-operator-0.1.0 > > > Similar to how other flink configs are documented we should also document the > configs found in OperatorConfigOptions. > Generating the documentation might be possible but maybe it's easier to do it > manually. -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Created] (FLINK-27009) Support SQL job submission in flink kubernetes opeartor
Biao Geng created FLINK-27009: - Summary: Support SQL job submission in flink kubernetes opeartor Key: FLINK-27009 URL: https://issues.apache.org/jira/browse/FLINK-27009 Project: Flink Issue Type: New Feature Components: Kubernetes Operator Reporter: Biao Geng Currently, the flink kubernetes opeartor is for jar job using application or session cluster. For SQL job, there is no out of box solution in the operator. One simple and short-term solution is to wrap the SQL script into a jar job using table API with limitation. The long-term solution may work with [FLINK-26541|https://issues.apache.org/jira/browse/FLINK-26541] to achieve the full support. -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Commented] (FLINK-26892) Observe current status before validating CR changes
[ https://issues.apache.org/jira/browse/FLINK-26892?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17513971#comment-17513971 ] Biao Geng commented on FLINK-26892: --- I try to summary what we should do for this case: 1. Save latest validated config ({{lastReconciledSpec}} should have already been sufficient.) 2. Adjust current logic to observe with latest validated config -> validate new config -> reconcile with new config Is there any drawback of above idea? If it is fine, I can take this ticket and work on it. > Observe current status before validating CR changes > --- > > Key: FLINK-26892 > URL: https://issues.apache.org/jira/browse/FLINK-26892 > Project: Flink > Issue Type: Improvement > Components: Kubernetes Operator >Reporter: Gyula Fora >Priority: Major > Fix For: kubernetes-operator-1.0.0 > > > Currently validation is the first step in the controller loop which means > that when there is a validation error we fail to observe the status of > currently running deployments. > We should change the order of operations and observe before validation. > Furthermore observe should use the previous configuration not the one after > the CR change. -- This message was sent by Atlassian Jira (v8.20.1#820001)