[jira] [Commented] (FLINK-35376) When flink submits the job by calling the rest api, the dependent jar package generated to the tmp is not removed

2024-05-17 Thread Biao Geng (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-35376?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17847168#comment-17847168
 ] 

Biao Geng commented on FLINK-35376:
---

hi [~18380428...@163.com], in the implementation of the 
`extractContainerdLibraries()`, it would call `tempFile.deleteOnExit();`. So 
when the job exits, these temp files should be cleared. (See 
https://github.com/apache/flink/blob/f1ecb9e4701d612050da54589a8f561857debf34/flink-clients/src/main/java/org/apache/flink/client/program/PackagedProgram.java#L585
 for details)
Have you encountered any residual jars when the job exits?

> When flink submits the job by calling the rest api, the dependent jar package 
> generated to the tmp is not removed
> -
>
> Key: FLINK-35376
> URL: https://issues.apache.org/jira/browse/FLINK-35376
> Project: Flink
>  Issue Type: Bug
>  Components: Runtime / REST
>Affects Versions: 1.14.4
>Reporter: 中国无锡周良
>Priority: Major
>
> When org. Apache. Flink. Runtime. Webmonitor. Handlers. JarRunHandler# 
> handleRequest receives job submission request,
> {code:java}
> final JarHandlerContext context = JarHandlerContext.fromRequest(request, 
> jarDir, log);
> context.applyToConfiguration(effectiveConfiguration); {code}
> The toPackagedProgram(configuration) method generates a dependency jar to the 
> tmp directory;
> Then,
> final PackagedProgram program = 
> context.toPackagedProgram(effectiveConfiguration);
> Will generate a dependent jars, and org. Apache. The flink. Client. The 
> program. The PackagedProgram# PackagedProgram method inside
> {code:java}
> this.extractedTempLibraries =
>                 this.jarFile == null
>                         ? Collections.emptyList()
>                         : extractContainedLibraries(this.jarFile); {code}
> Will be overwritten by the second generation;
>  
> As a result, the dependent jar package generated for the first time cannot be 
> deleted when close. If the job status detection is done and the rest api is 
> automatically invoked to pull up, the jar file will always be generated in 
> tmp.
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (FLINK-35288) Flink Restart Strategy does not work as documented

2024-05-06 Thread Biao Geng (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-35288?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17844106#comment-17844106
 ] 

Biao Geng commented on FLINK-35288:
---

https://cwiki.apache.org/confluence/display/FLINK/FLIP-364%3A+Improve+the+exponential-delay+restart-strategy#FLIP364:Improvetheexponentialdelayrestartstrategy-1.2Differentsemanticsofrestartattemptscauseregionfailovernotasexpected
In the above FLIP, there is some relevant discussion of the 
'restart-strategy.fixed-delay.attempts' problem. When 'region-failover' (the 
default value of *jobmanager.execution.failover-strategy*) is enabled, the 
org.apache.flink.runtime.executiongraph.failover.ExecutionFailureHandler#handleFailure
 method is called once a subtask in a region fails, which consumes the 
job-level 'restart-strategy.fixed-delay.attempts'. As a result, the restart 
strategy may not work as the documentation described.
We have also met such case in the production environment.

> Flink Restart Strategy does not work as documented
> --
>
> Key: FLINK-35288
> URL: https://issues.apache.org/jira/browse/FLINK-35288
> Project: Flink
>  Issue Type: Bug
>Reporter: Keshav Kansal
>Priority: Minor
>
> As per the documentation when using the Fixed Delay Restart Strategy, the
> *restart-strategy.fixed-delay.attempts* defines the "The number of times that 
> Flink retries the execution before the job is declared as failed if has been 
> set to fixed-delay". 
> However in reality it is the *maximum-total-task-failures*, i.e. it is 
> possbile that the job does not even attempt to restart. 
> This is as per documented in 
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-1%3A+Fine+Grained+Recovery+from+Task+Failures
> If there is an outage at a Sink level, for example Elasticsearch outage, all 
> the independent tasks might fail and the job will immediately fail without 
> restart (if restart-strategy.fixed-delay.attempts is set lower or equal to 
> the parallelism of the sink)



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (FLINK-35273) PyFlink's LocalZonedTimestampType should respect timezone set by set_local_timezone

2024-04-30 Thread Biao Geng (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-35273?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Biao Geng updated FLINK-35273:
--
Description: 
The issue is from 
https://apache-flink.slack.com/archives/C065944F9M2/p1714134880878399
When using TIMESTAMP_LTZ in PyFlink while setting a different time zone, it 
turns out that the output result does not show the expected result.
Here is my test codes:
{code:python}
from pyflink.datastream import StreamExecutionEnvironment
from pyflink.common import Types, Configuration
from pyflink.table import DataTypes, StreamTableEnvironment
from datetime import datetime
import pytz


config = Configuration()
config.set_string("python.client.executable", 
"/usr/local/Caskroom/miniconda/base/envs/myenv/bin/python")
config.set_string("python.executable", 
"/usr/local/Caskroom/miniconda/base/envs/myenv/bin/python")
env = StreamExecutionEnvironment.get_execution_environment(config)
t_env = StreamTableEnvironment.create(env)
t_env.get_config().set_local_timezone("UTC")
# t_env.get_config().set_local_timezone("GMT-08:00")

input_table = t_env.from_elements(
[
(
"elementA",
datetime(year=2024, month=4, day=12, hour=8, minute=35),
),
(
"elementB",
datetime(year=2024, month=4, day=12, hour=8, minute=35, 
tzinfo=pytz.utc),
# datetime(year=2024, month=4, day=12, hour=8, minute=35, 
tzinfo=pytz.timezone('America/New_York')),
),
],
DataTypes.ROW(
[
DataTypes.FIELD("name", DataTypes.STRING()),
DataTypes.FIELD("timestamp", DataTypes.TIMESTAMP_LTZ(3)),
]
),
)
input_table.execute().print()

# SQL
sql_result = t_env.execute_sql("CREATE VIEW MyView1 AS SELECT 
TO_TIMESTAMP_LTZ(171291090, 3);")
t_env.execute_sql("CREATE TABLE Sink (`t` TIMESTAMP_LTZ) WITH 
('connector'='print');")
t_env.execute_sql("INSERT INTO Sink SELECT * FROM MyView1;")
{code}
The output is: 
{code:java}
+++-+
| op |   name |   timestamp |
+++-+
| +I |   elementA | 2024-04-12 08:35:00.000 |
| +I |   elementB | 2024-04-12 16:35:00.000 |
+++-+
2 rows in set
+I[2024-04-12T08:35:00Z]
{code}

In pyflink/tables/types.py, the `LocalZonedTimestampType` class will use follow 
logic to convert python obj to sql type:

{code:python}
EPOCH_ORDINAL = calendar.timegm(time.localtime(0)) * 10 ** 6
...
def to_sql_type(self, dt):
if dt is not None:
seconds = (calendar.timegm(dt.utctimetuple()) if dt.tzinfo
   else time.mktime(dt.timetuple()))
return int(seconds) * 10 ** 6 + dt.microsecond + self.EPOCH_ORDINAL
{code}
It shows that the EPOCH_ORDINAL is calculated when the PVM starts but is not 
decided by the timezone set by `set_local_timezone`.


> PyFlink's LocalZonedTimestampType should respect timezone set by 
> set_local_timezone
> ---
>
> Key: FLINK-35273
> URL: https://issues.apache.org/jira/browse/FLINK-35273
> Project: Flink
>  Issue Type: Bug
>  Components: API / Python
>Reporter: Biao Geng
>Priority: Major
>
> The issue is from 
> https://apache-flink.slack.com/archives/C065944F9M2/p1714134880878399
> When using TIMESTAMP_LTZ in PyFlink while setting a different time zone, it 
> turns out that the output result does not show the expected result.
> Here is my test codes:
> {code:python}
> from pyflink.datastream import StreamExecutionEnvironment
> from pyflink.common import Types, Configuration
> from pyflink.table import DataTypes, StreamTableEnvironment
> from datetime import datetime
> import pytz
> config = Configuration()
> config.set_string("python.client.executable", 
> "/usr/local/Caskroom/miniconda/base/envs/myenv/bin/python")
> config.set_string("python.executable", 
> "/usr/local/Caskroom/miniconda/base/envs/myenv/bin/python")
> env = StreamExecutionEnvironment.get_execution_environment(config)
> t_env = StreamTableEnvironment.create(env)
> t_env.get_config().set_local_timezone("UTC")
> # t_env.get_config().set_local_timezone("GMT-08:00")
> input_table = t_env.from_elements(
> [
> (
> "elementA",
> datetime(year=2024, month=4, day=12, hour=8, minute=35),
> ),
> (
> "elementB",
> datetime(year=2024, month=4, day=12, hour=8, minute=35, 
> tzinfo=pytz.utc),
> # datetime(year=2024, month=4, day=12, hour=8, minute=35, 
> tzinfo=pytz.timezone('America/New_York')),
> ),
> ],
> DataTypes.ROW(
> [
> DataTypes.FIELD("name", 

[jira] [Updated] (FLINK-35273) PyFlink's LocalZonedTimestampType should respect timezone set by set_local_timezone

2024-04-30 Thread Biao Geng (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-35273?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Biao Geng updated FLINK-35273:
--
Component/s: API / Python

> PyFlink's LocalZonedTimestampType should respect timezone set by 
> set_local_timezone
> ---
>
> Key: FLINK-35273
> URL: https://issues.apache.org/jira/browse/FLINK-35273
> Project: Flink
>  Issue Type: Bug
>  Components: API / Python
>Reporter: Biao Geng
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (FLINK-35273) PyFlink's LocalZonedTimestampType should respect timezone set by set_local_timezone

2024-04-30 Thread Biao Geng (Jira)
Biao Geng created FLINK-35273:
-

 Summary: PyFlink's LocalZonedTimestampType should respect timezone 
set by set_local_timezone
 Key: FLINK-35273
 URL: https://issues.apache.org/jira/browse/FLINK-35273
 Project: Flink
  Issue Type: Bug
Reporter: Biao Geng






--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (FLINK-35192) operator oom

2024-04-30 Thread Biao Geng (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-35192?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17842322#comment-17842322
 ] 

Biao Geng commented on FLINK-35192:
---

[~stupid_pig], it looks like that pictures of the pmap is broken. But the glibc 
issue is well known, if that's the case, it makes sense to me to introduce 
jemalloc.

> operator oom
> 
>
> Key: FLINK-35192
> URL: https://issues.apache.org/jira/browse/FLINK-35192
> Project: Flink
>  Issue Type: Bug
>  Components: Kubernetes Operator
>Affects Versions: kubernetes-operator-1.6.1
> Environment: jdk: openjdk11
> operator version: 1.6.1
>Reporter: chenyuzhi
>Priority: Major
>  Labels: pull-request-available
> Attachments: image-2024-04-22-15-47-49-455.png, 
> image-2024-04-22-15-52-51-600.png, image-2024-04-22-15-58-23-269.png, 
> image-2024-04-22-15-58-42-850.png, image-2024-04-30-16-47-07-289.png, 
> image-2024-04-30-17-11-24-974.png, screenshot-1.png, screenshot-2.png, 
> screenshot-3.png
>
>
> The kubernetest operator docker process was killed by kernel cause out of 
> memory(the time is 2024.04.03: 18:16)
>  !image-2024-04-22-15-47-49-455.png! 
> Metrics:
> the pod memory (RSS) is increasing slowly in the past 7 days:
>  !screenshot-1.png! 
> However the jvm memory metrics of operator not shown obvious anomaly:
>  !image-2024-04-22-15-58-23-269.png! 
>  !image-2024-04-22-15-58-42-850.png! 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (FLINK-35192) operator oom

2024-04-30 Thread Biao Geng (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-35192?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17842294#comment-17842294
 ] 

Biao Geng commented on FLINK-35192:
---

Sure, created a 
[pr|https://github.com/apache/flink-kubernetes-operator/pull/822] for this.

> operator oom
> 
>
> Key: FLINK-35192
> URL: https://issues.apache.org/jira/browse/FLINK-35192
> Project: Flink
>  Issue Type: Bug
>  Components: Kubernetes Operator
>Affects Versions: kubernetes-operator-1.6.1
> Environment: jdk: openjdk11
> operator version: 1.6.1
>Reporter: chenyuzhi
>Priority: Major
>  Labels: pull-request-available
> Attachments: image-2024-04-22-15-47-49-455.png, 
> image-2024-04-22-15-52-51-600.png, image-2024-04-22-15-58-23-269.png, 
> image-2024-04-22-15-58-42-850.png, screenshot-1.png, screenshot-2.png, 
> screenshot-3.png
>
>
> The kubernetest operator docker process was killed by kernel cause out of 
> memory(the time is 2024.04.03: 18:16)
>  !image-2024-04-22-15-47-49-455.png! 
> Metrics:
> the pod memory (RSS) is increasing slowly in the past 7 days:
>  !screenshot-1.png! 
> However the jvm memory metrics of operator not shown obvious anomaly:
>  !image-2024-04-22-15-58-23-269.png! 
>  !image-2024-04-22-15-58-42-850.png! 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Comment Edited] (FLINK-35212) PyFlink thread mode process just can run once in standalonesession mode

2024-04-29 Thread Biao Geng (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-35212?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17841934#comment-17841934
 ] 

Biao Geng edited comment on FLINK-35212 at 4/29/24 10:22 AM:
-

Hi [~vonesec], thanks for creating the detailed bug report! 
I create a brand new env in my local computer and followed the instructions but 
I cannot reproduce the exception you have met.
I take a look at the [pemja 
code|https://github.com/alibaba/pemja/blob/release-0.4-1-rc1/src/main/java/pemja/core/PythonInterpreter.java#L336],
 as we can see, the MainInterpreter is a singleton which means 
System.load('pemja_core.xxx.so') should happen only once in the JVM. It should 
get rid of the exception in the jira, whose cause is loading the same .so twice.
So it is somehow strange for you to meet such exception. Have you ever changed 
something under your python site-packages of pyflink?


But when running your codes, I do meet another exception when trying to `flink 
run` for a second time:

{quote}2024-04-29 17:41:03,054 WARN  org.apache.flink.runtime.taskmanager.Task  
  [] - Source: *anonymous_python-input-format$1*[1] -> Calc[2] 
-> ConstraintEnforcer[3] -> TableToDataSteam -> Map -> Sink: Print to Std. Out 
(1/1)#0 (4b45f9997b18155682ed0218dcf0afbb_cbc357ccb763df2852fee8c4fc7d55f2_0_0) 
switched from RUNNING to FAILED with failure cause:
java.lang.ClassCastException: pemja.core.object.PyIterator cannot be cast to 
pemja.core.object.PyIterator
at 
org.apache.flink.streaming.api.operators.python.embedded.AbstractOneInputEmbeddedPythonFunctionOperator.processElement(AbstractOneInputEmbeddedPythonFunctionOperator.java:156)
 
~[blob_p-ce9fc762fcfc43a2cf28c0d69a738da7efd36b10-323be2d12056985907101ebb52aff326:1.18.1]
at 
org.apache.flink.streaming.runtime.tasks.CopyingChainingOutput.pushToOperator(CopyingChainingOutput.java:75)
 ~[flink-dist-1.18.1.jar:1.18.1]
at 
org.apache.flink.streaming.runtime.tasks.CopyingChainingOutput.collect(CopyingChainingOutput.java:50)
 ~[flink-dist-1.18.1.jar:1.18.1]
at 
org.apache.flink.streaming.runtime.tasks.CopyingChainingOutput.collect(CopyingChainingOutput.java:29)
 ~[flink-dist-1.18.1.jar:1.18.1]
at 
org.apache.flink.table.runtime.operators.sink.OutputConversionOperator.processElement(OutputConversionOperator.java:105)
 ~[flink-table-runtime-1.18.1.jar:1.18.1]
at 
org.apache.flink.streaming.runtime.tasks.CopyingChainingOutput.pushToOperator(CopyingChainingOutput.java:75)
 ~[flink-dist-1.18.1.jar:1.18.1]
at 
org.apache.flink.streaming.runtime.tasks.CopyingChainingOutput.collect(CopyingChainingOutput.java:50)
 ~[flink-dist-1.18.1.jar:1.18.1]
at 
org.apache.flink.streaming.runtime.tasks.CopyingChainingOutput.collect(CopyingChainingOutput.java:29)
 ~[flink-dist-1.18.1.jar:1.18.1]
at 
org.apache.flink.table.runtime.util.StreamRecordCollector.collect(StreamRecordCollector.java:44)
 ~[flink-table-runtime-1.18.1.jar:1.18.1]
at 
org.apache.flink.table.runtime.operators.sink.ConstraintEnforcer.processElement(ConstraintEnforcer.java:247)
 ~[flink-table-runtime-1.18.1.jar:1.18.1]
at 
org.apache.flink.streaming.runtime.tasks.CopyingChainingOutput.pushToOperator(CopyingChainingOutput.java:75)
 ~[flink-dist-1.18.1.jar:1.18.1]
at 
org.apache.flink.streaming.runtime.tasks.CopyingChainingOutput.collect(CopyingChainingOutput.java:50)
 ~[flink-dist-1.18.1.jar:1.18.1]
at 
org.apache.flink.streaming.runtime.tasks.CopyingChainingOutput.collect(CopyingChainingOutput.java:29)
 ~[flink-dist-1.18.1.jar:1.18.1]
at StreamExecCalc$6.processElement(Unknown Source) ~[?:?]
at 
org.apache.flink.streaming.runtime.tasks.CopyingChainingOutput.pushToOperator(CopyingChainingOutput.java:75)
 ~[flink-dist-1.18.1.jar:1.18.1]
at 
org.apache.flink.streaming.runtime.tasks.CopyingChainingOutput.collect(CopyingChainingOutput.java:50)
 ~[flink-dist-1.18.1.jar:1.18.1]
at 
org.apache.flink.streaming.runtime.tasks.CopyingChainingOutput.collect(CopyingChainingOutput.java:29)
 ~[flink-dist-1.18.1.jar:1.18.1]
at 
org.apache.flink.streaming.api.operators.StreamSourceContexts$ManualWatermarkContext.processAndCollect(StreamSourceContexts.java:425)
 ~[flink-dist-1.18.1.jar:1.18.1]
at 
org.apache.flink.streaming.api.operators.StreamSourceContexts$WatermarkContext.collect(StreamSourceContexts.java:520)
 ~[flink-dist-1.18.1.jar:1.18.1]
at 
org.apache.flink.streaming.api.operators.StreamSourceContexts$SwitchingOnClose.collect(StreamSourceContexts.java:110)
 ~[flink-dist-1.18.1.jar:1.18.1]
at 
org.apache.flink.streaming.api.functions.source.InputFormatSourceFunction.run(InputFormatSourceFunction.java:99)
 ~[flink-dist-1.18.1.jar:1.18.1]
at 
org.apache.flink.streaming.api.operators.StreamSource.run(StreamSource.java:114)
 ~[flink-dist-1.18.1.jar:1.18.1]
at 

[jira] [Commented] (FLINK-35212) PyFlink thread mode process just can run once in standalonesession mode

2024-04-29 Thread Biao Geng (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-35212?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17841934#comment-17841934
 ] 

Biao Geng commented on FLINK-35212:
---

Hi [~vonesec], thanks for creating the detailed bug report! 
I create a brand new env in my local computer and followed the instructions but 
I cannot reproduce the exception you have met.
I take a look at the [pemja 
code|https://github.com/alibaba/pemja/blob/release-0.4-1-rc1/src/main/java/pemja/core/PythonInterpreter.java#L336],
 as we can see, the MainInterpreter is a singleton which means 
System.load('pemja_core.xxx.so') should happen only once in the JVM. It should 
get rid of the exception in the jira, whose cause is loading the same .so twice.
So it is somehow for you to meet such exception. Have you ever changed 
something under your python site-packages of pyflink?


But when running your codes, I do meet another exception when trying to `flink 
run` for a second time:

{quote}2024-04-29 17:41:03,054 WARN  org.apache.flink.runtime.taskmanager.Task  
  [] - Source: *anonymous_python-input-format$1*[1] -> Calc[2] 
-> ConstraintEnforcer[3] -> TableToDataSteam -> Map -> Sink: Print to Std. Out 
(1/1)#0 (4b45f9997b18155682ed0218dcf0afbb_cbc357ccb763df2852fee8c4fc7d55f2_0_0) 
switched from RUNNING to FAILED with failure cause:
java.lang.ClassCastException: pemja.core.object.PyIterator cannot be cast to 
pemja.core.object.PyIterator
at 
org.apache.flink.streaming.api.operators.python.embedded.AbstractOneInputEmbeddedPythonFunctionOperator.processElement(AbstractOneInputEmbeddedPythonFunctionOperator.java:156)
 
~[blob_p-ce9fc762fcfc43a2cf28c0d69a738da7efd36b10-323be2d12056985907101ebb52aff326:1.18.1]
at 
org.apache.flink.streaming.runtime.tasks.CopyingChainingOutput.pushToOperator(CopyingChainingOutput.java:75)
 ~[flink-dist-1.18.1.jar:1.18.1]
at 
org.apache.flink.streaming.runtime.tasks.CopyingChainingOutput.collect(CopyingChainingOutput.java:50)
 ~[flink-dist-1.18.1.jar:1.18.1]
at 
org.apache.flink.streaming.runtime.tasks.CopyingChainingOutput.collect(CopyingChainingOutput.java:29)
 ~[flink-dist-1.18.1.jar:1.18.1]
at 
org.apache.flink.table.runtime.operators.sink.OutputConversionOperator.processElement(OutputConversionOperator.java:105)
 ~[flink-table-runtime-1.18.1.jar:1.18.1]
at 
org.apache.flink.streaming.runtime.tasks.CopyingChainingOutput.pushToOperator(CopyingChainingOutput.java:75)
 ~[flink-dist-1.18.1.jar:1.18.1]
at 
org.apache.flink.streaming.runtime.tasks.CopyingChainingOutput.collect(CopyingChainingOutput.java:50)
 ~[flink-dist-1.18.1.jar:1.18.1]
at 
org.apache.flink.streaming.runtime.tasks.CopyingChainingOutput.collect(CopyingChainingOutput.java:29)
 ~[flink-dist-1.18.1.jar:1.18.1]
at 
org.apache.flink.table.runtime.util.StreamRecordCollector.collect(StreamRecordCollector.java:44)
 ~[flink-table-runtime-1.18.1.jar:1.18.1]
at 
org.apache.flink.table.runtime.operators.sink.ConstraintEnforcer.processElement(ConstraintEnforcer.java:247)
 ~[flink-table-runtime-1.18.1.jar:1.18.1]
at 
org.apache.flink.streaming.runtime.tasks.CopyingChainingOutput.pushToOperator(CopyingChainingOutput.java:75)
 ~[flink-dist-1.18.1.jar:1.18.1]
at 
org.apache.flink.streaming.runtime.tasks.CopyingChainingOutput.collect(CopyingChainingOutput.java:50)
 ~[flink-dist-1.18.1.jar:1.18.1]
at 
org.apache.flink.streaming.runtime.tasks.CopyingChainingOutput.collect(CopyingChainingOutput.java:29)
 ~[flink-dist-1.18.1.jar:1.18.1]
at StreamExecCalc$6.processElement(Unknown Source) ~[?:?]
at 
org.apache.flink.streaming.runtime.tasks.CopyingChainingOutput.pushToOperator(CopyingChainingOutput.java:75)
 ~[flink-dist-1.18.1.jar:1.18.1]
at 
org.apache.flink.streaming.runtime.tasks.CopyingChainingOutput.collect(CopyingChainingOutput.java:50)
 ~[flink-dist-1.18.1.jar:1.18.1]
at 
org.apache.flink.streaming.runtime.tasks.CopyingChainingOutput.collect(CopyingChainingOutput.java:29)
 ~[flink-dist-1.18.1.jar:1.18.1]
at 
org.apache.flink.streaming.api.operators.StreamSourceContexts$ManualWatermarkContext.processAndCollect(StreamSourceContexts.java:425)
 ~[flink-dist-1.18.1.jar:1.18.1]
at 
org.apache.flink.streaming.api.operators.StreamSourceContexts$WatermarkContext.collect(StreamSourceContexts.java:520)
 ~[flink-dist-1.18.1.jar:1.18.1]
at 
org.apache.flink.streaming.api.operators.StreamSourceContexts$SwitchingOnClose.collect(StreamSourceContexts.java:110)
 ~[flink-dist-1.18.1.jar:1.18.1]
at 
org.apache.flink.streaming.api.functions.source.InputFormatSourceFunction.run(InputFormatSourceFunction.java:99)
 ~[flink-dist-1.18.1.jar:1.18.1]
at 
org.apache.flink.streaming.api.operators.StreamSource.run(StreamSource.java:114)
 ~[flink-dist-1.18.1.jar:1.18.1]
at 

[jira] [Commented] (FLINK-35192) operator oom

2024-04-26 Thread Biao Geng (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-35192?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17841129#comment-17841129
 ] 

Biao Geng commented on FLINK-35192:
---

 !screenshot-3.png! 
According to the flink k8s op's codes, the deleteOnExit() is called when create 
config files or pod template files. It looks like that it is possible to lead 
the memory leak if the operator pod runs for a long time. In the operator's 
FlinkConfigManager implementation, we would clean up these temp files/dirs. 
Maybe we can safely remove the deleteOnExit() usage? cc [~gyfora]

Also, from the attached yaml, it looks like a custom flink k8s op 
image(gdc-flink-kubernetes-operator:1.6.1-GDC1.0.2) is used.  [~stupid_pig] 
would you mind checking if your codes call methods like deleteOnExit if you 
have some customized changes to the operator?

> operator oom
> 
>
> Key: FLINK-35192
> URL: https://issues.apache.org/jira/browse/FLINK-35192
> Project: Flink
>  Issue Type: Bug
>  Components: Kubernetes Operator
>Affects Versions: kubernetes-operator-1.6.1
> Environment: jdk: openjdk11
> operator version: 1.6.1
>Reporter: chenyuzhi
>Priority: Major
> Attachments: image-2024-04-22-15-47-49-455.png, 
> image-2024-04-22-15-52-51-600.png, image-2024-04-22-15-58-23-269.png, 
> image-2024-04-22-15-58-42-850.png, screenshot-1.png, screenshot-2.png, 
> screenshot-3.png
>
>
> The kubernetest operator docker process was killed by kernel cause out of 
> memory(the time is 2024.04.03: 18:16)
>  !image-2024-04-22-15-47-49-455.png! 
> Metrics:
> the pod memory (RSS) is increasing slowly in the past 7 days:
>  !screenshot-1.png! 
> However the jvm memory metrics of operator not shown obvious anomaly:
>  !image-2024-04-22-15-58-23-269.png! 
>  !image-2024-04-22-15-58-42-850.png! 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (FLINK-35192) operator oom

2024-04-26 Thread Biao Geng (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-35192?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Biao Geng updated FLINK-35192:
--
Attachment: screenshot-3.png

> operator oom
> 
>
> Key: FLINK-35192
> URL: https://issues.apache.org/jira/browse/FLINK-35192
> Project: Flink
>  Issue Type: Bug
>  Components: Kubernetes Operator
>Affects Versions: kubernetes-operator-1.6.1
> Environment: jdk: openjdk11
> operator version: 1.6.1
>Reporter: chenyuzhi
>Priority: Major
> Attachments: image-2024-04-22-15-47-49-455.png, 
> image-2024-04-22-15-52-51-600.png, image-2024-04-22-15-58-23-269.png, 
> image-2024-04-22-15-58-42-850.png, screenshot-1.png, screenshot-2.png, 
> screenshot-3.png
>
>
> The kubernetest operator docker process was killed by kernel cause out of 
> memory(the time is 2024.04.03: 18:16)
>  !image-2024-04-22-15-47-49-455.png! 
> Metrics:
> the pod memory (RSS) is increasing slowly in the past 7 days:
>  !screenshot-1.png! 
> However the jvm memory metrics of operator not shown obvious anomaly:
>  !image-2024-04-22-15-58-23-269.png! 
>  !image-2024-04-22-15-58-42-850.png! 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (FLINK-35035) Reduce job pause time when cluster resources are expanded in adaptive mode

2024-04-07 Thread Biao Geng (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-35035?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17834752#comment-17834752
 ] 

Biao Geng commented on FLINK-35035:
---

I am not very familiar with adaptive scheduler, maybe others can share more 
insights. I just want to ask a question to make sure we are on the same page.
Do you mean that instead of triggering onNewResourcesAvailable repeatedly once 
a new slot is found(in your example, 5 new slots so 5 times of rescheduling), 
you are expecting the JM can discover the 5 new slots at the same time after a 
configurable period of time and only trigger 1 time of rescheduling?

> Reduce job pause time when cluster resources are expanded in adaptive mode
> --
>
> Key: FLINK-35035
> URL: https://issues.apache.org/jira/browse/FLINK-35035
> Project: Flink
>  Issue Type: Improvement
>  Components: Runtime / Task
>Affects Versions: 1.19.0
>Reporter: yuanfenghu
>Priority: Minor
>
> When 'jobmanager.scheduler = adaptive' , job graph changes triggered by 
> cluster expansion will cause long-term task stagnation. We should reduce this 
> impact.
> As an example:
> I have jobgraph for : [v1 (maxp=10 minp = 1)] -> [v2 (maxp=10, minp=1)]
> When my cluster has 5 slots, the job will be executed as [v1 p5]->[v2 p5]
> When I add slots the task will trigger jobgraph changes,by
> org.apache.flink.runtime.scheduler.adaptive.ResourceListener#onNewResourcesAvailable,
> However, the five new slots I added were not discovered at the same time (for 
> convenience, I assume that a taskmanager has one slot), because no matter 
> what environment we add, we cannot guarantee that the new slots will be added 
> at once, so this will cause onNewResourcesAvailable triggers repeatedly
> ,If each new slot action has a certain interval, then the jobgraph will 
> continue to change during this period. What I hope is that there will be a 
> stable time to configure the cluster resources  and then go to it after the 
> number of cluster slots has been stable for a certain period of time. Trigger 
> jobgraph changes to avoid this situation



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (FLINK-35036) Flink CDC Job cancel with savepoint failed

2024-04-07 Thread Biao Geng (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-35036?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17834749#comment-17834749
 ] 

Biao Geng commented on FLINK-35036:
---

Hi [~fly365], according to the attached screenshot, the failure is caused by a 
timeout in flink client side. 
IIUC, in the full volume phase of a flink cdc job, it needs to process lots of 
data and typically due to the back pressure, the state may be much larger than 
the incremental phase(you can check the state size in flink's web ui).
As a result, it would take longer time for the flink to complete the savepoint. 
The client's [default timeout is 
60s|https://nightlies.apache.org/flink/flink-docs-master/docs/deployment/config/#client-timeout],
 so maybe you can increase the value to see if the savepoint can succeed.


> Flink CDC Job cancel with savepoint failed
> --
>
> Key: FLINK-35036
> URL: https://issues.apache.org/jira/browse/FLINK-35036
> Project: Flink
>  Issue Type: Bug
>  Components: Flink CDC
> Environment: Flink 1.15.2
> Flink CDC 2.4.2
> Oracle 19C
> Doris 2.0.3
>Reporter: Fly365
>Priority: Major
> Attachments: image-2024-04-07-17-35-23-136.png
>
>
> With the Flink CDC job, I want oracle data to doris, in the  snapshot,canel 
> the Flink CDC Job with savepoint,the job cancel failed.
> 使用Flink CDC,将Oracle 
> 19C的数据表同步到Doris中,在初始化快照阶段,同步了一部分数据但还没有到增量阶段,此时取消CDC任务并保存Flink 
> Savepoint,取消任务失败;而在任务进入增量阶段后,取消任务并保存savepoint是可以的,请问存量数据同步阶段,为何savepoint失败?
> !image-2024-04-07-17-35-23-136.png!
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (FLINK-31850) Fix args in JobSpec not being passed through to Flink in Standalone mode - 1.4.0

2023-04-27 Thread Biao Geng (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-31850?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17717434#comment-17717434
 ] 

Biao Geng commented on FLINK-31850:
---

hi [~gil_shmaya], I checked relevant codes in the 
[PR|https://github.com/apache/flink-kubernetes-operator/pull/379], it looks 
like the codes are not broken.
Could you please share your args example that worked in 1.2.0 while failed in 
1.4.0?

> Fix args in JobSpec not being passed through to Flink in Standalone mode - 
> 1.4.0
> 
>
> Key: FLINK-31850
> URL: https://issues.apache.org/jira/browse/FLINK-31850
> Project: Flink
>  Issue Type: Bug
>  Components: Kubernetes Operator
>Affects Versions: kubernetes-operator-1.4.0
>Reporter: Gil Shmaya
>Priority: Major
>
> This issue is related to a previously fixed bug in version 1.2.0 -  
> [FLINK-29388] Fix args in JobSpec not being passed through to Flink in 
> Standalone mode - ASF JIRA (apache.org)
> I have noticed that while the args are successfully being passed when using 
> version 1.2.0, this is not the case with version 1.4.0.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (FLINK-31964) Improve the document of Autoscaler as 1.17.0 is released

2023-04-27 Thread Biao Geng (Jira)
Biao Geng created FLINK-31964:
-

 Summary: Improve the document of Autoscaler as 1.17.0 is released
 Key: FLINK-31964
 URL: https://issues.apache.org/jira/browse/FLINK-31964
 Project: Flink
  Issue Type: Improvement
  Components: Kubernetes Operator
Reporter: Biao Geng
 Attachments: image-2023-04-28-10-21-09-935.png

Since 1.17.0 is released and the official image is 
[available|https://hub.docker.com/layers/library/flink/1.17.0-scala_2.12-java8/images/sha256-a8bbef97ec3f7ce4fa6541d48dfe16261ee7f93f93b164c0e84644605f9ea0a3?context=explore],
 we can update the image link in the Autoscaler section.
 !image-2023-04-28-10-21-09-935.png! 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (FLINK-31509) REST Service missing sessionAffinity causes job run failure with HA cluster

2023-03-20 Thread Biao Geng (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-31509?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17702659#comment-17702659
 ] 

Biao Geng commented on FLINK-31509:
---

I believe that it should be a bug to route API requests to all JM instead of 
the master. 
Besides, it should be ok to use K8s HA with one JM is launched as when this JM 
crashes, a new one will created from the HA data stored in config map. Multiple 
JM may reduce the recovery time but not so necessary.

> REST Service missing sessionAffinity causes job run failure with HA cluster
> ---
>
> Key: FLINK-31509
> URL: https://issues.apache.org/jira/browse/FLINK-31509
> Project: Flink
>  Issue Type: Bug
>  Components: Kubernetes Operator
> Environment: Flink 1.15 on Flink Operator 1.4.0 on Kubernetes 1.25.4, 
> (optionally with Beam 2.46.0)
> but the issue was observed on Flink 1.14, 1.15 and 1.16 and on Flink Operator 
> 1.2, 1.3, 1.3.1, 1.4.0
>  
>Reporter: Emmanuel Leroy
>Priority: Major
>
> When using a Session Cluster with multiple Job Managers, the -rest service 
> load balances the API requests to all job managers, not just the master.
> When submitting a FlinkSessionJob, I often see errors like: `jar .jar 
> was not found`, because the submission is done in 2 steps: 
>  * upload the jar with `v1/jars/upload` which returns the `jar_id`
>  * run the job with `v1/jars//run`
> Unfortunately, with the Service load balacing between nodes, it is often the 
> case that the jar is uploaded on a JM, and the run request happens on 
> another, where the jar doesn't exist.
> A simple fix is to append the `sessionAffinity: ClientIP` on the -rest 
> service, where the API calls from a given originating IP will always be 
> routed to the same node.
> This issue is especially problematic with Beam, where the Beam job submission 
> does not retry to run the job with the jar_id, and will fail, causing it to 
> re-upload a new jar and retrying, until it is lucky enough to get the 2 calls 
> in a row routed to the same node.
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (FLINK-31529) Let yarn client exit early before JobManager running

2023-03-20 Thread Biao Geng (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-31529?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17702634#comment-17702634
 ] 

Biao Geng commented on FLINK-31529:
---

Hi there, I want to share some thoughts/questions about this jira:
1. According to the 
[doc|https://nightlies.apache.org/flink/flink-docs-master/docs/deployment/cli/],
 the --detached option is used to notify if the client should wait the job to 
finish. I have seem some users' platform rely on this option to get the 
returned YARN app info to manage their flink jobs(e.g. whether the job is 
submitted successfully). Maybe introducing a new option is better than changing 
the behavior of the --detached option.
2. The description says "In batch mode, the queue resources is insufficient in 
most case." IIUC, the lack of resource should not be a normal case. One 
possible use case I can come up with is that to reduce costs, people may run 
flink batch jobs in night and utilize workflow frameworks like airflow to retry 
the submission. Is that the case? 

> Let yarn client exit early before JobManager running
> 
>
> Key: FLINK-31529
> URL: https://issues.apache.org/jira/browse/FLINK-31529
> Project: Flink
>  Issue Type: Improvement
>  Components: Deployment / YARN
>Reporter: Weihua Hu
>Priority: Major
>
> Currently the YarnClusterDescriptor always wait yarn application status to be 
> RUNNING even if we use the detach mode. 
> In batch mode, the queue resources is insufficient in most case. So the job 
> manager may take a long time to wait resources. And client also keep waiting 
> too. If flink client is killed(some other reason), the cluster will be 
> shutdown too.
> We need an option to let Flink client exit early. Use the detach option or 
> introduce a new option are both OK.
> Looking forward other suggestions



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (FLINK-23890) CepOperator may create a large number of timers and cause performance problems

2023-02-09 Thread Biao Geng (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-23890?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17686299#comment-17686299
 ] 

Biao Geng commented on FLINK-23890:
---

This improvement is of great value for CEP users when they use Event Time and 
have lots of keys in their input stream. 
Thanks a lot for you guys work!

> CepOperator may create a large number of timers and cause performance problems
> --
>
> Key: FLINK-23890
> URL: https://issues.apache.org/jira/browse/FLINK-23890
> Project: Flink
>  Issue Type: Improvement
>  Components: Library / CEP
>Affects Versions: 1.12.1
>Reporter: Yue Ma
>Assignee: Yue Ma
>Priority: Major
>  Labels: pull-request-available
> Fix For: 1.16.0
>
> Attachments: image-2021-08-20-13-59-05-977.png, 
> image-2022-06-07-21-27-03-814.png, image-2022-06-07-21-40-58-781.png
>
>
>  There are two situations in the CepOperator that may register the time when 
> dealing with EventTime. 
> when the processElement will buffer the data first, and then register a timer 
> with a timestamp of watermark+1.
> {code:java}
> if (timestamp > timerService.currentWatermark()) {
>  // we have an event with a valid timestamp, so
>  // we buffer it until we receive the proper watermark.
>  saveRegisterWatermarkTimer();
>  bufferEvent(value, timestamp);
> }{code}
> The other is when the EventTimer is triggered, if sortedTimestamps or 
> partialMatches are not empty, a timer will also be registered.
> {code:java}
> if (!sortedTimestamps.isEmpty() || !partialMatches.isEmpty()) {
>  saveRegisterWatermarkTimer();
> }{code}
>  
> The problem is, if the partialMatches corresponding to each of my keys are 
> not empty. Then every time the watermark advances, the timers of all keys 
> will be triggered, and then a new EventTimer is re-registered under each key. 
> When the number of task keys is very large, this operation greatly affects 
> performance.
> !https://code.byted.org/inf/flink/uploads/91aee639553df07fa376cf2865e91fd2/image.png!
> I think it is unnecessary to register EventTimer frequently like this and can 
> we make the following changes?
> When an event comes, the timestamp of the EventTimer we registered is equal 
> to the EventTime of this event instead of watermark + 1.
> When a new ComputionState with window is created (like *withIn* pattern ),  
> we use the timeout of this window to create EventTimer (EventTime + 
> WindowTime). 
> After making such an attempt in our test environment, the number of 
> registered timers has been greatly reduced, and the performance has been 
> greatly improved.
> !https://code.byted.org/inf/flink/uploads/24b85492c6a34a35c4445a4fd46c8363/image.png!
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (FLINK-30562) Patterns are not emitted with parallelism >1 since 1.15.x+

2023-01-04 Thread Biao Geng (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-30562?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17654507#comment-17654507
 ] 

Biao Geng commented on FLINK-30562:
---

hi [~Jamalarm], thanks for the report. I tried to reproduce your problem with 
flink 1.16.0 using standalone cluster on my computer(my demo using event time 
can be found 
[here|https://github.com/bgeng777/ververica-cep-demo/tree/FLINK-30562]). I 
agree there are some difference in stdout when setting parallism to 2 comparing 
with setting it 1. But it seems that thre result is corrent. 
In my demo, when p is 2, the stdout of the matches is:
{quote}1> 3,3,3
1> 2,2,2{quote}
when p is 1:
{quote}3,3,3
2,2,2{quote}
Is above result the same with your experiments?

> Patterns are not emitted with parallelism >1 since 1.15.x+
> --
>
> Key: FLINK-30562
> URL: https://issues.apache.org/jira/browse/FLINK-30562
> Project: Flink
>  Issue Type: Bug
>  Components: Library / CEP
>Affects Versions: 1.16.0, 1.15.3
> Environment: Problem observed in:
> Production:
> Dockerised Flink cluster running in AWS Fargate, sourced from AWS Kinesis and 
> sink to AWS SQS
> Local:
> Completely local MiniCluster based test with no external sinks or sources
>Reporter: Thomas Wozniakowski
>Priority: Major
>
> (Apologies for the speculative and somewhat vague ticket, but I wanted to 
> raise this while I am investigating to see if anyone has suggestions to help 
> me narrow down the problem.)
> We are encountering an issue where our streaming Flink job has stopped 
> working correctly since Flink 1.15.3. This problem is also present on Flink 
> 1.16.0. The Keyed CEP operators that our job uses are no longer emitting 
> Patterns reliably, but critically *this is only happening when parallelism is 
> set to a value greater than 1*. 
> Our local build tests were previously set up using in-JVM `MiniCluster` 
> instances, or dockerised Flink clusters all set with a parallelism of 1, so 
> this problem was not caught and it caused an outage when we upgraded the 
> cluster version in production.
> Observing the job using the Flink console in production, I can see that 
> events are *arriving* into the Keyed CEP operators, but no Pattern events are 
> being emitted out of any of the operators. Furthermore, all the reported 
> Watermark values are zero, though I don't know if that is a red herring as it 
> seems Watermark reporting seems to have changed since 1.14.x.
> I am currently attempting to create a stripped down version of our streaming 
> job to demonstrate the problem, but this is quite tricky to set up. In the 
> meantime I would appreciate any hints that could point me in the right 
> direction.
> I have isolated the problem to the Keyed CEP operator by removing our real 
> sinks and sources from the failing test. I am still seeing the erroneous 
> behaviour when setting up a job as:
> # Events are read from a list using `env.fromCollection( ... )`
> # CEP operator processes events
> # Output is captured in another list for assertions
> My best guess at the moment is something to do with Watermark emission? There 
> seems to have been changes related to watermark alignment, perhaps this has 
> caused some kind of regression in the CEP library? To reiterate, *this 
> problem only occurs with parallelism of 2 or more. Setting the parallelism to 
> 1 immediately fixes the issue*



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (FLINK-30518) [flink-operator] Kubernetes HA not working due to wrong jobmanager.rpc.address

2022-12-28 Thread Biao Geng (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-30518?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17652520#comment-17652520
 ] 

Biao Geng commented on FLINK-30518:
---

[~gyfora] I see. Thanks for the information. I misunderstood the problem 
somehow :(
But I just tried 1.3.0 operator and 1.16 flink to run 
basic-checkpoint-ha-example with setting replicas of JM to 3. It works fine as 
well.
[~tbnguyen1407] would you mind sharing the full deployment yaml for this 
problem?   
 

> [flink-operator] Kubernetes HA not working due to wrong jobmanager.rpc.address
> --
>
> Key: FLINK-30518
> URL: https://issues.apache.org/jira/browse/FLINK-30518
> Project: Flink
>  Issue Type: Bug
>  Components: Kubernetes Operator
>Affects Versions: kubernetes-operator-1.3.0
>Reporter: Binh-Nguyen Tran
>Priority: Major
> Attachments: flink-configmap.png, screenshot-1.png
>
>
> Since flink-conf.yaml is mounted as read-only configmap, the 
> /docker-entrypoint.sh script is not able to inject correct Pod IP to 
> `jobmanager.rpc.address`. This leads to same address (e.g flink.ns-ext) being 
> set for all Job Manager pods. This causes:
> (1) flink-cluster-config-map always contains wrong address for all 3 
> component leaders (see screenshot, should be pod IP instead of clusterIP 
> service name)
> (2) Accessing Web UI when jobmanager.replicas > 1 is not possible with error
> {code:java}
> {"errors":["Service temporarily unavailable due to an ongoing leader 
> election. Please refresh."]} {code}
>  
> ~ flinkdeployment.yaml ~
> {code:java}
> spec:
>   flinkConfiguration:
>     high-availability: kubernetes
>     high-availability.storageDir: "file:///opt/flink/storage"
>     ...
>   jobManager:
>     replicas: 3
>   ... {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (FLINK-30518) [flink-operator] Kubernetes HA not working due to wrong jobmanager.rpc.address

2022-12-28 Thread Biao Geng (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-30518?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17652511#comment-17652511
 ] 

Biao Geng commented on FLINK-30518:
---

I think [~gyfora] is right that this is not a problem before.
I just tried an older version(1.1.0) of the operator with setting replicas to 2 
for jobManager and it works fine on running jobs and accessing web ui. 
I checked my configmap(`basic-checkpoint-ha-example-cluster-config-map`) and it 
seems that leader.dispatcher is set correctly.
 !screenshot-1.png! 

I will try to reproduce this problem with the image of the operator based on 
the latest main branch as the 
[PR|https://github.com/apache/flink-kubernetes-operator/pull/494] to remove 
read-only limt is merged.

> [flink-operator] Kubernetes HA not working due to wrong jobmanager.rpc.address
> --
>
> Key: FLINK-30518
> URL: https://issues.apache.org/jira/browse/FLINK-30518
> Project: Flink
>  Issue Type: Bug
>  Components: Kubernetes Operator
>Affects Versions: kubernetes-operator-1.3.0
>Reporter: Binh-Nguyen Tran
>Priority: Major
> Attachments: flink-configmap.png, screenshot-1.png
>
>
> Since flink-conf.yaml is mounted as read-only configmap, the 
> /docker-entrypoint.sh script is not able to inject correct Pod IP to 
> `jobmanager.rpc.address`. This leads to same address (e.g flink.ns-ext) being 
> set for all Job Manager pods. This causes:
> (1) flink-cluster-config-map always contains wrong address for all 3 
> component leaders (see screenshot, should be pod IP instead of clusterIP 
> service name)
> (2) Accessing Web UI when jobmanager.replicas > 1 is not possible with error
> {code:java}
> {"errors":["Service temporarily unavailable due to an ongoing leader 
> election. Please refresh."]} {code}
>  
> ~ flinkdeployment.yaml ~
> {code:java}
> spec:
>   flinkConfiguration:
>     high-availability: kubernetes
>     high-availability.storageDir: "file:///opt/flink/storage"
>     ...
>   jobManager:
>     replicas: 3
>   ... {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (FLINK-30518) [flink-operator] Kubernetes HA not working due to wrong jobmanager.rpc.address

2022-12-28 Thread Biao Geng (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-30518?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Biao Geng updated FLINK-30518:
--
Attachment: screenshot-1.png

> [flink-operator] Kubernetes HA not working due to wrong jobmanager.rpc.address
> --
>
> Key: FLINK-30518
> URL: https://issues.apache.org/jira/browse/FLINK-30518
> Project: Flink
>  Issue Type: Bug
>  Components: Kubernetes Operator
>Affects Versions: kubernetes-operator-1.3.0
>Reporter: Binh-Nguyen Tran
>Priority: Major
> Attachments: flink-configmap.png, screenshot-1.png
>
>
> Since flink-conf.yaml is mounted as read-only configmap, the 
> /docker-entrypoint.sh script is not able to inject correct Pod IP to 
> `jobmanager.rpc.address`. This leads to same address (e.g flink.ns-ext) being 
> set for all Job Manager pods. This causes:
> (1) flink-cluster-config-map always contains wrong address for all 3 
> component leaders (see screenshot, should be pod IP instead of clusterIP 
> service name)
> (2) Accessing Web UI when jobmanager.replicas > 1 is not possible with error
> {code:java}
> {"errors":["Service temporarily unavailable due to an ongoing leader 
> election. Please refresh."]} {code}
>  
> ~ flinkdeployment.yaml ~
> {code:java}
> spec:
>   flinkConfiguration:
>     high-availability: kubernetes
>     high-availability.storageDir: "file:///opt/flink/storage"
>     ...
>   jobManager:
>     replicas: 3
>   ... {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (FLINK-28554) Kubernetes-Operator allow readOnlyRootFilesystem via operatorSecurityContext

2022-12-20 Thread Biao Geng (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-28554?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17649728#comment-17649728
 ] 

Biao Geng commented on FLINK-28554:
---

Want to attach more information:
According to k8s' 
[doc|https://kubernetes.io/docs/concepts/storage/volumes/#configmap], "A 
container using a ConfigMap as a subPath volume mount will not receive 
ConfigMap updates". 
It may also be worthwhile to mention that {{dynamic.config.check.interval}} 
config may not work as expected in some case. For example, we are relying on 
{{dynamic.config.check.interval}}  to do some default config update by updating 
{{flink-operator-config}} ConfigMap without restarting the operator Pod. After 
this jira, the above dynamic update is broken. Also, the {{CONF_OVERRIDE_DIR}} 
introduced in FLINK-28445 may not work as well.

 

> Kubernetes-Operator allow readOnlyRootFilesystem via operatorSecurityContext
> 
>
> Key: FLINK-28554
> URL: https://issues.apache.org/jira/browse/FLINK-28554
> Project: Flink
>  Issue Type: Improvement
>  Components: Kubernetes Operator
>Affects Versions: kubernetes-operator-1.0.1
>Reporter: Tim
>Assignee: Tim
>Priority: Minor
>  Labels: operator, pull-request-available
> Fix For: kubernetes-operator-1.2.0
>
>
> It would be nice if the operator would support using the 
> "readOnlyRootFilesystem" setting via the operatorSecurityContext. When using 
> the default operator template the operator won't be able to start when using 
> this setting because the config files mounted in `/opt/flink/conf` are now 
> (of course) also read-only.
> It would be nice if the default template would be written in such a way that 
> it allows adding emptyDir volumes to /opt/flink/conf via the values.yaml. 
> Which is not possible right now. Then the config files can remain editable by 
> the operator while keeping the root filesystem read-only.
> I have successfully tried that in my branch (see: 
> https://github.com/apache/flink-kubernetes-operator/compare/main...timsn:flink-kubernetes-operator:mount-single-flink-conf-files)
>  which prepares the operator template.
> After this small change to the template it is possible add emptyDir volumes 
> for the conf and tmp dirs and in the second step to enable the 
> readOnlyRootFilesystem setting via the values.yaml
> values.yaml
> {code:java}
> [...]
> operatorVolumeMounts:
>   create: true
>   data:
>     - name: flink-conf
>       mountPath: /opt/flink/conf
>       subPath: conf
>     - name: flink-tmp
>       mountPath: /tmp
> operatorVolumes:
>   create: true
>   data:
>     - name: flink-conf
>       emptyDir: {}
>     - name: flink-tmp
>       emptyDir: {}
> operatorSecurityContext:
>   readOnlyRootFilesystem: true
> [...]{code}
> I think this could be a viable way to allow this security setting and I could 
> turn this into a pull request if desired. What do you think about it? Or is 
> there even a better way to achive this I didn't think about yet?



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (FLINK-29529) Update flink version in flink-python-example of flink k8s operator

2022-10-06 Thread Biao Geng (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-29529?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Biao Geng updated FLINK-29529:
--
Description: 
Currently, we hardcoded the version of both flink image and pyflink pip package 
as 1.15.0 in the example's Dockerfile. It is not the best practice as the flink 
has new 1.15.x releases.
We had better do following improvements:
{{FROM flink:1.15.0 -> FROM flink:1.15}}
{{RUN pip3 install apache-flink==1.15.0 -> RUN pip3 install 
"apache-flink>=1.15.0,<1.16.0"}}

  was:
Currently, we hardcoded the version of both flink image and pyflink pip package 
as 1.15.0 in the example's Dockerfile. It is not the best practice as the flink 
has new 1.15.x releases.
We had better do following improvements:
{{FROM flink:1.15.0 -> FROM flink:1.15}}
{{RUN pip3 install apache-flink==1.15.0 -> RUN pip install 
"apache-flink>=1.15.0,<1.16.0"}}


> Update flink version in flink-python-example of flink k8s operator
> --
>
> Key: FLINK-29529
> URL: https://issues.apache.org/jira/browse/FLINK-29529
> Project: Flink
>  Issue Type: Improvement
>  Components: Kubernetes Operator
>Reporter: Biao Geng
>Priority: Minor
>
> Currently, we hardcoded the version of both flink image and pyflink pip 
> package as 1.15.0 in the example's Dockerfile. It is not the best practice as 
> the flink has new 1.15.x releases.
> We had better do following improvements:
> {{FROM flink:1.15.0 -> FROM flink:1.15}}
> {{RUN pip3 install apache-flink==1.15.0 -> RUN pip3 install 
> "apache-flink>=1.15.0,<1.16.0"}}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (FLINK-29529) Update flink version in flink-python-example of flink k8s operator

2022-10-06 Thread Biao Geng (Jira)
Biao Geng created FLINK-29529:
-

 Summary: Update flink version in flink-python-example of flink k8s 
operator
 Key: FLINK-29529
 URL: https://issues.apache.org/jira/browse/FLINK-29529
 Project: Flink
  Issue Type: Improvement
  Components: Kubernetes Operator
Reporter: Biao Geng


Currently, we hardcoded the version of both flink image and pyflink pip package 
as 1.15.0 in the example's Dockerfile. It is not the best practice as the flink 
has new 1.15.x releases.
We had better do following improvements:
{{FROM flink:1.15.0 -> FROM flink:1.15}}
{{RUN pip3 install apache-flink==1.15.0 -> RUN pip install 
"apache-flink>=1.15.0,<1.16.0"}}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (FLINK-29362) Allow loading dynamic config for kerberos authentication in CliFrontend

2022-09-20 Thread Biao Geng (Jira)
Biao Geng created FLINK-29362:
-

 Summary: Allow loading dynamic config for kerberos authentication 
in CliFrontend
 Key: FLINK-29362
 URL: https://issues.apache.org/jira/browse/FLINK-29362
 Project: Flink
  Issue Type: Improvement
  Components: Command Line Client
Reporter: Biao Geng


In the 
[code|https://github.com/apache/flink/blob/97f5a45cd035fbae37a7468c6f771451ddb4a0a4/flink-clients/src/main/java/org/apache/flink/client/cli/CliFrontend.java#L1167],
 Flink's client will try to {{SecurityUtils.install(new 
SecurityConfiguration(cli.configuration));}} with configs(e.g. 
{{security.kerberos.login.principal}} and {{security.kerberos.login.keytab}}) 
from only flink-conf.yaml.
If users specify the above 2 config via -D option, it will not work as 
{{cli.parseAndRun(args)}} will be executed after installing security configs 
from flink-conf.yaml.
However, if a user specify principal A in client's flink-conf.yaml and use -D 
option to specify principal B, the launched YARN container will use principal B 
though the job is submitted in client end with principal A.

Such behavior can be misleading as Flink provides 2 ways to set a config but 
does not keep consistency between client and cluster. It also influence users 
who want use flink with kerberos as they must modify flink-conf.yaml if they 
want to use another kerberos user.




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (FLINK-29277) Flink submits tasks to yarn Federation and throws an exception 'org.apache.commons.lang3.NotImplementedException: Code is not implemented'

2022-09-20 Thread Biao Geng (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-29277?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Biao Geng updated FLINK-29277:
--
Attachment: screenshot-1.png

> Flink submits tasks to yarn Federation and throws an exception 
> 'org.apache.commons.lang3.NotImplementedException: Code is not implemented'
> --
>
> Key: FLINK-29277
> URL: https://issues.apache.org/jira/browse/FLINK-29277
> Project: Flink
>  Issue Type: Bug
>  Components: Deployment / YARN
>Affects Versions: 1.14.3
> Environment: Flink 1.14.3、JDK8、hadoop-3.2.1
>Reporter: Jiankun Feng
>Priority: Blocker
> Attachments: error.log, image-2022-09-13-15-56-47-631.png, 
> screenshot-1.png
>
>
> 2022-09-13 11:02:35,488 INFO  
> org.apache.flink.runtime.util.config.memory.ProcessMemoryUtils [] - The 
> derived from fraction jvm overhead memory (102.400mb (107374184 bytes)) is 
> less than its min value 192.000mb (201326592 bytes), min value will be used 
> instead
> 2022-09-13 11:02:35,751 WARN  org.apache.flink.table.client.cli.CliClient     
>              [] - Could not execute SQL statement.
> org.apache.flink.table.client.gateway.SqlExecutionException: Could not 
> execute SQL statement.
>         at 
> org.apache.flink.table.client.gateway.local.LocalExecutor.executeModifyOperations(LocalExecutor.java:225)
>  ~[flink-sql-client_2.11-1.14.3-qihoo-f5.jar:1.14.3-qihoo-f5]
>         at 
> org.apache.flink.table.client.cli.CliClient.callInserts(CliClient.java:617) 
> ~[flink-sql-client_2.11-1.14.3-qihoo-f5.jar:1.14.3-qihoo-f5]
>         at 
> org.apache.flink.table.client.cli.CliClient.callInsert(CliClient.java:606) 
> ~[flink-sql-client_2.11-1.14.3-qihoo-f5.jar:1.14.3-qihoo-f5]
>         at 
> org.apache.flink.table.client.cli.CliClient.callOperation(CliClient.java:466) 
> ~[flink-sql-client_2.11-1.14.3-qihoo-f5.jar:1.14.3-qihoo-f5]
>         at 
> org.apache.flink.table.client.cli.CliClient.lambda$executeStatement$1(CliClient.java:346)
>  [flink-sql-client_2.11-1.14.3-qihoo-f5.jar:1.14.3-qihoo-f5]
>         at java.util.Optional.ifPresent(Optional.java:159) ~[?:1.8.0_141]
>         at 
> org.apache.flink.table.client.cli.CliClient.executeStatement(CliClient.java:339)
>  [flink-sql-client_2.11-1.14.3-qihoo-f5.jar:1.14.3-qihoo-f5]
>         at 
> org.apache.flink.table.client.cli.CliClient.executeFile(CliClient.java:318) 
> [flink-sql-client_2.11-1.14.3-qihoo-f5.jar:1.14.3-qihoo-f5]
>         at 
> org.apache.flink.table.client.cli.CliClient.executeInNonInteractiveMode(CliClient.java:234)
>  [flink-sql-client_2.11-1.14.3-qihoo-f5.jar:1.14.3-qihoo-f5]
>         at 
> org.apache.flink.table.client.SqlClient.openCli(SqlClient.java:153) 
> [flink-sql-client_2.11-1.14.3-qihoo-f5.jar:1.14.3-qihoo-f5]
>         at org.apache.flink.table.client.SqlClient.start(SqlClient.java:95) 
> [flink-sql-client_2.11-1.14.3-qihoo-f5.jar:1.14.3-qihoo-f5]
>         at 
> org.apache.flink.table.client.SqlClient.startClient(SqlClient.java:187) 
> [flink-sql-client_2.11-1.14.3-qihoo-f5.jar:1.14.3-qihoo-f5]
>         at org.apache.flink.table.client.SqlClient.main(SqlClient.java:161) 
> [flink-sql-client_2.11-1.14.3-qihoo-f5.jar:1.14.3-qihoo-f5]
> Caused by: org.apache.flink.table.api.TableException: Failed to execute sql
>         at 
> org.apache.flink.table.api.internal.TableEnvironmentImpl.executeInternal(TableEnvironmentImpl.java:791)
>  ~[flink-table_2.11-1.14.3-qihoo-f5.jar:1.14.3-qihoo-f5]
>         at 
> org.apache.flink.table.api.internal.TableEnvironmentImpl.executeInternal(TableEnvironmentImpl.java:754)
>  ~[flink-table_2.11-1.14.3-qihoo-f5.jar:1.14.3-qihoo-f5]
>         at 
> org.apache.flink.table.client.gateway.local.LocalExecutor.lambda$executeModifyOperations$4(LocalExecutor.java:223)
>  ~[flink-sql-client_2.11-1.14.3-qihoo-f5.jar:1.14.3-qihoo-f5]
>         at 
> org.apache.flink.table.client.gateway.context.ExecutionContext.wrapClassLoader(ExecutionContext.java:88)
>  ~[flink-sql-client_2.11-1.14.3-qihoo-f5.jar:1.14.3-qihoo-f5]
>         at 
> org.apache.flink.table.client.gateway.local.LocalExecutor.executeModifyOperations(LocalExecutor.java:223)
>  ~[flink-sql-client_2.11-1.14.3-qihoo-f5.jar:1.14.3-qihoo-f5]
>         ... 12 more
> Caused by: org.apache.flink.client.deployment.ClusterDeploymentException: 
> Could not deploy Yarn job cluster.
>         at 
> org.apache.flink.yarn.YarnClusterDescriptor.deployJobCluster(YarnClusterDescriptor.java:489)
>  ~[flink-dist_2.11-1.14.3-qihoo-f5.jar:1.14.3-qihoo-f5]
>         at 
> org.apache.flink.client.deployment.executors.AbstractJobClusterExecutor.execute(AbstractJobClusterExecutor.java:81)
>  ~[flink-dist_2.11-1.14.3-qihoo-f5.jar:1.14.3-qihoo-f5]
>         at 
> 

[jira] [Commented] (FLINK-29277) Flink submits tasks to yarn Federation and throws an exception 'org.apache.commons.lang3.NotImplementedException: Code is not implemented'

2022-09-20 Thread Biao Geng (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-29277?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17607268#comment-17607268
 ] 

Biao Geng commented on FLINK-29277:
---

In hadoop3.2.1, 
org.apache.hadoop.yarn.server.router.clientrm.FederationClientInterceptor#getClusterNodes
 is not implemented. 
 !screenshot-1.png! 

> Flink submits tasks to yarn Federation and throws an exception 
> 'org.apache.commons.lang3.NotImplementedException: Code is not implemented'
> --
>
> Key: FLINK-29277
> URL: https://issues.apache.org/jira/browse/FLINK-29277
> Project: Flink
>  Issue Type: Bug
>  Components: Deployment / YARN
>Affects Versions: 1.14.3
> Environment: Flink 1.14.3、JDK8、hadoop-3.2.1
>Reporter: Jiankun Feng
>Priority: Blocker
> Attachments: error.log, image-2022-09-13-15-56-47-631.png, 
> screenshot-1.png
>
>
> 2022-09-13 11:02:35,488 INFO  
> org.apache.flink.runtime.util.config.memory.ProcessMemoryUtils [] - The 
> derived from fraction jvm overhead memory (102.400mb (107374184 bytes)) is 
> less than its min value 192.000mb (201326592 bytes), min value will be used 
> instead
> 2022-09-13 11:02:35,751 WARN  org.apache.flink.table.client.cli.CliClient     
>              [] - Could not execute SQL statement.
> org.apache.flink.table.client.gateway.SqlExecutionException: Could not 
> execute SQL statement.
>         at 
> org.apache.flink.table.client.gateway.local.LocalExecutor.executeModifyOperations(LocalExecutor.java:225)
>  ~[flink-sql-client_2.11-1.14.3-qihoo-f5.jar:1.14.3-qihoo-f5]
>         at 
> org.apache.flink.table.client.cli.CliClient.callInserts(CliClient.java:617) 
> ~[flink-sql-client_2.11-1.14.3-qihoo-f5.jar:1.14.3-qihoo-f5]
>         at 
> org.apache.flink.table.client.cli.CliClient.callInsert(CliClient.java:606) 
> ~[flink-sql-client_2.11-1.14.3-qihoo-f5.jar:1.14.3-qihoo-f5]
>         at 
> org.apache.flink.table.client.cli.CliClient.callOperation(CliClient.java:466) 
> ~[flink-sql-client_2.11-1.14.3-qihoo-f5.jar:1.14.3-qihoo-f5]
>         at 
> org.apache.flink.table.client.cli.CliClient.lambda$executeStatement$1(CliClient.java:346)
>  [flink-sql-client_2.11-1.14.3-qihoo-f5.jar:1.14.3-qihoo-f5]
>         at java.util.Optional.ifPresent(Optional.java:159) ~[?:1.8.0_141]
>         at 
> org.apache.flink.table.client.cli.CliClient.executeStatement(CliClient.java:339)
>  [flink-sql-client_2.11-1.14.3-qihoo-f5.jar:1.14.3-qihoo-f5]
>         at 
> org.apache.flink.table.client.cli.CliClient.executeFile(CliClient.java:318) 
> [flink-sql-client_2.11-1.14.3-qihoo-f5.jar:1.14.3-qihoo-f5]
>         at 
> org.apache.flink.table.client.cli.CliClient.executeInNonInteractiveMode(CliClient.java:234)
>  [flink-sql-client_2.11-1.14.3-qihoo-f5.jar:1.14.3-qihoo-f5]
>         at 
> org.apache.flink.table.client.SqlClient.openCli(SqlClient.java:153) 
> [flink-sql-client_2.11-1.14.3-qihoo-f5.jar:1.14.3-qihoo-f5]
>         at org.apache.flink.table.client.SqlClient.start(SqlClient.java:95) 
> [flink-sql-client_2.11-1.14.3-qihoo-f5.jar:1.14.3-qihoo-f5]
>         at 
> org.apache.flink.table.client.SqlClient.startClient(SqlClient.java:187) 
> [flink-sql-client_2.11-1.14.3-qihoo-f5.jar:1.14.3-qihoo-f5]
>         at org.apache.flink.table.client.SqlClient.main(SqlClient.java:161) 
> [flink-sql-client_2.11-1.14.3-qihoo-f5.jar:1.14.3-qihoo-f5]
> Caused by: org.apache.flink.table.api.TableException: Failed to execute sql
>         at 
> org.apache.flink.table.api.internal.TableEnvironmentImpl.executeInternal(TableEnvironmentImpl.java:791)
>  ~[flink-table_2.11-1.14.3-qihoo-f5.jar:1.14.3-qihoo-f5]
>         at 
> org.apache.flink.table.api.internal.TableEnvironmentImpl.executeInternal(TableEnvironmentImpl.java:754)
>  ~[flink-table_2.11-1.14.3-qihoo-f5.jar:1.14.3-qihoo-f5]
>         at 
> org.apache.flink.table.client.gateway.local.LocalExecutor.lambda$executeModifyOperations$4(LocalExecutor.java:223)
>  ~[flink-sql-client_2.11-1.14.3-qihoo-f5.jar:1.14.3-qihoo-f5]
>         at 
> org.apache.flink.table.client.gateway.context.ExecutionContext.wrapClassLoader(ExecutionContext.java:88)
>  ~[flink-sql-client_2.11-1.14.3-qihoo-f5.jar:1.14.3-qihoo-f5]
>         at 
> org.apache.flink.table.client.gateway.local.LocalExecutor.executeModifyOperations(LocalExecutor.java:223)
>  ~[flink-sql-client_2.11-1.14.3-qihoo-f5.jar:1.14.3-qihoo-f5]
>         ... 12 more
> Caused by: org.apache.flink.client.deployment.ClusterDeploymentException: 
> Could not deploy Yarn job cluster.
>         at 
> org.apache.flink.yarn.YarnClusterDescriptor.deployJobCluster(YarnClusterDescriptor.java:489)
>  ~[flink-dist_2.11-1.14.3-qihoo-f5.jar:1.14.3-qihoo-f5]
>         at 
> 

[jira] [Created] (FLINK-28769) Flink History Server show wrong name of batch jobs

2022-08-01 Thread Biao Geng (Jira)
Biao Geng created FLINK-28769:
-

 Summary: Flink History Server show wrong name of batch jobs
 Key: FLINK-28769
 URL: https://issues.apache.org/jira/browse/FLINK-28769
 Project: Flink
  Issue Type: Bug
  Components: Runtime / Web Frontend
Reporter: Biao Geng
 Attachments: image-2022-08-02-00-41-51-815.png

When running {{examples/batch/WordCount.jar}} using flink1.15 and 1.16 together 
with history server started, the history server shows default name(e.g. Flink 
Java Job at Tue Aug 02.. ) of the batch job instead of the name( "WordCount 
Example" ) specified in the java code.
But for {{examples/streaming/WordCount.jar}}, the job name in history server is 
correct.

It looks like that 
{{org.apache.flink.api.java.ExecutionEnvironment#executeAsync(java.lang.String)}}
 does not set job name as what 
{{org.apache.flink.streaming.api.environment.StreamExecutionEnvironment#execute(java.lang.String)}}
 does(e.g. streamGraph.setJobName(jobName); ).


!image-2022-08-02-00-41-51-815.png!



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (FLINK-28643) Job archive file contains too much Job json file on Flink HistoryServer

2022-08-01 Thread Biao Geng (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-28643?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17573852#comment-17573852
 ] 

Biao Geng commented on FLINK-28643:
---

I run into this issue as well when running jobs with large parallism. One 
possible remedy is to limit the number of the archieved jobs. But a better way 
may be add limit in flink side so that it would not break down the whole hdfs 
system.

> Job archive file contains too much Job json file on Flink HistoryServer 
> 
>
> Key: FLINK-28643
> URL: https://issues.apache.org/jira/browse/FLINK-28643
> Project: Flink
>  Issue Type: Improvement
>  Components: Runtime / Web Frontend
>Affects Versions: 1.15.1
>Reporter: LI Mingkun
>Priority: Major
> Attachments: image-2022-07-22-17-04-55-046.png
>
>
> In History Server, 'HistoryServerArchiveFetcher' fetch job archived file,and 
> pass it into a job dir.
> There are more than 1w json files for big parallism job which run out of 
> inodes
> !image-2022-07-22-17-04-55-046.png!



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (FLINK-28725) flink-kubernetes-operator taskManager: replicas: 2 error

2022-07-28 Thread Biao Geng (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-28725?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17572257#comment-17572257
 ] 

Biao Geng commented on FLINK-28725:
---

Hi [~lizu18xz], the field "replicas" of task manager is introduced into CRD 
since 1.1.0. If your kube cluster has already installed earlier version of 
flink-kubernetes-operator, you may have to remove the outdated CRD manually and 
reinstall the newest operator. The detail of upgrade process is 
[here|https://nightlies.apache.org/flink/flink-kubernetes-operator-docs-release-1.1/docs/operations/upgrade/]

Besides, in most cases, users do not need to set replica of task managers 
manually as it can be calculated from configs(i.e. parallelism / taskSlots).

> flink-kubernetes-operator taskManager: replicas: 2 error
> 
>
> Key: FLINK-28725
> URL: https://issues.apache.org/jira/browse/FLINK-28725
> Project: Flink
>  Issue Type: Bug
>Reporter: lizu18xz
>Priority: Major
>
> version:v1.1.0
>  
> taskManager:
>     replicas: 2
>     resource:
>       memory: "1024m"
>       cpu: 1
>  
> error validating data: ValidationError(FlinkDeployment.spec.taskManager): 
> unknown field "replicas" in 
> org.apache.flink.v1beta1.FlinkDeployment.spec.taskManager; if you choose to 
> ignore these errors, turn validation off with --validate=false



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] (FLINK-28725) flink-kubernetes-operator taskManager: replicas: 2 error

2022-07-28 Thread Biao Geng (Jira)


[ https://issues.apache.org/jira/browse/FLINK-28725 ]


Biao Geng deleted comment on FLINK-28725:
---

was (Author: bgeng777):
Hi [~lizu18xz], the field "replicas" of task manager is introduced into CRD 
since 1.1.0. If your kube cluster has already installed earlier version of 
flink-kubernetes-operator, you may have to remove the outdated CRD manually and 
reinstall the newest operator. The detail of upgrade process is 
[here|https://nightlies.apache.org/flink/flink-kubernetes-operator-docs-release-1.1/docs/operations/upgrade/]

Besides, in most cases, users do not need to set replica of task managers 
manually as it can be calculated from configs(i.e. parallelism / taskSlots).

> flink-kubernetes-operator taskManager: replicas: 2 error
> 
>
> Key: FLINK-28725
> URL: https://issues.apache.org/jira/browse/FLINK-28725
> Project: Flink
>  Issue Type: Bug
>Reporter: lizu18xz
>Priority: Major
>
> version:v1.1.0
>  
> taskManager:
>     replicas: 2
>     resource:
>       memory: "1024m"
>       cpu: 1
>  
> error validating data: ValidationError(FlinkDeployment.spec.taskManager): 
> unknown field "replicas" in 
> org.apache.flink.v1beta1.FlinkDeployment.spec.taskManager; if you choose to 
> ignore these errors, turn validation off with --validate=false



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (FLINK-28495) Fix typos or mistakes of Flink CEP Document in the official website

2022-07-13 Thread Biao Geng (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-28495?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17566586#comment-17566586
 ] 

Biao Geng commented on FLINK-28495:
---

[~martijnvisser] Sure. I can take this ticket.

> Fix typos or mistakes of Flink CEP Document in the official website
> ---
>
> Key: FLINK-28495
> URL: https://issues.apache.org/jira/browse/FLINK-28495
> Project: Flink
>  Issue Type: Improvement
>  Components: Library / CEP
>Reporter: Biao Geng
>Priority: Minor
>
> 1. "how you can migrate your job from an older Flink version to Flink-1.3." 
> -> "how you can migrate your job from an older Flink version to Flink-1.13."
> 2. "Will generate the following matches for an input sequence: C D A1 A2 A3 D 
> A4 B. with combinations enabled: {quote}\\{ C A1 B\}, \{C A1 A2 B\}, \{C A1 
> A3 B\}, \{C A1 A4 B\}, \{C A1 A2 A3 B\}, \{C A1 A2 A4 B\}, \{C A1 A3 A4 B\}, 
> \{C A1 A2 A3 A4 B\}{quote}" -> "Will generate the following matches for an 
> input sequence: C D A1 A2 A3 D A4 B. with combinations enabled: {quote}\{C A1 
> B\}, \{C A1 A2 B\}, \{C A1 A3 B\}, \{C A1 A4 B\}, \{C A1 A2 A3 B\}, \{C A1 A2 
> A4 B\}, \{C A1 A3 A4 B\}, \{C A1 A2 A3 A4 B\}, \{C A2 B\}, \{C A2 A3 B\}, \{C 
> A2 A4 B\}, \{C A2 A3 A4 B\}, \{C A3 B\}, \{C A3 A4 B\}, \{C A4 B\}{quote}"
> 3. "For SKIP_TO_FIRST/LAST there are two options how to handle cases when 
> there are no elements mapped to the specified variable." -> "For 
> SKIP_TO_FIRST/LAST there are two options how to handle cases when there are 
> no events mapped to the *PatternName*."



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (FLINK-28503) Fix invalid link in Python FAQ Document

2022-07-12 Thread Biao Geng (Jira)
Biao Geng created FLINK-28503:
-

 Summary: Fix invalid link in Python FAQ Document
 Key: FLINK-28503
 URL: https://issues.apache.org/jira/browse/FLINK-28503
 Project: Flink
  Issue Type: Improvement
  Components: Documentation, Project Website
Reporter: Biao Geng
 Attachments: image-2022-07-12-14-51-01-434.png

[https://nightlies.apache.org/flink/flink-docs-master/docs/dev/python/faq/#preparing-python-virtual-environment]

The script for setting pyflink virtual environment is invalid now. The 
candidate is 
[https://nightlies.apache.org/flink/flink-docs-release-1.12/downloads/setup-pyflink-virtual-env.sh]
 or we can add this short script in the doc website directly.

!image-2022-07-12-14-51-01-434.png!



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (FLINK-28495) Fix typos or mistakes of Flink CEP Document in the official website

2022-07-11 Thread Biao Geng (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-28495?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Biao Geng updated FLINK-28495:
--
Description: 
1. "how you can migrate your job from an older Flink version to Flink-1.3." -> 
"how you can migrate your job from an older Flink version to Flink-1.13."

2. "Will generate the following matches for an input sequence: C D A1 A2 A3 D 
A4 B. with combinations enabled: {quote}\\{ C A1 B\}, \{C A1 A2 B\}, \{C A1 A3 
B\}, \{C A1 A4 B\}, \{C A1 A2 A3 B\}, \{C A1 A2 A4 B\}, \{C A1 A3 A4 B\}, \{C 
A1 A2 A3 A4 B\}{quote}" -> "Will generate the following matches for an input 
sequence: C D A1 A2 A3 D A4 B. with combinations enabled: {quote}\{C A1 B\}, 
\{C A1 A2 B\}, \{C A1 A3 B\}, \{C A1 A4 B\}, \{C A1 A2 A3 B\}, \{C A1 A2 A4 
B\}, \{C A1 A3 A4 B\}, \{C A1 A2 A3 A4 B\}, \{C A2 B\}, \{C A2 A3 B\}, \{C A2 
A4 B\}, \{C A2 A3 A4 B\}, \{C A3 B\}, \{C A3 A4 B\}, \{C A4 B\}{quote}"


3. "For SKIP_TO_FIRST/LAST there are two options how to handle cases when there 
are no elements mapped to the specified variable." -> "For SKIP_TO_FIRST/LAST 
there are two options how to handle cases when there are no events mapped to 
the *PatternName*."

  was:
1. "how you can migrate your job from an older Flink version to Flink-1.3." -> 
"how you can migrate your job from an older Flink version to Flink-1.13."

2. "Will generate the following matches for an input sequence: C D A1 A2 A3 D 
A4 B. with combinations enabled: {quote}{C A1 B}, {C A1 A2 B}, {C A1 A3 B}, {C 
A1 A4 B}, {C A1 A2 A3 B}, {C A1 A2 A4 B}, {C A1 A3 A4 B}, {C A1 A2 A3 A4 
B}{quote}" -> "Will generate the following matches for an input sequence: C D 
A1 A2 A3 D A4 B. with combinations enabled: {quote}{C A1 B}, {C A1 A2 B}, {C A1 
A3 B}, {C A1 A4 B}, {C A1 A2 A3 B}, {C A1 A2 A4 B}, {C A1 A3 A4 B}, {C A1 A2 A3 
A4 B}, {C A2 B}, {C A2 A3 B}, {C A2 A4 B}, {C A2 A3 A4 B}, {C A3 B}, {C A3 A4 
B}, {C A4 B}{quote}"


3. "For SKIP_TO_FIRST/LAST there are two options how to handle cases when there 
are no elements mapped to the specified variable." -> "For SKIP_TO_FIRST/LAST 
there are two options how to handle cases when there are no events mapped to 
the *PatternName*."


> Fix typos or mistakes of Flink CEP Document in the official website
> ---
>
> Key: FLINK-28495
> URL: https://issues.apache.org/jira/browse/FLINK-28495
> Project: Flink
>  Issue Type: Improvement
>  Components: Library / CEP
>Reporter: Biao Geng
>Priority: Minor
>
> 1. "how you can migrate your job from an older Flink version to Flink-1.3." 
> -> "how you can migrate your job from an older Flink version to Flink-1.13."
> 2. "Will generate the following matches for an input sequence: C D A1 A2 A3 D 
> A4 B. with combinations enabled: {quote}\\{ C A1 B\}, \{C A1 A2 B\}, \{C A1 
> A3 B\}, \{C A1 A4 B\}, \{C A1 A2 A3 B\}, \{C A1 A2 A4 B\}, \{C A1 A3 A4 B\}, 
> \{C A1 A2 A3 A4 B\}{quote}" -> "Will generate the following matches for an 
> input sequence: C D A1 A2 A3 D A4 B. with combinations enabled: {quote}\{C A1 
> B\}, \{C A1 A2 B\}, \{C A1 A3 B\}, \{C A1 A4 B\}, \{C A1 A2 A3 B\}, \{C A1 A2 
> A4 B\}, \{C A1 A3 A4 B\}, \{C A1 A2 A3 A4 B\}, \{C A2 B\}, \{C A2 A3 B\}, \{C 
> A2 A4 B\}, \{C A2 A3 A4 B\}, \{C A3 B\}, \{C A3 A4 B\}, \{C A4 B\}{quote}"
> 3. "For SKIP_TO_FIRST/LAST there are two options how to handle cases when 
> there are no elements mapped to the specified variable." -> "For 
> SKIP_TO_FIRST/LAST there are two options how to handle cases when there are 
> no events mapped to the *PatternName*."



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (FLINK-28495) Fix typos or mistakes of Flink CEP Document in the official website

2022-07-11 Thread Biao Geng (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-28495?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Biao Geng updated FLINK-28495:
--
Description: 
1. "how you can migrate your job from an older Flink version to Flink-1.3." -> 
"how you can migrate your job from an older Flink version to Flink-1.13."

2. "Will generate the following matches for an input sequence: C D A1 A2 A3 D 
A4 B. with combinations enabled: {quote}{C A1 B}, {C A1 A2 B}, {C A1 A3 B}, {C 
A1 A4 B}, {C A1 A2 A3 B}, {C A1 A2 A4 B}, {C A1 A3 A4 B}, {C A1 A2 A3 A4 
B}{quote}" -> "Will generate the following matches for an input sequence: C D 
A1 A2 A3 D A4 B. with combinations enabled: {quote}{C A1 B}, {C A1 A2 B}, {C A1 
A3 B}, {C A1 A4 B}, {C A1 A2 A3 B}, {C A1 A2 A4 B}, {C A1 A3 A4 B}, {C A1 A2 A3 
A4 B}, {C A2 B}, {C A2 A3 B}, {C A2 A4 B}, {C A2 A3 A4 B}, {C A3 B}, {C A3 A4 
B}, {C A4 B}{quote}"


3. "For SKIP_TO_FIRST/LAST there are two options how to handle cases when there 
are no elements mapped to the specified variable." -> "For SKIP_TO_FIRST/LAST 
there are two options how to handle cases when there are no events mapped to 
the *PatternName*."

  was:
"how you can migrate your job from an older Flink version to Flink-1.3." -> 
"how you can migrate your job from an older Flink version to Flink-1.13."

"Will generate the following matches for an input sequence: C D A1 A2 A3 D A4 
B. with combinations enabled: {C A1 B}, {C A1 A2 B}, {C A1 A3 B}, {C A1 A4 B}, 
{C A1 A2 A3 B}, {C A1 A2 A4 B}, {C A1 A3 A4 B}, {C A1 A2 A3 A4 B}" -> "Will 
generate the following matches for an input sequence: C D A1 A2 A3 D A4 B. with 
combinations enabled: {C A1 B}, {C A1 A2 B}, {C A1 A3 B}, {C A1 A4 B}, {C A1 A2 
A3 B}, {C A1 A2 A4 B}, {C A1 A3 A4 B}, {C A1 A2 A3 A4 B}, {C A2 B}, {C A2 A3 
B}, {C A2 A4 B}, {C A2 A3 A4 B}, {C A3 B}, {C A3 A4 B}, {C A4 B}"


"For SKIP_TO_FIRST/LAST there are two options how to handle cases when there 
are no elements mapped to the specified variable." -> "For SKIP_TO_FIRST/LAST 
there are two options how to handle cases when there are no events mapped to 
the *PatternName*."


> Fix typos or mistakes of Flink CEP Document in the official website
> ---
>
> Key: FLINK-28495
> URL: https://issues.apache.org/jira/browse/FLINK-28495
> Project: Flink
>  Issue Type: Improvement
>  Components: Library / CEP
>Reporter: Biao Geng
>Priority: Minor
>
> 1. "how you can migrate your job from an older Flink version to Flink-1.3." 
> -> "how you can migrate your job from an older Flink version to Flink-1.13."
> 2. "Will generate the following matches for an input sequence: C D A1 A2 A3 D 
> A4 B. with combinations enabled: {quote}{C A1 B}, {C A1 A2 B}, {C A1 A3 B}, 
> {C A1 A4 B}, {C A1 A2 A3 B}, {C A1 A2 A4 B}, {C A1 A3 A4 B}, {C A1 A2 A3 A4 
> B}{quote}" -> "Will generate the following matches for an input sequence: C D 
> A1 A2 A3 D A4 B. with combinations enabled: {quote}{C A1 B}, {C A1 A2 B}, {C 
> A1 A3 B}, {C A1 A4 B}, {C A1 A2 A3 B}, {C A1 A2 A4 B}, {C A1 A3 A4 B}, {C A1 
> A2 A3 A4 B}, {C A2 B}, {C A2 A3 B}, {C A2 A4 B}, {C A2 A3 A4 B}, {C A3 B}, {C 
> A3 A4 B}, {C A4 B}{quote}"
> 3. "For SKIP_TO_FIRST/LAST there are two options how to handle cases when 
> there are no elements mapped to the specified variable." -> "For 
> SKIP_TO_FIRST/LAST there are two options how to handle cases when there are 
> no events mapped to the *PatternName*."



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (FLINK-28495) Fix typos or mistakes of Flink CEP Document in the official website

2022-07-11 Thread Biao Geng (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-28495?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Biao Geng updated FLINK-28495:
--
Description: 
"how you can migrate your job from an older Flink version to Flink-1.3." -> 
"how you can migrate your job from an older Flink version to Flink-1.13."

"Will generate the following matches for an input sequence: C D A1 A2 A3 D A4 
B. with combinations enabled: {C A1 B}, {C A1 A2 B}, {C A1 A3 B}, {C A1 A4 B}, 
{C A1 A2 A3 B}, {C A1 A2 A4 B}, {C A1 A3 A4 B}, {C A1 A2 A3 A4 B}" -> "Will 
generate the following matches for an input sequence: C D A1 A2 A3 D A4 B. with 
combinations enabled: {C A1 B}, {C A1 A2 B}, {C A1 A3 B}, {C A1 A4 B}, {C A1 A2 
A3 B}, {C A1 A2 A4 B}, {C A1 A3 A4 B}, {C A1 A2 A3 A4 B}, {C A2 B}, {C A2 A3 
B}, {C A2 A4 B}, {C A2 A3 A4 B}, {C A3 B}, {C A3 A4 B}, {C A4 B}"


"For SKIP_TO_FIRST/LAST there are two options how to handle cases when there 
are no elements mapped to the specified variable." -> "For SKIP_TO_FIRST/LAST 
there are two options how to handle cases when there are no events mapped to 
the *PatternName*."

  was:
"Will generate the following matches for an input sequence: C D A1 A2 A3 D A4 
B. with combinations enabled: {C A1 B}, {C A1 A2 B}, {C A1 A3 B}, {C A1 A4 B}, 
{C A1 A2 A3 B}, {C A1 A2 A4 B}, {C A1 A3 A4 B}, {C A1 A2 A3 A4 B}" -> "Will 
generate the following matches for an input sequence: C D A1 A2 A3 D A4 B. with 
combinations enabled: {C A1 B}, {C A1 A2 B}, {C A1 A3 B}, {C A1 A4 B}, {C A1 A2 
A3 B}, {C A1 A2 A4 B}, {C A1 A3 A4 B}, {C A1 A2 A3 A4 B}, {C A2 B}, {C A2 A3 
B}, {C A2 A4 B}, {C A2 A3 A4 B}, {C A3 B}, {C A3 A4 B}, {C A4 B}"
"For SKIP_TO_FIRST/LAST there are two options how to handle cases when there 
are no elements mapped to the specified variable." -> "For SKIP_TO_FIRST/LAST 
there are two options how to handle cases when there are no events mapped to 
the *PatternName*."


> Fix typos or mistakes of Flink CEP Document in the official website
> ---
>
> Key: FLINK-28495
> URL: https://issues.apache.org/jira/browse/FLINK-28495
> Project: Flink
>  Issue Type: Improvement
>  Components: Library / CEP
>Reporter: Biao Geng
>Priority: Minor
>
> "how you can migrate your job from an older Flink version to Flink-1.3." -> 
> "how you can migrate your job from an older Flink version to Flink-1.13."
> "Will generate the following matches for an input sequence: C D A1 A2 A3 D A4 
> B. with combinations enabled: {C A1 B}, {C A1 A2 B}, {C A1 A3 B}, {C A1 A4 
> B}, {C A1 A2 A3 B}, {C A1 A2 A4 B}, {C A1 A3 A4 B}, {C A1 A2 A3 A4 B}" -> 
> "Will generate the following matches for an input sequence: C D A1 A2 A3 D A4 
> B. with combinations enabled: {C A1 B}, {C A1 A2 B}, {C A1 A3 B}, {C A1 A4 
> B}, {C A1 A2 A3 B}, {C A1 A2 A4 B}, {C A1 A3 A4 B}, {C A1 A2 A3 A4 B}, {C A2 
> B}, {C A2 A3 B}, {C A2 A4 B}, {C A2 A3 A4 B}, {C A3 B}, {C A3 A4 B}, {C A4 B}"
> "For SKIP_TO_FIRST/LAST there are two options how to handle cases when there 
> are no elements mapped to the specified variable." -> "For SKIP_TO_FIRST/LAST 
> there are two options how to handle cases when there are no events mapped to 
> the *PatternName*."



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (FLINK-28495) Fix typos or mistakes of Flink CEP Document in the official website

2022-07-11 Thread Biao Geng (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-28495?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Biao Geng updated FLINK-28495:
--
Component/s: Library / CEP

> Fix typos or mistakes of Flink CEP Document in the official website
> ---
>
> Key: FLINK-28495
> URL: https://issues.apache.org/jira/browse/FLINK-28495
> Project: Flink
>  Issue Type: Improvement
>  Components: Library / CEP
>Reporter: Biao Geng
>Priority: Minor
>
> "Will generate the following matches for an input sequence: C D A1 A2 A3 D A4 
> B. with combinations enabled: {C A1 B}, {C A1 A2 B}, {C A1 A3 B}, {C A1 A4 
> B}, {C A1 A2 A3 B}, {C A1 A2 A4 B}, {C A1 A3 A4 B}, {C A1 A2 A3 A4 B}" -> 
> "Will generate the following matches for an input sequence: C D A1 A2 A3 D A4 
> B. with combinations enabled: {C A1 B}, {C A1 A2 B}, {C A1 A3 B}, {C A1 A4 
> B}, {C A1 A2 A3 B}, {C A1 A2 A4 B}, {C A1 A3 A4 B}, {C A1 A2 A3 A4 B}, {C A2 
> B}, {C A2 A3 B}, {C A2 A4 B}, {C A2 A3 A4 B}, {C A3 B}, {C A3 A4 B}, {C A4 B}"
> "For SKIP_TO_FIRST/LAST there are two options how to handle cases when there 
> are no elements mapped to the specified variable." -> "For SKIP_TO_FIRST/LAST 
> there are two options how to handle cases when there are no events mapped to 
> the *PatternName*."



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (FLINK-28495) Fix typos or mistakes of Flink CEP Document in the official website

2022-07-11 Thread Biao Geng (Jira)
Biao Geng created FLINK-28495:
-

 Summary: Fix typos or mistakes of Flink CEP Document in the 
official website
 Key: FLINK-28495
 URL: https://issues.apache.org/jira/browse/FLINK-28495
 Project: Flink
  Issue Type: Improvement
Reporter: Biao Geng


"Will generate the following matches for an input sequence: C D A1 A2 A3 D A4 
B. with combinations enabled: {C A1 B}, {C A1 A2 B}, {C A1 A3 B}, {C A1 A4 B}, 
{C A1 A2 A3 B}, {C A1 A2 A4 B}, {C A1 A3 A4 B}, {C A1 A2 A3 A4 B}" -> "Will 
generate the following matches for an input sequence: C D A1 A2 A3 D A4 B. with 
combinations enabled: {C A1 B}, {C A1 A2 B}, {C A1 A3 B}, {C A1 A4 B}, {C A1 A2 
A3 B}, {C A1 A2 A4 B}, {C A1 A3 A4 B}, {C A1 A2 A3 A4 B}, {C A2 B}, {C A2 A3 
B}, {C A2 A4 B}, {C A2 A3 A4 B}, {C A3 B}, {C A3 A4 B}, {C A4 B}"
"For SKIP_TO_FIRST/LAST there are two options how to handle cases when there 
are no elements mapped to the specified variable." -> "For SKIP_TO_FIRST/LAST 
there are two options how to handle cases when there are no events mapped to 
the *PatternName*."



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (FLINK-28443) Provide official Flink image for PyFlink

2022-07-09 Thread Biao Geng (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-28443?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17564520#comment-17564520
 ] 

Biao Geng commented on FLINK-28443:
---

This improvement is good for user experience, but one challenge of this 
improvement may be how to manage the  version matrix: considering 3 popular 
python version(3.7,3.8,3.9) and 3 flink version(1.14, 1.15, 1.16), there should 
be 3 * 3 = 9 pyflink images. If we consider java version, the problem can be 
worse.

> Provide official Flink image for PyFlink
> 
>
> Key: FLINK-28443
> URL: https://issues.apache.org/jira/browse/FLINK-28443
> Project: Flink
>  Issue Type: Improvement
>  Components: API / Python
>Affects Versions: 1.15.0
>Reporter: Xuannan Su
>Priority: Major
>
> Users must build a custom Flink image to include python and pyFlink to run a 
> pyFlink job with Native Kubernetes application mode.
> I think we can improve the user experience by providing an official image 
> that pre-installs python and pyFlink.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (FLINK-28364) Python Job support for Kubernetes Operator

2022-07-08 Thread Biao Geng (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-28364?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17564277#comment-17564277
 ] 

Biao Geng commented on FLINK-28364:
---

Following the pattern we have adopted for sql job example, I have created a 
simple 
[example|https://github.com/bgeng777/flink-kubernetes-operator/blob/python_example/examples/flink-python-example/python-example.yaml]
 to run pyflink jobs. As [~nicholasjiang] said, the python demo is trival but 
we may need more document about building pyflink image in the operator repo.
[~gyfora] [~dianfu] I can take this ticket to add the pyflink example. 

> Python Job support for Kubernetes Operator
> --
>
> Key: FLINK-28364
> URL: https://issues.apache.org/jira/browse/FLINK-28364
> Project: Flink
>  Issue Type: New Feature
>  Components: Kubernetes Operator
>Reporter: wu3396
>Priority: Major
>
> *Describe the solution*
> Job types that I want to support pyflink for
> *Describe alternatives*
> like [here|https://github.com/spotify/flink-on-k8s-operator/pull/165]



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (FLINK-27009) Support SQL job submission in flink kubernetes opeartor

2022-07-05 Thread Biao Geng (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-27009?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17562729#comment-17562729
 ] 

Biao Geng commented on FLINK-27009:
---

hi [~gyfora], sorry for the late reply. As Yang said, I had previously created  
a 
[PR|https://github.com/bgeng777/flink-kubernetes-operator/tree/sql-runner-demo/flink-kubernetes-sql-runner]
 to create a sql runner jar in the k8s operator side. The implementation is 
pretty basic: the sql runner jar just parses sql statements in the yaml and 
then call stableEnvironment.executeSql(statements). In my PR, I plan to make it 
bundled in the flink k8s operator image, so users can directly use it and of 
course we can add some utils to make submit SQL script easier. But after some 
tests, I meet a problem: Our flink kubernetes operator is based on Java11 and 
as a result, the `sql-runner.jar` should be compiled in Java11 but our flink 
image can be built on Java8 which leads to incompatibility issues. I am not 
sure if your implementation will meet such problem.

So after some investigation, I prefer to do something in the upstream flink 
project to support running SQL scripts in Application Mode first so that we can 
use flink SQL client as the entry class directly. Then in our k8s operator, we 
can support sql scripts easier. It is somehow simliar to what we have done for 
YARN Application Mode support of Python jobs.
I have created a [draft of Support submitting SQL scripts using Application 
Mode|https://docs.google.com/document/d/1C-o3wAgSE3TxxqaMxwIH7_Pt5w-v2PrSb6KU1vGNuE0/edit#heading=h.8rcq4dx7kcjy]
 for the upstream change but I do agree that it can take some uncertain time to 
discuss this upstream change. If we do not want this jira to be blocked by 
that, we can discuss how to improve the sql runner solution. Hope my previous 
experience can be helpful.


> Support SQL job submission in flink kubernetes opeartor
> ---
>
> Key: FLINK-27009
> URL: https://issues.apache.org/jira/browse/FLINK-27009
> Project: Flink
>  Issue Type: New Feature
>  Components: Kubernetes Operator
>Reporter: Biao Geng
>Assignee: Gyula Fora
>Priority: Major
>
> Currently, the flink kubernetes opeartor is for jar job using application or 
> session cluster. For SQL job, there is no out of box solution in the 
> operator.  
> One simple and short-term solution is to wrap the SQL script into a jar job 
> using table API with limitation.
> The long-term solution may work with 
> [FLINK-26541|https://issues.apache.org/jira/browse/FLINK-26541] to achieve 
> the full support.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Comment Edited] (FLINK-24346) Flink on yarn application mode,LinkageError

2022-07-04 Thread Biao Geng (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-24346?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17561986#comment-17561986
 ] 

Biao Geng edited comment on FLINK-24346 at 7/4/22 6:47 AM:
---

Hi [~soberchi...@gamil.com] have you ever tried to run the program with {{flink 
run-application -t yarn-application  
-Dyarn.per-job-cluster.include-user-jar=DISABLED  flink-demo-1.13.3.jar}} ?  
(the option name `yarn.per-job-cluster.include-user-jar` is somehow misleading 
as it can work for application mode as well. This naming has been improved in 
flink1.15)
I test your program locally and above command can be a workaround for the class 
loading problem. 
Besides, I believe your question is truly valuable and we may need to spend 
more time finding a way to see what happened for the class loading process.


was (Author: bgeng777):
Hi [~soberchi...@gamil.com] have you ever tried to run the program with {{flink 
run-application -t yarn-application  
-Dyarn.per-job-cluster.include-user-jar=DISABLED  flink-demo-1.13.3.jar}} ?  
(the option name `yarn.per-job-cluster.include-user-jar` is somehow misleading 
as it can work for application mode as well. This naming has been improved in 
flink1.15)
I test your program locally and above command can be a workaround for the class 
loading problem.

> Flink on yarn application mode,LinkageError
> ---
>
> Key: FLINK-24346
> URL: https://issues.apache.org/jira/browse/FLINK-24346
> Project: Flink
>  Issue Type: Bug
>  Components: API / DataStream
>Affects Versions: 1.13.1
> Environment: hadoop version 2.6.x
>Reporter: 李伟高
>Priority: Major
>
> Hello, I'm changing from per job mode to application mode to submit tasks to 
> yarn.All jars that my task depends on are typed into my task jar.I submit the 
> task as perjob and work normally, but change to application mode and report 
> an error.
> {code:java}
> [0;39mjava.util.concurrent.CompletionException: 
> org.apache.flink.client.deployment.application.ApplicationExecutionException: 
> Could not execute application. at 
> java.util.concurrent.CompletableFuture.encodeThrowable(CompletableFuture.java:292)
>  ~[na:1.8.0_271] at 
> java.util.concurrent.CompletableFuture.completeThrowable(CompletableFuture.java:308)
>  ~[na:1.8.0_271] at 
> java.util.concurrent.CompletableFuture.uniCompose(CompletableFuture.java:957) 
> ~[na:1.8.0_271] at 
> java.util.concurrent.CompletableFuture$UniCompose.tryFire(CompletableFuture.java:940)
>  ~[na:1.8.0_271] at 
> java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:488)
>  ~[na:1.8.0_271] at 
> java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:1990)
>  ~[na:1.8.0_271] at 
> org.apache.flink.client.deployment.application.ApplicationDispatcherBootstrap.runApplicationEntryPoint(ApplicationDispatcherBootstrap.java:257)
>  ~[hb-bigdata-wash-location-1.0-SNAPSHOT.jar:na] at 
> org.apache.flink.client.deployment.application.ApplicationDispatcherBootstrap.lambda$runApplicationAsync$1(ApplicationDispatcherBootstrap.java:212)
>  ~[hb-bigdata-wash-location-1.0-SNAPSHOT.jar:na] at 
> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) 
> ~[na:1.8.0_271] at java.util.concurrent.FutureTask.run(FutureTask.java:266) 
> ~[na:1.8.0_271] at 
> org.apache.flink.runtime.concurrent.akka.ActorSystemScheduledExecutorAdapter$ScheduledFutureTask.run(ActorSystemScheduledExecutorAdapter.java:159)
>  ~[hb-bigdata-wash-location-1.0-SNAPSHOT.jar:na] at 
> akka.dispatch.TaskInvocation.run(AbstractDispatcher.scala:40) 
> ~[hb-bigdata-wash-location-1.0-SNAPSHOT.jar:na] at 
> akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(ForkJoinExecutorConfigurator.scala:44)
>  ~[hb-bigdata-wash-location-1.0-SNAPSHOT.jar:na] at 
> akka.dispatch.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260) 
> ~[hb-bigdata-wash-location-1.0-SNAPSHOT.jar:na] at 
> akka.dispatch.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339) 
> ~[hb-bigdata-wash-location-1.0-SNAPSHOT.jar:na] at 
> akka.dispatch.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979) 
> ~[hb-bigdata-wash-location-1.0-SNAPSHOT.jar:na] at 
> akka.dispatch.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
>  ~[hb-bigdata-wash-location-1.0-SNAPSHOT.jar:na] Caused by: 
> org.apache.flink.client.deployment.application.ApplicationExecutionException: 
> Could not execute application. ... 11 common frames omitted Caused by: 
> java.lang.LinkageError: loader constraint violation: loader (instance of 
> org/apache/flink/util/ChildFirstClassLoader) previously initiated loading for 
> a different type with name "org/elasticsearch/client/RestClientBuilder" at 
> java.lang.ClassLoader.defineClass1(Native Method) ~[na:1.8.0_271] at 
> 

[jira] [Commented] (FLINK-24346) Flink on yarn application mode,LinkageError

2022-07-03 Thread Biao Geng (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-24346?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17561986#comment-17561986
 ] 

Biao Geng commented on FLINK-24346:
---

Hi [~soberchi...@gamil.com] have you ever tried to run the program with {{flink 
run-application -t yarn-application  
-Dyarn.per-job-cluster.include-user-jar=DISABLED  flink-demo-1.13.3.jar}} ?  
(the option name `yarn.per-job-cluster.include-user-jar` is somehow misleading 
as it can work for application mode as well. This naming has been improved in 
flink1.15)
I test your program locally and above command can be a workaround for the class 
loading problem.

> Flink on yarn application mode,LinkageError
> ---
>
> Key: FLINK-24346
> URL: https://issues.apache.org/jira/browse/FLINK-24346
> Project: Flink
>  Issue Type: Bug
>  Components: API / DataStream
>Affects Versions: 1.13.1
> Environment: hadoop version 2.6.x
>Reporter: 李伟高
>Priority: Major
>
> Hello, I'm changing from per job mode to application mode to submit tasks to 
> yarn.All jars that my task depends on are typed into my task jar.I submit the 
> task as perjob and work normally, but change to application mode and report 
> an error.
> {code:java}
> [0;39mjava.util.concurrent.CompletionException: 
> org.apache.flink.client.deployment.application.ApplicationExecutionException: 
> Could not execute application. at 
> java.util.concurrent.CompletableFuture.encodeThrowable(CompletableFuture.java:292)
>  ~[na:1.8.0_271] at 
> java.util.concurrent.CompletableFuture.completeThrowable(CompletableFuture.java:308)
>  ~[na:1.8.0_271] at 
> java.util.concurrent.CompletableFuture.uniCompose(CompletableFuture.java:957) 
> ~[na:1.8.0_271] at 
> java.util.concurrent.CompletableFuture$UniCompose.tryFire(CompletableFuture.java:940)
>  ~[na:1.8.0_271] at 
> java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:488)
>  ~[na:1.8.0_271] at 
> java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:1990)
>  ~[na:1.8.0_271] at 
> org.apache.flink.client.deployment.application.ApplicationDispatcherBootstrap.runApplicationEntryPoint(ApplicationDispatcherBootstrap.java:257)
>  ~[hb-bigdata-wash-location-1.0-SNAPSHOT.jar:na] at 
> org.apache.flink.client.deployment.application.ApplicationDispatcherBootstrap.lambda$runApplicationAsync$1(ApplicationDispatcherBootstrap.java:212)
>  ~[hb-bigdata-wash-location-1.0-SNAPSHOT.jar:na] at 
> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) 
> ~[na:1.8.0_271] at java.util.concurrent.FutureTask.run(FutureTask.java:266) 
> ~[na:1.8.0_271] at 
> org.apache.flink.runtime.concurrent.akka.ActorSystemScheduledExecutorAdapter$ScheduledFutureTask.run(ActorSystemScheduledExecutorAdapter.java:159)
>  ~[hb-bigdata-wash-location-1.0-SNAPSHOT.jar:na] at 
> akka.dispatch.TaskInvocation.run(AbstractDispatcher.scala:40) 
> ~[hb-bigdata-wash-location-1.0-SNAPSHOT.jar:na] at 
> akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(ForkJoinExecutorConfigurator.scala:44)
>  ~[hb-bigdata-wash-location-1.0-SNAPSHOT.jar:na] at 
> akka.dispatch.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260) 
> ~[hb-bigdata-wash-location-1.0-SNAPSHOT.jar:na] at 
> akka.dispatch.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339) 
> ~[hb-bigdata-wash-location-1.0-SNAPSHOT.jar:na] at 
> akka.dispatch.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979) 
> ~[hb-bigdata-wash-location-1.0-SNAPSHOT.jar:na] at 
> akka.dispatch.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
>  ~[hb-bigdata-wash-location-1.0-SNAPSHOT.jar:na] Caused by: 
> org.apache.flink.client.deployment.application.ApplicationExecutionException: 
> Could not execute application. ... 11 common frames omitted Caused by: 
> java.lang.LinkageError: loader constraint violation: loader (instance of 
> org/apache/flink/util/ChildFirstClassLoader) previously initiated loading for 
> a different type with name "org/elasticsearch/client/RestClientBuilder" at 
> java.lang.ClassLoader.defineClass1(Native Method) ~[na:1.8.0_271] at 
> java.lang.ClassLoader.defineClass(ClassLoader.java:756) ~[na:1.8.0_271] at 
> java.security.SecureClassLoader.defineClass(SecureClassLoader.java:142) 
> ~[na:1.8.0_271] at 
> java.net.URLClassLoader.defineClass(URLClassLoader.java:468) ~[na:1.8.0_271] 
> at java.net.URLClassLoader.access$100(URLClassLoader.java:74) ~[na:1.8.0_271] 
> at java.net.URLClassLoader$1.run(URLClassLoader.java:369) ~[na:1.8.0_271] at 
> java.net.URLClassLoader$1.run(URLClassLoader.java:363) ~[na:1.8.0_271] at 
> java.security.AccessController.doPrivileged(Native Method) ~[na:1.8.0_271] at 
> java.net.URLClassLoader.findClass(URLClassLoader.java:362) ~[na:1.8.0_271] at 
> 

[jira] [Comment Edited] (FLINK-28199) Failures on YARNHighAvailabilityITCase.testClusterClientRetrieval and YARNHighAvailabilityITCase.testKillYarnSessionClusterEntrypoint

2022-06-22 Thread Biao Geng (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-28199?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17557506#comment-17557506
 ] 

Biao Geng edited comment on FLINK-28199 at 6/22/22 3:03 PM:


According to the log, the exception is thrown as the yarn application for

testKillYarnSessionClusterEntrypoint is not stopped as expected, which also 
leads to the failure of 

testClusterClientRetrieval furthermore. 

Similary as [~ferenc-csaky]'s analysis, IIUC, the PR for FLINK-27677 may not be 
relevant to this failure as FLINK-27677 's codes only change the @AfterAll 
method, which should be executed after all tests finished while this failure 
happens after a single test.

It looks that {{killApplicationAndWait()}} may wrongly return due to the side 
effect of the previous test.


was (Author: bgeng777):
According to the log, the exception is thrown as the yarn application for

testKillYarnSessionClusterEntrypoint is not stopped as expected, which also 
leads to the failure of 

testClusterClientRetrieval furthermore. 

IIUC, the PR for FLINK-27677 may not be relevant to this failure as 
[~ferenc-csaky]'s codes only change the @AfterAll method, which should be 
executed after all tests finished while this failure happens after a single 
test.

It looks that {{killApplicationAndWait()}} may wrongly return due to the side 
effect of the previous test.

> Failures on YARNHighAvailabilityITCase.testClusterClientRetrieval and 
> YARNHighAvailabilityITCase.testKillYarnSessionClusterEntrypoint
> -
>
> Key: FLINK-28199
> URL: https://issues.apache.org/jira/browse/FLINK-28199
> Project: Flink
>  Issue Type: Bug
>  Components: Deployment / YARN
>Affects Versions: 1.16.0
>Reporter: Martijn Visser
>Priority: Major
>  Labels: test-stability
>
> {code:java}
> Jun 22 08:57:50 [ERROR] Errors: 
> Jun 22 08:57:50 [ERROR]   
> YARNHighAvailabilityITCase.testClusterClientRetrieval » Timeout 
> testClusterCli...
> Jun 22 08:57:50 [ERROR]   
> YARNHighAvailabilityITCase.testKillYarnSessionClusterEntrypoint:156->YarnTestBase.runTest:288->lambda$testKillYarnSessionClusterEntrypoint$0:182->waitForJobTermination:325
>  » Execution
> Jun 22 08:57:50 [INFO] 
> Jun 22 08:57:50 [ERROR] Tests run: 27, Failures: 0, Errors: 2, Skipped: 0
> {code}
> https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=37037=logs=fc5181b0-e452-5c8f-68de-1097947f6483=995c650b-6573-581c-9ce6-7ad4cc038461=29523



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Commented] (FLINK-28199) Failures on YARNHighAvailabilityITCase.testClusterClientRetrieval and YARNHighAvailabilityITCase.testKillYarnSessionClusterEntrypoint

2022-06-22 Thread Biao Geng (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-28199?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17557506#comment-17557506
 ] 

Biao Geng commented on FLINK-28199:
---

According to the log, the exception is thrown as the yarn application for

testKillYarnSessionClusterEntrypoint is not stopped as expected, which also 
leads to the failure of 

testClusterClientRetrieval furthermore. 

IIUC, the PR for FLINK-27677 may not be relevant to this failure as 
[~ferenc-csaky]'s codes only change the @AfterAll method, which should be 
executed after all tests finished while this failure happens after a single 
test.

It looks that {{killApplicationAndWait()}} may wrongly return due to the side 
effect of the previous test.

> Failures on YARNHighAvailabilityITCase.testClusterClientRetrieval and 
> YARNHighAvailabilityITCase.testKillYarnSessionClusterEntrypoint
> -
>
> Key: FLINK-28199
> URL: https://issues.apache.org/jira/browse/FLINK-28199
> Project: Flink
>  Issue Type: Bug
>  Components: Deployment / YARN
>Affects Versions: 1.16.0
>Reporter: Martijn Visser
>Priority: Major
>  Labels: test-stability
>
> {code:java}
> Jun 22 08:57:50 [ERROR] Errors: 
> Jun 22 08:57:50 [ERROR]   
> YARNHighAvailabilityITCase.testClusterClientRetrieval » Timeout 
> testClusterCli...
> Jun 22 08:57:50 [ERROR]   
> YARNHighAvailabilityITCase.testKillYarnSessionClusterEntrypoint:156->YarnTestBase.runTest:288->lambda$testKillYarnSessionClusterEntrypoint$0:182->waitForJobTermination:325
>  » Execution
> Jun 22 08:57:50 [INFO] 
> Jun 22 08:57:50 [ERROR] Tests run: 27, Failures: 0, Errors: 2, Skipped: 0
> {code}
> https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=37037=logs=fc5181b0-e452-5c8f-68de-1097947f6483=995c650b-6573-581c-9ce6-7ad4cc038461=29523



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Commented] (FLINK-27667) YARNHighAvailabilityITCase fails with "Failed to delete temp directory /tmp/junit1681"

2022-06-22 Thread Biao Geng (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-27667?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17557287#comment-17557287
 ] 

Biao Geng commented on FLINK-27667:
---

Hi [~martijnvisser] , I see [~ferenc-csaky]  has created the PR which LGTM. The 
change in my mind is exactly the same as his PR and I may not need add other 
changes. 

> YARNHighAvailabilityITCase fails with "Failed to delete temp directory 
> /tmp/junit1681"
> --
>
> Key: FLINK-27667
> URL: https://issues.apache.org/jira/browse/FLINK-27667
> Project: Flink
>  Issue Type: Bug
>  Components: Deployment / YARN
>Affects Versions: 1.16.0
>Reporter: Martijn Visser
>Assignee: Ferenc Csaky
>Priority: Critical
>  Labels: pull-request-available, test-stability
> Fix For: 1.16.0
>
>
> https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=35733=logs=fc5181b0-e452-5c8f-68de-1097947f6483=995c650b-6573-581c-9ce6-7ad4cc038461=29208
>  
> {code:bash}
> May 17 08:36:22 [INFO] Results: 
> May 17 08:36:22 [INFO] 
> May 17 08:36:22 [ERROR] Errors: 
> May 17 08:36:22 [ERROR] YARNHighAvailabilityITCase » IO Failed to delete temp 
> directory /tmp/junit1681... 
> May 17 08:36:22 [INFO] 
> May 17 08:36:22 [ERROR] Tests run: 28, Failures: 0, Errors: 1, Skipped: 0 
> May 17 08:36:22 [INFO] 
> {code}
>  



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Comment Edited] (FLINK-27667) YARNHighAvailabilityITCase fails with "Failed to delete temp directory /tmp/junit1681"

2022-06-18 Thread Biao Geng (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-27667?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=1732#comment-1732
 ] 

Biao Geng edited comment on FLINK-27667 at 6/18/22 6:36 AM:


hi [~ferenc-csaky] [~chesnay] , I did some investigation and hope it can help 
us to enhence the stability:

In {{{}YARNHighAvailabilityITCase{}}}, we currently overwrite the 
{{teardown()}} method of the {{YarnTestBase}} (see 
[codes|https://github.com/apache/flink/blob/3ae4c6f5a48105d00807e8ce02e70d4c092cbf40/flink-yarn-tests/src/test/java/org/apache/flink/yarn/YARNHighAvailabilityITCase.java#L126]
 here) and as a result, only {{{}YARNHighAvailabilityITCase{}}}'s 
{{teardown()}} will be executed after all HA tests finish.

The above behavior may lead to potential race condition: 
{{YARNHighAvailabilityITCase}} relies on the {{@TempDir protected static File 
tmp}} defined in {{YarnTestBase}} as YARN's parent dir of staging dir of each 
YARN application to launch the mini YARN cluster using 
{{{}RawLocalFileSystem{}}}. According to JUnit5's 
[doc|https://junit.org/junit5/docs/5.4.1/api/org/junit/jupiter/api/io/TempDir.html],
 the TempDir will be deleted recursively when the test class has finished 
execution. But as the {{teardown()}} method of the base class is not executed, 
there is no guarantee when the mini YARN cluster will be cleaned up(e.g. 
deleting staging dir like 
{{{}/tmp/junit1681458499635252469/.flink/application_1652775626514_0003{}}}).
As a result, when JUnit wants to delete TempDir and it happens to see the 
staging dir, it is possible that the staging dir is deleted by YARN's cleanup 
method before being deleted by JUnit, which incurs the exception of 
{{{}java.io.IOException: Failed to delete temp directory{}}}.

I tried to add the call of the teardown method of the base class (see 
[codes|https://github.com/bgeng777/flink/commit/c4a4c8c8d4d1dafaa2875dd8d88133e38a78d438]
 here) and manually triggered the test for 5 
times(test[1|https://dev.azure.com/samuelgeng7/Flink/_build/results?buildId=119=results],
 
[2|https://dev.azure.com/samuelgeng7/Flink/_build/results?buildId=120=results],
 
[3|https://dev.azure.com/samuelgeng7/Flink/_build/results?buildId=121=results],
 
[4|https://dev.azure.com/samuelgeng7/Flink/_build/results?buildId=122=results],
 
[5|https://dev.azure.com/samuelgeng7/Flink/_build/results?buildId=123=results])
 in my own azure pipeline. All of them passed the misc tests.

 

One further question is that I am not so sure why this test keeps stable in 
JUnit 4. Maybe the @TempDir annotation behaves somehow differently. 

 


was (Author: bgeng777):
hi [~ferenc-csaky] [~chesnay] , I did some investigation and hope it can help 
us to enhence the stability:

In {{{}YARNHighAvailabilityITCase{}}}, we currently override the {{teardown()}} 
method of the {{YarnTestBase}} (see 
[codes|https://github.com/apache/flink/blob/3ae4c6f5a48105d00807e8ce02e70d4c092cbf40/flink-yarn-tests/src/test/java/org/apache/flink/yarn/YARNHighAvailabilityITCase.java#L126]
 here) and as a result, only {{{}YARNHighAvailabilityITCase{}}}'s 
{{teardown()}} will be executed after all HA tests finish.

The above behavior may lead to potential race condition: 
{{YARNHighAvailabilityITCase}} relies on the {{@TempDir protected static File 
tmp}} defined in {{YarnTestBase}} as YARN's parent dir of staging dir of each 
YARN application to launch the mini YARN cluster using 
{{{}RawLocalFileSystem{}}}. According to JUnit5's 
[doc|https://junit.org/junit5/docs/5.4.1/api/org/junit/jupiter/api/io/TempDir.html],
 the TempDir will be deleted recursively when the test class has finished 
execution. But as the {{teardown()}} method of the base class is not executed, 
there is no guarantee when the mini YARN cluster will be cleaned up(e.g. 
deleting staging dir like 
{{{}/tmp/junit1681458499635252469/.flink/application_1652775626514_0003{}}}).
As a result, when JUnit wants to delete TempDir and it happens to see the 
staging dir, it is possible that the staging dir is deleted by YARN's cleanup 
method before being deleted by JUnit, which incurs the exception of 
{{{}java.io.IOException: Failed to delete temp directory{}}}.

I tried to add the call of the teardown method of the base class (see 
[codes|https://github.com/bgeng777/flink/commit/c4a4c8c8d4d1dafaa2875dd8d88133e38a78d438]
 here) and manually triggered the test for 5 
times(test[1|https://dev.azure.com/samuelgeng7/Flink/_build/results?buildId=119=results],
 
[2|https://dev.azure.com/samuelgeng7/Flink/_build/results?buildId=120=results],
 
[3|https://dev.azure.com/samuelgeng7/Flink/_build/results?buildId=121=results],
 
[4|https://dev.azure.com/samuelgeng7/Flink/_build/results?buildId=122=results],
 
[5|https://dev.azure.com/samuelgeng7/Flink/_build/results?buildId=123=results])
 in my own azure pipeline. All of them passed the misc tests.

 

One further question is that I am not so 

[jira] [Comment Edited] (FLINK-27667) YARNHighAvailabilityITCase fails with "Failed to delete temp directory /tmp/junit1681"

2022-06-18 Thread Biao Geng (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-27667?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=1732#comment-1732
 ] 

Biao Geng edited comment on FLINK-27667 at 6/18/22 6:34 AM:


hi [~ferenc-csaky] [~chesnay] , I did some investigation and hope it can help 
us to enhence the stability:

In {{{}YARNHighAvailabilityITCase{}}}, we currently override the {{teardown()}} 
method of the {{YarnTestBase}} (see 
[codes|https://github.com/apache/flink/blob/3ae4c6f5a48105d00807e8ce02e70d4c092cbf40/flink-yarn-tests/src/test/java/org/apache/flink/yarn/YARNHighAvailabilityITCase.java#L126]
 here) and as a result, only {{{}YARNHighAvailabilityITCase{}}}'s 
{{teardown()}} will be executed after all HA tests finish.

The above behavior may lead to potential race condition: 
{{YARNHighAvailabilityITCase}} relies on the {{@TempDir protected static File 
tmp}} defined in {{YarnTestBase}} as YARN's parent dir of staging dir of each 
YARN application to launch the mini YARN cluster using 
{{{}RawLocalFileSystem{}}}. According to JUnit5's 
[doc|https://junit.org/junit5/docs/5.4.1/api/org/junit/jupiter/api/io/TempDir.html],
 the TempDir will be deleted recursively when the test class has finished 
execution. But as the {{teardown()}} method of the base class is not executed, 
there is no guarantee when the mini YARN cluster will be cleaned up(e.g. 
deleting staging dir like 
{{{}/tmp/junit1681458499635252469/.flink/application_1652775626514_0003{}}}).
As a result, when JUnit wants to delete TempDir and it happens to see the 
staging dir, it is possible that the staging dir is deleted by YARN's cleanup 
method before being deleted by JUnit, which incurs the exception of 
{{{}java.io.IOException: Failed to delete temp directory{}}}.

I tried to add the call of the teardown method of the base class (see 
[codes|https://github.com/bgeng777/flink/commit/c4a4c8c8d4d1dafaa2875dd8d88133e38a78d438]
 here) and manually triggered the test for 5 
times(test[1|https://dev.azure.com/samuelgeng7/Flink/_build/results?buildId=119=results],
 
[2|https://dev.azure.com/samuelgeng7/Flink/_build/results?buildId=120=results],
 
[3|https://dev.azure.com/samuelgeng7/Flink/_build/results?buildId=121=results],
 
[4|https://dev.azure.com/samuelgeng7/Flink/_build/results?buildId=122=results],
 
[5|https://dev.azure.com/samuelgeng7/Flink/_build/results?buildId=123=results])
 in my own azure pipeline. All of them passed the misc tests.

 

One further question is that I am not so sure why this test keeps stable in 
JUnit 4. Maybe the @TempDir annotation behaves somehow differently. 

 


was (Author: bgeng777):
[~martijnvisser] I will do some investigation

> YARNHighAvailabilityITCase fails with "Failed to delete temp directory 
> /tmp/junit1681"
> --
>
> Key: FLINK-27667
> URL: https://issues.apache.org/jira/browse/FLINK-27667
> Project: Flink
>  Issue Type: Bug
>  Components: Deployment / YARN
>Affects Versions: 1.16.0
>Reporter: Martijn Visser
>Assignee: Ferenc Csaky
>Priority: Critical
>  Labels: pull-request-available, test-stability
>
> https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=35733=logs=fc5181b0-e452-5c8f-68de-1097947f6483=995c650b-6573-581c-9ce6-7ad4cc038461=29208
>  
> {code:bash}
> May 17 08:36:22 [INFO] Results: 
> May 17 08:36:22 [INFO] 
> May 17 08:36:22 [ERROR] Errors: 
> May 17 08:36:22 [ERROR] YARNHighAvailabilityITCase » IO Failed to delete temp 
> directory /tmp/junit1681... 
> May 17 08:36:22 [INFO] 
> May 17 08:36:22 [ERROR] Tests run: 28, Failures: 0, Errors: 1, Skipped: 0 
> May 17 08:36:22 [INFO] 
> {code}
>  



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Commented] (FLINK-27667) YARNHighAvailabilityITCase fails with "Failed to delete temp directory /tmp/junit1681"

2022-06-17 Thread Biao Geng (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-27667?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=1732#comment-1732
 ] 

Biao Geng commented on FLINK-27667:
---

[~martijnvisser] I will do some investigation

> YARNHighAvailabilityITCase fails with "Failed to delete temp directory 
> /tmp/junit1681"
> --
>
> Key: FLINK-27667
> URL: https://issues.apache.org/jira/browse/FLINK-27667
> Project: Flink
>  Issue Type: Bug
>  Components: Deployment / YARN
>Affects Versions: 1.16.0
>Reporter: Martijn Visser
>Assignee: Ferenc Csaky
>Priority: Critical
>  Labels: test-stability
>
> https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=35733=logs=fc5181b0-e452-5c8f-68de-1097947f6483=995c650b-6573-581c-9ce6-7ad4cc038461=29208
>  
> {code:bash}
> May 17 08:36:22 [INFO] Results: 
> May 17 08:36:22 [INFO] 
> May 17 08:36:22 [ERROR] Errors: 
> May 17 08:36:22 [ERROR] YARNHighAvailabilityITCase » IO Failed to delete temp 
> directory /tmp/junit1681... 
> May 17 08:36:22 [INFO] 
> May 17 08:36:22 [ERROR] Tests run: 28, Failures: 0, Errors: 1, Skipped: 0 
> May 17 08:36:22 [INFO] 
> {code}
>  



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Commented] (FLINK-28025) Document the change of serviceAccount in upgrading doc of k8s opeator

2022-06-13 Thread Biao Geng (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-28025?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17553605#comment-17553605
 ] 

Biao Geng commented on FLINK-28025:
---

cc [~wangyang0918] 

> Document the change of serviceAccount in upgrading doc of k8s opeator
> -
>
> Key: FLINK-28025
> URL: https://issues.apache.org/jira/browse/FLINK-28025
> Project: Flink
>  Issue Type: Improvement
>  Components: Kubernetes Operator
>Reporter: Biao Geng
>Priority: Minor
>  Labels: pull-request-available
>
> Since 1.0.0, we require users to specify serviceAccount and add corresponding 
> validation.
> According to the experience of [~czchen] , we had better document such change 
> in the upgrading doc.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Created] (FLINK-28025) Document the change of serviceAccount in upgrading doc of k8s opeator

2022-06-13 Thread Biao Geng (Jira)
Biao Geng created FLINK-28025:
-

 Summary: Document the change of serviceAccount in upgrading doc of 
k8s opeator
 Key: FLINK-28025
 URL: https://issues.apache.org/jira/browse/FLINK-28025
 Project: Flink
  Issue Type: Improvement
  Components: Kubernetes Operator
Reporter: Biao Geng


Since 1.0.0, we require users to specify serviceAccount and add corresponding 
validation.

According to the experience of [~czchen] , we had better document such change 
in the upgrading doc.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Commented] (FLINK-27009) Support SQL job submission in flink kubernetes opeartor

2022-06-05 Thread Biao Geng (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-27009?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17550265#comment-17550265
 ] 

Biao Geng commented on FLINK-27009:
---

Hi [~mbalassi] , sorry for the late reply. I have finished the 
[poc|https://github.com/bgeng777/flink/commit/5b4338fe52ec343326927f0fc12f015dd22b1133]
 that directly utilizes the sql client. But it turns out that I need more time 
to initial the draft of FLIP. Will notify you once I finish it.

> Support SQL job submission in flink kubernetes opeartor
> ---
>
> Key: FLINK-27009
> URL: https://issues.apache.org/jira/browse/FLINK-27009
> Project: Flink
>  Issue Type: New Feature
>  Components: Kubernetes Operator
>Reporter: Biao Geng
>Assignee: Biao Geng
>Priority: Major
>
> Currently, the flink kubernetes opeartor is for jar job using application or 
> session cluster. For SQL job, there is no out of box solution in the 
> operator.  
> One simple and short-term solution is to wrap the SQL script into a jar job 
> using table API with limitation.
> The long-term solution may work with 
> [FLINK-26541|https://issues.apache.org/jira/browse/FLINK-26541] to achieve 
> the full support.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Commented] (FLINK-27894) Build flink-connector-hive failed using Maven@3.8.5

2022-06-03 Thread Biao Geng (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-27894?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17546186#comment-17546186
 ] 

Biao Geng commented on FLINK-27894:
---

cc [~luoyuxia]  have you ever met such error?

> Build flink-connector-hive failed using Maven@3.8.5
> ---
>
> Key: FLINK-27894
> URL: https://issues.apache.org/jira/browse/FLINK-27894
> Project: Flink
>  Issue Type: Improvement
>Reporter: Biao Geng
>Priority: Major
>
> When I tried to build flink project locally with Java8 and Maven3.8.5, I met 
> such error:
> {code:java}
> [ERROR] Failed to execute goal on project flink-connector-hive_2.12: Could 
> not resolve dependencies for project 
> org.apache.flink:flink-connector-hive_2.12:jar:1.16-SNAPSHOT: Failed to 
> collect dependencies at org.apache.hive:hive-exec:jar:2.3.9 -> 
> org.pentaho:pentaho-aggdesigner-algorithm:jar:5.1.5-jhyde: Failed to read 
> artifact descriptor for 
> org.pentaho:pentaho-aggdesigner-algorithm:jar:5.1.5-jhyde: Could not transfer 
> artifact org.pentaho:pentaho-aggdesigner-algorithm:pom:5.1.5-jhyde from/to 
> maven-default-http-blocker (http://0.0.0.0/): Blocked mirror for 
> repositories: [repository.jboss.org 
> (http://repository.jboss.org/nexus/content/groups/public/, default, 
> disabled), conjars (http://conjars.org/repo, default, releases+snapshots), 
> apache.snapshots (http://repository.apache.org/snapshots, default, 
> snapshots)] -> [Help 1]
> {code}
> After some investigation, the reason may be that Maven 3.8.1 disables support 
> for repositories using "http" protocol. Due to [NIFI-8398], one possible 
> solution is adding 
> {code:xml}
> 
> 
> 
> conjars
> https://conjars.org/repo
> 
> 
> {code}
>  in the pom.xml of flink-connector-hive module.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Created] (FLINK-27894) Build flink-connector-hive failed using Maven@3.8.5

2022-06-03 Thread Biao Geng (Jira)
Biao Geng created FLINK-27894:
-

 Summary: Build flink-connector-hive failed using Maven@3.8.5
 Key: FLINK-27894
 URL: https://issues.apache.org/jira/browse/FLINK-27894
 Project: Flink
  Issue Type: Improvement
Reporter: Biao Geng


When I tried to build flink project locally with Java8 and Maven3.8.5, I met 
such error:


{code:java}
[ERROR] Failed to execute goal on project flink-connector-hive_2.12: Could not 
resolve dependencies for project 
org.apache.flink:flink-connector-hive_2.12:jar:1.16-SNAPSHOT: Failed to collect 
dependencies at org.apache.hive:hive-exec:jar:2.3.9 -> 
org.pentaho:pentaho-aggdesigner-algorithm:jar:5.1.5-jhyde: Failed to read 
artifact descriptor for 
org.pentaho:pentaho-aggdesigner-algorithm:jar:5.1.5-jhyde: Could not transfer 
artifact org.pentaho:pentaho-aggdesigner-algorithm:pom:5.1.5-jhyde from/to 
maven-default-http-blocker (http://0.0.0.0/): Blocked mirror for repositories: 
[repository.jboss.org 
(http://repository.jboss.org/nexus/content/groups/public/, default, disabled), 
conjars (http://conjars.org/repo, default, releases+snapshots), 
apache.snapshots (http://repository.apache.org/snapshots, default, snapshots)] 
-> [Help 1]

{code}

After some investigation, the reason may be that Maven 3.8.1 disables support 
for repositories using "http" protocol. Due to [NIFI-8398], one possible 
solution is adding 
{code:xml}



conjars
https://conjars.org/repo


{code}
 in the pom.xml of flink-connector-hive module.




--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Commented] (FLINK-27009) Support SQL job submission in flink kubernetes opeartor

2022-05-27 Thread Biao Geng (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-27009?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17543405#comment-17543405
 ] 

Biao Geng commented on FLINK-27009:
---

hi [~mbalassi] thanks for paying attention. I have already started working on 
this. After some offline discussion with [~wangyang0918] and [~dianfu], we have 
some alternative solutions and now prefer to do some work on 
[FLINK-26541|https://issues.apache.org/jira/browse/FLINK-26541] first. I am 
doing some basic verfication work to make sure the solution is feasible for our 
k8s operator and once it is ready, I will present the full FLIP for the 
community to discuss. Due to my current progress, I hope it will be proposed in 
next Friday.

> Support SQL job submission in flink kubernetes opeartor
> ---
>
> Key: FLINK-27009
> URL: https://issues.apache.org/jira/browse/FLINK-27009
> Project: Flink
>  Issue Type: New Feature
>  Components: Kubernetes Operator
>Reporter: Biao Geng
>Priority: Major
>
> Currently, the flink kubernetes opeartor is for jar job using application or 
> session cluster. For SQL job, there is no out of box solution in the 
> operator.  
> One simple and short-term solution is to wrap the SQL script into a jar job 
> using table API with limitation.
> The long-term solution may work with 
> [FLINK-26541|https://issues.apache.org/jira/browse/FLINK-26541] to achieve 
> the full support.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Commented] (FLINK-27615) Document how to define namespaceSelector for k8s operator's webhook for different k8s versions

2022-05-18 Thread Biao Geng (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-27615?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17538690#comment-17538690
 ] 

Biao Geng commented on FLINK-27615:
---

[~wangyang0918] I'd like to take this ticket to alert the k8s version 
requirement and add some examples of customized namespace selectors. Can you 
assign it to me?

> Document how to define namespaceSelector for k8s operator's webhook for 
> different k8s versions
> --
>
> Key: FLINK-27615
> URL: https://issues.apache.org/jira/browse/FLINK-27615
> Project: Flink
>  Issue Type: Improvement
>  Components: Kubernetes Operator
>Reporter: Biao Geng
>Priority: Major
>
> In our webhook, to support {{{}watchNamespaces{}}}, we rely on the 
> {{kubernetes.io/metadata.name}} to filter the validation requests. However, 
> this label will be automatically added to a namespace only since k8s 1.21.1 
> due to k8s' [release 
> notes|https://github.com/kubernetes/kubernetes/blob/master/CHANGELOG/CHANGELOG-1.21.md#v1211]
>  .  If users run the flink k8s operator on older k8s versions, they have to 
> add such label by themselevs to support the feature of namespaceSelector in 
> our webhook.
> As a result, if we want to support the feature defaultly, we may need to 
> emphasize that users should use k8s >= v1.21.1 to run the flink k8s operator.
>  
> Due to aitozi's advice, it may be better to add document of how to support 
> the nameSelector based validation instead of limiting the k8s version to be 
> >= 1.21.1 to support more users.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Updated] (FLINK-27615) Document how to define namespaceSelector for k8s operator's webhook for different k8s versions

2022-05-14 Thread Biao Geng (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-27615?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Biao Geng updated FLINK-27615:
--
Summary: Document how to define namespaceSelector for k8s operator's 
webhook for different k8s versions  (was: Document the how to define 
namespaceSelector for k8s operator's webhook)

> Document how to define namespaceSelector for k8s operator's webhook for 
> different k8s versions
> --
>
> Key: FLINK-27615
> URL: https://issues.apache.org/jira/browse/FLINK-27615
> Project: Flink
>  Issue Type: Improvement
>  Components: Kubernetes Operator
>Reporter: Biao Geng
>Priority: Major
>
> In our webhook, to support {{{}watchNamespaces{}}}, we rely on the 
> {{kubernetes.io/metadata.name}} to filter the validation requests. However, 
> this label will be automatically added to a namespace only since k8s 1.21.1 
> due to k8s' [release 
> notes|https://github.com/kubernetes/kubernetes/blob/master/CHANGELOG/CHANGELOG-1.21.md#v1211]
>  .  If users run the flink k8s operator on older k8s versions, they have to 
> add such label by themselevs to support the feature of namespaceSelector in 
> our webhook.
> As a result, if we want to support the feature defaultly, we may need to 
> emphasize that users should use k8s >= v1.21.1 to run the flink k8s operator.
>  
> Due to aitozi's advice, it may be better to add document of how to support 
> the nameSelector based validation instead of limiting the k8s version to be 
> >= 1.21.1 to support more users.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Comment Edited] (FLINK-27615) Document the how to define namespaceSelector for k8s operator's webhook

2022-05-14 Thread Biao Geng (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-27615?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17537012#comment-17537012
 ] 

Biao Geng edited comment on FLINK-27615 at 5/14/22 10:29 AM:
-

[~aitozi] I think your proposal makes sense.

I checked the release notes of k8s and just noticed that the 1.21.1 was just 
released 1 year ago. It looks like k8s iterates much more quickly than flink. 
IIUC, users, especially those use k8s in production, will not update k8s to new 
versions frequently. 
I was doing my tests on k8s 1.20.4. Currently, besides the nameSelector 
feature, I have not met other issues. To support more k8s versions, I agree 
with you on makeing this feature non-default with better documents 


was (Author: bgeng777):
[~aitozi] I think your proposal makes a lot of sense.

I checked the release notes of k8s and just noticed that the 1.21.1 was just 
released 1 year ago. K8s iterates much more quickly than flink. IIUC, users, 
especially those use k8s in production, will not update k8s to new versions 
frequently. 
I was doing my tests on k8s 1.20.4. Currently, besides the nameSelector 
feature, I have not met other issues. To support more k8s versions, I agree 
with you on makeing this feature non-default with better documents 

> Document the how to define namespaceSelector for k8s operator's webhook
> ---
>
> Key: FLINK-27615
> URL: https://issues.apache.org/jira/browse/FLINK-27615
> Project: Flink
>  Issue Type: Improvement
>  Components: Kubernetes Operator
>Reporter: Biao Geng
>Priority: Major
>
> In our webhook, to support {{{}watchNamespaces{}}}, we rely on the 
> {{kubernetes.io/metadata.name}} to filter the validation requests. However, 
> this label will be automatically added to a namespace only since k8s 1.21.1 
> due to k8s' [release 
> notes|https://github.com/kubernetes/kubernetes/blob/master/CHANGELOG/CHANGELOG-1.21.md#v1211]
>  .  If users run the flink k8s operator on older k8s versions, they have to 
> add such label by themselevs to support the feature of namespaceSelector in 
> our webhook.
> As a result, if we want to support the feature defaultly, we may need to 
> emphasize that users should use k8s >= v1.21.1 to run the flink k8s operator.
>  
> Due to aitozi's advice, it may be better to add document of how to support 
> the nameSelector based validation instead of limiting the k8s version to be 
> >= 1.21.1 to support more users.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Updated] (FLINK-27615) Document the how to define namespaceSelector for k8s operator's webhook

2022-05-14 Thread Biao Geng (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-27615?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Biao Geng updated FLINK-27615:
--
Summary: Document the how to define namespaceSelector for k8s operator's 
webhook  (was: Document the how to define )

> Document the how to define namespaceSelector for k8s operator's webhook
> ---
>
> Key: FLINK-27615
> URL: https://issues.apache.org/jira/browse/FLINK-27615
> Project: Flink
>  Issue Type: Improvement
>  Components: Kubernetes Operator
>Reporter: Biao Geng
>Priority: Major
>
> In our webhook, to support {{{}watchNamespaces{}}}, we rely on the 
> {{kubernetes.io/metadata.name}} to filter the validation requests. However, 
> this label will be automatically added to a namespace only since k8s 1.21.1 
> due to k8s' [release 
> notes|https://github.com/kubernetes/kubernetes/blob/master/CHANGELOG/CHANGELOG-1.21.md#v1211]
>  .  If users run the flink k8s operator on older k8s versions, they have to 
> add such label by themselevs to support the feature of namespaceSelector in 
> our webhook.
> As a result, if we want to support the feature defaultly, we may need to 
> emphasize that users should use k8s >= v1.21.1 to run the flink k8s operator.
>  
> Due to aitozi's advice, it may be better to add document of how to support 
> the nameSelector based validation instead of limiting the k8s version to be 
> >= 1.21.1 to support more users.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Updated] (FLINK-27615) Document the how to define

2022-05-14 Thread Biao Geng (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-27615?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Biao Geng updated FLINK-27615:
--
Summary: Document the how to define   (was: Document the minimum supported 
version of k8s for flink k8s operator)

> Document the how to define 
> ---
>
> Key: FLINK-27615
> URL: https://issues.apache.org/jira/browse/FLINK-27615
> Project: Flink
>  Issue Type: Improvement
>  Components: Kubernetes Operator
>Reporter: Biao Geng
>Priority: Major
>
> In our webhook, to support {{{}watchNamespaces{}}}, we rely on the 
> {{kubernetes.io/metadata.name}} to filter the validation requests. However, 
> this label will be automatically added to a namespace only since k8s 1.21.1 
> due to k8s' [release 
> notes|https://github.com/kubernetes/kubernetes/blob/master/CHANGELOG/CHANGELOG-1.21.md#v1211]
>  .  If users run the flink k8s operator on older k8s versions, they have to 
> add such label by themselevs to support the feature of namespaceSelector in 
> our webhook.
> As a result, if we want to support the feature defaultly, we may need to 
> emphasize that users should use k8s >= v1.21.1 to run the flink k8s operator.
>  
> Due to aitozi's advice, it may be better to add document of how to support 
> the nameSelector based validation instead of limiting the k8s version to be 
> >= 1.21.1 to support more users.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Updated] (FLINK-27615) Document the minimum supported version of k8s for flink k8s operator

2022-05-14 Thread Biao Geng (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-27615?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Biao Geng updated FLINK-27615:
--
Description: 
In our webhook, to support {{{}watchNamespaces{}}}, we rely on the 
{{kubernetes.io/metadata.name}} to filter the validation requests. However, 
this label will be automatically added to a namespace only since k8s 1.21.1 due 
to k8s' [release 
notes|https://github.com/kubernetes/kubernetes/blob/master/CHANGELOG/CHANGELOG-1.21.md#v1211]
 .  If users run the flink k8s operator on older k8s versions, they have to add 
such label by themselevs to support the feature of namespaceSelector in our 
webhook.

As a result, if we want to support the feature defaultly, we may need to 
emphasize that users should use k8s >= v1.21.1 to run the flink k8s operator.

 

Due to aitozi's advice, it may be better to add document of how to support the 
nameSelector based validation instead of limiting the k8s version to be >= 
1.21.1 to support more users.

  was:
In our webhook, to support {{{}watchNamespaces{}}}, we rely on the 
{{kubernetes.io/metadata.name}} to filter the validation requests. However, 
this label will be automatically added to a namespace only since k8s 1.21.1 due 
to k8s' [release 
notes|https://github.com/kubernetes/kubernetes/blob/master/CHANGELOG/CHANGELOG-1.21.md#v1211]
 .  If users run the flink k8s operator on older k8s versions, they have to add 
such label by themselevs to support the feature of namespaceSelector in our 
webhook.

As a result, if we want to support the feature defaultly, we may need to 
emphasize that users should use k8s >= v1.21.1  to run the flink k8s operator.


> Document the minimum supported version of k8s for flink k8s operator
> 
>
> Key: FLINK-27615
> URL: https://issues.apache.org/jira/browse/FLINK-27615
> Project: Flink
>  Issue Type: Improvement
>  Components: Kubernetes Operator
>Reporter: Biao Geng
>Priority: Major
>
> In our webhook, to support {{{}watchNamespaces{}}}, we rely on the 
> {{kubernetes.io/metadata.name}} to filter the validation requests. However, 
> this label will be automatically added to a namespace only since k8s 1.21.1 
> due to k8s' [release 
> notes|https://github.com/kubernetes/kubernetes/blob/master/CHANGELOG/CHANGELOG-1.21.md#v1211]
>  .  If users run the flink k8s operator on older k8s versions, they have to 
> add such label by themselevs to support the feature of namespaceSelector in 
> our webhook.
> As a result, if we want to support the feature defaultly, we may need to 
> emphasize that users should use k8s >= v1.21.1 to run the flink k8s operator.
>  
> Due to aitozi's advice, it may be better to add document of how to support 
> the nameSelector based validation instead of limiting the k8s version to be 
> >= 1.21.1 to support more users.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Commented] (FLINK-27615) Document the minimum supported version of k8s for flink k8s operator

2022-05-14 Thread Biao Geng (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-27615?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17537012#comment-17537012
 ] 

Biao Geng commented on FLINK-27615:
---

[~aitozi] I think your proposal makes a lot of sense.

I checked the release notes of k8s and just noticed that the 1.21.1 was just 
released 1 year ago. K8s iterates much more quickly than flink. IIUC, users, 
especially those use k8s in production, will not update k8s to new versions 
frequently. 
I was doing my tests on k8s 1.20.4. Currently, besides the nameSelector 
feature, I have not met other issues. To support more k8s versions, I agree 
with you on makeing this feature non-default with better documents 

> Document the minimum supported version of k8s for flink k8s operator
> 
>
> Key: FLINK-27615
> URL: https://issues.apache.org/jira/browse/FLINK-27615
> Project: Flink
>  Issue Type: Improvement
>  Components: Kubernetes Operator
>Reporter: Biao Geng
>Priority: Major
>
> In our webhook, to support {{{}watchNamespaces{}}}, we rely on the 
> {{kubernetes.io/metadata.name}} to filter the validation requests. However, 
> this label will be automatically added to a namespace only since k8s 1.21.1 
> due to k8s' [release 
> notes|https://github.com/kubernetes/kubernetes/blob/master/CHANGELOG/CHANGELOG-1.21.md#v1211]
>  .  If users run the flink k8s operator on older k8s versions, they have to 
> add such label by themselevs to support the feature of namespaceSelector in 
> our webhook.
> As a result, if we want to support the feature defaultly, we may need to 
> emphasize that users should use k8s >= v1.21.1  to run the flink k8s operator.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Updated] (FLINK-27615) Document the minimum supported version of k8s for flink k8s operator

2022-05-14 Thread Biao Geng (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-27615?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Biao Geng updated FLINK-27615:
--
Description: 
In our webhook, to support {{{}watchNamespaces{}}}, we rely on the 
{{kubernetes.io/metadata.name}} to filter the validation requests. However, 
this label will be automatically added to a namespace only since k8s 1.21.1 due 
to k8s' [release 
notes|https://github.com/kubernetes/kubernetes/blob/master/CHANGELOG/CHANGELOG-1.21.md#v1211]
 .  If users run the flink k8s operator on older k8s versions, they have to add 
such label by themselevs to support the feature of namespaceSelector in our 
webhook.

As a result, if we want to support the feature defaultly, we may need to 
emphasize that users should use k8s >= v1.21.1  to run the flink k8s operator.

  was:
In our webhook, to support {{{}watchNamespaces{}}}, we rely on the 
{{kubernetes.io/metadata.name}} to filter the validation requests. However, 
this label will be automatically added to a namespace only since k8s 1.21.1 due 
to k8s' [release 
notes|[https://github.com/kubernetes/kubernetes/blob/master/CHANGELOG/CHANGELOG-1.21.md#v1211]
 .  If users run the flink k8s operator on older k8s versions, they have add 
such label by themselevs to support the feature of namespaceSelector in our 
webhook.

As a result, if we want to support the feature defaultly, we may need to 
emphasize that users should use k8s >= v1.21.1  to run the flink k8s operator.


> Document the minimum supported version of k8s for flink k8s operator
> 
>
> Key: FLINK-27615
> URL: https://issues.apache.org/jira/browse/FLINK-27615
> Project: Flink
>  Issue Type: Improvement
>  Components: Kubernetes Operator
>Reporter: Biao Geng
>Priority: Major
>
> In our webhook, to support {{{}watchNamespaces{}}}, we rely on the 
> {{kubernetes.io/metadata.name}} to filter the validation requests. However, 
> this label will be automatically added to a namespace only since k8s 1.21.1 
> due to k8s' [release 
> notes|https://github.com/kubernetes/kubernetes/blob/master/CHANGELOG/CHANGELOG-1.21.md#v1211]
>  .  If users run the flink k8s operator on older k8s versions, they have to 
> add such label by themselevs to support the feature of namespaceSelector in 
> our webhook.
> As a result, if we want to support the feature defaultly, we may need to 
> emphasize that users should use k8s >= v1.21.1  to run the flink k8s operator.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Created] (FLINK-27615) Document the minimum supported version of k8s for flink k8s operator

2022-05-14 Thread Biao Geng (Jira)
Biao Geng created FLINK-27615:
-

 Summary: Document the minimum supported version of k8s for flink 
k8s operator
 Key: FLINK-27615
 URL: https://issues.apache.org/jira/browse/FLINK-27615
 Project: Flink
  Issue Type: Improvement
Reporter: Biao Geng


In our webhook, to support {{{}watchNamespaces{}}}, we rely on the 
{{kubernetes.io/metadata.name}} to filter the validation requests. However, 
this label will be automatically added to a namespace only since k8s 1.21.1 due 
to k8s' [release 
notes|[https://github.com/kubernetes/kubernetes/blob/master/CHANGELOG/CHANGELOG-1.21.md#v1211]
 .  If users run the flink k8s operator on older k8s versions, they have add 
such label by themselevs to support the feature of namespaceSelector in our 
webhook.

As a result, if we want to support the feature defaultly, we may need to 
emphasize that users should use k8s >= v1.21.1  to run the flink k8s operator.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Updated] (FLINK-27615) Document the minimum supported version of k8s for flink k8s operator

2022-05-14 Thread Biao Geng (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-27615?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Biao Geng updated FLINK-27615:
--
Component/s: Kubernetes Operator

> Document the minimum supported version of k8s for flink k8s operator
> 
>
> Key: FLINK-27615
> URL: https://issues.apache.org/jira/browse/FLINK-27615
> Project: Flink
>  Issue Type: Improvement
>  Components: Kubernetes Operator
>Reporter: Biao Geng
>Priority: Major
>
> In our webhook, to support {{{}watchNamespaces{}}}, we rely on the 
> {{kubernetes.io/metadata.name}} to filter the validation requests. However, 
> this label will be automatically added to a namespace only since k8s 1.21.1 
> due to k8s' [release 
> notes|[https://github.com/kubernetes/kubernetes/blob/master/CHANGELOG/CHANGELOG-1.21.md#v1211]
>  .  If users run the flink k8s operator on older k8s versions, they have add 
> such label by themselevs to support the feature of namespaceSelector in our 
> webhook.
> As a result, if we want to support the feature defaultly, we may need to 
> emphasize that users should use k8s >= v1.21.1  to run the flink k8s operator.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Comment Edited] (FLINK-27329) Add default value of replica of JM pod and not declare it in example yamls

2022-05-11 Thread Biao Geng (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-27329?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17535845#comment-17535845
 ] 

Biao Geng edited comment on FLINK-27329 at 5/12/22 3:35 AM:


[~wangyang0918] I see your point that if we set default value of some fields 
(e.g. parallism) in the operator, we cannot adopt flink's default value as our 
default value in the {*}{{*}}Spec\{*} will be used. 

For parallism and cpu, l will not add default value in our k8s operator. 

Besides, it is worthwhile to mention that in our FlinkConfigBuilder, we will 
verify if parallism is larger than 0, so we do not need to worry about it. But 
for setResource, we do not handle cases when cpu or memory is null. As a 
result, the zero cpu value may be set.  I improve it a little in the 
[PR|https://github.com/apache/flink-kubernetes-operator/pull/204].


was (Author: bgeng777):
[~wangyang0918] I see your point that if we set default value of some fields 
(e.g. parallism) in the operator, we cannot adopt flink's default value as our 
default value in the *{*}Spec{*} will be used. 

For parallism and cpu, l will not add default value in our k8s operator. 

Besides, it is worthwhile to mention that in our FlinkConfigBuilder, we will 
verify if parallism is larger than 0, so we do not need to worry about it. But 
for setResource, we do not handle cases when cpu or memory is null. I improve 
it a little in the 
[PR|https://github.com/apache/flink-kubernetes-operator/pull/204].

> Add default value of replica of JM pod and not declare it in example yamls
> --
>
> Key: FLINK-27329
> URL: https://issues.apache.org/jira/browse/FLINK-27329
> Project: Flink
>  Issue Type: Improvement
>  Components: Kubernetes Operator
>Reporter: Biao Geng
>Assignee: Biao Geng
>Priority: Critical
> Fix For: kubernetes-operator-1.0.0
>
>
> Currently, we do not explicitly set the default value of `replica` in 
> `JobManagerSpec`. As a result, Java sets the default value to be zero. 
> Besides, in our examples, we explicitly declare `replica` in `JobManagerSpec` 
> to be 1. 
> After a deeper look when debugging the exception thrown in FLINK-27310, we 
> find it would be better to set the default value to 1 for the `replica` field 
> and remove the declaration in examples due to following reasons:
> 1. A normal Session or Application cluster should have at least one JM. The 
> current default value, zero, does not follow the common case.
> 2. One JM can work for k8s HA mode as well and if users really want to launch 
> a standby JM for faster recorvery, they can declare the value of `replica` 
> field in the yaml file. In examples, we just use the new default value(i.e. 
> 1), which should be fine.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Commented] (FLINK-27329) Add default value of replica of JM pod and not declare it in example yamls

2022-05-11 Thread Biao Geng (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-27329?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17535845#comment-17535845
 ] 

Biao Geng commented on FLINK-27329:
---

[~wangyang0918] I see your point that if we set default value of some fields 
(e.g. parallism) in the operator, we cannot adopt flink's default value as our 
default value in the *{*}Spec{*} will be used. 

For parallism and cpu, l will not add default value in our k8s operator. 

Besides, it is worthwhile to mention that in our FlinkConfigBuilder, we will 
verify if parallism is larger than 0, so we do not need to worry about it. But 
for setResource, we do not handle cases when cpu or memory is null. I improve 
it a little in the 
[PR|https://github.com/apache/flink-kubernetes-operator/pull/204].

> Add default value of replica of JM pod and not declare it in example yamls
> --
>
> Key: FLINK-27329
> URL: https://issues.apache.org/jira/browse/FLINK-27329
> Project: Flink
>  Issue Type: Improvement
>  Components: Kubernetes Operator
>Reporter: Biao Geng
>Assignee: Biao Geng
>Priority: Critical
> Fix For: kubernetes-operator-1.0.0
>
>
> Currently, we do not explicitly set the default value of `replica` in 
> `JobManagerSpec`. As a result, Java sets the default value to be zero. 
> Besides, in our examples, we explicitly declare `replica` in `JobManagerSpec` 
> to be 1. 
> After a deeper look when debugging the exception thrown in FLINK-27310, we 
> find it would be better to set the default value to 1 for the `replica` field 
> and remove the declaration in examples due to following reasons:
> 1. A normal Session or Application cluster should have at least one JM. The 
> current default value, zero, does not follow the common case.
> 2. One JM can work for k8s HA mode as well and if users really want to launch 
> a standby JM for faster recorvery, they can declare the value of `replica` 
> field in the yaml file. In examples, we just use the new default value(i.e. 
> 1), which should be fine.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Comment Edited] (FLINK-27329) Add default value of replica of JM pod and not declare it in example yamls

2022-05-11 Thread Biao Geng (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-27329?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17534972#comment-17534972
 ] 

Biao Geng edited comment on FLINK-27329 at 5/11/22 3:42 PM:


Hi [~wangyang0918] [~gyfora], I revisit the default values in our 
{*}{{*}}Spec\{*}*s and summarize them in the bottom table. 
IMO, most of them work well with `null` default value, but besides 
JobManagerSpec#replicas,  there are some fields that I believe we can improve:
 # *JobManagerSpec#resource#cpu &* *TaskManagerSpec#resource#cpu:* current 
default value is 0 which is not consistent with upstream flink. In my mind, 
changing the default value to 1.0 is better for: a) for JM, if users not 
specify it explictly, flink will use 1.0; b) for TM, if users not specify it 
explictly, flink will use NUM_TASK_SLOTS, whose default value is 1 as well in 
flink's default flink-conf.yaml{*}{*}
 # *JobSpec#parallelism:* current default value is 0, which is illegal. But I 
am not sure if is possible/proper for us to read the value of 
parallelism.default in flink-conf.yaml in the *JobSpec* constructor. I tend to 
leave it as it is or use `1` as default.
 # *FlinkDeploymentSpec#serviceAccount:* current default value is null and as a 
result, if we do not specify it, flink will use `default` service account. It 
can be problematic as we use `flink` as default in our helm chart's 
values.yaml. Maybe `flink` can be a good candidate default value.

| |Default Value in Upstream Flink|Current Default Value in k8s Operator|
|FlinkDeploymentSpec#imagePullPolicy|KubernetesConfigOptions.ImagePullPolicy.IfNotPresent|null|
|FlinkDeploymentSpec#image|KubernetesConfigOptions#getDefaultFlinkImage()|null|
|*{*}FlinkDeploymentSpec#serviceAccount{*}*|"default"|null|
|FlinkDeploymentSpec#flinkVersion|not exist|null|
|FlinkDeploymentSpec#IngressSpec|not exist|null|
|FlinkDeploymentSpec#podTemplate|no default value|null|
|*{*}JobManagerSpec#replicas{*}*|not exist|0|
|JobManagerSpec#resource#memory|1600M(defined in flink-conf.yaml)|null|
|*{*}JobManagerSpec#resource#cpu{*}*|1.0|0|
|JobManagerSpec#podTemplate|no default value|null|
|TaskManagerSpec#resource#memory|memory: 1728m(defined in flink-conf.yaml)|null|
|*{*}TaskManagerSpec#resource#cpu{*}*|NUM_TASK_SLOTS( whose default value is 1 
in flink-conf.yaml)|0|
|TaskManagerSpec#podTemplate|no default value|null|
|JobSpec#jarURI|no default value|null|
|*{*}JobSpec#parallelism{*}*|parallelism.default in flink-conf.yaml|0|
|JobSpec#entryClass|no default value|null|
|JobSpec#args|no default value|String[0]|
|JobSpec#state|not exist|JobState.RUNNING|
|JobSpec#savepointTriggerNonce|not exist|null|
|JobSpec#initialSavepointPath|not exist|null|
|JobSpec#upgradeMode|not exist|UpgradeMode.STATELESS|
|JobSpec#allowNonRestoredState|not exist|null|
| | | |


was (Author: bgeng777):
Hi [~wangyang0918] [~gyfora], I revisit the default values in our *{*}Spec{*}*s 
and summarize them in the bottom table. 
IMO, most of them work well with `null` default value, but besides 
JobManagerSpec#replicas,  there are some fields that I believe we can improve:
 # *JobManagerSpec#resource#cpu &* *TaskManagerSpec#resource#cpu:* current 
default value is 0 which is not consistent with upstream flink. In my mind, 
changing the default value to 1.0 is better for: a) for JM, if users not 
specify it explictly, flink will use 1.0; b) for TM, if users not specify it 
explictly, flink will use NUM_TASK_SLOTS, whose default value is 1 as well in 
flink's default flink-conf.yaml{*}{*}
 # *JobSpec#parallelism:* current default value is 0, which is illegal. But I 
am not sure if is possible/proper for us to read the value of 
parallelism.default in flink-conf.yaml in the *JobSpec* constructor. I tend to 
leave it as it is or use `1` as default.{*}{*}
 # *FlinkDeploymentSpec#serviceAccount:* current default value is null and as a 
result, if we do not specify it, flink will use `default` service account. It 
can be problematic as we use `flink` as default in our helm chart's 
values.yaml. I am not so sure why we expose it as the first class field, but 
maybe `flink` can be a good candidate default value.


|| Default Value in Upstream Flink  
| Current Default Value in k8s Operator |
| FlinkDeploymentSpec#imagePullPolicy| 
KubernetesConfigOptions.ImagePullPolicy.IfNotPresent | null 
 |
| FlinkDeploymentSpec#image  | 
KubernetesConfigOptions#getDefaultFlinkImage()   | null 
 |
| **FlinkDeploymentSpec#serviceAccount** | "default"
| null  |
| FlinkDeploymentSpec#flinkVersion   | not exist
| null  |
| 

[jira] [Comment Edited] (FLINK-27329) Add default value of replica of JM pod and not declare it in example yamls

2022-05-11 Thread Biao Geng (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-27329?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17534972#comment-17534972
 ] 

Biao Geng edited comment on FLINK-27329 at 5/11/22 3:41 PM:


Hi [~wangyang0918] [~gyfora], I revisit the default values in our *{*}Spec{*}*s 
and summarize them in the bottom table. 
IMO, most of them work well with `null` default value, but besides 
JobManagerSpec#replicas,  there are some fields that I believe we can improve:
 # *JobManagerSpec#resource#cpu &* *TaskManagerSpec#resource#cpu:* current 
default value is 0 which is not consistent with upstream flink. In my mind, 
changing the default value to 1.0 is better for: a) for JM, if users not 
specify it explictly, flink will use 1.0; b) for TM, if users not specify it 
explictly, flink will use NUM_TASK_SLOTS, whose default value is 1 as well in 
flink's default flink-conf.yaml{*}{*}
 # *JobSpec#parallelism:* current default value is 0, which is illegal. But I 
am not sure if is possible/proper for us to read the value of 
parallelism.default in flink-conf.yaml in the *JobSpec* constructor. I tend to 
leave it as it is or use `1` as default.{*}{*}
 # *FlinkDeploymentSpec#serviceAccount:* current default value is null and as a 
result, if we do not specify it, flink will use `default` service account. It 
can be problematic as we use `flink` as default in our helm chart's 
values.yaml. I am not so sure why we expose it as the first class field, but 
maybe `flink` can be a good candidate default value.


|| Default Value in Upstream Flink  
| Current Default Value in k8s Operator |
| FlinkDeploymentSpec#imagePullPolicy| 
KubernetesConfigOptions.ImagePullPolicy.IfNotPresent | null 
 |
| FlinkDeploymentSpec#image  | 
KubernetesConfigOptions#getDefaultFlinkImage()   | null 
 |
| **FlinkDeploymentSpec#serviceAccount** | "default"
| null  |
| FlinkDeploymentSpec#flinkVersion   | not exist
| null  |
| FlinkDeploymentSpec#IngressSpec| not exist
| null  |
| FlinkDeploymentSpec#podTemplate| no default value 
| null  |
| **JobManagerSpec#replicas**| not exist
| 0 |
| JobManagerSpec#resource#memory | 1600M(defined in flink-conf.yaml)
| null  |
| **JobManagerSpec#resource#cpu**| 1.0  
| 0 |
| JobManagerSpec#podTemplate | no default value 
| null  |
| TaskManagerSpec#resource#memory| memory: 1728m(defined in 
flink-conf.yaml)| null  |
| **TaskManagerSpec#resource#cpu**   | NUM_TASK_SLOTS( whose default value 
is 1 in flink-conf.yaml) | 0 |
| TaskManagerSpec#podTemplate| no default value 
| null  |
| JobSpec#jarURI | no default value 
| null  |
| **JobSpec#parallelism**| parallelism.default in 
flink-conf.yaml   | 0 |
| JobSpec#entryClass | no default value 
| null  |
| JobSpec#args   | no default value 
| String[0] |
| JobSpec#state  | not exist
| JobState.RUNNING  |
| JobSpec#savepointTriggerNonce  | not exist
| null  |
| JobSpec#initialSavepointPath   | not exist
| null  |
| JobSpec#upgradeMode| not exist
| UpgradeMode.STATELESS |
| JobSpec#allowNonRestoredState  | not exist
| null  |
|   

[jira] [Commented] (FLINK-27329) Add default value of replica of JM pod and not declare it in example yamls

2022-05-11 Thread Biao Geng (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-27329?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17534972#comment-17534972
 ] 

Biao Geng commented on FLINK-27329:
---

Hi [~wangyang0918] [~gyfora], I revisit the default values in our *{*}Spec{*}*s 
and summarize them in the bottom table. 
IMO, most of them work well with `null` default value, but besides 
JobManagerSpec#replicas,  there are some fields that I believe we can improve:
 # *JobManagerSpec#resource#cpu &* *TaskManagerSpec#resource#cpu:* current 
default value is 0 which is not consistent with upstream flink. In my mind, 
changing the default value to 1.0 is better for: a) for JM, if users not 
specify it explictly, flink will use 1.0; b) for TM, if users not specify it 
explictly, flink will use NUM_TASK_SLOTS, whose default value is 1 as well in 
flink's default flink-conf.yaml{*}{*}
 # *JobSpec#parallelism:* current default value is 0, which is illegal. But I 
am not sure if is possible/proper for us to read the value of 
parallelism.default in flink-conf.yaml in the *JobSpec* constructor. I tend to 
leave it as it is or use `1` as default.{*}{*}
 # *FlinkDeploymentSpec#serviceAccount:* current default value is null and as a 
result, if we do not specify it, flink will use `default` service account. It 
can be problematic as we use `flink` as default in our helm chart's 
values.yaml. I am not so sure why we expose it as the first class field, but 
maybe `flink` can be a good candidate default value.

 
| |Default Value in Upstream Flink Native K8s|Current Default Value in K8s 
Operator|
|FlinkDeploymentSpec#imagePullPolicy|KubernetesConfigOptions.ImagePullPolicy.IfNotPresent|null|
|FlinkDeploymentSpec#image|KubernetesConfigOptions#getDefaultFlinkImage()|null|
|*{*}FlinkDeploymentSpec#serviceAccount{*}*|"default"|null|
|FlinkDeploymentSpec#flinkVersion|\|null|
|FlinkDeploymentSpec#IngressSpec|\|null|
|FlinkDeploymentSpec#podTemplate|no default value|null|
|*{*}JobManagerSpec#replicas{*}*|1|0|
|JobManagerSpec#resource#memory|1600M(defined in flink-conf.yaml)|null|
|*JobManagerSpec#resource#cpu*|1.0|0|
|JobManagerSpec#podTemplate|no default value|null|
|TaskManagerSpec#resource#memory|memory: 1728m(defined in flink-conf.yaml)|null|
|*TaskManagerSpec#resource#cpu*|NUM_TASK_SLOTS( whose default value is 1 in 
flink-conf.yaml)|0|
|TaskManagerSpec#podTemplate|no default value|null|
|JobSpec#jarURI|no default value|null|
|*{*}JobSpec#parallelism{*}*|parallelism.default in flink-conf.yaml|0|
|JobSpec#entryClass|no default value|null|
|JobSpec#args|no default value|String[0]|
|JobSpec#state|\|JobState.RUNNING|
|JobSpec#savepointTriggerNonce|\|null|
|JobSpec#initialSavepointPath|\|null|
|JobSpec#upgradeMode|\|UpgradeMode.STATELESS|
|JobSpec#allowNonRestoredState|\|null|
| | | |

> Add default value of replica of JM pod and not declare it in example yamls
> --
>
> Key: FLINK-27329
> URL: https://issues.apache.org/jira/browse/FLINK-27329
> Project: Flink
>  Issue Type: Improvement
>  Components: Kubernetes Operator
>Reporter: Biao Geng
>Assignee: Biao Geng
>Priority: Critical
> Fix For: kubernetes-operator-1.0.0
>
>
> Currently, we do not explicitly set the default value of `replica` in 
> `JobManagerSpec`. As a result, Java sets the default value to be zero. 
> Besides, in our examples, we explicitly declare `replica` in `JobManagerSpec` 
> to be 1. 
> After a deeper look when debugging the exception thrown in FLINK-27310, we 
> find it would be better to set the default value to 1 for the `replica` field 
> and remove the declaration in examples due to following reasons:
> 1. A normal Session or Application cluster should have at least one JM. The 
> current default value, zero, does not follow the common case.
> 2. One JM can work for k8s HA mode as well and if users really want to launch 
> a standby JM for faster recorvery, they can declare the value of `replica` 
> field in the yaml file. In examples, we just use the new default value(i.e. 
> 1), which should be fine.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Commented] (FLINK-27329) Add default value of replica of JM pod and not declare it in example yamls

2022-05-10 Thread Biao Geng (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-27329?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17534341#comment-17534341
 ] 

Biao Geng commented on FLINK-27329:
---

Yes, I will create a PR in the next 2 days.

Sorry for the lateness due to other offline tasks.

> Add default value of replica of JM pod and not declare it in example yamls
> --
>
> Key: FLINK-27329
> URL: https://issues.apache.org/jira/browse/FLINK-27329
> Project: Flink
>  Issue Type: Improvement
>  Components: Kubernetes Operator
>Reporter: Biao Geng
>Assignee: Biao Geng
>Priority: Critical
> Fix For: kubernetes-operator-1.0.0
>
>
> Currently, we do not explicitly set the default value of `replica` in 
> `JobManagerSpec`. As a result, Java sets the default value to be zero. 
> Besides, in our examples, we explicitly declare `replica` in `JobManagerSpec` 
> to be 1. 
> After a deeper look when debugging the exception thrown in FLINK-27310, we 
> find it would be better to set the default value to 1 for the `replica` field 
> and remove the declaration in examples due to following reasons:
> 1. A normal Session or Application cluster should have at least one JM. The 
> current default value, zero, does not follow the common case.
> 2. One JM can work for k8s HA mode as well and if users really want to launch 
> a standby JM for faster recorvery, they can declare the value of `replica` 
> field in the yaml file. In examples, we just use the new default value(i.e. 
> 1), which should be fine.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Commented] (FLINK-27412) Allow flinkVersion v1_13 in flink-kubernetes-operator

2022-04-26 Thread Biao Geng (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-27412?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17528115#comment-17528115
 ] 

Biao Geng commented on FLINK-27412:
---

Big +1 for this ticket. Most users, exspecially those use flink in production, 
are not using latest versions like 1.14 or 1.15.
I have verified most examples in my own cluseter using flink1.13(both the open 
source and ververica's impl_ and as far as I can see, they work well. 

> Allow flinkVersion v1_13 in flink-kubernetes-operator
> -
>
> Key: FLINK-27412
> URL: https://issues.apache.org/jira/browse/FLINK-27412
> Project: Flink
>  Issue Type: Improvement
>  Components: Kubernetes Operator
>Reporter: Yang Wang
>Priority: Major
>  Labels: starter
> Fix For: kubernetes-operator-1.0.0
>
>
> The core k8s related features:
>  * native k8s integration for session cluster, 1.10
>  * native k8s integration for application cluster, 1.11
>  * Flink K8s HA, 1.12
>  * pod template, 1.13
> So we could set required the minimum version to 1.13. This will allow more 
> users to have a try on flink-kubernetes-operator.
>  
> BTW, we need to update the e2e tests to cover all the supported versions.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Commented] (FLINK-27329) Add default value of replica of JM pod and not declare it in example yamls

2022-04-21 Thread Biao Geng (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-27329?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17525432#comment-17525432
 ] 

Biao Geng commented on FLINK-27329:
---

[~wangyang0918] Reviewing all default values makes sense to me. I will try to 
check specs under \{{org.apache.flink.kubernetes.operator.crd.spec}} package to 
find if there are some fields like JM replica that do not have a proper default 
value.

> Add default value of replica of JM pod and not declare it in example yamls
> --
>
> Key: FLINK-27329
> URL: https://issues.apache.org/jira/browse/FLINK-27329
> Project: Flink
>  Issue Type: Improvement
>  Components: Kubernetes Operator
>Reporter: Biao Geng
>Assignee: Biao Geng
>Priority: Major
>
> Currently, we do not explicitly set the default value of `replica` in 
> `JobManagerSpec`. As a result, Java sets the default value to be zero. 
> Besides, in our examples, we explicitly declare `replica` in `JobManagerSpec` 
> to be 1. 
> After a deeper look when debugging the exception thrown in FLINK-27310, we 
> find it would be better to set the default value to 1 for the `replica` field 
> and remove the declaration in examples due to following reasons:
> 1. A normal Session or Application cluster should have at least one JM. The 
> current default value, zero, does not follow the common case.
> 2. One JM can work for k8s HA mode as well and if users really want to launch 
> a standby JM for faster recorvery, they can declare the value of `replica` 
> field in the yaml file. In examples, we just use the new default value(i.e. 
> 1), which should be fine.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Updated] (FLINK-27329) Add default value of replica of JM pod and not declare it in example yamls

2022-04-20 Thread Biao Geng (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-27329?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Biao Geng updated FLINK-27329:
--
Description: 
Currently, we do not explicitly set the default value of `replica` in 
`JobManagerSpec`. As a result, Java sets the default value to be zero. 
Besides, in our examples, we explicitly declare `replica` in `JobManagerSpec` 
to be 1. 
After a deeper look when debugging the exception thrown in FLINK-27310, we find 
it would be better to set the default value to 1 for the `replica` field and 
remove the declaration in examples due to following reasons:
1. A normal Session or Application cluster should have at least one JM. The 
current default value, zero, does not follow the common case.
2. One JM can work for k8s HA mode as well and if users really want to launch a 
standby JM for faster recorvery, they can declare the value of `replica` field 
in the yaml file. In examples, we just use the new default value(i.e. 1), which 
should be fine.

  was:
Currently, we do not explicitly set the default value of `replica` in 
`JobManagerSpec`. As a result, Java sets the default value to be zero. 
Besides, in our examples, we explicitly declare `replica` in `JobManagerSpec` 
to be 1. 
After a deeper look when debugging the exception thrown in FLINK-27310, we find 
it would be better to set the default value to 1 for `replica` fields and 
remove the declaration in examples due to following reasons:
1. A normal Session or Application cluster should have at least one JM. The 
current default value, zero, does not follow the common case.
2. One JM can work for k8s HA mode as well and if users really want to launch a 
standby JM for faster recorvery, they can declare the `replica` field in the 
yaml file. In examples, we just use the new default valu(i.e. 1) should be fine.


> Add default value of replica of JM pod and not declare it in example yamls
> --
>
> Key: FLINK-27329
> URL: https://issues.apache.org/jira/browse/FLINK-27329
> Project: Flink
>  Issue Type: Improvement
>  Components: Kubernetes Operator
>Reporter: Biao Geng
>Assignee: Biao Geng
>Priority: Major
>
> Currently, we do not explicitly set the default value of `replica` in 
> `JobManagerSpec`. As a result, Java sets the default value to be zero. 
> Besides, in our examples, we explicitly declare `replica` in `JobManagerSpec` 
> to be 1. 
> After a deeper look when debugging the exception thrown in FLINK-27310, we 
> find it would be better to set the default value to 1 for the `replica` field 
> and remove the declaration in examples due to following reasons:
> 1. A normal Session or Application cluster should have at least one JM. The 
> current default value, zero, does not follow the common case.
> 2. One JM can work for k8s HA mode as well and if users really want to launch 
> a standby JM for faster recorvery, they can declare the value of `replica` 
> field in the yaml file. In examples, we just use the new default value(i.e. 
> 1), which should be fine.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Updated] (FLINK-27329) Add default value of replica of JM pod and not declare it in example yamls

2022-04-20 Thread Biao Geng (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-27329?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Biao Geng updated FLINK-27329:
--
Summary: Add default value of replica of JM pod and not declare it in 
example yamls  (was: Add default value of replica of JM pod and remove 
declaring it in example yamls)

> Add default value of replica of JM pod and not declare it in example yamls
> --
>
> Key: FLINK-27329
> URL: https://issues.apache.org/jira/browse/FLINK-27329
> Project: Flink
>  Issue Type: Improvement
>  Components: Kubernetes Operator
>Reporter: Biao Geng
>Assignee: Biao Geng
>Priority: Major
>
> Currently, we do not explicitly set the default value of `replica` in 
> `JobManagerSpec`. As a result, Java sets the default value to be zero. 
> Besides, in our examples, we explicitly declare `replica` in `JobManagerSpec` 
> to be 1. 
> After a deeper look when debugging the exception thrown in FLINK-27310, we 
> find it would be better to set the default value to 1 for `replica` fields 
> and remove the declaration in examples due to following reasons:
> 1. A normal Session or Application cluster should have at least one JM. The 
> current default value, zero, does not follow the common case.
> 2. One JM can work for k8s HA mode as well and if users really want to launch 
> a standby JM for faster recorvery, they can declare the `replica` field in 
> the yaml file. In examples, we just use the new default valu(i.e. 1) should 
> be fine.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Commented] (FLINK-27329) Add default value of replica of JM pod and remove declaring it in example yamls

2022-04-20 Thread Biao Geng (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-27329?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17525020#comment-17525020
 ] 

Biao Geng commented on FLINK-27329:
---

cc [~wangyang0918] 

I can take this ticket~

> Add default value of replica of JM pod and remove declaring it in example 
> yamls
> ---
>
> Key: FLINK-27329
> URL: https://issues.apache.org/jira/browse/FLINK-27329
> Project: Flink
>  Issue Type: Improvement
>  Components: Kubernetes Operator
>Reporter: Biao Geng
>Priority: Major
>
> Currently, we do not explicitly set the default value of `replica` in 
> `JobManagerSpec`. As a result, Java sets the default value to be zero. 
> Besides, in our examples, we explicitly declare `replica` in `JobManagerSpec` 
> to be 1. 
> After a deeper look when debugging the exception thrown in FLINK-27310, we 
> find it would be better to set the default value to 1 for `replica` fields 
> and remove the declaration in examples due to following reasons:
> 1. A normal Session or Application cluster should have at least one JM. The 
> current default value, zero, does not follow the common case.
> 2. One JM can work for k8s HA mode as well and if users really want to launch 
> a standby JM for faster recorvery, they can declare the `replica` field in 
> the yaml file. In examples, we just use the new default valu(i.e. 1) should 
> be fine.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Created] (FLINK-27329) Add default value of replica of JM pod and remove declaring it in example yamls

2022-04-20 Thread Biao Geng (Jira)
Biao Geng created FLINK-27329:
-

 Summary: Add default value of replica of JM pod and remove 
declaring it in example yamls
 Key: FLINK-27329
 URL: https://issues.apache.org/jira/browse/FLINK-27329
 Project: Flink
  Issue Type: Improvement
  Components: Kubernetes Operator
Reporter: Biao Geng


Currently, we do not explicitly set the default value of `replica` in 
`JobManagerSpec`. As a result, Java sets the default value to be zero. 
Besides, in our examples, we explicitly declare `replica` in `JobManagerSpec` 
to be 1. 
After a deeper look when debugging the exception thrown in FLINK-27310, we find 
it would be better to set the default value to 1 for `replica` fields and 
remove the declaration in examples due to following reasons:
1. A normal Session or Application cluster should have at least one JM. The 
current default value, zero, does not follow the common case.
2. One JM can work for k8s HA mode as well and if users really want to launch a 
standby JM for faster recorvery, they can declare the `replica` field in the 
yaml file. In examples, we just use the new default valu(i.e. 1) should be fine.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Commented] (FLINK-27309) Allow to load default flink configs in the k8s operator dynamically

2022-04-20 Thread Biao Geng (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-27309?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17525012#comment-17525012
 ] 

Biao Geng commented on FLINK-27309:
---

Thanks [~gyfora]! Besides [~aitozi]'s valuable point, I also think about if the 
changed default configs should take effect for a running job deployment. 
>From k8s side, it may make sense that such change should be considered as 
>`specChanged`, and all the running job will be suspended and restarted.
I am not sure it is proper to restart running jobs(i.e. changed default configs 
affect running jobs) from Flink users' side.



> Allow to load default flink configs in the k8s operator dynamically
> ---
>
> Key: FLINK-27309
> URL: https://issues.apache.org/jira/browse/FLINK-27309
> Project: Flink
>  Issue Type: Improvement
>  Components: Kubernetes Operator
>Reporter: Biao Geng
>Assignee: Gyula Fora
>Priority: Major
>
> Current default configs used by the k8s operator will be saved in the 
> /opt/flink/conf dir in the k8s operator pod and will be loaded only once when 
> the operator is created.
> Since the flink k8s operator could be a long running service and users may 
> want to modify the default configs(e.g the metric reporter sampling interval) 
> for newly created deployments, it may better to load the default configs 
> dynamically(i.e. parsing the latest /opt/flink/conf/flink-conf.yaml) in the 
> {{ReconcilerFactory}} and {{ObserverFactory}}, instead of redeploying the 
> operator.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Commented] (FLINK-24960) YARNSessionCapacitySchedulerITCase.testVCoresAreSetCorrectlyAndJobManagerHostnameAreShownInWebInterfaceAndDynamicPropertiesAndYarnApplicationNameAndTaskManagerSlots ha

2022-04-20 Thread Biao Geng (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-24960?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17524860#comment-17524860
 ] 

Biao Geng commented on FLINK-24960:
---

Hi [~mapohl], I tried to do some research but in latest 
[failure|https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=34743=logs=298e20ef-7951-5965-0e79-ea664ddc435e=d4c90338-c843-57b0-3232-10ae74f00347=36026]
 posted by Yun, I did not find the {{Extracted hostname:port: }} was not shown. 
Though in the description of this ticket, it shows  {{Extracted hostname:port: 
5718b812c7ab:38622}} in the old CI test, which seems to be correct. 
I plan to verify {{yarnSessionClusterRunner.sendStop();}}  works fine(i.e. the 
session cluster will be stopped normally) first but I have not found a way to 
run the cron test's "test_cron_jdk11 misc" test only on the my own [azure 
pipeline|https://dev.azure.com/samuelgeng7/Flink/_build/results?buildId=109=results],
 which made the verification pretty slow and hard. Is there any guidelines 
about debugging the azure pipeline with some specific tests?

> YARNSessionCapacitySchedulerITCase.testVCoresAreSetCorrectlyAndJobManagerHostnameAreShownInWebInterfaceAndDynamicPropertiesAndYarnApplicationNameAndTaskManagerSlots
>  hangs on azure
> ---
>
> Key: FLINK-24960
> URL: https://issues.apache.org/jira/browse/FLINK-24960
> Project: Flink
>  Issue Type: Bug
>  Components: Deployment / YARN
>Affects Versions: 1.15.0, 1.14.3
>Reporter: Yun Gao
>Assignee: Niklas Semmler
>Priority: Critical
>  Labels: pull-request-available, test-stability
> Fix For: 1.16.0
>
>
> {code:java}
> Nov 18 22:37:08 
> 
> Nov 18 22:37:08 Test 
> testVCoresAreSetCorrectlyAndJobManagerHostnameAreShownInWebInterfaceAndDynamicPropertiesAndYarnApplicationNameAndTaskManagerSlots(org.apache.flink.yarn.YARNSessionCapacitySchedulerITCase)
>  is running.
> Nov 18 22:37:08 
> 
> Nov 18 22:37:25 22:37:25,470 [main] INFO  
> org.apache.flink.yarn.YARNSessionCapacitySchedulerITCase [] - Extracted 
> hostname:port: 5718b812c7ab:38622
> Nov 18 22:52:36 
> ==
> Nov 18 22:52:36 Process produced no output for 900 seconds.
> Nov 18 22:52:36 
> ==
>  {code}
> https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=26722=logs=f450c1a5-64b1-5955-e215-49cb1ad5ec88=cc452273-9efa-565d-9db8-ef62a38a0c10=36395



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Commented] (FLINK-27310) FlinkOperatorITCase failure due to JobManager replicas less than one

2022-04-19 Thread Biao Geng (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-27310?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17524322#comment-17524322
 ] 

Biao Geng commented on FLINK-27310:
---

[~usamj] oh, I tried to find the error in my local machine but ignored that the 
IT case is only executed in CI pipeline :(

cc [~wangyang0918] I created a 
[PR|https://github.com/apache/flink-kubernetes-operator/pull/173https://github.com/apache/flink-kubernetes-operator/pull/173]
 for fixing this IT case.
I think we should improve the ci.yml as well.

> FlinkOperatorITCase failure due to JobManager replicas less than one
> 
>
> Key: FLINK-27310
> URL: https://issues.apache.org/jira/browse/FLINK-27310
> Project: Flink
>  Issue Type: Bug
>  Components: Kubernetes Operator
>Reporter: Usamah Jassat
>Priority: Minor
>  Labels: pull-request-available
>
> The FlinkOperatorITCase test is currently failing, even in the CI pipeline 
>  
> {code:java}
> INFO] Running org.apache.flink.kubernetes.operator.FlinkOperatorITCase
>   
>   
> 
> Error:  Tests run: 1, Failures: 0, Errors: 1, Skipped: 0, Time elapsed: 
> 3.178 s <<< FAILURE! - in 
> org.apache.flink.kubernetes.operator.FlinkOperatorITCase
>   
>   
> 
> Error:  org.apache.flink.kubernetes.operator.FlinkOperatorITCase.test  
> Time elapsed: 2.664 s  <<< ERROR!
>   
>   
> 
> io.fabric8.kubernetes.client.KubernetesClientException: Failure 
> executing: POST at: 
> https://192.168.49.2:8443/apis/flink.apache.org/v1beta1/namespaces/flink-operator-test/flinkdeployments.
>  Message: Forbidden! User minikube doesn't have permission. admission webhook 
> "vflinkdeployments.flink.apache.org" denied the request: JobManager replicas 
> should not be configured less than one..
>   
>   
> 
>   at 
> flink.kubernetes.operator@1.0-SNAPSHOT/org.apache.flink.kubernetes.operator.FlinkOperatorITCase.test(FlinkOperatorITCase.java:86){code}
>  
> While the test is failing the CI test run is passing which also should be 
> fixed then to fail on the test failure.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Comment Edited] (FLINK-27310) FlinkOperatorITCase failure due to JobManager replicas less than one

2022-04-19 Thread Biao Geng (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-27310?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17524291#comment-17524291
 ] 

Biao Geng edited comment on FLINK-27310 at 4/19/22 1:01 PM:


It seems that this ITCase is not really executed due to its naming(not follow 
*Test* pattern). 
For the error in the posted log, I believe it is because we have not added 
{{jm.setReplicas(1);}} to set the the replica of JM.


was (Author: bgeng777):
It seems that this ITCase is not really executed due to its naming(not follow 
*Test{*}*{*}  pattern). 
For the error in the posted log, I believe it is because we have not added 
{{jm.setReplicas(1);}} to set the the replica of JM.

> FlinkOperatorITCase failure due to JobManager replicas less than one
> 
>
> Key: FLINK-27310
> URL: https://issues.apache.org/jira/browse/FLINK-27310
> Project: Flink
>  Issue Type: Bug
>  Components: Kubernetes Operator
>Reporter: Usamah Jassat
>Priority: Minor
>
> The FlinkOperatorITCase test is currently failing, even in the CI pipeline 
>  
> {code:java}
> INFO] Running org.apache.flink.kubernetes.operator.FlinkOperatorITCase
>   
>   
> 
> Error:  Tests run: 1, Failures: 0, Errors: 1, Skipped: 0, Time elapsed: 
> 3.178 s <<< FAILURE! - in 
> org.apache.flink.kubernetes.operator.FlinkOperatorITCase
>   
>   
> 
> Error:  org.apache.flink.kubernetes.operator.FlinkOperatorITCase.test  
> Time elapsed: 2.664 s  <<< ERROR!
>   
>   
> 
> io.fabric8.kubernetes.client.KubernetesClientException: Failure 
> executing: POST at: 
> https://192.168.49.2:8443/apis/flink.apache.org/v1beta1/namespaces/flink-operator-test/flinkdeployments.
>  Message: Forbidden! User minikube doesn't have permission. admission webhook 
> "vflinkdeployments.flink.apache.org" denied the request: JobManager replicas 
> should not be configured less than one..
>   
>   
> 
>   at 
> flink.kubernetes.operator@1.0-SNAPSHOT/org.apache.flink.kubernetes.operator.FlinkOperatorITCase.test(FlinkOperatorITCase.java:86){code}
>  
> While the test is failing the CI test run is passing which also should be 
> fixed then to fail on the test failure.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Comment Edited] (FLINK-27310) FlinkOperatorITCase failure due to JobManager replicas less than one

2022-04-19 Thread Biao Geng (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-27310?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17524291#comment-17524291
 ] 

Biao Geng edited comment on FLINK-27310 at 4/19/22 1:01 PM:


It seems that this ITCase is not really executed due to its naming(not follow 
**Test * pattern). 
For the error in the posted log, I believe it is because we have not added 
{{jm.setReplicas(1);}} to set the the replica of JM.


was (Author: bgeng777):
It seems that this ITCase is not really executed due to its naming(not follow 
*Test* pattern). 
For the error in the posted log, I believe it is because we have not added 
{{jm.setReplicas(1);}} to set the the replica of JM.

> FlinkOperatorITCase failure due to JobManager replicas less than one
> 
>
> Key: FLINK-27310
> URL: https://issues.apache.org/jira/browse/FLINK-27310
> Project: Flink
>  Issue Type: Bug
>  Components: Kubernetes Operator
>Reporter: Usamah Jassat
>Priority: Minor
>
> The FlinkOperatorITCase test is currently failing, even in the CI pipeline 
>  
> {code:java}
> INFO] Running org.apache.flink.kubernetes.operator.FlinkOperatorITCase
>   
>   
> 
> Error:  Tests run: 1, Failures: 0, Errors: 1, Skipped: 0, Time elapsed: 
> 3.178 s <<< FAILURE! - in 
> org.apache.flink.kubernetes.operator.FlinkOperatorITCase
>   
>   
> 
> Error:  org.apache.flink.kubernetes.operator.FlinkOperatorITCase.test  
> Time elapsed: 2.664 s  <<< ERROR!
>   
>   
> 
> io.fabric8.kubernetes.client.KubernetesClientException: Failure 
> executing: POST at: 
> https://192.168.49.2:8443/apis/flink.apache.org/v1beta1/namespaces/flink-operator-test/flinkdeployments.
>  Message: Forbidden! User minikube doesn't have permission. admission webhook 
> "vflinkdeployments.flink.apache.org" denied the request: JobManager replicas 
> should not be configured less than one..
>   
>   
> 
>   at 
> flink.kubernetes.operator@1.0-SNAPSHOT/org.apache.flink.kubernetes.operator.FlinkOperatorITCase.test(FlinkOperatorITCase.java:86){code}
>  
> While the test is failing the CI test run is passing which also should be 
> fixed then to fail on the test failure.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Comment Edited] (FLINK-27310) FlinkOperatorITCase failure due to JobManager replicas less than one

2022-04-19 Thread Biao Geng (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-27310?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17524291#comment-17524291
 ] 

Biao Geng edited comment on FLINK-27310 at 4/19/22 1:00 PM:


It seems that this ITCase is not really executed due to its naming(not follow 
*Test{*}*{*}  pattern). 
For the error in the posted log, I believe it is because we have not added 
{{jm.setReplicas(1);}} to set the the replica of JM.


was (Author: bgeng777):
It seems that this ITCase is not really executed due to its naming(not follow 
*Test*  pattern). 
For the error in the posted log, I believe it is because we have not added 
{{jm.setReplicas(1);}} to set the the replica of JM.

> FlinkOperatorITCase failure due to JobManager replicas less than one
> 
>
> Key: FLINK-27310
> URL: https://issues.apache.org/jira/browse/FLINK-27310
> Project: Flink
>  Issue Type: Bug
>  Components: Kubernetes Operator
>Reporter: Usamah Jassat
>Priority: Minor
>
> The FlinkOperatorITCase test is currently failing, even in the CI pipeline 
>  
> {code:java}
> INFO] Running org.apache.flink.kubernetes.operator.FlinkOperatorITCase
>   
>   
> 
> Error:  Tests run: 1, Failures: 0, Errors: 1, Skipped: 0, Time elapsed: 
> 3.178 s <<< FAILURE! - in 
> org.apache.flink.kubernetes.operator.FlinkOperatorITCase
>   
>   
> 
> Error:  org.apache.flink.kubernetes.operator.FlinkOperatorITCase.test  
> Time elapsed: 2.664 s  <<< ERROR!
>   
>   
> 
> io.fabric8.kubernetes.client.KubernetesClientException: Failure 
> executing: POST at: 
> https://192.168.49.2:8443/apis/flink.apache.org/v1beta1/namespaces/flink-operator-test/flinkdeployments.
>  Message: Forbidden! User minikube doesn't have permission. admission webhook 
> "vflinkdeployments.flink.apache.org" denied the request: JobManager replicas 
> should not be configured less than one..
>   
>   
> 
>   at 
> flink.kubernetes.operator@1.0-SNAPSHOT/org.apache.flink.kubernetes.operator.FlinkOperatorITCase.test(FlinkOperatorITCase.java:86){code}
>  
> While the test is failing the CI test run is passing which also should be 
> fixed then to fail on the test failure.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Comment Edited] (FLINK-27310) FlinkOperatorITCase failure due to JobManager replicas less than one

2022-04-19 Thread Biao Geng (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-27310?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17524291#comment-17524291
 ] 

Biao Geng edited comment on FLINK-27310 at 4/19/22 1:00 PM:


It seems that this ITCase is not really executed due to its naming(not follow 
*Test*  pattern). 
For the error in the posted log, I believe it is because we have not added 
{{jm.setReplicas(1);}} to set the the replica of JM.


was (Author: bgeng777):
It seems that this ITCase is not really executed due to its naming(not follow 
*Test* pattern). 
For the error in the posted log, I believe it is because we have not added 
{{jm.setReplicas(1);}} to set the the replica of JM.


> FlinkOperatorITCase failure due to JobManager replicas less than one
> 
>
> Key: FLINK-27310
> URL: https://issues.apache.org/jira/browse/FLINK-27310
> Project: Flink
>  Issue Type: Bug
>  Components: Kubernetes Operator
>Reporter: Usamah Jassat
>Priority: Minor
>
> The FlinkOperatorITCase test is currently failing, even in the CI pipeline 
>  
> {code:java}
> INFO] Running org.apache.flink.kubernetes.operator.FlinkOperatorITCase
>   
>   
> 
> Error:  Tests run: 1, Failures: 0, Errors: 1, Skipped: 0, Time elapsed: 
> 3.178 s <<< FAILURE! - in 
> org.apache.flink.kubernetes.operator.FlinkOperatorITCase
>   
>   
> 
> Error:  org.apache.flink.kubernetes.operator.FlinkOperatorITCase.test  
> Time elapsed: 2.664 s  <<< ERROR!
>   
>   
> 
> io.fabric8.kubernetes.client.KubernetesClientException: Failure 
> executing: POST at: 
> https://192.168.49.2:8443/apis/flink.apache.org/v1beta1/namespaces/flink-operator-test/flinkdeployments.
>  Message: Forbidden! User minikube doesn't have permission. admission webhook 
> "vflinkdeployments.flink.apache.org" denied the request: JobManager replicas 
> should not be configured less than one..
>   
>   
> 
>   at 
> flink.kubernetes.operator@1.0-SNAPSHOT/org.apache.flink.kubernetes.operator.FlinkOperatorITCase.test(FlinkOperatorITCase.java:86){code}
>  
> While the test is failing the CI test run is passing which also should be 
> fixed then to fail on the test failure.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Commented] (FLINK-27310) FlinkOperatorITCase failure due to JobManager replicas less than one

2022-04-19 Thread Biao Geng (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-27310?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17524291#comment-17524291
 ] 

Biao Geng commented on FLINK-27310:
---

It seems that this ITCase is not really executed due to its naming(not follow 
*Test* pattern). 
For the error in the posted log, I believe it is because we have not added 
{{jm.setReplicas(1);}} to set the the replica of JM.


> FlinkOperatorITCase failure due to JobManager replicas less than one
> 
>
> Key: FLINK-27310
> URL: https://issues.apache.org/jira/browse/FLINK-27310
> Project: Flink
>  Issue Type: Bug
>  Components: Kubernetes Operator
>Reporter: Usamah Jassat
>Priority: Minor
>
> The FlinkOperatorITCase test is currently failing, even in the CI pipeline 
>  
> {code:java}
> INFO] Running org.apache.flink.kubernetes.operator.FlinkOperatorITCase
>   
>   
> 
> Error:  Tests run: 1, Failures: 0, Errors: 1, Skipped: 0, Time elapsed: 
> 3.178 s <<< FAILURE! - in 
> org.apache.flink.kubernetes.operator.FlinkOperatorITCase
>   
>   
> 
> Error:  org.apache.flink.kubernetes.operator.FlinkOperatorITCase.test  
> Time elapsed: 2.664 s  <<< ERROR!
>   
>   
> 
> io.fabric8.kubernetes.client.KubernetesClientException: Failure 
> executing: POST at: 
> https://192.168.49.2:8443/apis/flink.apache.org/v1beta1/namespaces/flink-operator-test/flinkdeployments.
>  Message: Forbidden! User minikube doesn't have permission. admission webhook 
> "vflinkdeployments.flink.apache.org" denied the request: JobManager replicas 
> should not be configured less than one..
>   
>   
> 
>   at 
> flink.kubernetes.operator@1.0-SNAPSHOT/org.apache.flink.kubernetes.operator.FlinkOperatorITCase.test(FlinkOperatorITCase.java:86){code}
>  
> While the test is failing the CI test run is passing which also should be 
> fixed then to fail on the test failure.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)


[jira] [Commented] (FLINK-27303) Flink Operator will create a large amount of temp log config files

2022-04-19 Thread Biao Geng (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-27303?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17524218#comment-17524218
 ] 

Biao Geng commented on FLINK-27303:
---

Yeah, it seems that most generated configs are for constructing ClusterClient 
to interact with the cluster. 
Configs like templates, log configs or image configs that are for submission 
will not be used by the operator's reconciliation.
We can build them only in submission methods.

Besides, it may also make sense to clear those temporary files for logs and 
templates in {{FlinkDeploymentController#cleanup()}}.

> Flink Operator will create a large amount of temp log config files
> --
>
> Key: FLINK-27303
> URL: https://issues.apache.org/jira/browse/FLINK-27303
> Project: Flink
>  Issue Type: Improvement
>  Components: Kubernetes Operator
>Affects Versions: kubernetes-operator-1.0.0
>Reporter: Gyula Fora
>Priority: Critical
> Fix For: kubernetes-operator-1.0.0
>
>
> Now we use the configbuilder in multiple different places to generate the 
> effective config including observer, reconciler, validator etc.
> The effective config gerenration logic also creates temporary log config 
> files (if spec logConfiguration is set) which would lead to 3-4 files 
> generated in every reconcile loop for a given job. These files are not 
> cleaned up until the operator restarts leading to a large amount of files.
> I believe we should change the config generation logic and only apply the 
> logconfig generation logic right before flink cluster submission as that is 
> the only thing affected by it.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Created] (FLINK-27309) Allow to load default flink configs in the k8s operator dynamically

2022-04-19 Thread Biao Geng (Jira)
Biao Geng created FLINK-27309:
-

 Summary: Allow to load default flink configs in the k8s operator 
dynamically
 Key: FLINK-27309
 URL: https://issues.apache.org/jira/browse/FLINK-27309
 Project: Flink
  Issue Type: Improvement
  Components: Kubernetes Operator
Reporter: Biao Geng


Current default configs used by the k8s operator will be saved in the 
/opt/flink/conf dir in the k8s operator pod and will be loaded only once when 
the operator is created.
Since the flink k8s operator could be a long running service and users may want 
to modify the default configs(e.g the metric reporter sampling interval) for 
newly created deployments, it may better to load the default configs 
dynamically(i.e. parsing the latest /opt/flink/conf/flink-conf.yaml) in the 
{{ReconcilerFactory}} and {{ObserverFactory}}, instead of redeploying the 
operator.




--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Commented] (FLINK-27303) Flink Operator will create a large amount of temp log config files

2022-04-19 Thread Biao Geng (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-27303?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17524189#comment-17524189
 ] 

Biao Geng commented on FLINK-27303:
---

Hi [~gyfora], big +1 for this jira. And I believe besides log config files, 
there will also be a lot of pod template file if we define it the yaml. Because 
the {{applyCommonPodTemplate}} will create temporary files for pod template as 
well.
I think your suggestion makes sense. And one possible solution is to divide the 
generated configs into 2 types: type 1 is configs needed by observer and 
reconciler. type 2 is configs that not used by observer or reconciler.

Besides, I am wondering if it is a good idea to maintain a concurrent hashmap 
to save the effective config of each deployment so that we can reduce some 
repeated creation of effective configs. It may be the bottleneck of memory 
usage when there are plenty of deployments in the operator and make the 
operator less `stateless`. Not sure if it is worthwhile.

> Flink Operator will create a large amount of temp log config files
> --
>
> Key: FLINK-27303
> URL: https://issues.apache.org/jira/browse/FLINK-27303
> Project: Flink
>  Issue Type: Improvement
>  Components: Kubernetes Operator
>Affects Versions: kubernetes-operator-1.0.0
>Reporter: Gyula Fora
>Priority: Critical
> Fix For: kubernetes-operator-1.0.0
>
>
> Now we use the configbuilder in multiple different places to generate the 
> effective config including observer, reconciler, validator etc.
> The effective config gerenration logic also creates temporary log config 
> files (if spec logConfiguration is set) which would lead to 3-4 files 
> generated in every reconcile loop for a given job. These files are not 
> cleaned up until the operator restarts leading to a large amount of files.
> I believe we should change the config generation logic and only apply the 
> logconfig generation logic right before flink cluster submission as that is 
> the only thing affected by it.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Closed] (FLINK-27109) Improve the creation of ClusterRole in Flink K8s operator

2022-04-07 Thread Biao Geng (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-27109?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Biao Geng closed FLINK-27109.
-
Resolution: Won't Fix

> Improve the creation of ClusterRole in Flink K8s operator
> -
>
> Key: FLINK-27109
> URL: https://issues.apache.org/jira/browse/FLINK-27109
> Project: Flink
>  Issue Type: Improvement
>  Components: Kubernetes Operator
>Reporter: Biao Geng
>Priority: Not a Priority
>
> As the 
> [doc|https://kubernetes.io/docs/reference/access-authn-authz/rbac/#clusterrole-example]
>  of k8s said, ClusterRole is one kind of non-namespaced resource. 
> In our helm chart, we now define the ClusterRole with name 'flink-operator' 
> and the namespace field in metadata will be omitted. As a result, if a user 
> wants to install multiple flink-kubernetes-operator in different namespace, 
> the ClusterRole 'flink-operator' will be created multiple times. 
> Errors like
> {quote}Error: INSTALLATION FAILED: rendered manifests contain a resource that 
> already exists. Unable to continue with install: ClusterRole "flink-operator" 
> in namespace "" exists and cannot be imported into the current release: 
> invalid ownership metadata; annotation validation error: key 
> "meta.helm.sh/release-namespace" must equal "c-8725bcef1dc84d6f": current 
> value is "default"
> {quote}
> will be thrown.
> Solution 1 could be adding the namespace as a postfix in the name of 
> ClusterRole. 
> Solution 2 is to add if else check like 
> [this|https://stackoverflow.com/questions/65110332/clusterrole-exists-and-cannot-be-imported-into-the-current-release]
>  to avoid creating existed resource.  One important drawback of solution 2 is 
> that when uninstalling one helm release, the created ClusterRole will be 
> removed as well.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Updated] (FLINK-27109) Improve the creation of ClusterRole in Flink K8s operator

2022-04-07 Thread Biao Geng (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-27109?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Biao Geng updated FLINK-27109:
--
Priority: Not a Priority  (was: Major)

> Improve the creation of ClusterRole in Flink K8s operator
> -
>
> Key: FLINK-27109
> URL: https://issues.apache.org/jira/browse/FLINK-27109
> Project: Flink
>  Issue Type: Improvement
>  Components: Kubernetes Operator
>Reporter: Biao Geng
>Priority: Not a Priority
>
> As the 
> [doc|https://kubernetes.io/docs/reference/access-authn-authz/rbac/#clusterrole-example]
>  of k8s said, ClusterRole is one kind of non-namespaced resource. 
> In our helm chart, we now define the ClusterRole with name 'flink-operator' 
> and the namespace field in metadata will be omitted. As a result, if a user 
> wants to install multiple flink-kubernetes-operator in different namespace, 
> the ClusterRole 'flink-operator' will be created multiple times. 
> Errors like
> {quote}Error: INSTALLATION FAILED: rendered manifests contain a resource that 
> already exists. Unable to continue with install: ClusterRole "flink-operator" 
> in namespace "" exists and cannot be imported into the current release: 
> invalid ownership metadata; annotation validation error: key 
> "meta.helm.sh/release-namespace" must equal "c-8725bcef1dc84d6f": current 
> value is "default"
> {quote}
> will be thrown.
> Solution 1 could be adding the namespace as a postfix in the name of 
> ClusterRole. 
> Solution 2 is to add if else check like 
> [this|https://stackoverflow.com/questions/65110332/clusterrole-exists-and-cannot-be-imported-into-the-current-release]
>  to avoid creating existed resource.  One important drawback of solution 2 is 
> that when uninstalling one helm release, the created ClusterRole will be 
> removed as well.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Commented] (FLINK-27109) Improve the creation of ClusterRole in Flink K8s operator

2022-04-07 Thread Biao Geng (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-27109?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17518699#comment-17518699
 ] 

Biao Geng commented on FLINK-27109:
---

[~wangyang0918]I see your point. It's my bad to ignore the usage of the 
`watchNamespaces`. I will utilize this field.

> Improve the creation of ClusterRole in Flink K8s operator
> -
>
> Key: FLINK-27109
> URL: https://issues.apache.org/jira/browse/FLINK-27109
> Project: Flink
>  Issue Type: Improvement
>  Components: Kubernetes Operator
>Reporter: Biao Geng
>Priority: Major
>
> As the 
> [doc|https://kubernetes.io/docs/reference/access-authn-authz/rbac/#clusterrole-example]
>  of k8s said, ClusterRole is one kind of non-namespaced resource. 
> In our helm chart, we now define the ClusterRole with name 'flink-operator' 
> and the namespace field in metadata will be omitted. As a result, if a user 
> wants to install multiple flink-kubernetes-operator in different namespace, 
> the ClusterRole 'flink-operator' will be created multiple times. 
> Errors like
> {quote}Error: INSTALLATION FAILED: rendered manifests contain a resource that 
> already exists. Unable to continue with install: ClusterRole "flink-operator" 
> in namespace "" exists and cannot be imported into the current release: 
> invalid ownership metadata; annotation validation error: key 
> "meta.helm.sh/release-namespace" must equal "c-8725bcef1dc84d6f": current 
> value is "default"
> {quote}
> will be thrown.
> Solution 1 could be adding the namespace as a postfix in the name of 
> ClusterRole. 
> Solution 2 is to add if else check like 
> [this|https://stackoverflow.com/questions/65110332/clusterrole-exists-and-cannot-be-imported-into-the-current-release]
>  to avoid creating existed resource.  One important drawback of solution 2 is 
> that when uninstalling one helm release, the created ClusterRole will be 
> removed as well.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Commented] (FLINK-27109) Improve the creation of ClusterRole in Flink K8s operator

2022-04-07 Thread Biao Geng (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-27109?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17518673#comment-17518673
 ] 

Biao Geng commented on FLINK-27109:
---

cc [~morhidi] [~mbalassi] Do you have any suggestion?

> Improve the creation of ClusterRole in Flink K8s operator
> -
>
> Key: FLINK-27109
> URL: https://issues.apache.org/jira/browse/FLINK-27109
> Project: Flink
>  Issue Type: Improvement
>  Components: Kubernetes Operator
>Reporter: Biao Geng
>Priority: Major
>
> As the 
> [doc|https://kubernetes.io/docs/reference/access-authn-authz/rbac/#clusterrole-example]
>  of k8s said, ClusterRole is one kind of non-namespaced resource. 
> In our helm chart, we now define the ClusterRole with name 'flink-operator' 
> and the namespace field in metadata will be omitted. As a result, if a user 
> wants to install multiple flink-kubernetes-operator in different namespace, 
> the ClusterRole 'flink-operator' will be created multiple times. 
> Errors like
> {quote}Error: INSTALLATION FAILED: rendered manifests contain a resource that 
> already exists. Unable to continue with install: ClusterRole "flink-operator" 
> in namespace "" exists and cannot be imported into the current release: 
> invalid ownership metadata; annotation validation error: key 
> "meta.helm.sh/release-namespace" must equal "c-8725bcef1dc84d6f": current 
> value is "default"
> {quote}
> will be thrown.
> Solution 1 could be adding the namespace as a postfix in the name of 
> ClusterRole. 
> Solution 2 is to add if else check like 
> [this|https://stackoverflow.com/questions/65110332/clusterrole-exists-and-cannot-be-imported-into-the-current-release]
>  to avoid creating existed resource.  One important drawback of solution 2 is 
> that when uninstalling one helm release, the created ClusterRole will be 
> removed as well.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Updated] (FLINK-27109) Improve the creation of ClusterRole in Flink K8s operator

2022-04-07 Thread Biao Geng (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-27109?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Biao Geng updated FLINK-27109:
--
Description: 
As the 
[doc|https://kubernetes.io/docs/reference/access-authn-authz/rbac/#clusterrole-example]
 of k8s said, ClusterRole is one kind of non-namespaced resource. 
In our helm chart, we now define the ClusterRole with name 'flink-operator' and 
the namespace field in metadata will be omitted. As a result, if a user wants 
to install multiple flink-kubernetes-operator in different namespace, the 
ClusterRole 'flink-operator' will be created multiple times. 
Errors like
{quote}Error: INSTALLATION FAILED: rendered manifests contain a resource that 
already exists. Unable to continue with install: ClusterRole "flink-operator" 
in namespace "" exists and cannot be imported into the current release: invalid 
ownership metadata; annotation validation error: key 
"meta.helm.sh/release-namespace" must equal "c-8725bcef1dc84d6f": current value 
is "default"
{quote}
will be thrown.

Solution 1 could be adding the namespace as a postfix in the name of 
ClusterRole. 
Solution 2 is to add if else check like 
[this|https://stackoverflow.com/questions/65110332/clusterrole-exists-and-cannot-be-imported-into-the-current-release]
 to avoid creating existed resource.  One important drawback of solution 2 is 
that when uninstalling one helm release, the created ClusterRole will be 
removed as well.

  was:
As the 
[doc|https://kubernetes.io/docs/reference/access-authn-authz/rbac/#clusterrole-example]
 of k8s said, ClusterRole is one kind of non-namespaced resource. 
In our helm chart, we now define the ClusterRole with name 'flink-operator' and 
the namespace field in metadata will be omitted. As a result, if a user wants 
to install multiple flink-kubernetes-operator in different namespace, the 
ClusterRole 'flink-operator' will be created multiple times. 
Errors like
{quote}Error: INSTALLATION FAILED: rendered manifests contain a resource that 
already exists. Unable to continue with install: ClusterRole "flink-operator" 
in namespace "" exists and cannot be imported into the current release: invalid 
ownership metadata; annotation validation error: key 
"meta.helm.sh/release-namespace" must equal "c-8725bcef1dc84d6f": current value 
is "default"
{quote}
will be thrown.

One solution could be adding the namespace as a postfix in the name of 
ClusterRole.
Another possible solution is to add if else check like 
[this|https://stackoverflow.com/questions/65110332/clusterrole-exists-and-cannot-be-imported-into-the-current-release]
 to avoid creating existed resource.


> Improve the creation of ClusterRole in Flink K8s operator
> -
>
> Key: FLINK-27109
> URL: https://issues.apache.org/jira/browse/FLINK-27109
> Project: Flink
>  Issue Type: Improvement
>  Components: Kubernetes Operator
>Reporter: Biao Geng
>Priority: Major
>
> As the 
> [doc|https://kubernetes.io/docs/reference/access-authn-authz/rbac/#clusterrole-example]
>  of k8s said, ClusterRole is one kind of non-namespaced resource. 
> In our helm chart, we now define the ClusterRole with name 'flink-operator' 
> and the namespace field in metadata will be omitted. As a result, if a user 
> wants to install multiple flink-kubernetes-operator in different namespace, 
> the ClusterRole 'flink-operator' will be created multiple times. 
> Errors like
> {quote}Error: INSTALLATION FAILED: rendered manifests contain a resource that 
> already exists. Unable to continue with install: ClusterRole "flink-operator" 
> in namespace "" exists and cannot be imported into the current release: 
> invalid ownership metadata; annotation validation error: key 
> "meta.helm.sh/release-namespace" must equal "c-8725bcef1dc84d6f": current 
> value is "default"
> {quote}
> will be thrown.
> Solution 1 could be adding the namespace as a postfix in the name of 
> ClusterRole. 
> Solution 2 is to add if else check like 
> [this|https://stackoverflow.com/questions/65110332/clusterrole-exists-and-cannot-be-imported-into-the-current-release]
>  to avoid creating existed resource.  One important drawback of solution 2 is 
> that when uninstalling one helm release, the created ClusterRole will be 
> removed as well.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Updated] (FLINK-27109) Improve the creation of ClusterRole in Flink K8s operator

2022-04-07 Thread Biao Geng (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-27109?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Biao Geng updated FLINK-27109:
--
Description: 
As the 
[doc|https://kubernetes.io/docs/reference/access-authn-authz/rbac/#clusterrole-example]
 of k8s said, ClusterRole is one kind of non-namespaced resource. 
In our helm chart, we now define the ClusterRole with name 'flink-operator' and 
the namespace field in metadata will be omitted. As a result, if a user wants 
to install multiple flink-kubernetes-operator in different namespace, the 
ClusterRole 'flink-operator' will be created multiple times. 
Errors like
{quote}Error: INSTALLATION FAILED: rendered manifests contain a resource that 
already exists. Unable to continue with install: ClusterRole "flink-operator" 
in namespace "" exists and cannot be imported into the current release: invalid 
ownership metadata; annotation validation error: key 
"meta.helm.sh/release-namespace" must equal "c-8725bcef1dc84d6f": current value 
is "default"
{quote}
will be thrown.

One solution could be adding the namespace as a postfix in the name of 
ClusterRole.
Another possible solution is to add if else check like 
[this|https://stackoverflow.com/questions/65110332/clusterrole-exists-and-cannot-be-imported-into-the-current-release]
 to avoid creating existed resource.

  was:
As the 
[doc|https://kubernetes.io/docs/reference/access-authn-authz/rbac/#clusterrole-example]
 of k8s said, ClusterRole is one kind of non-namespaced resource. 
In our helm chart, we now define the ClusterRole with name 'flink-operator' and 
the namespace field in metadata will be omitted. As a result, if a user wants 
to install multiple flink-kubernetes-operator in different namespace, the 
ClusterRole 'flink-operator' will be created multiple times. 
Errors like
{quote}Error: INSTALLATION FAILED: rendered manifests contain a resource that 
already exists. Unable to continue with install: ClusterRole "flink-operator" 
in namespace "" exists and cannot be imported into the current release: invalid 
ownership metadata; annotation validation error: key 
"meta.helm.sh/release-namespace" must equal "c-8725bcef1dc84d6f": current value 
is "default"
{quote}
will be thrown.

One solution could be adding the namespace as a postfix in the name of 
ClusterRole.
Another possible solution is to add if else check to avoid creating existed 
resource.


> Improve the creation of ClusterRole in Flink K8s operator
> -
>
> Key: FLINK-27109
> URL: https://issues.apache.org/jira/browse/FLINK-27109
> Project: Flink
>  Issue Type: Improvement
>  Components: Kubernetes Operator
>Reporter: Biao Geng
>Priority: Major
>
> As the 
> [doc|https://kubernetes.io/docs/reference/access-authn-authz/rbac/#clusterrole-example]
>  of k8s said, ClusterRole is one kind of non-namespaced resource. 
> In our helm chart, we now define the ClusterRole with name 'flink-operator' 
> and the namespace field in metadata will be omitted. As a result, if a user 
> wants to install multiple flink-kubernetes-operator in different namespace, 
> the ClusterRole 'flink-operator' will be created multiple times. 
> Errors like
> {quote}Error: INSTALLATION FAILED: rendered manifests contain a resource that 
> already exists. Unable to continue with install: ClusterRole "flink-operator" 
> in namespace "" exists and cannot be imported into the current release: 
> invalid ownership metadata; annotation validation error: key 
> "meta.helm.sh/release-namespace" must equal "c-8725bcef1dc84d6f": current 
> value is "default"
> {quote}
> will be thrown.
> One solution could be adding the namespace as a postfix in the name of 
> ClusterRole.
> Another possible solution is to add if else check like 
> [this|https://stackoverflow.com/questions/65110332/clusterrole-exists-and-cannot-be-imported-into-the-current-release]
>  to avoid creating existed resource.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Updated] (FLINK-27109) Improve the creation of ClusterRole in Flink K8s operator

2022-04-07 Thread Biao Geng (Jira)


 [ 
https://issues.apache.org/jira/browse/FLINK-27109?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Biao Geng updated FLINK-27109:
--
Summary: Improve the creation of ClusterRole in Flink K8s operator  (was: 
The naming pattern of ClusterRole in Flink K8s operator should consider 
namespace)

> Improve the creation of ClusterRole in Flink K8s operator
> -
>
> Key: FLINK-27109
> URL: https://issues.apache.org/jira/browse/FLINK-27109
> Project: Flink
>  Issue Type: Improvement
>  Components: Kubernetes Operator
>Reporter: Biao Geng
>Priority: Major
>
> As the 
> [doc|https://kubernetes.io/docs/reference/access-authn-authz/rbac/#clusterrole-example]
>  of k8s said, ClusterRole is one kind of non-namespaced resource. 
> In our helm chart, we now define the ClusterRole with name 'flink-operator' 
> and the namespace field in metadata will be omitted. As a result, if a user 
> wants to install multiple flink-kubernetes-operator in different namespace, 
> the ClusterRole 'flink-operator' will be created multiple times. 
> Errors like
> {quote}Error: INSTALLATION FAILED: rendered manifests contain a resource that 
> already exists. Unable to continue with install: ClusterRole "flink-operator" 
> in namespace "" exists and cannot be imported into the current release: 
> invalid ownership metadata; annotation validation error: key 
> "meta.helm.sh/release-namespace" must equal "c-8725bcef1dc84d6f": current 
> value is "default"
> {quote}
> will be thrown.
> One solution could be adding the namespace as a postfix in the name of 
> ClusterRole.
> Another possible solution is to add if else check to avoid creating existed 
> resource.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Created] (FLINK-27109) The naming pattern of ClusterRole in Flink K8s operator should consider namespace

2022-04-07 Thread Biao Geng (Jira)
Biao Geng created FLINK-27109:
-

 Summary: The naming pattern of ClusterRole in Flink K8s operator 
should consider namespace
 Key: FLINK-27109
 URL: https://issues.apache.org/jira/browse/FLINK-27109
 Project: Flink
  Issue Type: Improvement
  Components: Kubernetes Operator
Reporter: Biao Geng


As the 
[doc|https://kubernetes.io/docs/reference/access-authn-authz/rbac/#clusterrole-example]
 of k8s said, ClusterRole is one kind of non-namespaced resource. 
In our helm chart, we now define the ClusterRole with name 'flink-operator' and 
the namespace field in metadata will be omitted. As a result, if a user wants 
to install multiple flink-kubernetes-operator in different namespace, the 
ClusterRole 'flink-operator' will be created multiple times. 
Errors like
{quote}Error: INSTALLATION FAILED: rendered manifests contain a resource that 
already exists. Unable to continue with install: ClusterRole "flink-operator" 
in namespace "" exists and cannot be imported into the current release: invalid 
ownership metadata; annotation validation error: key 
"meta.helm.sh/release-namespace" must equal "c-8725bcef1dc84d6f": current value 
is "default"
{quote}
will be thrown.

One solution could be adding the namespace as a postfix in the name of 
ClusterRole.
Another possible solution is to add if else check to avoid creating existed 
resource.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Commented] (FLINK-26611) Document operator config options

2022-04-02 Thread Biao Geng (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-26611?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17516254#comment-17516254
 ] 

Biao Geng commented on FLINK-26611:
---

[~gyfora] I see this jira is open and in the plan of 0.1.0 version. If no one 
has started on it, I can take it.


> Document operator config options
> 
>
> Key: FLINK-26611
> URL: https://issues.apache.org/jira/browse/FLINK-26611
> Project: Flink
>  Issue Type: Sub-task
>  Components: Kubernetes Operator
>Reporter: Gyula Fora
>Priority: Major
> Fix For: kubernetes-operator-0.1.0
>
>
> Similar to how other flink configs are documented we should also document the 
> configs found in OperatorConfigOptions.
> Generating the documentation might be possible but maybe it's easier to do it 
> manually.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Created] (FLINK-27009) Support SQL job submission in flink kubernetes opeartor

2022-04-02 Thread Biao Geng (Jira)
Biao Geng created FLINK-27009:
-

 Summary: Support SQL job submission in flink kubernetes opeartor
 Key: FLINK-27009
 URL: https://issues.apache.org/jira/browse/FLINK-27009
 Project: Flink
  Issue Type: New Feature
  Components: Kubernetes Operator
Reporter: Biao Geng


Currently, the flink kubernetes opeartor is for jar job using application or 
session cluster. For SQL job, there is no out of box solution in the operator.  
One simple and short-term solution is to wrap the SQL script into a jar job 
using table API with limitation.
The long-term solution may work with 
[FLINK-26541|https://issues.apache.org/jira/browse/FLINK-26541] to achieve the 
full support.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Commented] (FLINK-26892) Observe current status before validating CR changes

2022-03-29 Thread Biao Geng (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-26892?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17513971#comment-17513971
 ] 

Biao Geng commented on FLINK-26892:
---

I try to summary what we should do for this case:
1. Save latest validated config ({{lastReconciledSpec}} should have already 
been sufficient.)
2. Adjust current logic to observe with latest validated config -> validate new 
config -> reconcile with new config

Is there any drawback of above idea? If it is fine, I can take this ticket and 
work on it.

> Observe current status before validating CR changes
> ---
>
> Key: FLINK-26892
> URL: https://issues.apache.org/jira/browse/FLINK-26892
> Project: Flink
>  Issue Type: Improvement
>  Components: Kubernetes Operator
>Reporter: Gyula Fora
>Priority: Major
> Fix For: kubernetes-operator-1.0.0
>
>
> Currently validation is the first step in the controller loop which means 
> that when there is a validation error we fail to observe the status of 
> currently running deployments.
> We should change the order of operations and observe before validation. 
> Furthermore observe should use the previous configuration not the one after 
> the CR change.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


  1   2   >