Re: Understanding RocksDBStateBackend in Flink on Yarn on AWS EMR

2024-03-26 Thread Yang Wang
Usually, you should use the HDFS nameservice instead of the NameNode
hostname:port to avoid NN failover.
And you could find the supported nameservice in the hdfs-site.xml in the
key *dfs.nameservices*.


Best,
Yang

On Fri, Mar 22, 2024 at 8:33 PM Sachin Mittal  wrote:

> So, when we create an EMR cluster the NN service runs on the primary node
> of the cluster.
> Now at the time of creating the cluster, how can we specify the name of
> this NN in format hdfs://*namenode-host*:8020/.
>
> Is there a standard name by which we can identify the NN server ?
>
> Thanks
> Sachin
>
>
> On Fri, Mar 22, 2024 at 12:08 PM Asimansu Bera 
> wrote:
>
>> Hello Sachin,
>>
>> Typically, Cloud VMs are ephemeral, meaning that if the EMR cluster goes
>> down or VMs are required to be shut down for security updates or due to
>> faults, new VMs will be added to the cluster. As a result, any data stored
>> in the local file system, such as file://tmp, would be lost. To ensure data
>> persistence and prevent loss of checkpoint or savepoint data for recovery,
>> it is advisable to store such data in a persistent storage solution like
>> HDFS or S3.
>>
>> Generally, EMR based Hadoop NN runs on 8020 port. You may find the NN IP
>> details from EMR service.
>>
>> Hope this helps.
>>
>> -A
>>
>>
>> On Thu, Mar 21, 2024 at 10:54 PM Sachin Mittal 
>> wrote:
>>
>>> Hi,
>>> We are using AWS EMR where we can submit our flink jobs to a long
>>> running flink cluster on Yarn.
>>>
>>> We wanted to configure RocksDBStateBackend as our state backend to store
>>> our checkpoints.
>>>
>>> So we have configured following properties in our flink-conf.yaml
>>>
>>>- state.backend.type: rocksdb
>>>- state.checkpoints.dir: file:///tmp
>>>- state.backend.incremental: true
>>>
>>>
>>> My question here is regarding the checkpoint location: what is the
>>> difference between the location if it is a local filesystem vs a hadoop
>>> distributed file system (hdfs).
>>>
>>> What advantages we get if we use:
>>>
>>> *state.checkpoints.dir*: hdfs://namenode-host:port/flink-checkpoints
>>> vs
>>> *state.checkpoints.dir*: file:///tmp
>>>
>>> Also if we decide to use HDFS then from where we can get the value for
>>> *namenode-host:port*
>>> given we are running Flink on an EMR.
>>>
>>> Thanks
>>> Sachin
>>>
>>>
>>>


Re: Jobmanager restart after it has been requested to stop

2024-02-02 Thread Yang Wang
If you could find the "Deregistering Flink Kubernetes cluster, clusterId"
in the JobManager log, then it is not the expected behavior.

Having the full logs of JobManager Pod before restarted will help a lot.



Best,
Yang

On Fri, Feb 2, 2024 at 1:26 PM Liting Liu (litiliu) via user <
user@flink.apache.org> wrote:

> Hi, community:
> I'm running a Flink 1.14.3 job with flink-Kubernetes-operator-1.6.0 on the
> AWS. I found my flink jobmananger container's thread restarted after this
> flinkdeployment has been requested to stop, here is the log of jobmanager:
>
> 2024-02-01 21:57:48,977 tn="flink-akka.actor.default-dispatcher-107478"
> INFO
>  org.apache.flink.client.deployment.application.ApplicationDispatcherBootstrap
> [] - Application CANCELED:
> java.util.concurrent.CompletionException:
> org.apache.flink.client.deployment.application.UnsuccessfulExecutionException:
> Application Status: CANCELED
> at
> org.apache.flink.client.deployment.application.ApplicationDispatcherBootstrap.lambda$unwrapJobResultException$6(ApplicationDispatcherBootstrap.java:353)
> ~[flink-dist_2.11-1.14.3.jar:1.14.3]
> at
> java.util.concurrent.CompletableFuture.uniApply(CompletableFuture.java:616)
> ~[?:1.8.0_322]
> 2024-02-01 21:57:48,984 tn="flink-akka.actor.default-dispatcher-107484"
> INFO  org.apache.flink.runtime.jobmaster.MiniDispatcherRestEndpoint [] -
> Shutting down rest endpoint.
> 2024-02-01 21:57:49,103 tn="flink-akka.actor.default-dispatcher-107478"
> INFO
>  
> org.apache.flink.runtime.entrypoint.component.DispatcherResourceManagerComponent
> [] - Closing components.
> 2024-02-01 21:57:49,105 tn="flink-akka.actor.default-dispatcher-107484"
> INFO  org.apache.flink.runtime.dispatcher.StandaloneDispatcher [] -
> Stopped dispatcher akka.tcp://flink@
> 2024-02-01 21:57:49,112
> tn="AkkaRpcService-Supervisor-Termination-Future-Executor-thread-1" INFO
>  org.apache.flink.runtime.rpc.akka.AkkaRpcService [] - Stopping
> Akka RPC service.
> 2024-02-01 21:57:49,286 tn="flink-metrics-15" INFO
>  akka.remote.RemoteActorRefProvider$RemotingTerminator[] - Remoting
> shut down.
> 2024-02-01 21:57:49,387 tn="main" INFO
>  org.apache.flink.runtime.entrypoint.ClusterEntrypoint[] -
> Terminating cluster entrypoint process
> KubernetesApplicationClusterEntrypoint with exit code 0.
> 2024-02-01 21:57:53,828 tn="main" INFO
>  org.apache.flink.runtime.entrypoint.ClusterEntrypoint[] -
> -Dlog4j.configurationFile=file:/opt/flink/conf/log4j-console.properties
> 2024-02-01 21:57:54,287 tn="main" INFO
>  org.apache.flink.runtime.entrypoint.ClusterEntrypoint[] - Starting
> KubernetesApplicationClusterEntrypoint.
>
>
> I found the JM main container's containerId remains the same, after the JM
> auto-restart.
> why did this process start to run after it had been requested to stop?
>
>


Re: [DISCUSS] Hadoop 2 vs Hadoop 3 usage

2024-01-15 Thread Yang Wang
I could share some metrics about Alibaba Cloud EMR clusters.
The ratio of Hadoop2 VS Hadoop3 is 1:3.


Best,
Yang

On Thu, Dec 28, 2023 at 8:16 PM Martijn Visser 
wrote:

> Hi all,
>
> I want to get some insights on how many users are still using Hadoop 2
> vs how many users are using Hadoop 3. Flink currently requires a
> minimum version of Hadoop 2.10.2 for certain features, but also
> extensively uses Hadoop 3 (like for the file system implementations)
>
> Hadoop 2 has a large number of direct and indirect vulnerabilities
> [1]. Most of them can only be resolved by dropping support for Hadoop
> 2 and upgrading to a Hadoop 3 version. This thread is primarily to get
> more insights if Hadoop 2 is still commonly used, or if we can
> actually discuss dropping support for Hadoop 2 in Flink.
>
> Best regards,
>
> Martijn
>
> [1]
> https://mvnrepository.com/artifact/org.apache.hadoop/hadoop-common/2.10.2
>


Re: Flink HA with Zookeeper and Docker Compose: unable to startup a working setup.

2024-01-15 Thread Yang Wang
Could you please configure the same HA configurations for TaskManager as
well?
It seems that the TaskManager container does not use a correct URL when
contacting with ResourceManager.


Best,
Yang

On Fri, Dec 29, 2023 at 11:13 PM Alessio Bernesco Làvore <
alessio.berne...@gmail.com> wrote:

> Hello,
> i'm trying to setup a testing environment using:
>
> - Flink HA with Zookeeper
> - Docker Compose
>
> While starting the TaskManager generates an exception and then after some
> restarts if fails.
>
> The exception is:
> "Caused by: org.apache.flink.runtime.rpc.exceptions.FencingTokenException:
> Fencing token mismatch: Ignoring message
> RemoteFencedMessage(,
> RemoteRpcInvocation(ResourceManagerGateway.registerTaskExecutor(TaskExecutorRegistration,
> Time))) because the fencing token  did not
> match the expected fencing token ad8c271c31d576247b6c93a5e4ac4da6."
>
> I'm unable to find any information, i've already posted a complete request
> on StackOverlflow, with all the related information:
> https://stackoverflow.com/questions/77689872/flink-ha-with-zookeeper-and-docker-compose-fencingtokenexception-fencing-token
>
> Any help would be appreciated.
>
> Greetings,
> Alessio
>


Re: Deploying the K8S operator sample on GKE Autopilot : Association with remote system [akka.tcp://flink@basic-example.default:6123] has failed,

2024-01-15 Thread Yang Wang
Could you please directly use the JobManager Pod IP address instead of K8s
service name(basic-example.default) and have a try with curl/wget?

It seems that the JobManager K8s service could not be accessed.

Best,
Yang

On Sat, Jan 13, 2024 at 1:24 AM LINZ, Arnaud 
wrote:

> Hi,
>
> Some more tests results from the task/job managers pod :
>
>
>
> From task manager I cannot connect to job manager:
>
> root@basic-example-taskmanager-5d54f9f94-rbcr4:/opt/flink# wget
> basic-example.default:6123
>
> --2024-01-12 15:16:15--  http://basic-example.default:6123/
>
> Resolving basic-example.default (basic-example.default)... 100.64.3.182
>
> Connecting to basic-example.default
> (basic-example.default)|100.64.3.182|:6123... ^C
>
>
>
> From job manager I can (DNS is OK, same IP is given) :
>
> root@basic-example-57774f887d-6bht8:/opt/flink# wget
> basic-example.default:6123
>
> --2024-01-12 15:16:25--  http://basic-example.default:6123/
>
> Resolving basic-example.default (basic-example.default)... 100.64.3.182
>
> Connecting to basic-example.default
> (basic-example.default)|100.64.3.182|:6123... connected.
>
> HTTP request sent, awaiting response... No data received.
>
> Retrying.
>
>
>
> However services are created :
>
> basic-example   ClusterIP   None
> 6123/TCP,6124/TCP2s
>
> basic-example-rest  ClusterIP   100.87.240.180
> 8081/TCP 2s
>
>
>
> Maybe the job manager only listens on localhost  instead of 0.0.0.0 or its
> real IP ?  Is it something I have the hand on?
>
> Thanks,
>
> Arnaud
>
>
>
> *From:* LINZ, Arnaud
> *Sent:* Friday, January 12, 2024 2:07 PM
> *To:* user@flink.apache.org
> *Subject:* FW: Deploying the K8S operator sample on GKE Autopilot :
> Association with remote system [akka.tcp://flink@basic-example.default:6123]
> has failed,
>
>
>
> Hello,
>
>
>
> I am trying to follow the “quickstart” guide on a GKE Autopilot k8s
> cluster.
>
>
> https://nightlies.apache.org/flink/flink-kubernetes-operator-docs-main/docs/try-flink-kubernetes-operator/quick-start/
>
> I could install the operator (without webhook) without issue ; however,
> when running
>
> kubectl create -f
> https://raw.githubusercontent.com/apache/flink-kubernetes-operator/release-1.7/examples/basic.yaml
>
>
>
> The job does not work because the task manager does not reach the job
> manager (maybe a DNS issue?). Is there some special dns/network
> configuration to perform in GKE? Has anybody already made it work?
>
> Thanks,
>
> Arnaud
>
>
>
> Log in job manager is :
>
> 2024-01-12 11:01:56,878 INFO
> org.apache.flink.runtime.executiongraph.ExecutionGraph   [] - Source:
> Custom Source (1/2)
> (c2bf83a958eaf6701eb2eebbfadc8e2c_bc764cd8ddf7a0cff126f51c16239658_0_2)
> switched from CREATED to SCHEDULED.
>
> 2024-01-12 11:01:56,878 INFO
> org.apache.flink.runtime.executiongraph.ExecutionGraph   [] - Source:
> Custom Source (2/2)
> (c2bf83a958eaf6701eb2eebbfadc8e2c_bc764cd8ddf7a0cff126f51c16239658_1_2)
> switched from CREATED to SCHEDULED.
>
> 2024-01-12 11:01:56,878 INFO
> org.apache.flink.runtime.executiongraph.ExecutionGraph   [] - Flat Map
> -> Sink: Print to Std. Out (1/2)
> (c2bf83a958eaf6701eb2eebbfadc8e2c_20ba6b65f97481d5570070de90e4e791_0_2)
> switched from CREATED to SCHEDULED.
>
> 2024-01-12 11:01:56,878 INFO
> org.apache.flink.runtime.executiongraph.ExecutionGraph   [] - Flat Map
> -> Sink: Print to Std. Out (2/2)
> (c2bf83a958eaf6701eb2eebbfadc8e2c_20ba6b65f97481d5570070de90e4e791_1_2)
> switched from CREATED to SCHEDULED.
>
> 2024-01-12 11:01:56,879 INFO
> org.apache.flink.runtime.resourcemanager.slotmanager.DeclarativeSlotManager
> [] - Received resource requirements from job
> 096668d0039ed54215ae334b5d89aa82:
> [ResourceRequirement{resourceProfile=ResourceProfile{UNKNOWN},
> numberOfRequiredSlots=1}]
>
> 2024-01-12 11:01:56,880 INFO
> org.apache.flink.runtime.resourcemanager.slotmanager.DeclarativeSlotManager
> [] - Received resource requirements from job
> 096668d0039ed54215ae334b5d89aa82:
> [ResourceRequirement{resourceProfile=ResourceProfile{UNKNOWN},
> numberOfRequiredSlots=2}]
>
> 2024-01-12 11:01:56,902 INFO
> org.apache.flink.runtime.checkpoint.CheckpointFailureManager [] - Failed to
> trigger checkpoint for job 096668d0039ed54215ae334b5d89aa82 since
> Checkpoint triggering task Source: Custom Source (1/2) of job
> 096668d0039ed54215ae334b5d89aa82 is not being executed at the moment.
> Aborting checkpoint. Failure reason: Not all required tasks are currently
> running..
>
> 2024-01-12 11:01:57,014 INFO
> org.apache.flink.runtime.resourcemanager.active.ActiveResourceManager [] -
> need request 1 new workers, current worker number 0, declared worker number
> 1
>
> 2024-01-12 11:01:57,015 INFO
> org.apache.flink.runtime.resourcemanager.active.ActiveResourceManager [] -
> Requesting new worker with resource spec WorkerResourceSpec {cpuCores=1.0,
> taskHeapSize=537.600mb (563714445 bytes), taskOffHeapSize=0 bytes,
> networkMemSize=158.7

Re: Flink Kubernetes HA

2024-01-15 Thread Yang Wang
The fabric8 K8s client is using PATCH to replace get-and-update in v6.6.2.
That's why you also need to give PATCH permission for the K8s service
account.
This would help to decrease the pressure of K8s APIServer. You could find
more information here[1].

[1]. https://issues.apache.org/jira/browse/FLINK-32678


Best,
Yang

On Wed, Dec 6, 2023 at 10:54 PM Ethan T Yang  wrote:

> Never mind. The issue was fix due to the service account permission
> missing “patch” verb. Which lead to RPC service not started.
>
> On Dec 5, 2023, at 1:40 PM, Ethan T Yang  wrote:
>
> Hi Flink users,
> After upgrading Flink ( from 1.13.1 -> 1.18.0), I noticed the an issue
> when HA is enabled.( see exception below). I am using k8s deployment and I
> clean the previous configmaps, like leader files etc. I know the pekko is a
> recently thing. Can someone share doc on how to use or set it? When I
> disable HA, the deployment was successful. I also noticed a new configmap
> called “-cluster-config-map”, can someone provide reference on what
> it is for? I don’t see it in the 1.13.1 version.
>
> Thanks a lot
> Ivan
>
>
> org.apache.flink.runtime.rpc.exceptions.RecipientUnreachableException:
> Could not send message
> [LocalRpcInvocation(RestfulGateway.requestMultipleJobDetails(Time))] from
> sender [unknown] to recipient [pe
> kko.tcp://flink@flink-secondary-jobmanager:6123/user/rpc/dispatcher_1],
> because the recipient is unreachable. This can either mean that the
> recipient has been terminated or that the remote RpcService i
> s currently not reachable.
> at com.sun.proxy.$Proxy55.requestMultipleJobDetails(Unknown Source) ~[?:?]
> at
> org.apache.flink.runtime.rest.handler.job.JobsOverviewHandler.handleRequest(JobsOverviewHandler.java:65)
> ~[flink-dist-1.18.0.jar:1.18.0]
> at
> org.apache.flink.runtime.rest.handler.AbstractRestHandler.respondToRequest(AbstractRestHandler.java:83)
> ~[flink-dist-1.18.0.jar:1.18.0]
> at
> org.apache.flink.runtime.rest.handler.AbstractHandler.respondAsLeader(AbstractHandler.java:196)
> ~[flink-dist-1.18.0.jar:1.18.0]
> at
> org.apache.flink.runtime.rest.handler.LeaderRetrievalHandler.lambda$channelRead0$0(LeaderRetrievalHandler.java:83)
> ~[flink-dist-1.18.0.jar:1.18.0]
> at java.util.Optional.ifPresent(Unknown Source) [?:?]
> at
> org.apache.flink.util.OptionalConsumer.ifPresent(OptionalConsumer.java:45)
> [flink-dist-1.18.0.jar:1.18.0]
> at
> org.apache.flink.runtime.rest.handler.LeaderRetrievalHandler.channelRead0(LeaderRetrievalHandler.java:80)
> [flink-dist-1.18.0.jar:1.18.0]
> at
> org.apache.flink.runtime.rest.handler.LeaderRetrievalHandler.channelRead0(LeaderRetrievalHandler.java:49)
> [flink-dist-1.18.0.jar:1.18.0]
> at
> org.apache.flink.shaded.netty4.io.netty.channel.SimpleChannelInboundHandler.channelRead(SimpleChannelInboundHandler.java:99)
> [flink-dist-1.18.0.jar:1.18.0]
> at
> org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:444)
> [flink-dist-1.18.0.jar:1.18.0]
> at
> org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:420)
> [flink-dist-1.18.0.jar:1.18.0]
> at
> org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:412)
> [flink-dist-1.18.0.jar:1.18.0]
> at
> org.apache.flink.runtime.rest.handler.router.RouterHandler.routed(RouterHandler.java:115)
> [flink-dist-1.18.0.jar:1.18.0]
> at
> org.apache.flink.runtime.rest.handler.router.RouterHandler.channelRead0(RouterHandler.java:94)
> [flink-dist-1.18.0.jar:1.18.0]
> at
> org.apache.flink.runtime.rest.handler.router.RouterHandler.channelRead0(RouterHandler.java:55)
> [flink-dist-1.18.0.jar:1.18.0]
> at
> org.apache.flink.shaded.netty4.io.netty.channel.SimpleChannelInboundHandler.channelRead(SimpleChannelInboundHandler.java:99)
> [flink-dist-1.18.0.jar:1.18.0]
> at
> org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:444)
> [flink-dist-1.18.0.jar:1.18.0]
> at
> org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:420)
> [flink-dist-1.18.0.jar:1.18.0]
> at
> org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:412)
> [flink-dist-1.18.0.jar:1.18.0]
> at
> org.apache.flink.shaded.netty4.io.netty.handler.codec.MessageToMessageDecoder.channelRead(MessageToMessageDecoder.java:103)
> [flink-dist-1.18.0.jar:1.18.0]
> at
> org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:444)
> [flink-dist-1.18.0.jar:1.18.0]
> at
> org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:420)
> [flink-dist-1.18.0.jar:1.18.0]
> at
> org.apache

Re: Default Log4j properties in Native Kubernetes

2023-06-20 Thread Yang Wang
I assume you are using "*bin/flink run-application*" to submit a Flink
application to K8s cluster. Then you could simply
update your local log4j-console.properties, it will be shipped and mounted
to JobManager/TaskManager pods via ConfigMap.

Best,
Yang

Vladislav Keda  于2023年6月20日周二
22:15写道:

> Hi all again!
>
> Please tell me if you can answer my question, thanks.
>
> ---
>
> Best Regards,
> Vladislav Keda
>
> пт, 16 июн. 2023 г. в 16:12, Vladislav Keda <
> vladislav.k...@glowbyteconsulting.com>:
>
>> Hi all!
>>
>> Is it possible to change Flink* log4j-console.properties* in Native
>> Kubernetes (for example in Kubernetes Application mode) without rebuilding
>> the application docker image?
>>
>> I was trying to inject a .sh script call (in the attachment) before
>> /docker-entrypoint.sh, but this workaround did not work (k8s gives me an
>> exception that the log4j* files are write-locked because there is a
>> configmap over them).
>>
>> Is there another way to change log4j* files?
>>
>> Thank you very much in advance!
>>
>> Best Regards,
>> Vladislav Keda
>>
>


Re: "Error while retrieving the leader gateway" when using Kubernetes HA

2023-01-30 Thread Yang Wang
I assume you are using the standalone mode. Right?

For the native K8s mode, the leader address should be
*akka.tcp://flink@JM_POD_IP:6123/user/rpc/dispatcher_1
*when HA enabled.


Best,
Yang

Anton Ippolitov via user  于2023年1月31日周二 00:21写道:

> This is actually what I'm already doing, I'm only setting high-availability:
> kubernetes myself. The other values are either defaults or set by the
> Operator:
> - jobmanager.rpc.port: 6123 is the default value (docs
> 
> )
> -  high-availability.jobmanager.port: 6123 is set by the Operator here
> 
>
> - jobmanager.rpc.address: SERVICE-NAME-HERE.NAMESPACE-HERE is set by the
> Operator here
> 
>  (the
> actual code which gets executed is here
> 
> )
>
>  Looking at what the Lyft Operator is doing here
> ,
>  I thought
> this would be a common issue but since you've never seen this error before,
> not sure what to do 🤔
>
> On Fri, Jan 27, 2023 at 10:59 PM Gyula Fóra  wrote:
>
>> We never encountered this problem before but also we don't configure
>> those settings.
>> Can you simply try:
>>
>> high-availability: kubernetes
>>
>> And remove the other configs? I think that can only cause problems and
>> should not achieve anything :)
>>
>> Gyula
>>
>> On Fri, Jan 27, 2023 at 6:44 PM Anton Ippolitov via user <
>> user@flink.apache.org> wrote:
>>
>>> Hi everyone,
>>>
>>> I've been experimenting with Kubernetes HA and the Kubernetes Operator
>>> and ran into the following issue which is happening regularly on
>>> TaskManagers with Flink 1.16:
>>>
>>> Error while retrieving the leader gateway. Retrying to connect to 
>>> akka.tcp://flink@SERVICE-NAME-HERE.NAMESPACE-HERE:6123/user/rpc/dispatcher_1.
>>> org.apache.flink.util.concurrent.FutureUtils$RetryException: Could not 
>>> complete the operation. Number of retries has been exhausted.
>>>
>>> (The whole stacktrace is quite long, I put it in a Github Gist here
>>> .
>>> Note that I put placeholder values for the Kubernetes Service name and the
>>> Namespace name)
>>>
>>> The job configuration has the following values which should be relevant:
>>> high-availability: kubernetes
>>> high-availability.jobmanager.port: 6123
>>> jobmanager.rpc.address: SERVICE-NAME-HERE.NAMESPACE-HERE
>>> jobmanager.rpc.port: 6123
>>>
>>> Looking a bit more into the logs, I can see that the Akka Actor System
>>> is started with an external address pointing to the Kubernetes Service
>>> defined by jobmanager.rpc.address:
>>> Trying to start actor system, external
>>> address SERVICE-NAME-HERE.NAMESPACE-HERE:6123, bind address 0.0.0.0:6123
>>> .
>>> Actor system started at akka.tcp://flink@SERVICE-NAME-HERE.NAMESPACE-HERE
>>> :6123
>>>
>>> (I believe the external address for the Akka Actor System is set to
>>> jobmanager.rpc.address from this place
>>> 
>>> in the code but I might be wrong)
>>>
>>> I can also see these logs for the Dispatcher RPC endpoint:
>>> Starting RPC endpoint for
>>> org.apache.flink.runtime.dispatcher.StandaloneDispatcher at
>>> akka://flink/user/rpc/dispatcher_1 .
>>> Successfully wrote leader information
>>> LeaderInformation{leaderSessionID='8fd2bda3-1775-4b51-bf63-8da385247a18',
>>> leaderAddress=akka.tcp://flink@SERVICE-NAME-HERE.NAMESPACE-HERE:6123/user/rpc/dispatcher_1}
>>> for leader dispatcher into the config map JOB-NAME-HERE-cluster-config-map.
>>>
>>> I confirmed that the HA ConfigMap contains an address which also uses
>>> the Kubernetes Service defined by jobmanager.rpc.address:
>>> $ kubectl get cm JOB-NAME-HERE-cluster-config-map -o json | jq -r
>>> '.data["org.apache.flink.k8s.leader.dispatcher"]'
>>>
>>> ce33b6d4-a55f-475c-9b6e-b21d25c8e6b5,akka.tcp://flink@SERVICE-NAME-HERE.NAMESPACE-HERE
>>> :6123/user/rpc/dispatcher_1
>>>
>>> When looking at the code of the Operator and Flink itself, I can see
>>> that jobmanager.rpc.address is set automatically by the
>>> InternalServi

Re: Apache Beam MinimalWordCount on Flink on Kubernetes using Flink Kubernetes Operator on GCP

2023-01-17 Thread Yang Wang
The "JAR file does not exist" exception comes from the JobManager side, not
on the client.
Please be aware that the local:// scheme in the jarURI means the path in
the JobManager pod.

You could use an init-container to download your user jar and mount it to
the JobManager main-container.
Refer to the examples[1] for more information.

Of cause, you could also build your own Flink image(NOT the
flink-kubernetes-operator image) with user jar bundled[2].

[1].
https://github.com/apache/flink-kubernetes-operator/blob/main/examples/pod-template.yaml#L65
[2].
https://nightlies.apache.org/flink/flink-docs-release-1.16/docs/deployment/resource-providers/native_kubernetes/#application-mode


Best,
Yang


Lee Parayno  于2023年1月18日周三 03:41写道:

> I have a Kubernetes cluster in GCP running the Flink Kubernetes Operator.
>
> I'm trying to package a project with the Apache Beam MinimalWordCount
> using the Flink Runner as a FlinkDeployment to the Kubernetes Cluster
>
> Job Docker image created with this Dockerfile:
>
> FROM flink
>
> ENV FLINK_CLASSPATH /opt/flink/lib/*
> ENV CLASSPATH /opt/flink/lib/*
>
> # Add Google Dependencies
> ADD
> https://repo1.maven.org/maven2/com/google/guava/guava/31.1-jre/guava-31.1-jre.jar
> /opt/flink/lib/
>
> # Add Google Cloud Platform Dependencies
> ADD
> https://repo1.maven.org/maven2/com/google/cloud/google-cloud-core/2.9.0/google-cloud-core-2.9.0.jar
> /opt/flink/lib/
> ADD
> https://repo1.maven.org/maven2/com/google/cloud/google-cloud-core-http/2.9.0/google-cloud-core-http-2.9.0.jar
> /opt/flink/lib/
> ADD
> https://repo1.maven.org/maven2/com/google/cloud/google-cloud-core-grpc/2.9.0/google-cloud-core-grpc-2.9.0.jar
> /opt/flink/lib/
>
> # Add dependencies for accessing Google Cloud Storage
>
> ADD
> https://repo1.maven.org/maven2/com/google/cloud/google-cloud-storage/2.9.3/google-cloud-storage-2.9.3.jar
> /opt/flink/lib/
> ADD
> https://repo1.maven.org/maven2/com/google/auth/google-auth-library-oauth2-http/1.9.0/google-auth-library-oauth2-http-1.9.0.jar
> /opt/flink/lib/
> ADD
> https://repo1.maven.org/maven2/com/google/http-client/google-http-client/1.42.3/google-http-client-1.42.3.jar
> /opt/flink/lib/
>
> # Apache Beam
> ADD
> https://repo1.maven.org/maven2/org/apache/beam/beam-sdks-java-core/2.43.0/beam-sdks-java-core-2.43.0.jar
> /opt/flink/lib/
> ADD
> https://repo1.maven.org/maven2/org/apache/beam/beam-runners-direct-java/2.43.0/beam-runners-direct-java-2.43.0.jar
> /opt/flink/lib/
> ADD
> https://repo1.maven.org/maven2/org/apache/beam/beam-sdks-java-extensions-google-cloud-platform-core/2.43.0/beam-sdks-java-extensions-google-cloud-platform-core-2.43.0.jar
> /opt/flink/lib/
> ADD
> https://repo1.maven.org/maven2/org/apache/beam/beam-runners-flink_2.11/2.16.0/beam-runners-flink_2.11-2.16.0.jar
> /opt/flink/lib/
>
>
> ADD target/helloworld-bundled-1.0-SNAPSHOT.jar /opt/flink/lib/
>
> This is the yaml for the FlinkDeployment:
> apiVersion: flink.apache.org/v1beta1
> kind: FlinkDeployment
> metadata:
> name: minimal-word-count2
> spec:
> #image: flink:1.15
> image: /flink_with_minimal_word_count
> flinkVersion: v1_16
> flinkConfiguration:
> taskmanager.numberOfTaskSlots: "1"
> serviceAccount: flink
> ingress:
> template: "flink.k8s.io/{{namespace}}/{{name}}(/|$)(.*)
> "
> className: "nginx"
> annotations:
> nginx.ingress.kubernetes.io/rewrite-target: "/$2"
> jobManager:
> resource:
> memory: "1048m"
> cpu: 0.75
> taskManager:
> # template:
> # spec:
> # env:
> # - name: CLASSPATH
> # value: "/opt/flink/lib/dependencies:/opt/flink/lib/*"
> resource:
> memory: "1048m"
> cpu: 0.75
> job:
> #jarURI:
> local:///opt/flink/usrlib/helloworld-1.0-SNAPSHOT-jar-with-dependencies.jar
> #jarURI: local:///opt/flink/usrlib/helloworld-bundled-1.0-SNAPSHOT.jar
> jarURI: local:///opt/flink/lib/helloworld-bundled-1.0-SNAPSHOT.jar
> parallelism: 2
> upgradeMode: stateless
>
> When I apply the yaml, the pod crashes with this error:
> rg.apache.flink.util.FlinkException: Could not load the provided
> entrypoint class.
> at
> org.apache.flink.client.program.DefaultPackagedProgramRetriever.getPackagedProgram(DefaultPackagedProgramRetriever.java:215)
> ~[flink-dist-1.16.0.jar:1.16.0]
> at
> org.apache.flink.kubernetes.entrypoint.KubernetesApplicationClusterEntrypoint.getPackagedProgram(KubernetesApplicationClusterEntrypoint.java:100)
> ~[flink-dist-1.16.0.jar:1.16.0]
> at
> org.apache.flink.kubernetes.entrypoint.KubernetesApplicationClusterEntrypoint.main(KubernetesApplicationClusterEntrypoint.java:70)
> [flink-dist-1.16.0.jar:1.16.0]
> Caused by: org.apache.flink.client.program.ProgramInvocationException: JAR
> file does not exist '/opt/flink/lib/helloworld-bundled-1.0-SNAPSHOT.jar'
> at
> org.apache.flink.client.program.PackagedProgram.checkJarFile(PackagedProgram.java:617)
> ~[flink-dist-1.16.0.jar:1.16.0]
> at
> org.apache.flink.client.program.PackagedProgram.loadJarFile(PackagedProgram.java:465)
> ~[flink-dist-1.16.0.jar:1.16.0

Re: Supplying jar stored at S3 to flink to run the job in kubernetes

2023-01-16 Thread Yang Wang
Do you build your own flink-kubernetes-operator image with the flink-s3-fs
plugin bundled[1]?

[1].
https://nightlies.apache.org/flink/flink-kubernetes-operator-docs-release-1.3/docs/custom-resource/overview/#flinksessionjob-spec-overview

Best,
Yang

Weihua Hu  于2023年1月17日周二 10:47写道:

> Hi, Rahul
>
> User support and questions should be sent to the user mailing list (
> user@flink.apache.org)
>
> You can resend the issue to the user mailing list with a detailed error
> log.
>
> Best,
> Weihua
>
>
> On Mon, Jan 16, 2023 at 11:18 PM rahul sahoo 
> wrote:
>
> > I have been following the examples mentioned here:
> > flink-kubernetes-operator_examples
> >  >.
> > I'm testing this on the local minikube. I have deployed minio for s3 and
> > flink operator.
> >
> > I have my application jar in s3(using minio for this). I have deployed
> the
> > flink session deployment in minikube and want to submit the job as
> > mentioned in basic-session-deployment-and-job.yaml
> > <
> >
> https://github.com/apache/flink-kubernetes-operator/blob/main/examples/basic-session-deployment-and-job.yaml
> > >
> >
> > I want to replace the `https://` to `s3a://` in this line
> > <
> >
> https://github.com/apache/flink-kubernetes-operator/blob/92034fa912f39f5c8bd57632295c7ca85801f33a/examples/basic-session-deployment-and-job.yaml#L43
> > >.
> > The final URL should look like
> > `s3a://local-bkt/flink-examples-streaming_2.12-1.16.0.jar`. I'm using
> > flink_2.12-1.16.0 with s3 plugin in docker image.
> >
> > Can anyone help me to solve this?
> >
> > Thank You,
> > Rahul Sahoo
> >
>


Re: Flink Job Manager Recovery from EKS Node Terminations

2023-01-11 Thread Yang Wang
First, JobManager does not store any persistent data to local when the
Kubernetes HA + S3 used.
It means that you do not need to mount a PV for JobMananger deployment.

Secondly, node failures or terminations should not cause
the CrashLoopBackOff status.
One possible reason I could imagine is a bug FLINK-28265[1], which is fixed
in 1.15.3.

BTW, it will be great if you could share the logs of initial JobManager pod
and crashed JobManager pod.

[1]. https://issues.apache.org/jira/browse/FLINK-28265


Best,
Yang


Vijay Jammi  于2023年1月6日周五 04:24写道:

> Hi,
>
> Have a query on the Job Manager HA for flink 1.15.
>
> We currently run a standalone flink cluster with a single JobManager and
> multiple TaskManagers, deployed on top of a kubernetes cluster (EKS
> cluster) in application mode (reactive mode).
>
> The Task Managers are deployed as a ReplicaSet and the single Job Manager
> is configured to be highly available using the Kubernetes HA services with
> recovery data being written to S3.
>   high-availability.storageDir:
> s3:///flink//recovery
>
> We also have configured our cluster for the rocksdb state backend with
> checkpoints being written to S3.
>   state.backend: rocksdb
>   state.checkpoints.dir:
> s3:///flink//checkpoints
>
> Now to test the Job Manager HA, when we delete the job manager deployment
> (to simulate job manager crash), we see that Kubernetes (EKS) detects
> the failure, launches a new Job Manager pod and is able to recover the
> application cluster from the last successful checkpoint (Restoring job
> 000 from Checkpoint 5 @ 167...3692 for 000 located at
> s3://.../checkpoints/0.../chk-5).
>
> However, if we terminate the underlying node (EC2 instance) on which the
> Job Manager pod is scheduled, the cluster is unable to recover from this
> scenario. What we are seeing is that Kubernetes as usual tries and retries
> repeatedly to launch a newer Job Manager but this time the job manager is
> unable to find the checkpoint to recover from (No checkpoint found during
> restore), eventually going into a CrashLoopBackOff status after max
> attempts of restart.
>
> Now the query is will the Job Manager need to be configured to store its
> state to a local working directory over persistent volumes? Any pointers on
> how we can recover the cluster from such node failures or terminations?
>
> Vijay Jammi
>


Re: The use of zookeeper in flink

2023-01-03 Thread Yang Wang
The reason why the running jobs try to failover with zookeeper outage is
that the JobManager lost leadership.
Having a standby JobManager or not makes no difference.

Best,
Yang

Matthias Pohl via user  于2023年1月2日周一 20:51写道:

> And I screwed up the reply again. -.- Here's my previous response for the
> ML thread and not only spoon_lz:
>
> Hi spoon_lz,
> Thanks for reaching out to the community and sharing your use case. You're
> right about the fact that Flink's HA feature relies on the leader election.
> The HA backend not being responsive for too long might cause problems. I'm
> not sure I understand fully what you mean by the standby JobManagers
> struggling with the ZK outage shouldn't affect the running jobs. If ZK is
> not responding for the standby JMs, the actual JM leader should be affected
> as well which, as a consequence, would affect the job execution. But I
> might misunderstand your post. Logs would be helpful to get a better
> understanding of your post's context.
>
> Best,
> Matthias
>
> FYI: There is also (a kind of stalled) discussion in the dev ML [1] about
> recovery of too many jobs affecting Flink's performance.
>
> [1] https://lists.apache.org/thread/r3fnw13j5h04z87lb34l42nvob4pq2xj
>
> On Thu, Dec 29, 2022 at 8:55 AM spoon_lz  wrote:
>
>> Hi All,
>> We use zookeeper to achieve high availability of jobs. Recently, a
>> failure occurred in our flink cluster. It was due to the abnormal downtime
>> of the zookeeper service that all the flink jobs using this zookeeper all
>> occurred failover. The failover startup of a large number of jobs in a
>> short period of time caused the cluster The pressure is too high, which in
>> turn causes the cluster to crash.
>> Afterwards, I checked the HA function of zk:
>> 1. Leader election
>> 2. Service discovery
>> 3.State persistence:
>>
>> The unavailability of the zookeeper service leads to failover of the
>> flink job. It seems that because of the first point, JM cannot confirm
>> whether it is Active or Standby, and the other two points should not affect
>> it. But we didn't use the Standby JobManager.
>> So in my opinion, if the JobManager of Standby is not used, whether the
>> zk service is available should not affect the jobs that are running
>> normally(of course, it is understandable that the task cannot be recovered
>> correctly if an exception occurs), and I don’t know if there is a way to
>> achieve a similar purpose
>>
>


Re: How to get failed streaming Flink job log in Flink Native K8s mode?

2023-01-03 Thread Yang Wang
I think you might need a sidecar container or daemonset to collect the
Flink logs and store into a persistent storage.
You could find more information here[1].

[1].
https://www.alibabacloud.com/blog/best-practices-of-kubernetes-log-collection_596356

Best,
Yang

hjw  于2022年12月22日周四 23:28写道:

> On Flink Native K8s mode, the pod of JM and TM will disappear if the
> streaming job failed.Are there any ways to get the log of the failed
> Streaming job?
> I only think of a solution that is to mount job logs to NFS for
> persistence through pv-pvc defined in pod-template.
>
> ENV:
> Flink version:1.15.0
> Mode: Flink kubernetes Operator 1.2.0(Application Mode)
>
> --
> Best,
> Hjw
>


Re: Stand alone K8s HA mode with Static Tokens Used by Service Accounts

2022-11-24 Thread Yang Wang
IIUC, the fabric8 Kubernetes-client 5.5.0 should already support to reload
the latest kube config if received 401 error.
Refer to the following PR[1] for more information.

Please share your feedback here if it still could not work.

[1]. https://github.com/fabric8io/kubernetes-client/pull/2731

Best,
Yang

Berkay Polat via user  于2022年11月23日周三 01:57写道:

> Hi team,
>
> Bumping this up again, from the AWS docs, the suggested approach is to
> simply upgrade the K8s java SDK client (
> https://github.com/kubernetes-client/java/) being used. However, in
> Flink's case with the io.fabric8 K8s client, I am not sure how to handle
> it. Any help and guidance would be much appreciated.
>
> Thanks,
>
> -- Forwarded message -
> From: Berkay Polat 
> Date: Thu, Nov 17, 2022 at 12:36 PM
> Subject: Stand alone K8s HA mode with Static Tokens Used by Service
> Accounts
> To: 
>
>
> Hi,
>
> Our team has been using flink 1.15 and we have a stand alone K8s flink
> setup that uses K8s HA services for its HA mode. Recently, our organization
> is in the works of updating their EKS clusters' Kubernetes versions to 1.21
> or later. We received a request from our support team that the service
> accounts associated with our stand alone flink cluster have been using
> static tokens, which is not permitted for newer K8s versions. Instead, they
> requested us to switch to a refresh token approach (
> https://docs.aws.amazon.com/eks/latest/userguide/service-accounts.html#identify-pods-using-stale-tokens
> ).
>
> From what I understand, in flink 1.15, HA mode is using version 5.5.0 of
> io.fabric8's kubernetes client and it seems that it is compatible with K8s
> 1.21.1 and later (
> https://github.com/fabric8io/kubernetes-client#compatibility-matrix) so I
> am not sure what the underlying limitation/issue is here.
>
> The AWS doc link I referred to earlier recommends upgrading versions for
> Kubernetes Client SDKs but it refers to io.kubernetes's client SDKs, not
> io.fabric8.
>
> Could someone shed some light on it? Would it be worth it to request a
> change to upgrade the io.fabric8 kubernetes client version to a newer
> version?
>
> Thanks,
> --
> *BERKAY POLAT*
> Software Engineer SMTS | MuleSoft at Salesforce
> Mobile: 443-710-7021
>
> 
>
>
> --
> *BERKAY POLAT*
> Software Engineer SMTS | MuleSoft at Salesforce
> Mobile: 443-710-7021
>


Re: support escaping `#` in flink job spec in Flink-operator

2022-11-08 Thread Yang Wang
This is a known limit of the current Flink options parser. Refer to
FLINK-15358[1] for more information.

[1]. https://issues.apache.org/jira/browse/FLINK-15358

Best,
Yang

Gyula Fóra  于2022年11月8日周二 14:41写道:

> It is also possible that this is a problem of the Flink native Kubernetes
> integration, we have to check where exactly it goes wrong before we try to
> fix it .
>
> We simply set the args into a Flink config and pass it to the native
> deployment logic in the operator.
>
> Gyula
>
> On Tue, 8 Nov 2022 at 07:37, Gyula Fóra  wrote:
>
>> Hi!
>>
>> How do you submit your yaml?
>>
>> It’s possible that this is not operator problem. Did you try submitting
>> the deployment in json format instead?
>>
>> If it still doesn't work please open a JIRA ticket with the details to
>> reproduce and what you have tried :)
>>
>> Cheers
>> Gyula
>>
>> On Tue, 8 Nov 2022 at 04:56, liuxiangcao  wrote:
>>
>>> Hi,
>>>
>>> We have a job that contains `#` as part of mainArgs and it used to work
>>> on Ververica. Now we are switching to our own control plane to deploy to
>>> flink-operaotor and the job started to fail due to the main args string
>>> getting truncated at `#` character when passed to flink application. I
>>> believe this is due to characters after `#` being interpreted as comments
>>> in yaml file. To support having `#` in the mainArgs, the flink operator
>>> needs to escape `#` when generating k8 yaml file.
>>>
>>> Assuming the mainArgs contain '\"xyz#abc\".
>>>
>>> Here is the stack-trace:
>>> {"exception":{"exception_class":"java.lang.IllegalArgumentException","exception_message":"Could
>>> not parse value '\"xyz' *(Note: truncated by #)*
>>>
>>> for key  '$internal.application.program-args'.\n\tat
>>> org.apache.flink.configuration.Configuration.getOptional(Configuration.java:720)\n\tat
>>> org.apache.flink.configuration.Configuration.get(Configuration.java:704)\n\tat
>>>  
>>> org.apache.flink.configuration.ConfigUtils.decodeListFromConfig(ConfigUtils.java:123)\n\tat
>>>  
>>> org.apache.flink.client.deployment.application.ApplicationConfiguration.fromConfiguration(ApplicationConfiguration.java:80)\n\tat
>>>  
>>> org.apache.flink.kubernetes.entrypoint.KubernetesApplicationClusterEntrypoint.getPackagedProgram(KubernetesApplicationClusterEntrypoint.java:93)\n\tat
>>>  
>>> org.apache.flink.kubernetes.entrypoint.KubernetesApplicationClusterEntrypoint.main(KubernetesApplicationClusterEntrypoint.java:70)\nCaused
>>>  by: *java.lang.IllegalArgumentException: Could not split string. Quoting 
>>> was not closed properly*.\n\tat 
>>> org.apache.flink.configuration.StructuredOptionsSplitter.consumeInQuotes(StructuredOptionsSplitter.java:163)\n\tat
>>>  
>>> org.apache.flink.configuration.StructuredOptionsSplitter.tokenize(StructuredOptionsSplitter.java:129)\n\tat
>>>  
>>> org.apache.flink.configuration.StructuredOptionsSplitter.splitEscaped(StructuredOptionsSplitter.java:52)\n\tat
>>>  
>>> org.apache.flink.configuration.ConfigurationUtils.convertToList(ConfigurationUtils.java:324)\n\tat
>>>  
>>> org.apache.flink.configuration.Configuration.lambda$getOptional$2(Configuration.java:714)\n\tat
>>>  java.base/java.util.Optional.map(Optional.java:265)\n\tat 
>>> org.apache.flink.configuration.Configuration.getOptional(Configuration.java:714)\n\t...
>>>  5 more\n"},"@version":1,"source_host":"xx","message":"Could not create 
>>> application 
>>> program.","thread_name":"main","@timestamp":"2022-11-07T18:40:03.369+00:00","level":"ERROR","logger_name":"org.apache.flink.runtime.entrypoint.ClusterEntrypoint"}
>>>
>>>
>>>  Can someone take a look and help fixing this issue? or I can help
>>> fixing this if someone can point me in the right direction.
>>>
>>> --
>>> Best Wishes & Regards
>>> Shawn Xiangcao Liu
>>>
>>


Re: [DISCUSS ] add --jars to support users dependencies jars.

2022-10-27 Thread Yang Wang
Thanks Jacky Lau for starting this discussion.

I understand that you are trying to find a convenient way to specify
dependency jars along with user jar. However,
let's try to narrow down by differentiating deployment modes.

# Standalone mode
No matter you are using the standalone mode on virtual machine, or in a
Kubernetes cluster,
it is not very difficult to preparing user jar and all the dependencies
under the $FLINK_HOME/usrlib directory.
After then, they will be loaded by user classloader automatically.

# Yarn
We already have "--ship/-Dyarn.ship-files" to ship the dependency jars.

# Native K8s
Currently, only the local user jar in the image could be supported. And
users could not specify dependency jars.
A feasible solution is using the init-container(via pod template[1]) to
download the user jar and dependencies and then mount to usrlib directory.


All in all, I trying to get you point about why do we need the "--jars" to
specify the dependency jars. And which deployment mode it will support?


Best,
Yang

[1].
https://nightlies.apache.org/flink/flink-docs-release-1.15/docs/deployment/resource-providers/native_kubernetes/#pod-template



Martijn Visser  于2022年10月27日周四 14:49写道:

> Hi Jacky Lau,
>
> Since you've sent the email to multiple mailing lists, I've decided to
> reply to the one that you've sent to both the Dev and User ML.
>
> > but it is not possible for platform users to create fat jars to package
> all their dependencies into the final jar package
>
> Can you elaborate on why that's not possible?
>
> Best regards,
>
> Martijn
>
> On Thu, Oct 27, 2022 at 6:59 AM Jacky Lau  wrote:
>
> > Hi guys:
> >
> > I'd like to initiate a discussion about adding command-line arguments to
> > support user-dependent jar packages.
> >
> > Currently flink supports user's main jars through -jarfile or without
> > setting this , the flink client will treat the first argument after that
> as
> > the user master jar package when it encounters the first command line
> > argument that cannot be parsed. but it is not possible for platform users
> > to create fat jars to package all their dependencies into the final jar
> > package. In the meantime, the configuration pipeline.jars is currently
> > exposed, and this value is overridden by command-line arguments such as
> > -jarfile.
> >
> > And If the user is using both the command-line argument and the
> > pipeline.jars argument, which can make the user werild and confused. In
> > addition, we should specify the priority "command line parameter > -D
> > dynamic parameter > flink-conf.yml configuration file parameter" in docs
> >
>


Re: configMap value error when using flink-operator?

2022-10-26 Thread Yang Wang
Maybe we could change the values of *taskmanager.numberOfTaskSlots*
and *parallelism.default
*in flink-conf.yaml of Kubernetes operator to 1, which are aligned with the
default values in Flink codebase.


Best,
Yang

Gyula Fóra  于2022年10月26日周三 15:17写道:

> Hi!
>
> I agree that this might be confusing but let me explain what happened.
>
> In the operator you can define default flink configuration. Currently it
> is
> https://github.com/apache/flink-kubernetes-operator/blob/main/helm/flink-kubernetes-operator/conf/flink-conf.yaml
> It contains numberOfTaskSlots=2.
>
> This is more just an example of how to control default cluster
> configuration from the operator. Users generally define numberOfTaskSlots
> for every Flink resource in the flinkConfiguration setting, that would
> override this default.
>
> You are also free to change the operator side default flink conf to not
> set this, then you will have 1. In any case nobody is running real
> applications with 1 task slots / parallelism so this hasn't caused any
> problems so far :)
>
> Cheers,
> Gyula
>
> On Wed, Oct 26, 2022 at 7:20 AM Liting Liu (litiliu) 
> wrote:
>
>> hi:
>> I'm  trying to deploy a flink job with flink-operaotor. The
>> flink-operator's version is 1.2.0. And the yaml i use is here:
>> 
>> apiVersion: flink.apache.org/v1beta1
>> kind: FlinkDeployment
>> metadata:
>>   name: basic-example
>> spec:
>>   image: flink:1.15
>>   flinkVersion: v1_15
>>   flinkConfiguration:
>>   serviceAccount: flink
>>   jobManager:
>> resource:
>>   memory: "2048m"
>>   cpu: 1
>>   taskManager:
>> resource:
>>   memory: "2048m"
>>   cpu: 1
>>   job:
>> jarURI: local:///opt/flink/examples/streaming/StateMachineExample.jar
>> parallelism: 2
>> upgradeMode: stateless
>> 
>>But i found in the generated configMap, there was a field named
>> "taskmanager.numberOfTaskSlots" was set to 2.  Which is very weird, since
>> that field was not defined by user.  And according to flink doc the
>> default value of "taskmanager.numberOfTaskSlots" should be 1.
>>
>


Re: Flink Native K8S RBAC

2022-10-20 Thread Yang Wang
I have created a ticket[1] to fill the missing part in the native K8s
documentation.

[1]. https://issues.apache.org/jira/browse/FLINK-29705

Best,
Yang

Gyula Fóra  于2022年10月20日周四 13:37写道:

> Hi!
>
> As a reference you can look at how the Flink Kubernetes Operator manages
> RBAC settings:
>
>
> https://nightlies.apache.org/flink/flink-kubernetes-operator-docs-main/docs/operations/rbac/
>
> https://github.com/apache/flink-kubernetes-operator/blob/main/helm/flink-kubernetes-operator/templates/rbac.yaml
>
> Cheers,
> Gyula
>
> On Wed, Oct 19, 2022 at 9:46 PM Calvin D Souza via user <
> user@flink.apache.org> wrote:
>
>> Hi,
>>
>> I am using custom service account for flink native k8s. These are the
>> rules for the clusterrole I’m using:
>>
>> rules:
>> - apiGroups: [""]
>> resources: ["pods"]
>> verbs: ["get", "list", "watch", "create", "update", "patch", "delete"]
>> - apiGroups: [""]
>> resources: ["configmaps"]
>> verbs: ["get", "list", "watch", "create", "update", "patch", "delete"]
>> - apiGroups: [""]
>> resources: ["services"]
>> verbs: ["get", "list", "watch", "create", "update", "patch", "delete"]
>> - apiGroups: ["apps"]
>> resources: ["deployments"]
>> verbs: ["get", "list", "watch", "create", "update", "patch", "delete"]
>> - apiGroups: [""]
>> resources: ["pods/log"]
>> verbs: ["get", "list", "watch"]
>> - apiGroups: ["extensions"]
>> resources: ["deployments"]
>> verbs: ["get", "list", "watch", "create", "update", "patch", "delete"]
>>
>>
>> Are there any that I am missing or that are not needed?
>>
>> Thanks,
>> Calvin
>>
>


Re: Activate Flink HA without checkpoints on k8S

2022-10-19 Thread Yang Wang
Add some more information to Gyula's comment.

For application mode without checkpoint, you do not need to activate the HA
since it will not take any effect and the Flink job will be submitted again
after the JobManager restarted. Because the job submission happens on the
JobManager side.

For session mode without checkpoint, you still need to activate the HA. If
not, all the running jobs will be lost after JobManager restarted. Because
the
job submission happens on the client side and job graphs are stored by HA.

Best,
Yang

Gyula Fóra  于2022年10月14日周五 04:27写道:

> Without HA, if the jobmanager goes down, job information is lost so the
> job won’t be restarted after the JM comes back up.
>
> Gyula
>
> On Thu, 13 Oct 2022 at 19:07, marco andreas 
> wrote:
>
>>
>>
>> Hello,
>>
>> Can someone explain to me what is the point of using HA when deploying an
>> application cluster with a single JM and the checkpoints are not activated.
>>
>> AFAK when the pod of the JM goes down kubernetes will restart it anyway
>> so we don't need to activate the HA in this case.
>>
>> Maybe there's something else that I am missing here, so if someone could
>> give me an explanation it would be great .
>>
>> Sincerely,
>>
>


Re: fail to mount hadoop-config-volume when using flink-k8s-operator

2022-10-13 Thread Yang Wang
Currently, exporting the env "HADOOP_CONF_DIR" could only work for native
K8s integration. The flink client will try to create the
hadoop-config-volume automatically if hadoop env found.

If you want to set the HADOOP_CONF_DIR in the docker image, please also
make sure the specified hadoop conf directory exists in the image.

For flink-k8s-operator, another feasible solution is to create a
hadoop-config-configmap manually and then use
*"kubernetes.hadoop.conf.config-map.name
" *to mount it to JobManager
and TaskManager pods.


Best,
Yang

Liting Liu (litiliu)  于2022年10月12日周三 16:11写道:

> Hi, community:
>   I'm using flink-k8s-operator v1.2.0 to deploy flink job. And the
> "HADOOP_CONF_DIR" environment variable was setted in the image that i
> buiilded from flink:1.15.  I found the taskmanager pod was trying to mount
> a volume named "hadoop-config-volume" from configMap.  But the configMap
> with the name "hadoop-config-volume" was't created.
>
> Do i need to remove the "HADOOP_CONF_DIR" environment variable in
> dockerfile?
> If yes, what should i do to specify the hadoop conf?
>
>


Re: serviceAccount permissions issue for high availability in operator 1.1

2022-09-20 Thread Yang Wang
The standalone mode will be supported in the release-1.2, which is
expected to be released at the beginning of October.

Best,
Yang

Javier Vegas  于2022年9月12日周一 04:52写道:

> Hi, Yang!
>
> When you say the operator uses native k8s integration by default, does
> that mean there is a way to change that to use standalone K8s? I haven't
> seen anything about that in the docs, besides a mention that standalone
> support is coming in version 1.2 of the operator.
>
> Thanks,
>
> Javier
>
>
> On Thu, Sep 8, 2022, 22:50 Yang Wang  wrote:
>
>> Since the flink-kubernetes-operator is using native K8s integration[1] by
>> default, you need to give the permissions of pod and deployment as well as
>> ConfigMap.
>>
>> You could find more information about the RBAC here[2].
>>
>> [1].
>> https://nightlies.apache.org/flink/flink-docs-release-1.15/docs/deployment/resource-providers/native_kubernetes/
>> [2].
>> https://nightlies.apache.org/flink/flink-kubernetes-operator-docs-release-1.1/docs/operations/rbac/
>>
>> Best,
>> Yang
>>
>> Javier Vegas  于2022年9月7日周三 04:17写道:
>>
>>> I am migrating a HA standalone Kubernetes app to use the Flink operator.
>>> The HA store is S3 using IRSA so the app needs to run with a serviceAccount
>>> that is authorized to access S3. In standalone mode HA worked once I gave
>>> the account permissions to edit configMaps. But when trying the operator
>>> with the custom serviceAccount, I am getting this error:
>>>
>>> io.fabric8.kubernetes.client.KubernetesClientException: Failure
>>> executing: GET at:
>>> https://172.20.0.1/apis/apps/v1/namespaces/MYNAMESPACE/deployments/MYAPPNAME.
>>> Message: Forbidden!Configured service account doesn't have access. Service
>>> account may have been revoked. deployments.apps "MYAPPNAME" is forbidden:
>>> User "system:serviceaccount:MYNAMESPACE:MYSERVICEACCOUNT" cannot get
>>> resource "deployments" in API group "apps" in the namespace "MYNAMESPACE".
>>>
>>>
>>> Does the serviceAccount needs additional permissions beside configMap
>>> edit to be able to run HA using the operator?
>>>
>>> Thanks,
>>>
>>> Javier Vegas
>>>
>>


Re: serviceAccount permissions issue for high availability in operator 1.1

2022-09-08 Thread Yang Wang
Since the flink-kubernetes-operator is using native K8s integration[1] by
default, you need to give the permissions of pod and deployment as well as
ConfigMap.

You could find more information about the RBAC here[2].

[1].
https://nightlies.apache.org/flink/flink-docs-release-1.15/docs/deployment/resource-providers/native_kubernetes/
[2].
https://nightlies.apache.org/flink/flink-kubernetes-operator-docs-release-1.1/docs/operations/rbac/

Best,
Yang

Javier Vegas  于2022年9月7日周三 04:17写道:

> I am migrating a HA standalone Kubernetes app to use the Flink operator.
> The HA store is S3 using IRSA so the app needs to run with a serviceAccount
> that is authorized to access S3. In standalone mode HA worked once I gave
> the account permissions to edit configMaps. But when trying the operator
> with the custom serviceAccount, I am getting this error:
>
> io.fabric8.kubernetes.client.KubernetesClientException: Failure executing:
> GET at:
> https://172.20.0.1/apis/apps/v1/namespaces/MYNAMESPACE/deployments/MYAPPNAME.
> Message: Forbidden!Configured service account doesn't have access. Service
> account may have been revoked. deployments.apps "MYAPPNAME" is forbidden:
> User "system:serviceaccount:MYNAMESPACE:MYSERVICEACCOUNT" cannot get
> resource "deployments" in API group "apps" in the namespace "MYNAMESPACE".
>
>
> Does the serviceAccount needs additional permissions beside configMap edit
> to be able to run HA using the operator?
>
> Thanks,
>
> Javier Vegas
>


Re: [Flink 1.15.1 - Application mode native k8s Exception] - Exception occurred while acquiring lock 'ConfigMapLock

2022-09-08 Thread Yang Wang
You are right. Starting multiple JobManagers could help when the pod is
deleted and there's not enough resources in the cluster to start a new one.
For most cases, the JobManager container will be restarted locally without
scheduling a new Kubernetes pod[1].

The "already exists" error comes from the fabric8 Kubernetes-client. It is
somewhat reasonable because a same name ConfigMap might be already created
manually beforehand.
In the Flink use case, we could simply ignore this error.

For the first exception "*Caused by: java.io.FileNotFoundException:
/opt/flink/.kube/config (No such file or directory)*", I think you need to
share the full log file of all the JobManagers.

[1].
https://kubernetes.io/docs/concepts/workloads/pods/pod-lifecycle/#restart-policy


Best,
Yang


Tamir Sagi  于2022年9月8日周四 14:28写道:

> Hey Yang,
>
> Thank you for fast response.
>
> I get your point but, assuming 3 Job managers are up, in case the leader
> fails, one of the other 2 should become the new leader, no?
>
> If the cluster fails, the new leader should handle that.
>
> Another scenario could be that the Job manager stops(get killed by k8s
> due to memory, CPU limitations, bugs etc...)  while TMs are still
> operating, and the cluster is active. In some cases,  due to resources
> limitation, k8s will not be able to get a new instance right away, until
> auto-scale takes place(The pod remains in pending state). It seems like we
> do achieve resilience by having HA enabled in Native k8s mode.
>
> What do you think?
>
> Given that you are running multiple JobManagers, it does not matter for
> the "already exists" exception during leader election.
>
> ​Should we ignore such error? if so , it should be a warning then
>
> What about the 1st error we encountered regarding the kube/config file
> exception?
>
>
> Thank you so much,
> Best,
> Tamir
>
> --
> *From:* Yang Wang 
> *Sent:* Thursday, September 8, 2022 7:08 AM
> *To:* Tamir Sagi 
> *Cc:* user@flink.apache.org ; Lihi Peretz <
> lihi.per...@niceactimize.com>
> *Subject:* Re: [Flink 1.15.1 - Application mode native k8s Exception] -
> Exception occurred while acquiring lock 'ConfigMapLock
>
>
> *EXTERNAL EMAIL*
>
>
> Given that you are running multiple JobManagers, it does not matter for
> the "already exists" exception during leader election.
>
> BTW, I think running multiple JobManagers does not take enough advantages
> when deploying Flink on Kubernetes. Because a new JobManager will be
> started immediately once the old one crashed.
> And Flink JobManager always needs to recover the job from the latest
> checkpoint no matter how many JobManager are running.
>
> Best,
> Yang
>
> Tamir Sagi  于2022年9月5日周一 21:48写道:
>
> Hey Yang,
>
> The flink-conf.yaml submitted to the cluster does not contain 
> "kubernetes.config.file"
> at all.
> In addition, I verified flink config maps under cluster's namespace do not
> contain "kubernetes.config.file".
>
> In addition, we also noticed the following exception (appears to happen
> sporadically)
>
> 2022-09-04T21:06:35,231][Error] {} [i.f.k.c.e.l.LeaderElector]: Exception
> occurred while acquiring lock 'ConfigMapLock: dev-0-flink-jobs -
> data-agg-events-insertion-cluster-config-map
> (fa3dbbc5-1753-46cd-afaf-0baf8ff0947f)'
> io.fabric8.kubernetes.client.extended.leaderelection.resourcelock.LockException:
> Unable to create ConfigMapLock
>
> Caused by: io.fabric8.kubernetes.client.KubernetesClientException: Failure
> executing: POST at:
> https://172.20.0.1/api/v1/namespaces/dev-0-flink-jobs/configmaps.
> Message: configmaps "data-agg-events-insertion-cluster-config-map" already
> exists.
>
> Log file is enclosed.
>
> Thanks,
> Tamir.
>
> --
> *From:* Yang Wang 
> *Sent:* Monday, September 5, 2022 3:03 PM
> *To:* Tamir Sagi 
> *Cc:* user@flink.apache.org ; Lihi Peretz <
> lihi.per...@niceactimize.com>
> *Subject:* Re: [Flink 1.15.1 - Application mode native k8s Exception] -
> Exception occurred while acquiring lock 'ConfigMapLock
>
>
> *EXTERNAL EMAIL*
>
>
> Could you please check whether the "kubernetes.config.file" is configured
> to /opt/flink/.kube/config in the Flink configmap?
> It should be removed before creating the Flink configmap.
>
> Best,
> Yang
>
> Tamir Sagi  于2022年9月4日周日 18:08写道:
>
> Hey All,
>
> We recently updated to Flink 1.15.1. We deploy stream cluster in
> Application mode in Native K8S.(Deployed on Amazon EKS).  The cluster is
> configured with Kubernetes HA Service, Minimum 3 replicas of Job manager
> 

Re: [Flink Kubernetes Operator] FlinkSessionJob crd spec jarURI

2022-09-08 Thread Yang Wang
Given that the "local://" schema means the jar is available in the
image/container of JobManager, so it could only be supported in the K8s
application mode.

If you configure the jarURI to "file://" schema for session cluster, it
means that this jar file should be available in the
flink-kubernetes-operator container.
You could mount a PV for the flink-kuberentes-operator and then use "*kubectl
cp*" to copy a file into the pod.

Or you could specify a "http://"; path for jarURI.

Best,
Yang

Vignesh Kumar Kathiresan via user  于2022年9月7日周三
09:31写道:

> Hi,
>
> Have a session cluster deployed in kubernetes. Trying to submit a job
> following the example given in docs.
>
> When I give
> 1) spec.job.jarURI:
> local:///opt/flink/examples/streaming/StateMachineExample.jar
>
> getting
> Error:  org.apache.flink.core.fs.UnsupportedFileSystemSchemeException:
> Could not find a file system implementation for scheme 'local'. The scheme
> is not directly supported by Flink and no Hadoop file system to support
> this scheme could be loaded. For a full list of supported file systems,
> please see
> https://nightlies.apache.org/flink/flink-docs-stable/ops/filesystems/.
> 2) when I change the scheme to file
> spec.job.jarURI:
> file:///opt/flink/examples/streaming/StateMachineExample.jar
>
> getting
>  Error:  java.io.FileNotFoundException:
> /opt/flink/examples/streaming/TopSpeedWindowing.jar (No such file or
> directory)
>
> anything that I am missing. From the docs I can gather that I do not need
> any extra fs plugin for referencing a local file system jar.
>


Re: Deploying Jobmanager on k8s as a Deployment

2022-09-07 Thread Yang Wang
For native K8s integration, the Flink ResourceManager will delete the
JobManager K8s deployment as well as the HA data once the job reached a
globally terminal state.

However, it is indeed a problem for standalone mode since the JobManager
will be restarted again even the job has finished. I think the
flink-kubernetes-operator could handle this situation by doing the cleanup.


Best,
Yang

Austin Cawley-Edwards  于2022年9月8日周四 06:01写道:

> Hey Gil,
>
> I'm referring to when a pod exits on its own, not when being deleted.
> Deployments only support the "Always" restart policy [1].
>
> In my understanding, the JM only cleans up HA data when it is shutdown[2],
> after which the process will exit which leads to the problem with k8s
> Deployment restart policies.
>
> Best,
> Austin
>
> [1]:
> https://kubernetes.io/docs/concepts/workloads/controllers/deployment/#pod-template
> [2]:
> https://nightlies.apache.org/flink/flink-docs-release-1.15/docs/ops/rest_api/#cluster
>
> On Wed, Sep 7, 2022 at 4:43 PM Gil De Grove 
> wrote:
>
>> Hello Austin,
>>
>> I'm not aware of any limitations of deployement not letting pod exit
>> (correctly or incorrectly). What do you mean by that exactly? Would it be
>> possible for you to point out to piece of documentation that make you think
>> that ?
>>
>> A pod, if correctly setup will be exited when receiving it's sigterm or
>> sigkill from the orchestrator.
>> So when "deleting" the deployment, the pods are quitted correctly. In the
>> case flink did triggered a savepoint before, you can then restart from that
>> savepoint.
>> Usually, when a pod is not being terminated this means that the SIG is
>> not transferred to the correct process.
>>
>> Hopes this helps.
>>
>> Regards,
>> Gil
>>
>>
>> On Wed, Sep 7, 2022, 21:16 Austin CawleyEdwards 
>> wrote:
>>
>>> Cool, thanks! How does it clean up the HA data, if the cluster is never
>>> able to shut down (due to the k8s Deployment restriction)?
>>>
>>> Best,
>>> Austin
>>>
>>> On Mon, Sep 5, 2022 at 6:51 PM Gyula Fóra  wrote:
>>>
 Hi!

 The operator supports both Flink native and standalone deployment modes
 and in both cases the JM is deployed as k8s Deployment.

 During upgrade Flink/operator deletes the deployment after savepoint
 and waits for termination before it creates a new one with the updated
 spec.

 Cheers,
 Gyula

 On Mon, 5 Sep 2022 at 07:41, Austin Cawley-Edwards <
 austin.caw...@gmail.com> wrote:

> Hey Marco,
>
> Unfortunately there is no built in k8s API that models an application
> mode JM exactly but Deployments should be fine, in general. As Gyula 
> notes,
> where they can be difficult is during application upgrades as Deployments
> never let their pods exit, even if successful, so there is no way to stop
> the cluster gracefully.
>
> Is stopping your application with a savepoint and redeploying a
> workable solution for image upgrades? In this way a Job could still be
> used.
>
>
> @Gyula, how are JMs handled in the operator? Job, Deployment, or
> something custom?
>
>
> Best,
> Austin
>
>
>
> On Mon, Sep 5, 2022 at 6:15 AM Gyula Fóra 
> wrote:
>
>> You can use deployments of course , the operator and native k8s
>> integration does exactly that.
>>
>> Even then job updates can be tricky so I believe you are much better
>> off with the operator.
>>
>> Gyula
>>
>> On Sun, 4 Sep 2022 at 11:11, marco andreas 
>> wrote:
>>
>>> Hello,
>>>
>>> Thanks for the response, I will take a look at it.
>>>
>>> But if we aren't able to use the flink operator due to technical
>>> constraints is it possible to deploy the JM as deployment without any
>>> consequences that I am not aware of?
>>>
>>> Sincerely,
>>>
>>> Le sam. 3 sept. 2022 à 23:27, Gyula Fóra  a
>>> écrit :
>>>
 Hi!
 You should check out the Flink Kubernetes Operator. I think that
 covers all your needs .


 https://nightlies.apache.org/flink/flink-kubernetes-operator-docs-main/

 Cheers,
 Gyula

 On Sat, 3 Sep 2022 at 13:45, marco andreas <
 marcoandreas...@gmail.com> wrote:

>
> We are deploying a flink application cluster on k8S. Following the
> official documentation the JM is deployed As a job resource , however 
> we
> are deploying a long running flink job that is not supposed to be
> terminated and also we need to update the image of the flink job.
>
>  The problem is that the job is an immutable resource, we
> cant update it.
>
> So I'm wondering if it's possible to use a deployment resource for
> the jobmanager and if there will be any side effects or repercussions.
>
> Thanks,
>
>>>

Re: [Flink 1.15.1 - Application mode native k8s Exception] - Exception occurred while acquiring lock 'ConfigMapLock

2022-09-07 Thread Yang Wang
Given that you are running multiple JobManagers, it does not matter for the
"already exists" exception during leader election.

BTW, I think running multiple JobManagers does not take enough advantages
when deploying Flink on Kubernetes. Because a new JobManager will be
started immediately once the old one crashed.
And Flink JobManager always needs to recover the job from the latest
checkpoint no matter how many JobManager are running.

Best,
Yang

Tamir Sagi  于2022年9月5日周一 21:48写道:

> Hey Yang,
>
> The flink-conf.yaml submitted to the cluster does not contain 
> "kubernetes.config.file"
> at all.
> In addition, I verified flink config maps under cluster's namespace do not
> contain "kubernetes.config.file".
>
> In addition, we also noticed the following exception (appears to happen
> sporadically)
>
> 2022-09-04T21:06:35,231][Error] {} [i.f.k.c.e.l.LeaderElector]: Exception
> occurred while acquiring lock 'ConfigMapLock: dev-0-flink-jobs -
> data-agg-events-insertion-cluster-config-map
> (fa3dbbc5-1753-46cd-afaf-0baf8ff0947f)'
> io.fabric8.kubernetes.client.extended.leaderelection.resourcelock.LockException:
> Unable to create ConfigMapLock
>
> Caused by: io.fabric8.kubernetes.client.KubernetesClientException: Failure
> executing: POST at:
> https://172.20.0.1/api/v1/namespaces/dev-0-flink-jobs/configmaps.
> Message: configmaps "data-agg-events-insertion-cluster-config-map" already
> exists.
>
> Log file is enclosed.
>
> Thanks,
> Tamir.
>
> --
> *From:* Yang Wang 
> *Sent:* Monday, September 5, 2022 3:03 PM
> *To:* Tamir Sagi 
> *Cc:* user@flink.apache.org ; Lihi Peretz <
> lihi.per...@niceactimize.com>
> *Subject:* Re: [Flink 1.15.1 - Application mode native k8s Exception] -
> Exception occurred while acquiring lock 'ConfigMapLock
>
>
> *EXTERNAL EMAIL*
>
>
> Could you please check whether the "kubernetes.config.file" is configured
> to /opt/flink/.kube/config in the Flink configmap?
> It should be removed before creating the Flink configmap.
>
> Best,
> Yang
>
> Tamir Sagi  于2022年9月4日周日 18:08写道:
>
> Hey All,
>
> We recently updated to Flink 1.15.1. We deploy stream cluster in
> Application mode in Native K8S.(Deployed on Amazon EKS).  The cluster is
> configured with Kubernetes HA Service, Minimum 3 replicas of Job manager
> and pod-template which is configured with topologySpreadConstraints to
> enable distribution across different availability zones.
> HA storage directory is on S3.
>
> The cluster is deployed and running properly, however, after a while we
> noticed the following exception in Job manager instance(the log file is
> enclosed)
>
> 2022-09-04T02:05:33,097][Error] {} [i.f.k.c.e.l.LeaderElector]: Exception
> occurred while acquiring lock 'ConfigMapLock: dev-0-flink-jobs -
> data-agg-events-insertion-cluster-config-map
> (b6da2ae2-ad2b-471c-801e-ea460a348fab)'
> io.fabric8.kubernetes.client.KubernetesClientException: Operation: [get]
>  for kind: [ConfigMap]  with name:
> [data-agg-events-insertion-cluster-config-map]  in namespace:
> [dev-0-flink-jobs]  failed.
> Caused by: java.io.FileNotFoundException: /opt/flink/.kube/config (No such
> file or directory)
> at java.io.FileInputStream.open0(Native Method) ~[?:?]
> at java.io.FileInputStream.open(Unknown Source) ~[?:?]
> at java.io.FileInputStream.(Unknown Source) ~[?:?]
> at
> org.apache.flink.kubernetes.shaded.com.fasterxml.jackson.dataformat.yaml.YAMLFactory.createParser(YAMLFactory.java:354)
> ~[flink-dist-1.15.1.jar:1.15.1]
> at
> org.apache.flink.kubernetes.shaded.com.fasterxml.jackson.dataformat.yaml.YAMLFactory.createParser(YAMLFactory.java:15)
> ~[flink-dist-1.15.1.jar:1.15.1]
> at
> org.apache.flink.kubernetes.shaded.com.fasterxml.jackson.databind.ObjectMapper.readValue(ObjectMapper.java:3494)
> ~[flink-dist-1.15.1.jar:1.15.1]
> at
> io.fabric8.kubernetes.client.internal.KubeConfigUtils.parseConfig(KubeConfigUtils.java:42)
> ~[flink-dist-1.15.1.jar:1.15.1]
> at
> io.fabric8.kubernetes.client.utils.TokenRefreshInterceptor.intercept(TokenRefreshInterceptor.java:44)
> ~[flink-dist-1.15.1.jar:1.15.1]
> at
> org.apache.flink.kubernetes.shaded.okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:142)
> ~[flink-dist-1.15.1.jar:1.15.1]
> at
> org.apache.flink.kubernetes.shaded.okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:117)
> ~[flink-dist-1.15.1.jar:1.15.1]
> at
> io.fabric8.kubernetes.client.utils.ImpersonatorInterceptor.intercept(ImpersonatorInterceptor.java:68)
> ~[flink-dist-1.15.1.jar:1.15.1]
> at
> org.apache.flink.kubernetes.shaded.okhttp3.internal.http.Real

Re: [Flink 1.15.1 - Application mode native k8s Exception] - Exception occurred while acquiring lock 'ConfigMapLock

2022-09-05 Thread Yang Wang
Could you please check whether the "kubernetes.config.file" is configured
to /opt/flink/.kube/config in the Flink configmap?
It should be removed before creating the Flink configmap.

Best,
Yang

Tamir Sagi  于2022年9月4日周日 18:08写道:

> Hey All,
>
> We recently updated to Flink 1.15.1. We deploy stream cluster in
> Application mode in Native K8S.(Deployed on Amazon EKS).  The cluster is
> configured with Kubernetes HA Service, Minimum 3 replicas of Job manager
> and pod-template which is configured with topologySpreadConstraints to
> enable distribution across different availability zones.
> HA storage directory is on S3.
>
> The cluster is deployed and running properly, however, after a while we
> noticed the following exception in Job manager instance(the log file is
> enclosed)
>
> 2022-09-04T02:05:33,097][Error] {} [i.f.k.c.e.l.LeaderElector]: Exception
> occurred while acquiring lock 'ConfigMapLock: dev-0-flink-jobs -
> data-agg-events-insertion-cluster-config-map
> (b6da2ae2-ad2b-471c-801e-ea460a348fab)'
> io.fabric8.kubernetes.client.KubernetesClientException: Operation: [get]
>  for kind: [ConfigMap]  with name:
> [data-agg-events-insertion-cluster-config-map]  in namespace:
> [dev-0-flink-jobs]  failed.
> Caused by: java.io.FileNotFoundException: /opt/flink/.kube/config (No such
> file or directory)
> at java.io.FileInputStream.open0(Native Method) ~[?:?]
> at java.io.FileInputStream.open(Unknown Source) ~[?:?]
> at java.io.FileInputStream.(Unknown Source) ~[?:?]
> at
> org.apache.flink.kubernetes.shaded.com.fasterxml.jackson.dataformat.yaml.YAMLFactory.createParser(YAMLFactory.java:354)
> ~[flink-dist-1.15.1.jar:1.15.1]
> at
> org.apache.flink.kubernetes.shaded.com.fasterxml.jackson.dataformat.yaml.YAMLFactory.createParser(YAMLFactory.java:15)
> ~[flink-dist-1.15.1.jar:1.15.1]
> at
> org.apache.flink.kubernetes.shaded.com.fasterxml.jackson.databind.ObjectMapper.readValue(ObjectMapper.java:3494)
> ~[flink-dist-1.15.1.jar:1.15.1]
> at
> io.fabric8.kubernetes.client.internal.KubeConfigUtils.parseConfig(KubeConfigUtils.java:42)
> ~[flink-dist-1.15.1.jar:1.15.1]
> at
> io.fabric8.kubernetes.client.utils.TokenRefreshInterceptor.intercept(TokenRefreshInterceptor.java:44)
> ~[flink-dist-1.15.1.jar:1.15.1]
> at
> org.apache.flink.kubernetes.shaded.okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:142)
> ~[flink-dist-1.15.1.jar:1.15.1]
> at
> org.apache.flink.kubernetes.shaded.okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:117)
> ~[flink-dist-1.15.1.jar:1.15.1]
> at
> io.fabric8.kubernetes.client.utils.ImpersonatorInterceptor.intercept(ImpersonatorInterceptor.java:68)
> ~[flink-dist-1.15.1.jar:1.15.1]
> at
> org.apache.flink.kubernetes.shaded.okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:142)
> ~[flink-dist-1.15.1.jar:1.15.1]
> at
> org.apache.flink.kubernetes.shaded.okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:117)
> ~[flink-dist-1.15.1.jar:1.15.1]
> at
> io.fabric8.kubernetes.client.utils.HttpClientUtils.lambda$createApplicableInterceptors$6(HttpClientUtils.java:290)
> ~[flink-dist-1.15.1.jar:1.15.1]
> at
> org.apache.flink.kubernetes.shaded.okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:142)
> ~[flink-dist-1.15.1.jar:1.15.1]
> at
> org.apache.flink.kubernetes.shaded.okhttp3.internal.http.RealInterceptorChain.proceed(RealInterceptorChain.java:117)
> ~[flink-dist-1.15.1.jar:1.15.1]
> at
> org.apache.flink.kubernetes.shaded.okhttp3.RealCall.getResponseWithInterceptorChain(RealCall.java:229)
> ~[flink-dist-1.15.1.jar:1.15.1]
> at
> org.apache.flink.kubernetes.shaded.okhttp3.RealCall.execute(RealCall.java:81)
> ~[flink-dist-1.15.1.jar:1.15.1]
> at
> io.fabric8.kubernetes.client.dsl.base.OperationSupport.retryWithExponentialBackoff(OperationSupport.java:585)
> ~[flink-dist-1.15.1.jar:1.15.1]
> at
> io.fabric8.kubernetes.client.dsl.base.OperationSupport.handleResponse(OperationSupport.java:558)
> ~[flink-dist-1.15.1.jar:1.15.1]
> at
> io.fabric8.kubernetes.client.dsl.base.OperationSupport.handleResponse(OperationSupport.java:521)
> ~[flink-dist-1.15.1.jar:1.15.1]
> at
> io.fabric8.kubernetes.client.dsl.base.OperationSupport.handleGet(OperationSupport.java:488)
> ~[flink-dist-1.15.1.jar:1.15.1]
> at
> io.fabric8.kubernetes.client.dsl.base.OperationSupport.handleGet(OperationSupport.java:470)
> ~[flink-dist-1.15.1.jar:1.15.1]
> at
> io.fabric8.kubernetes.client.dsl.base.BaseOperation.handleGet(BaseOperation.java:830)
> ~[flink-dist-1.15.1.jar:1.15.1]
> at
> io.fabric8.kubernetes.client.dsl.base.BaseOperation.getMandatory(BaseOperation.java:200)
> ~[flink-dist-1.15.1.jar:1.15.1]
> ... 12 more
>
> Why is Kube/config needed in Native K8s,  should not service account be
> checked instead?
>
> Are we missing something?
>
> Thanks,
> Tamir.
>
>
> Confidentiality: This communication and any attachments are intended for
> the above-named persons only and may be confidential an

Re: How to open a Prometheus metrics port on the rest service when using the Kubernetes operator?

2022-09-05 Thread Yang Wang
I do not think we could add an additional port to the rest service since it
is created by Flink internally.

Actually, I do not suggest scrapping the metrics from rest service.
Instead, the port in the pod needs to be used.
Because the metrics might not work correctly if multiple JobManagers are
running.


Best,
Yang

Javier Vegas  于2022年9月5日周一 15:00写道:

> What I would need is to set
>
> ports:
>
> - name: metrics
>
>   port: 
>
>   protocol: TCP
>
>
>
> in the generated YAML fir the appname-rest service which properly
> aggregates the metrics from the pods, but I can't not figure out either
> from the job deployment file or modifying the operator templates in the
> Helm chart. Any way I can modify the ports in the Flink rest service?
>
>
> Thanks,
>
>
> Javier Vegas
>
>
>
> El dom, 4 sept 2022 a las 1:59, Javier Vegas ()
> escribió:
>
>> Hi, Biao!
>>
>> Thanks for the fast response! Setting that in the podTemplate opens the
>> metrics port in the pods, but unfortunately not on the rest service. Not
>> sure if that is standard procedure, but my Prometheus setup scraps the
>> metrics port on services but not pods. On my previous non-operator
>> standalone setup, the metrics port on the service was aggregating all the
>> pods metrics and then Prometheus was scrapping that, so I was trying to
>> reproduce that by opening the port on the rest service.
>>
>>
>>
>> El dom, 4 sept 2022 a las 1:03, Geng Biao ()
>> escribió:
>>
>>> Hi Javier,
>>>
>>>
>>>
>>> You can use podTemplate to expose the port in the flink containers.
>>>
>>> Here is a snippet:
>>>
>>> spec:
>>>
>>>   flinkVersion: v1_15
>>>
>>>   flinkConfiguration:
>>>
>>> state.savepoints.dir: file:///flink-data/flink-savepoints
>>>
>>> state.checkpoints.dir: file:///flink-data/flink-checkpoints
>>>
>>> *metrics.reporter.prom.factory.class:
>>> org.apache.flink.metrics.prometheus.PrometheusReporterFactory*
>>>
>>>   serviceAccount: flink
>>>
>>>   podTemplate:
>>>
>>> metadata:
>>>
>>>   annotations:
>>>
>>> prometheus.io/path: /metrics
>>>
>>> prometheus.io/port: "9249"
>>>
>>> prometheus.io/scrape: "true"
>>>
>>> spec:
>>>
>>>   serviceAccount: flink
>>>
>>>   containers:
>>>
>>> - name: flink-main-container
>>>
>>>   volumeMounts:
>>>
>>> - mountPath: /flink-data
>>>
>>>   name: flink-volume
>>>
>>>  * ports:*
>>>
>>> *- containerPort: 9249*
>>>
>>> *  name: metrics*
>>>
>>> *  protocol: TCP*
>>>
>>>   volumes:
>>>
>>> - name: flink-volume
>>>
>>>   emptyDir: {}
>>>
>>>
>>>
>>> The bold line are about how to specify the metric reporter and expose
>>> the metric. The annotations are not required if you use PodMonitor or
>>> ServiceMonitor. Hope it can help!
>>>
>>>
>>>
>>> Best,
>>>
>>> Biao Geng
>>>
>>>
>>>
>>> *From: *Javier Vegas 
>>> *Date: *Sunday, September 4, 2022 at 10:19 AM
>>> *To: *user 
>>> *Subject: *How to open a Prometheus metrics port on the rest service
>>> when using the Kubernetes operator?
>>>
>>> I am migrating my Flink app from standalone Kubernetes to the Kubernetes
>>> operator, it is going well but I ran into a problem, I can not figure out
>>> how to open a Prometheus metrics port in the rest-service to collect all my
>>> custom metrics from the task managers. Note that this is different from the
>>> instructions to "How to Enable Prometheus"
>>> https://nightlies.apache.org/flink/flink-kubernetes-operator-docs-main/docs/operations/metrics-logging/#how-to-enable-prometheus-example
>>> that example is to collect the operator pod metrics, but what I am trying
>>> to do is open a port on the rest service to make my job metrics available
>>> to Prometheus.
>>>
>>>
>>>
>>> Thanks,
>>>
>>>
>>>
>>> Javier Vegas
>>>
>>


Re: [E] Re: Kubernetes operator expose UI rest service as NodePort instead of default clusterIP

2022-09-05 Thread Yang Wang
I think we have no concrete reason to always override the
"REST_SERVICE_EXPOSED_TYPE" to "ClusterIP".
It was introduced to fix the default value for releases before 1.15. And I
believe we need to respect the user configured values.

Best,
Yang

Vignesh Kumar Kathiresan  于2022年9月3日周六 05:07写道:

> Jacob,
> Thanks, I checked it out and didn't work. The config overriding to
> ClusterIP part
> <https://github.com/apache/flink-kubernetes-operator/blob/468460275984bf1737640aa2fad912dc84da66ad/flink-kubernetes-operator/src/main/java/org/apache/flink/kubernetes/operator/config/FlinkConfigBuilder.java#L186>
>  we
> were talking about.  So looks like its always being set to ClusterIP now.
>
> Yang,
> Having the alb target type as ip works with a ClusterIP type service.
>
> On Fri, Sep 2, 2022 at 8:18 AM Jeesmon Jacob  wrote:
>
>> I remember testing the operator with the rest service exposed as
>> NodePort. NodePort requires rbac.nodeRoules.create: true (default is false)
>> in values.yaml. Maybe you missed that?
>>
>>
>> https://github.com/apache/flink-kubernetes-operator/blob/release-1.1/helm/flink-kubernetes-operator/values.yaml#L34-L38
>> <https://urldefense.com/v3/__https://github.com/apache/flink-kubernetes-operator/blob/release-1.1/helm/flink-kubernetes-operator/values.yaml*L34-L38__;Iw!!Op6eflyXZCqGR5I!DTn70pqhttQzBpwxuX_IzpnrchfomQ2-Qj8DIHnahai7tLLDx3MX9lmkcnZvRdz4f-LCTpuVvlqTdV-w$>
>>
>> On Thu, Sep 1, 2022 at 11:45 PM Vignesh Kumar Kathiresan via user <
>> user@flink.apache.org> wrote:
>>
>>> Hi Yang,
>>>
>>> Yeah, I gathered that from the operator code soon after posting. I am
>>> using the aws alb ingress class [1]. There under considerations it is
>>> mentioned if the alb target type is "instance" which is the default traffic
>>> mode, the kubernetes service type has to be nodeport or loadbalancer.
>>>
>>> Also alb target if changed to "ip" might work. Let me try that. I
>>> believe there should be a reason to always override the
>>> "REST_SERVICE_EXPOSED_TYPE" to "ClusterIP".
>>>
>>> [1] https://docs.aws.amazon.com/eks/latest/userguide/alb-ingress.html
>>> <https://urldefense.com/v3/__https://docs.aws.amazon.com/eks/latest/userguide/alb-ingress.html__;!!Op6eflyXZCqGR5I!DTn70pqhttQzBpwxuX_IzpnrchfomQ2-Qj8DIHnahai7tLLDx3MX9lmkcnZvRdz4f-LCTpuVvjbzj4cE$>
>>>
>>> On Thu, Sep 1, 2022 at 7:01 PM Yang Wang  wrote:
>>>
>>>> I am afraid the current flink-kubernetes-operator always overrides the
>>>> "REST_SERVICE_EXPOSED_TYPE" to "ClusterIP".
>>>> Could you please share why the ingress[1] could not meet your
>>>> requirements? Compared with NodePort, I think it is a more graceful
>>>> implementation.
>>>>
>>>> [1].
>>>> https://nightlies.apache.org/flink/flink-kubernetes-operator-docs-release-1.1/docs/operations/ingress/
>>>> <https://urldefense.com/v3/__https://nightlies.apache.org/flink/flink-kubernetes-operator-docs-release-1.1/docs/operations/ingress/__;!!Op6eflyXZCqGR5I!FZvr8XAEWiEa176d0PfqyLJQoxTGIsDkpV-xqs5JNRCJc3Kv43nm-sa2l275jTPk50K2mjrI3COxrj0op5P5cw$>
>>>>
>>>> Best,
>>>> Yang
>>>>
>>>> Vignesh Kumar Kathiresan via user  于2022年9月2日周五
>>>> 04:57写道:
>>>>
>>>>> Hello Flink community,
>>>>>
>>>>> Need some help with "flink kubernetes operator" based cluster setup.
>>>>>
>>>>> My flink cluster is set up using the flink-kubernetes-operator in AWS
>>>>> EKS. The required resources(deployments, pods, services, configmaps etc)
>>>>> are created as expected. But the service "*-rest" is created as a
>>>>> "ClusterIP" type. I would want it created as a NodePort type.
>>>>>
>>>>> I want to expose the UI to external viewing via ingress using the aws
>>>>> alb class. This aws-load balancer-controller requires my service to be of
>>>>> type NodePort.
>>>>>
>>>>> I have tried a few options but the service is always created as
>>>>> ClusterIP.
>>>>> 1) In the FlinkDeployment CRD, under spec.flinkConfiguration
>>>>> added kubernetes.rest-service.exposed.type: "NodePort"
>>>>> 2) In the operator helm values.yaml
>>>>>
>>>>> defaultConfiguration:
>>>>>   create: true
>>>>>   # Set append to false to replace configuration files
>>>>>   append: true
>>>>>   flink-conf.yaml: |+
>>>>> # Flink Config Overrides
>>>>> kubernetes.rest-service.exposed.type: NodePort
>>>>>
>>>>> Neither option gives me a NodePort type service for the UI.
>>>>> Any suggestions?
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>


Re: Kubernetes operator expose UI rest service as NodePort instead of default clusterIP

2022-09-01 Thread Yang Wang
I am afraid the current flink-kubernetes-operator always overrides the
"REST_SERVICE_EXPOSED_TYPE" to "ClusterIP".
Could you please share why the ingress[1] could not meet your requirements?
Compared with NodePort, I think it is a more graceful implementation.

[1].
https://nightlies.apache.org/flink/flink-kubernetes-operator-docs-release-1.1/docs/operations/ingress/

Best,
Yang

Vignesh Kumar Kathiresan via user  于2022年9月2日周五
04:57写道:

> Hello Flink community,
>
> Need some help with "flink kubernetes operator" based cluster setup.
>
> My flink cluster is set up using the flink-kubernetes-operator in AWS EKS.
> The required resources(deployments, pods, services, configmaps etc) are
> created as expected. But the service "*-rest" is created as a "ClusterIP"
> type. I would want it created as a NodePort type.
>
> I want to expose the UI to external viewing via ingress using the aws alb
> class. This aws-load balancer-controller requires my service to be of type
> NodePort.
>
> I have tried a few options but the service is always created as ClusterIP.
> 1) In the FlinkDeployment CRD, under spec.flinkConfiguration
> added kubernetes.rest-service.exposed.type: "NodePort"
> 2) In the operator helm values.yaml
>
> defaultConfiguration:
>   create: true
>   # Set append to false to replace configuration files
>   append: true
>   flink-conf.yaml: |+
> # Flink Config Overrides
> kubernetes.rest-service.exposed.type: NodePort
>
> Neither option gives me a NodePort type service for the UI.
> Any suggestions?
>
>
>
>
>
>
>
>


Re: Error when run test case in Windows

2022-08-22 Thread Yang Wang
It is caused by the following assert. Maybe we could *File.pathSeparator*
instead of "/".

*assertThat(optional.get()).isEqualTo(hadoopHome + "/conf");*

Would you like to create a ticket and attach a PR for this issue?

Best,
Yang

hjw <1010445...@qq.com> 于2022年8月21日周日 19:44写道:

> When I run mvn clean install ,It will run Flink test case .
> However , I get Error:
> [ERROR] Failures:
> [ERROR]
>  
> KubernetesClusterDescriptorTest.testDeployApplicationClusterWithNonLocalSchema:155
> Previous method call should have failed but it returned:
> org.apache.flink.kubernetes.KubernetesClusterDescriptor$$Lambda$839/1619964974@70e5737f
> [ERROR]
>  
> AbstractKubernetesParametersTest.testGetLocalHadoopConfigurationDirectoryFromHadoop1HomeEnv:132->runTestWithEmptyEnv:149->lambda$testGetLocalHadoopConfigurationDirectoryFromHadoop1HomeEnv$3:141
> Expected: is
> "C:\Users\10104\AppData\Local\Temp\junit5662202040601670287/conf"
>  but: was
> "C:\Users\10104\AppData\Local\Temp\junit5662202040601670287\conf"
> [ERROR]
>  
> AbstractKubernetesParametersTest.testGetLocalHadoopConfigurationDirectoryFromHadoop2HomeEnv:117->runTestWithEmptyEnv:149->lambda$testGetLocalHadoopConfigurationDirectoryFromHadoop2HomeEnv$2:126
> Expected: is
> "C:\Users\10104\AppData\Local\Temp\junit7094401822178578683/etc/hadoop"
>  but: was
> "C:\Users\10104\AppData\Local\Temp\junit7094401822178578683\etc\hadoop"
> [ERROR]
>  KubernetesUtilsTest.testLoadPodFromTemplateWithNonExistPathShouldFail:110
> Expected: Expected error message is "Pod template file
> /path/of/non-exist.yaml does not exist."
>  but: The throwable  template file \path\of\non-exist.yaml does not exist.> does not contain the
> expected error message "Pod template file /path/of/non-exist.yaml does not
> exist."
>
> I judge the error occurred due to different
> fileSysyem(unix,Windows..etc) separators.
>
>
>
>
>
> Env:
> Flink version :1.15
> Maven:3.2.5
> Jdk:1.8
> Environment:Win10
>


Re: Flink Operator Resources Requests and Limits

2022-07-27 Thread Yang Wang
We have the *kubernetes.jobmanager.cpu.limit-factor* and
*kubernetes.jobmanager.memory.limit-factor* to control the limit value.

The resources limit memory will be set to memory/cpu * limit-factor.


Best,
Yang

PACE, JAMES  于2022年7月28日周四 01:26写道:

> That does not seem to work.
>
>
>
> For instance:
>
>   jobManager:
>
> podTemplate:
>
>   spec:
>
> containers:
>
>   - resources:
>
>   requests:
>
> cpu: "0.5"
>
> memory: "2048m"
>
>   limits:
>
> cpu: "2"
>
> memory: "2048m"
>
>
>
> results in a pod like this:
>
> Limits:
>
>   cpu: 1
>
>   memory:  1600Mi
>
> Requests:
>
>   cpu: 1
>
>   memory:  1600Mi
>
>
>
> This appears to be overwritten by a default if cpu and memory do not
> appear in the jobManager resources.
>
>
>
> Jim
>
>
>
> *From:* Őrhidi Mátyás 
> *Sent:* Wednesday, July 27, 2022 11:16 AM
> *To:* PACE, JAMES 
> *Cc:* user@flink.apache.org
> *Subject:* Re: Flink Operator Resources Requests and Limits
>
>
>
> Hi James,
>
>
>
> Have you considered using pod templates already?
>
>
> https://nightlies.apache.org/flink/flink-kubernetes-operator-docs-main/docs/custom-resource/pod-template/
> 
>
>
>
> Regards,
>
> Matyas
>
>
>
> On Wed, Jul 27, 2022 at 3:21 PM PACE, JAMES  wrote:
>
> We are currently evaluating the apache flink operator (version 1.1.0) to
> replace the operator that we currently use.  Setting the memory and cpu
> resources sets both the request and the limit for the pod.  Previously, we
> were only setting request allowing pods to oversubscribe to CPU when needed
> to handle the burstiness of the traffic that we see into the jobs.
>
>
>
> Is there a way to set different values for cpu for resource requests and
> limits, or omit the limit specification?  If not, is this something that
> would be on the roadmap?
>
>
>
> Thanks.
>
>
>
> Jim
>
>


Re: NodePort conflict for multiple HA application-mode standalone Kubernetes deploys in same namespace

2022-07-24 Thread Yang Wang
Removing the nodePort for every different Flink application is necessary so
that it could pick up a random port.

Moreover, I believe you also need to change some other yamls. For example,
having a different name for JobManager/TaskManager yamls, update
the jobmanager-service.yaml and flink-configuration-configmap.yaml to use
the new name.

An easy way is to use the flink-kubernetes-operator[1].

[1].
https://nightlies.apache.org/flink/flink-kubernetes-operator-docs-release-1.1/

Best,
Yang

Javier Vegas  于2022年7月24日周日 23:43写道:

> Partial answer to my own question: Removing the hardcoded `nodePort:
> 30081` entry from jobmanager-rest-service.yaml, Flink assigns random
> ports so there are no conflicts and multiple Flink application-mode jobs
> can be deployed. However the jobs seem to communicate with each other, when
> launching the second job, the first job taskmanagers start executing tasks
> sent by the second job jobmanager, and the second job taskmanagers execute
> jobs from both jobmanagers.
>
> El vie, 22 jul 2022 a las 12:03, Javier Vegas ()
> escribió:
>
>>
>> I am deploying a high-availability Flink job to Kubernetes in application
>> mode using Flink's standalone k8 deployment
>> https://nightlies.apache.org/flink/flink-docs-release-1.15/docs/deployment/resource-providers/standalone/kubernetes/
>> All goes well when I deploy a job, but if I want to deploy a second
>> application-mode Flink job in the same K8s namespace I get a 
>> "spec.ports[0].nodePort:
>> Invalid value: 30081: provided port is already allocated" error. Is
>> there a way that nodePort can be allocated dynamically, or other way around
>> this (using Loadbalancer or Ingress instead of NodePort in
>> jobmanager-rest-service.yaml?) besides hard-coding different nodePorts
>> for different jobs running in same namespace?
>>
>> Thanks,
>>
>> Javier Vegas
>>
>


Re: [ANNOUNCE] Apache Flink Kubernetes Operator 1.1.0 released

2022-07-24 Thread Yang Wang
Congrats! Thanks Gyula for driving this release, and thanks to all
contributors!


Best,
Yang

Gyula Fóra  于2022年7月25日周一 10:44写道:

> The Apache Flink community is very happy to announce the release of Apache
> Flink Kubernetes Operator 1.1.0.
>
> The Flink Kubernetes Operator allows users to manage their Apache Flink
> applications and their lifecycle through native k8s tooling like kubectl.
>
> Please check out the release blog post for an overview of the release:
>
> https://flink.apache.org/news/2022/07/25/release-kubernetes-operator-1.1.0.html
>
> The release is available for download at:
> https://flink.apache.org/downloads.html
>
> Maven artifacts for Flink Kubernetes Operator can be found at:
>
> https://search.maven.org/artifact/org.apache.flink/flink-kubernetes-operator
>
> Official Docker image for the Flink Kubernetes Operator can be found at:
> https://hub.docker.com/r/apache/flink-kubernetes-operator
>
> The full release notes are available in Jira:
>
> https://issues.apache.org/jira/secure/ReleaseNote.jspa?projectId=12315522&version=12351723
>
> We would like to thank all contributors of the Apache Flink community who
> made this release possible!
>
> Regards,
> Gyula Fora
>


Re: standalone mode support in the kubernetes operator (FLIP-25)

2022-07-18 Thread Yang Wang
I think at least we have the following advantages(in some cases).
* We do not need to configure a service account for JobManager, which
allows it could allocate/delete pods from Kubernetes APIServer.
* The reactive mode[1] could only work with standalone cluster.

[1].
https://nightlies.apache.org/flink/flink-docs-release-1.15/docs/deployment/elastic_scaling/#reactive-mode

Best,
Yang

yidan zhao  于2022年7月15日周五 10:13写道:

> Hi all, Does 'standalone mode support in the kubernetes operator'
> means: Using flink-k8s-operator to manage jobs deployed in a
> standalone cluster?
> What is the advantag doing so.
>
> Yang Wang  于2022年7月14日周四 10:55写道:
> >
> > I think the standalone mode support is expected to be done in the
> version 1.2.0[1], which will be released on Oct 1 (ETA).
> >
> > [1].
> https://cwiki.apache.org/confluence/display/FLINK/Release+Schedule+and+Planning
> >
> >
> > Best,
> > Yang
> >
> > Javier Vegas  于2022年7月14日周四 06:25写道:
> >>
> >> Hello! The operator docs
> https://nightlies.apache.org/flink/flink-kubernetes-operator-docs-main/docs/concepts/overview/
> say "The Operator does not support Standalone Kubernetes deployments yet"
> and mentions
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-225%3A+Implement+standalone+mode+support+in+the+kubernetes+operator
> as a "what's next" step. Is there a timeline for that to be released?
> >>
> >> Thanks,
> >>
> >> Javier Vegas
>


Re: standalone mode support in the kubernetes operator (FLIP-25)

2022-07-13 Thread Yang Wang
I think the standalone mode support is expected to be done in the version
1.2.0[1], which will be released on Oct 1 (ETA).

[1].
https://cwiki.apache.org/confluence/display/FLINK/Release+Schedule+and+Planning


Best,
Yang

Javier Vegas  于2022年7月14日周四 06:25写道:

> Hello! The operator docs
> https://nightlies.apache.org/flink/flink-kubernetes-operator-docs-main/docs/concepts/overview/
> say "The Operator does not support Standalone Kubernetes
> 
>  deployments yet" and mentions
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-225%3A+Implement+standalone+mode+support+in+the+kubernetes+operator
> as a "what's next" step. Is there a timeline for that to be released?
>
> Thanks,
>
> Javier Vegas
>


Re: [ANNOUNCE] Apache Flink Kubernetes Operator 1.0.1 released

2022-06-28 Thread Yang Wang
Thanks Gyula for working on the first patch release for the Flink
Kubernetes Operator project.


Best,
Yang



Gyula Fóra  于2022年6月28日周二 00:22写道:

> The Apache Flink community is very happy to announce the release of Apache
> Flink Kubernetes Operator 1.0.1.
>
> The Flink Kubernetes Operator allows users to manage their Apache Flink
> applications and their lifecycle through native k8s tooling like kubectl.
> <
> https://flink.apache.org/news/2022/04/03/release-kubernetes-operator-0.1.0.html
> >
>
> The release is available for download at:
> https://flink.apache.org/downloads.html
>
> Official Docker image for Flink Kubernetes Operator applications can be
> found at:
> https://hub.docker.com/r/apache/flink-kubernetes-operator
>
> The full release notes are available in Jira:
>
> https://issues.apache.org/jira/secure/ReleaseNote.jspa?projectId=12315522&version=12351812
>
> We would like to thank all contributors of the Apache Flink community who
> made this release possible!
>
> Regards,
> Gyula Fora
>


Re: Flink k8s Operator on AWS?

2022-06-26 Thread Yang Wang
Could you please share the JobManager logs of failed deployment? It will
also help a lot if you could show the pending pod status via "kubectl
describe ".

Given that the current Flink Kubernetes Operator is built on top of native
K8s integration[1], the Flink ResourceManager should allocate enough
TaskManager pods automatically.
We need to find out what is wrong via the logs. Maybe the service account
or taint or something else.


[1]. https://flink.apache.org/2021/02/10/native-k8s-with-ha.html


Best,
Yang

Matt Casters  于2022年6月24日周五 23:48写道:

> Yes of-course.  I already feel a bit less intelligent for having asked the
> question ;-)
>
> The status now is that I managed to have it all puzzled together.  Copying
> the files from s3 to an ephemeral volume takes all of 2 seconds so it's
> really not an issue.  The cluster starts and our fat jar and Apache Hop
> MainBeam class is found and started.
>
> The only thing that remains is figuring out how to configure the Flink
> cluster itself.  I have a couple of m5.large ec2 instances in a node group
> on EKS and I set taskmanager.numberOfTaskSlots to "4".  However, the tasks
> in the pipeline can't seem to find resources to start.
>
> Caused by:
> org.apache.flink.runtime.jobmanager.scheduler.NoResourceAvailableException:
> Slot request bulk is not fulfillable! Could not allocate the required slot
> within slot request timeout
>
> Parallelism was set to 1 for the runner and there are only 2 tasks in my
> first Beam pipeline so it should be simple enough but it just times out.
>
> Next step for me is to document the result which will end up on
> hop.apache.org.   I'll probably also want to demo this in Austin at the
> upcoming Beam summit.
>
> Thanks a lot for your time and help so far!
>
> Cheers,
> Matt
>
>


Re: Flink operator - ignore ssl cert validation

2022-06-23 Thread Yang Wang
Do you mean the HttpArtifactFetcher could not support HTTPS?

cc @Aitozi 

Best,
Yang

calvin beloy  于2022年6月22日周三 22:10写道:

> Sorry typo "jarring" should be "jar url".
>
> Sent from Yahoo Mail on Android
> 
>
> On Wed, Jun 22, 2022 at 10:07 AM, calvin beloy
>  wrote:
> Hi,
>
> We are using Flink Operator to deploy FlinkSessionJob. The jarring is
> pointing to our private internal repo which is using self signed ssl cert.
> What's the easiest way for the operator to ignore ssl cert validation?
>
> Thanks,
> Calvin
>
> Sent from Yahoo Mail on Android
> 
>
>


Re: Flink Operator - Support for k8s HA jobmanager

2022-06-23 Thread Yang Wang
Matyas's answer is on the point.

You need to mount a shared volume for all the JobManager pods so that the
uploaded jars are visible for them all.

Best,
Yang

Őrhidi Mátyás  于2022年6月23日周四 04:34写道:

> I guess the problem here is that your JM pods do not have access to a
> common upload folder. You need to mount a shared volume for all the
> JobManagers and point to it using the  'web.upload.dir: property.
>
> On Wed, Jun 22, 2022 at 8:45 PM calvin beloy  wrote:
>
>> Hi,
>>
>> Trying to deploy FlinkSessionJob with Flink Operator on HA enabled
>> Jobmanager (2 replicas) but getting below error. Changing job manager
>> replica to 1 is working fine.
>> Is this a bug on Flink Operator that needs to support 2 Jobmanagers on HA
>> mode?
>>
>> at
>> org.apache.flink.kubernetes.operator.service.FlinkService.submitJobToSessionCluster(FlinkService.java:198)
>>
>>at
>> org.apache.flink.kubernetes.operator.reconciler.sessionjob.FlinkSessionJobReconciler.submitAndInitStatus(FlinkSessionJobReconciler.java:164)
>>
>>at
>> org.apache.flink.kubernetes.operator.reconciler.sessionjob.FlinkSessionJobReconciler.reconcile(FlinkSessionJobReconciler.java:88)
>>
>>at
>> org.apache.flink.kubernetes.operator.reconciler.sessionjob.FlinkSessionJobReconciler.reconcile(FlinkSessionJobReconciler.java:48)
>>
>>at
>> org.apache.flink.kubernetes.operator.controller.FlinkSessionJobController.reconcile(FlinkSessionJobController.java:115)
>>
>>... 13 more
>>
>> Caused by: java.util.concurrent.ExecutionException:
>> org.apache.flink.runtime.rest.util.RestClientException: [Internal server
>> error., >
>> java.util.concurrent.CompletionException:
>> org.apache.flink.runtime.rest.handler.RestHandlerException: Jar file
>> /tmp/flink-web-452cd7b2-6e13-4147-b5e8-a829c5be733a/flink-web-upload/bfdc0c10-91ba-4b6c-9d25-cef8fccbddd9_plt-realtime-nirvana-prices-1.0.5.jar
>> does not exist
>>
>>at
>> org.apache.flink.runtime.webmonitor.handlers.utils.JarHandlerUtils$JarHandlerContext.toPackagedProgram(JarHandlerUtils.java:172)
>>
>>at
>> org.apache.flink.runtime.webmonitor.handlers.utils.JarHandlerUtils$JarHandlerContext.applyToConfiguration(JarHandlerUtils.java:141)
>>
>>at
>> org.apache.flink.runtime.webmonitor.handlers.JarRunHandler.handleRequest(JarRunHandler.java:100)
>>
>>at
>> org.apache.flink.runtime.webmonitor.handlers.JarRunHandler.handleRequest(JarRunHandler.java:57)
>>
>>at
>> org.apache.flink.runtime.rest.handler.AbstractRestHandler.respondToRequest(AbstractRestHandler.java:83)
>>
>>at
>> org.apache.flink.runtime.rest.handler.AbstractHandler.respondAsLeader(AbstractHandler.java:195)
>>
>>at
>> org.apache.flink.runtime.rest.handler.LeaderRetrievalHandler.lambda$channelRead0$0(LeaderRetrievalHandler.java:83)
>>
>>at java.base/java.util.Optional.ifPresent(Unknown Source)
>>
>>at
>> org.apache.flink.util.OptionalConsumer.ifPresent(OptionalConsumer.java:45)
>>
>>at
>> org.apache.flink.runtime.rest.handler.LeaderRetrievalHandler.channelRead0(LeaderRetrievalHandler.java:80)
>>
>>at
>> org.apache.flink.runtime.rest.handler.LeaderRetrievalHandler.channelRead0(LeaderRetrievalHandler.java:49)
>>
>>at
>> org.apache.flink.shaded.netty4.io.netty.channel.SimpleChannelInboundHandler.channelRead(SimpleChannelInboundHandler.java:99)
>>
>>at
>> org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:379)
>>
>>at
>> org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:365)
>>
>>at
>> org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:357)
>>
>>at
>> org.apache.flink.runtime.rest.handler.router.RouterHandler.routed(RouterHandler.java:115)
>>
>>at
>> org.apache.flink.runtime.rest.handler.router.RouterHandler.channelRead0(RouterHandler.java:94)
>>
>>at
>> org.apache.flink.runtime.rest.handler.router.RouterHandler.channelRead0(RouterHandler.java:55)
>>
>>at
>> org.apache.flink.shaded.netty4.io.netty.channel.SimpleChannelInboundHandler.channelRead(SimpleChannelInboundHandler.java:99)
>>
>>at
>> org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:379)
>>
>>at
>> org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:365)
>>
>>at
>> org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:357)
>>
>

Re: HTTP 404 while creating resource with flink kubernetes operator and frabric8 client

2022-06-23 Thread Yang Wang
Do you have installed the operator along with CRD[1]?

[1].
https://nightlies.apache.org/flink/flink-kubernetes-operator-docs-release-1.0/docs/try-flink-kubernetes-operator/quick-start/#deploying-the-operator

Best,
Yang

yu'an huang  于2022年6月23日周四 13:04写道:

> Hi,
>
> It seems that you can't find the FlinkDeployment. I saw the api server
> address is maskedip. Can you confirm whether it is the correct api server
> address?
>
> Best,
> Yuan
>
> On Thu, 23 Jun 2022 at 11:52 AM, Kishore Pola 
> wrote:
>
>> Hello flink user group,
>>
>> When I am trying to create a flink deployment with the operator
>> programmatically, kubernetes cluster is returning HTTP 404 message. Any
>> pointers/help?
>>  I am constructing the context using fabric8 client like this
>>
>>
>> Config config =
>> new ConfigBuilder()
>> .withMasterUrl(masterUrl)
>> .withTrustCerts(true)
>> .withDisableHostnameVerification(true)
>> .build();
>>
>> this.client = new DefaultKubernetesClient(config);
>> this.flinkCrdContext = new ResourceDefinitionContext
>> .Builder()
>> .withGroup("flink.apache.org")
>> .withVersion(OPERATOR_API_VERSION)
>> .withPlural("flinkdeployments")
>> .withNamespaced(true)
>> .withKind("FlinkDeployment")
>> .build();
>>
>>
>>
>> io.fabric8.kubernetes.client.KubernetesClientException: Failure executing:
>> POST at:
>> https://maskedip/apis/flink.apache.org/v1beta1/namespaces/default/flinkdeployments.
>> Message: Not Found.
>>
>


Re: Flink k8s Operator on AWS?

2022-06-23 Thread Yang Wang
Thanks for your valuable inputs.
To make deploying Flink on K8s easy as a normal Java application
is certainly the mission of Flink Kubernetes Operator. Obviously, we are
still a little far from this mission.

Back to the user jars download, I think it makes sense to introduce the
artifact fetcher in the Flink Kubernetes Operator or directly in the Flink.
Then things become quite easier.
Users just need to configure the S3 access key and secret.


Best,
Yang




Matt Casters  于2022年6月22日周三 15:52写道:

> Hi Yang,
>
> Thanks for the suggestion!  I looked into this volume sharing on EKS
> yesterday but I couldn't figure it out right away.
> The way that people come into the Apache Hop project is often with very
> little technical knowledge since that's sort of the goal of the project:
> make things easy.  Following page after page of complicated instructions
> just to get a few files into a pod container... I feel it's just a bit
> much.
> But again, this is my frustration with k8s, not with Flink ;-)
>
> Cheers,
> Matt
>
> On Wed, Jun 22, 2022 at 5:32 AM Yang Wang  wrote:
>
>> Matyas and Gyula have shared many great informations about how to make
>> the Flink Kubernetes Operator work on the EKS.
>>
>> One more input about how to prepare the user jars. If you are more
>> familiar with K8s, you could use persistent volume to provide the user jars
>> and them mount the volume to JobManager and TaskManager.
>> I think the EKS could support EBS, NFS and more other PVs.
>>
>> Best,
>> Yang
>>
>> Őrhidi Mátyás  于2022年6月21日周二 23:00写道:
>>
>>> Hi Matt,
>>>
>>> I believe an artifact fetcher (e.g
>>> https://hub.docker.com/r/agiledigital/s3-artifact-fetcher ) + the pod
>>> template (
>>> https://nightlies.apache.org/flink/flink-kubernetes-operator-docs-main/docs/custom-resource/pod-template/#pod-template)
>>> is an elegant way to solve your problem.
>>>
>>> The operator uses K8s native integration under the hood:
>>> https://nightlies.apache.org/flink/flink-docs-release-1.13/docs/deployment/resource-providers/native_kubernetes/#application-mode
>>>  In
>>> application mode,  the main() method of the application is executed on the
>>> JobManager, hence we need the jar locally.
>>>
>>> You can launch a session cluster (without job spec) on the operator that
>>> allows submitting jars if you would like to avoid dealing with
>>> authentication, but the recommended and safe approach is to use
>>> sessionjobs for this purpose.
>>>
>>>
>>> Cheers,
>>> Matyas
>>>
>>> On Tue, Jun 21, 2022 at 4:03 PM Matt Casters <
>>> matt.cast...@neotechnology.com> wrote:
>>>
>>>> Thank you very much for the help Matyas and Gyula!
>>>>
>>>> I just saw a video today where you were presenting the FKO.  Really
>>>> nice stuff!
>>>>
>>>> So I'm guessing we're executing "flink run" at some point on the master
>>>> and that this is when we need the jar file to be local?
>>>> Am I right in assuming that this happens after the flink cluster in
>>>> question was started, as part of the job execution?
>>>>
>>>> On the one hand I agree with the underlying idea that authentication
>>>> and security should not be a responsibility of the operator.   On the other
>>>> hand I could add a flink-s3 driver but then I'd also have to configure it
>>>> and so on and it's just hard to get that configuration to be really clean.
>>>>
>>>> Do we have some service running on the flink cluster which would allow
>>>> us to post/copy files from the client (running kubectl) to the master?  If
>>>> so, could we add an option to the job specification to that effect?  Just
>>>> brainstorming ;-) (and forking apache/flink-kubernetes-operator)
>>>>
>>>> All the best,
>>>> Matt
>>>>
>>>> On Tue, Jun 21, 2022 at 2:52 PM Őrhidi Mátyás 
>>>> wrote:
>>>>
>>>>> Hi Matt,
>>>>>
>>>>> - In FlinkDeployments you can utilize an init container to download
>>>>> your artifact onto a shared volume, then you can refer to it as local:/..
>>>>> from the main container. FlinkDeployments comes with pod template support
>>>>> https://nightlies.apache.org/flink/flink-kubernetes-operator-docs-main/docs/custom-resource/pod-template/#pod-template
>>>>>
>>>>> - FlinkSessionJobs 

Re: Flink k8s Operator on AWS?

2022-06-21 Thread Yang Wang
Matyas and Gyula have shared many great informations about how to make the
Flink Kubernetes Operator work on the EKS.

One more input about how to prepare the user jars. If you are more familiar
with K8s, you could use persistent volume to provide the user jars and them
mount the volume to JobManager and TaskManager.
I think the EKS could support EBS, NFS and more other PVs.

Best,
Yang

Őrhidi Mátyás  于2022年6月21日周二 23:00写道:

> Hi Matt,
>
> I believe an artifact fetcher (e.g
> https://hub.docker.com/r/agiledigital/s3-artifact-fetcher ) + the pod
> template (
> https://nightlies.apache.org/flink/flink-kubernetes-operator-docs-main/docs/custom-resource/pod-template/#pod-template)
> is an elegant way to solve your problem.
>
> The operator uses K8s native integration under the hood:
> https://nightlies.apache.org/flink/flink-docs-release-1.13/docs/deployment/resource-providers/native_kubernetes/#application-mode
>  In
> application mode,  the main() method of the application is executed on the
> JobManager, hence we need the jar locally.
>
> You can launch a session cluster (without job spec) on the operator that
> allows submitting jars if you would like to avoid dealing with
> authentication, but the recommended and safe approach is to use
> sessionjobs for this purpose.
>
>
> Cheers,
> Matyas
>
> On Tue, Jun 21, 2022 at 4:03 PM Matt Casters <
> matt.cast...@neotechnology.com> wrote:
>
>> Thank you very much for the help Matyas and Gyula!
>>
>> I just saw a video today where you were presenting the FKO.  Really nice
>> stuff!
>>
>> So I'm guessing we're executing "flink run" at some point on the master
>> and that this is when we need the jar file to be local?
>> Am I right in assuming that this happens after the flink cluster in
>> question was started, as part of the job execution?
>>
>> On the one hand I agree with the underlying idea that authentication and
>> security should not be a responsibility of the operator.   On the other
>> hand I could add a flink-s3 driver but then I'd also have to configure it
>> and so on and it's just hard to get that configuration to be really clean.
>>
>> Do we have some service running on the flink cluster which would allow us
>> to post/copy files from the client (running kubectl) to the master?  If so,
>> could we add an option to the job specification to that effect?  Just
>> brainstorming ;-) (and forking apache/flink-kubernetes-operator)
>>
>> All the best,
>> Matt
>>
>> On Tue, Jun 21, 2022 at 2:52 PM Őrhidi Mátyás 
>> wrote:
>>
>>> Hi Matt,
>>>
>>> - In FlinkDeployments you can utilize an init container to download your
>>> artifact onto a shared volume, then you can refer to it as local:/.. from
>>> the main container. FlinkDeployments comes with pod template support
>>> https://nightlies.apache.org/flink/flink-kubernetes-operator-docs-main/docs/custom-resource/pod-template/#pod-template
>>>
>>> - FlinkSessionJobs comes with an artifact fetcher, but it may need some
>>> tweaking to make it work on your environment:
>>>
>>> https://nightlies.apache.org/flink/flink-kubernetes-operator-docs-main/docs/custom-resource/overview/#flinksessionjob-spec-overview
>>>
>>> I hope it helps, let us know if you have further questions.
>>>
>>> Cheers,
>>> Matyas
>>>
>>>
>>>
>>> On Tue, Jun 21, 2022 at 2:35 PM Matt Casters <
>>> matt.cast...@neotechnology.com> wrote:
>>>
 Hi Flink team!

 I'm interested in getting the new Flink Kubernetes Operator to work on
 AWS EKS.  Following the documentation I got pretty far.  However, when
 trying to run a job I got the following error:

 Only "local" is supported as schema for application mode. This assumes t
> hat the jar is located in the image, not the Flink client. An example
> of such path is: local:///opt/flink/examples/streaming/WindowJoin.jar


  I have an Apache Hop/Beam fat jar capable of running the Flink
 pipeline in my yml file:

 jarURI: s3://hop-eks/hop/hop-2.1.0-fat.jar

 So how could I go about getting the fat jar in a desired location for
 the operator?

 Getting this to work would be really cool for both short and long-lived
 pipelines in the service of all sorts of data integration work.  It would
 do away with the complexity of setting up and maintaining your own Flink
 cluster.

 Thanks in advance!

 All the best,

 Matt (mcasters, Apache Hop PMC)




Re: Flink Kubernetes Operator with K8S + Istio + mTLS - port definitions

2022-06-16 Thread Yang Wang
Could you please have a try with high availability enabled[1]?

If HA enabled, the internal jobmanager rpc service will not be created.
Instead, the TaskManager retrieves the JobManager address via HA services
and connects to it via pod ip.

[1].
https://github.com/apache/flink-kubernetes-operator/blob/main/examples/basic-checkpoint-ha.yaml


Best,
Yang

Elisha, Moshe (Nokia - IL/Kfar Sava)  于2022年6月16日周四
15:24写道:

> Hello,
>
>
>
> We are launching Flink deployments using the Flink Kubernetes Operator
> 
> on a Kubernetes cluster with Istio and mTLS enabled.
>
>
>
> We found that the TaskManager is unable to communicate with the JobManager
> on the jobmanager-rpc port:
>
>
>
> 2022-06-15 15:25:40,508 WARN  akka.remote.ReliableDeliverySupervisor
>   [] - Association with remote system
> [akka.tcp://flink@amf-events-to-inference-and-central.nwdaf-edge:6123]
> has failed, address is now gated for [50] ms. Reason: [Association failed
> with [akka.tcp://flink@amf-events-to-inference-and-central.nwdaf-edge:6123]]
> Caused by: [The remote system explicitly disassociated (reason unknown).]
>
>
>
> The reason for the issue is that the JobManager service port definitions are
> not following the Istio guidelines
> https://istio.io/latest/docs/ops/configuration/traffic-management/protocol-selection/
> (see example below).
>
>
>
> We believe a change to the default port definitions is needed but for now,
> is there an immediate action we can take to work around the issue? Perhaps
> overriding the default port definitions somehow?
>
>
>
> Thanks.
>
>
>
>
>
> flink-kubernetes-operator 1.0.0
>
> Flink 1.14-java11
>
> Kubernetes v1.19.5
>
> Istio 1.7.6
>
>
>
>
>
> # k get service inference-results-to-analytics-engine -o yaml
>
> apiVersion: v1
>
> kind: Service
>
> metadata:
>
> ...
>
>   labels:
>
> app: inference-results-to-analytics-engine
>
> type: flink-native-kubernetes
>
>   name: inference-results-to-analytics-engine
>
> spec:
>
>   clusterIP: None
>
>   ports:
>
>   - name: jobmanager-rpc # should start with “tcp-“ or add "appProtocol"
> property
>
> port: 6123
>
> protocol: TCP
>
> targetPort: 6123
>
>   - name: blobserver # should start with "tcp-" or add "appProtocol"
> property
>
> port: 6124
>
> protocol: TCP
>
> targetPort: 6124
>
>   selector:
>
> app: inference-results-to-analytics-engine
>
> component: jobmanager
>
> type: flink-native-kubernetes
>
>   sessionAffinity: None
>
>   type: ClusterIP
>
> status:
>
>   loadBalancer: {}
>
>
>


[ANNOUNCE] Apache Flink Kubernetes Operator 1.0.0 released

2022-06-05 Thread Yang Wang
The Apache Flink community is very happy to announce the release of Apache
Flink Kubernetes Operator 1.0.0.

The Flink Kubernetes Operator allows users to manage their Apache Flink
applications and their lifecycle through native k8s tooling like kubectl.
This is the first production ready release and brings numerous improvements
and new features to almost every aspect of the operator.

Please check out the release blog post for an overview of the release:
https://flink.apache.org/news/2022/06/05/release-kubernetes-operator-1.0.0.html

The release is available for download at:
https://flink.apache.org/downloads.html

Maven artifacts for Flink Kubernetes Operator can be found at:
https://search.maven.org/artifact/org.apache.flink/flink-kubernetes-operator

Official Docker image for Flink Kubernetes Operator applications can be
found at:
https://hub.docker.com/r/apache/flink-kubernetes-operator

The full release notes are available in Jira:
https://issues.apache.org/jira/secure/ReleaseNote.jspa?projectId=12315522&version=12351500

We would like to thank all contributors of the Apache Flink community who
made this release possible!

Regards,
Gyula & Yang


Re: Flink Kubernetes Operator v1.0 ETA

2022-06-01 Thread Yang Wang
If everything goes well, I will close the VOTE for RC4 on Friday night,
which should run for more than 48 hours. And then finalize the release.

Best,
Yang

Gyula Fóra  于2022年6月1日周三 23:30写道:

> Hi Jeesmon!
>
> We are currently working through the release process. We are now in the
> middle of voting for RC4 (we have identified and fixed a number of blocker
> issues in earlier RCs).
>
> We are hopeful that this RC will be successful in which case you will have
> a release by the end of the week. If we hit any further blockers that might
> delay it 1-2 days, but I would say current ETA is end of this week.
>
> Cheers,
> Gyula
>
> On Wed, Jun 1, 2022 at 5:05 PM Jeesmon Jacob  wrote:
>
>> Hi there,
>>
>> Is there an ETA on v1.0 release of operator? We are prototyping with a CI
>> build from release-1.0 branch but would like to know the approximate ETA of
>> official 1.0 release so that we can plan accordingly.
>>
>> Thanks,
>> Jeesmon
>>
>


Re: multiple pipeline deployment using flink k8s operator

2022-06-01 Thread Yang Wang
The current application mode has the limitation that only one job could be
submitted when HA enabled[1].
So a feasible solution is to use the session mode[2], it will be supported
in the coming release-1.0.0.

However, I am afraid it still could not satisfy your requirement "2 task
managers (one per job)". Unless each TaskManager only has one slot.


[1].
https://nightlies.apache.org/flink/flink-docs-release-1.15/docs/deployment/overview/#application-mode
[2].
https://nightlies.apache.org/flink/flink-kubernetes-operator-docs-main/docs/custom-resource/overview/#flinksessionjob


Best,
Yang

Sigalit Eliazov  于2022年6月1日周三 14:54写道:

> Hi all,
> we just started using the flink k8s operator to deploy our flink cluster.
> From what we understand we are only able to start a flink cluster per job.
> So in our case when we have 2 jobs we have to create 2 different clusters.
> obviously we would prefer to deploy these 2 job which relate to the same
> use case in the same cluster with 1 job manager and 2 task managers (one
> per job)
>
> Is this possible via the operator?
> Did we miss something understanding the configuration?
>
> thanks
> Sigalit
>


Re: Deployment on k8s via API

2022-05-17 Thread Yang Wang
Maybe you could have a try on the flink-kubernetes-operator[1]. It is
designed for using Kubernetes CRD to manage the Flink applications.


[1].
https://nightlies.apache.org/flink/flink-kubernetes-operator-docs-release-0.1/

Best,
Yang

Devin Bost  于2022年5月18日周三 08:29写道:

> Hi,
>
> I'm looking at my options for automating the deployment of Flink jobs on
> k8s (ideally using application mode), and I noticed that most of the
> examples of deploying Flink jobs in the docs use calls to the Flink binary,
> such as:
>
> $ ./bin/flink run-application \--target kubernetes-application \
> -Dkubernetes.cluster-id=my-first-application-cluster \
> -Dkubernetes.container.image=custom-image-name \
> local:///opt/flink/usrlib/my-flink-job.jar
>
> However, my automation function won't be running in the same container as
> Flink, so I'm trying to determine what my options are here. Does Flink have
> an API available for submitting jobs?
> If not, how hard would it be to use the Kubernetes API to construct the
> deployment configs for new Flink applications? Is there a better way?
>
> Thanks,
>
> Devin G. Bost
>


Re: Running in application mode on YARN without fat jar

2022-05-16 Thread Yang Wang
The usrlib for YARN only works for 1.15.0 and later versions. Refer to the
ticket[1] for more information.

[1]. https://issues.apache.org/jira/browse/FLINK-24897

Best,
Yang

Pavel Penkov  于2022年5月16日周一 22:59写道:

> I can't manage to run an application on YARN because of classpath issues.
> Flink distribution is unpacked in $HOME/flink-1.14.4
> $HOME/flink1.14.4/usrlib contains all the dependency jars excluding the
> main application jar as flat file structure.
> The application is started with
>
> ./bin/flink run-application -t yarn-application \
> -Dyarn.application.queue=production \
> hdfs:///tmp/flink-parts/ru.aliexpress.data.flink-parts-0.1.0-SNAPSHOT.jar
>
> And it still can't find required classes.
>


Re: Flink on Native K8s jobs turn in to `SUSPENDED` status unexpectedly.

2022-05-16 Thread Yang Wang
It will help a lot if you could share the logs of JobManager and
TaskManager for the unexpected `SUSPENDED` job.

Best,
Yang


Xiaolong Wang  于2022年5月16日周一 13:30写道:

> Sorry for the late reply.
>
> I checked the logs in both jobmanager & taskmanager.
>
> During that time, there were no more logs there.
>
> How can I reproduce the issue ?
>
> On Thu, May 12, 2022 at 10:35 AM Yang Wang  wrote:
>
>> The SUSPENDED state is usually caused by lost leadership. Maybe you could
>> find more information about leader in the JobManager and TaskManager logs.
>>
>> Best,
>> Yang
>>
>> Xiaolong Wang  于2022年5月11日周三 19:18写道:
>>
>>> Hello,
>>>
>>> Recently our Flink jobs on Native K8s encountered failing in the
>>> `SUSPENDED` status and got restarted for no reason.
>>>
>>> Flink version: 1.13.2
>>>
>>> Logs:
>>> ```
>>> 2022-05-11 05:01:41
>>>
>>> 2022-05-10 21:01:41,771 INFO
>>> org.apache.flink.runtime.checkpoint.CheckpointCoordinator [] - Triggering
>>> checkpoint 17921 (type=CHECKPOINT) @ 1652216501302 for job
>>> .\n
>>> 2022-05-11 05:01:43
>>>
>>> 2022-05-10 21:01:42,860 INFO
>>> org.apache.flink.runtime.checkpoint.CheckpointCoordinator [] - Completed
>>> checkpoint 17921 for job  (11840 bytes in
>>> 866 ms).\n
>>> 2022-05-11 05:04:34
>>>
>>> 2022-05-10 21:04:34,550 INFO
>>> org.apache.flink.kubernetes.KubernetesResourceManagerDriver [] - Creating a
>>> new watch on TaskManager pods.\n
>>> 2022-05-11 05:06:43
>>>
>>> 2022-05-10 21:06:43,512 INFO
>>> org.apache.flink.runtime.checkpoint.CheckpointCoordinator [] - Triggering
>>> checkpoint 17922 (type=CHECKPOINT) @ 1652216802860 for job
>>> .\n
>>> 2022-05-11 05:06:44
>>>
>>> 2022-05-10 21:06:44,441 INFO
>>> org.apache.flink.runtime.checkpoint.CheckpointCoordinator [] - Completed
>>> checkpoint 17922 for job  (11840 bytes in
>>> 977 ms).\n
>>> 2022-05-11 05:11:45
>>>
>>> 2022-05-10 21:11:44,826 INFO
>>> org.apache.flink.runtime.checkpoint.CheckpointCoordinator [] - Triggering
>>> checkpoint 17923 (type=CHECKPOINT) @ 1652217104441 for job
>>> .\n
>>> 2022-05-11 05:11:45
>>>
>>> 2022-05-10 21:11:45,537 INFO
>>> org.apache.flink.runtime.checkpoint.CheckpointCoordinator [] - Completed
>>> checkpoint 17923 for job  (11840 bytes in
>>> 646 ms).\n
>>> 2022-05-11 05:12:36
>>>
>>> 2022-05-10 21:12:36,746 INFO
>>> org.apache.flink.runtime.dispatcher.runner.SessionDispatcherLeaderProcess
>>> [] - Stopping SessionDispatcherLeaderProcess.\n
>>> 2022-05-11 05:12:36
>>>
>>> 2022-05-10 21:12:36,747 INFO
>>> org.apache.flink.runtime.dispatcher.StandaloneDispatcher [] - Stopping
>>> dispatcher akka.tcp://flink@10.2.70.34:6123/user/rpc/dispatcher_1.\n
>>> 2022-05-11 05:12:36
>>>
>>> 2022-05-10 21:12:36,747 INFO
>>> org.apache.flink.runtime.dispatcher.StandaloneDispatcher [] - Stopping all
>>> currently running jobs of dispatcher akka.tcp://
>>> flink@10.2.70.34:6123/user/rpc/dispatcher_1.\n
>>> 2022-05-11 05:12:36
>>>
>>> 2022-05-10 21:12:36,749 INFO
>>> org.apache.flink.runtime.jobmaster.JobMaster [] - Stopping the JobMaster
>>> for job
>>> insert-into_default_catalog.default_database.sn_fstore_location_cluster_raw_scylla_sink().\n
>>> 2022-05-11 05:12:36
>>>
>>> 2022-05-10 21:12:36,752 INFO
>>> org.apache.flink.runtime.dispatcher.StandaloneDispatcher [] - Job
>>>  reached terminal state SUSPENDED.\n
>>> 2022-05-11 05:12:36
>>>
>>> 2022-05-10 21:12:36,752 INFO
>>> org.apache.flink.runtime.executiongraph.ExecutionGraph [] - Job
>>> insert-xxx_sink () switched from state
>>> RUNNING to SUSPENDED.\n
>>> 2022-05-11 05:12:36
>>>
>>> org.apache.flink.util.FlinkException: Scheduler is being stopped.\n
>>> 2022-05-11 05:12:36
>>>
>>> at
>>> org.apache.flink.runtime.scheduler.SchedulerBase.closeAsync(SchedulerBase.java:607)
>>> ~[flink-dist_2.11-1.13.2.jar:1.13.2]\n
>>> 2022-05-11 05:12:36
>>>
>>> 

Re: Flink on Native K8s jobs turn in to `SUSPENDED` status unexpectedly.

2022-05-11 Thread Yang Wang
The SUSPENDED state is usually caused by lost leadership. Maybe you could
find more information about leader in the JobManager and TaskManager logs.

Best,
Yang

Xiaolong Wang  于2022年5月11日周三 19:18写道:

> Hello,
>
> Recently our Flink jobs on Native K8s encountered failing in the
> `SUSPENDED` status and got restarted for no reason.
>
> Flink version: 1.13.2
>
> Logs:
> ```
> 2022-05-11 05:01:41
>
> 2022-05-10 21:01:41,771 INFO
> org.apache.flink.runtime.checkpoint.CheckpointCoordinator [] - Triggering
> checkpoint 17921 (type=CHECKPOINT) @ 1652216501302 for job
> .\n
> 2022-05-11 05:01:43
>
> 2022-05-10 21:01:42,860 INFO
> org.apache.flink.runtime.checkpoint.CheckpointCoordinator [] - Completed
> checkpoint 17921 for job  (11840 bytes in
> 866 ms).\n
> 2022-05-11 05:04:34
>
> 2022-05-10 21:04:34,550 INFO
> org.apache.flink.kubernetes.KubernetesResourceManagerDriver [] - Creating a
> new watch on TaskManager pods.\n
> 2022-05-11 05:06:43
>
> 2022-05-10 21:06:43,512 INFO
> org.apache.flink.runtime.checkpoint.CheckpointCoordinator [] - Triggering
> checkpoint 17922 (type=CHECKPOINT) @ 1652216802860 for job
> .\n
> 2022-05-11 05:06:44
>
> 2022-05-10 21:06:44,441 INFO
> org.apache.flink.runtime.checkpoint.CheckpointCoordinator [] - Completed
> checkpoint 17922 for job  (11840 bytes in
> 977 ms).\n
> 2022-05-11 05:11:45
>
> 2022-05-10 21:11:44,826 INFO
> org.apache.flink.runtime.checkpoint.CheckpointCoordinator [] - Triggering
> checkpoint 17923 (type=CHECKPOINT) @ 1652217104441 for job
> .\n
> 2022-05-11 05:11:45
>
> 2022-05-10 21:11:45,537 INFO
> org.apache.flink.runtime.checkpoint.CheckpointCoordinator [] - Completed
> checkpoint 17923 for job  (11840 bytes in
> 646 ms).\n
> 2022-05-11 05:12:36
>
> 2022-05-10 21:12:36,746 INFO
> org.apache.flink.runtime.dispatcher.runner.SessionDispatcherLeaderProcess
> [] - Stopping SessionDispatcherLeaderProcess.\n
> 2022-05-11 05:12:36
>
> 2022-05-10 21:12:36,747 INFO
> org.apache.flink.runtime.dispatcher.StandaloneDispatcher [] - Stopping
> dispatcher akka.tcp://flink@10.2.70.34:6123/user/rpc/dispatcher_1.\n
> 2022-05-11 05:12:36
>
> 2022-05-10 21:12:36,747 INFO
> org.apache.flink.runtime.dispatcher.StandaloneDispatcher [] - Stopping all
> currently running jobs of dispatcher akka.tcp://
> flink@10.2.70.34:6123/user/rpc/dispatcher_1.\n
> 2022-05-11 05:12:36
>
> 2022-05-10 21:12:36,749 INFO org.apache.flink.runtime.jobmaster.JobMaster
> [] - Stopping the JobMaster for job
> insert-into_default_catalog.default_database.sn_fstore_location_cluster_raw_scylla_sink().\n
> 2022-05-11 05:12:36
>
> 2022-05-10 21:12:36,752 INFO
> org.apache.flink.runtime.dispatcher.StandaloneDispatcher [] - Job
>  reached terminal state SUSPENDED.\n
> 2022-05-11 05:12:36
>
> 2022-05-10 21:12:36,752 INFO
> org.apache.flink.runtime.executiongraph.ExecutionGraph [] - Job
> insert-xxx_sink () switched from state
> RUNNING to SUSPENDED.\n
> 2022-05-11 05:12:36
>
> org.apache.flink.util.FlinkException: Scheduler is being stopped.\n
> 2022-05-11 05:12:36
>
> at
> org.apache.flink.runtime.scheduler.SchedulerBase.closeAsync(SchedulerBase.java:607)
> ~[flink-dist_2.11-1.13.2.jar:1.13.2]\n
> 2022-05-11 05:12:36
>
> at
> org.apache.flink.runtime.jobmaster.JobMaster.stopScheduling(JobMaster.java:962)
> ~[flink-dist_2.11-1.13.2.jar:1.13.2]\n
> 2022-05-11 05:12:36
>
> at
> org.apache.flink.runtime.jobmaster.JobMaster.stopJobExecution(JobMaster.java:926)
> ~[flink-dist_2.11-1.13.2.jar:1.13.2]\n
> 2022-05-11 05:12:36
>
> at org.apache.flink.runtime.jobmaster.JobMaster.onStop(JobMaster.java:398)
> ~[flink-dist_2.11-1.13.2.jar:1.13.2]\n
> 2022-05-11 05:12:36
>
> at
> org.apache.flink.runtime.rpc.RpcEndpoint.internalCallOnStop(RpcEndpoint.java:214)
> ~[flink-dist_2.11-1.13.2.jar:1.13.2]\n
> 2022-05-11 05:12:36
>
> at
> org.apache.flink.runtime.rpc.akka.AkkaRpcActor$StartedState.terminate(AkkaRpcActor.java:563)
> ~[flink-dist_2.11-1.13.2.jar:1.13.2]\n
> 2022-05-11 05:12:36
>
> at
> org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleControlMessage(AkkaRpcActor.java:186)
> ~[flink-dist_2.11-1.13.2.jar:1.13.2]\n
> 2022-05-11 05:12:36
>
> at akka.japi.pf.UnitCaseStatement.apply(CaseStatements.scala:26)
> [flink-dist_2.11-1.13.2.jar:1.13.2]\n
> 2022-05-11 05:12:36
>
> at akka.japi.pf.UnitCaseStatement.apply(CaseStatements.scala:21)
> [flink-dist_2.11-1.13.2.jar:1.13.2]\n
> 2022-05-11 05:12:36
>
> at scala.PartialFunction$class.applyOrElse(PartialFunction.scala:123)
> [flink-dist_2.11-1.13.2.jar:1.13.2]\n
> 2022-05-11 05:12:36
>
> at akka.japi.pf.UnitCaseStatement.applyOrElse(CaseStatements.scala:21)
> [flink-dist_2.11-1.13.2.jar:1.13.2]\n
> 2022-05-11 05:12:36
>
> at scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:170)
> [fli

Re: Flink Kubernetes operator not having a scale subresource

2022-05-06 Thread Yang Wang
Currently, the flink-kubernetes-operator is using Flink native K8s
integration[1], which means Flink ResourceManager will dynamically allocate
TaskManager on demand.
So the users do not need to specify the replicas of TaskManager.

Just like Gyula said, one possible solution to make "kubectl scale" work is
to change the parallelism of Flink job.

If the standalone mode[2] is introduced in the operator, then it is also
possible to directly change the replicas of TaskManager pods.


[1].
https://nightlies.apache.org/flink/flink-docs-release-1.15/docs/deployment/resource-providers/native_kubernetes/
[2].
https://cwiki.apache.org/confluence/display/FLINK/FLIP-225%3A+Implement+standalone+mode+support+in+the+kubernetes+operator

Best,
Yang

Gyula Fóra  于2022年5月7日周六 04:26写道:

> Hi Jay!
>
> Interesting question/proposal to add the scale-subresource.
>
> I am not an expert on this area but we will look into this a little and
> give you some feedback and see if we can incorporate something into the
> upcoming release if it makes sense.
>
> On a high level there is not a single replicas value for a
> FlinkDeployment that would be easy to map, but maybe we could use the
> parallelism value for this purpose for Applications/Session jobs.
>
> Cheers,
> Gyula
>
> On Fri, May 6, 2022 at 8:04 PM Jay Ghiya  wrote:
>
>>  Hi Team,
>>
>>
>> I have been experimenting the Flink Kubernetes operator. One of the
>> biggest miss that we have is it does not support scale sub resource as of
>> now to support reactive scaling. Without that commercially it becomes very
>> difficult for products like us who have very varied loads for every hour.
>>
>>
>>
>> Can I get some direction on the same to contribute on
>> https://kubernetes.io/docs/tasks/extend-kubernetes/custom-resources/custom-resource-definitions/#scale-subresource
>>  for
>> our Kubernetes operator crd?
>>
>> I have been a hard time reading -> 
>> *https://github.com/apache/flink-kubernetes-operator/blob/main/helm/flink-kubernetes-operator/crds/flinkdeployments.flink.apache.org-v1.yml
>> 
>>  to
>> figure out the replicas, status,label selector json path of task
>> manager? It may be due to lack of my knowledge so sense of direction will
>> help me.*
>>
>> *-Jay*
>> *GEHC*
>>
>


Re: Using the official flink operator and kubernetes secrets

2022-05-04 Thread Yang Wang
Thanks Meissner Dylan for the suggestion. I have created a ticket [1] to
track this requirement.


[1]. https://issues.apache.org/jira/browse/FLINK-27491

Best,
Yang




Francis Conroy  于2022年5月5日周四 06:06写道:

> Hi all,
>
> Thanks for looking into this. Yeah, I kept trying different variations of
> the replacement fields with no success. I'm trying to use the .getenv()
> technique now but our cluster is having problems and I haven't been able to
> reinstall the operator.
> I'll reply once it's all working.
>
> Thanks,
> Francis
>
> On Thu, 5 May 2022 at 03:23, Meissner, Dylan <
> dylan.t.meiss...@nordstrom.com> wrote:
>
>> Flink deployment resources support env interpolation natively using $()
>> syntax. I expected this to "just work" like other resources when using the
>> operator, but it does not.
>>
>>
>> https://kubernetes.io/docs/tasks/inject-data-application/_print/#use-environment-variables-to-define-arguments
>>
>> job:
>>   jarURI: local:///my.jar
>>   entryClass: my.JobMainKt
>>   args:
>> - "--kafka.bootstrap.servers"
>> - "my.kafka.host:9093"
>> - "--kafka.sasl.username"
>> - "$(KAFKA_SASL_USERNAME)"
>> - "--kafka.sasl.password"
>> - "$(KAFKA_SASL_PASSWORD)"
>> ​
>>
>> It would be a great addition, simplifying job startup decision-making
>> while following existing conventions.
>>
>> --
>> *From:* Yang Wang 
>> *Sent:* Tuesday, May 3, 2022 7:22 AM
>> *To:* Őrhidi Mátyás 
>> *Cc:* Francis Conroy ; user <
>> user@flink.apache.org>
>> *Subject:* Re: Using the official flink operator and kubernetes secrets
>>
>> Flink could not support environment replacement in the args. I think you
>> could access the env via "*System.getenv()*" in the user main method.
>> It should work since the user main method is executed in the JobManager
>> side.
>>
>> Best,
>> Yang
>>
>> Őrhidi Mátyás  于2022年4月28日周四 19:27写道:
>>
>> Also,
>>
>> just declaring it in the flink configs should be sufficient, no need to
>> define it in the pod templates:
>>
>> flinkConfiguration:
>> kubernetes.env.secretKeyRef: 
>> "env:DJANGO_TOKEN,secret:switchdin-django-token,key:token"
>>
>>
>> Best,
>> Matyas
>>
>> On Thu, Apr 28, 2022 at 1:17 PM Őrhidi Mátyás 
>> wrote:
>>
>> Hi Francis,
>>
>> I suggest accessing the environment variables directly, no need to pass
>> them as command arguments I guess.
>>
>> Best,
>> Matyas
>>
>> On Thu, Apr 28, 2022 at 11:31 AM Francis Conroy <
>> francis.con...@switchdin.com> wrote:
>>
>> Hi all,
>>
>> I'm trying to use a kubernetes secret as a command line argument in my
>> job and the text replacement doesn't seem to be happening. I've verified
>> passing the custom args via the command line on my local flink cluster but
>> can't seem to get the environment var replacement to work.
>>
>> apiVersion: flink.apache.org/v1alpha1
>> kind: FlinkDeployment
>> metadata:
>>   namespace: default
>>   name: http-over-mqtt
>> spec:
>>   image: flink:1.14.4-scala_2.12-java11
>>   flinkVersion: v1_14
>>   flinkConfiguration:
>> taskmanager.numberOfTaskSlots: "2"
>> kubernetes.env.secretKeyRef: 
>> "env:DJANGO_TOKEN,secret:switchdin-django-token,key:token"
>> #containerized.taskmanager.env.DJANGO_TOKEN: "$DJANGO_TOKEN"
>>   serviceAccount: flink
>>   jobManager:
>> replicas: 1
>> resource:
>>   memory: "1024m"
>>   cpu: 1
>>   taskManager:
>> resource:
>>   memory: "1024m"
>>   cpu: 1
>>   podTemplate:
>> spec:
>>   serviceAccount: flink
>>   containers:
>> - name: flink-main-container
>>   volumeMounts:
>> - mountPath: /flink-job
>>   name: flink-jobs
>>   env:
>> - name: DJANGO_TOKEN  # kubectl create secret generic 
>> switchdin-django-token --from-literal=token='[TOKEN]'
>>   valueFrom:
>> secretKeyRef:
>>   name: switchdin-django-token
>>   key: token
>>   optional: false
>>   initContainers:
>> - name: grab-mqtt-over-http-jar

Re: Using the official flink operator and kubernetes secrets

2022-05-03 Thread Yang Wang
Flink could not support environment replacement in the args. I think you
could access the env via "*System.getenv()*" in the user main method.
It should work since the user main method is executed in the JobManager
side.

Best,
Yang

Őrhidi Mátyás  于2022年4月28日周四 19:27写道:

> Also,
>
> just declaring it in the flink configs should be sufficient, no need to
> define it in the pod templates:
>
> flinkConfiguration:
> kubernetes.env.secretKeyRef: 
> "env:DJANGO_TOKEN,secret:switchdin-django-token,key:token"
>
>
> Best,
> Matyas
>
> On Thu, Apr 28, 2022 at 1:17 PM Őrhidi Mátyás 
> wrote:
>
>> Hi Francis,
>>
>> I suggest accessing the environment variables directly, no need to pass
>> them as command arguments I guess.
>>
>> Best,
>> Matyas
>>
>> On Thu, Apr 28, 2022 at 11:31 AM Francis Conroy <
>> francis.con...@switchdin.com> wrote:
>>
>>> Hi all,
>>>
>>> I'm trying to use a kubernetes secret as a command line argument in my
>>> job and the text replacement doesn't seem to be happening. I've verified
>>> passing the custom args via the command line on my local flink cluster but
>>> can't seem to get the environment var replacement to work.
>>>
>>> apiVersion: flink.apache.org/v1alpha1
>>> kind: FlinkDeployment
>>> metadata:
>>>   namespace: default
>>>   name: http-over-mqtt
>>> spec:
>>>   image: flink:1.14.4-scala_2.12-java11
>>>   flinkVersion: v1_14
>>>   flinkConfiguration:
>>> taskmanager.numberOfTaskSlots: "2"
>>> kubernetes.env.secretKeyRef: 
>>> "env:DJANGO_TOKEN,secret:switchdin-django-token,key:token"
>>> #containerized.taskmanager.env.DJANGO_TOKEN: "$DJANGO_TOKEN"
>>>   serviceAccount: flink
>>>   jobManager:
>>> replicas: 1
>>> resource:
>>>   memory: "1024m"
>>>   cpu: 1
>>>   taskManager:
>>> resource:
>>>   memory: "1024m"
>>>   cpu: 1
>>>   podTemplate:
>>> spec:
>>>   serviceAccount: flink
>>>   containers:
>>> - name: flink-main-container
>>>   volumeMounts:
>>> - mountPath: /flink-job
>>>   name: flink-jobs
>>>   env:
>>> - name: DJANGO_TOKEN  # kubectl create secret generic 
>>> switchdin-django-token --from-literal=token='[TOKEN]'
>>>   valueFrom:
>>> secretKeyRef:
>>>   name: switchdin-django-token
>>>   key: token
>>>   optional: false
>>>   initContainers:
>>> - name: grab-mqtt-over-http-jar
>>>   image: docker-push.k8s.local/test/switchdin/platform_flink:job-41
>>>   command: [ "/bin/sh", "-c",
>>>  "cp /opt/switchdin/* /tmp/job/." ]  # Copies the jar 
>>> in the init container to the flink-jobs volume
>>>   volumeMounts:
>>> - name: flink-jobs
>>>   mountPath: /tmp/job
>>>   volumes:
>>> - name: flink-jobs
>>>   emptyDir: { }
>>>   job:
>>> jarURI: local:///flink-job/switchdin-topologies-1.0-SNAPSHOT.jar
>>> args: ["--swit-django-token", "$DJANGO_TOKEN",
>>>"--swit-prod","false"]
>>> entryClass: org.switchdin.HTTPOverMQTT
>>> parallelism: 1
>>> upgradeMode: stateless
>>> state: running
>>>
>>> In the logs I can see:
>>>
>>> 2022-04-28 08:43:02,329 WARN org.switchdin.HTTPOverMQTT [] - ARGS ARE {}
>>> 2022-04-28 08:43:02,329 WARN org.switchdin.HTTPOverMQTT [] -
>>> --swit-django-token
>>> 2022-04-28 08:43:02,330 WARN org.switchdin.HTTPOverMQTT [] -
>>> $DJANGO_TOKEN
>>> 2022-04-28 08:43:02,330 WARN org.switchdin.HTTPOverMQTT [] - --swit-prod
>>> 2022-04-28 08:43:02,330 WARN org.switchdin.HTTPOverMQTT [] - false
>>>
>>> Anyone know how I can do this? I'm considering mounting it in a volume,
>>> but that seems like a lot of hassle for such a small thing.
>>>
>>> Thanks in advance!
>>>
>>>
>>> This email and any attachments are proprietary and confidential and are
>>> intended solely for the use of the individual to whom it is addressed. Any
>>> views or opinions expressed are solely those of the author and do not
>>> necessarily reflect or represent those of SwitchDin Pty Ltd. If you have
>>> received this email in error, please let us know immediately by reply email
>>> and delete it from your system. You may not use, disseminate, distribute or
>>> copy this message nor disclose its contents to anyone.
>>> SwitchDin Pty Ltd (ABN 29 154893857) PO Box 1165, Newcastle NSW 2300
>>> Australia
>>>
>>


Re: flink operator sometimes cannot start jobmanager after upgrading

2022-05-02 Thread Yang Wang
I am afraid we do not handle the scenario that the JobManager deployment is
deleted externally.

Best,
Yang

Őrhidi Mátyás  于2022年5月2日周一 16:52写道:

> I filed a Jira for tracking this issue:
> https://issues.apache.org/jira/browse/FLINK-27468
>
> On Mon, May 2, 2022 at 10:31 AM Őrhidi Mátyás 
> wrote:
>
>> This can be reproduced simply by deleting the kubernetes deployment. The
>> operator cannot recover from this state automatically, by defining a
>> restartNonce on the deployment should recover the state.
>>
>> Regards,
>> Matyas
>>
>> On Mon, May 2, 2022 at 10:00 AM Márton Balassi 
>> wrote:
>>
>>> Hi ChangZhuo,
>>>
>>> Thanks for reporting this, I think I have just run into this myself too.
>>> Will try to reproduce it, but I do not fully comprehend it yet. If anyone
>>> has a way to reproduce it is more than welcome. :-)
>>>
>>> On Fri, Apr 29, 2022 at 12:16 PM ChangZhuo Chen (陳昌倬) 
>>> wrote:
>>>
 Hi,

 We found that flink operator [0] sometimes cannot start jobmanager after
 upgrading FlinkDeployment. We need to recreate FlinkDeployment to fix
 the problem. Anyone has this issue?

 The following is redacted log from flink operator. After status becomes
 MISSING, it keeps in MISSING status for at least 15 minutes.


 2022-04-29 09:41:15,141 o.a.f.c.d.a.c.ApplicationClusterDeployer
 [INFO ][namespace/flink-deployment-name] Submitting application in
 'Application Mode'.
 2022-04-29 09:41:15,145 o.a.f.r.u.c.m.ProcessMemoryUtils [INFO
 ][namespace/flink-deployment-name] The derived from fraction jvm overhead
 memory (2.400gb (2576980416 bytes)) is greater than its max value
 1024.000mb (1073741824 bytes), max value will be used instead
 2022-04-29 09:41:15,146 o.a.f.r.u.c.m.ProcessMemoryUtils [INFO
 ][namespace/flink-deployment-name] The derived from fraction jvm overhead
 memory (5.200gb (5583457568 bytes)) is greater than its max value
 1024.000mb (1073741824 bytes), max value will be used instead
 2022-04-29 09:41:15,146 o.a.f.r.u.c.m.ProcessMemoryUtils [INFO
 ][namespace/flink-deployment-name] The derived from fraction network memory
 (5.050gb (5422396292 bytes)) is greater than its max value 4.000gb
 (4294967296 bytes), max value will be used instead
 2022-04-29 09:41:15,237 o.a.f.k.u.KubernetesUtils  [INFO
 ][namespace/flink-deployment-name] Kubernetes deployment requires a fixed
 port. Configuration high-availability.jobmanager.port will be set to 6123
 2022-04-29 09:41:15,508 o.a.f.k.KubernetesClusterDescriptor [WARN
 ][namespace/flink-deployment-name] Please note that Flink client
 operations(e.g. cancel, list, stop, savepoint, etc.) won't work from
 outside the Kubernetes cluster since 'kubernetes.rest-service.exposed.type'
 has been set to ClusterIP.
 2022-04-29 09:41:15,508 o.a.f.k.KubernetesClusterDescriptor [INFO
 ][namespace/flink-deployment-name] Create flink application cluster
 flink-deployment-name successfully, JobManager Web Interface:
 http://flink-deployment-name.namespace:8081
 2022-04-29 09:41:15,510 o.a.f.k.o.s.FlinkService   [INFO
 ][namespace/flink-deployment-name] Application cluster successfully 
 deployed
 2022-04-29 09:41:15,583 o.a.f.k.o.c.FlinkDeploymentController [INFO
 ][namespace/flink-deployment-name] Reconciliation successfully completed
 2022-04-29 09:41:15,684 o.a.f.k.o.c.FlinkDeploymentController [INFO
 ][namespace/flink-deployment-name] Starting reconciliation
 2022-04-29 09:41:15,686 o.a.f.k.o.o.JobObserver[INFO
 ][namespace/flink-deployment-name] Observing JobManager deployment.
 Previous status: DEPLOYING
 2022-04-29 09:41:15,792 o.a.f.k.o.o.JobObserver[INFO
 ][namespace/flink-deployment-name] JobManager is being deployed
 2022-04-29 09:41:15,792 o.a.f.k.o.c.FlinkDeploymentController [INFO
 ][namespace/flink-deployment-name] Reconciliation successfully completed
 2022-04-29 09:41:20,795 o.a.f.k.o.c.FlinkDeploymentController [INFO
 ][namespace/flink-deployment-name] Starting reconciliation
 2022-04-29 09:41:20,797 o.a.f.k.o.o.JobObserver[INFO
 ][namespace/flink-deployment-name] Observing JobManager deployment.
 Previous status: DEPLOYING
 2022-04-29 09:41:20,896 o.a.f.k.o.o.JobObserver[INFO
 ][namespace/flink-deployment-name] JobManager is being deployed
 2022-04-29 09:41:20,897 o.a.f.k.o.c.FlinkDeploymentController [INFO
 ][namespace/flink-deployment-name] Reconciliation successfully completed
 2022-04-29 09:41:25,899 o.a.f.k.o.c.FlinkDeploymentController [INFO
 ][namespace/flink-deployment-name] Starting reconciliation
 2022-04-29 09:41:25,901 o.a.f.k.o.o.JobObserver[INFO
 ][namespace/flink-deployment-name] Observing JobManager deployment.
 Previous status: DEPLOYING
 2022-04

Re: how to setup working dir in Flink operator

2022-04-25 Thread Yang Wang
Using the pod template to configure the local SSD(via host-path or local
PV) is the correct way.
After that, either "java.io.tmpdir" or "process.taskmanager.working-dir" in
CR should take effect.

Maybe you need to share the complete pod yaml and logs of failed
TaskManager.

nit: if the TaskManager pod crashed and was deleted too fast, you could
kill the JobManager first, then you will have enough time to get the logs
and yamls.

Best,
Yang

ChangZhuo Chen (陳昌倬)  于2022年4月25日周一 10:19写道:

> Hi,
>
> We try to migrate our application from `Flink on standalone Kubernetes`
> to `Application mode on Flink operator`. However, we cannot configure to
> use local SSD for RocksDB state successful. Any through?
>
>
> Detail:
>
> In original `Flink on standalone Kubernetes`:
> - set `io.tmp.dirs` to local SSD and Flink uses local SSD for its data.
>
> In new `Application mode on Flink operator`:
> - set `io.tmp.dirs` to local SSD causes taskmanager crashloop. We are
>   still trying to get the exact error message since it disappers very
>   fast.
> - set `workingDir` in pod template does not work. Flink still uses /tmp
>   to store its data.
> - set `process.taskmanager.working-dir` does not work. Flink still uses
>   /tmp to store its data.
>
>
> --
> ChangZhuo Chen (陳昌倬) czchen@{czchen,debian}.org
> http://czchen.info/
> Key fingerprint = BA04 346D C2E1 FE63 C790  8793 CC65 B0CD EC27 5D5B
>


Re: JobManager doesn't bring up new TaskManager during failure recovery

2022-04-23 Thread Yang Wang
After more debugging, I think this issue is same as FLINK-24315[1],
which is fixed in 1.13.3.

[1]. https://issues.apache.org/jira/browse/FLINK-24315

Best,
Yang

Zheng, Chenyu  于2022年4月22日周五 18:27写道:

> I created a JIRA ticket https://issues.apache.org/jira/browse/FLINK-27350
> to track this issue.
>
>
>
> BRs,
>
> Chenyu
>
>
>
> *From: *"Zheng, Chenyu" 
> *Date: *Friday, April 22, 2022 at 6:26 PM
> *To: *Yang Wang 
> *Cc: *"user@flink.apache.org" , "
> user...@flink.apache.org" 
> *Subject: *Re: JobManager doesn't bring up new TaskManager during failure
> recovery
>
>
>
> Thank you, Yang!
>
>
>
> In fact I have a fine-grained dashboard for Kubernetes cluster health
> (like apiserver qps/latency etc.), and I didn't find anything unusual…
> Also, the JobManager container cpu/memory usage is low.
>
>
>
> Besides, I have a deep dive in these logs and Flink resource manager code,
> and find something interesting. I use taskmanager-1-9 to give you an
> example:
>
>1. I can see logs “Requesting new worker with resource spec
>WorkerResourceSpec” at 2022-04-17 00:33:15,333. And the code location is
>here
>
> <https://nam12.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fapache%2Fflink%2Fblob%2Frelease-1.13.2%2Fflink-runtime%2Fsrc%2Fmain%2Fjava%2Forg%2Fapache%2Fflink%2Fruntime%2Fresourcemanager%2Factive%2FActiveResourceManager.java%23L283&data=05%7C01%7Cchenyu.zheng%40disneystreaming.com%7Cbaa22ad99dd0423f8fa808da244a8dc2%7C65f03ca86d0a493e9e4ac85ac9526a03%7C0%7C0%7C63786219988139%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=OlH3iQ6OR4rjrRodaG38AihihsR9d7Fy1pqosGaBpqg%3D&reserved=0>
>.
>2. “Creating new TaskManager pod with name
>stream-1376a7c25e714f06b2ca818af964c45c-taskmanager-1-9 and resource
><16384,4.0>” at 2022-04-17 00:33:15,376, code location
>
> <https://nam12.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fapache%2Fflink%2Fblob%2Frelease-1.13.2%2Fflink-kubernetes%2Fsrc%2Fmain%2Fjava%2Forg%2Fapache%2Fflink%2Fkubernetes%2FKubernetesResourceManagerDriver.java%23L167&data=05%7C01%7Cchenyu.zheng%40disneystreaming.com%7Cbaa22ad99dd0423f8fa808da244a8dc2%7C65f03ca86d0a493e9e4ac85ac9526a03%7C0%7C0%7C63786219988139%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=hIavko7ONdrzzC3icwg2rPfIJM7oRDBlToKpd1A3b30%3D&reserved=0>
>.
>3. “Pod stream-1376a7c25e714f06b2ca818af964c45c-taskmanager-1-9 is
>created.” at 2022-04-17 00:33:15,412, code location
>
> <https://nam12.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fapache%2Fflink%2Fblob%2Frelease-1.13.2%2Fflink-kubernetes%2Fsrc%2Fmain%2Fjava%2Forg%2Fapache%2Fflink%2Fkubernetes%2FKubernetesResourceManagerDriver.java%23L190&data=05%7C01%7Cchenyu.zheng%40disneystreaming.com%7Cbaa22ad99dd0423f8fa808da244a8dc2%7C65f03ca86d0a493e9e4ac85ac9526a03%7C0%7C0%7C63786219988139%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=uzWc65ZqnAcguJBlodWtiz6yoahV0TdAYPq95JMRV0A%3D&reserved=0>.
>*The request is sent and pod is created here, so I think the apiserver
>is healthy at that moment.*
>4. But I cannot find any logs that print in line
>
> <https://nam12.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fapache%2Fflink%2Fblob%2Frelease-1.13.2%2Fflink-runtime%2Fsrc%2Fmain%2Fjava%2Forg%2Fapache%2Fflink%2Fruntime%2Fresourcemanager%2Factive%2FActiveResourceManager.java%23L301&data=05%7C01%7Cchenyu.zheng%40disneystreaming.com%7Cbaa22ad99dd0423f8fa808da244a8dc2%7C65f03ca86d0a493e9e4ac85ac9526a03%7C0%7C0%7C63786219988139%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=uhvFoCWiQtlnRHu86bczN8J%2Btpq9H1QggZFZl%2FC%2BlAQ%3D&reserved=0>
>and line
>
> <https://nam12.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fapache%2Fflink%2Fblob%2Frelease-1.13.2%2Fflink-runtime%2Fsrc%2Fmain%2Fjava%2Forg%2Fapache%2Fflink%2Fruntime%2Fresourcemanager%2Factive%2FActiveResourceManager.java%23L314&data=05%7C01%7Cchenyu.zheng%40disneystreaming.com%7Cbaa22ad99dd0423f8fa808da244a8dc2%7C65f03ca86d0a493e9e4ac85ac9526a03%7C0%7C0%7C63786219988139%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=DY3JuIuu947uM9yCTq%2FKfY3jmVIJ8gS8SkzRP7O%2BLVA%3D&reserved=0>
>.
>5. “Discard registration from TaskExecutor
>stream-1376a7c25e714f06b2ca818af964c45c-taskmanager-1-9” at 2022-04-17
>00:33:32,39

Re: JobManager doesn't bring up new TaskManager during failure recovery

2022-04-22 Thread Yang Wang
The root cause might be you APIServer is overloaded or not running
normally. And then all the pods events of
taskmanager-1-9 and taskmanager-1-10 are not delivered to the watch in
FlinkResourceManager.
So the two taskmanagers are not recognized by ResourceManager and then
registration are rejected.

The ResourceManager also did not receive the terminated pod events. That's
why it does not allocate new TaskManager pods.

All in all, I believe you need to check the K8s APIServer status.

Best,
Yang

Zheng, Chenyu  于2022年4月22日周五 12:54写道:

> Hi developers!
>
>
>
> I got a strange bug during failure recovery of Flink. It seems the
> JobManager doesn't bring up new TaskManager during failure recovery. Some
> logs and information of the Flink job are pasted below. Can you take a look
> and give me some guidance? Thank you so much!
>
>
>
> Flink version: 1.13.2
>
> Deploy mode: K8s native
>
> Timeline of the bug:
>
>1. Flink job start to work with 8 taskmanagers.
>2. At *2022-04-17 00:28:15,286*, this job got an error and JobManager
>decided to restart 2 tasks (pod
>stream-1376a7c25e714f06b2ca818af964c45c-taskmanager-1-1,
>stream-1376a7c25e714f06b2ca818af964c45c-taskmanager-1-7)
>3. The two old pod is stopped and JobManager created 2 pod (pod
>stream-1376a7c25e714f06b2ca818af964c45c-taskmanager-1-9,
>stream-1376a7c25e714f06b2ca818af964c45c-taskmanager-1-10) at *2022-04-17
>00:33:15,376*
>4. JobManager discard two new pods’ registration at *2022-04-17
>00:33:32,393*
>5. These new pods exited at *2022-04-17 00:33:32,396*, due to the
>rejection of registration.
>6. JobManager didn’t bring up new pods and print error “Slot request
>bulk is not fulfillable! Could not allocate the required slot within slot
>request timeout” over and over
>
> Flink logs:
>
> 1.  JobManager:
> https://drive.google.com/file/d/1HuRQUFQrq9JIfrOzH9qBPCK1hMsyqFpJ/view?usp=sharing
>
> 2.  TaskManager:
> https://drive.google.com/file/d/1ReWR27VlXCkGCFN62__j0UpQlXV7Ensn/view?usp=sharing
>
>
>
>
>
> BRs,
>
> Chenyu
>


Re: Kubernetes killing TaskManager - Flink ignoring taskmanager.memory.process.size

2022-04-21 Thread Yang Wang
Could you please configure a bigger memory to avoid OOM and use
NMTracker[1] to figure out the memory usage categories?

[1].
https://docs.oracle.com/javase/8/docs/technotes/guides/troubleshoot/tooldescr007.html

Best,
Yang

Dan Hill  于2022年4月21日周四 07:42写道:

> Hi.
>
> I upgraded to Flink v1.14.4 and now my Flink TaskManagers are being killed
> by Kubernetes for exceeding the requested memory.  My Flink TM is using an
> extra ~5gb of memory over the tm.memory.process.size.
>
> Here are the flink-config values that I'm using
> taskmanager.memory.process.size: 25600mb
> # The default, 256mb, is too small.
> taskmanager.memory.jvm-metaspace.size: 320mb
> taskmanager.memory.network.fraction: 0.2
> taskmanager.memory.network.max: 2560m
>
> I'm requesting 26112Mi in my Kubernetes config (so there's some buffer).
>
> I re-read the Flink docs
> 
>  on
> setting memory.  This seems like it should be fine.  The diagrams and docs
> show that process.size is used.
>
> If it helps, the TMs are failing in a round robin once every ~30 minutes
> or so.  This isn't an issue with Flink v1.12.3 but is an issue with Flink
> v1.14.4.
>
> My text logs have a bunch of kafka connections in them.  I don't know if
> that's related to overallocating memory.
>
> ❯ kubectl -n flink-v1-14-4 get events
>
> LAST SEEN   TYPE  REASONOBJECT
>   MESSAGE
>
> 37m Warning   Evicted   pod/flink-taskmanager-3
>   The node was low on resource: memory. Container taskmanager was using
> 31457992Ki, which exceeds its request of 26112Mi.
>
> 37m NormalKilling   pod/flink-taskmanager-3
>   Stopping container taskmanager
>
> 37m NormalScheduled pod/flink-taskmanager-3
>   Successfully assigned
> hipcamp-prod-metrics-flink-v1-14-4/flink-taskmanager-3 to
> ip-10-12-104-15.ec2.internal
>
> 37m NormalPulledpod/flink-taskmanager-3
>   Container image "flink:1.14.4" already present on machine
>
> 37m NormalCreated   pod/flink-taskmanager-3
>   Created container taskmanager
>
> 37m NormalStarted   pod/flink-taskmanager-3
>   Started container taskmanager
>
> 37m NormalSuccessfulCreate  statefulset/flink-taskmanager
> create Pod flink-taskmanager-3 in StatefulSet flink-taskmanager successful
>
> 37m Warning   RecreatingFailedPod   statefulset/flink-taskmanager
> StatefulSet hipcamp-prod-metrics-flink-v1-14-4/flink-taskmanager is
> recreating failed Pod flink-taskmanager-3
>
> 37m NormalSuccessfulDelete  statefulset/flink-taskmanager
> delete Pod flink-taskmanager-3 in StatefulSet flink-taskmanager successful
>


Re: Enabling savepoints when deploying in Application Mode

2022-04-12 Thread Yang Wang
If you are trying to submit a job to an already-running application via
"flink run", then it will not succeed. Because this is the by-design
behavior.

Please note that triggering a savepoint will also update the checkpoint
information in HA ConfigMap, so deleting the deployment(with HA ConfigMap
retained) and creating a new one still could work.
The job could recover from the savepoint while you do not need to specify
via the config options.


Best,
Yang

Gyula Fóra  于2022年4月12日周二 13:19写道:

> Hi Lilli!
>
> I am not aware of any problems with savepoint restore in application mode.
> What you can try is to use the *execution.savepoint.path *configuration
> setting to control it instead of the CLI and see if it makes a difference
> for you.
>
> Otherwise, you could also check out the
> https://github.com/apache/flink-kubernetes-operator  (docs
> )
> which can help you manage your Flink Application Deployments in Kubernetes.
>
> Cheers,
> Gyula
>
> On Mon, Apr 11, 2022 at 8:09 PM Lilli Pearson 
> wrote:
>
>> Hi,
>>
>> Summary:
>> I've run into a number of issues trying to marry savepoints with running
>> Flink in Application Mode, and am wondering if anyone has suggestions on
>> how to resolve them, or if savepoints and Application Mode simply aren't
>> designed to work together.
>>
>> Context on app deployment:
>> For long-running processing of my Kafka streams, I'm running Flink 1.13.5
>> in application mode, using CI/CD to deploy the cluster to Kubernetes by
>> deleting and recreating the deployment. This approach has worked great with
>> checkpoints. However, since the savepoint Flink should start up with needs
>> to be specified on startup, this approach would need to change a bit.
>>
>> Details:
>> In experimenting with savepoints while running the app in Application
>> Mode, I've run into some issues that have made me suspect these two
>> features just don't work well together, at least in Flink 1.13, though I
>> can't find documentation that says so directly. (Maybe it's implied, as
>> considering how application mode is set up, it does seem reasonable to me
>> that savepoints wouldn't work.) For example:
>> * The entire /jars API (link:
>> https://nightlies.apache.org/flink/flink-docs-release-1.13/docs/ops/rest_api/#jars)
>> is simply unavailable on my cluster (404s), though the rest of the API
>> works fine. This means I can't use those endpoints to submit a jar to start
>> from
>> * using the CLI to run has been equally unsuccessful; when I run a
>> command like  `bin/flink run path/to/jar.jar -s path/to/savepoint`, it
>> fails and the root cause error is
>> `org.apache.flink.runtime.rest.util.RestClientException: [Not found.]`
>>
>>
>> Thanks in advance for any help or advice!
>>
>>


Re: Official Flink operator additional class paths

2022-04-07 Thread Yang Wang
It seems that you have a typo when specifying the pipeline classpath.
"file:///flink-jar/flink-connector-rabbitmq_2.12-1.14.4.jar" ->
"file:///flink-jars/flink-connector-rabbitmq_2.12-1.14.4.jar"

If this is not the root cause, maybe you could have a try with downloading
the connector jars to /opt/flink/usrlib. The usrlib will be loaded to the
user classloader automatically without any configuration.

BTW, I am not aware of any other bugs which will cause pipeline classpath
not take effect except FLINK-21289[1].

[1]. https://issues.apache.org/jira/browse/FLINK-21289

Best,
Yang

Francis Conroy  于2022年4月7日周四 15:14写道:

> Hi all,
> thanks in advance for any tips.
>
> I've been trying to specify some additional classpaths in my kubernetes
> yaml file when using the official flink operator and nothing seems to work.
>
> I know the technique for getting my job jar works fine since it's finding
> the class ok, but I cannot get the RabbitMQ connector jar to load.
>
> apiVersion: flink.apache.org/v1alpha1
> kind: FlinkDeployment
> metadata:
>   namespace: default
>   name: http-over-mqtt
> spec:
>   image: flink:1.14.4-scala_2.12-java11
>   flinkVersion: v1_14
>   flinkConfiguration:
> taskmanager.numberOfTaskSlots: "2"
> pipeline.classpaths: 
> "file:///flink-jar/flink-connector-rabbitmq_2.12-1.14.4.jar"
>   serviceAccount: flink
>   jobManager:
> replicas: 1
> resource:
>   memory: "1024m"
>   cpu: 1
>   taskManager:
> resource:
>   memory: "1024m"
>   cpu: 1
>   podTemplate:
> spec:
>   serviceAccount: flink
>   containers:
> - name: flink-main-container
>   volumeMounts:
> - mountPath: /flink-job
>   name: flink-jobs
> - mountPath: /flink-jars
>   name: flink-jars
>   initContainers:
> - name: grab-mqtt-over-http-jar
>   image: busybox
>   command: [ '/bin/sh', '-c',
>  'cd /tmp/job; wget 
> https://jenkins/job/platform_flink/job/master/39/artifact/src-java/switchdin-topologies/target/switchdin-topologies-1.0-SNAPSHOT.jar
>  --no-check-certificate;',
>  'cd /tmp/jar; wget 
> https://repo1.maven.org/maven2/org/apache/flink/flink-connector-rabbitmq_2.12/1.14.4/flink-connector-rabbitmq_2.12-1.14.4.jar'
>  ]
>   volumeMounts:
> - name: flink-jobs
>   mountPath: /tmp/job
> - name: flink-jars
>   mountPath: /tmp/jar
>   volumes:
> - name: flink-jobs
>   emptyDir: { }
> - name: flink-jars
>   emptyDir: { }
>   job:
> jarURI: local:///flink-job/switchdin-topologies-1.0-SNAPSHOT.jar
> entryClass: org.switchdin.HTTPOverMQTT
> parallelism: 1
> upgradeMode: stateless
> state: running
>
> Any ideas? I've looked at the ConfigMaps that result and they also look
> fine.
> apiVersion: v1
> data:
>   flink-conf.yaml: "blob.server.port: 6124\nkubernetes.jobmanager.replicas:
> 1\njobmanager.rpc.address:
> http-over-mqtt.default\nkubernetes.taskmanager.cpu: 1.0\n
> kubernetes.service-account:
> flink\nkubernetes.cluster-id: http-over-mqtt\n
> $internal.application.program-args:
> \nkubernetes.container.image: flink:1.14.4-scala_2.12-java11\n
> parallelism.default:
> 1\nkubernetes.namespace: default\ntaskmanager.numberOfTaskSlots: 2\n
> kubernetes.rest-service.exposed.type:
> ClusterIP\n$internal.application.main: org.switchdin.HTTPOverMQTT\n
> taskmanager.memory.process.size:
> 1024m\nkubernetes.internal.jobmanager.entrypoint.class:
> org.apache.flink.kubernetes.entrypoint.KubernetesApplicationClusterEntrypoint
> \nkubernetes.pod-template-file:
> /tmp/podTemplate_11292791104169595925.yaml\n
> kubernetes.pod-template-file.taskmanager:
> /tmp/podTemplate_17362225267763549900.yaml\nexecution.target:
> kubernetes-application\njobmanager.memory.process.size:
> 1024m\njobmanager.rpc.port: 6123\ntaskmanager.rpc.port: 6122\n
> internal.cluster.execution-mode:
> NORMAL\nqueryable-state.proxy.ports: 6125\npipeline.jars:
> local:///flink-job/switchdin-topologies-1.0-SNAPSHOT.jar\n
> kubernetes.jobmanager.cpu:
> 1.0\npipeline.classpath:
> file:///flink-jars/flink-connector-rabbitmq_2.12-1.14.4.jar\n
> kubernetes.pod-template-file.jobmanager:
> /tmp/podTemplate_17029501154997462433.yaml\n"
>
>
>
> This email and any attachments are proprietary and confidential and are
> intended solely for the use of the individual to whom it is addressed. Any
> views or opinions expressed are solely those of the author and do not
> necessarily reflect or represent those of SwitchDin Pty Ltd. If you have
> received this email in error, please let us know immediately by reply email
> and delete it from your system. You may not use, disseminate, distribute or
> copy this message nor disclose its contents to anyone.
> SwitchDin Pty Ltd (ABN 29 154893857) PO Box 1165, Newcastle NSW 2300
> Australia
>


Re: The flink-kubernetes-operator vs third party flink operators

2022-04-05 Thread Yang Wang
Thanks for the interest on the flink-kubernetes-operator project. I believe
you could leave a comment in the ticket FLINK-27049.
If the reporter has not start working on this ticket, then you could be
assigned.

Best,
Yang

Hao t Chang  于2022年4月6日周三 06:30写道:

> Hi Gyula
>
>
>
> Thanks for the reply. I look forward to making contributions. I am
> assuming if an issue, for example this one
> ,
> has no assignee, then it’s probably a good candidate to work on?
>
>
>
> *From: *Gyula Fóra 
> *Date: *Saturday, April 2, 2022 at 2:19 AM
> *To: *Hao t Chang 
> *Cc: *"user@flink.apache.org" 
> *Subject: *[EXTERNAL] Re: The flink-kubernetes-operator vs third party
> flink operators
>
>
>
> Hi! The main difference at the moment is the programming language and the
> APIs used to interact with Flink. The flink-kubernetes-operator, uses Java
> and interacts with Flink using the built in (native) clients. The other
> operators have been ZjQcmQRYFpfptBannerStart
>
> This Message Is From an External Sender
>
> This message came from outside your organization.
>
> ZjQcmQRYFpfptBannerEnd
>
> Hi!
>
>
>
> The main difference at the moment is the programming language and the APIs
> used to interact with Flink.
>
>
>
> The flink-kubernetes-operator, uses Java and interacts with Flink using
> the built in (native) clients.
>
>
>
> The other operators have been around since earlier Flink versions. They
> all use Golang, and some of them have been already abandoned by the initial
> developers. In many cases they also do not support the latest Flink
> operational features.
>
>
>
> With the flink-kubernetes-operator project we aimed to take inspiration
> from the existing operators and create a project where Flink developers can
> easily contribute and could be maintained together with Flink itself while
> keeping up the high quality standards.
>
>
>
> We hope that developers of the other operators would start contributing
> soon :)
>
>
>
> Cheers,
>
> Gyula
>
>
>
>
>
>
>
> On Sat, 2 Apr 2022 at 11:01, Hao t Chang  wrote:
>
> Hi
>
>
>
> I started looking into Flink recently more specifically the
> flink-kubernetes-operator so I only know little about it. I found at least
> 3 other Flink K8s operators that Lyft, Google, and Spotify developed.
> Could someone please enlighten me what is the difference of these third
> party Flink k8s operators ? and why don’t these parties contribute to the
> same repo in the Flink community from the beginning ? Thanks.
>
>
>
> Ted
>
>


Re: flink cluster startup time

2022-03-30 Thread Yang Wang
@Gyula Fóra  is trying to prepare the preview
release(0.1) for flink-kubernetes-operator. It now is fully functional for
application mode.
You could have a try and share more feedback with the community.

The release-1.0 aims for production ready. And we still miss some important
pieces(e.g. FlinkSessionJob, SQL job, observability improvements, etc.).

Best,
Yang

Frank Dekervel  于2022年3月30日周三 23:40写道:

> Hello David,
>
> Thanks for the information! So the two main takeaways from your email are
> to
>
>- Move to something supporting application mode. Is
>https://github.com/apache/flink-kubernetes-operator already ready
>enough for production deployments ?
>- wait for flink 1.15
>
> thanks!
> Frank
>
>
> On Mon, Mar 28, 2022 at 9:16 AM David Morávek  wrote:
>
>> Hi Frank,
>>
>> I'm not really familiar with the internal workings of the Spotify's
>> operator, but here are few general notes:
>>
>> - You only need the JM process for the REST API to become available (TMs
>> can join in asynchronously). I'd personally aim for < 1m for this step, if
>> it takes longer it could signal a problem with your infrastructure (eg.
>> images taking long time to pull, incorrect setup of liveness / readiness
>> probes, not enough resources).
>>
>> The job is packaged as a fat jar, but it is already baked in the docker
>>> images we use (so technically there would be no need to "submit" it from a
>>> separate pod).
>>>
>>
>> That's where the application mode comes in. Please note that this might
>> be also one of the reasons for previous steps taking too long (as all pods
>> are pulling an image with your fat jar that might not be cached).
>>
>> Then the application needs to start up and load its state from the latest
>>> savepoint, which again takes a couple of minutes
>>>
>>
>> This really depends on the state size, state backend (eg. rocksdb restore
>> might take longer), object store throughput / rate limit. The
>> native-savepoint feature that will come out with 1.15 might help to shave
>> off some time here as the there is no conversion into the state backend
>> structures.
>>
>> Best,
>> D.
>>
>>-
>>
>>
>> On Fri, Mar 25, 2022 at 9:46 AM Frank Dekervel 
>> wrote:
>>
>>> Hello,
>>>
>>> We run flink using the spotify flink Kubernetes operator (job cluster
>>> mode). Everything works fine, including upgrades and crash recovery. We do
>>> not run the job manager in HA mode.
>>>
>>> One of the problems we have is that upon upgrades (or during testing),
>>> the startup time of the flink cluster takes a very long time:
>>>
>>>- First the operator needs to create the cluster (JM+TM), and wait
>>>for it to respond for api requests. This already takes a couple of 
>>> minutes.
>>>- Then the operator creates a job-submitter pod that submits the job
>>>to the cluster. The job is packaged as a fat jar, but it is already baked
>>>in the docker images we use (so technically there would be no need to
>>>"submit" it from a separate pod). The submission goes rather fast tho 
>>> (the
>>>time between the job submitter seeing the cluster is online and the 
>>> "hello"
>>>log from the main program is <1min)
>>>- Then the application needs to start up and load its state from the
>>>latest savepoint, which again takes a couple of minutes
>>>
>>> All steps take quite some time, and we are looking to reduce the startup
>>> time to allow for easier testing but also less downtime during upgrades. So
>>> i have some questions:
>>>
>>>- I wonder if the situation is the same for all kubernetes
>>>operators.  I really need some kind of operator because i otherwise i 
>>> have
>>>to set which savepoint to load from myself every startup.
>>>- What cluster startup time is considered to be acceptable / best
>>>practise ?
>>>- If there are other tricks to reduce startup time, i would be very
>>>interested in knowing them :-)
>>>
>>> There is also a discussion ongoing on running flink on spot nodes. I
>>> guess the startup time is relevant there too.
>>>
>>> Thanks already
>>> Frank
>>>
>>>
>>>
>>>
>>>
>>>
>
> --
> [image: Kapernikov] 
> Frank Dekervel
> +32 473 94 34 21 <+32473943421>
> www.kapernikov.com 
> [image: Blog] 
>


Re: "Native Kubernetes" sample in Flink documentation fails. JobManager Web Interface is wrongly generated. [Flink 1.14.4]

2022-03-28 Thread Yang Wang
By default, the idle TaskManager will be released after 30s(configured via
"resourcemanager.taskmanager-timeout").
If it could not be removed, you need to check the JobManager logs for the
root cause. Maybe it does not have enough permission or sth else.

Best,
Yang

Burcu Gul POLAT EGRI  于2022年3月29日周二 13:15写道:

> Thank you, I have tried the first suggestion and the sample job executed
> successfully (last executed command is like below).
>
>
>
> But I have another question. After executing the below command, a new task
> manager pod is created as expected but it is not removed automatically
> after the execution completed. Actually, for native kubernetes, I expect
> that the task manager pod should disappear after job completion.
>
> Do you have any comment for this? Are there any other configuration for
> task manager pod removal?
>
>
>
>
>
> ./bin/flink run --target kubernetes-session
> -Dkubernetes.service-account=flink-service-account
> -Dkubernetes.rest-service.exposed.type=NodePort
> -Dkubernetes.cluster-id=dproc-example-flink-cluster-id
> -Dkubernetes.namespace=sdt-dproc-flink-test
> -Dkubernetes.config.file=/home/devuser/.kube/config
> examples/batch/WordCount.jar
>
>
>
> Best regards,
>
> Burcu
>
>
>
> *From:* Yang Wang [mailto:danrtsey...@gmail.com]
> *Sent:* Saturday, March 26, 2022 7:48 AM
> *To:* Burcu Gul POLAT EGRI 
> *Cc:* user@flink.apache.org
> *Subject:* Re: "Native Kubernetes" sample in Flink documentation fails.
> JobManager Web Interface is wrongly generated. [Flink 1.14.4]
>
>
>
> The root cause might be the LoadBalancer could not really work in your
> environment. We already have a ticket to track this[1] and will try to get
> it resolved in the next release.
>
>
>
> For now, could you please have a try by adding
> "-Dkubernetes.rest-service.exposed.type=NodePort" to your session and
> submission commands?
>
>
>
> Maybe you are also interested in the new flink-kubernetes-operator
> project[2]. It should make it easier to run a Flink application on the K8s.
>
>
>
> [1]. https://issues.apache.org/jira/browse/FLINK-17231
>
> [2]. https://github.com/apache/flink-kubernetes-operator
>
>
>
> Best,
>
> Yang
>
>
>
> Burcu Gul POLAT EGRI  于2022年3月25日周五 21:39写道:
>
> I am getting the following error when I try to execute sample at Flink
> documentation - Native Kubernetes
> <https://nightlies.apache.org/flink/flink-docs-release-1.14/docs/deployment/resource-providers/native_kubernetes/>
> .
>
> I have succedded to execute the first command in documentation by adding
> some extra parameters with the help of this post
> <https://cloudolife.com/2020/12/12/Cloud-Native/BIg-Data/Flink/Deploy-a-Apache-Flink-session-cluster-natively-on-Kubernetes-K8S/>
> .
>
> user@local:~/flink-1.14.4$ ./bin/kubernetes-session.sh \
>
> -Dkubernetes.cluster-id=dproc-example-flink-cluster-id \
>
> -Dtaskmanager.memory.process.size=4096m \
>
> -Dkubernetes.taskmanager.cpu=2 \
>
> -Dtaskmanager.numberOfTaskSlots=4 \
>
> -Dresourcemanager.taskmanager-timeout=360 \
>
> -Dkubernetes.namespace=sdt-dproc-flink-test \
>
> -Dkubernetes.config.file=/home/devuser/.kube/config \
>
> -Dkubernetes.jobmanager.service-account=flink-service-account
>
> After executing above command, I have listed the new pod like below.
>
> user@local:~/flink-1.14.4$ kubectl get pods
>
> NAME READY   STATUSRESTARTS   
> AGE
>
> dproc-example-flink-cluster-id-68c79bf67-mwh52   1/1 Running   0  
> 1m
>
> Then, I have executed the below command to submit example job.
>
> user@local:~/flink-1.14.4$ ./bin/flink run --target kubernetes-session \
>
> -Dkubernetes.service-account=flink-service-account \
>
> -Dkubernetes.cluster-id=dproc-example-flink-cluster-id \
>
> -Dkubernetes.namespace=sdt-dproc-flink-test \
>
> -Dkubernetes.config.file=/home/devuser/.kube/config
>
> examples/batch/WordCount.jar --input /home/user/sometexts.txt --output 
> /tmp/flinksample
>
> After a while, I received below logs:
>
> 2022-03-25 12:38:00,538 INFO  
> org.apache.flink.kubernetes.KubernetesClusterDescriptor  [] - Retrieve 
> flink cluster dproc-example-flink-cluster-id successfully, JobManager Web 
> Interface: http://10.150.140.248:8081
>
>
>
> 
>
>  The program finished with the following exception:
>
>
>
> org.apache.flink.client.program.ProgramInvocationException: The main method 
> caused an error: java.util.concurrent.ExecutionException: 
> org.apache.flink.runtime.

Re: JobManager failed to renew it's leadership (K8S HA)

2022-03-27 Thread Yang Wang
Could you please verify whether the JobManager is going through a long full
GC or the Kubernetes APIServer is working well at that moment?

We are using Kubernetes HA service in the production and it seems stable
without your issue.


Best,
Yang

marco andreas  于2022年3月27日周日 18:35写道:

>
> Hello,
>
> Does anyone have the same issue or have an idea why the jobmanager fails
> to renew its leadership when using kubernetes ha service.
>
> Configuration :
> kubernetes.namespace: flink-ps-flink-dev
> high-availability.kubernetes.leader-election.lease-duration: 200 s
> high-availability.kubernetes.leader-election.renew-deadline: 100 s
> high-availability.kubernetes.leader-election.retry-period: 15 s
>
> Attached is the log of the error.
>
> Best regards,
>


Re: Deploy a Flink session cluster natively on K8s with multi AZ

2022-03-27 Thread Yang Wang
> In the example, we can pass args in the command, is there a way to do it
by using the flink-conf.yaml?

Yes. All the changes in the $FLINK_HOME/conf/flink-conf.yaml at the client
side will also be picked up when deploying a native K8s cluster.

For your use case, I am also suggesting the flink-kubernetes-operator. It
is a more k8s-native way.


Best,
Yang

Gyula Fóra  于2022年3月27日周日 15:13写道:

> Hi!
>
> I think the Flink Kubernetes Operator (
> https://github.com/apache/flink-kubernetes-operator) project is exactly
> what you are looking for.
>
> This is a relatively new addition to Flink that supports k8s application
> and session deployments with lifecycle management through kubernetes native
> tooling.
>
> Cheers,
> Gyula
>
> On Sun, 27 Mar 2022 at 08:50, Almog Rozencwajg <
> almog.rozencw...@niceactimize.com> wrote:
>
>> Hi,
>>
>>
>>
>> From the documentation, deploy a Flink session cluster natively on K8S is
>> by running a shell script.
>>
>>
>> https://nightlies.apache.org/flink/flink-docs-release-1.14/docs/deployment/resource-providers/native_kubernetes/?web=1&wdLOR=cFF7B7F0D-8BEC-4BAA-B06D-3D80A62B24AE#starting-a-flink-session-on-kubernetes
>>
>> In the example, we can pass args in the command, is there a way to do it
>> by using the flink-conf.yaml?
>> Is there another way to deploy a session cluster natively on K8S not only
>> by running a command?
>>
>>
>> We want to support multi availability zone. If we are deploying the
>> cluster in a standalone mode, we can configure the deployment of the job
>> manager and task manager and using  K8s pod topology spread constraints to
>> achieve it.
>> If we are working with native K8s mode, is there a way to do it?
>>
>>
>>
>> Thanks,
>>
>> Almog
>>
>>
>>
>>
>> Confidentiality: This communication and any attachments are intended for
>> the above-named persons only and may be confidential and/or legally
>> privileged. Any opinions expressed in this communication are not
>> necessarily those of NICE Actimize. If this communication has come to you
>> in error you must take no action based on it, nor must you copy or show it
>> to anyone; please delete/destroy and inform the sender by e-mail
>> immediately.
>> Monitoring: NICE Actimize may monitor incoming and outgoing e-mails.
>> Viruses: Although we have taken steps toward ensuring that this e-mail
>> and attachments are free from any virus, we advise that in keeping with
>> good computing practice the recipient should ensure they are actually virus
>> free.
>>
>


Re: flink-kubernetes-operator: Flink deployment stuck in scheduled state when increasing resource CPU above 1

2022-03-25 Thread Yang Wang
Could you please share the result of "kubectl describe pods" when getting
stuck? It will be very useful to help to figure out the root cause.

I guess it might be related to insufficient resources for minikube.


Best,
Yang

Őrhidi Mátyás  于2022年3月26日周六 03:12写道:

> It's worth checking the deployment->replica set->pod chain for error
> message.
>
> On Fri, Mar 25, 2022, 19:49 Makas Tzavellas 
> wrote:
>
>> Hi,
>>
>> I have been experimenting with flink-kubernetes-operator today and it's
>> very cool. However, the Flink application will be stuck in scheduled state
>> if I increase the CPU to anything above 1 for JobManager and TaskManager.
>> It works fine if I keep the CPU to 1.
>>
>> I am testing the Flink deployment with Minikube running with 4 CPUs and
>> 8GB RAM.
>>
>> Unfortunately, I am not sure what to look out for to figure out why it's
>> stuck.
>>
>> Appreciate it if someone could help point me in the right direction.
>>
>> Thanks!
>>
>


Re: "Native Kubernetes" sample in Flink documentation fails. JobManager Web Interface is wrongly generated. [Flink 1.14.4]

2022-03-25 Thread Yang Wang
The root cause might be the LoadBalancer could not really work in your
environment. We already have a ticket to track this[1] and will try to get
it resolved in the next release.

For now, could you please have a try by adding
"-Dkubernetes.rest-service.exposed.type=NodePort" to your session and
submission commands?

Maybe you are also interested in the new flink-kubernetes-operator
project[2]. It should make it easier to run a Flink application on the K8s.

[1]. https://issues.apache.org/jira/browse/FLINK-17231
[2]. https://github.com/apache/flink-kubernetes-operator

Best,
Yang

Burcu Gul POLAT EGRI  于2022年3月25日周五 21:39写道:

> I am getting the following error when I try to execute sample at Flink
> documentation - Native Kubernetes
> 
> .
>
> I have succedded to execute the first command in documentation by adding
> some extra parameters with the help of this post
> 
> .
>
> user@local:~/flink-1.14.4$ ./bin/kubernetes-session.sh \
>
> -Dkubernetes.cluster-id=dproc-example-flink-cluster-id \
>
> -Dtaskmanager.memory.process.size=4096m \
>
> -Dkubernetes.taskmanager.cpu=2 \
>
> -Dtaskmanager.numberOfTaskSlots=4 \
>
> -Dresourcemanager.taskmanager-timeout=360 \
>
> -Dkubernetes.namespace=sdt-dproc-flink-test \
>
> -Dkubernetes.config.file=/home/devuser/.kube/config \
>
> -Dkubernetes.jobmanager.service-account=flink-service-account
>
> After executing above command, I have listed the new pod like below.
>
> user@local:~/flink-1.14.4$ kubectl get pods
>
> NAME READY   STATUSRESTARTS   
> AGE
>
> dproc-example-flink-cluster-id-68c79bf67-mwh52   1/1 Running   0  
> 1m
>
> Then, I have executed the below command to submit example job.
>
> user@local:~/flink-1.14.4$ ./bin/flink run --target kubernetes-session \
>
> -Dkubernetes.service-account=flink-service-account \
>
> -Dkubernetes.cluster-id=dproc-example-flink-cluster-id \
>
> -Dkubernetes.namespace=sdt-dproc-flink-test \
>
> -Dkubernetes.config.file=/home/devuser/.kube/config
>
> examples/batch/WordCount.jar --input /home/user/sometexts.txt --output 
> /tmp/flinksample
>
> After a while, I received below logs:
>
> 2022-03-25 12:38:00,538 INFO  
> org.apache.flink.kubernetes.KubernetesClusterDescriptor  [] - Retrieve 
> flink cluster dproc-example-flink-cluster-id successfully, JobManager Web 
> Interface: http://10.150.140.248:8081
>
>
>
> 
>
>  The program finished with the following exception:
>
>
>
> org.apache.flink.client.program.ProgramInvocationException: The main method 
> caused an error: java.util.concurrent.ExecutionException: 
> org.apache.flink.runtime.client.JobSubmissionException: Failed to submit 
> JobGraph.
>
> at 
> org.apache.flink.client.program.PackagedProgram.callMainMethod(PackagedProgram.java:372)
>
> at 
> org.apache.flink.client.program.PackagedProgram.invokeInteractiveModeForExecution(PackagedProgram.java:222)
>
> at 
> org.apache.flink.client.ClientUtils.executeProgram(ClientUtils.java:114)
>
> at 
> org.apache.flink.client.cli.CliFrontend.executeProgram(CliFrontend.java:812)
>
> at org.apache.flink.client.cli.CliFrontend.run(CliFrontend.java:246)
>
> at 
> org.apache.flink.client.cli.CliFrontend.parseAndRun(CliFrontend.java:1054)
>
> at 
> org.apache.flink.client.cli.CliFrontend.lambda$main$10(CliFrontend.java:1132)
>
> at 
> org.apache.flink.runtime.security.contexts.NoOpSecurityContext.runSecured(NoOpSecurityContext.java:28)
>
> at org.apache.flink.client.cli.CliFrontend.main(CliFrontend.java:1132)
>
> Caused by: java.lang.RuntimeException: 
> java.util.concurrent.ExecutionException: 
> org.apache.flink.runtime.client.JobSubmissionException: Failed to submit 
> JobGraph.
>
> at org.apache.flink.util.ExceptionUtils.rethrow(ExceptionUtils.java:316)
>
> at 
> org.apache.flink.api.java.ExecutionEnvironment.executeAsync(ExecutionEnvironment.java:1061)
>
> at 
> org.apache.flink.client.program.ContextEnvironment.executeAsync(ContextEnvironment.java:131)
>
> at 
> org.apache.flink.client.program.ContextEnvironment.execute(ContextEnvironment.java:70)
>
> at 
> org.apache.flink.examples.java.wordcount.WordCount.main(WordCount.java:93)
>
> at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native 
> Method)
>
> at 
> java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>
> at 
> java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>
> at java.base/java.lang.reflect.Method.invoke(Method.java:566)
>
> at 
> org.apache.flink.client.program.PackagedProgram.callMainMethod(PackagedProgram.java:

Re: Kubernetes HA on an application cluster

2022-03-21 Thread Yang Wang
This log means the Flink internal leader elector failed to renew the leader
ConfigMap to keep its leadership. It might be caused by a network issue,
long fullGC or the K8s APIServer internal error.

This blog[1] could help you to know how the Kubernetes HA works.

[1]. https://flink.apache.org/2021/02/10/native-k8s-with-ha.html

Best,
Yang

marco andreas  于2022年3月22日周二 03:41写道:

> Hello everyone,
>
> I am deploying a flink application cluster using k8S HA .
>
> I notice this message in the log
>
> @timestamp":"2022-03-21T17:11:39.436+01:00","@version":"1","message":"Renew
> deadline reached after 200 seconds while renewing lock ConfigMapLock:
> flink-pushavoo-flink-rec -
> elifibre--jobmanager-leader
> (58de99f0-67dd-4a5c-9850-b26f9cb8e759)","logger_name":"io.fabric8.kubernetes.client.extended.leaderelection.LeaderElector","thread_name":"pool-100902-thread-1","level":"DEBUG","level_value":1
>
>
> Can anyone explain what is the cause of this and how to prevent it,By the
> way, any useful documentation about the flink HA feature on k8S would be
> appreciated.
>
> flink version : 1.13.5
>
> Thanks,
>
>


Re: Submit job to a session cluster on Kubernetes via REST API

2022-03-06 Thread Yang Wang
If you want to use the RestClusterClient to do the job submission and
lifecycle management, the implementation in the
flink-kubernetes-operator[1] project may give you some insights.

You could also use /jars/:jarid/run[2] to run a Flink job. It is a pure
HTTP interface.


[1].
https://github.com/apache/flink-kubernetes-operator/blob/main/flink-kubernetes-operator/src/main/java/org/apache/flink/kubernetes/operator/service/FlinkService.java#L126
[2].
https://nightlies.apache.org/flink/flink-docs-master/docs/ops/rest_api/#jars-jarid-run

Best,
Yang

Almog Rozencwajg  于2022年3月6日周日 15:29写道:

> Hi,
>
>
>
> We deploy a Flink session cluster on Kubernetes.
>
> We want to submit jobs via java application using the REST API
> https://nightlies.apache.org/flink/flink-docs-release-1.14/docs/ops/rest_api/
> .
>
> I'm trying to use the RestClusterClient which comes with the flink-clients
> module but I couldn't find any examples in the documentation.
>
> What is the correct way to submit a job to a native Kubernetes session
> cluster using the REST API, work with the RestClusterClient or can we use
> any other REST client?
>
> Is there an example of how to work with the RestClusterClient?
>
>
>
> Thanks,
>
> Almog
>
>
> Confidentiality: This communication and any attachments are intended for
> the above-named persons only and may be confidential and/or legally
> privileged. Any opinions expressed in this communication are not
> necessarily those of NICE Actimize. If this communication has come to you
> in error you must take no action based on it, nor must you copy or show it
> to anyone; please delete/destroy and inform the sender by e-mail
> immediately.
> Monitoring: NICE Actimize may monitor incoming and outgoing e-mails.
> Viruses: Although we have taken steps toward ensuring that this e-mail and
> attachments are free from any virus, we advise that in keeping with good
> computing practice the recipient should ensure they are actually virus free.
>


Re: [Flink-1.14.3] Restart of pod due to duplicatejob submission

2022-02-24 Thread Yang Wang
This might be related with FLINK-21928 and seems already fixed in 1.14.0.
But it will have some limitations and users need to manually clean up the
HA entries.


Best,
Yang

Parag Somani  于2022年2月24日周四 13:42写道:

> Hello,
>
> Recently due to log4j vulnerabilities, we have upgraded to Apache Flink
> 1.14.3. What we observed we are getting following exception, and because of
> it pod gets in crashloopback. We have seen this issues esp. during the time
> of upgrade or deployment time when existing pod is already running.
>
> What would it be causing this issue during deployment time? Any assistance
> as a workaround would be much appreciated.
>
> Also, i am seeing this issue only after upgrade from 1.14.2 to 1.14.3 .
>
> Env:
> Deployed on : k8s
> Flink version: 1.14.3
> HA using zookeeper
>
> Logs:
> 2022-02-23 05:13:14.555 ERROR 45 --- [t-dispatcher-17]
> c.b.a.his.service.FlinkExecutorService   : Failed to execute job
>
> org.apache.flink.util.FlinkException: Failed to execute job 'events rates
> calculation'.
> at
> org.apache.flink.streaming.api.environment.StreamExecutionEnvironment.executeAsync(StreamExecutionEnvironment.java:2056)
> ~[flink-streaming-java_2.12-1.14.0.jar:1.14.0]
> at
> org.apache.flink.client.program.StreamContextEnvironment.executeAsync(StreamContextEnvironment.java:137)
> ~[flink-clients_2.12-1.14.0.jar:1.14.0]
> at
> org.apache.flink.client.program.StreamContextEnvironment.execute(StreamContextEnvironment.java:76)
> ~[flink-clients_2.12-1.14.0.jar:1.14.0]
> at
> org.apache.flink.streaming.api.environment.StreamExecutionEnvironment.execute(StreamExecutionEnvironment.java:1917)
> ~[flink-streaming-java_2.12-1.14.0.jar:1.14.0]
> at
> com.bmc.ade.his.service.FlinkExecutorService.init(FlinkExecutorService.java:37)
> ~[health-service-1.0.00.jar:1.0.00]
> at
> java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native
> Method) ~[na:na]
> at
> java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
> ~[na:na]
> at
> java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> ~[na:na]
> at java.base/java.lang.reflect.Method.invoke(Method.java:566)
> ~[na:na]
> at
> org.springframework.beans.factory.annotation.InitDestroyAnnotationBeanPostProcessor$LifecycleElement.invoke(InitDestroyAnnotationBeanPostProcessor.java:389)
> ~[spring-beans-5.3.4.jar:5.3.4]
> at
> org.springframework.beans.factory.annotation.InitDestroyAnnotationBeanPostProcessor$LifecycleMetadata.invokeInitMethods(InitDestroyAnnotationBeanPostProcessor.java:333)
> ~[spring-beans-5.3.4.jar:5.3.4]
> at
> org.springframework.beans.factory.annotation.InitDestroyAnnotationBeanPostProcessor.postProcessBeforeInitialization(InitDestroyAnnotationBeanPostProcessor.java:157)
> ~[spring-beans-5.3.4.jar:5.3.4]
> at
> org.springframework.beans.factory.support.AbstractAutowireCapableBeanFactory.applyBeanPostProcessorsBeforeInitialization(AbstractAutowireCapableBeanFactory.java:422)
> ~[spring-beans-5.3.4.jar:5.3.4]
> at
> org.springframework.beans.factory.support.AbstractAutowireCapableBeanFactory.initializeBean(AbstractAutowireCapableBeanFactory.java:1778)
> ~[spring-beans-5.3.4.jar:5.3.4]
> at
> org.springframework.beans.factory.support.AbstractAutowireCapableBeanFactory.doCreateBean(AbstractAutowireCapableBeanFactory.java:602)
> ~[spring-beans-5.3.4.jar:5.3.4]
> at
> org.springframework.beans.factory.support.AbstractAutowireCapableBeanFactory.createBean(AbstractAutowireCapableBeanFactory.java:524)
> ~[spring-beans-5.3.4.jar:5.3.4]
> at
> org.springframework.beans.factory.support.AbstractBeanFactory.lambda$doGetBean$0(AbstractBeanFactory.java:335)
> ~[spring-beans-5.3.4.jar:5.3.4]
> at
> org.springframework.beans.factory.support.DefaultSingletonBeanRegistry.getSingleton(DefaultSingletonBeanRegistry.java:234)
> ~[spring-beans-5.3.4.jar:5.3.4]
> at
> org.springframework.beans.factory.support.AbstractBeanFactory.doGetBean(AbstractBeanFactory.java:333)
> ~[spring-beans-5.3.4.jar:5.3.4]
> at
> org.springframework.beans.factory.support.AbstractBeanFactory.getBean(AbstractBeanFactory.java:208)
> ~[spring-beans-5.3.4.jar:5.3.4]
> at
> org.springframework.beans.factory.support.DefaultListableBeanFactory.preInstantiateSingletons(DefaultListableBeanFactory.java:944)
> ~[spring-beans-5.3.4.jar:5.3.4]
> at
> org.springframework.context.support.AbstractApplicationContext.finishBeanFactoryInitialization(AbstractApplicationContext.java:917)
> ~[spring-context-5.3.4.jar:5.3.4]
> at
> org.springframework.context.support.AbstractApplicationContext.refresh(AbstractApplicationContext.java:582)
> ~[spring-context-5.3.4.jar:5.3.4]
> at
> org.springframework.boot.SpringApplication.refresh(SpringApplication.java:754)
> ~[spring-boot-2.5.5.jar:2.5.5]
> at
> org.springfram

Re: No effect from --allowNonRestoredState or "execution.savepoint.ignore-unclaimed-state" in K8S application mode

2022-02-22 Thread Yang Wang
The config options configured by -D param should take effect. It is also
the recommended way instead of CLI options(e.g. --fromSavepoint).
Not only the K8s application, it also does not work for yarn application
and yarn per-job mode.
I believe it is indeed a bug in the current implementation and have created
a ticket for this[1].

After then you could start the Flink k8s application via the following
command.






*$FLINK_HOME/bin/flink run-application -t kubernetes-application
\-Dkubernetes.cluster-id=$CLUSTER_ID \-Dkubernetes.namespace=$NAMESPACE
\-Dkubernetes.container.image=$IMAGE
\-Dexecution.savepoint.ignore-unclaimed-state=true
-Dexecution.savepoint.path=oss://flink-debug-yiqi/flink-ha
\local:///opt/flink/examples/streaming/StateMachineExample.jar*


If you still want to use the CLI options, then I expect at least you need
to set "--fromSavepoint".

[1]. https://issues.apache.org/jira/browse/FLINK-26316


Best,
Yang

Andrey Bulgakov  于2022年2月23日周三 04:09写道:

> Thank you, Yang. That was it! Specifying "--fromSavepoint" and
> "--allowNonRestoredState" for "run-application" together did the trick.
>
> I was a bit confused, because when you run "flink run-application --help",
> it only tells you about the "--executor" and "--target" options. So I
> assumed I should pass everything else as -D params. I had only tried
> passing "--allowNonRestoredState" on the CLI as the last resort but didn't
> think to do it together with "--fromSavepoint".
>
> Thanks again!
>
> On Sun, Feb 20, 2022 at 9:49 PM Yang Wang  wrote:
>
>> By design, we should support arbitrary config keys via the CLI when using
>> generic CLI mode.
>>
>> Do you have also specified the "--fromSavepoint" along with
>> "--allowNonRestoredState" when submitting a Flink job via "flink
>> run-application"?
>>
>> From the current code base, it seems that the CLI options(e.g
>> --fromSavepoint, --allowNonRestoredState) have higher priority than Flink
>> config options.
>> And it will make the savepoint related config options are overwritten
>> wrongly. Refer to the implementation[1].
>>
>> [1].
>> https://github.com/apache/flink/blob/master/flink-clients/src/main/java/org/apache/flink/client/cli/ProgramOptions.java#L181
>>
>>
>> Best,
>> Yang
>>
>> Andrey Bulgakov  于2022年2月19日周六 08:30写道:
>>
>>> Hi Austin,
>>>
>>> Thanks for the reply! Yeah, the docs aren't super explicit about this.
>>>
>>> But for what it's worth, I'm setting a few options unrelated to
>>> kubernetes this way and they all have effect:
>>> -Dstate.checkpoints.num-retained=100 \
>>>
>>> -Dfs.s3a.aws.credentials.provider=com.amazonaws.auth.WebIdentityTokenCredentialsProvider
>>> \
>>> -Dio.tmp.dirs=/data/flink-local-data \
>>>     -Dqueryable-state.enable=true \
>>>
>>> The only one i'm having problems with is
>>> "execution.savepoint.ignore-unclaimed-state".
>>>
>>> On Fri, Feb 18, 2022 at 3:42 PM Austin Cawley-Edwards <
>>> austin.caw...@gmail.com> wrote:
>>>
>>>> Hi Andrey,
>>>>
>>>> It's unclear to me from the docs[1] if the flink native-kubernetes
>>>> integration supports setting arbitrary config keys via the CLI. I'm cc'ing
>>>> Yang Wang, who has worked a lot in this area and can hopefully help us out.
>>>>
>>>> Best,
>>>> Austin
>>>>
>>>> [1]:
>>>> https://nightlies.apache.org/flink/flink-docs-release-1.14/docs/deployment/resource-providers/native_kubernetes/#configuring-flink-on-kubernetes
>>>>
>>>> On Fri, Feb 18, 2022 at 5:14 PM Andrey Bulgakov 
>>>> wrote:
>>>>
>>>>> Hey all,
>>>>>
>>>>> I'm working on migrating our Flink job away from Hadoop session mode
>>>>> to K8S application mode.
>>>>> It's been going great so far but I'm hitting a wall with this
>>>>> seemingly simple thing.
>>>>>
>>>>> In the first phase of the migration I want to remove some operators
>>>>> (their state can be discarded) and focus on getting the primary pipeline
>>>>> running first.
>>>>> For that I have to start the cluster from a savepoint with the
>>>>> "allowNonRestoredState" parameter turned on.
>>>>>
>>>>> The problem is that I can't 

Re: No effect from --allowNonRestoredState or "execution.savepoint.ignore-unclaimed-state" in K8S application mode

2022-02-20 Thread Yang Wang
By design, we should support arbitrary config keys via the CLI when using
generic CLI mode.

Do you have also specified the "--fromSavepoint" along with
"--allowNonRestoredState" when submitting a Flink job via "flink
run-application"?

>From the current code base, it seems that the CLI options(e.g
--fromSavepoint, --allowNonRestoredState) have higher priority than Flink
config options.
And it will make the savepoint related config options are overwritten
wrongly. Refer to the implementation[1].

[1].
https://github.com/apache/flink/blob/master/flink-clients/src/main/java/org/apache/flink/client/cli/ProgramOptions.java#L181


Best,
Yang

Andrey Bulgakov  于2022年2月19日周六 08:30写道:

> Hi Austin,
>
> Thanks for the reply! Yeah, the docs aren't super explicit about this.
>
> But for what it's worth, I'm setting a few options unrelated to kubernetes
> this way and they all have effect:
> -Dstate.checkpoints.num-retained=100 \
>
> -Dfs.s3a.aws.credentials.provider=com.amazonaws.auth.WebIdentityTokenCredentialsProvider
> \
> -Dio.tmp.dirs=/data/flink-local-data \
> -Dqueryable-state.enable=true \
>
> The only one i'm having problems with is
> "execution.savepoint.ignore-unclaimed-state".
>
> On Fri, Feb 18, 2022 at 3:42 PM Austin Cawley-Edwards <
> austin.caw...@gmail.com> wrote:
>
>> Hi Andrey,
>>
>> It's unclear to me from the docs[1] if the flink native-kubernetes
>> integration supports setting arbitrary config keys via the CLI. I'm cc'ing
>> Yang Wang, who has worked a lot in this area and can hopefully help us out.
>>
>> Best,
>> Austin
>>
>> [1]:
>> https://nightlies.apache.org/flink/flink-docs-release-1.14/docs/deployment/resource-providers/native_kubernetes/#configuring-flink-on-kubernetes
>>
>> On Fri, Feb 18, 2022 at 5:14 PM Andrey Bulgakov  wrote:
>>
>>> Hey all,
>>>
>>> I'm working on migrating our Flink job away from Hadoop session mode to
>>> K8S application mode.
>>> It's been going great so far but I'm hitting a wall with this seemingly
>>> simple thing.
>>>
>>> In the first phase of the migration I want to remove some operators
>>> (their state can be discarded) and focus on getting the primary pipeline
>>> running first.
>>> For that I have to start the cluster from a savepoint with the
>>> "allowNonRestoredState" parameter turned on.
>>>
>>> The problem is that I can't set it in any way that I'm aware of. I tried
>>> 4 ways separately and simultaneously:
>>>
>>> 1) Adding --allowNonRestoredState to flink run-application
>>> -t kubernetes-application
>>> 2) Adding -Dexecution.savepoint.ignore-unclaimed-state=true to flink
>>> run-application -t kubernetes-application
>>> 3) Adding "execution.savepoint.ignore-unclaimed-state: true" to my local
>>> flink-conf.yaml where I'm running flink run-application
>>> 4) Overriding it in the application code:
>>> val sigh = new Configuration()
>>> sigh.setBoolean(SavepointConfigOptions.SAVEPOINT_IGNORE_UNCLAIMED_STATE,
>>> true)
>>> env.configure(sigh)
>>>
>>> Every time the resulting pod ends up with "false" value for this setting
>>> in its configmap:
>>> $ kc describe cm/flink-config-flink-test | grep ignore
>>> execution.savepoint.ignore-unclaimed-state: false
>>>
>>> And I get the exception:
>>> java.lang.IllegalStateException: Failed to rollback to
>>> checkpoint/savepoint . Cannot map checkpoint/savepoint state for
>>> operator 68895e9129981bfc6d96d1dad715298e to the new program, because the
>>> operator is not available in the new program. If you want to allow to skip
>>> this, you can set the --allowNonRestoredState option on the CLI.
>>>
>>> It seems like something overrides it to false and it never has any
>>> effect.
>>>
>>> Can this be a bug or am I doing something wrong?
>>>
>>> For context, the savepoint is produced by Flink 1.8.2 and the version
>>> I'm trying to run on K8S is 1.14.3.
>>>
>>> --
>>> With regards,
>>> Andrey Bulgakov
>>>
>>>
>
> --
> With regards,
> Andrey Bulgakov
>


Re: Flink 1.14.2 - Log4j2 -Dlog4j.configurationFile is ignored and falls back to default /opt/flink/conf/log4j-console.properties

2022-01-24 Thread Yang Wang
>
> I checked the image prior cluster creation; all logs' files are there.
> once the cluster is deployed, they are missing. (bug?)


I do not think it is a bug since we already have shipped all the config
files(log4j properties, flink-conf.yaml) via the ConfigMap.
Then it is directly mounted to an existing path(/opt/flink/conf), which
makes all the existing files hidden.

Of course, we could use the subpath mount to avoid this issue. But the
volume mount will not receive any updates[1].


[1]. https://kubernetes.io/docs/concepts/storage/volumes/#configmap


Best,
Yang


Tamir Sagi  于2022年1月22日周六 23:18写道:

> Hey Yang,
>
> I've created the ticket,
> https://issues.apache.org/jira/browse/FLINK-25762
>
> In addition,
>
> The /opt/flink/conf is cleaned up because we are mounting the conf files
> from K8s ConfigMap.
>
> I checked the image prior cluster creation; all logs' files are there.
> once the cluster is deployed, they are missing. (bug?)
>
> Best,
> Tamir.
> --
> *From:* Tamir Sagi 
> *Sent:* Friday, January 21, 2022 7:19 PM
> *To:* Yang Wang 
> *Cc:* user@flink.apache.org 
> *Subject:* Re: Flink 1.14.2 - Log4j2 -Dlog4j.configurationFile is ignored
> and falls back to default /opt/flink/conf/log4j-console.properties
>
> Yes,
>
> Thank you!
> I will handle that.
>
> Best,
> Tamir
> --
> *From:* Yang Wang 
> *Sent:* Friday, January 21, 2022 5:11 AM
> *To:* Tamir Sagi 
> *Cc:* user@flink.apache.org 
> *Subject:* Re: Flink 1.14.2 - Log4j2 -Dlog4j.configurationFile is ignored
> and falls back to default /opt/flink/conf/log4j-console.properties
>
>
> *EXTERNAL EMAIL*
>
>
> Changing the order of exec command makes sense to me. Would you please
> create a ticket for this?
>
> The /opt/flink/conf is cleaned up because we are mounting the conf files
> from K8s ConfigMap.
>
>
>
> Best,
> Yang
>
> Tamir Sagi  于2022年1月18日周二 17:48写道:
>
> Hey Yang,
>
> Thank you for confirming it.
>
> IMO, a better approach is to change the order "log_setting" , "ARGS" and 
> "FLINK_ENV_JAVA_OPTS"
> in exec command.
> In that way we prioritize user defined properties.
>
> From:
>
> exec "$JAVA_RUN" $JVM_ARGS ${FLINK_ENV_JAVA_OPTS} "${log_setting[@]}"
> -classpath "`manglePathList
> "$FLINK_TM_CLASSPATH:$INTERNAL_HADOOP_CLASSPATHS"`" ${CLASS_TO_RUN}
> "${ARGS[@]}"
>
> To
>
> exec "$JAVA_RUN" $JVM_ARGS "${log_setting[@]}" -classpath "`manglePathList
> "$FLINK_TM_CLASSPATH:$INTERNAL_HADOOP_CLASSPATHS"`" ${CLASS_TO_RUN}
> "${ARGS[@]}" "${FLINK_ENV_JAVA_OPTS}"
>
> Unless there are system configurations which not supposed to be overridden
> by user(And then having dedicated env variables is better). does it make
> sense to you?
>
>
> In addition, any idea why /opt/flink/conf gets cleaned (Only
> flink-conf.xml is there).
>
>
> Best,
> Tamir
>
>
> --
> *From:* Yang Wang 
> *Sent:* Tuesday, January 18, 2022 6:02 AM
> *To:* Tamir Sagi 
> *Cc:* user@flink.apache.org 
> *Subject:* Re: Flink 1.14.2 - Log4j2 -Dlog4j.configurationFile is ignored
> and falls back to default /opt/flink/conf/log4j-console.properties
>
>
> *EXTERNAL EMAIL*
>
>
> I think you are right. Before 1.13.0, if the log configuration file does
> not exist, the logging properties would not be added to the start command.
> That is why it could work in 1.12.2.
>
> However, from 1.13.0, we are not using
> "kubernetes.container-start-command-template" to generate the JM/TM start
> command, but the jobmanager.sh/taskmanager.sh. We do not
> have the same logic in the "flink-console.sh".
>
> Maybe we could introduce an environment for log configuration file name in
> the "flink-console.sh". The default value could be
> "log4j-console.properties" and it could be configured by users.
> If this makes sense to you, could you please create a ticket?
>
>
> Best,
> Yang
>
> Tamir Sagi  于2022年1月17日周一 22:53写道:
>
> Hey Yang,
>
> thanks for answering,
>
> TL;DR
>
> Assuming I have not missed anything , the way TM and JM are created is
> different between these 2 versions,
> but it does look like flink-console.sh gets called eventually with the
> same exec command.
>
> in 1.12.2 if org.apache.flink.kubernetes.kubeclient.parameters#hasLog4j
> returns false then logging args are not added to startCommand.
>
>
>1. why does the config dir gets cleaned once the cluster starts? Even
>when I pushed log4j-

Re: Flink 1.14.2 - Log4j2 -Dlog4j.configurationFile is ignored and falls back to default /opt/flink/conf/log4j-console.properties

2022-01-20 Thread Yang Wang
Changing the order of exec command makes sense to me. Would you please
create a ticket for this?

The /opt/flink/conf is cleaned up because we are mounting the conf files
from K8s ConfigMap.



Best,
Yang

Tamir Sagi  于2022年1月18日周二 17:48写道:

> Hey Yang,
>
> Thank you for confirming it.
>
> IMO, a better approach is to change the order "log_setting" , "ARGS" and 
> "FLINK_ENV_JAVA_OPTS"
> in exec command.
> In that way we prioritize user defined properties.
>
> From:
>
> exec "$JAVA_RUN" $JVM_ARGS ${FLINK_ENV_JAVA_OPTS} "${log_setting[@]}"
> -classpath "`manglePathList
> "$FLINK_TM_CLASSPATH:$INTERNAL_HADOOP_CLASSPATHS"`" ${CLASS_TO_RUN}
> "${ARGS[@]}"
>
> To
>
> exec "$JAVA_RUN" $JVM_ARGS "${log_setting[@]}" -classpath "`manglePathList
> "$FLINK_TM_CLASSPATH:$INTERNAL_HADOOP_CLASSPATHS"`" ${CLASS_TO_RUN}
> "${ARGS[@]}" "${FLINK_ENV_JAVA_OPTS}"
>
> Unless there are system configurations which not supposed to be overridden
> by user(And then having dedicated env variables is better). does it make
> sense to you?
>
>
> In addition, any idea why /opt/flink/conf gets cleaned (Only
> flink-conf.xml is there).
>
>
> Best,
> Tamir
>
>
> --
> *From:* Yang Wang 
> *Sent:* Tuesday, January 18, 2022 6:02 AM
> *To:* Tamir Sagi 
> *Cc:* user@flink.apache.org 
> *Subject:* Re: Flink 1.14.2 - Log4j2 -Dlog4j.configurationFile is ignored
> and falls back to default /opt/flink/conf/log4j-console.properties
>
>
> *EXTERNAL EMAIL*
>
>
> I think you are right. Before 1.13.0, if the log configuration file does
> not exist, the logging properties would not be added to the start command.
> That is why it could work in 1.12.2.
>
> However, from 1.13.0, we are not using
> "kubernetes.container-start-command-template" to generate the JM/TM start
> command, but the jobmanager.sh/taskmanager.sh. We do not
> have the same logic in the "flink-console.sh".
>
> Maybe we could introduce an environment for log configuration file name in
> the "flink-console.sh". The default value could be
> "log4j-console.properties" and it could be configured by users.
> If this makes sense to you, could you please create a ticket?
>
>
> Best,
> Yang
>
> Tamir Sagi  于2022年1月17日周一 22:53写道:
>
> Hey Yang,
>
> thanks for answering,
>
> TL;DR
>
> Assuming I have not missed anything , the way TM and JM are created is
> different between these 2 versions,
> but it does look like flink-console.sh gets called eventually with the
> same exec command.
>
> in 1.12.2 if org.apache.flink.kubernetes.kubeclient.parameters#hasLog4j
> returns false then logging args are not added to startCommand.
>
>
>1. why does the config dir gets cleaned once the cluster starts? Even
>when I pushed log4j-console.properties to the expected location
>(/opt/flink/conf) , the directory includes only flink-conf.yaml.
>2. I think by running exec command "...${FLINK_ENV_JAVA_OPTS}
>"${log_setting[@]}" "${ARGS[@]}" some properties might be ignored.
>IMO, it should first look for properties in java.opts provided by the
>user in flink-conf and falls back to default in case it's not present.
>
>
> Taking about Native kubernetes mode
>
> I checked the bash script in flink-dist module, it looks like in both
> 1.14.2 and 1.12.2. flink-console.sh is similar. (in 1.14.2 there are more
> cases for the input argument)
>
> logging variable is the same
>
> https://github.com/apache/flink/blob/release-1.14.2/flink-dist/src/main/flink-bin/bin/flink-console.sh#L101
>
> https://github.com/apache/flink/blob/release-1.12.2/flink-dist/src/main/flink-bin/bin/flink-console.sh#L89
>
> Exec command is the same
>
> https://github.com/apache/flink/blob/release-1.14.2/flink-dist/src/main/flink-bin/bin/flink-console.sh#L114
>
> https://github.com/apache/flink/blob/release-1.12.2/flink-dist/src/main/flink-bin/bin/flink-console.sh#L99
>
> As for creating TM/JM, in *1.14.2* there is a usage of 2 bash scripts
>
>- kubernetes-jobmanager.sh
>- kubernetes-taskmanager.sh
>
> They get called while decorating the pod, referenced in startCommand.
>
> for instance, JobManager.
>
> https://github.com/apache/flink/blob/release-1.14.2/flink-kubernetes/src/main/java/org/apache/flink/kubernetes/kubeclient/decorators/CmdJobManagerDecorator.java#L58-L59
>
> kubernetes-jobmanager.sh gets called once the container starts which calls
> flink-console.sh internally and pass the
> deploymentName(kubernete

Re: Flink 1.14.2 - Log4j2 -Dlog4j.configurationFile is ignored and falls back to default /opt/flink/conf/log4j-console.properties

2022-01-17 Thread Yang Wang
I think you are right. Before 1.13.0, if the log configuration file does
not exist, the logging properties would not be added to the start command.
That is why it could work in 1.12.2.

However, from 1.13.0, we are not using
"kubernetes.container-start-command-template" to generate the JM/TM start
command, but the jobmanager.sh/taskmanager.sh. We do not
have the same logic in the "flink-console.sh".

Maybe we could introduce an environment for log configuration file name in
the "flink-console.sh". The default value could be
"log4j-console.properties" and it could be configured by users.
If this makes sense to you, could you please create a ticket?


Best,
Yang

Tamir Sagi  于2022年1月17日周一 22:53写道:

> Hey Yang,
>
> thanks for answering,
>
> TL;DR
>
> Assuming I have not missed anything , the way TM and JM are created is
> different between these 2 versions,
> but it does look like flink-console.sh gets called eventually with the
> same exec command.
>
> in 1.12.2 if org.apache.flink.kubernetes.kubeclient.parameters#hasLog4j
> returns false then logging args are not added to startCommand.
>
>
>1. why does the config dir gets cleaned once the cluster starts? Even
>when I pushed log4j-console.properties to the expected location
>(/opt/flink/conf) , the directory includes only flink-conf.yaml.
>2. I think by running exec command "...${FLINK_ENV_JAVA_OPTS}
>"${log_setting[@]}" "${ARGS[@]}" some properties might be ignored.
>IMO, it should first look for properties in java.opts provided by the
>user in flink-conf and falls back to default in case it's not present.
>
>
> Taking about Native kubernetes mode
>
> I checked the bash script in flink-dist module, it looks like in both
> 1.14.2 and 1.12.2. flink-console.sh is similar. (in 1.14.2 there are more
> cases for the input argument)
>
> logging variable is the same
>
> https://github.com/apache/flink/blob/release-1.14.2/flink-dist/src/main/flink-bin/bin/flink-console.sh#L101
>
> https://github.com/apache/flink/blob/release-1.12.2/flink-dist/src/main/flink-bin/bin/flink-console.sh#L89
>
> Exec command is the same
>
> https://github.com/apache/flink/blob/release-1.14.2/flink-dist/src/main/flink-bin/bin/flink-console.sh#L114
>
> https://github.com/apache/flink/blob/release-1.12.2/flink-dist/src/main/flink-bin/bin/flink-console.sh#L99
>
> As for creating TM/JM, in *1.14.2* there is a usage of 2 bash scripts
>
>- kubernetes-jobmanager.sh
>- kubernetes-taskmanager.sh
>
> They get called while decorating the pod, referenced in startCommand.
>
> for instance, JobManager.
>
> https://github.com/apache/flink/blob/release-1.14.2/flink-kubernetes/src/main/java/org/apache/flink/kubernetes/kubeclient/decorators/CmdJobManagerDecorator.java#L58-L59
>
> kubernetes-jobmanager.sh gets called once the container starts which calls
> flink-console.sh internally and pass the
> deploymentName(kubernetes-application in our case) and args.
>
> In *1.12.2* the decorator set /docker-entrypoint.sh
>
> https://github.com/apache/flink/blob/release-1.12.2/flink-kubernetes/src/main/java/org/apache/flink/kubernetes/kubeclient/factory/KubernetesJobManagerFactory.java#L67
>
> and set the start command
>
> https://github.com/apache/flink/blob/release-1.12.2/flink-kubernetes/src/main/java/org/apache/flink/kubernetes/configuration/KubernetesConfigOptions.java#L224
>
>
> https://github.com/apache/flink/blob/release-1.12.2/flink-kubernetes/src/main/java/org/apache/flink/kubernetes/utils/KubernetesUtils.java#L333
>
>
> with additional logging parameter
>
> https://github.com/apache/flink/blob/release-1.12.2/flink-kubernetes/src/main/java/org/apache/flink/kubernetes/utils/KubernetesUtils.java
> #L421-L425
> <https://github.com/apache/flink/blob/4dedee047bc69d219095bd98782c6e95f04a6cb9/flink-kubernetes/src/main/java/org/apache/flink/kubernetes/utils/KubernetesUtils.java#L421-L425>
>
> hasLog4j
>
> https://github.com/apache/flink/blob/release-1.12.2/flink-kubernetes/src/main/java/org/apache/flink/kubernetes/kubeclient/parameters/AbstractKubernetesParameters.java#L151-L155
> it checks if the file exists in conf dir.
>
> If the log4j is false, then the logging properties are not added to start
> command(Might be the case, which explains why it works in 1.12.2)
>
> It then passes 'jobmanager' as component.
> looking into /docker-entrypoint.sh it calls jobmanager.sh which calls
> flink-console.sh internally
>
> Have I missed anything?
>
>
> Best,
> Tamir
>
>
> --
> *From:* Yang Wang 
> *Sent:* Monday, January 17, 2022 1:05 PM
> *To:* Tamir Sagi 
> *Cc:* user@flink.a

Re: [E] Re: Orphaned job files in HDFS

2022-01-17 Thread Yang Wang
The clean-up of the staging directory is best effort. If the JobManager
crashed and killed externally, then it does not have any chance to do the
staging directory clean-up.
AFAIK, we do not have such Flink options to guarantee the clean-up.


Best,
Yang

David Clutter  于2022年1月11日周二 22:59写道:

> Ok, that makes sense.  I did see some job failures.  However failures
> could happen occasionally.  Is there any option to have the job manager
> clean-up these directories when the job has failed?
>
> On Mon, Jan 10, 2022 at 8:58 PM Yang Wang  wrote:
>
>> IIRC, the staging directory(/user/{name}/.flink/application_xxx) will be
>> deleted automatically if the Flink job reaches global terminal state(e.g.
>> FINISHED, CANCELED, FAILED).
>> So I assume you have stopped the yarn application via "yarn application
>> -kill", not via "bin/flink cancel".
>> If it is the case, then having the residual staging directory is an
>> expected behavior since Flink JobManager does not have a chance to do the
>> clean-up.
>>
>>
>>
>> Best,
>> Yang
>>
>> David Clutter  于2022年1月11日周二 10:08写道:
>>
>>> I'm seeing files orphaned in HDFS and wondering how to clean them up
>>> when the job is completed.  The directory is /user/yarn/.flink so I am
>>> assuming this is created by flink?  The HDFS in my cluster eventually fills
>>> up.
>>>
>>> Here is my setup:
>>>
>>>- Flink 1.13.1 on AWS EMR
>>>- Executing flink in per-job mode
>>>- Job is submitted every 5m
>>>
>>> In HDFS under /user/yarn/.flink I see a directory created for every
>>> flink job submitted/yarn application.  Each application directory contains
>>> my user jar file, flink-dist jar, /lib with various flink jars,
>>> log4j.properties.
>>>
>>> Is there a property to tell flink to clean up this directory when the
>>> job is completed?
>>>
>>


Re: Flink 1.14.2 - Log4j2 -Dlog4j.configurationFile is ignored and falls back to default /opt/flink/conf/log4j-console.properties

2022-01-17 Thread Yang Wang
I think the root cause is that we are using "flink-console.sh" to start the
JobManager/TaskManager process for native K8s integration after
FLINK-21128[1].
So it forces the log4j configuration name to be "log4j-console.properties".


[1]. https://issues.apache.org/jira/browse/FLINK-21128


Best,
Yang

Tamir Sagi  于2022年1月13日周四 20:30写道:

> Hey All
>
> I'm Running Flink 1.14.2, it seems like it ignores system
> property -Dlog4j.configurationFile and
> falls back to /opt/flink/conf/log4j-console.properties
>
> I enabled debug log for log4j2  ( -Dlog4j2.debug)
>
> DEBUG StatusLogger Catching
>  java.io.FileNotFoundException:
> file:/opt/flink/conf/log4j-console.properties (No such file or directory)
> at java.base/java.io.FileInputStream.open0(Native Method)
> at java.base/java.io.FileInputStream.open(Unknown Source)
> at java.base/java.io.FileInputStream.(Unknown Source)
> at
> org.apache.logging.log4j.core.config.ConfigurationFactory.getInputFromString(ConfigurationFactory.java:370)
> at
> org.apache.logging.log4j.core.config.ConfigurationFactory$Factory.getConfiguration(ConfigurationFactory.java:513)
> at
> org.apache.logging.log4j.core.config.ConfigurationFactory$Factory.getConfiguration(ConfigurationFactory.java:499)
> at
> org.apache.logging.log4j.core.config.ConfigurationFactory$Factory.getConfiguration(ConfigurationFactory.java:422)
> at
> org.apache.logging.log4j.core.config.ConfigurationFactory.getConfiguration(ConfigurationFactory.java:322)
> at
> org.apache.logging.log4j.core.LoggerContext.reconfigure(LoggerContext.java:695)
> at
> org.apache.logging.log4j.core.LoggerContext.reconfigure(LoggerContext.java:716)
> at
> org.apache.logging.log4j.core.LoggerContext.start(LoggerContext.java:270)
> at
> org.apache.logging.log4j.core.impl.Log4jContextFactory.getContext(Log4jContextFactory.java:155)
> at
> org.apache.logging.log4j.core.impl.Log4jContextFactory.getContext(Log4jContextFactory.java:47)
> at org.apache.logging.log4j.LogManager.getContext(LogManager.java:196)
> at
> org.apache.logging.log4j.spi.AbstractLoggerAdapter.getContext(AbstractLoggerAdapter.java:137)
> at
> org.apache.logging.slf4j.Log4jLoggerFactory.getContext(Log4jLoggerFactory.java:55)
> at
> org.apache.logging.log4j.spi.AbstractLoggerAdapter.getLogger(AbstractLoggerAdapter.java:47)
> at
> org.apache.logging.slf4j.Log4jLoggerFactory.getLogger(Log4jLoggerFactory.java:33)
> at org.slf4j.LoggerFactory.getLogger(LoggerFactory.java:329)
> at org.slf4j.LoggerFactory.getLogger(LoggerFactory.java:349)
> at
> org.apache.flink.runtime.rpc.akka.AkkaRpcServiceUtils.(AkkaRpcServiceUtils.java:55)
> at
> org.apache.flink.runtime.rpc.akka.AkkaRpcSystem.remoteServiceBuilder(AkkaRpcSystem.java:42)
> at
> org.apache.flink.runtime.rpc.akka.CleanupOnCloseRpcSystem.remoteServiceBuilder(CleanupOnCloseRpcSystem.java:77)
> at
> org.apache.flink.runtime.rpc.RpcUtils.createRemoteRpcService(RpcUtils.java:184)
> at
> org.apache.flink.runtime.entrypoint.ClusterEntrypoint.initializeServices(ClusterEntrypoint.java:300)
> at
> org.apache.flink.runtime.entrypoint.ClusterEntrypoint.runCluster(ClusterEntrypoint.java:243)
> at
> org.apache.flink.runtime.entrypoint.ClusterEntrypoint.lambda$startCluster$1(ClusterEntrypoint.java:193)
> at
> org.apache.flink.runtime.security.contexts.NoOpSecurityContext.runSecured(NoOpSecurityContext.java:28)
> at
> org.apache.flink.runtime.entrypoint.ClusterEntrypoint.startCluster(ClusterEntrypoint.java:190)
> at
> org.apache.flink.runtime.entrypoint.ClusterEntrypoint.runClusterEntrypoint(ClusterEntrypoint.java:617)
>
> Where I see the property is being loaded while deploying the cluster
>
> source:{
> class:org.apache.flink.configuration.GlobalConfiguration
> method:loadYAMLResource
> file:GlobalConfiguration.java
> line:213
> }
> message:Loading configuration property: env.java.opts,
> -XX:+HeapDumpOnOutOfMemoryError -XX:HeapDumpPath=/dumps
> -Dlog4j.configurationFile=/opt/log4j2/log4j2.xml -Dlog4j2.debug=true
>
> in addition,  following the documentation[1], it seems like Flink comes
> with default log4j properties files located in /opt/flink/conf
>
> looking into that dir once the cluster is deployed, only flink-conf.yaml
> is there.
>
>
>
> Docker file content
>
> FROM flink:1.14.2-scala_2.12-java11
> ARG JAR_FILE
> COPY target/${JAR_FILE} $FLINK_HOME/usrlib/flink-job.jar
> ADD log4j2.xml /opt/log4j2/log4j2.xml
>
>
>
> *It perfectly works in 1.12.2 with the same log4j2.xml file and system
> property. *
>
> [1]
> https://nightlies.apache.org/flink/flink-docs-master/docs/deployment/advanced/logging/#configuring-log4j-2
>
>
> Best,
> Tamir
>
>
>
> Confidentiality: This communication and any attachments are intended for
> the above-named persons only and may be confidential and/or legally
> privileged. Any opinions expressed in this communication are not
> necessarily those of NICE Actimize. If this communication has come to you
> in error you must take no action based on it, nor must you copy or show it
> to anyone; please delete/destroy 

Re: Flink native k8s integration vs. operator

2022-01-17 Thread Yang Wang
g called "native" (tooling,
>>> security
>>> >> > concerns).
>>> >> >
>>> >> > With respect to common lifecycle management operations: these
>>> features
>>> >> are
>>> >> > not available (within Apache Flink) for any of the other resource
>>> >> providers
>>> >> > (YARN, Standalone) either. From this perspective, I wouldn't
>>> consider
>>> >> this
>>> >> > a shortcoming of the Kubernetes integration. Instead, we have been
>>> >> focusing
>>> >> > our efforts in Apache Flink on the operations of a single Job, and
>>> left
>>> >> > orchestration and lifecycle management that spans multiple Jobs to
>>> >> > ecosystem projects. I still believe that we should keep this focus
>>> on
>>> >> low
>>> >> > level composable building blocks (like Jobs and Snapshots) in Apache
>>> >> Flink
>>> >> > to make it easy for everyone to build fitting higher level
>>> abstractions
>>> >> > like a FlinkApplication Custom Resource on top of it. For example,
>>> we
>>> >> are
>>> >> > currently contributing multiple improvements [1,2,3,4] to the REST
>>> API
>>> >> and
>>> >> > Application Mode that in our experience will make it easier to
>>> manage
>>> >> > Apache Flink with a Kubernetes operator. Given this background, I
>>> >> suspect a
>>> >> > Kubernetes Operator in Apache Flink would not be a priority for us
>>> at
>>> >> > Ververica - at least right now.
>>> >> >
>>> >> > Having said this, if others in the community have the capacity to
>>> push
>>> >> and
>>> >> > *maintain* a somewhat minimal "reference" Kubernetes Operator for
>>> Apache
>>> >> > Flink, I don't see any blockers. If or when this happens, I'd see
>>> some
>>> >> > clear benefits of using a separate repository (easier independent
>>> >> > versioning and releases, different build system & tooling (go, I
>>> >> assume)).
>>> >> >
>>> >> > Looking forward to your thoughts,
>>> >> >
>>> >> > Konstantin
>>> >> >
>>> >> > [1] https://issues.apache.org/jira/browse/FLINK-24275
>>> >> > [2] https://issues.apache.org/jira/browse/FLINK-24208
>>> >> > [3]
>>> >> >
>>> >>
>>> https://cwiki.apache.org/confluence/display/FLINK/FLIP-194%3A+Introduce+the+JobResultStore
>>> >> > [4] https://issues.apache.org/jira/browse/FLINK-24113
>>> >> >
>>> >> > On Mon, Jan 10, 2022 at 2:11 PM Gyula Fóra 
>>> wrote:
>>> >> >
>>> >> > > Hi All!
>>> >> > >
>>> >> > > This is a very interesting discussion.
>>> >> > >
>>> >> > > I think many users find it confusing what deployment mode to
>>> choose
>>> >> when
>>> >> > > considering a new production application on Kubernetes. With all
>>> the
>>> >> > > options of native, standalone and different operators this can get
>>> >> tricky :)
>>> >> > >
>>> >> > > I really like the idea that Thomas brought up to have at least a
>>> >> minimal
>>> >> > > operator implementation in Flink itself to cover the most common
>>> >> production
>>> >> > > job lifecycle management scenarios. I think the Flink community
>>> has a
>>> >> very
>>> >> > > strong experience in this area to create a successful
>>> implementation
>>> >> that
>>> >> > > would benefit most production users on Kubernetes.
>>> >> > >
>>> >> > > Cheers,
>>> >> > > Gyula
>>> >> > >
>>> >> > > On Mon, Jan 10, 2022 at 4:29 AM Yang Wang 
>>> >> wrote:
>>> >> > >
>>> >> > >> Thanks all for this fruitful discussion.
>>> >> > >>
>>> >> > >> I think Xintong has given a strong point why we introduced the

Re: Orphaned job files in HDFS

2022-01-10 Thread Yang Wang
IIRC, the staging directory(/user/{name}/.flink/application_xxx) will be
deleted automatically if the Flink job reaches global terminal state(e.g.
FINISHED, CANCELED, FAILED).
So I assume you have stopped the yarn application via "yarn application
-kill", not via "bin/flink cancel".
If it is the case, then having the residual staging directory is an
expected behavior since Flink JobManager does not have a chance to do the
clean-up.



Best,
Yang

David Clutter  于2022年1月11日周二 10:08写道:

> I'm seeing files orphaned in HDFS and wondering how to clean them up when
> the job is completed.  The directory is /user/yarn/.flink so I am assuming
> this is created by flink?  The HDFS in my cluster eventually fills up.
>
> Here is my setup:
>
>- Flink 1.13.1 on AWS EMR
>- Executing flink in per-job mode
>- Job is submitted every 5m
>
> In HDFS under /user/yarn/.flink I see a directory created for every flink
> job submitted/yarn application.  Each application directory contains my
> user jar file, flink-dist jar, /lib with various flink jars,
> log4j.properties.
>
> Is there a property to tell flink to clean up this directory when the job
> is completed?
>


Re: Unable to update logback configuration in Flink Native Kubernetes

2022-01-09 Thread Yang Wang
Sorry for the late reply.

Flink clients will ship the log4j-console.properties and
logback-console.xml via K8s ConfigMap and then mount to
JobManager/TaskManager pod.
So if you want to update the log settings or using logback, all you need is
to update the client-local files.

Best,
Yang




Raghavendar T S  于2021年12月31日周五 12:47写道:

> Hi Sharon
>
> Thanks a lot. I just updated the files (flink-conf.yaml and
> logback-console.xml) in the local conf folder and it worked as expected.
>
> Thanks & Regards
> Raghavendar T S
> MERAS Plugins
>
>
> 
>  Virus-free.
> www.avast.com
> 
> <#m_-8574539363514296023_DAB4FAD8-2DD7-40BB-A1B8-4E2AA1F9FDF2>
>
> On Thu, Dec 30, 2021 at 12:54 AM Sharon Xie 
> wrote:
>
>> I've faced the same issue before.
>>
>> I figured out that there is an internal configuration
>> `$internal.deployment.config-dir` (code
>> )
>> which allows me to specify a local folder which contains the logback config
>> using file `logback-console.xml`. The content of the file is then used to
>> create the config map.
>>
>> Hope it helps.
>>
>>
>> Sharon
>>
>> On Wed, Dec 29, 2021 at 7:04 AM Raghavendar T S 
>> wrote:
>>
>>> Hi
>>>
>>> I have created a Flink Native Kubernetes (1.14.2) cluster which is
>>> successful. I am trying to update the logback configuration for which I am
>>> using the configmap exposed by Flink Native Kubernetes. Flink Native
>>> Kubernetes is creating this configmap during the start of the cluster and
>>> deleting it when the cluster is stopped and this behavior is as per the
>>> official documentation.
>>>
>>> I updated the logback configmap which is also successful and this
>>> process even updates the actual logback files (conf folder) in the job
>>> manager and task manager. But Flink is not loading (hot reloading) this
>>> logback configuration.
>>>
>>> Also I want to make sure that the logback configmap configuration is
>>> persisted even during cluster restarts. But the Flink Native Kubernetes
>>> recreates the configmap each time the cluster is started.
>>>
>>> What is that I am missing here? How to make the updated logback
>>> configuration work?
>>>
>>>
>>> Thanks & Regards
>>> Raghavendar T S
>>>
>>>
>>> 
>>>  Virus-free.
>>> www.avast.com
>>> 
>>> <#m_-8574539363514296023_m_-6539971309794987579_m_9211879584941238630_DAB4FAD8-2DD7-40BB-A1B8-4E2AA1F9FDF2>
>>>
>>
>
> --
> Raghavendar T S
> www.teknosrc.com
>


Re: Pod Disruption in Flink Kubernetes Cluster

2022-01-09 Thread Yang Wang
Maybe the Flink applications could run more stably if you configure enough
resources(e.g. memory, cpu, ephemeral-storage) for the JobManager and
TaskManager pods.

Best,
Yang

David Morávek  于2022年1月5日周三 16:46写道:

> Hi Tianyi,
>
> this really depends on your kubernetes setup (eg. if autoscaling is
> enabled, you're using spot / preemtible instances). In general applications
> that run on Kubernetes needs be resilient to these kind of failures, Flink
> is no exception.
>
> In case of the failure, Flink needs to restart the job from the latest
> checkpoint to ensure consistency. In this kind of environment, you should
> be OK-ish with replying one checkpoint worth of data (you're able to adjust
> the checkpointing interval).
>
> Still it would be worth looking into why this disruptions happen and fix
> the cause. Even though you should be able to recover from these types of
> failures, doesn't mean that it's a good thing to do that more often then
> necessary :) I think if you describe the pod / sts you should see the k8s
> events that resulted in the container being terminated.
>
> Also we're currently working on sever efforts to make the restarting
> experience smoother and checkpointing interval shorter (eg. FLIP-198 [1],
> FLINK-25277 [2], FLIP-158 [3], ..).
>
> [1]
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-198%3A+Working+directory+for+Flink+processes
> [2] https://issues.apache.org/jira/browse/FLINK-25277
> [3]
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-158%3A+Generalized+incremental+checkpoints
>
> Best,
> D.
>
> On Tue, Jan 4, 2022 at 7:23 PM Tianyi Deng  wrote:
>
>> Hello Flink community,
>>
>>
>>
>> We have a Flink cluster deployed to AWS EKS along with many other
>> applications. This cluster is managed by Spotify’s Flink operator. After
>> deployment I notice the Stateful pods of job manager and task managers
>> intermittently received *SIGTERM* to terminate themselves. I assume this
>> has something to do with the voluntary pod disruption from K8s’s
>> descheduler, perhaps because of node draining since other applications’
>> pods scale up and down or other reasons. It seems like this is inevitable
>> as K8s usually move pods here and there, however it causes the Flink job to
>> restart every time. I feel this is quite unstable.
>>
>>
>>
>> Has anyone also seen this voluntary pod disruption in Flink cluster at
>> K8s? Is there any best practice or recommendation for the Flink operation
>> in K8s?
>>
>>
>>
>> Thanks,
>>
>> Tianyi
>>
>


Re: Flink native k8s integration vs. operator

2022-01-09 Thread Yang Wang
Thanks all for this fruitful discussion.

I think Xintong has given a strong point why we introduced the native K8s
integration, which is active resource management.
I have a concrete example for this in the production. When a K8s node is
down, the standalone K8s deployment will take longer
recovery time based on the K8s eviction time(IIRC, default is 5 minutes).
For the native K8s integration, Flink RM could be aware of the
TM heartbeat lost and allocate a new one timely.

Also when introducing the native K8s integration, another hit is that we
should make the users are easy enough to migrate from YARN deployment.
They already have a production-ready job life-cycle management system,
which is using Flink CLI to submit the Flink jobs.
So we provide a consistent command "bin/flink run-application -t
kubernetes-application/yarn-application" to start a Flink application and
"bin/flink cancel/stop ..."
to terminate a Flink application.


Compared with K8s operator, I know that this is not a K8s native mechanism.
Hence, I also agree that we still need a powerful K8s operator which
could work with both standalone and native K8s modes. The major difference
between them is how to start the JM and TM pods. For standalone,
they are managed by K8s job/deployment. For native, maybe we could simply
create a submission carrying the "flink run-application" arguments
which is derived from the Flink application CR.

Make the Flink's active resource manager can talk to the K8s operator is an
interesting option, which could support both standalone and native.
Then Flink RM just needs to declare the resource requirement(e.g. 2 * <2G,
1CPU>, 2 * <4G, 1CPU>) and defer the resource allocation/de-allocation
to the K8s operator. It feels like an intermediate form between native and
standalone mode :)



Best,
Yang



Xintong Song  于2022年1月7日周五 12:02写道:

> Hi folks,
>
> Thanks for the discussion. I'd like to share my two cents on this topic.
>
> Firstly, I'd like to clarify my understanding of the concepts "native k8s
> integration" and "active resource management".
> - Native k8s integration means Flink's master interacts with k8s' api
> server directly. It acts like embedding an operator inside Flink's master,
> which manages the resources (pod, deployment, configmap, etc.) and watches
> / reacts to related events.
> - Active resource management means Flink can actively start / terminate
> workers as needed. Its key characteristic is that the resource a Flink
> deployment uses is decided by the job's execution plan, unlike the opposite
> reactive mode (resource available to the deployment decides the execution
> plan) or the standalone mode (both execution plan and deployment resources
> are predefined).
>
> Currently, we have the yarn and native k8s deployments (and the recently
> removed mesos deployment) in active mode, due to their ability to request /
> release worker resources from the underlying cluster. And all the existing
> operators, AFAIK, work with a Flink standalone deployment, where Flink
> cannot request / release resources by itself.
>
> From this perspective, I think a large part of the native k8s integration
> advantages come from the active mode: being able to better understand the
> job's resource requirements and adjust the deployment resource accordingly.
> Both fine-grained resource management (customizing TM resources for
> different tasks / operators) and adaptive batch scheduler (rescale the
> deployment w.r.t. different stages) fall into this category.
>
> I'm wondering if we can have an operator that also works with the active
> mode. Instead of talking to the api server directly for adding / deleting
> resources, Flink's active resource manager can talk to the operator (via
> CR) about the resources the deployment needs, and let the operator to
> actually add / remove the resources. The operator should be able to work
> with (active) or without (standalone) the information of deployment's
> resource requirements. In this way, users are free to choose between active
> and reactive (e.g., HPA) rescaling, while always benefiting from the
> beyond-deployment lifecycle (upgrades, savepoint management, etc.) and
> alignment with the K8s ecosystem (Flink client free, operating via kubectl,
> etc.).
>
> Thank you~
>
> Xintong Song
>
>
>
> On Thu, Jan 6, 2022 at 1:06 PM Thomas Weise  wrote:
>
>> Hi David,
>>
>> Thank you for the reply and context!
>>
>> As for workload types and where native integration might fit: I think
>> that any k8s native solution that satisfies category 3) can also take
>> care of 1) and 2) while the native integration by itself can't achieve
>> that. Existence of [1] might serve as further indication.
>>
>> The k8s operator pattern would be an essential building block for a
>> k8s native solution that is interoperable with k8s ecosystem tooling
>> like kubectl, which is why [2] and subsequent derived art were
>> created. Specifically the CRD allows us to directly express the
>> concep

Re: Flink On Native K8s hostAliases in Pod-template

2021-12-23 Thread Yang Wang
Hi,

The pod template file when you submit a Flink application via "flink
run-application ...
-Dkubernetes.pod-template-file=/path/of/pod-template.yaml" is a
*client-local* file.
You do not need to bundle it into the docker image.


Best,
Yang

黄剑文  于2021年12月23日周四 23:00写道:

> Flink version:1.13
> I want to define some hosts in Flink jobmanager and taskmanager. I consult
> Flink official documents and find a way that define hostAlias in  Pod
> Template to solve it.
>
>  
> https://nightlies.apache.org/flink/flink-docs-master/docs/deployment/resource-providers/native_kubernetes/#pod-template
> 
>
> However.The  Pod Template must be  bundled into the image. The hosts is
> different for different environments(development,test,prod etc).That means
> that I have to maintain different images for different environments.
> Are there have a better way to solve this problem ?
> thx.
>


Re: Flink fails to load class from configured classpath using PipelineOptions

2021-12-20 Thread Yang Wang
Yes. You need to set the "pipeline.classpath" via flink-conf.yaml or CLI
options(-C/--classpath).
I do not think setting it in your main class could work. Just like you
said, the user classloader will not be updated after the user main class is
executed.

Best,
Yang

Pouria Pirzadeh  于2021年12月18日周六 01:23写道:

> I have tried 'PipelineOptions.CLASSPATHS'; It also fails with
> ClassNotFoundException with the exact same error stack trace as
> PipelineOptions.JARS.
>
> FYI The Same application jar works fine if submitted via Flink CLI using
> 'flink run' with the "-C" option to update classpath:
> /bin/flink run --detached -C file:///path/to/udf.jar 
>
> The problem seems to be that the classpath for the ClassLoader which
> codegen in table planner uses is not updated according to Configuration
> passed to the StreamExecutionEnvironment, and I am not sure how that can
> be done.
>
> Pouria
>
>
> On Thu, Dec 16, 2021 at 8:46 PM Yang Wang  wrote:
>
>> The config option "pipeline.jars" is used to specify the user jar, which
>> contains the main class.
>> I think what you need is "pipeline.classpaths".
>>
>> /**
>>  * A list of URLs that are added to the classpath of each user code 
>> classloader of the program.
>>  * Paths must specify a protocol (e.g. file://) and be accessible on all 
>> nodes
>>  */
>> public static final ConfigOption> CLASSPATHS =
>> key("pipeline.classpaths")
>> .stringType()
>> .asList()
>> .noDefaultValue()
>> .withDescription(
>> "A semicolon-separated list of the classpaths to 
>> package with the job jars to be sent to"
>> + " the cluster. These have to be valid 
>> URLs.");
>>
>>
>> Best,
>> Yang
>>
>> Pouria Pirzadeh  于2021年12月17日周五 03:43写道:
>>
>>> I am developing a Java application which uses UDFs on Flink 1.14.
>>> It uses PipelineOptions.JARS config to add jar files, containing UDF
>>> classes, dynamically to the user classpath in the main method; However the
>>> application fails to load UDF class from configured jar files at job
>>> launch time with and crashes with ClassNotFoundException.
>>>
>>> Is PipelineOptions.JARS the correct option to add files to classpath on
>>> Job manager and all task managers?
>>>
>>> Sample code snippet:
>>>
>>> final Configuration configuration = new Configuration();
>>>
>>> configuration.set(PipelineOptions.JARS,Collections.singletonList("file:///path/to/udf.jar"));
>>> StreamExecutionEnvironment streamEnv =
>>> StreamExecutionEnvironment.getExecutionEnvironment(configuration);
>>> StreamTableEnvironment tableEnv =
>>> StreamTableEnvironment.create(streamEnv);
>>> ...
>>> Class udfClass = Class.forName("demo.MyUDF", ...);
>>> tableEnv.createTemporarySystemFunction("MyUDF", udfClass);
>>> ...
>>>
>>> Error stack trace:
>>> Exception in thread "main" java.lang.ClassNotFoundException: demo.MyUDF
>>> at
>>> java.base/jdk.internal.loader.BuiltinClassLoader.loadClass(BuiltinClassLoader.java:582)
>>> at
>>> java.base/jdk.internal.loader.ClassLoaders$AppClassLoader.loadClass(ClassLoaders.java:178)
>>> at
>>> org.apache.flink.util.InstantiationUtil$ClassLoaderObjectInputStream.resolveClass(InstantiationUtil.java:78)
>>> at
>>> java.base/java.io.ObjectInputStream.readNonProxyDesc(ObjectInputStream.java:1886)
>>> at
>>> java.base/java.io.ObjectInputStream.readClassDesc(ObjectInputStream.java:1772)
>>> at
>>> java.base/java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:2060)
>>> at
>>> java.base/java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1594)
>>> at
>>> java.base/java.io.ObjectInputStream.readObject(ObjectInputStream.java:430)
>>> at
>>> org.apache.flink.util.InstantiationUtil.deserializeObject(InstantiationUtil.java:617)
>>> at
>>> org.apache.flink.util.InstantiationUtil.deserializeObject(InstantiationUtil.java:602)
>>> at
>>> org.apache.flink.util.InstantiationUtil.deserializeObject(InstantiationUtil.java:589)
>>> at
>>> org.apache.flink.table.planner.codegen.CodeGeneratorContext.addReusableObjectInternal(CodeGeneratorContext.scala:692)
>>> 

Re: Flink fails to load class from configured classpath using PipelineOptions

2021-12-16 Thread Yang Wang
The config option "pipeline.jars" is used to specify the user jar, which
contains the main class.
I think what you need is "pipeline.classpaths".

/**
 * A list of URLs that are added to the classpath of each user code
classloader of the program.
 * Paths must specify a protocol (e.g. file://) and be accessible on all nodes
 */
public static final ConfigOption> CLASSPATHS =
key("pipeline.classpaths")
.stringType()
.asList()
.noDefaultValue()
.withDescription(
"A semicolon-separated list of the classpaths
to package with the job jars to be sent to"
+ " the cluster. These have to be valid URLs.");


Best,
Yang

Pouria Pirzadeh  于2021年12月17日周五 03:43写道:

> I am developing a Java application which uses UDFs on Flink 1.14.
> It uses PipelineOptions.JARS config to add jar files, containing UDF
> classes, dynamically to the user classpath in the main method; However the
> application fails to load UDF class from configured jar files at job
> launch time with and crashes with ClassNotFoundException.
>
> Is PipelineOptions.JARS the correct option to add files to classpath on
> Job manager and all task managers?
>
> Sample code snippet:
>
> final Configuration configuration = new Configuration();
>
> configuration.set(PipelineOptions.JARS,Collections.singletonList("file:///path/to/udf.jar"));
> StreamExecutionEnvironment streamEnv =
> StreamExecutionEnvironment.getExecutionEnvironment(configuration);
> StreamTableEnvironment tableEnv = StreamTableEnvironment.create(streamEnv);
> ...
> Class udfClass = Class.forName("demo.MyUDF", ...);
> tableEnv.createTemporarySystemFunction("MyUDF", udfClass);
> ...
>
> Error stack trace:
> Exception in thread "main" java.lang.ClassNotFoundException: demo.MyUDF
> at
> java.base/jdk.internal.loader.BuiltinClassLoader.loadClass(BuiltinClassLoader.java:582)
> at
> java.base/jdk.internal.loader.ClassLoaders$AppClassLoader.loadClass(ClassLoaders.java:178)
> at
> org.apache.flink.util.InstantiationUtil$ClassLoaderObjectInputStream.resolveClass(InstantiationUtil.java:78)
> at
> java.base/java.io.ObjectInputStream.readNonProxyDesc(ObjectInputStream.java:1886)
> at
> java.base/java.io.ObjectInputStream.readClassDesc(ObjectInputStream.java:1772)
> at
> java.base/java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:2060)
> at
> java.base/java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1594)
> at
> java.base/java.io.ObjectInputStream.readObject(ObjectInputStream.java:430)
> at
> org.apache.flink.util.InstantiationUtil.deserializeObject(InstantiationUtil.java:617)
> at
> org.apache.flink.util.InstantiationUtil.deserializeObject(InstantiationUtil.java:602)
> at
> org.apache.flink.util.InstantiationUtil.deserializeObject(InstantiationUtil.java:589)
> at
> org.apache.flink.table.planner.codegen.CodeGeneratorContext.addReusableObjectInternal(CodeGeneratorContext.scala:692)
> at
> org.apache.flink.table.planner.codegen.CodeGeneratorContext.addReusableFunction(CodeGeneratorContext.scala:714)
> at
> org.apache.flink.table.planner.codegen.calls.BridgingFunctionGenUtil$.generateFunctionAwareCall(BridgingFunctionGenUtil.scala:130)
> at
> org.apache.flink.table.planner.codegen.calls.BridgingFunctionGenUtil$.generateFunctionAwareCallWithDataType(BridgingFunctionGenUtil.scala:116)
> at
> org.apache.flink.table.planner.codegen.calls.BridgingFunctionGenUtil$.generateFunctionAwareCall(BridgingFunctionGenUtil.scala:73)
> at
> org.apache.flink.table.planner.codegen.calls.BridgingSqlFunctionCallGen.generate(BridgingSqlFunctionCallGen.scala:81)
> at
> org.apache.flink.table.planner.codegen.ExprCodeGenerator.generateCallExpression(ExprCodeGenerator.scala:825)
> at
> org.apache.flink.table.planner.codegen.ExprCodeGenerator.visitCall(ExprCodeGenerator.scala:503)
> at
> org.apache.flink.table.planner.codegen.ExprCodeGenerator.visitCall(ExprCodeGenerator.scala:58)org.apache.flink.table.planner.delegation.StreamPlanner.translateToPlan(StreamPlanner.scala:70)
> at
> org.apache.flink.table.planner.delegation.PlannerBase.translate(PlannerBase.scala:185)
> at
> org.apache.flink.table.api.bridge.java.internal.StreamTableEnvironmentImpl.toStreamInternal(StreamTableEnvironmentImpl.java:437)
> at
> org.apache.flink.table.api.bridge.java.internal.StreamTableEnvironmentImpl.toStreamInternal(StreamTableEnvironmentImpl.java:432)
> at
> org.apache.flink.table.api.bridge.java.internal.StreamTableEnvironmentImpl.toDataStream(StreamTableEnvironmentImpl.java:356)
> ...
>


Re: Flink 1.13.3, k8s HA - ResourceManager was revoked leadership

2021-12-15 Thread Yang Wang
Could you please check whether the JobManager has a long fullGC, which will
cause the leadership lost?

BTW, increasing the timeout should help.

high-availability.kubernetes.leader-election.lease-duration: 60s
high-availability.kubernetes.leader-election.renew-deadline: 60s

Best,
Yang


Alexey Trenikhun  于2021年12月14日周二 05:36写道:

> Hi David,
>
> Setup is application mode, single job, single JM (Kubernetes job), k8s
> v1.18.2. I'm attaching JM log.
>
>
> Thanks,
> Alexey
> --
> *From:* David Morávek 
> *Sent:* Monday, December 13, 2021 12:59 AM
> *To:* Alexey Trenikhun 
> *Cc:* Flink User Mail List 
> *Subject:* Re: Flink 1.13.3, k8s HA - ResourceManager was revoked
> leadership
>
> Hi Alexey,
>
> please be aware that the json-based logs in the mail may not make it pass
> the spam filter (at least for gmail they did not) :(
>
> K8s based leader election is based on optimistic locking of the underlying
> config-map (~ periodically updating the lease annotation of the
> config-map). If JM fails to update this lease within a deadline, the
> leadership is lost.
>
> Can you please elaborate a bit about your setup and your k8s related Flink
> configurations? Also could you share the whole JM log by any chance (gist /
> email attachment)?
>
> Best,
> D.
>
> On Sat, Dec 11, 2021 at 6:47 AM Alexey Trenikhun  wrote:
>
> Hello,
> I'm running Flink 1.13.3 with Kubernetes HA. JM periodically restarts
> after some time, in log below job runs ~8 minutes, then suddenly leadership
> was revoked, job reaches terminal state and K8s restarts failed JM:
>
> {"timestamp":"2021-12-11T04:51:53.697Z","message":"Agent Info (1/1)
> (47e6706e52ad96111a3d722cc56b5752) switched from INITIALIZING to
> RUNNING.","logger_name":"org.apache.flink.runtime.executiongraph.ExecutionGraph","thread_name":"flink-akka.actor.default-dispatcher-2","level":"INFO","level_value":2}
>
> {"timestamp":"2021-12-11T05:06:10.483Z","message":"ResourceManager
> akka.tcp://flink@10.244.104.239:6123/user/rpc/resourcemanager_0 was
> revoked leadership. Clearing fencing
> token.","logger_name":"org.apache.flink.runtime.resourcemanager.StandaloneResourceManager","thread_name":"flink-akka.actor.default-dispatcher-4","level":"INFO","level_value":2}
> {"timestamp":"2021-12-11T05:06:10.484Z","message":"Stopping
> DefaultLeaderRetrievalService.","logger_name":"org.apache.flink.runtime.leaderretrieval.DefaultLeaderRetrievalService","thread_name":"flink-akka.actor.default-dispatcher-4","level":"INFO","level_value":2}
> {"timestamp":"2021-12-11T05:06:10.484Z","message":"Stopping
> KubernetesLeaderRetrievalDriver{configMapName='gsp--jobmanager-leader'}.","logger_name":"org.apache.flink.kubernetes.highavailability.KubernetesLeaderRetrievalDriver","thread_name":"flink-akka.actor.default-dispatcher-4","level":"INFO","level_value":2}
> {"timestamp":"2021-12-11T05:06:10.485Z","message":"The watcher is
> closing.","logger_name":"org.apache.flink.kubernetes.kubeclient.resources.KubernetesConfigMapWatcher","thread_name":"flink-akka.actor.default-dispatcher-4","level":"INFO","level_value":2}
> {"timestamp":"2021-12-11T05:06:10.487Z","message":"Suspending the slot
> manager.","logger_name":"org.apache.flink.runtime.resourcemanager.slotmanager.DeclarativeSlotManager","thread_name":"flink-akka.actor.default-dispatcher-4","level":"INFO","level_value":2}
> {"timestamp":"2021-12-11T05:06:10.555Z","message":"DefaultDispatcherRunner
> was revoked the leadership with leader id
> 138b4029-88eb-409f-98cc-e296fe400eb8. Stopping the
> DispatcherLeaderProcess.","logger_name":"org.apache.flink.runtime.dispatcher.runner.DefaultDispatcherRunner","thread_name":"KubernetesLeaderElector-ExecutorService-thread-1","level":"INFO","level_value":2}
> {"timestamp":"2021-12-11T05:06:10.556Z","message":"Stopping
> SessionDispatcherLeaderProcess.","logger_name":"org.apache.flink.runtime.dispatcher.runner.SessionDispatcherLeaderProcess","thread_name":"KubernetesLeaderElector-ExecutorService-thread-1","level":"INFO","level_value":2}
> {"timestamp":"2021-12-11T05:06:10.557Z","message":"Stopping dispatcher
> akka.tcp://flink@10.244.104.239:6123/user/rpc/dispatcher_1
> .","logger_name":"org.apache.flink.runtime.dispatcher.StandaloneDispatcher","thread_name":"flink-akka.actor.default-dispatcher-4","level":"INFO","level_value":2}
> {"timestamp":"2021-12-11T05:06:10.558Z","message":"Stopping all currently
> running jobs of dispatcher akka.tcp://
> flink@10.244.104.239:6123/user/rpc/dispatcher_1
> .","logger_name":"org.apache.flink.runtime.dispatcher.StandaloneDispatcher","thread_name":"flink-akka.actor.default-dispatcher-4","level":"INFO","level_value":2}
> {"timestamp":"2021-12-11T05:06:10.560Z","message":"Stopping the JobMaster
> for job
> gim().","logger_name":"org.apache.flink.runtime.jobmaster.JobMaster","thread_name":"flink-akka.actor.default-dispatcher-2","level":"INFO","level_value"

Re: [DISCUSS] Drop Zookeeper 3.4

2021-12-11 Thread Yang Wang
After FLINK-10052[1], which was merged in 1.14.0, rolling upgrading
ZooKeeper will not affect the running Flink application.

1. https://issues.apache.org/jira/browse/FLINK-10052


Best,
Yang

Chesnay Schepler  于2021年12月7日周二 下午4:37写道:

> Since this is only relevant for 1.15, if you intend to migrate to 1.15
> close to the release, then somewhere around February.
>
> The only resource I could find for migrating Zookeeper is this FAQ:
> https://cwiki.apache.org/confluence/display/ZOOKEEPER/Upgrade+FAQ
>
> On 07/12/2021 04:02, Dongwon Kim wrote:
>
> When should I prepare for upgrading ZK to 3.5 or newer?
> We're operating a Hadoop cluster w/ ZK 3.4.6 for running only Flink jobs.
> Just hope that the rolling update is not that painful - any advice on this?
>
> Best,
>
> Dongwon
>
> On Tue, Dec 7, 2021 at 3:22 AM Chesnay Schepler 
> wrote:
>
>> Current users of ZK 3.4 and below would need to upgrade their Zookeeper
>> installation that is used by Flink to 3.5+.
>>
>> Whether K8s users are affected depends on whether they use ZK or not. If
>> they do, see above, otherwise they are not affected at all.
>>
>> On 06/12/2021 18:49, Arvid Heise wrote:
>>
>> Could someone please help me understand the implications of the upgrade?
>>
>> As far as I understood this upgrade would only affect users that have a
>> zookeeper shared across multiple services, some of which require ZK 3.4-? A
>> workaround for those users would be to run two ZKs with different versions,
>> eventually deprecating old ZK, correct?
>>
>> If that is the only limitation, I'm +1 for the proposal since ZK 3.4 is
>> already EOL.
>>
>> How are K8s users affected?
>>
>> Best,
>>
>> Arvid
>>
>> On Mon, Dec 6, 2021 at 2:00 PM Chesnay Schepler 
>> wrote:
>>
>>> ping @users; any input on how this would affect you is highly
>>> appreciated.
>>>
>>> On 25/11/2021 22:39, Chesnay Schepler wrote:
>>> > I included the user ML in the thread.
>>> >
>>> > @users Are you still using Zookeeper 3.4? If so, were you planning to
>>> > upgrade Zookeeper in the near future?
>>> >
>>> > I'm not sure about ZK compatibility, but we'd also upgrade Curator to
>>> > 5.x, which doesn't support ookeeperK 3.4 anymore.
>>> >
>>> > On 25/11/2021 21:56, Till Rohrmann wrote:
>>> >> Should we ask on the user mailing list whether anybody is still using
>>> >> ZooKeeper 3.4 and thus needs support for this version or can a
>>> ZooKeeper
>>> >> 3.5/3.6 client talk to a ZooKeeper 3.4 cluster? I would expect that
>>> >> not a
>>> >> lot of users depend on it but just to make sure that we aren't
>>> >> annoying a
>>> >> lot of our users with this change. Apart from that +1 for removing it
>>> if
>>> >> not a lot of user depend on it.
>>> >>
>>> >> Cheers,
>>> >> Till
>>> >>
>>> >> On Wed, Nov 24, 2021 at 11:03 AM Matthias Pohl <
>>> matth...@ververica.com>
>>> >> wrote:
>>> >>
>>> >>> Thanks for starting this discussion, Chesnay. +1 from my side. It's
>>> >>> time to
>>> >>> move forward with the ZK support considering the EOL of 3.4 you
>>> already
>>> >>> mentioned. The benefits we gain from upgrading Curator to 5.x as a
>>> >>> consequence is another plus point. Just for reference on the
>>> >>> inconsistent
>>> >>> state issue you mentioned: FLINK-24543 [1].
>>> >>>
>>> >>> Matthias
>>> >>>
>>> >>> [1] https://issues.apache.org/jira/browse/FLINK-24543
>>> >>>
>>> >>> On Wed, Nov 24, 2021 at 10:19 AM Chesnay Schepler <
>>> ches...@apache.org>
>>> >>> wrote:
>>> >>>
>>>  Hello,
>>> 
>>>  I'd like to drop support for Zookeeper 3.4 in 1.15, upgrading the
>>>  default to 3.5 with an opt-in for 3.6.
>>> 
>>>  Supporting Zookeeper 3.4 (which is already EOL) prevents us from
>>>  upgrading Curator to 5.x, which would allow us to properly fix an
>>>  issue
>>>  with inconsistent state. It is also required to eventually support
>>> ZK
>>> >>> 3.6.
>>> >
>>> >
>>>
>>>
>>
>


Re: [DISCUSS] Drop Zookeeper 3.4

2021-12-06 Thread Yang Wang
FYI:

We(Alibaba) are widely using ZooKeeper 3.5.5 for all the YARN and some K8s
Flink high-available applications.


Best,
Yang

Chesnay Schepler  于2021年12月7日周二 上午2:22写道:

> Current users of ZK 3.4 and below would need to upgrade their Zookeeper
> installation that is used by Flink to 3.5+.
>
> Whether K8s users are affected depends on whether they use ZK or not. If
> they do, see above, otherwise they are not affected at all.
>
> On 06/12/2021 18:49, Arvid Heise wrote:
> > Could someone please help me understand the implications of the upgrade?
> >
> > As far as I understood this upgrade would only affect users that have
> > a zookeeper shared across multiple services, some of which require ZK
> > 3.4-? A workaround for those users would be to run two ZKs with
> > different versions, eventually deprecating old ZK, correct?
> >
> > If that is the only limitation, I'm +1 for the proposal since ZK 3.4
> > is already EOL.
> >
> > How are K8s users affected?
> >
> > Best,
> >
> > Arvid
> >
> > On Mon, Dec 6, 2021 at 2:00 PM Chesnay Schepler 
> > wrote:
> >
> > ping @users; any input on how this would affect you is highly
> > appreciated.
> >
> > On 25/11/2021 22:39, Chesnay Schepler wrote:
> > > I included the user ML in the thread.
> > >
> > > @users Are you still using Zookeeper 3.4? If so, were you
> > planning to
> > > upgrade Zookeeper in the near future?
> > >
> > > I'm not sure about ZK compatibility, but we'd also upgrade
> > Curator to
> > > 5.x, which doesn't support ookeeperK 3.4 anymore.
> > >
> > > On 25/11/2021 21:56, Till Rohrmann wrote:
> > >> Should we ask on the user mailing list whether anybody is still
> > using
> > >> ZooKeeper 3.4 and thus needs support for this version or can a
> > ZooKeeper
> > >> 3.5/3.6 client talk to a ZooKeeper 3.4 cluster? I would expect
> > that
> > >> not a
> > >> lot of users depend on it but just to make sure that we aren't
> > >> annoying a
> > >> lot of our users with this change. Apart from that +1 for
> > removing it if
> > >> not a lot of user depend on it.
> > >>
> > >> Cheers,
> > >> Till
> > >>
> > >> On Wed, Nov 24, 2021 at 11:03 AM Matthias Pohl
> > 
> > >> wrote:
> > >>
> > >>> Thanks for starting this discussion, Chesnay. +1 from my side.
> > It's
> > >>> time to
> > >>> move forward with the ZK support considering the EOL of 3.4
> > you already
> > >>> mentioned. The benefits we gain from upgrading Curator to 5.x as
> a
> > >>> consequence is another plus point. Just for reference on the
> > >>> inconsistent
> > >>> state issue you mentioned: FLINK-24543 [1].
> > >>>
> > >>> Matthias
> > >>>
> > >>> [1] https://issues.apache.org/jira/browse/FLINK-24543
> > >>>
> > >>> On Wed, Nov 24, 2021 at 10:19 AM Chesnay Schepler
> > 
> > >>> wrote:
> > >>>
> >  Hello,
> > 
> >  I'd like to drop support for Zookeeper 3.4 in 1.15, upgrading
> the
> >  default to 3.5 with an opt-in for 3.6.
> > 
> >  Supporting Zookeeper 3.4 (which is already EOL) prevents us from
> >  upgrading Curator to 5.x, which would allow us to properly
> > fix an
> >  issue
> >  with inconsistent state. It is also required to eventually
> > support ZK
> > >>> 3.6.
> > >
> > >
> >
>


Re: Kubernetes HA: New jobs stuck in Initializing for a long time after a certain number of existing jobs are running

2021-11-22 Thread Yang Wang
I believe this issue[1] is related and has been fixed in 1.13.0 and 1.12.3.


[1]. https://issues.apache.org/jira/browse/FLINK-22006


Best,
Yang

Matthias Pohl  于2021年11月22日周一 下午9:19写道:

> Hi Joey,
> that looks like a cluster configuration issue. The 192.168.100.79:6123 is
> not accessible from the JobManager pod (see line 1224f in the provided JM
> logs):
>2021-11-19 04:06:45,049 WARN
>  akka.remote.transport.netty.NettyTransport   [] - Remote
> connection to [null] failed with java.net.NoRouteToHostException: No route
> to host
>2021-11-19 04:06:45,067 WARN  akka.remote.ReliableDeliverySupervisor
> [] - Association with remote system [akka.tcp://
> flink@192.168.100.79:6123] has failed, address is now gated for [50] ms.
> Reason: [Association failed with [akka.tcp://flink@192.168.100.79:6123]]
> Caused by: [java.net.NoRouteToHostException: No route to host]
>
> The TaskManagers are able to communicate with the JobManager pod and are
> properly registered. The JobMaster, instead, tries to connect to the
> ResourceManager (both running on the JobManager pod) but fails.
> SlotRequests are triggered but never actually fulfilled. They are put in
> the queue for pending SlotRequests. The timeout kicks in after trying to
> reach the ResourceManager for some time. That's
> the NoResourcesAvailableException you are experiencing.
>
> Matthias
>
> On Fri, Nov 19, 2021 at 7:02 AM Joey L  wrote:
>
>> Hi,
>>
>> I've set up a Flink 1.12.5 session cluster running on K8s with HA, and
>> came across an issue with creating new jobs once the cluster has reached 20
>> existing jobs. The first 20 jobs always gets initialized and start running
>> within 5 - 10 seconds.
>>
>> Any new job submission is stuck in Initializing state for a long time (10
>> - 30 mins), and eventually it goes to Running but the tasks are stuck in
>> Scheduled state despite there being free task slots available. The
>> Scheduled jobs will eventually start running, but the delay could be up to
>> an hour. Interestingly, this issue doesn't occur once I remove the HA
>> config.
>>
>> Each task manager is configured to have 4 task slots, and I can see via
>> the Flink UI that the task managers are registered correctly. (Refer to
>> attached screenshot).
>>
>> [image: Screen Shot 2021-11-19 at 3.08.11 pm.png]
>>
>> In the logs, I can see that jobs stuck in Scheduled throw this exception
>> after 5 minutes (eventhough there are slots available):
>>
>> ```
>> java.util.concurrent.CompletionException:
>> org.apache.flink.runtime.jobmanager.scheduler.NoResourceAvailableException:
>> Slot request bulk is not fulfillable! Could not allocate the required slot
>> within slot request timeout
>> ```
>>
>> I've also attached the full job manager logs below.
>>
>> Any help/guidance would be appreciated.
>>
>> Thanks,
>> Joey
>>
>


Re: Could not retrieve submitted JobGraph from state handle

2021-11-16 Thread Yang Wang
Hi Alexey,

If you delete the HA data stored in the S3 manually or maybe you configured
an automatic clean-up rule, then it could happen that
the ConfigMap has the pointers while the concrete data in the S3 is missing.


> How to clean the state handle store?
Since the handle is stored in the ConfigMap, I think you could use the
following command to do the cleanup manually.
An easy way is to use the different cluster id.

kubectl delete cm
--selector='app=,configmap-type=high-availability'

Best,
Yang


Alexander Preuß  于2021年11月16日周二 下午10:30写道:

> Hi Alexey,
>
> Are you maybe reusing the cluster-id?
>
> Also, could you provide some more information on your setup and a more
> complete stacktrace?
> The ConfigMap contains pointers to the actual files on Azure.
>
> Best,
> Alexander
>
> On Tue, Nov 16, 2021 at 6:14 AM Alexey Trenikhun  wrote:
>
>> Hello,
>> We are using Kubernetes HA and Azure Blob storage and in rare cases I see
>> following error:
>>
>> Could not retrieve submitted JobGraph from state handle under
>> jobGraph-. This indicates that the
>> retrieved state handle is broken. Try cleaning the state handle store.
>>
>> Question is, how exactly can I clean "state handle store"? Delete
>> fsp-dispatcher-leader Config Map? Or some files (which one) in Azure
>> Blob storage?
>>
>> Thanks,
>> Alexey
>>
>


Re: Fabric8 does not support EC keys

2021-11-16 Thread Yang Wang
If you are using the following command to submit the job, I am afraid the
dynamic properties could not take effect on the client side.

/flink-1.14.0/bin/flink run-application ... -D
kubernetes.certs.client.key.algo=EC

Could you please export the environment
KUBERNETES_CLIENT_KEY_ALGO_SYSTEM_PROPERTY and have a try?

Best,
Yang

Nicolás Ferrario  于2021年11月16日周二 上午1:58写道:

> Hi Yang, I tried that and *-Dkubernetes.certs.client.key.algo=EC* (
> https://github.com/fabric8io/kubernetes-client/blob/278ca235dc4ab5653e82dbe2960004ab62f021e4/kubernetes-client/src/main/java/io/fabric8/kubernetes/client/Config.java#L79)
> but none seems to work :(
>
> I'm launching flink with this: /flink-1.14.0/bin/flink run-application ...
>
> Thanks!
>
> On Mon, Nov 15, 2021 at 4:08 AM Yang Wang  wrote:
>
>> It seems that "EC"[1] is already supported in Kubernetes client v5.5.0.
>> However, the default value is "RSA". Could you please export the
>> following environment first and have a try again?
>>
>> export KUBERNETES_CLIENT_KEY_ALGO_SYSTEM_PROPERTY=EC
>>
>> [1].
>> https://github.com/fabric8io/kubernetes-client/blob/v5.5.0/kubernetes-client/src/main/java/io/fabric8/kubernetes/client/internal/CertUtils.java#L136
>>
>> Best,
>> Yang
>>
>> Nicolás Ferrario  于2021年11月12日周五 下午11:05写道:
>>
>>> Hi all, I am trying to run Flink on a K3s cluster and I'm getting this
>>> exception:
>>>
>>> io.fabric8.kubernetes.client.KubernetesClientException: An error has
>>>> occurred.
>>>>
>>>> at
>>>> io.fabric8.kubernetes.client.KubernetesClientException.launderThrowable(KubernetesClientException.java:64)
>>>>
>>>> at
>>>> io.fabric8.kubernetes.client.KubernetesClientException.launderThrowable(KubernetesClientException.java:53)
>>>>
>>>> at
>>>> io.fabric8.kubernetes.client.utils.HttpClientUtils.createHttpClient(HttpClientUtils.java:234)
>>>>
>>>> at
>>>> io.fabric8.kubernetes.client.utils.HttpClientUtils.createHttpClient(HttpClientUtils.java:66)
>>>>
>>>> at
>>>> io.fabric8.kubernetes.client.BaseClient.(BaseClient.java:51)
>>>>
>>>> at
>>>> io.fabric8.kubernetes.client.BaseKubernetesClient.(BaseKubernetesClient.java:145)
>>>>
>>>> at
>>>> io.fabric8.kubernetes.client.DefaultKubernetesClient.(DefaultKubernetesClient.java:40)
>>>>
>>>> at
>>>> org.apache.flink.kubernetes.kubeclient.FlinkKubeClientFactory.fromConfiguration(FlinkKubeClientFactory.java:95)
>>>>
>>>> at
>>>> org.apache.flink.kubernetes.KubernetesClusterClientFactory.createClusterDescriptor(KubernetesClusterClientFactory.java:61)
>>>>
>>>> at
>>>> org.apache.flink.kubernetes.KubernetesClusterClientFactory.createClusterDescriptor(KubernetesClusterClientFactory.java:39)
>>>>
>>>> at
>>>> org.apache.flink.client.deployment.application.cli.ApplicationClusterDeployer.run(ApplicationClusterDeployer.java:63)
>>>>
>>>> at
>>>> org.apache.flink.client.cli.CliFrontend.runApplication(CliFrontend.java:213)
>>>>
>>>> at
>>>> org.apache.flink.client.cli.CliFrontend.parseAndRun(CliFrontend.java:1057)
>>>>
>>>> at
>>>> org.apache.flink.client.cli.CliFrontend.lambda$main$10(CliFrontend.java:1132)
>>>>
>>>> at
>>>> org.apache.flink.runtime.security.contexts.NoOpSecurityContext.runSecured(NoOpSecurityContext.java:28)
>>>>
>>>> at
>>>> org.apache.flink.client.cli.CliFrontend.main(CliFrontend.java:1132)
>>>>
>>>> Caused by: java.io.IOException: Invalid DER: object is not integer
>>>>
>>>> at
>>>> io.fabric8.kubernetes.client.internal.PKCS1Util$Asn1Object.getInteger(PKCS1Util.java:125)
>>>>
>>>> at
>>>> io.fabric8.kubernetes.client.internal.PKCS1Util.next(PKCS1Util.java:55)
>>>>
>>>> at
>>>> io.fabric8.kubernetes.client.internal.PKCS1Util.decodePKCS1(PKCS1Util.java:46)
>>>>
>>>> at
>>>> io.fabric8.kubernetes.client.internal.CertUtils.handleOtherKeys(CertUtils.java:179)
>>>>
>>>> at
>>>> io.fabric8.kubernetes.client.internal.CertUtils.loadKey(CertUtils.java:139)
>>>>
>>>> at
>>>> io.fabric8.kubernetes.client.internal.CertUtils.createKeyStore(CertUtils.java:115)
>>>>
>>>> at
>>>> io.fabric8.kubernetes.client.internal.CertUtils.createKeyStore(CertUtils.java:251)
>>>>
>>>> at
>>>> io.fabric8.kubernetes.client.internal.SSLUtils.keyManagers(SSLUtils.java:128)
>>>>
>>>> at
>>>> io.fabric8.kubernetes.client.internal.SSLUtils.keyManagers(SSLUtils.java:122)
>>>>
>>>> at
>>>> io.fabric8.kubernetes.client.utils.HttpClientUtils.createHttpClient(HttpClientUtils.java:129)
>>>>
>>>
>>> This seems to be an old issue with Fabric8 Kubernetes Client that I
>>> guess should be fixed in newer releases. You can find more information here
>>> https://issues.jenkins.io/browse/JENKINS-64322.
>>>
>>> My K3s cluster is running on version 1.21.6+k3s1, and Flink 1.14.0
>>>
>>> Does anyone know a workaround that does not involve replacing the
>>> certificate with a token?
>>>
>>> Thanks
>>>
>>


Re: Fabric8 does not support EC keys

2021-11-14 Thread Yang Wang
It seems that "EC"[1] is already supported in Kubernetes client v5.5.0.
However, the default value is "RSA". Could you please export the following
environment first and have a try again?

export KUBERNETES_CLIENT_KEY_ALGO_SYSTEM_PROPERTY=EC

[1].
https://github.com/fabric8io/kubernetes-client/blob/v5.5.0/kubernetes-client/src/main/java/io/fabric8/kubernetes/client/internal/CertUtils.java#L136

Best,
Yang

Nicolás Ferrario  于2021年11月12日周五 下午11:05写道:

> Hi all, I am trying to run Flink on a K3s cluster and I'm getting this
> exception:
>
> io.fabric8.kubernetes.client.KubernetesClientException: An error has
>> occurred.
>>
>> at
>> io.fabric8.kubernetes.client.KubernetesClientException.launderThrowable(KubernetesClientException.java:64)
>>
>> at
>> io.fabric8.kubernetes.client.KubernetesClientException.launderThrowable(KubernetesClientException.java:53)
>>
>> at
>> io.fabric8.kubernetes.client.utils.HttpClientUtils.createHttpClient(HttpClientUtils.java:234)
>>
>> at
>> io.fabric8.kubernetes.client.utils.HttpClientUtils.createHttpClient(HttpClientUtils.java:66)
>>
>> at
>> io.fabric8.kubernetes.client.BaseClient.(BaseClient.java:51)
>>
>> at
>> io.fabric8.kubernetes.client.BaseKubernetesClient.(BaseKubernetesClient.java:145)
>>
>> at
>> io.fabric8.kubernetes.client.DefaultKubernetesClient.(DefaultKubernetesClient.java:40)
>>
>> at
>> org.apache.flink.kubernetes.kubeclient.FlinkKubeClientFactory.fromConfiguration(FlinkKubeClientFactory.java:95)
>>
>> at
>> org.apache.flink.kubernetes.KubernetesClusterClientFactory.createClusterDescriptor(KubernetesClusterClientFactory.java:61)
>>
>> at
>> org.apache.flink.kubernetes.KubernetesClusterClientFactory.createClusterDescriptor(KubernetesClusterClientFactory.java:39)
>>
>> at
>> org.apache.flink.client.deployment.application.cli.ApplicationClusterDeployer.run(ApplicationClusterDeployer.java:63)
>>
>> at
>> org.apache.flink.client.cli.CliFrontend.runApplication(CliFrontend.java:213)
>>
>> at
>> org.apache.flink.client.cli.CliFrontend.parseAndRun(CliFrontend.java:1057)
>>
>> at
>> org.apache.flink.client.cli.CliFrontend.lambda$main$10(CliFrontend.java:1132)
>>
>> at
>> org.apache.flink.runtime.security.contexts.NoOpSecurityContext.runSecured(NoOpSecurityContext.java:28)
>>
>> at
>> org.apache.flink.client.cli.CliFrontend.main(CliFrontend.java:1132)
>>
>> Caused by: java.io.IOException: Invalid DER: object is not integer
>>
>> at
>> io.fabric8.kubernetes.client.internal.PKCS1Util$Asn1Object.getInteger(PKCS1Util.java:125)
>>
>> at
>> io.fabric8.kubernetes.client.internal.PKCS1Util.next(PKCS1Util.java:55)
>>
>> at
>> io.fabric8.kubernetes.client.internal.PKCS1Util.decodePKCS1(PKCS1Util.java:46)
>>
>> at
>> io.fabric8.kubernetes.client.internal.CertUtils.handleOtherKeys(CertUtils.java:179)
>>
>> at
>> io.fabric8.kubernetes.client.internal.CertUtils.loadKey(CertUtils.java:139)
>>
>> at
>> io.fabric8.kubernetes.client.internal.CertUtils.createKeyStore(CertUtils.java:115)
>>
>> at
>> io.fabric8.kubernetes.client.internal.CertUtils.createKeyStore(CertUtils.java:251)
>>
>> at
>> io.fabric8.kubernetes.client.internal.SSLUtils.keyManagers(SSLUtils.java:128)
>>
>> at
>> io.fabric8.kubernetes.client.internal.SSLUtils.keyManagers(SSLUtils.java:122)
>>
>> at
>> io.fabric8.kubernetes.client.utils.HttpClientUtils.createHttpClient(HttpClientUtils.java:129)
>>
>
> This seems to be an old issue with Fabric8 Kubernetes Client that I guess
> should be fixed in newer releases. You can find more information here
> https://issues.jenkins.io/browse/JENKINS-64322.
>
> My K3s cluster is running on version 1.21.6+k3s1, and Flink 1.14.0
>
> Does anyone know a workaround that does not involve replacing the
> certificate with a token?
>
> Thanks
>


Re: Getting Errors in Standby Jobmanager pod during installation & after restart on k8s

2021-10-27 Thread Yang Wang
Roman's answer is on the point.

The exception is really confusing and it comes from fabric8
kubernetes-client. We might try to create a PR for the upstream project :)

Best,
Yang

Roman Khachatryan  于2021年10月25日周一 下午10:00写道:

> Hi Amit,
>
> AFAIK, these exceptions are normal in HA mode as different JM
> instances are trying to acquire the lease.
>
> Regards,
> Roman
>
> On Mon, Oct 25, 2021 at 1:45 PM Amit Bhatia 
> wrote:
> >
> > Hi,
> >
> > We have deployed two jobmanagers in HA mode on kubernetes using k8s
> configmap solution with deployment controller. During Installation and
> after restart we are getting below errors in standby jobmanager.
> >
> > 2021-10-25 11:17:46,397 ERROR
> io.fabric8.kubernetes.client.extended.leaderelection.LeaderElector []
> POD_NAME: eric-bss-em-sm-haflink-jobmanager-586d44dbbb-9v499 - Exception
> occurred while acquiring lock 'ConfigMapLock: gautam -
> eric-bss-em-sm-haflink-resourcemanager-leader
> (ebfdc2b3-1097-41fc-a377-b1d0a7916690)'
> >
> io.fabric8.kubernetes.client.extended.leaderelection.resourcelock.LockException:
> Unable to create ConfigMapLock
> > at
> io.fabric8.kubernetes.client.extended.leaderelection.resourcelock.ConfigMapLock.create(ConfigMapLock.java:88)
> ~[flink-dist_2.11-1.13.2.jar:1.13.2]
> > at
> io.fabric8.kubernetes.client.extended.leaderelection.LeaderElector.tryAcquireOrRenew(LeaderElector.java:138)
> ~[flink-dist_2.11-1.13.2.jar:1.13.2]
> > at
> io.fabric8.kubernetes.client.extended.leaderelection.LeaderElector.lambda$acquire$0(LeaderElector.java:82)
> ~[flink-dist_2.11-1.13.2.jar:1.13.2]
> > at
> io.fabric8.kubernetes.client.extended.leaderelection.LeaderElector.lambda$loop$3(LeaderElector.java:198)
> ~[flink-dist_2.11-1.13.2.jar:1.13.2]
> > at
> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
> [?:1.8.0_281]
> > at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308)
> [?:1.8.0_281]
> > at
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180)
> [?:1.8.0_281]
> > at
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294)
> [?:1.8.0_281]
> > at
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
> [?:1.8.0_281]
> > at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
> [?:1.8.0_281]
> > at java.lang.Thread.run(Thread.java:748) [?:1.8.0_281]
> > Caused by: io.fabric8.kubernetes.client.KubernetesClientException:
> Failure executing: POST at:
> https://10.96.0.1/api/v1/namespaces/gautam/configmaps. Message:
> configmaps "eric-bss-em-sm-haflink-resourcemanager-leader" already exists.
> Received status: Status(apiVersion=v1, code=409,
> details=StatusDetails(causes=[], group=null, kind=configmaps,
> name=eric-bss-em-sm-haflink-resourcemanager-leader, retryAfterSeconds=null,
> uid=null, additionalProperties={}), kind=Status, message=configmaps
> "eric-bss-em-sm-haflink-resourcemanager-leader" already exists,
> metadata=ListMeta(_continue=null, remainingItemCount=null,
> resourceVersion=null, selfLink=null, additionalProperties={}),
> reason=AlreadyExists, status=Failure, additionalProperties={}).
> > at
> io.fabric8.kubernetes.client.dsl.base.OperationSupport.requestFailure(OperationSupport.java:568)
> ~[flink-dist_2.11-1.13.2.jar:1.13.2]
> > at
> io.fabric8.kubernetes.client.dsl.base.OperationSupport.assertResponseCode(OperationSupport.java:507)
> ~[flink-dist_2.11-1.13.2.jar:1.13.2]
> > at
> io.fabric8.kubernetes.client.dsl.base.OperationSupport.handleResponse(OperationSupport.java:471)
> ~[flink-dist_2.11-1.13.2.jar:1.13.2]
> > at
> io.fabric8.kubernetes.client.dsl.base.OperationSupport.handleResponse(OperationSupport.java:430)
> ~[flink-dist_2.11-1.13.2.jar:1.13.2]
> > at
> io.fabric8.kubernetes.client.dsl.base.OperationSupport.handleCreate(OperationSupport.java:251)
> ~[flink-dist_2.11-1.13.2.jar:1.13.2]
> > at
> io.fabric8.kubernetes.client.dsl.base.BaseOperation.handleCreate(BaseOperation.java:815)
> ~[flink-dist_2.11-1.13.2.jar:1.13.2]
> > at
> io.fabric8.kubernetes.client.dsl.base.BaseOperation.create(BaseOperation.java:333)
> ~[flink-dist_2.11-1.13.2.jar:1.13.2]
> > at
> io.fabric8.kubernetes.client.dsl.base.BaseOperation.lambda$createNew$0(BaseOperation.java:346)
> ~[flink-dist_2.11-1.13.2.jar:1.13.2]
> > at
> io.fabric8.kubernetes.api.model.DoneableConfigMap.done(DoneableConfigMap.java:26)
> ~[flink-dist_2.11-1.13.2.jar:1.13.2]
> > at
> io.fabric8.kubernetes.client.extended.leaderelection.resourcelock.ConfigMapLock.create(ConfigMapLock.java:86)
> ~[flink-dist_2.11-1.13.2.jar:1.13.2]
> > ... 10 more
> >
> >
> >
> >
> > SLF4J: Class path contains multiple SLF4J bindings.
> > SLF4J: Found binding in
> [jar:file:/opt/flink/lib/log4j-slf4j-impl-2.12.1.jar!/org/slf4j/impl/StaticLoggerBinder.class]
> > SLF4J: Found binding in
> [jar:file:/opt/flink/lib/log4j-slf4j-impl-2.13.3.jar!/org/slf4j/impl/StaticLoggerBinder.class]
> > SL

Re: Not cleanup Kubernetes Configmaps after execution success

2021-10-27 Thread Yang Wang
Hi,

I think Roman is right. It seems that the JobManager is relaunched again by
K8s after Flink has
already deregister the application(aka delete the JobManager K8s
deployment).

One possible reason might be that kubelet is too late to know the
JobManager deployment is deleted.
So it relaunch the JobManager pod when it terminated with exit code 0.


Best,
Yang


Roman Khachatryan  于2021年10月26日周二 下午6:17写道:

> Thanks for sharing this,
> The sequence of events the log seems strange to me:
>
> 2021-10-17 03:05:55,801 INFO
> org.apache.flink.runtime.jobmaster.JobMaster [] -
> Close ResourceManager connection c1092812cfb2853a5576ffd78e346189:
> Stopping JobMaster for job 'rt-match_12.4.5_8d48b21a'
> ().
> 2021-10-17 03:05:59,382 INFO
> org.apache.flink.runtime.entrypoint.ClusterEntrypoint[] -
> Starting KubernetesApplicationClusterEntrypoint (Version: 1.14.0,
> Scala: 2.12, Rev:460b386, Date:2021-09-22T08:39:40+02:00)
> 2021-10-17 03:06:00,251 INFO
> org.apache.flink.runtime.entrypoint.ClusterEntrypoint[] -
> RECEIVED SIGNAL 15: SIGTERM. Shutting down as requested.
> 2021-10-17 03:06:04,355 ERROR
> io.fabric8.kubernetes.client.extended.leaderelection.LeaderElector []
> - Exception occurred while acquiring lock 'ConfigMapLock: flink-ns -
> match-70958037-f414-4925-9d60-19e90d12abc0-restserver-leader
> (ef5c2463-2d66-4dce-a023-4b8a50d7acff)'
>
> io.fabric8.kubernetes.client.extended.leaderelection.resourcelock.LockException:
> Unable to create ConfigMapLock
> Caused by: io.fabric8.kubernetes.client.KubernetesClientException:
> Operation: [create]  for kind: [ConfigMap]  with name:
> [match-70958037-f414-4925-9d60-19e90d12abc0-restserver-leader]  in
> namespace: [flink-ns]  failed.
> Caused by: java.io.InterruptedIOException
>
> It looks like KubernetesApplicationClusterEntrypoint is re-started in
> the middle of shutdown and, as a result, the resources it (re)creates
> aren't clean up.
>
> Could you please also share Kubernetes logs and resource definitions
> to validate the above assumption?
>
> Regards,
> Roman
>
> On Mon, Oct 25, 2021 at 6:15 AM Hua Wei Chen 
> wrote:
> >
> > Hi all,
> >
> > We have Flink jobs run on batch mode and get the job status via
> JobHandler.onJobExecuted()[1].
> >
> > Base on the thread[2], we expected the Configmaps will be cleaned up
> after execution successfully.
> >
> > But we found some Configmaps not be cleanup after job success. On the
> other hand, the Configmaps contents and the labels are removed.
> >
> > Here is one of the Configmaps.
> >
> > ```
> > apiVersion: v1
> > kind: ConfigMap
> > metadata:
> >   name: match-6370b6ab-de17-4c93-940e-0ce06d05a7b8-resourcemanager-leader
> >   namespace: app-flink
> >   selfLink: >-
> >
>  
> /api/v1/namespaces/app-flink/configmaps/match-6370b6ab-de17-4c93-940e-0ce06d05a7b8-resourcemanager-leader
> >   uid: 80c79c87-d6e2-4641-b13f-338c3d3154b0
> >   resourceVersion: '578806788'
> >   creationTimestamp: '2021-10-21T17:06:48Z'
> >   annotations:
> > control-plane.alpha.kubernetes.io/leader: >-
> >
>  
> {"holderIdentity":"3da40a4a-0346-49e5-8d18-b04a68239bf3","leaseDuration":15.0,"acquireTime":"2021-10-21T17:06:48.092264Z","renewTime":"2021-10-21T17:06:48.092264Z","leaderTransitions":0}
> >   managedFields:
> > - manager: okhttp
> >   operation: Update
> >   apiVersion: v1
> >   time: '2021-10-21T17:06:48Z'
> >   fieldsType: FieldsV1
> >   fieldsV1:
> > 'f:metadata':
> >   'f:annotations':
> > .: {}
> > 'f:control-plane.alpha.kubernetes.io/leader': {}
> > data: {}
> > ```
> >
> >
> > Our Flink apps run on ver. 1.14.0.
> > Thanks!
> >
> > BR,
> > Oscar
> >
> >
> > Reference:
> > [1] JobListener (Flink : 1.15-SNAPSHOT API) (apache.org)
> > [2]
> https://lists.apache.org/list.html?user@flink.apache.org:lte=1M:High%20availability%20data%20clean%20up%20
> >
>


Re: Why we need again kubernetes flink operator?

2021-10-25 Thread Yang Wang
Hi Bhaskar,

IIUC, flink-k8s-operator and Flink native K8s mode are orthogonal. They do
not mean to replace other one.

The flink-k8s-operator is more like a Flink lifecycle management tool. It
could make deploying a Flink application on K8s easier.
We just need to apply a CR yaml and is more friendly to K8s users.

The flink-k8s-operator could integrate with standalone mode[1], but also
native K8s mode[2].

[1]. https://github.com/GoogleCloudPlatform/flink-on-k8s-operator
[2]. https://github.com/wangyang0918/flink-native-k8s-operator

Best,
Yang

Vijay Bhaskar  于2021年10月22日周五 下午1:59写道:

> Understood that we have kubernetes HA configuration where we specify s3://
> or HDFS:/// persistent storage, as mentioned here:
> https://ci.apache.org/projects/flink/flink-docs-master/docs/deployment/ha/kubernetes_ha/
>
> Regards
> Bhaskar
>
> On Fri, Oct 22, 2021 at 10:47 AM Vijay Bhaskar 
> wrote:
>
>> All,
>> I have used flink upto last year using flank 1.9.  That time we built our
>> own cluster using zookeeper and monitoring jobs.  Now I am revisiting
>> different applications. Found that community has come up with this  native
>> kubernetes deployment:
>>
>> https://ci.apache.org/projects/flink/flink-docs-master/docs/deployment/resource-providers/native_kubernetes/#application-mode
>>
>> Now i am getting question:
>>
>> Why do we need second thought, we can directly deploy in native
>> kubernetes in Application Mode?
>>
>> Only thing is if I add little monitoring from outside that suffice right?
>>
>> I believe this has eliminated the need for
>> 1. Kubernetes Operator in flink ( Provided adding monitoring using flink
>> logs, on  top)
>> 2. Zookeeper usage and cluster mode..
>>
>> Where is the state stored now? Is it stored in /etcd in kubernetes?
>>
>> Regards
>> Bhaskar
>>
>>
>>
>>
>>


Re: High availability data clean up

2021-10-25 Thread Yang Wang
Hi Weiqing,

> Why does Flink not set the owner reference of HA related ConfigMaps to
JobManager deployment? It is easier to clean up for users.
The major reason is that simply deleting the HA related ConfigMaps will
make the HA data located in DFS leak.

> How to delete the HA ConfigMap from external tools(e.g. kubectl, K8s
operator)?
All the HA ConfigMaps have specific labels
"app=,configmap-type=high-availability". So it is easy to clean
up them manually.

kubectl delete cm
--selector='app=,configmap-type=high-availability'

Best,
Yang

Weiqing Yang  于2021年10月23日周六 上午9:10写道:

> Thanks for the replies, Yangze and Vijay!
>
> We are using standalone Flink on K8s (we created a K8s operator to manage
> the life cycle of the flink clusters (session mode)). Seems there is no way
> for the operator to know when these HA related configMaps are created (if
> the operator somehow can know when these HA configMap are created, then we
> can add ownerRef for them). Please let me know if I missed anything and if
> you have any recommended way to clean these HA related data/configMaps when
> deleting a flink cluster.
>
> Best,
> Wq
>
> On Thu, Oct 21, 2021 at 11:05 PM Vijay Bhaskar 
> wrote:
>
>> In HA mode the configMap will be retained  after deletion of the
>> deployment:
>> https://ci.apache.org/projects/flink/flink-docs-master/docs/deployment/ha/kubernetes_ha/
>>   ( Refer High availability data clean up)
>>
>>
>>
>> On Fri, Oct 22, 2021 at 8:13 AM Yangze Guo  wrote:
>>
>>> For application mode, when the job finished normally or be canceled,
>>> the ConfigMaps will be cleanup.
>>> For session mode, when you stop the session through [1], the
>>> ConfigMaps will be cleanup.
>>>
>>> [1]
>>> https://ci.apache.org/projects/flink/flink-docs-release-1.13/docs/deployment/resource-providers/native_kubernetes/#stop-a-running-session-cluster
>>>
>>> Best,
>>> Yangze Guo
>>>
>>> On Thu, Oct 21, 2021 at 6:37 AM Weiqing Yang 
>>> wrote:
>>> >
>>> >
>>> > Hi,
>>> >
>>> > Per the doc, `kubernetes.jobmanager.owner.reference` can be used to
>>> set up the owners of the job manager Deployment. If the owner is deleted,
>>> then the job manager and its related pods will be deleted. How about the HA
>>> related ConfigMaps? Are they also deleted when deleting the owner of the
>>> job manager Deployment? Per the wiki here, the HA data will be retained
>>> when deleting jobmanager Deployment. If we want to delete these HA related
>>> configMaps as well when deleting the job manager, what is the suggested way
>>> to do that?
>>> >
>>> > Thanks,
>>> > weiqing
>>> >
>>>
>>


Re: Not cleanup Kubernetes Configmaps after execution success

2021-10-25 Thread Yang Wang
Hi Hua Wei,

I think you need to share the JobManager logs so that we could check
whether Flink had tried to clean up the HA related ConfigMaps.

Using the "kubectl logs  -f >/tmp/log.jm" could help
with dumping the logs.

Best,
Yang

Roman Khachatryan  于2021年10月25日周一 下午5:35写道:

> Hi Hua,
>
> It looks like the ConfigMap misses HA labels for some reason.
>
> Could you confirm that you are running in HA mode?
> Which deployment mode are you using? [1]
>
> I'm also pulling in Yan Wang who might know this area better.
>
> [1]
>
> https://nightlies.apache.org/flink/flink-docs-release-1.13/docs/deployment/resource-providers/standalone/kubernetes/#deployment-modes
>
> Regards,
> Roman
>
> On Mon, Oct 25, 2021 at 6:15 AM Hua Wei Chen 
> wrote:
> >
> > Hi all,
> >
> > We have Flink jobs run on batch mode and get the job status via
> JobHandler.onJobExecuted()[1].
> >
> > Base on the thread[2], we expected the Configmaps will be cleaned up
> after execution successfully.
> >
> > But we found some Configmaps not be cleanup after job success. On the
> other hand, the Configmaps contents and the labels are removed.
> >
> > Here is one of the Configmaps.
> >
> > ```
> > apiVersion: v1
> > kind: ConfigMap
> > metadata:
> >   name: match-6370b6ab-de17-4c93-940e-0ce06d05a7b8-resourcemanager-leader
> >   namespace: app-flink
> >   selfLink: >-
> >
>  
> /api/v1/namespaces/app-flink/configmaps/match-6370b6ab-de17-4c93-940e-0ce06d05a7b8-resourcemanager-leader
> >   uid: 80c79c87-d6e2-4641-b13f-338c3d3154b0
> >   resourceVersion: '578806788'
> >   creationTimestamp: '2021-10-21T17:06:48Z'
> >   annotations:
> > control-plane.alpha.kubernetes.io/leader: >-
> >
>  
> {"holderIdentity":"3da40a4a-0346-49e5-8d18-b04a68239bf3","leaseDuration":15.0,"acquireTime":"2021-10-21T17:06:48.092264Z","renewTime":"2021-10-21T17:06:48.092264Z","leaderTransitions":0}
> >   managedFields:
> > - manager: okhttp
> >   operation: Update
> >   apiVersion: v1
> >   time: '2021-10-21T17:06:48Z'
> >   fieldsType: FieldsV1
> >   fieldsV1:
> > 'f:metadata':
> >   'f:annotations':
> > .: {}
> > 'f:control-plane.alpha.kubernetes.io/leader': {}
> > data: {}
> > ```
> >
> >
> > Our Flink apps run on ver. 1.14.0.
> > Thanks!
> >
> > BR,
> > Oscar
> >
> >
> > Reference:
> > [1] JobListener (Flink : 1.15-SNAPSHOT API) (apache.org)
> > [2]
> https://lists.apache.org/list.html?user@flink.apache.org:lte=1M:High%20availability%20data%20clean%20up%20
> >
>


Re: Kubernetes HA - Reusing storage dir for different clusters

2021-10-08 Thread Yang Wang
Yes, if you delete the deployment directly, all the HA data will be
retained. And you could recover the Flink job by creating a new deployment.

You could also find this description in the documentation[1].


[1].
https://nightlies.apache.org/flink/flink-docs-release-1.14/docs/deployment/ha/kubernetes_ha/#high-availability-data-clean-up

Best,
Yang

Alexis Sarda-Espinosa  于2021年10月8日周五
下午10:47写道:

> Hi Yang,
>
> thanks for the confirmation. If I manually stop the job by deleting the
> Kubernetes deployment before it completes, I suppose the files will not be
> cleaned up, right? That's a somewhat non-standard scenario, so I wouldn't
> expect Flink to clean up, I just want to be sure.
>
> Regards,
> Alexis.
>
> ------
> *From:* Yang Wang 
> *Sent:* Friday, October 8, 2021 5:24 AM
> *To:* Alexis Sarda-Espinosa 
> *Cc:* Flink ML 
> *Subject:* Re: Kubernetes HA - Reusing storage dir for different clusters
>
> When the Flink job reached to global terminal state(FAILED, CANCELED,
> FINISHED), all the HA related data(including pointers in ConfigMap and
> concrete data in DFS) will be cleaned up automatically.
>
> Best,
> Yang
>
> Alexis Sarda-Espinosa 
> 于2021年10月4日周一 下午3:59写道:
>
> Hello,
>
>
>
> If I deploy a Flink-Kubernetes application with HA, I need to set
> high-availability.storageDir. If my application is a batch job that may run
> multiple times with the same configuration, do I need to manually clean up
> the storage dir between each execution?
>
>
>
> Regards,
>
> Alexis.
>
>
>
>


Re: Start Flink cluster, k8s pod behavior

2021-10-08 Thread Yang Wang
Did you use the "jobmanager.sh start-foreground" in your own
"run-job-manager.sh", just like what Flink has done
in the docker-entrypoint.sh[1]?

I strongly suggest to start the Flink session cluster with official
yamls[2].

[1].
https://github.com/apache/flink-docker/blob/master/1.13/scala_2.11-java11-debian/docker-entrypoint.sh#L114
[2].
https://nightlies.apache.org/flink/flink-docs-release-1.14/docs/deployment/resource-providers/standalone/kubernetes/#starting-a-kubernetes-cluster-session-mode

Best,
Yang

Qihua Yang  于2021年10月1日周五 上午2:59写道:

> Looks like after script *flink-daemon.sh *complete, it return exit 0.
> Kubernetes regard it as done. Is that expected?
>
> Thanks,
> Qihua
>
> On Thu, Sep 30, 2021 at 11:11 AM Qihua Yang  wrote:
>
>> Thank you for your reply.
>> From the log, exit code is 0, and reason is Completed.
>> Looks like the cluster is fine. But why kubenetes restart the pod. As you
>> said, from perspective of Kubernetes everything is done. Then how to
>> prevent the restart?
>> It didn't even give chance to upload and run a jar
>>
>> Ports: 8081/TCP, 6123/TCP, 6124/TCP, 6125/TCP
>> Host Ports:0/TCP, 0/TCP, 0/TCP, 0/TCP
>> Command:
>>   /opt/flink/bin/entrypoint.sh
>> Args:
>>   /opt/flink/bin/run-job-manager.sh
>> State:  Waiting
>>   Reason:   CrashLoopBackOff
>> Last State: Terminated
>>   Reason:   Completed
>>   Exit Code:0
>>   Started:  Wed, 29 Sep 2021 20:12:30 -0700
>>   Finished: Wed, 29 Sep 2021 20:12:45 -0700
>> Ready:  False
>> Restart Count:  131
>>
>> Thanks,
>> Qihua
>>
>> On Thu, Sep 30, 2021 at 1:00 AM Chesnay Schepler 
>> wrote:
>>
>>> Is the run-job-manager.sh script actually blocking?
>>> Since you (apparently) use that as an entrypoint, if that scripts exits
>>> after starting the JM then from the perspective of Kubernetes everything is
>>> done.
>>>
>>> On 30/09/2021 08:59, Matthias Pohl wrote:
>>>
>>> Hi Qihua,
>>> I guess, looking into kubectl describe and the JobManager logs would
>>> help in understanding what's going on.
>>>
>>> Best,
>>> Matthias
>>>
>>> On Wed, Sep 29, 2021 at 8:37 PM Qihua Yang  wrote:
>>>
 Hi,
 I deployed flink in session mode. I didn't run any jobs. I saw below
 logs. That is normal, same as Flink menual shows.

 + /opt/flink/bin/run-job-manager.sh
 Starting HA cluster with 1 masters.
 Starting standalonesession daemon on host job-manager-776dcf6dd-xzs8g.
 Starting taskexecutor daemon on host job-manager-776dcf6dd-xzs8g.


 But when I check kubectl, it shows status is Completed. After a while,
 status changed to CrashLoopBackOff, and pod restart.
 NAME  READY
   STATUS RESTARTS   AGE
 job-manager-776dcf6dd-xzs8g   0/1 Completed  5
  5m27s

 NAME  READY
   STATUS RESTARTS   AGE
 job-manager-776dcf6dd-xzs8g   0/1 CrashLoopBackOff   5
  7m35s

 Anyone can help me understand why?
 Why do kubernetes regard this pod as completed and restart? Should I
 config something? either Flink side or Kubernetes side? From the Flink
 manual, after the cluster is started, I can upload a jar to run the
 application.

 Thanks,
 Qihua

>>>
>>>


  1   2   3   4   5   >