Re: Custom Prometheus metrics disappeared in 1.16.2 => 1.17.1 upgrade
Reason is simple, I migrated to Flink a project that already had Prometheus metrics integrated. Thanks, Javier El mar, 3 oct 2023 a las 15:51, Mason Chen () escribió: > > Hi Javier, > > Is there a particular reason why you aren't leveraging Flink metric API? It > seems that functionality was internal to the PrometheusReporter > implementation and your usecase should've continued working if it had > depended on Flink's metric API. > > Best, > Mason > > On Thu, Sep 28, 2023 at 2:51 AM Javier Vegas wrote: >> >> Thanks! I saw the first change but missed the third one, that is the >> most that most probably explains my problem, most probably the metrics >> I was sending with the twitter/finagle statsReceiver ended up in the >> singleton default registry and were exposed by Flink with all the >> other Flink metrics, but now that Flink uses its own registry I have >> no idea where my custom metrics end up >> >> >> El mié, 27 sept 2023 a las 4:56, Kenan Kılıçtepe >> () escribió: >> > >> > Have you checked the metric changes in 1.17. >> > >> > From release notes 1.17: >> > https://nightlies.apache.org/flink/flink-docs-master/release-notes/flink-1.17/ >> > >> > Metric Reporters # >> > Only support reporter factories for instantiation # >> > FLINK-24235 # >> > Configuring reporters by their class is no longer supported. Reporter >> > implementations must provide a MetricReporterFactory, and all >> > configurations must be migrated to such a factory. >> > >> > UseLogicalIdentifier makes datadog consider metric as custom # >> > FLINK-30383 # >> > The Datadog reporter now adds a “flink.” prefix to metric identifiers if >> > “useLogicalIdentifier” is enabled. This is required for these metrics to >> > be recognized as Flink metrics, not custom ones. >> > >> > Use separate Prometheus CollectorRegistries # >> > FLINK-30020 # >> > The PrometheusReporters now use a separate CollectorRegistry for each >> > reporter instance instead of the singleton default registry. This >> > generally shouldn’t impact setups, but it may break code that indirectly >> > interacts with the reporter via the singleton instance (e.g., a test >> > trying to assert what metrics are reported). >> > >> > >> > >> > On Wed, Sep 27, 2023 at 11:11 AM Javier Vegas wrote: >> >> >> >> I implemented some custom Prometheus metrics that were working on >> >> 1.16.2, with my configuration >> >> >> >> metrics.reporter.prom.factory.class: >> >> org.apache.flink.metrics.prometheus.PrometheusReporterFactory >> >> metrics.reporter.prom.port: >> >> >> >> I could see both Flink metrics and my custom metrics on port of >> >> my task managers >> >> >> >> After upgrading to 1.17.1, using the same configuration, I can see >> >> only the FLink metrics on port of the task managers, >> >> the custom metrics are getting lost somewhere. >> >> >> >> The release notes for 1.17 mention >> >> https://issues.apache.org/jira/browse/FLINK-24235 >> >> that removes instantiating reporters by name and forces using a >> >> factory, which I was already doing in 1.16.2. Do I need to do >> >> anything extra after those changes so my metrics are aggregated with >> >> the Flink ones? >> >> >> >> I am also seeing this error message on application startup (which I >> >> was already seeing in 1.16.2): "Multiple implementations of the same >> >> reporter were found in 'lib' and/or 'plugins' directories for >> >> org.apache.flink.metrics.prometheus.PrometheusReporterFactory. It is >> >> recommended to remove redundant reporter JARs to resolve used >> >> versions' ambiguity." Could that also explain the missing metrics? >> >> >> >> Thanks, >> >> >> >> Javier Vegas
Re: Custom Prometheus metrics disappeared in 1.16.2 => 1.17.1 upgrade
Thanks! I saw the first change but missed the third one, that is the most that most probably explains my problem, most probably the metrics I was sending with the twitter/finagle statsReceiver ended up in the singleton default registry and were exposed by Flink with all the other Flink metrics, but now that Flink uses its own registry I have no idea where my custom metrics end up El mié, 27 sept 2023 a las 4:56, Kenan Kılıçtepe () escribió: > > Have you checked the metric changes in 1.17. > > From release notes 1.17: > https://nightlies.apache.org/flink/flink-docs-master/release-notes/flink-1.17/ > > Metric Reporters # > Only support reporter factories for instantiation # > FLINK-24235 # > Configuring reporters by their class is no longer supported. Reporter > implementations must provide a MetricReporterFactory, and all configurations > must be migrated to such a factory. > > UseLogicalIdentifier makes datadog consider metric as custom # > FLINK-30383 # > The Datadog reporter now adds a “flink.” prefix to metric identifiers if > “useLogicalIdentifier” is enabled. This is required for these metrics to be > recognized as Flink metrics, not custom ones. > > Use separate Prometheus CollectorRegistries # > FLINK-30020 # > The PrometheusReporters now use a separate CollectorRegistry for each > reporter instance instead of the singleton default registry. This generally > shouldn’t impact setups, but it may break code that indirectly interacts with > the reporter via the singleton instance (e.g., a test trying to assert what > metrics are reported). > > > > On Wed, Sep 27, 2023 at 11:11 AM Javier Vegas wrote: >> >> I implemented some custom Prometheus metrics that were working on >> 1.16.2, with my configuration >> >> metrics.reporter.prom.factory.class: >> org.apache.flink.metrics.prometheus.PrometheusReporterFactory >> metrics.reporter.prom.port: >> >> I could see both Flink metrics and my custom metrics on port of >> my task managers >> >> After upgrading to 1.17.1, using the same configuration, I can see >> only the FLink metrics on port of the task managers, >> the custom metrics are getting lost somewhere. >> >> The release notes for 1.17 mention >> https://issues.apache.org/jira/browse/FLINK-24235 >> that removes instantiating reporters by name and forces using a >> factory, which I was already doing in 1.16.2. Do I need to do >> anything extra after those changes so my metrics are aggregated with >> the Flink ones? >> >> I am also seeing this error message on application startup (which I >> was already seeing in 1.16.2): "Multiple implementations of the same >> reporter were found in 'lib' and/or 'plugins' directories for >> org.apache.flink.metrics.prometheus.PrometheusReporterFactory. It is >> recommended to remove redundant reporter JARs to resolve used >> versions' ambiguity." Could that also explain the missing metrics? >> >> Thanks, >> >> Javier Vegas
Re: Custom Prometheus metrics disappeared in 1.16.2 => 1.17.1 upgrade
Some more details on my problem: 1. The "Multiple implementations" problem was because I had the metrics-prometheus jar both in the plugins and lib directories. I tried putting it in only one, and in both cases (plugins or lib), the result was the same, I got only Flink metrics on my prom port. 2. My application extends https://github.com/twitter/twitter-server/blob/develop/server/src/main/scala/com/twitter/server/TwitterServer.scala and I was sending my custom stats via the statsReceiver provided there https://github.com/twitter/twitter-server/blob/33b3fb00635c4ab1102eb0c062cde6bb132d80c0/server/src/main/scala/com/twitter/server/Stats.scala#L14 3. I realized that my reporter configuration was: metrics.reporter.prom.class: org.apache.flink.metrics.prometheus.PrometheusReporter metrics.reporter.prom.factory.class: org.apache.flink.metrics.prometheus.PrometheusReporterFactory metrics.reporter.prom.port: So I guess in 1.16.2 the prometheus reporter could have been instantiated by class name, and perhaps that somehow allowed my metrics to be merged with the Flink ones, but in 1.17.1 the reporter gets instantiated by the factory and somehow that renders my metrics invisible. Do you have any suggestion so my metrics work as in 1.16.2? Thanks again, Javier Vegas El mar, 26 sept 2023 a las 19:42, Javier Vegas () escribió: > > I implemented some custom Prometheus metrics that were working on > 1.16.2, with my configuration > > metrics.reporter.prom.factory.class: > org.apache.flink.metrics.prometheus.PrometheusReporterFactory > metrics.reporter.prom.port: > > I could see both Flink metrics and my custom metrics on port of > my task managers > > After upgrading to 1.17.1, using the same configuration, I can see > only the FLink metrics on port of the task managers, > the custom metrics are getting lost somewhere. > > The release notes for 1.17 mention > https://issues.apache.org/jira/browse/FLINK-24235 > that removes instantiating reporters by name and forces using a > factory, which I was already doing in 1.16.2. Do I need to do > anything extra after those changes so my metrics are aggregated with > the Flink ones? > > I am also seeing this error message on application startup (which I > was already seeing in 1.16.2): "Multiple implementations of the same > reporter were found in 'lib' and/or 'plugins' directories for > org.apache.flink.metrics.prometheus.PrometheusReporterFactory. It is > recommended to remove redundant reporter JARs to resolve used > versions' ambiguity." Could that also explain the missing metrics? > > Thanks, > > Javier Vegas
Custom Prometheus metrics disappeared in 1.16.2 => 1.17.1 upgrade
I implemented some custom Prometheus metrics that were working on 1.16.2, with my configuration metrics.reporter.prom.factory.class: org.apache.flink.metrics.prometheus.PrometheusReporterFactory metrics.reporter.prom.port: I could see both Flink metrics and my custom metrics on port of my task managers After upgrading to 1.17.1, using the same configuration, I can see only the FLink metrics on port of the task managers, the custom metrics are getting lost somewhere. The release notes for 1.17 mention https://issues.apache.org/jira/browse/FLINK-24235 that removes instantiating reporters by name and forces using a factory, which I was already doing in 1.16.2. Do I need to do anything extra after those changes so my metrics are aggregated with the Flink ones? I am also seeing this error message on application startup (which I was already seeing in 1.16.2): "Multiple implementations of the same reporter were found in 'lib' and/or 'plugins' directories for org.apache.flink.metrics.prometheus.PrometheusReporterFactory. It is recommended to remove redundant reporter JARs to resolve used versions' ambiguity." Could that also explain the missing metrics? Thanks, Javier Vegas
Re: Error upgrading operator CRD
Additionally, when I try the next step I get another unexpected error: ✗ helm -n flink template flink-kubernetes-operator flink-operator-repo/flink-kubernetes-operator Error: failed to fetch https://downloads.apache.org/flink/flink-kubernetes-operator-1.0.1/flink-kubernetes-operator-1.0.1-helm.tgz : 404 Not Found Not sure why helm wants to find 1.0.1 because I have 1.3.1 installed (but that would have result in a 404 too, since that downloads site only has versions 1.4.0 and 1.5.0 of the operator El vie, 7 jul 2023 a las 10:59, Javier Vegas () escribió: > Somehow I was able in the past to upgrade the CRD when I upgraded the > operator to 1.2 and 1.3, but trying now to upgrade to 1.4 I am getting the > following error: > > ✗ kubectl replace -f > helm/flink-kubernetes-operator/crds/flinksessionjobs.flink.apache.org-v1.yml > > error: the path > "helm/flink-kubernetes-operator/crds/flinksessionjobs.flink.apache.org-v1.yml" > does not exist > > Do I need to pass more arguments to kubectl for it to find the path? How > can I verify the CRD path? > > Thanks, > > Javier Vegas >
Error upgrading operator CRD
Somehow I was able in the past to upgrade the CRD when I upgraded the operator to 1.2 and 1.3, but trying now to upgrade to 1.4 I am getting the following error: ✗ kubectl replace -f helm/flink-kubernetes-operator/crds/flinksessionjobs.flink.apache.org-v1.yml error: the path "helm/flink-kubernetes-operator/crds/flinksessionjobs.flink.apache.org-v1.yml" does not exist Do I need to pass more arguments to kubectl for it to find the path? How can I verify the CRD path? Thanks, Javier Vegas
Re: DuplicateJobSubmissionException on restart after taskmanagers crash
My issue is described in https://issues.apache.org/jira/browse/FLINK-21928 where it says was fixed in 1.14, but I am still seeing the problem. Although there it says: "Additionally, it is still required that the user cleans up the corresponding HA entries for the running jobs registry because these entries won't be reliably cleaned up when encountering the situation described by FLINK-21928 <https://issues.apache.org/jira/browse/FLINK-21928>." so I guess I need to do some manual cleanup of my S3 HA data before restarting El vie, 20 ene 2023 a las 4:58, Javier Vegas () escribió: > > I have a Flink app (Flink 1.16.0, deployed to Kubernetes via operator > 1.3.1 and using Kubernetes HighAvailaibilty with storage in S3) that > depends on multiple Thrift services for data queries. When one of those > services is down (or throws exceptions) the Flink job managers end up > crashing and only the task managers remain up. Once the dependencies are > fixed, when I try to restart the Flink app I end up with a > "DuplicateJobSubmissionException: Job has already been submitted" (see > below for detailed log) and the task managers never start. The only > solution I have found is to delete the deployment from Kubernetes and then > deploy again as a new job. > > 1) Is there a better way to handle failures on dependencies than letting > task managers crash and keep job managers up, and restart after > dependencies are fixed? > 1) If not, is there a way to handle the DuplicateJobSubmissionException so > the Flink app can be restarted without having to uninstall it first? > > Thanks, > > Javier Vegas > > > org.apache.flink.util.FlinkException: Failed to execute job > at > org.apache.flink.streaming.api.environment.StreamExecutionEnvironment.executeAsync(StreamExecutionEnvironment.java:2203) > Caused by: > org.apache.flink.runtime.client.DuplicateJobSubmissionException: Job has > already been submitted. > at > org.apache.flink.runtime.client.DuplicateJobSubmissionException.ofGloballyTerminated(DuplicateJobSubmissionException.java:35) > at > org.apache.flink.runtime.dispatcher.Dispatcher.submitJob(Dispatcher.java:449) > at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native > Method) > at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(Unknown > Source) > at > java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(Unknown > Source) > at java.base/java.lang.reflect.Method.invoke(Unknown Source) > at > org.apache.flink.runtime.rpc.akka.AkkaRpcActor.lambda$handleRpcInvocation$1(AkkaRpcActor.java:309) > at > org.apache.flink.runtime.concurrent.akka.ClassLoadingUtils.runWithContextClassLoader(ClassLoadingUtils.java:83) > at > org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRpcInvocation(AkkaRpcActor.java:307) > at > org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRpcMessage(AkkaRpcActor.java:222) > at > org.apache.flink.runtime.rpc.akka.FencedAkkaRpcActor.handleRpcMessage(FencedAkkaRpcActor.java:84) > at > org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleMessage(AkkaRpcActor.java:168) > at akka.japi.pf.UnitCaseStatement.apply(CaseStatements.scala:24) > at akka.japi.pf.UnitCaseStatement.apply(CaseStatements.scala:20) > at scala.PartialFunction.applyOrElse(PartialFunction.scala:123) > at scala.PartialFunction.applyOrElse$(PartialFunction.scala:122) > at akka.japi.pf.UnitCaseStatement.applyOrElse(CaseStatements.scala:20) > at scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:171) > at scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:172) > at scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:172) > at akka.actor.Actor.aroundReceive(Actor.scala:537) > at akka.actor.Actor.aroundReceive$(Actor.scala:535) > at akka.actor.AbstractActor.aroundReceive(AbstractActor.scala:220) > at akka.actor.ActorCell.receiveMessage(ActorCell.scala:580) > at akka.actor.ActorCell.invoke(ActorCell.scala:548) > at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:270) > at akka.dispatch.Mailbox.run(Mailbox.scala:231) > at akka.dispatch.Mailbox.exec(Mailbox.scala:243) > ... 5 more > Exception thrown in main on startup > > >
DuplicateJobSubmissionException on restart after taskmanagers crash
I have a Flink app (Flink 1.16.0, deployed to Kubernetes via operator 1.3.1 and using Kubernetes HighAvailaibilty with storage in S3) that depends on multiple Thrift services for data queries. When one of those services is down (or throws exceptions) the Flink job managers end up crashing and only the task managers remain up. Once the dependencies are fixed, when I try to restart the Flink app I end up with a "DuplicateJobSubmissionException: Job has already been submitted" (see below for detailed log) and the task managers never start. The only solution I have found is to delete the deployment from Kubernetes and then deploy again as a new job. 1) Is there a better way to handle failures on dependencies than letting task managers crash and keep job managers up, and restart after dependencies are fixed? 1) If not, is there a way to handle the DuplicateJobSubmissionException so the Flink app can be restarted without having to uninstall it first? Thanks, Javier Vegas org.apache.flink.util.FlinkException: Failed to execute job at org.apache.flink.streaming.api.environment.StreamExecutionEnvironment.executeAsync(StreamExecutionEnvironment.java:2203) Caused by: org.apache.flink.runtime.client.DuplicateJobSubmissionException: Job has already been submitted. at org.apache.flink.runtime.client.DuplicateJobSubmissionException.ofGloballyTerminated(DuplicateJobSubmissionException.java:35) at org.apache.flink.runtime.dispatcher.Dispatcher.submitJob(Dispatcher.java:449) at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(Unknown Source) at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(Unknown Source) at java.base/java.lang.reflect.Method.invoke(Unknown Source) at org.apache.flink.runtime.rpc.akka.AkkaRpcActor.lambda$handleRpcInvocation$1(AkkaRpcActor.java:309) at org.apache.flink.runtime.concurrent.akka.ClassLoadingUtils.runWithContextClassLoader(ClassLoadingUtils.java:83) at org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRpcInvocation(AkkaRpcActor.java:307) at org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRpcMessage(AkkaRpcActor.java:222) at org.apache.flink.runtime.rpc.akka.FencedAkkaRpcActor.handleRpcMessage(FencedAkkaRpcActor.java:84) at org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleMessage(AkkaRpcActor.java:168) at akka.japi.pf.UnitCaseStatement.apply(CaseStatements.scala:24) at akka.japi.pf.UnitCaseStatement.apply(CaseStatements.scala:20) at scala.PartialFunction.applyOrElse(PartialFunction.scala:123) at scala.PartialFunction.applyOrElse$(PartialFunction.scala:122) at akka.japi.pf.UnitCaseStatement.applyOrElse(CaseStatements.scala:20) at scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:171) at scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:172) at scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:172) at akka.actor.Actor.aroundReceive(Actor.scala:537) at akka.actor.Actor.aroundReceive$(Actor.scala:535) at akka.actor.AbstractActor.aroundReceive(AbstractActor.scala:220) at akka.actor.ActorCell.receiveMessage(ActorCell.scala:580) at akka.actor.ActorCell.invoke(ActorCell.scala:548) at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:270) at akka.dispatch.Mailbox.run(Mailbox.scala:231) at akka.dispatch.Mailbox.exec(Mailbox.scala:243) ... 5 more Exception thrown in main on startup
"Exec Failure java.io.EOFException null" message before taskmanagers crash
I have a Flink 1.15 app running in Kubernetes (v1.22) deployed via operator 1.2, using S3-based HA with 2 jobmanagers and 2 taskmanagers. The app consumes a high-traffic Kafka topic and writes to a Cassandra database. It had been running fine for 4 days, but at some point the taskmanagers crashed. Looking through my logs, the oldest messages I see are "Exec Failure java.io.EOFException null" from both taskmanagers at exactly the same time, but there is no associated stack trace. After that, the taskmanagers try to restart, but I see another "Exec Failure java.io.EOFException null" message from one jobmanager and shortly after the task manager sets the newly started taskmanagers to a failed state with the message below. This repeats a couple more times until finally no more taskmanagers try to come up, and the jobmanager sit there throwing RecipientUnreachableExceptions because there are no more taskmanagers around. Any idea what that "Exec Failure java.io.EOFException null" message is about, or what can I do to debug it if it happens again? Thanks, Javier Vegas message from jobmanager Source: event-activity (2/4)#0 (090ef433d97011f8f595885a9bb39a28) switched from RUNNING to FAILED with failure cause: org.apache.flink.util.FlinkException: Disconnect from JobManager responsible for b7c72bb5b7570a7c981f761afa1b7ea6. at org.apache.flink.runtime.taskexecutor.TaskExecutor.disconnectJobManagerConnection(TaskExecutor.java:1679) at org.apache.flink.runtime.taskexecutor.TaskExecutor.lambda$closeJob$18(TaskExecutor.java:1660) at java.base/java.util.Optional.ifPresent(Unknown Source) at org.apache.flink.runtime.taskexecutor.TaskExecutor.closeJob(TaskExecutor.java:1658) at org.apache.flink.runtime.taskexecutor.TaskExecutor.onStop(TaskExecutor.java:462) at org.apache.flink.runtime.rpc.RpcEndpoint.internalCallOnStop(RpcEndpoint.java:214) at org.apache.flink.runtime.rpc.akka.AkkaRpcActor$StartedState.lambda$terminate$0(AkkaRpcActor.java:568) at org.apache.flink.runtime.concurrent.akka.ClassLoadingUtils.runWithContextClassLoader(ClassLoadingUtils.java:83) at org.apache.flink.runtime.rpc.akka.AkkaRpcActor$StartedState.terminate(AkkaRpcActor.java:567) at org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleControlMessage(AkkaRpcActor.java:191) at akka.japi.pf.UnitCaseStatement.apply(CaseStatements.scala:24) at akka.japi.pf.UnitCaseStatement.apply(CaseStatements.scala:20) at scala.PartialFunction.applyOrElse(PartialFunction.scala:123) at scala.PartialFunction.applyOrElse$(PartialFunction.scala:122) at akka.japi.pf.UnitCaseStatement.applyOrElse(CaseStatements.scala:20) at scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:171) at scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:172) at akka.actor.Actor.aroundReceive(Actor.scala:537) at akka.actor.Actor.aroundReceive$(Actor.scala:535) at akka.actor.AbstractActor.aroundReceive(AbstractActor.scala:220) at akka.actor.ActorCell.receiveMessage(ActorCell.scala:580) at akka.actor.ActorCell.invoke(ActorCell.scala:548) at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:270) at akka.dispatch.Mailbox.run(Mailbox.scala:231) at akka.dispatch.Mailbox.exec(Mailbox.scala:243) at java.base/java.util.concurrent.ForkJoinTask.doExec(Unknown Source) at java.base/java.util.concurrent.ForkJoinPool$WorkQueue.topLevelExec(Unknown Source) at java.base/java.util.concurrent.ForkJoinPool.scan(Unknown Source) at java.base/java.util.concurrent.ForkJoinPool.runWorker(Unknown Source) at java.base/java.util.concurrent.ForkJoinWorkerThread.run(Unknown Source) Caused by: org.apache.flink.util.FlinkException: The TaskExecutor is shutting down. at org.apache.flink.runtime.taskexecutor.TaskExecutor.onStop(TaskExecutor.java:456) ... 25 more
Re: HA not working in standalone mode for operator 1.2
The jars that my build version creates have a version number, something like myapp-2.2.11.jar. I am lazy and want to avoid having to update the jarURI param (required in native mode) every time I deploy a new version of my app, and just update the Docker image I am using. Another solution would be to keep using native, and modify my build system to strip the version number in the packaged jar. Thanks! Javier El jue, 13 oct 2022 a las 13:22, Gyula Fóra () escribió: > Before we dive further into this can you please explain the jarURI problem > your are trying to solve by switching to standalone? > > The native mode should work well in almost any setup. > > Gyula > > On Thu, 13 Oct 2022 at 21:41, Javier Vegas wrote: > >> Hi, I have a S3 HA Flink app that works as expected deployed via >> operator 1.2 in native mode, but I am seeing errors when switching to >> standalone mode (which I want to do mostly to save me having to set jarURI >> explicitly). >> I can see the job manager writes the JobGraph in S3, and in the web UI I >> can see it creates the jobs, but the taskmanager sits there doing nothing >> as if could not communicate with the jobmanager. I can see also that the >> operator has created two services, while native mode creates only the rest >> service. After a while, the taskmanager closes with the following exception: >> >> org.apache.flink.runtime.taskexecutor.exceptions.RegistrationTimeoutException: >> Could not register at the ResourceManager within the specified maximum >> registration duration 30 ms. This indicates a problem with this >> instance. Terminating now. >> >> at >> org.apache.flink.runtime.taskexecutor.TaskExecutor.registrationTimeout(TaskExecutor.java:1474) >> >> at >> org.apache.flink.runtime.taskexecutor.TaskExecutor.lambda$startRegistrationTimeout$16(TaskExecutor.java:1459) >> >> at >> org.apache.flink.runtime.rpc.akka.AkkaRpcActor.lambda$handleRunAsync$4(AkkaRpcActor.java:443) >> >> at >> org.apache.flink.runtime.concurrent.akka.ClassLoadingUtils.runWithContextClassLoader(ClassLoadingUtils.java:68) >> >> at >> org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRunAsync(AkkaRpcActor.java:443) >> >> at >> org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRpcMessage(AkkaRpcActor.java:213) >> >> at >> org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleMessage(AkkaRpcActor.java:163) >> >> at akka.japi.pf.UnitCaseStatement.apply(CaseStatements.scala:24) >> >> at akka.japi.pf.UnitCaseStatement.apply(CaseStatements.scala:20) >> >> at scala.PartialFunction.applyOrElse(PartialFunction.scala:123) >> >> at scala.PartialFunction.applyOrElse$(PartialFunction.scala:122) >> >> at >> akka.japi.pf.UnitCaseStatement.applyOrElse(CaseStatements.scala:20) >> >> at >> scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:171) >> >> at >> scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:172) >> >> at >> scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:172) >> >> at akka.actor.Actor.aroundReceive(Actor.scala:537) >> >> at akka.actor.Actor.aroundReceive$(Actor.scala:535) >> >> at >> akka.actor.AbstractActor.aroundReceive(AbstractActor.scala:220) >> >> at akka.actor.ActorCell.receiveMessage(ActorCell.scala:580) >> >> at akka.actor.ActorCell.invoke(ActorCell.scala:548) >> >> at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:270) >> >> at akka.dispatch.Mailbox.run(Mailbox.scala:231) >> >> at akka.dispatch.Mailbox.exec(Mailbox.scala:243) >> >> at java.base/java.util.concurrent.ForkJoinTask.doExec(Unknown >> Source) >> >> at >> java.base/java.util.concurrent.ForkJoinPool$WorkQueue.topLevelExec(Unknown >> Source) >> >> at java.base/java.util.concurrent.ForkJoinPool.scan(Unknown >> Source) >> >> at java.base/java.util.concurrent.ForkJoinPool.runWorker(Unknown >> Source) >> at >> java.base/java.util.concurrent.ForkJoinWorkerThread.run(Unknown Source) >> >>
Re: Validation error trying to use standalone mode with operator 1.2.0
Thanks, that fixed the problem! Sadly I am now running into a different problem with S3 HA when running in standalone mode, see https://lists.apache.org/thread/rf62htkr6govpr41fj3br4mzplsg9vg8 Cheers, Javier El vie, 7 oct 2022 a las 22:02, Gyula Fóra () escribió: > Hi! > > Seems like you still have an older version CRD installed for the > FlinkDeployment which doesn’t contain the newly introduced mode setting. > > You can check > > https://nightlies.apache.org/flink/flink-kubernetes-operator-docs-release-1.2/docs/operations/upgrade/ > for the upgrade process. > > Cheers > Gyula > > On Sat, 8 Oct 2022 at 00:00, Javier Vegas wrote: > >> >> I am following the operator quickstart >> >> https://nightlies.apache.org/flink/flink-kubernetes-operator-docs-release-1.2/docs/try-flink-kubernetes-operator/quick-start/ >> >> kubectl create -f >> https://raw.githubusercontent.com/apache/flink-kubernetes-operator/release-1.2/examples/basic.yaml >> >> >> works fine, but >> >> >> kubectl create -f >> https://raw.githubusercontent.com/apache/flink-kubernetes-operator/release-1.2/examples/basic-reactive.yaml >> >> >> which has the mode: standalone setting >> >> >> gives me this error: >> >> >> error: error validating " >> https://raw.githubusercontent.com/apache/flink-kubernetes-operator/release-1.2/examples/basic-reactive.yaml": >> error validating data: ValidationError(FlinkDeployment.spec): unknown field >> "mode" in org.apache.flink.v1beta1.FlinkDeployment.spec; if you choose to >> ignore these errors, turn validation off with --validate=false >> >> >> >>
HA not working in standalone mode for operator 1.2
Hi, I have a S3 HA Flink app that works as expected deployed via operator 1.2 in native mode, but I am seeing errors when switching to standalone mode (which I want to do mostly to save me having to set jarURI explicitly). I can see the job manager writes the JobGraph in S3, and in the web UI I can see it creates the jobs, but the taskmanager sits there doing nothing as if could not communicate with the jobmanager. I can see also that the operator has created two services, while native mode creates only the rest service. After a while, the taskmanager closes with the following exception: org.apache.flink.runtime.taskexecutor.exceptions.RegistrationTimeoutException: Could not register at the ResourceManager within the specified maximum registration duration 30 ms. This indicates a problem with this instance. Terminating now. at org.apache.flink.runtime.taskexecutor.TaskExecutor.registrationTimeout(TaskExecutor.java:1474) at org.apache.flink.runtime.taskexecutor.TaskExecutor.lambda$startRegistrationTimeout$16(TaskExecutor.java:1459) at org.apache.flink.runtime.rpc.akka.AkkaRpcActor.lambda$handleRunAsync$4(AkkaRpcActor.java:443) at org.apache.flink.runtime.concurrent.akka.ClassLoadingUtils.runWithContextClassLoader(ClassLoadingUtils.java:68) at org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRunAsync(AkkaRpcActor.java:443) at org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRpcMessage(AkkaRpcActor.java:213) at org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleMessage(AkkaRpcActor.java:163) at akka.japi.pf.UnitCaseStatement.apply(CaseStatements.scala:24) at akka.japi.pf.UnitCaseStatement.apply(CaseStatements.scala:20) at scala.PartialFunction.applyOrElse(PartialFunction.scala:123) at scala.PartialFunction.applyOrElse$(PartialFunction.scala:122) at akka.japi.pf.UnitCaseStatement.applyOrElse(CaseStatements.scala:20) at scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:171) at scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:172) at scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:172) at akka.actor.Actor.aroundReceive(Actor.scala:537) at akka.actor.Actor.aroundReceive$(Actor.scala:535) at akka.actor.AbstractActor.aroundReceive(AbstractActor.scala:220) at akka.actor.ActorCell.receiveMessage(ActorCell.scala:580) at akka.actor.ActorCell.invoke(ActorCell.scala:548) at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:270) at akka.dispatch.Mailbox.run(Mailbox.scala:231) at akka.dispatch.Mailbox.exec(Mailbox.scala:243) at java.base/java.util.concurrent.ForkJoinTask.doExec(Unknown Source) at java.base/java.util.concurrent.ForkJoinPool$WorkQueue.topLevelExec(Unknown Source) at java.base/java.util.concurrent.ForkJoinPool.scan(Unknown Source) at java.base/java.util.concurrent.ForkJoinPool.runWorker(Unknown Source) at java.base/java.util.concurrent.ForkJoinWorkerThread.run(Unknown Source)
Validation error trying to use standalone mode with operator 1.2.0
I am following the operator quickstart https://nightlies.apache.org/flink/flink-kubernetes-operator-docs-release-1.2/docs/try-flink-kubernetes-operator/quick-start/ kubectl create -f https://raw.githubusercontent.com/apache/flink-kubernetes-operator/release-1.2/examples/basic.yaml works fine, but kubectl create -f https://raw.githubusercontent.com/apache/flink-kubernetes-operator/release-1.2/examples/basic-reactive.yaml which has the mode: standalone setting gives me this error: error: error validating " https://raw.githubusercontent.com/apache/flink-kubernetes-operator/release-1.2/examples/basic-reactive.yaml": error validating data: ValidationError(FlinkDeployment.spec): unknown field "mode" in org.apache.flink.v1beta1.FlinkDeployment.spec; if you choose to ignore these errors, turn validation off with --validate=false
Missing logback-console.xml when submitting job via operator
My Flink app uses logback for logging, when submitting it from the operator I get this error: ERROR in ch.qos.logback.classic.joran.JoranConfigurator@7364985f - Could not open URL [file:/opt/flink/conf/logback-console.xml]. java.io.FileNotFoundException: /opt/flink/conf/logback-console.xml (No such file or directory) https://nightlies.apache.org/flink/flink-docs-release-1.15/docs/deployment/advanced/logging/#configuring-logback says "The Flink distribution ships with the following logback configuration files in the conf directory, which are used automatically if logback is enabled" How do I "enable" logback? I don't see any relevant configuration param to enable it, either in flink-conf.yaml or in the operator config. I am also noticing that the log4j-console.properties that ends up in my deployed app configmap is the one from the operator Helm chart https://github.com/apache/flink-kubernetes-operator/tree/main/helm/flink-kubernetes-operator/conf not https://github.com/apache/flink/tree/master/flink-dist/src/main/flink-bin/conf (where logback-console.xml lives). Should the operator have also a logback-console.xml in the Helm chart?
Re: Classloading issues with Flink Operator / Kubernetes Native
Version 1.15.2, there is no /opt/flink/usrlib folder created El mar, 20 sept 2022 a las 20:53, Yaroslav Tkachenko () escribió: > Interesting, do you see the /opt/flink/usrlib folder created as well? > Also, what Flink version do you use? > > Thanks. > > On Tue, Sep 20, 2022 at 4:04 PM Javier Vegas wrote: > >> >> jarURI: local:///opt/flink/lib/MYJARNAME.jar >> >> El mar, 20 sept 2022 a las 0:25, Yaroslav Tkachenko (< >> yaros...@goldsky.com>) escribió: >> >>> Hi Javier, >>> >>> What do you specify as a jarURI? >>> >>> On Mon, Sep 19, 2022 at 3:56 PM Javier Vegas wrote: >>> >>>> I am doing the same thing (migrating from standalone to operator in >>>> native mode) and also have my jar in /opt/flink/lib but for me it works >>>> fine, no class loading errors on app startup. >>>> >>>> El vie, 16 sept 2022 a las 9:28, Yaroslav Tkachenko (< >>>> yaros...@goldsky.com>) escribió: >>>> >>>>> Application mode. I've done a bit more research and created >>>>> https://issues.apache.org/jira/browse/FLINK-29288, planning to work >>>>> on a PR today. >>>>> >>>>> TLDR: currently Flink operator always creates /opt/flink/usrlib folder >>>>> and forces you to specify the jarURI parameter, which is passed as >>>>> pipeline.jars / pipeline.classpaths configuration options. This leads to >>>>> the jar being loaded twice by different classloaders (system and user >>>>> ones). >>>>> >>>>> On Fri, Sep 16, 2022 at 2:30 AM Matthias Pohl >>>>> wrote: >>>>> >>>>>> Are you deploying the job in session or application mode? Could you >>>>>> provide the stacktrace. I'm wondering whether that would be helpful to >>>>>> pin >>>>>> a code location for further investigation. >>>>>> So far, I couldn't come up with a definite answer about placing the >>>>>> jar in the lib directory. Initially, I would have thought that it's fine >>>>>> considering that all dependencies are included and the job jar itself >>>>>> ends >>>>>> up on the user classpath. I'm curious whether Chesnay (CC'd) has an >>>>>> answer >>>>>> to that one. >>>>>> >>>>>> On Tue, Sep 13, 2022 at 1:40 AM Yaroslav Tkachenko < >>>>>> yaros...@goldsky.com> wrote: >>>>>> >>>>>>> Hey everyone, >>>>>>> >>>>>>> I’m migrating a Flink Kubernetes standalone job to the Flink >>>>>>> operator (with Kubernetes native mode). >>>>>>> >>>>>>> I have a lot of classloading issues when trying to run with >>>>>>> the operator in native mode. For example, I have a Postgres driver as a >>>>>>> dependency (I can confirm the files are included in the uber jar), but I >>>>>>> still get "java.sql.SQLException: No suitable driver found for >>>>>>> jdbc:postgresql:..." exception. >>>>>>> >>>>>>> In the Kubernetes standalone setup my uber jar is placed in the >>>>>>> /opt/flink/lib folder, this is what I specify as "jarURI" in the >>>>>>> operator >>>>>>> config. Is this supported? Should I only be using /opt/flink/usrlib? >>>>>>> >>>>>>> Thanks for any suggestions. >>>>>>> >>>>>>
Re: Classloading issues with Flink Operator / Kubernetes Native
jarURI: local:///opt/flink/lib/MYJARNAME.jar El mar, 20 sept 2022 a las 0:25, Yaroslav Tkachenko () escribió: > Hi Javier, > > What do you specify as a jarURI? > > On Mon, Sep 19, 2022 at 3:56 PM Javier Vegas wrote: > >> I am doing the same thing (migrating from standalone to operator in >> native mode) and also have my jar in /opt/flink/lib but for me it works >> fine, no class loading errors on app startup. >> >> El vie, 16 sept 2022 a las 9:28, Yaroslav Tkachenko (< >> yaros...@goldsky.com>) escribió: >> >>> Application mode. I've done a bit more research and created >>> https://issues.apache.org/jira/browse/FLINK-29288, planning to work on >>> a PR today. >>> >>> TLDR: currently Flink operator always creates /opt/flink/usrlib folder >>> and forces you to specify the jarURI parameter, which is passed as >>> pipeline.jars / pipeline.classpaths configuration options. This leads to >>> the jar being loaded twice by different classloaders (system and user >>> ones). >>> >>> On Fri, Sep 16, 2022 at 2:30 AM Matthias Pohl >>> wrote: >>> >>>> Are you deploying the job in session or application mode? Could you >>>> provide the stacktrace. I'm wondering whether that would be helpful to pin >>>> a code location for further investigation. >>>> So far, I couldn't come up with a definite answer about placing the jar >>>> in the lib directory. Initially, I would have thought that it's fine >>>> considering that all dependencies are included and the job jar itself ends >>>> up on the user classpath. I'm curious whether Chesnay (CC'd) has an answer >>>> to that one. >>>> >>>> On Tue, Sep 13, 2022 at 1:40 AM Yaroslav Tkachenko < >>>> yaros...@goldsky.com> wrote: >>>> >>>>> Hey everyone, >>>>> >>>>> I’m migrating a Flink Kubernetes standalone job to the Flink operator >>>>> (with Kubernetes native mode). >>>>> >>>>> I have a lot of classloading issues when trying to run with >>>>> the operator in native mode. For example, I have a Postgres driver as a >>>>> dependency (I can confirm the files are included in the uber jar), but I >>>>> still get "java.sql.SQLException: No suitable driver found for >>>>> jdbc:postgresql:..." exception. >>>>> >>>>> In the Kubernetes standalone setup my uber jar is placed in the >>>>> /opt/flink/lib folder, this is what I specify as "jarURI" in the operator >>>>> config. Is this supported? Should I only be using /opt/flink/usrlib? >>>>> >>>>> Thanks for any suggestions. >>>>> >>>>
Re: Classloading issues with Flink Operator / Kubernetes Native
I am doing the same thing (migrating from standalone to operator in native mode) and also have my jar in /opt/flink/lib but for me it works fine, no class loading errors on app startup. El vie, 16 sept 2022 a las 9:28, Yaroslav Tkachenko () escribió: > Application mode. I've done a bit more research and created > https://issues.apache.org/jira/browse/FLINK-29288, planning to work on a > PR today. > > TLDR: currently Flink operator always creates /opt/flink/usrlib folder and > forces you to specify the jarURI parameter, which is passed as > pipeline.jars / pipeline.classpaths configuration options. This leads to > the jar being loaded twice by different classloaders (system and user > ones). > > On Fri, Sep 16, 2022 at 2:30 AM Matthias Pohl > wrote: > >> Are you deploying the job in session or application mode? Could you >> provide the stacktrace. I'm wondering whether that would be helpful to pin >> a code location for further investigation. >> So far, I couldn't come up with a definite answer about placing the jar >> in the lib directory. Initially, I would have thought that it's fine >> considering that all dependencies are included and the job jar itself ends >> up on the user classpath. I'm curious whether Chesnay (CC'd) has an answer >> to that one. >> >> On Tue, Sep 13, 2022 at 1:40 AM Yaroslav Tkachenko >> wrote: >> >>> Hey everyone, >>> >>> I’m migrating a Flink Kubernetes standalone job to the Flink operator >>> (with Kubernetes native mode). >>> >>> I have a lot of classloading issues when trying to run with the operator >>> in native mode. For example, I have a Postgres driver as a dependency (I >>> can confirm the files are included in the uber jar), but I still get >>> "java.sql.SQLException: No suitable driver found for jdbc:postgresql:..." >>> exception. >>> >>> In the Kubernetes standalone setup my uber jar is placed in the >>> /opt/flink/lib folder, this is what I specify as "jarURI" in the operator >>> config. Is this supported? Should I only be using /opt/flink/usrlib? >>> >>> Thanks for any suggestions. >>> >>
Re: serviceAccount permissions issue for high availability in operator 1.1
Hi, Yang! When you say the operator uses native k8s integration by default, does that mean there is a way to change that to use standalone K8s? I haven't seen anything about that in the docs, besides a mention that standalone support is coming in version 1.2 of the operator. Thanks, Javier On Thu, Sep 8, 2022, 22:50 Yang Wang wrote: > Since the flink-kubernetes-operator is using native K8s integration[1] by > default, you need to give the permissions of pod and deployment as well as > ConfigMap. > > You could find more information about the RBAC here[2]. > > [1]. > https://nightlies.apache.org/flink/flink-docs-release-1.15/docs/deployment/resource-providers/native_kubernetes/ > [2]. > https://nightlies.apache.org/flink/flink-kubernetes-operator-docs-release-1.1/docs/operations/rbac/ > > Best, > Yang > > Javier Vegas 于2022年9月7日周三 04:17写道: > >> I am migrating a HA standalone Kubernetes app to use the Flink operator. >> The HA store is S3 using IRSA so the app needs to run with a serviceAccount >> that is authorized to access S3. In standalone mode HA worked once I gave >> the account permissions to edit configMaps. But when trying the operator >> with the custom serviceAccount, I am getting this error: >> >> io.fabric8.kubernetes.client.KubernetesClientException: Failure >> executing: GET at: >> https://172.20.0.1/apis/apps/v1/namespaces/MYNAMESPACE/deployments/MYAPPNAME. >> Message: Forbidden!Configured service account doesn't have access. Service >> account may have been revoked. deployments.apps "MYAPPNAME" is forbidden: >> User "system:serviceaccount:MYNAMESPACE:MYSERVICEACCOUNT" cannot get >> resource "deployments" in API group "apps" in the namespace "MYNAMESPACE". >> >> >> Does the serviceAccount needs additional permissions beside configMap >> edit to be able to run HA using the operator? >> >> Thanks, >> >> Javier Vegas >> >
serviceAccount permissions issue for high availability in operator 1.1
I am migrating a HA standalone Kubernetes app to use the Flink operator. The HA store is S3 using IRSA so the app needs to run with a serviceAccount that is authorized to access S3. In standalone mode HA worked once I gave the account permissions to edit configMaps. But when trying the operator with the custom serviceAccount, I am getting this error: io.fabric8.kubernetes.client.KubernetesClientException: Failure executing: GET at: https://172.20.0.1/apis/apps/v1/namespaces/MYNAMESPACE/deployments/MYAPPNAME. Message: Forbidden!Configured service account doesn't have access. Service account may have been revoked. deployments.apps "MYAPPNAME" is forbidden: User "system:serviceaccount:MYNAMESPACE:MYSERVICEACCOUNT" cannot get resource "deployments" in API group "apps" in the namespace "MYNAMESPACE". Does the serviceAccount needs additional permissions beside configMap edit to be able to run HA using the operator? Thanks, Javier Vegas
Re: How to open a Prometheus metrics port on the rest service when using the Kubernetes operator?
What I would need is to set ports: - name: metrics port: protocol: TCP in the generated YAML fir the appname-rest service which properly aggregates the metrics from the pods, but I can't not figure out either from the job deployment file or modifying the operator templates in the Helm chart. Any way I can modify the ports in the Flink rest service? Thanks, Javier Vegas El dom, 4 sept 2022 a las 1:59, Javier Vegas () escribió: > Hi, Biao! > > Thanks for the fast response! Setting that in the podTemplate opens the > metrics port in the pods, but unfortunately not on the rest service. Not > sure if that is standard procedure, but my Prometheus setup scraps the > metrics port on services but not pods. On my previous non-operator > standalone setup, the metrics port on the service was aggregating all the > pods metrics and then Prometheus was scrapping that, so I was trying to > reproduce that by opening the port on the rest service. > > > > El dom, 4 sept 2022 a las 1:03, Geng Biao () > escribió: > >> Hi Javier, >> >> >> >> You can use podTemplate to expose the port in the flink containers. >> >> Here is a snippet: >> >> spec: >> >> flinkVersion: v1_15 >> >> flinkConfiguration: >> >> state.savepoints.dir: file:///flink-data/flink-savepoints >> >> state.checkpoints.dir: file:///flink-data/flink-checkpoints >> >> *metrics.reporter.prom.factory.class: >> org.apache.flink.metrics.prometheus.PrometheusReporterFactory* >> >> serviceAccount: flink >> >> podTemplate: >> >> metadata: >> >> annotations: >> >> prometheus.io/path: /metrics >> >> prometheus.io/port: "9249" >> >> prometheus.io/scrape: "true" >> >> spec: >> >> serviceAccount: flink >> >> containers: >> >> - name: flink-main-container >> >> volumeMounts: >> >> - mountPath: /flink-data >> >> name: flink-volume >> >> * ports:* >> >> *- containerPort: 9249* >> >> * name: metrics* >> >> * protocol: TCP* >> >> volumes: >> >> - name: flink-volume >> >> emptyDir: {} >> >> >> >> The bold line are about how to specify the metric reporter and expose the >> metric. The annotations are not required if you use PodMonitor or >> ServiceMonitor. Hope it can help! >> >> >> >> Best, >> >> Biao Geng >> >> >> >> *From: *Javier Vegas >> *Date: *Sunday, September 4, 2022 at 10:19 AM >> *To: *user >> *Subject: *How to open a Prometheus metrics port on the rest service >> when using the Kubernetes operator? >> >> I am migrating my Flink app from standalone Kubernetes to the Kubernetes >> operator, it is going well but I ran into a problem, I can not figure out >> how to open a Prometheus metrics port in the rest-service to collect all my >> custom metrics from the task managers. Note that this is different from the >> instructions to "How to Enable Prometheus" >> https://nightlies.apache.org/flink/flink-kubernetes-operator-docs-main/docs/operations/metrics-logging/#how-to-enable-prometheus-example >> that example is to collect the operator pod metrics, but what I am trying >> to do is open a port on the rest service to make my job metrics available >> to Prometheus. >> >> >> >> Thanks, >> >> >> >> Javier Vegas >> >
Re: How to open a Prometheus metrics port on the rest service when using the Kubernetes operator?
Hi, Biao! Thanks for the fast response! Setting that in the podTemplate opens the metrics port in the pods, but unfortunately not on the rest service. Not sure if that is standard procedure, but my Prometheus setup scraps the metrics port on services but not pods. On my previous non-operator standalone setup, the metrics port on the service was aggregating all the pods metrics and then Prometheus was scrapping that, so I was trying to reproduce that by opening the port on the rest service. El dom, 4 sept 2022 a las 1:03, Geng Biao () escribió: > Hi Javier, > > > > You can use podTemplate to expose the port in the flink containers. > > Here is a snippet: > > spec: > > flinkVersion: v1_15 > > flinkConfiguration: > > state.savepoints.dir: file:///flink-data/flink-savepoints > > state.checkpoints.dir: file:///flink-data/flink-checkpoints > > *metrics.reporter.prom.factory.class: > org.apache.flink.metrics.prometheus.PrometheusReporterFactory* > > serviceAccount: flink > > podTemplate: > > metadata: > > annotations: > > prometheus.io/path: /metrics > > prometheus.io/port: "9249" > > prometheus.io/scrape: "true" > > spec: > > serviceAccount: flink > > containers: > > - name: flink-main-container > > volumeMounts: > > - mountPath: /flink-data > > name: flink-volume > > * ports:* > > *- containerPort: 9249* > > * name: metrics* > > * protocol: TCP* > > volumes: > > - name: flink-volume > > emptyDir: {} > > > > The bold line are about how to specify the metric reporter and expose the > metric. The annotations are not required if you use PodMonitor or > ServiceMonitor. Hope it can help! > > > > Best, > > Biao Geng > > > > *From: *Javier Vegas > *Date: *Sunday, September 4, 2022 at 10:19 AM > *To: *user > *Subject: *How to open a Prometheus metrics port on the rest service when > using the Kubernetes operator? > > I am migrating my Flink app from standalone Kubernetes to the Kubernetes > operator, it is going well but I ran into a problem, I can not figure out > how to open a Prometheus metrics port in the rest-service to collect all my > custom metrics from the task managers. Note that this is different from the > instructions to "How to Enable Prometheus" > https://nightlies.apache.org/flink/flink-kubernetes-operator-docs-main/docs/operations/metrics-logging/#how-to-enable-prometheus-example > that example is to collect the operator pod metrics, but what I am trying > to do is open a port on the rest service to make my job metrics available > to Prometheus. > > > > Thanks, > > > > Javier Vegas >
How to open a Prometheus metrics port on the rest service when using the Kubernetes operator?
I am migrating my Flink app from standalone Kubernetes to the Kubernetes operator, it is going well but I ran into a problem, I can not figure out how to open a Prometheus metrics port in the rest-service to collect all my custom metrics from the task managers. Note that this is different from the instructions to "How to Enable Prometheus" https://nightlies.apache.org/flink/flink-kubernetes-operator-docs-main/docs/operations/metrics-logging/#how-to-enable-prometheus-example that example is to collect the operator pod metrics, but what I am trying to do is open a port on the rest service to make my job metrics available to Prometheus. Thanks, Javier Vegas
Re: NodePort conflict for multiple HA application-mode standalone Kubernetes deploys in same namespace
Partial answer to my own question: Removing the hardcoded `nodePort: 30081` entry from jobmanager-rest-service.yaml, Flink assigns random ports so there are no conflicts and multiple Flink application-mode jobs can be deployed. However the jobs seem to communicate with each other, when launching the second job, the first job taskmanagers start executing tasks sent by the second job jobmanager, and the second job taskmanagers execute jobs from both jobmanagers. El vie, 22 jul 2022 a las 12:03, Javier Vegas () escribió: > > I am deploying a high-availability Flink job to Kubernetes in application > mode using Flink's standalone k8 deployment > https://nightlies.apache.org/flink/flink-docs-release-1.15/docs/deployment/resource-providers/standalone/kubernetes/ > All goes well when I deploy a job, but if I want to deploy a second > application-mode Flink job in the same K8s namespace I get a > "spec.ports[0].nodePort: > Invalid value: 30081: provided port is already allocated" error. Is there > a way that nodePort can be allocated dynamically, or other way around this > (using Loadbalancer or Ingress instead of NodePort in > jobmanager-rest-service.yaml?) besides hard-coding different nodePorts > for different jobs running in same namespace? > > Thanks, > > Javier Vegas >
NodePort conflict for multiple HA application-mode standalone Kubernetes deploys in same namespace
I am deploying a high-availability Flink job to Kubernetes in application mode using Flink's standalone k8 deployment https://nightlies.apache.org/flink/flink-docs-release-1.15/docs/deployment/resource-providers/standalone/kubernetes/ All goes well when I deploy a job, but if I want to deploy a second application-mode Flink job in the same K8s namespace I get a "spec.ports[0].nodePort: Invalid value: 30081: provided port is already allocated" error. Is there a way that nodePort can be allocated dynamically, or other way around this (using Loadbalancer or Ingress instead of NodePort in jobmanager-rest-service.yaml?) besides hard-coding different nodePorts for different jobs running in same namespace? Thanks, Javier Vegas
standalone mode support in the kubernetes operator (FLIP-25)
Hello! The operator docs https://nightlies.apache.org/flink/flink-kubernetes-operator-docs-main/docs/concepts/overview/ say "The Operator does not support Standalone Kubernetes <https://nightlies.apache.org/flink/flink-docs-master/docs/deployment/resource-providers/standalone/kubernetes/> deployments yet" and mentions https://cwiki.apache.org/confluence/display/FLINK/FLIP-225%3A+Implement+standalone+mode+support+in+the+kubernetes+operator as a "what's next" step. Is there a timeline for that to be released? Thanks, Javier Vegas
Re: IllegalArgumentException: URI is not hierarchical error when initializating jobmanager in cluster
Thanks, Robert! I tried the classloader.resolve.order: parent-first option but ran into SLF4J: Failed to load class "org.slf4j.impl.StaticLoggerBinder" errors (because I use logback so I followed https://nightlies.apache.org/flink/flink-docs-master/docs/deployment/advanced/logging/#configuring-logback and removed log4j-slf4j-impl from the classpath. But putting all my classes in lib/ instead of usrlib/ fixed that problem, and everything now runs fine. Thanks! El vie, 28 ene 2022 a las 6:11, Robert Metzger () escribió: > Hi Javier, > > I suspect that TwitterServer is using some classloading / dependency > injection / service loading "magic" that is causing this. > I would try to find out, either by attaching a remote debugger (should be > possible when executing in cluster mode locally) or by adding log > statements in the code, what the URI it's trying to load looks like. > > On the cluster, Flink is using separate classloaders for the base flink > system, and the user code (as opposed to executing in the IDE, where > everything is loaded from the same loader). Check out this page and try out > the config arguments: > https://nightlies.apache.org/flink/flink-docs-master/docs/ops/debugging/debugging_classloading/ > > > > On Wed, Jan 26, 2022 at 4:13 AM Javier Vegas wrote: > >> I am porting a Scala service to Flink in order to make it more scalable >> via running it in a cluster. All my Scala services extends a base Service >> class that extends TwitterServer ( >> https://github.com/twitter/twitter-server/blob/develop/server/src/main/scala/com/twitter/server/TwitterServer.scala) >> and that base class contains a lot of logic about resource initialization, >> logging, stats and error handling, monitoring, etc that I want to keep >> using in my class. I ported my logic to Flink sources and sinks, and >> everything worked fine when I ran my class in single JVM mode either from >> sbt or my IDE, Flink's jobmanager and taskmanagers start and run my app. >> But when I try to run my application in cluster mode, when launching my >> class with "./bin/standalone-job.sh start --job-classname" the >> jobmanager runs into a "IllegalArgumentException: URI is not hierarchical" >> error on initialization, apparently because TwitterServer is trying to load >> something from the class path (see attached full log). >> >> Is there anything I can do to run a class that extends TwitterServer in a >> Flink cluster? I have tried making my class not extend it and it worked >> fine, but I really want to keep using all the common infraestructure logic >> that I have in my base class that extends TwitterServer. >> >> Thanks! >> >
Mesos deploy starts Mesos framework but does not start job managers
I am trying to deploy a Flink cluster via Mesos following https://nightlies.apache.org/flink/flink-docs-release-1.13/docs/deployment/resource-providers/mesos/ (I know Mesos support has been deprecated, and I am planning to migrate my deployment tools to Kubernetes, but for now I am stuck using Mesos). To deploy, I am using a custom Docker image that contains both Flink and my user binaries. The command I am using to start the cluster is /opt/flink/bin/mesos-appmaster.sh \ -Djobmanager.rpc.address=$HOST \ -Dmesos.resourcemanager.framework.user=flink \ -Dmesos.resourcemanager.framework.name=timeline-flink-populator \ -Dmesos.master=10.0.25.139:5050 \ -Dmesos.resourcemanager.tasks.cpus=4 \ -Dmesos.resourcemanager.tasks.container.type=docker \ -Dmesos.resourcemanager.tasks.container.image.name= docker.strava.com/strava/flink:jv-mesos \ -Dtaskmanager.numberOfTaskSlots=4 ; mesos-appmaster.sh is able to start a Mesos framework and a Flink job manager, but fails to start task managers. Looking in the Mesos syslog I see that the Mesos framework was sending offers that were being declined very quickly, and the agents ended in LOST state. I am attaching all the relevant lines in the syslog. Any ideas what the problem could be or what else I could check to see what is happening? Thanks, Javier Vegas syslog Description: Binary data
Re: Unable to connect to Mesos on mesos-appmaster.sh start
Thanks again, Matthias! Putting -Djobmanager.rpc.address=$HOST and -Djobmanager.rpc.port=$PORT0 as params for appmaster.sh I see in tog they seem to transform in the correct values -Djobmanager.rpc.address=10.0.23.35 -Djobmanager.rpc.port=31009 but a bit later the appmaster dies with this new error. it is unclear what address it is trying to bind, I added explicit params -Drest.bind-port=8081 and -Drest.port=8081 in case jobmanager.rpc.port was somehow interfering, but that didn't help. 2021-09-29 10:29:59.845 [main] INFO org.apache.flink.runtime.entrypoint.ClusterEntrypoint - Shutting MesosSessionClusterEntrypoint down with application status FAILED. Diagnostics java.net.BindException: Cannot assign requested address at java.base/sun.nio.ch.Net.bind0(Native Method) at java.base/sun.nio.ch.Net.bind(Unknown Source) at java.base/sun.nio.ch.Net.bind(Unknown Source) at java.base/sun.nio.ch.ServerSocketChannelImpl.bind(Unknown Source) at org.apache.flink.shaded.netty4.io.netty.channel.socket.nio.NioServerSocketChannel.doBind(NioServerSocketChannel.java:134) at org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannel$AbstractUnsafe.bind(AbstractChannel.java:550) at org.apache.flink.shaded.netty4.io.netty.channel.DefaultChannelPipeline$HeadContext.bind(DefaultChannelPipeline.java:1334) at org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannelHandlerContext.invokeBind(AbstractChannelHandlerContext.java:506) at org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannelHandlerContext.bind(AbstractChannelHandlerContext.java:491) at org.apache.flink.shaded.netty4.io.netty.channel.DefaultChannelPipeline.bind(DefaultChannelPipeline.java:973) at org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannel.bind(AbstractChannel.java:248) at org.apache.flink.shaded.netty4.io.netty.bootstrap.AbstractBootstrap$2.run(AbstractBootstrap.java:356) at org.apache.flink.shaded.netty4.io.netty.util.concurrent.AbstractEventExecutor.safeExecute(AbstractEventExecutor.java:164) at org.apache.flink.shaded.netty4.io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:472) at org.apache.flink.shaded.netty4.io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:500) at org.apache.flink.shaded.netty4.io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:989) at org.apache.flink.shaded.netty4.io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74) at org.apache.flink.shaded.netty4.io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30) at java.base/java.lang.Thread.run(Unknown Source) . On Wed, Sep 29, 2021 at 2:36 AM Matthias Pohl wrote: > The port has its separate configuration parameter jobmanager.rpc.port [1] > > [1] > https://ci.apache.org/projects/flink/flink-docs-master/docs/deployment/config/#jobmanager-rpc-port-1 > > On Wed, Sep 29, 2021 at 10:11 AM Javier Vegas wrote: > >> Matthias, thanks for the suggestion! I changed my jobmanager.rpc.address >> param from $HOSTNAME to $HOST:$PORT0 which in the log I see resolves >> properly to the host IP and port mapped to 8081 >> >> 2021-09-29 07:58:05.452 [main] INFO >> org.apache.flink.runtime.entrypoint.ClusterEntrypoint - >> -Djobmanager.rpc.address=10.0.22.114:31894 >> >> which is very promising. But sadly a little bit later appmaster dies with >> this errror: >> >> 2021-09-29 07:58:05.648 [main] INFO >> org.apache.flink.runtime.entrypoint.ClusterEntrypoint - Initializing >> cluster services. >> 2021-09-29 07:58:05.674 [main] INFO >> org.apache.flink.runtime.entrypoint.ClusterEntrypoint - Shutting >> MesosSessionClusterEntrypoint down with application status FAILED. >> Diagnostics org.apache.flink.configurati >> on.IllegalConfigurationException: The configured hostname is not valid >> at >> org.apache.flink.util.NetUtils.unresolvedHostToNormalizedString(NetUtils.java:179) >> at >> org.apache.flink.util.NetUtils.unresolvedHostAndPortToNormalizedString(NetUtils.java:197) >> at >> org.apache.flink.runtime.clusterframework.BootstrapTools.startRemoteActorSystem(BootstrapTools.java:207) >> at >> org.apache.flink.runtime.clusterframework.BootstrapTools.startRemoteActorSystem(BootstrapTools.java:152) >> at >> org.apache.flink.runtime.rpc.akka.AkkaRpcServiceUtils$AkkaRpcServiceBuilder.createAndStart(AkkaRpcServiceUtils.java:370) >> at >> org.apache.flink.runtime.rpc.akka.AkkaRpcServiceUtils$AkkaRpcServiceBuilder.createAndStart(AkkaRpcServiceUtils.java:344) >> at >> org.apache.flink.runtime.rpc.akka.AkkaRpcServiceUtils.createRemoteRpcService(AkkaRp
Re: Unable to connect to Mesos on mesos-appmaster.sh start
(UserGroupInformation.java:1762) at org.apache.flink.runtime.security.contexts.HadoopSecurityContext.runSecured(HadoopSecurityContext.java:41) at org.apache.flink.runtime.entrypoint.ClusterEntrypoint.startCluster(ClusterEntrypoint.java:186) ... 2 common frames omitted Caused by: java.lang.IllegalArgumentException: null at org.apache.flink.util.Preconditions.checkArgument(Preconditions.java:122) at org.apache.flink.util.NetUtils.unresolvedHostToNormalizedString(NetUtils.java:177) ... 17 common frames omitted On Wed, Sep 29, 2021 at 12:16 AM Matthias Pohl wrote: > One thing that was puzzling me yesterday when reading your post: Have you > tried $HOST instead of $HOSTNAME in the Marathon configuration? When I > played around with Mesos, I remember using HOST to resolve the host's IP > address instead of the host's name. It could be that the hostname itself > cannot be resolved to the right IP address. But I struggled to find proper > documentation to back that up. Only in the recipes section of the Marathon > docs [1], HOST was used as well. > > Matthias > > [1] > https://mesosphere.github.io/marathon/docs/recipes.html#command-executor-health-checks > > On Wed, Sep 29, 2021 at 3:37 AM Javier Vegas wrote: > >> Another update: Looking more carefully in my appmaster log, I see the >> following >> >> 2021-09-29 01:15:39.680 [flink-akka.actor.default-dispatcher-3] INFO >> o.a.f.m.runtime.clusterframework.MesosResourceManagerDriver - >> Registering as new framework. >> >> 2021-09-29 01:15:39.680 [flink-akka.actor.default-dispatcher-3] INFO >> o.a.f.m.runtime.clusterframework.MesosResourceManagerDriver - >> - >> >> --- >> >> 2021-09-29 01:15:39.680 [flink-akka.actor.default-dispatcher-3] INFO >> o.a.f.m.runtime.clusterframework.MesosResourceManagerDriver - Mesos >> Info: >> >> 2021-09-29 01:15:39.680 [flink-akka.actor.default-dispatcher-3] INFO >> o.a.f.m.runtime.clusterframework.MesosResourceManagerDriver - Master >> URL: 10.0.18.246:5050 >> >> 2021-09-29 01:15:39.680 [flink-akka.actor.default-dispatcher-3] INFO >> o.a.f.m.runtime.clusterframework.MesosResourceManagerDriver - Framework >> Info: >> >> 2021-09-29 01:15:39.680 [flink-akka.actor.default-dispatcher-3] INFO >> o.a.f.m.runtime.clusterframework.MesosResourceManagerDriver - ID: >> (none) >> >> 2021-09-29 01:15:39.680 [flink-akka.actor.default-dispatcher-3] INFO >> o.a.f.m.runtime.clusterframework.MesosResourceManagerDriver - Name: >> flink-test >> >> 2021-09-29 01:15:39.680 [flink-akka.actor.default-dispatcher-3] INFO >> o.a.f.m.runtime.clusterframework.MesosResourceManagerDriver - Failover >> Timeout (secs): 604800.0 >> >> 2021-09-29 01:15:39.680 [flink-akka.actor.default-dispatcher-3] INFO >> o.a.f.m.runtime.clusterframework.MesosResourceManagerDriver - Role: >> * >> >> 2021-09-29 01:15:39.680 [flink-akka.actor.default-dispatcher-3] INFO >> o.a.f.m.runtime.clusterframework.MesosResourceManagerDriver - >> Capabilities: >> (none) >> >> 2021-09-29 01:15:39.680 [flink-akka.actor.default-dispatcher-3] INFO >> o.a.f.m.runtime.clusterframework.MesosResourceManagerDriver - Principal: >> (none) >> >> 2021-09-29 01:15:39.680 [flink-akka.actor.default-dispatcher-3] INFO >> o.a.f.m.runtime.clusterframework.MesosResourceManagerDriver - Host: >> 311dcf7fd77c >> >> 2021-09-29 01:15:39.680 [flink-akka.actor.default-dispatcher-3] INFO >> o.a.f.m.runtime.clusterframework.MesosResourceManagerDriver - Web >> UI: http://311dcf7fd77c:8081 >> >> 2021-09-29 01:15:39.680 [flink-akka.actor.default-dispatcher-3] INFO >> o.a.f.m.runtime.clusterframework.MesosResourceManagerDriver - >> - >> >> --- >> >> >> which is picking up the mesos.master and >> mesos.resourcemanager.framework.name params I am passing to >> mesos-appmaster.sh >> >> >> In my Mesos dashboard I can see the framework has been created with the >> right name, but has no associated agents/tasks to it. So at least Flink has >> been able to connect to the Mesos master to create the framework >> >> >> Later in the mesos-appmaster log is when I see the Mesos connection >> errors: >> >> >> 2021-09-29 01:15:39.726 [flink-akka.actor.default-dispatcher-3] DEBUG >> o.a.f.r.resourcemanager.slotmanager.DeclarativeSlotManager - Starting >> the slot manager. >> >> 2021-09-29 01:15:39.815 [flink-a
Re: Unable to connect to Mesos on mesos-appmaster.sh start
ever it does later? One possible issue I see is that the framework is set with web UI in h ttp://311dcf7fd77c:8081 which can not be resolved from the Mesos master. 311dcf7fd77c is the result of doing hostname on the Docker container, and the Mesos master can not resolve that name. I could try to replace the Docker container hostname with the Docker host hostname, but the host port that gets mapped to 8081 on the container is a random port that I can not know beforehand. Does Mesos master try to reach Flink using that Web UI setting? Could this be the issue causing my connection problem, or is this a red herring and the problem is a different one? Thanks, Javier Vegas On Tue, Sep 28, 2021 at 10:23 AM Javier Vegas wrote: > Thanks, Matthias! > > There are lots of apps deployed to the Mesos cluster, the task manager > itself is deployed to Mesos via Marathon. In the Mesos log I can see the > Job manager agent starting, but no error messages related to it. As you > say, TaskManagers don't even have the chance to get confused about > variables, since the Job Manager can not connect to the Mesos master to > tell it to start the Task Managers. > > Thanks, > > Javier > > On Tue, Sep 28, 2021 at 7:59 AM Matthias Pohl > wrote: > >> Hi Javier, >> I don't see anything that's configured in the wrong way based on the >> jobmanager logs you've provided. Have you been able to deploy other >> applications to this Mesos cluster? Do the Mesos master logs reveal >> anything? The variable resolution on the TaskManager side is a valid >> concern shared by Roman since it's easy to run into such an issue. But the >> JobManager logs indicate that the JobManager is not able to contact the >> Mesos master. Hence, I'd assume that it's not related to the TaskManagers >> not coming up. >> >> Best, >> Matthias >> >> On Tue, Sep 28, 2021 at 2:45 PM Roman Khachatryan >> wrote: >> >>> Hi, >>> >>> No additional ports need to be open as far as I know. >>> >>> Probably, $HOSTNAME is substituted for something not resolvable on TMs? >>> >>> Please also make sure that the following gets executed before >>> mesos-appmaster.sh: >>> export HADOOP_CLASSPATH=$(hadoop classpath) >>> export MESOS_NATIVE_JAVA_LIBRARY=/path/to/lib/libmesos.so >>> (as per the documentation you linked) >>> >>> Regards, >>> Roman >>> >>> On Mon, Sep 27, 2021 at 7:38 PM Javier Vegas wrote: >>> > >>> > I am trying to start Flink 1.13.2 on Mesos following the instrucions >>> in >>> https://nightlies.apache.org/flink/flink-docs-release-1.13/docs/deployment/resource-providers/mesos/ >>> and using Marathon to deploy a Docker image with both the Flink and my >>> binaries. >>> > >>> > My entrypoint for the Docker image is: >>> > >>> > >>> > /opt/flink/bin/mesos-appmaster.sh \ >>> > >>> > -Djobmanager.rpc.address=$HOSTNAME \ >>> > >>> > -Dmesos.resourcemanager.framework.user=flink \ >>> > >>> > -Dmesos.master=10.0.18.246:5050 \ >>> > >>> > -Dmesos.resourcemanager.tasks.cpus=6 >>> > >>> > >>> > >>> > When mesos-appmaster.sh starts, in the stderr I see this: >>> > >>> > >>> > I0927 16:50:32.306691 801308 exec.cpp:164] Version: 1.7.3 >>> > >>> > I0927 16:50:32.310277 801345 exec.cpp:238] Executor registered on >>> agent f671d9ee-57f6-4f92-b1b2-3137676f6cdf-S6090 >>> > >>> > I0927 16:50:32.311120 801355 executor.cpp:130] Registered docker >>> executor on 10.0.20.177 >>> > >>> > I0927 16:50:32.311394 801345 executor.cpp:186] Starting task >>> tl_flink_prod.fb215c64-1fb2-11ec-9ce6-aaa2e9cb6ba0 >>> > >>> > WARNING: Your kernel does not support swap limit capabilities or the >>> cgroup is not mounted. Memory limited without swap. >>> > >>> > WARNING: An illegal reflective access operation has occurred >>> > >>> > WARNING: Illegal reflective access by >>> org.apache.hadoop.security.authentication.util.KerberosUtil >>> (file:/opt/flink/lib/flink-shaded-hadoop-2-uber-2.8.3-10.0.jar) to method >>> sun.security.krb5.Config.getInstance() >>> > >>> > WARNING: Please consider reporting this to the maintainers of >>> org.apache.hadoop.security.authentication.util.KerberosUtil >>> > >>> > W
Re: Unable to connect to Mesos on mesos-appmaster.sh start
Thanks, Matthias! There are lots of apps deployed to the Mesos cluster, the task manager itself is deployed to Mesos via Marathon. In the Mesos log I can see the Job manager agent starting, but no error messages related to it. As you say, TaskManagers don't even have the chance to get confused about variables, since the Job Manager can not connect to the Mesos master to tell it to start the Task Managers. Thanks, Javier On Tue, Sep 28, 2021 at 7:59 AM Matthias Pohl wrote: > Hi Javier, > I don't see anything that's configured in the wrong way based on the > jobmanager logs you've provided. Have you been able to deploy other > applications to this Mesos cluster? Do the Mesos master logs reveal > anything? The variable resolution on the TaskManager side is a valid > concern shared by Roman since it's easy to run into such an issue. But the > JobManager logs indicate that the JobManager is not able to contact the > Mesos master. Hence, I'd assume that it's not related to the TaskManagers > not coming up. > > Best, > Matthias > > On Tue, Sep 28, 2021 at 2:45 PM Roman Khachatryan > wrote: > >> Hi, >> >> No additional ports need to be open as far as I know. >> >> Probably, $HOSTNAME is substituted for something not resolvable on TMs? >> >> Please also make sure that the following gets executed before >> mesos-appmaster.sh: >> export HADOOP_CLASSPATH=$(hadoop classpath) >> export MESOS_NATIVE_JAVA_LIBRARY=/path/to/lib/libmesos.so >> (as per the documentation you linked) >> >> Regards, >> Roman >> >> On Mon, Sep 27, 2021 at 7:38 PM Javier Vegas wrote: >> > >> > I am trying to start Flink 1.13.2 on Mesos following the instrucions in >> https://nightlies.apache.org/flink/flink-docs-release-1.13/docs/deployment/resource-providers/mesos/ >> and using Marathon to deploy a Docker image with both the Flink and my >> binaries. >> > >> > My entrypoint for the Docker image is: >> > >> > >> > /opt/flink/bin/mesos-appmaster.sh \ >> > >> > -Djobmanager.rpc.address=$HOSTNAME \ >> > >> > -Dmesos.resourcemanager.framework.user=flink \ >> > >> > -Dmesos.master=10.0.18.246:5050 \ >> > >> > -Dmesos.resourcemanager.tasks.cpus=6 >> > >> > >> > >> > When mesos-appmaster.sh starts, in the stderr I see this: >> > >> > >> > I0927 16:50:32.306691 801308 exec.cpp:164] Version: 1.7.3 >> > >> > I0927 16:50:32.310277 801345 exec.cpp:238] Executor registered on agent >> f671d9ee-57f6-4f92-b1b2-3137676f6cdf-S6090 >> > >> > I0927 16:50:32.311120 801355 executor.cpp:130] Registered docker >> executor on 10.0.20.177 >> > >> > I0927 16:50:32.311394 801345 executor.cpp:186] Starting task >> tl_flink_prod.fb215c64-1fb2-11ec-9ce6-aaa2e9cb6ba0 >> > >> > WARNING: Your kernel does not support swap limit capabilities or the >> cgroup is not mounted. Memory limited without swap. >> > >> > WARNING: An illegal reflective access operation has occurred >> > >> > WARNING: Illegal reflective access by >> org.apache.hadoop.security.authentication.util.KerberosUtil >> (file:/opt/flink/lib/flink-shaded-hadoop-2-uber-2.8.3-10.0.jar) to method >> sun.security.krb5.Config.getInstance() >> > >> > WARNING: Please consider reporting this to the maintainers of >> org.apache.hadoop.security.authentication.util.KerberosUtil >> > >> > WARNING: Use --illegal-access=warn to enable warnings of further >> illegal reflective access operations >> > >> > WARNING: All illegal access operations will be denied in a future >> release >> > >> > I0927 16:50:43.622053 237 sched.cpp:232] Version: 1.7.3 >> > >> > I0927 16:50:43.624439 328 sched.cpp:336] New master detected at >> master@10.0.18.246:5050 >> > >> > I0927 16:50:43.624779 328 sched.cpp:356] No credentials provided. >> Attempting to register without authentication >> > >> > >> > where the "New master detected" line is promising. >> > >> > However, on the Flink UI I see only the jobmanager started, and there >> are no task managers. Getting into the Docker container, I see this in the >> log: >> > >> > WARN org.apache.flink.mesos.scheduler.ConnectionMonitor - Unable to >> connect to Mesos; still trying... >> > >> > >> > I have verified that from the container I can access the Mesos >> container 10
Re: Unable to connect to Mesos on mesos-appmaster.sh start
Thanks, Roman! Looking at the log, seems that the TaskManager can resolve $HOSTNAME to its own hostname (07a6b681ee0f), as seen in these lines: 2021-09-27 22:02:41.067 [main] INFO org.apache.flink.runtime.entrypoint.ClusterEntrypoint - -Djobmanager.rpc.address=*07a6b681ee0f* 2021-09-27 22:02:43.025 [main] INFO org.apache.flink.runtime.dispatcher.DispatcherRestEndpoint - Rest endpoint listening at *07a6b681ee0f*:8081 2021-09-27 22:02:43.025 [main] INFO org.apache.flink.runtime.dispatcher.DispatcherRestEndpoint - http:// *07a6b681ee0f*:8081 was granted leadership with leaderSessionID=---- 2021-09-27 22:02:43.026 [main] INFO org.apache.flink.runtime.dispatcher.DispatcherRestEndpoint - Web frontend listening at http://*07a6b681ee0f*:8081. I am deploying to Mesos with Marathon, so I have no way other than $HOSTNAME to indicate the host that will execute mesos-appmaster.sh The environment variables are set, this is what I can see if I hop into the Docker container: root@07a6b681ee0f:/opt/flink# echo $HADOOP_CLASSPATH /opt/flink/hadoop-3.2.2/etc/hadoop:/opt/flink/hadoop-3.2.2/share/hadoop/common/lib/*:/opt/flink/hadoop-3.2.2/share/hadoop/common/*:/opt/flink/hadoop-3.2.2/share/hadoop/hdfs:/opt/flink/hadoop-3.2.2/share/hadoop/hdfs/lib/*:/opt/flink/hadoop-3.2.2/share/hadoop/hdfs/*:/opt/flink/hadoop-3.2.2/share/hadoop/mapreduce/lib/*:/opt/flink/hadoop-3.2.2/share/hadoop/mapreduce/*:/opt/flink/hadoop-3.2.2/share/hadoop/yarn:/opt/flink/hadoop-3.2.2/share/hadoop/yarn/lib/*:/opt/flink/hadoop-3.2.2/share/hadoop/yarn/*:/opt/flink/lib root@07a6b681ee0f:/opt/flink# echo $MESOS_NATIVE_JAVA_LIBRARY /usr/lib/libmesos.so On Tue, Sep 28, 2021 at 5:45 AM Roman Khachatryan wrote: > Hi, > > No additional ports need to be open as far as I know. > > Probably, $HOSTNAME is substituted for something not resolvable on TMs? > > Please also make sure that the following gets executed before > mesos-appmaster.sh: > export HADOOP_CLASSPATH=$(hadoop classpath) > export MESOS_NATIVE_JAVA_LIBRARY=/path/to/lib/libmesos.so > (as per the documentation you linked) > > Regards, > Roman > > On Mon, Sep 27, 2021 at 7:38 PM Javier Vegas wrote: > > > > I am trying to start Flink 1.13.2 on Mesos following the instrucions in > https://nightlies.apache.org/flink/flink-docs-release-1.13/docs/deployment/resource-providers/mesos/ > and using Marathon to deploy a Docker image with both the Flink and my > binaries. > > > > My entrypoint for the Docker image is: > > > > > > /opt/flink/bin/mesos-appmaster.sh \ > > > > -Djobmanager.rpc.address=$HOSTNAME \ > > > > -Dmesos.resourcemanager.framework.user=flink \ > > > > -Dmesos.master=10.0.18.246:5050 \ > > > > -Dmesos.resourcemanager.tasks.cpus=6 > > > > > > > > When mesos-appmaster.sh starts, in the stderr I see this: > > > > > > I0927 16:50:32.306691 801308 exec.cpp:164] Version: 1.7.3 > > > > I0927 16:50:32.310277 801345 exec.cpp:238] Executor registered on agent > f671d9ee-57f6-4f92-b1b2-3137676f6cdf-S6090 > > > > I0927 16:50:32.311120 801355 executor.cpp:130] Registered docker > executor on 10.0.20.177 > > > > I0927 16:50:32.311394 801345 executor.cpp:186] Starting task > tl_flink_prod.fb215c64-1fb2-11ec-9ce6-aaa2e9cb6ba0 > > > > WARNING: Your kernel does not support swap limit capabilities or the > cgroup is not mounted. Memory limited without swap. > > > > WARNING: An illegal reflective access operation has occurred > > > > WARNING: Illegal reflective access by > org.apache.hadoop.security.authentication.util.KerberosUtil > (file:/opt/flink/lib/flink-shaded-hadoop-2-uber-2.8.3-10.0.jar) to method > sun.security.krb5.Config.getInstance() > > > > WARNING: Please consider reporting this to the maintainers of > org.apache.hadoop.security.authentication.util.KerberosUtil > > > > WARNING: Use --illegal-access=warn to enable warnings of further illegal > reflective access operations > > > > WARNING: All illegal access operations will be denied in a future release > > > > I0927 16:50:43.622053 237 sched.cpp:232] Version: 1.7.3 > > > > I0927 16:50:43.624439 328 sched.cpp:336] New master detected at > master@10.0.18.246:5050 > > > > I0927 16:50:43.624779 328 sched.cpp:356] No credentials provided. > Attempting to register without authentication > > > > > > where the "New master detected" line is promising. > > > > However, on the Flink UI I see only the jobmanager started, and there > are no task managers. Getting into the Docker container, I see this in the > log: > > > > WARN org.apache.flink.mesos.schedu
Unable to connect to Mesos on mesos-appmaster.sh start
I am trying to start Flink 1.13.2 on Mesos following the instrucions in https://nightlies.apache.org/flink/flink-docs-release-1.13/docs/deployment/resource-providers/mesos/ and using Marathon to deploy a Docker image with both the Flink and my binaries. My entrypoint for the Docker image is: /opt/flink/bin/mesos-appmaster.sh \ -Djobmanager.rpc.address=$HOSTNAME \ -Dmesos.resourcemanager.framework.user=flink \ -Dmesos.master=10.0.18.246:5050 \ -Dmesos.resourcemanager.tasks.cpus=6 When mesos-appmaster.sh starts, in the stderr I see this: I0927 16:50:32.306691 801308 exec.cpp:164] Version: 1.7.3 I0927 16:50:32.310277 801345 exec.cpp:238] Executor registered on agent f671d9ee-57f6-4f92-b1b2-3137676f6cdf-S6090 I0927 16:50:32.311120 801355 executor.cpp:130] Registered docker executor on 10.0.20.177 I0927 16:50:32.311394 801345 executor.cpp:186] Starting task tl_flink_prod.fb215c64-1fb2-11ec-9ce6-aaa2e9cb6ba0 WARNING: Your kernel does not support swap limit capabilities or the cgroup is not mounted. Memory limited without swap. WARNING: An illegal reflective access operation has occurred WARNING: Illegal reflective access by org.apache.hadoop.security.authentication.util.KerberosUtil (file:/opt/flink/lib/flink-shaded-hadoop-2-uber-2.8.3-10.0.jar) to method sun.security.krb5.Config.getInstance() WARNING: Please consider reporting this to the maintainers of org.apache.hadoop.security.authentication.util.KerberosUtil WARNING: Use --illegal-access=warn to enable warnings of further illegal reflective access operations WARNING: All illegal access operations will be denied in a future release I0927 16:50:43.622053 237 sched.cpp:232] Version: 1.7.3 I0927 16:50:43.624439 328 sched.cpp:336] New master detected at master@10.0.18.246:5050 I0927 16:50:43.624779 328 sched.cpp:356] No credentials provided. Attempting to register without authentication where the "New master detected" line is promising. However, on the Flink UI I see only the jobmanager started, and there are no task managers. Getting into the Docker container, I see this in the log: WARN org.apache.flink.mesos.scheduler.ConnectionMonitor - Unable to connect to Mesos; still trying... I have verified that from the container I can access the Mesos container 10.0.18.246:5050 Does any other port besides the web UI port 5050 need to be open for mesos-appmaster to connect with the Mesos master? In the appmaster log (attached) I see one exception that I don't know if they are related to the Mesos connection problem, one is java.io.FileNotFoundException: HADOOP_HOME and hadoop.home.dir are unset. at org.apache.hadoop.util.Shell.checkHadoopHomeInner(Shell.java:448) at org.apache.hadoop.util.Shell.checkHadoopHome(Shell.java:419) at org.apache.hadoop.util.Shell.(Shell.java:496) at org.apache.hadoop.util.StringUtils.(StringUtils.java:79) at org.apache.hadoop.conf.Configuration.getBoolean(Configuration.java:1555) at org.apache.hadoop.security.SecurityUtil.getLogSlowLookupsEnabled(SecurityUtil.java:497) at org.apache.hadoop.security.SecurityUtil.(SecurityUtil.java:90) at org.apache.hadoop.security.UserGroupInformation.initialize(UserGroupInformation.java:289) at org.apache.hadoop.security.UserGroupInformation.ensureInitialized(UserGroupInformation.java:277) at org.apache.hadoop.security.UserGroupInformation.loginUserFromSubject(UserGroupInformation.java:833) at org.apache.hadoop.security.UserGroupInformation.getLoginUser(UserGroupInformation.java:803) at org.apache.hadoop.security.UserGroupInformation.getCurrentUser(UserGroupInformation.java:676) at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(Unknown Source) at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(Unknown Source) at java.base/java.lang.reflect.Method.invoke(Unknown Source) at org.apache.flink.runtime.util.EnvironmentInformation.getHadoopUser(EnvironmentInformation.java:215) at org.apache.flink.runtime.util.EnvironmentInformation.logEnvironmentInfo(EnvironmentInformation.java:432) at org.apache.flink.mesos.entrypoint.MesosSessionClusterEntrypoint.main(MesosSessionClusterEntrypoint.java:95) I am not trying (yet) to run in high availability mode, so I am not sure if I need to have HADOOP_HOME set or not, but I don't see anything about HADOOP_HOME in the FLink docs. Any tips on how I can fix my Docker+Marathon+Mesos environment so Flink can connect to my Mesos master? Thanks, Javier Vegas flink--mesos-appmaster-6c49aa87e1d4.log Description: Binary data