Re: Custom Prometheus metrics disappeared in 1.16.2 => 1.17.1 upgrade

2023-12-04 Thread Javier Vegas
Reason is simple, I migrated to Flink a project that already had
Prometheus metrics integrated.

Thanks,

Javier

El mar, 3 oct 2023 a las 15:51, Mason Chen () escribió:
>
> Hi Javier,
>
> Is there a particular reason why you aren't leveraging Flink metric API? It 
> seems that functionality was internal to the PrometheusReporter 
> implementation and your usecase should've continued working if it had 
> depended on Flink's  metric API.
>
> Best,
> Mason
>
> On Thu, Sep 28, 2023 at 2:51 AM Javier Vegas  wrote:
>>
>> Thanks! I saw the first change but missed the third one, that is the
>> most that most probably explains my problem, most probably the metrics
>> I was sending with the twitter/finagle statsReceiver ended up in the
>> singleton default registry and were exposed by Flink with all the
>> other Flink metrics, but now that Flink uses its own registry I have
>> no idea where my custom metrics end up
>>
>>
>> El mié, 27 sept 2023 a las 4:56, Kenan Kılıçtepe
>> () escribió:
>> >
>> > Have you checked the metric  changes in 1.17.
>> >
>> > From release notes 1.17:
>> > https://nightlies.apache.org/flink/flink-docs-master/release-notes/flink-1.17/
>> >
>> > Metric Reporters #
>> > Only support reporter factories for instantiation #
>> > FLINK-24235 #
>> > Configuring reporters by their class is no longer supported. Reporter 
>> > implementations must provide a MetricReporterFactory, and all 
>> > configurations must be migrated to such a factory.
>> >
>> > UseLogicalIdentifier makes datadog consider metric as custom #
>> > FLINK-30383 #
>> > The Datadog reporter now adds a “flink.” prefix to metric identifiers if 
>> > “useLogicalIdentifier” is enabled. This is required for these metrics to 
>> > be recognized as Flink metrics, not custom ones.
>> >
>> > Use separate Prometheus CollectorRegistries #
>> > FLINK-30020 #
>> > The PrometheusReporters now use a separate CollectorRegistry for each 
>> > reporter instance instead of the singleton default registry. This 
>> > generally shouldn’t impact setups, but it may break code that indirectly 
>> > interacts with the reporter via the singleton instance (e.g., a test 
>> > trying to assert what metrics are reported).
>> >
>> >
>> >
>> > On Wed, Sep 27, 2023 at 11:11 AM Javier Vegas  wrote:
>> >>
>> >> I implemented some custom Prometheus metrics that were working on
>> >> 1.16.2, with my configuration
>> >>
>> >> metrics.reporter.prom.factory.class:
>> >> org.apache.flink.metrics.prometheus.PrometheusReporterFactory
>> >> metrics.reporter.prom.port: 
>> >>
>> >> I could see both Flink metrics and my custom metrics on port  of
>> >> my task managers
>> >>
>> >> After upgrading to 1.17.1, using the same configuration, I can see
>> >> only the FLink metrics on port  of the task managers,
>> >> the custom metrics are getting lost somewhere.
>> >>
>> >> The release notes for 1.17 mention
>> >> https://issues.apache.org/jira/browse/FLINK-24235
>> >> that removes instantiating reporters by name and forces using a
>> >> factory, which I was already doing in 1.16.2. Do I need to do
>> >> anything extra after those changes so my metrics are aggregated with
>> >> the Flink ones?
>> >>
>> >> I am also seeing this error message on application startup (which I
>> >> was already seeing in 1.16.2): "Multiple implementations of the same
>> >> reporter were found in 'lib' and/or 'plugins' directories for
>> >> org.apache.flink.metrics.prometheus.PrometheusReporterFactory. It is
>> >> recommended to remove redundant reporter JARs to resolve used
>> >> versions' ambiguity." Could that also explain the missing metrics?
>> >>
>> >> Thanks,
>> >>
>> >> Javier Vegas


Re: Custom Prometheus metrics disappeared in 1.16.2 => 1.17.1 upgrade

2023-09-28 Thread Javier Vegas
Thanks! I saw the first change but missed the third one, that is the
most that most probably explains my problem, most probably the metrics
I was sending with the twitter/finagle statsReceiver ended up in the
singleton default registry and were exposed by Flink with all the
other Flink metrics, but now that Flink uses its own registry I have
no idea where my custom metrics end up


El mié, 27 sept 2023 a las 4:56, Kenan Kılıçtepe
() escribió:
>
> Have you checked the metric  changes in 1.17.
>
> From release notes 1.17:
> https://nightlies.apache.org/flink/flink-docs-master/release-notes/flink-1.17/
>
> Metric Reporters #
> Only support reporter factories for instantiation #
> FLINK-24235 #
> Configuring reporters by their class is no longer supported. Reporter 
> implementations must provide a MetricReporterFactory, and all configurations 
> must be migrated to such a factory.
>
> UseLogicalIdentifier makes datadog consider metric as custom #
> FLINK-30383 #
> The Datadog reporter now adds a “flink.” prefix to metric identifiers if 
> “useLogicalIdentifier” is enabled. This is required for these metrics to be 
> recognized as Flink metrics, not custom ones.
>
> Use separate Prometheus CollectorRegistries #
> FLINK-30020 #
> The PrometheusReporters now use a separate CollectorRegistry for each 
> reporter instance instead of the singleton default registry. This generally 
> shouldn’t impact setups, but it may break code that indirectly interacts with 
> the reporter via the singleton instance (e.g., a test trying to assert what 
> metrics are reported).
>
>
>
> On Wed, Sep 27, 2023 at 11:11 AM Javier Vegas  wrote:
>>
>> I implemented some custom Prometheus metrics that were working on
>> 1.16.2, with my configuration
>>
>> metrics.reporter.prom.factory.class:
>> org.apache.flink.metrics.prometheus.PrometheusReporterFactory
>> metrics.reporter.prom.port: 
>>
>> I could see both Flink metrics and my custom metrics on port  of
>> my task managers
>>
>> After upgrading to 1.17.1, using the same configuration, I can see
>> only the FLink metrics on port  of the task managers,
>> the custom metrics are getting lost somewhere.
>>
>> The release notes for 1.17 mention
>> https://issues.apache.org/jira/browse/FLINK-24235
>> that removes instantiating reporters by name and forces using a
>> factory, which I was already doing in 1.16.2. Do I need to do
>> anything extra after those changes so my metrics are aggregated with
>> the Flink ones?
>>
>> I am also seeing this error message on application startup (which I
>> was already seeing in 1.16.2): "Multiple implementations of the same
>> reporter were found in 'lib' and/or 'plugins' directories for
>> org.apache.flink.metrics.prometheus.PrometheusReporterFactory. It is
>> recommended to remove redundant reporter JARs to resolve used
>> versions' ambiguity." Could that also explain the missing metrics?
>>
>> Thanks,
>>
>> Javier Vegas


Re: Custom Prometheus metrics disappeared in 1.16.2 => 1.17.1 upgrade

2023-09-27 Thread Javier Vegas
Some more details on my problem:

1. The "Multiple implementations" problem was because I had the
metrics-prometheus jar both in the plugins and lib directories. I
tried putting it in only one,
and in both cases (plugins or lib), the result was the same, I got
only Flink metrics on my prom port.
2. My application extends
https://github.com/twitter/twitter-server/blob/develop/server/src/main/scala/com/twitter/server/TwitterServer.scala
and I was sending
my custom stats via the statsReceiver provided there
https://github.com/twitter/twitter-server/blob/33b3fb00635c4ab1102eb0c062cde6bb132d80c0/server/src/main/scala/com/twitter/server/Stats.scala#L14
3. I realized that my reporter configuration was:

metrics.reporter.prom.class:
org.apache.flink.metrics.prometheus.PrometheusReporter
metrics.reporter.prom.factory.class:
org.apache.flink.metrics.prometheus.PrometheusReporterFactory
metrics.reporter.prom.port: 

So I guess in 1.16.2 the prometheus reporter could have been
instantiated by class name, and perhaps that somehow allowed my
metrics to be merged with the Flink
ones, but in 1.17.1 the reporter gets instantiated by the factory and
somehow that renders my metrics invisible. Do you have any suggestion
so my metrics work as in 1.16.2?

Thanks again, Javier Vegas


El mar, 26 sept 2023 a las 19:42, Javier Vegas () escribió:
>
> I implemented some custom Prometheus metrics that were working on
> 1.16.2, with my configuration
>
> metrics.reporter.prom.factory.class:
> org.apache.flink.metrics.prometheus.PrometheusReporterFactory
> metrics.reporter.prom.port: 
>
> I could see both Flink metrics and my custom metrics on port  of
> my task managers
>
> After upgrading to 1.17.1, using the same configuration, I can see
> only the FLink metrics on port  of the task managers,
> the custom metrics are getting lost somewhere.
>
> The release notes for 1.17 mention
> https://issues.apache.org/jira/browse/FLINK-24235
> that removes instantiating reporters by name and forces using a
> factory, which I was already doing in 1.16.2. Do I need to do
> anything extra after those changes so my metrics are aggregated with
> the Flink ones?
>
> I am also seeing this error message on application startup (which I
> was already seeing in 1.16.2): "Multiple implementations of the same
> reporter were found in 'lib' and/or 'plugins' directories for
> org.apache.flink.metrics.prometheus.PrometheusReporterFactory. It is
> recommended to remove redundant reporter JARs to resolve used
> versions' ambiguity." Could that also explain the missing metrics?
>
> Thanks,
>
> Javier Vegas


Custom Prometheus metrics disappeared in 1.16.2 => 1.17.1 upgrade

2023-09-26 Thread Javier Vegas
I implemented some custom Prometheus metrics that were working on
1.16.2, with my configuration

metrics.reporter.prom.factory.class:
org.apache.flink.metrics.prometheus.PrometheusReporterFactory
metrics.reporter.prom.port: 

I could see both Flink metrics and my custom metrics on port  of
my task managers

After upgrading to 1.17.1, using the same configuration, I can see
only the FLink metrics on port  of the task managers,
the custom metrics are getting lost somewhere.

The release notes for 1.17 mention
https://issues.apache.org/jira/browse/FLINK-24235
that removes instantiating reporters by name and forces using a
factory, which I was already doing in 1.16.2. Do I need to do
anything extra after those changes so my metrics are aggregated with
the Flink ones?

I am also seeing this error message on application startup (which I
was already seeing in 1.16.2): "Multiple implementations of the same
reporter were found in 'lib' and/or 'plugins' directories for
org.apache.flink.metrics.prometheus.PrometheusReporterFactory. It is
recommended to remove redundant reporter JARs to resolve used
versions' ambiguity." Could that also explain the missing metrics?

Thanks,

Javier Vegas


Re: Error upgrading operator CRD

2023-07-07 Thread Javier Vegas
Additionally, when I try the next step I get another unexpected error:

✗ helm -n flink template flink-kubernetes-operator
flink-operator-repo/flink-kubernetes-operator
Error: failed to fetch
https://downloads.apache.org/flink/flink-kubernetes-operator-1.0.1/flink-kubernetes-operator-1.0.1-helm.tgz
: 404 Not Found

Not sure why helm wants to find 1.0.1 because I have 1.3.1 installed (but
that would have result in a 404 too, since that downloads site only has
versions 1.4.0 and 1.5.0 of the operator

El vie, 7 jul 2023 a las 10:59, Javier Vegas () escribió:

> Somehow I was able in the past to upgrade the CRD when I upgraded the
> operator to 1.2 and 1.3, but trying now to upgrade to 1.4 I am getting the
> following error:
>
> ✗ kubectl replace -f
> helm/flink-kubernetes-operator/crds/flinksessionjobs.flink.apache.org-v1.yml
>
> error: the path
> "helm/flink-kubernetes-operator/crds/flinksessionjobs.flink.apache.org-v1.yml"
> does not exist
>
> Do I need to pass more arguments to kubectl for it to find the path? How
> can I verify the CRD path?
>
> Thanks,
>
> Javier Vegas
>


Error upgrading operator CRD

2023-07-07 Thread Javier Vegas
Somehow I was able in the past to upgrade the CRD when I upgraded the
operator to 1.2 and 1.3, but trying now to upgrade to 1.4 I am getting the
following error:

✗ kubectl replace -f
helm/flink-kubernetes-operator/crds/flinksessionjobs.flink.apache.org-v1.yml

error: the path
"helm/flink-kubernetes-operator/crds/flinksessionjobs.flink.apache.org-v1.yml"
does not exist

Do I need to pass more arguments to kubectl for it to find the path? How
can I verify the CRD path?

Thanks,

Javier Vegas


Re: DuplicateJobSubmissionException on restart after taskmanagers crash

2023-01-20 Thread Javier Vegas
My issue is described in https://issues.apache.org/jira/browse/FLINK-21928
where it says was fixed in 1.14, but I am still seeing the problem.
Although there it says:

"Additionally, it is still required that the user cleans up the
corresponding HA entries for the running jobs registry because these
entries won't be reliably cleaned up when encountering the situation
described by FLINK-21928 <https://issues.apache.org/jira/browse/FLINK-21928>."


so I guess I need to do some manual cleanup of my S3 HA data before
restarting

El vie, 20 ene 2023 a las 4:58, Javier Vegas () escribió:

>
> I have a Flink app (Flink 1.16.0, deployed to Kubernetes via operator
> 1.3.1 and using Kubernetes HighAvailaibilty with storage in S3) that
> depends on multiple Thrift services for data queries. When one of those
> services is down (or throws exceptions) the Flink job managers end up
> crashing and only the task managers remain up. Once the dependencies are
> fixed, when I try to restart the Flink app I end up with a
> "DuplicateJobSubmissionException: Job has already been submitted" (see
> below for detailed log) and the task managers never start. The only
> solution I have found is to delete the deployment from Kubernetes and then
> deploy again as a new job.
>
> 1) Is there a better way to handle failures on dependencies than letting
> task managers crash and keep job managers up, and restart after
> dependencies are fixed?
> 1) If not, is there a way to handle the DuplicateJobSubmissionException so
> the Flink app can be restarted without having to uninstall it first?
>
> Thanks,
>
> Javier Vegas
>
>
> org.apache.flink.util.FlinkException: Failed to execute job
> at
> org.apache.flink.streaming.api.environment.StreamExecutionEnvironment.executeAsync(StreamExecutionEnvironment.java:2203)
> Caused by:
> org.apache.flink.runtime.client.DuplicateJobSubmissionException: Job has
> already been submitted.
> at
> org.apache.flink.runtime.client.DuplicateJobSubmissionException.ofGloballyTerminated(DuplicateJobSubmissionException.java:35)
> at
> org.apache.flink.runtime.dispatcher.Dispatcher.submitJob(Dispatcher.java:449)
> at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native
> Method)
> at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(Unknown
> Source)
> at
> java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(Unknown
> Source)
> at java.base/java.lang.reflect.Method.invoke(Unknown Source)
> at
> org.apache.flink.runtime.rpc.akka.AkkaRpcActor.lambda$handleRpcInvocation$1(AkkaRpcActor.java:309)
> at
> org.apache.flink.runtime.concurrent.akka.ClassLoadingUtils.runWithContextClassLoader(ClassLoadingUtils.java:83)
> at
> org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRpcInvocation(AkkaRpcActor.java:307)
> at
> org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRpcMessage(AkkaRpcActor.java:222)
> at
> org.apache.flink.runtime.rpc.akka.FencedAkkaRpcActor.handleRpcMessage(FencedAkkaRpcActor.java:84)
> at
> org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleMessage(AkkaRpcActor.java:168)
> at akka.japi.pf.UnitCaseStatement.apply(CaseStatements.scala:24)
> at akka.japi.pf.UnitCaseStatement.apply(CaseStatements.scala:20)
> at scala.PartialFunction.applyOrElse(PartialFunction.scala:123)
> at scala.PartialFunction.applyOrElse$(PartialFunction.scala:122)
> at akka.japi.pf.UnitCaseStatement.applyOrElse(CaseStatements.scala:20)
> at scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:171)
> at scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:172)
> at scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:172)
> at akka.actor.Actor.aroundReceive(Actor.scala:537)
> at akka.actor.Actor.aroundReceive$(Actor.scala:535)
> at akka.actor.AbstractActor.aroundReceive(AbstractActor.scala:220)
> at akka.actor.ActorCell.receiveMessage(ActorCell.scala:580)
> at akka.actor.ActorCell.invoke(ActorCell.scala:548)
> at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:270)
> at akka.dispatch.Mailbox.run(Mailbox.scala:231)
> at akka.dispatch.Mailbox.exec(Mailbox.scala:243)
> ... 5 more
> Exception thrown in main on startup
>
>
>


DuplicateJobSubmissionException on restart after taskmanagers crash

2023-01-20 Thread Javier Vegas
I have a Flink app (Flink 1.16.0, deployed to Kubernetes via operator 1.3.1
and using Kubernetes HighAvailaibilty with storage in S3) that depends on
multiple Thrift services for data queries. When one of those services is
down (or throws exceptions) the Flink job managers end up crashing and only
the task managers remain up. Once the dependencies are fixed, when I try to
restart the Flink app I end up with a "DuplicateJobSubmissionException: Job
has already been submitted" (see below for detailed log) and the task
managers never start. The only solution I have found is to delete the
deployment from Kubernetes and then deploy again as a new job.

1) Is there a better way to handle failures on dependencies than letting
task managers crash and keep job managers up, and restart after
dependencies are fixed?
1) If not, is there a way to handle the DuplicateJobSubmissionException so
the Flink app can be restarted without having to uninstall it first?

Thanks,

Javier Vegas


org.apache.flink.util.FlinkException: Failed to execute job
at
org.apache.flink.streaming.api.environment.StreamExecutionEnvironment.executeAsync(StreamExecutionEnvironment.java:2203)
Caused by: org.apache.flink.runtime.client.DuplicateJobSubmissionException:
Job has already been submitted.
at
org.apache.flink.runtime.client.DuplicateJobSubmissionException.ofGloballyTerminated(DuplicateJobSubmissionException.java:35)
at
org.apache.flink.runtime.dispatcher.Dispatcher.submitJob(Dispatcher.java:449)
at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native
Method)
at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(Unknown
Source)
at
java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(Unknown
Source)
at java.base/java.lang.reflect.Method.invoke(Unknown Source)
at
org.apache.flink.runtime.rpc.akka.AkkaRpcActor.lambda$handleRpcInvocation$1(AkkaRpcActor.java:309)
at
org.apache.flink.runtime.concurrent.akka.ClassLoadingUtils.runWithContextClassLoader(ClassLoadingUtils.java:83)
at
org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRpcInvocation(AkkaRpcActor.java:307)
at
org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRpcMessage(AkkaRpcActor.java:222)
at
org.apache.flink.runtime.rpc.akka.FencedAkkaRpcActor.handleRpcMessage(FencedAkkaRpcActor.java:84)
at
org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleMessage(AkkaRpcActor.java:168)
at akka.japi.pf.UnitCaseStatement.apply(CaseStatements.scala:24)
at akka.japi.pf.UnitCaseStatement.apply(CaseStatements.scala:20)
at scala.PartialFunction.applyOrElse(PartialFunction.scala:123)
at scala.PartialFunction.applyOrElse$(PartialFunction.scala:122)
at akka.japi.pf.UnitCaseStatement.applyOrElse(CaseStatements.scala:20)
at scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:171)
at scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:172)
at scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:172)
at akka.actor.Actor.aroundReceive(Actor.scala:537)
at akka.actor.Actor.aroundReceive$(Actor.scala:535)
at akka.actor.AbstractActor.aroundReceive(AbstractActor.scala:220)
at akka.actor.ActorCell.receiveMessage(ActorCell.scala:580)
at akka.actor.ActorCell.invoke(ActorCell.scala:548)
at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:270)
at akka.dispatch.Mailbox.run(Mailbox.scala:231)
at akka.dispatch.Mailbox.exec(Mailbox.scala:243)
... 5 more
Exception thrown in main on startup


"Exec Failure java.io.EOFException null" message before taskmanagers crash

2022-10-25 Thread Javier Vegas
I have a Flink 1.15 app running in Kubernetes (v1.22) deployed via operator
1.2, using S3-based HA with 2 jobmanagers and 2 taskmanagers.

The app consumes a high-traffic Kafka topic and writes to a Cassandra
database. It had been running fine for 4 days, but at some point
the taskmanagers crashed. Looking through my logs, the oldest messages I
see are "Exec Failure java.io.EOFException null" from both taskmanagers at
exactly the same time, but there is no associated stack trace. After that,
the taskmanagers try to restart, but I see another "Exec Failure
java.io.EOFException null" message from one jobmanager and shortly after
the task manager sets the newly started taskmanagers to a failed state with
the message below. This repeats a couple more times until finally no more
taskmanagers try to come up, and the jobmanager sit there throwing
RecipientUnreachableExceptions because there are no more
taskmanagers around.

Any idea what that "Exec Failure java.io.EOFException null" message is
about, or what can I do to debug it if it happens again?

Thanks,

Javier Vegas

message from jobmanager
Source: event-activity (2/4)#0 (090ef433d97011f8f595885a9bb39a28) switched
from RUNNING to FAILED with failure cause:
org.apache.flink.util.FlinkException: Disconnect from JobManager
responsible for b7c72bb5b7570a7c981f761afa1b7ea6. at
org.apache.flink.runtime.taskexecutor.TaskExecutor.disconnectJobManagerConnection(TaskExecutor.java:1679)
at
org.apache.flink.runtime.taskexecutor.TaskExecutor.lambda$closeJob$18(TaskExecutor.java:1660)
at java.base/java.util.Optional.ifPresent(Unknown Source) at
org.apache.flink.runtime.taskexecutor.TaskExecutor.closeJob(TaskExecutor.java:1658)
at
org.apache.flink.runtime.taskexecutor.TaskExecutor.onStop(TaskExecutor.java:462)
at
org.apache.flink.runtime.rpc.RpcEndpoint.internalCallOnStop(RpcEndpoint.java:214)
at
org.apache.flink.runtime.rpc.akka.AkkaRpcActor$StartedState.lambda$terminate$0(AkkaRpcActor.java:568)
at
org.apache.flink.runtime.concurrent.akka.ClassLoadingUtils.runWithContextClassLoader(ClassLoadingUtils.java:83)
at
org.apache.flink.runtime.rpc.akka.AkkaRpcActor$StartedState.terminate(AkkaRpcActor.java:567)
at
org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleControlMessage(AkkaRpcActor.java:191)
at akka.japi.pf.UnitCaseStatement.apply(CaseStatements.scala:24) at
akka.japi.pf.UnitCaseStatement.apply(CaseStatements.scala:20) at
scala.PartialFunction.applyOrElse(PartialFunction.scala:123) at
scala.PartialFunction.applyOrElse$(PartialFunction.scala:122) at
akka.japi.pf.UnitCaseStatement.applyOrElse(CaseStatements.scala:20) at
scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:171) at
scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:172) at
akka.actor.Actor.aroundReceive(Actor.scala:537) at
akka.actor.Actor.aroundReceive$(Actor.scala:535) at
akka.actor.AbstractActor.aroundReceive(AbstractActor.scala:220) at
akka.actor.ActorCell.receiveMessage(ActorCell.scala:580) at
akka.actor.ActorCell.invoke(ActorCell.scala:548) at
akka.dispatch.Mailbox.processMailbox(Mailbox.scala:270) at
akka.dispatch.Mailbox.run(Mailbox.scala:231) at
akka.dispatch.Mailbox.exec(Mailbox.scala:243) at
java.base/java.util.concurrent.ForkJoinTask.doExec(Unknown Source) at
java.base/java.util.concurrent.ForkJoinPool$WorkQueue.topLevelExec(Unknown
Source) at java.base/java.util.concurrent.ForkJoinPool.scan(Unknown Source)
at java.base/java.util.concurrent.ForkJoinPool.runWorker(Unknown Source) at
java.base/java.util.concurrent.ForkJoinWorkerThread.run(Unknown Source)
Caused by: org.apache.flink.util.FlinkException: The TaskExecutor is
shutting down. at
org.apache.flink.runtime.taskexecutor.TaskExecutor.onStop(TaskExecutor.java:456)
... 25 more


Re: HA not working in standalone mode for operator 1.2

2022-10-13 Thread Javier Vegas
The jars that my build version creates have a version number, something
like myapp-2.2.11.jar. I am lazy and want to avoid having to update the
jarURI param (required in native mode) every time I deploy a new version of
my app, and just update the Docker image I am using. Another solution would
be to keep using native, and modify my build system to strip the version
number in the packaged jar.

Thanks!

Javier

El jue, 13 oct 2022 a las 13:22, Gyula Fóra ()
escribió:

> Before we dive further into this can you please explain the jarURI problem
> your are trying to solve by switching to standalone?
>
> The native mode should work well in almost any setup.
>
> Gyula
>
> On Thu, 13 Oct 2022 at 21:41, Javier Vegas  wrote:
>
>> Hi, I have a S3 HA Flink app that works as expected deployed via
>> operator 1.2 in native mode, but I am seeing errors when switching to
>> standalone mode (which I want to do mostly to save me having to set jarURI
>> explicitly).
>> I can see the job manager writes the JobGraph in S3, and in the web UI I
>> can see it creates the jobs, but the taskmanager sits there doing nothing
>> as if could not communicate with the jobmanager. I can see also that the
>> operator has created two services, while native mode creates only the rest
>> service. After a while, the taskmanager closes with the following exception:
>>
>> org.apache.flink.runtime.taskexecutor.exceptions.RegistrationTimeoutException:
>> Could not register at the ResourceManager within the specified maximum
>> registration duration 30 ms. This indicates a problem with this
>> instance. Terminating now.
>>
>> at
>> org.apache.flink.runtime.taskexecutor.TaskExecutor.registrationTimeout(TaskExecutor.java:1474)
>>
>> at
>> org.apache.flink.runtime.taskexecutor.TaskExecutor.lambda$startRegistrationTimeout$16(TaskExecutor.java:1459)
>>
>> at
>> org.apache.flink.runtime.rpc.akka.AkkaRpcActor.lambda$handleRunAsync$4(AkkaRpcActor.java:443)
>>
>> at
>> org.apache.flink.runtime.concurrent.akka.ClassLoadingUtils.runWithContextClassLoader(ClassLoadingUtils.java:68)
>>
>> at
>> org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRunAsync(AkkaRpcActor.java:443)
>>
>> at
>> org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRpcMessage(AkkaRpcActor.java:213)
>>
>> at
>> org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleMessage(AkkaRpcActor.java:163)
>>
>> at akka.japi.pf.UnitCaseStatement.apply(CaseStatements.scala:24)
>>
>> at akka.japi.pf.UnitCaseStatement.apply(CaseStatements.scala:20)
>>
>> at scala.PartialFunction.applyOrElse(PartialFunction.scala:123)
>>
>> at scala.PartialFunction.applyOrElse$(PartialFunction.scala:122)
>>
>> at
>> akka.japi.pf.UnitCaseStatement.applyOrElse(CaseStatements.scala:20)
>>
>> at
>> scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:171)
>>
>> at
>> scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:172)
>>
>> at
>> scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:172)
>>
>> at akka.actor.Actor.aroundReceive(Actor.scala:537)
>>
>> at akka.actor.Actor.aroundReceive$(Actor.scala:535)
>>
>> at
>> akka.actor.AbstractActor.aroundReceive(AbstractActor.scala:220)
>>
>> at akka.actor.ActorCell.receiveMessage(ActorCell.scala:580)
>>
>> at akka.actor.ActorCell.invoke(ActorCell.scala:548)
>>
>> at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:270)
>>
>> at akka.dispatch.Mailbox.run(Mailbox.scala:231)
>>
>> at akka.dispatch.Mailbox.exec(Mailbox.scala:243)
>>
>> at java.base/java.util.concurrent.ForkJoinTask.doExec(Unknown
>> Source)
>>
>> at
>> java.base/java.util.concurrent.ForkJoinPool$WorkQueue.topLevelExec(Unknown
>> Source)
>>
>> at java.base/java.util.concurrent.ForkJoinPool.scan(Unknown
>> Source)
>>
>> at java.base/java.util.concurrent.ForkJoinPool.runWorker(Unknown
>> Source)
>> at
>> java.base/java.util.concurrent.ForkJoinWorkerThread.run(Unknown Source)
>>
>>


Re: Validation error trying to use standalone mode with operator 1.2.0

2022-10-13 Thread Javier Vegas
Thanks, that fixed the problem! Sadly I am now running into a different
problem with S3 HA when running in standalone mode, see
https://lists.apache.org/thread/rf62htkr6govpr41fj3br4mzplsg9vg8

Cheers,

Javier

El vie, 7 oct 2022 a las 22:02, Gyula Fóra ()
escribió:

> Hi!
>
> Seems like you still have an older version CRD installed for the
> FlinkDeployment which doesn’t contain the newly introduced mode setting.
>
> You can check
>
> https://nightlies.apache.org/flink/flink-kubernetes-operator-docs-release-1.2/docs/operations/upgrade/
> for the upgrade process.
>
> Cheers
> Gyula
>
> On Sat, 8 Oct 2022 at 00:00, Javier Vegas  wrote:
>
>>
>> I am following the operator quickstart
>>
>> https://nightlies.apache.org/flink/flink-kubernetes-operator-docs-release-1.2/docs/try-flink-kubernetes-operator/quick-start/
>>
>> kubectl create -f
>> https://raw.githubusercontent.com/apache/flink-kubernetes-operator/release-1.2/examples/basic.yaml
>>
>>
>> works fine, but
>>
>>
>> kubectl create -f
>> https://raw.githubusercontent.com/apache/flink-kubernetes-operator/release-1.2/examples/basic-reactive.yaml
>>
>>
>> which has the mode: standalone setting
>>
>>
>> gives me this error:
>>
>>
>> error: error validating "
>> https://raw.githubusercontent.com/apache/flink-kubernetes-operator/release-1.2/examples/basic-reactive.yaml":
>> error validating data: ValidationError(FlinkDeployment.spec): unknown field
>> "mode" in org.apache.flink.v1beta1.FlinkDeployment.spec; if you choose to
>> ignore these errors, turn validation off with --validate=false
>>
>>
>>
>>


HA not working in standalone mode for operator 1.2

2022-10-13 Thread Javier Vegas
Hi, I have a S3 HA Flink app that works as expected deployed via
operator 1.2 in native mode, but I am seeing errors when switching to
standalone mode (which I want to do mostly to save me having to set jarURI
explicitly).
I can see the job manager writes the JobGraph in S3, and in the web UI I
can see it creates the jobs, but the taskmanager sits there doing nothing
as if could not communicate with the jobmanager. I can see also that the
operator has created two services, while native mode creates only the rest
service. After a while, the taskmanager closes with the following exception:

org.apache.flink.runtime.taskexecutor.exceptions.RegistrationTimeoutException:
Could not register at the ResourceManager within the specified maximum
registration duration 30 ms. This indicates a problem with this
instance. Terminating now.

at
org.apache.flink.runtime.taskexecutor.TaskExecutor.registrationTimeout(TaskExecutor.java:1474)

at
org.apache.flink.runtime.taskexecutor.TaskExecutor.lambda$startRegistrationTimeout$16(TaskExecutor.java:1459)

at
org.apache.flink.runtime.rpc.akka.AkkaRpcActor.lambda$handleRunAsync$4(AkkaRpcActor.java:443)

at
org.apache.flink.runtime.concurrent.akka.ClassLoadingUtils.runWithContextClassLoader(ClassLoadingUtils.java:68)

at
org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRunAsync(AkkaRpcActor.java:443)

at
org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRpcMessage(AkkaRpcActor.java:213)

at
org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleMessage(AkkaRpcActor.java:163)

at akka.japi.pf.UnitCaseStatement.apply(CaseStatements.scala:24)

at akka.japi.pf.UnitCaseStatement.apply(CaseStatements.scala:20)

at scala.PartialFunction.applyOrElse(PartialFunction.scala:123)

at scala.PartialFunction.applyOrElse$(PartialFunction.scala:122)

at
akka.japi.pf.UnitCaseStatement.applyOrElse(CaseStatements.scala:20)

at
scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:171)

at
scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:172)

at
scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:172)

at akka.actor.Actor.aroundReceive(Actor.scala:537)

at akka.actor.Actor.aroundReceive$(Actor.scala:535)

at akka.actor.AbstractActor.aroundReceive(AbstractActor.scala:220)

at akka.actor.ActorCell.receiveMessage(ActorCell.scala:580)

at akka.actor.ActorCell.invoke(ActorCell.scala:548)

at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:270)

at akka.dispatch.Mailbox.run(Mailbox.scala:231)

at akka.dispatch.Mailbox.exec(Mailbox.scala:243)

at java.base/java.util.concurrent.ForkJoinTask.doExec(Unknown
Source)

at
java.base/java.util.concurrent.ForkJoinPool$WorkQueue.topLevelExec(Unknown
Source)

at java.base/java.util.concurrent.ForkJoinPool.scan(Unknown Source)

at java.base/java.util.concurrent.ForkJoinPool.runWorker(Unknown
Source)
at java.base/java.util.concurrent.ForkJoinWorkerThread.run(Unknown
Source)


Validation error trying to use standalone mode with operator 1.2.0

2022-10-07 Thread Javier Vegas
I am following the operator quickstart
https://nightlies.apache.org/flink/flink-kubernetes-operator-docs-release-1.2/docs/try-flink-kubernetes-operator/quick-start/

kubectl create -f
https://raw.githubusercontent.com/apache/flink-kubernetes-operator/release-1.2/examples/basic.yaml


works fine, but


kubectl create -f
https://raw.githubusercontent.com/apache/flink-kubernetes-operator/release-1.2/examples/basic-reactive.yaml


which has the mode: standalone setting


gives me this error:


error: error validating "
https://raw.githubusercontent.com/apache/flink-kubernetes-operator/release-1.2/examples/basic-reactive.yaml":
error validating data: ValidationError(FlinkDeployment.spec): unknown field
"mode" in org.apache.flink.v1beta1.FlinkDeployment.spec; if you choose to
ignore these errors, turn validation off with --validate=false


Missing logback-console.xml when submitting job via operator

2022-09-29 Thread Javier Vegas
My Flink app uses logback for logging, when submitting it from the operator
I get this error:

ERROR in ch.qos.logback.classic.joran.JoranConfigurator@7364985f - Could
not open URL [file:/opt/flink/conf/logback-console.xml].
java.io.FileNotFoundException: /opt/flink/conf/logback-console.xml (No such
file or directory)

https://nightlies.apache.org/flink/flink-docs-release-1.15/docs/deployment/advanced/logging/#configuring-logback
says

"The Flink distribution ships with the following logback configuration
files in the conf directory, which are used automatically if logback is
enabled"

How do I "enable" logback? I don't see any relevant configuration param to
enable it, either in flink-conf.yaml or in the operator config.

I am also noticing that the log4j-console.properties that ends up in my
deployed app configmap is the one from the operator Helm chart
https://github.com/apache/flink-kubernetes-operator/tree/main/helm/flink-kubernetes-operator/conf
not
https://github.com/apache/flink/tree/master/flink-dist/src/main/flink-bin/conf
(where logback-console.xml lives). Should the operator have also a
logback-console.xml in the Helm chart?


Re: Classloading issues with Flink Operator / Kubernetes Native

2022-09-21 Thread Javier Vegas
Version 1.15.2, there is no /opt/flink/usrlib folder created

El mar, 20 sept 2022 a las 20:53, Yaroslav Tkachenko ()
escribió:

> Interesting, do you see the /opt/flink/usrlib folder created as well?
> Also, what Flink version do you use?
>
> Thanks.
>
> On Tue, Sep 20, 2022 at 4:04 PM Javier Vegas  wrote:
>
>>
>> jarURI: local:///opt/flink/lib/MYJARNAME.jar
>>
>> El mar, 20 sept 2022 a las 0:25, Yaroslav Tkachenko (<
>> yaros...@goldsky.com>) escribió:
>>
>>> Hi Javier,
>>>
>>> What do you specify as a jarURI?
>>>
>>> On Mon, Sep 19, 2022 at 3:56 PM Javier Vegas  wrote:
>>>
>>>> I am doing the same thing (migrating from standalone to operator in
>>>> native mode) and also have my jar in /opt/flink/lib but for me it works
>>>> fine, no class loading errors on app startup.
>>>>
>>>> El vie, 16 sept 2022 a las 9:28, Yaroslav Tkachenko (<
>>>> yaros...@goldsky.com>) escribió:
>>>>
>>>>> Application mode. I've done a bit more research and created
>>>>> https://issues.apache.org/jira/browse/FLINK-29288, planning to work
>>>>> on a PR today.
>>>>>
>>>>> TLDR: currently Flink operator always creates /opt/flink/usrlib folder
>>>>> and forces you to specify the jarURI parameter, which is passed as
>>>>> pipeline.jars / pipeline.classpaths configuration options. This leads to
>>>>> the jar being loaded twice by different classloaders (system and user
>>>>> ones).
>>>>>
>>>>> On Fri, Sep 16, 2022 at 2:30 AM Matthias Pohl 
>>>>> wrote:
>>>>>
>>>>>> Are you deploying the job in session or application mode? Could you
>>>>>> provide the stacktrace. I'm wondering whether that would be helpful to 
>>>>>> pin
>>>>>> a code location for further investigation.
>>>>>> So far, I couldn't come up with a definite answer about placing the
>>>>>> jar in the lib directory. Initially, I would have thought that it's fine
>>>>>> considering that all dependencies are included and the job jar itself 
>>>>>> ends
>>>>>> up on the user classpath. I'm curious whether Chesnay (CC'd) has an 
>>>>>> answer
>>>>>> to that one.
>>>>>>
>>>>>> On Tue, Sep 13, 2022 at 1:40 AM Yaroslav Tkachenko <
>>>>>> yaros...@goldsky.com> wrote:
>>>>>>
>>>>>>> Hey everyone,
>>>>>>>
>>>>>>> I’m migrating a Flink Kubernetes standalone job to the Flink
>>>>>>> operator (with Kubernetes native mode).
>>>>>>>
>>>>>>> I have a lot of classloading issues when trying to run with
>>>>>>> the operator in native mode. For example, I have a Postgres driver as a
>>>>>>> dependency (I can confirm the files are included in the uber jar), but I
>>>>>>> still get "java.sql.SQLException: No suitable driver found for
>>>>>>> jdbc:postgresql:..." exception.
>>>>>>>
>>>>>>> In the Kubernetes standalone setup my uber jar is placed in the
>>>>>>> /opt/flink/lib folder, this is what I specify as "jarURI" in the 
>>>>>>> operator
>>>>>>> config. Is this supported? Should I only be using /opt/flink/usrlib?
>>>>>>>
>>>>>>> Thanks for any suggestions.
>>>>>>>
>>>>>>


Re: Classloading issues with Flink Operator / Kubernetes Native

2022-09-20 Thread Javier Vegas
jarURI: local:///opt/flink/lib/MYJARNAME.jar

El mar, 20 sept 2022 a las 0:25, Yaroslav Tkachenko ()
escribió:

> Hi Javier,
>
> What do you specify as a jarURI?
>
> On Mon, Sep 19, 2022 at 3:56 PM Javier Vegas  wrote:
>
>> I am doing the same thing (migrating from standalone to operator in
>> native mode) and also have my jar in /opt/flink/lib but for me it works
>> fine, no class loading errors on app startup.
>>
>> El vie, 16 sept 2022 a las 9:28, Yaroslav Tkachenko (<
>> yaros...@goldsky.com>) escribió:
>>
>>> Application mode. I've done a bit more research and created
>>> https://issues.apache.org/jira/browse/FLINK-29288, planning to work on
>>> a PR today.
>>>
>>> TLDR: currently Flink operator always creates /opt/flink/usrlib folder
>>> and forces you to specify the jarURI parameter, which is passed as
>>> pipeline.jars / pipeline.classpaths configuration options. This leads to
>>> the jar being loaded twice by different classloaders (system and user
>>> ones).
>>>
>>> On Fri, Sep 16, 2022 at 2:30 AM Matthias Pohl 
>>> wrote:
>>>
>>>> Are you deploying the job in session or application mode? Could you
>>>> provide the stacktrace. I'm wondering whether that would be helpful to pin
>>>> a code location for further investigation.
>>>> So far, I couldn't come up with a definite answer about placing the jar
>>>> in the lib directory. Initially, I would have thought that it's fine
>>>> considering that all dependencies are included and the job jar itself ends
>>>> up on the user classpath. I'm curious whether Chesnay (CC'd) has an answer
>>>> to that one.
>>>>
>>>> On Tue, Sep 13, 2022 at 1:40 AM Yaroslav Tkachenko <
>>>> yaros...@goldsky.com> wrote:
>>>>
>>>>> Hey everyone,
>>>>>
>>>>> I’m migrating a Flink Kubernetes standalone job to the Flink operator
>>>>> (with Kubernetes native mode).
>>>>>
>>>>> I have a lot of classloading issues when trying to run with
>>>>> the operator in native mode. For example, I have a Postgres driver as a
>>>>> dependency (I can confirm the files are included in the uber jar), but I
>>>>> still get "java.sql.SQLException: No suitable driver found for
>>>>> jdbc:postgresql:..." exception.
>>>>>
>>>>> In the Kubernetes standalone setup my uber jar is placed in the
>>>>> /opt/flink/lib folder, this is what I specify as "jarURI" in the operator
>>>>> config. Is this supported? Should I only be using /opt/flink/usrlib?
>>>>>
>>>>> Thanks for any suggestions.
>>>>>
>>>>


Re: Classloading issues with Flink Operator / Kubernetes Native

2022-09-19 Thread Javier Vegas
I am doing the same thing (migrating from standalone to operator in native
mode) and also have my jar in /opt/flink/lib but for me it works fine, no
class loading errors on app startup.

El vie, 16 sept 2022 a las 9:28, Yaroslav Tkachenko ()
escribió:

> Application mode. I've done a bit more research and created
> https://issues.apache.org/jira/browse/FLINK-29288, planning to work on a
> PR today.
>
> TLDR: currently Flink operator always creates /opt/flink/usrlib folder and
> forces you to specify the jarURI parameter, which is passed as
> pipeline.jars / pipeline.classpaths configuration options. This leads to
> the jar being loaded twice by different classloaders (system and user
> ones).
>
> On Fri, Sep 16, 2022 at 2:30 AM Matthias Pohl 
> wrote:
>
>> Are you deploying the job in session or application mode? Could you
>> provide the stacktrace. I'm wondering whether that would be helpful to pin
>> a code location for further investigation.
>> So far, I couldn't come up with a definite answer about placing the jar
>> in the lib directory. Initially, I would have thought that it's fine
>> considering that all dependencies are included and the job jar itself ends
>> up on the user classpath. I'm curious whether Chesnay (CC'd) has an answer
>> to that one.
>>
>> On Tue, Sep 13, 2022 at 1:40 AM Yaroslav Tkachenko 
>> wrote:
>>
>>> Hey everyone,
>>>
>>> I’m migrating a Flink Kubernetes standalone job to the Flink operator
>>> (with Kubernetes native mode).
>>>
>>> I have a lot of classloading issues when trying to run with the operator
>>> in native mode. For example, I have a Postgres driver as a dependency (I
>>> can confirm the files are included in the uber jar), but I still get
>>> "java.sql.SQLException: No suitable driver found for jdbc:postgresql:..."
>>> exception.
>>>
>>> In the Kubernetes standalone setup my uber jar is placed in the
>>> /opt/flink/lib folder, this is what I specify as "jarURI" in the operator
>>> config. Is this supported? Should I only be using /opt/flink/usrlib?
>>>
>>> Thanks for any suggestions.
>>>
>>


Re: serviceAccount permissions issue for high availability in operator 1.1

2022-09-11 Thread Javier Vegas
Hi, Yang!

When you say the operator uses native k8s integration by default, does that
mean there is a way to change that to use standalone K8s? I haven't seen
anything about that in the docs, besides a mention that standalone support
is coming in version 1.2 of the operator.

Thanks,

Javier


On Thu, Sep 8, 2022, 22:50 Yang Wang  wrote:

> Since the flink-kubernetes-operator is using native K8s integration[1] by
> default, you need to give the permissions of pod and deployment as well as
> ConfigMap.
>
> You could find more information about the RBAC here[2].
>
> [1].
> https://nightlies.apache.org/flink/flink-docs-release-1.15/docs/deployment/resource-providers/native_kubernetes/
> [2].
> https://nightlies.apache.org/flink/flink-kubernetes-operator-docs-release-1.1/docs/operations/rbac/
>
> Best,
> Yang
>
> Javier Vegas  于2022年9月7日周三 04:17写道:
>
>> I am migrating a HA standalone Kubernetes app to use the Flink operator.
>> The HA store is S3 using IRSA so the app needs to run with a serviceAccount
>> that is authorized to access S3. In standalone mode HA worked once I gave
>> the account permissions to edit configMaps. But when trying the operator
>> with the custom serviceAccount, I am getting this error:
>>
>> io.fabric8.kubernetes.client.KubernetesClientException: Failure
>> executing: GET at:
>> https://172.20.0.1/apis/apps/v1/namespaces/MYNAMESPACE/deployments/MYAPPNAME.
>> Message: Forbidden!Configured service account doesn't have access. Service
>> account may have been revoked. deployments.apps "MYAPPNAME" is forbidden:
>> User "system:serviceaccount:MYNAMESPACE:MYSERVICEACCOUNT" cannot get
>> resource "deployments" in API group "apps" in the namespace "MYNAMESPACE".
>>
>>
>> Does the serviceAccount needs additional permissions beside configMap
>> edit to be able to run HA using the operator?
>>
>> Thanks,
>>
>> Javier Vegas
>>
>


serviceAccount permissions issue for high availability in operator 1.1

2022-09-06 Thread Javier Vegas
I am migrating a HA standalone Kubernetes app to use the Flink operator.
The HA store is S3 using IRSA so the app needs to run with a serviceAccount
that is authorized to access S3. In standalone mode HA worked once I gave
the account permissions to edit configMaps. But when trying the operator
with the custom serviceAccount, I am getting this error:

io.fabric8.kubernetes.client.KubernetesClientException: Failure executing:
GET at:
https://172.20.0.1/apis/apps/v1/namespaces/MYNAMESPACE/deployments/MYAPPNAME.
Message: Forbidden!Configured service account doesn't have access. Service
account may have been revoked. deployments.apps "MYAPPNAME" is forbidden:
User "system:serviceaccount:MYNAMESPACE:MYSERVICEACCOUNT" cannot get
resource "deployments" in API group "apps" in the namespace "MYNAMESPACE".

Does the serviceAccount needs additional permissions beside configMap edit
to be able to run HA using the operator?

Thanks,

Javier Vegas


Re: How to open a Prometheus metrics port on the rest service when using the Kubernetes operator?

2022-09-05 Thread Javier Vegas
What I would need is to set

ports:

- name: metrics

  port: 

  protocol: TCP



in the generated YAML fir the appname-rest service which properly
aggregates the metrics from the pods, but I can't not figure out either
from the job deployment file or modifying the operator templates in the
Helm chart. Any way I can modify the ports in the Flink rest service?


Thanks,


Javier Vegas



El dom, 4 sept 2022 a las 1:59, Javier Vegas () escribió:

> Hi, Biao!
>
> Thanks for the fast response! Setting that in the podTemplate opens the
> metrics port in the pods, but unfortunately not on the rest service. Not
> sure if that is standard procedure, but my Prometheus setup scraps the
> metrics port on services but not pods. On my previous non-operator
> standalone setup, the metrics port on the service was aggregating all the
> pods metrics and then Prometheus was scrapping that, so I was trying to
> reproduce that by opening the port on the rest service.
>
>
>
> El dom, 4 sept 2022 a las 1:03, Geng Biao ()
> escribió:
>
>> Hi Javier,
>>
>>
>>
>> You can use podTemplate to expose the port in the flink containers.
>>
>> Here is a snippet:
>>
>> spec:
>>
>>   flinkVersion: v1_15
>>
>>   flinkConfiguration:
>>
>> state.savepoints.dir: file:///flink-data/flink-savepoints
>>
>> state.checkpoints.dir: file:///flink-data/flink-checkpoints
>>
>> *metrics.reporter.prom.factory.class:
>> org.apache.flink.metrics.prometheus.PrometheusReporterFactory*
>>
>>   serviceAccount: flink
>>
>>   podTemplate:
>>
>> metadata:
>>
>>   annotations:
>>
>> prometheus.io/path: /metrics
>>
>> prometheus.io/port: "9249"
>>
>> prometheus.io/scrape: "true"
>>
>> spec:
>>
>>   serviceAccount: flink
>>
>>   containers:
>>
>> - name: flink-main-container
>>
>>   volumeMounts:
>>
>> - mountPath: /flink-data
>>
>>   name: flink-volume
>>
>>  * ports:*
>>
>> *- containerPort: 9249*
>>
>> *      name: metrics*
>>
>> *  protocol: TCP*
>>
>>   volumes:
>>
>> - name: flink-volume
>>
>>   emptyDir: {}
>>
>>
>>
>> The bold line are about how to specify the metric reporter and expose the
>> metric. The annotations are not required if you use PodMonitor or
>> ServiceMonitor. Hope it can help!
>>
>>
>>
>> Best,
>>
>> Biao Geng
>>
>>
>>
>> *From: *Javier Vegas 
>> *Date: *Sunday, September 4, 2022 at 10:19 AM
>> *To: *user 
>> *Subject: *How to open a Prometheus metrics port on the rest service
>> when using the Kubernetes operator?
>>
>> I am migrating my Flink app from standalone Kubernetes to the Kubernetes
>> operator, it is going well but I ran into a problem, I can not figure out
>> how to open a Prometheus metrics port in the rest-service to collect all my
>> custom metrics from the task managers. Note that this is different from the
>> instructions to "How to Enable Prometheus"
>> https://nightlies.apache.org/flink/flink-kubernetes-operator-docs-main/docs/operations/metrics-logging/#how-to-enable-prometheus-example
>> that example is to collect the operator pod metrics, but what I am trying
>> to do is open a port on the rest service to make my job metrics available
>> to Prometheus.
>>
>>
>>
>> Thanks,
>>
>>
>>
>> Javier Vegas
>>
>


Re: How to open a Prometheus metrics port on the rest service when using the Kubernetes operator?

2022-09-04 Thread Javier Vegas
Hi, Biao!

Thanks for the fast response! Setting that in the podTemplate opens the
metrics port in the pods, but unfortunately not on the rest service. Not
sure if that is standard procedure, but my Prometheus setup scraps the
metrics port on services but not pods. On my previous non-operator
standalone setup, the metrics port on the service was aggregating all the
pods metrics and then Prometheus was scrapping that, so I was trying to
reproduce that by opening the port on the rest service.



El dom, 4 sept 2022 a las 1:03, Geng Biao () escribió:

> Hi Javier,
>
>
>
> You can use podTemplate to expose the port in the flink containers.
>
> Here is a snippet:
>
> spec:
>
>   flinkVersion: v1_15
>
>   flinkConfiguration:
>
> state.savepoints.dir: file:///flink-data/flink-savepoints
>
> state.checkpoints.dir: file:///flink-data/flink-checkpoints
>
> *metrics.reporter.prom.factory.class:
> org.apache.flink.metrics.prometheus.PrometheusReporterFactory*
>
>   serviceAccount: flink
>
>   podTemplate:
>
> metadata:
>
>   annotations:
>
> prometheus.io/path: /metrics
>
> prometheus.io/port: "9249"
>
> prometheus.io/scrape: "true"
>
> spec:
>
>   serviceAccount: flink
>
>   containers:
>
> - name: flink-main-container
>
>   volumeMounts:
>
> - mountPath: /flink-data
>
>   name: flink-volume
>
>  * ports:*
>
> *- containerPort: 9249*
>
> *  name: metrics*
>
> *  protocol: TCP*
>
>   volumes:
>
> - name: flink-volume
>
>   emptyDir: {}
>
>
>
> The bold line are about how to specify the metric reporter and expose the
> metric. The annotations are not required if you use PodMonitor or
> ServiceMonitor. Hope it can help!
>
>
>
> Best,
>
> Biao Geng
>
>
>
> *From: *Javier Vegas 
> *Date: *Sunday, September 4, 2022 at 10:19 AM
> *To: *user 
> *Subject: *How to open a Prometheus metrics port on the rest service when
> using the Kubernetes operator?
>
> I am migrating my Flink app from standalone Kubernetes to the Kubernetes
> operator, it is going well but I ran into a problem, I can not figure out
> how to open a Prometheus metrics port in the rest-service to collect all my
> custom metrics from the task managers. Note that this is different from the
> instructions to "How to Enable Prometheus"
> https://nightlies.apache.org/flink/flink-kubernetes-operator-docs-main/docs/operations/metrics-logging/#how-to-enable-prometheus-example
> that example is to collect the operator pod metrics, but what I am trying
> to do is open a port on the rest service to make my job metrics available
> to Prometheus.
>
>
>
> Thanks,
>
>
>
> Javier Vegas
>


How to open a Prometheus metrics port on the rest service when using the Kubernetes operator?

2022-09-03 Thread Javier Vegas
I am migrating my Flink app from standalone Kubernetes to the Kubernetes
operator, it is going well but I ran into a problem, I can not figure out
how to open a Prometheus metrics port in the rest-service to collect all my
custom metrics from the task managers. Note that this is different from the
instructions to "How to Enable Prometheus"
https://nightlies.apache.org/flink/flink-kubernetes-operator-docs-main/docs/operations/metrics-logging/#how-to-enable-prometheus-example
that example is to collect the operator pod metrics, but what I am trying
to do is open a port on the rest service to make my job metrics available
to Prometheus.

Thanks,

Javier Vegas


Re: NodePort conflict for multiple HA application-mode standalone Kubernetes deploys in same namespace

2022-07-24 Thread Javier Vegas
Partial answer to my own question: Removing the hardcoded `nodePort:
30081` entry
from jobmanager-rest-service.yaml, Flink assigns random ports so there are
no conflicts and multiple Flink application-mode jobs can be deployed.
However the jobs seem to communicate with each other, when launching the
second job, the first job taskmanagers start executing tasks sent by the
second job jobmanager, and the second job taskmanagers execute jobs from
both jobmanagers.

El vie, 22 jul 2022 a las 12:03, Javier Vegas ()
escribió:

>
> I am deploying a high-availability Flink job to Kubernetes in application
> mode using Flink's standalone k8 deployment
> https://nightlies.apache.org/flink/flink-docs-release-1.15/docs/deployment/resource-providers/standalone/kubernetes/
> All goes well when I deploy a job, but if I want to deploy a second
> application-mode Flink job in the same K8s namespace I get a 
> "spec.ports[0].nodePort:
> Invalid value: 30081: provided port is already allocated" error. Is there
> a way that nodePort can be allocated dynamically, or other way around this
> (using Loadbalancer or Ingress instead of NodePort in
> jobmanager-rest-service.yaml?) besides hard-coding different nodePorts
> for different jobs running in same namespace?
>
> Thanks,
>
> Javier Vegas
>


NodePort conflict for multiple HA application-mode standalone Kubernetes deploys in same namespace

2022-07-22 Thread Javier Vegas
I am deploying a high-availability Flink job to Kubernetes in application
mode using Flink's standalone k8 deployment
https://nightlies.apache.org/flink/flink-docs-release-1.15/docs/deployment/resource-providers/standalone/kubernetes/
All goes well when I deploy a job, but if I want to deploy a second
application-mode Flink job in the same K8s namespace I get a
"spec.ports[0].nodePort:
Invalid value: 30081: provided port is already allocated" error. Is there a
way that nodePort can be allocated dynamically, or other way around this
(using Loadbalancer or Ingress instead of NodePort in
jobmanager-rest-service.yaml?) besides hard-coding different nodePorts for
different jobs running in same namespace?

Thanks,

Javier Vegas


standalone mode support in the kubernetes operator (FLIP-25)

2022-07-13 Thread Javier Vegas
Hello! The operator docs
https://nightlies.apache.org/flink/flink-kubernetes-operator-docs-main/docs/concepts/overview/
say "The Operator does not support Standalone Kubernetes
<https://nightlies.apache.org/flink/flink-docs-master/docs/deployment/resource-providers/standalone/kubernetes/>
 deployments yet" and mentions
https://cwiki.apache.org/confluence/display/FLINK/FLIP-225%3A+Implement+standalone+mode+support+in+the+kubernetes+operator
as a "what's next" step. Is there a timeline for that to be released?

Thanks,

Javier Vegas


Re: IllegalArgumentException: URI is not hierarchical error when initializating jobmanager in cluster

2022-02-02 Thread Javier Vegas
Thanks, Robert!

I tried the classloader.resolve.order: parent-first option but ran into
SLF4J: Failed to load class "org.slf4j.impl.StaticLoggerBinder" errors
(because I use logback so I followed
https://nightlies.apache.org/flink/flink-docs-master/docs/deployment/advanced/logging/#configuring-logback
and removed log4j-slf4j-impl from the classpath. But putting all my classes
in lib/ instead of usrlib/ fixed that problem, and everything now runs
fine. Thanks!

El vie, 28 ene 2022 a las 6:11, Robert Metzger ()
escribió:

> Hi Javier,
>
> I suspect that TwitterServer is using some classloading / dependency
> injection / service loading "magic" that is causing this.
> I would try to find out, either by attaching a remote debugger (should be
> possible when executing in cluster mode locally) or by adding log
> statements in the code, what the URI it's trying to load looks like.
>
> On the cluster, Flink is using separate classloaders for the base flink
> system, and the user code (as opposed to executing in the IDE, where
> everything is loaded from the same loader). Check out this page and try out
> the config arguments:
> https://nightlies.apache.org/flink/flink-docs-master/docs/ops/debugging/debugging_classloading/
>
>
>
> On Wed, Jan 26, 2022 at 4:13 AM Javier Vegas  wrote:
>
>> I am porting a Scala service to Flink in order to make it more scalable
>> via running it in a cluster. All my Scala services extends a base Service
>> class that extends TwitterServer (
>> https://github.com/twitter/twitter-server/blob/develop/server/src/main/scala/com/twitter/server/TwitterServer.scala)
>> and that base class contains a lot of logic about resource initialization,
>> logging, stats and error handling, monitoring, etc that I want to keep
>> using in my class. I ported my logic to Flink sources and sinks, and
>> everything worked fine when I ran my class in single JVM mode either from
>> sbt or my IDE, Flink's jobmanager and taskmanagers start and run my app.
>> But when I try to run my application in cluster mode, when launching my
>> class with "./bin/standalone-job.sh start --job-classname" the
>> jobmanager runs into a "IllegalArgumentException: URI is not hierarchical"
>> error on initialization, apparently because TwitterServer is trying to load
>> something from the class path (see attached full log).
>>
>> Is there anything I can do to run a class that extends TwitterServer in a
>> Flink cluster? I have tried making my class not extend it and it worked
>> fine, but I really want to keep using all the common infraestructure logic
>> that I have in my base class that extends TwitterServer.
>>
>> Thanks!
>>
>


Mesos deploy starts Mesos framework but does not start job managers

2021-10-20 Thread Javier Vegas
I am trying to deploy a Flink cluster via Mesos following
https://nightlies.apache.org/flink/flink-docs-release-1.13/docs/deployment/resource-providers/mesos/
(I know Mesos support has been deprecated, and I am planning to migrate my
deployment tools to Kubernetes, but for now I am stuck using Mesos). To
deploy, I am using a custom Docker image that contains both Flink and my
user binaries. The command I am using to start the cluster is

/opt/flink/bin/mesos-appmaster.sh \
  -Djobmanager.rpc.address=$HOST \
  -Dmesos.resourcemanager.framework.user=flink \
  -Dmesos.resourcemanager.framework.name=timeline-flink-populator \
  -Dmesos.master=10.0.25.139:5050 \
  -Dmesos.resourcemanager.tasks.cpus=4 \
  -Dmesos.resourcemanager.tasks.container.type=docker \
  -Dmesos.resourcemanager.tasks.container.image.name=
docker.strava.com/strava/flink:jv-mesos \
  -Dtaskmanager.numberOfTaskSlots=4 ;

mesos-appmaster.sh is able to start a Mesos framework and a Flink job
manager, but fails to start task managers. Looking in the Mesos syslog I
see that the Mesos framework was sending offers that were being declined
very quickly, and the agents ended in LOST state. I am attaching all the
relevant lines in the syslog.

Any ideas what the problem could be or what else I could check to see what
is happening?

Thanks,

Javier Vegas


syslog
Description: Binary data


Re: Unable to connect to Mesos on mesos-appmaster.sh start

2021-09-29 Thread Javier Vegas
Thanks again, Matthias!

Putting  -Djobmanager.rpc.address=$HOST and  -Djobmanager.rpc.port=$PORT0
as params for appmaster.sh
I see in tog they seem to transform in the correct values

-Djobmanager.rpc.address=10.0.23.35 -Djobmanager.rpc.port=31009

but a bit later the appmaster dies with this new error. it is unclear what
address it is trying to bind, I added explicit params
-Drest.bind-port=8081 and
  -Drest.port=8081 in case jobmanager.rpc.port was somehow interfering,
but that didn't help.

2021-09-29 10:29:59.845 [main] INFO
org.apache.flink.runtime.entrypoint.ClusterEntrypoint  - Shutting
MesosSessionClusterEntrypoint down with application status FAILED.
Diagnostics java.net.BindException: Cannot assign requested address
at java.base/sun.nio.ch.Net.bind0(Native Method)
at java.base/sun.nio.ch.Net.bind(Unknown Source)
at java.base/sun.nio.ch.Net.bind(Unknown Source)
at java.base/sun.nio.ch.ServerSocketChannelImpl.bind(Unknown Source)
at 
org.apache.flink.shaded.netty4.io.netty.channel.socket.nio.NioServerSocketChannel.doBind(NioServerSocketChannel.java:134)
at 
org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannel$AbstractUnsafe.bind(AbstractChannel.java:550)
at 
org.apache.flink.shaded.netty4.io.netty.channel.DefaultChannelPipeline$HeadContext.bind(DefaultChannelPipeline.java:1334)
at 
org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannelHandlerContext.invokeBind(AbstractChannelHandlerContext.java:506)
at 
org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannelHandlerContext.bind(AbstractChannelHandlerContext.java:491)
at 
org.apache.flink.shaded.netty4.io.netty.channel.DefaultChannelPipeline.bind(DefaultChannelPipeline.java:973)
at 
org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannel.bind(AbstractChannel.java:248)
at 
org.apache.flink.shaded.netty4.io.netty.bootstrap.AbstractBootstrap$2.run(AbstractBootstrap.java:356)
at 
org.apache.flink.shaded.netty4.io.netty.util.concurrent.AbstractEventExecutor.safeExecute(AbstractEventExecutor.java:164)
at 
org.apache.flink.shaded.netty4.io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:472)
at 
org.apache.flink.shaded.netty4.io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:500)
at 
org.apache.flink.shaded.netty4.io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:989)
at 
org.apache.flink.shaded.netty4.io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74)
at 
org.apache.flink.shaded.netty4.io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
at java.base/java.lang.Thread.run(Unknown Source)


.


On Wed, Sep 29, 2021 at 2:36 AM Matthias Pohl 
wrote:

> The port has its separate configuration parameter jobmanager.rpc.port [1]
>
> [1]
> https://ci.apache.org/projects/flink/flink-docs-master/docs/deployment/config/#jobmanager-rpc-port-1
>
> On Wed, Sep 29, 2021 at 10:11 AM Javier Vegas  wrote:
>
>> Matthias, thanks for the suggestion! I changed my jobmanager.rpc.address
>> param from $HOSTNAME to $HOST:$PORT0 which in the log I see resolves
>> properly to the host IP and port mapped to 8081
>>
>> 2021-09-29 07:58:05.452 [main] INFO
>>  org.apache.flink.runtime.entrypoint.ClusterEntrypoint  -
>> -Djobmanager.rpc.address=10.0.22.114:31894
>>
>> which is very promising. But sadly a little bit later appmaster dies with
>> this errror:
>>
>> 2021-09-29 07:58:05.648 [main] INFO
>>  org.apache.flink.runtime.entrypoint.ClusterEntrypoint  - Initializing
>> cluster services.
>> 2021-09-29 07:58:05.674 [main] INFO
>>  org.apache.flink.runtime.entrypoint.ClusterEntrypoint  - Shutting
>> MesosSessionClusterEntrypoint down with application status FAILED.
>> Diagnostics org.apache.flink.configurati
>> on.IllegalConfigurationException: The configured hostname is not valid
>> at
>> org.apache.flink.util.NetUtils.unresolvedHostToNormalizedString(NetUtils.java:179)
>> at
>> org.apache.flink.util.NetUtils.unresolvedHostAndPortToNormalizedString(NetUtils.java:197)
>> at
>> org.apache.flink.runtime.clusterframework.BootstrapTools.startRemoteActorSystem(BootstrapTools.java:207)
>> at
>> org.apache.flink.runtime.clusterframework.BootstrapTools.startRemoteActorSystem(BootstrapTools.java:152)
>> at
>> org.apache.flink.runtime.rpc.akka.AkkaRpcServiceUtils$AkkaRpcServiceBuilder.createAndStart(AkkaRpcServiceUtils.java:370)
>> at
>> org.apache.flink.runtime.rpc.akka.AkkaRpcServiceUtils$AkkaRpcServiceBuilder.createAndStart(AkkaRpcServiceUtils.java:344)
>> at
>> org.apache.flink.runtime.rpc.akka.AkkaRpcServiceUtils.createRemoteRpcService(AkkaRp

Re: Unable to connect to Mesos on mesos-appmaster.sh start

2021-09-29 Thread Javier Vegas
(UserGroupInformation.java:1762)
at
org.apache.flink.runtime.security.contexts.HadoopSecurityContext.runSecured(HadoopSecurityContext.java:41)
at
org.apache.flink.runtime.entrypoint.ClusterEntrypoint.startCluster(ClusterEntrypoint.java:186)
... 2 common frames omitted
Caused by: java.lang.IllegalArgumentException: null
at org.apache.flink.util.Preconditions.checkArgument(Preconditions.java:122)
at
org.apache.flink.util.NetUtils.unresolvedHostToNormalizedString(NetUtils.java:177)
... 17 common frames omitted



On Wed, Sep 29, 2021 at 12:16 AM Matthias Pohl 
wrote:

> One thing that was puzzling me yesterday when reading your post: Have you
> tried $HOST instead of $HOSTNAME in the Marathon configuration? When I
> played around with Mesos, I remember using HOST to resolve the host's IP
> address instead of the host's name. It could be that the hostname itself
> cannot be resolved to the right IP address. But I struggled to find proper
> documentation to back that up. Only in the recipes section of the Marathon
> docs [1], HOST was used as well.
>
> Matthias
>
> [1]
> https://mesosphere.github.io/marathon/docs/recipes.html#command-executor-health-checks
>
> On Wed, Sep 29, 2021 at 3:37 AM Javier Vegas  wrote:
>
>> Another update:  Looking more carefully in my appmaster log, I see the
>> following
>>
>> 2021-09-29 01:15:39.680 [flink-akka.actor.default-dispatcher-3] INFO
>> o.a.f.m.runtime.clusterframework.MesosResourceManagerDriver  -
>> Registering as new framework.
>>
>> 2021-09-29 01:15:39.680 [flink-akka.actor.default-dispatcher-3] INFO
>> o.a.f.m.runtime.clusterframework.MesosResourceManagerDriver  -
>> -
>>
>> ---
>>
>> 2021-09-29 01:15:39.680 [flink-akka.actor.default-dispatcher-3] INFO
>> o.a.f.m.runtime.clusterframework.MesosResourceManagerDriver  -  Mesos
>> Info:
>>
>> 2021-09-29 01:15:39.680 [flink-akka.actor.default-dispatcher-3] INFO
>> o.a.f.m.runtime.clusterframework.MesosResourceManagerDriver  - Master
>> URL: 10.0.18.246:5050
>>
>> 2021-09-29 01:15:39.680 [flink-akka.actor.default-dispatcher-3] INFO
>> o.a.f.m.runtime.clusterframework.MesosResourceManagerDriver  -  Framework
>> Info:
>>
>> 2021-09-29 01:15:39.680 [flink-akka.actor.default-dispatcher-3] INFO
>> o.a.f.m.runtime.clusterframework.MesosResourceManagerDriver  - ID:
>> (none)
>>
>> 2021-09-29 01:15:39.680 [flink-akka.actor.default-dispatcher-3] INFO
>> o.a.f.m.runtime.clusterframework.MesosResourceManagerDriver  - Name:
>> flink-test
>>
>> 2021-09-29 01:15:39.680 [flink-akka.actor.default-dispatcher-3] INFO
>> o.a.f.m.runtime.clusterframework.MesosResourceManagerDriver  - Failover
>> Timeout (secs): 604800.0
>>
>> 2021-09-29 01:15:39.680 [flink-akka.actor.default-dispatcher-3] INFO
>> o.a.f.m.runtime.clusterframework.MesosResourceManagerDriver  - Role:
>> *
>>
>> 2021-09-29 01:15:39.680 [flink-akka.actor.default-dispatcher-3] INFO
>> o.a.f.m.runtime.clusterframework.MesosResourceManagerDriver  - 
>> Capabilities:
>> (none)
>>
>> 2021-09-29 01:15:39.680 [flink-akka.actor.default-dispatcher-3] INFO
>> o.a.f.m.runtime.clusterframework.MesosResourceManagerDriver  - Principal:
>> (none)
>>
>> 2021-09-29 01:15:39.680 [flink-akka.actor.default-dispatcher-3] INFO
>> o.a.f.m.runtime.clusterframework.MesosResourceManagerDriver  - Host:
>> 311dcf7fd77c
>>
>> 2021-09-29 01:15:39.680 [flink-akka.actor.default-dispatcher-3] INFO
>> o.a.f.m.runtime.clusterframework.MesosResourceManagerDriver  - Web
>> UI: http://311dcf7fd77c:8081
>>
>> 2021-09-29 01:15:39.680 [flink-akka.actor.default-dispatcher-3] INFO
>> o.a.f.m.runtime.clusterframework.MesosResourceManagerDriver  -
>> -
>>
>> ---
>>
>>
>> which is picking up the mesos.master and
>> mesos.resourcemanager.framework.name params I am passing to
>> mesos-appmaster.sh
>>
>>
>> In my Mesos dashboard I can see the framework has been created with the
>> right name, but has no associated agents/tasks to it. So at least Flink has
>> been able to connect to the Mesos master to create the framework
>>
>>
>> Later in the mesos-appmaster log is when I see the Mesos connection
>> errors:
>>
>>
>> 2021-09-29 01:15:39.726 [flink-akka.actor.default-dispatcher-3] DEBUG
>> o.a.f.r.resourcemanager.slotmanager.DeclarativeSlotManager  - Starting
>> the slot manager.
>>
>> 2021-09-29 01:15:39.815 [flink-a

Re: Unable to connect to Mesos on mesos-appmaster.sh start

2021-09-28 Thread Javier Vegas
ever it does later?


One possible issue I see is that the framework is set with web UI in h
ttp://311dcf7fd77c:8081 which can not be resolved from the Mesos
master. 311dcf7fd77c
is the result of doing hostname on the Docker container, and the Mesos
master can not resolve that name. I could try to replace the Docker
container hostname with the Docker host hostname, but the host port that
gets mapped to 8081 on the container is a random port that I can not know
beforehand. Does Mesos master try to reach Flink using that Web UI setting?
Could this be the issue causing my connection problem, or is this a red
herring and the problem is a different one?


Thanks,


Javier Vegas








On Tue, Sep 28, 2021 at 10:23 AM Javier Vegas  wrote:

> Thanks, Matthias!
>
> There are lots of apps deployed to the Mesos cluster, the task manager
> itself is deployed to Mesos via Marathon.  In the Mesos log I can see the
> Job manager agent starting, but no error messages related to it. As you
> say, TaskManagers don't even have the chance to get confused about
> variables, since the Job Manager can not connect to the Mesos master to
> tell it to start the Task Managers.
>
> Thanks,
>
> Javier
>
> On Tue, Sep 28, 2021 at 7:59 AM Matthias Pohl 
> wrote:
>
>> Hi Javier,
>> I don't see anything that's configured in the wrong way based on the
>> jobmanager logs you've provided. Have you been able to deploy other
>> applications to this Mesos cluster? Do the Mesos master logs reveal
>> anything? The variable resolution on the TaskManager side is a valid
>> concern shared by Roman since it's easy to run into such an issue. But the
>> JobManager logs indicate that the JobManager is not able to contact the
>> Mesos master. Hence, I'd assume that it's not related to the TaskManagers
>> not coming up.
>>
>> Best,
>> Matthias
>>
>> On Tue, Sep 28, 2021 at 2:45 PM Roman Khachatryan 
>> wrote:
>>
>>> Hi,
>>>
>>> No additional ports need to be open as far as I know.
>>>
>>> Probably, $HOSTNAME is substituted for something not resolvable on TMs?
>>>
>>> Please also make sure that the following gets executed before
>>> mesos-appmaster.sh:
>>> export HADOOP_CLASSPATH=$(hadoop classpath)
>>> export MESOS_NATIVE_JAVA_LIBRARY=/path/to/lib/libmesos.so
>>> (as per the documentation you linked)
>>>
>>> Regards,
>>> Roman
>>>
>>> On Mon, Sep 27, 2021 at 7:38 PM Javier Vegas  wrote:
>>> >
>>> > I am trying to start Flink 1.13.2 on Mesos following the instrucions
>>> in
>>> https://nightlies.apache.org/flink/flink-docs-release-1.13/docs/deployment/resource-providers/mesos/
>>> and using Marathon to deploy a Docker image with both the Flink and my
>>> binaries.
>>> >
>>> > My entrypoint for the Docker image is:
>>> >
>>> >
>>> > /opt/flink/bin/mesos-appmaster.sh \
>>> >
>>> >   -Djobmanager.rpc.address=$HOSTNAME \
>>> >
>>> >   -Dmesos.resourcemanager.framework.user=flink \
>>> >
>>> >   -Dmesos.master=10.0.18.246:5050 \
>>> >
>>> >   -Dmesos.resourcemanager.tasks.cpus=6
>>> >
>>> >
>>> >
>>> > When mesos-appmaster.sh starts, in the stderr I see this:
>>> >
>>> >
>>> > I0927 16:50:32.306691 801308 exec.cpp:164] Version: 1.7.3
>>> >
>>> > I0927 16:50:32.310277 801345 exec.cpp:238] Executor registered on
>>> agent f671d9ee-57f6-4f92-b1b2-3137676f6cdf-S6090
>>> >
>>> > I0927 16:50:32.311120 801355 executor.cpp:130] Registered docker
>>> executor on 10.0.20.177
>>> >
>>> > I0927 16:50:32.311394 801345 executor.cpp:186] Starting task
>>> tl_flink_prod.fb215c64-1fb2-11ec-9ce6-aaa2e9cb6ba0
>>> >
>>> > WARNING: Your kernel does not support swap limit capabilities or the
>>> cgroup is not mounted. Memory limited without swap.
>>> >
>>> > WARNING: An illegal reflective access operation has occurred
>>> >
>>> > WARNING: Illegal reflective access by
>>> org.apache.hadoop.security.authentication.util.KerberosUtil
>>> (file:/opt/flink/lib/flink-shaded-hadoop-2-uber-2.8.3-10.0.jar) to method
>>> sun.security.krb5.Config.getInstance()
>>> >
>>> > WARNING: Please consider reporting this to the maintainers of
>>> org.apache.hadoop.security.authentication.util.KerberosUtil
>>> >
>>> > W

Re: Unable to connect to Mesos on mesos-appmaster.sh start

2021-09-28 Thread Javier Vegas
Thanks, Matthias!

There are lots of apps deployed to the Mesos cluster, the task manager
itself is deployed to Mesos via Marathon.  In the Mesos log I can see the
Job manager agent starting, but no error messages related to it. As you
say, TaskManagers don't even have the chance to get confused about
variables, since the Job Manager can not connect to the Mesos master to
tell it to start the Task Managers.

Thanks,

Javier

On Tue, Sep 28, 2021 at 7:59 AM Matthias Pohl 
wrote:

> Hi Javier,
> I don't see anything that's configured in the wrong way based on the
> jobmanager logs you've provided. Have you been able to deploy other
> applications to this Mesos cluster? Do the Mesos master logs reveal
> anything? The variable resolution on the TaskManager side is a valid
> concern shared by Roman since it's easy to run into such an issue. But the
> JobManager logs indicate that the JobManager is not able to contact the
> Mesos master. Hence, I'd assume that it's not related to the TaskManagers
> not coming up.
>
> Best,
> Matthias
>
> On Tue, Sep 28, 2021 at 2:45 PM Roman Khachatryan 
> wrote:
>
>> Hi,
>>
>> No additional ports need to be open as far as I know.
>>
>> Probably, $HOSTNAME is substituted for something not resolvable on TMs?
>>
>> Please also make sure that the following gets executed before
>> mesos-appmaster.sh:
>> export HADOOP_CLASSPATH=$(hadoop classpath)
>> export MESOS_NATIVE_JAVA_LIBRARY=/path/to/lib/libmesos.so
>> (as per the documentation you linked)
>>
>> Regards,
>> Roman
>>
>> On Mon, Sep 27, 2021 at 7:38 PM Javier Vegas  wrote:
>> >
>> > I am trying to start Flink 1.13.2 on Mesos following the instrucions in
>> https://nightlies.apache.org/flink/flink-docs-release-1.13/docs/deployment/resource-providers/mesos/
>> and using Marathon to deploy a Docker image with both the Flink and my
>> binaries.
>> >
>> > My entrypoint for the Docker image is:
>> >
>> >
>> > /opt/flink/bin/mesos-appmaster.sh \
>> >
>> >   -Djobmanager.rpc.address=$HOSTNAME \
>> >
>> >   -Dmesos.resourcemanager.framework.user=flink \
>> >
>> >   -Dmesos.master=10.0.18.246:5050 \
>> >
>> >   -Dmesos.resourcemanager.tasks.cpus=6
>> >
>> >
>> >
>> > When mesos-appmaster.sh starts, in the stderr I see this:
>> >
>> >
>> > I0927 16:50:32.306691 801308 exec.cpp:164] Version: 1.7.3
>> >
>> > I0927 16:50:32.310277 801345 exec.cpp:238] Executor registered on agent
>> f671d9ee-57f6-4f92-b1b2-3137676f6cdf-S6090
>> >
>> > I0927 16:50:32.311120 801355 executor.cpp:130] Registered docker
>> executor on 10.0.20.177
>> >
>> > I0927 16:50:32.311394 801345 executor.cpp:186] Starting task
>> tl_flink_prod.fb215c64-1fb2-11ec-9ce6-aaa2e9cb6ba0
>> >
>> > WARNING: Your kernel does not support swap limit capabilities or the
>> cgroup is not mounted. Memory limited without swap.
>> >
>> > WARNING: An illegal reflective access operation has occurred
>> >
>> > WARNING: Illegal reflective access by
>> org.apache.hadoop.security.authentication.util.KerberosUtil
>> (file:/opt/flink/lib/flink-shaded-hadoop-2-uber-2.8.3-10.0.jar) to method
>> sun.security.krb5.Config.getInstance()
>> >
>> > WARNING: Please consider reporting this to the maintainers of
>> org.apache.hadoop.security.authentication.util.KerberosUtil
>> >
>> > WARNING: Use --illegal-access=warn to enable warnings of further
>> illegal reflective access operations
>> >
>> > WARNING: All illegal access operations will be denied in a future
>> release
>> >
>> > I0927 16:50:43.622053   237 sched.cpp:232] Version: 1.7.3
>> >
>> > I0927 16:50:43.624439   328 sched.cpp:336] New master detected at
>> master@10.0.18.246:5050
>> >
>> > I0927 16:50:43.624779   328 sched.cpp:356] No credentials provided.
>> Attempting to register without authentication
>> >
>> >
>> > where the "New master detected" line is promising.
>> >
>> > However, on the Flink UI I see only the jobmanager started, and there
>> are no task managers.  Getting into the Docker container, I see this in the
>> log:
>> >
>> > WARN  org.apache.flink.mesos.scheduler.ConnectionMonitor  - Unable to
>> connect to Mesos; still trying...
>> >
>> >
>> > I have verified that from the container I can access the Mesos
>> container 10

Re: Unable to connect to Mesos on mesos-appmaster.sh start

2021-09-28 Thread Javier Vegas
Thanks, Roman!

Looking at the log, seems that the TaskManager can resolve $HOSTNAME to its
own hostname (07a6b681ee0f), as seen in these lines:

2021-09-27 22:02:41.067 [main] INFO
org.apache.flink.runtime.entrypoint.ClusterEntrypoint  -
-Djobmanager.rpc.address=*07a6b681ee0f*

2021-09-27 22:02:43.025 [main] INFO
org.apache.flink.runtime.dispatcher.DispatcherRestEndpoint  - Rest endpoint
listening at *07a6b681ee0f*:8081

2021-09-27 22:02:43.025 [main] INFO
org.apache.flink.runtime.dispatcher.DispatcherRestEndpoint  - http://
*07a6b681ee0f*:8081 was granted leadership with
leaderSessionID=----

2021-09-27 22:02:43.026 [main] INFO
org.apache.flink.runtime.dispatcher.DispatcherRestEndpoint  - Web frontend
listening at http://*07a6b681ee0f*:8081.


I am deploying to Mesos with Marathon, so I have no way other than
$HOSTNAME to indicate the host that will execute mesos-appmaster.sh

The environment variables are set, this is what I can see if I hop into the
Docker container:

root@07a6b681ee0f:/opt/flink# echo $HADOOP_CLASSPATH

/opt/flink/hadoop-3.2.2/etc/hadoop:/opt/flink/hadoop-3.2.2/share/hadoop/common/lib/*:/opt/flink/hadoop-3.2.2/share/hadoop/common/*:/opt/flink/hadoop-3.2.2/share/hadoop/hdfs:/opt/flink/hadoop-3.2.2/share/hadoop/hdfs/lib/*:/opt/flink/hadoop-3.2.2/share/hadoop/hdfs/*:/opt/flink/hadoop-3.2.2/share/hadoop/mapreduce/lib/*:/opt/flink/hadoop-3.2.2/share/hadoop/mapreduce/*:/opt/flink/hadoop-3.2.2/share/hadoop/yarn:/opt/flink/hadoop-3.2.2/share/hadoop/yarn/lib/*:/opt/flink/hadoop-3.2.2/share/hadoop/yarn/*:/opt/flink/lib


root@07a6b681ee0f:/opt/flink# echo $MESOS_NATIVE_JAVA_LIBRARY

/usr/lib/libmesos.so




On Tue, Sep 28, 2021 at 5:45 AM Roman Khachatryan  wrote:

> Hi,
>
> No additional ports need to be open as far as I know.
>
> Probably, $HOSTNAME is substituted for something not resolvable on TMs?
>
> Please also make sure that the following gets executed before
> mesos-appmaster.sh:
> export HADOOP_CLASSPATH=$(hadoop classpath)
> export MESOS_NATIVE_JAVA_LIBRARY=/path/to/lib/libmesos.so
> (as per the documentation you linked)
>
> Regards,
> Roman
>
> On Mon, Sep 27, 2021 at 7:38 PM Javier Vegas  wrote:
> >
> > I am trying to start Flink 1.13.2 on Mesos following the instrucions in
> https://nightlies.apache.org/flink/flink-docs-release-1.13/docs/deployment/resource-providers/mesos/
> and using Marathon to deploy a Docker image with both the Flink and my
> binaries.
> >
> > My entrypoint for the Docker image is:
> >
> >
> > /opt/flink/bin/mesos-appmaster.sh \
> >
> >   -Djobmanager.rpc.address=$HOSTNAME \
> >
> >   -Dmesos.resourcemanager.framework.user=flink \
> >
> >   -Dmesos.master=10.0.18.246:5050 \
> >
> >   -Dmesos.resourcemanager.tasks.cpus=6
> >
> >
> >
> > When mesos-appmaster.sh starts, in the stderr I see this:
> >
> >
> > I0927 16:50:32.306691 801308 exec.cpp:164] Version: 1.7.3
> >
> > I0927 16:50:32.310277 801345 exec.cpp:238] Executor registered on agent
> f671d9ee-57f6-4f92-b1b2-3137676f6cdf-S6090
> >
> > I0927 16:50:32.311120 801355 executor.cpp:130] Registered docker
> executor on 10.0.20.177
> >
> > I0927 16:50:32.311394 801345 executor.cpp:186] Starting task
> tl_flink_prod.fb215c64-1fb2-11ec-9ce6-aaa2e9cb6ba0
> >
> > WARNING: Your kernel does not support swap limit capabilities or the
> cgroup is not mounted. Memory limited without swap.
> >
> > WARNING: An illegal reflective access operation has occurred
> >
> > WARNING: Illegal reflective access by
> org.apache.hadoop.security.authentication.util.KerberosUtil
> (file:/opt/flink/lib/flink-shaded-hadoop-2-uber-2.8.3-10.0.jar) to method
> sun.security.krb5.Config.getInstance()
> >
> > WARNING: Please consider reporting this to the maintainers of
> org.apache.hadoop.security.authentication.util.KerberosUtil
> >
> > WARNING: Use --illegal-access=warn to enable warnings of further illegal
> reflective access operations
> >
> > WARNING: All illegal access operations will be denied in a future release
> >
> > I0927 16:50:43.622053   237 sched.cpp:232] Version: 1.7.3
> >
> > I0927 16:50:43.624439   328 sched.cpp:336] New master detected at
> master@10.0.18.246:5050
> >
> > I0927 16:50:43.624779   328 sched.cpp:356] No credentials provided.
> Attempting to register without authentication
> >
> >
> > where the "New master detected" line is promising.
> >
> > However, on the Flink UI I see only the jobmanager started, and there
> are no task managers.  Getting into the Docker container, I see this in the
> log:
> >
> > WARN  org.apache.flink.mesos.schedu

Unable to connect to Mesos on mesos-appmaster.sh start

2021-09-27 Thread Javier Vegas
I am trying to start Flink 1.13.2 on Mesos following the instrucions in
https://nightlies.apache.org/flink/flink-docs-release-1.13/docs/deployment/resource-providers/mesos/
and using Marathon to deploy a Docker image with both the Flink and my
binaries.

My entrypoint for the Docker image is:


/opt/flink/bin/mesos-appmaster.sh \

  -Djobmanager.rpc.address=$HOSTNAME \

  -Dmesos.resourcemanager.framework.user=flink \

  -Dmesos.master=10.0.18.246:5050 \

  -Dmesos.resourcemanager.tasks.cpus=6



When mesos-appmaster.sh starts, in the stderr I see this:


I0927 16:50:32.306691 801308 exec.cpp:164] Version: 1.7.3

I0927 16:50:32.310277 801345 exec.cpp:238] Executor registered on agent
f671d9ee-57f6-4f92-b1b2-3137676f6cdf-S6090

I0927 16:50:32.311120 801355 executor.cpp:130] Registered docker executor
on 10.0.20.177

I0927 16:50:32.311394 801345 executor.cpp:186] Starting task
tl_flink_prod.fb215c64-1fb2-11ec-9ce6-aaa2e9cb6ba0

WARNING: Your kernel does not support swap limit capabilities or the cgroup
is not mounted. Memory limited without swap.

WARNING: An illegal reflective access operation has occurred

WARNING: Illegal reflective access by
org.apache.hadoop.security.authentication.util.KerberosUtil
(file:/opt/flink/lib/flink-shaded-hadoop-2-uber-2.8.3-10.0.jar) to method
sun.security.krb5.Config.getInstance()

WARNING: Please consider reporting this to the maintainers of
org.apache.hadoop.security.authentication.util.KerberosUtil

WARNING: Use --illegal-access=warn to enable warnings of further illegal
reflective access operations

WARNING: All illegal access operations will be denied in a future release

I0927 16:50:43.622053   237 sched.cpp:232] Version: 1.7.3

I0927 16:50:43.624439   328 sched.cpp:336] New master detected at
master@10.0.18.246:5050

I0927 16:50:43.624779   328 sched.cpp:356] No credentials provided.
Attempting to register without authentication


where the "New master detected" line is promising.

However, on the Flink UI I see only the jobmanager started, and there are
no task managers.  Getting into the Docker container, I see this in the log:

WARN  org.apache.flink.mesos.scheduler.ConnectionMonitor  - Unable to
connect to Mesos; still trying...


I have verified that from the container I can access the Mesos container
10.0.18.246:5050


Does any other port besides the web UI port 5050 need to be open for
mesos-appmaster to connect with the Mesos master?


In the appmaster log (attached) I see one exception that I don't know if
they are related to the Mesos connection problem, one is


java.io.FileNotFoundException: HADOOP_HOME and hadoop.home.dir are unset.

at org.apache.hadoop.util.Shell.checkHadoopHomeInner(Shell.java:448)

at org.apache.hadoop.util.Shell.checkHadoopHome(Shell.java:419)

at org.apache.hadoop.util.Shell.(Shell.java:496)

at org.apache.hadoop.util.StringUtils.(StringUtils.java:79)

at
org.apache.hadoop.conf.Configuration.getBoolean(Configuration.java:1555)

at
org.apache.hadoop.security.SecurityUtil.getLogSlowLookupsEnabled(SecurityUtil.java:497)

at
org.apache.hadoop.security.SecurityUtil.(SecurityUtil.java:90)

at
org.apache.hadoop.security.UserGroupInformation.initialize(UserGroupInformation.java:289)

at
org.apache.hadoop.security.UserGroupInformation.ensureInitialized(UserGroupInformation.java:277)

at
org.apache.hadoop.security.UserGroupInformation.loginUserFromSubject(UserGroupInformation.java:833)

at
org.apache.hadoop.security.UserGroupInformation.getLoginUser(UserGroupInformation.java:803)

at
org.apache.hadoop.security.UserGroupInformation.getCurrentUser(UserGroupInformation.java:676)

at
java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native
Method)

at
java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(Unknown
Source)

at
java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(Unknown
Source)

at java.base/java.lang.reflect.Method.invoke(Unknown Source)

at
org.apache.flink.runtime.util.EnvironmentInformation.getHadoopUser(EnvironmentInformation.java:215)

at
org.apache.flink.runtime.util.EnvironmentInformation.logEnvironmentInfo(EnvironmentInformation.java:432)

at
org.apache.flink.mesos.entrypoint.MesosSessionClusterEntrypoint.main(MesosSessionClusterEntrypoint.java:95)




I am not trying (yet) to run in high availability mode, so I am not sure if
I need to have HADOOP_HOME set or not, but I don't see anything about
HADOOP_HOME in the FLink docs.



Any tips on how I can fix my Docker+Marathon+Mesos environment so Flink can
connect to my Mesos master?


Thanks,


Javier Vegas


flink--mesos-appmaster-6c49aa87e1d4.log
Description: Binary data