Flink Promethues Metricsreporter Question
Hi, Have a general question about Flink support for Prometheus metrics. We already have a Prometheus setup in our cluster with ServiceMonitor-s monitoring ports like 8080 etc. for scraping metrics. In a setup like this, if we deploy Flink Job managers/Task managers in the cluster, is there any need to have the PrometheusReporter configured as well? How does that coordinate with existing Prometheus ServiceMonitors if present? Is the PrometheusReporter based on "pull" model so that it can pull metrics from Flink and send to some Prometheus host system? Thanks Avijit
Re: Any change in behavior related to the "web.upload.dir" behavior between Flink 1.9 and 1.11
Thanks! It seems the problem went away when I started using 'ln -s $FLINK_HOME/usrlib $FLINK_HOME/flink-web-upload' in my Dockerfile! On Mon, Aug 3, 2020 at 3:09 PM Chesnay Schepler wrote: > From what I can tell we have not changed anything. > > Are you making any modifications to the image? This exception should only > be thrown if there is already a file with the same path, and I don't think > Flink would do that. > > On 03/08/2020 21:43, Avijit Saha wrote: > > Hello, > > Has there been any change in behavior related to the "web.upload.dir" > behavior between Flink 1.9 and 1.11? > > I have a failure case where when build an image using > "flink:1.11.0-scala_2.12" in Dockerfile, the job manager job submissions > fail with the following Exception but the same flow works fine (for the > same underlying Code image) when using > "flink:1.9.1-scala_2.12".. > > This is the Exception stack trace for 1.11 and not seen using 1.9: > > -- > Caused by: java.nio.file.FileAlreadyExistsException: > /opt/flink/flink-web-upload > at > sun.nio.fs.UnixException.translateToIOException(UnixException.java:88) > ~[?:1.8.0_262] > at > sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:102) > ~[?:1.8.0_262] > at > sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:107) > ~[?:1.8.0_262] > at > sun.nio.fs.UnixFileSystemProvider.createDirectory(UnixFileSystemProvider.java:384) > ~[?:1.8.0_262] > at java.nio.file.Files.createDirectory(Files.java:674) > ~[?:1.8.0_262] > at java.nio.file.Files.createAndCheckIsDirectory(Files.java:781) > ~[?:1.8.0_262] > at java.nio.file.Files.createDirectories(Files.java:727) > ~[?:1.8.0_262] > at > org.apache.flink.runtime.rest.RestServerEndpoint.checkAndCreateUploadDir(RestServerEndpoint.java:478) > ~[flink-dist_2.12-1.11.0.jar:1.11.0] > at > org.apache.flink.runtime.rest.RestServerEndpoint.createUploadDir(RestServerEndpoint.java:462) > ~[flink-dist_2.12-1.11.0.jar:1.11.0] > at > org.apache.flink.runtime.rest.RestServerEndpoint.(RestServerEndpoint.java:114) > ~[flink-dist_2.12-1.11.0.jar:1.11.0] > at > org.apache.flink.runtime.webmonitor.WebMonitorEndpoint.(WebMonitorEndpoint.java:200) > ~[flink-dist_2.12-1.11.0.jar:1.11.0] > at > org.apache.flink.runtime.dispatcher.DispatcherRestEndpoint.(DispatcherRestEndpoint.java:68) > ~[flink-dist_2.12-1.11.0.jar:1.11.0] > at > org.apache.flink.runtime.rest.SessionRestEndpointFactory.createRestEndpoint(SessionRestEndpointFactory.java:63) > ~[flink-dist_2.12-1.11.0.jar:1.11.0] > at > org.apache.flink.runtime.entrypoint.component.DefaultDispatcherResourceManagerComponentFactory.create(DefaultDispatcherResourceManagerComponentFactory.java:152) > ~[flink-dist_2.12-1.11.0.jar:1.11.0] > at > org.apache.flink.runtime.entrypoint.ClusterEntrypoint.runCluster(ClusterEntrypoint.java:216) > ~[flink-dist_2.12-1.11.0.jar:1.11.0] > at > org.apache.flink.runtime.entrypoint.ClusterEntrypoint.lambda$startCluster$0(ClusterEntrypoint.java:169) > ~[flink-dist_2.12-1.11.0.jar:1.11.0] > at > org.apache.flink.runtime.security.contexts.NoOpSecurityContext.runSecured(NoOpSecurityContext.java:30) > ~[flink-dist_2.12-1.11.0.jar:1.11.0] > at > org.apache.flink.runtime.entrypoint.ClusterEntrypoint.startCluster(ClusterEntrypoint.java:168) > ~[flink-dist_2.12-1.11.0.jar:1.11.0] > ... 2 more > > >
Any change in behavior related to the "web.upload.dir" behavior between Flink 1.9 and 1.11
Hello, Has there been any change in behavior related to the "web.upload.dir" behavior between Flink 1.9 and 1.11? I have a failure case where when build an image using "flink:1.11.0-scala_2.12" in Dockerfile, the job manager job submissions fail with the following Exception but the same flow works fine (for the same underlying Code image) when using "flink:1.9.1-scala_2.12".. This is the Exception stack trace for 1.11 and not seen using 1.9: -- Caused by: java.nio.file.FileAlreadyExistsException: /opt/flink/flink-web-upload at sun.nio.fs.UnixException.translateToIOException(UnixException.java:88) ~[?:1.8.0_262] at sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:102) ~[?:1.8.0_262] at sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:107) ~[?:1.8.0_262] at sun.nio.fs.UnixFileSystemProvider.createDirectory(UnixFileSystemProvider.java:384) ~[?:1.8.0_262] at java.nio.file.Files.createDirectory(Files.java:674) ~[?:1.8.0_262] at java.nio.file.Files.createAndCheckIsDirectory(Files.java:781) ~[?:1.8.0_262] at java.nio.file.Files.createDirectories(Files.java:727) ~[?:1.8.0_262] at org.apache.flink.runtime.rest.RestServerEndpoint.checkAndCreateUploadDir(RestServerEndpoint.java:478) ~[flink-dist_2.12-1.11.0.jar:1.11.0] at org.apache.flink.runtime.rest.RestServerEndpoint.createUploadDir(RestServerEndpoint.java:462) ~[flink-dist_2.12-1.11.0.jar:1.11.0] at org.apache.flink.runtime.rest.RestServerEndpoint.(RestServerEndpoint.java:114) ~[flink-dist_2.12-1.11.0.jar:1.11.0] at org.apache.flink.runtime.webmonitor.WebMonitorEndpoint.(WebMonitorEndpoint.java:200) ~[flink-dist_2.12-1.11.0.jar:1.11.0] at org.apache.flink.runtime.dispatcher.DispatcherRestEndpoint.(DispatcherRestEndpoint.java:68) ~[flink-dist_2.12-1.11.0.jar:1.11.0] at org.apache.flink.runtime.rest.SessionRestEndpointFactory.createRestEndpoint(SessionRestEndpointFactory.java:63) ~[flink-dist_2.12-1.11.0.jar:1.11.0] at org.apache.flink.runtime.entrypoint.component.DefaultDispatcherResourceManagerComponentFactory.create(DefaultDispatcherResourceManagerComponentFactory.java:152) ~[flink-dist_2.12-1.11.0.jar:1.11.0] at org.apache.flink.runtime.entrypoint.ClusterEntrypoint.runCluster(ClusterEntrypoint.java:216) ~[flink-dist_2.12-1.11.0.jar:1.11.0] at org.apache.flink.runtime.entrypoint.ClusterEntrypoint.lambda$startCluster$0(ClusterEntrypoint.java:169) ~[flink-dist_2.12-1.11.0.jar:1.11.0] at org.apache.flink.runtime.security.contexts.NoOpSecurityContext.runSecured(NoOpSecurityContext.java:30) ~[flink-dist_2.12-1.11.0.jar:1.11.0] at org.apache.flink.runtime.entrypoint.ClusterEntrypoint.startCluster(ClusterEntrypoint.java:168) ~[flink-dist_2.12-1.11.0.jar:1.11.0] ... 2 more
Re: Between Flink 1.9 and 1.11 - any behavior change for web.upload.dir
Hello, Has there been any change in behavior related to the "web.upload.dir" behavior between Flink 1.9 and 1.11? I have a failure case where when build an image using "flink:1.11.0-scala_2.12" in Dockerfile, the job manager job submissions fail with the following Exception but the same flow works fine (for the same underlying Code image) when using "flink:1.9.1-scala_2.12".. This is the Exception stack trace for 1.11 and not seen using 1.9: -- Caused by: java.nio.file.FileAlreadyExistsException: /opt/flink/flink-web-upload at sun.nio.fs.UnixException.translateToIOException(UnixException.java:88) ~[?:1.8.0_262] at sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:102) ~[?:1.8.0_262] at sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:107) ~[?:1.8.0_262] at sun.nio.fs.UnixFileSystemProvider.createDirectory(UnixFileSystemProvider.java:384) ~[?:1.8.0_262] at java.nio.file.Files.createDirectory(Files.java:674) ~[?:1.8.0_262] at java.nio.file.Files.createAndCheckIsDirectory(Files.java:781) ~[?:1.8.0_262] at java.nio.file.Files.createDirectories(Files.java:727) ~[?:1.8.0_262] at org.apache.flink.runtime.rest.RestServerEndpoint.checkAndCreateUploadDir(RestServerEndpoint.java:478) ~[flink-dist_2.12-1.11.0.jar:1.11.0] at org.apache.flink.runtime.rest.RestServerEndpoint.createUploadDir(RestServerEndpoint.java:462) ~[flink-dist_2.12-1.11.0.jar:1.11.0] at org.apache.flink.runtime.rest.RestServerEndpoint.(RestServerEndpoint.java:114) ~[flink-dist_2.12-1.11.0.jar:1.11.0] at org.apache.flink.runtime.webmonitor.WebMonitorEndpoint.(WebMonitorEndpoint.java:200) ~[flink-dist_2.12-1.11.0.jar:1.11.0] at org.apache.flink.runtime.dispatcher.DispatcherRestEndpoint.(DispatcherRestEndpoint.java:68) ~[flink-dist_2.12-1.11.0.jar:1.11.0] at org.apache.flink.runtime.rest.SessionRestEndpointFactory.createRestEndpoint(SessionRestEndpointFactory.java:63) ~[flink-dist_2.12-1.11.0.jar:1.11.0] at org.apache.flink.runtime.entrypoint.component.DefaultDispatcherResourceManagerComponentFactory.create(DefaultDispatcherResourceManagerComponentFactory.java:152) ~[flink-dist_2.12-1.11.0.jar:1.11.0] at org.apache.flink.runtime.entrypoint.ClusterEntrypoint.runCluster(ClusterEntrypoint.java:216) ~[flink-dist_2.12-1.11.0.jar:1.11.0] at org.apache.flink.runtime.entrypoint.ClusterEntrypoint.lambda$startCluster$0(ClusterEntrypoint.java:169) ~[flink-dist_2.12-1.11.0.jar:1.11.0] at org.apache.flink.runtime.security.contexts.NoOpSecurityContext.runSecured(NoOpSecurityContext.java:30) ~[flink-dist_2.12-1.11.0.jar:1.11.0] at org.apache.flink.runtime.entrypoint.ClusterEntrypoint.startCluster(ClusterEntrypoint.java:168) ~[flink-dist_2.12-1.11.0.jar:1.11.0] ... 2 more >
Re: Docker Taskmanager unable to connect to Flink JpbManager...Connection RefusedHi,
Thanks Yang! It worked as expected after I made the changes suggested by you! Avijit On Wed, Jul 22, 2020 at 11:05 PM Yang Wang wrote: > Hi Avijit, > > I think you need to create a network via "docker network create > flink-network". > And then use "docker run ... --name=jobmanager --network flink-network" to > set the hostname. Also > "jobmanager.rpc.address" need to be set the jobmanager. Refer the doc[1] > for more information. > > If you really do not want to create a network and still want to use the > port forward directly. Then you > need to set the "jobmanager.rpc.address" to a local ip address, for > example, 192.168.0.100, not 127.0.0.1. > Since 127.0.0.1 in docker container means a container local address, not > host machine local address. > > > [1]. > https://ci.apache.org/projects/flink/flink-docs-stable/ops/deployment/docker.html#start-a-session-cluster > > Best, > Yang > > Avijit Saha 于2020年7月23日周四 上午1:14写道: > >> Hi, >> I have built a docker image containing both Flink 1.11 and the job jar as >> per instructions at: >> >> https://ci.apache.org/projects/flink/flink-docs-stable/ops/deployment/docker.html >> >> >> The jobanager starts up fine as follows: >> >> - FLINK_PROPERTIES="jobmanager.rpc.address: 127.0.0.1" >> - docker run --env FLINK_PROPERTIES={"${FLINK_PROPERTIES}" -p >> 0.0.0.0:6123:6123/tcp flink_with_job_artifacts standalone-job >> --job-classname=org.apache.beam.examples.WordCount --runner=FlinkRunner >> --inputFile=/opt/flink/conf/flink-conf.yaml --output=/tmp/counts >> >> >> >> Now, when launching the taskmanager as follows: >> - FLINK_PROPERTIES="jobmanager.rpc.address: 127.0.0.1" >> - docker run --env FLINK_PROPERTIES="${FLINK_PROPERTIES}" >> flink_with_job_artifacts taskmanager, >> the taskmanager fails with the following: >> >> . >> 2020-07-22 16:55:25,974 INFO >> org.apache.flink.runtime.net.ConnectionUtils [] - Failed >> to connect from address '/127.0.0.1': Connection refused (Connection >> refused) >> >> 2020-07-22 16:55:32,709 INFO >> org.apache.flink.runtime.taskexecutor.TaskExecutor [] - Could >> not resolve ResourceManager address akka.tcp:// >> flink@127.0.0.1:6123/user/rpc/resourcemanager_*, retrying in 1 ms: >> Could not connect to rpc endpoint under address akka.tcp:// >> flink@127.0.0.1:6123/user/rpc/resourcemanager_*. >> >> >> Any pointer on How to fix this? >> >> Thanks >> Avijit >> >
Re: Docker Taskmanager unable to connect to Flink JpbManager...Connection RefusedHi,
Hi, I have built a docker image containing both Flink 1.11 and the job jar as per instructions at: https://ci.apache.org/projects/flink/flink-docs-stable/ops/deployment/docker.html The jobanager starts up fine as follows: - FLINK_PROPERTIES="jobmanager.rpc.address: 127.0.0.1" - docker run --env FLINK_PROPERTIES={"${FLINK_PROPERTIES}" -p 0.0.0.0:6123:6123/tcp flink_with_job_artifacts standalone-job --job-classname=org.apache.beam.examples.WordCount --runner=FlinkRunner --inputFile=/opt/flink/conf/flink-conf.yaml --output=/tmp/counts Now, when launching the taskmanager as follows: - FLINK_PROPERTIES="jobmanager.rpc.address: 127.0.0.1" - docker run --env FLINK_PROPERTIES="${FLINK_PROPERTIES}" flink_with_job_artifacts taskmanager, the taskmanager fails with the following: . 2020-07-22 16:55:25,974 INFO org.apache.flink.runtime.net.ConnectionUtils [] - Failed to connect from address '/127.0.0.1': Connection refused (Connection refused) 2020-07-22 16:55:32,709 INFO org.apache.flink.runtime.taskexecutor.TaskExecutor [] - Could not resolve ResourceManager address akka.tcp:// flink@127.0.0.1:6123/user/rpc/resourcemanager_*, retrying in 1 ms: Could not connect to rpc endpoint under address akka.tcp:// flink@127.0.0.1:6123/user/rpc/resourcemanager_*. Any pointer on How to fix this? Thanks Avijit
Beam Word Cound on Flink/kubernetes Issue with task manager ot getting picked up
Hi, I have a docker image of the Beam WordCount example that reads a status file and produces a output one time with word counts etc. This runs fine as a separate job-manager and task-manager when run from docker-compose locally. Now, I am trying to deploy and run this on my Kubernetes cluster as per instructions at https://github.com/apache/flink/tree/release-1.10/flink-container/kubernetes . Deployment of Job-cluster and task-manager goes thro fine but the task-manager never seems to be picked up - it always stays in 'Pending' status! Is this expected behavior for a one-time Job like Word-Count application or am I missing something? Thanks Avijit $ kubectl get pods NAME READY STATUS RESTARTS AGE flink-job-cluster-kw85v 2/2 Running 2 15m flink-task-manager-5cc79c5795-7mnqh 0/2 Pending 0 14m
TaskManager docker image for Beam WordCount failing with ClassNotFound Exception
Hi, I am trying the run the Beam WordCount example on Flink runner using docker container-s for 'Jobcluster' and 'TaskManager'. When I put the Beam Wordcount custom jar in the /opt/flink/usrlib/ dir - the 'taskmanager' docker image fails at runtime with ClassNotFound Exception for the following: Caused by: java.lang.ClassNotFoundException: org.apache.beam.runners.core.metrics.MetricUpdates$MetricUpdate: taskmanager_1 | Caused by: java.lang.ClassNotFoundException: org.apache.beam.runners.core.metrics.MetricUpdates$MetricUpdate taskmanager_1 |at java.net.URLClassLoader.findClass(URLClassLoader.java:382) taskmanager_1 |at java.lang.ClassLoader.loadClass(ClassLoader.java:424) taskmanager_1 |at org.apache.flink.util.ChildFirstClassLoader.loadClass(ChildFirstClassLoader.java:69) taskmanager_1 |at java.lang.ClassLoader.loadClass(ClassLoader.java:357) taskmanager_1 |... 68 more If I instead put the Beam WordCount jar in the "/opt/flink-1.10.1/lib" dir as follows, $ ls flink-dist_2.12-1.10.1.jar flink-table-blink_2.12-1.10.1.jar flink-table_2.12-1.10.1.jar log4j-1.2.17.jar slf4j-log4j12-1.7.15.jar word-count-beam-bundled-0.1.jar It runs without any Exception! Is this the expected behavior? Do we need to always bundle the job-jar in the same lib location as other flink jars?