Flink Promethues Metricsreporter Question

2020-08-05 Thread Avijit Saha
Hi,

Have a general question about Flink support for Prometheus metrics. We
already have a Prometheus setup in our cluster with ServiceMonitor-s
monitoring ports like 8080 etc. for scraping metrics.

In a setup like this, if we deploy Flink Job managers/Task managers in the
cluster, is there any need to have the PrometheusReporter configured as
well? How does that coordinate with existing Prometheus ServiceMonitors if
present?

Is the  PrometheusReporter based on "pull" model so that it can pull
metrics from Flink and send to some Prometheus host system?

Thanks
Avijit


Re: Any change in behavior related to the "web.upload.dir" behavior between Flink 1.9 and 1.11

2020-08-03 Thread Avijit Saha
Thanks!

It seems the problem went away when I started using 'ln -s
$FLINK_HOME/usrlib $FLINK_HOME/flink-web-upload' in my Dockerfile!


On Mon, Aug 3, 2020 at 3:09 PM Chesnay Schepler  wrote:

> From what I can tell we have not changed anything.
>
> Are you making any modifications to the image? This exception should only
> be thrown if there is already a file with the same path, and I don't think
> Flink would do that.
>
> On 03/08/2020 21:43, Avijit Saha wrote:
>
> Hello,
>
> Has there been any change in behavior related to the "web.upload.dir"
> behavior between Flink 1.9 and 1.11?
>
> I have a failure case where when build an image using
> "flink:1.11.0-scala_2.12" in Dockerfile, the job manager job submissions
> fail with the following Exception but the same flow works fine (for the
> same underlying Code image) when using
> "flink:1.9.1-scala_2.12"..
>
> This is the Exception stack trace for 1.11 and not seen using 1.9:
>
> --
> Caused by: java.nio.file.FileAlreadyExistsException:
> /opt/flink/flink-web-upload
> at
> sun.nio.fs.UnixException.translateToIOException(UnixException.java:88)
> ~[?:1.8.0_262]
> at
> sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:102)
> ~[?:1.8.0_262]
> at
> sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:107)
> ~[?:1.8.0_262]
> at
> sun.nio.fs.UnixFileSystemProvider.createDirectory(UnixFileSystemProvider.java:384)
> ~[?:1.8.0_262]
> at java.nio.file.Files.createDirectory(Files.java:674)
> ~[?:1.8.0_262]
> at java.nio.file.Files.createAndCheckIsDirectory(Files.java:781)
> ~[?:1.8.0_262]
> at java.nio.file.Files.createDirectories(Files.java:727)
> ~[?:1.8.0_262]
> at
> org.apache.flink.runtime.rest.RestServerEndpoint.checkAndCreateUploadDir(RestServerEndpoint.java:478)
> ~[flink-dist_2.12-1.11.0.jar:1.11.0]
> at
> org.apache.flink.runtime.rest.RestServerEndpoint.createUploadDir(RestServerEndpoint.java:462)
> ~[flink-dist_2.12-1.11.0.jar:1.11.0]
> at
> org.apache.flink.runtime.rest.RestServerEndpoint.(RestServerEndpoint.java:114)
> ~[flink-dist_2.12-1.11.0.jar:1.11.0]
> at
> org.apache.flink.runtime.webmonitor.WebMonitorEndpoint.(WebMonitorEndpoint.java:200)
> ~[flink-dist_2.12-1.11.0.jar:1.11.0]
> at
> org.apache.flink.runtime.dispatcher.DispatcherRestEndpoint.(DispatcherRestEndpoint.java:68)
> ~[flink-dist_2.12-1.11.0.jar:1.11.0]
> at
> org.apache.flink.runtime.rest.SessionRestEndpointFactory.createRestEndpoint(SessionRestEndpointFactory.java:63)
> ~[flink-dist_2.12-1.11.0.jar:1.11.0]
> at
> org.apache.flink.runtime.entrypoint.component.DefaultDispatcherResourceManagerComponentFactory.create(DefaultDispatcherResourceManagerComponentFactory.java:152)
> ~[flink-dist_2.12-1.11.0.jar:1.11.0]
> at
> org.apache.flink.runtime.entrypoint.ClusterEntrypoint.runCluster(ClusterEntrypoint.java:216)
> ~[flink-dist_2.12-1.11.0.jar:1.11.0]
> at
> org.apache.flink.runtime.entrypoint.ClusterEntrypoint.lambda$startCluster$0(ClusterEntrypoint.java:169)
> ~[flink-dist_2.12-1.11.0.jar:1.11.0]
> at
> org.apache.flink.runtime.security.contexts.NoOpSecurityContext.runSecured(NoOpSecurityContext.java:30)
> ~[flink-dist_2.12-1.11.0.jar:1.11.0]
> at
> org.apache.flink.runtime.entrypoint.ClusterEntrypoint.startCluster(ClusterEntrypoint.java:168)
> ~[flink-dist_2.12-1.11.0.jar:1.11.0]
> ... 2 more
>
>
>


Any change in behavior related to the "web.upload.dir" behavior between Flink 1.9 and 1.11

2020-08-03 Thread Avijit Saha
Hello,

Has there been any change in behavior related to the "web.upload.dir"
behavior between Flink 1.9 and 1.11?

I have a failure case where when build an image using
"flink:1.11.0-scala_2.12" in Dockerfile, the job manager job submissions
fail with the following Exception but the same flow works fine (for the
same underlying Code image) when using
"flink:1.9.1-scala_2.12"..

This is the Exception stack trace for 1.11 and not seen using 1.9:
--
Caused by: java.nio.file.FileAlreadyExistsException:
/opt/flink/flink-web-upload
at
sun.nio.fs.UnixException.translateToIOException(UnixException.java:88)
~[?:1.8.0_262]
at
sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:102)
~[?:1.8.0_262]
at
sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:107)
~[?:1.8.0_262]
at
sun.nio.fs.UnixFileSystemProvider.createDirectory(UnixFileSystemProvider.java:384)
~[?:1.8.0_262]
at java.nio.file.Files.createDirectory(Files.java:674)
~[?:1.8.0_262]
at java.nio.file.Files.createAndCheckIsDirectory(Files.java:781)
~[?:1.8.0_262]
at java.nio.file.Files.createDirectories(Files.java:727)
~[?:1.8.0_262]
at
org.apache.flink.runtime.rest.RestServerEndpoint.checkAndCreateUploadDir(RestServerEndpoint.java:478)
~[flink-dist_2.12-1.11.0.jar:1.11.0]
at
org.apache.flink.runtime.rest.RestServerEndpoint.createUploadDir(RestServerEndpoint.java:462)
~[flink-dist_2.12-1.11.0.jar:1.11.0]
at
org.apache.flink.runtime.rest.RestServerEndpoint.(RestServerEndpoint.java:114)
~[flink-dist_2.12-1.11.0.jar:1.11.0]
at
org.apache.flink.runtime.webmonitor.WebMonitorEndpoint.(WebMonitorEndpoint.java:200)
~[flink-dist_2.12-1.11.0.jar:1.11.0]
at
org.apache.flink.runtime.dispatcher.DispatcherRestEndpoint.(DispatcherRestEndpoint.java:68)
~[flink-dist_2.12-1.11.0.jar:1.11.0]
at
org.apache.flink.runtime.rest.SessionRestEndpointFactory.createRestEndpoint(SessionRestEndpointFactory.java:63)
~[flink-dist_2.12-1.11.0.jar:1.11.0]
at
org.apache.flink.runtime.entrypoint.component.DefaultDispatcherResourceManagerComponentFactory.create(DefaultDispatcherResourceManagerComponentFactory.java:152)
~[flink-dist_2.12-1.11.0.jar:1.11.0]
at
org.apache.flink.runtime.entrypoint.ClusterEntrypoint.runCluster(ClusterEntrypoint.java:216)
~[flink-dist_2.12-1.11.0.jar:1.11.0]
at
org.apache.flink.runtime.entrypoint.ClusterEntrypoint.lambda$startCluster$0(ClusterEntrypoint.java:169)
~[flink-dist_2.12-1.11.0.jar:1.11.0]
at
org.apache.flink.runtime.security.contexts.NoOpSecurityContext.runSecured(NoOpSecurityContext.java:30)
~[flink-dist_2.12-1.11.0.jar:1.11.0]
at
org.apache.flink.runtime.entrypoint.ClusterEntrypoint.startCluster(ClusterEntrypoint.java:168)
~[flink-dist_2.12-1.11.0.jar:1.11.0]
... 2 more


Re: Between Flink 1.9 and 1.11 - any behavior change for web.upload.dir

2020-08-03 Thread Avijit Saha
Hello,

Has there been any change in behavior related to the "web.upload.dir"
behavior between Flink 1.9 and 1.11?

I have a failure case where when build an image using
"flink:1.11.0-scala_2.12" in Dockerfile, the job manager job submissions
fail with the following Exception but the same flow works fine (for the
same underlying Code image) when using
"flink:1.9.1-scala_2.12"..

This is the Exception stack trace for 1.11 and not seen using 1.9:
--
Caused by: java.nio.file.FileAlreadyExistsException:
/opt/flink/flink-web-upload
at
sun.nio.fs.UnixException.translateToIOException(UnixException.java:88)
~[?:1.8.0_262]
at
sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:102)
~[?:1.8.0_262]
at
sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:107)
~[?:1.8.0_262]
at
sun.nio.fs.UnixFileSystemProvider.createDirectory(UnixFileSystemProvider.java:384)
~[?:1.8.0_262]
at java.nio.file.Files.createDirectory(Files.java:674)
~[?:1.8.0_262]
at java.nio.file.Files.createAndCheckIsDirectory(Files.java:781)
~[?:1.8.0_262]
at java.nio.file.Files.createDirectories(Files.java:727)
~[?:1.8.0_262]
at
org.apache.flink.runtime.rest.RestServerEndpoint.checkAndCreateUploadDir(RestServerEndpoint.java:478)
~[flink-dist_2.12-1.11.0.jar:1.11.0]
at
org.apache.flink.runtime.rest.RestServerEndpoint.createUploadDir(RestServerEndpoint.java:462)
~[flink-dist_2.12-1.11.0.jar:1.11.0]
at
org.apache.flink.runtime.rest.RestServerEndpoint.(RestServerEndpoint.java:114)
~[flink-dist_2.12-1.11.0.jar:1.11.0]
at
org.apache.flink.runtime.webmonitor.WebMonitorEndpoint.(WebMonitorEndpoint.java:200)
~[flink-dist_2.12-1.11.0.jar:1.11.0]
at
org.apache.flink.runtime.dispatcher.DispatcherRestEndpoint.(DispatcherRestEndpoint.java:68)
~[flink-dist_2.12-1.11.0.jar:1.11.0]
at
org.apache.flink.runtime.rest.SessionRestEndpointFactory.createRestEndpoint(SessionRestEndpointFactory.java:63)
~[flink-dist_2.12-1.11.0.jar:1.11.0]
at
org.apache.flink.runtime.entrypoint.component.DefaultDispatcherResourceManagerComponentFactory.create(DefaultDispatcherResourceManagerComponentFactory.java:152)
~[flink-dist_2.12-1.11.0.jar:1.11.0]
at
org.apache.flink.runtime.entrypoint.ClusterEntrypoint.runCluster(ClusterEntrypoint.java:216)
~[flink-dist_2.12-1.11.0.jar:1.11.0]
at
org.apache.flink.runtime.entrypoint.ClusterEntrypoint.lambda$startCluster$0(ClusterEntrypoint.java:169)
~[flink-dist_2.12-1.11.0.jar:1.11.0]
at
org.apache.flink.runtime.security.contexts.NoOpSecurityContext.runSecured(NoOpSecurityContext.java:30)
~[flink-dist_2.12-1.11.0.jar:1.11.0]
at
org.apache.flink.runtime.entrypoint.ClusterEntrypoint.startCluster(ClusterEntrypoint.java:168)
~[flink-dist_2.12-1.11.0.jar:1.11.0]
... 2 more

>


Re: Docker Taskmanager unable to connect to Flink JpbManager...Connection RefusedHi,

2020-07-23 Thread Avijit Saha
Thanks Yang!

It worked as expected after I made the changes suggested by you!

Avijit

On Wed, Jul 22, 2020 at 11:05 PM Yang Wang  wrote:

> Hi Avijit,
>
> I think you need to create a network via "docker network create
> flink-network".
> And then use "docker run ... --name=jobmanager --network flink-network" to
> set the hostname. Also
> "jobmanager.rpc.address" need to be set the jobmanager. Refer the doc[1]
> for more information.
>
> If you really do not want to create a network and still want to use the
> port forward directly. Then you
> need to set the "jobmanager.rpc.address" to a local ip address, for
> example, 192.168.0.100, not 127.0.0.1.
> Since 127.0.0.1 in docker container means a container local address, not
> host machine local address.
>
>
> [1].
> https://ci.apache.org/projects/flink/flink-docs-stable/ops/deployment/docker.html#start-a-session-cluster
>
> Best,
> Yang
>
> Avijit Saha  于2020年7月23日周四 上午1:14写道:
>
>> Hi,
>> I have built a docker image containing both Flink 1.11 and the job jar as
>> per instructions at:
>>
>> https://ci.apache.org/projects/flink/flink-docs-stable/ops/deployment/docker.html
>>
>>
>> The jobanager starts up fine as follows:
>>
>> - FLINK_PROPERTIES="jobmanager.rpc.address: 127.0.0.1"
>> - docker run  --env FLINK_PROPERTIES={"${FLINK_PROPERTIES}" -p
>> 0.0.0.0:6123:6123/tcp  flink_with_job_artifacts standalone-job
>> --job-classname=org.apache.beam.examples.WordCount --runner=FlinkRunner
>> --inputFile=/opt/flink/conf/flink-conf.yaml  --output=/tmp/counts
>>
>>
>>
>> Now, when launching the taskmanager as follows:
>> - FLINK_PROPERTIES="jobmanager.rpc.address: 127.0.0.1"
>> -  docker run --env  FLINK_PROPERTIES="${FLINK_PROPERTIES}"
>> flink_with_job_artifacts taskmanager,
>> the taskmanager fails with the following:
>>
>> .
>> 2020-07-22 16:55:25,974 INFO
>>  org.apache.flink.runtime.net.ConnectionUtils [] - Failed
>> to connect from address '/127.0.0.1': Connection refused (Connection
>> refused)
>> 
>> 2020-07-22 16:55:32,709 INFO
>>  org.apache.flink.runtime.taskexecutor.TaskExecutor   [] - Could
>> not resolve ResourceManager address akka.tcp://
>> flink@127.0.0.1:6123/user/rpc/resourcemanager_*, retrying in 1 ms:
>> Could not connect to rpc endpoint under address akka.tcp://
>> flink@127.0.0.1:6123/user/rpc/resourcemanager_*.
>>
>>
>> Any pointer on How to fix this?
>>
>> Thanks
>> Avijit
>>
>


Re: Docker Taskmanager unable to connect to Flink JpbManager...Connection RefusedHi,

2020-07-22 Thread Avijit Saha
Hi,
I have built a docker image containing both Flink 1.11 and the job jar as
per instructions at:
https://ci.apache.org/projects/flink/flink-docs-stable/ops/deployment/docker.html


The jobanager starts up fine as follows:

- FLINK_PROPERTIES="jobmanager.rpc.address: 127.0.0.1"
- docker run  --env FLINK_PROPERTIES={"${FLINK_PROPERTIES}" -p
0.0.0.0:6123:6123/tcp
 flink_with_job_artifacts standalone-job
--job-classname=org.apache.beam.examples.WordCount --runner=FlinkRunner
--inputFile=/opt/flink/conf/flink-conf.yaml  --output=/tmp/counts



Now, when launching the taskmanager as follows:
- FLINK_PROPERTIES="jobmanager.rpc.address: 127.0.0.1"
-  docker run --env  FLINK_PROPERTIES="${FLINK_PROPERTIES}"
flink_with_job_artifacts taskmanager,
the taskmanager fails with the following:

.
2020-07-22 16:55:25,974 INFO  org.apache.flink.runtime.net.ConnectionUtils
[] - Failed to connect from address '/127.0.0.1':
Connection refused (Connection refused)

2020-07-22 16:55:32,709 INFO
 org.apache.flink.runtime.taskexecutor.TaskExecutor   [] - Could
not resolve ResourceManager address akka.tcp://
flink@127.0.0.1:6123/user/rpc/resourcemanager_*, retrying in 1 ms:
Could not connect to rpc endpoint under address akka.tcp://
flink@127.0.0.1:6123/user/rpc/resourcemanager_*.


Any pointer on How to fix this?

Thanks
Avijit


Beam Word Cound on Flink/kubernetes Issue with task manager ot getting picked up

2020-07-13 Thread Avijit Saha
Hi,

I have a docker image of the Beam WordCount example that reads a
status file and produces a output one time with word counts etc.

This runs fine as a separate job-manager and task-manager when run from
docker-compose locally.

Now, I am trying to deploy and run this on my Kubernetes cluster as per
instructions at
https://github.com/apache/flink/tree/release-1.10/flink-container/kubernetes
.

Deployment of Job-cluster and task-manager goes thro fine but the
task-manager never seems to be picked up - it always stays in 'Pending'
status!
Is this expected behavior for a one-time Job like Word-Count application or
am I missing something?

Thanks
Avijit

$ kubectl get pods
NAME  READY   STATUS
 RESTARTS   AGE
flink-job-cluster-kw85v   2/2   Running   2
 15m
flink-task-manager-5cc79c5795-7mnqh   0/2 Pending   0  14m


TaskManager docker image for Beam WordCount failing with ClassNotFound Exception

2020-07-07 Thread Avijit Saha
Hi,
I am trying the run the Beam WordCount example on Flink runner using docker
container-s for 'Jobcluster' and 'TaskManager'.

When I put the Beam Wordcount custom jar in the /opt/flink/usrlib/ dir -
the 'taskmanager' docker image  fails at runtime with ClassNotFound
Exception for the following:
Caused by: java.lang.ClassNotFoundException:
org.apache.beam.runners.core.metrics.MetricUpdates$MetricUpdate:
taskmanager_1  | Caused by: java.lang.ClassNotFoundException:
org.apache.beam.runners.core.metrics.MetricUpdates$MetricUpdate
taskmanager_1  |at
java.net.URLClassLoader.findClass(URLClassLoader.java:382)
taskmanager_1  |at
java.lang.ClassLoader.loadClass(ClassLoader.java:424)
taskmanager_1  |at
org.apache.flink.util.ChildFirstClassLoader.loadClass(ChildFirstClassLoader.java:69)
taskmanager_1  |at
java.lang.ClassLoader.loadClass(ClassLoader.java:357)
taskmanager_1  |... 68 more

If I  instead put the Beam WordCount jar in the "/opt/flink-1.10.1/lib" dir
as follows,
$ ls
flink-dist_2.12-1.10.1.jar
flink-table-blink_2.12-1.10.1.jar
flink-table_2.12-1.10.1.jar
log4j-1.2.17.jar
 slf4j-log4j12-1.7.15.jar
 word-count-beam-bundled-0.1.jar

It runs without any Exception!

Is this the expected behavior? Do we need to always bundle the job-jar in
the same lib location as other flink jars?