Re: Operator metrics do not get unregistered after job finishes

2018-09-07 Thread Helmut Zechmann
Hi Vino,

The log shows no problems. The problem can be reproduced easily. I created 
https://issues.apache.org/jira/browse/FLINK-10300 
<https://issues.apache.org/jira/browse/FLINK-10300>.

Best,

Helmut

> On 18. Aug 2018, at 04:53, vino yang  wrote:
> 
> Hi Helmut,
> 
> Is the metrics of all the sub task instances of a job not unregistered, or 
> part of it is not unregistered. Is there any exception log information 
> available?
> 
> Please feel free to create a JIRA issue and clearly describe your problem.
> 
> Thanks, vino.
> 
> Helmut Zechmann mailto:hel...@adeven.com>> 于2018年8月17日周五 
> 下午11:14写道:
> Hi all,
> 
> 
> we are using flink 1.5.2 in batch mode with prometheus monitoring.
> 
> We noticed that a few metrics do not get unregistered after a job is finished:
> 
> flink_taskmanager_job_task_operator_numRecordsIn
> flink_taskmanager_job_task_operator_numRecordsInPerSecond
> flink_taskmanager_job_task_operator_numRecordsOut
> flink_taskmanager_job_task_operator_numRecordsOutPerSecond
> 
> 
> Those metrics stay in the taksmanager metrics list until the task manger gets 
> restarted.
> 
> Our metrics config is:
> 
> metrics.reporters: prom
> metrics.reporter.prom.class: 
> org.apache.flink.metrics.prometheus.PrometheusReporter
> metrics.reporter.prom.port: 7000-7001
> 
> metrics.scope.jm <http://metrics.scope.jm/>: flink..jobmanager
> metrics.scope.tm <http://metrics.scope.tm/>: flink..taskmanager.
> metrics.scope.jm.job: flink..jobmanager.
> metrics.scope.tm.job: flink..taskmanager..
> metrics.scope.task: 
> flink..taskmanager
> metrics.scope.operator: 
> flink..taskmanager
> 
> 
> Since we run many batch jobs, this makes prometheus monitoring unusable for 
> us. Is this a known issue?
> 
> 
> Best,
> 
> Helmut



Re: Flink Jobmanager Failover in HA mode

2018-08-20 Thread Helmut Zechmann
Hi Dominik,

all jobs on the cluster (batch only jobs without state) where in status
FINISHED.

Best,

Helmut

On Fri, Aug 17, 2018 at 8:04 PM Dominik Wosiński  wrote:

> I have faced this issue, but in 1.4.0 IIRC. This seems to be related to
> https://issues.apache.org/jira/browse/FLINK-10011. What was the status of
> the jobs when the main Job Manager has been stopped ?
>
> 2018-08-17 17:08 GMT+02:00 Helmut Zechmann :
>
>> Hi all,
>>
>> we have a problem with flink 1.5.2 high availability in standalone mode.
>>
>> We have two jobmanagers running. When I shut down the main job manager,
>> the failover job manager encounters an error during failover.
>>
>> Logs:
>>
>>
>> 2018-08-17 14:38:16,478 WARN  akka.remote.ReliableDeliverySupervisor
>>   - Association with remote system [akka.tcp://
>> fl...@seg-1.adjust.com:29095] has failed, address is now gated for [50]
>> ms. Reason: [Disassociated]
>> 2018-08-17 14:38:31,449 WARN  akka.remote.transport.netty.NettyTransport
>>   - Remote connection to [null] failed with
>> java.net.ConnectException: Connection refused:
>> seg-1.adjust.com/178.162.219.66:29095
>> 2018-08-17 <http://seg-1.adjust.com/178.162.219.66:29095%0D2018-08-17>
>> 14:38:31,451 WARN  akka.remote.ReliableDeliverySupervisor
>>   - Association with remote system [akka.tcp://
>> fl...@seg-1.adjust.com:29095] has failed, address is now gated for [50]
>> ms. Reason: [Association failed with [akka.tcp://
>> fl...@seg-1.adjust.com:29095]] Caused by: [Connection refused:
>> seg-1.adjust.com/178.162.219.66:29095]
>> 2018-08-17 14:38:41,379 ERROR
>> org.apache.flink.runtime.rest.handler.legacy.files.StaticFileServerHandler
>> - Could not retrieve the redirect address.
>> java.util.concurrent.CompletionException:
>> akka.pattern.AskTimeoutException: Ask timed out on [Actor[akka.tcp://
>> fl...@seg-1.adjust.com:29095/user/dispatcher#-1599908403]] after [1
>> ms]. Sender[null] sent message of type
>> "org.apache.flink.runtime.rpc.messages.RemoteFencedMessage".
>> at
>> java.util.concurrent.CompletableFuture.encodeThrowable(CompletableFuture.java:292)
>> [... shortened ...]
>> Caused by: akka.pattern.AskTimeoutException: Ask timed out on
>> [Actor[akka.tcp://
>> fl...@seg-1.adjust.com:29095/user/dispatcher#-1599908403]] after [1
>> ms]. Sender[null] sent message of type
>> "org.apache.flink.runtime.rpc.messages.RemoteFencedMessage".
>> at
>> akka.pattern.PromiseActorRef$$anonfun$1.apply$mcV$sp(AskSupport.scala:604)
>> ... 9 more
>> 2018-08-17 14:38:48,005 INFO
>> org.apache.flink.runtime.dispatcher.DispatcherRestEndpoint-
>> http://seg-2.adjust.com:8083 was granted leadership with
>> leaderSessionID=708d1a64-c353-448b-9101-7eb3f910970e
>> 2018-08-17 14:38:48,005 INFO
>> org.apache.flink.runtime.resourcemanager.StandaloneResourceManager  -
>> ResourceManager akka.tcp://
>> fl...@seg-2.adjust.com:30169/user/resourcemanager was granted leadership
>> with fencing token 8de829de14876a367a80d37194b944ee
>> 2018-08-17 14:38:48,006 INFO
>> org.apache.flink.runtime.resourcemanager.slotmanager.SlotManager  -
>> Starting the SlotManager.
>> 2018-08-17 14:38:48,007 INFO
>> org.apache.flink.runtime.dispatcher.StandaloneDispatcher  - Dispatcher
>> akka.tcp://fl...@seg-2.adjust.com:30169/user/dispatcher was granted
>> leadership with fencing token 684f50f8-327c-47e1-a53c-931c4f4ea3e5
>> 2018-08-17 14:38:48,007 INFO
>> org.apache.flink.runtime.dispatcher.StandaloneDispatcher  - Recovering
>> all persisted jobs.
>> 2018-08-17 14:38:48,021 INFO
>> org.apache.flink.runtime.jobmanager.ZooKeeperSubmittedJobGraphStore  -
>> Recovered SubmittedJobGraph(b951bbf518bcf6cc031be6d2ccc441bb, null).
>> 2018-08-17 14:38:48,028 INFO
>> org.apache.flink.runtime.jobmanager.ZooKeeperSubmittedJobGraphStore  -
>> Recovered SubmittedJobGraph(06ed64f48fa0a7cffde53b99cbaa073f, null).
>> 2018-08-17 14:38:48,035 ERROR
>> org.apache.flink.runtime.entrypoint.ClusterEntrypoint - Fatal error
>> occurred in the cluster entrypoint.
>> java.lang.RuntimeException:
>> org.apache.flink.runtime.client.JobExecutionException: Could not set up
>> JobManager
>> at
>> org.apache.flink.util.ExceptionUtils.rethrow(ExceptionUtils.java:199)
>> [... shortened ...]
>> Caused by: org.apache.flink.runtime.client.JobExecutionException: Could
>> not set up JobManager
>> at
>> org.apache.flink.runtime.jobmaster.JobManagerRunner.(Jo

Operator metrics do not get unregistered after job finishes

2018-08-17 Thread Helmut Zechmann
Hi all,


we are using flink 1.5.2 in batch mode with prometheus monitoring.

We noticed that a few metrics do not get unregistered after a job is finished:

flink_taskmanager_job_task_operator_numRecordsIn
flink_taskmanager_job_task_operator_numRecordsInPerSecond
flink_taskmanager_job_task_operator_numRecordsOut
flink_taskmanager_job_task_operator_numRecordsOutPerSecond


Those metrics stay in the taksmanager metrics list until the task manger gets 
restarted.

Our metrics config is:

metrics.reporters: prom
metrics.reporter.prom.class: 
org.apache.flink.metrics.prometheus.PrometheusReporter
metrics.reporter.prom.port: 7000-7001

metrics.scope.jm: flink..jobmanager
metrics.scope.tm: flink..taskmanager.
metrics.scope.jm.job: flink..jobmanager.
metrics.scope.tm.job: flink..taskmanager..
metrics.scope.task: 
flink..taskmanager
metrics.scope.operator: 
flink..taskmanager


Since we run many batch jobs, this makes prometheus monitoring unusable for us. 
Is this a known issue?


Best,

Helmut

Flink Jobmanager Failover in HA mode

2018-08-17 Thread Helmut Zechmann
Hi all,

we have a problem with flink 1.5.2 high availability in standalone mode.

We have two jobmanagers running. When I shut down the main job manager, the 
failover job manager encounters an error during failover.

Logs:


2018-08-17 14:38:16,478 WARN  akka.remote.ReliableDeliverySupervisor
- Association with remote system 
[akka.tcp://fl...@seg-1.adjust.com:29095] has failed, address is now gated for 
[50] ms. Reason: [Disassociated]
2018-08-17 14:38:31,449 WARN  akka.remote.transport.netty.NettyTransport
- Remote connection to [null] failed with 
java.net.ConnectException: Connection refused: 
seg-1.adjust.com/178.162.219.66:29095
2018-08-17 14:38:31,451 WARN  akka.remote.ReliableDeliverySupervisor
- Association with remote system 
[akka.tcp://fl...@seg-1.adjust.com:29095] has failed, address is now gated for 
[50] ms. Reason: [Association failed with 
[akka.tcp://fl...@seg-1.adjust.com:29095]] Caused by: [Connection refused: 
seg-1.adjust.com/178.162.219.66:29095]
2018-08-17 14:38:41,379 ERROR 
org.apache.flink.runtime.rest.handler.legacy.files.StaticFileServerHandler  - 
Could not retrieve the redirect address.
java.util.concurrent.CompletionException: akka.pattern.AskTimeoutException: Ask 
timed out on 
[Actor[akka.tcp://fl...@seg-1.adjust.com:29095/user/dispatcher#-1599908403]] 
after [1 ms]. Sender[null] sent message of type 
"org.apache.flink.runtime.rpc.messages.RemoteFencedMessage".
at 
java.util.concurrent.CompletableFuture.encodeThrowable(CompletableFuture.java:292)
[... shortened ...]
Caused by: akka.pattern.AskTimeoutException: Ask timed out on 
[Actor[akka.tcp://fl...@seg-1.adjust.com:29095/user/dispatcher#-1599908403]] 
after [1 ms]. Sender[null] sent message of type 
"org.apache.flink.runtime.rpc.messages.RemoteFencedMessage".
at 
akka.pattern.PromiseActorRef$$anonfun$1.apply$mcV$sp(AskSupport.scala:604)
... 9 more
2018-08-17 14:38:48,005 INFO  
org.apache.flink.runtime.dispatcher.DispatcherRestEndpoint- 
http://seg-2.adjust.com:8083 was granted leadership with 
leaderSessionID=708d1a64-c353-448b-9101-7eb3f910970e
2018-08-17 14:38:48,005 INFO  
org.apache.flink.runtime.resourcemanager.StandaloneResourceManager  - 
ResourceManager akka.tcp://fl...@seg-2.adjust.com:30169/user/resourcemanager 
was granted leadership with fencing token 8de829de14876a367a80d37194b944ee
2018-08-17 14:38:48,006 INFO  
org.apache.flink.runtime.resourcemanager.slotmanager.SlotManager  - Starting 
the SlotManager.
2018-08-17 14:38:48,007 INFO  
org.apache.flink.runtime.dispatcher.StandaloneDispatcher  - Dispatcher 
akka.tcp://fl...@seg-2.adjust.com:30169/user/dispatcher was granted leadership 
with fencing token 684f50f8-327c-47e1-a53c-931c4f4ea3e5
2018-08-17 14:38:48,007 INFO  
org.apache.flink.runtime.dispatcher.StandaloneDispatcher  - Recovering all 
persisted jobs.
2018-08-17 14:38:48,021 INFO  
org.apache.flink.runtime.jobmanager.ZooKeeperSubmittedJobGraphStore  - 
Recovered SubmittedJobGraph(b951bbf518bcf6cc031be6d2ccc441bb, null).
2018-08-17 14:38:48,028 INFO  
org.apache.flink.runtime.jobmanager.ZooKeeperSubmittedJobGraphStore  - 
Recovered SubmittedJobGraph(06ed64f48fa0a7cffde53b99cbaa073f, null).
2018-08-17 14:38:48,035 ERROR 
org.apache.flink.runtime.entrypoint.ClusterEntrypoint - Fatal error 
occurred in the cluster entrypoint.
java.lang.RuntimeException: 
org.apache.flink.runtime.client.JobExecutionException: Could not set up 
JobManager
at org.apache.flink.util.ExceptionUtils.rethrow(ExceptionUtils.java:199)
[... shortened ...]
Caused by: org.apache.flink.runtime.client.JobExecutionException: Could not set 
up JobManager
at 
org.apache.flink.runtime.jobmaster.JobManagerRunner.(JobManagerRunner.java:176)
at 
org.apache.flink.runtime.dispatcher.Dispatcher$DefaultJobManagerRunnerFactory.createJobManagerRunner(Dispatcher.java:936)
at 
org.apache.flink.runtime.dispatcher.Dispatcher.createJobManagerRunner(Dispatcher.java:291)
at 
org.apache.flink.runtime.dispatcher.Dispatcher.runJob(Dispatcher.java:281)
at 
org.apache.flink.util.function.ConsumerWithException.accept(ConsumerWithException.java:38)
... 21 more
Caused by: java.lang.Exception: Cannot set up the user code libraries: 
/var/lib/flink/ceph/prod/1.5-batch/ha_state/1.5-batch/blob/job_b951bbf518bcf6cc031be6d2ccc441bb/blob_p-a26f62e3bbdcd8884dd18c42a3f6f202b9d2c6e7-0dc87a56862a1f799d515306ffeddfcb
 (No such file or directory)
at 
org.apache.flink.runtime.jobmaster.JobManagerRunner.(JobManagerRunner.java:134)
... 25 more
Caused by: java.io.FileNotFoundException: 
/var/lib/flink/ceph/prod/1.5-batch/ha_state/1.5-batch/blob/job_b951bbf518bcf6cc031be6d2ccc441bb/blob_p-a26f62e3bbdcd8884dd18c42a3f6f202b9d2c6e7-0dc87a56862a1f799d515306ffeddfcb
 (No such file or directory)
at java.io.FileInputStream.open0(Native Method)
[... shortened ...

Fwd: Processing Sorted Input Datasets

2018-05-08 Thread Helmut Zechmann
Hi all,
Helmut Zechmann
helmut.zechm...@mailbox.org
www.helmutzechmann.com
0151 27527950




we want to use flink batch to merge records from two or more datasets using 
groupBy.
The input datasets are already sorted since they have been written out sorted 
by some other job.

Is it possible to tell flink that it does not have to re-sort the data again?

Best,

Helmut