Hi :
    
生产上在凌晨的时候做了1个升级,线上有13个job,其中有1个名为gift的job失败了,其他的都是正常的,很奇怪,查看日志后看到如下信息,此类问题如何规避,是否flink需要做一些特殊设置?
JM的日志
  [2021-02-08 
00:49:23.714][org.apache.flink.runtime.checkpoint.CheckpointCoordinator][]Completed
 checkpoint 1435986 for job d31fe16e0525e20a19dc79c88ab958a2 (10872 bytes in 41 
ms).
[2021-02-08 
00:49:23.859][org.apache.flink.runtime.executiongraph.ExecutionGraph][]Source: 
Custom Source -> Filter -> Flat Map (1/1) (cc7dc71b5a40ece6fd1383ae30c316a4) 
switched from RUNNING to FAILED.
org.apache.kafka.common.KafkaException: Error registering mbean 
kafka.consumer:type=consumer-node-metrics,client-id=consumer-2,node-id=node-2147483544
at org.apache.kafka.common.metrics.JmxReporter.reregister(JmxReporter.java:163)
at org.apache.kafka.common.metrics.JmxReporter.metricChange(JmxReporter.java:81)
at org.apache.kafka.common.metrics.Metrics.registerMetric(Metrics.java:504)
at org.apache.kafka.common.metrics.Sensor.add(Sensor.java:255)
at org.apache.kafka.common.metrics.Sensor.add(Sensor.java:240)
at 
org.apache.kafka.common.network.Selector$SelectorMetrics.maybeRegisterConnectionMetrics(Selector.java:817)
at org.apache.kafka.common.network.Selector.pollSelectionKeys(Selector.java:353)
at org.apache.kafka.common.network.Selector.poll(Selector.java:326)
at org.apache.kafka.clients.NetworkClient.poll(NetworkClient.java:433)
at 
org.apache.kafka.clients.consumer.internals.ConsumerNetworkClient.poll(ConsumerNetworkClient.java:232)
at 
org.apache.kafka.clients.consumer.internals.ConsumerNetworkClient.poll(ConsumerNetworkClient.java:208)
at 
org.apache.kafka.clients.consumer.KafkaConsumer.pollOnce(KafkaConsumer.java:1096)
at org.apache.kafka.clients.consumer.KafkaConsumer.poll(KafkaConsumer.java:1043)
at 
org.apache.flink.streaming.connectors.kafka.internal.KafkaConsumerThread.run(KafkaConsumerThread.java:257)
Caused by: javax.management.InstanceAlreadyExistsException: 
kafka.consumer:type=consumer-node-metrics,client-id=consumer-2,node-id=node-2147483544
at com.sun.jmx.mbeanserver.Repository.addMBean(Repository.java:437)
at 
com.sun.jmx.interceptor.DefaultMBeanServerInterceptor.registerWithRepository(DefaultMBeanServerInterceptor.java:1898)
at 
com.sun.jmx.interceptor.DefaultMBeanServerInterceptor.registerDynamicMBean(DefaultMBeanServerInterceptor.java:966)
at 
com.sun.jmx.interceptor.DefaultMBeanServerInterceptor.registerObject(DefaultMBeanServerInterceptor.java:900)
at 
com.sun.jmx.interceptor.DefaultMBeanServerInterceptor.registerMBean(DefaultMBeanServerInterceptor.java:324)
at com.sun.jmx.mbeanserver.JmxMBeanServer.registerMBean(JmxMBeanServer.java:522)
at org.apache.kafka.common.metrics.JmxReporter.reregister(JmxReporter.java:161)
... 13 common frames omitted
[2021-02-08 
00:49:23.859][org.apache.flink.runtime.executiongraph.ExecutionGraph][]Job 
ty-bi-flink:gift (36ecb2327a19f39c3a48e5f9467255b9) switched from state RUNNING 
to FAILING.
org.apache.kafka.common.KafkaException: Error registering mbean 
kafka.consumer:type=consumer-node-metrics,client-id=consumer-2,node-id=node-2147483544
at org.apache.kafka.common.metrics.JmxReporter.reregister(JmxReporter.java:163)
at org.apache.kafka.common.metrics.JmxReporter.metricChange(JmxReporter.java:81)
at org.apache.kafka.common.metrics.Metrics.registerMetric(Metrics.java:504)
at org.apache.kafka.common.metrics.Sensor.add(Sensor.java:255)
at org.apache.kafka.common.metrics.Sensor.add(Sensor.java:240)
at 
org.apache.kafka.common.network.Selector$SelectorMetrics.maybeRegisterConnectionMetrics(Selector.java:817)
at org.apache.kafka.common.network.Selector.pollSelectionKeys(Selector.java:353)
at org.apache.kafka.common.network.Selector.poll(Selector.java:326)
at org.apache.kafka.clients.NetworkClient.poll(NetworkClient.java:433)
at 
org.apache.kafka.clients.consumer.internals.ConsumerNetworkClient.poll(ConsumerNetworkClient.java:232)
at 
org.apache.kafka.clients.consumer.internals.ConsumerNetworkClient.poll(ConsumerNetworkClient.java:208)
at 
org.apache.kafka.clients.consumer.KafkaConsumer.pollOnce(KafkaConsumer.java:1096)
at org.apache.kafka.clients.consumer.KafkaConsumer.poll(KafkaConsumer.java:1043)
at 
org.apache.flink.streaming.connectors.kafka.internal.KafkaConsumerThread.run(KafkaConsumerThread.java:257)
Caused by: javax.management.InstanceAlreadyExistsException: 
kafka.consumer:type=consumer-node-metrics,client-id=consumer-2,node-id=node-2147483544
at com.sun.jmx.mbeanserver.Repository.addMBean(Repository.java:437)
at 
com.sun.jmx.interceptor.DefaultMBeanServerInterceptor.registerWithRepository(DefaultMBeanServerInterceptor.java:1898)
at 
com.sun.jmx.interceptor.DefaultMBeanServerInterceptor.registerDynamicMBean(DefaultMBeanServerInterceptor.java:966)
at 
com.sun.jmx.interceptor.DefaultMBeanServerInterceptor.registerObject(DefaultMBeanServerInterceptor.java:900)
at 
com.sun.jmx.interceptor.DefaultMBeanServerInterceptor.registerMBean(DefaultMBeanServerInterceptor.java:324)
at com.sun.jmx.mbeanserver.JmxMBeanServer.registerMBean(JmxMBeanServer.java:522)
at org.apache.kafka.common.metrics.JmxReporter.reregister(JmxReporter.java:161)
... 13 common frames omitted
[2021-02-08 
00:49:23.862][org.apache.flink.runtime.executiongraph.ExecutionGraph][]TriggerWindow(TumblingProcessingTimeWindows(5000),
 ListStateDescriptor{name=window-contents, defaultValue=null, 
serializer=org.apache.flink.api.common.typeutils.base.ListSerializer@503e2d76}, 
ProcessingTimeTrigger(), 
AllWindowedStream.addSinkByTopic(Application.java:254)) -> Sink: gift (1/1) 
(0bf38f64eef92b7f900e1a1510b5ac2a) switched from RUNNING to CANCELING.
[2021-02-08 
00:49:23.963][org.apache.flink.runtime.executiongraph.ExecutionGraph][]TriggerWindow(TumblingProcessingTimeWindows(5000),
 ListStateDescriptor{name=window-contents, defaultValue=null, 
serializer=org.apache.flink.api.common.typeutils.base.ListSerializer@503e2d76}, 
ProcessingTimeTrigger(), 
AllWindowedStream.addSinkByTopic(Application.java:254)) -> Sink: gift (1/1) 
(0bf38f64eef92b7f900e1a1510b5ac2a) switched from CANCELING to CANCELED.
[2021-02-08 
00:49:23.964][org.apache.flink.runtime.executiongraph.ExecutionGraph][]Try to 
restart or fail the job ty-bi-flink:gift (36ecb2327a19f39c3a48e5f9467255b9) if 
no longer possible.
[2021-02-08 
00:49:23.964][org.apache.flink.runtime.executiongraph.ExecutionGraph][]Job 
ty-bi-flink:gift (36ecb2327a19f39c3a48e5f9467255b9) switched from state FAILING 
to FAILED.
org.apache.kafka.common.KafkaException: Error registering mbean 
kafka.consumer:type=consumer-node-metrics,client-id=consumer-2,node-id=node-2147483544
at org.apache.kafka.common.metrics.JmxReporter.reregister(JmxReporter.java:163)
at org.apache.kafka.common.metrics.JmxReporter.metricChange(JmxReporter.java:81)
at org.apache.kafka.common.metrics.Metrics.registerMetric(Metrics.java:504)
at org.apache.kafka.common.metrics.Sensor.add(Sensor.java:255)
at org.apache.kafka.common.metrics.Sensor.add(Sensor.java:240)
at 
org.apache.kafka.common.network.Selector$SelectorMetrics.maybeRegisterConnectionMetrics(Selector.java:817)
at org.apache.kafka.common.network.Selector.pollSelectionKeys(Selector.java:353)
at org.apache.kafka.common.network.Selector.poll(Selector.java:326)
at org.apache.kafka.clients.NetworkClient.poll(NetworkClient.java:433)
at 
org.apache.kafka.clients.consumer.internals.ConsumerNetworkClient.poll(ConsumerNetworkClient.java:232)
at 
org.apache.kafka.clients.consumer.internals.ConsumerNetworkClient.poll(ConsumerNetworkClient.java:208)
at 
org.apache.kafka.clients.consumer.KafkaConsumer.pollOnce(KafkaConsumer.java:1096)
at org.apache.kafka.clients.consumer.KafkaConsumer.poll(KafkaConsumer.java:1043)
at 
org.apache.flink.streaming.connectors.kafka.internal.KafkaConsumerThread.run(KafkaConsumerThread.java:257)
Caused by: javax.management.InstanceAlreadyExistsException: 
kafka.consumer:type=consumer-node-metrics,client-id=consumer-2,node-id=node-2147483544
at com.sun.jmx.mbeanserver.Repository.addMBean(Repository.java:437)
at 
com.sun.jmx.interceptor.DefaultMBeanServerInterceptor.registerWithRepository(DefaultMBeanServerInterceptor.java:1898)
at 
com.sun.jmx.interceptor.DefaultMBeanServerInterceptor.registerDynamicMBean(DefaultMBeanServerInterceptor.java:966)
at 
com.sun.jmx.interceptor.DefaultMBeanServerInterceptor.registerObject(DefaultMBeanServerInterceptor.java:900)
at 
com.sun.jmx.interceptor.DefaultMBeanServerInterceptor.registerMBean(DefaultMBeanServerInterceptor.java:324)
at com.sun.jmx.mbeanserver.JmxMBeanServer.registerMBean(JmxMBeanServer.java:522)
at org.apache.kafka.common.metrics.JmxReporter.reregister(JmxReporter.java:161)
... 13 common frames omitted
---
[2021-02-08 
00:49:23.966][org.apache.flink.runtime.checkpoint.CheckpointCoordinator][]Stopping
 checkpoint coordinator for job 36ecb2327a19f39c3a48e5f9467255b9.
[2021-02-08 
00:49:23.966][org.apache.flink.runtime.checkpoint.StandaloneCompletedCheckpointStore][]Shutting
 down
[2021-02-08 
00:49:23.972][org.apache.flink.runtime.dispatcher.StandaloneDispatcher][]Job 
36ecb2327a19f39c3a48e5f9467255b9 reached globally terminal state FAILED.
[2021-02-08 
00:49:23.989][org.apache.flink.runtime.jobmaster.JobMaster][]Stopping the 
JobMaster for job ty-bi-flink:gift(36ecb2327a19f39c3a48e5f9467255b9).
[2021-02-08 
00:49:23.991][org.apache.flink.runtime.checkpoint.CheckpointCoordinator][]Triggering
 checkpoint 1435959 @ 1612716563991 for job 4bece886fb7bd17c2e67bbfca7c49ddb.
[2021-02-08 00:49:23.997][org.apache.flink.runtime.jobmaster.JobMaster][]Close 
ResourceManager connection a5c2e0aad7cfa0a7051cc0f673a82e94: JobManager is 
shutting down..
[2021-02-08 
00:49:23.997][org.apache.flink.runtime.jobmaster.slotpool.SlotPool][]Suspending 
SlotPool.
[2021-02-08 
00:49:23.997][org.apache.flink.runtime.jobmaster.slotpool.SlotPool][]Stopping 
SlotPool.
[2021-02-08 
00:49:23.997][org.apache.flink.runtime.jobmaster.JobManagerRunner][]JobManagerRunner
 already shutdown.


TM的日志
[2021-02-08 
00:49:23.675][org.apache.kafka.clients.consumer.internals.AbstractCoordinator][]Marking
 the coordinator 172.20.23.243:9092 (id: 2147483546 rack: null) dead for group 
ty_bi_flink
[2021-02-08 
00:49:23.721][org.apache.kafka.clients.consumer.internals.AbstractCoordinator][]Discovered
 coordinator 172.20.23.245:9092 (id: 2147483544 rack: null) for group 
ty_bi_flink.
[2021-02-08 
00:49:23.721][org.apache.kafka.clients.consumer.internals.AbstractCoordinator][]Marking
 the coordinator 172.20.23.243:9092 (id: 2147483546 rack: null) dead for group 
ty_bi_flink
[2021-02-08 
00:49:23.721][org.apache.kafka.clients.consumer.internals.AbstractCoordinator][]Discovered
 coordinator 172.20.23.245:9092 (id: 2147483544 rack: null) for group 
ty_bi_flink.
[2021-02-08 00:49:23.751][org.apache.flink.runtime.taskmanager.Task][]Source: 
Custom Source -> Filter -> Flat Map (1/1) (cc7dc71b5a40ece6fd1383ae30c316a4) 
switched from RUNNING to FAILED.
org.apache.kafka.common.KafkaException: Error registering mbean 
kafka.consumer:type=consumer-node-metrics,client-id=consumer-2,node-id=node-2147483544
at org.apache.kafka.common.metrics.JmxReporter.reregister(JmxReporter.java:163)
at org.apache.kafka.common.metrics.JmxReporter.metricChange(JmxReporter.java:81)
at org.apache.kafka.common.metrics.Metrics.registerMetric(Metrics.java:504)
at org.apache.kafka.common.metrics.Sensor.add(Sensor.java:255)
at org.apache.kafka.common.metrics.Sensor.add(Sensor.java:240)
at 
org.apache.kafka.common.network.Selector$SelectorMetrics.maybeRegisterConnectionMetrics(Selector.java:817)
at org.apache.kafka.common.network.Selector.pollSelectionKeys(Selector.java:353)
at org.apache.kafka.common.network.Selector.poll(Selector.java:326)
at org.apache.kafka.clients.NetworkClient.poll(NetworkClient.java:433)
at 
org.apache.kafka.clients.consumer.internals.ConsumerNetworkClient.poll(ConsumerNetworkClient.java:232)
at 
org.apache.kafka.clients.consumer.internals.ConsumerNetworkClient.poll(ConsumerNetworkClient.java:208)
at 
org.apache.kafka.clients.consumer.KafkaConsumer.pollOnce(KafkaConsumer.java:1096)
at org.apache.kafka.clients.consumer.KafkaConsumer.poll(KafkaConsumer.java:1043)
at 
org.apache.flink.streaming.connectors.kafka.internal.KafkaConsumerThread.run(KafkaConsumerThread.java:257)
Caused by: javax.management.InstanceAlreadyExistsException: 
kafka.consumer:type=consumer-node-metrics,client-id=consumer-2,node-id=node-2147483544
at com.sun.jmx.mbeanserver.Repository.addMBean(Repository.java:437)
at 
com.sun.jmx.interceptor.DefaultMBeanServerInterceptor.registerWithRepository(DefaultMBeanServerInterceptor.java:1898)
at 
com.sun.jmx.interceptor.DefaultMBeanServerInterceptor.registerDynamicMBean(DefaultMBeanServerInterceptor.java:966)
at 
com.sun.jmx.interceptor.DefaultMBeanServerInterceptor.registerObject(DefaultMBeanServerInterceptor.java:900)
at 
com.sun.jmx.interceptor.DefaultMBeanServerInterceptor.registerMBean(DefaultMBeanServerInterceptor.java:324)
at com.sun.jmx.mbeanserver.JmxMBeanServer.registerMBean(JmxMBeanServer.java:522)
at org.apache.kafka.common.metrics.JmxReporter.reregister(JmxReporter.java:161)
... 13 common frames omitted
[2021-02-08 00:49:23.751][org.apache.flink.runtime.taskmanager.Task][]Freeing 
task resources for Source: Custom Source -> Filter -> Flat Map (1/1) 
(cc7dc71b5a40ece6fd1383ae30c316a4).
[2021-02-08 
00:49:23.751][org.apache.kafka.clients.consumer.internals.AbstractCoordinator][]Marking
 the coordinator 172.20.23.245:9092 (id: 2147483544 rack: null) dead for group 
ty_bi_flink
[2021-02-08 00:49:23.752][org.apache.flink.runtime.taskmanager.Task][]Ensuring 
all FileSystem streams are closed for task Source: Custom Source -> Filter -> 
Flat Map (1/1) (cc7dc71b5a40ece6fd1383ae30c316a4) [FAILED]
[2021-02-08 
00:49:23.752][org.apache.flink.streaming.connectors.kafka.internal.Kafka09Fetcher][]Committing
 offsets to Kafka failed. This does not compromise Flink's checkpoints.
org.apache.kafka.clients.consumer.RetriableCommitFailedException: Offset commit 
failed with a retriable exception. You should retry committing offsets. The 
underlying error was: This is not the correct coordinator.
[2021-02-08 
00:49:23.752][org.apache.flink.streaming.connectors.kafka.FlinkKafkaConsumerBase][]Async
 Kafka commit failed.
org.apache.kafka.clients.consumer.RetriableCommitFailedException: Offset commit 
failed with a retriable exception. You should retry committing offsets. The 
underlying error was: This is not the correct coordinator.
[2021-02-08 
00:49:23.753][org.apache.kafka.clients.consumer.internals.AbstractCoordinator][]Marking
 the coordinator 172.20.23.245:9092 (id: 2147483544 rack: null) dead for group 
ty_bi_flink
[2021-02-08 
00:49:23.776][org.apache.flink.runtime.taskexecutor.TaskExecutor][]Un-registering
 task and sending final execution state FAILED to JobManager for task Source: 
Custom Source -> Filter -> Flat Map cc7dc71b5a40ece6fd1383ae30c316a4.


freeza1...@outlook.com

回复