[
https://issues.apache.org/jira/browse/FLINK-23573?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Xintong Song closed FLINK-23573.
--------------------------------
Resolution: Fixed
Fixed via
- master (1.14): e45723b503b9cc793317a6cad7c0d4c8075c0d16
> Tests fail with AdaptiveScheduler due to exceptions in logs trying to offer
> slots after JobMaster shutdown
> ----------------------------------------------------------------------------------------------------------
>
> Key: FLINK-23573
> URL: https://issues.apache.org/jira/browse/FLINK-23573
> Project: Flink
> Issue Type: Bug
> Components: Runtime / Coordination
> Affects Versions: 1.14.0
> Reporter: Xintong Song
> Assignee: Xintong Song
> Priority: Major
> Labels: pull-request-available, test-stability
> Fix For: 1.14.0
>
>
> https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=20267&view=logs&j=4dd4dbdd-1802-5eb7-a518-6acd9d24d0fc&t=8d6b4dd3-4ca1-5611-1743-57a7d76b395a&l=412
> https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=20396&view=logs&j=4dd4dbdd-1802-5eb7-a518-6acd9d24d0fc&t=8d6b4dd3-4ca1-5611-1743-57a7d76b395a&l=2390
> https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=20454&view=logs&j=4dd4dbdd-1802-5eb7-a518-6acd9d24d0fc&t=8d6b4dd3-4ca1-5611-1743-57a7d76b395a&l=371
> https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=21228&view=logs&j=4dd4dbdd-1802-5eb7-a518-6acd9d24d0fc&t=7c4a8fb8-eeee-5a77-f518-4176bfae300b&l=2437
> The test failed due to exceptions in logs. I executed the following command
> from flink-end-to-end-tests/test-scripts/common.sh on the logs, and it points
> to the RecipientUnreachableException in TM logs. The problem is that, TM
> received extra slot requests from RM after the tasks are finished and slots
> are freed, while the JobMaster it tried to offer slots to had already
> shutdown.
> {code}
> $ grep -rv "GroupCoordinatorNotAvailableException" . \
> | grep -v "RetriableCommitFailedException" \
> | grep -v "NoAvailableBrokersException" \
> | grep -v "Async Kafka commit failed" \
> | grep -v "DisconnectException" \
> | grep -v "Cannot connect to ResourceManager right now" \
> | grep -v "AskTimeoutException" \
> | grep -v "WARN akka.remote.transport.netty.NettyTransport" \
> | grep -v "WARN
> org.apache.flink.shaded.akka.org.jboss.netty.channel.DefaultChannelPipeline" \
> | grep -v 'INFO.*AWSErrorCode' \
> | grep -v "RejectedExecutionException" \
> | grep -v "CancellationException" \
> | grep -v "An exception was thrown by an exception handler" \
> | grep -v "Caused by: java.lang.ClassNotFoundException:
> org.apache.hadoop.yarn.exceptions.YarnException" \
> | grep -v "Caused by: java.lang.ClassNotFoundException:
> org.apache.hadoop.conf.Configuration" \
> | grep -v "java.lang.NoClassDefFoundError:
> org/apache/hadoop/yarn/exceptions/YarnException" \
> | grep -v "java.lang.NoClassDefFoundError:
> org/apache/hadoop/conf/Configuration" \
> | grep -v "java.lang.Exception: Execution was suspended" \
> | grep -v "java.io.InvalidClassException:
> org.apache.flink.formats.avro.typeutils.AvroSerializer" \
> | grep -v "Caused by: java.lang.Exception: JobManager is shutting down" \
> | grep -v "java.lang.Exception: Artificial failure" \
> | grep -v "org.apache.flink.runtime.checkpoint.CheckpointException" \
> | grep -v "org.elasticsearch.ElasticsearchException" \
> | grep -v "Elasticsearch exception" \
> | grep -v "org.apache.flink.runtime.JobException: Recovery is suppressed" \
> | grep -v "WARN akka.remote.ReliableDeliverySupervisor" \
> | grep -i "exception"
> ./flink-vsts-taskexecutor-0-fv-az217-107.log:org.apache.flink.runtime.rpc.exceptions.RecipientUnreachableException:
> Could not send message
> [RemoteFencedMessage(00000000000000000000000000000000,
> RemoteRpcInvocation(null.offerSlots(ResourceID, Collection, Time)))] from
> sender [Actor[akka.tcp://[email protected]:38955/temp/$0b]] to recipient
> [Actor[akka://flink/user/rpc/jobmanager_2#1483449133]], because the recipient
> is unreachable. This can either mean that the recipient has been terminated
> or that the remote RpcService is currently not reachable.
> {code}
--
This message was sent by Atlassian Jira
(v8.3.4#803005)