[
https://issues.apache.org/jira/browse/FLINK-19791?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17231456#comment-17231456
]
Robert Metzger commented on FLINK-19791:
----------------------------------------
I'm not sure if this problem has been really fixed. While testing the RC 1 of
Flink 1.12.0, I saw the following exception:
{code}
2020-11-13 14:39:15,566 INFO
org.apache.flink.runtime.executiongraph.ExecutionGraph [] - Co-Flat Map
(1/4) (0602ab4f0306596872a928c6375bd153) switched from RUNNING to FAILED on
org.apache.flink.runtime.jobmaster.slotpool.SingleLogicalSlot@4102bd05.
org.apache.flink.runtime.io.network.partition.consumer.PartitionConnectionException:
Connection for partition
be51d31b9b1185e636f8b0e964615117#1@96cf744116e8d64d20ca53ccedac43c3 not
reachable.
at
org.apache.flink.runtime.io.network.partition.consumer.RemoteInputChannel.requestSubpartition(RemoteInputChannel.java:163)
~[flink-dist_2.11-1.12.0.jar:1.12.0]
at
org.apache.flink.runtime.io.network.partition.consumer.SingleInputGate.internalRequestPartitions(SingleInputGate.java:314)
~[flink-dist_2.11-1.12.0.jar:1.12.0]
at
org.apache.flink.runtime.io.network.partition.consumer.SingleInputGate.requestPartitions(SingleInputGate.java:286)
~[flink-dist_2.11-1.12.0.jar:1.12.0]
at
org.apache.flink.runtime.taskmanager.InputGateWithMetrics.requestPartitions(InputGateWithMetrics.java:94)
~[flink-dist_2.11-1.12.0.jar:1.12.0]
at
org.apache.flink.streaming.runtime.tasks.StreamTaskActionExecutor$1.runThrowing(StreamTaskActionExecutor.java:47)
~[flink-dist_2.11-1.12.0.jar:1.12.0]
at
org.apache.flink.streaming.runtime.tasks.mailbox.Mail.run(Mail.java:78)
~[flink-dist_2.11-1.12.0.jar:1.12.0]
at
org.apache.flink.streaming.runtime.tasks.mailbox.MailboxProcessor.processMail(MailboxProcessor.java:283)
~[flink-dist_2.11-1.12.0.jar:1.12.0]
at
org.apache.flink.streaming.runtime.tasks.mailbox.MailboxProcessor.runMailboxLoop(MailboxProcessor.java:184)
~[flink-dist_2.11-1.12.0.jar:1.12.0]
at
org.apache.flink.streaming.runtime.tasks.StreamTask.runMailboxLoop(StreamTask.java:577)
~[flink-dist_2.11-1.12.0.jar:1.12.0]
at
org.apache.flink.streaming.runtime.tasks.StreamTask.invoke(StreamTask.java:541)
~[flink-dist_2.11-1.12.0.jar:1.12.0]
at org.apache.flink.runtime.taskmanager.Task.doRun(Task.java:722)
~[flink-dist_2.11-1.12.0.jar:1.12.0]
at org.apache.flink.runtime.taskmanager.Task.run(Task.java:547)
~[flink-dist_2.11-1.12.0.jar:1.12.0]
at java.lang.Thread.run(Thread.java:748) ~[?:1.8.0_222]
Caused by: java.io.IOException: java.util.concurrent.ExecutionException:
org.apache.flink.runtime.io.network.netty.exception.RemoteTransportException:
Connecting to remote task manager '/192.168.1.25:57359' has failed. This might
indicate that the remote task manager has been lost.
at
org.apache.flink.runtime.io.network.netty.PartitionRequestClientFactory.createPartitionRequestClient(PartitionRequestClientFactory.java:95)
~[flink-dist_2.11-1.12.0.jar:1.12.0]
at
org.apache.flink.runtime.io.network.netty.NettyConnectionManager.createPartitionRequestClient(NettyConnectionManager.java:67)
~[flink-dist_2.11-1.12.0.jar:1.12.0]
at
org.apache.flink.runtime.io.network.partition.consumer.RemoteInputChannel.requestSubpartition(RemoteInputChannel.java:160)
~[flink-dist_2.11-1.12.0.jar:1.12.0]
... 12 more
Caused by: java.util.concurrent.ExecutionException:
org.apache.flink.runtime.io.network.netty.exception.RemoteTransportException:
Connecting to remote task manager '/192.168.1.25:57359' has failed. This might
indicate that the remote task manager has been lost.
at
java.util.concurrent.CompletableFuture.reportGet(CompletableFuture.java:357)
~[?:1.8.0_222]
at
java.util.concurrent.CompletableFuture.get(CompletableFuture.java:1895)
~[?:1.8.0_222]
at
org.apache.flink.runtime.io.network.netty.PartitionRequestClientFactory.createPartitionRequestClient(PartitionRequestClientFactory.java:88)
~[flink-dist_2.11-1.12.0.jar:1.12.0]
at
org.apache.flink.runtime.io.network.netty.NettyConnectionManager.createPartitionRequestClient(NettyConnectionManager.java:67)
~[flink-dist_2.11-1.12.0.jar:1.12.0]
at
org.apache.flink.runtime.io.network.partition.consumer.RemoteInputChannel.requestSubpartition(RemoteInputChannel.java:160)
~[flink-dist_2.11-1.12.0.jar:1.12.0]
... 12 more
Caused by:
org.apache.flink.runtime.io.network.netty.exception.RemoteTransportException:
Connecting to remote task manager '/192.168.1.25:57359' has failed. This might
indicate that the remote task manager has been lost.
at
org.apache.flink.runtime.io.network.netty.PartitionRequestClientFactory.connect(PartitionRequestClientFactory.java:134)
~[flink-dist_2.11-1.12.0.jar:1.12.0]
at
org.apache.flink.runtime.io.network.netty.PartitionRequestClientFactory.connectWithRetries(PartitionRequestClientFactory.java:111)
~[flink-dist_2.11-1.12.0.jar:1.12.0]
at
org.apache.flink.runtime.io.network.netty.PartitionRequestClientFactory.createPartitionRequestClient(PartitionRequestClientFactory.java:77)
~[flink-dist_2.11-1.12.0.jar:1.12.0]
at
org.apache.flink.runtime.io.network.netty.NettyConnectionManager.createPartitionRequestClient(NettyConnectionManager.java:67)
~[flink-dist_2.11-1.12.0.jar:1.12.0]
at
org.apache.flink.runtime.io.network.partition.consumer.RemoteInputChannel.requestSubpartition(RemoteInputChannel.java:160)
~[flink-dist_2.11-1.12.0.jar:1.12.0]
... 12 more
Caused by: java.lang.NullPointerException
at
org.apache.flink.util.Preconditions.checkNotNull(Preconditions.java:61)
~[flink-dist_2.11-1.12.0.jar:1.12.0]
at
org.apache.flink.runtime.io.network.netty.NettyPartitionRequestClient.<init>(NettyPartitionRequestClient.java:73)
~[flink-dist_2.11-1.12.0.jar:1.12.0]
at
org.apache.flink.runtime.io.network.netty.PartitionRequestClientFactory.connect(PartitionRequestClientFactory.java:126)
~[flink-dist_2.11-1.12.0.jar:1.12.0]
at
org.apache.flink.runtime.io.network.netty.PartitionRequestClientFactory.connectWithRetries(PartitionRequestClientFactory.java:111)
~[flink-dist_2.11-1.12.0.jar:1.12.0]
at
org.apache.flink.runtime.io.network.netty.PartitionRequestClientFactory.createPartitionRequestClient(PartitionRequestClientFactory.java:77)
~[flink-dist_2.11-1.12.0.jar:1.12.0]
at
org.apache.flink.runtime.io.network.netty.NettyConnectionManager.createPartitionRequestClient(NettyConnectionManager.java:67)
~[flink-dist_2.11-1.12.0.jar:1.12.0]
at
org.apache.flink.runtime.io.network.partition.consumer.RemoteInputChannel.requestSubpartition(RemoteInputChannel.java:160)
~[flink-dist_2.11-1.12.0.jar:1.12.0]
... 12 more
{code}
> PartitionRequestClientFactoryTest.testInterruptsNotCached fails with
> NullPointerException
> -----------------------------------------------------------------------------------------
>
> Key: FLINK-19791
> URL: https://issues.apache.org/jira/browse/FLINK-19791
> Project: Flink
> Issue Type: Bug
> Components: Runtime / Network
> Affects Versions: 1.12.0
> Reporter: Robert Metzger
> Assignee: Roman Khachatryan
> Priority: Major
> Labels: pull-request-available, test-stability
> Fix For: 1.12.0
>
>
> https://dev.azure.com/rmetzger/Flink/_build/results?buildId=8517&view=logs&j=6e58d712-c5cc-52fb-0895-6ff7bd56c46b&t=f30a8e80-b2cf-535c-9952-7f521a4ae374
> {code}
> 2020-10-23T13:25:12.0774554Z [ERROR]
> testInterruptsNotCached(org.apache.flink.runtime.io.network.netty.PartitionRequestClientFactoryTest)
> Time elapsed: 0.762 s <<< ERROR!
> 2020-10-23T13:25:12.0775695Z java.io.IOException:
> java.util.concurrent.ExecutionException:
> org.apache.flink.runtime.io.network.netty.exception.RemoteTransportException:
> Connecting to remote task manager '934dfa03c743/172.18.0.2:8080' has failed.
> This might indicate that the remote task manager has been lost.
> 2020-10-23T13:25:12.0776455Z at
> org.apache.flink.runtime.io.network.netty.PartitionRequestClientFactory.createPartitionRequestClient(PartitionRequestClientFactory.java:95)
> 2020-10-23T13:25:12.0777038Z at
> org.apache.flink.runtime.io.network.netty.PartitionRequestClientFactoryTest.testInterruptsNotCached(PartitionRequestClientFactoryTest.java:72)
> 2020-10-23T13:25:12.0777465Z at
> sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> 2020-10-23T13:25:12.0777815Z at
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
> 2020-10-23T13:25:12.0778221Z at
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> 2020-10-23T13:25:12.0778581Z at
> java.lang.reflect.Method.invoke(Method.java:498)
> 2020-10-23T13:25:12.0778921Z at
> org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:50)
> 2020-10-23T13:25:12.0779331Z at
> org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12)
> 2020-10-23T13:25:12.0779733Z at
> org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:47)
> 2020-10-23T13:25:12.0780117Z at
> org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17)
> 2020-10-23T13:25:12.0780484Z at
> org.junit.runners.ParentRunner.runLeaf(ParentRunner.java:325)
> 2020-10-23T13:25:12.0780851Z at
> org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:78)
> 2020-10-23T13:25:12.0781236Z at
> org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:57)
> 2020-10-23T13:25:12.0781600Z at
> org.junit.runners.ParentRunner$3.run(ParentRunner.java:290)
> 2020-10-23T13:25:12.0781937Z at
> org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:71)
> 2020-10-23T13:25:12.0782431Z at
> org.junit.runners.ParentRunner.runChildren(ParentRunner.java:288)
> 2020-10-23T13:25:12.0782877Z at
> org.junit.runners.ParentRunner.access$000(ParentRunner.java:58)
> 2020-10-23T13:25:12.0783223Z at
> org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:268)
> 2020-10-23T13:25:12.0783541Z at
> org.junit.runners.ParentRunner.run(ParentRunner.java:363)
> 2020-10-23T13:25:12.0783905Z at
> org.apache.maven.surefire.junit4.JUnit4Provider.execute(JUnit4Provider.java:365)
> 2020-10-23T13:25:12.0784315Z at
> org.apache.maven.surefire.junit4.JUnit4Provider.executeWithRerun(JUnit4Provider.java:273)
> 2020-10-23T13:25:12.0784718Z at
> org.apache.maven.surefire.junit4.JUnit4Provider.executeTestSet(JUnit4Provider.java:238)
> 2020-10-23T13:25:12.0785125Z at
> org.apache.maven.surefire.junit4.JUnit4Provider.invoke(JUnit4Provider.java:159)
> 2020-10-23T13:25:12.0785552Z at
> org.apache.maven.surefire.booter.ForkedBooter.invokeProviderInSameClassLoader(ForkedBooter.java:384)
> 2020-10-23T13:25:12.0785980Z at
> org.apache.maven.surefire.booter.ForkedBooter.runSuitesInProcess(ForkedBooter.java:345)
> 2020-10-23T13:25:12.0786379Z at
> org.apache.maven.surefire.booter.ForkedBooter.execute(ForkedBooter.java:126)
> 2020-10-23T13:25:12.0786763Z at
> org.apache.maven.surefire.booter.ForkedBooter.main(ForkedBooter.java:418)
> 2020-10-23T13:25:12.0787922Z Caused by:
> java.util.concurrent.ExecutionException:
> org.apache.flink.runtime.io.network.netty.exception.RemoteTransportException:
> Connecting to remote task manager '934dfa03c743/172.18.0.2:8080' has failed.
> This might indicate that the remote task manager has been lost.
> 2020-10-23T13:25:12.0788575Z at
> java.util.concurrent.CompletableFuture.reportGet(CompletableFuture.java:357)
> 2020-10-23T13:25:12.0788954Z at
> java.util.concurrent.CompletableFuture.get(CompletableFuture.java:1908)
> 2020-10-23T13:25:12.0789431Z at
> org.apache.flink.runtime.io.network.netty.PartitionRequestClientFactory.createPartitionRequestClient(PartitionRequestClientFactory.java:88)
> 2020-10-23T13:25:12.0789808Z ... 26 more
> 2020-10-23T13:25:12.0790546Z Caused by:
> org.apache.flink.runtime.io.network.netty.exception.RemoteTransportException:
> Connecting to remote task manager '934dfa03c743/172.18.0.2:8080' has failed.
> This might indicate that the remote task manager has been lost.
> 2020-10-23T13:25:12.0791396Z at
> org.apache.flink.runtime.io.network.netty.PartitionRequestClientFactory.connect(PartitionRequestClientFactory.java:134)
> 2020-10-23T13:25:12.0791959Z at
> org.apache.flink.runtime.io.network.netty.PartitionRequestClientFactory.connectWithRetries(PartitionRequestClientFactory.java:111)
> 2020-10-23T13:25:12.0792732Z at
> org.apache.flink.runtime.io.network.netty.PartitionRequestClientFactory.createPartitionRequestClient(PartitionRequestClientFactory.java:77)
> 2020-10-23T13:25:12.0793118Z ... 26 more
> 2020-10-23T13:25:12.0793342Z Caused by: java.lang.NullPointerException
> 2020-10-23T13:25:12.0793681Z at
> org.apache.flink.util.Preconditions.checkNotNull(Preconditions.java:61)
> 2020-10-23T13:25:12.0794319Z at
> org.apache.flink.runtime.io.network.netty.NettyPartitionRequestClient.<init>(NettyPartitionRequestClient.java:73)
> 2020-10-23T13:25:12.0794854Z at
> org.apache.flink.runtime.io.network.netty.PartitionRequestClientFactory.connect(PartitionRequestClientFactory.java:126)
> {code}
--
This message was sent by Atlassian Jira
(v8.3.4#803005)