[ https://issues.apache.org/jira/browse/FLINK-18348?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17138999#comment-17138999 ]
Zhijiang commented on FLINK-18348: ---------------------------------- [~pnowojski] FLINK-16536 might cause this issue, but there was also another place to cause it in [link|https://github.com/apache/flink/blob/2150533ac0b2a6cc00238041853bbb6ebf22cee9/flink-runtime/src/main/java/org/apache/flink/runtime/io/network/netty/NettyPartitionRequestClient.java#L121]. So this issue is not a new one brought by release-1.11, not blocker issue as well. [~wind_ljy] Regarding the solution, I think the conservative way is to reverse the calls between `checkError` and `checkState`. `checkState` might be still reasonable to guard the logics in some cases. E.g. if we have some logic bugs to miss `requestPartition` in advance, then the `checkState` can help locate such issue. > RemoteInputChannel should checkError before checking partitionRequestClient > --------------------------------------------------------------------------- > > Key: FLINK-18348 > URL: https://issues.apache.org/jira/browse/FLINK-18348 > Project: Flink > Issue Type: Improvement > Components: Runtime / Network > Affects Versions: 1.10.1, 1.11.0, 1.12.0 > Reporter: Jiayi Liao > Priority: Critical > Fix For: 1.11.0 > > > The error will be set and \{{partitionRequestClient}} will be a null value if > a remote channel fails to request the partition at the beginning. And the > task will fail > [here|https://github.com/apache/flink/blob/2150533ac0b2a6cc00238041853bbb6ebf22cee9/flink-runtime/src/main/java/org/apache/flink/runtime/io/network/partition/consumer/RemoteInputChannel.java#L172] > when the task thread trying to fetch data from channels. > And then we get error: > {code:java} > java.lang.IllegalStateException: Queried for a buffer before requesting a > queue. > at > org.apache.flink.util.Preconditions.checkState(Preconditions.java:195) > ~[flink-dist_2.11-1.11-SNAPSHOT.jar:1.11-SNAPSHOT] > at > org.apache.flink.runtime.io.network.partition.consumer.RemoteInputChannel.getNextBuffer(RemoteInputChannel.java:172) > ~[flink-dist_2.11-1.11-SNAPSHOT.jar:1.11-SNAPSHOT] > at > org.apache.flink.runtime.io.network.partition.consumer.SingleInputGate.waitAndGetNextData(SingleInputGate.java:637) > ~[flink-dist_2.11-1.11-SNAPSHOT.jar:1.11-SNAPSHOT] > at > org.apache.flink.runtime.io.network.partition.consumer.SingleInputGate.getNextBufferOrEvent(SingleInputGate.java:615) > ~[flink-dist_2.11-1.11-SNAPSHOT.jar:1.11-SNAPSHOT] > at > org.apache.flink.runtime.io.network.partition.consumer.SingleInputGate.getNext(SingleInputGate.java:598) > ~[flink-dist_2.11-1.11-SNAPSHOT.jar:1.11-SNAPSHOT] > {code} > But the root cause is the {{PartitionConnectionException}} we set when > requesting the partition. -- This message was sent by Atlassian Jira (v8.3.4#803005)