[ 
https://issues.apache.org/jira/browse/FLINK-18348?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17138999#comment-17138999
 ] 

Zhijiang commented on FLINK-18348:
----------------------------------

[~pnowojski] FLINK-16536 might cause this issue, but there was also another 
place to cause it in 
[link|https://github.com/apache/flink/blob/2150533ac0b2a6cc00238041853bbb6ebf22cee9/flink-runtime/src/main/java/org/apache/flink/runtime/io/network/netty/NettyPartitionRequestClient.java#L121].
 So this issue is not a new one brought by release-1.11, not blocker issue as 
well.

[~wind_ljy] Regarding the solution, I think the conservative way is to reverse 
the calls between `checkError` and `checkState`. `checkState` might be still 
reasonable to guard the logics in some cases. E.g. if we have some logic bugs 
to miss `requestPartition` in advance, then the `checkState` can help locate 
such issue.

> RemoteInputChannel should checkError before checking partitionRequestClient
> ---------------------------------------------------------------------------
>
>                 Key: FLINK-18348
>                 URL: https://issues.apache.org/jira/browse/FLINK-18348
>             Project: Flink
>          Issue Type: Improvement
>          Components: Runtime / Network
>    Affects Versions: 1.10.1, 1.11.0, 1.12.0
>            Reporter: Jiayi Liao
>            Priority: Critical
>             Fix For: 1.11.0
>
>
> The error will be set and \{{partitionRequestClient}} will be a null value if 
> a remote channel fails to request the partition at the beginning. And the 
> task will fail 
> [here|https://github.com/apache/flink/blob/2150533ac0b2a6cc00238041853bbb6ebf22cee9/flink-runtime/src/main/java/org/apache/flink/runtime/io/network/partition/consumer/RemoteInputChannel.java#L172]
>  when the task thread trying to fetch data from channels.
> And then we get error:
> {code:java}
> java.lang.IllegalStateException: Queried for a buffer before requesting a 
> queue.
>         at 
> org.apache.flink.util.Preconditions.checkState(Preconditions.java:195) 
> ~[flink-dist_2.11-1.11-SNAPSHOT.jar:1.11-SNAPSHOT]
>         at 
> org.apache.flink.runtime.io.network.partition.consumer.RemoteInputChannel.getNextBuffer(RemoteInputChannel.java:172)
>  ~[flink-dist_2.11-1.11-SNAPSHOT.jar:1.11-SNAPSHOT]
>         at 
> org.apache.flink.runtime.io.network.partition.consumer.SingleInputGate.waitAndGetNextData(SingleInputGate.java:637)
>  ~[flink-dist_2.11-1.11-SNAPSHOT.jar:1.11-SNAPSHOT]
>         at 
> org.apache.flink.runtime.io.network.partition.consumer.SingleInputGate.getNextBufferOrEvent(SingleInputGate.java:615)
>  ~[flink-dist_2.11-1.11-SNAPSHOT.jar:1.11-SNAPSHOT]
>         at 
> org.apache.flink.runtime.io.network.partition.consumer.SingleInputGate.getNext(SingleInputGate.java:598)
>  ~[flink-dist_2.11-1.11-SNAPSHOT.jar:1.11-SNAPSHOT]
> {code}
> But the root cause is the {{PartitionConnectionException}} we set when 
> requesting the partition.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to