Cai Liuyang created FLINK-26080:
-----------------------------------
Summary: PartitionRequest client use Netty's IdleStateHandler to
monitor channel's status
Key: FLINK-26080
URL: https://issues.apache.org/jira/browse/FLINK-26080
Project: Flink
Issue Type: Improvement
Components: Runtime / Network
Reporter: Cai Liuyang
In out production environment, we encounter one abnormal case:
upstreamTask is backpressured but its donwStreamTask is idle, job will keep
this status until chk is timeout(use aligned chk); After we analyse this case,
we found the reason: (Machine's kernel we used may have bug that will lost
socket event )
1. NettyServer encounter ReadTimeoutException when read data from channel,
then it will release the NetworkSequenceViewReader (which is responsable to
send data to PartitionRequestClient) and write ErrorResponse to
PartitionRequestClient;
2. PartitionRequestClient doesn't receive the ErrorResponse (maybe our
machine's kernel-bug lead to this)
3. NettyServer after write ErrorResponse, it will close the channel (socket
will be transformed to fin_wait1 status), but client machine doesn't receive
the Server's fin, so it will treat the channel is ok, and will keep waiting for
server's BufferReponse (But server is already release correlative
NetworkSequenceViewReader)
4. Server machine will release the socket if it keep fin_wait1 status for
two long time, but the socket on client machine is also under established
status.
To avoid this case,I think there are two methods:
1. Client enable TCP keep alive(flink is already enabled): this way should
also need adjust machine's tcp-keep-alive time (tcp-keep-alive's default time
is 7200 seconds, which is two long).
2. Client use netty‘s IdleStateHandler to detect whether channel is
idle(read or write), if channel is idle, client will try to write pingMsg to
server to detect whether channel is really ok.
For the two methods, i recommend the method-2, because adjustment of machine's
tcp-keep-alive time will have an impact on other service running on the same
machine
--
This message was sent by Atlassian Jira
(v8.20.1#820001)