[jira] [Commented] (FLINK-16030) Add heartbeat between netty server and client to detect long connection alive
[ https://issues.apache.org/jira/browse/FLINK-16030?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17323093#comment-17323093 ] Flink Jira Bot commented on FLINK-16030: This issue is assigned but has not received an update in 7 days so it has been labeled "stale-assigned". If you are still working on the issue, please give an update and remove the label. If you are no longer working on the issue, please unassign so someone else may work on it. In 7 days the issue will be automatically unassigned. > Add heartbeat between netty server and client to detect long connection alive > - > > Key: FLINK-16030 > URL: https://issues.apache.org/jira/browse/FLINK-16030 > Project: Flink > Issue Type: Improvement > Components: Runtime / Network >Affects Versions: 1.7.2, 1.8.3, 1.9.2, 1.10.0 >Reporter: begginghard >Assignee: begginghard >Priority: Major > Labels: stale-assigned > > As reported on [the user mailing > list|https://lists.apache.org/list.html?u...@flink.apache.org:lte=1M:Encountered%20error%20while%20consuming%20partitions] > Network can fail in many ways, sometimes pretty subtle (e.g. high ratio > packet loss). > When the long tcp connection between netty client and server is lost, the > server would failed to send response to the client, then shut down the > channel. At the same time, the netty client does not know that the connection > has been disconnected, so it has been waiting for two hours. > To detect the long tcp connection alive on netty client and server, we should > have two ways: tcp keepalive and heartbeat. > > The tcp keepalive is 2 hours by default. When the long tcp connection dead, > you continue to wait for 2 hours, the netty client will trigger exception and > enter failover recovery. > If you want to detect quickly, netty provides IdleStateHandler which it use > ping-pang mechanism. If netty client sends continuously n ping message and > receives no one pang message, then trigger exception. > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (FLINK-16030) Add heartbeat between netty server and client to detect long connection alive
[ https://issues.apache.org/jira/browse/FLINK-16030?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17038880#comment-17038880 ] Piotr Nowojski commented on FLINK-16030: {quote} It is probably fair to say that in cases of "non-recoverable pipelined" partitions, the sender should handle the exception directly as well. {quote} I think this is important to keep in mind here. Indeed downstream failures (timeout detected on the upstream node) should in some cases (retry-able partition) just cause downstream node to failover, but in others (pipelined) failover of both upstream and downstream task. > Add heartbeat between netty server and client to detect long connection alive > - > > Key: FLINK-16030 > URL: https://issues.apache.org/jira/browse/FLINK-16030 > Project: Flink > Issue Type: Improvement > Components: Runtime / Network >Affects Versions: 1.7.2, 1.8.3, 1.9.2, 1.10.0 >Reporter: begginghard >Assignee: begginghard >Priority: Major > > As reported on [the user mailing > list|https://lists.apache.org/list.html?u...@flink.apache.org:lte=1M:Encountered%20error%20while%20consuming%20partitions] > Network can fail in many ways, sometimes pretty subtle (e.g. high ratio > packet loss). > When the long tcp connection between netty client and server is lost, the > server would failed to send response to the client, then shut down the > channel. At the same time, the netty client does not know that the connection > has been disconnected, so it has been waiting for two hours. > To detect the long tcp connection alive on netty client and server, we should > have two ways: tcp keepalive and heartbeat. > > The tcp keepalive is 2 hours by default. When the long tcp connection dead, > you continue to wait for 2 hours, the netty client will trigger exception and > enter failover recovery. > If you want to detect quickly, netty provides IdleStateHandler which it use > ping-pang mechanism. If netty client sends continuously n ping message and > receives no one pang message, then trigger exception. > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (FLINK-16030) Add heartbeat between netty server and client to detect long connection alive
[ https://issues.apache.org/jira/browse/FLINK-16030?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17038512#comment-17038512 ] Zhijiang commented on FLINK-16030: -- Considering the direction of "richer exception handling" [~sewen] mentioned, I think we already followed this way for some previous cases. E.g. `PartitionNotFoundException` is reported on downstream side while requesting upstream's partition failure, then the JM can check the upstream task's state to make a decision. The current exception detection and report mechanisms are a bit different in netty client and server sides. * In client side, any exceptions detected by netty handler would cause the respective tasks enter failed state. So JM is aware of the exceptions via task state reporting. Besides that, the client would also report to netty server side via network in `CancelPartitionRequest` and `CloseRequest` messages. The netty server side would release view resources as a result, but not fail the respective upstream's tasks which might be canceled by JM if necessary. * In server side, any exceptions detected by netty handler would not cause the respective tasks fail, so it also does not report to JM ATM. But it would notify the netty client side via `ErrorResponse` message in some cases (`PartitionRequestQueue#exceptionCaught`). If the client handler receives the error message, then it would cause the downstream' task fail to report to JM. JM can cancel the upstream's task if necessary. So for both client and server sides in network, we already have the exception detection mechanism, but missing the effective report mechanism in some cases. The previous proposal for adding ping message in network is also for the detection mechanism in essence. But if it relies on the task failure to realize the report mechanism, it would bring unnecessary job restart for fake ping timeout. Considering this ticket case, I think we could still follow the "richer exception handling" direction to some extent. For confirmed exceptions, we can rely on the task failure to report JM as did before. For ambiguous exceptions we should have a mechanism to report to JM directly such as `PartitionNotFoundException` did, then the JM has the global information for final decision. E.g. the JM can inquire the other side state, or wait for a while because of the other side report delay, or even send RPC to ask for the other side if necessary before making decisions. > Add heartbeat between netty server and client to detect long connection alive > - > > Key: FLINK-16030 > URL: https://issues.apache.org/jira/browse/FLINK-16030 > Project: Flink > Issue Type: Improvement > Components: Runtime / Network >Affects Versions: 1.7.2, 1.8.3, 1.9.2, 1.10.0 >Reporter: begginghard >Assignee: begginghard >Priority: Major > > As reported on [the user mailing > list|https://lists.apache.org/list.html?u...@flink.apache.org:lte=1M:Encountered%20error%20while%20consuming%20partitions] > Network can fail in many ways, sometimes pretty subtle (e.g. high ratio > packet loss). > When the long tcp connection between netty client and server is lost, the > server would failed to send response to the client, then shut down the > channel. At the same time, the netty client does not know that the connection > has been disconnected, so it has been waiting for two hours. > To detect the long tcp connection alive on netty client and server, we should > have two ways: tcp keepalive and heartbeat. > > The tcp keepalive is 2 hours by default. When the long tcp connection dead, > you continue to wait for 2 hours, the netty client will trigger exception and > enter failover recovery. > If you want to detect quickly, netty provides IdleStateHandler which it use > ping-pang mechanism. If netty client sends continuously n ping message and > receives no one pang message, then trigger exception. > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (FLINK-16030) Add heartbeat between netty server and client to detect long connection alive
[ https://issues.apache.org/jira/browse/FLINK-16030?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17038508#comment-17038508 ] Stephan Ewen commented on FLINK-16030: -- The original network stack philosophy was that failures are handled by the receiver. The assumption was that there would be various "partition types" where the receiver could "re-try" to fetch the data. Batch, buffered pipelines. It is probably fair to say that in cases of "non-recoverable pipelined" partitions, the sender should handle the exception directly as well. > Add heartbeat between netty server and client to detect long connection alive > - > > Key: FLINK-16030 > URL: https://issues.apache.org/jira/browse/FLINK-16030 > Project: Flink > Issue Type: Improvement > Components: Runtime / Network >Affects Versions: 1.7.2, 1.8.3, 1.9.2, 1.10.0 >Reporter: begginghard >Assignee: begginghard >Priority: Major > > As reported on [the user mailing > list|https://lists.apache.org/list.html?u...@flink.apache.org:lte=1M:Encountered%20error%20while%20consuming%20partitions] > Network can fail in many ways, sometimes pretty subtle (e.g. high ratio > packet loss). > When the long tcp connection between netty client and server is lost, the > server would failed to send response to the client, then shut down the > channel. At the same time, the netty client does not know that the connection > has been disconnected, so it has been waiting for two hours. > To detect the long tcp connection alive on netty client and server, we should > have two ways: tcp keepalive and heartbeat. > > The tcp keepalive is 2 hours by default. When the long tcp connection dead, > you continue to wait for 2 hours, the netty client will trigger exception and > enter failover recovery. > If you want to detect quickly, netty provides IdleStateHandler which it use > ping-pang mechanism. If netty client sends continuously n ping message and > receives no one pang message, then trigger exception. > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (FLINK-16030) Add heartbeat between netty server and client to detect long connection alive
[ https://issues.apache.org/jira/browse/FLINK-16030?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17038300#comment-17038300 ] Piotr Nowojski commented on FLINK-16030: Thanks for reporting back [~Jiangang]. As I wrote above, unfortunately such kind of keep alive messages wouldn't work well in all of the cases. [~sewen] I think it's a bit different to what you were thinking. As me and [~zjwang] discussed, an alternative to having the ping pong is to just report connection timeout the same way as any other errors, while currently (for unknown reasons to us), it's completely ignored on the "server"/upstream side ({{PartitionRequestQueue#channelInactive}} vs {{PartitionRequestQueue#exceptionCaught}}). What you are suggesting would require adding more information context to the exceptions and then process this context in the JM. > Add heartbeat between netty server and client to detect long connection alive > - > > Key: FLINK-16030 > URL: https://issues.apache.org/jira/browse/FLINK-16030 > Project: Flink > Issue Type: Improvement > Components: Runtime / Network >Affects Versions: 1.7.2, 1.8.3, 1.9.2, 1.10.0 >Reporter: begginghard >Assignee: begginghard >Priority: Major > > As reported on [the user mailing > list|https://lists.apache.org/list.html?u...@flink.apache.org:lte=1M:Encountered%20error%20while%20consuming%20partitions] > Network can fail in many ways, sometimes pretty subtle (e.g. high ratio > packet loss). > When the long tcp connection between netty client and server is lost, the > server would failed to send response to the client, then shut down the > channel. At the same time, the netty client does not know that the connection > has been disconnected, so it has been waiting for two hours. > To detect the long tcp connection alive on netty client and server, we should > have two ways: tcp keepalive and heartbeat. > > The tcp keepalive is 2 hours by default. When the long tcp connection dead, > you continue to wait for 2 hours, the netty client will trigger exception and > enter failover recovery. > If you want to detect quickly, netty provides IdleStateHandler which it use > ping-pang mechanism. If netty client sends continuously n ping message and > receives no one pang message, then trigger exception. > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (FLINK-16030) Add heartbeat between netty server and client to detect long connection alive
[ https://issues.apache.org/jira/browse/FLINK-16030?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17037105#comment-17037105 ] Liu commented on FLINK-16030: - Sorry for late reply. For quick fix, I send ping message to server and expect to receive pong message in the client side. If the client can not receive pong message for some time, such as 3 seconds, then it fails the job. Thanks for that so many people are interesting in this bug. Expect for better solution. > Add heartbeat between netty server and client to detect long connection alive > - > > Key: FLINK-16030 > URL: https://issues.apache.org/jira/browse/FLINK-16030 > Project: Flink > Issue Type: Improvement > Components: Runtime / Network >Affects Versions: 1.7.2, 1.8.3, 1.9.2, 1.10.0 >Reporter: begginghard >Assignee: begginghard >Priority: Major > > As reported on [the user mailing > list|https://lists.apache.org/list.html?u...@flink.apache.org:lte=1M:Encountered%20error%20while%20consuming%20partitions] > Network can fail in many ways, sometimes pretty subtle (e.g. high ratio > packet loss). > When the long tcp connection between netty client and server is lost, the > server would failed to send response to the client, then shut down the > channel. At the same time, the netty client does not know that the connection > has been disconnected, so it has been waiting for two hours. > To detect the long tcp connection alive on netty client and server, we should > have two ways: tcp keepalive and heartbeat. > > The tcp keepalive is 2 hours by default. When the long tcp connection dead, > you continue to wait for 2 hours, the netty client will trigger exception and > enter failover recovery. > If you want to detect quickly, netty provides IdleStateHandler which it use > ping-pang mechanism. If netty client sends continuously n ping message and > receives no one pang message, then trigger exception. > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (FLINK-16030) Add heartbeat between netty server and client to detect long connection alive
[ https://issues.apache.org/jira/browse/FLINK-16030?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17037094#comment-17037094 ] Stephan Ewen commented on FLINK-16030: -- We had a discussion some time ago about "richer exception handling" on the Job Manager. For example when TM1 and TM2 are communicating, and TM2 is crashing, often the first exception is that TM1 reports a "loss of connection with TM2" from Netty. When recovery is started, the heartbeats have not timed out, so the JM tries to deploy again to TM2. That deploy typically fails (ask timeout). Then eventually the heatbeat times out and TM2 is removed. Then the redeploy is successful. It prolongs recovery time that we need to wait for a heartbeat timeout from TM2 to understand that it is lost. What we could do is make more use of exception information. For example if TM1 reports a connection failure with TM2, we can use that to either cancel the corresponding task on TM2, or we can "graylist" TM2 until it reports proper running status again. Just bringing this up, because these things seem to go into a similar direction. > Add heartbeat between netty server and client to detect long connection alive > - > > Key: FLINK-16030 > URL: https://issues.apache.org/jira/browse/FLINK-16030 > Project: Flink > Issue Type: Improvement > Components: Runtime / Network >Affects Versions: 1.7.2, 1.8.3, 1.9.2, 1.10.0 >Reporter: begginghard >Assignee: begginghard >Priority: Major > > As reported on [the user mailing > list|https://lists.apache.org/list.html?u...@flink.apache.org:lte=1M:Encountered%20error%20while%20consuming%20partitions] > Network can fail in many ways, sometimes pretty subtle (e.g. high ratio > packet loss). > When the long tcp connection between netty client and server is lost, the > server would failed to send response to the client, then shut down the > channel. At the same time, the netty client does not know that the connection > has been disconnected, so it has been waiting for two hours. > To detect the long tcp connection alive on netty client and server, we should > have two ways: tcp keepalive and heartbeat. > > The tcp keepalive is 2 hours by default. When the long tcp connection dead, > you continue to wait for 2 hours, the netty client will trigger exception and > enter failover recovery. > If you want to detect quickly, netty provides IdleStateHandler which it use > ping-pang mechanism. If netty client sends continuously n ping message and > receives no one pang message, then trigger exception. > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (FLINK-16030) Add heartbeat between netty server and client to detect long connection alive
[ https://issues.apache.org/jira/browse/FLINK-16030?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17036830#comment-17036830 ] Zhijiang commented on FLINK-16030: -- After some offline discussions with [~pnowojski], we reach the agreement that it might be proper to enhance the server side to also trigger failure once detecting any exceptions, then the JM can handle the whole job restart. Double reviewing the current codes, once the netty client detects any exceptions, it would notify the server side in best-effort way via `CancelPartition` and `ClosePartition` messages before closing channel. Meanwhile, it also triggers the respective task fail via `RemoteInputChannel#onError`. But on netty server side, it only releases the view resources once detecting inactive channel. If it can also trigger task failure as client side does, then the JM can handle it well. We should also consider carefully to avoid misleading sometimes, because in normal case when the partition is consumed complete by downstream side, the inactive channel is caused by normal channel close and should not trigger any failure. [~begginghard] After you think it through in this way, then we can further sync with it or discuss in PR page. > Add heartbeat between netty server and client to detect long connection alive > - > > Key: FLINK-16030 > URL: https://issues.apache.org/jira/browse/FLINK-16030 > Project: Flink > Issue Type: Improvement > Components: Runtime / Network >Affects Versions: 1.7.2, 1.8.3, 1.9.2, 1.10.0 >Reporter: begginghard >Priority: Major > > As reported on [the user mailing > list|https://lists.apache.org/list.html?u...@flink.apache.org:lte=1M:Encountered%20error%20while%20consuming%20partitions] > Network can fail in many ways, sometimes pretty subtle (e.g. high ratio > packet loss). > When the long tcp connection between netty client and server is lost, the > server would failed to send response to the client, then shut down the > channel. At the same time, the netty client does not know that the connection > has been disconnected, so it has been waiting for two hours. > To detect the long tcp connection alive on netty client and server, we should > have two ways: tcp keepalive and heartbeat. > > The tcp keepalive is 2 hours by default. When the long tcp connection dead, > you continue to wait for 2 hours, the netty client will trigger exception and > enter failover recovery. > If you want to detect quickly, netty provides IdleStateHandler which it use > ping-pang mechanism. If netty client sends continuously n ping message and > receives no one pang message, then trigger exception. > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (FLINK-16030) Add heartbeat between netty server and client to detect long connection alive
[ https://issues.apache.org/jira/browse/FLINK-16030?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17036813#comment-17036813 ] begginghard commented on FLINK-16030: - [~zjwang] [~pnowojski] I agree with you. I'm going to find other ways to solve this problem. For example, notify jm to handle the exception. > Add heartbeat between netty server and client to detect long connection alive > - > > Key: FLINK-16030 > URL: https://issues.apache.org/jira/browse/FLINK-16030 > Project: Flink > Issue Type: Improvement > Components: Runtime / Network >Affects Versions: 1.7.2, 1.8.3, 1.9.2, 1.10.0 >Reporter: begginghard >Priority: Major > > As reported on [the user mailing > list|https://lists.apache.org/list.html?u...@flink.apache.org:lte=1M:Encountered%20error%20while%20consuming%20partitions] > Network can fail in many ways, sometimes pretty subtle (e.g. high ratio > packet loss). > When the long tcp connection between netty client and server is lost, the > server would failed to send response to the client, then shut down the > channel. At the same time, the netty client does not know that the connection > has been disconnected, so it has been waiting for two hours. > To detect the long tcp connection alive on netty client and server, we should > have two ways: tcp keepalive and heartbeat. > > The tcp keepalive is 2 hours by default. When the long tcp connection dead, > you continue to wait for 2 hours, the netty client will trigger exception and > enter failover recovery. > If you want to detect quickly, netty provides IdleStateHandler which it use > ping-pang mechanism. If netty client sends continuously n ping message and > receives no one pang message, then trigger exception. > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (FLINK-16030) Add heartbeat between netty server and client to detect long connection alive
[ https://issues.apache.org/jira/browse/FLINK-16030?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17036811#comment-17036811 ] Yingjie Cao commented on FLINK-16030: - I also ever encountered this problem (though just once), in my case, the netty exception handler was called and from the log I can see the reason was write timeout and it did not trigger any further failure like job failover. Just for your information. > Add heartbeat between netty server and client to detect long connection alive > - > > Key: FLINK-16030 > URL: https://issues.apache.org/jira/browse/FLINK-16030 > Project: Flink > Issue Type: Improvement > Components: Runtime / Network >Affects Versions: 1.7.2, 1.8.3, 1.9.2, 1.10.0 >Reporter: begginghard >Priority: Major > > As reported on [the user mailing > list|https://lists.apache.org/list.html?u...@flink.apache.org:lte=1M:Encountered%20error%20while%20consuming%20partitions] > Network can fail in many ways, sometimes pretty subtle (e.g. high ratio > packet loss). > When the long tcp connection between netty client and server is lost, the > server would failed to send response to the client, then shut down the > channel. At the same time, the netty client does not know that the connection > has been disconnected, so it has been waiting for two hours. > To detect the long tcp connection alive on netty client and server, we should > have two ways: tcp keepalive and heartbeat. > > The tcp keepalive is 2 hours by default. When the long tcp connection dead, > you continue to wait for 2 hours, the netty client will trigger exception and > enter failover recovery. > If you want to detect quickly, netty provides IdleStateHandler which it use > ping-pang mechanism. If netty client sends continuously n ping message and > receives no one pang message, then trigger exception. > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (FLINK-16030) Add heartbeat between netty server and client to detect long connection alive
[ https://issues.apache.org/jira/browse/FLINK-16030?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17036692#comment-17036692 ] Zhijiang commented on FLINK-16030: -- I agree with [~pnowojski]'s concern. I forgot the previous issue that the netty thread might stuck in IO operations for blocking partition while reading data in some serious scenarios. It might cause the delay response for heartbeat ping message to bring unnecessary failure. The current netty handlers in flink stack are unified for both pipelined & blocking partitions, so we might not only consider the pipelined case. Answer above [~pnowojski]'s question. The current heartbeat between TM/JM can not work well for this case. When the server side is aware of the network issue (local machine iptable issue), it would close the channel on its side and release all the partitions. But this can also happen in the normal case like when the client side send `CancelPartition|CloseRequest` message explicitly to close the channel, so it would throw any exception on server side to report JM. In short words the server side can not distinguish the cases while aware of inactive channel. When the server side closes its local channel, the client side would be aware of this issue after two hours(based on the default kernel keep-alive mechanism ), so it would cause the whole job stuck until failure after two hours. I guess there might other options to work around for this issue. If we can make the server side distinguish the different cases to cause inactive channels, then it can perform different actions to notify JM to trigger job failure. > Add heartbeat between netty server and client to detect long connection alive > - > > Key: FLINK-16030 > URL: https://issues.apache.org/jira/browse/FLINK-16030 > Project: Flink > Issue Type: Improvement > Components: Runtime / Network >Affects Versions: 1.7.2, 1.8.3, 1.9.2, 1.10.0 >Reporter: begginghard >Priority: Major > > As reported on [the user mailing > list|https://lists.apache.org/list.html?u...@flink.apache.org:lte=1M:Encountered%20error%20while%20consuming%20partitions] > Network can fail in many ways, sometimes pretty subtle (e.g. high ratio > packet loss). > When the long tcp connection between netty client and server is lost, the > server would failed to send response to the client, then shut down the > channel. At the same time, the netty client does not know that the connection > has been disconnected, so it has been waiting for two hours. > To detect the long tcp connection alive on netty client and server, we should > have two ways: tcp keepalive and heartbeat. > > The tcp keepalive is 2 hours by default. When the long tcp connection dead, > you continue to wait for 2 hours, the netty client will trigger exception and > enter failover recovery. > If you want to detect quickly, netty provides IdleStateHandler which it use > ping-pang mechanism. If netty client sends continuously n ping message and > receives no one pang message, then trigger exception. > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (FLINK-16030) Add heartbeat between netty server and client to detect long connection alive
[ https://issues.apache.org/jira/browse/FLINK-16030?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17036217#comment-17036217 ] Piotr Nowojski commented on FLINK-16030: So the real world scenario is that the network connection between the Task Managers and Job Manager is working fine, heart beats are going through, there were no other exceptions? And just a single (or bunch of) connection between some two Task Managers is not working properly? I'm trying to understand the severity/impact of this issue and whether we can solve it in some other way. As I wrote above, adding a heartbeat between Task Managers could open different can of worms, like it wouldn't probably be stable for any job using {{BoundedBlockingSubpartition}} (while I'm not entirely sure if {{PipelinedSubpartition}} is 100% non blocking). > Add heartbeat between netty server and client to detect long connection alive > - > > Key: FLINK-16030 > URL: https://issues.apache.org/jira/browse/FLINK-16030 > Project: Flink > Issue Type: Improvement > Components: Runtime / Network >Affects Versions: 1.7.2, 1.8.3, 1.9.2, 1.10.0 >Reporter: begginghard >Priority: Major > > As reported on [the user mailing > list|https://lists.apache.org/list.html?u...@flink.apache.org:lte=1M:Encountered%20error%20while%20consuming%20partitions] > Network can fail in many ways, sometimes pretty subtle (e.g. high ratio > packet loss). > When the long tcp connection between netty client and server is lost, the > server would failed to send response to the client, then shut down the > channel. At the same time, the netty client does not know that the connection > has been disconnected, so it has been waiting for two hours. > To detect the long tcp connection alive on netty client and server, we should > have two ways: tcp keepalive and heartbeat. > > The tcp keepalive is 2 hours by default. When the long tcp connection dead, > you continue to wait for 2 hours, the netty client will trigger exception and > enter failover recovery. > If you want to detect quickly, netty provides IdleStateHandler which it use > ping-pang mechanism. If netty client sends continuously n ping message and > receives no one pang message, then trigger exception. > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (FLINK-16030) Add heartbeat between netty server and client to detect long connection alive
[ https://issues.apache.org/jira/browse/FLINK-16030?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17036196#comment-17036196 ] begginghard commented on FLINK-16030: - [~pnowojski] I have reproduce the problem in the test environment. The problem must be occurred after I disable transfer data from netty server to client by iptables, but keep alive from netty client to server. > Add heartbeat between netty server and client to detect long connection alive > - > > Key: FLINK-16030 > URL: https://issues.apache.org/jira/browse/FLINK-16030 > Project: Flink > Issue Type: Improvement > Components: Runtime / Network >Affects Versions: 1.7.2, 1.8.3, 1.9.2, 1.10.0 >Reporter: begginghard >Priority: Major > > As reported on [the user mailing > list|https://lists.apache.org/list.html?u...@flink.apache.org:lte=1M:Encountered%20error%20while%20consuming%20partitions] > Network can fail in many ways, sometimes pretty subtle (e.g. high ratio > packet loss). > When the long tcp connection between netty client and server is lost, the > server would failed to send response to the client, then shut down the > channel. At the same time, the netty client does not know that the connection > has been disconnected, so it has been waiting for two hours. > To detect the long tcp connection alive on netty client and server, we should > have two ways: tcp keepalive and heartbeat. > > The tcp keepalive is 2 hours by default. When the long tcp connection dead, > you continue to wait for 2 hours, the netty client will trigger exception and > enter failover recovery. > If you want to detect quickly, netty provides IdleStateHandler which it use > ping-pang mechanism. If netty client sends continuously n ping message and > receives no one pang message, then trigger exception. > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (FLINK-16030) Add heartbeat between netty server and client to detect long connection alive
[ https://issues.apache.org/jira/browse/FLINK-16030?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17036178#comment-17036178 ] Piotr Nowojski commented on FLINK-16030: Could someone also explain what is the scenario when not having this heartbeat between task managers is causing some issues? The setup is that there is an idling connection between an upstream TM and downstream TM, and upstream TM fails? Silently? Shouldn't the Job Manager detect this and trigger failover of the remaining TMs? > Add heartbeat between netty server and client to detect long connection alive > - > > Key: FLINK-16030 > URL: https://issues.apache.org/jira/browse/FLINK-16030 > Project: Flink > Issue Type: Improvement > Components: Runtime / Network >Affects Versions: 1.7.2, 1.8.3, 1.9.2, 1.10.0 >Reporter: begginghard >Priority: Major > > As reported on [the user mailing > list|https://lists.apache.org/list.html?u...@flink.apache.org:lte=1M:Encountered%20error%20while%20consuming%20partitions] > Network can fail in many ways, sometimes pretty subtle (e.g. high ratio > packet loss). > When the long tcp connection between netty client and server is lost, the > server would failed to send response to the client, then shut down the > channel. At the same time, the netty client does not know that the connection > has been disconnected, so it has been waiting for two hours. > To detect the long tcp connection alive on netty client and server, we should > have two ways: tcp keepalive and heartbeat. > > The tcp keepalive is 2 hours by default. When the long tcp connection dead, > you continue to wait for 2 hours, the netty client will trigger exception and > enter failover recovery. > If you want to detect quickly, netty provides IdleStateHandler which it use > ping-pang mechanism. If netty client sends continuously n ping message and > receives no one pang message, then trigger exception. > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (FLINK-16030) Add heartbeat between netty server and client to detect long connection alive
[ https://issues.apache.org/jira/browse/FLINK-16030?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17036108#comment-17036108 ] Piotr Nowojski commented on FLINK-16030: Currently our threading model and network stack can not reliably support heartbeats on data network channels (we do have them on akka). The reason is that we are performing blocking operations inside Netty threads (we were recently discussing [this here|http://mail-archives.apache.org/mod_mbox/flink-dev/202002.mbox/browser]). Unless the keep alive is set to value like 1 hour, I would be afraid that If we add such feature, we will get more false positive connection timeouts, confusing users and causing us more new problems than solving old ones. > Add heartbeat between netty server and client to detect long connection alive > - > > Key: FLINK-16030 > URL: https://issues.apache.org/jira/browse/FLINK-16030 > Project: Flink > Issue Type: Improvement > Components: Runtime / Network >Reporter: begginghard >Priority: Major > > Network can fail in many ways, sometimes pretty subtle (e.g. high ratio > packet loss). > When the long tcp connection between netty client and server is lost, the > server would failed to send response to the client, then shut down the > channel. At the same time, the netty client does not know that the connection > has been disconnected, so it has been waiting for two hours. > To detect the long tcp connection alive on netty client and server, we should > have two ways: tcp keepalive and heartbeat. > > The tcp keepalive is 2 hours by default. When the long tcp connection dead, > you continue to wait for 2 hours, the netty client will trigger exception and > enter failover recovery. > If you want to detect quickly, netty provides IdleStateHandler which it use > ping-pang mechanism. If netty client sends continuously n ping message and > receives no one pang message, then trigger exception. > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (FLINK-16030) Add heartbeat between netty server and client to detect long connection alive
[ https://issues.apache.org/jira/browse/FLINK-16030?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17036085#comment-17036085 ] Zhijiang commented on FLINK-16030: -- Thanks for opening this issue, [~begginghard] . I think it makes sense for some unstable physical network environment. Actually we ever did the similar thing in our private branch before. If you want to contribute, I can assign this ticket to you. > Add heartbeat between netty server and client to detect long connection alive > - > > Key: FLINK-16030 > URL: https://issues.apache.org/jira/browse/FLINK-16030 > Project: Flink > Issue Type: Improvement > Components: Runtime / Network >Affects Versions: 1.10.0 >Reporter: begginghard >Priority: Major > > Network can fail in many ways, sometimes pretty subtle (e.g. high ratio > packet loss). > When the long tcp connection between netty client and server is lost, the > server would failed to send response to the client, then shut down the > channel. At the same time, the netty client does not know that the connection > has been disconnected, so it has been waiting for two hours. > To detect the long tcp connection alive on netty client and server, we should > have two ways: tcp keepalive and heartbeat. > > The tcp keepalive is 2 hours by default. When the long tcp connection dead, > you continue to wait for 2 hours, the netty client will trigger exception and > enter failover recovery. > If you want to detect quickly, netty provides IdleStateHandler which it use > ping-pang mechanism. If netty client sends continuously n ping message and > receives no one pang message, then trigger exception. > -- This message was sent by Atlassian Jira (v8.3.4#803005)