[jira] [Commented] (FLINK-16030) Add heartbeat between netty server and client to detect long connection alive

2021-04-16 Thread Flink Jira Bot (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-16030?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17323093#comment-17323093
 ] 

Flink Jira Bot commented on FLINK-16030:


This issue is assigned but has not received an update in 7 days so it has been 
labeled "stale-assigned". If you are still working on the issue, please give an 
update and remove the label. If you are no longer working on the issue, please 
unassign so someone else may work on it. In 7 days the issue will be 
automatically unassigned.

> Add heartbeat between netty server and client to detect long connection alive
> -
>
> Key: FLINK-16030
> URL: https://issues.apache.org/jira/browse/FLINK-16030
> Project: Flink
>  Issue Type: Improvement
>  Components: Runtime / Network
>Affects Versions: 1.7.2, 1.8.3, 1.9.2, 1.10.0
>Reporter: begginghard
>Assignee: begginghard
>Priority: Major
>  Labels: stale-assigned
>
> As reported on [the user mailing 
> list|https://lists.apache.org/list.html?u...@flink.apache.org:lte=1M:Encountered%20error%20while%20consuming%20partitions]
> Network can fail in many ways, sometimes pretty subtle (e.g. high ratio 
> packet loss).  
> When the long tcp connection between netty client and server is lost, the 
> server would failed to send response to the client, then shut down the 
> channel. At the same time, the netty client does not know that the connection 
> has been disconnected, so it has been waiting for two hours.
> To detect the long tcp connection alive on netty client and server, we should 
> have two ways: tcp keepalive and heartbeat.
>  
> The tcp keepalive is 2 hours by default. When the long tcp connection dead, 
> you continue to wait for 2 hours, the netty client will trigger exception and 
> enter failover recovery.
> If you want to detect quickly, netty provides IdleStateHandler which it use 
> ping-pang mechanism. If netty client sends continuously n ping message and 
> receives no one pang message, then trigger exception.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (FLINK-16030) Add heartbeat between netty server and client to detect long connection alive

2020-02-18 Thread Piotr Nowojski (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-16030?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17038880#comment-17038880
 ] 

Piotr Nowojski commented on FLINK-16030:


{quote}
It is probably fair to say that in cases of "non-recoverable pipelined" 
partitions, the sender should handle the exception directly as well.
{quote}
I think this is important to keep in mind here. Indeed downstream failures 
(timeout detected on the upstream node) should in some cases (retry-able 
partition) just cause downstream node to failover, but in others (pipelined) 
failover of both upstream and downstream task.

> Add heartbeat between netty server and client to detect long connection alive
> -
>
> Key: FLINK-16030
> URL: https://issues.apache.org/jira/browse/FLINK-16030
> Project: Flink
>  Issue Type: Improvement
>  Components: Runtime / Network
>Affects Versions: 1.7.2, 1.8.3, 1.9.2, 1.10.0
>Reporter: begginghard
>Assignee: begginghard
>Priority: Major
>
> As reported on [the user mailing 
> list|https://lists.apache.org/list.html?u...@flink.apache.org:lte=1M:Encountered%20error%20while%20consuming%20partitions]
> Network can fail in many ways, sometimes pretty subtle (e.g. high ratio 
> packet loss).  
> When the long tcp connection between netty client and server is lost, the 
> server would failed to send response to the client, then shut down the 
> channel. At the same time, the netty client does not know that the connection 
> has been disconnected, so it has been waiting for two hours.
> To detect the long tcp connection alive on netty client and server, we should 
> have two ways: tcp keepalive and heartbeat.
>  
> The tcp keepalive is 2 hours by default. When the long tcp connection dead, 
> you continue to wait for 2 hours, the netty client will trigger exception and 
> enter failover recovery.
> If you want to detect quickly, netty provides IdleStateHandler which it use 
> ping-pang mechanism. If netty client sends continuously n ping message and 
> receives no one pang message, then trigger exception.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (FLINK-16030) Add heartbeat between netty server and client to detect long connection alive

2020-02-17 Thread Zhijiang (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-16030?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17038512#comment-17038512
 ] 

Zhijiang commented on FLINK-16030:
--

Considering the direction of "richer exception handling" [~sewen]  mentioned, I 
think we already followed this way for some previous cases. E.g. 
`PartitionNotFoundException` is reported on downstream side while requesting 
upstream's partition failure, then the JM can check the upstream task's state 
to make a decision. 

The current exception detection and report mechanisms are a bit different in 
netty client and server sides.
 * In client side, any exceptions detected by netty handler would cause the 
respective tasks enter failed state. So JM is aware of the exceptions via task 
state reporting.  Besides that, the client would also report to netty server 
side via network in `CancelPartitionRequest` and `CloseRequest` messages. The 
netty server side would release view resources as a result, but not fail the 
respective upstream's tasks which might be canceled by JM if necessary.
 * In server side, any exceptions detected by netty handler would not cause the 
respective tasks fail, so it also does not report to JM ATM. But it would 
notify the netty client side via `ErrorResponse` message in some cases 
(`PartitionRequestQueue#exceptionCaught`). If the client handler receives the 
error message, then it would cause the downstream' task fail to report to JM. 
JM can cancel the upstream's task if necessary.

So for both client and server sides in network, we already have the exception 
detection mechanism, but missing the effective report mechanism in some cases. 
The previous proposal for adding ping message in network is also for the 
detection mechanism in essence. But if it relies on the task failure to realize 
the report mechanism, it would bring unnecessary job restart for fake ping 
timeout.

Considering this ticket case, I think we could still follow the "richer 
exception handling" direction to some extent. For confirmed exceptions, we can 
rely on the task failure to report JM as did before. For ambiguous exceptions 
we should have a mechanism to report to JM directly such as 
`PartitionNotFoundException` did, then the JM has the global information for 
final decision. E.g. the JM can inquire the other side state, or wait for a 
while because of the other side report delay, or even send RPC to ask for the 
other side if necessary before making decisions.

> Add heartbeat between netty server and client to detect long connection alive
> -
>
> Key: FLINK-16030
> URL: https://issues.apache.org/jira/browse/FLINK-16030
> Project: Flink
>  Issue Type: Improvement
>  Components: Runtime / Network
>Affects Versions: 1.7.2, 1.8.3, 1.9.2, 1.10.0
>Reporter: begginghard
>Assignee: begginghard
>Priority: Major
>
> As reported on [the user mailing 
> list|https://lists.apache.org/list.html?u...@flink.apache.org:lte=1M:Encountered%20error%20while%20consuming%20partitions]
> Network can fail in many ways, sometimes pretty subtle (e.g. high ratio 
> packet loss).  
> When the long tcp connection between netty client and server is lost, the 
> server would failed to send response to the client, then shut down the 
> channel. At the same time, the netty client does not know that the connection 
> has been disconnected, so it has been waiting for two hours.
> To detect the long tcp connection alive on netty client and server, we should 
> have two ways: tcp keepalive and heartbeat.
>  
> The tcp keepalive is 2 hours by default. When the long tcp connection dead, 
> you continue to wait for 2 hours, the netty client will trigger exception and 
> enter failover recovery.
> If you want to detect quickly, netty provides IdleStateHandler which it use 
> ping-pang mechanism. If netty client sends continuously n ping message and 
> receives no one pang message, then trigger exception.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (FLINK-16030) Add heartbeat between netty server and client to detect long connection alive

2020-02-17 Thread Stephan Ewen (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-16030?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17038508#comment-17038508
 ] 

Stephan Ewen commented on FLINK-16030:
--

The original network stack philosophy was that failures are handled by the 
receiver. The assumption was that there would be various "partition types" 
where the receiver could "re-try" to fetch the data. Batch, buffered pipelines.

It is probably fair to say that in cases of "non-recoverable pipelined" 
partitions, the sender should handle the exception directly as well.

> Add heartbeat between netty server and client to detect long connection alive
> -
>
> Key: FLINK-16030
> URL: https://issues.apache.org/jira/browse/FLINK-16030
> Project: Flink
>  Issue Type: Improvement
>  Components: Runtime / Network
>Affects Versions: 1.7.2, 1.8.3, 1.9.2, 1.10.0
>Reporter: begginghard
>Assignee: begginghard
>Priority: Major
>
> As reported on [the user mailing 
> list|https://lists.apache.org/list.html?u...@flink.apache.org:lte=1M:Encountered%20error%20while%20consuming%20partitions]
> Network can fail in many ways, sometimes pretty subtle (e.g. high ratio 
> packet loss).  
> When the long tcp connection between netty client and server is lost, the 
> server would failed to send response to the client, then shut down the 
> channel. At the same time, the netty client does not know that the connection 
> has been disconnected, so it has been waiting for two hours.
> To detect the long tcp connection alive on netty client and server, we should 
> have two ways: tcp keepalive and heartbeat.
>  
> The tcp keepalive is 2 hours by default. When the long tcp connection dead, 
> you continue to wait for 2 hours, the netty client will trigger exception and 
> enter failover recovery.
> If you want to detect quickly, netty provides IdleStateHandler which it use 
> ping-pang mechanism. If netty client sends continuously n ping message and 
> receives no one pang message, then trigger exception.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (FLINK-16030) Add heartbeat between netty server and client to detect long connection alive

2020-02-17 Thread Piotr Nowojski (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-16030?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17038300#comment-17038300
 ] 

Piotr Nowojski commented on FLINK-16030:


Thanks for reporting back [~Jiangang]. As I wrote above, unfortunately such 
kind of keep alive messages wouldn't work well in all of the cases.

[~sewen] I think it's a bit different to what you were thinking. As me and 
[~zjwang] discussed, an alternative to having the ping pong is to just report 
connection timeout the same way as any other errors, while currently (for 
unknown reasons to us), it's completely ignored on the "server"/upstream side 
({{PartitionRequestQueue#channelInactive}} vs 
{{PartitionRequestQueue#exceptionCaught}}). What you are suggesting would 
require adding more information context to the exceptions and then process this 
context in the JM.

> Add heartbeat between netty server and client to detect long connection alive
> -
>
> Key: FLINK-16030
> URL: https://issues.apache.org/jira/browse/FLINK-16030
> Project: Flink
>  Issue Type: Improvement
>  Components: Runtime / Network
>Affects Versions: 1.7.2, 1.8.3, 1.9.2, 1.10.0
>Reporter: begginghard
>Assignee: begginghard
>Priority: Major
>
> As reported on [the user mailing 
> list|https://lists.apache.org/list.html?u...@flink.apache.org:lte=1M:Encountered%20error%20while%20consuming%20partitions]
> Network can fail in many ways, sometimes pretty subtle (e.g. high ratio 
> packet loss).  
> When the long tcp connection between netty client and server is lost, the 
> server would failed to send response to the client, then shut down the 
> channel. At the same time, the netty client does not know that the connection 
> has been disconnected, so it has been waiting for two hours.
> To detect the long tcp connection alive on netty client and server, we should 
> have two ways: tcp keepalive and heartbeat.
>  
> The tcp keepalive is 2 hours by default. When the long tcp connection dead, 
> you continue to wait for 2 hours, the netty client will trigger exception and 
> enter failover recovery.
> If you want to detect quickly, netty provides IdleStateHandler which it use 
> ping-pang mechanism. If netty client sends continuously n ping message and 
> receives no one pang message, then trigger exception.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (FLINK-16030) Add heartbeat between netty server and client to detect long connection alive

2020-02-14 Thread Liu (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-16030?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17037105#comment-17037105
 ] 

Liu commented on FLINK-16030:
-

Sorry for late reply. For quick fix, I send ping message to server and expect 
to receive pong message in the client side. If the client can not receive pong 
message for some time, such as 3 seconds, then it fails the job.


Thanks for that so many people are interesting in this bug. Expect for better 
solution.



> Add heartbeat between netty server and client to detect long connection alive
> -
>
> Key: FLINK-16030
> URL: https://issues.apache.org/jira/browse/FLINK-16030
> Project: Flink
>  Issue Type: Improvement
>  Components: Runtime / Network
>Affects Versions: 1.7.2, 1.8.3, 1.9.2, 1.10.0
>Reporter: begginghard
>Assignee: begginghard
>Priority: Major
>
> As reported on [the user mailing 
> list|https://lists.apache.org/list.html?u...@flink.apache.org:lte=1M:Encountered%20error%20while%20consuming%20partitions]
> Network can fail in many ways, sometimes pretty subtle (e.g. high ratio 
> packet loss).  
> When the long tcp connection between netty client and server is lost, the 
> server would failed to send response to the client, then shut down the 
> channel. At the same time, the netty client does not know that the connection 
> has been disconnected, so it has been waiting for two hours.
> To detect the long tcp connection alive on netty client and server, we should 
> have two ways: tcp keepalive and heartbeat.
>  
> The tcp keepalive is 2 hours by default. When the long tcp connection dead, 
> you continue to wait for 2 hours, the netty client will trigger exception and 
> enter failover recovery.
> If you want to detect quickly, netty provides IdleStateHandler which it use 
> ping-pang mechanism. If netty client sends continuously n ping message and 
> receives no one pang message, then trigger exception.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (FLINK-16030) Add heartbeat between netty server and client to detect long connection alive

2020-02-14 Thread Stephan Ewen (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-16030?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17037094#comment-17037094
 ] 

Stephan Ewen commented on FLINK-16030:
--

We had a discussion some time ago about "richer exception handling" on the Job 
Manager.

For example when TM1 and TM2 are communicating, and TM2 is crashing, often the 
first exception is that TM1 reports a "loss of connection with TM2" from Netty. 
When recovery is started, the heartbeats have not timed out, so the JM tries to 
deploy again to TM2. That deploy typically fails (ask timeout). Then eventually 
the heatbeat times out and TM2 is removed. Then the redeploy is successful.

It prolongs recovery time that we need to wait for a heartbeat timeout from TM2 
to understand that it is lost.
What we could do is make more use of exception information. For example if TM1 
reports a connection failure with TM2, we can use that to either cancel the 
corresponding task on TM2, or we can "graylist" TM2 until it reports proper 
running status again.

Just bringing this up, because these things seem to go into a similar direction.


> Add heartbeat between netty server and client to detect long connection alive
> -
>
> Key: FLINK-16030
> URL: https://issues.apache.org/jira/browse/FLINK-16030
> Project: Flink
>  Issue Type: Improvement
>  Components: Runtime / Network
>Affects Versions: 1.7.2, 1.8.3, 1.9.2, 1.10.0
>Reporter: begginghard
>Assignee: begginghard
>Priority: Major
>
> As reported on [the user mailing 
> list|https://lists.apache.org/list.html?u...@flink.apache.org:lte=1M:Encountered%20error%20while%20consuming%20partitions]
> Network can fail in many ways, sometimes pretty subtle (e.g. high ratio 
> packet loss).  
> When the long tcp connection between netty client and server is lost, the 
> server would failed to send response to the client, then shut down the 
> channel. At the same time, the netty client does not know that the connection 
> has been disconnected, so it has been waiting for two hours.
> To detect the long tcp connection alive on netty client and server, we should 
> have two ways: tcp keepalive and heartbeat.
>  
> The tcp keepalive is 2 hours by default. When the long tcp connection dead, 
> you continue to wait for 2 hours, the netty client will trigger exception and 
> enter failover recovery.
> If you want to detect quickly, netty provides IdleStateHandler which it use 
> ping-pang mechanism. If netty client sends continuously n ping message and 
> receives no one pang message, then trigger exception.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (FLINK-16030) Add heartbeat between netty server and client to detect long connection alive

2020-02-14 Thread Zhijiang (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-16030?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17036830#comment-17036830
 ] 

Zhijiang commented on FLINK-16030:
--

After some offline discussions with [~pnowojski], we reach the agreement that 
it might be proper to enhance the server side to also trigger failure once 
detecting any exceptions, then the JM can handle the whole job restart.

Double reviewing the current codes, once the netty client detects any 
exceptions, it would notify the server side in best-effort way via 
`CancelPartition` and `ClosePartition` messages before closing channel. 
Meanwhile, it also triggers the respective task fail via 
`RemoteInputChannel#onError`.

But on netty server side, it only releases the view resources once detecting 
inactive channel. If it can also trigger task failure as client side does, then 
the JM can handle it well. We should also consider carefully to avoid 
misleading sometimes, because in normal case when the partition is consumed 
complete by downstream side, the inactive channel is caused by normal channel 
close and should not trigger any failure.

[~begginghard] After you think it through in this way, then we can further sync 
with it or discuss in PR page.

> Add heartbeat between netty server and client to detect long connection alive
> -
>
> Key: FLINK-16030
> URL: https://issues.apache.org/jira/browse/FLINK-16030
> Project: Flink
>  Issue Type: Improvement
>  Components: Runtime / Network
>Affects Versions: 1.7.2, 1.8.3, 1.9.2, 1.10.0
>Reporter: begginghard
>Priority: Major
>
> As reported on [the user mailing 
> list|https://lists.apache.org/list.html?u...@flink.apache.org:lte=1M:Encountered%20error%20while%20consuming%20partitions]
> Network can fail in many ways, sometimes pretty subtle (e.g. high ratio 
> packet loss).  
> When the long tcp connection between netty client and server is lost, the 
> server would failed to send response to the client, then shut down the 
> channel. At the same time, the netty client does not know that the connection 
> has been disconnected, so it has been waiting for two hours.
> To detect the long tcp connection alive on netty client and server, we should 
> have two ways: tcp keepalive and heartbeat.
>  
> The tcp keepalive is 2 hours by default. When the long tcp connection dead, 
> you continue to wait for 2 hours, the netty client will trigger exception and 
> enter failover recovery.
> If you want to detect quickly, netty provides IdleStateHandler which it use 
> ping-pang mechanism. If netty client sends continuously n ping message and 
> receives no one pang message, then trigger exception.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (FLINK-16030) Add heartbeat between netty server and client to detect long connection alive

2020-02-14 Thread begginghard (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-16030?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17036813#comment-17036813
 ] 

begginghard commented on FLINK-16030:
-

[~zjwang] [~pnowojski]  I agree with you. I'm going to find other ways to solve 
this problem. For example, notify jm to handle the exception.

> Add heartbeat between netty server and client to detect long connection alive
> -
>
> Key: FLINK-16030
> URL: https://issues.apache.org/jira/browse/FLINK-16030
> Project: Flink
>  Issue Type: Improvement
>  Components: Runtime / Network
>Affects Versions: 1.7.2, 1.8.3, 1.9.2, 1.10.0
>Reporter: begginghard
>Priority: Major
>
> As reported on [the user mailing 
> list|https://lists.apache.org/list.html?u...@flink.apache.org:lte=1M:Encountered%20error%20while%20consuming%20partitions]
> Network can fail in many ways, sometimes pretty subtle (e.g. high ratio 
> packet loss).  
> When the long tcp connection between netty client and server is lost, the 
> server would failed to send response to the client, then shut down the 
> channel. At the same time, the netty client does not know that the connection 
> has been disconnected, so it has been waiting for two hours.
> To detect the long tcp connection alive on netty client and server, we should 
> have two ways: tcp keepalive and heartbeat.
>  
> The tcp keepalive is 2 hours by default. When the long tcp connection dead, 
> you continue to wait for 2 hours, the netty client will trigger exception and 
> enter failover recovery.
> If you want to detect quickly, netty provides IdleStateHandler which it use 
> ping-pang mechanism. If netty client sends continuously n ping message and 
> receives no one pang message, then trigger exception.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (FLINK-16030) Add heartbeat between netty server and client to detect long connection alive

2020-02-14 Thread Yingjie Cao (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-16030?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17036811#comment-17036811
 ] 

Yingjie Cao commented on FLINK-16030:
-

I also ever encountered this problem (though just once), in my case, the netty 
exception handler was called and from the log I can see the reason was write 
timeout and it did not trigger any further failure like job failover. Just for 
your information.

> Add heartbeat between netty server and client to detect long connection alive
> -
>
> Key: FLINK-16030
> URL: https://issues.apache.org/jira/browse/FLINK-16030
> Project: Flink
>  Issue Type: Improvement
>  Components: Runtime / Network
>Affects Versions: 1.7.2, 1.8.3, 1.9.2, 1.10.0
>Reporter: begginghard
>Priority: Major
>
> As reported on [the user mailing 
> list|https://lists.apache.org/list.html?u...@flink.apache.org:lte=1M:Encountered%20error%20while%20consuming%20partitions]
> Network can fail in many ways, sometimes pretty subtle (e.g. high ratio 
> packet loss).  
> When the long tcp connection between netty client and server is lost, the 
> server would failed to send response to the client, then shut down the 
> channel. At the same time, the netty client does not know that the connection 
> has been disconnected, so it has been waiting for two hours.
> To detect the long tcp connection alive on netty client and server, we should 
> have two ways: tcp keepalive and heartbeat.
>  
> The tcp keepalive is 2 hours by default. When the long tcp connection dead, 
> you continue to wait for 2 hours, the netty client will trigger exception and 
> enter failover recovery.
> If you want to detect quickly, netty provides IdleStateHandler which it use 
> ping-pang mechanism. If netty client sends continuously n ping message and 
> receives no one pang message, then trigger exception.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (FLINK-16030) Add heartbeat between netty server and client to detect long connection alive

2020-02-13 Thread Zhijiang (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-16030?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17036692#comment-17036692
 ] 

Zhijiang commented on FLINK-16030:
--

I agree with [~pnowojski]'s concern. I forgot the previous issue that the netty 
thread might stuck in IO operations for blocking partition while reading data 
in some serious scenarios. It might cause the delay response for heartbeat ping 
message to bring unnecessary failure. The current netty handlers in flink stack 
are unified for both pipelined & blocking partitions, so we might not only 
consider the pipelined case.

Answer above [~pnowojski]'s question. The current heartbeat between TM/JM can 
not work well for this case. When the server side is aware of the network issue 
(local machine iptable issue), it would close the channel on its side and 
release all the partitions. But this can also happen in the normal case like 
when the client side send `CancelPartition|CloseRequest` message explicitly to 
close the channel, so it would throw any exception on server side to report JM. 
In short words the server side can not distinguish the cases while aware of 
inactive channel. 

When the server side closes its local channel, the client side would be aware 
of this issue after two hours(based on the default kernel keep-alive mechanism 
), so it would cause the whole job stuck until failure after two hours.

I guess there might other options to work around for this issue. If we can make 
the server side distinguish the different cases to cause inactive channels, 
then it can perform different actions to notify JM to trigger job failure.

> Add heartbeat between netty server and client to detect long connection alive
> -
>
> Key: FLINK-16030
> URL: https://issues.apache.org/jira/browse/FLINK-16030
> Project: Flink
>  Issue Type: Improvement
>  Components: Runtime / Network
>Affects Versions: 1.7.2, 1.8.3, 1.9.2, 1.10.0
>Reporter: begginghard
>Priority: Major
>
> As reported on [the user mailing 
> list|https://lists.apache.org/list.html?u...@flink.apache.org:lte=1M:Encountered%20error%20while%20consuming%20partitions]
> Network can fail in many ways, sometimes pretty subtle (e.g. high ratio 
> packet loss).  
> When the long tcp connection between netty client and server is lost, the 
> server would failed to send response to the client, then shut down the 
> channel. At the same time, the netty client does not know that the connection 
> has been disconnected, so it has been waiting for two hours.
> To detect the long tcp connection alive on netty client and server, we should 
> have two ways: tcp keepalive and heartbeat.
>  
> The tcp keepalive is 2 hours by default. When the long tcp connection dead, 
> you continue to wait for 2 hours, the netty client will trigger exception and 
> enter failover recovery.
> If you want to detect quickly, netty provides IdleStateHandler which it use 
> ping-pang mechanism. If netty client sends continuously n ping message and 
> receives no one pang message, then trigger exception.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (FLINK-16030) Add heartbeat between netty server and client to detect long connection alive

2020-02-13 Thread Piotr Nowojski (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-16030?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17036217#comment-17036217
 ] 

Piotr Nowojski commented on FLINK-16030:


So the real world scenario is that the network connection between the Task 
Managers and Job Manager is working fine, heart beats are going through, there 
were no other exceptions? And just a single (or bunch of) connection between 
some two Task Managers is not working properly?

I'm trying to understand the severity/impact of this issue and whether we can 
solve it in some other way. As I wrote above, adding a heartbeat between Task 
Managers could open different can of worms, like it wouldn't probably be stable 
for any job using {{BoundedBlockingSubpartition}} (while I'm not entirely sure 
if {{PipelinedSubpartition}} is 100% non blocking).

> Add heartbeat between netty server and client to detect long connection alive
> -
>
> Key: FLINK-16030
> URL: https://issues.apache.org/jira/browse/FLINK-16030
> Project: Flink
>  Issue Type: Improvement
>  Components: Runtime / Network
>Affects Versions: 1.7.2, 1.8.3, 1.9.2, 1.10.0
>Reporter: begginghard
>Priority: Major
>
> As reported on [the user mailing 
> list|https://lists.apache.org/list.html?u...@flink.apache.org:lte=1M:Encountered%20error%20while%20consuming%20partitions]
> Network can fail in many ways, sometimes pretty subtle (e.g. high ratio 
> packet loss).  
> When the long tcp connection between netty client and server is lost, the 
> server would failed to send response to the client, then shut down the 
> channel. At the same time, the netty client does not know that the connection 
> has been disconnected, so it has been waiting for two hours.
> To detect the long tcp connection alive on netty client and server, we should 
> have two ways: tcp keepalive and heartbeat.
>  
> The tcp keepalive is 2 hours by default. When the long tcp connection dead, 
> you continue to wait for 2 hours, the netty client will trigger exception and 
> enter failover recovery.
> If you want to detect quickly, netty provides IdleStateHandler which it use 
> ping-pang mechanism. If netty client sends continuously n ping message and 
> receives no one pang message, then trigger exception.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (FLINK-16030) Add heartbeat between netty server and client to detect long connection alive

2020-02-13 Thread begginghard (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-16030?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17036196#comment-17036196
 ] 

begginghard commented on FLINK-16030:
-

[~pnowojski]  I have reproduce the problem in the test environment. The problem 
must be occurred after I disable transfer data from netty server to client by 
iptables, but keep alive from netty client to server.

> Add heartbeat between netty server and client to detect long connection alive
> -
>
> Key: FLINK-16030
> URL: https://issues.apache.org/jira/browse/FLINK-16030
> Project: Flink
>  Issue Type: Improvement
>  Components: Runtime / Network
>Affects Versions: 1.7.2, 1.8.3, 1.9.2, 1.10.0
>Reporter: begginghard
>Priority: Major
>
> As reported on [the user mailing 
> list|https://lists.apache.org/list.html?u...@flink.apache.org:lte=1M:Encountered%20error%20while%20consuming%20partitions]
> Network can fail in many ways, sometimes pretty subtle (e.g. high ratio 
> packet loss).  
> When the long tcp connection between netty client and server is lost, the 
> server would failed to send response to the client, then shut down the 
> channel. At the same time, the netty client does not know that the connection 
> has been disconnected, so it has been waiting for two hours.
> To detect the long tcp connection alive on netty client and server, we should 
> have two ways: tcp keepalive and heartbeat.
>  
> The tcp keepalive is 2 hours by default. When the long tcp connection dead, 
> you continue to wait for 2 hours, the netty client will trigger exception and 
> enter failover recovery.
> If you want to detect quickly, netty provides IdleStateHandler which it use 
> ping-pang mechanism. If netty client sends continuously n ping message and 
> receives no one pang message, then trigger exception.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (FLINK-16030) Add heartbeat between netty server and client to detect long connection alive

2020-02-13 Thread Piotr Nowojski (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-16030?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17036178#comment-17036178
 ] 

Piotr Nowojski commented on FLINK-16030:


Could someone also explain what is the scenario when not having this heartbeat 
between task managers is causing some issues? 

The setup is that there is an idling connection between an upstream TM and 
downstream TM, and upstream TM fails? Silently? Shouldn't the Job Manager 
detect this and trigger failover of the remaining TMs?

> Add heartbeat between netty server and client to detect long connection alive
> -
>
> Key: FLINK-16030
> URL: https://issues.apache.org/jira/browse/FLINK-16030
> Project: Flink
>  Issue Type: Improvement
>  Components: Runtime / Network
>Affects Versions: 1.7.2, 1.8.3, 1.9.2, 1.10.0
>Reporter: begginghard
>Priority: Major
>
> As reported on [the user mailing 
> list|https://lists.apache.org/list.html?u...@flink.apache.org:lte=1M:Encountered%20error%20while%20consuming%20partitions]
> Network can fail in many ways, sometimes pretty subtle (e.g. high ratio 
> packet loss).  
> When the long tcp connection between netty client and server is lost, the 
> server would failed to send response to the client, then shut down the 
> channel. At the same time, the netty client does not know that the connection 
> has been disconnected, so it has been waiting for two hours.
> To detect the long tcp connection alive on netty client and server, we should 
> have two ways: tcp keepalive and heartbeat.
>  
> The tcp keepalive is 2 hours by default. When the long tcp connection dead, 
> you continue to wait for 2 hours, the netty client will trigger exception and 
> enter failover recovery.
> If you want to detect quickly, netty provides IdleStateHandler which it use 
> ping-pang mechanism. If netty client sends continuously n ping message and 
> receives no one pang message, then trigger exception.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (FLINK-16030) Add heartbeat between netty server and client to detect long connection alive

2020-02-13 Thread Piotr Nowojski (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-16030?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17036108#comment-17036108
 ] 

Piotr Nowojski commented on FLINK-16030:


Currently our threading model and network stack can not reliably support 
heartbeats on data network channels (we do have them on akka). The reason is 
that we are performing blocking operations inside Netty threads (we were 
recently discussing [this 
here|http://mail-archives.apache.org/mod_mbox/flink-dev/202002.mbox/browser]).

Unless the keep alive is set to value like 1 hour, I would be afraid that If we 
add such feature, we will get more false positive connection timeouts, 
confusing users and causing us more new problems than solving old ones.

> Add heartbeat between netty server and client to detect long connection alive
> -
>
> Key: FLINK-16030
> URL: https://issues.apache.org/jira/browse/FLINK-16030
> Project: Flink
>  Issue Type: Improvement
>  Components: Runtime / Network
>Reporter: begginghard
>Priority: Major
>
> Network can fail in many ways, sometimes pretty subtle (e.g. high ratio 
> packet loss).  
> When the long tcp connection between netty client and server is lost, the 
> server would failed to send response to the client, then shut down the 
> channel. At the same time, the netty client does not know that the connection 
> has been disconnected, so it has been waiting for two hours.
> To detect the long tcp connection alive on netty client and server, we should 
> have two ways: tcp keepalive and heartbeat.
>  
> The tcp keepalive is 2 hours by default. When the long tcp connection dead, 
> you continue to wait for 2 hours, the netty client will trigger exception and 
> enter failover recovery.
> If you want to detect quickly, netty provides IdleStateHandler which it use 
> ping-pang mechanism. If netty client sends continuously n ping message and 
> receives no one pang message, then trigger exception.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (FLINK-16030) Add heartbeat between netty server and client to detect long connection alive

2020-02-13 Thread Zhijiang (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-16030?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17036085#comment-17036085
 ] 

Zhijiang commented on FLINK-16030:
--

Thanks for opening this issue, [~begginghard] . I think it makes sense for some 
unstable physical network environment. Actually we ever did the similar thing 
in our private branch before. If you want to contribute, I can assign this 
ticket to you.

> Add heartbeat between netty server and client to detect long connection alive
> -
>
> Key: FLINK-16030
> URL: https://issues.apache.org/jira/browse/FLINK-16030
> Project: Flink
>  Issue Type: Improvement
>  Components: Runtime / Network
>Affects Versions: 1.10.0
>Reporter: begginghard
>Priority: Major
>
> Network can fail in many ways, sometimes pretty subtle (e.g. high ratio 
> packet loss).  
> When the long tcp connection between netty client and server is lost, the 
> server would failed to send response to the client, then shut down the 
> channel. At the same time, the netty client does not know that the connection 
> has been disconnected, so it has been waiting for two hours.
> To detect the long tcp connection alive on netty client and server, we should 
> have two ways: tcp keepalive and heartbeat.
>  
> The tcp keepalive is 2 hours by default. When the long tcp connection dead, 
> you continue to wait for 2 hours, the netty client will trigger exception and 
> enter failover recovery.
> If you want to detect quickly, netty provides IdleStateHandler which it use 
> ping-pang mechanism. If netty client sends continuously n ping message and 
> receives no one pang message, then trigger exception.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)