[ 
https://issues.apache.org/jira/browse/GEODE-5896?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16655175#comment-16655175
 ] 

Dong Yang edited comment on GEODE-5896 at 10/18/18 1:17 PM:
------------------------------------------------------------

Root cause 

Client side onRegion function invocation actually need 2 meta information ready 
before executing the user-define function. The first is static meta include 
colocateWith, bucketCount,partitionResolver etc. The second is dynamic meta 
that mapping the bucketId to ServerLocation.

Client should send request to right server based on these meta info. But 
because GemFire is a dynamic cluster, sometime maybe the network issue, maybe 
node down or new node join in. Client-side meta can not catch up the change. At 
that time,the request send from client should go to the node A but 
unfortunately go to a node B, then the request been redirect from node B to 
node A. When function finish the logic and sending result using sendResult as 
stream, the data will firstly send from nod A to node B ,then from node B to 
client. If client program been killed or cancelled, node B will catch an 
exception:

java.net.SocketException: Connection reset by peer: socket write error.

Then the server to client channel will be closed. But node A can not get any 
exception because the channel between server is shared . Node A found nothing 
wrong, so it will continuously send data to node B till all data send out. 
Based on the data volume, it will cost minutes, hours or even days. During the 
transfer, client will send more request to server, maybe same thing happens, 
client been cancelled and resend again. Finally the server-side resources will 
be exhausted. At that time, the only way is restarting the cluster.

 


was (Author: twosand):
Root cause 

Client side onRegion function invocation actually need 2 meta information ready 
before executing the user-define function. The first is static meta include 
colocateWith, bucketCount,partitionResolver etc. The second is dynamic meta 
that mapping the bucketId to ServerLocation.

Client should send request to right server based on these meta info. But 
because GemFire is a dynamic cluster, sometime maybe the network issue, maybe 
node down or new node join in. Client-side meta can not catch up the change. 
Then the request send from client should go to the node A but unfortunately go 
to a node B, then the request been redirect from node B to node A. When 
function finish the logic and sending result using sendResult as streaming 
style, the result data stream will firstly send from nod A to node B ,then from 
node B to client. If client program been killed or cancelled, node B will catch 
an exception:

java.net.SocketException: Connection reset by peer: socket write error.

Then the server to client channel will be closed. But node A can not get any 
exception because the channel between server is shared . Node A found nothing 
wrong, so it will continuously send data to node B till all data send out. 
Based on the data volume, it will cost minutes, hours or even days. If client 
program send a new request and killed again, server-side resource will be 
exhausted. At that time, the only way is restarting the cluster.

 

> Function sendResult can not finish correctly when client stop receive data
> --------------------------------------------------------------------------
>
>                 Key: GEODE-5896
>                 URL: https://issues.apache.org/jira/browse/GEODE-5896
>             Project: Geode
>          Issue Type: Bug
>          Components: functions
>            Reporter: Dong Yang
>            Priority: Major
>
> Scenario:
>  # TCP client-server mode
>  # on Region with filter invocation
>  # single-hop enabled at client-side
>  # lots of data transfer from server to client
>  # Using sendResult send data from server to client as streaming style
> Incident:
> Client program killed or exit normally. Server-side can not detect the 
> exception so still sending data to client. Resources occupied sometimes a 
> very long time and get more worse when client resent the request. As result, 
> the cluster looks like hang in and can not response any request include api 
> invocation, gfsh comand , etc.
>  
>  
>  
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to