[ 
https://issues.apache.org/jira/browse/GEODE-10056?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17601726#comment-17601726
 ] 

ASF subversion and git services commented on GEODE-10056:
---------------------------------------------------------

Commit e627e60bae087a2874f2439994ab6d745dbd66a1 in geode's branch 
refs/heads/develop from Jakov Varenina
[ https://gitbox.apache.org/repos/asf?p=geode.git;h=e627e60bae ]

GEODE-10056: Improve gateway-receiver load balance (#7378)

* GEODE-10056: Improve gateway-receiver load balance

The problem is that servers send incorrect gateway-receiver connection
load to locators within CacheServerLoadMessage. Additionally, locators
do not refresh gateway-receivers load with the load received in
CacheServerLoadMessage. The only time locator increments
gateway-receiver load is after it receives
ClientConnectionRequest{group=__recv_group...} and returns selected
server in ClientConnectionResponse message. This is done only by
coordinator, so that means that other locators will have load with
initial values, since it is never updated.

The solution is to correctly track gateway-receiver acceptor
connection count and then based on it correctly calculate the load
when sending CacheServerLoadMessage. Additionally each locator will
read the load received from CacheServerLoadMessage and update load
for gateway-receiver location id in group __recv__group accordingly.

* Updates after the review

* Fix for the flaky test cases

* Updates after review

* Empty commit to trigger test

* Updates after review

* Fix failed distributed test

The test case testMultiUser failed because Wan service is available
in geode-core distributed tests, and therefore test now throws:

org.apache.geode.internal.cache.wan.GatewaySenderConfigurationException
: Locators must be configured before starting gateway-sender.

instead of:

java.lang.IllegalStateException: WAN service is not available.

* Synchronize handling of receiver load

This commit synchronizes the getting and sending of gateway-receiver
load (CacheServerLoadMessage) on all servers.

> Gateway-reciver connection load mantained only on one locator
> -------------------------------------------------------------
>
>                 Key: GEODE-10056
>                 URL: https://issues.apache.org/jira/browse/GEODE-10056
>             Project: Geode
>          Issue Type: Bug
>            Reporter: Jakov Varenina
>            Assignee: Jakov Varenina
>            Priority: Major
>              Labels: pull-request-available
>
> The first problem is that servers send incorrect gateway-receiver connection 
> load to locators with CacheServerLoadMessage. The second problem is that the 
> locator doesn't refresh gateway-receivers load per server in the local map 
> with the load received in CacheServerLoadMessage. This seems to be a bug, as 
> there is already a mechanism to track and store gateway-receiver connection 
> load per server in the locator, but that load is never refreshed by a fault 
> at the reception of CacheServerLoadMessage. Currently, receiver load is only 
> refreshed/increased on the locator that is handling 
> ClientConnectionRequest\{group=__recv_group...} and ClientConnectionResponse 
> messages from a remote server that is trying to establish gateway sender 
> connection. All other locators in a cluster will never refresh the 
> gateway-receiver connection load in this case. When the locator that was 
> serving remote gateway-senders goes down then a new locator will take that 
> job. Problem is that the new locator will not have a correct load (it was 
> never refreshed) and that would in most situations result in new 
> gateway-sender connections being established in an unbalanced way.
> Way to reproduce the issue:
> Start 2 clusters, Let's call site1 the sending and site2 the receiving site, 
> The receiving site should have at least 2 locators. Both have 2 servers. No 
> regions are needed.
> Cluster-1 gfsh>list members
> Member Count : 3Name | Id
> --------- | -------------------------------------------------------------
> locator10 | 10.0.2.15(locator10:7332:locator)<ec><v0>:41000 [Coordinator]
> server11 | 10.0.2.15(server11:8358)<v1>:41003
> server12 | 10.0.2.15(server12:8717)<v2>:41005
>  
> Cluster-2 gfsh>list members
> Member Count : 4Name | Id
> --------- | -------------------------------------------------------------
> locator10 | 10.0.2.15(locator10:7562:locator)<ec><v0>:41001 [Coordinator]
> locator11 | 10.0.2.15(locator11:8103:locator)<ec><v1>:41002
> server11 | 10.0.2.15(server11:8547)<v2>:41004
> server12 | 10.0.2.15(server12:8908)<v3>:41006
>  
> Create GW receiver in Site2 on both servers.
>  
> Cluster-2 gfsh>list gateways
> GatewayReceiver Section              Member               | Port | Sender 
> Count | Senders Connected
> ---------------------------------- | ---- | ------------ | -----------------
> 10.0.2.15(server11:8547)<v2>:41004 | 5175 | 0            |
> 10.0.2.15(server12:8908)<v3>:41006 | 5457 | 0            |
> Create GW sender in Site1 on both servers. Use 10 dispatcher threads for 
> easier obervation. 
> Cluster-1 gfsh>list gateways
> GatewaySender SectionGatewaySender Id |               Member               | 
> Remote Cluster Id |   Type   |        Status         | Queued Events | 
> Receiver Location
> ---------------- | ---------------------------------- | ----------------- | 
> -------- | --------------------- | ------------- | -----------------
> senderTo2        | 10.0.2.15(server11:8358)<v1>:41003 | 2                 | 
> Parallel | Running and Connected | 0             | 10.0.2.15:5457
> senderTo2        | 10.0.2.15(server12:8717)<v2>:41005 | 2                 | 
> Parallel | Running and Connected | 0             | 10.0.2.15:5457
>  
> Observe balance in GW receiver connections in Site2. It will be perfect.
>  
> Cluster-2 gfsh>list gateways
> GatewayReceiver Section              Member               | Port | Sender 
> Count | Senders Connected
> ---------------------------------- | ---- | ------------ | 
> ---------------------------------------------------------------------------------------------------------------------------------
> 10.0.2.15(server11:8547)<v2>:41004 | 5175 | 12           | 
> 10.0.2.15(server12:8717)<v2>:41005, 10.0.2.15(server12:8717)<v2>:41005, 
> 10.0.2.15(server11:8358)<v1>:41003, 10.0.2.15(server11:..
> 10.0.2.15(server12:8908)<v3>:41006 | 5457 | 12           | 
> 10.0.2.15(server12:8717)<v2>:41005, 10.0.2.15(server12:8717)<v2>:41005, 
> 10.0.2.15(server12:8717)<v2>:41005, 10.0.2.15(server12:..
>  
> 12 connections each - 10 payload + 2 ping connections.
> Now stop GW receiver in one server of site2. In Site1 do a stop/start 
> gateway-sender command - all connections will go to the only receiver in 
> site2 (as expected). Check it:
>  
> Cluster-2 gfsh>list gateways
> GatewayReceiver Section              Member               | Port | Sender 
> Count | Senders Connected
> ---------------------------------- | ---- | ------------ | 
> ---------------------------------------------------------------------------------------------------------------------------------
> 10.0.2.15(server11:8547)<v2>:41004 | 5175 | 22           | 
> 10.0.2.15(server11:8358)<v1>:41003, 10.0.2.15(server12:8717)<v2>:41005, 
> 10.0.2.15(server11:8358)<v1>:41003, 10.0.2.15(server11:..
> 10.0.2.15(server12:8908)<v3>:41006 | 5457 | 0            |
>  
> Now 22 in just one receiver - 20 payload + 1 ping from each sender.
> Stop GW sender in one server in Site1. Connection drops in GW receiver to 
> half the value (also expected).
>  
> Cluster-2 gfsh>list gateways
> GatewayReceiver Section              Member               | Port | Sender 
> Count | Senders Connected
> ---------------------------------- | ---- | ------------ | 
> ---------------------------------------------------------------------------------------------------------------------------------
> 10.0.2.15(server11:8547)<v2>:41004 | 5175 | 11           | 
> 10.0.2.15(server11:8358)<v1>:41003, 10.0.2.15(server11:8358)<v1>:41003, 
> 10.0.2.15(server11:8358)<v1>:41003, 10.0.2.15(server11:..
> 10.0.2.15(server12:8908)<v3>:41006 | 5457 | 0            |
>  
> Now 11 as one sender from Site1 is stopped.
> Start the GW receiver in server of site2 (that was stopped before). It will 
> not receive new connections just yet.
> Start GW sender in one server in Site1 (that was stopped before). All 
> connections will land in receiver started before so the balance is there.
>  
> Cluster-2 gfsh>list gateways
> GatewayReceiver Section              Member               | Port | Sender 
> Count | Senders Connected
> ---------------------------------- | ---- | ------------ | 
> ---------------------------------------------------------------------------------------------------------------------------------
> 10.0.2.15(server11:8547)<v2>:41004 | 5175 | 11           | 
> 10.0.2.15(server11:8358)<v1>:41003, 10.0.2.15(server11:8358)<v1>:41003, 
> 10.0.2.15(server11:8358)<v1>:41003, 10.0.2.15(server11:..
> 10.0.2.15(server12:8908)<v3>:41006 | 5182 | 11           | 
> 10.0.2.15(server12:8717)<v2>:41005, 10.0.2.15(server12:8717)<v2>:41005, 
> 10.0.2.15(server12:8717)<v2>:41005, 10.0.2.15(server12:..
>  
> 11 connections in each because we have perfect mapping server11 to server11 
> and server12 to server12 (i.e. there is just 1 ping connection in each 
> receiver). As expected - we see how balance was achieved. Stop GW sender in 
> same server in Site1 again. Again, no connections in receiver of Site2 we 
> just started (expected).
>  
> Cluster-2 gfsh>list gateways
> GatewayReceiver Section              Member               | Port | Sender 
> Count | Senders Connected
> ---------------------------------- | ---- | ------------ | 
> ---------------------------------------------------------------------------------------------------------------------------------
> 10.0.2.15(server11:8547)<v2>:41004 | 5175 | 11           | 
> 10.0.2.15(server11:8358)<v1>:41003, 10.0.2.15(server11:8358)<v1>:41003, 
> 10.0.2.15(server11:8358)<v1>:41003, 10.0.2.15(server11:..
> 10.0.2.15(server12:8908)<v3>:41006 | 5182 | 0            |
>  
> Now stop one locator in Site2 - the one that was serving GW senders - it was 
> locator10 in my case. Start GW sender in that server of Site1 again. Check 
> the balance in Site2 GW receiver:
> Cluster-2 gfsh>list gateways
> GatewayReceiver Section              Member               | Port | Sender 
> Count | Senders Connected
> ---------------------------------- | ---- | ------------ | 
> ---------------------------------------------------------------------------------------------------------------------------------
> 10.0.2.15(server11:8547)<v2>:41004 | 5175 | 17           | 
> 10.0.2.15(server11:8358)<v1>:41003, 10.0.2.15(server11:8358)<v1>:41003, 
> 10.0.2.15(server11:8358)<v1>:41003, 10.0.2.15(server11:..
> 10.0.2.15(server12:8908)<v3>:41006 | 5182 | 6            | 
> 10.0.2.15(server12:8717)<v2>:41005, 10.0.2.15(server12:8717)<v2>:41005, 
> 10.0.2.15(server12:8717)<v2>:41005, 10.0.2.15(server12:..
> As you can see in above printout, connections aren't balanced correctly when 
> connection request is sent to new locator.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to