jvarenina commented on pull request #7378:
URL: https://github.com/apache/geode/pull/7378#issuecomment-1072272422
Hi @boglesby,
I also assumed that the same race condition is possible for the client
connections, but I haven't tried to reproduce it. Thanks for pointing this out
and lots of other valuable information. Also, thank you for the extensive
testing you have done.
If we decide to go with this solution, I agree that we should make the
load-poll-interval parameter configurable for gateway receivers. Changing it to
the lower value would slightly mitigate race condition effects.
The load-balance gateways command is working on server this way:
- pauses gateway-sender
- destroys all connections and then rely upon the mechanism used during
connection creation (ClientConnectionRequest/Response) to do the better load
balancing
- resume gateway-sender
This command will result again in the burst of connection requests that
could hit an issue caused by a race condition.
Maybe instead of sending load information periodically from the servers, the
locator could scrape it (perhaps using CacheServerMXBean) from the servers and
apply it simultaneously for all receivers in the locator. The locator could get
load when it receives a connection request, and the current connection load is
stale (e.g., older than 200 ms), as we don't expect many connections from
gateway-senders. This way, the locator would at least have an up-to-date
connection load taken at a similar time on all servers. This solution should
even catch the change in connection load when the load-balance command destroys
all connections.
Maybe, an algorithm that could work this way:
- Connection request received, check if a connection request is stale (older
than new parameter load-update-frequency=200ms)
- if yes, then try to get connection load from all servers asynchronously
- if received load from all servers, then apply it in the locator
- if any get fails, then check profiles again and immediately retry for
all servers
- Use immediately the current load
- If the connection request is not received, then just periodically get
load, e.g., every 5 seconds (load-poll-interval)
Not sure if this makes any sense as I don't know how fast locator can scrape
the load. I can create a prototype if you see that this could maybe work?
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]