[ https://issues.apache.org/jira/browse/SOLR-17106?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17824591#comment-17824591 ]
Chris M. Hostetter commented on SOLR-17106: ------------------------------------------- Hi Aparna, This looks pretty good ... but it still seems like a few things have slipped through the cracks? * In the {{Builder}} classes, {{aliveCheckQuery}} needs to to default to the existing behavior which is currently in {{LBSolrClient}} 's {{private static final SolrQuery solrQuery}} ** I would rename/refactor that to something like {{protected static final SolrQuery DEFAULT_ALIVE_CHECK_QUERY}} and re-use it when initializing {{aliveCheckQuery}} in both {{Builder}} classes ** IMO we should change the default to {{null}} on the {{main}} branch, but first we need a patch that is cleanly backcompatible to commit & backport to 9x, then we can discuss changing the defaults on it's own merits * In the actual {{LBSolrClient}} subclasses, the {{aliveCheckSkipIters}} and {{aliveCheckQuery}} variables need to be final * In {{checkAZombieServer}} you are inspection & decrementing {{zombieServer.skipAliveCheckIters}} before doing the ping check – but in the event that the ping fails, you are never re-setting the value of {{zombieServer.skipAliveCheckIters}} ** So instead of only pinging ever Nth iter, it will ping on every iter starting with the Nth iter ** {{handleServerDown}} should reset {{zombieServer.skipAliveCheckIters = this.aliveCheckSkipIters;}} * The new private {{pingServer}} and {{isServerAlive}} methods you added feel like they are too intertwined to make sense as individual methods? ** Both have to explicitly check if {{aliveCheckQuery}} is null ** The return value of {{pingServer}} is only used by {{isServerAlive}} and no where else. ** I think it would make more sense to just merge these two methods into... {code:java} private boolean isServerAlive(ServerWrapper zombieServer) throws SolrServerException, IOException { if (null == aliveCheckQuery) { return true; } QueryRequest queryRequest = new QueryRequest(aliveCheckQuery); queryRequest.setBasePath(zombieServer.baseUrl); return queryRequest.process(getClient(zombieServer.getBaseUrl())).getStatus() == 0; } {code} ...those are my functionality concerns, i also have some nit-picky concerns... * Severally places in your patch modify lines in ways that only change formatting ** If you run {{gradle tidy}} before committing to your branch these should be automatically cleaned up * It looks like you also added an unused import of {{io.swagger.v3.oas.annotations.servers.Server}} ? ** {{gradle tidy}} (or maybe it's {{gradle check}} ?) should also catch this. * I it would be good to have some {{@see}} tags or {{@link}} tags in the javadocs of {{setAliveCheckSkipIters}} , {{{}setAliveCheckInterval{}}}, and {{setAliveCheckQuery}} explaining how they all relate to each other. * {{testReliability()}} is already not a very good test – it's certainly not worth copy/pasting exactly as is just to change the client slightly ** It would be cleaner to just refactor it's body into a helper method (ie: {{protected doTestReliability(LBSolrClient)}} ) that you call from both {{testReliabilityWithLivenessChecks()}} and your new {{testReliabilityWithMinimumZombieTimeAndDisabledQueries()}} ** Better still: we can make {{{}checkAZombieServer{}}}, {{isServerAlive}} do some {{DEBUG}} level logging about the particulars of what it's doing (when it skips a server because of the {{{}skipAliveCheckIters{}}}, when it assumes success because {{aliveCheckQuery}} is null, when it gets a success or failure result from a server, etc...) and then make tose tests use the {{@LogLevel}} annotation along with the {{LogListener}} helper class (restricted to substrings matching the name of the server we stop/restart) to assert it got the expected log messages (we just can't be too picky about the number of times those messages are logged) > LBSolrClient: Make it configurable to remove zombie ping checks > --------------------------------------------------------------- > > Key: SOLR-17106 > URL: https://issues.apache.org/jira/browse/SOLR-17106 > Project: Solr > Issue Type: Improvement > Reporter: Aparna Suresh > Priority: Minor > Time Spent: 10m > Remaining Estimate: 0h > > Following discussion from a dev list discussion here: > [https://lists.apache.org/thread/f0zfmpg0t48xrtppyfsmfc5ltzsq2qqh] > The issue involves scalability challenges in SolrJ's *LBSolrClient* when a > pod with numerous cores experiences connectivity problems. The "zombie" > tracking mechanism, operating on a core basis, becomes a bottleneck during > distributed search on a massive multi shard collection. Threads attempting to > reach unhealthy cores contribute to a high computational load, causing > performance issues. > As suggested by Chris Hostetter: LBSolrClient could be configured to disable > zombie "ping" checks, but retain zombie tracking. Once a replica/endpoint is > identified as a zombie, it could be held in zombie jail for X seconds, before > being released - hoping that by this timeframe ZK would be updated to mark > this endpoint DOWN or the pod is back up and CloudSolrClient would avoid > querying it. In any event, only 1 failed query would be needed to send the > server back to zombie jail. > > There are benefits in doing this change: > * Eliminate the zombie ping requests, which would otherwise overload pod(s) > coming up after a restart > * Avoid memory leaks, in case a node/replica goes away permanently, but it > stays as zombie forever, with a background thread in LBSolrClient constantly > pinging it -- This message was sent by Atlassian Jira (v8.20.10#820010) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@solr.apache.org For additional commands, e-mail: issues-h...@solr.apache.org