[ 
https://issues.apache.org/jira/browse/SOLR-17106?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17824591#comment-17824591
 ] 

Chris M. Hostetter commented on SOLR-17106:
-------------------------------------------

Hi Aparna,

This looks pretty good ... but it still seems like a few things have slipped 
through the cracks?
 * In the {{Builder}} classes, {{aliveCheckQuery}} needs to to default to the 
existing behavior which is currently in {{LBSolrClient}} 's {{private static 
final SolrQuery solrQuery}}
 ** I would rename/refactor that to something like {{protected static final 
SolrQuery DEFAULT_ALIVE_CHECK_QUERY}} and re-use it when initializing 
{{aliveCheckQuery}} in both {{Builder}} classes
 ** IMO we should change the default to {{null}} on the {{main}} branch, but 
first we need a patch that is cleanly backcompatible to commit & backport to 
9x, then we can discuss changing the defaults on it's own merits
 * In the actual {{LBSolrClient}} subclasses, the {{aliveCheckSkipIters}} and 
{{aliveCheckQuery}} variables need to be final
 * In {{checkAZombieServer}} you are inspection & decrementing 
{{zombieServer.skipAliveCheckIters}} before doing the ping check – but in the 
event that the ping fails, you are never re-setting the value of 
{{zombieServer.skipAliveCheckIters}}
 ** So instead of only pinging ever Nth iter, it will ping on every iter 
starting with the Nth iter
 ** {{handleServerDown}} should reset {{zombieServer.skipAliveCheckIters = 
this.aliveCheckSkipIters;}}
 * The new private {{pingServer}} and {{isServerAlive}} methods you added feel 
like they are too intertwined to make sense as individual methods?
 ** Both have to explicitly check if {{aliveCheckQuery}} is null
 ** The return value of {{pingServer}} is only used by {{isServerAlive}} and no 
where else.
 ** I think it would make more sense to just merge these two methods into...
{code:java}
  private boolean isServerAlive(ServerWrapper zombieServer) throws 
SolrServerException, IOException {
    if (null == aliveCheckQuery) {
      return true;
    }
    QueryRequest queryRequest = new QueryRequest(aliveCheckQuery);
    queryRequest.setBasePath(zombieServer.baseUrl);
    return 
queryRequest.process(getClient(zombieServer.getBaseUrl())).getStatus() == 0;
  }
{code}

...those are my functionality concerns, i also have some nit-picky concerns...
 * Severally places in your patch modify lines in ways that only change 
formatting
 ** If you run {{gradle tidy}} before committing to your branch these should be 
automatically cleaned up
 * It looks like you also added an unused import of 
{{io.swagger.v3.oas.annotations.servers.Server}} ?
 ** {{gradle tidy}} (or maybe it's {{gradle check}} ?) should also catch this.
 * I it would be good to have some {{@see}} tags or {{@link}} tags in the 
javadocs of {{setAliveCheckSkipIters}} , {{{}setAliveCheckInterval{}}}, and 
{{setAliveCheckQuery}} explaining how they all relate to each other.
 * {{testReliability()}} is already not a very good test – it's certainly not 
worth copy/pasting exactly as is just to change the client slightly
 ** It would be cleaner to just refactor it's body into a helper method (ie: 
{{protected doTestReliability(LBSolrClient)}} ) that you call from both 
{{testReliabilityWithLivenessChecks()}} and your new 
{{testReliabilityWithMinimumZombieTimeAndDisabledQueries()}}
 ** Better still: we can make {{{}checkAZombieServer{}}}, {{isServerAlive}} do 
some {{DEBUG}} level logging about the particulars of what it's doing (when it 
skips a server because of the {{{}skipAliveCheckIters{}}}, when it assumes 
success because {{aliveCheckQuery}} is null, when it gets a success or failure 
result from a server, etc...) and then make tose tests use the {{@LogLevel}} 
annotation along with the {{LogListener}} helper class (restricted to 
substrings matching the name of the server we stop/restart) to assert it got 
the expected log messages (we just can't be too picky about the number of times 
those messages are logged)

> LBSolrClient: Make it configurable to remove zombie ping checks
> ---------------------------------------------------------------
>
>                 Key: SOLR-17106
>                 URL: https://issues.apache.org/jira/browse/SOLR-17106
>             Project: Solr
>          Issue Type: Improvement
>            Reporter: Aparna Suresh
>            Priority: Minor
>          Time Spent: 10m
>  Remaining Estimate: 0h
>
> Following discussion from a dev list discussion here: 
> [https://lists.apache.org/thread/f0zfmpg0t48xrtppyfsmfc5ltzsq2qqh]
> The issue involves scalability challenges in SolrJ's *LBSolrClient* when a 
> pod with numerous cores experiences connectivity problems. The "zombie" 
> tracking mechanism, operating on a core basis, becomes a bottleneck during 
> distributed search on a massive multi shard collection. Threads attempting to 
> reach unhealthy cores contribute to a high computational load, causing 
> performance issues. 
> As suggested by Chris Hostetter: LBSolrClient could be configured to disable 
> zombie "ping" checks, but retain zombie tracking. Once a replica/endpoint is 
> identified as a zombie, it could be held in zombie jail for X seconds, before 
> being released - hoping that by this timeframe ZK would be updated to mark 
> this endpoint DOWN or the pod is back up and CloudSolrClient would avoid 
> querying it. In any event, only 1 failed query would be needed to send the 
> server back to zombie jail.
>  
> There are benefits in doing this change:
>  * Eliminate the zombie ping requests, which would otherwise overload pod(s) 
> coming up after a restart
>  * Avoid memory leaks, in case a node/replica goes away permanently, but it 
> stays as zombie forever, with a background thread in LBSolrClient constantly 
> pinging it



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@solr.apache.org
For additional commands, e-mail: issues-h...@solr.apache.org

Reply via email to