[jira] [Commented] (SOLR-17106) LBSolrClient: Make it configurable to remove zombie ping checks

2024-04-16 Thread David Smiley (Jira)


[ 
https://issues.apache.org/jira/browse/SOLR-17106?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17837877#comment-17837877
 ] 

David Smiley commented on SOLR-17106:
-

Just want to mention that we should probably have these settings choose default 
values via EnvUtils (a new thing) so we can conveniently make these settings 
adjustments via system properties or env vars, whichever is convenient to the 
user.

> LBSolrClient: Make it configurable to remove zombie ping checks
> ---
>
> Key: SOLR-17106
> URL: https://issues.apache.org/jira/browse/SOLR-17106
> Project: Solr
>  Issue Type: Improvement
>Reporter: Aparna Suresh
>Priority: Minor
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> Following discussion from a dev list discussion here: 
> [https://lists.apache.org/thread/f0zfmpg0t48xrtppyfsmfc5ltzsq2qqh]
> The issue involves scalability challenges in SolrJ's *LBSolrClient* when a 
> pod with numerous cores experiences connectivity problems. The "zombie" 
> tracking mechanism, operating on a core basis, becomes a bottleneck during 
> distributed search on a massive multi shard collection. Threads attempting to 
> reach unhealthy cores contribute to a high computational load, causing 
> performance issues. 
> As suggested by Chris Hostetter: LBSolrClient could be configured to disable 
> zombie "ping" checks, but retain zombie tracking. Once a replica/endpoint is 
> identified as a zombie, it could be held in zombie jail for X seconds, before 
> being released - hoping that by this timeframe ZK would be updated to mark 
> this endpoint DOWN or the pod is back up and CloudSolrClient would avoid 
> querying it. In any event, only 1 failed query would be needed to send the 
> server back to zombie jail.
>  
> There are benefits in doing this change:
>  * Eliminate the zombie ping requests, which would otherwise overload pod(s) 
> coming up after a restart
>  * Avoid memory leaks, in case a node/replica goes away permanently, but it 
> stays as zombie forever, with a background thread in LBSolrClient constantly 
> pinging it



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@solr.apache.org
For additional commands, e-mail: issues-h...@solr.apache.org



[jira] [Commented] (SOLR-17106) LBSolrClient: Make it configurable to remove zombie ping checks

2024-03-14 Thread Aparna Suresh (Jira)


[ 
https://issues.apache.org/jira/browse/SOLR-17106?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17827178#comment-17827178
 ] 

Aparna Suresh commented on SOLR-17106:
--

Hi [~hossman] ,

I made the changes you requested. I opted not to declare 
{{aliveCheckSkipIters}} and {{aliveCheckQuery}} as final variables in the 
Builder classes of {{LBHttp2SolrClient}} and {{{}LBHttpSolrClient{}}}. These 
variables are designed to be overridden using the setter methods within the 
Builder. The overrides are being tested in TestLBHttp2SolrClient.

 

Please let me know. Thank you!

> LBSolrClient: Make it configurable to remove zombie ping checks
> ---
>
> Key: SOLR-17106
> URL: https://issues.apache.org/jira/browse/SOLR-17106
> Project: Solr
>  Issue Type: Improvement
>Reporter: Aparna Suresh
>Priority: Minor
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> Following discussion from a dev list discussion here: 
> [https://lists.apache.org/thread/f0zfmpg0t48xrtppyfsmfc5ltzsq2qqh]
> The issue involves scalability challenges in SolrJ's *LBSolrClient* when a 
> pod with numerous cores experiences connectivity problems. The "zombie" 
> tracking mechanism, operating on a core basis, becomes a bottleneck during 
> distributed search on a massive multi shard collection. Threads attempting to 
> reach unhealthy cores contribute to a high computational load, causing 
> performance issues. 
> As suggested by Chris Hostetter: LBSolrClient could be configured to disable 
> zombie "ping" checks, but retain zombie tracking. Once a replica/endpoint is 
> identified as a zombie, it could be held in zombie jail for X seconds, before 
> being released - hoping that by this timeframe ZK would be updated to mark 
> this endpoint DOWN or the pod is back up and CloudSolrClient would avoid 
> querying it. In any event, only 1 failed query would be needed to send the 
> server back to zombie jail.
>  
> There are benefits in doing this change:
>  * Eliminate the zombie ping requests, which would otherwise overload pod(s) 
> coming up after a restart
>  * Avoid memory leaks, in case a node/replica goes away permanently, but it 
> stays as zombie forever, with a background thread in LBSolrClient constantly 
> pinging it



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@solr.apache.org
For additional commands, e-mail: issues-h...@solr.apache.org



[jira] [Commented] (SOLR-17106) LBSolrClient: Make it configurable to remove zombie ping checks

2024-03-07 Thread Chris M. Hostetter (Jira)


[ 
https://issues.apache.org/jira/browse/SOLR-17106?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17824591#comment-17824591
 ] 

Chris M. Hostetter commented on SOLR-17106:
---

Hi Aparna,

This looks pretty good ... but it still seems like a few things have slipped 
through the cracks?
 * In the {{Builder}} classes, {{aliveCheckQuery}} needs to to default to the 
existing behavior which is currently in {{LBSolrClient}} 's {{private static 
final SolrQuery solrQuery}}
 ** I would rename/refactor that to something like {{protected static final 
SolrQuery DEFAULT_ALIVE_CHECK_QUERY}} and re-use it when initializing 
{{aliveCheckQuery}} in both {{Builder}} classes
 ** IMO we should change the default to {{null}} on the {{main}} branch, but 
first we need a patch that is cleanly backcompatible to commit & backport to 
9x, then we can discuss changing the defaults on it's own merits
 * In the actual {{LBSolrClient}} subclasses, the {{aliveCheckSkipIters}} and 
{{aliveCheckQuery}} variables need to be final
 * In {{checkAZombieServer}} you are inspection & decrementing 
{{zombieServer.skipAliveCheckIters}} before doing the ping check – but in the 
event that the ping fails, you are never re-setting the value of 
{{zombieServer.skipAliveCheckIters}}
 ** So instead of only pinging ever Nth iter, it will ping on every iter 
starting with the Nth iter
 ** {{handleServerDown}} should reset {{zombieServer.skipAliveCheckIters = 
this.aliveCheckSkipIters;}}
 * The new private {{pingServer}} and {{isServerAlive}} methods you added feel 
like they are too intertwined to make sense as individual methods?
 ** Both have to explicitly check if {{aliveCheckQuery}} is null
 ** The return value of {{pingServer}} is only used by {{isServerAlive}} and no 
where else.
 ** I think it would make more sense to just merge these two methods into...
{code:java}
  private boolean isServerAlive(ServerWrapper zombieServer) throws 
SolrServerException, IOException {
if (null == aliveCheckQuery) {
  return true;
}
QueryRequest queryRequest = new QueryRequest(aliveCheckQuery);
queryRequest.setBasePath(zombieServer.baseUrl);
return 
queryRequest.process(getClient(zombieServer.getBaseUrl())).getStatus() == 0;
  }
{code}

...those are my functionality concerns, i also have some nit-picky concerns...
 * Severally places in your patch modify lines in ways that only change 
formatting
 ** If you run {{gradle tidy}} before committing to your branch these should be 
automatically cleaned up
 * It looks like you also added an unused import of 
{{io.swagger.v3.oas.annotations.servers.Server}} ?
 ** {{gradle tidy}} (or maybe it's {{gradle check}} ?) should also catch this.
 * I it would be good to have some {{@see}} tags or {{@link}} tags in the 
javadocs of {{setAliveCheckSkipIters}} , {{{}setAliveCheckInterval{}}}, and 
{{setAliveCheckQuery}} explaining how they all relate to each other.
 * {{testReliability()}} is already not a very good test – it's certainly not 
worth copy/pasting exactly as is just to change the client slightly
 ** It would be cleaner to just refactor it's body into a helper method (ie: 
{{protected doTestReliability(LBSolrClient)}} ) that you call from both 
{{testReliabilityWithLivenessChecks()}} and your new 
{{testReliabilityWithMinimumZombieTimeAndDisabledQueries()}}
 ** Better still: we can make {{{}checkAZombieServer{}}}, {{isServerAlive}} do 
some {{DEBUG}} level logging about the particulars of what it's doing (when it 
skips a server because of the {{{}skipAliveCheckIters{}}}, when it assumes 
success because {{aliveCheckQuery}} is null, when it gets a success or failure 
result from a server, etc...) and then make tose tests use the {{@LogLevel}} 
annotation along with the {{LogListener}} helper class (restricted to 
substrings matching the name of the server we stop/restart) to assert it got 
the expected log messages (we just can't be too picky about the number of times 
those messages are logged)

> LBSolrClient: Make it configurable to remove zombie ping checks
> ---
>
> Key: SOLR-17106
> URL: https://issues.apache.org/jira/browse/SOLR-17106
> Project: Solr
>  Issue Type: Improvement
>Reporter: Aparna Suresh
>Priority: Minor
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> Following discussion from a dev list discussion here: 
> [https://lists.apache.org/thread/f0zfmpg0t48xrtppyfsmfc5ltzsq2qqh]
> The issue involves scalability challenges in SolrJ's *LBSolrClient* when a 
> pod with numerous cores experiences connectivity problems. The "zombie" 
> tracking mechanism, operating on a core basis, becomes a bottleneck during 
> distributed search on a massive multi shard collection. Threads attempting to 
> reach unhealthy cores contribute to a high computational load, causing 
> 

[jira] [Commented] (SOLR-17106) LBSolrClient: Make it configurable to remove zombie ping checks

2024-03-05 Thread Aparna Suresh (Jira)


[ 
https://issues.apache.org/jira/browse/SOLR-17106?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17823566#comment-17823566
 ] 

Aparna Suresh commented on SOLR-17106:
--

Hi [~hossman] , can you please provide your feedback on the requested changes?

 

Thanks

> LBSolrClient: Make it configurable to remove zombie ping checks
> ---
>
> Key: SOLR-17106
> URL: https://issues.apache.org/jira/browse/SOLR-17106
> Project: Solr
>  Issue Type: Improvement
>Reporter: Aparna Suresh
>Priority: Minor
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> Following discussion from a dev list discussion here: 
> [https://lists.apache.org/thread/f0zfmpg0t48xrtppyfsmfc5ltzsq2qqh]
> The issue involves scalability challenges in SolrJ's *LBSolrClient* when a 
> pod with numerous cores experiences connectivity problems. The "zombie" 
> tracking mechanism, operating on a core basis, becomes a bottleneck during 
> distributed search on a massive multi shard collection. Threads attempting to 
> reach unhealthy cores contribute to a high computational load, causing 
> performance issues. 
> As suggested by Chris Hostetter: LBSolrClient could be configured to disable 
> zombie "ping" checks, but retain zombie tracking. Once a replica/endpoint is 
> identified as a zombie, it could be held in zombie jail for X seconds, before 
> being released - hoping that by this timeframe ZK would be updated to mark 
> this endpoint DOWN or the pod is back up and CloudSolrClient would avoid 
> querying it. In any event, only 1 failed query would be needed to send the 
> server back to zombie jail.
>  
> There are benefits in doing this change:
>  * Eliminate the zombie ping requests, which would otherwise overload pod(s) 
> coming up after a restart
>  * Avoid memory leaks, in case a node/replica goes away permanently, but it 
> stays as zombie forever, with a background thread in LBSolrClient constantly 
> pinging it



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@solr.apache.org
For additional commands, e-mail: issues-h...@solr.apache.org



[jira] [Commented] (SOLR-17106) LBSolrClient: Make it configurable to remove zombie ping checks

2024-02-13 Thread Aparna Suresh (Jira)


[ 
https://issues.apache.org/jira/browse/SOLR-17106?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17817128#comment-17817128
 ] 

Aparna Suresh commented on SOLR-17106:
--

Thanks for the feedback! Sorry I did not have a chance to respond for a few 
weeks - I was out sick with Covid initially and then tied up investigating 
issues in Production. Appreciate the detailed evaluation. 

 

I completely missed the point about backwards compatibility!
{quote}I'm guessing what you ment to do is have 
{{reduceRemainingZombieTime(...)}} subtract 
{{zombieStateMonitoringIntervalMillis}} from {{remainingTime}} ? ... but this 
approach still seems kind of confusing & misleading, because tracking & 
recording "remaining milliseconds" like this implies more granularity then here 
really is.

{{remainingTime=10 (ms)}} is meaningless if 
{{zombieStateMonitoringIntervalMillis=60_000}} – you're going to have to wait 
the full 60 seconds.
{quote}
I specified an override to zombieStateMonitoringIntervalMillis = 5s in my first 
commit on LBHttp2SolrClient, with remainingTime set to 10s. So the thread 
running periodically doesnt evict a zombie entry right away, I added the 
following if condition - but I agree that would keep some entries as zombies up 
to the next run. Agree 100% about the point that the time based approach doesnt 
provide a lot of flexibility compared to the numIters approach.

 
{code:java}
private void reduceRemainingZombieTime(ServerWrapper wrapper) {
    if(wrapper == null){
      return;
    }
    if (wrapper.remainingTime == 0) {
      //evict from zombieServers, add to aliveServers
      zombieServers.remove(wrapper.getBaseUrl());
      wrapper.failedPings = 0;
      if (wrapper.standard) {
        addToAlive(wrapper);
      }
    } else {
      wrapper.remainingTime = Math.max(0, (wrapper.remainingTime - 
minZombieReleaseTimeMillis));
    }
  }
 
{code}
 

 

Have updated the PR based on your comments here: 
[https://github.com/apache/solr/pull/2160/files]

 

 

 

> LBSolrClient: Make it configurable to remove zombie ping checks
> ---
>
> Key: SOLR-17106
> URL: https://issues.apache.org/jira/browse/SOLR-17106
> Project: Solr
>  Issue Type: Improvement
>Reporter: Aparna Suresh
>Priority: Minor
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> Following discussion from a dev list discussion here: 
> [https://lists.apache.org/thread/f0zfmpg0t48xrtppyfsmfc5ltzsq2qqh]
> The issue involves scalability challenges in SolrJ's *LBSolrClient* when a 
> pod with numerous cores experiences connectivity problems. The "zombie" 
> tracking mechanism, operating on a core basis, becomes a bottleneck during 
> distributed search on a massive multi shard collection. Threads attempting to 
> reach unhealthy cores contribute to a high computational load, causing 
> performance issues. 
> As suggested by Chris Hostetter: LBSolrClient could be configured to disable 
> zombie "ping" checks, but retain zombie tracking. Once a replica/endpoint is 
> identified as a zombie, it could be held in zombie jail for X seconds, before 
> being released - hoping that by this timeframe ZK would be updated to mark 
> this endpoint DOWN or the pod is back up and CloudSolrClient would avoid 
> querying it. In any event, only 1 failed query would be needed to send the 
> server back to zombie jail.
>  
> There are benefits in doing this change:
>  * Eliminate the zombie ping requests, which would otherwise overload pod(s) 
> coming up after a restart
>  * Avoid memory leaks, in case a node/replica goes away permanently, but it 
> stays as zombie forever, with a background thread in LBSolrClient constantly 
> pinging it



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@solr.apache.org
For additional commands, e-mail: issues-h...@solr.apache.org



[jira] [Commented] (SOLR-17106) LBSolrClient: Make it configurable to remove zombie ping checks

2024-01-16 Thread Chris M. Hostetter (Jira)


[ 
https://issues.apache.org/jira/browse/SOLR-17106?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17807491#comment-17807491
 ] 

Chris M. Hostetter commented on SOLR-17106:
---

Hi Aprna, thanks for creating this jira!

 

I read through the changes on your branch, and I'm a little confused by a few 
things...

 

1. What was your motivation for replacing {{setAliveCheckInterval(int 
aliveCheckInterval, TimeUnit unit)}} with 
{{setZombieStateMonitoringIntervalMillis(long 
zombieStateMonitoringIntervalMillis)}} ?

It appears that they ultimately serve the same purpose of dictating the 
{{scheduleAtFixedRate()}} executor, but changing the name just seems like an 
unnecessary back compat break?

(We also, AFAIK, don't currently use the term "zombie" in any of the {{public}} 
APIs of SolrJ – only in the internal tracking of what servers are "not alive" 
 we probably want to keep the APIs in terms of being "alive" or "not alive")

Even if there is a strong reason I'm overlooking why it makes sense to change 
the method name: There's also been a push over the last few years to 
standardize on always taking in {{TimeUnit}} when dealing with method args that 
specify a time duration, rather then forcing the client to specify milliseconds 
(see SOLR-16595)

 

2. I don't think {{setMinZombieReleaseTimeMillis(long jailTimeInMillis)}} is 
being used the way you intended / expected?

>From the javadocs you added...
{quote}This param should be set if enableZombieChecks=false. This param 
configures the time a server should be regarded, at a minimum, as a zombie 
before being released
{quote}
...but from what I've seen of where & how the values are actually used:
 * In {{addZombie(...)}} it explicitly sets {{ServerWraper.remainingTime = 
zombieCheckIntervalMillis}}
 * In {{{}reduceRemainingZombieTime(...){}}}, if {{ServerWraper.remainingTime}} 
is non zero, explicitly set it to {{wrapper.remainingTime = Math.max(0, 
(wrapper.remainingTime - minZombieReleaseTimeMillis));}}

...that's it.

So no matter what the value {{minZombieReleaseTimeMillis}} is set to, the only 
thing that determines how long a URL is kept in the zombie list is 2x the 
{{zombieStateMonitoringIntervalMillis}} value used to configure the scheduled 
executor. The first time the executor calls {{reduceRemainingZombieTime(...)}} 
on a {{ServerWrapper}} it will reduce the {{remainingTime}} to zero. The Second 
time the executor calls {{reduceRemainingZombieTime(...)}} it will remove the 
URL from zombies and re-add it to alive.

I'm guessing what you ment to do is have {{reduceRemainingZombieTime(...)}} 
subtract {{zombieStateMonitoringIntervalMillis}} from {{remainingTime}} ? ... 
but this approach still seems kind of confusing & misleading, because tracking 
& recording "remaining milliseconds" like this implies more granularity then 
here really is.

{{remainingTime=10 (ms)}} is meaningless if 
{{zombieStateMonitoringIntervalMillis=60_000}} – you're going to have to wait 
the full 60 seconds.

I would suggest a "num iters" type configuration, rather then a "time" based 
configuration, so that we are very transparent that our "liveness checker" runs 
every {{aliveCheckInterval}} – and you can configure the "liveness checker" to 
only inspect a URL ever "N runs".

 

 

Thinking about the API of this change holistically, my inclination would be to 
support disabling the zombie check queries as a special case of making the 
specifics of those queries configurable.

 

Here's the API changes I would suggest:
 * Keep {{setAliveCheckInterval(...)}} the way it is
 * Add a new {{setAliveCheckSkipIters(int)}} that defaults to {{0}}
 ** Document that positive values can be used to ensure that the liveness 
checks for a given URL will only be run every "Nth" iteration of the 
{{setAliveCheckInterval}}
 * Add a new {{setAliveCheckQuery(SolrQuery)}} that defaults to the current 
static {{SolrQuery}}
 ** Document that if this is set to {{null}} the client will implicitly assume 
success anytime it would normally perform a a liveness check on a server.

The underlying functional changes would be:
 * Add an {{int skipAliveCheckIters}} to {{ServerWrapper}} that is initialized 
to whatever {{setAliveCheckSkipIters(int)}} is set to
 * In {{checkAZombieServer(...)}} ...
 ** Befor any existing logic: {{if (0 < zombieServer.skipAliveCheckIters)}} ... 
decrement and return immediately
 ** Short circuit the the current {{QueryRequest}} logic such that if 
{{setAliveCheckQuery(null)}} is in use, we don't execute any request, we just 
enter the "success" block

Unless I'm overlooking something, this would all be back-compatable, so we 
could merge it to the 9x branch, and then on the main branch we could change 
some of the defaults (like using {{setAliveCheckRequest(null)}} by default, 
increasing {{{}setAliveCheckSkipIters(int){}}}, decreasing 
{{{}setAliveCheckInterval(...){}}})

What do you