[jira] [Commented] (SOLR-17106) LBSolrClient: Make it configurable to remove zombie ping checks
[ https://issues.apache.org/jira/browse/SOLR-17106?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17837877#comment-17837877 ] David Smiley commented on SOLR-17106: - Just want to mention that we should probably have these settings choose default values via EnvUtils (a new thing) so we can conveniently make these settings adjustments via system properties or env vars, whichever is convenient to the user. > LBSolrClient: Make it configurable to remove zombie ping checks > --- > > Key: SOLR-17106 > URL: https://issues.apache.org/jira/browse/SOLR-17106 > Project: Solr > Issue Type: Improvement >Reporter: Aparna Suresh >Priority: Minor > Time Spent: 10m > Remaining Estimate: 0h > > Following discussion from a dev list discussion here: > [https://lists.apache.org/thread/f0zfmpg0t48xrtppyfsmfc5ltzsq2qqh] > The issue involves scalability challenges in SolrJ's *LBSolrClient* when a > pod with numerous cores experiences connectivity problems. The "zombie" > tracking mechanism, operating on a core basis, becomes a bottleneck during > distributed search on a massive multi shard collection. Threads attempting to > reach unhealthy cores contribute to a high computational load, causing > performance issues. > As suggested by Chris Hostetter: LBSolrClient could be configured to disable > zombie "ping" checks, but retain zombie tracking. Once a replica/endpoint is > identified as a zombie, it could be held in zombie jail for X seconds, before > being released - hoping that by this timeframe ZK would be updated to mark > this endpoint DOWN or the pod is back up and CloudSolrClient would avoid > querying it. In any event, only 1 failed query would be needed to send the > server back to zombie jail. > > There are benefits in doing this change: > * Eliminate the zombie ping requests, which would otherwise overload pod(s) > coming up after a restart > * Avoid memory leaks, in case a node/replica goes away permanently, but it > stays as zombie forever, with a background thread in LBSolrClient constantly > pinging it -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@solr.apache.org For additional commands, e-mail: issues-h...@solr.apache.org
[jira] [Commented] (SOLR-17106) LBSolrClient: Make it configurable to remove zombie ping checks
[ https://issues.apache.org/jira/browse/SOLR-17106?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17827178#comment-17827178 ] Aparna Suresh commented on SOLR-17106: -- Hi [~hossman] , I made the changes you requested. I opted not to declare {{aliveCheckSkipIters}} and {{aliveCheckQuery}} as final variables in the Builder classes of {{LBHttp2SolrClient}} and {{{}LBHttpSolrClient{}}}. These variables are designed to be overridden using the setter methods within the Builder. The overrides are being tested in TestLBHttp2SolrClient. Please let me know. Thank you! > LBSolrClient: Make it configurable to remove zombie ping checks > --- > > Key: SOLR-17106 > URL: https://issues.apache.org/jira/browse/SOLR-17106 > Project: Solr > Issue Type: Improvement >Reporter: Aparna Suresh >Priority: Minor > Time Spent: 10m > Remaining Estimate: 0h > > Following discussion from a dev list discussion here: > [https://lists.apache.org/thread/f0zfmpg0t48xrtppyfsmfc5ltzsq2qqh] > The issue involves scalability challenges in SolrJ's *LBSolrClient* when a > pod with numerous cores experiences connectivity problems. The "zombie" > tracking mechanism, operating on a core basis, becomes a bottleneck during > distributed search on a massive multi shard collection. Threads attempting to > reach unhealthy cores contribute to a high computational load, causing > performance issues. > As suggested by Chris Hostetter: LBSolrClient could be configured to disable > zombie "ping" checks, but retain zombie tracking. Once a replica/endpoint is > identified as a zombie, it could be held in zombie jail for X seconds, before > being released - hoping that by this timeframe ZK would be updated to mark > this endpoint DOWN or the pod is back up and CloudSolrClient would avoid > querying it. In any event, only 1 failed query would be needed to send the > server back to zombie jail. > > There are benefits in doing this change: > * Eliminate the zombie ping requests, which would otherwise overload pod(s) > coming up after a restart > * Avoid memory leaks, in case a node/replica goes away permanently, but it > stays as zombie forever, with a background thread in LBSolrClient constantly > pinging it -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@solr.apache.org For additional commands, e-mail: issues-h...@solr.apache.org
[jira] [Commented] (SOLR-17106) LBSolrClient: Make it configurable to remove zombie ping checks
[ https://issues.apache.org/jira/browse/SOLR-17106?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17824591#comment-17824591 ] Chris M. Hostetter commented on SOLR-17106: --- Hi Aparna, This looks pretty good ... but it still seems like a few things have slipped through the cracks? * In the {{Builder}} classes, {{aliveCheckQuery}} needs to to default to the existing behavior which is currently in {{LBSolrClient}} 's {{private static final SolrQuery solrQuery}} ** I would rename/refactor that to something like {{protected static final SolrQuery DEFAULT_ALIVE_CHECK_QUERY}} and re-use it when initializing {{aliveCheckQuery}} in both {{Builder}} classes ** IMO we should change the default to {{null}} on the {{main}} branch, but first we need a patch that is cleanly backcompatible to commit & backport to 9x, then we can discuss changing the defaults on it's own merits * In the actual {{LBSolrClient}} subclasses, the {{aliveCheckSkipIters}} and {{aliveCheckQuery}} variables need to be final * In {{checkAZombieServer}} you are inspection & decrementing {{zombieServer.skipAliveCheckIters}} before doing the ping check – but in the event that the ping fails, you are never re-setting the value of {{zombieServer.skipAliveCheckIters}} ** So instead of only pinging ever Nth iter, it will ping on every iter starting with the Nth iter ** {{handleServerDown}} should reset {{zombieServer.skipAliveCheckIters = this.aliveCheckSkipIters;}} * The new private {{pingServer}} and {{isServerAlive}} methods you added feel like they are too intertwined to make sense as individual methods? ** Both have to explicitly check if {{aliveCheckQuery}} is null ** The return value of {{pingServer}} is only used by {{isServerAlive}} and no where else. ** I think it would make more sense to just merge these two methods into... {code:java} private boolean isServerAlive(ServerWrapper zombieServer) throws SolrServerException, IOException { if (null == aliveCheckQuery) { return true; } QueryRequest queryRequest = new QueryRequest(aliveCheckQuery); queryRequest.setBasePath(zombieServer.baseUrl); return queryRequest.process(getClient(zombieServer.getBaseUrl())).getStatus() == 0; } {code} ...those are my functionality concerns, i also have some nit-picky concerns... * Severally places in your patch modify lines in ways that only change formatting ** If you run {{gradle tidy}} before committing to your branch these should be automatically cleaned up * It looks like you also added an unused import of {{io.swagger.v3.oas.annotations.servers.Server}} ? ** {{gradle tidy}} (or maybe it's {{gradle check}} ?) should also catch this. * I it would be good to have some {{@see}} tags or {{@link}} tags in the javadocs of {{setAliveCheckSkipIters}} , {{{}setAliveCheckInterval{}}}, and {{setAliveCheckQuery}} explaining how they all relate to each other. * {{testReliability()}} is already not a very good test – it's certainly not worth copy/pasting exactly as is just to change the client slightly ** It would be cleaner to just refactor it's body into a helper method (ie: {{protected doTestReliability(LBSolrClient)}} ) that you call from both {{testReliabilityWithLivenessChecks()}} and your new {{testReliabilityWithMinimumZombieTimeAndDisabledQueries()}} ** Better still: we can make {{{}checkAZombieServer{}}}, {{isServerAlive}} do some {{DEBUG}} level logging about the particulars of what it's doing (when it skips a server because of the {{{}skipAliveCheckIters{}}}, when it assumes success because {{aliveCheckQuery}} is null, when it gets a success or failure result from a server, etc...) and then make tose tests use the {{@LogLevel}} annotation along with the {{LogListener}} helper class (restricted to substrings matching the name of the server we stop/restart) to assert it got the expected log messages (we just can't be too picky about the number of times those messages are logged) > LBSolrClient: Make it configurable to remove zombie ping checks > --- > > Key: SOLR-17106 > URL: https://issues.apache.org/jira/browse/SOLR-17106 > Project: Solr > Issue Type: Improvement >Reporter: Aparna Suresh >Priority: Minor > Time Spent: 10m > Remaining Estimate: 0h > > Following discussion from a dev list discussion here: > [https://lists.apache.org/thread/f0zfmpg0t48xrtppyfsmfc5ltzsq2qqh] > The issue involves scalability challenges in SolrJ's *LBSolrClient* when a > pod with numerous cores experiences connectivity problems. The "zombie" > tracking mechanism, operating on a core basis, becomes a bottleneck during > distributed search on a massive multi shard collection. Threads attempting to > reach unhealthy cores contribute to a high computational load, causing >
[jira] [Commented] (SOLR-17106) LBSolrClient: Make it configurable to remove zombie ping checks
[ https://issues.apache.org/jira/browse/SOLR-17106?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17823566#comment-17823566 ] Aparna Suresh commented on SOLR-17106: -- Hi [~hossman] , can you please provide your feedback on the requested changes? Thanks > LBSolrClient: Make it configurable to remove zombie ping checks > --- > > Key: SOLR-17106 > URL: https://issues.apache.org/jira/browse/SOLR-17106 > Project: Solr > Issue Type: Improvement >Reporter: Aparna Suresh >Priority: Minor > Time Spent: 10m > Remaining Estimate: 0h > > Following discussion from a dev list discussion here: > [https://lists.apache.org/thread/f0zfmpg0t48xrtppyfsmfc5ltzsq2qqh] > The issue involves scalability challenges in SolrJ's *LBSolrClient* when a > pod with numerous cores experiences connectivity problems. The "zombie" > tracking mechanism, operating on a core basis, becomes a bottleneck during > distributed search on a massive multi shard collection. Threads attempting to > reach unhealthy cores contribute to a high computational load, causing > performance issues. > As suggested by Chris Hostetter: LBSolrClient could be configured to disable > zombie "ping" checks, but retain zombie tracking. Once a replica/endpoint is > identified as a zombie, it could be held in zombie jail for X seconds, before > being released - hoping that by this timeframe ZK would be updated to mark > this endpoint DOWN or the pod is back up and CloudSolrClient would avoid > querying it. In any event, only 1 failed query would be needed to send the > server back to zombie jail. > > There are benefits in doing this change: > * Eliminate the zombie ping requests, which would otherwise overload pod(s) > coming up after a restart > * Avoid memory leaks, in case a node/replica goes away permanently, but it > stays as zombie forever, with a background thread in LBSolrClient constantly > pinging it -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@solr.apache.org For additional commands, e-mail: issues-h...@solr.apache.org
[jira] [Commented] (SOLR-17106) LBSolrClient: Make it configurable to remove zombie ping checks
[ https://issues.apache.org/jira/browse/SOLR-17106?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17817128#comment-17817128 ] Aparna Suresh commented on SOLR-17106: -- Thanks for the feedback! Sorry I did not have a chance to respond for a few weeks - I was out sick with Covid initially and then tied up investigating issues in Production. Appreciate the detailed evaluation. I completely missed the point about backwards compatibility! {quote}I'm guessing what you ment to do is have {{reduceRemainingZombieTime(...)}} subtract {{zombieStateMonitoringIntervalMillis}} from {{remainingTime}} ? ... but this approach still seems kind of confusing & misleading, because tracking & recording "remaining milliseconds" like this implies more granularity then here really is. {{remainingTime=10 (ms)}} is meaningless if {{zombieStateMonitoringIntervalMillis=60_000}} – you're going to have to wait the full 60 seconds. {quote} I specified an override to zombieStateMonitoringIntervalMillis = 5s in my first commit on LBHttp2SolrClient, with remainingTime set to 10s. So the thread running periodically doesnt evict a zombie entry right away, I added the following if condition - but I agree that would keep some entries as zombies up to the next run. Agree 100% about the point that the time based approach doesnt provide a lot of flexibility compared to the numIters approach. {code:java} private void reduceRemainingZombieTime(ServerWrapper wrapper) { if(wrapper == null){ return; } if (wrapper.remainingTime == 0) { //evict from zombieServers, add to aliveServers zombieServers.remove(wrapper.getBaseUrl()); wrapper.failedPings = 0; if (wrapper.standard) { addToAlive(wrapper); } } else { wrapper.remainingTime = Math.max(0, (wrapper.remainingTime - minZombieReleaseTimeMillis)); } } {code} Have updated the PR based on your comments here: [https://github.com/apache/solr/pull/2160/files] > LBSolrClient: Make it configurable to remove zombie ping checks > --- > > Key: SOLR-17106 > URL: https://issues.apache.org/jira/browse/SOLR-17106 > Project: Solr > Issue Type: Improvement >Reporter: Aparna Suresh >Priority: Minor > Time Spent: 10m > Remaining Estimate: 0h > > Following discussion from a dev list discussion here: > [https://lists.apache.org/thread/f0zfmpg0t48xrtppyfsmfc5ltzsq2qqh] > The issue involves scalability challenges in SolrJ's *LBSolrClient* when a > pod with numerous cores experiences connectivity problems. The "zombie" > tracking mechanism, operating on a core basis, becomes a bottleneck during > distributed search on a massive multi shard collection. Threads attempting to > reach unhealthy cores contribute to a high computational load, causing > performance issues. > As suggested by Chris Hostetter: LBSolrClient could be configured to disable > zombie "ping" checks, but retain zombie tracking. Once a replica/endpoint is > identified as a zombie, it could be held in zombie jail for X seconds, before > being released - hoping that by this timeframe ZK would be updated to mark > this endpoint DOWN or the pod is back up and CloudSolrClient would avoid > querying it. In any event, only 1 failed query would be needed to send the > server back to zombie jail. > > There are benefits in doing this change: > * Eliminate the zombie ping requests, which would otherwise overload pod(s) > coming up after a restart > * Avoid memory leaks, in case a node/replica goes away permanently, but it > stays as zombie forever, with a background thread in LBSolrClient constantly > pinging it -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@solr.apache.org For additional commands, e-mail: issues-h...@solr.apache.org
[jira] [Commented] (SOLR-17106) LBSolrClient: Make it configurable to remove zombie ping checks
[ https://issues.apache.org/jira/browse/SOLR-17106?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17807491#comment-17807491 ] Chris M. Hostetter commented on SOLR-17106: --- Hi Aprna, thanks for creating this jira! I read through the changes on your branch, and I'm a little confused by a few things... 1. What was your motivation for replacing {{setAliveCheckInterval(int aliveCheckInterval, TimeUnit unit)}} with {{setZombieStateMonitoringIntervalMillis(long zombieStateMonitoringIntervalMillis)}} ? It appears that they ultimately serve the same purpose of dictating the {{scheduleAtFixedRate()}} executor, but changing the name just seems like an unnecessary back compat break? (We also, AFAIK, don't currently use the term "zombie" in any of the {{public}} APIs of SolrJ – only in the internal tracking of what servers are "not alive" we probably want to keep the APIs in terms of being "alive" or "not alive") Even if there is a strong reason I'm overlooking why it makes sense to change the method name: There's also been a push over the last few years to standardize on always taking in {{TimeUnit}} when dealing with method args that specify a time duration, rather then forcing the client to specify milliseconds (see SOLR-16595) 2. I don't think {{setMinZombieReleaseTimeMillis(long jailTimeInMillis)}} is being used the way you intended / expected? >From the javadocs you added... {quote}This param should be set if enableZombieChecks=false. This param configures the time a server should be regarded, at a minimum, as a zombie before being released {quote} ...but from what I've seen of where & how the values are actually used: * In {{addZombie(...)}} it explicitly sets {{ServerWraper.remainingTime = zombieCheckIntervalMillis}} * In {{{}reduceRemainingZombieTime(...){}}}, if {{ServerWraper.remainingTime}} is non zero, explicitly set it to {{wrapper.remainingTime = Math.max(0, (wrapper.remainingTime - minZombieReleaseTimeMillis));}} ...that's it. So no matter what the value {{minZombieReleaseTimeMillis}} is set to, the only thing that determines how long a URL is kept in the zombie list is 2x the {{zombieStateMonitoringIntervalMillis}} value used to configure the scheduled executor. The first time the executor calls {{reduceRemainingZombieTime(...)}} on a {{ServerWrapper}} it will reduce the {{remainingTime}} to zero. The Second time the executor calls {{reduceRemainingZombieTime(...)}} it will remove the URL from zombies and re-add it to alive. I'm guessing what you ment to do is have {{reduceRemainingZombieTime(...)}} subtract {{zombieStateMonitoringIntervalMillis}} from {{remainingTime}} ? ... but this approach still seems kind of confusing & misleading, because tracking & recording "remaining milliseconds" like this implies more granularity then here really is. {{remainingTime=10 (ms)}} is meaningless if {{zombieStateMonitoringIntervalMillis=60_000}} – you're going to have to wait the full 60 seconds. I would suggest a "num iters" type configuration, rather then a "time" based configuration, so that we are very transparent that our "liveness checker" runs every {{aliveCheckInterval}} – and you can configure the "liveness checker" to only inspect a URL ever "N runs". Thinking about the API of this change holistically, my inclination would be to support disabling the zombie check queries as a special case of making the specifics of those queries configurable. Here's the API changes I would suggest: * Keep {{setAliveCheckInterval(...)}} the way it is * Add a new {{setAliveCheckSkipIters(int)}} that defaults to {{0}} ** Document that positive values can be used to ensure that the liveness checks for a given URL will only be run every "Nth" iteration of the {{setAliveCheckInterval}} * Add a new {{setAliveCheckQuery(SolrQuery)}} that defaults to the current static {{SolrQuery}} ** Document that if this is set to {{null}} the client will implicitly assume success anytime it would normally perform a a liveness check on a server. The underlying functional changes would be: * Add an {{int skipAliveCheckIters}} to {{ServerWrapper}} that is initialized to whatever {{setAliveCheckSkipIters(int)}} is set to * In {{checkAZombieServer(...)}} ... ** Befor any existing logic: {{if (0 < zombieServer.skipAliveCheckIters)}} ... decrement and return immediately ** Short circuit the the current {{QueryRequest}} logic such that if {{setAliveCheckQuery(null)}} is in use, we don't execute any request, we just enter the "success" block Unless I'm overlooking something, this would all be back-compatable, so we could merge it to the 9x branch, and then on the main branch we could change some of the defaults (like using {{setAliveCheckRequest(null)}} by default, increasing {{{}setAliveCheckSkipIters(int){}}}, decreasing {{{}setAliveCheckInterval(...){}}}) What do you