Re: Draining a Solr node for traffic before shutting down

Houston Putman Thu, 30 Mar 2023 07:43:44 -0700

>
> Looks like there's room for improvement.  I too would want the desired
> state to be reflected in ZK first before attempting to make it happen.
> Remove live_nodes first, then iterate the local replicas to be state=DOWN,
> then close down all the things.
>


I agree with this, but just for Jan's comments on the shutdown logic. I've
been hitting issues with Solr 9.0, that if run in a docker image it can
take a long time for Solr to startup (over 30 seconds). And then the
operator will try to kill solr via the STOP_PORT, and this hangs until it
is stopped via a kill. So I think we have an issue with recent versions on
the stop logic and how it coordinates between Jetty and Solr.

The primary goal is to drain traffic right before shutting a node down, but
> it could also be designed as a generic Readiness Probe <
> https://kubernetes.io/docs/tasks/configure-pod-container/configure-liveness-readiness-startup-probes/#define-readiness-probes>
> modeled from Kubernetes?
>

I think that the idea for a "not ready" state is different and not entirely
related to shutdown behavior.
The Solr Operator has logic to evict all replicas on a node before
restarting it, if the data is ephemeral. (Since we don't want the node to
come up with lost data)
So it would be great for us to say that this node is "not ready" before
evicting the replicas, so that the eviction process goes as smoothly as
possible.
I think that this "not ready" state could also be used with other commands,
or just sent directly via the user.

Basically there are multiple ways that this "not ready" command could be
triggered:

   - An explicit command from the user to set/unset this state on the node
   - An optional param on REPLACENODE and DELETENODE, that will set this
   state before doing the replace/delete logic.
   - I would imagine on node startup this state us unset by default.

There are reasons why you want to keep these nodes "live", since other Solr
nodes might still be interacting with them, but you want to avoid sending
updates and queries there.

I agree that I'm not entirely sure that the benefit is there, since this
would be a pretty big change. But I think it's definitely worth a
discussion.

- Houston

On Wed, Mar 29, 2023 at 8:43 PM David Smiley <dsmi...@apache.org> wrote:

> Looks like there's room for improvement.  I too would want the desired
> state to be reflected in ZK first before attempting to make it happen.
> Remove live_nodes first, then iterate the local replicas to be state=DOWN,
> then close down all the things.
>
> ~ David Smiley
> Apache Lucene/Solr Search Developer
> http://www.linkedin.com/in/davidwsmiley
>
>
> On Wed, Mar 29, 2023 at 9:16 AM Jan Høydahl <jan....@cominvent.com> wrote:
>
> > Hi,
> >
> > Trying to prevent traffic being sent to a Solr node that is going to shut
> > down, to avoid interruption of service as seen from various clients.
> > First part of the puzzle is signaling to any (external) load balancer to
> > stop sending requests to the node.
> > The other part is having SolrJ understand that the node is being stopped,
> > and not routing internal requests to cores on the node.
> >
> > Does anyone have a good command of the Shutdown logic in Solr?
> > My understanding is a bit sparse, but here's what I can see in the code:
> >
> > bin/solr stop will send a STOP command to Jetty's STOP_PORT with
> > (not-so-secret) stop key
> > Jetty starts the shutdown process, destroying all servlets and filters,
> > including Solr's dispatchFilter
> > Solr is notified about the shutdown through a callback in
> > CoreContainerProvider.
> > CoreContainerProvider#close() is called which calls CC#shutdown
> > CC shuts down every core on the node and then calls zkController#preClose
> > ZkController#preClose removes ephemeral live_nodes/myNode and then
> > publishes down state in state.json
> > Wait for shutdown of executors mm and let Jetty exit
> >
> > I could have got it wrong though.
> >
> > I was hoping that a Solr node would first publish itself as "not ready"
> in
> > ZK before rejecting requests, but seems as this is all reversed, since
> > shutdown is initiated by Jetty?
> > So could we instead register our own shutdown-port in Solr, and let our
> > bin/solr script trigger that one? There we could orchestrate the shutdown
> > as we want:
> >
> > Remove live_nodes znode in ZK
> > Publish itself as not ready on api/node/health handler (or a new
> > api/node/ready?)
> > Sleep for a few seconds (or longer with an optional &shutdownDelay
> > argument to our shutdown endpoint)
> > trigger server.stop() to take down Jetty and kill the servlet
> >
> > I filed https://issues.apache.org/jira/browse/SOLR-16722 to discuss a
> > technical solution.
> > The primary goal is to drain traffic right before shutting a node down,
> > but it could also be designed as a generic Readiness Probe <
> >
> https://kubernetes.io/docs/tasks/configure-pod-container/configure-liveness-readiness-startup-probes/#define-readiness-probes
> >
> > modeled from Kubernetes?
> > I'm also aware that any solr client should be prepared to hit a dead node
> > due to network/power events, and retry. But it won't hurt to be graceful
> > whenever we can..
> >
> > Happy to hear your thoughts. Is this a made-up problem?
> >
> > Jan
>

Re: Draining a Solr node for traffic before shutting down

Reply via email to