thanks, I assume there is some issue on my side as I actually did not find any of the messages that the Solr script would log out during the shutdown. The shutdown also happened much faster then the 5 second delay in the script. So I'm doing something wrong. Anyhow, thanks for the further details, should give me enough to investigate further.

On 22.10.2016 15:22, Erick Erickson wrote:
bq:  Would a clean shutdown result in the node to be flagged as down
in the cluster state straight away?

It should, if it's truly clean. HOWEVER..... a "clean shutdown" is
unfortunately not just a "bin/solr stop" because of the timeout Shawn
mentioned, see SOLR-9371. It's a simple edit to make it much longer,
but the real fix should poll. The "smoking gun" would be a correlation
between the node not being marked as down in state.json and a message
when you stop the instance with bin/solr about "forcefully killing
....."

After only 5 seconds, that script forcefully kills the instance of
Solr which would _not_ flag the replicas it hosts as down. After an
interval, you should see it disappear from the "live nodes" znode
though. The problem of course is that part of graceful shutdown is
each replica updating the associated state.json, and they don't get a
chance. ZK will periodically ping the Solr instance and if it times
out remove the associated znode in "live nodes"....

Solr code checks both the state.json and live_nodes to know whether a
node is truly functioning, being absent from live_nodes trumps
whatever state is in state.json.

Best,
Erick




On Sat, Oct 22, 2016 at 1:00 AM, Hendrik Haddorp
<hendrik.hadd...@gmx.net> wrote:
Thanks, that was what I was hoping for I just didn't see any indication for
that in the normal log output.

The reason for asking is that I have a SolrCloud 6.2.1 setup and when ripple
restarting the nodes I sometimes get errors. So far I have seen two
different things:
1) The node starts up again and is able to receive new replicas but all
existing replicas are broken.
2) All nodes come up and no problems are seen in the cluster status but the
admin UI on one node claims that a file for one config set is missing.
Restarting the node resolves the issue.

This looked to me like the node is not going down cleanly. Would a clean
shutdown result in the node to be flagged as down in the cluster state
straight away? So far the ZooKeeper data gets only updated once the node
comes up again and reports itself as down before the recovery starts.

On 21.10.2016 15:01, Shawn Heisey wrote:
On 10/21/2016 6:56 AM, Hendrik Haddorp wrote:
I'm running solrcloud in foreground mode (-f). Does it make a
difference for Solr if I stop it by pressing ctrl-c, sending it a
SIGTERM or using "solr stop"?
All of those should produce the same result in the end -- Solr's
shutdown hook will be called and a graceful shutdown will commence.

Note that in the case of the "bin/solr stop" command, the default is to
only wait five seconds for graceful shutdown before proceeding to a
forced kill, which for a typical install, means that forced kills become
the norm rather than the exception.  We have an issue to increase the
max timeout, but it hasn't been done yet.

I strongly recommend anyone going into production should edit the script
to increase the timeout.  For the shell script I would do at least 60
seconds.  The Windows script just does a pause, not an intelligent wait,
so going that high probably isn't advisable on Windows.

Thanks,
Shawn


Reply via email to