Hey John,
Assuming your cluster is still configured basically the same way as when I
helped you set it up, you're using local disk replication. Each master forest
has at least one replica on another cluster node which means that if a node
dies the cluster is still live and running (after a short failover delay to
vote the dead guy out of the cluster).
The easiest way to deal with a lost cluster node is to build a new one, give
it the identity of the lost one and then have it join the cluster. Automatic
forest replication will take over and the new node will, after some amount of
time, be re-synchronized and everything will be back to normal.
The only thing left to do at that point will be to have the replacement node
take over as the master from the replica(s) that took over when he failed.
Springer has (or at least had) a nightly scheduled task that looks for forests
that have failed over and tries to flip them back if they're eligible, so you
don't even have to worry about that.
There is actually a README in your source tree that I wrote named
node-recreate.txt that documents how to do this. I retained a generified
version which is pasted in below, could be useful to other people. As written
up it requires some manual steps, but I think it could be automated with your
favorite DevOps tool to configure a new box and perform the needed steps.
Hope this helps.
=============
Steps to recreate a lost cluster node
The MarkLogic clusters at XXXX are configured with
local-disk replication for all of the data forests. Each forest
has at least one replica on another host in the cluster. This means
that the cluster can tolerate one lost host and continue serving
requests as normal without data loss.
If a cluster node drops out of the cluster, the replica(s) of
that node's forests will take over. When the node re-enters the
cluster, its forest will sync-up with the replicas.
The following is the procedure to cope with a cluster node that
fails and all data is lost on it. The cluster can carry on without
the lost node until the node is recreated (but it can't tolerate a
second lost node, so it's important to get that node back online).
A fresh install of MarkLogic cannot simply be booted up and
added to the cluster. It is necessary to recreate the MarkLogic
cluster configuration on the new node so that it will be recognized
when it attempts to join the cluster. Luckily, the complete cluster
configuration is shared by all nodes in the cluster.
Follow these steps to re-create the lost node.
o Install the operating system and MarkLogic on the new machine.
Any references to local filesystems in the original MarkLogic
setup must be identical, and permissions must be correct. For
example, if forests were on /volumes/raid/blah then that path
must be available and writable by MarkLogic.
o If you haven't started MarkLogic yet, go ahead and do so
(sh /etc/init.d/MarkLogic start). Point your browser at
localhost:8001 and go through the initial setup, including
entering the license key.
o Once that's done, shut down MarkLogic (via the Admin UI or
by running "sh /etc/init.d/MarkLogic stop").
o cd to /var/opt/MarkLogic
o You should see several XML files there. The only one you care
about is server.xml. Copy it to /tmp.
o It is best to shutdown the cluster before doing the following
step. But if this is difficult, it will probably be ok to do
it while the cluster is running. Just make sure no configuration
changes are made until all the following steps are completed.
o You now need to copy the cluster config from one of the functioning
nodes in the cluster. Pick one, it doesn't matter which, and log
into it. On that machine, cd to /var/opt/MarkLogic. You should
see several XML files there as well, but more of them. Use scp
to copy them to the machine you're re-creating:
scp *.xml new-machine:/var/opt/MarkLogic
o Back on the new machine, check that all those XML files are
now in /var/opt/MarkLogic
o Edit server.xml (the one that was just copied over) and replace
the <license-key> field with the value in the server.xml that you
copied to /tmp in an earlier step. You may also need to copy
the <licensee> field as well if it's not the same in both files.
o Next, copy the <host-id> field from hosts.xml, which is one of
the files that you scp'ed from the other host. In hosts.xml,
search for a <host-name> element containing the host name of the
new machine that you're recreating. Copy the value of the <host-id>
element (usually immediately above the host name) into server.xml.
o The host id in server.xml will now identify the new machine to the
cluster as the one that was lost.
o If you stopped the cluster earlier, restart all the nodes of the
cluster except the new one. There could be a delay of several
minutes before the Admin UI of the cluster begins responding as
normal. Wait until the cluster is back up.
o At this point, the Hosts list on the cluster Admin UI should
still show the failed node as Disconnected.
o Start MarkLogic on the new node. And keep an eye on its log
file (tail -f /var/opt/MarkLogic/Logs/ErrorLog.txt). You should
see messages indicating that forests have been mounted from the
other nodes in the cluster, followed by messages saying that
forests are being synchronized. Depending on the amount of data
in the forests, it may take a while for them to fully synchronize.
o In the Admin UI, the node you've just created should now show
as Connected. And all of it's forest should now be online. If
you list all forests, they should all be either "Open" or
"Sync Replicating". Once a day, at 1:00am, a script runs that
will attempt to re-balance the master/replicas. If the forests
have finished syncing by then they should flip around to their
proper master/replica relationships.
o That's it, you're done. Exhale.
Ron Hitchens, 2011-12-07
---
Ron Hitchens {[email protected]} +44 7879 358212
On Dec 11, 2014, at 10:30 AM, "Muth, John, Springer UK"
<[email protected]> wrote:
>
> Hello,
>
> I'm considering writing some XQuery to automatically update the configuration
> of a cluster where one host has died and you want to replace it with a new
> one.
> We already have scripts in place that automatically configure the new host,
> and join it to the cluster.
> The bit we're missing is updating the forests, to make it so the forests that
> were previously on the now-dead host are instead on the replacement host.
> I'm imagining a new script/endpoint that would be something like:
>
> /move-forests?old-host=dead-host.domain.com&new-host=replacement-host.domain.com
>
> It looks like it should be fairly straightforward to do using existing admin:
> functions, something like:
>
> - for each database,
> - for each forest
> - if on dead host
> - copy to new host and delete from dead host
>
> - delete dead host
>
> Does this sound reasonable?
> Has anybody done this kind of thing?
> Any gotchas, etc?
>
> Thanks,
> John
>
> _______________________________________________
> General mailing list
> [email protected]
> http://developer.marklogic.com/mailman/listinfo/general
_______________________________________________
General mailing list
[email protected]
http://developer.marklogic.com/mailman/listinfo/general