Hey John,

   Assuming your cluster is still configured basically the same way as when I 
helped you set it up, you're using local disk replication.  Each master forest 
has at least one replica on another cluster node which means that if a node 
dies the cluster is still live and running (after a short failover delay to 
vote the dead guy out of the cluster).

   The easiest way to deal with a lost cluster node is to build a new one, give 
it the identity of the lost one and then have it join the cluster.  Automatic 
forest replication will take over and the new node will, after some amount of 
time, be re-synchronized and everything will be back to normal.

   The only thing left to do at that point will be to have the replacement node 
take over as the master from the replica(s) that took over when he failed.  
Springer has (or at least had) a nightly scheduled task that looks for forests 
that have failed over and tries to flip them back if they're eligible, so you 
don't even have to worry about that.

  There is actually a README in your source tree that I wrote named 
node-recreate.txt that documents how to do this.  I retained a generified 
version which is pasted in below, could be useful to other people.  As written 
up it requires some manual steps, but I think it could be automated with your 
favorite DevOps tool to configure a new box and perform the needed steps.

   Hope this helps.

=============

   Steps to recreate a lost cluster node

   The MarkLogic clusters at XXXX are configured with
local-disk replication for all of the data forests.  Each forest
has at least one replica on another host in the cluster.  This means
that the cluster can tolerate one lost host and continue serving
requests as normal without data loss.

   If a cluster node drops out of the cluster, the replica(s) of
that node's forests will take over.  When the node re-enters the
cluster, its forest will sync-up with the replicas.

   The following is the procedure to cope with a cluster node that
fails and all data is lost on it.  The cluster can carry on without
the lost node until the node is recreated (but it can't tolerate a
second lost node, so it's important to get that node back online).

   A fresh install of MarkLogic cannot simply be booted up and
added to the cluster.  It is necessary to recreate the MarkLogic
cluster configuration on the new node so that it will be recognized
when it attempts to join the cluster.  Luckily, the complete cluster
configuration is shared by all nodes in the cluster.

   Follow these steps to re-create the lost node.

   o Install the operating system and MarkLogic on the new machine.
     Any references to local filesystems in the original MarkLogic
     setup must be identical, and permissions must be correct.  For
     example, if forests were on /volumes/raid/blah then that path
     must be available and writable by MarkLogic.

   o If you haven't started MarkLogic yet, go ahead and do so
     (sh /etc/init.d/MarkLogic start).  Point your browser at
     localhost:8001 and go through the initial setup, including
     entering the license key.

   o Once that's done, shut down MarkLogic (via the Admin UI or
     by running "sh /etc/init.d/MarkLogic stop").

   o cd to /var/opt/MarkLogic

   o You should see several XML files there.  The only one you care
     about is server.xml.  Copy it to /tmp.

   o It is best to shutdown the cluster before doing the following
     step.  But if this is difficult, it will probably be ok to do
     it while the cluster is running.  Just make sure no configuration
     changes are made until all the following steps are completed.

   o You now need to copy the cluster config from one of the functioning
     nodes in the cluster.  Pick one, it doesn't matter which, and log
     into it.  On that machine, cd to /var/opt/MarkLogic.  You should
     see several XML files there as well, but more of them.  Use scp
     to copy them to the machine you're re-creating:

          scp *.xml new-machine:/var/opt/MarkLogic

   o Back on the new machine, check that all those XML files are
     now in /var/opt/MarkLogic

   o Edit server.xml (the one that was just copied over) and replace
     the <license-key> field with the value in the server.xml that you
     copied to /tmp in an earlier step.  You may also need to copy
     the <licensee> field as well if it's not the same in both files.

   o Next, copy the <host-id> field from hosts.xml, which is one of
     the files that you scp'ed from the other host.  In hosts.xml,
     search for a <host-name> element containing the host name of the
     new machine that you're recreating.  Copy the value of the <host-id>
     element (usually immediately above the host name) into server.xml.

   o The host id in server.xml will now identify the new machine to the
     cluster as the one that was lost.

   o If you stopped the cluster earlier, restart all the nodes of the
     cluster except the new one.  There could be a delay of several
     minutes before the Admin UI of the cluster begins responding as
     normal.  Wait until the cluster is back up.

   o At this point, the Hosts list on the cluster Admin UI should
     still show the failed node as Disconnected.

   o Start MarkLogic on the new node.  And keep an eye on its log
     file (tail -f /var/opt/MarkLogic/Logs/ErrorLog.txt).  You should
     see messages indicating that forests have been mounted from the
     other nodes in the cluster, followed by messages saying that
     forests are being synchronized.  Depending on the amount of data
     in the forests, it may take a while for them to fully synchronize.

   o In the Admin UI, the node you've just created should now show
     as Connected.  And all of it's forest should now be online.  If
     you list all forests, they should all be either "Open" or
     "Sync Replicating".  Once a day, at 1:00am, a script runs that
     will attempt to re-balance the master/replicas.  If the forests
     have finished syncing by then they should flip around to their
     proper master/replica relationships.

   o That's it, you're done.  Exhale.


Ron Hitchens, 2011-12-07

---
Ron Hitchens {[email protected]}  +44 7879 358212

On Dec 11, 2014, at 10:30 AM, "Muth, John, Springer UK" 
<[email protected]> wrote:

> 
> Hello,
> 
> I'm considering writing some XQuery to automatically update the configuration 
> of a cluster where one host has died and you want to replace it with a new 
> one.
> We already have scripts in place that automatically configure the new host, 
> and join it to the cluster.
> The bit we're missing is updating the forests, to make it so the forests that 
> were previously on the now-dead host are instead on the replacement host.
> I'm imagining a new script/endpoint that would be something like:
> 
> /move-forests?old-host=dead-host.domain.com&new-host=replacement-host.domain.com
> 
> It looks like it should be fairly straightforward to do using existing admin: 
> functions, something like:
> 
> - for each database, 
>    - for each forest
>       - if on dead host
>       - copy to new host and delete from dead host
> 
> - delete dead host
> 
> Does this sound reasonable?
> Has anybody done this kind of thing?
> Any gotchas, etc?
> 
> Thanks,
> John 
> 
> _______________________________________________
> General mailing list
> [email protected]
> http://developer.marklogic.com/mailman/listinfo/general

_______________________________________________
General mailing list
[email protected]
http://developer.marklogic.com/mailman/listinfo/general

Reply via email to