Stas Oskin wrote:
Hi.
Could you share the way in which it didn't quite work? Would be valuable
information for the community.
The idea is to have a Xen machine dedicated to NN, and maybe to SNN, which
would be running over DRBD, as described here:
http://www.drbd.org/users-guide/ch-xen.html
The VM will be monitored by heart-beat, which would restart it on another
node when it fails.
I wanted to go that way as I thought it's perfect in case of small cluster,
as then the node can be re-used for other tasks.
Once the cluster grows reasonably, the VM could be migrated to dedicated
machine in live fashion - with minimum downtime.
Problem is, that it didn't work as expected. The Xen over DRBD is just not
reliable, as described. The most basic operation of live domain migration
works only in 50% of cases. Most often the domain migration leaves the DRBD
in read-only status, meaning the domain can't be cleanly shut down - only
killed. This often leads in turn to NN meta-data corruption.
It's probably a quirk of virtualisation, all those clocks and things,
causes trouble for any HA protocol running round the cluster. I would
not blame Xen, as VMWare and virtualbox are also tricky.
As you have a virtual infrastructure, why not have an image of the 1ary
NN, ready to bring up on demand when the NN goes down, pointed at a copy
of the NN datasets?