Hi Denis,
I'm wondering if it could be possible to extend lvm SR to integrate a
DRBD primary/secondary redundancy for a two nodes installation.
Let me explains the context and the a possible solution.
we've been using drbd for some years on iscsi SAN with Xen/XCP, having
2 redundant iscsi SAN and 2 Xen/XCP nodes.
This kind of setup works great, however for smaller setups it is kind
of overkill.
We used to do exactly the same, using two switches and multi-path to
deal with the switch redundancy. Feeling it was overkill, we switched to
a similar setup to you..
So we started to integrate drbd directly on XCP nodes (there are some
docs at linbits about it). However being a little bit paranoid about
split brain scenario, we have always been using primary/secondary
setups (SR1 primary on XCP1 and SR2 primary on XCP2).
We are using a similar setup, but with dual primary. The two servers are
connected directly via a cable so there is little possibility of the two
being disconnected. We have only run into a problem with split-brain
once where we were doing something rather silly - using vdi-copy to
migrate virtual disks to a local software raid5 array while running VMs
on the same raid5 array. In short the whole machine stopped responding
to the network for minutes at a time while the copy took place.
Fortunately we were able to recover pretty easily - lesson learned: stay
away from software raid5 on the hypervisors.
This kind of setup has a big drawback : VMs with VDI on SR1 have to
run on XCP1 and VMs with VDI on SR2 have to run on XCP2. There is a
loss of flexibility and a loss of transparancy for XCP admins.
Generally, we have not had a problem with dual primary - we use
live-migrate and run VMs on either node. When the hypervisors boot, we
make sure everything is connected, switch both to primary and plug in
the PBDs.
Our plan for recovering from split-brain it to pick the host with the
most changed VDIs (the most data) as the new primary. Use DRBD with an
external meta-disk to replicate the changed VDIs to the new primary.
Invalidate the data on the junked host and run a full re-sync. Not fun,
but recoverable as long as you know which VMs have been running on which
host and you keep an eye out for split-brain (there are hooks you can
use to notify you when things happen to drbd).
So I'd want to extend the lvm SR to integrate DRBD primary secondary,
and I'd like to have some input on this kind of scenario :
* one each lvm, create a drbd resource, and when a vbd is brough up
the drbd resource is switch to primary.
* when migrating a vm to the second node, turn drbd on first node
secondary, turn drbd on the second node primary, and get on with
resuming the VM.
* when a VMs is brought down, pbd is brought down and drbd resource is
switch to secondary.
My understanding is the following: An LVM SR is effectively an LVM
Volume group attached via a PBD. If you unplug the PBD when shutting
down the VM, you are taking the whole SR offline. If you had one PBD per
VM, you would need one SR per VDI therefore one VG per VDI which
wouldn't work... You wouldn't be able to do snapshots, resize VDIs
etc.......
What you might do (which is what I think you're angling at) is the
following:
Both hosts handle their own LVM VG and corresponding LVs. Every time you
create a VDI, each host would create the LV locally, set up a DRBD
resource and start syncing. Both VDIs could sit in "Secondary" mode
while the VM was down, and would only switch to primary while the VM was
running on that particular host.
Provisions would have to be made for making sure the DRBD/LVM config was
kept concurrent across the two hosts. Most of the DRBD config (sync
rates, protcol, passwords, data integrity algorithm etc.....) could be
stored in the SR config, but one would need to be able to re-build a
whole SR should a disk on one of the hosts fail.
How would you deal with split-brain? If XCP2 appeared down, would XCP1
put a VDI into primary without knowing the state of the VDI on XCP2?
Please feel to correct me should my understanding be at fault in any way.
That would make a lot of drbd resource when accounting for snapshot
and all, but if it could be possible to be done, it would be a
tremendous addition for smaller setups for SMBs.
So far, we have only thought about a two node setup, but I suppose you
could go further. You would need a system of keeping track of which VDI
was mirrored on which which host, automatic re-distribution of VDIs to
another host in the pool should a another fail (or be removed from the
pool), each VM would be limited to two hosts in the pool (so you may run
into problems with multiple host failures), each host would have to have
a large amount of local storage. There are some limitations, but it
might be workable......
My two cents!
I'd be glad to have some input from the dev if possible. By the way,
kudos to the devs for XCP 1.6, it really rocks.
Good to hear - thanks to the devs from this corner too. I'm looking
forward to playing with the new features soon.
Regards,
Tim
_______________________________________________
Xen-api mailing list
[email protected]
http://lists.xen.org/cgi-bin/mailman/listinfo/xen-api