Re: Multi-Datacenter Deployment

Logan Barfield Wed, 07 Jan 2015 13:05:40 -0800

Thought so, just trying to look at all the potential options right now.
We'll probably end up going the single region route, but the idea of a
whole zone rebooting because of misbehaving HA worries me.  We're going
from a traditional single node, non-HA setup.  We're used to having a
single node go down (hence our interest in HA), but having a whole
datacenter's worth of customer VMs rebooting at once would be a nightmare.



Thank You,

Logan Barfield
Tranquil Hosting

On Wed, Jan 7, 2015 at 3:57 PM, Simon Weller <swel...@ena.com> wrote:

> Regions are designed to be completely separate from one other, so no, as
> far as I'm aware there is no way to sync secondary storage data between
> them. I don't think you'd want to do that anyway, as it defeats the purpose
> of maintaining an isolated cloud region from another.
>
> - Si
>
>
> ________________________________________
> From: Logan Barfield <lbarfi...@tqhosting.com>
> Sent: Wednesday, January 07, 2015 2:00 PM
> To: dev@cloudstack.apache.org
> Cc: us...@cloudstack.apache.org
> Subject: Re: Multi-Datacenter Deployment
>
> A followup here:  You can't have secondary storage that spans regions (e.g,
> templates/snapshots in sync), even with S3/Swift correct?  If not that's
> another downside to regions on top of the account sync.
>
> It seems like the best solution to prevent weird split-brain/HA issues
> would be to have at least 3 databases set up as master/master/master with
> quorum.  That way if two sites lose contact and re-establish there's a 2/1
> majority saying the hosts are all reachable.  Would hopefully prevent the
> ones that lost contact from kicking off HA immediately.  I don't even know
> how feasible that would be; maybe with Galera?
>
> Even then it would have to be on a table level since there would be a
> conflict, for instance:
> - Given sites 1, 2, and 3, where site 1 loses contact with site 2 and comes
> back up
> - Site 1: Thinks site 1 is up and site 2 is down
> - Site 2: Thinks site 2 is up and 1 is down.
> - Site 3: Thinks all sites are up.
>
> In the above case the least harmful thing would be to push site 3 to the
> other two, but since all three sites have different data it may just hang
> instead.
>
> This is going to drive me nuts. :D
>
>
> Thank You,
>
> Logan Barfield
> Tranquil Hosting
>
> On Wed, Jan 7, 2015 at 12:57 PM, Simon Weller <swel...@ena.com> wrote:
>
> > See inline.
> > ________________________________________
> > From: Logan Barfield <lbarfi...@tqhosting.com>
> > Sent: Wednesday, January 07, 2015 11:43 AM
> > To: dev@cloudstack.apache.org
> > Cc: us...@cloudstack.apache.org
> > Subject: Re: Multi-Datacenter Deployment
> >
> > I appreciate the explanation.  That seems to confirm what I was thinking,
> > that until regions are working 100% we'll just have to make sure the
> > DC-to-DC links are as stable/redundant as possible to prevent HA issues.
> > If we increase the HA delay it shouldn't be a major issue, and it will
> > still be better than nothing.
> >
> > For us is probably also makes sense to not worry about having management
> > servers in each DC for now.  If we have a big enough outage in our
> primary
> > DC to affect access to the management server we probably have bigger
> > problems to worry about.
> >
> > > Yeah, I agree. Even with Mgmt down, it's not going to stop any existing
> > services from running or functioning as long as the clusters are healthy.
> >
> > - Si
> >
> > Much appreciated!
> >
> >
> > Thank You,
> >
> > Logan Barfield
> > Tranquil Hosting
> >
> > On Wed, Jan 7, 2015 at 12:15 PM, Simon Weller <swel...@ena.com> wrote:
> >
> > > Logan,
> > >
> > > We currently run CS in multiple geographically separate DCs, and may be
> > > able to give you a little insight into things.
> > >
> > > We run KVM in advanced networking mode, with CLVM clusters backed onto
> > > Dell Compellent SANs. We currently have different DCs running different
> > > zones per DC, in a single region. We've been running CS in production
> now
> > > since 4.0 prior to regions, so that functionality (along with its
> > > limitations) hasn't been something we've adopted yet. We run our
> > Management
> > > (With Multiple clustered nodes) out of 1 DC, and have a backup set of
> > > Management Nodes in another DC should we need to invoke BCDR in the
> event
> > > the primary Management nodes became unavailable.
> > >
> > > Your concerns regarding HA problems are founded. We run our own
> > nationwide
> > > MPLS backbone, and therefore have multiple high capacity bandwidth
> paths
> > > between our different DCs, and even with that capacity and fault
> tolerant
> > > design, we've seen issues where Management has attempted to invoke HA
> due
> > > to brief loss of connectivity (typically due to maintenance or grooming
> > > activity), and this can be quite problematic. VPN tunnels are going to
> be
> > > very challenging for you, and you really need to look at VPLS or some
> > other
> > > technology that can layer on top of a resilient infrastructure with
> > > multiple paths and fast failover (e.g. MPLS Fast Reroute).
> > >
> > > Ideally, regions should solve this with dedicated local management
> nodes,
> > > but until the syncing is sorted out, and those newer releases are
> stable,
> > > there isn't much option short of using a single region right now, short
> > of
> > > setting up a completely separate CS instances per DC.
> > >
> > > Hope this helps a little.
> > >
> > > - Si
> > >
> > > ________________________________________
> > > From: Logan Barfield <lbarfi...@tqhosting.com>
> > > Sent: Tuesday, January 06, 2015 1:45 PM
> > > To: dev@cloudstack.apache.org; us...@cloudstack.apache.org
> > > Subject: Multi-Datacenter Deployment
> > >
> > > We are currently running a single location CloudStack deployment:
> > > - 1 Hardware firewall
> > > - 1 Mangement/Database Server
> > > - 1 NFS staging store (for S3 secondary storage)
> > > - Ceph RBD for primary storage
> > > - 4 Hypervisors
> > > - 1 Zone/Pod/Cluster
> > >
> > > We are looking to expand our deployment to other datacenters, and I'm
> > > trying to determine the best way to go about it.  The documentation is
> a
> > > bit lacking for multi-site deployments.
> > >
> > > Our goal for the multi-site deployment is to have a zone for each site
> > > (E.G. US East, US West, Europe) that our customers can use to deploy
> > > instances in their preferred geographic area.
> > >
> > > Since we don't want to have different accounts for every datacenter, I
> > > don't think using Regions makes sense for us (and I'm not sure what
> > they're
> > > actually good for without keeping accounts/users/domains in sync).
> > >
> > > Right now I'm thinking our setup will be as follows:
> > > - Firewall, Management Server, NFS staging server, primary storage, and
> > > Hypervisors in each datacenter.
> > > - All Management servers will be on the same management network.
> > > - Management servers will be connected via site-to-site VPN links over
> > WAN.
> > > - MySQL replication (Percona?) will be set up on the management
> servers.
> > > Having an odd number of servers to protect against split brain, and
> > keeping
> > > redundant database backups.
> > > - One region (default)
> > > - One zone for each datacenter
> > > - Geo-enabled DNS to direct customers to the nearest Management server
> > > - Object storage for secondary storage across cloud.
> > >
> > > My primary concerns with this setup are:
> > > - I haven't really seen multi-site deployments details anywhere.
> > > - Potential for split-brain.
> > > - How will HA be handled (e.g., if a VPN link goes down and one of the
> > > remote management servers can't contact a host, will it try to initiate
> > > HA?) - This sort of goes along with the split brain problem.
> > >
> > > Are my assumptions here sound, or is there a standard recommended way
> of
> > > doing multi-site deployments?
> > >
> > > Any suggestions are much appreciated.
> > >
> >
>

Re: Multi-Datacenter Deployment

Reply via email to