I agree that the HA should be hidden to the user/tenant. IMHO a tenant should just use a load-balancer as a “managed” black box where the service is resilient in itself.
Our current Libra/LBaaS implementation in the HP public cloud uses a pool of standby LB to replace failing tenant’s LB. Our LBaaS service is monitoring itself and replacing LB when they fail. This is via a set of Admin API server. http://libra.readthedocs.org/en/latest/admin_api/index.html The Admin server spawns several scheduled threads to run tasks such as building new devices for the pool, monitoring load balancer devices and maintaining IP addresses. http://libra.readthedocs.org/en/latest/pool_mgm/about.html Susanne On Thu, Apr 17, 2014 at 6:49 PM, Stephen Balukoff <sbaluk...@bluebox.net>wrote: > Heyas, y'all! > > So, given both the prioritization and usage info on HA functionality for > Neutron LBaaS here: > https://docs.google.com/spreadsheet/ccc?key=0Ar1FuMFYRhgadDVXZ25NM2NfbGtLTkR0TDFNUWJQUWc&usp=sharing > > It's clear that: > > A. HA seems to be a top priority for most operators > B. Almost all load balancer functionality deployed is done so in an > Active/Standby HA configuration > > I know there's been some round-about discussion about this on the list in > the past (which usually got stymied in "implementation details" > disagreements), but it seems to me that with so many players putting a high > priority on HA functionality, this is something we need to discuss and > address. > > This is also apropos, as we're talking about doing a major revision of the > API, and it probably makes sense to seriously consider if or how HA-related > stuff should make it into the API. I'm of the opinion that almost all the > HA stuff should be hidden from the user/tenant, but that the admin/operator > at the very least is going to need to have some visibility into HA-related > functionality. The hope here is to discover what things make sense to have > as a "least common denominator" and what will have to be hidden behind a > driver-specific implementation. > > > I certainly have a pretty good idea how HA stuff works at our > organization, but I have almost no visibility into how this is done > elsewhere, leastwise not enough detail to know what makes sense to write > API controls for. > > So! Since gathering data about actual usage seems to have worked pretty > well before, I'd like to try that again. Yes, I'm going to be asking about > implementation details, but this is with the hope of discovering any "least > common denominator" factors which make sense to build API around. > > For the purposes of this document, when I say "load balancer devices" I > mean either physical or virtual appliances, or software executing on a host > somewhere that actually does the load balancing. It need not directly > correspond with anything physical... but probably does. :P > > And... all of these questions are meant to be interpreted from the > perspective of the cloud operator. > > Here's what I'm looking to learn from those of you who are allowed to > share this data: > > 1. Are your load balancer devices shared between customers / tenants, not > shared, or some of both? > > 1a. If shared, what is your strategy to avoid or deal with collisions of > customer rfc1918 address space on back-end networks? (For example, I know > of no load balancer device that can balance traffic for both customer A and > customer B if both are using the 10.0.0.0/24 subnet for their back-end > networks containing the nodes to be balanced, unless an extra layer of > NATing is happening somewhere.) > > 2. What kinds of metrics do you use in determining load balancing capacity? > > 3. Do you operate with a pool of unused load balancer device capacity > (which a cloud OS would need to keep track of), or do you spin up new > capacity (in the form of virtual servers, presumably) on the fly? > > 3a. If you're operating with a availability pool, can you describe how new > load balancer devices are added to your availability pool? Specifically, > are there any steps in the process that must be manually performed (ie. so > no API could help with this)? > > 4. How are new devices 'registered' with the cloud OS? How are they > removed or replaced? > > 5. What kind of visibility do you (or would you) allow your user base to > see into the HA-related aspects of your load balancing services? > > 6. What kind of functionality and visibility do you need into the > operations of your load balancer devices in order to maintain your > services, troubleshoot, etc.? Specifically, are you managing the > infrastructure outside the purview of the cloud OS? Are there certain > aspects which would be easier to manage if done within the purview of the > cloud OS? > > 7. What kind of network topology is used when deploying load balancing > functionality? (ie. do your load balancer devices live inside or outside > customer firewalls, directly on tenant networks? Are you using layer-3 > routing? etc.) > > 8. Is there any other data you can share which would be useful in > considering features of the API that only cloud operators would be able to > perform? > > > And since we're one of these operators, here are my responses: > > 1. We have both shared load balancer devices and private load balancer > devices. > > 1a. Our shared load balancers live outside customer firewalls, and we use > IPv6 to reach individual servers behind the firewalls "directly." We have > followed a careful deployment strategy across all our networks so that IPv6 > addresses between tenants do not overlap. > > 2. The most useful ones for us are "number of appliances deployed" and > "number and type of load balancing services deployed" though we also pay > attention to: > * Load average per "active" appliance > * Per appliance number and type of load balancing services deployed > * Per appliance bandwidth consumption > * Per appliance connections / sec > * Per appliance SSL connections / sec > > Since our devices are software appliances running on linux we also track > OS-level metrics as well, though these aren't used directly in the load > balancing features in our cloud OS. > > 3. We operate with an availability pool that our current cloud OS pays > attention to. > > 3a. Since the devices we use correspond to physical hardware this must of > course be rack-and-stacked by a datacenter technician, who also does > initial configuration of these devices. > > 4. All of our load balancers are deployed in an active / standby > configuration. Two machines which make up an active / standby pair are > registered with the cloud OS as a single unit that we call a "load balancer > cluster." Our availability pool consists of a whole bunch of these load > balancer clusters. (The devices themselves are registered individually at > the time the cluster object is created in our database.) There are a couple > manual steps in this process (currently handled by the datacenter techs who > do the racking and stacking), but these could be automated via API. In > fact, as we move to virtual appliances with these, we expect the entire > process to become automated via API (first cluster primitive is created, > and then "load balancer device objects" get attached to it, then the > cluster gets added to our availability pool.) > > Removal of a "cluster" object is handled by first evacuating any customer > services off the cluster, then destroying the load balancer device objects, > then the cluster object. Replacement of a single load balancer device > entails removing the dead device, adding the new one, synchronizing > configuration data to it, and starting services. > > 5. At the present time, all our load balancing services are deployed in an > active / standby HA configuration, so the user has no choice or visibility > into any HA details. As we move to Neutron LBaaS, we would like to give > users the option of deploying non-HA load balancing capacity. Therefore, > the only visibility we want the user to get is: > > * Choose whether a given load balancing service should be deployed in an > HA configuration ("flavor" functionality could handle this) > * See whether a running load balancing service is deployed in an HA > configuration (and see the "hint" for which physical or virtual device(s) > it's deployed on) > * Give a "hint" as to which device(s) a new load balancing service should > be deployed on (ie. for customers looking to deploy a bunch of test / QA / > etc. environments on the same device(s) to reduce costs). > > Note that the "hint" above corresponds to the "load balancing cluster" > alluded to above, not necessarily any specific physical or virtual device. > This means we retain the ability to switch out the underlying hardware > powering a given service at any time. > > Users may also see usage data, of course, but that's more of a generic > stats / billing function (which doesn't have to do with HA at all, really). > > 6. We need to see the status of all our load balancing devices, including > availability, current role (active or standby), and all the metrics listed > under 2 above. Some of this data is used for creating trend graphs and > business metrics, so being able to query the current metrics at any time > via API is important. It would also be very handy to query specific device > info (like revision of software on it, etc.) Our current cloud OS does all > this for us, and having Neutron LBaaS provide visibility into all of this > as well would be ideal. We do almost no management of our load balancing > services outside the purview of our current cloud OS. > > 7. Shared load balancers must live outside customer firewalls, private > load balancers typically live within customer firewalls (sometimes in a > DMZ). In any case, we use layer-3 routing (distributed using routing > protocols on our core networking gear and static routes on customer > firewalls) to route requests for "service IPs" to the "highly available > routing IPs" which live on the load balancers themselves. (When a fail-over > happens, at a low level, what's really going on is the "highly available > routing IPs" shift from the active to standby load balancer.) > > We have contemplated using layer-2 topology (ie. directly connected on the > same vlan / broadcast domain) and are building a version of our appliance > which can operate in this way, potentially reducing the reliance on layer-3 > routes (and making things more friendly for the OpenStack environment, > which we understand probably isn't ready for layer-3 routing just yet). > > 8. I wrote this survey, so none come to mind for me. :) > > Stephen > > -- > Stephen Balukoff > Blue Box Group, LLC > (800)613-4305 x807 > > _______________________________________________ > OpenStack-dev mailing list > OpenStack-dev@lists.openstack.org > http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev > >
_______________________________________________ OpenStack-dev mailing list OpenStack-dev@lists.openstack.org http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev