On Tue, Oct 6, 2015 at 4:22 PM, Vladimir Kuklin <vkuk...@mirantis.com> wrote:
> Eugene > > For example, each time that you need to have one instance (e.g. master > instance) of something non-stateless running in the cluster. > Right. This is theoretical. Practically, there are no such services among openstack. You are right that currently lots of things are fixed already - heat engine > is fine, for example. But I still see this issue with l3 agents and I will > not change my mind until we conduct complete scale and destructive testing > with new neutron code. > > Secondly, if we cannot reliably identify when to engage - then we need to > write the code that will tell us when to engage. If this code is already in > place and we can trigger a couple of commands to figure out Neutron agent > state, then we can add them to OCF script monitor and that is all. I agree > that we have some issues with our OCF scripts, for example some unoptimal > cleanup code that has issues with big scale, but I am almost sure we can > fix it. > > Finally, let me show an example of when you need a centralized cluster > manager to manage such situations - you have a temporary issue with > connectivity to neutron server over management network for some reason. > Your agents are not cleaned up and neutron server starts new l3 agent > instances on different node. In this case you will have IP duplication in > the network and will bring down the whole cluster as connectivity through > 'public' network will be working just fine. In case when we are using > Pacemaker - such node will be either fenced or will stop all the services > controlled by pacemaker as it is a part of non-quorate partition of the > cluster. When this happens, l3 agent OCF script will run its cleanup > section and purge all the stale IPs thus saving us from the trouble. I > obviously may be mistaking, so please correct me if this is not the case. > I think this deserves discussion in a separate thread, which I'll start soon. My initial point was (to state it clearly), that I will be -2 on any new additions of openstack services to pacemaker kingdom. Thanks, Eugene. > > > On Tue, Oct 6, 2015 at 3:46 PM, Eugene Nikanorov <enikano...@mirantis.com> > wrote: > >> >> >>> 2) I think you misunderstand what is the difference between >>> upstart/systemd and Pacemaker in this case. There are many cases when you >>> need to have syncrhonized view of the cluster. Otherwise you will hit >>> split-brain situations and have your cluster misfunctioning. Until >>> OpenStack provides us with such means there is no other way than using >>> Pacemaker/Zookeper/etc. >>> >> >> Could you please give some examples of those 'many cases' for openstack >> specifically? >> As for my 'misunderstanding' - openstack services only need to be always >> up, not more than that. >> Upstart does a perfect job there. >> >> >>> 3) Regarding Neutron agents - we discussed it many times - you need to >>> be able to control and clean up stuff after some service crashed. >>> Currently, Neutron does not provide reliable ways to do it. If your agent >>> dies and does not clean up ip addresses from the network namespace you will >>> get into the situation of ARP duplication which will be a kind of split >>> brain described in item #2. I personally as a system architect and >>> administrator do not believe for this to change in at least several years >>> for OpenStack so we will be using Pacemaker for a very long period of time. >>> >> >> This has been changed already, and a while ago. >> OCF infrastructure around neutron agents has never helped neutron in any >> meaningful way and is just an artifact from the dark past. >> The reasons are: pacemaker/ocf doesn't have enough intelligence to know >> when to engage, as a result, any cleanup could only be achieved through >> manual operations. I don't need to remind you how many bugs were in ocf >> scripts which brought whole clusters down after those manual operations. >> So it's just a way better to go with simple standard tools with >> fine-grain control. >> Same applies to any other openstack service (again, not rabbitmq/galera) >> >> > so we will be using Pacemaker for a very long period of time. >> Not for neutron, sorry. As soon as we finish the last bit of such >> cleanup, which is targeted for 8.0 >> >> Now, back to the topic - we may decide to use some more sophisticated >>> integral node health attribute which can be used with Pacemaker as well as >>> to put node into some kind of maintenance mode. We can leverage User >>> Maintenance Mode feature here or just simply stop particular services and >>> disable particular haproxy backends. >>> >> >> I think this kind of attribute, although being analyzed by pacemaker/ocf, >> doesn't need any new OS service to be put under pacemaker control. >> >> Thanks, >> Eugene. >> >> >>> >>> On Mon, Oct 5, 2015 at 11:57 PM, Eugene Nikanorov < >>> enikano...@mirantis.com> wrote: >>> >>>> >>>>>> >>>>> Mirantis does control neither Rabbitmq or Galera. Mirantis cannot >>>>> assure their quality as well. >>>>> >>>> >>>> Correct, and rabbitmq was always the pain in the back, preventing any *real >>>> *enterprise usage of openstack where reliability does matter. >>>> >>>> >>>>> > 2) it has terrible UX >>>>>> >>>>> >>>>> It looks like personal opinion. I'd like to see surveys or operators >>>>> feedbacks. Also, this statement is not constructive as it doesn't have >>>>> alternative solutions. >>>>> >>>> >>>> The solution is to get rid of terrible UX wherever possible (i'm not >>>> saying it is always possible, of course) >>>> upstart is just so much better. >>>> And yes, this is my personal opinion and is a summary of escalation >>>> team's experience. >>>> >>>> >>>>> >>>>>> > 3) it is not reliable >>>>>> >>>>> >>>>> I would say openstack services are not HA reliable. So OCF scripts are >>>>> reaction of operators on these problems. Many of them have child-ish >>>>> issues >>>>> from release to release. Operators made OCF scripts to fix these problems. >>>>> A lot of openstack are stateful, so they require some kind of stickiness >>>>> or >>>>> synchronization. Openstack services doesn't have simple health-check >>>>> functionality so it's hard to say it's running well or not. Sighup is >>>>> still >>>>> a problem for many of openstack services. Etc/etc So, let's be >>>>> constructive >>>>> here. >>>>> >>>> >>>> Well, I prefer to be responsible for what I know and maintain. Thus, I >>>> state that neutron doesn't need to be managed by pacemaker, neither server, >>>> nor all kinds of agents, and that's the path that neutron team will be >>>> taking. >>>> >>>> Thanks, >>>> Eugene. >>>> >>>>> >>>>> >>>>>> > >>>>>> >>>>>> I disagree with #1 as I do not agree that should be a criteria for an >>>>>> open-source project. Considering pacemaker is at the core of our >>>>>> controller setup, I would argue that if these are in fact true we need >>>>>> to be using something else. I would agree that it is a terrible UX >>>>>> but all the clustering software I've used fall in this category. I'd >>>>>> like more information on how it is not reliable. Do we have numbers to >>>>>> backup these claims? >>>>>> >>>>>> > (3) is not evaluation of the project itself, but just a logical >>>>>> consequence >>>>>> > of (1) and (2). >>>>>> > As a part of escalation team I can say that it has cost our team >>>>>> thousands >>>>>> > of man hours of head-scratching, staring at pacemaker logs which >>>>>> value are >>>>>> > usually slightly below zero. >>>>>> > >>>>>> > Most of openstack services (in fact, ALL api servers) are >>>>>> stateless, they >>>>>> > don't require any cluster management (also, they don't need to be >>>>>> moved in >>>>>> > case of lack of space). >>>>>> > Statefull services like neutron agents have their states being a >>>>>> function of >>>>>> > db state and are able to syncronize it with the server without >>>>>> external >>>>>> > "help". >>>>>> > >>>>>> >>>>>> So it's not an issue with moving services so much as being able to >>>>>> stop the services when a condition is met. Have we tested all OS >>>>>> services to ensure they do function 100% when out of disk space? I >>>>>> would assume that glance might have issues with image uploads if there >>>>>> is no space to handle a request. >>>>>> >>>>>> > So now usage of pacemaker can be only justified for cases where >>>>>> service's >>>>>> > clustering mechanism requires active monitoring (rabbitmq, galera) >>>>>> > But even there, examples when we are better off without pacemaker >>>>>> are all >>>>>> > around. >>>>>> > >>>>>> > Thanks, >>>>>> > Eugene. >>>>>> > >>>>>> >>>>>> After I sent this email, I had further discussions around the issues >>>>>> that I'm facing and it may not be completely related to disk space. I >>>>>> think we might be relying on the expectation that the local rabbitmq >>>>>> is always available but I need to look into that. Either way, I >>>>>> believe we still should continue to discuss this issue as we are >>>>>> managing services in multiple ways on a single host. Additionally I do >>>>>> not believe that we really perform quality health checks on our >>>>>> services. >>>>>> >>>>>> Thanks, >>>>>> -Alex >>>>>> >>>>>> >>>>>> > >>>>>> > On Mon, Oct 5, 2015 at 1:34 PM, Sergey Vasilenko < >>>>>> svasile...@mirantis.com> >>>>>> > wrote: >>>>>> >> >>>>>> >> >>>>>> >> On Mon, Oct 5, 2015 at 12:22 PM, Eugene Nikanorov >>>>>> >> <enikano...@mirantis.com> wrote: >>>>>> >>> >>>>>> >>> No pacemaker for os services, please. >>>>>> >>> We'll be moving out neutron agents from pacemaker control in 8.0, >>>>>> other >>>>>> >>> os services don't need it too. >>>>>> >> >>>>>> >> >>>>>> >> could you please provide your arguments. >>>>>> >> >>>>>> >> >>>>>> >> /sv >>>>>> >> >>>>>> >> >>>>>> __________________________________________________________________________ >>>>>> >> OpenStack Development Mailing List (not for usage questions) >>>>>> >> Unsubscribe: >>>>>> openstack-dev-requ...@lists.openstack.org?subject:unsubscribe >>>>>> >> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev >>>>>> >> >>>>>> > >>>>>> > >>>>>> > >>>>>> __________________________________________________________________________ >>>>>> > OpenStack Development Mailing List (not for usage questions) >>>>>> > Unsubscribe: >>>>>> openstack-dev-requ...@lists.openstack.org?subject:unsubscribe >>>>>> > http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev >>>>>> > >>>>>> >>>>>> >>>>>> __________________________________________________________________________ >>>>>> OpenStack Development Mailing List (not for usage questions) >>>>>> Unsubscribe: >>>>>> openstack-dev-requ...@lists.openstack.org?subject:unsubscribe >>>>>> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev >>>>>> >>>>> >>>>> >>>>> >>>>> __________________________________________________________________________ >>>>> OpenStack Development Mailing List (not for usage questions) >>>>> Unsubscribe: >>>>> openstack-dev-requ...@lists.openstack.org?subject:unsubscribe >>>>> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev >>>>> >>>>> >>>> >>>> >>>> __________________________________________________________________________ >>>> OpenStack Development Mailing List (not for usage questions) >>>> Unsubscribe: >>>> openstack-dev-requ...@lists.openstack.org?subject:unsubscribe >>>> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev >>>> >>>> >>> >>> >>> -- >>> Yours Faithfully, >>> Vladimir Kuklin, >>> Fuel Library Tech Lead, >>> Mirantis, Inc. >>> +7 (495) 640-49-04 >>> +7 (926) 702-39-68 >>> Skype kuklinvv >>> 35bk3, Vorontsovskaya Str. >>> Moscow, Russia, >>> www.mirantis.com <http://www.mirantis.ru/> >>> www.mirantis.ru >>> vkuk...@mirantis.com >>> >>> >>> __________________________________________________________________________ >>> OpenStack Development Mailing List (not for usage questions) >>> Unsubscribe: >>> openstack-dev-requ...@lists.openstack.org?subject:unsubscribe >>> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev >>> >>> >> >> __________________________________________________________________________ >> OpenStack Development Mailing List (not for usage questions) >> Unsubscribe: >> openstack-dev-requ...@lists.openstack.org?subject:unsubscribe >> http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev >> >> > > > -- > Yours Faithfully, > Vladimir Kuklin, > Fuel Library Tech Lead, > Mirantis, Inc. > +7 (495) 640-49-04 > +7 (926) 702-39-68 > Skype kuklinvv > 35bk3, Vorontsovskaya Str. > Moscow, Russia, > www.mirantis.com <http://www.mirantis.ru/> > www.mirantis.ru > vkuk...@mirantis.com > > __________________________________________________________________________ > OpenStack Development Mailing List (not for usage questions) > Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe > http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev > >
__________________________________________________________________________ OpenStack Development Mailing List (not for usage questions) Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev