On Fri, Jun 9, 2017 at 5:25 AM, Dmitry Tantsur <[email protected]> wrote: > This number of "300", does it come from your testing or from other sources? > If the former, which driver were you using? What exactly problems have you > seen approaching this number?
I haven't encountered this issue personally, but talking to Joe Talerico and some operators at summit around this number a single conductor begins to fall behind polling all of the out of band interfaces for the machines that it's responsible for. You start to see what you would expect from polling running behind, like incorrect power states listed for machines and a general inability to perform machine operations in a timely manner. Having spent some time at the Ironic operators form this is pretty normal and the correct response is just to scale out conductors, this is a problem with TripleO because we don't really have a scale out option with a single machine design. Fortunately just increasing the time between interface polling acts as a pretty good stopgap for this and lets Ironic catch up. I may get some time on a cloud of that scale in the future, at which point I will have hard numbers to give you. One of the reasons I made YODA was the frustrating prevalence of anecdotes instead of hard data when it came to one of the most important parts of the user experience. If it doesn't deploy people don't use it, full stop. > Could you please elaborate? (a bug could also help). What exactly were you > doing? https://bugs.launchpad.net/ironic/+bug/1680725 Describes exactly what I'm experiencing. Essentially the problem is that nodes can and do fail to pxe, then cleaning fails and you just lose the nodes. Users have to spend time going back and babysitting these nodes and there's no good instructions on what to do with failed nodes anyways. The answer is move them to manageable and then to available at which point they go back into cleaning until it finally works. Like introspection was a year ago this is a cavalcade of documentation problems and software issues. I mean really everything *works* technically but the documentation acts like cleaning will work all the time and so does the software, leaving the user to figure out how to accommodate the realities of the situation without so much as a warning that it might happen. This comes out as more of a ux issue than a software one, but we can't just ignore these. - Justin __________________________________________________________________________ OpenStack Development Mailing List (not for usage questions) Unsubscribe: [email protected]?subject:unsubscribe http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
