On Tue, Oct 30, 2018 at 12:25 PM Clark Boylan <cboy...@sapwetik.org> wrote: > > On Tue, Oct 30, 2018, at 10:42 AM, Alex Schultz wrote: > > On Tue, Oct 30, 2018 at 11:36 AM Ben Nemec <openst...@nemebean.com> wrote: > > > > > > Tagging with tripleo since my suggestion below is specific to that > > > project. > > > > > > On 10/30/18 11:03 AM, Clark Boylan wrote: > > > > Hello everyone, > > > > > > > > A little while back I sent email explaining how the gate queues work > > > > and how fixing bugs helps us test and merge more code. All of this > > > > still is still true and we should keep pushing to improve our testing > > > > to avoid gate resets. > > > > > > > > Last week we migrated Zuul and Nodepool to a new Zookeeper cluster. In > > > > the process of doing this we had to restart Zuul which brought in a new > > > > logging feature that exposes node resource usage by jobs. Using this > > > > data I've been able to generate some report information on where our > > > > node demand is going. This change [0] produces this report [1]. > > > > > > > > As with optimizing software we want to identify which changes will have > > > > the biggest impact and to be able to measure whether or not changes > > > > have had an impact once we have made them. Hopefully this information > > > > is a start at doing that. Currently we can only look back to the point > > > > Zuul was restarted, but we have a thirty day log rotation for this > > > > service and should be able to look at a month's worth of data going > > > > forward. > > > > > > > > Looking at the data you might notice that Tripleo is using many more > > > > node resources than our other projects. They are aware of this and have > > > > a plan [2] to reduce their resource consumption. We'll likely be using > > > > this report generator to check progress of this plan over time. > > > > > > I know at one point we had discussed reducing the concurrency of the > > > tripleo gate to help with this. Since tripleo is still using >50% of the > > > resources it seems like maybe we should revisit that, at least for the > > > short-term until the more major changes can be made? Looking through the > > > merge history for tripleo projects I don't see a lot of cases (any, in > > > fact) where more than a dozen patches made it through anyway*, so I > > > suspect it wouldn't have a significant impact on gate throughput, but it > > > would free up quite a few nodes for other uses. > > > > > > > It's the failures in gate and resets. At this point I think it would > > be a good idea to turn down the concurrency of the tripleo queue in > > the gate if possible. As of late it's been timeouts but we've been > > unable to track down why it's timing out specifically. I personally > > have a feeling it's the container download times since we do not have > > a local registry available and are only able to leverage the mirrors > > for some levels of caching. Unfortunately we don't get the best > > information about this out of docker (or the mirrors) and it's really > > hard to determine what exactly makes things run a bit slower. > > We actually tried this not too long ago > https://git.openstack.org/cgit/openstack-infra/project-config/commit/?id=22d98f7aab0fb23849f715a8796384cffa84600b > but decided to revert it because it didn't decrease the check queue backlog > significantly. We were still running at several hours behind most of the time. > > If we want to set up better monitoring and measuring and try it again we can > do that. But we probably want to measure queue sizes with and without the > change like that to better understand if it helps. > > As for container image download times can we quantify that via docker logs? > Basically sum up the amount of time spent by a job downloading images so that > we can see what the impact is but also measure if changes improve that? As > for other ideas improving things seems like many of the images that tripleo > use are quite large. I recall seeing a > 600MB image just for rsyslog. > Wouldn't it be advantageous for both the gate and tripleo in the real world > to trim the size of those images (which should improve download times). In > any case quantifying the size of the downloads and trimming those if possible > is likely also worthwhile. >
So it's not that simple as we don't just download all the images in a distinct task and there isn't any information provided around size/speed AFAIK. Additionally we aren't doing anything special with the images (it's mostly kolla built containers with a handful of tweaks) so that's just the size of the containers. I am currently working on reducing any tripleo specific dependencies (ie removal of instack-undercloud, etc) in hopes that we'll shave off some of the dependencies but it seems that there's a larger (bloat) issue around containers in general. I have no idea why the rsyslog container would be 600M, but yea that does seem excessive. > Clark > > __________________________________________________________________________ > OpenStack Development Mailing List (not for usage questions) > Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe > http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev __________________________________________________________________________ OpenStack Development Mailing List (not for usage questions) Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev