Some (new?) data on the oom kill issue in the gate. I filed a new bug / E-R query yet for the issue [1][2] since it looks to me like the issue is not specific to mysqld - oom-kill will just pick the best candidate, which in most cases happens to be mysqld. The next most likely candidate to show errors in the logs is keystone, since token requests are rather frequent, more than any other API call probably.
According to logstash [3] all failures identified by [2] happen on RAX nodes [3], which I hadn't realised before. Comparing dstat data between the failed run and a successful on an OVH node [4], the main difference I can spot is free memory. For the same test job, the free memory tends to be much lower, quite close to zero for the majority of the time on the RAX node. My guess is that an unlucky scheduling of tests may cause a slightly higher peak in memory usage and trigger the oom-kill. I find it hard to relate lower free memory to a specific cloud provider / underlying virtualisation technology, but maybe someone has an idea about how that could be? Andrea [0] http://logs.openstack.org/93/432793/1/check/gate-tempest-dsvm-neutron-full-ubuntu-xenial/6f31320/logs/syslog.txt.gz#_Feb_14_00_32_28 [1] https://bugs.launchpad.net/tempest/+bug/1664953 [2] https://review.openstack.org/434238 [3] http://logstash.openstack.org/#dashboard/file/logstash.json?query=message%3A%5C%22Out%20of%20memory%3A%20Kill%20process%5C%22%20AND%20tags%3A%5C%22syslog.txt%5C%22 [4] http://logs.openstack.org/93/432793/1/check/gate-tempest-dsvm-neutron-full-ubuntu-xenial/1dfb4b7/logs/dstat-csv_log.txt.gz On Mon, Feb 6, 2017 at 10:13 AM Miguel Angel Ajo Pelayo <majop...@redhat.com> wrote: Jeremy Stanley wrote: > It's an option of last resort, I think. The next consistent flavor > up in most of the providers donating resources is double the one > we're using (which is a fairly typical pattern in public clouds). As > aggregate memory constraints are our primary quota limit, this would > effectively halve our current job capacity. Properly coordinated with all the cloud the providers, they could create flavours which are private but available to our tenants, where a 25-50% more RAM would be just enough. I agree that should probably be a last resort tool, and we should keep looking for proper ways to find where we consume unnecessary RAM and make sure that's properly freed up. It could be interesting to coordinate such flavour creation in the mean time, even if we don't use it now, we could eventually test it or put it to work if we find trapped anytime later. On Sun, Feb 5, 2017 at 8:37 PM, Matt Riedemann <mriede...@gmail.com> wrote: On 2/5/2017 1:19 PM, Clint Byrum wrote: Also I wonder if there's ever been any serious consideration given to switching to protobuf? Feels like one could make oslo.versionedobjects a wrapper around protobuf relatively easily, but perhaps that's already been explored in a forum that I wasn't paying attention to. I've never heard of anyone attempting that. -- Thanks, Matt Riedemann __________________________________________________________________________ OpenStack Development Mailing List (not for usage questions) Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev __________________________________________________________________________ OpenStack Development Mailing List (not for usage questions) Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
__________________________________________________________________________ OpenStack Development Mailing List (not for usage questions) Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev