Some (new?) data on the oom kill issue in the gate.

I filed a new bug / E-R query yet for the issue [1][2] since it looks to me
like the issue is not specific to mysqld - oom-kill will just pick the best
candidate, which in most cases happens to be mysqld. The next most likely
candidate to show errors in the logs is keystone, since token requests are
rather frequent, more than any other API call probably.

According to logstash [3] all failures identified by [2] happen on RAX
nodes [3], which I hadn't realised before.

Comparing dstat data between the failed run and a successful on an OVH node
[4], the main difference I can spot is free memory.
For the same test job, the free memory tends to be much lower, quite close
to zero for the majority of the time on the RAX node. My guess is that an
unlucky scheduling of tests may cause a slightly higher peak in memory
usage and trigger the oom-kill.

I find it hard to relate lower free memory to a specific cloud provider /
underlying virtualisation technology, but maybe someone has an idea about
how that could be?

Andrea

[0]
http://logs.openstack.org/93/432793/1/check/gate-tempest-dsvm-neutron-full-ubuntu-xenial/6f31320/logs/syslog.txt.gz#_Feb_14_00_32_28

[1] https://bugs.launchpad.net/tempest/+bug/1664953
[2] https://review.openstack.org/434238
[3]
http://logstash.openstack.org/#dashboard/file/logstash.json?query=message%3A%5C%22Out%20of%20memory%3A%20Kill%20process%5C%22%20AND%20tags%3A%5C%22syslog.txt%5C%22

[4]
http://logs.openstack.org/93/432793/1/check/gate-tempest-dsvm-neutron-full-ubuntu-xenial/1dfb4b7/logs/dstat-csv_log.txt.gz


On Mon, Feb 6, 2017 at 10:13 AM Miguel Angel Ajo Pelayo <majop...@redhat.com>
wrote:

Jeremy Stanley wrote:


> It's an option of last resort, I think. The next consistent flavor
> up in most of the providers donating resources is double the one
> we're using (which is a fairly typical pattern in public clouds). As
> aggregate memory constraints are our primary quota limit, this would
> effectively halve our current job capacity.

Properly coordinated with all the cloud the providers, they could create
flavours which are private but available to our tenants, where a 25-50%
more RAM would be just enough.

I agree that should probably be a last resort tool, and we should keep
looking for proper ways to find where we consume unnecessary RAM and make
sure that's properly freed up.

It could be interesting to coordinate such flavour creation in the mean
time, even if we don't use it now, we could eventually test it or put it to
work if we find trapped anytime later.


On Sun, Feb 5, 2017 at 8:37 PM, Matt Riedemann <mriede...@gmail.com> wrote:

On 2/5/2017 1:19 PM, Clint Byrum wrote:


Also I wonder if there's ever been any serious consideration given to
switching to protobuf? Feels like one could make oslo.versionedobjects
a wrapper around protobuf relatively easily, but perhaps that's already
been explored in a forum that I wasn't paying attention to.


I've never heard of anyone attempting that.

-- 

Thanks,

Matt Riedemann


__________________________________________________________________________
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


__________________________________________________________________________
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
__________________________________________________________________________
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev

Reply via email to