On Thu, 25 Sep 2014 08:49:12 -0400 Sean Dague <s...@dague.net> wrote:
> > #1 - tried to get a lock, but someone else has it. Then we know we've > got lock contention. . > #2 - something is still holding a lock after some "long" amount of > time. +1 to both. > #2 turned out to be a critical bit in understanding one of the worst > recent gate impacting issues. > > You can write a tool today that analyzes the logs and shows you these > things. However, I wonder if we could actually do something creative > in the code itself to do this already. I'm curious if the creative > use of Timers might let us emit log messages under the conditions > above (someone with better understanding of python internals needs to > speak up here). Maybe it's too much overhead, but I think it's worth > at least asking the question. Even a simple log at the end when its finally released if it has been held for a long time would be handy. As matching up acquire/release by eye is not easy. I don't think we get a log message when an acquire is attempted but fails. That might help get a measure of lock contention? > The same issue exists when it comes to processutils I think, warning > that a command is still running after 10s might be really handy, > because it turns out that issue #2 was caused by this, and it took > quite a bit of decoding to figure that out. Also I think that perhaps a log message when a period task takes longer than the interval that the task is meant to be run would be a useful warning sign that something odd is going on. Chris _______________________________________________ OpenStack-dev mailing list OpenStack-dev@lists.openstack.org http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev