Logged as high priority bug - https://bugs.launchpad.net/oslo.concurrency/+bug/1374075
On Thu, Sep 25, 2014 at 1:57 PM, Jay Pipes <jaypi...@gmail.com> wrote: > +1 for making those two changes. I also have been frustrated doing debugging > in the gate recently, and any operational-ease-of-debugging things like this > would be appreciated. > > -jay > > On 09/25/2014 08:49 AM, Sean Dague wrote: >> >> Spending a ton of time reading logs, oslo locking ends up basically >> creating a ton of output at DEBUG that you have to mentally filter to >> find problems: >> >> 2014-09-24 18:44:49.240 DEBUG nova.openstack.common.lockutils >> [req-d7443a9c-1eb5-42c3-bf40-aadd20f0452f >> ListImageFiltersTestXML-1316776531 ListImageFiltersTestXML-132181290] >> Created new semaphore "iptables" internal_lock >> /opt/stack/new/nova/nova/openstack/common/lockutils.py:206 >> 2014-09-24 18:44:49.240 DEBUG nova.openstack.common.lockutils >> [req-d7443a9c-1eb5-42c3-bf40-aadd20f0452f >> ListImageFiltersTestXML-1316776531 ListImageFiltersTestXML-132181290] >> Acquired semaphore "iptables" lock >> /opt/stack/new/nova/nova/openstack/common/lockutils.py:229 >> 2014-09-24 18:44:49.240 DEBUG nova.openstack.common.lockutils >> [req-d7443a9c-1eb5-42c3-bf40-aadd20f0452f >> ListImageFiltersTestXML-1316776531 ListImageFiltersTestXML-132181290] >> Attempting to grab external lock "iptables" external_lock >> /opt/stack/new/nova/nova/openstack/common/lockutils.py:178 >> 2014-09-24 18:44:49.240 DEBUG nova.openstack.common.lockutils >> [req-d7443a9c-1eb5-42c3-bf40-aadd20f0452f >> ListImageFiltersTestXML-1316776531 ListImageFiltersTestXML-132181290] >> Got file lock "/opt/stack/data/nova/nova-iptables" acquire >> /opt/stack/new/nova/nova/openstack/common/lockutils.py:93 >> 2014-09-24 18:44:49.240 DEBUG nova.openstack.common.lockutils >> [req-d7443a9c-1eb5-42c3-bf40-aadd20f0452f >> ListImageFiltersTestXML-1316776531 ListImageFiltersTestXML-132181290] >> Got semaphore / lock "_do_refresh_provider_fw_rules" inner >> /opt/stack/new/nova/nova/openstack/common/lockutils.py:271 >> 2014-09-24 18:44:49.244 DEBUG nova.compute.manager >> [req-b91cb1c1-f211-43ef-9714-651eeb3b2302 >> DeleteServersAdminTestXML-1408641898 >> DeleteServersAdminTestXML-469708524] [instance: >> 98eb8e6e-088b-4dda-ada5-7b2b79f62506] terminating bdm >> >> BlockDeviceMapping(boot_index=0,connection_info=None,created_at=2014-09-24T18:44:42Z,delete_on_termination=True,deleted=False,deleted_at=None,destination_type='local',device_name='/dev/vda',device_type='disk',disk_bus=None,guest_format=None,id=43,image_id='262ab8a2-0790-49b3-a8d3-e8ed73e3ed71',instance=<?>,instance_uuid=98eb8e6e-088b-4dda-ada5-7b2b79f62506,no_device=False,snapshot_id=None,source_type='image',updated_at=2014-09-24T18:44:42Z,volume_id=None,volume_size=None) >> _cleanup_volumes /opt/stack/new/nova/nova/compute/manager.py:2407 >> 2014-09-24 18:44:49.248 DEBUG nova.openstack.common.lockutils >> [req-d7443a9c-1eb5-42c3-bf40-aadd20f0452f >> ListImageFiltersTestXML-1316776531 ListImageFiltersTestXML-132181290] >> Released file lock "/opt/stack/data/nova/nova-iptables" release >> /opt/stack/new/nova/nova/openstack/common/lockutils.py:115 >> 2014-09-24 18:44:49.248 DEBUG nova.openstack.common.lockutils >> [req-d7443a9c-1eb5-42c3-bf40-aadd20f0452f >> ListImageFiltersTestXML-1316776531 ListImageFiltersTestXML-132181290] >> Releasing semaphore "iptables" lock >> /opt/stack/new/nova/nova/openstack/common/lockutils.py:238 >> 2014-09-24 18:44:49.249 DEBUG nova.openstack.common.lockutils >> [req-d7443a9c-1eb5-42c3-bf40-aadd20f0452f >> ListImageFiltersTestXML-1316776531 ListImageFiltersTestXML-132181290] >> Semaphore / lock released "_do_refresh_provider_fw_rules" inner >> >> Also readable here: >> >> http://logs.openstack.org/01/123801/1/check/check-tempest-dsvm-full/b5f8b37/logs/screen-n-cpu.txt.gz#_2014-09-24_18_44_49_240 >> >> (Yes, it's kind of ugly) >> >> What occured to me is that in debugging locking issues what we actually >> care about is 2 things semantically: >> >> #1 - tried to get a lock, but someone else has it. Then we know we've >> got lock contention. . >> #2 - something is still holding a lock after some "long" amount of time. >> >> #2 turned out to be a critical bit in understanding one of the worst >> recent gate impacting issues. >> >> You can write a tool today that analyzes the logs and shows you these >> things. However, I wonder if we could actually do something creative in >> the code itself to do this already. I'm curious if the creative use of >> Timers might let us emit log messages under the conditions above >> (someone with better understanding of python internals needs to speak up >> here). Maybe it's too much overhead, but I think it's worth at least >> asking the question. >> >> The same issue exists when it comes to processutils I think, warning >> that a command is still running after 10s might be really handy, because >> it turns out that issue #2 was caused by this, and it took quite a bit >> of decoding to figure that out. >> >> -Sean >> > > _______________________________________________ > OpenStack-dev mailing list > OpenStack-dev@lists.openstack.org > http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev -- Davanum Srinivas :: https://twitter.com/dims _______________________________________________ OpenStack-dev mailing list OpenStack-dev@lists.openstack.org http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev