Spending a ton of time reading logs, oslo locking ends up basically creating a ton of output at DEBUG that you have to mentally filter to find problems:
2014-09-24 18:44:49.240 DEBUG nova.openstack.common.lockutils [req-d7443a9c-1eb5-42c3-bf40-aadd20f0452f ListImageFiltersTestXML-1316776531 ListImageFiltersTestXML-132181290] Created new semaphore "iptables" internal_lock /opt/stack/new/nova/nova/openstack/common/lockutils.py:206 2014-09-24 18:44:49.240 DEBUG nova.openstack.common.lockutils [req-d7443a9c-1eb5-42c3-bf40-aadd20f0452f ListImageFiltersTestXML-1316776531 ListImageFiltersTestXML-132181290] Acquired semaphore "iptables" lock /opt/stack/new/nova/nova/openstack/common/lockutils.py:229 2014-09-24 18:44:49.240 DEBUG nova.openstack.common.lockutils [req-d7443a9c-1eb5-42c3-bf40-aadd20f0452f ListImageFiltersTestXML-1316776531 ListImageFiltersTestXML-132181290] Attempting to grab external lock "iptables" external_lock /opt/stack/new/nova/nova/openstack/common/lockutils.py:178 2014-09-24 18:44:49.240 DEBUG nova.openstack.common.lockutils [req-d7443a9c-1eb5-42c3-bf40-aadd20f0452f ListImageFiltersTestXML-1316776531 ListImageFiltersTestXML-132181290] Got file lock "/opt/stack/data/nova/nova-iptables" acquire /opt/stack/new/nova/nova/openstack/common/lockutils.py:93 2014-09-24 18:44:49.240 DEBUG nova.openstack.common.lockutils [req-d7443a9c-1eb5-42c3-bf40-aadd20f0452f ListImageFiltersTestXML-1316776531 ListImageFiltersTestXML-132181290] Got semaphore / lock "_do_refresh_provider_fw_rules" inner /opt/stack/new/nova/nova/openstack/common/lockutils.py:271 2014-09-24 18:44:49.244 DEBUG nova.compute.manager [req-b91cb1c1-f211-43ef-9714-651eeb3b2302 DeleteServersAdminTestXML-1408641898 DeleteServersAdminTestXML-469708524] [instance: 98eb8e6e-088b-4dda-ada5-7b2b79f62506] terminating bdm BlockDeviceMapping(boot_index=0,connection_info=None,created_at=2014-09-24T18:44:42Z,delete_on_termination=True,deleted=False,deleted_at=None,destination_type='local',device_name='/dev/vda',device_type='disk',disk_bus=None,guest_format=None,id=43,image_id='262ab8a2-0790-49b3-a8d3-e8ed73e3ed71',instance=<?>,instance_uuid=98eb8e6e-088b-4dda-ada5-7b2b79f62506,no_device=False,snapshot_id=None,source_type='image',updated_at=2014-09-24T18:44:42Z,volume_id=None,volume_size=None) _cleanup_volumes /opt/stack/new/nova/nova/compute/manager.py:2407 2014-09-24 18:44:49.248 DEBUG nova.openstack.common.lockutils [req-d7443a9c-1eb5-42c3-bf40-aadd20f0452f ListImageFiltersTestXML-1316776531 ListImageFiltersTestXML-132181290] Released file lock "/opt/stack/data/nova/nova-iptables" release /opt/stack/new/nova/nova/openstack/common/lockutils.py:115 2014-09-24 18:44:49.248 DEBUG nova.openstack.common.lockutils [req-d7443a9c-1eb5-42c3-bf40-aadd20f0452f ListImageFiltersTestXML-1316776531 ListImageFiltersTestXML-132181290] Releasing semaphore "iptables" lock /opt/stack/new/nova/nova/openstack/common/lockutils.py:238 2014-09-24 18:44:49.249 DEBUG nova.openstack.common.lockutils [req-d7443a9c-1eb5-42c3-bf40-aadd20f0452f ListImageFiltersTestXML-1316776531 ListImageFiltersTestXML-132181290] Semaphore / lock released "_do_refresh_provider_fw_rules" inner Also readable here: http://logs.openstack.org/01/123801/1/check/check-tempest-dsvm-full/b5f8b37/logs/screen-n-cpu.txt.gz#_2014-09-24_18_44_49_240 (Yes, it's kind of ugly) What occured to me is that in debugging locking issues what we actually care about is 2 things semantically: #1 - tried to get a lock, but someone else has it. Then we know we've got lock contention. . #2 - something is still holding a lock after some "long" amount of time. #2 turned out to be a critical bit in understanding one of the worst recent gate impacting issues. You can write a tool today that analyzes the logs and shows you these things. However, I wonder if we could actually do something creative in the code itself to do this already. I'm curious if the creative use of Timers might let us emit log messages under the conditions above (someone with better understanding of python internals needs to speak up here). Maybe it's too much overhead, but I think it's worth at least asking the question. The same issue exists when it comes to processutils I think, warning that a command is still running after 10s might be really handy, because it turns out that issue #2 was caused by this, and it took quite a bit of decoding to figure that out. -Sean -- Sean Dague http://dague.net _______________________________________________ OpenStack-dev mailing list OpenStack-dev@lists.openstack.org http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev