+1 for making those two changes. I also have been frustrated doing debugging in the gate recently, and any operational-ease-of-debugging things like this would be appreciated.

-jay

On 09/25/2014 08:49 AM, Sean Dague wrote:
Spending a ton of time reading logs, oslo locking ends up basically
creating a ton of output at DEBUG that you have to mentally filter to
find problems:

2014-09-24 18:44:49.240 DEBUG nova.openstack.common.lockutils
[req-d7443a9c-1eb5-42c3-bf40-aadd20f0452f
ListImageFiltersTestXML-1316776531 ListImageFiltersTestXML-132181290]
Created new semaphore "iptables" internal_lock
/opt/stack/new/nova/nova/openstack/common/lockutils.py:206
2014-09-24 18:44:49.240 DEBUG nova.openstack.common.lockutils
[req-d7443a9c-1eb5-42c3-bf40-aadd20f0452f
ListImageFiltersTestXML-1316776531 ListImageFiltersTestXML-132181290]
Acquired semaphore "iptables" lock
/opt/stack/new/nova/nova/openstack/common/lockutils.py:229
2014-09-24 18:44:49.240 DEBUG nova.openstack.common.lockutils
[req-d7443a9c-1eb5-42c3-bf40-aadd20f0452f
ListImageFiltersTestXML-1316776531 ListImageFiltersTestXML-132181290]
Attempting to grab external lock "iptables" external_lock
/opt/stack/new/nova/nova/openstack/common/lockutils.py:178
2014-09-24 18:44:49.240 DEBUG nova.openstack.common.lockutils
[req-d7443a9c-1eb5-42c3-bf40-aadd20f0452f
ListImageFiltersTestXML-1316776531 ListImageFiltersTestXML-132181290]
Got file lock "/opt/stack/data/nova/nova-iptables" acquire
/opt/stack/new/nova/nova/openstack/common/lockutils.py:93
2014-09-24 18:44:49.240 DEBUG nova.openstack.common.lockutils
[req-d7443a9c-1eb5-42c3-bf40-aadd20f0452f
ListImageFiltersTestXML-1316776531 ListImageFiltersTestXML-132181290]
Got semaphore / lock "_do_refresh_provider_fw_rules" inner
/opt/stack/new/nova/nova/openstack/common/lockutils.py:271
2014-09-24 18:44:49.244 DEBUG nova.compute.manager
[req-b91cb1c1-f211-43ef-9714-651eeb3b2302
DeleteServersAdminTestXML-1408641898
DeleteServersAdminTestXML-469708524] [instance:
98eb8e6e-088b-4dda-ada5-7b2b79f62506] terminating bdm
BlockDeviceMapping(boot_index=0,connection_info=None,created_at=2014-09-24T18:44:42Z,delete_on_termination=True,deleted=False,deleted_at=None,destination_type='local',device_name='/dev/vda',device_type='disk',disk_bus=None,guest_format=None,id=43,image_id='262ab8a2-0790-49b3-a8d3-e8ed73e3ed71',instance=<?>,instance_uuid=98eb8e6e-088b-4dda-ada5-7b2b79f62506,no_device=False,snapshot_id=None,source_type='image',updated_at=2014-09-24T18:44:42Z,volume_id=None,volume_size=None)
_cleanup_volumes /opt/stack/new/nova/nova/compute/manager.py:2407
2014-09-24 18:44:49.248 DEBUG nova.openstack.common.lockutils
[req-d7443a9c-1eb5-42c3-bf40-aadd20f0452f
ListImageFiltersTestXML-1316776531 ListImageFiltersTestXML-132181290]
Released file lock "/opt/stack/data/nova/nova-iptables" release
/opt/stack/new/nova/nova/openstack/common/lockutils.py:115
2014-09-24 18:44:49.248 DEBUG nova.openstack.common.lockutils
[req-d7443a9c-1eb5-42c3-bf40-aadd20f0452f
ListImageFiltersTestXML-1316776531 ListImageFiltersTestXML-132181290]
Releasing semaphore "iptables" lock
/opt/stack/new/nova/nova/openstack/common/lockutils.py:238
2014-09-24 18:44:49.249 DEBUG nova.openstack.common.lockutils
[req-d7443a9c-1eb5-42c3-bf40-aadd20f0452f
ListImageFiltersTestXML-1316776531 ListImageFiltersTestXML-132181290]
Semaphore / lock released "_do_refresh_provider_fw_rules" inner

Also readable here:
http://logs.openstack.org/01/123801/1/check/check-tempest-dsvm-full/b5f8b37/logs/screen-n-cpu.txt.gz#_2014-09-24_18_44_49_240

(Yes, it's kind of ugly)

What occured to me is that in debugging locking issues what we actually
care about is 2 things semantically:

#1 - tried to get a lock, but someone else has it. Then we know we've
got lock contention. .
#2 - something is still holding a lock after some "long" amount of time.

#2 turned out to be a critical bit in understanding one of the worst
recent gate impacting issues.

You can write a tool today that analyzes the logs and shows you these
things. However, I wonder if we could actually do something creative in
the code itself to do this already. I'm curious if the creative use of
Timers might let us emit log messages under the conditions above
(someone with better understanding of python internals needs to speak up
here). Maybe it's too much overhead, but I think it's worth at least
asking the question.

The same issue exists when it comes to processutils I think, warning
that a command is still running after 10s might be really handy, because
it turns out that issue #2 was caused by this, and it took quite a bit
of decoding to figure that out.

        -Sean


_______________________________________________
OpenStack-dev mailing list
OpenStack-dev@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev

Reply via email to