Re: [openstack-dev] [oslo] logging around olso lockutils
On 09/25/2014 03:06 PM, Ben Nemec wrote: > On 09/25/2014 07:49 AM, Sean Dague wrote: >> Spending a ton of time reading logs, oslo locking ends up basically >> creating a ton of output at DEBUG that you have to mentally filter to >> find problems: >> >> 2014-09-24 18:44:49.240 DEBUG nova.openstack.common.lockutils >> [req-d7443a9c-1eb5-42c3-bf40-aadd20f0452f >> ListImageFiltersTestXML-1316776531 ListImageFiltersTestXML-132181290] >> Created new semaphore "iptables" internal_lock >> /opt/stack/new/nova/nova/openstack/common/lockutils.py:206 >> 2014-09-24 18:44:49.240 DEBUG nova.openstack.common.lockutils >> [req-d7443a9c-1eb5-42c3-bf40-aadd20f0452f >> ListImageFiltersTestXML-1316776531 ListImageFiltersTestXML-132181290] >> Acquired semaphore "iptables" lock >> /opt/stack/new/nova/nova/openstack/common/lockutils.py:229 >> 2014-09-24 18:44:49.240 DEBUG nova.openstack.common.lockutils >> [req-d7443a9c-1eb5-42c3-bf40-aadd20f0452f >> ListImageFiltersTestXML-1316776531 ListImageFiltersTestXML-132181290] >> Attempting to grab external lock "iptables" external_lock >> /opt/stack/new/nova/nova/openstack/common/lockutils.py:178 >> 2014-09-24 18:44:49.240 DEBUG nova.openstack.common.lockutils >> [req-d7443a9c-1eb5-42c3-bf40-aadd20f0452f >> ListImageFiltersTestXML-1316776531 ListImageFiltersTestXML-132181290] >> Got file lock "/opt/stack/data/nova/nova-iptables" acquire >> /opt/stack/new/nova/nova/openstack/common/lockutils.py:93 >> 2014-09-24 18:44:49.240 DEBUG nova.openstack.common.lockutils >> [req-d7443a9c-1eb5-42c3-bf40-aadd20f0452f >> ListImageFiltersTestXML-1316776531 ListImageFiltersTestXML-132181290] >> Got semaphore / lock "_do_refresh_provider_fw_rules" inner >> /opt/stack/new/nova/nova/openstack/common/lockutils.py:271 >> 2014-09-24 18:44:49.244 DEBUG nova.compute.manager >> [req-b91cb1c1-f211-43ef-9714-651eeb3b2302 >> DeleteServersAdminTestXML-1408641898 >> DeleteServersAdminTestXML-469708524] [instance: >> 98eb8e6e-088b-4dda-ada5-7b2b79f62506] terminating bdm >> BlockDeviceMapping(boot_index=0,connection_info=None,created_at=2014-09-24T18:44:42Z,delete_on_termination=True,deleted=False,deleted_at=None,destination_type='local',device_name='/dev/vda',device_type='disk',disk_bus=None,guest_format=None,id=43,image_id='262ab8a2-0790-49b3-a8d3-e8ed73e3ed71',instance=,instance_uuid=98eb8e6e-088b-4dda-ada5-7b2b79f62506,no_device=False,snapshot_id=None,source_type='image',updated_at=2014-09-24T18:44:42Z,volume_id=None,volume_size=None) >> _cleanup_volumes /opt/stack/new/nova/nova/compute/manager.py:2407 >> 2014-09-24 18:44:49.248 DEBUG nova.openstack.common.lockutils >> [req-d7443a9c-1eb5-42c3-bf40-aadd20f0452f >> ListImageFiltersTestXML-1316776531 ListImageFiltersTestXML-132181290] >> Released file lock "/opt/stack/data/nova/nova-iptables" release >> /opt/stack/new/nova/nova/openstack/common/lockutils.py:115 >> 2014-09-24 18:44:49.248 DEBUG nova.openstack.common.lockutils >> [req-d7443a9c-1eb5-42c3-bf40-aadd20f0452f >> ListImageFiltersTestXML-1316776531 ListImageFiltersTestXML-132181290] >> Releasing semaphore "iptables" lock >> /opt/stack/new/nova/nova/openstack/common/lockutils.py:238 >> 2014-09-24 18:44:49.249 DEBUG nova.openstack.common.lockutils >> [req-d7443a9c-1eb5-42c3-bf40-aadd20f0452f >> ListImageFiltersTestXML-1316776531 ListImageFiltersTestXML-132181290] >> Semaphore / lock released "_do_refresh_provider_fw_rules" inner >> >> Also readable here: >> http://logs.openstack.org/01/123801/1/check/check-tempest-dsvm-full/b5f8b37/logs/screen-n-cpu.txt.gz#_2014-09-24_18_44_49_240 >> >> (Yes, it's kind of ugly) >> >> What occured to me is that in debugging locking issues what we actually >> care about is 2 things semantically: >> >> #1 - tried to get a lock, but someone else has it. Then we know we've >> got lock contention. . >> #2 - something is still holding a lock after some "long" amount of time. > > We did just merge https://review.openstack.org/#/c/122166/ which adds > some contention/timing information to the log messages and should at > least be a step toward what you're talking about. > > For context, we had some bad logging that resulted in > https://bugs.launchpad.net/oslo.concurrency/+bug/1367941 . That has > been cleaned up to at least be accurate, but it did add an extra log > message (created and acquired). The reason we cared about that is we > thought there might be a bad interaction between our code and eventlet, > so we wanted to know whether we were in fact locking the same semaphore > twice or mistakenly creating two separate ones (as it turns out, neither > - it was just the bad logging I mentioned earlier). > > So, given that I think everyone involved agrees that the double-locking > thing was a cosmetic issue and not a functional one we could probably > just remove the created/using messages here: > https://github.com/openstack/oslo-incubator/blob/master/openstack/common/lockutils.py#L202 > which would eliminate one message per lock operation without > significantly impacting debug
Re: [openstack-dev] [oslo] logging around olso lockutils
On 09/26/2014 12:03 AM, Christopher Yeoh wrote: > On Thu, 25 Sep 2014 08:49:12 -0400 > Sean Dague wrote: > >> >> #1 - tried to get a lock, but someone else has it. Then we know we've >> got lock contention. . >> #2 - something is still holding a lock after some "long" amount of >> time. > > +1 to both. > >> #2 turned out to be a critical bit in understanding one of the worst >> recent gate impacting issues. >> >> You can write a tool today that analyzes the logs and shows you these >> things. However, I wonder if we could actually do something creative >> in the code itself to do this already. I'm curious if the creative >> use of Timers might let us emit log messages under the conditions >> above (someone with better understanding of python internals needs to >> speak up here). Maybe it's too much overhead, but I think it's worth >> at least asking the question. > > Even a simple log at the end when its finally released if it has been > held for a long time would be handy. As matching up acquire/release by > eye is not easy. > > I don't think we get a log message when an acquire is attempted but > fails. That might help get a measure of lock contention? > >> The same issue exists when it comes to processutils I think, warning >> that a command is still running after 10s might be really handy, >> because it turns out that issue #2 was caused by this, and it took >> quite a bit of decoding to figure that out. > > Also I think that perhaps a log message when a period task takes longer > than the interval that the task is meant to be run would be a useful > warning sign that something odd is going on. Well, we do get some of the periodic tasks today when they overrun, IIRC. However the surprising thing about the Nova resource tracker lock bug was that basically all the time was taken by blockdev failing (which took almost 2 minutes) and happened to be inside a lock that it probably didn't need to be in. I think if we find we're ever taking a lost for > 10s... we're going to start impacting the rest of the system in very bad ways. Especially given the RPC timeouts at 60s. Getting that information as close to the actual issue as possible would be super helpful. -Sean -- Sean Dague http://dague.net ___ OpenStack-dev mailing list OpenStack-dev@lists.openstack.org http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
Re: [openstack-dev] [oslo] logging around olso lockutils
On Thu, 25 Sep 2014 08:49:12 -0400 Sean Dague wrote: > > #1 - tried to get a lock, but someone else has it. Then we know we've > got lock contention. . > #2 - something is still holding a lock after some "long" amount of > time. +1 to both. > #2 turned out to be a critical bit in understanding one of the worst > recent gate impacting issues. > > You can write a tool today that analyzes the logs and shows you these > things. However, I wonder if we could actually do something creative > in the code itself to do this already. I'm curious if the creative > use of Timers might let us emit log messages under the conditions > above (someone with better understanding of python internals needs to > speak up here). Maybe it's too much overhead, but I think it's worth > at least asking the question. Even a simple log at the end when its finally released if it has been held for a long time would be handy. As matching up acquire/release by eye is not easy. I don't think we get a log message when an acquire is attempted but fails. That might help get a measure of lock contention? > The same issue exists when it comes to processutils I think, warning > that a command is still running after 10s might be really handy, > because it turns out that issue #2 was caused by this, and it took > quite a bit of decoding to figure that out. Also I think that perhaps a log message when a period task takes longer than the interval that the task is meant to be run would be a useful warning sign that something odd is going on. Chris ___ OpenStack-dev mailing list OpenStack-dev@lists.openstack.org http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
Re: [openstack-dev] [oslo] logging around olso lockutils
+1 Exactly what I was thinking. Semaphore races and deadlocks are important to be able to trace, but the normal production cloud doesn't want to see those messages. What might be even better would be to also put a counter on the semaphores so that if they ever are >1 or <0 they report an error on normal log levels. I'm assuming it would be an error. I can't see why it would be just a warn or info, but, I don't know the guts of the code here. --Rocky -Original Message- From: Joshua Harlow [mailto:harlo...@outlook.com] Sent: Thursday, September 25, 2014 12:23 PM To: openst...@nemebean.com; OpenStack Development Mailing List (not for usage questions) Subject: Re: [openstack-dev] [oslo] logging around olso lockutils Or how about we add in a new log level? A few libraries I have come across support the log level 5 (which is less than debug (10) but greater than notset (0))... One usage of this is in the multiprocessing library in python itself @ https://hg.python.org/releasing/3.4/file/8671f89107c8/Lib/multiprocessing/util.py#l34 Kazoo calls it the 'BLATHER' level @ https://github.com/python-zk/kazoo/blob/master/kazoo/loggingsupport.py Since these messages can be actually useful for lock_utils developers it could be useful to keep them[1]? Just a thought... [1] Ones mans DEBUG is another mans garbage, ha. On Sep 25, 2014, at 12:06 PM, Ben Nemec wrote: > On 09/25/2014 07:49 AM, Sean Dague wrote: >> Spending a ton of time reading logs, oslo locking ends up basically >> creating a ton of output at DEBUG that you have to mentally filter to >> find problems: >> >> 2014-09-24 18:44:49.240 DEBUG nova.openstack.common.lockutils >> [req-d7443a9c-1eb5-42c3-bf40-aadd20f0452f >> ListImageFiltersTestXML-1316776531 ListImageFiltersTestXML-132181290] >> Created new semaphore "iptables" internal_lock >> /opt/stack/new/nova/nova/openstack/common/lockutils.py:206 >> 2014-09-24 18:44:49.240 DEBUG nova.openstack.common.lockutils >> [req-d7443a9c-1eb5-42c3-bf40-aadd20f0452f >> ListImageFiltersTestXML-1316776531 ListImageFiltersTestXML-132181290] >> Acquired semaphore "iptables" lock >> /opt/stack/new/nova/nova/openstack/common/lockutils.py:229 >> 2014-09-24 18:44:49.240 DEBUG nova.openstack.common.lockutils >> [req-d7443a9c-1eb5-42c3-bf40-aadd20f0452f >> ListImageFiltersTestXML-1316776531 ListImageFiltersTestXML-132181290] >> Attempting to grab external lock "iptables" external_lock >> /opt/stack/new/nova/nova/openstack/common/lockutils.py:178 >> 2014-09-24 18:44:49.240 DEBUG nova.openstack.common.lockutils >> [req-d7443a9c-1eb5-42c3-bf40-aadd20f0452f >> ListImageFiltersTestXML-1316776531 ListImageFiltersTestXML-132181290] >> Got file lock "/opt/stack/data/nova/nova-iptables" acquire >> /opt/stack/new/nova/nova/openstack/common/lockutils.py:93 >> 2014-09-24 18:44:49.240 DEBUG nova.openstack.common.lockutils >> [req-d7443a9c-1eb5-42c3-bf40-aadd20f0452f >> ListImageFiltersTestXML-1316776531 ListImageFiltersTestXML-132181290] >> Got semaphore / lock "_do_refresh_provider_fw_rules" inner >> /opt/stack/new/nova/nova/openstack/common/lockutils.py:271 >> 2014-09-24 18:44:49.244 DEBUG nova.compute.manager >> [req-b91cb1c1-f211-43ef-9714-651eeb3b2302 >> DeleteServersAdminTestXML-1408641898 >> DeleteServersAdminTestXML-469708524] [instance: >> 98eb8e6e-088b-4dda-ada5-7b2b79f62506] terminating bdm >> BlockDeviceMapping(boot_index=0,connection_info=None,created_at=2014-09-24T18:44:42Z,delete_on_termination=True,deleted=False,deleted_at=None,destination_type='local',device_name='/dev/vda',device_type='disk',disk_bus=None,guest_format=None,id=43,image_id='262ab8a2-0790-49b3-a8d3-e8ed73e3ed71',instance=,instance_uuid=98eb8e6e-088b-4dda-ada5-7b2b79f62506,no_device=False,snapshot_id=None,source_type='image',updated_at=2014-09-24T18:44:42Z,volume_id=None,volume_size=None) >> _cleanup_volumes /opt/stack/new/nova/nova/compute/manager.py:2407 >> 2014-09-24 18:44:49.248 DEBUG nova.openstack.common.lockutils >> [req-d7443a9c-1eb5-42c3-bf40-aadd20f0452f >> ListImageFiltersTestXML-1316776531 ListImageFiltersTestXML-132181290] >> Released file lock "/opt/stack/data/nova/nova-iptables" release >> /opt/stack/new/nova/nova/openstack/common/lockutils.py:115 >> 2014-09-24 18:44:49.248 DEBUG nova.openstack.common.lockutils >> [req-d7443a9c-1eb5-42c3-bf40-aadd20f0452f >> ListImageFiltersTestXML-1316776531 ListImageFiltersTestXML-132181290] >> Releasing semaphore "iptables" lock >> /opt/stack/new/nova/nova/openstack/common/lockutils.py:238 >> 2014-09-24 18:44:49.249 DEBUG nova.openstack.com
Re: [openstack-dev] [oslo] logging around olso lockutils
Or how about we add in a new log level? A few libraries I have come across support the log level 5 (which is less than debug (10) but greater than notset (0))... One usage of this is in the multiprocessing library in python itself @ https://hg.python.org/releasing/3.4/file/8671f89107c8/Lib/multiprocessing/util.py#l34 Kazoo calls it the 'BLATHER' level @ https://github.com/python-zk/kazoo/blob/master/kazoo/loggingsupport.py Since these messages can be actually useful for lock_utils developers it could be useful to keep them[1]? Just a thought... [1] Ones mans DEBUG is another mans garbage, ha. On Sep 25, 2014, at 12:06 PM, Ben Nemec wrote: > On 09/25/2014 07:49 AM, Sean Dague wrote: >> Spending a ton of time reading logs, oslo locking ends up basically >> creating a ton of output at DEBUG that you have to mentally filter to >> find problems: >> >> 2014-09-24 18:44:49.240 DEBUG nova.openstack.common.lockutils >> [req-d7443a9c-1eb5-42c3-bf40-aadd20f0452f >> ListImageFiltersTestXML-1316776531 ListImageFiltersTestXML-132181290] >> Created new semaphore "iptables" internal_lock >> /opt/stack/new/nova/nova/openstack/common/lockutils.py:206 >> 2014-09-24 18:44:49.240 DEBUG nova.openstack.common.lockutils >> [req-d7443a9c-1eb5-42c3-bf40-aadd20f0452f >> ListImageFiltersTestXML-1316776531 ListImageFiltersTestXML-132181290] >> Acquired semaphore "iptables" lock >> /opt/stack/new/nova/nova/openstack/common/lockutils.py:229 >> 2014-09-24 18:44:49.240 DEBUG nova.openstack.common.lockutils >> [req-d7443a9c-1eb5-42c3-bf40-aadd20f0452f >> ListImageFiltersTestXML-1316776531 ListImageFiltersTestXML-132181290] >> Attempting to grab external lock "iptables" external_lock >> /opt/stack/new/nova/nova/openstack/common/lockutils.py:178 >> 2014-09-24 18:44:49.240 DEBUG nova.openstack.common.lockutils >> [req-d7443a9c-1eb5-42c3-bf40-aadd20f0452f >> ListImageFiltersTestXML-1316776531 ListImageFiltersTestXML-132181290] >> Got file lock "/opt/stack/data/nova/nova-iptables" acquire >> /opt/stack/new/nova/nova/openstack/common/lockutils.py:93 >> 2014-09-24 18:44:49.240 DEBUG nova.openstack.common.lockutils >> [req-d7443a9c-1eb5-42c3-bf40-aadd20f0452f >> ListImageFiltersTestXML-1316776531 ListImageFiltersTestXML-132181290] >> Got semaphore / lock "_do_refresh_provider_fw_rules" inner >> /opt/stack/new/nova/nova/openstack/common/lockutils.py:271 >> 2014-09-24 18:44:49.244 DEBUG nova.compute.manager >> [req-b91cb1c1-f211-43ef-9714-651eeb3b2302 >> DeleteServersAdminTestXML-1408641898 >> DeleteServersAdminTestXML-469708524] [instance: >> 98eb8e6e-088b-4dda-ada5-7b2b79f62506] terminating bdm >> BlockDeviceMapping(boot_index=0,connection_info=None,created_at=2014-09-24T18:44:42Z,delete_on_termination=True,deleted=False,deleted_at=None,destination_type='local',device_name='/dev/vda',device_type='disk',disk_bus=None,guest_format=None,id=43,image_id='262ab8a2-0790-49b3-a8d3-e8ed73e3ed71',instance=,instance_uuid=98eb8e6e-088b-4dda-ada5-7b2b79f62506,no_device=False,snapshot_id=None,source_type='image',updated_at=2014-09-24T18:44:42Z,volume_id=None,volume_size=None) >> _cleanup_volumes /opt/stack/new/nova/nova/compute/manager.py:2407 >> 2014-09-24 18:44:49.248 DEBUG nova.openstack.common.lockutils >> [req-d7443a9c-1eb5-42c3-bf40-aadd20f0452f >> ListImageFiltersTestXML-1316776531 ListImageFiltersTestXML-132181290] >> Released file lock "/opt/stack/data/nova/nova-iptables" release >> /opt/stack/new/nova/nova/openstack/common/lockutils.py:115 >> 2014-09-24 18:44:49.248 DEBUG nova.openstack.common.lockutils >> [req-d7443a9c-1eb5-42c3-bf40-aadd20f0452f >> ListImageFiltersTestXML-1316776531 ListImageFiltersTestXML-132181290] >> Releasing semaphore "iptables" lock >> /opt/stack/new/nova/nova/openstack/common/lockutils.py:238 >> 2014-09-24 18:44:49.249 DEBUG nova.openstack.common.lockutils >> [req-d7443a9c-1eb5-42c3-bf40-aadd20f0452f >> ListImageFiltersTestXML-1316776531 ListImageFiltersTestXML-132181290] >> Semaphore / lock released "_do_refresh_provider_fw_rules" inner >> >> Also readable here: >> http://logs.openstack.org/01/123801/1/check/check-tempest-dsvm-full/b5f8b37/logs/screen-n-cpu.txt.gz#_2014-09-24_18_44_49_240 >> >> (Yes, it's kind of ugly) >> >> What occured to me is that in debugging locking issues what we actually >> care about is 2 things semantically: >> >> #1 - tried to get a lock, but someone else has it. Then we know we've >> got lock contention. . >> #2 - something is still holding a lock after some "long" amount of time. > > We did just merge https://review.openstack.org/#/c/122166/ which adds > some contention/timing information to the log messages and should at > least be a step toward what you're talking about. > > For context, we had some bad logging that resulted in > https://bugs.launchpad.net/oslo.concurrency/+bug/1367941 . That has > been cleaned up to at least be accurate, but it did add an extra log > message (created and acquired). The reason we cared about that is we > thought there might be a bad inter
Re: [openstack-dev] [oslo] logging around olso lockutils
On 09/25/2014 07:49 AM, Sean Dague wrote: > Spending a ton of time reading logs, oslo locking ends up basically > creating a ton of output at DEBUG that you have to mentally filter to > find problems: > > 2014-09-24 18:44:49.240 DEBUG nova.openstack.common.lockutils > [req-d7443a9c-1eb5-42c3-bf40-aadd20f0452f > ListImageFiltersTestXML-1316776531 ListImageFiltersTestXML-132181290] > Created new semaphore "iptables" internal_lock > /opt/stack/new/nova/nova/openstack/common/lockutils.py:206 > 2014-09-24 18:44:49.240 DEBUG nova.openstack.common.lockutils > [req-d7443a9c-1eb5-42c3-bf40-aadd20f0452f > ListImageFiltersTestXML-1316776531 ListImageFiltersTestXML-132181290] > Acquired semaphore "iptables" lock > /opt/stack/new/nova/nova/openstack/common/lockutils.py:229 > 2014-09-24 18:44:49.240 DEBUG nova.openstack.common.lockutils > [req-d7443a9c-1eb5-42c3-bf40-aadd20f0452f > ListImageFiltersTestXML-1316776531 ListImageFiltersTestXML-132181290] > Attempting to grab external lock "iptables" external_lock > /opt/stack/new/nova/nova/openstack/common/lockutils.py:178 > 2014-09-24 18:44:49.240 DEBUG nova.openstack.common.lockutils > [req-d7443a9c-1eb5-42c3-bf40-aadd20f0452f > ListImageFiltersTestXML-1316776531 ListImageFiltersTestXML-132181290] > Got file lock "/opt/stack/data/nova/nova-iptables" acquire > /opt/stack/new/nova/nova/openstack/common/lockutils.py:93 > 2014-09-24 18:44:49.240 DEBUG nova.openstack.common.lockutils > [req-d7443a9c-1eb5-42c3-bf40-aadd20f0452f > ListImageFiltersTestXML-1316776531 ListImageFiltersTestXML-132181290] > Got semaphore / lock "_do_refresh_provider_fw_rules" inner > /opt/stack/new/nova/nova/openstack/common/lockutils.py:271 > 2014-09-24 18:44:49.244 DEBUG nova.compute.manager > [req-b91cb1c1-f211-43ef-9714-651eeb3b2302 > DeleteServersAdminTestXML-1408641898 > DeleteServersAdminTestXML-469708524] [instance: > 98eb8e6e-088b-4dda-ada5-7b2b79f62506] terminating bdm > BlockDeviceMapping(boot_index=0,connection_info=None,created_at=2014-09-24T18:44:42Z,delete_on_termination=True,deleted=False,deleted_at=None,destination_type='local',device_name='/dev/vda',device_type='disk',disk_bus=None,guest_format=None,id=43,image_id='262ab8a2-0790-49b3-a8d3-e8ed73e3ed71',instance=,instance_uuid=98eb8e6e-088b-4dda-ada5-7b2b79f62506,no_device=False,snapshot_id=None,source_type='image',updated_at=2014-09-24T18:44:42Z,volume_id=None,volume_size=None) > _cleanup_volumes /opt/stack/new/nova/nova/compute/manager.py:2407 > 2014-09-24 18:44:49.248 DEBUG nova.openstack.common.lockutils > [req-d7443a9c-1eb5-42c3-bf40-aadd20f0452f > ListImageFiltersTestXML-1316776531 ListImageFiltersTestXML-132181290] > Released file lock "/opt/stack/data/nova/nova-iptables" release > /opt/stack/new/nova/nova/openstack/common/lockutils.py:115 > 2014-09-24 18:44:49.248 DEBUG nova.openstack.common.lockutils > [req-d7443a9c-1eb5-42c3-bf40-aadd20f0452f > ListImageFiltersTestXML-1316776531 ListImageFiltersTestXML-132181290] > Releasing semaphore "iptables" lock > /opt/stack/new/nova/nova/openstack/common/lockutils.py:238 > 2014-09-24 18:44:49.249 DEBUG nova.openstack.common.lockutils > [req-d7443a9c-1eb5-42c3-bf40-aadd20f0452f > ListImageFiltersTestXML-1316776531 ListImageFiltersTestXML-132181290] > Semaphore / lock released "_do_refresh_provider_fw_rules" inner > > Also readable here: > http://logs.openstack.org/01/123801/1/check/check-tempest-dsvm-full/b5f8b37/logs/screen-n-cpu.txt.gz#_2014-09-24_18_44_49_240 > > (Yes, it's kind of ugly) > > What occured to me is that in debugging locking issues what we actually > care about is 2 things semantically: > > #1 - tried to get a lock, but someone else has it. Then we know we've > got lock contention. . > #2 - something is still holding a lock after some "long" amount of time. We did just merge https://review.openstack.org/#/c/122166/ which adds some contention/timing information to the log messages and should at least be a step toward what you're talking about. For context, we had some bad logging that resulted in https://bugs.launchpad.net/oslo.concurrency/+bug/1367941 . That has been cleaned up to at least be accurate, but it did add an extra log message (created and acquired). The reason we cared about that is we thought there might be a bad interaction between our code and eventlet, so we wanted to know whether we were in fact locking the same semaphore twice or mistakenly creating two separate ones (as it turns out, neither - it was just the bad logging I mentioned earlier). So, given that I think everyone involved agrees that the double-locking thing was a cosmetic issue and not a functional one we could probably just remove the created/using messages here: https://github.com/openstack/oslo-incubator/blob/master/openstack/common/lockutils.py#L202 which would eliminate one message per lock operation without significantly impacting debuggability of our code. Those messages only exist because we didn't trust what was going on at a lower level. It would also be nice to reduc
Re: [openstack-dev] [oslo] logging around olso lockutils
Logged as high priority bug - https://bugs.launchpad.net/oslo.concurrency/+bug/1374075 On Thu, Sep 25, 2014 at 1:57 PM, Jay Pipes wrote: > +1 for making those two changes. I also have been frustrated doing debugging > in the gate recently, and any operational-ease-of-debugging things like this > would be appreciated. > > -jay > > On 09/25/2014 08:49 AM, Sean Dague wrote: >> >> Spending a ton of time reading logs, oslo locking ends up basically >> creating a ton of output at DEBUG that you have to mentally filter to >> find problems: >> >> 2014-09-24 18:44:49.240 DEBUG nova.openstack.common.lockutils >> [req-d7443a9c-1eb5-42c3-bf40-aadd20f0452f >> ListImageFiltersTestXML-1316776531 ListImageFiltersTestXML-132181290] >> Created new semaphore "iptables" internal_lock >> /opt/stack/new/nova/nova/openstack/common/lockutils.py:206 >> 2014-09-24 18:44:49.240 DEBUG nova.openstack.common.lockutils >> [req-d7443a9c-1eb5-42c3-bf40-aadd20f0452f >> ListImageFiltersTestXML-1316776531 ListImageFiltersTestXML-132181290] >> Acquired semaphore "iptables" lock >> /opt/stack/new/nova/nova/openstack/common/lockutils.py:229 >> 2014-09-24 18:44:49.240 DEBUG nova.openstack.common.lockutils >> [req-d7443a9c-1eb5-42c3-bf40-aadd20f0452f >> ListImageFiltersTestXML-1316776531 ListImageFiltersTestXML-132181290] >> Attempting to grab external lock "iptables" external_lock >> /opt/stack/new/nova/nova/openstack/common/lockutils.py:178 >> 2014-09-24 18:44:49.240 DEBUG nova.openstack.common.lockutils >> [req-d7443a9c-1eb5-42c3-bf40-aadd20f0452f >> ListImageFiltersTestXML-1316776531 ListImageFiltersTestXML-132181290] >> Got file lock "/opt/stack/data/nova/nova-iptables" acquire >> /opt/stack/new/nova/nova/openstack/common/lockutils.py:93 >> 2014-09-24 18:44:49.240 DEBUG nova.openstack.common.lockutils >> [req-d7443a9c-1eb5-42c3-bf40-aadd20f0452f >> ListImageFiltersTestXML-1316776531 ListImageFiltersTestXML-132181290] >> Got semaphore / lock "_do_refresh_provider_fw_rules" inner >> /opt/stack/new/nova/nova/openstack/common/lockutils.py:271 >> 2014-09-24 18:44:49.244 DEBUG nova.compute.manager >> [req-b91cb1c1-f211-43ef-9714-651eeb3b2302 >> DeleteServersAdminTestXML-1408641898 >> DeleteServersAdminTestXML-469708524] [instance: >> 98eb8e6e-088b-4dda-ada5-7b2b79f62506] terminating bdm >> >> BlockDeviceMapping(boot_index=0,connection_info=None,created_at=2014-09-24T18:44:42Z,delete_on_termination=True,deleted=False,deleted_at=None,destination_type='local',device_name='/dev/vda',device_type='disk',disk_bus=None,guest_format=None,id=43,image_id='262ab8a2-0790-49b3-a8d3-e8ed73e3ed71',instance=,instance_uuid=98eb8e6e-088b-4dda-ada5-7b2b79f62506,no_device=False,snapshot_id=None,source_type='image',updated_at=2014-09-24T18:44:42Z,volume_id=None,volume_size=None) >> _cleanup_volumes /opt/stack/new/nova/nova/compute/manager.py:2407 >> 2014-09-24 18:44:49.248 DEBUG nova.openstack.common.lockutils >> [req-d7443a9c-1eb5-42c3-bf40-aadd20f0452f >> ListImageFiltersTestXML-1316776531 ListImageFiltersTestXML-132181290] >> Released file lock "/opt/stack/data/nova/nova-iptables" release >> /opt/stack/new/nova/nova/openstack/common/lockutils.py:115 >> 2014-09-24 18:44:49.248 DEBUG nova.openstack.common.lockutils >> [req-d7443a9c-1eb5-42c3-bf40-aadd20f0452f >> ListImageFiltersTestXML-1316776531 ListImageFiltersTestXML-132181290] >> Releasing semaphore "iptables" lock >> /opt/stack/new/nova/nova/openstack/common/lockutils.py:238 >> 2014-09-24 18:44:49.249 DEBUG nova.openstack.common.lockutils >> [req-d7443a9c-1eb5-42c3-bf40-aadd20f0452f >> ListImageFiltersTestXML-1316776531 ListImageFiltersTestXML-132181290] >> Semaphore / lock released "_do_refresh_provider_fw_rules" inner >> >> Also readable here: >> >> http://logs.openstack.org/01/123801/1/check/check-tempest-dsvm-full/b5f8b37/logs/screen-n-cpu.txt.gz#_2014-09-24_18_44_49_240 >> >> (Yes, it's kind of ugly) >> >> What occured to me is that in debugging locking issues what we actually >> care about is 2 things semantically: >> >> #1 - tried to get a lock, but someone else has it. Then we know we've >> got lock contention. . >> #2 - something is still holding a lock after some "long" amount of time. >> >> #2 turned out to be a critical bit in understanding one of the worst >> recent gate impacting issues. >> >> You can write a tool today that analyzes the logs and shows you these >> things. However, I wonder if we could actually do something creative in >> the code itself to do this already. I'm curious if the creative use of >> Timers might let us emit log messages under the conditions above >> (someone with better understanding of python internals needs to speak up >> here). Maybe it's too much overhead, but I think it's worth at least >> asking the question. >> >> The same issue exists when it comes to processutils I think, warning >> that a command is still running after 10s might be really handy, because >> it turns out that issue #2 was caused by this, and it took quite a bit >> of decoding to figure that out
Re: [openstack-dev] [oslo] logging around olso lockutils
+1 for making those two changes. I also have been frustrated doing debugging in the gate recently, and any operational-ease-of-debugging things like this would be appreciated. -jay On 09/25/2014 08:49 AM, Sean Dague wrote: Spending a ton of time reading logs, oslo locking ends up basically creating a ton of output at DEBUG that you have to mentally filter to find problems: 2014-09-24 18:44:49.240 DEBUG nova.openstack.common.lockutils [req-d7443a9c-1eb5-42c3-bf40-aadd20f0452f ListImageFiltersTestXML-1316776531 ListImageFiltersTestXML-132181290] Created new semaphore "iptables" internal_lock /opt/stack/new/nova/nova/openstack/common/lockutils.py:206 2014-09-24 18:44:49.240 DEBUG nova.openstack.common.lockutils [req-d7443a9c-1eb5-42c3-bf40-aadd20f0452f ListImageFiltersTestXML-1316776531 ListImageFiltersTestXML-132181290] Acquired semaphore "iptables" lock /opt/stack/new/nova/nova/openstack/common/lockutils.py:229 2014-09-24 18:44:49.240 DEBUG nova.openstack.common.lockutils [req-d7443a9c-1eb5-42c3-bf40-aadd20f0452f ListImageFiltersTestXML-1316776531 ListImageFiltersTestXML-132181290] Attempting to grab external lock "iptables" external_lock /opt/stack/new/nova/nova/openstack/common/lockutils.py:178 2014-09-24 18:44:49.240 DEBUG nova.openstack.common.lockutils [req-d7443a9c-1eb5-42c3-bf40-aadd20f0452f ListImageFiltersTestXML-1316776531 ListImageFiltersTestXML-132181290] Got file lock "/opt/stack/data/nova/nova-iptables" acquire /opt/stack/new/nova/nova/openstack/common/lockutils.py:93 2014-09-24 18:44:49.240 DEBUG nova.openstack.common.lockutils [req-d7443a9c-1eb5-42c3-bf40-aadd20f0452f ListImageFiltersTestXML-1316776531 ListImageFiltersTestXML-132181290] Got semaphore / lock "_do_refresh_provider_fw_rules" inner /opt/stack/new/nova/nova/openstack/common/lockutils.py:271 2014-09-24 18:44:49.244 DEBUG nova.compute.manager [req-b91cb1c1-f211-43ef-9714-651eeb3b2302 DeleteServersAdminTestXML-1408641898 DeleteServersAdminTestXML-469708524] [instance: 98eb8e6e-088b-4dda-ada5-7b2b79f62506] terminating bdm BlockDeviceMapping(boot_index=0,connection_info=None,created_at=2014-09-24T18:44:42Z,delete_on_termination=True,deleted=False,deleted_at=None,destination_type='local',device_name='/dev/vda',device_type='disk',disk_bus=None,guest_format=None,id=43,image_id='262ab8a2-0790-49b3-a8d3-e8ed73e3ed71',instance=,instance_uuid=98eb8e6e-088b-4dda-ada5-7b2b79f62506,no_device=False,snapshot_id=None,source_type='image',updated_at=2014-09-24T18:44:42Z,volume_id=None,volume_size=None) _cleanup_volumes /opt/stack/new/nova/nova/compute/manager.py:2407 2014-09-24 18:44:49.248 DEBUG nova.openstack.common.lockutils [req-d7443a9c-1eb5-42c3-bf40-aadd20f0452f ListImageFiltersTestXML-1316776531 ListImageFiltersTestXML-132181290] Released file lock "/opt/stack/data/nova/nova-iptables" release /opt/stack/new/nova/nova/openstack/common/lockutils.py:115 2014-09-24 18:44:49.248 DEBUG nova.openstack.common.lockutils [req-d7443a9c-1eb5-42c3-bf40-aadd20f0452f ListImageFiltersTestXML-1316776531 ListImageFiltersTestXML-132181290] Releasing semaphore "iptables" lock /opt/stack/new/nova/nova/openstack/common/lockutils.py:238 2014-09-24 18:44:49.249 DEBUG nova.openstack.common.lockutils [req-d7443a9c-1eb5-42c3-bf40-aadd20f0452f ListImageFiltersTestXML-1316776531 ListImageFiltersTestXML-132181290] Semaphore / lock released "_do_refresh_provider_fw_rules" inner Also readable here: http://logs.openstack.org/01/123801/1/check/check-tempest-dsvm-full/b5f8b37/logs/screen-n-cpu.txt.gz#_2014-09-24_18_44_49_240 (Yes, it's kind of ugly) What occured to me is that in debugging locking issues what we actually care about is 2 things semantically: #1 - tried to get a lock, but someone else has it. Then we know we've got lock contention. . #2 - something is still holding a lock after some "long" amount of time. #2 turned out to be a critical bit in understanding one of the worst recent gate impacting issues. You can write a tool today that analyzes the logs and shows you these things. However, I wonder if we could actually do something creative in the code itself to do this already. I'm curious if the creative use of Timers might let us emit log messages under the conditions above (someone with better understanding of python internals needs to speak up here). Maybe it's too much overhead, but I think it's worth at least asking the question. The same issue exists when it comes to processutils I think, warning that a command is still running after 10s might be really handy, because it turns out that issue #2 was caused by this, and it took quite a bit of decoding to figure that out. -Sean ___ OpenStack-dev mailing list OpenStack-dev@lists.openstack.org http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
[openstack-dev] [oslo] logging around olso lockutils
Spending a ton of time reading logs, oslo locking ends up basically creating a ton of output at DEBUG that you have to mentally filter to find problems: 2014-09-24 18:44:49.240 DEBUG nova.openstack.common.lockutils [req-d7443a9c-1eb5-42c3-bf40-aadd20f0452f ListImageFiltersTestXML-1316776531 ListImageFiltersTestXML-132181290] Created new semaphore "iptables" internal_lock /opt/stack/new/nova/nova/openstack/common/lockutils.py:206 2014-09-24 18:44:49.240 DEBUG nova.openstack.common.lockutils [req-d7443a9c-1eb5-42c3-bf40-aadd20f0452f ListImageFiltersTestXML-1316776531 ListImageFiltersTestXML-132181290] Acquired semaphore "iptables" lock /opt/stack/new/nova/nova/openstack/common/lockutils.py:229 2014-09-24 18:44:49.240 DEBUG nova.openstack.common.lockutils [req-d7443a9c-1eb5-42c3-bf40-aadd20f0452f ListImageFiltersTestXML-1316776531 ListImageFiltersTestXML-132181290] Attempting to grab external lock "iptables" external_lock /opt/stack/new/nova/nova/openstack/common/lockutils.py:178 2014-09-24 18:44:49.240 DEBUG nova.openstack.common.lockutils [req-d7443a9c-1eb5-42c3-bf40-aadd20f0452f ListImageFiltersTestXML-1316776531 ListImageFiltersTestXML-132181290] Got file lock "/opt/stack/data/nova/nova-iptables" acquire /opt/stack/new/nova/nova/openstack/common/lockutils.py:93 2014-09-24 18:44:49.240 DEBUG nova.openstack.common.lockutils [req-d7443a9c-1eb5-42c3-bf40-aadd20f0452f ListImageFiltersTestXML-1316776531 ListImageFiltersTestXML-132181290] Got semaphore / lock "_do_refresh_provider_fw_rules" inner /opt/stack/new/nova/nova/openstack/common/lockutils.py:271 2014-09-24 18:44:49.244 DEBUG nova.compute.manager [req-b91cb1c1-f211-43ef-9714-651eeb3b2302 DeleteServersAdminTestXML-1408641898 DeleteServersAdminTestXML-469708524] [instance: 98eb8e6e-088b-4dda-ada5-7b2b79f62506] terminating bdm BlockDeviceMapping(boot_index=0,connection_info=None,created_at=2014-09-24T18:44:42Z,delete_on_termination=True,deleted=False,deleted_at=None,destination_type='local',device_name='/dev/vda',device_type='disk',disk_bus=None,guest_format=None,id=43,image_id='262ab8a2-0790-49b3-a8d3-e8ed73e3ed71',instance=,instance_uuid=98eb8e6e-088b-4dda-ada5-7b2b79f62506,no_device=False,snapshot_id=None,source_type='image',updated_at=2014-09-24T18:44:42Z,volume_id=None,volume_size=None) _cleanup_volumes /opt/stack/new/nova/nova/compute/manager.py:2407 2014-09-24 18:44:49.248 DEBUG nova.openstack.common.lockutils [req-d7443a9c-1eb5-42c3-bf40-aadd20f0452f ListImageFiltersTestXML-1316776531 ListImageFiltersTestXML-132181290] Released file lock "/opt/stack/data/nova/nova-iptables" release /opt/stack/new/nova/nova/openstack/common/lockutils.py:115 2014-09-24 18:44:49.248 DEBUG nova.openstack.common.lockutils [req-d7443a9c-1eb5-42c3-bf40-aadd20f0452f ListImageFiltersTestXML-1316776531 ListImageFiltersTestXML-132181290] Releasing semaphore "iptables" lock /opt/stack/new/nova/nova/openstack/common/lockutils.py:238 2014-09-24 18:44:49.249 DEBUG nova.openstack.common.lockutils [req-d7443a9c-1eb5-42c3-bf40-aadd20f0452f ListImageFiltersTestXML-1316776531 ListImageFiltersTestXML-132181290] Semaphore / lock released "_do_refresh_provider_fw_rules" inner Also readable here: http://logs.openstack.org/01/123801/1/check/check-tempest-dsvm-full/b5f8b37/logs/screen-n-cpu.txt.gz#_2014-09-24_18_44_49_240 (Yes, it's kind of ugly) What occured to me is that in debugging locking issues what we actually care about is 2 things semantically: #1 - tried to get a lock, but someone else has it. Then we know we've got lock contention. . #2 - something is still holding a lock after some "long" amount of time. #2 turned out to be a critical bit in understanding one of the worst recent gate impacting issues. You can write a tool today that analyzes the logs and shows you these things. However, I wonder if we could actually do something creative in the code itself to do this already. I'm curious if the creative use of Timers might let us emit log messages under the conditions above (someone with better understanding of python internals needs to speak up here). Maybe it's too much overhead, but I think it's worth at least asking the question. The same issue exists when it comes to processutils I think, warning that a command is still running after 10s might be really handy, because it turns out that issue #2 was caused by this, and it took quite a bit of decoding to figure that out. -Sean -- Sean Dague http://dague.net ___ OpenStack-dev mailing list OpenStack-dev@lists.openstack.org http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev