Re: [openstack-dev] [oslo] logging around olso lockutils

2014-09-26 Thread Sean Dague
On 09/25/2014 03:06 PM, Ben Nemec wrote:
> On 09/25/2014 07:49 AM, Sean Dague wrote:
>> Spending a ton of time reading logs, oslo locking ends up basically
>> creating a ton of output at DEBUG that you have to mentally filter to
>> find problems:
>>
>> 2014-09-24 18:44:49.240 DEBUG nova.openstack.common.lockutils
>> [req-d7443a9c-1eb5-42c3-bf40-aadd20f0452f
>> ListImageFiltersTestXML-1316776531 ListImageFiltersTestXML-132181290]
>> Created new semaphore "iptables" internal_lock
>> /opt/stack/new/nova/nova/openstack/common/lockutils.py:206
>> 2014-09-24 18:44:49.240 DEBUG nova.openstack.common.lockutils
>> [req-d7443a9c-1eb5-42c3-bf40-aadd20f0452f
>> ListImageFiltersTestXML-1316776531 ListImageFiltersTestXML-132181290]
>> Acquired semaphore "iptables" lock
>> /opt/stack/new/nova/nova/openstack/common/lockutils.py:229
>> 2014-09-24 18:44:49.240 DEBUG nova.openstack.common.lockutils
>> [req-d7443a9c-1eb5-42c3-bf40-aadd20f0452f
>> ListImageFiltersTestXML-1316776531 ListImageFiltersTestXML-132181290]
>> Attempting to grab external lock "iptables" external_lock
>> /opt/stack/new/nova/nova/openstack/common/lockutils.py:178
>> 2014-09-24 18:44:49.240 DEBUG nova.openstack.common.lockutils
>> [req-d7443a9c-1eb5-42c3-bf40-aadd20f0452f
>> ListImageFiltersTestXML-1316776531 ListImageFiltersTestXML-132181290]
>> Got file lock "/opt/stack/data/nova/nova-iptables" acquire
>> /opt/stack/new/nova/nova/openstack/common/lockutils.py:93
>> 2014-09-24 18:44:49.240 DEBUG nova.openstack.common.lockutils
>> [req-d7443a9c-1eb5-42c3-bf40-aadd20f0452f
>> ListImageFiltersTestXML-1316776531 ListImageFiltersTestXML-132181290]
>> Got semaphore / lock "_do_refresh_provider_fw_rules" inner
>> /opt/stack/new/nova/nova/openstack/common/lockutils.py:271
>> 2014-09-24 18:44:49.244 DEBUG nova.compute.manager
>> [req-b91cb1c1-f211-43ef-9714-651eeb3b2302
>> DeleteServersAdminTestXML-1408641898
>> DeleteServersAdminTestXML-469708524] [instance:
>> 98eb8e6e-088b-4dda-ada5-7b2b79f62506] terminating bdm
>> BlockDeviceMapping(boot_index=0,connection_info=None,created_at=2014-09-24T18:44:42Z,delete_on_termination=True,deleted=False,deleted_at=None,destination_type='local',device_name='/dev/vda',device_type='disk',disk_bus=None,guest_format=None,id=43,image_id='262ab8a2-0790-49b3-a8d3-e8ed73e3ed71',instance=,instance_uuid=98eb8e6e-088b-4dda-ada5-7b2b79f62506,no_device=False,snapshot_id=None,source_type='image',updated_at=2014-09-24T18:44:42Z,volume_id=None,volume_size=None)
>> _cleanup_volumes /opt/stack/new/nova/nova/compute/manager.py:2407
>> 2014-09-24 18:44:49.248 DEBUG nova.openstack.common.lockutils
>> [req-d7443a9c-1eb5-42c3-bf40-aadd20f0452f
>> ListImageFiltersTestXML-1316776531 ListImageFiltersTestXML-132181290]
>> Released file lock "/opt/stack/data/nova/nova-iptables" release
>> /opt/stack/new/nova/nova/openstack/common/lockutils.py:115
>> 2014-09-24 18:44:49.248 DEBUG nova.openstack.common.lockutils
>> [req-d7443a9c-1eb5-42c3-bf40-aadd20f0452f
>> ListImageFiltersTestXML-1316776531 ListImageFiltersTestXML-132181290]
>> Releasing semaphore "iptables" lock
>> /opt/stack/new/nova/nova/openstack/common/lockutils.py:238
>> 2014-09-24 18:44:49.249 DEBUG nova.openstack.common.lockutils
>> [req-d7443a9c-1eb5-42c3-bf40-aadd20f0452f
>> ListImageFiltersTestXML-1316776531 ListImageFiltersTestXML-132181290]
>> Semaphore / lock released "_do_refresh_provider_fw_rules" inner
>>
>> Also readable here:
>> http://logs.openstack.org/01/123801/1/check/check-tempest-dsvm-full/b5f8b37/logs/screen-n-cpu.txt.gz#_2014-09-24_18_44_49_240
>>
>> (Yes, it's kind of ugly)
>>
>> What occured to me is that in debugging locking issues what we actually
>> care about is 2 things semantically:
>>
>> #1 - tried to get a lock, but someone else has it. Then we know we've
>> got lock contention. .
>> #2 - something is still holding a lock after some "long" amount of time.
> 
> We did just merge https://review.openstack.org/#/c/122166/ which adds
> some contention/timing information to the log messages and should at
> least be a step toward what you're talking about.
> 
> For context, we had some bad logging that resulted in
> https://bugs.launchpad.net/oslo.concurrency/+bug/1367941 .  That has
> been cleaned up to at least be accurate, but it did add an extra log
> message (created and acquired).  The reason we cared about that is we
> thought there might be a bad interaction between our code and eventlet,
> so we wanted to know whether we were in fact locking the same semaphore
> twice or mistakenly creating two separate ones (as it turns out, neither
> - it was just the bad logging I mentioned earlier).
> 
> So, given that I think everyone involved agrees that the double-locking
> thing was a cosmetic issue and not a functional one we could probably
> just remove the created/using messages here:
> https://github.com/openstack/oslo-incubator/blob/master/openstack/common/lockutils.py#L202
> which would eliminate one message per lock operation without
> significantly impacting debug

Re: [openstack-dev] [oslo] logging around olso lockutils

2014-09-26 Thread Sean Dague
On 09/26/2014 12:03 AM, Christopher Yeoh wrote:
> On Thu, 25 Sep 2014 08:49:12 -0400
> Sean Dague  wrote:
> 
>>
>> #1 - tried to get a lock, but someone else has it. Then we know we've
>> got lock contention. .
>> #2 - something is still holding a lock after some "long" amount of
>> time.
> 
> +1 to both.
> 
>> #2 turned out to be a critical bit in understanding one of the worst
>> recent gate impacting issues.
>>
>> You can write a tool today that analyzes the logs and shows you these
>> things. However, I wonder if we could actually do something creative
>> in the code itself to do this already. I'm curious if the creative
>> use of Timers might let us emit log messages under the conditions
>> above (someone with better understanding of python internals needs to
>> speak up here). Maybe it's too much overhead, but I think it's worth
>> at least asking the question.
> 
> Even a simple log at the end when its finally released if it has been
> held for a long time would be handy. As matching up acquire/release by
> eye is not easy.
> 
> I don't think we get a log message when an acquire is attempted but
> fails. That might help get a measure of lock contention?
> 
>> The same issue exists when it comes to processutils I think, warning
>> that a command is still running after 10s might be really handy,
>> because it turns out that issue #2 was caused by this, and it took
>> quite a bit of decoding to figure that out.
> 
> Also I think that perhaps a log message when a period task takes longer
> than the interval that the task is meant to be run would be a useful
> warning sign that something odd is going on.

Well, we do get some of the periodic tasks today when they overrun, IIRC.

However the surprising thing about the Nova resource tracker lock
bug was that basically all the time was taken by blockdev failing (which
took almost 2 minutes) and happened to be inside a lock that it probably
didn't need to be in.

I think if we find we're ever taking a lost for > 10s... we're going to
start impacting the rest of the system in very bad ways. Especially
given the RPC timeouts at 60s. Getting that information as close to the
actual issue as possible would be super helpful.

-Sean

-- 
Sean Dague
http://dague.net

___
OpenStack-dev mailing list
OpenStack-dev@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [oslo] logging around olso lockutils

2014-09-25 Thread Christopher Yeoh
On Thu, 25 Sep 2014 08:49:12 -0400
Sean Dague  wrote:

> 
> #1 - tried to get a lock, but someone else has it. Then we know we've
> got lock contention. .
> #2 - something is still holding a lock after some "long" amount of
> time.

+1 to both.

> #2 turned out to be a critical bit in understanding one of the worst
> recent gate impacting issues.
> 
> You can write a tool today that analyzes the logs and shows you these
> things. However, I wonder if we could actually do something creative
> in the code itself to do this already. I'm curious if the creative
> use of Timers might let us emit log messages under the conditions
> above (someone with better understanding of python internals needs to
> speak up here). Maybe it's too much overhead, but I think it's worth
> at least asking the question.

Even a simple log at the end when its finally released if it has been
held for a long time would be handy. As matching up acquire/release by
eye is not easy.

I don't think we get a log message when an acquire is attempted but
fails. That might help get a measure of lock contention?

> The same issue exists when it comes to processutils I think, warning
> that a command is still running after 10s might be really handy,
> because it turns out that issue #2 was caused by this, and it took
> quite a bit of decoding to figure that out.

Also I think that perhaps a log message when a period task takes longer
than the interval that the task is meant to be run would be a useful
warning sign that something odd is going on.

Chris


___
OpenStack-dev mailing list
OpenStack-dev@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [oslo] logging around olso lockutils

2014-09-25 Thread Rochelle.RochelleGrober
+1

Exactly what I was thinking.  Semaphore races and deadlocks are important to be 
able to trace, but the normal production cloud doesn't want to see those 
messages.  

What might be even better would be to also put a counter on the semaphores so 
that if they ever are >1 or <0 they report an error on normal log levels.  I'm 
assuming it would be an error.  I can't see why it would be just a warn or 
info, but, I don't know the guts of the code here.

--Rocky

-Original Message-
From: Joshua Harlow [mailto:harlo...@outlook.com] 
Sent: Thursday, September 25, 2014 12:23 PM
To: openst...@nemebean.com; OpenStack Development Mailing List (not for usage 
questions)
Subject: Re: [openstack-dev] [oslo] logging around olso lockutils

Or how about we add in a new log level?

A few libraries I have come across support the log level 5 (which is less than 
debug (10) but greater than notset (0))...

One usage of this is in the multiprocessing library in python itself @

https://hg.python.org/releasing/3.4/file/8671f89107c8/Lib/multiprocessing/util.py#l34

Kazoo calls it the 'BLATHER' level @

https://github.com/python-zk/kazoo/blob/master/kazoo/loggingsupport.py

Since these messages can be actually useful for lock_utils developers it could 
be useful to keep them[1]?

Just a thought...

[1] Ones mans DEBUG is another mans garbage, ha.

On Sep 25, 2014, at 12:06 PM, Ben Nemec  wrote:

> On 09/25/2014 07:49 AM, Sean Dague wrote:
>> Spending a ton of time reading logs, oslo locking ends up basically
>> creating a ton of output at DEBUG that you have to mentally filter to
>> find problems:
>> 
>> 2014-09-24 18:44:49.240 DEBUG nova.openstack.common.lockutils
>> [req-d7443a9c-1eb5-42c3-bf40-aadd20f0452f
>> ListImageFiltersTestXML-1316776531 ListImageFiltersTestXML-132181290]
>> Created new semaphore "iptables" internal_lock
>> /opt/stack/new/nova/nova/openstack/common/lockutils.py:206
>> 2014-09-24 18:44:49.240 DEBUG nova.openstack.common.lockutils
>> [req-d7443a9c-1eb5-42c3-bf40-aadd20f0452f
>> ListImageFiltersTestXML-1316776531 ListImageFiltersTestXML-132181290]
>> Acquired semaphore "iptables" lock
>> /opt/stack/new/nova/nova/openstack/common/lockutils.py:229
>> 2014-09-24 18:44:49.240 DEBUG nova.openstack.common.lockutils
>> [req-d7443a9c-1eb5-42c3-bf40-aadd20f0452f
>> ListImageFiltersTestXML-1316776531 ListImageFiltersTestXML-132181290]
>> Attempting to grab external lock "iptables" external_lock
>> /opt/stack/new/nova/nova/openstack/common/lockutils.py:178
>> 2014-09-24 18:44:49.240 DEBUG nova.openstack.common.lockutils
>> [req-d7443a9c-1eb5-42c3-bf40-aadd20f0452f
>> ListImageFiltersTestXML-1316776531 ListImageFiltersTestXML-132181290]
>> Got file lock "/opt/stack/data/nova/nova-iptables" acquire
>> /opt/stack/new/nova/nova/openstack/common/lockutils.py:93
>> 2014-09-24 18:44:49.240 DEBUG nova.openstack.common.lockutils
>> [req-d7443a9c-1eb5-42c3-bf40-aadd20f0452f
>> ListImageFiltersTestXML-1316776531 ListImageFiltersTestXML-132181290]
>> Got semaphore / lock "_do_refresh_provider_fw_rules" inner
>> /opt/stack/new/nova/nova/openstack/common/lockutils.py:271
>> 2014-09-24 18:44:49.244 DEBUG nova.compute.manager
>> [req-b91cb1c1-f211-43ef-9714-651eeb3b2302
>> DeleteServersAdminTestXML-1408641898
>> DeleteServersAdminTestXML-469708524] [instance:
>> 98eb8e6e-088b-4dda-ada5-7b2b79f62506] terminating bdm
>> BlockDeviceMapping(boot_index=0,connection_info=None,created_at=2014-09-24T18:44:42Z,delete_on_termination=True,deleted=False,deleted_at=None,destination_type='local',device_name='/dev/vda',device_type='disk',disk_bus=None,guest_format=None,id=43,image_id='262ab8a2-0790-49b3-a8d3-e8ed73e3ed71',instance=,instance_uuid=98eb8e6e-088b-4dda-ada5-7b2b79f62506,no_device=False,snapshot_id=None,source_type='image',updated_at=2014-09-24T18:44:42Z,volume_id=None,volume_size=None)
>> _cleanup_volumes /opt/stack/new/nova/nova/compute/manager.py:2407
>> 2014-09-24 18:44:49.248 DEBUG nova.openstack.common.lockutils
>> [req-d7443a9c-1eb5-42c3-bf40-aadd20f0452f
>> ListImageFiltersTestXML-1316776531 ListImageFiltersTestXML-132181290]
>> Released file lock "/opt/stack/data/nova/nova-iptables" release
>> /opt/stack/new/nova/nova/openstack/common/lockutils.py:115
>> 2014-09-24 18:44:49.248 DEBUG nova.openstack.common.lockutils
>> [req-d7443a9c-1eb5-42c3-bf40-aadd20f0452f
>> ListImageFiltersTestXML-1316776531 ListImageFiltersTestXML-132181290]
>> Releasing semaphore "iptables" lock
>> /opt/stack/new/nova/nova/openstack/common/lockutils.py:238
>> 2014-09-24 18:44:49.249 DEBUG nova.openstack.com

Re: [openstack-dev] [oslo] logging around olso lockutils

2014-09-25 Thread Joshua Harlow
Or how about we add in a new log level?

A few libraries I have come across support the log level 5 (which is less than 
debug (10) but greater than notset (0))...

One usage of this is in the multiprocessing library in python itself @

https://hg.python.org/releasing/3.4/file/8671f89107c8/Lib/multiprocessing/util.py#l34

Kazoo calls it the 'BLATHER' level @

https://github.com/python-zk/kazoo/blob/master/kazoo/loggingsupport.py

Since these messages can be actually useful for lock_utils developers it could 
be useful to keep them[1]?

Just a thought...

[1] Ones mans DEBUG is another mans garbage, ha.

On Sep 25, 2014, at 12:06 PM, Ben Nemec  wrote:

> On 09/25/2014 07:49 AM, Sean Dague wrote:
>> Spending a ton of time reading logs, oslo locking ends up basically
>> creating a ton of output at DEBUG that you have to mentally filter to
>> find problems:
>> 
>> 2014-09-24 18:44:49.240 DEBUG nova.openstack.common.lockutils
>> [req-d7443a9c-1eb5-42c3-bf40-aadd20f0452f
>> ListImageFiltersTestXML-1316776531 ListImageFiltersTestXML-132181290]
>> Created new semaphore "iptables" internal_lock
>> /opt/stack/new/nova/nova/openstack/common/lockutils.py:206
>> 2014-09-24 18:44:49.240 DEBUG nova.openstack.common.lockutils
>> [req-d7443a9c-1eb5-42c3-bf40-aadd20f0452f
>> ListImageFiltersTestXML-1316776531 ListImageFiltersTestXML-132181290]
>> Acquired semaphore "iptables" lock
>> /opt/stack/new/nova/nova/openstack/common/lockutils.py:229
>> 2014-09-24 18:44:49.240 DEBUG nova.openstack.common.lockutils
>> [req-d7443a9c-1eb5-42c3-bf40-aadd20f0452f
>> ListImageFiltersTestXML-1316776531 ListImageFiltersTestXML-132181290]
>> Attempting to grab external lock "iptables" external_lock
>> /opt/stack/new/nova/nova/openstack/common/lockutils.py:178
>> 2014-09-24 18:44:49.240 DEBUG nova.openstack.common.lockutils
>> [req-d7443a9c-1eb5-42c3-bf40-aadd20f0452f
>> ListImageFiltersTestXML-1316776531 ListImageFiltersTestXML-132181290]
>> Got file lock "/opt/stack/data/nova/nova-iptables" acquire
>> /opt/stack/new/nova/nova/openstack/common/lockutils.py:93
>> 2014-09-24 18:44:49.240 DEBUG nova.openstack.common.lockutils
>> [req-d7443a9c-1eb5-42c3-bf40-aadd20f0452f
>> ListImageFiltersTestXML-1316776531 ListImageFiltersTestXML-132181290]
>> Got semaphore / lock "_do_refresh_provider_fw_rules" inner
>> /opt/stack/new/nova/nova/openstack/common/lockutils.py:271
>> 2014-09-24 18:44:49.244 DEBUG nova.compute.manager
>> [req-b91cb1c1-f211-43ef-9714-651eeb3b2302
>> DeleteServersAdminTestXML-1408641898
>> DeleteServersAdminTestXML-469708524] [instance:
>> 98eb8e6e-088b-4dda-ada5-7b2b79f62506] terminating bdm
>> BlockDeviceMapping(boot_index=0,connection_info=None,created_at=2014-09-24T18:44:42Z,delete_on_termination=True,deleted=False,deleted_at=None,destination_type='local',device_name='/dev/vda',device_type='disk',disk_bus=None,guest_format=None,id=43,image_id='262ab8a2-0790-49b3-a8d3-e8ed73e3ed71',instance=,instance_uuid=98eb8e6e-088b-4dda-ada5-7b2b79f62506,no_device=False,snapshot_id=None,source_type='image',updated_at=2014-09-24T18:44:42Z,volume_id=None,volume_size=None)
>> _cleanup_volumes /opt/stack/new/nova/nova/compute/manager.py:2407
>> 2014-09-24 18:44:49.248 DEBUG nova.openstack.common.lockutils
>> [req-d7443a9c-1eb5-42c3-bf40-aadd20f0452f
>> ListImageFiltersTestXML-1316776531 ListImageFiltersTestXML-132181290]
>> Released file lock "/opt/stack/data/nova/nova-iptables" release
>> /opt/stack/new/nova/nova/openstack/common/lockutils.py:115
>> 2014-09-24 18:44:49.248 DEBUG nova.openstack.common.lockutils
>> [req-d7443a9c-1eb5-42c3-bf40-aadd20f0452f
>> ListImageFiltersTestXML-1316776531 ListImageFiltersTestXML-132181290]
>> Releasing semaphore "iptables" lock
>> /opt/stack/new/nova/nova/openstack/common/lockutils.py:238
>> 2014-09-24 18:44:49.249 DEBUG nova.openstack.common.lockutils
>> [req-d7443a9c-1eb5-42c3-bf40-aadd20f0452f
>> ListImageFiltersTestXML-1316776531 ListImageFiltersTestXML-132181290]
>> Semaphore / lock released "_do_refresh_provider_fw_rules" inner
>> 
>> Also readable here:
>> http://logs.openstack.org/01/123801/1/check/check-tempest-dsvm-full/b5f8b37/logs/screen-n-cpu.txt.gz#_2014-09-24_18_44_49_240
>> 
>> (Yes, it's kind of ugly)
>> 
>> What occured to me is that in debugging locking issues what we actually
>> care about is 2 things semantically:
>> 
>> #1 - tried to get a lock, but someone else has it. Then we know we've
>> got lock contention. .
>> #2 - something is still holding a lock after some "long" amount of time.
> 
> We did just merge https://review.openstack.org/#/c/122166/ which adds
> some contention/timing information to the log messages and should at
> least be a step toward what you're talking about.
> 
> For context, we had some bad logging that resulted in
> https://bugs.launchpad.net/oslo.concurrency/+bug/1367941 .  That has
> been cleaned up to at least be accurate, but it did add an extra log
> message (created and acquired).  The reason we cared about that is we
> thought there might be a bad inter

Re: [openstack-dev] [oslo] logging around olso lockutils

2014-09-25 Thread Ben Nemec
On 09/25/2014 07:49 AM, Sean Dague wrote:
> Spending a ton of time reading logs, oslo locking ends up basically
> creating a ton of output at DEBUG that you have to mentally filter to
> find problems:
> 
> 2014-09-24 18:44:49.240 DEBUG nova.openstack.common.lockutils
> [req-d7443a9c-1eb5-42c3-bf40-aadd20f0452f
> ListImageFiltersTestXML-1316776531 ListImageFiltersTestXML-132181290]
> Created new semaphore "iptables" internal_lock
> /opt/stack/new/nova/nova/openstack/common/lockutils.py:206
> 2014-09-24 18:44:49.240 DEBUG nova.openstack.common.lockutils
> [req-d7443a9c-1eb5-42c3-bf40-aadd20f0452f
> ListImageFiltersTestXML-1316776531 ListImageFiltersTestXML-132181290]
> Acquired semaphore "iptables" lock
> /opt/stack/new/nova/nova/openstack/common/lockutils.py:229
> 2014-09-24 18:44:49.240 DEBUG nova.openstack.common.lockutils
> [req-d7443a9c-1eb5-42c3-bf40-aadd20f0452f
> ListImageFiltersTestXML-1316776531 ListImageFiltersTestXML-132181290]
> Attempting to grab external lock "iptables" external_lock
> /opt/stack/new/nova/nova/openstack/common/lockutils.py:178
> 2014-09-24 18:44:49.240 DEBUG nova.openstack.common.lockutils
> [req-d7443a9c-1eb5-42c3-bf40-aadd20f0452f
> ListImageFiltersTestXML-1316776531 ListImageFiltersTestXML-132181290]
> Got file lock "/opt/stack/data/nova/nova-iptables" acquire
> /opt/stack/new/nova/nova/openstack/common/lockutils.py:93
> 2014-09-24 18:44:49.240 DEBUG nova.openstack.common.lockutils
> [req-d7443a9c-1eb5-42c3-bf40-aadd20f0452f
> ListImageFiltersTestXML-1316776531 ListImageFiltersTestXML-132181290]
> Got semaphore / lock "_do_refresh_provider_fw_rules" inner
> /opt/stack/new/nova/nova/openstack/common/lockutils.py:271
> 2014-09-24 18:44:49.244 DEBUG nova.compute.manager
> [req-b91cb1c1-f211-43ef-9714-651eeb3b2302
> DeleteServersAdminTestXML-1408641898
> DeleteServersAdminTestXML-469708524] [instance:
> 98eb8e6e-088b-4dda-ada5-7b2b79f62506] terminating bdm
> BlockDeviceMapping(boot_index=0,connection_info=None,created_at=2014-09-24T18:44:42Z,delete_on_termination=True,deleted=False,deleted_at=None,destination_type='local',device_name='/dev/vda',device_type='disk',disk_bus=None,guest_format=None,id=43,image_id='262ab8a2-0790-49b3-a8d3-e8ed73e3ed71',instance=,instance_uuid=98eb8e6e-088b-4dda-ada5-7b2b79f62506,no_device=False,snapshot_id=None,source_type='image',updated_at=2014-09-24T18:44:42Z,volume_id=None,volume_size=None)
> _cleanup_volumes /opt/stack/new/nova/nova/compute/manager.py:2407
> 2014-09-24 18:44:49.248 DEBUG nova.openstack.common.lockutils
> [req-d7443a9c-1eb5-42c3-bf40-aadd20f0452f
> ListImageFiltersTestXML-1316776531 ListImageFiltersTestXML-132181290]
> Released file lock "/opt/stack/data/nova/nova-iptables" release
> /opt/stack/new/nova/nova/openstack/common/lockutils.py:115
> 2014-09-24 18:44:49.248 DEBUG nova.openstack.common.lockutils
> [req-d7443a9c-1eb5-42c3-bf40-aadd20f0452f
> ListImageFiltersTestXML-1316776531 ListImageFiltersTestXML-132181290]
> Releasing semaphore "iptables" lock
> /opt/stack/new/nova/nova/openstack/common/lockutils.py:238
> 2014-09-24 18:44:49.249 DEBUG nova.openstack.common.lockutils
> [req-d7443a9c-1eb5-42c3-bf40-aadd20f0452f
> ListImageFiltersTestXML-1316776531 ListImageFiltersTestXML-132181290]
> Semaphore / lock released "_do_refresh_provider_fw_rules" inner
> 
> Also readable here:
> http://logs.openstack.org/01/123801/1/check/check-tempest-dsvm-full/b5f8b37/logs/screen-n-cpu.txt.gz#_2014-09-24_18_44_49_240
> 
> (Yes, it's kind of ugly)
> 
> What occured to me is that in debugging locking issues what we actually
> care about is 2 things semantically:
> 
> #1 - tried to get a lock, but someone else has it. Then we know we've
> got lock contention. .
> #2 - something is still holding a lock after some "long" amount of time.

We did just merge https://review.openstack.org/#/c/122166/ which adds
some contention/timing information to the log messages and should at
least be a step toward what you're talking about.

For context, we had some bad logging that resulted in
https://bugs.launchpad.net/oslo.concurrency/+bug/1367941 .  That has
been cleaned up to at least be accurate, but it did add an extra log
message (created and acquired).  The reason we cared about that is we
thought there might be a bad interaction between our code and eventlet,
so we wanted to know whether we were in fact locking the same semaphore
twice or mistakenly creating two separate ones (as it turns out, neither
- it was just the bad logging I mentioned earlier).

So, given that I think everyone involved agrees that the double-locking
thing was a cosmetic issue and not a functional one we could probably
just remove the created/using messages here:
https://github.com/openstack/oslo-incubator/blob/master/openstack/common/lockutils.py#L202
which would eliminate one message per lock operation without
significantly impacting debuggability of our code.  Those messages only
exist because we didn't trust what was going on at a lower level.

It would also be nice to reduc

Re: [openstack-dev] [oslo] logging around olso lockutils

2014-09-25 Thread Davanum Srinivas
Logged as high priority bug -
https://bugs.launchpad.net/oslo.concurrency/+bug/1374075

On Thu, Sep 25, 2014 at 1:57 PM, Jay Pipes  wrote:
> +1 for making those two changes. I also have been frustrated doing debugging
> in the gate recently, and any operational-ease-of-debugging things like this
> would be appreciated.
>
> -jay
>
> On 09/25/2014 08:49 AM, Sean Dague wrote:
>>
>> Spending a ton of time reading logs, oslo locking ends up basically
>> creating a ton of output at DEBUG that you have to mentally filter to
>> find problems:
>>
>> 2014-09-24 18:44:49.240 DEBUG nova.openstack.common.lockutils
>> [req-d7443a9c-1eb5-42c3-bf40-aadd20f0452f
>> ListImageFiltersTestXML-1316776531 ListImageFiltersTestXML-132181290]
>> Created new semaphore "iptables" internal_lock
>> /opt/stack/new/nova/nova/openstack/common/lockutils.py:206
>> 2014-09-24 18:44:49.240 DEBUG nova.openstack.common.lockutils
>> [req-d7443a9c-1eb5-42c3-bf40-aadd20f0452f
>> ListImageFiltersTestXML-1316776531 ListImageFiltersTestXML-132181290]
>> Acquired semaphore "iptables" lock
>> /opt/stack/new/nova/nova/openstack/common/lockutils.py:229
>> 2014-09-24 18:44:49.240 DEBUG nova.openstack.common.lockutils
>> [req-d7443a9c-1eb5-42c3-bf40-aadd20f0452f
>> ListImageFiltersTestXML-1316776531 ListImageFiltersTestXML-132181290]
>> Attempting to grab external lock "iptables" external_lock
>> /opt/stack/new/nova/nova/openstack/common/lockutils.py:178
>> 2014-09-24 18:44:49.240 DEBUG nova.openstack.common.lockutils
>> [req-d7443a9c-1eb5-42c3-bf40-aadd20f0452f
>> ListImageFiltersTestXML-1316776531 ListImageFiltersTestXML-132181290]
>> Got file lock "/opt/stack/data/nova/nova-iptables" acquire
>> /opt/stack/new/nova/nova/openstack/common/lockutils.py:93
>> 2014-09-24 18:44:49.240 DEBUG nova.openstack.common.lockutils
>> [req-d7443a9c-1eb5-42c3-bf40-aadd20f0452f
>> ListImageFiltersTestXML-1316776531 ListImageFiltersTestXML-132181290]
>> Got semaphore / lock "_do_refresh_provider_fw_rules" inner
>> /opt/stack/new/nova/nova/openstack/common/lockutils.py:271
>> 2014-09-24 18:44:49.244 DEBUG nova.compute.manager
>> [req-b91cb1c1-f211-43ef-9714-651eeb3b2302
>> DeleteServersAdminTestXML-1408641898
>> DeleteServersAdminTestXML-469708524] [instance:
>> 98eb8e6e-088b-4dda-ada5-7b2b79f62506] terminating bdm
>>
>> BlockDeviceMapping(boot_index=0,connection_info=None,created_at=2014-09-24T18:44:42Z,delete_on_termination=True,deleted=False,deleted_at=None,destination_type='local',device_name='/dev/vda',device_type='disk',disk_bus=None,guest_format=None,id=43,image_id='262ab8a2-0790-49b3-a8d3-e8ed73e3ed71',instance=,instance_uuid=98eb8e6e-088b-4dda-ada5-7b2b79f62506,no_device=False,snapshot_id=None,source_type='image',updated_at=2014-09-24T18:44:42Z,volume_id=None,volume_size=None)
>> _cleanup_volumes /opt/stack/new/nova/nova/compute/manager.py:2407
>> 2014-09-24 18:44:49.248 DEBUG nova.openstack.common.lockutils
>> [req-d7443a9c-1eb5-42c3-bf40-aadd20f0452f
>> ListImageFiltersTestXML-1316776531 ListImageFiltersTestXML-132181290]
>> Released file lock "/opt/stack/data/nova/nova-iptables" release
>> /opt/stack/new/nova/nova/openstack/common/lockutils.py:115
>> 2014-09-24 18:44:49.248 DEBUG nova.openstack.common.lockutils
>> [req-d7443a9c-1eb5-42c3-bf40-aadd20f0452f
>> ListImageFiltersTestXML-1316776531 ListImageFiltersTestXML-132181290]
>> Releasing semaphore "iptables" lock
>> /opt/stack/new/nova/nova/openstack/common/lockutils.py:238
>> 2014-09-24 18:44:49.249 DEBUG nova.openstack.common.lockutils
>> [req-d7443a9c-1eb5-42c3-bf40-aadd20f0452f
>> ListImageFiltersTestXML-1316776531 ListImageFiltersTestXML-132181290]
>> Semaphore / lock released "_do_refresh_provider_fw_rules" inner
>>
>> Also readable here:
>>
>> http://logs.openstack.org/01/123801/1/check/check-tempest-dsvm-full/b5f8b37/logs/screen-n-cpu.txt.gz#_2014-09-24_18_44_49_240
>>
>> (Yes, it's kind of ugly)
>>
>> What occured to me is that in debugging locking issues what we actually
>> care about is 2 things semantically:
>>
>> #1 - tried to get a lock, but someone else has it. Then we know we've
>> got lock contention. .
>> #2 - something is still holding a lock after some "long" amount of time.
>>
>> #2 turned out to be a critical bit in understanding one of the worst
>> recent gate impacting issues.
>>
>> You can write a tool today that analyzes the logs and shows you these
>> things. However, I wonder if we could actually do something creative in
>> the code itself to do this already. I'm curious if the creative use of
>> Timers might let us emit log messages under the conditions above
>> (someone with better understanding of python internals needs to speak up
>> here). Maybe it's too much overhead, but I think it's worth at least
>> asking the question.
>>
>> The same issue exists when it comes to processutils I think, warning
>> that a command is still running after 10s might be really handy, because
>> it turns out that issue #2 was caused by this, and it took quite a bit
>> of decoding to figure that out

Re: [openstack-dev] [oslo] logging around olso lockutils

2014-09-25 Thread Jay Pipes
+1 for making those two changes. I also have been frustrated doing 
debugging in the gate recently, and any operational-ease-of-debugging 
things like this would be appreciated.


-jay

On 09/25/2014 08:49 AM, Sean Dague wrote:

Spending a ton of time reading logs, oslo locking ends up basically
creating a ton of output at DEBUG that you have to mentally filter to
find problems:

2014-09-24 18:44:49.240 DEBUG nova.openstack.common.lockutils
[req-d7443a9c-1eb5-42c3-bf40-aadd20f0452f
ListImageFiltersTestXML-1316776531 ListImageFiltersTestXML-132181290]
Created new semaphore "iptables" internal_lock
/opt/stack/new/nova/nova/openstack/common/lockutils.py:206
2014-09-24 18:44:49.240 DEBUG nova.openstack.common.lockutils
[req-d7443a9c-1eb5-42c3-bf40-aadd20f0452f
ListImageFiltersTestXML-1316776531 ListImageFiltersTestXML-132181290]
Acquired semaphore "iptables" lock
/opt/stack/new/nova/nova/openstack/common/lockutils.py:229
2014-09-24 18:44:49.240 DEBUG nova.openstack.common.lockutils
[req-d7443a9c-1eb5-42c3-bf40-aadd20f0452f
ListImageFiltersTestXML-1316776531 ListImageFiltersTestXML-132181290]
Attempting to grab external lock "iptables" external_lock
/opt/stack/new/nova/nova/openstack/common/lockutils.py:178
2014-09-24 18:44:49.240 DEBUG nova.openstack.common.lockutils
[req-d7443a9c-1eb5-42c3-bf40-aadd20f0452f
ListImageFiltersTestXML-1316776531 ListImageFiltersTestXML-132181290]
Got file lock "/opt/stack/data/nova/nova-iptables" acquire
/opt/stack/new/nova/nova/openstack/common/lockutils.py:93
2014-09-24 18:44:49.240 DEBUG nova.openstack.common.lockutils
[req-d7443a9c-1eb5-42c3-bf40-aadd20f0452f
ListImageFiltersTestXML-1316776531 ListImageFiltersTestXML-132181290]
Got semaphore / lock "_do_refresh_provider_fw_rules" inner
/opt/stack/new/nova/nova/openstack/common/lockutils.py:271
2014-09-24 18:44:49.244 DEBUG nova.compute.manager
[req-b91cb1c1-f211-43ef-9714-651eeb3b2302
DeleteServersAdminTestXML-1408641898
DeleteServersAdminTestXML-469708524] [instance:
98eb8e6e-088b-4dda-ada5-7b2b79f62506] terminating bdm
BlockDeviceMapping(boot_index=0,connection_info=None,created_at=2014-09-24T18:44:42Z,delete_on_termination=True,deleted=False,deleted_at=None,destination_type='local',device_name='/dev/vda',device_type='disk',disk_bus=None,guest_format=None,id=43,image_id='262ab8a2-0790-49b3-a8d3-e8ed73e3ed71',instance=,instance_uuid=98eb8e6e-088b-4dda-ada5-7b2b79f62506,no_device=False,snapshot_id=None,source_type='image',updated_at=2014-09-24T18:44:42Z,volume_id=None,volume_size=None)
_cleanup_volumes /opt/stack/new/nova/nova/compute/manager.py:2407
2014-09-24 18:44:49.248 DEBUG nova.openstack.common.lockutils
[req-d7443a9c-1eb5-42c3-bf40-aadd20f0452f
ListImageFiltersTestXML-1316776531 ListImageFiltersTestXML-132181290]
Released file lock "/opt/stack/data/nova/nova-iptables" release
/opt/stack/new/nova/nova/openstack/common/lockutils.py:115
2014-09-24 18:44:49.248 DEBUG nova.openstack.common.lockutils
[req-d7443a9c-1eb5-42c3-bf40-aadd20f0452f
ListImageFiltersTestXML-1316776531 ListImageFiltersTestXML-132181290]
Releasing semaphore "iptables" lock
/opt/stack/new/nova/nova/openstack/common/lockutils.py:238
2014-09-24 18:44:49.249 DEBUG nova.openstack.common.lockutils
[req-d7443a9c-1eb5-42c3-bf40-aadd20f0452f
ListImageFiltersTestXML-1316776531 ListImageFiltersTestXML-132181290]
Semaphore / lock released "_do_refresh_provider_fw_rules" inner

Also readable here:
http://logs.openstack.org/01/123801/1/check/check-tempest-dsvm-full/b5f8b37/logs/screen-n-cpu.txt.gz#_2014-09-24_18_44_49_240

(Yes, it's kind of ugly)

What occured to me is that in debugging locking issues what we actually
care about is 2 things semantically:

#1 - tried to get a lock, but someone else has it. Then we know we've
got lock contention. .
#2 - something is still holding a lock after some "long" amount of time.

#2 turned out to be a critical bit in understanding one of the worst
recent gate impacting issues.

You can write a tool today that analyzes the logs and shows you these
things. However, I wonder if we could actually do something creative in
the code itself to do this already. I'm curious if the creative use of
Timers might let us emit log messages under the conditions above
(someone with better understanding of python internals needs to speak up
here). Maybe it's too much overhead, but I think it's worth at least
asking the question.

The same issue exists when it comes to processutils I think, warning
that a command is still running after 10s might be really handy, because
it turns out that issue #2 was caused by this, and it took quite a bit
of decoding to figure that out.

-Sean



___
OpenStack-dev mailing list
OpenStack-dev@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


[openstack-dev] [oslo] logging around olso lockutils

2014-09-25 Thread Sean Dague
Spending a ton of time reading logs, oslo locking ends up basically
creating a ton of output at DEBUG that you have to mentally filter to
find problems:

2014-09-24 18:44:49.240 DEBUG nova.openstack.common.lockutils
[req-d7443a9c-1eb5-42c3-bf40-aadd20f0452f
ListImageFiltersTestXML-1316776531 ListImageFiltersTestXML-132181290]
Created new semaphore "iptables" internal_lock
/opt/stack/new/nova/nova/openstack/common/lockutils.py:206
2014-09-24 18:44:49.240 DEBUG nova.openstack.common.lockutils
[req-d7443a9c-1eb5-42c3-bf40-aadd20f0452f
ListImageFiltersTestXML-1316776531 ListImageFiltersTestXML-132181290]
Acquired semaphore "iptables" lock
/opt/stack/new/nova/nova/openstack/common/lockutils.py:229
2014-09-24 18:44:49.240 DEBUG nova.openstack.common.lockutils
[req-d7443a9c-1eb5-42c3-bf40-aadd20f0452f
ListImageFiltersTestXML-1316776531 ListImageFiltersTestXML-132181290]
Attempting to grab external lock "iptables" external_lock
/opt/stack/new/nova/nova/openstack/common/lockutils.py:178
2014-09-24 18:44:49.240 DEBUG nova.openstack.common.lockutils
[req-d7443a9c-1eb5-42c3-bf40-aadd20f0452f
ListImageFiltersTestXML-1316776531 ListImageFiltersTestXML-132181290]
Got file lock "/opt/stack/data/nova/nova-iptables" acquire
/opt/stack/new/nova/nova/openstack/common/lockutils.py:93
2014-09-24 18:44:49.240 DEBUG nova.openstack.common.lockutils
[req-d7443a9c-1eb5-42c3-bf40-aadd20f0452f
ListImageFiltersTestXML-1316776531 ListImageFiltersTestXML-132181290]
Got semaphore / lock "_do_refresh_provider_fw_rules" inner
/opt/stack/new/nova/nova/openstack/common/lockutils.py:271
2014-09-24 18:44:49.244 DEBUG nova.compute.manager
[req-b91cb1c1-f211-43ef-9714-651eeb3b2302
DeleteServersAdminTestXML-1408641898
DeleteServersAdminTestXML-469708524] [instance:
98eb8e6e-088b-4dda-ada5-7b2b79f62506] terminating bdm
BlockDeviceMapping(boot_index=0,connection_info=None,created_at=2014-09-24T18:44:42Z,delete_on_termination=True,deleted=False,deleted_at=None,destination_type='local',device_name='/dev/vda',device_type='disk',disk_bus=None,guest_format=None,id=43,image_id='262ab8a2-0790-49b3-a8d3-e8ed73e3ed71',instance=,instance_uuid=98eb8e6e-088b-4dda-ada5-7b2b79f62506,no_device=False,snapshot_id=None,source_type='image',updated_at=2014-09-24T18:44:42Z,volume_id=None,volume_size=None)
_cleanup_volumes /opt/stack/new/nova/nova/compute/manager.py:2407
2014-09-24 18:44:49.248 DEBUG nova.openstack.common.lockutils
[req-d7443a9c-1eb5-42c3-bf40-aadd20f0452f
ListImageFiltersTestXML-1316776531 ListImageFiltersTestXML-132181290]
Released file lock "/opt/stack/data/nova/nova-iptables" release
/opt/stack/new/nova/nova/openstack/common/lockutils.py:115
2014-09-24 18:44:49.248 DEBUG nova.openstack.common.lockutils
[req-d7443a9c-1eb5-42c3-bf40-aadd20f0452f
ListImageFiltersTestXML-1316776531 ListImageFiltersTestXML-132181290]
Releasing semaphore "iptables" lock
/opt/stack/new/nova/nova/openstack/common/lockutils.py:238
2014-09-24 18:44:49.249 DEBUG nova.openstack.common.lockutils
[req-d7443a9c-1eb5-42c3-bf40-aadd20f0452f
ListImageFiltersTestXML-1316776531 ListImageFiltersTestXML-132181290]
Semaphore / lock released "_do_refresh_provider_fw_rules" inner

Also readable here:
http://logs.openstack.org/01/123801/1/check/check-tempest-dsvm-full/b5f8b37/logs/screen-n-cpu.txt.gz#_2014-09-24_18_44_49_240

(Yes, it's kind of ugly)

What occured to me is that in debugging locking issues what we actually
care about is 2 things semantically:

#1 - tried to get a lock, but someone else has it. Then we know we've
got lock contention. .
#2 - something is still holding a lock after some "long" amount of time.

#2 turned out to be a critical bit in understanding one of the worst
recent gate impacting issues.

You can write a tool today that analyzes the logs and shows you these
things. However, I wonder if we could actually do something creative in
the code itself to do this already. I'm curious if the creative use of
Timers might let us emit log messages under the conditions above
(someone with better understanding of python internals needs to speak up
here). Maybe it's too much overhead, but I think it's worth at least
asking the question.

The same issue exists when it comes to processutils I think, warning
that a command is still running after 10s might be really handy, because
it turns out that issue #2 was caused by this, and it took quite a bit
of decoding to figure that out.

-Sean

-- 
Sean Dague
http://dague.net

___
OpenStack-dev mailing list
OpenStack-dev@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev