Re: [openstack-dev] [ceilometer][qa] Punting ceilometer from whitelist

2013-12-04 Thread Gordon Chung
 As a developer think about the fact that when you log something as
 ERROR, you are expecting a cloud operator to be woken up in the middle
 of the night with an email alert to go fix the cloud immediately. You
 are intentionally ruining someone's weekend to fix this issue - RIGHT 
NOW!

was going to ask what CRITICAL level was for... good thing i googled 
first: http://docs.python.org/2/howto/logging.html seems like a good 
enough definition for each level.

cheers,
gordon chung

openstack, ibm software standards
email: chungg [at] ca.ibm.com___
OpenStack-dev mailing list
OpenStack-dev@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [ceilometer][qa] Punting ceilometer from whitelist

2013-12-03 Thread Eoghan Glynn


- Original Message -
 On 12/02/2013 10:24 AM, Julien Danjou wrote:
  On Fri, Nov 29 2013, David Kranz wrote:
 
  In preparing to fail builds with log errors I have been trying to make
  things easier for projects by maintaining a whitelist. But these bugs in
  ceilometer are coming in so fast that I can't keep up. So I am  just
  putting
  .* in the white list for any cases I find before gate failing is turned
  on, hopefully early this week.
  Following the chat on IRC and the bug reports, it seems this might come
   From the tempest tests that are under reviews, as currently I don't
  think Ceilometer generates any error as it's not tested.
 
  So I'm not sure we want to whitelist anything?
 So I tested this with https://review.openstack.org/#/c/59443/. There are
 flaky log errors coming from ceilometer. You
 can see that the build at 12:27 passed, but the last build failed twice,
 each with a different set of errors. So the whitelist needs to remain
 and the ceilometer team should remove each entry when it is believed to
 be unnecessary.

Hi David,

Just looking into this issue.

So when you say the build failed, do you mean that errors were detected
in the ceilometer log files? (as opposed to a specific Tempest testcase
having reported a failure)

If that interpretation of build failure is correct, I think there's a simple
explanation for the compute agent ERRORs seen in the log file for the CI
build related to your patch referenced above, specifically:

  ERROR ceilometer.compute.pollsters.disk [-] Requested operation is not valid: 
domain is not running

The problem I suspect is a side-effect of a nova test that suspends the
instance in question, followed by a race between the ceilometer logic that
discovers the local instances via the nova-api followed by the individual
pollsters that call into the libvirt daemon to gather the disk stats etc.
It appears that the libvirt virDomainBlockStats() call fails with domain
is not running for suspended instances.

This would only occur intermittently as it requires the instance to
remain in the suspended state across a polling interval boundary. 

So we need tighten up our logic there to avoid spewing needless errors
when a very normal event occurs (i.e. instance suspension).

I've filed a bug[1] which some ideas for addressing the issue - this
will require a bit of discussion before agreeing a way forward, but I'll
prioritize getting this knocked on the head asap.

Cheers,
Eoghan

[1] https://bugs.launchpad.net/ceilometer/+bug/1257302



  The tricky part is going to be for us to fix Ceilometer on one side and
  re-run Tempest reviews on the other side once a potential fix is merged.
 This is another use case for the promised
 dependent-patch-between-projects thing.
 
   -David
 
 
 
 ___
 OpenStack-dev mailing list
 OpenStack-dev@lists.openstack.org
 http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
 

___
OpenStack-dev mailing list
OpenStack-dev@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [ceilometer][qa] Punting ceilometer from whitelist

2013-12-03 Thread Sean Dague
On 12/03/2013 09:30 AM, Eoghan Glynn wrote:
 
 
 - Original Message -
 On 12/02/2013 10:24 AM, Julien Danjou wrote:
 On Fri, Nov 29 2013, David Kranz wrote:

 In preparing to fail builds with log errors I have been trying to make
 things easier for projects by maintaining a whitelist. But these bugs in
 ceilometer are coming in so fast that I can't keep up. So I am  just
 putting
 .* in the white list for any cases I find before gate failing is turned
 on, hopefully early this week.
 Following the chat on IRC and the bug reports, it seems this might come
  From the tempest tests that are under reviews, as currently I don't
 think Ceilometer generates any error as it's not tested.

 So I'm not sure we want to whitelist anything?
 So I tested this with https://review.openstack.org/#/c/59443/. There are
 flaky log errors coming from ceilometer. You
 can see that the build at 12:27 passed, but the last build failed twice,
 each with a different set of errors. So the whitelist needs to remain
 and the ceilometer team should remove each entry when it is believed to
 be unnecessary.
 
 Hi David,
 
 Just looking into this issue.
 
 So when you say the build failed, do you mean that errors were detected
 in the ceilometer log files? (as opposed to a specific Tempest testcase
 having reported a failure)
 
 If that interpretation of build failure is correct, I think there's a simple
 explanation for the compute agent ERRORs seen in the log file for the CI
 build related to your patch referenced above, specifically:
 
   ERROR ceilometer.compute.pollsters.disk [-] Requested operation is not 
 valid: domain is not running
 
 The problem I suspect is a side-effect of a nova test that suspends the
 instance in question, followed by a race between the ceilometer logic that
 discovers the local instances via the nova-api followed by the individual
 pollsters that call into the libvirt daemon to gather the disk stats etc.
 It appears that the libvirt virDomainBlockStats() call fails with domain
 is not running for suspended instances.
 
 This would only occur intermittently as it requires the instance to
 remain in the suspended state across a polling interval boundary. 
 
 So we need tighten up our logic there to avoid spewing needless errors
 when a very normal event occurs (i.e. instance suspension).

Definitely need to tighten things up.

As a developer think about the fact that when you log something as
ERROR, you are expecting a cloud operator to be woken up in the middle
of the night with an email alert to go fix the cloud immediately. You
are intentionally ruining someone's weekend to fix this issue - RIGHT NOW!

Hence why we are going to start failing jobs that add new ERRORs. We
have a whitelist for times when this should be the case. But assume
that's not the normal path.

-Sean

-- 
Sean Dague
http://dague.net

___
OpenStack-dev mailing list
OpenStack-dev@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [ceilometer][qa] Punting ceilometer from whitelist

2013-12-03 Thread David Kranz

On 12/03/2013 09:30 AM, Eoghan Glynn wrote:


- Original Message -

On 12/02/2013 10:24 AM, Julien Danjou wrote:

On Fri, Nov 29 2013, David Kranz wrote:


In preparing to fail builds with log errors I have been trying to make
things easier for projects by maintaining a whitelist. But these bugs in
ceilometer are coming in so fast that I can't keep up. So I am  just
putting
.* in the white list for any cases I find before gate failing is turned
on, hopefully early this week.

Following the chat on IRC and the bug reports, it seems this might come
  From the tempest tests that are under reviews, as currently I don't
think Ceilometer generates any error as it's not tested.

So I'm not sure we want to whitelist anything?

So I tested this with https://review.openstack.org/#/c/59443/. There are
flaky log errors coming from ceilometer. You
can see that the build at 12:27 passed, but the last build failed twice,
each with a different set of errors. So the whitelist needs to remain
and the ceilometer team should remove each entry when it is believed to
be unnecessary.

Hi David,

Just looking into this issue.

So when you say the build failed, do you mean that errors were detected
in the ceilometer log files? (as opposed to a specific Tempest testcase
having reported a failure)
Yes, exactly. This patch removed the whitelist entries for ceilometer 
and so those errors then failed the build.


If that interpretation of build failure is correct, I think there's a simple
explanation for the compute agent ERRORs seen in the log file for the CI
build related to your patch referenced above, specifically:

   ERROR ceilometer.compute.pollsters.disk [-] Requested operation is not 
valid: domain is not running

The problem I suspect is a side-effect of a nova test that suspends the
instance in question, followed by a race between the ceilometer logic that
discovers the local instances via the nova-api followed by the individual
pollsters that call into the libvirt daemon to gather the disk stats etc.
It appears that the libvirt virDomainBlockStats() call fails with domain
is not running for suspended instances.

This would only occur intermittently as it requires the instance to
remain in the suspended state across a polling interval boundary.

So we need tighten up our logic there to avoid spewing needless errors
when a very normal event occurs (i.e. instance suspension).

I've filed a bug[1] which some ideas for addressing the issue - this
will require a bit of discussion before agreeing a way forward, but I'll
prioritize getting this knocked on the head asap.
Great! Thanks. The change I pushed yesterday should help prevent this 
sort of thing from creeping in across all projects. But as Julian 
observed, the process of removing entries from the whitelist that are no 
longer needed due to bug fixes is not so easy and automatic. I'm trying 
to put together a script that will check the whitelist entries against 
the last two weeks of builds using logstash but it is not so simple to 
do that since general regexps cannot be used with logstash.



 -David

___
OpenStack-dev mailing list
OpenStack-dev@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [ceilometer][qa] Punting ceilometer from whitelist

2013-12-02 Thread David Kranz

On 12/02/2013 10:24 AM, Julien Danjou wrote:

On Fri, Nov 29 2013, David Kranz wrote:


In preparing to fail builds with log errors I have been trying to make
things easier for projects by maintaining a whitelist. But these bugs in
ceilometer are coming in so fast that I can't keep up. So I am  just putting
.* in the white list for any cases I find before gate failing is turned
on, hopefully early this week.

Following the chat on IRC and the bug reports, it seems this might come
 From the tempest tests that are under reviews, as currently I don't
think Ceilometer generates any error as it's not tested.

So I'm not sure we want to whitelist anything?
So I tested this with https://review.openstack.org/#/c/59443/. There are 
flaky log errors coming from ceilometer. You
can see that the build at 12:27 passed, but the last build failed twice, 
each with a different set of errors. So the whitelist needs to remain 
and the ceilometer team should remove each entry when it is believed to 
be unnecessary.


The tricky part is going to be for us to fix Ceilometer on one side and
re-run Tempest reviews on the other side once a potential fix is merged.
This is another use case for the promised 
dependent-patch-between-projects thing.


 -David





___
OpenStack-dev mailing list
OpenStack-dev@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


[openstack-dev] [ceilometer][qa] Punting ceilometer from whitelist

2013-11-29 Thread David Kranz
In preparing to fail builds with log errors I have been trying to make 
things easier for projects by maintaining a whitelist. But these bugs in 
ceilometer are coming in so fast that I can't keep up. So I am  just 
putting .* in the white list for any cases I find before gate failing 
is turned on, hopefully early this week.


 -David

___
OpenStack-dev mailing list
OpenStack-dev@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev