Re: [openstack-dev] [ceilometer][qa] Punting ceilometer from whitelist
As a developer think about the fact that when you log something as ERROR, you are expecting a cloud operator to be woken up in the middle of the night with an email alert to go fix the cloud immediately. You are intentionally ruining someone's weekend to fix this issue - RIGHT NOW! was going to ask what CRITICAL level was for... good thing i googled first: http://docs.python.org/2/howto/logging.html seems like a good enough definition for each level. cheers, gordon chung openstack, ibm software standards email: chungg [at] ca.ibm.com___ OpenStack-dev mailing list OpenStack-dev@lists.openstack.org http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
Re: [openstack-dev] [ceilometer][qa] Punting ceilometer from whitelist
- Original Message - On 12/02/2013 10:24 AM, Julien Danjou wrote: On Fri, Nov 29 2013, David Kranz wrote: In preparing to fail builds with log errors I have been trying to make things easier for projects by maintaining a whitelist. But these bugs in ceilometer are coming in so fast that I can't keep up. So I am just putting .* in the white list for any cases I find before gate failing is turned on, hopefully early this week. Following the chat on IRC and the bug reports, it seems this might come From the tempest tests that are under reviews, as currently I don't think Ceilometer generates any error as it's not tested. So I'm not sure we want to whitelist anything? So I tested this with https://review.openstack.org/#/c/59443/. There are flaky log errors coming from ceilometer. You can see that the build at 12:27 passed, but the last build failed twice, each with a different set of errors. So the whitelist needs to remain and the ceilometer team should remove each entry when it is believed to be unnecessary. Hi David, Just looking into this issue. So when you say the build failed, do you mean that errors were detected in the ceilometer log files? (as opposed to a specific Tempest testcase having reported a failure) If that interpretation of build failure is correct, I think there's a simple explanation for the compute agent ERRORs seen in the log file for the CI build related to your patch referenced above, specifically: ERROR ceilometer.compute.pollsters.disk [-] Requested operation is not valid: domain is not running The problem I suspect is a side-effect of a nova test that suspends the instance in question, followed by a race between the ceilometer logic that discovers the local instances via the nova-api followed by the individual pollsters that call into the libvirt daemon to gather the disk stats etc. It appears that the libvirt virDomainBlockStats() call fails with domain is not running for suspended instances. This would only occur intermittently as it requires the instance to remain in the suspended state across a polling interval boundary. So we need tighten up our logic there to avoid spewing needless errors when a very normal event occurs (i.e. instance suspension). I've filed a bug[1] which some ideas for addressing the issue - this will require a bit of discussion before agreeing a way forward, but I'll prioritize getting this knocked on the head asap. Cheers, Eoghan [1] https://bugs.launchpad.net/ceilometer/+bug/1257302 The tricky part is going to be for us to fix Ceilometer on one side and re-run Tempest reviews on the other side once a potential fix is merged. This is another use case for the promised dependent-patch-between-projects thing. -David ___ OpenStack-dev mailing list OpenStack-dev@lists.openstack.org http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev ___ OpenStack-dev mailing list OpenStack-dev@lists.openstack.org http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
Re: [openstack-dev] [ceilometer][qa] Punting ceilometer from whitelist
On 12/03/2013 09:30 AM, Eoghan Glynn wrote: - Original Message - On 12/02/2013 10:24 AM, Julien Danjou wrote: On Fri, Nov 29 2013, David Kranz wrote: In preparing to fail builds with log errors I have been trying to make things easier for projects by maintaining a whitelist. But these bugs in ceilometer are coming in so fast that I can't keep up. So I am just putting .* in the white list for any cases I find before gate failing is turned on, hopefully early this week. Following the chat on IRC and the bug reports, it seems this might come From the tempest tests that are under reviews, as currently I don't think Ceilometer generates any error as it's not tested. So I'm not sure we want to whitelist anything? So I tested this with https://review.openstack.org/#/c/59443/. There are flaky log errors coming from ceilometer. You can see that the build at 12:27 passed, but the last build failed twice, each with a different set of errors. So the whitelist needs to remain and the ceilometer team should remove each entry when it is believed to be unnecessary. Hi David, Just looking into this issue. So when you say the build failed, do you mean that errors were detected in the ceilometer log files? (as opposed to a specific Tempest testcase having reported a failure) If that interpretation of build failure is correct, I think there's a simple explanation for the compute agent ERRORs seen in the log file for the CI build related to your patch referenced above, specifically: ERROR ceilometer.compute.pollsters.disk [-] Requested operation is not valid: domain is not running The problem I suspect is a side-effect of a nova test that suspends the instance in question, followed by a race between the ceilometer logic that discovers the local instances via the nova-api followed by the individual pollsters that call into the libvirt daemon to gather the disk stats etc. It appears that the libvirt virDomainBlockStats() call fails with domain is not running for suspended instances. This would only occur intermittently as it requires the instance to remain in the suspended state across a polling interval boundary. So we need tighten up our logic there to avoid spewing needless errors when a very normal event occurs (i.e. instance suspension). Definitely need to tighten things up. As a developer think about the fact that when you log something as ERROR, you are expecting a cloud operator to be woken up in the middle of the night with an email alert to go fix the cloud immediately. You are intentionally ruining someone's weekend to fix this issue - RIGHT NOW! Hence why we are going to start failing jobs that add new ERRORs. We have a whitelist for times when this should be the case. But assume that's not the normal path. -Sean -- Sean Dague http://dague.net ___ OpenStack-dev mailing list OpenStack-dev@lists.openstack.org http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
Re: [openstack-dev] [ceilometer][qa] Punting ceilometer from whitelist
On 12/03/2013 09:30 AM, Eoghan Glynn wrote: - Original Message - On 12/02/2013 10:24 AM, Julien Danjou wrote: On Fri, Nov 29 2013, David Kranz wrote: In preparing to fail builds with log errors I have been trying to make things easier for projects by maintaining a whitelist. But these bugs in ceilometer are coming in so fast that I can't keep up. So I am just putting .* in the white list for any cases I find before gate failing is turned on, hopefully early this week. Following the chat on IRC and the bug reports, it seems this might come From the tempest tests that are under reviews, as currently I don't think Ceilometer generates any error as it's not tested. So I'm not sure we want to whitelist anything? So I tested this with https://review.openstack.org/#/c/59443/. There are flaky log errors coming from ceilometer. You can see that the build at 12:27 passed, but the last build failed twice, each with a different set of errors. So the whitelist needs to remain and the ceilometer team should remove each entry when it is believed to be unnecessary. Hi David, Just looking into this issue. So when you say the build failed, do you mean that errors were detected in the ceilometer log files? (as opposed to a specific Tempest testcase having reported a failure) Yes, exactly. This patch removed the whitelist entries for ceilometer and so those errors then failed the build. If that interpretation of build failure is correct, I think there's a simple explanation for the compute agent ERRORs seen in the log file for the CI build related to your patch referenced above, specifically: ERROR ceilometer.compute.pollsters.disk [-] Requested operation is not valid: domain is not running The problem I suspect is a side-effect of a nova test that suspends the instance in question, followed by a race between the ceilometer logic that discovers the local instances via the nova-api followed by the individual pollsters that call into the libvirt daemon to gather the disk stats etc. It appears that the libvirt virDomainBlockStats() call fails with domain is not running for suspended instances. This would only occur intermittently as it requires the instance to remain in the suspended state across a polling interval boundary. So we need tighten up our logic there to avoid spewing needless errors when a very normal event occurs (i.e. instance suspension). I've filed a bug[1] which some ideas for addressing the issue - this will require a bit of discussion before agreeing a way forward, but I'll prioritize getting this knocked on the head asap. Great! Thanks. The change I pushed yesterday should help prevent this sort of thing from creeping in across all projects. But as Julian observed, the process of removing entries from the whitelist that are no longer needed due to bug fixes is not so easy and automatic. I'm trying to put together a script that will check the whitelist entries against the last two weeks of builds using logstash but it is not so simple to do that since general regexps cannot be used with logstash. -David ___ OpenStack-dev mailing list OpenStack-dev@lists.openstack.org http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
Re: [openstack-dev] [ceilometer][qa] Punting ceilometer from whitelist
On 12/02/2013 10:24 AM, Julien Danjou wrote: On Fri, Nov 29 2013, David Kranz wrote: In preparing to fail builds with log errors I have been trying to make things easier for projects by maintaining a whitelist. But these bugs in ceilometer are coming in so fast that I can't keep up. So I am just putting .* in the white list for any cases I find before gate failing is turned on, hopefully early this week. Following the chat on IRC and the bug reports, it seems this might come From the tempest tests that are under reviews, as currently I don't think Ceilometer generates any error as it's not tested. So I'm not sure we want to whitelist anything? So I tested this with https://review.openstack.org/#/c/59443/. There are flaky log errors coming from ceilometer. You can see that the build at 12:27 passed, but the last build failed twice, each with a different set of errors. So the whitelist needs to remain and the ceilometer team should remove each entry when it is believed to be unnecessary. The tricky part is going to be for us to fix Ceilometer on one side and re-run Tempest reviews on the other side once a potential fix is merged. This is another use case for the promised dependent-patch-between-projects thing. -David ___ OpenStack-dev mailing list OpenStack-dev@lists.openstack.org http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
[openstack-dev] [ceilometer][qa] Punting ceilometer from whitelist
In preparing to fail builds with log errors I have been trying to make things easier for projects by maintaining a whitelist. But these bugs in ceilometer are coming in so fast that I can't keep up. So I am just putting .* in the white list for any cases I find before gate failing is turned on, hopefully early this week. -David ___ OpenStack-dev mailing list OpenStack-dev@lists.openstack.org http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev