On 9/28/2018 3:12 PM, Clark Boylan wrote:
I was asked to write a followup to this as the long Zuul queues have persisted 
through this week. Largely because the situation from last week hasn't changed 
much. We were down the upgraded cloud region while we worked around a network 
configuration bug, then once that was addressed we ran into neutron port 
assignment and deletion issues. We think these are both fixed and we are 
running in this region again as of today.

Other good news is our classification rate is up significantly. We can use that 
information to go through the top identified gate bugs:

Network Connectivity issues to test nodes [2]. This is the current top of the 
list, but I think its impact is relatively small. What is happening here is 
jobs fail to connect to their test nodes early in the pre-run playbook and then 
fail. Zuul will rerun these jobs for us because they failed in the pre-run 
step. Prior to zuulv3 we had nodepool run a ready script before marking test 
nodes as ready, this script would've caught and filtered out these broken 
network nodes early. We now notice them late during the pre-run of a job.

Pip fails to find distribution for package [3]. Earlier in the week we had the in 
region mirror fail in two different regions for unrelated errors. These mirrors 
were fixed and the only other hits for this bug come from Ara which tried to 
install the 'black' package on python3.5 but this package requires python>=3.6.

yum, no more mirrors to try [4]. At first glance this appears to be an 
infrastructure issue because the mirror isn't serving content to yum. On 
further investigation it turned out to be a DNS resolution issue caused by the 
installation of designate in the tripleo jobs. Tripleo is aware of this issue 
and working to correct it.

Stackviz failing on py3 [5]. This is a real bug in stackviz caused by subunit 
data being binary not utf8 encoded strings. I've written a fix for this problem 
athttps://review.openstack.org/606184, but in doing so found that this was a 
known issue back in March and there was already a proposed 
fix,https://review.openstack.org/#/c/555388/3. It would be helpful if the QA 
team could care for this project and get a fix in. Otherwise, we should 
consider disabling stackviz on our tempest jobs (though the output from 
stackviz is often useful).

There are other bugs being tracked by e-r. Some are bugs in the openstack 
software and I'm sure some are also bugs in the infrastructure. I have not yet 
had the time to work through the others though. It would be helpful if project 
teams could prioritize the debugging and fixing of these issues though.

[2]http://status.openstack.org/elastic-recheck/gate.html#1793370
[3]http://status.openstack.org/elastic-recheck/gate.html#1449136
[4]http://status.openstack.org/elastic-recheck/gate.html#1708704
[5]http://status.openstack.org/elastic-recheck/gate.html#1758054

Thanks for the update Clark.

Another thing this week is the logstash indexing is behind by at least half a day. That's because workers were hitting OOM errors due to giant screen log files that aren't formatted properly so that we only index INFO+ level logs, and were instead trying to index the entire file, which some of which are 33MB *compressed*. So indexing of those identified problematic screen logs has been disabled:

https://review.openstack.org/#/c/606197/

I've reported bugs against each related project.

--

Thanks,

Matt

__________________________________________________________________________
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev

Reply via email to