On 09/14/2016 11:05 PM, Mike Bayer wrote:

Are *these* errors also new as of version 4.13.3 of oslo.db ?   Because
here I have more suspicion of one particular oslo.db change here.

The version in question that has the changes to provisioning and anything really to do with this area is 4.12.0. So if you didn't see any problem w/ 4.12 then almost definitely oslo.db is not the cause - the code changes subsequent to 4.12 have no relationship to any system used by the opportunistic test base. I would hope at least that 4.12 is the version where we see things changing because there were small changes to the provisioning code.

But at the same time, I'm combing through the quite small adjustments to the provisioning code as of 4.12.0 and I'm not seeing what could introduce this issue. That said, we really should never see the kind of error we see with the "DROP DATABASE" failing because it remains in use, however this can be a side effect of the test itself having problems with the state of a different connection, not being closed and locks remain held.

That is, there's poor failure modes for sure here, I just can't see anything in 4.13 or even 4.12 that would suddenly introduce them.

By all means if these failures disappear when we go to 4.11 vs. 4.12, that would be where we need to go and to look for next cycle. From my POV if the failures do disappear then that would be the best evidence that the oslo.db version is the factor.









 fits much more with your initial description

On 09/14/2016 10:48 PM, Mike Bayer wrote:


On 09/14/2016 07:04 PM, Alan Pevec wrote:
Olso.db 4.13.3 did hit the scene about the time this showed up. So I
think we need to strongly consider blocking it and revisiting these
issues post newton.

So that means reverting all stable/newton changes, previous 4.13.x
have been already blocked https://review.openstack.org/365565
How would we proceed, do we need to revert all backport on
stable/newton?

In case my previous email wasn't clear, I don't *yet* see evidence that
the recent 4.13.3 release of oslo.db is the cause of this problem.
However, that is only based upon what I see in this stack trace, which
is that the test framework is acting predictably (though erroneously)
based on the timeout condition which is occurring.   I don't (yet) see a
reason that the same effect would not occur prior to 4.13.3 in the face
of a signal pre-empting the work of the pymysql driver mid-stream.
However, this assumes that the timeout condition itself is not a product
of the current oslo.db version and that is not known yet.

There's a list of questions that should all be answerable which could
assist in giving some hints towards this.

There's two parts to the error in the logs.  There's the "timeout"
condition, then there is the bad reaction of the PyMySQL driver and the
test framework as a result of the operation being interrupted within the
test.

* Prior to oslo.db 4.13.3, did we ever see this "timeout" condition
occur?   If so, was it also accompanied by the same "resource closed"
condition or did this second part of the condition only appear at 4.13.3?

* Did we see a similar "timeout" / "resource closed" combination prior
to 4.13.3, just with less frequency?

* Was the version of PyMySQL also recently upgraded (I'm assuming this
environment has been on PyMySQL for a long time at this point) ?   What
was the version change if so?  Especially if we previously saw "timeout"
but no "resource closed", perhaps an older version pf PyMySQL didn't
react in this way?

* Was the version of MySQL running in the CI environment changed?   What
was the version change if so?    Were there any configurational changes
such as transaction isolation, memory or process settings?

* Have there been changes to the "timeout" logic itself in the test
suite, e.g. whatever it is that sets up fixtures.Timeout()?  Or some
change that alters how teardown of tests occurs when a test is
interrupted via this timeout?

* What is the magnitude of the "timeout" this fixture is using, is it on
the order of seconds, minutes, hours?

* If many minutes or hours, can the test suite be observed to be stuck
on this test?   Has someone tried to run a "SHOW PROCESSLIST" while this
condition is occurring to see what SQL is pausing?

* Has there been some change such that the migration tests are running
against non-empty tables or tables with much more data than was present
before?

* Is this failure only present within the Nova test suite or has it been
observed in the test suites of other projects?

* Is this failure present only on the "database migration" test suite or
is it present in other opportunistic tests, for Nova and others?

* Have there been new database migrations added to Nova which are being
exercised here and may be involved?

I'm not sure how much of an inconvenience it is to downgrade oslo.db. If
downgrading it is feasible, that would at least be a way to eliminate it
as a possibility if these same failures continue to occur, or a way to
confirm its involvement if they disappear.   But if downgrading is
disruptive then there are other things to look at in order to have a
better chance at predicting its involvement.




Cheers,
Alan

__________________________________________________________________________


OpenStack Development Mailing List (not for usage questions)
Unsubscribe:
openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


__________________________________________________________________________
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev

Reply via email to