#1373513

Matthew Treinish Tue, 25 Nov 2014 13:34:01 -0800

On Tue, Nov 25, 2014 at 01:22:01PM -0800, Vishvananda Ishaya wrote:
> 
> On Nov 25, 2014, at 7:29 AM, Matt Riedemann <mrie...@linux.vnet.ibm.com> 
> wrote:
> 
> > 
> > 
> > On 11/25/2014 9:03 AM, Matt Riedemann wrote:
> >> 
> >> 
> >> On 11/25/2014 8:11 AM, Sean Dague wrote:
> >>> There is currently a review stream coming into Tempest to add Cinder v2
> >>> tests in addition to the Cinder v1 tests. At the same time the currently
> >>> biggest race fail in the gate related to the projects is
> >>> http://status.openstack.org/elastic-recheck/#1373513 - which is cinder
> >>> related.
> >>> 
> >>> I believe these 2 facts are coupled. The number of volume tests we have
> >>> in tempest is somewhat small, and as such the likelihood of them running
> >>> simultaneously is also small. However the fact that as the # of tests
> >>> with volumes goes up we are getting more of these race fails typically
> >>> means that what's actually happening is 2 vol ops that aren't safe to
> >>> run at the same time, are.
> >>> 
> >>> This remains critical - https://bugs.launchpad.net/cinder/+bug/1373513 -
> >>> with no assignee.
> >>> 
> >>> So we really needs dedicated diving on this (last bug update with any
> >>> code was a month ago), otherwise we need to stop adding these tests to
> >>> Tempest, and honestly start skipping the volume tests if we can't have a
> >>> repeatable success.
> >>> 
> >>>    -Sean
> >>> 
> >> 
> >> I just put up an e-r query for a newly opened bug
> >> https://bugs.launchpad.net/cinder/+bug/1396186 this morning, it looks
> >> similar to bug 1373513 but without the blocked task error in syslog.
> >> 
> >> There is a three minute gap between when the volume is being deleted in
> >> c-vol logs and when we see the volume uuid logged again, at which point
> >> tempest has already timed out waiting for the delete to complete.
> >> 
> >> We should at least get some patches to add diagnostic logging in these
> >> delete flows (or periodic tasks that use the same locks/low-level i/o
> >> bound commands?) to try and pinpoint these failures.
> >> 
> >> I think I'm going to propose a skip patch for test_volume_boot_pattern
> >> since that just seems to be a never ending cause of pain until these
> >> root issues get fixed.
> >> 
> > 
> > I marked 1396186 as a duplicate of 1373513 since the e-r query for 1373513 
> > had an OR message which was the same as 1396186.
> > 
> > I went ahead and proposed a skip for test_volume_boot_pattern due to bug 
> > 1373513 [1] until people get on top of debugging it.
> > 
> > I added some notes to bug 1396186, the 3 minute hang seems to be due to a 
> > vgs call taking ~1 minute and an lvs call taking ~2 minutes.
> > 
> > I'm not sure if those are hit in the volume delete flow or in some periodic 
> > task, but if there are multiple concurrent worker processes that could be 
> > hitting those commands at the same time can we look at off-loading one of 
> > them to a separate thread or something?
> 
> Do we set up devstack to not zero volumes on delete 
> (CINDER_SECURE_DELETE=False) ? If not, the dd process could be hanging the 
> system due to io load. This would get significantly worse with multiple 
> deletes occurring simultaneously.


Yes, we do that:

http://git.openstack.org/cgit/openstack-infra/devstack-gate/tree/devstack-vm-gate.sh#n139

and

http://git.openstack.org/cgit/openstack-infra/devstack-gate/tree/devstack-vm-gate-wrap.sh#n170

it can be overridden, but I don't think that any of the job definitions do that.

-Matt Treinish

pgpyqzwGQmI7O.pgp
Description: PGP signature

_______________________________________________
OpenStack-dev mailing list
OpenStack-dev@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev

Re: [openstack-dev] [cinder] [qa] which core team members are diving into - http://status.openstack.org/elastic-recheck/#1373513

Reply via email to