Re: [Openstack] nova-compute and cinder-scheduler HA

Jay Pipes Fri, 16 May 2014 06:46:39 -0700

On 05/14/2014 02:49 PM, Сергей Мотовиловец wrote:

Hello everyone!


Hi Motovilovets :) Comments and questions for you inline...

I'm facing some troubles with nova and cinder here.

I have 2 control nodes (active/active) in my testing environment with
Percona XtraDB cluster (Galera+xtrabackup) + garbd on a separate node
(to avoid split-brain) Â + OpenStack Icehouse, latest from Ubuntu 14.04
main repo.

The problem is horizontal scalability of nova-conductor and
cinder-scheduler services, seems like all active instances of these
services are trying to execute sameÂ MySQLÂ queries theyÂ get from
Rabbit, which leads to numerous deadlocks in set-up with Galera.Â

Are you using RabbitMQ in clustered mode? Also, how are you doing yourload balancing? Do you use HAProxy or some appliance? Do you have stickysessions enabled for your load balancing?

In case when multiple nova-conductor services are running (and using
MySQL instances on corresponding control nodes) it appears as "Deadlock
found when trying to get lock; try restarting transaction" in log.
With cinder-scheduler it leads to "InvalidBDM: Block Device Mapping is
Invalid."

So, it's not actually a deadlock that is occurring... unless I'mmistaken (I've asked a Percona engineer to take a look at this thread todouble-check me), the error about "Deadlock found..." is actually *not*a deadlock. It's just that Galera uses the same InnoDB error code as anormal deadlock to indicate that the WSREP certification process hastimed out between the cluster nodes. Would you mind pastebin'ing yourwsrep.cnf and my.cnf files for us to take a look at? I presume that youdo not have much latency between the cluster nodes (i.e. they are notover a WAN)... let me know if that is not the case.

It would also be helpful to see your rabbit and load balancer configs ifyou can pastebin those, too.

Is there any possible way to make multiple instances of these services
running simultaneously and not duplicating queries?Â

Yes, it most certainly is. At AT&T, we ran Galera clusters of muchbigger size with absolutely no problems due to this cert timeout problemthat manifests itself as a deadlock, so I know it's definitely possibleto have a clean, performant, multi-writer Galera solution for OpenStack. :)


Best,
-jay

(I don't really like the idea of handling this with Heartbeat+Pacemaker
or other similar stuff, mostly because I'm thinking about equal load
distribution across control nodes, but in this case it seems like it has
an opposite effect, multiplying load on MySQL)

Another thing that is extremely annoying: if instance stuck in ERROR
state because of deadlock during its termination - it is impossible to
terminate instance anymore in Horizon, only via nova-api with
reset-state. How can this be handled?

I'd really appreciate any help/advises/thoughts regarding these problems.


Best regards,
Motovilovets Sergey
Software Operation Engineer


_______________________________________________
Mailing list: http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack
Post to     : openstack@lists.openstack.org
Unsubscribe : http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack



_______________________________________________
Mailing list: http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack
Post to     : openstack@lists.openstack.org
Unsubscribe : http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack

Re: [Openstack] nova-compute and cinder-scheduler HA

Reply via email to