Some comments inline

On Tue, Nov 18, 2014 at 3:18 PM, Andrew Beekhof <abeek...@redhat.com> wrote:
> Hi Everyone,
>
> I was reading the blueprints mentioned here and thought I'd take the 
> opportunity to introduce myself and ask a few questions.
> For those that don't recognise my name, Pacemaker is my baby - so I take a 
> keen interest helping people have a good experience with it :)
>
> A couple of items stood out to me (apologies if I repeat anything that is 
> already well understood):
>
> * Operations with CIB utilizes almost 100% of CPU on the Controller
>
>  We introduced a new CIB algorithm in 1.1.12 which is O(2) faster/less 
> resource hungry than prior versions.
>  I would be interested to hear your experiences with it if you are able to 
> upgrade to that version.

Pacemaker on CentOS 6.5 is 1.1.10 14.el6_5.3
https://review.fuel-infra.org/#/admin/projects/packages/centos6/pacemaker
Corosync on CentOS 6.5 is 1.4.6 26.2
https://review.fuel-infra.org/#/admin/projects/packages/centos6/corosync

>
> * Corosync shutdown process takes a lot of time
>
>  Corosync (and Pacemaker) can shut down incredibly quickly.
>  If corosync is taking a long time, it will be because it is waiting for 
> pacemaker, and pacemaker is almost always waiting for for one of the 
> clustered services to shut down.
>
> * Current Fuel Architecture is limited to Corosync 1.x and Pacemaker 1.x
>
>  Corosync 2 is really the way to go.
>  Is there something in particular that is holding you back?

We try to keep close to the distro version when possible / reasonable.

>  Also, out of interest, are you using cman or the pacemaker plugin?
>
> *  Diff operations against Corosync CIB require to save data to file rather
>   than keep all data in memory
>
>  Can someone clarify this one for me?
>
>  Also, I notice that the corosync init script has been modified to set/unset 
> maintenance-mode with cibadmin.
>  Any reason not to use crm_attribute instead?  You might find its a less 
> fragile solution than a hard-coded diff.
>
> * Debug process of OCF scripts is not unified requires a lot of actions from
>  Cloud Operator
>
>  Two things to mention here... the first is crm_resource 
> --force-(start|stop|check) which queries the cluster for the resource's 
> definition but runs the command directly.
>  Combined with -V, this means that you get to see everything the agent is 
> doing.
>
>  Also, pacemaker now supports the ability for agents to emit specially 
> formatted error messages that are stored in the cib and can be shown back to 
> users.
>  This can make things much less painful for admins. Look for 
> PCMK_OCF_REASON_PREFIX in the upstream resource-agents project.
>
>
> * Openstack services are not managed by Pacemaker
>
>  Oh?

fuel doesn't (currently) set up API services in pacemaker

>
> * Compute nodes aren't in Pacemaker cluster, hence, are lacking a viable
>  control plane for their's compute/nova services.
>
>  pacemaker-remoted might be of some interest here.
>  
> http://clusterlabs.org/doc/en-US/Pacemaker/1.1-pcs/html/Pacemaker_Remote/index.html
>
>
> * Creating and committing shadows not only adds constant pain with 
> dependencies and unneeded complexity but also rewrites cluster attributes and 
> even other changes if you mess up with ordering and it’s really hard to debug 
> it.
>
>  Is this still an issue?  I'm reasonably sure this is specific to the way 
> crmsh uses shadows.
>  Using the native tools it should be possible to commit only the delta, so 
> any other changes that occur while you're updating the shadow would not be an 
> issue, and existing attributes wouldn't be rewritten.
>
> * Restarting resources by Puppet’s pacemaker service provider restarts them 
> even if they are running on other nodes and it sometimes impacts the cluster.
>
>  Not available yet, but upstream there is now a smart --restart option for 
> crm_resource which can optionally take a --host parameter.
>  Sounds like it would be useful here.
>  
> http://blog.clusterlabs.org/blog/2014/feature-spotlight-smart-resource-restart-from-the-command-line/
>
> * An attempt to stop or restart corosync service brings down a lot of 
> resources and probably will fail and bring down the entire deployment.
>
>  That sounds deeply worrying.  Details?
>
> * Controllers other the the first download configured cib an immediate start 
> all cloned resources before they are configured so they have to be cleaned up 
> later.
>
>  By this you mean clones are being started on nodes which do not have the 
> software? Or before the ordering/colocation constraints have been configured?

this is a issue because we deploy one controller and
corosync/pacemaker is set up in one stage, and the software and
services are set up in a later stage. When we deploy the remaining
controllers, the cluster is joined in the same early stage, this
causes services to attempt to be started when there is no software yet
installed. This was worked around with some banning method so that
none of the clones will start until the node has the services.

>
>


-- 
Andrew
Mirantis
Ceph community

_______________________________________________
OpenStack-dev mailing list
OpenStack-dev@lists.openstack.org
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev

Reply via email to