On 8/18/2010 at 10:25 AM, Simon Horman <[email protected]> wrote: 
> On Tue, Aug 17, 2010 at 06:12:04PM -0600, Tim Serong wrote: 
> > On 8/18/2010 at 09:03 AM, Simon Horman <[email protected]> wrote:  
> > > On Tue, Aug 17, 2010 at 03:06:45PM +0200, Dejan Muhamedagic wrote:  
> > > > Hi,  
> > > >   
> > > > On Tue, Aug 17, 2010 at 04:50:27PM +0900, Simon Horman wrote:  
> > > > > On Wed, Jul 21, 2010 at 01:41:09AM -0600, Tim Serong wrote:  
> > > > > > Hi All,  
> > > > > >   
> > > > > > A while ago (April, from memory), there was an ABI change in  
> > > > > > clplumbing in cluster-glue.  Presumably this went mostly unnoticed  
> > > > > > in general usage, however I have twice seen systems where the 
> > > > > > cluster  
> > > > > > could not run because of a missing (or incorrect) libglue2 package. 
> > > > > >  
> > > > > > One was my development system, with a dodgy build, the other was  
> > > > > > mentioned on #linux-ha yesterday, and was the result of ignoring a  
> > > > > > conflict error when installing the pacemaker RPM on openSUSE.  So,  
> > > > > > let me be clear, this is not something anyone should need to worry  
> > > > > > about...  But I thought I'd mention it here, because the error  
> > > > > > messages you get are, IMO, not very obvious.  
> > > > > >   
> > > > > > Symptoms of a mismatched pacemaker/libglue build are errors like:  
> > > > > >   
> > > > > >   lrmd: [3004]: ERROR:  
> > > > > >     main: can not create wait connection for command.  
> > > > > >   lrmd: [3004]: ERROR:  
> > > > > >     Startup aborted (can't create comm channel).  Shutting down.  
> > > > > >   ...  
> > > > > >   pengine: [4011]: ERROR:  
> > > > > >     init_client_ipc_comms_nodispatch: Could not access channel on:  
> > > > > >     /var/run/crm/pengine  
> > > > > >   corosync[4000]: [pcmk  ] ERROR:  
> > > > > >     pcmk_wait_dispatch: Child process pengine exited (pid=4011, 
> > > > > > rc=1)  
> > > > > >   corosync[4000]: [pcmk  ] notice:  
> > > > > >     pcmk_wait_dispatch: Respawning failed child process: pengine  
> > > > > >   
> > > > > > If your cluster won't start and you see this in /var/log/messages,  
> > > > > > make sure libglue2 is up to date.  And now that I've mentioned this 
> > > > > >  
> > > > > > here and it's made it to the mailing list archive, Google will 
> > > > > > know,  
> > > > > > and nobody else will ever have this problem again.  
> > > > > >   
> > > > > > This has been a public service announcement.  Thank you for 
> > > > > > reading.  
> > > > >   
> > > > > Could we get the .so bumped accordingly in the next release of  
> > > > > cluster glue? That would at least help in managing the problem  
> > > > > once the new release has been made.  
> > > >   
> > > > I don't think that that is necessary. The ABI change in the  
> > > > _released_ cluster-glue packages was done in such a way as not to  
> > > > disturb the existing pacemaker installations, i.e. by adding  
> > > > fields to the end of the struct. Further, the library version has  
> > > > been bumped to 3:0:1 (with libtool's -version-info) at the time.  
> > > > For whatever reason that translates to so.2.1.0. Users of the new  
> > > > ABI are also using domain sockets of the new type if they want  
> > > > the new functionality.  
> > > >   
> > > > I guess that what Tim was seeing was Pacemaker built against the  
> > > > unreleased glue versions which did have different ABI, i.e. the  
> > > > fields were inserted somewhere in the middle of the struct.  
> > >   
> > > Ok, so no ABI incompatibility was introduced in 1.0.6. Great!  
> > > I will go ahead and close the related Debian bugs,  
> > > #593319, #593321, #593322 and #593323.  
> >  
> > I was seeing Pacemaker *built* against new glue, installed on a system 
> > that had *old* glue installed, because both libglue2 (new glue) and 
> > libheartbeat2 < 3.0 (old glue) provide what looks like the same DSO; 
> > so when Pacemaker was upgraded on this system, libheartbeat2 was not 
> > automatically upgraded to libglue2.  For reference, there's an 
> > openSUSE 11.3 bug for this: 
> >  
> >   https://bugzilla.novell.com/show_bug.cgi?id=628243 
> >  
> > I believe this may only be a problem on openSUSE 11.3, where heartbeat 
> > 2.99.3 still exists, providing old libheartbeat2. 
> >  
> > It shouldn't be a problem the other way around (i.e. old Pacemaker is 
> > meant to work with new glue, as Dejan said). 
>  
> Understood. 
>  
> Was the new glue that you used for building a released version 
> or an hg snapshot? 

The first time I saw it was on with an odd build around about the time
of glue 1.0.4 or 1.0.5 (with which there was definitely a problem,
see http://www.gossamer-threads.com/lists/linuxha/dev/63396). 

The issue on openSUSE 11.3 is with Pacemaker built against glue slightly
newer than 1.0.5 (changeset 1448deafdf79), but installed with libheartbeat2
2.99.x instead of libglue2.

I have not tried Pacemaker built against glue 1.0.5, but installed with
an earlier glue (e.g. 1.0.4 or earlier).  I expect this would break in the
same way I mentioned originally.

I had a quick look at the Debian bugs you mentioned.  If it's possible at
all on Debian to have glue < 1.0.5 installed with Pacemaker built against
glue >= 1.0.5, I expect there will be trouble.  However, a quick search
on packages.debian.org shows no glue earlier than 1.0.5, so hopefully
this means you're good.

Regards,

Tim


-- 
Tim Serong <[email protected]>
Senior Clustering Engineer, OPS Engineering, Novell Inc.


_______________________________________________
Linux-HA mailing list
[email protected]
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems

Reply via email to