Re: 0.14 cluster never survives more than an hour or so.

Pavel Moravec Fri, 13 Apr 2012 03:49:20 -0700

Hi Paul,
both errors occur under very similar circumstances. I recommend enabling debug 
logs of cluster component by adding:


log-enable=debug+:cluster
log-enable=notice+

to qpidd.conf and post the logs to a new JIRA. (you can try enabling trace logs 
that might provide more verbose output but running traces for 1/2 hour would 
require some nontrivial disk space)

To alleviate consequences, I think disabling management shall help (but some 
other problems can arise later on somewhere else, as this just prevents the 
consequence and not the root cause bug). And some QMF based services (like 
qpid-tool) won't work with management disabled.

To disable management stuff, add to qpidd.conf:

mgmt-enable=no

Alternatively, one can setup frequency of management updates (that are 
processed by the periodicProcessing task), see mgmt-pub-interval option (set by 
default to 10 seconds). Setting it to e.g. 2 hours, your qpid cluster will run 
for at least 2 hours without the error. But again, some QMF based services rely 
on the updates.


Kind regards,
Pavel Moravec


----- Original Message -----
> From: "Paul Colby" <p...@colby.id.au>
> To: users@qpid.apache.org
> Sent: Friday, April 13, 2012 11:02:14 AM
> Subject: Re: 0.14 cluster never survives more than an hour or so.
> 
> Alas the patch at  https://issues.apache.org/jira/browse/QPID-3369
>  has not
> fixed the issue.
> 
> Interestingly though, it did move the error to a different line, but
> with a
> very similar message. eg
> 
> Apr 13 17:04:17 gateway02 qpidd[32258]: 2012-04-13 17:04:17 critical
> Error
> delivering frames: Cluster timer wakeup non-existent task
> ManagementAgent::periodicProcessing
> (qpid/cluster/ClusterTimer.cpp:112)
> 
> So it's moved from  ClusterTimer::deliverDrop
> to ClusterTimer::deliverWakeup instead... but with the same effectual
> result.
> 
> pc
> ----
> http://colby.id.au
> 
> 
> On Fri, Apr 13, 2012 at 9:30 AM, Paul Colby <p...@colby.id.au> wrote:
> 
> > Thanks Pavel and Gordon, I really appreciate you guys getting back
> > to me
> > so quickly :)
> >
> > I'm not currently using cman.  I hadn't been using it on 0.12
> > either.  I
> > suspect that split-brain is not the case, since the test cluster in
> > question on on virtual machines all within a single host, with
> > *very*
> > reliable virtual networking between them.  After reading your
> > response, I
> > did have a quick look at setting up cman to verify either way, but
> > that's
> > not proving to be quick and easy, so I'll come back to it shortly.
> >
> > The https://issues.apache.org/jira/browse/QPID-3369 issue does look
> > interesting.  I'll apply the patch suggested there and see what
> > difference
> > it makes.
> >
> > Thanks again.  I'll let you know how it goes :)
> >
> > pc
> > ----
> > http://colby.id.au
> >
> >
> >
> > On Thu, Apr 12, 2012 at 9:39 PM, Pavel Moravec
> > <pmora...@redhat.com>wrote:
> >
> >> Hi Paul,
> >> this usually happens as a consequence of cluster split-brain. Are
> >> you
> >> using CMAN (Cluster Manager)?
> >>
> >> (Technically, when split brain occurs, two (or more) qpid brokers
> >> think
> >> they are the elder nodes (elder node = "the managing" node,
> >> usually the
> >> node that is oldest in the cluster). But there can be just one
> >> elder node
> >> in a cluster, as the elder node periodically invokes
> >> periodicProcessing
> >> task cluster-wide that can run just one at a time. When more elder
> >> nodes
> >> are present, all invokes the task on every cluster member, causing
> >> more
> >> tasks to be executed - that is prevented by broker shutdown.)
> >>
> >> Kind regards,
> >> Pavel Moravec
> >>
> >>
> >> ----- Original Message -----
> >> > From: "Paul Colby" <p...@colby.id.au>
> >> > To: users@qpid.apache.org
> >> > Sent: Thursday, April 12, 2012 5:08:01 AM
> >> > Subject: 0.14 cluster never survives more than an hour or so.
> >> >
> >> > Hi guys,
> >> >
> >> > I'm having an issue with my new 0.14 cluster, where the same
> >> > configuration
> >> > was fine with 0.12.
> >> >
> >> > The cluster starts up, and all brokers are happy.  Then, with no
> >> > client
> >> > activity at all, after some seemingly random amount time
> >> > (usually
> >> > around 30
> >> > minutes to an hour) all brokers in the cluster (three, in this
> >> > case)
> >> > report
> >> > the following error:
> >> >
> >> > critical Error delivering frames: Cluster timer drop
> >> > non-existent
> >> > task
> >> > ManagementAgent::periodicProcessing
> >> > (qpid/cluster/ClusterTimer.cpp:128)
> >> >
> >> > Then they all shutdown, leaving their respective stores dirty :(
> >> >
> >> > Any ideas what might be going wrong here?
> >> >
> >> > Thanks,
> >> >
> >> > pc
> >> > ----
> >> > http://colby.id.au
> >> >
> >>
> >> ---------------------------------------------------------------------
> >> To unsubscribe, e-mail: users-unsubscr...@qpid.apache.org
> >> For additional commands, e-mail: users-h...@qpid.apache.org
> >>
> >>
> >
> 

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscr...@qpid.apache.org
For additional commands, e-mail: users-h...@qpid.apache.org

Re: 0.14 cluster never survives more than an hour or so.

Reply via email to