Re: [Pacemaker] [Problem or Enhancement]When attrd reboots, a fail count is initialized.

renayama19661014 Thu, 30 Sep 2010 19:05:03 -0700

Hi Andrew,

Thank you for comment.


> During crmd startup, one could read all the values from attrd into the
> hashtable.
> So the hashtable would only do something if only attrd went down.

If attrd communicates with crmd at the time of start and reads the data of the 
hash table, the problem
seems to be able to be settled.

Is the change of this attrd and crmd difficult?


> I mean: did you see this behavior in a production system, or only
> during testing when you manually killed attrd?

We carry out kill-command by manual operation as one of the tests of the 
trouble of the processes.
Our user minds behavior of the process trouble very much.

Best Regards,
Hideo Yamauchi.

--- Andrew Beekhof <and...@beekhof.net> wrote:

> On Wed, Sep 29, 2010 at 3:59 AM,  <renayama19661...@ybb.ne.jp> wrote:
> > Hi Andrew,
> >
> > Thank you for comment.
> >
> >> The problem here is that attrd is supposed to be the authoritative
> >> source for this sort of data.
> >
> > Yes. I understand.
> >
> >> Additionally, you don't always want attrd reading from the status
> >> section - like after the cluster restarts.
> >
> > The problem seems to be able to solve even that it retrieves a status 
> > section from cib after
> attrd
> > rebooted.
> > "method2" which I suggested is such a meaning.
> >> > method 2)When attrd started, Attrd communicates with cib and receives 
> >> > fail-count.
> >
> >> For failcount, the crmd could keep a hashtable of the current values
> >> which it could re-send to attrd if it detects a disconnection.
> >> But that might not be a generic-enough solution.
> >
> > If a Hash table of crmd can maintain it, it may be a good thought.
> > However, I have a feeling that the same problem happens when crmd causes 
> > trouble and rebooted.
> 
> During crmd startup, one could read all the values from attrd into the
> hashtable.
> So the hashtable would only do something if only attrd went down.
> 
> >
> >> The chance that attrd dies _and_ there were relevant values for
> >> fail-count is pretty remote though... is this a real problem you've
> >> experienced or a theoretical one?
> >
> > I did not understand meanings well.
> > Does this mean that there is fail-count of attrd in the other node?
> 
> I mean: did you see this behavior in a production system, or only
> during testing when you manually killed attrd?
> 
> >
> > Best Regards,
> > Hideo Yamauchi.
> >
> > --- Andrew Beekhof <and...@beekhof.net> wrote:
> >
> >> On Mon, Sep 27, 2010 at 7:26 AM, &#65533;<renayama19661...@ybb.ne.jp> 
> >> wrote:
> >> > Hi,
> >> >
> >> > When I investigated another problem, I discovered this phenomenon.
> >> > If attrd causes process trouble and does not restart, the problem does 
> >> > not occur.
> >> >
> >> > Step1) After start, it causes a monitor error in UmIPaddr twice.
> >> >
> >> > Online: [ srv01 srv02 ]
> >> >
> >> > &#65533;Resource Group: UMgroup01
> >> > &#65533; &#65533; UmVIPcheck (ocf::heartbeat:Dummy): Started srv01
> >> > &#65533; &#65533; UmIPaddr &#65533; (ocf::heartbeat:Dummy2): &#65533; 
> >> > &#65533; &#65533;
> > &#65533;Started srv01
> >> >
> >> > Migration summary:
> >> > * Node srv02:
> >> > * Node srv01:
> >> > &#65533; UmIPaddr: migration-threshold=10 fail-count=2
> >> >
> >> > Step2) Kill Attrd and Attrd reboots.
> >> >
> >> > Online: [ srv01 srv02 ]
> >> >
> >> > &#65533;Resource Group: UMgroup01
> >> > &#65533; &#65533; UmVIPcheck (ocf::heartbeat:Dummy): Started srv01
> >> > &#65533; &#65533; UmIPaddr &#65533; (ocf::heartbeat:Dummy2): &#65533; 
> >> > &#65533; &#65533;
> > &#65533;Started srv01
> >> >
> >> > Migration summary:
> >> > * Node srv02:
> >> > * Node srv01:
> >> > &#65533; UmIPaddr: migration-threshold=10 fail-count=2
> >> >
> >> > Step3) It causes a monitor error in UmIPaddr.
> >> >
> >> > Online: [ srv01 srv02 ]
> >> >
> >> > &#65533;Resource Group: UMgroup01
> >> > &#65533; &#65533; UmVIPcheck (ocf::heartbeat:Dummy): Started srv01
> >> > &#65533; &#65533; UmIPaddr &#65533; (ocf::heartbeat:Dummy2): &#65533; 
> >> > &#65533; &#65533;
> > &#65533;Started srv01
> >> >
> >> > Migration summary:
> >> > * Node srv02:
> >> > * Node srv01:
> >> > &#65533; UmIPaddr: migration-threshold=10 fail-count=1 -----> Fail-count 
> >> > return to the
> first.
> >> >
> >> > The problem is so that attrd disappears fail-count by 
> >> > reboot.(Hash-tables is Lost.)
> >> > It is a problem very much that the trouble number of times is 
> >> > initialized.
> >> >
> >> > I think that there is the following method.
> >> >
> >> > method 1)Attrd maintain fail-count as a file in "/var/run" directories 
> >> > and refer.
> >> >
> >> > method 2)When attrd started, Attrd communicates with cib and receives 
> >> > fail-count.
> >> >
> >> > Is there a better method?
> >> >
> >> > Please think about the solution of this problem.
> >>
> >> Hmmmm... a tricky one.
> >>
> >> The problem here is that attrd is supposed to be the authoritative
> >> source for this sort of data.
> >> Additionally, you don't always want attrd reading from the status
> >> section - like after the cluster restarts.
> >>
> >> For failcount, the crmd could keep a hashtable of the current values
> >> which it could re-send to attrd if it detects a disconnection.
> >> But that might not be a generic-enough solution.
> >>
> >> The chance that attrd dies _and_ there were relevant values for
> >> fail-count is pretty remote though... is this a real problem you've
> >> experienced or a theoretical one?
> >>
> >> _______________________________________________
> >> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
> >> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
> >>
> >> Project Home: http://www.clusterlabs.org
> >> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> >> Bugs: 
> >> http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker
> >>
> >
> >
> >
> > _______________________________________________
> > Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
> > http://oss.clusterlabs.org/mailman/listinfo/pacemaker
> >
> > Project Home: http://www.clusterlabs.org
> > Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> > Bugs: 
> > http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker
> >
> 
> _______________________________________________
> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
> http://oss.clusterlabs.org/mailman/listinfo/pacemaker
> 
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: 
> http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker
> 


_______________________________________________
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker

Re: [Pacemaker] [Problem or Enhancement]When attrd reboots, a fail count is initialized.

Reply via email to