Hi Andrew, Thank you for comment.
> > Is the change of this attrd and crmd difficult? > > I dont think so. > But its not a huge priority because I've never heard of attrd actually > crashing. > > So while I agree that its theoretically a problem, in practice no-one > is going to hit this in production. > Even if they were unlucky enough to see it, at worst the resource is > able to run on the node again - which doesn't seem that bad for a HA > cluster :-) All right. I register this problem with Bugzilla as a demand first of all. I will wait for the opinion from other users already appearing a little. Thanks, Hideo Yamauchi. --- Andrew Beekhof <and...@beekhof.net> wrote: > On Fri, Oct 1, 2010 at 4:00 AM, <renayama19661...@ybb.ne.jp> wrote: > > Hi Andrew, > > > > Thank you for comment. > > > >> During crmd startup, one could read all the values from attrd into the > >> hashtable. > >> So the hashtable would only do something if only attrd went down. > > > > If attrd communicates with crmd at the time of start and reads the data of > > the hash table, the > problem > > seems to be able to be settled. > > > > Is the change of this attrd and crmd difficult? > > I dont think so. > But its not a huge priority because I've never heard of attrd actually > crashing. > > So while I agree that its theoretically a problem, in practice no-one > is going to hit this in production. > Even if they were unlucky enough to see it, at worst the resource is > able to run on the node again - which doesn't seem that bad for a HA > cluster :-) > > > > > > >> I mean: did you see this behavior in a production system, or only > >> during testing when you manually killed attrd? > > > > We carry out kill-command by manual operation as one of the tests of the > > trouble of the > processes. > > Our user minds behavior of the process trouble very much. > > > > Best Regards, > > Hideo Yamauchi. > > > > --- Andrew Beekhof <and...@beekhof.net> wrote: > > > >> On Wed, Sep 29, 2010 at 3:59 AM,  <renayama19661...@ybb.ne.jp> wrote: > >> > Hi Andrew, > >> > > >> > Thank you for comment. > >> > > >> >> The problem here is that attrd is supposed to be the authoritative > >> >> source for this sort of data. > >> > > >> > Yes. I understand. > >> > > >> >> Additionally, you don't always want attrd reading from the status > >> >> section - like after the cluster restarts. > >> > > >> > The problem seems to be able to solve even that it retrieves a status > >> > section from cib > after > >> attrd > >> > rebooted. > >> > "method2" which I suggested is such a meaning. > >> >> > method 2)When attrd started, Attrd communicates with cib and receives > >> >> > fail-count. > >> > > >> >> For failcount, the crmd could keep a hashtable of the current values > >> >> which it could re-send to attrd if it detects a disconnection. > >> >> But that might not be a generic-enough solution. > >> > > >> > If a Hash table of crmd can maintain it, it may be a good thought. > >> > However, I have a feeling that the same problem happens when crmd causes > >> > trouble and > rebooted. > >> > >> During crmd startup, one could read all the values from attrd into the > >> hashtable. > >> So the hashtable would only do something if only attrd went down. > >> > >> > > >> >> The chance that attrd dies _and_ there were relevant values for > >> >> fail-count is pretty remote though... is this a real problem you've > >> >> experienced or a theoretical one? > >> > > >> > I did not understand meanings well. > >> > Does this mean that there is fail-count of attrd in the other node? > >> > >> I mean: did you see this behavior in a production system, or only > >> during testing when you manually killed attrd? > >> > >> > > >> > Best Regards, > >> > Hideo Yamauchi. > >> > > >> > --- Andrew Beekhof <and...@beekhof.net> wrote: > >> > > >> >> On Mon, Sep 27, 2010 at 7:26 AM, �<renayama19661...@ybb.ne.jp> > >> >> wrote: > >> >> > Hi, > >> >> > > >> >> > When I investigated another problem, I discovered this phenomenon. > >> >> > If attrd causes process trouble and does not restart, the problem > >> >> > does not occur. > >> >> > > >> >> > Step1) After start, it causes a monitor error in UmIPaddr twice. > >> >> > > >> >> > Online: [ srv01 srv02 ] > >> >> > > >> >> > �Resource Group: UMgroup01 > >> >> > � � UmVIPcheck (ocf::heartbeat:Dummy): Started srv01 > >> >> > � � UmIPaddr � (ocf::heartbeat:Dummy2): � > >> >> > � � > >> > �Started srv01 > >> >> > > >> >> > Migration summary: > >> >> > * Node srv02: > >> >> > * Node srv01: > >> >> > � UmIPaddr: migration-threshold=10 fail-count=2 > >> >> > > >> >> > Step2) Kill Attrd and Attrd reboots. > >> >> > > >> >> > Online: [ srv01 srv02 ] > >> >> > > >> >> > �Resource Group: UMgroup01 > >> >> > � � UmVIPcheck (ocf::heartbeat:Dummy): Started srv01 > >> >> > � � UmIPaddr � (ocf::heartbeat:Dummy2): � > >> >> > � � > >> > �Started srv01 > >> >> > > >> >> > Migration summary: > >> >> > * Node srv02: > >> >> > * Node srv01: > >> >> > � UmIPaddr: migration-threshold=10 fail-count=2 > >> >> > > >> >> > Step3) It causes a monitor error in UmIPaddr. > >> >> > > >> >> > Online: [ srv01 srv02 ] > >> >> > > >> >> > �Resource Group: UMgroup01 > >> >> > � � UmVIPcheck (ocf::heartbeat:Dummy): Started srv01 > >> >> > � � UmIPaddr � (ocf::heartbeat:Dummy2): � > >> >> > � � > >> > �Started srv01 > >> >> > > >> >> > Migration summary: > >> >> > * Node srv02: > >> >> > * Node srv01: > >> >> > � UmIPaddr: migration-threshold=10 fail-count=1 -----> > >> >> > Fail-count return to the > >> first. > >> >> > > >> >> > The problem is so that attrd disappears fail-count by > >> >> > reboot.(Hash-tables is Lost.) > >> >> > It is a problem very much that the trouble number of times is > >> >> > initialized. > >> >> > > >> >> > I think that there is the following method. > >> >> > > >> >> > method 1)Attrd maintain fail-count as a file in "/var/run" > >> >> > directories and refer. > >> >> > > >> >> > method 2)When attrd started, Attrd communicates with cib and receives > >> >> > fail-count. > >> >> > > >> >> > Is there a better method? > >> >> > > >> >> > Please think about the solution of this problem. > >> >> > >> >> Hmmmm... a tricky one. > >> >> > >> >> The problem here is that attrd is supposed to be the authoritative > >> >> source for this sort of data. > >> >> Additionally, you don't always want attrd reading from the status > >> >> section - like after the cluster restarts. > >> >> > >> >> For failcount, the crmd could keep a hashtable of the current values > >> >> which it could re-send to attrd if it detects a disconnection. > >> >> But that might not be a generic-enough solution. > >> >> > >> >> The chance that attrd dies _and_ there were relevant values for > >> >> fail-count is pretty remote though... is this a real problem you've > >> >> experienced or a theoretical one? > >> >> > >> >> _______________________________________________ > >> >> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org > >> >> http://oss.clusterlabs.org/mailman/listinfo/pacemaker > >> >> > >> >> Project Home: http://www.clusterlabs.org > >> >> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf > >> >> Bugs: > >> >> http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker > >> >> > >> > > >> > > >> > > >> > _______________________________________________ > >> > Pacemaker mailing list: Pacemaker@oss.clusterlabs.org > >> > http://oss.clusterlabs.org/mailman/listinfo/pacemaker > >> > > >> > Project Home: http://www.clusterlabs.org > >> > Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf > >> > Bugs: > >> > http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker > >> > > >> > >> _______________________________________________ > >> Pacemaker mailing list: Pacemaker@oss.clusterlabs.org > >> http://oss.clusterlabs.org/mailman/listinfo/pacemaker > >> > >> Project Home: http://www.clusterlabs.org > >> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf > >> Bugs: > >> http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker > >> > > > > > > _______________________________________________ > > Pacemaker mailing list: Pacemaker@oss.clusterlabs.org > > http://oss.clusterlabs.org/mailman/listinfo/pacemaker > > > > Project Home: http://www.clusterlabs.org > > Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf > > Bugs: > > http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker > === 以下のメッセージは省略されました === _______________________________________________ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker