----- Original Message ----- > Hello David, > > I think I use the latest version from ubuntu, it is version 1.1.10 > Do you think it has bug on it?
There have been a number of fixes to the lrmd since v1.1.10. It is possible a couple of them could result in crashes. Again, without a backtrace from the lrmd core dump, it is difficult for me to advise whether or not your specific issue has been fixed. Building from source could yield better results for you. The pacemaker master branch is stable at the moment. lrmd related changes since 1.1.10 # git log --oneline Pacemaker-1.1.10^..HEAD | grep -e "lrmd:" 71b429c Low: lrmd: fix regression test LSBdummy install fb94901 Test: lrmd: Ensure the lsb script is executable 30d978e Low: lrmd: systemd stress tests 568e41d Fix: lrmd: Prevent glib assert triggered by timers being removed from mainloop more than once 977de97 High: lrmd: cancel pending async connection during disconnect d2d0cba Low: lrmd: ensures systemd python package is available when systemd tests run f0fe737 Fix: lrmd: fix rescheduling of systemd monitor op during start c0e8e6a Low: lrmd: prevent \n from being printed in exit reason output 2342835 High: lrmd: pass exit reason prefix to ocf scripts as env variable 412631c High: lrmd: store failed operation exit reason in cib ad083a8 Fix: lrmd: Log with the correct personality 718bf5b Test: lrmd: Update the systemd agent to test long running actions c78b4b8 Fix: lrmd: Handle systemd reporting 'done' before a resource is actually stopped 3bd6c30 Fix: lrmd: Handle systemd reporting 'done' before a resource is actually stopped 574fc49 Fix: lrmd: Prevent OCF agents from logging to random files due to "value" of setenv() being NULL 155c6aa Low: lrmd: wider use of defined literals fa8bd56 Fix: lrmd: Expose logging variables expected by OCF agents d9cc751 Fix: lrmd: Provide stderr output from agents if available, otherwise fall back to stdout 3adc781 Low: lrmd: clean up the agent's entire process group 348bb51 Fix: lrmd: Cancel recurring operations before stop action is executed fa2954e Low: lrmd: Warning msg to indicate duplicate op merge has occurred b94d0e9 Low: lrmd: recurring op merger regression tests c29ab27 High: lrmd: Merge duplicate recurring monitor operations c1a326d Test: lrmd: Bump the lrmd test timeouts to avoid transient travis failures deead39 Low: lrmd: Install ping agent during lrmd regression test. aad79e2 Low: lrmd: Make ocf dummy agents executable with regression test in src tree 5c8c7a5 Test: lrmd: Kill uninstalled daemons by the correct name 8e90200 Test: lrmd: Fix upstart metadata test and install required OCF agents bbdd6e1 Test: lrmd: Allow regression tests to run from the source tree 87f4091 Low: lrmd: Send event alerting estabilished clients that a new client connection is created. 644752e Fix: lrmd: Correctly calculate metadata for the 'service' class ea7991f Fix: lrmd: Do not interrogate NULL replies from the server 1c14b9d Fix: lrmd: Correctly cancel monitor actions for lsb/systemd/service resources on cleaning up eceeeea Doc: lrmd: Indicate which function recieves the proxied command ad4056f Test: lrmd: Drop the default verbosity for lrmd regression tests eb40d6a Fix: lrmd: Do not overwrite any existing operation status error -- Vossel > Should I compile from the source? > > Best Regards, > > > Ariee > > > On Fri, Dec 19, 2014 at 8:27 PM, < pacemaker-requ...@oss.clusterlabs.org > > wrote: > > > Message: 2 > Date: Fri, 19 Dec 2014 14:21:59 -0500 (EST) > From: David Vossel < dvos...@redhat.com > > To: The Pacemaker cluster resource manager > < pacemaker@oss.clusterlabs.org > > Subject: Re: [Pacemaker] pacemaker error after a couple week or month > Message-ID: > < 102420175.739708.1419016919246.javamail.zim...@redhat.com > > Content-Type: text/plain; charset=utf-8 > > > > ----- Original Message ----- > > Hello, > > > > I have 2 active-passive fail over system with corosync and drbd. > > One system using 2 debian server and the other using 2 ubuntu server. > > The debian servers are for web server fail over and the ubuntu servers are > > for database server fail over. > > > > I applied the same configuration in the pacemaker. Everything works fine, > > fail over can be done nicely and also the file system synchronization, but > > in the ubuntu server, it was always has error after a couple week or month. > > The pacemaker in ubuntu1 had different status with ubuntu2, ubuntu1 assumed > > that ubuntu2 was down and ubuntu2 assumed that something happened with > > ubuntu1 but still alive and took over the resources. It made the drbd > > resource cannot be taken over, thus no fail over happened and we must > > manually restart the server because restarting pacemaker and corosync > > didn't > > help. I have changed the configuration of pacemaker a couple time, but the > > problem still exist. > > > > has anyone experienced it? I use Ubuntu 14.04.1 LTS. > > > > I got this error in apport.log > > > > ERROR: apport (pid 20361) Fri Dec 19 02:43:52 2014: executable: > > /usr/lib/pacemaker/lrmd (command line "/usr/lib/pacemaker/lrmd") > > wow, it looks like the lrmd is crashing on you. I haven't seen this occur > in the wild before. Without a backtrace it will be nearly impossible to > determine > what is happening. > > Do you have the ability to upgrade pacemaker to a newer version? > > -- Vossel > > > _______________________________________________ > Pacemaker mailing list: Pacemaker@oss.clusterlabs.org > http://oss.clusterlabs.org/mailman/listinfo/pacemaker > > Project Home: http://www.clusterlabs.org > Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf > Bugs: http://bugs.clusterlabs.org > _______________________________________________ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org