Hello, I've just come across the howto http://www.clusterlabs.org/wiki/Nagios3_on_Pacemaker_DRBD and found some serious problems. I am experimenting with my first cluster configuration, which should end in a stable and reliable production environment on a new blade server and an older existing machine. My Linux is Debian squeeze. I installed I am using active/active configuration and OCFS2 filesystem.
Problems I found: - the current lsb-startscript (version 3.2.1-2 currently) in debian is faulty. In addition to the pid-file fault mentionned in the wiki, it contains 2 'status ()'-sections. Looks like an overlooked modification made for debugging something. I have corrected that manally. Unfortunately I cannot post the patch here because the mailing list thinks I am top-posting. - on a shared device the ownership is determined by uid/gid. nagios itself needs user:group nagios:nagios on /var/lib/nagios3/retention.dat and /var/lib/nagios3/spool/ - so it must be ensured that the uid/gid on all nodes running nagios3 are the same. Otherwise nagios3 will not run on at least one node! I have solved that problem by creating uid:gid nagios:nagios identically on all nodes before actually doing apt-get install nagios3. This works. - when starting corosync, before starting any resource it also checks all resources if they are stopped correctly. '/etc/init.d/nagios3 status' fails because it cannot find it's config file '/etc/nagios3/nagios.cfg' which is on the shared device. This cannot be prevented by an order constraint, which I believe is correct. I am currently thinking about how to most elegantly solve this problem. I see the following methods - patching lsb script. I think this is no good, as config check - creating a /mnt_shared/etc/nagios3/nagios.cfg on each node before mounting filesys. Advantage: file only has to be correct, it is only read when nagios3 is not started, so we don't have to sync. Disadvantage: this can easily be forgotten as you usually don't see this file. - linking the files in /etc/nagios3/ individually and leaving out nagios.cfg, which then can be synced via csync2 or similar. That could be done by a script. All of them are not easy. I am thinking of dropping the shared config of nagios, but I need the shared device anyway for apache. Has anyone ideas of elegantly solving these problems? _______________________________________________ Pacemaker mailing list: Pacemaker@oss.clusterlabs.org http://oss.clusterlabs.org/mailman/listinfo/pacemaker Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://developerbugs.linux-foundation.org/enter_bug.cgi?product=Pacemaker