Everything is good so no stress about anything here but I am poking at the log files with a stick after a strange incident. Perhaps this tripped over some problem that discussing it might either enlighten me or perhaps unlikely improve things. Who knows?
The GNU Savannah software forge had a network outage in the data center lasting about eight hours. It was the dark of the night and things were fixed quickly once the admins woke up at sunrise and went to the data center to fix things. Due to the timing this was an unusually long network outage event. I would like to describe four of the VMs of interest here. Two were okay after networking returned. But two were found afterward without postfix running. I am curious if the why is somehow useful or interesting to know. All of the systems have their root block storage on a ceph network attached storage pool. Which of course meant that the root file system was unavailable for the full time of the eight hour outage. Therefore if some bit of file data is cached and not expired then the Linux kernel can service the request. If not and if it needs to read the data then it attempts a network read and of course blocks waiting for network I/O. Of course cron jobs were still running. And stacking up processes blocked on I/O waiting. One server achieved a load average of 520 and was perfectly fine recovering after networking was restored. Another reached a load of 68 but afterward the postfix daemons were found to be not running. In summary: Two of the four had no discernible failures. Two of the four were found with postfix not running afterward. Postfix seems to have been the only noticed failure. Which I found rather unusual. And perhaps noteworthy. But this is a very unusual situation where the root file system is unavailable for an extended period of time. I am simply reviewing things afterward now. Trying to understand and perhaps improve things. Since two of them failed. But again there is no stress here. Everything is all good now. And this is a highly unusual system event. Because those two were found without postfix running I rebooted all of the servers subsequently as a preventative maintenance action. Even though others seemed perfectly okay afterward. Because almost certainly there would be other as yet not found problems. Any ideas on why postfix would not be running after such an event on two of the systems but okay on the others? Bob I am abbreviating everything because it was too large for the mailing list on the first sending of this message. :-) Here is one of the config files for the two where postfix was found not running. I'll put the long details below. The other two are similar. This is on an older Trisquel OS. Trisquel is a fork of Ubuntu. So basically think Ubuntu here. root@vcs1:~# postconf -n alias_database = hash:/etc/aliases alias_maps = hash:/etc/aliases append_dot_mydomain = no biff = no canonical_maps = hash:/etc/postfix/canonical compatibility_level = 2 inet_interfaces = loopback-only inet_protocols = ipv4 mailbox_size_limit = 0 masquerade_domains = savannah.gnu.org masquerade_exceptions = root mydestination = $myhostname, localhost.$mydomain, localhost myhostname = vcs1.savannah.gnu.org mynetworks = 127.0.0.0/8 [::ffff:127.0.0.0]/104 [::1]/128 myorigin = /etc/mailname non_smtpd_milters = unix:/var/run/opendkim/opendkim.sock readme_directory = no recipient_delimiter = + relayhost = smtp_tls_session_cache_database = btree:${data_directory}/smtp_scache smtpd_banner = $myhostname ESMTP $mail_name (Debian/GNU) smtpd_milters = unix:/var/run/opendkim/opendkim.sock smtpd_relay_restrictions = permit_mynetworks permit_sasl_authenticated defer_unauth_destination smtpd_tls_cert_file = /etc/ssl/certs/ssl-cert-snakeoil.pem smtpd_tls_key_file = /etc/ssl/private/ssl-cert-snakeoil.key smtpd_tls_session_cache_database = btree:${data_directory}/smtpd_scache smtpd_use_tls = yes And here is the smaller of the log events. Timezone is US/Eastern for both. ...everything is fine... Dec 20 01:45:45 vcs1 postfix/qmgr[1996]: A25F0A908C: removed Dec 20 09:43:31 vcs1 postfix/master[1983]: warning: unix_trigger_event: read timeout for service public/pickup Dec 20 09:43:31 vcs1 postfix/master[1983]: warning: unix_trigger_event: read timeout for service public/pickup Dec 20 09:43:31 vcs1 postfix/master[1983]: warning: unix_trigger_event: read timeout for service public/pickup ...I trimmed out 27 identical lines just to reduce this... Dec 20 09:43:31 vcs1 postfix/master[1983]: warning: unix_trigger_event: read timeout for service public/pickup Dec 20 09:43:32 vcs1 postfix/master[1983]: warning: unix_trigger_event: read timeout for service public/pickup Dec 20 09:43:32 vcs1 postfix/master[1983]: warning: unix_trigger_event: read timeout for service public/pickup Dec 20 09:43:32 vcs1 postfix/postfix-script[30251]: fatal: the Postfix mail system is not running ...I really rather arbitrarily trimmed this for mailing list size... Dec 20 03:01:31 frontend1 postfix/qmgr[10024]: warning: problem talking to service rewrite: Connection timed out Dec 20 03:01:58 frontend1 postfix/master[10022]: warning: unix_trigger_event: read timeout for service private/tlsmgr Dec 20 03:02:13 frontend1 postfix/master[10022]: warning: unix_trigger_event: read timeout for service public/qmgr ... Dec 20 04:51:34 frontend1 postfix/master[10022]: warning: unix_trigger_event: read timeout for service public/pickup Dec 20 04:51:53 frontend1 postfix/master[10022]: warning: master_wakeup_timer_event: service pickup(public/pickup): Resource temporarily unavailable Dec 20 04:52:14 frontend1 postfix/master[10022]: warning: unix_trigger_event: read timeout for service public/qmgr Dec 20 04:52:34 frontend1 postfix/master[10022]: warning: unix_trigger_event: read timeout for service public/pickup Dec 20 04:52:54 frontend1 postfix/master[10022]: warning: master_wakeup_timer_event: service pickup(public/pickup): Resource temporarily unavailable ... Dec 20 06:46:58 frontend1 postfix/master[10022]: warning: master_wakeup_timer_event: service pickup(public/pickup): Resource temporarily unavailable Dec 20 06:47:16 frontend1 postfix/master[10022]: warning: unix_trigger_event: read timeout for service public/qmgr Dec 20 09:43:22 frontend1 postfix/qmgr[17568]: CA6EE20D2C: from=<www-d...@savannah.gnu.org>, size=1479, nrcpt=1 (queue active) Dec 20 09:43:22 frontend1 postfix/qmgr[17568]: warning: connect #1 to subsystem private/rewrite: Connection refused Dec 20 09:43:32 frontend1 postfix/qmgr[17568]: warning: connect #2 to subsystem private/rewrite: Connection refused Dec 20 09:43:42 frontend1 postfix/qmgr[17568]: warning: connect #3 to subsystem private/rewrite: Connection refused Dec 20 09:43:52 frontend1 postfix/qmgr[17568]: warning: connect #4 to subsystem private/rewrite: Connection refused Dec 20 09:44:02 frontend1 postfix/qmgr[17568]: warning: connect #5 to subsystem private/rewrite: Connection refused Dec 20 09:44:12 frontend1 postfix/qmgr[17568]: warning: connect #6 to subsystem private/rewrite: Connection refused Dec 20 09:44:22 frontend1 postfix/qmgr[17568]: warning: connect #7 to subsystem private/rewrite: Connection refused Dec 20 09:44:32 frontend1 postfix/qmgr[17568]: warning: connect #8 to subsystem private/rewrite: Connection refused Dec 20 09:44:42 frontend1 postfix/qmgr[17568]: warning: connect #9 to subsystem private/rewrite: Connection refused Dec 20 09:44:52 frontend1 postfix/qmgr[17568]: warning: connect #10 to subsystem private/rewrite: Connection refused Dec 20 09:45:02 frontend1 postfix/qmgr[17568]: fatal: connect #11 to subsystem private/rewrite: Connection refused