Everything is good so no stress about anything here but I am poking at
the log files with a stick after a strange incident.  Perhaps this
tripped over some problem that discussing it might either enlighten me
or perhaps unlikely improve things.  Who knows?

The GNU Savannah software forge had a network outage in the data
center lasting about eight hours.  It was the dark of the night and
things were fixed quickly once the admins woke up at sunrise and went
to the data center to fix things.  Due to the timing this was an
unusually long network outage event.

I would like to describe four of the VMs of interest here.  Two were
okay after networking returned.  But two were found afterward without
postfix running.  I am curious if the why is somehow useful or
interesting to know.

All of the systems have their root block storage on a ceph network
attached storage pool.  Which of course meant that the root file
system was unavailable for the full time of the eight hour outage.
Therefore if some bit of file data is cached and not expired then the
Linux kernel can service the request.  If not and if it needs to read
the data then it attempts a network read and of course blocks waiting
for network I/O.

Of course cron jobs were still running.  And stacking up processes
blocked on I/O waiting.  One server achieved a load average of 520 and
was perfectly fine recovering after networking was restored.  Another
reached a load of 68 but afterward the postfix daemons were found to
be not running.  In summary: Two of the four had no discernible
failures.  Two of the four were found with postfix not running
afterward.  Postfix seems to have been the only noticed failure.
Which I found rather unusual.  And perhaps noteworthy.  But this is a
very unusual situation where the root file system is unavailable for
an extended period of time.

I am simply reviewing things afterward now.  Trying to understand and
perhaps improve things.  Since two of them failed.  But again there is
no stress here.  Everything is all good now.  And this is a highly
unusual system event.  Because those two were found without postfix
running I rebooted all of the servers subsequently as a preventative
maintenance action.  Even though others seemed perfectly okay
afterward.  Because almost certainly there would be other as yet not
found problems.

Any ideas on why postfix would not be running after such an event on
two of the systems but okay on the others?

Bob

I am abbreviating everything because it was too large for the mailing
list on the first sending of this message. :-)

Here is one of the config files for the two where postfix was found
not running.  I'll put the long details below.  The other two are
similar.  This is on an older Trisquel OS.  Trisquel is a fork of
Ubuntu.  So basically think Ubuntu here.

    root@vcs1:~# postconf -n
    alias_database = hash:/etc/aliases
    alias_maps = hash:/etc/aliases
    append_dot_mydomain = no
    biff = no
    canonical_maps = hash:/etc/postfix/canonical
    compatibility_level = 2
    inet_interfaces = loopback-only
    inet_protocols = ipv4
    mailbox_size_limit = 0
    masquerade_domains = savannah.gnu.org
    masquerade_exceptions = root
    mydestination = $myhostname, localhost.$mydomain, localhost
    myhostname = vcs1.savannah.gnu.org
    mynetworks = 127.0.0.0/8 [::ffff:127.0.0.0]/104 [::1]/128
    myorigin = /etc/mailname
    non_smtpd_milters = unix:/var/run/opendkim/opendkim.sock
    readme_directory = no
    recipient_delimiter = +
    relayhost =
    smtp_tls_session_cache_database = btree:${data_directory}/smtp_scache
    smtpd_banner = $myhostname ESMTP $mail_name (Debian/GNU)
    smtpd_milters = unix:/var/run/opendkim/opendkim.sock
    smtpd_relay_restrictions = permit_mynetworks permit_sasl_authenticated 
defer_unauth_destination
    smtpd_tls_cert_file = /etc/ssl/certs/ssl-cert-snakeoil.pem
    smtpd_tls_key_file = /etc/ssl/private/ssl-cert-snakeoil.key
    smtpd_tls_session_cache_database = btree:${data_directory}/smtpd_scache
    smtpd_use_tls = yes

And here is the smaller of the log events.  Timezone is US/Eastern for both.

    ...everything is fine...
    Dec 20 01:45:45 vcs1 postfix/qmgr[1996]: A25F0A908C: removed
    Dec 20 09:43:31 vcs1 postfix/master[1983]: warning: unix_trigger_event: 
read timeout for service public/pickup
    Dec 20 09:43:31 vcs1 postfix/master[1983]: warning: unix_trigger_event: 
read timeout for service public/pickup
    Dec 20 09:43:31 vcs1 postfix/master[1983]: warning: unix_trigger_event: 
read timeout for service public/pickup
    ...I trimmed out 27 identical lines just to reduce this...
    Dec 20 09:43:31 vcs1 postfix/master[1983]: warning: unix_trigger_event: 
read timeout for service public/pickup
    Dec 20 09:43:32 vcs1 postfix/master[1983]: warning: unix_trigger_event: 
read timeout for service public/pickup
    Dec 20 09:43:32 vcs1 postfix/master[1983]: warning: unix_trigger_event: 
read timeout for service public/pickup
    Dec 20 09:43:32 vcs1 postfix/postfix-script[30251]: fatal: the Postfix mail 
system is not running

    ...I really rather arbitrarily trimmed this for mailing list size...
    Dec 20 03:01:31 frontend1 postfix/qmgr[10024]: warning: problem talking to 
service rewrite: Connection timed out
    Dec 20 03:01:58 frontend1 postfix/master[10022]: warning: 
unix_trigger_event: read timeout for service private/tlsmgr
    Dec 20 03:02:13 frontend1 postfix/master[10022]: warning: 
unix_trigger_event: read timeout for service public/qmgr
    ...
    Dec 20 04:51:34 frontend1 postfix/master[10022]: warning: 
unix_trigger_event: read timeout for service public/pickup
    Dec 20 04:51:53 frontend1 postfix/master[10022]: warning: 
master_wakeup_timer_event: service pickup(public/pickup): Resource temporarily 
unavailable
    Dec 20 04:52:14 frontend1 postfix/master[10022]: warning: 
unix_trigger_event: read timeout for service public/qmgr
    Dec 20 04:52:34 frontend1 postfix/master[10022]: warning: 
unix_trigger_event: read timeout for service public/pickup
    Dec 20 04:52:54 frontend1 postfix/master[10022]: warning: 
master_wakeup_timer_event: service pickup(public/pickup): Resource temporarily 
unavailable
    ...
    Dec 20 06:46:58 frontend1 postfix/master[10022]: warning: 
master_wakeup_timer_event: service pickup(public/pickup): Resource temporarily 
unavailable
    Dec 20 06:47:16 frontend1 postfix/master[10022]: warning: 
unix_trigger_event: read timeout for service public/qmgr
    Dec 20 09:43:22 frontend1 postfix/qmgr[17568]: CA6EE20D2C: 
from=<www-d...@savannah.gnu.org>, size=1479, nrcpt=1 (queue active)
    Dec 20 09:43:22 frontend1 postfix/qmgr[17568]: warning: connect #1 to 
subsystem private/rewrite: Connection refused
    Dec 20 09:43:32 frontend1 postfix/qmgr[17568]: warning: connect #2 to 
subsystem private/rewrite: Connection refused
    Dec 20 09:43:42 frontend1 postfix/qmgr[17568]: warning: connect #3 to 
subsystem private/rewrite: Connection refused
    Dec 20 09:43:52 frontend1 postfix/qmgr[17568]: warning: connect #4 to 
subsystem private/rewrite: Connection refused
    Dec 20 09:44:02 frontend1 postfix/qmgr[17568]: warning: connect #5 to 
subsystem private/rewrite: Connection refused
    Dec 20 09:44:12 frontend1 postfix/qmgr[17568]: warning: connect #6 to 
subsystem private/rewrite: Connection refused
    Dec 20 09:44:22 frontend1 postfix/qmgr[17568]: warning: connect #7 to 
subsystem private/rewrite: Connection refused
    Dec 20 09:44:32 frontend1 postfix/qmgr[17568]: warning: connect #8 to 
subsystem private/rewrite: Connection refused
    Dec 20 09:44:42 frontend1 postfix/qmgr[17568]: warning: connect #9 to 
subsystem private/rewrite: Connection refused
    Dec 20 09:44:52 frontend1 postfix/qmgr[17568]: warning: connect #10 to 
subsystem private/rewrite: Connection refused
    Dec 20 09:45:02 frontend1 postfix/qmgr[17568]: fatal: connect #11 to 
subsystem private/rewrite: Connection refused

Reply via email to