On 12/22/21 8:26 PM, Bob Proulx wrote:
> Everything is good so no stress about anything here but I am poking at
> the log files with a stick after a strange incident.  Perhaps this
> tripped over some problem that discussing it might either enlighten me
> or perhaps unlikely improve things.  Who knows?
> 
> The GNU Savannah software forge had a network outage in the data
> center lasting about eight hours.  It was the dark of the night and
> things were fixed quickly once the admins woke up at sunrise and went
> to the data center to fix things.  Due to the timing this was an
> unusually long network outage event.
> 
> I would like to describe four of the VMs of interest here.  Two were
> okay after networking returned.  But two were found afterward without
> postfix running.  I am curious if the why is somehow useful or
> interesting to know.
> 
> All of the systems have their root block storage on a ceph network
> attached storage pool.  Which of course meant that the root file
> system was unavailable for the full time of the eight hour outage.
> Therefore if some bit of file data is cached and not expired then the
> Linux kernel can service the request.  If not and if it needs to read
> the data then it attempts a network read and of course blocks waiting
> for network I/O.
> 
> Of course cron jobs were still running.  And stacking up processes
> blocked on I/O waiting.  One server achieved a load average of 520 and
> was perfectly fine recovering after networking was restored.  Another
> reached a load of 68 but afterward the postfix daemons were found to
> be not running.  In summary: Two of the four had no discernible
> failures.  Two of the four were found with postfix not running
> afterward.  Postfix seems to have been the only noticed failure.
> Which I found rather unusual.  And perhaps noteworthy.  But this is a
> very unusual situation where the root file system is unavailable for
> an extended period of time.
> 
> I am simply reviewing things afterward now.  Trying to understand and
> perhaps improve things.  Since two of them failed.  But again there is
> no stress here.  Everything is all good now.  And this is a highly
> unusual system event.  Because those two were found without postfix
> running I rebooted all of the servers subsequently as a preventative
> maintenance action.  Even though others seemed perfectly okay
> afterward.  Because almost certainly there would be other as yet not
> found problems.
> 
> Any ideas on why postfix would not be running after such an event on
> two of the systems but okay on the others?
> 
> Bob

My intuition is that either some timeout somewhere got hit, or that
some I/O failed (rather than being queued forever) and caused an error
paging in some code.  That would cause Postfix to die with SIGBUS.

Do you have Postfix set to automatically be restarted if it crashes?

-- 
Sincerely,
Demi Marie Obenour (she/her/hers)

Attachment: OpenPGP_0xB288B55FFF9C22C1.asc
Description: OpenPGP public key

Attachment: OpenPGP_signature
Description: OpenPGP digital signature

Reply via email to