Re: [Mailman-Users] stuck bounce-events write

2011-11-22 Thread Richard Haas
On Tue, Nov 22, 2011 at 2:54 AM, Mark Sapiro m...@msapiro.net wrote:

 Richard Haas wrote:
 
 Last week, we ran into a rather odd event (first time we've seen it in a
 decade or so of Mailman ops), where a bounced message kept writing to its
 /var/mailman/data/bounce-events-PID.pck file, until it filled the disk
 [...]
 The first bounce seemed normal:
 
 Delivery failed for the following reason:
 Server entsrmbx01.teledyne.com[74.201.16.49] failed with: 550 #5.1.0
 Address rejected someone@teledyne.com
 This has been a permanent failure.  No further delivery attempts will be
 made.
 
 
 ... and then the first bounce was written to the
 /var/mailman/data/bounce-events pck file at:
 
 Received from localhost (localhost [127.0.0.1])
by [myserver] (Postfix) with ESMTP id 3272D601E21
for [somelist]-news-bounces@myserver; Wed, 16 Nov 2011 16:41:55
 -0500 (EST)
 
 
 ... but was written another 5000 times with it's PDF attachment.
 
 ... at 17:43 local, the disk filled with:
 
 Nov 16 17:43:13 [hostname] postfix/smtpd[27239]: NOQUEUE: reject: MAIL
 from
 localhost[127.0.0.1]: 452 4.3.1 Insufficient system storage; proto=ESMTP
 helo=localhost
 Nov 16 17:43:13 [hostname] postfix/smtpd[27239]: warning: not enough free
 space in mail queue: 36216832 bytes  1.5*message size limit


 So is it Postfix that is repeatedly delivering this bounce message to
 Mailman, and Mailman is just trying to deal with it? If so, the
 underlying issue is with Postfix, not Mailman.


There's nothing in mail.log for 127.0.0.1 deliveries related to this list,
during the timeframe, except clean passes through amavis. Certainly no 5000
redeliveries.

There is normal traffic sending the message outbound to the 130 or so list
members, with about 20 bounces.

The original address that led to the bounce was accepted outbound for
delivery, and then came back about 17 seconds later, after remote
translation of

someoneelse@someresearch.com to someone@telecdyne.com

... but cam back only once at:

Nov 16 16:41:55 myhost amavis[26807]: (26807-14) Passed CLEAN,
[66.18.18.67] [66.18.18.67]  - mylist-news-boun...@my.org, Message-ID:
2016214143.10704.6...@hm-mx2b.solinus.com, mail_id: PvH5EnpUQiLg,
Hits: -, size: 7808036, queued_as: 3272D601E21, 677 ms

... which was when the first bounce was written to the bounce-events file.

Anyhow, it doesn't look like Postfix processed the bounce 5000 times, and
was repeatedly handing it back to BounceRunner.


 Oddly, the bounce-events file's last write was considerably earlier:
 
 -rw-rw-r-- 1 mailman mailman 37867700224 2011-11-16 16:52
 bounce-events-01556.pck


 This is an orphaned bounce-events file that was left behind when
 BounceRunner PID 1556 died - see below.


 The only questionable log entries I can find after/around the message
 posting is for an abort of the Archrunner and Bouncerunner before the disk
 filled:
 
 Nov 16 17:06:08 2011 (1533) Master qrunner detected subprocess exit
 (pid: 1555, sig: 9, sts: None, class: ArchRunner, slice: 1/1) [restarting]
 Nov 16 17:06:14 2011 (27568) ArchRunner qrunner started.
 Nov 16 17:06:18 2011 (1533) Master qrunner detected subprocess exit
 (pid: 1556, sig: 9, sts: None, class: BounceRunner, slice: 1/1)
 [restarting]
 Nov 16 17:06:20 2011 (27573) BounceRunner qrunner started.


 BounceRunner PID 1556 died at 17:06 possibly while processing the
 bounces in qfiles/bounces or while processing the bounce-events file.
 This is within the normal 15 minute window for processing
 bounce-events.

 Where there any error log entries associated with the deaths of
 BounceRunner or ArchRunner?


No, other than the sig 9(s) above, there's nothing in the qrunner logs
since 6:33 that morning, and afterwards (18:43) when admins did a halt on
mailman and postfix before moving the 37 GB bounce-events-01556.pck file to
another filesystem.



 Since the [somelist]-news list in question doesn't have archiving enabled
 (never has), so I the ArchRunner restart. There were no other qrunner
 restarts since 6:33 that morning, and the only later ones were when admin
 staff went in to clear things up.
 
 So chronologically:
 
 - 16:41 first bounce occurs and is written to bounce-events by PID 1556
 - 16:52 PID 1556 makes (according to the file system) its last write to
 bounce-events-1556.pck
 - 17:02 ArchRunner and BounceRunner get restart signals, PID 27573 is the
 new BounceRunner
 - 17:43 the disk fills to the point postfix starts complaining
 
 
 ... that doesn't add up unless one presumes PID 1556 hung around and kept
 writing bounce-events-1556.pck from 17:02 to 17:43 (and probably beyond).


 PID 1556 was gone at 17:06:18. It is not clear whether the repeated
 bounce events were due to BounceRunner in some kind of loop or
 BounceRunner actually receiving bounces from Postfix.


Since there's nothing to indicate postfix was repeatedly passing the
bounce, I think we're stuck looking at the BounceRunner.


 Are there Postfix log entries for the delivery of 5000 

Re: [Mailman-Users] stuck bounce-events write

2011-11-22 Thread Mailman Admin
Hi Richard Haas


On 2011-11-22 15:14, Richard Haas wrote:
 ... 
 
 The original address that led to the bounce was accepted outbound for
 delivery, and then came back about 17 seconds later, after remote
 translation of
 
 someoneelse@someresearch.com to someone@telecdyne.com
 
 ... but cam back only once at:
 
 Nov 16 16:41:55 myhost amavis[26807]: (26807-14) Passed CLEAN,
 [66.18.18.67] [66.18.18.67]  - mylist-news-boun...@my.org, Message-ID:
 2016214143.10704.6...@hm-mx2b.solinus.com, mail_id: PvH5EnpUQiLg,
 Hits: -, size: 7808036, queued_as: 3272D601E21, 677 ms
 cut 

What do you get for the queued Message 3272D601E21 in your postfix.log?


Kind regards,
Christian Mack
--
Mailman-Users mailing list Mailman-Users@python.org
http://mail.python.org/mailman/listinfo/mailman-users
Mailman FAQ: http://wiki.list.org/x/AgA3
Security Policy: http://wiki.list.org/x/QIA9
Searchable Archives: http://www.mail-archive.com/mailman-users%40python.org/
Unsubscribe: 
http://mail.python.org/mailman/options/mailman-users/archive%40jab.org


[Mailman-Users] stuck bounce-events write

2011-11-21 Thread Richard Haas
Greetings.

Last week, we ran into a rather odd event (first time we've seen it in a
decade or so of Mailman ops), where a bounced message kept writing to its
/var/mailman/data/bounce-events-PID.pck file, until it filled the disk --
it wasn't a series of bounces (same delivery times on every received
header), it was the same bounce written over and over again (about 5000
times), unfortunately, that bounce contained a 7 MB base64 encoded
attachment.

I thought I'd toss the details out here, to see if anyone had a thought on
what caused this, or (other than implementing a disk watching cron job,
which we already have), how to avoid it in the future ... though it
certainly seems like a fluke.

Anyhow, the message was first sent to the list at 16:41:35 (after manual
list owner approval of the moderated posting), and was posted normally at:


/var/mailman/logs/post:Nov 16 16:41:35 2011 (1560) post to [somelist]-news
from [someone]@my.org size=7705563,
message-id=AB96A557B3C21F4298987419BF86FA1B672214168A@[my domain]v,
success


The message had a PDF attachment, and was from the list owner. The list had
about 120 members, and most of the members (there were some other
undeliverable addresses) got the message. The list owner got the message.
There appeared to be nothing bogus about the attachment.

The first bounce seemed normal:

Delivery failed for the following reason:
Server entsrmbx01.teledyne.com[74.201.16.49] failed with: 550 #5.1.0
Address rejected someone@teledyne.com
This has been a permanent failure.  No further delivery attempts will be
made.


... and then the first bounce was written to the
/var/mailman/data/bounce-events pck file at:

Received from localhost (localhost [127.0.0.1])
   by [myserver] (Postfix) with ESMTP id 3272D601E21
   for [somelist]-news-bounces@myserver; Wed, 16 Nov 2011 16:41:55
-0500 (EST)


... but was written another 5000 times with it's PDF attachment.

... at 17:43 local, the disk filled with:

Nov 16 17:43:13 [hostname] postfix/smtpd[27239]: NOQUEUE: reject: MAIL from
localhost[127.0.0.1]: 452 4.3.1 Insufficient system storage; proto=ESMTP
helo=localhost
Nov 16 17:43:13 [hostname] postfix/smtpd[27239]: warning: not enough free
space in mail queue: 36216832 bytes  1.5*message size limit


Oddly, the bounce-events file's last write was considerably earlier:

-rw-rw-r-- 1 mailman mailman 37867700224 2011-11-16 16:52
bounce-events-01556.pck


The only questionable log entries I can find after/around the message
posting is for an abort of the Archrunner and Bouncerunner before the disk
filled:

Nov 16 17:06:08 2011 (1533) Master qrunner detected subprocess exit
(pid: 1555, sig: 9, sts: None, class: ArchRunner, slice: 1/1) [restarting]
Nov 16 17:06:14 2011 (27568) ArchRunner qrunner started.
Nov 16 17:06:18 2011 (1533) Master qrunner detected subprocess exit
(pid: 1556, sig: 9, sts: None, class: BounceRunner, slice: 1/1) [restarting]
Nov 16 17:06:20 2011 (27573) BounceRunner qrunner started.


Since the [somelist]-news list in question doesn't have archiving enabled
(never has), so I the ArchRunner restart. There were no other qrunner
restarts since 6:33 that morning, and the only later ones were when admin
staff went in to clear things up.

So chronologically:

- 16:41 first bounce occurs and is written to bounce-events by PID 1556
- 16:52 PID 1556 makes (according to the file system) its last write to
bounce-events-1556.pck
- 17:02 ArchRunner and BounceRunner get restart signals, PID 27573 is the
new BounceRunner
- 17:43 the disk fills to the point postfix starts complaining


... that doesn't add up unless one presumes PID 1556 hung around and kept
writing bounce-events-1556.pck from 17:02 to 17:43 (and probably beyond).

Does the above add up to anything for anyone else? Is there something else
that should be checked?

Thanks for any insight.

--
 Richard Haas rh...@rhaas.us
 GnuPG public key ID: 1CB7F0E2
--
--
Mailman-Users mailing list Mailman-Users@python.org
http://mail.python.org/mailman/listinfo/mailman-users
Mailman FAQ: http://wiki.list.org/x/AgA3
Security Policy: http://wiki.list.org/x/QIA9
Searchable Archives: http://www.mail-archive.com/mailman-users%40python.org/
Unsubscribe: 
http://mail.python.org/mailman/options/mailman-users/archive%40jab.org


Re: [Mailman-Users] stuck bounce-events write

2011-11-21 Thread Mark Sapiro
Richard Haas wrote:

Last week, we ran into a rather odd event (first time we've seen it in a
decade or so of Mailman ops), where a bounced message kept writing to its
/var/mailman/data/bounce-events-PID.pck file, until it filled the disk [...]
The first bounce seemed normal:

Delivery failed for the following reason:
Server entsrmbx01.teledyne.com[74.201.16.49] failed with: 550 #5.1.0
Address rejected someone@teledyne.com
This has been a permanent failure.  No further delivery attempts will be
made.


... and then the first bounce was written to the
/var/mailman/data/bounce-events pck file at:

Received from localhost (localhost [127.0.0.1])
   by [myserver] (Postfix) with ESMTP id 3272D601E21
   for [somelist]-news-bounces@myserver; Wed, 16 Nov 2011 16:41:55
-0500 (EST)


... but was written another 5000 times with it's PDF attachment.

... at 17:43 local, the disk filled with:

Nov 16 17:43:13 [hostname] postfix/smtpd[27239]: NOQUEUE: reject: MAIL from
localhost[127.0.0.1]: 452 4.3.1 Insufficient system storage; proto=ESMTP
helo=localhost
Nov 16 17:43:13 [hostname] postfix/smtpd[27239]: warning: not enough free
space in mail queue: 36216832 bytes  1.5*message size limit


So is it Postfix that is repeatedly delivering this bounce message to
Mailman, and Mailman is just trying to deal with it? If so, the
underlying issue is with Postfix, not Mailman.


Oddly, the bounce-events file's last write was considerably earlier:

-rw-rw-r-- 1 mailman mailman 37867700224 2011-11-16 16:52
bounce-events-01556.pck


This is an orphaned bounce-events file that was left behind when
BounceRunner PID 1556 died - see below.


The only questionable log entries I can find after/around the message
posting is for an abort of the Archrunner and Bouncerunner before the disk
filled:

Nov 16 17:06:08 2011 (1533) Master qrunner detected subprocess exit
(pid: 1555, sig: 9, sts: None, class: ArchRunner, slice: 1/1) [restarting]
Nov 16 17:06:14 2011 (27568) ArchRunner qrunner started.
Nov 16 17:06:18 2011 (1533) Master qrunner detected subprocess exit
(pid: 1556, sig: 9, sts: None, class: BounceRunner, slice: 1/1) [restarting]
Nov 16 17:06:20 2011 (27573) BounceRunner qrunner started.


BounceRunner PID 1556 died at 17:06 possibly while processing the
bounces in qfiles/bounces or while processing the bounce-events file.
This is within the normal 15 minute window for processing
bounce-events.

Where there any error log entries associated with the deaths of
BounceRunner or ArchRunner?


Since the [somelist]-news list in question doesn't have archiving enabled
(never has), so I the ArchRunner restart. There were no other qrunner
restarts since 6:33 that morning, and the only later ones were when admin
staff went in to clear things up.

So chronologically:

- 16:41 first bounce occurs and is written to bounce-events by PID 1556
- 16:52 PID 1556 makes (according to the file system) its last write to
bounce-events-1556.pck
- 17:02 ArchRunner and BounceRunner get restart signals, PID 27573 is the
new BounceRunner
- 17:43 the disk fills to the point postfix starts complaining


... that doesn't add up unless one presumes PID 1556 hung around and kept
writing bounce-events-1556.pck from 17:02 to 17:43 (and probably beyond).


PID 1556 was gone at 17:06:18. It is not clear whether the repeated
bounce events were due to BounceRunner in some kind of loop or
BounceRunner actually receiving bounces from Postfix.

Are there Postfix log entries for the delivery of 5000 bounces to
Mailman? If so, it would seem Postfix is the culprit. If not, then it
would seem to be BounceRunner. I don't have a clue as to why
BounceRunner would do this.

-- 
Mark Sapiro m...@msapiro.netThe highway is for gamblers,
San Francisco Bay Area, Californiabetter use your sense - B. Dylan

--
Mailman-Users mailing list Mailman-Users@python.org
http://mail.python.org/mailman/listinfo/mailman-users
Mailman FAQ: http://wiki.list.org/x/AgA3
Security Policy: http://wiki.list.org/x/QIA9
Searchable Archives: http://www.mail-archive.com/mailman-users%40python.org/
Unsubscribe: 
http://mail.python.org/mailman/options/mailman-users/archive%40jab.org