Re: [Mailman-Users] stuck bounce-events write
On Tue, Nov 22, 2011 at 2:54 AM, Mark Sapiro m...@msapiro.net wrote: Richard Haas wrote: Last week, we ran into a rather odd event (first time we've seen it in a decade or so of Mailman ops), where a bounced message kept writing to its /var/mailman/data/bounce-events-PID.pck file, until it filled the disk [...] The first bounce seemed normal: Delivery failed for the following reason: Server entsrmbx01.teledyne.com[74.201.16.49] failed with: 550 #5.1.0 Address rejected someone@teledyne.com This has been a permanent failure. No further delivery attempts will be made. ... and then the first bounce was written to the /var/mailman/data/bounce-events pck file at: Received from localhost (localhost [127.0.0.1]) by [myserver] (Postfix) with ESMTP id 3272D601E21 for [somelist]-news-bounces@myserver; Wed, 16 Nov 2011 16:41:55 -0500 (EST) ... but was written another 5000 times with it's PDF attachment. ... at 17:43 local, the disk filled with: Nov 16 17:43:13 [hostname] postfix/smtpd[27239]: NOQUEUE: reject: MAIL from localhost[127.0.0.1]: 452 4.3.1 Insufficient system storage; proto=ESMTP helo=localhost Nov 16 17:43:13 [hostname] postfix/smtpd[27239]: warning: not enough free space in mail queue: 36216832 bytes 1.5*message size limit So is it Postfix that is repeatedly delivering this bounce message to Mailman, and Mailman is just trying to deal with it? If so, the underlying issue is with Postfix, not Mailman. There's nothing in mail.log for 127.0.0.1 deliveries related to this list, during the timeframe, except clean passes through amavis. Certainly no 5000 redeliveries. There is normal traffic sending the message outbound to the 130 or so list members, with about 20 bounces. The original address that led to the bounce was accepted outbound for delivery, and then came back about 17 seconds later, after remote translation of someoneelse@someresearch.com to someone@telecdyne.com ... but cam back only once at: Nov 16 16:41:55 myhost amavis[26807]: (26807-14) Passed CLEAN, [66.18.18.67] [66.18.18.67] - mylist-news-boun...@my.org, Message-ID: 2016214143.10704.6...@hm-mx2b.solinus.com, mail_id: PvH5EnpUQiLg, Hits: -, size: 7808036, queued_as: 3272D601E21, 677 ms ... which was when the first bounce was written to the bounce-events file. Anyhow, it doesn't look like Postfix processed the bounce 5000 times, and was repeatedly handing it back to BounceRunner. Oddly, the bounce-events file's last write was considerably earlier: -rw-rw-r-- 1 mailman mailman 37867700224 2011-11-16 16:52 bounce-events-01556.pck This is an orphaned bounce-events file that was left behind when BounceRunner PID 1556 died - see below. The only questionable log entries I can find after/around the message posting is for an abort of the Archrunner and Bouncerunner before the disk filled: Nov 16 17:06:08 2011 (1533) Master qrunner detected subprocess exit (pid: 1555, sig: 9, sts: None, class: ArchRunner, slice: 1/1) [restarting] Nov 16 17:06:14 2011 (27568) ArchRunner qrunner started. Nov 16 17:06:18 2011 (1533) Master qrunner detected subprocess exit (pid: 1556, sig: 9, sts: None, class: BounceRunner, slice: 1/1) [restarting] Nov 16 17:06:20 2011 (27573) BounceRunner qrunner started. BounceRunner PID 1556 died at 17:06 possibly while processing the bounces in qfiles/bounces or while processing the bounce-events file. This is within the normal 15 minute window for processing bounce-events. Where there any error log entries associated with the deaths of BounceRunner or ArchRunner? No, other than the sig 9(s) above, there's nothing in the qrunner logs since 6:33 that morning, and afterwards (18:43) when admins did a halt on mailman and postfix before moving the 37 GB bounce-events-01556.pck file to another filesystem. Since the [somelist]-news list in question doesn't have archiving enabled (never has), so I the ArchRunner restart. There were no other qrunner restarts since 6:33 that morning, and the only later ones were when admin staff went in to clear things up. So chronologically: - 16:41 first bounce occurs and is written to bounce-events by PID 1556 - 16:52 PID 1556 makes (according to the file system) its last write to bounce-events-1556.pck - 17:02 ArchRunner and BounceRunner get restart signals, PID 27573 is the new BounceRunner - 17:43 the disk fills to the point postfix starts complaining ... that doesn't add up unless one presumes PID 1556 hung around and kept writing bounce-events-1556.pck from 17:02 to 17:43 (and probably beyond). PID 1556 was gone at 17:06:18. It is not clear whether the repeated bounce events were due to BounceRunner in some kind of loop or BounceRunner actually receiving bounces from Postfix. Since there's nothing to indicate postfix was repeatedly passing the bounce, I think we're stuck looking at the BounceRunner. Are there Postfix log entries for the delivery of 5000
Re: [Mailman-Users] stuck bounce-events write
Hi Richard Haas On 2011-11-22 15:14, Richard Haas wrote: ... The original address that led to the bounce was accepted outbound for delivery, and then came back about 17 seconds later, after remote translation of someoneelse@someresearch.com to someone@telecdyne.com ... but cam back only once at: Nov 16 16:41:55 myhost amavis[26807]: (26807-14) Passed CLEAN, [66.18.18.67] [66.18.18.67] - mylist-news-boun...@my.org, Message-ID: 2016214143.10704.6...@hm-mx2b.solinus.com, mail_id: PvH5EnpUQiLg, Hits: -, size: 7808036, queued_as: 3272D601E21, 677 ms cut What do you get for the queued Message 3272D601E21 in your postfix.log? Kind regards, Christian Mack -- Mailman-Users mailing list Mailman-Users@python.org http://mail.python.org/mailman/listinfo/mailman-users Mailman FAQ: http://wiki.list.org/x/AgA3 Security Policy: http://wiki.list.org/x/QIA9 Searchable Archives: http://www.mail-archive.com/mailman-users%40python.org/ Unsubscribe: http://mail.python.org/mailman/options/mailman-users/archive%40jab.org
[Mailman-Users] stuck bounce-events write
Greetings. Last week, we ran into a rather odd event (first time we've seen it in a decade or so of Mailman ops), where a bounced message kept writing to its /var/mailman/data/bounce-events-PID.pck file, until it filled the disk -- it wasn't a series of bounces (same delivery times on every received header), it was the same bounce written over and over again (about 5000 times), unfortunately, that bounce contained a 7 MB base64 encoded attachment. I thought I'd toss the details out here, to see if anyone had a thought on what caused this, or (other than implementing a disk watching cron job, which we already have), how to avoid it in the future ... though it certainly seems like a fluke. Anyhow, the message was first sent to the list at 16:41:35 (after manual list owner approval of the moderated posting), and was posted normally at: /var/mailman/logs/post:Nov 16 16:41:35 2011 (1560) post to [somelist]-news from [someone]@my.org size=7705563, message-id=AB96A557B3C21F4298987419BF86FA1B672214168A@[my domain]v, success The message had a PDF attachment, and was from the list owner. The list had about 120 members, and most of the members (there were some other undeliverable addresses) got the message. The list owner got the message. There appeared to be nothing bogus about the attachment. The first bounce seemed normal: Delivery failed for the following reason: Server entsrmbx01.teledyne.com[74.201.16.49] failed with: 550 #5.1.0 Address rejected someone@teledyne.com This has been a permanent failure. No further delivery attempts will be made. ... and then the first bounce was written to the /var/mailman/data/bounce-events pck file at: Received from localhost (localhost [127.0.0.1]) by [myserver] (Postfix) with ESMTP id 3272D601E21 for [somelist]-news-bounces@myserver; Wed, 16 Nov 2011 16:41:55 -0500 (EST) ... but was written another 5000 times with it's PDF attachment. ... at 17:43 local, the disk filled with: Nov 16 17:43:13 [hostname] postfix/smtpd[27239]: NOQUEUE: reject: MAIL from localhost[127.0.0.1]: 452 4.3.1 Insufficient system storage; proto=ESMTP helo=localhost Nov 16 17:43:13 [hostname] postfix/smtpd[27239]: warning: not enough free space in mail queue: 36216832 bytes 1.5*message size limit Oddly, the bounce-events file's last write was considerably earlier: -rw-rw-r-- 1 mailman mailman 37867700224 2011-11-16 16:52 bounce-events-01556.pck The only questionable log entries I can find after/around the message posting is for an abort of the Archrunner and Bouncerunner before the disk filled: Nov 16 17:06:08 2011 (1533) Master qrunner detected subprocess exit (pid: 1555, sig: 9, sts: None, class: ArchRunner, slice: 1/1) [restarting] Nov 16 17:06:14 2011 (27568) ArchRunner qrunner started. Nov 16 17:06:18 2011 (1533) Master qrunner detected subprocess exit (pid: 1556, sig: 9, sts: None, class: BounceRunner, slice: 1/1) [restarting] Nov 16 17:06:20 2011 (27573) BounceRunner qrunner started. Since the [somelist]-news list in question doesn't have archiving enabled (never has), so I the ArchRunner restart. There were no other qrunner restarts since 6:33 that morning, and the only later ones were when admin staff went in to clear things up. So chronologically: - 16:41 first bounce occurs and is written to bounce-events by PID 1556 - 16:52 PID 1556 makes (according to the file system) its last write to bounce-events-1556.pck - 17:02 ArchRunner and BounceRunner get restart signals, PID 27573 is the new BounceRunner - 17:43 the disk fills to the point postfix starts complaining ... that doesn't add up unless one presumes PID 1556 hung around and kept writing bounce-events-1556.pck from 17:02 to 17:43 (and probably beyond). Does the above add up to anything for anyone else? Is there something else that should be checked? Thanks for any insight. -- Richard Haas rh...@rhaas.us GnuPG public key ID: 1CB7F0E2 -- -- Mailman-Users mailing list Mailman-Users@python.org http://mail.python.org/mailman/listinfo/mailman-users Mailman FAQ: http://wiki.list.org/x/AgA3 Security Policy: http://wiki.list.org/x/QIA9 Searchable Archives: http://www.mail-archive.com/mailman-users%40python.org/ Unsubscribe: http://mail.python.org/mailman/options/mailman-users/archive%40jab.org
Re: [Mailman-Users] stuck bounce-events write
Richard Haas wrote: Last week, we ran into a rather odd event (first time we've seen it in a decade or so of Mailman ops), where a bounced message kept writing to its /var/mailman/data/bounce-events-PID.pck file, until it filled the disk [...] The first bounce seemed normal: Delivery failed for the following reason: Server entsrmbx01.teledyne.com[74.201.16.49] failed with: 550 #5.1.0 Address rejected someone@teledyne.com This has been a permanent failure. No further delivery attempts will be made. ... and then the first bounce was written to the /var/mailman/data/bounce-events pck file at: Received from localhost (localhost [127.0.0.1]) by [myserver] (Postfix) with ESMTP id 3272D601E21 for [somelist]-news-bounces@myserver; Wed, 16 Nov 2011 16:41:55 -0500 (EST) ... but was written another 5000 times with it's PDF attachment. ... at 17:43 local, the disk filled with: Nov 16 17:43:13 [hostname] postfix/smtpd[27239]: NOQUEUE: reject: MAIL from localhost[127.0.0.1]: 452 4.3.1 Insufficient system storage; proto=ESMTP helo=localhost Nov 16 17:43:13 [hostname] postfix/smtpd[27239]: warning: not enough free space in mail queue: 36216832 bytes 1.5*message size limit So is it Postfix that is repeatedly delivering this bounce message to Mailman, and Mailman is just trying to deal with it? If so, the underlying issue is with Postfix, not Mailman. Oddly, the bounce-events file's last write was considerably earlier: -rw-rw-r-- 1 mailman mailman 37867700224 2011-11-16 16:52 bounce-events-01556.pck This is an orphaned bounce-events file that was left behind when BounceRunner PID 1556 died - see below. The only questionable log entries I can find after/around the message posting is for an abort of the Archrunner and Bouncerunner before the disk filled: Nov 16 17:06:08 2011 (1533) Master qrunner detected subprocess exit (pid: 1555, sig: 9, sts: None, class: ArchRunner, slice: 1/1) [restarting] Nov 16 17:06:14 2011 (27568) ArchRunner qrunner started. Nov 16 17:06:18 2011 (1533) Master qrunner detected subprocess exit (pid: 1556, sig: 9, sts: None, class: BounceRunner, slice: 1/1) [restarting] Nov 16 17:06:20 2011 (27573) BounceRunner qrunner started. BounceRunner PID 1556 died at 17:06 possibly while processing the bounces in qfiles/bounces or while processing the bounce-events file. This is within the normal 15 minute window for processing bounce-events. Where there any error log entries associated with the deaths of BounceRunner or ArchRunner? Since the [somelist]-news list in question doesn't have archiving enabled (never has), so I the ArchRunner restart. There were no other qrunner restarts since 6:33 that morning, and the only later ones were when admin staff went in to clear things up. So chronologically: - 16:41 first bounce occurs and is written to bounce-events by PID 1556 - 16:52 PID 1556 makes (according to the file system) its last write to bounce-events-1556.pck - 17:02 ArchRunner and BounceRunner get restart signals, PID 27573 is the new BounceRunner - 17:43 the disk fills to the point postfix starts complaining ... that doesn't add up unless one presumes PID 1556 hung around and kept writing bounce-events-1556.pck from 17:02 to 17:43 (and probably beyond). PID 1556 was gone at 17:06:18. It is not clear whether the repeated bounce events were due to BounceRunner in some kind of loop or BounceRunner actually receiving bounces from Postfix. Are there Postfix log entries for the delivery of 5000 bounces to Mailman? If so, it would seem Postfix is the culprit. If not, then it would seem to be BounceRunner. I don't have a clue as to why BounceRunner would do this. -- Mark Sapiro m...@msapiro.netThe highway is for gamblers, San Francisco Bay Area, Californiabetter use your sense - B. Dylan -- Mailman-Users mailing list Mailman-Users@python.org http://mail.python.org/mailman/listinfo/mailman-users Mailman FAQ: http://wiki.list.org/x/AgA3 Security Policy: http://wiki.list.org/x/QIA9 Searchable Archives: http://www.mail-archive.com/mailman-users%40python.org/ Unsubscribe: http://mail.python.org/mailman/options/mailman-users/archive%40jab.org