Re: Odd behaviour under load.
Charles Gregory a écrit : On Fri, 8 May 2009, Mark wrote: Okay, working from the idea that indeed the connecting client is timing out waiting for the 250 OK after sending the message, I would think DNS lookups are the most costly, time-wise. So, I would examine the RBL lookups first: it only takes the presence of one or two defunct DNSBL lookups, or an unruly 'rbl_timeout' setting, to muck things up. I run the 'standard' set of RBL lookups for SA. Though it's worth mentioning that my front end runs a half-dozen 'high trust' RBL's while processing the envelope RCPT command. So by the time DATA is accepted, those should be cached locally, does your front end parse the received headers and check the IPs found there against DNSBLs? if not, then caching will only work for the IP that sent the message to your front end. and not add to the time that SA is running. I think maybe I should look for a convenient way to limit the RBL checks to the 'most popular' (most reliable, good hits) RBL's you may want a two-step SA: - during smtp time, run a lightweight SA (minimize latency and minimize FPs, at the cost of increasing the ratio of FNs) - after mail has been accepted, pass it through a normal SA configuration
RE: Odd behaviour under load.
-Original Message- From: Charles Gregory [mailto:cgreg...@hwcn.org] Sent: donderdag 7 mei 2009 21:58 To: SpamAssassin Users List Subject: Odd behaviour under load. Do yahoo and python.org enforce a shorter time-out? Highly doubtful. RFC 2821, Section 4.5.3.2 (Timeouts) gives you a 2 minutes window while awaiting the 354 Start Input reply to a DATA command. But please note that timeouts in this Section are a SHOULD, not a MUST. Or does some characteristic of list mail make it take longer to process in sa? Headers are part of the DATA stream. Hence, at the time a connecting server is awaiting your 354 Start Input reply to their DATA command, the whole notion of a (possible) mailing-list is simply not within scope yet (as the headers have not been received yet). At the start of DATA, you've basically just concluded the envelope-stage of the SMTP dialogue (MAIL FROM, RCPT TO, etc). So, whatever characteristic of list mail *might* make things taking longer to process in SA, it's simply not yet relevant here. If you're doing Milter-type SPF-lookups right after MAIL FROM, for instance, you'd have a 5 minutes timeout window. Same after RCPT TO. But since both those actions are completed by the time the connecting server issues its DATA command, there's basically no valid reason for your server to wait unusually long to respond with its 354 reply. Not unless it's very broken, that is. At first I didn't associate this behaviour with SA because the duplicates were being sent by the originating system, not from within mine. I wasn't even certain that the fault was mine, given the way it mostly seemed to occur with certain list servers. The mail was getting delivered to my local recipient, but the sending system was resending a few minutes later (and would do so three or four times). It's quite possible for a connecting mail server to retry after it received, say, one of the many 4.x.x codes. However, what's definitely broken in your setup is that, in such a scenario, apparently your LDA already delivers (part of an) aborted mail. That seems too strange for words; so, likely, you're not sending out proper reply codes to connecting mail servers. So, the solution is simple: fix your mail server; or have someone do it for you. - Mark
RE: Odd behaviour under load.
On Fri, 8 May 2009, Mark wrote: From: Charles Gregory [mailto:cgreg...@hwcn.org] Do yahoo and python.org enforce a shorter time-out? Highly doubtful. RFC 2821, Section 4.5.3.2 (Timeouts) gives you a 2 minutes window while awaiting the 354 Start Input reply to a DATA command. ...are you sure that's the timeout he's referring to? How does SA come into play **before** the sending MTA sends the message body? I suspect the sender is timing out waiting for the 250 OK after sending the message, hence my (humorous) 100 Please hold... suggestion. (Jeeze, SM, lighten up!) Per RFC2821, the timeout for the 250 OK is ten minutes, but again, those are SHOULDs. And I can see a busy list server reducing those timeouts a bit - they are sending a _lot_ of mail. The mail was getting delivered to my local recipient, but the sending system was resending a few minutes later (and would do so three or four times). It's quite possible for a connecting mail server to retry after it received, say, one of the many 4.x.x codes. However, what's definitely broken in your setup is that, in such a scenario, apparently your LDA already delivers (part of an) aborted mail. That seems too strange for words; so, likely, you're not sending out proper reply codes to connecting mail servers. So, the solution is simple: fix your mail server; or have someone do it for you. More likely the client is disconnecting after the timeout and _assuming_ a 4xx state, and requeueing the message. If there's any problem with Charles' server it's that it continues delivery after the sending client has dropped the connection. I wouldn't really see that as a problem, though, as I'd rather get multiple deliveries in that case than have both sides make assumptions the other direction and _lose_ the message. Charles, here's something you could try: using telnet, connect to your MTA and pretend you're the yahoo listserv. Feed in one of the problematic messages, and time how long it takes from sending the final period to receiving the 250 OK. That will give you something to compare to how long it takes to scan the message outside of your MTA envrionment. -- John Hardin KA7OHZhttp://www.impsec.org/~jhardin/ jhar...@impsec.orgFALaholic #11174 pgpk -a jhar...@impsec.org key: 0xB8732E79 -- 2D8C 34F4 6411 F507 136C AF76 D822 E6E6 B873 2E79 --- Homeland Security: Specializing in Tactical Band-aids for Strategic Problems. -- Eric K. in Bruce Schneier's blog --- Today: the 64th anniversary of VE day
Re: Odd behaviour under load.
Mark wrote: Do yahoo and python.org enforce a shorter time-out? Highly doubtful. RFC 2821, Section 4.5.3.2 (Timeouts) gives you a 2 minutes window while awaiting the 354 Start Input reply to a DATA command. Wich is of course irrelevant since SpamAssassin must be called after the client sends the data, not while the client if waiting for permission to send it. The relevant timeout is the after the client sends CRLF.CRLF to end the data and then waits for 250 or an error. RFC2821 4.5.3.2 specifies that this timeout SHOULD be 10 minutes. But please note that timeouts in this Section are a SHOULD, not a MUST. Exactly. But one should remember that RFC2821 has this to say about SHOULD: 2.2.2: ---8--- there may exist valid reasons in particular circumstances to ignore a particular item, but the full implications must be understood and carefully weighed before choosing a different course. ---8--- I'm not convinced all mail list operators satisfies that (see below). While I have no idea what Yahoo or python.org has set those timeouts to, I have seen mail from listservers with as low as a 1 (one) minute timeout. (The rationale for one such system was that they needed to empty their mail queue as fast as possible. When told that the low timeout in reality might make them empty their queue slower rather than faster (because their system had to retry mail that would otherwise get out on the first try), they simply didnät listen.) Or does some characteristic of list mail make it take longer to process in sa? Headers are part of the DATA stream. Hence, at the time a connecting server is awaiting your 354 Start Input reply to their DATA command, the whole notion of a (possible) mailing-list is simply not within scope yet (as the headers have not been received yet). But... SpamAssassin should not be called before the DATA is sent. SpamAssassin is a content filter that scans the data (including the headers). It's quite possible for a connecting mail server to retry after it received, say, one of the many 4.x.x codes. However, what's definitely broken in your setup is that, in such a scenario, apparently your LDA already delivers (part of an) aborted mail. The LDA delivers what it gets, and really can't know ether the mail is complete or not. It's the responibility of the server wether a non-complete mail should be discarded or send to the LDA. Once the client has sent data end (CRLF.CRLF) and the server has recieved it, it's normal (and correct) for the server to deliver the mail and then send a relevant response to the client. If the client disconnects (due to a timeout or other problem) after sending CRLF.CRLF but before getting a response from the server, this tend to result in the server delivering the message and the client retrying it. RFS2821 spoecifically mentions this. 4.5.3.2: ---8--- When the receiver gets the final period terminating the message data, it typically performs processing to deliver the message to a user mailbox. A spurious timeout at this point would be very wasteful and would typically result in delivery of multiple copies of the message, since it has been successfully sent and the server has accepted responsibility for delivery. ---8--- 6.1: ---8--- To avoid receiving duplicate messages as the result of timeouts, a receiver-SMTP MUST seek to minimize the time required to respond to the final CRLF.CRLF end of data indicator. ---8--- The resoning behind this behaviour is that the server doesn't know wether the client received the 250 response or not. If the client did receive it, the server MUST take responisibilty for delivering the message (or notifying the sender of a failure if it can't deliver it). So the server takes the safe way of assuming the client did receive the positive response. Similarly, since the client doesn't know wether the server managed to deliver the message or not, the client MUST take responsibility for retrying the message (or notifying the sender of a failure). The basic idea is that it's more important to not loose mail than to allways avoid duplicates. That seems too strange for words; so, likely, you're not sending out proper reply codes to connecting mail servers. Quite the opposite. It seems perfectly normal. His servers are most likely sending the correct responses to the client, but the client disconnects due to a timeout before getting the response and therefore doesn't know that the mail was eventually accepted for delivery. So, the solution is simple: fix your mail server; or have someone do it for you. There's two parts to the solution. Part One: Fix his mail server so that it does take overly long to respond to data end (CRLF.CRLF). This is a MUST. It is, of course, quiote possible that a normally perfectly good mail server is temporarily overloaded, has disk problems, network problems, or for some other temporary reason takes a long time to respond. This
RE: Odd behaviour under load.
-Original Message- From: John Hardin [mailto:jhar...@impsec.org] Sent: vrijdag 8 mei 2009 15:52 To: Mark Cc: users@spamassassin.apache.org Subject: RE: Odd behaviour under load. On Fri, 8 May 2009, Mark wrote: From: Charles Gregory [mailto:cgreg...@hwcn.org] Do yahoo and python.org enforce a shorter time-out? Highly doubtful. RFC 2821, Section 4.5.3.2 (Timeouts) gives you a 2 minutes window while awaiting the 354 Start Input reply to a DATA command. are you sure that's the timeout he's referring to? I suspect the sender is timing out waiting for the 250 OK after sending the message, hence my (humorous) 100 Please hold... suggestion. (Jeeze, SM, lighten up!) I suspect that's indeed what he meant. I guess I took his So obviously some systems were 'timing out' waiting for my server to respond to the DATA command. a bit too literal. How does SA come into play **before** the sending MTA sends the message body? It doesn't. :) That's why it did make no sense to me. More likely the client is disconnecting after the timeout and _assuming_ a 4xx state, and requeueing the message If there's any problem with Charles' server it's that it continues delivery after the sending client has dropped the connection. I wouldn't really see that as a problem, though, as I'd rather get multiple deliveries in that case than have both sides make assumptions the other direction and _lose_ the message. Well, _assuming_ a 4xx state is an assumption, too. :) And not one I like per se, to be honest (though one could make a good case for its rationale, saying an unexpected disconnect is precisely the kind of 'transient' error-state the 4.x.x codes are meant to cover). But resending a message, over and over, without the receiving server indicating such need, well, that could here and there raise an eyebrow or two as well. Okay, working from the idea that indeed the connecting client is timing out waiting for the 250 OK after sending the message, I would think DNS lookups are the most costly, time-wise. So, I would examine the RBL lookups first: it only takes the presence of one or two defunct DNSBL lookups, or an unruly 'rbl_timeout' setting, to muck things up. - Mark
RE: Odd behaviour under load.
Hi John, At 06:50 08-05-2009, John Hardin wrote: I suspect the sender is timing out waiting for the 250 OK after sending the message, hence my (humorous) 100 Please hold... suggestion. (Jeeze, SM, lighten up!) There has already been such a proposal. Someone might take your humorous suggestion seriously, hence my comment. Regards, -sm
Re: [sa] RE: Odd behaviour under load.
On Fri, 8 May 2009, Mark wrote: Headers are part of the DATA stream. Hence, at the time a connecting server is awaiting your 354 Start Input reply to their DATA command My apologies. I have misled with the phrase 'data command'. I was referring to the response that the sending server expects to see after it has finished sending data. - C
RE: Odd behaviour under load.
On Fri, 8 May 2009, John Hardin wrote: I suspect the sender is timing out waiting for the 250 OK after sending the message, hence my (humorous) 100 Please hold... suggestion. (Jeeze, SM, lighten up!) (nod) I should not have said data command. Apologies again. And I can see a busy list server reducing those timeouts a bit - they are sending a _lot_ of mail. That was my guess. Just wondering if anyone *knew* for certain. I suppose I could always test it, but as I said, I figured out how to stop the problem. So another time, perhaps Charles, here's something you could try: using telnet, connect to your MTA and pretend you're the yahoo listserv. Feed in one of the problematic messages, and time how long it takes from sending the final period to receiving the 250 OK. That will give you something to compare to how long it takes to scan the message outside of your MTA envrionment. Actually, my logs are sufficiently verbose that I can check the times it takes from DATA until the message is handed off to my LDA. I didn't make a note of exact timing, but it wasn't 10 minutes. More like 4 or 5 (though don't quote me on that - this is 'resolved' so I'm too busy to double-check those logs). As a side note, for people wondering why my SA is so slow, I thought I would mention that my SMTP front end (Mail Avenger) has a bug that prevents me from properly using 'spamc', so all load issues are compounded buy having to use 'spamassassin'. Fortunately, without those 'full' rules, it handles things quite well - Charles
RE: Odd behaviour under load.
On Fri, 8 May 2009, Mark wrote: Okay, working from the idea that indeed the connecting client is timing out waiting for the 250 OK after sending the message, I would think DNS lookups are the most costly, time-wise. So, I would examine the RBL lookups first: it only takes the presence of one or two defunct DNSBL lookups, or an unruly 'rbl_timeout' setting, to muck things up. I run the 'standard' set of RBL lookups for SA. Though it's worth mentioning that my front end runs a half-dozen 'high trust' RBL's while processing the envelope RCPT command. So by the time DATA is accepted, those should be cached locally, and not add to the time that SA is running. I think maybe I should look for a convenient way to limit the RBL checks to the 'most popular' (most reliable, good hits) RBL's - C
RE: Odd behaviour under load.
On Fri, 8 May 2009, Charles Gregory wrote: As a side note, for people wondering why my SA is so slow, I thought I would mention that my SMTP front end (Mail Avenger) has a bug that prevents me from properly using 'spamc', so all load issues are compounded buy having to use 'spamassassin'. Fortunately, without those 'full' rules, it handles things quite well That bites. You can probably work around it, though, by playing some PATH games and getting Mail Avenger to see a shell script named spamassassin (that actually runs spamc) before seeing the real spamassassin program. -- John Hardin KA7OHZhttp://www.impsec.org/~jhardin/ jhar...@impsec.orgFALaholic #11174 pgpk -a jhar...@impsec.org key: 0xB8732E79 -- 2D8C 34F4 6411 F507 136C AF76 D822 E6E6 B873 2E79 --- It is not the place of government to make right every tragedy and woe that befalls every resident of the nation. --- Today: the 64th anniversary of VE day
RE: Odd behaviour under load.
On Fri, 8 May 2009, John Hardin wrote: ... my SMTP front end (Mail Avenger) has a bug that prevents me from properly using 'spamc' You can probably work around it, though, by playing some PATH games and getting Mail Avenger to see a shell script named spamassassin (that actually runs spamc) before seeing the real spamassassin program. The issue isn't getting spamc to run. The problem lays in the fact that mail avenger uses a wrapper named 'edinplace' that keeps the message in its original standard input pipe, and enables spamassassin's modified output message to replace that standard input. edinplace has a problem with 'rewinding' the pipe if it has been fed to 'spamc' but not if it is fed to 'spamassassin'. I have written to the mail avenger author to address this issue. - Charles
Odd behaviour under load.
Hallo! Just wanted to throw in an observation on my system's behaviour with spamassassin 'overloaded' Not really a complaint, as I know what I did 'wrong'. But curious about one of the effects During the recent run of image spams, I tried a couple of different pieces of code that involved 'full' tests. I knew they would slow things down a bit, but they had an unexpected side-effect: Various people on my system, including myself, started to receive duplicates for some mail. Particularly *list* mails like the sa list, or yahoo groups. At first I didn't associate this behaviour with SA because the duplicates were being sent by the originating system, not from within mine. I wasn't even certain that the fault was mine, given the way it mostly seemed to occur with certain list servers. The mail was getting delivered to my local recipient, but the sending system was resending a few minutes later (and would do so three or four times). Yes, I have my spamassassin in my SMTP front end, so I figured out that the timing of the appearance of these duplicates coincided with my adding those 'full' tests, and when I removed them, the duplicates stopped. So obviously some systems were 'timing out' waiting for my server to respond to the DATA command. But I am slightly curious as to why this only happens with mailing lists... Do yahoo and python.org enforce a shorter time-out? Or does some characteristic of list mail make it take longer to process in sa? - Charles
Re: Odd behaviour under load.
On Thu, 7 May 2009, Charles Gregory wrote: So obviously some systems were 'timing out' waiting for my server to respond to the DATA command. Heh. Does the SMTP protocol need a 100 Please hold... reply? -- John Hardin KA7OHZhttp://www.impsec.org/~jhardin/ jhar...@impsec.orgFALaholic #11174 pgpk -a jhar...@impsec.org key: 0xB8732E79 -- 2D8C 34F4 6411 F507 136C AF76 D822 E6E6 B873 2E79 --- Gun Control laws cannot reduce violent crime, because gun control laws assume a violent criminal will obey the law. --- Tomorrow: the 64th anniversary of VE day
Re: Odd behaviour under load.
At 13:15 07-05-2009, John Hardin wrote: Heh. Does the SMTP protocol need a 100 Please hold... reply? No. Fix the mail server instead of the protocol. Regards, -sm