Re: Huge active queue and system idle, not delivering
Patrick Chemla: Wietse: OK, so you can turn back on that connection caching. Note that qmail creates and destroys two processes per SMTP session, so reusing a session is also a win from a CPU resource point of view. Patrick: If I do so, will postfix open more than one connexion to each qmail for parallel deliveries? Of course. Connection caching is a performance IMPROVEMENT feature. However, some qmail implementations are patched and turn on TARPIT delays when the client sends many messages or recipients over the same SMTP connection. Wietse
Re: Huge active queue and system idle, not delivering
Wietse, Please try the following, as asked half a week ago: postconf -e smtp_connection_cache_on_demand=no postfix reload and report if this makes a difference. Wietse I have tested this since yesterday night. I got some problems with Linux per user number of processes limit. I fixed it. I also increased some delivery concurrency figures, and now I can see up to 1300 processes delivering emails to the qmail servers. I had a few minutes shot today at a rate of 6300 emails per minute. I ran a full hour at 180,000 emails per hour. The outbound line was saturated. CPU is about 30% loaded, no Wait I/O, no swap, memory is large. I think I will reach about 600,000 emails per hour if I fix some timeout on the qmails (replace by postfix?). Maybe I could reach 1 million? The full architecture that I plan will include 2 to 3 clustered postfix relays and 50 2nd level qmails(or postfix) delivery servers, each with 3 to 5 IP addresses, and upgraded outbound internet connection. With your help, I better understand now the impact of timeout and concurrency parameters. In fact, delivery was blocked because postfix was trying to reuse connections, so was waiting each email to complete to send the next one. Also, because hundreds processes were created at start time to manage inbound messages, there were no slots to fork processes to deliver messages on the other hand. Same problem caused very slow DNS and EHLO, because no available slots to fork. Of course, if you want me to post my conf, I will with pleasure. Many thanks to you, to Victor and Stan. Patrick
Re: Huge active queue and system idle, not delivering
Patrick Chemla put forth on 1/10/2010 3:00 PM: Wietse, Please try the following, as asked half a week ago: postconf -e smtp_connection_cache_on_demand=no postfix reload and report if this makes a difference. Wietse I have tested this since yesterday night. I got some problems with Linux per user number of processes limit. I fixed it. I also increased some delivery concurrency figures, and now I can see up to 1300 processes delivering emails to the qmail servers. I had a few minutes shot today at a rate of 6300 emails per minute. I ran a full hour at 180,000 emails per hour. The outbound line was saturated. CPU is about 30% loaded, no Wait I/O, no swap, memory is large. I think I will reach about 600,000 emails per hour if I fix some timeout on the qmails (replace by postfix?). Maybe I could reach 1 million? The full architecture that I plan will include 2 to 3 clustered postfix relays and 50 2nd level qmails(or postfix) delivery servers, each with 3 to 5 IP addresses, and upgraded outbound internet connection. With your help, I better understand now the impact of timeout and concurrency parameters. In fact, delivery was blocked because postfix was trying to reuse connections, so was waiting each email to complete to send the next one. Also, because hundreds processes were created at start time to manage inbound messages, there were no slots to fork processes to deliver messages on the other hand. Same problem caused very slow DNS and EHLO, because no available slots to fork. Of course, if you want me to post my conf, I will with pleasure. Many thanks to you, to Victor and Stan. Patrick On a technical level I'm happy you got it working. Just please tell us you're not sending mass spam with this setup. -- Stan
Re: Huge active queue and system idle, not delivering
Patrick Chemla: Wietse, Please try the following, as asked half a week ago: postconf -e smtp_connection_cache_on_demand=no postfix reload and report if this makes a difference. Wietse I have tested this since yesterday night. I got some problems with Linux per user number of processes limit. I fixed it. I also increased some delivery concurrency figures, and now I can see up to 1300 processes delivering emails to the qmail servers. I had a few minutes shot today at a rate of 6300 emails per minute. I ran a full hour at 180,000 emails per hour. The outbound line was saturated. CPU is about 30% loaded, no Wait I/O, no swap, memory is large. I think I will reach about 600,000 emails per hour if I fix some timeout on the qmails (replace by postfix?). Maybe I could reach 1 million? OK, so you can turn back on that connection caching. Note that qmail creates and destroys two processes per SMTP session, so reusing a session is also a win from a CPU resource point of view. 1M/hour, or less than 300/s, should be possible (you mention the queue is on a solid-state disk). Barring brain damage such as synchronous syslogging by default on some Linux boxes, borked DNS, process/file/etc. resource limits, etc. Perhaps this is a good time to mention that SecurityFocus was ezmlm-qmail based, and that they switched to Postfix for outbound deliveries, because qmail simply could not keep up with the volume: inbound mail - qmail - ezmlm - multiple postfix MTAs - internet That was 2001 when I added QMQP support to Postfix, and this is still what they appear to be using now, if I must believe their own Received: message headers. Received: from lists2.securityfocus.com (lists2.securityfocus.com [205.206.231.20]) by outgoing2.securityfocus.com (Postfix) with QMQP id 8AC0814370A; Thu, 7 Jan 2010 14:11:35 -0700 (MST) My very first qmail/Postfix benchmarks showed that qmail was up to three times slower as a transit MTA, simply because qmail creates three queue files where Postfix creates one. Creating/deleting files involves more disk access operations than reading/writing files, and that hurts especially with small email messages. Wietse
Re: Huge active queue and system idle, not delivering
Le 10/01/2010 23:58, Stan Hoeppner a écrit : On a technical level I'm happy you got it working. Just please tell us you're not sending mass spam with this setup. -- Stan I have to do it for a customer who send as he said, only opt-in mass emails. He has a big blacklisted email database where he keeps all unsubscribe messages. He said he has the right filters not to send unwanted emails. Thanks Patrick
Re: Huge active queue and system idle, not delivering
Le 11/01/2010 01:13, Wietse Venema a écrit : Patrick Chemla: Wietse, Please try the following, as asked half a week ago: postconf -e smtp_connection_cache_on_demand=no postfix reload and report if this makes a difference. Wietse I have tested this since yesterday night. I got some problems with Linux per user number of processes limit. I fixed it. I also increased some delivery concurrency figures, and now I can see up to 1300 processes delivering emails to the qmail servers. I had a few minutes shot today at a rate of 6300 emails per minute. I ran a full hour at 180,000 emails per hour. The outbound line was saturated. CPU is about 30% loaded, no Wait I/O, no swap, memory is large. I think I will reach about 600,000 emails per hour if I fix some timeout on the qmails (replace by postfix?). Maybe I could reach 1 million? OK, so you can turn back on that connection caching. Note that qmail creates and destroys two processes per SMTP session, so reusing a session is also a win from a CPU resource point of view. Wietse If I do so, will postfix open more than one connexion to each qmail for parallel deliveries? I am afraid that if we use connection caching this will create a single queue on each qmail. As far as I have available resources, I think prefer parallel deliveries. Patrick
Re: Huge active queue and system idle, not delivering
Patrick Chemla put forth on 1/11/2010 1:02 AM: Le 10/01/2010 23:58, Stan Hoeppner a écrit : On a technical level I'm happy you got it working. Just please tell us you're not sending mass spam with this setup. -- Stan I have to do it for a customer who send as he said, only opt-in mass emails. He has a big blacklisted email database where he keeps all unsubscribe messages. He said he has the right filters not to send unwanted emails. Sigh... This doesn't pass the sniff test. I fear we've helped enable the sending of mass UBE. Patrick would you mind providing the IP netblock(s) you will be sending these mass mailings from? Or provide them to me off list please? Thanks. -- Stan
Re: Huge active queue and system idle, not delivering
Le 11/01/2010 09:27, Stan Hoeppner a écrit : Patrick Chemla put forth on 1/11/2010 1:02 AM: Le 10/01/2010 23:58, Stan Hoeppner a écrit : On a technical level I'm happy you got it working. Just please tell us you're not sending mass spam with this setup. -- Stan I have to do it for a customer who send as he said, only opt-in mass emails. He has a big blacklisted email database where he keeps all unsubscribe messages. He said he has the right filters not to send unwanted emails. Sigh... This doesn't pass the sniff test. I fear we've helped enable the sending of mass UBE. Patrick would you mind providing the IP netblock(s) you will be sending these mass mailings from? Or provide them to me off list please? Thanks. -- Stan Don't be afraid Stan. They work only on french market, maybe also on french people who have a mailbox overseas. You have very very very low chance to be concerned. Patrick
Re: Huge active queue and system idle, not delivering
Hi, I will try all your advises, but something still very strange for me: We see that postfix logs show that ehlo process is very slow through postfix but very fast by hand. Even I have recorded through tcpdump/WireShark and I can see that messages are sent very very very quickly in about 1 second. But still messages are sent at a rate of a dozen in 10 seconds. That means that messages are sent 1 by one. If connexion to qmail servers are slow, or if qmails are mis-parameted, too slow or anything else, When I do netstat -apn |grep :25 I get only a few connexions from postfix server to qmail servers. Even if DNS+EHLO are slow, and more, because DNS+EHLO seem to be slow, why I don't see hundreds TCP connexions ESTABLISHED ? I expected that postfix will deliver on 30 qmail servers at the same time, and should manage hundreds parallel deliveries, hundreds parallel connexions. Is there some parameter or some conception rule that refrain him to do so? I expected that postfix will full up his own CPU/memory creating these parallel delivery processes or/and will wait after the qmail servers, but on all servers at the same time, on multiple connections to each one. Am I correct ? or I am dreaming of another mail transport package? Patrick
Re: Huge active queue and system idle, not delivering
Hi all, I got these statistics: Jan 9 19:15:21 postfix postfix/scache[18038]: statistics: start interval Jan 9 19:09:03 Jan 9 19:15:21 postfix postfix/scache[18038]: statistics: domain lookup hits=110 miss=89 success=55% Jan 9 19:15:21 postfix postfix/scache[18038]: statistics: address lookup hits=0 miss=2492 success=0% Jan 9 19:15:21 postfix postfix/scache[18038]: statistics: max simultaneous domains=1 addresses=4 connection=4 What means miss=89 success=55%, miss=2492 success=0%? Thanks Patrick
Re: Huge active queue and system idle, not delivering
Patrick Chemla put forth on 1/9/2010 11:17 AM: Hi all, I got these statistics: Jan 9 19:15:21 postfix postfix/scache[18038]: statistics: start interval Jan 9 19:09:03 Jan 9 19:15:21 postfix postfix/scache[18038]: statistics: domain lookup hits=110 miss=89 success=55% Jan 9 19:15:21 postfix postfix/scache[18038]: statistics: address lookup hits=0 miss=2492 success=0% Jan 9 19:15:21 postfix postfix/scache[18038]: statistics: max simultaneous domains=1 addresses=4 connection=4 What means miss=89 success=55%, miss=2492 success=0%? http://www.postfix.com/CONNECTION_CACHE_README.html -- Stan
Re: Huge active queue and system idle, not delivering
Hi Stan, Thanks for your interest. Le 09/01/2010 20:21, Stan Hoeppner a écrit : Patrick Chemla put forth on 1/9/2010 11:17 AM: Hi all, I got these statistics: Jan 9 19:15:21 postfix postfix/scache[18038]: statistics: start interval Jan 9 19:09:03 Jan 9 19:15:21 postfix postfix/scache[18038]: statistics: domain lookup hits=110 miss=89 success=55% Jan 9 19:15:21 postfix postfix/scache[18038]: statistics: address lookup hits=0 miss=2492 success=0% Jan 9 19:15:21 postfix postfix/scache[18038]: statistics: max simultaneous domains=1 addresses=4 connection=4 What means miss=89 success=55%, miss=2492 success=0%? http://www.postfix.com/CONNECTION_CACHE_README.html I wen t there but did not find explanations about miss address lookup or miss domain lookup. While I have 122,000 messages in active queue I still don't understand why statistics show max simultaneous domains=1. It should be dozens , or hundreds. Patrick -- Stan
Re: Huge active queue and system idle, not delivering
Patrick Chemla put forth on 1/9/2010 11:07 AM: Hi, I will try all your advises, but something still very strange for me: We see that postfix logs show that ehlo process is very slow through postfix but very fast by hand. Even I have recorded through tcpdump/WireShark and I can see that messages are sent very very very quickly in about 1 second. But still messages are sent at a rate of a dozen in 10 seconds. That means that messages are sent 1 by one. If connexion to qmail servers are slow, or if qmails are mis-parameted, too slow or anything else, When I do netstat -apn |grep :25 I get only a few connexions from postfix server to qmail servers. Even if DNS+EHLO are slow, and more, because DNS+EHLO seem to be slow, why I don't see hundreds TCP connexions ESTABLISHED ? This behavior is likely a result of the connection cache: http://www.postfix.com/CONNECTION_CACHE_README.html If one has a large amount of mail destined for a single host, it is inefficient to open dozens or hundreds of TCP connections and SMTP connections due to the additional overhead of process/thread count and memory consumption. It is much more efficient to pipeline all the mail through a single connection. One can only pump so many bits down the wire between two hosts. If you can fill the pipe to near capacity with one TCP/SMTP stream, why open 100s of connections to do the same? I believe this is why you are not seeing dozens or hundreds of TCP connections. Postfix is intelligently designed to avoid this inefficiency. I expected that postfix will deliver on 30 qmail servers at the same time, and should manage hundreds parallel deliveries, hundreds parallel connexions. Is there some parameter or some conception rule that refrain him to do so? I expected that postfix will full up his own CPU/memory creating these parallel delivery processes or/and will wait after the qmail servers, but on all servers at the same time, on multiple connections to each one. Am I correct ? or I am dreaming of another mail transport package? Patrick As Victor and others have already stated: 1. In your previous configuration, you had multiple thousands of unique IP addresses (your customers) connecting directly to your 30 qmail servers to relay their mail. qmail performed fine with this configuration because no one qmail server was seeing thousands of delivery attempts per minute from any one single IP address. 2. In your current Postfix configuration, your qmail servers are seeing a single unique IP address attempting to send multiple thousands of messages per minute, and qmail is reacting with rate limiting countermeasures because of this. You need to figure out what settings in the qmail configuration are controlling this rate throttling and in what way. Once you find this and change it, you should see a dramatic improvement in Postfix's ability to quickly move the mail out of the queue to the 30 qmail servers, most likely using a single or only a few TCP connections to each qmail server. -- Stan
Re: Huge active queue and system idle, not delivering
Patrick Chemla put forth on 1/9/2010 12:37 PM: I wen t there but did not find explanations about miss address lookup or miss domain lookup. While I have 122,000 messages in active queue I still don't understand why statistics show max simultaneous domains=1. It should be dozens , or hundreds. Those are statistics relating to scache performance. It tells you how many domains or addresses were able to be delivered via scache reuse. I.e. how many emails Postfix was able to send through an already open SMTP connection to a given host. Since all of your qmail hosts are configured identically, and should be able to relay mail bound for any destination on the internet, you should never see anything less than ~100% in those statistics, _unless_ there is some other kind of problem. If your qmail servers are rate limiting via any method, and Postfix is attempting to send 2000 emails per minute down that one SMTP connection, when qmail blocks individual deliveries for any reason, those scache failure statistics will increase. -- Stan
Re: Huge active queue and system idle, not delivering
Le 09/01/2010 20:54, Stan Hoeppner a écrit : Patrick Chemla put forth on 1/9/2010 12:37 PM: I wen t there but did not find explanations about miss address lookup or miss domain lookup. While I have 122,000 messages in active queue I still don't understand why statistics show max simultaneous domains=1. It should be dozens , or hundreds. Those are statistics relating to scache performance. It tells you how many domains or addresses were able to be delivered via scache reuse. I.e. how many emails Postfix was able to send through an already open SMTP connection to a given host. Since all of your qmail hosts are configured identically, and should be able to relay mail bound for any destination on the internet, you should never see anything less than ~100% in those statistics, _unless_ there is some other kind of problem. You mean 100% success? If your qmail servers are rate limiting via any method, and Postfix is attempting to send 2000 emails per minute down that one SMTP connection, when qmail blocks individual deliveries for any reason, those scache failure statistics will increase. Before I set up the postfix relay to load balance between 30 qmail servers, each of them was able to accept in his own queue hundreds thousands email. Email were sent by campaigns of thousands balanced on 3 qmails servers, each one full in CPU/memory working hard to deliver. Instead of sending each campaign on only 3 qmails, I though that by sending each campaign on 30 qmails I will cut each one load by ten and speed up deliveries. But now, postfix is retaining the emails in his own queue, not pushing the queue down to the qmails. Postfix server and qmail servers are all about 90%cpu free. only 1 to 9 connexions exist at a time from postfix to qmails. This is exactly what I would like to append: Instead of a queue of 122,000 on postfix, I expect to have each qmail with a queue of 4000. Qmails did this before I set up postfix. Patrick -- Stan
Re: Huge active queue and system idle, not delivering
Patrick Chemla put forth on 1/9/2010 1:08 PM: You mean 100% success? Yes. Before I set up the postfix relay to load balance between 30 qmail servers, each of them was able to accept in his own queue hundreds thousands email. Email were sent by campaigns of thousands balanced on 3 qmails servers, each one full in CPU/memory working hard to deliver. Instead of sending each campaign on only 3 qmails, I though that by sending each campaign on 30 qmails I will cut each one load by ten and speed up deliveries. But now, postfix is retaining the emails in his own queue, not pushing the queue down to the qmails. An admiral technical goal. Can you elaborate on these campaigns? You said previously that you had hundreds of thousands of customers whose email you were relaying, as if you are an ISP. Now you are saying the mail load is generated by campaigns. What exactly are these campaigns? Postfix server and qmail servers are all about 90%cpu free. only 1 to 9 connexions exist at a time from postfix to qmails. This is because the qmail servers won't let the postfix server send any faster. We've been over this mulitple times now. Multiple people have told you the same thing. For this to work correctly, you need to figure out why the qmail servers are rate limiting the postfix server deliveries. This is exactly what I would like to append: Instead of a queue of 122,000 on postfix, I expect to have each qmail with a queue of 4000. Qmails did this before I set up postfix. All MTAs have unique performance characteristics. You've changed one of the MTAs in your architecture. Now you must re-tune your qmail farm servers to work with the new MTA, postfix, which you have introduced. This is kinda IT 101 stuff. You can't automatically assume the problem lies with the new thing you introduced. Often, the new thing exposes problems or weaknesses that already existed in the old stuff. -- Stan
Re: Huge active queue and system idle, not delivering
Patrick Chemla: Hi all, I got these statistics: Jan 9 19:15:21 postfix postfix/scache[18038]: statistics: start interval Jan 9 19:09:03 Jan 9 19:15:21 postfix postfix/scache[18038]: statistics: domain lookup hits=110 miss=89 success=55% Jan 9 19:15:21 postfix postfix/scache[18038]: statistics: address lookup hits=0 miss=2492 success=0% Jan 9 19:15:21 postfix postfix/scache[18038]: statistics: max simultaneous domains=1 addresses=4 connection=4 Please try the following, as asked half a week ago: postconf -e smtp_connection_cache_on_demand=no postfix reload and report if this makes a difference. Wietse
Re: Huge active queue and system idle, not delivering
Wietse Venema: Patrick Chemla: Hi all, I got these statistics: Jan 9 19:15:21 postfix postfix/scache[18038]: statistics: start interval Jan 9 19:09:03 Jan 9 19:15:21 postfix postfix/scache[18038]: statistics: domain lookup hits=110 miss=89 success=55% Jan 9 19:15:21 postfix postfix/scache[18038]: statistics: address lookup hits=0 miss=2492 success=0% Jan 9 19:15:21 postfix postfix/scache[18038]: statistics: max simultaneous domains=1 addresses=4 connection=4 Please try the following, as asked half a week ago: postconf -e smtp_connection_cache_on_demand=no postfix reload and report if this makes a difference. Oh, and please limit the discussion to people who understand the hard technical internals of Postfix. Other people please stay out of the way. Wietse
Re: Huge active queue and system idle, not delivering
Le 08/01/2010 03:03, Wietse Venema a écrit : Patrick Chemla: But the CPU of the box is idle more than 80%. It is clear that it is not a matter of CPU, nor memory, nor disk. Something in the number of processes/users/simultaneous tasks is blocking. Indeed, the symptom of blocking is in the third field of the Postfix delays logging. The format of the delays=a/b/c/d logging is as follows: o a = time from message arrival to last active queue entry o b = time from last active queue entry to connection setup o c = time in connection setup, including DNS, EHLO and TLS o d = time in message transmission In your case, it takes a minute or more to set up the connection including DNS lookup and EHLO handshake. That is holding up your mail. - Check if the qmail servers are responsive (telnet hostname 25). qmail are responsive. I made some arrangements to my DNS. DNS is better now, but the connexion is still very slow. I saw this morning c=285. - Check if your Postfix needs a /var/spool/postfix/etc/resolv.conf file, and if that file is consistent with /etc/resolv.conf. If Postfix needs /var/spool/postfix/etc/resolv.conf and the file is missong or contains a bogus server that will add time to your deliveries. Hi Wietse, How do I know if Postfix needs a /var/spool/postfix/etc/resolv.conf directory /var/spool/postfix/etc doesn't exist. - If they aren't, increase the concurrency on the qmail side. conccurency =100. It's already a large number. I can increase it. Wietse Thanks Patrick
Re: Huge active queue and system idle, not delivering
Le 08/01/2010 00:43, Victor Duchovni a écrit : On Fri, Jan 08, 2010 at 12:30:34AM +0200, Patrick Chemla wrote: Jan 7 22:02:57 postfix postfix/qmgr[26441]: 5B91F873F6: removed Jan 7 22:02:57 postfix postfix/smtp[27180]: 375DDD5923: to=lexoti...@gmail.com, relay=a139.localpc2105.com[10.0.0.139]:25, conn_use=59, delay=61550, delays=17019/44435/96/0.17, dsn=2.0.0, status=sent (250 ok 1262894577 qp 12113) This recipient does not match the destination that is clogging the queue. Is the queue clogged with postmaster notices. I never enable any postmaster notices, they don't scale. notify_classes = done, no change. This said, the 96 seconds of connection setup latency is an obvious and severe problem. Why on earth does it take 96 seconds to complete a HELO handshake with a139.localpcc2105.com? You are not going to get much mail out if each delivery takes 96 seconds... Is your Postfix server's IP address resolvable on the qmail systems? Should it be? qmail accept all RELAY CLIENT from local network. Are they doing some sort of pre-banner delay? ... When I do telnet a139.localpc2105.com 25, I get immediate response. Jan 7 22:02:58 postfix postfix/smtp[27070]: 7F0F2943B3: to=gpo...@wanadoo.fr, relay=a70.localpc2105.com[10.0.0.70]:25, conn_use=10, delay=73795, delays=29264/44481/50/0.21, dsn=2.0.0, status=sent (250 ok 1262894577 qp 23067) Once again, 50 seconds is severely crippled. When I telnet a70.localpc2105.com 25 I get an immediate response. I have checked my local DNS. There were some troubles, and I made some improvements. I have now 2 local caching DNS respawning fast. All qmail servers addresses are in the postfix /etc/hosts to avoid Ip lookup. I have checked qmails servers, nothing has changed since they were able to have a queue of 200,000 messages, but they have now a few hundreds only. I have calculated average times to complete HELO. All qmails are in the same kind of value around 2 minutes. Not any one is better than others. Again, each was handling a queue of hundreds thousands before I set up the postfix relay to load balance. I really don't have a clue. I don't know where to look. Jan 7 22:02:58 postfix postfix/smtp[27050]: 32BB182182: to=gmarin-jardins-lois...@wanadoo.fr, relay=a139.localpc2105.com[10.0.0.139]:25, conn_use=48, delay=73799, delays=29268/44466/65/0.28, dsn=2.0.0, status=sent (250 ok 1262894578 qp 12121) This is enough. Fix this. How I can fix it if it works fine through telnet? Where are the deliveries to the clogged destination??? Sorry, I don't understand this question. Please be clear. Patrick
Re: Huge active queue and system idle, not delivering
Patrick Chemla: [ Charset ISO-8859-1 unsupported, converting... ] Le 08/01/2010 00:43, Victor Duchovni a ?crit : On Fri, Jan 08, 2010 at 12:30:34AM +0200, Patrick Chemla wrote: Jan 7 22:02:57 postfix postfix/qmgr[26441]: 5B91F873F6: removed Jan 7 22:02:57 postfix postfix/smtp[27180]: 375DDD5923: to=lexoti...@gmail.com, relay=a139.localpc2105.com[10.0.0.139]:25, conn_use=59, delay=61550, delays=17019/44435/96/0.17, dsn=2.0.0, ^^^ Note that this connection has been reused multiple times (see below for what this means in Postfix). Why does it take 69 seconds to initialize a reused SMTP connection? What happens when you set smtp_connection_cache_on_demand=no in main.cf (and do postfix reload)? If this makes a difference, then a) you have a problem with smtp-scache communication. b) qmail does not like RSET commands c) Your machine is running low on memory and swapping out the scache process. d) something else. Wietse Under high load, smtp(8) processes give their open connections to scache(8). Later, they ask scache(8) for an open connection to a specific destination. Once an smtp(8) client retrieves an open connection, it sends RSET tothe remote server and waits for a 250 reply (i.e. the server is still happy). According to the logfile record this lookup/rset/reply sequence is taking 96 seconds.
Re: Huge active queue and system idle, not delivering
On Fri, 08 Jan 2010 15:24:25 +0200, Patrick Chemla When I telnet a70.localpc2105.com 25 I get an immediate response. I assume you are telnet'ing from the Postfix server with the queue delay problem. At this point, after you receive the 220, type: ehlo your.postfix-server.tld enter and time the delay of the 250 responses. Continue to do a complete manual mail transaction through telnet, and time each smtp command completion (wall clock is fine). Post the results here please. -- Stan
Re: Huge active queue and system idle, not delivering
Wietse Venema: Patrick Chemla: Le 08/01/2010 00:43, Victor Duchovni a ?crit : On Fri, Jan 08, 2010 at 12:30:34AM +0200, Patrick Chemla wrote: Jan 7 22:02:57 postfix postfix/qmgr[26441]: 5B91F873F6: removed Jan 7 22:02:57 postfix postfix/smtp[27180]: 375DDD5923: to=lexoti...@gmail.com, relay=a139.localpc2105.com[10.0.0.139]:25, conn_use=59, delay=61550, delays=17019/44435/96/0.17, dsn=2.0.0, ^^^ Note that this connection has been reused multiple times (see below for what this means in Postfix). Why does it take 69 seconds to initialize a reused SMTP connection? What happens when you set smtp_connection_cache_on_demand=no in main.cf (and do postfix reload)? If this makes a difference, then Check your qmail configuration for tarpit options. There exist qmail patches that will slow down the qmail SMTP server when the client sends lots of email, or lots of recipients. Wietse a) you have a problem with smtp-scache communication. b) qmail does not like RSET commands c) Your machine is running low on memory and swapping out the scache process. d) something else. Wietse Under high load, smtp(8) processes give their open connections to scache(8). Later, they ask scache(8) for an open connection to a specific destination. Once an smtp(8) client retrieves an open connection, it sends RSET tothe remote server and waits for a 250 reply (i.e. the server is still happy). According to the logfile record this lookup/rset/reply sequence is taking 96 seconds.
Re: Huge active queue and system idle, not delivering
On Fri, Jan 08, 2010 at 03:24:25PM +0200, Patrick Chemla wrote: When I do telnet a139.localpc2105.com 25, I get immediate response. What does response mean? Immediate connection completion means nothing. Do you get a 220 banner right away? Do you get all of it or just the first line in a multi-line banner, with the rest arriving later? I have checked my local DNS. There were some troubles, and I made some improvements. I have now 2 local caching DNS respawning fast. All qmail servers addresses are in the postfix /etc/hosts to avoid Ip lookup. I have checked qmails servers, nothing has changed since they were able to have a queue of 200,000 messages, but they have now a few hundreds only. I have calculated average times to complete HELO. All qmails are in the same kind of value around 2 minutes. Why the heck does it take 2 minutes to respond to HELO??? There's your problem. Fix the qmail servers, to ensure that it takes a few milliseconds to respond to HELO. Right now your HELO response is 3-4 orders of magnitude too slow. Fix, probably involves as much removal of tweaks that are counter-productive as adding specific changes to address the problem (disable any rate controls that deliberately slow the sender, or constrain resources). Start debugging on the qmail side, find out what it is doing for 2 minutes... -- Viktor. Disclaimer: off-list followups get on-list replies or get ignored. Please do not ignore the Reply-To header. To unsubscribe from the postfix-users list, visit http://www.postfix.org/lists.html or click the link below: mailto:majord...@postfix.org?body=unsubscribe%20postfix-users If my response solves your problem, the best way to thank me is to not send an it worked, thanks follow-up. If you must respond, please put It worked, thanks in the Subject so I can delete these quickly.
Re: Huge active queue and system idle, not delivering
Patrick Chemla: Hi, I am running Postfix 2.5.6 on a Fedora 11 Linux system on a hardware based Intel I5/750 Quad Core, 8 Gb memory, 160Gb SSD hard disk. Incoming messages are entering very fast (500 smtp processes declared) and the active queue is actually of 2 millions messages waiting for delivery. The delivery, for all messages should go through a farm of 30 MX servers from domain localpc2105.com, on load balancing through DNS resolution. DNS server is of course local. All 30 MX servers are running qmail. All of them are more than 90% idle. Before I set up my postfix server, email were sent directly to the qmail servers, and qmail was running at full CPU. So I am sure that qmail can handle much more faster. I have set up the postfix server to load balance the load between all the 30 qmail servers to avoid situation where some were running at full charge and others were not working. http://www.postfix.org/DEBUG_README.html#logging Wietse
Re: Huge active queue and system idle, not delivering
2010/1/8 Patrick Chemla patrick.che...@perfaction.net Incoming messages are entering very fast (500 smtp processes declared) and the active queue is actually of 2 millions messages waiting for delivery. snip here is my main.cf file: That's some very thorough information, you've provided plenty of context and clear description, which is great. While I lack sufficient knowledge to provide thoughts on the bottlenecking, I *do* expect that people will want to see the output of `postconf -n`, instead of your main.cf (to ensure we see what postfix actually sees and uses). Can you clarify what you mean by 500 smtp processes declared? A sample output from qshape also wouldn't go astray either (http://www.postfix.org/qshape.1.html). You're provided some proportional figures (percentages), but some solid throughput numbers would be good too. Eg. We're injecting 2 million messages to the postfix box, we expect to enqueue them in X hrs, but it takes Y hrs, and they're only leaving the postfix box at Z messages/sec. I see you said I just found that Postfix could send 1 million emails per hour when I send less than a half million in 24 hours, but I can't make sense of that, sorry.
Re: Huge active queue and system idle, not delivering
Le 07/01/2010 20:03, Barney Desmond a écrit : 2010/1/8 Patrick Chemlapatrick.che...@perfaction.net Incoming messages are entering very fast (500 smtp processes declared) and the active queue is actually of 2 millions messages waiting for delivery. snip here is my main.cf file: That's some very thorough information, you've provided plenty of context and clear description, which is great. While I lack sufficient knowledge to provide thoughts on the bottlenecking, I *do* expect that people will want to see the output of `postconf -n`, instead of your main.cf (to ensure we see what postfix actually sees and uses). Here is postconf -n alias_database = hash:/etc/aliases alias_maps = hash:/etc/aliases command_directory = /usr/sbin config_directory = /etc/postfix daemon_directory = /usr/libexec/postfix data_directory = /var/lib/postfix debug_peer_level = 8 debug_peer_list = orange.fr default_delivery_slot_cost = 30 default_delivery_slot_discount = 100 default_destination_concurrency_failed_cohort_limit = 10 default_destination_concurrency_limit = 500 default_destination_recipient_limit = 200 default_minimum_delivery_slots = 30 default_process_limit = 1000 default_recipient_limit = 200 html_directory = no inet_interfaces = all inet_protocols = all initial_destination_concurrency = 100 lmtp_destination_concurrency_limit = $default_destination_concurrency_limit local_destination_concurrency_limit = 50 local_destination_recipient_limit = 500 mail_owner = postfix mailbox_size_limit = 512000 mailq_path = /usr/bin/mailq.postfix manpage_directory = /usr/share/man max_use = 1000 mime_nesting_limit = 100 mydestination = $myhostname, localhost.$mydomain, localhost mydomain = localpc2105.com myhostname = postfix.proacti5.net mynetworks = 172.27.27.0/24, 10.0.0.0/24, 127.0.0.0/24 newaliases_path = /usr/bin/newaliases.postfix qmgr_fudge_factor = 200 qmgr_message_active_limit = 200 qmgr_message_recipient_limit = 200 queue_directory = /var/spool/postfix queue_file_attribute_count_limit = 250 readme_directory = /usr/share/doc/postfix-2.5.6/README_FILES relay_destination_concurrency_limit = $default_destination_concurrency_limit relayhost = $mydomain sample_directory = /usr/share/doc/postfix-2.5.6/samples sendmail_path = /usr/sbin/sendmail.postfix setgid_group = postdrop smtp_connect_timeout = 10s smtp_data_done_timeout = 10s smtp_destination_concurrency_limit = $default_destination_concurrency_limit smtp_mail_timeout = 5s smtpd_history_flush_threshold = 100 smtpd_junk_command_limit = 100 smtpd_peername_lookup = no unknown_local_recipient_reject_code = 550 Can you clarify what you mean by 500 smtp processes declared? A sample output from qshape also wouldn't go astray either (http://www.postfix.org/qshape.1.html). Here is qshape: T 5 10 20 40 80 160 320 640 1280 1280+ TOTAL 133000 0 0 0 0 0 2470 40538 80844 7167 1981 wanadoo.fr 61955 0 0 0 0 0 2469 26830 31340 126056 orange.fr 4171 0 0 0 0 00 1176 2144 540 311 skynet.be 3286 0 0 0 0 00 1 32840 1 aliceadsl.fr 3259 0 0 0 0 0054 3169 2511 aol.com 3150 0 0 0 0 00 1545 1524 4041 free.fr 2138 0 0 0 0 00 453 1561 8935 sfr.fr840 0 0 0 0 0023 8161 0 hotmail.fr679 0 0 0 0 00 150 420 1297 telenet.be658 0 0 0 0 00 0 6580 0 gmail.com358 0 0 0 0 00 157 145 1145 hotmail.com325 0 0 0 0 0044 220 2041 neuf.fr252 0 0 0 0 0062 176 14 0 9online.fr250 0 0 0 0 00 6 2440 0 cegetel.net195 0 0 0 0 0026 155410 laposte.net183 0 0 0 0 005193 1524 swing.be141 0 0 0 0 00 2 1390 0 9business.fr111 0 0 0 0 0023 853 0 sonepar.fr107 0 0 0 0 0033 722 0 axa.fr103 0 0 0 0 0030 671 5 most of the messages stay in the queue for hours. You're provided some proportional figures (percentages), but some solid throughput numbers would be good too. Eg. We're injecting 2 million messages to the postfix box, we expect to enqueue them in X hrs, but it takes Y hrs, and they're only leaving the postfix box at Z messages/sec. I see you said I just found that Postfix could send 1 million emails per hour when I send less than a half million in 24 hours, but I can't make sense of
Re: Huge active queue and system idle, not delivering
Le 07/01/2010 20:00, Wietse Venema a écrit : Patrick Chemla: Hi, I am running Postfix 2.5.6 on a Fedora 11 Linux system on a hardware based Intel I5/750 Quad Core, 8 Gb memory, 160Gb SSD hard disk. Incoming messages are entering very fast (500 smtp processes declared) and the active queue is actually of 2 millions messages waiting for delivery. The delivery, for all messages should go through a farm of 30 MX servers from domain localpc2105.com, on load balancing through DNS resolution. DNS server is of course local. All 30 MX servers are running qmail. All of them are more than 90% idle. Before I set up my postfix server, email were sent directly to the qmail servers, and qmail was running at full CPU. So I am sure that qmail can handle much more faster. I have set up the postfix server to load balance the load between all the 30 qmail servers to avoid situation where some were running at full charge and others were not working. http://www.postfix.org/DEBUG_README.html#logging Wietse Here the logs: Jan 6 23:12:48 postfix postfix/qmgr[31260]: warning: to turn off these warnings specify: qmgr_clog_warn_time = 0 Jan 6 23:19:39 postfix postfix/qmgr[31260]: warning: mail for localpc2105.com is using up 461335 of 461335 active queue entries Jan 6 23:19:39 postfix postfix/qmgr[31260]: warning: you may need to increase the main.cf smtp_destination_concurrency_limit from 100 Jan 6 23:19:39 postfix postfix/qmgr[31260]: warning: please avoid flushing the whole queue when you have Jan 6 23:19:39 postfix postfix/qmgr[31260]: warning: lots of deferred mail, that is bad for performance Jan 6 23:19:39 postfix postfix/qmgr[31260]: warning: to turn off these warnings specify: qmgr_clog_warn_time = 0 Jan 6 23:24:51 postfix postfix/qmgr[31260]: warning: mail for localpc2105.com is using up 461086 of 461086 active queue entries Jan 6 23:24:51 postfix postfix/qmgr[31260]: warning: you may need to increase the main.cf smtp_destination_concurrency_limit from 100 Jan 6 23:24:51 postfix postfix/qmgr[31260]: warning: please avoid flushing the whole queue when you have Jan 6 23:24:51 postfix postfix/qmgr[31260]: warning: lots of deferred mail, that is bad for performance Jan 6 23:24:51 postfix postfix/qmgr[31260]: warning: to turn off these warnings specify: qmgr_clog_warn_time = 0 Jan 6 23:29:51 postfix postfix/qmgr[31260]: warning: mail for localpc2105.com is using up 460872 of 460872 active queue entries Jan 6 23:29:51 postfix postfix/qmgr[31260]: warning: you may need to increase the main.cf smtp_destination_concurrency_limit from 100 Jan 6 23:29:51 postfix postfix/qmgr[31260]: warning: please avoid flushing the whole queue when you have Jan 6 23:29:51 postfix postfix/qmgr[31260]: warning: lots of deferred mail, that is bad for performance Jan 6 23:29:51 postfix postfix/qmgr[31260]: warning: to turn off these warnings specify: qmgr_clog_warn_time = 0 Jan 6 23:35:51 postfix postfix/qmgr[31260]: warning: mail for localpc2105.com is using up 460025 of 460025 active queue entries Jan 6 23:35:51 postfix postfix/qmgr[31260]: warning: you may need to increase the main.cf smtp_destination_concurrency_limit from 100 Jan 6 23:35:51 postfix postfix/qmgr[31260]: warning: please avoid flushing the whole queue when you have Jan 6 23:35:51 postfix postfix/qmgr[31260]: warning: lots of deferred mail, that is bad for performance Jan 6 23:35:51 postfix postfix/qmgr[31260]: warning: to turn off these warnings specify: qmgr_clog_warn_time = 0 Jan 6 23:40:51 postfix postfix/qmgr[31260]: warning: mail for localpc2105.com is using up 460283 of 460283 active queue entries Jan 6 23:40:51 postfix postfix/qmgr[31260]: warning: you may need to increase the main.cf smtp_destination_concurrency_limit from 100 Jan 6 23:40:51 postfix postfix/qmgr[31260]: warning: please avoid flushing the whole queue when you have Jan 6 23:40:51 postfix postfix/qmgr[31260]: warning: lots of deferred mail, that is bad for performance Jan 6 23:40:51 postfix postfix/qmgr[31260]: warning: to turn off these warnings specify: qmgr_clog_warn_time = 0 Jan 6 23:47:21 postfix postfix/qmgr[31260]: warning: mail for localpc2105.com is using up 459714 of 459714 active queue entries Jan 6 23:47:21 postfix postfix/qmgr[31260]: warning: you may need to increase the main.cf smtp_destination_concurrency_limit from 100 Jan 6 23:47:21 postfix postfix/qmgr[31260]: warning: please avoid flushing the whole queue when you have Jan 6 23:47:21 postfix postfix/qmgr[31260]: warning: lots of deferred mail, that is bad for performance Jan 6 23:47:21 postfix postfix/qmgr[31260]: warning: to turn off these warnings specify: qmgr_clog_warn_time = 0 Jan 6 23:52:21 postfix postfix/qmgr[31260]: warning: mail for localpc2105.com is using up 459491 of 459491 active queue entries Jan 6 23:52:21 postfix postfix/qmgr[31260]: warning: you may need to increase the main.cf smtp_destination_concurrency_limit from 100 Jan
Re: Huge active queue and system idle, not delivering
On Thu, Jan 07, 2010 at 07:43:55PM +0200, Patrick Chemla wrote: CPU is more than 85% idle on my postfix I5/750 box, but the outbound queue is very very slow. Throughput == Concurrency / Latency What destination are most of the messages in the queue going to? What is the associated transport? Are you using any content filters? What is the destination concurrency limit for that transport? What is the delivery latency to that transport? Show a/b/c/d data averaged (mean, median, stddev) over a bunch of log entries. It seems that something refrain qmgr to work at full range, despite the parameters Most of your parameter tweaks are counter-productive. Do not tweak anything other than the destination concurrency limit for a transport that delivers to a high capacity destination you control, say: # The 200 is not a golden value, start at 50 and raise only # if throughput improves as a result... # relay_destination_concurrency_limit = 200 myhostname = postfix.proacti5.net mydomain = localpc2105.com inet_interfaces = all mydestination = $myhostname, localhost.$mydomain, localhost unknown_local_recipient_reject_code = 550 mynetworks = 172.27.27.0/24, 10.0.0.0/24, 127.0.0.0/24 relayhost = $mydomain With relayhost set, all remote mail goes to the MX hosts for $mydomain, so in this case, you can also raise: # The 200 is not a golden value, start at 50 and raise only # if throughput improves as a result... # smtp_destination_concurrency_limit = 200 if necessary. local_destination_recipient_limit = 500 Terrible idea. local_destination_concurrency_limit = 50 Terrible idea. debug_peer_level = 8 Absurd. debug_peer_list = orange.fr I hope very little mail goes there... default_process_limit = 1000 Raise just the master.cf limits for the smtpd(8) and smtp(8) services. You don't need 1000 of everything. initial_destination_concurrency = 100 Too high. transport_initial_destination_concurrency = 100 You misunderstood the docs, this is useless. default_destination_concurrency_failed_cohort_limit = 10 Should not be necessary. default_destination_recipient_limit = 200 OK. transport_destination_recipient_limit = 100 You misunderstood the docs, this is useless. default_delivery_slot_cost = 30 default_minimum_delivery_slots = 30 default_delivery_slot_discount = 100 qmgr_fudge_factor = 200 Don't mess with the nqmgr tunables, they are too subtle for mortals. smtpd_peername_lookup = no When output is starved, why make the input even faster? default_recipient_limit = 200 qmgr_message_active_limit = 200 qmgr_message_recipient_limit = 200 The Postfix queue does not scale to arbitrarily large sizes, at some point, there is more to do than available capacity to process the backlog. 2 million active messages may be OK for a mass-mail engine that fires up periodically, and works as fast as it can, but it is terrible for a mail forwarding relay. Which use-case are you in? mailbox_size_limit = 512000 Why does this machine have any mailboxes at all? Isn't it a relay? What software performs well with 5GB mailboxes? default_destination_concurrency_limit = 500 Better to specify smtp, relay or both, but not default. lmtp_destination_concurrency_limit = $default_destination_concurrency_limit smtp_destination_concurrency_limit = $default_destination_concurrency_limit relay_destination_concurrency_limit = $default_destination_concurrency_limit mime_nesting_limit = 100 These are default settings, don't add them to main.cf max_use = 1000 Fine. queue_file_attribute_count_limit = 250 smtpd_history_flush_threshold = 100 Why??? smtpd_junk_command_limit = 100 Why so generous to the input side? smtp_connect_timeout = 10s Reasonable for a large nearby MX pool, you can even use 1s if you want. smtp_data_done_timeout = 10s Really not a good idea. smtp_mail_timeout = 5s A bit aggressive... smtp inetn - n - - smtpd Tune the process limit here qmgr fifon - n 30 1 qmgr Why re-scan the incoming queue every 30 seconds? The default is fine. smtp unix- - n - - smtp Adjust the process limit here to the right number of smtp(8) delivery agents. relay unix -o smtp_fallback_relay= - - n - - smtp Adjust this process limit if you service any relay domains. I tried many combinations to speed up the delivery. Nothing help up to now. LOGS!!! I just found that Postfix could send 1 million emails per hour when I send less than a half million in 24 hours. LOGS!!! -- Viktor. Disclaimer: off-list followups get on-list replies or get ignored. Please do not ignore the Reply-To header. To unsubscribe from the postfix-users list, visit http://www.postfix.org/lists.html or click the link below:
Re: Huge active queue and system idle, not delivering
On Thu, Jan 07, 2010 at 08:29:44PM +0200, Patrick Chemla wrote: Here the logs: This is just the qmgr(8) warnings about a clogged queue. Other than telling us that all the mail is going to localpc2105.com, this is not very useful. Where are the logs from smtp(8)? What transport is localpc2105.com destined for? Any earlier logging about actual delivery attempts for this destination? -- Viktor. Disclaimer: off-list followups get on-list replies or get ignored. Please do not ignore the Reply-To header. To unsubscribe from the postfix-users list, visit http://www.postfix.org/lists.html or click the link below: mailto:majord...@postfix.org?body=unsubscribe%20postfix-users If my response solves your problem, the best way to thank me is to not send an it worked, thanks follow-up. If you must respond, please put It worked, thanks in the Subject so I can delete these quickly.
Re: Huge active queue and system idle, not delivering
On Thu, Jan 7, 2010 at 1:25 PM, Patrick Chemla patrick.che...@perfaction.net wrote: said I just found that Postfix could send 1 million emails per hour when I send less than a half million in 24 hours, but I can't make sense of that, sorry. I have to inject 2 to 4 millions emails to the postfix box in 24 hours, and I expect to deliver within the same delay. Actually, I can't deliver more than 500,000 per 24h hours. It could be viewed that half a million delivered in 24 hours is fine. Are you signing the mail? This can help with delivery rates to the large webmailer mx destinations. Stef But the CPU of the box is idle more than 80%. It is clear that it is not a matter of CPU, nor memory, nor disk. Something in the number of processes/users/simultaneous tasks is blocking.
Re: Huge active queue and system idle, not delivering
On Thu, Jan 07, 2010 at 04:47:14PM -0500, Stefan Caunter wrote: I have to inject 2 to 4 millions emails to the postfix box in 24 hours, and I expect to deliver within the same delay. Actually, I can't deliver more than 500,000 per 24h hours. It could be viewed that half a million delivered in 24 hours is fine. No, it is too slow, when there is no content inspection involved, especially with a nearby farm of relayhosts. Are you signing the mail? This can help with delivery rates to the large webmailer mx destinations. This is unrelated to the OP's problem. -- Viktor. Disclaimer: off-list followups get on-list replies or get ignored. Please do not ignore the Reply-To header. To unsubscribe from the postfix-users list, visit http://www.postfix.org/lists.html or click the link below: mailto:majord...@postfix.org?body=unsubscribe%20postfix-users If my response solves your problem, the best way to thank me is to not send an it worked, thanks follow-up. If you must respond, please put It worked, thanks in the Subject so I can delete these quickly.
Re: Huge active queue and system idle, not delivering
* Stefan Caunter s...@caunter.ca: It could be viewed that half a million delivered in 24 hours is fine. Are you signing the mail? This can help with delivery rates to the large webmailer mx destinations. There are many things to consider: * DKIM signing - which is the prerequisite for getting into feedback loops at major email providers * get into the feedback loops at major email providers * SPF * good reputation (e.g. SenderBase, senderscore) -- Ralf Hildebrandt Geschäftsbereich IT | Abteilung Netzwerk Charité - Universitätsmedizin Berlin Campus Benjamin Franklin Hindenburgdamm 30 | D-12203 Berlin Tel. +49 30 450 570 155 | Fax: +49 30 450 570 962 ralf.hildebra...@charite.de | http://www.charite.de
Re: Huge active queue and system idle, not delivering
On Thu, Jan 07, 2010 at 10:54:15PM +0100, Ralf Hildebrandt wrote: It could be viewed that half a million delivered in 24 hours is fine. Are you signing the mail? This can help with delivery rates to the large webmailer mx destinations. There are many things to consider: * DKIM signing - which is the prerequisite for getting into feedback loops at major email providers * get into the feedback loops at major email providers * SPF * good reputation (e.g. SenderBase, senderscore) None of these apply to the OP's problem. He is sending mail to a pool of 30 qmail hosts. -- Viktor. Disclaimer: off-list followups get on-list replies or get ignored. Please do not ignore the Reply-To header. To unsubscribe from the postfix-users list, visit http://www.postfix.org/lists.html or click the link below: mailto:majord...@postfix.org?body=unsubscribe%20postfix-users If my response solves your problem, the best way to thank me is to not send an it worked, thanks follow-up. If you must respond, please put It worked, thanks in the Subject so I can delete these quickly.
Re: Huge active queue and system idle, not delivering
Le 07/01/2010 23:47, Stefan Caunter a écrit : On Thu, Jan 7, 2010 at 1:25 PM, Patrick Chemla patrick.che...@perfaction.net wrote: said I just found that Postfix could send 1 million emails per hour when I send less than a half million in 24 hours, but I can't make sense of that, sorry. I have to inject 2 to 4 millions emails to the postfix box in 24 hours, and I expect to deliver within the same delay. Actually, I can't deliver more than 500,000 per 24h hours. It could be viewed that half a million delivered in 24 hours is fine. Are you signing the mail? This can help with delivery rates to the large webmailer mx destinations. Stef Half a million is 4 times lower than what we have done with qmail servers. Email are signed, but not from Postfix. Postfix must only relay mails from clients to local MXs. These local MXs will assume deliveries to the outside. Mail queue should be on these MXs, because they are dependant on final destinations. But the CPU of the box is idle more than 80%. It is clear that it is not a matter of CPU, nor memory, nor disk. Something in the number of processes/users/simultaneous tasks is blocking.
Re: Huge active queue and system idle, not delivering
Le 07/01/2010 20:37, Victor Duchovni a écrit : On Thu, Jan 07, 2010 at 08:29:44PM +0200, Patrick Chemla wrote: Here the logs: This is just the qmgr(8) warnings about a clogged queue. Other than telling us that all the mail is going to localpc2105.com, this is not very useful. Where are the logs from smtp(8)? What transport is localpc2105.com destined for? Any earlier logging about actual delivery attempts for this destination? Victor, thank you for your interest. Daily logs are huge. Here is a sample of deliveries: Jan 7 22:02:57 postfix postfix/qmgr[26441]: 5B91F873F6: removed Jan 7 22:02:57 postfix postfix/smtp[27180]: 375DDD5923: to=lexoti...@gmail.com, relay=a139.localpc2105.com[10.0.0.139]:25, conn_use=59, delay=61550, delays=17019/44435/96/0.17, dsn=2.0.0, status=sent (250 ok 1262894577 qp 12113) Jan 7 22:02:57 postfix postfix/qmgr[26441]: 375DDD5923: removed Jan 7 22:02:58 postfix postfix/smtp[27070]: 7F0F2943B3: to=gpo...@wanadoo.fr, relay=a70.localpc2105.com[10.0.0.70]:25, conn_use=10, delay=73795, delays=29264/44481/50/0.21, dsn=2.0.0, status=sent (250 ok 1262894577 qp 23067) Jan 7 22:02:58 postfix postfix/qmgr[26441]: 7F0F2943B3: removed Jan 7 22:02:58 postfix postfix/smtp[27050]: 32BB182182: to=gmarin-jardins-lois...@wanadoo.fr, relay=a139.localpc2105.com[10.0.0.139]:25, conn_use=48, delay=73799, delays=29268/44466/65/0.28, dsn=2.0.0, status=sent (250 ok 1262894578 qp 12121) Jan 7 22:02:58 postfix postfix/qmgr[26441]: 32BB182182: removed Jan 7 22:02:58 postfix postfix/smtp[26758]: 577D6C7F7D: to=gerardtremb...@vinsdusiecle.com, relay=a139.localpc2105.com[10.0.0.139]:25, conn_use=60, delay=68451, delays=23920/44481/50/0.29, dsn=2.0.0, status=sent (250 ok 1262894578 qp 12122) Jan 7 22:02:58 postfix postfix/qmgr[26441]: 577D6C7F7D: removed Jan 7 22:02:58 postfix postfix/smtp[26935]: CDCE074F53: to=christian.lebe...@arcelor.com, relay=a139.localpc2105.com[10.0.0.139]:25, conn_use=49, delay=104597, delays=60065/44421/110/0.3, dsn=2.0.0, status=sent (250 ok 1262894578 qp 12135) Jan 7 22:02:58 postfix postfix/qmgr[26441]: CDCE074F53: removed Jan 7 22:02:58 postfix postfix/smtp[26708]: 4B0B6E77FD: to=m...@metaproductique.com, relay=a139.localpc2105.com[10.0.0.139]:25, conn_use=61, delay=46137, delays=1606/44461/70/0.31, dsn=2.0.0, status=sent (250 ok 1262894578 qp 12136) Jan 7 22:02:58 postfix postfix/qmgr[26441]: 4B0B6E77FD: removed Jan 7 22:02:58 postfix postfix/smtp[26794]: D2CB5DC84C: to=secretar...@mairie-charly.fr, relay=a70.localpc2105.com[10.0.0.70]:25, conn_use=11, delay=58160, delays=13628/44481/50/0.23, dsn=2.0.0, status=sent (250 ok 1262894578 qp 23076) Jan 7 22:02:58 postfix postfix/qmgr[26441]: D2CB5DC84C: removed Jan 7 22:02:58 postfix postfix/smtp[26968]: 1A651E17E0: to=davau.br...@orange.fr, relay=a74.localpc2105.com[10.0.0.74]:25, conn_use=2, delay=54426, delays=9894/44462/69/0.27, dsn=2.0.0, status=sent (250 ok 1262894578 qp 7411) Jan 7 22:02:58 postfix postfix/qmgr[26441]: 1A651E17E0: removed Jan 7 22:02:58 postfix postfix/smtp[27037]: 4CCC486B55: to=lenaerts.natuurst...@pandora.be, relay=a139.localpc2105.com[10.0.0.139]:25, conn_use=50, delay=45538, delays=1005/44407/125/0.17, dsn=2.0.0, status=sent (250 ok 1262894578 qp 12150) Jan 7 22:02:58 postfix postfix/qmgr[26441]: 4CCC486B55: removed Jan 7 22:02:58 postfix postfix/smtp[27188]: D130997201: to=cont...@afcmecanum.com, relay=a74.localpc2105.com[10.0.0.74]:25, conn_use=2, delay=71536, delays=27004/8/84/0.28, dsn=2.0.0, status=sent (250 ok 1262894578 qp 7412) Jan 7 22:02:58 postfix postfix/qmgr[26441]: D130997201: removed Jan 7 22:02:59 postfix postfix/smtp[27033]: 6BD743906A: to=copyboli...@orange.fr, relay=a139.localpc2105.com[10.0.0.139]:25, conn_use=62, delay=81473, delays=36941/44467/65/0.24, dsn=2.0.0, status=sent (250 ok 1262894579 qp 12157) Jan 7 22:02:59 postfix postfix/qmgr[26441]: 6BD743906A: removed Jan 7 22:02:59 postfix postfix/smtp[26793]: 84947C14B2: to=wgall...@saemshema.com, relay=a70.localpc2105.com[10.0.0.70]:25, conn_use=12, delay=69401, delays=24868/44469/63/0.2, dsn=2.0.0, status=sent (250 ok 1262894578 qp 23084) Jan 7 22:02:59 postfix postfix/qmgr[26441]: 84947C14B2: removed Jan 7 22:02:59 postfix postfix/smtp[26737]: 6023552F52: to=cont...@installation-spa-gard.com, relay=a139.localpc2105.com[10.0.0.139]:25, conn_use=51, delay=96132, delays=51599/8/84/0.3, dsn=2.0.0, status=sent (250 ok 1262894579 qp 12158) Jan 7 22:02:59 postfix postfix/qmgr[26441]: 6023552F52: removed Jan 7 22:02:59 postfix postfix/smtp[27134]: connect to a132.localpc2105.com[10.0.0.132]:25: Connection timed out Jan 7 22:02:59 postfix postfix/smtp[26717]: 96A447C426: to=alain.perignon.aulnaysousb...@reseau.renault.fr, relay=a139.localpc2105.com[10.0.0.139]:25, conn_use=63, delay=103800, delays=59267/44433/99/0.27, dsn=2.0.0, status=sent (250 ok 1262894579 qp 12166) Jan 7 22:02:59 postfix postfix/qmgr[26441]:
Re: Huge active queue and system idle, not delivering
On Fri, Jan 08, 2010 at 12:30:34AM +0200, Patrick Chemla wrote: Jan 7 22:02:57 postfix postfix/qmgr[26441]: 5B91F873F6: removed Jan 7 22:02:57 postfix postfix/smtp[27180]: 375DDD5923: to=lexoti...@gmail.com, relay=a139.localpc2105.com[10.0.0.139]:25, conn_use=59, delay=61550, delays=17019/44435/96/0.17, dsn=2.0.0, status=sent (250 ok 1262894577 qp 12113) This recipient does not match the destination that is clogging the queue. Is the queue clogged with postmaster notices. I never enable any postmaster notices, they don't scale. notify_classes = This said, the 96 seconds of connection setup latency is an obvious and severe problem. Why on earth does it take 96 seconds to complete a HELO handshake with a139.localpcc2105.com? You are not going to get much mail out if each delivery takes 96 seconds... Is your Postfix server's IP address resolvable on the qmail systems? Are they doing some sort of pre-banner delay? ... Jan 7 22:02:58 postfix postfix/smtp[27070]: 7F0F2943B3: to=gpo...@wanadoo.fr, relay=a70.localpc2105.com[10.0.0.70]:25, conn_use=10, delay=73795, delays=29264/44481/50/0.21, dsn=2.0.0, status=sent (250 ok 1262894577 qp 23067) Once again, 50 seconds is severely crippled. Jan 7 22:02:58 postfix postfix/smtp[27050]: 32BB182182: to=gmarin-jardins-lois...@wanadoo.fr, relay=a139.localpc2105.com[10.0.0.139]:25, conn_use=48, delay=73799, delays=29268/44466/65/0.28, dsn=2.0.0, status=sent (250 ok 1262894578 qp 12121) This is enough. Fix this. Where are the deliveries to the clogged destination??? -- Viktor. Disclaimer: off-list followups get on-list replies or get ignored. Please do not ignore the Reply-To header. To unsubscribe from the postfix-users list, visit http://www.postfix.org/lists.html or click the link below: mailto:majord...@postfix.org?body=unsubscribe%20postfix-users If my response solves your problem, the best way to thank me is to not send an it worked, thanks follow-up. If you must respond, please put It worked, thanks in the Subject so I can delete these quickly.
Re: Huge active queue and system idle, not delivering
Patrick Chemla: But the CPU of the box is idle more than 80%. It is clear that it is not a matter of CPU, nor memory, nor disk. Something in the number of processes/users/simultaneous tasks is blocking. Indeed, the symptom of blocking is in the third field of the Postfix delays logging. The format of the delays=a/b/c/d logging is as follows: o a = time from message arrival to last active queue entry o b = time from last active queue entry to connection setup o c = time in connection setup, including DNS, EHLO and TLS o d = time in message transmission In your case, it takes a minute or more to set up the connection including DNS lookup and EHLO handshake. That is holding up your mail. - Check if the qmail servers are responsive (telnet hostname 25). - Check if your Postfix needs a /var/spool/postfix/etc/resolv.conf file, and if that file is consistent with /etc/resolv.conf. If Postfix needs /var/spool/postfix/etc/resolv.conf and the file is missong or contains a bogus server that will add time to your deliveries. - If they aren't, increase the concurrency on the qmail side. Wietse