Re: Possible reasons for lost connection after DATA
Hi Viktor, Am 11.09.2014 um 16:04 schrieb Viktor Dukhovni: Your PCAP files should demonstrate repeated retransmission of data, are the ACKs you're sending confirming receipt of packets that are sent repeatedly? In that case your ACKs are getting lost? Is there a sequence number gap in the data received from the server? In that case the remote server's data is getting lost. Does the capture confirm that window scaling is not in use? ... I've now collected a big enough data sample (I wanted to get more data from different hosts and such). What I am seeing is this: - window scaling is turned off - there's two cases: 1. I am simply not receiving any data anymore after a few packets, also no retransmissions (up to that point, everything seems normal, no sequence number gaps, no missing ACKs); after 20mins or so of complete silence, I receive a RST and the connection is killed. That is the less common case. That looks like the connection is suddenly completely dead and the remote hosts just keeps sending RST until one of them happens to go through after all. 2. The connection starts out OK (first few packets all OK), then I start receiving packets out of order or not at all. I then get a lot of retransmissions of packets that I didn't ACK because I never received them in the first place (that's the majority of cases). This goes on until the remote host is fed up and kills the connection. So that looks like this is an external problem I can't fix by changing my config... Regards, Sean
Re: Possible reasons for lost connection after DATA
Hi Wietse, Am 11.09.2014 um 17:10 schrieb Wietse Venema: That increases my suspicion of a data-dependent error - some marginal cable/switch/router, perhaps some middle box with a memory bit error that requires a power cycle to clear the problem. If the problem is caused by crosstalk defect, then only physical replacement will solve it. If it is a hardware error on my server (NIC, cable, anything on the physical layer...), wouldn't I be seeing CRC errors or something? Shouldn't I be seeing something i.e. in the ifconfig error counters or the like? If it's somewhere else in the data center (router/switch), then there's really no way for me to know. I could contact support, but that request would be much to vague to be any useful (You've probably got some broken hardware somewhere...) and they probably wouldn't care anyway unless more customers were affected. If it's a middle box somewhere along the way, that's even worse. Even more different people potentially involved... Try power cycling. Did that, no change. So, the plan now is to just sit it out, keeping an eye on things... Regards, Sean
Re: Possible reasons for lost connection after DATA
Hi Hannes, Am 11.09.2014 um 20:48 schrieb Hannes Erven: I remember a possibly similar situation back in 2008... the culprit was a not-fully-up-to-date Cisco ASA firewall that corrupted TCP SACK fields and hence had the remote site send RSET. Anyways on our end the connection seemed to starve, just as you describe it. We detected that by comparing tcpdumps from both affected ends. Of course we had been lucky enough to have that happen with a business partner with competent IT people who we got a hold of, spotted the problem and also temporarily switched the feature off on their side to prove that this actually is the problem. A firmware upgrade on my client's firewall then fixed the issue. With a server hosted somewhere and incoming connections from big clusters, you might not be as lucky as that... Yup. Looks like I'll just have to sit it out. This is just a small, private, low-traffic server, it's not like anyone at Amazon cares that I have problems. ;) And even if they did, I have neither the know-how nor the time and resources to do anything useful to fix it. I'll just keep my eyes open to see if it gets any worse and recommend my users have their Amazon and newsletter stuff sent to other Email addresses. The advantage of being small is that that really is a feasible option. :) Regards, Sean
Re: Possible reasons for lost connection after DATA
Hi Mark, Am 11.09.2014 um 22:59 schrieb L. Mark Stone: Any chance there is a UTM device in the email stream? Possible, but I wouldn't know. This is a rented rootserver in some data center. I don't know their topology, and they probably wouldn't tell me even if I asked. We see lots of these errors when our SonicWalls do an RBL lookup, don't like the data in the email stream etc. The SonicWalls then just drop the connection and Postfix logs the drop. I'll contact support and ask, that won't hurt. Regards, Sean
Re: Possible reasons for lost connection after DATA
On Fri, Sep 12, 2014 at 10:36:51AM +0200, Sean Durkin wrote: If it's a middle box somewhere along the way, that's even worse. Even more different people potentially involved... I would rent a backup MX server (deploy identical anti-spam policies, and lists of valid recipients, ...) at a different site, which relays mail to your primary. If no problems are not observed at that server, make it the primary (cut over) and stop paying for the original server. Let the vendor know that there is a network problem at the present location. You can refer them to this thread. http://archives.neohapsis.com/archives/postfix/2014-09/thread.html#118 It should be in their interest to fix the problem. They can deploy network sniffers upstream from your machine and work with you to find out what's really happening. Unless the problem is an NSL mandated network sniffer upstream of your machine. :-) -- Viktor.
Re: Possible reasons for lost connection after DATA
Hello Wietse, Am 10.09.2014 um 21:52 schrieb Wietse Venema: Slow performance is typical for TCP window scaling problems. Have you tried to turn it off in your kernel? Yes, Viktor suggested that also and I tried it. It does not make a difference, the problem persists. Regards, Sean
Re: Possible reasons for lost connection after DATA
Sean Durkin: Hello Wietse, Am 10.09.2014 um 21:52 schrieb Wietse Venema: Slow performance is typical for TCP window scaling problems. Have you tried to turn it off in your kernel? Yes, Viktor suggested that also and I tried it. It does not make a difference, the problem persists. What is the distribution of DATA sizes before failure? In your example I see numbers around 3kB, 9kB, 12kB. Some failures are triggered by packet content, and may be replaced only by replacing hardware that operates marginally. Does the problem go away when you - Replace the server (either the network card or the whole box) - Replace the cable that connects the server to the network switch - Replace the network switch that the server is plugged into. - Replace the cable that connects the switch to the router - Replace the router - And so on... If you think this is a stupid idea, then you haven't been around long enough. Wietse
Re: Possible reasons for lost connection after DATA
Hi Viktor, Am 10.09.2014 um 23:03 schrieb Viktor Dukhovni: This trace has an insane level of debugging turned on, to the point that syslogd is overwhelmed and is losing messages. PLEASE DISABLE ALL VERBOSE logging. NO -v options in master.cf, NO debug_peer_list, ... Yes, sorry, I cranked up the debug level, since normal logging looks like this: Sep 11 09:43:31 mail postfix/smtpd[25170]: connect from mail18-21.srv2.de[193.169.181.21] Sep 11 09:43:31 mail postfix/smtpd[25170]: 2C076C4026A: client=mail18-21.srv2.de[193.169.181.21] Sep 11 09:46:33 mail postfix/smtpd[25170]: lost connection after DATA (33290 bytes) from mail18-21.srv2.de[193.169.181.21] Sep 11 09:46:33 mail postfix/smtpd[25170]: disconnect from mail18-21.srv2.de[193.169.181.21] ... Sep 11 10:10:59 mail postfix/smtpd[25537]: connect from quattuorocto.psi.cust-cluster.com[195.140.187.48] Sep 11 10:10:59 mail postfix/smtpd[25537]: 8736FC40A7D: client=quattuorocto.psi.cust-cluster.com[195.140.187.48] Sep 11 10:36:44 mail postfix/smtpd[25537]: lost connection after DATA (36809 bytes) from quattuorocto.psi.cust-cluster.com[195.140.187.48] Sep 11 10:36:44 mail postfix/smtpd[25537]: disconnect from quattuorocto.psi.cust-cluster.com[195.140.187.48] .. Sep 11 10:38:48 mail postfix/smtpd[25913]: connect from smtp-out-127-108.amazon.com[176.32.127.108] Sep 11 10:38:49 mail postfix/smtpd[25913]: 2558DC40458: client=smtp-out-127-108.amazon.com[176.32.127.108] Sep 11 10:41:01 mail postfix/smtpd[25913]: lost connection after DATA (17511 bytes) from smtp-out-127-108.amazon.com[176.32.127.108] Sep 11 10:41:01 mail postfix/smtpd[25913]: disconnect from smtp-out-127-108.amazon.com[176.32.127.108] I didn't think that info alone was particularly useful... Please make sure that the /dev/log syslog socket is a dgram not a stream socket, and that mail logging is not synchronous. Logging is not synchronous, the socket is a datagram socket (it has all been set up that way all along). No change, still the same problem, see above. Meanwhile, I've managed to record a tcpdump of such a failed session. What exactly am I looking for there? I don't see anything out of the ordinary, except increasing delays between received packets from the external host, until the host sends a RST. It seems I simply do not receive any packets. The ones I get are immediately ACK'd, but then there's seconds and later minutes until the next one even arrives, until finally the remote host gives up and terminates the connection. I'll try to get more dumps for comparison, including some from hosts that have no problems delivering. There's no packet filtering or rate limiting on port 25, at least not on my machine. The hosting provider might have something there, I'd have to ask them... Regards, Sean
Re: Possible reasons for lost connection after DATA
On Thu, Sep 11, 2014 at 02:36:51PM +0200, Sean Durkin wrote: PLEASE DISABLE ALL VERBOSE logging. NO -v options in master.cf, NO debug_peer_list, Yes, sorry, I cranked up the debug level, since normal logging looks like this: Sep 11 09:43:31 mail postfix/smtpd[25170]: connect from mail18-21.srv2.de[193.169.181.21] Sep 11 09:43:31 mail postfix/smtpd[25170]: 2C076C4026A: client=mail18-21.srv2.de[193.169.181.21] Sep 11 09:46:33 mail postfix/smtpd[25170]: lost connection after DATA (33290 bytes) from mail18-21.srv2.de[193.169.181.21] Sep 11 09:46:33 mail postfix/smtpd[25170]: disconnect from mail18-21.srv2.de[193.169.181.21] That's sufficient. It shows you're likely not using TLS here, and the time beetween message start and connection loss. The number of samples is rather small now. I would expect the session duration for each sending host to be essentially constant over multiple deliveries (equal to the remote machine's TCP timeout). Possibilities include a broken network interface somewhere or a bad cable that corrupts IP or TCP packet headers given specific input patterns. If the problem is with the message payload, you could try enabling inbound TLS, perhaps these sending servers support it. Don't recall whether you already have TLS. If the problem is not with the payload, then TLS won't make any difference (some hosts will still fail even after TLS). I didn't think that info alone was particularly useful... It is sufficient, and the verbose logs just add noise. Meanwhile, I've managed to record a tcpdump of such a failed session. What exactly am I looking for there? Retransmissions from the sender of data with the same sequence number... Post tcpdump output (without packet content is fine), containing packets from just a single failed session. There's no packet filtering or rate limiting on port 25, at least not on my machine. The hosting provider might have something there, I'd have to ask them... They probably have middle boxes, which might be the cause of the problem. -- Viktor.
Re: Possible reasons for lost connection after DATA
Hi Wietse, Am 11.09.2014 um 13:49 schrieb Wietse Venema: What is the distribution of DATA sizes before failure? In your example I see numbers around 3kB, 9kB, 12kB. At the moment, I see these sizes: - always exactly 17511 bytes from smtp-out-127-*.amazon.com (today, seems to be only 3 different hosts trying) - always exactly 49116 bytes from *.psi.cust-cluster.com (I've seen about 60 different hosts from there today) - always exactly 33290 bytes from mail18-*.srv2.de (about a dozen different hosts) It seems those are always the same 3 messages being re-tried constantly (when I look at them in the incoming queue folder, it's the same recipient and sender and the same message-ID, as far as I can tell). I have problems only with messages from these clusters, everything else seems unaffected (at least I haven't seen any lost connection messages from any other hosts as far as my logfiles go back). Yesterday I had an additional message with exactly 17441 bytes on every try before failure from the Amazon-cluster. That one was finally delivered completely early this morning, and has since disappeared from the cycle. FWIW, I have received a handful of messages from the Amazon-cluster that did not have any delays/problems yesterday and today, one of them even from one of the problematic hosts that can't deliver the other message. Some failures are triggered by packet content, and may be replaced only by replacing hardware that operates marginally. Does the problem go away when you - Replace the server (either the network card or the whole box) - Replace the cable that connects the server to the network switch - Replace the network switch that the server is plugged into. - Replace the cable that connects the switch to the router - Replace the router - And so on... If you think this is a stupid idea, then you haven't been around long enough. By no means do I think that's stupid. :) I'm only doing this server stuff for fun in my spare time, but my real job is in microelectronics and hardware, so I've had my share of mysterious and seemingly unexplainable stuff (ISI, crosstalk, low-frequency jitter, ground bounce, ESD-induced phenomena, you know the drill...). Problem is that this box is a rented root server in a data center somwhere, so I don't have access to the hardware to try any of that. I can contact support, but they of course charge you for everything they do, and as long as I haven't ruled out that the reason is just some stupid configuration mistake on my part (or a routing/filtering issue at my hosting provider, or Amazon, or...), I don't want to start replacing hardware, obviously... Regards, Sean
Re: Possible reasons for lost connection after DATA
On Thu, Sep 11, 2014 at 03:25:57PM +0200, Sean Durkin wrote: I can contact support, but they of course charge you for everything they do, and as long as I haven't ruled out that the reason is just some stupid configuration mistake on my part (or a routing/filtering issue at my hosting provider, or Amazon, or...), I don't want to start replacing hardware, obviously... The Postfix configuration has no impact on the TCP layer, beyond optionally specifying the TCP window size. Since it is the TCP layer that fails, the problem is not related to the Postfix configuration. Your PCAP files should demonstrate repeated retransmission of data, are the ACKs you're sending confirming receipt of packets that are sent repeatedly? In that case your ACKs are getting lost? Is there a sequence number gap in the data received from the server? In that case the remote server's data is getting lost. Does the capture confirm that window scaling is not in use? ... -- Viktor.
Re: Possible reasons for lost connection after DATA
Sean Durkin: Hi Wietse, Am 11.09.2014 um 13:49 schrieb Wietse Venema: What is the distribution of DATA sizes before failure? In your example I see numbers around 3kB, 9kB, 12kB. At the moment, I see these sizes: - always exactly 17511 bytes from smtp-out-127-*.amazon.com (today, seems to be only 3 different hosts trying) - always exactly 49116 bytes from *.psi.cust-cluster.com (I've seen about 60 different hosts from there today) - always exactly 33290 bytes from mail18-*.srv2.de (about a dozen different hosts) It seems those are always the same 3 messages being re-tried constantly (when I look at them in the incoming queue folder, it's the same recipient and sender and the same message-ID, as far as I can tell). I have problems only with messages from these clusters, everything else seems unaffected (at least I haven't seen any lost connection messages from any other hosts as far as my logfiles go back). Yesterday I had an additional message with exactly 17441 bytes on every try before failure from the Amazon-cluster. That one was finally delivered completely early this morning, and has since disappeared from the cycle. That increases my suspicion of a data-dependent error - some marginal cable/switch/router, perhaps some middle box with a memory bit error that requires a power cycle to clear the problem. If the problem is caused by crosstalk defect, then only physical replacement will solve it. Problem is that this box is a rented root server in a data center somwhere, so I don't have access to the hardware to try any of that. I can contact support, but they of course charge you for everything they do, and as long as I haven't ruled out that the reason is just some stupid configuration mistake on my part (or a routing/filtering issue at my hosting provider, or Amazon, or...), I don't want to start replacing hardware, obviously... Try power cycling. Wietse
Re: Possible reasons for lost connection after DATA
Hi Sean, Meanwhile, I've managed to record a tcpdump of such a failed session. What exactly am I looking for there? I remember a possibly similar situation back in 2008... the culprit was a not-fully-up-to-date Cisco ASA firewall that corrupted TCP SACK fields and hence had the remote site send RSET. Anyways on our end the connection seemed to starve, just as you describe it. We detected that by comparing tcpdumps from both affected ends. Of course we had been lucky enough to have that happen with a business partner with competent IT people who we got a hold of, spotted the problem and also temporarily switched the feature off on their side to prove that this actually is the problem. A firmware upgrade on my client's firewall then fixed the issue. With a server hosted somewhere and incoming connections from big clusters, you might not be as lucky as that... best regards, -hannes
Re: Possible reasons for lost connection after DATA
Sean Durkin: Meanwhile, I've managed to record a tcpdump of such a failed session. What exactly am I looking for there? - The receiving host's window announcement in the tcp handshake and in subsequent ACKs. - Whether there is a gap in the sender packet sequence numbers as seen by the receiving host. Such a gap means that a particular packet is being dropped. Just to bore you with a few examples of bad middleboxes: - Shortly after the first Postfix release there was a problem with traffic corruption due to a buggy middlebox (a Packeteer traffic shaper). The error had a very distinct signature. - For many years, there were problems with CISCO PIX firewalls that inspected SMTP traffic but failed to properly handle the case that CRLF.CRLF happened to fall on a packet boundary. - http://www.arschkrebs.de/postfix/postfix_cisco_pix_bugs.shtml has other examples where CISCO PIX/ASA firewalls will mis-handle SMTP traffic in various ways. In your case, you may have to collaborate with someone who is willing to send large amounts of random email; hopefully some messages will trigger the bug, and then the sender and receiver can compare tcpdump recordings. Wietse
Re: Possible reasons for lost connection after DATA
Any chance there is a UTM device in the email stream? We see lots of these errors when our SonicWalls do an RBL lookup, don't like the data in the email stream etc. The SonicWalls then just drop the connection and Postfix logs the drop. Hope that helps, Mark
Possible reasons for lost connection after DATA
Hello, some of my users were complaining about losing incoming mail, namely Amazon shipping notifications, newsletters and such things that they were absolutely sure were sent out, but never reached their inbox. After doing some digging, increasing log verbosity and such, I found a lot of this: [... snip ...] Sep 10 00:06:37 mail postfix/smtpd[23095]: lost connection after DATA (17511 bytes) from smtp-out-127-108.amazon.com[176.32.127.108] Sep 10 00:06:48 mail postfix/smtpd[23111]: lost connection after DATA (22788 bytes) from mail18-92.srv2.de[193.169.181.92] Sep 10 00:13:35 mail postfix/smtpd[23348]: lost connection after DATA (17441 bytes) from smtp-out-127-108.amazon.com[176.32.127.108] Sep 10 00:27:49 mail postfix/smtpd[23454]: lost connection after DATA (22788 bytes) from mail18-97.srv2.de[193.169.181.97] Sep 10 00:31:03 mail postfix/smtpd[23103]: lost connection after DATA (49116 bytes) from quinqueunus.psi.cust-cluster.com[195.140.187.51] Sep 10 00:48:46 mail postfix/smtpd[23890]: lost connection after DATA (22788 bytes) from mail18-98.srv2.de[193.169.181.98] Sep 10 00:51:53 mail postfix/smtpd[23564]: lost connection after DATA (49116 bytes) from quattuorocto.psi.cust-cluster.com[195.140.187.48] Sep 10 01:09:16 mail postfix/smtpd[24565]: lost connection after DATA (17511 bytes) from smtp-out-127-108.amazon.com[176.32.127.108] Sep 10 01:09:44 mail postfix/smtpd[24290]: lost connection after DATA (22788 bytes) from mail18-99.srv2.de[193.169.181.99] Sep 10 01:16:14 mail postfix/smtpd[24674]: lost connection after DATA (17441 bytes) from smtp-out-127-107.amazon.com[176.32.127.107] Sep 10 01:30:44 mail postfix/smtpd[24782]: lost connection after DATA (22788 bytes) from mail18-100.srv2.de[193.169.181.100] Sep 10 01:51:42 mail postfix/smtpd[25198]: lost connection after DATA (22788 bytes) from mail18-105.srv2.de[193.169.181.105] Sep 10 01:54:29 mail postfix/smtpd[24966]: lost connection after DATA (49116 bytes) from quattuorocto.psi.cust-cluster.com[195.140.187.48] Sep 10 02:12:21 mail postfix/smtpd[25784]: lost connection after DATA (17511 bytes) from smtp-out-127-107.amazon.com[176.32.127.107] Sep 10 02:12:42 mail postfix/smtpd[25656]: lost connection after DATA (22788 bytes) from mail18-106.srv2.de[193.169.181.106] Sep 10 02:20:11 mail postfix/smtpd[25892]: lost connection after DATA (17441 bytes) from smtp-out-127-108.amazon.com[176.32.127.108] Sep 10 02:33:41 mail postfix/smtpd[26077]: lost connection after DATA (22788 bytes) from mail18-107.srv2.de[193.169.181.107] Sep 10 02:54:22 mail postfix/smtpd[26178]: lost connection after DATA (49116 bytes) from quinquenulla.psi.cust-cluster.com[195.140.187.50] Sep 10 02:54:40 mail postfix/smtpd[26490]: lost connection after DATA (22788 bytes) from mail18-108.srv2.de[193.169.181.108] Sep 10 03:14:16 mail postfix/smtpd[26585]: lost connection after DATA (49116 bytes) from quinquenulla.psi.cust-cluster.com[195.140.187.50] Sep 10 03:15:39 mail postfix/smtpd[26905]: lost connection after DATA (22788 bytes) from mail18-113.srv2.de[193.169.181.113] Sep 10 03:15:52 mail postfix/smtpd[27091]: lost connection after DATA (17511 bytes) from smtp-out-127-106.amazon.com[176.32.127.106] Sep 10 03:23:15 mail postfix/smtpd[27214]: lost connection after DATA (17441 bytes) from smtp-out-127-107.amazon.com[176.32.127.107] [... snip ...] So to me that looks as if either the external SMTP server closes its connection before it is done with the entire message (the transferred size does not match the size passed through MAIL FROM: SIZE=XYZ), or the connection times out. I can see in the logs and looking at the queue directories that these messages are put in the incoming queue by cleanup, are then found but skipped by qmgr (probably since they are not finished); they lurk in the incoming queue for awhile and disappear about the time the lost connection message is put in the logs. So this points to a timeout. The weird thing is that the data sizes are always the same for the same message-id being delivered, even if it is delivered via different servers from a cluster. If it were a timeout, network problem or such, I'd expect a more or less random value for the received data size, not always exactly the same. This seems to be a new problem (meaning I just recently got aware of it; I don't know when it started, but I do know everything was working fine for years before). This does not seem to be a problem of the sepcific hosts above; in between, I've been getting messages from the same hosts successfully. It seems that some messages go through on the first try, other messages from the same hosts are always losing connection, they never are delivered completely, ultimately resulting in the external host giving up; the message is then lost, and the user never knows about it. The first question is: Can I rule out it's my fault? I don't have traffic shaping or ICMP blocking running on that host, which maybe could cause something like
Re: Possible reasons for lost connection after DATA
Am 10.09.2014 um 09:56 schrieb Sean Durkin: The first question is: Can I rule out it's my fault? have you changed anything last days/month upgrades/updates software hardware ? please send you postfix config , search list archive lost connection after DATA Best Regards MfG Robert Schetterer -- [*] sys4 AG http://sys4.de, +49 (89) 30 90 46 64 Franziskanerstraße 15, 81669 München Sitz der Gesellschaft: München, Amtsgericht München: HRB 199263 Vorstand: Patrick Ben Koetter, Marc Schiffbauer Aufsichtsratsvorsitzender: Florian Kirstein
Re: Possible reasons for lost connection after DATA
On Wed, Sep 10, 2014 at 09:56:48AM +0200, Sean Durkin wrote: Some of my users were complaining about losing incoming mail, namely Amazon shipping notifications, newsletters and such things that they were absolutely sure were sent out, but never reached their inbox. After doing some digging, increasing log verbosity and such, I found a lot of this: Have you tried disabling TCP window scaling? It might be confusing some middle-box (firewall, NAT device, ...) on path between the remote systems and your MTA. [... snip ...] Sep 10 00:06:37 mail postfix/smtpd[23095]: lost connection after DATA (17511 bytes) from smtp-out-127-108.amazon.com[176.32.127.108] Post the hostname/IP address of the receving system. Capture and examine a tcpdump recording of mail from one of the problem senders. Any sign of retransmission by the sender? For at least one such session, post all related messages from the postfix/smtpd[pid] that occur between connect from and disconnect from. -- Viktor.
Re: Possible reasons for lost connection after DATA
Hi Robert, Am 10.09.2014 um 10:11 schrieb Robert Schetterer: Am 10.09.2014 um 09:56 schrieb Sean Durkin: The first question is: Can I rule out it's my fault? have you changed anything last days/month upgrades/updates software hardware ? Hardware is unchanged. The Ubuntu postfix package was upgraded in August (2.9.6-1~12.04.2), but this problem seems to have started before that, looking at older logs. Except that, I don't see any updates directly related to the mail system in the past half year. There's of course other system/security updates, but how should I know which of these might possibly be responsible? I haven't changed the basic Postfix configuration lately. I did add OpenDKIM a few months back, but I removed that a few days ago to rule out that's the problem. I also removed Spamassassin, any RBLs and Postgrey, which I normally have running there; that does not seem to make a difference. So I'm now back to a very basic Postfix conf, but the problem persists. please send you postfix config , Anonymized postfinger-output is attached below. search list archive lost connection after DATA I did that, I couldn't find anything that really applies in my case... most problems there are either related to DATA size 0 or to weird MTU issues. Mostly this seems to happen for connections from spam bots or misconfigured clients and people tell you you should just ignore it, but that doesn't really apply here. I've tried getting a TCP dump of such an SMTP session, but since most of the interesting mail is coming from server clusters and the external hosts trying to deliver mail keep changing I'm still waiting to catch a good one... Regards, Sean Here, as promised, postfinger-output: --System Parameters-- mail_version = 2.9.6 hostname = mail uname = Linux mail 3.2.0-65-virtual #99-Ubuntu SMP Fri Jul 4 21:23:03 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux --Packaging information-- looks like this postfix comes from deb package: postfix-2.9.6-1~12.04.2 --main.cf non-default parameters-- alias_maps = $alias_database append_dot_mydomain = no biff = no broken_sasl_auth_clients = yes debug_peer_list = amazon.com, srv2.de, psi.cust-cluster.com, outbound.protection.outlook.com delay_warning_time = 4h disable_vrfy_command = yes html_directory = /usr/share/doc/postfix/html mailbox_size_limit = 0 mailbox_transport = lmtp:unix:/var/run/cyrus/socket/lmtp message_size_limit = 262144000 mydestination = localhost, localhost.$mydomain, $mydomain, mail.$mydomain, mysql:/etc/postfix/mysql-mydestination.cf myhostname = my.host.name mynetworks = 127.0.0.0/8, ip.add.re.ss myorigin = /etc/mailname proxy_interfaces = ip.add.re.ss recipient_delimiter = + sender_canonical_maps = mysql:/etc/postfix/mysql-canonical.cf smtp_destination_concurrency_limit = 1 smtp_destination_rate_delay = 1s smtpd_helo_required = yes smtpd_recipient_restrictions = permit_mynetworks, permit_sasl_authenticated, reject_unauth_pipelining, reject_unauth_destination, reject_unlisted_recipient reject_invalid_hostname, reject_non_fqdn_hostname, reject_non_fqdn_sender, reject_non_fqdn_recipient, reject_unknown_sender_domain, reject_unknown_recipient_domain, smtpd_sasl_auth_enable = yes smtpd_sender_restrictions = permit_sasl_authenticated, permit_mynetworks, reject_unauth_destination, reject_non_fqdn_sender, reject_unknown_sender_domain reject_unknown_recipient_domain, reject_unauth_pipelining smtpd_tls_auth_only = yes smtpd_tls_CAfile = /etc/postfix/ssl/ca.pem smtpd_tls_cert_file = /etc/postfix/ssl/my_cert.crt smtpd_tls_dh1024_param_file = /etc/postfix/ssl/dh_2048.pem smtpd_tls_dh512_param_file = /etc/postfix/ssl/dh_512.pem smtpd_tls_key_file = /etc/postfix/ssl/my_key.key smtpd_tls_protocols = !SSLv2 smtpd_tls_received_header = yes smtpd_tls_session_cache_database = btree:/var/lib/postfix/smtpd_scache smtpd_use_tls = yes smtp_tls_security_level = may smtp_tls_session_cache_database = btree:/var/lib/postfix/smtp_scache strict_rfc821_envelopes = yes tls_preempt_cipherlist = yes virtual_alias_maps = mysql:/etc/postfix/mysql-virtual.cf --master.cf-- smtp inet n - y -- smtpd submission inet n - y - - smtpd -o smtpd_etrn_restrictions=reject -o smtpd_enforce_tls=yes -o smtpd_sasl_auth_enable=yes -o smtpd_client_restrictions=permit_sasl_authenticated,reject smtpsinet n - y - - smtpd -o smtpd_etrn_restrictions=reject -o smtpd_tls_wrappermode=yes -o smtpd_sasl_auth_enable=yes pickupfifo n - - 60 1 pickup cleanup unix n - - - 0 cleanup qmgr fifo n - n 100 1 qmgr tlsmgrunix - - - 1000? 1 tlsmgr rewrite unix - - - - - trivial-rewrite bounceunix - - - - 0 bounce defer unix - -
Re: Possible reasons for lost connection after DATA
Hi Viktor, Am 10.09.2014 um 16:19 schrieb Viktor Dukhovni: Have you tried disabling TCP window scaling? It might be confusing some middle-box (firewall, NAT device, ...) on path between the remote systems and your MTA. I would not have thought of that... I've tried that now, but it does not seem to help. Post the hostname/IP address of the receving system. mail.tuxroot.de Capture and examine a tcpdump recording of mail from one of the problem senders. Any sign of retransmission by the sender? I'm trying to get a good dump and will post results once I get one. Not that easy since the external hosts keep changing all the time. All mail affected comes from mass mailers that use server clusters, so I keep getting those messages from lots of different remote hosts. I'm waiting for it to happen from one of the hosts I've seen before. Retransmission is tried numerous times, but for every retransmission the lost connection message is the same (identical number of bytes), as far as I can tell. That's one thing that puzzles me... So e.g. a message is delivered twice and each time the connection is lost after exactly 17441 bytes, even if it's different remote hosts trying, that's kind of odd. For at least one such session, post all related messages from the postfix/smtpd[pid] that occur between connect from and disconnect from. Here's one: http://pastebin.com/twb3Z8Eg And this seems to be the same message being redelivered later, from a different host, with the same result (connection lost after exactly 17441 bytes): http://pastebin.com/Qihbjz3w What I do notice there is that in fact the connection seems to be *very* slow. In the above example, the whole process takes several minutes. I don't have any throughput or network speed issues with other hosts, though. I've tried sending mail from Gmail, Yahoo, my workplace, my former university, GMX, whatever; everything goes through on the first attempt each and every time, and quickly. But it seems it is always slow for a few hosts. Regards, Sean
Re: Possible reasons for lost connection after DATA
Sean Durkin: [ Charset windows-1252 converted... ] Hi Viktor, Am 10.09.2014 um 16:19 schrieb Viktor Dukhovni: Have you tried disabling TCP window scaling? It might be confusing some middle-box (firewall, NAT device, ...) on path between the remote systems and your MTA. I would not have thought of that... I've tried that now, but it does not seem to help. Post the hostname/IP address of the receving system. mail.tuxroot.de Capture and examine a tcpdump recording of mail from one of the problem senders. Any sign of retransmission by the sender? I'm trying to get a good dump and will post results once I get one. Not that easy since the external hosts keep changing all the time. All mail affected comes from mass mailers that use server clusters, so I keep getting those messages from lots of different remote hosts. I'm waiting for it to happen from one of the hosts I've seen before. Retransmission is tried numerous times, but for every retransmission the lost connection message is the same (identical number of bytes), as far as I can tell. That's one thing that puzzles me... So e.g. a message is delivered twice and each time the connection is lost after exactly 17441 bytes, even if it's different remote hosts trying, that's kind of odd. No, it means the same problem is happening. Same error, same symptom. What I do notice there is that in fact the connection seems to be *very* slow. In the above example, the whole process takes several minutes. I don't have any throughput or network speed issues with other hosts, though. I've tried sending mail from Gmail, Yahoo, Slow performance is typical for TCP window scaling problems. Have you tried to turn it off in your kernel? # sysctl -w net.ipv4.tcp_window_scaling=0 To make it permanent: # echo 'net.ipv4.tcp_window_scaling = 0' /etc/sysctl.conf Wietse
Re: Possible reasons for lost connection after DATA
On Wed, Sep 10, 2014 at 09:19:58PM +0200, Sean Durkin wrote: For at least one such session, post all related messages from the postfix/smtpd[pid] that occur between connect from and disconnect from. Here's one: http://pastebin.com/twb3Z8Eg This trace has an insane level of debugging turned on, to the point that syslogd is overwhelmed and is losing messages. PLEASE DISABLE ALL VERBOSE logging. NO -v options in master.cf, NO debug_peer_list, ... Please make sure that the /dev/log syslog socket is a dgram not a stream socket, and that mail logging is not synchronous. Then if the problem persists, report just normal Postfix logging, not the flood of noise from verbose logging. -- Viktor.