Hello,

some of my users were complaining about losing incoming mail, namely Amazon 
shipping notifications, newsletters and such things that they were absolutely 
sure were sent out, but never reached their inbox. After doing some digging, 
increasing log verbosity and such, I found a lot of this:

[... snip ...]
Sep 10 00:06:37 mail postfix/smtpd[23095]: lost connection after DATA (17511 
bytes) from smtp-out-127-108.amazon.com[176.32.127.108]
Sep 10 00:06:48 mail postfix/smtpd[23111]: lost connection after DATA (22788 
bytes) from mail18-92.srv2.de[193.169.181.92]
Sep 10 00:13:35 mail postfix/smtpd[23348]: lost connection after DATA (17441 
bytes) from smtp-out-127-108.amazon.com[176.32.127.108]
Sep 10 00:27:49 mail postfix/smtpd[23454]: lost connection after DATA (22788 
bytes) from mail18-97.srv2.de[193.169.181.97]
Sep 10 00:31:03 mail postfix/smtpd[23103]: lost connection after DATA (49116 
bytes) from quinqueunus.psi.cust-cluster.com[195.140.187.51]
Sep 10 00:48:46 mail postfix/smtpd[23890]: lost connection after DATA (22788 
bytes) from mail18-98.srv2.de[193.169.181.98]
Sep 10 00:51:53 mail postfix/smtpd[23564]: lost connection after DATA (49116 
bytes) from quattuorocto.psi.cust-cluster.com[195.140.187.48]
Sep 10 01:09:16 mail postfix/smtpd[24565]: lost connection after DATA (17511 
bytes) from smtp-out-127-108.amazon.com[176.32.127.108]
Sep 10 01:09:44 mail postfix/smtpd[24290]: lost connection after DATA (22788 
bytes) from mail18-99.srv2.de[193.169.181.99]
Sep 10 01:16:14 mail postfix/smtpd[24674]: lost connection after DATA (17441 
bytes) from smtp-out-127-107.amazon.com[176.32.127.107]
Sep 10 01:30:44 mail postfix/smtpd[24782]: lost connection after DATA (22788 
bytes) from mail18-100.srv2.de[193.169.181.100]
Sep 10 01:51:42 mail postfix/smtpd[25198]: lost connection after DATA (22788 
bytes) from mail18-105.srv2.de[193.169.181.105]
Sep 10 01:54:29 mail postfix/smtpd[24966]: lost connection after DATA (49116 
bytes) from quattuorocto.psi.cust-cluster.com[195.140.187.48]
Sep 10 02:12:21 mail postfix/smtpd[25784]: lost connection after DATA (17511 
bytes) from smtp-out-127-107.amazon.com[176.32.127.107]
Sep 10 02:12:42 mail postfix/smtpd[25656]: lost connection after DATA (22788 
bytes) from mail18-106.srv2.de[193.169.181.106]
Sep 10 02:20:11 mail postfix/smtpd[25892]: lost connection after DATA (17441 
bytes) from smtp-out-127-108.amazon.com[176.32.127.108]
Sep 10 02:33:41 mail postfix/smtpd[26077]: lost connection after DATA (22788 
bytes) from mail18-107.srv2.de[193.169.181.107]
Sep 10 02:54:22 mail postfix/smtpd[26178]: lost connection after DATA (49116 
bytes) from quinquenulla.psi.cust-cluster.com[195.140.187.50]
Sep 10 02:54:40 mail postfix/smtpd[26490]: lost connection after DATA (22788 
bytes) from mail18-108.srv2.de[193.169.181.108]
Sep 10 03:14:16 mail postfix/smtpd[26585]: lost connection after DATA (49116 
bytes) from quinquenulla.psi.cust-cluster.com[195.140.187.50]
Sep 10 03:15:39 mail postfix/smtpd[26905]: lost connection after DATA (22788 
bytes) from mail18-113.srv2.de[193.169.181.113]
Sep 10 03:15:52 mail postfix/smtpd[27091]: lost connection after DATA (17511 
bytes) from smtp-out-127-106.amazon.com[176.32.127.106]
Sep 10 03:23:15 mail postfix/smtpd[27214]: lost connection after DATA (17441 
bytes) from smtp-out-127-107.amazon.com[176.32.127.107]
[... snip ...]

So to me that looks as if either the external SMTP server closes its connection 
before it is done with the entire message (the transferred size does not match 
the size passed through "MAIL FROM: SIZE=XYZ"), or the connection times out. I 
can see in the logs and looking at the queue directories that these messages 
are put in the incoming queue by cleanup, are then found but skipped by qmgr 
(probably since they are not finished); they lurk in the incoming queue for 
awhile and disappear about the time the "lost connection" message is put in the 
logs. So this points to a timeout.

The weird thing is that the data sizes are always the same for the same 
message-id being delivered, even if it is delivered via different servers from 
a cluster. If it were a timeout, network problem or such, I'd expect a more or 
less random value for the received data size, not always exactly the same.

This seems to be a "new" problem (meaning I just recently got aware of it; I 
don't know when it started, but I do know everything was working fine for years 
before). This does not seem to be a problem of the sepcific hosts above; in 
between, I've been getting messages from the same hosts successfully. It seems 
that some messages go through on the first try, other messages from the same 
hosts are always losing connection, they never are delivered completely, 
ultimately resulting in the external host giving up; the message is then lost, 
and the user never knows about it.

The first question is:
Can I rule out it's my fault? I don't have traffic shaping or ICMP blocking 
running on that host, which maybe could cause something like that. Doing a 
traceroute to the hosts above does not show anything out of the ordinary, so it 
doesn't look like a routing problem, either. Connection times are OK as well, 
connections should not be so slow they time out.

The second question is:
Is there anything I can do on my end? Since it looks as if I simply do not get 
a complete delivery, the sender just stops sending after a certain number of 
bytes, I don't see what I could do...

Regards,
Sean

Reply via email to