Re: Possible reasons for lost connection after DATA

2014-09-12 Thread Sean Durkin
Hi Viktor,

Am 11.09.2014 um 16:04 schrieb Viktor Dukhovni:
 Your PCAP files should demonstrate repeated retransmission of data,
 are the ACKs you're sending confirming receipt of packets that are
 sent repeatedly?  In that case your ACKs are getting lost?   Is
 there a sequence number gap in the data received from the server?
 In that case the remote server's data is getting lost.  Does the
 capture confirm that window scaling is not in use? ...

I've now collected a big enough data sample (I wanted to get more data from 
different hosts and such).

What I am seeing is this:
- window scaling is turned off
- there's two cases:
 1. I am simply not receiving any data anymore after a few packets, also no 
retransmissions (up to that point, everything seems normal, no sequence number 
gaps, no missing ACKs); after 20mins or so of complete silence, I receive a RST 
and the connection is killed. That is the less common case. That looks like the 
connection is suddenly completely dead and the remote hosts just keeps sending 
RST until one of them happens to go through after all.
 2. The connection starts out OK (first few packets all OK), then I start 
receiving packets out of order or not at all. I then get a lot of 
retransmissions of packets that I didn't ACK because I never received them in 
the first place (that's the majority of cases). This goes on until the remote 
host is fed up and kills the connection.

So that looks like this is an external problem I can't fix by changing my 
config...

Regards,
Sean


Re: Possible reasons for lost connection after DATA

2014-09-12 Thread Sean Durkin
Hi Wietse,

Am 11.09.2014 um 17:10 schrieb Wietse Venema:
 That increases my suspicion of a data-dependent error - some marginal
 cable/switch/router, perhaps some middle box with a memory bit error
 that requires a power cycle to clear the problem. If the problem is 
 caused by crosstalk defect, then only physical replacement will
 solve it.

If it is a hardware error on my server (NIC, cable, anything on the physical 
layer...), wouldn't I be seeing CRC errors or something?
Shouldn't I be seeing something i.e. in the ifconfig error counters or the like?

If it's somewhere else in the data center (router/switch), then there's really 
no way for me to know.
I could contact support, but that request would be much to vague to be any 
useful (You've probably got some broken hardware somewhere...) and they 
probably wouldn't care anyway unless more customers were affected.

If it's a middle box somewhere along the way, that's even worse. Even more 
different people potentially involved...
 
 Try power cycling.

Did that, no change.

So, the plan now is to just sit it out, keeping an eye on things...

Regards,
Sean


Re: Possible reasons for lost connection after DATA

2014-09-12 Thread Sean Durkin
Hi Hannes,

Am 11.09.2014 um 20:48 schrieb Hannes Erven:
 I remember a possibly similar situation back in 2008... the culprit was a 
 not-fully-up-to-date Cisco ASA firewall that corrupted TCP SACK fields and 
 hence had the remote site send RSET.
 Anyways on our end the connection seemed to starve, just as you describe it.
 
 We detected that by comparing tcpdumps from both affected ends. Of course we 
 had been lucky enough to have that happen with a business partner with 
 competent IT people who we got a hold of, spotted the problem and also 
 temporarily switched the feature off on their side to prove that this 
 actually is the problem.
 A firmware upgrade on my client's firewall then fixed the issue.
 
 With a server hosted somewhere and incoming connections from big clusters, 
 you might not be as lucky as that...

Yup. Looks like I'll just have to sit it out. This is just a small, private, 
low-traffic server, it's not like anyone at Amazon cares that I have problems. 
;)

And even if they did, I have neither the know-how nor the time and resources to 
do anything useful to fix it.

I'll just keep my eyes open to see if it gets any worse and recommend my users 
have their Amazon and newsletter stuff sent to other Email addresses. The 
advantage of being small is that that really is a feasible option. :)

Regards,
Sean


Re: Possible reasons for lost connection after DATA

2014-09-12 Thread Sean Durkin
Hi Mark,

Am 11.09.2014 um 22:59 schrieb L. Mark Stone:
 Any chance there is a UTM device in the email stream?
Possible, but I wouldn't know. This is a rented rootserver in some data center. 
I don't know their topology, and they probably wouldn't tell me even if I asked.

 We see lots of these errors when our SonicWalls do an RBL lookup, don't like 
 the data in the email stream etc.  The SonicWalls then just drop the 
 connection and Postfix logs the drop.
I'll contact support and ask, that won't hurt.

Regards,
Sean


Re: Possible reasons for lost connection after DATA

2014-09-12 Thread Viktor Dukhovni
On Fri, Sep 12, 2014 at 10:36:51AM +0200, Sean Durkin wrote:

 If it's a middle box somewhere along the way, that's even worse.
 Even more different people potentially involved...

I would rent a backup MX server (deploy identical anti-spam policies,
and lists of valid recipients, ...) at a different site, which
relays mail to your primary.  If no problems are not observed at
that server, make it the primary (cut over) and stop paying for
the original server.

Let the vendor know that there is a network problem at the present
location.  You can refer them to this thread.

http://archives.neohapsis.com/archives/postfix/2014-09/thread.html#118

It should be in their interest to fix the problem.  They can deploy
network sniffers upstream from your machine and work with you to
find out what's really happening.  Unless the problem is an NSL
mandated network sniffer upstream of your machine. :-)

-- 
Viktor.


Re: Possible reasons for lost connection after DATA

2014-09-11 Thread Sean Durkin
Hello Wietse,

Am 10.09.2014 um 21:52 schrieb Wietse Venema:

 Slow performance is typical for TCP window scaling problems. Have
 you tried to turn it off in your kernel?

Yes, Viktor suggested that also and I tried it. It does not make a difference, 
the problem persists.

Regards,
Sean


Re: Possible reasons for lost connection after DATA

2014-09-11 Thread Wietse Venema
Sean Durkin:
 Hello Wietse,
 
 Am 10.09.2014 um 21:52 schrieb Wietse Venema:
 
  Slow performance is typical for TCP window scaling problems. Have
  you tried to turn it off in your kernel?
 
 Yes, Viktor suggested that also and I tried it. It does not make
 a difference, the problem persists.

What is the distribution of DATA sizes before failure? In your
example I see numbers around 3kB, 9kB, 12kB.

Some failures are triggered by packet content, and may be replaced
only by replacing hardware that operates marginally. Does the problem
go away when you

- Replace the server (either the network card or the whole box)

- Replace the cable that connects the server to the network switch

- Replace the network switch that the server is plugged into.

- Replace the cable that connects the switch to the router

- Replace the router

- And so on...

If you think this is a stupid idea, then you haven't been around
long enough.

Wietse


Re: Possible reasons for lost connection after DATA

2014-09-11 Thread Sean Durkin
Hi Viktor,

Am 10.09.2014 um 23:03 schrieb Viktor Dukhovni:
 This trace has an insane level of debugging turned on, to the point
 that syslogd is overwhelmed and is losing messages.  PLEASE DISABLE
 ALL VERBOSE logging.  NO -v options in master.cf, NO debug_peer_list,
 ...
Yes, sorry, I cranked up the debug level, since normal logging looks like this:

Sep 11 09:43:31 mail postfix/smtpd[25170]: connect from 
mail18-21.srv2.de[193.169.181.21]
Sep 11 09:43:31 mail postfix/smtpd[25170]: 2C076C4026A: 
client=mail18-21.srv2.de[193.169.181.21]
Sep 11 09:46:33 mail postfix/smtpd[25170]: lost connection after DATA (33290 
bytes) from mail18-21.srv2.de[193.169.181.21]
Sep 11 09:46:33 mail postfix/smtpd[25170]: disconnect from 
mail18-21.srv2.de[193.169.181.21]
...
Sep 11 10:10:59 mail postfix/smtpd[25537]: connect from 
quattuorocto.psi.cust-cluster.com[195.140.187.48]
Sep 11 10:10:59 mail postfix/smtpd[25537]: 8736FC40A7D: 
client=quattuorocto.psi.cust-cluster.com[195.140.187.48]
Sep 11 10:36:44 mail postfix/smtpd[25537]: lost connection after DATA (36809 
bytes) from quattuorocto.psi.cust-cluster.com[195.140.187.48]
Sep 11 10:36:44 mail postfix/smtpd[25537]: disconnect from 
quattuorocto.psi.cust-cluster.com[195.140.187.48]
..
Sep 11 10:38:48 mail postfix/smtpd[25913]: connect from 
smtp-out-127-108.amazon.com[176.32.127.108]
Sep 11 10:38:49 mail postfix/smtpd[25913]: 2558DC40458: 
client=smtp-out-127-108.amazon.com[176.32.127.108]
Sep 11 10:41:01 mail postfix/smtpd[25913]: lost connection after DATA (17511 
bytes) from smtp-out-127-108.amazon.com[176.32.127.108]
Sep 11 10:41:01 mail postfix/smtpd[25913]: disconnect from 
smtp-out-127-108.amazon.com[176.32.127.108]

I didn't think that info alone was particularly useful...

 Please make sure that the /dev/log syslog socket is a dgram not
 a stream socket, and that mail logging is not synchronous.
Logging is not synchronous, the socket is a datagram socket (it has all been 
set up that way all along).

No change, still the same problem, see above.

Meanwhile, I've managed to record a tcpdump of such a failed session. What 
exactly am I looking for there?
I don't see anything out of the ordinary, except increasing delays between 
received packets from the external host, until the host sends a RST.
It seems I simply do not receive any packets. The ones I get are immediately 
ACK'd, but then there's seconds and later minutes until the next one even 
arrives, until finally the remote host gives up and terminates the connection.

I'll try to get more dumps for comparison, including some from hosts that have 
no problems delivering.

There's no packet filtering or rate limiting on port 25, at least not on my 
machine. The hosting provider might have something there, I'd have to ask 
them...

Regards,
Sean


Re: Possible reasons for lost connection after DATA

2014-09-11 Thread Viktor Dukhovni
On Thu, Sep 11, 2014 at 02:36:51PM +0200, Sean Durkin wrote:

  PLEASE DISABLE
  ALL VERBOSE logging.  NO -v options in master.cf, NO debug_peer_list,

 Yes, sorry, I cranked up the debug level, since normal logging looks like 
 this:
 
 Sep 11 09:43:31 mail postfix/smtpd[25170]: connect from 
 mail18-21.srv2.de[193.169.181.21]
 Sep 11 09:43:31 mail postfix/smtpd[25170]: 2C076C4026A: 
 client=mail18-21.srv2.de[193.169.181.21]
 Sep 11 09:46:33 mail postfix/smtpd[25170]: lost connection after DATA (33290 
 bytes) from mail18-21.srv2.de[193.169.181.21]
 Sep 11 09:46:33 mail postfix/smtpd[25170]: disconnect from 
 mail18-21.srv2.de[193.169.181.21]

That's sufficient.  It shows you're likely not using TLS here, and
the time beetween message start and connection loss.  The number
of samples is rather small now.  I would expect the session duration
for each sending host to be essentially constant over multiple
deliveries (equal to the remote machine's TCP timeout).

Possibilities include a broken network interface somewhere or a
bad cable that corrupts IP or TCP packet headers given specific
input patterns.

If the problem is with the message payload, you could try enabling
inbound TLS, perhaps these sending servers support it.  Don't recall
whether you already have TLS.  If the problem is not with the
payload, then TLS won't make any difference (some hosts will still
fail even after TLS).

 I didn't think that info alone was particularly useful...

It is sufficient, and the verbose logs just add noise.

 Meanwhile, I've managed to record a tcpdump of such a failed
 session. What exactly am I looking for there?

Retransmissions from the sender of data with the same sequence
number...

Post tcpdump output (without packet content is fine), containing
packets from just a single failed session.

 There's no packet filtering or rate limiting on port 25, at least
 not on my machine. The hosting provider might have something there,
 I'd have to ask them...

They probably have middle boxes, which might be the cause of the
problem.

-- 
Viktor.


Re: Possible reasons for lost connection after DATA

2014-09-11 Thread Sean Durkin
Hi Wietse,

Am 11.09.2014 um 13:49 schrieb Wietse Venema:
 What is the distribution of DATA sizes before failure? In your
 example I see numbers around 3kB, 9kB, 12kB.

At the moment, I see these sizes:

- always exactly 17511 bytes from smtp-out-127-*.amazon.com (today, seems to be 
only 3 different hosts trying)
- always exactly 49116 bytes from *.psi.cust-cluster.com (I've seen about 60 
different hosts from there today)
- always exactly 33290 bytes from mail18-*.srv2.de (about a dozen different 
hosts)

It seems those are always the same 3 messages being re-tried constantly (when I 
look at them in the incoming queue folder, it's the same recipient and sender 
and the same message-ID, as far as I can tell). I have problems only with 
messages from these clusters, everything else seems unaffected (at least I 
haven't seen any lost connection messages from any other hosts as far as my 
logfiles go back).

Yesterday I had an additional message with exactly 17441 bytes on every try 
before failure from the Amazon-cluster. That one was finally delivered 
completely early this morning, and has since disappeared from the cycle.

FWIW, I have received a handful of messages from the Amazon-cluster that did 
not have any delays/problems yesterday and today, one of them even from one of 
the problematic hosts that can't deliver the other message.

 Some failures are triggered by packet content, and may be replaced
 only by replacing hardware that operates marginally. Does the problem
 go away when you
 
 - Replace the server (either the network card or the whole box)
 
 - Replace the cable that connects the server to the network switch
 
 - Replace the network switch that the server is plugged into.
 
 - Replace the cable that connects the switch to the router
 
 - Replace the router
 
 - And so on...
 
 If you think this is a stupid idea, then you haven't been around
 long enough.

By no means do I think that's stupid. :)
I'm only doing this server stuff for fun in my spare time, but my real job is 
in microelectronics and hardware, so I've had my share of mysterious and 
seemingly unexplainable stuff (ISI, crosstalk, low-frequency jitter, ground 
bounce, ESD-induced phenomena, you know the drill...).

Problem is that this box is a rented root server in a data center somwhere, so 
I don't have access to the hardware to try any of that. I can contact support, 
but they of course charge you for everything they do, and as long as I haven't 
ruled out that the reason is just some stupid configuration mistake on my part 
(or a routing/filtering issue at my hosting provider, or Amazon, or...), I 
don't want to start replacing hardware, obviously...

Regards,
Sean


Re: Possible reasons for lost connection after DATA

2014-09-11 Thread Viktor Dukhovni
On Thu, Sep 11, 2014 at 03:25:57PM +0200, Sean Durkin wrote:

 I can contact support, but they of course charge you for
 everything they do, and as long as I haven't ruled out that the
 reason is just some stupid configuration mistake on my part (or a
 routing/filtering issue at my hosting provider, or Amazon, or...),
 I don't want to start replacing hardware, obviously...

The Postfix configuration has no impact on the TCP layer, beyond
optionally specifying the TCP window size.  Since it is the TCP
layer that fails, the problem is not related to the Postfix
configuration.

Your PCAP files should demonstrate repeated retransmission of data,
are the ACKs you're sending confirming receipt of packets that are
sent repeatedly?  In that case your ACKs are getting lost?   Is
there a sequence number gap in the data received from the server?
In that case the remote server's data is getting lost.  Does the
capture confirm that window scaling is not in use? ...

-- 
Viktor.


Re: Possible reasons for lost connection after DATA

2014-09-11 Thread Wietse Venema
Sean Durkin:
 Hi Wietse,
 
 Am 11.09.2014 um 13:49 schrieb Wietse Venema:
  What is the distribution of DATA sizes before failure? In your
  example I see numbers around 3kB, 9kB, 12kB.
 
 At the moment, I see these sizes:
 
 - always exactly 17511 bytes from smtp-out-127-*.amazon.com (today, seems to 
 be only 3 different hosts trying)
 - always exactly 49116 bytes from *.psi.cust-cluster.com (I've seen about 60 
 different hosts from there today)
 - always exactly 33290 bytes from mail18-*.srv2.de (about a dozen different 
 hosts)
 
 It seems those are always the same 3 messages being re-tried
 constantly (when I look at them in the incoming queue folder, it's
 the same recipient and sender and the same message-ID, as far as
 I can tell). I have problems only with messages from these clusters,
 everything else seems unaffected (at least I haven't seen any lost
 connection messages from any other hosts as far as my logfiles
 go back).
 
 Yesterday I had an additional message with exactly 17441 bytes on
 every try before failure from the Amazon-cluster. That one was
 finally delivered completely early this morning, and has since
 disappeared from the cycle.

That increases my suspicion of a data-dependent error - some marginal
cable/switch/router, perhaps some middle box with a memory bit error
that requires a power cycle to clear the problem. If the problem is 
caused by crosstalk defect, then only physical replacement will
solve it.

 Problem is that this box is a rented root server in a data center
 somwhere, so I don't have access to the hardware to try any of
 that. I can contact support, but they of course charge you for
 everything they do, and as long as I haven't ruled out that the
 reason is just some stupid configuration mistake on my part (or a
 routing/filtering issue at my hosting provider, or Amazon, or...),
 I don't want to start replacing hardware, obviously...

Try power cycling.

Wietse


Re: Possible reasons for lost connection after DATA

2014-09-11 Thread Hannes Erven

Hi Sean,


 Meanwhile, I've managed to record a tcpdump of such a failed session.
 What exactly am I looking for there?

I remember a possibly similar situation back in 2008... the culprit was 
a not-fully-up-to-date Cisco ASA firewall that corrupted TCP SACK fields 
and hence had the remote site send RSET.
Anyways on our end the connection seemed to starve, just as you describe 
it.


We detected that by comparing tcpdumps from both affected ends. Of 
course we had been lucky enough to have that happen with a business 
partner with competent IT people who we got a hold of, spotted the 
problem and also temporarily switched the feature off on their side to 
prove that this actually is the problem.

A firmware upgrade on my client's firewall then fixed the issue.

With a server hosted somewhere and incoming connections from big 
clusters, you might not be as lucky as that...



best regards,

-hannes


Re: Possible reasons for lost connection after DATA

2014-09-11 Thread Wietse Venema
Sean Durkin:
 Meanwhile, I've managed to record a tcpdump of such a failed
 session. What exactly am I looking for there?

- The receiving host's window announcement in the tcp handshake
  and in subsequent ACKs.

- Whether there is a gap in the sender packet sequence numbers
  as seen by the receiving host.

Such a gap means that a particular packet is being dropped.

Just to bore you with a few examples of bad middleboxes:

- Shortly after the first Postfix release there was a problem with
  traffic corruption due to a buggy middlebox (a Packeteer traffic
  shaper). The error had a very distinct signature.

- For many years, there were problems with CISCO PIX firewalls
  that inspected SMTP traffic but failed to properly handle the
  case that CRLF.CRLF happened to fall on a packet boundary.

- http://www.arschkrebs.de/postfix/postfix_cisco_pix_bugs.shtml
  has other examples where CISCO PIX/ASA firewalls will mis-handle
  SMTP traffic in various ways.

In your case, you may have to collaborate with someone who is willing
to send large amounts of random email; hopefully some messages will
trigger the bug, and then the sender and receiver can compare tcpdump
recordings.

Wietse


Re: Possible reasons for lost connection after DATA

2014-09-11 Thread L. Mark Stone
Any chance there is a UTM device in the email stream? 

We see lots of these errors when our SonicWalls do an RBL lookup, don't like 
the data in the email stream etc. The SonicWalls then just drop the connection 
and Postfix logs the drop. 

Hope that helps, 
Mark 


Possible reasons for lost connection after DATA

2014-09-10 Thread Sean Durkin
Hello,

some of my users were complaining about losing incoming mail, namely Amazon 
shipping notifications, newsletters and such things that they were absolutely 
sure were sent out, but never reached their inbox. After doing some digging, 
increasing log verbosity and such, I found a lot of this:

[... snip ...]
Sep 10 00:06:37 mail postfix/smtpd[23095]: lost connection after DATA (17511 
bytes) from smtp-out-127-108.amazon.com[176.32.127.108]
Sep 10 00:06:48 mail postfix/smtpd[23111]: lost connection after DATA (22788 
bytes) from mail18-92.srv2.de[193.169.181.92]
Sep 10 00:13:35 mail postfix/smtpd[23348]: lost connection after DATA (17441 
bytes) from smtp-out-127-108.amazon.com[176.32.127.108]
Sep 10 00:27:49 mail postfix/smtpd[23454]: lost connection after DATA (22788 
bytes) from mail18-97.srv2.de[193.169.181.97]
Sep 10 00:31:03 mail postfix/smtpd[23103]: lost connection after DATA (49116 
bytes) from quinqueunus.psi.cust-cluster.com[195.140.187.51]
Sep 10 00:48:46 mail postfix/smtpd[23890]: lost connection after DATA (22788 
bytes) from mail18-98.srv2.de[193.169.181.98]
Sep 10 00:51:53 mail postfix/smtpd[23564]: lost connection after DATA (49116 
bytes) from quattuorocto.psi.cust-cluster.com[195.140.187.48]
Sep 10 01:09:16 mail postfix/smtpd[24565]: lost connection after DATA (17511 
bytes) from smtp-out-127-108.amazon.com[176.32.127.108]
Sep 10 01:09:44 mail postfix/smtpd[24290]: lost connection after DATA (22788 
bytes) from mail18-99.srv2.de[193.169.181.99]
Sep 10 01:16:14 mail postfix/smtpd[24674]: lost connection after DATA (17441 
bytes) from smtp-out-127-107.amazon.com[176.32.127.107]
Sep 10 01:30:44 mail postfix/smtpd[24782]: lost connection after DATA (22788 
bytes) from mail18-100.srv2.de[193.169.181.100]
Sep 10 01:51:42 mail postfix/smtpd[25198]: lost connection after DATA (22788 
bytes) from mail18-105.srv2.de[193.169.181.105]
Sep 10 01:54:29 mail postfix/smtpd[24966]: lost connection after DATA (49116 
bytes) from quattuorocto.psi.cust-cluster.com[195.140.187.48]
Sep 10 02:12:21 mail postfix/smtpd[25784]: lost connection after DATA (17511 
bytes) from smtp-out-127-107.amazon.com[176.32.127.107]
Sep 10 02:12:42 mail postfix/smtpd[25656]: lost connection after DATA (22788 
bytes) from mail18-106.srv2.de[193.169.181.106]
Sep 10 02:20:11 mail postfix/smtpd[25892]: lost connection after DATA (17441 
bytes) from smtp-out-127-108.amazon.com[176.32.127.108]
Sep 10 02:33:41 mail postfix/smtpd[26077]: lost connection after DATA (22788 
bytes) from mail18-107.srv2.de[193.169.181.107]
Sep 10 02:54:22 mail postfix/smtpd[26178]: lost connection after DATA (49116 
bytes) from quinquenulla.psi.cust-cluster.com[195.140.187.50]
Sep 10 02:54:40 mail postfix/smtpd[26490]: lost connection after DATA (22788 
bytes) from mail18-108.srv2.de[193.169.181.108]
Sep 10 03:14:16 mail postfix/smtpd[26585]: lost connection after DATA (49116 
bytes) from quinquenulla.psi.cust-cluster.com[195.140.187.50]
Sep 10 03:15:39 mail postfix/smtpd[26905]: lost connection after DATA (22788 
bytes) from mail18-113.srv2.de[193.169.181.113]
Sep 10 03:15:52 mail postfix/smtpd[27091]: lost connection after DATA (17511 
bytes) from smtp-out-127-106.amazon.com[176.32.127.106]
Sep 10 03:23:15 mail postfix/smtpd[27214]: lost connection after DATA (17441 
bytes) from smtp-out-127-107.amazon.com[176.32.127.107]
[... snip ...]

So to me that looks as if either the external SMTP server closes its connection 
before it is done with the entire message (the transferred size does not match 
the size passed through MAIL FROM: SIZE=XYZ), or the connection times out. I 
can see in the logs and looking at the queue directories that these messages 
are put in the incoming queue by cleanup, are then found but skipped by qmgr 
(probably since they are not finished); they lurk in the incoming queue for 
awhile and disappear about the time the lost connection message is put in the 
logs. So this points to a timeout.

The weird thing is that the data sizes are always the same for the same 
message-id being delivered, even if it is delivered via different servers from 
a cluster. If it were a timeout, network problem or such, I'd expect a more or 
less random value for the received data size, not always exactly the same.

This seems to be a new problem (meaning I just recently got aware of it; I 
don't know when it started, but I do know everything was working fine for years 
before). This does not seem to be a problem of the sepcific hosts above; in 
between, I've been getting messages from the same hosts successfully. It seems 
that some messages go through on the first try, other messages from the same 
hosts are always losing connection, they never are delivered completely, 
ultimately resulting in the external host giving up; the message is then lost, 
and the user never knows about it.

The first question is:
Can I rule out it's my fault? I don't have traffic shaping or ICMP blocking 
running on that host, which maybe could cause something like 

Re: Possible reasons for lost connection after DATA

2014-09-10 Thread Robert Schetterer
Am 10.09.2014 um 09:56 schrieb Sean Durkin:
 The first question is:
 Can I rule out it's my fault? 

have you changed anything last days/month  upgrades/updates software
hardware ?
please send you postfix config , search list archive lost connection
after DATA


Best Regards
MfG Robert Schetterer

-- 
[*] sys4 AG

http://sys4.de, +49 (89) 30 90 46 64
Franziskanerstraße 15, 81669 München

Sitz der Gesellschaft: München, Amtsgericht München: HRB 199263
Vorstand: Patrick Ben Koetter, Marc Schiffbauer
Aufsichtsratsvorsitzender: Florian Kirstein


Re: Possible reasons for lost connection after DATA

2014-09-10 Thread Viktor Dukhovni
On Wed, Sep 10, 2014 at 09:56:48AM +0200, Sean Durkin wrote:

 Some of my users were complaining about losing incoming mail,
 namely Amazon shipping notifications, newsletters and such things
 that they were absolutely sure were sent out, but never reached
 their inbox. After doing some digging, increasing log verbosity
 and such, I found a lot of this:

Have you tried disabling TCP window scaling?  It might be confusing
some middle-box (firewall, NAT device, ...) on path between the
remote systems and your MTA.

 [... snip ...]
 Sep 10 00:06:37 mail postfix/smtpd[23095]: lost connection after DATA (17511 
 bytes) from smtp-out-127-108.amazon.com[176.32.127.108]

Post the hostname/IP address of the receving system.

Capture and examine a tcpdump recording of mail from one of the
problem senders.  Any sign of retransmission by the sender?

For at least one such session, post all related messages from the
postfix/smtpd[pid] that occur between connect from and
disconnect from.

-- 
Viktor.


Re: Possible reasons for lost connection after DATA

2014-09-10 Thread Sean Durkin
Hi Robert,

Am 10.09.2014 um 10:11 schrieb Robert Schetterer:
 Am 10.09.2014 um 09:56 schrieb Sean Durkin:
 The first question is:
 Can I rule out it's my fault? 
 
 have you changed anything last days/month  upgrades/updates software
 hardware ?
Hardware is unchanged.
The Ubuntu postfix package was upgraded in August (2.9.6-1~12.04.2), but this 
problem seems to have started before that, looking at older logs. Except that, 
I don't see any updates directly related to the mail system in the past half 
year. There's of course other system/security updates, but how should I know 
which of these might possibly be responsible?

I haven't changed the basic Postfix configuration lately. I did add OpenDKIM a 
few months back, but I removed that a few days ago to rule out that's the 
problem. I also removed Spamassassin, any RBLs and Postgrey, which I normally 
have running there; that does not seem to make a difference. So I'm now back to 
a very basic Postfix conf, but the problem persists.

 please send you postfix config ,
Anonymized postfinger-output is attached below.

 search list archive lost connection after DATA
I did that, I couldn't find anything that really applies in my case... most 
problems there are either related to DATA size 0 or to weird MTU issues. Mostly 
this seems to happen for connections from spam bots or misconfigured clients 
and people tell you you should just ignore it, but that doesn't really apply 
here.

I've tried getting a TCP dump of such an SMTP session, but since most of the 
interesting mail is coming from server clusters and the external hosts trying 
to deliver mail keep changing I'm still waiting to catch a good one...

Regards,
Sean

Here, as promised, postfinger-output:

--System Parameters--
mail_version = 2.9.6
hostname = mail
uname = Linux mail 3.2.0-65-virtual #99-Ubuntu SMP Fri Jul 4 21:23:03 UTC 2014 
x86_64 x86_64 x86_64 GNU/Linux

--Packaging information--
looks like this postfix comes from deb package: postfix-2.9.6-1~12.04.2

--main.cf non-default parameters--
alias_maps = $alias_database
append_dot_mydomain = no
biff = no
broken_sasl_auth_clients = yes
debug_peer_list = amazon.com, srv2.de, psi.cust-cluster.com, 
outbound.protection.outlook.com
delay_warning_time = 4h
disable_vrfy_command = yes
html_directory = /usr/share/doc/postfix/html
mailbox_size_limit = 0
mailbox_transport = lmtp:unix:/var/run/cyrus/socket/lmtp
message_size_limit = 262144000
mydestination = localhost, localhost.$mydomain, $mydomain, mail.$mydomain, 
mysql:/etc/postfix/mysql-mydestination.cf
myhostname = my.host.name
mynetworks = 127.0.0.0/8, ip.add.re.ss
myorigin = /etc/mailname
proxy_interfaces = ip.add.re.ss
recipient_delimiter = +
sender_canonical_maps = mysql:/etc/postfix/mysql-canonical.cf
smtp_destination_concurrency_limit = 1
smtp_destination_rate_delay = 1s
smtpd_helo_required = yes
smtpd_recipient_restrictions = permit_mynetworks, permit_sasl_authenticated, 
reject_unauth_pipelining, reject_unauth_destination, reject_unlisted_recipient 
reject_invalid_hostname, reject_non_fqdn_hostname, reject_non_fqdn_sender, 
reject_non_fqdn_recipient, reject_unknown_sender_domain, 
reject_unknown_recipient_domain,
smtpd_sasl_auth_enable = yes
smtpd_sender_restrictions = permit_sasl_authenticated, permit_mynetworks, 
reject_unauth_destination, reject_non_fqdn_sender, reject_unknown_sender_domain 
reject_unknown_recipient_domain, reject_unauth_pipelining
smtpd_tls_auth_only = yes
smtpd_tls_CAfile = /etc/postfix/ssl/ca.pem
smtpd_tls_cert_file = /etc/postfix/ssl/my_cert.crt
smtpd_tls_dh1024_param_file = /etc/postfix/ssl/dh_2048.pem
smtpd_tls_dh512_param_file = /etc/postfix/ssl/dh_512.pem
smtpd_tls_key_file = /etc/postfix/ssl/my_key.key
smtpd_tls_protocols = !SSLv2
smtpd_tls_received_header = yes
smtpd_tls_session_cache_database = btree:/var/lib/postfix/smtpd_scache
smtpd_use_tls = yes
smtp_tls_security_level = may
smtp_tls_session_cache_database = btree:/var/lib/postfix/smtp_scache
strict_rfc821_envelopes = yes
tls_preempt_cipherlist = yes
virtual_alias_maps = mysql:/etc/postfix/mysql-virtual.cf

--master.cf--
smtp  inet  n   -   y   --  smtpd
submission inet n  -   y   -   -   smtpd
-o smtpd_etrn_restrictions=reject
-o smtpd_enforce_tls=yes
-o smtpd_sasl_auth_enable=yes
-o smtpd_client_restrictions=permit_sasl_authenticated,reject
smtpsinet  n   -   y   -   -   smtpd
-o smtpd_etrn_restrictions=reject
-o smtpd_tls_wrappermode=yes
-o smtpd_sasl_auth_enable=yes
pickupfifo  n   -   -   60  1   pickup
cleanup   unix  n   -   -   -   0   cleanup
qmgr  fifo  n   -   n   100 1   qmgr
tlsmgrunix  -   -   -   1000?   1   tlsmgr
rewrite   unix  -   -   -   -   -   trivial-rewrite
bounceunix  -   -   -   -   0   bounce
defer unix  -   -   

Re: Possible reasons for lost connection after DATA

2014-09-10 Thread Sean Durkin
Hi Viktor,

Am 10.09.2014 um 16:19 schrieb Viktor Dukhovni:
 Have you tried disabling TCP window scaling?  It might be confusing 
 some middle-box (firewall, NAT device, ...) on path between the 
 remote systems and your MTA.
I would not have thought of that... I've tried that now, but it does not seem 
to help. 
 
 Post the hostname/IP address of the receving system.
mail.tuxroot.de
 
 Capture and examine a tcpdump recording of mail from one of the 
 problem senders.  Any sign of retransmission by the sender?
I'm trying to get a good dump and will post results once I get one.
Not that easy since the external hosts keep changing all the time. All mail 
affected comes from mass mailers that use server clusters, so I keep getting 
those messages from lots of different remote hosts. I'm waiting for it to 
happen from one of the hosts I've seen before.

Retransmission is tried numerous times, but for every retransmission the lost 
connection message is the same (identical number of bytes), as far as I can 
tell. That's one thing that puzzles me... So e.g. a message is delivered twice 
and each time the connection is lost after exactly 17441 bytes, even if it's 
different remote hosts trying, that's kind of odd. 

 For at least one such session, post all related messages from the 
 postfix/smtpd[pid] that occur between connect from and 
 disconnect from.
Here's one: http://pastebin.com/twb3Z8Eg
And this seems to be the same message being redelivered later, from a different 
host, with the same result (connection lost after exactly 17441 bytes):
http://pastebin.com/Qihbjz3w

What I do notice there is that in fact the connection seems to be *very* slow. 
In the above example, the whole process takes several minutes. I don't have any 
throughput or network speed issues with other hosts, though. I've tried sending 
mail from Gmail, Yahoo, my workplace, my former university, GMX, whatever; 
everything goes through on the first attempt each and every time, and quickly. 
But it seems it is always slow for a few hosts. 

Regards,
Sean


Re: Possible reasons for lost connection after DATA

2014-09-10 Thread Wietse Venema
Sean Durkin:
[ Charset windows-1252 converted... ]
 Hi Viktor,
 
 Am 10.09.2014 um 16:19 schrieb Viktor Dukhovni:
  Have you tried disabling TCP window scaling?  It might be confusing 
  some middle-box (firewall, NAT device, ...) on path between the 
  remote systems and your MTA.
 I would not have thought of that... I've tried that now, but it does not seem 
 to help. 
  
  Post the hostname/IP address of the receving system.
 mail.tuxroot.de
  
  Capture and examine a tcpdump recording of mail from one of the 
  problem senders.  Any sign of retransmission by the sender?
 I'm trying to get a good dump and will post results once I get one.
 Not that easy since the external hosts keep changing all the time. All mail 
 affected comes from mass mailers that use server clusters, so I keep getting 
 those messages from lots of different remote hosts. I'm waiting for it to 
 happen from one of the hosts I've seen before.
 
 Retransmission is tried numerous times, but for every retransmission
 the lost connection message is the same (identical number of bytes),
 as far as I can tell. That's one thing that puzzles me... So e.g.
 a message is delivered twice and each time the connection is lost
 after exactly 17441 bytes, even if it's different remote hosts
 trying, that's kind of odd.

No, it means the same problem is happening. Same error,
same symptom.

 What I do notice there is that in fact the connection seems to be
 *very* slow. In the above example, the whole process takes several
 minutes. I don't have any throughput or network speed issues with
 other hosts, though. I've tried sending mail from Gmail, Yahoo,

Slow performance is typical for TCP window scaling problems. Have
you tried to turn it off in your kernel?

# sysctl -w net.ipv4.tcp_window_scaling=0

To make it permanent:

# echo 'net.ipv4.tcp_window_scaling = 0'  /etc/sysctl.conf

Wietse


Re: Possible reasons for lost connection after DATA

2014-09-10 Thread Viktor Dukhovni
On Wed, Sep 10, 2014 at 09:19:58PM +0200, Sean Durkin wrote:

  For at least one such session, post all related messages from the 
  postfix/smtpd[pid] that occur between connect from and 
  disconnect from.
 Here's one: http://pastebin.com/twb3Z8Eg

This trace has an insane level of debugging turned on, to the point
that syslogd is overwhelmed and is losing messages.  PLEASE DISABLE
ALL VERBOSE logging.  NO -v options in master.cf, NO debug_peer_list,
...

Please make sure that the /dev/log syslog socket is a dgram not
a stream socket, and that mail logging is not synchronous.

Then if the problem persists, report just normal Postfix logging,
not the flood of noise from verbose logging.

-- 
Viktor.