Dear Cyan,
you seem to have invested in detail into the problem. We are using
aolserver and naviserver since many years with heavy traffic
(since two last years just naviserver), but we have not seen this
problem.
Some question pop up:
- you say, you have a legacy application: did the problem show up
just recently or did you become aware of this just recently?
- can you reproduce the problem on are bare metal machine?
- do you have the option to replace haproxy by nginx in front of aolserver?
nginx can use for the intratalk persistent connections (we could reduce
the number of connections to naviserver by some order of magnitude),
nginx has various retry options. maybe haproxy has the same, i have not
worked with this.
-gustaf neumann
Am 28.08.14 13:51, schrieb Cyan ogilvie:
On Wed, Aug 27, 2014 at 9:34 PM, Torben Brosten <tor...@dekka.com
<mailto:tor...@dekka.com>> wrote:
>
> On 08/27/2014 12:02 PM, Cyan ogilvie wrote:> ..
> > There doesn't seem to be a pattern to the failing requests,
> > sometimes it's small static files like favicon.ico, but mostly not
> > (although in> our case we're not using fastpath for that - different
> favicons are
> > served based on the request context). At the moment I'm leaning
> > towards some sort of corrupted connection thread state - the failures
> > tend to cluster by time, server, user - so that, although the
failures
> > are exceedingly rare overall (220 yesterday), it's often the case
that
> > a given user will have to reload a page several times before they get
> > a successful response. The servers are fronted by haproxy which will
> > tend to send a given session back to the same server.
> > ..
>
> Have you ruled out a router issue, such as from ipv4 exhaustion or
> localized network flooding?
I'm pretty sure it's not network related at this stage. To test this
I built a man-in-the-middle relay listening on port 8008 running on
the same server as nsd, which forwards all connection traffic to
127.0.0.1:80 <http://127.0.0.1:80> and records the events and data it
sees flowing in both directions, and a packet trace using tcpdump of
both the requests arriving on eth0 and the relayed traffic on lo. An
iptables nat prerouting rule DNATs connections coming in on eth0 port
80 to the relay's port 8008. Another process watches the haproxy logs
for indications of a failed request and retrieves a dump of the ring
buffers from the relay and saves them for later analysis.
Here is the packet trace of the relay -> nsd traffic on the lo
interface for a typical event (as captured by tcpdump, which indicated
that no packets had been dropped):
Time Prot Len Info
34.750440 TCP 76 47576 > http [SYN] Seq=0 Win=32792 Len=0 MSS=16396
SACK_PERM=1 WS=32
34.750465 TCP 76 http > 47576 [SYN, ACK] Seq=0 Ack=1 Win=32768
Len=0 MSS=16396 SACK_PERM=1 WS=32
34.750479 TCP 68 47576 > http [ACK] Seq=1 Ack=1 Win=32800 Len=0
34.750720 TCP 4412 [TCP segment of a reassembled PDU]
34.750756 TCP 68 http > 47576 [ACK] Seq=1 Ack=4345 Win=32768 Len=0
34.751274 HTTP 439 GET
/item/531138-RR-1012/Elgin-Art-Deco-Dial-Pocket-Watch HTTP/1.1
34.751295 TCP 68 http > 47576 [ACK] Seq=1 Ack=4716 Win=32768 Len=0
34.751377 TCP 68 http > 47576 [FIN, ACK] Seq=1 Ack=4716 Win=32768 Len=0
34.751492 TCP 68 47576 > http [FIN, ACK] Seq=4716 Ack=2 Win=32800 Len=0
34.751515 TCP 68 http > 47576 [ACK] Seq=2 Ack=4717 Win=32768 Len=0
The connection reaches the ESTABLISHED state and the HTTP request data
is sent to nsd which is acked. Then 0.8 ms later the connection is
closed by nsd. The first Tcl code that should execute for this
request is a preauth filter, which starts by writing a log containing
[ns_conn request]. That log message isn't reached in cases like this.
In this example the request is quite large (around 4.7 KB) because of
some large cookies, but the same pattern happens for requests of
around 400 bytes.
There appears to be a less common failure mode where the request
processing happens normally and the Tcl code generates a normal
response (HTTP code 200), but the response data never hits the
network. The network trace looks the same as the example I gave above
except that the time between the ACK of the GET request and nsd's FIN,
ACK is longer - around 20 - 70 ms which is in-line with the normal
times for a successful request. I haven't yet caught the packet trace
between the relay and nsd for this case (only the trace on eth0
between haproxy and the relay), so I'm not 100% certain of my
interpretation of this failure mode yet.
There is usually a cluster of up to about 6 requests that fail in this
way for a given server in the space of around 2 seconds, with long
intervals between the clusters where no failures happen.
Cyan
------------------------------------------------------------------------------
Slashdot TV.
Video for Nerds. Stuff that matters.
http://tv.slashdot.org/
_______________________________________________
aolserver-talk mailing list
aolserver-talk@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/aolserver-talk
--
Univ.Prof. Dr. Gustaf Neumann
WU Vienna
Institute of Information Systems and New Media
Welthandelsplatz 1, A-1020 Vienna, Austria
------------------------------------------------------------------------------
Slashdot TV.
Video for Nerds. Stuff that matters.
http://tv.slashdot.org/
_______________________________________________
aolserver-talk mailing list
aolserver-talk@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/aolserver-talk