Re: [AOLSERVER] Null responses

Gustaf Neumann Thu, 28 Aug 2014 07:32:07 -0700

Dear Cyan,

you seem to have invested in detail into the problem. We are using
aolserver and naviserver since many years with heavy traffic
(since two last years just naviserver), but we have not seen this
problem.


Some question pop up:
- you say, you have a legacy application: did the problem show up
  just recently or did you become aware of this just recently?

- can you reproduce the problem on are bare metal machine?

- do you have the option to replace haproxy by nginx in front of aolserver?
  nginx can use for the intratalk persistent connections (we could reduce
  the number of connections to naviserver by some order of magnitude),
  nginx has various retry options. maybe haproxy has the same, i have not
  worked with this.

-gustaf neumann

Am 28.08.14 13:51, schrieb Cyan ogilvie:

On Wed, Aug 27, 2014 at 9:34 PM, Torben Brosten <tor...@dekka.com<mailto:tor...@dekka.com>> wrote:
>
> On 08/27/2014 12:02 PM, Cyan ogilvie wrote:> ..
>  > There doesn't seem to be a pattern to the failing requests,
>  > sometimes it's small static files like favicon.ico, but mostly not
>  > (although in> our case we're not using fastpath for that - different
> favicons are
>  > served based on the request context).  At the moment I'm leaning
>  > towards some sort of corrupted connection thread state - the failures
> > tend to cluster by time, server, user - so that, although thefailures> > are exceedingly rare overall (220 yesterday), it's often the casethat
>  > a given user will have to reload a page several times before they get
>  > a successful response.  The servers are fronted by haproxy which will
>  > tend to send a given session back to the same server.
>  > ..
>
> Have you ruled out a router issue, such as from ipv4 exhaustion or
> localized network flooding?
I'm pretty sure it's not network related at this stage. To test thisI built a man-in-the-middle relay listening on port 8008 running onthe same server as nsd, which forwards all connection traffic to127.0.0.1:80 <http://127.0.0.1:80> and records the events and data itsees flowing in both directions, and a packet trace using tcpdump ofboth the requests arriving on eth0 and the relayed traffic on lo. Aniptables nat prerouting rule DNATs connections coming in on eth0 port80 to the relay's port 8008. Another process watches the haproxy logsfor indications of a failed request and retrieves a dump of the ringbuffers from the relay and saves them for later analysis.
Here is the packet trace of the relay -> nsd traffic on the lointerface for a typical event (as captured by tcpdump, which indicatedthat no packets had been dropped):
Time      Prot Len  Info
34.750440 TCP 76 47576 > http [SYN] Seq=0 Win=32792 Len=0 MSS=16396SACK_PERM=1 WS=3234.750465 TCP 76 http > 47576 [SYN, ACK] Seq=0 Ack=1 Win=32768Len=0 MSS=16396 SACK_PERM=1 WS=32
34.750479 TCP  68 47576 > http [ACK] Seq=1 Ack=1 Win=32800 Len=0
34.750720 TCP  4412 [TCP segment of a reassembled PDU]
34.750756 TCP  68 http > 47576 [ACK] Seq=1 Ack=4345 Win=32768 Len=0
34.751274 HTTP 439 GET/item/531138-RR-1012/Elgin-Art-Deco-Dial-Pocket-Watch HTTP/1.1
34.751295 TCP  68 http > 47576 [ACK] Seq=1 Ack=4716 Win=32768 Len=0
34.751377 TCP  68 http > 47576 [FIN, ACK] Seq=1 Ack=4716 Win=32768 Len=0
34.751492 TCP  68 47576 > http [FIN, ACK] Seq=4716 Ack=2 Win=32800 Len=0
34.751515 TCP  68 http > 47576 [ACK] Seq=2 Ack=4717 Win=32768 Len=0
The connection reaches the ESTABLISHED state and the HTTP request datais sent to nsd which is acked. Then 0.8 ms later the connection isclosed by nsd. The first Tcl code that should execute for thisrequest is a preauth filter, which starts by writing a log containing[ns_conn request]. That log message isn't reached in cases like this.In this example the request is quite large (around 4.7 KB) because ofsome large cookies, but the same pattern happens for requests ofaround 400 bytes.
There appears to be a less common failure mode where the requestprocessing happens normally and the Tcl code generates a normalresponse (HTTP code 200), but the response data never hits thenetwork. The network trace looks the same as the example I gave aboveexcept that the time between the ACK of the GET request and nsd's FIN,ACK is longer - around 20 - 70 ms which is in-line with the normaltimes for a successful request. I haven't yet caught the packet tracebetween the relay and nsd for this case (only the trace on eth0between haproxy and the relay), so I'm not 100% certain of myinterpretation of this failure mode yet.
There is usually a cluster of up to about 6 requests that fail in thisway for a given server in the space of around 2 seconds, with longintervals between the clusters where no failures happen.
Cyan


------------------------------------------------------------------------------
Slashdot TV.
Video for Nerds.  Stuff that matters.
http://tv.slashdot.org/


_______________________________________________
aolserver-talk mailing list
aolserver-talk@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/aolserver-talk



--
Univ.Prof. Dr. Gustaf Neumann
WU Vienna
Institute of Information Systems and New Media
Welthandelsplatz 1, A-1020 Vienna, Austria

------------------------------------------------------------------------------
Slashdot TV.  
Video for Nerds.  Stuff that matters.
http://tv.slashdot.org/

_______________________________________________
aolserver-talk mailing list
aolserver-talk@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/aolserver-talk

Re: [AOLSERVER] Null responses

Reply via email to