On Tue, Feb 28, 2012 at 2:50 PM, Tom Evans <tevans...@googlemail.com> wrote:
> Hi all
>
> [I'm re-reading this, and it is a bit of a convoluted setup - I
> appreciate any eyes that read this!]
>
> Hardware: 2 x Dell 2850, 2 x Xeon 5140 2.33 GHz, 4 GB RAM
> OS: FreeBSD 7.1-RELEASE-p4
> Server version: Apache/2.2.22 (FreeBSD)
> Server built:   Feb 13 2012 22:29:44
>
> At $JOB we use Apache to serve as a reverse proxy - we have a pair of
> servers which all our web requests are round robin routed to. These
> servers then provide SSL termination, serve static content and reverse
> proxy onto backend servers for dynamic content.
>
> We in fact run two httpd server instances on each server; one using
> worker MPM providing the SSL termination, and one using event MPM to
> serve static content and reverse proxy content.
>
> First off, I'm not a network guy; I can find out more about this
> routing stuff if you think it's relevant though. IIRC it works like
> this: both boxes have the all public IP addresses for our websites
> allocated on the loopback interface, and the edge routers round robin
> requests to a pair of CARP/VRRP IP addresses on the Apache boxes. By
> controlling which box has which CARP address, we can control which
> box(s) are receiving traffic.
>
> So our problems started when we put all traffic through one box,
> whilst we upgraded to 2.2.22. Some of our websites are served through
> a CDN, and we could observe from our office a significant proportion
> of requests that went via the CDN failed to ever reach our server. We
> can see from our squid proxy log that requests were made that did not
> reach or get recorded by Apache.
>
> We also had reports from our clients and users that the websites (even
> non CDN sites) were subjectively 'slow' once we were operating on just
> one box. We think that these were requests failing to reach our
> server, and then subsequently being retried.
>
> We can quite clearly identify when this happens with sites served from
> the CDN, as each timeout results in them returning a 503 to us, which
> we can detect in our squid proxy logs and track how frequently this
> was happening. When all the traffic was put through one of the Apache
> frontend proxies, the error rate we could detect was 5 times higher
> than when we spread the load through both frontend proxies.
>

To try and make it clearer, I've made an ASCII art of our architecture:

>        +---------+
>        |  LAN    |
>        +---------+
>             |
>        +---------+
>        | SQUID   |
>        +---------+
>             |
>        ~~~~~~~~~~~
>        ~   inet  ~
>        ~~~~~~~~~~~
>             |
>        ~~~~~~~~~~~
>        ~  CDN    ~
>        ~~~~~~~~~~~
>             |
>        +---------+
>        |   FW    |
>        +---------+
>            /\
>     +-----+  +-------+
>     |                |
>  +-------+      +---------+
>  | FEP 1 |      |  FEP 2  |
>  +-------+      +---------+
>    |\_____          |\_____
>    |      \         |      \
>  +-----+  +-----+  +-----+ +-----+
>  | BE1 |  | BE2 |  | BE3 | | BE4 |
>  +-----+  +-----+  +-----+ +-----+
>

Key:
LAN - our corporate LAN
SQUID - our corporate squid proxy
inet - our corporate internet
CDN - our partner CDN's network
FW - our data centre edge firewall
FEP 1/2 - our front end proxies - httpd-event 2.2.21
BE 1/2/3/4 - our backend web servers, varied (mainly httpd-prefork 2.2).

Hopefully that will come through unmangled...

So, we've been trying to track disappearing requests. We see lots of
requests that go via the CDN to reach our data centre failing with
error code 503. This error message is produced by the CDN, and the
request is not logged in either of the FEPs.

We've been trying to track what happens with tcpdump running at SQUID
and at FW. At SQUID, we see a POST request for a resource, followed by
a long wait, and then a 503 generated by the CDN. Interestingly, 95%
of the failing requests are POST requests.

Tracking that at FW, we see the request coming in, and no reply from
the FEP. The connection is a keep-alive connection, and had just
completed a similar request 4 seconds previously, to which we returned
a 200 and data. This (failing) request is made on the same connection,
we reply with an ACK, then no data for 47 seconds (same wait as seen
by squid), and finally the connection is closed with a FIN.


Any ideas?

Cheers

Tom

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscr...@httpd.apache.org
For additional commands, e-mail: users-h...@httpd.apache.org

Reply via email to