On Thu, May 03, 2018 at 12:08:37AM +0200, Tim Düsterhus wrote:
> Willy,
> 
> Am 02.05.2018 um 11:47 schrieb Willy Tarreau:
> > Nice one, though I'd argue that sites which do this know that they
> > are manipulating large contents (it's visible in the config file and
> > sometimes they are the ones asking to relax the config parsing rules).
> > So they're also aware of the need to increase maxrewrite.
> 
> Clearly I was not aware that I needed to increase maxrewrite, despite
> adding a ton of headers to my responses.
> Yes, I could have found out if I read the complete configuration manual.
> But the tunables are defined as "Performance Tuning" in the headline. I
> don't have performance issues, I handle less than 15 r/s in peak
> situations. It did not even occur to me that there could possibly be a
> limit to header rewriting. In fact I found about this whole issue by
> accident.

Oh I can very well imagine. Long gone are the days where it was possible
to read the whole doc in 10 minutes!

> >> I'd start of with *logging* when the call fails for the short term.
> > 
> > Adding extra logs not related to traffic actually makes debugging much
> > worse because these logs don't end up in the correct file/storage/whatever.
> > It's much easier to filter on 400/502 and find all the request elements to
> > understand what happens than seeing a random error out of any context.
> > Part of the difficulty here is to properly indicate what was attempted
> > and failed, in complex configs where it's hard to even enumerate rulesets.
> 
> To me a message like: "Unable to add-header Content-Security-Policy to
> response. Possibly the amount of headers exceeds tune.maxrewrite." would
> have been more helpful than random 502 without any further information.

We could possibly emit something like this with a warning level, just like
when a server goes down. However we need to rate-limit it as you don't want
your admin log to grow at 1000/s when you're flooded with large bogus
requests that repeatedly cause this to happen.

> Especially since the issue happens randomly: Sometimes the additional
> headers fit by chance. Sometimes they don't. I would start by
> investigating the connection to the backend services, not investigating
> some random tunables (See my paragraph above.).

Actually when you have an error, the termination flags are the most important
thing to look at as they indicate what was wrong and where/when.

> Actually I'm not sure
> whether a 502 would even be the correct response. The issue is not with
> the backend, but with the proxy. I'd expect a 500 here:
> 
> >    The 502 (Bad Gateway) status code indicates that the server, while
> >    acting as a gateway or proxy, received an *invalid response from an
> >    inbound server* it accessed while attempting to fulfill the request.
> 
> (highlighting mine)

It could as well, but arguably it could also be said that the frontend
never received a valid response from the backend since this one failed
to transform a valid response into another valid one.

> After digging into it I might be able to deduce that the addition of the
> new `http-response add-header` line caused the issues. But still I would
> be non the wiser. I would have to stumble upon the tunable by accident.
> Or ask on the list, like I did.

Just out of curiosity, what do you check more often, in order of priority,
among :
  - stats page
  - show info on CLI
  - traffic logs
  - admin logs
  - other

Because in fact that might help figure where such painful failures would
need to be shown (possibly multiple places).

> > Maybe one solution could be to tag transactions that experienced such a
> > failure and to put a warning on them, that would be reported in the logs
> > with a "W" flag in the termination codes. It would not indicate what
> > failed, obviously, but would very quickly raise awareness of the fact
> > that there is a problem. Usually thanks to the second flag indicating
> > the transaction state and to the rest of the log, it's easier to follow
> > the processing and to figure what went wrong. Later once we add the
> > ability to reject again such failures, we could add an "F" (for "failure"
> > to process).
> > 
> > What do you think ?
> 
> It clearly would be better than silently failing, but it still would
> have wasted a few hours I suspect (again: see above paragraphs).
> 
> I want to note at this point that I'm not running haproxy at scale or
> with serious monitoring. The haproxy instance I'm experiencing this
> issue with is my personal server, not some company or business one. It
> runs my mail and some side / hobby projects. My needs or expectations
> might be different.

That's an important point. It's the same for me on 1wt.eu or haproxy.org,
sometimes I figure late that there are errors. Most often it's the component
we forget about because it works and we don't spend time looking at the logs.
The probably, like me, you're looking at the stats page once in a while, and
only at the logs when stats look suspicious ?

We already have "Warnings" columns in the stats page which are unused for
the frontends, we could use it to report a count of such failures. Or we
could add an extra "rewrite" column under "warnings" to report such errors
where they were detected.

Willy

Reply via email to