Re: [users@httpd] proxy_html / xml2enc won't handle certain HTML entities
FYI: also working fine for Debian Buster's and Ubuntu Bionic's current source packages (versions 2.4.38 and 2.4.29). Thanks! - Mensaje original - De: "Antonio Suárez Pozuelo" Para: "users" Enviados: Lunes, 18 de Mayo 2020 9:58:14 Asunto: Re: [users@httpd] proxy_html / xml2enc won't handle certain HTML entities That was swift! Still testing it, but it's working as a charm so far. Congratulations for the good job. Thanks again, Antonio - Mensaje original - De: "Nick Kew" Para: "users" Enviados: Sábado, 16 de Mayo 2020 1:20:57 Asunto: Re: [users@httpd] proxy_html / xml2enc won't handle certain HTML entities On Fri, 15 May 2020 09:12:30 +0200 (CEST) Antonio Suárez Pozuelo wrote: > Sure! There you are: > https://bz.apache.org/bugzilla/show_bug.cgi?id=64443 > > Thanks for your support, Nick, really appreciate it. Best regards, I've hacked up - but not tested - a simple patch (attached). If it works for you I'll think through whether it's fit to commit as-is or whether there's a case for something more general (and complex). -- Nick Kew - To unsubscribe, e-mail: users-unsubscr...@httpd.apache.org For additional commands, e-mail: users-h...@httpd.apache.org - To unsubscribe, e-mail: users-unsubscr...@httpd.apache.org For additional commands, e-mail: users-h...@httpd.apache.org
Re: [users@httpd] proxy_html / xml2enc won't handle certain HTML entities
That was swift! Still testing it, but it's working as a charm so far. Congratulations for the good job. Thanks again, Antonio - Mensaje original - De: "Nick Kew" Para: "users" Enviados: Sábado, 16 de Mayo 2020 1:20:57 Asunto: Re: [users@httpd] proxy_html / xml2enc won't handle certain HTML entities On Fri, 15 May 2020 09:12:30 +0200 (CEST) Antonio Suárez Pozuelo wrote: > Sure! There you are: > https://bz.apache.org/bugzilla/show_bug.cgi?id=64443 > > Thanks for your support, Nick, really appreciate it. Best regards, I've hacked up - but not tested - a simple patch (attached). If it works for you I'll think through whether it's fit to commit as-is or whether there's a case for something more general (and complex). -- Nick Kew - To unsubscribe, e-mail: users-unsubscr...@httpd.apache.org For additional commands, e-mail: users-h...@httpd.apache.org - To unsubscribe, e-mail: users-unsubscr...@httpd.apache.org For additional commands, e-mail: users-h...@httpd.apache.org
Re: [users@httpd] proxy_html / xml2enc won't handle certain HTML entities
On Fri, 15 May 2020 09:12:30 +0200 (CEST) Antonio Suárez Pozuelo wrote: > Sure! There you are: > https://bz.apache.org/bugzilla/show_bug.cgi?id=64443 > > Thanks for your support, Nick, really appreciate it. Best regards, I've hacked up - but not tested - a simple patch (attached). If it works for you I'll think through whether it's fit to commit as-is or whether there's a case for something more general (and complex). -- Nick Kew Index: modules/filters/mod_proxy_html.c === --- modules/filters/mod_proxy_html.c (revision 1877795) +++ modules/filters/mod_proxy_html.c (working copy) @@ -674,6 +674,16 @@ } } } +/* PR#64443: for , insert accept-charset attribute if necessary */ +if (!strcasecmp(name, "FORM")) { +const char *cenc; +xmlCharEncoding enc; +if (xml2enc_charset && +(xml2enc_charset(ctx->f->r, &enc, &cenc) == APR_SUCCESS)) { +ap_fputstrs(ctx->f->next, ctx->bb, " accept-charset=\"", +cenc, "\"", NULL); +} +} ctx->offset = 0; if (desc && desc->empty) ap_fputs(ctx->f->next, ctx->bb, ctx->etag); - To unsubscribe, e-mail: users-unsubscr...@httpd.apache.org For additional commands, e-mail: users-h...@httpd.apache.org
Re: [users@httpd] proxy_html / xml2enc won't handle certain HTML entities
Sure! There you are: https://bz.apache.org/bugzilla/show_bug.cgi?id=64443 Thanks for your support, Nick, really appreciate it. Best regards, Antonio - Mensaje original - De: "Nick Kew" Para: "users" Enviados: Jueves, 14 de Mayo 2020 20:06:03 Asunto: Re: [users@httpd] proxy_html / xml2enc won't handle certain HTML entities > On 14 May 2020, at 13:36, Antonio Suárez Pozuelo > wrote: > > Hi, Nick. I'm afraid we're still having some issue with this. > > Without ProxyHTMLCharsetOut, proxy_html is translating our backend ISO-8859-1 > response into UTF-8, which is fine. When submitting a form, I guess the > browser will also encode its contents in UTF-8, but maybe proxy_html won't > reverse-translate that into ISO-8859-1 before relaying it to the backend > server. Whoops, now you mention it, that may have figured in the thinking behind the very configuration you were trying to use. Yes, of course, mod_proxy_html doesn't touch your POST data. > This can be enforced by adding an accept-charset="ISO-8859-1" attribute to > the tag (tested on Firefox 77.0b5), so: should proxy_html add that > attribute to tags automagically when parsing and translating HTML > content? Interesting suggestion. It would be straightforward to offer that as a configuration option (much easier than fixing the problem with ProxyHTMLCharsetOut, unless I'm missing something in the libxml2 API). Though it seems to me kind-of an ugly workaround. I think this merits a bugzilla entry. Do you want to submit it? -- Nick Kew - To unsubscribe, e-mail: users-unsubscr...@httpd.apache.org For additional commands, e-mail: users-h...@httpd.apache.org - To unsubscribe, e-mail: users-unsubscr...@httpd.apache.org For additional commands, e-mail: users-h...@httpd.apache.org
Re: [users@httpd] proxy_html / xml2enc won't handle certain HTML entities
I have added xml2EncDefault UTF-8 directive with something wrong when combining xml2enc_module and proxy_html_module. - To unsubscribe, e-mail: users-unsubscr...@httpd.apache.org For additional commands, e-mail: users-h...@httpd.apache.org
Re: [users@httpd] proxy_html / xml2enc won't handle certain HTML entities
> On 14 May 2020, at 13:36, Antonio Suárez Pozuelo > wrote: > > Hi, Nick. I'm afraid we're still having some issue with this. > > Without ProxyHTMLCharsetOut, proxy_html is translating our backend ISO-8859-1 > response into UTF-8, which is fine. When submitting a form, I guess the > browser will also encode its contents in UTF-8, but maybe proxy_html won't > reverse-translate that into ISO-8859-1 before relaying it to the backend > server. Whoops, now you mention it, that may have figured in the thinking behind the very configuration you were trying to use. Yes, of course, mod_proxy_html doesn't touch your POST data. > This can be enforced by adding an accept-charset="ISO-8859-1" attribute to > the tag (tested on Firefox 77.0b5), so: should proxy_html add that > attribute to tags automagically when parsing and translating HTML > content? Interesting suggestion. It would be straightforward to offer that as a configuration option (much easier than fixing the problem with ProxyHTMLCharsetOut, unless I'm missing something in the libxml2 API). Though it seems to me kind-of an ugly workaround. I think this merits a bugzilla entry. Do you want to submit it? -- Nick Kew - To unsubscribe, e-mail: users-unsubscr...@httpd.apache.org For additional commands, e-mail: users-h...@httpd.apache.org
Re: [users@httpd] proxy_html / xml2enc won't handle certain HTML entities
Hi, Nick. I'm afraid we're still having some issue with this. Currently our conf is: ProxyPreserveHost on ProxyHTMLEnable on ProxyHTMLExtended on And our pages are showing fine, but non-english characters fed into form fields ared posted incorrectly (badly encoded) to our backend server. This won't happen with ProxyHTMLCharsetOut set to "*" or explicitly to "ISO-8859-1"; but that configuration, you know, takes us to the starting point. Without ProxyHTMLCharsetOut, proxy_html is translating our backend ISO-8859-1 response into UTF-8, which is fine. When submitting a form, I guess the browser will also encode its contents in UTF-8, but maybe proxy_html won't reverse-translate that into ISO-8859-1 before relaying it to the backend server. This can be enforced by adding an accept-charset="ISO-8859-1" attribute to the tag (tested on Firefox 77.0b5), so: should proxy_html add that attribute to tags automagically when parsing and translating HTML content? Just speculating, I really don't know the internals of it. But I guess you do :) Thanks in advance. Best regards, Antonio - Mensaje original - De: "Nick Kew" Para: "users" Enviados: Viernes, 8 de Mayo 2020 9:22:40 Asunto: Re: [users@httpd] proxy_html / xml2enc won't handle certain HTML entities > On 8 May 2020, at 07:28, Antonio Suárez Pozuelo > wrote: > > Hi Nick, > > Your glass of wine was inspiring: just removed > >> ProxyHTMLCharsetOut * # Backend (Tomcat) charset is ISO-8859-1 > > and the problem's gone! OK, thanks for confirming it. I'm pretty sure now what's happening. Libxml2 uses unicode (utf-8) internally, so for i18n to work, your iso-8859-1 gets converted before feeding to the parser. But HTML entities are not preserved: they get converted to their unicode representations. ProxyHTMLCharsetOut is kind-of an afterthought: it converts unicode to your choice of encoding. But it doesn't deal with HTML entities. So when it encounters unicode sequences for your "→" et al, it just tries to convert unicode to latin-1, and fails when there is no latin-1 representation. As far as I know this doesn't really matter: unicode support is pretty-near universal, so just leaving it in place has no real downside. I'll think about whether there's an easy fix to ProxyHTMLCharsetOut for cases like this, but will more likely just add a note to the docs about the limitation. > FYI, by increasing LogLevel to INFO, error log shows: Basically just shows the problem isn't your backend. My first reply was leading to "if the debug info doesn't tell us what's wrong, I'll ask for a test case to try and replicate the problem". No need for that now! Thanks for the report! -- Nick Kew - To unsubscribe, e-mail: users-unsubscr...@httpd.apache.org For additional commands, e-mail: users-h...@httpd.apache.org - To unsubscribe, e-mail: users-unsubscr...@httpd.apache.org For additional commands, e-mail: users-h...@httpd.apache.org
Re: [users@httpd] proxy_html / xml2enc won't handle certain HTML entities
> On 8 May 2020, at 07:28, Antonio Suárez Pozuelo > wrote: > > Hi Nick, > > Your glass of wine was inspiring: just removed > >> ProxyHTMLCharsetOut * # Backend (Tomcat) charset is ISO-8859-1 > > and the problem's gone! OK, thanks for confirming it. I'm pretty sure now what's happening. Libxml2 uses unicode (utf-8) internally, so for i18n to work, your iso-8859-1 gets converted before feeding to the parser. But HTML entities are not preserved: they get converted to their unicode representations. ProxyHTMLCharsetOut is kind-of an afterthought: it converts unicode to your choice of encoding. But it doesn't deal with HTML entities. So when it encounters unicode sequences for your "→" et al, it just tries to convert unicode to latin-1, and fails when there is no latin-1 representation. As far as I know this doesn't really matter: unicode support is pretty-near universal, so just leaving it in place has no real downside. I'll think about whether there's an easy fix to ProxyHTMLCharsetOut for cases like this, but will more likely just add a note to the docs about the limitation. > FYI, by increasing LogLevel to INFO, error log shows: Basically just shows the problem isn't your backend. My first reply was leading to "if the debug info doesn't tell us what's wrong, I'll ask for a test case to try and replicate the problem". No need for that now! Thanks for the report! -- Nick Kew - To unsubscribe, e-mail: users-unsubscr...@httpd.apache.org For additional commands, e-mail: users-h...@httpd.apache.org
Re: [users@httpd] proxy_html / xml2enc won't handle certain HTML entities
Hi Nick, Your glass of wine was inspiring: just removed >ProxyHTMLCharsetOut * # Backend (Tomcat) charset is ISO-8859-1 and the problem's gone! Also commented out >ProxyHTMLMeta on with no noticeable change in behaviour. As per the docs "turning ProxyHTMLMeta Off will give a small performance boost", so off it goes. Thank you so much! FYI, by increasing LogLevel to INFO, error log shows: [Fri May 08 07:42:35.790051 2020] [xml2enc:info] [pid 13183:tid 139823008806656] [client _redacted_:55344] AH01431: Got charset ISO-8859-1 from HTTP headers So our backend's stated charset is ISO-8859-1. About your questions: > Are you sure your backend is sending literally those entities, as opposed to > their byte representations in its charset? > Note that libxml2 is doing the hard work here: what version of libxml2 do you > have? "Faulty" entities are coded verbatim (i.e. "→") in the backend JSP pages, and are rendered exactly that way in non-proxied responses. libxml2 version is 2.9.4 (within Debian 10.3 amd64). I can do further testing, if you need it. FYI 2 (side point): > >ProxyHTMLURLMap "/backend-path/(.*)" "/$1" R We had some previous experience with proxy URL mapping, and "/frontend-path/" <-> "/backend-path/" has always worked fine for us without the regexp. But mapping the root frontend path "/" gave us some trouble; maybe there's a better solution, but that regexp solved the issue. Thank you again. Best regards, Antonio ----- Mensaje original - De: "Nick Kew" Para: "users" Enviados: Viernes, 8 de Mayo 2020 1:49:25 Asunto: Re: [users@httpd] proxy_html / xml2enc won't handle certain HTML entities > On 7 May 2020, at 17:52, Antonio Suárez Pozuelo > wrote: > > Hi there, Further to my last reply, I can see what may possibly be wrong: > We have a Tomcat 8 backend server behind an Apache 2.4 proxy. Our Apache conf: > >ProxyPreserveHost on >ProxyHTMLEnable on >ProxyHTMLExtended on You probably don't want that. >ProxyHTMLCharsetOut * # Backend (Tomcat) charset is ISO-8859-1 I suspect that is very probably the culprit. Does removing it fix the problem? >ProxyHTMLMeta on You probably also don't want that. I think the documentation of that is misleadingly out-of-date, but I don't want to check now (late, and after a glass of wine). -- Nick Kew - To unsubscribe, e-mail: users-unsubscr...@httpd.apache.org For additional commands, e-mail: users-h...@httpd.apache.org - To unsubscribe, e-mail: users-unsubscr...@httpd.apache.org For additional commands, e-mail: users-h...@httpd.apache.org
Re: [users@httpd] proxy_html / xml2enc won't handle certain HTML entities
> On 7 May 2020, at 17:52, Antonio Suárez Pozuelo > wrote: > > Hi there, Further to my last reply, I can see what may possibly be wrong: > We have a Tomcat 8 backend server behind an Apache 2.4 proxy. Our Apache conf: > >ProxyPreserveHost on >ProxyHTMLEnable on >ProxyHTMLExtended on You probably don't want that. >ProxyHTMLCharsetOut * # Backend (Tomcat) charset is ISO-8859-1 I suspect that is very probably the culprit. Does removing it fix the problem? >ProxyHTMLMeta on You probably also don't want that. I think the documentation of that is misleadingly out-of-date, but I don't want to check now (late, and after a glass of wine). -- Nick Kew - To unsubscribe, e-mail: users-unsubscr...@httpd.apache.org For additional commands, e-mail: users-h...@httpd.apache.org
Re: [users@httpd] proxy_html / xml2enc won't handle certain HTML entities
> On 7 May 2020, at 17:52, Antonio Suárez Pozuelo > wrote: > > Hi there, > >ProxyHTMLURLMap "/backend-path/(.*)" "/$1" R Minor point, no need for that regexp. Just /backend-path/ / (the remainder will be untouched). > Everything works fine but for a few HTML entities; detected so far: → > ← ↑ ↓ ▸. Whenever the backend response HTML includes > one of those: Are you sure your backend is sending literally those entities, as opposed to their byte representations in its charset? > 1. Apache's response's erratic: it either drops parts of the HTML or resets > the connection altogether. > > 2. Error log shows: > >[Thu May 07 18:07:54.934922 2020] [xml2enc:error] [pid 12355:tid > 139930604844800] [client (_redacted_):33206] AH01444: Skipping invalid > byte(s) in input stream!, referer: (_redacted_) That means there's something in the input stream that's a mismatch with the charset detected. Either that or you've found a bug. Note that libxml2 is doing the hard work here: what version of libxml2 do you have? And is there any other filter involved? > > By the way, we've found that replacing > >ProxyHTMLEnable on > > with > >SetOutputFilterproxy-html] Yes, that configuration skips mod_xml2enc entirely, which means you have no i18n support in mod_proxy_html. So > (although it has some drawbacks with non-english characters, so it's of > no use for us). is expected behaviour. > Are we doing something wrong, maybe? Nothing obviously wrong, though your backend/app may be misconfigured. If you increase LogLevel to INFO, mod_xml2enc will report what charset it detects, so you can check whether that's what you think it should be. Probably better to increase it to DEBUG, which will get quite a lot more info from mod_xml2enc. If that doesn't help you figure it out, post the mod_xml2enc messages at level DEBUG here and I'll take a look. -- Nick Kew - To unsubscribe, e-mail: users-unsubscr...@httpd.apache.org For additional commands, e-mail: users-h...@httpd.apache.org
[users@httpd] proxy_html / xml2enc won't handle certain HTML entities
Hi there, We have a Tomcat 8 backend server behind an Apache 2.4 proxy. Our Apache conf: ProxyPreserveHost on ProxyHTMLEnable on ProxyHTMLExtended on ProxyHTMLCharsetOut * # Backend (Tomcat) charset is ISO-8859-1 ProxyHTMLMeta on ProxyHTMLDocType"" XML ProxyHTMLURLMap "/backend-path/(.*)" "/$1" R ProxyPass "http://backend-host:8080/backend-path/"; ProxyPassReverse "http://backend-host:8080/backend-path/"; ProxyPassReverseCookieDomainbackend-host "%{HTTP_HOST}s" ProxyPassReverseCookiePath "/backend-path/" "/" Everything works fine but for a few HTML entities; detected so far: → ← ↑ ↓ ▸. Whenever the backend response HTML includes one of those: 1. Apache's response's erratic: it either drops parts of the HTML or resets the connection altogether. 2. Error log shows: [Thu May 07 18:07:54.934922 2020] [xml2enc:error] [pid 12355:tid 139930604844800] [client (_redacted_):33206] AH01444: Skipping invalid byte(s) in input stream!, referer: (_redacted_) First experienced on version 2.4.38 (Debian-shipped); also verified on version 2.4.43 (just built from source on Debian 10.3 amd64). As far as I know, those "faulty" HTML entities are fully standard. Some others such as or letters with diacritics (ñ, á...) pass through just fine. By the way, we've found that replacing ProxyHTMLEnable on with SetOutputFilterproxy-html works fine for those HTML entities (although it has some drawbacks with non-english characters, so it's of no use for us). Are we doing something wrong, maybe? Thanks you all in advance. Best regards, Antonio - To unsubscribe, e-mail: users-unsubscr...@httpd.apache.org For additional commands, e-mail: users-h...@httpd.apache.org