Re: [users@httpd] proxy_html / xml2enc won't handle certain HTML entities

2020-05-18 Thread Antonio Suárez Pozuelo
FYI: also working fine for Debian Buster's and Ubuntu Bionic's current source 
packages (versions 2.4.38 and 2.4.29).

Thanks!

- Mensaje original -
De: "Antonio Suárez Pozuelo" 
Para: "users" 
Enviados: Lunes, 18 de Mayo 2020 9:58:14
Asunto: Re: [users@httpd] proxy_html / xml2enc won't handle certain HTML 
entities

That was swift! Still testing it, but it's working as a charm so far.

Congratulations for the good job. Thanks again,

Antonio

- Mensaje original -
De: "Nick Kew" 
Para: "users" 
Enviados: Sábado, 16 de Mayo 2020 1:20:57
Asunto: Re: [users@httpd] proxy_html / xml2enc won't handle certain HTML 
entities

On Fri, 15 May 2020 09:12:30 +0200 (CEST)
Antonio Suárez Pozuelo  wrote:

> Sure! There you are:
> https://bz.apache.org/bugzilla/show_bug.cgi?id=64443
> 
> Thanks for your support, Nick, really appreciate it. Best regards,

I've hacked up - but not tested - a simple patch (attached).
If it works for you I'll think through whether it's fit to
commit as-is or whether there's a case for something more
general (and complex).

-- 
Nick Kew


-
To unsubscribe, e-mail: users-unsubscr...@httpd.apache.org
For additional commands, e-mail: users-h...@httpd.apache.org

-
To unsubscribe, e-mail: users-unsubscr...@httpd.apache.org
For additional commands, e-mail: users-h...@httpd.apache.org



Re: [users@httpd] proxy_html / xml2enc won't handle certain HTML entities

2020-05-18 Thread Antonio Suárez Pozuelo
That was swift! Still testing it, but it's working as a charm so far.

Congratulations for the good job. Thanks again,

Antonio

- Mensaje original -
De: "Nick Kew" 
Para: "users" 
Enviados: Sábado, 16 de Mayo 2020 1:20:57
Asunto: Re: [users@httpd] proxy_html / xml2enc won't handle certain HTML 
entities

On Fri, 15 May 2020 09:12:30 +0200 (CEST)
Antonio Suárez Pozuelo  wrote:

> Sure! There you are:
> https://bz.apache.org/bugzilla/show_bug.cgi?id=64443
> 
> Thanks for your support, Nick, really appreciate it. Best regards,

I've hacked up - but not tested - a simple patch (attached).
If it works for you I'll think through whether it's fit to
commit as-is or whether there's a case for something more
general (and complex).

-- 
Nick Kew


-
To unsubscribe, e-mail: users-unsubscr...@httpd.apache.org
For additional commands, e-mail: users-h...@httpd.apache.org

-
To unsubscribe, e-mail: users-unsubscr...@httpd.apache.org
For additional commands, e-mail: users-h...@httpd.apache.org



Re: [users@httpd] proxy_html / xml2enc won't handle certain HTML entities

2020-05-15 Thread Nick Kew
On Fri, 15 May 2020 09:12:30 +0200 (CEST)
Antonio Suárez Pozuelo  wrote:

> Sure! There you are:
> https://bz.apache.org/bugzilla/show_bug.cgi?id=64443
> 
> Thanks for your support, Nick, really appreciate it. Best regards,

I've hacked up - but not tested - a simple patch (attached).
If it works for you I'll think through whether it's fit to
commit as-is or whether there's a case for something more
general (and complex).

-- 
Nick Kew
Index: modules/filters/mod_proxy_html.c
===
--- modules/filters/mod_proxy_html.c	(revision 1877795)
+++ modules/filters/mod_proxy_html.c	(working copy)
@@ -674,6 +674,16 @@
 }
 }
 }
+/* PR#64443: for , insert accept-charset attribute if necessary */
+if (!strcasecmp(name, "FORM")) {
+const char *cenc;
+xmlCharEncoding enc;
+if (xml2enc_charset &&
+(xml2enc_charset(ctx->f->r, &enc, &cenc) == APR_SUCCESS)) {
+ap_fputstrs(ctx->f->next, ctx->bb, " accept-charset=\"",
+cenc, "\"", NULL);
+}
+}
 ctx->offset = 0;
 if (desc && desc->empty)
 ap_fputs(ctx->f->next, ctx->bb, ctx->etag);


-
To unsubscribe, e-mail: users-unsubscr...@httpd.apache.org
For additional commands, e-mail: users-h...@httpd.apache.org

Re: [users@httpd] proxy_html / xml2enc won't handle certain HTML entities

2020-05-15 Thread Antonio Suárez Pozuelo
Sure! There you are: https://bz.apache.org/bugzilla/show_bug.cgi?id=64443

Thanks for your support, Nick, really appreciate it. Best regards,

Antonio

- Mensaje original -
De: "Nick Kew" 
Para: "users" 
Enviados: Jueves, 14 de Mayo 2020 20:06:03
Asunto: Re: [users@httpd] proxy_html / xml2enc won't handle certain HTML 
entities

> On 14 May 2020, at 13:36, Antonio Suárez Pozuelo  
> wrote:
> 
> Hi, Nick. I'm afraid we're still having some issue with this.
> 
> Without ProxyHTMLCharsetOut, proxy_html is translating our backend ISO-8859-1 
> response into UTF-8, which is fine. When submitting a form, I guess the 
> browser will also encode its contents in UTF-8, but maybe proxy_html won't 
> reverse-translate that into ISO-8859-1 before relaying it to the backend 
> server. 

Whoops, now you mention it, that may have figured in the thinking behind
the very configuration you were trying to use.  Yes, of course, mod_proxy_html
doesn't touch your POST data.

> This can be enforced by adding an accept-charset="ISO-8859-1" attribute to 
> the  tag (tested on Firefox 77.0b5), so: should proxy_html add that 
> attribute to  tags automagically when parsing and translating HTML 
> content?

Interesting suggestion.  It would be straightforward to offer that as a 
configuration
option (much easier than fixing the problem with ProxyHTMLCharsetOut,
unless I'm missing something in the libxml2 API).  Though it seems to me
kind-of an ugly workaround.

I think this merits a bugzilla entry.  Do you want to submit it?

-- 
Nick Kew
-
To unsubscribe, e-mail: users-unsubscr...@httpd.apache.org
For additional commands, e-mail: users-h...@httpd.apache.org

-
To unsubscribe, e-mail: users-unsubscr...@httpd.apache.org
For additional commands, e-mail: users-h...@httpd.apache.org



Re: [users@httpd] proxy_html / xml2enc won't handle certain HTML entities

2020-05-14 Thread Tatsuki Makino
I have added

xml2EncDefault UTF-8

directive with something wrong when combining xml2enc_module and
proxy_html_module.

-
To unsubscribe, e-mail: users-unsubscr...@httpd.apache.org
For additional commands, e-mail: users-h...@httpd.apache.org



Re: [users@httpd] proxy_html / xml2enc won't handle certain HTML entities

2020-05-14 Thread Nick Kew



> On 14 May 2020, at 13:36, Antonio Suárez Pozuelo  
> wrote:
> 
> Hi, Nick. I'm afraid we're still having some issue with this.
> 
> Without ProxyHTMLCharsetOut, proxy_html is translating our backend ISO-8859-1 
> response into UTF-8, which is fine. When submitting a form, I guess the 
> browser will also encode its contents in UTF-8, but maybe proxy_html won't 
> reverse-translate that into ISO-8859-1 before relaying it to the backend 
> server. 

Whoops, now you mention it, that may have figured in the thinking behind
the very configuration you were trying to use.  Yes, of course, mod_proxy_html
doesn't touch your POST data.

> This can be enforced by adding an accept-charset="ISO-8859-1" attribute to 
> the  tag (tested on Firefox 77.0b5), so: should proxy_html add that 
> attribute to  tags automagically when parsing and translating HTML 
> content?

Interesting suggestion.  It would be straightforward to offer that as a 
configuration
option (much easier than fixing the problem with ProxyHTMLCharsetOut,
unless I'm missing something in the libxml2 API).  Though it seems to me
kind-of an ugly workaround.

I think this merits a bugzilla entry.  Do you want to submit it?

-- 
Nick Kew
-
To unsubscribe, e-mail: users-unsubscr...@httpd.apache.org
For additional commands, e-mail: users-h...@httpd.apache.org



Re: [users@httpd] proxy_html / xml2enc won't handle certain HTML entities

2020-05-14 Thread Antonio Suárez Pozuelo
Hi, Nick. I'm afraid we're still having some issue with this.

Currently our conf is:

ProxyPreserveHost   on
ProxyHTMLEnable on
ProxyHTMLExtended   on

And our pages are showing fine, but non-english characters fed into  form fields ared posted incorrectly (badly encoded) to our backend 
server. This won't happen with ProxyHTMLCharsetOut set to "*" or explicitly to 
"ISO-8859-1"; but that configuration, you know, takes us to the starting point.

Without ProxyHTMLCharsetOut, proxy_html is translating our backend ISO-8859-1 
response into UTF-8, which is fine. When submitting a form, I guess the browser 
will also encode its contents in UTF-8, but maybe proxy_html won't 
reverse-translate that into ISO-8859-1 before relaying it to the backend 
server. This can be enforced by adding an accept-charset="ISO-8859-1" attribute 
to the  tag (tested on Firefox 77.0b5), so: should proxy_html add that 
attribute to  tags automagically when parsing and translating HTML 
content?

Just speculating, I really don't know the internals of it. But I guess you do :)

Thanks in advance. Best regards,

Antonio



- Mensaje original -
De: "Nick Kew" 
Para: "users" 
Enviados: Viernes, 8 de Mayo 2020 9:22:40
Asunto: Re: [users@httpd] proxy_html / xml2enc won't handle certain HTML 
entities

> On 8 May 2020, at 07:28, Antonio Suárez Pozuelo  
> wrote:
> 
> Hi Nick,
> 
> Your glass of wine was inspiring: just removed
> 
>>   ProxyHTMLCharsetOut *   # Backend (Tomcat) charset is ISO-8859-1
> 
> and the problem's gone!

OK, thanks for confirming it.  I'm pretty sure now what's happening.

Libxml2 uses unicode (utf-8) internally, so for i18n to work, your iso-8859-1
gets converted before feeding to the parser.  But HTML entities are not
preserved: they get converted to their unicode representations.

ProxyHTMLCharsetOut is kind-of an afterthought: it converts unicode to
your choice of encoding.  But it doesn't deal with HTML entities.  So when
it encounters unicode sequences for your "→" et al, it just tries to
convert unicode to latin-1, and fails when there is no latin-1 representation.

As far as I know this doesn't really matter: unicode support is pretty-near
universal, so just leaving it in place has no real downside.  I'll think about
whether there's an easy fix to ProxyHTMLCharsetOut for cases like this,
but will more likely just add a note to the docs about the limitation.

> FYI, by increasing LogLevel to INFO, error log shows:

Basically just shows the problem isn't your backend.  My first reply was
leading to "if the debug info doesn't tell us what's wrong, I'll ask for a
test case to try and replicate the problem".  No need for that now!

Thanks for the report!

-- 
Nick Kew
-
To unsubscribe, e-mail: users-unsubscr...@httpd.apache.org
For additional commands, e-mail: users-h...@httpd.apache.org

-
To unsubscribe, e-mail: users-unsubscr...@httpd.apache.org
For additional commands, e-mail: users-h...@httpd.apache.org



Re: [users@httpd] proxy_html / xml2enc won't handle certain HTML entities

2020-05-08 Thread Nick Kew



> On 8 May 2020, at 07:28, Antonio Suárez Pozuelo  
> wrote:
> 
> Hi Nick,
> 
> Your glass of wine was inspiring: just removed
> 
>>   ProxyHTMLCharsetOut *   # Backend (Tomcat) charset is ISO-8859-1
> 
> and the problem's gone!

OK, thanks for confirming it.  I'm pretty sure now what's happening.

Libxml2 uses unicode (utf-8) internally, so for i18n to work, your iso-8859-1
gets converted before feeding to the parser.  But HTML entities are not
preserved: they get converted to their unicode representations.

ProxyHTMLCharsetOut is kind-of an afterthought: it converts unicode to
your choice of encoding.  But it doesn't deal with HTML entities.  So when
it encounters unicode sequences for your "→" et al, it just tries to
convert unicode to latin-1, and fails when there is no latin-1 representation.

As far as I know this doesn't really matter: unicode support is pretty-near
universal, so just leaving it in place has no real downside.  I'll think about
whether there's an easy fix to ProxyHTMLCharsetOut for cases like this,
but will more likely just add a note to the docs about the limitation.

> FYI, by increasing LogLevel to INFO, error log shows:

Basically just shows the problem isn't your backend.  My first reply was
leading to "if the debug info doesn't tell us what's wrong, I'll ask for a
test case to try and replicate the problem".  No need for that now!

Thanks for the report!

-- 
Nick Kew
-
To unsubscribe, e-mail: users-unsubscr...@httpd.apache.org
For additional commands, e-mail: users-h...@httpd.apache.org



Re: [users@httpd] proxy_html / xml2enc won't handle certain HTML entities

2020-05-07 Thread Antonio Suárez Pozuelo
Hi Nick,

Your glass of wine was inspiring: just removed

>ProxyHTMLCharsetOut *   # Backend (Tomcat) charset is ISO-8859-1

and the problem's gone!

Also commented out 

>ProxyHTMLMeta   on

with no noticeable change in behaviour. As per the docs "turning ProxyHTMLMeta 
Off will give a small performance boost", so off it goes.

Thank you so much!

FYI, by increasing LogLevel to INFO, error log shows:

[Fri May 08 07:42:35.790051 2020] [xml2enc:info] [pid 13183:tid 
139823008806656] [client _redacted_:55344] AH01431: Got charset ISO-8859-1 from 
HTTP headers

So our backend's stated charset is ISO-8859-1. 

About your questions:

> Are you sure your backend is sending literally those entities, as opposed to 
> their byte representations in its charset?
> Note that libxml2 is doing the hard work here: what version of libxml2 do you 
> have?

"Faulty" entities are coded verbatim (i.e. "→") in the backend JSP pages, 
and are rendered exactly that way in non-proxied responses. libxml2 version is 
2.9.4 (within Debian 10.3 amd64).

I can do further testing, if you need it.

FYI 2 (side point):

>
>ProxyHTMLURLMap "/backend-path/(.*)" "/$1" R

We had some previous experience with proxy URL mapping, and "/frontend-path/" 
<-> "/backend-path/" has always worked fine for us without the regexp. But 
mapping the root frontend path "/" gave us some trouble; maybe there's a better 
solution, but that regexp solved the issue.

Thank you again. Best regards,

Antonio

----- Mensaje original -
De: "Nick Kew" 
Para: "users" 
Enviados: Viernes, 8 de Mayo 2020 1:49:25
Asunto: Re: [users@httpd] proxy_html / xml2enc won't handle certain HTML 
entities

> On 7 May 2020, at 17:52, Antonio Suárez Pozuelo  
> wrote:
> 
> Hi there,

Further to my last reply, I can see what may possibly be wrong:

> We have a Tomcat 8 backend server behind an Apache 2.4 proxy. Our Apache conf:
> 
>ProxyPreserveHost   on
>ProxyHTMLEnable on
>ProxyHTMLExtended   on

You probably don't want that.

>ProxyHTMLCharsetOut *   # Backend (Tomcat) charset is ISO-8859-1

I suspect that is very probably the culprit.
Does removing it fix the problem?


>ProxyHTMLMeta   on

You probably also don't want that.  I think the documentation of that
is misleadingly out-of-date, but I don't want to check now (late, and
after a glass of wine).

-- 
Nick Kew


-
To unsubscribe, e-mail: users-unsubscr...@httpd.apache.org
For additional commands, e-mail: users-h...@httpd.apache.org

-
To unsubscribe, e-mail: users-unsubscr...@httpd.apache.org
For additional commands, e-mail: users-h...@httpd.apache.org



Re: [users@httpd] proxy_html / xml2enc won't handle certain HTML entities

2020-05-07 Thread Nick Kew



> On 7 May 2020, at 17:52, Antonio Suárez Pozuelo  
> wrote:
> 
> Hi there,

Further to my last reply, I can see what may possibly be wrong:

> We have a Tomcat 8 backend server behind an Apache 2.4 proxy. Our Apache conf:
> 
>ProxyPreserveHost   on
>ProxyHTMLEnable on
>ProxyHTMLExtended   on

You probably don't want that.

>ProxyHTMLCharsetOut *   # Backend (Tomcat) charset is ISO-8859-1

I suspect that is very probably the culprit.
Does removing it fix the problem?


>ProxyHTMLMeta   on

You probably also don't want that.  I think the documentation of that
is misleadingly out-of-date, but I don't want to check now (late, and
after a glass of wine).

-- 
Nick Kew


-
To unsubscribe, e-mail: users-unsubscr...@httpd.apache.org
For additional commands, e-mail: users-h...@httpd.apache.org



Re: [users@httpd] proxy_html / xml2enc won't handle certain HTML entities

2020-05-07 Thread Nick Kew



> On 7 May 2020, at 17:52, Antonio Suárez Pozuelo  
> wrote:
> 
> Hi there,


>
>ProxyHTMLURLMap "/backend-path/(.*)" "/$1" R

Minor point, no need for that regexp.  Just   /backend-path/  /
(the remainder will be untouched).

> Everything works fine but for a few HTML entities; detected so far: → 
> ← ↑ ↓ ▸. Whenever the backend response HTML includes 
> one of those:

Are you sure your backend is sending literally those entities, as opposed to 
their byte
representations in its charset?

> 1. Apache's response's erratic: it either drops parts of the HTML or resets 
> the connection altogether.
> 
> 2. Error log shows:
> 
>[Thu May 07 18:07:54.934922 2020] [xml2enc:error] [pid 12355:tid 
> 139930604844800] [client (_redacted_):33206] AH01444: Skipping invalid 
> byte(s) in input stream!, referer: (_redacted_)

That means there's something in the input stream that's a mismatch with the 
charset
detected.  Either that or you've found a bug.  Note that libxml2 is doing the 
hard work
here: what version of libxml2 do you have?

And is there any other filter involved?

> 
> By the way, we've found that replacing 
> 
>ProxyHTMLEnable on
> 
> with
> 
>SetOutputFilterproxy-html]

Yes, that configuration skips mod_xml2enc entirely, which means you have no i18n
support in mod_proxy_html.  So

> (although it has some drawbacks with non-english characters, so it's of 
> no use for us).

is expected behaviour.

> Are we doing something wrong, maybe?

Nothing obviously wrong, though your backend/app may be misconfigured.

If you increase LogLevel to INFO, mod_xml2enc will report what charset it 
detects,
so you can check whether that's what you think it should be.  Probably better to
increase it to DEBUG, which will get quite a lot more info from mod_xml2enc.

If that doesn't help you figure it out, post the mod_xml2enc messages at level
DEBUG here and I'll take a look.

-- 
Nick Kew
-
To unsubscribe, e-mail: users-unsubscr...@httpd.apache.org
For additional commands, e-mail: users-h...@httpd.apache.org



[users@httpd] proxy_html / xml2enc won't handle certain HTML entities

2020-05-07 Thread Antonio Suárez Pozuelo
Hi there,

We have a Tomcat 8 backend server behind an Apache 2.4 proxy. Our Apache conf:

ProxyPreserveHost   on
ProxyHTMLEnable on
ProxyHTMLExtended   on
ProxyHTMLCharsetOut *   # Backend (Tomcat) charset is ISO-8859-1
ProxyHTMLMeta   on
ProxyHTMLDocType"" XML


ProxyHTMLURLMap "/backend-path/(.*)" "/$1" R
ProxyPass   
"http://backend-host:8080/backend-path/";
ProxyPassReverse
"http://backend-host:8080/backend-path/";
ProxyPassReverseCookieDomainbackend-host "%{HTTP_HOST}s"
ProxyPassReverseCookiePath  "/backend-path/" "/"


Everything works fine but for a few HTML entities; detected so far: → 
← ↑ ↓ ▸. Whenever the backend response HTML includes one 
of those:

1. Apache's response's erratic: it either drops parts of the HTML or resets the 
connection altogether.

2. Error log shows:

[Thu May 07 18:07:54.934922 2020] [xml2enc:error] [pid 12355:tid 
139930604844800] [client (_redacted_):33206] AH01444: Skipping invalid byte(s) 
in input stream!, referer: (_redacted_)

First experienced on version 2.4.38 (Debian-shipped); also verified on version 
2.4.43 (just built from source on Debian 10.3 amd64).

As far as I know, those "faulty" HTML entities are fully standard. Some others 
such as   or letters with diacritics (ñ, á...) pass through 
just fine.

By the way, we've found that replacing 

ProxyHTMLEnable on

with

SetOutputFilterproxy-html

works fine for those HTML entities (although it has some drawbacks with 
non-english characters, so it's of no use for us).

Are we doing something wrong, maybe?

Thanks you all in advance. Best regards,

Antonio

-
To unsubscribe, e-mail: users-unsubscr...@httpd.apache.org
For additional commands, e-mail: users-h...@httpd.apache.org