Re: Tomcat 8.5.19 corrupts static text files encoded with UTF-8

2017-07-30 Thread Mark Thomas
On 30/07/17 13:39, Konstantin Preißer wrote:
> Hi Mark,
> 
>> -Original Message-
>> From: Mark Thomas [mailto:ma...@apache.org]
>> Sent: Sunday, July 30, 2017 12:40 PM
>>
>> (...)
>>
>>> Stuff breaking is unintentional and is a bug. Unfortunately, it appears
>>> that you have stumbled across a bug that wasn't detected in any of the
>>> last three attempted releases.
>>>
>>> I think (but I can't be sure without a test case) the problem stems from
>>> the case where a character set is not explicitly defined for the
>>> response. If that is the case, it should be a fairly simple fix.
>>>
>>> My preference is to keep the edge case handling I recently added if at
>>> all possible and prevent the conversion from applying when it is not
>>> required.
>>
>> Konstantin,
>>
>> If you can try one of the following patches and report back whether it
>> fixes the problem that would be very helpful.
>>
>> Tomcat 9.0.x
>> http://home.apache.org/~markt/patches/2017-07-30-default-servlet-
>> encoding-tc9-v1.patch
>>
>> Tomcat 8.5.x
>> http://home.apache.org/~markt/patches/2017-07-30-default-servlet-
>> encoding-tc85-v1.patch
> 
> Thank you very much for your fast feedback. I applied the patch for Tomcat 
> 8.5.x and it seems to fix the issue: Static text/JavaScript files are served 
> untouched (their encoding is not changed), which means JavaScript files 
> encoded as UTF-8 (without BOM) are working again in the browser.

Thanks for the confirmation.

The more I think about this, the more I'm leaning towards your
suggestion of reverting it for 8.5.x and below anyway.

It should just be addressing the edge cases it was intended to but there
have been too many unintended consequences for my liking.

I'll start that discussion on the dev list.

Mark

-
To unsubscribe, e-mail: users-unsubscr...@tomcat.apache.org
For additional commands, e-mail: users-h...@tomcat.apache.org



RE: Tomcat 8.5.19 corrupts static text files encoded with UTF-8

2017-07-30 Thread Konstantin Preißer
Hi Mark,

> -Original Message-
> From: Mark Thomas [mailto:ma...@apache.org]
> Sent: Sunday, July 30, 2017 12:40 PM
> 
> (...)
> 
> > Stuff breaking is unintentional and is a bug. Unfortunately, it appears
> > that you have stumbled across a bug that wasn't detected in any of the
> > last three attempted releases.
> >
> > I think (but I can't be sure without a test case) the problem stems from
> > the case where a character set is not explicitly defined for the
> > response. If that is the case, it should be a fairly simple fix.
> >
> > My preference is to keep the edge case handling I recently added if at
> > all possible and prevent the conversion from applying when it is not
> > required.
> 
> Konstantin,
> 
> If you can try one of the following patches and report back whether it
> fixes the problem that would be very helpful.
> 
> Tomcat 9.0.x
> http://home.apache.org/~markt/patches/2017-07-30-default-servlet-
> encoding-tc9-v1.patch
> 
> Tomcat 8.5.x
> http://home.apache.org/~markt/patches/2017-07-30-default-servlet-
> encoding-tc85-v1.patch

Thank you very much for your fast feedback. I applied the patch for Tomcat 
8.5.x and it seems to fix the issue: Static text/JavaScript files are served 
untouched (their encoding is not changed), which means JavaScript files encoded 
as UTF-8 (without BOM) are working again in the browser.


Thanks!

Regards,
Konstantin Preißer


-
To unsubscribe, e-mail: users-unsubscr...@tomcat.apache.org
For additional commands, e-mail: users-h...@tomcat.apache.org



Re: Tomcat 8.5.19 corrupts static text files encoded with UTF-8

2017-07-30 Thread Mark Thomas
On 30/07/17 10:50, Mark Thomas wrote:
> On 30/07/17 10:21, Rémy Maucherat wrote:
>> On Sun, Jul 30, 2017 at 10:59 AM, Konstantin Preißer 
>> wrote:



>>> I honestly don't understand that change. As a web developer, I expect a
>>> web server to serve static files exactly as-is, without trying to convert
>>> the files into another charset and without trying to detect the charset of
>>> the file (unless the server is configured to do so).
> 
> Tomcat is trying to handle various edge cases. These include:
> 
> - Response encoding defined as one charset when serving static content
> that has a different charset (Tomcat used to send the static bytes as-is
> which could result in a broken response in some cases).
> 
> - Static content in one encoding included into a response encoding in a
> different encoding. Again, depending on circumstances, the included
> content would be broken.
> 
>> It probably still does too much right now. Mark made a very complex change,
>> but there's encoding conversion in too many cases maybe. I think there
>> should be conversion only when a writer is used by the default servlet, but
>> we should let the user deal with the other cases.
>>
>> Right now, the code does its conversion when the resource is a text mime
>> type and its encoding doesn't match (which may be accurate, or not, it
>> seems), and in that case it's very broad and the behavior should be
>> optional (off by default IMO). Besides, it's going to perform much worse
>> all of a sudden.
> 
> I agree that the change is complex. I also agree that the conversion
> appears to be kicking in more often than expected.
> 
> I thought we had resolved most of the issues working through the
> problems reported by George Stanchev and that 8.5.19 was unlikely to
> cause further issues.
> 
> I think the key to fixing this is limiting when the conversion is applied.



>>> Further, as an system administrator, I would expect that I can update
>>> Tomcat from x.y.z to x.y.(z+n), without static JavaScript files suddenly
>>> getting broken (which isn't immediately obvious as mostly the script per se
>>> will work, only that some special string characters outside of ASCII are
>>> displayed incorrectly to the user).
>>> Shouldn't such behavior changes be reserved for the next major/minor
>>> version which is not yet stable, in this case Tomcat 9.0.0?
> 
> Stuff breaking is unintentional and is a bug. Unfortunately, it appears
> that you have stumbled across a bug that wasn't detected in any of the
> last three attempted releases.
> 
> I think (but I can't be sure without a test case) the problem stems from
> the case where a character set is not explicitly defined for the
> response. If that is the case, it should be a fairly simple fix.
> 
> My preference is to keep the edge case handling I recently added if at
> all possible and prevent the conversion from applying when it is not
> required.

Konstantin,

If you can try one of the following patches and report back whether it
fixes the problem that would be very helpful.

Tomcat 9.0.x
http://home.apache.org/~markt/patches/2017-07-30-default-servlet-encoding-tc9-v1.patch

Tomcat 8.5.x
http://home.apache.org/~markt/patches/2017-07-30-default-servlet-encoding-tc85-v1.patch

Remy,

The patch above should significantly reduce the frequency that
conversion is applied, limiting it to the case where an encoding has
been explicitly defined and the fileEncoding attribute of the
DefaultServlet is configured differently or when including since we
always need to remove any BOM in that case.

Is that sufficient or would you still like to see an attemptConversion
attribute added to the DefaultServlet?

Mark

-
To unsubscribe, e-mail: users-unsubscr...@tomcat.apache.org
For additional commands, e-mail: users-h...@tomcat.apache.org



Re: Tomcat 8.5.19 corrupts static text files encoded with UTF-8

2017-07-30 Thread Mark Thomas
On 30/07/17 10:21, Rémy Maucherat wrote:
> On Sun, Jul 30, 2017 at 10:59 AM, Konstantin Preißer 
> wrote:
> 
>> Hi Mark,
>>
>>> -Original Message-
>>> From: Mark Thomas [mailto:ma...@apache.org]
>>> Sent: Saturday, July 29, 2017 2:56 PM
>>>
 (...)

 Why would Tomcat want to modify static files, instead of just serving
 them as-is?
>>>
>>> Because Tomcat now checks the response encoding and the file encoding
>>> and converts if necessary.
>>>
>>> You probably want to set the fileEncoding init param of the Default
>> servlet to
>>> UTF-8.
>>
>> Thanks. So I set the following parameter in web.xml:
>> 
>> fileEncoding
>> utf-8
>> 
>>
>> The result now is, that Tomcat converts the static file without a BOM from
>> UTF-8 to ISO-8859-1, which means my JavaScript files included by the HTML
>> page will still be broken, as the brower expects them to be UTF-8-encoded
>> ...
>>
>> I honestly don't understand that change. As a web developer, I expect a
>> web server to serve static files exactly as-is, without trying to convert
>> the files into another charset and without trying to detect the charset of
>> the file (unless the server is configured to do so).

Tomcat is trying to handle various edge cases. These include:

- Response encoding defined as one charset when serving static content
that has a different charset (Tomcat used to send the static bytes as-is
which could result in a broken response in some cases).

- Static content in one encoding included into a response encoding in a
different encoding. Again, depending on circumstances, the included
content would be broken.

> It probably still does too much right now. Mark made a very complex change,
> but there's encoding conversion in too many cases maybe. I think there
> should be conversion only when a writer is used by the default servlet, but
> we should let the user deal with the other cases.
> 
> Right now, the code does its conversion when the resource is a text mime
> type and its encoding doesn't match (which may be accurate, or not, it
> seems), and in that case it's very broad and the behavior should be
> optional (off by default IMO). Besides, it's going to perform much worse
> all of a sudden.

I agree that the change is complex. I also agree that the conversion
appears to be kicking in more often than expected.

I thought we had resolved most of the issues working through the
problems reported by George Stanchev and that 8.5.19 was unlikely to
cause further issues.

I think the key to fixing this is limiting when the conversion is applied.

>> Bug 49464 [1] mentions that "As per spec the encoding of the page is
>> asssumed to be iso-8859-1.". Do I understand correctly that this refers to
>> the following section "3.7.1 Canonicalization and Text Defaults" of RFC2616?

No. That is the Servlet spec.


>> (...)
>>The "charset" parameter is used with some media types to define the
>>character set (section 3.4) of the data. When no explicit charset
>>parameter is provided by the sender, media subtypes of the "text"
>>type are defined to have a default charset value of "ISO-8859-1" when
>>received via HTTP.
>>
>>
>> But not that RFC7231 says in "Appendix B.  Changes from RFC 2616":
>>
>>The default charset of ISO-8859-1 for text media types has been
>>removed; the default is now whatever the media type definition says.
>>Likewise, special treatment of ISO-8859-1 has been removed from the
>>Accept-Charset header field.  (Section 3.1.1.3 and Section 5.3.3)
>>
>>
>> I found a following page that talks about this change [2] and mentions
>> RFC6657 [3] that describes text/* media registrations with charset handling.
>>
>> While RFC6657 seems to indicate that the default charset of text/plain is
>> US-ASCII (which is not what browsers do), it doesn't seem to indicate a
>> default charset for other types like text/html, text/javascript,
>> application/javascript etc.
>>
>> Browsers (I tested with IE, Firefox and Chrome) already handle the
>> encoding of text-based files where the Content-Type doesn't specify a
>> charset as the user would expect:>> - For example, with text/html files that 
>> don't contain a BOM, they will
>> respect the  element. If a UTF-8 BOM is present, they
>> will interpret it as UTF-8.
>> - If you directly open text/plain, text/css, application/javascript files
>> in a browser, they will check if the file has an UTF-8 BOM, and interpret
>> it as UTF-8 in that case; otherwise, they seem to interpret it as
>> ISO-8859-1/Windows-1252 (or maybe using the default system encoding, I'm
>> not exactly sure about that).
>> - However, if such files (.css and .js) are referenced by a HTML file,
>> browsers will interpret them in the same encoding that the HTML file (if
>> they don't have a BOM), which means if the HTML uses UTF-8, they will
>> interpret .js and .css also as UTF-8 (unless the HTML element uses a
>> 

Re: Tomcat 8.5.19 corrupts static text files encoded with UTF-8

2017-07-30 Thread Rémy Maucherat
On Sun, Jul 30, 2017 at 10:59 AM, Konstantin Preißer 
wrote:

> Hi Mark,
>
> > -Original Message-
> > From: Mark Thomas [mailto:ma...@apache.org]
> > Sent: Saturday, July 29, 2017 2:56 PM
> >
> >> (...)
> >>
> > >Why would Tomcat want to modify static files, instead of just serving
> > >them as-is?
> >
> > Because Tomcat now checks the response encoding and the file encoding
> > and converts if necessary.
> >
> > You probably want to set the fileEncoding init param of the Default
> servlet to
> > UTF-8.
>
> Thanks. So I set the following parameter in web.xml:
> 
> fileEncoding
> utf-8
> 
>
> The result now is, that Tomcat converts the static file without a BOM from
> UTF-8 to ISO-8859-1, which means my JavaScript files included by the HTML
> page will still be broken, as the brower expects them to be UTF-8-encoded
> ...
>
> I honestly don't understand that change. As a web developer, I expect a
> web server to serve static files exactly as-is, without trying to convert
> the files into another charset and without trying to detect the charset of
> the file (unless the server is configured to do so).
>

It probably still does too much right now. Mark made a very complex change,
but there's encoding conversion in too many cases maybe. I think there
should be conversion only when a writer is used by the default servlet, but
we should let the user deal with the other cases.

Right now, the code does its conversion when the resource is a text mime
type and its encoding doesn't match (which may be accurate, or not, it
seems), and in that case it's very broad and the behavior should be
optional (off by default IMO). Besides, it's going to perform much worse
all of a sudden.

Rémy


>
> Bug 49464 [1] mentions that "As per spec the encoding of the page is
> asssumed to be iso-8859-1.". Do I understand correctly that this refers to
> the following section "3.7.1 Canonicalization and Text Defaults" of RFC2616?
>
> (...)
>The "charset" parameter is used with some media types to define the
>character set (section 3.4) of the data. When no explicit charset
>parameter is provided by the sender, media subtypes of the "text"
>type are defined to have a default charset value of "ISO-8859-1" when
>received via HTTP.
>
>
> But not that RFC7231 says in "Appendix B.  Changes from RFC 2616":
>
>The default charset of ISO-8859-1 for text media types has been
>removed; the default is now whatever the media type definition says.
>Likewise, special treatment of ISO-8859-1 has been removed from the
>Accept-Charset header field.  (Section 3.1.1.3 and Section 5.3.3)
>
>
> I found a following page that talks about this change [2] and mentions
> RFC6657 [3] that describes text/* media registrations with charset handling.
>
> While RFC6657 seems to indicate that the default charset of text/plain is
> US-ASCII (which is not what browsers do), it doesn't seem to indicate a
> default charset for other types like text/html, text/javascript,
> application/javascript etc.
>
> Browsers (I tested with IE, Firefox and Chrome) already handle the
> encoding of text-based files where the Content-Type doesn't specify a
> charset as the user would expect:
> - For example, with text/html files that don't contain a BOM, they will
> respect the  element. If a UTF-8 BOM is present, they
> will interpret it as UTF-8.
> - If you directly open text/plain, text/css, application/javascript files
> in a browser, they will check if the file has an UTF-8 BOM, and interpret
> it as UTF-8 in that case; otherwise, they seem to interpret it as
> ISO-8859-1/Windows-1252 (or maybe using the default system encoding, I'm
> not exactly sure about that).
> - However, if such files (.css and .js) are referenced by a HTML file,
> browsers will interpret them in the same encoding that the HTML file (if
> they don't have a BOM), which means if the HTML uses UTF-8, they will
> interpret .js and .css also as UTF-8 (unless the HTML element uses a
> charset parameter, e.g. 

RE: Tomcat 8.5.19 corrupts static text files encoded with UTF-8

2017-07-30 Thread Konstantin Preißer
Hi Mark,

> -Original Message-
> From: Mark Thomas [mailto:ma...@apache.org]
> Sent: Saturday, July 29, 2017 2:56 PM
> 
>> (...)
>> 
> >Why would Tomcat want to modify static files, instead of just serving
> >them as-is?
> 
> Because Tomcat now checks the response encoding and the file encoding
> and converts if necessary.
> 
> You probably want to set the fileEncoding init param of the Default servlet to
> UTF-8.

Thanks. So I set the following parameter in web.xml:

fileEncoding
utf-8


The result now is, that Tomcat converts the static file without a BOM from 
UTF-8 to ISO-8859-1, which means my JavaScript files included by the HTML page 
will still be broken, as the brower expects them to be UTF-8-encoded ...

I honestly don't understand that change. As a web developer, I expect a web 
server to serve static files exactly as-is, without trying to convert the files 
into another charset and without trying to detect the charset of the file 
(unless the server is configured to do so).

Bug 49464 [1] mentions that "As per spec the encoding of the page is asssumed 
to be iso-8859-1.". Do I understand correctly that this refers to the following 
section "3.7.1 Canonicalization and Text Defaults" of RFC2616?

(...) 
   The "charset" parameter is used with some media types to define the
   character set (section 3.4) of the data. When no explicit charset
   parameter is provided by the sender, media subtypes of the "text"
   type are defined to have a default charset value of "ISO-8859-1" when
   received via HTTP.


But not that RFC7231 says in "Appendix B.  Changes from RFC 2616":

   The default charset of ISO-8859-1 for text media types has been
   removed; the default is now whatever the media type definition says.
   Likewise, special treatment of ISO-8859-1 has been removed from the
   Accept-Charset header field.  (Section 3.1.1.3 and Section 5.3.3)


I found a following page that talks about this change [2] and mentions RFC6657 
[3] that describes text/* media registrations with charset handling.

While RFC6657 seems to indicate that the default charset of text/plain is 
US-ASCII (which is not what browsers do), it doesn't seem to indicate a default 
charset for other types like text/html, text/javascript, application/javascript 
etc.

Browsers (I tested with IE, Firefox and Chrome) already handle the encoding of 
text-based files where the Content-Type doesn't specify a charset as the user 
would expect:
- For example, with text/html files that don't contain a BOM, they will respect 
the  element. If a UTF-8 BOM is present, they will interpret 
it as UTF-8.
- If you directly open text/plain, text/css, application/javascript files in a 
browser, they will check if the file has an UTF-8 BOM, and interpret it as 
UTF-8 in that case; otherwise, they seem to interpret it as 
ISO-8859-1/Windows-1252 (or maybe using the default system encoding, I'm not 
exactly sure about that).
- However, if such files (.css and .js) are referenced by a HTML file, browsers 
will interpret them in the same encoding that the HTML file (if they don't have 
a BOM), which means if the HTML uses UTF-8, they will interpret .js and .css 
also as UTF-8 (unless the HTML element uses a charset parameter, e.g. ).

Therefore, I don't see why Tomcat would want to convert static resources to 
other encodings. (I think it should also not try to detect the charset of files 
and then include a "; charset=..." parameter in the Content-Type, as this may 
override the browser's behavior and thus also lead to incorrect decoding of 
JavaScript files that are encoded with UTF-8 without a BOM).


Further, as an system administrator, I would expect that I can update Tomcat 
from x.y.z to x.y.(z+n), without static JavaScript files suddenly getting 
broken (which isn't immediately obvious as mostly the script per se will work, 
only that some special string characters outside of ASCII are displayed 
incorrectly to the user).
Shouldn't such behavior changes be reserved for the next major/minor version 
which is not yet stable, in this case Tomcat 9.0.0?


Thanks!

Regards,
Konstantin Preißer


[1] https://bz.apache.org/bugzilla/show_bug.cgi?id=49464
[2] https://github.com/requests/requests/issues/2086
[3] https://tools.ietf.org/html/rfc6657



-
To unsubscribe, e-mail: users-unsubscr...@tomcat.apache.org
For additional commands, e-mail: users-h...@tomcat.apache.org



Re: Tomcat 8.5.19 corrupts static text files encoded with UTF-8

2017-07-29 Thread Mark Thomas
On 28 July 2017 21:53:27 BST, "Konstantin Preißer"  wrote:
>Hi all,
>
>after quite a while I'm reporting back here, because I faced a problem
>after updating to Tomcat 8.5.19: Suddenly, static text files (.txt, .js
>etc.) encoded with UTF-8 (without BOM) are getting corrupted when they
>are served to the browser. This didn't happen with Tomcat 8.5.16.
>
>To reproduce (I'm using Windows 10 Creators Update with Java
>1.8.0_141):
>
>1) Download apache-tomcat-8.5.19-windows-x64.zip and extract it
>2) Open Notepad++ [1] and paste the text "Aß" (without quotes) in a new
>text file. In the Encoding menu, select "UTF-8 without BOM" (if not
>already selected) and then save the textfile in the Tomcat directory to
>"webapps/ROOT/test.txt".
>3) Verify with a hex editor that the text file contains the following 3
>bytes: 0x41 0xC3 0x9F
>4) Now use a browser or some other download tool to make a request to
>"http://localhost:8080/test.txt; and save the text file.
>5) Open the file with a hex editor and notice that the last byte has
>changed: 0x41 0xC3 0x3F
>This means UTF-8 decoding will fail as the last byte does not have set
>the highest bit any more.
>
>In my case, this problem caused string from (UTF-8) JavaScript files
>being displayed incorrectly in the browser.
>
>If you do the same with Tomcat 8.5.16, you can see that the text file
>is served correctly.
>(Additionally, I found that Tomcat 8.5.19 uses "Transfer-Encoding:
>chunked" to serve the file, instead of using a "Content-Length: 3"
>header as Tomcat 8.5.16.)
>
>Why would Tomcat want to modify static files, instead of just serving
>them as-is?

Because Tomcat now checks the response encoding and the file encoding and 
converts if necessary.

You probably want to set the fileEncoding init param of the Default servlet to 
UTF-8.

Mark


>Note: Bisecting shows that the problem seems to have been introduced
>with r1800455 [2].
>
>Thanks!
>
>
>Regards,
>Konstantin Preißer
>
>[1] https://notepad-plus-plus.org/
>[2] https://svn.apache.org/viewvc?view=revision=r1800455
>
>
>
>-
>To unsubscribe, e-mail: users-unsubscr...@tomcat.apache.org
>For additional commands, e-mail: users-h...@tomcat.apache.org


-
To unsubscribe, e-mail: users-unsubscr...@tomcat.apache.org
For additional commands, e-mail: users-h...@tomcat.apache.org



Tomcat 8.5.19 corrupts static text files encoded with UTF-8

2017-07-28 Thread Konstantin Preißer
Hi all,

after quite a while I'm reporting back here, because I faced a problem after 
updating to Tomcat 8.5.19: Suddenly, static text files (.txt, .js etc.) encoded 
with UTF-8 (without BOM) are getting corrupted when they are served to the 
browser. This didn't happen with Tomcat 8.5.16.

To reproduce (I'm using Windows 10 Creators Update with Java 1.8.0_141):

1) Download apache-tomcat-8.5.19-windows-x64.zip and extract it
2) Open Notepad++ [1] and paste the text "Aß" (without quotes) in a new text 
file. In the Encoding menu, select "UTF-8 without BOM" (if not already 
selected) and then save the textfile in the Tomcat directory to 
"webapps/ROOT/test.txt".
3) Verify with a hex editor that the text file contains the following 3 bytes: 
0x41 0xC3 0x9F
4) Now use a browser or some other download tool to make a request to 
"http://localhost:8080/test.txt; and save the text file.
5) Open the file with a hex editor and notice that the last byte has changed: 
0x41 0xC3 0x3F
This means UTF-8 decoding will fail as the last byte does not have set the 
highest bit any more.

In my case, this problem caused string from (UTF-8) JavaScript files being 
displayed incorrectly in the browser.

If you do the same with Tomcat 8.5.16, you can see that the text file is served 
correctly.
(Additionally, I found that Tomcat 8.5.19 uses "Transfer-Encoding: chunked" to 
serve the file, instead of using a "Content-Length: 3" header as Tomcat 8.5.16.)

Why would Tomcat want to modify static files, instead of just serving them 
as-is?

Note: Bisecting shows that the problem seems to have been introduced with 
r1800455 [2].

Thanks!


Regards,
Konstantin Preißer

[1] https://notepad-plus-plus.org/
[2] https://svn.apache.org/viewvc?view=revision=r1800455



-
To unsubscribe, e-mail: users-unsubscr...@tomcat.apache.org
For additional commands, e-mail: users-h...@tomcat.apache.org