Re: Tomcat 8.5.19 corrupts static text files encoded with UTF-8

Mark Thomas Sun, 30 Jul 2017 02:50:42 -0700

On 30/07/17 10:21, Rémy Maucherat wrote:
> On Sun, Jul 30, 2017 at 10:59 AM, Konstantin Preißer <[email protected]>
> wrote:
> 
>> Hi Mark,
>>
>>> -----Original Message-----
>>> From: Mark Thomas [mailto:[email protected]]
>>> Sent: Saturday, July 29, 2017 2:56 PM
>>>
>>>> (...)
>>>>
>>>> Why would Tomcat want to modify static files, instead of just serving
>>>> them as-is?
>>>
>>> Because Tomcat now checks the response encoding and the file encoding
>>> and converts if necessary.
>>>
>>> You probably want to set the fileEncoding init param of the Default
>> servlet to
>>> UTF-8.
>>
>> Thanks. So I set the following parameter in web.xml:
>>         <init-param>
>>             <param-name>fileEncoding</param-name>
>>             <param-value>utf-8</param-value>
>>         </init-param>
>>
>> The result now is, that Tomcat converts the static file without a BOM from
>> UTF-8 to ISO-8859-1, which means my JavaScript files included by the HTML
>> page will still be broken, as the brower expects them to be UTF-8-encoded
>> ...
>>
>> I honestly don't understand that change. As a web developer, I expect a
>> web server to serve static files exactly as-is, without trying to convert
>> the files into another charset and without trying to detect the charset of
>> the file (unless the server is configured to do so).


Tomcat is trying to handle various edge cases. These include:

- Response encoding defined as one charset when serving static content
that has a different charset (Tomcat used to send the static bytes as-is
which could result in a broken response in some cases).

- Static content in one encoding included into a response encoding in a
different encoding. Again, depending on circumstances, the included
content would be broken.

> It probably still does too much right now. Mark made a very complex change,
> but there's encoding conversion in too many cases maybe. I think there
> should be conversion only when a writer is used by the default servlet, but
> we should let the user deal with the other cases.
> 
> Right now, the code does its conversion when the resource is a text mime
> type and its encoding doesn't match (which may be accurate, or not, it
> seems), and in that case it's very broad and the behavior should be
> optional (off by default IMO). Besides, it's going to perform much worse
> all of a sudden.

I agree that the change is complex. I also agree that the conversion
appears to be kicking in more often than expected.

I thought we had resolved most of the issues working through the
problems reported by George Stanchev and that 8.5.19 was unlikely to
cause further issues.

I think the key to fixing this is limiting when the conversion is applied.

>> Bug 49464 [1] mentions that "As per spec the encoding of the page is
>> asssumed to be iso-8859-1.". Do I understand correctly that this refers to
>> the following section "3.7.1 Canonicalization and Text Defaults" of RFC2616?

No. That is the Servlet spec.


>>     (...)
>>    The "charset" parameter is used with some media types to define the
>>    character set (section 3.4) of the data. When no explicit charset
>>    parameter is provided by the sender, media subtypes of the "text"
>>    type are defined to have a default charset value of "ISO-8859-1" when
>>    received via HTTP.
>>
>>
>> But not that RFC7231 says in "Appendix B.  Changes from RFC 2616":
>>
>>    The default charset of ISO-8859-1 for text media types has been
>>    removed; the default is now whatever the media type definition says.
>>    Likewise, special treatment of ISO-8859-1 has been removed from the
>>    Accept-Charset header field.  (Section 3.1.1.3 and Section 5.3.3)
>>
>>
>> I found a following page that talks about this change [2] and mentions
>> RFC6657 [3] that describes text/* media registrations with charset handling.
>>
>> While RFC6657 seems to indicate that the default charset of text/plain is
>> US-ASCII (which is not what browsers do), it doesn't seem to indicate a
>> default charset for other types like text/html, text/javascript,
>> application/javascript etc.
>>
>> Browsers (I tested with IE, Firefox and Chrome) already handle the
>> encoding of text-based files where the Content-Type doesn't specify a
>> charset as the user would expect:>> - For example, with text/html files that 
>> don't contain a BOM, they will
>> respect the <meta charset=...> element. If a UTF-8 BOM is present, they
>> will interpret it as UTF-8.
>> - If you directly open text/plain, text/css, application/javascript files
>> in a browser, they will check if the file has an UTF-8 BOM, and interpret
>> it as UTF-8 in that case; otherwise, they seem to interpret it as
>> ISO-8859-1/Windows-1252 (or maybe using the default system encoding, I'm
>> not exactly sure about that).
>> - However, if such files (.css and .js) are referenced by a HTML file,
>> browsers will interpret them in the same encoding that the HTML file (if
>> they don't have a BOM), which means if the HTML uses UTF-8, they will
>> interpret .js and .css also as UTF-8 (unless the HTML element uses a
>> charset parameter, e.g. <script src="script.js" charset="windows-1252"></
>> script>).
>>
>> Therefore, I don't see why Tomcat would want to convert static resources
>> to other encodings. (I think it should also not try to detect the charset
>> of files and then include a "; charset=..." parameter in the Content-Type,
>> as this may override the browser's behavior and thus also lead to incorrect
>> decoding of JavaScript files that are encoded with UTF-8 without a BOM).
>>
>>
>> Further, as an system administrator, I would expect that I can update
>> Tomcat from x.y.z to x.y.(z+n), without static JavaScript files suddenly
>> getting broken (which isn't immediately obvious as mostly the script per se
>> will work, only that some special string characters outside of ASCII are
>> displayed incorrectly to the user).
>> Shouldn't such behavior changes be reserved for the next major/minor
>> version which is not yet stable, in this case Tomcat 9.0.0?

Stuff breaking is unintentional and is a bug. Unfortunately, it appears
that you have stumbled across a bug that wasn't detected in any of the
last three attempted releases.

I think (but I can't be sure without a test case) the problem stems from
the case where a character set is not explicitly defined for the
response. If that is the case, it should be a fairly simple fix.

My preference is to keep the edge case handling I recently added if at
all possible and prevent the conversion from applying when it is not
required.

Mark


>>
>>
>> Thanks!
>>
>> Regards,
>> Konstantin Preißer
>>
>>
>> [1] https://bz.apache.org/bugzilla/show_bug.cgi?id=49464
>> [2] https://github.com/requests/requests/issues/2086
>> [3] https://tools.ietf.org/html/rfc6657
>>
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: [email protected]
>> For additional commands, e-mail: [email protected]
>>
>>
> 


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: Tomcat 8.5.19 corrupts static text files encoded with UTF-8

Reply via email to