On 30/07/17 10:21, Rémy Maucherat wrote: > On Sun, Jul 30, 2017 at 10:59 AM, Konstantin Preißer <[email protected]> > wrote: > >> Hi Mark, >> >>> -----Original Message----- >>> From: Mark Thomas [mailto:[email protected]] >>> Sent: Saturday, July 29, 2017 2:56 PM >>> >>>> (...) >>>> >>>> Why would Tomcat want to modify static files, instead of just serving >>>> them as-is? >>> >>> Because Tomcat now checks the response encoding and the file encoding >>> and converts if necessary. >>> >>> You probably want to set the fileEncoding init param of the Default >> servlet to >>> UTF-8. >> >> Thanks. So I set the following parameter in web.xml: >> <init-param> >> <param-name>fileEncoding</param-name> >> <param-value>utf-8</param-value> >> </init-param> >> >> The result now is, that Tomcat converts the static file without a BOM from >> UTF-8 to ISO-8859-1, which means my JavaScript files included by the HTML >> page will still be broken, as the brower expects them to be UTF-8-encoded >> ... >> >> I honestly don't understand that change. As a web developer, I expect a >> web server to serve static files exactly as-is, without trying to convert >> the files into another charset and without trying to detect the charset of >> the file (unless the server is configured to do so).
Tomcat is trying to handle various edge cases. These include: - Response encoding defined as one charset when serving static content that has a different charset (Tomcat used to send the static bytes as-is which could result in a broken response in some cases). - Static content in one encoding included into a response encoding in a different encoding. Again, depending on circumstances, the included content would be broken. > It probably still does too much right now. Mark made a very complex change, > but there's encoding conversion in too many cases maybe. I think there > should be conversion only when a writer is used by the default servlet, but > we should let the user deal with the other cases. > > Right now, the code does its conversion when the resource is a text mime > type and its encoding doesn't match (which may be accurate, or not, it > seems), and in that case it's very broad and the behavior should be > optional (off by default IMO). Besides, it's going to perform much worse > all of a sudden. I agree that the change is complex. I also agree that the conversion appears to be kicking in more often than expected. I thought we had resolved most of the issues working through the problems reported by George Stanchev and that 8.5.19 was unlikely to cause further issues. I think the key to fixing this is limiting when the conversion is applied. >> Bug 49464 [1] mentions that "As per spec the encoding of the page is >> asssumed to be iso-8859-1.". Do I understand correctly that this refers to >> the following section "3.7.1 Canonicalization and Text Defaults" of RFC2616? No. That is the Servlet spec. >> (...) >> The "charset" parameter is used with some media types to define the >> character set (section 3.4) of the data. When no explicit charset >> parameter is provided by the sender, media subtypes of the "text" >> type are defined to have a default charset value of "ISO-8859-1" when >> received via HTTP. >> >> >> But not that RFC7231 says in "Appendix B. Changes from RFC 2616": >> >> The default charset of ISO-8859-1 for text media types has been >> removed; the default is now whatever the media type definition says. >> Likewise, special treatment of ISO-8859-1 has been removed from the >> Accept-Charset header field. (Section 3.1.1.3 and Section 5.3.3) >> >> >> I found a following page that talks about this change [2] and mentions >> RFC6657 [3] that describes text/* media registrations with charset handling. >> >> While RFC6657 seems to indicate that the default charset of text/plain is >> US-ASCII (which is not what browsers do), it doesn't seem to indicate a >> default charset for other types like text/html, text/javascript, >> application/javascript etc. >> >> Browsers (I tested with IE, Firefox and Chrome) already handle the >> encoding of text-based files where the Content-Type doesn't specify a >> charset as the user would expect:>> - For example, with text/html files that >> don't contain a BOM, they will >> respect the <meta charset=...> element. If a UTF-8 BOM is present, they >> will interpret it as UTF-8. >> - If you directly open text/plain, text/css, application/javascript files >> in a browser, they will check if the file has an UTF-8 BOM, and interpret >> it as UTF-8 in that case; otherwise, they seem to interpret it as >> ISO-8859-1/Windows-1252 (or maybe using the default system encoding, I'm >> not exactly sure about that). >> - However, if such files (.css and .js) are referenced by a HTML file, >> browsers will interpret them in the same encoding that the HTML file (if >> they don't have a BOM), which means if the HTML uses UTF-8, they will >> interpret .js and .css also as UTF-8 (unless the HTML element uses a >> charset parameter, e.g. <script src="script.js" charset="windows-1252"></ >> script>). >> >> Therefore, I don't see why Tomcat would want to convert static resources >> to other encodings. (I think it should also not try to detect the charset >> of files and then include a "; charset=..." parameter in the Content-Type, >> as this may override the browser's behavior and thus also lead to incorrect >> decoding of JavaScript files that are encoded with UTF-8 without a BOM). >> >> >> Further, as an system administrator, I would expect that I can update >> Tomcat from x.y.z to x.y.(z+n), without static JavaScript files suddenly >> getting broken (which isn't immediately obvious as mostly the script per se >> will work, only that some special string characters outside of ASCII are >> displayed incorrectly to the user). >> Shouldn't such behavior changes be reserved for the next major/minor >> version which is not yet stable, in this case Tomcat 9.0.0? Stuff breaking is unintentional and is a bug. Unfortunately, it appears that you have stumbled across a bug that wasn't detected in any of the last three attempted releases. I think (but I can't be sure without a test case) the problem stems from the case where a character set is not explicitly defined for the response. If that is the case, it should be a fairly simple fix. My preference is to keep the edge case handling I recently added if at all possible and prevent the conversion from applying when it is not required. Mark >> >> >> Thanks! >> >> Regards, >> Konstantin Preißer >> >> >> [1] https://bz.apache.org/bugzilla/show_bug.cgi?id=49464 >> [2] https://github.com/requests/requests/issues/2086 >> [3] https://tools.ietf.org/html/rfc6657 >> >> >> >> --------------------------------------------------------------------- >> To unsubscribe, e-mail: [email protected] >> For additional commands, e-mail: [email protected] >> >> > --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
