Re: distinction between resource charset and format octet decoding
On 2/6/2020 10:44 AM, Mark Thomas wrote: … As of Tomcat 10, conf/web.xml contains the following: UTF-8 UTF-8 That *should* have the effect you are looking for but I confess I haven't tested it in any great detail. Yes! Oh, that is so wonderful. Thank you! I brought this issue up on the list over a year ago, and I have since published my entire comprehensive software development course (still being expanded). https://www.globalmentor.com/courses/softdev/ The course is centered around Tomcat as the server, and the lesson on HTML forms contains a section warning to use ``. https://www.globalmentor.com/courses/softdev/html-forms Once Tomcat 10 is released I'll be able to update this note as well. Thanks again! Garret - To unsubscribe, e-mail: users-unsubscr...@tomcat.apache.org For additional commands, e-mail: users-h...@tomcat.apache.org
Re: distinction between resource charset and format octet decoding
On 06/02/2020 13:39, Garret Wilson wrote: > On 2/6/2020 10:36 AM, Mark Thomas wrote: >> … Whether Tomcat should ship with this setting present in conf/web.xml by default is something that should probably be discussed for Tomcat 10. Given the current state of the web, there is a reasonable case for doing so. I'll add that to the TOMCAT-NEXT discussion list. >>> Is this still on the list for discussion for Tomcat 10? >> No, because it has already been implemented for Tomcat 10 and is in the >> milestone release currently being voted on. > > Waitasec. I'm not used to good news, so I want to make sure I understand > what you're saying. Are you saying that the proposed Tomcat 10 > implementation already interprets encoded octets in web form submissions > using UTF-8 by default?!! :O As of Tomcat 10, conf/web.xml contains the following: UTF-8 UTF-8 That *should* have the effect you are looking for but I confess I haven't tested it in any great detail. Mark > > It will be a joy to update the FAQ when this is released. > > Garret > > > - > To unsubscribe, e-mail: users-unsubscr...@tomcat.apache.org > For additional commands, e-mail: users-h...@tomcat.apache.org > - To unsubscribe, e-mail: users-unsubscr...@tomcat.apache.org For additional commands, e-mail: users-h...@tomcat.apache.org
Re: distinction between resource charset and format octet decoding
On 2/6/2020 10:36 AM, Mark Thomas wrote: … Whether Tomcat should ship with this setting present in conf/web.xml by default is something that should probably be discussed for Tomcat 10. Given the current state of the web, there is a reasonable case for doing so. I'll add that to the TOMCAT-NEXT discussion list. Is this still on the list for discussion for Tomcat 10? No, because it has already been implemented for Tomcat 10 and is in the milestone release currently being voted on. Waitasec. I'm not used to good news, so I want to make sure I understand what you're saying. Are you saying that the proposed Tomcat 10 implementation already interprets encoded octets in web form submissions using UTF-8 by default?!! :O It will be a joy to update the FAQ when this is released. Garret - To unsubscribe, e-mail: users-unsubscr...@tomcat.apache.org For additional commands, e-mail: users-h...@tomcat.apache.org
Re: distinction between resource charset and format octet decoding
On 06/02/2020 13:30, Garret Wilson wrote: > On 1/8/2019 9:57 PM, Mark Thomas wrote: >> … >> >> Yes, this default is now very out-dated. That is a side-effect of: >> … >> As of Servlet 4.0 there is a specification compliant configuration >> option to change this default to any encoding of your choice. >> Obviously, UTF-8 is one of the options. You can do this by adding the >> following to your web.xml: >> … >> >> Whether Tomcat should ship with this setting present in conf/web.xml >> by default is something that should probably be discussed for Tomcat >> 10. Given the current state of the web, there is a reasonable case for >> doing so. I'll add that to the TOMCAT-NEXT discussion list. > > Is this still on the list for discussion for Tomcat 10? No, because it has already been implemented for Tomcat 10 and is in the milestone release currently being voted on. Mark > > In my opinion it would be a real shame if Tomcat 10 ships with a web > form encoding default that goes against the WhatWG specifications and > corrupts non ISO-8859-1 content under modern browsers. > > Garret > > P.S. Mark, please ignore the other email from my personal email address. > Because the Tomcat users list doesn't include my name in the "To:" > header, my email client didn't know to use the correct reply address. > > > - > To unsubscribe, e-mail: users-unsubscr...@tomcat.apache.org > For additional commands, e-mail: users-h...@tomcat.apache.org > - To unsubscribe, e-mail: users-unsubscr...@tomcat.apache.org For additional commands, e-mail: users-h...@tomcat.apache.org
Re: distinction between resource charset and format octet decoding
On 1/8/2019 9:57 PM, Mark Thomas wrote: … Yes, this default is now very out-dated. That is a side-effect of: … As of Servlet 4.0 there is a specification compliant configuration option to change this default to any encoding of your choice. Obviously, UTF-8 is one of the options. You can do this by adding the following to your web.xml: … Whether Tomcat should ship with this setting present in conf/web.xml by default is something that should probably be discussed for Tomcat 10. Given the current state of the web, there is a reasonable case for doing so. I'll add that to the TOMCAT-NEXT discussion list. Is this still on the list for discussion for Tomcat 10? In my opinion it would be a real shame if Tomcat 10 ships with a web form encoding default that goes against the WhatWG specifications and corrupts non ISO-8859-1 content under modern browsers. Garret P.S. Mark, please ignore the other email from my personal email address. Because the Tomcat users list doesn't include my name in the "To:" header, my email client didn't know to use the correct reply address. - To unsubscribe, e-mail: users-unsubscr...@tomcat.apache.org For additional commands, e-mail: users-h...@tomcat.apache.org
Re: distinction between resource charset and format octet decoding
Sorry to bring up the non-UTF-8 escaped octets form POST problem again, but … On 1/8/2019 3:57 PM, Mark Thomas wrote: … As of Servlet 4.0 there is a specification compliant configuration option to change this default to any encoding of your choice. Obviously, UTF-8 is one of the options. You can do this by adding the following to your web.xml: UTF-8 If you add it to conf/web.xml it applies to every web application deployed to Tomcat. Tomcat 9 uses this in the examples, manager and host-manager applications in place of the SetCharacterEncodingFilter. As you know I've already updated the Tomcat FAQ with the options for forcing Tomcat to interpret form POSTs with any escaped characters using UTF-8 octet sequences (as modern browsers send, and as HTML5 requires) instead of ISO-8859-1 (as the Servlet 4 spec says). But the problem is worse with the Spring community. If someone is using Spring Boot to create an executable JAR/WAR using embedded tomcat, Spring Boot does something to configure Tomcat to send the POSTs correctly (that is, as the modern web likes it, not like the Servlet 4 spec says). Unfortunately, if I use Spring Boot to make a WAR which is both a self-contained executing WAR /and/ a WAR deployable on Tomcat, when I deploy the WAR on Tomcat the encoded characters are using escaped ISO-8859-1 octets, so my web app breaks. Yes, the WAR runs differently if using Spring Boot embedded Tomcat or deployed on standalone Tomcat as a WAR. Spring Boot ignores any `web.xml` file. I guess I could create a `web.xml` file only for standalone Tomcat, but then this freezes Eclipse (as I posted elsewhere) because Eclipse doesn't understand ``. So like so many things on the web, this is a mess. This is a serious issue, in my opinion. The Servlet 4 specification is out of step with everything else in the ecosystem! Whether Tomcat should ship with this setting present in conf/web.xml by default is something that should probably be discussed for Tomcat 10. Given the current state of the web, there is a reasonable case for doing so. I'll add that to the TOMCAT-NEXT discussion list. Yes, can I just re-second (third?) that motion, and underscore the need for this to be changed in Tomcat 10? Thanks, Garret
Re: distinction between resource charset and format octet decoding
On 01/02/2019 17:58, Garret Wilson wrote: > OK, Mark, I've made my initial edits to the > https://wiki.apache.org/tomcat/FAQ/CharacterEncoding page. _Please check > them over!_ This is my first edit to the wiki. > > That page has a lot of legacy information, some of which had to do with > internal Tomcat stuff, and some of which had to do with minute details > of obsolete RFCs and evolution of browser behavior. I didn't want to > spend the entire day (week?) on this, so I tried to surgically to only > address the sections relating to POST of > application/x-www-form-urlencoded and how percent-encoded octets are > interpreted. I couldn't resist updating the specification links and > changing just a little prose about URL percent encoding. > > There is the risk now that other sections of the page are still outdated > and conflict with my changes, but most importantly the FAQ should > provide more complete information on how Tomcat web applications can be > made to work with modern browsers. > > Please let me know if I bungled anything or if I need to clarify something. LGTM. > Thanks for letting me participate. No need to thank us. We should be thanking you. Thank you. So, what do you want to work on next? ;) Cheers, Mark > > Garret > > On 1/23/2019 12:26 AM, Mark Thomas wrote: >> On 23/01/2019 05:07, Garret Wilson wrote: >>> On 1/15/2019 3:20 AM, Mark Thomas wrote: … Anything in PascalCase becomes a link to a wiki page of that name. Usernames are created in this form so references to the user automatically become links to that user's page in the wiki. >>> >>> Ah, OK, that explains it. Very good to know. Maybe a little semantic >>> overloading, but as this is my first wiki account anywhere, I'm guessing >>> it's typical with whatever software you're using. >>> >>> Anyway my account is created, with username `GarretWilson`. After I get >>> permissions I'll update the info on octet encoding for >>> application/x-www-form-urlencoded in relation to the servlet spec. It >>> may not be immediately, but I'll slowly but surely get to it. >> Karma granted. Happy editing. >> >> Cheers, >> >> Mark >> >> - >> To unsubscribe, e-mail: users-unsubscr...@tomcat.apache.org >> For additional commands, e-mail: users-h...@tomcat.apache.org >> > > - > To unsubscribe, e-mail: users-unsubscr...@tomcat.apache.org > For additional commands, e-mail: users-h...@tomcat.apache.org > - To unsubscribe, e-mail: users-unsubscr...@tomcat.apache.org For additional commands, e-mail: users-h...@tomcat.apache.org
Re: distinction between resource charset and format octet decoding
OK, Mark, I've made my initial edits to the https://wiki.apache.org/tomcat/FAQ/CharacterEncoding page. _Please check them over!_ This is my first edit to the wiki. That page has a lot of legacy information, some of which had to do with internal Tomcat stuff, and some of which had to do with minute details of obsolete RFCs and evolution of browser behavior. I didn't want to spend the entire day (week?) on this, so I tried to surgically to only address the sections relating to POST of application/x-www-form-urlencoded and how percent-encoded octets are interpreted. I couldn't resist updating the specification links and changing just a little prose about URL percent encoding. There is the risk now that other sections of the page are still outdated and conflict with my changes, but most importantly the FAQ should provide more complete information on how Tomcat web applications can be made to work with modern browsers. Please let me know if I bungled anything or if I need to clarify something. Thanks for letting me participate. Garret On 1/23/2019 12:26 AM, Mark Thomas wrote: On 23/01/2019 05:07, Garret Wilson wrote: On 1/15/2019 3:20 AM, Mark Thomas wrote: … Anything in PascalCase becomes a link to a wiki page of that name. Usernames are created in this form so references to the user automatically become links to that user's page in the wiki. Ah, OK, that explains it. Very good to know. Maybe a little semantic overloading, but as this is my first wiki account anywhere, I'm guessing it's typical with whatever software you're using. Anyway my account is created, with username `GarretWilson`. After I get permissions I'll update the info on octet encoding for application/x-www-form-urlencoded in relation to the servlet spec. It may not be immediately, but I'll slowly but surely get to it. Karma granted. Happy editing. Cheers, Mark - To unsubscribe, e-mail: users-unsubscr...@tomcat.apache.org For additional commands, e-mail: users-h...@tomcat.apache.org - To unsubscribe, e-mail: users-unsubscr...@tomcat.apache.org For additional commands, e-mail: users-h...@tomcat.apache.org
Re: distinction between resource charset and format octet decoding
On 2/1/2019 9:38 AM, Christopher Schultz wrote: Amazing. A close reading of RFC 3986 reveals that there is no clear mandate for UTF-8 in existing URI schemes, even though recommended for new schemes. Anyway, everyone seems to have settled on UTF-8 (Tomcat included), so I'll try to indicate that. Wait... are you saying that _it's the Wild West out there?_ ;) Yes. The web is indeed held together with duct-tape and bailing wire. It's amazing that it works as well as it does. Hahaha. I'm /so/ happy someone agrees with me! Here's to improving things with a little JB Weld once in a while. (That's what my grandparents used on the farm when the bailing wire and duct tape couldn't handle it.) Garret
Re: distinction between resource charset and format octet decoding
-BEGIN PGP SIGNED MESSAGE- Hash: SHA256 Garret, On 2/1/19 11:08, Garret Wilson wrote: > On 2/1/2019 7:23 AM, Garret Wilson wrote: >> … * "There /is no default encoding for URIs/ specified anywhere, >> which is why there is a lot of confusion when it comes to >> decoding these values." Sheesh, this is is ancient. I'll correct >> it as per https://tools.ietf.org/html/rfc3986#section-2.5 . > > > Amazing. A close reading of RFC 3986 reveals that there is no > clear mandate for UTF-8 in existing URI schemes, even though > recommended for new schemes. Anyway, everyone seems to have settled > on UTF-8 (Tomcat included), so I'll try to indicate that. Wait... are you saying that _it's the Wild West out there?_ ;) Yes. The web is indeed held together with duct-tape and bailing wire. It's amazing that it works as well as it does. - -chris -BEGIN PGP SIGNATURE- Comment: Using GnuPG with Thunderbird - https://www.enigmail.net/ iQIzBAEBCAAdFiEEMmKgYcQvxMe7tcJcHPApP6U8pFgFAlxUhDEACgkQHPApP6U8 pFhWlA/8Cxr6xzT8+cw5Mu/a8cH788p+ucK4QtO9Qlm6EBhhX2sW9BelWpk2ftOX xypZkwW155D2hlz58eUTGSoFl92rgFZNXmXBoIXd+MDgNS/b0zgabb7N7wlHswzj LJArA9GtXNjRy5vJc4Bpe37ZpiqcV9f/sbQhSO31ZrJYvnVuOOYszzfp2g6UWlg5 +OAgfi2L99uMxJdqc81eIVsL6mmmhlkJYe6ejAZjb/EQ2Lk74MKlgCUfaoasCdYd hqdQJIBpRGvUnx6UEoq+sdEilBAXTJocGv8cyOFQY5rHcaTy7WIQ9mIWilTjBb6O gxWJbgRfX+uOVhTT5mo7LoE+YVLQZ3QPAM21SEXtX3PR5Vuk4hB8SYj3/er7S7v2 /kPL0d5K2DsO8034PoZQBturIV8pkiF5jqr2nSTND/B0nFK9hcZu27qY9RigHF95 8owMY7/hdMsK2PlYOwyj6dZSMx94Iy5mWDCrF3GUFCbEN9u3/6HoRYuJZOpCv8h1 aZHZmiYDEtxzxL8OkXNqyuBu4k+HJ58/ABMelpXOjxMVHuFXkqny6XiqrzyWac+z yW1otX/uLKgqKI9PL3O8MfzVS5LZ6XVtprkZUDhCBvsA8vQTZYBRVQu3DiGMPojj U4STB1VBJSV4I67bBhkQaAZnsqIgeNi/qzHC+5h6hbHl+Me1lRg= =Z4XG -END PGP SIGNATURE- - To unsubscribe, e-mail: users-unsubscr...@tomcat.apache.org For additional commands, e-mail: users-h...@tomcat.apache.org
Re: distinction between resource charset and format octet decoding
On 2/1/2019 7:23 AM, Garret Wilson wrote: … * "There /is no default encoding for URIs/ specified anywhere, which is why there is a lot of confusion when it comes to decoding these values." Sheesh, this is is ancient. I'll correct it as per https://tools.ietf.org/html/rfc3986#section-2.5 . Amazing. A close reading of RFC 3986 reveals that there is no clear mandate for UTF-8 in existing URI schemes, even though recommended for new schemes. Anyway, everyone seems to have settled on UTF-8 (Tomcat included), so I'll try to indicate that. Garret - To unsubscribe, e-mail: users-unsubscr...@tomcat.apache.org For additional commands, e-mail: users-h...@tomcat.apache.org
Re: distinction between resource charset and format octet decoding
Good morning, I'm just getting to the editing. I'm going to list some thoughts I have as I go through this, so you can verify things: * The servlet spec links are way out of date. I'll update them. * "There /is no default encoding for URIs/ specified anywhere, which is why there is a lot of confusion when it comes to decoding these values." Sheesh, this is is ancient. I'll correct it as per https://tools.ietf.org/html/rfc3986#section-2.5 . * "Most of the web uses ISO-8859-1 as the default for query strings." Is this still true?! In light of the above, I would think it is not true, but I wanted to ask, as you know better about what you've seen "in the wild". Garret
Re: distinction between resource charset and format octet decoding
On 23/01/2019 05:07, Garret Wilson wrote: > On 1/15/2019 3:20 AM, Mark Thomas wrote: >> … >> Anything in PascalCase becomes a link to a wiki page of that name. >> Usernames are created in this form so references to the user >> automatically become links to that user's page in the wiki. > > > Ah, OK, that explains it. Very good to know. Maybe a little semantic > overloading, but as this is my first wiki account anywhere, I'm guessing > it's typical with whatever software you're using. > > Anyway my account is created, with username `GarretWilson`. After I get > permissions I'll update the info on octet encoding for > application/x-www-form-urlencoded in relation to the servlet spec. It > may not be immediately, but I'll slowly but surely get to it. Karma granted. Happy editing. Cheers, Mark - To unsubscribe, e-mail: users-unsubscr...@tomcat.apache.org For additional commands, e-mail: users-h...@tomcat.apache.org
Re: distinction between resource charset and format octet decoding
On 1/15/2019 3:20 AM, Mark Thomas wrote: … Anything in PascalCase becomes a link to a wiki page of that name. Usernames are created in this form so references to the user automatically become links to that user's page in the wiki. Ah, OK, that explains it. Very good to know. Maybe a little semantic overloading, but as this is my first wiki account anywhere, I'm guessing it's typical with whatever software you're using. Anyway my account is created, with username `GarretWilson`. After I get permissions I'll update the info on octet encoding for application/x-www-form-urlencoded in relation to the servlet spec. It may not be immediately, but I'll slowly but surely get to it. Cheers, Garret - To unsubscribe, e-mail: users-unsubscr...@tomcat.apache.org For additional commands, e-mail: users-h...@tomcat.apache.org
Re: distinction between resource charset and format octet decoding
On 15/01/2019 03:39, Garret Wilson wrote: > On 1/9/2019 2:30 AM, Mark Thomas wrote: >> … >> Create yourself an account at https://wiki.apache.org/tomcat (click >> login then create an account) and let the list know your ID. Then one of >> the admins can add you to the allowed editors. > > > I was just ready to create an account, but I want to verify the details > so I don't screw things up. > > * It asks for a "Name". Is this a username, I suppose? So we don't > maintain our "name" separate from our "login username"? Yes, it is your username. Any linkage from that to your "public name" would be maintained on your user page - if you wish. > * It says to use "FirstnameLastName". Are you literally wanting us to > use "JohnDoe", or can we use "johndoe"? Sorry for the questions; as > one who works with protocols all the time, I automatically assume > this stuff is important. But I prefer to use lowercase on my > usernames; I'm a little confused about why this would want > PascalCase for a login username. (I can't think of another system > that I use that requires PascalCase usernames.) Think of it as a SHOULD rather than a MUST. > My guess is that it's trying to maintain a "human name" and a "username" > but combine them both into one field or something. I can't say this > approach is typical… Anything in PascalCase becomes a link to a wiki page of that name. Usernames are created in this form so references to the user automatically become links to that user's page in the wiki. It isn't a feature we use much at the moment. A quick check shows that most, but not all, contributors have created their user name in PascalCase. For example, take a look at https://wiki.apache.org/tomcat/AndrewCarr Mark - To unsubscribe, e-mail: users-unsubscr...@tomcat.apache.org For additional commands, e-mail: users-h...@tomcat.apache.org
Re: distinction between resource charset and format octet decoding
On 1/9/2019 2:30 AM, Mark Thomas wrote: … Create yourself an account at https://wiki.apache.org/tomcat (click login then create an account) and let the list know your ID. Then one of the admins can add you to the allowed editors. I was just ready to create an account, but I want to verify the details so I don't screw things up. * It asks for a "Name". Is this a username, I suppose? So we don't maintain our "name" separate from our "login username"? * It says to use "FirstnameLastName". Are you literally wanting us to use "JohnDoe", or can we use "johndoe"? Sorry for the questions; as one who works with protocols all the time, I automatically assume this stuff is important. But I prefer to use lowercase on my usernames; I'm a little confused about why this would want PascalCase for a login username. (I can't think of another system that I use that requires PascalCase usernames.) My guess is that it's trying to maintain a "human name" and a "username" but combine them both into one field or something. I can't say this approach is typical… Garret
Re: distinction between resource charset and format octet decoding
On 09/01/2019 00:50, Garret Wilson wrote: > Hi, Mark, and thanks for some quick response. You provided some info I > wasn't aware of. Some responses below: > > On 1/8/2019 9:57 PM, Mark Thomas wrote: >> On 08/01/2019 21:31, Garret Wilson wrote: >> >> >> >>> But as discussed above, this is completely wrong: the resource >>> character encoding of a request sent in >>> `application/x-www-form-urlencoded` should have absolutely no bearing >>> on how the encoded octets within that resource are decoded. >> >> That is not the correct interpretation of section 3.12 of the Servlet >> 4.0 specification (note the section numbers do vary between spec >> versions). Tomcat implements the correct interpretation - i.e. the >> charset from the request content-type defines how encoded octets are >> decoded and, if none is specified, ISO-8859-1 is used as the default. > > > Ah, I hadn't seen that in the servlet spec. Yes, it seems as if Tomcat > is correctly following the spec, but I would still say the servlet spec > is wrong to make any linkage at all between resource encoding and %nn > interpretation. In fact reading the prose it's not clear to me that the > servlet spec is even strongly tying the %nn interpretation to the > encoding. It just sees to say that, unless otherwise specified, the %nn > interpretation should be ISO-8859-1. And actually that's a step up from > the HTML 4.0.1 spec, which in > https://www.w3.org/TR/html401/interact/forms.html#h-17.13.4.1 indicates > that they should be interpreted as US-ASCII codes. :( > > You indicate that this is all out of date, and I think we're in > agreement there. We really, really need to get the next servlet > specification to remove this part. In fact the servlet specification > should defer to the official `application/x-www-form-urlencoded` > specification, which at this point I think is the W3C HTML5 spec, which > in turn defers to the WHATWG spec (which clearly says that UTF-8) should > be used. What makes all of this more of a mess is that there seems to be > no way to work around this from the client side, e.g. by putting > something in the HTML to indicate UTF-8, as > `application/x-www-form-urlencoded` doesn't support a `charset` parameter. > > Anyway if there are any openings on the committee to update the servlet > spec, let me know. That has moved to Eclipse. The process to update the spec is still being defined. The Jakarta EE Servlet API project is the project to get involved in. >> ... >> As of Servlet 4.0 there is a specification compliant configuration >> option to change this default to any encoding of your choice. >> Obviously, UTF-8 is one of the options. You can do this by adding the >> following to your web.xml: >> >> UTF-8 > > Oh, that is really good to know, thanks!! Still I say that the request > character encoding is orthogonal to the %nn encoding, but, still, it's > good to have an implementation-agnostic way to do it. > >> >> >> Whether Tomcat should ship with this setting present in conf/web.xml >> by default is something that should probably be discussed for Tomcat >> 10. Given the current state of the web, there is a reasonable case for >> doing so. I'll add that to the TOMCAT-NEXT discussion list. > > > Yes please! If I can help in any way, let me know. > > >> >> The Tomcat Wiki also needs to be updated to take account of this new >> configuration option (and the related ). >> Since it is a wiki and this is clearly an issue you care about would >> you like to tackle that? > > > Yes, I'd love to. Let me know what permissions I need, etc. Create yourself an account at https://wiki.apache.org/tomcat (click login then create an account) and let the list know your ID. Then one of the admins can add you to the allowed editors. Apologies for the hoop jumping required but without the manual approval step for new accounts, the ASF project wiki's were being deluged in spam. Mark > > I have an international flight boarding right now so I have to go, and I > may not reply for the next few hours, but definitely sign me up. > > Thanks, > > Garret > > > - > To unsubscribe, e-mail: users-unsubscr...@tomcat.apache.org > For additional commands, e-mail: users-h...@tomcat.apache.org > - To unsubscribe, e-mail: users-unsubscr...@tomcat.apache.org For additional commands, e-mail: users-h...@tomcat.apache.org
Re: distinction between resource charset and format octet decoding
Hi, Mark, and thanks for some quick response. You provided some info I wasn't aware of. Some responses below: On 1/8/2019 9:57 PM, Mark Thomas wrote: On 08/01/2019 21:31, Garret Wilson wrote: But as discussed above, this is completely wrong: the resource character encoding of a request sent in `application/x-www-form-urlencoded` should have absolutely no bearing on how the encoded octets within that resource are decoded. That is not the correct interpretation of section 3.12 of the Servlet 4.0 specification (note the section numbers do vary between spec versions). Tomcat implements the correct interpretation - i.e. the charset from the request content-type defines how encoded octets are decoded and, if none is specified, ISO-8859-1 is used as the default. Ah, I hadn't seen that in the servlet spec. Yes, it seems as if Tomcat is correctly following the spec, but I would still say the servlet spec is wrong to make any linkage at all between resource encoding and %nn interpretation. In fact reading the prose it's not clear to me that the servlet spec is even strongly tying the %nn interpretation to the encoding. It just sees to say that, unless otherwise specified, the %nn interpretation should be ISO-8859-1. And actually that's a step up from the HTML 4.0.1 spec, which in https://www.w3.org/TR/html401/interact/forms.html#h-17.13.4.1 indicates that they should be interpreted as US-ASCII codes. :( You indicate that this is all out of date, and I think we're in agreement there. We really, really need to get the next servlet specification to remove this part. In fact the servlet specification should defer to the official `application/x-www-form-urlencoded` specification, which at this point I think is the W3C HTML5 spec, which in turn defers to the WHATWG spec (which clearly says that UTF-8) should be used. What makes all of this more of a mess is that there seems to be no way to work around this from the client side, e.g. by putting something in the HTML to indicate UTF-8, as `application/x-www-form-urlencoded` doesn't support a `charset` parameter. Anyway if there are any openings on the committee to update the servlet spec, let me know. ... As of Servlet 4.0 there is a specification compliant configuration option to change this default to any encoding of your choice. Obviously, UTF-8 is one of the options. You can do this by adding the following to your web.xml: UTF-8 Oh, that is really good to know, thanks!! Still I say that the request character encoding is orthogonal to the %nn encoding, but, still, it's good to have an implementation-agnostic way to do it. Whether Tomcat should ship with this setting present in conf/web.xml by default is something that should probably be discussed for Tomcat 10. Given the current state of the web, there is a reasonable case for doing so. I'll add that to the TOMCAT-NEXT discussion list. Yes please! If I can help in any way, let me know. The Tomcat Wiki also needs to be updated to take account of this new configuration option (and the related ). Since it is a wiki and this is clearly an issue you care about would you like to tackle that? Yes, I'd love to. Let me know what permissions I need, etc. I have an international flight boarding right now so I have to go, and I may not reply for the next few hours, but definitely sign me up. Thanks, Garret - To unsubscribe, e-mail: users-unsubscr...@tomcat.apache.org For additional commands, e-mail: users-h...@tomcat.apache.org
Re: distinction between resource charset and format octet decoding
On 08/01/2019 21:31, Garret Wilson wrote: But as discussed above, this is completely wrong: the resource character encoding of a request sent in `application/x-www-form-urlencoded` should have absolutely no bearing on how the encoded octets within that resource are decoded. That is not the correct interpretation of section 3.12 of the Servlet 4.0 specification (note the section numbers do vary between spec versions). Tomcat implements the correct interpretation - i.e. the charset from the request content-type defines how encoded octets are decoded and, if none is specified, ISO-8859-1 is used as the default. Yes, this default is now very out-dated. That is a side-effect of: - how long the Servlet specification has been around - the very conservative approach taken by Java EE in terms of backwards compatibility (once set, defaults are very rarely - if ever - changed) - arguably missed opportunities to address this issue prior to Servlet 4.0 As of Servlet 4.0 there is a specification compliant configuration option to change this default to any encoding of your choice. Obviously, UTF-8 is one of the options. You can do this by adding the following to your web.xml: UTF-8 If you add it to conf/web.xml it applies to every web application deployed to Tomcat. Tomcat 9 uses this in the examples, manager and host-manager applications in place of the SetCharacterEncodingFilter. Whether Tomcat should ship with this setting present in conf/web.xml by default is something that should probably be discussed for Tomcat 10. Given the current state of the web, there is a reasonable case for doing so. I'll add that to the TOMCAT-NEXT discussion list. The Tomcat Wiki also needs to be updated to take account of this new configuration option (and the related ). Since it is a wiki and this is clearly an issue you care about would you like to tackle that? Mark - To unsubscribe, e-mail: users-unsubscr...@tomcat.apache.org For additional commands, e-mail: users-h...@tomcat.apache.org