Re: distinction between resource charset and format octet decoding

2020-02-06 Thread Garret Wilson

On 2/6/2020 10:44 AM, Mark Thomas wrote:

…
As of Tomcat 10, conf/web.xml contains the following:


UTF-8
UTF-8

That *should* have the effect you are looking for but I confess I
haven't tested it in any great detail.


Yes! Oh, that is so wonderful. Thank you!

I brought this issue up on the list over a year ago, and I have since 
published my entire comprehensive software development course (still 
being expanded).


https://www.globalmentor.com/courses/softdev/

The course is centered around Tomcat as the server, and the lesson on 
HTML forms contains a section warning to use ``.


https://www.globalmentor.com/courses/softdev/html-forms

Once Tomcat 10 is released I'll be able to update this note as well.

Thanks again!

Garret


-
To unsubscribe, e-mail: users-unsubscr...@tomcat.apache.org
For additional commands, e-mail: users-h...@tomcat.apache.org



Re: distinction between resource charset and format octet decoding

2020-02-06 Thread Mark Thomas
On 06/02/2020 13:39, Garret Wilson wrote:
> On 2/6/2020 10:36 AM, Mark Thomas wrote:
>> …
 Whether Tomcat should ship with this setting present in conf/web.xml
 by default is something that should probably be discussed for Tomcat
 10. Given the current state of the web, there is a reasonable case for
 doing so. I'll add that to the TOMCAT-NEXT discussion list.
>>> Is this still on the list for discussion for Tomcat 10?
>> No, because it has already been implemented for Tomcat 10 and is in the
>> milestone release currently being voted on.
> 
> Waitasec. I'm not used to good news, so I want to make sure I understand
> what you're saying. Are you saying that the proposed Tomcat 10
> implementation already interprets encoded octets in web form submissions
> using UTF-8 by default?!! :O

As of Tomcat 10, conf/web.xml contains the following:


UTF-8
UTF-8

That *should* have the effect you are looking for but I confess I
haven't tested it in any great detail.

Mark


> 
> It will be a joy to update the FAQ when this is released.
> 
> Garret
> 
> 
> -
> To unsubscribe, e-mail: users-unsubscr...@tomcat.apache.org
> For additional commands, e-mail: users-h...@tomcat.apache.org
> 


-
To unsubscribe, e-mail: users-unsubscr...@tomcat.apache.org
For additional commands, e-mail: users-h...@tomcat.apache.org



Re: distinction between resource charset and format octet decoding

2020-02-06 Thread Garret Wilson

On 2/6/2020 10:36 AM, Mark Thomas wrote:

…

Whether Tomcat should ship with this setting present in conf/web.xml
by default is something that should probably be discussed for Tomcat
10. Given the current state of the web, there is a reasonable case for
doing so. I'll add that to the TOMCAT-NEXT discussion list.

Is this still on the list for discussion for Tomcat 10?

No, because it has already been implemented for Tomcat 10 and is in the
milestone release currently being voted on.


Waitasec. I'm not used to good news, so I want to make sure I understand 
what you're saying. Are you saying that the proposed Tomcat 10 
implementation already interprets encoded octets in web form submissions 
using UTF-8 by default?!! :O


It will be a joy to update the FAQ when this is released.

Garret


-
To unsubscribe, e-mail: users-unsubscr...@tomcat.apache.org
For additional commands, e-mail: users-h...@tomcat.apache.org



Re: distinction between resource charset and format octet decoding

2020-02-06 Thread Mark Thomas
On 06/02/2020 13:30, Garret Wilson wrote:
> On 1/8/2019 9:57 PM, Mark Thomas wrote:
>> …
>>
>> Yes, this default is now very out-dated. That is a side-effect of:
>> …
>> As of Servlet 4.0 there is a specification compliant configuration
>> option to change this default to any encoding of your choice.
>> Obviously, UTF-8 is one of the options. You can do this by adding the
>> following to your web.xml:
>> …
>>
>> Whether Tomcat should ship with this setting present in conf/web.xml
>> by default is something that should probably be discussed for Tomcat
>> 10. Given the current state of the web, there is a reasonable case for
>> doing so. I'll add that to the TOMCAT-NEXT discussion list.
> 
> Is this still on the list for discussion for Tomcat 10?

No, because it has already been implemented for Tomcat 10 and is in the
milestone release currently being voted on.

Mark


> 
> In my opinion it would be a real shame if Tomcat 10 ships with a web
> form encoding default that goes against the WhatWG specifications and
> corrupts non ISO-8859-1 content under modern browsers.
> 
> Garret
> 
> P.S. Mark, please ignore the other email from my personal email address.
> Because the Tomcat users list doesn't include my name in the "To:"
> header, my email client didn't know to use the correct reply address.
> 
> 
> -
> To unsubscribe, e-mail: users-unsubscr...@tomcat.apache.org
> For additional commands, e-mail: users-h...@tomcat.apache.org
> 


-
To unsubscribe, e-mail: users-unsubscr...@tomcat.apache.org
For additional commands, e-mail: users-h...@tomcat.apache.org



Re: distinction between resource charset and format octet decoding

2020-02-06 Thread Garret Wilson

On 1/8/2019 9:57 PM, Mark Thomas wrote:

…

Yes, this default is now very out-dated. That is a side-effect of:
…
As of Servlet 4.0 there is a specification compliant configuration 
option to change this default to any encoding of your choice. 
Obviously, UTF-8 is one of the options. You can do this by adding the 
following to your web.xml:

…

Whether Tomcat should ship with this setting present in conf/web.xml 
by default is something that should probably be discussed for Tomcat 
10. Given the current state of the web, there is a reasonable case for 
doing so. I'll add that to the TOMCAT-NEXT discussion list.


Is this still on the list for discussion for Tomcat 10?

In my opinion it would be a real shame if Tomcat 10 ships with a web 
form encoding default that goes against the WhatWG specifications and 
corrupts non ISO-8859-1 content under modern browsers.


Garret

P.S. Mark, please ignore the other email from my personal email address. 
Because the Tomcat users list doesn't include my name in the "To:" 
header, my email client didn't know to use the correct reply address.



-
To unsubscribe, e-mail: users-unsubscr...@tomcat.apache.org
For additional commands, e-mail: users-h...@tomcat.apache.org



Re: distinction between resource charset and format octet decoding

2019-05-21 Thread Garret Wilson
Sorry to bring up the non-UTF-8 escaped octets form POST problem again, 
but …


On 1/8/2019 3:57 PM, Mark Thomas wrote:

…
As of Servlet 4.0 there is a specification compliant configuration 
option to change this default to any encoding of your choice. 
Obviously, UTF-8 is one of the options. You can do this by adding the 
following to your web.xml:


UTF-8

If you add it to conf/web.xml it applies to every web application 
deployed to Tomcat.


Tomcat 9 uses this in the examples, manager and host-manager 
applications in place of the SetCharacterEncodingFilter.



As you know I've already updated the Tomcat FAQ with the options for 
forcing Tomcat to interpret form POSTs with any escaped characters using 
UTF-8 octet sequences (as modern browsers send, and as HTML5 requires) 
instead of ISO-8859-1 (as the Servlet 4 spec says).


But the problem is worse with the Spring community. If someone is using 
Spring Boot to create an executable JAR/WAR using embedded tomcat, 
Spring Boot does something to configure Tomcat to send the POSTs 
correctly (that is, as the modern web likes it, not like the Servlet 4 
spec says). Unfortunately, if I use Spring Boot to make a WAR which is 
both a self-contained executing WAR /and/ a WAR deployable on Tomcat, 
when I deploy the WAR on Tomcat the encoded characters are using escaped 
ISO-8859-1 octets, so my web app breaks. Yes, the WAR runs differently 
if using Spring Boot embedded Tomcat or deployed on standalone Tomcat as 
a WAR.


Spring Boot ignores any `web.xml` file. I guess I could create a 
`web.xml` file only for standalone Tomcat, but then this freezes Eclipse 
(as I posted elsewhere) because Eclipse doesn't understand 
``. So like so many things on the web, this 
is a mess.


This is a serious issue, in my opinion. The Servlet 4 specification is 
out of step with everything else in the ecosystem!


Whether Tomcat should ship with this setting present in conf/web.xml 
by default is something that should probably be discussed for Tomcat 
10. Given the current state of the web, there is a reasonable case for 
doing so. I'll add that to the TOMCAT-NEXT discussion list.


Yes, can I just re-second (third?) that motion, and underscore the need 
for this to be changed in Tomcat 10?


Thanks,

Garret



Re: distinction between resource charset and format octet decoding

2019-02-01 Thread Mark Thomas
On 01/02/2019 17:58, Garret Wilson wrote:
> OK, Mark, I've made my initial edits to the
> https://wiki.apache.org/tomcat/FAQ/CharacterEncoding page. _Please check
> them over!_ This is my first edit to the wiki.
> 
> That page has a lot of legacy information, some of which had to do with
> internal Tomcat stuff, and some of which had to do with minute details
> of obsolete RFCs and evolution of browser behavior. I didn't want to
> spend the entire day (week?) on this, so I tried to surgically to only
> address the sections relating to POST of
> application/x-www-form-urlencoded and how percent-encoded octets are
> interpreted. I couldn't resist updating the specification links and
> changing just a little prose about URL percent encoding.
> 
> There is the risk now that other sections of the page are still outdated
> and conflict with my changes, but most importantly the FAQ should
> provide more complete information on how Tomcat web applications can be
> made to work with modern browsers.
> 
> Please let me know if I bungled anything or if I need to clarify something.

LGTM.

> Thanks for letting me participate.

No need to thank us. We should be thanking you. Thank you.

So, what do you want to work on next? ;)

Cheers,

Mark


> 
> Garret
> 
> On 1/23/2019 12:26 AM, Mark Thomas wrote:
>> On 23/01/2019 05:07, Garret Wilson wrote:
>>> On 1/15/2019 3:20 AM, Mark Thomas wrote:
 …
 Anything in PascalCase becomes a link to a wiki page of that name.
 Usernames are created in this form so references to the user
 automatically become links to that user's page in the wiki.
>>>
>>> Ah, OK, that explains it. Very good to know. Maybe a little semantic
>>> overloading, but as this is my first wiki account anywhere, I'm guessing
>>> it's typical with whatever software you're using.
>>>
>>> Anyway my account is created, with username `GarretWilson`. After I get
>>> permissions I'll update the info on octet encoding for
>>> application/x-www-form-urlencoded in relation to the servlet spec. It
>>> may not be immediately, but I'll slowly but surely get to it.
>> Karma granted. Happy editing.
>>
>> Cheers,
>>
>> Mark
>>
>> -
>> To unsubscribe, e-mail: users-unsubscr...@tomcat.apache.org
>> For additional commands, e-mail: users-h...@tomcat.apache.org
>>
> 
> -
> To unsubscribe, e-mail: users-unsubscr...@tomcat.apache.org
> For additional commands, e-mail: users-h...@tomcat.apache.org
> 


-
To unsubscribe, e-mail: users-unsubscr...@tomcat.apache.org
For additional commands, e-mail: users-h...@tomcat.apache.org



Re: distinction between resource charset and format octet decoding

2019-02-01 Thread Garret Wilson
OK, Mark, I've made my initial edits to the 
https://wiki.apache.org/tomcat/FAQ/CharacterEncoding page. _Please check 
them over!_ This is my first edit to the wiki.


That page has a lot of legacy information, some of which had to do with 
internal Tomcat stuff, and some of which had to do with minute details 
of obsolete RFCs and evolution of browser behavior. I didn't want to 
spend the entire day (week?) on this, so I tried to surgically to only 
address the sections relating to POST of 
application/x-www-form-urlencoded and how percent-encoded octets are 
interpreted. I couldn't resist updating the specification links and 
changing just a little prose about URL percent encoding.


There is the risk now that other sections of the page are still outdated 
and conflict with my changes, but most importantly the FAQ should 
provide more complete information on how Tomcat web applications can be 
made to work with modern browsers.


Please let me know if I bungled anything or if I need to clarify something.

Thanks for letting me participate.

Garret

On 1/23/2019 12:26 AM, Mark Thomas wrote:

On 23/01/2019 05:07, Garret Wilson wrote:

On 1/15/2019 3:20 AM, Mark Thomas wrote:

…
Anything in PascalCase becomes a link to a wiki page of that name.
Usernames are created in this form so references to the user
automatically become links to that user's page in the wiki.


Ah, OK, that explains it. Very good to know. Maybe a little semantic
overloading, but as this is my first wiki account anywhere, I'm guessing
it's typical with whatever software you're using.

Anyway my account is created, with username `GarretWilson`. After I get
permissions I'll update the info on octet encoding for
application/x-www-form-urlencoded in relation to the servlet spec. It
may not be immediately, but I'll slowly but surely get to it.

Karma granted. Happy editing.

Cheers,

Mark

-
To unsubscribe, e-mail: users-unsubscr...@tomcat.apache.org
For additional commands, e-mail: users-h...@tomcat.apache.org



-
To unsubscribe, e-mail: users-unsubscr...@tomcat.apache.org
For additional commands, e-mail: users-h...@tomcat.apache.org



Re: distinction between resource charset and format octet decoding

2019-02-01 Thread Garret Wilson

On 2/1/2019 9:38 AM, Christopher Schultz wrote:

Amazing. A close reading of RFC 3986 reveals that there is no
clear mandate for UTF-8 in existing URI schemes, even though
recommended for new schemes. Anyway, everyone seems to have settled
on UTF-8 (Tomcat included), so I'll try to indicate that.

Wait... are you saying that _it's the Wild West out there?_ ;)

Yes. The web is indeed held together with duct-tape and bailing wire.
It's amazing that it works as well as it does.



Hahaha. I'm /so/ happy someone agrees with me! Here's to improving 
things with a little JB Weld once in a while. (That's what my 
grandparents used on the farm when the bailing wire and duct tape 
couldn't handle it.)


Garret



Re: distinction between resource charset and format octet decoding

2019-02-01 Thread Christopher Schultz
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA256

Garret,

On 2/1/19 11:08, Garret Wilson wrote:
> On 2/1/2019 7:23 AM, Garret Wilson wrote:
>> … * "There /is no default encoding for URIs/ specified anywhere,
>> which is why there is a lot of confusion when it comes to
>> decoding these values." Sheesh, this is is ancient. I'll correct
>> it as per https://tools.ietf.org/html/rfc3986#section-2.5 .
> 
> 
> Amazing. A close reading of RFC 3986 reveals that there is no
> clear mandate for UTF-8 in existing URI schemes, even though
> recommended for new schemes. Anyway, everyone seems to have settled
> on UTF-8 (Tomcat included), so I'll try to indicate that.

Wait... are you saying that _it's the Wild West out there?_ ;)

Yes. The web is indeed held together with duct-tape and bailing wire.
It's amazing that it works as well as it does.

- -chris
-BEGIN PGP SIGNATURE-
Comment: Using GnuPG with Thunderbird - https://www.enigmail.net/

iQIzBAEBCAAdFiEEMmKgYcQvxMe7tcJcHPApP6U8pFgFAlxUhDEACgkQHPApP6U8
pFhWlA/8Cxr6xzT8+cw5Mu/a8cH788p+ucK4QtO9Qlm6EBhhX2sW9BelWpk2ftOX
xypZkwW155D2hlz58eUTGSoFl92rgFZNXmXBoIXd+MDgNS/b0zgabb7N7wlHswzj
LJArA9GtXNjRy5vJc4Bpe37ZpiqcV9f/sbQhSO31ZrJYvnVuOOYszzfp2g6UWlg5
+OAgfi2L99uMxJdqc81eIVsL6mmmhlkJYe6ejAZjb/EQ2Lk74MKlgCUfaoasCdYd
hqdQJIBpRGvUnx6UEoq+sdEilBAXTJocGv8cyOFQY5rHcaTy7WIQ9mIWilTjBb6O
gxWJbgRfX+uOVhTT5mo7LoE+YVLQZ3QPAM21SEXtX3PR5Vuk4hB8SYj3/er7S7v2
/kPL0d5K2DsO8034PoZQBturIV8pkiF5jqr2nSTND/B0nFK9hcZu27qY9RigHF95
8owMY7/hdMsK2PlYOwyj6dZSMx94Iy5mWDCrF3GUFCbEN9u3/6HoRYuJZOpCv8h1
aZHZmiYDEtxzxL8OkXNqyuBu4k+HJ58/ABMelpXOjxMVHuFXkqny6XiqrzyWac+z
yW1otX/uLKgqKI9PL3O8MfzVS5LZ6XVtprkZUDhCBvsA8vQTZYBRVQu3DiGMPojj
U4STB1VBJSV4I67bBhkQaAZnsqIgeNi/qzHC+5h6hbHl+Me1lRg=
=Z4XG
-END PGP SIGNATURE-

-
To unsubscribe, e-mail: users-unsubscr...@tomcat.apache.org
For additional commands, e-mail: users-h...@tomcat.apache.org



Re: distinction between resource charset and format octet decoding

2019-02-01 Thread Garret Wilson

On 2/1/2019 7:23 AM, Garret Wilson wrote:

…
 * "There /is no default encoding for URIs/ specified anywhere, which
   is why there is a lot of confusion when it comes to decoding these
   values." Sheesh, this is is ancient. I'll correct it as per
   https://tools.ietf.org/html/rfc3986#section-2.5 .



Amazing. A close reading of RFC 3986 reveals that there is no clear 
mandate for UTF-8 in existing URI schemes, even though recommended for 
new schemes. Anyway, everyone seems to have settled on UTF-8 (Tomcat 
included), so I'll try to indicate that.


Garret


-
To unsubscribe, e-mail: users-unsubscr...@tomcat.apache.org
For additional commands, e-mail: users-h...@tomcat.apache.org



Re: distinction between resource charset and format octet decoding

2019-02-01 Thread Garret Wilson
Good morning, I'm just getting to the editing. I'm going to list some 
thoughts I have as I go through this, so you can verify things:


 * The servlet spec links are way out of date. I'll update them.
 * "There /is no default encoding for URIs/ specified anywhere, which
   is why there is a lot of confusion when it comes to decoding these
   values." Sheesh, this is is ancient. I'll correct it as per
   https://tools.ietf.org/html/rfc3986#section-2.5 .
 * "Most of the web uses ISO-8859-1 as the default for query strings."
   Is this still true?! In light of the above, I would think it is not
   true, but I wanted to ask, as you know better about what you've seen
   "in the wild".

Garret



Re: distinction between resource charset and format octet decoding

2019-01-23 Thread Mark Thomas
On 23/01/2019 05:07, Garret Wilson wrote:
> On 1/15/2019 3:20 AM, Mark Thomas wrote:
>> …
>> Anything in PascalCase becomes a link to a wiki page of that name.
>> Usernames are created in this form so references to the user
>> automatically become links to that user's page in the wiki.
> 
> 
> Ah, OK, that explains it. Very good to know. Maybe a little semantic
> overloading, but as this is my first wiki account anywhere, I'm guessing
> it's typical with whatever software you're using.
> 
> Anyway my account is created, with username `GarretWilson`. After I get
> permissions I'll update the info on octet encoding for
> application/x-www-form-urlencoded in relation to the servlet spec. It
> may not be immediately, but I'll slowly but surely get to it.

Karma granted. Happy editing.

Cheers,

Mark

-
To unsubscribe, e-mail: users-unsubscr...@tomcat.apache.org
For additional commands, e-mail: users-h...@tomcat.apache.org



Re: distinction between resource charset and format octet decoding

2019-01-22 Thread Garret Wilson

On 1/15/2019 3:20 AM, Mark Thomas wrote:

…
Anything in PascalCase becomes a link to a wiki page of that name.
Usernames are created in this form so references to the user
automatically become links to that user's page in the wiki.



Ah, OK, that explains it. Very good to know. Maybe a little semantic 
overloading, but as this is my first wiki account anywhere, I'm guessing 
it's typical with whatever software you're using.


Anyway my account is created, with username `GarretWilson`. After I get 
permissions I'll update the info on octet encoding for 
application/x-www-form-urlencoded in relation to the servlet spec. It 
may not be immediately, but I'll slowly but surely get to it.


Cheers,

Garret


-
To unsubscribe, e-mail: users-unsubscr...@tomcat.apache.org
For additional commands, e-mail: users-h...@tomcat.apache.org



Re: distinction between resource charset and format octet decoding

2019-01-15 Thread Mark Thomas
On 15/01/2019 03:39, Garret Wilson wrote:
> On 1/9/2019 2:30 AM, Mark Thomas wrote:
>> …
>> Create yourself an account at https://wiki.apache.org/tomcat (click
>> login then create an account) and let the list know your ID. Then one of
>> the admins can add you to the allowed editors.
> 
> 
> I was just ready to create an account, but I want to verify the details
> so I don't screw things up.
> 
>  * It asks for a "Name". Is this a username, I suppose? So we don't
>    maintain our "name" separate from our "login username"?

Yes, it is your username. Any linkage from that to your "public name"
would be maintained on your user page - if you wish.

>  * It says to use "FirstnameLastName". Are you literally wanting us to
>    use "JohnDoe", or can we use "johndoe"? Sorry for the questions; as
>    one who works with protocols all the time, I automatically assume
>    this stuff is important. But I prefer to use lowercase on my
>    usernames; I'm a little confused about why this would want
>    PascalCase for a login username. (I can't think of another system
>    that I use that requires PascalCase usernames.)

Think of it as a SHOULD rather than a MUST.

> My guess is that it's trying to maintain a "human name" and a "username"
> but combine them both into one field or something. I can't say this
> approach is typical…

Anything in PascalCase becomes a link to a wiki page of that name.
Usernames are created in this form so references to the user
automatically become links to that user's page in the wiki.

It isn't a feature we use much at the moment. A quick check shows that
most, but not all, contributors have created their user name in PascalCase.

For example, take a look at https://wiki.apache.org/tomcat/AndrewCarr

Mark

-
To unsubscribe, e-mail: users-unsubscr...@tomcat.apache.org
For additional commands, e-mail: users-h...@tomcat.apache.org



Re: distinction between resource charset and format octet decoding

2019-01-14 Thread Garret Wilson

On 1/9/2019 2:30 AM, Mark Thomas wrote:

…
Create yourself an account at https://wiki.apache.org/tomcat (click
login then create an account) and let the list know your ID. Then one of
the admins can add you to the allowed editors.



I was just ready to create an account, but I want to verify the details 
so I don't screw things up.


 * It asks for a "Name". Is this a username, I suppose? So we don't
   maintain our "name" separate from our "login username"?
 * It says to use "FirstnameLastName". Are you literally wanting us to
   use "JohnDoe", or can we use "johndoe"? Sorry for the questions; as
   one who works with protocols all the time, I automatically assume
   this stuff is important. But I prefer to use lowercase on my
   usernames; I'm a little confused about why this would want
   PascalCase for a login username. (I can't think of another system
   that I use that requires PascalCase usernames.)

My guess is that it's trying to maintain a "human name" and a "username" 
but combine them both into one field or something. I can't say this 
approach is typical…


Garret



Re: distinction between resource charset and format octet decoding

2019-01-09 Thread Mark Thomas
On 09/01/2019 00:50, Garret Wilson wrote:
> Hi, Mark, and thanks for some quick response. You provided some info I
> wasn't aware of. Some responses below:
> 
> On 1/8/2019 9:57 PM, Mark Thomas wrote:
>> On 08/01/2019 21:31, Garret Wilson wrote:
>>
>> 
>>
>>> But as discussed above, this is completely wrong: the resource
>>> character encoding of a request sent in
>>> `application/x-www-form-urlencoded` should have absolutely no bearing
>>> on how the encoded octets within that resource are decoded.
>>
>> That is not the correct interpretation of section 3.12 of the Servlet
>> 4.0 specification (note the section numbers do vary between spec
>> versions). Tomcat implements the correct interpretation - i.e. the
>> charset from the request content-type defines how encoded octets are
>> decoded and, if none is specified, ISO-8859-1 is used as the default.
> 
> 
> Ah, I hadn't seen that in the servlet spec. Yes, it seems as if Tomcat
> is correctly following the spec, but I would still say the servlet spec
> is wrong to make any linkage at all between resource encoding and %nn
> interpretation. In fact reading the prose it's not clear to me that the
> servlet spec is even strongly tying the %nn interpretation to the
> encoding. It just sees to say that, unless otherwise specified, the %nn
> interpretation should be ISO-8859-1. And actually that's a step up from
> the HTML 4.0.1 spec, which in
> https://www.w3.org/TR/html401/interact/forms.html#h-17.13.4.1 indicates
> that they should be interpreted as US-ASCII codes. :(
> 
> You indicate that this is all out of date, and I think we're in
> agreement there. We really, really need to get the next servlet
> specification to remove this part. In fact the servlet specification
> should defer to the official `application/x-www-form-urlencoded`
> specification, which at this point I think is the W3C HTML5 spec, which
> in turn defers to the WHATWG spec (which clearly says that UTF-8) should
> be used. What makes all of this more of a mess is that there seems to be
> no way to work around this from the client side, e.g. by putting
> something in the HTML to indicate UTF-8, as
> `application/x-www-form-urlencoded` doesn't support a `charset` parameter.
> 
> Anyway if there are any openings on the committee to update the servlet
> spec, let me know.

That has moved to Eclipse. The process to update the spec is still being
defined. The Jakarta EE Servlet API project is the project to get
involved in.


>> ...
>> As of Servlet 4.0 there is a specification compliant configuration
>> option to change this default to any encoding of your choice.
>> Obviously, UTF-8 is one of the options. You can do this by adding the
>> following to your web.xml:
>>
>> UTF-8
> 
> Oh, that is really good to know, thanks!! Still I say that the request
> character encoding is orthogonal to the %nn encoding, but, still, it's
> good to have an implementation-agnostic way to do it.
> 
>>
>>
>> Whether Tomcat should ship with this setting present in conf/web.xml
>> by default is something that should probably be discussed for Tomcat
>> 10. Given the current state of the web, there is a reasonable case for
>> doing so. I'll add that to the TOMCAT-NEXT discussion list.
> 
> 
> Yes please! If I can help in any way, let me know.
> 
> 
>>
>> The Tomcat Wiki also needs to be updated to take account of this new
>> configuration option (and the related ).
>> Since it is a wiki and this is clearly an issue you care about would
>> you like to tackle that?
> 
> 
> Yes, I'd love to. Let me know what permissions I need, etc.

Create yourself an account at https://wiki.apache.org/tomcat (click
login then create an account) and let the list know your ID. Then one of
the admins can add you to the allowed editors.

Apologies for the hoop jumping required but without the manual approval
step for new accounts, the ASF project wiki's were being deluged in spam.

Mark

> 
> I have an international flight boarding right now so I have to go, and I
> may not reply for the next few hours, but definitely sign me up.
> 
> Thanks,
> 
> Garret
> 
> 
> -
> To unsubscribe, e-mail: users-unsubscr...@tomcat.apache.org
> For additional commands, e-mail: users-h...@tomcat.apache.org
> 


-
To unsubscribe, e-mail: users-unsubscr...@tomcat.apache.org
For additional commands, e-mail: users-h...@tomcat.apache.org



Re: distinction between resource charset and format octet decoding

2019-01-08 Thread Garret Wilson
Hi, Mark, and thanks for some quick response. You provided some info I 
wasn't aware of. Some responses below:


On 1/8/2019 9:57 PM, Mark Thomas wrote:

On 08/01/2019 21:31, Garret Wilson wrote:



But as discussed above, this is completely wrong: the resource 
character encoding of a request sent in 
`application/x-www-form-urlencoded` should have absolutely no bearing 
on how the encoded octets within that resource are decoded.


That is not the correct interpretation of section 3.12 of the Servlet 
4.0 specification (note the section numbers do vary between spec 
versions). Tomcat implements the correct interpretation - i.e. the 
charset from the request content-type defines how encoded octets are 
decoded and, if none is specified, ISO-8859-1 is used as the default.



Ah, I hadn't seen that in the servlet spec. Yes, it seems as if Tomcat 
is correctly following the spec, but I would still say the servlet spec 
is wrong to make any linkage at all between resource encoding and %nn 
interpretation. In fact reading the prose it's not clear to me that the 
servlet spec is even strongly tying the %nn interpretation to the 
encoding. It just sees to say that, unless otherwise specified, the %nn 
interpretation should be ISO-8859-1. And actually that's a step up from 
the HTML 4.0.1 spec, which in 
https://www.w3.org/TR/html401/interact/forms.html#h-17.13.4.1 indicates 
that they should be interpreted as US-ASCII codes. :(


You indicate that this is all out of date, and I think we're in 
agreement there. We really, really need to get the next servlet 
specification to remove this part. In fact the servlet specification 
should defer to the official `application/x-www-form-urlencoded` 
specification, which at this point I think is the W3C HTML5 spec, which 
in turn defers to the WHATWG spec (which clearly says that UTF-8) should 
be used. What makes all of this more of a mess is that there seems to be 
no way to work around this from the client side, e.g. by putting 
something in the HTML to indicate UTF-8, as 
`application/x-www-form-urlencoded` doesn't support a `charset` parameter.


Anyway if there are any openings on the committee to update the servlet 
spec, let me know.




...
As of Servlet 4.0 there is a specification compliant configuration 
option to change this default to any encoding of your choice. 
Obviously, UTF-8 is one of the options. You can do this by adding the 
following to your web.xml:


UTF-8


Oh, that is really good to know, thanks!! Still I say that the request 
character encoding is orthogonal to the %nn encoding, but, still, it's 
good to have an implementation-agnostic way to do it.





Whether Tomcat should ship with this setting present in conf/web.xml 
by default is something that should probably be discussed for Tomcat 
10. Given the current state of the web, there is a reasonable case for 
doing so. I'll add that to the TOMCAT-NEXT discussion list.



Yes please! If I can help in any way, let me know.




The Tomcat Wiki also needs to be updated to take account of this new 
configuration option (and the related ). 
Since it is a wiki and this is clearly an issue you care about would 
you like to tackle that?



Yes, I'd love to. Let me know what permissions I need, etc.

I have an international flight boarding right now so I have to go, and I 
may not reply for the next few hours, but definitely sign me up.


Thanks,

Garret


-
To unsubscribe, e-mail: users-unsubscr...@tomcat.apache.org
For additional commands, e-mail: users-h...@tomcat.apache.org



Re: distinction between resource charset and format octet decoding

2019-01-08 Thread Mark Thomas

On 08/01/2019 21:31, Garret Wilson wrote:



But as discussed above, this is completely wrong: the resource character 
encoding of a request sent in `application/x-www-form-urlencoded` should 
have absolutely no bearing on how the encoded octets within that 
resource are decoded.


That is not the correct interpretation of section 3.12 of the Servlet 
4.0 specification (note the section numbers do vary between spec 
versions). Tomcat implements the correct interpretation - i.e. the 
charset from the request content-type defines how encoded octets are 
decoded and, if none is specified, ISO-8859-1 is used as the default.


Yes, this default is now very out-dated. That is a side-effect of:
- how long the Servlet specification has been around
- the very conservative approach taken by Java EE in terms of backwards
  compatibility (once set, defaults are very rarely - if ever - changed)
- arguably missed opportunities to address this issue prior to
  Servlet 4.0

As of Servlet 4.0 there is a specification compliant configuration 
option to change this default to any encoding of your choice. Obviously, 
UTF-8 is one of the options. You can do this by adding the following to 
your web.xml:


UTF-8

If you add it to conf/web.xml it applies to every web application 
deployed to Tomcat.


Tomcat 9 uses this in the examples, manager and host-manager 
applications in place of the SetCharacterEncodingFilter.


Whether Tomcat should ship with this setting present in conf/web.xml by 
default is something that should probably be discussed for Tomcat 10. 
Given the current state of the web, there is a reasonable case for doing 
so. I'll add that to the TOMCAT-NEXT discussion list.


The Tomcat Wiki also needs to be updated to take account of this new 
configuration option (and the related ). 
Since it is a wiki and this is clearly an issue you care about would you 
like to tackle that?


Mark

-
To unsubscribe, e-mail: users-unsubscr...@tomcat.apache.org
For additional commands, e-mail: users-h...@tomcat.apache.org