DO NOT REPLY [Bug 26403] New: - double UTF-8 encoding of HTTP request parameters

bugzilla Sat, 24 Jan 2004 14:20:12 -0800

DO NOT REPLY TO THIS EMAIL, BUT PLEASE POST YOUR BUG 
RELATED COMMENTS THROUGH THE WEB INTERFACE AVAILABLE AT
<http://nagoya.apache.org/bugzilla/show_bug.cgi?id=26403>.
ANY REPLY MADE TO THIS MESSAGE WILL NOT BE COLLECTED AND 
INSERTED IN THE BUG DATABASE.


http://nagoya.apache.org/bugzilla/show_bug.cgi?id=26403

double UTF-8 encoding of HTTP request parameters

           Summary: double UTF-8 encoding of HTTP request parameters
           Product: Struts
           Version: Nightly Build
          Platform: Other
        OS/Version: Other
            Status: NEW
          Severity: Normal
          Priority: Other
         Component: Digester
        AssignedTo: [EMAIL PROTECTED]
        ReportedBy: [EMAIL PROTECTED]


I'm having a problem with properly processing UTF-8 encoded request parameters
through struts. The effect is, that international characters (that are not
ASCII, thus are multi-byte UTF-8 characters) are encoded twice into UTF-8.

As an example, let's see the examples webapp included in the jakarta-struts
source tree. It has the registration sample, reachable through

http://localhost:8080/struts-examples/validator/registration.do

if installed on localhost:8080. let's suppose I which to type:

small letter a with acute: á
unicode value hex:         00e1
unicode value binary:      11100001
UTF-8 binary:              11000011 10100001
UTF-8 in hex:              c3a1

into the firstName field into the form. this can be simulated by:

http://localhost:8080/struts-examples/validator/registration-submit.do?firstName=%C3%A1

(if typed manually and submitted via POST, has the same effect)

the resuling page shows a lot of form problems, as I didn't fill out most of the
fields, which is OK. but more importantly, it also shows the entered letter in
the firstName input field. what is vierd, is that a different letter is shown
(actually two letters). running xxd on the received page, here's the relevant part:


00003a0: 6e67 7468 3d22 3330 2220 7369 7a65 3d22  ngth="30" size="
00003b0: 3330 2220 7661 6c75 653d 22c3 83c2 a122  30" value="...."
00003c0: 3e0a 2020 2020 3c2f 7464 3e0a 2020 3c2f  >.    </td>.  </

with the important part at value="....", which is:

00003b0: 3330 2220 7661 6c75 653d 22c3 83c2 a122  30" value="...."
                                    ^^^^^^^^^^

the letters presented are:

UTF-8 hex sequence: c383c2a1
UTF-8 binary:       11000011 10000011 11000010 10100001

which is actually two UTF-8 letters by now. what is funny, that if I 'decode'
them from UTF-8, I get the original UTF-8 sequence:

first part, as received: 11000011 10000011
de-coded:                11000011

second part, as received: 11000010 10100001
de-coded:                 10100001


and voila, the the parts make up the original UTF-8 sequence:

11000011 10100001

which actually is the UTF-8 sequence for the letter sent.

if I resend this page (the by now to UTF-8 letters), I get four letters, then 8,
etc. it seems, that the engine doesn't recognize, that there are UTF-8 sequences
to begin with, and encodes them 'again'.


I'm using mozilla as a browser, Tomcat 5.0.16. the encoding of the pages is UTF-8.

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

DO NOT REPLY [Bug 26403] New: - double UTF-8 encoding of HTTP request parameters

Reply via email to