On 03/10/18 10:22, Jean Pierre Urkens wrote:
> Hi everybody,
> 
> I am having an issue where Unicode characters (e.g. Ž and & #105;) are
> passed by the Apache Webserver 2.4 to Tomcat as UTF-8 encoded bytes while
> Tomcat seems to evaluate them as ISO-8859-15 encoded.
> 
> Having taken a network trace with TCPDUMP I see the following bytes for my
> header field (truncated the output after byte ‘72’): 
> 
> 0200   0a 48 54 54 50 5f 56 6f 6f 72 6e 61 61 6d 3a 20   .HTTP_Voornaam: 
> 0210   4d 61 c5 82 67 6f 72
> MaÅ.gor
> 
>  
> 
> Here the bytes C582 is the UTF-8 encoded value for the Unicode character
> Ž
> 
> Now when inspecting the header value in Tomcat using:
> 
>                String headerValue = request.getHeader("HTTP_Voornaam");
> 
> I’m getting the value ‘MaÅ.gor’ which seems to be using the ISO-8859-15
> repesentation for the bytes C582. The byte string from the TCPDUMP seems to
> match the result of  headerValue.getBytes(Charset.forName("ISO-8859-15"))
> and not the result of headerValue.getBytes(Charset.forName("UTF-8")).
> 
> The FAQ (https://wiki.apache.org/tomcat/FAQ/CharacterEncoding) indicates
> that ‘headers are always in US-ASCII encoding. Anything outside of that
> needs to be encoded’, in this case it seems to be UTF-8 encoded.

>From the HTTP spec:

<quote>
   Historically, HTTP has allowed field content with text in the
   ISO-8859-1 charset [ISO-8859-1], supporting other charsets only
   through use of [RFC2047] encoding.  In practice, most HTTP header
   field values use only a subset of the US-ASCII charset [USASCII].
   Newly defined header fields SHOULD limit their field values to
   US-ASCII octets.  A recipient SHOULD treat other octets in field
   content (obs-text) as opaque data.
</quote>

Sending raw UTF-8 bytes and having them decoded as such has newer been
part of the Servlet spec (and is discouraged by the HTTP spec).

Tomcat has never supported the use of RFC2047 encoding. It has been
considered in the past but I'm not aware of any mainstream client that
supports it.

Tomcat does allow raw UTF-8 in the cookie header (although neither the
Cookie nor the HTTP spec allows this) because most (all major?) browsers
sent raw UTF-8 in the cookie header.

If you know that the data is always going to be UTF-8 then you can do
the (fairly ugly):

String utf8Value = new String(
        headerValue.getBytes(StandardCharsets.ISO_8859_1),
        StandardCharsets.UTF_8);

The servlet spec should probably provide a mechanism to obtain the
header data as bytes and/or decode them using a given encoding.

> The headers are evaluated by a servlet 2.5 web application which has defined
> a ‘CharacterEncodingFilter’ as first filter performing the following
> actions:
> 
>              request.setCharacterEncoding("UTF-8");
>              response.setContentType("text/html; charset=UTF-8");
>              response.setCharacterEncoding("UTF-8");
>              filterChain.doFilter(request, response);

None of those apply to HTTP headers.

> Is there a way to tell Tomcat to decode the headers as being UTF-8 encoded
> bytes?

No.

> I am using Tomcat-version 8.5.32. 

Thanks for providing that information. A lot of people forget.

Mark

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscr...@tomcat.apache.org
For additional commands, e-mail: users-h...@tomcat.apache.org

Reply via email to