Re: mod_jk: plus-character causes %-encoding problems

Konstantin Kolinko Thu, 14 Jan 2010 13:08:22 -0800

2010/1/14 Tero Karttunen <[email protected]>:
>> Why is '+' decoded to ' ' in the path part of the URL?
>> That is, I think, wrong.
>
> This is an interesting theory. If true, it could provide an
> explanation to the observed behavior, but I cannot completely follow
> it.
>
>> The '+' char has no special meaning in HTTP/1.1 (RFC 2616) [1], so in
>> the path part of the URL it just means itself, the plus sign.
>
> On the other hand, the same RFC provides a counter-example. Look at
> section 3.2.3 "URI comparison". It says that characters other than
> those in the "reserved" and "unsafe" sets are equivalent to their
> %-encoded counterparts. The reserved set as defined in RFC 2396 (and
> the later RFC 3986 that obsoletes it) include '+' character.
>
> I believe the chapter 3.2.3 means that the characters in the reserved
> set are not equivalent to their %-encoded counterparts, and in this
> way, /contextroot/subcontext/sites/one+one%3cfive IS NOT equivalent to
> /contextroot/subcontext/sites/one%2Bone%3cfive when doing URI
> comparison.
>
>> It is the HTML Forms spec [2] that makes it special, defining
>> "urlencoding" used when submitting web forms through HTTP. It has
>> special meaning only in the query part of the URL and only because of
>> that part of HTML spec.
>
> HTML Forms spec does define www-form-urlencoding, but I can't tell
> from the spec whether it is limited to just the query part.
>
>>> What my application actually sees after decoding: sites/one one<five
>>
>> What is your application code here? Where and how do you obtain the
>> "decoded" value?
>
> I am using Apache Commons URLCodec to decode the URL. This widely-used
> utility class does not make the distinction between path and query
> parts...
>
> Let me explain my application to you before I provide the code example
> to you. As you could guess from its name TeamCenterEmulator, my
> application emulates a set of former URLs, continuing to serve the
> pre-existing links while the legacy application is retired.
>
> My application is configured with a CSV file containing a mapping
> between an URL and a resource it is supposed to serve (in a dynamic
> fashion, it is not a simple file). Say, the application could contain
> the following mapping:
>
> <former url>                        <response>
> /sites/foo                            file1
> /sites/bar                            file2
> /sites/one%2Bone%cthree   file3
> /sites/one%2Bone%cfour     file4
> /sites/one%2Bone%cfive      file5
> ...
>
> Once the application initializes, it reads the mapping into memory,
> and if the request matches the former url EXACTLY, the matching
> response is returned. This is the application spec. Note here that by
> RFC 2616-compliant URI comparison, my application must regard request
> /sites/one+one%cfive as a non-match!
>
> Here is doGet from my servlet. Note that I am trimming the URL to
> start from the "sites" part for obvious reasons...
>
>        protected void doGet(HttpServletRequest request, HttpServletResponse
> response) throws ServletException, IOException {
>                super.doGet(request, response);
>                if (config == null) {
>                        config = new
> ConfigurationFactory().createConfiguration(getServletContext().getInitParame
> ter("teamCenterURLMapping"));
>                }
>                String urlSnippet = (getServletContext().getContextPath() +
> "/" + getServletConfig().getServletName() + "/");
>                String url = "";
>                if (request.getRequestURI().length() > urlSnippet.length())
> {
>                        url =
> request.getRequestURI().substring(urlSnippet.length());
>                }
>               try {
>                        TeamCenterConfigurationItem item =
> config.findByURL(url);
>                        [...]
>               catch (UnknownUrlException) {
>               ...
>               }
> }
>
> I am not going to post ConfigurationFactory, because it is not
> interesting. It basically builds a HashMap based on the CSV file that
> has URLCodec.decode()'d former urls as its keys, with the idea that if
> we URL-decode the incoming request, we can search the HashMap for
> matches.
>
> Here is how the abovementioned findUrl method does just that:
>
>        public TeamCenterConfigurationItem findByURL (String url) throws
> UnknownUrlException {
>                URLCodec codec = new URLCodec("UTF8");
>                try {
>                        url = codec.decode(url);
>                        logger.info(url);
>                } catch (DecoderException e) {
>                        logger.error(e);
>                        throw new UnknownUrlException (url);
>                }
>                if (config.containsKey(url)) {
>                        return config.get(url);
>                }
>                throw new UnknownUrlException (url);
>        }
>
> What do you think? Is my approach valid? Am I somehow abusing
> URLCodec? Should the request be (partially) decoded in some other way?
>
> Best Regards,
> Tero Karttunen
>


Is UTF-8 the reason why you are using your custom decoding?

There is URIEncoding on a <Connector> element [1] and a usual
(non-default) setting for it is URIEncoding="UTF-8". There is a FAQ
page about solving character encoding issues [2].

You should be able to use HttpServletRequest.getPathInfo() to get the
decoded value.

[1]
http://tomcat.apache.org/tomcat-6.0-doc/config/http.html
http://tomcat.apache.org/tomcat-6.0-doc/config/ajp.html

[2]
http://wiki.apache.org/tomcat/FAQ/CharacterEncoding

Best regards,
Konstantin Kolinko

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: mod_jk: plus-character causes %-encoding problems

Reply via email to