2010/1/14 Tero Karttunen <karttunen.mailingl...@gmail.com>: >> Why is '+' decoded to ' ' in the path part of the URL? >> That is, I think, wrong. > > This is an interesting theory. If true, it could provide an > explanation to the observed behavior, but I cannot completely follow > it. > >> The '+' char has no special meaning in HTTP/1.1 (RFC 2616) [1], so in >> the path part of the URL it just means itself, the plus sign. > > On the other hand, the same RFC provides a counter-example. Look at > section 3.2.3 "URI comparison". It says that characters other than > those in the "reserved" and "unsafe" sets are equivalent to their > %-encoded counterparts. The reserved set as defined in RFC 2396 (and > the later RFC 3986 that obsoletes it) include '+' character. > > I believe the chapter 3.2.3 means that the characters in the reserved > set are not equivalent to their %-encoded counterparts, and in this > way, /contextroot/subcontext/sites/one+one%3cfive IS NOT equivalent to > /contextroot/subcontext/sites/one%2Bone%3cfive when doing URI > comparison. > >> It is the HTML Forms spec [2] that makes it special, defining >> "urlencoding" used when submitting web forms through HTTP. It has >> special meaning only in the query part of the URL and only because of >> that part of HTML spec. > > HTML Forms spec does define www-form-urlencoding, but I can't tell > from the spec whether it is limited to just the query part. > >>> What my application actually sees after decoding: sites/one one<five >> >> What is your application code here? Where and how do you obtain the >> "decoded" value? > > I am using Apache Commons URLCodec to decode the URL. This widely-used > utility class does not make the distinction between path and query > parts... > > Let me explain my application to you before I provide the code example > to you. As you could guess from its name TeamCenterEmulator, my > application emulates a set of former URLs, continuing to serve the > pre-existing links while the legacy application is retired. > > My application is configured with a CSV file containing a mapping > between an URL and a resource it is supposed to serve (in a dynamic > fashion, it is not a simple file). Say, the application could contain > the following mapping: > > <former url> <response> > /sites/foo file1 > /sites/bar file2 > /sites/one%2Bone%cthree file3 > /sites/one%2Bone%cfour file4 > /sites/one%2Bone%cfive file5 > ... > > Once the application initializes, it reads the mapping into memory, > and if the request matches the former url EXACTLY, the matching > response is returned. This is the application spec. Note here that by > RFC 2616-compliant URI comparison, my application must regard request > /sites/one+one%cfive as a non-match! > > Here is doGet from my servlet. Note that I am trimming the URL to > start from the "sites" part for obvious reasons... > > protected void doGet(HttpServletRequest request, HttpServletResponse > response) throws ServletException, IOException { > super.doGet(request, response); > if (config == null) { > config = new > ConfigurationFactory().createConfiguration(getServletContext().getInitParame > ter("teamCenterURLMapping")); > } > String urlSnippet = (getServletContext().getContextPath() + > "/" + getServletConfig().getServletName() + "/"); > String url = ""; > if (request.getRequestURI().length() > urlSnippet.length()) > { > url = > request.getRequestURI().substring(urlSnippet.length()); > } > try { > TeamCenterConfigurationItem item = > config.findByURL(url); > [...] > catch (UnknownUrlException) { > ... > } > } > > I am not going to post ConfigurationFactory, because it is not > interesting. It basically builds a HashMap based on the CSV file that > has URLCodec.decode()'d former urls as its keys, with the idea that if > we URL-decode the incoming request, we can search the HashMap for > matches. > > Here is how the abovementioned findUrl method does just that: > > public TeamCenterConfigurationItem findByURL (String url) throws > UnknownUrlException { > URLCodec codec = new URLCodec("UTF8"); > try { > url = codec.decode(url); > logger.info(url); > } catch (DecoderException e) { > logger.error(e); > throw new UnknownUrlException (url); > } > if (config.containsKey(url)) { > return config.get(url); > } > throw new UnknownUrlException (url); > } > > What do you think? Is my approach valid? Am I somehow abusing > URLCodec? Should the request be (partially) decoded in some other way? > > Best Regards, > Tero Karttunen >
Is UTF-8 the reason why you are using your custom decoding? There is URIEncoding on a <Connector> element [1] and a usual (non-default) setting for it is URIEncoding="UTF-8". There is a FAQ page about solving character encoding issues [2]. You should be able to use HttpServletRequest.getPathInfo() to get the decoded value. [1] http://tomcat.apache.org/tomcat-6.0-doc/config/http.html http://tomcat.apache.org/tomcat-6.0-doc/config/ajp.html [2] http://wiki.apache.org/tomcat/FAQ/CharacterEncoding Best regards, Konstantin Kolinko --------------------------------------------------------------------- To unsubscribe, e-mail: users-unsubscr...@tomcat.apache.org For additional commands, e-mail: users-h...@tomcat.apache.org