Re: Semicolon URI encoding and RFC

Mindaugas Žakšauskas Tue, 10 May 2011 04:10:50 -0700

Hi,

Thanks very much for your answers. Just for a reference, I will sum up
what I've managed to get out of this discussion. Please correct me if
I am wrong.


My problem wasn't charset incompatibility between client and server as
it is the same party which produces URLs and consumes them (and yes,
we do use UTF-8 everywhere and have useBodyEncodingForURL set to
true). Anyway, it was interesting read to get the whole picture,
including Punycode. I hope others did benefit from this, too.

What I wanted to clarify was the exact sets of characters needing %
encoding. Initially I thought that this all boils down to different
character classes but it turned out to be incorrect (the semicolon VS
bracket case).

My another concern was i18zed paths, and it was a good advice from
Konstantin to have a look at Wikipedia. For example, a link to
"botánico" in Spanish Wikipedia is printed as <a
href="/wiki/Bot%C3%A1nica" title="Botánica"> and browsers are seem to
be able to show it percent-decoded without any special effort. I only
slipped here because initially I have used [1] which does not encode
(at least) some characters correctly. I ended up using modified
java.net.URI::appendEncoded(StringBuilder, char) as it's private there
and doesn't escape semicolons [2].

My conclusion is to percent-encode everything that is not unreserved.
It might be sub-optimal as some characters, such as brackets, do not
need encoding, but I better choose safe than sorry.

[1] 
http://stackoverflow.com/questions/573184/java-convert-string-to-valid-uri-object/3332864#3332864
[2] The final code that does the escaping:

    private static final String UNRESERVED =
"abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ01234567890-._~";

    private final static char[] hexDigits = {'0', '1', '2', '3', '4',
'5', '6', '7', '8', '9', 'A', 'B', 'C', 'D', 'E', 'F'};

    // stolen from java.net.URI and modified to ensure semicolons,
etc. get encoded
    private static void appendEncoded(StringBuilder sb, char c) {
        ByteBuffer bb = null;
        try {
            bb =
ThreadLocalCoders.encoderFor("UTF-8").encode(CharBuffer.wrap("" + c));
        } catch (CharacterCodingException x) {
            assert false;
        }
        while (bb.hasRemaining()) {
            int b = bb.get() & 0xff;
            sb.append('%');
            sb.append(hexDigits[(b >> 4) & 0x0f]);
            sb.append(hexDigits[(b) & 0x0f]);
        }
    }

    // to escape, one needs to iterate over all characters and escape if
    // !isUnreserved(yourChar)
    private static boolean isUnreserved(char c) {
        return UNRESERVED.indexOf(c) != -1;
    }

Regards,
Mindaugas

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscr...@tomcat.apache.org
For additional commands, e-mail: users-h...@tomcat.apache.org

Re: Semicolon URI encoding and RFC

Reply via email to