On 11/03/2009 09:50 PM, Sidney Markowitz wrote:
Warren Togami wrote, On 4/11/09 3:27 PM:
It seems clear that we will need to flatten/encode any URI domain to
punycode for URIBL lookups.

I agree with that -- if something has non-ASCII characters then punycode
is the canonical form to use to look it up.

The unclear part is if we will need to decode URI's prior to punycode
encoding. I suspect we will be forced to decode.

I'm not sure exactly what you mean, but the big issue that I see is how
to determine that a string is a URL (where it starts and where it stops)
that needs to be encoded to punycode. Is that what you are talking
about? The rule of thumb that I used when working on code to extract
URLs from plain text is that is some common MUA hot links it, then we
want to treat it as a URL. Perhaps the answer is to wait until MUAs
support these URLs and then follow that rule of thumb.

-- sidney

http://日本語.テスト/

Did your MUA turn that into a clickable link?

Thunderbird     yes
Evolution       no
GMail           yes
Squirrelmail    no
Roundcubemail   no

Yes, it might be hard to figure out the beginning and end of a URL without decoding the entire message. Determining if we can do it without full body decoding will be an important first step before decoding what else we will do.

I suspect we will be forced to decode arbitrary encodings before punycode flattening because URI domains can be encoded in different ways.

The following examples are not correct, but it demonstrates the problem:

ASCII without decoding the domain sent as UTF-8
http://日本語.テスト/

ASCII without decoding the domain sent as ISO-2022-JP
http://$BF|K\8l(B.$B%F%9%H(B/

Both of these strings are the same domain name. But if not decoded before punycode flattening they will query as different strings in the URIBL lookup.

Warren Togami
[email protected]

Reply via email to