Re: Non-Roman characters in TLDs and domain names

Warren Togami Tue, 03 Nov 2009 19:21:06 -0800

On 11/03/2009 09:50 PM, Sidney Markowitz wrote:

Warren Togami wrote, On 4/11/09 3:27 PM:

It seems clear that we will need to flatten/encode any URI domain to
punycode for URIBL lookups.


I agree with that -- if something has non-ASCII characters then punycode
is the canonical form to use to look it up.

The unclear part is if we will need to decode URI's prior to punycode
encoding. I suspect we will be forced to decode.


I'm not sure exactly what you mean, but the big issue that I see is how
to determine that a string is a URL (where it starts and where it stops)
that needs to be encoded to punycode. Is that what you are talking
about? The rule of thumb that I used when working on code to extract
URLs from plain text is that is some common MUA hot links it, then we
want to treat it as a URL. Perhaps the answer is to wait until MUAs
support these URLs and then follow that rule of thumb.

-- sidney


http://日本語.テスト/

Did your MUA turn that into a clickable link?

Thunderbird     yes
Evolution       no
GMail           yes
Squirrelmail    no
Roundcubemail   no

Yes, it might be hard to figure out the beginning and end of a URLwithout decoding the entire message. Determining if we can do itwithout full body decoding will be an important first step beforedecoding what else we will do.

I suspect we will be forced to decode arbitrary encodings beforepunycode flattening because URI domains can be encoded in different ways.


The following examples are not correct, but it demonstrates the problem:

ASCII without decoding the domain sent as UTF-8
http://æ—¥æœ¬èªž.ãƒ†ã‚¹ãƒˆ/

ASCII without decoding the domain sent as ISO-2022-JP
http://$BF|K\8l(B.$B%F%9%H(B/

Both of these strings are the same domain name. But if not decodedbefore punycode flattening they will query as different strings in theURIBL lookup.


Warren Togami
[email protected]

Re: Non-Roman characters in TLDs and domain names

Reply via email to