On 2012/04/28 4:26, Mark Davis ☕ wrote:
Actually, if the goal is to get as many characters in as possible, Punycode
might be the best solution. That is the encoding used for internationalized
domains. In that form, it uses a smaller number of bytes per character, but
a parameterization allows use of all byte values.

Because punycode encodes differences between character numbers, not the character numbers themselves, it can indeed be quite efficient in particular if the characters used are tightly packed (e.g. Greek, Hebrew,...). For languages with Latin script and accented characters, the question is how close these accented characters are in Unicode.

However, punycode also codes character positions. Because of this, it gets less efficient for longer text.

[Because punycode uses (circular) position differences rather than simple positions, this contribution is limited by the (rounded-up binary logarithm of the) weighted average distance between two same characters in the text/language.]

My guess is therefore that punycode won't necessarily be super-efficient for texts in the 100+ character range. It's difficult to test quickly because the punycode converters on the Web limit the output to 63 characters, the maximum length of a label in a domain name.

Regards,    Martin.

Reply via email to