Re: Unicode, SMS and year 2012

Martin J. Dürst Fri, 27 Apr 2012 22:14:16 -0700

On 2012/04/28 4:26, Mark Davis ☕ wrote:

Actually, if the goal is to get as many characters in as possible, Punycode
might be the best solution. That is the encoding used for internationalized
domains. In that form, it uses a smaller number of bytes per character, but
a parameterization allows use of all byte values.

Because punycode encodes differences between character numbers, not thecharacter numbers themselves, it can indeed be quite efficient inparticular if the characters used are tightly packed (e.g. Greek,Hebrew,...). For languages with Latin script and accented characters,the question is how close these accented characters are in Unicode.

However, punycode also codes character positions. Because of this, itgets less efficient for longer text.

[Because punycode uses (circular) position differences rather thansimple positions, this contribution is limited by the (rounded-up binarylogarithm of the) weighted average distance between two same charactersin the text/language.]

My guess is therefore that punycode won't necessarily be super-efficientfor texts in the 100+ character range. It's difficult to test quicklybecause the punycode converters on the Web limit the output to 63characters, the maximum length of a label in a domain name.


Regards,    Martin.

Re: Unicode, SMS and year 2012

Reply via email to