----- Original Message ----- From: "Martin Duerst" <[EMAIL PROTECTED]> To: "Soobok Lee" <[EMAIL PROTECTED]>; "James Seng/Personal" <[EMAIL PROTECTED]>; <[EMAIL PROTECTED]> Sent: Friday, October 19, 2001 5:58 PM Subject: Re: [idn] call for comments for REORDERING
> At 17:39 01/10/19 +0900, Soobok Lee wrote: > > >----- Original Message ----- > >From: "Martin Duerst" <[EMAIL PROTECTED]> > > > > > > > >1) saturations in TLD namespaces would require longer names for which > > > > REORDERING is designed to give greater benefits/compression ratio. > > > > > > No. What James referred to is that saturation tends to fill up the > > > short name slots, and thus flatten the probability distribution. > > > I.e. if somebody doesn't get the name they wanted, the chance is > > > that they go for something like xq.com, because it's easy to > > > remember because it's short. Neither x nor q are very frequent > > > letters. > > > >Han/hangeul characters carries meanings while latin alphabets > >denote phonemes. Therefore your analogy between latin and han domains > >may be false. Chinese people would rather choose to register > >digit-added variants of alreagy taken desired domains in saturated ML.com, > >instead of choosing non-sense irrelevant rare han characters. > > Some really rare and irrelevant han characters may indeed never > be chosen. But still if you want to name a company, there are > many different possibilities, and people will look for short, > not yet used possibilities (which still make some sense) > rather than use longer and longer names. > In most cases, they add latin digits. CJK people would know what i am saying. > > >Later time, I will provide some proofs that SC and TC only have > >small partial set of frequent characters. That's already clear in > >SJIS and KSC5601 han characters set which size is less than 5000. > > Yes, this is true. > > > > > >to avoid countriy-specific biases in han reordering table. > > > > > > > >non-CJK scripts often haver small set of basic alphabets, and their > > > >character usage patterns are more stable than those for han/hangeul. > > > > > > No, many other scripts are used for many more languages, with > > > quite different usage patterns. (A lot of Han usage in Japan, > > > and most of it in Korea, is due to loanwords from Chinese.) > > > > > > >But, even without Urdu consideration in > >arabic reordering, the efficiency of reordering is always better than > >without it, because the lexicographic ordering in un-reordered > >arabic script block can be regarded as *RANDOM* ordering > >in frequency measure (maximum entropy). > > It's probably not, because most alphabets contain a few > 'late additions'. If and only if the reordering table for a script needs modifications for added characters, it can be done in the next version of nameprep/ACE with new ACE prefix. > And just using first order frequency > to bring the most frequent characters to the front may > not be the most efficient way for compression. > Do you a good idea that can replace current first-order frequency reordering ? Welcome any changes to that. If someone devise new ordering scheme in the future, that may substitute current reordering scheme in the next namepre/ACE version with new ACE prefix. > > >Partial reordering (without Urdu consideration) is always better than > >no reordering. > > I don't deny that you may be able to squeeze out a few bits. > But I don't think that should be the aim of this exercise. > > >If Urdu text samples are available, my arabic reordering table may be > >improved to reflect them, though. > > Which might then make it less efficient for Arabic. Yes, but marginally. > > > Regards, Martin. >
