This posting is to explain why REORDERING do something good only for non-latin scripts.
I have 47263 latin ML.com samples from various parts of Eastern and Western Europe. Total # of code points: 464915 # of basic latin letters: 409922 # of extened latin letters: 54993 ( 0x00a0 ~ 0x0370) This data shows that extended latin letters account for 12% (roughly 1/9 ) of code points in typical latin labels. But the mean average length of latin labels is 10. Therefore,in most cases, latin labels contain only 1 extended latin letter. As you know AMC-Z is designed to favor latin letters(called basic code points in AMC-Z), because AMC-Z encodes basic latin letters in literal mode "as it is" for which reordering v2.0 does not do anything: One typical latin domain of length 12: b<diaeresis u>stenhalter U+0062 U+00FC U+0073 U+0074 U+0065 U+006E U+0068 U+0061 U+006C U+0074 U+0065 U+0072: AMC-Z: bstenhalter-thB the non-basic(extended) latin letter <diaeresis u> is encoded into -thB in the latter part of the ACE label. AMC-Z+REORDERING: bstenhalter-ymB (the same length) Reordering v2.0 reorders only the code point values of EXTENED latin letters based on frequency distribution data from ML.com samples from various countries. Since most labels contain just 1 extened latin letters, reordering on extened latin letters don't help, because reordering is designed to reduce the successive code distances of the 2 or more non-basic code points. I think it's fair to compensate AMC-Z-imposed disadvantage on non-latin scripts. the next table in my I-D summarizes the reordering improvements for each latin ML.com samples of length N: (How to read the tables ) N: length of a domain label ( # of code points) FREQ: number domains of length N N*FREQ: sum of # of code points of domains of length N SUM OF AMCZ: sum of lengths of AMCZ labels X: SUM OF AMCZ / N * FREQ SUM OF LAMCZ: sum of lengths of LAMCZ labels Y: SUM OF LAMCZ / N * FREQ COMP: (SUM OF LAMCZ - SUM OF AMCZ) / SUM OF AMCZ * 100 16. latin | N| FREQ| N*FREQ| SUM OF AMCZ(X)| SUM OF LAMCZ(Y)| COMP| | 1| 87| 87| 260(2.99)| 259(2.98)| 0.38| | 2| 1043| 2086| 5140(2.46)| 5134(2.46)| 0.12| | 3| 1046| 3138| 6274(2.00)| 6241(1.99)| 0.53| | 4| 1812| 7248| 12750(1.76)| 12715(1.75)| 0.27| | 5| 3238| 16190| 26129(1.61)| 26047(1.61)| 0.31| | 6| 3956| 23736| 35894(1.51)| 35802(1.51)| 0.26| | 7| 4340| 30380| 43756(1.44)| 43633(1.44)| 0.28| | 8| 4639| 37112| 51351(1.38)| 51286(1.38)| 0.13| | 9| 4551| 40959| 54994(1.34)| 54873(1.34)| 0.22| | 10| 4289| 42890| 56159(1.31)| 56058(1.31)| 0.18| | 11| 3778| 41558| 53227(1.28)| 53157(1.28)| 0.13| | 12| 2967| 35604| 44820(1.26)| 44754(1.26)| 0.15| | 13| 2501| 32513| 40264(1.24)| 40197(1.24)| 0.17| | 14| 2058| 28812| 35212(1.22)| 35174(1.22)| 0.11| | 15| 1653| 24795| 29947(1.21)| 29918(1.21)| 0.10| | 16| 1372| 21952| 26264(1.20)| 26224(1.19)| 0.15| | 17| 1094| 18598| 22053(1.19)| 21994(1.18)| 0.27| | 18| 839| 15102| 17782(1.18)| 17722(1.17)| 0.34| | 19| 632| 12008| 14045(1.17)| 13988(1.16)| 0.41| | 20| 464| 9280| 10778(1.16)| 10721(1.16)| 0.53| | 21| 312| 6552| 7539(1.15)| 7516(1.15)| 0.31| | 22| 194| 4268| 4905(1.15)| 4876(1.14)| 0.59| | 23| 124| 2852| 3242(1.14)| 3234(1.13)| 0.25| | 24| 71| 1704| 1935(1.14)| 1925(1.13)| 0.52| | 25| 71| 1775| 2011(1.13)| 2002(1.13)| 0.45| | 26| 37| 962| 1083(1.13)| 1080(1.12)| 0.28| | 27| 33| 891| 1004(1.13)| 996(1.12)| 0.80| | 28| 17| 476| 535(1.12)| 529(1.11)| 1.12| | 29| 13| 377| 422(1.12)| 420(1.11)| 0.47| | 30| 9| 270| 298(1.10)| 299(1.11)|-0.34| | 31| 7| 217| 243(1.12)| 238(1.10)| 2.06| | 32| 9| 288| 321(1.11)| 316(1.10)| 1.56| | 33| 4| 132| 146(1.11)| 144(1.09)| 1.37| | 34| 2| 68| 76(1.12)| 74(1.09)| 2.63| | 35| 1| 35| 38(1.09)| 38(1.09)| 0.00| For arabic labels of length >13, the compression ratio is close to 13%. compare this result with the one above. 1. arabic | N| FREQ| N*FREQ| SUM OF AMCZ(X)| SUM OF LAMCZ(Y)| COMP| | 1| 42| 42| 126(3.00)| 126(3.00)| 0.00| | 2| 59| 118| 258(2.19)| 249(2.11)| 3.49| | 3| 363| 1089| 2121(1.95)| 1992(1.83)| 6.08| | 4| 888| 3552| 6359(1.79)| 5811(1.64)| 8.62| | 5| 1122| 5610| 9550(1.70)| 8529(1.52)|10.69| | 6| 1009| 6054| 9890(1.63)| 8620(1.42)|12.84| | 7| 845| 5915| 9309(1.57)| 8134(1.38)|12.62| | 8| 378| 3024| 4590(1.52)| 3992(1.32)|13.03| | 9| 263| 2367| 3523(1.49)| 3063(1.29)|13.06| | 10| 152| 1520| 2230(1.47)| 1941(1.28)|12.96| | 11| 130| 1430| 2058(1.44)| 1787(1.25)|13.17| | 12| 110| 1320| 1873(1.42)| 1614(1.22)|13.83| | 13| 67| 871| 1230(1.41)| 1040(1.19)|15.45| | 14| 61| 854| 1211(1.42)| 1015(1.19)|16.18| | 15| 52| 780| 1085(1.39)| 924(1.18)|14.84| | 16| 34| 544| 743(1.37)| 630(1.16)|15.21| | 17| 11| 187| 256(1.37)| 218(1.17)|14.84| | 18| 19| 342| 465(1.36)| 392(1.15)|15.70| | 19| 8| 152| 201(1.32)| 175(1.15)|12.94| | 20| 10| 200| 268(1.34)| 235(1.18)|12.31| | 21| 3| 63| 85(1.35)| 75(1.19)|11.76| | 22| 4| 88| 116(1.32)| 99(1.12)|14.66| | 23| 3| 69| 89(1.29)| 76(1.10)|14.61| | 24| 2| 48| 62(1.29)| 55(1.15)|11.29| | 25| 5| 125| 165(1.32)| 143(1.14)|13.33| | 26| 2| 52| 67(1.29)| 56(1.08)|16.42| | 27| 2| 54| 73(1.35)| 61(1.13)|16.44| | 33| 1| 33| 41(1.24)| 37(1.12)| 9.76| | 34| 1| 34| 45(1.32)| 36(1.06)|20.00| |All| 5646| 36537| 58089(1.59)| 51125(1.40)|11.99| For hangul, the compression ratio reaches 31%. 8. hangul-1024 | N| FREQ| N*FREQ| SUM OF AMCZ(X)| SUM OF LAMCZ(Y)| COMP| | 1| 1953| 1953| 7812(4.00)| 7812(4.00)| 0.00| | 2| 17149| 34298| 124782(3.64)| 106238(3.10)|14.86| | 3| 39643| 118929| 403205(3.39)| 323801(2.72)|19.69| | 4| 62285| 249140| 816093(3.28)| 622067(2.50)|23.77| | 5| 39675| 198375| 636102(3.21)| 470174(2.37)|26.09| | 6| 23891| 143346| 452483(3.16)| 326242(2.28)|27.90| | 7| 12448| 87136| 271953(3.12)| 192139(2.21)|29.35| | 8| 5441| 43528| 134600(3.09)| 94322(2.17)|29.92| | 9| 2264| 20376| 62405(3.06)| 43266(2.12)|30.67| | 10| 895| 8950| 27223(3.04)| 18764(2.10)|31.07| | 11| 373| 4103| 12420(3.03)| 8511(2.07)|31.47| | 12| 141| 1692| 5080(3.00)| 3505(2.07)|31.00| | 13| 77| 1001| 2986(2.98)| 2039(2.04)|31.71| | 14| 32| 448| 1331(2.97)| 911(2.03)|31.56| | 15| 20| 300| 884(2.95)| 603(2.01)|31.79| | 16| 10| 160| 460(2.88)| 337(2.11)|26.74| | 17| 7| 119| 354(2.97)| 243(2.04)|31.36| |All| 206304| 913854| 2960173(3.24)| 2220974(2.43)|24.97| REORDERING is consisted of only character mappings. It resembles legacy-to-UCS2 mapping which preserves the character entity itself while it assign new different code integer value to it. It's clear that REORDERING adds just as much complexity as the legacy-to-UCS mappings done by IDNA-aware applications before nameprep/ACE process. Soobok Lee
