For the confusables, the presumption is that implementations have already either normalized the input to NFKC or have rejected input that is not NFKC.
More broadly, in gathering data the main emphasis is on characters that fit the profile in http://www.unicode.org/reports/tr39/#Identifier_Characters, including scripts like Cyrillic ( http://www.unicode.org/reports/tr31/#Table_Recommended_Scripts). So while we do add characters outside of that, there has been no concerted effort to do so. In particular, in your identifiers you should not allow scripts like Buginese ( http://www.unicode.org/reports/tr31/#Table_Candidate_Characters_for_Exclusion_from_Identifiers) or Lisu (http://www.unicode.org/reports/tr31/#Table_Limited_Use_Scripts) without recognizing that the confusable data will be sketchy for those. It would probably be worth clarifying this in the text of http://www.unicode.org/reports/tr39/#Identifier_Characters. There is an upcoming UTC meeting at the start of Nov., so if you want to suggest that or any other improvements, you should use the http://www.unicode.org/reporting.html. Mark <https://plus.google.com/114199149796022210033> * * *— Il meglio è l’inimico del bene —* ** On Sun, Oct 13, 2013 at 7:36 PM, Chris Weber <ch...@lookout.net> wrote: > While looking closer at the current confusables data, I've noticed that > several of the fullwidth code points seem to be missing from the > confusables data. For example, U+FF4D FULLWIDTH LATIN SMALL LETTER M > does not exist as a confusable for U+006D LATIN SMALL LETTER M, as well > as several others I've noticed. > > Was this intentional? > > Also, I'm not clear on the difference between the confusables.txt and > confusablesSummary.txt - are these meant to provide the same data in > different formats? > > -- > Best regards, > Chris Weber - ch...@lookout.net - http://www.lookout.net > PGP: F18B 2F5D ED81 B30C 58F8 3E49 3D21 FD57 F04B BCF7 > >