From: "Christopher John Fynn" <[EMAIL PROTECTED]> > Anyone have a list of other standards, protocols, RFC's etc which specify > Unicode (in any of it's encoding formats) as the base, default or preferred > character set to be used?
For RFCs it's not difficult to get this list using the RFCeditor.org built-in search engine. However a more interesting list would be to seek for standards that were built on non-Unicode, non-ISO/IEC10646 charsets, registered in IANA, and that were since mapped onto Unicode, where these standards may perform some string processing that does not conform to Unicode processing rules. For example, these other standards may specify canonical equivalences which do not exist in Unicode: - For example, I think about some ETSI standards for Teletext, which may contain more combining marks than those currently encoded in Unicode, and may create some canonical or compatibility equivalences. - Or about Asian string processing algorithms, notably for Hangul, Han and Hiragana/Katakana. These standards may be supported by documenting the additional equivalences as Unicode folding rules. For now Unicode and ISO/IEC have focused on preserving the distinctions in supported character sets, but I think that there's some work to do with grapheme clusters that are now distinct in Unicode but equivalent or compatibility equivalent in other standards. Documenting folding algorithms that may be used in Unicode is probably a huge work, that is as much complex as unification of repertoires within ISO/IEC 10646 assignments of code points, or within Unicode canonical equivalences. Knowing them would certainly help to perform safe handling of texts with Unicode, when they were initially coded with legacy charsets.