On Fri, 27 Jul 2012 09:01:13 -0700 Mark Davis ☕ <m...@macchiato.com> wrote:
> The key term is 'open interchange'. XML documents are textual objects. It is therefore reasonable to look at them using tools for displaying textual objects. However, > "<snip> noncharacters are <snip> > permanently reserved (unassigned) and have no interpretation > whatsoever outside of their possible application-internal private > uses." > For CLDR collation data - *not open interchange, but specific to use > in CLDR collation data* - these characters have specified use as > sentinel characters, marking the boundaries for CJK 'buckets' for use > in indexes. I hope you're addressing a complaint I haven't made. I haven't complained about tailoring involving non-characters, though it does strike me as a least evil. Are you perhaps arguing that I become part of some CLDR application when I read CLDR XML files? > This is described in > http://unicode.org/reports/tr35/#Collation_Elements. The > noncharacters are chosen specifically so that they do not overlap > with publicly interchanged private use characters. Of course, > implementations of LDML can tailor the collations to remove them, or > replace by other mechanisms. I was going to ask when the LDML element suppress_contractions took effect. At least I now have some idea of the answer. > Unfortunately, some restrictions that were perfectly reasonable for > use in document interchange become annoying flaws in a general > structured data interchange format. The inability to interchange all > Unicode scalar values is one. The restrictions improve legibility. As it is, many of the character-level elements in CLDR XML files tend to be unreadable. It would be better for them not to require genuinely complex text rendering. In a related matter, it was very inconvenient to have to treat collation test files as binary data because they could not be DOS text files - ctrl/Z in the comments cut the files short. Richard.