2012/7/27 Richard Wordingham <richard.wording...@ntlworld.com>: > The restrictions improve legibility. As it is, many of the > character-level elements in CLDR XML files tend to be unreadable. It > would be better for them not to require genuinely complex text > rendering. In a related matter, it was very inconvenient to have to > treat collation test files as binary data because they could not be DOS > text files - ctrl/Z in the comments cut the files short.
I do agree. Even if non-characters are used internally for the processing of CLDR data, this is the result of the internal conversion of the CLDR data files which are still interchanged as text files. For this reason, the needed sentinels (which are perfectly a valid element to include in the data) should be reencoded. As these data are in CML format, it is easy to markup them using specific XML elements allowing to include not just text elements but simply indirect references to code points that will be used internally, or probably better without any assumption that these specific non-character codepoints will be used). As this internal processing will be internal, let's hide this implementation detail and not even expose it even indirectly with artefacts like <char cp="0000"/> but really as something more semantic like <sentinel type="xyz"/>. The internal processing of these sentinels are not restricted to use only the code point encoding space, but could as well use any internal integer type with negative values and bit packing of various flags in those values, or could use other internal structure not needing any sentinel, such as TLV-encoded structures for variable-length data that will be mixing text contents and non-text data or meta-information). The presence of ^Z controls, or other XML-restricted controls or Unicode non-characters in the CLDR data files is also undesirable and not needed (even if they appear in comments those comments which are expected to be plain-text should still respect the minimum plain-text requirements). These data files should still be fully compliant to plain-text and XML requirements so that they can safely be used by text-processing tools (including text editors, import tools for databases, spreadsheet processors, and various format converters...).