Steffen Daode Nurpmeso continued: > Hmm. To me, this raises the question why these constraints were > introduced at all. Imho either one adds constraints due to solid > considerations, and enforces them after some period of backward > compatibility, or there simply should be no constraints.
What you are talking about in the notes about the case mapping fields in UnicodeData.txt do not really constitute constraints, but rather are attempts to clearly document what the nature of the data is. The Unicode Consortium does maintain true constraints on various aspects of the data files: those are generally referred to as the "stability guarantees" or the stability policy: http://www.unicode.org/policies/stability_policy.html See also: http://www.unicode.org/policies/property_value_stability_table.html There is no stability policy (yet) regarding the titlecase field in particular, although there could be, I suppose, if the Unicode Technical Committee (and the Unicode Consortium officers) decided there was a good enough reason to add one. In the meantime, the Unicode Technical Committee also runs various tests on the UCD for each release checking what are termed "invariants", to look for possible problems when adding new repertoire or changing properties for existing characters. Some of those invariants are the subject of stability policies and *must* be honored when changing the UCD. Others are simply existing patterns (like the relationship between the titlecase mapping and the uppercase mapping) which are checked to look for inadvertent introduction of bonehead errors in the data. > > There are parsers (i know of one) which use *only* UnicodeData.txt > for generating tables (using patterns like `SPACE' etc. to join > characters into sets; which seems to have been common practice in > the past -- as in [3], „Case Mappings“: „derivable from the > presence of the terms "CAPITAL" or "SMALL" in the character > name“). That is very bad practice, and should be avoided. The UCD documentation warns against making assumptions about character properties based only on character names. It leads to many bad results. > > If there is no such extensive guaranteed backward compatibility > for UnicodeData.txt content already today then that should be > noted (i wouldn't know where that is true?), but otherwise it > cannot be that labour-intensive to drop these constraints again, > since nothing had to be done at all? > I.e., are these parsers already broken today? > Just curious… Parsers which deduce properties based on character names are definitely broken -- and that would include any case mapping information. As regards actual constraints, please refer to the stability policies to see what the Unicode Consortium officially claims to be required constraints on data changes. And if the odd edge cases for parsing the legacy data files (and UnicodeData.txt is the ur-data file with the most legacy status) seem problematical, the ultimate fix is just to refer to the UCD in XML: http://www.unicode.org/Public/UCD/latest/ucdxml/ which has a fully rationalized and regular structure, well documented in UAX #42. --Ken