On 8/31/2018 1:36 AM, Manuel Strehl via Unicode wrote:
For codepoints.net I use that data to stuff everything in a MySQL
database.

Well, for some sense of "everything", anyway. ;-)

People having this discussion should keep in mind a few significant points.

First, the UCD proper isn't "everything", extensive as it is. There are also other significant sets of data that the UTC maintains about characters in other formats, as well, including the data files associated with UTS #46 (IDNA-related), UTS #39 (confusables mapping, etc.), UTS #10 (collation), UTR #25 (a set of math-related property values), and UTS #51 (emoji-related). The emoji-related data has now strayed into the CLDR space, so a significant amount of the information about emoji characters is now carried as CLDR tags. And then there is various other information about individual characters (or small sets of characters) scattered in the core spec -- some in tables, some not, as well as mappings to dozens of external standards. There is no actual definition anywhere of what "everything" actually is. Further, it is a mistake to assume that every character property just associates a simple attribute with a code point. There are multiple types of mappings, complex relational and set properties, and so forth.

The UTC attempts to keep a fairly clear line around what constitutes the "UCD proper" (including Unihan.zip), in part so that it is actually possible to run the tools that create the XML version of the UCD, for folks who want to consume a more consistent, single-file format version of the data. But be aware that that isn't everything -- nor would there be much sense in trying to keep expanding the UCD proper to actually represent "everything" in one giant DTD.

Second, one of the main obligations of a standards organization is *stability*. People may well object to the ad hoc nature of the UCD data files that have been added over the years -- but it is a *stable* ad-hockery. The worst thing the UTC could do, IMO, would be to keep tweaking formats of data files to meet complaints about one particular parsing inconvenience or another. That would create multiple points of discontinuity between versions -- worse than just having to deal with the ongoing growth in the number of assigned characters and the occasional addition of new data files and properties to the UCD.

Keep in mind that there is more to processing the UCD than just "latest". People who just focus on grabbing the very latest version of the UCD and updating whatever application they have are missing half the problem. There are multiple tools out there that parse and use multiple *versions* of the UCD. That includes the tooling that is used to maintain the UCD (which parses *all* versions), and the tooling that creates UCD in XML, which also parses all versions. Then there is tooling like unibook, to produce code charts, which also has to adapt to multiple versions, and bidi reference code, which also reads multiple versions of UCD data files. Those are just examples I know off the top of my head. I am sure there are many other instances out there that fit this profile. And none of the applications already built to handle multiple versions would welcome having to permanently build in tracking particular format anomalies between specific versions of the UCD.

Third, please remember that folks who come here complaining about the complications of parsing the UCD are a very small percentage of a very small percentage of a very small percentage of interested parties. Nearly everybody who needs UCD data should be consuming it as a secondary source (e.g. for reference via codepoints.net), or as a tertiary source (behind specialized API's, regex, etc.), or as an end user (just getting behavior they expect for characters in applications). Programmers who actually *need* to consume the raw UCD data files and write parsers for them directly should actually be able to deal with the format complexity -- and, if anything, slowing them down to make them think about the reasons for the format complexity might be a good thing, as it tends to put the lie to the easy initial assumption that the UCD is nothing more than a bunch of simple attributes for all the code points.

--Ken

Reply via email to