Re: UCD in XML or in CSV?

Ken Whistler via Unicode Fri, 31 Aug 2018 10:54:01 -0700



On 8/31/2018 1:36 AM, Manuel Strehl via Unicode wrote:

For codepoints.net I use that data to stuff everything in a MySQL
database.


Well, for some sense of "everything", anyway. ;-)

People having this discussion should keep in mind a few significant points.

First, the UCD proper isn't "everything", extensive as it is. There arealso other significant sets of data that the UTC maintains aboutcharacters in other formats, as well, including the data filesassociated with UTS #46 (IDNA-related), UTS #39 (confusables mapping,etc.), UTS #10 (collation), UTR #25 (a set of math-related propertyvalues), and UTS #51 (emoji-related). The emoji-related data has nowstrayed into the CLDR space, so a significant amount of the informationabout emoji characters is now carried as CLDR tags. And then there isvarious other information about individual characters (or small sets ofcharacters) scattered in the core spec -- some in tables, some not, aswell as mappings to dozens of external standards. There is no actualdefinition anywhere of what "everything" actually is. Further, it is amistake to assume that every character property just associates a simpleattribute with a code point. There are multiple types of mappings,complex relational and set properties, and so forth.

The UTC attempts to keep a fairly clear line around what constitutes the"UCD proper" (including Unihan.zip), in part so that it is actuallypossible to run the tools that create the XML version of the UCD, forfolks who want to consume a more consistent, single-file format versionof the data. But be aware that that isn't everything -- nor would therebe much sense in trying to keep expanding the UCD proper to actuallyrepresent "everything" in one giant DTD.

Second, one of the main obligations of a standards organization is*stability*. People may well object to the ad hoc nature of the UCD datafiles that have been added over the years -- but it is a *stable*ad-hockery. The worst thing the UTC could do, IMO, would be to keeptweaking formats of data files to meet complaints about one particularparsing inconvenience or another. That would create multiple points ofdiscontinuity between versions -- worse than just having to deal withthe ongoing growth in the number of assigned characters and theoccasional addition of new data files and properties to the UCD.

Keep in mind that there is more to processing the UCD than just"latest". People who just focus on grabbing the very latest version ofthe UCD and updating whatever application they have are missing half theproblem. There are multiple tools out there that parse and use multiple*versions* of the UCD. That includes the tooling that is used tomaintain the UCD (which parses *all* versions), and the tooling thatcreates UCD in XML, which also parses all versions. Then there istooling like unibook, to produce code charts, which also has to adapt tomultiple versions, and bidi reference code, which also reads multipleversions of UCD data files. Those are just examples I know off the topof my head. I am sure there are many other instances out there that fitthis profile. And none of the applications already built to handlemultiple versions would welcome having to permanently build in trackingparticular format anomalies between specific versions of the UCD.

Third, please remember that folks who come here complaining about thecomplications of parsing the UCD are a very small percentage of a verysmall percentage of a very small percentage of interested parties.Nearly everybody who needs UCD data should be consuming it as asecondary source (e.g. for reference via codepoints.net), or as atertiary source (behind specialized API's, regex, etc.), or as an enduser (just getting behavior they expect for characters in applications).Programmers who actually *need* to consume the raw UCD data files andwrite parsers for them directly should actually be able to deal with theformat complexity -- and, if anything, slowing them down to make themthink about the reasons for the format complexity might be a good thing,as it tends to put the lie to the easy initial assumption that the UCDis nothing more than a bunch of simple attributes for all the code points.


--Ken

Re: UCD in XML or in CSV?

Reply via email to