On 31/08/18 07:27 Janusz S. Bień via Unicode wrote: […] > > Given NamesList.txt / Code Charts comments are kept minimal by design, > > one couldn’t simply pop them into XML or whatever, as the result would be > > disappointing and call for completion in the aftermath. Yet another task > > competing with CLDR survey. > > Please elaborate. It's not clear for me what do you mean.
These comments are designed for the Code Charts and as such must not be disproportionate in exhaustivity. Eg we have lists of related languages ending in an ellipsis. Once this is popped into XML, ie extracted from NamesList.txt to be fed in an extensible and unconstrained format (without any constraint as of available space, number and length of comments, and so on), any lack is felt as a discriminating neglect, and there will be a huge rush adding data. Yet Unicode hasn’t set up products where that data could be published, ie not in the Code Charts (for the abovementioned reason), not in ICU so far as the additional information involved does not match a known demand on user side (localizing software does not mean providing scholarly exhaustive information about supported characters). The use will be in character pickers providing every available information about a given character. That is why Unicode is to prioritize CLDR for CLDR users, rather than extra information for the web. > > > Reviewing CLDR data is IMO top priority. > > There are many flaws to be fixed in many languages including in English. > > A lot of useful digest charts are extracted from XML there, > > Which XML? where? More precisely it is LDML, the CLDR-specific XML. What I called “digest charts” are the charts found here: http://www.unicode.org/cldr/charts/34/ The access is via this page: http://cldr.unicode.org/index/downloads where the charts are in the Charts column, while the raw data is under SVN Tag. > > > and we really > > need to go through the data and correct the many many errors, please. > > Some time ago I tried to have a close look at the Polish locale and > found the CLDR site prohibitively confusing. I experienced some trouble too, mainly because "SVN Tag" is counter-intuitive for the access to the XML data (except when knowing about SubVersioN). Polish data is found here: https://www.unicode.org/cldr/charts/34/summary/pl.html The access is via the top of the "Summary" index page (showing root data): https://www.unicode.org/cldr/charts/34/summary/root.html You may wish to particularly check the By-Type charts: https://www.unicode.org/cldr/charts/34/by_type/index.html Here I’d suggest to first focus on alphabetic information and on punctuation. https://www.unicode.org/cldr/charts/34/by_type/core_data.alphabetic_information.punctuation.html Under Latin (table caption, without anchor) we find out what punctuation Polish has compared to other locales using the same script. The exact character appears when hovering the header row. Eg U+2011 NON-BREAKING HYPHEN is systematically missing, which is an error in almost every locale using hyphen. TC is about to correct that. Further you will see that while Polish is using apostrophe https://slowodnia.tumblr.com/post/136492530255/the-use-of-apostrophe-in-polish CLDR does not have the correct apostrophe for Polish, as opposed eg to French. You may wish to note that from now on, both U+0027 APOSTROPHE and U+0022 QUOTATION MARK are ruled out in almost all locales, given the preferred characters in publishing are U+2019 and, for Polish, the U+201E and U+201D that are already found in CLDR pl. Note however that according to the information provided by English Wikipedia: https://en.wikipedia.org/wiki/Quotation_mark#Polish Polish also uses single quotes, that by contrast are still missing in CLDR. Now you might understand what I meant when pointing that there are still many errors in many languages in CLDR, including in English. Best regards, Marcel > > Best regards > > Janusz > > -- > , > Janusz S. Bien > emeryt (emeritus) > https://sites.google.com/view/jsbien > >