UCD in XML or in CSV? (was: Re: Unicode Digest, Vol 56, Issue 20)

Marcel Schneider via Unicode Thu, 30 Aug 2018 22:01:57 -0700

On 30/08/18 23:34 Philippe Verdy via Unicode wrote:
>
> Welel an alternative to XML is JSON which is more compact and faster/simpler 
> to process;


Thanks for pointing the problem and the solution alike. Indeed the main 
drawback of the XML 
format of UCD is that it results in an “insane” filesize. “Insane” was applied 
to the number
of semicolons in UnicodeData.txt, but that is irrelevant. What is really insane 
is the filesize
of the XML versions of the UCD. Even without Unihan, it may take up to a minute 
or so to load 
in a text editor.

> however JSON has no explicit schema, unless the schema is being made part of 
> the data itself,
> complicating its structure (with many levels of arrays of arrays, in which 
> case it becomes
> less easy to read by humans, but more adapted to automated processes for fast 
> processing).
>
> I'd say that the XML alone is enough to generate any JSON-derived dataset 
> that will conform
> to the schema an application expects to process fast
> (and with just the data it can process, excluding various extensions still 
> not implemetned).
> But the fastest implementations are also based on data tables encoded in code
> (such as DLL or Java classes), or custom database formats (such as Berkeley 
> dB)
> generated also automatically from the XML, without the processing cost of 
> decompression schemes
> and parsers.
>
> Still today, even if XML is not the usual format used by applications, it is 
> still
> the most interoperable format that allows building all sorts of applications
> in all sorts of languages: the cost of parsing is left to an application 
> builder/compiler.

I’ve tried an online tool to get ucd.nounihan.flat.xml converted to CSV. The 
tool is great 
and offers a lot of options, but given the “insane” file size, my browser was 
up for over 
two hours of trouble until I shut down the computer manually. From what I could 
see in 
the result field, there are many bogus values, meaning that their presence is 
useless in 
the tags of most characters. And while many attributes have cryptic names in 
order to keep 
the file size minimal, some attributes have overlong values, ie the design is 
inconsistent.
Eg in every character we read:
jg="No_Joining_Group"
That is bogus. One would need to take them off the tags of most characters, and 
even 
in the characters where they are relevant, the value would be simply "No". 
What’s the use 
of abbreviating "Joining Group" to "jg" in the atribute name if in the value it 
is written out?
And I’m quoting from U+0000. 
Further many values are set to a crosshatch, instead of simply being removed 
from the 
characters where they are empty. Then the many instances of "undetermined 
script" 
resulting in *two* attribues with "Zyyy" value. Then in almost each character 
we’re told that 
it is not a whitespace, not a dash, not a hyphen, and not a quotation mark:
Dash="N" WSpace="N" Hyphen="N" QMark="N"
One couldn’t tell that UCD does actually benefit from the flexibility of XML, 
given that many 
attributes are systematically present even where they are useless.
Perhaps ucd-*.xml would be two thirds, half, or one third their actual size if 
they were 
properly designed.

> Some apps embed the compilers themselves and use a stored cache for faster 
> processing:
> this approach allows easy updates by detecting changes in the XML source, and 
> then
> downloading them.
>
> But in CLDR such updates are generally not automated : the general scheme 
> evolves over time
> and there are complex dependencies to check so that some data becomes usable

Should probably read *un*usable.

> (frequently you need to implement some new algorithms to follow the 
> processing rules
> documented in CLDR, or to use data not completely validated, or to allow 
> aplicatioçns
> to provide their overrides from insufficiently complete datasets in CLDR,
> even if CLDR provides a root locale and applcaitions are supposed to follow 
> the BCP47
> fallback resolution rules;
> applciations also have their own need about which language codes they use or 
> need,
> and CLDR provides many locales that many applications are still not prepared 
> to render correctly,
> and many application users complain if an application is partly translated 
> and contains
> too many fallbacks to another language, or worse to another script).

So the case is even worse than what I could see when looking into CLDR. Many 
countries, 
including France, don’t care about the data of their own locale in CLDR, but 
I’m not going 
to vent about that on Unicode Public, because that involves language offices 
and authorities, 
and would have political entanglements.

Staying technical, I can tell so far about the file header of UnicodeData.txt 
that I can see zero technical reasons not to add it. Processes using the file 
to generate
an overview of Unicode also use other files and are thus able to process 
comments correctly,
whereas those processes using UnicodeData to look up character properties 
provided in the file 
would start searching the code point. (Perhaps there are compilers building 
DLLs from the file.)

Le jeu. 30 août 2018 à 20:38, Doug Ewell via Unicode  a écrit :
>


UnicodeData.txt was devised long before any of the other UCD data files. Though 
it might seem like a simple enhancement to us, adding a header block, or even a 
single line, would break a lot of existing processes that were built long ago 
to parse this file.

>
So Unicode can't add a header to this file, and that is the reason the format 
can never be changed (e.g. with more columns). That is why new files keep 
getting created instead.

>
The XML format could indeed be expanded with more attributes and more 
subsections. Any process that can parse XML can handle unknown stuff like this 
without misinterpreting the stuff it does know.

>
That's why the only two reasonable options for getting UCD data are to read all 
the tab- and semicolon-delimited files, and be ready for new files, or just 
read the XML. Asking for changes to existing UCD file formats is kind of a 
non-starter, given these two alternatives.

>

>

>
--
Doug Ewell | Thornton, CO, US | ewellic.org


>

-------- Original message --------
Message: 3

Date: Thu, 30 Aug 2018 02:27:33 +0200 (CEST)
> From: Marcel Schneider via Unicode 
> 
>
Curiously, UnicodeData.txt is lacking the header line. That makes it unflexible.
I never wondered why the header line is missing, probably because compared
to the other UCD files, the file looks really odd without a file header showing 
at least the version number and datestamp. It?s like the file was made up for 
dumb parsers unable to handle comment delimiters, and never to be upgraded
to do so.

But I like the format, and that?s why at some point I submitted feedback asking 
for an extension. [...]

UCD in XML or in CSV? (was: Re: Unicode Digest, Vol 56, Issue 20)

Reply via email to