Re: Unicode Digest, Vol 56, Issue 20

Marcel Schneider via Unicode Thu, 30 Aug 2018 16:17:32 -0700

Thank you for looking into this. First, I’m unable to retrieve the publication 
you are citing, 
but a February thread had nearly the same subject, referring to Vol. 50. How 
did you 
compute these figures? Is that a code phrase to say: “The same questions over 
and 
over again; let’s settle this on the record, as a reference for later 
inquiries.”

Also, "[email protected]" doesn’t appear to seem to be a valid e-mail 
address.
That would mean that I’d better send a proposal with an enhancement request to
[email protected], rather than contribute to the topic while it is being 
discussed 
on the Unicode Public Mail List?

OK I’ll try to get something out of this, because many people really want 
things to grow 
better:

On 30/08/18 20:37 Doug Ewell via Unicode wrote:
> 
> UnicodeData.txt was devised long before any of the other UCD data files.

I can’t think of any era in the computer age where file headers were uncommon, 
and where a parser able to process semicolons couldn’t be directed to make sense
of crosshatches. If ever releasing a headerless file was a mistake, 
implementers 
would be able to anticipate that it might be corrected at some point. 
Implementations 
are to be updated at every single Unicode release, that’s what I’m able to 
tell, while 
ignoring the arcanes of frozen APIs.

> Though it might seem like a simple enhancement to us, adding a header block, 
> or even a single line,
> would break a lot of existing processes that were built long ago to parse 
> this file.

They are hopelessly outdated anyway, and most of them would have been replaced 
with something 
better since a long time. The remainder might not be worth bothering the rest 
of the world with 
headerless files.

> So Unicode can't add a header to this file, and that is the reason the format 
> can never be changed
> (e.g. with more columns). That is why new files keep getting created instead.

I figured out something like that rationale, and I can also understand that 
Unicode isn’t going 
to keep releasing headerless files while waiting for a guy telling them not to 
do so, and then
to suddenly add the missing header. Also I didn’t really ask for that, but 
suggested adding 
yet another *new* file, not changing the data structure of the existing 
UnicodeData.txt. 

As of the reference, a Google search for "unicodedataextended.txt" just brought 
it up:
http://www.unicode.org/review/pri297/

Having said that, I still think that while not parsing a header line in a 
process is a 
reasonable position if the field structure is known to be stable, not being 
able to *skip* 
a header is sort of odd.

> The XML format could indeed be expanded with more attributes and more 
> subsections.
> Any process that can parse XML can handle unknown stuff like this without 
> misinterpreting
> the stuff it does know.

Agreed. I’m not questioning XML. But I’m using spreadsheets. I don’t know how 
many computer
scientists do use spreadsheets. Perhaps we’re not many looking up 
UnicodeData.txt that way
(I use it in raw text, too, and I look up ucd.nounihan.flat.xml). Generating 
code in a 
spreadsheet is considered quick-and-dirty. I don’t agree it’s dirty, but it’s 
quick.

And above all, it appears that doing certain research in spreadsheets is the 
most efficient 
way to check whether character properties are matching character identity. 
Using spreadsheet 
software is trivial, so it might be disconsidered and left to non-scientists, 
while it is 
closer to human experience and allows to do research in nearly no time, by 
adding columns, 
filters and formulae, that one would probably spend weeks to code in C, Lisp, 
Perl or Python 
(that I cannot do, so I’m biased).

> That's why the only two reasonable options for getting UCD data are to read 
> all the tab- and semicolon-delimited files,
> and be ready for new files, or just read the XML. Asking for changes to 
> existing UCD file formats is kind of a non-starter,
> given these two alternatives.

Given the above, one can easily understand why I do not agree with being 
limited to these two
alternatives. 

Given a process must be able to be updated to be able… to grab a newly added 
small file 
from the UCD, it can as well be updated to be able to skip file comments, and 
even to be able 
to parse a new *large* file from the UCD.

On the other hand, given Unicode are ready to add new small semicolon-delimited 
files, 
they might wish to add as well a new *large* semicolon-delimited file to the 
UCD.
That large file would have a file header and a header line, and be specified as 
being flexible.
That file might have one hundred fields delimited by 99 semicolons. These 5 
million semicolons 
would still be more lightweight than 5 million attribute names plus the XML 
tags.

The added value is that people using spreadsheets have a handy file to import, 
rather than 
each individual having to convert a large XML file to a large CSV file, by lack 
of the latter
being readily provided by Unicode. 

If this discussion has a positive echo, I or somebody else may submit an 
appropriate proposal.
But I’d prefer not repeating the mistake of not discussing a topic on Unicode 
Public prior to 
submitting a proposal that is then kindly put on the agenda, but discussed in 
disfavor and 
dismissed in disgrace twice at UTC meetings. And figure out why I didn’t wish 
upstream discussion
here? Because I was naively afraid that the unveiled mistakes could reflect 
badly on some people.

Turned out that nothing reflects badly on anybody. 

(So UnicodeData.txt could as well get its missing header BTW.)

Regards,

Marcel

Re: Unicode Digest, Vol 56, Issue 20

Reply via email to