Re: per-character "stories" in a database (derives from Re: geometric shapes)

William Overington Fri, 14 Mar 2003 09:09:53 -0800

Markus Scherer wrote as follows.

quote


It has been suggested many times to build a database (list, document, XML,
...) where each designated/assigned code point and each character gets its
"story": Comments on the glyphs, from what codepage it was inherited, usage
comments and examples, alternate names, etc.

I am talking about both code points and "characters" on purpose, and I
would go a step beyond documenting what's there. All the "characters" that
can be represented by a sequence of assigned Unicode characters should be
listed, with that sequence (or those sequences), and with further
explanation if necessary.

end quote

Yes, that is a very good point.  I have become interested in the languages
of the Indian subcontinent from the standpoint of trying to ensure that they
can be displayed properly using interactive television using portable font
technology, however I am not a linguist and I find it strange that the
Unicode Standard does not codify the ligatures which can be produced with
the languages of the Indian subcontinent at display time using specific
sequences of regular Unicode characters so that someone skilled in the art
of font design may design a font from the code charts.

Later he wrote.

quote

Now we just need to
- find someone to sponsor this effort technically and with humanpower
- squeeze the existing information out of the standard, the mailing lists,
FAQs, and of course out of the Unicode veterans before they retire by
Unicode 6...

end quote

Well, how about an approach like Project Gutenberg uses for proofreading
transcripts of classic books.  If there were a database where people could
post items about particular characters and people could read them and either
confirm what is said or put some other view or just add some other
information, then maybe the database could just sort of gradually become
generated over a period of years.  How big would that be?  About 100
thousand code points at, say, 200 words for each on average at about 5 or 6
characters per word on average with a space following each word would be
about 130 megabytes in total.  I fully realize that the phrase "sort of
gradually" might easily be quoted in a response to this posting, yet if the
database facility were there, accessible directly from the web, there may
well be many people who would stop by for a while and review what has been
entered and add a little more to the database.

>PS: Sorry, I am not in a position to volunteer...

Well, it could be more of an informal thing.  If the facility were set up,
then people who are interested could simply visit the web site when they
felt like participating.  Certainly there might be a core of people who had
the ability to throw out rubbish and to convert fragments of text into a
good English narrative so that there was some overall structure to it all,
yet it does not necessarily need to be as formal and rigid as if it were a
commercial project with a time deadline, particularly if the alternative is
that it does not get done at all.

William Overington

14 March 2003

Re: per-character "stories" in a database (derives from Re: geometric shapes)

Reply via email to