from:"Andrew C. West"

RE: current version of unicode-font

2004-12-03 Thread Andrew C. West

On Fri, 03 Dec 2004 15:10:37 +0200, "Cristian Secarã" wrote:
> 
> On Thu, 2 Dec 2004 07:51:42 -0800, Peter Constable wrote:
> 
> > Microsoft has never used the label 'OpenFont' for this or any of the
> > fonts that ship with their products.
> 
> However, the .ttf fonts that ship with their products are showing an OT
> icon. I don't know how it's done technically.

Because Arial Unicode MS includes OpenType tables, and is thus technically an
OpenType font.

OpenType is a font technology, and does not imply that any such font is free and
open source -- since when did "open" ever mean "free" anyway ?

The original term mentioned in this thread, "OpenFont" does not exist (other
than as a programming method in certain APIs), and so, as Peter rightly pointed
out, OpenFont does not apply to Arial Unicode MS or any Microsoft font or any
font that I know of.

Andrew

Re: Nicest UTF

2004-12-03 Thread Andrew C. West

On Thu, 2 Dec 2004 21:56:28 -0800, "Doug Ewell" wrote:
> 
> This thread amuses me.
> 

Me too, but then most threads on this list do ;)

> 
> I also think that as more and more Han characters are encoded in the
> supplementary space, corresponding to the ever-growing repretoires of
> Eastern standards, the story that UTF-16 is virtually a fixed-width
> encoding because "supplementary code points are very rare in most text"
> will gradually go away.
> 

More and more mostly very obscure and rarely used Han ideographs. It does not
matter how many tens of thousands of additional CJK ideographs you add to the
supplementary planes, the vast majority of CJK users will still get by quite
happily with only CJK and CJK-A, which, as they are inherited from the important
legacy CJK encoding standards, are what most CJK users have been living with for
many years now. Of course people on this list, such as Richard Cook and myself,
find endless use for obscure and archaic ideographs, but in writing day-to-day
Chinese/Japanese/Korean there is no need to resort to CJK-B or CJK-C, except for
certain idiosyncratic (U+24B62 CEI4 is my personal faourite) or dialectal
usages, which are not typical.

Now that the number of allocated characters in planes 1, 2 and 14 (45,718
characters) is little fewer than the number of allocated characters in the BMP
(57,129) (and soon it wil be greater), it is of course ridiculous to claim that
Unicode is basically a standard for 16-bit characters, but despite the large
number of supra-BMP characters they are, by definition, rarely used, and IMHO it
will remain true that "supplementary code points are very rare in most text". 
That is not to say that I think that it is OK for people to be lazy, and just
ignore everything outside the BMP. I strongly agree that all Unicode
implementations should cover all of Unicode, and not just the BMP, and it really
annoys me when they don't; but suggesting that you need to implement supra-BMP
characters because they are going to start popping up all over the place is
wrong in my opinion (not that Doug suggested that, but that's my extrapolation
of his point). Software developers need to implement supra-BMP characters
because some users (probably very few) will from time to time want to use them,
and software should allow people to do what they want.

Andrew

Re: current version of unicode-font

2004-12-02 Thread Andrew C. West

On Fri, 03 Dec 2004 00:38:25 +0700, Paul Hastings wrote:
> 
> John Cowan wrote:
> 
> > Googling for "free Unicode fonts" (no quotes) is useful.
> 
> sort of, when i've googled for this in the past, language-specific 
> (chinese seemed to be the most frequent) fonts turn up more often than 
> not. hey if you guys don't know, who does?
> 

As someone once said, Google is your friend, but if you don't have time to
google yourself these (and many other similar pages) may give you some useful
pointers :

http://www.alanwood.net/unicode/fonts.html
http://www.babelstone.co.uk/Fonts/Fonts.html

Andrew

Re: official languages of ISO / IEC (CIE)

2004-11-09 Thread Andrew C. West

On Mon, 8 Nov 2004 15:13:21 -0800 (PST), "E. Keown" wrote:
> 
> At the U.N. and in some countries, they have 'official
> languages.'  The U.N. has 5, I think. Singapore has 4,
> several African countries have 2-3, and so forth.  
> 
> Does either the ISO or the IEC have official
> languages?  Whether official or not, is French the
> 'second language' of the standards world?
> 
> And also, is there a bilingual or trilingual standards
> glossary?  
> 
> A glossary is a small-ish dictionary, frequently
> focused on a narrow topic.  
> 
> I'm about to translate something into technical
> French.I still didn't purchase a technical French
> dictionary because the ones I've seen didn't have
> enough computer terminology.  
> 
> Thanks Elaine
> 

If the document you are translating has anything to do with Unicode or character
encoding then you may find the "Unicode et ISO 10646 en français" site very
useful :

http://iquebec.ifrance.com/hapax/

This site comprises a French translation of the Unicode standard, including the
Unicode glossary, as well as a list of the official French versions of the
ISO/IEC 10646 character names.

You may also be interested to know that some recent character proposals for
ISO/IEC 10646 have been written in both French and English (e.g. N2739 for
Tifinagh and N2765 for NKo).

Andrew

Re: Public Review Issues Update

2004-10-22 Thread Andrew C. West

On Thu, 21 Oct 2004 12:06:23 -0700 (PDT), Kenneth Whistler wrote:
> 
> > Mark Davis wrote:
> > > All comments are reviewed at the next UTC meeting. Due to the volume, we
> > > don't reply to each and every one what the disposition was. If actions were
> > > taken, they are recorded in the minutes of the meetings.
> > 
> 
> Instead of expecting a bureaucratic response, as if from a
> governmental organization staffed up with clerks whose job it
> is to track this kind of stuff, a *practical* approach would be
> to:
> 
>   A. Check the public minutes when they become available.
>   

In the case of UTC 99, which was held June 15-18 2004, the minutes were not made
public until last week, which meant that we had to wait almost four months to
find out what happened.

>   B. Check the disposition of a Public Review Issue on the
>  website, when it becomes available.
>  

Or if you're interested in a character proposal check the Pipeline page and the
Rejected Characters and Scripts page, which are usually updated well before the
minutes become available.

>   C. If neither of those seems to have explicitly addressed some
>  item that you provided feedback on, then contact (offlist)
>  someone who did attend the meeting in question, and see
>  if they have information about the item in question.
>  

I agree that Ken, Asmus, Rick etc. are very helpful, and are happy to discuss
privately issues submitted to the UTC for consideration, but the last thing I
want to do is bombard them with "What happened to my comments/proposal ?"
questions.

I tend to agree with Theo that it would be very helpful that if an issue or
proposal is discussed by the UTC, the relevant extracts of the minutes are
forwarded to the person who raised the issue as soon as possible after the UTC
meeting. This way, if the issue requires clarification or the originator of the
issue wants to submit a further document, it can be done before the next UTC
meeting, and does not need to wait until the UTC minutes are made public after
the next UTC meeting. On the other hand, as a serial offender, I personally
don't expect explicit feedback for every minor issue that I report on via the
Unicode reporting form -- I'm thinking more of script proposals and such like.

> If none of A, B, or C satisfies you, *then* submit another
> problem report, including a more explicit request in it
> for an explicit response regarding its disposition.
> 

Of course, the best solution is "Join Unicode", but I can't see much point in
the $120 p.a. individual membership, as that still does not get you access to
UTC documents or the Unicore list, and $1,200 p.a. for associate membership is a
little bit steep for most of us. Incidentally, I notice that the $600 p.a.
specialist (?) membership category has silently disappeared, which was the only
other possible option for individuals who wanted to get involved at the UTC
level.

Anyway, just my bent pennyworth, hope no-one's offended.

Andrew

Precomposed Glyphs (was Re: Saudi-Arabian Copyright sign )

2004-09-24 Thread Andrew C. West

On Thu, 23 Sep 2004 10:45:53 +0100, Peter Kirk wrote:
> 
> If there were such a list, font designers could indeed design 
> precomposed glyphs for each of the tens of thousands of graphemes on it. 
> But I suspect that they would prefer to specify a programmatic way of 
> making most of the combinations, except for rather common ones. And 
> users will prefer this as they won't want huge fonts mostly full of 
> extremely rare precomposed glyphs.

They will if they're Tibetan, as using precomposed glyphs is the only solution
if you want to produce professional quality Tibetan text display (cf. the recent
Unicode Tibetan fonts Ximalaya and Tibetan Machine Uni, which each have many
thousands of precomposed Tibetan glyphs).

And does anyone actually care what size a font is anyway, just as long as it
displays complex characters nicely ?

Andrew

Re: Public Review Issue: UAX #24 Proposed Update

2004-09-09 Thread Andrew C. West

On Thu, 9 Sep 2004 07:29:20 -0400, John Cowan wrote:
> 
> Jony Rosenne scripsit:
> 
> > The UTC refused to add Yiddish to the name, unlike the other Yiddish
> > specialties, and I am not aware of any other possibility.
> 
> Why should it?  Incorporating a language name into a character name,
> as in ABKHASIAN CHE and KHAKASSIAN CHE, is done because those languages
> have a letter named CHE distinct from the more usual, cross-linguistic
> Cyrillic CHE.  There is no such contrast in this case: we do not speak of
> LATIN SMALL LETTER ICELANDIC THORN, for example.

And indeed the Character Naming Guidelines specifically prohibit the
non-essential incorporation of a language name into a character name :

"In principle when a character of a given script is used in more than one
language, no language name is specified. Exceptions are tolerated where an
ambiguity would otherwise result." [N2652R Annex L Rule 9]

The usage of the language name "Yiddish" in 05F0..05F2 and FB1F contravenes this
rule, but these characters were inherited from Unicode 1.0, long before the rule
came into force.

Andrew

Re: Ogham and Initialisms

2004-07-22 Thread Andrew C. West

On Thu, 22 Jul 2004 11:24:17 +0200, fantasai wrote:
> 
> If a Latin initialism appears in a bottom-to-top text
> and the characters are oriented upright rather than
> rotated, should the initialism read up or down?
> 
>UA
>S   or   S   ?
>AU
> 

In traditional monumental inscriptions which have both Ogham text and Latin or
Runic text, the Ogham text is always separate from the Latin/Runic text, and so
the issue of how to combine a vertical script (Ogham) with a horizontal script
(Latin/Runic) never arises.

If you're writing modern Ogham on paper or computer screen you would be best
advised to write it LTR, and then you won't have any problems with embedding
Latin text in it. If you insist on writing it bottom-to-top, then my guess would
be that embedded "USA" should read top-to-bottom as that's the preferred
vertical directionality of Latin text ... but does it really matter how it's
rendered as it should be obvious from context. Anyway, the whole issue seems a
little too hypothetical, even for me.

Andrew

Medieval CJK race-horse names (was Re: Bantu click letters )

2004-06-11 Thread Andrew C. West

On Fri, 11 Jun 2004 03:04:17 +0100, Michael Everson wrote:
> 
> How many people use medieval CJK race-horse-name characters?
> 

Actually, the famous Song dynasty female poet Li Qingzhao (1084-c.1151) invented
a board game (da3 ma3 tu2 in Chinese) which involved racing around a course in
which each square was marked with the name of one of dozens of famous horses
ancient and modern, most of which are written using idiosyncratic ideographs. I
would of thought that Michael of all people would be in favour of encoding
characters used in board games !

Depite the oft-mentioned cutesy Hong Kong race horse names, idiosyncratic
invented Han ideographs are a negligible component of the encoded CJK
repertoire. In my opinion there are thousands, possibly tens of thousands, of
ideographs that should not really have been encoded individually as they are
simply minor glyph variants, frequently only attested in a single source because
the author simply wrote the character wrongly in the first place. This is the
real issue with the over-encoding of CJKV, not the occasional race horse name.

Andrew

Additional examples of the Phoenician script in use

2004-06-08 Thread Andrew C. West

At the risk of keeping the "thread from hell" alive, I'd like to point out a new
contribution by Michael Everson that may be of interest to participants in this
debate :

http://std.dkuug.dk/JTC1/SC2/WG2/docs/n2787-phoenician.pdf

To my untrained eyes this document provides some pretty compelling evidence in
support of the separate encoding of Phoenician.

Andrew

Re: Proposal to encode dominoes and other game symbols

2004-06-02 Thread Andrew C. West

On Wed, 2 Jun 2004 08:05:00 -0400, John Cowan wrote:
> 
> > H.7 Some criteria weaken the case for encoding
> 
> > -- the symbol is purely decorative
> 
> This would seem to exclude dingbats altogether.
> 

Or perhaps more apposite examples would be the shamrock and fleur-de-lis symbols
(see N2586R). Whilst the former symbol "is sometimes used in lexicography to
indicate botany or agriculture", and the latter symbol "symbolizes French
culture in general or the Francophonie specifically", I would think that most
people would consider them to be purely decorative.

If the shamrock and fleur-de-lis symbols pass the criteria outlined in Annex H,
it is hard to think of any symbol which would fail.

On the other hand, with absolutely no disrespect to Michael intended, the more
sceptical amongst us might be forgiven for thinking that the shamrock and
fleur-de-lis would never have been accepted for encoding if they had been
proposed by someone of lesser stature than Michael, especially given the minimal
examples of usage and justification for encoding provided in the proposal.

Andrew

Re: Proposal to encode dominoes and other game symbols

2004-06-02 Thread Andrew C. West

On Tue, 1 Jun 2004 22:50:53 -0700, "Doug Ewell" wrote:
> 
> I bet if someone took the trouble to look through enough children's
> literature and driver's testing materials, they could find at least one
> document that uses the STOP SIGN inline in a sentence, and that could be
> cited as sufficient evidence that it should be encoded.

And perhaps Michael would be kind enough to prepare a proposal for traffic signs
if you asked nicely ;)

> Everything I thought I knew about encoding symbols is wrong.
> 

I think that I agree with you on that.

I suggest that everyone interested in the question of encoding symbols have a
close read of Annex H "Criteria for encoding symbols" in N2652R ("Principles and
Procedures for Allocation of New Characters and Scripts and handling of Defect
Reports on Character Names"), as this details the criteria against which
Michael's dominos etc. proposal should be judged.

A few gemane quotes from Annex H :

H.5 Discussion
Any proposal to encode additional symbols must be evaluated in terms of what the
benefit will be of cataloguing these entities and whether there is a realistic
expectation that users will be able to access them by the codes that we define.
This is especially an issue for non-notational, non-compatibility symbols.
...
As a conclusion, any successful proposal would need to contain a set of
non-notational symbols for which the benefits of a shared encoding are so
compelling that its existence would encourage a transition.

H.6 Some criteria that strengthen the case for encoding
The symbol 
-- is typically used as part of computer applications (e.g. CAD symbols)
-- has well defined user community / usage
-- always occurs together with text or numbers (unit, currency, estimated)
-- is required to be searchable or indexable
-- is customarily used in tabular lists as shorthand for characteristics (for
example, check mark, maru etc.)
-- is part of a notational system
-- is used in 'text-like' labels (even if applied to maps and 2D diagrams)
-- has well-defined semantics
-- has semantics that lend themselves to computer processing
-- completes a class of symbols already in the standard
-- is letter-like (i.e. ordinarily varies with the surrounding font style)
-- itself has a name, (for example, ampersand, hammer-and-sickle, caduceus)
-- is commonly used amidst text
-- is widespread, i.e. actually found used in materials of diverse
types/contexts by diverse publishers, including governmental

H.7 Some criteria weaken the case for encoding
There is evidence that
-- the symbol is primarily used free-standing (traffic signs)
-- the notational system is not widely used on computers (dance notation,
traffic signs)
-- the symbol is part of a set undergoing rapid changes (short-lived symbols)
-- the symbol is trademarked (unless encoding is requested by the owner) (logos,
Der grüne Punkt, CE symbol, UL symbol, etc)
-- the symbol is purely decorative
-- the symbol is an image of something, not a symbol for something
-- the symbol is only used in 2-Dimensional diagrams, (e.g. circuit components)
-- the symbol is composable (see diacritics for symbols)
-- the identity of the symbol is usually ignored in processing
-- font shifting is the preferred access and the user community is happy with
that (logos, etc.)

H.10 Perceived usefulness
The fact that a symbol merely seems to be useful or potentially useful is
precisely not a reason to code it. Demonstrated usage, or demonstrated demand,
on the other hand, does constitute a good reason to encode the symbol.

Re: Vertical BIDI

2004-05-28 Thread Andrew C. West

On Fri, 28 May 2004 06:51:27 -0700, "Mark Davis" wrote:
> 
> > As things now stand, Ogham must be wrapped in RLO...PDF brackets when
> > mixed with vertical Han or Mongolian.
> 
> Yes, that's true -- and I don't see any reason why people can't live with
> that... Those are the kinds of reasons we have the explicit controls.

Well I for one can live with that.

It was my original contention that using directional control characters such as
RLO and PDF were an appropriate mechanism (together with a suitable RTL Ogham
font) for producing BTT Ogham embedded within TTB Mongolian. And since Mark
seems to be confirming that this is indeed correct, it is the mechanism that I
will be using on my forthcoming web page in which I prove that Ogham was
invented by time-travelling warrior monks from Mongolia.

Andrew

White and Black Shogi Pieces [2616..2617] (was Re: Proposal to encode dominoes and other game symbols)

2004-05-27 Thread Andrew C. West

On Wed, 26 May 2004 04:34:21 -0700 (PDT), "Andrew C. West" wrote:
> 
> On Tue, 25 May 2004 10:08:26 -0700, John Hudson wrote:
> > 
> > Andrew C. West wrote:
> > 
> > > I've never quite worked out what purpose U+2616 [WHITE SHOGI PIECE] and
U+2617
> > > [BLACK SHOGI PIECE] are intended for.
> > 
> > I would like to know what the presumed purpose of U+2616 and U+2617 is.
> > 

Rummaging through some boxes in the attic I found an English book on shogi that
seems to show how U+2616 and U+2617 are intended to be used (unfortunately I
don't have any Japanese shogi books). See the attached picture which shows a
shogi game. A black shogi marker (= U+2617) and a white shogi marker (= U+2616)
are placed on the left of the board to indicate which side is nominally Black,
and which side is nominally White. When, as in this case, a player has captured
some pieces, the captured pieces are listed below the player's black or white
shogi marker (this is important as in shogi the capturer can throw the captured
pieces back onto the board as his own pieces, which is why shogi pieces are
differentiated by orientation not colour). Note that even when no pieces have
yet been captured, the two black and white shogi markers are still placed at the
side of the board to indicate which side is Black and which side is White.

Also note that the fact that the white shogi marker is inverted in the picture
is irrelevant, as board layout is obviously a higher level protocol.

Andrew<>

Re: Proposal to encode dominoes and other game symbols

2004-05-26 Thread Andrew C. West

On Wed, 26 May 2004 13:09:43 +0100, Michael Everson wrote:
> 
> At 04:40 -0700 2004-05-26, Andrew C. West wrote:
> 
> >But we're not encoding dominos per se, but rather encoding 
> >representations of domino pieces in textual contexts. Whilst 
> >pictures of domino sets are interesting, and provide useful 
> >background information, I would imagine that examples of the textual 
> >usage of domino glyphs is what is required in order for domino 
> >characters to be accepted for encoding by the UTC and WG2.
> 
> Be serious. It doesn't take a genius to see that if people are using 
> domino characters in text descriptions of domino rules and play and 
> that there will be a need for all the major varieties. The 15- and 
> 18-tile sets are used in tournament play. Just because someone hasn't 
> put them on a web page (in a clumsy graphic) yet doesn't  mean that 
> it isn't *un*reasonable to wait for them to do so.
> 

Hmm, pre-emptive encoding ... an interesting idea ... I might just have a use
for it in a proposal I'm working on.

Andrew

Re: Proposal to encode dominoes and other game symbols

2004-05-26 Thread Andrew C. West

On Tue, 25 May 2004 10:08:26 -0700, John Hudson wrote:
> 
> Andrew C. West wrote:
> 
> > I've never quite worked out what purpose U+2616 [WHITE SHOGI PIECE] and
U+2617
> > [BLACK SHOGI PIECE] are intended for.
> 
> > The standard game of shogi (Japanese Chess) has 20 uncoloured tiles on each
> > side, with a kanji inscription giving the piece's name on each tile.
> 
> In discussions of shogi games, one player is conventionally called 'Black' and
> the other 'White', but as you note this has nothing to do with the colour of
the pieces. I
> would like to know what the presumed purpose of U+2616 and U+2617 is. If it is
indeed
> to be able to represent shogi game pieces, then the glyph representation shown
in the
> Unicode charts might be changed: both pieces should be white in colour, but
facing in
> opposite directions.
> 

If any application did want to use U+2616 and U+2617 for representing actual
shogi pieces, the glyphs would have to have a blank foreground (sorry James) and
one of them would have to be inverted as John suggests. But then the application
would have to resize the glyph as appropriate for each piece (more important
pieces have larger sized tiles), and overlay the appropriate Kanji or kana
inscription (at the appropriate font size), and if the inscription was to
overlay the inverted shogi glyph the inscription would also have to be rotated
180 degrees. Clearly using plain images would be far far simpler, and so I
cannot believe that U+2616 and U+2617 were intended to be used like this.

My suspicion is that U+2616 and U+2617 are intended to be used in shogi notation
to indicate which side a particular move belongs to, and are not intended to be
used to actually represent specific shogi pieces. There is no real need to do so
anyway. Just as in Western Chess notation, which uses "K", "N", "B", etc. to
represent the king, knight and bishop (not U+2654..265F), in shogi notation the
pieces are represented by their Japanese names not picures of the pieces. So
U+2616 and U+2617 *may* be used something like this :

U+2616 : Gold to e4.
U+2617 : Silver to f5.

> > Each side's
> > 20 tiles are identical (differentiated by orientation not by colour) except
for
> > the "general".
> 
> Not so. Both sides has four generals: two 'gold' and two 'silver'. The gold and
> silver generals differ from each other, but each side's pieces are entirely
identical.
> 

By "general" I meant the piece that corresponds to the king in Western Chess (in
xiangqi and shogi the king is a general), not the gold and silver generals
(which correspond to the "Mandarin" in xiangqi). In my set, at least, one of the
tiles representing this piece is inscribed "Prince General" 王将
<738B, 5C06> (ô-shô ?) whilst the other tile is inscribed "Jade General"
玉将 <7389, 5C06> (gyoku-shô ?), with a single dot differentiating
the pieces. However, I do not know if this is universal or not. What does your
set have ?

> By the way, if any Unicoders play shogi, I could bring my travel set next time
I
> come to the conference.

I have played the odd game in the past ... and given my playing style, odd is
the operative word.

Andrew

Re: Proposal to encode dominoes and other game symbols

2004-05-26 Thread Andrew C. West

On Tue, 25 May 2004 17:30:37 -0700, Rick McGowan wrote:
> 
> > John that going beyond the double-twelve (for now) is just speculative
> > and not supported by actual use in dominoes books.
> 
> I don't think this is speculative. A photograph of production domino sets  
> above 12 is included in the proposal. We might as well add them now as  
> later.

But we're not encoding dominos per se, but rather encoding representations of
domino pieces in textual contexts. Whilst pictures of domino sets are
interesting, and provide useful background information, I would imagine that
examples of the textual usage of domino glyphs is what is required in order for
domino characters to be accepted for encoding by the UTC and WG2.

Andrew

Re: Proposal to encode dominoes and other game symbols

2004-05-25 Thread Andrew C. West

On Tue, 25 May 2004 13:00:51 +0100, Michael Everson wrote:

> 
> At 03:27 -0700 2004-05-25, Andrew C. West wrote:
> >On Tue, 25 May 2004 10:23:19 +0100, Michael Everson wrote:
> >  >
> >  > Now that you mention it, it could well be that Chaturunga and Chinese
> >  > Chess both could be considered extensions to a unified Chess
> >  > repertoire:
> >  >
> >  > WHITE CHATURANGA COUNSELLOR (-> white chess queen)
> >  > WHITE CHATURANGA ELEPHANT (-> white chess bishop)
> >  > BLACK CHATURANGA COUNSELLOR (-> black chess queen)
> >  > BLACK CHATURANGA ELEPHANT (-> black chess bishop)
> >  > WHITE XIANGQI MANDARIN (advisor, assistant, guard)
> >  > WHITE XIANGQI CANNON
> >  > BLACK XIANGQI MANDARIN
> >  > BLACK XIANGQI CANNON
> >  >
> >
> >I don't think that a unified chess repertoire would be useful. Although
> >individual pieces in chaturanga, chess, xiangqi and shogi may 
> >correspond to each other in function, they are represented 
> >differently (Western chess pieces are represented by pictures, 
> >xiangqi pieces by ideographs in a circle, shogi pieces by kanji 
> >inscriptions in a five-sided figure), so that I do not believe that 
> >there would be a single character of the "unified chess repertoire" 
> >which would be common to any two chess families. You would, I think, 
> >have to encode each set of characters used to represent games pieces 
> >separately for each chess family.
> 
> Andrew, if you look at the links in my original posting about 
> Chaturanga you will see that "generic" Chess pieces are indeed used 
> for these.

True, for that particular site for Chaturanga, and to a more limited extent for
xiangqi.

I can't speak for chaturanga, but in Chinese books on xiangqi and Japanese books
on shogi, generic chess images are not used to represent xiangqi and shogi
pieces. Although in some beginner-level books written for an English-speaking
audience the Chinese or Japanese text on the various pieces is replaced by
pictural representations, I imagine that the native representation of xiangqi
and shogi pieces should have priority if symbols representing xiangqi and shogi
pieces were to be encoded.

I would suggest that any proposed encoding of an extended set of generic chess
symbols for use with all varieties of chess should be based on recognised
international usage, not the ad hoc usage of individual web sites and books.

Andrew

Re: Proposal to encode dominoes and other game symbols

2004-05-25 Thread Andrew C. West

On Tue, 25 May 2004 10:23:19 +0100, Michael Everson wrote:
>
> Now that you mention it, it could well be that Chaturunga and Chinese 
> Chess both could be considered extensions to a unified Chess 
> repertoire:
> 
> WHITE CHATURANGA COUNSELLOR (-> white chess queen)
> WHITE CHATURANGA ELEPHANT (-> white chess bishop)
> BLACK CHATURANGA COUNSELLOR (-> black chess queen)
> BLACK CHATURANGA ELEPHANT (-> black chess bishop)
> WHITE XIANGQI MANDARIN (advisor, assistant, guard)
> WHITE XIANGQI CANNON
> BLACK XIANGQI MANDARIN
> BLACK XIANGQI CANNON
> 

I don't think that a unified chess repertoire would be useful. Although
individual pieces in chaturanga, chess, xiangqi and shogi may correspond to each
other in function, they are represented differently (Western chess pieces are
represented by pictures, xiangqi pieces by ideographs in a circle, shogi pieces
by kanji inscriptions in a five-sided figure), so that I do not believe that
there would be a single character of the "unified chess repertoire" which would
be common to any two chess families. You would, I think, have to encode each set
of characters used to represent games pieces separately for each chess family.

Andrew

Re: Proposal to encode dominoes and other game symbols

2004-05-25 Thread Andrew C. West

On Mon, 24 May 2004 20:11:08 -0700, Patrick Andries wrote:
>
> >> Proposal to encode dominoes and other game symbols
> >
> > This could get out of hand very quickly. Chinese and Japanese (shogi) 
> > chess pieces? 
> 
> To complete U+2616 and U+2617 ?
> 

I've never quite worked out what purpose U+2616 [WHITE SHOGI PIECE] and U+2617
[BLACK SHOGI PIECE] are intended for.

The standard game of shogi (Japanese Chess) has 20 uncoloured tiles on each
side, with a kanji inscription giving the piece's name on each tile. Each side's
20 tiles are identical (differentiated by orientation not by colour) except for
the "general". However, most of the tiles can be turned over to reveal a
different inscription when the piece is promoted. This would mean that you would
need (from memory, I may be wrong) a total of 28 Unicode characters to represent
all possible tile patterns and orientations. However, there are a dozen or so
historical and modern varieties of shogi, some played on a large board with huge
numbers of tiles.

And then of course there's Chinese Chess (xiangqi), which requires 14 Unicode
characters to represent the different pieces. And it wouldn't be fair to ignore
the infamous "seven states chess" (affectionately nicknamed "stinky chess"),
that is a complex variation of Chinese Chess played on a go-board with seven
players, each with an army of about twenty pieces (I can't remember the exact
number offhand). That would require quite a few extra characters, and some way
of representing the seven colours of the pieces.

Oh, and what about Mahjong ?

I had assumed that representations of games pieces were largely outside the
scope of Unicode, but I am agnostic on this issue, and given that Western chess
pieces are already encoded at 2654..265F, if Michael's proposal for dominos etc.
does get accepted I do think that someone will need to propose a full set of
shogi, xiangqi and mahjong characters.

Andrew

Re: Vertical BIDI

2004-05-19 Thread Andrew C. West

Michael Everson wrote:
>
> Come on, people. Read the standard, please. It's on page 338. 

Michael is absolutely right to rebuke me for not reading the Standard. Of course
I have read the Ogham block intro before, and no doubt that is where I got the
notion of rendering Ogham BTT from, but I had forgotten that Ogham's BTT
directionality is explicitly mentioned there. If only I had reread the block
intro before joining this thread I wouldn't have ended up rambling down a dead
end in my recent postings.

But now that I'm back on the marked path the way forward is still as unclear as
ever.

The only thing that is certain is that Ogham must be rendered BTT in vertical
contexts. For Ogham text in isolation this is fairly easy to accomplish by
simple rotation, and one could expect "writing-mode : bt-rl" or "writing-mode :
bt-lr" to accomplish this in a CSS stylesheet. Whether the columns should run
LTR or RTL across the page is another question, although LTR would be simplest
to implement as it would only involve rotating a whole block of horizontal LTR
Ogham text 90 degrees anticlockwise. At any rate, vertical presentation is a
matter for a higher protocol, and not a Unicode matter.

However, Ogham text embedded in Mongolian may be a different matter. If a plain
text editor renders everything horizontally, as most do, then both Mongolian and
Ogham should be rendered LTR thus , but if you then
select vertical presentation (assuming your text editor has this option)
Mongolian should be rendered TTB and Ogham BTT thus .
I still have no idea as to how this should be achieved. My "hack" of using a
custom rotated Ogham font and RLO/PDF codes would achieve the desired result for
vertical presentation, but would make the Ogham text RTL for horizontal
presentation, which is apparently unacceptable. But what alternatives are there ?

Andrew

Re: Vertical BIDI

2004-05-18 Thread Andrew C. West

On Mon, 17 May 2004 22:59:50 -0400, John Cowan wrote:
> 
> It should not.  That's what makes Ogham different from standard
> horizontal scripts -- it does have a preferred vertical orientation,

It does ? I thought that the whole point of much of the recent discussion was
the uncertainty of how Ogham should be laid out in vertically formatted text,
such as when embedded in Mongolian or vertical Chinese.

> and because turning it upside-down generates different *characters*,
> you can't violate that.
> 

True, if you don't know the directionality of an Ogham text it may be difficult
or impossible to be sure how to read it (for example a bone plaque with an Ogham
inscription unearthed recently in the Outer Hebrides
 may
read either EIHNE-- or --EQBIE depending upon which end you should start from).
However, if you know the directionality of a line of Ogham text then it does not
matter whether it is "upside-down" or not (what is "upside-down" for Ogham?),
you just read it from start of line to end of line.

I guess what you're saying is that unless you know what the vertical orientation
is you may not know which end to start reading from, and so a fragment of Ogham
embedded in Mongolian may be readable either TTB or BTT, and there is nothing
that explicitly notifies the reader which way is correct.

But what is this preferred vertical orientation of Ogham that you speak of ? Is
it specified in the Unicode Standard ? And if not, should it be ?

Andrew

Re: Multiple Directions (was: Re: Coptic/Greek (Re: Phoenician))

2004-05-17 Thread Andrew C. West

On Mon, 17 May 2004 12:32:14 -0400, [EMAIL PROTECTED] wrote:
> 
> I follow you.  The question is, then, whether T2B Ogham is legible or
> not to someone who reads B2T Ogham fluently -- unfortunately, your texts
> are all pothooks and tick marks to me.
>

If you're used to reading Ogham LTR on the printed page, I would say the answer
is yes. And even if you're in the habit of reading inscriptions directly from
stones in field or churchyard then you probably tilt your head sideways to read
them LTR anyway.

> Still mysterious is the question of whether vertical Ogham columns should
> be laid out L2R or R2L across the page.  I suppose the inscriptions
> aren't really much help.

I doubt that any of the inscriptions are long enough to warrant multi-line
presentation on a full screen or page. If there were any long inscriptions I
personally would format them in LTR rows.

> 
> I think my point was that a plain text editor that claims to handle
> Mongolian had better be able to rotate the text to vertical orientation,
> or the users will discard it for one that doesn't give them sore necks
> (which is not at all the case with one claiming to handle Ogham).
> 

Hmm, you're right, and that's something that has been nagging me for some time
now.

> Am I right in thinking that in vertical layout, native R2L scripts
> are displayed with the baseline to the right, and therefore not
> bidirectionally?  If so, does Unicode require a LRO/PDF pair around them
> to do the Right Thing?

No idea. I've never tried such an experiment, and I can't read any RTL scripts.

Andrew

Re: Multiple Directions (was: Re: Coptic/Greek (Re: Phoenician))

2004-05-17 Thread Andrew C. West

On Mon, 17 May 2004 10:12:50 -0400, John Cowan wrote:
> 
> Andrew C. West scripsit:
> 
> > Thus, if "tb-lr" were supported, your browser would display the
> > following HTML line as vertical Mongolian with embedded Ogham reading
> > top-to-bottom, but in a plain text editor, the Mongolian and Ogham
> > would both read LTR, and everyone would be happy :
> 
> I don't know about that.  I wouldn't be too happy trying to read English
> with the Latin letters laid out bt-rl and lying on their left sides to boot.
> On paper is one thing, but on a non-rotatable screen?  I don't think so.

I think you may have misunderstood me. I'm now suggesting that perhaps Ogham
shouldn't be rendered bottom-to-top when embedded in vertical text such as
Mongolian, but top-to-bottom as is the case with other LTR scripts such as
Latin, as shown in the attached screen shot from an HTML page, where the
embedded Ogham and Latin text both read LTR down the page if you tilt your head
sideways. I don't know whether you're happy reading Latin text as displayed in
this example, but this is the normal way of embedding short sections of Latin
script (e.g. proper names) in vertical text, and I don't know of any better way
to deal with embedding horizontal scripts in vertical text.

Likewise, when you embed Mongolian in horizontal text (which is also quite
common), you have to tilt your head sideways to read the Mongolian. I don't
think there's any way round the head-tilting business when mixing vertical and
horizontal scripts. Note that I'm talking about embedding single words or short
phrases in text with a different orientation. Of course for long passages of
both vertical and horizontal text, each script should be laid out in separate
vertical and horizontal blocks.

Andrew<>

ᠡᠷᠲᠡ ᠤᠷᠢᠳᠠ ᠬ᠌ᠠᠪᠠᠯᠢᠬ᠌ Ogham  ᚒᚂᚉᚐᚌᚅᚔ  ᠪᠠᠯᠭᠠᠰᠤᠨ ᠳᠤᠷ ᠪᠢᠷᠠᠮᠠᠨ ᠤ ᠬᠠᠮᠤᠭ

Re: Multiple Directions (was: Re: Coptic/Greek (Re: Phoenician))

2004-05-17 Thread Andrew C. West

On Mon, 17 May 2004 12:15:55 +0100, Jon Hanna wrote:
> 
> It seems to me that as far as Ogham goes the positioning of successive glyphs
is
> more comparable to the way a graphics program will position text along a path
> (allowing text to go in a circle, for example) than the differences between
> LTR, RTL, vertical and boustrophedon scripts. The text isn't composed of a BTT
> passage, a LTR passage and a TTB passage, but of a single passage which follows
> a path which changes through those three directions.
> 
> Paths are not a plain text matter.
> 

I agree entirely with this. A lump of rock (and what I'm interested in are
monumental Ogham inscriptions) is not comparable to a sheet of paper or a
computer screen, and it makes perfect sense to reformat the wandering path of
the original Ogham text as horizontal LTR lines for display or printing within a
text document.

What intrigues me is how Ogham text would be embedded within a vertical script
such as Mongolian. Perhaps Ogham isn't really that good an example of a
bottom-to-top writing system (are there any good examples ?), especially as very
few Ogham inscriptions extend to more than two or three words, and it would make
most sense for Ogham embedded in Mongolian to simply follow the directionality
of the surrounding text, i.e. read top-to-bottom, in which case it would simply
be rotated LTR text, no different to how Latin text embedded in Mongolian or
vertical Chinese is normally rendered.

Thus, if "tb-lr" were supported, your browser would display the following HTML
line as vertical Mongolian with embedded Ogham reading top-to-bottom, but in a
plain text editor, the Mongolian and Ogham would both read LTR, and everyone
would be happy :

Some Mongolian text some Ogham text some more
Mongolian text

Andrew

Re: Multiple Directions (was: Re: Coptic/Greek (Re: Phoenician))

2004-05-17 Thread Andrew C. West

On Sat, 15 May 2004 14:14:50 -0400, fantasai wrote:
> 
> That's a hack, not a solution.

There's a fine line between "hack" and "solution", and I'm not sure which side
of the line my proposed technique falls.

> Again, if you take the text out of the
> presentational context you've warped it into, it doesn't make any sense.

To my way of thinking, if a text (such as an Ogham inscription) was originally
written vertically bottom to top, it makes just as much sense to render and read
it RTL as it does to render and read it LTR .

> The text shouldn't depend on the font or text orientation switches being
> exactly right.

And yet forced RTL Ogham text rendered horizontally with an ordinary Ogham font
would be no more illegible than plain Unicode Mongolian or Arabic displayed on a
system that only supports LTR with fonts that only display the fixed code chart
glyphs. Correct rendering of any complex script does depend on the correct
combination of fonts, rendering system and control codes; and you can't expect
all Unicode text to be displayed correctly irrespectiveof the sophistication of
your rendering system and fonts.

> 
> Unicode directionality shouldn't be used as a presentational property;
> that's the problem CSS3 Text has right now.
>http://lists.w3.org/Archives/Public/www-style/2003Apr/0116.html

So, would your "solution" to embedding Ogham in vertical Mongolian be a
higher-level protocol, such as

some Mongolian text some Ogham
text some more Mongolian text

Great if it works, and probably superior to my "hack", but I still don't see
that there's anything intrinsically wrong with using directional control codes
to format plain text.

Andrew

Re: Multiple Directions (was: Re: Coptic/Greek (Re: Phoenician))

2004-05-15 Thread Andrew C. West

On Fri, 14 May 2004 18:44:10 +0100, Michael Everson wrote:

> 
> You can't play around with Ogham directionality like that. Reversing 
> it makes it read completely differently! The first example reads 
> INGACLU; the second reads ULCAGNI.

Well I disagree. As I said in the message, the RTL result does not work in *this
case* because the glyphs need to be rotated 180 degrees. As I said, if you had a
font designed specifically for RTL/TTB Ogham (not that hard to create), then the
glyphs in the font would be rotated 180 degrees compared with the glyphs in the
Unicode code charts, with the result that my sample Ogham text would read
ULCAGNI correctly from right to left. Then if you rotated the whole thing 90
degrees clockwise (either using a text editor or by printing it out and manually
rotating the printed output) you would have ULCAGNI reading upwards embedded in
Mongolian text reading downwards. If I wasn't preoccupied with more pressing
matters I would have a go at creating such a font to prove that this can be done.

Also, note that the point of RTL Ogham is NOT to render it RTL per se, but as a
step towards rendering it BTT. A similar trick is used for Mongolian. In order
to get vertical left-to-right layout of Mongolian text (when no systems
currently support left-to-right vertical layout), one technique is to use an RTL
Mongolian font with the glyphs rotated 180 degrees. Then the text is written RTL
in lines going top to bottom down the page. You print out the result, and rotate
the sheet of paper 90 degrees, and Hey Presto! you have vertical Mongolian text
reading left to right across the page.

Andrew

Re: Multiple Directions (was: Re: Coptic/Greek (Re: Phoenician))

2004-05-14 Thread Andrew C. West

On Fri, 14 May 2004 11:43:53 -0400, [EMAIL PROTECTED] wrote:
> 
> Andrew C. West scripsit:
> 
> > In bilingual Manchu-Chinese texts, which were common during
> > the Manchu Qing dynasty [1644-1911], the text normally follows the Manchu
page
> > layout, with vertical lines of Manchu and Chinese interleaved from left to
> right
> > across the page, so that from a Chinese perspective the book reads backwards.
> 
> Most interesting.  What about codex binding?  When I see people reading
> Chinese newspapers on the subway, the binding appears to be on the left
> even though the columns of the text are RTL; at least, judging by what
> appears to be the front page.
> 

For vertically laid out Chinese books, the front cover corresponds to the back
cover of an English book, and the back cover corresponds to the front cover of
an English book, so from a Western perspective such books appear to be read
backwards. Traditionally-bound bilingual Manchu-Chinese books are read the same
way as Western books ... which is backwards from a  Chinese perspective.

As to newspapers, these normally follow Western newspaper page order, regardless
of whether the content is laid out horizontally, vertically or mixed horizontal
and vertical.

> > As I suggested in a recent thread on mixed horizontal/vertical layout, if you
> > did have mixed Top-To-Bottom (TTB) and Bottom-To-Top (BTT) scripts such as
> > Mongolian and Ogham [...] then you
> > could deal with their conflicting directionality as if they were rotated LTR
> and
> > RTL scripts by means of LRO, RLO and PDF control codes [202C..202D]. 
> 
> Surely that's not enough: you'd need to implement the full implicit bidi
> algorithm, giving Ogham a nonce bidi type of R.  Either that, or run the
> Ogham T2B instead of the normal direction.
> 

Have a look at the attached images which show Mongolian text with embedded Ogham
(laid out horizontally).

In "Mongolian with embedded LTR Ogham" both Mongolian and Ogham are LTR by
default.

In "Mongolian with embedded RTL Ogham" Mongolian is LTR by default, but the
Ogham has been made RTL by sandwiching it between an RLO control code and a PDF
control code (i.e. <202E, 1680, 1692, 1682, 1689, 1690, 168C, 1685, 1694, 1680,
202C>). Thanks to Uniscribe this magically has the effect of reversing the flow
of the embedded Ogham text. This does not quite work, however, as although the
glyph order has been reversed the glyphs have not been rotated 180 degrees, with
the result that they are the wrong way round. But an Ogham font that was
specifically designed for RTL/BTT usage would have glyphs that are rotated 180
degrees compared with those in the Unicode code chart, and with such a font this
embedded RTL Ogham test text would be rendered correctly. Now all you would need
is a text editor that rotated the whole thing 90 degrees, and you would have BTT
Ogham embedded in TTB Mongolian ... with no messy fiddling with the bidi
algorithm.

And if you really wanted to be clever, you could perform the same trick by
embedding LTR Ogham in RTL Mongolian (using an RTL Mongolian font ... which do
exist, although I don't know of any Unicode RTL Mongolian fonts yet), and
rotating the whole thing 270 degrees ... I think ... my head's started to spin
at this point ...

Andrew<><>

Re: Multiple Directions (was: Re: Coptic/Greek (Re: Phoenician))

2004-05-14 Thread Andrew C. West

On Fri, 14 May 2004 11:09:19 +0100, Michael Everson wrote:
> 
> At 02:40 -0700 2004-05-14, Andrew C. West wrote:
> 
> >(not that Ogham's strictly BTT, but it is largely BTT in monumental 
> >inscriptions
> 
> I think it is always BTT in the inscriptions.

My understanding is that when written along the arris of a "memorial stone", the
inscription goes up one arris, and then either continues from the bottom of
another arris going upwards again, or continues directly down another arris so
that it follows an inverted V or U line; so technically inscriptional Ogham
would be either BTT or vertical boustrophedon. But, on the other hand, Pictish
Ogham inscriptions on flat slabs are largely adirectional.

> 
> >-- although for convenience it is almost always written LTR on paper 
> >and on screen ... and even in the Unicode code charts)
> 
> As it has been for centuries.

Which is eminently sensible.

I see no reason to mimic the original vertical orientation of Ogham inscriptions
in text. If I did want to reproduce the exact textual layout of an Ogham
inscription I would opt for a photograph or drawing of the inscription rather
than try to format the text into twisted lines.

Nevertheless, if a Mongolian script expert were to write a book on Ogham in
Mongolian, he would probably embed Ogham text vertically within the vertical
Mongolian text of the book. In which case, would he write the Ogham text TTB
following Mongolian directionality, or BTT following inscriptional
directionality ?

Andrew

Re: Multiple Directions (was: Re: Coptic/Greek (Re: Phoenician))

2004-05-14 Thread Andrew C. West

On Thu, 13 May 2004 16:33:51 -0400, [EMAIL PROTECTED] wrote:
>
> That's irrelevant.  L2R and R2L scripts are often mixed in the same
> sentence, whereas it's barely possible to mix horizontal and vertical
> scripts on the same page; when it must be done, the vertical script
> is generally rotated to become a L2R horizontal one (even Mongolian).
> A page that contained both Mongolian and vertical CJK might require
> a vertical bidirectional algorithm, but AFAIK that question has not
> yet arisen.

I'm a little confused by the last sentence. Mongolian and traditional Chinese
are both written vertically from top to bottom, and so a vertical bidirectional
algorithm is not relevant. Mixed Chinese/Mongolian and Chinese/Manchu texts are
extremely common. Traditionally vertical layout of mixed Mongolian/Chinese text
is preferred, but in modern Chinese books that have individual Mongolian words
and phrases embedded in Chinese text then horizontal layout is common (with any
complete passages of Mongolian text usually typeset vertically on a separate
page). Where Mongolian and Chinese do differ is that traditional Chinese is
written in vertical columns from right to left across the page, whereas
Mongolian and Manchu are written in vertical columns from left to right across
the page (because the Uighur script from which Mongolian derives is actually a
rotated RTL script). In bilingual Manchu-Chinese texts, which were common during
the Manchu Qing dynasty [1644-1911], the text normally follows the Manchu page
layout, with vertical lines of Manchu and Chinese interleaved from left to right
across the page, so that from a Chinese perspective the book reads backwards.

As I suggested in a recent thread on mixed horizontal/vertical layout, if you
did have mixed Top-To-Bottom (TTB) and Bottom-To-Top (BTT) scripts such as
Mongolian and Ogham (not that Ogham's strictly BTT, but it is largely BTT in
monumental inscriptions -- although for convenience it is almost always written
LTR on paper and on screen ... and even in the Unicode code charts), then you
could deal with their conflicting directionality as if they were rotated LTR and
RTL scripts by means of LRO, RLO and PDF control codes [202C..202D]. Peter and
Ken's remarks in this thread seem to suggest that this interpretation is
correct, and that the terms "Left-to-Right" and "Right-to-Left" are relative
directions not absolute directions with respect to plain text layout. A plain
text editor would normally lay out vertical text such as Mongolian horizontally
whether you like it or not; but if there were a plain text editor that laid
everything out vertically then presumably the bidi algorithm and bidi control
characters would not be invalidated simply because mixed LTR and RTL text were
laid out in vertical lines rather than horizontal lines.

As has been stated time and time again, mixing vertical and horizontal textual
orientation in the same document is beyond the scope of a plain text standard,
and rendering mixed horizontal/vertical text is certainly beyond the ability of
any plain text editor that I know of. Markup is the appropriate way to deal with
mixed horizontal/vertical/diagonal/circular/spiral text (Artemis Fowl has a
constructed "script" with spiral textual orientation), not dozens of new
directional control characters.

Andrew

RE: Script vs Writing System

2004-05-13 Thread Andrew C. West

On Wed, 12 May 2004 11:12:24 -0700, "Peter Constable" wrote:
> 
> My concern with using this in the taxonomy is that every other category
> in the taxonomy is structural in nature, having to do with the
> relationship between structural units of the writing system and
> structural units of the phonology (sign <> phoneme / syllable...). In
> describing Hangul as "featural", however, those structural issues are
> ignored, and instead the focus is on an iconic relationship between
> shapes of symbols and the shape of the vocal tract. I don't mind noting
> such a characteristic of a script, but I think it is not good science to
> create a taxonomy that mixes defining criteria in an ad hoc manner:
> categories in a taxonomy should be defined on a consistent basis. There
> is absolutely no reason why a purely structural taxonomy could not
> include Hangul. It just requires an additional category of like
> "alphasyllabary", which Peter Daniels simply refuses to accept.

Whilst not an "alphasyllabary" (as its script units are believed to be
polysyllabic), the small Khitan script also shares the same structural feature
as Hangul, that is that phonetic elements corresponding to a basic lexical unit
are laid out into a rectangular block. Indeed, it is quite possible that the
blocked layout of the Hangul script was inspired by the small Khitan script
(which was in use in the area of north-east China adjacent to Korea only a few
hundred years earlier).

In addition, traditional Chinese zither notation (qin pu) is also laid out in
ideographic-like square blocks. However, as this is a notational system rather
than a script, the constituent elements of each block represent string, finger
and plucking technique rather than phonetic values.

Perhaps a term could be devised that encompasses block layout (rather than
linear layout) scripts such as Hangul and small Khitan (and even Chinese zither
notation ?).

Andrew

Re: CJK(B) and IE6

2004-05-04 Thread Andrew C. West

On Sun, 2 May 2004 12:14:29 -0700, "Doug Ewell" wrote:
> 
>  wrote:
> 
> > The BabelPad editor can easily convert between UTF-8 and NCRs...
> 
> As can SC UniPad.

For $199 (unless you're only interested in editing files up to 1,000 characters
in length).

Andrew

Re: Brahmic Unification (was Re: New contribution )

2004-04-30 Thread Andrew C. West

On Fri, 
> 
> Andrew C. West scripsit:
> 
> > For example, the excellent description of the Tocharian script
> > (surely the worst made-up name for a dead script ever) at
> > http://titus.fkidg1.uni-frankfurt.de/didact/idg/toch/tochbr.htm could
> > be the basis of a proposal for this important Brahmic script. There is
> > a considerable body of Tocharian material, and it would be much easier
> > to encode this material using a dedicated Tocharian block than using
> > a generic Brahmic encoding model.
> 
> I don't understand.  Are you proposing this be implemented as a full
> syllabary, Cherokee/Ethiopic style, rather than using the usual
> apparatus of consonants, independent vowels, and vowel marks (which in
> this case ligate fully with the consonants)?

No, not at all. The charts may show consonant-vowel syllables, but that does not
mean that I believe that they should be proposed to be encoded as syllables.

What I was saying was that all the glyphs needed for a proposal are nicely laid
out here, not that there is necessarily a one-to-one correspondence between
these glyphs and Unicode characters.

Andrew

Brahmic Unification (was Re: New contribution )

2004-04-30 Thread Andrew C. West

On Thu, 29 Apr 2004 12:35:55 -0700, Rick McGowan wrote:
> 
> The unified Brahmis proposal exactly proposes unification of systems with
> vastly different rendering behavior. That's part of the controversy with it.
> But that proposal is currently sitting on a siding waiting to be taken up
> by the Indic list.

Well, I'll kick things off if you like, by giving my opinion that the proposal
attempts to over-unify a too wide range of scripts with quite different encoding
and rendering requirements. What I would really like to see is individual
proposals for some of the major Brahmic scripts.

For example, the excellent description of the Tocharian script (surely the worst
made-up name for a dead script ever) at
http://titus.fkidg1.uni-frankfurt.de/didact/idg/toch/tochbr.htm could be the
basis of a proposal for this important Brahmic script. There is a considerable
body of Tocharian material, and it would be much easier to encode this material
using a dedicated Tocharian block than using a generic Brahmic encoding model.
It would probably take years to get any sort of consensus on a unified Bramic
script proposal, but the issues involved with a single (and in the case of
Tocharian fairly uniform) script could be dealt with in a relatively short
period of time.

Andrew

Re: Unihan.txt and the four dictionary sorting algorithm

2004-04-21 Thread Andrew C. West

On Tue, 20 Apr 2004 22:36:48 +0100, "Raymond Mercier" wrote:
> 
> The problem of the size of Unihan has nothing at all to do with the cost of
> storage, and everything to do with the functioning of programs that might
> open and read it.
> Since the lines in Unihan are separated by 0x0A alone, not 0x0A0x0D, this
> means that when opened in notepad the lines are not separated. Notepad does
> have the advantage that the UTF-8 encoding is recognized, and the characters
> are displayed.
> 
> If opened in Wordpad the Chinese characters do not appear, perhaps the UTF-8
> encoding does not function.
> 
> If I try MS Word the machine grinds to a halt - and this is a good modern
> machine (XP with 120Mb HD and 512Mb RAM).
> 
> Similarly if I open in IE6, with UTF-8 encoding, the text opens up to around
> U+4C00, and then grinds to a halt.
> 
> I can open it in the HexWorkshop byte editor, or in the editor in Visual C
> 6, but these do not recognize UTF-8 encoding, and they hardly count as
> suitable readers for such a file.
> 

I've never managed to get either Notepad or Word to open Unihan.txt (or at least
I've never had the patience to wait for the operation to complete), and editing
very large files with Notepad is next to impossible as it rerenders the entire
file on every edit operation or window resizing operation.

As James mentioned, my BabelPad text editor for Windows will open and edit
Unihan.txt with no problem (tip - disable undo/redo functionality if you're
going to make global replacements) - it takes about 20 seconds to open on my
(rather old) machine. On the other hand, Visual Studio 7.1 opens Unihan
correctly (autodetecting as UTF-8) in less than 10 seconds, and has regular
expression find/replace functionality, which makes it quite powerful.

Andrew

Re: French typographic thin space (was: Fixed Width Spaces)

2004-04-02 Thread Andrew C. West

Patrick Andries wrote:
>
>>Asmus Freytag <[EMAIL PROTECTED]> a écrit : 
>>
>> Have you folks noticed the addition of Narrow Non Break Space? 
>
>Yes, but I have not been able to find a font with a narrow enough glyph 
>(I just looked again at Code 2000). 
>
>Does anyone know of an appropriate font for French in this regard ? 
>

Code2000 and Doulos SIL both have NNBSP, which in both cases is narrower than
Space and NBSP, although Code2000 is the narrowest. Mind you, if I recall
correctly, in printed Mongolian texts the space that NNBSP is meant to represent
isn't appreciably thinner than an ordinary space, so maybe it shouldn't be too
narrow.

The problem with newly introduced characters such as NNBSP, is that although
they may in theory be just what you're looking for, you know that you can't
actually use them on web pages as all the standard fonts (Arial, Times New
Roman, or dare I say Comic Sans) will display a little rectangular box, which is
worse than nothing. And then because nobody uses it the major font vendors don't
see any point in adding support for the character ... vicious circle.

Andrew

Re: Unicode 4.0.1 Released

2004-04-02 Thread Andrew C. West

On Tue, 30 Mar 2004 15:49:53 -0800, Rick McGowan wrote:
> 
> Unicode 4.0.1 has been released!
> 
> The main new features in Unicode 4.0.1 are the following:
> 
> 1. The first significant update of the Unihan Database (Unihan.txt)
>   since Unicode 3.2.0, including a large number of fixes and
>   additional data items.
>

For me 4.0.1 was a big disappointment. The much vaunted update of the Unihan
database did not even clear up all the editorial errors in the database, let
alone deal with the real problems of content, such as incorrect or dubious
Mandarin, Cantonese, Korean and Japanese readings.

As to the 164 incorrect Vietnamese readings for basic CJKV ideographs, I notice
that although the correct readings for these characters have now been added to
the kVietnamese field, the original erroneous readings (agreed as such by the
relevant Vietnamese experts) have been retained as well; so that now each of the
164 characters in the CJK Unified Ideographs block with a kVietnamese key has a
spurious reading followed by the correct reading. Hardly much of an improvement.

For the record, the following is a list of easily fixed editorial errors
relating to the fields of interest to me that I submitted as part of the review
process, and which remain unfixed in the latest version of the Unihan database.
This means that I have to manually preprocess Unihan.txt to correct these errors
before I can put it through my parsing program -- which is a pain.

1. kRSUnicode Field

A. Missing Simplified Radical Marker
Simplified radicals are indicated by an apostrophe after the radical number in
the basic CJK block, but not in CJK-A or CJK-B.
U+4336..4341Radical 120 should be 120'
U+4723..4729Radical 149 should be 149'
U+478C..4790Radical 154 should be 154'
U+4880..4882Radical 159 should be 159'
U+497A..4986Radical 167 should be 167'
U+49B6..49B8Radical 169 should be 169'
U+4B6A  Radical 184 should be 184'
U+4BC3..4BC5Radical 187 should be 187'
U+4C9D..4CA4Radical 195 should be 195'
U+4D13..4D19Radical 196 should be 196'
U+4DAD..4DAERadical 212 should be 212'
U+8D5C  154.11 should be 154'.11
U+8D5D  154.12 should be 154'.12
U+8F89  159.8 should be 159'.8
U+987C  181.4 should be 181'.4
U+9E6B  196.12 should be 196'.12
U+9EA6  199.0 should be 199'.0
U+9EFE  205.0 should be 205'.0
U+9F7F  211.0 should be 211'.0
U+26208..26221  Radical 120 should be 120'
U+27BAA Radical 149 should be 149'
U+27E51..27E57  Radical 154 should be 154'
U+28405..2840A  Radical 159 should be 159'
U+28C3E..28C56  Radical 167 should be 167'
U+28DFF..28E0E  Radical 169 should be 169'
U+293FC..29400  Radical 178 should be 178'
U+29595..29597  Radical 181 should be 181'
U+29665..29670  Radical 182 should be 182'
U+297FE..2980F  Radical 184 should be 184'
U+299E6..29A10  Radical 187 should be 187'
U+29F79..29F8E  Radical 195 should be 195'
U+2A241..2A255  Radical 196 should be 196'
U+2A388..2A390  Radical 199 should be 199'
U+2A68F..2A690  Radical 211 should be 211'
U+2FA18 Radical 205 should be 205'

2. kMandarin Field

A. Invalid Application of U-UMLAUT
LÜN and LÜAN are invalid Mandarin pinyin spellings.
U+6523  kMandarin   LÜAN2 LUAN2 should be LUAN2
U+7674  kMandarin   LÜAN2 should be LUAN2
U+7D6F  kMandarin   LÜN4 GAI1 should be LUN4 GAI1

B. Invalid Non-Application of U-UMLAUT
LUE and NUE should always be written with an umlaut.
U+3A3C  kMandarin   LUE4 should be LÜE4
U+4588  kMandarin   NUE4 should be NÜE4
U+458B  kMandarin   NUE4 should be NÜE4
U+63A0  kMandarin   LÜE4 LUE3 should be LÜE4 LÜE3
U+7878  kMandarin   NUE4 should be NÜE4

C. Other Invalid Pinyin Spellings
IE, LIONG, YIAN, IONG and YIAO are invalid Mandarin pinyin spellings.
U+34C8  kMandarin   BEI4 BING4 FEI4 IE4 should be BEI4 BING4 FEI4 YE4 ?
U+3D88  kMandarin   LIONG3 YING2 should be YONG3 YING2 ?
U+66D5  kMandarin   YIAN4   should be YAN4
U+7867  kMandarin   IONG3 should be YONG3
U+9D01  kMandarin   YIAO1   should be YAO1

D. Duplicate Readings
Whilst there may be a historical reason why some characters in the Unihan
database originally had multiple duplicate Mandarin readings, there is no reason
why they should still be there in the latest version.
U+3561  kMandarin   HE2 HE2 HE4 HE4 HUO4
U+3563  kMandarin   YAN3 YAN4 YAN4
U+356A  kMandarin   DAN3 DAN3
U+3613  kMandarin   LAN2 LAN2
U+363A  kMandarin   FA2 FA2
U+369C  kMandarin   XU4 XU4 YU4
U+38BD  kMandarin   ER3 ER3
U+39A9  kMandarin   YIN3 YIN3
U+39E5  kMandarin   XIAN3 XIAN3
U+3C34  kMandarin   PO2 POU3 POU3
U+3C76  kMandarin   BENG4 JIAO4 PENG2 PENG2 QIAO3 RU4
U+3C78  kMandarin   BI4 BI4 BIE2
U+3E2E  kMandarin   FEN2 FEN2
U+3F18  kMandarin   WA3 WA3
U+3FC5  kMandarin   XIAN3 XUAN3
U+400E  kMandarin

Re: French typographic thin space (was: Fixed Width Spaces)

2004-04-01 Thread Andrew C. West

On Thu, 1 Apr 2004 18:37:49 +0200, "Antoine Leca" wrote:
> 
> On Thursday, April 01, 2004 12:37 AM
> Asmus Freytag <[EMAIL PROTECTED]> va escriure:
> 
> > Have you folks noticed the addition of Narrow Non Break Space?
> 
> Is it intended (in part) for French typography?
> 

No, it was introduced for Mongolian; but there's no reason why it can't be used
for French or any other language/script.

Andrew

[OT] BabelMap in French

2004-04-01 Thread Andrew C. West

Some of you may be interested to know that a French version of BabelMap (now
supporting Unicode 4.0.1) is available from :

http://uk.geocities.com/BabelStone1357/Software/BabelMap_fr.html

In this version all Unicode data (character names, block/plane names, UCD
properties, character notes/synonyms/cross-references, standardized variant
descriptions, ISO/IEC 10646 notes, etc.) is in French, based on the data
available at http://pages.infinit.net/hapax/ "Unicode et ISO 10646 en français".

Many thanks to Patrick Andries for his help in translating the BabelMap UI.

Andrew

Re: vertical direction control

2004-03-25 Thread Andrew C. West

On Thu, 25 Mar 2004 03:36:29 -0800, Peter Kirk wrote:
> 
> What about a cell phone or PDA for use in China. Some users may prefer 
> vertical display of text, but then the system needs to know what to do 
> with Latin etc text embedded in the Chinese. Isn't that a credible 
> scenario? Or are the Chinese to be forced to read their language 
> horizontally on all electronic devices?

The whole point is that you cannot sensibly mix vertically formatted text and
horizontally formatted text in the same line (except for the case of very short
words). When Mongolian text is embedded in Latin, Cyrillic or horizontal Chinese
it is rotated 90 degrees so that it reads LTR instead of vertically; and when
Latin or Cyrillic text is embedded in vertical Mongolian or vertical Chinese
text the alphabetic script is normally rotated 90 degrees so that it reads TTB
instead of LTR. There is absolutely no need for directional formatting controls
in these situations.

The only potential need for vertical formatting controls would be if you were
embedding a chunk of Bottom-To-Top text within text that was oriented
Top-To-Bottom (or vice versa), but that scenario is extremely unlikely, and even
if it did occur you could simply use the horizontal formatting controls (LRO,
RLO, PDF, etc.) - treating TTB as LTR and BTT as RTL - and use a higher level
protocol to rotate the whole thing 90 degrees.

Andrew

Re: OT? Languages with letters that always take diacriticals

2004-03-16 Thread Andrew C. West

Curtis Clark wrote:
>
>Are there any languages that use letters with diacriticals, 
>but *never* use the base letter without diacriticals?

The most common system of academic transcription for Mongolian uses LATIN SMALL
LETTER C WITH CARON [U+010D] and LATIN SMALL LETTER J WITH CARON [U+01F0] for
U+1834 and U+1835 respectively, but "c" and "j" are not used (although "c" is
sometimes used for U+183C instead of "ts").

Andrew

Re: LATIN SMALL LIGATURE CT

2004-03-02 Thread Andrew C. West

On Mon, 01 Mar 2004 20:02:45 -0800, "D. Starner" wrote:
> 
> Most importantly, you don't need
> to wander all over the PUA - with modern typesetting systems and good fonts,
> you just place a ct there and the software automatically ligatures it for you.
> You can use a ZWJ to ask for a ligature and ZWNJ to make sure there isn't one.

If you're using Windows, with the latest versions of Uniscribe and Code2000,
typing  into Notepad will produce a very pleasing ct ligature
(likewise for st, ff, fl and fi ligatures). As several people have stated, there
is no need to either encode the ct ligature or resort to the PUA. (PUA Bad !
Smart fonts and rendering systems Good !)

Andrew

Re: Codes for Individual Chinese Brushstrokes

2004-02-20 Thread Andrew C. West

On Thu, 19 Feb 2004 18:27:09 -0800 (PST), Kenneth Whistler wrote:
> 
> Of the 64 entities listed on the page:
> 
> http://www.chinavoc.com/arts/calligraphy/eightstroke.asp
> 
> *none* of them are encoded, and *none* of them are "standard"
> enough to merit consideration -- if by consideration you mean
> separate encoding as characters.
> 

I'm not sure about "*none* of them are encoded". As far as I can tell, pretty
much most of the basic ideographic stroke forms are either already encoded in
CJK and CJK-B or are proposed in CJK-C (where "encoded" here means "encoded in
their own right" or "can be represented by same-shaped ideographs").

See for example the IRG document

which states :

Although most ideographic strokes have been encoded in CJK (including Ext.A and
B) or submitted to CJK_C1 by IRG members, there are two ideographic strokes are
found missing. Ideographic strokes are important for ideograph decomposition,
analysis and for making ideographic strokes subset. Chinese linguists suggest to
add these two ideographic strokes to CJK_C1.

I also remember reading one WG2 document that explicitly raised the question of
how to deal with all the ideographic strokes proposed in CJK-C that are not
distinct ideographs in their own right, although I can't seem to locate that
document any more.

All except one of the eight basic strokes mentioned at
 are *representable*
using existing characters in the CJK and/or Kangxi Radicals blocks :

dot = U+4E36 or U+2F02 [KANGXI RADICAL DOT]
dash = U+4E00 or U+2F00 [KANGXI RADICAL ONE]
perpendicular downstroke = U+4E28 or U+2F01 [KANGXI RADICAL LINE]
downstroke to the left or left-falling stroke = U+4E3F or U+2F03 [KANGXI RADICAL
SLASH]
wavelike stroke or right-falling stroke = U+4E40
hook = U+4E85 or U+2F05 [KANGXI RADICAL HOOK], as well as U+4E5A and U+2010C
upstroke to the right = 
bend or twist = U+4E5B and U+200CC

I concur with Ken that the 8x8 stroke categorization given at this web site is
largely artificial. Whilst it may be useful to encode general ideographic stroke
forms to help in the analysis and decomposition of ideographs, in my opinion the
minute distinctions in the way that dots and dashes are written in various
individual ideographs are beyond the scope of a character encoding system as the
exact shape of a dot or length of a dash is irrelevant to any analysis of the
compositional structure of an ideograph.

Andrew

Re: interesting SIL-document

2004-02-04 Thread Andrew C. West

On Wed, 4 Feb 2004 11:12:41 +, Michael Everson wrote:
> 
> At 02:50 -0800 2004-02-04, Peter Kirk wrote:
> 
> >As for Birmingham, I like the idea of analysing 
> >it as a monosyllable [b?m©Øm] although I would 
> >tend to think of the eng and the second m as 
> >syllabic, but there is then a near minimal pair 
> >with the interjection [mhm] meaning "no".
> 
> [mhm] is a positive for me. It is [m?m] which is negative.

Hmm, my 2 1/2 year old dauhter is going through a stage of saying "mm mm"
(rising tone on first syllable) for "yes" and "mm mm" (falling tone on first
syllable) for "no", which is a very subtle distinction, especially as in
Mandarin Chinese "m" rising tone is an interrogative grunt and "m" falling tone
is an affirmaive grunt.

Andrew

Re: interesting SIL-document

2004-02-04 Thread Andrew C. West

On Tue, 03 Feb 2004 10:53:40 -0800, Peter Kirk wrote:
> 
> There are minimal pairs at the 
> syllable level between the British pronounciation of Birmingham (silent 
> h, stress on first syllable only) and many similar -ingham names, and 
> (rarer) place names like Odiham (Hampshire) - although I suspect the h 
> tends to be silent in the latter.

Pronounced "odium" locally. Offhand I can't think of any English placenames with
a -ham suffix that don't have a silent "h" (Farnham, Fareham, Wokingham ...),
although "h" is generaly pronounced in other common placename suffixes such as
-hampton and -hurst.

Andrew

Re: Chinese FVS? (was: RE: Cuneiform Free Variation Selectors)

2004-01-22 Thread Andrew C. West

On Wed, 21 Jan 2004 11:13:33 -0700, John Jenkins wrote:
> 
> Granted, epigraphy is tough on plain text.  As Unicode starts to deal 
> with dead scripts, we have to deal with the issues it raises.  
> Variation selectors are one way of doing it.
> 

Yes, but I'm delighted to see from document N2684 "Draft Agreement on Old Hanzi
Encoding" that variation selectors are not the method proposed for dealing with
archaic forms of the Han script. I think that encoding the Oracle Bone, Bronze
Inscription and Small Seal pre-Han scripts separately from the modern Han script
is definitely the right thing to do, although as glyph variation is an even
bigger problem for the ancient unstandardised scripts than for the modern
script, I wonder whether variation selectors might not play a role in the end
anyway.

As I'm currently working on a proposal for the deceased Jurchen script, which
also has a problem with glyph variation (about a third of the 1,355 entries in
the most recent Jurchen dictionary are simple glyph variants, many almost
indistinguishable from one another), maybe someone on the UTC could give me some
advice ? Should I :

A. Stick to a strict character encoding model, and ignore glyph variants that
have no semantic distinctions (as I did for Phags-pa).
B. Indiscriminately code every glyph form that has ever been seen, on the basis
that ghyph variants are given in a respected dictionary.
C. Propose distinct characters, but append a long list of proposed standardised
variants to cover the simple glyph variants (some missing a dot here or adding a
stroke there, some written in a more cursive manner, and some just differently
proportioned).

Andrew

Re: Mongolian Unicoding (was Re: Cuneiform Free Variation Selectors)

2004-01-21 Thread Andrew C. West

On Tue, 20 Jan 2004 16:33:24 -0500, [EMAIL PROTECTED] wrote:
> 
> Andrew C. West scripsit:
> 
> > These are glyph variants of Phags-pa letters that are used with semantic
> > distinctiveness in a single (but very important) text, _Menggu Ziyun_ , a
14th
> > century rhyming dictionary of Chinese in which Chinese ideographs are listed
by
> > their Phags-pa spellings. In this one text only, variant forms of the letters
> > FA, SHA, HA and YA are used contrastively in order to represent historical
> > phonetic differences between Chinese syllables that were pronounced the same
in
> > early 14th century standard Chinese (Old Mandarin). 
> 
> In short, these are like the diacritics used in some English-language
> dictionaries
> to mark up English words to show how the vowels are pronounced, except they are
> "abstract diacritics" rather than shape-based ones.
> 

Yes,  I think that I could agree with that. So in a way the FVS acts as an
invisible diacritic mark that affects the shape of the character to which it is
applied. In some contexts you would want to keep the diacritic mark, and in
other contexts you may want to strip it off or ignore it (e.g. for text
matching). Thus if I were to make a transcription of the 14th century _Menggu
Ziyun_ manuscript I would want to preserve the glyph variants for the letters
FA, SHA, HA and YA, but if I were then to use this data to create a dictionary
of Phags-pa spellings for Chinese ideographs found in actual Yuan dynasty
manuscripts and monumental inscriptions I would want to strip off the FVSs, just
preserving the "base spelling" of the ideographs.

You might be interested to know that the variant nature of the Phags-pa letters
FA, SHA, HA and YA is mirrored in the transliteration system for _Menggu Ziyun_
devised by Professor Junast (China's foremost authority on the Phags-pa script),
which uses an ordinary letter "j" for the normal Phags-pa letter YA, but a form
of the letter "j" with a long tail for the variant form of the letter YA; and an
ordinary letter "h" for the normal Phags-pa letter HA but a crossed "h" [U+0127]
for the variant form of the letter HA. (He uses U+0161 with subscript 1 and 2 to
differentiate the normal and variant forms of the Phags-pa letter SHA).

Andrew

Re: Chinese FVS? (was: RE: Cuneiform Free Variation Selectors)

2004-01-21 Thread Andrew C. West

On Tue, 20 Jan 2004 10:32:06 -0700, John Jenkins wrote:
> 
> 1)  U+9CE6 is a traditional Chinese character (a kind of swallow) 
> without a SC counterpart encoded.  However, applying the usual rules 
> for simplifications, it would be easy to derive a simplified form which 
> one could conceivably see in a book printed in the PRC.  Rather than 
> encode the simplified form, the UTC would prefer to represent the SC 
> form using U+9CE6 + a variation selector.
> 

If a simplified form of a given CJK ideograph is used, then it deserves encoding
properly. There are newly-coined simplified forms in CJK-B and CJK-C, so why not
add newly used simplified forms to CJK-C or whereever if they are really needed
? To borrow Michael's term, this use of variation selectors is simply
pseudo-coding.

If a Chinese publishing house were going to print a book in simplified
characters that included a simplified form of U+9CE6, would they go the lengths
of applying to Unicode to define an appropriate standardised variant for U+9CE6,
and then trying to create a font that implemented variation selectors ? Or would
they simply use a font that mapped a simplified glyph form to U+9CE6 (or the
PUA) ? If it is so important to formally define the existence of a simplified
form of an existing character, then why not encode it properly ??

> 2) Your best friend has the last name of "turtle," but he doesn't use 
> any of the encoded forms for the turtle character to represent it.  He 
> insists on writing it in yet another way and wants to be able to 
> include his name as he writes it in the source code he edits.  The UTC 
> ends up accommodating him using U+2A6C9 (which is the closest turtle to 
> his last name) + a variation selector.

1. Unicode Design Principle 3 : "The Unicode Standard encodes characters, not
glyphs."
This is simple glyph variant. I insist on writing the "A" in my name with two
cross-bars. Will the UTC kindly accommodate me by providing an appropriate
standardised variant for U+0041 ? (In fact, come to think of it I have
idiosyncratic ways of writing all of the letters in my name ...)

The plain fact of the matter is that the *character* turtle is already encoded,
and if someone wants to use a different glyph form for this character then he or
she should design their own font with the appropriate glyph mapped to U+9F9C or
U+9F9F.

2. Unicode does not encode private-use characters.
I can't find chapter and verse for it, but I was always under the impression
that Unicode did not encode private-use characters.

> 3)  You're editing a critical edition of an ancient MS, and you find 
> that your author, who talks a lot about handkerchiefs, uses U+5E28 
> quite a bit, but varies between the "ears-in" form and the "ears-out" 
> form almost at random.  Rather than lose the distinction which *may* be 
> meaningful, you (with the UTC's blessing) use U+5E28 for the ears-in 
> form (as Unicode uses) and U+5E28 + a variation selector for the 
> ears-out form.

This example actually opens up the biggest can of worms.

As someone who has a passion for transcribing ancient manuscripts, in Chinese
and other scripts, I fully appreciate the desire to be able to represent every
little idiosyncrasy of a manuscript or inscription in plain text Unicode. But
the simple fact of the matter is that you can't. My apologies for repeating
myself, but Unicode Design Principle 3 states that "The Unicode Standard encodes
characters, not glyphs." (and Section 2.2 of TUS elaborates on this statement).

Unless Unicode becomes a Glyph Encoding Standard instead of a Character Encoding
Standard, then how on earth can the UTC allow VSs to be used for simple glyph
variants ? And if it's OK for CJK ideographs, then why not for every other
Unicoded script ?

Glyph variations are of paramount interest to textual scholars and epigraphers
of all scripts, not just Chinese. To take a random example from the Celtic
Inscribed Stones Project (CISP), this is a palaeographgic description of a cross
slab at Kirk Maughold in the Isle of Man, inscribed [--]I IN CHRISTI NOMINE
CRUCIS CHRISTI IMAGENEM :

Kermode/1907, 112: `we have here the diamond-shaped O, the N like an H, and the
M like a double H, all characteristics of the Hiberno-Saxon manuscripts and
sculptured stones of the period. Other characteristic forms are the
square-shaped C and the peculiar G, the like of which I have not seen elsewhere.
But some of the letters are minuscules, as p, d, b, r, and a; while in the
contraction for CHRISTI, in each case the R differs from the ordinary small R in
CRUCIS, representing, in fact, the Greek Rho!'. 

[http://www.ucl.ac.uk/archaeology/cisp/database/stone/maugh_4.html]

If we go down the road of encoding epigraphic and palaeographic glyph variants
for CJK and other scripts I'm afraid that we'll soon find that 256 Variation
Selectors just isn't enough.

Andrew

Re: Mongolian Unicoding (was Re: Cuneiform Free Variation Selectors)

2004-01-20 Thread Andrew C. West

On Tue, 20 Jan 2004 00:36:54 -0800, Asmus Freytag wrote:
> 
> Currently, Variation Selectors work only one way. You could 'force' one 
> particular
> shape. Leaving the VS off, gives you no restriction, leaving the software free
> to give you either shape. W/o defining the use of two VSs you cannot 'force'
> the 'regular' shape.

Yes, I had forgotten this. Although in practice I would imagine that only the
most perverse font would use an unexpected glyph variant as the standard glyph
for a character. To go back to my simplistic example of the long s (which I hope
no-one is taking too seriously), I think that the user would be justified in
expecting an ordinary short s to be displayed for U+0073 in isolation, and I
doubt that many fonts would map a long s glyph directly to U+0073. Thus although
you cannot force the "regular" glyph shape you can force the font's default
glyph shape by the omission of a VS, and in most fonts the default glyph would
be the same as the "regular" Unicode code chart glyph.

> Also, the way most VSs are defined, their use does not depend
> on context the same way as the example suggests.
>

Absolutely. My understanding is that the Mongolian Free Variation Selectors (and
the hypothetical long-s FVS) function quite differently from the ordinary
variation selectors currently used for mathematical symbols, and proposed for
Phags-pa, and apparently coming soon for Han ideographs. In the case of
Mongolian the rendering system can determine the expected glyph form based on a
set of deterministic rules, and so an FVS needs only be applied when the rules
need to be broken. On the other hand, there are no rules that allow the
rendering system to know which particular Standardised Variant glyph form to use
for an unmarked Unicode character in a particular context, and the VS must be
applied manually by the user or IME.

My understanding of under what circumstances standard variation selectors are a
good idea is typified by the four proposed Phags-pa standardised variants :

A85B FE00 -- PHAGS-PA LETTER YA with rounded appearance
A860 FE00 -- PHAGS-PA LETTER HA without tail kink
A864 FE00 -- PHAGS-PA LETTER FA with tail kink
A85E FE00 -- PHAGS-PA LETTER SHA with sloping stroke

These are glyph variants of Phags-pa letters that are used with semantic
distinctiveness in a single (but very important) text, _Menggu Ziyun_ , a 14th
century rhyming dictionary of Chinese in which Chinese ideographs are listed by
their Phags-pa spellings. In this one text only, variant forms of the letters
FA, SHA, HA and YA are used contrastively in order to represent historical
phonetic differences between Chinese syllables that were pronounced the same in
early 14th century standard Chinese (Old Mandarin). For example :

A. The ideographs SHU  [U+66F8] and SHU ê [U+6B8A] were pronounced the same
in Old Mandarin, but were historically distinct (in the Chinese of the Tang
dynasty), the former with a reconstructed [U+0255] initial, the latter with a
reconstructed [U+0291] initial. In _Menggu Ziyun_ the former SHU is spelled
sheeu and the latter SHU spelled sh'eeu (where sh' is a glyph variant of sh).

B. The ideographs YIN ö [U+56E0] and YIN Ð [U+5BC5] were pronounced the same
in Old Mandarin (other than tone which is not represented in Phags-pa spelling
of Chinese), but were historically distinct, the former with a reconstructed
null initial, the latter with a reconstructed [j] initial. In _Menggu Ziyun_ the
former YIN is spelled yin and the latter YIN spelled y'in (where y' is a glyph
variant of y).

C. The ideographs XIAN è¨ [U+96AA] and XIAN  [U+5ACC] were pronounced the same
in Old Mandarin (other than tone), but were historically distinct, the former
with a reconstructed [x] initial, the latter with a reconstructed [U+0263]
initial. In _Menggu Ziyun_ the former XIAN is spelled hyem and the latter XIAN
spelled h'yem (where h' is a glyph variant of h).

D. The ideographs FANG û [U+65B9] and FANG [ [U+623F] were pronounced the same
in Old Mandarin (other than tone), but were historically distinct, the former
with a reconstructed [p] initial, the latter with a reconstructed [b] initial.
In _Menggu Ziyun_ the former FANG is spelled fang and the latter FANG spelled
f'ang (where f' is a glyph variant of f).

However, in actual Phags-pa manuscript/printed texts and epigraphic inscriptions
there is no distinction between pairs of ideographs such as these, and the same
glyph form is used for all occurences of the letters FA, SHA, HA and YA
respectively.

Thus the Phags-pa letters FA, SHA, HA and YA represent "f", "sh", "h" and "y"
however they are written, but in one certain textual context glyph distinction
is used to carry additional historic phonetic information that you may or may
not want to preserve in electronic texts.

As Asmus says, "A VS approach is potentially indicated when its necessary to
manually select non-deterministic variants (or to override deterministic ones)
and at the same time it's desire

Re: Mongolian Unicoding (was Re: Cuneiform Free Variation Selectors)

2004-01-19 Thread Andrew C. West

On Mon, 19 Jan 2004 05:23:31 +, [EMAIL PROTECTED] wrote:
> 
> Dean Snyder wrote,
> 
> > Tom Gewecke wrote at 2:26 PM on Sunday, January 18, 2004:
> > ... 
> > >
> > >Agreed.  I can't imagine that anyone who has ever tried to actually do
> > >anything with Unicode Mongolian would recommend variation selectors as an
> > >encoding technique, unless perhaps they wanted to make sure the encoding
> > >was never implemented.
> > 
> > Could you please elaborate? Has this modle not been implemented? Either
> > via Unicode or otherwise?
> > 
> 
> Here's how it works:  there are three factions involved.  The OS and
> rendering-engine developers, the editor/processor/input developers, 
> and the font developers.  Each faction considers that the fancy stuff 
> needed for Mongolian rendering should properly be handled through
> a combination provided by the other two factions.
> 

An analogy for those not familiar with the Mongolian script is the much beloved
long s, which is a positional glyph variant of the ordinary letter s for some
languages at some periods of time. The long s does not need to be encoded as a
separate character as there are well-known rules for when an s should be written
long and when it should be written short (although these rules may vary from
locale to locale and from time to time). If, for example, the rule for a given
locale is short s finally and medially after another s, and long s initially and
medially except after another s, then the user could type in a word using the
ordinary letter s throughout, and the rendering system would select the long or
short s glyph as appropriate depending on its position within the word. But say
that the user wanted to go against the rendering rules, and write a long s in a
position that is normally rendered as a short s, or if he wanted to refer to the
long s in isolation, then this is where an FVS would come in. The FVS could be
applied to the letter s to override its normal glyph shape, and force a long s
even where the rules state that it should be a short s (and vice versa for short
s).

Now the Latin alphabet only has this one example (as far as I know) of a letter
that has positional or contextual variant forms, and so it is simpler to just
encode the long s separately. However, almost every letter in Mongolian and its
related scripts has at least two positional and/or contextual forms, and some
letters have up to four or five glyph forms. Encoding all the various glyph
forms of each letter separately would be an unecessary burden on the user, who
would have to manually select the correct glyph form for each letter even though
they are conceived of as the same letter. It is far simpler (for the end-user at
least) to let the rendering engine apply a set of rules to determine which glyph
form is required in which position (isolate, initial, medial or final) or in
which context (e.g. in "feminine" or "masculine" words). As Asmus pointed out
the Mongolian FVSs would normally only be needed to override the rules, for
example to display a particular glyph form in isolation (e.g. in metalanguage 
descriptions
of the Mongolian script), or to write foreign words (which in Mongolian
typically use unexpected glyph forms for certain letters); and so in normal
running text with no foreign words the user would rarely need to use an FVS (and
with a good IME the user probably wouldn't even need to know of their existence).

The reason why everybody who has had anything to do with Mongolian encoding
(including myself) shudder in fear at the mere mention of "Free Variation
Selector" is not that they are a bad thing per se -- their use in Mongolian is
very fit and proper -- but that the rules for selecting the appropriate
positional or contextual form in Unicode have never been clearly formalised, and
without the rules it's impossible to know how to correctly render running text
or to know which FVS to apply to a given letter to override the rules. Once the
rules have been established (hopefully soon), and incorporated into the fonts,
rendering engines and IMEs, then everything should work like a well-oiled
machine.

Knowing nothing about Cuneiform, I can't say whether FVSs are a suitable option
for Cuneiform or not, but if Dean is thinking about using FVSs like ordinary
Variation Selectors (i.e. applied manually by the user to select a distinct
character), then I agree with Michael that this is "pseudo-coding" and probably
not appropriate.

Andrew

Re: U+0185 in Zhuang and Azeri (was Re: unicode Digest V4 #3)

2004-01-15 Thread Andrew C. West

On Wed, 14 Jan 2004 10:44:18 -0800, Peter Kirk wrote:
> 
> I received the following reply from a Zhuang researcher, which agrees 
> with what Andrew has written:
...
> 
> > There are two other orthographies in use in Zhuang. Most important, 
> > there is an ancient Zhuang square-character script that has never been 
> > standardized. If it ever is, maybe we can get a unicode font for it. 
> > Until then, I wouldn't worry about it much. Second, sometimes, very 
> > informally, people will use Chinese characters to write Zhuang. This 
> > happens rarely.
> 
> On this last paragraph, I note that this ancient Zhuang script has not 
> even been roadmapped. From a brief Internet search, I found that it 
> consists of about 10,000 characters. Or is it roadmapped under another 
> name? Or is it unified with CJK - despite the researcher clearly 
> distinguishing it from Chinese characters?

A dictionary of traditional Zhuang usage ideographs was published in 1989 as _Gu
Zhuangzi Zidian_ ÃsT (Guangxi Minzu Chubanshe ö¼¯°oÅÐ, 1989) [my
apologies for the encoding which is probably Shift-JIS .. I have no control over
how my outgoing emails are encoded], but I haven't been able to get hold of a
copy yet. It should provide plenty of material for a proposal, although at least
some Zhuang usage ideographs are already coincidentally encoded in CJK-B or in
the pipeline for CJK-C (e.g. U+28499 = Zhuang nak "heavy").

Zhuang ideographs are either Han ideographs borrowed for their pronunciation or
for their meaning, or modified Han ideographs in the manner of Vietnamese nom
characters. Therefore they do not need to be separately roadmapped, but belong
in the CJK Unified Ideographs blocks ... hmm perhaps CJKVZ Unified Ideographs is
more appropriate ?

For example, a Zhuang word might be written with a Han ideograph with the same
or similar pronunciation in the local Chinese dialect (generally either the
Liuzhou dialect of Mandarin Chinese in Northern Guangxi or the Guangxi dialect
of Cantonese in Southern Guangxi). Thus the Zhuang word kau1 meaning "I" might
be written using the Han ideograph Ã GU3 [U+53E4].

On the other hand, a new ideograph might be created from two or more existing
Han ideographs or ideographic components to represent a Zhuang word. Thus, the
Zhuang word na2 meaning "paddy field" might be written with a constructed
ideograph written with the Han ideograph ß NA4 [U+90A3] above the Han ideograph
c TIAN2 [U+7530] "field" (U+90A3 giving the pronunciation, and U+7530 giving
the meaning).

A new ideograph might also be created by decomposing or altering the form of an
existing Han ideograph. For example, the Zhuang word for side might be written
using one half of the Han ideograph å MEN2 [U+9580] "gate".

As Peter's correspondent mentions, this system of writing was never
standardised, so the actual Zhuang ideographs used for any given Zhuang word may
vary from place to place, and from manuscript to manuscript.

Incidentally the Zhuang ideographs were never widely used as a means of
communication, but were mostly used for writing down traditional Zhuang texts,
such as folk songs. To be educated meant (and still means) to be able to read,
write and speak Chinese, and so educated Zhuang would simply use Chinese for
personal correspondence. On the other hand, uneducated Zhuang would not be able
to read or write the Zhuang usage ideographs anyway.

Andrew

Re: Bhutanese marks

2004-01-06 Thread Andrew C. West

On Tue, 6 Jan 2004 05:29:51 -, "C J Fynn" wrote:
> 
> U+0F09 which was erroneously named BKA-SHOG YIG MGO (should have been ZHU-YIG
> GO RGYAN), is used for writing respectfully to a senior particularly when
> requesting something. e.g when writing to a government officer or minister
> requesting a licence or permit, when petitioning the court and so on.

And to make matters worse BKA- SHOG YIG MGO [U+0F0A] is actually the term for
the "starting flourish for giving a command", and should thus be the name of the
character proposed at U+0FD0, except that U+0FD0 now has to make do with a
subtly different name as its proper name has already been appropriated by the
character with the opposite meaning ... definitely one for the Unicode Book of
Bloopers !

Incidentally, shouldn't the proposed name for U+0FD0 be "TIBETAN MARK BKA- SHOG
GI MGO RGYAN" (BKA- + space) not "TIBETAN MARK BSKA-SHOG GI MGO RGYAN" (BSKA- +
no space) ?

Andrew

Re: U+0185 in Zhuang and Azeri (was Re: unicode Digest V4 #3)

2004-01-06 Thread Andrew C. West

On Mon, 5 Jan 2004 17:37:30 -0800 (PST), Kenneth Whistler wrote:
> 
> Perhaps someone on the list who knows more about the actual
> history of orthographic reform in the Zhuang Autonomous Region
> of Guangxi could chime in with more details.
> 

Well, I'm not really that knowledgeable about Zhuang, but but my father-in-law
is a native Zhuang speaker, and I've visited "Zhuangland" many times, so I have
absorbed some knowledge on the subject by osmosis.

The original Zhuang alphabet devised in 1955 and officially promulgated in 1957
used an unwieldy mixture of Latin and Cyrillic letters together with the special
tone letters encoded at U+01A7/01A8 [tone 2], U+01BC/01BD [tone 5] and
U+0184/0185 [tone 6]. However, in 1982 the Zhuang alphabet was amended to use
basic Latin letters only, so that the old tone letters are now written as "z",
"j", "x", "q" and "h" :

Tone 1 [Low Rising] : [Tone 1 is implicit, marked by the absence of a tone
letter]
Tone 2 [Low Falling] : U+01A7/U+01A8 = z
Tone 3 [High Level] : U+0417/U+0437 = j
Tone 4 [High Falling] : U+0427/U+0447 = x
Tone 5 [Mid Rising] : U+01BC/U+01BD = q
Tone 6 [Mid Level] : U+0184/U+0185 = h

For examples of this new Zhuang orthography on the internet see :

http://www.kam-tai.org/languages/zhuang/materials/thestoryofhuasan/lan_09_03.htm
http://www.pouchoong.com/cuengh.htm

As far as I'm aware the old Zhuang orthography is no longer in general use.
However, it is quite possible that books and newspapers printed in the old
orthography will be archived electronically using Unicode one day, so it cannot
be said that the tone letters are now redundant, and so can be blithely
reassigned to other uses.

I agree 100% with Ken that the Unicode letters Tone Two, Five and Six were
introduced to represent the Zhuang tones, and so they should not be hijacked for
other uses for which their glyph shapes are not quite appropriate. If the glyph
shape for U+0184/U+0185 is wrong for the context that Michael and Peter want to
use this character in, then I guess that this is probably not the right
character to use for that purpose, and they should propose a new character.

Andrew

Re: LATIN SOFT SIGN

2004-01-05 Thread Andrew C. West

On Mon, 5 Jan 2004 13:54:18 +, Michael Everson wrote:
> 
> LATIN LETTER TONE SIX **is** the SOFT SIGN clone into Latin, and 
> should be used for Pan-Turkic. I've suggested, but perhaps not loudly 
> enough, that the reference glyph be modified to be more soft-sign 
> like.

LATIN LETTER TONE SIX isn't a Latin clone of the Cyrillic soft sign per se, but
is simply a character that is based on the Cyrillic letter that looks most like
the digit "6". It was chosen to represent Zhuang Tone 6 purely on the shape of
the glyph (likewise the letters for Zhuang Tones 1-5 were chosen simply for
their resemblence to the digits "1" through "5"), and has no relation to the
original phonetic usage of the Cyrillic letter. To modify the reference glyph be
modified to be more soft-sign like would simply make the reference glyph less
Zhuang Tone Six-like.

Andrew

Re: Aramaic unification and information retrieval

2003-12-23 Thread Andrew C. West

On Tue, 23 Dec 2003 01:59:06 -0800, "Doug Ewell" wrote:
> 
> >  I deliberately followed the roadmap codepoints for my recent
> > 'Phags-pa proposal even though I think 'Phags-pa probably belongs in
> > the SMP (but I don't really care where 'Phags-pa is encoded as long as
> > it is encoded, so I am happy to defer to Michael, Rick and Ken in this
> > regard); and then WG2 in their wisdom decided to reallocate the block
> > three rows north of the roadmapped codepoints ... so maybe you can't
> > assume that roadmap codepoints are carved in stone.
> 
> I didn't see the minutes of the meeting where that decision was made.
> What was the rationale for moving it?
> 

To annoy the Chinese apparently ;)

... but therein lies a story beyond my ken.

A.

Re: Aramaic unification and information retrieval

2003-12-23 Thread Andrew C. West

On Mon, 22 Dec 2003 21:36:25 -0800, "Doug Ewell" wrote:
> 
> > Ancient forms of Aramaic
> > aren't going to be taken up anytime soon for any consideration
> > for encoding. And the Roadmap cannot be taken as a predetermination
> > of the eventual decisions in this regard, in my opinion.
> 
> Maybe not as far as whether it will actually be encoded.  We do know
> that "Accordance with the Roadmap" is often the sole justification for
> the code positions specified in proposals, as discussed in a thread some
> months ago.

I don't recall that thread, but is there anything intrinsically wrong in
proposing to use the same codepoints for a proposal that are given in the
roadmaps ? I deliberately followed the roadmap codepoints for my recent
'Phags-pa proposal even though I think 'Phags-pa probably belongs in the SMP
(but I don't really care where 'Phags-pa is encoded as long as it is encoded, so
I am happy to defer to Michael, Rick and Ken in this regard); and then WG2 in
their wisdom decided to reallocate the block three rows north of the roadmapped
codepoints ... so maybe you can't assume that roadmap codepoints are carved in
stone.

Andrew

Re: Swastika to be banned by Microsoft?

2003-12-15 Thread Andrew C. West

On Sun, 14 Dec 2003 13:39:23 -0500 (EST), Thomas Chan wrote:
> 
> The entry for U+534D in the _Hanyu Da Zidian_, vol. 1, p. 51 (as indicated
> in unihan.txt) includes a quote that it was originally not a Han
> character, "wan ben fei zi ...", suggesting that it now is.  There are
> also serifs shown in that dictionary and the _Kangxi Zidian_ for both
> characters.
> 
> Couldn't the above two characters be considerd a "CJK" or "IDEOGRAPHIC"
> version (like the spaces, zero, punctuation, brackets, etc. in the "CJK
> Symbols and Punctuation" block)?
> 

If memory serves me, the swastika was formally designated a Chinese ideograph by
the redoubtable Empress Wu of the Tang dynasty during the late 7th century.
Empress Wu had a penchant for creating new ideographs, and decreed that the
Buddhist swastika symbol should henceforth be considered a Chinese ideograph to
be pronounced WAN4 (a deliberate homophone for U+842C "10,000"). This is why,
unexpectedly to some, the swastika symbols are found in the CJK Ideograph block
rather than elsewhere.

Incidentally, U+534D and U+5350 are rarely used within running text in Chinese.
In the decorative arts the swastika motif is generally described as WAN4ZI4
<842C, 5B57> "WAN ideograph", as in the word WAN4ZI4JIN1 <842C, 5B57, 5DFE>, a
type of turban with a swastika decoration that was the height of fashion during
the Ming dynasty.

Andrew

Re: Text Editors and Canonical Equivalence (was Coloured diacritics)

2003-12-12 Thread Andrew C. West

On Fri, 12 Dec 2003 07:53:13 -0800, Peter Kirk wrote:
> 
> OK. In fact I suspect that the number "that have meaningful semantics 
> and effective usage" is actually rather small and could be fitted within 
> the higher PUA planes if one chose to do that. After all, not many 
> languages use large numbers of different grapheme clusters (i.e. more 
> than a few hundred consonant-vowel combinations)

About 10,000 for Tibetan.

But why on earth are we talking about mapping grapheme clusters to the PUA ?! I
thought we had heard the last of that sort of "heresy" when William softly and
suddenly vanished away.

Andrew

Re: Unihan kKorean pronunciations

2003-12-08 Thread Andrew C. West

On Sat, 6 Dec 2003 05:17:16 +0900 (KST), Jungshik Shin wrote:
> 
>   For the nice summary of various transliteration/transcription schemes
> for Korean, see
> 
>   http://www.asahi-net.or.jp/~ez3k-msym/charsets/roma-k.htm
> 

Thanks, this page seems to provide just the information I need to convert the
Unihan readings to M-R.

>
>   As he wrote, it's good as a transliteration scheme, but certainly
> is not a transcription scheme.  It'd be better if the same
> transliteration scheme had been used in Unicode Hangul syllable
> names and UniHan DB.

It would certainly have made it a lot easier to convert from romanisation to
Hangul syllables, which is something I might want to do.

Andrew

Re: Unihan kKorean pronunciations

2003-12-08 Thread Andrew C. West

On Fri, 5 Dec 2003 11:20:02 -0700, John Jenkins wrote:
> 
> I checked with Lee Collins (who's the person who put the data in there 
> originally).  Quoth'a:
> 
> It's called Yale, since it appears in a number of Samuel Martin's works 
> published by Yale Press.

Oops, I guess I really ought to have known that. Stiil, it would be a good idea
to add an explanatory note about this to the next release of Unihan.

Andrew

Re: Ideographic Description Characters

2003-12-08 Thread Andrew C. West

On Sun, 7 Dec 2003 11:25:01 -0700, Tom Gewecke wrote:
> 
> Can anyone tell me whether ideographic description characters are ever 
> actually used? 

Well, I use them on a couple of my web pages to describe unencoded ideographs
(try viewing http://uk.geocities.com/BabelStone1357/Alphabets/Zhuang.html with
Code2000), but I can't recall ever having seen them used elsewhere.

> I recently ran into a Han (Vietnamese Nôm) character 
> which does not seem to be encoded yet, "slice" radical on left and 
> "heart" radical on right, and was wondering whether it would make 
> practical sense to encode this as U+2FF1, U+2F5A, U+2F3C (
⿰⽚⼼). 

Remember IDCs *describe* ideographs, they are not used to *encode* them. One of
the reasons why IDCs can't be used to formally encode an ideograph is that there
are usually several or even many different ways to describe the same ideograph
with IDCs depending upon how far you break down its constituent components, and
whether you use radicals or complete ideographs for the constituent components.
Even for your simple example, you could variously describe the character as
, ,  or
 ... which all in all means that you cannot hope to
successfully search for an IDC-described ideograph or do most of the other
operations you would expect to be able to do with formally encoded ideographs.

Andrew

Re: Coloured diacritics (Was: Transcoding Tamil in the presence of markup)

2003-12-08 Thread Andrew C. West

On Sun, 7 Dec 2003 17:40:25 -0800, "Doug Ewell" wrote:
> There are plenty of things one can do with writing that aren't supported
> by computer encodings, and aren't really expected to be.  The idea of a
> black "i" with a red dot was mentioned.  Here's another: the
> piece-by-piece "exploded diagrams" used to teach young children how to
> write the letters of the alphabet.  For "a" you first draw a "c", then a
> half-height vertical stroke to the right of (and connected with) the
> "c".  Two diagrams.  For "b" you draw a full-height stroke, then a
> backwards "c" to its right.  Another two diagrams.  And so on.
> 
> For each letter, each diagram except the last shows an incomplete
> letter, which might or might not accidentally correspond to the glyph
> for a real letter.  Also, each diagram except the first might show the
> "new" stroke to be drawn in a different color from the strokes already
> drawn, to clarify the step-by-step process.  There might also be little
> arrows with dashed lines alongside each stroke, to show the direction in
> which to draw it.
> 

... and similar stroke-by-stroke incremental diagrams showing how to write CJK
ideographs are even more common in (Chinese, Japanese, etc.) pedagogical texts
intended for both native children and for foreigners. I've also seen such
diagrams in Tibetan pedagogical texts, and imagine you could find them for
almost any modern script, so I hope no-one's thinking of proposing a set of
semi-composed Latin letters (with combining directional arrows), or we'll be
sliding down a long, slippery slope.

Andrew

Unihan kKorean pronunciations

2003-12-05 Thread Andrew C. West

Does anyone know what is the system of transliteration used for the kKorean key
in the Unihan database ? The notes at the top of Unihan.txt simply state that
kKorean gives "The Korean pronunciation(s) of this character". However, the
readings are in some strange orthography that I am not familiar with.

If anyone can supply me with a mapping table between the system used for the
kKorean key of Unihan and either the McCune-Reischauer system or the official
South Korean transliteration system, or point me to a Web page that would help
me out, I would be most grateful. I append an alphabetical list of all the
kKorean syllables.

Andrew

A
AL
AM
AN
ANG
AP
AY
AYK
AYNG
CA
CAK
CAL
CAM
CAN
CANG
CAP
CAY
CAYNG
CE
CEK
CEL
CEM
CEN
CENG
CEP
CEY
CHA
CHAK
CHAL
CHAM
CHAN
CHANG
CHAY
CHAYK
CHE
CHEK
CHEL
CHEM
CHEN
CHENG
CHEP
CHEY
CHI
CHIK
CHIL
CHIM
CHIN
CHING
CHIP
CHO
CHOK
CHOL
CHON
CHONG
CHOY
CHUK
CHUM
CHUN
CHUNG
CHWAL
CHWAY
CHWEY
CHWI
CHWU
CHWUK
CHWUL
CHWUN
CHWUNG
CI
CIK
CIL
CIM
CIN
CING
CIP
CIS
CO
COK
COL
CON
CONG
COY
CUK
CUL
CUNG
CUP
CWA
CWU
CWUK
CWUL
CWUN
CWUNG
E
EK
EL
EM
EN
EP
ES
EY
HA
HAK
HAL
HAM
HAN
HANG
HAP
HAY
HAYK
HAYNG
HE
HEL
HEM
HEN
HI
HIL
HO
HOK
HOL
HON
HONG
HOY
HOYK
HOYNG
HUING
HUK
HUL
HUM
HUN
HUNG
HUP
HUY
HWA
HWAK
HWAL
HWAN
HWANG
HWAY
HWEN
HWEY
HWI
HWU
HWUL
HWUN
HWUNG
HYANG
HYE
HYEK
HYEL
HYEM
HYEN
HYENG
HYEP
HYEY
HYO
HYU
HYUK
HYUL
HYUNG
I
IK
IL
IM
IN
ING
IP
KA
KAK
KAL
KAM
KAN
KANG
KAP
KAY
KAYK
KAYNG
KE
KEK
KEL
KEM
KEN
KEP
KES
KEY
KHWAY
KI
KIL
KIM
KIN
KKIK
KKUTH
KO
KOC
KOK
KOL
KON
KONG
KOY
KOYK
KOYNG
KUK
KUL
KUM
KUN
KUNG
KUP
KWA
KWAK
KWAL
KWAN
KWANG
KWAY
KWEL
KWEN
KWEY
KWI
KWU
KWUK
KWUL
KWUN
KWUNG
KYAK
KYEK
KYEL
KYEM
KYEN
KYENG
KYEP
KYEY
KYO
KYU
KYUL
KYUN
LA
LAK
LAL
LAM
LAN
LANG
LAP
LAY
LAYNG
LEY
LI
LIM
LIN
LIP
LO
LOK
LON
LONG
LOY
LUK
LUM
LUNG
LWI
LWU
LYAK
LYANG
LYE
LYEK
LYEL
LYEM
LYEN
LYENG
LYEP
LYEY
LYO
LYONG
LYU
LYUK
LYUL
LYUN
LYUNG
MA
MAK
MAL
MAN
MANG
MAY
MAYK
MAYNG
MENG
MI
MIL
MIN
MO
MOK
MOL
MONG
MWU
MWUK
MWUL
MWUN
MYA
MYE
MYEK
MYEL
MYEN
MYENG
MYEY
MYO
NA
NAK
NAL
NAM
NAN
NANG
NAP
NAY
NAYNG
NI
NIK
NIL
NO
NOK
NON
NONG
NOY
NUC
NUK
NUM
NUNG
NWU
NWUL
NWUN
NYE
NYEK
NYEL
NYEM
NYEN
NYENG
NYEP
NYEY
NYO
NYU
NYUK
O
OAY
OK
OL
ON
ONG
OY
PAK
PAL
PAN
PANG
PAY
PAYK
PEL
PEM
PEN
PEP
PEY
PEYK
PHA
PHAL
PHAN
PHAS
PHAY
PHAYNG
PHI
PHIK
PHIL
PHIP
PHO
PHOK
PHOL
PHOS
PHWUM
PHWUN
PHWUNG
PHYAK
PHYEM
PHYEN
PHYENG
PHYEY
PHYO
PI
PIN
PING
PO
POK
PON
PONG
PPWUN
PWU
PWUK
PWUL
PWUN
PWUNG
PYEK
PYEL
PYEN
PYENG
SA
SAK
SAL
SAM
SAN
SANG
SAP
SAY
SAYK
SAYNG
SE
SEK
SEL
SEM
SEN
SENG
SEP
SEY
SI
SIK
SIL
SIM
SIN
SIP
SO
SOK
SOL
SON
SONG
SOY
SSANG
SSI
SUI
SUL
SUNG
SUP
SWA
SWAY
SWI
SWU
SWUK
SWUL
SWUN
SWUNG
SYA
TA
TAL
TAM
TAN
TANG
TAP
TAY
TAYK
TEK
THA
THAK
THAL
THAM
THAN
THANG
THAP
THAY
THAYK
THAYNG
THE
THO
THON
THONG
THOY
THUK
THWU
TO
TOK
TOL
TON
TONG
TUK
TUNG
TWU
TWUL
TWUN
UK
UL
UM
UN
UNG
UP
UY
UYS
WA
WAL
WAN
WANG
WAY
WEL
WEN
WI
WU
WUK
WUL
WUN
WUNG
YA
YAK
YANG
YE
YEK
YEL
YEM
YEN
YENG
YEP
YEY
YEYN
YO
YOK
YONG
YU
YUK
YUL
YUN
YUNG

RE: Complex Combining

2003-11-28 Thread Andrew C. West

On Fri, 28 Nov 2003 10:32:51 +, Arcane Jill wrote:
> 
> You are getting personal and indulging in ad hominem. I consider this 
> out of order.

Wow, people really are tetchy today.

The published "Mail List Rules and Etiquette" state that "Correspondents should
remain tolerably polite and consider carefully before launching into rabid ad
hominem attacks". Well I think that my musings were tolerably polite, only
obliquely ad hominem and certainly not rabid.

Hopefully no-one will be too deeply scarred by my mild remarks, and we can
return to discussing issues of interest to this forum.

Andrew

Re: Complex Combining

2003-11-28 Thread Andrew C. West

On Thu, 27 Nov 2003 08:11:55 -0800, Peter Kirk wrote:
> 
> This is all rather interesting speculation. There are surely a lot of 
> potential cases in scripts where some kind of combining mark can be 
> considered as applying to a sequence of an arbitrary number of 
> characters. For example:
> 
> Enclosing circles, squares and ellipses.
> Continuous underlines and overlines.
> Continuous tildes, slurs, contour tone marks etc which may apply to 
> several characters or whole words.
> The cartouche in Egyptian hieroglyphs, which surrounds a group of 
> several characters.
> A number of mathematical functions e.g. fraction dividers, extensions to 
> root signs.
> Combining marks which are supposed to be centred over or under two or 
> more characters or even a whole word, like the Hebrew masora circle.
> 

For music, formatting characters exist for marks that extend over multiple
characters :
U+1D173..1D174 MUSICAL SYMBOL BEGIN BEAM..END BEAM
U+1D175..1D176 MUSICAL SYMBOL BEGIN TIE..END TIE
U+1D177..1D178 MUSICAL SYMBOL BEGIN SLUR..END SLUR
U+1D179..1D17A MUSICAL SYMBOL BEGIN PHRASE..END PHRASE

For Egyption Hieroglyphs similar formatting characters have been proposed to
deal with cartouches (see http://std.dkuug.dk/JTC1/SC2/WG2/docs/n1944.pdf) :

U+x307..x308 EGYPTIAN HIEROGLYPHIC BEGIN CARTOUCHE..END CARTOUCHE

These are all specialised cases that are strictly necessary in order to
represent the respective scripts. General text formatting such as underlining or
arbitrary encirclement of characters (or cartouchement of ideographs which is
common in traditional Chinese texts) is considered to be "rich text" and beyond
the scope of Unicode. Whenever I read threads like this one (and they resurface
with monotonous regularity) I do wonder whether the participants have ever read
TUS Section 2.2 "Unicode Design Principles".

Andrew

Re: numeric properties of Nl characters in the UCD

2003-11-26 Thread Andrew C. West

On Wed, 26 Nov 2003 08:04:33 -0800, Peter Kirk wrote:
> 
> On 26/11/2003 04:40, Andrew C. West wrote:
> >Is this perhaps because all the other Gothic letters
> >can also be used to represent numbers in exactly the same way that U+10341 and
> >U+1034A are used (these two letter were devised specifically to fill the gap
in
> >the series of numbers represented by the ordinary Gothic letters), ... 
> >
>
> Probably not. It doesn't take long to see that NINETY appears where one 
> might expect a Q and corresponds to the Greek koppa. Koppa was used as a 
> letter in very early Greek, but since then (and even to the present day) 
> as a numeral with the same value 90. See 
> http://www.tlg.uci.edu/~opoudjis/unicode/numerals.html#koppa. It is 
> clear from the value and the glyph that the Gothic NINETY is derived 
> from the Greek koppa. Similarly, the Gothic NINE HUNDRED is derived from 
> the Greek sampi (U+03E1).

No-one's disputing the origins of U+10341 and U+1034A. All I'm saying is that
these two letters are neither needed nor actually used for writing Gothic words,
but were devised (i.e. borrowed from Greek) with the sole purpose of
representing the numbers 90 and 900.

Andrew

Re: numeric properties of Nl characters in the UCD

2003-11-26 Thread Andrew C. West

On Tue, 25 Nov 2003 16:16:15 -0800, "Doug Ewell" wrote:
> 
> Well, one reason could be that there is no such character.  (Did you
> mean U+1034A GOTHIC LETTER NINE HUNDRED?)
> 

But why do U+10341 [GOTHIC LETTER NINETY] and U+1034A [GOTHIC LETTER NINE
HUNDRED], which are letters that are only ever used to represent the numbers 90
and 900 respectively (they have no intrinsic phonetic value), not have a numeric
value assigned to them ? Is this perhaps because all the other Gothic letters
can also be used to represent numbers in exactly the same way that U+10341 and
U+1034A are used (these two letter were devised specifically to fill the gap in
the series of numbers represented by the ordinary Gothic letters), and in this
respect the two Gothic letters Ninety and Nine Hundred are no different from the
other Gothic letters which can be used to represent numbers but that do not have
a numeric value assigned to them by Unicode ?

On the other hand, the Gothic situation is analogous to Runic, in which the
three runes U+16EE..16F0 [RUNIC ARLAUG/TVIMADUR/BELGTHOR SYMBOL], which were
added to the sixteen-letter futhark in order to represent the complete set of
nineteen golden numbers using Runic letters, are assigned numeric values of
17..19 by Unicode.

BTW I've just noticed that U+10341 has a general category of "Lo" (Letter,
Other), whereas U+1034A has a general category of "Nl" (Number, Letter), which
seems a little odd.

Andrew

Unihan Vietnamese Readings

2003-11-25 Thread Andrew C. West

I've been looking at the Vietnamese readings given in the Unihan database
recently, and although I don't know Vietnamese, I think there may be something
not quite right with some of them, and so I wondered if anyone on this list who
knows Vietnamese could confirm the validity of the Unihan Vietnamese readings.

Since Unicode 3.2 the Unihan database has included Vietnamese Nôm readings for
164 basic CJK ideographs (from U+66F2 up to U+9C31, which is odd in itself), 122
CJK-A ideographs, and 4,230 CJK-B ideographs. The Vietnamese readings for the
CJK-A and CJK-B ideographs look like phonetic variations on the original Chinese
pronunciations of the ideographs (as would be expected), but none of the
Vietnamese readings for the 164 basic CJK ideographs bear any correspondence
with the Chinese pronunciations for the same ideographs.

I used the excellant Nôm Lookup Tool provided by the Nôm Foundation
(http://www.nomfoundation.org/nomdb/lookup.php) to check the Vietnamese readings
given in the Unihan database, and found that the Nôm readings for a random
sample of CJK-A and CJK-B ideographs exactly matched the readings given in the
Unihan database. On the other hand, none of the readings given by the Nôm Lookup
Tool for basic CJK ideographs (between U+66F2 and U+9C31) matched the readings
given in the Unihan database.

For example, the Unihan database has the following readings for these three
basic CJK ideographs :

U+66F2  kVietnamese giả 
U+66F4  kVietnamese xâu 
U+6771  kVietnamese hốc 

On the other hand the Nôm Lookup Tool gives the following readings for the same
ideographs :

U+66F2 = khúc 
U+66F4 = canh 
U+6771 = ðông 

And looking up the Unihan Vietnamese readings for these three ideographs with
the Nôm Lookup Tool gives the following results :
giả = U+4F3D or U+5047 or U+5056 or U+8005 or U+8D6D
xâu = U+507B or U+641C or U+22D1C or U+22E64 or U+26113
hốc = U+561D or U+21417

Can anyone tell me whether this discrepancy between the Unihan Vietnamese
readings and the readings given by the Nôm Lookup Tool is due to an error in the
Unihan database or due to my lack of understanding of Vietnamese ?

Andrew

Re: IE settings for surrogates support

2003-11-25 Thread Andrew C. West

On Mon, 24 Nov 2003 15:47:16 +0100 (CET), Philippe VERDY wrote:
> 
> > > [HKEY_CURRENT_USER\Software\Microsoft\Internet
> > > Explorer\International\Scripts\42]
> > > "IEFixedFontName"="Code2001"
> > > "IEPropFontName"="Code2001"
> 
> This setting is incorrect: the script IDs go between 3 and 40,

See
http://msdn.microsoft.com/library/default.asp?url=/library/en-us/intl/unicode_192r.asp

Andrew

Re: creating a test font w/ CJKV Extension B characters.

2003-11-24 Thread Andrew C. West

On Mon, 24 Nov 2003 10:12:52 +, [EMAIL PROTECTED] wrote:
> 
> Even with the registery changes that allow Uniscript to work with such 
> characters?

Oops, my mistake. I had forgotten that I had deliberately deleted the registry
settings that control how IE deals with surrogate pairs sometime ago in order to
prove a point (that IE won't display surrogate pairs without them ?). Anyway,
restore the registry to its original state and Frank's page displays OK without
any tweaking whatsoever - both NCR and GB18030 encoded CJK-B characters render
correctly with my preferred CJK-B font.

To install the registry keys necessary for IE to display surrogate pairs simply
copy the code below to a file named "something.reg" and double-click on it.
Replace "Code2001" with the name of your preferred Supra-BMP font if necessary.

Windows Registry Editor Version 5.00

[HKEY_LOCAL_MACHINE\SOFTWARE\Microsoft\Windows NT\CurrentVersion\LanguagePack]
"SURROGATE"=dword:0002

[HKEY_CURRENT_USER\Software\Microsoft\Internet Explorer\International\Scripts\42]
"IEFixedFontName"="Code2001"
"IEPropFontName"="Code2001"

Andrew

RE: creating a test font w/ CJKV Extension B characters.

2003-11-21 Thread Andrew C. West

On Fri, 21 Nov 2003 15:12:26 +0100, "Philippe Verdy" wrote:
> 
> Could an editor loading such incorrect but legacy GB-18030 file accept to
> load it and work with it using an internal-only UCS-4 mapping (or an
> extended UTF-8 mapping), to preserve those out of range sequences, as if
> they were mapped in a extra PUA range?
> 

An editor which stored data internally as extended UTF-32 or extended UTF-8
could easily preserve such invalid codepoints, but BabelPad stores data
internally as UTF-16 so it couldn't, and even if it could it wouldn't as its a
Unicode editor, and codepoints beyond U+10 are not Unicode (nor for that
matter are codepoints beyond  valid GB-18030 as far as I'm aware).
The first thing I'll do this evening is change BabelPad so that GB-18030
codepoints beyond  are converted to U+FFFD.

> Of course saving the file into a UTF encoding would be forbidden, but saving
> the internal UCS-4 file back to GB-18030 would preserve those out-of-range
> GB-18030 sequences, without making any other interpretation, and without
> changing them arbitrarily into the GB18030 equivalent of U+FFFD?
> 
> The editor could still use the Unicode rules for all valid GB18030
> sequences. And the invalid characters could be then represented for example
> with a colored/highlighted glyph such as . As both the input and
> output are not a Unicode scheme, I don't think this invalidates the Unicode
> conformance: the behavior would just be conforming to GB18030 or other
> legacy GB PUAs mappings.
> 

I'm pretty sure that there are no such legacy GB mapping, and I doubt that China
will ever want to map characters to extra-Unicode codepoints in GB-18030 ...
they seem far more interested in trying to force everyone else to accept their
unwanted characters in the BMP than putting them in some limbo beyond Plane 16.

Andrew

Re: creating a test font w/ CJKV Extension B characters.

2003-11-21 Thread Andrew C. West

On Thu, 20 Nov 2003 21:02:49 -0800, "Doug Ewell" wrote:
> 
> An invalid GB18030 sequence, like , or a valid but out-of-range
> sequence, like , should be treated just like an invalid or
> out-of-range UTF-8 sequence.  Issue an error message, format the hard
> disk, whatever; just don't try to treat it like a normal character.
> 

Hmm, surely  is a valid GB-18030 sequence = U+FA0C according to my
reckoning (although Word fails to correctly convert  when told to open a
file as GB-18030, it does save U+FA0C as  when told to save as GB-18030).

In BabelPad I convert any invalid GB-18030 characters to U+FFFD ("used to
replace an incoming character whose value is unknown or unrepresentable in
Unicode"), and notify the user that the file has been opened with errors, which
I think is a compliant and sensible implementation. (Unfortunately I've just
noticed that BabelPad has a slight bug with out of range GB-18030 values such as
 = U+11.)

Andrew

Re: creating a test font w/ CJKV Extension B characters.

2003-11-21 Thread Andrew C. West

On Thu, 20 Nov 2003 11:45:35 -0800, "Frank Yung-Fong Tang" wrote:
> 
> so.. in summary, how is your concusion about the quality of "GB18030" 
> support on IE6/Win2K ? If you run the same test on Mozilla / Netscape 
> 7.0, what is your conclusion about that quality of support?

For the benefit of those who seem willing to trash Frank's page without actually
having looked at it, it is indeed encoded as GB-18030 with the declaration . The SIP
characters are represented three ways on the page : in native GB-18030 encoding,
as hexadecimal NCR entities, and as gif images. It is therefore a fine test for
browser support of GB-18030.

As far as W2K/IE6 is concerned, if you have GB-18030 support installed (and it
is not installed by default) then it seems to open, display and save GB-18030
pages with no problem. The problem is that W2K/IE6 won't render supra-BMP
characters no matter what encoding you use unless they are represented as NCRs
(either as a single 32-bit value or as two 16-bit surrogates, in hex or decimal
format) and the encoding is set to "User Defined" (either in the encoding
declaration on the page or manually by the end-user).

Andrew

Re: creating a test font w/ CJKV Extension B characters.

2003-11-20 Thread Andrew C. West

On Thu, 20 Nov 2003 01:32:16 +, [EMAIL PROTECTED] wrote:
> 
> Frank Yung-Fong Tang wrote,
> > If you visit 
> > http://people.netscape.com/ftang/testscript/gb18030/gb18030.cgi?page=596 
> > and your machine have surrogate support install correctly and surrogate 
> > font install correctly then you should see surrogate characters show up 
> > match the gif. 
> 
> It isn't working, but I have surrogate support and a font correctly
> installed.
> 

Using W2K and IE6, if you have a CJK-B font configured for "User Defined"
scripts under the "Options : Fonts" settings, and manually select the encoding
for the page as "User Defined", then the second CJK-B character in each box
(just above the gif image) displays just fine.

The top character in each box appears to be encoded as GB-18030 (e.g. GB-18030
0x95328236 = U+2), and the second character is encoded as hex NCR values
(e.g. 𠀀 for U+2).

If GB-18030 is selected as the encoding for the page (as explicitly given in the
file), then IE won't display the CJK-B characters correctly (even if you
configure a CJK-B font as your default font for displaying Chinese), but you can
copy and paste them to a Unicode editor, where both the GB-18030 and NCR encoded
forms of CJK-B characters will display correctly with an appropriate CJK-B font.

If User Defined is selected as the encoding for the page (either manually or by
changing the meta tag in the file to charset="x-user-defined"), then the
GB-18030 encoded characters turn to gunk, but the NCR representations are
displayed using whatever font you have configured for user defined scripts, and
if that is a CJK-B font then hey presto !

Andrew

Re: Ewellic

2003-11-12 Thread Andrew C. West

On Wed, 12 Nov 2003 08:10:49 -0800, "Doug Ewell" wrote:
> 
> I don't think the "secrecy" criterion is sufficient to qualify a writing
> system as a cipher (whether it is necessary is another question).  Nüshü
> (sp?) was developed primarily for secrecy, if I'm not mistaken, and I
> doubt anyone would regard it as a cipher.

On the contrary, if Nüshu glyph A has the same pronunciation and semantics as
Han ideograph B, and the Nüshu script is just a set of glyphs that have a
one-to-one mapping to standard Han ideographs, then surely Nüshu really is no
more than a cipher ?

Andrew

Re: Handy table of combining character classes

2003-11-10 Thread Andrew C. West

On Fri, 7 Nov 2003 14:57:51 -0500, "John Cowan" wrote:
> 
> Here's a little table of the combining classes, showing the value, the
> number of characters in the class, and a handy name (typically the one
> used in the Unicode Standard, or a CODE POINT NAME if there is only one;
> sometimes of my own invention).
> 
> Class   Count   Name
> = =   
> 0   589 Class Zero

589 ? Aren't all characters that are not 1-240 Combining Class 0 (i.e. Spacing,
split, enclosing, reordrant, and Tibetan subjoined) ? 235,617 (including 2,048
surrogate code points) by my reckoning.

Andrew

Re: elided base character or obliterated character (was: Hebrew composition model, with cantillation marks)

2003-11-07 Thread Andrew C. West

On Thu, 6 Nov 2003 12:51:53 -0500, John Cowan wrote:
> 
> IIRC we talked about this a year or so ago, and kicked around the idea that
> the Chinese square could be treated as a glyph variant of U+3013 GETA MARK,
> which looks quite different but symbolizes the same thing.

I suspect that few Chinese would be happy to see a well-known, easily-recognised
and frequently-used symbol relegated to a glyph variant of a Japanese symbol
that is unknown amd unrecognised in China. There would be puzzled faces if the
geta mark appeared within Chinese text if the "wrong" font was selected. And
given that most CJK fonts aim to cover both Chinese and Japanese characters, how
would the square missing ideograph glyph and the Japanese geta mark be
differentiated ? By means of variant selectors ? If you were going to use
variant selectors to differentiate the two glyphs (and neither glyph is a
variant of the other for that matter), then you might as well encode it
seperately, and be done with it !

The CJK Symbols and Punctuation block is largely Japanocentric, and I do not
think that it would hurt to add a few Chinese-specific symbols and marks - after
all if there's room in Unicode for wheelchairs, hot beverages, umbrellas with
raindrops, hot springs, etc. etc., you would think that room could be made for
the Chinese missing ideograph symbol which is used with such great frequency in
modern reprints of old texts. Probably worthwhile making a proposal and letting
UTC/WG2 decide.

Andrew

Re: [hebrew] Re: Hebrew composition model, with cantillation marks

2003-11-06 Thread Andrew C. West

On Thu, 6 Nov 2003 08:30:24 -0800, "Doug Ewell" wrote:
> 
> I can't help thinking that other specialized lists, such as those
> for bidi and CJK, were created to resolve this exact type of problem.

CJK list ? Now if only there was a list of Unicode lists ...

Re: elided base character or obliterated character (was: Hebrew composition model, with cantillation marks)

2003-11-06 Thread Andrew C. West

On Wed, 5 Nov 2003 12:24:00 +0100, "Philippe Verdy" wrote:
> 
> The obliterated character needed for paleolitic studies, or to encode any
> texts in which the character is not recognizable already exists: isn't it
> the REPLACEMENT CHARACTER?
> 

The problem of how to represent missing/obliterated characters in Unicode when
transcribing manuscript/printed texts and inscriptions, etc. has always
perplexed me.

U+FFFD [Replacement Character] is "used to replace an incoming character whose
value is unknown or unrepresentable in Unicode", and is definitely not the
correct character to use to represent a missing or obliterated character in a
non-electronic source text.

For Chinese the standard glyph for a missing/obliterated/unclear ideograph is a
full-width hollow square (i.e. the same size as a CJK ideograph). This glyph is
very common in modern printed Chinese texts, from scholarly editions of ancient
texts unearthed from 2,000 year old tombs to popular typeset reprints of 19th
century novels. Several examples of the usage of this glyph in modern printed
texts from the PRC can be found at
http://uk.geocities.com/babelstone1357/CJK/missing.html

The problem is how to represent this glyph in electronic texts. Browsing the
internet there seem to be two, both unsatisfactory, ways of representing this
"missing ideograph" glyph :

1. Using U+25A1   [WHITE SQUARE] (although any of the other white square
graphic symbols encoded in Unicode, such as U+25A2, U+25FB or U+25FD, could also
be used I suppose). The problems with this character are :
a) it has the wrong character properties for use within running CJK text.
b) with CJK fonts such as SimSun U+25A1 is rendered the same height and width as
a CJK ideograph, but with non-Chinese fonts such as Arial Unicode MS U+25A1 may
be rendered much smaller than a CJK ideograph, which looks totally wrong.

2. Using U+56D7  [a CJK ideograph, rarely used other than as a radical =
U+2F1E], which has the right character properties, and renders at the correct
size; but the glyph shape may not be completely square depending upon the font
style, and basically it is just the wrong character for the job.

It would be extremely useful to have a dedicated Unicode character for "missing
CJK ideograph" with the right character properties, and I have considered making
a proposal for such a character, but have hesitated as if there really is such a
great need for it (and I personally have web pages which transcribe texts with
missing/obliterated ideographs where such a character is desperately needed)
then why does it not already exist in Unicode or pre-existing Chinese encoding
standards ?

Andrew

Re: FW: Web Form: Other Question, Problem, or Feedback

2003-10-24 Thread Andrew C. West

On Fri, 24 Oct 2003 01:58:03 -0700 (PDT), "Andrew C. West" wrote:
> 
> Try BabelPad at uk.geocities.com/BabelStone1357/Software/BabelPad.html 
> 
> Select the text, and click on "Convert : NCR to Unicode" from the menu.
> Or simply check the "Convert NCRs" checkbox on the file open dialog when you
> open the file, and the codes will be converted to Unicode automatically. You
can
> than save the file as UTF-8 or UTF-16.
> 

Oh, and I forgot to mention that BabelPad can save line breaks as CR/LF, CR, LF,
LS [U+2028] or PS [U+2029] as per user choice (internally BabelPad does not use
line break characters, so \r and/or \n do not come into the equation).

Andrew

Re: [tibex] Re: TIBETAN DIGIT HALF ZERO

2003-10-24 Thread Andrew C. West

On Thu, 23 Oct 2003 13:05:05 -0700, Peter Lofting wrote:
> 
> The representation of slashed digits are problematic for two reasons.
> 
> (1) The notation is that a slash indicates half of the value. This is 
> different to the "less a half" interpretation Andrew describes, which 
> would only be true for the digit "1".
> 

I was pretty sure that the slash *normally* indicated "half less than",
corresponding to the way in which fractions are represented in written Tibetan
(phyed + integer = half less than integer). Certainly, the only printed example
of a half digit that I know of (the 1933 7 1/2 skar Tibetan postage stamp) is
written as slashed eight (i.e. phyed brgyad = half less than eight).

If the slash denoted "half of the value" then slashed even numbers would be
integers (e.g. slashed 8 = 4), and what would the point of that be ?

The problem is what slashed zero means. Following the "half less than" rule
UnicodeData.txt assigns it a value of "-1/2", but a single negative half number
seems very improbable to me (it would not be much use without negative half
digits through to -9 1/2 ... as well as negative integers, and even then who's
using negative half digits ??). On the other hand the value of 9 1/2 (or 19/2)
is needed to complete the series of half digits between zero and ten given by
slashed 1 through slashed 9 (1/2 through 17/2). It therefore seems logical to me
that slashed zero in fact represents "ten less a half". 

As Peter says, it would be really useful if the people responsible for the
encoding of these half digits could (even at this late stage) provide some
concrete examples of their usage, and clear indication of their numeric value.

Andrew

Re: FW: Web Form: Other Question, Problem, or Feedback

2003-10-24 Thread Andrew C. West

> > i'm looking for a tool or a tutorial to convert japanese 
> > signs in numeric unicode signs (e.g. 留). Can you help me?
> > 

Try BabelPad at uk.geocities.com/BabelStone1357/Software/BabelPad.html 

Select the text, and click on "Convert : NCR to Unicode" from the menu.
Or simply check the "Convert NCRs" checkbox on the file open dialog when you
open the file, and the codes will be converted to Unicode automatically. You can
than save the file as UTF-8 or UTF-16.

Andrew

Re: [OT] Meaning of U+24560?

2003-10-13 Thread Andrew C. West

On Sun, 12 Oct 2003, Patrick Andries wrote:
> 
> Would anyone know what U+24650 means? 

Probably only Yang Xiong really knows what U+24560 means in the context of
Tetragram #9, but unfortunately he's been dead for a couple of thousand years.

Rather than bravely attempt to translate the Chinese tetragram names yourself,
which you could not do properly without having studied the original text and
commentaries in detail over many years, it would probably be safer to seek out a
good French translation of the Tai Xuan Jing, and simply use the names given
there. (Off hand I can't recommend a French translation, but I feel sure that it
must have been translated by at least one of the great 19th or early 20th
century French Sinologists). Using names which are already familiar to French
readers - even if they are not necessarily considered "correct" now - is
probably better than mistranslating the original Chinese names or producing
stilted translations of Nylan's names for the tetagrams.

Andrew

Re: kMandarin and kCantonese in Unihan

2003-10-07 Thread Andrew C. West

On Tue, 7 Oct 2003 21:42:09 +0800, Anthony Fok wrote:
> 
> What is a good place for discussions on these issues?   And which
> personnel and which sources are involved with esp. the CJK-Ext-A
> kCantonese data?  It would be nice to talk with the original people to
> find out how these errors crept in, e.g. errors of the original source? 
> Systematic errors due to mistakes in conversion from e.g. Jyutping to
> Yale?  Inappropriate use of "Fanqie"?  Other human errors?  etc. so
> that we can find a good ways to correct these mistakes.

The latest draft version of the Unihan database (Unihan-4.0.1d1.txt) is
currently subject to public review (see
http://www.unicode.org/versions/beta.html).

This forum is a suitable place for discussing the Unihan database, but in order
to ensure that your errata are taken note of you should report them using the
Unicode reporting form (http://www.unicode.org/unicode/reporting.html) by
October 27.

The failings of the Unihan database have been the subject of much discussion in
the past, especially the kMandarin field which got rather mangled in Unicode
3.1. Happily the 4.0.1d1 version of Unihan fixes most of the kMandarin problems,
although the quality of many of the provided Mandarin readings still leaves much
to be desired. (The Mandarin readings really need to be completely overhauled,
based on a single authoritative source such as _Hanyu Da Zidian_ ... but that's
just my personal opinion).

> Furthermore, is there something like CVS web or changelogs to see the
> history of modifications of Unihan?  (when, by whom, and why, from what
> source, etc.)  What other fixes have been done to Unihan.txt since
> 19 June 2003?

There is no public CVS repository, but the various incarnations of the Unihan
database may be downloaded from the "Official Unicode Online Data" site at
http://www.unicode.org/Public

I suppose there won't be another release of Unihan until after the public review
period ends at the end of this month.

Andrew

Re: Chinese "departing" tone marks

2003-09-30 Thread Andrew C. West

On Thu, 19 Jun 2003 17:38:06 -0500, [EMAIL PROTECTED] wrote: 

> That sounds, then, like these are *not* two of the left-stemmed tone 
> letters (mirrors of 02E5..02E9) that I'm going to be including in a 
> proposal for additional modifier characters for tone. 

Peter,

I notice from document N2626 posted on the WG2 website today
(http://std.dkuug.dk/JTC1/SC2/WG2/docs/n2626.pdf) that China is proposing a
whole host of "IPA Extensions and Combining Diacritical Marks", including 225
"five-degree" tone values, which cover both left-stemmed and right-stemmed tone
marks, as well as most of the possible combinations of complex stemmed tone
marks that you have suggested could be represented by means of ligatures (and
indeed which are so represented in your excellant DoulosSIL font).

Andrew

Re: TAI NÜA , TAI LE

2003-09-15 Thread Andrew C. West

Not knowing very much about the Tai script I consulted some Chinese reference
books to see how the Chinese designate the Tai script.

The "Languages and Scripts" volume of the _Ci Hai_ encyclopaedia (Shanghai,
1978) states that there are four main script traditions used for writing the Dai
(= Tai) language :

1. "Dehong Dai", aka "Dai Na" [U+50A3, U+54EA] = TAI NÜA ?
2. "Xishuangbanna Dai", aka "Dai Le" [U+50A3, U+4EC2] = TAI LE ?
3. "Dai Peng" [U+50A3, U+7EF7] or "Dai De" [U+50A3, U+5FB7]
4. "Jinping Dai", aka "Dai Duan" [U+50A3, U+7AEF]

A google for "Dai Le" (in Chinese) gets over a hundred Chinese sites, all of
which seem to give "Dai Le" as an alternative name for the Xishuangbanna Dai
(people and/or script), and "Dai Na" as an alternative name for Dehong Dai
(people and/or script).

The fact that Chinese sources are unanimous in using "Dai Le" to refer to the
rounded form of the script used in the Xishuangbanna region, and "Dai Na" for
the square form of the script used in the Dehong region is rather odd, given
that it is the square Dehong form of the script that is encoded under the block
name of TAI LE.

Andrew

Re: Damn'd fools

2003-07-26 Thread Andrew C. West

On Fri, 25 Jul 2003 21:28:30 +0100, Michael Everson wrote:

> >Presumably the name of the U.K. would change, however.
> 
> Why? It would be the United Kingdom of Great Britain, which comprises 
> England, Scotland, Wales, and the Duchy of Cornwall.

"United Kingdom of Great Britain" as opposed to the present "United Kingdom of
Great Britain and Northern Ireland". The whishful misnomer "United Kingdom" of
course refers to the union of the erstwhile independant kingdoms of England
(including the principality of Wales and various other territories) and
Scotland. "Ireland" and later the reduced "Northern Ireland" was tagged onto the
official name of the country as somewhat of an afterthought (no doubt reflecting
its status within the political union).

BTW the Duchy of Cornwall is part of England, even if the Cornish are not
necessarily English.

Andrew

Re: Nu Shu script

2003-07-15 Thread Andrew C. West

On Mon, 14 Jul 2003 15:15:44 -0700 (PDT), Kenneth Whistler wrote:

> NuShu (or Nüshu) is periodically "discovered" and raised for
> discussion on this list.

There has been considerable interest in Nü Shu (literally "women's writing") in
recent years, especially amongst feminist academics in Japan and the States. If
anyone is interested in finding out more, the best web site on the subject is
that of Professor ENDO Orie (http://www2.ttcn.ne.jp/~orie/home.htm).

As with other informal, non-standardised scripts, such as traditional Yi
characters and Zhuang ideographs, I suspect that the Nü Shu script repetoire is
going to vary considerably both geographically and chronologically, which will
cause a big headache for anyone trying to formulate a proposal.

Personally I think that Nü Shu is a borderline case for encoding in Unicode.
Certainly there are a number of mainstream East Asian scripts that are more in
need of encoding. At any rate, I don't think the Nü Shu script will be ripe for
encoding until comprehensive dictionaries of the script have been published (I
know that there are scholars in China who are compiling lists of Nü Shu
characters, but I don't think any dictionaries have been published yet).

Andrew

RE: Combining diacriticals and Cyrillic

2003-07-12 Thread Andrew C. West

On Fri, 11 Jul 2003 09:09:08 -0700, Rick Cameron wrote:

> Ah, but what you don't realise [and it's not surprising, because MSDN
> doesn't make it clear] is that when ScriptTextOut calls ExtTextOut, it
> passes glyph indices, and uses the ETO_GLYPH_INDEX option. 
> 
> Thus, the two statements are perfectly consistent.  For once, Philippe's
> bold statement of fact is right. ;^)
> 
> (BTW, the authority for my bold statement of fact above is a conversation
> with David Brown, the architect of Uniscribe)

Well, I had a sneaking suspicion that someone would prove me wrong on this one.
But having said that, I'm not entirely convinced. You seem to be saying that
ExtTextOut facilitates ScriptTextOut's use of it under Windows 2K and XP by
means of the ETO_GLYPH_INDEX option (surely the same would apply under NT4, 9X
and ME ?), but this is not the same as Philippe's assertion that under XP you
can call simply TextOut with a Unicode string and TextOut will utilise the
appropriate Uniscribe functions to render the text the same as if you had used
the Uniscribe API directly.

Andrew

Re: Combining diacriticals and Cyrillic

2003-07-11 Thread Andrew C. West

On Fri, 11 Jul 2003 13:15:14 +0200, "Philippe Verdy" wrote:

> The Win32 Text APIs (such as TextOut) actually DO support UniScribe
> transparently on Windows XP... In most applications, this means that the
> UniScribe support works without requiring explicit calls to the Uniscribe API.

Surely some mistake here.

Starting with Microsoft Windows 2000, these functions [TextOut, ExtTextOut,
TabbedTextOut, DrawText, and GetTextExtentExPoint] have been extended to support
complex scripts. In general, this support is transparent to the application.

The [Uniscribe] ScriptTextOut function takes the output of both ScriptShape and
ScriptPlace calls and calls the operating system ExtTextOut function
appropriately.

Now if Uniscribe's ScriptTextOut function calls ExtTextOut, and according to
Philippe ExtTextOut utilises Uniscribe to output text ...

No, I don't think so. There is a big difference between "support complex
scripts" (MSDN) and "support UniScribe" (Philippe). I don't know what the exact
implementation of complex script support is for ExtTextOut etc., but I'm pretty
sure that it is independant of Uniscribe. Maybe I'm wrong, but at least I'm not
going to dress up a wild guess as a statement of certain fact as Philippe so
likes to do (and it is disingenuous of him to pretend that we are all picking on
him because his English is not good enough - there's nothing ambiguous about his
misleading statements, and if he wants to repeat them in French they'll still be
misleading or just plain wrong).

Andrew

Re: Documents needed for proposal

2003-07-04 Thread Andrew C. West

On Thu, 3 Jul 2003 08:59:19 -0700, Rick McGowan wrote:

> That is section 2.2 of the WG2 Principles and Procedures document. It is  
> available on-line. Go here:
> 
> http://std.dkuug.dk/JTC1/SC2/WG2/docs/principles.html
> 

I'm familiar with this document (n2352r), and it does indeed list the Character
Categories. But unless it's very well hidden, it does not provide (or even
summarize) the Character Naming Guidelines, which a proposer must be familiar
with in order to make an honest proposal.

Andrew

Re: Documents needed for proposal

2003-07-03 Thread Andrew C. West

On Thu, 3 Jul 2003 12:19:55 +0100, "Anto'nio Martins-Tuva'lkin" wrote:

> Where can the average proposal author browse "section II, Character
> Categories" (needed for item B.3), "clause 14, ISO/IEC 10646-1: 2000" 
> (needed for B.4) and "Annex L of ISO/IEC 10646-1: 2000" (needed for 
> B.5a), short of buying the standard? 

This is a perennial question. As I have stated before on this list, it would
indeed be very useful if the relevant parts of the document could be made freely
available.

The best that I've been able to get hold of over the internet is a Working Draft
of ISO/IEC 10646:2003 that is freely downloadable via the ISO/IEC JTC 1/SC 2
Document Register
(http://lucia.itscj.ipsj.or.jp/itscj/servlets/ScmDoc10?Com_Id=02) :

http://std.dkuug.dk/jtc1/sc2/open/02n3639.pdf

It should be sufficient for working out how to fill in the Proposal form.

Andrew

hPhags-pa Proposal

2003-06-30 Thread Andrew C. West

I have made available for consultation a draft proposal for the encoding of the
'Phags-pa or hPhags-pa script (mainly used for writing Chinese and Mongolian
during the 13th and 14th centuries) at :

uk.geocities.com/BabelStone1357/hPhags-pa/N2352.html

A set of additional pages relating to the 'Phags-pa script is also available at :

uk.geocities.com/BabelStone1357/hPhags-pa/index.html

Comments and criticism would be welcome. I am especially interested in hearing
from any of the Tibetan experts on this list who may be familiar with recent
Tibetan usage of the 'Phags-pa script.

Andrew

(My apologies for the large size of the proposal. If anyone has difficulty
accessing it, a zip of all the files is available on request.)

Re: Mongolian Rant (was: Biblical Hebrew... was: Tibetan... was: ...)

2003-06-29 Thread Andrew C. West

Ken,

Thank you for your kind response to my latest rant. It has gone a long way to
reassuring me that my concerns over MFVSs and the already defined Mongolian
standardized variants are unfounded.

On Fri, 27 Jun 2003 17:25:50 -0700 (PDT), Kenneth Whistler wrote:

> The Mongolian variants are normatively defined in tables in both 10646 and in
> the Unicode Standard, but there is normative, and then there is immutable.
> 
> Fortunately, the definition of standardized variation sequences is not
> entangled in any of the stability guarantees for normalization. As far
> as I know, neither WG2 (in its guiding Principles and Procedures document)
> nor the Unicode Consortium, at:
> 
> http://www.unicode.org/standard/stability_policy.html
> 
> has made any public guarantee about the immutability of these tables.
> Of course, there is the general assumption that things that are
> standardized stay stable, but we are, as Andrew implies, still in
> early days in terms of understanding all the implications for
> Mongolian implementations. So enough with the "come hell or high
> water" rhetoric in this case.

Sorry, I got carried away (as usual).

> If it turns out that the actual implementations need to define the variants
> differently than in the current tables for the MFVS standardized variation
> sequences, then we just need to define the appropriate technical
> corrigenda for the standards, and then get on with our lives. In this
> case, if the only implementations need a revision, then the standard
> should reflect the consensus practice, rather than vice versa. And in *this*
> case, there are no stability guarantees (that I know of) standing in
> the way of fixing the problem.

That is distinctly not the impression I got from Asmus, but no doubt I got hold
of the wrong end of the stick.

> The only barrier is one of getting a sufficiently clear statement of the
> implementation requirements and a sufficiently detailed specification of
> the results in front of the committees to enable them to close the loop
> on this.

Hopefully the good members of this list who are working on Mongolian shaping
behaviour will be able to do this.

> BTW, for what it's worth, I tried arguing a different interpretation of
> how the MFVS characters had to interact with Mongolian glyph selection
> than the simple x + MFVS --> fixed glyph interpretation that ended up
> in the tables, but I personally had neither the intimate knowledge
> of the Mongolian writing system nor the bandwidth to really make the
> case. That was a fight for another day.

Indeed, it has been discovered that there is not a simple one-to-one
correspondence between MFVSs and variants glyphs, but that in most cases variant
glyphs will be selected contextually without recourse to any control codes, and
that MFVSs will normally only be needed to override the default shaping
behaviour - and also that what variant glyph a particular MFVS selects may vary
depending upon context.

> Don't deprecate the MFVSs. Why? Instead, create a technical corrigendum that
> fixes the table(s) and updates the definition of their semantics
> until it matches consensus implementation in actual practice.
> Note that you are a leg up on this already because the MFVS characters
> are not submerged in the collection of generic variation selectors.
> They were defined for Mongolian for a reason, and the fact that
> they are *Mongolian* variation selectors ought to give us some
> room for making them function as intended for Mongolian.

Agreed.

> Nobody that I know of is deliberately *trying* to maintain mistakes
> or unimplementable features in the standard out of some perverse
> pleasure in frustrating people who are faithfully trying to implement
> Unicode. 

Glad to hear to hear that.

Regards,

Andrew

Re: Biblical Hebrew (Was: Major Defect in Combining Classes of Tibetan Vowels)

2003-06-27 Thread Andrew C. West

On Fri, 27 Jun 2003 04:22:30 -0500, [EMAIL PROTECTED] wrote:

> I just have a hard time believing that 50 years from now our grandchildren 
> won't look back, "What were they thinking? So it took them a couple of 
> years to figure out canonical ordering and normalization; why on earth 
> didn't they work that out first before setting things in stone, rather 
> than saddling us with this hodgepodge of ad hoc workarounds? How short 
> sighted." As Rick said, I know this will get shot down; don't bother 
> telling me so.

I have to agree 100% with Peter on this. The potential fiasco with regards to
Mongolian Free Variation Selectors is another area where our grandchildren are
going to be weeping with despair if we are not careful. The standardized
variants for Mongolian were set in stone by Unicode based on an unfortunate but
understandable misunderstanding of the infamous TR170, and now that it is
apparent from Chinese and Mongolian sources that Unicode had got hold of
completely the wrong end of the stick (the defined standardized variants are
actually intended for use in isolation only, and the same MFVS that selects one
variant form in isolation may be used to select a completely different variant
within running text ... which of course it can't according to the Standardized
Variants document), instead of just wiping the slate clean and redefining a new
and consistant set of standardized variants that correspond to actual usage
within China and Mongolia, Unicode is determined to preserve the original
erroneous standardised variants come hell or high water - even though no-one has
ever seriously used them yet (well, the Chinese and Mongolians will go ahead and
do it their way whatever Unicode decides).

And before Peter suggests it, I have already suggested elsewhere that if Unicode
can't fix past errors, the only course might be for Unicode to deprecate the
MFVSs, and start again from scratch - didn't go down too well!

Andrew

Re: Question about Unicode Ranges in TrueType fonts

2003-06-26 Thread Andrew C. West

On Thu, 26 Jun 2003 14:26:13 +0200, "Philippe Verdy" wrote:

> Isn't there a work-around with the following function (quote from Microsoft
> MSDN):
> (with the caveat that you first need to allocate and fill a Unicode string for
> the
> codepoints you want to test, and this can be lengthy if one wants to retreive
the
> full list of supported codepoints).
> However, this is still the best function to use to know if a string can
> effectively
> be rendered before drawing it...
> 
> _*GetGlyphIndices*_
> 

GetGlyphIndices() or Uniscribe's ScriptGetCMap() would be OK for checking
coverage for small Unicode blocks such as Gothic (27 codepoints) or even
Mathematical Alphanumeric Symbols (992 codepoints), but I suspect your
application would freeze if you tried to use it to work out exact codepoint
coverage of CJK-B (42,711 codepoints) and PUA-A and PUA-B (65,534 codepoints
each).

Andrew

Re: Question about Unicode Ranges in TrueType fonts

2003-06-26 Thread Andrew C. West

On Wed, 25 Jun 2003 21:58:28 -0700, "Elisha Berns" wrote:

> Some weeks back there were a number of postings about software for
> viewing Unicode Ranges in TrueType fonts and I had a few questions about
> that. Most viewers listed seemed to only check the Unicode Range bits of
> the fonts which can be misleading in certain cases.

For W2K and XP only, Microsoft provides an API for determining exactly which
Unicode codepoints a font covers.

GetFontUnicodeRanges() in the Platform SDK fills a GLYPHSET structure with
Unicode coverage information for the currently selected font in a given device
context.

The GLYPHSET structure has these members :

cGlyphsSupported - Total number of Unicode code points supported in the font
cRanges - Total number of Unicode ranges in ranges
ranges - Array of Unicode ranges that are supported in the font

Note that "cRanges" is not the number of Unicode blocks supported, and "ranges"
is not an array of Unicode blocks. Rather "ranges" is an array of WCRANGE
structures that specify contiguous clumps of Unicode codepoints, and "cRanges"
is the number of contiguous clumps of Unicode codepoints. The WCRANGE structure
has the following members :

wcLow - Low Unicode code point in the range of supported Unicode code points
cGlyphs - Number of supported Unicode code points in this range

By looping through the "ranges" array it is possible to determine exactly which
characters in which Unicode blocks a given font covers (as long as your sofware
has an array of Unicode blocks and their codepoint ranges).

Note that unlike the Unicode Subfield Bitfield (USB) that is part of the
FONTSIGNATURE structure that is filled by GetTextCharsetInfo() etc. [available
to W9X and NT as well as 2K/XP), which is limited to a particular version of
Unicode (3.0 ?) and returns supersets of Unicode blocks, the GLYPHSET structure
is version-independant. As long as your software has an up-to-date list of the
Unicode blocks and their constituent codepoints for the latest version of
Unicode, you will always be able to get up to date information about Unicode
coverage of a font.

This is the method used in my BabelMap utility, and you will note that it is
therefore able to not only list what Unicode 4.0 blocks are covered by a
particular font, but also give the exact number of codepoints that are covered
in that block. If you want to determine language coverage for a particular font,
then all you need to do is define a minimum set of codepoints that must be
covered for a particular block or set of blocks to be considered as supporting
that language. (Just the little matter of deciding what the minimum set of
codepoints would be for every language that is supported by Unicode ...)

Now the caveat. The USB sets a Surrogates bit to indicate that the font contains
at least one codepoint beyond the Basic Multilingual Plane (BMP). Unfortunately
the "ranges" array of the GLYPHSET structure only lists contiguous clumps of
Unicode codepoints within the BMP (wcLow is a 16 bit value), and does not list
surrogate coverage. Therefore you cannot determine supra-BMP codepoint coverage
from the GLYPHSET structure. If anyone does know an easy way to do this under
Windows, please let me know.

Regards,

Andrew

Re: Major Defect in Combining Classes of Tibetan Vowels

2003-06-25 Thread Andrew C. West

On Wed, 25 Jun 2003 13:41:27 -0700 (PDT), Kenneth Whistler wrote:

> 
> Peter asked:
> 
> > How can things that are visually indistinguishable be lexically different? 
> 
> chat (en)
> chat (fr)

And if Unicode reordered vowels in front of consonants, then we wouldn't be able
to distinguish :

chat (en)
chat (fr)
acht (de)

Andrew

Re: Major Defect in Combining Classes of Tibetan Vowels

2003-06-25 Thread Andrew C. West

On Wed, 25 Jun 2003 19:47:26 +0400, "Valeriy E. Ushakov" wrote:

> And given that the two look identical in writing in the first palce,
> this lexical difference had a chance to originate exactly *where*?
> You are putting the cart before the horse.

Well, unless the text has been scanned with OCR, a human user will have to enter
Tibetan text manually, and if the user encounters a base consonant with two
different vowel signs joined to it, they will have to make a choice as to which
order the vowel signs are entered.

For example, if the word "bcuig" (with the letter CA carrying both a shabkyu [u]
and gigu [i] sign) is encountered in a text that is being transcribed into
electronic form, and the user recognises it from its context as a contraction
for "bcu gcig" (eleven), then it would be natural to enter " b-c-u-i-g" <0F56,
0F45, 0F74, 0F72, 0F42>. On the other hand, if a syllable (tsheg bar) comprising
the base consant GA with a shabkyu [u] sign below and a gigu [i] sign above is
encountered (this is a plausible but hypothetical contraction), and the user
recognises this from its context as a contraction for the word "gi gu" (the name
for the I vowel sign), then it would be natural to enter "g-i-u" <0F42, 0F72,
0F74>, even though when writing it by hand the shabkyu would be written before
the gigu (calligraphic order does not necessarily equate to logical order). In
the one case a base consonant plus shabkyu and gigu is entered as <0FXX, 0F74,
0F72>, in the other case as <0FXX, 0F72, 0F74>.

Unfortunately it is precisely at this point that my argument starts to crumble,
and I am forced to throw in the towel, and admit defeat.

The key question is, if <0F56, 0F45, 0F74, 0F72, 0F42> (bcuig) gets normalised
to <0F56, 0F45, 0F72, 0F74, 0F42> (bciug), then so what ? Well, so nothing,
unless <0F56, 0F45, 0F74, 0F72, 0F42> (bcuig) is a shared contraction for two
different words, and the order of the U and I distinguishes what the contraction
is. As Tibetan shorthand abbreviations are an informal, non-standardised method
of abbreviating words, it is hypothetically possible that two different scribes
could come up with the same contracted form for two differently spelled words,
but I very much doubt that this would ever happen in reality. If I do find such
a case, I will certainly let this list know, but in the meanwhile I agree that
perhaps it would be more productive to return to Chris's original question,
rather than travel too far down this detour, scenic though it is.

Regards,

Andrew

Re: Major Defect in Combining Classes of Tibetan Vowels

2003-06-25 Thread Andrew C. West

On Wed, 25 Jun 2003 15:05:26 +0400, "Valeriy E. Ushakov" wrote:

> Err, as in this particular case one vowel sign is above and the other
> one is below the stack - i.e. they don't interact spatially - you
> cannot really distinguish them. ;)

I know that the vowel signs do not interact with each other typographically, but
what's that got to do with anything ? I'm talking about the logical ordering of
the Unicode codepoints used to encode some Tibetan text, not the physical
appearance of the glyphs that are used to render that sequence of codepoints.

What I'm suggesting is that although "cui" <0F45, 0F74, 0F72> and "ciu" <0F45,
0F72, 0F74> should be rendered identically, the logical ordering of the
codepoints representing the vowels may represent lexical differences that would
be lost during the process of normalisation.

Andrew

1 2 >

1 - 100 of 181 matches

Mail list logo