Looking for information on the UnicodeData file

2003-03-05 Thread Pim Blokland




Iapologize if this question has been asked 
before, but I'm relatively new at this.
My question is: where can I find formal definitions 
of the terms used in the Character Name field of the UnicodeData.txt file? Most 
specifically, precise explanations of designations like "turned", "inverse", 
"inverted", "reversed", "rotated" etc. Also the difference between "digraph" and 
"ligature", etc.
Although I've searched the FAQ files and the rest 
of the unicode.org site, I haven't been able to find this info as yet. This site 
is huge! So can anyone provide me with an URL? 
Thanks.

Pim Blokland



Re: Caron / Hacek?

2003-03-05 Thread Pim Blokland
John Hudson wrote:

 In the Slovak orthography, the lowercase d, l and t are normally written
 with the 'apostrophe' form of the accent.

Then why does UnicodeData break them down as (e.g.) 0064 030C rather than
0064 0315?

Pim Blokland





Re: The display of *kholam* on PCs

2003-03-05 Thread Dean Snyder
Chris Jacobs wrote at 12:54 AM on Wednesday, March 5, 2003:

But why do you call the kholam a high left dot?

As far as I know it can appear high left or middle, to indicate that is
should be pronounced after the consonant, or right, to pronounce it before.
So the meaning of a shin with two dots above it is ambiguous,

In classical Hebrew KHOLEM always represents a trailing vowel, i.e. it is
always pronounced after the consonant over which it is written. [In fact
I can't think of ANY vowel sign in classical Hebrew which represents a
pronunciation that precedes the consonant to which it is associated,
ignoring, for obvious reasons, written/read (kethib/qere) orthographies,
where the vowels indicate what is to be read in spite of the consonants
that are written.] And so the graphemic sequence SHIN KHOLEM is never
ambiguous in classical Hebrew. (I don't know about modern Israeli Hebrew.)

About the only unusual orthographic phenomenon I can think of related
to KHOLEM is that when it occurs after SIN it shares the same dot with SIN.


Respectfully,

Dean A. Snyder
Scholarly Technology Specialist
Center For Scholarly Resources, Sheridan Libraries
Garrett Room, MSE Library, 3400 N. Charles St.
The Johns Hopkins University
Baltimore, Maryland, USA 21218

office: 410 516-6850 mobile: 410 245-7168 fax: 410-516-6229
Manager, Digital Hammurabi Project: www.jhu.edu/digitalhammurabi
Manager, Initiative for Cuneiform Encoding: www.jhu.edu/ice





Re: Khmer encoding model (had no subject)

2003-03-05 Thread Mijan
Quoting Marco Cimarosti [EMAIL PROTECTED]:

 Mijan wrote:
  [...]
   3. There are no other cases of a Vowel+Virama combination in the
   Unicode encoding model.
   
   Yes, there are. Khmer.
  
  I do not understand Khmer but I see that it does not use the 
  same 'encoding model'. Please look, you will see that you
  were wrong to use Khmer as an example.
 
 What do you mean by not using the same encoding model?
 
 There are actually three Indic scripts that have been encoded with a
 different model: Tibetan (subscript letters are encoded separately, rather
 than as combinations of virama + consonant), and Thai/Lao (reordrant vowel
 marks are encoded in visual order, rather than in phonetic order).
 
 But, AFAIK, this is not the case of Unicode Khmer, which is encoded in the
 same way as the scripts of India.

Thank you for the correction. I said I do not understand Khmer. I was 
understanding that scripts not based on ISCII were using different encoding 
model

Mijan




-
This mail sent through http://www.bangladesh.net 



RE: Reph and Khmer encoding model

2003-03-05 Thread Mijan
Quoting Kent Karlsson [EMAIL PROTECTED]:

 
  I understand that unicode is supposed to represent the 
  language, not the way it is written.
 
 No, Unicode is supposed to be able to represent the written
 form. (Of course.)

Yes, I was wrong! I think I wanted to say something like, Unicode is supposed 
to be able to represent the written language with logicaly equivalent code 
points.
(Because the argument is, what is logicaly equivalent to ya-phalaa) 

Mijan

 form 
 ...
  Let's consider the ra+virama+ya case. In the mostpart the 
  ra+virama+ya is 
  displayed as ya+reph. This obviously seems to be an 
  instance of ambiguous interpretation because ra+virama+ya 
  could also represents 
  ra+ja-phalaa. ya+reph and ra+ja-phalaa are used in different 
  words and have 
  different meaning.
  Form this you see that ja-phalaa is not equivalent to 
  virama-ya and is better 
  as a separate letter in Unicode. We always thought of 
  ya-phalaa as separate 
  anyway.
 
 
   3. There are no other cases of a Vowel+Virama combination in the
   Unicode encoding model.
   
   Yes, there are. Khmer.
  
  I do not understand Khmer but I see that it does not use the 
  same 'encoding 
  model'. Please look, you will see that you were wrong to use 
  Khmer as an example.
 
 Khmer uses the same encoding model as most other Indic scripts,
 except for one point: the reph is represented via a combining
 character (which also means that it does not come in logical order
 in the text representation), so the ambiguity you refer to does
 not exist for Khmer.  Further, Khmer could have been represented
 in a Tibetan-like encoding model (but isn't).  Further, IIRC,
 independent vowels can both be subscripted (before virama/coeng)
 and be subscripts (after virama/coeng) in Khmer.  The latter is
 orthographically different from using dependent vowels.
 
   /kent k
 




-
This mail sent through http://www.bangladesh.net 



Re: Caron / Hacek?

2003-03-05 Thread John Cowan
Pim Blokland scripsit:

 Then why does UnicodeData break them down as (e.g.) 0064 030C rather than
 0064 0315?

To keep the upper case and lower case characters in sync for decomposition,
they always have the same combining characters.  For another example, G with
cedilla gets the cedilla on top when it's a capital, but it still decomposes
to the ordinary combining cedilla.  These are essentially font-ligaturing
issues.

-- 
John Cowan  http://www.ccil.org/~cowan[EMAIL PROTECTED]
To say that Bilbo's breath was taken away is no description at all.  There are
no words left to express his staggerment, since Men changed the language that
they learned of elves in the days when all the world was wonderful. --The Hobbit



Ya-phalaa

2003-03-05 Thread Michael Everson
Mijan,

Unicode has a mechanism for producing the ya-phalaa conjunct, namely 
by preceding the ya with virama. This works also in the unusual 
situation where the consonant the ya-phalaa modifies is an 
independent vowel.

A + VIRAMA + YA + -AA  (this is aa-yaphalaa)
RA + VIRAMA + ZWJ + YA (this is the reph-ya)
RA + VIRAMA + YA (this is the ra-yaphalaa)
There are analogous examples of this use of ZWJ in Malayalam and Devanagari.

--
Michael Everson * * Everson Typography *  * http://www.evertype.com


FAQ entry (was: Looking for information on the UnicodeData file)

2003-03-05 Thread John Cowan
I've reformatted Pim Blokland's question as a Unicode FAQ.

Q: What do the terms turned, inverted, reversed, rotated,
inverse, digraph, and ligature used in the names of Unicode
characters mean?

A: These terms are basically typographical rather than Unicode-specific.

A turned character is one that has been rotated 180 degrees around its
center.  A turned e winds up with the opening in the upper left portion.
U+0259 LATIN SMALL LETTER SCHWA is a turned e.

An inverted character has been flipped along the horizontal axis.
An inverted e winds up with the opening in the upper right portion.
There is no Unicode character representing an inverted e.

A reversed character has been flipped along the vertical axis.
A reversed e winds up with the opening in the lower right portion.
U+0258 LATIN SMALL LETTER REVERSED E is an reversed e.

A rotated character has been rotated 90 degrees, but one can't tell
which way without looking at the glyph.  U+213A ROTATED CAPITAL Q is a
Q that has been rotated counterclockwise.

Inverse means that the white parts of the glyph are made black, and
vice versa.  An inverse e looks like a normal e but is white on a
black background.  There is no Unicode character representing an
inverse e.

Digraphs and ligatures are both made by combining two glyphs.  In a digraph,
the glyphs remain separate but are placed close together.  In a ligature,
the glyphs are fused into a single glyph.

-- 
A mosquito cried out in his pain,   John Cowan
A chemist has poisoned my brain!  http://www.ccil.org/~cowan
The cause of his sorrow http://www.reutershealth.com
Was para-dichloro-  [EMAIL PROTECTED]
Diphenyltrichloroethane.(aka DDT)



RE: Ya-phalaa

2003-03-05 Thread Michael Everson
At 17:41 + 2003-03-05, Andy White wrote:

  Unicode has a mechanism for producing the ya-phalaa conjunct, namely
  by preceding the ya with virama. This works also in the unusual
  situation where the consonant the ya-phalaa modifies is an
  independent vowel.
 
  A + VIRAMA + YA + -AA  (this is aa-yaphalaa)
  RA + VIRAMA + ZWJ + YA (this is the reph-ya)
  RA + VIRAMA + YA (this is the ra-yaphalaa)
I said that I was not going to discuss this with you any further. I can
now no longer resist! :-)
Saying RA + VIRAMA + ZWJ + YA =  reph-ya will not be acceptable.
Implementing this will break all existing implementations. All current
Fonts and Bengali Unicode texts rely on Ra+Virama+Ya as being
representative of the more common reph-ya.
Moreover, RA + VIRAMA + YA cannot represent Ra-yaphalaa as Ra+Virama
is relied upon as being representative of Reph.
For example, in the Indic OpenType secifications, you will see that a
Ra+Virama is recognised as reph before any other processing is applied.
If this is the case (and one would like corroboration) then simply 
reverse the two. The solution is the same.
--
Michael Everson * * Everson Typography *  * http://www.evertype.com



Re: The display of *kholam* on PCs

2003-03-05 Thread John Hudson
At 07:57 AM 3/5/2003, Dean Snyder wrote:

About the only unusual orthographic phenomenon I can think of related
to KHOLEM is that when it occurs after SIN it shares the same dot with SIN.
Not always. I have not done a close analysis of manuscript sources, but I 
wouldn't be surprised to find that this practice is largely due to 
technical limitations in older typesetting systems and/or the conventions 
of particular script styles. The question was raised recently during our 
development of a set of fonts for biblical scholarship: I told the clients 
they had a choice of whether to combine the holam and sin dots or to have 
them separate. The clear preference was to have them separate. This was 
possible because, following the convention of the sephardic style on which 
the new font is based, the sin and shin dots do not sit at the *extreme* 
left and right of the shin letter, so there is a little extra space into 
which to insert a holam. This would be more difficult in an ashkenazic 
style, and particularly difficult in older typesetting systems that would 
not allow dynamic adjustment of holam relative to other marks.

John Hudson

Tiro Typeworks  www.tiro.com
Vancouver, BC   [EMAIL PROTECTED]
It is necessary that by all means and cunning,
the cursed owners of books should be persuaded
to make them available to us, either by argument
or by force.  - Michael Apostolis, 1467



RE: Ya-phalaa

2003-03-05 Thread Andy White
Michael Everson wrote:
[...]
RA + VIRAMA + ZWJ + YA (this is the reph-ya)
RA + VIRAMA + YA (this is the ra-yaphalaa)
[...]
 ... in the
 Indic OpenType secifications, you will see that a
 Ra+Virama is recognised as reph before any other processing
 is applied.
[...]
 If this is the case (and one would like corroboration) then simply 
 reverse the two. The solution is the same.

Once a botch is implemented then others will sure follow. Replacing you
original botch with yet another will make the encoding model into
nothing other than a hack. Seeing as one would like corroboration,
there seems is no point in me wasting time by going into details.

IMHO, TUS needs solid rules; Exceptions, hacks, patches, or workarounds
should definitely be avoided wherever possible. (If you care to look
back in the mailing list archives a few years, you will see that the
a+Virama+Ya+aa kludge was originally proposed as a workaround due to
the lack of a separate encoded letter)

Andy




RE: Ya-phalaa

2003-03-05 Thread Michael Everson
Andy, the ya-phalaa is a presentation form of cojoined YA, which is 
produced in Unicode by the sequence VIRAMA + YA. Encoding it as 
anything else makes very little sense at all. However it is 
pronounced today in Bengali, and however weird you feel about its 
being applied to an initial vowel, the fact is that it is still a 
presentation form of cojoined YA, and it should be encoded as such.

Consider the fact that the Bhagavadgita is available in Sanskrit in 
Bengali script. This will certainly contain many, many examples of 
consonant clusters in -YA. These will all be encoded as VIRAMA + YA, 
not as some independent form of ya-phalaa.

It is easy to point fingers about a mismatch that someone like me 
makes, but the Unicode encoding model for Indic scripts is very 
robust, and we do our best to apply it correctly.

Your proposed combining ya-phalaa will do Bengali no service, as it 
will introduce multiple spellings for consonant clusters in -YA. I 
have already stated on this forum:

For example, in Sanskrit and Bengali, we have the word pratyeka 
'each, every'. This is derived from the Sanskrit root prati 
(expressing likeness or comparison) plus eka 'one'. In Sanskrit 
orthography i + e becomes ye and is so written. Now in Bengali this 
word also exists and in both languages what is written is PA + VIRAMA 
+ RA + TA + VIRAMA + YA + E + KA.

It would be absurd -- and wrong -- to spell the Sanskrit word one way 
and the Bengali word another, especially as it is the same word.

IMHO, TUS needs solid rules; Exceptions, hacks, patches, or workarounds
should definitely be avoided wherever possible. (If you care to look
back in the mailing list archives a few years, you will see that the
a+Virama+Ya+aa kludge was originally proposed as a workaround due to
the lack of a separate encoded letter)
It isn't a kludge. It is a consistent application of the rules. 
Ya-phalaa is a presentation form of YA in conjunction with a 
preceding consonant or -- a Bengali innovation -- an independent 
vowel.

In keeping this stance, Andy, I am defending the Unicode Standards 
encoding principles. The Indic encoding model is constantly under 
attack from people who want explicit rephas, explicit half-forms, 
explicit ya-phalaas, and all sorts of other explicit things, which 
were we to encode them would make the standard very much worse than 
it is.

To reiterate our consistency in using this model, I will give you a 
Malayalam example.

NA + VIRAMA + MA -- NMA (a single conjunct)
NA + VIRAMA + ZWNJ + MA -- NMA (with a visible virama breve above and between)
NA + VIRAMA + ZWJ + MA -- NMA (with the cillaks.aram virama curl)
We prefer to apply this consistency to Bengali as well. Thank you for 
correcting my error earlier. That kind of feedback is helpful. 
Beating us up because you don't like our encoding model isn't.
--
Michael Everson * * Everson Typography *  * http://www.evertype.com



Re: The display of *kholam* on PCs

2003-03-05 Thread Dean Snyder
Chris Jacobs wrote at 7:27 PM on Wednesday, March 5, 2003:

 Chris Jacobs wrote at 12:54 AM on Wednesday, March 5, 2003:

 But why do you call the kholam a high left dot?
 
 As far as I know it can appear high left or middle, to indicate that it
 should be pronounced after the consonant, or right, to pronounce it
before.
 So the meaning of a shin with two dots above it is ambiguous,

 In classical Hebrew KHOLEM always represents a trailing vowel, i.e. it is
 always pronounced after the consonant over which it is written. [In fact
 I can't think of ANY vowel sign in classical Hebrew which represents a
 pronunciation that precedes the consonant to which it is associated,
 ignoring, for obvious reasons, written/read (kethib/qere) orthographies,
 where the vowels indicate what is to be read in spite of the consonants
 that are written.] And so the graphemic sequence SHIN KHOLEM is never
 ambiguous in classical Hebrew. (I don't know about modern Israeli Hebrew.)

When holem precedes ?,  the point is placed on the
upper right of the letter, as with ?? (yo¯'macaronr).
When it follows the ?, the point is placed on the
upper left, as in ? ('o¯bhe¯dh). When holem precedes
??, the points coincide, as with ?? (mo¯scaronecaronl).
When holem follows ??, the points again coincide as
with ?? (so¯t?e¯n). The letter ??? will be scarono¯ to
commence a syllabe, e.g., ?? (scarono¯macaron'), and o¯s in
other places.

[ R.K. Harrison, Teach Yourself Biblical Hebrew ]

The case of (written) Yo'MaR is not an exception. The pronunciation is
yomar, the aleph not being pronounced; and therefore the KHOLEM is
written after the consonant which directly precedes it in pronunciation.

In the examples 'oBeD, MoSHeL, and SoTeN the KHOLEM, as expected, follows
in pronunciation the letter with which it is associated.

I can't make out the transcription The letter ??? will be scarono¯ to
commence a syllabe, e.g., ?? (scarono¯macaron'), and o¯s in other places.
and I don't have Harrison's grammar at work to check the reading; but it
sounds like an explanation of how SHIN + KHOLEM are written, which has
already been discussed.


In the Bagster Polyglot Bible, Hebrew-English Old Testament, translation
Everard van der Hooght,
Genesis 1.3 weyyomer elohiem  And God said
the holem is clearly above the aleph, not above the yod.

Same response given for YoMaR above.


I see in fact _another_ example of a holem to the right, which Harrison
did not mention:
the holem in elohiem is above the he, not above the lamed.

Due to innate complexity there is variation in Hebrew pointing in
manuscripts and printed editions, even leaving aside for the moment
discussion of the various Hebrew pointing traditions themselves. But,
although KHOLEM following LAMED is indeed orthographically a somewhat
special case (due to the fact that LAMED is the only Hebrew character to
extend above the scribal line and the extension is precisely from the
upper left corner of the glyph where you want to place the KHOLEM), I
have nevertheless always seen it written between the LAMED and the
following glyph but closer to the LAMED. This is certainly how it is
taught and printed these days. 

I don't have my Bagster here at work but I would suspect if you looked
closely, the location of the KHOLEM would be as I have suggested. If not
I suspect this is idiosyncratic to works printed on that press. 

[I did however misspeak technically when I said after the consonant OVER
which it is written. The KHOLEM pronounced after LAMED is indeed written
OVER the scribal line, but is written directly AFTER the LAMED.]


 About the only unusual orthographic phenomenon I can think of related
 to KHOLEM is that when it occurs after SIN it shares the same dot
with SIN.

And if those dots were above different letters there were no reason why
they should share.

I must be missing your point here; this seems to support what I was saying.


But I'm surprised that no one has provided the one possible
counterexample to my statement about no vowel preceding its consonant (an
example I completely forgot about when writing my former post) -  furtive
pathach (as in the second a-vowel in SaMeaKH). 

Depending on your linguistic persuasion you might argue that the PATAKH
here is a vowel glide, both written and pronounced, which is merely
extending a non-a-vowel before guttural consonants in certain phonemic
contexts. Or you might want to posit that it is the only example of a
syllable in classical Hebrew beginning with a vowel - or an unwritten
consonant.

Probably more than we need to know about the originally posted problem,
but I have a feeling that readers of this list enjoy, like I do,
discussion of these orthographic quirks of the world's writing systems.


Respectfully,

Dean A. Snyder
Scholarly Technology Specialist
Center For Scholarly Resources, Sheridan Libraries
Garrett Room, MSE Library, 3400 N. Charles St.
The Johns Hopkins University
Baltimore, Maryland, USA 21218

office: 

[OT] The project is done

2003-03-05 Thread David Oftedal
Hello!

My keymap is done, and is working well. I just wanted to thank everyone 
who helped me during the construction of all the scripts and tidbits 
that made it work.

Thanks a lot!

-Dave Oftedal

--
Sonna ojamasan ni ha batsu-geemu namatako pantsu juppun!




RE: Ya-phalaa

2003-03-05 Thread Andy White
Michael, I do not wish to get into yet another long discussion
(argument) but I must reply to one point.

 Your proposed combining ya-phalaa will do Bengali no service, as it 
 will introduce multiple spellings for consonant clusters in -YA. 

Um, actually if you look, you will not find any place where I have
proposed a combining ya-phalaa. I have so-far avoided any mention of
such a thing due to the reasons you give above. (I think you will find
that it was Mijan that mentioned that.)

Andy





Re: Malayalam Cillaksharams (was Ya-phalaa)

2003-03-05 Thread Michael Everson
At 21:14 + 2003-03-05, Andy White wrote:
I am replying to this portion of the reply as I feel it is a very
important revelation.
We weren't hiding it. This is part of the improvements to Unicode 
that have been made for 4.0. One of the tasks I was given was to 
improve the block descriptions of the Indic scripts if I could. Most 
have been improved rather a lot considering the time constraints we 
have had. In each case we endeavoured to address some of the problem 
areas. We are still editing.

  To reiterate our consistency in using this model, I will give you a
 Malayalam example.

 NA + VIRAMA + MA -- NMA (a single conjunct)
 NA + VIRAMA + ZWNJ + MA -- NMA (with a visible virama breve
 above and between) NA + VIRAMA + ZWJ + MA -- NMA (with the
 cillaks.aram virama curl)
[...]
  Michael Everson
--
Michael Everson * * Everson Typography *  * http://www.evertype.com


Re: [OT] The project is done

2003-03-05 Thread Edward H Trager


On Wed, 5 Mar 2003, David Oftedal wrote:

 Hello!

 My keymap is done, and is working well. I just wanted to thank everyone
 who helped me during the construction of all the scripts and tidbits
 that made it work.

I'm curious what keymap and for what language/script that is?  Probably I
ignored the earlier posts regarding this.  Is this a keymap that is
generally available for people to use?


 Thanks a lot!

 -Dave Oftedal

 --
 Sonna ojamasan ni ha batsu-geemu namatako pantsu juppun!







RE: Ya-phalaa

2003-03-05 Thread jameskass
.
 Moreover, RA + VIRAMA + YA cannot represent Ra-yaphalaa as Ra+Virama
 is relied upon as being representative of Reph.
 For example, in the Indic OpenType secifications, you will see that a
 Ra+Virama is recognised as reph before any other processing is applied.

 If this is the case (and one would like corroboration) then simply 
 reverse the two. The solution is the same.

RA + VIRAMA is a pre-base substitution and pre-base stuff gets
processed first.

RA + ZWNJ + VIRAMA + YA might be the way to go in order to
disambiguate REPH + YA from RA + YA-PHALAA.

Whatever method is chosen, it will be invisible to the user.
The way text is stored on computers has nothing to do with
the way text is handwritten, typed, and printed or displayed.  

Computer characters consist of strings of ones and zeros.  The
binary string which is stored by a computer to represent the 
LATIN CAPITAL LETTER A doesn't look anything like the letter.  

The important matter is that each letter needs to have a
unique binary string which can be stored electronically.
Lengths of such strings vary.

Input methods and display need to match users' expectations,
but the underlying binary string encodings do not.  The users
never see this.

Best regards,

James Kass
.



length of text by different languages

2003-03-05 Thread Yung-Fong Tang
I remember there were some study to show although UTF-8 encode each 
Japanese/Chinese characters in 3 bytes, Japanese/Chinese usually use 
LESS characters in writting to communicate information than alphabetic 
base langauges.

Any one can point to me such research? Martin, do you have some paper 
about that ?

I would like to find out the average ration between
English,
Geram,
French,
Japanese,
Chinese,
Korean
in term of the number of characters, and in term of the bytes needed to 
encode in UTF-8

If such research information have not been done, maybe one way to figure 
the result is to take tranlated Bible fo these language from swords 
project, strip out those xml tag and leave the pure text, and measure 
the size. Since all the Bible translation communicate the same 
information and the volumn is huge enough, that could be a good way to 
find out the result. Of course, those mark up need to be taken out to 
reduce the noise.






RE: Ya-phalaa

2003-03-05 Thread Andy White
I once wrote:
  My thoughts were to put a ZWNJ after the Ra to indicate that is not
  to form a Reph e.g. Ra+ZWNJ+Virama+Ya = Ra+Jophola
  Then I remembered that in some font designs, secondary forms such 
  as jophola can form a conjunct ligature with the preceding 
 consonant. 
  I think that a ZWNJ would imply that Ra and Ya should not ligate.

James Kass said:
 Exactly.  This would seem to work without breaking anything 
 existing and would not mean extending the semantics of ZWNJ.
 
 Have you since changed your mind about this?

No!
This is an example of stating something that can be read in two ways -
unfortunatly you took an unintended meaning :-(

Re-iterating in reverse should get the point across, I hope:
I think that a ZWNJ would imply that Ra and Ya should not join together.
(ZWNonJoiner)
But I remembered that in some font designs Ra and Ya *do* join together
(they make a ligature.)
Therefore Ra+ZWNJ+Virama+Ya cannot represent Ra+Yaphalaa when they form
a ligature.


Andy




RE: Ya-phalaa

2003-03-05 Thread jameskass
.
Andy White wrote,

 No!
 This is an example of stating something that can be read in two ways -

Hmmm, kind of like RA+VIRAMA+YA in current implementations?

 unfortunatly you took an unintended meaning :-(

Actually, I did get the intended meaning.  Unfortunately, though,
I didn't get it until after my reply was sent.  smile

 I think that a ZWNJ would imply that Ra and Ya should not join together.
 (ZWNonJoiner)
 But I remembered that in some font designs Ra and Ya *do* join together
 (they make a ligature.)
 Therefore Ra+ZWNJ+Virama+Ya cannot represent Ra+Yaphalaa when they form
 a ligature.
 

So, I've had a half hour to consider how to respond to your
anticipated response.  smile

If a font designer makes a special ligature form of RA+JOPHOLA,
then the easy solution would be to put a look-up in the font's
GSUB table:

RA + ZWNJ + VIRAMA + YA --- my special ligature form

The hard part of this, as you know, is getting something like
this to actually work.  But, as you also know, the people who
are working on Unicode font engines, like Paul Nelson of
Microsoft, are very diligent in following up on these special
cases.  Remember all of our talk about the KHANDA TA and note
that the current experimental version of Uniscribe now seems
to be properly substituting that form.

Best regards,

James Kass
.



Re: Looking for information on the UnicodeData file

2003-03-05 Thread Asmus Freytag
At 04:57 PM 3/5/03 +0100, Pim Blokland wrote:
I apologize if this question has been asked before, but I'm relatively new 
at this.
My question is: where can I find formal definitions of the terms used in 
the Character Name field of the UnicodeData.txt file? Most specifically, 
precise explanations of designations like turned, inverse, inverted, 
reversed, rotated etc. Also the difference between digraph and 
ligature, etc.
Although I've searched the FAQ files and the rest of the unicode.org site, 
I haven't been able to find this info as yet. This site is huge! So can 
anyone provide me with an URL? Thanks.

No such information exists. These are descriptive terms that have been 
applied somewhat consistently, but not strictly.

Officially, character names are (somewhat) arbitrary, but unique 
identifiers of characters. They are neither always a description of the 
appearance of a character, nor do they always match the street name for the 
corresponding elements of the writing systems.

A./






Re: Looking for information on the UnicodeData file

2003-03-05 Thread Rick McGowan
By the way, the FAQ was updated today, thanks to people on this list.
Rick

My question is: where can I find formal definitions of the terms used 
in the Character Name field of the UnicodeData.txt file? Most 





RE: Ya-phalaa

2003-03-05 Thread Andy White
Jameskass wrote:

 If a font designer makes a special ligature form of 
 RA+JOPHOLA, then the easy solution would be to put a look-up 
 in the font's GSUB table:
 
 RA + ZWNJ + VIRAMA + YA --- my special ligature form

Now that simplicity makes me smile :-)
I would be surprised if anyone (even diligent Paul Nelson of Microsoft)
would except that a sequence containing a non-joiner should be allowed
to form a ligature - I could be wrong - I await further responses to
see.

Andy




RE: length of text by different languages

2003-03-05 Thread Francois Yergeau
[EMAIL PROTECTED] wrote:
 I remember there were some study to show although UTF-8 encode each 
 Japanese/Chinese characters in 3 bytes, Japanese/Chinese usually use 
 LESS characters in writting to communicate information than 
 alphabetic base langauges.
 
 Any one can point to me such research?

I don't know of exactly what you want, but I vaguely remember a paper given
at a Unicode conference long ago that compared various translations of the
charter (or some such) of the Voice of America in a couple or three
encodings.  H, let's see  could be this:

http://www.unicode.org/iuc/iuc9/Friday2.html#b3
Reuters Compression Scheme for Unicode (RCSU) 
Misha Wolf

No paper online, alas.  I remember that Chinese was a clear winner in terms
of # of characters.  In fact, I kind of remember that Chinese was so much
denser that it still won after RCSU (now SCSU) compression, which would mean
that a Han character contains more than twice as much info on average as a
Latin letter as used in (say) English.

This is all on pretty shaky ground, distant memories.  Perhaps Misha stil
has the figures (if that's in fact the right paper).

-- 
François