RE: traditional vs simplified chinese

2003-02-14 Thread jarkko.hietaniemi
> > I know little about Chinese, but I have the impression that it is much more
> > common for several traditional characters to correspond to one simplified
> > character than vice versa. If that's true, it seems to me that it would make
> > most sense to fold to simplified.
>
> Hmmm ... Suppose I'm searching for some relatively obscure traditional
> character that occurs mostly in Wen Yen (u+6587 u+8A00 : Classical
> Chinese) and has a very specific meaning in Classical Chinese.  This
> character gets "folded" or "mapped" to a fairly common character in modern
> bai hua (u+767D u+8BDD) Chinese, and then the search proceeds.  The result
> set contains hundreds or thousands of irrelevant results related to the
> modern meaning, and I still have to sift through them looking for the
> needles in the haystack.  I'll try to provide a concrete example once I
> think of one ... it's been a long time since I studied Classical Chinese.

A search interface can first show the exact or "more exact" matches, and only
after that show the "less exact" matches.  Of course how this works with the
user interfaces of the applications is a different story.





Re: traditional vs simplified chinese

2003-02-14 Thread Andrew C. West
On Fri, 14 Feb 2003 01:23:42 -0800 (PST), "Zhang Weiwu" wrote:

> I never saw 500B and 4E2A in one same printed document as I lived in China for
> 20 years. (Well, need to remove the years I cannot read:) Unless you have a
> obvious reason to do so, to print a book with Traditional characters is
> considered somewhat wrong in the past in China. There is a language council
> (YuWei) in charge of such issue. In some period of past time people want to
> completely kill Traditional Chinese. I remeber an advertisement on the street
> when I was a child, which said people should report public appearance of
> Traditional Chinese character to the local culture ministry of some sort. (Oh
> it's very OT) So let me correct my word: If you find a 4E2A, maybe it is still
> Traditional, but if you find a 500B it is very very likely to be Traditional
> Chinese. I think we can search 500B, if it does not exist it is likely to be a
> simplified character. 
> 

You're right of course that searching for U+500B or U+4E2A would work as a
simple test for traditional/simplified Chinese in the vast majority of Chinese
language web pages. The point that I was trying to get across is that
Traditional/Simplified is an artificial distinction formalised in the second
half of the 20th century by the PRC's adoption of a "simplified" character set,
and enforced by the coding standards that developed on either side of the Taiwan
straits. With Unicode it is now possible to overcome this artificial divide if
one wants to.

It's also true that some people did want to wipe out all traditional form
characters -- or even replace Chinese characters altogether -- but they have
certainly failed. There are still plenty of books (albeit mainly academic) that
are printed using traditional characters in China to this very day. One
interesting example of modern mixed Simplified/Traditional printing is the
standard 13 volume Dictionary of Chinese (Hanyu Da Cidian) published 1986-1994,
that gives head words in traditional Chinese, the definitions in simplified
Chinese and the quotations in traditional and/or simplified Chinese (simplified
for modern sources, traditional for pre-modern sources). Take a look at Vol.1
p.1501 under ge4, where U+500B and U+4E2A occur in about equal numbers.
Interestingly, the dictionary quotes Zheng Xuan, writing in the 2nd century
A.D., as stating that U+4E2A (the modern "simplified" form) is the correct form
of the character, and that U+500B (the modern "traditional" form) is a vulgar
substitute !

Now if Hanyu Da Cidian were to be put onto the internet ...

Regards,

Andrew




Re: traditional vs simplified chinese

2003-02-14 Thread John Cowan
Andrew C. West scripsit:

> Interestingly, the dictionary quotes Zheng Xuan, writing in the 2nd century
> A.D., as stating that U+4E2A (the modern "simplified" form) is the correct form
> of the character, and that U+500B (the modern "traditional" form) is a vulgar
> substitute !

IIRC this is true of very many simplified characters: the simplification
process was not so much one of inventing simplified characters, but of
inventorying the existing repertoire of simplified forms, ancient and
modern, and deciding which ones were now to be official.

-- 
John Cowan   http://www.ccil.org/~cowan  [EMAIL PROTECTED]
To say that Bilbo's breath was taken away is no description at all.  There
are no words left to express his staggerment, since Men changed the language
that they learned of elves in the days when all the world was wonderful.
--_The Hobbit_




Re: Plane 14 Tag Deprecation Issue

2003-02-14 Thread William Overington
On the last day of the consultation period I wonder if I may add a few notes
about tags and plane 14.

An interesting point is that there exists the possibility of defining
additional types of tagging using codes U+E0002 through to U+E001A.

Yesterday evening I began wondering for what matters such additional types
of tagging could potentially be useful.  This within the constraints of
tag characters themselves being restricted to an ASCII-like set of
characters.

Books in libraries are often classified with a code consisting of digits and
a full stop character.  For example, the number 515.53 is on a label which
is still on the spine of a book which I bought in a sale of withdrawn books
from a library.  So, if U+E0002 were used to introduce a tag for the library
book classification code, then a sequence starting with U+E0002 and using
some other tag characters could be used to classify the subject matter of
any document which is stored in computerized form.

Editions of books are classified using International Standard book numbers.
A tag code could be used to state an International Standard Book Number
using tag characters.

New concepts could be introduced.  Suppose that a new system of codes were
introduced, perhaps called something like International Literary Work
Numbers and that any author could obtain some of these numbers, which
numbers would have a format carefully designed so as not to be confusable
with International Standard Book Numbers, perhaps by having a letter other
than X within them (X being used in International Standard Book Numbers).
Then if someone writes a poem, he or she could allocate an International
Literary Work Number to the poem.  In a document, the code U+E0004 could
introduce the International Literary Work Number which Work Number would be
expressed using tag characters.  If the poem were on the web, most present
day computer systems could ignore the tag characters, yet advanced
futuristic software could search databases for specific codes or ranges of
codes and hopefully find the poem.

Yes, this is a potentially far-reaching line of research and it needs to be
allowed the freedom to flourish.

Looking further at the matter of plane 14, I am wondering whether there is
scope for the eventual production of a vector graphics system to be encoded
in plane 14.  I have had some good success with my eutocode graphics system
which is produced using codes from the Private Use Area.

http://www.users.globalnet.co.uk/~ngo/ast03000.htm

http://www.users.globalnet.co.uk/~ngo/ast03100.htm

Eutocode graphics uses 10 bit data input.  If a system in plane 14 were
produced, then 12 bit data input could be used, perhaps using all of the
codes U+E2000 through to U+E2FFF for data input.  Some of the codes in the
range U+E1000 through to U+E1FFF could be used for control codes for the
system, though not that many of them.  At its present stage of development
eutocode graphics uses only a few codes for control, all of them within the
range U+EB00 through to U+EBFF of the Private Use Area.

An interesting matter upon which I would appreciate some help please is as
follows.

In early 2002 I learned of a system called ViOS, which is a
three-dimensional interface for the web.  I learned about it from the
newsgroup alt.binaries.education.distance which showed a graphic and
provided a link to the http://www.vios.com website.  Unfortunately, that
website is no longer accessible.  ViOS is a magnificent program, it still
works well in offline mode, and is about 90 Megabytes in size.  Inspired by
the three-dimensional setting out of web pages in related groups used in
ViOS, I am trying to devise a vios-inspired three-dimensional index system
for the DVB-MHP (Digital Video Broadcasting - Multimedia Home Platform)
system.  I am designing this as an optional part of the eutocode graphics
system.  However, this eutovios system is designed to be implementable
within a Java program of under 100 kilobytes when compiled, hopefully less.
So eutovios is nothing like as detailed as ViOS.  I am thinking in terms of
a plane populated with objects, each of which can have a string of Unicode
characters as a label and a string of Unicode characters as an action string
so that the program knows what to do when the object is entered.  The
objects at present consist of three types, namely a cylinder stood on the
plane, a cone stood on the plane and a sphere which can be at any specified
height.  My thinking is that the spheres will be markers for clusters of
objects, the cylinders will lead to viewing a document or obeying a program
and that the cones will be used for cross-referencing to related topics.
Thus a collection of learning programs for distance education will hopefully
be indexed in a three-dimensional visual-spatial setting so that related
topics can be placed in proximity to one another.  The eutovios system
allows a particular three-dimensional environment to be set up using Private
Use Area codes from the r

Re: traditional vs simplified chinese

2003-02-14 Thread Thomas Chan
On Thu, 13 Feb 2003, Zhang Weiwu wrote:
>Take it easy, if you find one 500B (the measure word)  it is usually enough to
>say it is traditional Chinese, one 4E2A (measure word)  is in simplified
>Chinese. They never happen together in a logically correct document.

Others have already given examples of logically correct documents with
both characters, but one cannot always have the luxury of assuming the
data is not deviant.  For example, there are many electronic texts online
that are a hybrid of simplified and traditional text, because they contain
erroneous conversions from a simplified source document (typically GB2312)
to a traditional one (typically Big5).

I think zhe4 'this' (simp U+8FD9 / trad U+9019) might be better for a very
simple heuristic for modern text, since it occupies position #11 in at
least one frequency list (compared to #15 for the above-cited ge4), and as
far as I know, U+8FD9 is not one of those ancient characters that have
been promoted/reused as a simplified form.


On Thu, 13 Feb 2003, Andrew C. West wrote:
>Take, for example, this Web page --
>http://uk.geocities.com/Morrison1782/Texts/TianguanCifu.html -- which
>transcribes a short one-act play from the Cantonese Opera tradition, published
>during the Qing dynasty (probably early 19th century). It has U+4E2A
>(simplified
>ge4) but not U+500B (traditional ge4), and yet is written mostly in
>"traditional" characters. How would your algorithm classify such a page ?

Aren't such texts by default "traditional"?  "Simplified" text, besides
using simplified form characters, usually also entails refraining from
using variant forms (according to PRC definitions of what is a variant).  
And depending on how far one wants to stretch the definition, PRC-style
vocabulary, etc., cf., http://www.cjk.org/cjk/reference/chinvar.htm and
http://www.cjk.org/cjk/c2c/c2cbasis.htm .


On Thu, 13 Feb 2003, Marco Cimarosti wrote:
>The easiest way to do it is "folding" both the user's query and the conten
>being sought to the same form (either traditional or simplified, it doesn't
>matter). It may also help to "fold" also other kinds of variants beside
>simplified and traditional.

It would help to at least fold the Unicode z-variants together.  For
example, with the possibility of Unicode data, authors have the choice of
U+6236, U+6237, and U+6238 for hu4 'door', but these are not meaningful
distinctions, and certainly a lot harder to detect than the typical
traditional/simplified case.


On Thu, 13 Feb 2003, Edward H Trager wrote:
>And I've seen books printed in the beginning years of the PRC era using
>mostly simplified, but with smatterings of traditional characters here and
>there.  These books were printed in the days of lead type, so I

Those must be the ones printed before the final 1964 version of the
simplification (drafts dating back to 1956, and some earlier pre-1949
usages in Communist-occupied areas), so that they do not utilize all the
simplified characters that eventually exist in the 1964 version.

There are even some cases of semi-simplified forms where one half of a
character might have been simplified according to pre-1964 rules, but the
simplification rule for the other half has to wait until 1964.  But I
think these might've been missed by Unicode, like some of the
ultra-simplified forms in the short-lived 1977 scheme, and Singapore's
temporarily different (from the PRC's) schemes prior to 1976.


On Fri, 14 Feb 2003, Andrew C. West wrote:
>Now if Hanyu Da Cidian were to be put onto the internet ...

How about the one here?  http://202.109.114.220/


Thomas Chan
[EMAIL PROTECTED]






Re: Plane 14 Tag Deprecation Issue

2003-02-14 Thread Michael Everson
At 13:38 + 2003-02-14, William Overington wrote:


Books in libraries are often classified with a code consisting of digits and
a full stop character.  For example, the number 515.53 is on a label which
is still on the spine of a book which I bought in a sale of withdrawn books
from a library.  So, if U+E0002 were used to introduce a tag for the library
book classification code, then a sequence starting with U+E0002 and using
some other tag characters could be used to classify the subject matter of
any document which is stored in computerized form.


No, no, no, no, no. People should use XML or other forms of markup. 
You are headed into a dead end, William.
--
Michael Everson * * Everson Typography *  * http://www.evertype.com



Re: Everson Mono

2003-02-14 Thread Roozbeh Pournader
On Thu, 13 Feb 2003, Michael Everson wrote:

> Anyway, while there are gaps in this font, of course there are gaps 
> in all the other fonts out there as well. Announcing, then, the 
> biggest monowidth font I'm aware of Please see 
> http://www.evertype.com/emono/

Well, since many of us can't open that on a PC, would you tell us the 
number of glyphs so we can correct you if we found about any bigger 
monowidth font?

roozbeh





Byte order mark (?) mars Unicode homepage

2003-02-14 Thread Michael Everson
Under Mac OS X, Explorer 5.2.2 displays a euro sign above the red 
title bar, creating a white bar which pushes the red bar down. This 
doesn't occur on other pages.

Safari doesn't display the euro sign but the white bar is there. Same 
for OmniWeb. I tried to use UnicodeChecker in the OS X Services but 
it said the character which precedes  in the document is 
SPACE. I don't think I believe it
--
Michael Everson * * Everson Typography *  * http://www.evertype.com



Re: Everson Mono

2003-02-14 Thread Michael Everson
At 18:44 +0330 2003-02-14, Roozbeh Pournader wrote:


Well, since many of us can't open that on a PC, would you tell us 
the number of glyphs so we can correct you if we found about any 
bigger monowidth font?

7,072
--
Michael Everson * * Everson Typography *  * http://www.evertype.com




Re: Everson Mono

2003-02-14 Thread Roozbeh Pournader
On Fri, 14 Feb 2003, Michael Everson wrote:

> >Well, since many of us can't open that on a PC, would you tell us 
> >the number of glyphs so we can correct you if we found about any 
> >bigger monowidth font?
> 
> 7,072

Wow! The next biggest monowidth non-CJK font I know, has just 5,013. It
has Latin, Greek, Cyrillic, Armenian, Georgian, Ethiopic, Hebrew, Arabic,
Thai, Ogham, Runic, Braille, and a lot of those phonetic and technical
stuff. No Indic, no non-BMP. But the very nice thing about it, is that
it's public domain:

http://www.cl.cam.ac.uk/~mgk25/ucs-fonts.html

Well, to be honest, I know of a monowidth font that may have more glyphs
than Everson Mono. But I guess that doesn't count, since it's a
single-application font. The SC Unipad's internal font:

http://www.unipad.org/

roozbeh





Re: Plane 14 Tag Deprecation Issue

2003-02-14 Thread Rick McGowan
William Overington wrote:

> Books in libraries are often classified with a code consisting
> of digits and a full stop character.

And there are already long-established standards for library catalogs and  
computerization of same. Ask your local librarian about "MARC" for  
instance.

Rick






Re: traditional vs simplified chinese

2003-02-14 Thread Andrew C. West
On Fri, 14 Feb 2003 07:45:44 -0800 (PST), Thomas Chan wrote:

> I think zhe4 'this' (simp U+8FD9 / trad U+9019) might be better for a very
> simple heuristic for modern text, since it occupies position #11 in at
> least one frequency list (compared to #15 for the above-cited ge4), and as
> far as I know, U+8FD9 is not one of those ancient characters that have
> been promoted/reused as a simplified form.

On the other hand I don't think that zhe4 is used in Cantonese, whereas I think
that ge4 is, so it wouldn't be so good for pages written in Cantonese (not that
I have ever seen any, but I'm sure there must be some). Probably even a simple
heuristic would need to try several common characters such as ge4 and zhe4.

> Aren't such texts by default "traditional"?  "Simplified" text, besides
> using simplified form characters, usually also entails refraining from
> using variant forms (according to PRC definitions of what is a variant).  

Probably true, but the point that I was making is that the simplified ge4 in the
text would confuse a simple heuristic.

> There are even some cases of semi-simplified forms where one half of a
> character might have been simplified according to pre-1964 rules, but the
> simplification rule for the other half has to wait until 1964.  But I
> think these might've been missed by Unicode, like some of the
> ultra-simplified forms in the short-lived 1977 scheme, and Singapore's
> temporarily different (from the PRC's) schemes prior to 1976.

I think that most of the 1977 simplifications have already been encoded in
Unicode, but any that haven't and the hybrid semi-simplified forms found in some
printed books from the 50s and 60s will probably be included in CJK-C along with
the rest of its unnecessary baggage (excuse my distaste for CJK-C, but I think
that the Ideographic Rapporteur Group is indiscrimately collecting characters
that in most cases probably do not needed to be encoded, just for the sake of
encoding as many characters as possible - 24,000+ and counting - see the "CJK
Extension C Project" at http://www.cse.cuhk.edu.hk/~irg/irg/extc/CJK_Ext_C.htm
for details).

> >Now if Hanyu Da Cidian were to be put onto the internet ...
> 
> How about the one here?  http://202.109.114.220";>http://202.109.114.220/

Yes, this is an excellent resource. Although the Hanyu Da Cidian look-up only
gives definitions, and none of the extremely useful quotations found in the
printed book, it still mixes traditional form head words with simplified
definitions, so that both ge4 simplified and traditional are found together on
the same page if you search under U+500B and look at the appended compound
words. I guess that according to Thomas's definition of Simplified Chinese, this
makes it a Traditional Chinese page, even though most of the text is in
simplified Chinese !?

Incidentally, for those interested in UTF-16 Chinese web pages, I noticed that
this site is encoded as UTF-16LE.

On a related matter, I was wondering about language tagging for Chinese. "zh-CN"
and "zh-TW" are used quite frequently, but what do they imply ? Is an HTML page
tagged as "zh-CN" expected to be composed of simplified characters, and a a page
tagged as "zh-TW" expected to be traditional characters ? Or does the CN or TW
imply nothing about the orthography of the text, in which case the CN or TW may
simply allow selection of an appropriate font ? What if I am writing a Chinese
page here in England - should I put "zh-UK" or should I make a political
decision as to whose side I'm on, and use "zh-CN" or "zh-TW" ?

On the other hand, "zh-simplified" and "zh-traditional" are sometimes found.
These tags are less politically charged, but miss out on mixed
simplified/traditional pages. Is there a "zh-mixed" ?

Andrew




Re: traditional vs simplified chinese

2003-02-14 Thread John Cowan
Andrew C. West scripsit:

> On a related matter, I was wondering about language tagging for Chinese. "zh-CN"
> and "zh-TW" are used quite frequently, but what do they imply ?

They are usually (mis)used to mean "Mandarin, simplified characters" and
"Mandarin, traditional characters" respectively.  IMHO, the language tagging
list needs to create zh-hant and zh-hans (and perhaps zh-latn) tags.

> What if I am writing a Chinese
> page here in England - should I put "zh-UK" or should I make a political
> decision as to whose side I'm on, and use "zh-CN" or "zh-TW" ?

If used correctly these should imply the variety of Chinese in use.
To overdo it slightly, the tag zh-yue-taishan-us-ny-nyc could mean "the
kind of Cantonese they speak in New York's Chinatown".  In fact this
tag would need to be registered (i.e. get past Michael Everson) before
it would be valid; so far we have zh-yue, but no further granularity
is currently valid.

zh-uk would therefore be the U.K. dialect of Chinese, probably but not
necessarily Mandarin Chinese.  Likewise nv-dk would be the Danish dialect
of Navajo.  :-)  These are allowed because tags with the forms xx, xxx,
xx-yy, and xxx-yy (where xx and xxx are ISO 639 codes and yy are ISO
3166 country codes) are in effect pre-registered.

> On the other hand, "zh-simplified" and "zh-traditional" are sometimes found.
> These tags are less politically charged, but miss out on mixed
> simplified/traditional pages. Is there a "zh-mixed" ?

These tags are not registered, i.e. bogus.

-- 
LEAR: Dost thou call me fool, boy?  John Cowan
FOOL: All thy other titles  http://www.ccil.org/~cowan
 thou hast given away:  [EMAIL PROTECTED]
  That thou wast born with. http://www.reutershealth.com




Converting old TrueType fonts to Unicode

2003-02-14 Thread Alan Wood
Two people have recently asked me how to convert TrueType fonts to make them
Unicode compliant.  One person wants to do this for Cyrillic, and the other
for Byzantine Musical Symbols.

I know nothing about creating or modifying fonts, so I hope one of you will
be willing to share your expertise.

Please copy your reply to the two people in the CC field of this message.

Thank you.

Alan Wood
http://www.alanwood.net (Unicode, special characters, pesticide names)





Re: Everson Mono

2003-02-14 Thread Elliotte Rusty Harold
I first made Everson Mono glyphs in 8-bit font sets in 1994. I've 
always been a perfectionist, but huge fonts are just so huge... 
there's never a good time to release, so why not now? (Several 
people have written to nag me about it.


Very interesting. I've been using Code2000, but this might be more 
appropriate for the next time I need something like this. Is there 
any list anywhere of the ranges or code points that are included?
--

+---++---+
| Elliotte Rusty Harold | [EMAIL PROTECTED] | Writer/Programmer |
+---++---+
|   Processing XML with Java (Addison-Wesley, 2002)  |
|  http://www.cafeconleche.org/books/xmljava |
| http://www.amazon.com/exec/obidos/ISBN%3D0201771861/cafeaulaitA  |
+--+-+
|  Read Cafe au Lait for Java News:  http://www.cafeaulait.org/  |
|  Read Cafe con Leche for XML News: http://www.cafeconleche.org/|
+--+-+



Re: Everson Mono

2003-02-14 Thread John Hudson
At 07:25 AM 2/14/2003, Michael Everson wrote:


At 18:44 +0330 2003-02-14, Roozbeh Pournader wrote:


Well, since many of us can't open that on a PC, would you tell us the 
number of glyphs so we can correct you if we found about any bigger 
monowidth font?

7,072


Andale Mono WT from Monotype has 50,422 glyphs. At least, that's how many 
the slightly old version I have contains. My guess is that they have a 
newer version with more glyphs. Sorry, Michael.

By the way, since Mac OSX supports datafork TrueType fonts as well as 
traditional Apple resource fork fontsm you could easily make a single 
binary that will work equally well on the latest versions of Mac, Windows 
and Linux operating systems.

John Hudson

Tiro Typeworks		www.tiro.com
Vancouver, BC		[EMAIL PROTECTED]

It is necessary that by all means and cunning,
the cursed owners of books should be persuaded
to make them available to us, either by argument
or by force.  - Michael Apostolis, 1467




Re: Converting old TrueType fonts to Unicode

2003-02-14 Thread Sayamindu Dasgupta
On Fri, 2003-02-14 at 22:30, Alan Wood wrote:
> Two people have recently asked me how to convert TrueType fonts to make them
> Unicode compliant.  One person wants to do this for Cyrillic, and the other
> for Byzantine Musical Symbols.
> 
> I know nothing about creating or modifying fonts, so I hope one of you will
> be willing to share your expertise.
> 
We have been working on that for some time
(www.nongnu.org/freebangfont/) - but we are not experts with that.
However, the usual procedure is to use a font editor like pfaedit
(http://pfaedit.sourceforge.net/) and cut-paste the glyphs to the proper
places so that they occupy the proper places.
However pfaedit has some problems with ttf files, and you may use some
commercial font developing tools (no ideas on that).
-regards-
Sayamindu

-- 
Sayamindu Dasgupta [ http://www.peacefulaction.org/sayamindu/ ]

=
Speak out on social and cultural issues
at
PeacefulAction.Org
http://www.peacefulaction.org
*


A black cat crossing your path signifies that the animal is going
somewhere.
-- Groucho Marx





Re: Converting old TrueType fonts to Unicode

2003-02-14 Thread John Hudson
At 09:00 AM 2/14/2003, Alan Wood wrote:


Two people have recently asked me how to convert TrueType fonts to make them
Unicode compliant.  One person wants to do this for Cyrillic, and the other
for Byzantine Musical Symbols.


The easiest way to do this is to invest in a commercial font development 
tool. The best one for re-encoding most fonts is FontLab, although I 
believe you might also be able to use the cheaper TypeTool from the same 
company. See http://www.fontlab.com. However, the current version of 
FontLab does not support Unicode supplementary plane codepoints needed for 
Byzantine Musical Symbols (see note below).

When you open the font in FontLab you will see the Font Window containing 
individual cells for each glyph in the font. In order to re-encode a font 
for Unicode, you can use one of two methods:

1. Manual.

Select the first glyph you wish to encode and open the Rename Glyph window 
(ctrl+\ on Windows). In this window you can both rename glyphs and assign 
Unicode values. In the Unicode field you can enter a hex Unicode value 
(e.g. 0049) or multiple Unicode values separated by commas if you want to 
map a glyph to more than one Unicode value (e.g. the 'space' glyph is 
typically mapped to the codepoints for both the breaking and non-breaking 
space characters). Note that FontLab will not permit you to assign the same 
name or Unicode value to two different glyphs. Note also that changing the 
name or Unicode value may cause the glyphs to change position in the Font 
Window; this is because the cells in the window are ordered by encoding 
(glyph name) or codepage (Unicode value) depending on the view selected. 
When you have assigned the correct Unicode value to the glyph, click Rename 
Next Glyph and proceed. Obviously this manual method can take quite a long 
time, and is not recommended if you need to rename or re-encode large 
numbers of glyph in multiple fonts.

2. Automated.

In the FontLab/Mapping folder is a file with the extension .NAM. This is a 
naming file that maps glyph names to Unicode values. You can create your 
own .NAM files, using the standard one as a model, and this is recommended 
if you have, for example, a collection of fonts that all use the same 
non-Unicode encoding. Map the names in your existing fonts to correct 
Unicode values in the new .NAM file, save the file, and then re-open 
FontLab. You can now apply the Generate Unicode function to automatically 
encode/re-encode all the glyphs in the font in one easy step. Note that the 
.NAM file needs a unique name in the first line, e.g.:

%%FONTLAB NAMETABLE: My New Naming File

The format for the .NAM file entries is Unicode value + space + glyph name, 
e.g.

0x0449 c.shcha


As noted above, the current version of FontLab (4.2.5) does not support 
supplementary plane codepoints internally, which means you cannot use 
either of the above methods. The company that makes FontLab also makes a 
tool called Asia Font Studio; this is considerably more expensive than 
FontLab, and is intended for making CJK fonts. It has a mechanism for 
generating cmap tables including supplementary plane characters, based on 
using a special glyph naming scheme.

Another option for re-encoding fonts is to hack the font cmap table itself. 
The easiest way to do this is probably with Just van Rossum's TTX tool. See 
http://sourceforge.net/projects/fonttools/. This is a Python-based open 
source tool that decompiles TTF and OTF fonts to a human-readable XML file, 
which can then be edited and recompiled to a font. I have used this tool 
for a variety of purposes, but do not have any experience working on fonts 
with supplementary plane codepoints, so cannot verify its usefulness for 
this purpose.

I hope some of this information helps.

John Hudson

Tiro Typeworks		www.tiro.com
Vancouver, BC		[EMAIL PROTECTED]

It is necessary that by all means and cunning,
the cursed owners of books should be persuaded
to make them available to us, either by argument
or by force.  - Michael Apostolis, 1467




Re: Everson Mono

2003-02-14 Thread Michael Everson
At 09:12 -0800 2003-02-14, John Hudson wrote:

At 07:25 AM 2/14/2003, Michael Everson wrote:


At 18:44 +0330 2003-02-14, Roozbeh Pournader wrote:


Well, since many of us can't open that on a PC, would you tell us 
the number of glyphs so we can correct you if we found about any 
bigger monowidth font?

7,072


Actually that's the number of characters. Not all the characters have 
glyphs yet though. Currently I'm churning through all those new 
supplemental arrows. Ugh. I have been enjoying working on the 
Ethiopic though, which isn't yet integrated. There are a lot of 
those


Andale Mono WT from Monotype has 50,422 glyphs. At least, that's how 
many the slightly old version I have contains. My guess is that they 
have a newer version with more glyphs. Sorry, Michael.

I'm not bothered. How can I compete with Monotype? But I made all my 
glyphs myself. I bet they had a team doing it. :-)

Anyway, I bet the size is all them pesky Han characters Have they 
a nice Ogham? :-)
--
Michael Everson * * Everson Typography *  * http://www.evertype.com



Re: Converting old TrueType fonts to Unicode

2003-02-14 Thread John H. Jenkins

On Friday, February 14, 2003, at 01:12 PM, John Hudson wrote:


Another option for re-encoding fonts is to hack the font cmap table 
itself. The easiest way to do this is probably with Just van Rossum's 
TTX tool. See http://sourceforge.net/projects/fonttools/. This is a 
Python-based open source tool that decompiles TTF and OTF fonts to a 
human-readable XML file, which can then be edited and recompiled to a 
font. I have used this tool for a variety of purposes, but do not have 
any experience working on fonts with supplementary plane codepoints, 
so cannot verify its usefulness for this purpose.


For people on Mac OS X, there is a set of tools available for download 
from  which, like TTX, can decompile 
table from TrueType and OpenType fonts and let the user edit the 
results.  These *do* support astral characters.

==
John H. Jenkins
[EMAIL PROTECTED]
[EMAIL PROTECTED]
http://www.tejat.net/




Everson Mono contents

2003-02-14 Thread Michael Everson
Basic Latin
Latin 1
Latin Extended-A
Latin Extended-B
IPA Extensions
Spacing Modifier Letters
Combining Diacritical Marks
Greek and Coptic
Cyrillic
Cyrillic Supplement
Armenian
Hebrew
Georgian
Cherokee
Unified Canadian Aboriginal Syllabics
Ogham
Runic
Phonetic Extensions
Latin Extended Additional
Greek Extended
General Punctuation
Superscripts and Subscripts
Currency Symbols
Combining Diacritical Marks for Symbols
Letterlike Symbols
Number Forms
Arrows
Mathematical Operators
Miscellaneous Technical
Control Pictures
Optical Character Recognition
Enclosed Alphanumerics
Box Drawing
Block Elements
Geometric Shapes
Miscellaneous Symbols
Dingbats
Braille Patterns (*** nearly)
Miscellaneous Mathematical Symbols-A
Supplemental Arrows-A
Supplemental Arrows-B
Miscellaneous Mathematical Symbols-B (*** nearly)
Supplemental Mathematical Operators (*** nearly)
CJK Symbols and Punctuation
Hiragana
Katakana
Katakana Phonetic Extensions (*** nearly)
Alphabetic Presentation Forms
Variation Selectors
Combining Half Marks
Specials
--
Michael Everson * * Everson Typography *  * http://www.evertype.com




Re: Everson Mono contents

2003-02-14 Thread Michael Everson
Oh yeah, and Yijing Hexagram Symbols (*** nearly)
--
Michael Everson * * Everson Typography *  * http://www.evertype.com




Re: Plane 14 Tag Deprecation Issue

2003-02-14 Thread jameskass
.
William Overington wrote on the subject of Plane Fourteen tags
and closed with a haiku.

Since the best arguments in favor of not deprecating Plane
Fourteen tags of necessity involve suggested or potential
uses for those characters, and it has been mentioned that
discussing such potential is frivolous at best and that such
potential uses aren't valid arguments against deprecation 
(at least to those who are prejudiced in this regard),
I guess the only thing I can do at this point is to close
with a haiku:

Wanting usefulness
Formally deprecated
Null and voided now

Best regards,

James Kass
.