Re: Question: the german umlaut

2002-11-08 Thread James E. Agenbroad
On Fri, 8 Nov 2002, Magda Danish (Unicode) wrote:

 
 
  -Original Message-
  
  Date/Time:Fri Nov  8 09:05:40 EST 2002
  Contact:  [EMAIL PROTECTED]
  Report Type:  Other Question, Problem, or Feedback
  
  Hello
  
  I just wanted to know how much space in bytes the Latin-1 
  characters such as the german umlaut characters take up in 
  UTF-8 encoding. Is it still just one byte or does it now 
  require 2 bytes?
  
  Regards,
  
  Magnus Rosenberg
  
  -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- --
  (End of Report)
  
  
 
 
   Friday, November 8, 2002
Mr. Rosenberg,
 Without delving into the issues of separately encoded combining
characters vs. precomposed combinations I think the short answer is that 
in UTF-8 all Unicode characters except those with ASCII codes 00 to 7F
are two or more bytes long.  If I'm wrong others will corret me.  
 Regards,
  Jim Agenbroad ( [EMAIL PROTECTED] )
 It is not true that people stop pursuing their dreams because they
grow old, they grow old because they stop pursuing their dreams. Adapted
from a letter by Gabriel Garcia Marquez.
 The above are purely personal opinions, not necessarily the official
views of any government or any agency of any.
 Addresses: Office: Phone: 202 707-9612; Fax: 202 707-0955; US
mail: I.T.S. Sys.Dev.Gp.4, Library of Congress, 101 Independence Ave. SE, 
Washington, D.C. 20540-9334 U.S.A.
Home: Phone: 301 946-7326; US mail: Box 291, Garrett Park, MD 20896.  





Small 's' with grave?

2002-09-25 Thread James E. Agenbroad

   Wednesday, September 25, 2002
A friend of a friend asked me if Unicode has a code for small s with a
grave.  I can't find one; am I overlooking it?  Has it been added
since 3.0? Thanks in advance.   

 Regards,
  Jim Agenbroad ( [EMAIL PROTECTED] )
 It is not true that people stop pursuing their dreams because they
grow old, they grow old because they stop pursuing their dreams. Adapted
from a letter by Gabriel Garcia Marquez.
 The above are purely personal opinions, not necessarily the official
views of any government or any agency of any.
 Addresses: Office: Phone: 202 707-9612; Fax: 202 707-0955; US
mail: I.T.S. Sys.Dev.Gp.4, Library of Congress, 101 Independence Ave. SE, 
Washington, D.C. 20540-9334 U.S.A.
Home: Phone: 301 946-7326; US mail: Box 291, Garrett Park, MD 20896.  





Re: [OT] looking for electronic dictionaries

2002-08-30 Thread James E. Agenbroad

On Thu, 29 Aug 2002, Eric Muller wrote:

 For my personal use, I would like to acquire electronic dictionaries, 
 principally for the major European languages, with the following 
 characteristics:
 
 - reputable source
 
 - raw datafiles accessible - I appreciate the interfaces that 
 dictionary vendors may provide, but I want to be able to write my own 
 code to find the data I am looking for
 
 - the wordlist is the principal aspect; I can live without definitions.
 
 - markup about the structure of words, for things like hyphenation, 
 etc. (or from which hyphenation can be derived)
 
 - some form of frequency count would be nice
 
 For example, I'd like to compute something like: the average French 
 character occupies x bytes in UTF-8, with average defined in sync with 
 the frequency count. And I'd like to compute things like spelling 
 changes introduced by hyphenation in Dutch.
 
 Any pointers?
 
 Thanks,
 Eric.
 Friday, August 30, 2002
Eric,
I have no sources to suggest, just a comment.  The average UTF-8
length of a French word will depend to some extent on whether separate
codes are used for combining characters/diacritics or a single code for a 
precomposed letter + diacritic combination. It will matter more if you
want the average length of Czech or Polish words. Fortunately Vietnamese
isn't European.

 Regards,
  Jim Agenbroad ( [EMAIL PROTECTED] )
 It is not true that people stop pursuing their dreams because they
grow old, they grow old because they stop pursuing their dreams. Adapted
from a letter by Gabriel Garcia Marquez.
 The above are purely personal opinions, not necessarily the official
views of any government or any agency of any.
 Addresses: Office: Phone: 202 707-9612; Fax: 202 707-0955; US
mail: I.T.S. Sys.Dev.Gp.4, Library of Congress, 101 Independence Ave. SE, 
Washington, D.C. 20540-9334 U.S.A.
Home: Phone: 301 946-7326; US mail: Box 291, Garrett Park, MD 20896.  





Re: Recent changes to i18n standards

2002-08-26 Thread James E. Agenbroad

On Fri, 23 Aug 2002 [EMAIL PROTECTED] wrote:

 On 08/23/2002 04:54:58 AM Doug Ewell wrote:
 
 For those who like to keep up on such things, there have been recent
 changes to the code lists of two important standards related to
 internationalization -- ISO 639 (language codes) and ISO 3166-2 (codes
 for country subdivisions).
 
 In addition to the two new code elements in ISO 639-2, there's another 
 development of interest in relation to language coding: ISO/TC 37 has 
 begun working toward development of a new part to this standard, to be 
 designated ISO 639-3, that will provide 3-letter identifiers for all known 
 languages. The relationship to part 2 will be that this the 
 individual-language code elements in part 2 will be a subset of part 3 
 (part 2 will continue to have collective-language identifiers but part 3 will 
 not). The reason for the subsetting relationship of part 2 to part 3 
 (rather than just adding a bunch of things to part 2) is that some user 
 communities (e.g. bibliographers) have indicated a need to restrict 
 individual-language identifiers to only developed languages with 
 significant bodies of literature. I'm anticipating a time frame of about 
 one year for this to be completed (assuming the process goes smoothly).
 
 
 
 - Peter
 
 
 ---
 Peter Constable
 
 Non-Roman Script Initiative, SIL International
 7500 W. Camp Wisdom Rd., Dallas, TX 75236, USA
 Tel: +1 972 708 7485
 E-mail: [EMAIL PROTECTED]
 
Monday, August 26, 2002
Peter, 
 I congratulate you and others who reached this reasonable solution.
 Regards,
  Jim Agenbroad ( [EMAIL PROTECTED] )
 It is not true that people stop pursuing their dreams because they
grow old, they grow old because they stop pursuing their dreams. Adapted
from a letter by Gabriel Garcia Marquez.
 The above are purely personal opinions, not necessarily the official
views of any government or any agency of any.
 Addresses: Office: Phone: 202 707-9612; Fax: 202 707-0955; US
mail: I.T.S. Sys.Dev.Gp.4, Library of Congress, 101 Independence Ave. SE, 
Washington, D.C. 20540-9334 U.S.A.
Home: Phone: 301 946-7326; US mail: Box 291, Garrett Park, MD 20896.  





Re: FW: New version of TR29:

2002-08-20 Thread James E. Agenbroad

On Tue, 20 Aug 2002, Andrew C. West wrote:

 On Tue, 20 August 2002, John Cowan wrote:
 
  It has no sound, but neither does Romance quot;hquot;; both exist as a
   marker of
  etymology.
  
 
 But in fact the apostrophe may have a sound in dialectal English, where it is
 used to represent a
 medial or final glotal stop (e.g. a drin' a wa'er for a drink of water 
 in Cockney English). In
 this usage it is surely acting as a letter, not a punctuation mark.
 
 Andrew
 
 
Tuesday, August 20, 2002
There is also fo'c'sle, the abridged version of forecastle. :-)
 Regards,
  Jim Agenbroad ( [EMAIL PROTECTED] )
 It is not true that people stop pursuing their dreams because they
grow old, they grow old because they stop pursuing their dreams. Adapted
from a letter by Gabriel Garcia Marquez.
 The above are purely personal opinions, not necessarily the official
views of any government or any agency of any.
 Addresses: Office: Phone: 202 707-9612; Fax: 202 707-0955; US
mail: I.T.S. Sys.Dev.Gp.4, Library of Congress, 101 Independence Ave. SE, 
Washington, D.C. 20540-9334 U.S.A.
Home: Phone: 301 946-7326; US mail: Box 291, Garrett Park, MD 20896.  





Re: FW: New version of TR29:

2002-08-20 Thread James E. Agenbroad

On Tue, 20 Aug 2002, Michael Everson wrote:

 At 10:10 -0700 2002-08-20, Andrew C. West wrote:
 On Tue, 20 August 2002, John Cowan wrote:
 
It has no sound, but neither does Romance quot;hquot;; both 
 exist as a marker of etymology.
 
 But in fact the apostrophe may have a sound in dialectal English, 
 where it is used to represent a medial or final glotal stop (e.g. a 
 drin' a wa'er for a drink of water in Cockney English). In this 
 usage it is surely acting as a letter, not a punctuation mark.
 
 It is acting, as it did in its origins, as a graphic symbol showing 
 the omission of an letter.
 -- 
 Michael Everson *** Everson Typography *** http://www.evertype.com
 
 
Tuesday, August 20, 2002
In English, at least in questions, the apostrophe signals more than just
omission of a letter.  A co-worker here has a sign that says, What part
of No do you not understand?  When written or spoken as a contration 
this becomes What part of No don't you understand? in which you gets
transposed to follow the short form of not.  When I see a line break with
n't on the new line I wince but keep on reading.  
 Regards,
  Jim Agenbroad ( [EMAIL PROTECTED] )
 The above are purely personal opinions, not necessarily the official
views of any government or any agency of any.
 Addresses: Office: Phone: 202 707-9612; Fax: 202 707-0955; US
mail: I.T.S. Sys.Dev.Gp.4, Library of Congress, 101 Independence Ave. SE, 
Washington, D.C. 20540-9334 U.S.A.
Home: Phone: 301 946-7326; US mail: Box 291, Garrett Park, MD 20896.  





Re: Discrepancy between Names List Code Charts?

2002-08-16 Thread James E. Agenbroad

On Fri, 16 Aug 2002, John Cowan wrote:

 John Hudson scripsit:
 
  The newish Gagauz Turkish Latin-script orthography derives from both 
  Turkish and Romanian models. This has led to a peculiar hybrid, in which 
  the cedilla is used for the s and the commaaccent is used for the t.  
 
 ME's remarks in _The Alphabets of Europe_ seem downright bizarre to me:
 
 # Note that in
 # Romania, Gagauz uses the characters S WITH COMMA BELOW and T WITH COMMA BELOW.
 # In inferior Gagauz typography, the glyphs for these characters are sometimes
 # drawn with CEDILLAs, but it is strongly recommended to avoid this practice.
 # However, because Gagauz is a Turkic language, it may be left to the user to
 # decide whether S WITH COMMA BELOW (as in Romanian) or S WITH CEDILLA (as in
 # Turkish) is preferred.
 
 It seems that the last two sentences say that it may be left to the user
 to decide whether inferior or superior typography is preferred.
 
 -- 
 De plichten van een docent zijn divers, John Cowan
 die van het gehoor ook. [EMAIL PROTECTED]
   --Edsger Dijkstra http://www.ccil.org/~cowan
 
 
   Friday, August 16, 2002
If fools such as I who know no Gagauz may rush in:  It seems to me that
reading is learned habit.  When different people learned to read Gagauz
they may have learned to expect different forms of glyphs because that's
what they were taught. Assuming teaching different conventions isn't based
on an evil intent to pervert the minds of children, differing conventions
are not bad only different. It may be that such different conventions
will gradually evolve to one but I think Unicode would be wise to avoid
attempting to impose standards on how written text appears and should
instead aim to facilitate presentation of text legible to the conventions
of current readers.  
 We all live with two forms of lower case t (with and without the
curved bottom) and lower case g (with and without the closed descender). 
It's possible these different conventions will disappear but until they do
some will want one and some will want the other and I would hope Unicode
could permit rendering software to provide either. 
 Regards,
  Jim Agenbroad ( [EMAIL PROTECTED] )
 It is not true that people stop pursuing their dreams because they
grow old, they grow old because they stop pursuing their dreams. Adapted
from a letter by Gabriel Garcia Marquez.
 The above are purely personal opinions, not necessarily the official
views of any government or any agency of any.
 Addresses: Office: Phone: 202 707-9612; Fax: 202 707-0955; US
mail: I.T.S. Sys.Dev.Gp.4, Library of Congress, 101 Independence Ave. SE, 
Washington, D.C. 20540-9334 U.S.A.
Home: Phone: 301 946-7326; US mail: Box 291, Garrett Park, MD 20896.  





Re: Unicode certification - quote correction and attribution

2002-07-26 Thread James E. Agenbroad

On Thu, 25 Jul 2002, Kenneth Whistler wrote:

 [snip]
 
 And the devil is in the details. Looking a bit at your suggestions,
 for example:
   [snip] 
 
Friday, July 26, 2002
No, God is in the details Ludiwg Mies van der Rohe (1886-1969) said. And
that's the beauty of Unicode IMHO.
 Regards,
  Jim Agenbroad ( [EMAIL PROTECTED] )
 It is not true that people stop pursuing their dreams because they
grow old, they grow old because they stop pursuing their dreams. Adapted
from a letter by Gabriel Garcia Marquez.
 The above are purely personal opinions, not necessarily the official
views of any government or any agency of any.
 Addresses: Office: Phone: 202 707-9612; Fax: 202 707-0955; US
mail: I.T.S. Sys.Dev.Gp.4, Library of Congress, 101 Independence Ave. SE, 
Washington, D.C. 20540-9334 U.S.A.
Home: Phone: 301 946-7326; US mail: Box 291, Garrett Park, MD 20896.  





Re: The standard disclaimer

2002-07-24 Thread James E. Agenbroad

On Wed, 24 Jul 2002, Tex Texin wrote:

 
 
 John Hudson wrote:
  
  At 08:41 AM 24-07-02, [EMAIL PROTECTED] wrote:
  
 from:Doug Ewell [EMAIL PROTECTED]
 subject: Re: The standard disclaimer
   
James Kass jameskass at worldnet dot att dot net wrote:
   
   However, just as
 no trinities have fourth persons (Zeppo Marx notwithstanding)

 What about Gummo?  (Or,... Karl?  or... Deutsche ?)
   
Stretch?
   
  Skid??
  
  Combining?
 
 Hall?
 Check?
 Re- ?
 Water?
 
 -- 
 -
 Tex Texin   cell: +1 781 789 1898   mailto:[EMAIL PROTECTED]
 Xen Master  http://www.i18nGuy.com
  
 XenCraft  http://www.XenCraft.com
 Making e-Business Work Around the World
 -
 
 
Wednesday, July 24, 2002
Depending on whether I'm at work or comuting, MARC = 1. MAchine Readable
Cataloging, or, 2. Maryland Alliance of Rail Comuters.  
 Regards,
  Jim Agenbroad ( [EMAIL PROTECTED] )
 It is not true that people stop pursuing their dreams because they
grow old, they grow old because they stop pursuing their dreams. Adapted
from a letter by Gabriel Garcia Marquez.
 The above are purely personal opinions, not necessarily the official
views of any government or any agency of any.
 Addresses: Office: Phone: 202 707-9612; Fax: 202 707-0955; US
mail: I.T.S. Sys.Dev.Gp.4, Library of Congress, 101 Independence Ave. SE, 
Washington, D.C. 20540-9334 U.S.A.
Home: Phone: 301 946-7326; US mail: Box 291, Garrett Park, MD 20896.  





Re: *Why* are precomposed characters required for backwardcompatibility?

2002-07-10 Thread James E. Agenbroad

On Tue, 9 Jul 2002, Kenneth Whistler wrote:

 David Hopwood wrote:
 
  Marco Cimarosti wrote:
 
  
 
  The only difficulty would have been if a pre-existing standard had supported
  both precomposed and decomposed encodings of the same combining mark. I don't
 ^^
 /character
  think there are any such standards (other than Unicode as it is now), are
  there?
 
 Not to my knowledge.
 
  
  
 
 --Ken
 
  
  - -- 
  David Hopwood [EMAIL PROTECTED]
 
 
  Wednesday, July 10, 2002
ISO 5426 - 1980, Extension of the Latin alphabet coded character set for
bibliographic interchange, and its similar US counterpart, ANSI Z39.64,
Extended Latin alphabet coded character set for bibliographic use
(ANSEL), do contain both separate codes for diacritics for use with any
letter (e.g. tilde, grave, cedilla, etc.) and codes for characters which
could have been further decomposed (L with stroke, O with stroke, D with
stroke (all with codes for both upper and lower case) etc.) but were not.  
In the US libraries have used ANSEL since about 1968 because they needed
to support many different languages. 
I have to agree with Ken that ISO 10646 and Unicode probably would not
have gotten this far if they had excluded all precomposed combinations. 
 Regards,
  Jim Agenbroad ( [EMAIL PROTECTED] )
 It is not true that people stop pursuing their dreams because they
grow old, they grow old because they stop pursuing their dreams. Adapted
from a letter by Gabriel Garcia Marquez.
 The above are purely personal opinions, not necessarily the official
views of any government or any agency of any.
 Addresses: Office: Phone: 202 707-9612; Fax: 202 707-0955; US
mail: I.T.S. Sys.Dev.Gp.4, Library of Congress, 101 Independence Ave. SE, 
Washington, D.C. 20540-9334 U.S.A.
Home: Phone: 301 946-7326; US mail: Box 291, Garrett Park, MD 20896.  





Re: Codes for codes for codes for... (RE: Chromatic font research)

2002-07-10 Thread James E. Agenbroad

On Wed, 10 Jul 2002, Roozbeh Pournader wrote:

 On Thu, 27 Jun 2002, Marco Cimarosti wrote:
 
  Encoding the navy's flag alphabet or the Morse code would be exactly doing
  this: assigning a code to a code which represents a letter.
 
 BTW, which characters should be used to encode the dot and dash of Morse 
 in a typographically correct way?
 
 roozbeh
 
   Wednesday, July 10, 2002
Well, assigning codes to Braille which Unicode does seems similar to me,
but maybe it's to represent the visual image of the 8-bit Braille code. 

 Regards,
  Jim Agenbroad ( [EMAIL PROTECTED] )
 The above are purely personal opinions, not necessarily the official
views of any government or any agency of any.
 Addresses: Office: Phone: 202 707-9612; Fax: 202 707-0955; US
mail: I.T.S. Sys.Dev.Gp.4, Library of Congress, 101 Independence Ave. SE, 
Washington, D.C. 20540-9334 U.S.A.
Home: Phone: 301 946-7326; US mail: Box 291, Garrett Park, MD 20896.  





Re: *Why* are precomposed characters required for backward

2002-07-10 Thread James E. Agenbroad

On Wed, 10 Jul 2002, John Cowan wrote:

 James E. Agenbroad scripsit:
 
   The standards I cited use both
  techniques (precomposed and decomposed letter+diacritic) but they don't
  allow two ways of creating a single letter+diacritic combination the way
  ISO10646/Unicode do.
 
 Even Unicode doesn't go so far as to decompose WITH STROKE.
 In fact, I would argue that the COMBINING HORN was a mistake.  It would
 have been only slightly less efficient to include O WITH HORN and U WITH
 HORN (uc and lc) as undecomposable letters; HORN is really not a diacritic
 but a modification of the ordinary O and U.
 
 -- 
 John Cowan[EMAIL PROTECTED] 
 http://www.reutershealth.com  http://www.ccil.org/~cowan
 Yakka foob mog.  Grug pubbawup zink wattoom gazork.  Chumble spuzz.
 -- Calvin, giving Newton's First Law in his own words
 
  Wednesday, July 10,2002
John,
 I think back in the late 60's the justification for how the
library character set was designed was that (except for cedilla) if the
diacritic/modifier visually touched the letter the combination had its own
code, if it didn't the diacritic had its own code. 
 It's may also be worth noting that codes for separate diacritics
preceded the letter they modified following manual typewriter dead key
practice. We weren't far-sighted enough to see that this would have led
to chaos with some other scripts. It also makes conversion between Unicode
and U.S. library encoding practices more challenging.  (The less said
about Vietnamese the better.) 
 Regards,
  Jim Agenbroad ( [EMAIL PROTECTED] )
 It is not true that people stop pursuing their dreams because they
grow old, they grow old because they stop pursuing their dreams. Adapted
from a letter by Gabriel Garcia Marquez.
 The above are purely personal opinions, not necessarily the official
views of any government or any agency of any.
 Addresses: Office: Phone: 202 707-9612; Fax: 202 707-0955; US
mail: I.T.S. Sys.Dev.Gp.4, Library of Congress, 101 Independence Ave. SE, 
Washington, D.C. 20540-9334 U.S.A.
Home: Phone: 301 946-7326; US mail: Box 291, Garrett Park, MD 20896.  





Re: Mongolian Ali Gali

2002-07-03 Thread James E. Agenbroad

On Wed, 3 Jul 2002, Michael Everson wrote:

 At 11:48 +0100 2002-07-03, Anthony Stone wrote:
 I should be very glad if someone could solve the mystery of what
 Sanskrit and/or Tibetan characters correspond to the following Unicode
 characters:
 
 1883 MONGOLIAN LETTER ALI GALI UBADAMA
 1884 MONGOLIAN LETTER ALI GALI INVERTED UBADAMA
 
 I suspect the Sanskrit word here is something like upadhama, which 
 would be a word related to breathing.
 -- 
 Michael Everson *** Everson Typography *** http://www.evertype.com
 
   Wednesday, July 3, 2002
Michael, et al.,
 In the ALA/LC romanization table for Sanskrit there is
upadmaniya (with a macron over the 'i' and second 'a'). It is romanized
as 'h' with combining breve below (U+032E) The original character
resembles adjacent close and open parentheses )( if rotated 90 degrees. 
It seems not to be in Unicode 3.0 (Might it be in 3.2?)  In ISCII (page
23) Annes G Extended character set for Vedic at G.16: This is a
half-Visarga sound, and can come only before four consonants.  Before 'ka'
[U+0915] and 'kha' [U+0916] it is called Jihvamuliya [macron over 'i' and
and first 'a'], while before pa [U+092A] and pha [U+092B] it is called
Upadhmaniya. I have no idea if it is the functional/phonetic equivalent
of either of (or both) the Mongolian characters mentioned.  

 Regards,
  Jim Agenbroad ( [EMAIL PROTECTED] )
 It is not true that people stop pursuing their dreams because they
grow old, they grow old because they stop pursuing their dreams. Adapted
from a letter by Gabriel Garcia Marquez.
 The above are purely personal opinions, not necessarily the official
views of any government or any agency of any.
 Addresses: Office: Phone: 202 707-9612; Fax: 202 707-0955; US
mail: I.T.S. Sys.Dev.Gp.4, Library of Congress, 101 Independence Ave. SE, 
Washington, D.C. 20540-9334 U.S.A.
Home: Phone: 301 946-7326; US mail: Box 291, Garrett Park, MD 20896.  





Re: Characters 0x80 - 0x9F in ISO 8859-1

2002-06-27 Thread James E. Agenbroad

On Thu, 27 Jun 2002, Keld Jørn Simonsen wrote:

 On Thu, Jun 27, 2002 at 11:59:14AM +0200, Lars Marius Garshol wrote:
  
  This list has previously told me that the characters 0x80 - 0x9F in
  ISO 8859-1 are a particular set of control characters from ISO 6429.
  [snip]
  
  I now see that ISO 8859-1 actually says
  
The shaded positions [0x00-0x1f og 0x7f-0x9f] correspond to
 bit combinations that do not represent graphic characters.
 Their use is outside the scope of ISO 8859; it is specifies
 in other International Standards, for example ISO 646 or
 ISO 6429.
  [snip]
  I find this a little confusing and would like to know whether there
  really is a fixed, normative interpretation of this character range.
 
 What people usually use is ISO 6429, this is eg what is used in
 IETF charset definitions for the iso-8859 series.
 
 Kind regards
 keld
 
 
Thursday, June 27, 2002
There is also ISO 6630 Additional Control Functions for Bibliographic
Use. It defines 13 control codes which in 8 bit environments are in the
80 to 9F range.  I do not know how widely they are used.
 Regards,
  Jim Agenbroad ( [EMAIL PROTECTED] )
 It is not true that people stop pursuing their dreams because they
grow old, they grow old because they stop pursuing their dreams. Adapted
from a letter by Gabriel Garcia Marquez.
 The above are purely personal opinions, not necessarily the official
views of any government or any agency of any.
 Addresses: Office: Phone: 202 707-9612; Fax: 202 707-0955; US
mail: I.T.S. Sys.Dev.Gp.4, Library of Congress, 101 Independence Ave. SE, 
Washington, D.C. 20540-9334 U.S.A.
Home: Phone: 301 946-7326; US mail: Box 291, Garrett Park, MD 20896.  





Printed Proceedings of the Dublin IUC?

2002-06-04 Thread James E. Agenbroad

Tuesday, June 4, 2002
Does anyone have a copy of the printed proceedings of the recent
International Unicode Conference held in Dublin that they would be willing
to part with? I could afford U.S. postage costs. Only the CD version is
available from the source and though they are sending me a copy I fear it
will become technologically inaccessible in a few years.  
 Thanks in advance.

 Regards,
  Jim Agenbroad ( [EMAIL PROTECTED] )
 It is not true that people stop pursuing their dreams because they
grow old, they grow old because they stop pursuing their dreams. Adapted
from a letter by Gabriel Garcia Marquez.
 The above are purely personal opinions, not necessarily the official
views of any government or any agency of any.
 Addresses: Office: Phone: 202 707-9612; Fax: 202 707-0955; US
mail: I.T.S. Sys.Dev.Gp.4, Library of Congress, 101 Independence Ave. SE, 
Washington, D.C. 20540-9334 U.S.A.
Home: Phone: 301 946-7326; US mail: Box 291, Garrett Park, MD 20896.  





Re: Ie abbreviation character

2002-04-30 Thread James E. Agenbroad

On Tue, 30 Apr 2002, Michael Everson wrote:

 At 11:55 +0200 2002-04-30, Lars Marius Garshol wrote:
 * Stefan Persson
 |
 | Isn't the reversed lower-case c somewhere in the IPA block?
 
 Could be, but I need reversed lower-case 'c' followed by colon as a
 single character.
 
 Also, I am very curious if this character is used (or even known)
 outside Norway at all.
 
 It's a Latin abbreviation I imagine. It's found in older Irish texts 
 where it represents con.
 
 You aren't going to get this as a single character. We write i.e. 
 with four characters, we write .i. (used in Ireland) with three 
 characters; you can certainly write 9: with two characters.
 -- 
 Michael Everson *** Everson Typography *** http://www.evertype.com
 
 
   Tuesday, April 30, 2002
U+0254 is an IPA chracter that looks like a c rotated 180 degrees. It's
name is latin small letter open O.  Pullum and Ladusaw's Phonetic symbol
guide (page 117) call it open O and says it is Cardinal vowel
no.6: lower-mid back rounded ... the vowel sound of the Scottish English
pronounciation of hot. In the section of Van Osterman's 1952 Manual of
foreign languages on the most common abbreviations used in Norwegian
commercial correspondence (page 181): d.e. (det er), that is (i.e.). So
at one time there was an alternative abbreviation for i.e.  
 Regards,
  Jim Agenbroad ( [EMAIL PROTECTED] )
 It is not true that people stop pursuing their dreams because they
grow old, they grow old because they stop pursuing their dreams. Adapted
from a letter by Gabriel Garcia Marquez.
 The above are purely personal opinions, not necessarily the official
views of any government or any agency of any.
 Addresses: Office: Phone: 202 707-9612; Fax: 202 707-0955; US
mail: I.T.S. Sys.Dev.Gp.4, Library of Congress, 101 Independence Ave. SE, 
Washington, D.C. 20540-9334 U.S.A.
Home: Phone: 301 946-7326; US mail: Box 291, Garrett Park, MD 20896.  





Re: How many printable characters in 3.2.0?

2002-04-23 Thread James E. Agenbroad

On Mon, 22 Apr 2002, Doug Ewell wrote:

 Zsigri Gyula [EMAIL PROTECTED] wrote:
 
  How many printable characters are there in Unicode 3.2.0? I tried
  desperately to find the answer at the Unicode web site but could
  not.
 
 There are 95,156 total assigned characters.
 
 To find the number of printable characters, you must first determine
 what you mean by printable and then subtract that number.  This is
 where it might get tricky.  Control characters, formatting characters,
 and such are obviously not printable, but what about things like
 spaces?  (Unicode 3.1 had about 20 of them.)
 
 You might try subtracting those characters in
 http://www.unicode.org/Public/UNIDATA/UnicodeData.txt that have specific
 properties, such as Cc.  Again, though, which properties are to be
 excluded is up to you.
 
 -Doug Ewell
  Fullerton, California
 
 
 
 
 Tuesday, April 23, 2002
There are also various combining characters.  For instance a tilde could
be printed with any letter A-Z, a-z and others.  Arabic and various South
and Southeast Asian scripts have many combinations of letters that appear
different from a simple linear string of such letters.  How many depends
on the level of quality one wants to achieve.
 Regards,
  Jim Agenbroad ( [EMAIL PROTECTED] )
 It is not true that people stop pursuing their dreams because they
grow old, they grow old because they stop pursuing their dreams. Adapted
from a letter by Gabriel Garcia Marquez.
 The above are purely personal opinions, not necessarily the official
views of any government or any agency of any.
 Addresses: Office: Phone: 202 707-9612; Fax: 202 707-0955; US
mail: I.T.S. Sys.Dev.Gp.4, Library of Congress, 101 Independence Ave. SE, 
Washington, D.C. 20540-9334 U.S.A.
Home: Phone: 301 946-7326; US mail: Box 291, Garrett Park, MD 20896.  





Re: Inherent a

2002-04-01 Thread James E. Agenbroad

On Fri, 29 Mar 2002, Doug Ewell wrote:

 Avarangal [EMAIL PROTECTED] wrote:
 
  I need to allocate a U+codepoint for inherent a, to be used for
  Tamil research. Can anyone suggest a temporary location or is it
  possible to find such code point within the existing code point
  for Tamil.
 
 While we're waiting for someone with better knowledge of Indic scripts
 to reply...
 
 1.  An *inherent* A wouldn't have its own code point, would it?  I don't
 think of it as having an existence outside of the consonant it goes
 with.  Tamil KA is U+0B95, which represents K plus the inherent A.  If
 you wanted to represent only the K, you would use U+0B95 plus the Tamil
 virama, U+0BCD, to kill the A.  But how could you represent an inherent
 vowel by itself?
 
 2.  Assuming you have an answer to #1 above, the only way you could
 allocate a Unicode code point for this character would be to use the
 Private Use Area.  You could choose any code point from U+E000 to U+F8FF
 for this purpose.  (There are unofficial assignments for some of these,
 but you are perfectly free to ignore them.)  Do *not* assign a code
 point in the Tamil block, or anywhere else except the Private Use Area,
 even if it's only for temporary and internal use.  To do so would be
 very non-conformant.
 
 -Doug Ewell
  Fullerton, California
 
Monday, April 1, 2002
There is always 0B85 for this vowel when it is not inhering to a
consonant. 

 Regards,
  Jim Agenbroad ( [EMAIL PROTECTED] )
 It is not true that people stop pursuing their dreams because they
grow old, they grow old because they stop pursuing their dreams. Adapted
from a letter by Gabriel Garcia Marquez.
 The above are purely personal opinions, not necessarily the official
views of any government or any agency of any.
 Addresses: Office: Phone: 202 707-9612; Fax: 202 707-0955; US
mail: I.T.S. Sys.Dev.Gp.4, Library of Congress, 101 Independence Ave. SE, 
Washington, D.C. 20540-9334 U.S.A.
Home: Phone: 301 946-7326; US mail: Box 291, Garrett Park, MD 20896.  





Re: UTR#9: Bidirection and UTR#14: Line Breaking

2002-03-25 Thread James E. Agenbroad

On Mon, 25 Mar 2002, Markus Scherer wrote:

 Chookij Vanatham wrote:
 
  UTR#14:Line Breaking says that, Interpretation of line breaking properties
  in bidirectional text takes place before applying rule L1 of the Unicode
  Bidirectional Algorithm.
  
  UTR#9:Bidirectional says that, [at the Reordering Resolved Levels section],
  As opposed to resolution phases, this algorithm Reordering Resolved Levels,
  acts on a per-line basis, and is applied after any line wrapping 
 
 
 You can work on the paragraph-level Bidi resolution in parallel with (i.e., in any 
order relative to) figuring out line breaks.
 
 Then you run line-level Bidi resolution/reordering.
 
 For soft line breaks you will need to iterate through several line breaks and 
reorderings until your glyph vector fits.
 
 
 markus
 
   Monday,March 25, 2002
I'd like to ask a a related but more general question.  If a line break in
a paragraph coincides with a space between RTL and LTR text (or LTR and RTL)
does the new line always begin at the margin appropiriate to the new line?
In other words does a line always begin at one margin or the other where a
reader's eye would expect it to? 

 Regards,
  Jim Agenbroad ( [EMAIL PROTECTED] )
 It is not true that people stop pursuing their dreams because they
grow old, they grow old because they stop pursuing their dreams. Adapted
from a letter by Gabriel Garcia Marquez.
 The above are purely personal opinions, not necessarily the official
views of any government or any agency of any.
 Addresses: Office: Phone: 202 707-9612; Fax: 202 707-0955; US
mail: I.T.S. Sys.Dev.Gp.4, Library of Congress, 101 Independence Ave. SE, 
Washington, D.C. 20540-9334 U.S.A.
Home: Phone: 301 946-7326; US mail: Box 291, Garrett Park, MD 20896.  





Re: Talk about Unicode Myths...

2002-03-20 Thread James E. Agenbroad

On Wed, 20 Mar 2002, John Cowan wrote:

 John H. Jenkins scripsit:
 
  (His point is that if you have kanji in an IDN you can't tell whether to 
  draw it the Japanese way or the Chinese way, of course, and since 
  civilization as we know it depends on Japanese people never being 
  confronted with Chinese writing styles, even when being used for Chinese, 
  this obviously means that Unicode is Satan incarnate.  Or something like 
  that.)
 
 I am now developing a patch for Mozilla that causes it to display all
 URLs in Fraktur fonts only.
 
 -- 
 John Cowan [EMAIL PROTECTED] http://www.reutershealth.com
 I amar prestar aen, han mathon ne nen,http://www.ccil.org/~cowan
 han mathon ne chae, a han noston ne 'wilith.  --Galadriel, _LOTR:FOTR_
 
 
   Wednesday, March 20, 2002
A Japanese Fraktur font? :-) 
 Regards,
  Jim Agenbroad ( [EMAIL PROTECTED] )
 It is not true that people stop pursuing their dreams because they
grow old, they grow old because they stop pursuing their dreams. Adapted
from a letter by Gabriel Garcia Marquez.
 The above are purely personal opinions, not necessarily the official
views of any government or any agency of any.
 Addresses: Office: Phone: 202 707-9612; Fax: 202 707-0955; US
mail: I.T.S. Sys.Dev.Gp.4, Library of Congress, 101 Independence Ave. SE, 
Washington, D.C. 20540-9334 U.S.A.
Home: Phone: 301 946-7326; US mail: Box 291, Garrett Park, MD 20896.  





Re: Synthetic scripts

2002-03-18 Thread James E. Agenbroad

On Sun, 17 Mar 2002, Miikka-Markus Alhonen wrote:

 
 On 17-Mar-02 Curtis Clark wrote:
  At 04:45 PM 3/16/02, Doug Ewell wrote:
 But right away that definition includes not only Shavian, Tengwar,
 Cirth, Klingon, and most of the contents of ConScript, but also
 Ethiopic, Cherokee, Canadian Syllabics, Gothic, Deseret, and maybe Yi
 Syllabics, all of which are already encoded in Unicode.
  
  And iirc Cyril and Methodius were people, although their script was based 
  on Greek and continued to evolve.
 
 What about a script that was invented by one person with the principal
 intention of representing an artificially constructed language?
 This would include Tengwar, Cirth and Klingon but not any of the other
 above-mentioned cases.
 
 Of course, Tolkien's scripts can be used to write English or other natural
 languages, too, but I strongly feel the author didn't intend Tengwar to replace
 the present Latin alphabet...
 
 Best regards,
 Miikka-Markus Alhonen
 
 Monday, March 18, 2002
I know very little about the scripts involved.  It seems to me if the
scripts are defined at the same time as the language this might help
in defining the scripts.  Something along the lines of: Any writing
system defined by one or a few individuals concurrent witht  their
definition of the language intended to use it.  I think it shouldn't
matter if later it were used to write other langauges--unless it became
the main means of writing some long spoken language.  
Math is field about which I'm very ignorant.  Did Leibniz define calculus
and the means of expressing it at the same time?  Is it a langauge?
Esperanto doesn't have its own script does it?  Cherokee and English
existed long before Sequoia's script and Deseret so this definition would
exclude them. 

 Regards,
  Jim Agenbroad ( [EMAIL PROTECTED] )
 It is not true that people stop pursuing their dreams because they
grow old, they grow old because they stop pursuing their dreams. Adapted
from a letter by Gabriel Garcia Marquez.
 The above are purely personal opinions, not necessarily the official
views of any government or any agency of any.
 Addresses: Office: Phone: 202 707-9612; Fax: 202 707-0955; US
mail: I.T.S. Sys.Dev.Gp.4, Library of Congress, 101 Independence Ave. SE, 
Washington, D.C. 20540-9334 U.S.A.
Home: Phone: 301 946-7326; US mail: Box 291, Garrett Park, MD 20896.  





Re: Synthetic scripts (was: Re: Private Use Agreements and UnapprovedCharacters)

2002-03-18 Thread James E. Agenbroad

On Fri, 15 Mar 2002, Kenneth Whistler wrote:

 Dan Kogai continued:
 [snip] 
  His 
  favorite appears to be ISO-2022 but as Yet Another Perl Encoding Hacker, 
  ISO-2022 is pain in the arse
 
 You got that right!
 
 --Ken
 
  Monday, March 18, 2002
Is ISO 2022 a character set (characters with their codes) or a complex
(painful?) means to announce and negotiate among various sets? I thought
it was the latter; am I missing something?   

 Regards,
  Jim Agenbroad ( [EMAIL PROTECTED] )
 It is not true that people stop pursuing their dreams because they
grow old, they grow old because they stop pursuing their dreams. Adapted
from a letter by Gabriel Garcia Marquez.
 The above are purely personal opinions, not necessarily the official
views of any government or any agency of any.
 Addresses: Office: Phone: 202 707-9612; Fax: 202 707-0955; US
mail: I.T.S. Sys.Dev.Gp.4, Library of Congress, 101 Independence Ave. SE, 
Washington, D.C. 20540-9334 U.S.A.
Home: Phone: 301 946-7326; US mail: Box 291, Garrett Park, MD 20896.  





Re: Private Use Agreements and Unapproved Characters

2002-03-13 Thread James E. Agenbroad

On Wed, 13 Mar 2002, William Overington wrote:

 Here is a system that I think would work.
 
 Consider please that there exists for the private use area the concept of
 the hexadecimal point.  The term hexadecimal point is similar to the
 concept of a decimal point, the difference being that a decimal point is for
 base 10 numbers and a hexadecimal point is for base 16 numbers.
 
  [snip]
 
 Wednesday, March 13, 2002
If we are to have a hexidecimal point should it have a code of its own to
distinguish it from the . decimal point?  U+2394 is an open hexagon
with aflat side down which might serve. Its name, software-function symbol 
is unclear to me.   

 Regards,
  Jim Agenbroad ( [EMAIL PROTECTED] )
 It is not true that people stop pursuing their dreams because they
grow old, they grow old because they stop pursuing their dreams. Adapted
from a letter by Gabriel Garcia Marquez.
 The above are purely personal opinions, not necessarily the official
views of any government or any agency of any.
 Addresses: Office: Phone: 202 707-9612; Fax: 202 707-0955; US
mail: I.T.S. Sys.Dev.Gp.4, Library of Congress, 101 Independence Ave. SE, 
Washington, D.C. 20540-9334 U.S.A.
Home: Phone: 301 946-7326; US mail: Box 291, Garrett Park, MD 20896.  





Re: Private Use Agreements and Unapproved Characters

2002-03-13 Thread James E. Agenbroad

On Tue, 12 Mar 2002, John Cowan wrote:

 
   [snip] 
 
 (In truth neither of us has had much time to process new registrations
 lately.  Arse longa, vita brevis.)
 
   [snip]
 -- 
 John Cowan [EMAIL PROTECTED] http://www.reutershealth.com
 I amar prestar aen, han mathon ne nen,http://www.ccil.org/~cowan
 han mathon ne chae, a han noston ne 'wilith.  --Galadriel, _LOTR:FOTR_
 
 
  Wednesday, March 13, 2002
I have a little Greek but no Latin, but should that be Ars longa ...? 

 Regards,
  Jim Agenbroad ( [EMAIL PROTECTED] )
 It is not true that people stop pursuing their dreams because they
grow old, they grow old because they stop pursuing their dreams. Adapted
from a letter by Gabriel Garcia Marquez.
 The above are purely personal opinions, not necessarily the official
views of any government or any agency of any.
 Addresses: Office: Phone: 202 707-9612; Fax: 202 707-0955; US
mail: I.T.S. Sys.Dev.Gp.4, Library of Congress, 101 Independence Ave. SE, 
Washington, D.C. 20540-9334 U.S.A.
Home: Phone: 301 946-7326; US mail: Box 291, Garrett Park, MD 20896.  





Re: Private Use Agreements and Unapproved Characters

2002-03-13 Thread James E. Agenbroad

On Wed, 13 Mar 2002, Michael Everson wrote:

 Um,
 
 What I think is that *I* for one am certainly not going to invest any 
 effort in pseudo-coding scripts in a PreScript Unicode Registry. 
 The work to get scripts proposed and encoded is enough. If someone is 
 interested in a script, and wants to build fonts for it based on 
 script proposals, he or she can do that, assigning PUA numbers to the 
 glyphs for testing purposes. When the script is standardized then 
 real Unicode numbers can be added.
 
 If users need to exchange data in a not-yet-encoded script, then they 
 should agree what PUA numbers they are using between them. That's 
 what ConScript is for. (We don't get a lot of people beating down our 
 doors to get stuff encoded though, and no, Doug, I haven't forgotten 
 Ewellic.)
 -- 
 Michael Everson *** Everson Typography *** http://www.evertype.com
 
 
 Wednesday, March 13, 2002
Perhaps I'm having a senior moment; what is ConScript?  I didn't find it
in either index to 3.0.  Is there a write-up in the Unicode web page you
can point me to?  Thanks in advance.  
 Regards,
  Jim Agenbroad ( [EMAIL PROTECTED] )
 It is not true that people stop pursuing their dreams because they
grow old, they grow old because they stop pursuing their dreams. Adapted
from a letter by Gabriel Garcia Marquez.
 The above are purely personal opinions, not necessarily the official
views of any government or any agency of any.
 Addresses: Office: Phone: 202 707-9612; Fax: 202 707-0955; US
mail: I.T.S. Sys.Dev.Gp.4, Library of Congress, 101 Independence Ave. SE, 
Washington, D.C. 20540-9334 U.S.A.
Home: Phone: 301 946-7326; US mail: Box 291, Garrett Park, MD 20896.  





RE: Devanagari variations

2002-03-11 Thread James E. Agenbroad

On Fri, 8 Mar 2002, Marco Cimarosti wrote:

 Peter Constable wrote:
  On 03/07/2002 02:16:10 PM James E. Agenbroad wrote:
  
  A similar but not the same situation is found in the fourth 
  example in
  figure 9-3 of Unicode 3.0 (page 214) where an intedpendent 
  vowel has the
  reph (an abridged form of a the consonant 'ra') above it.  Unicode 
  wants
  this encoded as consonant + halant + independent vowel. I 
  believe it is
  better considered as a consonant + vowel sign combination 
  which happens 
  to
  have an odd display and at least one Sanskrit textbook agrees.
  
  I may be wrong, but I believe that example has  ra, halant, ra, 
  independent i . The first ra is the one that  transforms 
  into the reph.
 
 You are wrong, in fact, sorry. Although figure 9-3 does not show code point
 values, both the glyphs and the abbreviated letter names make it clear that
 the sequence is:
 
   U+0930 (DEVANAGARI LETTER RA)
   U+094D (DEVANAGARI SIGN VIRAMA)
   U+090B (DEVANAGARI LETTER VOCALIC R)
 
 James' idea is that the same graphemes could have been better represented
 with sequence:
 
   U+0930 (DEVANAGARI LETTER RA)
   U+0943 (DEVANAGARI VOWEL SIGN VOCALIC R)
 
 It is an interesting idea, because ra never occurs with matra r., so
 there is no danger of confusion. But it is probably too late for changing
 it: it would break compatibility with ISCII and existing Unicode fonts.
 
 _ Marco
 
 
 Monday, March 11, 2002
ra as reph does occur with r. cf. Monier Williams' Sanskrit-English
Dictionary, page 554, second column, between niru_ha and nire (using
underscore for macron and  for circumflex are nirr.i and nirr.ich and
nirr.ij. I believe ISCII is silent on this matter. If so, how can
compatibility with it be broken?  If fonts have this glyph can't they
allow two encodings to invoke it?  I do not advocate deletion or
deprecation of the encoding shown on page 214 of 3.0 for this glyph, I do
advocate saying somewhere in the Unicode Standard discussion of Devanagari
that there is another, more plausible and more Indian way to encode this
glyph.
   
 Regards,
  Jim Agenbroad ( [EMAIL PROTECTED] )
 It is not true that people stop pursuing their dreams because they
grow old, they grow old because they stop pursuing their dreams. Adapted
from a letter by Gabriel Garcia Marquez.
 The above are purely personal opinions, not necessarily the official
views of any government or any agency of any.
 Addresses: Office: Phone: 202 707-9612; Fax: 202 707-0955; US
mail: I.T.S. Sys.Dev.Gp.4, Library of Congress, 101 Independence Ave. SE, 
Washington, D.C. 20540-9334 U.S.A.
Home: Phone: 301 946-7326; US mail: Box 291, Garrett Park, MD 20896.  





Re: Devanagari variations

2002-03-08 Thread James E. Agenbroad

On Fri, 8 Mar 2002, Michael Everson wrote:

 At 15:16 -0500 07/03/2002, James E. Agenbroad wrote:
 On Wed, 6 Mar 2002 [EMAIL PROTECTED] wrote:
 
   On 03/06/2002 08:25:18 AM Michael Everson wrote:
   [snip]
 
   In
   Cham, independent vowels can take dependent vowel signs. In
   Devanagari, I guess that doesn't occur, but the Brahmic model
   shouldn't be understood to preclude this behaviour.
  [snip]
   - Peter
 
 A similar but not the same situation is found in the fourth example in
 figure 9-3 of Unicode 3.0 (page 214) where an intedpendent vowel has the
 reph (an abridged form of a the consonant 'ra') above it.  Unicode wants
 this encoded as consonant + halant + independent vowel. I believe it is
 better considered as a consonant + vowel sign combination which happens to
 have an odd display and at least one Sanskrit textbook agrees.
 
 Is that the sample you showed me when I was a-photocopying at the 
 Library of Congress in August, James? You're saying that RA + virama 
 + INDEPENDENT VOCALIC R and RA + VOWEL SIGN VOCALIC R should both 
 produce the same glyph?
 -- 
 Michael Everson *** Everson Typography *** http://www.evertype.com
 
 
   Friday, March 8, 2002
Michael,
 Yes.  
 [Call lme Jim]
 Regards,
  Jim Agenbroad ( [EMAIL PROTECTED] )
 It is not true that people stop pursuing their dreams because they
grow old, they grow old because they stop pursuing their dreams. Adapted
from a letter by Gabriel Garcia Marquez.
 The above are purely personal opinions, not necessarily the official
views of any government or any agency of any.
 Addresses: Office: Phone: 202 707-9612; Fax: 202 707-0955; US
mail: I.T.S. Sys.Dev.Gp.4, Library of Congress, 101 Independence Ave. SE, 
Washington, D.C. 20540-9334 U.S.A.
Home: Phone: 301 946-7326; US mail: Box 291, Garrett Park, MD 20896.  





Re: Devanagari variations

2002-03-08 Thread James E. Agenbroad

On Fri, 8 Mar 2002 [EMAIL PROTECTED] wrote:

 Jim Agenbroad responded (off list):
 
 Not quite. On page 214 of 3.0 there is one RA vowel, a halant and a 
 RI
 vowel: RA(d) + RI(n) -- RI(n) +RA(sup)   ( parens in lieu ofsubscript)
 
 I didn't realise that RI meant the vocalic R. I mistook it to mean 
 something else. I find it a weakness of that section that such notations 
 are not defined and prominently displayed in an easy-to-find location.
 
 Thanks for setting me straight. I should have known you knew what you were 
 talking about.
 
 
 Peter
 
 
   Friday, March 8, 2002
Peter,
 I agree there is a weakness there.  Maybe more than one. 
 I have mailed you (Peter) the Deshpande and Monier Williams examples
I cited.  
 Have a nice weekend all!
 Regards,
  Jim Agenbroad ( [EMAIL PROTECTED] )
 It is not true that people stop pursuing their dreams because they
grow old, they grow old because they stop pursuing their dreams. Adapted
from a letter by Gabriel Garcia Marquez.
 The above are purely personal opinions, not necessarily the official
views of any government or any agency of any.
 Addresses: Office: Phone: 202 707-9612; Fax: 202 707-0955; US
mail: I.T.S. Sys.Dev.Gp.4, Library of Congress, 101 Independence Ave. SE, 
Washington, D.C. 20540-9334 U.S.A.
Home: Phone: 301 946-7326; US mail: Box 291, Garrett Park, MD 20896.  





Re: Devanagari variations

2002-03-07 Thread James E. Agenbroad

On Wed, 6 Mar 2002 [EMAIL PROTECTED] wrote:

 On 03/06/2002 08:25:18 AM Michael Everson wrote: 
 [snip] 
 
 In
 Cham, independent vowels can take dependent vowel signs. In
 Devanagari, I guess that doesn't occur, but the Brahmic model
 shouldn't be understood to preclude this behaviour.
[snip]
 - Peter

A similar but not the same situation is found in the fourth example in
figure 9-3 of Unicode 3.0 (page 214) where an intedpendent vowel has the
reph (an abridged form of a the consonant 'ra') above it.  Unicode wants
this encoded as consonant + halant + independent vowel. I believe it is
better considered as a consonant + vowel sign combination which happens to
have an odd display and at least one Sanskrit textbook agrees.  

Didn't Mark Twain say he didn't think much of a person who could spell a
word in only one way?   

 Regards,
  Jim Agenbroad ( [EMAIL PROTECTED] )
 It is not true that people stop pursuing their dreams because they
grow old, they grow old because they stop pursuing their dreams. Adapted
from a letter by Gabriel Garcia Marquez.
 The above are purely personal opinions, not necessarily the official
views of any government or any agency of any.
 Addresses: Office: Phone: 202 707-9612; Fax: 202 707-0955; US
mail: I.T.S. Sys.Dev.Gp.4, Library of Congress, 101 Independence Ave. SE, 
Washington, D.C. 20540-9334 U.S.A.
Home: Phone: 301 946-7326; US mail: Box 291, Garrett Park, MD 20896.  





Re: Unicode and Bengali

2002-03-05 Thread James E. Agenbroad

On Tue, 5 Mar 2002, Doug Ewell wrote:

 Dhrubajyoti Banerjee [EMAIL PROTECTED] wrote:
 
 [quoting Akshor]
  I thing we need not be restrained by these so-called 'standards'.
 Because,
  they can't and will not serve our need (Bengali) in my humble view.
 Thats
  why we toke this project at our hand and working to impliment a
 universal
  input method..
 
 They can implement any input method they like.  Having discrete keys for
 consonants and half-forms does not mean those forms have to be encoded
 separately.
 
  In fact they have put a huge number of conjunct characters in the
 Private
  Use Area. Its a pity because it means that so many people still do not
 even
  understand the difference between characters and glyphs.
 
  I hope Unicode proliferates fast in these areas so people can
 understand it
  and use it without wasting time in such activities as reinventing the
 wheel.
 
 What a relief to hear someone within the Indic community who actually
 understands the character-glyph model.  You probably know that many,
 many users of Indic scripts believe Unicode is incomplete or
 inadequate without separately encoded conjuncts and glyph variants.
 Please do your best to share your knowledge!
 
 BTW, your post was anything but offtopic.
 
 -Doug Ewell
  Fullerton, California
 
  Tuesday, March 5, 2002
It may not be obvious to all so I'll say it:  Unicode is as complete and
adequate as ISCII because they both use the phonetic approach to encoding 
Indic scripts, as distinguished from the graphic approach.  Though I 
attended one meeting of its authors in 1982, ISCII is a purely Indian
standard (IS 13194: 1991) which also lacks separately encoded conjuncts
and glyph variants.

 Regards,
  Jim Agenbroad ( [EMAIL PROTECTED] )
 It is not true that people stop pursuing their dreams because they
grow old, they grow old because they stop pursuing their dreams. Adapted
from a letter by Gabriel Garcia Marquez.
 The above are purely personal opinions, not necessarily the official
views of any government or any agency of any.
 Addresses: Office: Phone: 202 707-9612; Fax: 202 707-0955; US
mail: I.T.S. Sys.Dev.Gp.4, Library of Congress, 101 Independence Ave. SE, 
Washington, D.C. 20540-9334 U.S.A.
Home: Phone: 301 946-7326; US mail: Box 291, Garrett Park, MD 20896.  





Re: Standard Conventions and euro

2002-03-01 Thread James E. Agenbroad

Friday, March 1, 2002
Would I be correct in assuming that the Euro is also now the currency in
non-European dependencies such as the Netherlands Antilles, French
Polynesia, etc.?  Apologies in advance if either of these is now
independent. 

 Regards,
  Jim Agenbroad ( [EMAIL PROTECTED] )
 It is not true that people stop pursuing their dreams because they
grow old, they grow old because they stop pursuing their dreams. Adapted
from a letter by Gabriel Garcia Marquez.
 The above are purely personal opinions, not necessarily the official
views of any government or any agency of any.
 Addresses: Office: Phone: 202 707-9612; Fax: 202 707-0955; US
mail: I.T.S. Sys.Dev.Gp.4, Library of Congress, 101 Independence Ave. SE, 
Washington, D.C. 20540-9334 U.S.A.
Home: Phone: 301 946-7326; US mail: Box 291, Garrett Park, MD 20896.  





Re: [OT beyond any repair] House numbers (was RE: ISO 3166 (countryc odes) Maintenance Agency Web pages move))

2002-03-01 Thread James E. Agenbroad

On Fri, 1 Mar 2002, Patrick Andries wrote:

 
 
 Marco Cimarosti wrote:
 
 John Cowan wrote:
 
 [...]  House numbers in North America (and in France
 also, it seems) have a few bits of meaning: the least-significant
 (numeric) bit tells you which side of the street the house is on,
 [...]
 
 
 It is the same in Italy. I was quite surprised to know that also in other
 countries even and odd numbers are on the opposite sides of the road.
 
 I am curious whether another rule valid in Italy also applies in other
 countries: here the numbering always starts on the end of the road which is
 nearer to the center. When visiting Italian cities, I know whether I am
 walking towards the suburbs or towards the center by the looking whether
 house numbers increase or decrease.
 
 This is the same in Belgium.
 
 Patrick Andries
 
 
Friday, March 1, 2002
I'd say this is generaly true in the U.S. too.  It is probably a product
of urban expansion from the center outward.
 Regards,
  Jim Agenbroad ( [EMAIL PROTECTED] )
 It is not true that people stop pursuing their dreams because they
grow old, they grow old because they stop pursuing their dreams. Adapted
from a letter by Gabriel Garcia Marquez.
 The above are purely personal opinions, not necessarily the official
views of any government or any agency of any.
 Addresses: Office: Phone: 202 707-9612; Fax: 202 707-0955; US
mail: I.T.S. Sys.Dev.Gp.4, Library of Congress, 101 Independence Ave. SE, 
Washington, D.C. 20540-9334 U.S.A.
Home: Phone: 301 946-7326; US mail: Box 291, Garrett Park, MD 20896.  





RE: ISO 3166 (country codes) Maintenance Agency Web pages move

2002-02-25 Thread James E. Agenbroad


On Mon, 25 Feb 2002, Marco Cimarosti wrote:

 John Hudson wrote:
  At 06:33 2/25/2002, Marco Cimarosti wrote:
  
  Alain LaBonté wrote:
[...] Who knows? What is the word for gipsy in Romanian? [...]
  
  Rom, in fact: I just asked this to a Rumanian colleague.
  
  I presumed this was the reason for the code change 'request' from the 
  Romanian government. Bigotry against and persecution of the Rom is a 
  long-standing Romanian tradition: one of the few constants of 
  the country's political scene.
 
 I hope that you are wrong, although your suspect does not lack plausibility.
 
 But the present case could perhaps have a more positive explanation: many
 Romanians are indeed Rom (Gipsy) so, maybe, everybody felt it was
 inappropriate to call all Romanian nationals with the name belonging one
 single ethnical group.
 
 It would be as calling any USA citizens were called Californians: although
 there is nothing particularly good or bad in being a Californian, it is not
 generally deemed appropriate to call Texans Californians. (Or, anyway, I'm
 told that it's not advisable to should You're all Californians while
 entering a truck driver's diner in the middle of Texas).
 
 _ Marco
 
 
   Monday, February 25, 2002
The name Yankee has different connotations in in parts of the U.S. and
outside the U.S.
 Regards,
  Jim Agenbroad ( [EMAIL PROTECTED] )
 It is not true that people stop pursuing their dreams because they
grow old, they grow old because they stop pursuing their dreams. Adapted
from a letter by Gabriel Garcia Marquez.
 The above are purely personal opinions, not necessarily the official
views of any government or any agency of any.
 Addresses: Office: Phone: 202 707-9612; Fax: 202 707-0955; US
mail: I.T.S. Sys.Dev.Gp.4, Library of Congress, 101 Independence Ave. SE, 
Washington, D.C. 20540-9334 U.S.A.
Home: Phone: 301 946-7326; US mail: Box 291, Garrett Park, MD 20896.  





Re: Unicode Search Engines

2002-02-21 Thread James E. Agenbroad

On Tue, 19 Feb 2002, Asmus Freytag wrote:

 At 09:52 PM 2/18/02 -0800, Doug Ewell wrote:
 So if some language turns out to need
 a with horn in the future, its readers will have to cross its fingers
 that rendering engines become capable of displaying U+0061 U+031B
 properly.
 
 Support for such arbitrary combination is apparently in the works in several
 camps - it's needed in African languages for one.
 
 A./
 Thursday, February 21, 2002
Transliteration (converting texts from one writing system to another) is
of course an unnatural act (not in natural language) but is is sometimes
necessary.  Transliteration often uses letters with diacritics to
represent letters not found in the target writing system.  Many of these
letter + combining mark combinations do not have a precomposed equivalent
in Unicode (because they do not occur in normal text of the source
language) but they are needed when such texts must be presented in
another writing system.  Many can be found in the ALA/LC romanization
tables. John Jenkins has a list of several of such combinations found in
cataloging data to help in preparing a TR which I gather will tell those
concerned with rendering engines that such combinations can occur.

 Regards,
  Jim Agenbroad ( [EMAIL PROTECTED] )
 It is not true that people stop pursuing their dreams because they
grow old, they grow old because they stop pursuing their dreams. Adapted
[not by me] from a letter by Gabriel Garcia Marquez.
 The above are purely personal opinions, not necessarily the official
views of any government or any agency of any.
 Addresses: Office: Phone: 202 707-9612; Fax: 202 707-0955; US
mail: I.T.S. Sys.Dev.Gp.4, Library of Congress, 101 Independence Ave. SE, 
Washington, D.C. 20540-9334 U.S.A.
Home: Phone: 301 946-7326; US mail: Box 291, Garrett Park, MD 20896.  





Re: Unicode and Security

2002-02-07 Thread James E. Agenbroad

 Thursday, February 7, 2002
Would making the about to be misled respondent type the address of the
intended person (with a roman 'o', not a greek omicron) and then having
the system see if they match detect and thwart such tricks?  The
respondent is already typing so it's not a large extra burden.
 Regards,
  Jim Agenbroad (dislcaimer and addresses at bottom)
On Thu, 7 Feb 2002, Michael Everson wrote:

 At 12:22 -0500 2002-02-07, Elliotte Rusty Harold wrote:
 
 For the sake of argument, let's call the company they work at 
 Microsoft, but this attack could hit most companies with a .com 
 address. Let's say I register microsoft.com, only the fifth letter 
 isn't a lower-case Latin o. It's actually a lower case Greek 
 omicron. I then forge a believable letter from [EMAIL PROTECTED] 
 to [EMAIL PROTECTED] saying Can you please update me on your 
 budget? Bob, noticing that the e-mail appears to come from Alice, 
 whom he knows and trusts, fires off a reply with his confidential 
 information. Only it doesn't go to Alice. It goes to me. I can then 
 reply to Bob, asking for clarification or more details. I can ask 
 him to attach the latest build of his software. I can carry on a 
 conversation in which Bob believes me to be Alice and spills his 
 guts. This is very, very bad.
 
 It isn't Unicode's fault that some letters look like others. That's a 
 fault of history.
 
 -- 
 Michael Everson *** Everson Typography *** http://www.evertype.com
 
 

 Regards,
  Jim Agenbroad ( [EMAIL PROTECTED] )
 It is not true that people stop pursuing their dreams because they
grow old, they grow old because they stop pursuing their dreams. Adapted
from a letter by Gabriel Garcia Marquez.
 The above are purely personal opinions, not necessarily the official
views of any government or any agency of any.
 Addresses: Office: Phone: 202 707-9612; Fax: 202 707-0955; US
mail: I.T.S. Sys.Dev.Gp.4, Library of Congress, 101 Independence Ave. SE, 
Washington, D.C. 20540-9334 U.S.A.
Home: Phone: 301 946-7326; US mail: Box 291, Garrett Park, MD 20896.  





Oops!

2002-02-06 Thread James E. Agenbroad

The ALA/LC romanization tables ar at: lcweb.loc.gov/catdir/cpso/roman.html
( not .../romanization.html as in my earlier note)

 Sorry,
  Jim Agenbroad ( [EMAIL PROTECTED] )
 It is not true that people stop pursuing their dreams because they
grow old, they grow old because they stop pursuing their dreams. Adapted
from a letter by Gabriel Garcia Marquez.
 The above are purely personal opinions, not necessarily the official
views of any government or any agency of any.
 Addresses: Office: Phone: 202 707-9612; Fax: 202 707-0955; US
mail: I.T.S. Sys.Dev.Gp.4, Library of Congress, 101 Independence Ave. SE, 
Washington, D.C. 20540-9334 U.S.A.
Home: Phone: 301 946-7326; US mail: Box 291, Garrett Park, MD 20896.  





ALA/LC Romanization Tables on the Web

2002-02-06 Thread James E. Agenbroad

 Wednesday, February 6, 2002
The scanned pages of the 1997 ALA/LC romanization tables are now available
on the Web:  http://lcweb.loc.gov/catdir/cpso/romanization.html
Note that in lieu of the Wade Giles pages there is a note that pinyin
guidelines are pending.  

 Regards,
  Jim Agenbroad ( [EMAIL PROTECTED] )
 It is not true that people stop pursuing their dreams because they
grow old, they grow old because they stop pursuing their dreams. Adapted
from a letter by Gabriel Garcia Marquez.
 The above are purely personal opinions, not necessarily the official
views of any government or any agency of any.
 Addresses: Office: Phone: 202 707-9612; Fax: 202 707-0955; US
mail: I.T.S. Sys.Dev.Gp.4, Library of Congress, 101 Independence Ave. SE, 
Washington, D.C. 20540-9334 U.S.A.
Home: Phone: 301 946-7326; US mail: Box 291, Garrett Park, MD 20896.  





Re: names of the control characters

2002-02-04 Thread James E. Agenbroad

On Mon, 4 Feb 2002, Michael Everson wrote:

 At 12:33 -0800 2002-02-03, Mark Davis wrote:
 This has bitten more than a few people. For political reasons, having
 to do with the synchronization of names to ISO 10646, the name fields
 are empty for the control characters. That is because (at least in
 theory) people could have other semantics for those characters.
 
 I would really favour challenging WG2 to accept reality and reference 
 the glyphs and names at least informatively in 10646. I'd support it 
 in committee.
 -- 
 Michael Everson *** Everson Typography *** http://www.evertype.com
 
 
   Monday, February 4, 2002
Cf. ISO 6603 (1989) Documentaiton - Bibliographic control
characters where among other control characters:
  X'88' is non-sorting character(s), beginning, and
  X'89' is non-sorting chracters(s), ending   
These two are in MARC 21 documentation, others in the ISO standard are not
but other countries may have adopted them.
 Regards,
  Jim Agenbroad ( [EMAIL PROTECTED] )
 It is not true that people stop pursuing their dreams because they
grow old, they grow old because they stop pursuing their dreams. Adapted
from a letter by Gabriel Garcia Marquez.
 The above are purely personal opinions, not necessarily the official
views of any government or any agency of any.
 Addresses: Office: Phone: 202 707-9612; Fax: 202 707-0955; US
mail: I.T.S. Sys.Dev.Gp.4, Library of Congress, 101 Independence Ave. SE, 
Washington, D.C. 20540-9334 U.S.A.
Home: Phone: 301 946-7326; US mail: Box 291, Garrett Park, MD 20896.  





Cuneiform Article

2001-11-15 Thread James E. Agenbroad

 Thursday, November 15, 2001
On pages A14-15 the November 9 issue of the Chronicle of Higher Education
has an article Silicon Babylon by Scott LcLemee on the Cunieform Digital
Library Initiative.  It seems they're using digital images, not character
encoding: Most of the time we can put the tablet on  [or in?] a flatbed
scanner. (My notes aren't entirely legible.) 

 Regards,
  Jim Agenbroad ( [EMAIL PROTECTED] )
 The above are purely personal opinions, not necessarily the official
views of any government or any agency of any.
Phone: 202 707-9612; Fax: 202 707-0955; US mail: I.T.S. Dev.Gp.4, Library
of Congress, 101 Independence Ave. SE, Washington, D.C. 20540-9334 U.S.A.  





Re: What constitutes character?

2001-11-12 Thread James E. Agenbroad

On Wed, 7 Nov 2001, Philipp Reichmuth wrote:

 -BEGIN PGP SIGNED MESSAGE-
 Hash: SHA1
 
 Hello folks,
 
 I've been wondering a little bit recently about the definition of
 character vs. glyph variant that is applied during decision
 whether or not a given proposed character should go into Unicode.
 
 I'm thinking of all those highly academic cases such as the famous Han
 signs in medieval Korean Buddhist manuscripts (which we've had quite a
 lot of recently). What if it is a character where nobody knows for
 sure whether it is a character in its own right or a variant of some
 sort, in orthography, style or whatever? There must be some semiotic
 concept behind the idea of character here. Other examples might
 include some aspects of Mayan or Indus script or of Sumerian cuneiform
 when used to write Eblaite where we've got lots and lots of text, but
 we can't read it properly without confusion, either completely (Indus
 script) or in some more or less rare cases.
 
 What is necessary for two signs to constitute different characters in
 cases such as these?
 
 Greetings
  Philipp  mailto:[EMAIL PROTECTED]
 
 __
 Nuke the gay, unborn, baby whales for Jesus.
 -BEGIN PGP SIGNATURE-
 Version: GnuPG v1.0.6 (MingW32)
 Comment: Freedom of the press is limited to those who own one.
 
 iD8DBQE76UxN3PGzpSk43FoRAlaaAJ4iXNo2AHai8P0a6dctKU3egsZgHACgiYCf
 jh2b3FPhTEzjt3WxsySRgYs=
 =RFru
 -END PGP SIGNATURE-
 
 
 
   Wednesday, November 7, 2001
Maybe we need different standards for: 1. Living scripts, 2. Dead but
read scripts, e.g., cuneiform, and 3. Dead and unread scripts, e.g. rongo 
rongo/Easter Island.  
 Regards,
  Jim Agenbroad ( [EMAIL PROTECTED] )
 The above are purely personal opinions, not necessarily the official
views of any government or any agency of any.
Phone: 202 707-9612; Fax: 202 707-0955; US mail: I.T.S. Dev.Gp.4, Library
of Congress, 101 Independence Ave. SE, Washington, D.C. 20540-9334 U.S.A.  




Re: Cyrillic Q

2001-09-27 Thread James E. Agenbroad

On Thu, 27 Sep 2001, John Hudson wrote:

 At 02:48 9/27/2001, Marco Cimarosti wrote:
 
 A lot of time ago, someone on this list mentioned a language, written in the
 Cyrillic alphabet, which employed letter Q, taken from the Latin alphabet.
 
 Which language is it?
 
 Kurdish. The common Cyrillic orthography includes four Latin letterforms 
 that are, as far as I know, unique to Kurdish:
 
  U+0051, U+0071  Capital, Small Q
  U+0057, U+077   Capital, Small W
 
 John Hudson
 
 Tiro Typeworkswww.tiro.com
 Vancouver, BC [EMAIL PROTECTED]
 
 Type is something that you can pick up and hold in your hand.
- Harry Carter
 
 
 Thursday, Septembe 27, 2001
Besides Kurdish, the section on tansliteration of non-Slavic languages
using Cyrillic the ALA-LC romanization tables (1997) shows Q used with
four other languages: Aisor, Chechen (the 1862 and 1908 orthographies but
not the 1938 one), Dargwa (Uslar) and Lak (1864 but not 1938). For Kurdish
Q seems also to have an alternative glyph that appears as O followed by
a vertical bar which is also used with Lezghian (Uslar).  

 Regards,
  Jim Agenbroad ( [EMAIL PROTECTED] )
 The above are purely personal opinions, not necessarily the official
views of any government or any agency of any.
Phone: 202 707-9612; Fax: 202 707-0955; US mail: I.T.S. Dev.Gp.4, Library
of Congress, 101 Independence Ave. SE, Washington, D.C. 20540-9334 U.S.A.  





RE: discontent about Indic scripts and Unicode

2001-09-19 Thread James E. Agenbroad

On Wed, 19 Sep 2001, Carl W. Brown wrote:

 Ram,
 
 If ISCII is intended as a pan-Indic solution does it also support Urdu?
 
 Carl
 
  Wednesday, September 19, 2001
No, from the foreword to ISCII: As Perso-Arabic scripts have a different
alphabet, a different standard is envisaged for them. 

 Regards,
  Jim Agenbroad ( [EMAIL PROTECTED] )
 The above are purely personal opinions, not necessarily the official
views of any government or any agency of any.
Phone: 202 707-9612; Fax: 202 707-0955; US mail: I.T.S. Dev.Gp.4, Library
of Congress, 101 Independence Ave. SE, Washington, D.C. 20540-9334 U.S.A.  





Re: discontent about Indic scripts and Unicode

2001-09-19 Thread James E. Agenbroad

On Wed, 19 Sep 2001, Rick McGowan wrote:

  If ISCII is still being developed does this suggest that Unicode and its ISO 
  equivalent move too slowly?
 
 ISCII dates back to 1988 with a revision in 1990.  It's not still being  
 developed -- as far as I know, it's a stable standard that is under  
 routine maintenance.
 
 I wonder if anyone has yet corresponded with the people who put up the  
 almost unbelievable misconceptions on the two web pages mentioned  
 yesterday?  At least a note could go to the site owners, I would think.
 
   Rick
 Wednesday, September 19, 2001
I agree that the 1991 version of ISCII has been stable for representation
of Indian scripts of Indian origin.  I do not know if a standard for
encoding of Perso-Arabic script for Urdu, Sindhi, etc. has advanced beyond
being envisaged as mentioned in my earlier note. 
The term ISSCII (for Indian script standard code for information
interchange) dates back at least to the July 1983 report of Government
of India's Sub-Committee on Standardization of Indian Scripts and Their
Codes for for Information Processing entitled Standardization of Indian
script codes for information interchange. (I'm not sure when the second
'S' was dropped.) Page iv of the 1991 standard is devoted to history and
begins: Since the 70s, different commitees of the Department of Official
Langauges and the Department of Electronic (DOE) have been evolving
different codes and keyboards which would cater to all the Indian scripts
due to their common phonetic structure.  Besides the 1983 version it
mentions ones of 1986 and 1988.  The 1983 report cites a March 1981
interim report. 

 Regards,
  Jim Agenbroad ( [EMAIL PROTECTED] )
 The above are purely personal opinions, not necessarily the official
views of any government or any agency of any.
Phone: 202 707-9612; Fax: 202 707-0955; US mail: I.T.S. Dev.Gp.4, Library
of Congress, 101 Independence Ave. SE, Washington, D.C. 20540-9334 U.S.A.  





Re: FW: 6 questions

2001-09-18 Thread James E. Agenbroad

On Tue, 18 Sep 2001, Magda Danish (Unicode) wrote:

 
 
 -Original Message-
 From: Bernard Miller [mailto:[EMAIL PROTECTED]] 
 Sent: Monday, September 17, 2001 5:19 PM
 To: [EMAIL PROTECTED]
 Subject: 6 questions
 
 
 Hello,
 
 These are the questions I wanted to
 ask: 
 
 1. [snip] 
 6.Why does Unicode use capital vs small letter
 terminology instead of uppercase vs lowercase? It
 seems like lowercase is more descriptive than small
 letter. 
 
Tuesday, September 18, 2001
Probably because uppercase and lowercase  hark back to manual
typesetting (pre-desktop, pre-photocomposition, pre-Linotype) as Gutenberg
and Ben Franklin did it: one case containing the more used a-z was
almost horizontal at a stand up desk (the lower case) and behind it
and more nearly vertical was the upper case with less used capital
letters, A-Z.  The person setting type picked wanted letters from both
cases and inserted them into a composing stick and they were then
transferred to the printing press.  When printing was done they letters
were manually redistributed into their proper sections of the cases.

 Regards,
  Jim Agenbroad ( [EMAIL PROTECTED] )
 The above are purely personal opinions, not necessarily the official
views of any government or any agency of any.
Phone: 202 707-9612; Fax: 202 707-0955; US mail: I.T.S. Dev.Gp.4, Library
of Congress, 101 Independence Ave. SE, Washington, D.C. 20540-9334 U.S.A.  





RE: [OT] o-circumflex

2001-09-07 Thread James E. Agenbroad

On Thu, 6 Sep 2001, Ayers, Mike wrote:

 
  From: David Starner [mailto:[EMAIL PROTECTED]] 
  Sent: Thursday, September 06, 2001 01:40 PM
 
  On Thu, Sep 06, 2001 at 04:03:07PM +0200, Thierry Sourbier wrote:
   The only little thing to know about French and diacritical 
  mark is that when
   doing a sort diacritical mark are evaluated from right to 
  left.  (e.g.
   cote  côte  coté vs the English order cote   
  coté  côte ).
  
  I'm not sure there is an established English sort order. It's not a 
  problem that comes up much in English. 
 
   I believe that there is an established sort order in English, which
 is to sort without regard to diacritics, or else we'd never find the words!
 In English (American English more than British English), diacritics are
 considered optional, and it is common to see naїve written naive, San
 José written San Jose, etc.  Especially amongst Americans, the two are
 considered equivalent, and I know of no word pair in all of English which is
 separated only by a diacritic.
 
 Friday, September 7, 2001
Librarians have *filing* rules--the American Library Association (ALA) and
the Library of Congress (LC) each issued some in, I think, 1980.  I
believe they both say to ignore diacritics because Americans do not
recognize that they have an order.  These days filing in vendor software
for libraries tends to follow neither one very closely--the phrase
more honored in the breach than the observance comes to mind.  I may be
wrong but I do not believe there is an established U.S. standard for
sorting/filing.  A few years ago a National Information Standards
Organization (NISO) committee drafted one but it didn't get the
votes needed to become an accepted standard.  

 Regards,
  Jim Agenbroad ( [EMAIL PROTECTED] )
 The above are purely personal opinions, not necessarily the official
views of any government or any agency of any.
Phone: 202 707-9612; Fax: 202 707-0955; US mail: I.T.S. Dev.Gp.4, Library
of Congress, 101 Independence Ave. SE, Washington, D.C. 20540-9334 U.S.A.  





Re: Arial Unicode MS and Code2000

2001-07-06 Thread James E. Agenbroad

On Fri, 6 Jul 2001, Rajesh Chandrakar wrote:

 
 James Kass wrote:
 
  Adarsh wrote:
 
   [snip]
 
  Another problem has to do with searching/indexing.  Search/index 
  applications
  are broken by non-Standard encodings.
 
 but how far searching and indexing is possible for encoded standards?
 
 regards
 
 rajesh
 
 Friday, July 6, 2001
To searching and indexing one might add sorting/filing of the
retrieved items into some useful order since I beleieve this discussion
began with a bibliographic application.  (This is, I know, a bit remote 
from the fonts issue.)  
   

 Regards,
  Jim Agenbroad ( [EMAIL PROTECTED] )
 The above are purely personal opinions, not necessarily the official
views of any government or any agency of any.
Phone: 202 707-9612; Fax: 202 707-0955; US mail: I.T.S. Dev.Gp.4, Library
of Congress, 101 Independence Ave. SE, Washington, D.C. 20540-9334 U.S.A.  





RE: informative due to variation across langauges

2001-06-19 Thread James E. Agenbroad

On Tue, 19 Jun 2001, Marco Cimarosti wrote:

 Peter Constable wrote:
  Can anyone think of other examples of informative properties 
  that are so
  because the property is typical but not true for all languages?
 
[snip]
I arrived late to this discussion.  Is culturally correct sorting/filing
such a property?  I believe the Japanese and Koreans sort/file Kanji/Hani
phonetically--as if they were written in kana and hangul. And that
software cannot be expected to derive the kana from the kanji. I think it
is also the case that good sorting of Latin, Cyrillic, Arabic scripts
is language dependent (and m aybe other scripts too.
 Regards,
  Jim Agenbroad ( [EMAIL PROTECTED] )
 The above are purely personal opinions, not necessarily the official
views of any government or any agency of any.
Phone: 202 707-9612; Fax: 202 707-0955; US mail: I.T.S. Dev.Gp.4, Library
of Congress, 101 Independence Ave. SE, Washington, D.C. 20540-9334 U.S.A.  





Re: New acquisition

2001-06-12 Thread James E. Agenbroad

   Tuesday, June 12, 2001
Did the Lion dip his thorn in ink?
 Jim Agenbroad (discalimer and addresses at bottom)
On Mon, 11 Jun 2001, John Hudson wrote:

 At 15:56 6/11/2001 +0100, Michael Everson wrote:
 
 Shaw, Bernard. 1962. Androcles  the Lion. Printed in the Shaw Alphabet 
  [snip]
 Well, my copy is inscribed to Androcles from the Lion, so there!
 
 John Hudson
 
 Tiro Typeworks |
 Vancouver, BC  |
 www.tiro.com   |
 [EMAIL PROTECTED]  | 
 
 
 

 Regards,
  Jim Agenbroad ( [EMAIL PROTECTED] )
 The above are purely personal opinions, not necessarily the official
views of any government or any agency of any.
Phone: 202 707-9612; Fax: 202 707-0955; US mail: I.T.S. Dev.Gp.4, Library
of Congress, 101 Independence Ave. SE, Washington, D.C. 20540-9334 U.S.A.  





RE: RECOMMENDATIONs( Term Asian is not used properly on Computersand NET)

2001-05-31 Thread James E. Agenbroad

  Thursday, May 31, 2001
We seem to have strayed from searching for a clearer term than Asian.  I
think part of the problem is that many language names are also national
adjectives, e.g., Chinese, Japanese and Korean.  Likewise names of scripts
(or writing systems) are also often names of languages, e.g., Arabic.
 I would hope that input methods (for Chinese or Amharic charcters) remain
a separate issue: so long as it results in a Unicode encoding that can be
unambiguously shared, it should not matter what keystrokes were used.  (An
analogy might be QWERTY vs. Dvorak input not effecting ASCII.) Input methods
are still important issue but a separate one. 

 On Thu, 31 May 2001, Carl W. Brown wrote:

 Liwal,
 
 Such classifications are not easy.  For example Azeri can be written in both
 Latin and Cyrillic scripts.  The Latin script is much like Turkish which has
 the dotted and dot-less i.  This is not necessarily be big issue for fonts
 but is requires special case shifting logic.
 
 What do you do about scripts that are not tied to a locale?  The Orthodox
 Church uses a special Cyrillic font that is different from standard
 Cyrillic.
 
 The classifications vary not only by script but by how it affects you
 specific field of interest and the implementation.  For example Unicode
 implements Ethiopian has fully formed syllabic characters.  Some
 implementations use decomposed syllables.  This allows 256 byte code pages
 but it requires glyph composition.  This would make is similar to SE Asian
 and Indic processing.  But with fully composed glyphs you would classify the
 language differently probably as a large characters set language with an
 input method editor like the CJK languages.
 
 Carl
 
 
 
 -Original Message-
 From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED]]On
 Behalf Of N.R.Liwal
 Sent: Thursday, May 31, 2001 8:52 PM
 To: Jungshik Shin
 Cc: [EMAIL PROTECTED]
 Subject: Re: RECOMMENDATIONs( Term Asian is not used properly on
 Computers and NET)
 
 
 Dear Jungshik Shin;
 
 Thanks, good explinations, I hope those who are interested in Software and
 Web for Asia will be
 benefited.
 
 Thanks.
 
 Liwal
 
 - Original Message -  On Wed, 30 May 2001, N.R.Liwal wrote:
 
   TERM ASIA IN COMPUTER  INTERNET (RECOMMENDATIONS UNICODE LIST MAY
 2001)
  
   So far the recomendations are, that Asian Text Fonts can be called:
   -Han Fonts or Hanzi Fonts
 
As already pointed out, this is not adqueate to cover Korean
  and Japanese because other scripts are also used for them. Moreover,
  Japanese may not like 'Hanzi' even if you're talking about
  Hanzi/Kanji/Hanja alone. Even 'Han' (which is more neutral) could be
  balked at by some.
 
   -East Asian Unified Fonts
   -East Asian Fonts
 
If they mean fonts for Chinese, Japanese and Korean writing
  systems, I would pick 'East Asian fonts'.
 
 
   Script Can be classified as:
   -languages which Han ideographs
 
  you're talking not about language(s) but about script(s) , right?
 
   -'ideographic languages' SCRIPT
 
 A language cannot be ideographic as I wrote before. Has anybody else
  mentioned this term other than me? I mentioned it not because I think it's
  appropriate BUT because I think that the term (ideographic language)
  MUST NOT be used.
 
   -East Asian Unified SCRIPT
 
What's been 'unified' is Han 'ideographs' while there ARE other
  scripts in (more predominant) use in the region (even if you only mean
  Chinese,Japanese and Korean by 'East Asian').
 
   - East Asian SCRIPT
 
What 'script' (not 'scripts') are you talking about here?
  If you just mean 'Han ideographs', I don't think you  need to come up with
  new term(s). I think 'Han ideograph' (or CJK ideographs if it ONLY means
  Hanzi/Kanji/Hanja and nothing else)  is good enough (although certainly
  not perfect.)  On the other hand, if you're talking about all the scripts
  used in Northeast/East Asian countries (or China, Japan and Korea),
  you CANNOT use any of the above with the possible exception of the last
  (which can be used provided that they're made plural 'East Asian Scripts'
  to reflect that there are *multiple* scripts in use.)
 
 
   Asian geographic expressions are better:
   -Southeast Asia, East Asia CENRAL ASIA
   WEST ASIA = Arabic Countries and  Neighborhood
 
I believe the following are widely used at least in 'geography
  text books' and 'encyclopedia'. Also, many US schools with regional
  studies programs use similar divisions (except for Southwest Asian which
  appears to be refered to as 'Middle East' most of time). This division
  is bound to be aribtrary to some degree (Asian continent is not a circle
  or any definitive geometric shape which can be divided in an unambiguous
  way ;-) )
 
 
East Asia/Northeast Asia : Japan, Korea, China (it's a huge country,
 but)
'Far East' (in Western 

RE: Some Char. to Glyph Statistics, Pan/Single Font

2001-05-31 Thread James E. Agenbroad

   Thursday, May 31, 2001
My goal was never to give a specific number of glyphs needed to display a
particular Indian or other script.  As others have pointed out, this
depends among other things, on the particular display device and its font
processing software possibly including the operating system.  My goals
were to point out that Arabic and South and Southeast Asian scripts require:
1. Many more glyphs than character codes and, 2. As important, software to
render character codes legibly from the available glyphs. Discussions of a
single Unicode font that do not mention such software seem pointless, or
worse, managers might believe them.  I wonder it we could usefully define
levels of legibility for displaying a language or writing system, or is it
too subjective?  Is evoking a lam alef ligature when alef follows a lam the
minimal level for any language using Arabic script?  For languages using
Devanagari script is transposing the short i matra (U+093F) to precede the
consonant(s) it follows the minimum?
 Regards,
  Jim Agenbroad (disclaimer and address at bottom)
 On Thu, 31 May 2001, Marco Cimarosti wrote:

 Mike Meir wrote:
  The problem with your glyph statistics is that they are based 
  on mould counts employed by the Monotype hot metal typesetters.
 
 I agree: no one will ever come up with *the* correct count.
 
 Such general evaluations simply depend on too many things to be useful.
 E.g.: which language(s) are targeted, what degree of typographic excellence
 is required, and (as Mike explained very well) the kind of technology
 involved and its limitations.
 
 The simple fact that software fonts can overlay glyphs can cause a great
 factor of reduction,  compared to lead type. Similarly, the fact that a
 software font technology has the capability of kerning glyphs vertically can
 reduce dramatically the inventory of glyphs needed for certain scripts.
 
 Moreover, different technologies may have totally different meanings for the
 word glyph. E.g., I have heard of Arabic fonts that analyze the Arabic
 script well under the level of a grapheme: segments of lines and
 individual dots were stored separately and assembled at display time.
 Comparing the number of glyphs in such an a font with the inventory of a
 more traditional font is what we call sum up apples and pears.
 
  Turning to Devanagari, our researches indicate that the total 
  number of script units (In Unicode terms, combinations of 
  consonants, halants, vowel signs and other signs),  excluding 
  the Unicode characters in the range 0951 to 0954, in use is 
  around the 5550 mark. It is actually greater than this, since 
  there are a number of characters relating to Sanskrit sandhi 
  for which we do not have any conjunct-vowel statistics.
 
 As an opposite example for Devanagari, I did a little research on my own on
 a minimal rendering scheme for Unicode Indic scripts. The scenario behind
 this evaluation was low-resolution displays or printers and simple bitmapped
 fonts.
 
 For Devanagari's 77 characters (non-decomposable L and M characters) my
 set of glyphs was just 82 pieces. Of course, such a ratio (about 1:1.06)
 requires dropping any typographical gracefulness: of all the complexity of
 Devanagari, just a handful of half-consonants and ligatures was preserved.
 
 Neither your 5550 nor my 82 are of much use to anyone who has even
 slightly different requirements. However, the contrast between these two
 figures perhaps says something about the difficulty of such a count.
 
 _ Marco
 
 

 Regards,
  Jim Agenbroad ( [EMAIL PROTECTED] )
 The above are purely personal opinions, not necessarily the official
views of any government or any agency of any.
Phone: 202 707-9612; Fax: 202 707-0955; US mail: I.T.S. Dev.Gp.4, Library
of Congress, 101 Independence Ave. SE, Washington, D.C. 20540-9334 U.S.A.  





Some Char. to Glyph Statistics, Pan/Single Font

2001-05-30 Thread James E. Agenbroad

 Wednesday, May 30, 2001
Attached is a note I wrote in September 1993 about the ratio of characters
to glyphs in several Indic scripts.  Much has changed on the Unicode
front since then, but I think the need for rendering software to decide
which of many glyphs to use to represent a given sequence of codes is
still with us.  A similar situation obtains with Arabic--unless one
requires the use of Arabic presentation forms.  If one excludes the
combining characters at U+0300 to 0362 European scripts tend to have a 1:1
character to glyph ratio; Chinese, Japanese and (maybe Korean) scripts
also tend to have a 1:1 character to glyph ratio.  But most scripts
between Europe and the Far East--Arabic, South and Southeast Asian ones do
not.  Unless the rendering software and the fonts are in synch the results
will be unsatisfactory.  A few posting on the 'single font' discussion
have mentioned this but I hope some data may be helpful.
 The story goes that back in Ancient Greece (I think) someone was
describing Utopia and a listener asked, But who will do the work? and
the reply was, Oh, we will  have slaves.  The computer now can be an
effective slave when given explicit instructions, but without consistent
instructions the results will not be satisfactory.
 This may be beyond the scope of Unicode which aims to unambiguously
encode text for the computer (and succeeds) but does not dwell on details
of its input or output--rendering it legible for humans to read.  

 Regards,
  Jim Agenbroad ( [EMAIL PROTECTED] )
 The above are purely personal opinions, not necessarily the official
views of any government or any agency of any.
Phone: 202 707-9612; Fax: 202 707-0955; US mail: I.T.S. Dev.Gp.4, Library
of Congress, 101 Independence Ave. SE, Washington, D.C. 20540-9334 U.S.A.  

-- Forwarded message --
Date: Fri, 10 Sep 93 14:12:07 -0400
From: jage (James E. Agenbroad)
To: [EMAIL PROTECTED]
Cc: jage@seq1
Subject: Some Character to Glyph Statistics

Friday, September 10, 1993
Glenn,
 Recent Internet discussions about fonts for ISO10646/Unicode prompted
me to do some counting.  The data are suggestive rather than definitive
at least in part because the counts of glyphs are based on only a single
source and it may not be up to date.  They do suggest that for various
writing systems of South (and maybe Southeast) Asia based on Indic scripts
the ratio of coded characters to glyphs is not 1:1 but 1:2 or even 1:3.
I'm sure this is no surprise to you but the Internet discussions make no
meniton of it so I thought I would.  When a writing system has more glyphs
than characters I think there must be software to decide when which glyph
is wanted.  (This software may also need to know something about the
target device but that's not an issue I can shed any light on.)  
 As a preliminary assessment I have counted the number of character
codes ISO 10646 assigns for several writing systems and the number of
glyphs from synopses of the same writing systems as found in Specimen
book of 'Monotype non-latin faces issued loose-leaf by Monotype
Corporation.  I geve the number and date of each sheet.  In counting
I have omitted western style punctuation and numerals.  

Writing System, date 10646 Mono. Rough
 chars glyphsratio

Bengali 470,5/6589   331 1:3
Burmese 558,5/6476   213 1:3
Devanagari155,8/75 104   248 1:2.5
Gujarthi 460,7/71   75   232 1:3
Gurmukhi 601,9/74   74   146 1:2
Kannada 588,9/6980   236 1:3
Malayalam 590,7/75  78   590 1:7
Oriya 706,3/70  78   371 1:4
Sinhalese 557,1/64  90   348 1:3.5
Tamil 280,1/64  61   171 1:3
Telugu 626,3/71 80   312 1:4
Thai 577,4/74   92   208 1:2
Tibetan (Van Osterman)  80   158 1:2

For Sinhalese and Tibetan (not in 10646 yet) the count is from Unicode
Technical report no. 2.  For Devanagari and Gurmukhi has a note: A
special mould is required for these matrices.  THe relation of these
fonts to current systems is unclear.  As noted, my Monotype book does
not include Tibetan, the glyphs are from George Vvan Ostermann's
Manual of foreign languages 4th ed. 1952--Icounted the leters, ligtures,
numerals, vowel signs and punctuation.

I would also like to expres my agreement with the man from New South Wales
who said libraries will need to display lots of different characters.  I
do not know if this means one large font or m any so long as they are
all available when needed to display a string of ccharacter codes--without
the recipent knowing what will be needed and taking extra measures to
load the proper font.  The fonts for such purposes would not need to have
extremely

Coptic?

2001-05-22 Thread James E. Agenbroad

   Tuesday, May 22, 2001
My recollection is that assigning separate codes to all characters in
Coptic script rather than treating it as part of Greek script was under
consideration at one time. If so, is this effort's current status closer
to approved, rejected, dormant or still under consideration?  Not knowing 
exactly what is/was proposed I have no real opinion on its merits. Thanks
in advance.   

 Regards,
  Jim Agenbroad ( [EMAIL PROTECTED] )
 The above are purely personal opinions, not necessarily the official
views of any government or any agency of any.
Phone: 202 707-9612; Fax: 202 707-0955; US mail: I.T.S. Dev.Gp.4, Library
of Congress, 101 Independence Ave. SE, Washington, D.C. 20540-9334 U.S.A.  





[unicode] Re: Not exactly

2001-03-28 Thread James E. Agenbroad

On Tue, 27 Mar 2001, Tony Graham wrote:

 At 27 Mar 2001 12:37 -0500, James E. Agenbroad wrote:
   On page 125 of the 2000 cumulation of 'Computer literature index' under 
   the subject heading 'Conversion' the annotation for "Unicode: a primer" by
   Tony Graham says: "Unicode is a programming standard and coding system for
   translating programs and applications into other languages with different
   character sets."
 
 Blame "Computer literature index", not me: I wouldn't have written
 that even if I could understand it.
 
 Regards,
 
 
 Tony Graham.
 
Wednesday, March 28, 2001
Tony,
 I totally agree and did not mean to suggest otherwise.  Your book is
very useful--I have a copy at my desk.   

 Regards,
  Jim Agenbroad ( [EMAIL PROTECTED] )
 The above are purely personal opinions, not necessarily the official
views of any government or any agency of any.
Phone: 202 707-9612; Fax: 202 707-0955; US mail: I.T.S. Dev.Gp.4, Library
of Congress, 101 Independence Ave. SE, Washington, D.C. 20540-9334 U.S.A.  





[unicode] Not exactly

2001-03-27 Thread James E. Agenbroad

   Tuesday, March 27, 2000
On page 125 of the 2000 cumulation of 'Computer literature index' under 
the subject heading 'Conversion' the annotation for "Unicode: a primer" by
Tony Graham says: "Unicode is a programming standard and coding system for
translating programs and applications into other languages with different
character sets."

 Regards,
  Jim Agenbroad ( [EMAIL PROTECTED] )
 The above are purely personal opinions, not necessarily the official
views of any government or any agency of any.
Phone: 202 707-9612; Fax: 202 707-0955; US mail: I.T.S. Dev.Gp.4, Library
of Congress, 101 Independence Ave. SE, Washington, D.C. 20540-9334 U.S.A.  





[unicode] Re: removing compromises from unicode (WCode)

2001-03-26 Thread James E. Agenbroad

On Fri, 23 Mar 2001, Jonathan Coxhead wrote:


 
It would be very entertaining to do the same job with the ideographs (down 
 to the radical level) and count the number of atoms. I suspect the resulting 
 "character set" would contain less than 2000 atoms altogether.
 
Please do feel free to share any thoughts on the "Atomic Theory" with me!
 
 /|
  o o o (_|/
 /|
(_/
 

Monday, March 26, 2001
Mr. Coxhead, 
 I am far from an expert on Chinese characters, but I suspect that
decomposing ideographs down to their radicals would sometimes require some
means of indicating the relative position of the component radicals. 
('Ideographic description characters, U+2FF0 to U+2FFB, described at  
pages 268-271 of 3.0 are one such means.)  The 'code the strokes' approach
has the same difficulty but with greater frequency.  Both also assume some
means to indicate the end of a character.  These approaches or variatiants
of them have been used as means of character input where, after a person
resolves ambiguous cases, a unique code for the whole character is stored
and transmtted.  
 Regards,
  Jim Agenbroad ( [EMAIL PROTECTED] )
 The above are purely personal opinions, not necessarily the official
views of any government or any agency of any.
Phone: 202 707-9612; Fax: 202 707-0955; US mail: I.T.S. Dev.Gp.4, Library
of Congress, 101 Independence Ave. SE, Washington, D.C. 20540-9334 U.S.A.  





Indic Scripts Page

2001-03-13 Thread James E. Agenbroad

 Tuesday, March 13, 2001
Those interested in Indic and related scripts might want to consult:
http://www.cs.colostate.edu/~malaiya/scripts.html
[Thats a tilde before malaiya]  Not all the links from it are operational
but many are.

 Regards,
  Jim Agenbroad ( [EMAIL PROTECTED] )
 The above are purely personal opinions, not necessarily the official
views of any government or any agency of any.
Phone: 202 707-9612; Fax: 202 707-0955; US mail: I.T.S. Dev.Gp.4, Library
of Congress, 101 Independence Ave. SE, Washington, D.C. 20540-9334 U.S.A.  




RE: Final letters in Hebrew and Arabic

2001-03-12 Thread James E. Agenbroad

On Sat, 10 Mar 2001, Jonathan Rosenne wrote:

 Regarding Hebrew:
 
  -Original Message-
  From: Nick NICHOLAS [mailto:[EMAIL PROTECTED]]
  Sent: Friday, March 09, 2001 10:12 PM
  To: Unicode List
  Cc: Nick NICHOLAS
  Subject: Final letters in Hebrew and Arabic

  (1) When a letter with a final variant appears alone --- say as a numeral,
  or in discussion of the letter or phoneme --- does it under any
  circumstances appear in its final form, or is it always medial?
 
 Monday, March 12, 2001
When Hebrew letters are used as numbers, (probably not a current
mainstream practice) the final forms of kaph, mem, num, pe and ssadhe are
used to repreent 500, 600, 700, 800 and 900. My source: "Alphabete und
Schriftzeichen des Morgen- und des Abendlandes. 2. Aufl. Berlin:
Bundesdruckeri, 1969.  Hence my use of German transliterated letter names.
Use of medial forms would thus change the numeric value; this would also
mean the final forms could appear in the middle of of a number.  Nakanishi
(p. 32), Daniels and Bright, (p.490) and Van Ostermann (1952, p.120) only
give numeric values for Hebrew letters through 400. I do not know if it is
safe to infer from their silence that use of final forms for 500 to 900
is a seldom used twig of a seldom used branch. 

 Regards,
  Jim Agenbroad ( [EMAIL PROTECTED] )
 The above are purely personal opinions, not necessarily the official
views of any government or any agency of any.
Phone: 202 707-9612; Fax: 202 707-0955; US mail: I.T.S. Dev.Gp.4, Library
of Congress, 101 Independence Ave. SE, Washington, D.C. 20540-9334 U.S.A.  




Re: languages on Web (was Re: Unicode market acceptance)

2001-03-09 Thread James E. Agenbroad

On Fri, 9 Mar 2001 [EMAIL PROTECTED] wrote:

 
 On 03/09/2001 11:01:53 AM "Tex Texin" wrote:
 
 We have estimates for (human) language usages on the web
 
 Do you mean the number of different languages used on the web? I'd be
 curious to know what such estimates are.
 
 
 
 - Peter
 
 
 ---
 Peter Constable
 
 Non-Roman Script Initiative, SIL International
 
   Friday, March 9, 2001
Me too.   
 Regards,
  Jim Agenbroad ( [EMAIL PROTECTED] )
 The above are purely personal opinions, not necessarily the official
views of any government or any agency of any.
Phone: 202 707-9612; Fax: 202 707-0955; US mail: I.T.S. Dev.Gp.4, Library
of Congress, 101 Independence Ave. SE, Washington, D.C. 20540-9334 U.S.A.  




Re: Article in Financial Times; Feb 7, 2001

2001-02-08 Thread James E. Agenbroad

On Thu, 8 Feb 2001, Michael Everson wrote:

 At 04:48 -0800 2001-02-08, J M Sykes quoted the FT:
 
 The International Standards Organisation (ISO) has now agreed to give
 standard meanings to these remaining codes.
 
 Which as everyone knows, is really the International Organization for
 Standardization (ISO).
 
 Sigh.
 --
 Michael Everson  **  Everson Gunn Teoranta  **   http://www.egt.ie
 15 Port Chaeimhghein Íochtarach; Baile Átha Cliath 2; Éire/Ireland
 Mob +353 86 807 9169 ** Fax +353 1 478 2597 ** Vox +353 1 478 2597
 27 Páirc an Fhéithlinn;  Baile an Bhóthair;  Co. Átha Cliath; Éire
 
 Thursday, February 8, 2001
And the next sentence: "The new standard is known as 'Latin-1' or
'extended ASCII' and includes accented characters."  I'd say 'includes
*some* accented characters' just as  Latin-2, Latin-3 etc. include other
repertoires of accented characters and other alphabets needed for a
particular group.  And later: "Double-byte codes are a very efficient
means of storing ideographic characters, such as Chinese, since a whole
word is stored in the equivalent of the space for two letters. Since each
word has a unique code there is also less of the ambiguity that is
inherent in, for example English ... Unicode's unambiguous meanings
..." This begs for a definition of a Chinese word and seems unaware that
Unicode assigns codes to characters, not to words or their meanings. Later
he accurately enough describes one approach to the character input issue
and then leaps to the domain names issue.  I was unable to find
www.worldnames.com which he cites. I join Michael with a sigh.  Feel free
to use these thoughts as part of a response, please do not forward this to
him or the Financial Times.
 Regards,
  Jim Agenbroad ( [EMAIL PROTECTED] )
 The above are purely personal opinions, not necessarily the official
views of any government or any agency of any.
Phone: 202 707-9612; Fax: 202 707-0955; US mail: I.T.S. Dev.Gp.4, Library
of Congress, 101 Independence Ave. SE, Washington, D.C. 20540-9334 U.S.A.  




Daniels and Bright Tibetan Query

2001-01-31 Thread James E. Agenbroad

   Wednesday, Januaary 31, 2001
In the chapter on Tibetan in Daniels and Bright's The world's writing
systems (page 434) about prescript symbols: "There are six radicals that
never occur with a prescript: wa, ra, la, ha, and 'a chung." Does anyone
know what the sixth one is or should it be "five"?  Thanks in advance. 

 Regards,
  Jim Agenbroad ( [EMAIL PROTECTED] )
 The above are purely personal opinions, not necessarily the official
views of any government or any agency of any.
Phone: 202 707-9612; Fax: 202 707-0955; US mail: I.T.S. Dev.Gp.4, Library
of Congress, 101 Independence Ave. SE, Washington, D.C. 20540-9334 U.S.A.  




Order of bidi script numbers in ranges

2001-01-19 Thread James E. Agenbroad

Friday, January 19, 2001
In what order are ranges of numbers such as 15-23 expressed in a bidi
context? 1. What is wanted visually, if there is one consistent
expectation? 2. Then what order should the codes be stored in Unicode for
the bidi algorithm to provide the desired visual order?  My guess is that
the visually the '15' would be wanted to the right of the dash and '23'
and that this would be the desired order in a Unicode string too, but it's
only a guess.  Thus the Unicode test string would contain:
1. The first string of Arabic or Hebrew characters
2. The code for '1' 
3. The code for '5'
4. The code for dash
5. The code for '2'
6. The code for '3'
7. The code for remaining Arabic or Hebrew text.
Apoligies if this is obvious to everyone or explicit in 3.0.  May I assume
decimal numbers such as 3.1416 and time such as 10:30 are expressed in the
order used in the West (though the punctuation may differ)? TIA.
 Regards,
  Jim Agenbroad ( [EMAIL PROTECTED] )
 The above are purely personal opinions, not necessarily the official
views of any government or any agency of any.
Phone: 202 707-9612; Fax: 202 707-0955; US mail: I.T.S. Dev.Gp.4, Library
of Congress, 101 Independence Ave. SE, Washington, D.C. 20540-9334 U.S.A.  




Re: conjucts beginning with independent vowel?

2001-01-17 Thread James E. Agenbroad

On Wed, 17 Jan 2001 [EMAIL PROTECTED] wrote:

 
 On 01/17/2001 05:13:25 AM Michael Everson wrote:
 
  A + Ldep
 
 No such thing as Ldep in our model, so you'd have to rely on A + virama +
 L.
 
 Well, if a script had such behaviour, one possibility could be to propose a
 combining CONSONANT SIGN L for what we would be choosing to think of as a
 dependent form of the consonant. I.e. it may not be in an existing model,
 but for a new script one could create a new model. I hear you saying,
 though, that you think it would be preferable to fit this into the existing
 model that uses a virama.
 
 
 
 - Peter
 
 
 ---
 Peter Constable
 
 Non-Roman Script Initiative, SIL International
 7500 W. Camp Wisdom Rd., Dallas, TX 75236, USA
 Tel: +1 972 708 7485
 E-mail: [EMAIL PROTECTED]
 
 
 
  Wednesday, January 17, 2001
A virama after other than a consonant seems un-Indian.  My novice's
understanding of virama is that it means: If the available rendering
capabilities allow it, consider the implicit 'a' expunged and combine the
preceding consonant with the next one to form a conjunct; otherwise
(i.e. if the rendering capabilities do not allow this) insert the virama
glyph beneath the preceding consonant.  This would mean the last example
in Unicode 3.0 figure 9-3 could be ignored and instead RA + vocalic R
vowel sign (U+0930, U+0943 with no virama) would be rendered as
independent vocalic R (U+090B) with "reph hook" above it.  
 Regards,
  Jim Agenbroad ( [EMAIL PROTECTED] )
 The above are purely personal opinions, not necessarily the official
views of any government or any agency of any.
Phone: 202 707-9612; Fax: 202 707-0955; US mail: I.T.S. Dev.Gp.4, Library
of Congress, 101 Independence Ave. SE, Washington, D.C. 20540-9334 U.S.A.  




Re: Unicode Case Mappings UTR #21

2000-11-30 Thread James E. Agenbroad

On Thu, 30 Nov 2000, Antoine Leca wrote:

 Carl W. Brown wrote:
  
  #3 French also has other articles such as d'.
 
 Yes. But this one, contrary to "l'" can according to the context,
 either be the contraction (élidé) of "de", or can be a genuine
 part of a proper name... When it comes to titlecase, OTOH, "d'"
 never becomes "D'", unless it is the very first word in a sentence,
 while one can see "L'" inside a sentence.
 
  are there prescribed rules for capitalization?
 
 You mean, "title casing" (capitalization would mean "uppercasing").
 Yes, and they are quite complex, at least to my taste.
 
 Title case is a pretty arcane matter in French (i.e., to determine
 what should and what should not be written with an initial capital letter,
 named "majuscule"). Some detailled ways are given (in French) in
 URL:http://www.orthotypographie.fr.st/.
 
 
 Antoine
 
 Thursday, November 30, 2000
My French was never strong and now is minimal, but isn't "d'" a
preposition, not an article?  This probably doesn't change the essence of
the above message.  
 Regards,
  Jim Agenbroad ( [EMAIL PROTECTED] )
 The above are purely personal opinions, not necessarily the official
views of any government or any agency of any.
Phone: 202 707-9612; Fax: 202 707-0955; US mail: I.T.S. Dev.Gp.4, Library
of Congress, 101 Independence Ave. SE, Washington, D.C. 20540-9334 U.S.A.  




Lakota--Oops!

2000-11-15 Thread James E. Agenbroad

 Wednesday, November 14, 2000
Oh I see the long right leg is straight.  Sorry.

 Regards,
  Jim Agenbroad ( [EMAIL PROTECTED] )
 The above are purely personal opinions, not necessarily the official
views of any government or any agency of any.
Phone: 202 707-9612; Fax: 202 707-0955; US mail: I.T.S. Dev.Gp.4, Library
of Congress, 101 Independence Ave. SE, Washington, D.C. 20540-9334 U.S.A.  




RE: Devanagari Consonant RA Rule R2

2000-11-09 Thread James E. Agenbroad

On Wed, 8 Nov 2000, Apurva Joshi wrote:

 The RA[sup] is seen applied to the independent vowel Vocalic R (U+ 090B) in
 printed samples in Sanskrit.
 
 There are atleast the following words that contain the above:
 NaiRiTa (the name of a demon)
 = 0928 090B Ra[sup] 0924
 NaiRiTi (the goddess Durga, slayer of demons)
 = 0928 090B Ra[sup] 0924 0940
 NaiRiTYa (south-west)
 = 0928 090B Ra[sup] 0924 094D 092F 
 
 The Devanagari shaping engine in Uniscribe currently recognises a 0930 094D
 preceding only consonants, to be duely reordered to the end of the syllable
 and replaced with Ra[sup]. Whether this be extended to independent vowels
 had figured in internal discussions when the shaping engine was being
 planned. To the best of my knowledge, extending this to be applicable to
 Vocalic R would be a special case, because Ra[sup] is not seen to be applied
 to any other Indic vowel in words that are native to Indic languages.  
 
 Would be glad to hear from any expert on this list, if there are
 phonemes/sounds in any language, which when transliterated into Devanagari,
 would require the Ra[sup] to be applied to an independent vowel. 
 eg. vowel E Ra[sup] etc.
 
 Thanks,
 -apurva
 
 -Original Message-
 From: Eric Mader/Cupertino/IBM [mailto:[EMAIL PROTECTED]]
 Sent: Wednesday, November 08, 2000 10:24 AM
 To: Unicode List
 Subject: Devanagari Consonant RA Rule R2
 
 
 Hello,
 
 In the Devanagari section of the standard, rule R2, on page 217 of the
 version 3.0 standard, states, "If the dead consoant RA[d] preecesd either a
 consonant *or an independent vowel,* then it is replaced by the superscript
 nonspacing mark RA[sup]..."
 
 I've never seen a RA[sup] applied to an indpenedent vowel, and non of the
 software I can find that renders Devanagari does this; they all render a
 dead RA followed by the vowel. Is the rule in error, or is it written to
 cover some obscure case that most software doesn't bother with?
 
 Eric Mader
 
 
Wednesday, November 8, 2000
First, I'm not an expert in Sanskrit but have done some work with
Devanagari.  I think at figure 9-3 (4) on page 214 and at R2 on page 217 
Unicode 3.0 overstates and mistates the situation a bit.  What is being
described is, I believe, a rendering issue, not an encoding issue.  
Instead of involving an independent vowel, it involves the r consonant,
U+0930, immediately followed by the R vowel sign (matra), U+0943, which
happens to get rendered as the independent vowel, U+090B with the
superscript R, reph, above it--with no halant  between the consonant
and the vowel sign.  On page 24 of Hester Lambert's Introduction to
Devanagari, "The vowel sign of  [U+090B] is not written with [U+0930,
094D] The character representing {0930, 094D] with [U+090B] is written
with the superscribed stroke used to represent [0930, 094D] when it is to
be realized before another consonant with character without an intervening
vowel {i.e. reph]. This stroke is placed over the vowel character
[U+090B], as in [U+0928, 093F, 090B, reph, 0924, 093F] nirrti."  The order
of filing 'nirri' (dot under second r) in Monier Williams Sanskrit-English
dictionary (page 554, column 2) tends to confirm this interpretation: It
has after nirUha, nirri, nirrich and nirrij (with a dot below the second
r) followed by nire.  It is possible that this peculiar rendering practice
would extend to the RA followed by U+0944, 0962 or 0963 but they seem to
me too unlikely to dwell on.  I suppose (by analogy to having two ways to
encode many letters with diacritics) Unicode could allow two ways to
encode what looks like "R vowel with reph"; at present it describes the
one with a halant but is silent about the display when the r consonant is
immediately followed by the r matra, U+0930, 0943. 
 Regards,
  Jim Agenbroad ( [EMAIL PROTECTED] )
 The above are purely personal opinions, not necessarily the official
views of any government or any agency of any.
Phone: 202 707-9612; Fax: 202 707-0955; US mail: I.T.S. Dev.Gp.4, Library
of Congress, 101 Independence Ave. SE, Washington, D.C. 20540-9334 U.S.A.  





Devanagari question

2000-11-09 Thread James E. Agenbroad

 Thursday, November 8, 2000
After sending a comment on the Ra(sup) +  independent vowel discussion two
more general Devanagari questions occurred to me:
1. Is a halant/virama ever valid following other than a consonant (or
consonant and nukta)?  My logic being that you cannot remove an inherent
vowel unless one is present and they are only inherent to consonants..
2. Is a vowel (matra or independent) ever valid after a halant/virama? 
My logic being that it is pointless to say no vowel is present and then
give one.and a matra replaces the inherent vowel without the help of a
halant.  
I know Unicode prefers not to declare sequences of codes invalid so this
is not a suggestion that it do so, just a comment that, if true, those who
build Indic rendering software might find useful.
 Regards,
  Jim Agenbroad ( [EMAIL PROTECTED] )
 The above are purely personal opinions, not necessarily the official
views of any government or any agency of any.
Phone: 202 707-9612; Fax: 202 707-0955; US mail: I.T.S. Dev.Gp.4, Library
of Congress, 101 Independence Ave. SE, Washington, D.C. 20540-9334 U.S.A.  




Re: Devanagari question

2000-11-09 Thread James E. Agenbroad

On Thu, 9 Nov 2000, Rick McGowan wrote:

  1. Is a halant/virama ever valid following other than a consonant (or
  consonant and nukta)?
 
 Legal?  In the sense of "any string is legal", yes; as is anything else.
  The implementation question to answer is whether it's useful or
 renderable,  and if so, how.
 
 The independent vowel followed by a halant could be interpreted as a
 graphical answer to the question, "What is the sound of one vowel
 silenced?"
 
   Rick
 
  
 Thursday, November 8, 2000
Rick,
 My question was about the validity of a vowel after a halant, but
your mention of a vowel followed by a halant would seem equally as
illogical.  (Note than neither languages nor writing systems being
artifacts of fallible humans needs to be logical.)  
 Regards,
  Jim Agenbroad ( [EMAIL PROTECTED] )
 The above are purely personal opinions, not necessarily the official
views of any government or any agency of any.
Phone: 202 707-9612; Fax: 202 707-0955; US mail: I.T.S. Dev.Gp.4, Library
of Congress, 101 Independence Ave. SE, Washington, D.C. 20540-9334 U.S.A.  




Re: Number separators

2000-10-31 Thread James E. Agenbroad

On Tue, 31 Oct 2000, James E. Agenbroad wrote:

 On Mon, 30 Oct 2000, Michael (michka) Kaplan wrote:
 
  Most of this happens to be in the Windows NLS database. See GetLocaleInfo in
  MSDN for details:
  
  http://msdn.microsoft.com/library/psdk/winbase/nls_34rz.htm
  
  Or more specifically, LCTypes like LOCALE_SGROUPING for this function,
  listed at
  
  http://msdn.microsoft.com/library/psdk/winbase/nls_8rse.htm
  
  
  michka
  
  a new book on internationalization in VB at
  http://www.i18nWithVB.com/
  
  - Original Message -
  From: Ayers; "Mike" [EMAIL PROTECTED]
  To: "Unicode List" [EMAIL PROTECTED]
  Sent: Monday, October 30, 2000 10:19 AM
  Subject: Number separators
  
  
  
   I discovered this weekend that Chinese, despite grouping large
   numbers by ten thousands (I think I'm explaining this poorly - what I mean
   is that the chinese language has numbers representing nx10^4, as opposed
  to
   the nx10^3 used in english), write their digits with comma separators
  every
   3 digits, apparently having learned this from the same place they got the
   digits themselves.
  
   I am aware that there are European languages (swiss and italian?)
   that group four digits, and am reasonably sure that japanese does.
  
   Before I go on a wild web search, does anyone know if there already
   exists a collection of information on the numbering systems of various
   languages, including the natural language ordering of the numbers, the
  digit
   grouping size, and the digit group separator character?  Since this is for
   informational purposes, I don't need code, just examples.
  
  
   TiA,
  
   /"\/|/|ike /+yers
   \ / ASCII Ribbon Campaign
X  Against HTML Mail   Test Engineer
   / \BMC Software, Inc.
  
  
  Tuesday, October 31, 2000
 You probably should check out what's done in India. The call hundred
 thousands "crores" and have a name I don't recall for tens of millions.
 I don't recall how they punctuate them but think it's not in triplets as
 is done in the U.S.   
 
Tuesday, Ocotber 31, 2000
Oops!  100,000 is a lakh and 10,000,000 is a crore.  I should have checked
my dictionary. Where and with what they are separated I still don't know. 

 Regards,
  Jim Agenbroad ( [EMAIL PROTECTED] )
 The above are purely personal opinions, not necessarily the official
views of any government or any agency of any.
Phone: 202 707-9612; Fax: 202 707-0955; US mail: I.T.S. Dev.Gp.4, Library
of Congress, 101 Independence Ave. SE, Washington, D.C. 20540-9334 U.S.A.  




RE: Giga Character Set: Nothing but noise

2000-10-18 Thread James E. Agenbroad

On Wed, 18 Oct 2000 [EMAIL PROTECTED] wrote:

 Jon Babcock wrote:
  It seems to me that if not for that, how could anyone
  make a Chinese font? Who is going to sit down and
  draw a *myriad* or more characters? Since elements
  recur, this reduces the amount of labour required
  greatly.
 
 I too would have bet that all CJK foundries used some form of (automatic?)
 composition to build their fonts.
 
 But, after a few enquiries, it seem that they don't (or they do, but
 zealously guard the secret).
 
 _ Marco
 
Wednesday, October 18, 2000
If I had to make a guess it would be that transforming the glyphs of parts
of characters so they will fit together in a pleasing fashion would take
about as much effort (or more) than designing separate glyphs for each new
character.
 Regards,
  Jim Agenbroad ( [EMAIL PROTECTED] )
 The above are purely personal opinions, not necessarily the official
views of any government or any agency of any.
Phone: 202 707-9612; Fax: 202 707-0955; US mail: I.T.S. Dev.Gp.4, Library
of Congress, 101 Independence Ave. SE, Washington, D.C. 20540-9334 U.S.A.  




RE: CJK combining components

2000-10-18 Thread James E. Agenbroad

On Wed, 18 Oct 2000 [EMAIL PROTECTED] wrote:

 Doug Ewell wrote:
  Marco Cimarosti [EMAIL PROTECTED] wrote:
   Carl W. Brown:
   An article in the October 12, 2000 issue of Linux Weekly News
   http://lwn.net/bigpage.php3 tries to explain the benefit...
  
  Actually, that quote from Linux Weekly News came from me, not Carl.
  (I'm not trying to take credit for the research, just deflecting any
  criticism away from Carl.)
 
 My mistake, sorry. And thanks to Doug for providing this info.
 
 However, I was not criticizing that article -- nor defending GCS! --, but
 rather annoying the list (once more!) about the pros and cons of CJK
 characters seen as atomic units, as opposed to composed graphemes.
 
 This topic is so boring probably because it is a chicken-egg problem: a CJK
 ideograph is in fact a "character", just like any alphabetic letter is, but
 it is also a "compound" that can be analyzed in smaller elements, pretty
 like the jamos in a Hangul syllable, or the letters (and diacritics) in a
 word.
 
 David Starner wrote:
  If you can decompose the CJK characters into pieces and automatically
  recompose them, what stops you from doing that for Unicode?
 
 Yeah! Nothing can stop me! (Well, apart maybe time and budget
 considerations, and the fact that I am not in the fonts business -- but
 that's nobody's problem :-)
 
  The only problem is that you have to decompose the Unicode CJK 
  characters yourself, and you still have the table look ups,
  but there's no need to carry around a huge font.
 
 OK. But, in a hypothetical encoding by components, this look up wouldn't be
 necessary at all.
 
 And in a hypothetical "mixed" encoding (i.e., having both precomposed
 ideographs and combining elements), it would only be needed for
 normalization (i.e. when you want the text to be either all precomposed or
 all decomposed).
 
  Even if you have to work with preexisting Unicode technology,
  you could still make the font using that technology instead of doing
  everything by hand.
 
 Yes, I see your point: provided that ideographic decomposition really has
 some utility, this utility is not necessarily in the encoding.
 
 This is true, and a good point, but not necessarily a definitive argument
 against the theoretical possibility of a decomposed encoding.
 
 Compatibility with the existing practice is the only argument that convinces
 me (sort of) that Unicode provides the best possible encoding for CJK
 logographs.
 
 _ Marco
 
   Wednesday, Ocotber 18, 2000
Doesn't one also need to somehow specify the relative position of the
parts to eachother? Just specifyinga the components of a character won't
suffice if the top half has three components and the bottom half has one
component above another of one side and just one on the right.  There are
templates for this but I think it is not trivial. 
 Regards,
  Jim Agenbroad ( [EMAIL PROTECTED] )
 The above are purely personal opinions, not necessarily the official
views of any government or any agency of any.
Phone: 202 707-9612; Fax: 202 707-0955; US mail: I.T.S. Dev.Gp.4, Library
of Congress, 101 Independence Ave. SE, Washington, D.C. 20540-9334 U.S.A.  




Re: Clarification of Arabic joining classes

2000-10-10 Thread James E. Agenbroad

On Tue, 10 Oct 2000, Majid Bhurgri wrote:

 
 On Tues, 10 Oct 2000, Roozbeh Pournader wrote:
 
  It's somehow weird for me, and if it were me, I would have considered it
  non-joining. Why would it appear between two letters that would otherwise
  join? Arabic cannot be broken between the joining letters.
 
 There are scores of words and instances in Arabic and other languages which
 use Arabic script where a word is split in two parts by not letting two
 letters join which would normally be joined. Non breaking zero width space
 facilitates such structures. It is used where you want to split the word,
 without it being treated as two words.
 
 Majid Bhurgri
 
   Tuesday, October 10, 2000
Am I correct in thinking that the letter before the 'non breaking zero
width space' would appear in its final form (or in stand alone form if a
space preceded it)?
 Regards,
  Jim Agenbroad ( [EMAIL PROTECTED] )
 The above are purely personal opinions, not necessarily the official
views of any government or any agency of any.
Phone: 202 707-9612; Fax: 202 707-0955; US mail: I.T.S. Dev.Gp.4, Library
of Congress, 101 Independence Ave. SE, Washington, D.C. 20540-9334 U.S.A.  




Re: TATAP = TATAR

2000-09-19 Thread James E. Agenbroad

On Tue, 19 Sep 2000, Mark Davis wrote:

 If those can be confirmed, then the SpecialCasing file should be modified to add
 them. Could you verify this in time for the next UTC?
 
 Mark
 
 Cathy Wissink wrote:
 
  I believe Azeri also uses the dotless i/dotted i Turkish-style casing.
 
  Cathy
 
  -Original Message-
  From: Carl W. Brown [mailto:[EMAIL PROTECTED]]
  Sent: Tuesday, September 19, 2000 9:03 AM
  To: Unicode List
  Subject: RE: TATAP = TATAR
 
  -Original Message-
  From: Herman Ranes [mailto:[EMAIL PROTECTED]]
  Sent: Tuesday, September 19, 2000 6:30 AM
  To: Unicode List
  Cc: [EMAIL PROTECTED]
  Subject: Re: TATAP = TATAR
 
  Several Tatar language links here:
  http://members.tripod.com/~anttikoski/eng_tatar.html
 
  In particular, the Tatar-Bashkir latin alphabet is presented in RFE/RL's
  site at
  http://rferl.org/bd/tb/tatar/TATAR/abs.html
 
  Are all these characters supported in UNICODE?
 
  I was unaware that they were moving back to the Latin alphabet.
  What jumps out at me is that case conversion code like the code that I just
  submitted for inclusion into ICU is wrong.  Turkish is not the only language
  with dotted and dot less i.  I assume that Tatar and Bashkir should follow
  the same rules as Turkish. Are there other languages?
 
  So I guess that I should check for "ba", "tt"  "tr" for special case
  shifting.  I presume that the alphabet is listed in proper sort order?
 
  Carl
 
 
   Tuesday, September 19, 2000
The ALA/LC romanization table for converting Azeri from Arabic script to
roman uses 'I' both with and without a dot.  I know this proves nothing
about how Azeri is now supposed to be written in roman script.
 Regards,
  Jim Agenbroad ( [EMAIL PROTECTED] )
 The above are purely personal opinions, not necessarily the official
views of any government or any agency of any.
Phone: 202 707-9612; Fax: 202 707-0955; US mail: I.T.S. Dev.Gp.4, Library
of Congress, 101 Independence Ave. SE, Washington, D.C. 20540-9334 U.S.A.  




Re: (iso639.184) Plane 14 redux (was: Same language, two locales)

2000-09-12 Thread James E. Agenbroad

  Tuesday, September 12, 2000
 Last Friday was International Literacy Day here at LC.  SIL was among
those distributing literature here.  From it I gather their goal is to
define and implement writing systems for many presently unwritten 
languages and dialects.  A worthy goal I'm sure.  My point is that
SIL's continued progress would mean that the two lists would tend
gradually to converge.  If that is true then care to coordinate the codes
in the ISO 639 and a 'spoken only' list would seem desirable.  

 Regards,
  Jim Agenbroad ( [EMAIL PROTECTED] )
 The above are purely personal opinions, not necessarily the official
views of any government or any agency of any.
Phone: 202 707-9612; Fax: 202 707-0955; US mail: I.T.S. Dev.Gp.4, Library
of Congress, 101 Independence Ave. SE, Washington, D.C. 20540-9334 U.S.A.  




Re: NUKTA

2000-08-24 Thread James E. Agenbroad

On Wed, 23 Aug 2000, Jaap Pranger wrote:

 At 18:05 +0200 2000.08.23, James E. Agenbroad wrote:
 
 
 In a list of Devanagari conjuncts if compiled a while ago there are at
 least two cases of conjuncts in which both consonants have a nukta:
 1. Ka + nukta + halant + ka + nukta = qqa
 2. Ka + nukta + hanant + pha + nukta = qfa
 
 That should say "A list of conjuncts I compiled a while ago" sorry.

 I think:   
   1. Any consonant can have a nukta. But if a Unicode character includes a
 precomposed nukta, U+0929, 0931, 0934 and 0958 through 095F, and has a
 another nukta, U+093C, following it, I'd ignore the second nukta during
 rendering. 
 
 Mac typing behaviour (as far as I can see) is a bit different in that you
 can't type the precomposed nuqta characters with a single keystroke,
 and when you type a nuqta where you should not (e.g. as a second nuqta,
 or after a base character that shouldn't have a nuqta), the rendering 
 translates your (faulty) typing into a clearly visible spacing character. 
 (This spacing char is also a nuqta, but down below the baseline.) 
 
 I think ignoring an erroneously typed char during rendering is not 
 a good thing. Is rendering faulty data correctly not *as* bad as
 rendering correct data incorrectly?  

 You have a point.  What to do?  I guess two dots below in a
horizontal line would be better than two vertically.  Perhaps some
distinctions need to be made here:  1. Between rendering text
dynamically as keyed where the syllable boundary is not always known, and
rendering a more or less fixed text where syllable boundaries can be
determined.  ('more or less' here is meant to suggest a possible further
distinction between rendering for proofreading and correction and
rendering with no chance to alter the text.)
 
 Whether a vowel or vowel sign can have a nukta I do not know.  
 Don't think so.  
 
 In another posting Mr. Leca says ISCII 91 uses nukta with both vowel
and vowel signs to input certain uncommon cases.  So I guess it would be
safer to allow them, presumably before U+0901 to U+0903.  It would help to
know if these are just input conventions or are also how long vocalic rii
and both vocalin  li and lii are stored too.
 
   2. A nukta should immediately follow a consonant--before a halant or
 vowel sign or 'various signs' = candrabindu, anusvara, visarga = U+901 to 
 U+903 only.
   3. These 'various signs' should follow a nukta, vowel sign (or
 halant?). I'm unsure if one of these 'various signs' after a halant
 would be valid; I doubt if 'various sign' followed by halant is.
 
 No 'various signs' after halant, and no 'various signs' followed by 
 halant, I would say. 

 Good.
 
 'Nuktated' consonants always (?) belong to Urdu words, and visarga
 "occurs almost exclusively in Sanskrit loanwords", thus the occurce 
 of nuqta followed by visarga is highly unlikely, or non-existent. 
 (I don't consider U+095C, U+095D and U+095F as nuktated; should I?) 
 
 Well, Unicode 3.0 page 403 does say they are identical to the base
character followed by nukta.

   6. [...] a vowel sign immediately after a vowel is unlikely.  
 
 Yess, there is a reason for U+0906, the 0905093E sequence for
 instance is invalid I guess. 

  Yes indeed.   Mr. Leca points out that ISCII uses two halants to
mean an 'explicit halant'--one not to be replaced by a more complicated
conjunct.  I guess I prefer the Unicode ZWJ.  
 
   7. Unicode 3.0 fig. 9-3 (4) to the contrary notwithstanding, halant
 immediately followed by a vowel sign or an independent vowel is highly
 questionable--just consonant + vowel sign would seem preferable.
 
 I would like to know in which word(s) this 'rare' sequence occurs,  
 in Sanskrit?   
 
 On page 554 (middle of second column) of Monier Williams
Sanskrit-English Dictonary there are three words beginning 'nirr.' where
'r.' is the vocalic ri. It displays as the independent ri vowel with a
reph above it; but they are filed after 'niru' and before 'nire' which
strongly suggests to me that the ra consonant + ri vowel sign are present
but display strangely.  The first means to go out or fall away from; the
second to go asunder or pass away; the last to let out, deliver.  The
first has several related words and citations. There may be others, I only
know of these; does anyone know if there is an automated version of
this work (like the OED) that one could search for all occurences of ra
(consonant) + ri (vowel sign)?

 Also, the explanatory text: "When an independent vowel appears ... ... 
 ... ..., the indepent vowel should not be depicted as a dependent vowel 
 sign, but as an independent vowel letterform", is a bit beyond me. 

 And me.  I tried to get the fourth example changed but I failed
because I couldn't point to ISCII practice for this--it says nothing about
this.  I take 'appears' to mean 'displays' or 'is desired to dispaly as'
and 'depicted' to mean is 'encoded'.  My preference would

Re: Cost per character?

2000-07-31 Thread James E. Agenbroad

On Mon, 31 Jul 2000, Christopher J. Fynn wrote:

 Leaving aside implementation costs - has anyone ever come up with a good
 estimate of the cost per character for the development of the  Unicode / ISO
 10646 standards in terms of man hours of experts and their long-suffering
 secretaries, the office space they use, cost of attending and hosting UTC, WG2
 and other meetings, cost of producing and distributing documents for proposals
 etc, communications charges - and a whole host of other things?
 
 (And maybe there should also be an estimate of all additional the time  work
 many people have put in to these standards without getting paid for it.)
 
 - Chris
 
   Monday, July 31, 2000
Three quotations come to mind:

1, J.P.Morgan on the cost of yachting, "If you have to ask you can't
afford it."

2. Source unknown, "All pioneers get arrows in the back."

3. Thoreau or Emerson, I think, "What is the value of a child?"

  :-)  

 Regards,
  Jim Agenbroad ( [EMAIL PROTECTED] )
 The above are purely personal opinions, not necessarily the official
views of any government or any agency of any.
Phone: 202 707-9612; Fax: 202 707-0955; US mail: I.T.S. Dev.Gp.4, Library
of Congress, 101 Independence Ave. SE, Washington, D.C. 20540-9334 U.S.A.