Re: Definition of character: Exegesis of SC2 nomenclature

2002-07-12 Thread Otto Stolz

Salve,

ho scritto:

 And which character most resembles a Frenchman smoking his cigarette?
 
Marco Cimarosti scrisse:
 I need to know, NOW! PLEASE!

U+A232

Egli ha il suo basco.

Ciao,
   Otto





Re: Proposal: Ligatures w/ ZWJ in OpenType

2002-07-12 Thread James Kass


From Unicode 3.1 (On-line)
( http://www.unicode.org/unicode/reports/tr27/index.html )

QUOTE
U+200D ZERO WIDTH JOINER

The intended semantic is to produce a more connected rendering of adjacent characters 
than would otherwise be the case, if possible.
In particular:

1.  If the two characters could form a ligature, but do not normally,
 ZWJ requests that the ligature be used.
2.  Otherwise, if either of the characters could cursively connect, but
 do not normally, ZWJ requests that each of the characters take a
 cursive-connection form where possible.
 (bullet)  In a sequence like X, ZWJ, Y, where a cursive form exists for X,
 but not for Y, the presence of ZWJ requests a cursive form for X.
3.  Otherwise, where neither a ligature nor cursive connection are available,
 the ZWJ has no effect.
/QUOTE

Starting with Unicode 3.0.1, the definitions of ZWJ and ZWNJ were expanded
to allow for greater control over ligature formation.  A reason given for this
is:  In some orthographies the same letters may either ligate or not, depending
on the intended reading.

 Thus, what John Hudson is wanting to do is to have f + ZWJ + i be
 required to make the fi ligature by using the rlig feature. Any font
 that does not have OpenType support, or some other smart font
 rendering, would ignore this and not render the ligature.

Right.  And any older font lacking a no-width no-contour glyph for ZWJ
would probably display a null box between the f and the i.

 Another example: a + ZWJ + combining acute + ZWJ + e would be
 required to produce an ae ligature with the combining acute over the
 a portion of the ligature. Is this reasonable?

AFAICT, ZWJ is not appropriate for combining glyphs like the combining
acute diacritic.   a + combining acute + ZWJ + e might be reasonably
expected to produce what you've described.

 Asmus is correct in needing to consider other languages. Saying that
 the ZWJ causes Arabic to ligate would not be correct. It already is
 defined to cause correct contextual shaping (isol, initial, medial, final)
 forms. In fact, LAM + ZWJ + ALEF breaks the required ligature
 formation because it sticks something in the middle of the context and
 proves what the Unicode book says, in some systems they may break
 up ligatures by interrupting the character sequence required to form
 the ligature. Should font vendors then have to not only code the normal
 ligature formation, but also have to code shaping rules to make the ZWJ
 work as well?


Yes, if font vendors want to provide this level of support.  According
to recent posts on the Unicode list, some font vendors are already doing
this because of Unicode's recommendations on the subject.  (Please see
Implementation Notes under Controlling Ligatures in TR27 linked
above.)

As far as 'interrupting the sequence on some systems', the Unicode
Standard may simply be referring to older, non-compliant systems
which don't ignore these formatting characters where appropriate
and/or have not yet implemented full support for Unicode 3.0.1 and up.

So, this is already a complicated nightmare for shaping engine
implementers.  Sometimes the character should be ignored, but
other times it needs to be a mandatory part of a look-up.  Font
developers seeking to follow the Unicode guidelines seem to be
doing so on a 'by gosh and by golly' basis.  John Hudson's proposal
offers sensible parameters along with intuitive justification.

Using 'rlig' for ZWJ based ligation is a clear choice.  If an author
takes the trouble to insert a ZWJ, a ligature is required if possible.

Best regards,

James Kass.







RE: Unicode Devanagari Font in Mozilla

2002-07-12 Thread Alan Wood

Dipali Choudhary asked

 Every time Mozilla is using default devanagari font for showing the
 characters. What should I do to change default font?
 
Mozilla does not seem to allow you to choose a font for Devanagari.

Edit  Preferences...  Category  Appearance  Fonts

brings up a list of languages/scripts, but it does not include 
Devanagari.  Unicode is on the list, so you could try changing 
the font(s) for that.

Alan Wood
http://www.alanwood.net (Unicode, special characters, pesticide names)





Re: Proposal: Ligatures w/ ZWJ in OpenType

2002-07-12 Thread James Kass


James Kass carved in stone:

 There are a multitude of special cases such as paleontology, ...

Ooof.

Paleography.

A kind person called my attention to this gaffe off-list.  Thank
you, kind person.

Best regards,

James Kass.







What is TISI character Code?

2002-07-12 Thread Sreedhar.M



Hi,
 I would lilke to make my 
application to Thai language compatible.In that way I heard the term TISI 
character code.That's why I want to know about the TISI character code.Please 
let me know if anybody have an idea regarding this.
Thanks in Advance.
with Regards,
Sreedhar M.


Re: What is TISI character Code?

2002-07-12 Thread Samphan Raruenrom

Sreedhar.M wrote:
 I would lilke to make my application to Thai language compatible.In 
 that way I heard the term TISI character code.That's why I want to know 
 about the TISI character code.Please let me know if anybody have an idea 
 regarding this.

TISI is the name of the standard organization in Thailand, Thai Industry
Standard Institute. The character set name is tis-620. It's a 8-bit character
set which is an extension to 7-bit ASCII for Thai characters. See :-

http://www.nectec.or.th/it-standards/

-- 
Samphan Raruenrom
Information Research and Development Division,
National Electronics and Computer Technology Center, Thailand.
http://www.nectec.or.th/home/index.html





Re: What is TISI character Code?

2002-07-12 Thread Arthit Suriyawongkul

Dear Sreedhar,

for Thai Industrial Standard character set, it's TIS-620.

to makes your apps support Thai, please try consisder about conforming
these standards.


TIS 620-2533 (1990) Standard for Thai Character Codes for Computers
UDC 681.3.04:003.62   ISBN 974-606-153-4

TIS 820-2538 (1995) Layout of Thai Character Keys on Computer Keyboards
UDC 681.3.02:003.62   ISBN 974-607-416-4

TIS 1566-2541 (1998) Thai Input/Output Methods for Computers
ICS 35.060ISBN 974-607-898-4


Thai Industrial Standards Institute, Ministry of Industrial
http://www.tisi.go.th
[EMAIL PROTECTED]

regards,
Art

Sreedhar.M wrote:
 Hi,
 I would lilke to make my application to Thai language compatible.In 
 that way I heard the term TISI character code.That's why I want to know 
 about the TISI character code.Please let me know if anybody have an idea 
 regarding this.
 Thanks in Advance.
 with Regards,
 Sreedhar M.






Re: *Why* are precomposed characters required for backward compatibility?

2002-07-12 Thread Dan Oscarsson


From: David Hopwood [EMAIL PROTECTED]

For all of these characters, use as a spacing diacritic is actually much
less common than any of the other uses listed above. Even when they are used
to represent accents, it is usually as a fallback representation of a combining
accent, not as a true spacing accent.

So, there would have been no practical problem with disunifying spacing
circumflex, grave, and tilde from the above US-ASCII characters, so that the
preferred representation of all spacing diacritics would have been the
combining diacritic applied to U+0020.

Apart from the problems Kenneth Whistler mentioned.
You would get the same problems with the ISO 8859-1 spacing accents but
there are less people using them than with those in ASCII.
One problem is that some characters can be used as an accent and as
a normal base character, and some characters that Unicode defines
a decomposition of, is not a composed character in some countries.
So in some contexts is is wrong to decompose some characters that
could be ok to decompose in others.
That is one reason I prefer NFC as it do not decompose characters.



 For a lot of text handling precomposed characters are much easier to
 handle, especially when the combining character comes after instead of
 before the base character.

I thought you said approximately the opposite in relation to T.61 above :-)

Sorry, got the last part wrong in my haste. I meant it is easier when
the combining character comes before the base character.

   Dan




RE: Saying characters out loud (derives from hash, pound,octothor pe?)

2002-07-12 Thread Suzanne M. Topping


 -Original Message-
 From: David Possin [mailto:[EMAIL PROTECTED]]
 
 so now we have a chromatic audio attribute for each character?

Don't be ridiculous. Sounds don't have chroma. 

There will however be a need for tone and accent variation so that
proper localization can be executed. 

;^P




[Fwd: Re: Unicode Devanagari Font in Mozilla]

2002-07-12 Thread Prabhat



resend.

 Original Message 

  

  Subject: 
  Re: Unicode Devanagari Font in Mozilla


  Date: 
  Thu, 11 Jul 2002 20:58:03 -0700 (PDT)


  From: 
  Prabhat Hegde [EMAIL PROTECTED]


  Reply-To: 
  Prabhat Hegde [EMAIL PROTECTED]


  To: 
  [EMAIL PROTECTED], [EMAIL PROTECTED]


  CC: 
  [EMAIL PROTECTED]

  



hi dipali,

There are numerous changes needed to position/shape devanagari and other
Indic text. Additional changes are needed to support caret and selection 
operations. And finally, it would depend on the nature of font that you 
use (intelligent Vs dumb). To complicate matters furthur:

* Indic scripts do not have any standard (as in published/registered/
  recognized) font encoding that i know of. 
* Indian language web-sites use mis-use charset tag "x-user-defined".

Please look at :
http://bugzilla.mozilla.org/show_bug.cgi?id=85204

And let me know if you need additional info.

prabhat.

Date: Thu, 11 Jul 2002 22:50:04 +0530
Date: Thu, 11 Jul 2002 22:45:47 +0530 (IST)
From: Dipali Choudhary 
Subject: Unicode Devanagari Font in Mozilla
X-Originating-IP: 144.16.111.15
To: [EMAIL PROTECTED]
MIME-version: 1.0
X-Auth-User: [EMAIL PROTECTED]
X-archive-position: 959
X-ecartis-version: Ecartis v1.0.0
X-original-sender: [EMAIL PROTECTED]
X-List: unicode
List-Unsubscribe:  

List-Help: 
List-Id: 
X-List-Id: 
List-Software: Ecartis version 1.0.0


Hello,

	I am newbie in the area.I am using mozilla 0.7 on Linux 7.2. I 
can
see devangari text in it. but there is problem of shifted matras.
What should I need to do to correctly position it.

Every time Mozilla is using default devanagari font for showing the
char
acters. What should I do to change default font?


Thanks in advance

regards,
dipali




  Dipali Choudhary
  M.Tech.CSE dept.
  IIT Bombay.








?? Unicode ??

2002-07-12 Thread Felipe Boita

As I use the Unicode characters?  
Which publishers of text support Unicode?  
To use Unicode, what she is necessary to make, download, upgrade...

Thanks





RE: Inappropriate Proposals FAQ

2002-07-12 Thread Barry Caplan

At 01:27 PM 7/11/2002 -0400, Suzanne M. Topping wrote:
Unicode is a character set. Period. 


Well, maybe. But in a much broader sense then the character sets it subsumes in its 
listings. Each character has numerous properties in Unicode, whereas they generally 
don't in legacy character sets.

Maybe Unicode is more of a shared set of rules that apply to low level data structures 
surrounding text and its algorithms then a character set.

The Unicode consortium very wisely keeps it's focus narrow. It provides
a mechanism for specifying characters. Not for manipulating them, not
for describing them, not for making them twinkle.

All true, except for some special cases (BOM, bidi issues and algoirthms, vertical 
variants, etc).Not saying those shouldn't be in there, just that they are useful only 
in the use of algorithms that are explicit (bi-di) or assumed (upper case/lower case, 
vertical/horizontal) etc.

In many cases, these algorthms are not well known, even amongst the cognoscenti, or 
generally available in nice libraries. Anyone for an open source Japanese word 
splitting library (I know not taking a look at ICU before I press send is going to 
come back to haunt me on this, but if it is in there, then substitute something that 
isn't :)

Barry Caplan
www.i18n.com





RE: Saying characters out loud (derives from hash, pound,octothor pe?)

2002-07-12 Thread Barry Caplan

At 09:43 AM 7/12/2002 -0400, Suzanne M. Topping wrote:

 -Original Message-
 From: David Possin [mailto:[EMAIL PROTECTED]]
 
 so now we have a chromatic audio attribute for each character?

Don't be ridiculous. Sounds don't have chroma. 

There will however be a need for tone and accent variation so that
proper localization can be executed. 

;^P

I have been dreaming of the idea of synaesthetic applications for years but haven't 
come up with a way to do it yet. But sounds absolutely will need chroma, that much I 
know. And when you say it with feeling, the fonts will literally be perceived as 
feeling

Such an application better not be written for Windows, because the blue screen of 
death will be felt rather than seen :)

Barry Caplan
www.i18n.com





RE: Saying characters out loud (derives from hash, pound,octothor pe?)

2002-07-12 Thread David Possin

OK, while we are at it: smelly fonts, anyone?

(actually I can imagine how some fonts smell)

Dave
--- Barry Caplan [EMAIL PROTECTED] wrote:
 At 09:43 AM 7/12/2002 -0400, Suzanne M. Topping wrote:
 
  -Original Message-
  From: David Possin [mailto:[EMAIL PROTECTED]]
  
  so now we have a chromatic audio attribute for each character?
 
 Don't be ridiculous. Sounds don't have chroma. 
 
 There will however be a need for tone and accent variation so that
 proper localization can be executed. 
 
 ;^P
 
 I have been dreaming of the idea of synaesthetic applications for
 years but haven't come up with a way to do it yet. But sounds
 absolutely will need chroma, that much I know. And when you say it
 with feeling, the fonts will literally be perceived as feeling
 
 Such an application better not be written for Windows, because the
 blue screen of death will be felt rather than seen :)
 
 Barry Caplan
 www.i18n.com
 
 


=
Dave Possin
Globalization Consultant
www.Welocalize.com
http://groups.yahoo.com/group/locales/

__
Do You Yahoo!?
Sign up for SBC Yahoo! Dial - First Month Free
http://sbc.yahoo.com




RE: Inappropriate Proposals FAQ

2002-07-12 Thread Suzanne M. Topping

 -Original Message-
 From: Barry Caplan [mailto:[EMAIL PROTECTED]]
 
 At 01:27 PM 7/11/2002 -0400, Suzanne M. Topping wrote:
 Unicode is a character set. Period. 
 
 Each character has numerous 
 properties in Unicode, whereas they generally don't in legacy 
 character sets.

Each character, or some characters?

 Maybe Unicode is more of a shared set of rules that apply to 
 low level data structures surrounding text and its algorithms 
 then a character set.

Sounds like the start of a philosophical debate. 

If Unicode is described as a set of rules, we'll be in a world of hurt.

 The Unicode consortium very wisely keeps it's focus narrow. 
 It provides
 a mechanism for specifying characters. Not for manipulating them, not
 for describing them, not for making them twinkle.
 
 All true, except for some special cases (BOM, bidi issues and 
 algoirthms, vertical variants, etc).Not saying those 
 shouldn't be in there, just that they are useful only in the 
 use of algorithms that are explicit (bi-di) or assumed (upper 
 case/lower case, vertical/horizontal) etc.

humour
Why mess up a nice clean statement simply because of a few hard facts? 
/humour

I choose to look at this stuff as the exceptions that make the rule.

(On a serious note, these exceptions are exactly what make writing some
sort of is and isn't FAQ pretty darned hard. I can't very well say
that Unicode manipulates characters given certain historical/legacy
conditions and under duress. If I did, people would be scurrying around
trying to figure out how to foment the duress.)




re smelly fonts (was: Saying characters out loud (derives from hash, pound,octothor pe?))

2002-07-12 Thread Tex Texin

I saw the movie Polyester in Odorama. A John Waters film with Divine.
They handed out scratch and sniff cards and the movie flashed numbers at
the right time for you to scratch the numbered dot on the card which
released the odor. It was a great technique and effect. They played a
few tricks on the audience as well, substituting unexpected smells for
the ones you were anticipating. It was great.

Anyone for scratch and sniff fonts?

It could be a problem though to scratch the character of the smoking
Frenchman, where smoking is prohibited...
;-)

Hey too bad we don't have a Gun character. Scratch it, and the gun goes
off and becomes the smoking gun everyone is looking for!
It could be the first animated, noisy, smelly font!
(Probably shoots the dot off the dotted i, making it Turkish.)

;-) (OK, its friday!)

David Possin wrote:
 
 OK, while we are at it: smelly fonts, anyone?
 
 (actually I can imagine how some fonts smell)
 
 Dave
 --- Barry Caplan [EMAIL PROTECTED] wrote:
  At 09:43 AM 7/12/2002 -0400, Suzanne M. Topping wrote:
 
   -Original Message-
   From: David Possin [mailto:[EMAIL PROTECTED]]
  
   so now we have a chromatic audio attribute for each character?
  
  Don't be ridiculous. Sounds don't have chroma.
  
  There will however be a need for tone and accent variation so that
  proper localization can be executed.
  
  ;^P
 
  I have been dreaming of the idea of synaesthetic applications for
  years but haven't come up with a way to do it yet. But sounds
  absolutely will need chroma, that much I know. And when you say it
  with feeling, the fonts will literally be perceived as feeling
 
  Such an application better not be written for Windows, because the
  blue screen of death will be felt rather than seen :)
 
  Barry Caplan
  www.i18n.com
 
 
 
 =
 Dave Possin
 Globalization Consultant
 www.Welocalize.com
 http://groups.yahoo.com/group/locales/
 
 __
 Do You Yahoo!?
 Sign up for SBC Yahoo! Dial - First Month Free
 http://sbc.yahoo.com

-- 
-
Tex Texin   cell: +1 781 789 1898   mailto:[EMAIL PROTECTED]
Xen Master  http://www.i18nGuy.com
 
XenCrafthttp://www.XenCraft.com
Making e-Business Work Around the World
-




Status update re. Inappropriate Proposals FAQ

2002-07-12 Thread Suzanne M. Topping

I'm nearly done playing catchup after vacation and hope to begin
extracting concepts for the FAQ next week. 

Thanks to all who've submitted input, as conflicting and varied as it
is/was.

 -Original Message-
 From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED]]
 Sent: Wednesday, July 03, 2002 8:53 
 
 
 I would like to once again suggest that we refocus this 'FAQ' 
 
 AWAY from a repetition of the Principles and Procedures 
 document maintained
 by WG2 and containing the explanation of what constitutes a 
 valid *formal*
 proposal.
 




Hmm, this evolved into an editorial when I wasn't looking :) was: RE: Inappropriate Proposals FAQ

2002-07-12 Thread Barry Caplan

At 05:13 PM 7/12/2002 -0400, Suzanne M. Topping wrote:
 -Original Message-
 From: Barry Caplan [mailto:[EMAIL PROTECTED]]
 
 At 01:27 PM 7/11/2002 -0400, Suzanne M. Topping wrote:
 Unicode is a character set. Period. 
 
 Each character has numerous 
 properties in Unicode, whereas they generally don't in legacy 
 character sets.

Each character, or some characters?


For all intents and purposes, each character. Chapter 4.5 of my Unicode 3.0 book says 
 The Unicode Character Database on the CDROM defines a General Category for all 
Unicode characters

So, each character has at least one attribute. One could easily say that each 
character also has an attribute for isUpperCase of either true of false, and so on.

There are no corresponding features in other character sets usually.


 Maybe Unicode is more of a shared set of rules that apply to 
 low level data structures surrounding text and its algorithms 
 then a character set.

Sounds like the start of a philosophical debate. 

Not really. I have been giving presentations for years, and I have seen many others 
give similar presentations. A common definition of character set is a list of 
character you are interested in assigned to codepoints. That fits most legacy 
character sets pretty well, but Unicode is sooo much more than that.



If Unicode is described as a set of rules, we'll be in a world of hurt.


Yeah, one of the heaviest books I own is Unicode 3.0. I keep it on a low shelf so the 
book of rules describing Unicode doesn't fall on me for just that reason. this is 
earthquake country after all  :)


I choose to look at this stuff as the exceptions that make the rule.


I don't really know if it is possible to break down Unicode into more fundamental 
units if you started over. Its complexity is inherent in the nature of the task. My 
own interest is more in getting things done with data and algorithms that use the type 
of material represented by the Unicode standard, more so than the arcania of the 
standard itself. So it doesn't bother me so much that there are exceptions - as long 
as we have the exceptions that everyone agrees on, that is fine by me because it means 
my data and at least some of my algorithms are likely to be preservable across systems.


(On a serious note, these exceptions are exactly what make writing some
sort of is and isn't FAQ pretty darned hard. 

humor
Be careful what you ask for :)
/humor

I can't very well say
that Unicode manipulates characters given certain historical/legacy
conditions and under duress. 

Why not? It is true.

But what if we took a look at it from a different point of view, that the standard is 
a agreed upon set of rules and building blocks for text oriented algorithms? Would 
people start to publish algorithms that extend on the base data provided so we don't 
have to reinvent wheels all the time?

I'm just brainstorming here, this is all just coming to me now. 

If I were to stand in front of a college comp sci class, where the future is all ahead 
of the students, what proportion of time would I want to invest in how much they knew 
about legacy encodings versus how much I could inspire them to build from and extend 
what Unicode provides them?

Seriously, most of the folks on this list that I know personally, and I include myself 
in this category, are approaching or past the halfway point in our careers. What would 
we want the folks who are just starting their careers now to know about Unicode and do 
with it by the time they reach the end of theirs, long after we have stopped working?

For many applications, people are not going to specialize in i18n/l10n issues. They 
need to know what the appropriate building text based blocks are, and how they can 
expand on them while still building whatever they are working on.

Unicode at least hints at this with the bidi algorothm. Moving forward should other 
algorithms be codified into Unicode, or as separate standards or defacto standards? I 
am thinking of Japanese word splitting algorithm. There are proprietary products 
that do this today with reasonable but not perfect results. Are they good enough that 
the rules can be encoded into a standard? If so, then someone would build an open 
implementation, and then there would always be this building block available for 
people to use.

I am sure everyone on this list can think of their own favorite algorithms of this 
type, based on the part of Unicode that interests you the most. My point is that the 
raw information already in unicode *does* suggest the next level of usage, and the 
repeated newbie questions that inspired this thread suggest the need for a 
comprehensive solution at a higher level then a character set provides. Maybe part of 
this means including or at least facilitating the description of lowlevel text 
handling algorithms.

If I did, people would be scurrying around
trying to figure out how to foment the duress.)


The accomplishments of the Unicode 

What Unicode Is (was RE: Inappropriate Proposals FAQ)

2002-07-12 Thread Kenneth Whistler

Suzanne responded:

  Maybe Unicode is more of a shared set of rules that apply to 
  low level data structures surrounding text and its algorithms 
  then a character set.
 
 Sounds like the start of a philosophical debate. 
 
 If Unicode is described as a set of rules, we'll be in a world of hurt.

 (On a serious note, these exceptions are exactly what make writing some
 sort of is and isn't FAQ pretty darned hard. 

Hmm. Since the discussion which started out trying to specify a
few examples of what kinds of entities would be inappropriate to
proffer for encoding as Unicode characters seems to be in danger
of mutating into the recurrent What is Unicode? question,
perhaps its time to start a new thread for the latter.

And now for some ontological ground rules.

When trying to decide what a thing is, it helps not to use
an attribute nominatively, since that encourages people to
privately visualize the noun the attribute is applied to,
but to do so in different ways -- and then to argue past each
other because they are, in the end, talking about different
things.

Unicode is used attributatively of a number of things, and
if we are going to start arguing/discussing what it is, it
would be better to lay out the alternative its a little
more specifically first.

1. The Unicode *Consortium* is a standardization organization.
It started out with a charter to produce a single standard,
but along the way has expanded that charter, in response to
the desire of its membership. In addition to The Unicode
Standard, it now has adopted a terminology that refers to
some of its other publications as Unicode Technical Standards
[UTS], of which two formally exist now: UTS #6 SCSU, and
UTS #10 Unicode Collation Algorithm [UCA].

It is important to keep this straight, because some people,
when they say Unicode are talking about the *organization*,
rather than the Unicode Standard per se. And when people talk
about the standard, they are generally referring to The
Unicode Standard, but the Unicode Consortium is actually
responsible for several standards.

2. The Unicode *Standard* itself is a very complex standard, consisting
of many pieces now. To keep track of just what something like
The Unicode Standard, Version 3.2 means, we now have to
keep web pages enumerating all the parts exactly -- like
components in an assemble-your-own-furniture kit. See:
http://www.unicode.org/unicode/standard/versions/

In any one particular version, the Unicode Standard now consists
of a book publication, some number of web publications
(referred to as Unicode Standard Annexes [UAX]), and a
large number of contributory data files -- some normative and
some informative, some data and some documentation. These
definitions, including the exact list of contributory
data files and their versions, are themselves under tight
control by the Unicode Technical Committee, as they constitute
the very *definition* of the Unicode Standard. It is not
by accident that the version definitions start off now with
the following wording:

The Unicode Standard, Version 3.2.0 is defined by the following
list...

and so on for earlier versions.

3. The Unicode *Book* is a periodic publication, constituting the
central document for any given version of the Unicode *Standard*,
but is by no means the entire standard. The book, in turn,
is very complex, consisting of many chapters and parts, some
of which constitute tightly controlled, normative specification,
and some of which is informative, editorial content.

The book now also exists in an online version (pdf files):
http://www.unicode.org/unicode/uni2book/u2.html
which is *almost* identical to the published hardcover book,
but not quite. (The Introduction is slightly restructured,
the online glossary is restructured and has been added to,
the charts are constructed slightly differently and have
introductory pages of their own, etc.)

4. The Unicode *CCS* [coded character set] is the mapping of the
set of abstract characters contained in the Unicode repertoire
(at any given version) to a bunch of code points in the
Unicode codespace (0x..0x10). Technically speaking, it
is the Unicode *CCS* which is synchronized closely with
ISO/IEC 10646, rather than the Unicode *Standard*. 10646 and
the Unicode CCS have exactly the same coded characters (at
various key synchronization points in their joint publication
histories), but the *text* of the ISO/IEC 10646 standard doesn't
look anything like the *text* of the Unicode Standard, and the
Unicode Standard [sensum #2 above] contains all kinds of
material, both textual and data, that goes far beyond the scope
of 10646. 

There are other standards produced by some national
bodies that are effectively just translations of 10646
(GB 13000 in China, JIS X 0221 in Japan), but the Unicode Standard
is nothing like those.

Finally, the attribute Unicode ... can be applied to all
kinds of other things characteristic of the Unicode Standard,
including algorithms for the 

Re: What Unicode Is (was RE: Inappropriate Proposals FAQ)

2002-07-12 Thread Barry Caplan

At 03:54 PM 7/12/2002 -0700, Kenneth Whistler wrote:
Suzanne responded:

  Maybe Unicode is more of a shared set of rules that apply to 
  low level data structures surrounding text and its algorithms 
  then a character set.

O.k., so now before asserting or denying that Unicode ... is
a shared set of rules, it would be helpful to pin down
first what you are referring to. That might make the ensuing
debate more fruitful.
Actually, it was me, not Suzanne, that called Unicode a shared set of rules. As 
Ferris Bueller once said I'll take the heat for this. 

I was aware of all of the uses of Unicode that you listed. I have no quarrels with any 
of them. They do point to the fact that the word is overloaded with definitions. Which 
means that readers have to choose the appropriate one from the context. The context of 
the statement above is that the Unicode referred to is the Standard, and all 
associated documentation. Not Unicode the Consortia which manages the Standard. Not 
Unicode the way of life :)

I did intend to throw open a debate about the long term future of Unicode the Standard 
and by extension Unicode the Consortia. Since Suzanne is writing What is Unicode and 
is not Unicode FAQ, I think the answer to that is going to be very definitely colored 
by the answer to the related question What will Unicode become?, e.g. Unicode 6.0, 
7.0, 8.0, etc. 

See my previous msg, subject line: Hmm, this evolved into an editorial when I wasn't 
looking :)  for some thoughts on that subject.


Barry Caplan
www.i18n.com





Re: [OpenType] Proposal: Ligatures w/ ZWJ in OpenType

2002-07-12 Thread Eric Muller



The mechanism proposed by John to handle ZWJ/ZWNJ makes the implicit assumption
that those characters are transformed into glyphs (via the usual 'cmap' mechanism)
and that this is the avenue to transfer the intent of those characters to
the shaping code in the font (i.e. some kind of ligature lookup). I'd like
to revisit that assumption.

The ZWJ/ZWNJ characters are formatting characters. Their function is definitely
different from the function of the "regular" characters (such as "A"): they
are a way to control the rendering of regular characters around them, and
to express that control in plain text. The debate so far shows that there
is no strong objection to that mechanism by itself.

In an environment richer than plain text, there is obviously the possibility
that this control could be expressed by other means than characters. In the
OpenType world, and in particular in the interface between the layout engine
and the shaping code in fonts, we have more than plain text, or rather plain
glyphs; we also have a description of which features should be applied to
which glyphs. So instead of having glyphs that stand for ZWJ/ZWNJ, can we
use these features?

In fact, we already do that every day. For example, an InDesign user can
insert the two characters x and y, and apply a ligature feature (let's say
'dlig') to them. It seems to me that this is just what ZWJ is about. So InDesign
could do the following given the character sequence x ZWJ y: map it the glyph
sequence cmap(x) cmap(y), with 'dlig' applied on those two glyphs. This 'dlig'
application takes precedence over one via UI, i.e. it happens regardles of
whether the user requested 'dlig' explicitly. The ZWJ character is simply
not mapped to the glyph stream, since the feature application does the job
of ZWJ.

We can handle ZWNJ in the same way: the sequence x ZWNJ y is transformed
to the glyph sequence cmap(x) cmap(y), with 'dlig' not applied on those two
glyphs. This 'dlig' non-application takes precedence over one via UI, i.e.
'dlig' is not applied to these two glyphs regardless of whether the user
requested 'dlig' explicitly.

[May be a better way of thinking about the precedence stuff is to think entirely
in markup terms: 
ligatures-on ... x ZWNJ y ... /ligatures-on is transformed
in the glyph stream dlig ... cmap(x) /dlig dlig cmap(y)
... dlig, i.e. dlig is off on the pair x y; hold your objection that
a feature is applied to a position rather than a range for a minute.]

With this approach, we gain two things. First, not having a "formatting"
glyph for ZWJ is IMHO a huge conceptual win, even bigger than not having
a "formatting" character ZWJ would be. Second, what John's proposal did not
mention (or may be I missed it) is that it's not just the ligature features
that have to deal with this glyph, it is all the features; compound
that by all the formatting characters, and you will start to understand Paul's
reaction.

It's interesting to note that this approach can be applied to other formatting
characters as well. Either their intent can be achieved by the layout engine
alone, without help of the font, in which case there is no need to show anything
to the code in the font; no glyph and no feature are consequence of those
characters. Or their intent needs help of the font, and the OpenType way
to ask for this help is to apply (or not) features.

All that takes care of selecting a ligature, but it does not quite take care
of selecting cursive forms. I can see how we could define 'dlig' to do that
(or define a 'zwj' feature that invokes the ligature lookups plus some single
substitution lookup), but I am not sure I am happy with that. In fact, I
am not sure I am happy with that clause in Unicode. 


Eric.

[About the features applied to ranges rather than positions: think about
it and it should be obvious 8-) It does not make sense to apply a ligature
at a position; what makes sense is to apply a ligature on range. Think about
1-n substitutions; whatever lookups apply to the source glyph should
also apply to all the replacement glyphs - ranges again. I even believe that
this approach is compatible with the current OpenType spec. More details
on demand.] 





Re: Hmm, this evolved into an editorial when I wasn't looking :) was: RE: Inappropriate Proposals FAQ

2002-07-12 Thread Kenneth Whistler

Barry Caplan wrote:

  At 01:27 PM 7/11/2002 -0400, Suzanne M. Topping wrote:
  Unicode is a character set. Period. 
  
  Each character has numerous 
  properties in Unicode, whereas they generally don't in legacy 
  character sets.
 
 Each character, or some characters?
 
 
 For all intents and purposes, each character. 
 So, each character has at least one attribute. 

Yes. The implications of the Unicode Character Database include
the determination that the UTC has normatively assigned properties
(multiple) to all Unicode encoded characters.

Actually, it is a little more subtle than that. There are some
properties which accrue to code points. The General Category and
the Bidirectional Category are good examples, since they constitute
enumerated partitions of the entire codespace, and API's need to 
return meaningful values for any code point, including unassigned ones.

Other properties accrue more directly to characters, per se.
They attach to the abstract character, and get associated with
a code point more indirectly by virtue of the encoding of that
character. The numeric value of a character would be a good example
of this. No one expects an unassigned code point or an assigned
dingbat character or a left bracket to have a numeric value property
(except perhaps a future generation of Unicabbalists).
 
 There are no corresponding features in other character sets usually.

Correct. Before the development of the Unicode Standard, character
encoding committees tended to leave that property assignments
either up to implementations (considering them obvious) or up
to standardization committees whose charter was character
processing -- e.g. SC22/WG15 POSIX in the ISO context.

The development of a Universal character encoding necessitated
changing that, bringing character property development and
standardization under the same roof as character encoding.

Note that not everyone agrees about that, however. We are
still having some rather vigorous disagreements in SC22 about
who owns the problem of standardization of character properties.

 A common definition of character set is a list of character 
 you are interested in assigned to codepoints. That fits most 
 legacy character sets pretty well, but Unicode is sooo much 
 more than that.

Roughly the distinction I was drawing between the Unicode CCS
and the Unicode Standard.

 But what if we took a look at it from a different point of view, 
 that the standard is a agreed upon set of rules and building 
 blocks for text oriented algorithms? Would people start to 
 publish algorithms that extend on the base data provided so 
 we don't have to reinvent wheels all the time?

Well the Unicode Standard isn't that, although it contains
both formal and informal algorithms for accomplishing various
tasks with text, and even more general guidelines for how to
do things.

The members of the Unicode Technical Committee are always
casting about for areas of Unicode implementation behavior
where commonly defined, public algorithms would be mutually
beneficial for everyone's implementations and would assist
general interoperability with Unicode data.

To date, it seems to me that the members, as well as other
participants in the larger effort of implementing the Unicode
Standard, have been rather generous in contributing time
and brainpower to this development of public algorithms. The
fact that ICU is an Open Source development effort is enormously
helpful in this regard.

 If I were to stand in front of a college comp sci class, 
 where the future is all ahead of the students, what proportion 
 of time would I want to invest in how much they knew about legacy 
 encodings versus how much I could inspire them to build from and 
 extend what Unicode provides them?

This problem, of Unicode in the computer science curriculum,
intrigues me -- and I don't think it has received enough attention
on this list.

One of my concerns is that even now it seems to be that CS
curricula not only don't teach enough about Unicode -- they basically
don't teach much about characters, or text handling, or anything
in the field of internationalization. It just isn't an area that
people get Ph.D.'s in or do research in, and it tends to get
overlooked in people's education until they go out, get a job
in industry and discover that in the *real* world of software
development, they have to learn about that stuff to make software
work in real products. (Just like they have to do a lot of
seat-of-the-pants learning about a lot of other topics: building,
maintaining, and bug-fixing for large, legacy systems; software
life cycle; large team cooperative development process;
backwards compatibility -- almost nothing is really built from
scratch!)

 
 The major work ahead is no longer in the context of building 
 a character standard. Time is fast approaching to decide to keep 
 it small and apply a bit of polish, or focus on the use and usage 
 of what is already there in Unicode by those who