Re: How to type sporadic Unicode (was: User interface for keyboard input)

2002-07-22 Thread Doug Ewell

Martin Kochanski unicode at cardbox dot net wrote:

 Microsoft's Alt+X method: unfortunately, there is no such thing. I
 have seen at least two different Alt+X methods in Microsoft software:

I should have said one of Microsoft's Alt+X methods.

 Methods specified by ISO 14755: unfortunately, there are no such
 things. As you and others have said, the ISO 14755 specification
 merely specifies properties that conforming methods should have, it
 does not specify the methods themselves [you can imagine a predecessor
 standard specifying that characters should be typed by hitting keys
 but not specifying the keyboard layout itself].

I should have said methods conforming to ISO 14755.  My language may
not have been precise, but my point should have been clear: until the
real world settles on a standard for entry of arbitrary Unicode
characters, as universal as Ctrl+C for copy and Ctrl+V for paste, there
is no need for an application to support only one method.

 Is there any sign of an emerging consensus as to what the beginning
 and ending sequences might be? Addison Phillips mentioned \uX, but
 that was in a programming context and he says himself it wouldn't be
 suitable for running text. It would be nice if an innocent user faced
 with a new software package did not have to look up manuals or
 experiment to see what the beginning sequence was.

I wouldn't mind seeing the ISO 14755 suggestions, press and hold
Ctrl+Shift and release Ctrl+Shift, take hold.  But Ctrl+Shift
sequences could already be assigned by users, as you note.  I use
Ctrl+Shift+C myself to launch Character Map..

-Doug Ewell
 Fullerton, California





Re: Unicode mention (Urdu)

2002-07-22 Thread Martin Kochanski

At 02:07 22/07/02 +0100, Alistair Vining wrote:
The cross-platform message somewhat dulled by the font [Urdu Naskh Asiatype] download
being a Windows .exe file with (judging by a message that popped up) a copy of the
uniscribe .dll...

If this is true, then are the BBC pirating Microsoft's software? And if that is the 
case, could Microsoft please either sue or not sue?

I'm not saying this out of perversity. The non-redistributability of Uniscribe is an 
enormous inconvenience to software developers like us, because either we have to do 
without Uniscribe or we have to force our users to install a large and irrelevant 
software package (such as a web browser) in order to make sure that they have it on 
their systems. So if the BBC has found a way to redistribute Uniscribe legally, we 
want to hear about it; or if Microsoft have decided to take no notice if people do 
distribute Uniscribe, we want to hear about that too!





TIM - A Table-base Input Method Module

2002-07-22 Thread Arthit Suriyawongkul

anybody here interesting in this Table-based Input Method ?

  http://sourceforge.net/projects/wenju/


i've got this site from gtk-i18n-list.

:)

regards,
Art


 Original Message 
Subject: Re: TIM - A Table-base Input Method Module
Date: Sun, 21 Jul 2002 09:03:06 -0400
From: Daniel Yacob [EMAIL PROTECTED]
To: [EMAIL PROTECTED], [EMAIL PROTECTED]

many months later...


  Now I just finished such a IM module which you can find it at

http://sourceforge.net/projects/wenju/

  I call it TIM (Table-based Input Method).  I haven't released a package
  yet, but it is in the CVS.


I do like this idea, if I were to give a wish list of features I'd like
to see in an IM description file I'd no doubt end up describing what
Keyman uses.  Perhaps because it is what I'm most familiar with but it
also some nice expressive syntax.

It has occured to me before that it would be nice to be able to import
keyman .kmn files directly.  Has an XML definition for IMs ever been
developed?  It would *really* be nice to have some kind of universal
vendor independent, IM definition, like unicode is to charsets.

Could TIM be taken in this direction?  Towards a XIM?  I'd be happy to
participate in defining an XML schema for it.  Anyone interested?

cheers,

/Daniel
___
gtk-i18n-list mailing list
[EMAIL PROTECTED]
http://mail.gnome.org/mailman/listinfo/gtk-i18n-list





Re: TIM - A Table-base Input Method Module

2002-07-22 Thread Hideki Hiura
 From: Arthit Suriyawongkul [EMAIL PROTECTED]
 anybody here interesting in this Table-based Input Method ?
   http://sourceforge.net/projects/wenju/
 i've got this site from gtk-i18n-list.

I have not looked at this one yet, but you may also want to take a look at
IIIMF(http://www.li18nux.org/subgroup/im/IIIMF) which has something similar,
called ude(user defined engine) as a table based IM.
Also recently, XML based IM, EIMIL(Extensible IM interface Language) is 
ntroduced to IIIMF, which you can combine the table based IM, the
portable XML based logics, and backend dictionay lookup server.

You can retrive the source of this as follows;

cvs -d -d:pserver:[EMAIL PROTECTED]:/cvsroot co -r exp-EIMIL-1 im-sdk

The following is the sample XML based IM definition, which you can
find in im-sdk/server/programs/language_engines/canna.

This sample shows how you can combine those table/logic and backend
dictionary lookup server(in this case, Japanese Canna dictionary
lookup server).

---
?xml version="1.0"?
!DOCTYPE ccdef PUBLIC "-//Li18nux//DTD CCDEF 1.0//EN"
"ccdef.dtd"

ccdef name="default" class="org.li18nux.CannaLE" revision="0.1"
  interface
langinfo xml:lang="ja"/langinfo
decldata name="edittext" type="mtext"/
declop name="convert"
  dependency depend="edittext" affect="edittext"
/declop
declop name="fixate"
  dependency depend="edittext" affect="edittext"
/declop
  /interface
  engine name="ja-romakana" class="com.sun.iiim.pce1.s1"
PCE
  deftable name="romakana" from="mtext" to="mtext"
"a""$B$"(B"
"i""$B$$(B"
"u""$B$&(B"
"e""$B$((B"
"o""$B$*(B"

"xa"   "$B$!(B"
"xi"   "$B$#(B"
"xu"   "$B$%(B"
"xe"   "$B$'(B"
"xo"   "$B$)(B"

"ka"   "$B$+(B"
"ki"   "$B$-(B"
"ku"   "$B$/(B"
"ke"   "$B$1(B"
"ko"   "$B$3(B"

"kya"  "$B$-$c(B"
"kyi"  "$B$-$#(B"
"kyu"  "$B$-$e(B"
"kye"  "$B$-$'(B"
"kyo"  "$B$-$g(B"

"ga"   "$B$,(B"
"gi"   "$B$.(B"
"gu"   "$B$0(B"
"ge"   "$B$2(B"
"go"   "$B$4(B"

"gya"  "$B$.$c(B"
"gyi"  "$B$.$#(B"
"gyu"  "$B$.$e(B"
"gye"  "$B$.$'(B"
"gyo"  "$B$.$g(B"

"sa"   "$B$5(B"
"si"   "$B$7(B"
"su"   "$B$9(B"
"se"   "$B$;(B"
"so"   "$B$=(B"

"sha"  "$B$7$c(B"
"shi"  "$B$7(B"
"shu"  "$B$7$e(B"
"she"  "$B$7$'(B"
"sho"  "$B$7$g(B"

"sya"  "$B$7$c(B"
"syi"  "$B$7$#(B"
"syu"  "$B$7$e(B"
"sye"  "$B$7$'(B"
"syo"  "$B$7$g(B"

"za"   "$B$6(B"
"zi"   "$B$8(B"
"zu"   "$B$:(B"
"ze"   "$B$<(B"
"zo"   "$B$>(B"

"ja"   "$B$8$c(B"
"ji"   "$B$8(B"
"ju"   "$B$8$e(B"
"je"   "$B$8$'(B"
"jo"   "$B$8$g(B"

"zya"  "$B$8$c(B"
"zyi"  "$B$8$#(B"
"zyu"  "$B$8$e(B"
"zye"  "$B$8$'(B"
"zyo"  "$B$8$g(B"

"ta"   "$B$?(B"
"ti"   "$B$A(B"
"tu"   "$B$D(B"
"te"   "$B$F(B"
"to"   "$B$H(B"

"cha"  "$B$A$c(B"
"chi"  "$B$A(B"
"chu"  "$B$A$e(B"
"che"  "$B$A$'(B"
"cho"  "$B$A$g(B"

"tya"  "$B$A$c(B"
"tyi"  "$B$A$#(B"
"tyu"  "$B$A$e(B"
"tye"  "$B$A$'(B"
"tyo"  "$B$A$g(B"

"da"   "$B$@(B"
"di"   "$B$B(B"
"du"   "$B$E(B"
"de"   "$B$G(B"
"do"   "$B$I(B"

"dha"  "$B$G$c(B"
"dhi"  "$B$G$#(B"
"dhu"  "$B$G$e(B"
"dhe"  "$B$G$'(B"
"dho"  "$B$G$g(B"

"dya"  "$B$B$c(B"
"dyi"  "$B$B$#(B"
"dyu"  "$B$B$e(B"
"dye"  "$B$B$'(B"
"dyo"  "$B$B$g(B"

"na"   "$B$J(B"
"ni"   "$B$K(B"
"nu"   "$B$L(B"
"ne"   "$B$M(B"
"no"   "$B$N(B"

"nya"  "$B$K$c(B"
"nyi"  "$B$K$#(B"
"nyu"  "$B$K$e(B"
"nye"  "$B$K$'(B"
"nyo"  "$B$K$g(B"

"ha"   "$B$O(B"
"hi"   "$B$R(B"
"hu"   "$B$U(B"
"he"   "$B$X(B"
"ho"   "$B$[(B"

"fa"   "$B$U$!(B"
"fi"   "$B$U$#(B"
"fu" 

Re: UTS #10: Unicode Collation Algorithm (UCA)

2002-07-22 Thread Michael Everson

At 06:15 -0700 2002-07-21, Michael \(michka\) Kaplan wrote:

The UCA provides a very nice framework. But if you already have a house, who
needs a new frame?

Because your already nice house isn't very friendly. It isn't 
tailorable by anyone but you, which means, in effect, unless you're 
an invited guest, you won't be able to enjoy the house. Of course 
it's understandable that you did something else while things were 
under ballot and so on. Perhaps you should migrate your current 
ordering support to a UCA-based one. That would leave you more 
flexible in future, particularly for the support of smaller 
communities.
-- 
Michael Everson *** Everson Typography *** http://www.evertype.com




Normalization

2002-07-22 Thread Debmalya Biswas

Hi,

Before getting to the question, let me explain the scenarios first:

Scenario 1: Need to compare strings containing Japanese/French characters 
entered from the command line against string stored in a SQL Server database (stored 
through an ASP application) as a nvarchar datatype. The application accepting the 
command line aguments and doing the comparison is a C++ console application.

Scenario 2: A C++ application is posting data (string containing 
French/Japanese characters) to a Java Servlet. Now, the C++ application exists on both 
Windows and Mac.

My question for both the scenarios is same, do I need to do anything special 
w.r.t Normalization while comparing the strings or the C++ (as in first scnario) and 
Java (second scnario) string comparison functions are capble enough to work properly.

Thanks in advance.
Regards,
Debmalya Biswas




Re: ISO/IEC 10646 versus Unicode

2002-07-22 Thread Michael Everson

Dear colleagues,

I was biting my tongue there for a bit, but as this list is both 
public and archived, I am afraid that I have little choice but to 
respond to Marion Gunn's revisionist history, as it reflects on my 
own activities working for the Universal Character Set.

I will begin by reminding readers of this list that I have had no 
interest in EGT since September 2001, when I ceased to be an 
owner-director of that limited company. However, as Marion refers to 
the period of time when I *was* involved, it seems to me proper that 
I set the record straight.

At 11:36 +0100 2002-07-18, Marion Gunn wrote:

EGT was one of the first companies to give (almost) unqualified 
support to the setting up of Unicode.

This could not possibly be considered to be true. As told in 
http://www.unicode.org/unicode/history/ the UTC meetings are counted 
from February 1989. I didn't come to Ireland until September 1989. 
The Unicode Consortium was officially incorporated in January 1991. 
EGT wasn't incorporated until February 1991.

Further, although EGT did become aware of the 10646 ballot in time 
to influence Ireland's vote on the DIS in June 1991, it was afterward 
that EGT made contact with the Unicode Consortium, when I wrote a 
number of responses to UTR #1 and UTR #2: Burmese (April 1993), 
Ethiopic (May 1993), Sinhala and Tibetan (September 1993).

The formal involvement of EGT in standards development began with 
my attendance of a CEN/TC304 meeting in Paris in 1994.

In October 1994 I attended my first meeting of ISO/IEC JTC1/SC2/WG2 
in San Francisco, and it was there that I first met members of the 
Unicode Technical Committee. (Asmus Freytag and I hit it off rather 
badly, in the spirit of cautious distrust which was, it has to be 
admitted, present in the 10646-vs-Unicode spirit of those times. Now, 
of course, we work closely together as co-editors and are fast 
friends; I have the honour of being godfather to his daughter 
Brianna.)

When it became clear that 10646 was getting unwieldy, EGT took a 
2-pronged approach, consisting of establishing new Irish National 
Standards and adding to the 8859-series, which proved a lot more 
productive than trusting to 10646 alone (both of which aims EGT 
successfully achieved).

EGT did not propose the development of I.S. 434 (8-bit code for 
Ogham) and ISO/IEC 8859-14 (Latin 8, Celtic) because 10646 was 
getting unwieldy. I had developed Ogham and Gaelic fonts for use on 
8-bit operating systems, and it seemed that support for Celtic text 
written with those character sets would be likelier if there were 
formal standards available. That is the reason those standards were 
developed. I was the editor of both of those standards on behalf of 
NSAI/AGITS/WG6 (now NSAI/ICTSCC/SC4) and ISO/IEC JTC1/SC2/WG3.

I, for one, am still a believer in the vision of Unicode, and still 
monitor/support its mailing list/other activities, and hope to live 
long enough to see it succeed, although I have to admit to getting 
so very many things wrong about Unicode in the past: [...] I 
thought, for example, that involvement in it would cost EGT very 
little, in terms of working hours (wrong) and in terms of money 
(wrong) []

Marion writes about EGT as though it were more than the sum of its 
parts. From February 1991 to September 2001, in any case, it was 
certainly not so; during that period, EGT consisted of two people, 
myself and Marion, and no more. What money was spent on 
standardization was chiefly for JTC1/SC2/WG2 and CEN/TC304 
activities, in point of fact, and it was spent with the agreement of 
the two co-directors who both signed the cheques. It ought not to be 
made to look otherwise.

For my part, I regret not one penny of the money we chose to spend on 
standardization travel, nor one minute of the time I invested in 
drawing up script and character proposals. Consider, for instance, 
the living scripts which have been encoded to date with at least some 
of my input (Buhid, Cherokee, Canadian Syllabics, Ethiopic, Hanunóo, 
Khmer, Limbu, Myanmar, Sinhala, Tagbanwa, Tai Le, Thaana, Tibetan, 
and Yi). These are used to write languages spoken by some 63 million 
people on our planet. The investment has, to be sure, enabled me to 
come into the fullness of my ability to do what has become my own 
life's work. If I may be so bold to say so, the Unicode Standard and 
ISO/IEC 10646 -- and computer users worldwide -- are better off for 
the investment which EGT made between 1994 and 2001 than they would 
have been otherwise.

When, after all the years of receiving Irish support, I saw 
Unicode's 2002 conference in Dublin being advertised as more of a 
showcase for German than native interests, I decided not to attend, 
but that does not mean any withdrawal of EGT's initial and 
longstanding support of Unicode, in principal (although it seems to 
have produced only one thing to date, viz., a book called The 
Unicode Standard (where I expected to read  

Re: Corporate influence on Unicode development (long)

2002-07-22 Thread Peter_Constable


On 07/21/2002 07:30:33 PM Doug Ewell wrote:

First of all, the figure that William (or any other individual) really
should be looking at is not $12,000 for a full membership, but $600 for
a specialist membership or $120 for an individual membership.  (BTW,
I would be interested in hearing -- perhaps off-line -- from individuals
who hold or have held such memberships, to find out how they felt their
memberships benefited them and Unicode.)

Only the $12000 membership makes you a candidate to vote on UTC. Associate
and Specialist memberships give you access to the insider's mailing list
(where real proposals get discussed) and documnts, though, which has been
very useful.



- Peter


---
Peter Constable

Non-Roman Script Initiative, SIL International
7500 W. Camp Wisdom Rd., Dallas, TX 75236, USA
Tel: +1 972 708 7485
E-mail: [EMAIL PROTECTED]







Re: ISO/IEC 10646 versus Unicode

2002-07-22 Thread Marion Gunn

Arsa Kenneth Whistler:
 
 Marion Gunn wrote:
 
  How many years does it take to get ISO/IEC work item accepted, then
  develop the corresponding Standard to publication stage, Ken?
 
 In the case of 10646, approximately 10 years, Marion.
 ...

10 years? And Unicode, after eleven long years, has yet to produce the
promised Universal Character Set/Implementations of 10646. Any fool can
chuck missiles from the discarded rockheaps of history, but I do know
what my company understood itself to be investing in through many
expensive years of supporting Unicode. It was in the Universal Character
Set and 10646 Implemenations, which I still hope to see Unicode produce,
or at least a reasonable timetable offered. Does Unicode have a
reasonable timetable to offer?
mg

-- 
Marion Gunn * E G T (Estab.1991) vox: +353-1-2839396 * [EMAIL PROTECTED]
27 Páirc an Fhéithlinn; Baile an Bhóthair; Contae Átha Cliath; Éire




Re: ISO/IEC 10646 versus Unicode

2002-07-22 Thread Peter_Constable


On 07/22/2002 10:15:37 AM Marion Gunn wrote:

I do know
what my company understood itself to be investing in through many
expensive years of supporting Unicode. It was in the Universal Character
Set and 10646 Implemenations, which I still hope to see Unicode produce,
or at least a reasonable timetable offered. Does Unicode have a
reasonable timetable to offer?

I'm not sure I get this: 10646 implementations to be produced by [The]
Unicode [Consortium]? My understanding is that TUC does not produce
implementations; they only produce a standard known as The Unicode
Standard. As for producing the Universal Character Set, they are in the
process of doing that in concert with WG2, and are working with exactly the
same timetable as WG2.

The point of this thread escapes me.



- Peter


---
Peter Constable

Non-Roman Script Initiative, SIL International
7500 W. Camp Wisdom Rd., Dallas, TX 75236, USA
Tel: +1 972 708 7485
E-mail: [EMAIL PROTECTED]







Re: ISO/IEC 10646 versus Unicode

2002-07-22 Thread Michael Everson

At 16:15 +0100 2002-07-22, Marion Gunn wrote:
Kenneth Whistler wrote:
  
   Marion Gunn wrote:
   
How many years does it take to get ISO/IEC work item accepted, then
develop the corresponding Standard to publication stage, Ken?

   In the case of 10646, approximately 10 years, Marion.

10 years? And Unicode, after eleven long years, has yet to produce the
promised Universal Character Set/Implementations of 10646.

It is absolutely astonishing to me that after all these years that 
you don't know what it is that Unicode is meant to produce. Unicode 
produces a character set standard, equivalent to the character set of 
ISO/IEC 10646. Unicode also produces some other standards which guide 
implementation.

The implementation is done by Apple, HP, IBM, JustSystem, Microsoft, 
Oracle, SAP, Sun, Sybase, Unisys and many other companies.

If you wish to learn more, start at 
http://www.unicode.org/unicode/standard/WhatIsUnicode.html. You may 
also be interested to see how many products are Unicode-enabled. See 
http://www.unicode.org/unicode/onlinedat/products.html.

Any fool can chuck missiles from the discarded rockheaps of history,

Truer words were never spoken.

but I do know what my company understood itself to be investing in 
through many expensive years of supporting Unicode.

I don't know what you were understanding during the period at which 
EGT was investing in travel to ISO and CEN meetings, but I 
understood it perfectly well. I would prefer it very much if you 
would not speak for me with regard to that time period on this or 
other lists.

It was in the Universal Character Set and 10646 Implemenations, 
which I still hope to see Unicode produce, or at least a reasonable 
timetable offered. Does Unicode have a reasonable timetable to offer?

I wonder how many characters I actually helped encode so far? It must 
be approaching four and a half thousand.
-- 
Michael Everson *** Everson Typography *** http://www.evertype.com




Abstract character?

2002-07-22 Thread Lars Marius Garshol


I'm trying to find out what an abstract character is. I've been
looking at chapter 3 of Unicode 3.0, without really achieving
enlightenment. 

The term Unicode scalar value (apparently synonymous with code point)
seems clear. It is the identifying number assigned to assigned
Unicode characters.

So far, so good. Some questions:

 - are all assigned Unicode characters also abstract characters?

 - it seems that not all abstract characters have code points (since
   abstract characters can be formed using combining characters). Is
   that correct?

 - do U+00C5 (Å) and U+0041, U+030A (A followed by combining ring
   above) represent the same abstract character?

Would be good if someone could clear this up.

-- 
Lars Marius Garshol, Ontopian URL: http://www.ontopia.net 
ISO SC34/WG3, OASIS GeoLang TCURL: http://www.garshol.priv.no 





Re: Abstract character?

2002-07-22 Thread Kenneth Whistler

Lars Marius Garshol asked:

 I'm trying to find out what an abstract character is. I've been
 looking at chapter 3 of Unicode 3.0, without really achieving
 enlightenment. 
 
 The term Unicode scalar value (apparently synonymous with code point)
 seems clear. It is the identifying number assigned to assigned
 Unicode characters.

Here is one of my attempts at a more rigorous term rectification:

Abstract character

   that which is encoded; an element of the repertoire (existing
   independent of the character encoding standard, and often
   identifiable in other character encoding standards, as well
   as the Unicode Standard); the implicit basis of transcodings.

   Note that while in some sense abstract characters exist a
   priori by virtue of the nature of the units of various writing
   systems, their exact nature is only pinned down at the point
   that an actual encoding is done. They are not always obvious,
   and many new abstract characters may arise as the result of
   particular textual processing needs that can be addressed by
   characters. (E.g. WORD JOINER, OBJECT REPLACEMENT CHARACTER,
   etc., etc.)

Code point

   A number from 0..10; a point in the codespace 0..10.

Encoded character

   An *association* of an abstract character with a code point.

Unicode scalar value

   A number from 0..D7FF, E000..10; the domain of the
   functions which define UTF's. The Unicode scalar value
   definitionally excludes D800..DFFF, which are only code unit
   values used in UTF-16, and which are not code points associated
   with any well-formed UTF code unit sequences.

Assignment (of code points)

   Refers to the process of associating abstract character with
   code points. Mathematically a code point is
   assigned to an abstract character and an abstract
   character is mapped to a code point.

   This is distinguished from the vaguer sense of assigned
   in general parlance as meaning a code point given some
   designated function by the standard, which would include
   noncharacters and surrogates.

 
 So far, so good. Some questions:
 
  - are all assigned Unicode characters also abstract characters?

Yes. Or rather: all encoded characters are assigned to abstract
characters.

(See above for my distinction between assigned and
designated, which would apply to noncharacters and surrogate
code points -- neither of which classes of code points get
assigned to abstract characters.)

 
  - it seems that not all abstract characters have code points (since
abstract characters can be formed using combining characters). Is
that correct?

Yes. (Note above -- abstract characters are also a concept which
applies to other character encodings besides the Unicode Standard,
and not all encoded characters in other character encodings automatically
make it into the Unicode Standard, for various architectural reasons.)

 
  - do U+00C5 (Å) and U+0041, U+030A (A followed by combining ring
above) represent the same abstract character?

Yes. That is the implicit claim behind a specification of canonical
equivalence.

--Ken

 
 Would be good if someone could clear this up.
 
 -- 
 Lars Marius Garshol, Ontopian URL: http://www.ontopia.net 
 ISO SC34/WG3, OASIS GeoLang TCURL: http://www.garshol.priv.no 
 
 
 





Re: Abstract character?

2002-07-22 Thread Barry Caplan

I usually define an abstract character in talks I give as an element of a writing 
system that you care about, independent of glyphs, and certainly independent of 
endings or specific code points. 

If it could be described more precisely than that, it wouldn't be abstract, would 
it? :)

This is usually brought up in a series of definitions  leading from character (what 
we are referring to here as abstract character, and then:

- character list - a list of characters one is interested in
- character set - a list of character lists, which may or may not be ordered, but 
still has no codepoints
- encoding scheme - an algorithm for assigning code points to a character set
- code point the representation of an abstract character in an encoding scheme
- font - a series of glyphs that are used to display a characters represented by 
code points, in their immediate context

All of this is filled with examples - building to an explanation of Unicode. For 
example, wrt abstract character, I ask the audience to ponder if upper case A and 
lower case a, are the same abstract character. Also, I ask them to ponder if 
lower case a displayed in Helvetica is the same character as lower case a in  
Times Roman. Finally, how about  lower case a in 9 point Helvetica and lower case 
a in 18 point Helvetica?

And apropos a thread from last week, Unicode introduces new concepts such as 
character properties which means the anticipation and intrigue I spend time building 
in the audience that there is a neat solution to the historical morass I just spent 40 
minutes describing, gets thoroughly dashed! Joy!

Implicit in this set of definitions is of course that a character may or may not be 
of interest to all character lists, and therefore may or may not end of represented 
in more than one encoding. Also note that even when it does end up in more than one, 
this model in no way implies a round trip capability.

This leads nicely into a discussion about some very important aspects of 
internationalizing code and working with 3rd party components..

Barry Caplan
www.i18n.com

At 01:38 PM 7/22/2002 -0700, Kenneth Whistler wrote:
Lars Marius Garshol asked:

 I'm trying to find out what an abstract character is. I've been
 looking at chapter 3 of Unicode 3.0, without really achieving
 enlightenment. 
 
 The term Unicode scalar value (apparently synonymous with code point)
 seems clear. It is the identifying number assigned to assigned
 Unicode characters.

Here is one of my attempts at a more rigorous term rectification:

Abstract character

   that which is encoded; an element of the repertoire (existing
   independent of the character encoding standard, and often
   identifiable in other character encoding standards, as well
   as the Unicode Standard); the implicit basis of transcodings.

   Note that while in some sense abstract characters exist a
   priori by virtue of the nature of the units of various writing
   systems, their exact nature is only pinned down at the point
   that an actual encoding is done. They are not always obvious,
   and many new abstract characters may arise as the result of
   particular textual processing needs that can be addressed by
   characters. (E.g. WORD JOINER, OBJECT REPLACEMENT CHARACTER,
   etc., etc.)

Code point

   A number from 0..10; a point in the codespace 0..10.

Encoded character

   An *association* of an abstract character with a code point.

Unicode scalar value

   A number from 0..D7FF, E000..10; the domain of the
   functions which define UTF's. The Unicode scalar value
   definitionally excludes D800..DFFF, which are only code unit
   values used in UTF-16, and which are not code points associated
   with any well-formed UTF code unit sequences.

Assignment (of code points)

   Refers to the process of associating abstract character with
   code points. Mathematically a code point is
   assigned to an abstract character and an abstract
   character is mapped to a code point.

   This is distinguished from the vaguer sense of assigned
   in general parlance as meaning a code point given some
   designated function by the standard, which would include
   noncharacters and surrogates.

 
 So far, so good. Some questions:
 
  - are all assigned Unicode characters also abstract characters?

Yes. Or rather: all encoded characters are assigned to abstract
characters.

(See above for my distinction between assigned and
designated, which would apply to noncharacters and surrogate
code points -- neither of which classes of code points get
assigned to abstract characters.)

 
  - it seems that not all abstract characters have code points (since
abstract characters can be formed using combining characters). Is
that correct?

Yes. (Note above -- abstract characters are also a concept which
applies to other character encodings besides the Unicode Standard,
and not all encoded characters in other character encodings automatically
make it into the Unicode Standard, 

Tamil Text Messaging in Mobile Phones

2002-07-22 Thread Sinnathurai Srivas


http://www.gbizg.com/tamil/Unicode/Tamil_Text_Messaging.htm

see the above for a sample of typical modern Tamil designed for mobile
texting without rendering support.


A typical Product;

http://sms.gt.com.ua/

Text messaging in Tamil on Mobile phones. Would they only work with my
proposed reformed Tamil characters?

see http://www.geocities.com/avarangal
for using ancient Tamil writing logic and reforming current alphabets

_
Join the world’s largest e-mail service with MSN Hotmail. 
http://www.hotmail.com





Re: Tamil Text Messaging in Mobile Phones

2002-07-22 Thread Michael \(michka\) Kaplan

For those who are interested in what is behind this message, a little
background...

Sinnathurai Srivas is a member of INFITT's WG02 (Working Group 02, Unicode
Tamil) who has been long advocating changes to Unicode Tamil that would be
done in a linear manner that would remove the requirement of complex
rendering. It would of course require many changes to rendering rules and
character properties.

At this point you might wonder how it would be possible to do this without
breaking compatibility -- well, no need to wondeer, it would not be
possible. Compatibility would have to be sacrificed.

Several members of the committee pointed out that these reforms would not be
possible without invalidating existing implementations. After some
discussion, the chairman noted that he saw no way that such a proposal could
actually be accomplished.

The committee let the matter drop after this, I know am I not the only one
thought that was the end of it -- until this very post was sent out today to
at least a half dozen lists.


MichKa

Michael Kaplan
Trigeminal Software, Inc.  -- http://www.trigeminal.com/

- Original Message -
From: Sinnathurai Srivas [EMAIL PROTECTED]
To: [EMAIL PROTECTED]
Sent: Monday, July 22, 2002 6:22 PM
Subject: Tamil Text Messaging in Mobile Phones



 http://www.gbizg.com/tamil/Unicode/Tamil_Text_Messaging.htm

 see the above for a sample of typical modern Tamil designed for mobile
 texting without rendering support.

 
 A typical Product;

 http://sms.gt.com.ua/

 Text messaging in Tamil on Mobile phones. Would they only work with my
 proposed reformed Tamil characters?

 see http://www.geocities.com/avarangal
 for using ancient Tamil writing logic and reforming current alphabets

 _
 Join the world's largest e-mail service with MSN Hotmail.
 http://www.hotmail.com








Re: Tamil Text Messaging in Mobile Phones

2002-07-22 Thread Doug Ewell

Sinnathurai Srivas avarangal at hotmail dot com wrote:

 http://www.gbizg.com/tamil/Unicode/Tamil_Text_Messaging.htm

 see the above for a sample of typical modern Tamil designed for mobile
 texting without rendering support.

Rendering is the process of mapping character codes to displayable
glyphs on a screen or printer.  Rendering support is always required,
even for English.  You probably mean without complex rendering
support, e.g. contextual glyph forms and glyph reordering.

 Text messaging in Tamil on Mobile phones. Would they only work with my
 proposed reformed Tamil characters?

Unicode is not the place to propose reforms in scripts or orthography.
Your proposed characters must first achieve popular usage before they
will be encoded.

-Doug Ewell
 Fullerton, California





Re: Abstract character?

2002-07-22 Thread Mark Davis

A small correction to Ken's message:

The Unicode scalar value
definitionally excludes D800..DFFF, which are only code unit
values used in UTF-16, and which are not code points associated
with any well-formed UTF code unit sequences.

The UTC in has decided to make scalar value mean unambiguously the
code points ..D7FF, E000..10, i.e., everything but surrogate
code points. While surrogate code points cannot be represented in
UTF-8 (as of Unicode 3.2), the UTC has not decided that the surrogate
code points are illegal in all UTFs; notably, they are legal in
UTF-16.

Ken is pushing for this change; I believe it would be a very bad idea.
(I think the reasons have already appeared on this list, so I am not
trying to reopen the discussion; just state the current situation.)

Mark
__
http://www.macchiato.com
◄  “Eppur si muove” ►

- Original Message -
From: Kenneth Whistler [EMAIL PROTECTED]
To: [EMAIL PROTECTED]
Cc: [EMAIL PROTECTED]; [EMAIL PROTECTED]
Sent: Monday, July 22, 2002 13:38
Subject: Re: Abstract character?


 Lars Marius Garshol asked:

  I'm trying to find out what an abstract character is. I've been
  looking at chapter 3 of Unicode 3.0, without really achieving
  enlightenment.
 
  The term Unicode scalar value (apparently synonymous with code
point)
  seems clear. It is the identifying number assigned to assigned
  Unicode characters.

 Here is one of my attempts at a more rigorous term rectification:

 Abstract character

that which is encoded; an element of the repertoire (existing
independent of the character encoding standard, and often
identifiable in other character encoding standards, as well
as the Unicode Standard); the implicit basis of transcodings.

Note that while in some sense abstract characters exist a
priori by virtue of the nature of the units of various writing
systems, their exact nature is only pinned down at the point
that an actual encoding is done. They are not always obvious,
and many new abstract characters may arise as the result of
particular textual processing needs that can be addressed by
characters. (E.g. WORD JOINER, OBJECT REPLACEMENT CHARACTER,
etc., etc.)

 Code point

A number from 0..10; a point in the codespace 0..10.

 Encoded character

An *association* of an abstract character with a code point.

 Unicode scalar value

A number from 0..D7FF, E000..10; the domain of the
functions which define UTF's. The Unicode scalar value
definitionally excludes D800..DFFF, which are only code unit
values used in UTF-16, and which are not code points associated
with any well-formed UTF code unit sequences.

 Assignment (of code points)

Refers to the process of associating abstract character with
code points. Mathematically a code point is
assigned to an abstract character and an abstract
character is mapped to a code point.

This is distinguished from the vaguer sense of assigned
in general parlance as meaning a code point given some
designated function by the standard, which would include
noncharacters and surrogates.

 
  So far, so good. Some questions:
 
   - are all assigned Unicode characters also abstract characters?

 Yes. Or rather: all encoded characters are assigned to abstract
 characters.

 (See above for my distinction between assigned and
 designated, which would apply to noncharacters and surrogate
 code points -- neither of which classes of code points get
 assigned to abstract characters.)

 
   - it seems that not all abstract characters have code points
(since
 abstract characters can be formed using combining characters).
Is
 that correct?

 Yes. (Note above -- abstract characters are also a concept which
 applies to other character encodings besides the Unicode Standard,
 and not all encoded characters in other character encodings
automatically
 make it into the Unicode Standard, for various architectural
reasons.)

 
   - do U+00C5 () and U+0041, U+030A (A followed by combining
ring
 above) represent the same abstract character?

 Yes. That is the implicit claim behind a specification of canonical
 equivalence.

 --Ken

 
  Would be good if someone could clear this up.
 
  --
  Lars Marius Garshol, Ontopian URL: http://www.ontopia.net

  ISO SC34/WG3, OASIS GeoLang TCURL:
http://www.garshol.priv.no 
 
 
 








Dublin Conference: Re: ISO/IEC 10646 versus Unicode

2002-07-22 Thread Lisa Moore


Dear Marion,

After checking the mail lists upon returning from vacation/holiday, I found
the following comment on the most recent Unicode conference in Dublin
rather surprising:

  When, after all the years of receiving Irish support,  I saw
Unicode's
  2002 conference in Dublin being advertised as more of a showcase for
  German than native interests, I decided not to attend, but that does
not
  mean any  withdrawal of EGT's initial and longstanding support of
  Unicode, in principal (although it seems to have produced only one
thing
  to date, viz., a book called 'The Unicode Standard' (where I expected
to
  read  'Implementation').

As a matter of fact, we specifically designed the Dublin Unicode Conference
to tie in with the substantial Dublin localization industry. I am quite
sorry if this purpose was misunderstood. Our keynote speaker you refer to
was from the Localization Research Institute of the University of Limerick.

It is too bad that you were not able to attend, particularly since you have
a great interest in Unicode implementations (as do I).  We were able to
showcase implementations ranging from top US IT businesses, to many
interesting worldwide case studies, localization, etc. I think you would
have enjoyed it (in addition to the local pub:-).

Implementation is truly where the rubber meets the road, to use an
American idiom.  In this regard, the conferences have a goal to champion
leading edge Unicode implementations.  I particularly enjoyed hearing from
a British mobile phone company at the Dublin conference - Unicode is
popping up everywhere, it seems.

Best regards,

Lisa Moore
Co-Chair, IUC







Re: Abstract character?

2002-07-22 Thread Doug Ewell

Mark Davis mark at macchiato dot com wrote:

 The UTC in has decided to make scalar value mean unambiguously the
 code points ..D7FF, E000..10, i.e., everything but surrogate
 code points. While surrogate code points cannot be represented in
 UTF-8 (as of Unicode 3.2), the UTC has not decided that the surrogate
 code points are illegal in all UTFs; notably, they are legal in
 UTF-16.

They are not legal in UTF-16 unless you believe that the two code points
(0xD800, 0xDC00) are fundamentally equivalent to the single code point
0x1 -- that is, unless you believe Unicode *is* UTF-16.

UTF-16 does not allow the representation of an unpaired surrogate 0xD800
followed by another, coincidental unpaired surrogate 0xDC00.  (It maps
the two to U+1.)  Among the standard UTFs, only UTF-32 allows the
two to be treated as unpaired surrogates.  In fact, before UTF-8 was
tightened up in 3.2, the only UTF that DID NOT permit these two
coincidental unpaired surrogates was UTF-16.

UTF-8:  D800 DC00 == ED A0 80 ED B0 80 (no longer legal)
UTF-32:  D800 DC00 == D800 DC00
- but -
UTF-16:  D800 DC00 == D800 DC00 == 1

 Ken is pushing for this change; I believe it would be a very bad idea.
 (I think the reasons have already appeared on this list, so I am not
 trying to reopen the discussion; just state the current situation.)

I don't recall seeing the reasons conclusively discussed on this list;
I'd be happy to hear them again.  I've been complaining about the
paragraph after D29 for two years now.

-Doug Ewell
 Fullerton, California