Re: Internal Representation of Unicode

2003-10-01 Thread John Cowan
[EMAIL PROTECTED] scripsit:

> First they'd want numeric value properties added to the Hebrew and
> Greek letters, then when they came to do the same for the Latin letters
> the ensuing flamewar would bring the whole effort to a standstill.

Numeric values for Hebrew, Greek, and Cyrillic make a lot of sense, actually.

-- 
I am expressing my opinion.  When myJohn Cowan
honorable and gallant friend is called, [EMAIL PROTECTED]
he will express his opinion.  This is   http://www.ccil.org/~cowan
the process which we call Debate.   --Winston Churchill



RE: Internal Representation of Unicode

2003-10-01 Thread Jill Ramonsky




Yeah, but dude, wasting time on stupid ideas goes with the territory if
you happen to be a creative genius. Some of your ideas won't work.
Others will be magnificent. I'd put good money on the notion that if
Newton had been prevented from pursuing astrology or numerology, this
restriction would have had serious negative consequences for his
genius, and then maybe we wouldn't have calculus or the laws of motion
either. Creativity is all about playing with your mind, not about
playing in a sandbox. Anyone who doesn't understand that is doomed to
constantly cripple the _expression_ of genius, to the detriment of
society as a whole.

I can see why someone would want to make a console or terminal emulator
work with Unicode. I've messed around myself with ideas to make it work
(haven't come up with anything yet though). I say, go for it Johann. If
it turns out to have been a good idea, people will use it. If not, you
will have learned a great deal. It's definitely a no-lose situation.

Jill


> -Original Message-
> Isaac Newton spent an unconscionable amount of time, by our
standards, messing about with astrology and numerology -- far more than
he ever put into physics or calculus.  The "standardization" of science
since his day has helped reduce such effects.






Re: Internal Representation of Unicode

2003-10-01 Thread jon
> > At 11:15 AM 9/30/03 -0400, John Cowan wrote:
> >> Isaac Newton spent an unconscionable amount
> >> of time, by our standards, messing about with astrology and
> numerology
> >
> > One of the aspects of character encoding and standardization that 
> > seems to have an unholy fascination for people is its numerical 
> > aspect. It starts with the catalog number for 10646, which was 
> > deliberately jiggered to incorporate the number 646, which is the 
> > catalog number for the 7-bit standards. It continues with the desire 
> > to see certain characters are specific code locations (for example the 
> > byte order mark) and continues with the never-ending stream of 
> > (re-)encoding forms.
> >
> > It's just human nature, I guess.
> >
> 
> Maybe we should add something to the submission form:  "Has this 
> proposal been approved by a numerologist?"
> 

First they'd want numeric value properties added to the Hebrew and Greek letters, then 
when they came to do the same for the Latin letters the ensuing flamewar would bring 
the whole effort to a standstill.

Still, there are good reasons for the BOM being where it is...







Re: Internal Representation of Unicode

2003-09-30 Thread John Jenkins
On 2003年9月30日, at 下午12:01, Asmus Freytag wrote:

At 11:15 AM 9/30/03 -0400, John Cowan wrote:
Isaac Newton spent an unconscionable amount
of time, by our standards, messing about with astrology and numerology
One of the aspects of character encoding and standardization that 
seems to have an unholy fascination for people is its numerical 
aspect. It starts with the catalog number for 10646, which was 
deliberately jiggered to incorporate the number 646, which is the 
catalog number for the 7-bit standards. It continues with the desire 
to see certain characters are specific code locations (for example the 
byte order mark) and continues with the never-ending stream of 
(re-)encoding forms.

It's just human nature, I guess.

Maybe we should add something to the submission form:  "Has this 
proposal been approved by a numerologist?"


John H. Jenkins
[EMAIL PROTECTED]
[EMAIL PROTECTED]
http://homepage..mac.com/jhjenkins/



Re: Internal Representation of Unicode

2003-09-30 Thread Asmus Freytag
At 11:15 AM 9/30/03 -0400, John Cowan wrote:
Isaac Newton spent an unconscionable amount
of time, by our standards, messing about with astrology and numerology
One of the aspects of character encoding and standardization that seems to 
have an unholy fascination for people is its numerical aspect. It starts 
with the catalog number for 10646, which was deliberately jiggered to 
incorporate the number 646, which is the catalog number for the 7-bit 
standards. It continues with the desire to see certain characters are 
specific code locations (for example the byte order mark) and continues 
with the never-ending stream of (re-)encoding forms.

It's just human nature, I guess.

A./



Re: Internal Representation of Unicode

2003-09-30 Thread John Cowan
Jill Ramonsky scripsit:

> Ludvig, this Pastoral Symphony of yours all seems to me like something 
> of a pointless excercise.
> And Albert, this "Theory of Relativity" of yours all seems to me like 
> something of a pointless excercise.
> 
> Never discourage someone else's creativity.

The whole point of standardization is to *redirect* (not discourage) people's
creativity into useful channels.  Isaac Newton spent an unconscionable amount
of time, by our standards, messing about with astrology and numerology --
far more than he ever put into physics or calculus.  The "standardization"
of science since his day has helped reduce such effects.

-- 
John Cowanhttp://www.ccil.org/~cowan  [EMAIL PROTECTED]
Please leave your valuesCheck your assumptions.  In fact,
   at the front desk.  check your assumptions at the door.
 --sign in Paris hotel   --Cordelia Vorkosigan



RE: Internal Representation of Unicode

2003-09-30 Thread Jill Ramonsky
Ludvig, this Pastoral Symphony of yours all seems to me like something 
of a pointless excercise.
And Albert, this "Theory of Relativity" of yours all seems to me like 
something of a pointless excercise.

Never discourage someone else's creativity.
Jill
> -Original Message-
> From: Rick McGowan [mailto:[EMAIL PROTECTED]
> Sent: Friday, September 26, 2003 5:05 PM
> To: [EMAIL PROTECTED]
> Subject: Re: Internal Representation of Unicode
>
> This all seems to me like something of a pointless excercise.




Re: Internal Representation of Unicode

2003-09-26 Thread Peter_Constable
James Kass wrote on 09/26/2003 12:03:42 AM:

> Peter Constable (IIRC) reported on this list a while ago that there was
> a Latin-based writing system used for an indigenous South American
> language which stacks up to three marks above.

Good memory, James! The language is Ticuna.


Peter



Re: Internal Representation of Unicode

2003-09-26 Thread Rick McGowan
myrkraverk...sourceforge wrote:

> In a plain text environment, there is often a need to encode more than
> just the plain character.
...
> Since I'm using 64 bits, I call it Excessive Memory Usage Encoding, or
> EMUE.
...
> I thought of dividing the 64 bit code space into 32 variably wide
> plains, one for control characters, one for latin characters, one for
> han characters, and so on;

This all seems to me like something of a pointless excercise. Or maybe  
you're not making clear what is your intented audience of users and  
problems that you're trying to solve.

Decent libraries exist that already do nice things with strings having
attributes. And that, in my opinion, is a better model than bit-hacking in
a 64-bit space with vague implementation-defined attributes that change
depending on the "script" of a character. Such "attributed strings" are
easy to work with and provide a much higher-level model than this.

You might want to check out Apple's Cocoa environment, particularly the
definitions of the attributed string classes. For example...
http://developer.apple.com/documentation/Cocoa/Reference/Foundation/Java/Classes/NSAttributedString.html
or even the intro:
http://developer.apple.com/documentation/Cocoa/Conceptual/AttributedStrings/index.html

I'm sure there are libraries with similar capabilities for storing
characters + attributes in Java and other languages, I'm just not familiar
with them. Maybe some of the developers can chime in with their favorite
attributed string libraries. Even if you don't use one, you might find the
attributed string model educational.

(All of the above of course reflects only my personal opinion.)

Rick




RE: Internal Representation of Unicode

2003-09-26 Thread Marco Cimarosti
[EMAIL PROTECTED] wrote:
> In a plain text environment, there is often a need to encode more than
> just the plain character.  A console, or terminal emulator, is such an
> environment.  Therefore I propose the following as a technical report
> for internal encoding of unicode characters; with one goal in mind:
> character equalence is binary equalence.

I guess you meant "equivalence".

Q1: But what are "character equivalence" and "binary equivalence", and
why did you choose them as your goals?

> I thought of dividing the 64 bit code space into 32 variably wide
> plains,

Q2: What are these "plains" for? Why are there 32 of them?

> one for control characters, one for latin characters, one for
> han characters,

Q3: Why do you want to treat Latin character and Han characters
differently?

There is nothing special with Latin or Han characters in Unicode: they are
just 2 of the about 50 scripts currently supported in Unicode. (see
http://www.unicode.org/Public/UNIDATA/UnicodeData.txt and
http://www.unicode.org/Public/UNIDATA/Scripts.txt)

Q4: And how do you plan to distinguish them?

Both Latin and Han characters are scattered all over the Unicode space, so
you need to check many ranges to determine which character belongs to which
category.

Q5: And what about all character which are neither Latin nor Han?

> and so on; using 5 bits and the next 3 fixed to zero
> (for future expansion and alignment to an octet).
> I call plain 0 control characters and won't discuss it further.

Q6: Why do control characters have a special handling?

Q7: Don't control characters have properties attached like any other
characters?

One example of properties which could be useful to attach to control
character is directionality. E.g., a TAB is always a TAB but, after it
passed through the Bidirectional Algorithm, its directionality can be
resolved to be either LTR or RTL.

> Plain 1, I had intended for latin characters with the following
> encoding method in mind:
> 
> bits 63..59  58..56 55..40 39..32 31..24 23..16 15..8  7..0
> +---+--+--+--+--+--+--+--+
> | plain | zero | attr | res  | uacc | lacc | res  | char |
> +---+--+--+--+--+--+--+--+
> 
> * Plain Plain(5 bits)
> * Zero  Zero bits(3 bits)
> * Attr  Attributes   (16 bits)

Q8: What kind of information are these three fields for?

Q9: In case your answer to Q8 is "they are application-defined", then
what is the rationale for defining and naming more than one field? I mean:
if they are application-defined, why not leave the task of defining
sub-fields to the application?

> * Res   Reserved (8 bits)
> * Uacc  Upper Accent (8 bits)
> * Lacc  Lower Accent (8 bits)

Q10:Why do treat "accents" specially?

They are just characters as any others. In Unicode there is no special
limitation as to how many "accents" can be applied to a base character.
There is also no obligation for accents to have a base character.

> * Res   Reserved (8 bits)
> * Char  Character(8 bits)

Q11:How can you store a Latin character in 8 bits?

Unicode has 938 Latin characters, and their codes range from U+0041 to
U+FF5A.

> All of these fields are actually implementation defined, with just one
> rule for char: don't include characters that can be made with
> combinations, that's what the accent fields are for.

But characters are non necessarily decomposed in one "Latin character" with
one "upper accent" and one "lower accent". E.g., U+01D5 (LATIN CAPITAL
LETTER U WITH DIAERESIS AND MACRON) decomposes to U+0055 U+0308 U+0304
(LATIN CAPITAL LETTER U, COMBINING DIAERESIS, COMBINING MACRON). Both
COMBINING DIAERESIS and COMBINING MACRON are "upper accents".

Q12:How are you going to deal with a combination of, e.g., a base letter
+ 5 "upper accents" + 3 "lower accents"?

>  This allows for 255 upper and lower accents which should be enough -- for
now.

I counted 129 "upper accents". But their codes range from U+0300 to U+1D1AD.

Q13:How are you going to compress these codes into 8 bits? Are you
planning to use a conversion table from the Unicode code to your internal
8-bit code?

> For Han characters I thought of the following encoding method (with no
> particular plain in mind):
> 
> bits 63..59  58..56 55..40 39..32  31 ..0
> +---+--+--+---+--+
> | plain | zero | attr | style |  char|
> +---+--+--+---+--+
> 
> * Plain Plain(5 bits)
> * Zero  Zero bits(3 bits)
> * Attr  Attributes   (16 bits)
> * Style   Stylistic Variation  (8 bits)

Q14:What kind of information is in field "Style"?

Q15:Why do only Han characters have 

Re: Internal Representation of Unicode

2003-09-26 Thread Peter Kirk
On 25/09/2003 20:52, [EMAIL PROTECTED] wrote:

Hi,

John Cowan writes:
> The problem is that multiple accents above are quite common -- Vietnamese
> depends on them heavily.  There may also be multiple accents below,
> for all I know.
That does not have to be a problem, as long as there are no more than
255 accents and combinations of them.  As for vietnamese, I just don't
know how many there are, or how many characters they use.
Johann

 

In Hebrew there are more than 255 accents and combinations of them, if 
you count vowel points, dagesh, shin and sin dots as accents. There are 
potentially many thousands of combinations. From a quick search, 
something like 700-800 are in actual use in the Hebrew Bible.

--
Peter Kirk
[EMAIL PROTECTED] (personal)
[EMAIL PROTECTED] (work)
http://www.qaya.org/




Re: Internal Representation of Unicode

2003-09-26 Thread jameskass
.
Jóhann Gunnar Óskarsson wrote,

> That does not have to be a problem, as long as there are no more than
> 255 accents and combinations of them.  As for vietnamese, I just don't
> know how many there are, or how many characters they use.

The Combining Diacritical Marks range of Unicode 4.0 lists 107
combining marks which can be used in any combination.  Some
combining marks are supposed to span two base characters.

Peter Constable (IIRC) reported on this list a while ago that there was
a Latin-based writing system used for an indigenous South American
language which stacks up to three marks above.

Best regards,

James Kass
.



Re: Internal Representation of Unicode

2003-09-25 Thread Doug Ewell
Johann  wrote:

> That does not have to be a problem, as long as there are no more than
> 255 accents and combinations of them.  As for vietnamese, I just don't
> know how many there are, or how many characters they use.

You'll need UTF-8 and a fairly comprehensive font to read the following.

For Vietnamese, you should count on supporting the following vowels:

a à ả ã á ạ ă ằ ẳ ẵ ắ ặ â ầ ẩ ẫ ấ ậ e è ẻ ẽ é ẹ 
ê ề ể ễ ế ệ i ì ỉ ĩ í ị
o ò ỏ õ ó ọ ô ồ ổ ỗ ố ộ ơ ờ ở ỡ ớ ợ u ù ủ ũ ú ụ ư 
ừ ử ữ ứ ự y ỳ ỷ ỹ ý ỵ

the following consonant (in addition to most other English consonants):

đ

and this currency sign:

₫

For purposes of your mechanism, you can think of each vowel as having up
to 2 accents: (upper, right-attached, or none) plus (upper, lower, or
none).  The way Vietnamese think of it is that the circumflex, breve,
and horn are part of the base letter (making a total of 12 base vowels),
whereas the grave, hook above, tilde, acute, and dot below are
considered diacritics (6 × 12 = 72 total vowels).  All combinations are
possible.

Of course, all of the letters (not the dong sign) come in both uppercase
and lowercase.

-Doug Ewell
 Fullerton, California
 http://users.adelphia.net/~dewell/




Re: Internal Representation of Unicode

2003-09-25 Thread myrkraverk
Hi,

John Cowan writes:
 > The problem is that multiple accents above are quite common -- Vietnamese
 > depends on them heavily.  There may also be multiple accents below,
 > for all I know.

That does not have to be a problem, as long as there are no more than
255 accents and combinations of them.  As for vietnamese, I just don't
know how many there are, or how many characters they use.


Johann

-- 
Emacs is not a text editor -- it's a way of life




Re: Internal Representation of Unicode

2003-09-25 Thread John Cowan
[EMAIL PROTECTED] scripsit:

> All of these fields are actually implementation defined, with just one
> rule for char: don't include characters that can be made with
> combinations, that's what the accent fields are for.  This allows for
> 255 upper and lower accents which should be enough -- for now.

The problem is that multiple accents above are quite common -- Vietnamese
depends on them heavily.  There may also be multiple accents below,
for all I know.

-- 
John Cowan  http://www.ccil.org/~cowan  [EMAIL PROTECTED]
Be yourself.  Especially do not feign a working knowledge of RDF where
no such knowledge exists.  Neither be cynical about RELAX NG; for in
the face of all aridity and disenchantment in the world of markup,
James Clark is as perennial as the grass.  --DeXiderata, Sean McGrath