RE: Speaking of Plane 1 characters...

2002-11-12 Thread Dominikus Scherkl
Hi!

For those of you who _are_ programmers (or at least
know a little C), there is a somewhat easier formaula
to convert between utf16 and utf32 for plane1 and above
(the offset 0x1 in the high surraogate can be fix
shifted and included in the constant term):

utf16high = 0xD7C0 + (utf32  10);
utf16low  = 0xDC00 + (utf32  1023);

this is very easy to invert:

utf32 = ((utf16high - 0xD7C0) 10) + (utf16low  1023);

Here utf16high and utf16low are 16bit-surrogates, and
utf32 is of course a 32bit-value.
The bitshift operators  and  can be replaced
by ordinary division or multiplication by twopowers and
the bitwise-and  is equivalent to a modulo-operation.
But that is slower (relevant only for realy high-speed
converters ;-).

Best regards.
-- 
Dominikus Scherkl
[EMAIL PROTECTED]




Re: Speaking of Plane 1 characters...

2002-11-12 Thread Andrew C. West
On Tue, 12 Nov 2002 06:13:07 -0800 (PST), John Cowan wrote:

 The Right Thing in HTML terms is to say #x10312; and *not* use the
 surrogate pair representation.
 

Or #66322;
Or #55296;#57106; 
Or #xD800;#xDF12;

(where I've followed John in deliberately reversing the ampersand and the hash
to stop them being converted)

Andrew 




Re: Speaking of Plane 1 characters...

2002-11-12 Thread Doug Ewell
Dominikus Scherkl Dominikus dot Scherkl at glueckkanja dot com wrote:

 utf16high = 0xD7C0 + (utf32  10);
 utf16low  = 0xDC00 + (utf32  1023);

 this is very easy to invert:

 utf32 = ((utf16high - 0xD7C0) 10) + (utf16low  1023);

This is good, but I'd write hexadecimal 0x3FF instead of decimal 1023,
as it shows the purpose of the bitmask a little more clearly.
(Apologies to Karl Pentzlin, if he is still on this list.)

-Doug Ewell
 Fullerton, California





Speaking of Plane 1 characters...

2002-11-11 Thread John Hudson
One of the tools I use for building fonts requires that codepoints for 
Plane 1 characters be expressed as surrogate pairs, rather than as scalar 
values. I'm hoping this will change on the next release, since the scalar 
values are a lot easier to work with, but in the meantime I need to figure 
out the easiest way to find the correct surrogate pair values for any given 
scalar value. Is there a comprehensive list somewhere, or an easy 
alogorithm (easy for a non-programmer)? How about a web-based form, into 
which someone could enter scalar values and receive back surrogate pairs?

John Hudson

Tiro Typeworks		www.tiro.com
Vancouver, BC		[EMAIL PROTECTED]

It is necessary that by all means and cunning,
the cursed owners of books should be persuaded
to make them available to us, either by argument
or by force.  - Michael Apostolis, 1467




Re: Speaking of Plane 1 characters...

2002-11-11 Thread John Cowan
John Hudson scripsit:
 
 One of the tools I use for building fonts requires that codepoints for 
 Plane 1 characters be expressed as surrogate pairs, rather than as scalar 
 values. I'm hoping this will change on the next release, since the scalar 
 I need to figure 
 out the easiest way to find the correct surrogate pair values for any given 
 scalar value. 

If you have access to any Windows box, you can use the Windows Calculator
(Start/Programs/Accessories/Calculator).  Choose View/Scientific and
click on the Hex radio button.  Then enter your 5-digit Unicode scalar value.
(You must type hex digits in lower case.)  To get the high surrogate, type:

- 1 0 0 0 0 = / 4 0 0 + d 8 0 0 =

To get the low surrogate, enter the scalar value again and type:

- 1 0 0 0 0 = % 4 0 0 + d c 0 0 =

You can also use the mouse, in which case % above represents the MOD key.

On *ix systems, use the bc command; type obase=16 and ibase=16.
For this program, you must use capital letters for the hex digits.
To get the high surrogate, type (x-1)/400+DC00 for the high
surrogate (x is the scalar value); to get the low surrogate,
type (x-1)%400+DC00.

On the Macintosh, I have no clue.

-- 
John Cowan   [EMAIL PROTECTED]
You need a change: try Canada  You need a change: try China
--fortune cookies opened by a couple that I know




Re: Speaking of Plane 1 characters...

2002-11-11 Thread John Hudson
Many thanks to the various people who recommended Michael Kaplan's 
calculator at http://trigeminal.com/16to32AndBack.asp

This is excellent and solves my problem.

John Hudson

Tiro Typeworks		www.tiro.com
Vancouver, BC		[EMAIL PROTECTED]

It is necessary that by all means and cunning,
the cursed owners of books should be persuaded
to make them available to us, either by argument
or by force.  - Michael Apostolis, 1467




Re: Speaking of Plane 1 characters...

2002-11-11 Thread Tom Gewecke

On the Macintosh, I have no clue.

On Mac OS X, the Character Palette or the add-on UnicodeChecker will give
the surrogates for any given codepoint.

For a web page that calculates both ways, see

http://www.trigeminal.com/16to32AndBack.asp






Re: Speaking of Plane 1 characters...

2002-11-11 Thread Michael Everson
At 13:55 -0700 2002-11-11, Tom Gewecke wrote:

 On the Macintosh, I have no clue.

On Mac OS X, the Character Palette or the add-on UnicodeChecker will give
the surrogates for any given codepoint.


If you can get it to work. It still breaks for me so constantly I 
don't even try to use it. :-(
--
Michael Everson * * Everson Typography *  * http://www.evertype.com



Re: Speaking of Plane 1 characters...

2002-11-11 Thread Michael Everson
At 13:11 -0800 2002-11-11, Michael \(michka\) Kaplan wrote:


  Perhaps it is just me, but terms like scalar value just don't mean
  anything to me. It rather reminds me of reptilian skin shedding.

Since I do not use that term on my site, I assume you are referring to
someone else's resource? :-)


It was related to this thread but in a previous post. Nevertheless a 
little gentle user-friendliness on your page would help me to use it 
more easily. Just a teensy tutorialette and a weensy example at the 
top? A little hand-holding?

  I visited MichKa's page and tried typing in 10312 (OLD ITALIC LETTER

 KU) and it did convert to a surrogate pair. I wonder what would
 happen if I pasted it into an HTML document. Hmm but I couldn't do
 that until I converted them to UTF-8


Well, since the page advertises itself as a UTF-16/UTF-32 sort of converter,
I would hope that the lack of UTF-8 byte conversion would be expected.


Gee, what I really need is a UTF-8/UTF-16/UTF/32 sort of converter 
that handles surrogates ;-) There isn't such a thing and there 
ought to be. :-)

  By the way MichKa if you make the boxes a bit wider the whole string

 of numbers would display.


What numbers did not display for you? They all fit for me


The surrogate pair shows three digits and a tiny little popup 
triangle to tell you that there's a fourth digit. If you need to I 
can send you a screenshot.
--
Michael Everson * * Everson Typography *  * http://www.evertype.com



Re: Speaking of Plane 1 characters...

2002-11-11 Thread Michael \(michka\) Kaplan
From: Michael Everson [EMAIL PROTECTED]
 At 12:10 -0700 2002-11-11, John Hudson wrote:

 Many thanks to the various people who recommended Michael Kaplan's
 calculator at http://trigeminal.com/16to32AndBack.asp
 
 This is excellent and solves my problem.

Glad you like it, John -- I am sure James Kass remembers when I put it up,
it was actually because of a complaint that there wasn't such a thing and
there ought to be. grin

 Perhaps it is just me, but terms like scalar value just don't mean
 anything to me. It rather reminds me of reptilian skin shedding.

Since I do not use that term on my site, I assume you are referring to
someone else's resource? :-)

 I visited MichKa's page and tried typing in 10312 (OLD ITALIC LETTER
 KU) and it did convert to a surrogate pair. I wonder what would
 happen if I pasted it into an HTML document. Hmm but I couldn't do
 that until I converted them to UTF-8

Well, since the page advertises itself as a UTF-16/UTF-32 sort of converter,
I would hope that the lack of UTF-8 byte conversion would be expected.

 By the way MichKa if you make the boxes a bit wider the whole string
 of numbers would display.

What numbers did not display for you? They all fit for me

MichKa





Re: Speaking of Plane 1 characters...

2002-11-11 Thread John Hudson
At 13:50 11/11/2002, Michael Everson wrote:


By the way MichKa if you make the boxes a bit wider the whole string of 
numbers would display.

I noticed the same problem in Opera. It's okay in IE.

John Hudson

Tiro Typeworks		www.tiro.com
Vancouver, BC		[EMAIL PROTECTED]

It is necessary that by all means and cunning,
the cursed owners of books should be persuaded
to make them available to us, either by argument
or by force.  - Michael Apostolis, 1467





Re: Speaking of Plane 1 characters...

2002-11-11 Thread John Cowan
Michael Everson scripsit:

 Perhaps it is just me, but terms like scalar value just don't mean 
 anything to me. It rather reminds me of reptilian skin shedding.

The scale in question is analogous to a temperature scale, not a
reptilian one.

 I visited MichKa's page and tried typing in 10312 (OLD ITALIC LETTER 
 KU) and it did convert to a surrogate pair. I wonder what would 
 happen if I pasted it into an HTML document. Hmm but I couldn't do 
 that until I converted them to UTF-8

The Right Thing in HTML terms is to say #x10312; and *not* use the
surrogate pair representation.

-- 
Deshil Holles eamus.  Deshil Holles eamus.  Deshil Holles eamus.
Send us, bright one, light one, Horhorn, quickening, and wombfruit. (3x)
Hoopsa, boyaboy, hoopsa!  Hoopsa, boyaboy, hoopsa!  Hoopsa, boyaboy, hoopsa!
  -- Joyce, _Ulysses_, Oxen of the Sun   [EMAIL PROTECTED]




Re: Speaking of Plane 1 characters...

2002-11-11 Thread John Colby
At 13:18 11/11/2002 -0700, John Hudson wrote:


At 13:50 11/11/2002, Michael Everson wrote:


By the way MichKa if you make the boxes a bit wider the whole string of 
numbers would display.

I noticed the same problem in Opera. It's okay in IE.


That's the default font size mismatch - IE do things differently (they 
would!). In Mozilla and Phoenix do they fit?

John


Re: Speaking of Plane 1 characters...

2002-11-11 Thread Michael Everson
At 13:20 -0800 2002-11-11, Mark Davis wrote:

If you look http://www.macchiato.com/ under Unicode Charts, you can type
in the code point (scalar value) for a character, then Enter, and you will
get a chart. The UTF-8, 16, and 32 numbers are given in the chart for each
value.


Why do you call it a scalar value if it is really a code point? I 
thought it was bad enough Unicode calls it code point while 10646 
calls it code position

For the Terminology Police,
--
Michael Everson * * Everson Typography *  * http://www.evertype.com



Re: Speaking of Plane 1 characters...

2002-11-11 Thread Michael \(michka\) Kaplan
From: John Hudson [EMAIL PROTECTED]

 At 13:50 11/11/2002, Michael Everson wrote:

 By the way MichKa if you make the boxes a bit wider the whole string of
 numbers would display.

 I noticed the same problem in Opera. It's okay in IE.

Ah, if I called *that* by design, someone might accuse me of global
conspiracy. :-)

Never mind, it wasn't that funny. I went ahead and updated the page, it
should work well in Opera Compatibility mode. g,dr

Michael, in answer to your request for a UTF-8 converter, that will have to
be another day (its a bit more complicated, and I spend most of my time in
UTF-16 and UTF-32 so I can't really pretend its work related). If you wanted
to provide the code in VBScript or JScript I will add it to the page (and
give you credit, of course).

MichKa





Re: Speaking of Plane 1 characters...

2002-11-11 Thread Michael Everson
At 13:34 -0800 2002-11-11, Michael \(michka\) Kaplan wrote:


Michael, in answer to your request for a UTF-8 converter, that will 
have to be another day (its a bit more complicated, and I spend most 
of my time in UTF-16 and UTF-32 so I can't really pretend its work 
related). If you wanted to provide the code in VBScript or JScript I 
will add it to the page (and give you credit, of course).

Sir, you mistake me for a programmer! :-)
--
Michael Everson * * Everson Typography *  * http://www.evertype.com




Re: Speaking of Plane 1 characters...

2002-11-11 Thread John Cowan
Michael Everson scripsit:

 The scale in question is analogous to a temperature scale, not a
 reptilian one.
 
 Now I very *seriously* don't get it.

A temperature scale enumerates the degrees -273, -272, -271, ..., 0, 1, 2, ...
in order.  When you ask What is the temperature?, you are actually asking
What is the scalar value of the temperature?

The Unicode scale enumerates the characters 0, 1, 2, ... 10.  Unicode
scalar values are points on this scale, just as temperature scalar values
are points on the (Celsius) temperature scale.

-- 
Winter:  MIT,   John Cowan
Keio, INRIA,[EMAIL PROTECTED]
Issue lots of Drafts.   http://www.ccil.org/~cowan
So much more to understand! http://www.reutershealth.com
Might simplicity return?(A tanka, or extended haiku)




Re: Speaking of Plane 1 characters...

2002-11-11 Thread Mark Davis
According to the new 4.0 definitions:

- code points go from 0..10, inclusive
- scalar value == non-surrogate code point, so they are simply a
restriction of code points to the ranges 0..D7FF, E000..10

Since surrogate code points can never represent characters, for a given
character you can refer to its code point or to its scalar value; in
that circumstance there is no effective difference in the terms.

Mark
__
http://www.macchiato.com
►  “Eppur si muove” ◄

- Original Message -
From: Michael Everson [EMAIL PROTECTED]
To: [EMAIL PROTECTED]
Sent: Monday, November 11, 2002 13:37
Subject: Re: Speaking of Plane 1 characters...


 At 13:20 -0800 2002-11-11, Mark Davis wrote:
 If you look http://www.macchiato.com/ under Unicode Charts, you can
type
 in the code point (scalar value) for a character, then Enter, and you
will
 get a chart. The UTF-8, 16, and 32 numbers are given in the chart for
each
 value.

 Why do you call it a scalar value if it is really a code point? I
 thought it was bad enough Unicode calls it code point while 10646
 calls it code position

 For the Terminology Police,
 --
 Michael Everson * * Everson Typography *  * http://www.evertype.com







Re: Speaking of Plane 1 characters...

2002-11-11 Thread Barry Caplan
At 05:47 PM 11/11/2002 -0500, John Cowan wrote:
Michael Everson scripsit:

 The scale in question is analogous to a temperature scale, not a
 reptilian one.
 
 Now I very *seriously* don't get it.

A temperature scale enumerates the degrees -273, -272, -271, ..., 0, 1, 2, ...
in order.  When you ask What is the temperature?, you are actually asking
What is the scalar value of the temperature?

The Unicode scale enumerates the characters 0, 1, 2, ... 10.  Unicode
scalar values are points on this scale, just as temperature scalar values
are points on the (Celsius) temperature scale.

Well, not exactly...temperature is an arbitrary but standard measure of a continuous 
physical property. The multiple well known scales attest to that. But code points are 
absolute points, not continuous. And because one character has a greater encoding 
value does not make it greater then in any useful sense. 

Basically, we are talking about continuous ordinal scales vs discrete cardinal scales. 
Hardly analogous at all IMM.

Barry Caplan
www.i18n.com






Re: Speaking of Plane 1 characters...

2002-11-11 Thread Jungshik Shin


On Mon, 11 Nov 2002, John Cowan wrote:

 On *ix systems, use the bc command; type obase=16 and ibase=16.

  Thank you for this. I should have read the man page of bc more
carefully. (or I used to know it but forgot...)

 For this program, you must use capital letters for the hex digits.
 To get the high surrogate, type (x-1)/400+DC00 for the high

  s/DC00/D800/

 surrogate (x is the scalar value); to get the low surrogate,
 type (x-1)%400+DC00.

And one can define a function

 On the Macintosh, I have no clue.

  As you know so well,  MacOS X is a Unix and 'bc' should be available
there, too.  If not by default, one can certainly grab the source and
compile it or get a precompiled binary somewhere.

  It seems to me a waste of the bandwidth (however abundant it may have
become recently. I heard several times on this list that it's not in a
certain country in Europe ;-) ) to go all the way across the Atlantic or
the continent to convert between UCVs and surrogate pairs.  There are
several ways to do it locally including two suggested above. On *nix
including MacOS X (http://developer.apple.com/internet/macosx/perl.html),
one can open up a small terminal window (yes, Mac OS X has a
terminal window !) and run a script like the following(assuming Perl
is installed.  If GUI is desired, make one up in Perl/Tk, Tcl/Tk,
pdksh, Python+Tk?...) This should also work in a command prompt of
Windows. Alternatively, I guess a local html file with ECMAscript should
also work.

Cuthere
#!/usr/bin/perl -w
# use the full path of your perl binary in place of /usr/bin/perl

while ( 1 ) {
  print ** Enter Unicode code point in hexadecimal \n .
  (to end, press [enter]) : ;
  $| = 1;   # force a flush after our print
  $ucs = STDIN;
  chomp $ucs;

  last if $ucs eq ;

  if ( $ucs =~ /[^a-f0-9A-F]/ ) {
printf   Error: %s is invalid. Try again\n, $ucs;
next;
  }

  $usv = hex $ucs;
  if ( 0x  $usv  $usv  0x11 ) {
printf UTF-16: %04x %04x\n, ($usv-0x1) / 0x400 + 0xd800,
  ($usv-0x1) % 0x400 + 0xdc00,
  }
  elsif ( $usv  0xd800 || 0xdfff  $usv  $usv  0x1 ) {
printf UTF-16: %04x\n, $usv;
  }
  else {
printf Your input %s is not valid. Try again\n, $ucs;
  }
}

print Bye !!\n;
Cut-here--

  Jungshik





Re: Speaking of Plane 1 characters...

2002-11-11 Thread Markus Scherer
Michael (michka) Kaplan wrote:

Michael, in answer to your request for a UTF-8 converter, that will have to
be another day (its a bit more complicated, and I spend most of my time in
UTF-16 and UTF-32 so I can't really pretend its work related). If you wanted
to provide the code in VBScript or JScript I will add it to the page (and
give you credit, of course).


Mark has it all in his UTF Converter and Charts at http://www.macchiato.com/unicode/convert.html
markus