Re: Speaking of Plane 1 characters...
Dominikus Scherkl wrote: > utf16high = 0xD7C0 + (utf32 >> 10); > utf16low = 0xDC00 + (utf32 & 1023); > > this is very easy to invert: > > utf32 = ((utf16high - 0xD7C0) <<10) + (utf16low & 1023); This is good, but I'd write hexadecimal 0x3FF instead of decimal 1023, as it shows the purpose of the bitmask a little more clearly. (Apologies to Karl Pentzlin, if he is still on this list.) -Doug Ewell Fullerton, California
Re: Speaking of Plane 1 characters...
On Tue, 12 Nov 2002 06:13:07 -0800 (PST), John Cowan wrote: > The Right Thing in HTML terms is to say #&x10312; and *not* use the > surrogate pair representation. > Or #&66322; Or #&55296;#&57106; Or #&xD800;#&xDF12; (where I've followed John in deliberately reversing the ampersand and the hash to stop them being converted) Andrew
RE: Speaking of Plane 1 characters...
Hi! For those of you who _are_ programmers (or at least know a little C), there is a somewhat easier formaula to convert between utf16 and utf32 for plane1 and above (the offset 0x1 in the high surraogate can be fix shifted and included in the constant term): utf16high = 0xD7C0 + (utf32 >> 10); utf16low = 0xDC00 + (utf32 & 1023); this is very easy to invert: utf32 = ((utf16high - 0xD7C0) <<10) + (utf16low & 1023); Here utf16high and utf16low are 16bit-surrogates, and utf32 is of course a 32bit-value. The bitshift operators >> and << can be replaced by ordinary division or multiplication by twopowers and the bitwise-and & is equivalent to a modulo-operation. But that is slower (relevant only for realy high-speed converters ;-). Best regards. -- Dominikus Scherkl [EMAIL PROTECTED]
Re: Speaking of Plane 1 characters...
Michael Kaplan wrote, > Glad you like it, John -- I am sure James Kass remembers when I put it up, Indeed. John Cowan wrote, > The Right Thing in HTML terms is to say #&x10312; and *not* use the > surrogate pair representation. How about 𐌒 (or xF0,x90,x8C,x92) ? Tex Texin wrote, > Hmmm. I just reviewed Andrew's comment that he can get support for > surrogates via uniscribe on windows 9x. > I guess I have to think about extending this to include those systems. I > guess if I get confirmation (or disconfirmation) from John or other > Microsofties I will update the page accordingly. Got non-BMP working here on MSIE 5.0 on Win 98. Had to fix the Registry per the note on: (all on one line:) http://msdn.microsoft.com/library/default.asp?url=/library/en- us/intl/unicode_192r.asp Best regards, James Kass.
Re: Speaking of Plane 1 characters...
Michael (michka) Kaplan wrote: Michael, in answer to your request for a UTF-8 converter, that will have to be another day (its a bit more complicated, and I spend most of my time in UTF-16 and UTF-32 so I can't really pretend its work related). If you wanted to provide the code in VBScript or JScript I will add it to the page (and give you credit, of course). Mark has it all in his UTF Converter and Charts at http://www.macchiato.com/unicode/convert.html markus
Re: Speaking of Plane 1 characters...
On Mon, 11 Nov 2002, John Cowan wrote: > On *ix systems, use the "bc" command; type "obase=16" and "ibase=16". Thank you for this. I should have read the man page of bc more carefully. (or I used to know it but forgot...) > For this program, you must use capital letters for the hex digits. > To get the high surrogate, type "(x-1)/400+DC00" for the high s/DC00/D800/ > surrogate ("x" is the scalar value); to get the low surrogate, > type "(x-1)%400+DC00". And one can define a function > On the Macintosh, I have no clue. As you know so well, MacOS X is a Unix and 'bc' should be available there, too. If not by default, one can certainly grab the source and compile it or get a precompiled binary somewhere. It seems to me a waste of the bandwidth (however abundant it may have become recently. I heard several times on this list that it's not in a certain country in Europe ;-) ) to go all the way across the Atlantic or the continent to convert between UCVs and surrogate pairs. There are several ways to do it locally including two suggested above. On *nix including MacOS X (http://developer.apple.com/internet/macosx/perl.html), one can open up a small terminal window (yes, Mac OS X has a terminal window !) and run a script like the following(assuming Perl is installed. If GUI is desired, make one up in Perl/Tk, Tcl/Tk, pdksh, Python+Tk?...) This should also work in a command prompt of Windows. Alternatively, I guess a local html file with ECMAscript should also work. Cuthere #!/usr/bin/perl -w # use the full path of your perl binary in place of /usr/bin/perl while ( 1 ) { print "** Enter Unicode code point in hexadecimal \n" . " (to end, press [enter]) : "; $| = 1; # force a flush after our print $ucs = ; chomp $ucs; last if $ucs eq ""; if ( $ucs =~ /[^a-f0-9A-F]/ ) { printf " Error: %s is invalid. Try again\n", $ucs; next; } $usv = hex $ucs; if ( 0x < $usv && $usv < 0x11 ) { printf "UTF-16: %04x %04x\n", ($usv-0x1) / 0x400 + 0xd800, ($usv-0x1) % 0x400 + 0xdc00, } elsif ( $usv < 0xd800 || 0xdfff < $usv && $usv < 0x1 ) { printf "UTF-16: %04x\n", $usv; } else { printf "Your input %s is not valid. Try again\n", $ucs; } } print "Bye !!\n"; Cut-here-- Jungshik
Re: Speaking of Plane 1 characters...
At 05:47 PM 11/11/2002 -0500, John Cowan wrote: >Michael Everson scripsit: > >> >The scale in question is analogous to a temperature scale, not a >> >reptilian one. >> >> Now I very *seriously* don't get it. > >A temperature scale enumerates the degrees -273, -272, -271, ..., 0, 1, 2, ... >in order. When you ask "What is the temperature?", you are actually asking >"What is the scalar value of the temperature?" > >The Unicode scale enumerates the characters 0, 1, 2, ... 10. Unicode >scalar values are points on this scale, just as temperature scalar values >are points on the (Celsius) temperature scale. Well, not exactly...temperature is an arbitrary but standard measure of a continuous physical property. The multiple well known scales attest to that. But code points are absolute points, not continuous. And because one character has a greater encoding value does not make it greater then in any useful sense. Basically, we are talking about continuous ordinal scales vs discrete cardinal scales. Hardly analogous at all IMM. Barry Caplan www.i18n.com
Re: Speaking of Plane 1 characters...
According to the new 4.0 definitions: - code points go from 0..10, inclusive - "scalar value" == "non-surrogate code point", so they are simply a restriction of code points to the ranges 0..D7FF, E000..10 Since surrogate code points can never represent characters, for a given character you can refer to "its code point" or to "its scalar value"; in that circumstance there is no effective difference in the terms. Mark __ http://www.macchiato.com ► “Eppur si muove” ◄ - Original Message - From: "Michael Everson" <[EMAIL PROTECTED]> To: <[EMAIL PROTECTED]> Sent: Monday, November 11, 2002 13:37 Subject: Re: Speaking of Plane 1 characters... > At 13:20 -0800 2002-11-11, Mark Davis wrote: > >If you look http://www.macchiato.com/ under "Unicode Charts", you can type > >in the code point (scalar value) for a character, then Enter, and you will > >get a chart. The UTF-8, 16, and 32 numbers are given in the chart for each > >value. > > Why do you call it a scalar value if it is really a code point? I > thought it was bad enough Unicode calls it code point while 10646 > calls it code position > > For the Terminology Police, > -- > Michael Everson * * Everson Typography * * http://www.evertype.com > >
Re: Speaking of Plane 1 characters...
Michael Everson scripsit: > >The scale in question is analogous to a temperature scale, not a > >reptilian one. > > Now I very *seriously* don't get it. A temperature scale enumerates the degrees -273, -272, -271, ..., 0, 1, 2, ... in order. When you ask "What is the temperature?", you are actually asking "What is the scalar value of the temperature?" The Unicode scale enumerates the characters 0, 1, 2, ... 10. Unicode scalar values are points on this scale, just as temperature scalar values are points on the (Celsius) temperature scale. -- Winter: MIT, John Cowan Keio, INRIA,[EMAIL PROTECTED] Issue lots of Drafts. http://www.ccil.org/~cowan So much more to understand! http://www.reutershealth.com Might simplicity return?(A "tanka", or extended haiku)
Re: Speaking of Plane 1 characters...
At 16:31 -0500 2002-11-11, John Cowan wrote: Michael Everson scripsit: Perhaps it is just me, but terms like scalar value just don't mean anything to me. It rather reminds me of reptilian skin shedding. The scale in question is analogous to a temperature scale, not a reptilian one. Now I very *seriously* don't get it. -- Michael Everson * * Everson Typography * * http://www.evertype.com
Re: Speaking of Plane 1 characters...
At 13:34 -0800 2002-11-11, Michael \(michka\) Kaplan wrote: Michael, in answer to your request for a UTF-8 converter, that will have to be another day (its a bit more complicated, and I spend most of my time in UTF-16 and UTF-32 so I can't really pretend its work related). If you wanted to provide the code in VBScript or JScript I will add it to the page (and give you credit, of course). Sir, you mistake me for a programmer! :-) -- Michael Everson * * Everson Typography * * http://www.evertype.com
Re: Speaking of Plane 1 characters...
From: "John Hudson" <[EMAIL PROTECTED]> > At 13:50 11/11/2002, Michael Everson wrote: > > >By the way MichKa if you make the boxes a bit wider the whole string of > >numbers would display. > > I noticed the same problem in Opera. It's okay in IE. Ah, if I called *that* by design, someone might accuse me of global conspiracy. :-) Never mind, it wasn't that funny. I went ahead and updated the page, it should work well in "Opera Compatibility" mode. Michael, in answer to your request for a UTF-8 converter, that will have to be another day (its a bit more complicated, and I spend most of my time in UTF-16 and UTF-32 so I can't really pretend its work related). If you wanted to provide the code in VBScript or JScript I will add it to the page (and give you credit, of course). MichKa
Re: Speaking of Plane 1 characters...
At 13:20 -0800 2002-11-11, Mark Davis wrote: If you look http://www.macchiato.com/ under "Unicode Charts", you can type in the code point (scalar value) for a character, then Enter, and you will get a chart. The UTF-8, 16, and 32 numbers are given in the chart for each value. Why do you call it a scalar value if it is really a code point? I thought it was bad enough Unicode calls it code point while 10646 calls it code position For the Terminology Police, -- Michael Everson * * Everson Typography * * http://www.evertype.com
Re: Speaking of Plane 1 characters...
At 13:18 11/11/2002 -0700, John Hudson wrote: At 13:50 11/11/2002, Michael Everson wrote: By the way MichKa if you make the boxes a bit wider the whole string of numbers would display. I noticed the same problem in Opera. It's okay in IE. That's the default font size mismatch - IE do things differently (they would!). In Mozilla and Phoenix do they fit? John
Re: Speaking of Plane 1 characters...
Michael Everson scripsit: > Perhaps it is just me, but terms like scalar value just don't mean > anything to me. It rather reminds me of reptilian skin shedding. The scale in question is analogous to a temperature scale, not a reptilian one. > I visited MichKa's page and tried typing in 10312 (OLD ITALIC LETTER > KU) and it did convert to a surrogate pair. I wonder what would > happen if I pasted it into an HTML document. Hmm but I couldn't do > that until I converted them to UTF-8 The Right Thing in HTML terms is to say #&x10312; and *not* use the surrogate pair representation. -- Deshil Holles eamus. Deshil Holles eamus. Deshil Holles eamus. Send us, bright one, light one, Horhorn, quickening, and wombfruit. (3x) Hoopsa, boyaboy, hoopsa! Hoopsa, boyaboy, hoopsa! Hoopsa, boyaboy, hoopsa! -- Joyce, _Ulysses_, "Oxen of the Sun" [EMAIL PROTECTED]
Re: Speaking of Plane 1 characters...
At 13:50 11/11/2002, Michael Everson wrote: By the way MichKa if you make the boxes a bit wider the whole string of numbers would display. I noticed the same problem in Opera. It's okay in IE. John Hudson Tiro Typeworks www.tiro.com Vancouver, BC [EMAIL PROTECTED] It is necessary that by all means and cunning, the cursed owners of books should be persuaded to make them available to us, either by argument or by force. - Michael Apostolis, 1467
Re: Speaking of Plane 1 characters...
From: "Michael Everson" <[EMAIL PROTECTED]> > At 12:10 -0700 2002-11-11, John Hudson wrote: > >Many thanks to the various people who recommended Michael Kaplan's > >calculator at http://trigeminal.com/16to32AndBack.asp > > > >This is excellent and solves my problem. Glad you like it, John -- I am sure James Kass remembers when I put it up, it was actually because of a complaint that there wasn't such a thing and there ought to be. > Perhaps it is just me, but terms like scalar value just don't mean > anything to me. It rather reminds me of reptilian skin shedding. Since I do not use that term on my site, I assume you are referring to someone else's resource? :-) > I visited MichKa's page and tried typing in 10312 (OLD ITALIC LETTER > KU) and it did convert to a surrogate pair. I wonder what would > happen if I pasted it into an HTML document. Hmm but I couldn't do > that until I converted them to UTF-8 Well, since the page advertises itself as a UTF-16/UTF-32 sort of converter, I would hope that the lack of UTF-8 byte conversion would be expected. > By the way MichKa if you make the boxes a bit wider the whole string > of numbers would display. What numbers did not display for you? They all fit for me MichKa
Re: Speaking of Plane 1 characters...
If you look http://www.macchiato.com/ under "Unicode Charts", you can type in the code point (scalar value) for a character, then Enter, and you will get a chart. The UTF-8, 16, and 32 numbers are given in the chart for each value. Mark __ http://www.macchiato.com ► “Eppur si muove” ◄ - Original Message - From: "John Hudson" <[EMAIL PROTECTED]> To: <[EMAIL PROTECTED]> Sent: Monday, November 11, 2002 10:08 Subject: Speaking of Plane 1 characters... > One of the tools I use for building fonts requires that codepoints for > Plane 1 characters be expressed as surrogate pairs, rather than as scalar > values. I'm hoping this will change on the next release, since the scalar > values are a lot easier to work with, but in the meantime I need to figure > out the easiest way to find the correct surrogate pair values for any given > scalar value. Is there a comprehensive list somewhere, or an easy > alogorithm (easy for a non-programmer)? How about a web-based form, into > which someone could enter scalar values and receive back surrogate pairs? > > John Hudson > > Tiro Typeworks www.tiro.com > Vancouver, BC [EMAIL PROTECTED] > > It is necessary that by all means and cunning, > the cursed owners of books should be persuaded > to make them available to us, either by argument > or by force. - Michael Apostolis, 1467 > > >
Re: Speaking of Plane 1 characters...
At 13:11 -0800 2002-11-11, Michael \(michka\) Kaplan wrote: > Perhaps it is just me, but terms like scalar value just don't mean > anything to me. It rather reminds me of reptilian skin shedding. Since I do not use that term on my site, I assume you are referring to someone else's resource? :-) It was related to this thread but in a previous post. Nevertheless a little gentle user-friendliness on your page would help me to use it more easily. Just a teensy tutorialette and a weensy example at the top? A little hand-holding? > I visited MichKa's page and tried typing in 10312 (OLD ITALIC LETTER KU) and it did convert to a surrogate pair. I wonder what would happen if I pasted it into an HTML document. Hmm but I couldn't do that until I converted them to UTF-8 Well, since the page advertises itself as a UTF-16/UTF-32 sort of converter, I would hope that the lack of UTF-8 byte conversion would be expected. Gee, what I really need is a UTF-8/UTF-16/UTF/32 sort of converter that handles surrogates ;-) "There isn't such a thing and there ought to be." :-) > By the way MichKa if you make the boxes a bit wider the whole string of numbers would display. What numbers did not display for you? They all fit for me The surrogate pair shows three digits and a tiny little popup triangle to tell you that there's a fourth digit. If you need to I can send you a screenshot. -- Michael Everson * * Everson Typography * * http://www.evertype.com
Re: Speaking of Plane 1 characters...
At 13:55 -0700 2002-11-11, Tom Gewecke wrote: >On the Macintosh, I have no clue. On Mac OS X, the Character Palette or the add-on UnicodeChecker will give the surrogates for any given codepoint. If you can get it to work. It still breaks for me so constantly I don't even try to use it. :-( -- Michael Everson * * Everson Typography * * http://www.evertype.com
Re: Speaking of Plane 1 characters...
>On the Macintosh, I have no clue. On Mac OS X, the Character Palette or the add-on UnicodeChecker will give the surrogates for any given codepoint. For a web page that calculates both ways, see http://www.trigeminal.com/16to32AndBack.asp
Re: Speaking of Plane 1 characters...
At 12:10 -0700 2002-11-11, John Hudson wrote: Many thanks to the various people who recommended Michael Kaplan's calculator at http://trigeminal.com/16to32AndBack.asp This is excellent and solves my problem. Perhaps it is just me, but terms like scalar value just don't mean anything to me. It rather reminds me of reptilian skin shedding. I visited MichKa's page and tried typing in 10312 (OLD ITALIC LETTER KU) and it did convert to a surrogate pair. I wonder what would happen if I pasted it into an HTML document. Hmm but I couldn't do that until I converted them to UTF-8 By the way MichKa if you make the boxes a bit wider the whole string of numbers would display. -- Michael Everson * * Everson Typography * * http://www.evertype.com
Re: Speaking of Plane 1 characters...
Many thanks to the various people who recommended Michael Kaplan's calculator at http://trigeminal.com/16to32AndBack.asp This is excellent and solves my problem. John Hudson Tiro Typeworks www.tiro.com Vancouver, BC [EMAIL PROTECTED] It is necessary that by all means and cunning, the cursed owners of books should be persuaded to make them available to us, either by argument or by force. - Michael Apostolis, 1467
Re: Speaking of Plane 1 characters...
John Hudson scripsit: > > One of the tools I use for building fonts requires that codepoints for > Plane 1 characters be expressed as surrogate pairs, rather than as scalar > values. I'm hoping this will change on the next release, since the scalar > I need to figure > out the easiest way to find the correct surrogate pair values for any given > scalar value. If you have access to any Windows box, you can use the Windows Calculator (Start/Programs/Accessories/Calculator). Choose View/Scientific and click on the Hex radio button. Then enter your 5-digit Unicode scalar value. (You must type hex digits in lower case.) To get the high surrogate, type: - 1 0 0 0 0 = / 4 0 0 + d 8 0 0 = To get the low surrogate, enter the scalar value again and type: - 1 0 0 0 0 = % 4 0 0 + d c 0 0 = You can also use the mouse, in which case "%" above represents the MOD key. On *ix systems, use the "bc" command; type "obase=16" and "ibase=16". For this program, you must use capital letters for the hex digits. To get the high surrogate, type "(x-1)/400+DC00" for the high surrogate ("x" is the scalar value); to get the low surrogate, type "(x-1)%400+DC00". On the Macintosh, I have no clue. -- John Cowan [EMAIL PROTECTED] "You need a change: try Canada" "You need a change: try China" --fortune cookies opened by a couple that I know