viewing jawi proposal

2003-12-19 Thread Hasbullah Bin Pit
Hi all
I'm new to this mailing list
refering to http://www.unicode.org/alloc/Pipeline.html
several char have been proposed
some of them is "Extended Arabic letters (for West African languages, 
Jawi, and Moroccan Arabic)"

question:

Is that any way for me to view what have been proposed?
I would like to ensure that jawi char is correct.




Re: [hebrew] Re: Aramaic unification and information retrieval

2003-12-19 Thread Peter Kirk
On 19/12/2003 13:39, Philippe Verdy wrote:

Jony Rosenne wrote:
 

Michael Everson
   

Samaritan and Phoenician are not font variants of Hebrew/Square 
Hebrew/Jewish or whatever else you want to call it.
 

But Square Hebrew IS a font variant of Ancient Hebrew or Phoenician or
Canaanite, whatever you want to call it, and so is Samaritan.
   

Do not mix script families (or genetic history) with their actual use.
Each time a script has evolved in a parallel way for other languages,
it has introduced its own distinctive features.
...
 

Do not mix lists. I have replied to this on the Hebrew list.

--
Peter Kirk
[EMAIL PROTECTED] (personal)
[EMAIL PROTECTED] (work)
http://www.qaya.org/




RE: [hebrew] Re: Aramaic unification and information retrieval

2003-12-19 Thread Jony Rosenne
So what about Chinese, Japanese and Korean? Was it wrong to unify them?

Jony

>  -Original Message-
> From: Philippe Verdy [mailto:[EMAIL PROTECTED] 
> Sent: Friday, December 19, 2003 11:40 PM
> To:   Jony Rosenne
> Cc:   [EMAIL PROTECTED]
> Subject:  RE: [hebrew] Re: Aramaic unification and 
> information retrieval
> 
> Jony Rosenne wrote:
> > Michael Everson
> > > Samaritan and Phoenician are not font variants of Hebrew/Square 
> > > Hebrew/Jewish or whatever else you want to call it.
> > 
> > But Square Hebrew IS a font variant of Ancient Hebrew or 
> Phoenician or
> > Canaanite, whatever you want to call it, and so is Samaritan.
> 
> Do not mix script families (or genetic history) with their actual use.
> Each time a script has evolved in a parallel way for other languages,
> it has introduced its own distinctive features.
> 
> With your argument, we would have to unify the Latin, Greek and
> Cyrillic scripts, because they have the same origin. Now move onto
> their common Phenician origin and we have to unify it with Semitic
> scripts... What disunified them was the writing direction, which was
> not fixed in early scripts that allowed boustrophedon ordering,
> and that had simpler designs with more independant glyphs, and the
> way the various glyphs combine to create sometimes new letters.
> 
> For me two scripts that are different enough so that a text written
> in one script will have imprecise matches in another, and will be
> hardly recognizable by readers is a candidate to a separate encoding,
> because it starts its own family of supplementary letters specific
> to some families of languages needing these extensions.
> 
> Some of these extensions do not have equivalent in the origin
> script, and sometimes (often?) their usage start to split with
> distinct semantics (see for example the various forms of
> the so-called "Tamazigh" script which is certainly better
> represented as a family of scripts rather than a single script,
> with as much differences between them than between Greek and
> Cyrillic).
> 
> 
> __
> << ella for Spam Control >> has removed Spam messages and set 
> aside Newsletters for me
> You can use it too - and it's FREE!  http://www.ellaforspam.com
<>

RE: [hebrew] Re: Aramaic unification and information retrieval

2003-12-19 Thread Philippe Verdy
Jony Rosenne wrote:
> Michael Everson
> > Samaritan and Phoenician are not font variants of Hebrew/Square 
> > Hebrew/Jewish or whatever else you want to call it.
> 
> But Square Hebrew IS a font variant of Ancient Hebrew or Phoenician or
> Canaanite, whatever you want to call it, and so is Samaritan.

Do not mix script families (or genetic history) with their actual use.
Each time a script has evolved in a parallel way for other languages,
it has introduced its own distinctive features.

With your argument, we would have to unify the Latin, Greek and
Cyrillic scripts, because they have the same origin. Now move onto
their common Phenician origin and we have to unify it with Semitic
scripts... What disunified them was the writing direction, which was
not fixed in early scripts that allowed boustrophedon ordering,
and that had simpler designs with more independant glyphs, and the
way the various glyphs combine to create sometimes new letters.

For me two scripts that are different enough so that a text written
in one script will have imprecise matches in another, and will be
hardly recognizable by readers is a candidate to a separate encoding,
because it starts its own family of supplementary letters specific
to some families of languages needing these extensions.

Some of these extensions do not have equivalent in the origin
script, and sometimes (often?) their usage start to split with
distinct semantics (see for example the various forms of
the so-called "Tamazigh" script which is certainly better
represented as a family of scripts rather than a single script,
with as much differences between them than between Greek and
Cyrillic).


__
<< ella for Spam Control >> has removed Spam messages and set aside
Newsletters for me
You can use it too - and it's FREE!  http://www.ellaforspam.com
<>

Re: [OT] Keyboards (was: American English translation of character names)

2003-12-19 Thread Peter Kirk
On 19/12/2003 12:26, [EMAIL PROTECTED] wrote:

...

IrfanView is freeware for non-commercial use and inexpensive to
register for other purposes.
 

Good point. He might get more registrations if he set up a PayPal 
account like yours, James, and didn't rely on sending cash by 
international mail.

--
Peter Kirk
[EMAIL PROTECTED] (personal)
[EMAIL PROTECTED] (work)
http://www.qaya.org/




Re: [OT] Keyboards (was: American English translation of character names)

2003-12-19 Thread jameskass
.
Peter Kirk wrote,

> You don't have to buy it. Use IrfanView, a free download (for Windows) 
> from http://www.irfanview.com/. The screen capture is not at all 
> sophisticated, but it sure beats the print screen key.

And, once the screen has been captured, IrfanView lets the user
"crop" the image as well as save the image in a plethora of
graphic formats.

IrfanView is freeware for non-commercial use and inexpensive to
register for other purposes.

Best regards,

James Kass
.



Re: [OT] Keyboards (was: American English translation of character names)

2003-12-19 Thread Peter Kirk
On 19/12/2003 10:39, Curtis Clark wrote:

...

(Okay, so  is used for "screen capture to clipboard" but who 
needs a button for that?). 


I use it all the time. Saves buying screen capture software.

You don't have to buy it. Use IrfanView, a free download (for Windows) 
from http://www.irfanview.com/. The screen capture is not at all 
sophisticated, but it sure beats the print screen key.

--
Peter Kirk
[EMAIL PROTECTED] (personal)
[EMAIL PROTECTED] (work)
http://www.qaya.org/




johab compound letters reference for Hangul?

2003-12-19 Thread Philippe Verdy
Is there a definitive reference for johab compound composition in Hangul?

As an example, I use these extra decompositions (not defined in Unicode) in my 
application, but I'd like advice notably for KAPYEOUN choseongs: Is it decomposable as 
the horizontal 114C (so that it stacks below), or as the vertical 110B (which normally 
stacks side-by-side)?

My set of overriden UnicodeData lines for CHOSEONGs is as below:

(...)
# add canonical de/recomposition of "Johab" compound Hangul jamos
#1100;HANGUL CHOSEONG KIYEOK;Lo;0;L;N;;g *;;;
1101;HANGUL CHOSEONG SSANGKIYEOK;Lo;0;L; 1100 1100N;;gg *;;;
#1102;HANGUL CHOSEONG NIEUN;Lo;0;L;N;;n *;;;
1113;HANGUL CHOSEONG NIEUN-KIYEOK;Lo;0;L; 1102 1100N;;ng *;;;
1114;HANGUL CHOSEONG SSANGNIEUN;Lo;0;L; 1102 1102N;;nn *;;;
1115;HANGUL CHOSEONG NIEUN-TIKEUT;Lo;0;L; 1102 1103N;;nd *;;;
1116;HANGUL CHOSEONG NIEUN-PIEUP;Lo;0;L; 1102 1107N;;nb *;;;
#1103;HANGUL CHOSEONG TIKEUT;Lo;0;L;N;;d *;;;
1117;HANGUL CHOSEONG TIKEUT-KIYEOK;Lo;0;L; 1103 1100N;;dg *;;;
1104;HANGUL CHOSEONG SSANGTIKEUT;Lo;0;L; 1103 1103N;;dd *;;;
#1105;HANGUL CHOSEONG RIEUL;Lo;0;L;N;;r *;;;
1118;HANGUL CHOSEONG RIEUL-NIEUN;Lo;0;L; 1105 1102N;;rn *;;;
1119;HANGUL CHOSEONG SSANGRIEUL;Lo;0;L; 1105 1105N;;rr *;;;
111A;HANGUL CHOSEONG RIEUL-HIEUH;Lo;0;L; 1105 1112N;;rh *;;;
111B;HANGUL CHOSEONG KAPYEOUNRIEUL;Lo;0;L; 1105 114CN;;rq *;;;
#1106;HANGUL CHOSEONG MIEUM;Lo;0;L;N;;m *;;;
111C;HANGUL CHOSEONG MIEUM-PIEUP;Lo;0;L; 1106 1107N;;mb *;;;
111D;HANGUL CHOSEONG KAPYEOUNMIEUM;Lo;0;L; 1106 114CN;;mq *;;;
#1107;HANGUL CHOSEONG PIEUP;Lo;0;L;N;;b *;;;
111E;HANGUL CHOSEONG PIEUP-KIYEOK;Lo;0;L; 1107 1100N;;bg *;;;
111F;HANGUL CHOSEONG PIEUP-NIEUN;Lo;0;L; 1107 1102N;;bn *;;;
1120;HANGUL CHOSEONG PIEUP-TIKEUT;Lo;0;L; 1107 1103N;;bd *;;;
1108;HANGUL CHOSEONG SSANGPIEUP;Lo;0;L; 1107 1107N;;bb *;;;
112C;HANGUL CHOSEONG KAPYEOUNSSANGPIEUP;Lo;0;L; 1108 114CN;;bbq *;;;
1121;HANGUL CHOSEONG PIEUP-SIOS;Lo;0;L; 1107 1109N;;bs *;;;
1122;HANGUL CHOSEONG PIEUP-SIOS-KIYEOK;Lo;0;L; 1107 1109 1100N;;bsg *;;;
1123;HANGUL CHOSEONG PIEUP-SIOS-TIKEUT;Lo;0;L; 1107 1109 1103N;;bsd *;;;
1124;HANGUL CHOSEONG PIEUP-SIOS-PIEUP;Lo;0;L; 1107 1109 1107N;;bsb *;;;
1125;HANGUL CHOSEONG PIEUP-SSANG SIOS;Lo;0;L; 1107 110AN;;bss *;;;
1126;HANGUL CHOSEONG PIEUP-SIOS-CIEUC;Lo;0;L; 1107 1109 110CN;;bsj *;;;
1127;HANGUL CHOSEONG PIEUP-CIEUC;Lo;0;L; 1107 110CN;;bj *;;;
1128;HANGUL CHOSEONG PIEUP-CHIEUCH;Lo;0;L; 1107 110EN;;bc *;;;
1129;HANGUL CHOSEONG PIEUP-THIEUTH;Lo;0;L; 1107 1110N;;bd *;;;
112A;HANGUL CHOSEONG PIEUP-PHIEUPH;Lo;0;L; 1107 N;;bp *;;;
112B;HANGUL CHOSEONG KAPYEOUNPIEUP;Lo;0;L; 1107 114CN;;bq *;;;
#1109;HANGUL CHOSEONG SIOS;Lo;0;L;N;;s *;;;
112D;HANGUL CHOSEONG SIOS-KIYEOK;Lo;0;L; 1109 1100N;;sg *;;;
112E;HANGUL CHOSEONG SIOS-NIEUN;Lo;0;L; 1109 1102N;;sn *;;;
112F;HANGUL CHOSEONG SIOS-TIKEUT;Lo;0;L; 1109 1103N;;sd *;;;
1130;HANGUL CHOSEONG SIOS-RIEUL;Lo;0;L; 1109 1105N;;sr *;;;
1131;HANGUL CHOSEONG SIOS-MIEUM;Lo;0;L; 1109 1106N;;sm *;;;
1132;HANGUL CHOSEONG SIOS-PIEUP;Lo;0;L; 1109 1107N;;sb *;;;
1133;HANGUL CHOSEONG SIOS-PIEUP-KIYEOK;Lo;0;L; 1109 1107 1100N;;sbg *;;;
110A;HANGUL CHOSEONG SSANGSIOS;Lo;0;L; 1109 1109N;;ss *;;;
1134;HANGUL CHOSEONG SIOS-SSANGSIOS;Lo;0;L; 1109 110AN;;sss *;;;
1135;HANGUL CHOSEONG SIOS-IEUNG;Lo;0;L; 1109 110BN;;s' *;;;
1136;HANGUL CHOSEONG SIOS-CIEUC;Lo;0;L; 1109 110CN;;sj *;;;
1137;HANGUL CHOSEONG SIOS-CHIEUCH;Lo;0;L; 1109 110EN;;sc *;;;
1138;HANGUL CHOSEONG SIOS-KHIEUKH;Lo;0;L; 1109 110FN;;sk *;;;
1139;HANGUL CHOSEONG SIOS-THIEUTH;Lo;0;L; 1109 1110N;;st *;;;
113A;HANGUL CHOSEONG SIOS-PHIEUPH;Lo;0;L; 1109 N;;sp *;;;
113B;HANGUL CHOSEONG SIOS-HIEUH;Lo;0;L; 1109 1112N;;sh *;;;
#113C;HANGUL CHOSEONG CHITUEUMSIOS;Lo;0;L;N;;zs *;;;
113D;HANGUL CHOSEONG CHITUEUMSSANGSIOS;Lo;0;L; 113C 113CN;;zss *;;;
#113E;HANGUL CHOSEONG CEONGCHIEUMSIOS;Lo;0;L;N;;sz *;;;
113F;HANGUL CHOSEONG CEONGCHIEUMSSANGSIOS;Lo;0;L; 113E 113EN;;ssz *;;;
#1140;HANGUL CHOSEONG PANSIOS;Lo;0;L;N;;zz *;;;
#110B;HANGUL CHOSEONG IEUNG;Lo;0;L;N;;' *;;;
1141;HANGUL CHOSEONG IEUNG-KIYEOK;Lo;0;L; 110B 1100N;;'g *;;;
1142;HANGUL CHOSEONG IEUNG-TIKEUT;Lo;0;L; 110B 1103N;;'d *;;;
1143;HANGUL CHOSEONG IEUNG-MIEUM;Lo;0;L; 110B 1106N;;'m *;;;
1144;HANGUL CHOSEONG IEUNG-PIEUP;Lo;0;L; 110B 1107N;;'b *;;;
1145;HANGUL CHOSEONG IEUNG-SIOS;Lo;0;L; 110B 1109N;;'s *;;;
1146;HANGUL CHOSEONG IEUNG-PANSIOS;Lo;0;L; 110B 1140N;;'zz *;;;
1147;HANGUL CHOSEONG SSANGIEUNG;Lo;0;L; 110B 110BN;;'';;;
1148;HANGUL CHOSEONG IEUNG-CIEUC;Lo;0;L; 110B 110CN;;'j *;;;
1149;HANGUL CHOSEONG IEUNG-CHIEUCH;Lo;0;L; 110B 110EN;;'c *;;;
114A;HANGUL CHOSEONG IEUNG-THIEUTH;Lo;0;L; 110B 1110N;;'t *;;;
114B;HANGUL CHOSEONG IEUNG-PHIEUPH;Lo;0;L; 110B N;;'p *;;;
#114C;HANGUL CHOSEONG YESIEUNG;Lo;0;L;N;;q *;;;
#110C;HANGUL C

Re: [OT] Keyboards (was: American English translation of character names)

2003-12-19 Thread Curtis Clark
on 2003-12-19 00:05 Arcane Jill wrote:

The left and right 
 keys are functionally identical anyway, and the  key is 
functionally identical to a right mouse click. 
It's handy, though, for people who cannot use a mouse.

(Okay, so  is used for "screen capture to 
clipboard" but who needs a button for that?). 
I use it all the time. Saves buying screen capture software.

They could have just used, 
for example,  for  and  for , without 
then having to scrunch up the  and  keys and shrink the 
space bar. 
With this I agree, and the keys could have retained their meaning in DOS 
windows. Perhaps the older versions of Windows weren't up to the task.

Vaguely ob Unicode, SC Unipad has keyboard layouts for many languages, 
but has the euro at Alt-Gr w on the "English (British)" keyboard.

--
Curtis Clark  http://www.csupomona.edu/~jcclark/
Mockingbird Font Works  http://www.mockfont.com/


Re: [OT] Keyboards (was: American English translation of character names)

2003-12-19 Thread jcowan
Doug Ewell scripsit:

> This was the standard typewriter arrangement, which in turn was actually
> the reason why that pair (and others) differs by only one bit in ASCII.

Correct.  It made the electromechanical design of Teletypes much easier to do.
However, it was the standard *manual* typewriter keyboard.  When IBM started
making Selectrics, they introduced the apostrophe/quote key on the home row
and remapped the shifted digits to pretty much the U.S. standard computer
keyboard today.

-- 
John Cowan  [EMAIL PROTECTED]  www.reutershealth.com  www.ccil.org/~cowan
"You cannot enter here.  Go back to the abyss prepared for you!  Go back!
Fall into the nothingness that awaits you and your Master.  Go!" --Gandalf



Re: [OT] Keyboards (was: American English translation of character names)

2003-12-19 Thread Doug Ewell
Mark E. Shoulson  wrote:

> I remember my old TRS-80 had double-quotes on shift-2 as well.  I
> half-remember that it had something to do with the bit-patterns, so
> the shift key could work by applying a fairly trivial transformation
> to the unshifted characters.  "2" is 0032 and """ is 0022, so I may
> not be completely mistaken... yeah, I think it looks right: the rest
> of the shifted-number symbols match the number symbols minus 0010.

This was the standard typewriter arrangement, which in turn was actually
the reason why that pair (and others) differs by only one bit in ASCII.

Mackenzie (1980) noted that Criterion 17 for determining code positions
in the original ASCII was that "Graphics that are normally paired on
typewriter keytops should differ in only a common single bit position."
Interestingly, a few pages later he described the assignment of ! and "
to positions 2/1 and 2/2 as "more or less arbitrary," even though those
two did satisfy Criterion 17.

As Frank said, it's a fascinating book for character geeks.
Unfortunately, it's long out of print and has to be special-ordered, and
isn't cheap; my slightly used copy cost me 67 USD four years ago.

-Doug Ewell
 Fullerton, California
 http://users.adelphia.net/~dewell/




Re: Unicode->ASCII approximate conversion

2003-12-19 Thread Hallvard B Furuseth
D. Starner writes:
>> The result is much better if you allow the ASCII conversion to be a string.
>> This allows you to, e.g., "©" = "(c)", "½" = "1/2", and so on. This is also
>> good for letters: "ß" = "ss", "å" = "aa", etc.
> 
> etcetra? I think he needs more direction then that, especially most naïve 
> algorithms are going to produce "a" from "å". Diagraphs can be treated
> as titlecase or capital or intelligently.

Hm.  Actually I'll want a mode which generates "a" rather than "aa" for
that one, to mimic local practice for how to generate e-mail adresses.
Though that can be tacked on with an extra hack afterwards.

One question, unless it has been answered already - I need to read up on
Unicode before I'll understand all the answers:

I'd like to translate 'ø' to 'o' or maybe 'oe'.  'o' at least when used
for matching, since it should match Swedish 'ö'.  However,
UnicodeData.txt has no decomposition property for that character:

00F8;LATIN SMALL LETTER O WITH STROKE;Ll;0;L;N;LATIN SMALL LETTER O 
SLASH;;00D8;;00D8

Is there some other property I can use?  Or is this a rare special case
to handle by hand?

-- 
Hallvard



Re: [OT] Keyboards (was: American English translation of character names)

2003-12-19 Thread Doug Ewell
Philippe Verdy  wrote:

> Arcane Jill did not speak about the 101-keys US English keyboard
> layout, but there DOES exist US English keyboards with this extra
> key, which allows them to be used ALSO with non US keyboard layouts
> that need that key (for example the standard French keyboard needs
> it to allow input of "<" and ">", in addition to the other "OEM"
> key located near Enter on the third row or near Backspace on the
> first row, to input "*" and optionally "Â" on PC keyboards)...

I want one.

-Doug Ewell
 Fullerton, California
 http://users.adelphia.net/~dewell/




Re: [OT] Keyboards (was: American English translation of character names)

2003-12-19 Thread Mark E. Shoulson
On 12/19/03 03:05, Arcane Jill wrote:

 Another minor US/UK difference is that  is double quotes in 
England, not @.
I remember my old TRS-80 had double-quotes on shift-2 as well.  I 
half-remember that it had something to do with the bit-patterns, so the 
shift key could work by applying a fairly trivial transformation to the 
unshifted characters.  "2" is 0032 and """ is 0022, so I may not be 
completely mistaken... yeah, I think it looks right: the rest of the 
shifted-number symbols match the number symbols minus 0010.

I never understood why a certain computer company saw fit to squeeze 
three extra keys onto an already crowded keyboard.
Not sure I do either, but I have to admit it comes in handy to have a 
couple of extra shift keys.  I map them to things like 
compose-character, Meta, stuff like that.  But then, I don't use that 
company's software much.

As another aside (and a possibly useful tip), I once had a keyboard 
with black keys, on which the letters were printed in white. Being a 
touch-typist, I painted all the key legends out with black paint, 
leaving only legendless keys.
Didn't Big Julie in Guys and Dolls try the same thing with dice?  He 
remembered where the spots formerly were...

~mark




RE: Unicode->ASCII approximate conversion

2003-12-19 Thread D. Starner
> The result is much better if you allow the ASCII conversion to be a string.
> This allows you to, e.g., "©" = "(c)", "½" = "1/2", and so on. This is also
> good for letters: "ß" = "ss", "å" = "aa", etc.

etcetra? I think he needs more direction then that, especially most naïve 
algorithms are going to produce "a" from "å". Diagraphs can be treated
as titlecase or capital or intelligently.

00FE - "th"
00DE - "TH"
00F0 - "dh" ("th"?)
OOD0 - "DH" ("TH"?)
0108 - "CH" (Esperanto)
0109 - "ch"
011C, 011D - "GH", "gh" (E-o)
0124, 0125 - "HH", "hh" (")
0134, 0135 - "JH", "jh" (")
015C, 015D - "SH", "sh" (")
017F - "s"

Depending on your goals, 015F & 0161 could be "sh", 0163 "ts",
017D "zh", etc. 

0195 - "hw"
01A3 - "gh"(?)
01BF - "w"
01C0 - "|" ("c"?)
01C1 - "||"? ("x"?)
01C3 - "!" ("q"?)
0223 - "w" ("ou"? "8"?)

I omitted most capitals and those that can be found by decomposition
or name stripping, as well a bunch I don't know anything about.
-- 
___
Sign-up for Ads Free at Mail.com
http://promo.mail.com/adsfreejump.htm




Re: Unicode->ASCII approximate conversion

2003-12-19 Thread Jungshik Shin
On Fri, 19 Dec 2003 [EMAIL PROTECTED] wrote:

> Quoting Hallvard B Furuseth <[EMAIL PROTECTED]>:
>
> > I need a function which converts Latin Unicode characters to the closest
> > equivalent ASCII characters, e.g. "Ã" -> "e".

> 1. Produce the NFD normalisation of the text.
> 2. Remove all characters with a non-zero combining class.
> 3. Some non-ASCII characters may remain (particularly those from non-Latin
> scripts) handling of some can be done nicely, but some may require you to
> raise an exception or output a replacement character.

> on your application. Specialised handling of some characters is possible, for
> instance you could convert the trademark sign to "(TM)" to avoid confusion,

  For Korean syllables (U+AC00 - U+Dxxx), you can use 'Hangul Syllable
Short Names' that can be algorithmically derived with small tables.



Re: Unicode->ASCII approximate conversion

2003-12-19 Thread jon
Quoting Hallvard B Furuseth <[EMAIL PROTECTED]>:

> I need a function which converts Latin Unicode characters to the closest
> equivalent ASCII characters, e.g. "é" -> "e".
> 
> Before I reinvent the wheel, does any public domain or GPL code for this
> already exist?
> 
> If not,
> for the most part I expect I can make the mapping from the character
> names, e.g. ignore 'WITH ACUTE' in 'LATIN CAPITAL LETTER O WITH ACUTE'
> in .
> Punctuation and other non-letters will be worse, but they are less
> important to me anyway.
> 

1. Produce the NFD normalisation of the text.
2. Remove all characters with a non-zero combining class.
3. Some non-ASCII characters may remain (particularly those from non-Latin 
scripts) handling of some can be done nicely, but some may require you to raise 
an exception or output a replacement character.

This can be done efficiently with a streaming processor if the size of the 
source text is large.

You may want to use NFKD rather than NFD. NFKD would, for example, convert the 
trademark symbol to "TM" and superscript 2 to "2" - this would allow you to 
convert more characters but the loss of semantics may be problematic depending 
on your application. Specialised handling of some characters is possible, for 
instance you could convert the trademark sign to "(TM)" to avoid confusion, of 
course this wouldn't be possible with an existing normalisation API, though if 
the number of characters handled specially is small it would be possible to do 
that in a first pass.

--
Jon Hanna   | Toys and books
 | for hospitals:
| 



RE: Unicode->ASCII approximate conversion

2003-12-19 Thread Marco Cimarosti
Hallvard B Furuseth wrote:
> I need a function which converts Latin Unicode characters to 
> the closest equivalent ASCII characters, e.g. "é" -> "e".
> 
> Before I reinvent the wheel, does any public domain or GPL 
> code for this already exist?

I don't know, sorry.

> If not,
> for the most part I expect I can make the mapping from the character
> names, e.g. ignore 'WITH ACUTE' in 'LATIN CAPITAL LETTER O WITH ACUTE'
> in .

Why the name!?

The decomposition property (5th filed on each line) is much better for this.
E.g.:

00E9;LATIN SMALL LETTER E WITH ACUTE;Ll;0;L;0065 0301N;LATIN
SMALL LETTER E ACUTE;;00C9;;00C9

The decomposition field tells you that "é" (code 00E9 hex) is composed of
ASCII "e" (code 0065 hex) and the combining acute accent (code 0301 hex):
you keep the ASCII character and drop the composing accent.

> Punctuation and other non-letters will be worse, but they are less
> important to me anyway.

The result is much better if you allow the ASCII conversion to be a string.
This allows you to, e.g., "©" = "(c)", "½" = "1/2", and so on. This is also
good for letters: "ß" = "ss", "å" = "aa", etc.

_ Marco




RE: Unicode->ASCII approximate conversion

2003-12-19 Thread Philippe Verdy
Hallvard B Furuseth wrote:
> I need a function which converts Latin Unicode characters to the closest
> equivalent ASCII characters, e.g. "é" -> "e".
> 
> Before I reinvent the wheel, does any public domain or GPL code for this
> already exist?

Please don't use character names for that conversion:
instead use the NFKD decompositions from the UCD, then see if the first
character is an ASCII character, and if so, remove diacritics in the 03xx
block (that have a "Mn" general category and a non-zero
combining class). If there remains non ASCII characters use a default
replacement like '?'. But you need some other custom rules:
(look at sharp-s compatibility decomposition: it's best to
map it to "ss" rather than "?", whch can be done by looking at
casefoldings of "Ll" lowercase letters)

This will be less tricky, as there's no guarantee that names will be
consistent


__
<< ella for Spam Control >> has removed Spam messages and set aside
Newsletters for me
You can use it too - and it's FREE!  http://www.ellaforspam.com
<>

Unicode->ASCII approximate conversion

2003-12-19 Thread Hallvard B Furuseth
I need a function which converts Latin Unicode characters to the closest
equivalent ASCII characters, e.g. "é" -> "e".

Before I reinvent the wheel, does any public domain or GPL code for this
already exist?

If not,
for the most part I expect I can make the mapping from the character
names, e.g. ignore 'WITH ACUTE' in 'LATIN CAPITAL LETTER O WITH ACUTE'
in .
Punctuation and other non-letters will be worse, but they are less
important to me anyway.

-- 
Hallvard



RE: American English translation of character names

2003-12-19 Thread jarkko.hietaniemi
> particularly the 1930s, 40s, 50s, and 60s sections, and 
> follow the many
> links from each entry.  In particular, you can see the basic 
> character set
> of the IBM 360 (as generated by the IBM 29 Card Punch) here:
> 
>   http://www.columbia.edu/acis/history/029.html
> 
> (scroll down a bit after the photo).

http://www.unicode.org/Public/MAPPINGS/VENDORS/IBM/IBM360.TXT

404 Not Found

:-)



RE: [OT] Keyboards (was: American English translation of character names)

2003-12-19 Thread Arcane Jill





Yeh, £ is  in Britain, right where # is in America (I
think). Another minor US/UK difference is that  is
double quotes in England, not @. In England, € is definitely . I guess keyboard designers couldn't use  because
it was already in use. On English keyboards,  + a, e, i,
o and u give á, é, í, ó and ú respectively. The other letter keys have
no  assignments at all, although obviously I can edit the
keyboard layout to change that.

I never understood why a certain computer company saw fit to squeeze
three extra keys onto an already crowded keyboard. The left and right
 keys are functionally identical anyway, and the
 key is functionally identical to a right mouse click. In
fact, if you have a working mouse, you'll never use  or
 anyway. Okay, I accept that that particular OS needed TWO
(not three) extra keys so that people without a mouse could still use
it, but there were already three unused keys on the keyboard.
Unused in Windows anyway. The keys , 
and  may have been needed for DOS, but have no use at all
in Windows. (Okay, so  is used for "screen capture to
clipboard" but who needs a button for that?). They could have just
used, for example,  for  and
 for , without then having to scrunch up the
 and  keys and shrink the space bar. Maybe
they just wanted to make more money by persuading everyone they had to
buy a new keyboard; maybe they wanted to spread their logo around a bit
further, who knows? But it was an extremely silly idea from the end
users' point of view.

As another aside (and a possibly useful tip), I once had a keyboard
with black keys, on which the letters were printed in white. Being a
touch-typist, I painted all the key legends out with black paint,
leaving only legendless keys. It wasn't the greatest of security
devices, but no-one else in my household would even dream of
using my computer thus configured. Passwords? They wouldn't know where
to start! (Of course, they could have just unplugged the keyboard and
plugged in a different one, but it definitely stopped the casual
curious).

Jill

> -Original Message-
> From: Marco Cimarosti [mailto:[EMAIL PROTECTED]]
> Sent: Thursday, December 18, 2003 5:44 PM
> To: 'Arcane Jill'; [EMAIL PROTECTED]
> Subject: RE: [OT] Keyboards (was: American English translation of
> characte r names)
> 
> 
> Arcane Jill wrote:
> > Yeah, everything's shifted around, I know. But I think we 
> > have one extra key, all told, to make room for the GBP
> > currency symbol (£).
> 
> Isn't shat  on the UK keyboard? That's where it
is 
> in Italy.
> 
> All non-US keyboard have an extra key on the right hand of 
> the left shift
> key; what that key is used for, depends on locale.
> 
> > They didn't add an extra key for the Euro though. We access
that as
> .
> 
> What OS is it? Most european keyboard I have seen have euro 
> on .
> 
> > Guess my  is smaller than yours.
> 
> Americans'  is bigger than anybody else's. :-)
> 
> _ Marco
>