Re: Braille rendering of Unicode [OT 50%]

2000-08-10 Thread JC Helary

> The background of the question is that current braille software is capable
> of dot-displaying many national text encodings, using the corresponding
> national braille. So I was wondering if anybody is thinking to paste all
> these local conventions together to represent Unicode (presumably, using
> "escape sequences" when the script/language change).
> 
> (Probably it was just a silly question. F8-)

I don't see what is silly in the question ? After all, if unicode allows the 
display of multilingual pages on the web where the full contents are 
accessible to people who understand all the displayed languages, it is not so 
far fetched to imagine that blind people may want to enjoy the ability to 
access different languages at the same time thanks to unicode  Or am i 
missing something ?

i am pretty sure we have braille using members of the list. what do they think 
about this ? what are the problems they are facing when they try to display 
different languages on a unicode text ?

jc helary

> 
> _ Marco



Which languages are supported in basic latin

2000-08-10 Thread Halldor G. Gestsson

Hi there!

Can I find a list where all languages supported in the basic latin
(0x-0x00FF)?
E.g:
Basic Latin support
English
German
Spanish
French
Danish
Icelandic
... etc. etc.

Wich languages uses the latin extensions A,B and C?

Kind regards,
Halldor

===
Halldor G. Gestsson (Mr.)
Software Engineer, M.Sc.E.E. 
E-mail: [EMAIL PROTECTED] 
Direct phone: +45 96355562, Mobile phone: +45 21205343

Maxon Cellular Systems (Denmark) A/S
Niels Jernes Vej 8
P.O. Box 8440
DK-9220 Aalborg Øst
Denmark
Phone: +45 96 35 55 00,  Fax: +45 96 35 56 00
E-mail: [EMAIL PROTECTED]
Homepage: www.maxon.dk



RE: Which languages are supported in basic latin

2000-08-10 Thread Marco . Cimarosti

Halldor G. Gestsson:
> Can I find a list where all languages supported in the basic latin
> (0x-0x00FF)?
> [...]
> Wich languages uses the latin extensions A,B and C?

Page  contains the information to build your
lists.

_ Marco



Swiss numerical format (war einmal: What is ` (U+0060) for?)

2000-08-10 Thread J%ORG KNAPPEN

As an aside:

Are there good (authorative) references on the so called
swiss numerical format with its peculiar thousand separator?

I only know about a manual shipped with some Aldus software product
as a reference. I own several books printed in Switzerland and they
show the typical swiss orthography (lack of ß), but all show one of
the two usual german number formats (. or \, (thin space) as thousands
separator).

--J"org Knappen



RE: Swiss numerical format [OT]

2000-08-10 Thread Marco . Cimarosti

Jörg Knappen wrote:
> Are there good (authorative) references on the so called
> swiss numerical format with its peculiar thousand separator?

Why not comparing the locale settings of main operating systems? I think
that at least WinNT, Apple, Linux, and other Unixes are widely represented
on this list.

This is what I see on Windows NT's Swiss (German) locale:

(Notice that the decimal separator seems to be "," for money, but "." for
other numbers.)

Number (e.g. "-123'456'789.00")
Decimal symbol: .
No. of digits after decimal:2
Digit grouping symbol:  '
No. of digits in group: 3
Negative sign symbol:   -
Negative number format: -1.1
Display leading zeros:  0.7 (i.e., yes)
Measurement system: Metric
List separator: ;

Currency (e.g. "SFr. -123'456'789,00")
Currency symbol:"SFr."
Positive currency format:   "SFr. 1.1"
Negative currency format:   "SFr. -1.1"
Decimal symbol: ","
No. of digits after decimal:"2"
Digit grouping symbol:  "'"
No. of digits in group: "3"

Time (e.g. "23.57.46")
Time style: "HH.mm.ss" (HH = 24h clock)
Time separator: "."
AM symbol: none
PM symbol: none

Short date: (e.g. "10.08.00")
Short date style: "dd.MM.yy"
Date separator: "."

Long date: (e.g. "Donnerstag, 10. August 2000")
Long date style: ", d.  "

_ Marco



RE: Braille rendering of Unicode [OT 50%]

2000-08-10 Thread Marco . Cimarosti

Steven R. Loomis wrote:
> [...] Presumably the unicode codepoints in braille
> would make a great format for these translations on their way to a
> printer.  One would hope they would get such use and not simply for
> braille-looking characters on paper or screen.

You are right, I didn't catch it: the primary usage of these codes is
probably allow (intelligent) braille software to communicate with (stupid)
braille hardware.

But then I was probably not so wrong in calling these "presentation glyphs"
for braille.

> [...] Is there a standard file format for those devices?

I heard something called "Braille ASCII": I think it is a de-facto standard
to send data to braille device, and to exchange print-ready documents.

It is a 1-to-1 mapping between 6-dot braille cells and ASCII characters 0x20
to 0x5F.

The problem, as I understand it, is that this approach cannot be used with
8-dot braille, because a byte is not enough. In addition to the 256 braille
"glyphs", the printers need control characters for line break, page break,
device controls, etc. That's probably where the Unicode braille block comes
in, as you suggested.

>  It sounded to me like a transliteration problem at first.

It is not a transliteration problem, in theory. But in practice the issue
involved must be very similar. E.g., the spelling rules could be slightly
different from print to braille (or even very different, in the case of CJK
ideographs, that have to be rendered phonetically). Moreover, like
transliteration, the convention are per language, not per script.

> I have somewhere a first cut at a unicode <-> braille mapping.

Wow! So something like this exists, somewhere...

_ Marco



Organizing your CD collection

2000-08-10 Thread 11digitboy

How do you sort text with some in Roman and some
in non-Roman alphabets? Currently, I'm just romanizing
everything but I don't know if that is that good.
Should I just kanize Japanese?
I would love a system that just goes by characters,
and I would much prefer it if the Han digits collated
in numerical order.
So far, I have 3 alphabets: Latin, Greek, and Japanese.
It is probably bad to kanize digits, because they
would sort 1, 9, 5, and so on, or some other mixed-up
order.

--
Robert Lozyniak
Accusplit pedometer manufactures can go suck eggs
My page: http://walk.to/11
[EMAIL PROTECTED] - email
(917) 421-3909 x1133 - voicemail/fax



___
Get your own FREE Bolt Onebox - FREE voicemail, email, and
fax, all in one place - sign up at http://www.bolt.com




Re: Off-topic: digraphs and trigraphs

2000-08-10 Thread Otto Stolz

On Thu, 3 Aug 2000 18:22:50 -0800 (GMT-0800) Doug Ewell has written:
> Does anyone know of a commonly used or commonly accepted collective
> term for multi-character sequences (e.g. digraphs, trigraphs, etc.)?

If it comes to coining a new term, I'd like to propose "oligograph",
from greeek "oligoi" (a few), as in "oligarchy" and "oligocene",
and "graphein" (to write), as in "photography" amd "grapheme".

There are really just a few, not many, characters to form such a
compound -- and we would avoid those puns on lying-detectors ;-)

Best wishes,
  Otto Stolz




Re: Organizing your CD collection

2000-08-10 Thread Michael \(michka\) Kaplan

See UTR #10 (Unicode Collation Algorithm) at

http://www.unicode.org/unicode/reports/tr10/

for a very firm report on how you should indeed be handling collation.

At some level, as I mentioned to you earlier, there are many Latin sorts
that contradict other Latin sorts; in THOSE cases a decision must be made in
sorting multilingual text.

michka


- Original Message -
From: <[EMAIL PROTECTED]>
To: "Unicode List" <[EMAIL PROTECTED]>
Sent: Thursday, August 10, 2000 6:44 AM
Subject: Organizing your CD collection


> How do you sort text with some in Roman and some
> in non-Roman alphabets? Currently, I'm just romanizing
> everything but I don't know if that is that good.
> Should I just kanize Japanese?
> I would love a system that just goes by characters,
> and I would much prefer it if the Han digits collated
> in numerical order.
> So far, I have 3 alphabets: Latin, Greek, and Japanese.
> It is probably bad to kanize digits, because they
> would sort 1, 9, 5, and so on, or some other mixed-up
> order.
>
> --
> Robert Lozyniak
> Accusplit pedometer manufactures can go suck eggs
> My page: http://walk.to/11
> [EMAIL PROTECTED] - email
> (917) 421-3909 x1133 - voicemail/fax
>
>
>
> ___
> Get your own FREE Bolt Onebox - FREE voicemail, email, and
> fax, all in one place - sign up at http://www.bolt.com
>
>




Re: Organizing your CD collection

2000-08-10 Thread Peter_Constable


Could we "please* use subjects that will tell us what topic of interest the
message is *really* about?




On 08/10/2000 08:44:13 AM 11digitboy wrote:

>How do you sort text with some in Roman and some
>in non-Roman alphabets? Currently, I'm just romanizing
>everything but I don't know if that is that good.
>Should I just kanize Japanese?
>I would love a system that just goes by characters,
>and I would much prefer it if the Han digits collated
>in numerical order.
>So far, I have 3 alphabets: Latin, Greek, and Japanese.
>It is probably bad to kanize digits, because they
>would sort 1, 9, 5, and so on, or some other mixed-up
>order.
>
>--
>Robert Lozyniak
>Accusplit pedometer manufactures can go suck eggs
>My page: http://walk.to/11
>[EMAIL PROTECTED] - email
>(917) 421-3909 x1133 - voicemail/fax
>
>
>
>___
>Get your own FREE Bolt Onebox - FREE voicemail, email, and
>fax, all in one place - sign up at http://www.bolt.com
>




RE: Braille rendering of Unicode [OT 50%]

2000-08-10 Thread Asmus Freytag

At 11:10 AM 8/9/00 -0800, [EMAIL PROTECTED] wrote:
>. (But I think
>that the main reason why these patterns are in Unicode is to encode runs of
>braille-looking characters in didactic texts for *sighted* people).

No, they are provided for any case that you want to transmit 'final-form' 
output for Braille, i.e. *after* the appropriate mapping has been done from 
Unicode characters to the Braille system the user is using.

Being able to unambiguously 'freeze' the result of a Braille conversion and 
to communicate that result to another device was explicitly requested by 
the proponents of the Braille encoding in Unicode.

A./



RFC 1766 (was: Summary: xml:lang validity and RFC 1766 refs

2000-08-10 Thread Doug Ewell

Mike Brown <[EMAIL PROTECTED]> wrote:

>> I don't see anything in RFC 1766 that hardcodes it to the 
>> 1988 versions of either 639 or 3166.
>
> I have taken this discussion off the Unicode list. I only started the
> thread here because I was referencing an earlier post and because ISO
> 639 language code updates were topical a couple months ago. It's not
> particularly relevant to Unicode.

Actually, this is quite relevant to Unicode, and that is why I'm
bringing it back on the list.

Unicode Technical Report #7, "Plane 14 Characters for Language Tags,"
makes a direct reference to RFC 1766:

> A Plane 14 tag string prefixed by U-000E0001 LANGUAGE TAG is specified
> to constitute a language tag.  Furthermore, the tag values for the
> language tag are to be spelled out as specified in RFC 1766, making
> use only of registered tag values or of user-defined language tags
> starting with the characters "x-".

If RFC 1766 can be construed by the XML spec as requiring the 1988
revisions of ISO 639 and ISO 3166, rather than newer revisions, then
it can be construed that way by UTR #7 as well.  And that means I cannot
conformantly use some of the more recently created language codes such
as "ae" for "Avestan" or country codes such as "ps" for "Palestinian
Territory, Occupied" in Plane 14 language tags.

Can anyone comment on this?  If RFC 1766 can realistically be read as
requiring outdated versions of ISO 639 and 3166, then it seems that UTR
#7 should be updated to bypass RFC 1766 entirely and refer directly to
ISO 639 and 3166.

-Doug Ewell
 Fullerton, California



Re: Which languages are supported in basic latin

2000-08-10 Thread Antoine Leca

Halldor G. Gestsson wrote:
> 
> Can I find a list where all languages supported in the basic latin
> (0x-0x00FF)?
> E.g:
> Basic Latin support
> English
> German
> Spanish
> French
> Danish
> Icelandic
> ... etc. etc.

What is your definition of "language" ?

For example, do you mean French as printed with any usual typewriter
(so U+ - U+00FF qualifies), or do you believe that it should have
all the characters needed to properly write the 500 most used French
words, in which case, you need to add U+0153, œ. (U+0152 and U+0178
are also seen, but much rarer).

Also punctuation may need to be considered. First the apostrophe,
which Unicode prefers to see encoded as U+2019 (IIRC); the same
character is also used in Spain to note the decimal separator.
Next, the German quotes, or French (among others) list hyphens
(em-dash). &c.

On the other hand, are you interested to know about "javanais",
which is a form of French Parisian slang used (among others) in
the 50's and then again in the 75's, and which consists of
systematically adding "av" between the consonnant(s) and the
vowel (so "jamais" becomes "javamavais", etc.) Since in Parisian
French "œ" is more or less the same as "eu", and since the
orthography of "javanais" is not fixed, it may well qualify...
;-)

On a similar point, Russians pionneers had designed a system
which allowed them to read and write Cyrillic with ASCII-only
terminals, using KOI-7 as character set, which features a rough
translitteration scheme (the passing from Latin to Cyrillic is
noted by inversion of the case, so when it reads as upper-case,
it is in fact lower-case Cyrillic). Does Russian qualifies?


I believe that the whole Unicode effort is here to show us that
restricted character set are *not* a solution to write a given
language, at least not without some restriction upon its use
(which you were not talking about). Particularly with a so much
restricted character set as iso-8859-1.


Antoine



Re: Off-topic: digraphs and trigraphs

2000-08-10 Thread Peter_Constable


>If it comes to coining a new term, I'd like to propose "oligograph"...

>There are really just a few, not many, characters to form such a
>compound -- and we would avoid those puns on lying-detectors ;-)

Not a bad idea, methinks.



- Peter


---
Peter Constable

Non-Roman Script Initiative, SIL International
7500 W. Camp Wisdom Rd., Dallas, TX 75236, USA
Tel: +1 972 708 7485
E-mail: <[EMAIL PROTECTED]>





Re: Organizing your CD collection

2000-08-10 Thread Antoine Leca

Robert Lozyniak wrote:
>
> How do you sort text with some in Roman and some
> in non-Roman alphabets?

I never sort texts, only lists of items (words,
names, titles, whatever).

Depending of the ratios, I see two main solutions:

- if Latin is the most current, _and_ only other Greek-
derived scripts are used, _and_ the intended audience
is proficient enough, I may interspeed the non-Roman
letters as if all the Greek-derived alphabets shared
a common order (so Greek alpha sorts just after Latin a,
Cyrillic ve after Cyrillic be which follows Greek beta
which follows Latin b, Greek xi after the o's and before
the p's, etc.)

- in other cases, I sort the scripts separately.


> Currently, I'm just romanizing
> everything but I don't know if that is that good.

Hmmm. I won't do that. It would take me much too long
to find something that begin with beta at the V section,
while something that begin with mu+pi at the B section...
For Cyrillic, I expect U+0427 to romanize as tcha,
and U+0429 as chtcha, and I am not sure you will (or
vice-versa).

Things are different if you actually translitterate,
i.e. if the items are presented in Latin script.


> It is probably bad to kanize digits, because they
> would sort 1, 9, 5, and so on, or some other mixed-up
> order.

It is always a problem to sort the digits, anyway.
Since they are usually ony a few of them, I believe the
best place is the foremost, so the search does not takes
too long. But if they are more than a bunch, that is
pretty always a brain damage.


Antoine



Re: RFC 1766 (was: Summary: xml:lang validity and RFC 1766 refs

2000-08-10 Thread John Cowan

Doug Ewell wrote:

> Can anyone comment on this?  If RFC 1766 can realistically be read as
> requiring outdated versions of ISO 639 and 3166, then it seems that UTR
> #7 should be updated to bypass RFC 1766 entirely and refer directly to
> ISO 639 and 3166.

With all due respect to all participants, the claim that RFC 1766 freezes
obsolete versions reminds me of (future Cardinal) Newman's argument in Tract #40,
back when he was still part of the Church of England.  Although the 39
Articles denounced "the Romish doctrine of Purgatory", they did not denounce
the Roman doctrine, and clearly Romish != Roman.  So believers in the Roman
doctrine might continue to belong to the Anglican Church

Anyhow, RFC 1766 has other useful things besides pass-through of the ISO
standards, specifically its very own registry that handles languages like
zh-yue (Cantonese) and i-mingo (Mingo).

-- 

Schlingt dreifach einen Kreis um dies! || John Cowan <[EMAIL PROTECTED]>
Schliesst euer Aug vor heiliger Schau,  || http://www.reutershealth.com
Denn er genoss vom Honig-Tau,   || http://www.ccil.org/~cowan
Und trank die Milch vom Paradies.-- Coleridge (tr. Politzer)



RE: RFC 1766

2000-08-10 Thread Mike Brown

> the claim that RFC 1766 freezes obsolete versions 

Actually, the claim is that RFC 1766 could be interpreted that way, not that
it is actually trying to say so. The RFC author's recent statement of intent
clarifies that a more lenient interpretation is prudent.

The reason it is important to have this statement of intent is because, in
the world of specifications, some (many?) would argue that the letter of the
'law' must be followed when the intent -- some higher, more abstract 'truth'
-- cannot be unambiguously divined from the materials at hand. How obviously
sensible one interpretation is over another is a subjective assessment that
inevitably results in differing implementations and translations
(translations like when I give xml:lang advice to coworkers).

> Although the 39 Articles denounced "the Romish doctrine of
> Purgatory", they did not denounce the Roman doctrine, and
> clearly Romish != Roman.  So believers in the Roman
> doctrine might continue to belong to the Anglican Church

'Although RFC 1766 references old versions of ISO 639 and ISO 3166, it does
not explicitly prohibit using newer versions of those standards, and clearly
old versions != new versions. So people who use the newer standards are
justified...'

In my opinion, it is highly preferable to seek clarifications from the
authors and/or experts on the literature in question than to assume that a
loose interpretation must be what was intended just because it's easier for
one to deal with the consequences of that interpretation.

   - Mike

Mike J. Brown, software engineer at My XML/XSL resources:
webb.net in Denver, Colorado, USA   http://www.skew.org/xml/



Re: Zero-width ligator

2000-08-10 Thread Roozbeh Pournader


That seems problematic to me, when used for Arabic. How should one use
ZWNJ between two Arabic letters to stop the ligature? The'll get
disconnected!

--roozbeh

On Tue, 8 Aug 2000 [EMAIL PROTECTED] wrote:

> 
> I inquired about that recently on the unicoRe list, and was told that the
> semantics of ZWJ/ZWNJ will be extended in 3.0.1 (or maybe it was 3.1). You
> mentioned that this decision was made at the meeting in February.
> Interestingly, I was at that meeting, and my recollection was that
> extending the semantics of ZWJ/ZWNJ was going to be given further
> consideration, after some people investigated the implications of extending
> the semantics of ZWJ, particularly for Indic scripts. But I left before the
> meeting was over, and the minutes reflect that a decision was in fact made
> (although the weasle word "provisionally" is used).
> 
> 
> 
> - Peter
> 
> 
> ---
> Peter Constable
> 
> Non-Roman Script Initiative, SIL International
> 7500 W. Camp Wisdom Rd., Dallas, TX 75236, USA
> Tel: +1 972 708 7485
> E-mail: <[EMAIL PROTECTED]>
> 
> 
> 
> 




Re: Braille rendering of Unicode [OT 50%]

2000-08-10 Thread Rick McGowan

Marco said:

> These are at most the building blocks for braille. A better parallel
> would be to consider these "presentation glyphs" for braille. (But I think
> that the main reason why these patterns are in Unicode is to encode runs of
> braille-looking characters in didactic texts for *sighted* people).

Just FYI...  Let me elaborate.

I was involved in the UTC decision to encode the Braille pattern symbols, and in the 
end I argued strongly FOR the encoding.

In the early history of Unicode, Braille was considered somewhat naively as a mere 
presentation form, and this was shown later to be an incorrect view.

The main point upon which the UTC encoding decision turned was the understanding that 
conversion from English (or whatever) text to Braille is algorithmically easy.  But a 
conversion from Braille back to some other system is "hard" because of escape 
sequences, locking shifts, and special usages.  In effect, Braille must be understood 
as a writing system that stands on its own apart from others.  The complexity of the 
escapes and multi-lingual usages make it possible to render "puns" and other forms of 
ambiguity in Braille that cannot be transcribed directly and unambiguously into 
another writing system such as the Latin script.

I.e., once you have Braille text, rendering in Latin or some other script can, at the 
extreme end, become a matter of _translation_ or _re-interpretation_ not one of mere 
_transcription_.

UTC does envision a usage like the one Steven Loomis pointed out:

> Presumably the unicode codepoints in braille would make a great
> format for these translations on their way to a printer.

Furthermore, this encoding makes it possible to treat Braille as a "first class" 
writing system in Unicode, and use it directly for the encoding of primary Braille 
texts.  Braille is comparable to any other complex writing system.

Rick




 


RE: Zero-width ligator

2000-08-10 Thread Marco . Cimarosti

Roozbeh Pournader wrote:
> That seems problematic to me, when used for Arabic. How should one use
> ZWNJ between two Arabic letters to stop the ligature? The'll get
> disconnected!

Good point.

ZWJ+ZWNJ+ZWJ comes to mind, but it is really not the maximum of elegance...

_ Marco



Re: (PRIV) RE: Zero-width ligator

2000-08-10 Thread Roozbeh Pournader


On Thu, 10 Aug 2000 [EMAIL PROTECTED] wrote:

> ZWJ+ZWNJ+ZWJ comes to mind, but it is really not the maximum of elegance...

No! Please! I have a lot of difficulties forcing old staff to use Unicode,
add this and they will escape. ;)

This surely creates many many problems.

--roozbeh




http://www.eki.ee/letter/ (was: Which languages are supported

2000-08-10 Thread Gary P. Grosso

This site would be worth inclusion in the Useful Resources
section of the Unicode site, I think.  Actually, there is a
link to http://www.eki.ee/itstandard/ladina as "Database of
Latin letters and languages" under Linguistics and Script
Specialty Sites, which turns out, it seems, to be a redirect
to the excellent page at http://www.eki.ee/letter/.  So I guess
the description and URI could stand to be updated.  This page
goes well beyond Latin scripts (e.g., Cyrillic, Greek, Thai,
Vietnamese).

However, information of 16-bit Asian charsets is not included.
Does anyone know of a resource equal to this one in usability
and completeness for the CJK scripts? Thanks.

>X-UML-Sequence: 15492 (2000-08-10 09:29:44 GMT)
>From: [EMAIL PROTECTED]
>To: "Unicode List" <[EMAIL PROTECTED]>
>Date: Thu, 10 Aug 2000 01:29:42 -0800 (GMT-0800)
>Subject: RE: Which languages are supported in basic latin
>
>Halldor G. Gestsson:
>  > Can I find a list where all languages supported in the basic latin
>  > (0x-0x00FF)?
>  > [...]
>  > Wich languages uses the latin extensions A,B and C?
>
>Page  contains the information to build your
>lists.
>
>_ Marco

---
Gary Grosso
[EMAIL PROTECTED]
Arbortext, Inc.
Ann Arbor, MI, USA




Mixing alphabets (was: sorting my CD collection)

2000-08-10 Thread 11digitboy

You have a good point:  does nu-alpha-tau-alpha-sigma-alpha
spell "Natasa" or "Natasha"? The Greek letters given
are obviously an attempt to write "Natasha" in Greek,
but they romanize to "Natasa".

And a, b, c, d, e, f, g, h, ... HATES a, i, u, e,
o, ka, ki, ku, ...

Maybe I should just capitalize everything (except
Georgian? ... not that I have any Georgian CDs, or
am likely to... I bet few things would be rarer than,
say, a Georgian female rap CD in the US!!) and from
there, just sort by codepoint number... no good,
"Á" would come after "Z"...

Would somebody PLEASE tell me, IN THE DEFAULT UNICODE
COLLATION ALGORITHM, WHAT COMES AFTER WHAT?! I could
use a list of Unicode characters in proper collation
order, with "ties" labeled!!

--
Robert Lozyniak
Accusplit pedometer manufactures can go suck eggs
My page: http://walk.to/11
[EMAIL PROTECTED] - email
(917) 421-3909 x1133 - voicemail/fax



 Antoine Leca <[EMAIL PROTECTED]> wrote:
> Robert Lozyniak wrote:
> >
> > How do you sort text with some in Roman and some
> > in non-Roman alphabets?
> 
> I never sort texts, only lists of items (words,
> names, titles, whatever).
> 
> Depending of the ratios, I see two main solutions:
> 
> - if Latin is the most current, _and_ only other
> Greek-
> derived scripts are used, _and_ the intended audience
> is proficient enough, I may interspeed the non-Roman
> letters as if all the Greek-derived alphabets shared
> a common order (so Greek alpha sorts just after
> Latin a,
> Cyrillic ve after Cyrillic be which follows Greek
> beta
> which follows Latin b, Greek xi after the o's and
> before
> the p's, etc.)
> 
> - in other cases, I sort the scripts separately.
> 
> 
> > Currently, I'm just romanizing
> > everything but I don't know if that is that good.
> 
> Hmmm. I won't do that. It would take me much too
> long
> to find something that begin with beta at the V
> section,
> while something that begin with mu+pi at the B
> section...
> For Cyrillic, I expect U+0427 to romanize as tcha,
> and U+0429 as chtcha, and I am not sure you will
> (or
> vice-versa).
> 
> Things are different if you actually translitterate,
> i.e. if the items are presented in Latin script.
> 
> 
> > It is probably bad to kanize digits, because
> they
> > would sort 1, 9, 5, and so on, or some other
> mixed-up
> > order.
> 
> It is always a problem to sort the digits, anyway.
> Since they are usually ony a few of them, I believe
> the
> best place is the foremost, so the search does
> not takes
> too long. But if they are more than a bunch, that
> is
> pretty always a brain damage.
> 
> 
> Antoine
> 

___
Get your own FREE Bolt Onebox - FREE voicemail, email, and
fax, all in one place - sign up at http://www.bolt.com




Re: Mixing alphabets (was: sorting my CD collection)

2000-08-10 Thread Rick McGowan

> I bet few things would be rarer than,
> say, a Georgian female rap CD in the US!!

Tobacco chewing killer whales in Picadilly Circus, surely.

> Would somebody PLEASE tell me, IN THE DEFAULT UNICODE
> COLLATION ALGORITHM, WHAT COMES AFTER WHAT?!

Read the technical report!  (It's available on-line.)

Rick

 



RE: Arabic shaping behavior questions

2000-08-10 Thread Roozbeh Pournader


On Thu, 10 Aug 2000 [EMAIL PROTECTED] wrote:

> Thanks! The - Iran System -: that's what I meant. It was on the tip of my
> tongue, but I could not recall the name (BTW, is it also called ISIRI?).

ISIRI is the Iranian standards organization. ISIRI 2900 was the old
charset standard, and ISIRI 3342 is the latest. ISIRI 3342 is the
mentioned source of Unicode characters ZWJ and ZWNJ known as Psuedo
Connection and Peuedo Space there.

About character to letter ratio, it is 1 to 4 characters per letter in
Iran System, 2 char/letter in ISIRI 2900, and 1 char/letter in ISIRI 3342.

--roozbeh





Re: Mixing alphabets (was: sorting my CD collection)

2000-08-10 Thread Michael \(michka\) Kaplan

Once again, if collation info is what you want, see

http://www.unicode.org/unicode/reports/tr10/

Beyond that, it is unclear what you are looking for, really. But if you were
to actually read and try to understand that document, I am fairly certain
that one of two things will happen:

1) You will find the answer to your question, or

2) You will be able to frame the question more clearly

I am betting on #1, actually, as the most likely outcome. :-)

michka

Michael Kaplan
Trigeminal Software, Inc.
http://www.trigeminal.com/


- Original Message -
From: <[EMAIL PROTECTED]>
To: "Unicode List" <[EMAIL PROTECTED]>
Sent: Thursday, August 10, 2000 12:56 PM
Subject: Mixing alphabets (was: sorting my CD collection)


> You have a good point:  does nu-alpha-tau-alpha-sigma-alpha
> spell "Natasa" or "Natasha"? The Greek letters given
> are obviously an attempt to write "Natasha" in Greek,
> but they romanize to "Natasa".
>
> And a, b, c, d, e, f, g, h, ... HATES a, i, u, e,
> o, ka, ki, ku, ...
>
> Maybe I should just capitalize everything (except
> Georgian? ... not that I have any Georgian CDs, or
> am likely to... I bet few things would be rarer than,
> say, a Georgian female rap CD in the US!!) and from
> there, just sort by codepoint number... no good,
> "Á" would come after "Z"...
>
> Would somebody PLEASE tell me, IN THE DEFAULT UNICODE
> COLLATION ALGORITHM, WHAT COMES AFTER WHAT?! I could
> use a list of Unicode characters in proper collation
> order, with "ties" labeled!!
>
> --
> Robert Lozyniak
> Accusplit pedometer manufactures can go suck eggs
> My page: http://walk.to/11
> [EMAIL PROTECTED] - email
> (917) 421-3909 x1133 - voicemail/fax
>
>
>
>  Antoine Leca <[EMAIL PROTECTED]> wrote:
> > Robert Lozyniak wrote:
> > >
> > > How do you sort text with some in Roman and some
> > > in non-Roman alphabets?
> >
> > I never sort texts, only lists of items (words,
> > names, titles, whatever).
> >
> > Depending of the ratios, I see two main solutions:
> >
> > - if Latin is the most current, _and_ only other
> > Greek-
> > derived scripts are used, _and_ the intended audience
> > is proficient enough, I may interspeed the non-Roman
> > letters as if all the Greek-derived alphabets shared
> > a common order (so Greek alpha sorts just after
> > Latin a,
> > Cyrillic ve after Cyrillic be which follows Greek
> > beta
> > which follows Latin b, Greek xi after the o's and
> > before
> > the p's, etc.)
> >
> > - in other cases, I sort the scripts separately.
> >
> >
> > > Currently, I'm just romanizing
> > > everything but I don't know if that is that good.
> >
> > Hmmm. I won't do that. It would take me much too
> > long
> > to find something that begin with beta at the V
> > section,
> > while something that begin with mu+pi at the B
> > section...
> > For Cyrillic, I expect U+0427 to romanize as tcha,
> > and U+0429 as chtcha, and I am not sure you will
> > (or
> > vice-versa).
> >
> > Things are different if you actually translitterate,
> > i.e. if the items are presented in Latin script.
> >
> >
> > > It is probably bad to kanize digits, because
> > they
> > > would sort 1, 9, 5, and so on, or some other
> > mixed-up
> > > order.
> >
> > It is always a problem to sort the digits, anyway.
> > Since they are usually ony a few of them, I believe
> > the
> > best place is the foremost, so the search does
> > not takes
> > too long. But if they are more than a bunch, that
> > is
> > pretty always a brain damage.
> >
> >
> > Antoine
> >
>
> ___
> Get your own FREE Bolt Onebox - FREE voicemail, email, and
> fax, all in one place - sign up at http://www.bolt.com
>
>




RE: Mixing alphabets (was: sorting my CD collection) [Collate][CJ

2000-08-10 Thread Ayers, Mike


> From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED]]
 
> You have a good point:  does nu-alpha-tau-alpha-sigma-alpha
> spell "Natasa" or "Natasha"? The Greek letters given
> are obviously an attempt to write "Natasha" in Greek,
> but they romanize to "Natasa".
> 
> And a, b, c, d, e, f, g, h, ... HATES a, i, u, e,
> o, ka, ki, ku, ...
> 

To throw a big fat monkey wrench into the whole thing, what if you
buy a CD from Jacky Cheung?  He records albums in both Cantonese, where
Cheung is the standard Romanization of his last name, and Mandarin, where
the name is romanized as Zhang!  Which Romanization will you choose?  With
Jacky, you could opt for putting his Mandarin records under Zhang and his
Cantonese records under Cheung, but what if you buy the latest from Faye
Wang/Wong (M/C), which has both Mandarin and Cantonese songs on it?

For reasons like this, I don't think there will ever be any
all-encompassing collation sequences that get much use.  What I think would
work much better would be the warehouse method: number each CD without
regard to order and log the CD in a database which permits multiple language
tagged entries for title and artist (so Japan's Dreams Come True can be
logged under both their English and Katakana spellings), then look up CDs
through localized search engines.  This would definitely take a bit more
time, but at least you could find everything.


/|/|ike



Re: Zero-width ligator

2000-08-10 Thread Asmus Freytag

At 09:36 AM 8/10/00 -0800, Roozbeh Pournader wrote:
>That seems problematic to me, when used for Arabic. How should one use
>ZWNJ between two Arabic letters to stop the ligature? The'll get
>disconnected!

(in those rare cases...)

Use ZWJ ZWNJ ZWJ and you will get the intended effect.

A./

Technical Vice President
The Unicode Consortium



Re: Zero-width ligator

2000-08-10 Thread Roozbeh Pournader



On Thu, 10 Aug 2000, Asmus Freytag wrote:

> Use ZWJ ZWNJ ZWJ and you will get the intended effect.
>
> Technical Vice President
> The Unicode Consortium

Official answer?! too bad for us...




Windows codepages

2000-08-10 Thread Peter_Constable

Anybody happen to know: Is there no Win32 API that takes a LANGID and
returns a codepage? Or, alternately, that takes a charset value and returns
a codepage? (I.e. takes one of the two parameters provided by
WM_INPUTLANGCHANGE.)



- Peter


---
Peter Constable

Non-Roman Script Initiative, SIL International
7500 W. Camp Wisdom Rd., Dallas, TX 75236, USA
Tel: +1 972 708 7485
E-mail: <[EMAIL PROTECTED]>





codepages on Windows

2000-08-10 Thread Peter_Constable

Anybody happen to know: Is there no Win32 API that allows you to determine
a codepage given a LANGID or a charset value (i.e. one of the two
parameters provided by WM_INPUTLANGCHANGE)?



- Peter


---
Peter Constable

Non-Roman Script Initiative, SIL International
7500 W. Camp Wisdom Rd., Dallas, TX 75236, USA
Tel: +1 972 708 7485
E-mail: <[EMAIL PROTECTED]>





Re: Windows codepages

2000-08-10 Thread Michael \(michka\) Kaplan

Yes, there is a way to extract this info using GetLocaleInfo. Pass the LCID
or langid with one of the following params:

LOCALE_IDEFAULTANSICODEPAGE
LOCALE_IDEFAULTCODEPAGE
LOCALE_IDEFAULTEBCDICCODEPAGE (win2k only)
LOCALE_IDEFAULTMACCODEPAGE

michka

Michael Kaplan
Trigeminal Software, Inc.
http://www.trigeminal.com/


- Original Message -
From: <[EMAIL PROTECTED]>
To: "Unicode List" <[EMAIL PROTECTED]>
Sent: Thursday, August 10, 2000 2:36 PM
Subject: Windows codepages


> Anybody happen to know: Is there no Win32 API that takes a LANGID and
> returns a codepage? Or, alternately, that takes a charset value and
returns
> a codepage? (I.e. takes one of the two parameters provided by
> WM_INPUTLANGCHANGE.)
>
>
>
> - Peter
>
>
> --
-
> Peter Constable
>
> Non-Roman Script Initiative, SIL International
> 7500 W. Camp Wisdom Rd., Dallas, TX 75236, USA
> Tel: +1 972 708 7485
> E-mail: <[EMAIL PROTECTED]>
>
>
>




Re: Windows codepages

2000-08-10 Thread Peter_Constable


Sorry about the duplicate message. Notes was hanging on me, and I thought
the first one was lost.



On 08/10/2000 04:36:32 PM Peter Constable wrote:

>Anybody happen to know: Is there no Win32 API that takes a LANGID and
>returns a codepage? Or, alternately, that takes a charset value and
returns
>a codepage? (I.e. takes one of the two parameters provided by
>WM_INPUTLANGCHANGE.)
>
>
>
>- Peter
>
>
>
---
>Peter Constable
>
>Non-Roman Script Initiative, SIL International
>7500 W. Camp Wisdom Rd., Dallas, TX 75236, USA
>Tel: +1 972 708 7485
>E-mail: <[EMAIL PROTECTED]>
>
>




Is there a keyboard layout driver supported for Georgian?

2000-08-10 Thread Magda Danish (Unicode)



-Original Message-
From: Irakli Tskhvedadze [mailto:[EMAIL PROTECTED]]
Sent: Wednesday, August 09, 2000 11:42 AM
To: [EMAIL PROTECTED]; [EMAIL PROTECTED]
Subject: ASRIOS-Georgia


Dear Sir,

ASRIOS is a Georgian company. For our projects we need Georgian language

which is supported by Unicode. We have standard Arial Unicode MS that
has Georgian language inside, BUT there is no keyboard layout driver
supported for Georgian language.

Could you please support us with the answer on this question.

Sincerely,

Irakli Tskhvedadze
Vice-President of the Association of System Reforms and
Informatization of Organizational Structures





Re: Is there a keyboard layout driver supported for Georgian?

2000-08-10 Thread Michael \(michka\) Kaplan

Windows 2000 actually has a Georgian keyboard layout, and a font that many
people in Georgia (and also I) find more visually pleasing than Arial
Unicode MS (Sylfaen). See

http://www.microsoft.com/globaldev/keyboards/keyboards.asp

for the layout. Note that the keyboard layout and font only support
mkhedruli, not khutsuri (though for one client I needed to support khutsuri
and did so through an owner draw control that simply took the
non-functioning shift key and gave it a function).

If I thought there was more demand, I would be trying to sell it. 

michka

Michael Kaplan
Trigeminal Software, Inc.
http://www.trigeminal.com/


- Original Message -
From: "Magda Danish (Unicode)" <[EMAIL PROTECTED]>
To: "Unicode List" <[EMAIL PROTECTED]>
Sent: Thursday, August 10, 2000 3:52 PM
Subject: Is there a keyboard layout driver supported for Georgian?


>
>
> -Original Message-
> From: Irakli Tskhvedadze [mailto:[EMAIL PROTECTED]]
> Sent: Wednesday, August 09, 2000 11:42 AM
> To: [EMAIL PROTECTED]; [EMAIL PROTECTED]
> Subject: ASRIOS-Georgia
>
>
> Dear Sir,
>
> ASRIOS is a Georgian company. For our projects we need Georgian language
>
> which is supported by Unicode. We have standard Arial Unicode MS that
> has Georgian language inside, BUT there is no keyboard layout driver
> supported for Georgian language.
>
> Could you please support us with the answer on this question.
>
> Sincerely,
>
> Irakli Tskhvedadze
> Vice-President of the Association of System Reforms and
> Informatization of Organizational Structures
>
>
>




Re: Is there a keyboard layout driver supported for Georgian?

2000-08-10 Thread Michael \(michka\) Kaplan

I should add that for that project, since they did not have Windows 2000 but
instead had NT4, the actual work ended up being to support a Georgian
keyboard layout on NT4 (the argument was that as long as it was mucking with
the uppercase, why not muck with the lowercase, too? ). Mainly liturgical
documents (I do not think khutsuri has any other real usage), but they used
the faux layout for modern ones as well once they found out I could support
it.

It happened during a time that I was learning about the language and proved
to be a wonderful technical project to immerse myself in. I do admit
"stealing" the layout from Windows 2000, but I believe they got it from
standard Georgian usage, anyway. :-)

michka


- Original Message -
From: "Michael (michka) Kaplan" <[EMAIL PROTECTED]>
To: "Magda Danish (Unicode)" <[EMAIL PROTECTED]>; "Unicode List"
<[EMAIL PROTECTED]>; "Irakli Tskhvedadze" <[EMAIL PROTECTED]>
Sent: Thursday, August 10, 2000 4:19 PM
Subject: Re: Is there a keyboard layout driver supported for Georgian?


> Windows 2000 actually has a Georgian keyboard layout, and a font that many
> people in Georgia (and also I) find more visually pleasing than Arial
> Unicode MS (Sylfaen). See
>
> http://www.microsoft.com/globaldev/keyboards/keyboards.asp
>
> for the layout. Note that the keyboard layout and font only support
> mkhedruli, not khutsuri (though for one client I needed to support
khutsuri
> and did so through an owner draw control that simply took the
> non-functioning shift key and gave it a function).
>
> If I thought there was more demand, I would be trying to sell it. 
>
> michka
>
> Michael Kaplan
> Trigeminal Software, Inc.
> http://www.trigeminal.com/
>
>
> - Original Message -
> From: "Magda Danish (Unicode)" <[EMAIL PROTECTED]>
> To: "Unicode List" <[EMAIL PROTECTED]>
> Sent: Thursday, August 10, 2000 3:52 PM
> Subject: Is there a keyboard layout driver supported for Georgian?
>
>
> >
> >
> > -Original Message-
> > From: Irakli Tskhvedadze [mailto:[EMAIL PROTECTED]]
> > Sent: Wednesday, August 09, 2000 11:42 AM
> > To: [EMAIL PROTECTED]; [EMAIL PROTECTED]
> > Subject: ASRIOS-Georgia
> >
> >
> > Dear Sir,
> >
> > ASRIOS is a Georgian company. For our projects we need Georgian language
> >
> > which is supported by Unicode. We have standard Arial Unicode MS that
> > has Georgian language inside, BUT there is no keyboard layout driver
> > supported for Georgian language.
> >
> > Could you please support us with the answer on this question.
> >
> > Sincerely,
> >
> > Irakli Tskhvedadze
> > Vice-President of the Association of System Reforms and
> > Informatization of Organizational Structures
> >
> >
> >
>




Re: RFC 1766

2000-08-10 Thread Doug Ewell

I did a little homework on this topic and made some discoveries.

First, a colleague of mine at work checked with his friend, the ANSI
representative to ISO TC212, who claimed that all references to ISO
standards in standards-track documents are to the most current version
of that standard.  I don't know if he meant in *any* standards-track
documents or only in other ISO standards, but the point was that nobody
is supposed to be forced to follow ISO standards that have been
superseded.

Second, and more to the point, I poked around the Everson Gunn Teoranta
site for a while -- amazing, the things you find in there! -- and
discovered that an Internet Draft revision of RFC 1766 is in the works.
(Look for "draft-alvestrand-lang-tags-v2-01.txt" at IETF or other fine
FTP sites.)  There is lots of new stuff to read: you will find, for
instance, that ISO 639-2 three-letter language codes, as well as ISO
639(-1) two-letter language codes, will soon be allowed in language
tags, and that these tags will also support ISO 3166-2 region codes and
ISO/DIS 15924 script codes (presumably after 15924 gets out of the DIS
stage).

This is also where I learned the rather surprising news that the ISO
639(-1) list of two-letter codes will soon be frozen, meaning in
particular that no new two-letter codes will be assigned to languages
that already have a three-letter code.  This means that some major,
significant languages like Turkish and Yoruba will never get two-letter
codes, which seems odd somehow.  Fortunately, the expansion of this I-D
to include ISO 639-2 means that it can finally be used to encode Turkish
et al. after all.

What is most relevant to this discussion, though, is the way the I-D
handles the issue of updated ISO standards.  Note the following wording
from the I-D:

> All 2-letter tags are interpreted according to *ISO 639:1988*, "Code
> for the representation of names of languages" [ISO 639] *and
> subsequent additions made by its Registration Authority*.
>
> *Note: this is currently under revision as ISO/DIS 639-1:2000.*
>
> All 3-letter tags are interpreted according to *ISO 639-2:1998*,
> "Code for the representation of names of languages -- Part 2: Alpha-3
> code" [ISO 639-2] *and subsequent additions made by its Registration
> Authority*.
(all emphasis original)

This means Alvestrand and colleagues recognized the ambiguity of
specifying ISO 639:1988 in the original RFC 1766 and have explicitly
permitted codes from updated lists in the new I-D.

Of course, the real hazards come whenever a standard (XML specification,
Unicode Technical Report, or whatever) relies on an Internet RFC.  For
one thing, an RFC is not "updated"; rather, it is "obsoleted" by a new
RFC with a new number.  If your standard or spec references an RFC, it
is outdated as soon as the referenced RFC is replaced.  How many
documents are still out there that claim that MIME is defined by RFCs
1521 and 1522?  It was, once upon a time, but those RFCs were replaced
almost four years ago by RFCs 2045 and 2046.

Another problem is that RFCs are not necessarily written with the same
attention to detail, precision, and completeness as ISO or national
standards.  Some are written very well indeed, but there are no
guarantees.  The present problem with imprecise wording in RFC 1766 is
evidence of this.

The mere fact that a document exists as an RFC should mean little.
Remember that RFCs used to be written to announce the postponenemt of
local meetings due to schedule conflicts and such (and of course there
are the famous "joke" RFCs like "ARPAWOCKY," which are nonetheless part
of the "official" RFC series right alongside RFCs 2152, 2279, and our
friend 1766).

Eric Raymond writes glowingly in the Jargon File about the advantage
that RFCs have over the "more formal, committee-driven process" of ISO
and national standards, but the flip side of that coin is that the less
formal RFC process sometimes results in documents with holes and
implicit assumptions of the kind that drive me, Mike Brown, and probably
many more of you crazy.

Reading RFC 1766, the new I-D that is destined to replace 1766 in the
near future, and UTR #7 all leads me to conclude that UTRs should
*definitely* be revised to refer directly to ISO standards whenever
possible, instead of Internet RFCs, even if the idea comes from an RFC
and significant wording must be cut and pasted from an RFC.  If that
happens, then the UTC (which we know is up to the task) can take the
responsibility for updates and clarifications, so that ambiguities of
the type Mike has been experiencing with the XML spec will not plague
implementors of the Unicode Standard.

-Doug Ewell
 Fullerton, California