Re: Bengali: variants of same conjunct

2000-06-22 Thread Antoine Leca

Michael Kaplan wrote:
 
 Thus far it is something that has been implemented in the fonts, rather than
 anywhere else for example there are several ligatures in Tamil that will
 display one way with the Latha font and the other way with Monotype Tamil
 Arial (the way set out in Unicode 3.0 is done in the latter).
 
 Thus since people who write the language sent both,
cut

Do you mean that Tamil writers *purposely* use both the "ancient" and the
"modern" forms in the same document?
What is the intent?


I can see a similar (but far less acute) problem with Latin lowercase a,
which can have two forms (and similarly for the g or the ae ligature).
For a, one can at the extreme limit use U+0251 for the alternate, but
outside IPA I do not see any use for this distinction.
For g or æ, I do not see any way to specify that one wants the rounded
(script, italic) form for the left part, or the printed-like or upright
form.
OTOH, I do not see anyone having a problem with that. In fact, I myself
don't mix them (except for IPA), even if depending of context I may use
one or another form when writing.
And I believe this is entirely a rendering problem that is (far) outside
Unicode's scope.


Antoine



RE: Case mapping errors?

2000-06-22 Thread Karlsson Kent - keka

(This message is send in UTF-8.  Flames regarding that fact
will be deleted without response.)


No, those case mappings are not in error.  Nor are their
canonical mappings in error.  (The MICRO SIGN would have
had a canonical mapping to Greek mu, if it had not been
included in such much-used repertioires as Latin-1.)

For the PROSGEGRAMMENI it's my understanding that it is
customary (in e.g. dictionaries) to capitalise it the way
it is done in Unicode.  (But I don't know classical Greek.)

The MICRO, OHM, KELVIN, and ANGSTROM (ÅNGSTRÖM, really) SIGNs
are included in Unicode for compatability reasons only. You
should not use them but use the characters that they canonically
(or 'near canonically' in the case of MICRO SIGN) decompose to.
Note that there are many (SI or other) unit names that are *not*
included as separate characters, like symbols for Watt, Volt, etc.
Nor is there any need to include them.  Those symbols are just
letters reused for unit symbols.  The case mappings for these
signs derive from the characters that they (near) canonically
map to.  It's true that you should never case change a unit
symbol or unit prefix symbol, but goes also for W, V, m, M, etc.
too, even though those can only be represented by "LETTER"
characters.

As far as I know, the inclusion of the MICRO and OHM signs
derive from their inclusion in repertiores that otherwise
contain only Latin letters (and punctuation), but apparently
someone found these Greek letters important enough (for use
in writing unit designations) to include those two Greek letters
with a name reflecting why they where included.  This does
not remove the fact that they are just ordinary Greek letters
really.  For the Kelvin and Ångström I can only speculate as
to why they where included in a source (for Unicode) Korean
encoding.  My theory is that the Kelvin sign started out as
a DEGREE KELVIN in analogy with the DEGREE CELSIUS and
DEGREE FAHRENHEIT signs (which have a (small) justification as
ligatures, especially in CJK typography), until someone pointed
out that its not called (nor written) "degree Kelvin" but just
"Kelvin". My theory about the Ångström sign's original inclusion
in that Korean encoding is that someone might have thought
that the A with a ring was not just a letter, but some
special invented symbol (easy mistake to do if you only know
that unit as "angstrom").  It's not a specially invented symbol,
it's just the first letter in Mr Ångström's name, just like for
Watt, Volt, Kelvin, ...

The case mappings are correct, but you should never apply
any case mapping to unit symbols that are letters.  Getting
software to "understand" what is a unit symbol (without
special markup) and what is not might be tricky when the
unit symbols are written with letters (as all SI, except for
the degree symbol, and many other units are)...  And no,
reincluding all letters (or letter combinations) as "signs"
for each and every reuse letters have been put to (e.g.
unit signs) is not an appropriate solution.

Please, never use those "SIGN"s, except when mapping those
letters to character repertoires which do not contain the
proper Greek letters, but do contain those "SIGN"s.  Nor
should you use any of the other "squared" unit characters,
except when you absolutely have to get the "squared"
typographic effect (but that's ugly in my eyes) in CJK
typography from plain text.  Note still that there are
many (composite) (SI) unit designations that do not have
any "squared" character associated with it.  The "squared"
unit characters is a rather random collection, best forgotten.

Kind regards
/kent k


 -Original Message-
 From: John O'Conner [mailto:[EMAIL PROTECTED]]
 Sent: Thursday, June 22, 2000 12:15 AM
 To: Unicode List
 Subject: Case mapping errors?
 
 
 There are 5 characters that are giving me a little discomfort 
 because of their case mappings:
 
* U+00B5 MICRO SIGN
* U+1FBE GREEK PROSGEGRAMMENI
* U+2126 OHM SIGN
* U+212A KELVIN SIGN
* U+212B ANGSTROM SIGN
 
 Each of these have case mappings...and I really don't 
 understand why. It
 appears that all of these have no "round-trip" capability to map back
 from another case. I suppose this can be argued for a lot of 
 mapppings.
 
 The most difficult cases are 2126, 212A, and 212B. These 
 characters are
 "letter-like" in their glyph appearance, but it seems that 
 their actual
 semantics are not. It seems like someone may have looked at 
 KELVIN SIGN
 for example, decided it looked like a Latin-1 'K' and gave it the same
 lowercase mapping. Still, would you really expect to 
 lowercase a KELVIN
 SIGN to a small 'k'. I can't imagine...but I may not be as imaginative
 as some. I have the same argument for OHM SIGN and ANGSTROM SIGN.
 Although they have case mappings, are they expected by most 
 people? If I
 were using the OHM, ANGSTROM, or KELVIN SIGN in my work, I 
 would be very
 surprised in a case operation changed them...maybe I would 

Chinese characters in Java Applet

2000-06-22 Thread Parvinder Singh(EHPT)
Title: Chinese characters in Java Applet





Hello, 


I am trying to to display chinese characters stored in Unicode format in oracle database through a Java applet in the browser. The applet uses JDBC calls and thin driver.

The oracle resides on Sun Solaris server . But the applet is not showing the characters correctly. My browser has chinese fonts.

Do I need to have something else at client side ? What all additional things are needed to accomplish the chinese character display in the applet ? 

Thanks and Rgds, 
Parvinder 





Re: UTF-8N?

2000-06-22 Thread Antoine Leca

John Cowan wrote:
 
 Now suppose we have a character sequence beginning with U+FEFF U+0020.
 This would be encoded as follows:
 
 US-ASCII: (not possible)
 UTF-16:   0xFE 0xFF 0xFE 0xFF 0x00 0x20 ...
 UTF-16:   0xFF 0xFE 0xFF 0xFE 0x20 0x00 ...
 UTF-16BE: 0xFE 0xFF 0x00 0x20 ...
 UTF-16LE: 0xFF 0xFE 0x20 0x00 ...
 UTF-8N:   0xEF 0xBB 0xBF 0x20 ...
 UTF-8B:   0xEF 0xBB 0xBF 0xEF 0xBB 0xBF 0x20 ...

There is something I should have missed.

It was my understanding that U+FEFF when received as first character should
be seen as BOM and not as a character, and handled accordingly.

So I expected:
  US-ASCII: 0x20
  UTF-16:   0xFE 0xFF 0x00 0x20 ...
  UTF-16:   0xFF 0xFE 0x20 0x00 ...
  UTF-16BE: 0xFE 0xFF 0x00 0x20 ...
  UTF-16LE: 0xFF 0xFE 0x20 0x00 ...
  UTF-8N:   0xEF 0xBB 0xBF 0x20 ...
  UTF-8B:   0xEF 0xBB 0xBF 0x20 ...


Antoine



RE: Bengali: variants of same conjunct

2000-06-22 Thread Michael Kaplan (Trigeminal Inc.)

  Thus since people who write the language sent both,
 cut
 
 Do you mean that Tamil writers *purposely* use both the "ancient" and the
 "modern" forms in the same document?
 What is the intent?
 
yes, that is what am I saying. If you go to several of the Tamil resource
sites on the web, you can see both of them used, often in the same
documents. This is VERY easy to do with the hack fonts, significantly more
difficult if you are using Unicode-enabled fonts.


 And I believe this is entirely a rendering problem that is (far) outside
 Unicode's scope.
 
I do not see how, if BOTH forms are in use and one form is not renderable in
a font that is Unicode compliant, how this would NOT be considered a Unicode
issue. It is crucial that language as used should be possible to render with
Unicode, should it not? The ligatures you mention do not really call into
the same category as the Tamil case, since all of them can be rendered using
the 3.0 (or even the 2.0!) standard.

I do know that the TamilNadu government has specific issues with the Unicode
standard, is this not one of the issues? Or do they prefer only the usage
outlined in the standard, in order to encourage people to use it? And would
this then be a case of the standard being more involved in politics than
might be good?

Michael



Re: UTF-8N?

2000-06-22 Thread Peter_Constable




On 06/21/2000 03:09:43 PM [EMAIL PROTECTED] wrote:


Appropriate or not, users (you know, those people who don't read the
documentation that the programmers don't write) will use text editors to
split
files.  They will then concatenate the files using a non-Unicode aware
tool.
And they will complain that the checksums mismatch.

I can't argue against that. I think the suggestion that BOM and ZWNBSP be
de-unified, which I have heard before, may make the best sense.



Peter Constable




Re: Case mapping errors?

2000-06-22 Thread Mark Davis

These characters are purely coded for compatibility. Unicode does not distinguish 
letters by the abbreviations that they happen to be used in. There is no difference in 
semantics between the "g" in "go" vs. the "g" in "12g", nor between the "Å" in "Århus" 
vs. the "Å" in "15Å", nor -- for that matter -- the "U" in "Underwood" vs the "U" in 
"UTF-8".

Mark

John O'Conner wrote:

 There are 5 characters that are giving me a little discomfort because of
 their case mappings:

* U+00B5 MICRO SIGN
* U+1FBE GREEK PROSGEGRAMMENI
* U+2126 OHM SIGN
* U+212A KELVIN SIGN
* U+212B ANGSTROM SIGN

 Each of these have case mappings...and I really don't understand why. It
 appears that all of these have no "round-trip" capability to map back
 from another case. I suppose this can be argued for a lot of mapppings.

 The most difficult cases are 2126, 212A, and 212B. These characters are
 "letter-like" in their glyph appearance, but it seems that their actual
 semantics are not. It seems like someone may have looked at KELVIN SIGN
 for example, decided it looked like a Latin-1 'K' and gave it the same
 lowercase mapping. Still, would you really expect to lowercase a KELVIN
 SIGN to a small 'k'. I can't imagine...but I may not be as imaginative
 as some. I have the same argument for OHM SIGN and ANGSTROM SIGN.
 Although they have case mappings, are they expected by most people? If I
 were using the OHM, ANGSTROM, or KELVIN SIGN in my work, I would be very
 surprised in a case operation changed them...maybe I would be
 disappointed or frustrated even. Are these bugs in the spec? Or do I
 just need to think about them a little differently?

 Best regards,
 John O'Conner




Re: UTF-8N?

2000-06-22 Thread Christopher John Fynn


[EMAIL PROTECTED] wrote:

 ... I think the suggestion that BOM and ZWNBSP be
 de-unified, which I have heard before, may make the best sense.

*If* that's the solution, it should be done yesterday. The longer it takes the
more implementations (and data) there will be that needs to be changed.

- Chris




Re: Chinese characters in Java Applet

2000-06-22 Thread Valeriy E. Ushakov

On Thu, Jun 22, 2000 at 02:20:39 -0800, Parvinder Singh(EHPT) wrote:

 I am trying to to display chinese characters stored in Unicode format in
 oracle database through a Java applet in the browser. The applet uses JDBC
 calls and thin driver.
 The oracle resides on Sun Solaris server . But the applet is not showing the
 characters correctly. My browser has chinese fonts.
 
 Do I need to have something else at client side ? What all additional things
 are needed to accomplish the chinese character display in the applet ?  

Yes, you need to tell client-side AWT which platform fonts to use.  I
have posted a sample font.properties entries for win32 just few days
ago, solaris is not very different.

If you missed that post of mine, just drop me a note and I'll forward
it to you.

SY, Uwe
-- 
[EMAIL PROTECTED] |   Zu Grunde kommen
http://www.ptc.spbu.ru/~uwe/|   Ist zu Grunde gehen



RE: How to distinguish UTF-8 from Latin-* ?

2000-06-22 Thread Robert A. Rosenberg

At 12:12 PM 06/20/2000 -0800, Kenneth Whistler wrote:
Bob Rosenberg wrote:

  
  This was my concern, there is no way to distinguish UTF-8 from Latin-1 in
  case of upper ASCII characters here.
 
  Yes there is - its called a "Sanity Check". You parse the file looking for
  High-ASCII. If you find none - you are US-ASCII (or ISO-8859-1). Once you
  find one, you use the UTF-8 Suffix method to see how long the string 
 should
  be IF it is UTF-8. Look at the next x characters to see if they have the
  correct suffix. If not, count as a Bad-UTF-8. If so, count as one
  Good-UTF-8. Once you roll off the end of the string resume scanning for
  another High-ASCII and do the check again. After finding 12 strings that
  start with High-ASCII (or bopping off the end of the file) check your
  GOOD/BAD counts. All BAD means ISO-8859-1. All GOOD means UTF-8.

Well, not necessarily. Granted, the distribution of precedent bytes and
successor bytes in UTF-8, when interpreted as ISO 8859-1, mostly results
in gibberish that is unlikely to appear in real text. The first byte of
a two-byte UTF-8 sequence consists essentially of an accented capital
letter in 8859-1 (0xC0..0xDF). And the successor bytes are either C1
controls or come from the set of miscellaneous symbols, currency signs,
punctuation, etc., that are rather unlikely to occur directly following
an uppercase accented Latin letter.

But if I invented a hoity-toity company name with extra accents for
"class", such as, L·DÏ·DÀ® Productions, Inc. and sent this to you in
ISO 8859-1, as I am currently doing, your sanity check will fail in
this case and identify this file as UTF-8, with 3 characters misinterpreted.
(i.e., LbulletDGreek letter etaD. Productions, Inc.) Of course, a 
further check
for irregular sequence UTF-8 would discover that 0xC0 0xAE == U+002E is
not shortest form UTF-8, and might, therefore, not actually be UTF-8,
but even that cannot really be relied on.

True you can FAKE an incorrect evaluation by plugging a trick string into 
an otherwise low ASCII file/message. My comment was aimed at normal (not a 
faked) files. I agree that missed the extra sanity check of looked for 
shortest string but if I remember the rules correctly, there is no 
requirement the shortest form be emitted - only a strong suggestion to do 
so (with a stronger suggestion to accept it [ie: "Be liberal with what you 
accept and conservative with what you create"]). I doubt that a real 
ISO-8859-1 file could be mistaken for a UTF-8 one without it being 
specially constructed to trick the sanity check. Note that the 12 string 
"universe" is just an attempt to check for false positives and could be 
adjusted for circumstances.

  Mixed
  (with most being BAD) is ISO-8859-1 (the Goods are "noise"). Mostly Good
  with a few Bad are either malformed UTF-8 or ISO-8859-1 (with the bad luck
  of finding 2 byte strings that LOOK LIKE UTF-8).

Even entirely GOOD can have that bad luck, as this email itself
demonstrates.

Since this is a special message that was designed to spoof not a real 
message, I do not regard it as bad luck. If you can supply a set of normal 
text that would give a false reading, I'd be much more willing to say that 
my claim of just doing a sanity check was overly simplistic.


--Ken




RE: How to distinguish UTF-8 from Latin-* ?

2000-06-22 Thread Karlsson Kent - keka



 -Original Message-
 From: Robert A. Rosenberg [mailto:[EMAIL PROTECTED]]
...

[on overlong UTF-8 sequences, a few lines down:]
 faked) files. I agree that missed the extra sanity check of 
 looked for 
 shortest string but if I remember the rules correctly, there is no 
 requirement the shortest form be emitted - only a strong 
 suggestion to do 
 so (with a stronger suggestion to accept it [ie: "Be liberal 
 with what you accept and conservative with what you create"]).


Well, there is a security aspect to this: sometimes given texts
need to be scanned to try to determine if they are "harmless"
or may trigger some undesirable interpretation (as interpreted
program code, like shell-script, for instance).  A hacker may
try to hide characters that trigger the undesired, and potentially
dangerous, interpretation, by using overlong UTF-8 sequences.
If the security scanner program does not "decode" overlong
UTF-8 sequences, but the interpreter accepts them as if nothing
was wrong, things you would not like to happen might happen.
So overlong UTF-8 sequences should be regarded as errors, and
not as a coding for any character at all.  Yes, you may regard
systems that at all have "escapes" into "execute this" mode
as ill-designed.  But they are around.

Kind regards
/kent k



Re: UTF-8N?

2000-06-22 Thread John Cowan

"Ayers, Mike" wrote:

 Am I reading this wrong?  Here's what I get:
 
 I hand you a UTF-16 document.  This document is:
 
 FE FF 00 48 00 65 00 6C 00 6C 00 6F
 
 ..so it says "Hello".  Then I say, "Oh, by the way, that's
 big-endian."  *POOF*  The content of the document has changed, and there is
 now a 'ZERO WIDTH NO BREAK SPACE' at the beginning.  Smells pretty skunky...

No, what you have said is that this document is in "UTF16-BE" encoding.
That's a name for an encoding that is known a priori to be BE, and does
not permit a BOM.  It is not the name for an encoding that has a BOM but
just happens to be BE.

Since you have changed the encoding, the content has naturally
changed too, just as if you had declared an 8859-1 document
to be 8859-2.

 BTW, what is a ZWNBSP anyway?  From here it seems like a
 non-character.  Is there an actual use for it? 

Yes.  It indicates that a line break may not be introduced at this point.
It is similar to the NO-BREAK SPACE (U+00A0) which you may be familiar
with under its HTML name of nbsp;, except that it doesn't produce any actual
whitespace.  ZWNBSP is useful in languages that don't use whitespace, and
in strings like "M.T.A." where a line breaker might be tempted to break after
a period.

Its opposite number is ZWSP (U+200B), which likewise doesn't generate any
actual whitespace, but indicates that line breaking *is* permitted here.

-- 

Schlingt dreifach einen Kreis um dies! || John Cowan [EMAIL PROTECTED]
Schliesst euer Aug vor heiliger Schau,  || http://www.reutershealth.com
Denn er genoss vom Honig-Tau,   || http://www.ccil.org/~cowan
Und trank die Milch vom Paradies.-- Coleridge (tr. Politzer)



Re: UTF-8N?

2000-06-22 Thread John Cowan

Kenneth Whistler wrote:

 Now we are pushing through the long, bureaucratic process of getting
 this accepted into 10646-1, so it we maintain synchronicity with a
 joint publication of it as a *standard* character.

So a fair statement of what you hope to achieve is: U+2060 will be
the zero-width non-breaking space, or zero-width word joiner depending on
how you look at it, and U+FFFE will be a byte order mark, which MAY
(but SHOULD NOT) be used with the same semantics as U+2060.

-- 

Schlingt dreifach einen Kreis um dies! || John Cowan [EMAIL PROTECTED]
Schliesst euer Aug vor heiliger Schau,  || http://www.reutershealth.com
Denn er genoss vom Honig-Tau,   || http://www.ccil.org/~cowan
Und trank die Milch vom Paradies.-- Coleridge (tr. Politzer)



RE: UTF-8 BOM Nonsense

2000-06-22 Thread Michael Kaplan (Trigeminal Inc.)

I agree Gary.

Windows 2000 Notepad, however, does not agree and writes one.

Since Notepad in prior versions of Windows was in fact the defacto standard
for HTML editor (g), clearly it is a program to be reckoned with. People
should be aware of the fact that there are going to MANY files out there
that are UTF-8 and do have a BOM.

I do not believe that this will require it to be added to a standard, and
this is a non-standard usage, but life is about dealing with things as they
are (and this is how they are!).

Michael

 --
 From: Gary L. Wade[SMTP:[EMAIL PROTECTED]]
 Sent: Thursday, June 22, 2000 9:08 AM
 To:   Unicode List
 Subject:  UTF-8 BOM Nonsense
 
 Please!
 
 After hundreds of e-mails on this topic, let it die!
 
 The BOM is only useful with UTF-16 or UCS-4 characters.
 
 There is no such thing as byte ordering when each character is a byte or
 a multibyte sequence with a well-documented ordering denoting how to
 interpret this!  For further reference, turn to page 20 in the Unicode
 3.0 book and let us get back to more important things, such as how to
 represent the price of tea in China!  ;-)
 -- 
 Gary L. Wade
 Product Development Consultant
 DesiSoft Systems | Voice:   214-642-6883
 9619 E. Valley Ranch Parkway | Fax: 972-506-7478
 Suite 2125   | E-Mail:  [EMAIL PROTECTED]
 Irving, TX 75063 |
 



Java, SQL, Unicode and Databases

2000-06-22 Thread Tex Texin

I want to write an application in Java that will store information
in a database using Unicode. Ideally the application will run
with any database that supports Unicode. One would presume that the
JDBC driver would take care of any differences between databases
so my application could be independent of database.
(OK, I know it is a naive view.)

However, I am hearing that databases from different vendors require
use of different datatypes or limit you to using certain datatypes
if you want to store Unicode. Changing datatypes would I presume make
a significant different in my programming of the application...

So, I want to make a list of the changes I need to make to 
my Java, SQL application in the event I want to
support each of the major databases (Oracle 8I, MS SQL Server 7, 
etc.) with respect to Unicode data storage.

(I am sure there are other differences programming to different
databases, independent of Unicode data, but those issues are
understood.)

So, if you can help me by identifying specific changes you would make
to query or update a major vendor's database with respect to Unicode
support, I would be very appreciative. If I get a good list, I'll
post it back here. I am most interested in Oracle and MS SQL Server,
but will collect info on any database.

As an example, I am hearing that some databases would require varchar,
others nchar, for Unicode data.

tex


-- 

Tex Texin Director, International Products
 
Progress Software Corp.   +1-781-280-4271
14 Oak Park   +1-781-280-4655 (Fax)
Bedford, MA 01730  USA[EMAIL PROTECTED]

http://www.progress.com   The #1 Embedded Database
http://www.SonicMQ.comJMS Compliant Messaging- Best Middleware
Award
http://www.aspconnections.com Leading provider in the ASP marketplace

Progress Globalization Program (New URL)
http://www.progress.com/partners/globalization.htm

Come to the Panel on Open Source Approaches to Unicode Libraries at
the Sept. Unicode Conference
http://www.unicode.org/iuc/iuc17



English as she is spoke

2000-06-22 Thread mark . davis



I got some amusing results when I tried out the Altavista translation
service on segments of the new language descriptions in
http://www.unicode.org/unicode/standard/WhatIsUnicode.html

Original (English):

   What is Unicode? Unicode provides a unique number for every character,
   no matter what the platform, no matter what the program, no matter what
   the language.

   Fundamentally, computers just deal with numbers. They store letters and
   other characters by assigning a number for each one. Before Unicode was
   invented, there were hundreds of different encoding systems for
   assigning these numbers. No single encoding could contain enough
   characters: for example, the European Union alone requires several
   different encodings to cover all its languages. Even for a single
   language like English no single encoding was adequate for all the
   letters, punctuation, and technical symbols in common use.

   These encoding systems also conflict with one another. That is, two
   encodings can use the same number for two different characters, or use
   different numbers for the same character. Any given computer (especially
   servers) needs to support many different encodings; yet whenever data is
   passed between different encodings or platforms, that data always runs
   the risk of corruption.

Hand-translated into German, on that page:

   Was ist Unicode? Unicode gibt jedem Zeichen seine eigene Nummer,
   platformunabhängig, programmunabhängig, sprachunabhängig.

   Grundsätzlich arbeiten Computer nur mit Zahlen. Buchstaben und andere
   Zeichen werden daher Zahlen zugeordnet, um sie zu speichern. Vor der
   Erfindung von Unicode gab es hunderte unterschiedlicher
   Kodierungssysteme. Keines dieser Kodierungssyteme umfasste je genug
   Zeichen: so braucht die Europäische Union allein mehrere
   Kodierungssysteme, um damit den Bedarf für die Sprachen aller
   Mitgliedsländer abzudecken. Nicht einmal für eine einzelne Sprache wie
   Englisch oder Deutsch gab es ein Kodierungssystem das wirklich alle
   Buchstaben, Interpunktionszeichen und alle gebräuchlichen technischen
   Zeichen umfasste.

   Diese Kodierungssysteme sind untereinander unverträglich, denn
   unterschiedliche Kodierungen können dieselbe Zahl für verschiedene
   Zeichen benutzen, oder verschiedene Zahlen für dasselbe Zeichen. Jeder
   Rechner (vor allem Server) muß viele verschiedene Kodierungssysteme
   unterstützen; und wenn Text zwischen verschiedenen Kodierungssystemen
   oder Rechnersystemen ausgetauscht wird, läuft dieser Text Gefahr,
   verstümmelt zu werden.


Altavista: German to English:

   What is university code? University code gives its own number,
   platformunabhängig, to each character programmunabhängig,
   sprachunabhängig. Grundsätzlich operate computers only with numbers.
   Letters and other characters are assigned numbers in order to store it.
   Before the invention of university code there were hundred different
   coding systems. None this Kodierungssyteme covered ever enough
   characters: thus the Europäi union needs alone several coding systems,
   in order to cover with it the requirement für the languages of all
   Mitgliedsländer. Not even für an individual language such as English or
   German gave it a coding system that really all letters, punctuation
   characters and all gebräuchlichen technical characters covered. These
   coding systems are among themselves unverträglich, because different
   coding können the same number für different characters use, or different
   numbers für the same character. Each computer (above all server) muß
   many different coding systems unterstützen; and if text between
   different coding systems or computer systems is exchanged, this text
   danger läuft to be verstümmelt.


Altavista: English to German:

   Was ist Unicode? Unicode stellt eine eindeutige Zahl für jedes Zeichen,
   egal was die Plattform, egal was das Programm, egal was die Sprache zur
   Verfügung. Grundlegend beschäftigen Computer gerade Zahlen. Sie
   speichern Zeichen und andere Zeichen, indem sie eine Zahl für jede
   zuweisen. Bevor Unicode erfunden wurde, gab es Hunderte der
   unterschiedlichen verschlüsselsysteme für das Zuweisen dieser Zahlen.
   Keine kodierung konnte genügende Zeichen enthalten: z.B. benötigt der
   europäische Anschluß alleine einige unterschiedliche Encodings, alle
   seine Sprachen zu umfassen. Sogar für eine einzelne Sprache wie Englisch
   war keine kodierung für alle Zeichen, Interpunktion und technischen
   Symbole in allgemeinem Gebrauch ausreichend. Diese verschlüsselsysteme
   widersprechen auch miteinander. Das heißt, können zwei Encodings die
   gleiche Zahl für zwei unterschiedliche Zeichen verwenden, oder verwenden
   Sie unterschiedliche Zahlen für das gleiche Zeichen. Irgendwelche
   gegebenen Notwendigkeiten des Computers (besonders Servers),viele
   unterschiedliche Encodings zu unterstützen; dennoch, wann immer Daten
   zwischen 

Re: UTF-8N?

2000-06-22 Thread Peter_Constable




On 06/21/2000 06:33:57 PM [EMAIL PROTECTED] wrote:

 The standard doesn't ever discuss the BOM in the context of UTF-8,

See section 13.6 (page 324).

Sure enough. Well, there you go: the confusion is officially sanctioned!



Peter Constable




Re: Bengali: variants of same conjunct

2000-06-22 Thread Antoine Leca

Michael Kaplan wrote:
 
   Thus since people who write the language sent both,
  cut
 
  Do you mean that Tamil writers *purposely* use both the "ancient" and the
  "modern" forms in the same document?
  What is the intent?
 
 yes, that is what am I saying.

Okay, I did not know (and I did not notice any example thereof; but I do not
read Tamil either ;-)).

But what is the semantic intent, then?
In other words, what may mean the use of "elephant-trunk" ai vs the "normal" one?
What may mean the use of the rounded naa vs the "normal", two parts, one?

Are we talking about that, by the way? And are they any other differences?


[The different forms for Latin a, g or æ]

  And I believe this is entirely a rendering problem that is (far) outside
  Unicode's scope.

 I do not see how, if BOTH forms are in use and one form is not renderable in
 a font that is Unicode compliant, how this would NOT be considered a Unicode
 issue.

Because there is no semantic difference between them.

Similarly, if you use a font like Poetica, there are a vast numbers of
different glyphs for . Does anyone consider encoding this in Unicode?


 It is crucial that language as used should be possible to render with
 Unicode, should it not?

I disagree.
For example, when I want to insist on one point, I use several technics.
When I speak, I speak louder and a bit slower; when I wrote a note,
I use bolder font; on Internet, I use asterisks. All of these are part
of the language, and as such are to be kept with the text. But I do not
believe it have to be encoded in Unicode: this would simply lead too far
in a multi-language world.

Usage of glyphic variations is in my mind even less significant, so
should also be dropped.


 The ligatures you mention do not really call into the same category as
 the Tamil case, since all of them can be rendered using the 3.0 (or
 even the 2.0!) standard.

Please explain to me how you render the script form of æ using a
standard upright font like Helvetica (not the expert variation).
Or else the two-bowl form of g with Courier?

Or did I miss your point?

 
 I do know that the TamilNadu government has specific issues with the
 Unicode standard, is this not one of the issues?

Perhaps, I do not know.
In fact, I cannot figure what issues the TN goverment really have.

 Or do they prefer only the usage outlined in the standard, in order
 to encourage people to use it?

Please do not forget that while Tamil Nadu is the principal place where
Tamil is spoken, it is not the only one, as Tamil is spoken all around
the Indian Ocean.

When I speak about French usage, I can only give testimonies. The various
French official agencies in charge of the language have a bit more power,
but it is far from things like "thou shalt use this rendering form"...
(for example, if a bill were passed to eradicate \oe or ÿ in French,
usage will survive for years, and Unicode will have to continue to
support it, not to mention the other French-speaking countries that may
easily chose to _not_ apply the bill themselves).


Antoine