Re: Persian developers (was Re: Detecting installed fonts in

2000-07-13 Thread Bob Hallissy




That only adds KEHEH. I still lack:
   
U+066B ARABIC DECIMAL SEPARATOR
U+06C0 ARABIC LETTER HEH WITH YEH ABOVE
U+06CC ARABIC LETTER FARSI YEH

   I looked at two of the docs it looks like they were using
   U+002C for the decimal separator even when they were using
   Unicode

   FWIW, In Microsoft Word 2000, when you type a period in the
   midst of a digit sequence (so that it is to be the decimal
   separator), it is *stored* in the document as U+002E, but how
   it is *rendered* (on screen or printer) depends on a Word
   setting that controls digit display. If the user sets the
   control so the digits are displayed using U+0030 and following,
   then the period is rendered using U+002E. Conversely, if the
   user sets the control so that the digits are displayed using
   U+0660 and following, then the period appears to be rendered as
   U+066B.

   Thus it isn't necessary for U+066B to be present in the
   codepage.

   Bob







Re: Persian developers (was Re: Detecting installed fonts in

2000-07-13 Thread Michael \(michka\) Kaplan

- Original Message -
From: "Bob Hallissy" [EMAIL PROTECTED]
FWIW, In Microsoft Word 2000,
snip
Thus it isn't necessary for U+066B to be present in the
codepage.

Word 2000 is a Unicode application, which makes code pages a lot less
relevant.

Michael




Re: Subset of Unicode to represent Japanese Kanji?

2000-07-13 Thread Kevin Bracey

In message [EMAIL PROTECTED]
  [EMAIL PROTECTED] wrote:

 Technically, a Japanese document can be written in all Roman characters,
 but this is not a true Japanese document. It is very difficult to read and
 it leads to ambiguity and misunderstanding. It was only used back in the
 Telex days, when people had no choice.
 

It is acceptable for a limited-capability device to display Japanese just
using katakana characters (under 64 8x16 glyphs). I've seen this in Japan
in such things as shop tills, and minidisc players displaying track names.

Anything more advanced than that (such as the funky digital oscilliscope
we've just obtained) will display the basic Kanji set (6500-odd 16x16
glyphs). That should need less than 256K of storage space.

-- 
Kevin Bracey, Principal Software Engineer
Pace Micro Technology plc Tel: +44 (0) 1223 518566
645 Newmarket RoadFax: +44 (0) 1223 518526
Cambridge, CB5 8PB, United KingdomWWW: http://www.acorn.co.uk/



Re: Eudora?

2000-07-13 Thread Jaap Pranger

At 19:36 +0200 2000.07.12, Pete Resnick wrote:


The latest version of Mac Eudora lets you read UTF-8. If I can get my 
act together, the next version may let you write. I'm not sure what 
we'll be able to get into Windows for the next version. 

What about the present Win Eudora? Can it send any CP125x 
text as UTF-8? Single CP, multiple?

I'm not after any secrets but could you briefly explain what 
kind of things one needs to get that writing act together? (MLTE?)

If the latest Eudora works with TEC what is the function of the
still built-in Eudora Tables? Do they take over in TEC-less Systems? 
Can I still get external Tables to work with 4.3 under OS 9.x? 
(The reason I ask is a maybe misguided wish for control over 
translations.)  


I suppose that whatever UTF-8 text Mac Eudora receives, it can 
only display the repertoire of a single Mac script/encoding. 

What is it that makes the difference for a) a browser that 
can display chars from several Mac scripts at the same time, and 
b) an application like Eudora that can not. Dependence on 
text drawing engines like MLTE or WASTE versus QuickDraw or 
is it (much) more? (I hope the question is as clear as the 
evidence for my ignorance .. )   


If the incoming UTF-8 in a Mac Eudora message represents a larger 
repertoire than that of a single Mac script/encoding, is it possible to 
somehow copy the UTF-8 bytes in order that the full text of the message 
can be displayed in a browser? Or, better yet, with an AppleScript and 
TEC OSAX, could I get the text in the right fonts in a WP? 
(provided fonts, language kits etc. are in place.)




Please educate me where the 
terminology was wrong. 



Jaap  

-- 





Re: Subset of Unicode to represent Japanese Kanji?

2000-07-13 Thread Kevin Bracey

In message [EMAIL PROTECTED]
  Otto Stolz [EMAIL PROTECTED] wrote:

 Am 2000-07-13 um 13:28 h UCT hat Kevin Bracey geschrieben:
  It is acceptable for a limited-capability device to display Japanese just
  using katakana characters (under 64 8x16 glyphs).
 ...
  Anything more advanced than that [...] will display the basic Kanji set
 
 and Hiragana, I suppose?
 
 I understand the the wording in TUS 3.0, sections 10.2 and 10.3 (pages 272
 and 274) to the effect that Hiragana is required together with Kanji to
 write Japanese (and that Katakana is used in normal text only for foreign
 words or visual emphasis). So, I guess, a limited-capability device can
 support Katakana only, and an advanced one has to support Kanji + Hiragana
 + Katakana.
 
 Is that correct?

Quite right. The standard Japanese repertoire (as originally defined in JIS X
0208) contains 6355 kanji, 83 hiragana, 86 katakana and a couple of hundred
other symbols. You'd use that in addition to the basic latin + halfwidth
katakana set defined in JIS X 0201.

In summary:

   Level  Repertoires   Glyphs
   ---
   UselessBasic Latin only  95
   LimitedBasic Latin + halfwidth katakana 158
   Standard   Basic Latin, halfwidth katakana + JIS X 02087037
   Above average  Basic Latin, halfwidth katakana, JIS X 0208+0212   13104

Our Japanese systems (internet access terminals) use a Japanese font with the
"standard" repertoire (with the addition of the all-important (C) and TM
characters :) ).

-- 
Kevin Bracey, Principal Software Engineer
Pace Micro Technology plc Tel: +44 (0) 1223 518566
645 Newmarket RoadFax: +44 (0) 1223 518526
Cambridge, CB5 8PB, United KingdomWWW: http://www.acorn.co.uk/



Re: Subject lines in UTF-8 mssgs? [was:

2000-07-13 Thread Michael \(michka\) Kaplan

 I forced the encoding to UTF-8 (it is supposed to be the
 default in my setting, but most of my messages arrive as
 charset="windows-1252"), and I am using some Chinese
 characters that are certainly not in my system's default
 code page:

 你好、雅朴。
 _馬可。

Note that this may not necessarily forced UTF-8, since OE supports encodings
for Chinese characters that you could also use to send the message.

UTF-8 *is* required for languages that do not support such an encoding, like
Tamil.

showing_off
உலகம் பேச நினைக்கும் போது Unicode 
பேசுகிறது
/showing_off

On the whole, I would not recommend sending mail using those other
encodings, I believe that people using OE 5.0 and later will be prompted to
install language support just by opening the e-mail! :-)

michka

(the sentence is right, by the way g).





Re: Using Unicode in XML

2000-07-13 Thread Michael \(michka\) Kaplan

Actually, the XML spec is very clear on this: it is handled through the use
of a BOM, to help the parser know that it is UTF-16 text.

If there is no BOM, then UTF-8 is assumed, unless the encoding tag is
present. However, the encoding tag is not required and parsers are not
required to support it.

In other words, a valid parser supports UTF-16 and UTF-8. If it does not, it
is not an XML parser.

You can see

http://www.w3.org/TR/REC-xml#charencoding

for more details.

michka


- Original Message -
From: "Paul Deuter" [EMAIL PROTECTED]
To: "Unicode List" [EMAIL PROTECTED]
Sent: Thursday, July 13, 2000 8:47 AM
Subject: Using Unicode in XML


 I know that XML can contain Unicode by using the declaration

 ?xl version="1.0" encoding="ISO-10646-UCS-2"

 But there seems to be a chicken and egg dilemma here.  If
 I encode my whole XML stream as Unicode, then the parser
 will need to know that the stream is Unicode in order to be able
 to parse the declaration which tells it that it is Unicode.

 If the parser cannot figure out that the stream is Unicode, then
 it won't be able to read the declaration.  But if it can recognize
 the Unicode, then the declaration would seem to be superfluous.

 How do systems handle this?

 Thanks,
 Paul








RE: Using Unicode in XML

2000-07-13 Thread Vaintroub, Wladislav

XML Parsers check  BOM at the beginning of the Document.
if an XML Document starts with 
0xfeff it is encoded in UTF16 (or UCS2),
0xfffe UTF16 byte-swapped architecture ,
0xfe00ff00 UCS4 and
0x00fe00ff UCS4 byte-swapped.


Wlad

 -Original Message-
 From: Paul Deuter [mailto:[EMAIL PROTECTED]]
 Sent: Thursday, July 13, 2000 5:47 PM
 To: Unicode List
 Subject: Using Unicode in XML
 
 
 I know that XML can contain Unicode by using the declaration 
 
 ?xl version="1.0" encoding="ISO-10646-UCS-2"
 
 But there seems to be a chicken and egg dilemma here.  If
 I encode my whole XML stream as Unicode, then the parser
 will need to know that the stream is Unicode in order to be able
 to parse the declaration which tells it that it is Unicode.
 
 If the parser cannot figure out that the stream is Unicode, then
 it won't be able to read the declaration.  But if it can recognize
 the Unicode, then the declaration would seem to be superfluous.
 
 How do systems handle this?
 
 Thanks,
 Paul
 
 
 



Re: Proposal to make the unicode list more transparent! (Sender:

2000-07-13 Thread N.R.Liwal


- Original Message - 
From: Doug Ewell [EMAIL PROTECTED]
To: Unicode List [EMAIL PROTECTED]
Sent: Wednesday, July 12, 2000 6:47 PM
Subject: Proposal to make the unicode list more transparent! (Sender:


 Jens Siebert [EMAIL PROTECTED] wrote:
 
  However, because of the tremendous amount of mails
  I would like to suggest splitting the list into
  various lists, divided by main-topics.
 
  These could be sorted by "groups of languages",
  such as CJK(+V) and other groups.
  Another sector could be "technical issues", such
  as encoding-related mails, statements about
  programm-code source-samples etc. !

I think mailing lists based on Script will be more approperiate.

Liwal




Re: Eudora?

2000-07-13 Thread Pete Resnick

On 7/12/00 at 5:19 PM -0800, Piotr Trzcionkowski wrote:

Is it default encoding ?

No. As I said, Mac Eudora only reads it, it can't yet write it. It's 
default encoding is still ISO-8859-1 (munged to deal with special Mac 
Roman characters).

What about other Iana encodings ?

It can interpret anything that the Apple Text Encoding Converter can 
handle (which is most, if not all, of the registered IANA encodings).

Does it able to produce structuralized text/html or text/xml part in 
multipart/alternative messages or alone ?

I'm not sure what you're asking. Eudora (on both platforms) generates 
text/html within multipart/related and can generate both text/plain 
and text/html within multipart/alternative.

pr
-- 
Pete Resnick mailto:[EMAIL PROTECTED]
Eudora Engineering - QUALCOMM Incorporated



Re: Using Unicode in XML

2000-07-13 Thread addison

Actually, you do NOT need to declare UCS-2/UTF-16 with an encoding
tag: it's supposed to be the default character set. It is, of course, not
illegal to declare it, but it is superfluous to do so (for the reason that
you suggest).

You do need to include a Byte Order Mark character as the first pair of
bytes in the file (that would be character U+FFFE), if you encode the file
as UTF-16. Many Unicode-aware text editors will do this for you (for
example, Notepad on WindowsNT does this), so this will be essentially
invisible to you.

Some XML parsers are not (alas) Unicode enabled--that is, they
can't handle a file encoded as UTF-16. There is usually a
disclaimer about their being able to handle only Latin-1 somewhere. They
can still handle Unicode (it's a requirement), but only as numeric
entities: the text stream, though, has to be Latin-1. If you have such a
beast, consider replacing it (please).

I should stress that most parsers have been written responsibly and will
handle your UTF-16 files just fine.

Regards,

Addison

===
Addison P. Phillips Principal Consultant
Inter-Locale LLChttp://www.inter-locale.com
Globalization Engineering  Consulting Services

+1 408.210.3569 (mobile)+1 408.904.4762 (fax)
===

On Thu, 13 Jul 2000, Paul Deuter wrote:

 I know that XML can contain Unicode by using the declaration 
 
 ?xl version="1.0" encoding="ISO-10646-UCS-2"
 
 But there seems to be a chicken and egg dilemma here.  If
 I encode my whole XML stream as Unicode, then the parser
 will need to know that the stream is Unicode in order to be able
 to parse the declaration which tells it that it is Unicode.
 
 If the parser cannot figure out that the stream is Unicode, then
 it won't be able to read the declaration.  But if it can recognize
 the Unicode, then the declaration would seem to be superfluous.
 
 How do systems handle this?
 
 Thanks,
 Paul
 
 
 
 




Re: C1 controls and terminals (was: Re: Euro character in ISO)

2000-07-13 Thread Erik van der Poel

Frank da Cruz wrote:
 
 Doug Ewell wrote:
  
  That last paragraph echoes what Frank said about "reversing the layers,"
  performing the UTF-8 conversion first and then looking for escape
  sequences.  True UTF-8 support, in terminal emulators and in other
  software as well, really should depend on UTF-8 conversion being
  performed first.
 
 The irony is, when using ISO 2022 character-set designation and invocation,
 you have to handle the escape sequences first to know if you're in UTF-8.
 Therefore, this pushes the burden onto the end-user to preconfigure their
 emulator for UTF-8 if that is what is being used, when ideally this should
 happen automatically and transparently.

I may be misunderstanding the above, but ISO 2022 says:

  ESC 2/5 F shall mean that the other coding system uses
  ESC 2/5 4/0 to return;

  ESC 2/5 2/15 F shall mean that the other coding system
  does not use ESC 2/5 4/0 to return (it may have an alternative
  means to return or none at all).

Registration number 196 is for UTF-8 without implementation level, and
its escape sequence is ESC 2/5 4/7. I believe that ISO 2022 was designed
that way so that a decoder that does not know UTF-8 (or any other coding
system invoked by ESC 2/5 F) could simply "skip" the octets in that
encoding until it gets to the octets ESC 2/5 4/0.

This means that it does not need to decode UTF-8 just to find the escape
sequence ESC 2/5 4/0. UTF-8 does not do anything special with characters
below U+0080 anyway (they're just single-byte ASCII), so it works, no?

Of course, if you wanted to include any C1 controls inside the UTF-8
segment, they would have to be encoded in UTF-8, but ESC 2/5 4/0 is
entirely in the ASCII range (less than 128), so those octets are encoded
as is.

Erik



Re: Miscellaneous comments/questions.

2000-07-13 Thread Asmus Freytag

At 07:50 AM 7/13/00 -0800, Antoine Leca wrote:
Alex Bochannek wrote:
 
  A similar issue was very interesting to observe in France and
  Germany. The use of the English language in advertisement seems to run
  rampant in Germany while almost all ads that include English in France
  (mostly tag lines) are followed by an asterisk and the literal French
  translation somewhere near the edge of the sign.
Thanks for the nice trip-report, Alex.

There seems to be always one language that's exerting that kind of pressure
on the other European languages. It just depends on time and circumstances.
Latin used to have that role for centuries, it still does in a limited way,
together with Greek in creating new scientific/medical terminology.
French had this role for some time, perhaps more on the continent.
German had this role, briefly and in a limited way, at the beginning of the
century for scientific terms.

Two things will happen: The words in question can lose their 'foreign'
feeling and become part of the language - usually by some adjustment in
spelling or grammatical forms. (Example: En: cake (pl. cakes) - De:
Keks (new pl. Kekse). This is now a word that most untrained native
speakers would not recognize as borrowed.).

Or the foreign word can be displaced by a neologism based on native roots.
This is often more successful in the case when there are phonemes in the
foreign word that are very hard to pronounce. It's also one area where
Government-led efforts have had some success over time. Iceland, by the
way, is particularly strict in this regard.

Since English is essentially a Germanic language (that incorporated a large
set of Norman French derived words) its pressure on speakers of other
Germanic languages tends to be higher, since not only words, but phrases
can be borrowed (verbatim or translated word-for-word). The strain between
these borrowed pieces and the native language is in a way less than it would
be for unrelated languages.

A./



Re: C1 controls and terminals (was: Re: Euro character in ISO)

2000-07-13 Thread Frank da Cruz

Erik van der Poel wrote:
 Frank da Cruz wrote:
  The irony is, when using ISO 2022 character-set designation and invocation,
  you have to handle the escape sequences first to know if you're in UTF-8.
  Therefore, this pushes the burden onto the end-user to preconfigure their
  emulator for UTF-8 if that is what is being used, when ideally this should
  happen automatically and transparently.
 
 I may be misunderstanding the above, but ISO 2022 says:
 
   ESC 2/5 F shall mean that the other coding system uses
   ESC 2/5 4/0 to return;
 
   ESC 2/5 2/15 F shall mean that the other coding system
   does not use ESC 2/5 4/0 to return (it may have an alternative
   means to return or none at all).
 
 Registration number 196 is for UTF-8 without implementation level, and
 its escape sequence is ESC 2/5 4/7. I believe that ISO 2022 was designed
 that way so that a decoder that does not know UTF-8 (or any other coding
 system invoked by ESC 2/5 F) could simply "skip" the octets in that
 encoding until it gets to the octets ESC 2/5 4/0.
 
 This means that it does not need to decode UTF-8 just to find the escape
 sequence ESC 2/5 4/0. UTF-8 does not do anything special with characters
 below U+0080 anyway (they're just single-byte ASCII), so it works, no?
 
Yes, but I was thinking more about the ISO 2022 invocation features than the
designation ones:  LS2, LS3, LS1R, LS2R, LS3R, SS2, and SS3 are C1 controls.
The situation *could* arise where these would be used prior to announcing
(or switching to) UTF-8.  In this case, the end-user would have to configure
the software in advance to know whether the incoming byte stream is UTF-8.

Not a big deal; just an illustration of what happens when we can't use the
normal layering.

- Frank




Re: C1 controls and terminals (was: Re: Euro character in ISO)

2000-07-13 Thread Erik van der Poel

Frank da Cruz wrote:
 
 Yes, but I was thinking more about the ISO 2022 invocation features than the
 designation ones:  LS2, LS3, LS1R, LS2R, LS3R, SS2, and SS3 are C1 controls.
 The situation *could* arise where these would be used prior to announcing
 (or switching to) UTF-8.  In this case, the end-user would have to configure
 the software in advance to know whether the incoming byte stream is UTF-8.

Shouldn't the UTF-8 segment switch back to ISO 2022 before invoking any
of those C1 controls? This way, the decoder wouldn't have to know UTF-8,
and could skip over it reliably.

Erik



Re: Subset of Unicode to represent Japanese Kanji?

2000-07-13 Thread foster . feng

1. Not the extended kanji. It is the basic kanji (or standard kanji as defined
in JIS X 0208-1990) is a MUST. Even Japanese Window 95 can only display the
basic kanji, not the extended kanji.

2. Both hiragana and katakana are nothing but symbols of pronouciation of
Japanese. Hiragana the cursive and katakana the print style. Every hiragana has
its equivalent katakana, and its equivalent Roman character. An all katakana
document is not much better than an all Roman character document.

The problem with all kana (or all Roman ch) document is because there are so
many words with same pronounciations. For example, the Roman Characters "KAMI"
may mean God, or hair, or paper, or above. "HASHI" may mean bridge or chop
sticks. If it is written in kanji, all God, hair, paper, above, bridge, chop
sticks are represented in different kanjis, thus no ambiguity.

Whether its practical or not to have an all kana display depends on your
application. As Kevin Bracey said, things as shop tills, and minidisc players
displaying track names may be OK, since the contents are focused.

Foster





Antoine Leca [EMAIL PROTECTED] on 2000/07/13 10:43:45

To:   Foster Feng/TYO/NIC@NIC
cc:   Unicode List [EMAIL PROTECTED], [EMAIL PROTECTED]

Subject:  Re: Subset of Unicode to represent Japanese Kanji?



I am NOT a Japanese speaker (I can only poorly read kana, and with help).
So here is my supplementary question.

[EMAIL PROTECTED] wrote:

 Japanese document must consist of:


 hiragana: less than 100 characters
 katakana: less than 100 characters
 kanji: basic kanji has 6,879 characters as defined in JIS X 0208-1990
   extended kanji has 6,067 characters as defined in JIS X 0212-1990

You mean, extended kanji is an absolute requirement for any device which
intended to dislay some Japanese text?


 Technically, a Japanese document can be written in all Roman characters, but
 this is not a true Japanese document.

I understand easily that this is _not_ the solution (it always needs me quite
some times when I see my name written in kana or Cyrillic or whatever).


But: What about a document written only with kanas, without any kanji?

I know this is far from perfect, that it will hurt (or upset?) the reader
quite a lot, and will reduce his reading speed to about a small fraction of
normal, perhaps a tenth (but that's much better than romaji, anyway).
But is it practical, for example for a small display? (say, 3 lines of
20 characters)


Regards,
Antoine







ODBC/JDBC Drivers

2000-07-13 Thread EnsoCompany



Does anyone know about any Unicode enabled 
ODBC/JDBC drivers for Microsoft SQL that will run with Linux 
Apache?

Thanks in advance,
Beverly Corwin, PresidentEnso Company 
Ltd.The Westin Building2001 Sixth Avenue, Suite 3403Seattle WA 98121 
USATel: 206.390.0743 Fax: 206.443.5758www.enso-company.com


Re: JDBC drivers that support databases using Unicode for storage

2000-07-13 Thread Linus Toshihiro Tanaka

Dear Tex,

Have you checked the below?
http://technet.oracle.com/doc/oracle8i_816/server.816/a76966/ch6.htm#7371

Best Regards,
++
| Linus Toshihiro Tanaka500 Oracle Parkway M/S 4op7  |
| NLS Consulting Team   Redwood Shores, CA 94065 USA |
| Server Globalization Technology   email: [EMAIL PROTECTED] |
| Oracle Corporation |
++


Tex Texin wrote:
 
 Hi,
 
 I am Unicode-enabling an application that utilizes with Oracle and
 Microsoft SQL Server among other databases. I need to replace the
 current JDBC driver since it doesn't support Unicode going in/out of
 the database.
 
 Any recommendations for good performing JDBC drivers that work with
 the above databases storing/retrieving Unicode?
 
 tex




Re: Subject lines....../ Lost Header?? Re: [nothing]

2000-07-13 Thread Jaap Pranger


My previous message of a few minutes ago 
with the empty   "Re:   "  --only Header (at least as I got 
it back from the listserver) left my home with a Header 
as shown below. 

Any information about the whereabouts 
of my lost Head   leading to its recovery ... 



Re: =?utf-8?B?UkU6IFN1YmplY3QgbGluZXMgaW4gVVRGLTggbXNzZ3M/IFt3YXM6?=





Jaap

--