Re: Persian developers (was Re: Detecting installed fonts in
That only adds KEHEH. I still lack: U+066B ARABIC DECIMAL SEPARATOR U+06C0 ARABIC LETTER HEH WITH YEH ABOVE U+06CC ARABIC LETTER FARSI YEH I looked at two of the docs it looks like they were using U+002C for the decimal separator even when they were using Unicode FWIW, In Microsoft Word 2000, when you type a period in the midst of a digit sequence (so that it is to be the decimal separator), it is *stored* in the document as U+002E, but how it is *rendered* (on screen or printer) depends on a Word setting that controls digit display. If the user sets the control so the digits are displayed using U+0030 and following, then the period is rendered using U+002E. Conversely, if the user sets the control so that the digits are displayed using U+0660 and following, then the period appears to be rendered as U+066B. Thus it isn't necessary for U+066B to be present in the codepage. Bob
Re: Persian developers (was Re: Detecting installed fonts in
- Original Message - From: "Bob Hallissy" [EMAIL PROTECTED] FWIW, In Microsoft Word 2000, snip Thus it isn't necessary for U+066B to be present in the codepage. Word 2000 is a Unicode application, which makes code pages a lot less relevant. Michael
Re: Subset of Unicode to represent Japanese Kanji?
In message [EMAIL PROTECTED] [EMAIL PROTECTED] wrote: Technically, a Japanese document can be written in all Roman characters, but this is not a true Japanese document. It is very difficult to read and it leads to ambiguity and misunderstanding. It was only used back in the Telex days, when people had no choice. It is acceptable for a limited-capability device to display Japanese just using katakana characters (under 64 8x16 glyphs). I've seen this in Japan in such things as shop tills, and minidisc players displaying track names. Anything more advanced than that (such as the funky digital oscilliscope we've just obtained) will display the basic Kanji set (6500-odd 16x16 glyphs). That should need less than 256K of storage space. -- Kevin Bracey, Principal Software Engineer Pace Micro Technology plc Tel: +44 (0) 1223 518566 645 Newmarket RoadFax: +44 (0) 1223 518526 Cambridge, CB5 8PB, United KingdomWWW: http://www.acorn.co.uk/
Re: Eudora?
At 19:36 +0200 2000.07.12, Pete Resnick wrote: The latest version of Mac Eudora lets you read UTF-8. If I can get my act together, the next version may let you write. I'm not sure what we'll be able to get into Windows for the next version. What about the present Win Eudora? Can it send any CP125x text as UTF-8? Single CP, multiple? I'm not after any secrets but could you briefly explain what kind of things one needs to get that writing act together? (MLTE?) If the latest Eudora works with TEC what is the function of the still built-in Eudora Tables? Do they take over in TEC-less Systems? Can I still get external Tables to work with 4.3 under OS 9.x? (The reason I ask is a maybe misguided wish for control over translations.) I suppose that whatever UTF-8 text Mac Eudora receives, it can only display the repertoire of a single Mac script/encoding. What is it that makes the difference for a) a browser that can display chars from several Mac scripts at the same time, and b) an application like Eudora that can not. Dependence on text drawing engines like MLTE or WASTE versus QuickDraw or is it (much) more? (I hope the question is as clear as the evidence for my ignorance .. ) If the incoming UTF-8 in a Mac Eudora message represents a larger repertoire than that of a single Mac script/encoding, is it possible to somehow copy the UTF-8 bytes in order that the full text of the message can be displayed in a browser? Or, better yet, with an AppleScript and TEC OSAX, could I get the text in the right fonts in a WP? (provided fonts, language kits etc. are in place.) Please educate me where the terminology was wrong. Jaap --
Re: Subset of Unicode to represent Japanese Kanji?
In message [EMAIL PROTECTED] Otto Stolz [EMAIL PROTECTED] wrote: Am 2000-07-13 um 13:28 h UCT hat Kevin Bracey geschrieben: It is acceptable for a limited-capability device to display Japanese just using katakana characters (under 64 8x16 glyphs). ... Anything more advanced than that [...] will display the basic Kanji set and Hiragana, I suppose? I understand the the wording in TUS 3.0, sections 10.2 and 10.3 (pages 272 and 274) to the effect that Hiragana is required together with Kanji to write Japanese (and that Katakana is used in normal text only for foreign words or visual emphasis). So, I guess, a limited-capability device can support Katakana only, and an advanced one has to support Kanji + Hiragana + Katakana. Is that correct? Quite right. The standard Japanese repertoire (as originally defined in JIS X 0208) contains 6355 kanji, 83 hiragana, 86 katakana and a couple of hundred other symbols. You'd use that in addition to the basic latin + halfwidth katakana set defined in JIS X 0201. In summary: Level Repertoires Glyphs --- UselessBasic Latin only 95 LimitedBasic Latin + halfwidth katakana 158 Standard Basic Latin, halfwidth katakana + JIS X 02087037 Above average Basic Latin, halfwidth katakana, JIS X 0208+0212 13104 Our Japanese systems (internet access terminals) use a Japanese font with the "standard" repertoire (with the addition of the all-important (C) and TM characters :) ). -- Kevin Bracey, Principal Software Engineer Pace Micro Technology plc Tel: +44 (0) 1223 518566 645 Newmarket RoadFax: +44 (0) 1223 518526 Cambridge, CB5 8PB, United KingdomWWW: http://www.acorn.co.uk/
Re: Subject lines in UTF-8 mssgs? [was:
I forced the encoding to UTF-8 (it is supposed to be the default in my setting, but most of my messages arrive as charset="windows-1252"), and I am using some Chinese characters that are certainly not in my system's default code page: 你好、雅朴。 _馬可。 Note that this may not necessarily forced UTF-8, since OE supports encodings for Chinese characters that you could also use to send the message. UTF-8 *is* required for languages that do not support such an encoding, like Tamil. showing_off உலகம் பேச நினைக்கும் போது Unicode பேசுகிறது /showing_off On the whole, I would not recommend sending mail using those other encodings, I believe that people using OE 5.0 and later will be prompted to install language support just by opening the e-mail! :-) michka (the sentence is right, by the way g).
Re: Using Unicode in XML
Actually, the XML spec is very clear on this: it is handled through the use of a BOM, to help the parser know that it is UTF-16 text. If there is no BOM, then UTF-8 is assumed, unless the encoding tag is present. However, the encoding tag is not required and parsers are not required to support it. In other words, a valid parser supports UTF-16 and UTF-8. If it does not, it is not an XML parser. You can see http://www.w3.org/TR/REC-xml#charencoding for more details. michka - Original Message - From: "Paul Deuter" [EMAIL PROTECTED] To: "Unicode List" [EMAIL PROTECTED] Sent: Thursday, July 13, 2000 8:47 AM Subject: Using Unicode in XML I know that XML can contain Unicode by using the declaration ?xl version="1.0" encoding="ISO-10646-UCS-2" But there seems to be a chicken and egg dilemma here. If I encode my whole XML stream as Unicode, then the parser will need to know that the stream is Unicode in order to be able to parse the declaration which tells it that it is Unicode. If the parser cannot figure out that the stream is Unicode, then it won't be able to read the declaration. But if it can recognize the Unicode, then the declaration would seem to be superfluous. How do systems handle this? Thanks, Paul
RE: Using Unicode in XML
XML Parsers check BOM at the beginning of the Document. if an XML Document starts with 0xfeff it is encoded in UTF16 (or UCS2), 0xfffe UTF16 byte-swapped architecture , 0xfe00ff00 UCS4 and 0x00fe00ff UCS4 byte-swapped. Wlad -Original Message- From: Paul Deuter [mailto:[EMAIL PROTECTED]] Sent: Thursday, July 13, 2000 5:47 PM To: Unicode List Subject: Using Unicode in XML I know that XML can contain Unicode by using the declaration ?xl version="1.0" encoding="ISO-10646-UCS-2" But there seems to be a chicken and egg dilemma here. If I encode my whole XML stream as Unicode, then the parser will need to know that the stream is Unicode in order to be able to parse the declaration which tells it that it is Unicode. If the parser cannot figure out that the stream is Unicode, then it won't be able to read the declaration. But if it can recognize the Unicode, then the declaration would seem to be superfluous. How do systems handle this? Thanks, Paul
Re: Proposal to make the unicode list more transparent! (Sender:
- Original Message - From: Doug Ewell [EMAIL PROTECTED] To: Unicode List [EMAIL PROTECTED] Sent: Wednesday, July 12, 2000 6:47 PM Subject: Proposal to make the unicode list more transparent! (Sender: Jens Siebert [EMAIL PROTECTED] wrote: However, because of the tremendous amount of mails I would like to suggest splitting the list into various lists, divided by main-topics. These could be sorted by "groups of languages", such as CJK(+V) and other groups. Another sector could be "technical issues", such as encoding-related mails, statements about programm-code source-samples etc. ! I think mailing lists based on Script will be more approperiate. Liwal
Re: Eudora?
On 7/12/00 at 5:19 PM -0800, Piotr Trzcionkowski wrote: Is it default encoding ? No. As I said, Mac Eudora only reads it, it can't yet write it. It's default encoding is still ISO-8859-1 (munged to deal with special Mac Roman characters). What about other Iana encodings ? It can interpret anything that the Apple Text Encoding Converter can handle (which is most, if not all, of the registered IANA encodings). Does it able to produce structuralized text/html or text/xml part in multipart/alternative messages or alone ? I'm not sure what you're asking. Eudora (on both platforms) generates text/html within multipart/related and can generate both text/plain and text/html within multipart/alternative. pr -- Pete Resnick mailto:[EMAIL PROTECTED] Eudora Engineering - QUALCOMM Incorporated
Re: Using Unicode in XML
Actually, you do NOT need to declare UCS-2/UTF-16 with an encoding tag: it's supposed to be the default character set. It is, of course, not illegal to declare it, but it is superfluous to do so (for the reason that you suggest). You do need to include a Byte Order Mark character as the first pair of bytes in the file (that would be character U+FFFE), if you encode the file as UTF-16. Many Unicode-aware text editors will do this for you (for example, Notepad on WindowsNT does this), so this will be essentially invisible to you. Some XML parsers are not (alas) Unicode enabled--that is, they can't handle a file encoded as UTF-16. There is usually a disclaimer about their being able to handle only Latin-1 somewhere. They can still handle Unicode (it's a requirement), but only as numeric entities: the text stream, though, has to be Latin-1. If you have such a beast, consider replacing it (please). I should stress that most parsers have been written responsibly and will handle your UTF-16 files just fine. Regards, Addison === Addison P. Phillips Principal Consultant Inter-Locale LLChttp://www.inter-locale.com Globalization Engineering Consulting Services +1 408.210.3569 (mobile)+1 408.904.4762 (fax) === On Thu, 13 Jul 2000, Paul Deuter wrote: I know that XML can contain Unicode by using the declaration ?xl version="1.0" encoding="ISO-10646-UCS-2" But there seems to be a chicken and egg dilemma here. If I encode my whole XML stream as Unicode, then the parser will need to know that the stream is Unicode in order to be able to parse the declaration which tells it that it is Unicode. If the parser cannot figure out that the stream is Unicode, then it won't be able to read the declaration. But if it can recognize the Unicode, then the declaration would seem to be superfluous. How do systems handle this? Thanks, Paul
Re: C1 controls and terminals (was: Re: Euro character in ISO)
Frank da Cruz wrote: Doug Ewell wrote: That last paragraph echoes what Frank said about "reversing the layers," performing the UTF-8 conversion first and then looking for escape sequences. True UTF-8 support, in terminal emulators and in other software as well, really should depend on UTF-8 conversion being performed first. The irony is, when using ISO 2022 character-set designation and invocation, you have to handle the escape sequences first to know if you're in UTF-8. Therefore, this pushes the burden onto the end-user to preconfigure their emulator for UTF-8 if that is what is being used, when ideally this should happen automatically and transparently. I may be misunderstanding the above, but ISO 2022 says: ESC 2/5 F shall mean that the other coding system uses ESC 2/5 4/0 to return; ESC 2/5 2/15 F shall mean that the other coding system does not use ESC 2/5 4/0 to return (it may have an alternative means to return or none at all). Registration number 196 is for UTF-8 without implementation level, and its escape sequence is ESC 2/5 4/7. I believe that ISO 2022 was designed that way so that a decoder that does not know UTF-8 (or any other coding system invoked by ESC 2/5 F) could simply "skip" the octets in that encoding until it gets to the octets ESC 2/5 4/0. This means that it does not need to decode UTF-8 just to find the escape sequence ESC 2/5 4/0. UTF-8 does not do anything special with characters below U+0080 anyway (they're just single-byte ASCII), so it works, no? Of course, if you wanted to include any C1 controls inside the UTF-8 segment, they would have to be encoded in UTF-8, but ESC 2/5 4/0 is entirely in the ASCII range (less than 128), so those octets are encoded as is. Erik
Re: Miscellaneous comments/questions.
At 07:50 AM 7/13/00 -0800, Antoine Leca wrote: Alex Bochannek wrote: A similar issue was very interesting to observe in France and Germany. The use of the English language in advertisement seems to run rampant in Germany while almost all ads that include English in France (mostly tag lines) are followed by an asterisk and the literal French translation somewhere near the edge of the sign. Thanks for the nice trip-report, Alex. There seems to be always one language that's exerting that kind of pressure on the other European languages. It just depends on time and circumstances. Latin used to have that role for centuries, it still does in a limited way, together with Greek in creating new scientific/medical terminology. French had this role for some time, perhaps more on the continent. German had this role, briefly and in a limited way, at the beginning of the century for scientific terms. Two things will happen: The words in question can lose their 'foreign' feeling and become part of the language - usually by some adjustment in spelling or grammatical forms. (Example: En: cake (pl. cakes) - De: Keks (new pl. Kekse). This is now a word that most untrained native speakers would not recognize as borrowed.). Or the foreign word can be displaced by a neologism based on native roots. This is often more successful in the case when there are phonemes in the foreign word that are very hard to pronounce. It's also one area where Government-led efforts have had some success over time. Iceland, by the way, is particularly strict in this regard. Since English is essentially a Germanic language (that incorporated a large set of Norman French derived words) its pressure on speakers of other Germanic languages tends to be higher, since not only words, but phrases can be borrowed (verbatim or translated word-for-word). The strain between these borrowed pieces and the native language is in a way less than it would be for unrelated languages. A./
Re: C1 controls and terminals (was: Re: Euro character in ISO)
Erik van der Poel wrote: Frank da Cruz wrote: The irony is, when using ISO 2022 character-set designation and invocation, you have to handle the escape sequences first to know if you're in UTF-8. Therefore, this pushes the burden onto the end-user to preconfigure their emulator for UTF-8 if that is what is being used, when ideally this should happen automatically and transparently. I may be misunderstanding the above, but ISO 2022 says: ESC 2/5 F shall mean that the other coding system uses ESC 2/5 4/0 to return; ESC 2/5 2/15 F shall mean that the other coding system does not use ESC 2/5 4/0 to return (it may have an alternative means to return or none at all). Registration number 196 is for UTF-8 without implementation level, and its escape sequence is ESC 2/5 4/7. I believe that ISO 2022 was designed that way so that a decoder that does not know UTF-8 (or any other coding system invoked by ESC 2/5 F) could simply "skip" the octets in that encoding until it gets to the octets ESC 2/5 4/0. This means that it does not need to decode UTF-8 just to find the escape sequence ESC 2/5 4/0. UTF-8 does not do anything special with characters below U+0080 anyway (they're just single-byte ASCII), so it works, no? Yes, but I was thinking more about the ISO 2022 invocation features than the designation ones: LS2, LS3, LS1R, LS2R, LS3R, SS2, and SS3 are C1 controls. The situation *could* arise where these would be used prior to announcing (or switching to) UTF-8. In this case, the end-user would have to configure the software in advance to know whether the incoming byte stream is UTF-8. Not a big deal; just an illustration of what happens when we can't use the normal layering. - Frank
Re: C1 controls and terminals (was: Re: Euro character in ISO)
Frank da Cruz wrote: Yes, but I was thinking more about the ISO 2022 invocation features than the designation ones: LS2, LS3, LS1R, LS2R, LS3R, SS2, and SS3 are C1 controls. The situation *could* arise where these would be used prior to announcing (or switching to) UTF-8. In this case, the end-user would have to configure the software in advance to know whether the incoming byte stream is UTF-8. Shouldn't the UTF-8 segment switch back to ISO 2022 before invoking any of those C1 controls? This way, the decoder wouldn't have to know UTF-8, and could skip over it reliably. Erik
Re: Subset of Unicode to represent Japanese Kanji?
1. Not the extended kanji. It is the basic kanji (or standard kanji as defined in JIS X 0208-1990) is a MUST. Even Japanese Window 95 can only display the basic kanji, not the extended kanji. 2. Both hiragana and katakana are nothing but symbols of pronouciation of Japanese. Hiragana the cursive and katakana the print style. Every hiragana has its equivalent katakana, and its equivalent Roman character. An all katakana document is not much better than an all Roman character document. The problem with all kana (or all Roman ch) document is because there are so many words with same pronounciations. For example, the Roman Characters "KAMI" may mean God, or hair, or paper, or above. "HASHI" may mean bridge or chop sticks. If it is written in kanji, all God, hair, paper, above, bridge, chop sticks are represented in different kanjis, thus no ambiguity. Whether its practical or not to have an all kana display depends on your application. As Kevin Bracey said, things as shop tills, and minidisc players displaying track names may be OK, since the contents are focused. Foster Antoine Leca [EMAIL PROTECTED] on 2000/07/13 10:43:45 To: Foster Feng/TYO/NIC@NIC cc: Unicode List [EMAIL PROTECTED], [EMAIL PROTECTED] Subject: Re: Subset of Unicode to represent Japanese Kanji? I am NOT a Japanese speaker (I can only poorly read kana, and with help). So here is my supplementary question. [EMAIL PROTECTED] wrote: Japanese document must consist of: hiragana: less than 100 characters katakana: less than 100 characters kanji: basic kanji has 6,879 characters as defined in JIS X 0208-1990 extended kanji has 6,067 characters as defined in JIS X 0212-1990 You mean, extended kanji is an absolute requirement for any device which intended to dislay some Japanese text? Technically, a Japanese document can be written in all Roman characters, but this is not a true Japanese document. I understand easily that this is _not_ the solution (it always needs me quite some times when I see my name written in kana or Cyrillic or whatever). But: What about a document written only with kanas, without any kanji? I know this is far from perfect, that it will hurt (or upset?) the reader quite a lot, and will reduce his reading speed to about a small fraction of normal, perhaps a tenth (but that's much better than romaji, anyway). But is it practical, for example for a small display? (say, 3 lines of 20 characters) Regards, Antoine
ODBC/JDBC Drivers
Does anyone know about any Unicode enabled ODBC/JDBC drivers for Microsoft SQL that will run with Linux Apache? Thanks in advance, Beverly Corwin, PresidentEnso Company Ltd.The Westin Building2001 Sixth Avenue, Suite 3403Seattle WA 98121 USATel: 206.390.0743 Fax: 206.443.5758www.enso-company.com
Re: JDBC drivers that support databases using Unicode for storage
Dear Tex, Have you checked the below? http://technet.oracle.com/doc/oracle8i_816/server.816/a76966/ch6.htm#7371 Best Regards, ++ | Linus Toshihiro Tanaka500 Oracle Parkway M/S 4op7 | | NLS Consulting Team Redwood Shores, CA 94065 USA | | Server Globalization Technology email: [EMAIL PROTECTED] | | Oracle Corporation | ++ Tex Texin wrote: Hi, I am Unicode-enabling an application that utilizes with Oracle and Microsoft SQL Server among other databases. I need to replace the current JDBC driver since it doesn't support Unicode going in/out of the database. Any recommendations for good performing JDBC drivers that work with the above databases storing/retrieving Unicode? tex
Re: Subject lines....../ Lost Header?? Re: [nothing]
My previous message of a few minutes ago with the empty "Re: " --only Header (at least as I got it back from the listserver) left my home with a Header as shown below. Any information about the whereabouts of my lost Head leading to its recovery ... Re: =?utf-8?B?UkU6IFN1YmplY3QgbGluZXMgaW4gVVRGLTggbXNzZ3M/IFt3YXM6?= Jaap --