Re: Unicode on a non-Unicode web page
Perhaps a typo only, but possibly with dire consequences; so I'd rather set it right. On Thu, 7 Sep 2000 07:16 (GMT-0800), Herman Ranes wrote: to make the HTML code 'understandable' to Netscape Navigator 4, without actually encoding in UTF-8: -Meta tag the document as UTF-8 -Encode characters beyond U+00FF as decimal NCRs (#232;). This must be: Encode characters beyond U+007F as decimal NCRs UTF-8 uses the byte values beyond 7F for its multi-byte sequences; 8-bit, single-byte coded characters beyond 7F, interspersed in a UTF-8 datastream, would be misunderstood by the receiver. Best wishes, Otto Stolz
Re: Unicode on a non-Unicode web page
[Sorry Paul, I didn't particularly intend to send this privately. I notice that the Unicode list no longer sets a Reply-To: header. Ô Sarasvati, might I humbly request that this behaviour be reinstated (though of course not overriding any Reply-To that individual subscribers may wish to set).] On Thu, 7 Sep 2000 12:46:56 -0800 (GMT-0800), Paul Deuter wrote: Finally you also have the solution already suggested of encoding everything as UTF-8 and using that as your main character set. I don't know of an easy way of transliterating 8859-2 to UTF-8. The hard ways are using Notepad on Windows 2000 on a machine that has 8859-2 as the ANSI character set and saving to UTF-8. One 'easy' way is to open the file as coded text using Word 2000, selecting Central Europe (ISO) when opening and UTF-8 when saving. John. -- -- Over 1200 webcams from ski resorts around the world - http://www.tradoc.fr/john/webcams/ -- Translate your technical documents and web pages- http://www.tradoc.fr/en/
RE: Unicode on a non-Unicode web page
John Cowan wrote: Versions of Netscape before 4.7 had this bug: character references greater than #255; only worked if the transmission character set was UTF-8. This bug is still present in the Windows version of Netscape 4.75. Use Edit, Preferences, Fonts to make both Western and Unicode encoding use Times New Roman and then look at: http://www.hclrss.demon.co.uk/demos/wgl4.html Now use View, Character Set to switch between Western (ISO-8859-1) and Unicode (UTF-8). With Western, most characters above 255 display as question marks, but with Unicode they all appear correctly. Alan Wood Documentation Writer / Web Master Context Limited mailto:[EMAIL PROTECTED] http://www.context.co.uk/ http://www.alanwood.net/ (Unicode, special characters, pesticide names)
Re: Unicode on a non-Unicode web page
Take a look at the Unicode FAQ on the web, at www.unicode.org "Gary P. Grosso" wrote: Hi Unicoders, I am working on software to emit HTML in the encoding and character set of the user's choice, from SGML/XML documents which can contain any Plane 1 Unicode character. The question is what to do with characters outside the selected encoding. I thought I would use the "numeric" character entity reference and IE5 at least seems to render that well, but Netscape Communicator 4.6 doesn't. One way to look at this is: how do I use unicode as an "escape" to include some isolated content on a web page of arbitrary encoding? For example, I have something such as: !DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" htmlheadtitleUnicode in a Latin 2 page/title meta http-equiv="Content-Type" content="text/html; charset=ISO-8859-2" /head body style="line-height: 16pt"div class="pgbrk" style="padding-top: 48pt" pÈlánek Úvod ®ádný èest èin èinìn èinù èinùm èinnost èinnosti jakmile jako jako¾ jako¾to jazyka je¾ jediné jednat jednotkou jednotlivec/p pCYRILLIC CAPITAL LETTER DJE: #1026;/p pCAPITAL LETTER GAMMA: #x0393;/p pHIRAGANA LETTER KA: #12363;/p pjeho jejich jemu jimi jiného jinému jiných jiným jinými jsou ka¾dému ka¾dý /p /body /html which probably looks awful since your email client is not likely set to display Latin 2, but which can also be seen at: http://www.angelfire.com/mi/virtualattic/latin2_test.html If I change the meta tag to: meta http-equiv="Content-Type" content="text/html; charset=UTF-8" then Netscape does slightly better (still stumbles over #x-anything and doesn't display the hiragana, but does display the DJE and GAMMA if I use decimal values) but of course now the Czech words are not displayed properly. My question(s): Is there some way I can nudge Netscape's browser to display these? Is there a better way to write this admittedly mongrel HTML content? I have heard somewhere that it is possible to change charset choice "on the fly" and if would work, I would appreciate a pointer to somewhere that says how best to do this. Thanks in advance for any insights. --- Gary Grosso [EMAIL PROTECTED] Arbortext, Inc. Ann Arbor, MI, USA
Re: Surrogate support in *ML?
Good point. In the past, I have used "surrogate characters" to refer to the characters encoded above , and surrogate code units to refer to the UTF-16 units D800-DFFF. However, I think that leads to confusion. Nobody has come up with a good term for all characters above . "Plane 1-16 characters" is clunky and requires explanation, as does "non-BMP characters". Another possibility is "surrogate-pair characters". My personal favorite is "astral characters" (don't remember who came up with that). Mark Karlsson Kent - keka wrote: From: Brendan Murray/DUB/Lotus [mailto:[EMAIL PROTECTED]] ... Karlsson Kent - keka [EMAIL PROTECTED] wrote: At the level of XML the number of bits is irrelevant. The "high and low surrogate" code points are excluded from being used as NCRs. A character (not UTF-16 code units) can be referenced by NCRs. See (XML) procuction 66 (CharRef) and its well-formedness constraint (and production 2 (Char), though they missed to exclude a number of other non-character code points in that production). I know that XML explicitly excludes surrogates. My question really refers to what one can do to encode the non-BMP data in the new Han unification data that will become part of 10646 and Unicode in the not too distant future: is this huge block of characters regarded as irrelevant, or has anyone proposed an encoding that can be used? As was apparently not clear enough from my answer is that you refer to the code point for the character. Thus, assuming the following example characters pass and stay at the currently suggested code points, #x10330; will refer to GOTHIC LETTER AHSA in plane 1, #x2A718; will refer to CJK UNIFIED IDEOGRAPH-2A718 (which is in extension B on plane 2), and so on. This should be clear from (XML) production 66 (CharRef) and its well-formedness constraint, that refers to (XML) production 2 (Char), that in turn does include planes 01-10 (hex) (even though that production mistakenly includes 32 not-a-character code points on the supplementary planes). In addition, XML processors must 'support' both UTF-8 and UTF-16 (not just UCS-2). However, independently of document encoding, character references (CharRef) always refer to UCS code points (a.k.a. scalar values), not (UTF-16, UTF-8, or other) code units. What is confusing is that sometimes "surrogates" refer to certain code units (for UTF-16) that are reserved as code points, and sometimes "surrogates" is used to refer to 'characters on planes 01-10'. I think the latter is a misuse. /kent k
RE: Win32: Commandline/batch ANSI-UTF8-UTF16-UTF8-ANSI conversion
Title: Win32: Commandline/batch ANSI-UTF8-UTF16-UTF8-ANSI conversion tools Sure: uniconv.exe by Basis Technology. It is distributed for free as a demo of the Rosette library; download from http://rosette.basistech.com/demo.html. The version I have(quite old) does not support UTF-16, but it has UCS-2, that shouldundistinguishable if you just need cp 1252. Call it without command line arguments, and it will output a long usage help that starts like this: usage: uniconv [-debug] input-encoding input-file output-encoding output-file property | transform* Version 1.1RC2, 4/13/98 Copyright (c) Basis Technology Corp. 1995-1998. All rights reserved.Type "uniconv -help" for more information. Encodings: Arabic, ASCII, Big5, BMP, cp1251, cp1252, cp437, cp850, EUC-J, EUC-KR, GB2312, Greek, Hebrew, ISO-2022-JP, ISO-2022-KR, ISOLatinCyrillic, JapaneseAutoDetect, JIS_X0201, JIS_X_0208, KoreanAutoDetect, Latin1, Latin2, Latin3, Latin4, Latin5, Latin6, Shift-JIS, Thai, UCS2, Unicode11UCS2, Unicode11UTF7, Unicode11UTF8, UTF7, UTF8 Properties: [...snip...] _ Marco -Original Message-From: Mikko Lahti [mailto:[EMAIL PROTECTED]]Sent: 08 Sep 2000, Fri 03.31To: Unicode ListSubject: Win32: Commandline/batch ANSI-UTF8-UTF16-UTF8-ANSI conversion too Are there any Win32 command line or batch ANSI to Unicode conversions tools out there? Desired conversions are: - Windows-1252 to UTF-8 - Windows-1252 to UTF-16 - UTF-8 to Windows-1252 - UTF-16 to Windows-1252 - UTF-8 to UTF-16 - UTF-16 to UTF-8 Later, Mikko Globalization Specialist Onyx Software - Bringing e-business and business together [EMAIL PROTECTED] www.onyx.com 425.519.4172
Re: Surrogate support in *ML?
Mark Davis wrote: My personal favorite is "astral characters" (don't remember who came up with that). I did. Or at least I came up with "Astral Planes" as opposed to the "Basic Multilingual Plane". Somebody got mighty offended, though ("Those planes are *real*!"), so I dropped it. -- There is / one art || John Cowan [EMAIL PROTECTED] no more / no less|| http://www.reutershealth.com to do / all things || http://www.ccil.org/~cowan with art- / lessness \\ -- Piet Hein
Re: Plane 14 redux
On Wed, 6 Sep 2000, Doug Ewell wrote: I have suggested on this list using Plane 14 tags to assist in glyph selection between C, J, and K or between Russian italics and Serbian italics because I thought they would provide a nice, all-Unicode solution *without* resorting to higher protocols. Other Unicode mechanisms, like LTR and RTL directional overrides and ligation control via ZWJ and ZWNJ (to name only two), seem to have been invented for exactly that purpose. I don't agree with the last comment. ZWJ and ZWNJ are not only for visual appearance. While the difference between Chinese and Japanese glyphs makes no difference in meaning, leaving or using a ZWJ or ZWNJ sometimes changes the word meaning. That's at least true for Persian, I don't know about Indic languages. --roozbeh
Reply-To mess opinion [was Re: Unicode on a non-Unicode web page]
Look out! Hot button political issue! Delete if uninterested in opinion. John And I even more humbly request that it *not* be reinstated. Rather John than reiterating the arguments, I will point to Chip Rosenthal's John "Reply-To Munging Considered Harmful" at John http://www.unicom.com/pw/reply-to-harmful.html , which is hereby John incorporated by reference, as the lawyers say. Totally unconvincing aside from possibility of problems introduced by munging (some might argue this is another sign email has become over-complicated). John In the interests of fair play, I will also point to Simon Hill's John "Reply-To Munging Considered Useful" at John http://www.metasystema.org/essays/reply-to-useful.mhtml . Not only simpler and logical, it just feels right. - Mark Leisher Computing Research LabCinema, radio, television, magazines are a New Mexico State University school of inattention: people look without Box 30001, Dept. 3CRL seeing, listen without hearing. Las Cruces, NM 88003-- Robert Bresson