Re: Unicode on a non-Unicode web page

2000-09-08 Thread Otto Stolz

Perhaps a typo only, but possibly with dire consequences; so I'd
rather set it right.

On Thu, 7 Sep 2000 07:16 (GMT-0800), Herman Ranes wrote:
 to make the HTML code 'understandable' to Netscape Navigator 4, without
 actually encoding in UTF-8:
 -Meta tag the document as UTF-8
 -Encode characters beyond U+00FF as decimal NCRs (#232;).

This must be:
  Encode characters beyond U+007F as decimal NCRs

UTF-8 uses the byte values beyond 7F for its multi-byte sequences;
8-bit, single-byte coded characters beyond 7F, interspersed in a UTF-8
datastream, would be misunderstood by the receiver.

Best wishes,
   Otto Stolz



Re: Unicode on a non-Unicode web page

2000-09-08 Thread John Wilcock

[Sorry Paul, I didn't particularly intend to send this privately. 
I notice that the Unicode list no longer sets a Reply-To: header. 
Ô Sarasvati, might I humbly request that this behaviour be reinstated
(though of course not overriding any Reply-To that individual
subscribers may wish to set).]

On Thu, 7 Sep 2000 12:46:56 -0800 (GMT-0800), Paul Deuter wrote:
 Finally you also have the solution already suggested of encoding everything
 as UTF-8 and using that as your main character set.  I don't know of an easy
 way of transliterating 8859-2 to UTF-8.  The hard ways are using Notepad on
 Windows 2000 on a machine that has 8859-2 as the ANSI character set and
 saving to UTF-8.

One 'easy' way is to open the file as coded text using Word 2000,
selecting Central Europe (ISO) when opening and UTF-8 when saving.

John.

-- 
-- Over 1200 webcams from ski resorts around the world - 
http://www.tradoc.fr/john/webcams/
-- Translate your technical documents and web pages- http://www.tradoc.fr/en/



RE: Unicode on a non-Unicode web page

2000-09-08 Thread Alan Wood

 John Cowan  wrote: 
 
 Versions of Netscape before 4.7 had this bug: character references
 greater than #255; only worked if the transmission character set
 was UTF-8.

This bug is still present in the Windows version of Netscape 4.75.

Use Edit, Preferences, Fonts to make both Western and Unicode encoding use
Times New Roman and then look at:
http://www.hclrss.demon.co.uk/demos/wgl4.html

Now use View, Character Set to switch between Western (ISO-8859-1) and
Unicode (UTF-8).  With Western, most characters above 255 display as
question marks, but with Unicode they all appear correctly.

Alan Wood
Documentation Writer / Web Master
Context Limited
mailto:[EMAIL PROTECTED]
http://www.context.co.uk/
http://www.alanwood.net/ (Unicode, special characters, pesticide names)




Re: Unicode on a non-Unicode web page

2000-09-08 Thread Mark Davis

Take a look at the Unicode FAQ on the web, at www.unicode.org

"Gary P. Grosso" wrote:

 Hi Unicoders,

 I am working on software to emit HTML in the encoding
 and character set of the user's choice, from SGML/XML
 documents which can contain any Plane 1 Unicode character.
 The question is what to do with characters outside the
 selected encoding.  I thought I would use the "numeric"
 character entity reference and IE5 at least seems to
 render that well, but Netscape Communicator 4.6 doesn't.

 One way to look at this is: how do I use unicode as an
 "escape" to include some isolated content on a web page
 of arbitrary encoding?

 For example, I have something such as:

 !DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN"
 htmlheadtitleUnicode in a Latin 2 page/title
 meta http-equiv="Content-Type" content="text/html; charset=ISO-8859-2"
 /head
 body style="line-height: 16pt"div class="pgbrk" style="padding-top: 48pt"
 pÈlánek Úvod ®ádný èest èin èinìn èinù èinùm èinnost èinnosti
 jakmile jako jako¾ jako¾to jazyka je¾ jediné jednat jednotkou jednotlivec/p
 pCYRILLIC CAPITAL LETTER DJE: #1026;/p
 pCAPITAL LETTER GAMMA: #x0393;/p
 pHIRAGANA LETTER KA: #12363;/p
 pjeho jejich jemu jimi jiného jinému jiných jiným jinými jsou ka¾dému ka¾dý
 /p
 /body
 /html

 which probably looks awful since your email client is not likely
 set to display Latin 2, but which can also be seen at:

 http://www.angelfire.com/mi/virtualattic/latin2_test.html

 If I change the meta tag to:
 meta http-equiv="Content-Type" content="text/html; charset=UTF-8"
 then Netscape does slightly better (still stumbles over #x-anything
 and doesn't display the hiragana, but does display the DJE and GAMMA
 if I use decimal values) but of course now the Czech words are not
 displayed properly.

 My question(s):

 Is there some way I can nudge Netscape's browser to display these?

 Is there a better way to write this admittedly mongrel HTML content?
 I have heard somewhere that it is possible to change charset choice
 "on the fly" and if would work, I would appreciate a pointer to
 somewhere that says how best to do this.

 Thanks in advance for any insights.

 ---
 Gary Grosso
 [EMAIL PROTECTED]
 Arbortext, Inc.
 Ann Arbor, MI, USA




Re: Surrogate support in *ML?

2000-09-08 Thread Mark Davis

Good point. In the past, I have used "surrogate characters" to refer to the
characters encoded above , and surrogate code units to refer to the UTF-16
units D800-DFFF. However, I think that leads to confusion. Nobody has come up
with a good term for all characters above . "Plane 1-16 characters" is
clunky and requires explanation, as does "non-BMP characters". Another
possibility is "surrogate-pair characters". My personal favorite is "astral
characters" (don't remember who came up with that).

Mark

Karlsson Kent - keka wrote:

  From: Brendan Murray/DUB/Lotus [mailto:[EMAIL PROTECTED]]
 ...
  Karlsson Kent - keka [EMAIL PROTECTED] wrote:
   At the level of XML the number of bits is irrelevant.
   The "high and low surrogate" code points are excluded
   from being used as NCRs.  A character (not UTF-16 code
   units) can be referenced by NCRs. See (XML) procuction 66
   (CharRef) and its well-formedness constraint (and
   production 2 (Char), though they missed to exclude a number
   of other non-character code points in that production).
 
  I know that XML explicitly excludes surrogates. My question really refers
  to what one can do to encode the non-BMP data in the new Han unification
  data that will become part of 10646 and Unicode in the not too distant
  future: is this huge block of characters regarded as irrelevant, or has
  anyone proposed an encoding that can be used?

 As was apparently not clear enough from my answer is that
 you refer to the code point for the character.  Thus,
 assuming the following example characters pass and stay at
 the  currently suggested code points, #x10330; will refer
 to GOTHIC LETTER AHSA in plane 1, #x2A718; will refer to
 CJK UNIFIED IDEOGRAPH-2A718 (which is in extension B on
 plane 2), and so on.

 This should be clear from (XML) production 66 (CharRef)
 and its well-formedness constraint, that refers to
 (XML) production 2 (Char), that in turn does include planes
 01-10 (hex) (even though that production mistakenly includes
 32 not-a-character code points on the supplementary planes).

 In addition, XML processors must 'support' both UTF-8 and
 UTF-16 (not just UCS-2).  However, independently of document
 encoding, character references (CharRef) always refer to UCS
 code points (a.k.a. scalar values), not (UTF-16, UTF-8, or other)
 code units.

 What is confusing is that sometimes "surrogates" refer to
 certain code units (for UTF-16) that are reserved as code points,
 and sometimes "surrogates" is used to refer to 'characters
 on planes 01-10'.  I think the latter is a misuse.

 /kent k




RE: Win32: Commandline/batch ANSI-UTF8-UTF16-UTF8-ANSI conversion

2000-09-08 Thread Marco . Cimarosti
Title: Win32: Commandline/batch ANSI-UTF8-UTF16-UTF8-ANSI conversion tools



Sure: 
uniconv.exe by Basis Technology.

It is distributed for free 
as a demo of the Rosette library; download from http://rosette.basistech.com/demo.html.

The version I 
have(quite old) does not support UTF-16, but it has UCS-2, that 
shouldundistinguishable if you just need cp 1252.

Call it without command 
line arguments, and it will output a long usage help that starts like 
this:

  usage: 
  uniconv [-debug] input-encoding input-file 
  output-encoding 
  output-file 
  property | transform*
  
  Version 
  1.1RC2, 
  4/13/98 
  Copyright (c) Basis Technology Corp. 1995-1998. All rights 
  reserved.Type "uniconv -help" for more 
  information.
  
  Encodings: Arabic, ASCII, Big5, BMP, 
  cp1251, cp1252, cp437, cp850, 
   EUC-J, 
  EUC-KR, GB2312, Greek, Hebrew, ISO-2022-JP, 
   
  ISO-2022-KR, ISOLatinCyrillic, JapaneseAutoDetect, 
   
  JIS_X0201, JIS_X_0208, KoreanAutoDetect, Latin1, Latin2, 
   Latin3, 
  Latin4, Latin5, Latin6, Shift-JIS, Thai, UCS2, 
   
  Unicode11UCS2, Unicode11UTF7, Unicode11UTF8, UTF7, 
  UTF8
  Properties: 
  [...snip...]
  
_ Marco


  -Original Message-From: Mikko Lahti 
  [mailto:[EMAIL PROTECTED]]Sent: 08 Sep 2000, Fri 03.31To: 
  Unicode ListSubject: Win32: Commandline/batch 
  ANSI-UTF8-UTF16-UTF8-ANSI conversion too
  Are there any Win32 command line or 
  batch ANSI to Unicode conversions tools out there?
  Desired conversions are:
  - Windows-1252 to UTF-8
  - Windows-1252 to UTF-16
  - UTF-8 to Windows-1252
  - UTF-16 to Windows-1252
  - UTF-8 
  to UTF-16
  - UTF-16 
  to UTF-8
  Later,
  Mikko
  Globalization 
  Specialist
  Onyx Software - Bringing e-business and business together
  [EMAIL PROTECTED]
  www.onyx.com 
  425.519.4172


Re: Surrogate support in *ML?

2000-09-08 Thread John Cowan

Mark Davis wrote:

 My personal favorite is "astral
 characters" (don't remember who came up with that).

I did. Or at least I came up with "Astral Planes" as opposed to the
"Basic Multilingual Plane".

Somebody got mighty offended, though ("Those planes are *real*!"),
so I dropped it.

-- 
There is / one art   || John Cowan [EMAIL PROTECTED]
no more / no less|| http://www.reutershealth.com
to do / all things   || http://www.ccil.org/~cowan
with art- / lessness \\ -- Piet Hein



Re: Plane 14 redux

2000-09-08 Thread Roozbeh Pournader



On Wed, 6 Sep 2000, Doug Ewell wrote:

 I have suggested on this list using Plane 14 tags to assist in glyph
 selection between C, J, and K or between Russian italics and Serbian
 italics because I thought they would provide a nice, all-Unicode
 solution *without* resorting to higher protocols.  Other Unicode
 mechanisms, like LTR and RTL directional overrides and ligation control
 via ZWJ and ZWNJ (to name only two), seem to have been invented for
 exactly that purpose.

I don't agree with the last comment. ZWJ and ZWNJ are not only for visual
appearance. While the difference between Chinese and Japanese glyphs makes
no difference in meaning, leaving or using a ZWJ or ZWNJ sometimes changes
the word meaning. That's at least true for Persian, I don't know about
Indic languages.

--roozbeh





Reply-To mess opinion [was Re: Unicode on a non-Unicode web page]

2000-09-08 Thread Mark Leisher

Look out!  Hot button political issue!  Delete if uninterested in opinion.

John And I even more humbly request that it *not* be reinstated.  Rather
John than reiterating the arguments, I will point to Chip Rosenthal's
John "Reply-To Munging Considered Harmful" at
John http://www.unicom.com/pw/reply-to-harmful.html , which is hereby
John incorporated by reference, as the lawyers say.

Totally unconvincing aside from possibility of problems introduced by munging
(some might argue this is another sign email has become over-complicated).

John In the interests of fair play, I will also point to Simon Hill's
John "Reply-To Munging Considered Useful" at
John http://www.metasystema.org/essays/reply-to-useful.mhtml .

Not only simpler and logical, it just feels right.
-
Mark Leisher
Computing Research LabCinema, radio, television, magazines are a
New Mexico State University   school of inattention: people look without
Box 30001, Dept. 3CRL seeing, listen without hearing.
Las Cruces, NM  88003-- Robert Bresson