Variations of UTF-16 (was: Re: UNICODE BOMBER STRIKES AGAIN)

2002-04-24 Thread Doug Ewell

Mark Davis [EMAIL PROTECTED] wrote:

 You can determine that that particular text is not legal UTF-32*,
 since there be illegal code points in any of the three forms. IF you
 exclude null code points, again heuristically, that also excludes
 UTF-8, and almost all non-Unicode encodings. That leaves UTF-16,
 16BE, 16LE as the only remaining possibilities. So look at those:

 1. In UTF-16LE, the text is perfectly legal Ken.
 2. In UTF-16BE or UTF-16, the text is the perfectly legal 䬀攀渀.

 Thus there are two legal interpretations of the text, if the only
 thing you know is that it is untagged. IF you have some additional
 information, such as that it could not be UTF-16LE, then you can
 limit it further.

OK, let me try to understand this again.  I'm sorry, you guys should
know that I'm not just trying to be a gadfly, but despite my efforts I
am still confused over whether an unlabeled, BOM-free sequence may or
may not be treated as little-endian UTF-16.

I think what Mark is saying is that, given Ken's byte sequence:

0x4B 0x00 0x65 0x00 0x6E 0x00

and some reason (heuristics, knowledge of platform, divine guidance,
etc.) to believe that this is Unicode text represented in some flavor of
UTF-16, I have my choice of:

(a) treating it as either UTF-16BE or UTF-16 and decoding it as
U+4B00 U+6500 U+6E00 (䬀攀渀), or

(b) treating it as UTF-16LE and decoding it as U+004B U+0065 U+006E
(Ken),

*BUT*

I must not *call* the sequence UTF-16, since that term is officially
reserved for BOM-marked text which can be either little- or big-endian,
or BOMless text which must be big-endian.

Is that what I have been missing all along?  It's perfectly OK for the
text to be encoded and decoded this way, so long as nobody actually
calls it UTF-16?  If so, then I've probably been arguing over nothing.

-Doug Ewell
 Fullerton, California






Re: unidata is big

2002-04-24 Thread Theo Veenker

andreas palsson wrote:
 
 Hi.
 
 I would just like to know if someone could give me a tip on how to
 structure all the unicode-information in memory?
 
 All the UNIDATA does contain quite a bit of information and I can't see
 any obvious method of which is memory-efficient and gives fast access.

You might want to evaluate some of the open source libraries 
mentioned under Enabled Products on the unicode site. For my
own lib (http://www.let.uu.nl/~Theo.Veenker/personal/projects/ucp/)
I've created a seperate table builder tool for each property or 
mapping. The tools organize data in planes, and for each plane
all possible trie setups are determined (about 80 combinations
of one, two or three stage tables). Then the cheapest setup
is used. This still requires over 230kb to store all data 
(except character names and comments) from the following files:
UnicodeData.txt, EastAsianWidth.txt, LineBreak.txt, ArabicShaping.txt,
Scripts.txt, Blocks.txt, SpecialCasing.txt, CaseFolding.txt,
BidiMirroring.txt, PropList.txt, DerivedCoreProperties.txt,
DerivedNormalizationProperties.txt, and DerivedJoiningType.txt.
For some mappings I've stored 32 bit code points where 16 bit
would have been enough, but I decided API uniformness is more
important than memory efficiency. 

I wouldn't bother too much about memory efficiency; it's irrelevant
these days. Even your mobile phone has enough memory to store all 
unicode data 10..20 times. Same thing for lookup speed. All you have
to do to get it fast is to wait (a few seasons).

Theo




Whence UniData.txt? (was Re: unidata is big)

2002-04-24 Thread Bob_Hallissy


Theo's comment leads me to a question I've pondered recently:

Assumptions:

   Many apps, from independent sources, need to access the Unicode
   character data,

   A lot of these apps aren't overly concerned with the slight overhead of
   parsing the data as needed from Unicode-supplied data files directly.

   Similarly, such apps benefit from being able to easily upgrade to new
   Unicode releases by simply replacing the data files.

   It isn't very user-friendly to for every such app to store their own
   private copy of the character data files when a single shared copy would
   take up less space and be easier to maintain.

It would seem to me that there is some value in establishing either (1) a
standard location where programs can expect to find (or install) a local
copy of the Unicode data files, or (2) a standard way to discover where
such a local copy of these files exist. My preference would be (2), which
would make it easy to configure a network of machines to share a single
copy of the data files. Something as simple as an environment variable
could work if developers were to agree on its name and semantics.

(I understand there may be different mechanisms for different platforms,
but it would be even better if a standard mechanism were cross platform).

So, are there any conventions for this evolving?  Or would anyone like to
propose one?

Bob



On 24/04/2002 09:26:55 Theo Veenker wrote:

andreas palsson wrote:

I wouldn't bother too much about memory efficiency; it's irrelevant
these days. Even your mobile phone has enough memory to store all
unicode data 10..20 times. Same thing for lookup speed. All you have
to do to get it fast is to wait (a few seasons).

Theo






Re: Variations of UTF-16 (was: Re: UNICODE BOMBER STRIKES AGAIN)

2002-04-24 Thread Mark Davis

below
—

Γνῶθι σαυτόν — Θαλῆς
[For transliteration, see http://oss.software.ibm.com/cgi-bin/icu/tr]

http://www.macchiato.com

- Original Message -
From: Doug Ewell [EMAIL PROTECTED]
To: Mark Davis [EMAIL PROTECTED]; [EMAIL PROTECTED]
Cc: Kenneth Whistler [EMAIL PROTECTED]; [EMAIL PROTECTED]
Sent: Tuesday, April 23, 2002 23:02
Subject: Variations of UTF-16 (was: Re: UNICODE BOMBER STRIKES
AGAIN)


 Mark Davis [EMAIL PROTECTED] wrote:

  You can determine that that particular text is not legal UTF-32*,
  since there be illegal code points in any of the three forms. IF
you
  exclude null code points, again heuristically, that also excludes
  UTF-8, and almost all non-Unicode encodings. That leaves UTF-16,
  16BE, 16LE as the only remaining possibilities. So look at those:
 
  1. In UTF-16LE, the text is perfectly legal Ken.
  2. In UTF-16BE or UTF-16, the text is the perfectly legal 䬀攀渀.
 
  Thus there are two legal interpretations of the text, if the only
  thing you know is that it is untagged. IF you have some additional
  information, such as that it could not be UTF-16LE, then you can
  limit it further.

 OK, let me try to understand this again.  I'm sorry, you guys should
 know that I'm not just trying to be a gadfly, but despite my efforts
I
 am still confused over whether an unlabeled, BOM-free sequence may
or
 may not be treated as little-endian UTF-16.

 I think what Mark is saying is that, given Ken's byte sequence:

 0x4B 0x00 0x65 0x00 0x6E 0x00

 and some reason (heuristics, knowledge of platform, divine guidance,
 etc.) to believe that this is Unicode text represented in some
flavor of
 UTF-16, I have my choice of:

 (a) treating it as either UTF-16BE or UTF-16 and decoding it as
 U+4B00 U+6500 U+6E00 (䬀攀渀), or

 (b) treating it as UTF-16LE and decoding it as U+004B U+0065
U+006E
 (Ken),

 *BUT*

 I must not *call* the sequence UTF-16, since that term is
officially
 reserved for BOM-marked text which can be either little- or
big-endian,
 or BOMless text which must be big-endian.

Yes, assuming the BUT clause applies to (b). That is, the untagged
byte sequence

0x4B 0x00 0x65 0x00 0x6E 0x00

could be
(a) U+4B00 U+6500 U+6E00 (䬀攀渀): UTF-16BE or UTF-16
(b) U+004B U+0065 U+006E (Ken): UTF-16LE
(c) U+004B U+ U+0065 U+ U+006E U+
(Knullenullnnull): ASCII, UTF-8, CP-1252, etc.
(d) ...: EBCDEC

If I really wanted to find out all the things it could be, I could run
it through the 700+ converters in ICU and capture all the cases that
don't detect illegal byte sequences. Except that the vast majority of
these are very unlikely because they would produce nulls in the code
point sequence.


 Is that what I have been missing all along?  It's perfectly OK for
the
 text to be encoded and decoded this way, so long as nobody actually
 calls it UTF-16?  If so, then I've probably been arguing over
nothing.

Not really arguing, just exploring the issues. But one key is that if
you are in an environment where untagged data is being exchanged (a
bad idea, anyway), *and* the convention for that environment is to use
the BOM (in either UTF-8, UTF-16, or UTF-32) thus excluding the
possibility of the explicit LE or BE forms, then that would further
winnow down the number of possible interpretations of untagged text.
In this case, that would select the (a) interpretation.

One real problem we have is that the *encoding form* UTF-16 and the
*encoding scheme* UTF-16 are very different, but have the same name.
If we had an explicit name for one or the other that would help to
reduce the confusion. (We also don't have a name to distinguish the
BOMed UTF-8 from the unBOMed, but that seems to cause less confusion.)


 -Doug Ewell
  Fullerton, California









Re: Variations of UTF-16 (was: Re: UNICODE BOMBER STRIKES AGAIN)

2002-04-24 Thread Doug Ewell

Mark Davis [EMAIL PROTECTED] wrote:

 I must not *call* the sequence UTF-16, since that term is
officially
 reserved for BOM-marked text which can be either little- or
big-endian,
 or BOMless text which must be big-endian.

 Yes, assuming the BUT clause applies to (b). That is, the untagged
 byte sequence

 0x4B 0x00 0x65 0x00 0x6E 0x00

 could be
 (a) U+4B00 U+6500 U+6E00 (䬀攀渀): UTF-16BE or UTF-16
 (b) U+004B U+0065 U+006E (Ken): UTF-16LE
 (c) U+004B U+ U+0065 U+ U+006E U+
 (Knullenullnnull): ASCII, UTF-8, CP-1252, etc.
 (d) ...: EBCDEC

Yes, that's what I meant to say.

 Not really arguing, just exploring the issues. But one key is that if
 you are in an environment where untagged data is being exchanged (a
 bad idea, anyway),

But not all mechanisms for exchanging data allow tagging.  (Bumper
sticker: UNTAGGED TEXT HAPPENS)

Here's what caused me to exhume this discussion.  Ken made a joke:

 -- K '\0' e '\0' n '\0'

(which I enjoyed) in response to the UNICODE BOMBER STRIKES AGAIN
satire about blank squares infiltrating otherwise good text.  This
representation of Ken in untagged, little-endian UTF-16,
misinterpreted as a sequence of 8-bit characters, corresponds to Mark's
example (c) above.  It *is* a misinterpretation, right?  You're not
really supposed to read this sequence of six bytes as K '\0' e '\0' n
'\0'.  That was the whole joke.

And in fact, there is only one correct interpretation in this example
(that is, only one interpretation that matches the sender's intent), and
that is U+004B U+0065 U+006E.  I contend that U+4B00 U+6500 U+6E00,
whether it makes sense semantically in Chinese or not, is just as
incorrect in this context as an ASCII, EBCDIC, FIELDATA, or BOCU-1
reading.

Note that everything I said before about this example is true:

- there is no BOM
- there is no external tagging as UTF-16LE (or anything else)
- we don't know the native byte orientation of the sender's machine

There's a lot of text like this out there, not all of which is intended
as jokes or even illustrations.  The Unix and Linux world is very
opposed to the use of BOM in plain-text files, and if they feel that way
about UTF-8 they probably feel the same about UTF-16.

Note also that heuristics in an example like this can be deceiving.  A
famous heuristic that applies to this example is to notice that every
other byte is 0, and therefore treat the text as UTF-16LE.  For example,
one could take the big-endian interpretation (U+4B00 U+6500 U+6E00),
notice that all of these characters are CJK ideographs, and use that to
deduce (incorrectly) that the text should be UTF-16BE.  What if the text
were reversed?  ('\0' K '\0' e '\0' n)  The latter heuristic would
suggest that the text should be UTF-16LE.  Heuristics are not perfect,
but sometimes they're all we've got.

So Ken's joke is encoded in BOMless, little-endian,
non-externally-tagged UTF-16.  It's a perfectly legal Unicode
representation, but we can't call it UTF-16 because that term implies
big-endian.  This sounds legalistic, sort of like the warnings on the
Unicode Web site about the correct use of the word Unicode.  But at
least I think I understand the issues a little better, and so the
exploration effort paid off.

-Doug Ewell
 Fullerton, California






variations of UTF-16/UTF-32 and browsers' interpretation (was Re:browsers and unicode surrogates)

2002-04-24 Thread Jungshik Shin

On Mon, 22 Apr 2002, Stefan Persson wrote:

 I haven't added plane 1 characters, yet (Tex let me do that, thanks !).
However, my test pages can be used to test how various web browsers
interpret various forms of UTF-16 and UTF-32 with or without BOM and
with or without external info. (such as MIME charset in http C-T header).
This is not  of practical importance/interest(UTF-8 is much less ambigous
and better supported than UTF-16/32 by various web browsers), but it's
interesting nonetheless because the way various forms of UTF-16/32 have
to be interpreted has been discussed recently.

 - Original Message -
 From: [EMAIL PROTECTED]
 Sent: den 22 april 2002 20:24

Thank you for this tip. I didn't know this and ended up
  'cluttering' my filenames with charset suffices at
  http://jshin.net/i18n/utftest.

 The following pages display Korean text:

 * All UTF-16 with BOM
 * All UTF-32LE with BOM
 * UTF-16LE without BOM, encoding specified as UTF-16

 The following pages are displayed as Latin-1 jibberish, ASCII displayed
 properly:
 * UTF-16 without BOM, encoding specified as UTF-16LE, UTF-16BE, or not
 specified at all
 * All UTF-32BE
 * All UTF-32LE without BOM

 This page is misinterpreted as UTF-16LE without line breaking:
 * UTF-16BE without BOM, encoding specified as UTF-16

 I'm using IE 5.5 under Windows 98.

  Thank you for your test result. MS IE 5.5. seems to *ignore*
MIME charset specified in http header. It appears to *solely* rely on
the presence of BOM. If it's not specified, it assumes the platform
byte order.  Is this behavior compatible with what Mark and
Ken described as to how to interpret various forms
of UTF-16 and UTF-32 last week and this week again?  It doesn't seem to be.
The way Mozilla interprets various forms of UTF-16|32 appears
to be more in line with what Mark and Ken have written although
there are some issues to be resolved as well. It'll be interesting
to see how Opera does.

  Here's the test result with Mozilla 0.9.9 on ix86 Linux (that is,
the platform byte order is the same as your case).

 * The following pages always get displayed  as intended

   - All UTF-16's and UTF-32's with MIME charset (*with* endian
 at the end. i.e. UTF-32(LE|BE), UTF-16(LE|BE) )
 specified in http header regardless of the endian and the presence
 of BOM
 (In UTF-32 pages, BOM is NOT ignored and rendered as
  'ZWNBS' enclosed by a dotted square)  : 8 cases

   - UTF-16BE  with BOM but without MIME charset specified
 : 1 cases

   - UTF-16BE and UTF-32BE without BOM but MIME charset specified
 as UTF-16 and UTF-32  : 2 cases

   - UTF-16BE and UTF-32BE with BOM but MIME charset specified
 as UTF-16 and UTF-32  : 2 cases

 * For the following pages, auto-detection sometimes works but not
   always.

   - UTF-16LE and UTF-32LE with BOM but without MIME charset specified
 : 2 cases

   - UTF-32BE  with BOM but without MIME charset specified
 : 1 cases


  * The following pages are recognized as Latin-1. US-ASCII
characters are rendered correctly with one or three hollow
boxes before or after each of them depending on the endian(BE/LE)
and the size (16/32)

- UTF-16LE and UTF-32LE without BOM and without MIME charset
  (2 cases)

- UTF-16BE and UTF-32BE without BOM and without MIME charset
  (2 cases)

  * The following pages are recognized as UTF-16BE and UTF-32BE.

- UTF-16LE and UTF-32LE without BOM but with MIME charset specified
  as UTF-16 and UTF-32  (2 cases)

- UTF-16LE and UTF-32LE with BOM but with MIME charset specified
  as UTF-16 and UTF-32  (2 cases)


  Jungshik Shin






Re: Variations of UTF-16 (was: Re: UNICODE BOMBER STRIKES AGAIN)

2002-04-24 Thread David Starner

On Wed, Apr 24, 2002 at 09:00:17AM -0700, Doug Ewell wrote:
 The Unix and Linux world is very
 opposed to the use of BOM in plain-text files, and if they feel that way
 about UTF-8 they probably feel the same about UTF-16.

Why? The problems with a BOM in UTF-8 have to do with it being an
ASCII-compatible encoding. (I'd guess that if there are any Unixes that
use EBCDIC, the same problems would apply to UTF-EBCDIC.) Pretty much
the only reason one would use UTF-16 is to be compatible with a foreign
system, and then you use the conventions of that system.

Also, look at the output of file:

n2404r.doc: Microsoft Office document data
file.utf8:  UTF-8 Unicode English text
file.utf16: Little-endian UTF-16 Unicode English character data
file.iso:   data
file_list:  ASCII text

There's basically two categories here; data or text. But UTF-16 is not
considered text; it's considered data, like a Word file. Most Unix users
would treat a UTF-16 encoded file the same way; as a format to be
converted from, or edited in a word processor only.

-- 
David Starner - [EMAIL PROTECTED]
It's not a habit; it's cool; I feel alive. 
If you don't have it you're on the other side. 
- K's Choice (probably referring to the Internet)




Re: Whence UniData.txt? (was Re: unidata is big)

2002-04-24 Thread Theo Veenker

[EMAIL PROTECTED] wrote:
 
 Theo's comment leads me to a question I've pondered recently:
 
 Assumptions:
 
Many apps, from independent sources, need to access the Unicode
character data,
 
A lot of these apps aren't overly concerned with the slight overhead of
parsing the data as needed from Unicode-supplied data files directly.
 
Similarly, such apps benefit from being able to easily upgrade to new
Unicode releases by simply replacing the data files.
 
It isn't very user-friendly to for every such app to store their own
private copy of the character data files when a single shared copy would
take up less space and be easier to maintain.
 
 It would seem to me that there is some value in establishing either (1) a
 standard location where programs can expect to find (or install) a local
 copy of the Unicode data files, or (2) a standard way to discover where
 such a local copy of these files exist. My preference would be (2), which
 would make it easy to configure a network of machines to share a single
 copy of the data files. Something as simple as an environment variable
 could work if developers were to agree on its name and semantics.

For applications that eat raw UCD files, this shouldn't be to difficult
to achieve. Any well designed app will/should have some parameter or env.
variable that you can set (no?). But for apps/libraries that like their UCD 
files cooked it is a different story because there is no recommended binary 
format for representing (compact) unicode character data. Personally I
would appreciate seeing such a recommendation including your point (2).
However apps/libs which enrich the character data with custom properties, 
would still need their own copy of the data.

The subject reminds me of the TZ database. Here you have a large text based 
database containing information on time zones and daylight saving times.
You can compile the data into a binary format by running a utility included
with the tz sources. Well, they don't give any recommendation on where to 
store the (text and/or binary) data, but at least there is a 'standard' 
format, which allows for sharing data. Would be nice to have something like
this for the UCD.

 (I understand there may be different mechanisms for different platforms,
 but it would be even better if a standard mechanism were cross platform).
 
 So, are there any conventions for this evolving?  Or would anyone like to
 propose one?

Please, go ahead :o)

Theo




RE: Variations of UTF-16 (was: Re: UNICODE BOMBER STRIKES AGAIN)

2002-04-24 Thread jarkko . hietaniemi

 Why? The problems with a BOM in UTF-8 have to do with it being an
 ASCII-compatible encoding.

Err, no.  That's not the point, AFAIK.  The point is that traditionally
in UNIX there hasn't been any sort of marker or tag in the beginning,
UNIX files being flat streams of bytes.  The UNIX toolset has been built
with this principle in mind.  No metadata in the files.  BOM breaks this.

  cat file1 file2 file3  file4

will have three BOMs, two of them in the middle of file4.

  wc -c file1

would have to skip the BOM not get the a wrong byte count.

  sort -o file5 file1

would have to strip the BOM from file1 (but put in pack into file5?)

And so forth.

If you have a multifork filesystem, you can do tagging like this easily
since the real payload doesn't get mixed with the metadata.  But traditional
UNIX filesystems do not have multifork filesystems.





RE: UNICODE BOMBER STRIKES AGAIN

2002-04-24 Thread Yves Arrouye


 You can determine that that particular text is not legal UTF-32*,
 since there be illegal code points in any of the three forms. IF you
 exclude null code points, again heuristically, that also excludes
 UTF-8, and almost all non-Unicode encodings. That leaves UTF-16, 16BE,
 16LE as the only remaining possibilities. So look at those:
 
 1. In UTF-16LE, the text is perfectly legal Ken.
 2. In UTF-16BE or UTF-16, the text is the perfectly legal 䬀攀渀.
 
 Thus there are two legal interpretations of the text, if the only
 thing you know is that it is untagged. IF you have some additional
 information, such as that it could not be UTF-16LE, then you can limit
 it further.

Actually, I also think that without any external information about the
encoding except that it is some UTF-16, it *has to* be interpreted as being
most significant byte first. I agree that it could be either UTF-16LE or
UTF-16BE/UTF-16, but in the absence of any other information, at this point
in time, it is ruled by the text of 3.1 C3 of TUS 3.0 and the reader has no
choice but to declare it UTF-16.

Now what about auto-detection in relation to this conformance clause?
Readers that first try to be smart by auto-detecting encodings could of
course pick any of these as the 'auto-detected' one. Does that violate 3.1
C3's interpretation of bytes? I would say that as long as the auto-detector
is seen as a separate process/step, one can get away with it, since by the
time you look at the bytes to process the data, their encoding has been set
by the auto-detector.

YA





L2 / UTC document register updated

2002-04-24 Thread Rick McGowan

The document register has been updated again...
http://www.unicode.org/L2/L-curdoc.htm

Several new documents:

L2/02-150 Status of Mapping between Characters of ISO 5426-2...
L2/02-151 Comparison of Characters of ISO 6861 and Those Proposed...
L2/02-152 Status of Mapping between Characters of ISO 8957 - Table 2...
L2/02-153 Status of Mapping between Characters of ISO 10574...
L2/02-154 Draft minutes of WG 2 meeting 41 (Singapore)
L2/02-155 Proposal to add 1 Hanja code of D P R of Korea... (1)
L2/02-156 Proposal to add 1 Hanja code of D P R of Korea... (2)
L2/02-157 Status in Myanmar on n2033
L2/02-158 WG2 - Pre-Meeting M42 Action Items List


Zipdocs for 121-140 are also available.

http://www.unicode.org/L2/Zipdocs/zipdocs.htm


Rick





Re: Variations of UTF-16 (was: Re: UNICODE BOMBER STRIKES AGAIN)

2002-04-24 Thread David Starner

On Wed, Apr 24, 2002 at 01:37:39PM -0400, [EMAIL PROTECTED] wrote:
 Err, no.  That's not the point, AFAIK.  The point is that traditionally
 in UNIX there hasn't been any sort of marker or tag in the beginning,
 UNIX files being flat streams of bytes.  The UNIX toolset has been built
 with this principle in mind.  No metadata in the files.  BOM breaks this.

Not at all true. Look at the head of a PNM file, a quintessentailly Unix
file format. PNM, MP3 or PNG files all have metadata identifying them,
and don't break under Unix systems.
 
   wc -c file1
 
 would have to skip the BOM not get the a wrong byte count.
 
   sort -o file5 file1
 
 would have to strip the BOM from file1 (but put in pack into file5?)

The wrong byte count? wc -c file1 is basically meaningless on a Unicode
file, but at least you can assume it gives the _byte count_ (including
extraneous things like BOMs). 

More importantly, how do these programs handle newlines? wc -l counts the
number of \x0A's in the file; sort splits the file based on \x0A. This
will produce nothing of value on a UTF-16 file. They could be changed to
work with UTF-16, but they won't be, as UTF-8 works just fine.

The point about file calling it data, not text, was just this; you can't
expect to throw UTF-16 through text tools and get a meaningful result.
That's why UTF-8 was created. The only sane thing to do with a UTF-16
file on Unix is treat as binary data, just like you would a
word-processor file. (Which are stunningly non-Unix, but coming
nonetheless. Probably for the best, though.)

-- 
David Starner - [EMAIL PROTECTED]
It's not a habit; it's cool; I feel alive. 
If you don't have it you're on the other side. 
- K's Choice (probably referring to the Internet)




Re: UNICODE BOMBER STRIKES AGAIN

2002-04-24 Thread Mark Davis

Unfortunately, the language in C3.1 is a bit archaic; it is referring
specifically to the UTF-16 encoding scheme. If you know you are
working with UTF-16, and you have no other information, then you do
have to use big-endian.

If, however, you only know that it is one of UTF-16BE, UTF-16LE, or
UTF-16 (plain)), then there are more choices.

Similarly, if you know that the text is limited to one of UTF-32LE or
UTF-16LE, then you actually know that the text must be little-endian.

Mark
—

Γνῶθι σαυτόν — Θαλῆς
[For transliteration, see http://oss.software.ibm.com/cgi-bin/icu/tr]

http://www.macchiato.com

- Original Message -
From: Yves Arrouye [EMAIL PROTECTED]
To: 'Mark Davis' [EMAIL PROTECTED]; Doug Ewell
[EMAIL PROTECTED]; [EMAIL PROTECTED]
Cc: Kenneth Whistler [EMAIL PROTECTED]; [EMAIL PROTECTED]
Sent: Wednesday, April 24, 2002 10:39
Subject: RE: UNICODE BOMBER STRIKES AGAIN



 You can determine that that particular text is not legal UTF-32*,
 since there be illegal code points in any of the three forms. IF you
 exclude null code points, again heuristically, that also excludes
 UTF-8, and almost all non-Unicode encodings. That leaves UTF-16,
16BE,
 16LE as the only remaining possibilities. So look at those:

 1. In UTF-16LE, the text is perfectly legal Ken.
 2. In UTF-16BE or UTF-16, the text is the perfectly legal 䬀攀渀.

 Thus there are two legal interpretations of the text, if the only
 thing you know is that it is untagged. IF you have some additional
 information, such as that it could not be UTF-16LE, then you can
limit
 it further.

Actually, I also think that without any external information about the
encoding except that it is some UTF-16, it *has to* be interpreted as
being
most significant byte first. I agree that it could be either UTF-16LE
or
UTF-16BE/UTF-16, but in the absence of any other information, at this
point
in time, it is ruled by the text of 3.1 C3 of TUS 3.0 and the reader
has no
choice but to declare it UTF-16.

Now what about auto-detection in relation to this conformance clause?
Readers that first try to be smart by auto-detecting encodings could
of
course pick any of these as the 'auto-detected' one. Does that violate
3.1
C3's interpretation of bytes? I would say that as long as the
auto-detector
is seen as a separate process/step, one can get away with it, since by
the
time you look at the bytes to process the data, their encoding has
been set
by the auto-detector.

YA







Re: Variations of UTF-16 (was: Re: UNICODE BOMBER STRIKES AGAIN)

2002-04-24 Thread Jungshik Shin


On Wed, 24 Apr 2002, David Starner wrote:

 On Wed, Apr 24, 2002 at 09:00:17AM -0700, Doug Ewell wrote:
  The Unix and Linux world is very
  opposed to the use of BOM in plain-text files, and if they feel that way
  about UTF-8 they probably feel the same about UTF-16.

 The reason we're not so fond of  UTF-8 with BOM is that it 'breaks' a
lot of time-honored Unix command line text-processing tools. The simplest
example is concatenating multiple files with 'cat'. With BOM at the
beginning, the following doesn't work as intended.

  $ cat f1 f2 f3 f4 | sort | uniq | sed ''  f5

For Sure, by typing a couple of more commands(enclosing 'cat'
with 'for loop',  for instance), we can work around that,
but 

 Why? The problems with a BOM in UTF-8 have to do with it being an
 ASCII-compatible encoding. (I'd guess that if there are any Unixes that
 use EBCDIC, the same problems would apply to UTF-EBCDIC.) Pretty much
 the only reason one would use UTF-16 is to be compatible with a foreign
 system, and then you use the conventions of that system.

 I  totally agree with you. We don't expect text tools
to work on files in UTF-16 the same way as we would expect them to work
on files in UTF-8 or other ASCII-compatible encodings.

  Jungshik Shin






Re: Variations of UTF-16 (was: Re: UNICODE BOMBER STRIKES AGAIN)

2002-04-24 Thread John Cowan

Doug Ewell scripsit:

 The Unix and Linux world is very
 opposed to the use of BOM in plain-text files, and if they feel that way
 about UTF-8 they probably feel the same about UTF-16.

I doubt it.  The trouble with BOMizing is that it makes ASCII not a
subset of UTF-8, but ASCII cannot be a subset of UTF-16 anyhow.
(I mean at the byte level, of course.)

-- 
John Cowan [EMAIL PROTECTED] http://www.reutershealth.com
I amar prestar aen, han mathon ne nen,http://www.ccil.org/~cowan
han mathon ne chae, a han noston ne 'wilith.  --Galadriel, _LOTR:FOTR_




Re: variations of UTF-16/UTF-32 and browsers' interpretation (wasRe: browsers and unicode surrogates)

2002-04-24 Thread Tom Gewecke

Following is result of Mac OS X and OmniWeb browser on
http://jshin.net/i18n/utftest

5 cases of proper display:

+Y, BE, 16, UTF-16 and UTF-16BE
+Y, LE, 16, UTF-16
+N, BE, 16, UTF-16 and UTF-16BE

All the rest showed only the ascii correctly.







Re: variations of UTF-16/UTF-32 and browsers' interpretation (was Re: browsers and unicode surrogates)

2002-04-24 Thread Michael Everson

At 01:42 +0100 2002-04-25, Michael Everson wrote:

On http://jshin.net/i18n/utftest/bom_utf16be.utf16.html under OS X 
you don't see just question marks, though -- you see the Last Resort 
font showing that Korean characters not present in the font are in 
the text. Awesome.

In OmniWeb at least. (Forgot to mention it.) OmniWeb does a very nice 
job on 
http://www.evertype.com/standards/iso15924/document/scriptbib.html by 
the way (I know I have to edit the Arabic still, and no, Omniweb 
doesn't order it properly in RTL though it does try to apply shaping 
behaviour.)
-- 
Michael Everson *** Everson Typography *** http://www.evertype.com




Re: Variations of UTF-16

2002-04-24 Thread David Starner

On Wed, Apr 24, 2002 at 05:12:43PM -0700, Jonathan Coxhead wrote:
But a BOM in every UTF-16 plain text file would make this completely 
 hopeless. If we ever think we might want to do UNIX-style text processing on 
 UTF-16, we have to resist that!

So the Unix people, because they might someday want to use UTF-16 plain
text (why? may as well go to UTF-32), should object to somebody else
using a BOM on a file format they actually use?

-- 
David Starner - [EMAIL PROTECTED]
It's not a habit; it's cool; I feel alive. 
If you don't have it you're on the other side. 
- K's Choice (probably referring to the Internet)




Re: Variations of UTF-16/UTF-32 and browsers interpretation

2002-04-24 Thread Tom Gewecke

At 01:42 +0100 2002-04-25, Michael Everson wrote:

 On http://jshin.net/i18n/utftest/bom_utf16be.utf16.html under OS X
 you don't see just question marks, though -- you see the Last Resort
 font showing that Korean characters not present in the font are in
 the text. Awesome.

Not sure which font is doing it (Code2000 perhaps), but I can see all of 
them.

 In OmniWeb at least. (Forgot to mention it.) OmniWeb does a very nice
 job on
 http://www.evertype.com/standards/iso15924/document/scriptbib.html

If you try it with Mozilla I think the Arabic will come out much better 
(but disable the font Code2000 first if you have it.)





Re: Variations of UTF-16

2002-04-24 Thread Shlomi Tal

{{ But a BOM in every UTF-16 plain text file would make this completely 
hopeless. If we ever think we might want to do UNIX-style text processing on 
UTF-16, we have to resist that! }}

If you're going to take the trouble of making text tools 16-bit aware, then 
you can afford to make them BOM-aware too.

type a.txt b.txt c.txt  d.txt

on Windows 2000, assuming that they are all UTF-16 (with an FFFE at the 
beginning of each, as is usual in MS-Windows Unicode files), strips every 
BOM except the last, so that d.txt has only the usual one initial FFFE. So 
it's not an immovable obstacle.

Concerning text files: nearly all of plain-text Unicode I've ever seen is in 
UTF-8. However, the ubiquitous MS-Office documents, from Office 2000 
onwards, are all in UTF-16 (little-endian, without BOM).

_
Join the world’s largest e-mail service with MSN Hotmail. 
http://www.hotmail.com