RE: Byte Order Marks

2001-04-20 Thread Yves Arrouye

  Then why is ICU mapping UTF-16 to UTF16_PlatformEndian and not
  UTF16_BigEndian?
 
 ICU does not do Unicode-signature or other encoding detection 
 as part of a converter. When you get text from some protocol, 
 you need to instantiate a converter according to what you 
 know about the encoding.

So I can't pass it some text with a BOM and say "utf-16" and let it run
through that. I guess that explains why I also didn't find converters that
write a BOM at the start of the conversion. Is that something that would
added to ICU in the future? It would be very nice to have a converter that
would pick the BOM (and write it back).

And yes, most of the time, when you stay on a given platform, it is very
convenient to use the platform's endianness. My question was rather "why
isn't UTF-16 the one that detects the BOM and defaults to an externalized
form, BE, and then people on a given platform would just use UTF-16PE (of
which UTF-16 is an alias in ICU)?". That would facilitate interchange of
information.

YA




RE: Byte Order Marks

2001-04-20 Thread Yves Arrouye


 On Thu, Apr 19, 2001 at 06:24:47PM -0700, Markus Scherer wrote:
  On the other hand, if you get a file from your platform and 
 it is in 16-bit Unicode, then you would appreciate the 
 convenience of the auto-endian alias.
 
 But nothing should be spitting out platform-endian UTF-16! In the
 case that there's a lot of unmarked big-endian UTF-16 around (as I
 understand the ISO-10646 standard recommends), then that assumption
 that everything emits unmaked platform-dependent UTF-16 will be
 wrong.

And for reference, on Windows, Unicode files are recognized because they
have a BOM. Write plain UTF-16LE w/o a BOM, and your file won't be
recognized properly. Manipulation of these files w/ ICU today is a bit
painful, since one needs to strip the BOM on input (if I understand Markus
correctly) and write a BOM at output. So these cannot be manipulated using
applications like uconv which blindly uses the raw converters.

YA




Re: Byte Order Marks

2001-04-20 Thread Markus Scherer

Yves, we are thinking about a general API for encoding detection that could initially 
just check for BOM/Unicode signatures. I believe we have a feature request for this 
already. Mark and I just brainstormed about what we may want an API look like.

The reason for doing what ICU is doing currently is simple pragmatism. None of our 
converters auto-detects anything, and they write only what you tell them to write.
When you deal with serialized data structures and fields in files or databases, that 
is exactly what you want.
With signature-carrying files and transmission protocols, there is more work necessary.

It seems to me that a converter API with its ability to take one byte at a time, and 
no other way to pass additional information ("I know the language of the text..."), is 
not the best way to implement this.

On output, writing a BOM/signature is easy: if you know you need one, then just call 
the converter once with U+feff.
Although, with this one feature, I could imagine having an API "emit a Unicode 
signature if you are a converter for a Unicode encoding".

markus




Byte Order Marks

2001-04-19 Thread Tomas McGuinness

Hi,

A quick question relating to the Byte Order Mark of UCS-2. If its absent is
it safe to assume any particular order (i.e. Big or Little Endian?).

I am writing a function to rearrange from Big to little endian but without a
byte order mark I'm not sure what the order is. Is there any
specification I could refer to?

Thanks.

Tom

Tomas McGuinness   Consultant
 --
 
 University Technology Park*   +353 21 4933 277 
  Curraheen Rd, Cork  *+353 21 4933 201
 * [EMAIL PROTECTED]
 --
 
 CMG   Telecom Products Division
   Product Development, Cork 
 --
 
 
 
 




Re: Byte Order Marks

2001-04-19 Thread Markus Scherer

There is an RFC about UTF-16 that explains this:

If the text is labeled by the protocol as
charset=UTF-16 then the first two bytes are the byte order mark
charset=UTF-16BE then it is big-endian and the first two bytes are just text
charset=UTF-16LE then it is little-endian and the first two bytes are just text

If you don't have any clue about the byte order, but you know it is UTF-16, then 
assume BE.

Similar for UTF-32[BE/LE].

If you don't know anything about your text, then you may start some heuristics or 
reject the text...

markus

Tomas McGuinness wrote:
 A quick question relating to the Byte Order Mark of UCS-2. If its absent is
 it safe to assume any particular order (i.e. Big or Little Endian?).




RE: Byte Order Marks

2001-04-19 Thread Yves Arrouye

 If you don't have any clue about the byte order, but you know it is
UTF-16, then assume BE.

Then why is ICU mapping UTF-16 to UTF16_PlatformEndian and not
UTF16_BigEndian? I know that was a difference between ICU and my library,
and when I asked this question a while ago I was told that despite what some
litterature suggests, w/o any clue, platform endianness should be used.
That's contradictory.

YA




Fwd: Re: Byte Order Marks

2001-04-19 Thread Asmus Freytag


Date: Thu, 19 Apr 2001 12:59:43 -0700
To: Tomas McGuinness [EMAIL PROTECTED]
From: Asmus Freytag [EMAIL PROTECTED]
Subject: Re: Byte Order Marks

At 02:58 PM 4/19/01 +0200, you wrote:
If its absent is it safe to assume any particular order (i.e. Big or 
Little Endian?)


The default order is Big endian, but I wouldn't call that a 'safe' 
assumption. In the most general case I would attempt an autorecognition in 
the unlabelled case. Where a particular protocol's specification reinforces 
that the default order SHALL apply for the unlabelled case, the assumption 
becomes that much stronger, of course.

A./

PS: as an aside: the SCSU encoder can be used to do this form of 
autorecognition. If text shows much better compression in one byte order 
than the other, that byte order is overwhelmingly likely to be the true 
one. The exception would be strings of pure Han ideographs. For these it's 
necessary to





Re: Byte Order Marks

2001-04-19 Thread Markus Scherer

Yves Arrouye wrote:
  If you don't have any clue about the byte order, but you know it is
 UTF-16, then assume BE.

 Then why is ICU mapping UTF-16 to UTF16_PlatformEndian and not
 UTF16_BigEndian?

ICU does not do Unicode-signature or other encoding detection as part of a converter. 
When you get text from some protocol, you need to instantiate a converter according to 
what you know about the encoding.

Note that guessing big-endian is only the last, desperate part of detecting the 
encoding. It is not the first choice. If the text is properly tagged (including maybe 
a signature), then you will never have to open a "UTF-16" converter.

On the other hand, if you get a file from your platform and it is in 16-bit Unicode, 
then you would appreciate the convenience of the auto-endian alias.

markus




Re: Byte Order Marks

2001-04-19 Thread David Starner

On Thu, Apr 19, 2001 at 06:24:47PM -0700, Markus Scherer wrote:
 On the other hand, if you get a file from your platform and it is in 16-bit Unicode, 
then you would appreciate the convenience of the auto-endian alias.

But nothing should be spitting out platform-endian UTF-16! In the
case that there's a lot of unmarked big-endian UTF-16 around (as I
understand the ISO-10646 standard recommends), then that assumption
that everything emits unmaked platform-dependent UTF-16 will be
wrong. (It's never right to have a program emit
platform-dependent-endian UTF-16 except in the case of system-local
cache files. That breaks interoperating between your program on
different systems.)

-- 
David Starner - [EMAIL PROTECTED]
Pointless website: http://dvdeug.dhis.org
"I don't care if Bill personally has my name and reads my email and 
laughs at me. In fact, I'd be rather honored." - Joseph_Greg




Byte Order Marks

2001-04-10 Thread Tomas McGuinness

Hi,

When looking at a document would it be safe to assume that if you found any
of the following Byte Order Marks 
*   0xFFFE (UCS-2 Little Endian)
*   0xFEFE (UCS-2 Big Endian)
*   0xEFBBBF (UTF-8)
That the document is encoded with that encoding format. That means that if I
found the first 3 octets to be EF BB EF could I assume I am dealing with a
UTF-8 Document.

Apart from UTF and Unicode/UCS encoding formats do any other "legacy"
character sets use Byte Order Marks?

Regrads,

Tom.

Tomas McGuinness   Consultant
 --
 
 University Technology Park*   +353 21 4933 277 
  Curraheen Rd, Cork  *+353 21 4933 201
 * [EMAIL PROTECTED]
 --
 
 CMG   Telecom Products Division
   Product Development, Cork 
 --
 
 
 
 




Re: Byte Order Marks

2001-04-10 Thread DougEwell2

In a message dated 2001-04-10 3:04:09 Pacific Daylight Time, 
[EMAIL PROTECTED] writes:

  When looking at a document would it be safe to assume that if you found any
  of the following Byte Order Marks 
  *0xFFFE (UCS-2 Little Endian)
  *0xFEFE (UCS-2 Big Endian)

should be 0xFEFF

  *0xEFBBBF (UTF-8)
  That the document is encoded with that encoding format. That means that if 
I
  found the first 3 octets to be EF BB EF could I assume I am dealing with a
  UTF-8 Document.

That is usually a safe assumption and a good practice, except that if the 
first two bytes are 0xFF 0xFE, you should check the next two to see if they 
are 0x00 0x00 (which would signify little-endian UCS-4).

Also, think in terms of UTF-16, not UCS-2.

  Apart from UTF and Unicode/UCS encoding formats do any other "legacy"
  character sets use Byte Order Marks?

Good question.  I have not heard of any.

To follow up, what about signatures that are not necessarily byte order 
marks?  UTF-8 does not need a BOM, so the signature 0xEF 0xBB 0xBF is useful 
for the purpose Toms mentioned, to indicate the encoding.  Do any other 
character sets have such signatures?

-Doug Ewell
 Fullerton, California