RE: Names for UTF-8 with and without BOM - pragmatic

2002-11-20 Thread Joseph Boyle
Working in a large organization whose product includes a large number of configuration 
and data files in text formats, I can say something about what we have found to work 
during development, localization, and release engineering, across multiple platforms .

We have eliminated UTF-16 text file formats in favor of UTF-8 because of Unix standard 
toolkit and other Unix-based tools' poor ability to deal with UTF-16. On the other 
hand, the BOM on UTF-8 has been useful and has not caused problems with Unix tools 
processing, including pipe sequences. Raw concatenation of files which would produce 
internal ZWNBSPs is not part of any of our processing as far as I know.

-Original Message-
From: David Starner [mailto:[EMAIL PROTECTED]] 
Sent: Thursday, November 07, 2002 12:14 PM
To: Markus Scherer
Cc: unicode
Subject: Re: Names for UTF-8 with and without BOM - pragmatic


On Wed, Nov 06, 2002 at 09:47:43AM -0800, Markus Scherer wrote:
> The fact is that Windows uses UTF-8 and UTF-16 plain text files with 
> signatures (BOMs) very simply, gracefully, and successfully. It has 
> applied what I called the "pragmatic" approach here for about 10 
> years. It just works.

It just works in an environment where relatively few documents are plain text, and 
that doesn’t use pipes of text as universal glue. C has been described as a 
(C)haracter processing language; whether or not that’s accurate, Awk and Perl 
certainly are; these are all Unix programming languages, and at the heart of what Unix 
is. The simple Unix program has a stream of text coming in and a stream of text going 
out, whereas the simple Windows program has a window. What works for Windows may very 
well not work for Unix. 

-- 
David Starner - [EMAIL PROTECTED]
Great is the battle-god, great, and his kingdom--
A field where a thousand corpses lie. 
  -- Stephen Crane, "War is Kind"





Re: Names for UTF-8 with and without BOM - pragmatic

2002-11-07 Thread David Starner
On Wed, Nov 06, 2002 at 09:47:43AM -0800, Markus Scherer wrote:
> The fact is that Windows uses UTF-8 and UTF-16 plain text files with 
> signatures (BOMs) very simply, gracefully, and successfully. It has applied 
> what I called the "pragmatic" approach here for about 10 years. It just 
> works.

It just works in an environment where relatively few documents are plain
text, and that doesn’t use pipes of text as universal glue. C has been
described as a (C)haracter processing language; whether or not that’s
accurate, Awk and Perl certainly are; these are all Unix programming
languages, and at the heart of what Unix is. The simple Unix program has
a stream of text coming in and a stream of text going out, whereas the
simple Windows program has a window. What works for Windows may very
well not work for Unix. 

-- 
David Starner - [EMAIL PROTECTED]
Great is the battle-god, great, and his kingdom--
A field where a thousand corpses lie. 
  -- Stephen Crane, "War is Kind"




RE: Names for UTF-8 with and without BOM - pragmatic

2002-11-07 Thread Kent Karlsson

> Initial for each piece, as each is assumed to be a complete 
> text file before concatenation. Nothing 
> prevents copy/cp/cat and other commands from recognizing 
> Unicode signatures, for as long as they 
> don't claim to preserve initial U+FEFF.

Yes there is, in a formal sense, for cat and cp.  See
http://www.opengroup.org/onlinepubs/007904975/utilities/cat.html
which states "The standard output shall contain the sequence of
*bytes* read from the input files. Nothing else shall be written
to the standard output." (my emphasis) and
http://www.opengroup.org/onlinepubs/007904975/utilities/cp.html
which is not so explicit, but silently assumes that copying
does not change the bytes of the file content in any way.

cat, and copy/cp are very agnostic programs. They just copy
(or concatenate) the byte strings, regardless of if the content
is pictures, sound, or text.  So 'cat' can "meaningfully"
concatenate text files of the *same* encoding serialisation
and *without* BOM/signature and where the text files are properly
terminated (in the case of stateful serialisations).  Trying
to get 'cat' to do more than that for text files would be just
as bad as trying to get 'cat' to join (in some "useful" way)
picture files (of possibly different formats) or sound or video
files. Don't expect cat to catenate those file types if they
are "complete" and to get a useful result. 'cat' is
*supposed* to be simple, and just string byte sequences
together.  If you want something more, use another program
that does that "more" you're looking for (or write one).
It's not the Unix/Linux utility program 'cat', nor cp.

/Kent K






Re: Names for UTF-8 with and without BOM - pragmatic

2002-11-06 Thread Markus Scherer
Lars Kristan wrote:

Markus Scherer wrote:


If software claims that it does not modify the contents of a 
document *except* for initial U+FEFF 
then it can do with initial U+FEFF what it wants. If the 
whole discussion hinges on what is allowed 
if software claims to not modify text then one need 
not claim that so absolutely.

That seems pretty straightforward, but only as long as your "software" is an
editor and your "document" is a single file. How about a case where
"software" is a copy or cat command, and instead of a document you have
several (plain?) text files that you concat? What does "initial" mean here?


Initial for each piece, as each is assumed to be a complete text file before concatenation. Nothing 
prevents copy/cp/cat and other commands from recognizing Unicode signatures, for as long as they 
don't claim to preserve initial U+FEFF.

What happens next is: some software lets an initial BOM get through and
appends such string to a file or a stream. If other software treats it as a
character, the data has been modified. On the other hand, if we want to
allow software to disregard BOMs in the middle of character streams then we
have another set of security issues. And not removing is equally bad because
of many consequences (in the end, we could end up with every character being
preceded by a BOM).


All true, and all well known, and the reason why the UTC and WG2 added U+2060 Word Joiner. This 
becomes less of an issue if and when they decide to remove/deprecate the ZWNBSP semantics from U+FEFF.

However, in a situation where you cannot be sure about the intended purpose of an initial U+FEFF I 
think that the "pragmatic" approach is any less safe than any other, while it increases usability.

.txt	UTF-8	require	We want plain text files to
			have BOM to distinguish
			from legacy codepage files

>
> H, what does "plain" mean?! ...

Your response to this takes it out of context. I am not trying to prescribe general semantics of 
.txt plain text files.

If you read the thread carefully, you will see that I am just taking the file checker configuration 
file from Joseph Boyle and suggesting a modification to its format that makes it not rely on having 
charset names that indicate any particular BOM handling. I am sorry to not have made this clearer.

True, UTF-16 files do need a signature. Well, we just need to abandon the
idea that UTF-16 can be used for plain text files. Let's have plain text
files in UTF-8. Look at it as the most universal code page. Plain text files
never contained information about the code page, why would there be such
information in UTF-8 plain text files?!


UTF-16 files do not *need* a signature per se. However, it is very useful to prepend Unicode plain 
text *files* with Unicode signatures so that tools have a chance to figure out if those files are in 
Unicode at all - and which Unicode charset - or in some legacy charset. With "plain text files" I 
mean plain text documents without any markup or other meta information.

The fact is that Windows uses UTF-8 and UTF-16 plain text files with signatures (BOMs) very simply, 
gracefully, and successfully. It has applied what I called the "pragmatic" approach here for about 
10 years. It just works.

markus

--
Opinions expressed here may not reflect my company's positions unless otherwise noted.




RE: Names for UTF-8 with and without BOM - pragmatic

2002-11-06 Thread Kent Karlsson

> True, UTF-16 files do need a signature. 

Eh, no!  "UTF-16BE" and "UTF-16LE" files (or whatever kind of text
data element) do not have any signature/BOM. Not even files (somehow)
labelled "UTF-16" need have a signature/BOM, without a BOM they are
then the same as if it was labelled "UTF-16BE".  (Formally, XML
"requires" BOM for UTF-16 XML documents, but then goes on
examplifying that it is not needed for XML documents...)

I do agree, however, that the idea of having a BOM/signature at
the beginning of a file (or other text data element) is a bad one.

/Kent K





RE: Names for UTF-8 with and without BOM - pragmatic

2002-11-06 Thread Marco Cimarosti
Lars Kristan wrote:
> > .txtUTF-8   require We want plain text files to
> > have BOM to distinguish
> > from legacy codepage files
> 
> H, what does "plain" mean?! Perhaps files with a BOM 
> should be called "text" files (or .txt files;) as
> opposed to "plain text" files, which in my opinion should
> be just that - _plain_ text. No ASCII plain text file had
> an ASCII signature. I believe "plain text" should be
> something that will be as easy to use (and handle) as
> ASCII plain text files were.

"Plain" per se means nothing, in this context. The term "plain text", in
Unicode jargon, means the opposite of "rich text".

"Rich text" (or "fancy text") is another Unicode jargon term, meaning text
containing *mark-up*, such as HTML, XML, RTF, troff, TeX, proprietary
word-processor formats, etc.

Unicode text not containing mark-up is called "plain text", regardless of
the fact that it might be quite "complicated" by the presence of BOM's, bidi
controls, etc.

_ Marco




RE: Names for UTF-8 with and without BOM - pragmatic

2002-11-06 Thread Lars Kristan
Markus Scherer wrote:
> If software claims that it does not modify the contents of a 
> document *except* for initial U+FEFF 
> then it can do with initial U+FEFF what it wants. If the 
> whole discussion hinges on what is allowed 
> if software claims to not modify text then one need 
> not claim that so absolutely.

That seems pretty straightforward, but only as long as your "software" is an
editor and your "document" is a single file. How about a case where
"software" is a copy or cat command, and instead of a document you have
several (plain?) text files that you concat? What does "initial" mean here?

What happens next is: some software lets an initial BOM get through and
appends such string to a file or a stream. If other software treats it as a
character, the data has been modified. On the other hand, if we want to
allow software to disregard BOMs in the middle of character streams then we
have another set of security issues. And not removing is equally bad because
of many consequences (in the end, we could end up with every character being
preceded by a BOM).

> .txt  UTF-8   require We want plain text files to
>   have BOM to distinguish
>   from legacy codepage files

H, what does "plain" mean?! Perhaps files with a BOM should be called
"text" files (or .txt files;) as opposed to "plain text" files, which in my
opinion should be just that - _plain_ text. No ASCII plain text file had an
ASCII signature. I believe "plain text" should be something that will be as
easy to use (and handle) as ASCII plain text files were.

True, UTF-16 files do need a signature. Well, we just need to abandon the
idea that UTF-16 can be used for plain text files. Let's have plain text
files in UTF-8. Look at it as the most universal code page. Plain text files
never contained information about the code page, why would there be such
information in UTF-8 plain text files?!

How about this:
* BOM makes a file stateful.
* Plain text should NOT be stateful (or, we should make it as stateless as
possible)
* If a text file is stateful, it is no longer a "plain text file", it
becomes a "text document".

BTW, since I may be tempted to process text documents with plain text tools,
I would rather see that the text documents would NOT have the BOM (yes, that
effectively makes them plain text files). Since it seems that many people
will insist that they want the option to have the BOM in text documents, it
seems that it will need to be allowed. But I would not make it "required".


Lars Kristan




Re: Names for UTF-8 with and without BOM - pragmatic

2002-11-05 Thread Markus Scherer
Mark Davis wrote:

Little probability that right double quote would appear at the start of a
document either. Doesn't mean that you are free to delete it (*and* say that
you are not modifying the contents).


This points to a pragmatic way to deal with this issue:

If software claims that it does not modify the contents of a document *except* for initial U+FEFF 
then it can do with initial U+FEFF what it wants. If the whole discussion hinges on what is allowed 
if software claims to not modify text then one need not claim that so absolutely.

Similarly, software may claim to not modify text contents _except_ that it may transform line 
endings into LS or any other convention.

Not all software claims to not modify text, nor needs to claim that, and a lot of software does 
modify text.

I agree that when the UTC decides that a BOM is *only* to be used as a
signature, and that it would be ok to delete it anywhere in a document (like
a non-character), then we are in much better shape. This was, as a matter of
fact proposed for 3.2, but not approved. If we did that for 4.0, then there
would be much less reason to distinguish UTF-8 'withBOM' from UTF-8
'withoutBOM'.


This would be good. The above would still be useful.

Joseph's request is actually different from the discussion of what is "the right thing": He mostly 
wants to have labels that distinguish between different things to be done. If there is no consensus 
for such labels here, then Joseph may need to use in his configuration file selectors that are 
separate from charset labels.

For example:

Type	charset	BOM	Comment
.txt	UTF-8	require	We want plain text files to
			have BOM to distinguish
			from legacy codepage files
.xml	UTF-8	forbid	Some XML processors may not cope with BOM
.htm	UTF-8	maybe	We want HTML to be UTF-8 but
			will not insist on BOM
.rc	not UTF	n/a	Unfortunately compiler insists on
			these being codepage.
.rc	UTF-16	require	Alternative to the previous line.
.swt	ASCII	n/a	Nonlocalizable internal format, must be ASCII.

markus

--
Opinions expressed here may not reflect my company's positions unless otherwise noted.




Re: Names for UTF-8 with and without BOM

2002-11-03 Thread Mark Davis
> So even if it were in there, who cares? I mean, can anyone explain why it
> would make a difference?

I personally wouldn't care if every instance of "Michael Kaplan" at the
start of a file were deleted. Not the point.

The actual point is that currently, as defined -- not as you would wish for
it to be, the FEFF is an actual character, and in circumstances where it is
not clearly defined for use as a BOM, it cannot be removed without altering
the content of the text.

As I said in another message, the UTC could change this situation by
completely deprecating the use of FEFF as anything but BOM. But it hasn't
done it yet.

Mark
__
http://www.macchiato.com
►  “Eppur si muove” ◄

- Original Message -
From: "Michael (michka) Kaplan" <[EMAIL PROTECTED]>
To: "Mark Davis" <[EMAIL PROTECTED]>; "Unicode Mailing List"
<[EMAIL PROTECTED]>
Sent: Sunday, November 03, 2002 13:02
Subject: Re: Names for UTF-8 with and without BOM


> From: "Mark Davis" <[EMAIL PROTECTED]>
>
> Ironic that for the purpose of dealing with THREE bytes that so many bytes
> are being wasted. :-)
>
> > Little probability that right double quote would appear at the start of
a
> > document either. Doesn't mean that you are free to delete it (*and* say
> that
> > you are not modifying the contents).
>
> Interesting strawman there, Mark -- but there is a huge difference there.
> But even if we leave in the notion of it as a character and just deprecate
> its usage and people ignore that, then we are talking about a ZERO WIDTH
NO
> BREAK SPACE. This character has the job of:
>
> 1) being invisible
> 2) not breaking text with it
>
> So even if it were in there, who cares? I mean, can anyone explain why it
> would make a difference?
>
> The one thing that no one has ever come up with is a reasonable case where
> it would be at the beginning of the document *yet* it was not a BOM.
>
> So we have a clear semantic for it at the beginning of a file -- its a
BOM.
> Period.
>
> If there is a higher level protocol as well and the protocol and the BOM
> both match, then that is great! Considering how much redundancy there is
in
> the Unicode standard about some definitions, a redundant marker for a file
> seems a very trivial issue.
>
> If there is a higher level protocol as well and they do not match, then we
> are in fantasy land bizarro world, inventing edge cases because we have
> nothing better to do. :-)  But for the sake of argument, lets pretend its
a
> real scenario -- in which case we treat it the same way as if your higher
> level protocol claims its ISO-8859-1 and the BOM says its UTF-32. Its an
> error.
>
> Problem solved!
>
> > I agree that when the UTC decides that a BOM is *only* to be used as a
> > signature, and that it would be ok to delete it anywhere in a document
> (like
> > a non-character), then we are in much better shape. This was, as a
matter
> of
> > fact proposed for 3.2, but not approved. If we did that for 4.0, then
> there
> > would be much less reason to distinguish UTF-8 'withBOM' from UTF-8
> > 'withoutBOM'.
>
> There is no reason to worry about this case and no need to delete
anything.
> This is a ZERO WIDTH NO BREAK SPACE we are talking about. The burden is on
> the people who think this is a scenario to bring proof that anyone is
doing
> anything as unrealistic as this.
>
> There is an easy, clear, and unambigous plan that can be used here which
> will always work. For ones lets not opt to complicate it without reason.
>
> MichKa
>
>
>





Re: Names for UTF-8 with and without BOM

2002-11-03 Thread Mark Davis


Mark
__
http://www.macchiato.com
►  “Eppur si muove” ◄

- Original Message -
From: "Michael (michka) Kaplan" <[EMAIL PROTECTED]>
To: "Mark Davis" <[EMAIL PROTECTED]>; "Unicode Mailing List"
<[EMAIL PROTECTED]>
Sent: Sunday, November 03, 2002 13:02
Subject: Re: Names for UTF-8 with and without BOM


> From: "Mark Davis" <[EMAIL PROTECTED]>
>
> Ironic that for the purpose of dealing with THREE bytes that so many bytes
> are being wasted. :-)
>
> > Little probability that right double quote would appear at the start of
a
> > document either. Doesn't mean that you are free to delete it (*and* say
> that
> > you are not modifying the contents).
>
> Interesting strawman there, Mark -- but there is a huge difference there.
> But even if we leave in the notion of it as a character and just deprecate
> its usage and people ignore that, then we are talking about a ZERO WIDTH
NO
> BREAK SPACE. This character has the job of:
>
> 1) being invisible
> 2) not breaking text with it
>
> So even if it were in there, who cares? I mean, can anyone explain why it
> would make a difference?
>
> The one thing that no one has ever come up with is a reasonable case where
> it would be at the beginning of the document *yet* it was not a BOM.
>
> So we have a clear semantic for it at the beginning of a file -- its a
BOM.
> Period.
>
> If there is a higher level protocol as well and the protocol and the BOM
> both match, then that is great! Considering how much redundancy there is
in
> the Unicode standard about some definitions, a redundant marker for a file
> seems a very trivial issue.
>
> If there is a higher level protocol as well and they do not match, then we
> are in fantasy land bizarro world, inventing edge cases because we have
> nothing better to do. :-)  But for the sake of argument, lets pretend its
a
> real scenario -- in which case we treat it the same way as if your higher
> level protocol claims its ISO-8859-1 and the BOM says its UTF-32. Its an
> error.
>
> Problem solved!
>
> > I agree that when the UTC decides that a BOM is *only* to be used as a
> > signature, and that it would be ok to delete it anywhere in a document
> (like
> > a non-character), then we are in much better shape. This was, as a
matter
> of
> > fact proposed for 3.2, but not approved. If we did that for 4.0, then
> there
> > would be much less reason to distinguish UTF-8 'withBOM' from UTF-8
> > 'withoutBOM'.
>
> There is no reason to worry about this case and no need to delete
anything.
> This is a ZERO WIDTH NO BREAK SPACE we are talking about. The burden is on
> the people who think this is a scenario to bring proof that anyone is
doing
> anything as unrealistic as this.
>
> There is an easy, clear, and unambigous plan that can be used here which
> will always work. For ones lets not opt to complicate it without reason.
>
> MichKa
>
>
>





Re: Names for UTF-8 with and without BOM

2002-11-03 Thread Mark Davis


Mark
__
http://www.macchiato.com
►  “Eppur si muove” ◄

- Original Message -
From: "Michael (michka) Kaplan" <[EMAIL PROTECTED]>
To: "Mark Davis" <[EMAIL PROTECTED]>; "Unicode Mailing List"
<[EMAIL PROTECTED]>
Sent: Sunday, November 03, 2002 13:02
Subject: Re: Names for UTF-8 with and without BOM


> From: "Mark Davis" <[EMAIL PROTECTED]>
>
> Ironic that for the purpose of dealing with THREE bytes that so many bytes
> are being wasted. :-)
>
> > Little probability that right double quote would appear at the start of
a
> > document either. Doesn't mean that you are free to delete it (*and* say
> that
> > you are not modifying the contents).
>
> Interesting strawman there, Mark -- but there is a huge difference there.
> But even if we leave in the notion of it as a character and just deprecate
> its usage and people ignore that, then we are talking about a ZERO WIDTH
NO
> BREAK SPACE. This character has the job of:
>
> 1) being invisible
> 2) not breaking text with it
>
> So even if it were in there, who cares? I mean, can anyone explain why it
> would make a difference?
>
> The one thing that no one has ever come up with is a reasonable case where
> it would be at the beginning of the document *yet* it was not a BOM.
>
> So we have a clear semantic for it at the beginning of a file -- its a
BOM.
> Period.
>
> If there is a higher level protocol as well and the protocol and the BOM
> both match, then that is great! Considering how much redundancy there is
in
> the Unicode standard about some definitions, a redundant marker for a file
> seems a very trivial issue.
>
> If there is a higher level protocol as well and they do not match, then we
> are in fantasy land bizarro world, inventing edge cases because we have
> nothing better to do. :-)  But for the sake of argument, lets pretend its
a
> real scenario -- in which case we treat it the same way as if your higher
> level protocol claims its ISO-8859-1 and the BOM says its UTF-32. Its an
> error.
>
> Problem solved!
>
> > I agree that when the UTC decides that a BOM is *only* to be used as a
> > signature, and that it would be ok to delete it anywhere in a document
> (like
> > a non-character), then we are in much better shape. This was, as a
matter
> of
> > fact proposed for 3.2, but not approved. If we did that for 4.0, then
> there
> > would be much less reason to distinguish UTF-8 'withBOM' from UTF-8
> > 'withoutBOM'.
>
> There is no reason to worry about this case and no need to delete
anything.
> This is a ZERO WIDTH NO BREAK SPACE we are talking about. The burden is on
> the people who think this is a scenario to bring proof that anyone is
doing
> anything as unrealistic as this.
>
> There is an easy, clear, and unambigous plan that can be used here which
> will always work. For ones lets not opt to complicate it without reason.
>
> MichKa
>
>
>





Re: Names for UTF-8 with and without BOM

2002-11-03 Thread Doug Ewell
Mark Davis  wrote:

> Little probability that right double quote would appear at the start
> of a document either. Doesn't mean that you are free to delete it
> (*and* say that you are not modifying the contents).

True, but right double quote:

(a) has a visible glyph with a well-defined human-readable meaning,
(b) isn't defined by Unicode as having a text-processing influence on
adjoining characters (leaving the question wide open of what to do when
there are fewer than two adjoining characters),
(c) doesn't have a second meaning as a signature that under certain
conditions can be stripped.

> I agree that when the UTC decides that a BOM is *only* to be used as a
> signature, and that it would be ok to delete it anywhere in a document
> (like a non-character), then we are in much better shape. This was, as
> a matter of fact proposed for 3.2, but not approved. If we did that
> for 4.0, then there would be much less reason to distinguish UTF-8
> 'withBOM' from UTF-8 'withoutBOM'.

Every one of us will be grateful when that day comes.

-Doug Ewell
 Fullerton, California





Re: Names for UTF-8 with and without BOM

2002-11-03 Thread Michael \(michka\) Kaplan
From: "Mark Davis" <[EMAIL PROTECTED]>

Ironic that for the purpose of dealing with THREE bytes that so many bytes
are being wasted. :-)

> Little probability that right double quote would appear at the start of a
> document either. Doesn't mean that you are free to delete it (*and* say
that
> you are not modifying the contents).

Interesting strawman there, Mark -- but there is a huge difference there.
But even if we leave in the notion of it as a character and just deprecate
its usage and people ignore that, then we are talking about a ZERO WIDTH NO
BREAK SPACE. This character has the job of:

1) being invisible
2) not breaking text with it

So even if it were in there, who cares? I mean, can anyone explain why it
would make a difference?

The one thing that no one has ever come up with is a reasonable case where
it would be at the beginning of the document *yet* it was not a BOM.

So we have a clear semantic for it at the beginning of a file -- its a BOM.
Period.

If there is a higher level protocol as well and the protocol and the BOM
both match, then that is great! Considering how much redundancy there is in
the Unicode standard about some definitions, a redundant marker for a file
seems a very trivial issue.

If there is a higher level protocol as well and they do not match, then we
are in fantasy land bizarro world, inventing edge cases because we have
nothing better to do. :-)  But for the sake of argument, lets pretend its a
real scenario -- in which case we treat it the same way as if your higher
level protocol claims its ISO-8859-1 and the BOM says its UTF-32. Its an
error.

Problem solved!

> I agree that when the UTC decides that a BOM is *only* to be used as a
> signature, and that it would be ok to delete it anywhere in a document
(like
> a non-character), then we are in much better shape. This was, as a matter
of
> fact proposed for 3.2, but not approved. If we did that for 4.0, then
there
> would be much less reason to distinguish UTF-8 'withBOM' from UTF-8
> 'withoutBOM'.

There is no reason to worry about this case and no need to delete anything.
This is a ZERO WIDTH NO BREAK SPACE we are talking about. The burden is on
the people who think this is a scenario to bring proof that anyone is doing
anything as unrealistic as this.

There is an easy, clear, and unambigous plan that can be used here which
will always work. For ones lets not opt to complicate it without reason.

MichKa





Re: Names for UTF-8 with and without BOM

2002-11-03 Thread Mark Davis
I don't know what you are trying to say. Perhaps you could explain it at the
meeting next week.

Mark
__
http://www.macchiato.com
►  “Eppur si muove” ◄

- Original Message -
From: "Michael (michka) Kaplan" <[EMAIL PROTECTED]>
To: "Mark Davis" <[EMAIL PROTECTED]>; "Murray Sargent"
<[EMAIL PROTECTED]>; "Joseph Boyle" <[EMAIL PROTECTED]>
Cc: <[EMAIL PROTECTED]>
Sent: Saturday, November 02, 2002 04:18
Subject: Re: Names for UTF-8 with and without BOM


> From: "Mark Davis" <[EMAIL PROTECTED]>
>
> > That is not sufficient. The first three bytes could represent a real
> content
> > character, ZWNBSP or they could be a BOM. The label doesn't tell you.
>
> There are several problems with this supposition -- most notably the fact
> that there are cases that specifically claim this is not recommended and
> that U+2060 is prefered?
>
> > This is similar to UTF-16 CES vs UTF-16BE CES. In the first case, 0xFE
> 0xFF
> > represents a BOM, and is not part of the content. In the second case, it
> > does *not* represent a BOM -- it represents a ZWNBSP, and must not be
> > stripped. The difference here is that the encoding name tells you
exactly
> > what the situation is.
>
> I do not see this as a realistic scenario.  I would argue that if the BOM
> matches the encoding scheme, perhaps this was an intentional effort to
make
> sure that applications which may not understand the higher level protocol
> can also see what the encoding scheme is.
>
> But even if we assume that someone has gone to the trouble of calling
> something UTF16BE and has 0xFE 0xFF at the beginning of the file. What
kind
> of content *is* such a code point that this is even worth calling out as a
> special case?
>
> If the goal is to clear and unambiguous text then the best way would to
> simplify ALL of this. It was previously decided to always call it a BOM,
why
> not stick with that?
>
> MichKa
>
>
>





Re: Names for UTF-8 with and without BOM

2002-11-03 Thread Mark Davis
Little probability that right double quote would appear at the start of a
document either. Doesn't mean that you are free to delete it (*and* say that
you are not modifying the contents).

I agree that when the UTC decides that a BOM is *only* to be used as a
signature, and that it would be ok to delete it anywhere in a document (like
a non-character), then we are in much better shape. This was, as a matter of
fact proposed for 3.2, but not approved. If we did that for 4.0, then there
would be much less reason to distinguish UTF-8 'withBOM' from UTF-8
'withoutBOM'.

Mark
__
http://www.macchiato.com
►  “Eppur si muove” ◄

- Original Message -
From: "Doug Ewell" <[EMAIL PROTECTED]>
To: "Unicode Mailing List" <[EMAIL PROTECTED]>
Cc: "Mark Davis" <[EMAIL PROTECTED]>; "Murray Sargent"
<[EMAIL PROTECTED]>; "Joseph Boyle" <[EMAIL PROTECTED]>
Sent: Saturday, November 02, 2002 13:27
Subject: Re: Names for UTF-8 with and without BOM


> Mark Davis  wrote:
>
> > That is not sufficient. The first three bytes could represent a real
> > content character, ZWNBSP or they could be a BOM. The label doesn't
> > tell you.
>
> I have never understood under what circumstances a ZWNBSP would ever
> appear as the first character of a file.  It wouldn't make any sense.  A
> ZWNBSP prevents a word break between the preceding and following
> characters.  If there *is* no preceding character, then what is the
> point of the ZWNBSP?
>
> Every time this topic comes up, I have asked why a true ZWNBSP would
> ever appear as the first character of a file.  The only responses I've
> heard are:
>
> 1.  It might not be a discrete file, but the second (or successive)
> piece of a file that was split up for some reason (transmission, etc.).
>
> In that case, the interpreting process should take its encoding cue from
> the first fragment, and should NEVER reinterpret fragments broken up at
> arbitrary points.  (Imagine a process modifying a GIF or JPEG file, or
> converting CR/LF, based on fragments!)  But this is not the point being
> discussed anyway; the point is whole files.
>
> 2.  It could happen; Unicode allows any character to appear anywhere.
>
> Well, almost anywhere.  But even so, the likelihood of a U+FEFF as
> ZWNBSP appearing at the start of an unsigned UTF-8 file is vanishingly
> small compared to the likelihood that the U+FEFF was intended to be a
> signature.  The rare case is just too rare to invalidate the heuristic
> for the much more common case.
>
> In addition, as Michka points out, we now have U+2060 WORD JOINER, whose
> entire purpose in life is to be used as U+FEFF was formerly used, as a
> ZWNBSP.  Any new Unicode text should use U+2060 and not U+FEFF as a word
> joiner.  It's hard to imagine that UTC and WG2 would have standardized
> this if there was a lot of real-world text that used U+FEFF as ZWNBSP.
>
> -Doug Ewell
>  Fullerton, California
>
>
>





Re: Names for UTF-8 with and without BOM

2002-11-03 Thread John Cowan
[EMAIL PROTECTED] scripsit:

> I find it interesting, then, to see Michael saying that, since Notepad 
> sticks a BOM-cum-signature at the start of its UTF-8, the rest of the 
> world should support it.

There is another argument, viz. ISO/IEC 10646, which plainly proclaims
that the 8-BOM is a valid signature for UTF-8 files.

-- 
Even a refrigerator can conform to the XML  John Cowan
Infoset, as long as it has a door sticker   [EMAIL PROTECTED]
saying "No information items inside".   http://www.reutershealth.com
--Eve Maler http://www.ccil.org/~cowan




Re: Names for UTF-8 with and without BOM

2002-11-03 Thread Michael \(michka\) Kaplan
From: <[EMAIL PROTECTED]>

> In particular, I'm thinking of a situation about a year and a half ago
> (IIRC) in which Michael (and I and others) were strongly opposed to a
> suggestion that the Unicode Consortium should document a certain variation
> (perversion, some would say) of one of the Unicode encoding forms that a
> certain vendor had implemented in their software. On that occasion,
> Michael (and I and others) were arguing that, just because they had done
> something in their software, that shouldn't mean that the rest of the
> world should be forced to support their encoding form.
>
> I find it interesting, then, to see Michael saying that, since Notepad
> sticks a BOM-cum-signature at the start of its UTF-8, the rest of the
> world should support it.

I do not see the conflict, or the irony? Remember that what Notepad and
others do is present mainly because it *is* in the XML standard, What was
being done by those others with UTF-8 was not a part of the UTF-8 "standard"
and was in fact specifically disallowed. In the end, note that UTF-8 was not
compromised; they got their own [non-preferred] encoding scheme for their
backcompat requirement, and they now have the "job" of making their products
use it in name.

If someone has a bug or problem in their software, then it is of course
their responsibility to fix it. On the other hand, if one pays attention to
a possible (optional) recommendation in a standard, it is the standard's
responsibility to not make people regret that they paid attention?

(Which is not to say that they got the "idea" from XML; I am not sure where
the idea came from. I figure that there was a strong interest in making sure
that when someone saved a file as UTF-8 that when reloaded it would still be
considered UTF-8, rather than ASCII or ANSI [sic]. This is a good reason for
such a decision in plain text --and the fact that XML is after all "just
text" is lost on no one...)

Given the strong lack of interest that XML has had in the notion of breaking
old parsers or valid XML 1.0 streams, it seems unlikely (to me) that they
would make such a breaking change in a future version of XML.

MichKa





Re: Names for UTF-8 with and without BOM

2002-11-03 Thread Peter_Constable
On 11/02/2002 12:15:54 PM "Michael \(michka\) Kaplan" wrote:

>> .xml UTF-8N Some XML processors may not cope with BOM
>
>Maybe they need to upgrade? Since people often edit the files in notepad,
>many files are going to have it. A parser that cannot accept this reality 
is
>not going to make it very long.

Ah, now here's an interesting twist. I'm not saying I disagree with 
Michael. I'm just acknowledging my own need for intellectual honesty, and 
realising that sometimes we take opposite sides of an opinion because of 
other factors that we may or may not be conscious of.

In particular, I'm thinking of a situation about a year and a half ago 
(IIRC) in which Michael (and I and others) were strongly opposed to a 
suggestion that the Unicode Consortium should document a certain variation 
(perversion, some would say) of one of the Unicode encoding forms that a 
certain vendor had implemented in their software. On that occasion, 
Michael (and I and others) were arguing that, just because they had done 
something in their software, that shouldn't mean that the rest of the 
world should be forced to support their encoding form.

I find it interesting, then, to see Michael saying that, since Notepad 
sticks a BOM-cum-signature at the start of its UTF-8, the rest of the 
world should support it.

Again, this is just an observation on the particular argument being used, 
but not on the suggestion being made.



- Peter


---
Peter Constable

Non-Roman Script Initiative, SIL International
7500 W. Camp Wisdom Rd., Dallas, TX 75236, USA
Tel: +1 972 708 7485
E-mail: <[EMAIL PROTECTED]>





RE: Names for UTF-8 with and without BOM

2002-11-03 Thread Peter_Constable
On 11/02/2002 11:59:24 AM "Joseph Boyle" wrote:

>The first time I thought of UTF-8Y it sounded too flippant, but actually 
it
>is fairly self-explanatory if UTF-8 is taken as a given, and has the 
virtue
>of being short.

UTF-8Y (and UTF-8J) is not at all intuitive. "UTF-8-yuk"? The better 
counterpart IMO to UTF-8N[o BOM], if we need these labels at all, would be 
UTF-8B[OM].



- Peter


---
Peter Constable

Non-Roman Script Initiative, SIL International
7500 W. Camp Wisdom Rd., Dallas, TX 75236, USA
Tel: +1 972 708 7485
E-mail: <[EMAIL PROTECTED]>





Re: Names for UTF-8 with and without BOM

2002-11-02 Thread Tex Texin


John Cowan wrote:
> 
> Tex Texin scripsit:
> 
> > Interestingly, although I didn't study it in detail, looking at rfc 2376
> > for prioritization over charset conflicts, it seems to recommend
> > stripping the BOM when converting from utf-16 to other charsets (and
> > without considering that ucs-4 would like to keep it). (section 5).
> 
> The point is not to try to convert it into an FFEF character or some
> replacement thereof, like say "?".

That may be the intent, but it doesn't say that. It should say convert
BOM to the equivalent BOM for the target encoding, if there is one.
Instead it says to strip it for other encodings.
(I wish it was called a signature rather than a BOM for most of these
usages.)

 
> > Also, in considering charset conflicts, 2376 fails to consider conflicts
> > between signature and the encoding declaration. (I have a utf-16BE BOM
> > and the encoding declaration is for utf-8...).
> 
> The encoding declaration is supposed to trump all.  So it is UTF-8, and
> since 0xFF is illegal in UTF-8, you blow chunks...

OK, but where is that written?

 
> > I'll have to check for a more up-to-date rfc.
> 
> There is none.

OK. Sorry if I seem to be difficult. I am just rereading a few things
with my new understanding to put the picture back together again.

tex
> 
> --
> John Cowan <[EMAIL PROTECTED]> http://www.reutershealth.com
> I amar prestar aen, han mathon ne nen,http://www.ccil.org/~cowan
> han mathon ne chae, a han noston ne 'wilith.  --Galadriel, _LOTR:FOTR_

-- 
-
Tex Texin   cell: +1 781 789 1898   mailto:Tex@;XenCraft.com
Xen Master  http://www.i18nGuy.com
 
XenCrafthttp://www.XenCraft.com
Making e-Business Work Around the World
-




Re: Names for UTF-8 with and without BOM

2002-11-02 Thread Tex Texin
Doug,

Doug Ewell wrote:
> 
> Tex Texin  wrote:
> 
> > However, I didn't realize that parsers were to allow for the
> > possibility of different signatures.
> > So a parser has to worry about scsu signatures, etc
> 
> A parser only *has* to read UTF-8 without signature and UTF-16 with
> signature.

Yes, I thought so until I saw Michka's note. And I thought that gave me
100% utf-8 coverage.
Apparently I would be leaving out the thousands ;-) that edit xml with
notepad.

It *may* read other encodings of its own choosing, including
> ISO 8859-1, SCSU, JOECODE, or US-BSCII.  (However, I can't find anything
> that allows for SCSU with signature, which is a shame since UTS #6
> encourages the signature.)


Can I stand on the other side of the fence now and refer to market
forces when it comes to ISO 8859 etc. ? ;-)

Anyway, I think you understood the context of my whines-- It was just
reaction to this silliness with open-ended signatures...

tex

-- 
-
Tex Texin   cell: +1 781 789 1898   mailto:Tex@;XenCraft.com
Xen Master  http://www.i18nGuy.com
 
XenCrafthttp://www.XenCraft.com
Making e-Business Work Around the World
-




Re: Names for UTF-8 with and without BOM

2002-11-02 Thread John Cowan
Tex Texin scripsit:

> Interestingly, although I didn't study it in detail, looking at rfc 2376
> for prioritization over charset conflicts, it seems to recommend
> stripping the BOM when converting from utf-16 to other charsets (and
> without considering that ucs-4 would like to keep it). (section 5).

The point is not to try to convert it into an FFEF character or some
replacement thereof, like say "?".

> Also, in considering charset conflicts, 2376 fails to consider conflicts
> between signature and the encoding declaration. (I have a utf-16BE BOM
> and the encoding declaration is for utf-8...).

The encoding declaration is supposed to trump all.  So it is UTF-8, and
since 0xFF is illegal in UTF-8, you blow chunks...

> I'll have to check for a more up-to-date rfc.

There is none.

-- 
John Cowan <[EMAIL PROTECTED]> http://www.reutershealth.com
I amar prestar aen, han mathon ne nen,http://www.ccil.org/~cowan
han mathon ne chae, a han noston ne 'wilith.  --Galadriel, _LOTR:FOTR_




Re: Names for UTF-8 with and without BOM

2002-11-02 Thread Tex Texin
John Cowan wrote:
> 
> Tex Texin scripsit:
> 
> > So when the parser gets JOECODE, I can understand ignoring the signature
> > and autodetection, but exactly how does it find the first "<"?
> 
> Well, if it begins with an 00 byte, it can't be UTF-8 or UTF-16 (it might
> be UTF-32 big-endian, but we'll suppose the parser can't handle that).
> JOECODE is what's left.  At worst it is in some other encoding and/or
> not well-formed, in which case you expect an error and you get one.
> Of course the processor knows that "<" is encoded as 0xFF in JOECODE
> 
> The point is that signatures don't decode to a character: processors in
> general, not just XML processors, are expected to skip them.
> 
> > It must have to try all of the encodings known to it... ugh.
> 
> In such a bad case, that's all you can do.

John,

The bad case is what I was whinging about, since more processors deal
with more than 3 encodings. Ultimately, because the initial characters
are fixed, autodetection is not as bad as it is for plaintext, I realize
that.

Interestingly, although I didn't study it in detail, looking at rfc 2376
for prioritization over charset conflicts, it seems to recommend
stripping the BOM when converting from utf-16 to other charsets (and
without considering that ucs-4 would like to keep it). (section 5).

Also, in considering charset conflicts, 2376 fails to consider conflicts
between signature and the encoding declaration. (I have a utf-16BE BOM
and the encoding declaration is for utf-8...).

I'll have to check for a more up-to-date rfc.

All in all I agree with you and Michka (yes you were right, I was wrong
Michael!) that it isn't that big a deal to support a variety of BOMs but
the world did not need yet another way to sometimes (maybe its there),
almost (maybe its unique), redundantly (one hopes its redundant and not
conflicting) declare an encoding.


tex



-- 
-
Tex Texin   cell: +1 781 789 1898   mailto:Tex@;XenCraft.com
Xen Master  http://www.i18nGuy.com
 
XenCrafthttp://www.XenCraft.com
Making e-Business Work Around the World
-




Re: Names for UTF-8 with and without BOM

2002-11-02 Thread Doug Ewell
Tex Texin  wrote:

> However, I didn't realize that parsers were to allow for the
> possibility of different signatures.
> So a parser has to worry about scsu signatures, etc

A parser only *has* to read UTF-8 without signature and UTF-16 with
signature.  It *may* read other encodings of its own choosing, including
ISO 8859-1, SCSU, JOECODE, or US-BSCII.  (However, I can't find anything
that allows for SCSU with signature, which is a shame since UTS #6
encourages the signature.)

-Doug Ewell
 Fullerton, California





Re: Names for UTF-8 with and without BOM

2002-11-02 Thread Michael \(michka\) Kaplan
You are mistaken about this -- XML claimed originally that it was valid but
was not required.

The notion that XML parsers would update to handle a new encoding form to
strip off three bytes but would not conditionally strip those three bytes if
they were the first three bytes of the file is an unrealistic one.

MichKa

- Original Message -
From: "Tex Texin" <[EMAIL PROTECTED]>
To: "Michael (michka) Kaplan" <[EMAIL PROTECTED]>
Cc: "Mark Davis" <[EMAIL PROTECTED]>; <[EMAIL PROTECTED]>
Sent: Saturday, November 02, 2002 11:08 AM
Subject: Re: Names for UTF-8 with and without BOM


> "Michael (michka) Kaplan" wrote:
> > > .xml UTF-8N Some XML processors may not cope with BOM
> >
> > Maybe they need to upgrade? Since people often edit the files in
notepad,
> > many files are going to have it. A parser that cannot accept this
reality is
> > not going to make it very long.
>
> I didn't think the XML standard allowed for utf-8 files to have a BOM.
> The standard is quite clear about requiring 0xFEFF for utf-16.
> I would have thought a proper parser would reject a non-utf-16 file
> beginning with something other than "<".
>
> (The fact that notepad puts it there should be irrelevant.)
>
> Am I wrong about XML and the utf-8 signature?
>
> tex
>
>
> --
> -
> Tex Texin   cell: +1 781 789 1898   mailto:Tex@;XenCraft.com
> Xen Master  http://www.i18nGuy.com
>
> XenCraft http://www.XenCraft.com
> Making e-Business Work Around the World
> -
>
>





Re: Names for UTF-8 with and without BOM

2002-11-02 Thread John Cowan
Tex Texin scripsit:

> So when the parser gets JOECODE, I can understand ignoring the signature
> and autodetection, but exactly how does it find the first "<"?

Well, if it begins with an 00 byte, it can't be UTF-8 or UTF-16 (it might
be UTF-32 big-endian, but we'll suppose the parser can't handle that).
JOECODE is what's left.  At worst it is in some other encoding and/or
not well-formed, in which case you expect an error and you get one.
Of course the processor knows that "<" is encoded as 0xFF in JOECODE

The point is that signatures don't decode to a character: processors in
general, not just XML processors, are expected to skip them.

> It must have to try all of the encodings known to it... ugh.

In such a bad case, that's all you can do.

-- 
John Cowan  [EMAIL PROTECTED]  www.reutershealth.com  www.ccil.org/~cowan
Promises become binding when there is a meeting of the minds and consideration
is exchanged. So it was at King's Bench in common law England; so it was
under the common law in the American colonies; so it was through more than
two centuries of jurisprudence in this country; and so it is today. 
   --_Specht v. Netscape_




Re: Names for UTF-8 with and without BOM

2002-11-02 Thread Tex Texin
John,
I understand the flexibility of XML to use different encodings.

However, I didn't realize that parsers were to allow for the possibility
of different signatures.
So a parser has to worry about scsu signatures, etc

Whereas XML is so fussy about which characters it accepts, I am
surprised at its flexibility for signatures.
So when the parser gets JOECODE, I can understand ignoring the signature
and autodetection, but exactly how does it find the first "<"?
It must have to try all of the encodings known to it... ugh.

tex


John Cowan wrote:
> 
> Tex Texin scripsit:
> 
> > However, that leaves open the question whether only the Unicode
> > transform signatures are acceptable or other signatures are also
> > allowed. So if a vendor defines a code page, and defines a signature
> > (perhaps mapping BOM/ZWNSP specifically to some code point or byte
> > string) does that then become acceptable?
> 
> IMHO yes.  XML documents are not *required* to be in one of the character
> sets that can be automatically detected by the methods of Appendix F.
> You can encode your documents in (hypothetical) JOECODE, which uses leading
> 00 as a signature (ignored by the XML parser) and then A=01, B=02, C=03, and so on.
> Autodetection will not work here, but it is perfectly conformant to have
> a processor that understands only UTF-8, UTF-16, and JOECODE.
> 
> Of course some encodings, such as US-BSCII, which looks just like US-ASCII
> except that A=0x42, B=0x41, a=0x62, b=0x61 will cause problems for anybody.
> :-)
> 
> I am a member of, but not speaking for, the XML Core WG.
> 
> --
> John Cowan  [EMAIL PROTECTED]  www.ccil.org/~cowan  www.reutershealth.com
> "The competent programmer is fully aware of the strictly limited size of his own
> skull; therefore he approaches the programming task in full humility, and among
> other things he avoids clever tricks like the plague."  --Edsger Dijkstra

-- 
-
Tex Texin   cell: +1 781 789 1898   mailto:Tex@;XenCraft.com
Xen Master  http://www.i18nGuy.com
 
XenCrafthttp://www.XenCraft.com
Making e-Business Work Around the World
-




Re: Names for UTF-8 with and without BOM

2002-11-02 Thread Tex Texin
Hi John,
I meant the character "<".

As for notepad, what I should have either stated more completely or bit
my tongue, is that where there is a standard in place (and where it is
unambiguous) the mistakes of particular products shouldn't hold sway,
unless they are tantamount to a de facto standard. I (personally) don't
hold notepad in that class. In particular with respect to Michka's
comment that parsers should upgrade to accommodate notepad's BOM,  I
rather thought notepad should be changed. But I certainly don't want to
get into a debate on notepad's influence on the market, so let's pretend
I bit my tongue in the last mail, and once again in this mail. ;-)

tex

John Cowan wrote:
> 
> Tex Texin scripsit:
> 
> > I didn't think the XML standard allowed for utf-8 files to have a BOM.
> 
> This capability was never actually excluded, and was added by erratum
> (and force-majeure, when it became clear that BOMful UTF-8 was going to
> start becoming common).  XML files are intended to be plain text, and
> if a large source of plain text insists on a BOM, so be it.
> 
> > The standard is quite clear about requiring 0xFEFF for utf-16.
> > I would have thought a proper parser would reject a non-utf-16 file
> > beginning with something other than "<".
> 
> If by "<" you mean the *character* "<", then yes.  If you mean the *byte*
> 0x3C, then no: well-formed XML files can begin with any of 0x00 (UTF-32),
> 0x3C (ASCII-compatible), 0x4C (EBCDIC), 0xEF (UTF-8 with BOM), 0xFE (UTF-16
> in BE order), or 0xFF (UTF-16 in LE order).  In principle they could begin with
> some other byte: 0x2B in UTF-7, e.g.
> 
> > (The fact that notepad puts it there should be irrelevant.)
> 
> Actual practice is never quite irrelevant.
> 
> --
> John Cowan   [EMAIL PROTECTED]   http://www.reutershealth.com
> "Mr. Lane, if you ever wish anything that I can do, all you will have
> to do will be to send me a telegram asking and it will be done."
> "Mr. Hearst, if you ever get a telegram from me asking you to do
> anything, you can put the telegram down as a forgery."

-- 
-
Tex Texin   cell: +1 781 789 1898   mailto:Tex@;XenCraft.com
Xen Master  http://www.i18nGuy.com
 
XenCrafthttp://www.XenCraft.com
Making e-Business Work Around the World
-




Re: Names for UTF-8 with and without BOM

2002-11-02 Thread John Cowan
Tex Texin scripsit:

> However, that leaves open the question whether only the Unicode
> transform signatures are acceptable or other signatures are also
> allowed. So if a vendor defines a code page, and defines a signature
> (perhaps mapping BOM/ZWNSP specifically to some code point or byte
> string) does that then become acceptable?

IMHO yes.  XML documents are not *required* to be in one of the character
sets that can be automatically detected by the methods of Appendix F.
You can encode your documents in (hypothetical) JOECODE, which uses leading
00 as a signature (ignored by the XML parser) and then A=01, B=02, C=03, and so on.
Autodetection will not work here, but it is perfectly conformant to have
a processor that understands only UTF-8, UTF-16, and JOECODE.

Of course some encodings, such as US-BSCII, which looks just like US-ASCII
except that A=0x42, B=0x41, a=0x62, b=0x61 will cause problems for anybody.
:-)

I am a member of, but not speaking for, the XML Core WG.

-- 
John Cowan  [EMAIL PROTECTED]  www.ccil.org/~cowan  www.reutershealth.com
"The competent programmer is fully aware of the strictly limited size of his own
skull; therefore he approaches the programming task in full humility, and among
other things he avoids clever tricks like the plague."  --Edsger Dijkstra




Re: Names for UTF-8 with and without BOM

2002-11-02 Thread John Cowan
Tex Texin scripsit:

> I didn't think the XML standard allowed for utf-8 files to have a BOM.

This capability was never actually excluded, and was added by erratum
(and force-majeure, when it became clear that BOMful UTF-8 was going to
start becoming common).  XML files are intended to be plain text, and
if a large source of plain text insists on a BOM, so be it.

> The standard is quite clear about requiring 0xFEFF for utf-16.
> I would have thought a proper parser would reject a non-utf-16 file
> beginning with something other than "<".

If by "<" you mean the *character* "<", then yes.  If you mean the *byte*
0x3C, then no: well-formed XML files can begin with any of 0x00 (UTF-32),
0x3C (ASCII-compatible), 0x4C (EBCDIC), 0xEF (UTF-8 with BOM), 0xFE (UTF-16
in BE order), or 0xFF (UTF-16 in LE order).  In principle they could begin with
some other byte: 0x2B in UTF-7, e.g.

> (The fact that notepad puts it there should be irrelevant.)

Actual practice is never quite irrelevant.

-- 
John Cowan   [EMAIL PROTECTED]   http://www.reutershealth.com
"Mr. Lane, if you ever wish anything that I can do, all you will have
to do will be to send me a telegram asking and it will be done."
"Mr. Hearst, if you ever get a telegram from me asking you to do
anything, you can put the telegram down as a forgery."




Re: Names for UTF-8 with and without BOM

2002-11-02 Thread Tex Texin
Thanks Doug. I had looked at the standard not at the appendix. 

I think that (non-normative) appendix is unfortunate. It seems to imply
(to my mind) that if other character sets define BOMs that it is ok to
use them as XML signatures. 
My reasoning is that the standard itself only says that UTF-16 must have
a signature and everything else except utf-8 must declare their
encoding. The standard doesn't say whether other encodings should or
should not be allowed to use signatures. The appendix F by defining the
other Unicode signatures implies they are acceptable (without
specifically stating so). 

The text of the standard however doesn't suggest even that UCS-4 would
use a signature, as it doesn't include it with utf-16 when speaking
about it requiring a BOM, and specifically says the name of UCS-4 to use
in the declaration, as with other encodings.

However, that leaves open the question whether only the Unicode
transform signatures are acceptable or other signatures are also
allowed. So if a vendor defines a code page, and defines a signature
(perhaps mapping BOM/ZWNSP specifically to some code point or byte
string) does that then become acceptable?

Of course we hope not, and I am sure the authors did not intend so, but
without a statement about which signatures are allowed or not allowed
beyond UTF-16, I think the can of worms is opened.

OK, having raised the issue I'll take it up with the w3c i18n group to
get their understanding and then the xml group if needed.

tex


Doug Ewell wrote:
> 
> Tex Texin  wrote:
> 
> > I didn't think the XML standard allowed for utf-8 files to have a BOM.
> > The standard is quite clear about requiring 0xFEFF for utf-16.
> > I would have thought a proper parser would reject a non-utf-16 file
> > beginning with something other than "<".
> 
> The standard explicitly allows UCS-4, UTF-16, and UTF-8 files to begin
> with a BOM.  See Appendix F.1, "Detection Without External Encoding
> Information":
> 
> http://www.w3.org/TR/REC-xml#sec-guessing
> 
> -Doug Ewell
>  Fullerton, California

-- 
-
Tex Texin   cell: +1 781 789 1898   mailto:Tex@;XenCraft.com
Xen Master  http://www.i18nGuy.com
 
XenCrafthttp://www.XenCraft.com
Making e-Business Work Around the World
-




Re: Names for UTF-8 with and without BOM

2002-11-02 Thread Doug Ewell
Tex Texin  wrote:

> I didn't think the XML standard allowed for utf-8 files to have a BOM.
> The standard is quite clear about requiring 0xFEFF for utf-16.
> I would have thought a proper parser would reject a non-utf-16 file
> beginning with something other than "<".

The standard explicitly allows UCS-4, UTF-16, and UTF-8 files to begin
with a BOM.  See Appendix F.1, "Detection Without External Encoding
Information":

http://www.w3.org/TR/REC-xml#sec-guessing

-Doug Ewell
 Fullerton, California





Re: Names for UTF-8 with and without BOM

2002-11-02 Thread Doug Ewell
Mark Davis  wrote:

> That is not sufficient. The first three bytes could represent a real
> content character, ZWNBSP or they could be a BOM. The label doesn't
> tell you.

I have never understood under what circumstances a ZWNBSP would ever
appear as the first character of a file.  It wouldn't make any sense.  A
ZWNBSP prevents a word break between the preceding and following
characters.  If there *is* no preceding character, then what is the
point of the ZWNBSP?

Every time this topic comes up, I have asked why a true ZWNBSP would
ever appear as the first character of a file.  The only responses I've
heard are:

1.  It might not be a discrete file, but the second (or successive)
piece of a file that was split up for some reason (transmission, etc.).

In that case, the interpreting process should take its encoding cue from
the first fragment, and should NEVER reinterpret fragments broken up at
arbitrary points.  (Imagine a process modifying a GIF or JPEG file, or
converting CR/LF, based on fragments!)  But this is not the point being
discussed anyway; the point is whole files.

2.  It could happen; Unicode allows any character to appear anywhere.

Well, almost anywhere.  But even so, the likelihood of a U+FEFF as
ZWNBSP appearing at the start of an unsigned UTF-8 file is vanishingly
small compared to the likelihood that the U+FEFF was intended to be a
signature.  The rare case is just too rare to invalidate the heuristic
for the much more common case.

In addition, as Michka points out, we now have U+2060 WORD JOINER, whose
entire purpose in life is to be used as U+FEFF was formerly used, as a
ZWNBSP.  Any new Unicode text should use U+2060 and not U+FEFF as a word
joiner.  It's hard to imagine that UTC and WG2 would have standardized
this if there was a lot of real-world text that used U+FEFF as ZWNBSP.

-Doug Ewell
 Fullerton, California





Re: Names for UTF-8 with and without BOM

2002-11-02 Thread Tex Texin
"Michael (michka) Kaplan" wrote:
> > .xml UTF-8N Some XML processors may not cope with BOM
> 
> Maybe they need to upgrade? Since people often edit the files in notepad,
> many files are going to have it. A parser that cannot accept this reality is
> not going to make it very long.

I didn't think the XML standard allowed for utf-8 files to have a BOM.
The standard is quite clear about requiring 0xFEFF for utf-16.
I would have thought a proper parser would reject a non-utf-16 file
beginning with something other than "<".

(The fact that notepad puts it there should be irrelevant.)

Am I wrong about XML and the utf-8 signature?

tex


-- 
-
Tex Texin   cell: +1 781 789 1898   mailto:Tex@;XenCraft.com
Xen Master  http://www.i18nGuy.com
 
XenCrafthttp://www.XenCraft.com
Making e-Business Work Around the World
-




Re: Names for UTF-8 with and without BOM

2002-11-02 Thread Michael \(michka\) Kaplan
From: "Joseph Boyle" <[EMAIL PROTECTED]>

> These are listed as examples to demonstrate the idea of a configuration
file
> listing encoding constraints. The fact that each constraint is arguable is
a
> good reason to make the constraints configurable, and therefore to have
> names to distinguish BOM and non-BOM UTF-8.

Yes, but the fact that every one of them can have it or not and only
inadequate parsers will ever really have a problem with them is a good
indication that it is not really required for the users who care about
separate charset names

MichKa





RE: Names for UTF-8 with and without BOM

2002-11-02 Thread Joseph Boyle
These are listed as examples to demonstrate the idea of a configuration file
listing encoding constraints. The fact that each constraint is arguable is a
good reason to make the constraints configurable, and therefore to have
names to distinguish BOM and non-BOM UTF-8.

-Original Message-
From: Michael (michka) Kaplan [mailto:michka@;trigeminal.com] 
Sent: Saturday, November 02, 2002 10:16 AM
To: Joseph Boyle; Mark Davis; Murray Sargent
Cc: [EMAIL PROTECTED]
Subject: Re: Names for UTF-8 with and without BOM


From: "Joseph Boyle" <[EMAIL PROTECTED]>

> Type Encoding Comment
> .txt UTF-8BOM We want plain text files to have BOM to distinguish from 
> legacy codepage files

Not really required, but optional -- the perfomance hit of making sure its
valid UTF-8 is pretty minor. But people do open some *huge* text files in
things like notepad

> .xml UTF-8N Some XML processors may not cope with BOM

Maybe they need to upgrade? Since people often edit the files in notepad,
many files are going to have it. A parser that cannot accept this reality is
not going to make it very long.

> .htm UTF-8 We want HTML to be UTF-8 but will not insist on BOM

Same as text, with the bonus of the possiblity of a higher lever protocol.
It can still go either way.

> .rc Codepage Unfortunately compiler insists on these being codepage.

They can be UTF-16, too (at least on Win32!).

> .swt ASCII Nonlocalizable internal format, must be ASCII.

Haven't run across these -- but note that  if its not UTF-8 then it does not
apply






Re: Names for UTF-8 with and without BOM

2002-11-02 Thread Michael \(michka\) Kaplan
From: "Joseph Boyle" <[EMAIL PROTECTED]>

> Type Encoding Comment
> .txt UTF-8BOM We want plain text files to have BOM to distinguish
> from legacy codepage files

Not really required, but optional -- the perfomance hit of making sure its
valid UTF-8 is pretty minor. But people do open some *huge* text files in
things like notepad

> .xml UTF-8N Some XML processors may not cope with BOM

Maybe they need to upgrade? Since people often edit the files in notepad,
many files are going to have it. A parser that cannot accept this reality is
not going to make it very long.

> .htm UTF-8 We want HTML to be UTF-8 but will not insist on BOM

Same as text, with the bonus of the possiblity of a higher lever protocol.
It can still go either way.

> .rc Codepage Unfortunately compiler insists on these being
> codepage.

They can be UTF-16, too (at least on Win32!).

> .swt ASCII Nonlocalizable internal format, must be ASCII.

Haven't run across these -- but note that  if its not UTF-8 then it does not
apply





RE: Names for UTF-8 with and without BOM

2002-11-02 Thread Joseph Boyle
The first time I thought of UTF-8Y it sounded too flippant, but actually it
is fairly self-explanatory if UTF-8 is taken as a given, and has the virtue
of being short.

UTF-8S for signature would also make sense, but is the same as the name of
Toby Phipps's proposal which eventually became CESU-8.

UTF-8J will certainly make sense, after UTC changes all the character names
to Esperanto, conducts its meetings in Esperanto, and publishes TUS in
Esperanto.

If we want to be really explicit, there's UTF-8EFBBBF.

-Original Message-
From: William Overington [mailto:WOverington@;ngo.globalnet.co.uk] 
Sent: Friday, November 01, 2002 10:37 PM
To: [EMAIL PROTECTED]
Cc: [EMAIL PROTECTED]
Subject: Re: Names for UTF-8 with and without BOM


As you have UTF-8N where the N stands for the word "no" one could possibly
have UTF-8Y where the Y stands for the word "yes".

Thus one could have the name of the format answering, or not answering, the
following question.

Is there a BOM encoded?

However, using the letter Y has three disadvantages for widespread use.  The
letter Y could be confused with the word "why", the word "yes" is English,
so the designation would be anglocentric, and the letter Y sorts
alphabetically after the letter N.

However, if one considers the use of the international language Esperanto,
then the N would mean "ne", that is, the Esperanto word for "no" and thus
one could use the letter J to stand for the Esperanto word "jes" which is
the Esperanto word for "yes" and which, in fact, is pronounced exactly the
same as the English word "yes".

Thus, I suggest that the three formats could be UTF-8, UTF-8J and UTF-8N,
which would solve the problem in a manner which, being based upon a neutral
language, will hopefully be acceptable to all.

William Overington

2 November 2002








RE: Names for UTF-8 with and without BOM

2002-11-02 Thread Joseph Boyle
The main need I see is not to tell a consumer whether a leading U+FEFF is a
BOM or ZWNBSP, but:

* for producers (telling whether to emit a BOM or not), and 
* normative (a checker enforcing an encoding standard per file type, defined
in a table like the one below)

TypeEncodingComment
.txtUTF-8BOMWe want plain text files to have BOM to distinguish
from legacy codepage files
.xmlUTF-8N  Some XML processors may not cope with BOM
.htmUTF-8   We want HTML to be UTF-8 but will not insist on BOM
.rc CodepageUnfortunately compiler insists on these being
codepage.
.swtASCII   Nonlocalizable internal format, must be ASCII.

Please consider the proposal for separate charset names on that basis and
not on the basis of utility for telling a consumer whether U+FEFF is a BOM,
which I agree is by now a nonissue.

-Original Message-
From: Michael (michka) Kaplan [mailto:michka@;trigeminal.com] 
Sent: Saturday, November 02, 2002 4:18 AM
To: Mark Davis; Murray Sargent; Joseph Boyle
Cc: [EMAIL PROTECTED]
Subject: Re: Names for UTF-8 with and without BOM


From: "Mark Davis" <[EMAIL PROTECTED]>

> That is not sufficient. The first three bytes could represent a real
content
> character, ZWNBSP or they could be a BOM. The label doesn't tell you.

There are several problems with this supposition -- most notably the fact
that there are cases that specifically claim this is not recommended and
that U+2060 is prefered?

> This is similar to UTF-16 CES vs UTF-16BE CES. In the first case, 0xFE
0xFF
> represents a BOM, and is not part of the content. In the second case, 
> it does *not* represent a BOM -- it represents a ZWNBSP, and must not 
> be stripped. The difference here is that the encoding name tells you 
> exactly what the situation is.

I do not see this as a realistic scenario.  I would argue that if the BOM
matches the encoding scheme, perhaps this was an intentional effort to make
sure that applications which may not understand the higher level protocol
can also see what the encoding scheme is.

But even if we assume that someone has gone to the trouble of calling
something UTF16BE and has 0xFE 0xFF at the beginning of the file. What kind
of content *is* such a code point that this is even worth calling out as a
special case?

If the goal is to clear and unambiguous text then the best way would to
simplify ALL of this. It was previously decided to always call it a BOM, why
not stick with that?

MichKa






Re: Names for UTF-8 with and without BOM

2002-11-02 Thread Michael \(michka\) Kaplan
From: "Mark Davis" <[EMAIL PROTECTED]>

> That is not sufficient. The first three bytes could represent a real
content
> character, ZWNBSP or they could be a BOM. The label doesn't tell you.

There are several problems with this supposition -- most notably the fact
that there are cases that specifically claim this is not recommended and
that U+2060 is prefered?

> This is similar to UTF-16 CES vs UTF-16BE CES. In the first case, 0xFE
0xFF
> represents a BOM, and is not part of the content. In the second case, it
> does *not* represent a BOM -- it represents a ZWNBSP, and must not be
> stripped. The difference here is that the encoding name tells you exactly
> what the situation is.

I do not see this as a realistic scenario.  I would argue that if the BOM
matches the encoding scheme, perhaps this was an intentional effort to make
sure that applications which may not understand the higher level protocol
can also see what the encoding scheme is.

But even if we assume that someone has gone to the trouble of calling
something UTF16BE and has 0xFE 0xFF at the beginning of the file. What kind
of content *is* such a code point that this is even worth calling out as a
special case?

If the goal is to clear and unambiguous text then the best way would to
simplify ALL of this. It was previously decided to always call it a BOM, why
not stick with that?

MichKa





Re: Names for UTF-8 with and without BOM

2002-11-02 Thread Mark Davis
That is not sufficient. The first three bytes could represent a real content
character, ZWNBSP or they could be a BOM. The label doesn't tell you.

This is similar to UTF-16 CES vs UTF-16BE CES. In the first case, 0xFE 0xFF
represents a BOM, and is not part of the content. In the second case, it
does *not* represent a BOM -- it represents a ZWNBSP, and must not be
stripped. The difference here is that the encoding name tells you exactly
what the situation is.

Mark
__
http://www.macchiato.com
►  “Eppur si muove” ◄

- Original Message -
From: "Murray Sargent" <[EMAIL PROTECTED]>
To: "Joseph Boyle" <[EMAIL PROTECTED]>
Cc: <[EMAIL PROTECTED]>
Sent: Friday, November 01, 2002 12:42
Subject: RE: Names for UTF-8 with and without BOM


> Joseph Boyle says: "It would be useful to have official names to
> distinguish UTF-8 with and without BOM."
>
> To see if a UTF-8 file has no BOM, you can just look at the first three
> bytes. Is this a problem? Typically when you care about a file's
> encoding form, you plan to read the file.
>
> Thanks
> Murray
>
>
>





Re: Names for UTF-8 with and without BOM

2002-11-01 Thread William Overington
As you have UTF-8N where the N stands for the word "no" one could possibly
have UTF-8Y where the Y stands for the word "yes".

Thus one could have the name of the format answering, or not answering, the
following question.

Is there a BOM encoded?

However, using the letter Y has three disadvantages for widespread use.  The
letter Y could be confused with the word "why", the word "yes" is English,
so the designation would be anglocentric, and the letter Y sorts
alphabetically after the letter N.

However, if one considers the use of the international language Esperanto,
then the N would mean "ne", that is, the Esperanto word for "no" and thus
one could use the letter J to stand for the Esperanto word "jes" which is
the Esperanto word for "yes" and which, in fact, is pronounced exactly the
same as the English word "yes".

Thus, I suggest that the three formats could be UTF-8, UTF-8J and UTF-8N,
which would solve the problem in a manner which, being based upon a neutral
language, will hopefully be acceptable to all.

William Overington

2 November 2002






RE: Names for UTF-8 with and without BOM

2002-11-01 Thread Murray Sargent
Joseph Boyle says: "It would be useful to have official names to
distinguish UTF-8 with and without BOM."

To see if a UTF-8 file has no BOM, you can just look at the first three
bytes. Is this a problem? Typically when you care about a file's
encoding form, you plan to read the file.

Thanks
Murray





Re: Names for UTF-8 with and without BOM

2002-11-01 Thread Kenneth Whistler
> Perhaps it
> is time to think of three other words starting with B, O, M that make a
> better explanation.)

Bollixed Operational Muddle ;-)

--Ken