subject:"Re\: Names for UTF\-8 with and without BOM"

RE: Names for UTF-8 with and without BOM - pragmatic

2002-11-20 Thread Joseph Boyle

Working in a large organization whose product includes a large number of configuration 
and data files in text formats, I can say something about what we have found to work 
during development, localization, and release engineering, across multiple platforms .

We have eliminated UTF-16 text file formats in favor of UTF-8 because of Unix standard 
toolkit and other Unix-based tools' poor ability to deal with UTF-16. On the other 
hand, the BOM on UTF-8 has been useful and has not caused problems with Unix tools 
processing, including pipe sequences. Raw concatenation of files which would produce 
internal ZWNBSPs is not part of any of our processing as far as I know.

-Original Message-
From: David Starner [mailto:[EMAIL PROTECTED]] 
Sent: Thursday, November 07, 2002 12:14 PM
To: Markus Scherer
Cc: unicode
Subject: Re: Names for UTF-8 with and without BOM - pragmatic

On Wed, Nov 06, 2002 at 09:47:43AM -0800, Markus Scherer wrote:
> The fact is that Windows uses UTF-8 and UTF-16 plain text files with 
> signatures (BOMs) very simply, gracefully, and successfully. It has 
> applied what I called the "pragmatic" approach here for about 10 
> years. It just works.

It just works in an environment where relatively few documents are plain text, and 
that doesn’t use pipes of text as universal glue. C has been described as a 
(C)haracter processing language; whether or not that’s accurate, Awk and Perl 
certainly are; these are all Unix programming languages, and at the heart of what Unix 
is. The simple Unix program has a stream of text coming in and a stream of text going 
out, whereas the simple Windows program has a window. What works for Windows may very 
well not work for Unix. 

-- 
David Starner - [EMAIL PROTECTED]
Great is the battle-god, great, and his kingdom--
A field where a thousand corpses lie. 
  -- Stephen Crane, "War is Kind"

Re: Names for UTF-8 with and without BOM - pragmatic

2002-11-07 Thread David Starner

On Wed, Nov 06, 2002 at 09:47:43AM -0800, Markus Scherer wrote:
> The fact is that Windows uses UTF-8 and UTF-16 plain text files with 
> signatures (BOMs) very simply, gracefully, and successfully. It has applied 
> what I called the "pragmatic" approach here for about 10 years. It just 
> works.

It just works in an environment where relatively few documents are plain
text, and that doesn’t use pipes of text as universal glue. C has been
described as a (C)haracter processing language; whether or not that’s
accurate, Awk and Perl certainly are; these are all Unix programming
languages, and at the heart of what Unix is. The simple Unix program has
a stream of text coming in and a stream of text going out, whereas the
simple Windows program has a window. What works for Windows may very
well not work for Unix. 

-- 
David Starner - [EMAIL PROTECTED]
Great is the battle-god, great, and his kingdom--
A field where a thousand corpses lie. 
  -- Stephen Crane, "War is Kind"

RE: Names for UTF-8 with and without BOM - pragmatic

2002-11-07 Thread Kent Karlsson


> Initial for each piece, as each is assumed to be a complete 
> text file before concatenation. Nothing 
> prevents copy/cp/cat and other commands from recognizing 
> Unicode signatures, for as long as they 
> don't claim to preserve initial U+FEFF.

Yes there is, in a formal sense, for cat and cp.  See
http://www.opengroup.org/onlinepubs/007904975/utilities/cat.html
which states "The standard output shall contain the sequence of
*bytes* read from the input files. Nothing else shall be written
to the standard output." (my emphasis) and
http://www.opengroup.org/onlinepubs/007904975/utilities/cp.html
which is not so explicit, but silently assumes that copying
does not change the bytes of the file content in any way.

cat, and copy/cp are very agnostic programs. They just copy
(or concatenate) the byte strings, regardless of if the content
is pictures, sound, or text.  So 'cat' can "meaningfully"
concatenate text files of the *same* encoding serialisation
and *without* BOM/signature and where the text files are properly
terminated (in the case of stateful serialisations).  Trying
to get 'cat' to do more than that for text files would be just
as bad as trying to get 'cat' to join (in some "useful" way)
picture files (of possibly different formats) or sound or video
files. Don't expect cat to catenate those file types if they
are "complete" and to get a useful result. 'cat' is
*supposed* to be simple, and just string byte sequences
together.  If you want something more, use another program
that does that "more" you're looking for (or write one).
It's not the Unix/Linux utility program 'cat', nor cp.

/Kent K

Re: Names for UTF-8 with and without BOM - pragmatic

2002-11-06 Thread Markus Scherer

Lars Kristan wrote:

Markus Scherer wrote:

If software claims that it does not modify the contents of a
document *except* for initial U+FEFF
then it can do with initial U+FEFF what it wants. If the
whole discussion hinges on what is allowed
if software claims to not modify text then one need
not claim that so absolutely.

That seems pretty straightforward, but only as long as your "software" is an
editor and your "document" is a single file. How about a case where
"software" is a copy or cat command, and instead of a document you have
several (plain?) text files that you concat? What does "initial" mean here?

Initial for each piece, as each is assumed to be a complete text file before concatenation. Nothing
prevents copy/cp/cat and other commands from recognizing Unicode signatures, for as long as they
don't claim to preserve initial U+FEFF.

What happens next is: some software lets an initial BOM get through and
appends such string to a file or a stream. If other software treats it as a
character, the data has been modified. On the other hand, if we want to
allow software to disregard BOMs in the middle of character streams then we
have another set of security issues. And not removing is equally bad because
of many consequences (in the end, we could end up with every character being
preceded by a BOM).

All true, and all well known, and the reason why the UTC and WG2 added U+2060 Word Joiner. This
becomes less of an issue if and when they decide to remove/deprecate the ZWNBSP semantics from U+FEFF.

However, in a situation where you cannot be sure about the intended purpose of an initial U+FEFF I
think that the "pragmatic" approach is any less safe than any other, while it increases usability.

.txt UTF-8 require We want plain text files to
have BOM to distinguish
from legacy codepage files

>
> H, what does "plain" mean?! ...

Your response to this takes it out of context. I am not trying to prescribe general semantics of
.txt plain text files.

If you read the thread carefully, you will see that I am just taking the file checker configuration
file from Joseph Boyle and suggesting a modification to its format that makes it not rely on having
charset names that indicate any particular BOM handling. I am sorry to not have made this clearer.

True, UTF-16 files do need a signature. Well, we just need to abandon the
idea that UTF-16 can be used for plain text files. Let's have plain text
files in UTF-8. Look at it as the most universal code page. Plain text files
never contained information about the code page, why would there be such
information in UTF-8 plain text files?!

UTF-16 files do not *need* a signature per se. However, it is very useful to prepend Unicode plain
text *files* with Unicode signatures so that tools have a chance to figure out if those files are in
Unicode at all - and which Unicode charset - or in some legacy charset. With "plain text files" I
mean plain text documents without any markup or other meta information.

The fact is that Windows uses UTF-8 and UTF-16 plain text files with signatures (BOMs) very simply,
gracefully, and successfully. It has applied what I called the "pragmatic" approach here for about
10 years. It just works.

markus

--
Opinions expressed here may not reflect my company's positions unless otherwise noted.

RE: Names for UTF-8 with and without BOM - pragmatic

2002-11-06 Thread Kent Karlsson


> True, UTF-16 files do need a signature. 

Eh, no!  "UTF-16BE" and "UTF-16LE" files (or whatever kind of text
data element) do not have any signature/BOM. Not even files (somehow)
labelled "UTF-16" need have a signature/BOM, without a BOM they are
then the same as if it was labelled "UTF-16BE".  (Formally, XML
"requires" BOM for UTF-16 XML documents, but then goes on
examplifying that it is not needed for XML documents...)

I do agree, however, that the idea of having a BOM/signature at
the beginning of a file (or other text data element) is a bad one.

/Kent K

RE: Names for UTF-8 with and without BOM - pragmatic

2002-11-06 Thread Marco Cimarosti

Lars Kristan wrote:
> > .txtUTF-8   require We want plain text files to
> > have BOM to distinguish
> > from legacy codepage files
> 
> H, what does "plain" mean?! Perhaps files with a BOM 
> should be called "text" files (or .txt files;) as
> opposed to "plain text" files, which in my opinion should
> be just that - _plain_ text. No ASCII plain text file had
> an ASCII signature. I believe "plain text" should be
> something that will be as easy to use (and handle) as
> ASCII plain text files were.

"Plain" per se means nothing, in this context. The term "plain text", in
Unicode jargon, means the opposite of "rich text".

"Rich text" (or "fancy text") is another Unicode jargon term, meaning text
containing *mark-up*, such as HTML, XML, RTF, troff, TeX, proprietary
word-processor formats, etc.

Unicode text not containing mark-up is called "plain text", regardless of
the fact that it might be quite "complicated" by the presence of BOM's, bidi
controls, etc.

_ Marco

RE: Names for UTF-8 with and without BOM - pragmatic

2002-11-06 Thread Lars Kristan

Markus Scherer wrote:
> If software claims that it does not modify the contents of a 
> document *except* for initial U+FEFF 
> then it can do with initial U+FEFF what it wants. If the 
> whole discussion hinges on what is allowed 
> if software claims to not modify text then one need 
> not claim that so absolutely.

That seems pretty straightforward, but only as long as your "software" is an
editor and your "document" is a single file. How about a case where
"software" is a copy or cat command, and instead of a document you have
several (plain?) text files that you concat? What does "initial" mean here?

What happens next is: some software lets an initial BOM get through and
appends such string to a file or a stream. If other software treats it as a
character, the data has been modified. On the other hand, if we want to
allow software to disregard BOMs in the middle of character streams then we
have another set of security issues. And not removing is equally bad because
of many consequences (in the end, we could end up with every character being
preceded by a BOM).

> .txt  UTF-8   require We want plain text files to
>   have BOM to distinguish
>   from legacy codepage files

H, what does "plain" mean?! Perhaps files with a BOM should be called
"text" files (or .txt files;) as opposed to "plain text" files, which in my
opinion should be just that - _plain_ text. No ASCII plain text file had an
ASCII signature. I believe "plain text" should be something that will be as
easy to use (and handle) as ASCII plain text files were.

True, UTF-16 files do need a signature. Well, we just need to abandon the
idea that UTF-16 can be used for plain text files. Let's have plain text
files in UTF-8. Look at it as the most universal code page. Plain text files
never contained information about the code page, why would there be such
information in UTF-8 plain text files?!

How about this:
* BOM makes a file stateful.
* Plain text should NOT be stateful (or, we should make it as stateless as
possible)
* If a text file is stateful, it is no longer a "plain text file", it
becomes a "text document".

BTW, since I may be tempted to process text documents with plain text tools,
I would rather see that the text documents would NOT have the BOM (yes, that
effectively makes them plain text files). Since it seems that many people
will insist that they want the option to have the BOM in text documents, it
seems that it will need to be allowed. But I would not make it "required".

Lars Kristan

Re: Names for UTF-8 with and without BOM - pragmatic

2002-11-05 Thread Markus Scherer

Mark Davis wrote:

Little probability that right double quote would appear at the start of a
document either. Doesn't mean that you are free to delete it (*and* say that
you are not modifying the contents).

This points to a pragmatic way to deal with this issue:

If software claims that it does not modify the contents of a document *except* for initial U+FEFF
then it can do with initial U+FEFF what it wants. If the whole discussion hinges on what is allowed
if software claims to not modify text then one need not claim that so absolutely.

Similarly, software may claim to not modify text contents _except_ that it may transform line
endings into LS or any other convention.

Not all software claims to not modify text, nor needs to claim that, and a lot of software does
modify text.

I agree that when the UTC decides that a BOM is *only* to be used as a
signature, and that it would be ok to delete it anywhere in a document (like
a non-character), then we are in much better shape. This was, as a matter of
fact proposed for 3.2, but not approved. If we did that for 4.0, then there
would be much less reason to distinguish UTF-8 'withBOM' from UTF-8
'withoutBOM'.

This would be good. The above would still be useful.

Joseph's request is actually different from the discussion of what is "the right thing": He mostly
wants to have labels that distinguish between different things to be done. If there is no consensus
for such labels here, then Joseph may need to use in his configuration file selectors that are
separate from charset labels.

For example:

Type charset BOM Comment
.txt UTF-8 require We want plain text files to
have BOM to distinguish
from legacy codepage files
.xml UTF-8 forbid Some XML processors may not cope with BOM
.htm UTF-8 maybe We want HTML to be UTF-8 but
will not insist on BOM
.rc not UTF n/a Unfortunately compiler insists on
these being codepage.
.rc UTF-16 require Alternative to the previous line.
.swt ASCII n/a Nonlocalizable internal format, must be ASCII.

markus

--
Opinions expressed here may not reflect my company's positions unless otherwise noted.

44 matches

Mail list logo