produce
internal ZWNBSPs is not part of any of our processing as far as I know.
-Original Message-
From: David Starner [mailto:[EMAIL PROTECTED]]
Sent: Thursday, November 07, 2002 12:14 PM
To: Markus Scherer
Cc: unicode
Subject: Re: Names for UTF-8 with and without BOM - pragmatic
On Wed
Initial for each piece, as each is assumed to be a complete
text file before concatenation. Nothing
prevents copy/cp/cat and other commands from recognizing
Unicode signatures, for as long as they
don't claim to preserve initial U+FEFF.
Yes there is, in a formal sense, for cat and cp.
On Wed, Nov 06, 2002 at 09:47:43AM -0800, Markus Scherer wrote:
The fact is that Windows uses UTF-8 and UTF-16 plain text files with
signatures (BOMs) very simply, gracefully, and successfully. It has applied
what I called the pragmatic approach here for about 10 years. It just
works.
It
Markus Scherer wrote:
If software claims that it does not modify the contents of a
document *except* for initial U+FEFF
then it can do with initial U+FEFF what it wants. If the
whole discussion hinges on what is allowed
emif software claims to not modify text/em then one need
not claim
Lars Kristan wrote:
.txtUTF-8 require We want plain text files to
have BOM to distinguish
from legacy codepage files
H, what does plain mean?! Perhaps files with a BOM
should be called text files (or .txt files;) as
opposed to plain
True, UTF-16 files do need a signature.
Eh, no! UTF-16BE and UTF-16LE files (or whatever kind of text
data element) do not have any signature/BOM. Not even files (somehow)
labelled UTF-16 need have a signature/BOM, without a BOM they are
then the same as if it was labelled UTF-16BE.
Lars Kristan wrote:
Markus Scherer wrote:
If software claims that it does not modify the contents of a
document *except* for initial U+FEFF
then it can do with initial U+FEFF what it wants. If the
whole discussion hinges on what is allowed
emif software claims to not modify text/em then one
Mark Davis wrote:
Little probability that right double quote would appear at the start of a
document either. Doesn't mean that you are free to delete it (*and* say that
you are not modifying the contents).
This points to a pragmatic way to deal with this issue:
If software claims that it does
On 11/02/2002 12:15:54 PM Michael \(michka\) Kaplan wrote:
.xml UTF-8N Some XML processors may not cope with BOM
Maybe they need to upgrade? Since people often edit the files in notepad,
many files are going to have it. A parser that cannot accept this reality
is
not going to make it very
On 11/02/2002 11:59:24 AM Joseph Boyle wrote:
The first time I thought of UTF-8Y it sounded too flippant, but actually
it
is fairly self-explanatory if UTF-8 is taken as a given, and has the
virtue
of being short.
UTF-8Y (and UTF-8J) is not at all intuitive. UTF-8-yuk? The better
counterpart
From: [EMAIL PROTECTED]
In particular, I'm thinking of a situation about a year and a half ago
(IIRC) in which Michael (and I and others) were strongly opposed to a
suggestion that the Unicode Consortium should document a certain variation
(perversion, some would say) of one of the Unicode
[EMAIL PROTECTED] scripsit:
I find it interesting, then, to see Michael saying that, since Notepad
sticks a BOM-cum-signature at the start of its UTF-8, the rest of the
world should support it.
There is another argument, viz. ISO/IEC 10646, which plainly proclaims
that the 8-BOM is a valid
, 2002 13:27
Subject: Re: Names for UTF-8 with and without BOM
Mark Davis mark dot davis at jtcsv dot com wrote:
That is not sufficient. The first three bytes could represent a real
content character, ZWNBSP or they could be a BOM. The label doesn't
tell you.
I have never understood under
]; Murray Sargent
[EMAIL PROTECTED]; Joseph Boyle [EMAIL PROTECTED]
Cc: [EMAIL PROTECTED]
Sent: Saturday, November 02, 2002 04:18
Subject: Re: Names for UTF-8 with and without BOM
From: Mark Davis [EMAIL PROTECTED]
That is not sufficient. The first three bytes could represent a real
content
From: Mark Davis [EMAIL PROTECTED]
Ironic that for the purpose of dealing with THREE bytes that so many bytes
are being wasted. :-)
Little probability that right double quote would appear at the start of a
document either. Doesn't mean that you are free to delete it (*and* say
that
you are
Mark Davis mark dot davis at jtcsv dot com wrote:
Little probability that right double quote would appear at the start
of a document either. Doesn't mean that you are free to delete it
(*and* say that you are not modifying the contents).
True, but right double quote:
(a) has a visible glyph
for UTF-8 with and without BOM
From: Mark Davis [EMAIL PROTECTED]
Ironic that for the purpose of dealing with THREE bytes that so many bytes
are being wasted. :-)
Little probability that right double quote would appear at the start of
a
document either. Doesn't mean that you are free
for UTF-8 with and without BOM
From: Mark Davis [EMAIL PROTECTED]
Ironic that for the purpose of dealing with THREE bytes that so many bytes
are being wasted. :-)
Little probability that right double quote would appear at the start of
a
document either. Doesn't mean that you are free
13:02
Subject: Re: Names for UTF-8 with and without BOM
From: Mark Davis [EMAIL PROTECTED]
Ironic that for the purpose of dealing with THREE bytes that so many bytes
are being wasted. :-)
Little probability that right double quote would appear at the start of
a
document either. Doesn't
Sargent [EMAIL PROTECTED]
To: Joseph Boyle [EMAIL PROTECTED]
Cc: [EMAIL PROTECTED]
Sent: Friday, November 01, 2002 12:42
Subject: RE: Names for UTF-8 with and without BOM
Joseph Boyle says: It would be useful to have official names to
distinguish UTF-8 with and without BOM.
To see if a UTF-8 file
From: Mark Davis [EMAIL PROTECTED]
That is not sufficient. The first three bytes could represent a real
content
character, ZWNBSP or they could be a BOM. The label doesn't tell you.
There are several problems with this supposition -- most notably the fact
that there are cases that specifically
;ngo.globalnet.co.uk]
Sent: Friday, November 01, 2002 10:37 PM
To: [EMAIL PROTECTED]
Cc: [EMAIL PROTECTED]
Subject: Re: Names for UTF-8 with and without BOM
As you have UTF-8N where the N stands for the word no one could possibly
have UTF-8Y where the Y stands for the word yes.
Thus one could have the name
From: Joseph Boyle [EMAIL PROTECTED]
Type Encoding Comment
.txt UTF-8BOM We want plain text files to have BOM to distinguish
from legacy codepage files
Not really required, but optional -- the perfomance hit of making sure its
valid UTF-8 is pretty minor. But people do open some *huge* text
-
From: Michael (michka) Kaplan [mailto:michka;trigeminal.com]
Sent: Saturday, November 02, 2002 10:16 AM
To: Joseph Boyle; Mark Davis; Murray Sargent
Cc: [EMAIL PROTECTED]
Subject: Re: Names for UTF-8 with and without BOM
From: Joseph Boyle [EMAIL PROTECTED]
Type Encoding Comment
.txt UTF-8BOM
From: Joseph Boyle [EMAIL PROTECTED]
These are listed as examples to demonstrate the idea of a configuration
file
listing encoding constraints. The fact that each constraint is arguable is
a
good reason to make the constraints configurable, and therefore to have
names to distinguish BOM and
Mark Davis mark dot davis at jtcsv dot com wrote:
That is not sufficient. The first three bytes could represent a real
content character, ZWNBSP or they could be a BOM. The label doesn't
tell you.
I have never understood under what circumstances a ZWNBSP would ever
appear as the first
Thanks Doug. I had looked at the standard not at the appendix.
I think that (non-normative) appendix is unfortunate. It seems to imply
(to my mind) that if other character sets define BOMs that it is ok to
use them as XML signatures.
My reasoning is that the standard itself only says that
Tex Texin scripsit:
I didn't think the XML standard allowed for utf-8 files to have a BOM.
This capability was never actually excluded, and was added by erratum
(and force-majeure, when it became clear that BOMful UTF-8 was going to
start becoming common). XML files are intended to be plain
Tex Texin scripsit:
However, that leaves open the question whether only the Unicode
transform signatures are acceptable or other signatures are also
allowed. So if a vendor defines a code page, and defines a signature
(perhaps mapping BOM/ZWNSP specifically to some code point or byte
string)
Hi John,
I meant the character .
As for notepad, what I should have either stated more completely or bit
my tongue, is that where there is a standard in place (and where it is
unambiguous) the mistakes of particular products shouldn't hold sway,
unless they are tantamount to a de facto standard.
John,
I understand the flexibility of XML to use different encodings.
However, I didn't realize that parsers were to allow for the possibility
of different signatures.
So a parser has to worry about scsu signatures, etc
Whereas XML is so fussy about which characters it accepts, I am
Tex Texin scripsit:
So when the parser gets JOECODE, I can understand ignoring the signature
and autodetection, but exactly how does it find the first ?
Well, if it begins with an 00 byte, it can't be UTF-8 or UTF-16 (it might
be UTF-32 big-endian, but we'll suppose the parser can't handle
is an unrealistic one.
MichKa
- Original Message -
From: Tex Texin [EMAIL PROTECTED]
To: Michael (michka) Kaplan [EMAIL PROTECTED]
Cc: Mark Davis [EMAIL PROTECTED]; [EMAIL PROTECTED]
Sent: Saturday, November 02, 2002 11:08 AM
Subject: Re: Names for UTF-8 with and without BOM
Michael (michka
Tex Texin tex at i18nguy dot com wrote:
However, I didn't realize that parsers were to allow for the
possibility of different signatures.
So a parser has to worry about scsu signatures, etc
A parser only *has* to read UTF-8 without signature and UTF-16 with
signature. It *may* read other
John Cowan wrote:
Tex Texin scripsit:
So when the parser gets JOECODE, I can understand ignoring the signature
and autodetection, but exactly how does it find the first ?
Well, if it begins with an 00 byte, it can't be UTF-8 or UTF-16 (it might
be UTF-32 big-endian, but we'll suppose
Tex Texin scripsit:
Interestingly, although I didn't study it in detail, looking at rfc 2376
for prioritization over charset conflicts, it seems to recommend
stripping the BOM when converting from utf-16 to other charsets (and
without considering that ucs-4 would like to keep it). (section
Doug,
Doug Ewell wrote:
Tex Texin tex at i18nguy dot com wrote:
However, I didn't realize that parsers were to allow for the
possibility of different signatures.
So a parser has to worry about scsu signatures, etc
A parser only *has* to read UTF-8 without signature and UTF-16
John Cowan wrote:
Tex Texin scripsit:
Interestingly, although I didn't study it in detail, looking at rfc 2376
for prioritization over charset conflicts, it seems to recommend
stripping the BOM when converting from utf-16 to other charsets (and
without considering that ucs-4 would
Perhaps it
is time to think of three other words starting with B, O, M that make a
better explanation.)
Bollixed Operational Muddle ;-)
--Ken
Joseph Boyle says: It would be useful to have official names to
distinguish UTF-8 with and without BOM.
To see if a UTF-8 file has no BOM, you can just look at the first three
bytes. Is this a problem? Typically when you care about a file's
encoding form, you plan to read the file.
Thanks
Murray
As you have UTF-8N where the N stands for the word no one could possibly
have UTF-8Y where the Y stands for the word yes.
Thus one could have the name of the format answering, or not answering, the
following question.
Is there a BOM encoded?
However, using the letter Y has three disadvantages
41 matches
Mail list logo