RE: Names for UTF-8 with and without BOM - pragmatic
Working in a large organization whose product includes a large number of configuration and data files in text formats, I can say something about what we have found to work during development, localization, and release engineering, across multiple platforms . We have eliminated UTF-16 text file formats in favor of UTF-8 because of Unix standard toolkit and other Unix-based tools' poor ability to deal with UTF-16. On the other hand, the BOM on UTF-8 has been useful and has not caused problems with Unix tools processing, including pipe sequences. Raw concatenation of files which would produce internal ZWNBSPs is not part of any of our processing as far as I know. -Original Message- From: David Starner [mailto:[EMAIL PROTECTED]] Sent: Thursday, November 07, 2002 12:14 PM To: Markus Scherer Cc: unicode Subject: Re: Names for UTF-8 with and without BOM - pragmatic On Wed, Nov 06, 2002 at 09:47:43AM -0800, Markus Scherer wrote: > The fact is that Windows uses UTF-8 and UTF-16 plain text files with > signatures (BOMs) very simply, gracefully, and successfully. It has > applied what I called the "pragmatic" approach here for about 10 > years. It just works. It just works in an environment where relatively few documents are plain text, and that doesn’t use pipes of text as universal glue. C has been described as a (C)haracter processing language; whether or not that’s accurate, Awk and Perl certainly are; these are all Unix programming languages, and at the heart of what Unix is. The simple Unix program has a stream of text coming in and a stream of text going out, whereas the simple Windows program has a window. What works for Windows may very well not work for Unix. -- David Starner - [EMAIL PROTECTED] Great is the battle-god, great, and his kingdom-- A field where a thousand corpses lie. -- Stephen Crane, "War is Kind"
Re: Names for UTF-8 with and without BOM - pragmatic
On Wed, Nov 06, 2002 at 09:47:43AM -0800, Markus Scherer wrote: > The fact is that Windows uses UTF-8 and UTF-16 plain text files with > signatures (BOMs) very simply, gracefully, and successfully. It has applied > what I called the "pragmatic" approach here for about 10 years. It just > works. It just works in an environment where relatively few documents are plain text, and that doesn’t use pipes of text as universal glue. C has been described as a (C)haracter processing language; whether or not that’s accurate, Awk and Perl certainly are; these are all Unix programming languages, and at the heart of what Unix is. The simple Unix program has a stream of text coming in and a stream of text going out, whereas the simple Windows program has a window. What works for Windows may very well not work for Unix. -- David Starner - [EMAIL PROTECTED] Great is the battle-god, great, and his kingdom-- A field where a thousand corpses lie. -- Stephen Crane, "War is Kind"
RE: Names for UTF-8 with and without BOM - pragmatic
> Initial for each piece, as each is assumed to be a complete > text file before concatenation. Nothing > prevents copy/cp/cat and other commands from recognizing > Unicode signatures, for as long as they > don't claim to preserve initial U+FEFF. Yes there is, in a formal sense, for cat and cp. See http://www.opengroup.org/onlinepubs/007904975/utilities/cat.html which states "The standard output shall contain the sequence of *bytes* read from the input files. Nothing else shall be written to the standard output." (my emphasis) and http://www.opengroup.org/onlinepubs/007904975/utilities/cp.html which is not so explicit, but silently assumes that copying does not change the bytes of the file content in any way. cat, and copy/cp are very agnostic programs. They just copy (or concatenate) the byte strings, regardless of if the content is pictures, sound, or text. So 'cat' can "meaningfully" concatenate text files of the *same* encoding serialisation and *without* BOM/signature and where the text files are properly terminated (in the case of stateful serialisations). Trying to get 'cat' to do more than that for text files would be just as bad as trying to get 'cat' to join (in some "useful" way) picture files (of possibly different formats) or sound or video files. Don't expect cat to catenate those file types if they are "complete" and to get a useful result. 'cat' is *supposed* to be simple, and just string byte sequences together. If you want something more, use another program that does that "more" you're looking for (or write one). It's not the Unix/Linux utility program 'cat', nor cp. /Kent K
Re: Names for UTF-8 with and without BOM - pragmatic
Lars Kristan wrote: Markus Scherer wrote: If software claims that it does not modify the contents of a document *except* for initial U+FEFF then it can do with initial U+FEFF what it wants. If the whole discussion hinges on what is allowed if software claims to not modify text then one need not claim that so absolutely. That seems pretty straightforward, but only as long as your "software" is an editor and your "document" is a single file. How about a case where "software" is a copy or cat command, and instead of a document you have several (plain?) text files that you concat? What does "initial" mean here? Initial for each piece, as each is assumed to be a complete text file before concatenation. Nothing prevents copy/cp/cat and other commands from recognizing Unicode signatures, for as long as they don't claim to preserve initial U+FEFF. What happens next is: some software lets an initial BOM get through and appends such string to a file or a stream. If other software treats it as a character, the data has been modified. On the other hand, if we want to allow software to disregard BOMs in the middle of character streams then we have another set of security issues. And not removing is equally bad because of many consequences (in the end, we could end up with every character being preceded by a BOM). All true, and all well known, and the reason why the UTC and WG2 added U+2060 Word Joiner. This becomes less of an issue if and when they decide to remove/deprecate the ZWNBSP semantics from U+FEFF. However, in a situation where you cannot be sure about the intended purpose of an initial U+FEFF I think that the "pragmatic" approach is any less safe than any other, while it increases usability. .txt UTF-8 require We want plain text files to have BOM to distinguish from legacy codepage files > > H, what does "plain" mean?! ... Your response to this takes it out of context. I am not trying to prescribe general semantics of .txt plain text files. If you read the thread carefully, you will see that I am just taking the file checker configuration file from Joseph Boyle and suggesting a modification to its format that makes it not rely on having charset names that indicate any particular BOM handling. I am sorry to not have made this clearer. True, UTF-16 files do need a signature. Well, we just need to abandon the idea that UTF-16 can be used for plain text files. Let's have plain text files in UTF-8. Look at it as the most universal code page. Plain text files never contained information about the code page, why would there be such information in UTF-8 plain text files?! UTF-16 files do not *need* a signature per se. However, it is very useful to prepend Unicode plain text *files* with Unicode signatures so that tools have a chance to figure out if those files are in Unicode at all - and which Unicode charset - or in some legacy charset. With "plain text files" I mean plain text documents without any markup or other meta information. The fact is that Windows uses UTF-8 and UTF-16 plain text files with signatures (BOMs) very simply, gracefully, and successfully. It has applied what I called the "pragmatic" approach here for about 10 years. It just works. markus -- Opinions expressed here may not reflect my company's positions unless otherwise noted.
RE: Names for UTF-8 with and without BOM - pragmatic
> True, UTF-16 files do need a signature. Eh, no! "UTF-16BE" and "UTF-16LE" files (or whatever kind of text data element) do not have any signature/BOM. Not even files (somehow) labelled "UTF-16" need have a signature/BOM, without a BOM they are then the same as if it was labelled "UTF-16BE". (Formally, XML "requires" BOM for UTF-16 XML documents, but then goes on examplifying that it is not needed for XML documents...) I do agree, however, that the idea of having a BOM/signature at the beginning of a file (or other text data element) is a bad one. /Kent K
RE: Names for UTF-8 with and without BOM - pragmatic
Lars Kristan wrote: > > .txtUTF-8 require We want plain text files to > > have BOM to distinguish > > from legacy codepage files > > H, what does "plain" mean?! Perhaps files with a BOM > should be called "text" files (or .txt files;) as > opposed to "plain text" files, which in my opinion should > be just that - _plain_ text. No ASCII plain text file had > an ASCII signature. I believe "plain text" should be > something that will be as easy to use (and handle) as > ASCII plain text files were. "Plain" per se means nothing, in this context. The term "plain text", in Unicode jargon, means the opposite of "rich text". "Rich text" (or "fancy text") is another Unicode jargon term, meaning text containing *mark-up*, such as HTML, XML, RTF, troff, TeX, proprietary word-processor formats, etc. Unicode text not containing mark-up is called "plain text", regardless of the fact that it might be quite "complicated" by the presence of BOM's, bidi controls, etc. _ Marco
RE: Names for UTF-8 with and without BOM - pragmatic
Markus Scherer wrote: > If software claims that it does not modify the contents of a > document *except* for initial U+FEFF > then it can do with initial U+FEFF what it wants. If the > whole discussion hinges on what is allowed > if software claims to not modify text then one need > not claim that so absolutely. That seems pretty straightforward, but only as long as your "software" is an editor and your "document" is a single file. How about a case where "software" is a copy or cat command, and instead of a document you have several (plain?) text files that you concat? What does "initial" mean here? What happens next is: some software lets an initial BOM get through and appends such string to a file or a stream. If other software treats it as a character, the data has been modified. On the other hand, if we want to allow software to disregard BOMs in the middle of character streams then we have another set of security issues. And not removing is equally bad because of many consequences (in the end, we could end up with every character being preceded by a BOM). > .txt UTF-8 require We want plain text files to > have BOM to distinguish > from legacy codepage files H, what does "plain" mean?! Perhaps files with a BOM should be called "text" files (or .txt files;) as opposed to "plain text" files, which in my opinion should be just that - _plain_ text. No ASCII plain text file had an ASCII signature. I believe "plain text" should be something that will be as easy to use (and handle) as ASCII plain text files were. True, UTF-16 files do need a signature. Well, we just need to abandon the idea that UTF-16 can be used for plain text files. Let's have plain text files in UTF-8. Look at it as the most universal code page. Plain text files never contained information about the code page, why would there be such information in UTF-8 plain text files?! How about this: * BOM makes a file stateful. * Plain text should NOT be stateful (or, we should make it as stateless as possible) * If a text file is stateful, it is no longer a "plain text file", it becomes a "text document". BTW, since I may be tempted to process text documents with plain text tools, I would rather see that the text documents would NOT have the BOM (yes, that effectively makes them plain text files). Since it seems that many people will insist that they want the option to have the BOM in text documents, it seems that it will need to be allowed. But I would not make it "required". Lars Kristan
Re: Names for UTF-8 with and without BOM - pragmatic
Mark Davis wrote: Little probability that right double quote would appear at the start of a document either. Doesn't mean that you are free to delete it (*and* say that you are not modifying the contents). This points to a pragmatic way to deal with this issue: If software claims that it does not modify the contents of a document *except* for initial U+FEFF then it can do with initial U+FEFF what it wants. If the whole discussion hinges on what is allowed if software claims to not modify text then one need not claim that so absolutely. Similarly, software may claim to not modify text contents _except_ that it may transform line endings into LS or any other convention. Not all software claims to not modify text, nor needs to claim that, and a lot of software does modify text. I agree that when the UTC decides that a BOM is *only* to be used as a signature, and that it would be ok to delete it anywhere in a document (like a non-character), then we are in much better shape. This was, as a matter of fact proposed for 3.2, but not approved. If we did that for 4.0, then there would be much less reason to distinguish UTF-8 'withBOM' from UTF-8 'withoutBOM'. This would be good. The above would still be useful. Joseph's request is actually different from the discussion of what is "the right thing": He mostly wants to have labels that distinguish between different things to be done. If there is no consensus for such labels here, then Joseph may need to use in his configuration file selectors that are separate from charset labels. For example: Type charset BOM Comment .txt UTF-8 require We want plain text files to have BOM to distinguish from legacy codepage files .xml UTF-8 forbid Some XML processors may not cope with BOM .htm UTF-8 maybe We want HTML to be UTF-8 but will not insist on BOM .rc not UTF n/a Unfortunately compiler insists on these being codepage. .rc UTF-16 require Alternative to the previous line. .swt ASCII n/a Nonlocalizable internal format, must be ASCII. markus -- Opinions expressed here may not reflect my company's positions unless otherwise noted.
Re: Names for UTF-8 with and without BOM
> So even if it were in there, who cares? I mean, can anyone explain why it > would make a difference? I personally wouldn't care if every instance of "Michael Kaplan" at the start of a file were deleted. Not the point. The actual point is that currently, as defined -- not as you would wish for it to be, the FEFF is an actual character, and in circumstances where it is not clearly defined for use as a BOM, it cannot be removed without altering the content of the text. As I said in another message, the UTC could change this situation by completely deprecating the use of FEFF as anything but BOM. But it hasn't done it yet. Mark __ http://www.macchiato.com ► “Eppur si muove” ◄ - Original Message - From: "Michael (michka) Kaplan" <[EMAIL PROTECTED]> To: "Mark Davis" <[EMAIL PROTECTED]>; "Unicode Mailing List" <[EMAIL PROTECTED]> Sent: Sunday, November 03, 2002 13:02 Subject: Re: Names for UTF-8 with and without BOM > From: "Mark Davis" <[EMAIL PROTECTED]> > > Ironic that for the purpose of dealing with THREE bytes that so many bytes > are being wasted. :-) > > > Little probability that right double quote would appear at the start of a > > document either. Doesn't mean that you are free to delete it (*and* say > that > > you are not modifying the contents). > > Interesting strawman there, Mark -- but there is a huge difference there. > But even if we leave in the notion of it as a character and just deprecate > its usage and people ignore that, then we are talking about a ZERO WIDTH NO > BREAK SPACE. This character has the job of: > > 1) being invisible > 2) not breaking text with it > > So even if it were in there, who cares? I mean, can anyone explain why it > would make a difference? > > The one thing that no one has ever come up with is a reasonable case where > it would be at the beginning of the document *yet* it was not a BOM. > > So we have a clear semantic for it at the beginning of a file -- its a BOM. > Period. > > If there is a higher level protocol as well and the protocol and the BOM > both match, then that is great! Considering how much redundancy there is in > the Unicode standard about some definitions, a redundant marker for a file > seems a very trivial issue. > > If there is a higher level protocol as well and they do not match, then we > are in fantasy land bizarro world, inventing edge cases because we have > nothing better to do. :-) But for the sake of argument, lets pretend its a > real scenario -- in which case we treat it the same way as if your higher > level protocol claims its ISO-8859-1 and the BOM says its UTF-32. Its an > error. > > Problem solved! > > > I agree that when the UTC decides that a BOM is *only* to be used as a > > signature, and that it would be ok to delete it anywhere in a document > (like > > a non-character), then we are in much better shape. This was, as a matter > of > > fact proposed for 3.2, but not approved. If we did that for 4.0, then > there > > would be much less reason to distinguish UTF-8 'withBOM' from UTF-8 > > 'withoutBOM'. > > There is no reason to worry about this case and no need to delete anything. > This is a ZERO WIDTH NO BREAK SPACE we are talking about. The burden is on > the people who think this is a scenario to bring proof that anyone is doing > anything as unrealistic as this. > > There is an easy, clear, and unambigous plan that can be used here which > will always work. For ones lets not opt to complicate it without reason. > > MichKa > > >
Re: Names for UTF-8 with and without BOM
Mark __ http://www.macchiato.com ► “Eppur si muove” ◄ - Original Message - From: "Michael (michka) Kaplan" <[EMAIL PROTECTED]> To: "Mark Davis" <[EMAIL PROTECTED]>; "Unicode Mailing List" <[EMAIL PROTECTED]> Sent: Sunday, November 03, 2002 13:02 Subject: Re: Names for UTF-8 with and without BOM > From: "Mark Davis" <[EMAIL PROTECTED]> > > Ironic that for the purpose of dealing with THREE bytes that so many bytes > are being wasted. :-) > > > Little probability that right double quote would appear at the start of a > > document either. Doesn't mean that you are free to delete it (*and* say > that > > you are not modifying the contents). > > Interesting strawman there, Mark -- but there is a huge difference there. > But even if we leave in the notion of it as a character and just deprecate > its usage and people ignore that, then we are talking about a ZERO WIDTH NO > BREAK SPACE. This character has the job of: > > 1) being invisible > 2) not breaking text with it > > So even if it were in there, who cares? I mean, can anyone explain why it > would make a difference? > > The one thing that no one has ever come up with is a reasonable case where > it would be at the beginning of the document *yet* it was not a BOM. > > So we have a clear semantic for it at the beginning of a file -- its a BOM. > Period. > > If there is a higher level protocol as well and the protocol and the BOM > both match, then that is great! Considering how much redundancy there is in > the Unicode standard about some definitions, a redundant marker for a file > seems a very trivial issue. > > If there is a higher level protocol as well and they do not match, then we > are in fantasy land bizarro world, inventing edge cases because we have > nothing better to do. :-) But for the sake of argument, lets pretend its a > real scenario -- in which case we treat it the same way as if your higher > level protocol claims its ISO-8859-1 and the BOM says its UTF-32. Its an > error. > > Problem solved! > > > I agree that when the UTC decides that a BOM is *only* to be used as a > > signature, and that it would be ok to delete it anywhere in a document > (like > > a non-character), then we are in much better shape. This was, as a matter > of > > fact proposed for 3.2, but not approved. If we did that for 4.0, then > there > > would be much less reason to distinguish UTF-8 'withBOM' from UTF-8 > > 'withoutBOM'. > > There is no reason to worry about this case and no need to delete anything. > This is a ZERO WIDTH NO BREAK SPACE we are talking about. The burden is on > the people who think this is a scenario to bring proof that anyone is doing > anything as unrealistic as this. > > There is an easy, clear, and unambigous plan that can be used here which > will always work. For ones lets not opt to complicate it without reason. > > MichKa > > >
Re: Names for UTF-8 with and without BOM
Mark __ http://www.macchiato.com ► “Eppur si muove” ◄ - Original Message - From: "Michael (michka) Kaplan" <[EMAIL PROTECTED]> To: "Mark Davis" <[EMAIL PROTECTED]>; "Unicode Mailing List" <[EMAIL PROTECTED]> Sent: Sunday, November 03, 2002 13:02 Subject: Re: Names for UTF-8 with and without BOM > From: "Mark Davis" <[EMAIL PROTECTED]> > > Ironic that for the purpose of dealing with THREE bytes that so many bytes > are being wasted. :-) > > > Little probability that right double quote would appear at the start of a > > document either. Doesn't mean that you are free to delete it (*and* say > that > > you are not modifying the contents). > > Interesting strawman there, Mark -- but there is a huge difference there. > But even if we leave in the notion of it as a character and just deprecate > its usage and people ignore that, then we are talking about a ZERO WIDTH NO > BREAK SPACE. This character has the job of: > > 1) being invisible > 2) not breaking text with it > > So even if it were in there, who cares? I mean, can anyone explain why it > would make a difference? > > The one thing that no one has ever come up with is a reasonable case where > it would be at the beginning of the document *yet* it was not a BOM. > > So we have a clear semantic for it at the beginning of a file -- its a BOM. > Period. > > If there is a higher level protocol as well and the protocol and the BOM > both match, then that is great! Considering how much redundancy there is in > the Unicode standard about some definitions, a redundant marker for a file > seems a very trivial issue. > > If there is a higher level protocol as well and they do not match, then we > are in fantasy land bizarro world, inventing edge cases because we have > nothing better to do. :-) But for the sake of argument, lets pretend its a > real scenario -- in which case we treat it the same way as if your higher > level protocol claims its ISO-8859-1 and the BOM says its UTF-32. Its an > error. > > Problem solved! > > > I agree that when the UTC decides that a BOM is *only* to be used as a > > signature, and that it would be ok to delete it anywhere in a document > (like > > a non-character), then we are in much better shape. This was, as a matter > of > > fact proposed for 3.2, but not approved. If we did that for 4.0, then > there > > would be much less reason to distinguish UTF-8 'withBOM' from UTF-8 > > 'withoutBOM'. > > There is no reason to worry about this case and no need to delete anything. > This is a ZERO WIDTH NO BREAK SPACE we are talking about. The burden is on > the people who think this is a scenario to bring proof that anyone is doing > anything as unrealistic as this. > > There is an easy, clear, and unambigous plan that can be used here which > will always work. For ones lets not opt to complicate it without reason. > > MichKa > > >
Re: Names for UTF-8 with and without BOM
Mark Davis wrote: > Little probability that right double quote would appear at the start > of a document either. Doesn't mean that you are free to delete it > (*and* say that you are not modifying the contents). True, but right double quote: (a) has a visible glyph with a well-defined human-readable meaning, (b) isn't defined by Unicode as having a text-processing influence on adjoining characters (leaving the question wide open of what to do when there are fewer than two adjoining characters), (c) doesn't have a second meaning as a signature that under certain conditions can be stripped. > I agree that when the UTC decides that a BOM is *only* to be used as a > signature, and that it would be ok to delete it anywhere in a document > (like a non-character), then we are in much better shape. This was, as > a matter of fact proposed for 3.2, but not approved. If we did that > for 4.0, then there would be much less reason to distinguish UTF-8 > 'withBOM' from UTF-8 'withoutBOM'. Every one of us will be grateful when that day comes. -Doug Ewell Fullerton, California
Re: Names for UTF-8 with and without BOM
From: "Mark Davis" <[EMAIL PROTECTED]> Ironic that for the purpose of dealing with THREE bytes that so many bytes are being wasted. :-) > Little probability that right double quote would appear at the start of a > document either. Doesn't mean that you are free to delete it (*and* say that > you are not modifying the contents). Interesting strawman there, Mark -- but there is a huge difference there. But even if we leave in the notion of it as a character and just deprecate its usage and people ignore that, then we are talking about a ZERO WIDTH NO BREAK SPACE. This character has the job of: 1) being invisible 2) not breaking text with it So even if it were in there, who cares? I mean, can anyone explain why it would make a difference? The one thing that no one has ever come up with is a reasonable case where it would be at the beginning of the document *yet* it was not a BOM. So we have a clear semantic for it at the beginning of a file -- its a BOM. Period. If there is a higher level protocol as well and the protocol and the BOM both match, then that is great! Considering how much redundancy there is in the Unicode standard about some definitions, a redundant marker for a file seems a very trivial issue. If there is a higher level protocol as well and they do not match, then we are in fantasy land bizarro world, inventing edge cases because we have nothing better to do. :-) But for the sake of argument, lets pretend its a real scenario -- in which case we treat it the same way as if your higher level protocol claims its ISO-8859-1 and the BOM says its UTF-32. Its an error. Problem solved! > I agree that when the UTC decides that a BOM is *only* to be used as a > signature, and that it would be ok to delete it anywhere in a document (like > a non-character), then we are in much better shape. This was, as a matter of > fact proposed for 3.2, but not approved. If we did that for 4.0, then there > would be much less reason to distinguish UTF-8 'withBOM' from UTF-8 > 'withoutBOM'. There is no reason to worry about this case and no need to delete anything. This is a ZERO WIDTH NO BREAK SPACE we are talking about. The burden is on the people who think this is a scenario to bring proof that anyone is doing anything as unrealistic as this. There is an easy, clear, and unambigous plan that can be used here which will always work. For ones lets not opt to complicate it without reason. MichKa
Re: Names for UTF-8 with and without BOM
I don't know what you are trying to say. Perhaps you could explain it at the meeting next week. Mark __ http://www.macchiato.com ► “Eppur si muove” ◄ - Original Message - From: "Michael (michka) Kaplan" <[EMAIL PROTECTED]> To: "Mark Davis" <[EMAIL PROTECTED]>; "Murray Sargent" <[EMAIL PROTECTED]>; "Joseph Boyle" <[EMAIL PROTECTED]> Cc: <[EMAIL PROTECTED]> Sent: Saturday, November 02, 2002 04:18 Subject: Re: Names for UTF-8 with and without BOM > From: "Mark Davis" <[EMAIL PROTECTED]> > > > That is not sufficient. The first three bytes could represent a real > content > > character, ZWNBSP or they could be a BOM. The label doesn't tell you. > > There are several problems with this supposition -- most notably the fact > that there are cases that specifically claim this is not recommended and > that U+2060 is prefered? > > > This is similar to UTF-16 CES vs UTF-16BE CES. In the first case, 0xFE > 0xFF > > represents a BOM, and is not part of the content. In the second case, it > > does *not* represent a BOM -- it represents a ZWNBSP, and must not be > > stripped. The difference here is that the encoding name tells you exactly > > what the situation is. > > I do not see this as a realistic scenario. I would argue that if the BOM > matches the encoding scheme, perhaps this was an intentional effort to make > sure that applications which may not understand the higher level protocol > can also see what the encoding scheme is. > > But even if we assume that someone has gone to the trouble of calling > something UTF16BE and has 0xFE 0xFF at the beginning of the file. What kind > of content *is* such a code point that this is even worth calling out as a > special case? > > If the goal is to clear and unambiguous text then the best way would to > simplify ALL of this. It was previously decided to always call it a BOM, why > not stick with that? > > MichKa > > >
Re: Names for UTF-8 with and without BOM
Little probability that right double quote would appear at the start of a document either. Doesn't mean that you are free to delete it (*and* say that you are not modifying the contents). I agree that when the UTC decides that a BOM is *only* to be used as a signature, and that it would be ok to delete it anywhere in a document (like a non-character), then we are in much better shape. This was, as a matter of fact proposed for 3.2, but not approved. If we did that for 4.0, then there would be much less reason to distinguish UTF-8 'withBOM' from UTF-8 'withoutBOM'. Mark __ http://www.macchiato.com ► “Eppur si muove” ◄ - Original Message - From: "Doug Ewell" <[EMAIL PROTECTED]> To: "Unicode Mailing List" <[EMAIL PROTECTED]> Cc: "Mark Davis" <[EMAIL PROTECTED]>; "Murray Sargent" <[EMAIL PROTECTED]>; "Joseph Boyle" <[EMAIL PROTECTED]> Sent: Saturday, November 02, 2002 13:27 Subject: Re: Names for UTF-8 with and without BOM > Mark Davis wrote: > > > That is not sufficient. The first three bytes could represent a real > > content character, ZWNBSP or they could be a BOM. The label doesn't > > tell you. > > I have never understood under what circumstances a ZWNBSP would ever > appear as the first character of a file. It wouldn't make any sense. A > ZWNBSP prevents a word break between the preceding and following > characters. If there *is* no preceding character, then what is the > point of the ZWNBSP? > > Every time this topic comes up, I have asked why a true ZWNBSP would > ever appear as the first character of a file. The only responses I've > heard are: > > 1. It might not be a discrete file, but the second (or successive) > piece of a file that was split up for some reason (transmission, etc.). > > In that case, the interpreting process should take its encoding cue from > the first fragment, and should NEVER reinterpret fragments broken up at > arbitrary points. (Imagine a process modifying a GIF or JPEG file, or > converting CR/LF, based on fragments!) But this is not the point being > discussed anyway; the point is whole files. > > 2. It could happen; Unicode allows any character to appear anywhere. > > Well, almost anywhere. But even so, the likelihood of a U+FEFF as > ZWNBSP appearing at the start of an unsigned UTF-8 file is vanishingly > small compared to the likelihood that the U+FEFF was intended to be a > signature. The rare case is just too rare to invalidate the heuristic > for the much more common case. > > In addition, as Michka points out, we now have U+2060 WORD JOINER, whose > entire purpose in life is to be used as U+FEFF was formerly used, as a > ZWNBSP. Any new Unicode text should use U+2060 and not U+FEFF as a word > joiner. It's hard to imagine that UTC and WG2 would have standardized > this if there was a lot of real-world text that used U+FEFF as ZWNBSP. > > -Doug Ewell > Fullerton, California > > >
Re: Names for UTF-8 with and without BOM
[EMAIL PROTECTED] scripsit: > I find it interesting, then, to see Michael saying that, since Notepad > sticks a BOM-cum-signature at the start of its UTF-8, the rest of the > world should support it. There is another argument, viz. ISO/IEC 10646, which plainly proclaims that the 8-BOM is a valid signature for UTF-8 files. -- Even a refrigerator can conform to the XML John Cowan Infoset, as long as it has a door sticker [EMAIL PROTECTED] saying "No information items inside". http://www.reutershealth.com --Eve Maler http://www.ccil.org/~cowan
Re: Names for UTF-8 with and without BOM
From: <[EMAIL PROTECTED]> > In particular, I'm thinking of a situation about a year and a half ago > (IIRC) in which Michael (and I and others) were strongly opposed to a > suggestion that the Unicode Consortium should document a certain variation > (perversion, some would say) of one of the Unicode encoding forms that a > certain vendor had implemented in their software. On that occasion, > Michael (and I and others) were arguing that, just because they had done > something in their software, that shouldn't mean that the rest of the > world should be forced to support their encoding form. > > I find it interesting, then, to see Michael saying that, since Notepad > sticks a BOM-cum-signature at the start of its UTF-8, the rest of the > world should support it. I do not see the conflict, or the irony? Remember that what Notepad and others do is present mainly because it *is* in the XML standard, What was being done by those others with UTF-8 was not a part of the UTF-8 "standard" and was in fact specifically disallowed. In the end, note that UTF-8 was not compromised; they got their own [non-preferred] encoding scheme for their backcompat requirement, and they now have the "job" of making their products use it in name. If someone has a bug or problem in their software, then it is of course their responsibility to fix it. On the other hand, if one pays attention to a possible (optional) recommendation in a standard, it is the standard's responsibility to not make people regret that they paid attention? (Which is not to say that they got the "idea" from XML; I am not sure where the idea came from. I figure that there was a strong interest in making sure that when someone saved a file as UTF-8 that when reloaded it would still be considered UTF-8, rather than ASCII or ANSI [sic]. This is a good reason for such a decision in plain text --and the fact that XML is after all "just text" is lost on no one...) Given the strong lack of interest that XML has had in the notion of breaking old parsers or valid XML 1.0 streams, it seems unlikely (to me) that they would make such a breaking change in a future version of XML. MichKa
Re: Names for UTF-8 with and without BOM
On 11/02/2002 12:15:54 PM "Michael \(michka\) Kaplan" wrote: >> .xml UTF-8N Some XML processors may not cope with BOM > >Maybe they need to upgrade? Since people often edit the files in notepad, >many files are going to have it. A parser that cannot accept this reality is >not going to make it very long. Ah, now here's an interesting twist. I'm not saying I disagree with Michael. I'm just acknowledging my own need for intellectual honesty, and realising that sometimes we take opposite sides of an opinion because of other factors that we may or may not be conscious of. In particular, I'm thinking of a situation about a year and a half ago (IIRC) in which Michael (and I and others) were strongly opposed to a suggestion that the Unicode Consortium should document a certain variation (perversion, some would say) of one of the Unicode encoding forms that a certain vendor had implemented in their software. On that occasion, Michael (and I and others) were arguing that, just because they had done something in their software, that shouldn't mean that the rest of the world should be forced to support their encoding form. I find it interesting, then, to see Michael saying that, since Notepad sticks a BOM-cum-signature at the start of its UTF-8, the rest of the world should support it. Again, this is just an observation on the particular argument being used, but not on the suggestion being made. - Peter --- Peter Constable Non-Roman Script Initiative, SIL International 7500 W. Camp Wisdom Rd., Dallas, TX 75236, USA Tel: +1 972 708 7485 E-mail: <[EMAIL PROTECTED]>
RE: Names for UTF-8 with and without BOM
On 11/02/2002 11:59:24 AM "Joseph Boyle" wrote: >The first time I thought of UTF-8Y it sounded too flippant, but actually it >is fairly self-explanatory if UTF-8 is taken as a given, and has the virtue >of being short. UTF-8Y (and UTF-8J) is not at all intuitive. "UTF-8-yuk"? The better counterpart IMO to UTF-8N[o BOM], if we need these labels at all, would be UTF-8B[OM]. - Peter --- Peter Constable Non-Roman Script Initiative, SIL International 7500 W. Camp Wisdom Rd., Dallas, TX 75236, USA Tel: +1 972 708 7485 E-mail: <[EMAIL PROTECTED]>
Re: Names for UTF-8 with and without BOM
John Cowan wrote: > > Tex Texin scripsit: > > > Interestingly, although I didn't study it in detail, looking at rfc 2376 > > for prioritization over charset conflicts, it seems to recommend > > stripping the BOM when converting from utf-16 to other charsets (and > > without considering that ucs-4 would like to keep it). (section 5). > > The point is not to try to convert it into an FFEF character or some > replacement thereof, like say "?". That may be the intent, but it doesn't say that. It should say convert BOM to the equivalent BOM for the target encoding, if there is one. Instead it says to strip it for other encodings. (I wish it was called a signature rather than a BOM for most of these usages.) > > Also, in considering charset conflicts, 2376 fails to consider conflicts > > between signature and the encoding declaration. (I have a utf-16BE BOM > > and the encoding declaration is for utf-8...). > > The encoding declaration is supposed to trump all. So it is UTF-8, and > since 0xFF is illegal in UTF-8, you blow chunks... OK, but where is that written? > > I'll have to check for a more up-to-date rfc. > > There is none. OK. Sorry if I seem to be difficult. I am just rereading a few things with my new understanding to put the picture back together again. tex > > -- > John Cowan <[EMAIL PROTECTED]> http://www.reutershealth.com > I amar prestar aen, han mathon ne nen,http://www.ccil.org/~cowan > han mathon ne chae, a han noston ne 'wilith. --Galadriel, _LOTR:FOTR_ -- - Tex Texin cell: +1 781 789 1898 mailto:Tex@;XenCraft.com Xen Master http://www.i18nGuy.com XenCrafthttp://www.XenCraft.com Making e-Business Work Around the World -
Re: Names for UTF-8 with and without BOM
Doug, Doug Ewell wrote: > > Tex Texin wrote: > > > However, I didn't realize that parsers were to allow for the > > possibility of different signatures. > > So a parser has to worry about scsu signatures, etc > > A parser only *has* to read UTF-8 without signature and UTF-16 with > signature. Yes, I thought so until I saw Michka's note. And I thought that gave me 100% utf-8 coverage. Apparently I would be leaving out the thousands ;-) that edit xml with notepad. It *may* read other encodings of its own choosing, including > ISO 8859-1, SCSU, JOECODE, or US-BSCII. (However, I can't find anything > that allows for SCSU with signature, which is a shame since UTS #6 > encourages the signature.) Can I stand on the other side of the fence now and refer to market forces when it comes to ISO 8859 etc. ? ;-) Anyway, I think you understood the context of my whines-- It was just reaction to this silliness with open-ended signatures... tex -- - Tex Texin cell: +1 781 789 1898 mailto:Tex@;XenCraft.com Xen Master http://www.i18nGuy.com XenCrafthttp://www.XenCraft.com Making e-Business Work Around the World -
Re: Names for UTF-8 with and without BOM
Tex Texin scripsit: > Interestingly, although I didn't study it in detail, looking at rfc 2376 > for prioritization over charset conflicts, it seems to recommend > stripping the BOM when converting from utf-16 to other charsets (and > without considering that ucs-4 would like to keep it). (section 5). The point is not to try to convert it into an FFEF character or some replacement thereof, like say "?". > Also, in considering charset conflicts, 2376 fails to consider conflicts > between signature and the encoding declaration. (I have a utf-16BE BOM > and the encoding declaration is for utf-8...). The encoding declaration is supposed to trump all. So it is UTF-8, and since 0xFF is illegal in UTF-8, you blow chunks... > I'll have to check for a more up-to-date rfc. There is none. -- John Cowan <[EMAIL PROTECTED]> http://www.reutershealth.com I amar prestar aen, han mathon ne nen,http://www.ccil.org/~cowan han mathon ne chae, a han noston ne 'wilith. --Galadriel, _LOTR:FOTR_
Re: Names for UTF-8 with and without BOM
John Cowan wrote: > > Tex Texin scripsit: > > > So when the parser gets JOECODE, I can understand ignoring the signature > > and autodetection, but exactly how does it find the first "<"? > > Well, if it begins with an 00 byte, it can't be UTF-8 or UTF-16 (it might > be UTF-32 big-endian, but we'll suppose the parser can't handle that). > JOECODE is what's left. At worst it is in some other encoding and/or > not well-formed, in which case you expect an error and you get one. > Of course the processor knows that "<" is encoded as 0xFF in JOECODE > > The point is that signatures don't decode to a character: processors in > general, not just XML processors, are expected to skip them. > > > It must have to try all of the encodings known to it... ugh. > > In such a bad case, that's all you can do. John, The bad case is what I was whinging about, since more processors deal with more than 3 encodings. Ultimately, because the initial characters are fixed, autodetection is not as bad as it is for plaintext, I realize that. Interestingly, although I didn't study it in detail, looking at rfc 2376 for prioritization over charset conflicts, it seems to recommend stripping the BOM when converting from utf-16 to other charsets (and without considering that ucs-4 would like to keep it). (section 5). Also, in considering charset conflicts, 2376 fails to consider conflicts between signature and the encoding declaration. (I have a utf-16BE BOM and the encoding declaration is for utf-8...). I'll have to check for a more up-to-date rfc. All in all I agree with you and Michka (yes you were right, I was wrong Michael!) that it isn't that big a deal to support a variety of BOMs but the world did not need yet another way to sometimes (maybe its there), almost (maybe its unique), redundantly (one hopes its redundant and not conflicting) declare an encoding. tex -- - Tex Texin cell: +1 781 789 1898 mailto:Tex@;XenCraft.com Xen Master http://www.i18nGuy.com XenCrafthttp://www.XenCraft.com Making e-Business Work Around the World -
Re: Names for UTF-8 with and without BOM
Tex Texin wrote: > However, I didn't realize that parsers were to allow for the > possibility of different signatures. > So a parser has to worry about scsu signatures, etc A parser only *has* to read UTF-8 without signature and UTF-16 with signature. It *may* read other encodings of its own choosing, including ISO 8859-1, SCSU, JOECODE, or US-BSCII. (However, I can't find anything that allows for SCSU with signature, which is a shame since UTS #6 encourages the signature.) -Doug Ewell Fullerton, California
Re: Names for UTF-8 with and without BOM
You are mistaken about this -- XML claimed originally that it was valid but was not required. The notion that XML parsers would update to handle a new encoding form to strip off three bytes but would not conditionally strip those three bytes if they were the first three bytes of the file is an unrealistic one. MichKa - Original Message - From: "Tex Texin" <[EMAIL PROTECTED]> To: "Michael (michka) Kaplan" <[EMAIL PROTECTED]> Cc: "Mark Davis" <[EMAIL PROTECTED]>; <[EMAIL PROTECTED]> Sent: Saturday, November 02, 2002 11:08 AM Subject: Re: Names for UTF-8 with and without BOM > "Michael (michka) Kaplan" wrote: > > > .xml UTF-8N Some XML processors may not cope with BOM > > > > Maybe they need to upgrade? Since people often edit the files in notepad, > > many files are going to have it. A parser that cannot accept this reality is > > not going to make it very long. > > I didn't think the XML standard allowed for utf-8 files to have a BOM. > The standard is quite clear about requiring 0xFEFF for utf-16. > I would have thought a proper parser would reject a non-utf-16 file > beginning with something other than "<". > > (The fact that notepad puts it there should be irrelevant.) > > Am I wrong about XML and the utf-8 signature? > > tex > > > -- > - > Tex Texin cell: +1 781 789 1898 mailto:Tex@;XenCraft.com > Xen Master http://www.i18nGuy.com > > XenCraft http://www.XenCraft.com > Making e-Business Work Around the World > - > >
Re: Names for UTF-8 with and without BOM
Tex Texin scripsit: > So when the parser gets JOECODE, I can understand ignoring the signature > and autodetection, but exactly how does it find the first "<"? Well, if it begins with an 00 byte, it can't be UTF-8 or UTF-16 (it might be UTF-32 big-endian, but we'll suppose the parser can't handle that). JOECODE is what's left. At worst it is in some other encoding and/or not well-formed, in which case you expect an error and you get one. Of course the processor knows that "<" is encoded as 0xFF in JOECODE The point is that signatures don't decode to a character: processors in general, not just XML processors, are expected to skip them. > It must have to try all of the encodings known to it... ugh. In such a bad case, that's all you can do. -- John Cowan [EMAIL PROTECTED] www.reutershealth.com www.ccil.org/~cowan Promises become binding when there is a meeting of the minds and consideration is exchanged. So it was at King's Bench in common law England; so it was under the common law in the American colonies; so it was through more than two centuries of jurisprudence in this country; and so it is today. --_Specht v. Netscape_
Re: Names for UTF-8 with and without BOM
John, I understand the flexibility of XML to use different encodings. However, I didn't realize that parsers were to allow for the possibility of different signatures. So a parser has to worry about scsu signatures, etc Whereas XML is so fussy about which characters it accepts, I am surprised at its flexibility for signatures. So when the parser gets JOECODE, I can understand ignoring the signature and autodetection, but exactly how does it find the first "<"? It must have to try all of the encodings known to it... ugh. tex John Cowan wrote: > > Tex Texin scripsit: > > > However, that leaves open the question whether only the Unicode > > transform signatures are acceptable or other signatures are also > > allowed. So if a vendor defines a code page, and defines a signature > > (perhaps mapping BOM/ZWNSP specifically to some code point or byte > > string) does that then become acceptable? > > IMHO yes. XML documents are not *required* to be in one of the character > sets that can be automatically detected by the methods of Appendix F. > You can encode your documents in (hypothetical) JOECODE, which uses leading > 00 as a signature (ignored by the XML parser) and then A=01, B=02, C=03, and so on. > Autodetection will not work here, but it is perfectly conformant to have > a processor that understands only UTF-8, UTF-16, and JOECODE. > > Of course some encodings, such as US-BSCII, which looks just like US-ASCII > except that A=0x42, B=0x41, a=0x62, b=0x61 will cause problems for anybody. > :-) > > I am a member of, but not speaking for, the XML Core WG. > > -- > John Cowan [EMAIL PROTECTED] www.ccil.org/~cowan www.reutershealth.com > "The competent programmer is fully aware of the strictly limited size of his own > skull; therefore he approaches the programming task in full humility, and among > other things he avoids clever tricks like the plague." --Edsger Dijkstra -- - Tex Texin cell: +1 781 789 1898 mailto:Tex@;XenCraft.com Xen Master http://www.i18nGuy.com XenCrafthttp://www.XenCraft.com Making e-Business Work Around the World -
Re: Names for UTF-8 with and without BOM
Hi John, I meant the character "<". As for notepad, what I should have either stated more completely or bit my tongue, is that where there is a standard in place (and where it is unambiguous) the mistakes of particular products shouldn't hold sway, unless they are tantamount to a de facto standard. I (personally) don't hold notepad in that class. In particular with respect to Michka's comment that parsers should upgrade to accommodate notepad's BOM, I rather thought notepad should be changed. But I certainly don't want to get into a debate on notepad's influence on the market, so let's pretend I bit my tongue in the last mail, and once again in this mail. ;-) tex John Cowan wrote: > > Tex Texin scripsit: > > > I didn't think the XML standard allowed for utf-8 files to have a BOM. > > This capability was never actually excluded, and was added by erratum > (and force-majeure, when it became clear that BOMful UTF-8 was going to > start becoming common). XML files are intended to be plain text, and > if a large source of plain text insists on a BOM, so be it. > > > The standard is quite clear about requiring 0xFEFF for utf-16. > > I would have thought a proper parser would reject a non-utf-16 file > > beginning with something other than "<". > > If by "<" you mean the *character* "<", then yes. If you mean the *byte* > 0x3C, then no: well-formed XML files can begin with any of 0x00 (UTF-32), > 0x3C (ASCII-compatible), 0x4C (EBCDIC), 0xEF (UTF-8 with BOM), 0xFE (UTF-16 > in BE order), or 0xFF (UTF-16 in LE order). In principle they could begin with > some other byte: 0x2B in UTF-7, e.g. > > > (The fact that notepad puts it there should be irrelevant.) > > Actual practice is never quite irrelevant. > > -- > John Cowan [EMAIL PROTECTED] http://www.reutershealth.com > "Mr. Lane, if you ever wish anything that I can do, all you will have > to do will be to send me a telegram asking and it will be done." > "Mr. Hearst, if you ever get a telegram from me asking you to do > anything, you can put the telegram down as a forgery." -- - Tex Texin cell: +1 781 789 1898 mailto:Tex@;XenCraft.com Xen Master http://www.i18nGuy.com XenCrafthttp://www.XenCraft.com Making e-Business Work Around the World -
Re: Names for UTF-8 with and without BOM
Tex Texin scripsit: > However, that leaves open the question whether only the Unicode > transform signatures are acceptable or other signatures are also > allowed. So if a vendor defines a code page, and defines a signature > (perhaps mapping BOM/ZWNSP specifically to some code point or byte > string) does that then become acceptable? IMHO yes. XML documents are not *required* to be in one of the character sets that can be automatically detected by the methods of Appendix F. You can encode your documents in (hypothetical) JOECODE, which uses leading 00 as a signature (ignored by the XML parser) and then A=01, B=02, C=03, and so on. Autodetection will not work here, but it is perfectly conformant to have a processor that understands only UTF-8, UTF-16, and JOECODE. Of course some encodings, such as US-BSCII, which looks just like US-ASCII except that A=0x42, B=0x41, a=0x62, b=0x61 will cause problems for anybody. :-) I am a member of, but not speaking for, the XML Core WG. -- John Cowan [EMAIL PROTECTED] www.ccil.org/~cowan www.reutershealth.com "The competent programmer is fully aware of the strictly limited size of his own skull; therefore he approaches the programming task in full humility, and among other things he avoids clever tricks like the plague." --Edsger Dijkstra
Re: Names for UTF-8 with and without BOM
Tex Texin scripsit: > I didn't think the XML standard allowed for utf-8 files to have a BOM. This capability was never actually excluded, and was added by erratum (and force-majeure, when it became clear that BOMful UTF-8 was going to start becoming common). XML files are intended to be plain text, and if a large source of plain text insists on a BOM, so be it. > The standard is quite clear about requiring 0xFEFF for utf-16. > I would have thought a proper parser would reject a non-utf-16 file > beginning with something other than "<". If by "<" you mean the *character* "<", then yes. If you mean the *byte* 0x3C, then no: well-formed XML files can begin with any of 0x00 (UTF-32), 0x3C (ASCII-compatible), 0x4C (EBCDIC), 0xEF (UTF-8 with BOM), 0xFE (UTF-16 in BE order), or 0xFF (UTF-16 in LE order). In principle they could begin with some other byte: 0x2B in UTF-7, e.g. > (The fact that notepad puts it there should be irrelevant.) Actual practice is never quite irrelevant. -- John Cowan [EMAIL PROTECTED] http://www.reutershealth.com "Mr. Lane, if you ever wish anything that I can do, all you will have to do will be to send me a telegram asking and it will be done." "Mr. Hearst, if you ever get a telegram from me asking you to do anything, you can put the telegram down as a forgery."
Re: Names for UTF-8 with and without BOM
Thanks Doug. I had looked at the standard not at the appendix. I think that (non-normative) appendix is unfortunate. It seems to imply (to my mind) that if other character sets define BOMs that it is ok to use them as XML signatures. My reasoning is that the standard itself only says that UTF-16 must have a signature and everything else except utf-8 must declare their encoding. The standard doesn't say whether other encodings should or should not be allowed to use signatures. The appendix F by defining the other Unicode signatures implies they are acceptable (without specifically stating so). The text of the standard however doesn't suggest even that UCS-4 would use a signature, as it doesn't include it with utf-16 when speaking about it requiring a BOM, and specifically says the name of UCS-4 to use in the declaration, as with other encodings. However, that leaves open the question whether only the Unicode transform signatures are acceptable or other signatures are also allowed. So if a vendor defines a code page, and defines a signature (perhaps mapping BOM/ZWNSP specifically to some code point or byte string) does that then become acceptable? Of course we hope not, and I am sure the authors did not intend so, but without a statement about which signatures are allowed or not allowed beyond UTF-16, I think the can of worms is opened. OK, having raised the issue I'll take it up with the w3c i18n group to get their understanding and then the xml group if needed. tex Doug Ewell wrote: > > Tex Texin wrote: > > > I didn't think the XML standard allowed for utf-8 files to have a BOM. > > The standard is quite clear about requiring 0xFEFF for utf-16. > > I would have thought a proper parser would reject a non-utf-16 file > > beginning with something other than "<". > > The standard explicitly allows UCS-4, UTF-16, and UTF-8 files to begin > with a BOM. See Appendix F.1, "Detection Without External Encoding > Information": > > http://www.w3.org/TR/REC-xml#sec-guessing > > -Doug Ewell > Fullerton, California -- - Tex Texin cell: +1 781 789 1898 mailto:Tex@;XenCraft.com Xen Master http://www.i18nGuy.com XenCrafthttp://www.XenCraft.com Making e-Business Work Around the World -
Re: Names for UTF-8 with and without BOM
Tex Texin wrote: > I didn't think the XML standard allowed for utf-8 files to have a BOM. > The standard is quite clear about requiring 0xFEFF for utf-16. > I would have thought a proper parser would reject a non-utf-16 file > beginning with something other than "<". The standard explicitly allows UCS-4, UTF-16, and UTF-8 files to begin with a BOM. See Appendix F.1, "Detection Without External Encoding Information": http://www.w3.org/TR/REC-xml#sec-guessing -Doug Ewell Fullerton, California
Re: Names for UTF-8 with and without BOM
Mark Davis wrote: > That is not sufficient. The first three bytes could represent a real > content character, ZWNBSP or they could be a BOM. The label doesn't > tell you. I have never understood under what circumstances a ZWNBSP would ever appear as the first character of a file. It wouldn't make any sense. A ZWNBSP prevents a word break between the preceding and following characters. If there *is* no preceding character, then what is the point of the ZWNBSP? Every time this topic comes up, I have asked why a true ZWNBSP would ever appear as the first character of a file. The only responses I've heard are: 1. It might not be a discrete file, but the second (or successive) piece of a file that was split up for some reason (transmission, etc.). In that case, the interpreting process should take its encoding cue from the first fragment, and should NEVER reinterpret fragments broken up at arbitrary points. (Imagine a process modifying a GIF or JPEG file, or converting CR/LF, based on fragments!) But this is not the point being discussed anyway; the point is whole files. 2. It could happen; Unicode allows any character to appear anywhere. Well, almost anywhere. But even so, the likelihood of a U+FEFF as ZWNBSP appearing at the start of an unsigned UTF-8 file is vanishingly small compared to the likelihood that the U+FEFF was intended to be a signature. The rare case is just too rare to invalidate the heuristic for the much more common case. In addition, as Michka points out, we now have U+2060 WORD JOINER, whose entire purpose in life is to be used as U+FEFF was formerly used, as a ZWNBSP. Any new Unicode text should use U+2060 and not U+FEFF as a word joiner. It's hard to imagine that UTC and WG2 would have standardized this if there was a lot of real-world text that used U+FEFF as ZWNBSP. -Doug Ewell Fullerton, California
Re: Names for UTF-8 with and without BOM
"Michael (michka) Kaplan" wrote: > > .xml UTF-8N Some XML processors may not cope with BOM > > Maybe they need to upgrade? Since people often edit the files in notepad, > many files are going to have it. A parser that cannot accept this reality is > not going to make it very long. I didn't think the XML standard allowed for utf-8 files to have a BOM. The standard is quite clear about requiring 0xFEFF for utf-16. I would have thought a proper parser would reject a non-utf-16 file beginning with something other than "<". (The fact that notepad puts it there should be irrelevant.) Am I wrong about XML and the utf-8 signature? tex -- - Tex Texin cell: +1 781 789 1898 mailto:Tex@;XenCraft.com Xen Master http://www.i18nGuy.com XenCrafthttp://www.XenCraft.com Making e-Business Work Around the World -
Re: Names for UTF-8 with and without BOM
From: "Joseph Boyle" <[EMAIL PROTECTED]> > These are listed as examples to demonstrate the idea of a configuration file > listing encoding constraints. The fact that each constraint is arguable is a > good reason to make the constraints configurable, and therefore to have > names to distinguish BOM and non-BOM UTF-8. Yes, but the fact that every one of them can have it or not and only inadequate parsers will ever really have a problem with them is a good indication that it is not really required for the users who care about separate charset names MichKa
RE: Names for UTF-8 with and without BOM
These are listed as examples to demonstrate the idea of a configuration file listing encoding constraints. The fact that each constraint is arguable is a good reason to make the constraints configurable, and therefore to have names to distinguish BOM and non-BOM UTF-8. -Original Message- From: Michael (michka) Kaplan [mailto:michka@;trigeminal.com] Sent: Saturday, November 02, 2002 10:16 AM To: Joseph Boyle; Mark Davis; Murray Sargent Cc: [EMAIL PROTECTED] Subject: Re: Names for UTF-8 with and without BOM From: "Joseph Boyle" <[EMAIL PROTECTED]> > Type Encoding Comment > .txt UTF-8BOM We want plain text files to have BOM to distinguish from > legacy codepage files Not really required, but optional -- the perfomance hit of making sure its valid UTF-8 is pretty minor. But people do open some *huge* text files in things like notepad > .xml UTF-8N Some XML processors may not cope with BOM Maybe they need to upgrade? Since people often edit the files in notepad, many files are going to have it. A parser that cannot accept this reality is not going to make it very long. > .htm UTF-8 We want HTML to be UTF-8 but will not insist on BOM Same as text, with the bonus of the possiblity of a higher lever protocol. It can still go either way. > .rc Codepage Unfortunately compiler insists on these being codepage. They can be UTF-16, too (at least on Win32!). > .swt ASCII Nonlocalizable internal format, must be ASCII. Haven't run across these -- but note that if its not UTF-8 then it does not apply
Re: Names for UTF-8 with and without BOM
From: "Joseph Boyle" <[EMAIL PROTECTED]> > Type Encoding Comment > .txt UTF-8BOM We want plain text files to have BOM to distinguish > from legacy codepage files Not really required, but optional -- the perfomance hit of making sure its valid UTF-8 is pretty minor. But people do open some *huge* text files in things like notepad > .xml UTF-8N Some XML processors may not cope with BOM Maybe they need to upgrade? Since people often edit the files in notepad, many files are going to have it. A parser that cannot accept this reality is not going to make it very long. > .htm UTF-8 We want HTML to be UTF-8 but will not insist on BOM Same as text, with the bonus of the possiblity of a higher lever protocol. It can still go either way. > .rc Codepage Unfortunately compiler insists on these being > codepage. They can be UTF-16, too (at least on Win32!). > .swt ASCII Nonlocalizable internal format, must be ASCII. Haven't run across these -- but note that if its not UTF-8 then it does not apply
RE: Names for UTF-8 with and without BOM
The first time I thought of UTF-8Y it sounded too flippant, but actually it is fairly self-explanatory if UTF-8 is taken as a given, and has the virtue of being short. UTF-8S for signature would also make sense, but is the same as the name of Toby Phipps's proposal which eventually became CESU-8. UTF-8J will certainly make sense, after UTC changes all the character names to Esperanto, conducts its meetings in Esperanto, and publishes TUS in Esperanto. If we want to be really explicit, there's UTF-8EFBBBF. -Original Message- From: William Overington [mailto:WOverington@;ngo.globalnet.co.uk] Sent: Friday, November 01, 2002 10:37 PM To: [EMAIL PROTECTED] Cc: [EMAIL PROTECTED] Subject: Re: Names for UTF-8 with and without BOM As you have UTF-8N where the N stands for the word "no" one could possibly have UTF-8Y where the Y stands for the word "yes". Thus one could have the name of the format answering, or not answering, the following question. Is there a BOM encoded? However, using the letter Y has three disadvantages for widespread use. The letter Y could be confused with the word "why", the word "yes" is English, so the designation would be anglocentric, and the letter Y sorts alphabetically after the letter N. However, if one considers the use of the international language Esperanto, then the N would mean "ne", that is, the Esperanto word for "no" and thus one could use the letter J to stand for the Esperanto word "jes" which is the Esperanto word for "yes" and which, in fact, is pronounced exactly the same as the English word "yes". Thus, I suggest that the three formats could be UTF-8, UTF-8J and UTF-8N, which would solve the problem in a manner which, being based upon a neutral language, will hopefully be acceptable to all. William Overington 2 November 2002
RE: Names for UTF-8 with and without BOM
The main need I see is not to tell a consumer whether a leading U+FEFF is a BOM or ZWNBSP, but: * for producers (telling whether to emit a BOM or not), and * normative (a checker enforcing an encoding standard per file type, defined in a table like the one below) TypeEncodingComment .txtUTF-8BOMWe want plain text files to have BOM to distinguish from legacy codepage files .xmlUTF-8N Some XML processors may not cope with BOM .htmUTF-8 We want HTML to be UTF-8 but will not insist on BOM .rc CodepageUnfortunately compiler insists on these being codepage. .swtASCII Nonlocalizable internal format, must be ASCII. Please consider the proposal for separate charset names on that basis and not on the basis of utility for telling a consumer whether U+FEFF is a BOM, which I agree is by now a nonissue. -Original Message- From: Michael (michka) Kaplan [mailto:michka@;trigeminal.com] Sent: Saturday, November 02, 2002 4:18 AM To: Mark Davis; Murray Sargent; Joseph Boyle Cc: [EMAIL PROTECTED] Subject: Re: Names for UTF-8 with and without BOM From: "Mark Davis" <[EMAIL PROTECTED]> > That is not sufficient. The first three bytes could represent a real content > character, ZWNBSP or they could be a BOM. The label doesn't tell you. There are several problems with this supposition -- most notably the fact that there are cases that specifically claim this is not recommended and that U+2060 is prefered? > This is similar to UTF-16 CES vs UTF-16BE CES. In the first case, 0xFE 0xFF > represents a BOM, and is not part of the content. In the second case, > it does *not* represent a BOM -- it represents a ZWNBSP, and must not > be stripped. The difference here is that the encoding name tells you > exactly what the situation is. I do not see this as a realistic scenario. I would argue that if the BOM matches the encoding scheme, perhaps this was an intentional effort to make sure that applications which may not understand the higher level protocol can also see what the encoding scheme is. But even if we assume that someone has gone to the trouble of calling something UTF16BE and has 0xFE 0xFF at the beginning of the file. What kind of content *is* such a code point that this is even worth calling out as a special case? If the goal is to clear and unambiguous text then the best way would to simplify ALL of this. It was previously decided to always call it a BOM, why not stick with that? MichKa
Re: Names for UTF-8 with and without BOM
From: "Mark Davis" <[EMAIL PROTECTED]> > That is not sufficient. The first three bytes could represent a real content > character, ZWNBSP or they could be a BOM. The label doesn't tell you. There are several problems with this supposition -- most notably the fact that there are cases that specifically claim this is not recommended and that U+2060 is prefered? > This is similar to UTF-16 CES vs UTF-16BE CES. In the first case, 0xFE 0xFF > represents a BOM, and is not part of the content. In the second case, it > does *not* represent a BOM -- it represents a ZWNBSP, and must not be > stripped. The difference here is that the encoding name tells you exactly > what the situation is. I do not see this as a realistic scenario. I would argue that if the BOM matches the encoding scheme, perhaps this was an intentional effort to make sure that applications which may not understand the higher level protocol can also see what the encoding scheme is. But even if we assume that someone has gone to the trouble of calling something UTF16BE and has 0xFE 0xFF at the beginning of the file. What kind of content *is* such a code point that this is even worth calling out as a special case? If the goal is to clear and unambiguous text then the best way would to simplify ALL of this. It was previously decided to always call it a BOM, why not stick with that? MichKa
Re: Names for UTF-8 with and without BOM
That is not sufficient. The first three bytes could represent a real content character, ZWNBSP or they could be a BOM. The label doesn't tell you. This is similar to UTF-16 CES vs UTF-16BE CES. In the first case, 0xFE 0xFF represents a BOM, and is not part of the content. In the second case, it does *not* represent a BOM -- it represents a ZWNBSP, and must not be stripped. The difference here is that the encoding name tells you exactly what the situation is. Mark __ http://www.macchiato.com ► “Eppur si muove” ◄ - Original Message - From: "Murray Sargent" <[EMAIL PROTECTED]> To: "Joseph Boyle" <[EMAIL PROTECTED]> Cc: <[EMAIL PROTECTED]> Sent: Friday, November 01, 2002 12:42 Subject: RE: Names for UTF-8 with and without BOM > Joseph Boyle says: "It would be useful to have official names to > distinguish UTF-8 with and without BOM." > > To see if a UTF-8 file has no BOM, you can just look at the first three > bytes. Is this a problem? Typically when you care about a file's > encoding form, you plan to read the file. > > Thanks > Murray > > >
Re: Names for UTF-8 with and without BOM
As you have UTF-8N where the N stands for the word "no" one could possibly have UTF-8Y where the Y stands for the word "yes". Thus one could have the name of the format answering, or not answering, the following question. Is there a BOM encoded? However, using the letter Y has three disadvantages for widespread use. The letter Y could be confused with the word "why", the word "yes" is English, so the designation would be anglocentric, and the letter Y sorts alphabetically after the letter N. However, if one considers the use of the international language Esperanto, then the N would mean "ne", that is, the Esperanto word for "no" and thus one could use the letter J to stand for the Esperanto word "jes" which is the Esperanto word for "yes" and which, in fact, is pronounced exactly the same as the English word "yes". Thus, I suggest that the three formats could be UTF-8, UTF-8J and UTF-8N, which would solve the problem in a manner which, being based upon a neutral language, will hopefully be acceptable to all. William Overington 2 November 2002
RE: Names for UTF-8 with and without BOM
Joseph Boyle says: "It would be useful to have official names to distinguish UTF-8 with and without BOM." To see if a UTF-8 file has no BOM, you can just look at the first three bytes. Is this a problem? Typically when you care about a file's encoding form, you plan to read the file. Thanks Murray
Re: Names for UTF-8 with and without BOM
> Perhaps it > is time to think of three other words starting with B, O, M that make a > better explanation.) Bollixed Operational Muddle ;-) --Ken