RE: Names for UTF-8 with and without BOM - pragmatic

2002-11-20 Thread Joseph Boyle
produce internal ZWNBSPs is not part of any of our processing as far as I know. -Original Message- From: David Starner [mailto:[EMAIL PROTECTED]] Sent: Thursday, November 07, 2002 12:14 PM To: Markus Scherer Cc: unicode Subject: Re: Names for UTF-8 with and without BOM - pragmatic On Wed

RE: Names for UTF-8 with and without BOM - pragmatic

2002-11-07 Thread Kent Karlsson
Initial for each piece, as each is assumed to be a complete text file before concatenation. Nothing prevents copy/cp/cat and other commands from recognizing Unicode signatures, for as long as they don't claim to preserve initial U+FEFF. Yes there is, in a formal sense, for cat and cp.

Re: Names for UTF-8 with and without BOM - pragmatic

2002-11-07 Thread David Starner
On Wed, Nov 06, 2002 at 09:47:43AM -0800, Markus Scherer wrote: The fact is that Windows uses UTF-8 and UTF-16 plain text files with signatures (BOMs) very simply, gracefully, and successfully. It has applied what I called the pragmatic approach here for about 10 years. It just works. It

RE: Names for UTF-8 with and without BOM - pragmatic

2002-11-06 Thread Lars Kristan
Markus Scherer wrote: If software claims that it does not modify the contents of a document *except* for initial U+FEFF then it can do with initial U+FEFF what it wants. If the whole discussion hinges on what is allowed emif software claims to not modify text/em then one need not claim

RE: Names for UTF-8 with and without BOM - pragmatic

2002-11-06 Thread Marco Cimarosti
Lars Kristan wrote: .txtUTF-8 require We want plain text files to have BOM to distinguish from legacy codepage files H, what does plain mean?! Perhaps files with a BOM should be called text files (or .txt files;) as opposed to plain

RE: Names for UTF-8 with and without BOM - pragmatic

2002-11-06 Thread Kent Karlsson
True, UTF-16 files do need a signature. Eh, no! UTF-16BE and UTF-16LE files (or whatever kind of text data element) do not have any signature/BOM. Not even files (somehow) labelled UTF-16 need have a signature/BOM, without a BOM they are then the same as if it was labelled UTF-16BE.

Re: Names for UTF-8 with and without BOM - pragmatic

2002-11-06 Thread Markus Scherer
Lars Kristan wrote: Markus Scherer wrote: If software claims that it does not modify the contents of a document *except* for initial U+FEFF then it can do with initial U+FEFF what it wants. If the whole discussion hinges on what is allowed emif software claims to not modify text/em then one

Re: Names for UTF-8 with and without BOM - pragmatic

2002-11-05 Thread Markus Scherer
Mark Davis wrote: Little probability that right double quote would appear at the start of a document either. Doesn't mean that you are free to delete it (*and* say that you are not modifying the contents). This points to a pragmatic way to deal with this issue: If software claims that it does

Re: Names for UTF-8 with and without BOM

2002-11-03 Thread Peter_Constable
On 11/02/2002 12:15:54 PM Michael \(michka\) Kaplan wrote: .xml UTF-8N Some XML processors may not cope with BOM Maybe they need to upgrade? Since people often edit the files in notepad, many files are going to have it. A parser that cannot accept this reality is not going to make it very

RE: Names for UTF-8 with and without BOM

2002-11-03 Thread Peter_Constable
On 11/02/2002 11:59:24 AM Joseph Boyle wrote: The first time I thought of UTF-8Y it sounded too flippant, but actually it is fairly self-explanatory if UTF-8 is taken as a given, and has the virtue of being short. UTF-8Y (and UTF-8J) is not at all intuitive. UTF-8-yuk? The better counterpart

Re: Names for UTF-8 with and without BOM

2002-11-03 Thread Michael \(michka\) Kaplan
From: [EMAIL PROTECTED] In particular, I'm thinking of a situation about a year and a half ago (IIRC) in which Michael (and I and others) were strongly opposed to a suggestion that the Unicode Consortium should document a certain variation (perversion, some would say) of one of the Unicode

Re: Names for UTF-8 with and without BOM

2002-11-03 Thread John Cowan
[EMAIL PROTECTED] scripsit: I find it interesting, then, to see Michael saying that, since Notepad sticks a BOM-cum-signature at the start of its UTF-8, the rest of the world should support it. There is another argument, viz. ISO/IEC 10646, which plainly proclaims that the 8-BOM is a valid

Re: Names for UTF-8 with and without BOM

2002-11-03 Thread Mark Davis
, 2002 13:27 Subject: Re: Names for UTF-8 with and without BOM Mark Davis mark dot davis at jtcsv dot com wrote: That is not sufficient. The first three bytes could represent a real content character, ZWNBSP or they could be a BOM. The label doesn't tell you. I have never understood under

Re: Names for UTF-8 with and without BOM

2002-11-03 Thread Mark Davis
]; Murray Sargent [EMAIL PROTECTED]; Joseph Boyle [EMAIL PROTECTED] Cc: [EMAIL PROTECTED] Sent: Saturday, November 02, 2002 04:18 Subject: Re: Names for UTF-8 with and without BOM From: Mark Davis [EMAIL PROTECTED] That is not sufficient. The first three bytes could represent a real content

Re: Names for UTF-8 with and without BOM

2002-11-03 Thread Michael \(michka\) Kaplan
From: Mark Davis [EMAIL PROTECTED] Ironic that for the purpose of dealing with THREE bytes that so many bytes are being wasted. :-) Little probability that right double quote would appear at the start of a document either. Doesn't mean that you are free to delete it (*and* say that you are

Re: Names for UTF-8 with and without BOM

2002-11-03 Thread Doug Ewell
Mark Davis mark dot davis at jtcsv dot com wrote: Little probability that right double quote would appear at the start of a document either. Doesn't mean that you are free to delete it (*and* say that you are not modifying the contents). True, but right double quote: (a) has a visible glyph

Re: Names for UTF-8 with and without BOM

2002-11-03 Thread Mark Davis
for UTF-8 with and without BOM From: Mark Davis [EMAIL PROTECTED] Ironic that for the purpose of dealing with THREE bytes that so many bytes are being wasted. :-) Little probability that right double quote would appear at the start of a document either. Doesn't mean that you are free

Re: Names for UTF-8 with and without BOM

2002-11-03 Thread Mark Davis
for UTF-8 with and without BOM From: Mark Davis [EMAIL PROTECTED] Ironic that for the purpose of dealing with THREE bytes that so many bytes are being wasted. :-) Little probability that right double quote would appear at the start of a document either. Doesn't mean that you are free

Re: Names for UTF-8 with and without BOM

2002-11-03 Thread Mark Davis
13:02 Subject: Re: Names for UTF-8 with and without BOM From: Mark Davis [EMAIL PROTECTED] Ironic that for the purpose of dealing with THREE bytes that so many bytes are being wasted. :-) Little probability that right double quote would appear at the start of a document either. Doesn't

Re: Names for UTF-8 with and without BOM

2002-11-02 Thread Mark Davis
Sargent [EMAIL PROTECTED] To: Joseph Boyle [EMAIL PROTECTED] Cc: [EMAIL PROTECTED] Sent: Friday, November 01, 2002 12:42 Subject: RE: Names for UTF-8 with and without BOM Joseph Boyle says: It would be useful to have official names to distinguish UTF-8 with and without BOM. To see if a UTF-8 file

Re: Names for UTF-8 with and without BOM

2002-11-02 Thread Michael \(michka\) Kaplan
From: Mark Davis [EMAIL PROTECTED] That is not sufficient. The first three bytes could represent a real content character, ZWNBSP or they could be a BOM. The label doesn't tell you. There are several problems with this supposition -- most notably the fact that there are cases that specifically

RE: Names for UTF-8 with and without BOM

2002-11-02 Thread Joseph Boyle
;ngo.globalnet.co.uk] Sent: Friday, November 01, 2002 10:37 PM To: [EMAIL PROTECTED] Cc: [EMAIL PROTECTED] Subject: Re: Names for UTF-8 with and without BOM As you have UTF-8N where the N stands for the word no one could possibly have UTF-8Y where the Y stands for the word yes. Thus one could have the name

Re: Names for UTF-8 with and without BOM

2002-11-02 Thread Michael \(michka\) Kaplan
From: Joseph Boyle [EMAIL PROTECTED] Type Encoding Comment .txt UTF-8BOM We want plain text files to have BOM to distinguish from legacy codepage files Not really required, but optional -- the perfomance hit of making sure its valid UTF-8 is pretty minor. But people do open some *huge* text

RE: Names for UTF-8 with and without BOM

2002-11-02 Thread Joseph Boyle
- From: Michael (michka) Kaplan [mailto:michka;trigeminal.com] Sent: Saturday, November 02, 2002 10:16 AM To: Joseph Boyle; Mark Davis; Murray Sargent Cc: [EMAIL PROTECTED] Subject: Re: Names for UTF-8 with and without BOM From: Joseph Boyle [EMAIL PROTECTED] Type Encoding Comment .txt UTF-8BOM

Re: Names for UTF-8 with and without BOM

2002-11-02 Thread Michael \(michka\) Kaplan
From: Joseph Boyle [EMAIL PROTECTED] These are listed as examples to demonstrate the idea of a configuration file listing encoding constraints. The fact that each constraint is arguable is a good reason to make the constraints configurable, and therefore to have names to distinguish BOM and

Re: Names for UTF-8 with and without BOM

2002-11-02 Thread Doug Ewell
Mark Davis mark dot davis at jtcsv dot com wrote: That is not sufficient. The first three bytes could represent a real content character, ZWNBSP or they could be a BOM. The label doesn't tell you. I have never understood under what circumstances a ZWNBSP would ever appear as the first

Re: Names for UTF-8 with and without BOM

2002-11-02 Thread Tex Texin
Thanks Doug. I had looked at the standard not at the appendix. I think that (non-normative) appendix is unfortunate. It seems to imply (to my mind) that if other character sets define BOMs that it is ok to use them as XML signatures. My reasoning is that the standard itself only says that

Re: Names for UTF-8 with and without BOM

2002-11-02 Thread John Cowan
Tex Texin scripsit: I didn't think the XML standard allowed for utf-8 files to have a BOM. This capability was never actually excluded, and was added by erratum (and force-majeure, when it became clear that BOMful UTF-8 was going to start becoming common). XML files are intended to be plain

Re: Names for UTF-8 with and without BOM

2002-11-02 Thread John Cowan
Tex Texin scripsit: However, that leaves open the question whether only the Unicode transform signatures are acceptable or other signatures are also allowed. So if a vendor defines a code page, and defines a signature (perhaps mapping BOM/ZWNSP specifically to some code point or byte string)

Re: Names for UTF-8 with and without BOM

2002-11-02 Thread Tex Texin
Hi John, I meant the character . As for notepad, what I should have either stated more completely or bit my tongue, is that where there is a standard in place (and where it is unambiguous) the mistakes of particular products shouldn't hold sway, unless they are tantamount to a de facto standard.

Re: Names for UTF-8 with and without BOM

2002-11-02 Thread Tex Texin
John, I understand the flexibility of XML to use different encodings. However, I didn't realize that parsers were to allow for the possibility of different signatures. So a parser has to worry about scsu signatures, etc Whereas XML is so fussy about which characters it accepts, I am

Re: Names for UTF-8 with and without BOM

2002-11-02 Thread John Cowan
Tex Texin scripsit: So when the parser gets JOECODE, I can understand ignoring the signature and autodetection, but exactly how does it find the first ? Well, if it begins with an 00 byte, it can't be UTF-8 or UTF-16 (it might be UTF-32 big-endian, but we'll suppose the parser can't handle

Re: Names for UTF-8 with and without BOM

2002-11-02 Thread Michael \(michka\) Kaplan
is an unrealistic one. MichKa - Original Message - From: Tex Texin [EMAIL PROTECTED] To: Michael (michka) Kaplan [EMAIL PROTECTED] Cc: Mark Davis [EMAIL PROTECTED]; [EMAIL PROTECTED] Sent: Saturday, November 02, 2002 11:08 AM Subject: Re: Names for UTF-8 with and without BOM Michael (michka

Re: Names for UTF-8 with and without BOM

2002-11-02 Thread Doug Ewell
Tex Texin tex at i18nguy dot com wrote: However, I didn't realize that parsers were to allow for the possibility of different signatures. So a parser has to worry about scsu signatures, etc A parser only *has* to read UTF-8 without signature and UTF-16 with signature. It *may* read other

Re: Names for UTF-8 with and without BOM

2002-11-02 Thread Tex Texin
John Cowan wrote: Tex Texin scripsit: So when the parser gets JOECODE, I can understand ignoring the signature and autodetection, but exactly how does it find the first ? Well, if it begins with an 00 byte, it can't be UTF-8 or UTF-16 (it might be UTF-32 big-endian, but we'll suppose

Re: Names for UTF-8 with and without BOM

2002-11-02 Thread John Cowan
Tex Texin scripsit: Interestingly, although I didn't study it in detail, looking at rfc 2376 for prioritization over charset conflicts, it seems to recommend stripping the BOM when converting from utf-16 to other charsets (and without considering that ucs-4 would like to keep it). (section

Re: Names for UTF-8 with and without BOM

2002-11-02 Thread Tex Texin
Doug, Doug Ewell wrote: Tex Texin tex at i18nguy dot com wrote: However, I didn't realize that parsers were to allow for the possibility of different signatures. So a parser has to worry about scsu signatures, etc A parser only *has* to read UTF-8 without signature and UTF-16

Re: Names for UTF-8 with and without BOM

2002-11-02 Thread Tex Texin
John Cowan wrote: Tex Texin scripsit: Interestingly, although I didn't study it in detail, looking at rfc 2376 for prioritization over charset conflicts, it seems to recommend stripping the BOM when converting from utf-16 to other charsets (and without considering that ucs-4 would

Re: Names for UTF-8 with and without BOM

2002-11-01 Thread Kenneth Whistler
Perhaps it is time to think of three other words starting with B, O, M that make a better explanation.) Bollixed Operational Muddle ;-) --Ken

RE: Names for UTF-8 with and without BOM

2002-11-01 Thread Murray Sargent
Joseph Boyle says: It would be useful to have official names to distinguish UTF-8 with and without BOM. To see if a UTF-8 file has no BOM, you can just look at the first three bytes. Is this a problem? Typically when you care about a file's encoding form, you plan to read the file. Thanks Murray

Re: Names for UTF-8 with and without BOM

2002-11-01 Thread William Overington
As you have UTF-8N where the N stands for the word no one could possibly have UTF-8Y where the Y stands for the word yes. Thus one could have the name of the format answering, or not answering, the following question. Is there a BOM encoded? However, using the letter Y has three disadvantages