On Sat, 02 Feb 2008 07:24:36 +0100, Stefan Behnel wrote: > Steven D'Aprano wrote: >> The same way it knows that "<?xml" is "<?xml" before it sees the >> encoding. If the parser knows that the hex bytes >> >> 3c 3f 78 6d 6c >> >> (or 3c 00 3f 00 78 00 6d 00 6c 00 if you prefer UTF-16, and feel free >> to swap the byte order) >> >> mean "<?xml" >> >> then it can equally know that bytes >> >> 20 09 0a >> >> are whitespace. According to the XML standard, what else could they be? > > So, what about all the other unicode whitespace characters?
What about them? They aren't part of the XML spec, which defines whitespace as the code points #x20, #x9, #xD and #xA. (Okay, I forgot carriage return. Oops.) You don't have to support arbitrary whitespace, only those four characters. > And what > about different encodings and byte orders that move the bytes around? What about them? The Byte Order Mark is optional in the case of UTF-8, and compulsory in the case of UTF-16. I quote: "Entities encoded in UTF-16 must and entities encoded in UTF-8 may begin with the Byte Order Mark described by Annex H of [ISO/IEC 10646:2000], section 2.4 of [Unicode], and section 2.7 of [Unicode3] (the ZERO WIDTH NO-BREAK SPACE character, #xFEFF). This is an encoding signature, not part of either the markup or the character data of the XML document. XML processors must be able to use this character to differentiate between UTF-8 and UTF-16 encoded documents." So if your XML document is written in UTF-8, you don't need a BOM (although you can use one if you wish) and if it is in UTF-16 you *must* have one, even before the '<?xml'. If you don't, how will the parser recognise the characters '<?xml', not to mention the characters 'encoding' and 'utf-16'? > Is > it ok for a byte stream to start with "00 20" or does it have to start > with "20 00"? If you're using UTF-16, the byte stream MUST start with the BOM, so no, the above is illegal. If the BOM has already been seen, then it will tell the XML parser which order is legal, depending on whether the BOM was FF FE or FE FF. If you're using UTF-8, the byte streams "00 20" and "20 00" would both be illegal: in UTF-8, the null byte is the unicode code point #x0, which is illegal in XML. Support for any other encoding is entirely optional. A parser may choose to support other encodings, or not, and deal with them appropriately. But whatever encodings you support, the same issue comes up: if you can recognise '<?xml' before seeing the encoding, why can't you recognise whitespace? > What about "00 20 00 00" and "00 00 00 20"? Are you sure > that means 0x20 encoded in 4 bytes, or is it actually the unicode > character 0x2000? What complexity do you want to put into the parser > here? I'm not putting any complexity into the parser that the XML standard doesn't already demand. Perhaps you should read it yourself: http://www.w3.org/TR/xml/ In particular, note that a parser must be prepared to accept leading whitespace at the start of a document, and only reject it if it comes across a XML declaration. > "In the face of ambiguity, refuse the temptation to guess" What ambiguity, and what guess? My earlier question wasn't rhetorical. I asked "According to the XML standard, what else could they [whitespace] be?". Just implying that they are ambiguous doesn't actually make them ambiguous. I don't believe there is an ambiguity at all. That's what makes the prohibition on leading whitespace before the '<?xml' tag all the more puzzling: there doesn't seem to be any good reason for it. If I am wrong, then will somebody please put me out of my misery and tell me what leading whitespace could be mistaken for, in what circumstances? -- Steven -- http://mail.python.org/mailman/listinfo/python-list