On Fri, 01 Feb 2008 07:51:56 +1100, Ben Finney wrote: > Steven D'Aprano <[EMAIL PROTECTED]> writes: > >> On Fri, 01 Feb 2008 00:40:01 +1100, Ben Finney wrote: >> >> > Quite apart from a human thinking it's pretty or not pretty, it's >> > *not valid XML* if the XML declaration isn't immediately at the start >> > of the document <URL:http://www.w3.org/TR/xml/#sec-prolog-dtd>. Many >> > XML parsers will (correctly) reject such a document. >> >> You know, I'd really like to know what the designers were thinking when >> they made this decision. > > Probably much the same that the designers of the Unix shebang ("#!") or > countless other "figure out whether the bitstream is a specific type" > were thinking:
There's no real comparison with the shebang '#!'. It is important that the shell can recognise a shebang with a single look-up for speed, and the shell doesn't have to deal with the complexities of Unicode: if you write your script in UTF-16, bash will complain that it can't execute the binary file. The shell cares whether or not the first two bytes are 23 21. An XML parser doesn't care about bytes, it cares about tags. It isn't good enough for an XML parser to grab the first five bytes of a file and say "That's legal XML!" in the same way that the shell can look at the first two bytes of a script and say "That's a shebang!". An XML parser must actually *parse*, even to determine whether or not it is looking at XML. Any such parser must be prepared to accept leading whitespace at the beginning of a file, and only reject it once it reaches an XML declaration tag, if any. When parsing a stream of bytes like this: ef bb bf 20 20 20 20 0a 09 3c 3f 78 6d 6c the parser doesn't know it is illegal until it has seen the fourteenth byte. That's the worst of both worlds: you have to provisionally accept whitespace just in case the XML declaration is missing, so you don't save any complexity, but if the declaration is there, you reject a perfectly fine document for an apparently arbitrary reason. > It's better to be as precise as possible so that failure can be > unambiguous, than to have more-complex parsing rules that lead to > ambiguity in implementation. Precision and complexity are orthogonal attributes. "All valid documents must begin with the sequence of bytes representing the first 8093 digits of pi to the power of e in base 256" is very precise and completely unambiguous. There's one and only one byte sequence that satisfies such a requirement. But it is also very complex. On the other hand, "valid documents must begin with a number" is not complex at all, but very imprecise: what counts as a number? Is the word "one" a number? A good example of how precision doesn't need to be the enemy of flexibility and simplicity: Python's rule dealing with imports from __future__ is precise. Any import from __future__ must be the first executable line in a module: (1) There's no ambiguity. The first executable line is well-defined in the context of a Python program. (2) The restriction is not arbitrary. There's a good technical reason for it, the rule doesn't needlessly restrict what you can do. (3) It is human-friendly: you can precede the import by a shebang line, a doc string, any other bare strings (so long as they aren't assigned to a name), comments and empty lines. -- Steven -- http://mail.python.org/mailman/listinfo/python-list