[Edward Loper] > I've been working on epydoc, and the question has come up of how I > should treat non-unicode docstrings that contain non-ascii > characters. An example of such a file is > "python2.4/encodings/string_escape.py", whose module docstring > contains an 'o' with an umlaut. > > In particular, the question is whether I should assume that the > docstring is encoded with the encoding specified by the "-*- coding > -*-" directive at the top of the file.
I think that although it's the only possible assumption, it's also potentially a wrong assumption. IOW, don't assume anything. > The reason why we *wouldn't* use the encoding is that PEP 263 [1], > which defines the coding directive, says that it does *not* apply to > non-unicode string literals. In particular, PEP 263 says that the > entire file should be read & tokenized using the specified coding, > but once string objects are created, they should be reencoded back > into 8-bit strings using the file encoding. One reason is that the module code may expect such string literals to have their original encoding. String literals can contain arbitrary 8-bit data (strings are bytes, not characters). Attempting to decode such strings is inviting misinterpretation. Another reason is simple: "In the face of ambiguity, refuse the temptation to guess." > So the "correct" fix is for the author of the module to use unicode > literals instead of string literals for docstrings that contain > non-ascii characters. This has the advantage that if a user tries > to look at the docstring via introspection, it will be correct. > > On the other hand, epydoc is often used by people other than the > author of a module, and requiring them to go through and replace all > string literal docstrings with unicode literals seems a bit > unreasonable. Yes, it's unreasonable. But such code is buggy IMO. It's also unreasonable to expect Epydoc to correctly interpret garbage input. Don't do it. > So the question is.. Should epydoc (and other tools like it) be > compliant with PEP 263 (and consistent with Python); or should they > "do what I mean, not what I say" and treat non-ascii docstrings as > if they were encoded using the module's encoding? Be compliant with PEP 263, issue a warning (PEP 263, Implementation, step 1), and either ignore such string literals or represent them as strings of bytes (using "\xYY" notation). -- David Goodger <http://python.net/~goodger>
signature.asc
Description: OpenPGP digital signature
_______________________________________________ Doc-SIG maillist - [email protected] http://mail.python.org/mailman/listinfo/doc-sig
