Hello all, While attempting to make a wrapper for opening multiple types of UTF-encoded files (more on that later, in a separate post, I guess), I ran into some oddities with the `codecs` module, specifically to do with `.register` ing `CodecInfo` objects. I'd like to report a bug or something, but there are several intertangled issues here and I'm not really sure how to report it so I thought I'd open the discussion. Apologies in advance if I get a bit rant-y, and a warning that this is fairly long.
Observe what happens when you `register` the wrong function: >>> import codecs >>> def ham(name): ... # Very obviously wrong, just for demonstration purposes ... if name == 'spam': return 'eggs' ... >>> codecs.register(ham) Already there is a problem in that there is no error... there is no realistic way to catch this, of course, but IMHO it points to an issue with the interface. I don't want to register a codec lookup function; I want to register *a codec*. The built-in lookup process would be just fine if I could just somehow tell it about this one new codec I have... I really don't see the use case for the added flexibility of the current interface, and it means that every time I have a new codec, I need to either create a new lookup function as well (to register it), or hook into an existing one that's still of my own creation. Anyway, moving on, let's see what happens when we try to use the faulty codec: >>> codecs.getencoder('spam') Traceback (most recent call last): File "<stdin>", line 1, in <module> File "C:\Python32\lib\codecs.py", line 939, in getencoder return lookup(encoding).encode TypeError: codec search functions must return 4-tuples Ehh?! That's odd. I thought I was supposed to return a `CodecInfo` object, not a 4-tuple! Although as an aside, AFAICT the documentation *doesn't actually document the CodecInfo class*, it just says what attributes CodecInfo objects are supposed to have. A bit of digging around with Google and existing old bugs on the tracker suggests that this comes about due to backwards-compatibility: in 2.4 and below, they *were* 4-tuples. But now CodecInfo objects are expected to provide 6 functions (and a name), not 4. Clearly that won't fit in a 4-tuple, and anyway I thought we had gotten rid of all this deprecated stuff. Regardless, let's see what happens if we do try to register a 4-tuple-lookup-er: >>> def spam(name): ... # As long as we return a 4-tuple, it doesn't really matter what the functions are; ... # errors shouldn't happen until we actually attempt to encode/decode. Right? ... if name == 'spam': return (spam, spam, spam, spam) Oops, we need to restart the interpreter, or otherwise reset global state somehow, because the old lookup function has priority over this one, and *there is no way to unregister it*. But once that's fixed: >>> codecs.getencoder('spam') Traceback (most recent call last): File "<stdin>", line 1, in <module> File "C:\Python32\lib\codecs.py", line 939, in getencoder return lookup(encoding).encode AttributeError: 'tuple' object has no attribute 'encode' That's quite odd indeed. We can't actually trust the error message we got before! 4-tuples don't work any more like they used to, so our backwards-compatibility concession doesn't even work. Meanwhile, we're left wondering how CodecInfo objects work at all. Is the error message wrong? Nope, well, not really. Let's grab an known good CodecInfo object and see what we can find out... >>> utf8 = codecs.lookup('utf-8') >>> utf8.__class__.__bases__ (<class 'tuple'>,) >>> # not collections.namedtuple, which is understandable, since that wasn't available until 2.6... >>> len(utf8) 4 >>> # OK, apparently it magically actually is a tuple of length 4 despite needing 7 attributes. I wonder which ones are included: >>> tuple(utf8) (<built-in function utf_8_encode>, <function decode at 0x01993390>, <class 'encodings.utf_8.StreamReader'>, <class 'encodings.utf_8.StreamWriter'>) >>> # Unsurprising: the ones mandated by the original PEP (100! That long ago...) ... and if we try `help` (or look at examples in the standard library or find them with Google - but I sure don't see any in the webpage docs), we can at least find out how to construct a CodecInfo object properly - although, curiously, it's implemented using `__new__` rather than `__init__`. You *can* hack around with `collections.namedtuple` and create something that basically works: # restarting again... >>> import codecs, collections >>> my_codecinfo = collections.namedtuple('my_codecinfo', 'encode decode streamreader streamwriter') >>> def spam(name): ... if name == 'spam': return my_codecinfo(spam, spam, spam, spam) And now the error correctly doesn't occur until we actually attempt to encode or decode something. Except we still don't have an incremental decoder/encoder, and in fact those are missing attributes rather than `None` as they're defaulted to by the `CodecInfo` class. (Of course, we can subclass `collections.namedtuple` to fix this, but then we're basically reverse-engineering the `codecs.CodecInfo` class wholesale...) Speaking of which, one last thing: >>> # Another restart, of course >>> import codecs >>> def spam(name): ... if name == 'spam': return codecs.CodecInfo(spam, spam) ... >>> codecs.register(spam) >>> codecs.getincrementaldecoder('spam') Traceback (most recent call last): File "<stdin>", line 1, in <module> File "C:\Python32\lib\codecs.py", line 976, in getincrementaldecoder raise LookupError(encoding) LookupError: spam That seems wrong to me too: the codec is certainly *there*, it just doesn't support incremental decoding. I would expect the error message to be more specific. -- ~Zahlman {:> -- http://mail.python.org/mailman/listinfo/python-list