M.-A. Lemburg wrote: > Walter Dörwald wrote: >> M.-A. Lemburg wrote: >>> Walter Dörwald wrote: >>>> [...] >>>>> Perhaps we should also deprecate codecs.lookup() in Py 2.5 ?! >>>> +1, but I'd like to have a replacement for this, i.e. a function that >>>> returns all info the registry has about an encoding: >>>> >>>> 1. Name >>>> 2. Encoder function >>>> 3. Decoder function >>>> 4. Stateful encoder factory >>>> 5. Stateful decoder factory >>>> 6. Stream writer factory >>>> 7. Stream reader factory >>>> >>>> and if this is an object with attributes, we won't have any problems if we >>>> extend it in the future. >>> Shouldn't be a problem: just expose the registry dictionary >>> via the _codecs module. >>> >>> The rest can then be done in a Python function defined in >>> codecs.py using a CodecInfo class. >> >> This would require the Python code to call codecs.lookup() and then look >> into the codecs dictionary (normalizing the >> encoding name again). Maybe we should make a version of __PyCodec_Lookup() >> that allows 4- and 6-tuples available to Python >> and use that? The official PyCodec_Lookup() would then have to downgrade the >> 6-tuples to 4-tuples. > > Hmm, you're right: the dictionary may not have the requested codec info yet > (it's only used as cache) and only a call to > _PyCodec_Lookup() would fill it.
I'm now trying a different approach: codecs.lookup() returns a subclass of tuple. We could deprecate calling __getitem__() in 2.5/2.6 and then remove the tuple subclassing later. >>>> BTW, if we change the API, can we fix the return value of the stateless >>>> functions? As the stateless function always >>>> encodes/decodes the complete string, returning the length of the string >>>> doesn't make sense. codecs.getencoder() and >>>> codecs.getdecoder() would have to continue to return the old variant of >>>> the functions, but >>>> codecs.getinfo("latin-1").encoder would be the new encoding function. >>> No: you can still write stateless encoders or decoders that do >>> not process the whole input string. Just because we don't have >>> any of those in Python, doesn't mean that they can't be written and used. A >>> stateless codec might want to leave the work >>> of buffering bytes at the end of the input data which cannot >>> be processed to the caller. >> >> But what would the call do with that info? It can't retry encoding/decoding >> the rejected input, because the state of the >> codec has been thrown away already. > > This depends a lot on the nature of the codec. It may well be > possible to work on chunks of input data in a stateless way, > e.g. say you have a string of 4-byte hex values, then the decode > function would be able to work on 4 bytes each and let the caller > buffer any remaining bytes for the next call. There'd be no need for keeping > state in the decoder function. So incomplete byte sequence would be silently ignored. >>> It is also possible to write >>> stateful codecs on top of such stateless encoding and decoding >>> functions. >> >> That's what the codec helper functions from Python/_codecs.c are for. > > I'm not sure what you mean here. _codecs.utf_8_decode() etc. use (result, count) tuples as the return value, because those functions are the building blocks of the codecs themselves. >> Anyway, I've started implementing a patch that just adds >> codecs.StatefulEncoder/codecs.StatefulDecoder. UTF8, UTF8-Sig, >> UTF-16, UTF-16-LE and UTF-16-BE are already working. > > Nice :-) gencodec.py is updated now too. The rest should be manageble too. I'll leave updating the CJKV codecs to Hye-Shik though. Bye, Walter Dörwald _______________________________________________ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com