Guido van Rossum wrote: > On 4/12/07, Walter Dörwald <[EMAIL PROTECTED]> wrote: >> Guido van Rossum wrote: >> > On 4/11/07, Walter Dörwald <[EMAIL PROTECTED]> wrote: >> >> Would it make sense to make the state of the decoder public, e.g. by >> >> adding setstate() and getstate() methods? This would give a cleaner >> API. >> > >> > I've been thinking of the same thing! >> > >> > I wonder if it would be possible to return the state as a pair >> > (unread, flags) where unread is a (byte) string of unprocessed bytes >> > and flags is some other state, with the constraint that in the initial >> > state the flags must be zero. Then I can optimize the case where flags >> > is returned as zero by subtracting len(unread) from the current >> > position and that'd be the correct seek position. >> >> I'd say that bytestream.tell() is the correct position. >> >> Or should seek() return to the last position where the codec was in a >> default state without anything buffered? (This can't work for UTF-16, >> because the codec almost never is in the default state.) > > That was my hope, yes (and I realize that UTF-16 is an exception).
We could designate natural endianness as the default state, but that would mean that a codec state can't be transferred to a different machine (or we could declare little (or big) endianness to be the default state). > Consider UTF-8 though. If the chunk we read from the byte stream ended > in the middle of a multi-byte character, the codec will have the first > part of that character buffered. In general we want to subtract > buffered data from the byte stream's position when reporting the > position of the text stream. The idea is that if we later seek to the > reported position, we should be reading the same character data. This > can be accomplished in two ways: by backing up the byte stream to the > previous character boundary, and resetting the decoder to neutral; or > by positioning the byte stream to where it was originally and setting > the state of the decoder to what it was before. However, backing up > the byte stream has the advantage that no decoder state needs to be > encoded in the position cookie. OK, so for decoders getstate() should always return a tuple, with the first entry being the buffered byte string (or bytes object?) and the second being additional state data. Do we need any specification for encoders? >> > I imagine most >> > decoders have only very few flags they care about. (The worst might be >> > the utf-16 decoder which must have a flag to remember whether it >> > already saw a byte order marker, and another indicating the byte >> > order. Maybe we'll have to special-case that one, so don't worry too >> > much about it.) >> > >> >> Should I work on a patch? >> > >> > That would be great! >> >> OK, here's the patch: http://bugs.python.org/1698994 >> >> The state returned from getstate() should be treated as an opaque value >> (e.g. for the buffered incremental codecs it is the buffered string, for >> the UTF-16 encoder it's the flag indicating whether a BOM has been >> written etc.). The codecs try to return None, if they are in some kind >> of default state (e.g. there's nothing buffered). > > I would like to await completion of those unit tests; The second version of the patch includes the unit tests (and fixes the utf-8-sig codec). > there seem to be > some subtle issues. Can you be more concrete? > I wonder if setstate() should call self.reset() > first. Calling reset() and calling setstate() with the initial state should have the same effect. > I'd also like to ask if setstate() could default to "" only if > the argument is None, not if it is empty; I'd like to use it to change > the buffer to be a bytes object. I'd say for Python 3000 it should always be a bytes object. Will this interoperate seamlessly with the C part of the codec machinery? > (And yes, I need to maintain more > hacks for that, alas). I'l try to update the patch tomorrow or over the weekend. Servus, Walter _______________________________________________ Python-3000 mailing list [EMAIL PROTECTED] http://mail.python.org/mailman/listinfo/python-3000 Unsubscribe: http://mail.python.org/mailman/options/python-3000/archive%40mail-archive.com