Bugs item #1156259, was opened at 2005-03-03 23:29 Message generated for change (Settings changed) made by loewis You can respond by visiting: https://sourceforge.net/tracker/?func=detail&atid=105470&aid=1156259&group_id=5470
Category: Extension Modules Group: Python 2.4 Status: Open Resolution: None Priority: 7 Submitted By: Matthias Klose (doko) >Assigned to: Walter Dörwald (doerwalter) Summary: [2.4 regression] seeking in codecs.reader broken Initial Comment: [forwarded from https://bugzilla.ubuntu.com/show_bug.cgi?id=6972 ] This is a regression; the following script (call as "scriptname some_textfile") fails. It is obvious that the file starts with a number of random bytes from the previous run. Uncommenting the two #XXX lines makes the bug go away. So does running it with Python 2.3.5 import sys import codecs from random import random data = codecs.getreader("utf-8")(open(sys.argv[1])) df = data.read() for t in range(30): #XXX data.seek(0,1) #XXX data.read() data.seek(0,0) dn="" for l in data: dn += l if random() < 0.1: break if not df.startswith(dn): print "OUCH",t print "BAD:", dn[0:100] print "GOOD:", df[0:100] sys.exit(1) print "OK",len(df) sys.exit(0) ---------------------------------------------------------------------- Comment By: M.-A. Lemburg (lemburg) Date: 2005-03-08 15:00 Message: Logged In: YES user_id=38388 Walter: the patch looks good. Please also add a doc-string mentioning the resetting of the codec in case .seek() is used. Whether .seek() causes a mess or not is not within the responsibility of the codec - it's an application space decision to make, otherwise we would have to introduce the notion of seeking code points (rather than bytes) which I'd rather not like to do since this can break existing applications in many ways. ---------------------------------------------------------------------- Comment By: Matthias Urlichs (smurf) Date: 2005-03-08 14:20 Message: Logged In: YES user_id=10327 Ahem -- seek(0,*whatever*) should still be allowed, whatever else you do, please. Reading UTF-16 from an odd position in a file isn't always an error -- sometimes text is embedded in weird on-disk data structures. As long as tell() returns something you can seek() back to, nobody's got a right to complain -- file position arithmetic in general is nonportable. ---------------------------------------------------------------------- Comment By: Walter Dörwald (doerwalter) Date: 2005-03-04 12:44 Message: Logged In: YES user_id=89016 How about the following patch? Unfortunately this breaks the codec in more obscure cases. Calling seek(0, 1) should have now effect, but with this patch it does. Maybe calling seek() should be prohibited? Calling a seek(1, 1) in a UTF-16 stream completely messes up the decoded text. ---------------------------------------------------------------------- Comment By: M.-A. Lemburg (lemburg) Date: 2005-03-04 10:56 Message: Logged In: YES user_id=38388 This is obviously related to the buffer logic that Walter added to support .readline(). In order to fix the problem, a .seek() method must be implemented that resets the buffers whenever called (before asking the stream to seek to the specified stream position). ---------------------------------------------------------------------- You can respond by visiting: https://sourceforge.net/tracker/?func=detail&atid=105470&aid=1156259&group_id=5470 _______________________________________________ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com