Package: python3-apt Version: 0.8.3 Severity: normal In Python 3, I can find no way to get apt_pkg.TagFile to read a file that isn't encoded in UTF-8:
>>> import sys >>> import apt_pkg >>> sys.version '3.2.2+ (default, Jan 8 2012, 07:26:18) \n[GCC 4.6.2]' >>> with open("test", "w", encoding="iso-8859-1") as test: ... print("Package: test", file=test) ... print("Maintainer: M\xe4intainer <t...@example.org>", file=test) ... print(file=test) ... >>> tagfile = apt_pkg.TagFile(open("test", "rb")) >>> next(tagfile)["Maintainer"] Traceback (most recent call last): File "<stdin>", line 1, in <module> UnicodeDecodeError: 'utf8' codec can't decode byte 0xe4 in position 1: invalid continuation byte >>> tagfile = apt_pkg.TagFile(open("test", encoding="iso-8859-1")) >>> next(tagfile)["Maintainer"] Traceback (most recent call last): File "<stdin>", line 1, in <module> UnicodeDecodeError: 'utf8' codec can't decode byte 0xe4 in position 1: invalid continuation byte Whereas in Python 2: >>> import sys >>> import apt_pkg >>> sys.version '2.7.2+ (default, Jan 13 2012, 23:15:17) \n[GCC 4.6.2]' >>> tagfile = apt_pkg.TagFile(open("test", "rb")) >>> tagfile.next()["Maintainer"] 'M\xe4intainer <t...@example.org>' This breaks part of the python-debian test suite (I'm currently trying to port python-debian to Python 3), which is interested in such things as making sure that it's possible to parse old Sources files from before Debian switched to UTF-8. A fix is tricky. We can't do anything actually nice using Python 3's I/O facilities, because python-apt just pokes around to find the file descriptor and passes that directly to apt. However, one idea that comes to mind is that if you open a file with the 'encoding' parameter then python-apt could spot that in the file object, remember it, and decode bytes using that encoding any time it wants to return a Unicode string. python-debian's test suite also tests that it's possible to parse old Sources files in *mixed* encodings. This is going to be harder because it basically means having apt_pkg.TagSection return bytes, which I don't think is desirable in general. Maybe this could be optional somehow? Thanks, -- Colin Watson [cjwat...@debian.org] -- To UNSUBSCRIBE, email to debian-bugs-dist-requ...@lists.debian.org with a subject of "unsubscribe". Trouble? Contact listmas...@lists.debian.org