Bugs item #1701389, was opened at 2007-04-16 12:05 Message generated for change (Comment added) made by doerwalter You can respond by visiting: https://sourceforge.net/tracker/?func=detail&atid=105470&aid=1701389&group_id=5470
Please note that this message will contain a full copy of the comment thread, including the initial issue submission, for this request, not just the latest update. Category: Unicode Group: Python 2.5 Status: Closed Resolution: Remind Priority: 5 Private: No Submitted By: Iceberg Luo (iceberg4ever) Assigned to: M.-A. Lemburg (lemburg) Summary: utf-16 codec problems with multiple file append Initial Comment: This bug is similar but not exactly the same as bug215974. (http://sourceforge.net/tracker/?group_id=5470&atid=105470&aid=215974&func=detail) In my test, even multiple write() within an open()~close() lifespan will not cause the multi BOM phenomena mentioned in bug215974. Maybe it is because bug 215974 was somehow fixed during the past 7 years, although Lemburg classified it as WontFix. However, if a file is appended for more than once, by an "codecs.open('file.txt', 'a', 'utf16')", the multi BOM appears. At the same time, the saying of "(Extra unnecessary) BOM marks are removed from the input stream by the Python UTF-16 codec" in bug215974 is not true even in today, on Python2.4.4 and Python2.5.1c1 on Windows XP. Iceberg ------------------ PS: Did not find the "File Upload" checkbox mentioned in this web page, so I think I'd better paste the code right here... import codecs, os filename = "test.utf-16" if os.path.exists(filename): os.unlink(filename) # reset def myOpen(): return codecs.open(filename, "a", 'UTF-16') def readThemBack(): return list( codecs.open(filename, "r", 'UTF-16') ) def clumsyPatch(raw): # you can read it after your first run of this program for line in raw: if line[0] in (u'\ufffe', u'\ufeff'): # get rid of the BOMs yield line[1:] else: yield line fout = myOpen() fout.write(u"ab\n") # to simplify the problem, I only use ASCII chars here fout.write(u"cd\n") fout.close() print readThemBack() assert readThemBack() == [ u'ab\n', u'cd\n' ] assert os.stat(filename).st_size == 14 # Only one BOM in the file fout = myOpen() fout.write(u"ef\n") fout.write(u"gh\n") fout.close() print readThemBack() #print list( clumsyPatch( readThemBack() ) ) # later you can enable this fix assert readThemBack() == [ u'ab\n', u'cd\n', u'ef\n', u'gh\n' ] # fails here assert os.stat(filename).st_size == 26 # not to mention here: multi BOM appears ---------------------------------------------------------------------- >Comment By: Walter Dörwald (doerwalter) Date: 2007-05-03 17:03 Message: Logged In: YES user_id=89016 Originator: NO >BTW, even the official document of Python2.4, chapter "7.3.2.1 Built-in > Codecs", mentions that the: > PyObject* PyUnicode_DecodeUTF16( const char *s, int size, const char > *errors, int *byteorder) > can "switches according to all byte order marks (BOM) it finds in the > input data. BOMs are not copied into the resulting Unicode string". I > don't know whether it is the BOM-less decoder we talked for long time. This seems to be wrong. Looking at the source code (Objects/unicodeobjects.c) reveals that only the first BOM is skipped. ---------------------------------------------------------------------- Comment By: Iceberg Luo (iceberg4ever) Date: 2007-05-03 16:08 Message: Logged In: YES user_id=1770538 Originator: YES The longtime arguable ZWNBSP is deprecated nowadays ( the http://www.unicode.org/unicode/faq/utf_bom.html#24 suggests a "U+2060 WORD JOINER" instead of ZWNBSP ). However I can understand that "backwards compatibility" is always a good concern, and that's why SteamReader seems reluctant to change. In practice, a ZWNBSP inside a file is rarely intended (please also refer to the topic "Q: What should I do with U+FEFF in the middle of a file?" in same URL above). IMHO, it is very likely caused by the multi-append file operation or etc. Well, at least, the unsymmetric "what you write is NOT what you get/read" effect between "codecs.open(filename, 'a', 'UTF-16')" and "codecs.open(filename, 'r', 'UTF-16')" is not elegant enough. Aiming at the unsymmetry, finally I come up with a wrapper function for the codecs.open(), which solve (or you may say "bypass") the problem well in my case. I'll post the code as attachment. BTW, even the official document of Python2.4, chapter "7.3.2.1 Built-in Codecs", mentions that the: PyObject* PyUnicode_DecodeUTF16( const char *s, int size, const char *errors, int *byteorder) can "switches according to all byte order marks (BOM) it finds in the input data. BOMs are not copied into the resulting Unicode string". I don't know whether it is the BOM-less decoder we talked for long time. //shrug Hope the information above can be some kind of recipe for those who encounter same problem. That's it. Thanks for your patience. Best regards, Iceberg File Added: _codecs.py ---------------------------------------------------------------------- Comment By: Walter Dörwald (doerwalter) Date: 2007-04-23 12:56 Message: Logged In: YES user_id=89016 Originator: NO But BOMs *may* appear in normal content: Then their meaning is that of ZERO WIDTH NO-BREAK SPACE (see http://docs.python.org/lib/encodings-overview.html for more info). ---------------------------------------------------------------------- Comment By: Iceberg Luo (iceberg4ever) Date: 2007-04-20 05:39 Message: Logged In: YES user_id=1770538 Originator: YES If such a bug would be fixed, either StreamWriter or StreamReader should do something. I can understand Doerwalter that it is somewhat not comfortable for a StreamWriter to detect whether these is already a BOM at current file header, especially when operating in append mode. But, IMHO, the StreamReader should be able to detect multi BOM during its life span and automatically ignore the non-first one, providing that a BOM is never supposed to occur in normal content. Not to mention that such a Reader seems exist for a while, according to the "(extra unnecessary) BOM marks are removed from the input stream by the Python UTF-16 codec" in bug215974 (http://sourceforge.net/tracker/?group_id=5470&atid=105470&aid=215974&func= detail). Therefore I don't think a WontFix will be the proper FINAL solution for this case. ---------------------------------------------------------------------- Comment By: Walter Dörwald (doerwalter) Date: 2007-04-19 13:30 Message: Logged In: YES user_id=89016 Originator: NO Closing as "won't fix" ---------------------------------------------------------------------- Comment By: M.-A. Lemburg (lemburg) Date: 2007-04-19 12:35 Message: Logged In: YES user_id=38388 Originator: NO I suggest you close this as wont fix. ---------------------------------------------------------------------- Comment By: Walter Dörwald (doerwalter) Date: 2007-04-19 12:30 Message: Logged In: YES user_id=89016 Originator: NO append mode is simply not supported for codecs. How would the codec find out the codec state that was active after the last characters where written to the file? ---------------------------------------------------------------------- You can respond by visiting: https://sourceforge.net/tracker/?func=detail&atid=105470&aid=1701389&group_id=5470 _______________________________________________ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com