[issue11461] Reading UTF-16 with codecs.readline() breaks on surrogate pairs

Yuriy Pilgun Thu, 10 Mar 2011 02:20:11 -0800

New submission from Yuriy Pilgun <p...@ukrpost.net>:

Reading UTF-16 text file with module 'codecs' fails, if surrogate pair is 
located at 72-character boundary.


Attached python script fails with message:
UnicodeDecodeError: 'utf16' codec can't decode bytes in position 70-71: 
unexpected end of data

The reason is splitting of input data for readline() into chunks, namely
  readsize = size or 72

----------
components: Library (Lib), Unicode
files: testutf16.py
messages: 130498
nosy: ply
priority: normal
severity: normal
status: open
title: Reading UTF-16 with codecs.readline() breaks on surrogate pairs
type: behavior
versions: Python 2.7
Added file: http://bugs.python.org/file21070/testutf16.py

_______________________________________
Python tracker <rep...@bugs.python.org>
<http://bugs.python.org/issue11461>
_______________________________________
_______________________________________________
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue11461] Reading UTF-16 with codecs.readline() breaks on surrogate pairs

Reply via email to