Alexander Belopolsky <belopol...@users.sourceforge.net> added the comment:

haypo> See also #2382: I wrote patches two years ago for this issue.

Yes, this is the same issue.  I don't want to close this as a duplicate because 
#2382 contains a much more ambitious set of patches.  What I am trying to 
achieve here is similar to the adjust_offset.patch there.

I am attaching a patch that takes an alternative approach and computes the 
number of characters in the parser.  I strongly believe that the buffer in the 
tokenizer always contains UTF-8 encoded text.  If it is not so already, I would 
consider making it so by replacing a call to 
_PyUnicode_AsDefaultEncodedString() with a call to PyUnicode_AsUTF8String(). 
(if that matters)

The patch still needs unittests and possibly has some off-by-one issues, but I 
would like to get to an agreement that this is the right level at which the 
problem should be fixed first.

----------

_______________________________________
Python tracker <rep...@bugs.python.org>
<http://bugs.python.org/issue10382>
_______________________________________
_______________________________________________
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

Reply via email to