Alexander Belopolsky <belopol...@users.sourceforge.net> added the comment:
haypo> See also #2382: I wrote patches two years ago for this issue. Yes, this is the same issue. I don't want to close this as a duplicate because #2382 contains a much more ambitious set of patches. What I am trying to achieve here is similar to the adjust_offset.patch there. I am attaching a patch that takes an alternative approach and computes the number of characters in the parser. I strongly believe that the buffer in the tokenizer always contains UTF-8 encoded text. If it is not so already, I would consider making it so by replacing a call to _PyUnicode_AsDefaultEncodedString() with a call to PyUnicode_AsUTF8String(). (if that matters) The patch still needs unittests and possibly has some off-by-one issues, but I would like to get to an agreement that this is the right level at which the problem should be fixed first. ---------- _______________________________________ Python tracker <rep...@bugs.python.org> <http://bugs.python.org/issue10382> _______________________________________ _______________________________________________ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com