[issue18814] Add codecs.convert_surrogateescape to clean surrogate escaped strings
Stephen J. Turnbull added the comment: Please do not add the rehandle functions to codecs. They do not change the (duck-typed) representation of data while maintaining the semantics, they change the semantics of data while retaining the representation. I suggest a validation submodule of the unicodedata package, or perhaps a new unicodeutils package, for these functions, as well as those that just detect the surrogates, etc. Because they change the semantics of data they should be documented as potentially dangerous because they can't be inverted back to bytes without knowledge of the history of transformations they perform (and not even then in the case of the replace error handler). This matters in applications where the input bytes may have been digitally signed, for example. -- nosy: +sjt ___ Python tracker rep...@bugs.python.org http://bugs.python.org/issue18814 ___ ___ Python-bugs-list mailing list Unsubscribe: https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue18814] Add codecs.convert_surrogateescape to clean surrogate escaped strings
Nick Coghlan added the comment: surrogateescape and surrogateepass data *already* can't be inverted back to bytes reliably without knowing the original encoding - if you encode them as something else when they contain surrogates, you'll either get an exception (the default) or mojibake (if you use surrogateescape/surrogateepass as the output error handler). They only work as a transparent pass through if the input and output encodings match. I'd be fine with putting these data scrubbing functions somewhere other than in codecs, though (I'm not sure unicodedata is the right place, but a new module like string.internals might be, as these functions have more to do with Python's internal text representation than they do anything else. A module like the latter could also be a home for things like a chunking utility that splits a string up into substrings that use as little memory as possible for feeding into a StringIO instance before throwing the original away). I also don't think they're urgent - the introduction of /etc/locale.conf makes modern Linux far more consistent in getting locale settings right, and even older platforms tend to get the locale right for user processes. -- ___ Python tracker rep...@bugs.python.org http://bugs.python.org/issue18814 ___ ___ Python-bugs-list mailing list Unsubscribe: https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue18814] Add codecs.convert_surrogateescape to clean surrogate escaped strings
Nick Coghlan added the comment: Oh, and yes, I agree a python-dev discussion would be a good idea. From my perspective, rehandle_surrogateescape is the key function for making it easier to check for malformed input data from operating system interfaces. The other items I don't personally have a use case for, but they seem potentially valuable in make some key Unicode concepts a bit more discoverable. -- ___ Python tracker rep...@bugs.python.org http://bugs.python.org/issue18814 ___ ___ Python-bugs-list mailing list Unsubscribe: https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue18814] Add codecs.convert_surrogateescape to clean surrogate escaped strings
Serhiy Storchaka added the comment: I uploaded the patch just before your comment Nick. Here is updated patch. Functions are renamed as Nick suggested, added two more functions: decompose_astrals() and compose_surrogate_pairs(). They are mainly for example here, they can be committed in other issue. I hesitate about the rehandle_surrogatepass name. This function handles surrogates than can be created not only with the surrogatepass handler, but also with different ways, e.g. with the surrogateescape handler, with chr(), handle_astral() or decompose_astrals(). Actually it checks that the string is valid Unicode (not containing surrogates) and handle errors if found with specified error handler. May be there is a time for wider discussion on Python-Dev. I especially want to hear opinions of Ezio and Martin. -- ___ Python tracker rep...@bugs.python.org http://bugs.python.org/issue18814 ___ ___ Python-bugs-list mailing list Unsubscribe: https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue18814] Add codecs.convert_surrogateescape to clean surrogate escaped strings
Changes by Serhiy Storchaka storch...@gmail.com: Added file: http://bugs.python.org/file38520/codecs_convert_escapes_2.patch ___ Python tracker rep...@bugs.python.org http://bugs.python.org/issue18814 ___ ___ Python-bugs-list mailing list Unsubscribe: https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue18814] Add codecs.convert_surrogateescape to clean surrogate escaped strings
Nick Coghlan added the comment: I'd wondered about that with respect to rehandle_surrogatepass. The current implementation looks like it processes *all* surrogates (even valid surrogate pairs), so handle_surrogates might be a suitable name. If the intent is for it to be handle_lone_surrogates, I'm not sure the current implementation achieves that, as a valid surrogate pair will match re.compile('[\ud800-\uefff]+'). The rest looks OK to me, including the decompose_astrals() and compose_surrogate_pairs() functions. Regardless of any practical utility, the latter two seem useful for *educational* purposes when it comes to unicode, by making it clear how to switch between the single code point and dual code point representations of the astrals. -- ___ Python tracker rep...@bugs.python.org http://bugs.python.org/issue18814 ___ ___ Python-bugs-list mailing list Unsubscribe: https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue18814] Add codecs.convert_surrogateescape to clean surrogate escaped strings
Serhiy Storchaka added the comment: Note that provided Python implementations are rather a proof of concept. After discussion I'll provide more efficient C implementations, that should be 1-2 orders faster (and infinitely fast for common case of ASCII strings). -- ___ Python tracker rep...@bugs.python.org http://bugs.python.org/issue18814 ___ ___ Python-bugs-list mailing list Unsubscribe: https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue18814] Add codecs.convert_surrogateescape to clean surrogate escaped strings
Nick Coghlan added the comment: (Serhiy, did you miss uploading the new patch?) Regarding the names, we may need to think about the use cases a bit more explicitly to clarify that in terms of the Python codecs API rather than expecting folks to understand the underlying representation. In the case of handling lone surrogates and escaped surrogates, what about: rehandle_surrogatepass(data, errors=strict) rehandle_surrogateescape(data, errors=strict) That is, we know we have data that was decoded with either surrogatepass or surrogateespace (respectively) as the error handler, and we want to process the results of that with a different error handler. I believe those two would be enough to address the specific cases this issue was raised to cover, so it may make sense to file a separate issue to discuss the use cases for the custom astral handling. Since astrals aren't actually errors in the first place, that could become: handle_astrals(data, errors=strict) As in pass every astral code point in this string through the named error handler. The astral - surrogate pair and surrogate pair - astral converters do sound potentially interesting, but as noted above, I think they may call for a separate issue that better explains the specific use cases. -- ___ Python tracker rep...@bugs.python.org http://bugs.python.org/issue18814 ___ ___ Python-bugs-list mailing list Unsubscribe: https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue18814] Add codecs.convert_surrogateescape to clean surrogate escaped strings
Serhiy Storchaka added the comment: Proposed preliminary patch adds three functions in the codecs module: convert_surrogates(data, errors) -- handle lone surrogates with specified error handler. codecs.convert_surrogates('a\u20ac\udca4', 'backslashreplace') 'a€\\udca4' convert_surrogateescape(data, errors) -- handle surrogateescaped bytes with specified error handler codecs.convert_surrogateescape('a\u20ac\udca4', 'backslashreplace') 'a€\\xa4' convert_astrals(data, errors) -- handle astral (non-BMP) characters with specified error handler. codecs.convert_astral('a\u20ac\U000e007f', 'backslashreplace') 'a€\\U000e007f' Names are discussable. I think also about adding two functions or error handlers (that can used with convert_surrogates and convert_astrals) for composing astral characters from surrogate pairs and vice versa. -- components: +Library (Lib) versions: +Python 3.5 -Python 3.4 ___ Python tracker rep...@bugs.python.org http://bugs.python.org/issue18814 ___ ___ Python-bugs-list mailing list Unsubscribe: https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue18814] Add codecs.convert_surrogateescape to clean surrogate escaped strings
Changes by Serhiy Storchaka storch...@gmail.com: -- keywords: +patch Added file: http://bugs.python.org/file38506/codecs_convert_escapes.patch ___ Python tracker rep...@bugs.python.org http://bugs.python.org/issue18814 ___ ___ Python-bugs-list mailing list Unsubscribe: https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue18814] Add codecs.convert_surrogateescape to clean surrogate escaped strings
Changes by Serhiy Storchaka storch...@gmail.com: -- dependencies: +Add support of UnicodeTranslateError in standard error handlers ___ Python tracker rep...@bugs.python.org http://bugs.python.org/issue18814 ___ ___ Python-bugs-list mailing list Unsubscribe: https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue18814] Add codecs.convert_surrogateescape to clean surrogate escaped strings
Nick Coghlan added the comment: Updated issue title to reflect current proposal. -- title: Add tools for cleaning surrogate escaped strings - Add codecs.convert_surrogateescape to clean surrogate escaped strings ___ Python tracker rep...@bugs.python.org http://bugs.python.org/issue18814 ___ ___ Python-bugs-list mailing list Unsubscribe: https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue18814] Add codecs.convert_surrogateescape to clean surrogate escaped strings
Marc-Andre Lemburg added the comment: Don't like the function name :-) How about codecs.filter_non_utf8_data(), since that's closer to what the function is really doing and doesn't require knowledge about what surrogateescape is. -- nosy: +lemburg ___ Python tracker rep...@bugs.python.org http://bugs.python.org/issue18814 ___ ___ Python-bugs-list mailing list Unsubscribe: https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue18814] Add codecs.convert_surrogateescape to clean surrogate escaped strings
Nick Coghlan added the comment: The error handler is called surrogateescape. That means convert_surrogateescape is always only a single step away from thinking I want to remove the smuggled bytes from a surrogateescape'd string, without needing to assume any knowledge on the part of the user other than the name of the error handler and the fact that it is used to smuggle arbitrary bytes through the Python 3 str type. Getting from this string was decoded with the surrogateescape handler and may contain smuggled bytes to filter_non_utf8_data as the relevant cleanup function is a much bigger leap that requires more assumed knowledge on the part of the user, and also one that confuses the conceptual purpose of the function (cleaning up the output of the surrogateescape error handler to ensure it is a pure Unicode string) with the internal details of the proposed approach to implementing that cleanup operation (encoding to UTF-8 with surrogateescape, and then decoding again with a different error handler). -- ___ Python tracker rep...@bugs.python.org http://bugs.python.org/issue18814 ___ ___ Python-bugs-list mailing list Unsubscribe: https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue18814] Add codecs.convert_surrogateescape to clean surrogate escaped strings
Nick Coghlan added the comment: The function definition again, this time with a draft docstring: def convert_surrogateescape(data, errors='replace'): Convert escaped raw bytes by applying a different error handler Uses the replace error handler by default, but any input error handler may be specified. return data.encode('utf-8', 'surrogateescape').decode('utf-8', errors) -- ___ Python tracker rep...@bugs.python.org http://bugs.python.org/issue18814 ___ ___ Python-bugs-list mailing list Unsubscribe: https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue18814] Add codecs.convert_surrogateescape to clean surrogate escaped strings
Nick Coghlan added the comment: Note I would also be OK with convert_surrogates, as that's the term that appears in the relevant error message: b'\xe9'.decode('ascii', 'surrogateescape').encode() Traceback (most recent call last): File stdin, line 1, in module UnicodeEncodeError: 'utf-8' codec can't encode character '\udce9' in position 0: surrogates not allowed -- ___ Python tracker rep...@bugs.python.org http://bugs.python.org/issue18814 ___ ___ Python-bugs-list mailing list Unsubscribe: https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue18814] Add codecs.convert_surrogateescape to clean surrogate escaped strings
Antoine Pitrou added the comment: Le 23/09/2014 12:57, Nick Coghlan a écrit : The function definition again, this time with a draft docstring: def convert_surrogateescape(data, errors='replace'): Convert escaped raw bytes by applying a different error handler Uses the replace error handler by default, but any input error handler may be specified. return data.encode('utf-8', 'surrogateescape').decode('utf-8', errors) 'utf-8' is hardcoded? -- ___ Python tracker rep...@bugs.python.org http://bugs.python.org/issue18814 ___ ___ Python-bugs-list mailing list Unsubscribe: https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue18814] Add codecs.convert_surrogateescape to clean surrogate escaped strings
Nick Coghlan added the comment: Draft docstring for that version def convert_surrogates(data, errors='replace'): Convert escaped surrogates by applying a different error handler Uses the replace error handler by default, but any input error handler may be specified. return data.encode('utf-8', 'surrogateescape').decode('utf-8', errors) -- ___ Python tracker rep...@bugs.python.org http://bugs.python.org/issue18814 ___ ___ Python-bugs-list mailing list Unsubscribe: https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue18814] Add codecs.convert_surrogateescape to clean surrogate escaped strings
Nick Coghlan added the comment: Antoine: what would be the use case for using a different encoding for the temporary bytes object? It's discarded anyway, so the encoding used isn't externally visible. -- ___ Python tracker rep...@bugs.python.org http://bugs.python.org/issue18814 ___ ___ Python-bugs-list mailing list Unsubscribe: https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue18814] Add codecs.convert_surrogateescape to clean surrogate escaped strings
Antoine Pitrou added the comment: The encoding used impacts the result: s = 'abc\udcc3\udca9' s.encode('ascii', 'surrogateescape').decode('ascii', 'replace') 'abc��' s.encode('utf-8', 'surrogateescape').decode('utf-8', 'replace') 'abcé' The original string ('abc\udcc3\udca9') was obtained by decoding a valid utf-8 string with the 'ascii' codec and the 'surrogateescape' error handler. If anything, the default encoding should probably be sys.getfilesystemencoding(). -- ___ Python tracker rep...@bugs.python.org http://bugs.python.org/issue18814 ___ ___ Python-bugs-list mailing list Unsubscribe: https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue18814] Add codecs.convert_surrogateescape to clean surrogate escaped strings
Marc-Andre Lemburg added the comment: On 23.09.2014 13:12, Nick Coghlan wrote: Nick Coghlan added the comment: Draft docstring for that version def convert_surrogates(data, errors='replace'): Convert escaped surrogates by applying a different error handler Uses the replace error handler by default, but any input error handler may be specified. return data.encode('utf-8', 'surrogateescape').decode('utf-8', errors) Nick, the doc string is not correct. It is not working on escaped surrogates. Instead it is working on lone surrogates that were used to encode undecodable bytes from some input data. The longer story goes like this: The surrogateescape error handler in the .decode() call that lead up to the data you want this function to take as input, will convert undecodable data to lone low surrogates. The function then reverts these bytes back into UTF-8 (which may well not be the original encoding, as Antoine has already pointed out, but that's not really important for the use case), recreating the unencodable bytes and then decodes the result again using the UTF-8 codec using a new error handler. So in summary, the function is supposed to retroactively apply a different error handler to the input data, undoing the effects of the surrogateescapes error handler. The name still doesn't match this functionality. BTW: There's a catch in the approach. The encoding used to decode the original data may well be 'ascii'. Now, if the original input data was in fact UTF-8, the input decoding would have mapped the UTF-8 code points to lone surrogates. The above function would then turn these back into UTF-8, redecode and get a completely different string back (since the error handlers would not trigger). I'm not sure whether adding such a small function with so many unclear implications is a good idea. Either it should be made more specific, e.g. be reserved for use on data from input streams with known encoding, or be put into the documentation as example for people to use and adapt as necessary. -- ___ Python tracker rep...@bugs.python.org http://bugs.python.org/issue18814 ___ ___ Python-bugs-list mailing list Unsubscribe: https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue18814] Add codecs.convert_surrogateescape to clean surrogate escaped strings
R. David Murray added the comment: And indeed my use case for this has instances of both cases: originally decoded using ASCII and the non-ascii bytes must end up as replaced characters, and originally decoded using utf-8. I'm also not sure that it is worth adding this. If you know what you are doing the solution is obvious, and if you don't know what you are doing you shouldn't be using surrogateescape in the first place :) Now, if there were or there is intended to be a more efficient C level implementation, that answer might be different. -- ___ Python tracker rep...@bugs.python.org http://bugs.python.org/issue18814 ___ ___ Python-bugs-list mailing list Unsubscribe: https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue18814] Add codecs.convert_surrogateescape to clean surrogate escaped strings
R. David Murray added the comment: Oh, wait, I forgot that the context for this was dealing with unix filenames and/or stdio. So, a function that just uses the fsencoding to do the replace might indeed be appropriate, but in that case should probably live in the os module. os.convert_surrogates? -- ___ Python tracker rep...@bugs.python.org http://bugs.python.org/issue18814 ___ ___ Python-bugs-list mailing list Unsubscribe: https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue18814] Add codecs.convert_surrogateescape to clean surrogate escaped strings
Serhiy Storchaka added the comment: Good catch Antoine! Here is a sample of more complicated implementation. -- title: Add a convert_surrogates function to clean surrogate escaped strings - Add codecs.convert_surrogateescape to clean surrogate escaped strings Added file: http://bugs.python.org/file36700/convert_surrogates.py ___ Python tracker rep...@bugs.python.org http://bugs.python.org/issue18814 ___import codecs import re def convert_surrogates(data, errors='strict'): handler = None p = re.compile('[\ud800-\uefff]+') pos = 0 res = [] while True: m = p.search(data, pos) if m: if handler is None: handler = codecs.lookup_error(errors) res.append(data[pos: m.start()]) repl, pos = handler(UnicodeTranslateError(data, m.start(), m.end(), 'lone surrogates')) res.append(repl) elif pos: res.append(data[pos:]) return ''.join(res) else: return data def convert_surrogateescape(data, errors='strict'): handler = None p = re.compile('[\ud800-\uefff]+') pos = 0 res = [] while True: m = p.search(data, pos) if m: if handler is None: handler = codecs.lookup_error(errors) start = m.start() res.append(data[pos: start]) try: baddata = data[start: m.end()].encode('ascii', 'surrogateescape') except UnicodeEncodeError as err: raise UnicodeTranslateError(data, err.start + start,err.end + start, r'surrogates not in range \ud880-\ud8ff') from None try: repl, pos = handler(UnicodeDecodeError('unicode', baddata, 0, len(baddata), 'lone surrogates')) except UnicodeDecodeError as err: raise UnicodeTranslateError(data, err.start + start, err.end + start, err.reason) from None pos += start res.append(repl) elif pos: res.append(data[pos:]) return ''.join(res) else: return data ___ Python-bugs-list mailing list Unsubscribe: https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue18814] Add codecs.convert_surrogateescape to clean surrogate escaped strings
Nick Coghlan added the comment: Ah, Serhiy's approach of avoiding the encode/decode dance entirely is an even better idea - replacing the lone surrogates directly with the output of the alternative error handler avoids any need to worry about the original encoding. -- ___ Python tracker rep...@bugs.python.org http://bugs.python.org/issue18814 ___ ___ Python-bugs-list mailing list Unsubscribe: https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com