[issue18814] Add codecs.convert_surrogateescape to clean surrogate escaped strings

2015-05-09 Thread Stephen J. Turnbull

Stephen J. Turnbull added the comment:

Please do not add the rehandle functions to codecs.  They do not change the 
(duck-typed) representation of data while maintaining the semantics, they 
change the semantics of data while retaining the representation.

I suggest a validation submodule of the unicodedata package, or perhaps a new 
unicodeutils package, for these functions, as well as those that just detect 
the surrogates, etc.

Because they change the semantics of data they should be documented as 
potentially dangerous because they can't be inverted back to bytes without 
knowledge of the history of transformations they perform (and not even then in 
the case of the replace error handler).  This matters in applications where 
the input bytes may have been digitally signed, for example.

--
nosy: +sjt

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue18814
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue18814] Add codecs.convert_surrogateescape to clean surrogate escaped strings

2015-05-09 Thread Nick Coghlan

Nick Coghlan added the comment:

surrogateescape and surrogateepass data *already* can't be inverted back to
bytes reliably without knowing the original encoding - if you encode them
as something else when they contain surrogates, you'll either get an
exception (the default) or mojibake (if you use
surrogateescape/surrogateepass as the output error handler). They only work
as a transparent pass through if the input and output encodings match.

I'd be fine with putting these data scrubbing functions somewhere other
than in codecs, though (I'm not sure unicodedata is the right place, but a
new module like string.internals might be, as these functions have more
to do with Python's internal text representation than they do anything
else. A module like the latter could also be a home for things like a
chunking utility that splits a string up into substrings that use as little
memory as possible for feeding into a StringIO instance before throwing the
original away).

I also don't think they're urgent - the introduction of /etc/locale.conf
makes modern Linux far more consistent in getting locale settings right,
and even older platforms tend to get the locale right for user processes.

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue18814
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue18814] Add codecs.convert_surrogateescape to clean surrogate escaped strings

2015-03-17 Thread Nick Coghlan

Nick Coghlan added the comment:

Oh, and yes, I agree a python-dev discussion would be a good idea.

From my perspective, rehandle_surrogateescape is the key function for making 
it easier to check for malformed input data from operating system interfaces.

The other items I don't personally have a use case for, but they seem 
potentially valuable in make some key Unicode concepts a bit more discoverable.

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue18814
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue18814] Add codecs.convert_surrogateescape to clean surrogate escaped strings

2015-03-17 Thread Serhiy Storchaka

Serhiy Storchaka added the comment:

I uploaded the patch just before your comment Nick.

Here is updated patch. Functions are renamed as Nick suggested, added two more 
functions: decompose_astrals() and compose_surrogate_pairs(). They are mainly 
for example here, they can be committed in other issue.

I hesitate about the rehandle_surrogatepass name. This function handles 
surrogates than can be created not only with the surrogatepass handler, but 
also with different ways, e.g. with the surrogateescape handler, with chr(), 
handle_astral() or decompose_astrals(). Actually it checks that the string is 
valid Unicode (not containing surrogates) and handle errors if found with 
specified error handler.

May be there is a time for wider discussion on Python-Dev. I especially want to 
hear opinions of Ezio and Martin.

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue18814
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue18814] Add codecs.convert_surrogateescape to clean surrogate escaped strings

2015-03-17 Thread Serhiy Storchaka

Changes by Serhiy Storchaka storch...@gmail.com:


Added file: http://bugs.python.org/file38520/codecs_convert_escapes_2.patch

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue18814
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue18814] Add codecs.convert_surrogateescape to clean surrogate escaped strings

2015-03-17 Thread Nick Coghlan

Nick Coghlan added the comment:

I'd wondered about that with respect to rehandle_surrogatepass.

The current implementation looks like it processes *all* surrogates (even valid 
surrogate pairs), so handle_surrogates might be a suitable name.

If the intent is for it to be handle_lone_surrogates, I'm not sure the 
current implementation achieves that, as a valid surrogate pair will match 
re.compile('[\ud800-\uefff]+').

The rest looks OK to me, including the decompose_astrals() and 
compose_surrogate_pairs() functions. Regardless of any practical utility, the 
latter two seem useful for *educational* purposes when it comes to unicode, by 
making it clear how to switch between the single code point and dual code point 
representations of the astrals.

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue18814
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue18814] Add codecs.convert_surrogateescape to clean surrogate escaped strings

2015-03-17 Thread Serhiy Storchaka

Serhiy Storchaka added the comment:

Note that provided Python implementations are rather a proof of concept. After 
discussion I'll provide more efficient C implementations, that should be 1-2 
orders faster (and infinitely fast for common case of ASCII strings).

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue18814
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue18814] Add codecs.convert_surrogateescape to clean surrogate escaped strings

2015-03-16 Thread Nick Coghlan

Nick Coghlan added the comment:

(Serhiy, did you miss uploading the new patch?)

Regarding the names, we may need to think about the use cases a bit more 
explicitly to clarify that in terms of the Python codecs API rather than 
expecting folks to understand the underlying representation. In the case of 
handling lone surrogates and escaped surrogates, what about:

rehandle_surrogatepass(data, errors=strict)
rehandle_surrogateescape(data, errors=strict)

That is, we know we have data that was decoded with either surrogatepass or 
surrogateespace (respectively) as the error handler, and we want to process the 
results of that with a different error handler.

I believe those two would be enough to address the specific cases this issue 
was raised to cover, so it may make sense to file a separate issue to discuss 
the use cases for the custom astral handling.

Since astrals aren't actually errors in the first place, that could become:

handle_astrals(data, errors=strict)

As in pass every astral code point in this string through the named error 
handler.

The astral - surrogate pair and surrogate pair - astral converters do sound 
potentially interesting, but as noted above, I think they may call for a 
separate issue that better explains the specific use cases.

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue18814
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue18814] Add codecs.convert_surrogateescape to clean surrogate escaped strings

2015-03-16 Thread Serhiy Storchaka

Serhiy Storchaka added the comment:

Proposed preliminary patch adds three functions in the codecs module:

convert_surrogates(data, errors) -- handle lone surrogates with specified error 
handler.

 codecs.convert_surrogates('a\u20ac\udca4', 'backslashreplace')
'a€\\udca4'

convert_surrogateescape(data, errors) -- handle surrogateescaped bytes with 
specified error handler

 codecs.convert_surrogateescape('a\u20ac\udca4', 'backslashreplace')
'a€\\xa4'

convert_astrals(data, errors) -- handle astral (non-BMP) characters with 
specified error handler.

 codecs.convert_astral('a\u20ac\U000e007f', 'backslashreplace')
'a€\\U000e007f'

Names are discussable.

I think also about adding two functions or error handlers (that can used with 
convert_surrogates and convert_astrals) for composing astral characters from 
surrogate pairs and vice versa.

--
components: +Library (Lib)
versions: +Python 3.5 -Python 3.4

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue18814
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue18814] Add codecs.convert_surrogateescape to clean surrogate escaped strings

2015-03-16 Thread Serhiy Storchaka

Changes by Serhiy Storchaka storch...@gmail.com:


--
keywords: +patch
Added file: http://bugs.python.org/file38506/codecs_convert_escapes.patch

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue18814
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue18814] Add codecs.convert_surrogateescape to clean surrogate escaped strings

2015-03-15 Thread Serhiy Storchaka

Changes by Serhiy Storchaka storch...@gmail.com:


--
dependencies: +Add support of UnicodeTranslateError in standard error handlers

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue18814
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue18814] Add codecs.convert_surrogateescape to clean surrogate escaped strings

2014-09-23 Thread Nick Coghlan

Nick Coghlan added the comment:

Updated issue title to reflect current proposal.

--
title: Add tools for cleaning surrogate escaped strings - Add 
codecs.convert_surrogateescape to clean surrogate escaped strings

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue18814
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue18814] Add codecs.convert_surrogateescape to clean surrogate escaped strings

2014-09-23 Thread Marc-Andre Lemburg

Marc-Andre Lemburg added the comment:

Don't like the function name :-)

How about codecs.filter_non_utf8_data(), since that's closer
to what the function is really doing and doesn't require
knowledge about what surrogateescape is.

--
nosy: +lemburg

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue18814
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue18814] Add codecs.convert_surrogateescape to clean surrogate escaped strings

2014-09-23 Thread Nick Coghlan

Nick Coghlan added the comment:

The error handler is called surrogateescape. That means 
convert_surrogateescape is always only a single step away from thinking I 
want to remove the smuggled bytes from a surrogateescape'd string, without 
needing to assume any knowledge on the part of the user other than the name of 
the error handler and the fact that it is used to smuggle arbitrary bytes 
through the Python 3 str type.

Getting from this string was decoded with the surrogateescape handler and may 
contain smuggled bytes to filter_non_utf8_data as the relevant cleanup 
function is a much bigger leap that requires more assumed knowledge on the part 
of the user, and also one that confuses the conceptual purpose of the function 
(cleaning up the output of the surrogateescape error handler to ensure it is a 
pure Unicode string) with the internal details of the proposed approach to 
implementing that cleanup operation (encoding to UTF-8 with surrogateescape, 
and then decoding again with a different error handler).

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue18814
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue18814] Add codecs.convert_surrogateescape to clean surrogate escaped strings

2014-09-23 Thread Nick Coghlan

Nick Coghlan added the comment:

The function definition again, this time with a draft docstring:

def convert_surrogateescape(data, errors='replace'):
Convert escaped raw bytes by applying a different error handler

Uses the replace error handler by default, but any input
error handler may be specified.

return data.encode('utf-8', 'surrogateescape').decode('utf-8', errors)

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue18814
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue18814] Add codecs.convert_surrogateescape to clean surrogate escaped strings

2014-09-23 Thread Nick Coghlan

Nick Coghlan added the comment:

Note I would also be OK with convert_surrogates, as that's the term that 
appears in the relevant error message:

 b'\xe9'.decode('ascii', 'surrogateescape').encode()
Traceback (most recent call last):
  File stdin, line 1, in module
UnicodeEncodeError: 'utf-8' codec can't encode character '\udce9' in position 
0: surrogates not allowed

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue18814
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue18814] Add codecs.convert_surrogateescape to clean surrogate escaped strings

2014-09-23 Thread Antoine Pitrou

Antoine Pitrou added the comment:

Le 23/09/2014 12:57, Nick Coghlan a écrit :
 The function definition again, this time with a draft docstring:
 
 def convert_surrogateescape(data, errors='replace'):
 Convert escaped raw bytes by applying a different error handler
 
 Uses the replace error handler by default, but any input
 error handler may be specified.
 
 return data.encode('utf-8', 'surrogateescape').decode('utf-8', errors)

'utf-8' is hardcoded?

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue18814
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue18814] Add codecs.convert_surrogateescape to clean surrogate escaped strings

2014-09-23 Thread Nick Coghlan

Nick Coghlan added the comment:

Draft docstring for that version

def convert_surrogates(data, errors='replace'):
Convert escaped surrogates by applying a different error handler

Uses the replace error handler by default, but any input
error handler may be specified.

return data.encode('utf-8', 'surrogateescape').decode('utf-8', errors)

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue18814
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue18814] Add codecs.convert_surrogateescape to clean surrogate escaped strings

2014-09-23 Thread Nick Coghlan

Nick Coghlan added the comment:

Antoine: what would be the use case for using a different encoding for the 
temporary bytes object? It's discarded anyway, so the encoding used isn't 
externally visible.

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue18814
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue18814] Add codecs.convert_surrogateescape to clean surrogate escaped strings

2014-09-23 Thread Antoine Pitrou

Antoine Pitrou added the comment:

The encoding used impacts the result:

 s = 'abc\udcc3\udca9'
 s.encode('ascii', 'surrogateescape').decode('ascii', 'replace')
'abc��'
 s.encode('utf-8', 'surrogateescape').decode('utf-8', 'replace')
'abcé'

The original string ('abc\udcc3\udca9') was obtained by decoding a valid utf-8 
string with the 'ascii' codec and the 'surrogateescape' error handler.

If anything, the default encoding should probably be 
sys.getfilesystemencoding().

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue18814
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue18814] Add codecs.convert_surrogateescape to clean surrogate escaped strings

2014-09-23 Thread Marc-Andre Lemburg

Marc-Andre Lemburg added the comment:

On 23.09.2014 13:12, Nick Coghlan wrote:
 
 Nick Coghlan added the comment:
 
 Draft docstring for that version
 
 def convert_surrogates(data, errors='replace'):
 Convert escaped surrogates by applying a different error handler
 
 Uses the replace error handler by default, but any input
 error handler may be specified.
 
 return data.encode('utf-8', 'surrogateescape').decode('utf-8', errors)

Nick, the doc string is not correct. It is not working on escaped
surrogates. Instead it is working on lone surrogates that were used
to encode undecodable bytes from some input data.

The longer story goes like this:

The surrogateescape error handler in the .decode() call that lead up
to the data you want this function to take as input, will convert
undecodable data to lone low surrogates.

The function then reverts these bytes back into UTF-8 (which may well
not be the original encoding, as Antoine has already pointed out, but
that's not really important for the use case), recreating the
unencodable bytes and then decodes the result again using the UTF-8
codec using a new error handler.

So in summary, the function is supposed to retroactively apply
a different error handler to the input data, undoing the effects
of the surrogateescapes error handler.

The name still doesn't match this functionality.

BTW: There's a catch in the approach. The encoding used to decode
the original data may well be 'ascii'. Now, if the original input
data was in fact UTF-8, the input decoding would have mapped the
UTF-8 code points to lone surrogates. The above function would then
turn these back into UTF-8, redecode and get a completely different
string back (since the error handlers would not trigger).

I'm not sure whether adding such a small function with so many
unclear implications is a good idea. Either it should be
made more specific, e.g. be reserved for use on data from input
streams with known encoding, or be put into the documentation as
example for people to use and adapt as necessary.

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue18814
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue18814] Add codecs.convert_surrogateescape to clean surrogate escaped strings

2014-09-23 Thread R. David Murray

R. David Murray added the comment:

And indeed my use case for this has instances of both cases: originally decoded 
using ASCII and the non-ascii bytes must end up as replaced characters, and 
originally decoded using utf-8.

I'm also not sure that it is worth adding this.  If you know what you are doing 
the solution is obvious, and if you don't know what you are doing you shouldn't 
be using surrogateescape in the first place :)

Now, if there were or there is intended to be a more efficient C level 
implementation, that answer might be different.

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue18814
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue18814] Add codecs.convert_surrogateescape to clean surrogate escaped strings

2014-09-23 Thread R. David Murray

R. David Murray added the comment:

Oh, wait, I forgot that the context for this was dealing with unix filenames 
and/or stdio.  So, a function that just uses the fsencoding to do the replace 
might indeed be appropriate, but in that case should probably live in the os 
module.  os.convert_surrogates?

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue18814
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue18814] Add codecs.convert_surrogateescape to clean surrogate escaped strings

2014-09-23 Thread Serhiy Storchaka

Serhiy Storchaka added the comment:

Good catch Antoine!

Here is a sample of more complicated implementation.

--
title: Add a convert_surrogates function to clean surrogate escaped strings 
- Add codecs.convert_surrogateescape to clean surrogate escaped strings
Added file: http://bugs.python.org/file36700/convert_surrogates.py

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue18814
___import codecs
import re

def convert_surrogates(data, errors='strict'):
handler = None
p = re.compile('[\ud800-\uefff]+')
pos = 0
res = []
while True:
m = p.search(data, pos)
if m:
if handler is None:
handler = codecs.lookup_error(errors)
res.append(data[pos: m.start()])
repl, pos = handler(UnicodeTranslateError(data, m.start(), m.end(),
  'lone surrogates'))
res.append(repl)
elif pos:
res.append(data[pos:])
return ''.join(res)
else:
return data

def convert_surrogateescape(data, errors='strict'):
handler = None
p = re.compile('[\ud800-\uefff]+')
pos = 0
res = []
while True:
m = p.search(data, pos)
if m:
if handler is None:
handler = codecs.lookup_error(errors)
start = m.start()
res.append(data[pos: start])
try:
baddata = data[start: m.end()].encode('ascii', 
'surrogateescape')
except UnicodeEncodeError as err:
raise UnicodeTranslateError(data,
err.start + start,err.end + start,
r'surrogates not in range \ud880-\ud8ff') from None
try:
repl, pos = handler(UnicodeDecodeError('unicode', baddata,
   0, len(baddata),
   'lone surrogates'))
except UnicodeDecodeError as err:
raise UnicodeTranslateError(data,
err.start + start,
err.end + start,
err.reason) from None
pos += start
res.append(repl)
elif pos:
res.append(data[pos:])
return ''.join(res)
else:
return data
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue18814] Add codecs.convert_surrogateescape to clean surrogate escaped strings

2014-09-23 Thread Nick Coghlan

Nick Coghlan added the comment:

Ah, Serhiy's approach of avoiding the encode/decode dance entirely is an even 
better idea - replacing the lone surrogates directly with the output of the 
alternative error handler avoids any need to worry about the original encoding.

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue18814
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com