[issue18814] Add utilities to "clean" surrogate code points from strings

2018-03-30 Thread Nick Coghlan

Nick Coghlan  added the comment:

With PEPs 538 and 540 implemented for 3.7, my thinking on this has evolved a 
bit.

A recent discussion on python-ideas [1] also introduced me to the third party 
library, "ftfy", which offers a wide range of tools for cleaning up improperly 
decoded data: https://ftfy.readthedocs.io/en/latest/

That includes a lone surrogate fixer: 
https://ftfy.readthedocs.io/en/latest/#ftfy.fixes.fix_surrogates

So a potential way to go here would be to a section on "Handling Improperly 
Decoded Text Data" to the codecs module documentation, and include ftfy as a 
See Also link in that new section.

If folks think that would be a reasonable way to go, then I think the clearest 
way to handle it would be to close this issue as "later" (which still implies 
"maybe never", but not as strongly as "rejected" does), and open a new issue 
for the suggested new section in the docs.

[1] https://mail.python.org/pipermail/python-ideas/2018-January/048583.html

--
versions: +Python 3.8 -Python 3.6

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue18814] Add utilities to "clean" surrogate code points from strings

2015-09-29 Thread R. David Murray

R. David Murray added the comment:

Done: issue 25269.

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue18814] Add utilities to "clean" surrogate code points from strings

2015-09-29 Thread STINNER Victor

STINNER Victor added the comment:

> I also want "detect if there are any surrogates".

Could you please open a separated issue for this function/method?

I believe that it's very different than other proposed functions/methods.

It was proposed before to add methods like "is_ascii()" but the request was 
rejected because other Python implementations don't implement Unicode using the 
PEP 393 and so the method would be less efficient than expected, depending on 
the implementation of Python.

Well, I don't know if we should add methods relying on the PEP 393 or not. It's 
probably better to discuss such "political" question on python-dev.

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue18814] Add utilities to "clean" surrogate code points from strings

2015-09-27 Thread R. David Murray

R. David Murray added the comment:

I also want "detect if there are any surrogates".

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue18814] Add utilities to "clean" surrogate code points from strings

2015-09-27 Thread Nick Coghlan

Nick Coghlan added the comment:

As far as the rationale for adding the functions at all goes, my main interest 
is still in having somewhere in the codecs module documentation to *define the 
problem*, and to my mind that entails also offering a simple way to do the 
relevant pre-/post-processing.

The nice aspect of building any related capabilities atop the standard error 
handlers is that it also means that third party modules can provide custom 
error handlers to support further escaping techniques, and those will also be 
available for use in decoding and encoding operations, rather than being 
specific to pre-/post-processing of the data.

However, it's also the case that we're generally going to be talking about the 
combination of encoding misconfiguration *and* processing data that gets 
potentially corrupted by the misconfiguration *and* doing something with it 
that isn't already handled by a surrogateescape round-trip, which is why I 
suspect in practice most applications are going to be able to get away with 
ignoring the problem entirely (especially with C.UTF-8 support coming to Fedora 
24, so the Fedora/RHEL/CentOS ecosystem will be joining the Debian/Ubuntu 
ecosystem in offering that by default)

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue18814] Add utilities to "clean" surrogate code points from strings

2015-09-27 Thread Steven D'Aprano

Steven D'Aprano added the comment:

On Sun, Sep 27, 2015 at 04:17:45PM +, R. David Murray wrote:
> 
> I also want "detect if there are any surrogates".

I think that's useful enough it should be a str method. Here's a 
pure-Python implementation:

def is_surrogate(s):
return any(0xD800 <= ord(c) <= 0xDFFF for c in s)

The actual Flexible String Representation implementation can be even 
more efficient. All-ASCII and all-Latin1 strings cannot possibly include 
surrogates, so is_surrogate will be a constant-time operation.

(If we care about making this a constant-time operation for all 
strings, the constructor could set a flag on the string object. Since 
the constructor has to walk the string anyway, I don't think that will 
cost much time-wise.)

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue18814] Add utilities to "clean" surrogate code points from strings

2015-09-27 Thread STINNER Victor

STINNER Victor added the comment:

Hum, I suggest to put these functions in a package on PyPI, or recipes on a
website like stackoverfkow., and close the issue.

I'm still not convinced that these functions are useful . Usually we take a
function from an existing project used in applications to put it in the
stdlib. Here the use case still looks artifical. For example which
application requires to escape non-BMP character? How does it handle them
currently?

Threre are too many ways to handle surrogate characters. The common ways to
show undecodable bytes are not supported by functions proposed by Serhiy.
Example: %80 on Mac OS X. Gnome uses something else.

It was said that one reason to add new functions is performance. I'm not
convinced neither that such function is the bottleneck on any application.

I prefer to wait until users experiment with their own implementation and
see if a common function can be extracted from this to put it in the stdlib.

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue18814] Add utilities to "clean" surrogate code points from strings

2015-09-27 Thread Nick Coghlan

Nick Coghlan added the comment:

I think moving this forward mainly needs someone with the time and energy 
wrangle a python-ideas/dev discussion to get some additional feedback on the 
API design. As I see it, there are 2 main questions to be resolved:

1. Where to expose these functions

The default location would be the codecs module, as they're closely related to 
the error handlers in that module, and the main reasons for needing to clean 
data at all are handling dirty data produced by an interface that uses 
surrogatepass or surrogateescape when decoding (handle_surrogates, 
handle_surrogateescape), or encoding data for use in a context which doesn't 
correctly handle code points outside the basic multilingual plane 
(handle_astrals).

If added to the codecs module, they could be documented in new sections on 
"Postprocessing decoded text" and "Preprocessing text for encoding".

The main argument against that would be Stephen's one, which is that these 
aren't themselves encoding or decoding operations, but rather internal state 
manipulations on Python strings.

2. The exact function set to be provided.

The three potential data cleaning cases currently being considered:

* process_surrogates: reprocessing all surrogates in the string, including lone 
surrogates and valid surrogate pairs. Such strings may be produced by using the 
"surrogatepass" handler when decoding, or by decomposing astral characters to 
surrogate pairs.
* process_surrogateescape: reprocessing only lone surrogates in the U+DC80 to 
U+DCFF range, with other surrogate pairs or lone surrogates triggering 
UnicodeTranslateError. Such strings may be produced by using the 
"surrogateescape" error handler when decoding.
* process_astrals: reprocessing all code points in the astral plane.

These seem to cover the essentials to me, and I changed the proposed prefix to 
"process_*" based on the idea of documentating them as preprocessing and 
postprocessing steps for encoding and decoding.

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue18814] Add utilities to "clean" surrogate code points from strings

2015-09-26 Thread Martin Panter

Martin Panter added the comment:

[padding]

I think my suggested colours for the bikeshed would be handle_surrogates() and 
handle_surrogateescape(). “Rehandle” seems awkward and too assuming to me. And 
I agree with Serhiy that surrogates are a Unicode thing, not just related to 
the “surrogatepass” handler.

Adding them to “codecs” makes sense to me. The most important one, 
handle_surrogateescape() or equivalent, is closely related to the error handler 
of that module.

Having handle_surrogateescape or equivalent would probably be useful for Issue 
25184 (displaying an arbitrary file path in a UTF-8 HTML file).

--
nosy: +martin.panter

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue18814] Add utilities to clean surrogate code points from strings

2015-06-07 Thread Steven D'Aprano

Changes by Steven D'Aprano steve+pyt...@pearwood.info:


--
nosy: +steven.daprano

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue18814
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue18814] Add utilities to clean surrogate code points from strings

2015-05-11 Thread Nick Coghlan

Nick Coghlan added the comment:

I suggest we defer this one to 3.6 - I still think it's worth doing, but I 
don't think it's a major barrier to migration, and it would be good to get some 
real world experience with the new sys.stdin behaviour of defaulting to using 
surrogateescape in the POSIX locale in 3.5 before committing to a particular 
design for the surrogate cleaning API.

I do like the idea of a string.internals submodule as a possible home for 
exposing the Python level API.

--
title: Add codecs.convert_surrogateescape to clean surrogate escaped strings 
- Add utilities to clean surrogate code points from strings
versions: +Python 3.6 -Python 3.5

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue18814
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com