#30481: force_text() allows lone surrogates -------------------------------------+------------------------------------- Reporter: Adam | Owner: nobody Hooper | Type: | Status: new Uncategorized | Component: Utilities | Version: 2.2 Severity: Normal | Keywords: force_text unicode Triage Stage: | Has patch: 0 Unreviewed | Needs documentation: 0 | Needs tests: 0 Patch needs improvement: 0 | Easy pickings: 0 UI/UX: 0 | -------------------------------------+------------------------------------- {{{ $ python3 Python 3.7.3 (default, Mar 27 2019, 13:36:35) [GCC 9.0.1 20190227 (Red Hat 9.0.1-0.8)] on linux Type "help", "copyright", "credits" or "license" for more information.
>>> invalid_text = '\ud802\udf12' >>> print(invalid_text) # we'd expect this to fail Traceback (most recent call last): File "<stdin>", line 1, in <module> UnicodeEncodeError: 'utf-8' codec can't encode characters in position 0-1: surrogates not allowed >>> import django.utils.encoding >>> django.VERSION (2, 2, 0, 'alpha', 1) >>> valid_text = django.utils.encoding.force_text(invalid_text) >>> print(valid_text) # we'd expect this to succeed? Traceback (most recent call last): File "<stdin>", line 1, in <module> UnicodeEncodeError: 'utf-8' codec can't encode characters in position 0-1: surrogates not allowed >>> valid_text '\ud802\udf12' }}} Perhaps this is a flaw in my expectations? I'd expect `force_text()`'s output to always be a valid text -- even though Python allows me to create _non-text_ `str` objects. (In this case, I'd expect maybe `\ufffd\ufffd` -- Unicode replacement characters.) Unicode primer: `\ud802` is a "lone surrogate" in this context. A lone surrogate is a valid Unicode _code point_ but it does not represent _text_. (Lone surrogates can crop up if someone decodes valid UCS-2 as UTF-16.) I don't think any caller of `force_text()` expects it to ever return a non-textual Unicode string. -- Ticket URL: <https://code.djangoproject.com/ticket/30481> Django <https://code.djangoproject.com/> The Web framework for perfectionists with deadlines. -- You received this message because you are subscribed to the Google Groups "Django updates" group. To unsubscribe from this group and stop receiving emails from it, send an email to django-updates+unsubscr...@googlegroups.com. To post to this group, send email to django-updates@googlegroups.com. To view this discussion on the web visit https://groups.google.com/d/msgid/django-updates/053.cef81db2df89b3f417ef91c71a9c6a94%40djangoproject.com. For more options, visit https://groups.google.com/d/optout.