New submission from Serhiy Storchaka:

RFC 4627 specifies a method to determine an encoding (one of UTF-8, 
UTF-16(BE|LE) or UTF-32(BE|LE)) of encoded JSON text. The proposed preliminary 
patch (it doesn't include the documentation yet) allows load() and loads() 
functions accept bytes data when it is encoded with standard Unicode encoding. 
Also accepted data with BOM (this doesn't specified in RFC 4627, but is widely 
used).

There is only one case where the method can give a misfire. Serialized string 
"\x00..." encoded in UTF-16LE may be erroneously detected as encoded in 
UTF-32LE. This case violates the two rules of RFC 4627: the string was 
serialized instead of a an object or an array, and the control character U+0000 
was not escaped. The standard encoded JSON always detected correctly.

This patch requires "surrogatepass" error handler for utf-16/32 (see issue12892 
and issue13916).

----------
assignee: serhiy.storchaka
components: Library (Lib), Unicode
files: json_detect_encoding.patch
keywords: patch
messages: 188442
nosy: ezio.melotti, pitrou, rhettinger, serhiy.storchaka
priority: normal
severity: normal
status: open
title: Autodetecting JSON encoding
type: enhancement
versions: Python 3.4
Added file: http://bugs.python.org/file30133/json_detect_encoding.patch

_______________________________________
Python tracker <rep...@bugs.python.org>
<http://bugs.python.org/issue17909>
_______________________________________
_______________________________________________
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

Reply via email to