Edit report at https://bugs.php.net/bug.php?id=65082&edit=1
ID: 65082
User updated by: masakielastic at gmail dot com
Reported by: masakielastic at gmail dot com
Summary: json_encode's option for replacing ill-formd byte
sequences with substitute cha
Status: Assigned
Type: Feature/Change Request
Package: JSON related
Operating System: All
PHP Version: 5.5.0
Assigned To: remi
Block user comment: N
Private report: N
New Comment:
I agree with you on isolated surrogate pairs.
The test cases for json_decode and JSON_NOTUTF8_SUBSTITUTE and
JSON_NOTUTF8_IGNORE must be contained
since json_decode uses json_utf8_to_utf16.
https://github.com/php/php-src/blob/master/ext/json/json.c#L673
I already posted the test cases.
https://gist.github.com/masakielastic/5973095#file-04-test-php-L26
"a\xEF\xBF\xBD" === json_decode('"'."a\x80".'"', false, 512,
JSON_NOTUTF8_SUBSTITUTE),
"a" === json_decode('"'."a\x80".'"', false, 512, JSON_NOTUTF8_IGNORE)
The one way of perfomance improvement is adding json_utf8_to_utf32.
I posted another patch.
https://gist.github.com/masakielastic/5973095#file-02-json_unescaped_unicode-
patch
I created unsigned int *utf32 data type
for not changing unsigned short *utf16 data type.
If you want to provide a common variable
for json_utf8_to_utf16 and json_utf8_to_utf32,
the modification for JSON_parser.c is also needed.
The one of candidate for the name of variable is
unsigned int *code_codes.
http://www.unicode.org/glossary/#code_unit
I also updated the previous patch.
https://gist.github.com/masakielastic/5973095#file-01-json_unescaped_unicode-
patch
if (options & PHP_JSON_UNESCAPED_UNICODE) {
+ if (us < 0x20) {
+ smart_str_appendl(buf, "\\u", 2);
+ smart_str_appendc(buf, digits[(us >> 12) & 0xf]);
+ smart_str_appendc(buf, digits[(us >> 8) & 0xf]);
+ smart_str_appendc(buf, digits[(us >> 4) & 0xf]);
+ smart_str_appendc(buf, digits[(us & 0xf)]);
+ } else if (us < 0x80) {
Previous Comments:
------------------------------------------------------------------------
[2013-07-15 07:31:49] [email protected]
> Hi remi, could you test my patch for PHP_JSON_UNESCAPED_UNICODE option?
> The patch adopts JSON_NOTUTF8_SUBSTITUTE and JSON_NOTUTF8_IGNORE options.
The PHP_JSON_UNESCAPED_UNICODE + JSON_NOTUTF8_IGNORE already works with my
patch.
Yes, PHP_JSON_UNESCAPED_UNICODE + JSON_NOTUTF8_SUBSTITUTE doesn't work for now,
but converting to utf16, then back to utf8 seems really... messy. Need
something simpler.
Notice: this bug is only for json_encode. Other issue have their own bug for
tracking (especially the json_decode one, as I dont plan to alter it)
------------------------------------------------------------------------
[2013-07-14 12:45:47] masakielastic at gmail dot com
As for JSON_NOTUTF8_IGNORE, the description for security is needed in the
manual
like htmlspecialchars's ENT_IGNORE
http://www.php.net/manual/en/function.htmlspecialchars.php
That's why I didn't sugguest JSON_IGNORE in the draft and showed Escaping RFC's
link
as resource.
UNICODE SECURITY CONSIDERATIONS
http://www.unicode.org/reports/tr36/#Deletion_of_Noncharacters
IDS11-J. Eliminate noncharacter code points before validation
https://www.securecoding.cert.org/confluence/display/java/IDS11-
J.+Eliminate+noncharacter+code+points+before+validation
------------------------------------------------------------------------
[2013-07-14 12:31:29] masakielastic at gmail dot com
Hi, nikic, sorry, ignore my last comment.
I added small change in json.c
https://gist.github.com/masakielastic/5973095#file-02-small_refactaring-patch
------------------------------------------------------------------------
[2013-07-14 08:48:01] masakielastic at gmail dot com
I nominate other names from the view of consistency with JSON_ERROR_UTF8.
JSON_UTF8_SUBSTITUTE
JSON_UTF8_IGNORE
------------------------------------------------------------------------
[2013-07-14 08:44:02] masakielastic at gmail dot com
Hi, nikic, I posted a document request for the mission option and error codes.
https://bugs.php.net/bug.php?id=65259
Your opinion about the consistency among
JSON_PARTIAL_OUTPUT_ON_ERROR and JSON_NOTUTF8_SUBSTITUTE
and JSON_NOTUTF8_IGNORE is needed.
------------------------------------------------------------------------
The remainder of the comments for this report are too long. To view
the rest of the comments, please view the bug report online at
https://bugs.php.net/bug.php?id=65082
--
Edit this bug report at https://bugs.php.net/bug.php?id=65082&edit=1