Edit report at https://bugs.php.net/bug.php?id=65082&edit=1
ID: 65082
User updated by: masakielastic at gmail dot com
Reported by: masakielastic at gmail dot com
Summary: json_encode's option for replacing ill-formd byte
sequences with substitute cha
Status: Assigned
Type: Feature/Change Request
Package: JSON related
Operating System: All
PHP Version: 5.5.0
Assigned To: remi
Block user comment: N
Private report: N
New Comment:
I created a repo for the patches and the report of benchmarks
https://github.com/masakielastic/patches/tree/master/php_bugs_65082
The difference between json_utf8_to_utf16 and json_utf8_to_utf32 isn't seen.
the use of json_utf8_to_utf32 or the direct use of php_next_utf8_char
in json_escape_string is better choice for
JSON_NOTUTF8_SUBSTITUTE and JSON_NOTUTF8_SUBSTITUTE|JSON_UNESCAPED_UNICODE.
php_next_utf8_char in json_escape_string is a bit faster than
json_utf8_to_utf32 for JSON_NOTUTF8_SUBSTITUTE.
https://github.com/masakielastic/patches/blob/master/php_bugs_65082/04_php_next_
utf8_char_in_json_escape_string.patch
https://github.com/masakielastic/patches/blob/master/php_bugs_65082/04_php_next_
utf8_char_in_json_escape_string.c
Previous Comments:
------------------------------------------------------------------------
[2013-07-19 16:46:49] masakielastic at gmail dot com
Another way of perfomance improvemnet is using php_next_utf8_char directly
in json_escape_string on the condition of PHP_JSON_NOTUTF8_SUBSTITUTE
and PHP_JSON_NOTUTF8_IGNORE.
This way reduces one loop compared with using json_utf8_to_utf16.
------------------------------------------------------------------------
[2013-07-19 16:33:24] masakielastic at gmail dot com
I agree with you on isolated surrogate pairs.
The test cases for json_decode and JSON_NOTUTF8_SUBSTITUTE and
JSON_NOTUTF8_IGNORE must be contained
since json_decode uses json_utf8_to_utf16.
https://github.com/php/php-src/blob/master/ext/json/json.c#L673
I already posted the test cases.
https://gist.github.com/masakielastic/5973095#file-04-test-php-L26
"a\xEF\xBF\xBD" === json_decode('"'."a\x80".'"', false, 512,
JSON_NOTUTF8_SUBSTITUTE),
"a" === json_decode('"'."a\x80".'"', false, 512, JSON_NOTUTF8_IGNORE)
The one way of perfomance improvement is adding json_utf8_to_utf32.
I posted another patch.
https://gist.github.com/masakielastic/5973095#file-02-json_unescaped_unicode-
patch
I created unsigned int *utf32 data type
for not changing unsigned short *utf16 data type.
If you want to provide a common variable
for json_utf8_to_utf16 and json_utf8_to_utf32,
the modification for JSON_parser.c is also needed.
The one of candidate for the name of variable is
unsigned int *code_codes.
http://www.unicode.org/glossary/#code_unit
I also updated the previous patch.
https://gist.github.com/masakielastic/5973095#file-01-json_unescaped_unicode-
patch
if (options & PHP_JSON_UNESCAPED_UNICODE) {
+ if (us < 0x20) {
+ smart_str_appendl(buf, "\\u", 2);
+ smart_str_appendc(buf, digits[(us >> 12) & 0xf]);
+ smart_str_appendc(buf, digits[(us >> 8) & 0xf]);
+ smart_str_appendc(buf, digits[(us >> 4) & 0xf]);
+ smart_str_appendc(buf, digits[(us & 0xf)]);
+ } else if (us < 0x80) {
------------------------------------------------------------------------
[2013-07-15 07:31:49] [email protected]
> Hi remi, could you test my patch for PHP_JSON_UNESCAPED_UNICODE option?
> The patch adopts JSON_NOTUTF8_SUBSTITUTE and JSON_NOTUTF8_IGNORE options.
The PHP_JSON_UNESCAPED_UNICODE + JSON_NOTUTF8_IGNORE already works with my
patch.
Yes, PHP_JSON_UNESCAPED_UNICODE + JSON_NOTUTF8_SUBSTITUTE doesn't work for now,
but converting to utf16, then back to utf8 seems really... messy. Need
something simpler.
Notice: this bug is only for json_encode. Other issue have their own bug for
tracking (especially the json_decode one, as I dont plan to alter it)
------------------------------------------------------------------------
[2013-07-14 12:45:47] masakielastic at gmail dot com
As for JSON_NOTUTF8_IGNORE, the description for security is needed in the
manual
like htmlspecialchars's ENT_IGNORE
http://www.php.net/manual/en/function.htmlspecialchars.php
That's why I didn't sugguest JSON_IGNORE in the draft and showed Escaping RFC's
link
as resource.
UNICODE SECURITY CONSIDERATIONS
http://www.unicode.org/reports/tr36/#Deletion_of_Noncharacters
IDS11-J. Eliminate noncharacter code points before validation
https://www.securecoding.cert.org/confluence/display/java/IDS11-
J.+Eliminate+noncharacter+code+points+before+validation
------------------------------------------------------------------------
[2013-07-14 12:31:29] masakielastic at gmail dot com
Hi, nikic, sorry, ignore my last comment.
I added small change in json.c
https://gist.github.com/masakielastic/5973095#file-02-small_refactaring-patch
------------------------------------------------------------------------
The remainder of the comments for this report are too long. To view
the rest of the comments, please view the bug report online at
https://bugs.php.net/bug.php?id=65082
--
Edit this bug report at https://bugs.php.net/bug.php?id=65082&edit=1