Req #65082 [Asn]: json_encode's option for replacing ill-formd byte sequences with substitute cha

masakielastic at gmail dot com Sun, 14 Jul 2013 01:29:24 -0700

Edit report at https://bugs.php.net/bug.php?id=65082&edit=1


 ID:                 65082
 User updated by:    masakielastic at gmail dot com
 Reported by:        masakielastic at gmail dot com
 Summary:            json_encode's option for replacing ill-formd byte
                     sequences with substitute cha
 Status:             Assigned
 Type:               Feature/Change Request
 Package:            JSON related
 Operating System:   All
 PHP Version:        5.5.0
 Assigned To:        remi
 Block user comment: N
 Private report:     N

 New Comment:

I created new feature request for preveting XSS attack and I withdraw my option 
about the change of default behavior.

new function for preventing XSS attack
https://bugs.php.net/bug.php?id=65257


Previous Comments:
------------------------------------------------------------------------
[2013-07-12 18:19:09] masakielastic at gmail dot com

I posted a patch for handling surrogate pairs 
since the range (U+D800 - U+DFFF) is not allowed in UTF-8 (RFC 3629).
Someone's help is needed for handling high surrogate pairs and the options.

https://gist.github.com/masakielastic/5985383

json_decode produces invalid byte-sequences
https://bugs.php.net/bug.php?id=62010

------------------------------------------------------------------------
[2013-07-11 09:48:54] masakielastic at gmail dot com

Hi, I fixed my patch and added test case for json_decode.

------------------------------------------------------------------------
[2013-07-11 08:37:51] masakielastic at gmail dot com

Hi remi, could you test my patch for PHP_JSON_UNESCAPED_UNICODE option?
The patch adopts JSON_NOTUTF8_SUBSTITUTE and JSON_NOTUTF8_IGNORE options.

https://gist.github.com/masakielastic/5973095

------------------------------------------------------------------------
[2013-07-11 04:59:02] [email protected]

I don't think changing the current behavior is a good idea, the reason why I 
really prefer some new options.

------------------------------------------------------------------------
[2013-07-11 04:27:19] masakielastic at gmail dot com

Hi, thanks nikic and remi.

After several considering, I changed my mind.
I think the behavior of substituting U+FFFD 
for ill-formed sequences should be default.

How do you think?

We might need the discussion about the consitency for Escaper API. 
htmlspecialchars's ENT_SUBSTITUTE option is adopted 
by Symfony and Zend Framework.

https://wiki.php.net/rfc/escaper

Although the behavior breaks 2 test suites, it don't break user's codebases.

A lot of people don't use any option looking in github.

https://github.com/search?l=PHP&q=json_encode&ref=advsearch&type=Code
https://github.com/search?l=PHP&q=json_decode&ref=advsearch&type=Code

The same problem can be seen in htmlspecialchars.

https://github.com/search?l=PHP&q=htmlspecialchars&ref=advsearch&type=Code

New options complicate the situation 
when using JSON_UNESCAPED_UNICODE option and json_decode.

[two option]
json_encode
  JSON_NOTUTF8_SUBSTITUTE
  JSON_NOTUTF8_IGNORE
  JSON_UNESCAPED_UNICODE | JSON_NOTUTF8_SUBSTITUTE
  JSON_UNESCAPED_UNICODE | JSON_NOTUTF8_IGNORE

json_decode
  JSON_NOTUTF8_SUBSTITUTE
  JSON_NOTUTF8_IGNORE


If JSON_NOTUTF8_SUBSTITUTE is default behavior, 
the problem we need to consider is only JSON_NOTUTF8_IGNORE option.

[one option]
json_encode
  JSON_NOTUTF8_IGNORE
  JSON_UNESCAPED_UNICODE | JSON_NOTUTF8_IGNORE

json_decode
  JSON_NOTUTF8_IGNORE

------------------------------------------------------------------------


The remainder of the comments for this report are too long. To view
the rest of the comments, please view the bug report online at

    https://bugs.php.net/bug.php?id=65082


-- 
Edit this bug report at https://bugs.php.net/bug.php?id=65082&edit=1

Req #65082 [Asn]: json_encode's option for replacing ill-formd byte sequences with substitute cha

Reply via email to