ID:               40506
 User updated by:  php at koterov dot ru
-Summary:          Suggestion: json_encode() and non-UTF8 strings
 Reported By:      php at koterov dot ru
-Status:           Bogus
+Status:           Open
 Bug Type:         Feature/Change Request
 Operating System: all
 PHP Version:      5.2.1
 New Comment:

I understand that JSON is UTF8-based format. But the question was
different: why json_encode() wastes CPU time for analyze the input data
instead of passing it through?

And the second thought. Assume that the output of json_encode must be
UTF8, OK. But why should it limit us to use UTF8 as its input
parameter? Ideologically input != output.

The main disadvantage that I cannot iterate through all of the input
data and call iconv() for it before passing the resulting array to
json_encode(). Because it is very CPU expensive (e.g. if I transfer
more than 500 strings, each about 30 characters length, the slowdown is
great). 

Theoretically json_encode() is irreplaceable for fast execution and CPU
saving only, but it is totally impossible in non-UTF8 sites. Because of
the speed is not needed, it is very easy to use PHP version of this
function.

I think that if we want to follow the RFC literally, it may be better
to write json_encode() without any encoding analyzation, and after that
- call iconv() ONE TIME to convert the resulting string to UTF8. It is
much more faster than calling of iconv() for each input string. Maybe -
pass the second optional parameter, $src_encoding, to json_encode() to
specify input encoding.


Previous Comments:
------------------------------------------------------------------------

[2007-02-17 15:50:14] [EMAIL PROTECTED]

http://www.ietf.org/rfc/rfc4627.txt?number=4627
see section 3

------------------------------------------------------------------------

[2007-02-16 10:47:31] php at koterov dot ru

Description:
------------
Could you please explain why json_encode() takes care about the
encoding at all? Why not to treat all the string data as a binary flow?
This is very inconvenient and disallows the usage of json_encode() in
non-UTF8 sites! :-(

I have written a small substitution for json_encode(), but note that it
of course works much more slow than json_encode() with big data
arrays..

    /**
     * Convert PHP scalar, array or hash to JS scalar/array/hash.
     */
    function php2js($a)
    {
        if (is_null($a)) return 'null';
        if ($a === false) return 'false';
        if ($a === true) return 'true';
        if (is_scalar($a)) {
            $a = addslashes($a);
            $a = str_replace("\n", '\n', $a);
            $a = str_replace("\r", '\r', $a);
            $a = preg_replace('{(</)(script)}i', "$1'+'$2", $a);
            return "'$a'";
        }
        $isList = true;
        for ($i=0, reset($a); $i<count($a); $i++, next($a))
            if (key($a) !== $i) { $isList = false; break; }
        $result = array();
        if ($isList) {
            foreach ($a as $v) $result[] = php2js($v);
            return '[ ' . join(', ', $result) . ' ]';
        } else {
            foreach ($a as $k=>$v) 
                $result[] = php2js($k) . ': ' . php2js($v);
            return '{ ' . join(', ', $result) . ' }';
        }
    }

So, my suggestion is remove all string analyzation from json_encode()
code. It also make this function to work faster.

Reproduce code:
---------------
<?php
$a = array('a' =>
'&#1087;&#1088;&#1086;&#1074;&#1077;&#1088;&#1082;&#1072;', 'b' =>
array('&#1089;&#1083;&#1091;&#1093;&#1072;',
'&#1075;&#1083;&#1091;&#1093;&#1086;&#1075;&#1086;'));
echo json_encode($a);
?>

Expected result:
----------------
Correctly encoded string in the source 1-byte encoding.

Actual result:
--------------
Empty strings everywhere (and sometimes - notices that a string
contains non-UTF8 characters).


------------------------------------------------------------------------


-- 
Edit this bug report at http://bugs.php.net/?id=40506&edit=1

Reply via email to