Edit report at https://bugs.php.net/bug.php?id=45993&edit=1
ID: 45993
Comment by: Apollo880 at gmail dot com
Reported by: mtrojan at transline dot de
Summary: mb_detect_encoding and mb_check_encoding results are
dissonant
Status: Open
Type: Bug
Package: mbstring related
Operating System: Windows XP
PHP Version: 5.2.6
Block user comment: N
Private report: N
New Comment:
Bug with correct encoding detection.
function detect_enc($str)
{
$awe = mb_list_encodings();
unset($awe[0], $awe[1], $awe[2]);
foreach ($awe as $enctype)
{
if (mb_check_encoding($str, $enctype) === true) return $enctype;
}
return false;
}
echo detect_enc('String_encoded_to_Windows-1251'); // Return 'byte2be'. It's a
fail.
Previous Comments:
------------------------------------------------------------------------
[2008-11-10 07:30:32] mtrojan at transline dot de
Of course, comparing the beginning of a file with the UTF-16 BOM can be used to
detect UTF-16 encoding. But what do you do with UTF-16 encoded files where no
BOM is set?
------------------------------------------------------------------------
[2008-11-08 02:20:46] [email protected]
mb_detect_encoding does not support the UTF-16/UTF-16BE
encoding detection. Because UTF-16 isn't byte stream encoding like UTF-8, we
cannot detect the encoding as other byte stream encoding.
The file encoded in UTF-16 can be detected easily using BOM,
it is like,
if ($content[0]==chr(0xff) && $content[1]==chr(0xfe)) {
echo 'UTF-16';
} else if ($content[0]==chr(0xfe) && $content[1]==chr(0xff)) {
echo 'UTF-16BE';
}
------------------------------------------------------------------------
[2008-10-26 23:01:49] [email protected]
Assigned to the mbstring maintainer.
------------------------------------------------------------------------
[2008-09-04 11:47:39] mtrojan at transline dot de
Description:
------------
mb_detect_encoding does not seem to recognize UTF-16 encoded files properly.
Even if it is assured by using mb_check_encoding that a file is truly UTF-16LE,
mb_detect_encoding does not detect the same file as UTF-16 and is returning
ISO-8859-1 instead. Activating/deactivating strict mode has no influence on the
result.
Reproduce code:
---------------
$content = file_get_contents($src_path);
$encodings = array('UTF-16', 'UTF-16LE', 'UTF-16BE', 'UTF-8', 'UNICODE',
'ISO-8859-1');
$enc = mb_detect_encoding($content, $encodings);
print "encoding: $enc\n";
print 'checked: ' . intval(mb_check_encoding($content, 'UTF-16LE'));
Expected result:
----------------
encoding: UTF-16LE
checked: 1
Actual result:
--------------
encoding: ISO-8859-1
checked: 1
------------------------------------------------------------------------
--
Edit this bug report at https://bugs.php.net/bug.php?id=45993&edit=1