ID: 49350
User updated by: soapergem at gmail dot com
Reported By: soapergem at gmail dot com
Status: Bogus
Bug Type: Filesystem function related
Operating System: Windows XP
PHP Version: 5.3.0
New Comment:
But generally speaking this isn't desired behavior from the user
standpoint. When you open a file in Notepad that has this character at
the front, you never see it. I never knew it was there until trying to
read it raw through PHP, since it is clearly not intended to be part of
the content, but instead part of the file meta-data.
It would be unwise not to expect that Unicode will eventually become
the standard. And currently the burden is on the PHP user to account for
it, when I think it should be a language feature. So I suggest doing
something like adding a letter to the "mode" of fopen(), for instance
something like this:
$fp = fopen('utf8_text_file.txt', 'ru');
The "u" would indicate that the file *may* be encoded in UTF-8, and if
so, throw out the BOM at the front. This would mean that fseek'ing to 0
would effectively start just after the BOM (if present), and the file
would be initialized to this seek position. This would provide backwards
compatibility, since you would have to change the fopen() mode for it to
detect the BOM. And it'd make things for PHP users like myself a lot
easier.
Just a thought.
Previous Comments:
------------------------------------------------------------------------
[2009-08-25 06:44:45] [email protected]
Of course it does. If it didn't, it would be broken.
------------------------------------------------------------------------
[2009-08-24 22:31:38] soapergem at gmail dot com
Description:
------------
When text files are saved with UTF-8 encoding, a few characters are
saved at the front called the "Byte Order Mark" (read more about it on
Wikipedia). They are supposed to remain hidden and just be used as
meta-data to indicate that the file is saved with UTF-8 formatting.
Their hex values are EF BB BF, which is represented in ASCII by "".
The trouble is that when you read in a UTF-8 text file with either
fgets or fgetcsv, PHP misinterprets the BOM as literal text and includes
it with all the other text.
Reproduce code:
---------------
<?php
if ( $fp = fopen('ut8_text_file.txt') ) {
echo fgets($fp);
fclose($fp);
}
?>
Expected result:
----------------
Whatever text is saved on the first line of the text file.
Actual result:
--------------
Whatever text is saved on the first line of the text file.
------------------------------------------------------------------------
--
Edit this bug report at http://bugs.php.net/?id=49350&edit=1