#49350 [Bgs]: fgets reads the UTF-8 Byte Order Mark literally

soapergem at gmail dot com Tue, 25 Aug 2009 09:12:47 -0700

 ID:               49350
 User updated by:  soapergem at gmail dot com
 Reported By:      soapergem at gmail dot com
 Status:           Bogus
 Bug Type:         Filesystem function related
 Operating System: Windows XP
 PHP Version:      5.3.0
 New Comment:


But generally speaking this isn't desired behavior from the user
standpoint. When you open a file in Notepad that has this character at
the front, you never see it. I never knew it was there until trying to
read it raw through PHP, since it is clearly not intended to be part of
the content, but instead part of the file meta-data.

It would be unwise not to expect that Unicode will eventually become
the standard. And currently the burden is on the PHP user to account for
it, when I think it should be a language feature. So I suggest doing
something like adding a letter to the "mode" of fopen(), for instance
something like this:

$fp = fopen('utf8_text_file.txt', 'ru');

The "u" would indicate that the file *may* be encoded in UTF-8, and if
so, throw out the BOM at the front. This would mean that fseek'ing to 0
would effectively start just after the BOM (if present), and the file
would be initialized to this seek position. This would provide backwards
compatibility, since you would have to change the fopen() mode for it to
detect the BOM. And it'd make things for PHP users like myself a lot
easier.

Just a thought.


Previous Comments:
------------------------------------------------------------------------

[2009-08-25 06:44:45] [email protected]

Of course it does. If it didn't, it would be broken.

------------------------------------------------------------------------

[2009-08-24 22:31:38] soapergem at gmail dot com

Description:
------------
When text files are saved with UTF-8 encoding, a few characters are
saved at the front called the "Byte Order Mark" (read more about it on
Wikipedia). They are supposed to remain hidden and just be used as
meta-data to indicate that the file is saved with UTF-8 formatting.
Their hex values are EF BB BF, which is represented in ASCII by "ï»¿".

The trouble is that when you read in a UTF-8 text file with either
fgets or fgetcsv, PHP misinterprets the BOM as literal text and includes
it with all the other text.

Reproduce code:
---------------
<?php

if ( $fp = fopen('ut8_text_file.txt') ) {

    echo fgets($fp);
    fclose($fp);

}

?>

Expected result:
----------------
Whatever text is saved on the first line of the text file.

Actual result:
--------------
ï»¿Whatever text is saved on the first line of the text file.


------------------------------------------------------------------------


-- 
Edit this bug report at http://bugs.php.net/?id=49350&edit=1

#49350 [Bgs]: fgets reads the UTF-8 Byte Order Mark literally

Reply via email to