Edit report at http://bugs.php.net/bug.php?id=48507&edit=1
ID: 48507 Comment by: pahan at hubbitus dot spb dot su Reported by: krynble at yahoo dot com dot br Summary: fgetcsv() ignoring special characters Status: Bogus Type: Bug Package: Filesystem function related Operating System: Unix PHP Version: 5.* New Comment: > Quote from the docs: > Note: Locale setting is taken into account by this function. If LANG is e.g. > en_US.UTF-8, files in one-byte encoding are read wrong by this function. Ok, bug documented as "are read wrong by this function" is better then nothing. But do you plan fix this wrong behaviour? Previous Comments: ------------------------------------------------------------------------ [2010-05-18 11:03:42] m...@php.net Thank you for taking the time to write to us, but this is not a bug. Please double-check the documentation available at http://www.php.net/manual/ and the instructions on how to report a bug at http://bugs.php.net/how-to-report.php Quote from the docs: Note: Locale setting is taken into account by this function. If LANG is e.g. en_US.UTF-8, files in one-byte encoding are read wrong by this function. ------------------------------------------------------------------------ [2009-12-12 11:40:29] pahan at hubbitus dot spb dot su Sorry for duplicate (#50456 is my), but in it, additionally to there described problem in fgetcsv I also suggest fix fputcvs to allow [force] enclosing single words in field. Off course it does *not* solve this problem of incorrect fgetcsv parsing, because RFC allow not quoted values ( http://www.faqs.org/rfcs/rfc4180.html , section 2.5 ), but, it is make pair fputcsv/fgetcsv as minimum compatible in PHP implementation. ------------------------------------------------------------------------ [2009-12-12 01:33:51] j...@php.net See also bug #50456 ------------------------------------------------------------------------ [2009-09-22 15:09:20] phofstetter at sensational dot ch below you'll find a small script which shows how to implement a user filter that can be used to on-the-fly utf8-encode the data so that fgetcsv is happy and returns correct output even if the first character in a field has its high-bit set and is not valid utf-8: Remember: This is a workaround and impacts performance. This is not a valid fix for the bug. I didn't yet have time to deeply look into the C implementation for fgetcsv, but all these calls to php_mblen() feel suspicious to me. I'll try and have a look into this later today, but for now, I'm just glad I have this workaround (quickly hacked together - keep that in mind): <?php class utf8encode_filter extends php_user_filter { function is_utf8($string){ return preg_match('%(?: [\xC2-\xDF][\x80-\xBF] # non-overlong 2-byte |\xE0[\xA0-\xBF][\x80-\xBF] # excluding overlongs |[\xE1-\xEC\xEE\xEF][\x80-\xBF]{2} # straight 3-byte |\xED[\x80-\x9F][\x80-\xBF] # excluding surrogates |\xF0[\x90-\xBF][\x80-\xBF]{2} # planes 1-3 |[\xF1-\xF3][\x80-\xBF]{3} # planes 4-15 |\xF4[\x80-\x8F][\x80-\xBF]{2} # plane 16 )+%xs', $string); } function filter($in, $out, &$consumed, $closing) { while ($bucket = stream_bucket_make_writeable($in)) { if (!$this->is_utf8($bucket->data)) $bucket->data = utf8_encode($bucket->data); $consumed += $bucket->datalen; stream_bucket_append($out, $bucket); } return PSFS_PASS_ON; } } /* Register our filter with PHP */ stream_filter_register("utf8encode", "utf8encode_filter") or die("Failed to register filter"); $fp = fopen($_SERVER['argv'][1], "r"); /* Attach the registered filter to the stream just opened */ stream_filter_prepend($fp, "utf8encode"); while($data = fgetcsv($fp, 0, ';', '"')) print_r($data); fclose($fp); ------------------------------------------------------------------------ [2009-09-22 14:45:22] phofstetter at sensational dot ch I was looking into this (after having been bitten by it) and I can add another tidbit that might help tracking this down: The bug doesn't happen if the file fgetcsv() is reading is in UTF-8-format. I have created a test-file in ISO-8859-1 and then used file_put_contents(utf8encode(file_get_contents())) to create the UTF8-version of it (explaining this here because I'm not sure whether this would write a BOM or not - probably not though). That version could be read correctly. I'm now writing a stream filter that does the UTF-8 conversion on the fly to hook that in between the file and fgetcsv() - while I would lose a bit of performance, in my case, this is the cleanest workaround. ------------------------------------------------------------------------ The remainder of the comments for this report are too long. To view the rest of the comments, please view the bug report online at http://bugs.php.net/bug.php?id=48507 -- Edit this bug report at http://bugs.php.net/bug.php?id=48507&edit=1