ID: 48507 Comment by: phofstetter at sensational dot ch Reported By: krynble at yahoo dot com dot br Status: Verified Bug Type: Filesystem function related Operating System: Unix PHP Version: 5.2.9 New Comment:
below you'll find a small script which shows how to implement a user filter that can be used to on-the-fly utf8-encode the data so that fgetcsv is happy and returns correct output even if the first character in a field has its high-bit set and is not valid utf-8: Remember: This is a workaround and impacts performance. This is not a valid fix for the bug. I didn't yet have time to deeply look into the C implementation for fgetcsv, but all these calls to php_mblen() feel suspicious to me. I'll try and have a look into this later today, but for now, I'm just glad I have this workaround (quickly hacked together - keep that in mind): <?php class utf8encode_filter extends php_user_filter { function is_utf8($string){ return preg_match('%(?: [\xC2-\xDF][\x80-\xBF] # non-overlong 2-byte |\xE0[\xA0-\xBF][\x80-\xBF] # excluding overlongs |[\xE1-\xEC\xEE\xEF][\x80-\xBF]{2} # straight 3-byte |\xED[\x80-\x9F][\x80-\xBF] # excluding surrogates |\xF0[\x90-\xBF][\x80-\xBF]{2} # planes 1-3 |[\xF1-\xF3][\x80-\xBF]{3} # planes 4-15 |\xF4[\x80-\x8F][\x80-\xBF]{2} # plane 16 )+%xs', $string); } function filter($in, $out, &$consumed, $closing) { while ($bucket = stream_bucket_make_writeable($in)) { if (!$this->is_utf8($bucket->data)) $bucket->data = utf8_encode($bucket->data); $consumed += $bucket->datalen; stream_bucket_append($out, $bucket); } return PSFS_PASS_ON; } } /* Register our filter with PHP */ stream_filter_register("utf8encode", "utf8encode_filter") or die("Failed to register filter"); $fp = fopen($_SERVER['argv'][1], "r"); /* Attach the registered filter to the stream just opened */ stream_filter_prepend($fp, "utf8encode"); while($data = fgetcsv($fp, 0, ';', '"')) print_r($data); fclose($fp); Previous Comments: ------------------------------------------------------------------------ [2009-09-22 14:45:22] phofstetter at sensational dot ch I was looking into this (after having been bitten by it) and I can add another tidbit that might help tracking this down: The bug doesn't happen if the file fgetcsv() is reading is in UTF-8-format. I have created a test-file in ISO-8859-1 and then used file_put_contents(utf8encode(file_get_contents())) to create the UTF8-version of it (explaining this here because I'm not sure whether this would write a BOM or not - probably not though). That version could be read correctly. I'm now writing a stream filter that does the UTF-8 conversion on the fly to hook that in between the file and fgetcsv() - while I would lose a bit of performance, in my case, this is the cleanest workaround. ------------------------------------------------------------------------ [2009-09-21 18:11:47] dmulryan at calendarwiz dot com Note: Previous comment has error where URL is shown in array element. This is not a bug but my error in the example. Bug is in special characters. ------------------------------------------------------------------------ [2009-09-21 18:07:42] dmulryan at calendarwiz dot com Similar problem when parsing the following line: 0909211132,1,ØÊááàÑ,äÆæç,CForm,Y,1,1,1,97.95.176.240,2530 which produces empty array elements for fields with special characters: Array ( [0] => 0909211132 [1] => 1 [2] => [3] => [4] => URL [5] => Y [6] => 1 [7] => 1 [8] => 1 [9] => 97.95.176.240 [10] => 2530 ) ------------------------------------------------------------------------ [2009-06-26 19:35:22] sjoerd-php at linuxonly dot nl Could reproduce with php 5.2.10, php 5.2.11-dev (200906261830) and php 5.3rc4. Example code: <?php $fp = tmpfile(); $str = "WEIRD#\xD3TICA#BEHAVIOR"; fwrite($fp, $str); fseek($fp, 0); $arr = fgetcsv($fp, 100, '#'); var_dump($arr[1]); fclose($fp); ?> Expected: string(5) "?TICA" Actual: string(4) "TICA" ------------------------------------------------------------------------ [2009-06-13 18:10:03] krynble at yahoo dot com dot br Unfortunately I'm unable to test it because the server is running in a Datacenter. If someone can give a feedback about it, I would apreciate. Still, thanks for the help! ------------------------------------------------------------------------ The remainder of the comments for this report are too long. To view the rest of the comments, please view the bug report online at http://bugs.php.net/48507 -- Edit this bug report at http://bugs.php.net/?id=48507&edit=1