ID:               48507
 Comment by:       phofstetter at sensational dot ch
 Reported By:      krynble at yahoo dot com dot br
 Status:           Verified
 Bug Type:         Filesystem function related
 Operating System: Unix
 PHP Version:      5.2.9
 New Comment:

below you'll find a small script which shows how to implement a user
filter that can be used to on-the-fly utf8-encode the data so that
fgetcsv is happy and returns correct output even if the first character
in a field has its high-bit set and is not valid utf-8:

Remember: This is a workaround and impacts performance. This is not a
valid fix for the bug.

I didn't yet have time to deeply look into the C implementation for
fgetcsv, but all these calls to php_mblen() feel suspicious to me.

I'll try and have a look into this later today, but for now, I'm just
glad I have this workaround (quickly hacked together - keep that in
mind):

<?php

class utf8encode_filter extends php_user_filter {
  function is_utf8($string){
      return preg_match('%(?:
          [\xC2-\xDF][\x80-\xBF]        # non-overlong 2-byte
          |\xE0[\xA0-\xBF][\x80-\xBF]               # excluding
overlongs
          |[\xE1-\xEC\xEE\xEF][\x80-\xBF]{2}      # straight 3-byte
          |\xED[\x80-\x9F][\x80-\xBF]               # excluding
surrogates
          |\xF0[\x90-\xBF][\x80-\xBF]{2}    # planes 1-3
          |[\xF1-\xF3][\x80-\xBF]{3}                  # planes 4-15
          |\xF4[\x80-\x8F][\x80-\xBF]{2}    # plane 16
      )+%xs', $string);
  }
      
  function filter($in, $out, &$consumed, $closing)
  {
    while ($bucket = stream_bucket_make_writeable($in)) {
      if (!$this->is_utf8($bucket->data))
          $bucket->data = utf8_encode($bucket->data);
      $consumed += $bucket->datalen;
      stream_bucket_append($out, $bucket);
    }
    return PSFS_PASS_ON;
  }
}

/* Register our filter with PHP */
stream_filter_register("utf8encode", "utf8encode_filter")
    or die("Failed to register filter");

$fp = fopen($_SERVER['argv'][1], "r");

/* Attach the registered filter to the stream just opened */
stream_filter_prepend($fp, "utf8encode");

while($data = fgetcsv($fp, 0, ';', '"'))
    print_r($data);

fclose($fp);


Previous Comments:
------------------------------------------------------------------------

[2009-09-22 14:45:22] phofstetter at sensational dot ch

I was looking into this (after having been bitten by it) and I can add
another tidbit that might help tracking this down:

The bug doesn't happen if the file fgetcsv() is reading is in
UTF-8-format.

I have created a test-file in ISO-8859-1 and then used
file_put_contents(utf8encode(file_get_contents())) to create the
UTF8-version of it (explaining this here because I'm not sure whether
this would write a BOM or not - probably not though).

That version could be read correctly.

I'm now writing a stream filter that does the UTF-8 conversion on the
fly to hook that in between the file and fgetcsv() - while I would lose
a bit of performance, in my case, this is the cleanest workaround.

------------------------------------------------------------------------

[2009-09-21 18:11:47] dmulryan at calendarwiz dot com

Note: Previous comment has error where URL is shown in array element. 
This is not a bug but my error in the example.  Bug is in special
characters.

------------------------------------------------------------------------

[2009-09-21 18:07:42] dmulryan at calendarwiz dot com

Similar problem when parsing the following line:

0909211132,1,ØÊááàÑ,äÆæç,CForm,Y,1,1,1,97.95.176.240,2530

which produces empty array elements for fields with special
characters:

Array ( [0] => 0909211132 [1] => 1 [2] => [3] => [4] => URL [5] => Y
[6] => 1 [7] => 1 [8] => 1 [9] => 97.95.176.240 [10] => 2530 )

------------------------------------------------------------------------

[2009-06-26 19:35:22] sjoerd-php at linuxonly dot nl

Could reproduce with php 5.2.10, php 5.2.11-dev (200906261830) and php
5.3rc4. Example code:

<?php
$fp = tmpfile();
$str = "WEIRD#\xD3TICA#BEHAVIOR";
fwrite($fp, $str);
fseek($fp, 0);
$arr = fgetcsv($fp, 100, '#');
var_dump($arr[1]);
fclose($fp);
?>

Expected: string(5) "?TICA"
Actual: string(4) "TICA"

------------------------------------------------------------------------

[2009-06-13 18:10:03] krynble at yahoo dot com dot br

Unfortunately I'm unable to test it because the server is running in a

Datacenter.

If someone can give a feedback about it, I would apreciate.

Still, thanks for the help!

------------------------------------------------------------------------

The remainder of the comments for this report are too long. To view
the rest of the comments, please view the bug report online at
    http://bugs.php.net/48507

-- 
Edit this bug report at http://bugs.php.net/?id=48507&edit=1

Reply via email to