Edit report at http://bugs.php.net/bug.php?id=48507&edit=1

 ID:               48507
 Comment by:       pahan at hubbitus dot spb dot su
 Reported by:      krynble at yahoo dot com dot br
 Summary:          fgetcsv() ignoring special characters
 Status:           Bogus
 Type:             Bug
 Package:          Filesystem function related
 Operating System: Unix
 PHP Version:      5.*

 New Comment:

> Quote from the docs:

> Note: Locale setting is taken into account by this function. If LANG
is e.g.

> en_US.UTF-8, files in one-byte encoding are read wrong by this
function.

Ok, bug documented as "are read wrong by this function" is better then
nothing. 

But do you plan fix this wrong behaviour?


Previous Comments:
------------------------------------------------------------------------
[2010-05-18 11:03:42] m...@php.net

Thank you for taking the time to write to us, but this is not
a bug. Please double-check the documentation available at
http://www.php.net/manual/ and the instructions on how to report
a bug at http://bugs.php.net/how-to-report.php

Quote from the docs:



Note: Locale setting is taken into account by this function. If LANG is
e.g. en_US.UTF-8, files in one-byte encoding are read wrong by this
function.

------------------------------------------------------------------------
[2009-12-12 11:40:29] pahan at hubbitus dot spb dot su

Sorry for duplicate (#50456 is my), but in it, additionally to there
described problem in fgetcsv I also suggest fix fputcvs to allow [force]
enclosing single words in field.



Off course it does *not* solve this problem of incorrect fgetcsv
parsing, because RFC allow not quoted values (
http://www.faqs.org/rfcs/rfc4180.html , section 2.5 ), but, it is make
pair fputcsv/fgetcsv as minimum compatible in PHP implementation.

------------------------------------------------------------------------
[2009-12-12 01:33:51] j...@php.net

See also bug #50456

------------------------------------------------------------------------
[2009-09-22 15:09:20] phofstetter at sensational dot ch

below you'll find a small script which shows how to implement a user
filter that can be used to on-the-fly utf8-encode the data so that
fgetcsv is happy and returns correct output even if the first character
in a field has its high-bit set and is not valid utf-8:



Remember: This is a workaround and impacts performance. This is not a
valid fix for the bug.



I didn't yet have time to deeply look into the C implementation for
fgetcsv, but all these calls to php_mblen() feel suspicious to me.



I'll try and have a look into this later today, but for now, I'm just
glad I have this workaround (quickly hacked together - keep that in
mind):



<?php



class utf8encode_filter extends php_user_filter {

  function is_utf8($string){

      return preg_match('%(?:

          [\xC2-\xDF][\x80-\xBF]        # non-overlong 2-byte

          |\xE0[\xA0-\xBF][\x80-\xBF]               # excluding
overlongs

          |[\xE1-\xEC\xEE\xEF][\x80-\xBF]{2}      # straight 3-byte

          |\xED[\x80-\x9F][\x80-\xBF]               # excluding
surrogates

          |\xF0[\x90-\xBF][\x80-\xBF]{2}    # planes 1-3

          |[\xF1-\xF3][\x80-\xBF]{3}                  # planes 4-15

          |\xF4[\x80-\x8F][\x80-\xBF]{2}    # plane 16

      )+%xs', $string);

  }

      

  function filter($in, $out, &$consumed, $closing)

  {

    while ($bucket = stream_bucket_make_writeable($in)) {

      if (!$this->is_utf8($bucket->data))

          $bucket->data = utf8_encode($bucket->data);

      $consumed += $bucket->datalen;

      stream_bucket_append($out, $bucket);

    }

    return PSFS_PASS_ON;

  }

}



/* Register our filter with PHP */

stream_filter_register("utf8encode", "utf8encode_filter")

    or die("Failed to register filter");



$fp = fopen($_SERVER['argv'][1], "r");



/* Attach the registered filter to the stream just opened */

stream_filter_prepend($fp, "utf8encode");



while($data = fgetcsv($fp, 0, ';', '"'))

    print_r($data);



fclose($fp);

------------------------------------------------------------------------
[2009-09-22 14:45:22] phofstetter at sensational dot ch

I was looking into this (after having been bitten by it) and I can add
another tidbit that might help tracking this down:



The bug doesn't happen if the file fgetcsv() is reading is in
UTF-8-format.



I have created a test-file in ISO-8859-1 and then used
file_put_contents(utf8encode(file_get_contents())) to create the
UTF8-version of it (explaining this here because I'm not sure whether
this would write a BOM or not - probably not though).



That version could be read correctly.



I'm now writing a stream filter that does the UTF-8 conversion on the
fly to hook that in between the file and fgetcsv() - while I would lose
a bit of performance, in my case, this is the cleanest workaround.

------------------------------------------------------------------------


The remainder of the comments for this report are too long. To view
the rest of the comments, please view the bug report online at

    http://bugs.php.net/bug.php?id=48507


-- 
Edit this bug report at http://bugs.php.net/bug.php?id=48507&edit=1

Reply via email to