On 2003/12/13, at 5:09, Ilia Alshanetsky wrote:

I mentioning this now because we are considering changes to the function in
the development branch, which is a fine time to resolve any deficiencies.

Okay, fine :)


The added functionality, which if I understand correctly is support for
multibyte delimeters and enclosures is great. But it hardly explains a

The change was not for multibyte delimiters and enclosures. The current
implementation still allows only single-byte characters for the delimiter
and enclosure. I was able to add such a capability as well, but I didn't
because it appeared to fairly slow it down.


As several multibyte encodings like CP932, CP936, CP949, CP950
and Shift_JIS may map a value in range of 0x40 - 0xfe to the second byte,
which had been a problem. Therefore we need to check if a octet of
a certain position belongs to a multibyte character or not and this
fact motivated me to bring a scanner-like finite-state machine
implementation into fgetcsv() (and basename()).


See http://www.microsoft.com/globaldev/reference/WinCP.mspx for detail.

significant performance disparity I am seeing. I believe much of the problem
can be solved by moving from manual string iteration to one using C library
functions such as memchr(). When parsing non-multibyte text there shouldn't
be more then 10-15% performance loss.
I should mention that benchmarks were made using time utility, so advantages
offered by PHP 5's speedups were discounted. Had they been considered the
speed loss would've been 300% or more.

If we limited the support to UTF-8 or EUC encoding only, we'd be able to drastically gain much better performance. But it won't actually solve practical problems where it is in action.

Moriyoshi

--
PHP Internals - PHP Runtime Development Mailing List
To unsubscribe, visit: http://www.php.net/unsub.php



Reply via email to