I mentioning this now because we are considering changes to the function in
the development branch, which is a fine time to resolve any deficiencies.
Okay, fine :)
The added functionality, which if I understand correctly is support for multibyte delimeters and enclosures is great. But it hardly explains a
The change was not for multibyte delimiters and enclosures. The current
implementation still allows only single-byte characters for the delimiter
and enclosure. I was able to add such a capability as well, but I didn't
because it appeared to fairly slow it down.
As several multibyte encodings like CP932, CP936, CP949, CP950
and Shift_JIS may map a value in range of 0x40 - 0xfe to the second byte,
which had been a problem. Therefore we need to check if a octet of
a certain position belongs to a multibyte character or not and this
fact motivated me to bring a scanner-like finite-state machine
implementation into fgetcsv() (and basename()).
See http://www.microsoft.com/globaldev/reference/WinCP.mspx for detail.
significant performance disparity I am seeing. I believe much of the problem
can be solved by moving from manual string iteration to one using C library
functions such as memchr(). When parsing non-multibyte text there shouldn't
be more then 10-15% performance loss.
I should mention that benchmarks were made using time utility, so advantages
offered by PHP 5's speedups were discounted. Had they been considered the
speed loss would've been 300% or more.
If we limited the support to UTF-8 or EUC encoding only, we'd be able to drastically gain much better performance. But it won't actually solve practical problems where it is in action.
Moriyoshi
-- PHP Internals - PHP Runtime Development Mailing List To unsubscribe, visit: http://www.php.net/unsub.php