Re: [PHP-DEV] Re: Regarding the latest patch on fgetcsv() (stable branch)

2003-12-15 Thread Stig S. Bakken
On Fri, 2003-12-12 at 23:28, Ilia Alshanetsky wrote:
 On December 12, 2003 04:18 pm, Moriyoshi Koizumi wrote:
  I disagree, because of the following reasons:
 
  1) Not a few people *actually* use fgetcsv() commonly
  with multibyte characters indeed. Regarding this,
  applications made by those who don't use
  such characters don't (and won't) use multibyte specific
  functions and that's the problem. This greatly prevents
  them from being portable.
 
 People have lived without multibyte support in fgetcsv() for many years now, 
 and I did not see a single request on bugs.php.net for fgetcsv() multi-byte 
 support. So, while this is certainly useful functionality I do not believe it 
 is as widely needed as you say it is. We also have a multibyte extension that 
 already implements multi-byte safe variants of common functions, why make 
 exception for fgetcsv() and add multibyte code into core?

Just an observation: it seems that the PHP users who need multibyte
support are generally self-supplied by default.  It's often hard to
convince programmers to change their code as fundamentally as you often
need to do to support not just UTF-8 but the whole range of CJK
charsets, it adds complexity and can slow things down.  These users are
used to maintaining their own patches for all kinds of software.  The
process of merging in multibyte character features often takes several
years.

Because of this (if my observation is correct), you can't really tell
for example how many Japanese users are having issues with fgetcsv() by
counting requests on bugs.php.net.

I agree with Moriyoshi Koizumi that performance is not necessarily the
primary factor here.  IMHO performance is important, but generality and
realibility is more so.

With all due respect to everyone, I think that we should be a bit more
welcoming to people who offer help in making PHP a better language for
CJK websites.  There's still a huge amount of marketshare waiting for
PHP in Asia. :-)

 - Stig

-- 
PHP Internals - PHP Runtime Development Mailing List
To unsubscribe, visit: http://www.php.net/unsub.php



Re: [PHP-DEV] Re: Regarding the latest patch on fgetcsv() (stable branch)

2003-12-15 Thread Stig S. Bakken
On Sun, 2003-12-14 at 00:28, Ilia Alshanetsky wrote:
 On December 13, 2003 05:52 pm, Moriyoshi Koizumi wrote:
  I haven't denied it. That said, multibyte facility is not so fancy
  as XML, but quite essential so as to enable most applications to work
  well under every environment.
 
 Bullshit. Only application that need to support multibyte strings need the 
 multibyte facility.
 
  Let's stop doing such a stupid thing any more. As I pointed out already,
  having different versions for each function doesn't solve problems at
  all.
 
 It sure does, those who need to slower (multibyte) version use that and those 
 who don't use the standard version which works nice and fast for 
 non-multibyte strings.

So you think the right solution is to dismiss multibyte users and direct
them to the hacks (mbstring etc) that have been used previously instead
of thinking ahead?

If I were starting a language from scratch today, I would make character
encoding part of the string zval structure.  IMHO that's where it
belongs.  As an alternative for PHP 5[.1], there is room for a
multibyte bit in the zval that various functions can use to choose
between sizeof(byte)==sizeof(char) and sizeof(byte)  sizeof(char)
implementations.

 - Stig

-- 
PHP Internals - PHP Runtime Development Mailing List
To unsubscribe, visit: http://www.php.net/unsub.php



Re: [PHP-DEV] Re: Regarding the latest patch on fgetcsv() (stable branch)

2003-12-15 Thread Ilia Alshanetsky
On December 15, 2003 05:37 am, Stig S. Bakken wrote:
 So you think the right solution is to dismiss multibyte users and direct
 them to the hacks (mbstring etc) that have been used previously instead
 of thinking ahead?

IMHO calling multibyte a hack would be great disservice to the developers of 
that extension. We don't call ext/pgsql a hack, simply because it's not 
builtin, do we?

 If I were starting a language from scratch today, I would make character
 encoding part of the string zval structure.  IMHO that's where it
 belongs.  As an alternative for PHP 5[.1], there is room for a
 multibyte bit in the zval that various functions can use to choose
 between sizeof(byte)==sizeof(char) and sizeof(byte)  sizeof(char)
 implementations.

If you were designing a new language you wouldn't have legacy users who'd 
suffer (significantly) because of features added for other users. 

Ilia

-- 
PHP Internals - PHP Runtime Development Mailing List
To unsubscribe, visit: http://www.php.net/unsub.php



Re: [PHP-DEV] Re: Regarding the latest patch on fgetcsv() (stable branch)

2003-12-15 Thread Moriyoshi Koizumi
On 2003/12/16, at 0:32, Ilia Alshanetsky wrote:

On December 15, 2003 05:37 am, Stig S. Bakken wrote:
So you think the right solution is to dismiss multibyte users and 
direct
them to the hacks (mbstring etc) that have been used previously 
instead
of thinking ahead?
IMHO calling multibyte a hack would be great disservice to the 
developers of
that extension. We don't call ext/pgsql a hack, simply because it's not
builtin, do we?
The extension is virtually a hack. Again, the developers of mbstring
had to choose the option, adding support for multiple encodings to PHP
by separating it as an extension, instead of integrating it into the
core implementation because we always ought to manage backwards
compatibilities.
If I were starting a language from scratch today, I would make 
character
encoding part of the string zval structure.  IMHO that's where it
belongs.  As an alternative for PHP 5[.1], there is room for a
multibyte bit in the zval that various functions can use to choose
between sizeof(byte)==sizeof(char) and sizeof(byte)  sizeof(char)
implementations.
If you were designing a new language you wouldn't have legacy users 
who'd
suffer (significantly) because of features added for other users.
Well, the legacy users of PHP4 will significantly suffer for
PHP5's new features.
Moriyoshi

--
PHP Internals - PHP Runtime Development Mailing List
To unsubscribe, visit: http://www.php.net/unsub.php


Re: [PHP-DEV] Re: Regarding the latest patch on fgetcsv() (stable branch)

2003-12-15 Thread Derick Rethans
On Tue, 16 Dec 2003, Moriyoshi Koizumi wrote:

  If you were designing a new language you wouldn't have legacy users
  who'd suffer (significantly) because of features added for other
  users.

 Well, the legacy users of PHP4 will significantly suffer for
 PHP5's new features.

Uh? Where does this wisdom comes form?

Derick

-- 
PHP Internals - PHP Runtime Development Mailing List
To unsubscribe, visit: http://www.php.net/unsub.php



Re: [PHP-DEV] Re: Regarding the latest patch on fgetcsv() (stable branch)

2003-12-15 Thread Moriyoshi Koizumi
On 2003/12/16, at 0:42, Derick Rethans wrote:

On Tue, 16 Dec 2003, Moriyoshi Koizumi wrote:

If you were designing a new language you wouldn't have legacy users
who'd suffer (significantly) because of features added for other
users.
Well, the legacy users of PHP4 will significantly suffer for
PHP5's new features.
Uh? Where does this wisdom comes form?
README.PHP4-TO-PHP5-THIN-CHANGES and anything I spotted in
the current code.
Moriyoshi

--
PHP Internals - PHP Runtime Development Mailing List
To unsubscribe, visit: http://www.php.net/unsub.php


Re: [PHP-DEV] Re: Regarding the latest patch on fgetcsv() (stable branch)

2003-12-15 Thread Derick Rethans
On Tue, 16 Dec 2003, Moriyoshi Koizumi wrote:


 On 2003/12/16, at 0:42, Derick Rethans wrote:

  On Tue, 16 Dec 2003, Moriyoshi Koizumi wrote:
 
  If you were designing a new language you wouldn't have legacy users
  who'd suffer (significantly) because of features added for other
  users.
 
  Well, the legacy users of PHP4 will significantly suffer for
  PHP5's new features.
 
  Uh? Where does this wisdom comes form?

 README.PHP4-TO-PHP5-THIN-CHANGES and anything I spotted in
 the current code.

I know there are changes, I was inquiring about the significantly in
your statement.

Derick

-- 
PHP Internals - PHP Runtime Development Mailing List
To unsubscribe, visit: http://www.php.net/unsub.php



Re: [PHP-DEV] Re: Regarding the latest patch on fgetcsv() (stable branch)

2003-12-15 Thread Ilia Alshanetsky
On December 15, 2003 10:36 am, Moriyoshi Koizumi wrote:
 Well, the legacy users of PHP4 will significantly suffer for
 PHP5's new features.

How so? PHP 5 does break BC (especially for objects) but this is something 
that was talked about for years and the consensus is/was that the change is 
for the better. To my knowledge, majority of the new features in PHP5 are 
just that and have no side effects.

Another alternative to moving fgetcsv()'s current implementation would be to 
add #ifdef HAVE_MBSTRING around php_mblen, which would do 
#define php_mblen(ptr, len) 1
if mbstring is not enabled.

Personally, I'd prefer to have the function in mbstring and modified further 
to support multibyte delimiters and enclosures, which it does not do now due 
to performance considerations.

Ilia

-- 
PHP Internals - PHP Runtime Development Mailing List
To unsubscribe, visit: http://www.php.net/unsub.php



Re: [PHP-DEV] Re: Regarding the latest patch on fgetcsv() (stable branch)

2003-12-15 Thread Derek Ford
Stig S. Bakken wrote:

On Sun, 2003-12-14 at 00:28, Ilia Alshanetsky wrote:
 

On December 13, 2003 05:52 pm, Moriyoshi Koizumi wrote:
   

I haven't denied it. That said, multibyte facility is not so fancy
as XML, but quite essential so as to enable most applications to work
well under every environment.
 

Bullshit. Only application that need to support multibyte strings need the 
multibyte facility.

   

Let's stop doing such a stupid thing any more. As I pointed out already,
having different versions for each function doesn't solve problems at
all.
 

It sure does, those who need to slower (multibyte) version use that and those 
who don't use the standard version which works nice and fast for 
non-multibyte strings.
   

So you think the right solution is to dismiss multibyte users and direct
them to the hacks (mbstring etc) that have been used previously instead
of thinking ahead?
 

I see no example of him implying he wanted to dismiss multibyte users, 
he simply suggested mb_* versions of the string manipulation functions 
and pointed available facilities that people can use already. I support 
that idea, as having a mb_ version and a version without multibyte 
support gives everyone what they want. People who want multibyte strings 
have it, and people who want speed without multibyte strings still have 
that; everyone should be happy. Those who don't need multibyte strings 
(the majority, by a long shot) don't have to suffer any performance 
loss, while those in Asia can open that marketshare you speak of.

If I were starting a language from scratch today, I would make character
encoding part of the string zval structure.  IMHO that's where it
belongs.  As an alternative for PHP 5[.1], there is room for a
multibyte bit in the zval that various functions can use to choose
between sizeof(byte)==sizeof(char) and sizeof(byte)  sizeof(char)
implementations.
- Stig

 

--
PHP Internals - PHP Runtime Development Mailing List
To unsubscribe, visit: http://www.php.net/unsub.php


Re: [PHP-DEV] Re: Regarding the latest patch on fgetcsv() (stable branch)

2003-12-15 Thread Ilia Alshanetsky
On December 15, 2003 06:36 pm, Rasmus Lerdorf wrote:
 As Stig says, the correct solution would be to always store the encoding
 of the string right alongside the length of the string in the guts of PHP.
 Anything short of that is going to be a hack.  PHP6 here we come...

Then here is our first TODO for PHP6 :). But until then, please, let's try to 
avoid adding hacks that implement partial mb support in select functions. 
These hacks serve no one, many users loose scalability and other users get 
partial support that will likely prevent them from using the new 
functionality.

Ilia

-- 
PHP Internals - PHP Runtime Development Mailing List
To unsubscribe, visit: http://www.php.net/unsub.php



Re: [PHP-DEV] Re: Regarding the latest patch on fgetcsv() (stable branch)

2003-12-13 Thread Derick Rethans
On Fri, 12 Dec 2003, Rasmus Lerdorf wrote:

 We need to move towards a uniform platform that works for everyone without
 putting undue strain on either side.

Sure we do, but not at a 200-250% performance loss.

Derick

-- 
PHP Internals - PHP Runtime Development Mailing List
To unsubscribe, visit: http://www.php.net/unsub.php



Re: [PHP-DEV] Re: Regarding the latest patch on fgetcsv() (stable branch)

2003-12-13 Thread Derick Rethans
On Fri, 12 Dec 2003, Ilia Alshanetsky wrote:

 On December 12, 2003 08:54 pm, Moriyoshi Koizumi wrote:
  And overloading
  cannot be turned on in scripts, this prevents us from writing portable
  scripts.

 Not entirely true, while you cannot enable it from with a script you can
 enable it for a particular directory via .htaccess or equivalent.

Actually, this going to open a whole can of worms... I would call that a
hack if it's done (though it might be a cool one :).

Derick

-- 
PHP Internals - PHP Runtime Development Mailing List
To unsubscribe, visit: http://www.php.net/unsub.php



Re: [PHP-DEV] Re: Regarding the latest patch on fgetcsv() (stable branch)

2003-12-13 Thread Jan Schneider
Zitat von Moriyoshi Koizumi [EMAIL PROTECTED]:

  The cool thing that mbstring provides is transparent overloading of
  some
  of the common string manipulation functions.  This means that at least
  for
  a subset of applications, even though they may not have been written
  with
  multibyte support in mind, they may in fact work perfectly in a
  multibyte
  environment with mbstring enabled and overloading turned on.

 Overloading is evil, because functions like substr() are often
 used to splice a certain length of octets byte-wise while mb_substr()
 treats the sequence of octets on a character-basis. And overloading
 cannot be turned on in scripts, this prevents us from writing portable
 scripts. There're virtually no cleaner way to do the tasks elegantly.

I have to agree. While in the past it helped mb users to turn on overloading
if they wanted to use our framework, it will now break it. This is because
we now explicitely use the str*() function for byte-wise string
manipulation and their mb_*() equivalents for character-wise manipulation.
This is the only way to predict the results, the magic that is done by
overloading or transparent charset conversion is not suitable for real
production environments.

Jan.

--
http://www.horde.org - The Horde Project
http://www.ammma.de - discover your knowledge
http://www.tip4all.de - Deine private Tippgemeinschaft

-- 
PHP Internals - PHP Runtime Development Mailing List
To unsubscribe, visit: http://www.php.net/unsub.php



Re: [PHP-DEV] Re: Regarding the latest patch on fgetcsv() (stable branch)

2003-12-13 Thread Jan Schneider
Zitat von Derick Rethans [EMAIL PROTECTED]:

 On Sat, 13 Dec 2003, Moriyoshi Koizumi wrote:

  Overloading is evil, because functions like substr() are often
  used to splice a certain length of octets byte-wise while mb_substr()
  treats the sequence of octets on a character-basis. And overloading
  cannot be turned on in scripts, this prevents us from writing portable
  scripts. There're virtually no cleaner way to do the tasks elegantly.

 I also think overloading is evil, w've seen before what the problems can
 be because of difference between an overloaded and a non-overloaded
 version. It is however perfectly possible to tune this behavior in the
 php.ini file; not sure if we want that though.

I guess it depends on your pov. If you want to write portable scripts,
relying on overloading or even special php.ini tuning is a nightmare.

Jan.

--
http://www.horde.org - The Horde Project
http://www.ammma.de - discover your knowledge
http://www.tip4all.de - Deine private Tippgemeinschaft

-- 
PHP Internals - PHP Runtime Development Mailing List
To unsubscribe, visit: http://www.php.net/unsub.php



Re: [PHP-DEV] Re: Regarding the latest patch on fgetcsv() (stable branch)

2003-12-13 Thread Rasmus Lerdorf
On Sat, 13 Dec 2003, Jan Schneider wrote:
 I have to agree. While in the past it helped mb users to turn on overloading
 if they wanted to use our framework, it will now break it. This is because
 we now explicitely use the str*() function for byte-wise string
 manipulation and their mb_*() equivalents for character-wise manipulation.
 This is the only way to predict the results, the magic that is done by
 overloading or transparent charset conversion is not suitable for real
 production environments.

Using str*() functions for octet manipulation is fundamentally wrong.  
str*() functions by definition work on character boundaries.  If we need 
to operate on byte boundaries we need to introduce a set of mem*() 
functions.

-Rasmus

-- 
PHP Internals - PHP Runtime Development Mailing List
To unsubscribe, visit: http://www.php.net/unsub.php



Re: [PHP-DEV] Re: Regarding the latest patch on fgetcsv() (stable branch)

2003-12-13 Thread Jan Schneider
Zitat von Rasmus Lerdorf [EMAIL PROTECTED]:

 On Sat, 13 Dec 2003, Jan Schneider wrote:
  I have to agree. While in the past it helped mb users to turn on
 overloading
  if they wanted to use our framework, it will now break it. This is
 because
  we now explicitely use the str*() function for byte-wise string
  manipulation and their mb_*() equivalents for character-wise
 manipulation.
  This is the only way to predict the results, the magic that is done by
  overloading or transparent charset conversion is not suitable for real
  production environments.

 Using str*() functions for octet manipulation is fundamentally wrong.
 str*() functions by definition work on character boundaries.  If we need
 to operate on byte boundaries we need to introduce a set of mem*()
 functions.

Maybe. Due to PHP lacking byte stream functions, working with str* is the
only solution atm.

Jan.

--
http://www.horde.org - The Horde Project
http://www.ammma.de - discover your knowledge
http://www.tip4all.de - Deine private Tippgemeinschaft

-- 
PHP Internals - PHP Runtime Development Mailing List
To unsubscribe, visit: http://www.php.net/unsub.php



Re: [PHP-DEV] Re: Regarding the latest patch on fgetcsv() (stable branch)

2003-12-13 Thread Jan Schneider
Zitat von Rasmus Lerdorf [EMAIL PROTECTED]:

 On Sat, 13 Dec 2003, Jan Schneider wrote:
  Maybe. Due to PHP lacking byte stream functions, working with str* is
 the
  only solution atm.

 And my contention is that there is no way to do this right now.  If you
 rely on a str*() function to do this your application is broken since you
 cannot reasonably expect a character to always be 8 bits wide.

With the current implemention and assuming that mbstring overloading is
turned off, I can. This not documentated, but I'd still consider a change
of this behaviour an huge bc break.

Jan.

--
http://www.horde.org - The Horde Project
http://www.ammma.de - discover your knowledge
http://www.tip4all.de - Deine private Tippgemeinschaft

-- 
PHP Internals - PHP Runtime Development Mailing List
To unsubscribe, visit: http://www.php.net/unsub.php



Re: [PHP-DEV] Re: Regarding the latest patch on fgetcsv() (stable branch)

2003-12-13 Thread Moriyoshi Koizumi
On 2003/12/14, at 1:07, Rasmus Lerdorf wrote:

On Sat, 13 Dec 2003, Jan Schneider wrote:
I have to agree. While in the past it helped mb users to turn on 
overloading
if they wanted to use our framework, it will now break it. This is 
because
we now explicitely use the str*() function for byte-wise string
manipulation and their mb_*() equivalents for character-wise 
manipulation.
This is the only way to predict the results, the magic that is done by
overloading or transparent charset conversion is not suitable for real
production environments.
Using str*() functions for octet manipulation is fundamentally wrong.
str*() functions by definition work on character boundaries.  If we 
need
to operate on byte boundaries we need to introduce a set of mem*()
functions.
I think single-byte users are prone to have a general assumption
(indeed a superstition) that strlen() returns the number of octets
and substr() cuts a portion of a string in a specified range of
bytes, regardless of your conception of str*(). This sort of
tendency applies to other programming languages.
Moriyoshi

--
PHP Internals - PHP Runtime Development Mailing List
To unsubscribe, visit: http://www.php.net/unsub.php


Re: [PHP-DEV] Re: Regarding the latest patch on fgetcsv() (stable branch)

2003-12-13 Thread Moriyoshi Koizumi
On 2003/12/13, at 9:36, Ilia Alshanetsky wrote:

There is a good chance you are correct. However my assumption is not 
without
bases, please consider the following statistic:

Google finds 185,000 (or so) phpinfo() pages, when mbstring is added 
to the
search query only 8150 pages are found. That leads me to believe that 
1/2% of
that userbase uses mbstring. Even if we were to say that of all the 
people
who have mbstring compiled use it (which is highly unlikely) it's 
still only
1/2%.
As a sidenote, this unrealistic statistics appear to be quite unreal.

phpinfo() = 186,000 (pages) [1]
phpinfo() mbstring = 8,330
phpinfo() Server API Configure Command = 16,800
phpinfo() Server API Configure Command mbstring = 4,510
[1] includes the number of pages that are not of actual phpinfo() and
merely contain the keyword phpinfo.
Moriyoshi

--
PHP Internals - PHP Runtime Development Mailing List
To unsubscribe, visit: http://www.php.net/unsub.php


Re: [PHP-DEV] Re: Regarding the latest patch on fgetcsv() (stable branch)

2003-12-13 Thread Ilia Alshanetsky
On December 13, 2003 03:27 pm, Moriyoshi Koizumi wrote:
 As a sidenote, this unrealistic statistics appear to be quite unreal.

 phpinfo() = 186,000 (pages) [1]
 phpinfo() mbstring = 8,330
 phpinfo() Server API Configure Command = 16,800
 phpinfo() Server API Configure Command mbstring = 4,510

Even so this still represents a MINORITY.

Ilia

-- 
PHP Internals - PHP Runtime Development Mailing List
To unsubscribe, visit: http://www.php.net/unsub.php



Re: [PHP-DEV] Re: Regarding the latest patch on fgetcsv() (stable branch)

2003-12-13 Thread Moriyoshi Koizumi
On 2003/12/14, at 5:55, Ilia Alshanetsky wrote:

On December 13, 2003 03:27 pm, Moriyoshi Koizumi wrote:
As a sidenote, this unrealistic statistics appear to be quite unreal.

phpinfo() = 186,000 (pages) [1]
phpinfo() mbstring = 8,330
phpinfo() Server API Configure Command = 16,800
phpinfo() Server API Configure Command mbstring = 4,510
Even so this still represents a MINORITY.
Could a quarter be a minority? And the fact is there are lots of
people who don't know what is the problem when they've got to
use multibyte strings. They don't realise why they should not
use UTF-8 with their mysql server set to handle ISO-8859-1
for instance.
Moriyoshi

--
PHP Internals - PHP Runtime Development Mailing List
To unsubscribe, visit: http://www.php.net/unsub.php


Re: [PHP-DEV] Re: Regarding the latest patch on fgetcsv() (stable branch)

2003-12-13 Thread Ilia Alshanetsky
On December 13, 2003 03:53 pm, Moriyoshi Koizumi wrote:
 Could a quarter be a minority?

Unless the rules of mathematics had changed 25% is still a minority. You also 
forget that there are plenty of people who compile extensions and never end 
up using them.

The critical point of this entire discussion is about NOT forcing choices on 
people who do not want/need them. There is no good reason to force multibyte 
version of fgetcsv() on every single user, when there are not one but two PHP 
extensions designed explicitly for multibyte support. If fgetcsv() in PHP 5 
cannot be designed in such a way as to have no significant performance 
penalties for non-multibyte strings the function should be introduced as 
mb_fgetcsv() or iconv_fgetcsv().

Ilia

-- 
PHP Internals - PHP Runtime Development Mailing List
To unsubscribe, visit: http://www.php.net/unsub.php



Re: [PHP-DEV] Re: Regarding the latest patch on fgetcsv() (stable branch)

2003-12-13 Thread Moriyoshi Koizumi
On 2003/12/14, at 6:19, Ilia Alshanetsky wrote:

On December 13, 2003 03:53 pm, Moriyoshi Koizumi wrote:
Could a quarter be a minority?
Unless the rules of mathematics had changed 25% is still a minority. 
You also
forget that there are plenty of people who compile extensions and 
never end
up using them.
A similar fact applies to your assumption. It's also true that PHP can
handle multibyte strings without mbstring or iconv in some cases where
the users are just fortunate enough to not get in trouble, most likely
because they just don't use such multibyte characters that are known
to cause problems due to its structure. Those user question [1][2] 
exactly
describes when it goes wrong.

[1] http://marc.theaimsgroup.com/?l=php-devm=103828989330521w=2
[2] http://news.php.net/article.php?group=php.i18narticle=633
The critical point of this entire discussion is about NOT forcing 
choices on
people who do not want/need them. There is no good reason to force 
multibyte
version of fgetcsv() on every single user, when there are not one but 
two PHP
extensions designed explicitly for multibyte support.
On the other hand, the chances are very limited to users familiar
to multibyte. First of all, flexibility on the configuration has been
causing lots of confusion. I'd be happy if every existing application
used mb_*() instead of their counterpart at approproate places, but
it's unlikely. This is because we have two versions of string 
manipulation
functions. And again, it's prevented users to write multibyte safe
applications because multibyte-flavor extensions are currently not 
enabled
by default though this fact is not my point.

If fgetcsv() in PHP 5 cannot be designed in such a way as to have no
significant performance penalties for non-multibyte strings the 
function
should be introduced as mb_fgetcsv() or iconv_fgetcsv().
So, why not begin thinking of how it could be bearably fast
even with multibyte support enabled? While I think the current stuff
I made is the best portable and the fastest code, it's probable
that there are a far better code.
Moriyoshi





--
PHP Internals - PHP Runtime Development Mailing List
To unsubscribe, visit: http://www.php.net/unsub.php


Re: [PHP-DEV] Re: Regarding the latest patch on fgetcsv() (stable branch)

2003-12-13 Thread Moriyoshi Koizumi
On 2003/12/14, at 6:46, Moriyoshi Koizumi wrote:

I made is the best portable and the fastest code, it's probable
that there are a far better code.
s/there are/there'd be/

--
PHP Internals - PHP Runtime Development Mailing List
To unsubscribe, visit: http://www.php.net/unsub.php


Re: [PHP-DEV] Re: Regarding the latest patch on fgetcsv() (stable branch)

2003-12-13 Thread Moriyoshi Koizumi
On 2003/12/14, at 7:33, Ilia Alshanetsky wrote:

Percentages aside you cannot deny the fact that not every application 
needs
multibyte support (whether this is a majority or 50/50 does not 
matter).
If a user needs to use multibyte they may need to do a little 
searching to
find a provider that supports it, fortunately for them PHP is a very 
common
scripting language with many hosting providers.
I haven't denied it. That said, multibyte facility is not so fancy
as XML, but quite essential so as to enable most applications to work
well under every environment.
So, why not begin thinking of how it could be bearably fast
even with multibyte support enabled? While I think the current stuff
I made is the best portable and the fastest code, it's probable
that there are a far better code.
If your code as indeed as fast as it can be then the only alternative 
it would
seem is to seperate the function into 'normal' and 'multibyte' variants
allowing the user and not the developer to choose the one most suited 
to
their needs.
Let's stop doing such a stupid thing any more. As I pointed out already,
having different versions for each function doesn't solve problems at 
all.

Moriyoshi

--
PHP Internals - PHP Runtime Development Mailing List
To unsubscribe, visit: http://www.php.net/unsub.php


Re: [PHP-DEV] Re: Regarding the latest patch on fgetcsv() (stable branch)

2003-12-13 Thread Ilia Alshanetsky
On December 13, 2003 05:52 pm, Moriyoshi Koizumi wrote:
 I haven't denied it. That said, multibyte facility is not so fancy
 as XML, but quite essential so as to enable most applications to work
 well under every environment.

Bullshit. Only application that need to support multibyte strings need the 
multibyte facility.

 Let's stop doing such a stupid thing any more. As I pointed out already,
 having different versions for each function doesn't solve problems at
 all.

It sure does, those who need to slower (multibyte) version use that and those 
who don't use the standard version which works nice and fast for 
non-multibyte strings.

Ilia

-- 
PHP Internals - PHP Runtime Development Mailing List
To unsubscribe, visit: http://www.php.net/unsub.php



Re: [PHP-DEV] Re: Regarding the latest patch on fgetcsv() (stable branch)

2003-12-12 Thread Rasmus Lerdorf
On Fri, 12 Dec 2003, Ilia Alshanetsky wrote:
 That said, the whole space trimming behavior seems a little unusual since it 
 will corrupt content especially if said content contains binary data. IMHO 
 the data read by fgetcsv() should be fetched in such a manner so that the 
 original string can be recreated.

I agree that it would be a good idea to provide a mechanism to do that, 
but at this point I don't think we should be changing the behaviour of 
fgetcsv() in neither the stable branch nor the HEAD branch.  I'd add a new 
binary-safe version of the function instead for this.  Or an optional arg, 
but fgetcsv() already has 2 optional args.

-Rasmus

-- 
PHP Internals - PHP Runtime Development Mailing List
To unsubscribe, visit: http://www.php.net/unsub.php



Re: [PHP-DEV] Re: Regarding the latest patch on fgetcsv() (stable branch)

2003-12-12 Thread Moriyoshi Koizumi
On 2003/12/13, at 4:02, Rasmus Lerdorf wrote:

On Fri, 12 Dec 2003, Ilia Alshanetsky wrote:
That said, the whole space trimming behavior seems a little unusual 
since it
will corrupt content especially if said content contains binary data. 
IMHO
the data read by fgetcsv() should be fetched in such a manner so that 
the
original string can be recreated.
I agree that it would be a good idea to provide a mechanism to do that,
but at this point I don't think we should be changing the behaviour of
fgetcsv() in neither the stable branch nor the HEAD branch.  I'd add a 
new
binary-safe version of the function instead for this.  Or an optional 
arg,
but fgetcsv() already has 2 optional args.
My opinion is basically the same as Ilia's. And I think it'd also be
a good idea to introduce a few more option to modify the escaping
behaviour. Escape characters like \ are treated specially
at the moment, while the de facto specification, of Microsoft,
adopts dubbed-quotes style instead of it. IMO we should be
able to choose the behaviour.
Then, where do we go from here?  Eventually we'll need to change the
spec of its arguments, or add a new function as Rasmus said...
Moriyoshi

--
PHP Internals - PHP Runtime Development Mailing List
To unsubscribe, visit: http://www.php.net/unsub.php


Re: [PHP-DEV] Re: Regarding the latest patch on fgetcsv() (stable branch)

2003-12-12 Thread Ilia Alshanetsky
On December 12, 2003 02:02 pm, Rasmus Lerdorf wrote:
 I agree that it would be a good idea to provide a mechanism to do that,
 but at this point I don't think we should be changing the behaviour of
 fgetcsv() in neither the stable branch nor the HEAD branch.  I'd add a new
 binary-safe version of the function instead for this.  Or an optional arg,
 but fgetcsv() already has 2 optional args.

I think we could add another optional argument (bitmask) that could be used to 
control various capabilities of fgetcsv(). So, if another tuneable behavior 
is necessary it could be easily added without breaking BC.

On a related note I should mention that fgetcsv() in 4.3.X is currently 2.5 
times faster then it's equivalent in 5.X.

Ilia

-- 
PHP Internals - PHP Runtime Development Mailing List
To unsubscribe, visit: http://www.php.net/unsub.php



Re: [PHP-DEV] Re: Regarding the latest patch on fgetcsv() (stable branch)

2003-12-12 Thread Ilia Alshanetsky
On December 12, 2003 02:40 pm, Moriyoshi Koizumi wrote:
 I don't know why you're mentioning this at this time,
 but I can say it is a sort of necessary evil :) Because the HEAD
 version is capable of handling various encodings, and
 less intricate IMO. Rather, I was surprised about that result,
 it's only 2.5 times slower :)

I mentioning this now because we are considering changes to the function in 
the development branch, which is a fine time to resolve any deficiencies.

The added functionality, which if I understand correctly is support for 
multibyte delimeters and enclosures is great. But it hardly explains a 
significant performance disparity I am seeing. I believe much of the problem 
can be solved by moving from manual string iteration to one using C library 
functions such as memchr(). When parsing non-multibyte text there shouldn't 
be more then 10-15% performance loss. 
I should mention that benchmarks were made using time utility, so advantages 
offered by PHP 5's speedups were discounted. Had they been considered the 
speed loss would've been 300% or more.

Ilia

-- 
PHP Internals - PHP Runtime Development Mailing List
To unsubscribe, visit: http://www.php.net/unsub.php



Re: [PHP-DEV] Re: Regarding the latest patch on fgetcsv() (stable branch)

2003-12-12 Thread Moriyoshi Koizumi
On 2003/12/13, at 5:51, Ilia Alshanetsky wrote:

How about we add mb_fgetcsv(), which would have full multi-byte support
(including delimeters). I'd imagine for people who need to parse 
multi-byte
csv files, full functionality is more important then speed. As for the
fgetcsv() in ext/standard/, we can port the 4.3.X code (copy  paste 
really)
and let PHP 5 users benefit from a faster fgetcsv() for common 
applications.
What do you think?
I disagree, because of the following reasons:

1) Not a few people *actually* use fgetcsv() commonly
   with multibyte characters indeed. Regarding this,
   applications made by those who don't use
   such characters don't (and won't) use multibyte specific
   functions and that's the problem. This greatly prevents
   them from being portable.
2) IMO speed is not a key factor here. People rather wants
   trust-worthy behaviour.
3) fgetcsv() implementation in the stable branch is
   now too complicated to add a new feature to
   and also hard to maintain. We should be able to
   eliminate the mblen() calls for acceptable performance.
   See the attached result.
Moriyoshi

p.s. fgetcsv() in the stable branch still seems to segfault with
 the attached test case (segfault.php.txt).
[The benchmark result]
My code with mblen() (on php5-csv):
real0m1.389s
user0m1.330s
sys 0m0.060s
Ditto without mblen():

real0m0.396s
user0m0.350s
sys 0m0.040s
Your code (on php4-csv):

real0m0.332s
user0m0.270s
sys 0m0.060s
?php
$file = '/tmp/test.csv';
$fp = fopen($file, 'w');
fwrite($fp, , 
bb, 
,d\n);
fclose($fp);
for ($i = 0; $i  4000; $i++) {
$fp = fopen($file, 'r');
fseek($fp, SEEK_SET, 0);
fgetcsv($fp, filesize($file));
fclose($fp);
}
?
Index: ext/standard/php_string.h
===
RCS file: /repository/php-src/ext/standard/php_string.h,v
retrieving revision 1.83
diff -u -r1.83 php_string.h
--- ext/standard/php_string.h   10 Dec 2003 21:23:35 -  1.83
+++ ext/standard/php_string.h   12 Dec 2003 21:16:09 -
@@ -144,15 +144,7 @@
 #define strerror php_strerror
 #endif
 
-#ifndef HAVE_MBLEN
-# define php_mblen(ptr, len) 1
-#else
-# if defined(_REENTRANT)  defined(HAVE_MBRLEN)  defined(HAVE_MBSTATE_T)
-#  define php_mblen(ptr, len) ((ptr) == NULL ? mbsinit(BG(mblen_state)): 
(int)mbrlen(ptr, len, BG(mblen_state)))
-# else
-#  define php_mblen(ptr, len) mblen(ptr, len)
-# endif
-#endif
+#define php_mblen(ptr, len) 1
 
 void register_string_constants(INIT_FUNC_ARGS);
 
?php
$file = '/tmp/test.csv';
$fp = fopen($file, 'w');
fwrite($fp, , 
bb, 
,
 
\d\\\\\n);
fclose($fp);
$fp = fopen($file, 'r');
fseek($fp, SEEK_SET, 0);
fgetcsv($fp, filesize($file));
fclose($fp);
?


-- 
PHP Internals - PHP Runtime Development Mailing List
To unsubscribe, visit: http://www.php.net/unsub.php

Re: [PHP-DEV] Re: Regarding the latest patch on fgetcsv() (stable branch)

2003-12-12 Thread Moriyoshi Koizumi
On 2003/12/13, at 5:45, Derick Rethans wrote:

I would call that rather unacceptable actually. Isn't it possible 
create
a new function for this which handles this MB 'crap' (and the same for
basename) so that we don't have to lose performance because of those
issues?
Don't you think crap sounds too disgusting and inappropriate?
Stop such wording here.
Moriyoshi

--
PHP Internals - PHP Runtime Development Mailing List
To unsubscribe, visit: http://www.php.net/unsub.php


Re: [PHP-DEV] Re: Regarding the latest patch on fgetcsv() (stable branch)

2003-12-12 Thread Ilia Alshanetsky
On December 12, 2003 04:18 pm, Moriyoshi Koizumi wrote:
 I disagree, because of the following reasons:

 1) Not a few people *actually* use fgetcsv() commonly
 with multibyte characters indeed. Regarding this,
 applications made by those who don't use
 such characters don't (and won't) use multibyte specific
 functions and that's the problem. This greatly prevents
 them from being portable.

People have lived without multibyte support in fgetcsv() for many years now, 
and I did not see a single request on bugs.php.net for fgetcsv() multi-byte 
support. So, while this is certainly useful functionality I do not believe it 
is as widely needed as you say it is. We also have a multibyte extension that 
already implements multi-byte safe variants of common functions, why make 
exception for fgetcsv() and add multibyte code into core?

 2) IMO speed is not a key factor here. People rather wants
 trust-worthy behaviour.

When it's a few percent and the changes offer significant improvements yes, 
but when were are talking about a performance loss of 250-300% or more then 
performance must become a consideration as well.

 3) fgetcsv() implementation in the stable branch is
 now too complicated to add a new feature to
 and also hard to maintain. We should be able to
 eliminate the mblen() calls for acceptable performance.
 See the attached result.

What features are we talking about here? The only 2 features I can see we may 
wish to add are 1 char long enclosures and separators and the binary thing. 
Both of these features would be fairly trivial to add. 

 p.s. fgetcsv() in the stable branch still seems to segfault with
   the attached test case (segfault.php.txt).

Writing a fix now, thanks for the heads-up. If you have any more please let me 
know.

-- 
PHP Internals - PHP Runtime Development Mailing List
To unsubscribe, visit: http://www.php.net/unsub.php



Re: [PHP-DEV] Re: Regarding the latest patch on fgetcsv() (stable branch)

2003-12-12 Thread Moriyoshi Koizumi
On 2003/12/13, at 7:28, Ilia Alshanetsky wrote:

On December 12, 2003 04:18 pm, Moriyoshi Koizumi wrote:
I disagree, because of the following reasons:

1) Not a few people *actually* use fgetcsv() commonly
with multibyte characters indeed. Regarding this,
applications made by those who don't use
such characters don't (and won't) use multibyte specific
functions and that's the problem. This greatly prevents
them from being portable.
People have lived without multibyte support in fgetcsv() for many 
years now,
and I did not see a single request on bugs.php.net for fgetcsv() 
multi-byte
support. So, while this is certainly useful functionality I do not 
believe it
is as widely needed as you say it is. We also have a multibyte 
extension that
already implements multi-byte safe variants of common functions, why 
make
exception for fgetcsv() and add multibyte code into core?
I admit that few requests are in sight, but it's the case that
those who really want that support don't advocate it in English.
And I don't think fgetcsv() is an exception, since htmlentities() can
be referred to as an example that is placed in core and
supports multibyte strings. As I mentioned, purging that kind of
functionality into the mbstring extension doesn't solve the problem
in practice by any means.
2) IMO speed is not a key factor here. People rather wants
trust-worthy behaviour.
When it's a few percent and the changes offer significant improvements 
yes,
but when were are talking about a performance loss of 250-300% or more 
then
performance must become a consideration as well.
If there are virtually no ways to improve it, it'd be natural to me
we dismiss the issue.
3) fgetcsv() implementation in the stable branch is
now too complicated to add a new feature to
and also hard to maintain. We should be able to
eliminate the mblen() calls for acceptable performance.
See the attached result.
What features are we talking about here? The only 2 features I can see 
we may
wish to add are 1 char long enclosures and separators and the binary 
thing.
Both of these features would be fairly trivial to add.
One thing I'm talking about here is escaping behaviour, which I 
mentioned
in the previous mail.

Moriyoshi

--
PHP Internals - PHP Runtime Development Mailing List
To unsubscribe, visit: http://www.php.net/unsub.php


Re: [PHP-DEV] Re: Regarding the latest patch on fgetcsv() (stable branch)

2003-12-12 Thread Ilia Alshanetsky
On December 12, 2003 05:38 pm, Moriyoshi Koizumi wrote:
 And I don't think fgetcsv() is an exception, since htmlentities() can
 be referred to as an example that is placed in core and
 supports multibyte strings. As I mentioned, purging that kind of
 functionality into the mbstring extension doesn't solve the problem
 in practice by any means.

htmlentities() is a rather special function it handles not only multibyte but 
a whole lot of diffrent  unusual things. I do not think you can fairly 
compare it to fgetcsv(). We have a multibyte extension for people who need 
that functionality, why force it on everyone else?

  2) IMO speed is not a key factor here. People rather wants
  trust-worthy behaviour.
 
  When it's a few percent and the changes offer significant improvements
  yes,
  but when were are talking about a performance loss of 250-300% or more
  then
  performance must become a consideration as well.

 If there are virtually no ways to improve it, it'd be natural to me
 we dismiss the issue.

Why does a vast majority of users have to endure degredation in performance 
for functionality that are needed by a few? It's as simple as that. Same 
argument applies to basename().


 One thing I'm talking about here is escaping behaviour, which I
 mentioned in the previous mail.

I believe it would be possible to implement in the 4.3.X code, however it 
sounds specific to multibyte implementation.

Ilia

-- 
PHP Internals - PHP Runtime Development Mailing List
To unsubscribe, visit: http://www.php.net/unsub.php



Re: [PHP-DEV] Re: Regarding the latest patch on fgetcsv() (stable branch)

2003-12-12 Thread Moriyoshi Koizumi
On 2003/12/13, at 8:23, Ilia Alshanetsky wrote:

On December 12, 2003 05:38 pm, Moriyoshi Koizumi wrote:
And I don't think fgetcsv() is an exception, since htmlentities() can
be referred to as an example that is placed in core and
supports multibyte strings. As I mentioned, purging that kind of
functionality into the mbstring extension doesn't solve the problem
in practice by any means.
htmlentities() is a rather special function it handles not only 
multibyte but
a whole lot of diffrent  unusual things. I do not think you can fairly
compare it to fgetcsv().
What are you referring to as a whole lot of different  unusual thing?

We have a multibyte extension for people who need
that functionality, why force it on everyone else?
Because it's a bug. The multibyte extension we have is not provided
to make easier the lives of those who don't use multibyte encodings.
It exists as an extension since we had to do so in the past.

2) IMO speed is not a key factor here. People rather wants
trust-worthy behaviour.
When it's a few percent and the changes offer significant 
improvements
yes,
but when were are talking about a performance loss of 250-300% or 
more
then
performance must become a consideration as well.
If there are virtually no ways to improve it, it'd be natural to me
we dismiss the issue.
Why does a vast majority of users have to endure degredation in 
performance
for functionality that are needed by a few? It's as simple as that. 
Same
argument applies to basename().
You should be underestimating the number of the people who *actually*
need it.

One thing I'm talking about here is escaping behaviour, which I
mentioned in the previous mail.
I believe it would be possible to implement in the 4.3.X code, however 
it
sounds specific to multibyte implementation.
Escaping behaviour is totally irrelevant to the multibyte issue.
I think users should be able to choose by an optional argument
whether \ has to be treated as a escaped quote or a simple
sequence of a backslash and a quote.
Moriyoshi

--
PHP Internals - PHP Runtime Development Mailing List
To unsubscribe, visit: http://www.php.net/unsub.php


Re: [PHP-DEV] Re: Regarding the latest patch on fgetcsv() (stable branch)

2003-12-12 Thread Rasmus Lerdorf
 Why does a vast majority of users have to endure degredation in performance 
 for functionality that are needed by a few? It's as simple as that. Same 
 argument applies to basename().

Ilia, we need to try to avoid this sort of thinking.  This vast majority 
is most likely only a vocal majority these days.  It is very likely that 
the non-mb users are actually the few and if we continue along your way 
of thinking then we need to have an ext/singlebyte that implements all 
these weird singlebyte string manipulation functions.

We need to move towards a uniform platform that works for everyone without 
putting undue strain on either side.

-Rasmus

-- 
PHP Internals - PHP Runtime Development Mailing List
To unsubscribe, visit: http://www.php.net/unsub.php



Re: [PHP-DEV] Re: Regarding the latest patch on fgetcsv() (stable branch)

2003-12-12 Thread Jan Schneider
Zitat von Ilia Alshanetsky [EMAIL PROTECTED]:

 On December 12, 2003 05:38 pm, Moriyoshi Koizumi wrote:
  And I don't think fgetcsv() is an exception, since htmlentities() can
  be referred to as an example that is placed in core and
  supports multibyte strings. As I mentioned, purging that kind of
  functionality into the mbstring extension doesn't solve the problem
  in practice by any means.

 htmlentities() is a rather special function it handles not only multibyte
 but
 a whole lot of diffrent  unusual things. I do not think you can fairly
 compare it to fgetcsv(). We have a multibyte extension for people who
 need
 that functionality, why force it on everyone else?

   2) IMO speed is not a key factor here. People rather wants
   trust-worthy behaviour.
  
   When it's a few percent and the changes offer significant
 improvements
   yes,
   but when were are talking about a performance loss of 250-300% or
 more
   then
   performance must become a consideration as well.
 
  If there are virtually no ways to improve it, it'd be natural to me
  we dismiss the issue.

 Why does a vast majority of users have to endure degredation in
 performance
 for functionality that are needed by a few? It's as simple as that. Same
 argument applies to basename().

Just a general note on this discussion becoming sort of a meta-topic:
From a PHP developers POV, complete charset support should be a key
technology for ZE and the standard extensions, as is now XML for PHP5 as a
whole. While the comparison might be a bit strange, it even reminds on the
relation of these two: The standard encoding for XML data is a multibyte
charset.

But the real problem is, that it's *really* hard for developers outside of
the multibyte world to understand the ins and outs of these charsets and
how to handle them correctly. It was a PITA to make the whole Horde
framework charset independent without knowing anything on mb charsets and
their support in php.
I did this due to popular demand, because there are a *lot* of people
using/needing mb charsets. It would be great if others developers wouldn't
have to take this steep road, because php would support these out of the
box.

While writing this message, Rasmus got my point in fewer words. ;-)

Jan.

--
http://www.horde.org - The Horde Project
http://www.ammma.de - discover your knowledge
http://www.tip4all.de - Deine private Tippgemeinschaft

-- 
PHP Internals - PHP Runtime Development Mailing List
To unsubscribe, visit: http://www.php.net/unsub.php



RE: [PHP-DEV] Re: Regarding the latest patch on fgetcsv() (stable branch)

2003-12-12 Thread Steph
blimey, I just agreed with Rasmus..

 -Original Message-
 From: Rasmus Lerdorf [mailto:[EMAIL PROTECTED]
 Sent: 13 December 2003 00:03
 To: Ilia Alshanetsky
 Cc: Moriyoshi Koizumi; PHP Internals
 Subject: Re: [PHP-DEV] Re: Regarding the latest patch on fgetcsv()
 (stable branch)


  Why does a vast majority of users have to endure degredation in
 performance
  for functionality that are needed by a few? It's as simple as
 that. Same
  argument applies to basename().

 Ilia, we need to try to avoid this sort of thinking.  This vast
 majority
 is most likely only a vocal majority these days.  It is very
 likely that
 the non-mb users are actually the few and if we continue along your way
 of thinking then we need to have an ext/singlebyte that implements all
 these weird singlebyte string manipulation functions.

 We need to move towards a uniform platform that works for
 everyone without
 putting undue strain on either side.

 -Rasmus

 --
 PHP Internals - PHP Runtime Development Mailing List
 To unsubscribe, visit: http://www.php.net/unsub.php

-- 
PHP Internals - PHP Runtime Development Mailing List
To unsubscribe, visit: http://www.php.net/unsub.php



Re: [PHP-DEV] Re: Regarding the latest patch on fgetcsv() (stable branch)

2003-12-12 Thread Ilia Alshanetsky
On December 12, 2003 07:02 pm, Rasmus Lerdorf wrote:
 Ilia, we need to try to avoid this sort of thinking.  This vast majority
 is most likely only a vocal majority these days.  It is very likely that
 the non-mb users are actually the few and if we continue along your way
 of thinking then we need to have an ext/singlebyte that implements all
 these weird singlebyte string manipulation functions.

There is a good chance you are correct. However my assumption is not without 
bases, please consider the following statistic:

Google finds 185,000 (or so) phpinfo() pages, when mbstring is added to the 
search query only 8150 pages are found. That leads me to believe that 1/2% of 
that userbase uses mbstring. Even if we were to say that of all the people 
who have mbstring compiled use it (which is highly unlikely) it's still only 
1/2%.

Ilia

-- 
PHP Internals - PHP Runtime Development Mailing List
To unsubscribe, visit: http://www.php.net/unsub.php



Re: [PHP-DEV] Re: Regarding the latest patch on fgetcsv() (stable branch)

2003-12-12 Thread Ilia Alshanetsky
Before we get too far offtopic ( fgetcsv() ) let me just quickly summarize my 
position. I think it's great that multi-byte support exists in PHP, we have 
mbstring, iconv and recode extensions that all help make PHP work with 
multibyte strings.
As with most things with PHP the user has a choice of whether or not to enable 
this functionality depending on their present and future needs. By putting 
this functionality into core/standard this choice is taken away more over it 
penalizes (rather significantly) every single user of affected functionality. 
That is what I'd like to avoid.

Without mbstring enabled, you would not be able to effectively work with 
multibyte characters. Therefor even if fgetcsv() would work as you may expect 
with multibyte strings, that data would not be manageable in most cases. 
Which is why I propose that multibyte version of fgetcsv() be placed inside 
ext/mbstring as mb_fgetcsv().

Ilia

-- 
PHP Internals - PHP Runtime Development Mailing List
To unsubscribe, visit: http://www.php.net/unsub.php



Re: [PHP-DEV] Re: Regarding the latest patch on fgetcsv() (stable branch)

2003-12-12 Thread Ilia Alshanetsky
On December 12, 2003 03:15 pm, Moriyoshi Koizumi wrote:
 If we limited the support to UTF-8 or EUC encoding only, we'd be
 able to drastically gain much better performance. But it won't
 actually solve practical problems where it is in action.

Could iconv stream filters be used to convert various encoding (if needed) to 
UTF-8 thus addressing the problem?

Ilia

-- 
PHP Internals - PHP Runtime Development Mailing List
To unsubscribe, visit: http://www.php.net/unsub.php



Re: [PHP-DEV] Re: Regarding the latest patch on fgetcsv() (stable branch)

2003-12-12 Thread Moriyoshi Koizumi
On 2003/12/13, at 9:47, Ilia Alshanetsky wrote:

Without mbstring enabled, you would not be able to effectively work 
with
multibyte characters. Therefor even if fgetcsv() would work as you may 
expect
with multibyte strings, that data would not be manageable in most 
cases.
That's a bogus argument. In what case do you think they should be
effectively managed with mbstring functions..? As PHP's functions
are mostly highly-completed on its role and they don't need any
additional operations in most cases, I don't have to use mbstring
functions there.
String manipulation functions provided by mbstring are atomic
and so they are used just in the same way as
their standard counterparts like substr() and strpos().
I see a few cases that the strings fetched by fgetcsv()
have to be modified by substr() or whatever..
Moriyoshi

--
PHP Internals - PHP Runtime Development Mailing List
To unsubscribe, visit: http://www.php.net/unsub.php


Re: [PHP-DEV] Re: Regarding the latest patch on fgetcsv() (stable branch)

2003-12-12 Thread Moriyoshi Koizumi
On 2003/12/13, at 9:56, Ilia Alshanetsky wrote:

On December 12, 2003 03:15 pm, Moriyoshi Koizumi wrote:
If we limited the support to UTF-8 or EUC encoding only, we'd be
able to drastically gain much better performance. But it won't
actually solve practical problems where it is in action.
Could iconv stream filters be used to convert various encoding (if 
needed) to
UTF-8 thus addressing the problem?
Actually it might do, but doing so leads to great overheads, because
you have to reconvert the strings to restore the initial form.
Besides the conversion is sometimes irreversible. And even it wouldn't
make sense unless iconv extension becomes built-in.
Moriyoshi

--
PHP Internals - PHP Runtime Development Mailing List
To unsubscribe, visit: http://www.php.net/unsub.php


Re: [PHP-DEV] Re: Regarding the latest patch on fgetcsv() (stable branch)

2003-12-12 Thread Ilia Alshanetsky
On December 12, 2003 07:54 pm, Moriyoshi Koizumi wrote:
 On 2003/12/13, at 9:47, Ilia Alshanetsky wrote:
  Without mbstring enabled, you would not be able to effectively work
  with
  multibyte characters. Therefor even if fgetcsv() would work as you may
  expect
  with multibyte strings, that data would not be manageable in most
  cases.

What if you want to alter the case of the data or use it a basis for sending 
e-mail or check it against another string via regular expression? Sure, if 
you just want to read them and echo them back in the form of a table you 
don't need mbstring. 
It also seems to me that if you are going to be multibyte inputs you'd have 
either iconv or mbstring extension enabled.

Ilia

-- 
PHP Internals - PHP Runtime Development Mailing List
To unsubscribe, visit: http://www.php.net/unsub.php



Re: [PHP-DEV] Re: Regarding the latest patch on fgetcsv() (stable branch)

2003-12-12 Thread Moriyoshi Koizumi
On 2003/12/13, at 10:13, Ilia Alshanetsky wrote:

It also seems to me that if you are going to be multibyte inputs you'd 
have
either iconv or mbstring extension enabled.
Which input? If you are talking about form inputs, we rarely need the
functionality (mbstring.encoding_conversion).
That is needed when the encoding in which a script is written and
the one the form uses to submit to the script. Such situation might look
weird to you, but it happens in case the output of the script
is automagically translated by mb_output_handler().
Please take a look at the relevant documentations.

Moriyoshi

--
PHP Internals - PHP Runtime Development Mailing List
To unsubscribe, visit: http://www.php.net/unsub.php


Re: [PHP-DEV] Re: Regarding the latest patch on fgetcsv() (stable branch)

2003-12-12 Thread Moriyoshi Koizumi
On 2003/12/13, at 10:11, Moriyoshi Koizumi wrote:

That is needed when the encoding in which a script is written and
the one the form uses to submit to the script.
A few more words were missing:

That is needed when the encoding in which a script is written and
the one the form uses to submit to the script are different.
Moriyoshi

--
PHP Internals - PHP Runtime Development Mailing List
To unsubscribe, visit: http://www.php.net/unsub.php


Re: [PHP-DEV] Re: Regarding the latest patch on fgetcsv() (stable branch)

2003-12-12 Thread Ilia Alshanetsky
On December 12, 2003 08:11 pm, Moriyoshi Koizumi wrote:
 Which input?

If you get multibyte data from a form and want to perform string operations on 
it such as strlen(), ereg(), etc... would you not need mb_* or iconv_* 
functions?

Ilia

-- 
PHP Internals - PHP Runtime Development Mailing List
To unsubscribe, visit: http://www.php.net/unsub.php



Re: [PHP-DEV] Re: Regarding the latest patch on fgetcsv() (stable branch)

2003-12-12 Thread Moriyoshi Koizumi
On 2003/12/13, at 10:35, Ilia Alshanetsky wrote:

On December 12, 2003 08:11 pm, Moriyoshi Koizumi wrote:
Which input?
If you get multibyte data from a form and want to perform string 
operations on
it such as strlen(), ereg(), etc... would you not need mb_* or iconv_*
functions?
In such a case, yes, I do. But I don't think that's directly related
to the issue..?
Moriyoshi

--
PHP Internals - PHP Runtime Development Mailing List
To unsubscribe, visit: http://www.php.net/unsub.php


RE: [PHP-DEV] Re: Regarding the latest patch on fgetcsv() (stable branch)

2003-12-12 Thread Steph

  If you get multibyte data from a form and want to perform string
  operations on
  it such as strlen(), ereg(), etc... would you not need mb_* or iconv_*
  functions?

 In such a case, yes, I do. But I don't think that's directly related
 to the issue..?

The real question is, what does mbstring do that iconv fails to do?  And why
do we need to restore the initial form in any sense?  And if iconv were
built-in, would mbstring still be needed, and if so, why and where?

/Steph goes away to read the relevant mbstring documentation

-- 
PHP Internals - PHP Runtime Development Mailing List
To unsubscribe, visit: http://www.php.net/unsub.php



RE: [PHP-DEV] Re: Regarding the latest patch on fgetcsv() (stable branch)

2003-12-12 Thread Rasmus Lerdorf
On Sat, 13 Dec 2003, Steph wrote:
   If you get multibyte data from a form and want to perform string
   operations on
   it such as strlen(), ereg(), etc... would you not need mb_* or iconv_*
   functions?
 
  In such a case, yes, I do. But I don't think that's directly related
  to the issue..?
 
 The real question is, what does mbstring do that iconv fails to do?  And why
 do we need to restore the initial form in any sense?  And if iconv were
 built-in, would mbstring still be needed, and if so, why and where?

The cool thing that mbstring provides is transparent overloading of some
of the common string manipulation functions.  This means that at least for
a subset of applications, even though they may not have been written with
multibyte support in mind, they may in fact work perfectly in a multibyte
environment with mbstring enabled and overloading turned on.

This is at the heart of this discussion, I think.  We are trying to make 
as many functions as possible behave correctly in both cases.  Simply 
saying that we should just have an mb_* version of a function to deal with 
the mb issues doesn't give us the power of being able to run the same code 
in both environments.  If we can make an mb_ version of a function that is 
similar enough argument and functionality-wise that we can cleanly 
overload the non-mb version of the function causing it to do the right 
thing, then I think we are most of the way there.

-Rasmus

-- 
PHP Internals - PHP Runtime Development Mailing List
To unsubscribe, visit: http://www.php.net/unsub.php



Re: [PHP-DEV] Re: Regarding the latest patch on fgetcsv() (stable branch)

2003-12-12 Thread Moriyoshi Koizumi
On 2003/12/13, at 10:43, Steph wrote:

The real question is, what does mbstring do that iconv fails to do?  
And why
do we need to restore the initial form in any sense?  And if iconv were
built-in, would mbstring still be needed, and if so, why and where?
I wrote we need to restore it to the initial form because we want
to emulate exactly the same behaviour as the current fgetcsv() works.
As fgetcsv() without a stream filter doesn't perform any conversion,
so once we've turned the input into another encoding, then we have to
turn it back. In this case, we're trying to convert strings encoded
in an encoding other than UTF-8 into UTF-8 first, and parse it through
fgetcsv(). Here what you get are UTF-8 version of these, which
is unwanted. You might say the products will suffice even if they
are UTF-8 encoded, however the conversion is sometimes irreversible,
so we then need to avoid any conversion stuff.
(irreversible means you would be supposed to lost some information
during the conversion.)
Moriyoshi

--
PHP Internals - PHP Runtime Development Mailing List
To unsubscribe, visit: http://www.php.net/unsub.php


Re: [PHP-DEV] Re: Regarding the latest patch on fgetcsv() (stable branch)

2003-12-12 Thread Moriyoshi Koizumi
On 2003/12/13, at 10:47, Rasmus Lerdorf wrote:

On Sat, 13 Dec 2003, Steph wrote:
If you get multibyte data from a form and want to perform string
operations on
it such as strlen(), ereg(), etc... would you not need mb_* or 
iconv_*
functions?
In such a case, yes, I do. But I don't think that's directly related
to the issue..?
The real question is, what does mbstring do that iconv fails to do?  
And why
do we need to restore the initial form in any sense?  And if iconv 
were
built-in, would mbstring still be needed, and if so, why and where?
The cool thing that mbstring provides is transparent overloading of 
some
of the common string manipulation functions.  This means that at least 
for
a subset of applications, even though they may not have been written 
with
multibyte support in mind, they may in fact work perfectly in a 
multibyte
environment with mbstring enabled and overloading turned on.
Overloading is evil, because functions like substr() are often
used to splice a certain length of octets byte-wise while mb_substr()
treats the sequence of octets on a character-basis. And overloading
cannot be turned on in scripts, this prevents us from writing portable
scripts. There're virtually no cleaner way to do the tasks elegantly.
Moriyoshi

--
PHP Internals - PHP Runtime Development Mailing List
To unsubscribe, visit: http://www.php.net/unsub.php


RE: [PHP-DEV] Re: Regarding the latest patch on fgetcsv() (stable branch)

2003-12-12 Thread Steph
 Here what you get are UTF-8 version of these, which
 is unwanted. 

This is the part I don't understand.

You might say the products will suffice even if they
 are UTF-8 encoded, however the conversion is sometimes irreversible,
 so we then need to avoid any conversion stuff.
 
 (irreversible means you would be supposed to lost some information
 during the conversion.)
 
 Moriyoshi
 

-- 
PHP Internals - PHP Runtime Development Mailing List
To unsubscribe, visit: http://www.php.net/unsub.php



Re: [PHP-DEV] Re: Regarding the latest patch on fgetcsv() (stable branch)

2003-12-12 Thread Ilia Alshanetsky
On December 12, 2003 08:54 pm, Moriyoshi Koizumi wrote:
 And overloading
 cannot be turned on in scripts, this prevents us from writing portable
 scripts.

Not entirely true, while you cannot enable it from with a script you can 
enable it for a particular directory via .htaccess or equivalent.

Ilia

-- 
PHP Internals - PHP Runtime Development Mailing List
To unsubscribe, visit: http://www.php.net/unsub.php



Re: [PHP-DEV] Re: Regarding the latest patch on fgetcsv() (stable branch)

2003-12-12 Thread Moriyoshi Koizumi
On 2003/12/13, at 11:09, Ilia Alshanetsky wrote:

On December 12, 2003 08:54 pm, Moriyoshi Koizumi wrote:
And overloading
cannot be turned on in scripts, this prevents us from writing portable
scripts.
Not entirely true, while you cannot enable it from with a script you 
can
enable it for a particular directory via .htaccess or equivalent.
That's correct, but that's not my entire point on the overloading
functionality either.
Moriyoshi

--
PHP Internals - PHP Runtime Development Mailing List
To unsubscribe, visit: http://www.php.net/unsub.php