Re: [PHP-DEV] Re: Regarding the latest patch on fgetcsv() (stable branch)
On December 15, 2003 06:36 pm, Rasmus Lerdorf wrote: > As Stig says, the correct solution would be to always store the encoding > of the string right alongside the length of the string in the guts of PHP. > Anything short of that is going to be a hack. PHP6 here we come... Then here is our first TODO for PHP6 :). But until then, please, let's try to avoid adding hacks that implement partial mb support in select functions. These hacks serve no one, many users loose scalability and other users get partial support that will likely prevent them from using the new functionality. Ilia -- PHP Internals - PHP Runtime Development Mailing List To unsubscribe, visit: http://www.php.net/unsub.php
Re: [PHP-DEV] Re: Regarding the latest patch on fgetcsv() (stable branch)
On Mon, 15 Dec 2003, Derek Ford wrote: > I see no example of him implying he wanted to "dismiss" multibyte users, > he simply suggested mb_* versions of the string manipulation functions > and pointed available facilities that people can use already. I support > that idea, as having a mb_ version and a version without multibyte > support gives everyone what they want. People who want multibyte strings > have it, and people who want speed without multibyte strings still have > that; everyone should be happy. Those who don't need multibyte strings > (the majority, by a long shot) don't have to suffer any performance > loss, while those in Asia can open that marketshare you speak of. It is a dismissal in the sense that existing apps not written explicitly for multibyte support will not work for nearly half the users of PHP. We are not talking about a small group of users here. As Stig says, the correct solution would be to always store the encoding of the string right alongside the length of the string in the guts of PHP. Anything short of that is going to be a hack. PHP6 here we come... -Rasmus -- PHP Internals - PHP Runtime Development Mailing List To unsubscribe, visit: http://www.php.net/unsub.php
Re: [PHP-DEV] Re: Regarding the latest patch on fgetcsv() (stable branch)
Stig S. Bakken wrote: On Sun, 2003-12-14 at 00:28, Ilia Alshanetsky wrote: On December 13, 2003 05:52 pm, Moriyoshi Koizumi wrote: I haven't denied it. That said, multibyte facility is not so fancy as XML, but quite essential so as to enable most applications to work well under every environment. Bullshit. Only application that need to support multibyte strings need the multibyte facility. Let's stop doing such a stupid thing any more. As I pointed out already, having different versions for each function doesn't solve problems at all. It sure does, those who need to slower (multibyte) version use that and those who don't use the standard version which works nice and fast for non-multibyte strings. So you think the right solution is to dismiss multibyte users and direct them to the hacks (mbstring etc) that have been used previously instead of thinking ahead? I see no example of him implying he wanted to "dismiss" multibyte users, he simply suggested mb_* versions of the string manipulation functions and pointed available facilities that people can use already. I support that idea, as having a mb_ version and a version without multibyte support gives everyone what they want. People who want multibyte strings have it, and people who want speed without multibyte strings still have that; everyone should be happy. Those who don't need multibyte strings (the majority, by a long shot) don't have to suffer any performance loss, while those in Asia can open that marketshare you speak of. If I were starting a language from scratch today, I would make character encoding part of the string "zval" structure. IMHO that's where it belongs. As an alternative for PHP 5[.1], there is room for a "multibyte bit" in the zval that various functions can use to choose between "sizeof(byte)==sizeof(char)" and "sizeof(byte) < sizeof(char)" implementations. - Stig -- PHP Internals - PHP Runtime Development Mailing List To unsubscribe, visit: http://www.php.net/unsub.php
Re: [PHP-DEV] Re: Regarding the latest patch on fgetcsv() (stable branch)
On December 15, 2003 10:36 am, Moriyoshi Koizumi wrote: > Well, the legacy users of PHP4 will significantly suffer for > PHP5's new features. How so? PHP 5 does break BC (especially for objects) but this is something that was talked about for years and the consensus is/was that the change is for the better. To my knowledge, majority of the new features in PHP5 are just that and have no side effects. Another alternative to moving fgetcsv()'s current implementation would be to add #ifdef HAVE_MBSTRING around php_mblen, which would do #define php_mblen(ptr, len) 1 if mbstring is not enabled. Personally, I'd prefer to have the function in mbstring and modified further to support multibyte delimiters and enclosures, which it does not do now due to performance considerations. Ilia -- PHP Internals - PHP Runtime Development Mailing List To unsubscribe, visit: http://www.php.net/unsub.php
Re: [PHP-DEV] Re: Regarding the latest patch on fgetcsv() (stable branch)
On Tue, 16 Dec 2003, Moriyoshi Koizumi wrote: > > On 2003/12/16, at 0:42, Derick Rethans wrote: > > > On Tue, 16 Dec 2003, Moriyoshi Koizumi wrote: > > > >>> If you were designing a new language you wouldn't have legacy users > >>> who'd suffer (significantly) because of features added for other > >>> users. > >> > >> Well, the legacy users of PHP4 will significantly suffer for > >> PHP5's new features. > > > > Uh? Where does this wisdom comes form? > > README.PHP4-TO-PHP5-THIN-CHANGES and anything I spotted in > the current code. I know there are changes, I was inquiring about the "significantly" in your statement. Derick -- PHP Internals - PHP Runtime Development Mailing List To unsubscribe, visit: http://www.php.net/unsub.php
Re: [PHP-DEV] Re: Regarding the latest patch on fgetcsv() (stable branch)
On 2003/12/16, at 0:42, Derick Rethans wrote: On Tue, 16 Dec 2003, Moriyoshi Koizumi wrote: If you were designing a new language you wouldn't have legacy users who'd suffer (significantly) because of features added for other users. Well, the legacy users of PHP4 will significantly suffer for PHP5's new features. Uh? Where does this wisdom comes form? README.PHP4-TO-PHP5-THIN-CHANGES and anything I spotted in the current code. Moriyoshi -- PHP Internals - PHP Runtime Development Mailing List To unsubscribe, visit: http://www.php.net/unsub.php
Re: [PHP-DEV] Re: Regarding the latest patch on fgetcsv() (stable branch)
On Tue, 16 Dec 2003, Moriyoshi Koizumi wrote: > > If you were designing a new language you wouldn't have legacy users > > who'd suffer (significantly) because of features added for other > > users. > > Well, the legacy users of PHP4 will significantly suffer for > PHP5's new features. Uh? Where does this wisdom comes form? Derick -- PHP Internals - PHP Runtime Development Mailing List To unsubscribe, visit: http://www.php.net/unsub.php
Re: [PHP-DEV] Re: Regarding the latest patch on fgetcsv() (stable branch)
On 2003/12/16, at 0:32, Ilia Alshanetsky wrote: On December 15, 2003 05:37 am, Stig S. Bakken wrote: So you think the right solution is to dismiss multibyte users and direct them to the hacks (mbstring etc) that have been used previously instead of thinking ahead? IMHO calling multibyte a hack would be great disservice to the developers of that extension. We don't call ext/pgsql a hack, simply because it's not builtin, do we? The extension is virtually a hack. Again, the developers of mbstring had to choose the option, adding support for multiple encodings to PHP by separating it as an extension, instead of integrating it into the core implementation because we always ought to manage backwards compatibilities. If I were starting a language from scratch today, I would make character encoding part of the string "zval" structure. IMHO that's where it belongs. As an alternative for PHP 5[.1], there is room for a "multibyte bit" in the zval that various functions can use to choose between "sizeof(byte)==sizeof(char)" and "sizeof(byte) < sizeof(char)" implementations. If you were designing a new language you wouldn't have legacy users who'd suffer (significantly) because of features added for other users. Well, the legacy users of PHP4 will significantly suffer for PHP5's new features. Moriyoshi -- PHP Internals - PHP Runtime Development Mailing List To unsubscribe, visit: http://www.php.net/unsub.php
Re: [PHP-DEV] Re: Regarding the latest patch on fgetcsv() (stable branch)
On December 15, 2003 05:37 am, Stig S. Bakken wrote: > So you think the right solution is to dismiss multibyte users and direct > them to the hacks (mbstring etc) that have been used previously instead > of thinking ahead? IMHO calling multibyte a hack would be great disservice to the developers of that extension. We don't call ext/pgsql a hack, simply because it's not builtin, do we? > If I were starting a language from scratch today, I would make character > encoding part of the string "zval" structure. IMHO that's where it > belongs. As an alternative for PHP 5[.1], there is room for a > "multibyte bit" in the zval that various functions can use to choose > between "sizeof(byte)==sizeof(char)" and "sizeof(byte) < sizeof(char)" > implementations. If you were designing a new language you wouldn't have legacy users who'd suffer (significantly) because of features added for other users. Ilia -- PHP Internals - PHP Runtime Development Mailing List To unsubscribe, visit: http://www.php.net/unsub.php
Re: [PHP-DEV] Re: Regarding the latest patch on fgetcsv() (stable branch)
On Sun, 2003-12-14 at 00:28, Ilia Alshanetsky wrote: > On December 13, 2003 05:52 pm, Moriyoshi Koizumi wrote: > > I haven't denied it. That said, multibyte facility is not so fancy > > as XML, but quite essential so as to enable most applications to work > > well under every environment. > > Bullshit. Only application that need to support multibyte strings need the > multibyte facility. > > > Let's stop doing such a stupid thing any more. As I pointed out already, > > having different versions for each function doesn't solve problems at > > all. > > It sure does, those who need to slower (multibyte) version use that and those > who don't use the standard version which works nice and fast for > non-multibyte strings. So you think the right solution is to dismiss multibyte users and direct them to the hacks (mbstring etc) that have been used previously instead of thinking ahead? If I were starting a language from scratch today, I would make character encoding part of the string "zval" structure. IMHO that's where it belongs. As an alternative for PHP 5[.1], there is room for a "multibyte bit" in the zval that various functions can use to choose between "sizeof(byte)==sizeof(char)" and "sizeof(byte) < sizeof(char)" implementations. - Stig -- PHP Internals - PHP Runtime Development Mailing List To unsubscribe, visit: http://www.php.net/unsub.php
Re: [PHP-DEV] Re: Regarding the latest patch on fgetcsv() (stable branch)
On Fri, 2003-12-12 at 23:28, Ilia Alshanetsky wrote: > On December 12, 2003 04:18 pm, Moriyoshi Koizumi wrote: > > I disagree, because of the following reasons: > > > > 1) Not a few people *actually* use fgetcsv() commonly > > with multibyte characters indeed. Regarding this, > > applications made by those who don't use > > such characters don't (and won't) use multibyte specific > > functions and that's the problem. This greatly prevents > > them from being portable. > > People have lived without multibyte support in fgetcsv() for many years now, > and I did not see a single request on bugs.php.net for fgetcsv() multi-byte > support. So, while this is certainly useful functionality I do not believe it > is as widely needed as you say it is. We also have a multibyte extension that > already implements multi-byte safe variants of common functions, why make > exception for fgetcsv() and add multibyte code into core? Just an observation: it seems that the PHP users who need multibyte support are generally self-supplied by default. It's often hard to convince programmers to change their code as fundamentally as you often need to do to support not just UTF-8 but the whole range of CJK charsets, it adds complexity and can slow things down. These users are used to maintaining their own patches for all kinds of software. The process of merging in multibyte character features often takes several years. Because of this (if my observation is correct), you can't really tell for example how many Japanese users are having issues with fgetcsv() by counting requests on bugs.php.net. I agree with Moriyoshi Koizumi that performance is not necessarily the primary factor here. IMHO performance is important, but generality and realibility is more so. With all due respect to everyone, I think that we should be a bit more welcoming to people who offer help in making PHP a better language for CJK websites. There's still a huge amount of marketshare waiting for PHP in Asia. :-) - Stig -- PHP Internals - PHP Runtime Development Mailing List To unsubscribe, visit: http://www.php.net/unsub.php
Re: [PHP-DEV] Re: Regarding the latest patch on fgetcsv() (stable branch)
On December 13, 2003 05:52 pm, Moriyoshi Koizumi wrote: > I haven't denied it. That said, multibyte facility is not so fancy > as XML, but quite essential so as to enable most applications to work > well under every environment. Bullshit. Only application that need to support multibyte strings need the multibyte facility. > Let's stop doing such a stupid thing any more. As I pointed out already, > having different versions for each function doesn't solve problems at > all. It sure does, those who need to slower (multibyte) version use that and those who don't use the standard version which works nice and fast for non-multibyte strings. Ilia -- PHP Internals - PHP Runtime Development Mailing List To unsubscribe, visit: http://www.php.net/unsub.php
Re: [PHP-DEV] Re: Regarding the latest patch on fgetcsv() (stable branch)
On 2003/12/14, at 7:33, Ilia Alshanetsky wrote: Percentages aside you cannot deny the fact that not every application needs multibyte support (whether this is a majority or 50/50 does not matter). If a user needs to use multibyte they may need to do a little searching to find a provider that supports it, fortunately for them PHP is a very common scripting language with many hosting providers. I haven't denied it. That said, multibyte facility is not so fancy as XML, but quite essential so as to enable most applications to work well under every environment. So, why not begin thinking of how it could be bearably fast even with multibyte support enabled? While I think the current stuff I made is the best portable and the fastest code, it's probable that there are a far better code. If your code as indeed as fast as it can be then the only alternative it would seem is to seperate the function into 'normal' and 'multibyte' variants allowing the user and not the developer to choose the one most suited to their needs. Let's stop doing such a stupid thing any more. As I pointed out already, having different versions for each function doesn't solve problems at all. Moriyoshi -- PHP Internals - PHP Runtime Development Mailing List To unsubscribe, visit: http://www.php.net/unsub.php
Re: [PHP-DEV] Re: Regarding the latest patch on fgetcsv() (stable branch)
On December 13, 2003 04:46 pm, Moriyoshi Koizumi wrote: > > The critical point of this entire discussion is about NOT forcing > > choices on > > people who do not want/need them. There is no good reason to force > > multibyte > > version of fgetcsv() on every single user, when there are not one but > > two PHP > > extensions designed explicitly for multibyte support. > > On the other hand, the chances are very limited to users familiar > to multibyte. First of all, flexibility on the configuration has been > causing lots of confusion. I'd be happy if every existing application > used mb_*() instead of their counterpart at approproate places, but > it's unlikely. This is because we have two versions of string > manipulation > functions. And again, it's prevented users to write multibyte safe > applications because multibyte-flavor extensions are currently not > enabled > by default though this fact is not my point. Percentages aside you cannot deny the fact that not every application needs multibyte support (whether this is a majority or 50/50 does not matter). If a user needs to use multibyte they may need to do a little searching to find a provider that supports it, fortunately for them PHP is a very common scripting language with many hosting providers. > > If fgetcsv() in PHP 5 cannot be designed in such a way as to have no > > significant performance penalties for non-multibyte strings the > > function > > should be introduced as mb_fgetcsv() or iconv_fgetcsv(). > > So, why not begin thinking of how it could be bearably fast > even with multibyte support enabled? While I think the current stuff > I made is the best portable and the fastest code, it's probable > that there are a far better code. If your code as indeed as fast as it can be then the only alternative it would seem is to seperate the function into 'normal' and 'multibyte' variants allowing the user and not the developer to choose the one most suited to their needs. Ilia -- PHP Internals - PHP Runtime Development Mailing List To unsubscribe, visit: http://www.php.net/unsub.php
Re: [PHP-DEV] Re: Regarding the latest patch on fgetcsv() (stable branch)
On 2003/12/14, at 6:46, Moriyoshi Koizumi wrote: I made is the best portable and the fastest code, it's probable that there are a far better code. s/there are/there'd be/ -- PHP Internals - PHP Runtime Development Mailing List To unsubscribe, visit: http://www.php.net/unsub.php
Re: [PHP-DEV] Re: Regarding the latest patch on fgetcsv() (stable branch)
On 2003/12/14, at 6:19, Ilia Alshanetsky wrote: On December 13, 2003 03:53 pm, Moriyoshi Koizumi wrote: Could a quarter be a minority? Unless the rules of mathematics had changed 25% is still a minority. You also forget that there are plenty of people who compile extensions and never end up using them. A similar fact applies to your assumption. It's also true that PHP can handle multibyte strings without mbstring or iconv in some cases where the users are just fortunate enough to not get in trouble, most likely because they just don't use such multibyte characters that are known to cause problems due to its structure. Those user question [1][2] exactly describes when it goes wrong. [1] http://marc.theaimsgroup.com/?l=php-dev&m=103828989330521&w=2 [2] http://news.php.net/article.php?group=php.i18n&article=633 The critical point of this entire discussion is about NOT forcing choices on people who do not want/need them. There is no good reason to force multibyte version of fgetcsv() on every single user, when there are not one but two PHP extensions designed explicitly for multibyte support. On the other hand, the chances are very limited to users familiar to multibyte. First of all, flexibility on the configuration has been causing lots of confusion. I'd be happy if every existing application used mb_*() instead of their counterpart at approproate places, but it's unlikely. This is because we have two versions of string manipulation functions. And again, it's prevented users to write multibyte safe applications because multibyte-flavor extensions are currently not enabled by default though this fact is not my point. If fgetcsv() in PHP 5 cannot be designed in such a way as to have no significant performance penalties for non-multibyte strings the function should be introduced as mb_fgetcsv() or iconv_fgetcsv(). So, why not begin thinking of how it could be bearably fast even with multibyte support enabled? While I think the current stuff I made is the best portable and the fastest code, it's probable that there are a far better code. Moriyoshi -- PHP Internals - PHP Runtime Development Mailing List To unsubscribe, visit: http://www.php.net/unsub.php
Re: [PHP-DEV] Re: Regarding the latest patch on fgetcsv() (stable branch)
On December 13, 2003 03:53 pm, Moriyoshi Koizumi wrote: > Could a quarter be a minority? Unless the rules of mathematics had changed 25% is still a minority. You also forget that there are plenty of people who compile extensions and never end up using them. The critical point of this entire discussion is about NOT forcing choices on people who do not want/need them. There is no good reason to force multibyte version of fgetcsv() on every single user, when there are not one but two PHP extensions designed explicitly for multibyte support. If fgetcsv() in PHP 5 cannot be designed in such a way as to have no significant performance penalties for non-multibyte strings the function should be introduced as mb_fgetcsv() or iconv_fgetcsv(). Ilia -- PHP Internals - PHP Runtime Development Mailing List To unsubscribe, visit: http://www.php.net/unsub.php
Re: [PHP-DEV] Re: Regarding the latest patch on fgetcsv() (stable branch)
On 2003/12/14, at 5:55, Ilia Alshanetsky wrote: On December 13, 2003 03:27 pm, Moriyoshi Koizumi wrote: As a sidenote, this unrealistic statistics appear to be quite unreal. phpinfo() => 186,000 (pages) [1] phpinfo() mbstring => 8,330 phpinfo() Server API Configure Command => 16,800 phpinfo() Server API Configure Command mbstring => 4,510 Even so this still represents a MINORITY. Could a quarter be a minority? And the fact is there are lots of people who don't know what is the problem when they've got to use multibyte strings. They don't realise why they should not use UTF-8 with their mysql server set to handle ISO-8859-1 for instance. Moriyoshi -- PHP Internals - PHP Runtime Development Mailing List To unsubscribe, visit: http://www.php.net/unsub.php
Re: [PHP-DEV] Re: Regarding the latest patch on fgetcsv() (stable branch)
On December 13, 2003 03:27 pm, Moriyoshi Koizumi wrote: > As a sidenote, this unrealistic statistics appear to be quite unreal. > > phpinfo() => 186,000 (pages) [1] > phpinfo() mbstring => 8,330 > phpinfo() Server API Configure Command => 16,800 > phpinfo() Server API Configure Command mbstring => 4,510 Even so this still represents a MINORITY. Ilia -- PHP Internals - PHP Runtime Development Mailing List To unsubscribe, visit: http://www.php.net/unsub.php
Re: [PHP-DEV] Re: Regarding the latest patch on fgetcsv() (stable branch)
On 2003/12/13, at 9:36, Ilia Alshanetsky wrote: There is a good chance you are correct. However my assumption is not without bases, please consider the following statistic: Google finds 185,000 (or so) phpinfo() pages, when mbstring is added to the search query only 8150 pages are found. That leads me to believe that 1/2% of that userbase uses mbstring. Even if we were to say that of all the people who have mbstring compiled use it (which is highly unlikely) it's still only 1/2%. As a sidenote, this unrealistic statistics appear to be quite unreal. phpinfo() => 186,000 (pages) [1] phpinfo() mbstring => 8,330 phpinfo() Server API Configure Command => 16,800 phpinfo() Server API Configure Command mbstring => 4,510 [1] includes the number of pages that are not of actual phpinfo() and merely contain the keyword "phpinfo". Moriyoshi -- PHP Internals - PHP Runtime Development Mailing List To unsubscribe, visit: http://www.php.net/unsub.php
Re: [PHP-DEV] Re: Regarding the latest patch on fgetcsv() (stable branch)
On 2003/12/14, at 1:07, Rasmus Lerdorf wrote: On Sat, 13 Dec 2003, Jan Schneider wrote: I have to agree. While in the past it helped mb users to turn on overloading if they wanted to use our framework, it will now break it. This is because we now explicitely use the str*() function for byte-wise string manipulation and their mb_*() equivalents for character-wise manipulation. This is the only way to predict the results, the magic that is done by overloading or transparent charset conversion is not suitable for real production environments. Using str*() functions for octet manipulation is fundamentally wrong. str*() functions by definition work on character boundaries. If we need to operate on byte boundaries we need to introduce a set of mem*() functions. I think single-byte users are prone to have a general assumption (indeed a superstition) that strlen() returns the number of octets and substr() cuts a portion of a string in a specified range of bytes, regardless of your conception of str*(). This sort of tendency applies to other programming languages. Moriyoshi -- PHP Internals - PHP Runtime Development Mailing List To unsubscribe, visit: http://www.php.net/unsub.php
Re: [PHP-DEV] Re: Regarding the latest patch on fgetcsv() (stable branch)
On Sat, 13 Dec 2003, Jan Schneider wrote: > With the current implemention and assuming that mbstring overloading is > turned off, I can. This not documentated, but I'd still consider a change > of this behaviour an huge bc break. The documentation states "characters" and nowhere does it say the size of a character. I would not consider this a large BC break. -Rasmus -- PHP Internals - PHP Runtime Development Mailing List To unsubscribe, visit: http://www.php.net/unsub.php
Re: [PHP-DEV] Re: Regarding the latest patch on fgetcsv() (stable branch)
Zitat von Rasmus Lerdorf <[EMAIL PROTECTED]>: > On Sat, 13 Dec 2003, Jan Schneider wrote: > > Maybe. Due to PHP lacking byte stream functions, working with str* is > the > > only solution atm. > > And my contention is that there is no way to do this right now. If you > rely on a str*() function to do this your application is broken since you > cannot reasonably expect a character to always be 8 bits wide. With the current implemention and assuming that mbstring overloading is turned off, I can. This not documentated, but I'd still consider a change of this behaviour an huge bc break. Jan. -- http://www.horde.org - The Horde Project http://www.ammma.de - discover your knowledge http://www.tip4all.de - Deine private Tippgemeinschaft -- PHP Internals - PHP Runtime Development Mailing List To unsubscribe, visit: http://www.php.net/unsub.php
Re: [PHP-DEV] Re: Regarding the latest patch on fgetcsv() (stable branch)
On Sat, 13 Dec 2003, Jan Schneider wrote: > Maybe. Due to PHP lacking byte stream functions, working with str* is the > only solution atm. And my contention is that there is no way to do this right now. If you rely on a str*() function to do this your application is broken since you cannot reasonably expect a character to always be 8 bits wide. -Rasmus -- PHP Internals - PHP Runtime Development Mailing List To unsubscribe, visit: http://www.php.net/unsub.php
Re: [PHP-DEV] Re: Regarding the latest patch on fgetcsv() (stable branch)
Zitat von Rasmus Lerdorf <[EMAIL PROTECTED]>: > On Sat, 13 Dec 2003, Jan Schneider wrote: > > I have to agree. While in the past it helped mb users to turn on > overloading > > if they wanted to use our framework, it will now break it. This is > because > > we now explicitely use the str*() function for byte-wise string > > manipulation and their mb_*() equivalents for character-wise > manipulation. > > This is the only way to predict the results, the magic that is done by > > overloading or transparent charset conversion is not suitable for real > > production environments. > > Using str*() functions for octet manipulation is fundamentally wrong. > str*() functions by definition work on character boundaries. If we need > to operate on byte boundaries we need to introduce a set of mem*() > functions. Maybe. Due to PHP lacking byte stream functions, working with str* is the only solution atm. Jan. -- http://www.horde.org - The Horde Project http://www.ammma.de - discover your knowledge http://www.tip4all.de - Deine private Tippgemeinschaft -- PHP Internals - PHP Runtime Development Mailing List To unsubscribe, visit: http://www.php.net/unsub.php
Re: [PHP-DEV] Re: Regarding the latest patch on fgetcsv() (stable branch)
On Sat, 13 Dec 2003, Jan Schneider wrote: > I have to agree. While in the past it helped mb users to turn on overloading > if they wanted to use our framework, it will now break it. This is because > we now explicitely use the str*() function for byte-wise string > manipulation and their mb_*() equivalents for character-wise manipulation. > This is the only way to predict the results, the magic that is done by > overloading or transparent charset conversion is not suitable for real > production environments. Using str*() functions for octet manipulation is fundamentally wrong. str*() functions by definition work on character boundaries. If we need to operate on byte boundaries we need to introduce a set of mem*() functions. -Rasmus -- PHP Internals - PHP Runtime Development Mailing List To unsubscribe, visit: http://www.php.net/unsub.php
Re: [PHP-DEV] Re: Regarding the latest patch on fgetcsv() (stable branch)
Zitat von Derick Rethans <[EMAIL PROTECTED]>: > On Sat, 13 Dec 2003, Moriyoshi Koizumi wrote: > > > Overloading is evil, because functions like substr() are often > > used to splice a certain length of octets byte-wise while mb_substr() > > treats the sequence of octets on a character-basis. And overloading > > cannot be turned on in scripts, this prevents us from writing portable > > scripts. There're virtually no cleaner way to do the tasks elegantly. > > I also think overloading is evil, w've seen before what the problems can > be because of difference between an overloaded and a non-overloaded > version. It is however perfectly possible to tune this behavior in the > php.ini file; not sure if we want that though. I guess it depends on your pov. If you want to write portable scripts, relying on overloading or even special php.ini tuning is a nightmare. Jan. -- http://www.horde.org - The Horde Project http://www.ammma.de - discover your knowledge http://www.tip4all.de - Deine private Tippgemeinschaft -- PHP Internals - PHP Runtime Development Mailing List To unsubscribe, visit: http://www.php.net/unsub.php
Re: [PHP-DEV] Re: Regarding the latest patch on fgetcsv() (stable branch)
Zitat von Moriyoshi Koizumi <[EMAIL PROTECTED]>: > > The cool thing that mbstring provides is transparent overloading of > > some > > of the common string manipulation functions. This means that at least > > for > > a subset of applications, even though they may not have been written > > with > > multibyte support in mind, they may in fact work perfectly in a > > multibyte > > environment with mbstring enabled and overloading turned on. > > Overloading is evil, because functions like substr() are often > used to splice a certain length of octets byte-wise while mb_substr() > treats the sequence of octets on a character-basis. And overloading > cannot be turned on in scripts, this prevents us from writing portable > scripts. There're virtually no cleaner way to do the tasks elegantly. I have to agree. While in the past it helped mb users to turn on overloading if they wanted to use our framework, it will now break it. This is because we now explicitely use the str*() function for byte-wise string manipulation and their mb_*() equivalents for character-wise manipulation. This is the only way to predict the results, the magic that is done by overloading or transparent charset conversion is not suitable for real production environments. Jan. -- http://www.horde.org - The Horde Project http://www.ammma.de - discover your knowledge http://www.tip4all.de - Deine private Tippgemeinschaft -- PHP Internals - PHP Runtime Development Mailing List To unsubscribe, visit: http://www.php.net/unsub.php
Re: [PHP-DEV] Re: Regarding the latest patch on fgetcsv() (stable branch)
On Fri, 12 Dec 2003, Ilia Alshanetsky wrote: > On December 12, 2003 08:54 pm, Moriyoshi Koizumi wrote: > > And overloading > > cannot be turned on in scripts, this prevents us from writing portable > > scripts. > > Not entirely true, while you cannot enable it from with a script you can > enable it for a particular directory via .htaccess or equivalent. Actually, this going to open a whole can of worms... I would call that a hack if it's done (though it might be a cool one :). Derick -- PHP Internals - PHP Runtime Development Mailing List To unsubscribe, visit: http://www.php.net/unsub.php
Re: [PHP-DEV] Re: Regarding the latest patch on fgetcsv() (stable branch)
On Fri, 12 Dec 2003, Rasmus Lerdorf wrote: > We need to move towards a uniform platform that works for everyone without > putting undue strain on either side. Sure we do, but not at a 200-250% performance loss. Derick -- PHP Internals - PHP Runtime Development Mailing List To unsubscribe, visit: http://www.php.net/unsub.php
Re: [PHP-DEV] Re: Regarding the latest patch on fgetcsv() (stable branch)
On Sat, 13 Dec 2003, Moriyoshi Koizumi wrote: > Overloading is evil, because functions like substr() are often > used to splice a certain length of octets byte-wise while mb_substr() > treats the sequence of octets on a character-basis. And overloading > cannot be turned on in scripts, this prevents us from writing portable > scripts. There're virtually no cleaner way to do the tasks elegantly. I also think overloading is evil, w've seen before what the problems can be because of difference between an overloaded and a non-overloaded version. It is however perfectly possible to tune this behavior in the php.ini file; not sure if we want that though. Derick -- PHP Internals - PHP Runtime Development Mailing List To unsubscribe, visit: http://www.php.net/unsub.php
Re: [PHP-DEV] Re: Regarding the latest patch on fgetcsv() (stable branch)
On 2003/12/13, at 11:12, Rasmus Lerdorf wrote: On Sat, 13 Dec 2003, Moriyoshi Koizumi wrote: Overloading is evil, because functions like substr() are often used to splice a certain length of octets byte-wise while mb_substr() treats the sequence of octets on a character-basis. I don't know about this happening often. In singlebyte apps I don't see this happening often. People splice on character boundaries. Perhaps in multibyte apps people are using substr this way, but that would seem to be an incorrect usage of the function. People who do this are taking advantage of a side-effect of a function designed for a different character set. substr as documented works on characters, not octets. "Often" was slightly too emphasized a word indeed, but you can see a typical example on the PEAR package Net_DNS. No matter how substr() or its family are documented, it is very likely abused in many places on a common error that every character is mapped to a single byte. Moriyoshi -- PHP Internals - PHP Runtime Development Mailing List To unsubscribe, visit: http://www.php.net/unsub.php
Re: [PHP-DEV] Re: Regarding the latest patch on fgetcsv() (stable branch)
On Sat, 13 Dec 2003, Moriyoshi Koizumi wrote: > Overloading is evil, because functions like substr() are often > used to splice a certain length of octets byte-wise while mb_substr() > treats the sequence of octets on a character-basis. I don't know about this happening often. In singlebyte apps I don't see this happening often. People splice on character boundaries. Perhaps in multibyte apps people are using substr this way, but that would seem to be an incorrect usage of the function. People who do this are taking advantage of a side-effect of a function designed for a different character set. substr as documented works on characters, not octets. -Rasmus -- PHP Internals - PHP Runtime Development Mailing List To unsubscribe, visit: http://www.php.net/unsub.php
Re: [PHP-DEV] Re: Regarding the latest patch on fgetcsv() (stable branch)
On 2003/12/13, at 11:09, Ilia Alshanetsky wrote: On December 12, 2003 08:54 pm, Moriyoshi Koizumi wrote: And overloading cannot be turned on in scripts, this prevents us from writing portable scripts. Not entirely true, while you cannot enable it from with a script you can enable it for a particular directory via .htaccess or equivalent. That's correct, but that's not my entire point on the overloading functionality either. Moriyoshi -- PHP Internals - PHP Runtime Development Mailing List To unsubscribe, visit: http://www.php.net/unsub.php
Re: [PHP-DEV] Re: Regarding the latest patch on fgetcsv() (stable branch)
On December 12, 2003 08:54 pm, Moriyoshi Koizumi wrote: > And overloading > cannot be turned on in scripts, this prevents us from writing portable > scripts. Not entirely true, while you cannot enable it from with a script you can enable it for a particular directory via .htaccess or equivalent. Ilia -- PHP Internals - PHP Runtime Development Mailing List To unsubscribe, visit: http://www.php.net/unsub.php
RE: [PHP-DEV] Re: Regarding the latest patch on fgetcsv() (stable branch)
Here what you get are UTF-8 version of these, which > is unwanted. This is the part I don't understand. You might say the products will suffice even if they > are UTF-8 encoded, however the conversion is sometimes irreversible, > so we then need to avoid any conversion stuff. > > ("irreversible" means you would be supposed to lost some information > during the conversion.) > > Moriyoshi > -- PHP Internals - PHP Runtime Development Mailing List To unsubscribe, visit: http://www.php.net/unsub.php
Re: [PHP-DEV] Re: Regarding the latest patch on fgetcsv() (stable branch)
On 2003/12/13, at 10:47, Rasmus Lerdorf wrote: On Sat, 13 Dec 2003, Steph wrote: If you get multibyte data from a form and want to perform string operations on it such as strlen(), ereg(), etc... would you not need mb_* or iconv_* functions? In such a case, yes, I do. But I don't think that's directly related to the issue..? The real question is, what does mbstring do that iconv fails to do? And why do we need to restore the initial form in any sense? And if iconv were built-in, would mbstring still be needed, and if so, why and where? The cool thing that mbstring provides is transparent overloading of some of the common string manipulation functions. This means that at least for a subset of applications, even though they may not have been written with multibyte support in mind, they may in fact work perfectly in a multibyte environment with mbstring enabled and overloading turned on. Overloading is evil, because functions like substr() are often used to splice a certain length of octets byte-wise while mb_substr() treats the sequence of octets on a character-basis. And overloading cannot be turned on in scripts, this prevents us from writing portable scripts. There're virtually no cleaner way to do the tasks elegantly. Moriyoshi -- PHP Internals - PHP Runtime Development Mailing List To unsubscribe, visit: http://www.php.net/unsub.php
Re: [PHP-DEV] Re: Regarding the latest patch on fgetcsv() (stable branch)
On 2003/12/13, at 10:43, Steph wrote: The real question is, what does mbstring do that iconv fails to do? And why do we need to restore the initial form in any sense? And if iconv were built-in, would mbstring still be needed, and if so, why and where? I wrote we need to restore it to the initial form because we want to emulate exactly the same behaviour as the current fgetcsv() works. As fgetcsv() without a stream filter doesn't perform any conversion, so once we've turned the input into another encoding, then we have to turn it back. In this case, we're trying to convert strings encoded in an encoding other than UTF-8 into UTF-8 first, and parse it through fgetcsv(). Here what you get are UTF-8 version of these, which is unwanted. You might say the products will suffice even if they are UTF-8 encoded, however the conversion is sometimes irreversible, so we then need to avoid any conversion stuff. ("irreversible" means you would be supposed to lost some information during the conversion.) Moriyoshi -- PHP Internals - PHP Runtime Development Mailing List To unsubscribe, visit: http://www.php.net/unsub.php
RE: [PHP-DEV] Re: Regarding the latest patch on fgetcsv() (stable branch)
On Sat, 13 Dec 2003, Steph wrote: > > > If you get multibyte data from a form and want to perform string > > > operations on > > > it such as strlen(), ereg(), etc... would you not need mb_* or iconv_* > > > functions? > > > > In such a case, yes, I do. But I don't think that's directly related > > to the issue..? > > The real question is, what does mbstring do that iconv fails to do? And why > do we need to restore the initial form in any sense? And if iconv were > built-in, would mbstring still be needed, and if so, why and where? The cool thing that mbstring provides is transparent overloading of some of the common string manipulation functions. This means that at least for a subset of applications, even though they may not have been written with multibyte support in mind, they may in fact work perfectly in a multibyte environment with mbstring enabled and overloading turned on. This is at the heart of this discussion, I think. We are trying to make as many functions as possible behave correctly in both cases. Simply saying that we should just have an mb_* version of a function to deal with the mb issues doesn't give us the power of being able to run the same code in both environments. If we can make an mb_ version of a function that is similar enough argument and functionality-wise that we can cleanly overload the non-mb version of the function causing it to do the right thing, then I think we are most of the way there. -Rasmus -- PHP Internals - PHP Runtime Development Mailing List To unsubscribe, visit: http://www.php.net/unsub.php
RE: [PHP-DEV] Re: Regarding the latest patch on fgetcsv() (stable branch)
> > If you get multibyte data from a form and want to perform string > > operations on > > it such as strlen(), ereg(), etc... would you not need mb_* or iconv_* > > functions? > > In such a case, yes, I do. But I don't think that's directly related > to the issue..? The real question is, what does mbstring do that iconv fails to do? And why do we need to restore the initial form in any sense? And if iconv were built-in, would mbstring still be needed, and if so, why and where? /Steph goes away to read the relevant mbstring documentation -- PHP Internals - PHP Runtime Development Mailing List To unsubscribe, visit: http://www.php.net/unsub.php
Re: [PHP-DEV] Re: Regarding the latest patch on fgetcsv() (stable branch)
On 2003/12/13, at 10:35, Ilia Alshanetsky wrote: On December 12, 2003 08:11 pm, Moriyoshi Koizumi wrote: Which input? If you get multibyte data from a form and want to perform string operations on it such as strlen(), ereg(), etc... would you not need mb_* or iconv_* functions? In such a case, yes, I do. But I don't think that's directly related to the issue..? Moriyoshi -- PHP Internals - PHP Runtime Development Mailing List To unsubscribe, visit: http://www.php.net/unsub.php
Re: [PHP-DEV] Re: Regarding the latest patch on fgetcsv() (stable branch)
On December 12, 2003 08:11 pm, Moriyoshi Koizumi wrote: > Which input? If you get multibyte data from a form and want to perform string operations on it such as strlen(), ereg(), etc... would you not need mb_* or iconv_* functions? Ilia -- PHP Internals - PHP Runtime Development Mailing List To unsubscribe, visit: http://www.php.net/unsub.php
Re: [PHP-DEV] Re: Regarding the latest patch on fgetcsv() (stable branch)
On 2003/12/13, at 10:11, Moriyoshi Koizumi wrote: That is needed when the encoding in which a script is written and the one the form uses to submit to the script. A few more words were missing: That is needed when the encoding in which a script is written and the one the form uses to submit to the script are different. Moriyoshi -- PHP Internals - PHP Runtime Development Mailing List To unsubscribe, visit: http://www.php.net/unsub.php
Re: [PHP-DEV] Re: Regarding the latest patch on fgetcsv() (stable branch)
On 2003/12/13, at 10:13, Ilia Alshanetsky wrote: It also seems to me that if you are going to be multibyte inputs you'd have either iconv or mbstring extension enabled. Which input? If you are talking about form inputs, we rarely need the functionality (mbstring.encoding_conversion). That is needed when the encoding in which a script is written and the one the form uses to submit to the script. Such situation might look weird to you, but it happens in case the output of the script is automagically translated by mb_output_handler(). Please take a look at the relevant documentations. Moriyoshi -- PHP Internals - PHP Runtime Development Mailing List To unsubscribe, visit: http://www.php.net/unsub.php
Re: [PHP-DEV] Re: Regarding the latest patch on fgetcsv() (stable branch)
On December 12, 2003 07:54 pm, Moriyoshi Koizumi wrote: > On 2003/12/13, at 9:47, Ilia Alshanetsky wrote: > > Without mbstring enabled, you would not be able to effectively work > > with > > multibyte characters. Therefor even if fgetcsv() would work as you may > > expect > > with multibyte strings, that data would not be manageable in most > > cases. What if you want to alter the case of the data or use it a basis for sending e-mail or check it against another string via regular expression? Sure, if you just want to read them and echo them back in the form of a table you don't need mbstring. It also seems to me that if you are going to be multibyte inputs you'd have either iconv or mbstring extension enabled. Ilia -- PHP Internals - PHP Runtime Development Mailing List To unsubscribe, visit: http://www.php.net/unsub.php
Re: [PHP-DEV] Re: Regarding the latest patch on fgetcsv() (stable branch)
On 2003/12/13, at 9:56, Ilia Alshanetsky wrote: On December 12, 2003 03:15 pm, Moriyoshi Koizumi wrote: If we limited the support to UTF-8 or EUC encoding only, we'd be able to drastically gain much better performance. But it won't actually solve practical problems where it is in action. Could iconv stream filters be used to convert various encoding (if needed) to UTF-8 thus addressing the problem? Actually it might do, but doing so leads to great overheads, because you have to reconvert the strings to restore the initial form. Besides the conversion is sometimes irreversible. And even it wouldn't make sense unless iconv extension becomes built-in. Moriyoshi -- PHP Internals - PHP Runtime Development Mailing List To unsubscribe, visit: http://www.php.net/unsub.php
Re: [PHP-DEV] Re: Regarding the latest patch on fgetcsv() (stable branch)
On 2003/12/13, at 9:47, Ilia Alshanetsky wrote: Without mbstring enabled, you would not be able to effectively work with multibyte characters. Therefor even if fgetcsv() would work as you may expect with multibyte strings, that data would not be manageable in most cases. That's a bogus argument. In what case do you think they should be effectively managed with mbstring functions..? As PHP's functions are mostly highly-completed on its role and they don't need any additional operations in most cases, I don't have to use mbstring functions there. String manipulation functions provided by mbstring are atomic and so they are used just in the same way as their standard counterparts like substr() and strpos(). I see a few cases that the strings fetched by fgetcsv() have to be modified by substr() or whatever.. Moriyoshi -- PHP Internals - PHP Runtime Development Mailing List To unsubscribe, visit: http://www.php.net/unsub.php
Re: [PHP-DEV] Re: Regarding the latest patch on fgetcsv() (stable branch)
On December 12, 2003 03:15 pm, Moriyoshi Koizumi wrote: > If we limited the support to UTF-8 or EUC encoding only, we'd be > able to drastically gain much better performance. But it won't > actually solve practical problems where it is in action. Could iconv stream filters be used to convert various encoding (if needed) to UTF-8 thus addressing the problem? Ilia -- PHP Internals - PHP Runtime Development Mailing List To unsubscribe, visit: http://www.php.net/unsub.php
Re: [PHP-DEV] Re: Regarding the latest patch on fgetcsv() (stable branch)
Before we get too far offtopic ( fgetcsv() ) let me just quickly summarize my position. I think it's great that multi-byte support exists in PHP, we have mbstring, iconv and recode extensions that all help make PHP work with multibyte strings. As with most things with PHP the user has a choice of whether or not to enable this functionality depending on their present and future needs. By putting this functionality into core/standard this choice is taken away more over it penalizes (rather significantly) every single user of affected functionality. That is what I'd like to avoid. Without mbstring enabled, you would not be able to effectively work with multibyte characters. Therefor even if fgetcsv() would work as you may expect with multibyte strings, that data would not be manageable in most cases. Which is why I propose that multibyte version of fgetcsv() be placed inside ext/mbstring as mb_fgetcsv(). Ilia -- PHP Internals - PHP Runtime Development Mailing List To unsubscribe, visit: http://www.php.net/unsub.php
Re: [PHP-DEV] Re: Regarding the latest patch on fgetcsv() (stable branch)
On December 12, 2003 07:02 pm, Rasmus Lerdorf wrote: > Ilia, we need to try to avoid this sort of thinking. This "vast majority" > is most likely only a "vocal majority" these days. It is very likely that > the non-mb users are actually the "few" and if we continue along your way > of thinking then we need to have an ext/singlebyte that implements all > these weird singlebyte string manipulation functions. There is a good chance you are correct. However my assumption is not without bases, please consider the following statistic: Google finds 185,000 (or so) phpinfo() pages, when mbstring is added to the search query only 8150 pages are found. That leads me to believe that 1/2% of that userbase uses mbstring. Even if we were to say that of all the people who have mbstring compiled use it (which is highly unlikely) it's still only 1/2%. Ilia -- PHP Internals - PHP Runtime Development Mailing List To unsubscribe, visit: http://www.php.net/unsub.php
RE: [PHP-DEV] Re: Regarding the latest patch on fgetcsv() (stable branch)
blimey, I just agreed with Rasmus.. > -Original Message- > From: Rasmus Lerdorf [mailto:[EMAIL PROTECTED] > Sent: 13 December 2003 00:03 > To: Ilia Alshanetsky > Cc: Moriyoshi Koizumi; PHP Internals > Subject: Re: [PHP-DEV] Re: Regarding the latest patch on fgetcsv() > (stable branch) > > > > Why does a vast majority of users have to endure degredation in > performance > > for functionality that are needed by a few? It's as simple as > that. Same > > argument applies to basename(). > > Ilia, we need to try to avoid this sort of thinking. This "vast > majority" > is most likely only a "vocal majority" these days. It is very > likely that > the non-mb users are actually the "few" and if we continue along your way > of thinking then we need to have an ext/singlebyte that implements all > these weird singlebyte string manipulation functions. > > We need to move towards a uniform platform that works for > everyone without > putting undue strain on either side. > > -Rasmus > > -- > PHP Internals - PHP Runtime Development Mailing List > To unsubscribe, visit: http://www.php.net/unsub.php -- PHP Internals - PHP Runtime Development Mailing List To unsubscribe, visit: http://www.php.net/unsub.php
Re: [PHP-DEV] Re: Regarding the latest patch on fgetcsv() (stable branch)
Zitat von Ilia Alshanetsky <[EMAIL PROTECTED]>: > On December 12, 2003 05:38 pm, Moriyoshi Koizumi wrote: > > And I don't think fgetcsv() is an exception, since htmlentities() can > > be referred to as an example that is placed in core and > > supports multibyte strings. As I mentioned, purging that kind of > > functionality into the mbstring extension doesn't solve the problem > > in practice by any means. > > htmlentities() is a rather special function it handles not only multibyte > but > a whole lot of diffrent & unusual things. I do not think you can fairly > compare it to fgetcsv(). We have a multibyte extension for people who > need > that functionality, why force it on everyone else? > > > >> 2) IMO speed is not a key factor here. People rather wants > > >> trust-worthy behaviour. > > > > > > When it's a few percent and the changes offer significant > improvements > > > yes, > > > but when were are talking about a performance loss of 250-300% or > more > > > then > > > performance must become a consideration as well. > > > > If there are virtually no ways to improve it, it'd be natural to me > > we dismiss the issue. > > Why does a vast majority of users have to endure degredation in > performance > for functionality that are needed by a few? It's as simple as that. Same > argument applies to basename(). Just a general note on this discussion becoming sort of a "meta"-topic: >From a PHP developers POV, complete charset support should be a key technology for ZE and the standard extensions, as is now XML for PHP5 as a whole. While the comparison might be a bit strange, it even reminds on the relation of these two: The "standard" encoding for XML data is a multibyte charset. But the real problem is, that it's *really* hard for developers outside of the "multibyte" world to understand the ins and outs of these charsets and how to handle them correctly. It was a PITA to make the whole Horde framework charset independent without knowing anything on mb charsets and their support in php. I did this due to popular demand, because there are a *lot* of people using/needing mb charsets. It would be great if others developers wouldn't have to take this steep road, because php would support these out of the box. While writing this message, Rasmus got my point in fewer words. ;-) Jan. -- http://www.horde.org - The Horde Project http://www.ammma.de - discover your knowledge http://www.tip4all.de - Deine private Tippgemeinschaft -- PHP Internals - PHP Runtime Development Mailing List To unsubscribe, visit: http://www.php.net/unsub.php
Re: [PHP-DEV] Re: Regarding the latest patch on fgetcsv() (stable branch)
> Why does a vast majority of users have to endure degredation in performance > for functionality that are needed by a few? It's as simple as that. Same > argument applies to basename(). Ilia, we need to try to avoid this sort of thinking. This "vast majority" is most likely only a "vocal majority" these days. It is very likely that the non-mb users are actually the "few" and if we continue along your way of thinking then we need to have an ext/singlebyte that implements all these weird singlebyte string manipulation functions. We need to move towards a uniform platform that works for everyone without putting undue strain on either side. -Rasmus -- PHP Internals - PHP Runtime Development Mailing List To unsubscribe, visit: http://www.php.net/unsub.php
Re: [PHP-DEV] Re: Regarding the latest patch on fgetcsv() (stable branch)
On 2003/12/13, at 8:23, Ilia Alshanetsky wrote: On December 12, 2003 05:38 pm, Moriyoshi Koizumi wrote: And I don't think fgetcsv() is an exception, since htmlentities() can be referred to as an example that is placed in core and supports multibyte strings. As I mentioned, purging that kind of functionality into the mbstring extension doesn't solve the problem in practice by any means. htmlentities() is a rather special function it handles not only multibyte but a whole lot of diffrent & unusual things. I do not think you can fairly compare it to fgetcsv(). What are you referring to as "a whole lot of different & unusual thing"? We have a multibyte extension for people who need that functionality, why force it on everyone else? Because it's a bug. The multibyte extension we have is not provided to make easier the lives of those who don't use multibyte encodings. It exists as an extension since we had to do so in the past. 2) IMO speed is not a key factor here. People rather wants trust-worthy behaviour. When it's a few percent and the changes offer significant improvements yes, but when were are talking about a performance loss of 250-300% or more then performance must become a consideration as well. If there are virtually no ways to improve it, it'd be natural to me we dismiss the issue. Why does a vast majority of users have to endure degredation in performance for functionality that are needed by a few? It's as simple as that. Same argument applies to basename(). You should be underestimating the number of the people who *actually* need it. One thing I'm talking about here is escaping behaviour, which I mentioned in the previous mail. I believe it would be possible to implement in the 4.3.X code, however it sounds specific to multibyte implementation. Escaping behaviour is totally irrelevant to the multibyte issue. I think users should be able to choose by an optional argument whether \" has to be treated as a escaped quote or a simple sequence of a backslash and a quote. Moriyoshi -- PHP Internals - PHP Runtime Development Mailing List To unsubscribe, visit: http://www.php.net/unsub.php
Re: [PHP-DEV] Re: Regarding the latest patch on fgetcsv() (stable branch)
On December 12, 2003 05:38 pm, Moriyoshi Koizumi wrote: > And I don't think fgetcsv() is an exception, since htmlentities() can > be referred to as an example that is placed in core and > supports multibyte strings. As I mentioned, purging that kind of > functionality into the mbstring extension doesn't solve the problem > in practice by any means. htmlentities() is a rather special function it handles not only multibyte but a whole lot of diffrent & unusual things. I do not think you can fairly compare it to fgetcsv(). We have a multibyte extension for people who need that functionality, why force it on everyone else? > >> 2) IMO speed is not a key factor here. People rather wants > >> trust-worthy behaviour. > > > > When it's a few percent and the changes offer significant improvements > > yes, > > but when were are talking about a performance loss of 250-300% or more > > then > > performance must become a consideration as well. > > If there are virtually no ways to improve it, it'd be natural to me > we dismiss the issue. Why does a vast majority of users have to endure degredation in performance for functionality that are needed by a few? It's as simple as that. Same argument applies to basename(). > One thing I'm talking about here is escaping behaviour, which I > mentioned in the previous mail. I believe it would be possible to implement in the 4.3.X code, however it sounds specific to multibyte implementation. Ilia -- PHP Internals - PHP Runtime Development Mailing List To unsubscribe, visit: http://www.php.net/unsub.php
Re: [PHP-DEV] Re: Regarding the latest patch on fgetcsv() (stable branch)
On 2003/12/13, at 7:28, Ilia Alshanetsky wrote: On December 12, 2003 04:18 pm, Moriyoshi Koizumi wrote: I disagree, because of the following reasons: 1) Not a few people *actually* use fgetcsv() commonly with multibyte characters indeed. Regarding this, applications made by those who don't use such characters don't (and won't) use multibyte specific functions and that's the problem. This greatly prevents them from being portable. People have lived without multibyte support in fgetcsv() for many years now, and I did not see a single request on bugs.php.net for fgetcsv() multi-byte support. So, while this is certainly useful functionality I do not believe it is as widely needed as you say it is. We also have a multibyte extension that already implements multi-byte safe variants of common functions, why make exception for fgetcsv() and add multibyte code into core? I admit that few requests are in sight, but it's the case that those who really want that support don't advocate it in English. And I don't think fgetcsv() is an exception, since htmlentities() can be referred to as an example that is placed in core and supports multibyte strings. As I mentioned, purging that kind of functionality into the mbstring extension doesn't solve the problem in practice by any means. 2) IMO speed is not a key factor here. People rather wants trust-worthy behaviour. When it's a few percent and the changes offer significant improvements yes, but when were are talking about a performance loss of 250-300% or more then performance must become a consideration as well. If there are virtually no ways to improve it, it'd be natural to me we dismiss the issue. 3) fgetcsv() implementation in the stable branch is now too complicated to add a new feature to and also hard to maintain. We should be able to eliminate the mblen() calls for acceptable performance. See the attached result. What features are we talking about here? The only 2 features I can see we may wish to add are >1 char long enclosures and separators and the binary thing. Both of these features would be fairly trivial to add. One thing I'm talking about here is escaping behaviour, which I mentioned in the previous mail. Moriyoshi -- PHP Internals - PHP Runtime Development Mailing List To unsubscribe, visit: http://www.php.net/unsub.php
Re: [PHP-DEV] Re: Regarding the latest patch on fgetcsv() (stable branch)
On December 12, 2003 04:18 pm, Moriyoshi Koizumi wrote: > I disagree, because of the following reasons: > > 1) Not a few people *actually* use fgetcsv() commonly > with multibyte characters indeed. Regarding this, > applications made by those who don't use > such characters don't (and won't) use multibyte specific > functions and that's the problem. This greatly prevents > them from being portable. People have lived without multibyte support in fgetcsv() for many years now, and I did not see a single request on bugs.php.net for fgetcsv() multi-byte support. So, while this is certainly useful functionality I do not believe it is as widely needed as you say it is. We also have a multibyte extension that already implements multi-byte safe variants of common functions, why make exception for fgetcsv() and add multibyte code into core? > 2) IMO speed is not a key factor here. People rather wants > trust-worthy behaviour. When it's a few percent and the changes offer significant improvements yes, but when were are talking about a performance loss of 250-300% or more then performance must become a consideration as well. > 3) fgetcsv() implementation in the stable branch is > now too complicated to add a new feature to > and also hard to maintain. We should be able to > eliminate the mblen() calls for acceptable performance. > See the attached result. What features are we talking about here? The only 2 features I can see we may wish to add are >1 char long enclosures and separators and the binary thing. Both of these features would be fairly trivial to add. > p.s. fgetcsv() in the stable branch still seems to segfault with > the attached test case (segfault.php.txt). Writing a fix now, thanks for the heads-up. If you have any more please let me know. -- PHP Internals - PHP Runtime Development Mailing List To unsubscribe, visit: http://www.php.net/unsub.php
Re: [PHP-DEV] Re: Regarding the latest patch on fgetcsv() (stable branch)
On Fri, 12 Dec 2003, Derick Rethans wrote: > On Sat, 13 Dec 2003, Moriyoshi Koizumi wrote: > > > On 2003/12/13, at 4:42, Ilia Alshanetsky wrote: > > > > > On a related note I should mention that fgetcsv() in 4.3.X is > > > currently 2.5 > > > times faster then it's equivalent in 5.X. > > > > I don't know why you're mentioning this at this time, > > but I can say it is a sort of necessary evil :) Because the HEAD > > version is capable of handling various encodings, and > > less intricate IMO. Rather, I was surprised about that result, > > it's only 2.5 times slower :) > > I would call that rather unacceptable actually. Isn't it possible create > a new function for this which handles this MB 'crap' (and the same for "crap" is a poor choice of words, I had no plans to insult you in anyway. My apologies for that. Derick -- PHP Internals - PHP Runtime Development Mailing List To unsubscribe, visit: http://www.php.net/unsub.php
Re: [PHP-DEV] Re: Regarding the latest patch on fgetcsv() (stable branch)
On 2003/12/13, at 5:45, Derick Rethans wrote: I would call that rather unacceptable actually. Isn't it possible create a new function for this which handles this MB 'crap' (and the same for basename) so that we don't have to lose performance because of those issues? Don't you think "crap" sounds too disgusting and inappropriate? Stop such wording here. Moriyoshi -- PHP Internals - PHP Runtime Development Mailing List To unsubscribe, visit: http://www.php.net/unsub.php
Re: [PHP-DEV] Re: Regarding the latest patch on fgetcsv() (stable branch)
On 2003/12/13, at 5:51, Ilia Alshanetsky wrote: How about we add mb_fgetcsv(), which would have full multi-byte support (including delimeters). I'd imagine for people who need to parse multi-byte csv files, full functionality is more important then speed. As for the fgetcsv() in ext/standard/, we can port the 4.3.X code (copy & paste really) and let PHP 5 users benefit from a faster fgetcsv() for common applications. What do you think? I disagree, because of the following reasons: 1) Not a few people *actually* use fgetcsv() commonly with multibyte characters indeed. Regarding this, applications made by those who don't use such characters don't (and won't) use multibyte specific functions and that's the problem. This greatly prevents them from being portable. 2) IMO speed is not a key factor here. People rather wants trust-worthy behaviour. 3) fgetcsv() implementation in the stable branch is now too complicated to add a new feature to and also hard to maintain. We should be able to eliminate the mblen() calls for acceptable performance. See the attached result. Moriyoshi p.s. fgetcsv() in the stable branch still seems to segfault with the attached test case (segfault.php.txt). [The benchmark result] My code with mblen() (on php5-csv): real0m1.389s user0m1.330s sys 0m0.060s Ditto without mblen(): real0m0.396s user0m0.350s sys 0m0.040s Your code (on php4-csv): real0m0.332s user0m0.270s sys 0m0.060s Index: ext/standard/php_string.h === RCS file: /repository/php-src/ext/standard/php_string.h,v retrieving revision 1.83 diff -u -r1.83 php_string.h --- ext/standard/php_string.h 10 Dec 2003 21:23:35 - 1.83 +++ ext/standard/php_string.h 12 Dec 2003 21:16:09 - @@ -144,15 +144,7 @@ #define strerror php_strerror #endif -#ifndef HAVE_MBLEN -# define php_mblen(ptr, len) 1 -#else -# if defined(_REENTRANT) && defined(HAVE_MBRLEN) && defined(HAVE_MBSTATE_T) -# define php_mblen(ptr, len) ((ptr) == NULL ? mbsinit(&BG(mblen_state)): (int)mbrlen(ptr, len, &BG(mblen_state))) -# else -# define php_mblen(ptr, len) mblen(ptr, len) -# endif -#endif +#define php_mblen(ptr, len) 1 void register_string_constants(INIT_FUNC_ARGS); -- PHP Internals - PHP Runtime Development Mailing List To unsubscribe, visit: http://www.php.net/unsub.php
Re: [PHP-DEV] Re: Regarding the latest patch on fgetcsv() (stable branch)
On Sat, 13 Dec 2003, Moriyoshi Koizumi wrote: > On 2003/12/13, at 4:42, Ilia Alshanetsky wrote: > > > On a related note I should mention that fgetcsv() in 4.3.X is > > currently 2.5 > > times faster then it's equivalent in 5.X. > > I don't know why you're mentioning this at this time, > but I can say it is a sort of necessary evil :) Because the HEAD > version is capable of handling various encodings, and > less intricate IMO. Rather, I was surprised about that result, > it's only 2.5 times slower :) I would call that rather unacceptable actually. Isn't it possible create a new function for this which handles this MB 'crap' (and the same for basename) so that we don't have to lose performance because of those issues? regards, Derick -- PHP Internals - PHP Runtime Development Mailing List To unsubscribe, visit: http://www.php.net/unsub.php
Re: [PHP-DEV] Re: Regarding the latest patch on fgetcsv() (stable branch)
How about we add mb_fgetcsv(), which would have full multi-byte support (including delimeters). I'd imagine for people who need to parse multi-byte csv files, full functionality is more important then speed. As for the fgetcsv() in ext/standard/, we can port the 4.3.X code (copy & paste really) and let PHP 5 users benefit from a faster fgetcsv() for common applications. What do you think? Ilia -- PHP Internals - PHP Runtime Development Mailing List To unsubscribe, visit: http://www.php.net/unsub.php
Re: [PHP-DEV] Re: Regarding the latest patch on fgetcsv() (stable branch)
On 2003/12/13, at 5:09, Ilia Alshanetsky wrote: I mentioning this now because we are considering changes to the function in the development branch, which is a fine time to resolve any deficiencies. Okay, fine :) The added functionality, which if I understand correctly is support for multibyte delimeters and enclosures is great. But it hardly explains a The change was not for multibyte delimiters and enclosures. The current implementation still allows only single-byte characters for the delimiter and enclosure. I was able to add such a capability as well, but I didn't because it appeared to fairly slow it down. As several multibyte encodings like CP932, CP936, CP949, CP950 and Shift_JIS may map a value in range of 0x40 - 0xfe to the second byte, which had been a problem. Therefore we need to check if a octet of a certain position belongs to a multibyte character or not and this fact motivated me to bring a scanner-like finite-state machine implementation into fgetcsv() (and basename()). See http://www.microsoft.com/globaldev/reference/WinCP.mspx for detail. significant performance disparity I am seeing. I believe much of the problem can be solved by moving from manual string iteration to one using C library functions such as memchr(). When parsing non-multibyte text there shouldn't be more then 10-15% performance loss. I should mention that benchmarks were made using time utility, so advantages offered by PHP 5's speedups were discounted. Had they been considered the speed loss would've been 300% or more. If we limited the support to UTF-8 or EUC encoding only, we'd be able to drastically gain much better performance. But it won't actually solve practical problems where it is in action. Moriyoshi -- PHP Internals - PHP Runtime Development Mailing List To unsubscribe, visit: http://www.php.net/unsub.php
Re: [PHP-DEV] Re: Regarding the latest patch on fgetcsv() (stable branch)
On December 12, 2003 02:40 pm, Moriyoshi Koizumi wrote: > I don't know why you're mentioning this at this time, > but I can say it is a sort of necessary evil :) Because the HEAD > version is capable of handling various encodings, and > less intricate IMO. Rather, I was surprised about that result, > it's only 2.5 times slower :) I mentioning this now because we are considering changes to the function in the development branch, which is a fine time to resolve any deficiencies. The added functionality, which if I understand correctly is support for multibyte delimeters and enclosures is great. But it hardly explains a significant performance disparity I am seeing. I believe much of the problem can be solved by moving from manual string iteration to one using C library functions such as memchr(). When parsing non-multibyte text there shouldn't be more then 10-15% performance loss. I should mention that benchmarks were made using time utility, so advantages offered by PHP 5's speedups were discounted. Had they been considered the speed loss would've been 300% or more. Ilia -- PHP Internals - PHP Runtime Development Mailing List To unsubscribe, visit: http://www.php.net/unsub.php
Re: [PHP-DEV] Re: Regarding the latest patch on fgetcsv() (stable branch)
On 2003/12/13, at 4:42, Ilia Alshanetsky wrote: On a related note I should mention that fgetcsv() in 4.3.X is currently 2.5 times faster then it's equivalent in 5.X. I don't know why you're mentioning this at this time, but I can say it is a sort of necessary evil :) Because the HEAD version is capable of handling various encodings, and less intricate IMO. Rather, I was surprised about that result, it's only 2.5 times slower :) Moriyoshi -- PHP Internals - PHP Runtime Development Mailing List To unsubscribe, visit: http://www.php.net/unsub.php
Re: [PHP-DEV] Re: Regarding the latest patch on fgetcsv() (stable branch)
On December 12, 2003 02:02 pm, Rasmus Lerdorf wrote: > I agree that it would be a good idea to provide a mechanism to do that, > but at this point I don't think we should be changing the behaviour of > fgetcsv() in neither the stable branch nor the HEAD branch. I'd add a new > binary-safe version of the function instead for this. Or an optional arg, > but fgetcsv() already has 2 optional args. I think we could add another optional argument (bitmask) that could be used to control various capabilities of fgetcsv(). So, if another tuneable behavior is necessary it could be easily added without breaking BC. On a related note I should mention that fgetcsv() in 4.3.X is currently 2.5 times faster then it's equivalent in 5.X. Ilia -- PHP Internals - PHP Runtime Development Mailing List To unsubscribe, visit: http://www.php.net/unsub.php
Re: [PHP-DEV] Re: Regarding the latest patch on fgetcsv() (stable branch)
On 2003/12/13, at 4:02, Rasmus Lerdorf wrote: On Fri, 12 Dec 2003, Ilia Alshanetsky wrote: That said, the whole space trimming behavior seems a little unusual since it will corrupt content especially if said content contains binary data. IMHO the data read by fgetcsv() should be fetched in such a manner so that the original string can be recreated. I agree that it would be a good idea to provide a mechanism to do that, but at this point I don't think we should be changing the behaviour of fgetcsv() in neither the stable branch nor the HEAD branch. I'd add a new binary-safe version of the function instead for this. Or an optional arg, but fgetcsv() already has 2 optional args. My opinion is basically the same as Ilia's. And I think it'd also be a good idea to introduce a few more option to modify the escaping behaviour. Escape characters like \ are treated specially at the moment, while the de facto specification, of Microsoft, adopts dubbed-quotes style instead of it. IMO we should be able to choose the behaviour. Then, where do we go from here? Eventually we'll need to change the spec of its arguments, or add a new function as Rasmus said... Moriyoshi -- PHP Internals - PHP Runtime Development Mailing List To unsubscribe, visit: http://www.php.net/unsub.php
Re: [PHP-DEV] Re: Regarding the latest patch on fgetcsv() (stable branch)
On Fri, 12 Dec 2003, Ilia Alshanetsky wrote: > That said, the whole space trimming behavior seems a little unusual since it > will corrupt content especially if said content contains binary data. IMHO > the data read by fgetcsv() should be fetched in such a manner so that the > original string can be recreated. I agree that it would be a good idea to provide a mechanism to do that, but at this point I don't think we should be changing the behaviour of fgetcsv() in neither the stable branch nor the HEAD branch. I'd add a new binary-safe version of the function instead for this. Or an optional arg, but fgetcsv() already has 2 optional args. -Rasmus -- PHP Internals - PHP Runtime Development Mailing List To unsubscribe, visit: http://www.php.net/unsub.php
[PHP-DEV] Re: Regarding the latest patch on fgetcsv() (stable branch)
On December 12, 2003 01:36 pm, Moriyoshi Koizumi wrote: > What do you think of this? I'll apply a fix momentarily, it wouldn't do to break BC in stable branch. That said, the whole space trimming behavior seems a little unusual since it will corrupt content especially if said content contains binary data. IMHO the data read by fgetcsv() should be fetched in such a manner so that the original string can be recreated. Ilia -- PHP Internals - PHP Runtime Development Mailing List To unsubscribe, visit: http://www.php.net/unsub.php