RE: [PHP-DEV] foreach() for strings
> -Original Message- > From: Jan Schneider [mailto:j...@horde.org] > > And if that very same string that's supposed to be an array is > processed using the $var[$n] syntax nowadays is any different? It's > not, you won't get an error message for that either, and it's the same > amount of work to track this down. Granted, making PHP behaving the > same in foreach gives you one more place to track down such errors, > but making it easier to track down developer errors is not anything > that should keep PHP from adding new features. > > Jan. In theory, yes, but in practice this doesn't seem to happen with any frequency (actually, I'm having a hard time thinking of a time when this has EVER happened to me.) On the other hand, warnings about foreach getting something that wasn't iteratable are commonplace for me, and more often than not, it is a string. I think it's perfectly appropriate for any language to avoid "high risk" features (features that are likely to result in buggy code, or features that are likely to result in bugs evading detection.) My code has enough bugs already, so any language feature that finds my bugs for me is more than welcome. Consider implicit vs. explicit returns. If a function always returns the value of the last statement (implicit) this is likely to result in unpredictable behavior and hidden bugs, when a warning could have been issued instead. Typing "return" clarifies intent and is a very small price to pay to avoid those errors. In this case, typing "new TextIterator()" in the handful of cases where you actually needed to iterate a string is a VERY small price to pay for: 1. The ability to get meaningful warnings when you didn't intend to iterate the string (by far the more likely scenario) 2. The ability to easily fix your code when you decide that a universal character set really is valuable 3. The ability to clearly see the intent of the code John Crenshaw Priacta, Inc. -- PHP Internals - PHP Runtime Development Mailing List To unsubscribe, visit: http://www.php.net/unsub.php
Re: [PHP-DEV] foreach() for strings
Zitat von Larry Garfield : On 06/20/2011 10:25 AM, John Crenshaw wrote: Doing this with an explicit iterator object is a fine idea. The syntax becomes something like: foreach(new TextIterator($s, 'UTF8') as $pos=>$c) { ... } On the other hand, I think that trying to support iteration without using an iterator object to mediate would be a disaster, and I'm opposed to doing something like that because: 1. The code just looks wrong. PHP developers are generally insulated from the char-arrayness of strings. In addition, since PHP isn't typesafe, the code becomes highly ambiguous. Is the code iterating an array, or a string? It is very hard to tell just by looking. It may be convenient to write, but it's certainly not convenient to read or maintain later. On the other hand, with a mediating iterator object, the intent becomes obvious, and the code is highly readable. 2. The odds of iterating any given string are slim at best. Supporting current, key, next, etc. would require the string object internally to get bloated with additional unnecessary data that is almost never used. This bloat isn't a single int either. For optimal performance it would need to consist of no less than two size_t (char position and binary position), and one encoding indicator. 3. Iteration cannot work without knowing which encoding to use for the string. Is it UTF8? UTF16? UTF7? Binary or some single byte encoding? Some other exotic wide encoding? Without an iterator object in the middle, there is no way to specify this encoding. Always treating this as binary would also be a mistake, since this is almost certainly never actually the correct behavior, even though it may often appear to behave correctly with simple inputs. 4. I've had simple mistakes caught numerous times when foreach complains about getting a scalar rather than an array. So far, it has been exactly right every time. Allowing strings to be iterated would, in the name of convenience, increase the probability of stupid mistakes evading detection. Even worse, the code itself would look logically correct until the developer finally realizes that they have a string and not an array. Errors like this are probably far more common in most projects than the need to iterate a string, so making this change hurts debugging in the common case, for the sake of syntactic sugar in the rare case. Not a good trade. John Crenshaw Priacta, Inc. I would echo John's statements here. foreach() directly iterating a string is going to make my life substantially harder. I work in array-heavy systems, and "bad first argument for foreach()" is already a hard enough error to track down. It means "somewhere, somehow, you put a string where you meant to put an array. GLWT." Adding automatic string iteration would take away even that error message and leave me with no way to figure out why my code is randomly misbehaving. Just looking at the code, I would have no way of knowing that such a bug lurks within. That's the downside of a weakly typed but still typed language. And if that very same string that's supposed to be an array is processed using the $var[$n] syntax nowadays is any different? It's not, you won't get an error message for that either, and it's the same amount of work to track this down. Granted, making PHP behaving the same in foreach gives you one more place to track down such errors, but making it easier to track down developer errors is not anything that should keep PHP from adding new features. Jan. -- Do you need professional PHP or Horde consulting? http://horde.org/consulting/ -- PHP Internals - PHP Runtime Development Mailing List To unsubscribe, visit: http://www.php.net/unsub.php
Re: [PHP-DEV] foreach() for strings
On 06/20/2011 10:25 AM, John Crenshaw wrote: Doing this with an explicit iterator object is a fine idea. The syntax becomes something like: foreach(new TextIterator($s, 'UTF8') as $pos=>$c) { ... } On the other hand, I think that trying to support iteration without using an iterator object to mediate would be a disaster, and I'm opposed to doing something like that because: 1. The code just looks wrong. PHP developers are generally insulated from the char-arrayness of strings. In addition, since PHP isn't typesafe, the code becomes highly ambiguous. Is the code iterating an array, or a string? It is very hard to tell just by looking. It may be convenient to write, but it's certainly not convenient to read or maintain later. On the other hand, with a mediating iterator object, the intent becomes obvious, and the code is highly readable. 2. The odds of iterating any given string are slim at best. Supporting current, key, next, etc. would require the string object internally to get bloated with additional unnecessary data that is almost never used. This bloat isn't a single int either. For optimal performance it would need to consist of no less than two size_t (char position and binary position), and one encoding indicator. 3. Iteration cannot work without knowing which encoding to use for the string. Is it UTF8? UTF16? UTF7? Binary or some single byte encoding? Some other exotic wide encoding? Without an iterator object in the middle, there is no way to specify this encoding. Always treating this as binary would also be a mistake, since this is almost certainly never actually the correct behavior, even though it may often appear to behave correctly with simple inputs. 4. I've had simple mistakes caught numerous times when foreach complains about getting a scalar rather than an array. So far, it has been exactly right every time. Allowing strings to be iterated would, in the name of convenience, increase the probability of stupid mistakes evading detection. Even worse, the code itself would look logically correct until the developer finally realizes that they have a string and not an array. Errors like this are probably far more common in most projects than the need to iterate a string, so making this change hurts debugging in the common case, for the sake of syntactic sugar in the rare case. Not a good trade. John Crenshaw Priacta, Inc. I would echo John's statements here. foreach() directly iterating a string is going to make my life substantially harder. I work in array-heavy systems, and "bad first argument for foreach()" is already a hard enough error to track down. It means "somewhere, somehow, you put a string where you meant to put an array. GLWT." Adding automatic string iteration would take away even that error message and leave me with no way to figure out why my code is randomly misbehaving. Just looking at the code, I would have no way of knowing that such a bug lurks within. That's the downside of a weakly typed but still typed language. A proper iterator class, however, makes a great deal of sense. It could be implemented user-space fairly easily, no doubt, but for strings of any appreciable size (like the OP seems to be talking about for code parsing) I suspect performance and memory usage would be far better if implemented in C. Whether it's a byte-based or character-set-sensitive-character-based iterator... honestly I don't care as long as it's documented properly. --Larry Garfield -- PHP Internals - PHP Runtime Development Mailing List To unsubscribe, visit: http://www.php.net/unsub.php
Re: [PHP-DEV] foreach() for strings
2011.06.22 14:14 Reindl Harald rašė: > Am 22.06.2011 07:24, schrieb Tomas Kuliavas: >> 2011.06.21 23:27 Reindl Harald rašė: >>> i do not understand any word and miss a simple str_is_utf8() >> >> Such function uses six lines in PHP. > > so why do you not post them? My lines are not public domain. They are GPLed. I am not sure about their performance, but they are executed 15 times on every mailbox listing I do and I don't see problems with that. I point at issues I have with available PHP functions and you start insulting me by calling my complains "low-lvevel bla" and fail to politely ask to explain everything in details. Sorry, if my tone insulted you and you decided to start war instead of investigating things further. -- Tomas -- PHP Internals - PHP Runtime Development Mailing List To unsubscribe, visit: http://www.php.net/unsub.php
RE: [PHP-DEV] foreach() for strings
On Wed Jun 22 11:25 AM, Reindl Harald wrote: > > and php as primary web-language is missing UTF8 support in the core You have a valid point. Now onto "foreach() for strings", any other opinions? Seems like the discussion is closed and most likely should move to a RFC with some consideration about character iteration could work. -- PHP Internals - PHP Runtime Development Mailing List To unsubscribe, visit: http://www.php.net/unsub.php
Re: [PHP-DEV] foreach() for strings
FIRST: it is terrible that one post in HTML, the next ansers on top and other answering on bottom in the same thread and that the most do not undertsand reply to the list address is enough Am 22.06.2011 17:07, schrieb Robert Eisele: > 1. The number of CHARs isn't unrelevant in a general manner it is if you calculate positions for substr() as simple example > It depents on the application it depends on luck the the application do everything as expected with multibyte input > even if the trend goes towards UTF8 for websites and php as primary web-language is missing UTF8 support in the core > 2. Within 10 years, you could have come to a working solution which could > please us all. many years ago UTF( support was announced so i waited for PHP > 3. Stop flaming and focus on your other day-job instead announcing multibyte support AFAIK 6 years ago and finally stop it makes my other day-job hard over the long term signature.asc Description: OpenPGP digital signature
Re: [PHP-DEV] foreach() for strings
1. The number of CHARs isn't unrelevant in a general manner. It depents on the application, even if the trend goes towards UTF8 for websites. 2. Within 10 years, you could have come to a working solution which could please us all. 3. Stop flaming and focus on your other day-job instead. 2011/6/22 Reindl Harald > > > Am 22.06.2011 16:49, schrieb Ferenc Kovacs: > > after 10 years i want solutions where it is not the road to hell > using any > > string function on user-input and since PHP6 seems to be quite dead > > it feels there will never be a trustable solution > > > > > > you made my day. > > boy :) > > after fetch a mail-client which does not convert plain-text to html > tell me how long should we wait until as example strlen() does return > the number of CHARS instead bytes because this low level information > is not needed on the script side and simply wrong by multibyte input > > what happended with PHP6? > > > > >
Re: [PHP-DEV] foreach() for strings
On Wed, Jun 22, 2011 at 4:52 PM, Reindl Harald wrote: > > > Am 22.06.2011 16:49, schrieb Ferenc Kovacs: > > after 10 years i want solutions where it is not the road to hell > using any > > string function on user-input and since PHP6 seems to be quite dead > > it feels there will never be a trustable solution > > > > > > you made my day. > > boy :) > > after fetch a mail-client which does not convert plain-text to html > tell me how long should we wait until as example strlen() does return > the number of CHARS instead bytes because this low level information > is not needed on the script side and simply wrong by multibyte input > > what happended with PHP6? aside the trolling parts, I'm pretty surprised that you are so uninformed about how utf-8 works in general and what happened with php6 when your are lurking on this list for years and especially the what happened with PHP6 was discussed on this list and in this very thread. Tyrael
Re: [PHP-DEV] foreach() for strings
Am 22.06.2011 16:49, schrieb Ferenc Kovacs: > after 10 years i want solutions where it is not the road to hell using any > string function on user-input and since PHP6 seems to be quite dead > it feels there will never be a trustable solution > > > you made my day. > boy :) after fetch a mail-client which does not convert plain-text to html tell me how long should we wait until as example strlen() does return the number of CHARS instead bytes because this low level information is not needed on the script side and simply wrong by multibyte input what happended with PHP6? signature.asc Description: OpenPGP digital signature
Re: [PHP-DEV] foreach() for strings
On Wed, Jun 22, 2011 at 4:08 PM, Reindl Harald wrote: > > > Am 22.06.2011 16:06, schrieb Rasmus Lerdorf: > > On 06/22/2011 07:01 AM, Reindl Harald wrote: > >> > >> > >> Am 22.06.2011 15:57, schrieb Rasmus Lerdorf: > >> > >>> There is obviously no way to tell if "hello" is UTF-8 or ISO-8859-1. It > >>> is technically impossible since the two are identical in that range. > >> > >> yes and so this will not work > >> > >> and as long PHP has on million places troubles with UTF8 it would > >> be hardly needed find real soultions as long as functions which > >> are not working with UTF8 input are not throwing a fatal error > >> > >>> And please keep things polite. It is very rare that I warn people about > >>> that here, and I will only do it once. > >> > >> this is my tone if somebody believes he is genius and can solve > >> problems which are since years existing with a sinlge line > > > > Then please stop posting to this list. There is no excuse for this tone. > > Especially since you basically implied that you wanted a str_is_ut8() > > function that can detect whether "hello" is UTF8 or ISO-8859-1. That's a > > completely nonsensical request > > after 10 years i want solutions where it is not the road to hell using any > string function on user-input and since PHP6 seems to be quite dead > it feels there will never be a trustable solution > > you made my day. boy :) Tyrael
Re: [PHP-DEV] foreach() for strings
Am 22.06.2011 16:06, schrieb Rasmus Lerdorf: > On 06/22/2011 07:01 AM, Reindl Harald wrote: >> >> >> Am 22.06.2011 15:57, schrieb Rasmus Lerdorf: >> >>> There is obviously no way to tell if "hello" is UTF-8 or ISO-8859-1. It >>> is technically impossible since the two are identical in that range. >> >> yes and so this will not work >> >> and as long PHP has on million places troubles with UTF8 it would >> be hardly needed find real soultions as long as functions which >> are not working with UTF8 input are not throwing a fatal error >> >>> And please keep things polite. It is very rare that I warn people about >>> that here, and I will only do it once. >> >> this is my tone if somebody believes he is genius and can solve >> problems which are since years existing with a sinlge line > > Then please stop posting to this list. There is no excuse for this tone. > Especially since you basically implied that you wanted a str_is_ut8() > function that can detect whether "hello" is UTF8 or ISO-8859-1. That's a > completely nonsensical request after 10 years i want solutions where it is not the road to hell using any string function on user-input and since PHP6 seems to be quite dead it feels there will never be a trustable solution signature.asc Description: OpenPGP digital signature
Re: [PHP-DEV] foreach() for strings
On 06/22/2011 07:01 AM, Reindl Harald wrote: > > > Am 22.06.2011 15:57, schrieb Rasmus Lerdorf: > >> There is obviously no way to tell if "hello" is UTF-8 or ISO-8859-1. It >> is technically impossible since the two are identical in that range. > > yes and so this will not work > > and as long PHP has on million places troubles with UTF8 it would > be hardly needed find real soultions as long as functions which > are not working with UTF8 input are not throwing a fatal error > >> And please keep things polite. It is very rare that I warn people about >> that here, and I will only do it once. > > this is my tone if somebody believes he is genius and can solve > problems which are since years existing with a sinlge line Then please stop posting to this list. There is no excuse for this tone. Especially since you basically implied that you wanted a str_is_ut8() function that can detect whether "hello" is UTF8 or ISO-8859-1. That's a completely nonsensical request. -Rasmus -- PHP Internals - PHP Runtime Development Mailing List To unsubscribe, visit: http://www.php.net/unsub.php
Re: [PHP-DEV] foreach() for strings
Am 22.06.2011 15:57, schrieb Rasmus Lerdorf: > There is obviously no way to tell if "hello" is UTF-8 or ISO-8859-1. It > is technically impossible since the two are identical in that range. yes and so this will not work and as long PHP has on million places troubles with UTF8 it would be hardly needed find real soultions as long as functions which are not working with UTF8 input are not throwing a fatal error > And please keep things polite. It is very rare that I warn people about > that here, and I will only do it once. this is my tone if somebody believes he is genius and can solve problems which are since years existing with a sinlge line signature.asc Description: OpenPGP digital signature
Re: [PHP-DEV] foreach() for strings
Am 22.06.2011 15:55, schrieb Olivier Hill: > Please change your tone. > Thank you the tonhe is quite correct for peopole who think a quick 3 liner will do the job of UTF8 detection which it does not and there you can read the manual as often you want signature.asc Description: OpenPGP digital signature
Re: [PHP-DEV] foreach() for strings
On 06/22/2011 06:52 AM, Reindl Harald wrote: > > > Am 22.06.2011 15:45, schrieb Lars Schultz: >> Am 22.06.2011 15:40, schrieb Reindl Harald: >>> and why this will not return true if $str is ISO-8859-1? >> If you RTFM (in your jargon) you would know. >> >> http://ch.php.net/manual/en/function.htmlspecialchars.php (Return value >> Section) > > i read the fucking manual > >> If the input string contains an invalid code unit sequence within the given >> charset >> and the ENT_IGNORE flag is not set, then htmlspecialchars() will return an >> empty string. > > so damend NOT ALL CHARACTERS are multibyte and so it will return true for > "hello" > so what will you tell me above boy? There is obviously no way to tell if "hello" is UTF-8 or ISO-8859-1. It is technically impossible since the two are identical in that range. And please keep things polite. It is very rare that I warn people about that here, and I will only do it once. -Rasmus -- PHP Internals - PHP Runtime Development Mailing List To unsubscribe, visit: http://www.php.net/unsub.php
Re: [PHP-DEV] foreach() for strings
Please change your tone. Thank you. Olivier -- http://www.olivierhill.ca/ -- PHP Internals - PHP Runtime Development Mailing List To unsubscribe, visit: http://www.php.net/unsub.php
Re: [PHP-DEV] foreach() for strings
Am 22.06.2011 15:45, schrieb Lars Schultz: > Am 22.06.2011 15:40, schrieb Reindl Harald: >> and why this will not return true if $str is ISO-8859-1? > If you RTFM (in your jargon) you would know. > > http://ch.php.net/manual/en/function.htmlspecialchars.php (Return value > Section) i read the fucking manual > If the input string contains an invalid code unit sequence within the given > charset > and the ENT_IGNORE flag is not set, then htmlspecialchars() will return an > empty string. so damend NOT ALL CHARACTERS are multibyte and so it will return true for "hello" so what will you tell me above boy? signature.asc Description: OpenPGP digital signature
Re: [PHP-DEV] foreach() for strings
> and why this will not return true if $str is ISO-8859-1? For lower 7 bit characters (code points <= 127) it would return true. But if there is a single higher character (outside of ascii), it would only return true if the byte sequences follow UTF-8 semantics. So it would return false if ISO-8859-1. For example, character é is 0xe9 (code point 234) in ISO-8859, but character 0xc3a9 in UTF-8. So if it encountered a byte stream such as 0xe92041 ("é A"), it knows it cannot be UTF-8 since 0xe920 is not a valid byte sequence. But if it saw 0xc3a92041, ("é A"), it knows it is valid UTF-8 (it could be another character set, but it is valid in UTF-8)... Please note that it's not checking if the string **is** UTF-8, just if the byte sequences in the string are valid when interpreted as UTF-8. You could have the Latin-1 string 0xc3a92041: ("é A") which parses as valid UTF-8... On Wed, Jun 22, 2011 at 9:40 AM, Reindl Harald wrote: > > > Am 22.06.2011 15:30, schrieb Gustavo Lopes: >> Em Wed, 22 Jun 2011 13:21:10 +0100, Reindl Harald >> escreveu: >> >>> Am 22.06.2011 14:14, schrieb Gustavo Lopes: It's actually 3 lines: function str_is_utf8($str) { return $str == "" || htmlspecialchars($str, 0, "UTF-8"); } >>> >>> >>> WTF should this do? >>> this won't return boolean >>> >> >> The reason it works is that >> 1) || coerces the operands into booleans (if they get to be evaluated) >> 2) htmlspecialchars returns "" on bad input sequence >> 3) (bool) "" === false >> >> But even if you didn't know these things, you should have bothered to at >> least test it >> before sending this response > > and why this will not return true if $str is ISO-8859-1? > > -- PHP Internals - PHP Runtime Development Mailing List To unsubscribe, visit: http://www.php.net/unsub.php
Re: [PHP-DEV] foreach() for strings
Am 22.06.2011 15:40, schrieb Reindl Harald: and why this will not return true if $str is ISO-8859-1? If you RTFM (in your jargon) you would know. http://ch.php.net/manual/en/function.htmlspecialchars.php (Return value Section) -- PHP Internals - PHP Runtime Development Mailing List To unsubscribe, visit: http://www.php.net/unsub.php
Re: [PHP-DEV] foreach() for strings
Am 22.06.2011 15:30, schrieb Gustavo Lopes: > Em Wed, 22 Jun 2011 13:21:10 +0100, Reindl Harald > escreveu: > >> Am 22.06.2011 14:14, schrieb Gustavo Lopes: >>> It's actually 3 lines: >>> >>> function str_is_utf8($str) { >>> return $str == "" || htmlspecialchars($str, 0, "UTF-8"); >>> } >> >> >> WTF should this do? >> this won't return boolean >> > > The reason it works is that > 1) || coerces the operands into booleans (if they get to be evaluated) > 2) htmlspecialchars returns "" on bad input sequence > 3) (bool) "" === false > > But even if you didn't know these things, you should have bothered to at > least test it > before sending this response and why this will not return true if $str is ISO-8859-1? signature.asc Description: OpenPGP digital signature
Re: [PHP-DEV] foreach() for strings
Em Wed, 22 Jun 2011 13:21:10 +0100, Reindl Harald escreveu: Am 22.06.2011 14:14, schrieb Gustavo Lopes: It's actually 3 lines: function str_is_utf8($str) { return $str == "" || htmlspecialchars($str, 0, "UTF-8"); } WTF should this do? this won't return boolean The reason it works is that 1) || coerces the operands into booleans (if they get to be evaluated) 2) htmlspecialchars returns "" on bad input sequence 3) (bool) "" === false But even if you didn't know these things, you should have bothered to at least test it before sending this response. -- Gustavo Lopes -- PHP Internals - PHP Runtime Development Mailing List To unsubscribe, visit: http://www.php.net/unsub.php
Re: [PHP-DEV] foreach() for strings
2011/6/22 Reindl Harald : > WTF should this do? > this won't return boolean It's an expression, so its evaluated result is a boolean hence why it makes sense -- regards, Kalle Sommer Nielsen ka...@php.net -- PHP Internals - PHP Runtime Development Mailing List To unsubscribe, visit: http://www.php.net/unsub.php
Re: [PHP-DEV] foreach() for strings
Am 22.06.2011 14:14, schrieb Gustavo Lopes: > Em Wed, 22 Jun 2011 12:14:40 +0100, Reindl Harald > escreveu: >> Am 22.06.2011 07:24, schrieb Tomas Kuliavas: >>> 2011.06.21 23:27 Reindl Harald rašė: i do not understand any word and miss a simple str_is_utf8() >>> >>> Such function uses six lines in PHP. >> >> so why do you not post them?t >> > > It's actually 3 lines: > > function str_is_utf8($str) { > return $str == "" || htmlspecialchars($str, 0, "UTF-8"); > } WTF should this do? this won't return boolean signature.asc Description: OpenPGP digital signature
Re: [PHP-DEV] foreach() for strings
Em Wed, 22 Jun 2011 12:14:40 +0100, Reindl Harald escreveu: Am 22.06.2011 07:24, schrieb Tomas Kuliavas: 2011.06.21 23:27 Reindl Harald rašė: i do not understand any word and miss a simple str_is_utf8() Such function uses six lines in PHP. so why do you not post them?t It's actually 3 lines: function str_is_utf8($str) { return $str == "" || htmlspecialchars($str, 0, "UTF-8"); } or function str_is_utf8($str) { return preg_match('//u', $str) !== false; } But I agree it wouldn't hurt to have a str_is_utf8. -- Gustavo Lopes -- PHP Internals - PHP Runtime Development Mailing List To unsubscribe, visit: http://www.php.net/unsub.php
Re: [PHP-DEV] foreach() for strings
Am 22.06.2011 07:24, schrieb Tomas Kuliavas: > 2011.06.21 23:27 Reindl Harald rašė: >> i do not understand any word and miss a simple str_is_utf8() > > Such function uses six lines in PHP. so why do you not post them? > You can write your own. no i can not as said > I need locale insensitive casecmp, typecasting to unsigned 32bit int and > bunch of other functions in PHP. as said: i do not understand any word of this low-lvevel bla and so PHP and UTF8 is a real problem > Do I have wait for PHP implementation or just write my > own functions? if i have no background-knowledge about UTF8 and it does not interest me really necause i have enough jobs for two lifes signature.asc Description: OpenPGP digital signature
Re: [PHP-DEV] foreach() for strings
2011.06.21 23:27 Reindl Harald rašė: > i do not understand any word and miss a simple str_is_utf8() Such function uses six lines in PHP. You can write your own. I need locale insensitive casecmp, typecasting to unsigned 32bit int and bunch of other functions in PHP. Do I have wait for PHP implementation or just write my own functions? -- Tomas -- PHP Internals - PHP Runtime Development Mailing List To unsubscribe, visit: http://www.php.net/unsub.php
Re: [PHP-DEV] foreach() for strings
Am 21.06.2011 22:19, schrieb Tomas Kuliavas: > 2011.06.21 20:51 Reindl Harald rašė: >>> utf-8 is strict format. If you expect utf-8 and someone submits >>> something >>> else, you can tell that without any string function. You can verify >>> utf-8 >>> strings in pcre. You can convert nbspace to regular space, if you want. >>> utf-8 does not have any byte sequence that can collide with nbspace byte >>> sequence in utf-8 >> >> show me a practicable way to detect if some input data contains UTF8 >> mb_string-functions are out of the game because there are many servers >> even of real big companies where they are not available > > :) I've said pcre and not mbstring. If you read fine utf-8 manual like I > did about 8 years ago, you would know how to detect 8bit inputs that are > not in utf-8. utf-8 is variable byte length character set which has very > specific rules about the way bytes are arranged. You can tell length of > symbol in bytes based on first byte. You can tell what kind of byte values > should be used for second, third, fourth, fifth or sixth byte. If you > eliminate five valid utf-8 8bit byte sequences and still have 8bit data, > it is not utf-8 i do not understand any word and miss a simple str_is_utf8() or call it as you like which can do this native and performant on a given variable and would offer the possibility to stop a script with not expected input without degrade performance signature.asc Description: OpenPGP digital signature
Re: [PHP-DEV] foreach() for strings
2011.06.21 20:51 Reindl Harald rašė: >> utf-8 is strict format. If you expect utf-8 and someone submits >> something >> else, you can tell that without any string function. You can verify >> utf-8 >> strings in pcre. You can convert nbspace to regular space, if you want. >> utf-8 does not have any byte sequence that can collide with nbspace byte >> sequence in utf-8 > > show me a practicable way to detect if some input data contains UTF8 > mb_string-functions are out of the game because there are many servers > even of real big companies where they are not available :) I've said pcre and not mbstring. If you read fine utf-8 manual like I did about 8 years ago, you would know how to detect 8bit inputs that are not in utf-8. utf-8 is variable byte length character set which has very specific rules about the way bytes are arranged. You can tell length of symbol in bytes based on first byte. You can tell what kind of byte values should be used for second, third, fourth, fifth or sixth byte. If you eliminate five valid utf-8 8bit byte sequences and still have 8bit data, it is not utf-8. -- Tomas -- PHP Internals - PHP Runtime Development Mailing List To unsubscribe, visit: http://www.php.net/unsub.php
Re: [PHP-DEV] foreach() for strings
Am 21.06.2011 19:12, schrieb Tomas Kuliavas: and this naive attitude is the root of most security problems! why do you believe that every client submission is coming over your form or generally over anything you can control? >>> that doesn't matter here, Tomas just corrected John, that his statement >>> that >>> chrome will always use utf-8 encoding for some special character isn't >>> true. >>> browsers will adhere the >>> http://www.w3.org/TR/html401/interact/forms.html#adef-accept-charset >>> of course you can't trust user input, and you have to validate it, but >>> this >>> has nothing to do with this topic >> >> it has >> >> how du you validate input if the string-functions having undefined results >> which you probably use for your validation? > > I've never said that he should trust user input. I've only said that his > valid user inputs depend on html form format. and i told you that this in the real world is utopic there is a world outside of forms show me FIVE php-apps which are using "accept-charset" not one of mine - they do and even there i can not be sure that all of the thousands of scipts/websites i wrote use it realy everywhere > utf-8 is strict format. If you expect utf-8 and someone submits something > else, you can tell that without any string function. You can verify utf-8 > strings in pcre. You can convert nbspace to regular space, if you want. > utf-8 does not have any byte sequence that can collide with nbspace byte > sequence in utf-8 show me a practicable way to detect if some input data contains UTF8 mb_string-functions are out of the game because there are many servers even of real big companies where they are not available so the problem is simply that you can not really write portable and well performing code that is aware of UTF8 signature.asc Description: OpenPGP digital signature
Re: [PHP-DEV] foreach() for strings
2011.06.21 19:24 Reindl Harald rašė: > > > Am 21.06.2011 18:22, schrieb Ferenc Kovacs: >> On Tue, Jun 21, 2011 at 6:14 PM, Reindl Harald >> wrote: >> >>> >>> >>> Am 21.06.2011 17:55, schrieb Tomas Kuliavas: >>> They submit it in utf-8 only if your html form allows them to do that or they don't follow html specification and try to exploit your form. Set form input charset to iso-8859-1 and your nbspace will take only one >>> byte. >>> >>> and this naive attitude is the root of most security problems! >>> >>> why do you believe that every client submission is coming over >>> your form or generally over anything you can control? >>> >>> >> that doesn't matter here, Tomas just corrected John, that his statement >> that >> chrome will always use utf-8 encoding for some special character isn't >> true. >> browsers will adhere the >> http://www.w3.org/TR/html401/interact/forms.html#adef-accept-charset >> of course you can't trust user input, and you have to validate it, but >> this >> has nothing to do with this topic > > it has > > how du you validate input if the string-functions having undefined results > which you probably use for your validation? I've never said that he should trust user input. I've only said that his valid user inputs depend on html form format. utf-8 is strict format. If you expect utf-8 and someone submits something else, you can tell that without any string function. You can verify utf-8 strings in pcre. You can convert nbspace to regular space, if you want. utf-8 does not have any byte sequence that can collide with nbspace byte sequence in utf-8. -- Tomas -- PHP Internals - PHP Runtime Development Mailing List To unsubscribe, visit: http://www.php.net/unsub.php
RE: [PHP-DEV] foreach() for strings
> From: Pierre Joye [mailto:pierre@gmail.com] > > On Tue, Jun 21, 2011 at 4:38 PM, John Crenshaw > > wrote: > > > This mindset is fundamentally broken. You can call it a byte array all you > > want, but the truth is that 99.999% of the time, when a developer is using > > a string they need it for characters, not for bytes > > Let me rephrase: > > For backward compatibility reasons we cannot change this behavior. > > Any serious text processing should be done using intl, mbstring, > transliterator (pecl) or other similar solutions. > > Cheers, > -- > Pierre Right, I totally agree. We can't fix the multibyte string issue today; I'm just saying that we *can* (and should) avoid making it much worse. John Crenshaw Priacta, Inc. -- PHP Internals - PHP Runtime Development Mailing List To unsubscribe, visit: http://www.php.net/unsub.php
RE: RE: [PHP-DEV] foreach() for strings
> They submit it in utf-8 only if your html form allows them to do that or > they don't follow html specification and try to exploit your form. If no explicit encoding is given, all modern browsers will attempt to "autodetect" the encoding based on the page contents, often with unpredictable results. Most web developers really don't understand the whole encoding thing, and many aren't aware of it at all. If they aren't taking care of the encoding question in their server side code, what makes anyone believe that they are specifying the encoding in their response headers, or HTML? I can tell you for certain that if no encoding is specified, Chrome can and will decide that the data is UTF8, at least under certain conditions (because I watched it recently when working on an encoding problem in some legacy code.) > Set form input charset to iso-8859-1 I can't believe I just saw someone recommend that ;) Yes, you *could* use Latin-1...for which the Euro sign, ellipsis, decorative quotes, trademark, em dash, and a number of other frequently pasted characters are still out of range. Then, when you eventually decide that latin1 isn't meeting your needs, you'll get to go through the wonderful process of trying to convert all of your legacy data to UTF8. Single byte just doesn't cut the mustard anymore, especially on the web. The world is too small. We should be trying to move PHP *away* from this, not towards it. John Crenshaw Priacta, Inc.
Re: [PHP-DEV] foreach() for strings
On Tue, Jun 21, 2011 at 6:24 PM, Reindl Harald wrote: > > > Am 21.06.2011 18:22, schrieb Ferenc Kovacs: > > On Tue, Jun 21, 2011 at 6:14 PM, Reindl Harald >wrote: > > > >> > >> > >> Am 21.06.2011 17:55, schrieb Tomas Kuliavas: > >> > >>> They submit it in utf-8 only if your html form allows them to do that > or > >>> they don't follow html specification and try to exploit your form. Set > >>> form input charset to iso-8859-1 and your nbspace will take only one > >> byte. > >> > >> and this naive attitude is the root of most security problems! > >> > >> why do you believe that every client submission is coming over > >> your form or generally over anything you can control? > >> > >> > > that doesn't matter here, Tomas just corrected John, that his statement > that > > chrome will always use utf-8 encoding for some special character isn't > true. > > browsers will adhere the > > http://www.w3.org/TR/html401/interact/forms.html#adef-accept-charset > > of course you can't trust user input, and you have to validate it, but > this > > has nothing to do with this topic > > it has > > how du you validate input if the string-functions having undefined results > which you probably use for your validation? > > what do you mean by undefined? if you use iso-8859-1 in your whole app and database, it doesn't matter from the security POV if somebody sends you crafted utf-8 data. if you mix up your encodings or you don't escape with the proper encoding, then that can get hit you ( http://shiflett.org/blog/2006/jan/addslashes-versus-mysql-real-escape-string ) the multiby support in the php core isn't undefined, just unsupported. :/ use intl or mbstring for handling multibyte encodings. Tyrael
Re: [PHP-DEV] foreach() for strings
Am 21.06.2011 18:22, schrieb Ferenc Kovacs: > On Tue, Jun 21, 2011 at 6:14 PM, Reindl Harald wrote: > >> >> >> Am 21.06.2011 17:55, schrieb Tomas Kuliavas: >> >>> They submit it in utf-8 only if your html form allows them to do that or >>> they don't follow html specification and try to exploit your form. Set >>> form input charset to iso-8859-1 and your nbspace will take only one >> byte. >> >> and this naive attitude is the root of most security problems! >> >> why do you believe that every client submission is coming over >> your form or generally over anything you can control? >> >> > that doesn't matter here, Tomas just corrected John, that his statement that > chrome will always use utf-8 encoding for some special character isn't true. > browsers will adhere the > http://www.w3.org/TR/html401/interact/forms.html#adef-accept-charset > of course you can't trust user input, and you have to validate it, but this > has nothing to do with this topic it has how du you validate input if the string-functions having undefined results which you probably use for your validation? signature.asc Description: OpenPGP digital signature
Re: [PHP-DEV] foreach() for strings
On Tue, Jun 21, 2011 at 6:14 PM, Reindl Harald wrote: > > > Am 21.06.2011 17:55, schrieb Tomas Kuliavas: > > > They submit it in utf-8 only if your html form allows them to do that or > > they don't follow html specification and try to exploit your form. Set > > form input charset to iso-8859-1 and your nbspace will take only one > byte. > > and this naive attitude is the root of most security problems! > > why do you believe that every client submission is coming over > your form or generally over anything you can control? > > that doesn't matter here, Tomas just corrected John, that his statement that chrome will always use utf-8 encoding for some special character isn't true. browsers will adhere the http://www.w3.org/TR/html401/interact/forms.html#adef-accept-charset of course you can't trust user input, and you have to validate it, but this has nothing to do with this topic. Tyrael
Re: [PHP-DEV] foreach() for strings
Am 21.06.2011 17:55, schrieb Tomas Kuliavas: > They submit it in utf-8 only if your html form allows them to do that or > they don't follow html specification and try to exploit your form. Set > form input charset to iso-8859-1 and your nbspace will take only one byte. and this naive attitude is the root of most security problems! why do you believe that every client submission is coming over your form or generally over anything you can control? signature.asc Description: OpenPGP digital signature
Re: RE: [PHP-DEV] foreach() for strings
As a userland developer due to my geographical nature i have to work with 3 languages constantly - english, russian (cyryllic) and latvian (witch has it's own share of non latin characters). I end up using utf-8 in every project. And some give me a headake of dealing with text parsing. mb_string covers just part of the functionality and can be turned off. I personally think something has to be done about unicode handling in php after 5.4 so that we have an official method of dealing with it in the core. Probably it can be done in a namespace of its own and be new functionality to witch people should migrate. my 2 cents. 21.06.2011 17:56 пользователь "Tomas Kuliavas" написал: > 2011.06.21 17:38 John Crenshaw rašė: >> Pierre Joye wrote: >>> On Tue, Jun 21, 2011 at 1:33 PM, Lester Caine >>> wrote: Pierre Joye wrote: >> >> It depended on ICU there, and I would be against making a core thing >> in >>> PHP 5.x depend on ICU. > > It can and should be done as part of intl, actually. > > But that's somehow unrelated to the proposal here, as it is about > byte, not characters :) I believe this may be where some of the new niggles may be coming from? With browsers returning unicode, it may be that some of the 'extra' characters are being returned as multibyte rather than as single bytes? Such as the problem reported on the general list currently. How do we ensure that we are dealing with single byte character strings nowadays? >>> >>> As it has been stated numerous times in this thread and other, we do >>> not do anything with multi bytes systems, unicode, etc. mbstring and >>> intl do, but php's string as of now is all about bytes, array of bytes >>> if I may describe them this way. >>> >>> And we can't change this behavior. >> >> This mindset is fundamentally broken. You can call it a byte array all you >> want, but the truth is that 99.999% of the time, when a developer is using >> a string they need it for characters, not for bytes, and characters are >> not single byte. Even English users tend to submit Unicode range >> characters at an alarming rate. If you're using a WYSIWYG editor, Chrome >> will submit non-breaking-spaces as the actual UTF8 encoded character, not >> as an HTML encoded entity. Whether developers like it, or even know it, >> supporting an extended universal character set is not really optional. > > They submit it in utf-8 only if your html form allows them to do that or > they don't follow html specification and try to exploit your form. Set > form input charset to iso-8859-1 and your nbspace will take only one byte. > > -- > Tomas > > > > -- > PHP Internals - PHP Runtime Development Mailing List > To unsubscribe, visit: http://www.php.net/unsub.php >
RE: [PHP-DEV] foreach() for strings
2011.06.21 17:38 John Crenshaw rašė: > Pierre Joye wrote: >> On Tue, Jun 21, 2011 at 1:33 PM, Lester Caine >> wrote: >>> Pierre Joye wrote: > > It depended on ICU there, and I would be against making a core thing > in >> PHP 5.x depend on ICU. It can and should be done as part of intl, actually. But that's somehow unrelated to the proposal here, as it is about byte, not characters :) >>> >>> I believe this may be where some of the new niggles may be coming from? >>> With >>> browsers returning unicode, it may be that some of the 'extra' >>> characters >>> are being returned as multibyte rather than as single bytes? Such as >>> the >>> problem reported on the general list currently. How do we ensure that >>> we are >>> dealing with single byte character strings nowadays? >> >> As it has been stated numerous times in this thread and other, we do >> not do anything with multi bytes systems, unicode, etc. mbstring and >> intl do, but php's string as of now is all about bytes, array of bytes >> if I may describe them this way. >> >> And we can't change this behavior. > > This mindset is fundamentally broken. You can call it a byte array all you > want, but the truth is that 99.999% of the time, when a developer is using > a string they need it for characters, not for bytes, and characters are > not single byte. Even English users tend to submit Unicode range > characters at an alarming rate. If you're using a WYSIWYG editor, Chrome > will submit non-breaking-spaces as the actual UTF8 encoded character, not > as an HTML encoded entity. Whether developers like it, or even know it, > supporting an extended universal character set is not really optional. They submit it in utf-8 only if your html form allows them to do that or they don't follow html specification and try to exploit your form. Set form input charset to iso-8859-1 and your nbspace will take only one byte. -- Tomas -- PHP Internals - PHP Runtime Development Mailing List To unsubscribe, visit: http://www.php.net/unsub.php
Re: [PHP-DEV] foreach() for strings
On Tue, Jun 21, 2011 at 4:38 PM, John Crenshaw wrote: > This mindset is fundamentally broken. You can call it a byte array all you > want, but the truth is that 99.999% of the time, when a developer is using a > string they need it for characters, not for bytes Let me rephrase: For backward compatibility reasons we cannot change this behavior. Any serious text processing should be done using intl, mbstring, transliterator (pecl) or other similar solutions. Cheers, -- Pierre @pierrejoye | http://blog.thepimp.net | http://www.libgd.org -- PHP Internals - PHP Runtime Development Mailing List To unsubscribe, visit: http://www.php.net/unsub.php
RE: [PHP-DEV] foreach() for strings
Pierre Joye wrote: > On Tue, Jun 21, 2011 at 1:33 PM, Lester Caine wrote: >> Pierre Joye wrote: It depended on ICU there, and I would be against making a core thing in > PHP 5.x depend on ICU. >>> >>> It can and should be done as part of intl, actually. >>> >>> But that's somehow unrelated to the proposal here, as it is about >>> byte, not characters :) >> >> I believe this may be where some of the new niggles may be coming from? With >> browsers returning unicode, it may be that some of the 'extra' characters >> are being returned as multibyte rather than as single bytes? Such as the >> problem reported on the general list currently. How do we ensure that we are >> dealing with single byte character strings nowadays? > > As it has been stated numerous times in this thread and other, we do > not do anything with multi bytes systems, unicode, etc. mbstring and > intl do, but php's string as of now is all about bytes, array of bytes > if I may describe them this way. > > And we can't change this behavior. This mindset is fundamentally broken. You can call it a byte array all you want, but the truth is that 99.999% of the time, when a developer is using a string they need it for characters, not for bytes, and characters are not single byte. Even English users tend to submit Unicode range characters at an alarming rate. If you're using a WYSIWYG editor, Chrome will submit non-breaking-spaces as the actual UTF8 encoded character, not as an HTML encoded entity. Whether developers like it, or even know it, supporting an extended universal character set is not really optional. PHP makes this bad enough with the whole collection of bytewise string functions, including many with no appropriate multibyte aware replacement, but at least this can be avoided, quickly audited, and in the future can even be fixed in any number of ways with only a nominal BC impact. Hard coding this single byte idiocy into a language construct (foreach) though would be an incredibly awful idea. This would create a trap for new naive PHP developers, and create a character set problem that the language could NEVER recover from without a massive BC break. This proposal is really about adding a feature which whenever it used is almost guaranteed to be an error. It probably won't look to the developer like an error during simple testing, but will almost certainly show up as an error in production. Is it really worth all that for a bit of syntax sugar that the developer will have to strip out anyway to fix their bug? If string iteration needs to be addressed in the core (and IMO it doesn't because it can be handled at the script level, but if it does) why not use iterator classes? This gives the same functionality and prevents the language from encouraging hidden bugs. John Crenshaw Priacta, Inc. -- PHP Internals - PHP Runtime Development Mailing List To unsubscribe, visit: http://www.php.net/unsub.php
Re: [PHP-DEV] foreach() for strings
Pierre Joye wrote: On Tue, Jun 21, 2011 at 1:33 PM, Lester Caine wrote: Pierre Joye wrote: It depended on ICU there, and I would be against making a core thing in PHP 5.x depend on ICU. It can and should be done as part of intl, actually. But that's somehow unrelated to the proposal here, as it is about byte, not characters :) I believe this may be where some of the new niggles may be coming from? With browsers returning unicode, it may be that some of the 'extra' characters are being returned as multibyte rather than as single bytes? Such as the problem reported on the general list currently. How do we ensure that we are dealing with single byte character strings nowadays? As it has been stated numerous times in this thread and other, we do not do anything with multi bytes systems, unicode, etc. mbstring and intl do, but php's string as of now is all about bytes, array of bytes if I may describe them this way. And we can't change this behavior. That is exactly the point. I suppose what I am asking is how people ensure that what they are feeding into simple strings are single byte when cut and past nowadays does not make a distinction? -- Lester Caine - G8HFL - Contact - http://lsces.co.uk/wiki/?page=contact L.S.Caine Electronic Services - http://lsces.co.uk EnquirySolve - http://enquirysolve.com/ Model Engineers Digital Workshop - http://medw.co.uk// Firebird - http://www.firebirdsql.org/index.php -- PHP Internals - PHP Runtime Development Mailing List To unsubscribe, visit: http://www.php.net/unsub.php
Re: [PHP-DEV] foreach() for strings
On Tue, Jun 21, 2011 at 1:33 PM, Lester Caine wrote: > Pierre Joye wrote: >>> >>> It depended on ICU there, and I would be against making a core thing in >>> > PHP 5.x depend on ICU. >> >> It can and should be done as part of intl, actually. >> >> But that's somehow unrelated to the proposal here, as it is about >> byte, not characters :) > > I believe this may be where some of the new niggles may be coming from? With > browsers returning unicode, it may be that some of the 'extra' characters > are being returned as multibyte rather than as single bytes? Such as the > problem reported on the general list currently. How do we ensure that we are > dealing with single byte character strings nowadays? As it has been stated numerous times in this thread and other, we do not do anything with multi bytes systems, unicode, etc. mbstring and intl do, but php's string as of now is all about bytes, array of bytes if I may describe them this way. And we can't change this behavior. Cheers, -- Pierre @pierrejoye | http://blog.thepimp.net | http://www.libgd.org -- PHP Internals - PHP Runtime Development Mailing List To unsubscribe, visit: http://www.php.net/unsub.php
Re: [PHP-DEV] foreach() for strings
Pierre Joye wrote: It depended on ICU there, and I would be against making a core thing in > PHP 5.x depend on ICU. It can and should be done as part of intl, actually. But that's somehow unrelated to the proposal here, as it is about byte, not characters :) I believe this may be where some of the new niggles may be coming from? With browsers returning unicode, it may be that some of the 'extra' characters are being returned as multibyte rather than as single bytes? Such as the problem reported on the general list currently. How do we ensure that we are dealing with single byte character strings nowadays? -- Lester Caine - G8HFL - Contact - http://lsces.co.uk/wiki/?page=contact L.S.Caine Electronic Services - http://lsces.co.uk EnquirySolve - http://enquirysolve.com/ Model Engineers Digital Workshop - http://medw.co.uk// Firebird - http://www.firebirdsql.org/index.php -- PHP Internals - PHP Runtime Development Mailing List To unsubscribe, visit: http://www.php.net/unsub.php
Re: [PHP-DEV] foreach() for strings
On Tue, Jun 21, 2011 at 12:53 PM, Derick Rethans wrote: > It depended on ICU there, and I would be against making a core thing in > PHP 5.x depend on ICU. It can and should be done as part of intl, actually. But that's somehow unrelated to the proposal here, as it is about byte, not characters :) -- Pierre @pierrejoye | http://blog.thepimp.net | http://www.libgd.org -- PHP Internals - PHP Runtime Development Mailing List To unsubscribe, visit: http://www.php.net/unsub.php
Re: [PHP-DEV] foreach() for strings
On Mon, 20 Jun 2011, Stas Malyshev wrote: > On 6/20/11 9:15 AM, John Crenshaw wrote: > > > From: Ilia Alshanetsky [mailto:i...@prohost.org] > > > > > > As long as it works on a premise that a "string" is a byte array > > > and each element represents 1 byte, +1 from me. > > > > Code written on this premise is almost always bug central when people > > finally get around to realizing why they really do need to support > > wide characters (and everybody does, because people like to paste > > stuff containing non-break-spaces, and decorative quotes). I really > > don't think this single byte character mentality should be > > encouraged. > > I think you're right, TextIterator would be better (and also much easier to > implement, I think). Didn't we have it in Unicode branch? We could port it > back or we could have something along the lines of grapheme_extract... It depended on ICU there, and I would be against making a core thing in PHP 5.x depend on ICU. cheers, Derick -- http://derickrethans.nl | http://xdebug.org Like Xdebug? Consider a donation: http://xdebug.org/donate.php twitter: @derickr and @xdebug -- PHP Internals - PHP Runtime Development Mailing List To unsubscribe, visit: http://www.php.net/unsub.php
Re: [PHP-DEV] foreach() for strings
Hi! On 6/20/11 9:15 AM, John Crenshaw wrote: From: Ilia Alshanetsky [mailto:i...@prohost.org] As long as it works on a premise that a "string" is a byte array and each element represents 1 byte, +1 from me. Code written on this premise is almost always bug central when people finally get around to realizing why they really do need to support wide characters (and everybody does, because people like to paste stuff containing non-break-spaces, and decorative quotes). I really don't think this single byte character mentality should be encouraged. I think you're right, TextIterator would be better (and also much easier to implement, I think). Didn't we have it in Unicode branch? We could port it back or we could have something along the lines of grapheme_extract... Also, how do you think this will work with the Unicode conversion in PHP 6? Guaranteed, this will break stuff. Some people will have I don't think we need to worry about PHP 6 now... If we ever get back to Unicode support, it probably will be different. -- Stanislav Malyshev, Software Architect SugarCRM: http://www.sugarcrm.com/ (408)454-6900 ext. 227 -- PHP Internals - PHP Runtime Development Mailing List To unsubscribe, visit: http://www.php.net/unsub.php
RE: [PHP-DEV] foreach() for strings
> From: Ilia Alshanetsky [mailto:i...@prohost.org] > > As long as it works on a premise that a "string" is a byte array and > each element represents 1 byte, +1 from me. Code written on this premise is almost always bug central when people finally get around to realizing why they really do need to support wide characters (and everybody does, because people like to paste stuff containing non-break-spaces, and decorative quotes). I really don't think this single byte character mentality should be encouraged. Also, how do you think this will work with the Unicode conversion in PHP 6? Guaranteed, this will break stuff. Some people will have written code to iterate characters, assuming single byte characters, some people will have written code ACTUALLY intending to iterate as a byte array. Sadly, we can almost certainly assume that the single byte characters assumption (which is wrong) will also be, by far, the most common. Supporting that most common case when moving to PHP 6 would require breaking the binary case (which was the only properly written code in the first place.) On the other hand, supporting the binary case means breaking the most common case. John Crenshaw Priacta, Inc. -- PHP Internals - PHP Runtime Development Mailing List To unsubscribe, visit: http://www.php.net/unsub.php
Re: [PHP-DEV] foreach() for strings
Hi! foreach() has many functions, looping over arrays, objects and implementing the iterator interface. I think it's also quite intuitive to use foreach() for strings, too. I'm not sure how you'd implement such thing, but then I think things like next(), end(), etc. should work too... -- Stanislav Malyshev, Software Architect SugarCRM: http://www.sugarcrm.com/ (408)454-6900 ext. 227 -- PHP Internals - PHP Runtime Development Mailing List To unsubscribe, visit: http://www.php.net/unsub.php
RE: [PHP-DEV] foreach() for strings
> -Original Message- > From: Lee davis [mailto:leedavi...@gmail.com] > Sent: Monday, June 20, 2011 9:12 AM > To: Robert Eisele > Cc: internals@lists.php.net > Subject: Re: [PHP-DEV] foreach() for strings > > I think this would be quite a useful feature, and am In favor of it. > However, I think caution should be taken when shifting array utilities out > of their remit and allowing them to manipulate / traverse other data types. > You may see the floodgates opening for more request to adapt array functions > for other uses. > > Say for instance.. > > Could we also use current(), next() and key() for iteration of strings? > > $string = 'string'; > while ($char = current($string)) > { > echo key($string) // Would output the offset position I assume 0,1,2 etc?? > echo $char // outputs each letter of string > next($string); > } > > Lee > > On Mon, Jun 20, 2011 at 12:27 PM, Robert Eisele wrote: > > > foreach() has many functions, looping over arrays, objects and implementing > > the iterator interface. I think it's also quite intuitive to use foreach() > > for strings, too. > > > > If you want to implement a parser in PHP, you have to go the way with for + > > strlen + substr() or $x[$i] to address one character of the string. We > > could > > overdo the functionality of foreach() > > by implementing LVAL's, too, in order to access single bits but this is > > really uncommon, even if the way of thinking could be, that foreach() gives > > a single attribute of each value, no matter > > if it's a complex object with the iterator interface or a primitive. What > > do > > you think about this one? My point of view is, that foreach() is very > > useful, which was acknowledged by many ppl via the comments of my article. > > > > I think, adding features like this persuades the one or the other PHP user > > to upgrade to 5.4. > > > > Robert > > Doing this with an explicit iterator object is a fine idea. The syntax becomes something like: foreach(new TextIterator($s, 'UTF8') as $pos=>$c) { ... } On the other hand, I think that trying to support iteration without using an iterator object to mediate would be a disaster, and I'm opposed to doing something like that because: 1. The code just looks wrong. PHP developers are generally insulated from the char-arrayness of strings. In addition, since PHP isn't typesafe, the code becomes highly ambiguous. Is the code iterating an array, or a string? It is very hard to tell just by looking. It may be convenient to write, but it's certainly not convenient to read or maintain later. On the other hand, with a mediating iterator object, the intent becomes obvious, and the code is highly readable. 2. The odds of iterating any given string are slim at best. Supporting current, key, next, etc. would require the string object internally to get bloated with additional unnecessary data that is almost never used. This bloat isn't a single int either. For optimal performance it would need to consist of no less than two size_t (char position and binary position), and one encoding indicator. 3. Iteration cannot work without knowing which encoding to use for the string. Is it UTF8? UTF16? UTF7? Binary or some single byte encoding? Some other exotic wide encoding? Without an iterator object in the middle, there is no way to specify this encoding. Always treating this as binary would also be a mistake, since this is almost certainly never actually the correct behavior, even though it may often appear to behave correctly with simple inputs. 4. I've had simple mistakes caught numerous times when foreach complains about getting a scalar rather than an array. So far, it has been exactly right every time. Allowing strings to be iterated would, in the name of convenience, increase the probability of stupid mistakes evading detection. Even worse, the code itself would look logically correct until the developer finally realizes that they have a string and not an array. Errors like this are probably far more common in most projects than the need to iterate a string, so making this change hurts debugging in the common case, for the sake of syntactic sugar in the rare case. Not a good trade. John Crenshaw Priacta, Inc. -- PHP Internals - PHP Runtime Development Mailing List To unsubscribe, visit: http://www.php.net/unsub.php
Re: [PHP-DEV] foreach() for strings
As long as it works on a premise that a "string" is a byte array and each element represents 1 byte, +1 from me. On Mon, Jun 20, 2011 at 7:27 AM, Robert Eisele wrote: > foreach() has many functions, looping over arrays, objects and implementing > the iterator interface. I think it's also quite intuitive to use foreach() > for strings, too. > > If you want to implement a parser in PHP, you have to go the way with for + > strlen + substr() or $x[$i] to address one character of the string. We could > overdo the functionality of foreach() > by implementing LVAL's, too, in order to access single bits but this is > really uncommon, even if the way of thinking could be, that foreach() gives > a single attribute of each value, no matter > if it's a complex object with the iterator interface or a primitive. What do > you think about this one? My point of view is, that foreach() is very > useful, which was acknowledged by many ppl via the comments of my article. > > I think, adding features like this persuades the one or the other PHP user > to upgrade to 5.4. > > Robert > -- PHP Internals - PHP Runtime Development Mailing List To unsubscribe, visit: http://www.php.net/unsub.php
RE: [PHP-DEV] foreach() for strings
On Mon Jun 20 09:11 AM, Lee davis wrote: > > Could we also use current(), next() and key() for iteration of strings? > > $string = 'string'; > while ($char = current($string)) > { > echo key($string) // Would output the offset position I assume 0,1,2 > etc?? > echo $char // outputs each letter of string > next($string); > } > Hopefully it can be supported without sacrificing too much performance Like others mentioned, it seems important to distinguish between binary/byte and text iteration. $string = new ByteIterator('string é'); foreach($string as $i => $byte) ... $string = new TextIterator('string é'); foreach($string as $i => $char) ... When most developers get a 'string' from a database, my hunch is they assume they would be iterating over the 'characters' (utf8, iso.. encoding) and not individual bytes. So +1 to string iteration as long as there's byte iteration and some plan for text iteration / by character (with icu or not). -- PHP Internals - PHP Runtime Development Mailing List To unsubscribe, visit: http://www.php.net/unsub.php
Re: [PHP-DEV] foreach() for strings
I think this would be quite a useful feature, and am In favor of it. However, I think caution should be taken when shifting array utilities out of their remit and allowing them to manipulate / traverse other data types. You may see the floodgates opening for more request to adapt array functions for other uses. Say for instance.. Could we also use current(), next() and key() for iteration of strings? $string = 'string'; while ($char = current($string)) { echo key($string) // Would output the offset position I assume 0,1,2 etc?? echo $char // outputs each letter of string next($string); } Lee On Mon, Jun 20, 2011 at 12:27 PM, Robert Eisele wrote: > foreach() has many functions, looping over arrays, objects and implementing > the iterator interface. I think it's also quite intuitive to use foreach() > for strings, too. > > If you want to implement a parser in PHP, you have to go the way with for + > strlen + substr() or $x[$i] to address one character of the string. We > could > overdo the functionality of foreach() > by implementing LVAL's, too, in order to access single bits but this is > really uncommon, even if the way of thinking could be, that foreach() gives > a single attribute of each value, no matter > if it's a complex object with the iterator interface or a primitive. What > do > you think about this one? My point of view is, that foreach() is very > useful, which was acknowledged by many ppl via the comments of my article. > > I think, adding features like this persuades the one or the other PHP user > to upgrade to 5.4. > > Robert >
Re: [PHP-DEV] foreach() for strings
hi Robert, I would go with a RFC for that one, at least to document/cover edge cases to help the doc team to properly document this change if it gets approved. Thanks for your work so far! On Mon, Jun 20, 2011 at 1:27 PM, Robert Eisele wrote: > foreach() has many functions, looping over arrays, objects and implementing > the iterator interface. I think it's also quite intuitive to use foreach() > for strings, too. > > If you want to implement a parser in PHP, you have to go the way with for + > strlen + substr() or $x[$i] to address one character of the string. We could > overdo the functionality of foreach() > by implementing LVAL's, too, in order to access single bits but this is > really uncommon, even if the way of thinking could be, that foreach() gives > a single attribute of each value, no matter > if it's a complex object with the iterator interface or a primitive. What do > you think about this one? My point of view is, that foreach() is very > useful, which was acknowledged by many ppl via the comments of my article. > > I think, adding features like this persuades the one or the other PHP user > to upgrade to 5.4. > > Robert > -- Pierre @pierrejoye | http://blog.thepimp.net | http://www.libgd.org -- PHP Internals - PHP Runtime Development Mailing List To unsubscribe, visit: http://www.php.net/unsub.php
Re: [PHP-DEV] foreach() for strings
2011/6/20 Johannes Schlüter : > On Mon, 2011-06-20 at 13:27 +0200, Robert Eisele wrote: >> foreach() has many functions, looping over arrays, objects and implementing >> the iterator interface. I think it's also quite intuitive to use foreach() >> for strings, too. > > I would prefer a TextIterator as we had in the old PHP 6 as this allows > more powerful filtering etc. using iterator semantics even though this > might be a bit slower. A foreach with string should be seen as binary buffer, with no clue about its content and only to fetch it byte by byte. TextIterator can be smarter and support unicode when ICU is available. Cheers, -- Pierre @pierrejoye | http://blog.thepimp.net | http://www.libgd.org -- PHP Internals - PHP Runtime Development Mailing List To unsubscribe, visit: http://www.php.net/unsub.php
Re: [PHP-DEV] foreach() for strings
On Mon, 20 Jun 2011, Robert Eisele wrote: > 2011/6/20 Derick Rethans > > > On Mon, 20 Jun 2011, Robert Eisele wrote: > > > > > foreach() has many functions, looping over arrays, objects and > > > implementing the iterator interface. I think it's also quite > > > intuitive to use foreach() for strings, too. > > > > > If you want to implement a parser in PHP, you have to go the way > > > with for + strlen + substr() or $x[$i] to address one character of > > > the string. > > > > Yes, this sounds like a good addition to me. One question though, > > what to do with an object that implements __toString() ? > > > > That's the question, maybe one must force __toString() via an explicit > string-cast: > > foreach( (string) $obj as $k=>$v) That's a sensible thing indeed. We just need to make sure we document this with an example and a test then. cheers, Derick -- http://derickrethans.nl | http://xdebug.org Like Xdebug? Consider a donation: http://xdebug.org/donate.php twitter: @derickr and @xdebug -- PHP Internals - PHP Runtime Development Mailing List To unsubscribe, visit: http://www.php.net/unsub.php
Re: [PHP-DEV] foreach() for strings
On Mon, 20 Jun 2011, Johannes Schlüter wrote: > On Mon, 2011-06-20 at 13:27 +0200, Robert Eisele wrote: > > foreach() has many functions, looping over arrays, objects and implementing > > the iterator interface. I think it's also quite intuitive to use foreach() > > for strings, too. > > I would prefer a TextIterator as we had in the old PHP 6 as this allows > more powerful filtering etc. using iterator semantics even though this > might be a bit slower. I think TextIterator is another good addition, but it will make PHP depend on ICU. Therefore, I think just a foreach for strings seems valuable to me. Derick -- PHP Internals - PHP Runtime Development Mailing List To unsubscribe, visit: http://www.php.net/unsub.php
Re: [PHP-DEV] foreach() for strings
On Mon, 2011-06-20 at 13:27 +0200, Robert Eisele wrote: > foreach() has many functions, looping over arrays, objects and implementing > the iterator interface. I think it's also quite intuitive to use foreach() > for strings, too. I would prefer a TextIterator as we had in the old PHP 6 as this allows more powerful filtering etc. using iterator semantics even though this might be a bit slower. johannes -- PHP Internals - PHP Runtime Development Mailing List To unsubscribe, visit: http://www.php.net/unsub.php
Re: [PHP-DEV] foreach() for strings
2011/6/20 Derick Rethans > On Mon, 20 Jun 2011, Robert Eisele wrote: > > > foreach() has many functions, looping over arrays, objects and > implementing > > the iterator interface. I think it's also quite intuitive to use > foreach() > > for strings, too. > > > If you want to implement a parser in PHP, you have to go the way with for > + > > strlen + substr() or $x[$i] to address one character of the string. > > Yes, this sounds like a good addition to me. One question though, what > to do with an object that implements __toString() ? > That's the question, maybe one must force __toString() via an explicit string-cast: foreach( (string) $obj as $k=>$v) > > > We could > > overdo the functionality of foreach() > > by implementing LVAL's, too, in order to access single bits but this is > > really uncommon, even if the way of thinking could be, that foreach() > gives > > a single attribute of each value, no matter > > if it's a complex object with the iterator interface or a primitive. What > do > > you think about this one? My point of view is, that foreach() is very > > useful, which was acknowledged by many ppl via the comments of my > article. > > I don't think we should do it for bits, as nothing in PHP really does do > anything with that. If you want to do stuff with bits, I think the > "bitset" package (http://pecl.php.net/package/Bitset) is the way > forwards. > yep, i totally agree. > > cheers, > Derick > > -- > http://derickrethans.nl | http://xdebug.org > Like Xdebug? Consider a donation: http://xdebug.org/donate.php > twitter: @derickr and @xdebug >
Re: [PHP-DEV] foreach() for strings
On Mon, 20 Jun 2011, Robert Eisele wrote: > foreach() has many functions, looping over arrays, objects and implementing > the iterator interface. I think it's also quite intuitive to use foreach() > for strings, too. > If you want to implement a parser in PHP, you have to go the way with for + > strlen + substr() or $x[$i] to address one character of the string. Yes, this sounds like a good addition to me. One question though, what to do with an object that implements __toString() ? > We could > overdo the functionality of foreach() > by implementing LVAL's, too, in order to access single bits but this is > really uncommon, even if the way of thinking could be, that foreach() gives > a single attribute of each value, no matter > if it's a complex object with the iterator interface or a primitive. What do > you think about this one? My point of view is, that foreach() is very > useful, which was acknowledged by many ppl via the comments of my article. I don't think we should do it for bits, as nothing in PHP really does do anything with that. If you want to do stuff with bits, I think the "bitset" package (http://pecl.php.net/package/Bitset) is the way forwards. cheers, Derick -- http://derickrethans.nl | http://xdebug.org Like Xdebug? Consider a donation: http://xdebug.org/donate.php twitter: @derickr and @xdebug -- PHP Internals - PHP Runtime Development Mailing List To unsubscribe, visit: http://www.php.net/unsub.php
[PHP-DEV] foreach() for strings
foreach() has many functions, looping over arrays, objects and implementing the iterator interface. I think it's also quite intuitive to use foreach() for strings, too. If you want to implement a parser in PHP, you have to go the way with for + strlen + substr() or $x[$i] to address one character of the string. We could overdo the functionality of foreach() by implementing LVAL's, too, in order to access single bits but this is really uncommon, even if the way of thinking could be, that foreach() gives a single attribute of each value, no matter if it's a complex object with the iterator interface or a primitive. What do you think about this one? My point of view is, that foreach() is very useful, which was acknowledged by many ppl via the comments of my article. I think, adding features like this persuades the one or the other PHP user to upgrade to 5.4. Robert