RE: [PHP-DEV] foreach() for strings

2011-06-23 Thread John Crenshaw
> -Original Message-
> From: Jan Schneider [mailto:j...@horde.org] 
>
> And if that very same string that's supposed to be an array is  
> processed using the $var[$n] syntax nowadays is any different? It's  
> not, you won't get an error message for that either, and it's the same  
> amount of work to track this down. Granted, making PHP behaving the  
> same in foreach gives you one more place to track down such errors,  
> but making it easier to track down developer errors is not anything  
> that should keep PHP from adding new features.
>
> Jan.

In theory, yes, but in practice this doesn't seem to happen with any frequency 
(actually, I'm having a hard time thinking of a time when this has EVER 
happened to me.) On the other hand, warnings about foreach getting something 
that wasn't iteratable are commonplace for me, and more often than not, it is a 
string.

I think it's perfectly appropriate for any language to avoid "high risk" 
features (features that are likely to result in buggy code, or features that 
are likely to result in bugs evading detection.) My code has enough bugs 
already, so any language feature that finds my bugs for me is more than welcome.

Consider implicit vs. explicit returns. If a function always returns the value 
of the last statement (implicit) this is likely to result in unpredictable 
behavior and hidden bugs, when a warning could have been issued instead. Typing 
"return" clarifies intent and is a very small price to pay to avoid those 
errors. In this case, typing "new TextIterator()" in the handful of cases where 
you actually needed to iterate a string is a VERY small price to pay for:
1. The ability to get meaningful warnings when you didn't intend to iterate the 
string (by far the more likely scenario)
2. The ability to easily fix your code when you decide that a universal 
character set really is valuable
3. The ability to clearly see the intent of the code

John Crenshaw
Priacta, Inc.

--
PHP Internals - PHP Runtime Development Mailing List
To unsubscribe, visit: http://www.php.net/unsub.php



Re: [PHP-DEV] foreach() for strings

2011-06-23 Thread Jan Schneider


Zitat von Larry Garfield :


On 06/20/2011 10:25 AM, John Crenshaw wrote:
Doing this with an explicit iterator object is a fine idea. The  
syntax becomes something like:

foreach(new TextIterator($s, 'UTF8') as $pos=>$c)
{
...
}

On the other hand, I think that trying to support iteration without  
using an iterator object to mediate would be a disaster, and I'm  
opposed to doing something like that because:
1. The code just looks wrong. PHP developers are generally  
insulated from the char-arrayness of strings. In addition, since  
PHP isn't typesafe, the code becomes highly ambiguous. Is the code  
iterating an array, or a string? It is very hard to tell just by  
looking. It may be convenient to write, but it's certainly not  
convenient to read or maintain later. On the other hand, with a  
mediating iterator object, the intent becomes obvious, and the code  
is highly readable.
2. The odds of iterating any given string are slim at best.  
Supporting current, key, next, etc. would require the string object  
internally to get bloated with additional unnecessary data that is  
almost never used. This bloat isn't a single int either. For  
optimal performance it would need to consist of no less than two  
size_t (char position and binary position), and one encoding  
indicator.
3. Iteration cannot work without knowing which encoding to use for  
the string. Is it UTF8? UTF16? UTF7? Binary or some single byte  
encoding? Some other exotic wide encoding? Without an iterator  
object in the middle, there is no way to specify this encoding.  
Always treating this as binary would also be a mistake, since this  
is almost certainly never actually the correct behavior, even  
though it may often appear to behave correctly with simple inputs.
4. I've had simple mistakes caught numerous times when foreach  
complains about getting a scalar rather than an array. So far, it  
has been exactly right every time. Allowing strings to be iterated  
would, in the name of convenience, increase the probability of  
stupid mistakes evading detection. Even worse, the code itself  
would look logically correct until the developer finally realizes  
that they have a string and not an array. Errors like this are  
probably far more common in most projects than the need to iterate  
a string, so making this change hurts debugging in the common case,  
for the sake of syntactic sugar in the rare case. Not a good trade.


John Crenshaw
Priacta, Inc.


I would echo John's statements here.  foreach() directly iterating a  
string is going to make my life substantially harder.  I work in  
array-heavy systems, and "bad first argument for foreach()" is  
already a hard enough error to track down.  It means "somewhere,  
somehow, you put a string where you meant to put an array.  GLWT."   
Adding automatic string iteration would take away even that error  
message and leave me with no way to figure out why my code is  
randomly misbehaving.  Just looking at the code, I would have no way  
of knowing that such a bug lurks within.  That's the downside of a  
weakly typed but still typed language.


And if that very same string that's supposed to be an array is  
processed using the $var[$n] syntax nowadays is any different? It's  
not, you won't get an error message for that either, and it's the same  
amount of work to track this down. Granted, making PHP behaving the  
same in foreach gives you one more place to track down such errors,  
but making it easier to track down developer errors is not anything  
that should keep PHP from adding new features.


Jan.

--
Do you need professional PHP or Horde consulting?
http://horde.org/consulting/


--
PHP Internals - PHP Runtime Development Mailing List
To unsubscribe, visit: http://www.php.net/unsub.php



Re: [PHP-DEV] foreach() for strings

2011-06-22 Thread Larry Garfield

On 06/20/2011 10:25 AM, John Crenshaw wrote:
Doing this with an explicit iterator object is a fine idea. The syntax 
becomes something like:

foreach(new TextIterator($s, 'UTF8') as $pos=>$c)
{
 ...
}

On the other hand, I think that trying to support iteration without using an 
iterator object to mediate would be a disaster, and I'm opposed to doing 
something like that because:
1. The code just looks wrong. PHP developers are generally insulated from the 
char-arrayness of strings. In addition, since PHP isn't typesafe, the code 
becomes highly ambiguous. Is the code iterating an array, or a string? It is 
very hard to tell just by looking. It may be convenient to write, but it's 
certainly not convenient to read or maintain later. On the other hand, with a 
mediating iterator object, the intent becomes obvious, and the code is highly 
readable.
2. The odds of iterating any given string are slim at best. Supporting current, 
key, next, etc. would require the string object internally to get bloated with 
additional unnecessary data that is almost never used. This bloat isn't a 
single int either. For optimal performance it would need to consist of no less 
than two size_t (char position and binary position), and one encoding indicator.
3. Iteration cannot work without knowing which encoding to use for the string. 
Is it UTF8? UTF16? UTF7? Binary or some single byte encoding? Some other exotic 
wide encoding? Without an iterator object in the middle, there is no way to 
specify this encoding. Always treating this as binary would also be a mistake, 
since this is almost certainly never actually the correct behavior, even though 
it may often appear to behave correctly with simple inputs.
4. I've had simple mistakes caught numerous times when foreach complains about 
getting a scalar rather than an array. So far, it has been exactly right every 
time. Allowing strings to be iterated would, in the name of convenience, 
increase the probability of stupid mistakes evading detection. Even worse, the 
code itself would look logically correct until the developer finally realizes 
that they have a string and not an array. Errors like this are probably far 
more common in most projects than the need to iterate a string, so making this 
change hurts debugging in the common case, for the sake of syntactic sugar in 
the rare case. Not a good trade.

John Crenshaw
Priacta, Inc.


I would echo John's statements here.  foreach() directly iterating a 
string is going to make my life substantially harder.  I work in 
array-heavy systems, and "bad first argument for foreach()" is already a 
hard enough error to track down.  It means "somewhere, somehow, you put 
a string where you meant to put an array.  GLWT."  Adding automatic 
string iteration would take away even that error message and leave me 
with no way to figure out why my code is randomly misbehaving.  Just 
looking at the code, I would have no way of knowing that such a bug 
lurks within.  That's the downside of a weakly typed but still typed 
language.


A proper iterator class, however, makes a great deal of sense.  It could 
be implemented user-space fairly easily, no doubt, but for strings of 
any appreciable size (like the OP seems to be talking about for code 
parsing) I suspect performance and memory usage would be far better if 
implemented in C.


Whether it's a byte-based or character-set-sensitive-character-based 
iterator... honestly I don't care as long as it's documented properly.


--Larry Garfield

--
PHP Internals - PHP Runtime Development Mailing List
To unsubscribe, visit: http://www.php.net/unsub.php



Re: [PHP-DEV] foreach() for strings

2011-06-22 Thread Tomas Kuliavas
2011.06.22 14:14 Reindl Harald rašė:
> Am 22.06.2011 07:24, schrieb Tomas Kuliavas:
>> 2011.06.21 23:27 Reindl Harald rašė:
>>> i do not understand any word and miss a simple str_is_utf8()
>>
>> Such function uses six lines in PHP.
>
> so why do you not post them?

My lines are not public domain. They are GPLed. I am not sure about their
performance, but they are executed 15 times on every mailbox listing I do
and I don't see problems with that.

I point at issues I have with available PHP functions and you start
insulting me by calling my complains "low-lvevel bla" and fail to politely
ask to explain everything in details. Sorry, if my tone insulted you and
you decided to start war instead of investigating things further.

-- 
Tomas



-- 
PHP Internals - PHP Runtime Development Mailing List
To unsubscribe, visit: http://www.php.net/unsub.php



RE: [PHP-DEV] foreach() for strings

2011-06-22 Thread Jonathan Bond-Caron
On Wed Jun 22 11:25 AM, Reindl Harald wrote:
> 
> and php as primary web-language is missing UTF8 support in the core

You have a valid point.

Now onto "foreach() for strings",  any other opinions? 
Seems like the discussion is closed and most likely should move to a RFC
with some consideration about character iteration could work.


-- 
PHP Internals - PHP Runtime Development Mailing List
To unsubscribe, visit: http://www.php.net/unsub.php



Re: [PHP-DEV] foreach() for strings

2011-06-22 Thread Reindl Harald
FIRST:
it is terrible that one post in HTML, the next ansers on top
and other answering on bottom in the same thread and that the
most do not undertsand reply to the list address is enough


Am 22.06.2011 17:07, schrieb Robert Eisele:
> 1. The number of CHARs isn't unrelevant in a general manner

it is if you calculate positions for substr() as simple example

> It depents on the application

it depends on luck the the application do everything as expected
with multibyte input

> even if the trend goes towards  UTF8 for websites

and php as primary web-language is missing UTF8 support in the core

> 2. Within 10 years, you could have come to a working solution which could
> please us all.

many years ago UTF( support was announced so i waited for PHP

> 3. Stop flaming and focus on your other day-job instead

announcing multibyte support AFAIK 6 years ago and finally stop
it makes my other day-job hard over the long term



signature.asc
Description: OpenPGP digital signature


Re: [PHP-DEV] foreach() for strings

2011-06-22 Thread Robert Eisele
1. The number of CHARs isn't unrelevant in a general manner.
It depents on the application, even if the trend goes towards UTF8 for
websites.

2. Within 10 years, you could have come to a working solution which could
please us all.

3. Stop flaming and focus on your other day-job instead.

2011/6/22 Reindl Harald 

>
>
> Am 22.06.2011 16:49, schrieb Ferenc Kovacs:
> > after 10 years i want solutions where it is not the road to hell
> using any
> > string function on user-input and since PHP6 seems to be quite dead
> > it feels there will never be a trustable solution
> >
> >
> > you made my day.
> > boy :)
>
> after fetch a mail-client which does not convert plain-text to html
> tell me how long should we wait until as example strlen() does return
> the number of CHARS instead bytes because this low level information
> is not needed on the script side and simply wrong by multibyte input
>
> what happended with PHP6?
>
>
>
>
>


Re: [PHP-DEV] foreach() for strings

2011-06-22 Thread Ferenc Kovacs
On Wed, Jun 22, 2011 at 4:52 PM, Reindl Harald wrote:

>
>
> Am 22.06.2011 16:49, schrieb Ferenc Kovacs:
> > after 10 years i want solutions where it is not the road to hell
> using any
> > string function on user-input and since PHP6 seems to be quite dead
> > it feels there will never be a trustable solution
> >
> >
> > you made my day.
> > boy :)
>
> after fetch a mail-client which does not convert plain-text to html
> tell me how long should we wait until as example strlen() does return
> the number of CHARS instead bytes because this low level information
> is not needed on the script side and simply wrong by multibyte input
>
> what happended with PHP6?


aside the trolling parts, I'm pretty surprised that you are so uninformed
about how utf-8 works in general and what happened with php6 when your are
lurking on this list for years and especially the what happened with PHP6
was discussed on this list and in this very thread.

Tyrael


Re: [PHP-DEV] foreach() for strings

2011-06-22 Thread Reindl Harald


Am 22.06.2011 16:49, schrieb Ferenc Kovacs:
> after 10 years i want solutions where it is not the road to hell using any
> string function on user-input and since PHP6 seems to be quite dead
> it feels there will never be a trustable solution
> 
> 
> you made my day.
> boy :)

after fetch a mail-client which does not convert plain-text to html
tell me how long should we wait until as example strlen() does return
the number of CHARS instead bytes because this low level information
is not needed on the script side and simply wrong by multibyte input

what happended with PHP6?






signature.asc
Description: OpenPGP digital signature


Re: [PHP-DEV] foreach() for strings

2011-06-22 Thread Ferenc Kovacs
On Wed, Jun 22, 2011 at 4:08 PM, Reindl Harald wrote:

>
>
> Am 22.06.2011 16:06, schrieb Rasmus Lerdorf:
> > On 06/22/2011 07:01 AM, Reindl Harald wrote:
> >>
> >>
> >> Am 22.06.2011 15:57, schrieb Rasmus Lerdorf:
> >>
> >>> There is obviously no way to tell if "hello" is UTF-8 or ISO-8859-1. It
> >>> is technically impossible since the two are identical in that range.
> >>
> >> yes and so this will not work
> >>
> >> and as long PHP has on million places troubles with UTF8 it would
> >> be hardly needed find real soultions as long as functions which
> >> are not working with UTF8 input are not throwing a fatal error
> >>
> >>> And please keep things polite. It is very rare that I warn people about
> >>> that here, and I will only do it once.
> >>
> >> this is my tone if somebody believes he is genius and can solve
> >> problems which are since years existing with a sinlge line
> >
> > Then please stop posting to this list. There is no excuse for this tone.
> > Especially since you basically implied that you wanted a str_is_ut8()
> > function that can detect whether "hello" is UTF8 or ISO-8859-1. That's a
> > completely nonsensical request
>
> after 10 years i want solutions where it is not the road to hell using any
> string function on user-input and since PHP6 seems to be quite dead
> it feels there will never be a trustable solution
>
>
you made my day.
boy :)

Tyrael


Re: [PHP-DEV] foreach() for strings

2011-06-22 Thread Reindl Harald


Am 22.06.2011 16:06, schrieb Rasmus Lerdorf:
> On 06/22/2011 07:01 AM, Reindl Harald wrote:
>>
>>
>> Am 22.06.2011 15:57, schrieb Rasmus Lerdorf:
>>
>>> There is obviously no way to tell if "hello" is UTF-8 or ISO-8859-1. It
>>> is technically impossible since the two are identical in that range.
>>
>> yes and so this will not work
>>
>> and as long PHP has on million places troubles with UTF8 it would
>> be hardly needed find real soultions as long as functions which
>> are not working with UTF8 input are not throwing a fatal error
>>
>>> And please keep things polite. It is very rare that I warn people about
>>> that here, and I will only do it once.
>>
>> this is my tone if somebody believes he is genius and can solve
>> problems which are since years existing with a sinlge line
> 
> Then please stop posting to this list. There is no excuse for this tone.
> Especially since you basically implied that you wanted a str_is_ut8()
> function that can detect whether "hello" is UTF8 or ISO-8859-1. That's a
> completely nonsensical request

after 10 years i want solutions where it is not the road to hell using any
string function on user-input and since PHP6 seems to be quite dead
it feels there will never be a trustable solution



signature.asc
Description: OpenPGP digital signature


Re: [PHP-DEV] foreach() for strings

2011-06-22 Thread Rasmus Lerdorf
On 06/22/2011 07:01 AM, Reindl Harald wrote:
> 
> 
> Am 22.06.2011 15:57, schrieb Rasmus Lerdorf:
> 
>> There is obviously no way to tell if "hello" is UTF-8 or ISO-8859-1. It
>> is technically impossible since the two are identical in that range.
> 
> yes and so this will not work
> 
> and as long PHP has on million places troubles with UTF8 it would
> be hardly needed find real soultions as long as functions which
> are not working with UTF8 input are not throwing a fatal error
> 
>> And please keep things polite. It is very rare that I warn people about
>> that here, and I will only do it once.
> 
> this is my tone if somebody believes he is genius and can solve
> problems which are since years existing with a sinlge line

Then please stop posting to this list. There is no excuse for this tone.
Especially since you basically implied that you wanted a str_is_ut8()
function that can detect whether "hello" is UTF8 or ISO-8859-1. That's a
completely nonsensical request.

-Rasmus


-- 
PHP Internals - PHP Runtime Development Mailing List
To unsubscribe, visit: http://www.php.net/unsub.php



Re: [PHP-DEV] foreach() for strings

2011-06-22 Thread Reindl Harald


Am 22.06.2011 15:57, schrieb Rasmus Lerdorf:

> There is obviously no way to tell if "hello" is UTF-8 or ISO-8859-1. It
> is technically impossible since the two are identical in that range.

yes and so this will not work

and as long PHP has on million places troubles with UTF8 it would
be hardly needed find real soultions as long as functions which
are not working with UTF8 input are not throwing a fatal error

> And please keep things polite. It is very rare that I warn people about
> that here, and I will only do it once.

this is my tone if somebody believes he is genius and can solve
problems which are since years existing with a sinlge line



signature.asc
Description: OpenPGP digital signature


Re: [PHP-DEV] foreach() for strings

2011-06-22 Thread Reindl Harald


Am 22.06.2011 15:55, schrieb Olivier Hill:
> Please change your tone.
> Thank you

the tonhe is quite correct for peopole who think a quick 3 liner will do
the job of UTF8 detection which it does not and there you can read the manual
as often you want





signature.asc
Description: OpenPGP digital signature


Re: [PHP-DEV] foreach() for strings

2011-06-22 Thread Rasmus Lerdorf
On 06/22/2011 06:52 AM, Reindl Harald wrote:
> 
> 
> Am 22.06.2011 15:45, schrieb Lars Schultz:
>> Am 22.06.2011 15:40, schrieb Reindl Harald:
>>> and why this will not return true if $str is ISO-8859-1?
>> If you RTFM (in your jargon) you would know.
>>
>> http://ch.php.net/manual/en/function.htmlspecialchars.php (Return value 
>> Section)
> 
> i read the fucking manual
> 
>> If the input string contains an invalid code unit sequence within the given 
>> charset
>> and the ENT_IGNORE flag is not set, then htmlspecialchars() will return an
>> empty string.
> 
> so damend NOT ALL CHARACTERS are multibyte and so it will return true for 
> "hello"
> so what will you tell me above boy?

There is obviously no way to tell if "hello" is UTF-8 or ISO-8859-1. It
is technically impossible since the two are identical in that range.

And please keep things polite. It is very rare that I warn people about
that here, and I will only do it once.

-Rasmus

-- 
PHP Internals - PHP Runtime Development Mailing List
To unsubscribe, visit: http://www.php.net/unsub.php



Re: [PHP-DEV] foreach() for strings

2011-06-22 Thread Olivier Hill
Please change your tone.

Thank you.

Olivier


-- 
http://www.olivierhill.ca/

-- 
PHP Internals - PHP Runtime Development Mailing List
To unsubscribe, visit: http://www.php.net/unsub.php



Re: [PHP-DEV] foreach() for strings

2011-06-22 Thread Reindl Harald


Am 22.06.2011 15:45, schrieb Lars Schultz:
> Am 22.06.2011 15:40, schrieb Reindl Harald:
>> and why this will not return true if $str is ISO-8859-1?
> If you RTFM (in your jargon) you would know.
> 
> http://ch.php.net/manual/en/function.htmlspecialchars.php (Return value 
> Section)

i read the fucking manual

> If the input string contains an invalid code unit sequence within the given 
> charset
> and the ENT_IGNORE flag is not set, then htmlspecialchars() will return an
> empty string.

so damend NOT ALL CHARACTERS are multibyte and so it will return true for 
"hello"
so what will you tell me above boy?



signature.asc
Description: OpenPGP digital signature


Re: [PHP-DEV] foreach() for strings

2011-06-22 Thread Anthony Ferrara
> and why this will not return true if $str is ISO-8859-1?

For lower 7 bit characters (code points <= 127) it would return true.
But if there is a single higher character (outside of ascii), it would
only return true if the byte sequences follow UTF-8 semantics.  So it
would return false if ISO-8859-1.

For example, character é is 0xe9 (code point 234) in ISO-8859, but
character 0xc3a9 in UTF-8.  So if it encountered a byte stream such as
0xe92041 ("é A"), it knows it cannot be UTF-8 since 0xe920 is not a
valid byte sequence.  But if it saw 0xc3a92041, ("é A"), it knows it
is valid UTF-8 (it could be another character set, but it is valid in
UTF-8)...

Please note that it's not checking if the string **is** UTF-8, just if
the byte sequences in the string are valid when interpreted as UTF-8.
You could have the Latin-1 string 0xc3a92041: ("é A") which parses as
valid UTF-8...

On Wed, Jun 22, 2011 at 9:40 AM, Reindl Harald  wrote:
>
>
> Am 22.06.2011 15:30, schrieb Gustavo Lopes:
>> Em Wed, 22 Jun 2011 13:21:10 +0100, Reindl Harald  
>> escreveu:
>>
>>> Am 22.06.2011 14:14, schrieb Gustavo Lopes:
 It's actually 3 lines:

 function str_is_utf8($str) {
     return $str == "" || htmlspecialchars($str, 0, "UTF-8");
 }
>>>
>>>
>>> WTF should this do?
>>> this won't return boolean
>>>
>>
>> The reason it works is that
>> 1) || coerces the operands into booleans (if they get to be evaluated)
>> 2) htmlspecialchars returns "" on bad input sequence
>> 3) (bool) "" === false
>>
>> But even if you didn't know these things, you should have bothered to at 
>> least test it
>> before sending this response
>
> and why this will not return true if $str is ISO-8859-1?
>
>

--
PHP Internals - PHP Runtime Development Mailing List
To unsubscribe, visit: http://www.php.net/unsub.php



Re: [PHP-DEV] foreach() for strings

2011-06-22 Thread Lars Schultz

Am 22.06.2011 15:40, schrieb Reindl Harald:

and why this will not return true if $str is ISO-8859-1?

If you RTFM (in your jargon) you would know.

http://ch.php.net/manual/en/function.htmlspecialchars.php (Return value 
Section)



--
PHP Internals - PHP Runtime Development Mailing List
To unsubscribe, visit: http://www.php.net/unsub.php



Re: [PHP-DEV] foreach() for strings

2011-06-22 Thread Reindl Harald


Am 22.06.2011 15:30, schrieb Gustavo Lopes:
> Em Wed, 22 Jun 2011 13:21:10 +0100, Reindl Harald  
> escreveu:
> 
>> Am 22.06.2011 14:14, schrieb Gustavo Lopes:
>>> It's actually 3 lines:
>>>
>>> function str_is_utf8($str) {
>>> return $str == "" || htmlspecialchars($str, 0, "UTF-8");
>>> }
>>
>>
>> WTF should this do?
>> this won't return boolean
>>
> 
> The reason it works is that
> 1) || coerces the operands into booleans (if they get to be evaluated)
> 2) htmlspecialchars returns "" on bad input sequence
> 3) (bool) "" === false
> 
> But even if you didn't know these things, you should have bothered to at 
> least test it 
> before sending this response

and why this will not return true if $str is ISO-8859-1?



signature.asc
Description: OpenPGP digital signature


Re: [PHP-DEV] foreach() for strings

2011-06-22 Thread Gustavo Lopes
Em Wed, 22 Jun 2011 13:21:10 +0100, Reindl Harald   
escreveu:



Am 22.06.2011 14:14, schrieb Gustavo Lopes:

It's actually 3 lines:

function str_is_utf8($str) {
return $str == "" || htmlspecialchars($str, 0, "UTF-8");
}



WTF should this do?
this won't return boolean



The reason it works is that
1) || coerces the operands into booleans (if they get to be evaluated)
2) htmlspecialchars returns "" on bad input sequence
3) (bool) "" === false

But even if you didn't know these things, you should have bothered to at  
least test it before sending this response.


--
Gustavo Lopes

--
PHP Internals - PHP Runtime Development Mailing List
To unsubscribe, visit: http://www.php.net/unsub.php



Re: [PHP-DEV] foreach() for strings

2011-06-22 Thread Kalle Sommer Nielsen
2011/6/22 Reindl Harald :
> WTF should this do?
> this won't return boolean

It's an expression, so its evaluated result is a boolean hence why it
makes sense


-- 
regards,

Kalle Sommer Nielsen
ka...@php.net

-- 
PHP Internals - PHP Runtime Development Mailing List
To unsubscribe, visit: http://www.php.net/unsub.php



Re: [PHP-DEV] foreach() for strings

2011-06-22 Thread Reindl Harald


Am 22.06.2011 14:14, schrieb Gustavo Lopes:
> Em Wed, 22 Jun 2011 12:14:40 +0100, Reindl Harald  
> escreveu:
>> Am 22.06.2011 07:24, schrieb Tomas Kuliavas:
>>> 2011.06.21 23:27 Reindl Harald rašė:
 i do not understand any word and miss a simple str_is_utf8()
>>>
>>> Such function uses six lines in PHP.
>>
>> so why do you not post them?t
>>
> 
> It's actually 3 lines:
> 
> function str_is_utf8($str) {
> return $str == "" || htmlspecialchars($str, 0, "UTF-8");
> }


WTF should this do?
this won't return boolean



signature.asc
Description: OpenPGP digital signature


Re: [PHP-DEV] foreach() for strings

2011-06-22 Thread Gustavo Lopes
Em Wed, 22 Jun 2011 12:14:40 +0100, Reindl Harald   
escreveu:

Am 22.06.2011 07:24, schrieb Tomas Kuliavas:

2011.06.21 23:27 Reindl Harald rašė:

i do not understand any word and miss a simple str_is_utf8()


Such function uses six lines in PHP.


so why do you not post them?t



It's actually 3 lines:

function str_is_utf8($str) {
return $str == "" || htmlspecialchars($str, 0, "UTF-8");
}

or


function str_is_utf8($str) {
return preg_match('//u', $str) !== false;
}

But I agree it wouldn't hurt to have a str_is_utf8.

--
Gustavo Lopes

--
PHP Internals - PHP Runtime Development Mailing List
To unsubscribe, visit: http://www.php.net/unsub.php



Re: [PHP-DEV] foreach() for strings

2011-06-22 Thread Reindl Harald

Am 22.06.2011 07:24, schrieb Tomas Kuliavas:
> 2011.06.21 23:27 Reindl Harald rašė:
>> i do not understand any word and miss a simple str_is_utf8()
> 
> Such function uses six lines in PHP. 

so why do you not post them?

> You can write your own. 

no i can not as said

> I need locale insensitive casecmp, typecasting to unsigned 32bit int and 
> bunch of other functions in PHP. 

as said: i do not understand any word of this low-lvevel bla
and so PHP and UTF8 is a real problem

> Do I have wait for PHP implementation or just write my
> own functions?

if i have no background-knowledge about UTF8 and it does not
interest me really necause i have enough jobs for two lifes



signature.asc
Description: OpenPGP digital signature


Re: [PHP-DEV] foreach() for strings

2011-06-21 Thread Tomas Kuliavas
2011.06.21 23:27 Reindl Harald rašė:
> i do not understand any word and miss a simple str_is_utf8()

Such function uses six lines in PHP. You can write your own. I need locale
insensitive casecmp, typecasting to unsigned 32bit int and bunch of other
functions in PHP. Do I have wait for PHP implementation or just write my
own functions?

-- 
Tomas



-- 
PHP Internals - PHP Runtime Development Mailing List
To unsubscribe, visit: http://www.php.net/unsub.php



Re: [PHP-DEV] foreach() for strings

2011-06-21 Thread Reindl Harald


Am 21.06.2011 22:19, schrieb Tomas Kuliavas:
> 2011.06.21 20:51 Reindl Harald rašė:
>>> utf-8 is strict format. If you expect utf-8 and someone submits
>>> something
>>> else, you can tell that without any string function. You can verify
>>> utf-8
>>> strings in pcre. You can convert nbspace to regular space, if you want.
>>> utf-8 does not have any byte sequence that can collide with nbspace byte
>>> sequence in utf-8
>>
>> show me a practicable way to detect if some input data contains UTF8
>> mb_string-functions are out of the game because there are many servers
>> even of real big companies where they are not available
> 
> :) I've said pcre and not mbstring. If you read fine utf-8 manual like I
> did about 8 years ago, you would know how to detect 8bit inputs that are
> not in utf-8. utf-8 is variable byte length character set which has very
> specific rules about the way bytes are arranged. You can tell length of
> symbol in bytes based on first byte. You can tell what kind of byte values
> should be used for second, third, fourth, fifth or sixth byte. If you
> eliminate five valid utf-8 8bit byte sequences and still have 8bit data,
> it is not utf-8

i do not understand any word and miss a simple str_is_utf8() or call it
as you like which can do this native and performant on a given variable
and would offer the possibility to stop a script with not expected input
without degrade performance




signature.asc
Description: OpenPGP digital signature


Re: [PHP-DEV] foreach() for strings

2011-06-21 Thread Tomas Kuliavas
2011.06.21 20:51 Reindl Harald rašė:
>> utf-8 is strict format. If you expect utf-8 and someone submits
>> something
>> else, you can tell that without any string function. You can verify
>> utf-8
>> strings in pcre. You can convert nbspace to regular space, if you want.
>> utf-8 does not have any byte sequence that can collide with nbspace byte
>> sequence in utf-8
>
> show me a practicable way to detect if some input data contains UTF8
> mb_string-functions are out of the game because there are many servers
> even of real big companies where they are not available

:) I've said pcre and not mbstring. If you read fine utf-8 manual like I
did about 8 years ago, you would know how to detect 8bit inputs that are
not in utf-8. utf-8 is variable byte length character set which has very
specific rules about the way bytes are arranged. You can tell length of
symbol in bytes based on first byte. You can tell what kind of byte values
should be used for second, third, fourth, fifth or sixth byte. If you
eliminate five valid utf-8 8bit byte sequences and still have 8bit data,
it is not utf-8.

-- 
Tomas


-- 
PHP Internals - PHP Runtime Development Mailing List
To unsubscribe, visit: http://www.php.net/unsub.php



Re: [PHP-DEV] foreach() for strings

2011-06-21 Thread Reindl Harald


Am 21.06.2011 19:12, schrieb Tomas Kuliavas:
 and this naive attitude is the root of most security problems!

 why do you believe that every client submission is coming over
 your form or generally over anything you can control?


>>> that doesn't matter here, Tomas just corrected John, that his statement
>>> that
>>> chrome will always use utf-8 encoding for some special character isn't
>>> true.
>>> browsers will adhere the
>>> http://www.w3.org/TR/html401/interact/forms.html#adef-accept-charset
>>> of course you can't trust user input, and you have to validate it, but
>>> this
>>> has nothing to do with this topic
>>
>> it has
>>
>> how du you validate input if the string-functions having undefined results
>> which you probably use for your validation?
> 
> I've never said that he should trust user input. I've only said that his
> valid user inputs depend on html form format.

and i told you that this in the real world is utopic
there is a world outside of forms

show me FIVE php-apps which are using "accept-charset"

not one of mine - they do and even there i can not be sure that
all of the thousands of scipts/websites i wrote use it realy everywhere

> utf-8 is strict format. If you expect utf-8 and someone submits something
> else, you can tell that without any string function. You can verify utf-8
> strings in pcre. You can convert nbspace to regular space, if you want.
> utf-8 does not have any byte sequence that can collide with nbspace byte
> sequence in utf-8

show me a practicable way to detect if some input data contains UTF8
mb_string-functions are out of the game because there are many servers
even of real big companies where they are not available

so the problem is simply that you can not really write portable and well
performing code that is aware of UTF8






signature.asc
Description: OpenPGP digital signature


Re: [PHP-DEV] foreach() for strings

2011-06-21 Thread Tomas Kuliavas
2011.06.21 19:24 Reindl Harald rašė:
>
>
> Am 21.06.2011 18:22, schrieb Ferenc Kovacs:
>> On Tue, Jun 21, 2011 at 6:14 PM, Reindl Harald
>> wrote:
>>
>>>
>>>
>>> Am 21.06.2011 17:55, schrieb Tomas Kuliavas:
>>>
 They submit it in utf-8 only if your html form allows them to do that
 or
 they don't follow html specification and try to exploit your form. Set
 form input charset to iso-8859-1 and your nbspace will take only one
>>> byte.
>>>
>>> and this naive attitude is the root of most security problems!
>>>
>>> why do you believe that every client submission is coming over
>>> your form or generally over anything you can control?
>>>
>>>
>> that doesn't matter here, Tomas just corrected John, that his statement
>> that
>> chrome will always use utf-8 encoding for some special character isn't
>> true.
>> browsers will adhere the
>> http://www.w3.org/TR/html401/interact/forms.html#adef-accept-charset
>> of course you can't trust user input, and you have to validate it, but
>> this
>> has nothing to do with this topic
>
> it has
>
> how du you validate input if the string-functions having undefined results
> which you probably use for your validation?

I've never said that he should trust user input. I've only said that his
valid user inputs depend on html form format.

utf-8 is strict format. If you expect utf-8 and someone submits something
else, you can tell that without any string function. You can verify utf-8
strings in pcre. You can convert nbspace to regular space, if you want.
utf-8 does not have any byte sequence that can collide with nbspace byte
sequence in utf-8.

-- 
Tomas


-- 
PHP Internals - PHP Runtime Development Mailing List
To unsubscribe, visit: http://www.php.net/unsub.php



RE: [PHP-DEV] foreach() for strings

2011-06-21 Thread John Crenshaw
> From: Pierre Joye [mailto:pierre@gmail.com] 
> > On Tue, Jun 21, 2011 at 4:38 PM, John Crenshaw  
> > wrote:
> 
> > This mindset is fundamentally broken. You can call it a byte array all you 
> > want, but the truth is that 99.999% of the time, when a developer is using 
> > a string they need it for characters, not for bytes
> 
> Let me rephrase:
> 
> For backward compatibility reasons we cannot change this behavior.
> 
> Any serious text processing should be done using intl, mbstring,
> transliterator (pecl) or other similar solutions.
> 
> Cheers,
> --
> Pierre

Right, I totally agree. We can't fix the multibyte string issue today; I'm just 
saying that we *can* (and should) avoid making it much worse.

John Crenshaw
Priacta, Inc.

--
PHP Internals - PHP Runtime Development Mailing List
To unsubscribe, visit: http://www.php.net/unsub.php



RE: RE: [PHP-DEV] foreach() for strings

2011-06-21 Thread John Crenshaw
> They submit it in utf-8 only if your html form allows them to do that or
> they don't follow html specification and try to exploit your form.

If no explicit encoding is given, all modern browsers will attempt to 
"autodetect" the encoding based on the page contents, often with unpredictable 
results. Most web developers really don't understand the whole encoding thing, 
and many aren't aware of it at all. If they aren't taking care of the encoding 
question in their server side code, what makes anyone believe that they are 
specifying the encoding in their response headers, or HTML?

I can tell you for certain that if no encoding is specified, Chrome can and 
will decide that the data is UTF8, at least under certain conditions (because I 
watched it recently when working on an encoding problem in some legacy code.)

> Set form input charset to iso-8859-1

I can't believe I just saw someone recommend that ;)

Yes, you *could* use Latin-1...for which the Euro sign, ellipsis, decorative 
quotes, trademark, em dash, and a number of other frequently pasted characters 
are still out of range.

Then, when you eventually decide that latin1 isn't meeting your needs, you'll 
get to go through the wonderful process of trying to convert all of your legacy 
data to UTF8.

Single byte just doesn't cut the mustard anymore, especially on the web. The 
world is too small. We should be trying to move PHP *away* from this, not 
towards it.

John Crenshaw
Priacta, Inc.


Re: [PHP-DEV] foreach() for strings

2011-06-21 Thread Ferenc Kovacs
On Tue, Jun 21, 2011 at 6:24 PM, Reindl Harald wrote:

>
>
> Am 21.06.2011 18:22, schrieb Ferenc Kovacs:
> > On Tue, Jun 21, 2011 at 6:14 PM, Reindl Harald  >wrote:
> >
> >>
> >>
> >> Am 21.06.2011 17:55, schrieb Tomas Kuliavas:
> >>
> >>> They submit it in utf-8 only if your html form allows them to do that
> or
> >>> they don't follow html specification and try to exploit your form. Set
> >>> form input charset to iso-8859-1 and your nbspace will take only one
> >> byte.
> >>
> >> and this naive attitude is the root of most security problems!
> >>
> >> why do you believe that every client submission is coming over
> >> your form or generally over anything you can control?
> >>
> >>
> > that doesn't matter here, Tomas just corrected John, that his statement
> that
> > chrome will always use utf-8 encoding for some special character isn't
> true.
> > browsers will adhere the
> > http://www.w3.org/TR/html401/interact/forms.html#adef-accept-charset
> > of course you can't trust user input, and you have to validate it, but
> this
> > has nothing to do with this topic
>
> it has
>
> how du you validate input if the string-functions having undefined results
> which you probably use for your validation?
>
>
what do you mean by undefined?
if you use iso-8859-1 in your whole app and database, it doesn't matter from
the security POV if somebody sends you crafted utf-8 data.
if you mix up your encodings or you don't escape with the proper encoding,
then that can get hit you (
http://shiflett.org/blog/2006/jan/addslashes-versus-mysql-real-escape-string
 )

the multiby support in the php core isn't undefined, just unsupported. :/
use intl or mbstring for handling multibyte encodings.

Tyrael


Re: [PHP-DEV] foreach() for strings

2011-06-21 Thread Reindl Harald


Am 21.06.2011 18:22, schrieb Ferenc Kovacs:
> On Tue, Jun 21, 2011 at 6:14 PM, Reindl Harald wrote:
> 
>>
>>
>> Am 21.06.2011 17:55, schrieb Tomas Kuliavas:
>>
>>> They submit it in utf-8 only if your html form allows them to do that or
>>> they don't follow html specification and try to exploit your form. Set
>>> form input charset to iso-8859-1 and your nbspace will take only one
>> byte.
>>
>> and this naive attitude is the root of most security problems!
>>
>> why do you believe that every client submission is coming over
>> your form or generally over anything you can control?
>>
>>
> that doesn't matter here, Tomas just corrected John, that his statement that
> chrome will always use utf-8 encoding for some special character isn't true.
> browsers will adhere the
> http://www.w3.org/TR/html401/interact/forms.html#adef-accept-charset
> of course you can't trust user input, and you have to validate it, but this
> has nothing to do with this topic

it has

how du you validate input if the string-functions having undefined results
which you probably use for your validation?



signature.asc
Description: OpenPGP digital signature


Re: [PHP-DEV] foreach() for strings

2011-06-21 Thread Ferenc Kovacs
On Tue, Jun 21, 2011 at 6:14 PM, Reindl Harald wrote:

>
>
> Am 21.06.2011 17:55, schrieb Tomas Kuliavas:
>
> > They submit it in utf-8 only if your html form allows them to do that or
> > they don't follow html specification and try to exploit your form. Set
> > form input charset to iso-8859-1 and your nbspace will take only one
> byte.
>
> and this naive attitude is the root of most security problems!
>
> why do you believe that every client submission is coming over
> your form or generally over anything you can control?
>
>
that doesn't matter here, Tomas just corrected John, that his statement that
chrome will always use utf-8 encoding for some special character isn't true.
browsers will adhere the
http://www.w3.org/TR/html401/interact/forms.html#adef-accept-charset
of course you can't trust user input, and you have to validate it, but this
has nothing to do with this topic.

Tyrael


Re: [PHP-DEV] foreach() for strings

2011-06-21 Thread Reindl Harald


Am 21.06.2011 17:55, schrieb Tomas Kuliavas:

> They submit it in utf-8 only if your html form allows them to do that or
> they don't follow html specification and try to exploit your form. Set
> form input charset to iso-8859-1 and your nbspace will take only one byte.

and this naive attitude is the root of most security problems!

why do you believe that every client submission is coming over
your form or generally over anything you can control?



signature.asc
Description: OpenPGP digital signature


Re: RE: [PHP-DEV] foreach() for strings

2011-06-21 Thread Arvids Godjuks
As a userland developer due to my geographical nature i have to work with 3
languages constantly - english, russian (cyryllic) and latvian (witch has
it's own share of non latin characters). I end up using utf-8 in every
project. And some give me a headake of dealing with text parsing. mb_string
covers just part of the functionality and can be turned off.

I personally think something has to be done about unicode handling in php
after 5.4 so that we have an official method of dealing with it in the core.
Probably it can be done in a namespace of its own and be new functionality
to witch people should migrate.

my 2 cents.
21.06.2011 17:56 пользователь "Tomas Kuliavas" 
написал:
> 2011.06.21 17:38 John Crenshaw rašė:
>> Pierre Joye wrote:
>>> On Tue, Jun 21, 2011 at 1:33 PM, Lester Caine
>>> wrote:
 Pierre Joye wrote:
>>
>> It depended on ICU there, and I would be against making a core thing
>> in
>>> PHP 5.x depend on ICU.
>
> It can and should be done as part of intl, actually.
>
> But that's somehow unrelated to the proposal here, as it is about
> byte, not characters :)

 I believe this may be where some of the new niggles may be coming from?
 With
 browsers returning unicode, it may be that some of the 'extra'
 characters
 are being returned as multibyte rather than as single bytes? Such as
 the
 problem reported on the general list currently. How do we ensure that
 we are
 dealing with single byte character strings nowadays?
>>>
>>> As it has been stated numerous times in this thread and other, we do
>>> not do anything with multi bytes systems, unicode, etc. mbstring and
>>> intl do, but php's string as of now is all about bytes, array of bytes
>>> if I may describe them this way.
>>>
>>> And we can't change this behavior.
>>
>> This mindset is fundamentally broken. You can call it a byte array all
you
>> want, but the truth is that 99.999% of the time, when a developer is
using
>> a string they need it for characters, not for bytes, and characters are
>> not single byte. Even English users tend to submit Unicode range
>> characters at an alarming rate. If you're using a WYSIWYG editor, Chrome
>> will submit non-breaking-spaces as the actual UTF8 encoded character, not
>> as an HTML encoded entity. Whether developers like it, or even know it,
>> supporting an extended universal character set is not really optional.
>
> They submit it in utf-8 only if your html form allows them to do that or
> they don't follow html specification and try to exploit your form. Set
> form input charset to iso-8859-1 and your nbspace will take only one byte.
>
> --
> Tomas
>
>
>
> --
> PHP Internals - PHP Runtime Development Mailing List
> To unsubscribe, visit: http://www.php.net/unsub.php
>


RE: [PHP-DEV] foreach() for strings

2011-06-21 Thread Tomas Kuliavas
2011.06.21 17:38 John Crenshaw rašė:
> Pierre Joye wrote:
>> On Tue, Jun 21, 2011 at 1:33 PM, Lester Caine
>> wrote:
>>> Pierre Joye wrote:
>
> It depended on ICU there, and I would be against making a core thing
> in
>>   PHP 5.x depend on ICU.

 It can and should be done as part of intl, actually.

 But that's somehow unrelated to the proposal here, as it is about
 byte, not characters :)
>>>
>>> I believe this may be where some of the new niggles may be coming from?
>>> With
>>> browsers returning unicode, it may be that some of the 'extra'
>>> characters
>>> are being returned as multibyte rather than as single bytes? Such as
>>> the
>>> problem reported on the general list currently. How do we ensure that
>>> we are
>>> dealing with single byte character strings nowadays?
>>
>> As it has been stated numerous times in this thread and other, we do
>> not do anything with multi bytes systems, unicode, etc. mbstring and
>> intl do, but php's string as of now is all about bytes, array of bytes
>> if I may describe them this way.
>>
>> And we can't change this behavior.
>
> This mindset is fundamentally broken. You can call it a byte array all you
> want, but the truth is that 99.999% of the time, when a developer is using
> a string they need it for characters, not for bytes, and characters are
> not single byte. Even English users tend to submit Unicode range
> characters at an alarming rate. If you're using a WYSIWYG editor, Chrome
> will submit non-breaking-spaces as the actual UTF8 encoded character, not
> as an HTML encoded entity. Whether developers like it, or even know it,
> supporting an extended universal character set is not really optional.

They submit it in utf-8 only if your html form allows them to do that or
they don't follow html specification and try to exploit your form. Set
form input charset to iso-8859-1 and your nbspace will take only one byte.

-- 
Tomas



-- 
PHP Internals - PHP Runtime Development Mailing List
To unsubscribe, visit: http://www.php.net/unsub.php



Re: [PHP-DEV] foreach() for strings

2011-06-21 Thread Pierre Joye
On Tue, Jun 21, 2011 at 4:38 PM, John Crenshaw  wrote:

> This mindset is fundamentally broken. You can call it a byte array all you 
> want, but the truth is that 99.999% of the time, when a developer is using a 
> string they need it for characters, not for bytes

Let me rephrase:

For backward compatibility reasons we cannot change this behavior.

Any serious text processing should be done using intl, mbstring,
transliterator (pecl) or other similar solutions.

Cheers,
--
Pierre

@pierrejoye | http://blog.thepimp.net | http://www.libgd.org

-- 
PHP Internals - PHP Runtime Development Mailing List
To unsubscribe, visit: http://www.php.net/unsub.php



RE: [PHP-DEV] foreach() for strings

2011-06-21 Thread John Crenshaw
Pierre Joye wrote:
> On Tue, Jun 21, 2011 at 1:33 PM, Lester Caine  wrote:
>> Pierre Joye wrote:

 It depended on ICU there, and I would be against making a core thing in
>   PHP 5.x depend on ICU.
>>>
>>> It can and should be done as part of intl, actually.
>>>
>>> But that's somehow unrelated to the proposal here, as it is about
>>> byte, not characters :)
>>
>> I believe this may be where some of the new niggles may be coming from? With
>> browsers returning unicode, it may be that some of the 'extra' characters
>> are being returned as multibyte rather than as single bytes? Such as the
>> problem reported on the general list currently. How do we ensure that we are
>> dealing with single byte character strings nowadays?
>
> As it has been stated numerous times in this thread and other, we do
> not do anything with multi bytes systems, unicode, etc. mbstring and
> intl do, but php's string as of now is all about bytes, array of bytes
> if I may describe them this way.
>
> And we can't change this behavior.

This mindset is fundamentally broken. You can call it a byte array all you 
want, but the truth is that 99.999% of the time, when a developer is using a 
string they need it for characters, not for bytes, and characters are not 
single byte. Even English users tend to submit Unicode range characters at an 
alarming rate. If you're using a WYSIWYG editor, Chrome will submit 
non-breaking-spaces as the actual UTF8 encoded character, not as an HTML 
encoded entity. Whether developers like it, or even know it, supporting an 
extended universal character set is not really optional.

PHP makes this bad enough with the whole collection of bytewise string 
functions, including many with no appropriate multibyte aware replacement, but 
at least this can be avoided, quickly audited, and in the future can even be 
fixed in any number of ways with only a nominal BC impact. Hard coding this 
single byte idiocy into a language construct (foreach) though would be an 
incredibly awful idea. This would create a trap for new naive PHP developers, 
and create a character set problem that the language could NEVER recover from 
without a massive BC break.

This proposal is really about adding a feature which whenever it used is almost 
guaranteed to be an error. It probably won't look to the developer like an 
error during simple testing, but will almost certainly show up as an error in 
production. Is it really worth all that for a bit of syntax sugar that the 
developer will have to strip out anyway to fix their bug?

If string iteration needs to be addressed in the core (and IMO it doesn't 
because it can be handled at the script level, but if it does) why not use 
iterator classes? This gives the same functionality and prevents the language 
from encouraging hidden bugs.

John Crenshaw
Priacta, Inc.

--
PHP Internals - PHP Runtime Development Mailing List
To unsubscribe, visit: http://www.php.net/unsub.php



Re: [PHP-DEV] foreach() for strings

2011-06-21 Thread Lester Caine

Pierre Joye wrote:

On Tue, Jun 21, 2011 at 1:33 PM, Lester Caine  wrote:

Pierre Joye wrote:


It depended on ICU there, and I would be against making a core thing in

  PHP 5.x depend on ICU.


It can and should be done as part of intl, actually.

But that's somehow unrelated to the proposal here, as it is about
byte, not characters :)


I believe this may be where some of the new niggles may be coming from? With
browsers returning unicode, it may be that some of the 'extra' characters
are being returned as multibyte rather than as single bytes? Such as the
problem reported on the general list currently. How do we ensure that we are
dealing with single byte character strings nowadays?


As it has been stated numerous times in this thread and other, we do
not do anything with multi bytes systems, unicode, etc. mbstring and
intl do, but php's string as of now is all about bytes, array of bytes
if I may describe them this way.

And we can't change this behavior.


That is exactly the point. I suppose what I am asking is how people ensure that 
what they are feeding into simple strings are single byte when cut and past 
nowadays does not make a distinction?


--
Lester Caine - G8HFL
-
Contact - http://lsces.co.uk/wiki/?page=contact
L.S.Caine Electronic Services - http://lsces.co.uk
EnquirySolve - http://enquirysolve.com/
Model Engineers Digital Workshop - http://medw.co.uk//
Firebird - http://www.firebirdsql.org/index.php

--
PHP Internals - PHP Runtime Development Mailing List
To unsubscribe, visit: http://www.php.net/unsub.php



Re: [PHP-DEV] foreach() for strings

2011-06-21 Thread Pierre Joye
On Tue, Jun 21, 2011 at 1:33 PM, Lester Caine  wrote:
> Pierre Joye wrote:
>>>
>>> It depended on ICU there, and I would be against making a core thing in
>>> >  PHP 5.x depend on ICU.
>>
>> It can and should be done as part of intl, actually.
>>
>> But that's somehow unrelated to the proposal here, as it is about
>> byte, not characters :)
>
> I believe this may be where some of the new niggles may be coming from? With
> browsers returning unicode, it may be that some of the 'extra' characters
> are being returned as multibyte rather than as single bytes? Such as the
> problem reported on the general list currently. How do we ensure that we are
> dealing with single byte character strings nowadays?

As it has been stated numerous times in this thread and other, we do
not do anything with multi bytes systems, unicode, etc. mbstring and
intl do, but php's string as of now is all about bytes, array of bytes
if I may describe them this way.

And we can't change this behavior.

Cheers,
-- 
Pierre

@pierrejoye | http://blog.thepimp.net | http://www.libgd.org

--
PHP Internals - PHP Runtime Development Mailing List
To unsubscribe, visit: http://www.php.net/unsub.php



Re: [PHP-DEV] foreach() for strings

2011-06-21 Thread Lester Caine

Pierre Joye wrote:

It depended on ICU there, and I would be against making a core thing in
>  PHP 5.x depend on ICU.

It can and should be done as part of intl, actually.

But that's somehow unrelated to the proposal here, as it is about
byte, not characters :)


I believe this may be where some of the new niggles may be coming from? With 
browsers returning unicode, it may be that some of the 'extra' characters are 
being returned as multibyte rather than as single bytes? Such as the problem 
reported on the general list currently. How do we ensure that we are dealing 
with single byte character strings nowadays?


--
Lester Caine - G8HFL
-
Contact - http://lsces.co.uk/wiki/?page=contact
L.S.Caine Electronic Services - http://lsces.co.uk
EnquirySolve - http://enquirysolve.com/
Model Engineers Digital Workshop - http://medw.co.uk//
Firebird - http://www.firebirdsql.org/index.php

--
PHP Internals - PHP Runtime Development Mailing List
To unsubscribe, visit: http://www.php.net/unsub.php



Re: [PHP-DEV] foreach() for strings

2011-06-21 Thread Pierre Joye
On Tue, Jun 21, 2011 at 12:53 PM, Derick Rethans  wrote:

> It depended on ICU there, and I would be against making a core thing in
> PHP 5.x depend on ICU.

It can and should be done as part of intl, actually.

But that's somehow unrelated to the proposal here, as it is about
byte, not characters :)

-- 
Pierre

@pierrejoye | http://blog.thepimp.net | http://www.libgd.org

-- 
PHP Internals - PHP Runtime Development Mailing List
To unsubscribe, visit: http://www.php.net/unsub.php



Re: [PHP-DEV] foreach() for strings

2011-06-21 Thread Derick Rethans
On Mon, 20 Jun 2011, Stas Malyshev wrote:

> On 6/20/11 9:15 AM, John Crenshaw wrote:
> > > From: Ilia Alshanetsky [mailto:i...@prohost.org]
> > > 
> > > As long as it works on a premise that a "string" is a byte array
> > > and each element represents 1 byte, +1 from me.
> > 
> > Code written on this premise is almost always bug central when people
> > finally get around to realizing why they really do need to support
> > wide characters (and everybody does, because people like to paste
> > stuff containing non-break-spaces, and decorative quotes). I really
> > don't think this single byte character mentality should be
> > encouraged.
> 
> I think you're right, TextIterator would be better (and also much easier to
> implement, I think). Didn't we have it in Unicode branch? We could port it
> back or we could have something along the lines of grapheme_extract...

It depended on ICU there, and I would be against making a core thing in 
PHP 5.x depend on ICU.

cheers,
Derick

-- 
http://derickrethans.nl | http://xdebug.org
Like Xdebug? Consider a donation: http://xdebug.org/donate.php
twitter: @derickr and @xdebug

-- 
PHP Internals - PHP Runtime Development Mailing List
To unsubscribe, visit: http://www.php.net/unsub.php



Re: [PHP-DEV] foreach() for strings

2011-06-20 Thread Stas Malyshev

Hi!

On 6/20/11 9:15 AM, John Crenshaw wrote:

From: Ilia Alshanetsky [mailto:i...@prohost.org]

As long as it works on a premise that a "string" is a byte array
and each element represents 1 byte, +1 from me.


Code written on this premise is almost always bug central when people
finally get around to realizing why they really do need to support
wide characters (and everybody does, because people like to paste
stuff containing non-break-spaces, and decorative quotes). I really
don't think this single byte character mentality should be
encouraged.


I think you're right, TextIterator would be better (and also much easier 
to implement, I think). Didn't we have it in Unicode branch? We could 
port it back or we could have something along the lines of 
grapheme_extract...



Also, how do you think this will work with the Unicode conversion in
PHP 6? Guaranteed, this will break stuff. Some people will have


I don't think we need to worry about PHP 6 now... If we ever get back to 
Unicode support, it probably will be different.

--
Stanislav Malyshev, Software Architect
SugarCRM: http://www.sugarcrm.com/
(408)454-6900 ext. 227

--
PHP Internals - PHP Runtime Development Mailing List
To unsubscribe, visit: http://www.php.net/unsub.php



RE: [PHP-DEV] foreach() for strings

2011-06-20 Thread John Crenshaw
> From: Ilia Alshanetsky [mailto:i...@prohost.org] 
> 
> As long as it works on a premise that a "string" is a byte array and
> each element represents 1 byte, +1 from me.

Code written on this premise is almost always bug central when people finally 
get around to realizing why they really do need to support wide characters (and 
everybody does, because people like to paste stuff containing non-break-spaces, 
and decorative quotes). I really don't think this single byte character 
mentality should be encouraged.

Also, how do you think this will work with the Unicode conversion in PHP 6? 
Guaranteed, this will break stuff. Some people will have written code to 
iterate characters, assuming single byte characters, some people will have 
written code ACTUALLY intending to iterate as a byte array. Sadly, we can 
almost certainly assume that the single byte characters assumption (which is 
wrong) will also be, by far, the most common. Supporting that most common case 
when moving to PHP 6 would require breaking the binary case (which was the only 
properly written code in the first place.) On the other hand, supporting the 
binary case means breaking the most common case.

John Crenshaw
Priacta, Inc.

--
PHP Internals - PHP Runtime Development Mailing List
To unsubscribe, visit: http://www.php.net/unsub.php



Re: [PHP-DEV] foreach() for strings

2011-06-20 Thread Stas Malyshev

Hi!


foreach() has many functions, looping over arrays, objects and implementing
the iterator interface. I think it's also quite intuitive to use foreach()
for strings, too.


I'm not sure how you'd implement such thing, but then I think things 
like next(), end(), etc. should work too...

--
Stanislav Malyshev, Software Architect
SugarCRM: http://www.sugarcrm.com/
(408)454-6900 ext. 227

--
PHP Internals - PHP Runtime Development Mailing List
To unsubscribe, visit: http://www.php.net/unsub.php



RE: [PHP-DEV] foreach() for strings

2011-06-20 Thread John Crenshaw
> -Original Message-
> From: Lee davis [mailto:leedavi...@gmail.com] 
> Sent: Monday, June 20, 2011 9:12 AM
> To: Robert Eisele
> Cc: internals@lists.php.net
> Subject: Re: [PHP-DEV] foreach() for strings
> 
> I think this would be quite a useful feature, and am In favor of it.
> However, I think caution should be taken when shifting array utilities out
> of their remit and allowing them to manipulate / traverse other data types.
> You may see the floodgates opening for more request to adapt array functions
> for other uses.
> 
> Say for instance..
> 
> Could we also use current(), next() and key() for iteration of strings?
> 
> $string = 'string';
> while ($char = current($string))
> {
> echo key($string)   // Would output the offset position I assume 0,1,2 etc??
> echo $char  // outputs each letter of string
> next($string);
> }
> 
> Lee
> 
> On Mon, Jun 20, 2011 at 12:27 PM, Robert Eisele  wrote:
> 
> > foreach() has many functions, looping over arrays, objects and implementing
> > the iterator interface. I think it's also quite intuitive to use foreach()
> > for strings, too.
> >
> > If you want to implement a parser in PHP, you have to go the way with for +
> > strlen + substr() or $x[$i] to address one character of the string. We
> > could
> > overdo the functionality of foreach()
> > by implementing LVAL's, too, in order to access single bits but this is
> > really uncommon, even if the way of thinking could be, that foreach() gives
> > a single attribute of each value, no matter
> > if it's a complex object with the iterator interface or a primitive. What
> > do
> > you think about this one? My point of view is, that foreach() is very
> > useful, which was acknowledged by many ppl via the comments of my article.
> >
> > I think, adding features like this persuades the one or the other PHP user
> > to upgrade to 5.4.
> >
> > Robert
> >

Doing this with an explicit iterator object is a fine idea. The syntax becomes 
something like:

foreach(new TextIterator($s, 'UTF8') as $pos=>$c)
{
...
}

On the other hand, I think that trying to support iteration without using an 
iterator object to mediate would be a disaster, and I'm opposed to doing 
something like that because:
1. The code just looks wrong. PHP developers are generally insulated from the 
char-arrayness of strings. In addition, since PHP isn't typesafe, the code 
becomes highly ambiguous. Is the code iterating an array, or a string? It is 
very hard to tell just by looking. It may be convenient to write, but it's 
certainly not convenient to read or maintain later. On the other hand, with a 
mediating iterator object, the intent becomes obvious, and the code is highly 
readable.
2. The odds of iterating any given string are slim at best. Supporting current, 
key, next, etc. would require the string object internally to get bloated with 
additional unnecessary data that is almost never used. This bloat isn't a 
single int either. For optimal performance it would need to consist of no less 
than two size_t (char position and binary position), and one encoding indicator.
3. Iteration cannot work without knowing which encoding to use for the string. 
Is it UTF8? UTF16? UTF7? Binary or some single byte encoding? Some other exotic 
wide encoding? Without an iterator object in the middle, there is no way to 
specify this encoding. Always treating this as binary would also be a mistake, 
since this is almost certainly never actually the correct behavior, even though 
it may often appear to behave correctly with simple inputs.
4. I've had simple mistakes caught numerous times when foreach complains about 
getting a scalar rather than an array. So far, it has been exactly right every 
time. Allowing strings to be iterated would, in the name of convenience, 
increase the probability of stupid mistakes evading detection. Even worse, the 
code itself would look logically correct until the developer finally realizes 
that they have a string and not an array. Errors like this are probably far 
more common in most projects than the need to iterate a string, so making this 
change hurts debugging in the common case, for the sake of syntactic sugar in 
the rare case. Not a good trade.

John Crenshaw
Priacta, Inc.

--
PHP Internals - PHP Runtime Development Mailing List
To unsubscribe, visit: http://www.php.net/unsub.php



Re: [PHP-DEV] foreach() for strings

2011-06-20 Thread Ilia Alshanetsky
As long as it works on a premise that a "string" is a byte array and
each element represents 1 byte, +1 from me.

On Mon, Jun 20, 2011 at 7:27 AM, Robert Eisele  wrote:
> foreach() has many functions, looping over arrays, objects and implementing
> the iterator interface. I think it's also quite intuitive to use foreach()
> for strings, too.
>
> If you want to implement a parser in PHP, you have to go the way with for +
> strlen + substr() or $x[$i] to address one character of the string. We could
> overdo the functionality of foreach()
> by implementing LVAL's, too, in order to access single bits but this is
> really uncommon, even if the way of thinking could be, that foreach() gives
> a single attribute of each value, no matter
> if it's a complex object with the iterator interface or a primitive. What do
> you think about this one? My point of view is, that foreach() is very
> useful, which was acknowledged by many ppl via the comments of my article.
>
> I think, adding features like this persuades the one or the other PHP user
> to upgrade to 5.4.
>
> Robert
>

-- 
PHP Internals - PHP Runtime Development Mailing List
To unsubscribe, visit: http://www.php.net/unsub.php



RE: [PHP-DEV] foreach() for strings

2011-06-20 Thread Jonathan Bond-Caron
On Mon Jun 20 09:11 AM, Lee davis wrote:
> 
> Could we also use current(), next() and key() for iteration of strings?
> 
> $string = 'string';
> while ($char = current($string))
> {
> echo key($string)   // Would output the offset position I assume 0,1,2
> etc??
> echo $char  // outputs each letter of string
> next($string);
> }
> 

Hopefully it can be supported without sacrificing too much performance
Like others mentioned, it seems important to distinguish between binary/byte
and text iteration.

$string = new ByteIterator('string é');
foreach($string as $i => $byte) 
  ...

$string = new TextIterator('string é');
foreach($string as $i => $char) 
  ...

When most developers get a 'string' from a database, my hunch is they assume
they would be iterating over the 'characters' (utf8, iso.. encoding) and not
individual bytes.

So +1 to string iteration as long as there's byte iteration and some plan
for text iteration / by character (with icu or not).



--
PHP Internals - PHP Runtime Development Mailing List
To unsubscribe, visit: http://www.php.net/unsub.php



Re: [PHP-DEV] foreach() for strings

2011-06-20 Thread Lee davis
I think this would be quite a useful feature, and am In favor of it.
However, I think caution should be taken when shifting array utilities out
of their remit and allowing them to manipulate / traverse other data types.
You may see the floodgates opening for more request to adapt array functions
for other uses.

Say for instance..

Could we also use current(), next() and key() for iteration of strings?

$string = 'string';
while ($char = current($string))
{
echo key($string)   // Would output the offset position I assume 0,1,2 etc??
echo $char  // outputs each letter of string
next($string);
}

Lee

On Mon, Jun 20, 2011 at 12:27 PM, Robert Eisele  wrote:

> foreach() has many functions, looping over arrays, objects and implementing
> the iterator interface. I think it's also quite intuitive to use foreach()
> for strings, too.
>
> If you want to implement a parser in PHP, you have to go the way with for +
> strlen + substr() or $x[$i] to address one character of the string. We
> could
> overdo the functionality of foreach()
> by implementing LVAL's, too, in order to access single bits but this is
> really uncommon, even if the way of thinking could be, that foreach() gives
> a single attribute of each value, no matter
> if it's a complex object with the iterator interface or a primitive. What
> do
> you think about this one? My point of view is, that foreach() is very
> useful, which was acknowledged by many ppl via the comments of my article.
>
> I think, adding features like this persuades the one or the other PHP user
> to upgrade to 5.4.
>
> Robert
>


Re: [PHP-DEV] foreach() for strings

2011-06-20 Thread Pierre Joye
hi Robert,

I would go with a RFC for that one, at least to document/cover edge
cases to help the doc team to properly document this change if it gets
approved.

Thanks for your work so far!

On Mon, Jun 20, 2011 at 1:27 PM, Robert Eisele  wrote:
> foreach() has many functions, looping over arrays, objects and implementing
> the iterator interface. I think it's also quite intuitive to use foreach()
> for strings, too.
>
> If you want to implement a parser in PHP, you have to go the way with for +
> strlen + substr() or $x[$i] to address one character of the string. We could
> overdo the functionality of foreach()
> by implementing LVAL's, too, in order to access single bits but this is
> really uncommon, even if the way of thinking could be, that foreach() gives
> a single attribute of each value, no matter
> if it's a complex object with the iterator interface or a primitive. What do
> you think about this one? My point of view is, that foreach() is very
> useful, which was acknowledged by many ppl via the comments of my article.
>
> I think, adding features like this persuades the one or the other PHP user
> to upgrade to 5.4.
>
> Robert
>



-- 
Pierre

@pierrejoye | http://blog.thepimp.net | http://www.libgd.org

-- 
PHP Internals - PHP Runtime Development Mailing List
To unsubscribe, visit: http://www.php.net/unsub.php



Re: [PHP-DEV] foreach() for strings

2011-06-20 Thread Pierre Joye
2011/6/20 Johannes Schlüter :
> On Mon, 2011-06-20 at 13:27 +0200, Robert Eisele wrote:
>> foreach() has many functions, looping over arrays, objects and implementing
>> the iterator interface. I think it's also quite intuitive to use foreach()
>> for strings, too.
>
> I would prefer a TextIterator as we had in the old PHP 6 as this allows
> more powerful filtering etc. using iterator semantics even though this
> might be a bit slower.

A foreach with string should be seen as binary buffer, with no clue
about its content and only to fetch it byte by byte.

TextIterator can be smarter and support unicode when ICU is available.

Cheers,
-- 
Pierre

@pierrejoye | http://blog.thepimp.net | http://www.libgd.org

--
PHP Internals - PHP Runtime Development Mailing List
To unsubscribe, visit: http://www.php.net/unsub.php



Re: [PHP-DEV] foreach() for strings

2011-06-20 Thread Derick Rethans
On Mon, 20 Jun 2011, Robert Eisele wrote:

> 2011/6/20 Derick Rethans 
> 
> > On Mon, 20 Jun 2011, Robert Eisele wrote:
> >
> > > foreach() has many functions, looping over arrays, objects and 
> > > implementing the iterator interface. I think it's also quite 
> > > intuitive to use foreach() for strings, too.
> >
> > > If you want to implement a parser in PHP, you have to go the way 
> > > with for + strlen + substr() or $x[$i] to address one character of 
> > > the string.
> >
> > Yes, this sounds like a good addition to me. One question though, 
> > what to do with an object that implements __toString() ?
> >
> 
> That's the question, maybe one must force __toString() via an explicit 
> string-cast:
> 
> foreach( (string) $obj as $k=>$v)

That's a sensible thing indeed. We just need to make sure we document 
this with an example and a test then.

cheers,
Derick

-- 
http://derickrethans.nl | http://xdebug.org
Like Xdebug? Consider a donation: http://xdebug.org/donate.php
twitter: @derickr and @xdebug

-- 
PHP Internals - PHP Runtime Development Mailing List
To unsubscribe, visit: http://www.php.net/unsub.php



Re: [PHP-DEV] foreach() for strings

2011-06-20 Thread Derick Rethans
On Mon, 20 Jun 2011, Johannes Schlüter wrote:

> On Mon, 2011-06-20 at 13:27 +0200, Robert Eisele wrote:
> > foreach() has many functions, looping over arrays, objects and implementing
> > the iterator interface. I think it's also quite intuitive to use foreach()
> > for strings, too.
> 
> I would prefer a TextIterator as we had in the old PHP 6 as this allows
> more powerful filtering etc. using iterator semantics even though this
> might be a bit slower.

I think TextIterator is another good addition, but it will make PHP 
depend on ICU. Therefore, I think just a foreach for strings seems 
valuable to me.

Derick
-- 
PHP Internals - PHP Runtime Development Mailing List
To unsubscribe, visit: http://www.php.net/unsub.php

Re: [PHP-DEV] foreach() for strings

2011-06-20 Thread Johannes Schlüter
On Mon, 2011-06-20 at 13:27 +0200, Robert Eisele wrote:
> foreach() has many functions, looping over arrays, objects and implementing
> the iterator interface. I think it's also quite intuitive to use foreach()
> for strings, too.

I would prefer a TextIterator as we had in the old PHP 6 as this allows
more powerful filtering etc. using iterator semantics even though this
might be a bit slower.

johannes



-- 
PHP Internals - PHP Runtime Development Mailing List
To unsubscribe, visit: http://www.php.net/unsub.php



Re: [PHP-DEV] foreach() for strings

2011-06-20 Thread Robert Eisele
2011/6/20 Derick Rethans 

> On Mon, 20 Jun 2011, Robert Eisele wrote:
>
> > foreach() has many functions, looping over arrays, objects and
> implementing
> > the iterator interface. I think it's also quite intuitive to use
> foreach()
> > for strings, too.
>
> > If you want to implement a parser in PHP, you have to go the way with for
> +
> > strlen + substr() or $x[$i] to address one character of the string.
>
> Yes, this sounds like a good addition to me. One question though, what
> to do with an object that implements __toString() ?
>

That's the question, maybe one must force __toString() via an explicit
string-cast:

foreach( (string) $obj as $k=>$v)


>
> > We could
> > overdo the functionality of foreach()
> > by implementing LVAL's, too, in order to access single bits but this is
> > really uncommon, even if the way of thinking could be, that foreach()
> gives
> > a single attribute of each value, no matter
> > if it's a complex object with the iterator interface or a primitive. What
> do
> > you think about this one? My point of view is, that foreach() is very
> > useful, which was acknowledged by many ppl via the comments of my
> article.
>
> I don't think we should do it for bits, as nothing in PHP really does do
> anything with that. If you want to do stuff with bits, I think the
> "bitset" package (http://pecl.php.net/package/Bitset) is the way
> forwards.
>

yep, i totally agree.

>
> cheers,
> Derick
>
> --
> http://derickrethans.nl | http://xdebug.org
> Like Xdebug? Consider a donation: http://xdebug.org/donate.php
> twitter: @derickr and @xdebug
>


Re: [PHP-DEV] foreach() for strings

2011-06-20 Thread Derick Rethans
On Mon, 20 Jun 2011, Robert Eisele wrote:

> foreach() has many functions, looping over arrays, objects and implementing
> the iterator interface. I think it's also quite intuitive to use foreach()
> for strings, too.

> If you want to implement a parser in PHP, you have to go the way with for +
> strlen + substr() or $x[$i] to address one character of the string.

Yes, this sounds like a good addition to me. One question though, what 
to do with an object that implements __toString() ?

> We could
> overdo the functionality of foreach()
> by implementing LVAL's, too, in order to access single bits but this is
> really uncommon, even if the way of thinking could be, that foreach() gives
> a single attribute of each value, no matter
> if it's a complex object with the iterator interface or a primitive. What do
> you think about this one? My point of view is, that foreach() is very
> useful, which was acknowledged by many ppl via the comments of my article.

I don't think we should do it for bits, as nothing in PHP really does do 
anything with that. If you want to do stuff with bits, I think the 
"bitset" package (http://pecl.php.net/package/Bitset) is the way 
forwards.

cheers,
Derick

-- 
http://derickrethans.nl | http://xdebug.org
Like Xdebug? Consider a donation: http://xdebug.org/donate.php
twitter: @derickr and @xdebug

-- 
PHP Internals - PHP Runtime Development Mailing List
To unsubscribe, visit: http://www.php.net/unsub.php



[PHP-DEV] foreach() for strings

2011-06-20 Thread Robert Eisele
foreach() has many functions, looping over arrays, objects and implementing
the iterator interface. I think it's also quite intuitive to use foreach()
for strings, too.

If you want to implement a parser in PHP, you have to go the way with for +
strlen + substr() or $x[$i] to address one character of the string. We could
overdo the functionality of foreach()
by implementing LVAL's, too, in order to access single bits but this is
really uncommon, even if the way of thinking could be, that foreach() gives
a single attribute of each value, no matter
if it's a complex object with the iterator interface or a primitive. What do
you think about this one? My point of view is, that foreach() is very
useful, which was acknowledged by many ppl via the comments of my article.

I think, adding features like this persuades the one or the other PHP user
to upgrade to 5.4.

Robert