Re: [PHP-DEV] What is the use of "unicode.semantics" in PHP 6?

Matt Wilmas Thu, 05 Jul 2007 21:36:40 -0700

Hi Richard,

----- Original Message ----- 
From: "Richard Lynch"
Sent: Thursday, July 05, 2007 10:43 PM

> On Fri, June 29, 2007 1:21 am, Tomas Kuliavas wrote:
> >> If unicode semantics are "on" what exactly is borked in PHP 5?
> >
> > In Unicode mode \[0-7]{1,3} and \x[0-9A-Fa-f]{1,2} refer to unicode
> > code
> > points and not to octal or hexadecimal byte values. Fix is not
> > backwards
> > compatible.
>
> Gak.
>
> You mean this will break:
>
> <?php
>   $mask = 0xf0;
>   $value = $_POST['foo'] & $mask;
> ?>
>
> because of Unicode?
>
> That's nuts.
>
> That can't be right...

No, that shouldn't break.  $mask is an int, and the other operand with &
etc. would also be converted to int, so it should be the same whether
$_POST['foo'] is a binary string or Unicode.

And I don't understand the previous message about \[0-7]{1,3} and
\x[0-9A-Fa-f]{1,2} (inside of strings, that means) referring to Unicode code
points.  I think octal and hex escapes work the same in Unicode mode...

> > Scripts can't match bytes. How they are supposed to check if string is
> > in
> > plain ascii or in 8bit? Do conversion to ASCII and check for errors
> > instead of looking for 8bit byte values? How can scripts replace 8bit
> > bytes with some other strings? ISO-8859-2 decoding table contains 95
> > entries written and evaluated as binary strings. Same thing applies to
> > other iso-8859 and windows-125x character sets. iso-89859-1 and utf-8
> > decoding does not use mapping tables and performs complex calculations
> > with byte values. multibyte character set decoding might actually
> > benefit
> > from unicode_encode(), if Table 325 (http://www.php.net/unicode)
> > provides
> > more information about U_INVALID_SUBSTITUTE and other unicode.
> > settings.
>
> I don't even understand this.
>
> But if I haven't done something new-fangled to make a string be some
> new-fangled Unicode thingie, then it's just plain old ASCII, no?
>
> Or PHP can just assume that anyway...

No, that's basically the issue that this thread is about -- that when
unicode.semantics=On, even though you *haven't done* anything new-fangled
with Unicode, it IS Unicode regardless (unless binary strings are explicitly
used).  That's how things may behave differently all of a sudden.

Did you see my message a couple weeks ago?:
http://marc.info/?l=php-dev&m=118234541809801&w=2  Seems to me it would be
great if any new Unicode stuff had to be explicitly specified, though
internally Unicode would always be there ready to use, regardless of a
setting, and old code would continue to work as before.

What do you think?  I'd hoped for some replies about it, since I also have
some ideas about possible internals concerns...

[...]
> > PHP6 could introduce new Unicode aware functions, but Unicode
> > implementation choose to modify existing ones. All low level string
> > operations ($string[1]) are Unicode aware by default and not when
> > script
> > actually asks for it. Such implementation is designed for developers,
> > who
> > don't care about Unicode support and want it out of the box without
> > any
> > changes in their Unicode unaware scripts. It is not designed for
> > developers that actually need it and want to have code working in PHP6
> > and
> > PHP4/5.
>
> But an old script ought to just work...

Again, not necessarily if the Unicode switch is on.

Matt

-- 
PHP Internals - PHP Runtime Development Mailing List
To unsubscribe, visit: http://www.php.net/unsub.php

Re: [PHP-DEV] What is the use of "unicode.semantics" in PHP 6?

Reply via email to