Hi Richard, ----- Original Message ----- From: "Richard Lynch" Sent: Thursday, July 05, 2007 10:43 PM
> On Fri, June 29, 2007 1:21 am, Tomas Kuliavas wrote: > >> If unicode semantics are "on" what exactly is borked in PHP 5? > > > > In Unicode mode \[0-7]{1,3} and \x[0-9A-Fa-f]{1,2} refer to unicode > > code > > points and not to octal or hexadecimal byte values. Fix is not > > backwards > > compatible. > > Gak. > > You mean this will break: > > <?php > $mask = 0xf0; > $value = $_POST['foo'] & $mask; > ?> > > because of Unicode? > > That's nuts. > > That can't be right... No, that shouldn't break. $mask is an int, and the other operand with & etc. would also be converted to int, so it should be the same whether $_POST['foo'] is a binary string or Unicode. And I don't understand the previous message about \[0-7]{1,3} and \x[0-9A-Fa-f]{1,2} (inside of strings, that means) referring to Unicode code points. I think octal and hex escapes work the same in Unicode mode... > > Scripts can't match bytes. How they are supposed to check if string is > > in > > plain ascii or in 8bit? Do conversion to ASCII and check for errors > > instead of looking for 8bit byte values? How can scripts replace 8bit > > bytes with some other strings? ISO-8859-2 decoding table contains 95 > > entries written and evaluated as binary strings. Same thing applies to > > other iso-8859 and windows-125x character sets. iso-89859-1 and utf-8 > > decoding does not use mapping tables and performs complex calculations > > with byte values. multibyte character set decoding might actually > > benefit > > from unicode_encode(), if Table 325 (http://www.php.net/unicode) > > provides > > more information about U_INVALID_SUBSTITUTE and other unicode. > > settings. > > I don't even understand this. > > But if I haven't done something new-fangled to make a string be some > new-fangled Unicode thingie, then it's just plain old ASCII, no? > > Or PHP can just assume that anyway... No, that's basically the issue that this thread is about -- that when unicode.semantics=On, even though you *haven't done* anything new-fangled with Unicode, it IS Unicode regardless (unless binary strings are explicitly used). That's how things may behave differently all of a sudden. Did you see my message a couple weeks ago?: http://marc.info/?l=php-dev&m=118234541809801&w=2 Seems to me it would be great if any new Unicode stuff had to be explicitly specified, though internally Unicode would always be there ready to use, regardless of a setting, and old code would continue to work as before. What do you think? I'd hoped for some replies about it, since I also have some ideas about possible internals concerns... [...] > > PHP6 could introduce new Unicode aware functions, but Unicode > > implementation choose to modify existing ones. All low level string > > operations ($string[1]) are Unicode aware by default and not when > > script > > actually asks for it. Such implementation is designed for developers, > > who > > don't care about Unicode support and want it out of the box without > > any > > changes in their Unicode unaware scripts. It is not designed for > > developers that actually need it and want to have code working in PHP6 > > and > > PHP4/5. > > But an old script ought to just work... Again, not necessarily if the Unicode switch is on. Matt -- PHP Internals - PHP Runtime Development Mailing List To unsubscribe, visit: http://www.php.net/unsub.php