ID: 49687 Updated by: scott...@php.net Reported By: sird at rckc dot at Status: Open Bug Type: *Unicode Issues Operating System: * PHP Version: 5.2.11 -Assigned To: +Assigned To: scottmac New Comment:
PHP 5 has binary strings, not utf-8 strings. It does not attempt to do any validation on input, so expecting addslashes to magically validate things as utf-8 is wrong, simple as. I agree that utf8_decode should do proper validation here though the overhead of doing that validation is going to be slow. So I've coded up a utf8_validate function. Still need to sort out some of the behaviour first. Previous Comments: ------------------------------------------------------------------------ [2009-10-16 03:41:30] sird at rckc dot at oops! you are right, :) the code before was unsigned short. still, the other vulnerabilities remain. I've made a blogpost that explains the other issues ;) http://sirdarckcat.blogspot.com/2009/10/couple-of-unicode-issues-on-php-and.html I updated the post to note the last bug was fixed on 5.2.11 Greetings!! ------------------------------------------------------------------------ [2009-10-16 03:32:19] scott...@php.net On a 16-bit processor an int might be 16-bit, if you can get PHP to compile then well done :-) Did you even try running the test code? ------------------------------------------------------------------------ [2009-10-16 01:36:27] sird at rckc dot at : ras...@php.net It has come to my attention that this hasn't been fixed.. unsigned int has a size of 16 bits, don't take my word for it http://www.acm.uiuc.edu/webmonkeys/book/c_guide/1.2.html Section: 1.2.2 Variables unsigned int 16 bits I just downloaded PHP 5.2.11, and I quote the code: // php-5.2.11.tar.bz2/php-5.2.11/ext/xml/xml.c#558 PHPAPI char *xml_utf8_decode( // ... { int pos = len; char *newbuf = emallo // ... unsigned int c; // sizeof(unsigned int)==16 bits char (*decoder)(unsig // ... xml_encoding *enc = x // ... // ... // #580 c = (unsigned char)(*s); if (c >= 0xf0) { /* four bytes encoded, 21 bits */ if(pos-4 >= 0) { c = ((s[0]&7)<<18) | ((s[1]&63)<<12) | ((s[2]&63)<<6) | (s[3]&63); } else { c = '?'; } s += 4; pos -= 4; // ... Also no checking at ALL is made on the leading bytes (they should be in the form: 10xx xxxx, a check is very easy, to check if s[0] has the correct form: you do an AND with 1100 0000 and then compare it with 1000 0000. s[0]&0xC0==0x80 Also, Overlong UTF is not being taken care of, that's yeah, yet another vulnerability. Greetings!! ------------------------------------------------------------------------ [2009-09-29 05:29:22] sird at rckc dot at the rest is still dangerous.. eating chars without the 10xx xxxx is against the spec, and overlong UTF. ------------------------------------------------------------------------ [2009-09-29 04:56:08] ras...@php.net > there are several bugs in the code, one of them is that a variable holding the value of the char is overflowed (trying to put 21 bits in a 16 bits int) That was fixed in 5.2.11 ------------------------------------------------------------------------ The remainder of the comments for this report are too long. To view the rest of the comments, please view the bug report online at http://bugs.php.net/49687 -- Edit this bug report at http://bugs.php.net/?id=49687&edit=1