Hi,
Here is a patch I am submitting to fix bug #42396 (PHP 5).
The problem: when PHP is configured with the '--enable-zend-multibyte' option,
it tries to autodetect unicode-encoded scripts. Then, if a script contains null
bytes after an __halt_compiler() directive, it will be considered as UTF-16 or
32, and the execution typically results in a lot of '?' garbage. In practice,
it makes PHK and PHAR incompatible with the zend-multibyte feature.
The only workaround was to turn off the (undocumented) 'detect_unicode' flag.
But it is not a real solution, as people may want to use unicode detection
along with PHK/PHAR packages, and there's no logical reason to keep them
incompatible.
The patch I am submitting assumes that a document encoded in UTF-8, UTF-16, or
UTF-32 cannot contain a sequence of four 0xff bytes. So, it adds a small
detection loop before scanning the script for null bytes. If a sequence of 4
0xff is found, the unicode detection is aborted and the script is considered as
non unicode, whatever other binary data it can contain. Of course, this
detection happens after looking for a byte-order mark.
Now, I can modify the PHK_Creator tool to set 4 0xff bytes after the
__halt_compiler() directive, which makes the generated PHK archives compatible
with zend-multibyte. The same for PHAR.
It would be better if we could scan the script for null bytes only up to the
__halt_compiler() directive, but I suspect it to be impossible as it is not yet
compiled...
Regards
Francois
--- zend_multibyte.c.old 2007-01-01 10:35:46.000000000 +0100
+++ zend_multibyte.c 2007-08-23 17:22:24.000000000 +0200
@@ -1035,6 +1035,7 @@
zend_encoding *script_encoding = NULL;
int bom_size;
char *script;
+ unsigned char *p,*p_end;
if (LANG_SCNG(script_org_size) < sizeof(BOM_UTF32_LE)-1) {
return NULL;
@@ -1069,6 +1070,18 @@
return script_encoding;
}
+ /* Search for four 0xff bytes - if found, script cannot be unicode */
+
+ p=(unsigned char *)LANG_SCNG(script_org);
+ p_end=(p+LANG_SCNG(script_org_size)-3);
+ while (p < p_end) {
+ if ( ((* p) ==(unsigned char)0x0ff)
+ && ((*(p+1))==(unsigned char)0x0ff)
+ && ((*(p+2))==(unsigned char)0x0ff)
+ && ((*(p+3))==(unsigned char)0x0ff)) return NULL;
+ p++;
+ }
+
/* script contains NULL bytes -> auto-detection */
if (memchr(LANG_SCNG(script_org), 0, LANG_SCNG(script_org_size))) {
/* make best effort if BOM is missing */
--
PHP Internals - PHP Runtime Development Mailing List
To unsubscribe, visit: http://www.php.net/unsub.php