Hi,
IMHO, #42396 is not a bug, but it is the specification.
The normal script doesn't contain a null byte if it is not encoded in Unicode.
It is understandable the addition of a unique byte seqence
'0xFFFFFFFF' detection to support PHAR/PHK,
but it is a change to add a new feature.
Rui
On Thu, 23 Aug 2007 18:58:52 +0200
LAUPRETRE Franč¼is (P) <[EMAIL PROTECTED]> wrote:
> Hi,
>
> Here is a patch I am submitting to fix bug #42396 (PHP 5).
>
> The problem: when PHP is configured with the '--enable-zend-multibyte'
> option, it tries to autodetect unicode-encoded scripts. Then, if a script
> contains null bytes after an __halt_compiler() directive, it will be
> considered as UTF-16 or 32, and the execution typically results in a lot of
> '?' garbage. In practice, it makes PHK and PHAR incompatible with the
> zend-multibyte feature.
>
> The only workaround was to turn off the (undocumented) 'detect_unicode' flag.
> But it is not a real solution, as people may want to use unicode detection
> along with PHK/PHAR packages, and there's no logical reason to keep them
> incompatible.
>
> The patch I am submitting assumes that a document encoded in UTF-8, UTF-16,
> or UTF-32 cannot contain a sequence of four 0xff bytes. So, it adds a small
> detection loop before scanning the script for null bytes. If a sequence of 4
> 0xff is found, the unicode detection is aborted and the script is considered
> as non unicode, whatever other binary data it can contain. Of course, this
> detection happens after looking for a byte-order mark.
>
> Now, I can modify the PHK_Creator tool to set 4 0xff bytes after the
> __halt_compiler() directive, which makes the generated PHK archives
> compatible with zend-multibyte. The same for PHAR.
>
> It would be better if we could scan the script for null bytes only up to the
> __halt_compiler() directive, but I suspect it to be impossible as it is not
> yet compiled...
>
> Regards
>
> Francois
>
> --- zend_multibyte.c.old 2007-01-01 10:35:46.000000000 +0100
> +++ zend_multibyte.c 2007-08-23 17:22:24.000000000 +0200
> @@ -1035,6 +1035,7 @@
> zend_encoding *script_encoding = NULL;
> int bom_size;
> char *script;
> + unsigned char *p,*p_end;
>
> if (LANG_SCNG(script_org_size) < sizeof(BOM_UTF32_LE)-1) {
> return NULL;
> @@ -1069,6 +1070,18 @@
> return script_encoding;
> }
>
> + /* Search for four 0xff bytes - if found, script cannot be unicode */
> +
> + p=(unsigned char *)LANG_SCNG(script_org);
> + p_end=(p+LANG_SCNG(script_org_size)-3);
> + while (p < p_end) {
> + if ( ((* p) ==(unsigned char)0x0ff)
> + && ((*(p+1))==(unsigned char)0x0ff)
> + && ((*(p+2))==(unsigned char)0x0ff)
> + && ((*(p+3))==(unsigned char)0x0ff)) return NULL;
> + p++;
> + }
> +
> /* script contains NULL bytes -> auto-detection */
> if (memchr(LANG_SCNG(script_org), 0, LANG_SCNG(script_org_size))) {
> /* make best effort if BOM is missing */
>
--
Rui Hirokawa <[EMAIL PROTECTED]>
--
PHP Internals - PHP Runtime Development Mailing List
To unsubscribe, visit: http://www.php.net/unsub.php