Re: [PHP-DEV] [PATCH] zend-multibyte unicode detection vs. __halt_compiler()

Rui Hirokawa Sun, 26 Aug 2007 05:50:57 -0700

Hi,

IMHO, #42396 is not a bug, but it is the specification.
The normal script doesn't contain a null byte if it is not encoded in Unicode.


It is understandable the addition of a unique byte seqence
'0xFFFFFFFF' detection to support PHAR/PHK, 
but it is a change to add a new feature.

Rui

On Thu, 23 Aug 2007 18:58:52 +0200
LAUPRETRE Fran輟is (P) <[EMAIL PROTECTED]> wrote:

> Hi,
> 
> Here is a patch I am submitting to fix bug #42396 (PHP 5).
> 
> The problem: when PHP is configured with the '--enable-zend-multibyte' 
> option, it tries to autodetect unicode-encoded scripts. Then, if a script 
> contains null bytes after an __halt_compiler() directive, it will be 
> considered as UTF-16 or 32, and the execution typically results in a lot of 
> '?' garbage. In practice, it makes PHK and PHAR incompatible with the 
> zend-multibyte feature.
> 
> The only workaround was to turn off the (undocumented) 'detect_unicode' flag. 
> But it is not a real solution, as people may want to use unicode detection 
> along with PHK/PHAR packages, and there's no logical reason to keep them 
> incompatible.
> 
> The patch I am submitting assumes that a document encoded in UTF-8, UTF-16, 
> or UTF-32 cannot contain a sequence of four 0xff bytes. So, it adds a small 
> detection loop before scanning the script for null bytes. If a sequence of 4 
> 0xff is found, the unicode detection is aborted and the script is considered 
> as non unicode, whatever other binary data it can contain. Of course, this 
> detection happens after looking for a byte-order mark.
> 
> Now, I can modify the PHK_Creator tool to set 4 0xff bytes after the 
> __halt_compiler() directive, which makes the generated PHK archives 
> compatible with zend-multibyte. The same for PHAR.
> 
> It would be better if we could scan the script for null bytes only up to the 
> __halt_compiler() directive, but I suspect it to be impossible as it is not 
> yet compiled...
> 
> Regards
> 
> Francois
> 
> --- zend_multibyte.c.old        2007-01-01 10:35:46.000000000 +0100
> +++ zend_multibyte.c    2007-08-23 17:22:24.000000000 +0200
> @@ -1035,6 +1035,7 @@
>         zend_encoding *script_encoding = NULL;
>         int bom_size;
>         char *script;
> +       unsigned char *p,*p_end;
>  
>         if (LANG_SCNG(script_org_size) < sizeof(BOM_UTF32_LE)-1) {
>                 return NULL;
> @@ -1069,6 +1070,18 @@
>                 return script_encoding;
>         }
>  
> +       /* Search for four 0xff bytes - if found, script cannot be unicode */
> +
> +       p=(unsigned char *)LANG_SCNG(script_org);
> +       p_end=(p+LANG_SCNG(script_org_size)-3);
> +       while (p < p_end) {
> +               if (   ((* p)   ==(unsigned char)0x0ff)
> +                       && ((*(p+1))==(unsigned char)0x0ff)
> +                       && ((*(p+2))==(unsigned char)0x0ff)
> +                       && ((*(p+3))==(unsigned char)0x0ff)) return NULL;
> +               p++;
> +       }
> +
>         /* script contains NULL bytes -> auto-detection */
>         if (memchr(LANG_SCNG(script_org), 0, LANG_SCNG(script_org_size))) {
>                 /* make best effort if BOM is missing */
> 

-- 
Rui Hirokawa <[EMAIL PROTECTED]>

-- 
PHP Internals - PHP Runtime Development Mailing List
To unsubscribe, visit: http://www.php.net/unsub.php

Re: [PHP-DEV] [PATCH] zend-multibyte unicode detection vs. __halt_compiler()

Reply via email to