Req #42396 [Asn]: Followup to #36711: __halt_compiler() and unicode detection

2010-11-18 Thread cataphract
Edit report at http://bugs.php.net/bug.php?id=42396edit=1

 ID: 42396
 Updated by: cataphr...@php.net
 Reported by:francois at tekwire dot net
 Summary:Followup to #36711: __halt_compiler() and unicode
 detection
 Status: Assigned
 Type:   Feature/Change Request
-Package:Feature/Change Request
+Package:*General Issues
 Operating System:   all
 PHP Version:5.2.3
 Assigned To:hirokawa
 Block user comment: N
 Private report: N

 New Comment:

This bug describes more accurately the problem I attempted to solve with
the patch for bug #53199.


Previous Comments:

[2010-06-25 10:11:56] phofstetter at sensational dot ch

somehow, recently, the default value of detect_unicode seems to have
changed. 



With detect_unicode enabled, it's impossible to run any PHAR-file -
neither 

through the CLI or through the web server. IMHO, this should really be
looked 

into.


[2007-08-27 08:38:58] j...@php.net

IMHO, #42396 is not a bug, but it is the specification.

The normal script doesn't contain a null byte if it is not encoded in
Unicode.



It is understandable the addition of a unique byte seqence

'0x' detection to support PHAR/PHK, 

but it is a change to add a new feature.



Rui




[2007-08-24 10:30:12] j...@php.net

Patch posted to internals: http://news.php.net/php.internals/31870




[2007-08-24 10:29:05] j...@php.net

The same folks who maintain mbstring have added that support so it's not
so wrong choice. Reclassified though. And assigned to the maintainer.


[2007-08-23 16:24:33] francois at tekwire dot net

Not sure it should be reclassified as mbstring related, as the bug is in
Zend/zend_multibyte.c and has nothing to do with mbstring.



PHP5 has a little unicode part in the engine. It even has an
(undocumented) 'detect_unicode' option.




The remainder of the comments for this report are too long. To view
the rest of the comments, please view the bug report online at

http://bugs.php.net/bug.php?id=42396


-- 
Edit this bug report at http://bugs.php.net/bug.php?id=42396edit=1


#42396 [Asn]: Followup to #36711: __halt_compiler() and unicode detection

2007-08-27 Thread jani
 ID:   42396
 Updated by:   [EMAIL PROTECTED]
 Reported By:  francois at tekwire dot net
 Status:   Assigned
-Bug Type: Scripting Engine problem
+Bug Type: Feature/Change Request
 Operating System: all
 PHP Version:  5.2.3
 Assigned To:  hirokawa
 New Comment:

IMHO, #42396 is not a bug, but it is the specification.
The normal script doesn't contain a null byte if it is not encoded in
Unicode.

It is understandable the addition of a unique byte seqence
'0x' detection to support PHAR/PHK, 
but it is a change to add a new feature.

Rui



Previous Comments:


[2007-08-24 10:30:12] [EMAIL PROTECTED]

Patch posted to internals: http://news.php.net/php.internals/31870




[2007-08-24 10:29:05] [EMAIL PROTECTED]

The same folks who maintain mbstring have added that support so it's
not so wrong choice. Reclassified though. And assigned to the
maintainer.



[2007-08-23 16:24:33] francois at tekwire dot net

Not sure it should be reclassified as mbstring related, as the bug is
in Zend/zend_multibyte.c and has nothing to do with mbstring.

PHP5 has a little unicode part in the engine. It even has an
(undocumented) 'detect_unicode' option.



[2007-08-23 14:08:09] [EMAIL PROTECTED]

Reclassified: There is no unicode in PHP 5. Just mbstring.



[2007-08-23 12:16:17] francois at tekwire dot net

Description:

Reopening bug #36711 because it is NOT a documentation problem. Setting
'detect_unicode=Off' is NOT a solution, just a workaround.

In practice, because of this bug, PHK or PHAR packages cannot run on
zend-multibyte-enabled environments, unless detect_unicode is turned
off. Which makes them unusable in environments running unicode-encoded
scripts. As a side effect, it also makes it impossible to include an
unicode-encoded script inside a PHAR/PHK package, as it cannot be run.

There is no logical reason to bind the __halt_compiler() feature with
the zend-multibyte unicode detection capability. Everything after an
__halt_compiler() directive must be considered as binary data and should
not be scanned for unicode detection. If this data contains a unicode
script, it will be scanned and detected when include()d through the
stream wrapper.

My (humble) suggestions to fix the problem:

In zend_multibyte_detect_unicode(), the BOM detection does not have to
be modified but, then, the script is scanned for null bytes :

return zend_multibyte_detect_utf_encoding(LANG_SCNG(script_org),
LANG_SCNG(script_org_size) TSRMLS_CC);

There, the size should not be LANG_SCNG(script_org_size), but the
offset of the __halt_compiler() directive. But I don't know where to
find the COMPILER_HALT_OFFSET constant for the script. I even suspect it
not to be available at this time...

Another way, if the previous one is not possible, would be to scan for
a binary string that cannot correspond to any unicode encoding. This
way, PHK and PHAR could insert this string after ther __halt_compiler()
directive, and it could be detected by
zend_multibyte_detect_utf_encoding() as a stop string. I am ready to
implement it if somebody provides a sequence of bytes that cannot be
found in any unicode-encoded document.

Reproduce code:
---
?php
echo OK\n;
__halt_compiler();null-byte

Expected result:

OK

Actual result:
--
??





-- 
Edit this bug report at http://bugs.php.net/?id=42396edit=1


#42396 [Asn]: Followup to #36711: __halt_compiler() and unicode detection

2007-08-24 Thread jani
 ID:   42396
 Updated by:   [EMAIL PROTECTED]
 Reported By:  francois at tekwire dot net
 Status:   Assigned
 Bug Type: Scripting Engine problem
 Operating System: all
 PHP Version:  5.2.3
 Assigned To:  hirokawa
 New Comment:

Patch posted to internals: http://news.php.net/php.internals/31870



Previous Comments:


[2007-08-24 10:29:05] [EMAIL PROTECTED]

The same folks who maintain mbstring have added that support so it's
not so wrong choice. Reclassified though. And assigned to the
maintainer.



[2007-08-23 16:24:33] francois at tekwire dot net

Not sure it should be reclassified as mbstring related, as the bug is
in Zend/zend_multibyte.c and has nothing to do with mbstring.

PHP5 has a little unicode part in the engine. It even has an
(undocumented) 'detect_unicode' option.



[2007-08-23 14:08:09] [EMAIL PROTECTED]

Reclassified: There is no unicode in PHP 5. Just mbstring.



[2007-08-23 12:16:17] francois at tekwire dot net

Description:

Reopening bug #36711 because it is NOT a documentation problem. Setting
'detect_unicode=Off' is NOT a solution, just a workaround.

In practice, because of this bug, PHK or PHAR packages cannot run on
zend-multibyte-enabled environments, unless detect_unicode is turned
off. Which makes them unusable in environments running unicode-encoded
scripts. As a side effect, it also makes it impossible to include an
unicode-encoded script inside a PHAR/PHK package, as it cannot be run.

There is no logical reason to bind the __halt_compiler() feature with
the zend-multibyte unicode detection capability. Everything after an
__halt_compiler() directive must be considered as binary data and should
not be scanned for unicode detection. If this data contains a unicode
script, it will be scanned and detected when include()d through the
stream wrapper.

My (humble) suggestions to fix the problem:

In zend_multibyte_detect_unicode(), the BOM detection does not have to
be modified but, then, the script is scanned for null bytes :

return zend_multibyte_detect_utf_encoding(LANG_SCNG(script_org),
LANG_SCNG(script_org_size) TSRMLS_CC);

There, the size should not be LANG_SCNG(script_org_size), but the
offset of the __halt_compiler() directive. But I don't know where to
find the COMPILER_HALT_OFFSET constant for the script. I even suspect it
not to be available at this time...

Another way, if the previous one is not possible, would be to scan for
a binary string that cannot correspond to any unicode encoding. This
way, PHK and PHAR could insert this string after ther __halt_compiler()
directive, and it could be detected by
zend_multibyte_detect_utf_encoding() as a stop string. I am ready to
implement it if somebody provides a sequence of bytes that cannot be
found in any unicode-encoded document.

Reproduce code:
---
?php
echo OK\n;
__halt_compiler();null-byte

Expected result:

OK

Actual result:
--
??





-- 
Edit this bug report at http://bugs.php.net/?id=42396edit=1