#22108 [Asn]: php doesn't ignore the utf-8 BOM

2005-01-06 Thread techtonik
 ID:   22108
 Updated by:   [EMAIL PROTECTED]
 Reported By:  bugzilla at jellycan dot com
 Status:   Assigned
 Bug Type: Feature/Change Request
 Operating System: *
 PHP Version:  *
 Assigned To:  moriyoshi
 New Comment:

How about making this --enable-zend-multibyte default option?
Is it possible to port this support for windows too?
And for 4.3.x branch?
Should it be marked open again?



Previous Comments:


[2004-05-25 12:33:30] lapo at lapo dot it

Adding '--enable-zend-multibyte' to latest PHP5 port for FreeBSD for
sure solves the problem:

All files contain:


cyberx [~] $ php /usr/tmp/utf8-bom.php 
à èéìòù
cyberx [~] $ php /usr/tmp/utf8Y-bom.php 
àèéìòù
cyberx [~] $ php /usr/tmp/utf16-bom.php 
àèéìòù
cyberx [~] $ php /usr/tmp/utf16BE-bom.php 
àèéìòù
cyberx [~] $ php /usr/tmp/utf16LE-bom.php 
àèéìòù

Except for "UTF8 without BOM" that is, of course, not distinguishable
from ISO8859-15 (default here), all theother formats are correctly
interpreted and outputted.
(notice that the 'header' instruction prior of the 'echo' one would
stutter with a non-BOM-aware PHP compile).

I wonder if and when this great multibyte support would be available by
default in Win32 compiles, I would really use it for work and am not
willing to but VisualC just to compile that ;-)
(though I'm trying compiling it with cygwin's gcc using '-mno-cygwin'
option, we'll see...)



[2003-11-09 16:12:50] a9c83cd8bb41db324db5b449352f183 at arcor dot de

Thought about it... Now I think it's better when the BOM isn't part of
the output because that would cause trouble if you want to output
images or PDF or something like that...



[2003-11-08 06:45:22] a9c83cd8bb41db324db5b449352f183 at arcor dot de

I think the best would be that PHP recognizes the BOM and outputs it
before it outputs the document (but after the HTTP headers, of course)
so that the document can still be recognized as UTF-8 when it's saved
to disk (where no Content-Type headers with a charset specification are
available).



[2003-10-31 11:12:06] [EMAIL PROTECTED]

I added i18n support to Zend Engine 2 (though it's still partial
one...), and one of its features contain awareness of BOM. So now you
can gracefully parse scripts with BOM if you use PHP 5.0.0b2 and
configure it with the option '--enable-zend-multibyte'.

These features are still experimental and under testing, so that I have
not been documented these but I'll add the entry to the manual,
ZEND_CHANGES and so on if I feel certain of the stability and
robustness of my patch, though I do not know when it is:)

Anyway, I'll close this bug if '--enable-zend-multibyte' option in PHP
5.0.0b2 is assured to work well for this problem. Comments are welcome.



[2003-02-07 23:13:07] bugzilla at jellycan dot com

The BOM (byte order mark) is a few bytes at the very front of a file
that act as a signature denoting what type of encoding has been used,
and in UTF16/32 it also makes the byte order (LE or BE). Although utf-8
is byte order independent, it has become popular on windows (perhaps not
so on unix) to make use of the BOM encoded in UTF-8 to flag the file as
being in UTF-8 format. This allows editors to determine the type of the
file from the first few characters instead of trying to guess what type
the file is. Ref: Textpad 4.6 (http://textpad.com)

See the Unicode FAQ for details of the utf-8 BOM...
http://www.unicode.org/unicode/faq/utf_bom.html#25

The use of this should be obvious, you have to leave the
my-language-only mindset that afflicts too many programmers (myself
included before this job) and think about the growing multiplicity of
languages on the web. I am writing web applications in Japan, with
European language and CJK (Chinese/Japanese/Korean) language processing
and interfaces. Thus I have php files where variable values are strings
of all sorts of languages - hence utf-8 encoding.

I feel that this is definitely a bug in php. Considering that:
* php is slowly growing into a language-neutral (i18n/l10n possible)
language
* php is designed such that php commands can be liberally sprinkled
through html, and html is increasing encoded in utf-8 these days
* the utf-8 bom is becoming increasingly popular for reasons of
indentifying the file character format
* if the utf-8 bom exists php actually outputs it incorrectly and in
doing so prevents header output

I request that you don't see this as a feature request, but as a bug in
the handling of utf-8 files. Whether the output generator is the correct
characterization of this bug or not I leave up to you.

Regards,
Brodie.


#22108 [Asn]: php doesn't ignore the utf-8 BOM

2003-10-31 Thread fujimoto
 ID:   22108
 Updated by:   [EMAIL PROTECTED]
 Reported By:  bugzilla at jellycan dot com
 Status:   Assigned
 Bug Type: Feature/Change Request
 Operating System: *
 PHP Version:  *
 Assigned To:  moriyoshi
 New Comment:

I added i18n support to Zend Engine 2 (though it's still partial
one...), and one of its features contain awareness of BOM. So now you
can gracefully parse scripts with BOM if you use PHP 5.0.0b2 and
configure it with the option '--enable-zend-multibyte'.

These features are still experimental and under testing, so that I have
not been documented these but I'll add the entry to the manual,
ZEND_CHANGES and so on if I feel certain of the stability and
robustness of my patch, though I do not know when it is:)

Anyway, I'll close this bug if '--enable-zend-multibyte' option in PHP
5.0.0b2 is assured to work well for this problem. Comments are welcome.


Previous Comments:


[2003-10-27 09:24:21] kamor at worldonline dot fr

New and better solution in order to include some UTF-8 files without
including BOM chars. In fact previous solution works well with IE and
Mozilla but XHTML validation fails.

Well, the following solution is not the best but with some restrictions
in the code organisation gives good results:

Solution:
Insert 'ob_start()' before include() commands and 'ob_end_clean()'
after. This will clear all direct outputs produced by inclusions.

example:


This works well for pure function/class libraries.

Other solutions may be explored with other output buffer functions
(ob_*).



[2003-05-05 03:40:23] tokiee at sayclub dot com

for who are not familiar with UTF-8:

UTF-8(UCS Transformation Format 8) is not different to ASCII. it's
compatible with the ASCII: if you write your text in english with
UTF-8. you dont see any difference between the text in ASCII in each
byte. (and UTF-8 BOM is optional).

it's not quite a exact explanation of UTF-8 but: UTF-8 expands ASCII to
support Full UNICODE characters without disurbing any existing alphabet
order or something. so basically the UTF-8 is ASCII. and you dont have
to imagine it as totally new freak.

actually, when a modern Unicode-supported OS reads this UTF-8, the OS
needs to CONVERT it to real UNICODE internally. so the UTF-8 is rather
similar with URL encoding.

in ASCII world, each byte corresponds a character, up to 255
characters.

in UNICODE, two bytes corresponds a character, up to 65535 characters.
and it's totally a new system as you think.

in UTF-8, it's interesting, a character can be one byte, or two bytes,
or even 3, 4 bytes!. why is that so complicated but the rule is simple
and actually you dont have to handle this: OS will do it for you. 

even if you have any software which does not understand the utf-8, it's
totally okay because it's ASCII transparent. so it "can be used with
normal string comparison functions for sorting and such." (quoted in
PHP.NET Reference: utf8_encode())



[2003-04-06 00:53:04] tronxoe at hotpop dot com

The BOM is still fine when the php file does not include another
Unicode file (by using @include()).

Another problem: If a php file is saved in unicode,  session and
cookies can not be used because "headers already sent ...". I think the
first 3 bytes has been sent in this case



[2003-02-08 06:10:51] [EMAIL PROTECTED]

Ok, the UTF-8 BOM was new to me.
If i find the time i'll have a look at it over the weekend.
I think the solution would be somewhere in zend's multibyte support
since i fear adding that bom to mbstring
alone does not do the trick.



[2003-02-07 23:13:07] bugzilla at jellycan dot com

The BOM (byte order mark) is a few bytes at the very front of a file
that act as a signature denoting what type of encoding has been used,
and in UTF16/32 it also makes the byte order (LE or BE). Although utf-8
is byte order independent, it has become popular on windows (perhaps
not so on unix) to make use of the BOM encoded in UTF-8 to flag the
file as being in UTF-8 format. This allows editors to determine the
type of the file from the first few characters instead of trying to
guess what type the file is. Ref: Textpad 4.6 (http://textpad.com)

See the Unicode FAQ for details of the utf-8 BOM...
http://www.unicode.org/unicode/faq/utf_bom.html#25

The use of this should be obvious, you have to leave the
my-language-only mindset that afflicts too many programmers (myself
included before this job) and think about the growing multiplicity of
languages on the web. I am writing web applications in Japan, with
European language and CJK (Chinese/Japanese/Korean) language processing
and interfaces. Thus 

#22108 [Asn]: php doesn't ignore the utf-8 BOM

2003-06-04 Thread moriyoshi
 ID:   22108
 Updated by:   [EMAIL PROTECTED]
 Reported By:  bugzilla at jellycan dot com
 Status:   Assigned
 Bug Type: Feature/Change Request
 Operating System: Any
 PHP Version:  All (as of the current implementation)
 Assigned To:  moriyoshi
 New Comment:

That script appears to be written in UTF-16. As for UTF-16, it could
actually be a parser problem as well, but this report addresses the
issue related to UTF-8.


Previous Comments:


[2003-06-04 02:59:33] [EMAIL PROTECTED]

Actually, not totally. A friend mailed me a PHP script, which had the
annoying BOM AND the whole file was in double byte (saved by
notepad)... which definitely makes it a parser problem too (\0 < \0 ?
\0 p   doesn't match "http://bugs.php.net/22108

-- 
Edit this bug report at http://bugs.php.net/?id=22108&edit=1



#22108 [Asn]: php doesn't ignore the utf-8 BOM

2003-06-04 Thread derick
 ID:   22108
 Updated by:   [EMAIL PROTECTED]
 Reported By:  bugzilla at jellycan dot com
 Status:   Assigned
 Bug Type: Feature/Change Request
 Operating System: Any
 PHP Version:  All (as of the current implementation)
 Assigned To:  moriyoshi
 New Comment:

Actually, not totally. A friend mailed me a PHP script, which had the
annoying BOM AND the whole file was in double byte (saved by
notepad)... which definitely makes it a parser problem too (\0 < \0 ?
\0 p   doesn't match " [8 Feb 4:24am CST] [EMAIL PROTECTED]

> PHP doesn't want UNICODE scripts, but just ASCII ones. Not 
> a bug -> bogus.

Not bogus.  

PHP is embedded in HTML, the surrounding document determines the
encoding.  You can't just specify this problem out of existence.



The remainder of the comments for this report are too long. To view
the rest of the comments, please view the bug report online at
http://bugs.php.net/22108

-- 
Edit this bug report at http://bugs.php.net/?id=22108&edit=1



#22108 [Asn]: php doesn't ignore the utf-8 BOM

2003-06-04 Thread moriyoshi
 ID:   22108
 Updated by:   [EMAIL PROTECTED]
 Reported By:  bugzilla at jellycan dot com
 Status:   Assigned
 Bug Type: Feature/Change Request
 Operating System: Any
 PHP Version:  All (as of the current implementation)
 Assigned To:  moriyoshi
 New Comment:

And just for clarification, this is a scanner problem, irrelevant to
the parser.



Previous Comments:


[2003-06-04 02:45:43] [EMAIL PROTECTED]

It wasn't assigned, just set to open (and I didn't notice your name in
the "Assign to" field).



[2003-06-04 02:40:30] [EMAIL PROTECTED]

Derick,

Please do not change the status of the bug that is already assigned to
someone.

There's no point that PHP can only handle ASCII documents because if
you want to use German in PHP for example, at least you have to use
ISO-8859-1 or ISO-8859-15, which is not even part of ASCII.




[2003-06-03 14:17:22] [EMAIL PROTECTED]

Feel free to rewrite the parser, but that's just not going to happen.
We want ascii import, not unicode.



[2003-06-03 14:07:16] gump at hotmail dot com

> [8 Feb 4:24am CST] [EMAIL PROTECTED]

> PHP doesn't want UNICODE scripts, but just ASCII ones. Not 
> a bug -> bogus.

Not bogus.  

PHP is embedded in HTML, the surrounding document determines the
encoding.  You can't just specify this problem out of existence.



[2003-05-05 03:40:23] tokiee at sayclub dot com

for who are not familiar with UTF-8:

UTF-8(UCS Transformation Format 8) is not different to ASCII. it's
compatible with the ASCII: if you write your text in english with
UTF-8. you dont see any difference between the text in ASCII in each
byte. (and UTF-8 BOM is optional).

it's not quite a exact explanation of UTF-8 but: UTF-8 expands ASCII to
support Full UNICODE characters without disurbing any existing alphabet
order or something. so basically the UTF-8 is ASCII. and you dont have
to imagine it as totally new freak.

actually, when a modern Unicode-supported OS reads this UTF-8, the OS
needs to CONVERT it to real UNICODE internally. so the UTF-8 is rather
similar with URL encoding.

in ASCII world, each byte corresponds a character, up to 255
characters.

in UNICODE, two bytes corresponds a character, up to 65535 characters.
and it's totally a new system as you think.

in UTF-8, it's interesting, a character can be one byte, or two bytes,
or even 3, 4 bytes!. why is that so complicated but the rule is simple
and actually you dont have to handle this: OS will do it for you. 

even if you have any software which does not understand the utf-8, it's
totally okay because it's ASCII transparent. so it "can be used with
normal string comparison functions for sorting and such." (quoted in
PHP.NET Reference: utf8_encode())



The remainder of the comments for this report are too long. To view
the rest of the comments, please view the bug report online at
http://bugs.php.net/22108

-- 
Edit this bug report at http://bugs.php.net/?id=22108&edit=1



#22108 [Asn]: php doesn't ignore the utf-8 BOM

2003-06-04 Thread derick
 ID:   22108
 Updated by:   [EMAIL PROTECTED]
 Reported By:  bugzilla at jellycan dot com
 Status:   Assigned
 Bug Type: Feature/Change Request
 Operating System: Any
 PHP Version:  All (as of the current implementation)
 Assigned To:  moriyoshi
 New Comment:

It wasn't assigned, just set to open (and I didn't notice your name in
the "Assign to" field).


Previous Comments:


[2003-06-04 02:40:30] [EMAIL PROTECTED]

Derick,

Please do not change the status of the bug that is already assigned to
someone.

There's no point that PHP can only handle ASCII documents because if
you want to use German in PHP for example, at least you have to use
ISO-8859-1 or ISO-8859-15, which is not even part of ASCII.




[2003-06-03 14:17:22] [EMAIL PROTECTED]

Feel free to rewrite the parser, but that's just not going to happen.
We want ascii import, not unicode.



[2003-06-03 14:07:16] gump at hotmail dot com

> [8 Feb 4:24am CST] [EMAIL PROTECTED]

> PHP doesn't want UNICODE scripts, but just ASCII ones. Not 
> a bug -> bogus.

Not bogus.  

PHP is embedded in HTML, the surrounding document determines the
encoding.  You can't just specify this problem out of existence.



[2003-05-05 03:40:23] tokiee at sayclub dot com

for who are not familiar with UTF-8:

UTF-8(UCS Transformation Format 8) is not different to ASCII. it's
compatible with the ASCII: if you write your text in english with
UTF-8. you dont see any difference between the text in ASCII in each
byte. (and UTF-8 BOM is optional).

it's not quite a exact explanation of UTF-8 but: UTF-8 expands ASCII to
support Full UNICODE characters without disurbing any existing alphabet
order or something. so basically the UTF-8 is ASCII. and you dont have
to imagine it as totally new freak.

actually, when a modern Unicode-supported OS reads this UTF-8, the OS
needs to CONVERT it to real UNICODE internally. so the UTF-8 is rather
similar with URL encoding.

in ASCII world, each byte corresponds a character, up to 255
characters.

in UNICODE, two bytes corresponds a character, up to 65535 characters.
and it's totally a new system as you think.

in UTF-8, it's interesting, a character can be one byte, or two bytes,
or even 3, 4 bytes!. why is that so complicated but the rule is simple
and actually you dont have to handle this: OS will do it for you. 

even if you have any software which does not understand the utf-8, it's
totally okay because it's ASCII transparent. so it "can be used with
normal string comparison functions for sorting and such." (quoted in
PHP.NET Reference: utf8_encode())



[2003-04-14 12:17:37] [EMAIL PROTECTED]

As a short-term workaround (yes I know it's not a solution), can you
try using output buffering?  That should at least solve the problem of
sneaking the headers in prior to the BOM even if it doesn't solve the
underlying problem of recoginizing document encodings properly.



The remainder of the comments for this report are too long. To view
the rest of the comments, please view the bug report online at
http://bugs.php.net/22108

-- 
Edit this bug report at http://bugs.php.net/?id=22108&edit=1