#22108 [Opn]: php doesn't ignore the utf-8 BOM

2003-02-08 Thread moriyoshi
 ID:   22108
 Updated by:   [EMAIL PROTECTED]
 Reported By:  [EMAIL PROTECTED]
 Status:   Open
 Bug Type: Feature/Change Request
-Operating System: windows 2000
+Operating System: Any
-PHP Version:  4.2.3
+PHP Version:  All (as of the current implementation)
-Assigned To:  
+Assigned To:  moriyoshi
 New Comment:

And assigning this task to me.



Previous Comments:


[2003-02-08 01:48:15] [EMAIL PROTECTED]

Yes, I suppose this might be a bug, but most of developers involved in
PHP are not just so aware of this issue as you expected (and I had
expected). So I thought that changing the category is a better choice
than bogusing.




[2003-02-07 23:13:07] [EMAIL PROTECTED]

The BOM (byte order mark) is a few bytes at the very front of a file
that act as a signature denoting what type of encoding has been used,
and in UTF16/32 it also makes the byte order (LE or BE). Although utf-8
is byte order independent, it has become popular on windows (perhaps
not so on unix) to make use of the BOM encoded in UTF-8 to flag the
file as being in UTF-8 format. This allows editors to determine the
type of the file from the first few characters instead of trying to
guess what type the file is. Ref: Textpad 4.6 (http://textpad.com)

See the Unicode FAQ for details of the utf-8 BOM...
http://www.unicode.org/unicode/faq/utf_bom.html#25

The use of this should be obvious, you have to leave the
my-language-only mindset that afflicts too many programmers (myself
included before this job) and think about the growing multiplicity of
languages on the web. I am writing web applications in Japan, with
European language and CJK (Chinese/Japanese/Korean) language processing
and interfaces. Thus I have php files where variable values are strings
of all sorts of languages - hence utf-8 encoding.

I feel that this is definitely a bug in php. Considering that:
* php is slowly growing into a language-neutral (i18n/l10n possible)
language
* php is designed such that php commands can be liberally sprinkled
through html, and html is increasing encoded in utf-8 these days
* the utf-8 bom is becoming increasingly popular for reasons of
indentifying the file character format
* if the utf-8 bom exists php actually outputs it incorrectly and in
doing so prevents header output

I request that you don't see this as a feature request, but as a bug in
the handling of utf-8 files. Whether the output generator is the
correct characterization of this bug or not I leave up to you.

Regards,
Brodie.



[2003-02-07 21:41:23] [EMAIL PROTECTED]

Because BOM issue has been referenced repeatedly as a header output
preventer and we should be more aware of this, I don't see any reason
we have to mark this report as bogus.

Changing category from output control to a kind of feature
request.




[2003-02-07 13:57:22] [EMAIL PROTECTED]

Thank you for taking the time to write to us, but this is not
a bug. Please double-check the documentation available at
http://www.php.net/manual/ and the instructions on how to report
a bug at http://bugs.php.net/how-to-report.php

BOM = Byte Order Mark for UCS-2 encoding
This value sould not be used in UTF-8 since the only
reason besides detecting the byte order of UCS-2 was a 
special non breaking space. And newer Unicode versions 
have another representation for the same thing.

Anyhow BOM = FE FF
That makes depending on the byte order:
UCS-2BE - \xFE\xFF
UCS-2LE - \xFF\xFE

Therefore a sequence of EF BB is another character and 
must not be ignored.




[2003-02-07 10:42:16] [EMAIL PROTECTED]

sniper,

imagine someone would want to echo some text in eg. French.
In that case, if you'd save it as ascii, you would get corrupted
output. So instead you'd have to save as utf-8. Which seems to cause
problems (or so [EMAIL PROTECTED] tells us)



The remainder of the comments for this report are too long. To view
the rest of the comments, please view the bug report online at
http://bugs.php.net/22108

-- 
Edit this bug report at http://bugs.php.net/?id=22108edit=1




#22108 [Opn]: php doesn't ignore the utf-8 BOM

2003-02-08 Thread helly
 ID:   22108
 Updated by:   [EMAIL PROTECTED]
 Reported By:  [EMAIL PROTECTED]
 Status:   Open
 Bug Type: Feature/Change Request
 Operating System: Any
 PHP Version:  All (as of the current implementation)
 New Comment:

Ok, the UTF-8 BOM was new to me.
If i find the time i'll have a look at it over the weekend.
I think the solution would be somewhere in zend's multibyte support
since i fear adding that bom to mbstring
alone does not do the trick.


Previous Comments:


[2003-02-08 05:43:14] [EMAIL PROTECTED]

derick, assuming that you wanted to create a version of the the example
at http://www.php.net/manual/en/introduction.php#intro-whatis which
displayed the text Hi, I'm a PHP script in multiple languages, how
would you propose doing it?  

The only way is to use a form of unicode encoding. The least intrusive
of these ways is utf-8 because it encodes the text in such a way that
ascii characters (7 bit characters) are still plain ascii characters,
and all encoded characters are always 128 and will never be mistaken
for ascii.

I haven't seen any documentation which states that php can only handle
ascii text, please direct me to it if it exists.  If there is some
known problem with PHP parsing UTF-8 scripts, I haven't found it yet in
a multitude of different files with different languages which PHP is
parsing happily.

The only problem that I have had is that any files which have an UTF-8
BOM, PHP is mistakenly outputting the BOM as input. This is a bug of
PHP. The solution is easy, on loading a file, strip the BOM if it
exists. Make it optional processing via a php.ini config argument if
necessary.

Don't be US-centric in your thinking, there is far more world existing
outside those borders.

Regards,
Brodie.



[2003-02-08 04:24:12] [EMAIL PROTECTED]

PHP doesn't want UNICODE scripts, but just ASCII ones. Not a bug -
bogus.



[2003-02-08 02:01:11] [EMAIL PROTECTED]

And assigning this task to me.




[2003-02-08 01:48:15] [EMAIL PROTECTED]

Yes, I suppose this might be a bug, but most of developers involved in
PHP are not just so aware of this issue as you expected (and I had
expected). So I thought that changing the category is a better choice
than bogusing.




[2003-02-07 23:13:07] [EMAIL PROTECTED]

The BOM (byte order mark) is a few bytes at the very front of a file
that act as a signature denoting what type of encoding has been used,
and in UTF16/32 it also makes the byte order (LE or BE). Although utf-8
is byte order independent, it has become popular on windows (perhaps
not so on unix) to make use of the BOM encoded in UTF-8 to flag the
file as being in UTF-8 format. This allows editors to determine the
type of the file from the first few characters instead of trying to
guess what type the file is. Ref: Textpad 4.6 (http://textpad.com)

See the Unicode FAQ for details of the utf-8 BOM...
http://www.unicode.org/unicode/faq/utf_bom.html#25

The use of this should be obvious, you have to leave the
my-language-only mindset that afflicts too many programmers (myself
included before this job) and think about the growing multiplicity of
languages on the web. I am writing web applications in Japan, with
European language and CJK (Chinese/Japanese/Korean) language processing
and interfaces. Thus I have php files where variable values are strings
of all sorts of languages - hence utf-8 encoding.

I feel that this is definitely a bug in php. Considering that:
* php is slowly growing into a language-neutral (i18n/l10n possible)
language
* php is designed such that php commands can be liberally sprinkled
through html, and html is increasing encoded in utf-8 these days
* the utf-8 bom is becoming increasingly popular for reasons of
indentifying the file character format
* if the utf-8 bom exists php actually outputs it incorrectly and in
doing so prevents header output

I request that you don't see this as a feature request, but as a bug in
the handling of utf-8 files. Whether the output generator is the
correct characterization of this bug or not I leave up to you.

Regards,
Brodie.



The remainder of the comments for this report are too long. To view
the rest of the comments, please view the bug report online at
http://bugs.php.net/22108

-- 
Edit this bug report at http://bugs.php.net/?id=22108edit=1




#22108 [Opn]: php doesn't ignore the utf-8 BOM

2003-02-08 Thread moriyoshi
 ID:   22108
 Updated by:   [EMAIL PROTECTED]
 Reported By:  [EMAIL PROTECTED]
 Status:   Open
 Bug Type: Feature/Change Request
 Operating System: Any
 PHP Version:  All (as of the current implementation)
-Assigned To:  
+Assigned To:  moriyoshi
 New Comment:

reassigning


Previous Comments:


[2003-02-08 06:10:51] [EMAIL PROTECTED]

Ok, the UTF-8 BOM was new to me.
If i find the time i'll have a look at it over the weekend.
I think the solution would be somewhere in zend's multibyte support
since i fear adding that bom to mbstring
alone does not do the trick.



[2003-02-08 05:43:14] [EMAIL PROTECTED]

derick, assuming that you wanted to create a version of the the example
at http://www.php.net/manual/en/introduction.php#intro-whatis which
displayed the text Hi, I'm a PHP script in multiple languages, how
would you propose doing it?  

The only way is to use a form of unicode encoding. The least intrusive
of these ways is utf-8 because it encodes the text in such a way that
ascii characters (7 bit characters) are still plain ascii characters,
and all encoded characters are always 128 and will never be mistaken
for ascii.

I haven't seen any documentation which states that php can only handle
ascii text, please direct me to it if it exists.  If there is some
known problem with PHP parsing UTF-8 scripts, I haven't found it yet in
a multitude of different files with different languages which PHP is
parsing happily.

The only problem that I have had is that any files which have an UTF-8
BOM, PHP is mistakenly outputting the BOM as input. This is a bug of
PHP. The solution is easy, on loading a file, strip the BOM if it
exists. Make it optional processing via a php.ini config argument if
necessary.

Don't be US-centric in your thinking, there is far more world existing
outside those borders.

Regards,
Brodie.



[2003-02-08 04:24:12] [EMAIL PROTECTED]

PHP doesn't want UNICODE scripts, but just ASCII ones. Not a bug -
bogus.



[2003-02-08 02:01:11] [EMAIL PROTECTED]

And assigning this task to me.




[2003-02-08 01:48:15] [EMAIL PROTECTED]

Yes, I suppose this might be a bug, but most of developers involved in
PHP are not just so aware of this issue as you expected (and I had
expected). So I thought that changing the category is a better choice
than bogusing.




The remainder of the comments for this report are too long. To view
the rest of the comments, please view the bug report online at
http://bugs.php.net/22108

-- 
Edit this bug report at http://bugs.php.net/?id=22108edit=1




#22108 [Opn]: php doesn't ignore the utf-8 BOM

2003-02-07 Thread bugzilla
 ID:   22108
 User updated by:  [EMAIL PROTECTED]
 Reported By:  [EMAIL PROTECTED]
 Status:   Open
 Bug Type: Feature/Change Request
 Operating System: windows 2000
 PHP Version:  4.2.3
 New Comment:

The BOM (byte order mark) is a few bytes at the very front of a file
that act as a signature denoting what type of encoding has been used,
and in UTF16/32 it also makes the byte order (LE or BE). Although utf-8
is byte order independent, it has become popular on windows (perhaps
not so on unix) to make use of the BOM encoded in UTF-8 to flag the
file as being in UTF-8 format. This allows editors to determine the
type of the file from the first few characters instead of trying to
guess what type the file is. Ref: Textpad 4.6 (http://textpad.com)

See the Unicode FAQ for details of the utf-8 BOM...
http://www.unicode.org/unicode/faq/utf_bom.html#25

The use of this should be obvious, you have to leave the
my-language-only mindset that afflicts too many programmers (myself
included before this job) and think about the growing multiplicity of
languages on the web. I am writing web applications in Japan, with
European language and CJK (Chinese/Japanese/Korean) language processing
and interfaces. Thus I have php files where variable values are strings
of all sorts of languages - hence utf-8 encoding.

I feel that this is definitely a bug in php. Considering that:
* php is slowly growing into a language-neutral (i18n/l10n possible)
language
* php is designed such that php commands can be liberally sprinkled
through html, and html is increasing encoded in utf-8 these days
* the utf-8 bom is becoming increasingly popular for reasons of
indentifying the file character format
* if the utf-8 bom exists php actually outputs it incorrectly and in
doing so prevents header output

I request that you don't see this as a feature request, but as a bug in
the handling of utf-8 files. Whether the output generator is the
correct characterization of this bug or not I leave up to you.

Regards,
Brodie.


Previous Comments:


[2003-02-07 21:41:23] [EMAIL PROTECTED]

Because BOM issue has been referenced repeatedly as a header output
preventer and we should be more aware of this, I don't see any reason
we have to mark this report as bogus.

Changing category from output control to a kind of feature
request.




[2003-02-07 13:57:22] [EMAIL PROTECTED]

Thank you for taking the time to write to us, but this is not
a bug. Please double-check the documentation available at
http://www.php.net/manual/ and the instructions on how to report
a bug at http://bugs.php.net/how-to-report.php

BOM = Byte Order Mark for UCS-2 encoding
This value sould not be used in UTF-8 since the only
reason besides detecting the byte order of UCS-2 was a 
special non breaking space. And newer Unicode versions 
have another representation for the same thing.

Anyhow BOM = FE FF
That makes depending on the byte order:
UCS-2BE - \xFE\xFF
UCS-2LE - \xFF\xFE

Therefore a sequence of EF BB is another character and 
must not be ignored.




[2003-02-07 10:42:16] [EMAIL PROTECTED]

sniper,

imagine someone would want to echo some text in eg. French.
In that case, if you'd save it as ascii, you would get corrupted
output. So instead you'd have to save as utf-8. Which seems to cause
problems (or so [EMAIL PROTECTED] tells us)



[2003-02-07 08:58:21] [EMAIL PROTECTED]

And why an earth would you save PHP files in any other
format than ascii?




[2003-02-07 08:53:10] [EMAIL PROTECTED]

What is a BOM ?

Derick



The remainder of the comments for this report are too long. To view
the rest of the comments, please view the bug report online at
http://bugs.php.net/22108

-- 
Edit this bug report at http://bugs.php.net/?id=22108edit=1




#22108 [Opn]: php doesn't ignore the utf-8 BOM

2003-02-07 Thread moriyoshi
 ID:   22108
 Updated by:   [EMAIL PROTECTED]
 Reported By:  [EMAIL PROTECTED]
 Status:   Open
 Bug Type: Feature/Change Request
 Operating System: windows 2000
 PHP Version:  4.2.3
 New Comment:

Yes, I suppose this might be a bug, but most of developers involved in
PHP are not just so aware of this issue as you expected (and I had
expected). So I thought that changing the category is a better choice
than bogusing.



Previous Comments:


[2003-02-07 23:13:07] [EMAIL PROTECTED]

The BOM (byte order mark) is a few bytes at the very front of a file
that act as a signature denoting what type of encoding has been used,
and in UTF16/32 it also makes the byte order (LE or BE). Although utf-8
is byte order independent, it has become popular on windows (perhaps
not so on unix) to make use of the BOM encoded in UTF-8 to flag the
file as being in UTF-8 format. This allows editors to determine the
type of the file from the first few characters instead of trying to
guess what type the file is. Ref: Textpad 4.6 (http://textpad.com)

See the Unicode FAQ for details of the utf-8 BOM...
http://www.unicode.org/unicode/faq/utf_bom.html#25

The use of this should be obvious, you have to leave the
my-language-only mindset that afflicts too many programmers (myself
included before this job) and think about the growing multiplicity of
languages on the web. I am writing web applications in Japan, with
European language and CJK (Chinese/Japanese/Korean) language processing
and interfaces. Thus I have php files where variable values are strings
of all sorts of languages - hence utf-8 encoding.

I feel that this is definitely a bug in php. Considering that:
* php is slowly growing into a language-neutral (i18n/l10n possible)
language
* php is designed such that php commands can be liberally sprinkled
through html, and html is increasing encoded in utf-8 these days
* the utf-8 bom is becoming increasingly popular for reasons of
indentifying the file character format
* if the utf-8 bom exists php actually outputs it incorrectly and in
doing so prevents header output

I request that you don't see this as a feature request, but as a bug in
the handling of utf-8 files. Whether the output generator is the
correct characterization of this bug or not I leave up to you.

Regards,
Brodie.



[2003-02-07 21:41:23] [EMAIL PROTECTED]

Because BOM issue has been referenced repeatedly as a header output
preventer and we should be more aware of this, I don't see any reason
we have to mark this report as bogus.

Changing category from output control to a kind of feature
request.




[2003-02-07 13:57:22] [EMAIL PROTECTED]

Thank you for taking the time to write to us, but this is not
a bug. Please double-check the documentation available at
http://www.php.net/manual/ and the instructions on how to report
a bug at http://bugs.php.net/how-to-report.php

BOM = Byte Order Mark for UCS-2 encoding
This value sould not be used in UTF-8 since the only
reason besides detecting the byte order of UCS-2 was a 
special non breaking space. And newer Unicode versions 
have another representation for the same thing.

Anyhow BOM = FE FF
That makes depending on the byte order:
UCS-2BE - \xFE\xFF
UCS-2LE - \xFF\xFE

Therefore a sequence of EF BB is another character and 
must not be ignored.




[2003-02-07 10:42:16] [EMAIL PROTECTED]

sniper,

imagine someone would want to echo some text in eg. French.
In that case, if you'd save it as ascii, you would get corrupted
output. So instead you'd have to save as utf-8. Which seems to cause
problems (or so [EMAIL PROTECTED] tells us)



[2003-02-07 08:58:21] [EMAIL PROTECTED]

And why an earth would you save PHP files in any other
format than ascii?




The remainder of the comments for this report are too long. To view
the rest of the comments, please view the bug report online at
http://bugs.php.net/22108

-- 
Edit this bug report at http://bugs.php.net/?id=22108edit=1