ID: 22108 Comment by: lapo at lapo dot it Reported By: bugzilla at jellycan dot com Status: Assigned Bug Type: Feature/Change Request Operating System: * PHP Version: * Assigned To: moriyoshi New Comment:
Adding '--enable-zend-multibyte' to latest PHP5 port for FreeBSD for sure solves the problem: All files contain: <? header("Content-Language: it"); echo "àèéìòù\n"; ?> cyberx [~] $ php /usr/tmp/utf8-bom.php à èéìòù cyberx [~] $ php /usr/tmp/utf8Y-bom.php àèéìòù cyberx [~] $ php /usr/tmp/utf16-bom.php àèéìòù cyberx [~] $ php /usr/tmp/utf16BE-bom.php àèéìòù cyberx [~] $ php /usr/tmp/utf16LE-bom.php àèéìòù Except for "UTF8 without BOM" that is, of course, not distinguishable from ISO8859-15 (default here), all theother formats are correctly interpreted and outputted. (notice that the 'header' instruction prior of the 'echo' one would stutter with a non-BOM-aware PHP compile). I wonder if and when this great multibyte support would be available by default in Win32 compiles, I would really use it for work and am not willing to but VisualC just to compile that ;-) (though I'm trying compiling it with cygwin's gcc using '-mno-cygwin' option, we'll see...) Previous Comments: ------------------------------------------------------------------------ [2003-11-09 16:12:50] a9c83cd8bb41db324db5b449352f183 at arcor dot de Thought about it... Now I think it's better when the BOM isn't part of the output because that would cause trouble if you want to output images or PDF or something like that... ------------------------------------------------------------------------ [2003-11-08 06:45:22] a9c83cd8bb41db324db5b449352f183 at arcor dot de I think the best would be that PHP recognizes the BOM and outputs it before it outputs the document (but after the HTTP headers, of course) so that the document can still be recognized as UTF-8 when it's saved to disk (where no Content-Type headers with a charset specification are available). ------------------------------------------------------------------------ [2003-10-31 11:12:06] [EMAIL PROTECTED] I added i18n support to Zend Engine 2 (though it's still partial one...), and one of its features contain awareness of BOM. So now you can gracefully parse scripts with BOM if you use PHP 5.0.0b2 and configure it with the option '--enable-zend-multibyte'. These features are still experimental and under testing, so that I have not been documented these but I'll add the entry to the manual, ZEND_CHANGES and so on if I feel certain of the stability and robustness of my patch, though I do not know when it is:) Anyway, I'll close this bug if '--enable-zend-multibyte' option in PHP 5.0.0b2 is assured to work well for this problem. Comments are welcome. ------------------------------------------------------------------------ [2003-02-07 23:13:07] bugzilla at jellycan dot com The BOM (byte order mark) is a few bytes at the very front of a file that act as a signature denoting what type of encoding has been used, and in UTF16/32 it also makes the byte order (LE or BE). Although utf-8 is byte order independent, it has become popular on windows (perhaps not so on unix) to make use of the BOM encoded in UTF-8 to flag the file as being in UTF-8 format. This allows editors to determine the type of the file from the first few characters instead of trying to guess what type the file is. Ref: Textpad 4.6 (http://textpad.com) See the Unicode FAQ for details of the utf-8 BOM... http://www.unicode.org/unicode/faq/utf_bom.html#25 The use of this should be obvious, you have to leave the my-language-only mindset that afflicts too many programmers (myself included before this job) and think about the growing multiplicity of languages on the web. I am writing web applications in Japan, with European language and CJK (Chinese/Japanese/Korean) language processing and interfaces. Thus I have php files where variable values are strings of all sorts of languages - hence utf-8 encoding. I feel that this is definitely a bug in php. Considering that: * php is slowly growing into a language-neutral (i18n/l10n possible) language * php is designed such that php commands can be liberally sprinkled through html, and html is increasing encoded in utf-8 these days * the utf-8 bom is becoming increasingly popular for reasons of indentifying the file character format * if the utf-8 bom exists php actually outputs it incorrectly and in doing so prevents header output I request that you don't see this as a feature request, but as a bug in the handling of utf-8 files. Whether the output generator is the correct characterization of this bug or not I leave up to you. Regards, Brodie. ------------------------------------------------------------------------ [2003-02-07 08:46:36] bugzilla at jellycan dot com Problem: When a php file is saved in utf-8 format with the UTF-8 BOM as the first three bytes of the file (EF BB BF), PHP doesn't ignore these bytes when loading and compiling the file, but instead considers them output coming prior to the <?php. This causes incorrect display of the page and failure of any http header output. It does this even when the internal character format is set in php.ini to be utf-8. Desired outcome: PHP recognizes the utf-8 bom and disregards it. ------------------------------------------------------------------------ -- Edit this bug report at http://bugs.php.net/?id=22108&edit=1