#39404 [Opn]: Support "entity" as substitute_character setting
ID: 39404 User updated by: martin dot t dot kutschker at blackbox dot net -Summary: Support "entitiy" as substitute_character setting Reported By: martin dot t dot kutschker at blackbox dot net Status: Open Bug Type:mbstring related PHP Version: 5.2.0 New Comment: Fix spelling of "entity" in the summary. Previous Comments: [2006-11-06 16:56:19] martin dot t dot kutschker at blackbox dot net Description: It would be great if the charset conversion could also output SGML/HTML entites for missing characters in the output charset. The option "long" is not very HTML-friendly. But with "entity" any Unicode aware browser could deal with the missing charater. eg mbstring.substitute_character=long => U+3000 mbstring.substitute_character=entity => -- Edit this bug report at http://bugs.php.net/?id=39404&edit=1
#39404 [NEW]: Support "entitiy" as substitute_character setting
From: martin dot t dot kutschker at blackbox dot net Operating system: PHP version: 5.2.0 PHP Bug Type: mbstring related Bug description: Support "entitiy" as substitute_character setting Description: It would be great if the charset conversion could also output SGML/HTML entites for missing characters in the output charset. The option "long" is not very HTML-friendly. But with "entity" any Unicode aware browser could deal with the missing charater. eg mbstring.substitute_character=long => U+3000 mbstring.substitute_character=entity => -- Edit bug report at http://bugs.php.net/?id=39404&edit=1 -- Try a CVS snapshot (PHP 4.4): http://bugs.php.net/fix.php?id=39404&r=trysnapshot44 Try a CVS snapshot (PHP 5.2): http://bugs.php.net/fix.php?id=39404&r=trysnapshot52 Try a CVS snapshot (PHP 6.0): http://bugs.php.net/fix.php?id=39404&r=trysnapshot60 Fixed in CVS: http://bugs.php.net/fix.php?id=39404&r=fixedcvs Fixed in release: http://bugs.php.net/fix.php?id=39404&r=alreadyfixed Need backtrace: http://bugs.php.net/fix.php?id=39404&r=needtrace Need Reproduce Script:http://bugs.php.net/fix.php?id=39404&r=needscript Try newer version:http://bugs.php.net/fix.php?id=39404&r=oldversion Not developer issue: http://bugs.php.net/fix.php?id=39404&r=support Expected behavior:http://bugs.php.net/fix.php?id=39404&r=notwrong Not enough info: http://bugs.php.net/fix.php?id=39404&r=notenoughinfo Submitted twice: http://bugs.php.net/fix.php?id=39404&r=submittedtwice register_globals: http://bugs.php.net/fix.php?id=39404&r=globals PHP 3 support discontinued: http://bugs.php.net/fix.php?id=39404&r=php3 Daylight Savings: http://bugs.php.net/fix.php?id=39404&r=dst IIS Stability:http://bugs.php.net/fix.php?id=39404&r=isapi Install GNU Sed: http://bugs.php.net/fix.php?id=39404&r=gnused Floating point limitations: http://bugs.php.net/fix.php?id=39404&r=float No Zend Extensions: http://bugs.php.net/fix.php?id=39404&r=nozend MySQL Configuration Error:http://bugs.php.net/fix.php?id=39404&r=mysqlcfg
#28220 [NoF->Opn]: mb_strwidth() returns wrong width values for some Hangul characters.
ID: 28220 User updated by: martin dot t dot kutschker at blackbox dot net Reported By: martin dot t dot kutschker at blackbox dot net -Status: No Feedback +Status: Open Bug Type:mbstring related PHP Version: Irrelevant New Comment: I never tried the original code (only noticed the problem from reading the docs), so I did not test the diff. Anyway I'm offline for two weeks, so I won't be able to give the fix a try for some time. Previous Comments: [2004-07-07 01:00:04] php-bugs at lists dot php dot net No feedback was provided for this bug for over a week, so it is being suspended automatically. If you are able to provide the information that was originally requested, please do so and change the status of the bug back to "Open". [2004-06-29 14:25:26] [EMAIL PROTECTED] Try this patch and see if it works. http://www.voltex.jp/patches/bug28220- preliminary.patch.diff This patch is only applicable for PHP 4.3.2 or later. ~/src/php-4.3.7 $ patch -p0 -R < bug28220- preliminary.patch.diff [2004-05-04 11:53:53] martin dot t dot kutschker at blackbox dot net I rechecked EastAsianWidth and have found two more wide chars and noticed that the range 2E80..4DB5 is in fact split by a single half-width filler space char 1100..115F Hangul Choseong 2329LEFT-POINTING ANGLE BRACKET 232ARIGHT-POINTING ANGLE BRACKET 2E80-303E CJK and Kangxi radicals, ideographic chars 3041-4DB5 Hiragana, Katakana, Bopomofo and Hangul letters 4E00..D7A3 CJK ideographs, Yi and Hangul syllables F900..FA6A CJK compatibiliy ideographs FE30..FE6B presentation forms, punctuations, etc. FF01..FF60 full-width Latin letters FFE0FULLWIDTH CENT SIGN FFE1FULLWIDTH POUND SIGN FFE2FULLWIDTH NOT SIGN FFE3FULLWIDTH MACRON FFE4FULLWIDTH BROKEN BAR FFE5FULLWIDTH YEN SIGN FFE6FULLWIDTH WON SIGN Please also note that Unicode knows about "ambigous" (A) chars. See quotes from http://www.unicode.org/reports/tr11/ "In a broad sense, wide characters include W, F, and A (when in EA context), while narrow characters include N, Na, H, and A (when not in EA context)." "Ambiguous characters behave like wide or narrow characters depending on context (language tag, script identification, associated font, source of data, or explicit markup; all can provide the context). If the context cannot be established reliably they should be treated as narrow characters by default." So mb_strwidth could try to auto-detect the context (eg. by locale) or have an optional east-asian context argument. [2004-05-01 15:30:09] [EMAIL PROTECTED] This is a valid bug. # thanks Nuno. -------------------- [2004-04-29 18:48:17] martin dot t dot kutschker at blackbox dot net Description: The table describing the width of the characters is wrong if you compare it with the table for Unicode 4.0: http://www.unicode.org/Public/UNIDATA/EastAsianWidth.txt For the BMP the wide/full-width chars are: 1100..115F Hangul Choseong 2E80..4DB5 CJK radicals and CJK Ideograph Extension A 4E00..D7A3 CJK Ideographs, Yi syll. and Hangul syll. F900..FA6A CJK compatibiliy ideographs FE30..FE6B presentation forms, punctuations, etc. FF01..FF60 full-width Latin letters FFE0FULLWIDTH CENT SIGN FFE1FULLWIDTH POUND SIGN FFE2FULLWIDTH NOT SIGN FFE3FULLWIDTH MACRON FFE4FULLWIDTH BROKEN BAR FFE5FULLWIDTH YEN SIGN FFE6FULLWIDTH WON SIGN I didn't check what the actual implementation does, but the docs are certainly wrong (if they mean Unicoe codepoints). -- Edit this bug report at http://bugs.php.net/?id=28220&edit=1
#28220 [Opn]: mb_strwidth() returns wrong width values for some Hangul characters.
ID: 28220 User updated by: martin dot t dot kutschker at blackbox dot net Reported By: martin dot t dot kutschker at blackbox dot net Status: Open Bug Type:mbstring related PHP Version: Irrelevant New Comment: I rechecked EastAsianWidth and have found two more wide chars and noticed that the range 2E80..4DB5 is in fact split by a single half-width filler space char 1100..115F Hangul Choseong 2329LEFT-POINTING ANGLE BRACKET 232ARIGHT-POINTING ANGLE BRACKET 2E80-303E CJK and Kangxi radicals, ideographic chars 3041-4DB5 Hiragana, Katakana, Bopomofo and Hangul letters 4E00..D7A3 CJK ideographs, Yi and Hangul syllables F900..FA6A CJK compatibiliy ideographs FE30..FE6B presentation forms, punctuations, etc. FF01..FF60 full-width Latin letters FFE0FULLWIDTH CENT SIGN FFE1FULLWIDTH POUND SIGN FFE2FULLWIDTH NOT SIGN FFE3FULLWIDTH MACRON FFE4FULLWIDTH BROKEN BAR FFE5FULLWIDTH YEN SIGN FFE6FULLWIDTH WON SIGN Please also note that Unicode knows about "ambigous" (A) chars. See quotes from http://www.unicode.org/reports/tr11/ "In a broad sense, wide characters include W, F, and A (when in EA context), while narrow characters include N, Na, H, and A (when not in EA context)." "Ambiguous characters behave like wide or narrow characters depending on context (language tag, script identification, associated font, source of data, or explicit markup; all can provide the context). If the context cannot be established reliably they should be treated as narrow characters by default." So mb_strwidth could try to auto-detect the context (eg. by locale) or have an optional east-asian context argument. Previous Comments: [2004-05-01 15:30:09] [EMAIL PROTECTED] This is a valid bug. # thanks Nuno. ---- [2004-04-29 18:48:17] martin dot t dot kutschker at blackbox dot net Description: The table describing the width of the characters is wrong if you compare it with the table for Unicode 4.0: http://www.unicode.org/Public/UNIDATA/EastAsianWidth.txt For the BMP the wide/full-width chars are: 1100..115F Hangul Choseong 2E80..4DB5 CJK radicals and CJK Ideograph Extension A 4E00..D7A3 CJK Ideographs, Yi syll. and Hangul syll. F900..FA6A CJK compatibiliy ideographs FE30..FE6B presentation forms, punctuations, etc. FF01..FF60 full-width Latin letters FFE0FULLWIDTH CENT SIGN FFE1FULLWIDTH POUND SIGN FFE2FULLWIDTH NOT SIGN FFE3FULLWIDTH MACRON FFE4FULLWIDTH BROKEN BAR FFE5FULLWIDTH YEN SIGN FFE6FULLWIDTH WON SIGN I didn't check what the actual implementation does, but the docs are certainly wrong (if they mean Unicoe codepoints). -- Edit this bug report at http://bugs.php.net/?id=28220&edit=1