#28220 [Opn->Fbk]: mb_strwidth() returns wrong width values for some Hangul characters.

moriyoshi Tue, 29 Jun 2004 05:25:29 -0700

 ID:          28220
 Updated by:  [EMAIL PROTECTED]
 Reported By: martin dot t dot kutschker at blackbox dot net
-Status:      Open
+Status:      Feedback
 Bug Type:    mbstring related
 PHP Version: Irrelevant
 New Comment:


Try this patch and see if it works.

http://www.voltex.jp/patches/bug28220-
preliminary.patch.diff

This patch is only applicable for PHP 4.3.2 or later.


~/src/php-4.3.7 $ patch -p0 -R < bug28220-
preliminary.patch.diff



Previous Comments:
------------------------------------------------------------------------

[2004-05-04 11:53:53] martin dot t dot kutschker at blackbox dot net

I rechecked EastAsianWidth and have found two more wide chars and
noticed that the range 2E80..4DB5 is in fact split by a single
half-width filler space char

1100..115F  Hangul Choseong
2329        LEFT-POINTING ANGLE BRACKET
232A        RIGHT-POINTING ANGLE BRACKET
2E80-303E   CJK and Kangxi radicals, ideographic chars
3041-4DB5   Hiragana, Katakana, Bopomofo and Hangul letters
4E00..D7A3  CJK ideographs, Yi and Hangul syllables
F900..FA6A  CJK compatibiliy ideographs
FE30..FE6B  presentation forms, punctuations, etc.
FF01..FF60  full-width Latin letters
FFE0        FULLWIDTH CENT SIGN
FFE1        FULLWIDTH POUND SIGN
FFE2        FULLWIDTH NOT SIGN
FFE3        FULLWIDTH MACRON
FFE4        FULLWIDTH BROKEN BAR
FFE5        FULLWIDTH YEN SIGN
FFE6        FULLWIDTH WON SIGN

Please also note that Unicode knows about "ambigous" (A) chars. See
quotes from http://www.unicode.org/reports/tr11/

"In a broad sense, wide characters include W, F, and A (when in EA
context), 
 while narrow characters include N, Na, H, and A (when not in EA
context)."

"Ambiguous characters behave like wide or narrow characters depending
on 
 context (language tag, script identification, associated font, source
of 
 data, or explicit markup; all can provide the context). If the context

 cannot be established reliably they should be treated as narrow
characters 
 by default."

So mb_strwidth could try to auto-detect the context (eg. by locale) or
have an optional east-asian context argument.

------------------------------------------------------------------------

[2004-05-01 15:30:09] [EMAIL PROTECTED]

This is a valid bug.
# thanks Nuno.


------------------------------------------------------------------------

[2004-04-29 18:48:17] martin dot t dot kutschker at blackbox dot net

Description:
------------
The table describing the width of the characters is wrong if you
compare it with the table for Unicode 4.0:

http://www.unicode.org/Public/UNIDATA/EastAsianWidth.txt

For the BMP the wide/full-width chars are:

1100..115F  Hangul Choseong
2E80..4DB5  CJK radicals and CJK Ideograph Extension A
4E00..D7A3  CJK Ideographs, Yi syll. and Hangul syll.
F900..FA6A  CJK compatibiliy ideographs
FE30..FE6B  presentation forms, punctuations, etc.
FF01..FF60  full-width Latin letters
FFE0        FULLWIDTH CENT SIGN
FFE1        FULLWIDTH POUND SIGN
FFE2        FULLWIDTH NOT SIGN
FFE3        FULLWIDTH MACRON
FFE4        FULLWIDTH BROKEN BAR
FFE5        FULLWIDTH YEN SIGN
FFE6        FULLWIDTH WON SIGN

I didn't check what the actual implementation does, but the docs are
certainly wrong (if they mean Unicoe codepoints).



------------------------------------------------------------------------


-- 
Edit this bug report at http://bugs.php.net/?id=28220&edit=1

#28220 [Opn->Fbk]: mb_strwidth() returns wrong width values for some Hangul characters.

Reply via email to