#28220 [Fbk->NoF]: mb_strwidth() returns wrong width values for some Hangul characters.

2004-07-19 Thread php-bugs
 ID:  28220
 Updated by:  [EMAIL PROTECTED]
 Reported By: martin dot t dot kutschker at blackbox dot net
-Status:  Feedback
+Status:  No Feedback
 Bug Type:mbstring related
 PHP Version: Irrelevant
 New Comment:

No feedback was provided for this bug for over a week, so it is
being suspended automatically. If you are able to provide the
information that was originally requested, please do so and change
the status of the bug back to "Open".


Previous Comments:


[2004-07-12 08:52:33] [EMAIL PROTECTED]

We're still waiting for feedback, so leave it at that state.



[2004-07-10 20:41:42] martin dot t dot kutschker at blackbox dot net

I never tried the original code (only noticed the problem from reading
the docs), so I did not test the diff. Anyway I'm offline for two
weeks, so I won't be able to give the fix a try for some time.



[2004-07-07 01:00:04] php-bugs at lists dot php dot net

No feedback was provided for this bug for over a week, so it is
being suspended automatically. If you are able to provide the
information that was originally requested, please do so and change
the status of the bug back to "Open".



[2004-06-29 14:25:26] [EMAIL PROTECTED]

Try this patch and see if it works.

http://www.voltex.jp/patches/bug28220-
preliminary.patch.diff

This patch is only applicable for PHP 4.3.2 or later.


~/src/php-4.3.7 $ patch -p0 -R < bug28220-
preliminary.patch.diff




[2004-05-04 11:53:53] martin dot t dot kutschker at blackbox dot net

I rechecked EastAsianWidth and have found two more wide chars and
noticed that the range 2E80..4DB5 is in fact split by a single
half-width filler space char

1100..115F  Hangul Choseong
2329LEFT-POINTING ANGLE BRACKET
232ARIGHT-POINTING ANGLE BRACKET
2E80-303E   CJK and Kangxi radicals, ideographic chars
3041-4DB5   Hiragana, Katakana, Bopomofo and Hangul letters
4E00..D7A3  CJK ideographs, Yi and Hangul syllables
F900..FA6A  CJK compatibiliy ideographs
FE30..FE6B  presentation forms, punctuations, etc.
FF01..FF60  full-width Latin letters
FFE0FULLWIDTH CENT SIGN
FFE1FULLWIDTH POUND SIGN
FFE2FULLWIDTH NOT SIGN
FFE3FULLWIDTH MACRON
FFE4FULLWIDTH BROKEN BAR
FFE5FULLWIDTH YEN SIGN
FFE6FULLWIDTH WON SIGN

Please also note that Unicode knows about "ambigous" (A) chars. See
quotes from http://www.unicode.org/reports/tr11/

"In a broad sense, wide characters include W, F, and A (when in EA
context), 
 while narrow characters include N, Na, H, and A (when not in EA
context)."

"Ambiguous characters behave like wide or narrow characters depending
on 
 context (language tag, script identification, associated font, source
of 
 data, or explicit markup; all can provide the context). If the context

 cannot be established reliably they should be treated as narrow
characters 
 by default."

So mb_strwidth could try to auto-detect the context (eg. by locale) or
have an optional east-asian context argument.



The remainder of the comments for this report are too long. To view
the rest of the comments, please view the bug report online at
http://bugs.php.net/28220

-- 
Edit this bug report at http://bugs.php.net/?id=28220&edit=1


#28220 [Fbk->NoF]: mb_strwidth() returns wrong width values for some Hangul characters.

2004-07-06 Thread php-bugs
 ID:  28220
 Updated by:  [EMAIL PROTECTED]
 Reported By: martin dot t dot kutschker at blackbox dot net
-Status:  Feedback
+Status:  No Feedback
 Bug Type:mbstring related
 PHP Version: Irrelevant
 New Comment:

No feedback was provided for this bug for over a week, so it is
being suspended automatically. If you are able to provide the
information that was originally requested, please do so and change
the status of the bug back to "Open".


Previous Comments:


[2004-06-29 14:25:26] [EMAIL PROTECTED]

Try this patch and see if it works.

http://www.voltex.jp/patches/bug28220-
preliminary.patch.diff

This patch is only applicable for PHP 4.3.2 or later.


~/src/php-4.3.7 $ patch -p0 -R < bug28220-
preliminary.patch.diff




[2004-05-04 11:53:53] martin dot t dot kutschker at blackbox dot net

I rechecked EastAsianWidth and have found two more wide chars and
noticed that the range 2E80..4DB5 is in fact split by a single
half-width filler space char

1100..115F  Hangul Choseong
2329LEFT-POINTING ANGLE BRACKET
232ARIGHT-POINTING ANGLE BRACKET
2E80-303E   CJK and Kangxi radicals, ideographic chars
3041-4DB5   Hiragana, Katakana, Bopomofo and Hangul letters
4E00..D7A3  CJK ideographs, Yi and Hangul syllables
F900..FA6A  CJK compatibiliy ideographs
FE30..FE6B  presentation forms, punctuations, etc.
FF01..FF60  full-width Latin letters
FFE0FULLWIDTH CENT SIGN
FFE1FULLWIDTH POUND SIGN
FFE2FULLWIDTH NOT SIGN
FFE3FULLWIDTH MACRON
FFE4FULLWIDTH BROKEN BAR
FFE5FULLWIDTH YEN SIGN
FFE6FULLWIDTH WON SIGN

Please also note that Unicode knows about "ambigous" (A) chars. See
quotes from http://www.unicode.org/reports/tr11/

"In a broad sense, wide characters include W, F, and A (when in EA
context), 
 while narrow characters include N, Na, H, and A (when not in EA
context)."

"Ambiguous characters behave like wide or narrow characters depending
on 
 context (language tag, script identification, associated font, source
of 
 data, or explicit markup; all can provide the context). If the context

 cannot be established reliably they should be treated as narrow
characters 
 by default."

So mb_strwidth could try to auto-detect the context (eg. by locale) or
have an optional east-asian context argument.



[2004-05-01 15:30:09] [EMAIL PROTECTED]

This is a valid bug.
# thanks Nuno.




[2004-04-29 18:48:17] martin dot t dot kutschker at blackbox dot net

Description:

The table describing the width of the characters is wrong if you
compare it with the table for Unicode 4.0:

http://www.unicode.org/Public/UNIDATA/EastAsianWidth.txt

For the BMP the wide/full-width chars are:

1100..115F  Hangul Choseong
2E80..4DB5  CJK radicals and CJK Ideograph Extension A
4E00..D7A3  CJK Ideographs, Yi syll. and Hangul syll.
F900..FA6A  CJK compatibiliy ideographs
FE30..FE6B  presentation forms, punctuations, etc.
FF01..FF60  full-width Latin letters
FFE0FULLWIDTH CENT SIGN
FFE1FULLWIDTH POUND SIGN
FFE2FULLWIDTH NOT SIGN
FFE3FULLWIDTH MACRON
FFE4FULLWIDTH BROKEN BAR
FFE5FULLWIDTH YEN SIGN
FFE6FULLWIDTH WON SIGN

I didn't check what the actual implementation does, but the docs are
certainly wrong (if they mean Unicoe codepoints).






-- 
Edit this bug report at http://bugs.php.net/?id=28220&edit=1