#39404 [Opn]: Support "entity" as substitute_character setting

2006-11-06 Thread martin dot t dot kutschker at blackbox dot net
 ID:  39404
 User updated by: martin dot t dot kutschker at blackbox dot net
-Summary: Support "entitiy" as substitute_character setting
 Reported By:     martin dot t dot kutschker at blackbox dot net
 Status:  Open
 Bug Type:mbstring related
 PHP Version: 5.2.0
 New Comment:

Fix spelling of "entity" in the summary.


Previous Comments:


[2006-11-06 16:56:19] martin dot t dot kutschker at blackbox dot net

Description:

It would be great if the charset conversion could also output SGML/HTML
entites for missing characters in the output charset. The option "long"
is not very HTML-friendly. But with "entity" any Unicode aware browser
could deal with the missing charater.

eg
mbstring.substitute_character=long => U+3000
mbstring.substitute_character=entity =>  







-- 
Edit this bug report at http://bugs.php.net/?id=39404&edit=1


#39404 [NEW]: Support "entitiy" as substitute_character setting

2006-11-06 Thread martin dot t dot kutschker at blackbox dot net
From: martin dot t dot kutschker at blackbox dot net
Operating system: 
PHP version:  5.2.0
PHP Bug Type: mbstring related
Bug description:  Support "entitiy" as substitute_character setting

Description:

It would be great if the charset conversion could also output SGML/HTML
entites for missing characters in the output charset. The option "long" is
not very HTML-friendly. But with "entity" any Unicode aware browser could
deal with the missing charater.

eg
mbstring.substitute_character=long => U+3000
mbstring.substitute_character=entity =>  



-- 
Edit bug report at http://bugs.php.net/?id=39404&edit=1
-- 
Try a CVS snapshot (PHP 4.4): 
http://bugs.php.net/fix.php?id=39404&r=trysnapshot44
Try a CVS snapshot (PHP 5.2): 
http://bugs.php.net/fix.php?id=39404&r=trysnapshot52
Try a CVS snapshot (PHP 6.0): 
http://bugs.php.net/fix.php?id=39404&r=trysnapshot60
Fixed in CVS: http://bugs.php.net/fix.php?id=39404&r=fixedcvs
Fixed in release: 
http://bugs.php.net/fix.php?id=39404&r=alreadyfixed
Need backtrace:   http://bugs.php.net/fix.php?id=39404&r=needtrace
Need Reproduce Script:http://bugs.php.net/fix.php?id=39404&r=needscript
Try newer version:http://bugs.php.net/fix.php?id=39404&r=oldversion
Not developer issue:  http://bugs.php.net/fix.php?id=39404&r=support
Expected behavior:http://bugs.php.net/fix.php?id=39404&r=notwrong
Not enough info:  
http://bugs.php.net/fix.php?id=39404&r=notenoughinfo
Submitted twice:  
http://bugs.php.net/fix.php?id=39404&r=submittedtwice
register_globals: http://bugs.php.net/fix.php?id=39404&r=globals
PHP 3 support discontinued:   http://bugs.php.net/fix.php?id=39404&r=php3
Daylight Savings: http://bugs.php.net/fix.php?id=39404&r=dst
IIS Stability:http://bugs.php.net/fix.php?id=39404&r=isapi
Install GNU Sed:  http://bugs.php.net/fix.php?id=39404&r=gnused
Floating point limitations:   http://bugs.php.net/fix.php?id=39404&r=float
No Zend Extensions:   http://bugs.php.net/fix.php?id=39404&r=nozend
MySQL Configuration Error:http://bugs.php.net/fix.php?id=39404&r=mysqlcfg


#28220 [NoF->Opn]: mb_strwidth() returns wrong width values for some Hangul characters.

2004-07-10 Thread martin dot t dot kutschker at blackbox dot net
 ID:  28220
 User updated by: martin dot t dot kutschker at blackbox dot net
 Reported By: martin dot t dot kutschker at blackbox dot net
-Status:  No Feedback
+Status:  Open
 Bug Type:mbstring related
 PHP Version: Irrelevant
 New Comment:

I never tried the original code (only noticed the problem from reading
the docs), so I did not test the diff. Anyway I'm offline for two
weeks, so I won't be able to give the fix a try for some time.


Previous Comments:


[2004-07-07 01:00:04] php-bugs at lists dot php dot net

No feedback was provided for this bug for over a week, so it is
being suspended automatically. If you are able to provide the
information that was originally requested, please do so and change
the status of the bug back to "Open".



[2004-06-29 14:25:26] [EMAIL PROTECTED]

Try this patch and see if it works.

http://www.voltex.jp/patches/bug28220-
preliminary.patch.diff

This patch is only applicable for PHP 4.3.2 or later.


~/src/php-4.3.7 $ patch -p0 -R < bug28220-
preliminary.patch.diff




[2004-05-04 11:53:53] martin dot t dot kutschker at blackbox dot net

I rechecked EastAsianWidth and have found two more wide chars and
noticed that the range 2E80..4DB5 is in fact split by a single
half-width filler space char

1100..115F  Hangul Choseong
2329LEFT-POINTING ANGLE BRACKET
232ARIGHT-POINTING ANGLE BRACKET
2E80-303E   CJK and Kangxi radicals, ideographic chars
3041-4DB5   Hiragana, Katakana, Bopomofo and Hangul letters
4E00..D7A3  CJK ideographs, Yi and Hangul syllables
F900..FA6A  CJK compatibiliy ideographs
FE30..FE6B  presentation forms, punctuations, etc.
FF01..FF60  full-width Latin letters
FFE0FULLWIDTH CENT SIGN
FFE1FULLWIDTH POUND SIGN
FFE2FULLWIDTH NOT SIGN
FFE3FULLWIDTH MACRON
FFE4FULLWIDTH BROKEN BAR
FFE5FULLWIDTH YEN SIGN
FFE6FULLWIDTH WON SIGN

Please also note that Unicode knows about "ambigous" (A) chars. See
quotes from http://www.unicode.org/reports/tr11/

"In a broad sense, wide characters include W, F, and A (when in EA
context), 
 while narrow characters include N, Na, H, and A (when not in EA
context)."

"Ambiguous characters behave like wide or narrow characters depending
on 
 context (language tag, script identification, associated font, source
of 
 data, or explicit markup; all can provide the context). If the context

 cannot be established reliably they should be treated as narrow
characters 
 by default."

So mb_strwidth could try to auto-detect the context (eg. by locale) or
have an optional east-asian context argument.



[2004-05-01 15:30:09] [EMAIL PROTECTED]

This is a valid bug.
# thanks Nuno.


--------------------

[2004-04-29 18:48:17] martin dot t dot kutschker at blackbox dot net

Description:

The table describing the width of the characters is wrong if you
compare it with the table for Unicode 4.0:

http://www.unicode.org/Public/UNIDATA/EastAsianWidth.txt

For the BMP the wide/full-width chars are:

1100..115F  Hangul Choseong
2E80..4DB5  CJK radicals and CJK Ideograph Extension A
4E00..D7A3  CJK Ideographs, Yi syll. and Hangul syll.
F900..FA6A  CJK compatibiliy ideographs
FE30..FE6B  presentation forms, punctuations, etc.
FF01..FF60  full-width Latin letters
FFE0FULLWIDTH CENT SIGN
FFE1FULLWIDTH POUND SIGN
FFE2FULLWIDTH NOT SIGN
FFE3FULLWIDTH MACRON
FFE4FULLWIDTH BROKEN BAR
FFE5FULLWIDTH YEN SIGN
FFE6FULLWIDTH WON SIGN

I didn't check what the actual implementation does, but the docs are
certainly wrong (if they mean Unicoe codepoints).






-- 
Edit this bug report at http://bugs.php.net/?id=28220&edit=1


#28220 [Opn]: mb_strwidth() returns wrong width values for some Hangul characters.

2004-05-04 Thread martin dot t dot kutschker at blackbox dot net
 ID:  28220
 User updated by: martin dot t dot kutschker at blackbox dot net
 Reported By: martin dot t dot kutschker at blackbox dot net
 Status:  Open
 Bug Type:mbstring related
 PHP Version: Irrelevant
 New Comment:

I rechecked EastAsianWidth and have found two more wide chars and
noticed that the range 2E80..4DB5 is in fact split by a single
half-width filler space char

1100..115F  Hangul Choseong
2329LEFT-POINTING ANGLE BRACKET
232ARIGHT-POINTING ANGLE BRACKET
2E80-303E   CJK and Kangxi radicals, ideographic chars
3041-4DB5   Hiragana, Katakana, Bopomofo and Hangul letters
4E00..D7A3  CJK ideographs, Yi and Hangul syllables
F900..FA6A  CJK compatibiliy ideographs
FE30..FE6B  presentation forms, punctuations, etc.
FF01..FF60  full-width Latin letters
FFE0FULLWIDTH CENT SIGN
FFE1FULLWIDTH POUND SIGN
FFE2FULLWIDTH NOT SIGN
FFE3FULLWIDTH MACRON
FFE4FULLWIDTH BROKEN BAR
FFE5FULLWIDTH YEN SIGN
FFE6FULLWIDTH WON SIGN

Please also note that Unicode knows about "ambigous" (A) chars. See
quotes from http://www.unicode.org/reports/tr11/

"In a broad sense, wide characters include W, F, and A (when in EA
context), 
 while narrow characters include N, Na, H, and A (when not in EA
context)."

"Ambiguous characters behave like wide or narrow characters depending
on 
 context (language tag, script identification, associated font, source
of 
 data, or explicit markup; all can provide the context). If the context

 cannot be established reliably they should be treated as narrow
characters 
 by default."

So mb_strwidth could try to auto-detect the context (eg. by locale) or
have an optional east-asian context argument.


Previous Comments:


[2004-05-01 15:30:09] [EMAIL PROTECTED]

This is a valid bug.
# thanks Nuno.


----

[2004-04-29 18:48:17] martin dot t dot kutschker at blackbox dot net

Description:

The table describing the width of the characters is wrong if you
compare it with the table for Unicode 4.0:

http://www.unicode.org/Public/UNIDATA/EastAsianWidth.txt

For the BMP the wide/full-width chars are:

1100..115F  Hangul Choseong
2E80..4DB5  CJK radicals and CJK Ideograph Extension A
4E00..D7A3  CJK Ideographs, Yi syll. and Hangul syll.
F900..FA6A  CJK compatibiliy ideographs
FE30..FE6B  presentation forms, punctuations, etc.
FF01..FF60  full-width Latin letters
FFE0FULLWIDTH CENT SIGN
FFE1FULLWIDTH POUND SIGN
FFE2FULLWIDTH NOT SIGN
FFE3FULLWIDTH MACRON
FFE4FULLWIDTH BROKEN BAR
FFE5FULLWIDTH YEN SIGN
FFE6FULLWIDTH WON SIGN

I didn't check what the actual implementation does, but the docs are
certainly wrong (if they mean Unicoe codepoints).






-- 
Edit this bug report at http://bugs.php.net/?id=28220&edit=1