From: kobrien at kiva dot org
Operating system: Ubuntu 12.04
PHP version: 5.5.0alpha1
Package: *Unicode Issues
Bug Type: Feature/Change Request
Bug description:Create a mb_str_word_count() function which is multi-byte aware
Description:
------------
Create a mb_str_word_count() function which will properly handle counting
the
number of words in string that contains multi-byte characters. This is
currently
not possible with str_word_count() because of use of the isalpha() C
function
which does not properly handle multi-byte characters.
As suggested by aharvey, this new function would replace usage of isalpha()
with
iswalpha().
A naive (meaning no real knowledge of this or testing of it) patch would
look
like:
diff --git a/ext/standard/string.c b/ext/standard/string.c
index 7a4ae2e..9ab6b5f 100644
--- a/ext/standard/string.c
+++ b/ext/standard/string.c
@@ -5202,7 +5202,7 @@ PHP_FUNCTION(str_word_count)
while (p < e) {
s = p;
- while (p < e && (isalpha((unsigned char)*p) || (char_list
&&
ch[(unsigned char)*p]) || *p == '\'' || *p == '-')) {
+ while (p < e && (iswalpha((unsigned char)*p) || (char_list
&&
ch[(unsigned char)*p]) || *p == '\'' || *p == '-')) {
p++;
}
if (p > s) {
Test script:
---------------
<?php
// existing str_word_count function for comparison
print str_word_count("PHP function str_word_count does not properly handle
non-latin characters") . "\n";
// returns 11
print str_word_count("Хабилло жиÑÐµÐ»Ñ Ð¯Ð²Ð°Ð½Ñкого
Ñайона. ÐÐ¼Ñ 70 леÑ. Ðн женаÑ. У него ÑеÑвеÑо
деÑей. Хабилло Ñилолог. Ðн более двадÑаÑи
Ð»ÐµÑ ÑабоÑÐ°ÐµÑ Ð¿Ð¾ пÑоÑеÑÑии. Также Хабилло
занимаеÑÑÑ Ð²Ð¸Ð½Ð¾Ð³ÑадаÑÑÑвом. У него
имееÑÑÑ Ð½ÐµÐ±Ð¾Ð»ÑÑой виногÑадник. ÐÑим
видом деÑÑелÑноÑÑи Хабилло занимаеÑÑÑ 15
леÑ.");
// returns 0
// new function mb_str_word_count
print mb_str_word_count("Хабилло жиÑÐµÐ»Ñ Ð¯Ð²Ð°Ð½Ñкого
Ñайона. ÐÐ¼Ñ 70 леÑ. Ðн женаÑ. У него ÑеÑвеÑо
деÑей. Хабилло Ñилолог. Ðн более двадÑаÑи
Ð»ÐµÑ ÑабоÑÐ°ÐµÑ Ð¿Ð¾ пÑоÑеÑÑии. Также Хабилло
занимаеÑÑÑ Ð²Ð¸Ð½Ð¾Ð³ÑадаÑÑÑвом. У него
имееÑÑÑ Ð½ÐµÐ±Ð¾Ð»ÑÑой виногÑадник. ÐÑим
видом деÑÑелÑноÑÑи Хабилло занимаеÑÑÑ 15
леÑ.");
// returns 37
Expected result:
----------------
Using mb_str_word_count() will return the number of words in a string
containing
multibyte characters
Actual result:
--------------
Currently there is no mb_str_word_count() function. Using str_word_count()
on a
string with multibyte characters returns 0.
--
Edit bug report at https://bugs.php.net/bug.php?id=63671&edit=1
--
Try a snapshot (PHP 5.4):
https://bugs.php.net/fix.php?id=63671&r=trysnapshot54
Try a snapshot (PHP 5.3):
https://bugs.php.net/fix.php?id=63671&r=trysnapshot53
Try a snapshot (trunk):
https://bugs.php.net/fix.php?id=63671&r=trysnapshottrunk
Fixed in SVN: https://bugs.php.net/fix.php?id=63671&r=fixed
Fixed in release: https://bugs.php.net/fix.php?id=63671&r=alreadyfixed
Need backtrace: https://bugs.php.net/fix.php?id=63671&r=needtrace
Need Reproduce Script: https://bugs.php.net/fix.php?id=63671&r=needscript
Try newer version: https://bugs.php.net/fix.php?id=63671&r=oldversion
Not developer issue: https://bugs.php.net/fix.php?id=63671&r=support
Expected behavior: https://bugs.php.net/fix.php?id=63671&r=notwrong
Not enough info:
https://bugs.php.net/fix.php?id=63671&r=notenoughinfo
Submitted twice:
https://bugs.php.net/fix.php?id=63671&r=submittedtwice
register_globals: https://bugs.php.net/fix.php?id=63671&r=globals
PHP 4 support discontinued: https://bugs.php.net/fix.php?id=63671&r=php4
Daylight Savings: https://bugs.php.net/fix.php?id=63671&r=dst
IIS Stability: https://bugs.php.net/fix.php?id=63671&r=isapi
Install GNU Sed: https://bugs.php.net/fix.php?id=63671&r=gnused
Floating point limitations: https://bugs.php.net/fix.php?id=63671&r=float
No Zend Extensions: https://bugs.php.net/fix.php?id=63671&r=nozend
MySQL Configuration Error: https://bugs.php.net/fix.php?id=63671&r=mysqlcfg