Edit report at https://bugs.php.net/bug.php?id=63663&edit=1
ID: 63663
User updated by: kobrien at kiva dot org
Reported by: kobrien at kiva dot org
Summary: str_word_count does not properly handle non-latin
characters
Status: Analyzed
Type: Bug
Package: Strings related
Operating System: Ubuntu 12.04
PHP Version: 5.3.20-dev
Block user comment: N
Private report: N
New Comment:
Thanks for the reply. Given your comments about the problems, would it be
helpful
for me to also file a feature request for newer versions of php to have a
mb_str_word_count function which could properly handle this case? I haven't dug
into the C code enough to understand why isalpha() fails on multibyte, but I'd
have to imagine there is an alternative available that will handle multi-byte
characters properly. I could potentially even create a patch if pointed in the
right direction.
Previous Comments:
------------------------------------------------------------------------
[2012-12-03 02:29:16] [email protected]
This is due to the use of isalpha() internally, which doesn't play well with
multibyte encodings like UTF-8, regardless of the locale setting.
Fundamentally, this is the same issue as bug #27668 â I'm not sure there's a
lot we can do about this in PHP 5.x, but it's worth noting if and when we
revisit Unicode string handling internally.
------------------------------------------------------------------------
[2012-12-01 02:29:17] kobrien at kiva dot org
Description:
------------
The function str_word_count() does work properly on non-latin characters. It
will
return a value of zero. Whereas str_word_count() works properly on latin
characters and returns the value for the number of words in a string.
Test script:
---------------
<?php
print str_word_count("PHP function str_word_count does not properly handle
non-latin characters") . "\n";
// returns 11
print str_word_count("Хабилло жиÑÐµÐ»Ñ Ð¯Ð²Ð°Ð½Ñкого
Ñайона. ÐÐ¼Ñ 70 леÑ. Ðн женаÑ. У него ÑеÑвеÑо
деÑей. Хабилло Ñилолог. Ðн более двадÑаÑи
Ð»ÐµÑ ÑабоÑÐ°ÐµÑ Ð¿Ð¾ пÑоÑеÑÑии. Также Хабилло
занимаеÑÑÑ Ð²Ð¸Ð½Ð¾Ð³ÑадаÑÑÑвом. У него имееÑÑÑ
неболÑÑой виногÑадник. ÐÑим видом
деÑÑелÑноÑÑи Хабилло занимаеÑÑÑ 15 леÑ.");
// returns 0, but should return 37
Expected result:
----------------
The second instruction should return 37
Actual result:
--------------
The second instruction returns 0
------------------------------------------------------------------------
--
Edit this bug report at https://bugs.php.net/bug.php?id=63663&edit=1