Bug #63663 [Ana]: str_word_count does not properly handle non-latin characters

kobrien at kiva dot org Sun, 02 Dec 2012 19:10:58 -0800

Edit report at https://bugs.php.net/bug.php?id=63663&edit=1


 ID:                 63663
 User updated by:    kobrien at kiva dot org
 Reported by:        kobrien at kiva dot org
 Summary:            str_word_count does not properly handle non-latin
                     characters
 Status:             Analyzed
 Type:               Bug
 Package:            Strings related
 Operating System:   Ubuntu 12.04
 PHP Version:        5.3.20-dev
 Block user comment: N
 Private report:     N

 New Comment:

Ok feature request filed here: https://bugs.php.net/bug.php?id=63671
First time doing that, so hopefully it's correctly filed.


Previous Comments:
------------------------------------------------------------------------
[2012-12-03 02:47:28] [email protected]

Yeah, a feature request for mb_str_word_count() might be a good idea.

The isalpha() issue isn't really PHP specific: the underlying C function simply 
takes a single byte as its input, so it can't ascertain whether a multibyte 
character is actually alphanumeric or not (since it only ever gets the first 
byte of the sequence). There's an iswalpha() function that would do the right 
thing, but PHP was written before it was widely available, and using it in 
str_word_count() alone would be inconsistent with the rest of the language: 
it's something we'd need to think about as part of making the whole language 
more multibyte-aware.

------------------------------------------------------------------------
[2012-12-03 02:36:37] kobrien at kiva dot org

Thanks for the reply. Given your comments about the problems, would it be 
helpful 
for me to also file a feature request for newer versions of php to have a 
mb_str_word_count function which could properly handle this case? I haven't dug 
into the C code enough to understand why isalpha() fails on multibyte, but I'd 
have to imagine there is an alternative available that will handle multi-byte 
characters properly. I could potentially even create a patch if pointed in the 
right direction.

------------------------------------------------------------------------
[2012-12-03 02:29:16] [email protected]

This is due to the use of isalpha() internally, which doesn't play well with 
multibyte encodings like UTF-8, regardless of the locale setting.

Fundamentally, this is the same issue as bug #27668 â I'm not sure there's a 
lot we can do about this in PHP 5.x, but it's worth noting if and when we 
revisit Unicode string handling internally.

------------------------------------------------------------------------
[2012-12-01 02:29:17] kobrien at kiva dot org

Description:
------------
The function str_word_count() does work properly on non-latin characters. It 
will 
return a value of zero. Whereas str_word_count() works properly on latin 
characters and returns the value for the number of words in a string.

Test script:
---------------
<?php
print str_word_count("PHP function str_word_count does not properly handle 
non-latin characters") . "\n";

// returns 11

print str_word_count("Ð¥Ð°Ð±Ð¸Ð»Ð»Ð¾ Ð¶Ð¸ÑÐµÐ»Ñ Ð¯Ð²Ð°Ð½ÑÐºÐ¾Ð³Ð¾ 
ÑÐ°Ð¹Ð¾Ð½Ð°. ÐÐ¼Ñ 70 Ð»ÐµÑ. ÐÐ½ Ð¶ÐµÐ½Ð°Ñ. Ð£ Ð½ÐµÐ³Ð¾ ÑÐµÑÐ²ÐµÑÐ¾ 
Ð´ÐµÑÐµÐ¹. Ð¥Ð°Ð±Ð¸Ð»Ð»Ð¾ ÑÐ¸Ð»Ð¾Ð»Ð¾Ð³. ÐÐ½ Ð±Ð¾Ð»ÐµÐµ Ð´Ð²Ð°Ð´ÑÐ°ÑÐ¸ 
Ð»ÐµÑ ÑÐ°Ð±Ð¾ÑÐ°ÐµÑ Ð¿Ð¾ Ð¿ÑÐ¾ÑÐµÑÑÐ¸Ð¸. Ð¢Ð°ÐºÐ¶Ðµ Ð¥Ð°Ð±Ð¸Ð»Ð»Ð¾ 
Ð·Ð°Ð½Ð¸Ð¼Ð°ÐµÑÑÑ Ð²Ð¸Ð½Ð¾Ð³ÑÐ°Ð´Ð°ÑÑÑÐ²Ð¾Ð¼. Ð£ Ð½ÐµÐ³Ð¾ Ð¸Ð¼ÐµÐµÑÑÑ 
Ð½ÐµÐ±Ð¾Ð»ÑÑÐ¾Ð¹ Ð²Ð¸Ð½Ð¾Ð³ÑÐ°Ð´Ð½Ð¸Ðº. ÐÑÐ¸Ð¼ Ð²Ð¸Ð´Ð¾Ð¼ 
Ð´ÐµÑÑÐµÐ»ÑÐ½Ð¾ÑÑÐ¸ Ð¥Ð°Ð±Ð¸Ð»Ð»Ð¾ Ð·Ð°Ð½Ð¸Ð¼Ð°ÐµÑÑÑ 15 Ð»ÐµÑ.");

// returns 0, but should return 37

Expected result:
----------------
The second instruction should return 37

Actual result:
--------------
The second instruction returns 0


------------------------------------------------------------------------



-- 
Edit this bug report at https://bugs.php.net/bug.php?id=63663&edit=1

Bug #63663 [Ana]: str_word_count does not properly handle non-latin characters

Reply via email to