#19257 [Bgs->Fbk]: strtolower & strtoupper does not work for UTF-8 strings

wez Wed, 25 Sep 2002 17:52:56 -0700

 ID:               19257
 Updated by:       [EMAIL PROTECTED]
 Reported By:      [EMAIL PROTECTED]
-Status:           Bogus
+Status:           Feedback
 Bug Type:         Strings related
 Operating System: Linux
 PHP Version:      4.2.2
 New Comment:


I've added a new function to the mbstring extension in CVS.
This function will be in PHP 4.3.

I would appreciate your feedback.
Try a snapshot from http://snaps.php.net/php4-latest.tar.gz
dated after this message.

usage:
proto string mb_convert_case(string str, int mode [, string
encoding]);

mode can be one of MB_CASE_UPPER, MB_CASE_LOWER or MB_CASE_TITLE.
encoding specifies the encoding of str; if omitted, the
mbstring.internal_encoding value will be used.
The return value is str with the appropriate case folding applied.

The function works by internally converting the string into UCS-4
format
and applying php_unicode_to(upper|lower|title) to each unicode
character,
and then converts the string back into the original encoding.

The code for your test case would look like this
(and works for me):

<?
$str = "Test".utf8_encode("\xFC");

$strU = mb_convert_case($str, MB_CASE_UPPER, "utf-8");
$strL = mb_convert_case($str, MB_CASE_LOWER, "utf-8");
?>
<PRE>
str  = '<? echo $str;  ?>'
strU = '<? echo $strU; ?>'
strL = '<? echo $strL; ?>'
</PRE>


Previous Comments:
------------------------------------------------------------------------

[2002-09-10 09:20:07] [EMAIL PROTECTED]

As I understand toupper()/tolower() are working only for one byte
encodings. So right way is to use 'wide' versions of
toupper()/tolower() - towupper()/towlower().
Example:

#include <stdio.h>
#include <wctype.h>
#include <locale.h>

int main() {
printf("locale set to '%s'\n", setlocale(LC_ALL, "UTF-8"));

printf("0x00DC C='%C'\n", towlower(0x00DC));
printf("0x042F C='%C'\n", towlower(0x042F));

return(0);
}

And it's working fine for UCS2 (UTF-16).
In PHP I can convert UTF-8 to UTF-16 by using iconv().
But PHP has not 'wide' version of strtolower()/strtoupper().
So, what can I do?

------------------------------------------------------------------------

[2002-09-10 08:54:19] [EMAIL PROTECTED]

I forgot to add that you should feed your utf8 data to the
input of that little program.

------------------------------------------------------------------------

[2002-09-10 08:52:58] [EMAIL PROTECTED]

This is not a bug in PHP; it's down to whether your system
can support this and has the appropriate locales installed.

A quick and dirty example might look this this in C:

#include <ctype.h>
main()
{
   char buff[1024];

   while(fgets(buff, sizeof(buff), stdin)) {
      int i, l;
      l = strlen(buff);
      for (i = 0; i < l; i++)
          buff[i] = toupper(buff[i]);
      puts(buff);
   }
}

If that little program works, your system supports
this conversion.  If it doesn't, then PHP doesn't
either.


------------------------------------------------------------------------

[2002-09-10 08:44:36] [EMAIL PROTECTED]

> So you didn't try it..?
Yes, I set LC_ALL/LANG to 'en_US' and try it.

> I only tried your test script and got the expected result.
> Whatever the characters are.. I've no idea of them anyway..
I think your confused by looking on the result of test script with
encoding set to 'ISO-8859-x' instead of 'UTF-8'.
In this case it looks as some characters changed to lower/upper case.
BUT they are not UTF-8 lower/upper case characters:
1) 0xC39C changed to 0xE39C, should be 0xC3BC
2) 0xD0AF changed to 0xF0AF, should be 0xD18F
As result we have not UTF-8 string but a garbage.
If you really like test this issue you should set
'default_charset=utf-8' in php.ini or set encoding to 'UTF-8' in your
browser.

> btw. AFAIK, setting LANG / LC_ALL to UTF-8 is not correct
> way to do it.. 
> According to that HOWTO, it should be something like
> ru_RU.UTF-8 (and only if you really have UTF-8 locales)
I try en_US.UTF-8, de_DE.UTF-8, ru_RU.UTF-8 - no lack.

> I'm bogusing this since it really isn't anything PHP can
> affect..
So, no way in PHP convert UTF-8 string to lower/upper case?

------------------------------------------------------------------------

[2002-09-09 17:06:43] [EMAIL PROTECTED]

So you didn't try it..? I only tried your test script and
got the expected result. Whatever the characters are..I've no idea of
them anyway..

btw. AFAIK, setting LANG / LC_ALL to UTF-8 is not correct
way to do it.. 

http://melkor.dnp.fmph.uniba.sk/~garabik/debian-utf8/howto.h
tml

According to that HOWTO, it should be something like ru_RU.UTF-8 (and
only if you really have UTF-8 locales)

I'm bogusing this since it really isn't anything PHP can affect..


------------------------------------------------------------------------

The remainder of the comments for this report are too long. To view
the rest of the comments, please view the bug report online at
    http://bugs.php.net/19257

-- 
Edit this bug report at http://bugs.php.net/?id=19257&edit=1

#19257 [Bgs->Fbk]: strtolower & strtoupper does not work for UTF-8 strings

Reply via email to