ID: 19257 Updated by: [EMAIL PROTECTED] Reported By: [EMAIL PROTECTED] -Status: Bogus +Status: Feedback Bug Type: Strings related Operating System: Linux PHP Version: 4.2.2 New Comment:
I've added a new function to the mbstring extension in CVS. This function will be in PHP 4.3. I would appreciate your feedback. Try a snapshot from http://snaps.php.net/php4-latest.tar.gz dated after this message. usage: proto string mb_convert_case(string str, int mode [, string encoding]); mode can be one of MB_CASE_UPPER, MB_CASE_LOWER or MB_CASE_TITLE. encoding specifies the encoding of str; if omitted, the mbstring.internal_encoding value will be used. The return value is str with the appropriate case folding applied. The function works by internally converting the string into UCS-4 format and applying php_unicode_to(upper|lower|title) to each unicode character, and then converts the string back into the original encoding. The code for your test case would look like this (and works for me): <? $str = "Test".utf8_encode("\xFC"); $strU = mb_convert_case($str, MB_CASE_UPPER, "utf-8"); $strL = mb_convert_case($str, MB_CASE_LOWER, "utf-8"); ?> <PRE> str = '<? echo $str; ?>' strU = '<? echo $strU; ?>' strL = '<? echo $strL; ?>' </PRE> Previous Comments: ------------------------------------------------------------------------ [2002-09-10 09:20:07] [EMAIL PROTECTED] As I understand toupper()/tolower() are working only for one byte encodings. So right way is to use 'wide' versions of toupper()/tolower() - towupper()/towlower(). Example: #include <stdio.h> #include <wctype.h> #include <locale.h> int main() { printf("locale set to '%s'\n", setlocale(LC_ALL, "UTF-8")); printf("0x00DC C='%C'\n", towlower(0x00DC)); printf("0x042F C='%C'\n", towlower(0x042F)); return(0); } And it's working fine for UCS2 (UTF-16). In PHP I can convert UTF-8 to UTF-16 by using iconv(). But PHP has not 'wide' version of strtolower()/strtoupper(). So, what can I do? ------------------------------------------------------------------------ [2002-09-10 08:54:19] [EMAIL PROTECTED] I forgot to add that you should feed your utf8 data to the input of that little program. ------------------------------------------------------------------------ [2002-09-10 08:52:58] [EMAIL PROTECTED] This is not a bug in PHP; it's down to whether your system can support this and has the appropriate locales installed. A quick and dirty example might look this this in C: #include <ctype.h> main() { char buff[1024]; while(fgets(buff, sizeof(buff), stdin)) { int i, l; l = strlen(buff); for (i = 0; i < l; i++) buff[i] = toupper(buff[i]); puts(buff); } } If that little program works, your system supports this conversion. If it doesn't, then PHP doesn't either. ------------------------------------------------------------------------ [2002-09-10 08:44:36] [EMAIL PROTECTED] > So you didn't try it..? Yes, I set LC_ALL/LANG to 'en_US' and try it. > I only tried your test script and got the expected result. > Whatever the characters are.. I've no idea of them anyway.. I think your confused by looking on the result of test script with encoding set to 'ISO-8859-x' instead of 'UTF-8'. In this case it looks as some characters changed to lower/upper case. BUT they are not UTF-8 lower/upper case characters: 1) 0xC39C changed to 0xE39C, should be 0xC3BC 2) 0xD0AF changed to 0xF0AF, should be 0xD18F As result we have not UTF-8 string but a garbage. If you really like test this issue you should set 'default_charset=utf-8' in php.ini or set encoding to 'UTF-8' in your browser. > btw. AFAIK, setting LANG / LC_ALL to UTF-8 is not correct > way to do it.. > According to that HOWTO, it should be something like > ru_RU.UTF-8 (and only if you really have UTF-8 locales) I try en_US.UTF-8, de_DE.UTF-8, ru_RU.UTF-8 - no lack. > I'm bogusing this since it really isn't anything PHP can > affect.. So, no way in PHP convert UTF-8 string to lower/upper case? ------------------------------------------------------------------------ [2002-09-09 17:06:43] [EMAIL PROTECTED] So you didn't try it..? I only tried your test script and got the expected result. Whatever the characters are..I've no idea of them anyway.. btw. AFAIK, setting LANG / LC_ALL to UTF-8 is not correct way to do it.. http://melkor.dnp.fmph.uniba.sk/~garabik/debian-utf8/howto.h tml According to that HOWTO, it should be something like ru_RU.UTF-8 (and only if you really have UTF-8 locales) I'm bogusing this since it really isn't anything PHP can affect.. ------------------------------------------------------------------------ The remainder of the comments for this report are too long. To view the rest of the comments, please view the bug report online at http://bugs.php.net/19257 -- Edit this bug report at http://bugs.php.net/?id=19257&edit=1
