php-i18n Digest 11 Apr 2007 22:05:22 -0000 Issue 353

Topics (messages 1056 through 1058):

Re: PHP + UTF-8 + mbstring extension issue.
        1056 by: Norbert Lindenberg
        1057 by: Anirudh Zala

PHP and AJAX
        1058 by: lazaros

Administrivia:

To subscribe to the digest, e-mail:
        [EMAIL PROTECTED]

To unsubscribe from the digest, e-mail:
        [EMAIL PROTECTED]

To post to the list, e-mail:
        [email protected]


----------------------------------------------------------------------
--- Begin Message ---
Hello,

mb_strlen, when run in UTF-8 mode, counts the Unicode characters in the string. ઝાલા is \u0a9d\u0abe\u0ab2\u0abe (using Java notation), i.e., 4 Unicode characters. It's actually 12 bytes in UTF-8.

It's not unusual for the user perception of a character and the definition used in the computer representation to be different. For example, users perceive zälä as 4 characters, but in Unicode the string can be represented either using precomposed forms, z\u00e4l \u00e4, or using combining marks, za\0308la\0308. The first representation would count as 4 (Unicode) characters, the second as 6. For Gujarati, where Unicode doesn't have precomposed forms, the problem is just visible more often than with Latin characters.

The ICU library has character break iterators that better approximate the user perception of characters. If this is important for your project, you may want to take a look at it:
http://icu-project.org/userguide/boundaryAnalysis.html

Norbert


On Mar 21, 2007, at 1:37 AM, Anirudh Zala wrote:

Hello Everybody,

While building a truly multilingual project, I am running into an interesting problem with php5 + utf-8 + mbstring functions. Please study below table carefully. I have taken 1 word in 3 different languages English, Finnish (of
Finland country) and Gujarati (of India country) to test PHP's Unicode
character set handling with single and multibyte strings using mbstring
extension.

Word appearing on left of "=" sign is actual string whose length is to be counted. What I have tried here is to count length of word in each language. For English and Finnish I have got correct results but for Gujarati language
it seems that mbstring functions(?) are not working properly.

=======================================================
zala = 1 word; 4 bytes; 4 characters (z, a, l, a); 4 key-strokes (z, a, l,
 a); "strlen" should be 4 and is 4 also.

zälä = 1 word; 4 bytes; 4 characters (z, ä, l, ä); 4 key-strokes (z, ä, l,
 ä); "strlen" should be 4 and is 4 also.

ઝાલા = 1 word; 4 bytes; 2 characters (ઝા, લા); 4 key-strokes (ઝ, ા, લ, ા);
"strlen" should be 2 but is 4.
=======================================================

Question is why PHP is not able to count length of given string in practical way. I am aware that current PHP versions are not aware of string, instead they just deal with bytes. In that case output is correct but this is not practical solution as length of word in Gujarati language is only "2" (In Indic languages, we have primary characters like "ઝ" and secondary characters like "ા", but secondary characters should not be counted while calculating
length) and not "4" even if it requires 4 bytes to store data.

I am sure that I am not missing any settings to be done at server, php or at
client level to work this correctly. English and Finnish languages are
different languages but they are part of same character set (i.e Latin) and their glyph is also same, while Gujarati language has different character set and it's glyph is also different. But this should not create this problem if
"mbstring functions" are capable to handle strings in proper way.

I have tested same thing using "iconv" extension but same results. Looks like
it is the behavior of php + mb_* functions.

Thanks,
Anirudh Zala

--
PHP Unicode & I18N Mailing List (http://www.php.net/)
To unsubscribe, visit: http://www.php.net/unsub.php


-------------------------------------
Norbert Lindenberg
Yahoo! Internationalization Architect

--- End Message ---
--- Begin Message ---
On Wednesday 21 March 2007 23:54, you wrote:
> Hello,
>
> mb_strlen, when run in UTF-8 mode, counts the Unicode characters in
> the string. ઝાલા is \u0a9d\u0abe\u0ab2\u0abe (using Java
> notation), i.e., 4 Unicode characters. It's actually 12 bytes in UTF-8.
>
> It's not unusual for the user perception of a character and the
> definition used in the computer representation to be different. For
> example, users perceive zälä as 4 characters, but in Unicode the
> string can be represented either using precomposed forms, z\u00e4l
> \u00e4, or using combining marks, za\0308la\0308. The first
> representation would count as 4 (Unicode) characters, the second as
> 6. For Gujarati, where Unicode doesn't have precomposed forms, the
> problem is just visible more often than with Latin characters.

It seems that characters of Gujarati and other Indic languages are not in 
precomposed forms, like Latin characters. But then question is why aren't 
they in precomposed forms? I am sorry if I am asking this question to you.

As you might be knowing that Indic languages have different tables, unlike 
English, for vowels and consonants. Hence when any vowel is used with 
consonant, vowel should not be counted while calculating length of string. 
Hence "ઝ" should be 1 and "ઝા" should also be 1 even if both requires 1 and 2 
characters in Unicode respectively.

To cope-up with such problem, string should be represented precomposed forms. 
We have 11 vowels and almost 60 consonants, hence there can be (11 x 60) over 
700 precomposed forms. And I assume that to save space in Unicode, vowels and 
consonants are stored in different ways.

And I wonder why didn't anybody face such problem until now? In future when 
PHP6 will arrive, how is it going to deal with this situation? because if 
problem area lies at Unicode level then I assume PHP can't do much.

>
> The ICU library has character break iterators that better approximate
> the user perception of characters. If this is important for your
> project, you may want to take a look at it:
> http://icu-project.org/userguide/boundaryAnalysis.html

Thanks for this suggestion but this library is in C/C++ and Java hence can't 
be used easily with PHP. I suggest that such library should be provided as an 
extension for PHP and other scripting languages.

Moreover this solves problems at string comparison level. There are more 
higher level problems also while storing string in database. For example if 
length of any field (for MySQL db specifically) is 12 characters then for 
Latin characters there is not any problem to store that string, but for Indic 
languages if user uses 7 consonants with 7 vowels then even if in human 
perception string length is just 7, last 2 characters will get truncated.

Then there could be some more areas where this problem can be more severe.

Anirudh Zala

--- End Message ---
--- Begin Message ---
I  have this code that sends some information to a php page that searches a
database and returns the results. The code is:

//showCustomer.js
...
  
var url="showCustomer.php?flag="+flag
url=url+"&initial="+str
url=url+"&customer_radio="+customer_radio
url=url+"&is_prospect="+is_prospect

xmlHttp.onreadystatechange=stateChanged 
xmlHttp.open("GET",url,true)
xmlHttp.send(null)
...

//The php page(showCustomer.php)
...
$initial = $_GET['initial'];
...
$show_customer_query = mysql_query("SELECT * FROM customer WHERE
customer_name LIKE '$initial%' ORDER BY customer_name");
...

In Firefox works perfect both with english and greek.
But when it comes to IE6 and 7 it works only with english. The $initial var
is not passed to showCustomer.php properly,so the query fails. If I pass a
letter α(like a) or σ(like s), $initial is something like this � or a
square.This happens with all the greek letters.What is the problem?I don't
mind working only with Firefox(I am working only with it), but what about
the other users?...
Everything is UTF8. And the problem appears when I send through GET the
parameters.

-- 
View this message in context: 
http://www.nabble.com/PHP-and-AJAX-tf3562272.html#a9949352
Sent from the Php - Internationalization (i18n) mailing list archive at 
Nabble.com.

--- End Message ---

Reply via email to