php-i18n Digest 11 Apr 2007 22:05:22 -0000 Issue 353

php-i18n-digest-help Wed, 11 Apr 2007 15:05:41 -0700

php-i18n Digest 11 Apr 2007 22:05:22 -0000 Issue 353

Topics (messages 1056 through 1058):


Re: PHP + UTF-8 + mbstring extension issue.
        1056 by: Norbert Lindenberg
        1057 by: Anirudh Zala

PHP and AJAX
        1058 by: lazaros

Administrivia:

To subscribe to the digest, e-mail:
        [EMAIL PROTECTED]

To unsubscribe from the digest, e-mail:
        [EMAIL PROTECTED]

To post to the list, e-mail:
        [email protected]


----------------------------------------------------------------------

--- Begin Message ---
Hello,
mb_strlen, when run in UTF-8 mode, counts the Unicode characters inthe string. ઝાલા is \u0a9d\u0abe\u0ab2\u0abe (using Javanotation), i.e., 4 Unicode characters. It's actually 12 bytes in UTF-8.
It's not unusual for the user perception of a character and thedefinition used in the computer representation to be different. Forexample, users perceive zälä as 4 characters, but in Unicode thestring can be represented either using precomposed forms, z\u00e4l\u00e4, or using combining marks, za\0308la\0308. The firstrepresentation would count as 4 (Unicode) characters, the second as6. For Gujarati, where Unicode doesn't have precomposed forms, theproblem is just visible more often than with Latin characters.
The ICU library has character break iterators that better approximatethe user perception of characters. If this is important for yourproject, you may want to take a look at it:
http://icu-project.org/userguide/boundaryAnalysis.html

Norbert


On Mar 21, 2007, at 1:37 AM, Anirudh Zala wrote:
Hello Everybody,
While building a truly multilingual project, I am running into aninterestingproblem with php5 + utf-8 + mbstring functions. Please study belowtablecarefully. I have taken 1 word in 3 different languages English,Finnish (of
Finland country) and Gujarati (of India country) to test PHP's Unicode
character set handling with single and multibyte strings usingmbstring
extension.
Word appearing on left of "=" sign is actual string whose length isto becounted. What I have tried here is to count length of word in eachlanguage.For English and Finnish I have got correct results but for Gujaratilanguage
it seems that mbstring functions(?) are not working properly.

=======================================================
zala = 1 word; 4 bytes; 4 characters (z, a, l, a); 4 key-strokes(z, a, l,
 a); "strlen" should be 4 and is 4 also.
zälä = 1 word; 4 bytes; 4 characters (z, ä, l, ä); 4 key-strokes(z, ä, l,
 ä); "strlen" should be 4 and is 4 also.
ઝાલા = 1 word; 4 bytes; 2 characters (ઝા, લા); 4key-strokes (ઝ, ા, લ, ા);
"strlen" should be 2 but is 4.
=======================================================
Question is why PHP is not able to count length of given string inpracticalway. I am aware that current PHP versions are not aware of string,insteadthey just deal with bytes. In that case output is correct but thisis notpractical solution as length of word in Gujarati language is only"2" (InIndic languages, we have primary characters like "ઝ" and secondarycharacterslike "ા", but secondary characters should not be counted whilecalculating
length) and not "4" even if it requires 4 bytes to store data.
I am sure that I am not missing any settings to be done at server,php or at
client level to work this correctly. English and Finnish languages are
different languages but they are part of same character set (i.eLatin) andtheir glyph is also same, while Gujarati language has differentcharacter setand it's glyph is also different. But this should not create thisproblem if
"mbstring functions" are capable to handle strings in proper way.
I have tested same thing using "iconv" extension but same results.Looks like
it is the behavior of php + mb_* functions.

Thanks,
Anirudh Zala

--
PHP Unicode & I18N Mailing List (http://www.php.net/)
To unsubscribe, visit: http://www.php.net/unsub.php
-------------------------------------
Norbert Lindenberg
Yahoo! Internationalization Architect
--- End Message ---

--- Begin Message ---

On Wednesday 21 March 2007 23:54, you wrote:
> Hello,
>
> mb_strlen, when run in UTF-8 mode, counts the Unicode characters in
> the string. ઝાલા is \u0a9d\u0abe\u0ab2\u0abe (using Java
> notation), i.e., 4 Unicode characters. It's actually 12 bytes in UTF-8.
>
> It's not unusual for the user perception of a character and the
> definition used in the computer representation to be different. For
> example, users perceive zälä as 4 characters, but in Unicode the
> string can be represented either using precomposed forms, z\u00e4l
> \u00e4, or using combining marks, za\0308la\0308. The first
> representation would count as 4 (Unicode) characters, the second as
> 6. For Gujarati, where Unicode doesn't have precomposed forms, the
> problem is just visible more often than with Latin characters.

It seems that characters of Gujarati and other Indic languages are not in 
precomposed forms, like Latin characters. But then question is why aren't 
they in precomposed forms? I am sorry if I am asking this question to you.

As you might be knowing that Indic languages have different tables, unlike 
English, for vowels and consonants. Hence when any vowel is used with 
consonant, vowel should not be counted while calculating length of string. 
Hence "ઝ" should be 1 and "ઝા" should also be 1 even if both requires 1 and 2 
characters in Unicode respectively.

To cope-up with such problem, string should be represented precomposed forms. 
We have 11 vowels and almost 60 consonants, hence there can be (11 x 60) over 
700 precomposed forms. And I assume that to save space in Unicode, vowels and 
consonants are stored in different ways.

And I wonder why didn't anybody face such problem until now? In future when 
PHP6 will arrive, how is it going to deal with this situation? because if 
problem area lies at Unicode level then I assume PHP can't do much.

>
> The ICU library has character break iterators that better approximate
> the user perception of characters. If this is important for your
> project, you may want to take a look at it:
> http://icu-project.org/userguide/boundaryAnalysis.html

Thanks for this suggestion but this library is in C/C++ and Java hence can't 
be used easily with PHP. I suggest that such library should be provided as an 
extension for PHP and other scripting languages.

Moreover this solves problems at string comparison level. There are more 
higher level problems also while storing string in database. For example if 
length of any field (for MySQL db specifically) is 12 characters then for 
Latin characters there is not any problem to store that string, but for Indic 
languages if user uses 7 consonants with 7 vowels then even if in human 
perception string length is just 7, last 2 characters will get truncated.

Then there could be some more areas where this problem can be more severe.

Anirudh Zala

--- End Message ---

--- Begin Message ---

I  have this code that sends some information to a php page that searches a
database and returns the results. The code is:

//showCustomer.js
...
  
var url="showCustomer.php?flag="+flag
url=url+"&initial="+str
url=url+"&customer_radio="+customer_radio
url=url+"&is_prospect="+is_prospect

xmlHttp.onreadystatechange=stateChanged 
xmlHttp.open("GET",url,true)
xmlHttp.send(null)
...

//The php page(showCustomer.php)
...
$initial = $_GET['initial'];
...
$show_customer_query = mysql_query("SELECT * FROM customer WHERE
customer_name LIKE '$initial%' ORDER BY customer_name");
...

In Firefox works perfect both with english and greek.
But when it comes to IE6 and 7 it works only with english. The $initial var
is not passed to showCustomer.php properly,so the query fails. If I pass a
letter α(like a) or σ(like s), $initial is something like this � or a
square.This happens with all the greek letters.What is the problem?I don't
mind working only with Firefox(I am working only with it), but what about
the other users?...
Everything is UTF8. And the problem appears when I send through GET the
parameters.

-- 
View this message in context: 
http://www.nabble.com/PHP-and-AJAX-tf3562272.html#a9949352
Sent from the Php - Internationalization (i18n) mailing list archive at 
Nabble.com.

--- End Message ---

php-i18n Digest 11 Apr 2007 22:05:22 -0000 Issue 353

Reply via email to