#47480 [Com]: preg_replace with /i is not case insensitive

2009-03-09 Thread mmcnickle at gmail dot com
 ID:   47480
 Comment by:   mmcnickle at gmail dot com
 Reported By:  sehh at ionos dot gr
 Status:   Open
 Bug Type: PCRE related
 Operating System: Linux
 PHP Version:  5.2.8
 New Comment:

The test case is wrong and the bug should be closed. The upper case
search target is misspelled.

$target1 = ÊÉÍÇÔÇÑÁ;
$target2 = êéíçôÞñá;
should read
$target1 = ÊÉÍÇÔÞÑÁ;
$target2 = êéíçôÞñá;

(note the replacement of the second Ç with a capital Thorn (U+00DE).

With this change I get the expected result:

Actual Result
-

Searching for: ÊÉÍÇÔÞÑÁ
Result string: Ôï êõñßùò ôìÞìá ôïõ itworks, áõôü ðïõ ðåñéëáìâÜíåé ôïõò
êõëßíäñïõò
Found and replaced: 1

Searching for: êéíçôþñá
Result string: Ôï êõñßùò ôìÞìá ôïõ itworks, áõôü ðïõ ðåñéëáìâÜíåé ôïõò
êõëßíäñïõò
Found and replaced: 1


Previous Comments:


[2009-02-23 13:32:39] sehh at ionos dot gr

Description:

preg_replace with the /i (case insensitive search) does not do a case
insensitive search for UTF-8 Greek characters, while it works fine for
English characters.


Reproduce code:
---
?php
$string = Ôï êõñßùò ôìÞìá ôïõ êéíçôÞñá, áõôü ðïõ ðåñéëáìâÜíåé ôïõò
êõëßíäñïõò; // UTF-8 string in Greek language
$target1 = ÊÉÍÇÔÇÑÁ; // Target string to search for (capitalized)
$target2 = êéíçôÞñá; // Target string to search for (small letters)
$replace = itworks; // Replace with this string

$rc = preg_replace(/$target1/imsUu, $replace, $string, -1, $counter);
// Execute search for target1 and replace

echo \nSearching for: .$target1.\n; // Report output
echo Result string: .$rc.\n;
echo Found and replaced: .$counter.\n;

$rc = preg_replace(/$target2/imsUu, $replace, $string, -1, $counter);
// Execute search for target2 and replace

echo \nSearching for: .$target2.\n; // Report output
echo Result string: .$rc.\n;
echo Found and replaced: .$counter.\n\n;
?

Expected result:

I expect the Found and Replaced to be both 1 since the expression is
not case sensitive.

Actual result:
--
$ php -f test.php 

Searching for: ÊÉÍÇÔÇÑÁ
Result string: Ôï êõñßùò ôìÞìá ôïõ êéíçôÞñá, áõôü ðïõ ðåñéëáìâÜíåé ôïõò
êõëßíäñïõò
Found and replaced: 0

Searching for: êéíçôÞñá
Result string: Ôï êõñßùò ôìÞìá ôïõ itworks, áõôü ðïõ ðåñéëáìâÜíåé ôïõò
êõëßíäñïõò
Found and replaced: 1






-- 
Edit this bug report at http://bugs.php.net/?id=47480edit=1



#47480 [Com]: preg_replace with /i is not case insensitive

2009-03-09 Thread mmcnickle at gmail dot com
 ID:   47480
 Comment by:   mmcnickle at gmail dot com
 Reported By:  sehh at ionos dot gr
 Status:   Open
 Bug Type: PCRE related
 Operating System: Linux
 PHP Version:  5.2.8
 New Comment:

You're absolutely correct, I do not speak Greek. But neither does the
PCRE library. It determines the uppercase/lowercase relationship between
characters solely using Unicode properties.

The lowercase of Ç is defined in Unicode as ç [1], not Þ. Therefore the
case-insensitive search will not match.

[1]http://www.fileformat.info/info/unicode/char/00c7/index.htm


Previous Comments:


[2009-03-09 12:16:43] sehh at ionos dot gr

Obviously you have no idea what you are talking about and obviously you
don't speak Greek or know anything about the Greek language.

The word êéíçôÞñá is capitalized as ÊÉÍÇÔÇÑÁ.

What you are suggesting is like capitalizing the word engine as
ENGiNE.

Obviously, there is no word ENGiNE, same way there is no word
ÊÉÍÇÔÞÑÁ :)



[2009-03-09 11:59:53] mmcnickle at gmail dot com

The test case is wrong and the bug should be closed. The upper case
search target is misspelled.

$target1 = ÊÉÍÇÔÇÑÁ;
$target2 = êéíçôÞñá;
should read
$target1 = ÊÉÍÇÔÞÑÁ;
$target2 = êéíçôÞñá;

(note the replacement of the second Ç with a capital Thorn (U+00DE).

With this change I get the expected result:

Actual Result
-

Searching for: ÊÉÍÇÔÞÑÁ
Result string: Ôï êõñßùò ôìÞìá ôïõ itworks, áõôü ðïõ ðåñéëáìâÜíåé ôïõò
êõëßíäñïõò
Found and replaced: 1

Searching for: êéíçôþñá
Result string: Ôï êõñßùò ôìÞìá ôïõ itworks, áõôü ðïõ ðåñéëáìâÜíåé ôïõò
êõëßíäñïõò
Found and replaced: 1



[2009-02-23 13:32:39] sehh at ionos dot gr

Description:

preg_replace with the /i (case insensitive search) does not do a case
insensitive search for UTF-8 Greek characters, while it works fine for
English characters.


Reproduce code:
---
?php
$string = Ôï êõñßùò ôìÞìá ôïõ êéíçôÞñá, áõôü ðïõ ðåñéëáìâÜíåé ôïõò
êõëßíäñïõò; // UTF-8 string in Greek language
$target1 = ÊÉÍÇÔÇÑÁ; // Target string to search for (capitalized)
$target2 = êéíçôÞñá; // Target string to search for (small letters)
$replace = itworks; // Replace with this string

$rc = preg_replace(/$target1/imsUu, $replace, $string, -1, $counter);
// Execute search for target1 and replace

echo \nSearching for: .$target1.\n; // Report output
echo Result string: .$rc.\n;
echo Found and replaced: .$counter.\n;

$rc = preg_replace(/$target2/imsUu, $replace, $string, -1, $counter);
// Execute search for target2 and replace

echo \nSearching for: .$target2.\n; // Report output
echo Result string: .$rc.\n;
echo Found and replaced: .$counter.\n\n;
?

Expected result:

I expect the Found and Replaced to be both 1 since the expression is
not case sensitive.

Actual result:
--
$ php -f test.php 

Searching for: ÊÉÍÇÔÇÑÁ
Result string: Ôï êõñßùò ôìÞìá ôïõ êéíçôÞñá, áõôü ðïõ ðåñéëáìâÜíåé ôïõò
êõëßíäñïõò
Found and replaced: 0

Searching for: êéíçôÞñá
Result string: Ôï êõñßùò ôìÞìá ôïõ itworks, áõôü ðïõ ðåñéëáìâÜíåé ôïõò
êõëßíäñïõò
Found and replaced: 1






-- 
Edit this bug report at http://bugs.php.net/?id=47480edit=1



#47480 [Com]: preg_replace with /i is not case insensitive

2009-03-09 Thread mmcnickle at gmail dot com
 ID:   47480
 Comment by:   mmcnickle at gmail dot com
 Reported By:  sehh at ionos dot gr
 Status:   Open
 Bug Type: PCRE related
 Operating System: Linux
 PHP Version:  5.2.8
 New Comment:

Yes, unfortunately trying to include locale and language specific cases
is next to impossible for regular expression engine developers. 

The best that can be done, though far from ideal, is for the user to
try to take these changes into account when they are crafting the
regex:

$target1 = ÊÉÍÇÔ[Ç|Þ]ÑÁ; // Greek;

$target1 = Stra[ss|ß]ebahn // German


Previous Comments:


[2009-03-09 15:00:25] sehh at ionos dot gr

I forgot the capital accented characters, so the above should read:

Ç == Þ == ç == ¹
Á == Ü == á == ¶
etc..

Remember that in Greek, the accent may be omitted from capital letters
or may be included for the first letter only. So that should produce
proper case-insensitive results.



[2009-03-09 14:54:32] sehh at ionos dot gr

The PCRE library is wrong then.

Ç is correctly defined in Unicode as ç, but the library should also
understand the meaning of Ç == Þ == ç.

This counts for all Greek accents:

Á == Ü == á
etc...

Otherwise, the parameter /i is useless for the Greek language and
thats why the current implementation does not work for Greek.

Thank you for taking the time to look into this issue, much
appreciated.



[2009-03-09 14:31:03] mmcnickle at gmail dot com

You're absolutely correct, I do not speak Greek. But neither does the
PCRE library. It determines the uppercase/lowercase relationship between
characters solely using Unicode properties.

The lowercase of Ç is defined in Unicode as ç [1], not Þ. Therefore the
case-insensitive search will not match.

[1]http://www.fileformat.info/info/unicode/char/00c7/index.htm



[2009-03-09 12:16:43] sehh at ionos dot gr

Obviously you have no idea what you are talking about and obviously you
don't speak Greek or know anything about the Greek language.

The word êéíçôÞñá is capitalized as ÊÉÍÇÔÇÑÁ.

What you are suggesting is like capitalizing the word engine as
ENGiNE.

Obviously, there is no word ENGiNE, same way there is no word
ÊÉÍÇÔÞÑÁ :)



[2009-03-09 11:59:53] mmcnickle at gmail dot com

The test case is wrong and the bug should be closed. The upper case
search target is misspelled.

$target1 = ÊÉÍÇÔÇÑÁ;
$target2 = êéíçôÞñá;
should read
$target1 = ÊÉÍÇÔÞÑÁ;
$target2 = êéíçôÞñá;

(note the replacement of the second Ç with a capital Thorn (U+00DE).

With this change I get the expected result:

Actual Result
-

Searching for: ÊÉÍÇÔÞÑÁ
Result string: Ôï êõñßùò ôìÞìá ôïõ itworks, áõôü ðïõ ðåñéëáìâÜíåé ôïõò
êõëßíäñïõò
Found and replaced: 1

Searching for: êéíçôþñá
Result string: Ôï êõñßùò ôìÞìá ôïõ itworks, áõôü ðïõ ðåñéëáìâÜíåé ôïõò
êõëßíäñïõò
Found and replaced: 1



The remainder of the comments for this report are too long. To view
the rest of the comments, please view the bug report online at
http://bugs.php.net/47480

-- 
Edit this bug report at http://bugs.php.net/?id=47480edit=1



#47480 [Com]: preg_replace with /i is not case insensitive

2009-03-09 Thread mmcnickle at gmail dot com
 ID:   47480
 Comment by:   mmcnickle at gmail dot com
 Reported By:  sehh at ionos dot gr
 Status:   Open
 Bug Type: PCRE related
 Operating System: Linux
 PHP Version:  5.2.8
 New Comment:

It wouldn't be impossible, no. But to someone without detailed
knowledge of Greek it would be. The unicode.org article on regular
expressions [1] has this to say:

All of the above deals with a default specification for a regular
expression. However, a regular expression engine also may want to
support tailored specifications, typically tailored for a particular
language or locale. This may be important when the regular expression
engine is being used by end-users instead of programmers, such as in a
word-processor allowing some level of regular expressions in
searching.

Earlier in the document it says about how basic regex engines are only
required to include the basic unicode uppercase/lowercase matching.

Looking though the source code of the PRCE library, it does seem
possible to generate locale-specific character tables; this may be an
avenue to look into.

Perhaps the best thing to do would be to drop a message in the
internationalization mailing list (http://marc.info/?l=php-i18n) and see
what they have to say.

[1] http://unicode.org/reports/tr18/#Tailored_Support


Previous Comments:


[2009-03-09 16:01:59] sehh at ionos dot gr

Indeed thats far from ideal, its impossible from my development point
of view to re-write every single accented character with its possible
equivalent for the entire string, for every string in the regex.

For example, this:
/Âáëâßäåò åéóáãùãÞò-åîáãùãÞò/i

Would become a monster like this:
/Âáëâ[É|ß|º]ä[Å|å|¸]ò åéóáãùã[Ç|Þ|¹]ò-åîáãùã[Ç|Þ|¹]ò/i

We would need a regex to create the regex! or at least a text
search/replace method in PHP.

Are you sure its impossible to add a few exceptions within the PCRE
library?



[2009-03-09 15:25:51] mmcnickle at gmail dot com

Yes, unfortunately trying to include locale and language specific cases
is next to impossible for regular expression engine developers. 

The best that can be done, though far from ideal, is for the user to
try to take these changes into account when they are crafting the
regex:

$target1 = ÊÉÍÇÔ[Ç|Þ]ÑÁ; // Greek;

$target1 = Stra[ss|ß]ebahn // German



[2009-03-09 15:00:25] sehh at ionos dot gr

I forgot the capital accented characters, so the above should read:

Ç == Þ == ç == ¹
Á == Ü == á == ¶
etc..

Remember that in Greek, the accent may be omitted from capital letters
or may be included for the first letter only. So that should produce
proper case-insensitive results.



[2009-03-09 14:54:32] sehh at ionos dot gr

The PCRE library is wrong then.

Ç is correctly defined in Unicode as ç, but the library should also
understand the meaning of Ç == Þ == ç.

This counts for all Greek accents:

Á == Ü == á
etc...

Otherwise, the parameter /i is useless for the Greek language and
thats why the current implementation does not work for Greek.

Thank you for taking the time to look into this issue, much
appreciated.



[2009-03-09 14:31:03] mmcnickle at gmail dot com

You're absolutely correct, I do not speak Greek. But neither does the
PCRE library. It determines the uppercase/lowercase relationship between
characters solely using Unicode properties.

The lowercase of Ç is defined in Unicode as ç [1], not Þ. Therefore the
case-insensitive search will not match.

[1]http://www.fileformat.info/info/unicode/char/00c7/index.htm



The remainder of the comments for this report are too long. To view
the rest of the comments, please view the bug report online at
http://bugs.php.net/47480

-- 
Edit this bug report at http://bugs.php.net/?id=47480edit=1