php-i18n Digest 19 Jan 2005 07:19:49 -0000 Issue 271

Topics (messages 833 through 834):

mb_ereg_replace problem with non a-z characters
        833 by: McAjvar
        834 by: Moriyoshi Koizumi

Administrivia:

To subscribe to the digest, e-mail:
        [EMAIL PROTECTED]

To unsubscribe from the digest, e-mail:
        [EMAIL PROTECTED]

To post to the list, e-mail:
        [email protected]


----------------------------------------------------------------------
--- Begin Message ---
hi!

i hope this is the right group for my question, if i got it wrong, please accept my apologies and tell me where to go next.

i have a form which needs to accept strings with specific central european, japanese, russian, etc. charactres. the site accepts utf-8, all headers are sent for utf-8 and on windows, where i'm testing it, all works ok. eg. i enter something in japanese, korean, russian, slovene, etc. and if i put everything through this:

ereg_replace ("[^[:alnum:]\n]", "", $t)

all japanese, russian, etc. characters remain in the string, as i'd like them to.

but when i test this same function on dragonflybsd or freebsd, all those characters get stripped out.

php.ini settings, mbstring section has the overload setting set to overload everything, so php should basically be using mb_ereg_replace. just to be sure, i tried that function specifically.

my test environments are:
win2k, sp4, apache 2.0.50, php 4.3.8 (here, it works ok)
freebsd 5.2.1, fully patched, apache-1.3.31_4, mod_php4-4.3.8_2,1 (doesn't work)
dragonflybsd, latest version, today freshly installed apache and php from ports, also doesn't work


could someone please help me, point me in the right direction, tell me what i am doing wrong or just explain why such a difference on windows and the bsd platforms? i couldn't test it on a linux platform since i don't have access to one. am i expectinh too much from ereg* functions if i want them to tell me if some foreign characters are valid in some alphabet or should i take a different approach?

thank you for any information!

i have mbstring extension installed and here are my php.ini mbstring settings:
[mbstring]
mbstring.language = Neutral
mbstring.internal_encoding = UTF-8
mbstring.http_input = UTF-8
mbstring.http_output = UTF-8
mbstring.encoding_translation = On
mbstring.substitute_character = _
mbstring.func_overload = 7


this is the (very simple) script i am using for testing:
___________________________
<?php

header("content-type:text/html; charset=utf-8");

unset ($t);
if (isset ($_POST['t']))
{
        $t = $_POST['t'];
}

?>
<?xml version="1.0" encoding="utf-8"?>
<head>
<title>mad scientists not a work</title>
<meta http-equiv="Content-Type" content="text/html; charset=utf-8" />
</head>

<body>
<div align="center">
<?php

if (isset ($t))
{
echo "<pre>Got this:\n" . $t . "\n</pre>";
echo "<hr><br><pre>put through preg_replace:\n" . preg_replace ("/[^[:alnum:]\n]/", "", $t) . "\n</pre>";
echo "<hr><br><pre>put through ereg_replace:\n" . ereg_replace ("[^[:alnum:]\n]", "", $t) . "\n</pre>";
}


?>

<form action="<?php echo $_SERVER["PHP_SELF"]; ?>" method="post">
<textarea name="t" rows="7" cols="60"><?php if (isset ($t)) echo $t; ?></textarea>
<br />
<input type="submit">
</form>
</div>
</body>


</html>
___________________________

regards,
McAjvar

--- End Message ---
--- Begin Message ---
Hi,

On 2005/01/18, at 0:55, McAjvar wrote:

i hope this is the right group for my question, if i got it wrong, please accept my apologies and tell me where to go next.

I think this is the right list, while it isn't really active :)

ereg_replace ("[^[:alnum:]\n]", "", $t)

all japanese, russian, etc. characters remain in the string, as i'd like them to.

but when i test this same function on dragonflybsd or freebsd, all those characters get stripped out.

php.ini settings, mbstring section has the overload setting set to overload everything, so php should basically be using mb_ereg_replace. just to be sure, i tried that function specifically.

There are three possibilities: a. Function overloading doesn't work at all on those platforms for an unknown reason.

   Reportedly the plain ereg_replace() malfunctions with UTF-8.

b. You are running FreeBSD on a 64 bit architecture and
   mb_ereg_replace() doesn't behave well with it.

   Recently some 64bit related bugs were addressed in mbstring,
   and there are probably ones missed still.

c. BSD's libc implementation is borked and inproperly handles
   the argument of isalnum().


To confirm those, please try the following steps:

1. Replace every ereg_replace() by mb_ereg_replace() and then run it through.

2. Run the one-liner below on a shell prompt and include the outout in the reply
to this mail:


( echo "#include <stdio.h>"; echo "#include <ctype.h>"; echo 'main() { int i; for (i = 0; i < 256; i++) { printf("%d", isalnum(i) ? 1: 0); } printf("\n"); }' ) > /var/tmp/test.c && gcc -o /var/tmp/test /var/tmp/test.c && /var/tmp/test && rm /var/tmp/test.c /var/tmp/test

my test environments are:
win2k, sp4, apache 2.0.50, php 4.3.8 (here, it works ok)
freebsd 5.2.1, fully patched, apache-1.3.31_4, mod_php4-4.3.8_2,1 (doesn't work)
dragonflybsd, latest version, today freshly installed apache and php from ports, also doesn't work

BTW,

mbstring.language = Neutral
mbstring.internal_encoding = UTF-8
mbstring.http_input = UTF-8
mbstring.http_output = UTF-8
mbstring.encoding_translation = On
mbstring.substitute_character = _
mbstring.func_overload = 7

If http_input equals to internal_encoding then you don't need to turn on encoding_translation.

Moriyoshi
--- End Message ---

Reply via email to