php-i18n Digest 19 Jan 2005 07:19:49 -0000 Issue 271
Topics (messages 833 through 834):
mb_ereg_replace problem with non a-z characters
833 by: McAjvar
834 by: Moriyoshi Koizumi
Administrivia:
To subscribe to the digest, e-mail:
[EMAIL PROTECTED]
To unsubscribe from the digest, e-mail:
[EMAIL PROTECTED]
To post to the list, e-mail:
[email protected]
----------------------------------------------------------------------
--- Begin Message ---
hi!
i hope this is the right group for my question, if i got it wrong,
please accept my apologies and tell me where to go next.
i have a form which needs to accept strings with specific central
european, japanese, russian, etc. charactres. the site accepts utf-8,
all headers are sent for utf-8 and on windows, where i'm testing it, all
works ok. eg. i enter something in japanese, korean, russian, slovene,
etc. and if i put everything through this:
ereg_replace ("[^[:alnum:]\n]", "", $t)
all japanese, russian, etc. characters remain in the string, as i'd like
them to.
but when i test this same function on dragonflybsd or freebsd, all those
characters get stripped out.
php.ini settings, mbstring section has the overload setting set to
overload everything, so php should basically be using mb_ereg_replace.
just to be sure, i tried that function specifically.
my test environments are:
win2k, sp4, apache 2.0.50, php 4.3.8 (here, it works ok)
freebsd 5.2.1, fully patched, apache-1.3.31_4, mod_php4-4.3.8_2,1
(doesn't work)
dragonflybsd, latest version, today freshly installed apache and php
from ports, also doesn't work
could someone please help me, point me in the right direction, tell me
what i am doing wrong or just explain why such a difference on windows
and the bsd platforms? i couldn't test it on a linux platform since i
don't have access to one. am i expectinh too much from ereg* functions
if i want them to tell me if some foreign characters are valid in some
alphabet or should i take a different approach?
thank you for any information!
i have mbstring extension installed and here are my php.ini mbstring
settings:
[mbstring]
mbstring.language = Neutral
mbstring.internal_encoding = UTF-8
mbstring.http_input = UTF-8
mbstring.http_output = UTF-8
mbstring.encoding_translation = On
mbstring.substitute_character = _
mbstring.func_overload = 7
this is the (very simple) script i am using for testing:
___________________________
<?php
header("content-type:text/html; charset=utf-8");
unset ($t);
if (isset ($_POST['t']))
{
$t = $_POST['t'];
}
?>
<?xml version="1.0" encoding="utf-8"?>
<head>
<title>mad scientists not a work</title>
<meta http-equiv="Content-Type" content="text/html; charset=utf-8" />
</head>
<body>
<div align="center">
<?php
if (isset ($t))
{
echo "<pre>Got this:\n" . $t . "\n</pre>";
echo "<hr><br><pre>put through preg_replace:\n" . preg_replace
("/[^[:alnum:]\n]/", "", $t) . "\n</pre>";
echo "<hr><br><pre>put through ereg_replace:\n" . ereg_replace
("[^[:alnum:]\n]", "", $t) . "\n</pre>";
}
?>
<form action="<?php echo $_SERVER["PHP_SELF"]; ?>" method="post">
<textarea name="t" rows="7" cols="60"><?php if (isset ($t)) echo $t;
?></textarea>
<br />
<input type="submit">
</form>
</div>
</body>
</html>
___________________________
regards,
McAjvar
--- End Message ---
--- Begin Message ---
Hi,
On 2005/01/18, at 0:55, McAjvar wrote:
i hope this is the right group for my question, if i got it wrong,
please accept my apologies and tell me where to go next.
I think this is the right list, while it isn't really active :)
ereg_replace ("[^[:alnum:]\n]", "", $t)
all japanese, russian, etc. characters remain in the string, as i'd
like them to.
but when i test this same function on dragonflybsd or freebsd, all
those characters get stripped out.
php.ini settings, mbstring section has the overload setting set to
overload everything, so php should basically be using mb_ereg_replace.
just to be sure, i tried that function specifically.
There are three possibilities:
a. Function overloading doesn't work at all on those platforms
for an unknown reason.
Reportedly the plain ereg_replace() malfunctions with UTF-8.
b. You are running FreeBSD on a 64 bit architecture and
mb_ereg_replace() doesn't behave well with it.
Recently some 64bit related bugs were addressed in mbstring,
and there are probably ones missed still.
c. BSD's libc implementation is borked and inproperly handles
the argument of isalnum().
To confirm those, please try the following steps:
1. Replace every ereg_replace() by mb_ereg_replace() and then run it
through.
2. Run the one-liner below on a shell prompt and include the outout in
the reply
to this mail:
( echo "#include <stdio.h>"; echo "#include <ctype.h>"; echo 'main() {
int i; for (i = 0; i < 256; i++) { printf("%d", isalnum(i) ? 1: 0); }
printf("\n"); }' ) > /var/tmp/test.c && gcc -o /var/tmp/test
/var/tmp/test.c && /var/tmp/test && rm /var/tmp/test.c /var/tmp/test
my test environments are:
win2k, sp4, apache 2.0.50, php 4.3.8 (here, it works ok)
freebsd 5.2.1, fully patched, apache-1.3.31_4, mod_php4-4.3.8_2,1
(doesn't work)
dragonflybsd, latest version, today freshly installed apache and php
from ports, also doesn't work
BTW,
mbstring.language = Neutral
mbstring.internal_encoding = UTF-8
mbstring.http_input = UTF-8
mbstring.http_output = UTF-8
mbstring.encoding_translation = On
mbstring.substitute_character = _
mbstring.func_overload = 7
If http_input equals to internal_encoding then you don't need to turn
on encoding_translation.
Moriyoshi
--- End Message ---