UTF-8 in PHP?

2008-02-28 Thread Dotan Cohen
I'm trying to make a regex that will change kaf, mem, nun, pe, and
tzadik to their final form if they are at the end of a word. On the
php mailing list, we came to this regex, which works to replace a
letter only at the end of a word, and it works well:

$text=preg_replace('/\b([^\s]+)a\b.*/U', '$1A', $text);

This replaces a final a with A, as a test case. It works.

However, when I change the a and A to Hebrew, it does not:

$text=preg_replace('/\b([^\s]+)כ\b.*/U', '$1ך', $text); // Lo Oved

You can see two test cases and the code here:
http://gibberish.co.il/test.html
http://gibberish.co.il/test2.html

Why should the regex above not work? Is it a UTF-8 problem? Can anyone
elaborate? PHP is Israeli homebrew, so I would expect that it is used
on quite a few Israeli sites.

Thanks in advance.

Dotan Cohen

http://what-is-what.com
http://gibberish.co.il
א-ב-ג-ד-ה-ו-ז-ח-ט-י-ך-כ-ל-ם-מ-ן-נ-ס-ע-ף-פ-ץ-צ-ק-ר-ש-ת

A: Because it messes up the order in which people normally read text.
Q: Why is top-posting such a bad thing?


subscribing to the list

2008-02-28 Thread Tzafrir Cohen
Is there a page with instructions for subscription to this mailing list?

How about archives?

Any chance of standard mailing list headers? 

-- 
Tzafrir Cohen | [EMAIL PROTECTED] | VIM is
http://tzafrir.org.il || a Mutt's
[EMAIL PROTECTED] ||  best
ICQ# 16849754 || friend

=
To unsubscribe, send mail to [EMAIL PROTECTED] with
the word unsubscribe in the message body, e.g., run the command
echo unsubscribe | mail [EMAIL PROTECTED]



Re: UTF-8 in PHP?

2008-02-28 Thread shimi
On 2/28/08, Dotan Cohen [EMAIL PROTECTED] wrote:

 I'm trying to make a regex that will change kaf, mem, nun, pe, and
 tzadik to their final form if they are at the end of a word. On the
 php mailing list, we came to this regex, which works to replace a
 letter only at the end of a word, and it works well:

 $text=preg_replace('/\b([^\s]+)a\b.*/U', '$1A', $text);

 This replaces a final a with A, as a test case. It works.

 However, when I change the a and A to Hebrew, it does not:

 $text=preg_replace('/\b([^\s]+)כ\b.*/U', '$1ך', $text); // Lo Oved

 You can see two test cases and the code here:
 http://gibberish.co.il/test.html
 http://gibberish.co.il/test2.html

 Why should the regex above not work? Is it a UTF-8 problem? Can anyone
 elaborate? PHP is Israeli homebrew, so I would expect that it is used
 on quite a few Israeli sites.


It can be a UTF-8 problem in general - PHP has many functions that are not
UTF-8 aware, which is why we have the mbstring functions... which are
equivalent to historical PHP functions, but work well on multibyte
strings... there's even an option to overload the mbstring functions on top
of the old functions, see:
http://il.php.net/manual/en/ref.mbstring.php#mbstring.overload

However, I can't see an mbstring equivalent for preg_replace (while
ereg_replace does have one...) - which might suggest one of two options: a)
preg_replace is utf-8 ready or b) mbstring functionality doesn't support a
function for preg_replace... I know this might not be a too helpful comment,
but I tried my best...

-- Shimi


Re: subscribing to the list

2008-02-28 Thread Shlomi Fish
On Thursday 28 February 2008, Tzafrir Cohen wrote:
 Is there a page with instructions for subscription to this mailing list?


There's http://www.hamakor.org.il/mailing-lists/linux-il.html

 How about archives?

It also lists archives.


 Any chance of standard mailing list headers?

I talked about it with the list admins a while ago. Linux-IL provides 
an X-list: linux-il MIME header which is non-standard, but KMail supports 
if you enter the header manually. A better option would be to implement the 
Mailman (or at least the Ezmlm) standard headers. But I don't know how much 
progress was made to do that.

Regards,

Shlomi Fish

-
Shlomi Fish  [EMAIL PROTECTED]
Homepage:http://www.shlomifish.org/

I'm not an actor - I just play one on T.V.

=
To unsubscribe, send mail to [EMAIL PROTECTED] with
the word unsubscribe in the message body, e.g., run the command
echo unsubscribe | mail [EMAIL PROTECTED]



Re: UTF-8 in PHP?

2008-02-28 Thread Dotan Cohen
On 28/02/2008, shimi [EMAIL PROTECTED] wrote:
 It can be a UTF-8 problem in general - PHP has many functions that are not
 UTF-8 aware, which is why we have the mbstring functions... which are
 equivalent to historical PHP functions, but work well on multibyte
 strings... there's even an option to overload the mbstring functions on top
 of the old functions, see:
 http://il.php.net/manual/en/ref.mbstring.php#mbstring.overload

 However, I can't see an mbstring equivalent for preg_replace (while
 ereg_replace does have one...) - which might suggest one of two options: a)
 preg_replace is utf-8 ready or b) mbstring functionality doesn't support a
 function for preg_replace... I know this might not be a too helpful comment,
 but I tried my best...

Thanks, Shimi. It seems that preg_replace does not work on multibyte
(utf-8) strings because that would be too slow. I'm looking for an
alternative, and you may have just found it. Thanks.

Dotan Cohen

http://what-is-what.com
http://gibberish.co.il
א-ב-ג-ד-ה-ו-ז-ח-ט-י-ך-כ-ל-ם-מ-ן-נ-ס-ע-ף-פ-ץ-צ-ק-ר-ש-ת

A: Because it messes up the order in which people normally read text.
Q: Why is top-posting such a bad thing?


http://wiki.osdc.org.il/ is back online

2008-02-28 Thread Shlomi Fish
Hi all!

The Israeli OSDC (Open Source Developers' Conference) wiki is back online:

http://wiki.osdc.org.il/index.php/Main_Page

Thanks should go to:

1. Andy Armstrong ( http://hexten.net/ ) for providing hosting and
setting up the wiki.

2. Issac Goldstand for seting up the DNS.

3. Gabor Szabo for some co-ordination effort.

I provided the original MediaWiki MySQL dumps.

Also note that if you're interested the Perl-IL wiki is back as well:

http://wiki.perl.org.il/index.php/Main_Page

Regards,

  Shlomi Fish

-
Shlomi Fish  [EMAIL PROTECTED]
Homepage:http://www.shlomifish.org/

I'm not an actor - I just play one on T.V.

=
To unsubscribe, send mail to [EMAIL PROTECTED] with
the word unsubscribe in the message body, e.g., run the command
echo unsubscribe | mail [EMAIL PROTECTED]



Re: UTF-8 in PHP?

2008-02-28 Thread Meir Kriheli

Dotan Cohen wrote:

On 28/02/2008, shimi [EMAIL PROTECTED] wrote:

It can be a UTF-8 problem in general - PHP has many functions that are not
UTF-8 aware, which is why we have the mbstring functions... which are
equivalent to historical PHP functions, but work well on multibyte
strings... there's even an option to overload the mbstring functions on top
of the old functions, see:
http://il.php.net/manual/en/ref.mbstring.php#mbstring.overload

However, I can't see an mbstring equivalent for preg_replace (while
ereg_replace does have one...) - which might suggest one of two options: a)
preg_replace is utf-8 ready or b) mbstring functionality doesn't support a
function for preg_replace... I know this might not be a too helpful comment,
but I tried my best...


Thanks, Shimi. It seems that preg_replace does not work on multibyte
(utf-8) strings because that would be too slow. I'm looking for an
alternative, and you may have just found it. 
Thanks.http://blog.page2rss.com/2007/01/postgresql-vs-mysql-performance.html

Dotan Cohen


It's not that, since preg_replace has a modifier for utf-8 (u). The 
problem seems to be detecting the boundaries (\b). Since (a simpler and 
not perfect or similar functionality, e.g: not working on line endings) 
the following works:


$test=preg_replace('/([^\s]+)כ(\W)/Uu', '$1ך‎$2', $test);

Cheers
--
Meir Kriheli

To unsubscribe, 
send mail to [EMAIL PROTECTED] with
the word unsubscribe in the message body, e.g., run the command
echo unsubscribe | mail [EMAIL PROTECTED]



Re: UTF-8 in PHP?

2008-02-28 Thread Dan Kenigsberg
On Thu, Feb 28, 2008 at 12:34:42PM +0200, Dotan Cohen wrote:
 I'm trying to make a regex that will change kaf, mem, nun, pe, and
 tzadik to their final form if they are at the end of a word. On the
 php mailing list, we came to this regex, which works to replace a
 letter only at the end of a word, and it works well:
 
 $text=preg_replace('/\b([^\s]+)a\b.*/U', '$1A', $text);
 
 This replaces a final a with A, as a test case. It works.
 
 However, when I change the a and A to Hebrew, it does not:
 
 $text=preg_replace('/\b([^\s]+)כ\b.*/U', '$1ך', $text); // Lo Oved
 
 You can see two test cases and the code here:
 http://gibberish.co.il/test.html
 http://gibberish.co.il/test2.html
 
 Why should the regex above not work? Is it a UTF-8 problem? Can anyone
 elaborate? PHP is Israeli homebrew, so I would expect that it is used
 on quite a few Israeli sites.

I'm no php expert, but it seems that \b does not catch a UTF-8 Hebrew word
boundary - it's probably implemented byte-by-byte.

when you do
 $text = preg_replace('/([^\s]+)כ($|\s)/', '$1ך$2', $text);
the non-final kaf is converted.

-- 
Dan Kenigsberghttp://www.cs.technion.ac.il/~dankenICQ 162180901

=
To unsubscribe, send mail to [EMAIL PROTECTED] with
the word unsubscribe in the message body, e.g., run the command
echo unsubscribe | mail [EMAIL PROTECTED]