UTF-8 in PHP?
I'm trying to make a regex that will change kaf, mem, nun, pe, and tzadik to their final form if they are at the end of a word. On the php mailing list, we came to this regex, which works to replace a letter only at the end of a word, and it works well: $text=preg_replace('/\b([^\s]+)a\b.*/U', '$1A', $text); This replaces a final a with A, as a test case. It works. However, when I change the a and A to Hebrew, it does not: $text=preg_replace('/\b([^\s]+)כ\b.*/U', '$1ך', $text); // Lo Oved You can see two test cases and the code here: http://gibberish.co.il/test.html http://gibberish.co.il/test2.html Why should the regex above not work? Is it a UTF-8 problem? Can anyone elaborate? PHP is Israeli homebrew, so I would expect that it is used on quite a few Israeli sites. Thanks in advance. Dotan Cohen http://what-is-what.com http://gibberish.co.il א-ב-ג-ד-ה-ו-ז-ח-ט-י-ך-כ-ל-ם-מ-ן-נ-ס-ע-ף-פ-ץ-צ-ק-ר-ש-ת A: Because it messes up the order in which people normally read text. Q: Why is top-posting such a bad thing?
subscribing to the list
Is there a page with instructions for subscription to this mailing list? How about archives? Any chance of standard mailing list headers? -- Tzafrir Cohen | [EMAIL PROTECTED] | VIM is http://tzafrir.org.il || a Mutt's [EMAIL PROTECTED] || best ICQ# 16849754 || friend = To unsubscribe, send mail to [EMAIL PROTECTED] with the word unsubscribe in the message body, e.g., run the command echo unsubscribe | mail [EMAIL PROTECTED]
Re: UTF-8 in PHP?
On 2/28/08, Dotan Cohen [EMAIL PROTECTED] wrote: I'm trying to make a regex that will change kaf, mem, nun, pe, and tzadik to their final form if they are at the end of a word. On the php mailing list, we came to this regex, which works to replace a letter only at the end of a word, and it works well: $text=preg_replace('/\b([^\s]+)a\b.*/U', '$1A', $text); This replaces a final a with A, as a test case. It works. However, when I change the a and A to Hebrew, it does not: $text=preg_replace('/\b([^\s]+)כ\b.*/U', '$1ך', $text); // Lo Oved You can see two test cases and the code here: http://gibberish.co.il/test.html http://gibberish.co.il/test2.html Why should the regex above not work? Is it a UTF-8 problem? Can anyone elaborate? PHP is Israeli homebrew, so I would expect that it is used on quite a few Israeli sites. It can be a UTF-8 problem in general - PHP has many functions that are not UTF-8 aware, which is why we have the mbstring functions... which are equivalent to historical PHP functions, but work well on multibyte strings... there's even an option to overload the mbstring functions on top of the old functions, see: http://il.php.net/manual/en/ref.mbstring.php#mbstring.overload However, I can't see an mbstring equivalent for preg_replace (while ereg_replace does have one...) - which might suggest one of two options: a) preg_replace is utf-8 ready or b) mbstring functionality doesn't support a function for preg_replace... I know this might not be a too helpful comment, but I tried my best... -- Shimi
Re: subscribing to the list
On Thursday 28 February 2008, Tzafrir Cohen wrote: Is there a page with instructions for subscription to this mailing list? There's http://www.hamakor.org.il/mailing-lists/linux-il.html How about archives? It also lists archives. Any chance of standard mailing list headers? I talked about it with the list admins a while ago. Linux-IL provides an X-list: linux-il MIME header which is non-standard, but KMail supports if you enter the header manually. A better option would be to implement the Mailman (or at least the Ezmlm) standard headers. But I don't know how much progress was made to do that. Regards, Shlomi Fish - Shlomi Fish [EMAIL PROTECTED] Homepage:http://www.shlomifish.org/ I'm not an actor - I just play one on T.V. = To unsubscribe, send mail to [EMAIL PROTECTED] with the word unsubscribe in the message body, e.g., run the command echo unsubscribe | mail [EMAIL PROTECTED]
Re: UTF-8 in PHP?
On 28/02/2008, shimi [EMAIL PROTECTED] wrote: It can be a UTF-8 problem in general - PHP has many functions that are not UTF-8 aware, which is why we have the mbstring functions... which are equivalent to historical PHP functions, but work well on multibyte strings... there's even an option to overload the mbstring functions on top of the old functions, see: http://il.php.net/manual/en/ref.mbstring.php#mbstring.overload However, I can't see an mbstring equivalent for preg_replace (while ereg_replace does have one...) - which might suggest one of two options: a) preg_replace is utf-8 ready or b) mbstring functionality doesn't support a function for preg_replace... I know this might not be a too helpful comment, but I tried my best... Thanks, Shimi. It seems that preg_replace does not work on multibyte (utf-8) strings because that would be too slow. I'm looking for an alternative, and you may have just found it. Thanks. Dotan Cohen http://what-is-what.com http://gibberish.co.il א-ב-ג-ד-ה-ו-ז-ח-ט-י-ך-כ-ל-ם-מ-ן-נ-ס-ע-ף-פ-ץ-צ-ק-ר-ש-ת A: Because it messes up the order in which people normally read text. Q: Why is top-posting such a bad thing?
http://wiki.osdc.org.il/ is back online
Hi all! The Israeli OSDC (Open Source Developers' Conference) wiki is back online: http://wiki.osdc.org.il/index.php/Main_Page Thanks should go to: 1. Andy Armstrong ( http://hexten.net/ ) for providing hosting and setting up the wiki. 2. Issac Goldstand for seting up the DNS. 3. Gabor Szabo for some co-ordination effort. I provided the original MediaWiki MySQL dumps. Also note that if you're interested the Perl-IL wiki is back as well: http://wiki.perl.org.il/index.php/Main_Page Regards, Shlomi Fish - Shlomi Fish [EMAIL PROTECTED] Homepage:http://www.shlomifish.org/ I'm not an actor - I just play one on T.V. = To unsubscribe, send mail to [EMAIL PROTECTED] with the word unsubscribe in the message body, e.g., run the command echo unsubscribe | mail [EMAIL PROTECTED]
Re: UTF-8 in PHP?
Dotan Cohen wrote: On 28/02/2008, shimi [EMAIL PROTECTED] wrote: It can be a UTF-8 problem in general - PHP has many functions that are not UTF-8 aware, which is why we have the mbstring functions... which are equivalent to historical PHP functions, but work well on multibyte strings... there's even an option to overload the mbstring functions on top of the old functions, see: http://il.php.net/manual/en/ref.mbstring.php#mbstring.overload However, I can't see an mbstring equivalent for preg_replace (while ereg_replace does have one...) - which might suggest one of two options: a) preg_replace is utf-8 ready or b) mbstring functionality doesn't support a function for preg_replace... I know this might not be a too helpful comment, but I tried my best... Thanks, Shimi. It seems that preg_replace does not work on multibyte (utf-8) strings because that would be too slow. I'm looking for an alternative, and you may have just found it. Thanks.http://blog.page2rss.com/2007/01/postgresql-vs-mysql-performance.html Dotan Cohen It's not that, since preg_replace has a modifier for utf-8 (u). The problem seems to be detecting the boundaries (\b). Since (a simpler and not perfect or similar functionality, e.g: not working on line endings) the following works: $test=preg_replace('/([^\s]+)כ(\W)/Uu', '$1ך$2', $test); Cheers -- Meir Kriheli To unsubscribe, send mail to [EMAIL PROTECTED] with the word unsubscribe in the message body, e.g., run the command echo unsubscribe | mail [EMAIL PROTECTED]
Re: UTF-8 in PHP?
On Thu, Feb 28, 2008 at 12:34:42PM +0200, Dotan Cohen wrote: I'm trying to make a regex that will change kaf, mem, nun, pe, and tzadik to their final form if they are at the end of a word. On the php mailing list, we came to this regex, which works to replace a letter only at the end of a word, and it works well: $text=preg_replace('/\b([^\s]+)a\b.*/U', '$1A', $text); This replaces a final a with A, as a test case. It works. However, when I change the a and A to Hebrew, it does not: $text=preg_replace('/\b([^\s]+)כ\b.*/U', '$1ך', $text); // Lo Oved You can see two test cases and the code here: http://gibberish.co.il/test.html http://gibberish.co.il/test2.html Why should the regex above not work? Is it a UTF-8 problem? Can anyone elaborate? PHP is Israeli homebrew, so I would expect that it is used on quite a few Israeli sites. I'm no php expert, but it seems that \b does not catch a UTF-8 Hebrew word boundary - it's probably implemented byte-by-byte. when you do $text = preg_replace('/([^\s]+)כ($|\s)/', '$1ך$2', $text); the non-final kaf is converted. -- Dan Kenigsberghttp://www.cs.technion.ac.il/~dankenICQ 162180901 = To unsubscribe, send mail to [EMAIL PROTECTED] with the word unsubscribe in the message body, e.g., run the command echo unsubscribe | mail [EMAIL PROTECTED]