Hi Dotan,

> In order to determine if a string is RTL this (untested) PHP regex was
> suggested: preg_match( "|[ا-يא-ת]|", substr($text, 0, 1))
>
> 1) Is grepping the first letter of a string for a Hebrew or Arabic
> character considered good enough to determine if a string should be
> treated as RTL for purposes of setting directionality in HTML output?

You have to skip things like whitespace and punctuation at the beginning
of the string. These are "weak" characters that do not define
directionality.

Only the first "strong" character in the string, such as a letter,
controls the directionality.

For example, assuming SHALOM is in Hebrew, this string is RTL, even
though it doesn't begin with a hebrew letter:
"...SHALOM".

You can add a few of those punctuation characters to the regex but that
will be a hack. For "true" directionality, you should have a database of
all unicode characters and their directionality.

Fribidi has such a database, and is embedded into PHP, but unfortunately
there is no interface to the fribidi_get_type() function. Nothing a
feature request can't fix :)

And as a sidenote, you can use "^" in your regex instead of
substr($text, 0, 1).. it may be a bit slower, but it's cleaner.

**************

For a better algorithm that covers languages other than Hebrew, based on
such a database, Pango has a simple implementation:

http://svn.gnome.org/viewvc/pango/trunk/pango/pango-utils.c?revision=2767&view=markup#l1390

or via google code search:
http://www.google.com/codesearch/p?hl=en#rgtz0aU8Yys/pango-1.4.1/pango/pango-utils.c&q=pango_find_base_dir

pango_unichar_direction is taken from a trimmed-down fribidi, found here:
http://www.google.com/codesearch/p?hl=en#rgtz0aU8Yys/pango-1.4.1/pango/mini-fribidi/fribidi.c&q=pango_unichar_direction&exact_package=ftp://ftp.gtk.org/pub/gtk/v2.4/pango-1.4.1.tar.bz2

> 2) Does Python have a better way of doing this?

There's no function to quickly give you an answer.
It does have the unicode character database, but you have to work a bit
to write a function similar to the one in Pango:

The function "bidirectional" in the unicodedata[1] module will return
one of the codes in the table in [2]. Run it for each character in the
string, until the answer is one of ['L', 'LRE' or 'LRO'] (text is LTR)
or ['R', 'AL', 'RLE', 'RLO']. If no such character is found, pick a
default direction :)

[1]http://docs.python.org/library/unicodedata.html
[2]http://www.unicode.org/reports/tr9/tr9-18.html#Bidirectional_Character_Types

-Ori

On 01/03/2009 01:08 PM, Dotan Cohen wrote:
> Please keep this post in English, as I am forwarding it to a developer
> who does not speak Hebrew. Thanks.
> 

> 
> Here is the related bug:
> https://bugs.launchpad.net/zim/+bug/303108
> 
> Thanks!
> 


_______________________________________________
Python-il mailing list
[email protected]
http://hamakor.org.il/cgi-bin/mailman/listinfo/python-il

לענות