Hi Dotan, > In order to determine if a string is RTL this (untested) PHP regex was > suggested: preg_match( "|[ا-يא-ת]|", substr($text, 0, 1)) > > 1) Is grepping the first letter of a string for a Hebrew or Arabic > character considered good enough to determine if a string should be > treated as RTL for purposes of setting directionality in HTML output?
You have to skip things like whitespace and punctuation at the beginning of the string. These are "weak" characters that do not define directionality. Only the first "strong" character in the string, such as a letter, controls the directionality. For example, assuming SHALOM is in Hebrew, this string is RTL, even though it doesn't begin with a hebrew letter: "...SHALOM". You can add a few of those punctuation characters to the regex but that will be a hack. For "true" directionality, you should have a database of all unicode characters and their directionality. Fribidi has such a database, and is embedded into PHP, but unfortunately there is no interface to the fribidi_get_type() function. Nothing a feature request can't fix :) And as a sidenote, you can use "^" in your regex instead of substr($text, 0, 1).. it may be a bit slower, but it's cleaner. ************** For a better algorithm that covers languages other than Hebrew, based on such a database, Pango has a simple implementation: http://svn.gnome.org/viewvc/pango/trunk/pango/pango-utils.c?revision=2767&view=markup#l1390 or via google code search: http://www.google.com/codesearch/p?hl=en#rgtz0aU8Yys/pango-1.4.1/pango/pango-utils.c&q=pango_find_base_dir pango_unichar_direction is taken from a trimmed-down fribidi, found here: http://www.google.com/codesearch/p?hl=en#rgtz0aU8Yys/pango-1.4.1/pango/mini-fribidi/fribidi.c&q=pango_unichar_direction&exact_package=ftp://ftp.gtk.org/pub/gtk/v2.4/pango-1.4.1.tar.bz2 > 2) Does Python have a better way of doing this? There's no function to quickly give you an answer. It does have the unicode character database, but you have to work a bit to write a function similar to the one in Pango: The function "bidirectional" in the unicodedata[1] module will return one of the codes in the table in [2]. Run it for each character in the string, until the answer is one of ['L', 'LRE' or 'LRO'] (text is LTR) or ['R', 'AL', 'RLE', 'RLO']. If no such character is found, pick a default direction :) [1]http://docs.python.org/library/unicodedata.html [2]http://www.unicode.org/reports/tr9/tr9-18.html#Bidirectional_Character_Types -Ori On 01/03/2009 01:08 PM, Dotan Cohen wrote: > Please keep this post in English, as I am forwarding it to a developer > who does not speak Hebrew. Thanks. > > > Here is the related bug: > https://bugs.launchpad.net/zim/+bug/303108 > > Thanks! > _______________________________________________ Python-il mailing list [email protected] http://hamakor.org.il/cgi-bin/mailman/listinfo/python-il
