On 2013-03-03 03:06, yomnasala...@gmail.com wrote:
I have a Python code that take an Arabic word and get the root and also remove diacritics, but i I have a problem with the output. For 
example  : when the input is "العربيه" the output is:"عرب" which is right answer but when the input is "كاتب" 
the output is:"ب", and when the input is "يخاف" the output is " خف".

This is my code:

# -*- coding=utf-8 -*-

import re
from arabic_const import *
import Tashaphyne
from Tashaphyne import *
import enum
from enum import Enum
search_type=Enum('unvoc_word','voc_word','root_word')

HARAKAT_pat = re.compile(ur"[" + u"".join([FATHATAN, DAMMATAN, KASRATAN, FATHA, DAMMA, 
KASRA, SUKUN, SHADDA]) + u"]")
HAMZAT_pat = re.compile(ur"[" + u"".join([WAW_HAMZA, YEH_HAMZA]) + u"]");
ALEFAT_pat = re.compile(ur"[" + u"".join([ALEF_MADDA, ALEF_HAMZA_ABOVE, ALEF_HAMZA_BELOW, 
HAMZA_ABOVE, HAMZA_BELOW]) + u"]");
LAMALEFAT_pat = re.compile(ur"[" + u"".join([LAM_ALEF, LAM_ALEF_HAMZA_ABOVE, 
LAM_ALEF_HAMZA_BELOW, LAM_ALEF_MADDA_ABOVE]) + u"]");

[snip]
When you're using Unicode with re in Python 2, you should include the
re.UNICODE flag. For example:

HARAKAT_pat = re.compile(ur"[" + u"".join([FATHATAN, DAMMATAN, KASRATAN, FATHA, DAMMA, KASRA, SUKUN, SHADDA]) + u"]", flags=re.UNICODE)

or:

HARAKAT_pat = re.compile(ur"(?u)[" + u"".join([FATHATAN, DAMMATAN, KASRATAN, FATHA, DAMMA, KASRA, SUKUN, SHADDA]) + u"]")

I don't know whether that will make a difference in this case because I
don't know Tashaphyne or Arabic.

--
http://mail.python.org/mailman/listinfo/python-list

Reply via email to