Re: arabic analyzer

Robert Muir Fri, 24 Jul 2009 05:39:53 -0700

walid, it is true some of what you mentioned (from aramorph) works in
light stemming version, some does not.
The problem is that its not clear to me that what aramorph is doing is
really the best.


>From the paper I sent you:

The best stemmer in our experiments, light8-s was very simple and did
not try to find roots or take into account most of Arabic morphology.
It is probably not essential for the stemmer to yield the correct
forms, whether stems or roots.
It is sufficient for it to group most of the forms that belong together.

This is what is being used in lucene, light8-s. If you read section
5.2.1 of the paper, you will see this method outperforms the
morphological analysis method you speak of (using the same buckwalter
dictionary)

But I also understand this is just a general text IR relevance
measurement (your specific text might vary), and it does not take into
account some human factors (it can be better on average, but make
users angry, that type of thing).

Another problem I have with this situation is that I'm not sure the
morph. analysis method is really wrong, just that perhaps aramorph /
that paper might be indexing the wrong thing. For example, aramorph
indexes arabic stems, but the latest buckwalter dictionary has
lemmaID, why not index that?

anyway, I hope in the future there will be more options, that would be
a good thing!

On Fri, Jul 24, 2009 at 4:06 AM, walid<[email protected]> wrote:
> We were using the aramorph library for some time and so we mapped out
> the set of features it provides, they come as follows:
>
> ----------------------------------------------------------------------------------------------------------------------------------------------------------------
> The ء and ~ are considered unique characters.
>
>
>              * أ , آ, ا, and إ are distinct
>
>              * و and ؤ are distinct
>
>              * ى and ئ are distinct
>
>
>
>      * The ا and ة (denoting the feminine adjective) at the end of a
>        word are optional.
>
>      * The ال, ب, ل, ك, بال, كال, لل at the beginning of a word are
>        optional
>
>      * All حركات as well as the ّ (شدّة) are ignored.
>
>      * The ي , و , ات , ون denoting the plural form of a word are
>        optional. If the indexed word ends with a ة its plural, which
>        replaces the ة with ات , is recognized.
>
>
>
>
>
> The following examples illustrate these rules:
>
>
> Indexed Word
>
>
> Search Term
>
>
> Success
>
>
>                  الحياة
>
>
>                  للحياة
>
>
> True
>
>
>
>
> حياة
>
>
> True
>
>
>
>
> حيا
>
>
> False
>
>
>
>
> ألحياة
>
>
> False
>
>
>
>
> إلحياة
>
>
> False
>
>
>
>
> كالحياة
>
>
> True
>
>
>
>
> بالحياة
>
>
> True
>
>
>
>
> بحياة
>
>
> True
>
>
>
>
> لحياة
>
>
> True
>
>
> دولارا
>
>
> دولار
>
>
> True
>
>
>
>
> بدولار
>
>
> True
>
>
>
>
> بالدولار
>
>
> True
>
>
>
>
> الدولار
>
>
> True
>
>
>
>
> دؤلارا
>
>
> False
>
>
>
>
> دولأرا
>
>
> False
>
>
>
>
> دولارأ
>
>
> False
>
>
> الكاتب
>
>
> كاتب
>
>
> True
>
>
>
>
> لكاتب
>
>
> True
>
>
>
>
> كاتبة
>
>
> True
>
>
>
>
> الكاتبة
>
>
> True
>
>
>
>
> الكاتبات
>
>
> True
>
>
>
>
> كاتبون
>
>
> True
>
>
>
>
> كاتبو \ كاتبي
>
>
> True
>
>
>
>
> كتب
>
>
> False
>
>
> جميلة
>
>
> جميلات
>
>
> True
>
>
>
>
> جميل
>
>
> True
>
>
>
>
> الجمال
>
>
> False
>
>
> بنت
>
>
> ابنة
>
>
> False
>
>
>
>
> بن
>
>
> True
>
>
>
>
> ابن
>
>
> True
>
>
>
>
> ابنت
>
>
> False
>
>
>
> ----------------------------------------------------------------------------------------------------------------------------------------------------------------
>
> while with the new one, we only got matches for:
> |
> فّ فُ فٌ فف فِِ فٍ ف
>  and the likes of that.
>
> -walid
>
> On Thu, 2009-07-23 at 09:33 -0400, Robert Muir wrote:
>
>> walid, can you provide any more information other than "very poor result"?
>>
>> Others have not measured much difference between morphological
>> analysis and light stemming:
>> http://ciir.cs.umass.edu/pubfiles/ir-249.pdf
>>
>>
>> On Thu, Jul 23, 2009 at 7:34 AM, walid<[email protected]> wrote:
>> > http://issues.apache.org/jira/browse/LUCENE-1406
>> > http://issues.apache.org/jira/browse/LUCENE-153
>> >
>> > based on this, there are two options:
>> > 1- using the aramorph library
>> > 2- moving the code from trunk to the current release and using the
>> > provided arabic analyzer
>> >
>> > 1- the library works very well in indexing, tokenizing, stemming and
>> > everything, but causes memory leaks
>> > 2- the provided library has a very poor result compared to the aramorph
>> > library.
>> >
>> > Is there a plan to have better arabic support with full morphological
>> > analysis support?
>> >
>> > walid
>> >
>> >
>> > ---------------------------------------------------------------------
>> > To unsubscribe, e-mail: [email protected]
>> > For additional commands, e-mail: [email protected]
>> >
>> >
>>
>>
>>
>



-- 
Robert Muir
[email protected]

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: arabic analyzer

Reply via email to