We were using the aramorph library for some time and so we mapped out
the set of features it provides, they come as follows:
----------------------------------------------------------------------------------------------------------------------------------------------------------------
The ء and ~ are considered unique characters.
* أ , آ, ا, and إ are distinct
* و and ؤ are distinct
* ى and ئ are distinct
* The ا and ة (denoting the feminine adjective) at the end of a
word are optional.
* The ال, ب, ل, ك, بال, كال, لل at the beginning of a word are
optional
* All حركات as well as the ّ (شدّة) are ignored.
* The ي , و , ات , ون denoting the plural form of a word are
optional. If the indexed word ends with a ة its plural, which
replaces the ة with ات , is recognized.
The following examples illustrate these rules:
Indexed Word
Search Term
Success
الحياة
للحياة
True
حياة
True
حيا
False
ألحياة
False
إلحياة
False
كالحياة
True
بالحياة
True
بحياة
True
لحياة
True
دولارا
دولار
True
بدولار
True
بالدولار
True
الدولار
True
دؤلارا
False
دولأرا
False
دولارأ
False
الكاتب
كاتب
True
لكاتب
True
كاتبة
True
الكاتبة
True
الكاتبات
True
كاتبون
True
كاتبو \ كاتبي
True
كتب
False
جميلة
جميلات
True
جميل
True
الجمال
False
بنت
ابنة
False
بن
True
ابن
True
ابنت
False
----------------------------------------------------------------------------------------------------------------------------------------------------------------
while with the new one, we only got matches for:
|
فّ فُ فٌ فف فِِ فٍ ف
and the likes of that.
-walid
On Thu, 2009-07-23 at 09:33 -0400, Robert Muir wrote:
> walid, can you provide any more information other than "very poor result"?
>
> Others have not measured much difference between morphological
> analysis and light stemming:
> http://ciir.cs.umass.edu/pubfiles/ir-249.pdf
>
>
> On Thu, Jul 23, 2009 at 7:34 AM, walid<[email protected]> wrote:
> > http://issues.apache.org/jira/browse/LUCENE-1406
> > http://issues.apache.org/jira/browse/LUCENE-153
> >
> > based on this, there are two options:
> > 1- using the aramorph library
> > 2- moving the code from trunk to the current release and using the
> > provided arabic analyzer
> >
> > 1- the library works very well in indexing, tokenizing, stemming and
> > everything, but causes memory leaks
> > 2- the provided library has a very poor result compared to the aramorph
> > library.
> >
> > Is there a plan to have better arabic support with full morphological
> > analysis support?
> >
> > walid
> >
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: [email protected]
> > For additional commands, e-mail: [email protected]
> >
> >
>
>
>