[jira] Commented: (LUCENE-1406) new Arabic Analyzer (Apache license)

Robert Muir (JIRA) Fri, 26 Sep 2008 10:46:20 -0700

    [ 
https://issues.apache.org/jira/browse/LUCENE-1406?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12634944#action_12634944
 ]


Robert Muir commented on LUCENE-1406:
-------------------------------------

Thought I would add the following comments:

I tried to stick to basics to start. Some things that kept bugging me just for 
the record:

1) the rules for stemming only require stemmed token to have 2 characters in 
many places. This seems incorrect... triliteral root anyone? Seems to be too 
aggresive. Yet at the same time, many common "prefix"/suffix combinations are 
not stemmed by light8 algorithm...  But its trec tested... 

2) there is no decomposition of unicode presentation forms. These characters 
show up (typically when text is extracted out of PDF). The easiest way to deal 
with this is Unicode normalization, but that requires Java 6 or ICU.

3) there is no enhanced parsing. Typically academics index high quality news 
text but in other less perfect text often you see much text without spaces 
between words when the characters do not join (to the human reader there is a 
space!). to really solve this you need a lot of special stuff including 
morphological data, but you can partially solve some of the common cases by 
splitting words when you see 100% certain cases such as medial teh marbuta, 
medial alef maksura, double alef, ... I didnt do this because I wanted to keep 
it simple, but its important, see here: 
http://papers.ldc.upenn.edu/COLING2004/Buckwalter_Arabic-orthography-morphology.pdf
 
4) it is simply a stemmer, but I read in lucene docs where it is possible to 
inject synonym-like information (multiple tokens for one word) and boost the 
score for certain ones. Seems like this would be better than simply stemming, 
at least indexing and boosting the normalized surface form for better 
precision. I'd want to setup TREC tests to actually measure this though.


> new Arabic Analyzer (Apache license)
> ------------------------------------
>
>                 Key: LUCENE-1406
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1406
>             Project: Lucene - Java
>          Issue Type: New Feature
>          Components: Analysis
>            Reporter: Robert Muir
>            Assignee: Grant Ingersoll
>            Priority: Minor
>         Attachments: LUCENE-1406.patch
>
>
> I've noticed there is no Arabic analyzer for Lucene, most likely because Tim 
> Buckwalter's morphological dictionary is GPL.
> However, it is not necessary  to have full morphological analysis engine for 
> a quality arabic search. 
> This implementation implements the light-8s algorithm present in the 
> following paper: http://ciir.cs.umass.edu/pubfiles/ir-249.pdf
> As you can see from the paper, improvement via this method over searching 
> surface forms (as lucene currently does) is significant, with almost 100% 
> improvement in average precision.
> While I personally don't think all the choices were the best, and some easily 
> improvements are still possible, the major motivation for implementing it 
> exactly the way it is presented in the paper is that the algorithm is 
> TREC-tested, so the precision/recall improvements to lucene are already 
> documented.
> For a stopword list, I used a list present at 
> http://members.unine.ch/jacques.savoy/clef/index.html simply because the 
> creator of this list documents the data as BSD-licensed.
> This implementation (Analyzer) consists of above mentioned stopword list plus 
> two filters:
>  ArabicNormalizationFilter: performs orthographic normalization (such as 
> hamza seated on alif, alif maksura, teh marbuta, removal of harakat, tatweel, 
> etc)
>  ArabicStemFilter: performs arabic light stemming
> Both filters operate directly on termbuffer for maximum performance. There is 
> no object creation in this Analyzer.
> There are no external dependencies. I've indexed about half a billion words 
> of arabic text and tested against that.
> If there are any issues with this implementation I am willing to fix them. I 
> use lucene on a daily basis and would like to give something back. Thanks.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

[jira] Commented: (LUCENE-1406) new Arabic Analyzer (Apache license)

Reply via email to