[jira] Commented: (LUCENE-1406) new Arabic Analyzer (Apache license)

Grant Ingersoll (JIRA) Tue, 30 Sep 2008 04:37:50 -0700

    [ 
https://issues.apache.org/jira/browse/LUCENE-1406?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12635723#action_12635723
 ]


Grant Ingersoll commented on LUCENE-1406:
-----------------------------------------

I'll commit once 2.4 is released.

> new Arabic Analyzer (Apache license)
> ------------------------------------
>
>                 Key: LUCENE-1406
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1406
>             Project: Lucene - Java
>          Issue Type: New Feature
>          Components: Analysis
>            Reporter: Robert Muir
>            Assignee: Grant Ingersoll
>            Priority: Minor
>         Attachments: LUCENE-1406.patch
>
>
> I've noticed there is no Arabic analyzer for Lucene, most likely because Tim 
> Buckwalter's morphological dictionary is GPL.
> However, it is not necessary  to have full morphological analysis engine for 
> a quality arabic search. 
> This implementation implements the light-8s algorithm present in the 
> following paper: http://ciir.cs.umass.edu/pubfiles/ir-249.pdf
> As you can see from the paper, improvement via this method over searching 
> surface forms (as lucene currently does) is significant, with almost 100% 
> improvement in average precision.
> While I personally don't think all the choices were the best, and some easily 
> improvements are still possible, the major motivation for implementing it 
> exactly the way it is presented in the paper is that the algorithm is 
> TREC-tested, so the precision/recall improvements to lucene are already 
> documented.
> For a stopword list, I used a list present at 
> http://members.unine.ch/jacques.savoy/clef/index.html simply because the 
> creator of this list documents the data as BSD-licensed.
> This implementation (Analyzer) consists of above mentioned stopword list plus 
> two filters:
>  ArabicNormalizationFilter: performs orthographic normalization (such as 
> hamza seated on alif, alif maksura, teh marbuta, removal of harakat, tatweel, 
> etc)
>  ArabicStemFilter: performs arabic light stemming
> Both filters operate directly on termbuffer for maximum performance. There is 
> no object creation in this Analyzer.
> There are no external dependencies. I've indexed about half a billion words 
> of arabic text and tested against that.
> If there are any issues with this implementation I am willing to fix them. I 
> use lucene on a daily basis and would like to give something back. Thanks.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

[jira] Commented: (LUCENE-1406) new Arabic Analyzer (Apache license)

Reply via email to