Re: [jira] Commented: (LUCENE-1406) new Arabic Analyzer (Apache license)

DM Smith Tue, 30 Sep 2008 05:54:57 -0700


On Sep 30, 2008, at 8:19 AM, Robert Muir wrote:

cool. is there interest in similar basic functionality for Hebrew?


I'm interested as I use lucene for biblical research.

same rules apply: without using GPL data (i.e. Hspell data) youcan't do it right, but you can do a lot of the common stuff justlike Arabic. Tokenization is a tad bit more complex, and out of boxwestern behavior is probably annoying at the least (splitting wordson punctuation where it shouldn't, etc).
Robert
On Tue, Sep 30, 2008 at 7:36 AM, Grant Ingersoll (JIRA) <[EMAIL PROTECTED]> wrote:
[ https://issues.apache.org/jira/browse/LUCENE-1406?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12635723#action_12635723 ]
Grant Ingersoll commented on LUCENE-1406:
-----------------------------------------

I'll commit once 2.4 is released.

> new Arabic Analyzer (Apache license)
> ------------------------------------
>
>                 Key: LUCENE-1406
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1406
>             Project: Lucene - Java
>          Issue Type: New Feature
>          Components: Analysis
>            Reporter: Robert Muir
>            Assignee: Grant Ingersoll
>            Priority: Minor
>         Attachments: LUCENE-1406.patch
>
>
> I've noticed there is no Arabic analyzer for Lucene, most likelybecause Tim Buckwalter's morphological dictionary is GPL.> However, it is not necessary to have full morphological analysisengine for a quality arabic search.> This implementation implements the light-8s algorithm present inthe following paper: http://ciir.cs.umass.edu/pubfiles/ir-249.pdf> As you can see from the paper, improvement via this method oversearching surface forms (as lucene currently does) is significant,with almost 100% improvement in average precision.> While I personally don't think all the choices were the best, andsome easily improvements are still possible, the major motivationfor implementing it exactly the way it is presented in the paper isthat the algorithm is TREC-tested, so the precision/recallimprovements to lucene are already documented.> For a stopword list, I used a list present at http://members.unine.ch/jacques.savoy/clef/index.htmlsimply because the creator of this list documents the data as BSD-licensed.> This implementation (Analyzer) consists of above mentionedstopword list plus two filters:> ArabicNormalizationFilter: performs orthographic normalization(such as hamza seated on alif, alif maksura, teh marbuta, removal ofharakat, tatweel, etc)
>  ArabicStemFilter: performs arabic light stemming
> Both filters operate directly on termbuffer for maximumperformance. There is no object creation in this Analyzer.> There are no external dependencies. I've indexed about half abillion words of arabic text and tested against that.> If there are any issues with this implementation I am willing tofix them. I use lucene on a daily basis and would like to givesomething back. Thanks.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]




--
Robert Muir
[EMAIL PROTECTED]

Re: [jira] Commented: (LUCENE-1406) new Arabic Analyzer (Apache license)

Reply via email to