Re: [jira] Commented: (LUCENE-1406) new Arabic Analyzer (Apache license)

Robert Muir Tue, 30 Sep 2008 06:25:06 -0700

can you provide any more information on your use case? I had originally
imagined MH, ktiv male spelling only, but your use case is interesting.


Are you currently indexing biblical hebrew text? dotted or undotted?


On Tue, Sep 30, 2008 at 8:54 AM, DM Smith <[EMAIL PROTECTED]> wrote:

>
> On Sep 30, 2008, at 8:19 AM, Robert Muir wrote:
>
> cool. is there interest in similar basic functionality for Hebrew?
>
>
> I'm interested as I use lucene for biblical research.
>
>
>
> same rules apply: without using GPL data (i.e. Hspell data) you can't do it
> right, but you can do a lot of the common stuff just like Arabic.
> Tokenization is a tad bit more complex, and out of box western behavior is
> probably annoying at the least (splitting words on punctuation where it
> shouldn't, etc).
>
> Robert
>
> On Tue, Sep 30, 2008 at 7:36 AM, Grant Ingersoll (JIRA) <[EMAIL 
> PROTECTED]>wrote:
>
>>
>>    [
>> https://issues.apache.org/jira/browse/LUCENE-1406?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12635723#action_12635723]
>>
>> Grant Ingersoll commented on LUCENE-1406:
>> -----------------------------------------
>>
>> I'll commit once 2.4 is released.
>>
>> > new Arabic Analyzer (Apache license)
>> > ------------------------------------
>> >
>> >                 Key: LUCENE-1406
>> >                 URL: https://issues.apache.org/jira/browse/LUCENE-1406
>> >             Project: Lucene - Java
>> >          Issue Type: New Feature
>> >          Components: Analysis
>> >            Reporter: Robert Muir
>> >            Assignee: Grant Ingersoll
>> >            Priority: Minor
>> >         Attachments: LUCENE-1406.patch
>> >
>> >
>> > I've noticed there is no Arabic analyzer for Lucene, most likely because
>> Tim Buckwalter's morphological dictionary is GPL.
>> > However, it is not necessary  to have full morphological analysis engine
>> for a quality arabic search.
>> > This implementation implements the light-8s algorithm present in the
>> following paper: http://ciir.cs.umass.edu/pubfiles/ir-249.pdf
>> > As you can see from the paper, improvement via this method over
>> searching surface forms (as lucene currently does) is significant, with
>> almost 100% improvement in average precision.
>> > While I personally don't think all the choices were the best, and some
>> easily improvements are still possible, the major motivation for
>> implementing it exactly the way it is presented in the paper is that the
>> algorithm is TREC-tested, so the precision/recall improvements to lucene are
>> already documented.
>> > For a stopword list, I used a list present at
>> http://members.unine.ch/jacques.savoy/clef/index.html simply because the
>> creator of this list documents the data as BSD-licensed.
>> > This implementation (Analyzer) consists of above mentioned stopword list
>> plus two filters:
>> >  ArabicNormalizationFilter: performs orthographic normalization (such as
>> hamza seated on alif, alif maksura, teh marbuta, removal of harakat,
>> tatweel, etc)
>> >  ArabicStemFilter: performs arabic light stemming
>> > Both filters operate directly on termbuffer for maximum performance.
>> There is no object creation in this Analyzer.
>> > There are no external dependencies. I've indexed about half a billion
>> words of arabic text and tested against that.
>> > If there are any issues with this implementation I am willing to fix
>> them. I use lucene on a daily basis and would like to give something back.
>> Thanks.
>>
>> --
>> This message is automatically generated by JIRA.
>> -
>> You can reply to this email to add a comment to the issue online.
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: [EMAIL PROTECTED]
>> For additional commands, e-mail: [EMAIL PROTECTED]
>>
>>
>
>
> --
> Robert Muir
> [EMAIL PROTECTED]
>
>
>


-- 
Robert Muir
[EMAIL PROTECTED]

Re: [jira] Commented: (LUCENE-1406) new Arabic Analyzer (Apache license)

Reply via email to