Re: [jira] Commented: (LUCENE-1406) new Arabic Analyzer (Apache license)

DM Smith Tue, 30 Sep 2008 07:16:39 -0700

Robert Muir wrote:

can you provide any more information on your use case? I hadoriginally imagined MH, ktiv male spelling only, but your use case isinteresting.
Are you currently indexing biblical hebrew text? dotted or undotted?

Biblical Hebrew. Variety of texts. Some unpointed. Others w/ points andcantillation. All are NFC.

IMHO, I think it is important to document whether an analyzer works withNFC, NFD or whatever. And leave it to the program to normalize to that form.

On Tue, Sep 30, 2008 at 8:54 AM, DM Smith <[EMAIL PROTECTED]<mailto:[EMAIL PROTECTED]>> wrote:



    On Sep 30, 2008, at 8:19 AM, Robert Muir wrote:

    cool. is there interest in similar basic functionality for Hebrew?


    I'm interested as I use lucene for biblical research.



    same rules apply: without using GPL data (i.e. Hspell data) you
    can't do it right, but you can do a lot of the common stuff just
    like Arabic. Tokenization is a tad bit more complex, and out of
    box western behavior is probably annoying at the least (splitting
    words on punctuation where it shouldn't, etc).

    Robert

    On Tue, Sep 30, 2008 at 7:36 AM, Grant Ingersoll (JIRA)
    <[EMAIL PROTECTED] <mailto:[EMAIL PROTECTED]>> wrote:


           [
        
https://issues.apache.org/jira/browse/LUCENE-1406?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12635723#action_12635723
        
<https://issues.apache.org/jira/browse/LUCENE-1406?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12635723#action_12635723>
        ]

        Grant Ingersoll commented on LUCENE-1406:
        -----------------------------------------

        I'll commit once 2.4 is released.

        > new Arabic Analyzer (Apache license)
        > ------------------------------------
        >
        >                 Key: LUCENE-1406
        >                 URL:
        https://issues.apache.org/jira/browse/LUCENE-1406
        >             Project: Lucene - Java
        >          Issue Type: New Feature
        >          Components: Analysis
        >            Reporter: Robert Muir
        >            Assignee: Grant Ingersoll
        >            Priority: Minor
        >         Attachments: LUCENE-1406.patch
        >
        >
        > I've noticed there is no Arabic analyzer for Lucene, most
        likely because Tim Buckwalter's morphological dictionary is GPL.
        > However, it is not necessary  to have full morphological
        analysis engine for a quality arabic search.
        > This implementation implements the light-8s algorithm
        present in the following paper:
        http://ciir.cs.umass.edu/pubfiles/ir-249.pdf
        > As you can see from the paper, improvement via this method
        over searching surface forms (as lucene currently does) is
        significant, with almost 100% improvement in average precision.
        > While I personally don't think all the choices were the
        best, and some easily improvements are still possible, the
        major motivation for implementing it exactly the way it is
        presented in the paper is that the algorithm is TREC-tested,
        so the precision/recall improvements to lucene are already
        documented.
        > For a stopword list, I used a list present at
        http://members.unine.ch/jacques.savoy/clef/index.html simply
        because the creator of this list documents the data as
        BSD-licensed.
        > This implementation (Analyzer) consists of above mentioned
        stopword list plus two filters:
        >  ArabicNormalizationFilter: performs orthographic
        normalization (such as hamza seated on alif, alif maksura,
        teh marbuta, removal of harakat, tatweel, etc)
        >  ArabicStemFilter: performs arabic light stemming
        > Both filters operate directly on termbuffer for maximum
        performance. There is no object creation in this Analyzer.
        > There are no external dependencies. I've indexed about half
        a billion words of arabic text and tested against that.
        > If there are any issues with this implementation I am
        willing to fix them. I use lucene on a daily basis and would
        like to give something back. Thanks.

        --
        This message is automatically generated by JIRA.
        -
        You can reply to this email to add a comment to the issue online.


        ---------------------------------------------------------------------
        To unsubscribe, e-mail:
        [EMAIL PROTECTED]
        <mailto:[EMAIL PROTECTED]>
        For additional commands, e-mail:
        [EMAIL PROTECTED]
        <mailto:[EMAIL PROTECTED]>

--Robert Muir

    [EMAIL PROTECTED] <mailto:[EMAIL PROTECTED]>





--
Robert Muir
[EMAIL PROTECTED] <mailto:[EMAIL PROTECTED]>



---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: [jira] Commented: (LUCENE-1406) new Arabic Analyzer (Apache license)

Reply via email to