On Sep 30, 2008, at 8:19 AM, Robert Muir wrote:
cool. is there interest in similar basic functionality for Hebrew?
I'm interested as I use lucene for biblical research.
same rules apply: without using GPL data (i.e. Hspell data) you
can't do it right, but you can do a lot of the common stuff just
like Arabic. Tokenization is a tad bit more complex, and out of box
western behavior is probably annoying at the least (splitting words
on punctuation where it shouldn't, etc).
Robert
On Tue, Sep 30, 2008 at 7:36 AM, Grant Ingersoll (JIRA) <[EMAIL PROTECTED]
> wrote:
[ https://issues.apache.org/jira/browse/LUCENE-1406?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12635723
#action_12635723 ]
Grant Ingersoll commented on LUCENE-1406:
-----------------------------------------
I'll commit once 2.4 is released.
> new Arabic Analyzer (Apache license)
> ------------------------------------
>
> Key: LUCENE-1406
> URL: https://issues.apache.org/jira/browse/LUCENE-1406
> Project: Lucene - Java
> Issue Type: New Feature
> Components: Analysis
> Reporter: Robert Muir
> Assignee: Grant Ingersoll
> Priority: Minor
> Attachments: LUCENE-1406.patch
>
>
> I've noticed there is no Arabic analyzer for Lucene, most likely
because Tim Buckwalter's morphological dictionary is GPL.
> However, it is not necessary to have full morphological analysis
engine for a quality arabic search.
> This implementation implements the light-8s algorithm present in
the following paper: http://ciir.cs.umass.edu/pubfiles/ir-249.pdf
> As you can see from the paper, improvement via this method over
searching surface forms (as lucene currently does) is significant,
with almost 100% improvement in average precision.
> While I personally don't think all the choices were the best, and
some easily improvements are still possible, the major motivation
for implementing it exactly the way it is presented in the paper is
that the algorithm is TREC-tested, so the precision/recall
improvements to lucene are already documented.
> For a stopword list, I used a list present at http://members.unine.ch/jacques.savoy/clef/index.html
simply because the creator of this list documents the data as BSD-
licensed.
> This implementation (Analyzer) consists of above mentioned
stopword list plus two filters:
> ArabicNormalizationFilter: performs orthographic normalization
(such as hamza seated on alif, alif maksura, teh marbuta, removal of
harakat, tatweel, etc)
> ArabicStemFilter: performs arabic light stemming
> Both filters operate directly on termbuffer for maximum
performance. There is no object creation in this Analyzer.
> There are no external dependencies. I've indexed about half a
billion words of arabic text and tested against that.
> If there are any issues with this implementation I am willing to
fix them. I use lucene on a daily basis and would like to give
something back. Thanks.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
--
Robert Muir
[EMAIL PROTECTED]