can you provide any more information on your use case? I had originally imagined MH, ktiv male spelling only, but your use case is interesting.
Are you currently indexing biblical hebrew text? dotted or undotted? On Tue, Sep 30, 2008 at 8:54 AM, DM Smith <[EMAIL PROTECTED]> wrote: > > On Sep 30, 2008, at 8:19 AM, Robert Muir wrote: > > cool. is there interest in similar basic functionality for Hebrew? > > > I'm interested as I use lucene for biblical research. > > > > same rules apply: without using GPL data (i.e. Hspell data) you can't do it > right, but you can do a lot of the common stuff just like Arabic. > Tokenization is a tad bit more complex, and out of box western behavior is > probably annoying at the least (splitting words on punctuation where it > shouldn't, etc). > > Robert > > On Tue, Sep 30, 2008 at 7:36 AM, Grant Ingersoll (JIRA) <[EMAIL > PROTECTED]>wrote: > >> >> [ >> https://issues.apache.org/jira/browse/LUCENE-1406?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12635723#action_12635723] >> >> Grant Ingersoll commented on LUCENE-1406: >> ----------------------------------------- >> >> I'll commit once 2.4 is released. >> >> > new Arabic Analyzer (Apache license) >> > ------------------------------------ >> > >> > Key: LUCENE-1406 >> > URL: https://issues.apache.org/jira/browse/LUCENE-1406 >> > Project: Lucene - Java >> > Issue Type: New Feature >> > Components: Analysis >> > Reporter: Robert Muir >> > Assignee: Grant Ingersoll >> > Priority: Minor >> > Attachments: LUCENE-1406.patch >> > >> > >> > I've noticed there is no Arabic analyzer for Lucene, most likely because >> Tim Buckwalter's morphological dictionary is GPL. >> > However, it is not necessary to have full morphological analysis engine >> for a quality arabic search. >> > This implementation implements the light-8s algorithm present in the >> following paper: http://ciir.cs.umass.edu/pubfiles/ir-249.pdf >> > As you can see from the paper, improvement via this method over >> searching surface forms (as lucene currently does) is significant, with >> almost 100% improvement in average precision. >> > While I personally don't think all the choices were the best, and some >> easily improvements are still possible, the major motivation for >> implementing it exactly the way it is presented in the paper is that the >> algorithm is TREC-tested, so the precision/recall improvements to lucene are >> already documented. >> > For a stopword list, I used a list present at >> http://members.unine.ch/jacques.savoy/clef/index.html simply because the >> creator of this list documents the data as BSD-licensed. >> > This implementation (Analyzer) consists of above mentioned stopword list >> plus two filters: >> > ArabicNormalizationFilter: performs orthographic normalization (such as >> hamza seated on alif, alif maksura, teh marbuta, removal of harakat, >> tatweel, etc) >> > ArabicStemFilter: performs arabic light stemming >> > Both filters operate directly on termbuffer for maximum performance. >> There is no object creation in this Analyzer. >> > There are no external dependencies. I've indexed about half a billion >> words of arabic text and tested against that. >> > If there are any issues with this implementation I am willing to fix >> them. I use lucene on a daily basis and would like to give something back. >> Thanks. >> >> -- >> This message is automatically generated by JIRA. >> - >> You can reply to this email to add a comment to the issue online. >> >> >> --------------------------------------------------------------------- >> To unsubscribe, e-mail: [EMAIL PROTECTED] >> For additional commands, e-mail: [EMAIL PROTECTED] >> >> > > > -- > Robert Muir > [EMAIL PROTECTED] > > > -- Robert Muir [EMAIL PROTECTED]