[ https://issues.apache.org/jira/browse/LUCENE-1406?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Grant Ingersoll reassigned LUCENE-1406: --------------------------------------- Assignee: Grant Ingersoll > new Arabic Analyzer (Apache license) > ------------------------------------ > > Key: LUCENE-1406 > URL: https://issues.apache.org/jira/browse/LUCENE-1406 > Project: Lucene - Java > Issue Type: New Feature > Components: Analysis > Reporter: Robert Muir > Assignee: Grant Ingersoll > Priority: Minor > Attachments: arabic.zip > > > I've noticed there is no Arabic analyzer for Lucene, most likely because Tim > Buckwalter's morphological dictionary is GPL. > However, it is not necessary to have full morphological analysis engine for > a quality arabic search. > This implementation implements the light-8s algorithm present in the > following paper: http://ciir.cs.umass.edu/pubfiles/ir-249.pdf > As you can see from the paper, improvement via this method over searching > surface forms (as lucene currently does) is significant, with almost 100% > improvement in average precision. > While I personally don't think all the choices were the best, and some easily > improvements are still possible, the major motivation for implementing it > exactly the way it is presented in the paper is that the algorithm is > TREC-tested, so the precision/recall improvements to lucene are already > documented. > For a stopword list, I used a list present at > http://members.unine.ch/jacques.savoy/clef/index.html simply because the > creator of this list documents the data as BSD-licensed. > This implementation (Analyzer) consists of above mentioned stopword list plus > two filters: > ArabicNormalizationFilter: performs orthographic normalization (such as > hamza seated on alif, alif maksura, teh marbuta, removal of harakat, tatweel, > etc) > ArabicStemFilter: performs arabic light stemming > Both filters operate directly on termbuffer for maximum performance. There is > no object creation in this Analyzer. > There are no external dependencies. I've indexed about half a billion words > of arabic text and tested against that. > If there are any issues with this implementation I am willing to fix them. I > use lucene on a daily basis and would like to give something back. Thanks. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]