subject:"Regarding ArabicLetterTokenizer and the StandardTokenizer \- best of both worlds\!"

Regarding ArabicLetterTokenizer and the StandardTokenizer - best of both worlds!

2009-02-20 Thread Yusuf Aaji

Hi Everyone, My question is related to the arabic analysis package under: org.apache.lucene.analysis.ar It is cool and it is doing a great job, but it uses a special tokenizer: ArabicLetterTokenizer The problem with this tokenizer is that it fails to handle emails, urls and acronyms

Re: Regarding ArabicLetterTokenizer and the StandardTokenizer - best of both worlds!

2009-02-20 Thread Grant Ingersoll

It's been a few years since I've worked on Arabic, but it sounds reasonable. Care to submit a patch with unit tests showing the StandardTokenizer properly handling all Arabic characters? http://wiki.apache.org/lucene-java/HowToContribute On Feb 20, 2009, at 6:22 AM, Yusuf Aaji wrote: Hi

Re: Regarding ArabicLetterTokenizer and the StandardTokenizer - best of both worlds!

2009-02-20 Thread Robert Muir

Yusuf, You are 100% correct it is bad that this uses a custom tokenizer. this was my motivation for attacking it from this angle: https://issues.apache.org/jira/browse/LUCENE-1488?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel (unfinished) otherwise, at some point jflex