Hi Everyone,
My question is related to the arabic analysis package under:
org.apache.lucene.analysis.ar
It is cool and it is doing a great job, but it uses a special tokenizer:
ArabicLetterTokenizer
The problem with this tokenizer is that it fails to handle emails, urls
and acronyms
It's been a few years since I've worked on Arabic, but it sounds
reasonable. Care to submit a patch with unit tests showing the
StandardTokenizer properly handling all Arabic characters? http://wiki.apache.org/lucene-java/HowToContribute
On Feb 20, 2009, at 6:22 AM, Yusuf Aaji wrote:
Hi
Yusuf,
You are 100% correct it is bad that this uses a custom tokenizer.
this was my motivation for attacking it from this angle:
https://issues.apache.org/jira/browse/LUCENE-1488?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
(unfinished)
otherwise, at some point jflex