Re: Regarding ArabicLetterTokenizer and the StandardTokenizer - best of both worlds!

Grant Ingersoll Fri, 20 Feb 2009 04:04:13 -0800

It's been a few years since I've worked on Arabic, but it soundsreasonable. Care to submit a patch with unit tests showing theStandardTokenizer properly handling all Arabic characters? http://wiki.apache.org/lucene-java/HowToContribute


On Feb 20, 2009, at 6:22 AM, Yusuf Aaji wrote:

Hi Everyone,
My question is related to the arabic analysis package under:org.apache.lucene.analysis.ar
It is cool and it is doing a great job, but it uses a specialtokenizer: ArabicLetterTokenizer
The problem with this tokenizer is that it fails to handle emails,urls and acronyms the same way the StandardTokenizer does.
Also the problem of the StandardTokenizer is that it fails to handlearabic diacritics right. so it splits words which shouldn't besplitted.
Arabic diacritics are: (as mentioned in the class:org.apache.lucene.analysis.ar.ArabicNormalizer)
FATHATAN = '\u064B';
DAMMATAN = '\u064C';
KASRATAN = '\u064D';
FATHA = '\u064E';
DAMMA = '\u064F';
KASRA = '\u0650';
SHADDA = '\u0651';
SUKUN = '\u0652';


so it is the range [\u064B-\u0652]
Is it possible to modify the StandardTokenizerImp to consider thesediacritics as normal letters.
I guess it should be done the same way its is done for Chinese andJapanese in this line in the file StandardTokenizerImp.jflex
// Chinese and Japanese (but NOT Korean, which is included in[:letter:])
CJ = [\u3100-\u312f\u3040-\u309F\u30A0-\u30FF\u31F0-\u31FF\u3300-\u337f\u3400-\u4dbf\u4e00-\u9fff\uf900-\ufaff\uff65-\uff9f]
so it can be something like:

AR = [\u064B-\u0652]


then modify this line also to include our new group of characters:
// From the JFlex manual: "the expression that matches everything of<a> not matched by <b> is !(!<a>|<b>)"
LETTER     = !(![:letter:]|{CJ}|{AR})
Am I right?! and am I going in the right direction?!! Comments arevery welcome.
Regards..


Yusuf



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]


--------------------------
Grant Ingersoll
http://www.lucidimagination.com/

Search the Lucene ecosystem (Lucene/Solr/Nutch/Mahout/Tika/Droids)using Solr/Lucene:

http://www.lucidimagination.com/search


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: Regarding ArabicLetterTokenizer and the StandardTokenizer - best of both worlds!

Reply via email to