Encoding problem while indexing

Engy Morsy Wed, 29 Jun 2011 05:05:06 -0700

I am working on indexing arabic documents containg arabic diacritics and 
dotless characters (old arabic characters), I am using Apache Tomcat server, 
and I am using my modified version of the aramorph analyzer as the arabic 
analyzer. I managed on the development enviorment to normalize the arabic 
diacritics and dotless characters (same concept as in the 
solr.ArabicNormalizationFilterFactory). and i can verfiy that the analyzer is 
working fine, and i get the correct stem for arabic words. the input text file 
for testing has a utf-8 encoding.


When i build the aramorph jar file and place it under solr lib, the diacritics 
and the dotless characters splits the word. I made sure that the server.xml 
contains the URI-Encoding="utf-8".

I also made sure that the text being send to solr using solj is utf-8 encoding
example : solr.addBean(new Doc("4",new String("حِباًَ".getBytes("UTF8"))));

but nothing is working.

I tried to use the analyze link on solr admin for both indexing and querying 
and both shows that the arabic word is splited if a diacritics or dotless 
character is found.

Do you have any idea what might be the problem


schema snippet:

<fieldType name="text" class="solr.TextField">
<analyzer type="index" 
class="gpl.pierrick.brihaye.aramorph.lucene.ArabicNormalizeStemmer"/>
<analyzer type="query" 
class="gpl.pierrick.brihaye.aramorph.lucene.ArabicNormalizeStemmer"/>
</fieldType>

I also added the following parameter to the JVM: -Dfile.encoding=UTF-8

Thanks,
engy

Encoding problem while indexing

Reply via email to