Re: TokenFilter not working at index time

Ahmet Arslan Tue, 24 Jun 2014 08:46:22 -0700

Hi Erlend,

After a quick look, I have implemented similar TokenFilter that injects several 
tokens at same position.


Please see source code of : Zemberek2DeasciifyFilter in 
https://github.com/iorixxx/lucene-solr-analysis-turkish 


You can insert your line :  final String[] values = 
stemmer.stem(termAtt.buffer()); to it.


Another note : You can use o.a.l.analysis.util.CharArrayMap<String> instead of 
Map<String, String>wordlist for efficiency.

Please see TurkishDeasciifyFilter for example usage.

Let us know if that works for you.

Ahmet


On Tuesday, June 24, 2014 3:00 PM, Erlend Garåsen <e.f.gara...@usit.uio.no> 
wrote:

I'm trying to create a Norwegian Lemmatizer based on a dictionary, but 
for some odd reason I don't get any search results even thought the 
Analyzer in Solr Admin shows that it does the right thing. It works at 
query time if I have reindexed everything based on another stemmer, e.g. 
NorwegianMinimalStemmer.

Here's a screenshot of how it lemmatizes the Norwegian word "studenter" 
(masculine indefinite noun, plural - English: "students"). The stem is 
"student". So far so good:
http://folk.uio.no/erlendfg/solr/lemmatizer.png

But I get no/few results if I search for "studenter" compared to 
"student". If I switch to solr.NorwegianMinimalStemFilterFactory in 
schema.xml at index time and reindexes everything, it works as it should:
<analyzer type="index">
   <filter class="solr.NorwegianMinimalStemFilterFactory" variant="no"/>

What is wrong with my TokenFilter and/or how can I debug this further? I 
have tried a lot of different things without any luck, for example 
decode everything explicitly to UTF8 (the wordlist is in iso-8859-1, but 
I'm reading it properly by setting the correct character set) and trim 
all the words without any help. The byte sequence also seems to be 
correct for the stemmed word. My lemmatizer shows [73 74 75 64 65 6e 
74], exactly the same as when I have configured 
NorwegianMinimalStemFilterFactory in schema.xml.

Here's the source code of my lemmatizer. Please note that it is not 
finished:
http://folk.uio.no/erlendfg/solr/

Here's the line in my wordlist which contains the word "studenter":
66235    student    studenter    subst mask appell fl ub normert    700    3

The following line returns the stem (input is "studenter"):
final String[] values = stemmer.stem(termAtt.buffer());

The rest of the code is in NorwegianLemmatizerFilter. If several stems 
are returned, they are all added.

Erlend

Re: TokenFilter not working at index time

Reply via email to