I have a requirement to handle synonyms differently based on the first word
(token) in the text field of the document. I have implemented custom
SynFilterFactory which loads synonyms per languages when core/solr is
started.
Now in the MySynonymFilterFactory#create(TokenStream input) method, I have
to read the first token from the input TokenStream. Based on that token
value, corresponding SynonymMap will be used for SynonymFilter creation.
Here are my documents
doc1 <text>lang_eng this is English language text</text>
doc2 <text>lang_fra this is French language text</text>
doc3 <text>lang_spa this is Spanish language text</text>
MySynonymFilterFactory creates MySynonymFilter. Method create() logic is
below...
@Override
*public* TokenStream create(TokenStream input) {
// if the fst is null, it means there's actually no synonyms... just return
the
// original stream as there is nothing to do here.
// return map.fst == null ? input : new MySynonymFilter(input, map,
ignoreCase);
System.*out*.println("input=" + input);
// some how read the TokenStream here to capture the lang value
SynonymMap synonyms = *null*;
*try* {
CharTermAttribute termAtt = input.addAttribute(CharTermAttribute.*class*);
*boolean* first = *false*;
input.reset();
*while* (!first && input.incrementToken()) {
String term = *new* String(termAtt.buffer(), 0, termAtt.length());
System.*out*.println("termAtt=" + term);
*if* (StringUtils.*startsWith*(term, "lang_")) {
String[] split = StringUtils.*split*(term, "_");
String lang = split[1];
String key = (langSynMap.containsKey(lang)) ? lang : "generic";
synonyms = langSynMap.get(key);
System.*out*.println("synonyms=" + synonyms);
}
first = *true*;
}
} *catch* (IOException e) {
// *TODO* Auto-generated catch block
e.printStackTrace();
}
*return* synonyms == *null* ? input : *new* SynonymFilter(input, synonyms,
ignoreCase);
}
This code compiles and this new analysis works fine in the Solr admin
analysis screen. But same fails with below exception when I try to index a
document
30273 ERROR (qtp1689843956-18) [ x:gcom] o.a.s.h.RequestHandlerBase
org.apache.solr.common.SolrException: Exception writing document id id1 to
the index; possible analysis error.
at
org.apache.solr.update.DirectUpdateHandler2.addDoc(DirectUpdateHandler2.java:180)
at
org.apache.solr.update.processor.RunUpdateProcessor.processAdd(RunUpdateProcessorFactory.java:68)
at
org.apache.solr.update.processor.UpdateRequestProcessor.processAdd(UpdateRequestProcessor.java:48)
at
org.apache.solr.update.processor.DistributedUpdateProcessor.doLocalAdd(DistributedUpdateProcessor.java:934)
at
org.apache.solr.update.processor.DistributedUpdateProcessor.versionAdd(DistributedUpdateProcessor.java:1089)
at
org.apache.solr.update.processor.DistributedUpdateProcessor.processAdd(DistributedUpdateProcessor.java:712)
at
org.apache.solr.update.processor.LogUpdateProcessorFactory$LogUpdateProcessor.processAdd(LogUpdateProcessorFactory.java:103)
Caused by: java.lang.IllegalStateException: TokenStream contract violation:
reset()/close() call missing, reset() called multiple times, or subclass
does not call super.reset(). Please see Java
docs of TokenStream class for more information about the correct consuming
workflow.
at org.apache.lucene.analysis.Tokenizer$1.read(Tokenizer.java:109)
at
org.apache.lucene.analysis.standard.StandardTokenizerImpl.zzRefill(StandardTokenizerImpl.java:527)
at
org.apache.lucene.analysis.standard.StandardTokenizerImpl.getNextToken(StandardTokenizerImpl.java:738)
at
org.apache.lucene.analysis.standard.StandardTokenizer.incrementToken(StandardTokenizer.java:159)
at
com.synonyms.poc.synpoc.MySynonymFilterFactory.create(MySynonymFilterFactory.java:94)
at
org.apache.solr.analysis.TokenizerChain.createComponents(TokenizerChain.java:91)
at
org.apache.lucene.analysis.AnalyzerWrapper.createComponents(AnalyzerWrapper.java:101)
at
org.apache.lucene.analysis.AnalyzerWrapper.createComponents(AnalyzerWrapper.java:101)
at
org.apache.lucene.analysis.Analyzer.tokenStream(Analyzer.java:176)
at org.apache.lucene.document.Field.tokenStream(Field.java:562)
at
org.apache.lucene.index.DefaultIndexingChain$PerField.invert(DefaultIndexingChain.java:628)
at
org.apache.lucene.index.DefaultIndexingChain.processField(DefaultIndexingChain.java:365)
at
org.apache.lucene.index.DefaultIndexingChain.processDocument(DefaultIndexingChain.java:321)
at
org.apache.lucene.index.DocumentsWriterPerThread.updateDocument(DocumentsWriterPerThread.java:234)
at
org.apache.lucene.index.DocumentsWriter.updateDocument(DocumentsWriter.java:450)
at
org.apache.lucene.index.IndexWriter.updateDocument(IndexWriter.java:1477)
at
org.apache.solr.update.DirectUpdateHandler2.doNormalUpdate(DirectUpdateHandler2.java:282)
at
org.apache.solr.update.DirectUpdateHandler2.addDoc0(DirectUpdateHandler2.java:214)
at
org.apache.solr.update.DirectUpdateHandler2.addDoc(DirectUpdateHandler2.java:169)
... 37 more
Any idea how can I read a token stream with out violating the token stream
contract. I see a similar discussion here
https://lucene.472066.n3.nabble.com/how-to-reuse-a-tokenStream-td850767.html,
but doesn't help solve my problem.
Also how come same error is not reported when analyzing the field value
using Solr admin console analysis screen.
Thanks