I have a requirement to handle synonyms differently based on the first word
(token) in the text field of the document. I have implemented custom
SynFilterFactory which loads synonyms per languages when core/solr is
started.

Now in the MySynonymFilterFactory#create(TokenStream input) method, I have
to read the first token from the input TokenStream. Based on that token
value, corresponding SynonymMap will be used for SynonymFilter creation.

Here are my documents
doc1 <text>lang_eng this is English language text</text>
doc2 <text>lang_fra this is French language text</text>
doc3 <text>lang_spa this is Spanish language text</text>

MySynonymFilterFactory creates MySynonymFilter. Method create() logic is
below...

@Override

*public* TokenStream create(TokenStream input) {

// if the fst is null, it means there's actually no synonyms... just return
the

// original stream as there is nothing to do here.

// return map.fst == null ? input : new MySynonymFilter(input, map,
ignoreCase);

System.*out*.println("input=" + input);

// some how read the TokenStream here to capture the lang value

SynonymMap synonyms = *null*;

*try* {

CharTermAttribute termAtt = input.addAttribute(CharTermAttribute.*class*);

*boolean* first = *false*;

input.reset();

*while* (!first && input.incrementToken()) {

String term = *new* String(termAtt.buffer(), 0, termAtt.length());

System.*out*.println("termAtt=" + term);

*if* (StringUtils.*startsWith*(term, "lang_")) {

String[] split = StringUtils.*split*(term, "_");

String lang = split[1];

String key = (langSynMap.containsKey(lang)) ? lang : "generic";

synonyms = langSynMap.get(key);

System.*out*.println("synonyms=" + synonyms);

}

first = *true*;

}

} *catch* (IOException e) {

// *TODO* Auto-generated catch block

e.printStackTrace();

}


*return* synonyms == *null* ? input : *new* SynonymFilter(input, synonyms,
ignoreCase);

}

This code compiles and this new analysis works fine in the Solr admin
analysis screen. But same fails with below exception when I try to index a
document
30273 ERROR (qtp1689843956-18) [   x:gcom] o.a.s.h.RequestHandlerBase
org.apache.solr.common.SolrException: Exception writing document id id1 to
the index; possible analysis error.
        at
org.apache.solr.update.DirectUpdateHandler2.addDoc(DirectUpdateHandler2.java:180)
        at
org.apache.solr.update.processor.RunUpdateProcessor.processAdd(RunUpdateProcessorFactory.java:68)
        at
org.apache.solr.update.processor.UpdateRequestProcessor.processAdd(UpdateRequestProcessor.java:48)
        at
org.apache.solr.update.processor.DistributedUpdateProcessor.doLocalAdd(DistributedUpdateProcessor.java:934)
        at
org.apache.solr.update.processor.DistributedUpdateProcessor.versionAdd(DistributedUpdateProcessor.java:1089)
        at
org.apache.solr.update.processor.DistributedUpdateProcessor.processAdd(DistributedUpdateProcessor.java:712)
        at
org.apache.solr.update.processor.LogUpdateProcessorFactory$LogUpdateProcessor.processAdd(LogUpdateProcessorFactory.java:103)
Caused by: java.lang.IllegalStateException: TokenStream contract violation:
reset()/close() call missing, reset() called multiple times, or subclass
does not call super.reset(). Please see Java
docs of TokenStream class for more information about the correct consuming
workflow.
        at org.apache.lucene.analysis.Tokenizer$1.read(Tokenizer.java:109)
        at
org.apache.lucene.analysis.standard.StandardTokenizerImpl.zzRefill(StandardTokenizerImpl.java:527)
        at
org.apache.lucene.analysis.standard.StandardTokenizerImpl.getNextToken(StandardTokenizerImpl.java:738)
        at
org.apache.lucene.analysis.standard.StandardTokenizer.incrementToken(StandardTokenizer.java:159)
        at
com.synonyms.poc.synpoc.MySynonymFilterFactory.create(MySynonymFilterFactory.java:94)
        at
org.apache.solr.analysis.TokenizerChain.createComponents(TokenizerChain.java:91)
        at
org.apache.lucene.analysis.AnalyzerWrapper.createComponents(AnalyzerWrapper.java:101)
        at
org.apache.lucene.analysis.AnalyzerWrapper.createComponents(AnalyzerWrapper.java:101)
        at
org.apache.lucene.analysis.Analyzer.tokenStream(Analyzer.java:176)
        at org.apache.lucene.document.Field.tokenStream(Field.java:562)
        at
org.apache.lucene.index.DefaultIndexingChain$PerField.invert(DefaultIndexingChain.java:628)
        at
org.apache.lucene.index.DefaultIndexingChain.processField(DefaultIndexingChain.java:365)
        at
org.apache.lucene.index.DefaultIndexingChain.processDocument(DefaultIndexingChain.java:321)
        at
org.apache.lucene.index.DocumentsWriterPerThread.updateDocument(DocumentsWriterPerThread.java:234)
        at
org.apache.lucene.index.DocumentsWriter.updateDocument(DocumentsWriter.java:450)
        at
org.apache.lucene.index.IndexWriter.updateDocument(IndexWriter.java:1477)
        at
org.apache.solr.update.DirectUpdateHandler2.doNormalUpdate(DirectUpdateHandler2.java:282)
        at
org.apache.solr.update.DirectUpdateHandler2.addDoc0(DirectUpdateHandler2.java:214)
        at
org.apache.solr.update.DirectUpdateHandler2.addDoc(DirectUpdateHandler2.java:169)
        ... 37 more

Any idea how can I read a token stream with out violating the token stream
contract. I see a similar discussion here
https://lucene.472066.n3.nabble.com/how-to-reuse-a-tokenStream-td850767.html,
but doesn't help solve my problem.

Also how come same error is not reported when analyzing the field value
using Solr admin console analysis screen.

Thanks

Reply via email to