Consuming token stream more than once in same filter

Shyamsunder Mutcha Wed, 30 Oct 2019 16:17:33 -0700

I have a requirement to handle synonyms differently based on the first word
(token) in the text field of the document. I have implemented custom
SynFilterFactory which loads synonyms per languages when core/solr is
started.


Now in the MySynonymFilterFactory#create(TokenStream input) method, I have
to read the first token from the input TokenStream. Based on that token
value, corresponding SynonymMap will be used for SynonymFilter creation.

Here are my documents
doc1 <text>lang_eng this is English language text</text>
doc2 <text>lang_fra this is French language text</text>
doc3 <text>lang_spa this is Spanish language text</text>

MySynonymFilterFactory creates MySynonymFilter. Method create() logic is
below...

@Override

*public* TokenStream create(TokenStream input) {

// if the fst is null, it means there's actually no synonyms... just return
the

// original stream as there is nothing to do here.

// return map.fst == null ? input : new MySynonymFilter(input, map,
ignoreCase);

System.*out*.println("input=" + input);

// some how read the TokenStream here to capture the lang value

SynonymMap synonyms = *null*;

*try* {

CharTermAttribute termAtt = input.addAttribute(CharTermAttribute.*class*);

*boolean* first = *false*;

input.reset();

*while* (!first && input.incrementToken()) {

String term = *new* String(termAtt.buffer(), 0, termAtt.length());

System.*out*.println("termAtt=" + term);

*if* (StringUtils.*startsWith*(term, "lang_")) {

String[] split = StringUtils.*split*(term, "_");

String lang = split[1];

String key = (langSynMap.containsKey(lang)) ? lang : "generic";

synonyms = langSynMap.get(key);

System.*out*.println("synonyms=" + synonyms);

}

first = *true*;

}

} *catch* (IOException e) {

// *TODO* Auto-generated catch block

e.printStackTrace();

}


*return* synonyms == *null* ? input : *new* SynonymFilter(input, synonyms,
ignoreCase);

}

This code compiles and this new analysis works fine in the Solr admin
analysis screen. But same fails with below exception when I try to index a
document
30273 ERROR (qtp1689843956-18) [   x:gcom] o.a.s.h.RequestHandlerBase
org.apache.solr.common.SolrException: Exception writing document id id1 to
the index; possible analysis error.
        at
org.apache.solr.update.DirectUpdateHandler2.addDoc(DirectUpdateHandler2.java:180)
        at
org.apache.solr.update.processor.RunUpdateProcessor.processAdd(RunUpdateProcessorFactory.java:68)
        at
org.apache.solr.update.processor.UpdateRequestProcessor.processAdd(UpdateRequestProcessor.java:48)
        at
org.apache.solr.update.processor.DistributedUpdateProcessor.doLocalAdd(DistributedUpdateProcessor.java:934)
        at
org.apache.solr.update.processor.DistributedUpdateProcessor.versionAdd(DistributedUpdateProcessor.java:1089)
        at
org.apache.solr.update.processor.DistributedUpdateProcessor.processAdd(DistributedUpdateProcessor.java:712)
        at
org.apache.solr.update.processor.LogUpdateProcessorFactory$LogUpdateProcessor.processAdd(LogUpdateProcessorFactory.java:103)
Caused by: java.lang.IllegalStateException: TokenStream contract violation:
reset()/close() call missing, reset() called multiple times, or subclass
does not call super.reset(). Please see Java
docs of TokenStream class for more information about the correct consuming
workflow.
        at org.apache.lucene.analysis.Tokenizer$1.read(Tokenizer.java:109)
        at
org.apache.lucene.analysis.standard.StandardTokenizerImpl.zzRefill(StandardTokenizerImpl.java:527)
        at
org.apache.lucene.analysis.standard.StandardTokenizerImpl.getNextToken(StandardTokenizerImpl.java:738)
        at
org.apache.lucene.analysis.standard.StandardTokenizer.incrementToken(StandardTokenizer.java:159)
        at
com.synonyms.poc.synpoc.MySynonymFilterFactory.create(MySynonymFilterFactory.java:94)
        at
org.apache.solr.analysis.TokenizerChain.createComponents(TokenizerChain.java:91)
        at
org.apache.lucene.analysis.AnalyzerWrapper.createComponents(AnalyzerWrapper.java:101)
        at
org.apache.lucene.analysis.AnalyzerWrapper.createComponents(AnalyzerWrapper.java:101)
        at
org.apache.lucene.analysis.Analyzer.tokenStream(Analyzer.java:176)
        at org.apache.lucene.document.Field.tokenStream(Field.java:562)
        at
org.apache.lucene.index.DefaultIndexingChain$PerField.invert(DefaultIndexingChain.java:628)
        at
org.apache.lucene.index.DefaultIndexingChain.processField(DefaultIndexingChain.java:365)
        at
org.apache.lucene.index.DefaultIndexingChain.processDocument(DefaultIndexingChain.java:321)
        at
org.apache.lucene.index.DocumentsWriterPerThread.updateDocument(DocumentsWriterPerThread.java:234)
        at
org.apache.lucene.index.DocumentsWriter.updateDocument(DocumentsWriter.java:450)
        at
org.apache.lucene.index.IndexWriter.updateDocument(IndexWriter.java:1477)
        at
org.apache.solr.update.DirectUpdateHandler2.doNormalUpdate(DirectUpdateHandler2.java:282)
        at
org.apache.solr.update.DirectUpdateHandler2.addDoc0(DirectUpdateHandler2.java:214)
        at
org.apache.solr.update.DirectUpdateHandler2.addDoc(DirectUpdateHandler2.java:169)
        ... 37 more

Any idea how can I read a token stream with out violating the token stream
contract. I see a similar discussion here
https://lucene.472066.n3.nabble.com/how-to-reuse-a-tokenStream-td850767.html,
but doesn't help solve my problem.

Also how come same error is not reported when analyzing the field value
using Solr admin console analysis screen.

Thanks

Consuming token stream more than once in same filter

Reply via email to