Thanks for your inputs Michael. Looks like ConditionalTokenFilter is introduced from Lucene 7.4 version. I have implemented similar approach where I read the first token from input stream in MySynonymFilterFactory and load the language specific SynonymMap to create MySynonymFilter. This approach seems to be working when I tested the field analysis using Solr admin console analysis page. But same reports a error message when index a document.
When I try to ping the TokenStream input parameter, I see it has two different class types **Admin console Analysis*** input=ListBasedTokenStream@3a90e594 term=,bytes=[],startOffset=13,endOffset=13,positionIncrement=0,positionLength=1,type=word,position=3,positionHistory=[Ljava.lang.Integer;@34571cb6 **Solr indexing*** input=StandardTokenizer@41448310 term=,bytes=[],startOffset=0,endOffset=0,positionIncrement=1,positionLength=1,type=word I have added a input.reset/close/end method in the constructor of MySynonymFilterFactory, but that didn't work. So I need a way to consume a token stream twice with out breaking the contract - consume it once to read the first token - then reset it to original state - consume it in the SynonymFilter where synonyms are applied as per the first token which is saved as variable in that SynonymFilter Thanks On 2019/10/31 08:54:02, Michael Sokolov <msoko...@gmail.com> wrote: > Are you able to: > 1) create a custom attribute encoding the language > 2) create a filter that sets the attribute when it reads the first token > 3) wrap your synonym filters (one for each language) in a > ConditionalTokenFilter that filters based on the language attribute > > On Wed, Oct 30, 2019 at 11:16 PM Shyamsunder Mutcha <sjh...@gmail.com> wrote: > > > > I have a requirement to handle synonyms differently based on the first word > > (token) in the text field of the document. I have implemented custom > > SynFilterFactory which loads synonyms per languages when core/solr is > > started. > > > > Now in the MySynonymFilterFactory#create(TokenStream input) method, I have > > to read the first token from the input TokenStream. Based on that token > > value, corresponding SynonymMap will be used for SynonymFilter creation. > > > > Here are my documents > > doc1 <text>lang_eng this is English language text</text> > > doc2 <text>lang_fra this is French language text</text> > > doc3 <text>lang_spa this is Spanish language text</text> > > > > MySynonymFilterFactory creates MySynonymFilter. Method create() logic is > > below... > > > > @Override > > > > public TokenStream create(TokenStream input) { > > > > // if the fst is null, it means there's actually no synonyms... just return > > the > > > > // original stream as there is nothing to do here. > > > > // return map.fst == null ? input : new MySynonymFilter(input, map, > > ignoreCase); > > > > System.out.println("input=" + input); > > > > // some how read the TokenStream here to capture the lang value > > > > SynonymMap synonyms = null; > > > > try { > > > > CharTermAttribute termAtt = input.addAttribute(CharTermAttribute.class); > > > > boolean first = false; > > > > input.reset(); > > > > while (!first && input.incrementToken()) { > > > > String term = new String(termAtt.buffer(), 0, termAtt.length()); > > > > System.out.println("termAtt=" + term); > > > > if (StringUtils.startsWith(term, "lang_")) { > > > > String[] split = StringUtils.split(term, "_"); > > > > String lang = split[1]; > > > > String key = (langSynMap.containsKey(lang)) ? lang : "generic"; > > > > synonyms = langSynMap.get(key); > > > > System.out.println("synonyms=" + synonyms); > > > > } > > > > first = true; > > > > } > > > > } catch (IOException e) { > > > > // TODO Auto-generated catch block > > > > e.printStackTrace(); > > > > } > > > > > > return synonyms == null ? input : new SynonymFilter(input, synonyms, > > ignoreCase); > > > > } > > > > > > This code compiles and this new analysis works fine in the Solr admin > > analysis screen. But same fails with below exception when I try to index a > > document > > 30273 ERROR (qtp1689843956-18) [ x:gcom] o.a.s.h.RequestHandlerBase > > org.apache.solr.common.SolrException: Exception writing document id id1 to > > the index; possible analysis error. > > at > > org.apache.solr.update.DirectUpdateHandler2.addDoc(DirectUpdateHandler2.java:180) > > at > > org.apache.solr.update.processor.RunUpdateProcessor.processAdd(RunUpdateProcessorFactory.java:68) > > at > > org.apache.solr.update.processor.UpdateRequestProcessor.processAdd(UpdateRequestProcessor.java:48) > > at > > org.apache.solr.update.processor.DistributedUpdateProcessor.doLocalAdd(DistributedUpdateProcessor.java:934) > > at > > org.apache.solr.update.processor.DistributedUpdateProcessor.versionAdd(DistributedUpdateProcessor.java:1089) > > at > > org.apache.solr.update.processor.DistributedUpdateProcessor.processAdd(DistributedUpdateProcessor.java:712) > > at > > org.apache.solr.update.processor.LogUpdateProcessorFactory$LogUpdateProcessor.processAdd(LogUpdateProcessorFactory.java:103) > > Caused by: java.lang.IllegalStateException: TokenStream contract violation: > > reset()/close() call missing, reset() called multiple times, or subclass > > does not call super.reset(). Please see Java > > docs of TokenStream class for more information about the correct consuming > > workflow. > > at org.apache.lucene.analysis.Tokenizer$1.read(Tokenizer.java:109) > > at > > org.apache.lucene.analysis.standard.StandardTokenizerImpl.zzRefill(StandardTokenizerImpl.java:527) > > at > > org.apache.lucene.analysis.standard.StandardTokenizerImpl.getNextToken(StandardTokenizerImpl.java:738) > > at > > org.apache.lucene.analysis.standard.StandardTokenizer.incrementToken(StandardTokenizer.java:159) > > at > > com.synonyms.poc.synpoc.MySynonymFilterFactory.create(MySynonymFilterFactory.java:94) > > at > > org.apache.solr.analysis.TokenizerChain.createComponents(TokenizerChain.java:91) > > at > > org.apache.lucene.analysis.AnalyzerWrapper.createComponents(AnalyzerWrapper.java:101) > > at > > org.apache.lucene.analysis.AnalyzerWrapper.createComponents(AnalyzerWrapper.java:101) > > at > > org.apache.lucene.analysis.Analyzer.tokenStream(Analyzer.java:176) > > at org.apache.lucene.document.Field.tokenStream(Field.java:562) > > at > > org.apache.lucene.index.DefaultIndexingChain$PerField.invert(DefaultIndexingChain.java:628) > > at > > org.apache.lucene.index.DefaultIndexingChain.processField(DefaultIndexingChain.java:365) > > at > > org.apache.lucene.index.DefaultIndexingChain.processDocument(DefaultIndexingChain.java:321) > > at > > org.apache.lucene.index.DocumentsWriterPerThread.updateDocument(DocumentsWriterPerThread.java:234) > > at > > org.apache.lucene.index.DocumentsWriter.updateDocument(DocumentsWriter.java:450) > > at > > org.apache.lucene.index.IndexWriter.updateDocument(IndexWriter.java:1477) > > at > > org.apache.solr.update.DirectUpdateHandler2.doNormalUpdate(DirectUpdateHandler2.java:282) > > at > > org.apache.solr.update.DirectUpdateHandler2.addDoc0(DirectUpdateHandler2.java:214) > > at > > org.apache.solr.update.DirectUpdateHandler2.addDoc(DirectUpdateHandler2.java:169) > > ... 37 more > > > > Any idea how can I read a token stream with out violating the token stream > > contract. I see a similar discussion here > > https://lucene.472066.n3.nabble.com/how-to-reuse-a-tokenStream-td850767.html, > > but doesn't help solve my problem. > > > > Also how come same error is not reported when analyzing the field value > > using Solr admin console analysis screen. > > > > Thanks > > --------------------------------------------------------------------- > To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org > For additional commands, e-mail: dev-h...@lucene.apache.org > > --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org