The problem is the CharFilter, which cannot be reused. To correctly implement the Analyzer do the wrapping of the incoming Reader in the protected initReader():http://lucene.apache.org/core/4_6_0/core/org/apache/lucene/analysis/Analyzer.html#initReader(java.lang.String, java.io.Reader). In createComponents() only take the Reader from the param and create the Tokenizer+TokenFilters (which can be reused). initReader() ensures that every call to "tokenStream" creates a new Reader and passes it to the reused Tokenizer.
----- Uwe Schindler H.-H.-Meier-Allee 63, D-28213 Bremen http://www.thetaphi.de eMail: [email protected] > -----Original Message----- > From: Scott Smith [mailto:[email protected]] > Sent: Thursday, December 05, 2013 9:36 PM > To: [email protected] > Subject: Analyzers aren't reusable?? (lucene 4.2.1) > > I wrote the following to demonstrate what for me was surprising behavior > (this is Lucene 4.2.1). If you want to run this yourself, you should be able > to > since there are no references to anything other than standard lucene and > java libraries. Basically, this is an analyzer that makes everything > lowercase > and strip all of the html tags. > > public final class DemoAnalyzer extends StopwordAnalyzerBase { > public DemoAnalyzer() > { > super(Version.LUCENE_42); > } > > @Override > protected TokenStreamComponents createComponents(String fieldName, > Reader reader) > { > final Tokenizer source = new StandardTokenizer(Version.LUCENE_42, > new > HTMLStripCharFilter(reader)); > TokenStream result = new LowerCaseFilter(Version.LUCENE_42, source); > return new TokenStreamComponents(source, result); > } > > // this is just a debug routine to display some results. > public static String getTokenStream(String a_zText, Analyzer a_zAnalyzer) > throws IOException > { > TokenStream stream; > CharTermAttribute attr; > stream = a_zAnalyzer.tokenStream(null, new StringReader(a_zText)); > stream.reset(); > StringBuffer sb = new StringBuffer(); > sb.append(a_zAnalyzer.toString()); > sb.append("::"); > while(stream.incrementToken()) > { > attr = stream.getAttribute(CharTermAttribute.class); > if (sb.length() > 0) > { > sb.append(' '); > } > sb.append(attr.toString()); > } > > return "original String: " + a_zText + "\n" + sb.toString(); > } > > > public static void main(String[] args) throws IOException > { > String text = "<p>This is a <b>TEST</b> of the demo analyzer</p>"; > Analyzer a = new DemoAnalyzer(); > > System.out.println(getTokenStream(text, a)); > > System.out.println(getTokenStream(text, a)); > > System.out.println(getTokenStream(text, new DemoAnalyzer())); > } > } > > When I run this, I get the following output: > > original String: <p>This is a <b>TEST</b> of the demo analyzer</p> > com.somedomain.DemoAnalyzer@5d3f79f7:: this is a test of the demo > analyzer > > original String: <p>This is a <b>TEST</b> of the demo analyzer</p> > com.somedomain.DemoAnalyzer@5d3f79f7:: p this is a b test b of the demo > analyzer p > > original String: <p>This is a <b>TEST</b> of the demo analyzer</p> > com.somedomain.DemoAnalyzer@138532dc:: this is a test of the demo > analyzer > > The critical line is the second of each of the 3 pairs. Note the line in > case 2 (of > 3). Rather than stripping the entire html tag, it's just stripping the "<" > and > "/>". Is this expected behavior? I thought analyzers were thread-safe and > reusable. Am I wrong on that point? I would expect the output of all three > to be the same. > > Can someone explain to me what's going on? What am I missing? > > Scott --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
