RE: Analyzers aren't reusable?? (lucene 4.2.1)

Uwe Schindler Thu, 05 Dec 2013 12:46:34 -0800

The problem is the CharFilter, which cannot be reused. To correctly implement 
the Analyzer do the wrapping of the incoming Reader in the protected 
initReader():http://lucene.apache.org/core/4_6_0/core/org/apache/lucene/analysis/Analyzer.html#initReader(java.lang.String,
 java.io.Reader). In createComponents() only take the Reader from the param and 
create the Tokenizer+TokenFilters (which can be reused). initReader() ensures 
that every call to "tokenStream" creates a new Reader and passes it to the 
reused Tokenizer.


-----
Uwe Schindler
H.-H.-Meier-Allee 63, D-28213 Bremen
http://www.thetaphi.de
eMail: [email protected]


> -----Original Message-----
> From: Scott Smith [mailto:[email protected]]
> Sent: Thursday, December 05, 2013 9:36 PM
> To: [email protected]
> Subject: Analyzers aren't reusable?? (lucene 4.2.1)
> 
> I wrote the following to demonstrate what for me was surprising behavior
> (this is Lucene 4.2.1).  If you want to run this yourself, you should be able 
> to
> since there are no references to anything other than standard lucene and
> java libraries.  Basically, this is an analyzer that makes everything 
> lowercase
> and strip all of the html tags.
> 
> public final class DemoAnalyzer extends StopwordAnalyzerBase {
>     public DemoAnalyzer()
>     {
>         super(Version.LUCENE_42);
>     }
> 
>     @Override
>     protected TokenStreamComponents createComponents(String fieldName,
>             Reader reader)
>     {
>         final Tokenizer source = new StandardTokenizer(Version.LUCENE_42,
>                                                 new 
> HTMLStripCharFilter(reader));
>         TokenStream result = new LowerCaseFilter(Version.LUCENE_42, source);
>         return new TokenStreamComponents(source, result);
>     }
> 
>     // this is just a debug routine to display some results.
>     public static String getTokenStream(String a_zText, Analyzer a_zAnalyzer)
> throws IOException
>     {
>         TokenStream stream;
>         CharTermAttribute attr;
>         stream = a_zAnalyzer.tokenStream(null, new StringReader(a_zText));
>         stream.reset();
>         StringBuffer sb = new StringBuffer();
>         sb.append(a_zAnalyzer.toString());
>         sb.append("::");
>         while(stream.incrementToken())
>         {
>             attr = stream.getAttribute(CharTermAttribute.class);
>             if (sb.length() > 0)
>             {
>                 sb.append(' ');
>             }
>             sb.append(attr.toString());
>         }
> 
>         return "original String: " + a_zText + "\n" + sb.toString();
>     }
> 
> 
>     public static void main(String[] args) throws IOException
>     {
>         String text = "<p>This is a <b>TEST</b> of the demo analyzer</p>";
>         Analyzer a = new DemoAnalyzer();
> 
>         System.out.println(getTokenStream(text, a));
> 
>         System.out.println(getTokenStream(text, a));
> 
>         System.out.println(getTokenStream(text, new DemoAnalyzer()));
>     }
> }
> 
> When I run this, I get the following output:
> 
> original String: <p>This is a <b>TEST</b> of the demo analyzer</p>
> com.somedomain.DemoAnalyzer@5d3f79f7:: this is a test of the demo
> analyzer
> 
> original String: <p>This is a <b>TEST</b> of the demo analyzer</p>
> com.somedomain.DemoAnalyzer@5d3f79f7:: p this is a b test b of the demo
> analyzer p
> 
> original String: <p>This is a <b>TEST</b> of the demo analyzer</p>
> com.somedomain.DemoAnalyzer@138532dc:: this is a test of the demo
> analyzer
> 
> The critical line is the second of each of the 3 pairs.  Note the line in 
> case 2 (of
> 3).  Rather than stripping the entire html tag, it's just stripping the "<" 
> and
> "/>".   Is this expected behavior?  I thought analyzers were thread-safe and
> reusable.  Am I wrong on that point?  I would expect the output of all three
> to be the same.
> 
> Can someone explain to me what's going on?  What am I missing?
> 
> Scott


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

RE: Analyzers aren't reusable?? (lucene 4.2.1)

Reply via email to