RE: HTMLStripReader, HTMLStripCharFilter

Uwe Schindler Mon, 26 Apr 2010 22:01:04 -0700

To reset this token stream you have to wrap it with a CachingTokenFilter.

-----
Uwe Schindler
H.-H.-Meier-Allee 63, D-28213 Bremen
http://www.thetaphi.de
eMail: [email protected]



> -----Original Message-----
> From: Justin [mailto:[email protected]]
> Sent: Tuesday, April 27, 2010 1:16 AM
> To: [email protected]
> Subject: Re: HTMLStripReader, HTMLStripCharFilter
> 
> Thanks for the update!  I appreciate the hard work.
> 
> Perhaps someone can help me with the use of HTMLStripCharFilter...
> 
> 
> I get an exception (3.1-dev) similar to the one reported here (2.9):
> 
> https://issues.apache.org/jira/browse/LUCENE-1695
> 
> 
> With the following code:
> 
>     Analyzer htmlStripAnalyzer = new ReusableAnalyzerBase() {
>         @Override
>         protected TokenStreamComponents createComponents(
>                 final String fieldName, final Reader reader) {
>             return new TokenStreamComponents(new
> StandardTokenizer(Version.LUCENE_30,
>                     new HTMLStripCharFilter(CharReader.get(reader))));
>         }
>     };
>     String content = reader.document(id, fieldSelector).get(field);
>     TokenStream ts = htmlStripAnalyzer.tokenStream(field, new
> StringReader(content));
>     String best = highlighter.getBestFragments(ts, content,
>       DEFAULT_EXCERPT_FRAGS, DEFAULT_EXCERPT_SEPARATOR);
>     OffsetAttribute off = ts.addAttribute(OffsetAttribute.class);
>     ts.reset();
>     ts.incrementToken();
> 
> 
> java.io.IOException: Stream closed
>         at java.io.StringReader.ensureOpen(StringReader.java:39)
>         at java.io.StringReader.read(StringReader.java:73)
>         at
> org.apache.lucene.analysis.CharReader.read(CharReader.java:54)
>         at java.io.Reader.read(Reader.java:104)
>         at
> org.apache.solr.analysis.HTMLStripCharFilter.next(HTMLStripCharFilter.j
> ava:92)
>         at
> org.apache.solr.analysis.HTMLStripCharFilter.read(HTMLStripCharFilter.j
> ava:690)
>         at
> org.apache.solr.analysis.HTMLStripCharFilter.read(HTMLStripCharFilter.j
> ava:748)
>         at
> org.apache.lucene.analysis.standard.StandardTokenizerImpl.zzRefill(Stan
> dardTokenizerImpl.java:453)
>         at
> org.apache.lucene.analysis.standard.StandardTokenizerImpl.getNextToken(
> StandardTokenizerImpl.java:639)
>         at
> org.apache.lucene.analysis.standard.StandardTokenizer.incrementToken(St
> andardTokenizer.java:167)
> 
> 
> Looking at the source, I wonder if Tokenizer should override reset():
> 
>   public void reset() throws IOException {
>     if (input != null) input.reset(); // would reset CharReader,
> StringReader
>   }
> 
> 
> 
> 
> 
> ----- Original Message ----
> From: Robert Muir <[email protected]>
> To: [email protected]
> Sent: Sat, April 24, 2010 9:03:02 AM
> Subject: Re: HTMLStripReader, HTMLStripCharFilter
> 
> On Fri, Apr 23, 2010 at 4:48 PM, Justin <[email protected]> wrote:
> 
> > Just out of curiousity, why does LUCENE-1377 have a minor priorty?
> >
> > https://issues.apache.org/jira/browse/LUCENE-1377
> >
> > Don't people index, filter, search HTML, perhaps more than any other
> > format?
> >
> >
> Rest assured we are working on this... but it unfortunately won't
> happen
> overnight. First of all, the development of Lucene and Solr was merged
> such
> that there is now one team working on this stuff. This way, both Solr
> and
> Lucene developers can maintain this stuff.
> 
> There is now the practical issue to combine all Lucene and Solr
> analyzers
> (not just the two components listed on that issue) into one package
> that can
> then be used by both Lucene and Solr users:
> https://issues.apache.org/jira/browse/LUCENE-2413
> 
> --
> Robert Muir
> [email protected]
> 
> 
> 
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [email protected]
> For additional commands, e-mail: [email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

RE: HTMLStripReader, HTMLStripCharFilter

Reply via email to