To reset this token stream you have to wrap it with a CachingTokenFilter. ----- Uwe Schindler H.-H.-Meier-Allee 63, D-28213 Bremen http://www.thetaphi.de eMail: u...@thetaphi.de
> -----Original Message----- > From: Justin [mailto:cry...@yahoo.com] > Sent: Tuesday, April 27, 2010 1:16 AM > To: java-user@lucene.apache.org > Subject: Re: HTMLStripReader, HTMLStripCharFilter > > Thanks for the update! I appreciate the hard work. > > Perhaps someone can help me with the use of HTMLStripCharFilter... > > > I get an exception (3.1-dev) similar to the one reported here (2.9): > > https://issues.apache.org/jira/browse/LUCENE-1695 > > > With the following code: > > Analyzer htmlStripAnalyzer = new ReusableAnalyzerBase() { > @Override > protected TokenStreamComponents createComponents( > final String fieldName, final Reader reader) { > return new TokenStreamComponents(new > StandardTokenizer(Version.LUCENE_30, > new HTMLStripCharFilter(CharReader.get(reader)))); > } > }; > String content = reader.document(id, fieldSelector).get(field); > TokenStream ts = htmlStripAnalyzer.tokenStream(field, new > StringReader(content)); > String best = highlighter.getBestFragments(ts, content, > DEFAULT_EXCERPT_FRAGS, DEFAULT_EXCERPT_SEPARATOR); > OffsetAttribute off = ts.addAttribute(OffsetAttribute.class); > ts.reset(); > ts.incrementToken(); > > > java.io.IOException: Stream closed > at java.io.StringReader.ensureOpen(StringReader.java:39) > at java.io.StringReader.read(StringReader.java:73) > at > org.apache.lucene.analysis.CharReader.read(CharReader.java:54) > at java.io.Reader.read(Reader.java:104) > at > org.apache.solr.analysis.HTMLStripCharFilter.next(HTMLStripCharFilter.j > ava:92) > at > org.apache.solr.analysis.HTMLStripCharFilter.read(HTMLStripCharFilter.j > ava:690) > at > org.apache.solr.analysis.HTMLStripCharFilter.read(HTMLStripCharFilter.j > ava:748) > at > org.apache.lucene.analysis.standard.StandardTokenizerImpl.zzRefill(Stan > dardTokenizerImpl.java:453) > at > org.apache.lucene.analysis.standard.StandardTokenizerImpl.getNextToken( > StandardTokenizerImpl.java:639) > at > org.apache.lucene.analysis.standard.StandardTokenizer.incrementToken(St > andardTokenizer.java:167) > > > Looking at the source, I wonder if Tokenizer should override reset(): > > public void reset() throws IOException { > if (input != null) input.reset(); // would reset CharReader, > StringReader > } > > > > > > ----- Original Message ---- > From: Robert Muir <rcm...@gmail.com> > To: java-user@lucene.apache.org > Sent: Sat, April 24, 2010 9:03:02 AM > Subject: Re: HTMLStripReader, HTMLStripCharFilter > > On Fri, Apr 23, 2010 at 4:48 PM, Justin <cry...@yahoo.com> wrote: > > > Just out of curiousity, why does LUCENE-1377 have a minor priorty? > > > > https://issues.apache.org/jira/browse/LUCENE-1377 > > > > Don't people index, filter, search HTML, perhaps more than any other > > format? > > > > > Rest assured we are working on this... but it unfortunately won't > happen > overnight. First of all, the development of Lucene and Solr was merged > such > that there is now one team working on this stuff. This way, both Solr > and > Lucene developers can maintain this stuff. > > There is now the practical issue to combine all Lucene and Solr > analyzers > (not just the two components listed on that issue) into one package > that can > then be used by both Lucene and Solr users: > https://issues.apache.org/jira/browse/LUCENE-2413 > > -- > Robert Muir > rcm...@gmail.com > > > > > > --------------------------------------------------------------------- > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org > For additional commands, e-mail: java-user-h...@lucene.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org