Re: HTMLStripReader, HTMLStripCharFilter

Herbert Roitblat Tue, 27 Apr 2010 12:03:49 -0700

Oops. Sorry.  replied to wrong message.

----- Original Message -----From: "Herbert Roitblat" <[email protected]>

To: <[email protected]>
Sent: Tuesday, April 27, 2010 12:01 PM
Subject: Re: HTMLStripReader, HTMLStripCharFilter

Great, I will look forward to it.
Thanks,
Herb

----- Original Message -----From: "Justin" <[email protected]>

To: <[email protected]>
Sent: Tuesday, April 27, 2010 11:47 AM
Subject: Re: HTMLStripReader, HTMLStripCharFilter

Thanks for the help. No more exception. Seems odd that I need to add afilter to make reset apply to the stream's underlying reader.





----- Original Message ----
From: Uwe Schindler <[email protected]>
To: [email protected]
Sent: Tue, April 27, 2010 12:00:31 AM
Subject: RE: HTMLStripReader, HTMLStripCharFilter

To reset this token stream you have to wrap it with a CachingTokenFilter.

-----
Uwe Schindler
H.-H.-Meier-Allee 63, D-28213 Bremen
http://www.thetaphi.de
eMail: [email protected]

-----Original Message-----
From: Justin [mailto:[email protected]]
Sent: Tuesday, April 27, 2010 1:16 AM
To: [email protected]
Subject: Re: HTMLStripReader, HTMLStripCharFilter

Thanks for the update!  I appreciate the hard work.

Perhaps someone can help me with the use of HTMLStripCharFilter...


I get an exception (3.1-dev) similar to the one reported here (2.9):

https://issues.apache.org/jira/browse/LUCENE-1695


With the following code:

    Analyzer htmlStripAnalyzer = new ReusableAnalyzerBase() {
        @Override
        protected TokenStreamComponents createComponents(
                final String fieldName, final Reader reader) {
            return new TokenStreamComponents(new
StandardTokenizer(Version.LUCENE_30,
                    new HTMLStripCharFilter(CharReader.get(reader))));
        }
    };
    String content = reader.document(id, fieldSelector).get(field);
    TokenStream ts = htmlStripAnalyzer.tokenStream(field, new
StringReader(content));
    String best = highlighter.getBestFragments(ts, content,
      DEFAULT_EXCERPT_FRAGS, DEFAULT_EXCERPT_SEPARATOR);
    OffsetAttribute off = ts.addAttribute(OffsetAttribute.class);
    ts.reset();
    ts.incrementToken();


java.io.IOException: Stream closed
        at java.io.StringReader.ensureOpen(StringReader.java:39)
        at java.io.StringReader.read(StringReader.java:73)
        at
org.apache.lucene.analysis.CharReader.read(CharReader.java:54)
        at java.io.Reader.read(Reader.java:104)
        at
org.apache.solr.analysis.HTMLStripCharFilter.next(HTMLStripCharFilter.j
ava:92)
        at
org.apache.solr.analysis.HTMLStripCharFilter.read(HTMLStripCharFilter.j
ava:690)
        at
org.apache.solr.analysis.HTMLStripCharFilter.read(HTMLStripCharFilter.j
ava:748)
        at
org.apache.lucene.analysis.standard.StandardTokenizerImpl.zzRefill(Stan
dardTokenizerImpl.java:453)
        at
org.apache.lucene.analysis.standard.StandardTokenizerImpl.getNextToken(
StandardTokenizerImpl.java:639)
        at
org.apache.lucene.analysis.standard.StandardTokenizer.incrementToken(St
andardTokenizer.java:167)


Looking at the source, I wonder if Tokenizer should override reset():

  public void reset() throws IOException {
    if (input != null) input.reset(); // would reset CharReader,
StringReader
  }





----- Original Message ----
From: Robert Muir <[email protected]>
To: [email protected]
Sent: Sat, April 24, 2010 9:03:02 AM
Subject: Re: HTMLStripReader, HTMLStripCharFilter

On Fri, Apr 23, 2010 at 4:48 PM, Justin <[email protected]> wrote:

> Just out of curiousity, why does LUCENE-1377 have a minor priorty?
>
> https://issues.apache.org/jira/browse/LUCENE-1377
>
> Don't people index, filter, search HTML, perhaps more than any other
> format?
>
>
Rest assured we are working on this... but it unfortunately won't
happen
overnight. First of all, the development of Lucene and Solr was merged
such
that there is now one team working on this stuff. This way, both Solr
and
Lucene developers can maintain this stuff.

There is now the practical issue to combine all Lucene and Solr
analyzers
(not just the two components listed on that issue) into one package
that can
then be used by both Lucene and Solr users:
https://issues.apache.org/jira/browse/LUCENE-2413

--
Robert Muir
[email protected]





---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]




---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]




---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: HTMLStripReader, HTMLStripCharFilter

Reply via email to