Oops. Sorry.  replied to wrong message.
----- Original Message ----- From: "Herbert Roitblat" <h...@orcatec.com>
To: <java-user@lucene.apache.org>
Sent: Tuesday, April 27, 2010 12:01 PM
Subject: Re: HTMLStripReader, HTMLStripCharFilter


Great, I will look forward to it.
Thanks,
Herb
----- Original Message ----- From: "Justin" <cry...@yahoo.com>
To: <java-user@lucene.apache.org>
Sent: Tuesday, April 27, 2010 11:47 AM
Subject: Re: HTMLStripReader, HTMLStripCharFilter


Thanks for the help. No more exception. Seems odd that I need to add a filter to make reset apply to the stream's underlying reader.




----- Original Message ----
From: Uwe Schindler <u...@thetaphi.de>
To: java-user@lucene.apache.org
Sent: Tue, April 27, 2010 12:00:31 AM
Subject: RE: HTMLStripReader, HTMLStripCharFilter

To reset this token stream you have to wrap it with a CachingTokenFilter.

-----
Uwe Schindler
H.-H.-Meier-Allee 63, D-28213 Bremen
http://www.thetaphi.de
eMail: u...@thetaphi.de


-----Original Message-----
From: Justin [mailto:cry...@yahoo.com]
Sent: Tuesday, April 27, 2010 1:16 AM
To: java-user@lucene.apache.org
Subject: Re: HTMLStripReader, HTMLStripCharFilter

Thanks for the update!  I appreciate the hard work.

Perhaps someone can help me with the use of HTMLStripCharFilter...


I get an exception (3.1-dev) similar to the one reported here (2.9):

https://issues.apache.org/jira/browse/LUCENE-1695


With the following code:

    Analyzer htmlStripAnalyzer = new ReusableAnalyzerBase() {
        @Override
        protected TokenStreamComponents createComponents(
                final String fieldName, final Reader reader) {
            return new TokenStreamComponents(new
StandardTokenizer(Version.LUCENE_30,
                    new HTMLStripCharFilter(CharReader.get(reader))));
        }
    };
    String content = reader.document(id, fieldSelector).get(field);
    TokenStream ts = htmlStripAnalyzer.tokenStream(field, new
StringReader(content));
    String best = highlighter.getBestFragments(ts, content,
      DEFAULT_EXCERPT_FRAGS, DEFAULT_EXCERPT_SEPARATOR);
    OffsetAttribute off = ts.addAttribute(OffsetAttribute.class);
    ts.reset();
    ts.incrementToken();


java.io.IOException: Stream closed
        at java.io.StringReader.ensureOpen(StringReader.java:39)
        at java.io.StringReader.read(StringReader.java:73)
        at
org.apache.lucene.analysis.CharReader.read(CharReader.java:54)
        at java.io.Reader.read(Reader.java:104)
        at
org.apache.solr.analysis.HTMLStripCharFilter.next(HTMLStripCharFilter.j
ava:92)
        at
org.apache.solr.analysis.HTMLStripCharFilter.read(HTMLStripCharFilter.j
ava:690)
        at
org.apache.solr.analysis.HTMLStripCharFilter.read(HTMLStripCharFilter.j
ava:748)
        at
org.apache.lucene.analysis.standard.StandardTokenizerImpl.zzRefill(Stan
dardTokenizerImpl.java:453)
        at
org.apache.lucene.analysis.standard.StandardTokenizerImpl.getNextToken(
StandardTokenizerImpl.java:639)
        at
org.apache.lucene.analysis.standard.StandardTokenizer.incrementToken(St
andardTokenizer.java:167)


Looking at the source, I wonder if Tokenizer should override reset():

  public void reset() throws IOException {
    if (input != null) input.reset(); // would reset CharReader,
StringReader
  }





----- Original Message ----
From: Robert Muir <rcm...@gmail.com>
To: java-user@lucene.apache.org
Sent: Sat, April 24, 2010 9:03:02 AM
Subject: Re: HTMLStripReader, HTMLStripCharFilter

On Fri, Apr 23, 2010 at 4:48 PM, Justin <cry...@yahoo.com> wrote:

> Just out of curiousity, why does LUCENE-1377 have a minor priorty?
>
> https://issues.apache.org/jira/browse/LUCENE-1377
>
> Don't people index, filter, search HTML, perhaps more than any other
> format?
>
>
Rest assured we are working on this... but it unfortunately won't
happen
overnight. First of all, the development of Lucene and Solr was merged
such
that there is now one team working on this stuff. This way, both Solr
and
Lucene developers can maintain this stuff.

There is now the practical issue to combine all Lucene and Solr
analyzers
(not just the two components listed on that issue) into one package
that can
then be used by both Lucene and Solr users:
https://issues.apache.org/jira/browse/LUCENE-2413

--
Robert Muir
rcm...@gmail.com





---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org




---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org




---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org




---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

Reply via email to