-Original Message-
> > From: Karolina Bernat [mailto:karolina.ber...@googlemail.com]
> > Sent: Tuesday, January 25, 2011 1:45 PM
> > To: java-user@lucene.apache.org
> > Subject: Re: Preserving original HTML file offsets for highlighting
> >
> > Hi Uwe,
> >
>
remen
http://www.thetaphi.de
eMail: u...@thetaphi.de
> -Original Message-
> From: Karolina Bernat [mailto:karolina.ber...@googlemail.com]
> Sent: Tuesday, January 25, 2011 1:45 PM
> To: java-user@lucene.apache.org
> Subject: Re: Preserving original HTML file offsets for
Hi Uwe,
thanks for this hint. I'm not sure, how much of the Solr functionality do I
need to implement for using the HTTPStripCharFilter. I'm using Apache Tika
for HTML parsing. Furthermore I use the StandardAnalyzer to initialize my
IndexWriter. I don't use a Tokenizer - this would be the Solr app
You can use HTMLStripCharFilter that is plugged into the chain before the
Tokenizer. This one strips all HTML but preserves the Token positions, so
you can later highlight using those positions.
This filter is currently only released through Apache Solr, but in Lucene
4.0 its part of the analysis
Fred Toth wrote:
I'm thinking we need something like "HTMLTokenizer" which bridges the
gap between StandardAnalyzer and an external HTML parser. Since so
many of us are dealing with HTML, I would think this would be generally
useful for many problems. It could work this way:
Given this input:
H