Re: Preserving original HTML file offsets for highlighting

2011-01-26 Thread Karolina Bernat
-Original Message- > > From: Karolina Bernat [mailto:karolina.ber...@googlemail.com] > > Sent: Tuesday, January 25, 2011 1:45 PM > > To: java-user@lucene.apache.org > > Subject: Re: Preserving original HTML file offsets for highlighting > > > > Hi Uwe, > > >

RE: Preserving original HTML file offsets for highlighting

2011-01-25 Thread Uwe Schindler
remen http://www.thetaphi.de eMail: u...@thetaphi.de > -Original Message- > From: Karolina Bernat [mailto:karolina.ber...@googlemail.com] > Sent: Tuesday, January 25, 2011 1:45 PM > To: java-user@lucene.apache.org > Subject: Re: Preserving original HTML file offsets for

Re: Preserving original HTML file offsets for highlighting

2011-01-25 Thread Karolina Bernat
Hi Uwe, thanks for this hint. I'm not sure, how much of the Solr functionality do I need to implement for using the HTTPStripCharFilter. I'm using Apache Tika for HTML parsing. Furthermore I use the StandardAnalyzer to initialize my IndexWriter. I don't use a Tokenizer - this would be the Solr app

RE: Preserving original HTML file offsets for highlighting

2011-01-24 Thread Uwe Schindler
You can use HTMLStripCharFilter that is plugged into the chain before the Tokenizer. This one strips all HTML but preserves the Token positions, so you can later highlight using those positions. This filter is currently only released through Apache Solr, but in Lucene 4.0 its part of the analysis

Re: Preserving original HTML file offsets for highlighting, need HTMLTokenizer?

2005-06-03 Thread Doug Cutting
Fred Toth wrote: I'm thinking we need something like "HTMLTokenizer" which bridges the gap between StandardAnalyzer and an external HTML parser. Since so many of us are dealing with HTML, I would think this would be generally useful for many problems. It could work this way: Given this input: H