Joshua, Thanks for the comments - you might have something there. What I do is clean up the HTML with JTidy and then parse it into a DOM. Then I use selected parts to create a new DOM which I write out as an XML file. I then use Lucene to index the XML files. Upon retrieval, I once again parse the XML, format it and render it to a browser.
The conversion from brackets to entities is necessary in order for the browser (which will subsequently view it) to render it properly. But maybe, in the indexing process, I could convert it back again (to brackets), but I'm not sure what to do with it then - in other words, how to bring an HTML parser into the picture. If you have ideas on this, I'd very much appreciate hearing them. Regards, Terry ----- Original Message ----- From: "Joshua O'Madadhain" <[EMAIL PROTECTED]> To: "Lucene Users List" <[EMAIL PROTECTED]> Sent: Monday, October 21, 2002 5:49 PM Subject: Re: Tags Screwing up Searches > On Mon, 21 Oct 2002, Terry Steichen wrote: > > > I discovered that the actual text that I was dealing with already > > converted the '<' converted to '<', and so forth. So the problem > > is that with something like '<b>College Soccer</b>', > > Lucene recognizes the trailing semi-colon ';' as a word separator, so > > it can find the term 'college', but it does not see the ending of > > 'soccer'. I did confirm that it *will* match on 'soccer<' just > > fine. > > > > I've proceeded to add a string substitution method which replaces > > '<' with ' ' (four spaces, in order to hopefully keep the offsets > > straight). It appears to work, though I believe it slows down the > > indexing. > > > > I don't know enough about the inner design of Lucene to figure this > > out, but it seems logical that there would be a much more efficient > > way to handle this than string operations. > > > > PS: I've had no responses from the list, so perhaps this is a unique > > problem and doesn't justify a formal fix effort. > > A few questions and comments; please pardon me if I am asking questions > answered in previous email: > > (1) Are you using an analyzer that is designed to handle (a) HTML, or > (b) plain text? > > (2) If (b), that's probably why you've been getting this kind of behavior, > and you may want to look at the HTMLParser sample code in the > distribution. The StandardAnalyzer, I'm pretty sure, is not designed to > handle HTML. > > (3) A quick and dirty solution for indexing HTML if you are running on > some flavor of Unix and don't want to figure out how to do parse HTML > tags: the text web browser "lynx". lynx can 'dump' the text from a web > page out as follows: > > cat foo.html | lynx -dump -nolist > foo.txt > > This effectively strips the HTML tags out of foo.html and writes the text > of the page to the file foo.txt. > > Once you've done this, of course, you can use the same analyzers that you > use for any unformatted text file. > > Good luck-- > > Joshua O'Madadhain > > [EMAIL PROTECTED] Per Obscurius...www.ics.uci.edu/~jmadden > Joshua O'Madadhain: Information Scientist, Musician, Philosopher-At-Tall > It's that moment of dawning comprehension that I live for--Bill Watterson > My opinions are too rational and insightful to be those of any organization. > > > > > > -- > To unsubscribe, e-mail: <mailto:lucene-user-unsubscribe@;jakarta.apache.org> > For additional commands, e-mail: <mailto:lucene-user-help@;jakarta.apache.org> > -- To unsubscribe, e-mail: <mailto:lucene-user-unsubscribe@;jakarta.apache.org> For additional commands, e-mail: <mailto:lucene-user-help@;jakarta.apache.org>