On Mon, 21 Oct 2002, Terry Steichen wrote: > I discovered that the actual text that I was dealing with already > converted the '<' converted to '<', and so forth. So the problem > is that with something like '<b>College Soccer</b>', > Lucene recognizes the trailing semi-colon ';' as a word separator, so > it can find the term 'college', but it does not see the ending of > 'soccer'. I did confirm that it *will* match on 'soccer<' just > fine. > > I've proceeded to add a string substitution method which replaces > '<' with ' ' (four spaces, in order to hopefully keep the offsets > straight). It appears to work, though I believe it slows down the > indexing. > > I don't know enough about the inner design of Lucene to figure this > out, but it seems logical that there would be a much more efficient > way to handle this than string operations. > > PS: I've had no responses from the list, so perhaps this is a unique > problem and doesn't justify a formal fix effort.
A few questions and comments; please pardon me if I am asking questions answered in previous email: (1) Are you using an analyzer that is designed to handle (a) HTML, or (b) plain text? (2) If (b), that's probably why you've been getting this kind of behavior, and you may want to look at the HTMLParser sample code in the distribution. The StandardAnalyzer, I'm pretty sure, is not designed to handle HTML. (3) A quick and dirty solution for indexing HTML if you are running on some flavor of Unix and don't want to figure out how to do parse HTML tags: the text web browser "lynx". lynx can 'dump' the text from a web page out as follows: cat foo.html | lynx -dump -nolist > foo.txt This effectively strips the HTML tags out of foo.html and writes the text of the page to the file foo.txt. Once you've done this, of course, you can use the same analyzers that you use for any unformatted text file. Good luck-- Joshua O'Madadhain [EMAIL PROTECTED] Per Obscurius...www.ics.uci.edu/~jmadden Joshua O'Madadhain: Information Scientist, Musician, Philosopher-At-Tall It's that moment of dawning comprehension that I live for--Bill Watterson My opinions are too rational and insightful to be those of any organization. -- To unsubscribe, e-mail: <mailto:lucene-user-unsubscribe@;jakarta.apache.org> For additional commands, e-mail: <mailto:lucene-user-help@;jakarta.apache.org>