Otis, I discovered that the actual text that I was dealing with already converted the '<' converted to '<', and so forth. So the problem is that with something like '<b>College Soccer</b>', Lucene recognizes the trailing semi-colon ';' as a word separator, so it can find the term 'college', but it does not see the ending of 'soccer'. I did confirm that it *will* match on 'soccer<' just fine.
I've proceeded to add a string substitution method which replaces '<' with ' ' (four spaces, in order to hopefully keep the offsets straight). It appears to work, though I believe it slows down the indexing. I don't know enough about the inner design of Lucene to figure this out, but it seems logical that there would be a much more efficient way to handle this than string operations. Anyway, thought I'd bring you up to date. Regards, Terry PS: I've had no responses from the list, so perhaps this is a unique problem and doesn't justify a formal fix effort. ----- Original Message ----- From: "Terry Steichen" <[EMAIL PROTECTED]> To: "Lucene Users Group" <[EMAIL PROTECTED]> Sent: Friday, October 18, 2002 11:39 AM Subject: Tags Screwing up Searches Some content I'm indexing contains certain HTML tags, like <p>, <b>, <i>, etc. What I find is that when a term I'm searching for touches one of these tags (which is fairly typical), the term isn't recognized and the search fails. For example, <b>College Soccer</b> doesn't match on either "college" or "soccer". I seem to recall someone else bring up a similar problem with a word that ends a sentence (and is thus treated as if the period was part of the word), but don't recall what the response was and I can't find that thread. Does anyone have some ideas on what's the best way to handle this? Filter out the tags in the process of creating the Document for indexing? Or through a modification to the Analyzer (I'm using the StandardAnalyzer)? Or something else? TIA, Terry -- To unsubscribe, e-mail: <mailto:lucene-user-unsubscribe@;jakarta.apache.org> For additional commands, e-mail: <mailto:lucene-user-help@;jakarta.apache.org>