Otis,

I discovered that the actual text that I was dealing with already converted
the '<' converted to '&lt;', and so forth.  So the problem is that with
something like '&lt;b&gt;College Soccer&lt;/b&gt;', Lucene recognizes the
trailing semi-colon ';' as a word separator, so it can find the term
'college', but it does not see the ending of 'soccer'.  I did confirm that
it *will* match on 'soccer&lt;' just fine.

I've proceeded to add a string substitution method which replaces '&lt;'
with '    ' (four spaces, in order to hopefully keep the offsets straight).
It appears to work, though I believe it slows down the indexing.

I don't know enough about the inner design of Lucene to figure this out, but
it seems logical that there would be a much more efficient way to handle
this than string operations.

Anyway, thought I'd bring you up to date.

Regards,

Terry

PS: I've had no responses from the list, so perhaps this is a unique problem
and doesn't justify a formal fix effort.

----- Original Message -----
From: "Terry Steichen" <[EMAIL PROTECTED]>
To: "Lucene Users Group" <[EMAIL PROTECTED]>
Sent: Friday, October 18, 2002 11:39 AM
Subject: Tags Screwing up Searches


Some content I'm indexing contains certain HTML tags, like <p>, <b>, <i>,
etc.  What I find is that when a term I'm searching for touches one of these
tags (which is fairly typical), the term isn't recognized and the search
fails.  For example, <b>College Soccer</b> doesn't match on either "college"
or "soccer".  I seem to recall someone else bring up a similar problem with
a word that ends a sentence (and is thus treated as if the period was part
of the word), but don't recall what the response was and I can't find that
thread.

Does anyone have some ideas on what's the best way to handle this?  Filter
out the tags in the process of creating the Document for indexing? Or
through a modification to the Analyzer (I'm using the StandardAnalyzer)? Or
something else?

TIA,

Terry




--
To unsubscribe, e-mail:   <mailto:lucene-user-unsubscribe@;jakarta.apache.org>
For additional commands, e-mail: <mailto:lucene-user-help@;jakarta.apache.org>

Reply via email to