Si, with StandardAnalyzer, I believe, since neither < nor > are alphabetical characters.
Otis --- Terry Steichen <[EMAIL PROTECTED]> wrote: > How should this be done (the translation, that is)? If it were left > as '<' > and '>', would Lucene parse it properly? > > Terry > > ----- Original Message ----- > From: "Otis Gospodnetic" <[EMAIL PROTECTED]> > To: "Lucene Users List" <[EMAIL PROTECTED]> > Sent: Monday, October 21, 2002 5:40 PM > Subject: Re: Tags Screwing up Searches > > > > Thanks for the update. > > This all sounds right (no bugs). The problem is the code that you > have > > that translates those < and > characters. > > > > Otis > > > > --- Terry Steichen <[EMAIL PROTECTED]> wrote: > > > Otis, > > > > > > I discovered that the actual text that I was dealing with already > > > converted > > > the '<' converted to '<', and so forth. So the problem is > that > > > with > > > something like '<b>College Soccer</b>', Lucene > recognizes > > > the > > > trailing semi-colon ';' as a word separator, so it can find the > term > > > 'college', but it does not see the ending of 'soccer'. I did > confirm > > > that > > > it *will* match on 'soccer<' just fine. > > > > > > I've proceeded to add a string substitution method which replaces > > > '<' > > > with ' ' (four spaces, in order to hopefully keep the offsets > > > straight). > > > It appears to work, though I believe it slows down the indexing. > > > > > > I don't know enough about the inner design of Lucene to figure > this > > > out, but > > > it seems logical that there would be a much more efficient way to > > > handle > > > this than string operations. > > > > > > Anyway, thought I'd bring you up to date. > > > > > > Regards, > > > > > > Terry > > > > > > PS: I've had no responses from the list, so perhaps this is a > unique > > > problem > > > and doesn't justify a formal fix effort. > > > > > > ----- Original Message ----- > > > From: "Terry Steichen" <[EMAIL PROTECTED]> > > > To: "Lucene Users Group" <[EMAIL PROTECTED]> > > > Sent: Friday, October 18, 2002 11:39 AM > > > Subject: Tags Screwing up Searches > > > > > > > > > Some content I'm indexing contains certain HTML tags, like <p>, > <b>, > > > <i>, > > > etc. What I find is that when a term I'm searching for touches > one > > > of these > > > tags (which is fairly typical), the term isn't recognized and the > > > search > > > fails. For example, <b>College Soccer</b> doesn't match on > either > > > "college" > > > or "soccer". I seem to recall someone else bring up a similar > > > problem with > > > a word that ends a sentence (and is thus treated as if the period > was > > > part > > > of the word), but don't recall what the response was and I can't > find > > > that > > > thread. > > > > > > Does anyone have some ideas on what's the best way to handle > this? > > > Filter > > > out the tags in the process of creating the Document for > indexing? Or > > > through a modification to the Analyzer (I'm using the > > > StandardAnalyzer)? Or > > > something else? > > > > > > TIA, > > > > > > Terry > > > > > > > > > > > > > > > -- > > > To unsubscribe, e-mail: > > > <mailto:lucene-user-unsubscribe@;jakarta.apache.org> > > > For additional commands, e-mail: > > > <mailto:lucene-user-help@;jakarta.apache.org> > > > > > > > > > __________________________________________________ > > Do you Yahoo!? > > Y! Web Hosting - Let the expert host your web site > > http://webhosting.yahoo.com/ > > > > -- > > To unsubscribe, e-mail: > <mailto:lucene-user-unsubscribe@;jakarta.apache.org> > > For additional commands, e-mail: > <mailto:lucene-user-help@;jakarta.apache.org> > > > > > > > -- > To unsubscribe, e-mail: > <mailto:lucene-user-unsubscribe@;jakarta.apache.org> > For additional commands, e-mail: > <mailto:lucene-user-help@;jakarta.apache.org> > > __________________________________________________ Do you Yahoo!? Y! Web Hosting - Let the expert host your web site http://webhosting.yahoo.com/ -- To unsubscribe, e-mail: <mailto:lucene-user-unsubscribe@;jakarta.apache.org> For additional commands, e-mail: <mailto:lucene-user-help@;jakarta.apache.org>