Joshua,

Thanks for the comments - you might have something there.  What I do is
clean up the HTML with JTidy and then parse it into a DOM.  Then I use
selected parts to create a new DOM which I write out as an XML file.  I then
use Lucene to index the XML files.  Upon retrieval, I once again parse the
XML, format it and render it to a browser.

The conversion from brackets to entities is necessary in order for the
browser (which will subsequently view it) to render it properly.

But maybe, in the indexing process, I could convert it back again (to
brackets), but I'm not sure what to do with it then - in other words, how to
bring an HTML parser into the picture.  If you have ideas on this, I'd very
much appreciate hearing them.

Regards,

Terry

----- Original Message -----
From: "Joshua O'Madadhain" <[EMAIL PROTECTED]>
To: "Lucene Users List" <[EMAIL PROTECTED]>
Sent: Monday, October 21, 2002 5:49 PM
Subject: Re: Tags Screwing up Searches


> On Mon, 21 Oct 2002, Terry Steichen wrote:
>
> > I discovered that the actual text that I was dealing with already
> > converted the '<' converted to '&lt;', and so forth.  So the problem
> > is that with something like '&lt;b&gt;College Soccer&lt;/b&gt;',
> > Lucene recognizes the trailing semi-colon ';' as a word separator, so
> > it can find the term 'college', but it does not see the ending of
> > 'soccer'.  I did confirm that it *will* match on 'soccer&lt;' just
> > fine.
> >
> > I've proceeded to add a string substitution method which replaces
> > '&lt;' with ' ' (four spaces, in order to hopefully keep the offsets
> > straight). It appears to work, though I believe it slows down the
> > indexing.
> >
> > I don't know enough about the inner design of Lucene to figure this
> > out, but it seems logical that there would be a much more efficient
> > way to handle this than string operations.
> >
> > PS: I've had no responses from the list, so perhaps this is a unique
> > problem and doesn't justify a formal fix effort.
>
> A few questions and comments; please pardon me if I am asking questions
> answered in previous email:
>
> (1) Are you using an analyzer that is designed to handle (a) HTML, or
> (b) plain text?
>
> (2) If (b), that's probably why you've been getting this kind of behavior,
> and you may want to look at the HTMLParser sample code in the
> distribution.  The StandardAnalyzer, I'm pretty sure, is not designed to
> handle HTML.
>
> (3) A quick and dirty solution for indexing HTML if you are running on
> some flavor of Unix and don't want to figure out how to do parse HTML
> tags: the text web browser "lynx".  lynx can 'dump' the text from a web
> page out as follows:
>
> cat foo.html | lynx -dump -nolist  > foo.txt
>
> This effectively strips the HTML tags out of foo.html and writes the text
> of the page to the file foo.txt.
>
> Once you've done this, of course, you can use the same analyzers that you
> use for any unformatted text file.
>
> Good luck--
>
> Joshua O'Madadhain
>
>  [EMAIL PROTECTED] Per Obscurius...www.ics.uci.edu/~jmadden
>   Joshua O'Madadhain: Information Scientist, Musician, Philosopher-At-Tall
>  It's that moment of dawning comprehension that I live for--Bill Watterson
> My opinions are too rational and insightful to be those of any
organization.
>
>
>
>
>
> --
> To unsubscribe, e-mail:
<mailto:lucene-user-unsubscribe@;jakarta.apache.org>
> For additional commands, e-mail:
<mailto:lucene-user-help@;jakarta.apache.org>
>


--
To unsubscribe, e-mail:   <mailto:lucene-user-unsubscribe@;jakarta.apache.org>
For additional commands, e-mail: <mailto:lucene-user-help@;jakarta.apache.org>

Reply via email to