
To clarify: I require the capability to perform precise, structure-sensitive
searches - you can't do that very well with simple HTML, since a simple
full-text search won't suffice.  The content for the XML 'semantic tags' is
extracted from the original HTML with some complex, XPath-assisted logic.
In other words, that content isn't conveniently wrapped in tags in the
original HTML. I don't recall the exact number of elements in the resulting
XML structure, but it's around 30 (including some metadata that I add).
That's one (of several) reason why the XML/DOM step is necessary.



PS: The problem that caused me to ask my original question stems from the
fact that some of the extracted content (stored in a couple of the XML
sections) sometimes contains HTML tags (in the form of entities), and the
StandardTokenizer (which I'm using) doesn't ignore/remove them.

----- Original Message -----
From: "Joshua O'Madadhain" <[EMAIL PROTECTED]>
To: "Lucene Users List" <[EMAIL PROTECTED]>
Sent: Monday, October 21, 2002 8:57 PM
Subject: Re: Tags Screwing up Searches

> On Mon, 21 Oct 2002, Terry Steichen wrote:
> > Thanks for the comments - you might have something there.  What I do
> > is clean up the HTML with JTidy and then parse it into a DOM.  Then I
> > use selected parts to create a new DOM which I write out as an XML
> > file.  I then use Lucene to index the XML files.  Upon retrieval, I
> > once again parse the XML, format it and render it to a browser.
> >
> > The conversion from brackets to entities is necessary in order for the
> > browser (which will subsequently view it) to render it properly.
> >
> > But maybe, in the indexing process, I could convert it back again (to
> > brackets), but I'm not sure what to do with it then - in other words,
> > how to bring an HTML parser into the picture.  If you have ideas on
> > this, I'd very much appreciate hearing them.
> Perhaps there is some reason for the conversion to XML that I'm not
> understanding (and this isn't really within my area of expertise).
> But if your purpose is to index HTML files and then display them later in
> response to a search, why not just use JTidy and then index the HTML
> instead (skipping the DOM and XML stages entirely), and then return the
> (cleaned-up) HTML later when asked for?  The basis of any 'semantic' tags
> that you might be putting in the XML (perhaps to define Lucene fields)
> must be there in the HTML anyway, so I'm not sure what the DOM and XML
> representations get you.
> Regards,
> Joshua O'Madadhain
>   Joshua O'Madadhain: Information Scientist, Musician, Philosopher-At-Tall
>  It's that moment of dawning comprehension that I live for--Bill Watterson
> My opinions are too rational and insightful to be those of any
> --
> To unsubscribe, e-mail:
> For additional commands, e-mail:

To unsubscribe, e-mail:   <mailto:lucene-user-unsubscribe@;>
For additional commands, e-mail: <mailto:lucene-user-help@;>

Reply via email to