Joshua, To clarify: I require the capability to perform precise, structure-sensitive searches - you can't do that very well with simple HTML, since a simple full-text search won't suffice. The content for the XML 'semantic tags' is extracted from the original HTML with some complex, XPath-assisted logic. In other words, that content isn't conveniently wrapped in tags in the original HTML. I don't recall the exact number of elements in the resulting XML structure, but it's around 30 (including some metadata that I add). That's one (of several) reason why the XML/DOM step is necessary.
Regards, Terry PS: The problem that caused me to ask my original question stems from the fact that some of the extracted content (stored in a couple of the XML sections) sometimes contains HTML tags (in the form of entities), and the StandardTokenizer (which I'm using) doesn't ignore/remove them. ----- Original Message ----- From: "Joshua O'Madadhain" <[EMAIL PROTECTED]> To: "Lucene Users List" <[EMAIL PROTECTED]> Sent: Monday, October 21, 2002 8:57 PM Subject: Re: Tags Screwing up Searches > On Mon, 21 Oct 2002, Terry Steichen wrote: > > > Thanks for the comments - you might have something there. What I do > > is clean up the HTML with JTidy and then parse it into a DOM. Then I > > use selected parts to create a new DOM which I write out as an XML > > file. I then use Lucene to index the XML files. Upon retrieval, I > > once again parse the XML, format it and render it to a browser. > > > > The conversion from brackets to entities is necessary in order for the > > browser (which will subsequently view it) to render it properly. > > > > But maybe, in the indexing process, I could convert it back again (to > > brackets), but I'm not sure what to do with it then - in other words, > > how to bring an HTML parser into the picture. If you have ideas on > > this, I'd very much appreciate hearing them. > > Perhaps there is some reason for the conversion to XML that I'm not > understanding (and this isn't really within my area of expertise). > > But if your purpose is to index HTML files and then display them later in > response to a search, why not just use JTidy and then index the HTML > instead (skipping the DOM and XML stages entirely), and then return the > (cleaned-up) HTML later when asked for? The basis of any 'semantic' tags > that you might be putting in the XML (perhaps to define Lucene fields) > must be there in the HTML anyway, so I'm not sure what the DOM and XML > representations get you. > > Regards, > > Joshua O'Madadhain > > [EMAIL PROTECTED] Per Obscurius...www.ics.uci.edu/~jmadden > Joshua O'Madadhain: Information Scientist, Musician, Philosopher-At-Tall > It's that moment of dawning comprehension that I live for--Bill Watterson > My opinions are too rational and insightful to be those of any organization. > > > > > -- > To unsubscribe, e-mail: <mailto:lucene-user-unsubscribe@;jakarta.apache.org> > For additional commands, e-mail: <mailto:lucene-user-help@;jakarta.apache.org> > -- To unsubscribe, e-mail: <mailto:lucene-user-unsubscribe@;jakarta.apache.org> For additional commands, e-mail: <mailto:lucene-user-help@;jakarta.apache.org>